Regression Diagnostics: An Introduction [2 ed.] 1544375220, 9781544375229

Regression diagnostics are methods for determining whether a regression model that has been fit to data adequately repre

184 72 12MB

English Pages 168 [315] Year 2020

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Half Title
Series
Publisher Note
Acknowledgements
Title Page
Copyright Page
CONTENTS
Contributors
Series
Acknowledgements
Chapter 1. Introduction
Chapter 2. The Linear Regression Model: Review
Chapter 3. Examining and Transforming Regression Data
Chapter 4. Unusual Data: Outliers, Leverage, and Influence
Chapter 5. Nonnormality and Nonconstant Error Variance
Chapter 6. Nonlinearity
Chapter 7. Collinearity
Chapter 8. Diagnostics for Generalized Linear Models
Chapter 9. Concluding Remarks
References
Index
Recommend Papers

Regression Diagnostics: An Introduction [2 ed.]
 1544375220, 9781544375229

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

1

REGRESSION DIAGNOSTICS Second Edition

2

3

Quantitative Applications in the Social Sciences A SAGE PUBLICATIONS SERIES

Analysis of Variance, 2nd Edition Iversen/ Norpoth Operations Research Methods Nagel/Neef Causal Modeling, 2nd Edition Asher Tests of Significance Henkel Cohort Analysis, 2nd Edition Glenn Canonical Analysis and Factor Comparison Levine Analysis of Nominal Data, 2nd Edition Reynolds Analysis of Ordinal Data Hildebrand/Laing/ Rosenthal Time Series Analysis, 2nd Edition Ostrom Ecological Inference Langbein/Lichtman Multidimensional Scaling Kruskal/Wish Analysis of Covariance Wildt/Ahtola Introduction to Factor Analysis Kim/Mueller Factor Analysis Kim/Mueller Multiple Indicators Sullivan/Feldman Exploratory Data Analysis Hartwig/Dearing Reliability and Validity Assessment Carmines/Zeller Analyzing Panel Data Markus Discriminant Analysis Klecka Log-Linear Models Knoke/Burke Interrupted Time Series Analysis McDowall/ McCleary/Meidinger/Hay 22. Applied Regression, 2nd Edition Lewis-Beck/ Lewis-Beck 23. Research Designs Spector 24. Unidimensional Scaling McIver/Carmines 25. Magnitude Scaling Lodge 26. Multiattribute Evaluation Edwards/Newman 27. Dynamic Modeling Huckfeldt/Kohfeld/Likens 28. Network Analysis Knoke/Kuklinski 29. Interpreting and Using Regression Achen 30. Test Item Bias Osterlind 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.

4

Mobility Tables Hout Measures of Association Liebetrau Confirmatory Factor Analysis Long Covariance Structure Models Long Introduction to Survey Sampling, 2nd Edition Kalton Achievement Testing Bejar Nonrecursive Causal Models Berry Matrix Algebra Namboodiri Introduction to Applied Demography Rives/Serow Microcomputer Methods for Social Scientists, 2nd Edition Schrodt 41. Game Theory Zagare 42. Using Published Data Jacob 43. Bayesian Statistical Inference Iversen 44. Cluster Analysis Aldenderfer/Blashfield 45. Linear Probability, Logit, and Probit Models Aldrich/Nelson 46. Event History and Survival Analysis, 2nd Edition Allison 47. Canonical Correlation Analysis Thompson 48. Models for Innovation Diffusion Mahajan/Peterson 49. Basic Content Analysis, 2nd Edition Weber 50. Multiple Regression in Practice Berry/Feldman 51. Stochastic Parameter Regression Models Newbold/Bos 52. Using Microcomputers in Research Madron/Tate/Brookshire 53. Secondary Analysis of Survey Data Kiecolt/ Nathan 54. Multivariate Analysis of Variance Bray/ Maxwell 55. The Logic of Causal Order Davis 56. Introduction to Linear Goal Programming Ignizio 57. Understanding Regression Analysis, 2nd Edition Schroeder/Sjoquist/Stephan 58. Randomized Response and Related Methods, 2nd Edition Fox/Tracy 59. Meta-Analysis Wolf 60. Linear Programming Feiring 61. Multiple Comparisons Klockars/Sax 62. Information Theory Krippendorff 31. 32. 33. 34. 35. 36. 37. 38. 39. 40.

5

Survey Questions Converse/Presser Latent Class Analysis McCutcheon Three-Way Scaling and Clustering Arabie/ Carroll/DeSarbo 66. Q Methodology, 2nd Edition McKeown/ Thomas 67. Analyzing Decision Making Louviere 68. Rasch Models for Measurement Andrich 69. Principal Components Analysis Dunteman 70. Pooled Time Series Analysis Sayrs 71. Analyzing Complex Survey Data, 2nd Edition Lee/Forthofer 72. Interaction Effects in Multiple Regression, 2nd Edition Jaccard/Turrisi 73. Understanding Significance Testing Mohr 74. Experimental Design and Analysis Brown/Melamed 75. Metric Scaling Weller/Romney 76. Longitudinal Research, 2nd Edition Menard 77. Expert Systems Benfer/Brent/Furbee 78. Data Theory and Dimensional Analysis Jacoby 79. Regression Diagnostics, 2nd Edition Fox 80. Computer-Assisted Interviewing Saris 81. Contextual Analysis Iversen 82. Summated Rating Scale Construction Spector 83. Central Tendency and Variability Weisberg 84. ANOVA: Repeated Measures Girden 85. Processing Data Bourque/Clark 86. Logit Modeling DeMaris 87. Analytic Mapping and Geographic Databases Garson/Biggs 88. Working With Archival Data Elder/Pavalko/Clipp 89. Multiple Comparison Procedures Toothaker 90. Nonparametric Statistics Gibbons 91. Nonparametric Measures of Association Gibbons 92. Understanding Regression Assumptions Berry 93. Regression With Dummy Variables Hardy 94. Loglinear Models With Latent Variables Hagenaars 95. Bootstrapping Mooney/Duval 96. Maximum Likelihood Estimation Eliason 63. 64. 65.

6

Ordinal Log-Linear Models Ishii-Kuntz Random Factors in ANOVA Jackson/Brashers Univariate Tests for Time Series Models Cromwell/Labys/Terraza 100. Multivariate Tests for Time Series Models Cromwell/Hannan/Labys/Terraza 101. Interpreting Probability Models: Logit, Probit, and Other Generalized Linear Models Liao 102. Typologies and Taxonomies Bailey 103. Data Analysis: An Introduction Lewis-Beck 104. Multiple Attribute Decision Making Yoon/Hwang 105. Causal Analysis With Panel Data Finkel 106. Applied Logistic Regression Analysis, 2nd Edition Menard 107. Chaos and Catastrophe Theories Brown 108. Basic Math for Social Scientists: Concepts Hagle 109. Basic Math for Social Scientists: Problems and Solutions Hagle 110. Calculus Iversen 111. Regression Models: Censored, Sample Selected, or Truncated Data Breen 112. Tree Models of Similarity and Association Corter 113. Computational Modeling Taber/Timpone 114. LISREL Approaches to Interaction Effects in Multiple Regression Jaccard/Wan 115. Analyzing Repeated Surveys Firebaugh 116. Monte Carlo Simulation Mooney 117. Statistical Graphics for Univariate and Bivariate Data Jacoby 118. Interaction Effects in Factorial Analysis of Variance Jaccard 119. Odds Ratios in the Analysis of Contingency Tables Rudas 120. Statistical Graphics for Visualizing Multivariate Data Jacoby 121. Applied Correspondence Analysis Clausen 122. Game Theory Topics Fink/Gates/Humes 123. Social Choice: Theory and Research Johnson 97. 98. 99.

7

Neural Networks Abdi/Valentin/Edelman Relating Statistics and Experimental Design: An Introduction Levin 126. Latent Class Scaling Analysis Dayton 127. Sorting Data: Collection and Analysis Coxon 128. Analyzing Documentary Accounts Hodson 129. Effect Size for ANOVA Designs Cortina/Nouri 130. Nonparametric Simple Regression: Smoothing Scatterplots Fox 131. Multiple and Generalized Nonparametric Regression Fox 132. Logistic Regression: A Primer Pampel 133. Translating Questionnaires and Other Research Instruments: Problems and Solutions Behling/Law 134. Generalized Linear Models: A Unified Approach, 2nd Edition Gill/Torres 135. Interaction Effects in Logistic Regression Jaccard 136. Missing Data Allison 137. Spline Regression Models Marsh/Cormier 138. Logit and Probit: Ordered and Multinomial Models Borooah 139. Correlation: Parametric and Nonparametric Measures Chen/Popovich 140. Confidence Intervals Smithson 141. Internet Data Collection Best/Krueger 142. Probability Theory Rudas 143. Multilevel Modeling, 2nd Edition Luke 144. Polytomous Item Response Theory Models Ostini/Nering 145. An Introduction to Generalized Linear Models Dunteman/Ho 146. Logistic Regression Models for Ordinal Response Variables O’Connell 147. Fuzzy Set Theory: Applications in the Social Sciences Smithson/Verkuilen 148. Multiple Time Series Models Brandt/Williams 149. Quantile Regression Hao/Naiman 150. Differential Equations: A Modeling Approach Brown 124. 125.

8

Graph Algebra: Mathematical Modeling With a Systems Approach Brown 152. Modern Methods for Robust Regression Andersen 153. Agent-Based Models, 2nd Edition Gilbert 154. Social Network Analysis, 3rd Edition Knoke/Yang 155. Spatial Regression Models, 2nd Edition Ward/Gleditsch 156. Mediation Analysis Iacobucci 157. Latent Growth Curve Modeling Preacher/Wichman/MacCallum/Briggs 158. Introduction to the Comparative Method With Boolean Algebra Caramani 159. A Mathematical Primer for Social Statistics Fox 160. Fixed Effects Regression Models Allison 161. Differential Item Functioning, 2nd Edition Osterlind/Everson 162. Quantitative Narrative Analysis Franzosi 163. Multiple Correspondence Analysis LeRoux/Rouanet 164. Association Models Wong 165. Fractal Analysis Brown/Liebovitch 166. Assessing Inequality Hao/Naiman 167. Graphical Models and the Multigraph Representation for Categorical Data Khamis 168. Nonrecursive Models Paxton/Hipp/ Marquart-Pyatt 169. Ordinal Item Response Theory Van Schuur 170. Multivariate General Linear Models Haase 171. Methods of Randomization in Experimental Design Alferes 172. Heteroskedasticity in Regression Kaufman 173. An Introduction to Exponential Random Graph Modeling Harris 174. Introduction to Time Series Analysis Pickup 175. Factorial Survey Experiments Auspurg/Hinz 176. Introduction to Power Analysis: Two-Group Studies Hedberg 177. Linear Regression: A Mathematical Introduction Gujarati 178. Propensity Score Methods and Applications Bai/Clark 179. Multilevel Structural Equation Modeling Silva/Bosancianu/Littvay 151.

9

Gathering Social Network Data Adams Generalized Linear Models for Bounded and Limited Quantitative Variables Smithson 182. Exploratory Factor Analysis Finch 183. Multidimensional Item Response Theory Bonifay 184. Argument-Based Validation in Testing and Assessment Chapelle 180. 181.

10

Sara Miller McCune founded SAGE Publishing in 1965 to support the dissemination of usable knowledge and educate a global community. SAGE publishes more than 1000 journals and over 800 new books each year, spanning a wide range of subject areas. Our growing selection of library products includes archives, data, case studies and video. SAGE remains majority owned by our founder and after her lifetime will become owned by a charitable trust that secures the company’s continued independence. Los Angeles | London | New Delhi | Singapore | Washington DC | Melbourne

11

For Jesse and Sasha

12

13

REGRESSION DIAGNOSTICS An Introduction Second Edition

John Fox McMaster University Quantitative Applications in the Social Sciences Volume 79

14

Copyright ©2020 by SAGE Publications, Inc. All rights reserved. Except as permitted by U.S. copyright law, no part of this work may be reproduced or distributed in any form or by any means, or stored in a database or retrieval system, without permission in writing from the publisher. All third party trademarks referenced or depicted herein are included solely for the purpose of illustration and are the property of their respective owners. Reference to these trademarks in no way indicates any relationship with, or endorsement by, the trademark owner.

FOR INFORMATION: SAGE Publications, Inc. 2455 Teller Road Thousand Oaks, California 91320 E-mail: [email protected] SAGE Publications Ltd. 1 Oliver’s Yard 55 City Road London EC1Y 1SP United Kingdom SAGE Publications India Pvt. Ltd. B 1/I 1 Mohan Cooperative Industrial Area Mathura Road, New Delhi 110 044

15

India SAGE Publications Asia-Pacific Pte. Ltd. 18 Cross Street #10–10/11/12 China Square Central Singapore 048423 Printed in the United States of America This book is printed on acid-free paper. ISBN: 978-1-5443-7522-9

19 20 21 22 23 10 9 8 7 6 5 4 3 2 1 Acquisitions Editor: Helen Salmon Editorial Assistant: Megan O’Heffernan Production Editor: Gagan Mahindra Copy Editor: QuADS Prepress (P) Ltd. Typesetter: Integra Proofreader: Rae-Ann Goodwin Cover Designer: Rose Storey Marketing Manager: Shari Countryman

16

17

CONTENTS About the Author Series Editor’s Introduction Acknowledgments

1. Introduction 2. The Linear Regression Model: Review

The Normal Linear Regression Model Least-Squares Estimation Statistical Inference for Regression Coefficients *The Linear Regression Model in Matrix Form

3. Examining and Transforming Regression Data

Univariate Displays Transformations for Symmetry Transformations for Linearity Transforming Nonconstant Variation Interpreting Results When Variables Are Transformed

4. Unusual Data: Outliers, Leverage, and Influence

Measuring Leverage: Hat Values Detecting Outliers: Studentized Residuals Measuring Influence: Cook’s Distance and Other Case Deletion Diagnostics Numerical Cutoffs for Noteworthy Case Diagnostics Jointly Influential Cases: Added-Variable Plots Should Unusual Data Be Discarded? *Unusual Data: Details

5. Nonnormality and Nonconstant Error Variance

Detecting and Correcting Nonnormality Detecting and Dealing With Nonconstant Error Variance Robust Coefficient Standard Errors Bootstrapping Weighted Least Squares *Robust Standard Errors and Weighted Least Squares: Details

6. Nonlinearity

Component-Plus-Residual Plots Marginal Model Plots 18

Testing for Nonlinearity Modeling Nonlinear Relationships With Regression Splines *Transforming Explanatory Variables Analytically

7. Collinearity

Collinearity and Variance Inflation Visualizing Collinearity Generalized Variance Inflation Dealing With Collinearity *Collinearity: Some Details

8. Diagnostics for Generalized Linear Models Generalized Linear Models: Review Detecting Unusual Data in GLMs Nonlinearity Diagnostics for GLMs Diagnosing Collinearity in GLMs Quasi-Likelihood Estimation of GLMs *GLMs: Further Background

9. Concluding Remarks

Complementary Reading References Index

19

20

About the Authors John Fox

is Professor Emeritus of Sociology at McMaster University in Hamilton, Ontario, Canada, where he was previously the Senator William McMaster Professor of Social Statistics. Professor Fox received a PhD in sociology from the University of Michigan in 1972 and is the author of many articles and books on statistics, including , Third Edition (2016), (2018), and, with Sanford Weisberg, , Third Edition (2019). He continues to work on the development of statistical methods and their implementation in software. Professor Fox is an elected member of the R Foundation for Statistical Computing and an associate editor of the .

Applied Regression Analysis and Generalized Linear Models Using the R Commander: A Point-and-Click Interface for R An R Companion to Applied Regression Journal of Statistical Software

21

22

Series Editor’s Introduction It is with great pleasure that I introduce the second edition of John Fox’s book . This book presents and demonstrates the application of tools to assess the validity of key assumptions of linear regression models, namely, that relationships are specified correctly, residuals are independent and normally distributed, error variance is constant across cases, and the degree of collinearity among explanatory variables is not excessive. will appeal to a wide variety of readers. It serves as an authoritative text for graduate-level courses on linear regression as well as a valuable refresher and handy reference for practitioners.

Regression Diagnostics

Regression

Diagnostics

Regression Diagnostics

The first edition of , published almost 30 years ago, is still in use in graduate quantitative methods courses. This is a tribute to its exceptional quality. The second edition is even better. Professor Fox reorganizes material, expanding some chapters, combining others, and adding two new ones: Chapter 3 on graphical methods and displays for transforming data and Chapter 8 on diagnostics for generalized linear regression models. Readers are assumed to be familiar with the linear regression model, although Chapter 2 provides a quick review for those needing it. The graphics are superb, serving as a pedagogical tool as well as a diagnostic approach. The graphics enable readers to “see” the problem, assess its nature, and visualize the solution. Professor Fox is a master at explaining complicated topics in a clear and straightforward manner. He draws on his vast teaching experience to establish a firm foundation and then builds on it in a systematic fashion. His style is inclusive. He aims to bring everyone along with him. Professor Fox anticipates the questions of novices and answers them. He covers topics of interest to more advanced readers but 23

brackets them in a way not to lose others. Parts that require matrix algebra or calculus are starred and are available to readers who would like more depth. Examples are critical to the pedagogy of the book. Some of these are classics and will be familiar to readers of the earlier edition, although with vivid new graphics. Other examples are new, notably data drawn from the on infant mortality, GDP per capita, Gini coefficient of income inequality, and health expenditures. These data are the mainstay of Chapter 3 on transforming data, and reprised again in Chapter 5 on diagnostics for nonnormality and nonconstant residual variance, in Chapter 6 on diagnostics for nonnormality, and in Chapter 7 on collinearity. All of the data and associated R script are available on a companion website so that readers can replicate the examples. Indeed, at multiple points, Professor Fox invites them to do so.

CIA World

Factbook

Linear regression is the “bread and butter” of the social science disciplines. Regression diagnostics are an essential part of the toolkit. Professor Fox shares his insights and provides practical advice based on years of experience. By way of summary, the book concludes with general recommendations that draw on and weave together the various chapters. The common thread? Know thy data!

Barbara Entwisle



Series Editor

24

25

Acknowledgments I’ve had the good fortune to work with several collaborators from whom I’ve learned a great deal about regression analysis and related topics in statistics, and some of the knowledge that I acquired from them is reflected in this monograph. I’m particularly grateful to Michael Friendly and Georges Monette, both of York University in Toronto, to Bob Stine, of the University of Pennsylvania, and to Sandy Weisberg, of the University of Minnesota, who, I hope, will recognize the contributions that they made to this monograph and forgive me for its deficiencies. I’m also grateful to Barbara Entwisle, the QASS monograph series editor, and to Helen Salmon, my editor at Sage, for encouraging me to revise the monograph, the first edition of which appeared in 1991, and more generally for their support in undertaking this project. Barbara and several, at the time anonymous, reviewers provided many helpful comments and suggestions on a draft of the revised monograph:

Illinois State University Levente Littvay, Central European University Peter V. Marsden, Harvard University Jeffrey Harring, University of Maryland, College Park Helmut Norpoth, Stony Brook University Erin Leahey, University of Arizona William G. Jacoby, Michigan State University Jacques Hagenaars, Tilburg University Carl L. Palmer,

26

Finally, I’d like to acknowledge support for this work from the Social Sciences and Humanities Research Council of Canada.

27

28

Chapter 1. Introduction regression model describes how the distribution of a response (or dependent) variable—or some characteristic of that distribution, typically its mean—changes with the values of one or more explanatory (or independent) variables. Regression diagnostics are methods for determining whether a A

regression model that has been fit to data adequately represents the structure of the data. For example, if the model assumes a linear (straight-line) relationship between the response and an explanatory variable, is the assumption of linearity warranted? Regression diagnostics not only reveal deficiencies in a regression model that has been fit to data but in many instances may suggest how the model can be improved.

This monograph considers two important classes of regression models:

normal linear regression model Gaussian

The , in which the response variable is quantitative and assumed to have a normal (or ) distribution conditional on the values of the explanatory variables. The observations on the response are further assumed to be independent of one another, to be a linear function (i.e., a weighted sum) of the parameters of the model, and to have constant conditional variance. The normal linear model fit by the method of least squares is the focus of the monograph both because it is often used in practice and because it provides a basis for diagnostics for the other regression models considered here. ( ), in which the conditional distribution of the response variable is a member of an , such as the families of Gaussian, binomial, and Poisson distributions, and in which the mean of the response is transformed to a linear

Generalized linear models GLMs exponential family 29

function of the parameters of the model. The GLMs include the normal linear model, for a dichotomous response, and for count data as important special cases. GLMs can also be extended to nonexponential distributions and to situations in which an explicit conditional response distribution isn’t assumed.

logistic regression Poisson regression

As a preliminary example of what can go wrong in linear least-squares regression, consider the four scatterplots from Anscombe (1973) shown in Figure 1.1 and dubbed “Anscombe’s quartet” by Edward Tufte in an influential treatise on statistical graphics (Tufte, 1983). One of the goals of statistical analysis is to provide an adequate descriptive summary of the data. All four of Anscombe’s data sets were contrived cleverly to produce the same standard linear regression outputs—slope, intercept, correlation, residual standard deviation, coefficient standard errors, and statistical tests—but, importantly, not the same residuals. Anscombe’s quartet: Four data sets with identical standard regression outputs (e.g., the equation of the common least-squares line and correlation coefficient are shown below the graphs).

Figure 1.1

30

Source: Adapted from Anscombe (1973). Reprinted by

permission of the American Statistical Association, www.amstat.org.

31

Each of the graphs seen here are labeled a through d. The x and y axes in each of these graphs are joined to form a box. In graph a, the axes are labeled x1 and y1. The values on the x axis range from 5 to 15, in intervals of 5 and the values on the y axis range from 4 to 12, in intervals of 2. The approximate values of the data points seen in this graph are tabulated below: An upward-sloping straight line is seen from left to right and passes through data points 2, 5, 7 and 11. In graph b, the axes are labeled x2 and y2. The values on the x axis range from 5 to 15, in intervals of 5 and the values on the y axis range from 4 to 12, in intervals of 2. The approximate values of the data points seen in this graph are tabulated below: An upward-sloping straight line is seen from left to right and passes through data points 3 and 9. In graph c, the axes are labeled x3 and y3. The values on the x axis range from 5 to 15, in intervals of 5 and the values on the y axis range from 4 to 12, in intervals of 2. The approximate values of the data points seen in this graph are tabulated below: An upward-sloping straight line is seen from left to right and passes through data points 3 and 4. Data point 10 has been marked as an outlier. In graph d, the axes are labeled x4 and y4. The values on the x axis range from 5 to 15, in intervals of 5 and the values on the y axis range from 4 to 12, in intervals of 2. The approximate values of the data points seen in this graph are tabulated below: An upward-sloping straight line is seen from left to right and passes through data point 5. Data point 11 has been marked as an influential point.

In Figure 1.1(a), the least-squares line is a reasonable description of the tendency for to increase with . In Figure 1.1(b), the linear regression fails to capture the obviously curvilinear pattern of the data—the linear model is clearly wrong.

y

32

x

outlier influence

In Figure 1.1(c), one data point (an ) is out of line with the others and has an undue on the fitted least-squares line. A line through the other points fits them perfectly. Ideally in this case, we want to understand why the outlying case differs from the others—possibly it is special in some way (e.g., it is strongly affected by a variable other than , or represents an error in recording the data). Of course, we are exercising our imaginations here, because Anscombe’s data are simply made up, but the essential point is that we should address anomalous data substantively. In Figure 1.1(d), in contrast, we are unable to fit a line at all but for the rightmost data point; the leastsquares line goes through this influential point and through the mean of the remaining values of above the common -value of 8. At the very least, we should be reluctant to trust the estimated regression coefficients because of their dependence on one unusual point.

x

y

x

Anscombe’s simple illustrations serve to introduce several of the themes of this monograph, including nonlinearity, outlying data, influential data, and the effectiveness of graphical displays. The usual numeric regression outputs clearly do not tell the whole story. Diagnostic methods—many of them graphical—help to fill in the gaps. The plan of the monograph is as follows: Chapter 2 estimated Chapter 3 examining variables

reviews the normal linear regression model by the method of least squares. introduces simple graphical methods for regression data and discusses how to transform to deal with common data analysis problems.

Chapter 4 describes methods for detecting unusual data in least-squares regression, distinguishing among highleverage cases, outliers, and influential cases.

33

Chapter 5 takes up the problems of nonnormally distributed errors and nonconstant error variance. Chapter 6 discusses methods for detecting and correcting nonlinearity. Chapter 7 describes methods for diagnosing collinearity. Chapter 8 extends the diagnostics discussed in the preceding chapters to GLMs. Chapter 9 makes recommendations for incorporating diagnostics in the work flow of regression analysis and suggests complementary readings. My aim is to explain clearly the various kinds of problems that regression diagnostics address, to provide effective methods for detecting these problems, and, where appropriate, to suggest possible remedies. All the problems discussed in this monograph vary in degree from trivial to catastrophic, but I view nonlinearity as intrinsically the most serious problem, because it implies that we’re fitting the wrong equation to the data. The first edition of this monograph was published in 1991. This new edition has been thoroughly revised and rewritten, partly reflecting more recent developments in regression diagnostics, partly extending the coverage to GLMs, and partly reflecting my evolving understanding of the subject. I feel that it is only right to mention that I’ve addressed partially overlapping material in Fox (2016) and (with Sanford Weisberg) in Fox and Weisberg (2019). Although this monograph was written independently of these other sources, I have adapted some of the examples that appear in them and I’m aware that I may express myself similarly when writing about similar subject matter. I have prepared a website for the monograph, with data and R code for the examples in the text at . If you have difficulty finding the website, there is also a link to the supporting materials on the SAGE website at https://www.sagepub.com: After

https://tinyurl.com/RegDiag

34

navigating to the SAGE website, search for “John Fox” to locate the SAGE webpage for the monograph.

35

Chapter 2. The Linear Regression Model: Review I assume that the reader is generally familiar with the normal linear regression model (along with the other regression models discussed in this monograph), but it is briefly described in this chapter, partly as a review, partly as a basis for developing the diagnostic methods discussed in subsequent chapters, and partly to establish notation. For more information, see the complementary readings suggested in Chapter 9.

36

The Normal Linear Regression Model The normal linear regression model is

y

i

where i is the value of the response variable for the th of cases; the ijs are values of for case ; the js are population regression coefficients; i is the for case , with 0 expectation (mean) and 2 ; and “~ NID” means “normally and constant variance independently distributed.” As a general matter, I use Greek letters to denote parameters, like the s, and unobservable random variables, like , and Roman letters, like and the s, for observable values. The coefficient 0 is the or , and it is the expected value of when all the s are 0; most, but not all, linear regression models include an intercept.1

n x β regression error

k regressors

σ

i

β

ε x intercept regression constant y x 1

i

ε

β

y

For example if we expect that the average value of

proportional to x, then we can fit the model

y is ,

representing regression through the origin.

k

As the reader is likely aware, the + 1 regressors in the linear model are functions of the explanatory variables,2 but there isn’t in general a one-to-one correspondence between the two: A single explanatory variable may give rise to several regressors, as, for example, when a (categorical explanatory variable) with (categories) is represented in the model by − 1 zero/one dummy-variable regressors, or a numeric explanatory variable is represented by a polynomial or regression spline term. A regressor may

m

37

factor m levels

transformation, such as the logarithm, of a numeric

also be a explanatory variable.

constant regressor β

2

This includes the implicit associated with the intercept 0; the constant regressor is only trivially a function of the explanatory variables. For example, if

y is income in dollars and x is age in years,

then the model , with regressors and 2 generated from the explanatory variable age, represents a quadratic regression of income on age. Similarly, if is income, is education in years, and is the factor gender, with levels male, female, and nonbinary,

x

y

x

x

g

coding the dummy regressors otherwise, and

for females and 0

for males and 0 otherwise, the model

assumes the same education slope for all three genders but potentially different intercepts.

interaction regressors

Furthermore, are functions of two or more explanatory variables. Extending the previous example and letting model

and

, the

incorporates the interaction between education and gender by permitting different education slopes intercepts for the three genders.

and

The normal linear model incorporates several assumptions about the structure of the data and the method by which data are collected:

38

Linearity: Because the errors have means of 0, the conditional expectation μ of the response is a linear function of the parameters and the regressors:

E() is the expectation operator. Constant error variance: The conditional variance of the where

response is the same for all cases,

, where operator.

V() is the variance

Normality: The errors are normally distributed , and hence the conditional distribution of the response is also normal,

.

Independence: The cases are independently sampled, so that εi is independent of for , and hence yi is independent of

.

Fixed xs or xs independent of ε: The explanatory variables, and thus the regressors, are either fixed with respect to replication of the study (as would be the case in a designed experiment), or the errors are independent of the values of the s.

x The xs are not perfectly collinear: If the xs are fixed, then no x can be a perfect linear function of others; if the xs are sampled along with y, then the xs cannot be perfectly collinear in the population. If the xs are perfectly collinear, then it is impossible to separate their effects on y. 39

These are strong assumptions and a goal of regression diagnostics is to determine, to the extent possible, whether they are tenable.

x

In nonexperimental research, the s are random, not fixed. Although I don’t focus on causation in this monograph, which deals instead with the descriptive accuracy of regression models—that is, the fidelity with which the models represent patterns in the data—the assumption that the s are independent of the errors is key to the causal interpretation of the linear regression model. Without embedding a regression equation in a more general causal framework (see, e.g., Morgan & Winship 2015; Pearl, 2009), it’s not possible in general to test this assumption against data: Although we can detect certain sorts of departures from the assumption of independent s and errors (as described in the discussion of nonconstant error variance in Chapter 5 and of nonlinearity in Chapter 6), the least-squares fit, reviewed in the next section, ensures that the regression residuals are linearly uncorrelated with the s. In effect, the assumption that the s and the errors are uncorrelated (which is implied by the stronger assumption that the s and errors are independent) is used to get the least-squares estimates.

x

x

x

x

x

Not all the assumptions of the normal linear model are required for all purposes. For example, if the distribution of the errors is nonnormal, then the least-squares estimators of the regression coefficients (reviewed in the next section) may not be efficient, but they are still unbiased, and standard methods of statistical inference for constructing confidence intervals and performing hypothesis tests (reviewed later in this chapter) still produce valid results, at least in large samples. If the s are fixed or independent of the errors, then only the assumption of linearity is required for the least-squares coefficients to be unbiased estimators of the s, but if the errors have different variances or are dependent, then the least-squares estimators may be inefficient and their standard errors may be

x

β

40

substantially biased. Finally, if the errors have different variances that are known up to a constant of proportionality, say , but the other assumptions of the model hold, then ( ) regression (discussed in the next section), provides efficient estimates of the s and correct coefficient standard errors.

β

weighted least squares WLS

41

Least-Squares Estimation

ordinary least maximum likelihood

The normal linear model is estimated by ( ) regression, which provides ( s) of the regression coefficients. The fitted model is

squares OLS estimates MLE

bjs are the estimated regression coefficients, the s are fitted values, and the eis are residuals. The method of least squares picks the values of the bs to minimize the sum of squared residuals, ; these bs are the solution of the normal (or estimating) equations where the

(2.1)

Because the sums are obviously over , I have suppressed the subscript for cases in the interest of brevity (e.g., 1 represents ). The normal equations are a system of + 1 linear equations in the + 1 regression coefficients and have a unique solution for the js as long

k

x

i

k

42

b

x

as none of the js are constant and none are a perfect linear function of others. The normal equations imply that the least-squares residuals sum to 0 and thus have a mean of 0. Furthermore, the residuals are uncorrelated with the fitted values and with the s because

x

The MLE of the error variance, , is biased, and we typically instead use the unbiased estimate

residual . The squared multiple , dividing by the

degrees of freedom correlation for the fitted model, given by

is interpreted as the proportion of variation in by its linear regression on the mean of .

y

xs. Here,

y captured

is the sample

If the variances of the errors are unequal, but are known up a constant of proportionality, , then we can substitute ( ) for OLS regression. The WLS estimator finds the values of the s that

weighted least squares WLS

b

minimize the weighted residual sum of squares . For normally distributed errors with unequal variances, WLS

43

regression provides the maximum likelihood estimators of the s.

β

44

Statistical Inference for Regression Coefficients Estimated coefficient sampling variances for (i.e., excluding the regression intercept 0) are given by

b

(2.2)

where

is the variance of

xj, and

is the squared multiple correlation from the regression of j on the other s. A -test statistic for the hypothesis

x

x

H:

t

(usually,

0

H: 0

) is given by

, where j. Under 0, the test statistic

standard error of b H is distributed as a t-variable with is the

t

0

degrees of

freedom.

To test the hypothesis that a set of regression coefficients (excluding the constant 0) is 0, for example, 0:

β

H

(where

calculate the

incremental F-statistic 45

),3 we

(2.3)

3

For notational convenience, I assume that the hypothesis pertains to the s, but it can more generally include coefficients. Here

R

any p

2

first p β

is, as before, the squared multiple correlation from

the full model, and for the regression of

p k

is the squared multiple correlation

y on the remaining xs: . These t- and F-tests are exact

. If = , then under the assumptions of the model, including the assumption of normally distributed errors. The boundaries of a j are given by

%

β

confidence interval for

(2.4)

where distribution with

is the

quantile of the degrees of freedom. For

example, for a 95% confidence interval,

and

. Because the width of the confidence interval is proportional to the estimated coefficient standard error, is a natural measure of the (im)precision of the estimate j.

b

46

t

joint confidence region

Likewise, an ellipsoidal for several coefficients can be constructed from the coefficient variances and covariances along with a critical value from the -distribution (see the next section for details). An illustration for two parameters, 1 and 2 , and showing ellipses for two levels of confidence, appears in Figure 2.1. Just as the confidence interval in Equation 2.4 gives all values of j acceptable at level , each ellipse in Figure 2.1 encloses all acceptable values of 1 and 2 at the corresponding level. Joint confidence regions for two regression coefficients, 1 and 2 . The confidence ellipses are centered at the estimates, 1 and 2. The inner ellipse is drawn at the 85% level of confidence, the outer ellipse at the 95% level. The projections of the 85% joint confidence ellipse onto the 1 and 2 axes generate individual confidence intervals for these parameters, each at the 95% level.

F

β

β

α

jointly α Figure 2.1: β β β

b

b

β

47

β

β

β

The square seen in this image has two concentric ellipses in it. The top left corner of the square reads β2 and the bottom right corner of the square reads β1. Four dotted lines are seen on inside the square, marking the edges of the inner ellipse, within a smaller box made by these four lines. An arrow on the left edge of the outer square marks the distance between the two horizontal dotted lines and is labeled b2. The arrow along the bottom edge of the outer square marks the distance between the two vertical dotted lines and is labeled b1. the ellipses is marked and a line joins this point to the b2 arrow on the left and another line joins this point to the

48

b1 arrow below. Both lines are perpendicular to both these arrows.

b β

Each confidence ellipse is centered on the estimates 1 and 2 . The projections of the ellipse onto the 1 and 2 axes give individual confidence intervals for these parameters, but at a somewhat higher level of confidence than the joint region; for example, the joint confidence ellipse at the 85% level generates confidence intervals at the 95% level, as illustrated in Figure 2.1. Just as the length of a confidence interval expresses the precision of estimation of a single coefficient, the size of a joint confidence region for several coefficients (i.e., area for two s, volume for three, and hypervolume for four or more) expresses their simultaneous precision of estimation.

b

β

β

49

*The Linear Regression Model in Matrix Form This and other starred sections assume a basic knowledge of matrix algebra or calculus. Starred sections can be skipped without loss of continuity, but they provide a deeper understanding of the material in the monograph. The linear model in matrix form is (2.5)

where

is the response vector. In this monograph, vectors are column vectors unless they are explicitly transposed, with the transpose indicated by the superscript “ .” I’ll occasionally, as in Equation 2.5, show the order of matrices as subscripts,

T

so, for example,

has

n rows and 1 column.

matrix (sometimes called the design matrix 50

is the ) of

model

regressors, with an initial column of 1s for the regression constant. is the error vector.

n n

is the multivariate normal distribution with elements. is an -element vector of 0s, and n is the orderidentity matrix.

0

I

n

k

Assuming that the model matrix is of full-column rank + 1 (i.e., that none of its columns are collinear), the leastsquares estimates of the regression coefficients are (2.6)

the fitted values are

, and the residuals are

. The estimated error variance is then , and the estimated covariance matrix of the regression coefficients is . The square roots of the diagonal elements of are the standard errors of the least-squares coefficients. A % ellipsoidal joint confidence region for the regression coefficients is given by

51

where

k

F

is the

quantile of the -

distribution with + 1 numerator and denominator degrees of freedom. For a subset

β

1

of

p

regression coefficients, we have the confidence region (2.7)

is derived from

, which is the

of rows and columns of entries of

submatrix

corresponding to the

.

F-tests are easily obtained from the expressions for these confidence regions. For example, to test H : , 0

find

which is distributed as

under

H . For 0

F

is equivalent to the incremental -statistic given in Equation 2.3. If the errors have nonconstant variance for known weights i, then the WLS estimator is

w

52

and the estimated coefficient covariance matrix is

where is the diagonal weight matrix. The procedures for statistical inference for OLS regression described above are straightforwardly adaptable to WLS regression.

53

Chapter 3. Examining and Transforming Regression Data This chapter has two purposes: to introduce graphical displays and methods for transforming data that will be used later in the book to address a variety of problems in regression analysis, and to show how these displays and transformations can be used fitting a regression model to data, to avoid problems at a later stage. The focus here is on examining and transforming numeric data.

prior to

To provide a basis for discussing the chapter, I introduce a data set drawn (Central Intelligence Agency, 2015).1 here is for 134 nations with complete variables:2

methods described in this from the The subset of the data used data on the following four

CIA World Factbook

1

The data were downloaded in .csv (comma-separated values) files provided at https://github.com/thewiremonkey/factbook.csv. As mentioned in Chapter 1, data used here and elsewhere in the monograph are available on the website for the monograph. 2

CIA World Factbook

The data set contains information about 261 countries, supranational entities (e.g., the European Union), territories (e.g., American Samoa), and some other places (e.g., the Gaza Strip). The data set used here eliminates nonnations, but it also eliminates many nations with missing data on one or more of the four variables. Using only complete cases raises issues of selection bias that I won’t address but that be addressed in a serious analysis of the data. See, for example, Allison (2002).

should

Gross domestic product (GDP) per capita, in thousands of U.S. dollars Infant mortality rate per 1,000 live births Gini coefficient for the distribution of family income. The Gini coefficient is a standard measure of income inequality that ranges from 0 to 100, with 0 representing perfect equality and 100 maximum inequality Health expenditures as a percentage of GDP

54

Univariate Displays Figure 3.1 shows several graphs depicting the distribution of national infant mortality rates in the data set. I expect that the in Panel (a) and the in Panel (b) are familiar: Several univariate graphs for the distribution of infant mortality rates in the data set:

Figure 3.1

CIA World Factbook boxplot

histogram

CIA World Factbook

(a) histogram, (b) boxplot, (c) adaptive kernel nonparametric density estimate, and (d) normal quantile-comparison plot.

55

The four types of graphs seen in this figure are a histogram, a boxplot, an adaptive kernel nonparametric density estimate and a normal quartile-comparison plot. The first is the histogram with the infant mortality rate per 100, on the x axis. The values on this graph range from 0 to 100, in intervals of 20. The y axis shows the frequency and the values on this axis range from 0 to 50, in intervals of 10. The table below shows approximate frequency for each bar of the histogram on the x axis: The second is a box plot. The y axis is labeled infant, and the values on this axis range from 0 to 100, in intervals of 20. The box seen in this graph is seen between the approximate values of 10 and 40 on the y axis. Two line segments are seen below and on top of the box and are joined by a dotted line. The lower line segment is seen at about the value of about 5 on the y axis and the upper segment, at the approximate value of 90 on the y axis. A thick line is also seen in the box at the value of about 20, on the y axis. A point labeled Mali, is see at about 105 on the y axis. The third graph is an adaptive kernel nonparametric density estimate with the infant mortality rate per 1000, on the x axis. The values on this graph range from 0 to 120, in intervals of 20. The y axis shows the density and the values on this axis range from 0.000 to 0.030, in intervals of 0.005. The graph has a trend line as well as a bar adjoining the x axis with the frequency depicted by small lines of different thickness at each infant mortality rate on the x axis, as applicable. This resembles a bar code. The approximate values of the trend line are tabulated below: The fourth is a normal quantile-comparison plot that shows the normal quantiles on the x axis and the infant mortality rate per 1000 on the y axis. The values on the x axis range from -3 to 3 in intervals of one and those on the y axis range from 0 to 100, in intervals of 20. The data points plotted on this graph are approximated below. Two dotted lines are also seen in this graph along a line that runs from -1 on the x axis to about -3 at the value of 100 on the y axis. The dotted line to the left of this line starts at about -1.2 on the x axis and slopes upward and away from the solid line and ends at about 2.2 on the x axis and 110 on the y axis. The other dotted line to the right of the solid line starts at about -0.8 on the x axis and slopes upward and curves to the right. This line ends at about 2.5 on the x axis and 70 on the y axis.

A histogram dissects the range of a numeric variable like infant mortality into (typically) equal-width class intervals or — here 0 to 10, 10 to 20, and so on—then counts the number of cases (the ) in each bin and graphs the count as a bar.

bins

frequency

56

Q

The central box in the boxplot is drawn between the first ( 1) and the third ( 3) quartile of the data, and thus it marks off the middle half of the data. The line in the box represent the median ( ). The “whiskers” at the ends of the box are drawn from the quartiles to the most extreme nonoutlying value in each direction; outliers are shown individually and are defined as points that are below 1 − 1.5 × IQR or above 3 + 1.5 × IQR, where IQR = 3 − 1 is the . In this case, there is just one outlier, Mali, with an infant mortality rate in excess of 100.

Q

M

Q

Q

Q

interquartile range

Q

The graphs in Figure 3.1(c) and (d) are possibly less familiar:

nonparametric kernel density estimate window

Panel (c) shows a , which is a smoothed version of the histogram. To construct the density estimate, a , rather like a bin, slides continuously across the values of the variable; at any value , the density estimate is

x

x x

x

n

h

where 1, 2, …, n are the data values, is the halfwidth of the window, and () is a symmetric density function, such as the standard normal distribution, with a mode at 0, called the , which serves to smooth the data. The area under the nonparametric density estimate is scaled to 1. The smoothness of the density estimate is controlled by the value of , which is also called the of the kernel density estimate: Larger values of not only produce smoother density curves but also suppress detail. There are automatic methods for selecting the bandwidth, which can also be picked by visual trial and error. The version of kernel density estimation in Figure 3.1(c), an , is a little more sophisticated in that it uses a preliminary density estimate to adjust the bandwidth to the density of the data, using a narrower window where data are plentiful, to resolve more detail, and a wider window where data are sparse, to reduce random variability. The lines at the bottom of Panel (c),

K kernel function

h

h

adaptive kernel density estimator

57

bandwidth

rugplot one-dimensional scatterplot quantile-comparison plot quantile plot QQ plot

called a or , show the location of the data values. Panel (d) is a (also called a or ), comparing the distribution of infant mortality to the standard normal distribution. More generally, a quantile-comparison plot compares data with a theoretical reference distribution, and in regression analysis, these plots are more useful for derived quantities, such as residuals, than for raw data (as discussed in Chapter 4). To draw a QQ plot, first arrange the data in ascending order,

quantile–

, where, by convention, the

ith order statistic x i

has proportion of the data below it, so that, for example, ( )

and . We then plot the order statistics (i), on the vertical axis, against the corresponding quantiles i of a random variable drawn from the reference distribution, on the horizontal

x

q

z

axis, so that . If the data are sampled from the reference distribution, then , where the approximation reflects sampling variation. If, alternatively, the distribution of the data differs from the reference distribution only in its center and scale (variation) , then the relationship between the order statistics and the corresponding theoretical quantiles

μ

σ

should be approximately linear, . Thus, if we fit a line to the QQ plot, then its intercept should estimate and its slope should estimate . Moreover, systematic nonlinearity in the QQ plot is indicative of departures from the reference distribution. Three examples of QQ plots with the standard normal distribution (0, 1) as the reference distribution appear in Figure 3.2. In each panel, = 50 cases are independently sampled from a theoretical distribution: (a) the normal distribution (100, 15); (b) the heavytailed -distribution with 2 degrees of freedom, 2; and (c) the highly positively

μ

σ

N N

n t

58

t

skewed chi-square distribution with 2 degrees of freedom, .

Figure 3.2 Normal quantile-comparison plots for samples of size n = 50 drawn from three distributions: (a) N(100, 15), (b) t , and (c) . 2

The three graphs labeled a, b and c are drawn from three distributions: 1. N (100, 15). 2. t (subscript 2). 3. x (subscript 2), squared.

59

All three graphs show quantiles of N (0, 1) on the x axis with values ranging from -2 to 2, in intervals of 1. The graph labeled a, shows the sample drawn from N (100, 15) on the y axis and the values on this graph range from 70 to 130, in intervals of 10. The data points seen on this graph are clustered along the solid straight line seen from about -2.6 on the x axis to the point at about (2, 130) on the x and y axes respectively. The two dotted, curved lines seen on either side of this graph are slightly arched with the middle of each line closest to the solid straight line in the middle and the ends arching outward, away from this line. The graph labeled b shows the sample drawn from t (subscript 2) on the y axis and the values on this axis range from –2 to 2, in intervals of 5. In this graph, while a majority of the data points cluster along the solid straight line seen from about -2.5 and -2.5 on the x and y axes, to about 2.5 and 2.5 on the x and y axes respectively. This is a mildly sloping upward line. The dotted, curved lines seen on either size of this line arch outward, below and above the solid straight line with the middle of both lines closer to the solid straight line as compared to that in graph a. Seven data points are seen below the lower dotted curved line and three are seen above the upper dotted curved line. The graph labeled c shows the sample drawn from x (subscript 2), squared, along the y axis and the values on this axis range from 0 to 10, in intervals of 2. The solid straight line seen in this graph starts at about -1 on the x axis and ends at about 2.5 and 6.8 on the x and y axes, on the right, respectively. The two dotted, curved lines seen on either side of this solid straight line also curve outward from the straight line at the ends with the middle of both dotted lines almost parallel to the straight line, in the middle. The data points in this graph are clustered along the solid straight line between the values of -1.5 and 1 on the x axis. Two data points are seen above the upper dotted curved line on the left and six data points are seen above the same dotted curved line on the right.

The straight solid line on each graph is drawn between the first and the third quartiles of the data and the reference distribution. The curved broken lines represent an approximate pointwise 95%

confidence envelope around the

values on the fitted line, confidence envelope is computed as

60

; the

-

the

ith order statistic is

, where the standard error of

pq

q

Here, ( i) is the reference density at the quantile i. Because this is a pointwise confidence envelope, the confidence statement applies individually to each order statistic, and the probability is considerably higher than 95% that at least one point will stray outside of the confidence envelope even when the data are sampled from a distribution with the same shape as the reference distribution. The heavy tails of the 2 data are reflected at both ends of the normal QQ plot, where the (i) values are more extreme than the fitted line—below the line at the left and above the line at the right. Similarly, the positive skew of the

t

x

data is apparent in both tails of the distribution, with the observed (i) values extreme than (i.e., above) the line at the left of the graph and extreme than (also above) the line at the right.

x

less

more

Returning to Figure 3.1 (on p. 14), it’s apparent from all four graphs that the distribution of national infant mortality rates is positively skewed. The boxplot in Panel (b) identifies Mali as an outlier at the high end of the distribution, with infant mortality exceeding 100, but it’s common to observe large values in positively skewed data. The adaptive kernel density estimate in Panel (c) suggests that the distribution of infant mortality is bimodal, a property of the data that’s hidden in the boxplot, hard to discern in the normal QQ plot, and not as apparent in the histogram.

61

Transformations for Symmetry The normal linear regression model (discussed in Chapter 2) makes no distributional assumptions about the regressors (the s), other than that they are independent of the errors, and the assumption of normality applies to the distribution of (i.e., to the errors, ), not to its distribution. Nevertheless, a response variable that is close to normally distributed usually makes the assumption of normal errors more tenable, and there are also advantages to having s that are close to normally distributed: For example, if all the variables in a linear regression equation, the s and , are multivariately normally distributed, then all regressions among them are linear.

x y

conditional unconditional

ε

x

y

x

family of power transformations

The , , is often effective in making the distribution of a numeric variable more nearly normal, or at least more symmetric. I assume here that takes on only positive values, and I’ll consider below what to do if there are 0 or negative values. For example, if the power , then if

; if

x

, then

, then

x

; and . For reasons that I’ll

explain presently, we use the

log transformation for

:

, as if log were the 0th power.3 3

The literal 0 power isn‘t useful as a data transformation because

x

it changes all values to the constant 1: . In contrast, the log transformation is the most generally useful member of the powertransformation family. Here‘s a quick review of logs ( ):

logarithms

Logs are exponents, in the sense that implies that , where is the of the log function. In this book, logs without an explicit base are , for base

base

natural logs

. Other frequently used bases are base 10, which

62

common logs x

produces , and base 2. Logs are defined only for positive values of . As a consequence of this definition, logs convert multiplication into addition and exponentiation into multiplication: and

.

For example, because because , because because

. To understand how the power transformations affect data, it helps to define a modified family of power transformations, introduced in a seminal paper on transformations by Box and Cox (1964):

A graph showing several members of the Box–Cox family of power transformations appears in Figure 3.3. The Box–Cox powers have the following properties:

λ, For all values of λ, the slope of at is 1. Because of the division by λ, the order of the transformed data x′ is the same as that of x, even when λ is negative. For the ordinary power transformations, the order of the data values is reversed when λ is negative. For all values of

4

As is apparent in Figure 3.3, the log transformation

fits neatly where it should into the

63

Box–Cox family.5 The transformation doesn’t change the

is a straight line, and so it

shape of the distribution of x.

The Box—Cox powers clearly reveal the unity and properties of the family of power transformations, but they have essentially the same effect on the shape of the distribution of

x

as the corresponding ordinary powers —with the caveat that ordinary negative powers reverse the order of the values. In practice, then, we usually use ordinary power transformations.

x

4

*For those familiar with calculus, the derivative for all . 5

λ

*For those familiar with the idea of a .

limit,

Figure 3.3 The Box–Cox family of power transformations for several values of the power λ; when λ = 0, the transformation is log(x).

64

The x axis in this graph has values that range from 0 to 3 in intervals of 1. The y axis is labeled t (subscript BC) (x, λ) and have values that range between -5 and 10, in intervals of 5. Six logarithmic curves are seen in this graph for several values of the power λ. Each of these logarithmic curves coincide at the value of 1 on the x axis and 0 on the y axis. Two dotted lines are seen from this point to each of the axes. The curve with the λ value of -0.5 rises steeply and then flattens considerably to a gently rising curve. Each subsequent curve until the value of λ = 1, rises less gradually initially and then becomes a gently upward-sloping curve after the point on each curve where x = 1. The curves where λ = 1 until λ = 3, slopes upward from left to right with each subsequent curve rising steeply toward the right.

Tukey (1977) characterized the family of power transformations as the the ladder of powers and roots

ladder of powers and roots: Descending 65

from

(no transformation) toward

(square root),

(log), and (inverse) increasingly spreads out the small values of relative to the large values. Conversely, the ladder of powers toward (square) and (cube) spreads out the large values relative to the small values. This observation explains why power transformations can make skewed data more symmetric: For example, the long right tail of a positively skewed variable is drawn in by a transformation down the ladder of powers, such as the log transformation, and the short left tail is stretched. Negatively skewed data are less common but can be made more symmetric by a transformation up the ladder of powers.

x

ascending

Table 3.1 illustrates the effect of various power transformations. In this table, the inverse transformation is taken as to preserve the order of the original scores, and the log transformation is to the base 10. As mentioned, the base of the log transformation doesn’t affect the shape of the transformed data.

66

Table 3.1

a The

interlinear numbers give the differences between adjacent scores. Thus, for example, in Panel (a), the difference of 1 between 1 and 2 in the column corresponds to the difference of 3 between 1 and 4 in the 2 column.

x

x

67

Recall Figure 3.1 (on p. 14), displaying the positively skewed distribution of national infant mortality rates in the data set. Figure 3.4 shows side-by-side boxplots for several power transformations of infant mortality down the ladder of powers and roots. The boxplot of the untransformed data is at the right of

CIA World

Factbook

the figure, for . None of the transformations perfectly correct the skew of the original distribution, but the log and cuberoot ( ) transformations work reasonably well. The infant mortality rates range from a minimum of about 2 to a maximum of about 102—a ratio of the largest to the smallest values of about 50. Whenever the ratio of the largest to the smallest values of a strictly positive variable exceeds 10 (an ), it’s worth considering the log transformation, which often works remarkably well.

order of magnitude

Figure 3.4 Boxplots for various transformations mortality down the ladder of powers and roots.

68

of infant

The x values on this graph is labeled powers, λ with values that read, -1, -0.5, log, 0.33, 0.5 and 1. The y axis is labeled t (subscript BC) (Infant mortality, λ). This graph shows the side-by-side boxplots for several power transformations of infant mortality down the ladder of powers and roots. The boxplots seen here range in size from smaller to larger and again smaller boxes with a thick line inside each box. The lines in the boxes from left to right are seen to be gradually moving from the top half of the each boxplot to the lower half of the subsequent boxplots. Each box plot also have lower and upper line segments behind them with a dotted line joining each set of segments. These for an I-shaped line on which each boxplot is superimposed. These move from lower to higher values from left to right.

69

An adaptive kernel density estimate for the log-transformed infant mortality rates is shown in Figure 3.5. The original infant mortality scale appears at the top of the graph. Because the lower part of the distribution of infant mortality is now more clearly resolved, we can see that there are apparently three concentrations of nations, corresponding to three modes of the distribution, near infant mortality rates of 5, 20, and 50, and because the right tail of the distribution is drawn in, there are no outliers. Density estimate for log-transformed infant mortality rates.

Figure 3.5

This graph has two x axes. The top x axis is labeled infant mortality rate per 1000 and the values on this scale range from 1 to 100. The

70

values marked on this scale are 1, 5, 10, 20, 50 and 100 and are marked at unequal intervals. The x axis on the bottom is labeled log (infant mortality) and shows values from 0 to 5 in equal intervals of 1. The y axis is labeled density and the values on this axis range from 0.0 to 0.4, in intervals of 0.1. A bar code-like distribute on of small lines is seen at the bottom x axis with most of the lines appearing between the approximate values of 0.8 to 4.7. The approximate density values of the trend line seen in this graph is estimated below: Three peaks are seen in this trend line between log values 1 and 2, at about 3, and at about 4.

71

*Selecting Transformations Analytically When Box and Cox (1964) introduced their family of modified power transformations, they did so in the context of the regression model (3.1)

where the transformation of the response is selected to make the errors as nearly normally distributed as possible, and where the is formally estimated along

ε transformation parameter λ

with the regression coefficients, , and the variance 2 of the errors. Box and Cox proposed estimating this model by a method similar to ( ), and as a shorthand I’ll refer to their approach as ML estimation.6

σ

maximum likelihood ML

6

Box and Cox’s method isn’t strictly speaking ML estimation because the conditional distribution of , and hence the likelihood, aren’t defined until the transformation of is selected. Moreover, if

y

y

y is normally distributed for a particular choice of

,

then it isn’t normal if . Their method shares many of the properties of ML estimation, however. I’ll discuss the Box—Cox regression model further in Chapter 5. At present, I introduce the closely related idea of transforming a variable (or set of variables) to make its distribution as close to normal (or multivariate normal) as possible.

unconditional

Suppose that . To estimate , we find the values of maximize the normal log-likelihood

λ

λ

72

for a suitable choice of , , and 2 that

λ μ

σ

The standard error of the ML estimate

can be obtained in the

usual manner from the second derivative of with respect to , allowing us to construct Wald-based confidence intervals and

λ

tests for

λ. Similarly, comparing the values of and

at

(i.e., no transformation) provides a

likelihood-ratio test of the null hypothesis

:

.

Applying this procedure to the transformation of infant mortality in the

CIA World Factbook data produces

with

; a 95% asymptotic confidence interval for

λ is

Consequently, the log transformation of infant mortality, which I previously picked by trial and error, and which corresponds to , is a tenable value of

λ.

The log-likelihood at the estimate

is

, and at

, it is

. The likelihood-ratio chi-square statistic for testing the null hypothesis

:

is

, with 1 degree of freedom, for which the -value is close to 0. There is therefore strong evidence in the data for transforming infant mortality.

p

This approach extends to transforming several variables toward multivariate normality, with

73

where

is the vector of transformation

parameters,

is the mean vector, and

is the variance–covariance matrix of the transformed data, ; and det is the determinant of

.

scatterplot matrix CIA World Factbook

For example, Figure 3.6 shows a for the four numeric variables in the data set. Each offdiagonal panel of the scatterplot matrix displays a scatterplot of the marginal (i.e., bivariate) relationship between a pair of variables, with a least-squares line and a line—a method of that traces the regression of on without assuming a particular functional form for their relationship. I’ll explain how loess works later in the chapter. The three scatterplots in the first row therefore have infant mortality on the vertical axis, the three scatterplots in the first column have infant mortality on the horizontal axis, and so on. Adaptive kernel density estimates for the four variables appear in the diagonal panels. The four variables are positively skewed to varying degrees, and several scatterplots show substantially nonlinear relationships. Scatterplot matrix for infant mortality, GDP per capita, the Gini coefficient, and health spending in the data set. Nonparametric density estimates with rugplots are on the main diagonal; a least-squares line (solid line) and loess (local regression) smooth (broken line) are displayed in the scatterplot in each off-diagonal panel.

loess

nonparametric regression

Figure 3.6

y

x

CIA World Factbook

74

This grid of sixteen graphs has four distribution curves titled, infant mortality, GDP per capita, Gini coefficient and health spending as well as twelve scatterplots showing the relationship between two variables. These graphs are represented as four graphs each on four rows. On the first row, the first graph is titled infant mortality. This graph has a nonparametric density estimate with a rugplot at the bottom of the graph. The curve seen here has one tall peak on the left and a lower peak on the right before it flattens out to the right. The second, third and fourth graphs on this row have scatterplots. The second graph shows values from 0 to 80 on the top of the graph, in intervals of 20. It shows a scatterplot with the data points concentrated in the lower left quadrant and some on the upper left quadrant’s left edge, along the y axis. A least-squares line is seen

75

through this scatterplot starting at the middle of the y axis and sloping downward and ending at the middle of the x axis. The loess smooth line runs through the data points of the scatterplot in a sloping L-shape. The third scatterplot has data points concentrated in cluster on the lower left quadrant and the rest dispersed at the center of the plot area and moving right. A least-squares line and a loess smooth line are seen through this data. Both lines run from the lower left to the middle of the y axis on the opposite side. The loess smooth line is a curve that starts below the least-squares line and then goes above it and dips below the least-squares line toward the right half of the plot area, before ending below the least-squares line on the y axis. The fourth graph shows a scale with values from 0 to 100, in intervals of 20 on the right side, parallel to the y axis and another with values from 5 to 15, in intervals of 5, parallel to the x axis. Most of the data points in this scatterplot are concentrated on the right side of the lower left quadrant as well as the rest of the data points dispersed on the upper right quadrant of this plot area. A least-squares line and a loess smooth line are seen through this data. Both lines run from the middle of the y axis on the left to almost the end of the x axis on the right in a downward slope. The loess smooth line is a curve that starts above the least-squares line and then dips below the least-square line in a sloping L-shape, before ending slightly above the least-squares line on the right. The second row starts with a scatterplot, followed by a distribution curve and then two more scatterplots. The first graph on the second row, shows a scatterplot with the y axis marked with values of 0 to 80 in intervals of 20. The data points of the scatterplot are concentrated in the lower left quadrant and some at the bottom of the upper left quadrant’s left edge, along the y axis. A least-squares line is seen through this scatterplot starts at approximately 30 on the y axis and sloping downward and ends at two-thirds the way along the x axis. The second graph is titled GDP per capita. This graph has a nonparametric density estimate with a rugplot at the bottom of the graph. The curve seen here has one tall peak on the left and then flattens gradually until the end of the x axis. The third graph has a scatterplot with most of the data points concentrated on the bottom half of the graph, closer to the x axis in an inverted V-shape with the apex a third of the way along the x axis. A least-squares line and a loess smooth line are seen passing through this scatterplot. The loess smooth line is a broad V-shape and runs through the scatterplot. The least-squares line starts at the middle of the y axis and ends at the end of the x axis as it slopes downward. The fourth graph on this row is a scatterplot with a concentration of data points on the middle of the lower left quadrant with some data points dispersed from this part, towards the center of the plot area.

76

A least-squares line is seen through the scatterplot that starts at the lower part of the y axis and slopes upward to the middle of the side parallel to the y axis, on the opposite side. The loess line is a curve that starts slightly above the least-squares line on the left and dips a little before it curves up and flattens out above the least-squares line. It ends above the least-squares line on the right side. The third row has two scatterplots, with a distribution curve titled Gini coefficient and another scatterplot. The first graph has a scatterplot with the data points concentrated on the lower left quadrant, close to the y axis and disperses to the lower right quadrant towards the right half of the plot area and moves upward through the rest of the plot area in the upper left quadrant. The least-squares line and loess smooth line start at the middle of the y axis on the left gradually slope up as they move right. The loess smooth line curves over and back under the leastsquares line and ends below it, on the right. The second graph also has a scatterplot with most of the data points concentrated along the left quadrants, close to the y axis and then disperse to the right side of the lower left quadrant, in a downward direction. The least-squares line and the loess smooth line start at the middle of the y axis and slope downward, to the right. The leastsquares line ends at almost the end of the x axis. The loess smooth line curves slightly over and then dips below the least-squares line before curving up and ending above the least-squares line, on the right. The third graph titled Gini coefficient has a nonparametric density estimate with a rugplot at the bottom of the graph. The curve seen here has one tall peak and a smaller bump along the right as it curves down gradually until the end of the x axis. The fourth graph has a scale that ranges from 30 to 60, in intervals of 10, on the side parallel to the y axis. Most of the data points of the scatterplot are dispersed along the center of the left side of the graph. The least squares line and the loess smooth line start close to the middle of the y axis and end at about 3.6 on the opposite side. The loess smooth line starts below the least-squares line and curves up slightly over it before it dips down again into a sharp dip and ends close to the least-squares line. The fourth row has three scatterplots, with a distribution curve titled health spending at the end. The first graph has a scatterplot with the values 0 to 100, in intervals of 20 along the x axis and the values from 5 to 15 along the y axis, in intervals of 5. The data points of this scatterplot are concentrated along the middle of the y axis, close to it and disperse along the lower right quadrant towards the right half of the plot area with some close to the center. The least-squares line and loess smooth line start at the middle of the y axis on the left gradually slope up as they move right. The loess smooth line curves

77

under and back over the least-squares line and ends above it, on the right. The second graph also has a scatterplot with most of the data points concentrated along the left quadrants, close to the y axis and then disperse to the bottom of the upper left quadrant. The least-squares line and the loess smooth line start at the almost the middle of the y axis and slope upward, to the right. The least-squares line ends a little above the middle of the y axis on the opposite side. The loess smooth line dips slightly below and then curves up over the leastsquares line before dipping down and ending below the least-squares line, on the right. The third graph has a scatterplot with values ranging from 30 to 60 in intervals of 10 on the x axis. The data points are dispersed between the values of 30 and 50 on the lower two quadrants of this plot area. The least-squares line in this graph starts close to the middle of the y axis and gradually slopes downward to a third of the way of the line parallel to the y axis, on the opposite side. The loess smooth line is shaped like a broad V with the apex below the middle of the least-squares line, approximately at the value 40, on the x axis. The fourth graph is titled health spending, has a nonparametric density estimate with a rugplot at the bottom of the graph. The curve seen here has one tall peak and a smaller bump along the right as it curves down gradually until the end of the x axis.

Transforming the four variables toward multivariate normality produces the estimated power transformations shown in Table 3.2. The transformations for infant mortality and GDP per capita are quite precisely estimated, with the former including 0 (i.e., the log transformation) in the 95% confidence interval for . The transformations for the Gini coefficient and health spending are much less precisely estimated, and in both cases is included in the 95% confidence interval. All four confidence intervals exclude (i.e., no transformation). Figure 3.7 shows what happens when infant mortality, the Gini coefficient, and health spending are log-transformed, while GDP is raised to the 0.2 power. After transformation, the distributions of the variables are more symmetric and the regressions are more nearly linear.

78

Table 3.2CIA World Factbook

Figure 3.7 Scatterplot matrix for the transformed CIA World Factbook data: log(infant mortality), (GDP per capita)0.2, log(Gini coefficient), and log(health spending).

This grid of sixteen graphs has four log-transformed distribution curves titled, log (infant mortality), (GDP per capita) (superscript

79

0.2), log (Gini coefficient) and log (health spending) as well as twelve scatterplots showing the relationship between two variables. These graphs are represented as four graphs each, on four rows. On the first row, the first graph is titled log (infant mortality). This graph has a nonparametric density estimate with a rugplot at the bottom of the graph. The curve seen here has two peaks after the first sharp upward rose – one on the left followed by a dip with another small peak just before the next peak before it drops to the right. The second, third and fourth graphs on this row have scatterplots. The second graph shows values from 1.0 to 2.5 on the top of the graph, in intervals of 0.5. It shows a scatterplot with the data points concentrated along the overlapping least-squares line and loess smooth curve that start at the top left corner of the graph and slope down to the bottom right corner of the graph. The loess smooth curve starts a little before the least-squares line, almost fully overlapping it. The third scatterplot has data points that are seen along the leastsquares line and the loess sooth line that start close to the bottom of the y axis on the left, in an upward direction, ending above the middle of the line parallel to the y axis. The data points of this scatterplot are dispersed loosely along these lines. The loess smooth line starts below the least-squares line and curves up in the middle and dips back down below the least-squares line and ends a little below it. The fourth graph shows a scale with values from 1.0 to 2.5, in intervals of 0.5 on top, parallel to the x axis and another with values from 1 to 4, in intervals of 1, parallel to the y axis. Most of the data points in this scatterplot are concentrated along the center of the lot area from top to bottom sloping gently from the top left towards the bottom right. A least-squares line and a loess smooth line are seen through this data. Both lines start close to the top of the y axis on the left and end close to the value 1.5 on the scale parallel to the y axis, on the right. The loess smooth line is a curve that starts slightly below the least-squares line and then curves above the least-square line before dipping back down, rising up again, ending at the same point as the least-squares line on the right. The second row starts with a scatterplot, followed by a graph titled GDP per capita (superscript 0.2) and then two more scatterplots. The first graph on the second row, shows a scatterplot with the y axis marked with values of 1.0 to 2.5 in intervals of 0.5. The data points of the scatterplot are concentrated in a line from the top left to the bottom right corner of the graph. The least-squares line and the loess smooth line both at approximately 2.25 on the y axis and run through the data. Both lines almost completely overlap until the loess smooth curve ends slightly lower than the least-squares line.

80

The second graph is titled GDP per capita (superscript 0.2). This graph has a nonparametric density estimate with a rugplot at the bottom of the graph. The curve seen here has one small peak followed by a tall peak before the curve drops sharply to the right, until the end of the x axis. The third graph has a scatterplot with the data points concentrated in the center of the plot area forming a broad inverted triangle. The least-squares line and the loess smooth line are seen passing through this scatterplot. The loess smooth line is a broad V-shape and runs through the scatterplot. The least-squares line starts above the middle of the y axis and sloped gradually as it ends about a third of the way down the line parallel to the y axis. The fourth graph on this row is a scatterplot has data points dispersed in the middle of the plot area moving slightly upward to the right. A least-squares line is seen through the scatterplot that starts at the lower part of the y axis and slopes upward to a point above the middle of the side parallel to the y axis, on the opposite side. The loess line is a curve that starts slightly above the leastsquares line on the left and dips a little before it curves up and flattens out above the least-squares line. It ends above the leastsquares line on the right side. The third row has two scatterplots, with a distribution curve titled log (Gini coefficient) and another scatterplot. The first graph has a scatterplot with the data points dispersed loosely in an expanding manner, from the bottom left to the top right of the graph. The least-squares line and loess smooth line start at almost the middle of the y axis on the left gradually slope up as they move right. The loess smooth line starts below the least-squares line and curves up over it and back under it past the middle below it ends below the least-squares line, on the right. The second graph also has a scatterplot with most of the data points concentrated in a loose manner from the top left to the bottom right of the plot area of the graph, in a downward direction. The leastsquares line and the loess smooth line start above the middle of the y axis and slope downward, to the right. The loess smooth line ends at almost the end of the x axis. The loess smooth line starts below the least-squares line and curves slightly over and then dips more steeply below the least-squares line and ends below it, near the x axis, on the right. The third graph titled log (Gini coefficient) has a nonparametric density estimate with a rugplot at the bottom of the graph. The curve seen here has one tall peak and a smaller bump along the right as it curves down gradually until the end of the x axis. The fourth graph has a scale that ranges from 3.2 to 4.0, in intervals of 0.2, on the side parallel to the y axis. The scatterplot is dispersed in the center of the plot area in a slight slope from left to right. The least squares line and the loess smooth line start close to the middle of the y axis and end at about 3.6 on the

81

opposite side. The loess smooth line starts below the least-squares line and curves up slightly over it before it dips down again into a sharp dip and ends close to the least-squares line. The fourth row has three scatterplots, with a distribution curve titled log (health spending) at the end. The first graph has a scatterplot with the values 1 to 4 in intervals of 1, along the x axis and values from 1.0 to 2.5 along the y axis, in intervals of 0.5. The data points of this scatterplot are concentrated along the middle of the graph plot area, in a slightly downward direction from left to right. The least-squares line and loess smooth line start at about 2.25 on the y axis on the left gradually slope down as they move right. The loess smooth line starts above the least-squares line and dips slightly below it before it curves back over the least-squares line and ends above it, on the right. The second graph also has a scatterplot with most of the data points dispersed loosely on the left half and more closely along the right half, near the middle of the upper right quadrant of the graph. The least-squares line and the loess smooth line start at the almost the middle of the y axis and slope upward, to the right. The loess line starts a little above the least-squares line on the left and slopes down and then back up over the least-squares line and rises sharply before it ends above the least-squares line, on the right. The third graph has a scatterplot with values ranging from 3.2 to 4.0 in intervals of 0.2 on the x axis. The data points are dispersed between the values of 3.2 and 4.0 in the middle and upper half of the graph’s plot area. The least-squares line in this graph starts above the middle of the y axis and gradually slopes downward to the middle of the line parallel to the y axis, on the opposite side. The loess smooth line is shaped like a broad V with the apex below the middle of the least-squares line, approximately at the value 3.6, on the x axis. The fourth graph is titled log (health spending), has a nonparametric density estimate with a rugplot at the bottom of the graph. The curve seen here has one tall peak and a smaller bump along the right as it curves down sharply until the end of the x axis.

82

Transforming Data With Zero and Negative Values Ordinary power transformations and the Box–Cox family of modified powers work only if the data values to be transformed are all positive.

x

Some of the power transformations, such as log( ), are defined only for positive , and others, such as 2, don’t preserve the order of the data when there are both positive and negative values present: For example, 32 and (−3)2 are both 9.

x

x

A simple way to apply power transformations to data with 0 or negative values is first to add a positive constant (called a by Tukey, 1977) to the data that is sufficiently large to make

γ

start

all values positive—for example, variable with no values at or below −10.

, for an imagined

x

A related idea is to modify the Box–Cox power family to accommodate 0 or negative values. For example, Hawkins and Weisberg (2017) define the family of modified powers Cox transformation

by applying the Box— to the variable . When

,

γ

. Like an ordinary start, for the Hawkins–Weisberg family can be selected arbitrarily, or it can be estimated formally along with by adapting Box and Cox’s method. For positive values of , transformations in the Hawkins–Weisberg family behave very much like Box—Cox powers and extend this behavior to 0 and negative values.

x

λ

Finally, power transformations are effective in modifying the shape of a strictly positive variable only when the ratio of the largest to the smallest -values in the data is sufficiently large. If this ratio is small, then power transformations are nearly linear. In this

x

x

negative

case, we can increase the ratio by adding a start to the data. For example, try transforming the variable

83

where then graphing

, by taking logs, and versus versus

x.

x. Compare this to the graph of

84

Transformations for Linearity The linear regression model can accommodate nonlinear relationships through devices such as polynomial regression and regression splines (both of which are discussed in Chapter 6), but in some circumstances, we can render a nonlinear relationship linear by transforming the response variable , an explanatory variable , or both. In particular, if the relationship between and is (i.e., strictly increasing or strictly decreasing) and (in the sense that its direction of curvature doesn’t change), then transformation of or might be a viable strategy.

y

x

y

y

x x monotone simple

These distinctions are illustrated in Figure 3.8, which shows scatterplots for three idealized patterns of nonlinear relationships: Three patterns of nonlinear relationships.

Figure 3.8

85

The graph labeled a, shows a monotone, simple, nonlinear relationship between two variables, x and y. The scatterplot is clustered along a curve that slopes upward from the bottom left to the top right, steeply at first and less gradually toward the right. The graph labeled b, shows a monotone, not simple, nonlinear relationship between two variables x and y. The scatterplot seen in this graph are clustered along an S-shaped curve that starts on the bottom left and slopes in an S-shape toward the top right of the plot area. The graph labeled c, shows a nonmonotone, simple, nonlinear relationship between two variables x and y. The scatterplot seen in this graph are clustered along a U-shaped curve that begins in the

86

middle of the y axis on the left and slopes down and then up toward the top right corner of the plot area, in a U-shape. While the scatterplot is more tightly clustered along the curves in graphs a and b, the scatterplot is more loosely clustered along the U-shaped curve in graph c with several data points seen along the middle and lowest point of this curve.

1. The relationship in Panel (a) is monotone and simple, and as

y

we’ll see presently, transformations such as or 2 might in this case serve to straighten the relationship between and . 2. In Panel (b), increases with , and so their relationship is monotone, but the direction of curvature changes–at the left, the regression curve is concave down, while at the right it’s concave up. A relationship like this might be modeled with a regression spline. 3. In Panel (c), the relationship of to is simple–concave up everywhere–but nonmonotone: decreases with at the left, levels off, and then increases at the right. A

y

y

x

y

regression of the form

y

x

x

quadratic

is

appropriate here.

bulging rule

Mosteller and Tukey (1977) introduced the , depicted in Figure 3.9, to guide trial-and-error selection of power transformations of , , or both to linearize simple monotone nonlinear relationships: Mosteller and Tukey’s bulging rule for selecting linearizing power transformations.

Figure 3.9

x y

87

x

Source: Adapted from Mosteller and Tukey (1977). ©1977. Reprinted by permission of Pearson Education, Inc., New York, NY.

Two intersecting arrow are seen in the middle of this figure with arrows on the vertical line reading y up: y squared, y cubed and so on, on top, and y down: square root of y, log (y) and so on, at below. The horizontal arrow reads x down: square root of x, log (x) and so on, on the left and x up: x squared, x cubed and so on, on the right. Circular segments are seen along the outer edges of these arrows with an arrow head at the center of each segment pointing outward.

left right

x down x x up x

When the bulge points , transform the ladder of powers and roots, using a transformation such as log( ). When the bulge points , transform the ladder of powers 2 and roots, using a transformation such as .

88

down up

y down y y up y

When the bulge points , transform the ladder of powers and roots, using a transformation such as log( ). When the bulge points , transform the ladder of powers and 2 roots, using a transformation such as In Figure 3.8(a), the bulge points up and to the left—hence my remark that the transformations straighten the relationship.

or

y

2

might serve to

Now consider the scatterplot in Figure 3.10, showing the relationship between infant mortality and GDP per capita in the data set. The scatterplot includes several enhancements that help us interpret the pattern of the data: Scatterplot for infant mortality and GDP per capita in the data set, with marginal boxplots, leastsquares line (solid line), and loess smooth (broken line).

CIA World Factbook

Figure 3.10 CIA World Factbook

89

x axis on this graph shows the GDP per capita in thousands of dollars and ranges from 0 to 80, in intervals of 20,000. The y axis shows the infant mortality rate per thousand and the values on this axis ranges from 0 to 100, in intervals of 20. The data points of the scatterplot are arranged in a gentle L-shape with fewer points along the upper and lower parts of the axes and most of the points clustered along the middle of the curve. The two lines seen in this graph are labeled least-squares line and loess smooth. The approximate values of these lines are estimated below: Two marginal boxplots are seen along each axis. The marginal boxplot along the x axis is seen between the approximate values of 0 and 25 with a line to the right, with three points to the right. The marginal boxplot along the y axis is seen between the approximate values of 5 and 40 with a line above, with a point above.

The boxplots show the marginal distributions of GDP and infant mortality. In this case, both distributions are positively skewed, as we noticed earlier in the chapter. The solid straight line in the scatterplot is the least-squares regression line, and it’s clearly an inappropriate summary of the highly nonlinear relationship between infant mortality and GDP. The broken curved line is a , tracing how the conditional average value of infant mortality changes with GDP. I’ll explain presently how loess works.

loess smooth

Scatterplots, which I assume are familiar, are the canonical graph for examining the relationship between two quantitative variables. The loess regression in Figure 3.10 reveals that the nonlinear relationship between infant mortality and GDP is monotone and simple, with the bulge pointing down and to the left. The bulging rule suggests, then, that we transform infant mortality, GDP, or both down the ladder of powers. Figure 3.11 shows what happens when both variables are log-transformed. The original infant mortality and GDP per capita scales appear at the right and top of the graph, respectively. The relationship between the two variables is now nearly straight, and both variables are more symmetrically distributed.7 The fitted least-squares line in Figure 3.11 has the equation (3.2)

90

overcorrects

7

The log transformation of both variables slightly their original nonlinear relationship, with a small bulge now pointing up and to the right. I invite the reader to try alternative transformations of infant mortality, GDP, or both variables. The remaining nonlinearity is slight, however, and we may prefer to stick with the log transformations because of interpretability, as explained in the text. Scatterplot for infant mortality and GDP per capita with both variables log-transformed.

Figure 3.11

This graph shows the log values of the GDP per capita and infant mortality. The x axis on this graph shows the log (GDP per capita) and the values range from 0 to 4, in intervals of 1. The y axis shows the log (infant mortality) and the values on this axis ranges from 0 to 4, in intervals of 1. The secondary x axis is labeled GDP per capita ($1000s) and the values on this graph range from 1 to 50 in

91

unequal intervals and the secondary y axis shows the infant mortality rate per 1000 and the values on this axis ranges from 5 to 100, in unequal intervals. The data points of the scatterplot are clustered along the logtransformed fitted least-squares line and the log-transformed loess smooth. The approximate values of these lines are estimated below: Two marginal boxplots are seen along each axis. The marginal boxplot along the x axis is seen between the approximate values of 1.3 and 3.1 with the median at about 2.5. The marginal boxplot along the y axis is seen between the approximate values of 1.6 and 3.7 with the median at about 2.8.

Because both variables are logged, the slope implies that a 1% increase in GDP per capita is on average associated with an approximate 0.84% decline in the infant mortality rate.

92

*How the Loess Smoother Works loess lowess

lo

ess

The scatterplot smoother (an acronym for cal regr ion), described originally by Cleveland (1979) (under the alternative acronym , for cally eighted catterplot moother), is an implementation of , and is similar in spirit to kernel density estimation, described earlier in this chapter. Loess is explicated in Figure 3.12, which shows the computation of the loess smoother for the scatterplot of infant mortality versus GDP per capita: Computing the loess regression of infant mortality on GDP

lo w s s nearest-neighbor local-polynomial regression

Figure 3.12

per capita: (a) defining a window around

corresponding to the

span , (b) computing tricube weights for points within the window, (c) computing the weighted quadratic regression for the points within the window and the fitted value at and (d) the completed loess regression connecting all the fitted points

.

93

,

There are four graphs in this figure labeled a through d. These graphs are titled:

1. 2. 3. 4.

Observations within the window; Tricube weights. Local quadratic regression. Complete local quadratic estimate.

.

Graph a has two axes. The horizontal axis is labeled GDP Per capita ($1000s) with values that range from 0 to 80 in intervals of 20. The vertical axis is labeled infant mortality rate per 1000 and have values ranging from 0 to 100 in intervals of 20. Three lines – two dotted lines with a solid line between are seen on the GDP per capita axis. The first dotted line is at approximately 4, the solid line at

94

about 23 and the second dotted line at about 43. The solid line is labeled X (subscript 96). The scatterplot seen in this graph has most of its data points within the two dotted lines. Few points are seen between the values of 40 and 80 on the y axis outside the first dotted line and a few between 44 and 60 on the x axis, to the right of the second dotted line. Three points are also seen beyond these points on the x axis. Graph b has a horizontal axis is labeled GDP Per capita ($1000s) with values that range from 0 to 80 in intervals of 20. The vertical axis is labeled tricube kernel weight and has values ranging from 0.0 to 1.0, in intervals of 0.2. Two horizontal lines are seen at the values of 0.0 and 1.0 on the y axis and run parallel to the x axis. Three lines – two dotted lines with a solid line between are seen on the GDP per capita axis. The first dotted line is at approximately 4, the solid line at about 23 and the second dotted line at about 43. A bell curve with the peak at about 23 on the x axis is seen in this graph. Data points are seen all along this curve in clusters with most of the points along the left side of the curve before the curve peaks. Graph c is identical to graph a but also has a point marked on the solid line at the y axis value of about 9, labeled y (superscript cap) (subscript 96). A curve is also seen running through this point from the top left to the bottom right, between the two dotted lines only. Graph d has a horizontal axis is labeled GDP Per capita ($1000s) with values that range from 0 to 80 in intervals of 20. The vertical axis is labeled infant mortality rate per 1000 and have values ranging from 0 to 100 in intervals of 20. The scatterplot seen in this graph shows most of the data points arranged in a sloping L-shape along the axes with a trend line running through them, from about (5, 70) on the x and y axes to about (95, 5) on the x and y axes.

A window slides over the scatterplot from left to right across the range of -values in the data, stopping at a series of of , such as the ordered data values

values

x

x

focal

. A window of half-width around each focal value called the

span

is defined

so that the interval

s

includes % of the data; is of the loess smoother. Panel (a) of Figure 3.12

shows the window corresponding to the span

centered at

, the GDP per capita of Latvia; this window includes the neighbors of

nearest .

95

x-

Picking

as the focal value is arbitrary—any of the

could have been selected to provide an illustration—as is picking , although the latter is often a good choice of span. The 89 points within the window in Figure 3.12(a) are colored black and the points outside the window are colored gray; the focal case corresponds to the solid black point. More generally, there are analytic methods for selecting the span of nearest-neighbor smoothers or the span can be selected by visual trial and error: Larger values of produce smoother regression curves but suppress detail, and a sensible approach is to select as the smallest value that produces a reasonably smooth regression. Each case within the loess window is weighted in relation to its

s

s

x-distance from the center of the window, tricube weight function:

, using the Tukey

The points in Figure 3.12(b) are the weights assigned to the cases within the window. As illustrated in Figure 3.12(c), a weighted least squares quadratic regression fit to the points in the window, using the tricube weights to minimize

is

. The regression equation is used to

compute the fitted response corresponding to the focal value

, shown in Figure 3.12 (c)

as a solid black point for

.

This procedure is repeated for each of computing the fitted values

, . The points

96

are connected, as shown in Figure 3.12(d), to produce the loess regression curve. The observant reader will notice that the loess regression in Figure 3.12(d) differs slightly from the loess regression in Figure 3.10 (on p. 31). The latter takes the additional step of weighting cases in inverse relation to the size of their residuals to produce a fit that is more resistant to outliers.

97

Transforming Nonconstant Variation As reviewed in Chapter 2, the normal linear model assumes the same error variance 2 for all combinations of -values in the population. A common violation of this assumption is for error variation to increase with the level of the response variable.

σ

x

A simple setting for examining conditional variation is a categorical explanatory variable that divides the data into groups. Figure 3.13, for example, shows the distribution of GDP per capita in the data set for five regions of the world. I’ve arranged the regions in order of decreasing median GDP, from Europe to Africa. The conditional distributions of GDP within regions are positively skewed, with outliers at the high end of the distributions. Although Oceania is an exception, there is a general tendency of within-region variation in GDP to increase with increasing median GDP. Parallel boxplots for the distribution of GDP per capita by regions of the world.

CIA

World Factbook

Figure 3.13

98

The x axis shows five regions of the world – Europe, America, Oceania, Asia and Africa. The five boxplots seen show the GDP per capita in thousands of dollars and ranges from 0 to 80, in intervals of $20,000. The lower and upper values of each of the boxplots for the five regions is estimated below: The GDP per capita in thousands of dollars, for specific countries in each region as seen on the graph is:

Tukey (1977) suggested plotting the log of the IQR (a measure of variation or “spread”) against the log of the median (a measure of center or “level”), to create a , as shown in Figure 3.14. The line in the spread-level plot, fit by robust

spread-level plot

regression, has positive slope , reflecting the tendency of spread to increase with level. Spread-level plot for GDP per capita by regions.

Figure 3.14

99

The x axis on this graph is labeled median and has values from 5 to 20 on it in unequal intervals. The values seen on this axis read, 5, 10 and 20. The y axis is labeled interquartile range and has values ranging from 5 to 25 on it in unequal intervals. The values on this axis read 5, 10, 15, 20 and 25. The approximate median and interquartile values for each region as seen on this graph is as below: An upward sloping line that starts at the Africa data point ends at the Europe data point and runs through the data point labeled Asia.

Tukey explained further that the slope of the line fit to the spreadlevel plot can be used for a variation-stabilizing power transformation, selecting the power . Thus, if variation with level, as in the example, we transform the variable the ladder of powers and roots. Here,

increases down

, which is close to the

100

cube-root transformation, GDP . Figure 3.15 shows the result of cube-root and log transformations of GDP. In both cases, variation in GDP is now more nearly constant across regions, the skewness of the within-region distributions of GDP has been reduced, and there are fewer outliers. Positive skewness and a positive association of variation with level often have a common origin—in the current example, the lower bound of 0 for GDP per capita and the absence of an upper bound. Parallel boxplots for the distribution of GDP per capita by regions of the world, with GDP transformed:

Figure 3.15 (a)

and (b)

.

The two box plots seen in this graph are labeled a and b. Both graphs show the five regions on the x axis – Europe, America, Oceania, Asia and Africa. The y axis on both graphs show values ranging from 1 to 4, in intervals of 1. This axis is labeled GDP per capita

(superscript ) on graph a and Log (GDP per capita) on graph b. The secondary y axis on both graphs are labeled GDP per capita ($1000s) and shows values from 1 to 100 in two different unequal intervals, one for each graph.

The lower, upper and median GDP per capita (superscript of the five regions seen in graph a are estimated below:

101

) values

The countries seen on this graph and their GDP per capita

(superscript

) values are:

Singapore – 4.5. United States – 3.8. The lower, upper and median log (GDP per capita) values of the five regions seen in graph b are estimated below: The countries seen on this graph and their log (GDP per capita) values are: Moldova – 1.6. Haiti – 0.6.

102

Interpreting Results When Variables Are Transformed Additive linear regression models in which the variables have familiar scales are simple to interpret. If, for example, we regress dollars of annual income on years of education and years of labor force experience, then the coefficient for education represents the average increment in dollars of annual income associated with a 1year increase in education, holding experience constant.8 Simplicity of interpretation isn’t, however, an argument for grossly distorting the data, and if the partial relationship between income and education is nonlinear or if education and experience interact in affecting income, it’s not sensible to stick with an additive linear model on the original scales of the variables. 8

For cross-sectional data, this kind of statement is really a shorthand for a static comparison between pairs of individuals who differ by 1 year in education but have equal labor force experience. It is nevertheless true that we usually pay a cost in interpretability when we transform variables that have familiar scales. Sometimes, as in Equation 3.2 (on p. 32), where log(infant mortality) is regressed on log(GDP per capita), regression coefficients for transformed variables have a simple interpretation: In this example, and as I mentioned, the log(GDP per capita) coefficient of −0.84 represents the approximate percentage change in infant mortality associated with a 1% increment in GDP per capita– what economists term an . Similarly, if we were to regress annual income in dollars on the log base 2 of education, then the education coefficient would represent the average increment in annual income associated with education (which is what’s entailed

elasticity

doubling

by increasing the

of education by 1).

There are other examples of directly interpretable transformations, but in most instances, simple interpretations of regression coefficients are unavailable for models with transformed variables or complex interactions. We can always resort to graphical representation of a regression model, however. One framework for doing so, , is described in the context of

predictor effect plots

103

nonlinearity diagnostics in Chapter 6. This is true as well of linear models that use multiple regressors to fit nonlinear relationships, such as polynomial regression and regression splines, also discussed in Chapter 6. The individual regression coefficients for polynomial and regression-spline regressors are generally difficult to interpret and are not of direct interest, but a graphical representation of the nonlinear partial relationship that the coefficients imply in combination is easy to comprehend.

104

Chapter 4. Unusual Data: Outliers, Leverage, and Influence Unusual data are problematic in a least-squares regression because they can unduly influence the results of the analysis and because their presence may be a signal that the regression model fails to capture important characteristics of the data. Let’s begin by differentiating among different kinds of unusual data. In simple regression (i.e., a linear regression model with one numeric explanatory variable), an is a case whose response value is unusual the value of the explanatory variable. In contrast, a univariate outlier is a value of or that is unusual; such a value may or may not be a regression outlier. The distinction is illustrated in Figure 4.1: The solid black point is a regression outlier in that, unusually for the data configuration in general, it combines a relatively large value of with a relatively small value of –other nearby -values are associated with much smaller -values. Neither the - nor the -value of the outlier is unusual, as is clear from the marginal (i.e., univariate) distribution of each variable, shown in the rugplots on the horizontal and vertical axes. The gray dot in the center of the scatterplot is the point of means (the ) for and . The shown in the graph are constructed so that they would include approximately 50%, 75%, and 95% of bivariately normally distributed data,1 and are drawn ignoring the outlier, as is the least-squares line on the graph.

x

outlier

given unconditionally

x y individually centroid

x

x

y

1 The

x

y

y

y

data ellipses

data ellipse is similar to the confidence ellipse, introduced in Figure 2.1 (p. 10). I’ll explain the relationship between the two in Chapter 7.

105

Figure 4.1 A regression outlier (the solid black point). The solid gray point near the center of the graph is at the means of x and y, and the data ellipses are drawn for bivariatenormal data to enclose approximately 50%, 75%, and 95% of the data points, ignoring the outlier; the least-squares line on the graph is also fit ignoring the outlier; the rugplots on the and axes show the marginal distribution of each variable.

x

y

A bar code-like rug plots are seen along the x and y axes. The line that passes through the three ovals and the center passes through the outer most ends of all three concentric

106

ovals. The solid black point seen to the left of the center, above the last oval and is labeled regression outlier. Most of the data points in this illustration are seen within the first two concentric ovals, around the center with few data points in the outer most oval.

Some central distinctions are illustrated in Figure 4.2 for the simple regression model . Regression outliers appear in both Panels (a) and (b) of Figure 4.2. In Panel (a), the outlying case has an -value at the center of the distribution; as a consequence, deleting the outlier has no impact on the least-squares slope 1 and little impact on the intercept 0. In Panel (b), the outlier has an unusual -value, and consequently, its deletion markedly affects both the slope and the intercept. Because of its unusual -value, the outlying point in Figure 4.2(b) has high on the regression coefficients, whereas the regression outlier in Figure 4.2(a) is at a low-leverage point and thus has little impact on the regression coefficients. The rightmost point in Figure 4.2(c) has high leverage, but because it’s in line with the rest of the data, it has no influence on the values of the least-squares regression coefficients. Unusual data in regression (outliers, leverage, and influence): (a) an outlier at a low-leverage point (near the mean of ) with little influence on the regression coefficients, (b) an outlier at a high-leverage point (far from the mean of ) with substantial influence on the regression coefficients, and (c) an in-line high-leverage point with no influence on the regression coefficients. In each case, the unusual point is black, the other points are gray, the solid line is the least-squares line for all the data, and the broken line is the least-squares line deleting the black point. The rugplots on the axes in each panel show the marginal distributions of and .

x

x

x leverage

Figure 4.2

x

b

x

x

x

107

y

b

These three graphs are labeled a, b and c and each graph has a rug plot along each axes as well as eight data points and an outlier point. Each graph also has a least-squares line that uses all nine data points including the outlier point and the rest of the data set and another least-squares line for only the eight data points, excluding the outlier. In all three graphs, the eight data points are arranged at a 45 degree angle between the axes.

108

In graph a, the outlier point indicated by a solid dark point, is seen almost midway along the y axis and about a third of the way along the x axis. In this graph, the leastsquares line including and excluding this point run at a 45 degree angle, parallelly. The line excluding the data point starts at the intersection of the x and y axes and runs through all eight data points, sloping upward. The leastsquare line excluding the outlier runs parallel to this line, above it, in this graph. In graph b, the outlier point indicated by a solid dark point, is seen at the top right corner of the graph’s plot area, close to the ends of both axes. The least-squares line including and the outlier point run at a 45 degree angle, from the intersection of the x and y axes. The line excluding the data point starts on the x axis and intersects the least-squares line excluding the data point between data points 4 and 5 and steeply slopes upward, ending just below the outlier point. In graph c, the outlier point indicated by a solid dark point, is at the middle of the y axis and almost at the end of the x axis. In this graph, the least-squares line including and excluding this point run at a 45 degree angle, parallelly, almost overlapping each other. The line excluding the data point starts at the intersection of the x and y axes and runs through all eight data points, sloping upward. The least-square line excluding the outlier runs parallel to this line, above it, also running through all eight data points. Both lines pass through the outlier point on the right before they end.

The combination of high leverage with a regression outlier therefore produces substantial on the regression coefficients. The following heuristic formula helps distinguish among these concepts and to explain their relationship:

influence

A simple and transparent example, with real data from Davis (1990), appears in Figure 4.3(a). These data record the measured and reported weight (in kilograms) of 183 male and female subjects who engaged in programs of regular physical exercise. As part of a larger study,Davis was interested in

109

ascertaining whether the subjects reported their weights accurately, and whether men and women reported similarly.2 2 The

published study is based on the data for the female subjects only and includes additional data for nonexercising women. Davis (1990) gives the correlation between measured and reported weight. (a) Regression of reported weight in kilograms on measured weight and gender for 183 subjects engaged in regular exercise; (b) regression of measured weight on reported weight and gender. In each panel, the black line shows the least-squares regression for women, the gray line the regression for men. Two relatively unusual points are identified in each panel, Cases 12 and 21. The scales of the and axes in these graphs are different (i.e., have different kg/cm) to accommodate the measured weight of the 12th case.

Figure 4.3

x

y

The two graphs in this figure are labeled a and b. In graph a, the measured weight in kilograms, is seen on the x axis and the values on this graph start at 40 and end at 160, in intervals of 20. The y axis is labeled reported weight in kilograms and the values on this graph start at 40 and end at 120, in intervals of 20. In this case, the data

110

points for women are marked by F and those for men are marked M. Most of the female data points are seen between the measured weight ranges of 40 to 70 kgs and reported weight ranges of about 40 to 65 kgs. One outlier is seen at a measured weight of a little over 160 kgs and reported weight of about 55 kgs. This point in labeled 12F. The male data points are seen between the measured weight ranges of about 60 to 90 kgs and reported weight ranges of 40 to 65 kgs. One outlier is seen at a measured weight of about 120 kgs and reported weight of about 120 kgs. This point in labeled 21M. A gray, least-squares regression line for men, passes through all the above data points except 12F and 21M. The black least-squares regression for women line starts with a more gradual slope intersects the lightcolored line at a point which is at about (55, 55) on both axes. In graph b, the axes are transposed and the gray leastsquares regression for men line, is seen to have a more gradual slope than that in graph a. This line touches the data point labeled 21M towards the end of the line.

The least-squares regression of reported weight (RW) on measured weight (MW), a dummy variable for gender (F) coded 1 for women and 0 for men, and an interaction regressor (MW × F) produces the following results (with coefficient standard errors in parentheses below the coefficients): (4.1)

Were these results taken seriously, we would conclude that men are on average unbiased reporters of their weights (because and ), whereas women tend to overreport their weights if they are relatively light and under-report if they are relatively heavy. Figure 4.3 makes

111

clear, however, that the differential results for women and men are due to one female subject (case 12) whose reported weight is about average (for women), but whose measured weight is extremely large. Case 21, an usually heavy male, is also at a high-leverage point, but is in line with the rest of the data for men. As it turns out, and as Davis discovered after calculating an anomalously low correlation between reported and measured weight among women, Case 12’s measured weight and height (in centimeters) were switched erroneously on data entry. Correcting the data produces the regression

which suggests that both women and men are approximately equally unbiased reporters of weight. There is another way to analyze the Davis weight data, illustrated in Figure 4.3(b): One of the investigator’s interests was to determine whether subjects reported their weights accurately enough to permit the substitution of reported weight for measured weight, which would decrease the cost of collecting data on weight. It is natural to think of reported weight as influenced by “real” weight, as in the regression presented above in which reported weight is the response variable. The question of substitution, however, is answered by the regression of measured weight on reported weight, giving the following results for the data:

uncorrected

112

Here, the outlier does not have much impact on the regression coefficients, precisely because the value of RW for Case 12 is near for women.3 There is a marked effect on the multiple correlation and the standard deviation of the residuals, however: For the corrected data, and kg, a possibly acceptable average error in predicting measured from reported weight. 3I

invite the reader to redo the regression with the measured weight for Case 12 corrected.

113

Measuring Leverage: Hat Values hat values h

The i are a common measure of leverage in leastsquares regression and are so named because it is possible to express the fitted values i:

in terms of the observed values

y

Thus, the weight

captures the extent to which

i

yi can

affect : If is large, then the th case can have a substantial impact on the ′ th fitted value. As I’ll show in the starred section at the end of this chapter,

i

and so, dropping the redundant double subscript, the hat value influence (the ) of

leverage

yi on all

summarizes the potential the fitted values,

. For a model with an intercept the hat values are bounded between average hat value is

and 1, and the .

In simple regression analysis, the hat values measure distance from the mean of :

x

114

β, 0

h

In multiple regression, i measures distance from the centroid of the s, taking into account the correlational structure and variation of the s, as illustrated for

x

x

in Figure 4.4, which reveals that contours of constant leverage in two dimensions are data ellipses. Multivariate outliers in the -space are thus high-leverage cases. Elliptical contours of constant leverage for two s. Taking the correlational structure and variation of 1 and 2 into account, the two black points are equally unusual and have equally large hat values. The gray point in the

Figure 4.4 x x

x

x

center of the graph is the point of means,

115

.

The horizontal axis is labeled x1 and the vertical axis, x2. Most of the data points are within the first two concentric ovals with some in the third oval. Two points are seen in the space between the third and fourth ovals. Two dark points are seen on the outline of the fourth oval. The center of the ovals has a point in another color.

For Davis’s regression of reported weight on measured weight, gender, and their interaction (Equation 4.1 on p. 43; also recall Figure 4.3(a) on p. 43), the largest hat value by far belongs to the 12th subject, whose measured weight was erroneously recorded as 166 kg:

116

, which is

many times the average hat value, .

117

Detecting Outliers: Studentized Residuals To identify regression outliers, we need to measure the conditional unusualness of given the s. Regression outliers usually have large residuals, but even if the errors i have equal variances (as assumed in the normal linear

y

ε

x

e

model), the residuals i do not: (see the last section of this chapter). High-leverage cases, therefore, tend to have small (i.e., low-variance) residuals —a sensible result, because these cases can force the regression surface to be close to them.

standardized residual by calculating

Although we can form a

, this measure suffers from the defect that the numerator and denominator are not independent, preventing When contains

from following a

is large,

t-distribution: , which

, tends to be large as well.

Suppose, however, that we refit the regression model after

i

deleted estimate i studentized residual

removing the th case, obtaining a based on the rest of the data, where the negative parenthetical subscript indicates that case has been removed. Then the

σ

(4.2)

118

of

t

has independent numerator and denominator and follows a distribution with

degrees of freedom.

An alternative, but equivalent and illuminating, procedure for finding the studentized residuals employs the

mean-shift

outlier model (4.3)

where is a dummy variable set to 1 for case all other cases. Thus

i and 0 for

It would be natural to specify Equation 4.3 if before examining the data we suspected that case differed from the others. Then, to test

under

H

i

H:

, we would find

0

, which is distributed as 0 , and which (it turns out) is the studentized

residual

of Equation 4.2.

Because in most applications we do not suspect a particular case in advance, we can fit the mean-shift model times,

n

once for each case,

, producing studentized

119

residuals

. In practice, alternative formulas

to Equations 4.2 and 4.3 provide the with little computational effort. Usually, our interest then focuses on the largest absolute , called . Because we have picked the biggest of test statistics, however, it is no longer

n

legitimate simply to use

to find the

p-value for

: For example, even if our model is wholly adequate, and disregarding for the moment the dependence among the *s, we would expect to observe about 5% of *s beyond

e

, about 1% beyond

e

, and

so forth. One solution to this problem of simultaneous inference is to perform a

Bonferroni adjustment to the p-value for

:4

Suppose that , the ordinary one-sided -value for the largest absolute studentized

p

residual. Then the Bonferroni

p-value for testing

is

. The factor 2 reflects the two-sided character of the test: We want to detect large negative as well as large positive outliers. The factor adjusts for conducting simultaneous tests, which is implicit in selecting the

n

n

n

largest of test statistics. Consequently, a much larger is required to obtain a small -value than is the case for an ordinary individual -test. Beckman and Cook(1983) show that this Bonferroni adjustment is usually accurate for testing the largest studentized residual.

t

p

120

4 Another

way to take into account the joint distribution of the studentized residuals is to construct a quantilecomparison plot, as discussed in Chapter 5. In Davis’s regression of reported weight on measured weight, gender, and their interaction (Equation 4.1 on p. 43), the largest absolute studentized residual by far belongs to the 12th case:

. Here, , and

(i.e., 58 zeroes to the right of the decimal point, followed by 99). The Bonferroni -value for the outlier test is therefore

p

, an unambiguous result: The 12th case clearly doesn’t belong with the rest of the data.

121

Measuring Influence: Cook’s Distance and Other Case Deletion Diagnostics As I previously pointed out, influence on the least-squares regression coefficients combines leverage and outlyingness. The most direct measure of influence simply examines the impact on each coefficient of deleting each case in turn:5

b b

precisely, if less succinctly, because j(−i) is j ij measures the effect on j of th case to a model fit to the other − 1 cases.

5 More

subtracted from b d i

n

adding the

where denotes the least-squares estimate of produced when the th case is omitted. To assist in

i

βj

interpretation, it is useful to scale the by (deleted) estimates of the coefficient standard errors:

Following Belsley, Kuh, and Welsch(1980), the termed

, and the

are called DFBETASij.

One problem associated with using the large number:

are often

or

is their

of each. These values can be more

122

quickly examined graphically than in numerical tables. For example, we can construct an coefficient

index plot of the

s for each

—a simple scatter plot with

on the vertical axis versus the case index on the horizontal axis. Nevertheless, it is useful to have a global summary measure of the influence of each case on the fit. Cook (1977) proposed measuring the distance between the and the corresponding

by calculating the

bj

F-statistic

for the “hypothesis” that . This statistic is recalculated for each case . The resulting values should not literally be interpreted as -tests, which is why “hypothesis” is in scare quotes—Cook’s approach merely exploits an to hypothesis testing to produce an overall measure of distance between the full-sample and deleted regression coefficients that is independent of the scales of the s and . Cook’s statistic may be written (and simply calculated) as

F

analogy

x

y

(4.4)

In effect, the first term is a measure of outlyingness and the second a measure of leverage (see the last section of this chapter). We look for values of i that are substantially larger than the rest.

D

123

Belsley et al. (1980) suggested a very similar measure that uses studentized residuals rather than standardized residuals:

Except for unusual data configurations, . Other global measures of influence are available (see Chatterjee Hadi 1988, Chapter 4 for a comparative discussion). Because all of the deletion diagnostics depend on the hat values and residuals, a graphical alternative to either of the general influence measures is to plot the against the i and to look for cases for which both are big. A slightly more sophisticated version of this graph displays circles of area proportional to Cook’s or another global measure of influence instead of points (see Figure 4.7 on p. 56, for an example). We can follow up by examining the cases that stand out from the others.

h

D

For Davis’s regression of reported weight on measured weight, gender, and their interaction (Equation 4.1 on p. 43), all the indices of influence point to the obviously discrepant 12th case:

124

Case 12 is a female subject, and so her removal has on the male intercept 0 and slope 1.

b

b

no impact

In developing measures of influence in regression, I have focused on changes in the regression coefficients. Other regression outputs may be examined as well, including the standard errors of the regression coefficients and consequently the size of individual confidence intervals for the coefficients, the size of the joint ellipsoidal confidence region for the regression coefficients, and the degree of collinearity among the s. All these measures depend on the leverages and the residuals, however. For example, consider the in-line high-leverage point at the right of Figure 4.2(c) (on p. 42): This point has no influence on the regression coefficients, but because it

x

greatly increases the variance the standard error of the slope

b

x

of , it greatly 1 , which is

decreases

. For an extended discussion of various measures of influence, see Chatterjee and Hadi (1988, Chapters 4 and 5).

125

Numerical Cutoffs for Noteworthy Case Diagnostics I have deliberately refrained from suggesting specific numerical criteria for identifying noteworthy cases on the basis of measures of leverage and influence. I believe that it is generally more useful to examine the distributions of these quantities to locate cases with unusual values. The Bonferroni -value for the largest absolute studentized residual provides a rule for identifying regression outliers, but even this isn’t a substitute for graphical examination of the residuals.

p

Cutoffs can be of some use; however, as long as they are not given too much weight, and especially when they serve to enhance graphical displays. A horizontal line can be shown on an index plot, for example, to draw attention to values beyond a cutoff. Similarly, such values can be identified individually in a graph (as in Figure 4.7 on p. 56). For some diagnostic statistics, such as measures of influence, absolute cutoffs are unlikely to identify noteworthy cases in large samples. In part, this characteristic reflects the ability of large samples to absorb discrepant data without markedly changing the results, but it is still often of interest to identify unusual points, even if no cases have large influence or leverage. Unusual cases may tell us something interesting and unanticipated about the data.

relatively absolute

Hat values: Belsley et al. (1980) suggest that hat values exceeding about twice the average, , are noteworthy. This cutoff was derived as an approximation nominating the most extreme 5% of cases when the s are

x

126

k

multivariate-normal and and are relatively large, but it is recommended by these authors as a rough general guide. They suggest using

n

when is small. (See Chatterjee & Hadi, 1988, Chapter 4, for a discussion of alternative cutoffs for hat values.) : Beyond outlier testing, discussed above, it sometimes helps to draw attention to residuals that are relatively large. Under ideal conditions, about

Studentized residuals

5% of studentized residuals are outside . It is therefore reasonable, for example, to draw lines at ±2 on a display of studentized residuals to highlight cases outside this range.

Standardized change in regression coefficients: The

(i.e., the DFBETAS) are scaled by standard errors, and consequently or 2 suggests itself as an absolute cutoff. Because this criterion is unlikely to nominate cases in large samples, Belsley et al. (1980) propose the sample–size–adjusted cutoff

for

identifying noteworthy . : A variety of numerical cutoffs have been recommended for Cook’s and DFFITS—exploiting the analogy between and an -statistic, for example. Chatterjee and Hadi(1988) suggest comparing

Cook’s D and DFFITS D

F

D

with the sample–size–adjusted cutoff . (Also see Belsley et al. 1980; Cook 1977; Velleman & Welsch, 1981.) Moreover, because of the approximate relationship between DFFITS and Cook’s , it is simple to translate cutoffs between

D

127

the two measures. Chatterjee and Hadi’s criterion for DFFITS, for example, becomes Absolute cutoffs, such as data.

. , risk missing unusual

128

Jointly Influential Cases: AddedVariable Plots

jointly

As illustrated in Figure 4.5, subsets of cases can be or can offset each other’s influence. Often, influential subsets of cases or multiple outliers can be identified by applying single-case deletion diagnostics sequentially. It is potentially important, however, to refit the model after deleting each case, because the presence of a single influential point can dramatically affect the fit at other points. Still, the sequential approach is not always successful. Jointly influential data. In each panel, the solid line gives the regression for all of the data, the dashed broken line gives the regression with the solid black triangle deleted, and the dash-dotted line gives the regression with both the black square and the triangle deleted. (a) Two jointly influential cases located close to one another; deletion of both cases has a much greater impact than deletion of only one. (b) Two jointly influential cases located on opposite sides of the data. (c) Cases that offset one another—the regression with both cases deleted is nearly the same as for the whole data set.

influential

Figure 4.5

129

There are three graphs in this figure labeled a, b and c. Each graph has a set of five data point as well as two outliers represented by a dark triangle and a dark square. The three regression lines seen in each of these graphs are represented by a solid line showing the regression for all the data including the two outliers; a dashed line for the regression of the data excluding the dark triangle and the dash-dotted line or the regression of the data excluding both outliers.

130

In graph a, the dark triangle and the dark square are seen in the top right corner of the graph’s plot area. The five data points are seen in a line that runs from the top left to the bottom right of the graph’s plot area. The solid regression line starts at almost the middle of the y axis and slopes upward, ending two-thirds of the way on the line parallel to the y axis, on the opposite side. The dashed line starts a little above the y axis as compared to the solid line and ends just above the middle of the line parallel to the y axis. The dash-dotted line passes through only the five data points and intersects the other two regression lines at the same point they intersect each other. In graph a, the dark triangle is seen in the top right corner of the graph’s plot area and the dark square is seen at the bottom left corner of the graph’s plot area. The five data points are seen in a line that runs from the top left to the bottom right of the graph’s plot area, more to the right as compared to that in graph a. The solid regression line starts a fourth of the way on the y axis and ends closer to the top of the line parallel to the y axis, sloping upward. It passes through the third data point of the data set. The dashed line starts a little above the y axis as compared to the solid line and ends just above the middle of the line parallel to the y axis. The dash-dotted line passes through only the five data points and intersects the solid line at data point 3 while intersecting the dashed line between data points 3 and 4. In graph c, the dark triangle is seen past the middle of the line parallel to the y axis and the dark square is seen at the top right corner of the graph’s plot area. The five data points are seen in a line that runs from the bottom left to the top right of the graph’s plot area, in a roughly 45 degree angle from the intersection of the two axes. The three regression lines intersect at the third data point with the solid regression line over the dash-dotted line, which is over the dotted regression line below this point. This is reversed past the third data point with the dashed line over the dash-dotted line, which is over the solid line on the right of the third data point. All three regression lines slope upward. Only the dashed line touches the dark square as it ends. The dark triangle is below all three regression lines.

131

Although it is possible to generalize case deletion statistics formally to subsets of several points, the very large number of subsets—there are subsets of size —usually renders the approach impractical (but see Belsley et al., 1980, Chapter 2 and Chatterjee & Hadi, 1988, Chapter 5). An alternative is to employ graphical methods.

p

An especially useful influence graph—which I consider the grand champion of regression graphs—is the

added-variable

plot, also called a partial-regression plot. Let represent the residuals from the least-squares regression of y on all the xs with the exception x , that is, the residuals 1

from the fitted model

Likewise, regression of

x

are residuals from the least-squares 1 on the other s:

x

The notation emphasizes the interpretation of the residuals (1) and (1) as the parts of and 1 that remain when the linear dependence of these variables on 2, …, k is removed.6

y

x

y

x

6I

x

x

x

focus here on 1 for notational simplicity. As I’ll explain, we will focus on each j in turn, regressing it and on the other s.

y

x

x

x

y

The added-variable plot for 1 is the scatterplot of (1) versus (1), and it has the following very interesting properties (see the final section of this chapter):

x

132

simple

The slope of the least-squares -regression line of (1) on (1) is the same as the least-squares slope 1 for regression. 1 in the full The residuals from this simple regression are the same as the residuals i from the full regression; that is,

y x

x

b

multiple e

. No intercept is required here, because as least-squares residuals, both (1) and (1) have means of 0. Consequently, the standard deviation of the residuals in the added-variable plot is from the multiple regression (if we use the residual degrees of freedom,  −   − 1 from the multiple regression to compute ). The standard error of 1 in the regression is

y

n s

x

s

k

b

multiple

then

.

Because the are residuals, they are less variable than the explanatory variable 1 if 1 is correlated with the other s. The added-variable plot therefore shows how collinearity can degrade the precision of estimation by decreasing the conditional variation of an explanatory variable (a topic that I will pursue in Chapter 7).

x

x

x

x

x

The collection of added-variable plots for 1 ,…, k in effect converts the graph for the multiple regression, the natural scatterplot for which has  + 1 dimensions and consequently cannot be drawn when  > 2 or 3, into a sequence of two-dimensional scatterplots.

k k

y

In the context of the current chapter, plotting (j) against (j) permits us to examine leverage and influence of the cases on j. We can even draw an added-variable plot for the intercept 0, regressing the 0  = 1 and

x

b

b

constant regressor x 133

y

x

x

on 1 through k, with no intercepts in the auxiliary regression equations. By the same token, added-variable plots usefully display leverage and influence on the coefficients for all kinds of regressors, including dummy regressors and interaction regressors. Illustrative added-variable plots for a regression with two explanatory variables appear in Figure 4.6. The data for this example are drawn from a landmark study by Duncan(1961), who regressed the rated prestige of 45 occupations ( , assessed as the percentage of raters in a national survey scoring the occupations as “good” or “excellent”) on the income and educational levels of the occupations in the 1950 U.S. Census ( , the percentage of males in the occupation earning an annual income of at least $3,500, and , the percentage of male high school graduates in the occupation, respectively). The primary aim of this regression was to produce predicted prestige scores for occupations for which there were no direct prestige ratings, but for which income and educational data were available, a methodology that’s still in use for constructing socioeconomic status scores for occupations. Duncan’s fitted regression equation (with coefficient standard errors in parentheses) is

p

I

E

(4.5)

Figure 4.6 Added-variable plots with least-squares lines for (a) income and (b) education in Duncan’s regression of prestige on the income and education levels of 45 U.S. occupations in 1950. Three noteworthy points are identified on the plots. The added-variable plot for the intercept is not shown.

134

Note: RR = railroad. The two graphs in this figure are labeled a and b. The x axis on graph a is labeled income-education and shows values from -40 to 40, in intervals of 20. The y axis is labeled prestige-education and shows values from -30 to 40, in intervals of 10. The data points in the scatterplot seen in this graph are between the income-education ranges of -20 and 20 and the prestige-education ranges of -30 and 35. Three data points labeled minister, conductor and RR engineer are labeled on this graph. The least-squares line I this graph starts at about (-42,-30) and ends at about (60, 35) on the x and y axes. The approximate income-education and prestige-education values of these three data points are: The x axis on graph b is labeled education-income and shows values from -60 to 40, in intervals of 20. The y axis is labeled prestige-income and shows values from -40 to 60, in intervals of 10. The data points in the scatterplot seen in this graph are between the education-income ranges of about -20 and 35 and the prestige-income ranges of about -20 and 21. Three data points labeled minister, conductor and RR engineer are labeled on this graph. The least-squares line I

135

this graph starts at about (-62,-35) and ends at about (55, 30) on the x and y axes. The approximate education-income and prestige-income values of these three data points are:

The added-variable plot for income in Figure 4.6(a) reveals three relatively high-leverage cases: the occupations , whose income is unusually low given the educational level of the occupation, and and , whose incomes are unusually high given their levels of education. Recall that the horizontal variable in the added-variable plot is the residual from the regression of income on education, and thus values far from 0 in this direction are those for which income is unusual given education. The point for railroad engineer is more or less in line with most of the points in the plot, but the points for minister and conductor appear to be working jointly to decrease the income slope, minister pulling the regression line up at the left and conductor down at the right.

minister engineer

railroad conductor

railroad

The added-variable plot for education in Figure 4.6(b) shows that the same three cases have relatively high leverage on the education coefficient: Minister and conductor serve jointly to increase the education coefficient, while railroad engineer is roughly in line with the rest of the data. Examining the single-case deletion diagnostics for Duncan’s regression reveals that minister has the largest Cook’s ( minister = 0.566) and studentized residual (

D

D

). This residual is not especially big, however: The Bonferroni -value for the outlier test is

p

. Figure 4.7 shows a “bubble plot” of studentized residuals versus hat values, with the areas of the plotted circles proportional to Cook’s

D. Case names are shown on the plot for

or .

136

Figure 4.7 Plot of studentized residuals versus hat values

for Duncan’s regression of occupational prestige on income and education. Each point is plotted as a circle with area proportional to Cook’s . The horizontal lines on the graph are drawn at studentized residuals of 0 and ±2, and the

D

vertical lines at hat values of

and

identified occupations are those for which or

.

137

. The

Note: RR = railroad. This graph shows the hat values on the x axis and the studentized residuals on the y axis. The hat values seen here range from 0.05 to 0.25, in intervals of 0.05. The studentized residual values seen on the y axis range from -2 to 3 in intervals of 1. Circles of different sizes are seen plotted on this graph. The data points which are dots and the smaller circles are seen between the studentized residual values of about -0.05 and 1 and hat values of about 0.05 and 0.10. Larger circles are seen along the same hat values but over the studentized residual values of 1 and -1, with the larger circles seen at the studentized residual values of about -2 and 2. The four significant circles seen in this image in order of their size from smaller to largest are: Railroad engineer, reporter, conductor and minister. These circles are plotted at approximately the following studentized residuals and hat values:

Deleting the occupations minister and conductor produces the fitted regression (4.6)

which, as expected from the added-variables plots, has a considerably larger income slope and smaller education slope than the original regression. Additionally deleting railroad engineer further increases the income slope and decreases the education slope, but, as also expected from the addedvariable plots, the changes are not dramatic: I = 0.931, E = 0.285.

b

b

138

Should Unusual Data Be Discarded? The discussion in this section has proceeded as if outlying and influential data are simply discarded. Although problematic data should not be ignored, they should also not be deleted mechanically and thoughtlessly:

why

It is important to investigate data are unusual. Obviously bad data (e.g., essentially random errors in data collection or data entry, as in Davis’s regression) can often be corrected, or, if correction is not possible, thrown away without qualm. Alternatively, when a discrepant data point is known to be correct, we may be able to understand why the case is unusual. For Duncan’s regression, for example, it makes sense that the occupation minister enjoys prestige not accounted for by its income and educational levels. Similarly, it may be the case that the high incomes of the railroad occupations relative to their educational levels and prestige reflect the power of railroad unions around 1950. In situations like this, we may choose to deal with outlying cases separately. Outliers or influential data may also motivate model respecification. For example, the pattern of outlying data may suggest introduction of additional explanatory variables. If, in Duncan’s regression, we can identify a factor, such as level of unionization, that produces the unusually high income levels of the railroad occupations given their relatively low levels of education, and we can measure that factor for other occupations, then this variable could be added to the regression7. In some instances, transformation of the response variable or of an explanatory variable may, by rendering the error distribution symmetric or correcting nonlinearity (see

139

Chapters 5 and 6), draw apparent outliers toward the rest of the data. We should, however, be careful to avoid the data, permitting a small portion of the data to determine the form of the model. Except in clear-cut cases, we are justifiably reluctant to delete cases or to respecify a regression model to accommodate unusual data. Overenthusiasm in deleting outliers can lead to the situation satirized in Figure 4.8. Alternatively, we can abandon least squares and adopt an estimation strategy, called , that continuously downweights outlying data rather than simply including or discarding them. These methods are termed because they behave well even when the errors are not normally distributed: Robust estimation is nearly as efficient as least squares when the errors are normally distributed, and much more efficient in the presence of outliers. Because robust regression assigns 0 or very small weight to highly discrepant data, however, the result is not generally very different from careful application of least squares, and indeed, robustregression weights may be used to identify outliers. Moreover, robust-regression methods may be vulnerable to high-leverage points (but see the , described by Rousseeuw & Leroy, 1987, which are subject to their own problems). Robust regression does not absolve us from looking at the data.

overfitting

robust regression

robust

high-breakdown

estimators 7 See

Pearce (2002) for a nice example of the interplay between unusual data and model specification in the context of research on fertility preferences in Nepal. Pearce followed up an initial statistical analysis of social survey data with semistructured interviews of unusual cases, leading to respecification of the statistical analysis. Outlier detection in action.

Figure 4.8

140

141

Source: Reprinted with permission from the announcement

of the Summer Program of the Inter-University Consortium for Political and Social Research, 1990.

142

*Unusual Data: Details

143

Hat Values and the Hat Matrix The fitted values in least-squares regression are a linear function of the observed values:

y

hat matrix, so named (“puts the hat on y”). The

Here,

is the

because it transforms

y into

hat matrix is symmetric (

) and idempotent (

), as can easily be verified. Consequently, the diagonal entries of the hat matrix , are

values

, called

hat

X includes the constant regressor, then . Finally, because H is a projection matrix, projecting y orthogonally onto the subspace spanned by the k + 1 columns of X, it follows that which implies that

. If

, and thus

.

144

The Distribution of the Least-Squares Residuals The least-squares residuals are given by

where

In is the order-n identity matrix. Thus,

and

I I

H H

H

because n −  , like , is symmetric and idempotent. The matrix n −  is not diagonal, and its diagonal entries are usually unequal; the residuals, therefore, are correlated and usually have unequal variances,8 even though the errors are, by assumption, independent with equal variances. 8 An

interesting exception is an analysis-of-variance model for a multifactorial designed experiment in which there are equal numbers of cases in all combinations of factor levels (cells) of the design. In that case, the variances of the residuals are all equal, and residuals in different cells are uncorrelated, but residuals in the same cell are still correlated.

145

Case Deletion Diagnostics b

Let  − (i) denote the vector of least-squares regression coefficients calculated with the th case omitted. Then i =   −  (−i) (i.e., DFBETAi) represents the influence of case on the regression coefficients; i may be calculated efficiently by

d

i

b

i

b

d

(4.7)

ith row of X. Cook’s Di is the F-value for testing the “hypothesis” that where

is the

:

D

An alternative interpretation of i, therefore, is that it measures the aggregate influence of case on the fitted

i

values , which is why Belsley et al. (1980) call their similar statistic “DFFITS.” Using Equation 4.7,

146

which is the formula given previously.

147

The Added-Variable Plot In vector form, the fitted multiple regression model is (4.8)

y

x n

n

where and the j are  × 1 column vectors, respectively, for the response and regressors, n is an  × 1 vector of 1s, and is the  × 1 residual vector. In least-squares

e

regression, orthogonal projection of

1

y

n

is the onto the subspace spanned by the

regressors (including the constant regressor

y

x

1n). Let

and be the projections of and 1, respectively, onto the orthogonal complement of the subspace spanned by n and

1

(i.e., the residual vectors from the regressions of and 1 on the other s). Then, by the

y

x

x

geometry of projections, the orthogonal projection of onto is , and is the residual vector from the overall regression in Equation 4.8.

148

Chapter 5. Nonnormality and Nonconstant Error Variance As explained in Chapter 2, the standard linear regression model assumes that the errors are normally and independently distributed with 0 means and constant variance. The assumption that the errors all have 0 means is equivalent to assuming that the functional form of the model is correct, an assumption that I’ll address in the next chapter on nonlinearity. This chapter discusses diagnostics for nonnormal errors and nonconstant error variance, two problems that both concern the conditional distribution of the response variable, that often occur together, and that also often have a common solution. The assumption of normally distributed errors is almost always arbitrary and made for mathematical convenience. Nevertheless, the central limit theorem ensures that under very broad conditions, inference based on the least-squares regression coefficients is approximately valid in all but small samples. Why, then, should we be concerned about nonnormal errors?

validity p

First, although the of inferences for least-squares estimation is robust— -values for tests and the coverage of confidence intervals are approximately correct in large samples even when the assumption of normality is violated—the method is not robust in : The least-squares estimator is maximally efficient (has smallest sampling variance) among unbiased estimators when the errors are normal. For some types of error distributions, however, particularly those with heavy tails, the efficiency of least-squares estimation decreases markedly. In these cases, the least-squares estimator becomes much less efficient than alternatives (e.g., robust estimators, or least squares augmented by diagnostics). To a large extent, heavy-tailed error distributions are problematic because they give rise to outliers, an issue that I addressed in the previous chapter.

efficiency

149

A commonly quoted justification of least-squares estimation— called the —states that the least-squares coefficients are the most efficient unbiased estimators that are functions of the response observations i. This result depends on the assumptions of linearity, constant error variance, and independence but does not require normality. Although the restriction to linear estimators produces simple sampling properties, it is not compelling in light of the vulnerability of least squares to heavy-tailed error distributions.

linear

Gauss—Markov theorem

y

Second, highly skewed error distributions, apart from their propensity to generate outliers in the direction of the skew, compromise the interpretation of the least-squares fit. This fit is, after all, a conditional mean (of given the s), and the mean is not a good measure of the center of a highly skewed distribution. Consequently, we may prefer to transform the data to produce a symmetric error distribution.

y

x

Finally, a multimodal error distribution suggests the omission from the model of one or more discrete explanatory variables that divide the data naturally into groups. An examination of the distribution of the residuals may therefore motivate elaboration of the regression model. The consequences of seriously violating the assumption of constant error variance are somewhat different: The leastsquares coefficients are still unbiased estimators of the population regression coefficients if the assumptions of (1) linearity and (2) independence of the s and the errors hold, but statistical inference may be compromised, with distorted values for hypothesis tests and confidence intervals that don’t have the stated coverage. For these negative consequences to occur, nonconstant error variance must typically be severe (see, e.g., Fox, 2016, section 12.2.4 and Exercise 12.5).

x

150

p

Detecting and Correcting Nonnormality

CIA World Factbook GDP

For concreteness, I’ll return to the data set, introduced in Chapter 3, to regress national infant mortality rates ( ) on per-capita GDP ( ), per-capita health expenditures ( ), and the Gini coefficient of income inequality ( ), with coefficient standard errors shown in parentheses:

Infant Health Gini

(5.1)

Although the signs of the regression coefficients make sense— holding the other explanatory variables constant, infant mortality declines with GDP and health expenditures, and increases with increasing inequality—the results of the regression are otherwise disappointing, with only the coefficient of GDP exceeding twice its standard error. What’s at issue here isn’t just our ability to detect the relationships that we expected but more generally the precision with which the regression coefficients are estimated, reflected in their standard errors. This example is artificial, in that our preliminary examination of the data in Chapter 3 suggested strongly that these variables shouldn’t be analyzed untransformed on their original scales, but it provides us with an opportunity to see how regression diagnostics can help us detect deficiencies in the model. The regression residuals are the key to the behavior of the errors, but, as explained in Chapter 4, the distribution of the residuals i isn’t as simple as the distribution of the errors i : Even if the errors are independent with equal variances,

ε

e

151

the residuals are correlated with generally different variances. In Chapter 4, I introduced the studentized residuals,

, which, though correlated, have equal variances

t

and are distributed as with degrees of freedom if the assumptions of the regression model are correct. One way, therefore, to address the assumption of normality is to compare the distribution of the studentized residuals with in a quantile-comparison plot. Unlike in Chapter 3, however, where we drew QQ plots for independently sampled data, the studentized residuals from a least-squares fit are correlated. Assessing their sampling variation is therefore more difficult. Atkinson (1985) suggested constructing a pointwise confidence envelope by a , using simulated random sampling, as follows:

parametric bootstrap 1. Compute the

n fitted values

and the residual standard deviation from the regression model. 2. Construct random samples of values, each of size . For the th such sample, draw independent random errors from

B

b

s

y

n

the normal distribution

and

compute

,

.

3. For each simulated sample

, regress the

on the original regressors residuals

x

ij ,

obtaining studentized

. Arrange these studentized residuals in

ascending order as if for a QQ plot,

152

.

4. To construct a

confidence interval for

ith ordered studentized residual , find the and quantiles of the B simulated studentized residuals in position (i), . For the

example, if and , then the confidence interval runs from the 25th to the 975th ordered .

B

Applying Atkinson’s procedure with = 1,000 to the regression model fit to the data (Equation 5.1) produces the QQ plot in Figure 5.1. It’s clear from this graph that the distribution of studentized residuals is positively skewed, with points straying above the fitted line and the confidence envelope both at the left and at the right of the QQ plot. QQ plot comparing the studentized residuals from the regression with the -distribution with degrees of freedom. The broken lines represent a pointwise 95% confidence envelope computed by simulated sampling.

CIA World Factbook

Figure 5.1 CIA World Factbook

t

153

The t quantiles are plotted on the x axis and range from -2 to 2, in intervals of 1. The ordered studentized residuals are plotted on the y axis and these values range from -1 to 4, in intervals of 1. The graph has a solid line with a dotted curve on either side arching toward the solid line in the middle and away from the solid line at the ends. The solid line starts at about (-1.8, -2) and ends at (3, 2.8), on the x and y axes. The data points of this QQ plot start in the quadrant made by the t quantile value before -2 and the ordered studentized residual value of more than -1. The data points form a line and rise gradually until about (-1, -1) and are concentrated in an upward curve between the solid line and the lower dotted curve until about (1, 1). At this point the data point curve

154

over the solid line with a few data points ending above the upper dotted curve.

The QQ plot draws attention to the tail behavior of the studentized residuals but is less effective in visualizing their distribution as a whole. Figure 5.2 shows an adaptive kernel nonparametric density plot for the studentized residuals from the regression. The positive skew of the studentized residuals is apparent, and there’s also a hint of a bimodal distribution. Adaptive kernel density estimate for the studentized residuals from the regression.

CIA World Factbook

Figure 5.2

CIA World Factbook

155

The x axis the values The y axis range from

of this graph is labeled studentized residuals and on this axis range from -2 to 5, in intervals of 1. is labeled density and the values on this axis 0.0 to 0.5, in intervals of 0.1.

A bar code-like rug plot is seen at the studentized residuals axis with most of the lines concentrated between the values of -1 and 1. The approximate values of the curve seen in this graph are tabulated below:

The positive skew of the studentized residuals suggests transforming the response variable, infant mortality, down the ladder of powers. Box—plots for several such power transformations are shown in Figure 5.3. As we might have anticipated from the work we did in Chapter 3, the log transformation does a good job of making the distribution of the studentized residuals symmetric, with, however, one outlier (Luxembourg) at the high end of the distribution.1 1 The

e

studentized residual for Luxembourg, * = 4.02, is associated with a Bonferroni -value of 0.013. Luxembourg, which has the largest per-capita GDP in the data set, is an outlier only because I fit an unreasonable regression model to the data, specifying a linear partial relationship between infant mortality or log-transformed infant mortality and GDP. We’ll see many indications in this chapter, and even more clearly in the next chapter on nonlinearity, that the model is poorly specified. Indeed, and as I mentioned, the work that we did in Chapter 3 examining the data in the CIA data set suggests transforming GDP per capita fitting a regression model to the data. Boxplots of studentized residuals resulting from various transformations of infant mortality down the ladder of powers.

p

CIA World Factbook

prior to

Figure 5.3

156

The x axis on this graph is labeled powers, λ and the values on this graph read: -1, -0.5, log, 0.5 and 1. The y axis is labeled studentized residuals for t (subscript BC) (Infant, λ) and the values on this axis range from -4 to 4, in intervals of 2. The approximate lower, upper and median studentized residuals for each of these powers are: The approximate studentized residual value of the other data points seen on this graph against each power are as follows:

The regression for log(infant mortality) is as follows: (5.2)

157

The regression coefficients and the residual standard deviation aren’t directly comparable with those in Equation 5.1 (p. 63) because the response in Equation 5.2 is in log scale, but the 2 for the regression is now much larger than before and the coefficients of , , and now all easily exceed twice their standard errors. Nevertheless, and as we’ll discover in the next chapter, the regression model still has serious deficiencies.

R

GDP Health

Gini

158

*Selecting a Normalizing Transformation Analytically As I mentioned in Chapter 3 in discussing the Box—Cox powers, Box and Cox (1964) introduced their family of transformations in the context of transforming the response variable in a regression toward normality (see Equation 3.1 on p. 21). In this context, the pseudo log-likelihood is of the form

where the

are the errors and

is the error variance associated with

the transformed response; is the standard normal density function, with mean 0 and variance 1. Applying Box and Cox’s method to the

CIA World Factbook

regression produces the estimate transformation parameter, with standard error

of the

. The Wald 95% confidence interval for the power transformation is then . Because (i.e., no transformation) is far outside the confidence interval, there is strong support for transforming the response, but the log transformation ( ) that I selected by trial and error is slightly outside of the confidence interval for .

λ

159

constructed-variable

Atkinson (1985) introduced a diagnostic plot based on an approximate score test for the transformation parameter in the Box—Cox regression model. The constructed variable is computed as

, where

geometric mean

is the of the response. We then regress s the constructed variable,

x and

y on the

(5.3)

β

ε

I put primes on the s and s because they differ from the corresponding values in the original regression model. The hypothesis 0: is equivalent to the hypothesis that the transformation parameter in the Box—Cox regression model is equal to 1—that is, that no transformation is required. The -statistic for the constructed-variable

H

λ

t

coefficient

is an approximate score test for this

hypothesis, with degrees of freedom. More interestingly, an added-variable plot for the constructed variable BC in Equation 5.3 shows leverage and influence on the estimation of and hence on the decision whether or not to transform .

g

λ

y

For the original regression model in Equation 5.1 (p. 63) fit to the

CIA World Factbook data, ; thus

160

and

with

degrees of freedom, for which

: As we already know, there is strong evidence in the data of the need to transform infant mortality. The corresponding constructed-variable plot in Figure 5.4 reveals that while there are high-leverage cases in the determination of , evidence for the transformation is spread across the whole data set. Constructed-variable plot for the transformation of infant mortality in the Box—Cox regression.

Figure 5.4

CIA World Factbook

The x axis of this graph is labeled constructed variable-other xs and the values on this axis range from -20 to 80, in

161

intervals of 20. The y axis is labeled infant-other xs and the values on this axis range from -20 to 60, in intervals of 20. The data points in this scatterplot are loosely arranged along the solid line in this graph. The solid line starts at about (-25, -22) on the x and y axes and passes through the (0, 0) point, sloping upward and ends at about (73, 70). Most of the data points are seen between the constructed variable-other xs values ranging from about -20 to 20 and the infant-other xs values ranging from about -20 to 25. Some points are also seen along the line above these ranges.

I’ve emphasized transformation as an approach to modeling a conditionally nonnormal response variable, partly because the mean of a skewed distribution—that is, here, the skewed conditional distribution of the response—is not a good measure of its center. An alternative to response transformation is to abandon the normal linear model fit by least squares and instead to construct a regression model for the conditional of the response,2 or, even more generally, for several conditional quantiles of the response. This approach, called , is feasible, if more complex than linear least-squares regression; see Koenker (2005).

median quantile regression 2 Median

regression is equivalent to finding the regression coefficients that minimize the sum of , as opposed to

absolute squared, residuals, —that is, the least absolute values (LAV) criterion rather than the least-squares criterion.

162

Detecting and Dealing With Nonconstant Error Variance As I pointed out in Chapter 4 in discussing added-variable plots, the natural graph for a regression with a numeric response variable and numeric explanatory variables is a ( +1)-dimensional scatterplot, which, of course, we can only draw when is at most 2. Were we able to draw the natural ( +1)-dimensional scatterplot of the data, we would in principle be able to detect arbitrary patterns of nonconstant conditional variation around a fitted -dimensional regression surface.

k k

k

k

k

The challenge of diagnosing nonconstant error variance, then, is that we have to decide what more specific patterns to look for, and then to project the ( +1)-dimensional cloud of data points onto a two-or three-dimensional space, similar in spirit to what we did to construct two-dimensional added-variable plots to examine leverage and influence on the regression coefficients. A caveat is that projecting the higher dimensional point cloud into two or three dimensions may lose important information, causing us to mistake another problem, such as nonlinearity or unmodeled interaction, for nonconstant error variance (see, e.g., Cook, 1998, section 1.2.1).

k

y

A very common pattern is for the conditional variation of to increase with the level of . It consequently seems natural to plot residuals against -values to see if the spread of the residuals changes as we scan the plot from left to right, but plotting versus suffers from two problems:

e

y

y

y

1. As we know, the least-squares residuals generally have unequal variances

even if the

errors have constant variance

163

. The studentized

residuals , however, have constant variance under the assumptions of the regression model. 2. The residuals i and the i-values are correlated because

e

y

the former is a component of the latter: It turns out that the correlation between

y and e is

.

R

, where is the multiple correlation from the regression, ensuring that a plot of residuals against -values will be tilted. The tilt can make it difficult to judge departures from constant conditional variation.3 This observation suggests the alternative of plotting residuals against fitted values, because a

y

property of the least-squares fit is that 3 See

.

Figure 5.6 (p. 73) for a similar phenomenon.

Two examples plotting studentized residuals versus fitted values appear in Figure 5.5.4 Both are for data artificially generated from the simple linear regression model , with and , and with -values sampled randomly and uniformly on the interval [1, 10]; the same s are used for both panels of the graph.

x

x

4 An

alternative is to plot squared or absolute studentized residuals versus fitted values, which may reveal more clearly whether the of the studentized residuals changes with the level of the response. We consider a similar diagnostic—a spread-level plot of studentized residuals—below.

magnitude

In Panel (a), the errors are sampled from a common normal distribution with 0 mean and constant variance equal to 1, i ∼ NID(0,1). In Panel (b), the errors are sampled from normal distributions with 0 means and variances equal to 2,

ε

x

164

.

Figure 5.5 Plots of studentized residuals versus fitted values

for regressions with two artificially generated data sets: (a) errors generated with constant variance; (b) errors generated with standard deviation proportional to . The solid horizontal

x

line is drawn at ; the dashed broken line is for a loess smooth of the points; and the dash-dotted lines are for separate loess smooths of the positive and negative residuals from the central smooth.

The two graphs seen in this figure are labeled a and b. Graph a uses errors generated with constant variance while graph b uses errors generated with standard deviation proportional to x. In both graphs, the x axis is labeled fitted values and the y axis is labeled studentized residuals. Each graph has a loess smooth line of the points with a dash-dotted line above and another below that are separate loess smooths of the positive and negative residuals from the central smooth respectively. In graph a, the three lines are almost parallel and the data points are spread through the area between the two dash-dotted lines and dispersed more loosely above the positive and below the negative dash-dotted line respectively.

165

In graph b, the positive and negative dash-dotted lines start closer to the loess smooth line in the middle and fan outward and away from left to right. The data points in this graph are concentrated between these two line on the left side of the graph plot area and is spread through the area between the two dash-dotted lines when moving from left to right through the graph’s plot area. On the far right side, a number of points are seen over and under the positive and negative dash-dotted lines.

The broken line in each panel is a loess smooth (introduced in Chapter 3). The dash-dotted lines are separate loess smooths of the positive and negative residuals from the central smooth, and the vertical separation of these lines represents the variation of the conditional distribution of the studentized residuals given the fitted values. If the variation of the studentized residuals is approximately constant as we scan the plot from left to right, then the separation of the two outer smooths should be roughly constant, as is the case in Panel (a) but not in Panel (b), where the variation of the studentized residuals increases with . Finally, there is a line on each * panel drawn at = 0. If the functional form of the model is correct, then the central smooth should roughly track this horizontal line, as is the case in both panels.

e

Figure 5.6 shows plots of studentized residuals versus fitted values for two regression models fit to the data: Plots of studentized residuals versus fitted values for two regressions fit to the data: (a) with untransformed infant mortality as the response; (b) with log-transformed infant mortality as the response.

Figure 5.6

CIA World Factbook

CIA World Factbook

166

The two graphs seen in this figure are labeled a and b. Graph a uses untransformed infant mortality as the response, while graph b uses log-transformed infant mortality as the response. In both graphs, the x axis is labeled fitted values and the y axis is labeled studentized residuals. Each graph has a loess smooth line of the points with a dash-dotted line above and another below that are separate loess smooths of the positive and negative residuals from the central smooth respectively. In graph a, the fitted values axis shows values ranging from -20 to 40, in intervals of 20 and the y axis shows the studentized residuals with values ranging from -1 to 4, in intervals of 1. The three lines start at about (-40, 2.5) on the x and y axes and drop down in almost straight lines before rising back up. The three lines resemble a broad V-shape with the apex between the fitted values of about 20 and 40 and studentized residual values of about -1.2 and 0. While the three lines are close together in the descending part of the curves, they are almost parallel past the apex. The data points between the area formed by the two dash-dotted lines between the fitted values of about -10 and 40. The data points are more spread out in the ascending part of the curves. In graph b, the fitted values axis shows values ranging from 0 to 4, in intervals of 1 and the y axis shows the studentized residuals with values ranging from -2 to 4, in intervals of 1. The loess smooth line starts at about (0.5, 2.9) and resembles a broad V-shape before it ends at about (4, 1). The apex of this curve is between the fitted values of about 2 and 3 and studentized residual values of about -1 and -0.5. The positive

167

residual loess smooth curve is above the loess smooth curve and the negative one below it. The three lines run parallelly. The data points in this graph are between the area formed by the two dash-dotted lines between the fitted values of about 1 and 4. The data points are clustered closer together between the fitted values of 3 and 4.

The plot in Panel (a) is for the original model that we fit to the untransformed data (Equation 5.1 on p. 63). The extreme nonlinearity apparent in this graph suggests that there are serious problems with the model, and, as a consequence, it is hard to examine the plot for nonconstant conditional variation: The plot is so tilted at the left and the right that it’s difficult to judge how the vertical separation of the outer smooths is changing, because our eye is drawn to the between the two lines, not to the . Looking closely, however, it appears as if the spread generally increases from left to right. Another problem with the fit is that there are a few negative fitted values, which aren’t sensible for the necessarily positive response variable, infant mortality. The plot in Panel (b) is for the regression fit to the logtransformed response (Equation 5.2 on p. 67). Nonlinearity is still apparent, but variation around the central smooth now seems approximately constant.

least distance vertical distance

A word about plotting residuals versus fitted values as a nonlinearity diagnostic: As here, the graph can reveal problems with the functional specification of the model, but it isn’t very helpful for determining the problem lies and how to correct it. In the next chapter, I’ll introduce more generally useful graphs for diagnosing nonlinearity.

where

Thinking about the size of the studentized residuals as a measure of variation, we can adapt Tukey’s spread-level plot (introduced in Chapter 3), plotting the log of the absolute studentized residuals versus the log of the fitted values. The latter are only defined, however, when

168

is positive.

Examples of studentized residual spread-level plots for the two CIA regressions appear in Figure 5.7. The lines on both graphs are fit by robust regression. Spread-level plots for two regressions fit to the data: (a) with untransformed infant mortality as the response and (b) with log-transformed infant mortality as the response.

Figure 5.7 CIA World Factbook

The two graphs in this figure are labeled a and b. The x axis on both graphs are labeled fitted values and the y axis is labeled absolute standardized residuals. The fitted values seen on the x axis of graph a, range from 0.5 to 50.0 in unequal intervals. The values seen on this axis read: 0.5, 1.0, 2.0, 5.0, 10.0, 20.0 and 50.0. The absolute studentized residual values seen on the y axis range from 0.05 to 2.0, in unequal intervals. The values seen on this axis are: 0.05, 0.10, 0.20, 0.50, 1.00 and 2.00. The regression line seen in this graph is plotted at the following approximate fitted values and absolute studentized residual values: The data points in this graph are loosely dispersed between the fitted values of 20.0 and 50.0 and the absolute studentized residual values of about 0.05 and 2.00. The points are concentrated in the grid formed by the fitted values of 20.0 and 50.0 and the absolute studentized residual values of about 0.75 and 1.50. Some points are also seen along the regression line between the fitted values of 5.0 and 20.0.

169

The fitted values seen on the x axis of graph b, range from 0.2 to 2.0 in unequal intervals. The values seen on this axis read: 0.2, 0.5, 1.0 and 2.0. The absolute studentized residual values seen on the y axis range from 0.01 to 2.0, in unequal intervals. The values seen on this axis are: 0.01, 0.02, 0.05, 0.10, 0.20, 0.50, 1.00 and 2.00. The regression line seen in this graph is plotted at the following approximate fitted values and absolute studentized residual values: The data points in this graph are loosely dispersed beyond the fitted values of 1.0 and the absolute studentized residual values of about 0.05 and 2.00. The points are concentrated in the grid formed by the fitted values over 2.0 and the absolute studentized residual values of about 0.20 and 2.00. Some points are also seen between the fitted values of 1.0 and 2.0 and the absolute studentized residual values of 0.10 and 1.50.

The tilted plot in Panel (a) suggests that conditional variation increases with the level of infant mortality. As in a traditional spread-level plot, we can use the slope of the fitted line to suggest a variance-stabilizing power

b

y

transformation

of . Here, we get and thus

, not far from the

log transformation ( ) that I selected to make the distribution of the residuals more symmetric, but closer to the cube-root transformation ( ). To draw this graph, 17 cases with negative fitted values were ignored. The approximately horizontal pattern in Panel (b) suggests that the log transformation was successful in stabilizing residual variation. Here, the slope

, and

the suggested transformation no transformation at all (i.e.,

170

is very close to ).

Testing for Nonconstant Error Variance Breusch and Pagan (1979) introduced a simple and commonly used score test for nonconstant error variance based on an auxiliary model in which the variance of the errors in the main regression model depends on a linear function of regressors s

z

through an unspecified function

:

z

In most applications, there is just one , the fitted values from the main regression (producing a test for nonconstant error variance independently suggested by Cook & Weisberg, 1983), or the s are the same as the regressors 1,…, k from the main regression.

z

x

x

The Breusch–Pagan test is implemented by regressing squared standardized residuals

—computed using the MLE

z

of the error variance,

—on the s:

The score statistic follows an asymptotic chi-square distribution with degrees of freedom under the null hypothesis of constant error variance,

q

i

n

, = 1, … , . The fitted values from the auxiliary regression of on the s.

u

z

are

CIA World Factbook

Returning to our initial regression for the data (Equation 5.1 on p. 63), the Breusch–Pagan test for the dependence of the error variance on the fitted values produces on 1 degree of freedom, for which

171

p = 0.025—

that is, some but not overwhelming evidence for the dependence of the error variance on the level of the response. Applying the same test to the model for log-transformed infant mortality (Equation 5.2 on p. 67) curiously produces

stronger

evidence for nonconstant error variance: , 1 degree of freedom, = 0.0058. Plotting squared residuals against fitted values for each auxiliary regression reveals the source of this anomaly:5 Luxembourg is a high-leverage outlier in both auxiliary regressions, decreasing the score statistic for the first test and greatly increasing it for the second. Moreover, Luxembourg derives its leverage in the auxiliary regressions by virtue of having a very large fitted value of infant mortality, produced by misspecifying the partial relationship between infant mortality (or transformed infant mortality) and GDP per capita.6

p

negative

5 These

graphs aren’t shown; I invite the reader to draw them.

6 I’ll

address misspecification of the regression function in the next chapter. Despite this deficiency, I retained the example here because it nicely illustrates how different apparent problems with regression models can be interrelated, and why (as advocated in Chapter 3) it pays to examine the data prior to specifying a regression model.

172

Robust Coefficient Standard Errors I’ve explained how transforming the response can stabilize the error variance when there’s a systematic relationship between the conditional variation and the level of the response. There are also other strategies for dealing with nonconstant error variance, one of which I’ll describe briefly in this section: sticking with the ( ) estimates of the regression coefficients but adjusting their standard errors to reflect nonconstant error variance. Another approach to nonconstant error variance, ( ) estimation, is described later in the chapter.

ordinary least squares OLS

weighted least squares WLS

There are additional approaches to nonconstant error variance that I won’t elaborate here. One such approach is to fit a GLM —such as a gamma model, an inverse-Gaussian model, or (for count data) a Poisson or negative binomial model—in which nonconstant conditional variance is built into the model (see Chapter 8). Yet another possibility is to model both the conditional mean and the conditional variance of the response (see, e.g., Harvey,1976), and quantile regression (mentioned earlier in the chapter in connection with nonnormality) can also reveal changes in conditional variation. Indeed, how the variability of the response depends on the explanatory variables may be of direct substantive interest. As I explained, the OLS estimators of the regression coefficients are unbiased even when the errors have different variances, but the conventionally estimated sampling variances of the OLS coefficients can be seriously biased. One solution to this problem, independently suggested by Huber (1967) and White (1980), is to retain the OLS estimator of the regression coefficients but to compute robust coefficient standard errors that are consistent even in the presence of nonconstant error variance. Huber–White robust standard errors have subsequently been tweaked in various ways to produce better performance in small samples (see Long & Ervin, 2000). I’ll defer the details

173

of robust standard errors to a starred section and simply apply the method to the original regression (in Equation 5.1 on p. 63) that I fit to the data, as shown in Table 5.1. As expected, the robust standard errors are larger than the corresponding conventional standard errors, especially for the GDP coefficient.

CIAWorld Factbook

Table 5.1:CIA World Factbook

Note: SE = standard error. Although they are often employed in practice, there are two common disadvantages of using Huber–White standard errors:

unbiased

1. The OLS coefficients are estimates of the population regression coefficients if the other assumptions of the regression model hold, but they aren’t generally estimates when the error variance isn’t constant. Another approach, such as WLS estimation (discussed immediately below), may produce more efficient estimates. 2. Nonconstant error variance often co-occurs with other problems, such as nonnormal errors and nonlinearity. In these cases, there is more wrong with the model than misleading standard errors. This is true, for example, for the OLS regression that I fit to the data. Transforming the response often helps address these other associated problems along with nonconstant error variance.

efficient

CIA World Factbook

174

Bootstrapping Another way to obtain robust statistical inferences for OLS regression in the presence of nonconstant error variance (and nonnormality) is to employ , which can be used either to estimate coefficient standard errors or directly for confidence intervals and hypothesis tests.7 Bootstrapping entails resampling repeatedly from the observed data to build up empirical sampling distributions for the regression coefficients. Each bootstrap sample draws cases from the cases in the original sample. The regression model is then fit to each of the bootstrap samples, the fitted regression coefficients are saved, and the bootstrap distribution of each of the + 1 regression coefficients is examined. The standard deviations of the bootstrapped regression coefficients estimate the sampling standard deviations of the coefficients and consequently are coefficient standard errors.

nonparametric bootstrapping

with replacement

n

k

B

n

7 See

Efron and Tibshirani (1993) and Davison and Hinkley (1997) for extensive treatments of bootstrapping, or Fox (2016, Chapter 21) for a briefer exposition. I used a version of the bootstrap, assuming normally distributed errors, earlier in the chapter to construct a pointwise confidence envelope in a QQ plot of studentized residuals.

parametric

The last column of Table 5.1 shows bootstrapped standard errors for the coefficients of the OLS regression fit to the data, based on = 1,000 bootstrap samples, each of size = 134. The bootstrapped standard errors are similar to, if slightly smaller than, the corresponding Huber—White standard errors and larger than the conventional OLS standard errors.

Factbook n

CIA World

B

175

Weighted Least Squares The standard linear model described in Chapter 2 assumes that the errors are normally and independently distributed with 0 means and common variance: . Suppose that the other assumptions of the linear model hold but that the variances of the errors differ, . As it stands, the resulting linear model isn’t estimable, although we can proceed as in the previous section to adjust the standard errors of the OLS regression coefficients for unequal error variances.

pattern of unequal error variances, so that , for known values w , then we can compute the MLEs of the regression coefficients by finding the bs that minimize the weighted sum of squared residuals , where, as in OLS regression, If, however, we know the

i

.

w

Cases with larger weights i have smaller error variances and, consequently, carry more information about the location of the regression surface. The diagnostics described in this book generally work for linear models fit by ( ) regression, substituting the Pearson residuals

weighted least squares

WLS

for the ordinary residuals

e. i

The fly in the ointment of WLS regression is the necessity of knowing the variances of the errors up to a constant of proportionality (i.e., the parameter ). In some contexts, it isn’t hard to determine weights. For example, if the

176

explanatory variables are discrete and divide the data into a relatively small number of groups, each with a reasonably large number of cases, we can then use the inverse of the withingroup sample variances as weights for the cases in each group. If the number of cases in each group isn’t sufficiently large, however, the resulting uncertainty in the weights may cause us to seriously overestimate the precision of estimation. Similarly, if we believe that the average magnitude of the errors (say, their standard deviation) is proportional to the magnitude of the response, and if the response variable is necessarily positive, then we can use the square roots of the fitted values from a preliminary OLS regression for weights, . This situation apparently applies to the initial OLS regression (Equation 5.1 on p. 63) that I fit to the data, but (1) because the model is poorly specified, there are many fitted values, and (2) the distribution of the OLS residuals is positively skewed, casting doubt on the assumption of normally distributed errors.

CIAWorld Factbook

negative

177

*Robust Standard Errors and Weighted Least Squares: Details In the starred section at the end of Chapter 2, I introduced the matrix form of the normal linear regression model, , where . To accommodate nonconstant error variance, we can stick with the linear model but instead specify

, where

is a diagonal matrix of potentially different error variances. Under this new, more general, model, the variance—covariance matrix of the OLS estimator is (5.4)

Huber (1967) and White (1980) suggest estimating

V(b) by

substituting for in Equation 5.4, where the i are the residuals from the OLS regression. The resulting coefficient-variance estimator is called a because the equation is like a

e sandwich estimator

sandwich, with as the “bread” and as the “filling.” The robust coefficient standard errors are then the square-root diagonal elements of

.

To compute the robust standard errors shown in Table 5.1, I used a variant of the Huber–White sandwich estimator recommended by Long and Ervin (2000), termed , which

HC3

substitutes

for

178

in the

definition of the sandwich estimator.8 Here, for case .

i

8 HC3

h

i

is the hat value

is an abbreviation of heteroscedasticity-consistent estimator number 3, and is a Greek-derived mouth-filling synonym for nonconstant variation.

heteroscedasticity

Sandwich standard errors have applications beyond nonconstant error variance—for example, to dependent observations in panel data collected on individuals who are observed on different occasions. When applied to dependent data, the matrix in the center of the sandwich isn’t diagonal. The linear model leading to the WLS estimator is , where

The WLS estimators of

β and σ

2

are then

and the asymptotic estimated covariance matrix of the WLS regression coefficients is

It is common, as in OLS regression, to substitute for the ML estimator

179

of the error variance.

180

Chapter 6. Nonlinearity Eε y

The assumption that ( ) is everywhere 0 implies that the specified regression surface captures the dependency of the conditional mean of on the s. Violating the assumption of linearity—broadly construed to mean a departure from the specified functional form of the regression model—therefore implies that the model fails to represent the systematic pattern of relationship between the average response and the explanatory variables. For example, a partial relationship specified to be linear may be nonlinear, or two explanatory variables specified to have additive partial effects may interact in determining . The fitted model may very well be a useful approximation even if the regression surface ( ) is not precisely captured. In other instances, however, the model can be extremely misleading. I consequently think of nonlinearity (i.e., fitting the wrong equation to the data) as the most fundamental of the problems discussed in this monograph.

x

y

Ey

Component-plus-residual plots are the primary graphical device for diagnosing nonlinearity. I’ll explain how component-plusresidual plots are constructed, how they can be extended and generalized in various ways, and when they can break down. I’ll also address other topics in this chapter, including tests for nonlinearity and analytic choice of linearizing transformations of the explanatory variables in a regression. The regression surface is generally high dimensional, even after accounting for regressors (e.g., polynomial terms, dummy variables, and interactions) that are functions of a smaller number of fundamental explanatory variables. As in the case of nonconstant error variance, therefore, it is necessary to focus on particular patterns of departure from linearity. The graphical diagnostics discussed in this chapter represent twodimensional views of the higher dimensional point-cloud of cases ( i, i1,…, ik). With modern computer graphics, the ideas in this chapter can usefully be extended to three

y x

x

181

dimensions (see, e.g., Cook, 1998). Even so, two- and threedimensional projections of the data can fail to capture their systematic structure.

182

Component-Plus-Residual Plots

y

Although it is useful in multiple regression to plot against each (e.g., the row pertaining to in the scatterplot matrix for and the s), these plots do not tell the whole story—and can be misleading—because our interest centers on the relationship between and each , controlling for the other s, not on the relationship between and a single . Residual-based plots are consequently more relevant in this context.

y

x

x

x

y

y marginal

x

y

partial x x

Plotting residuals or studentized residuals against each , perhaps augmented by a loess smooth (introduced in Chapter 3), is helpful in detecting departures from linearity. As Figure 6.1 illustrates, however, residual plots cannot distinguish between monotone (i.e., strictly increasing or decreasing) and nonmonotone (e.g., falling and then rising) nonlinearity. The distinction between monotone and nonmonotone nonlinearity is lost in the residual plots because the least-squares fit ensures that the residuals are linearly uncorrelated with each —that is, are untilted. The distinction is important, because, as we saw in Chapter 3, monotone nonlinearity frequently can be corrected by simple transformations of the variables. In Figure 6.1, for example, Case (a) might be

x

modeled by , whereas Case (b) cannot be linearized by a power transformation of and might instead be dealt with by a quadratic specification,

x

Figure 6.1

. Scatterplots (a) and (b) of

y versus x and

x

corresponding plots ( ) and ( ) of residuals versus in simple regression. The residual plots do not distinguish between (a) a nonlinear but monotone relationship and (b) a nonlinear, nonmonotone relationship.

183

These four graphs are labeled a, b, a’ and b’. Graphs a and b show the relationship of y versus x in a graph with the data points of the scatterplot clustered along the solid line seen starting from the bottom left to the top right in an upward slope. Graphs b, plots variable x on the x axis and y on the y axis. The regression line has a very slight slope and is almost parallel to the x axis. The data points of this scatterplot form a V-shape through the plot area of this graph. Graphs a’ and b’, plot e o the y axis and x on the x axis and graph b’

184

plots and have scatterplots that are identical to that seen in graph b.

y

x

In contrast to the marginal scatterplot of versus an , the added-variable plot, introduced in Chapter 4 for detecting influential data, is a partial plot. It turns out, however, that added-variable plots are not well tuned for detecting nonlinearity because they are biased toward linearity (as Cook, 1998, section 14.5 demonstrates). , also called , are often an effective alternative. Component-plus-residual plots, however, are not as suitable as added-variable plots for revealing leverage and influence on the regression coefficients.

Component-plus-residual partial-residual plots

plots

Define the

partial residuals for the jth regressor as

(6.1)

component b x y x multiple simple

That is, add the linear j ij of the partial relationship between and j to the least-squares i, which may include an unmodeled nonlinear component. Then plot ( j ) versus -regression j. By construction, the coefficient j is the slope of the linear regression of ( j ) on j, but nonlinearity may be apparent in the plot as well. This essentially simple idea was suggested independently by Larsen and McClearly (1972) and Wood (1973) and can be traced to work by Ezekial (1930). A loess smooth may help in interpreting the plot.

e e

x

x b

residuals e

To illustrate, I’ll return to the regression of the log infant mortality rate on GDP per capita, per-capita health expenditures, and the Gini coefficient for income inequality in the data set, repeating Equation 5.2 (from p.67):

CIA World Factbook

(6.2)

185

Figure 6.2 shows component-plus-residual plots for the three explanatory variables in the regression. The partial relationship of log(infant mortality) to GDP appears to be nonlinear but monotone; to health expenditures, nonlinear and nonmonotone; and to the Gini coefficient, approximately linear. Component-plus-residual plots for the regression of log(infant mortality) on (a) GDP per capita, (b) per-capita health expenditures, and (c) the Gini coefficient of income inequality in the . In each panel, the broken line represents the fitted model and the solid line a loess smooth.

Figure 6.2

CIA World Factbook data

186

The three graphs seen in this figure are labeled a, b and c. Each of these graphs plot the regression of log (infant mortality) against GDP per capita, health expenditures and the Gini coefficient respectively. The fitted model line and the loess smooth curve are plotted on each graph. In graph a, the x axis is labeled GDP per capita and the values on this axis range from 0 to 80, in intervals of 20. The y axis is labeled component+residual and the values on this axis range from -2 to 2 in intervals of 1.

187

The fitted model line starts at about (-0.2, 1) on the x and y axes and ends at about (68, -2.2) on the x and y axes respectively. It is a straight, downward-sloping line. The loess smooth curve starts above the fitted model line on the left and slopes down more steeply until the value of about (30, -0.9) on the x and y axes before gradually flattening out and ending well above the fitted model line, past it. The scatterplot seen in this graph has most of the data points clustered along the loess smooth curve’s top part and dispersing more loosely until it intersects the fitted model line at about (40, -1). In graph b, the x axis is labeled health expenditures and the values on this axis range from 5 to 15, in intervals of 5. The y axis is labeled component+residual and the values on this axis range from -1.0 to 2.0 in intervals of 0.5. The fitted model line starts at about (2, 0.3) on the x and y axes and ends at about (18, -0.6) on the x and y axes respectively. It is a straight, downward-sloping line. The loess smooth curve starts above the fitted model line on the left and slopes down gradually until the value of about (8, -0.25) on the x and y axes, before gradually curving up and ending well above the fitted model line. It is a broad Vshaped curve. The scatterplot seen in this graph has most of the data points clustered loosely between the health expenditure values of about 3 and 12 and the component+residual values of about -1.0 and 1.0. In graph c, the x axis is labeled Gini coefficient and the values on this axis range from 30 to 60, in intervals of 10. The y axis is labeled component+residual and the values on this axis range from -1.5 to 1.5 in intervals of 0.5. The fitted model line starts at about (20, -0.4) on the x and y axes and ends at (66, 0.5) on the x and y axes respectively. It is a straight, upward-sloping line. The loess smooth curve starts just below the fitted model line on the left and slopes up above the fitted line by a component+residual value of about 0.1 before dipping back down below it almost parallel to it and ending just below it. The scatterplot seen in this graph has most of the data points dispersed loosely along the line and the curve between the Gini coefficient values of about 25 and 60 and the component+residual values of about -1.0 and 1.25.

On the basis of Panel (a) in Figure 6.2, where the bulge points to the , I transformed GDP the ladder of powers to the 1 log transformation, and on the basis of Panel (b), I specified

left

down

188

a quadratic partial relationship of log(infant mortality) to health expenditures, obtaining the revised fitted model:

bulging rule for

1 Recall

the discussion in Chapter 3 of the selecting a linearizing power transformation. (6.3)

In the context of multiple regression, we generally prefer to transform an rather than , unless we see a common pattern of nonlinearity in the partial relationships of to several s: Transforming changes the shape of its relationship to of the s simultaneously, and also changes the shape of the distribution of the residuals.

x

x y

y

y

x all

Figure 6.3 shows the component-plus-residual plots for the respecified model, and all three plots are now straight. As a consequence, the 2 for the fitted model has increased, from 0.712 to 0.844, and the standard deviation of the residuals has decreased, from 0.590 to 0.436. Component-plus-residual plots for the respecified regression of log(infant mortality) on (a) log(GDP per capita), (b) per-capita health expenditures and health expenditures squared, and (c) the Gini coefficient of income inequality. In each panel, the broken line represents the fitted model and the solid line a loess smooth.

R

Figure 6.3

189

The three graphs seen in this figure are labeled a, b and c. Each of these graphs plot the regression of log (infant mortality) against log (GDP per capita), health expenditures (quadratic) and the Gini coefficient respectively. The fitted model line and the loess smooth curve are plotted on each graph. In graph a, the x axis is labeled log (GDP per capita) and the values on this axis range from 0 to 4, in intervals of 1. The y axis is labeled component+residual and the values on this axis range from -2 to 2 in intervals of 1.

190

The fitted model line starts at about (-0.8, 2) on the x and y axes and ends at about (4.5, -1.9) on the x and y axes respectively. It is a straight, downward-sloping line. The loess smooth curve starts below the fitted model line at about (-0.8, 1.8) and slopes down more steeply and crosses the fitted model line before it dips back down below it. The line and curve are almost overlapping. The scatterplot seen in this graph has the data points clustered along the line and the curve. In graph b, the x axis is labeled health expenditures (quadratic) and the values on this axis range from -0.2 to 0.6, in intervals of 0.2. The y axis is labeled component+residual and the values on this axis range from -1.0 to 1.0 in intervals of 0.5. The fitted model line starts at about (-0.3, -0.25) on the x and y axes and ends at about (0.65, 0.7) on the x and y axes respectively. It is a straight, upward-sloping line. The loess smooth curve starts just below the fitted model line on the left and almost overlaps the fitted model line and ends slightly above, it on the right. The scatterplot seen in this graph has most of the data points clustered loosely from the left half of the graph’s plot area to the right half, between the component+residual values of about -1.0 and 1.0. In graph c, the x axis is labeled Gini coefficient and the values on this axis range from 30 to 60, in intervals of 10. The y axis is labeled component+residual and the values on this axis range from -1.0 to 1.0 in intervals of 0.5. The fitted model line starts at about (20, -0.4) on the x and y axes and ends at about (66, 0.5) on the x and y axes respectively. It is a straight, upward-sloping line. The loess smooth curve starts just below the fitted model line on the left and slopes up above the fitted line by a component+residual value of about -0.25 before dipping back down below it almost parallel to it and ending below it. The scatterplot seen in this graph has most of the data points dispersed loosely along the line and the curve between the Gini coefficient values of about 22 and 55 and the component+residual values of about -0.75 and 1.25.

The component-plus-residual plots in Panels (a) and (c) of Figure 6.3, for log(GDP) and the Gini coefficient, respectively, are entirely straightforward: Partial residuals on the vertical axis are computed directly from Equation 6.1, and log(GDP) and the Gini coefficient appear on the horizontal axes of the graphs. The plot in Panel (b) for per-capita health

191

expenditures is more complicated, in that the partial fit is quadratic and therefore uses two regressors, Health and Health2. Here, the partial residuals are computed as

and the variable on the horizontal axis is the

partial fit,

If the partial relationship of log(infant mortality) to health expenditures is quadratic as specified, then the resulting component-plus-residual plot should be —as is the case for this example.

linear

x

An alternative to plotting against a transformed (e.g., log(GDP) in the example), or against a partial fit (for the quadratic relationship to health expenditures in the example), is to plot partial residuals against the untransformed variable but to show the partial fit as a curve on the graph. I’ve done this for GDP and health expenditures in Figure 6.4. In each panel, the loess line matches the fitted partial regression curve well. Component-plus-residual plots for the respecified regression for log(infant mortality), showing (a) GDP per capita and (b) per-capita health expenditures on the horizontal axis. In each panel, the broken line represents the fitted model and the solid line a loess smooth.

original

Figure 6.4

192

The two graphs seen in this figure are labeled a and b. Each of these graphs plot the respecified regression of log (infant mortality) against GDP per capita and health expenditures respectively. The fitted model line and the loess smooth curve are plotted on each graph. In graph a, the x axis is labeled GDP per capita and the values on this axis range from 0 to 80, in intervals of 20. The y axis is labeled component+residual and the values on this axis range from 1 to 4 in intervals of 1. Both the fitted model line and the loess smooth are sloping Lshaped curves that almost overlap. The fitted model line starts at about (0, 5) on the x and y axes and ends at about (85, 1) on the x and y axes respectively. The loess smooth curve starts above the fitted model line and runs almost parallelly to the fitted model line until it passes below it at about (20, 2.2) and runs parallelly below it. The scatterplot seen in this graph has most of the data points clustered along the two curves at the top of the curves and more loosely along it towards the middle of the curves. In graph b, the x axis is labeled health expenditures and the values on this axis range from 5 to 15, in intervals of 5. The y axis is labeled component+residual and the values on this axis range from 1 to 4 in intervals of 1. Both the fitted model line and the loess smooth curve are curves that are arc-shaped and overlap. They both start about

193

(2, 2.9) on the x and y axes and ends at about (18, the x and y axes respectively. The scatterplot seen graph has most of the data points clustered loosely the health expenditure values of about 3 and 12 and component+residual values of about 1.5 and 3.

194

2.5) on in this between the

When Are Component-Plus-Residual Plots Accurate? Cook (1993) explored the circumstances under which componentplus-residual plots accurately visualize the unknown partial regression function

in the model2

x

2I

focus here on 1 for notational convenience. I hope that it’s clear that we can equally focus on of the s.

any

x

(6.4)

where, as in the standard linear model,

. The

partial regression function isn’t necessarily linear, but the other explanatory variables enter the model linearly. Instead of fitting the model in Equation 6.4,3 we fit the

working model 3 To

fit Equation 6.4 directly requires knowledge of the

function up to one or more unknown parameters, which could then be estimated from the data, by linear least squares if the model can be written as a linear model, or more generally by , a topic not discussed in this monograph (see, e.g., Fox, 2016, section 17.4).

nonlinear least squares

x

x

in which 1 enters the model linearly along with the other s. The partial residuals for the working model estimate

195

rather than

. We hope that any nonlinear part of

is captured in the residuals from the working model . Cook (1993) showed that

either if

the partial regression function is linear after all or if the other s are each linearly related to 1. We can then legitimately smooth the scatterplot of the partial residuals

x

versus

x

1

x

to estimate

.

The takeaway message is that there’s an advantage in having linearly related s, a goal that’s promoted, for example, by transforming the s toward multivariate normality, as described in Chapter 3. In practice, it’s only nonlinearly related s that seriously threaten the validity of componentplus-residuals plots. If, say, 2 has a strong nonlinear

x x

x

relationship to to

x

1

x

strongly

x

1

and the partial relationship

is also nonlinear, then

of

y

(the population partial

residuals) may not adequately capture . We’ll see another way to deal with this issue in the next section.

y

A problem can also arise if is nonlinearly related to a (say, 2) rather than to 1. If 1 and 2 are correlated, that can induce spurious nonlinearity in the component-plus-residual plot for 1. This possibility suggests trying to correct nonlinearity for one at a time, though, in my experience, it’s rarely necessary to proceed sequentially: Earlier in this chapter, for example, I simultaneously and successfully addressed nonlinearity discovered in the component-plus-residual plots for both GDP and health expenditures in the regression for log(infant mortality), even though these two explanatory variables are correlated.

different x

x

x

196

x

x

x

x

More Robust Component-Plus-Residual Plots x

Transforming the s toward linear relationships won’t work, for example, if 2 or another in Equation 6.4 is related to the focal explanatory variable 1. Mallows (1986) suggested adapting to this situation by fitting a working model that’s quadratic in 1:

x

quadratically

a

11

and

x

x

and then computing

where

x

a

12

augmented partial residuals for x

1

are the least-squares estimates of

as

and

, and the are the least-squares residuals for the working model. This strategy will be successful if the partial relationship of to the focal 1 is quadratic, or if relationships among the s are either linear or quadratic, and it can be extended to higher order polynomial working models.

y

x

x

Cook (1993) introduced an even more generally applicable approach, which he termed (where CERES is an acronym for Combining conditional Expectations and RESiduals), that can accommodate arbitrary nonlinear relationships between the other s and the focal 1 in Equation 6.4 (p.87). CERES proceeds by first performing nonparametric regressions of each other on 1, using loess for example. These preliminary

CERES plots

x

x

x

x

nonparametric regressions produce fitted values, , which then are substituted for working model,

197

for

x

1

in the

fit by OLS regression. The partial residuals for computed as

The CERES plot graphs

versus

x

1

are

x. 1

Although augmented component-plus-residual plots and CERES plots can produce accurate nonlinearity diagnostics when conventional component-plus-residual plots break down, the latter generally work well, as I mentioned previously. Figure 6.5, for example, shows the conventional component-plusresidual plot, the quadratic component-plus-residual plot, and the CERES plot for GDP per captia in the regression of Equation 6.2 (p.83). Except for the different scaling of the vertical axis in the CERES plot, the three graphs are nearly indistinguishable. (a) Conventional component-plus-residual plot, (b) quadratic component-plus-residual plot, and (c) CERES plot for GDP in the regression of log(infant mortality) on per-capita GDP, per-capita health expenditures, and the Gini coefficient for income inequality. In each panel, the broken line represents the fitted model and the solid line a loess smooth.

CIA World Factbook

Figure 6.5

The three graphs in this figure are labeled a, b and c. The x axis on all three graphs plot the GDP per capita with the value on these axis starting at 0 and ending at 80, in intervals of 20. The y axis on these graphs are labeled

198

component+residual. In graphs a and b, the values on this axis ranges from -2 to 2 in intervals of 1. In graph c, the values on this axis ranges from 5 to 8 in intervals of 1. The component+residual values for the fitted model line and the loess sooth curve for each of these graphs is estimated below: Graph a: Graph b: Graph c:

Augmented component-plus-residual plots and CERES plots accommodate nonlinear relationships between the other s and 1 , but, like ordinary component-plus-residual plots, they can be corrupted when there is a nonlinear partial relationship between and an other than 1. For example, if is nonlinearly related to 2, then the CERES plot for 1 won’t

x

x

y

x

x

x

necessarily accurately visualize

199

y

.

x

Component-Plus-Residual Plots for Interactions Fox and Weisberg (2018) described a framework for componentplus-residual plots that’s general enough to accommodate not only nonlinear terms in a linear model, such as polynomials and transformations of s, but also interactions of arbitrary complexity. This framework applies to , which focus serially on each explanatory variable (“predictor”)—termed the —in a linear model, partitioning the other explanatory variables into two subsets: (1) , which interact with the focal predictor, either individually or in combination, and (2) , which simply are to be controlled statistically. The general procedure is as follows:

x

conditioning predictors predictors

predictor effect plots focal predictor

fixed

The focal predictor ranges over its values in the data on the horizontal axis of a multipanel array of twodimensional scatterplots, each scatterplot of partial residuals versus the focal predictor for a specific combination of values of the conditioning predictors, which also range in combination over their values in the data to define the panels, while the fixed predictors are set to typical values. Conditioning is straightforward for factors, which simply take on in turn each of their various levels; numeric conditioning predictors are set successively to each of several representative values over their ranges. Conversely, fixing predictors is straightforward for numerical explanatory variables, which are typically set to their means; categorical fixed predictors are typically set to their distribution in the data, which is equivalent to computing the means of the dummy regressors that represent them in the linear model. The regression surface is graphed by computing the fitted values under the model for the combinations of predictors in each panel of the predictor effect display. Each case in the data is allocated to one panel, and the residual for the case is added to its fitted value (which is on the

200

portion of the partial regression surface shown in the panel), forming a partial residual. Even without partial residuals, predictor effect plots are useful for visualizing complex regression models, such as models with transformed explanatory variables, polynomial regressors, regression splines (described later in this chapter), and interactions. Predictor effect plots can be drawn not just for linear models but also for a wide variety of regression models, including the GLMs discussed in Chapter 8. To illustrate, I’ll elaborate the regression model in Equation 6.2 (p.), initially fit to the data in this chapter, by including interactions between the numeric explanatory variables GDP per capita, per-capita health expenditures, and the Gini coefficient for income inequality, on the one hand, and the factor region (representing five regions of the world), on the other. The three explanatory variables enter this model linearly, in contrast to the model in Equation 6.3 (on p.85), in which GDP is log-transformed and there is a quadratic partial regression for health expenditures.

CIA World Factbook

A Type II analysis of variance table for the model with region interactions is shown in Table 6.1, with tests obeying Nelder’s (Nelder, 1977): Tests for a (e.g., the main effect of GDP) are formulated ignoring the interactions to which the main effect is marginal (the interaction between GDP and region). There is very strong evidence for the interaction of region with GDP, moderately strong evidence for the interaction of region with health expenditures, and weaker evidence for the interaction of region with the Gini coefficient.

principle of marginality main effect

Because it is difficult to directly interpret the coefficients for regression models with interactions, I’ll instead display the model with predictor effect plots for GDP, health expenditures, and the Gini coefficient (Figure 6.6). In addition to showing the fitted model, the partial residuals in

201

the effect plots, along with the loess smooths, allow us to judge departures from linearity. I used a large span for the loess smooths because of the small number of countries in each region; indeed, the smooths can’t be computed for Oceania because there aren’t enough countries in that region. Several noteworthy points are identified on these graphs: Luxembourg, the United States, Canada, and Singapore stand out within their regions for their relatively high GDP per capita, and the United States stands out for its relatively high per-capita health expenditures.

Table 6.1CIA World Factbook

Note: df = degrees of freedom. Because there aren’t many data points in the several regions, I don’t want to overinterpret the patterns in Figure 6.6 The within-region partial regressions for GDP and health expenditures are more nearly linear than the partial regressions ignoring region in Figure 6.2, but the partial residuals in several of the effect plots in Panels (a) and (b) of Figure 6.6 still show some nonlinearity. Moreover, as the reader can verify, log-transforming GDP and fitting a quadratic in health expenditures greatly decreases the evidence for the interactions of these explanatory variables with region.

202

Figure 6.6 Predictor effect plots for (a) GDP, (b) health

expenditures, and (c) the Gini coefficient for the model in which these variables interact with region. In each panel, the broken line represents the fitted model and the solid line a loess smooth. The gray bands display pointwise 95% confidence intervals around the fitted regression surface.

203

The first row of graphs labeled a, plot the GDP per capita in $1000s on the x axis in values that range between 0 and 80, in intervals of 20 against the log (infant mortality) plotted on the y axis. These values range from -6 to 4 in intervals of 2.

204

The log (infant mortality) for each of these regions – Europe, America, Oceania, Asia and Africa are plotted in five graphs against the GDP per capita. Each of these graphs also has a region marked in gray bands along with a fitted model line and a loess smooth curve. The first graph is for the Europe region and the data points are clustered between the GDP per capita ranges of 0 and 60 between the log (infant mortality) ranges of 1 and 3. Both lines pass through the data points and a narrow gray region is seen below these data points that starts off narrow on the left and expands slightly towards the right side of the graph’s plot area. The second graph is for the America region and the data points cluster between the GDP per capita ranges of 0 and 30 and the log (infant mortality) values of 2 and 4. The two lines pass through the data in a steeper angle as compared to the Europe lines. The gray area in this graph is also narrow on the left and expands to the right but to a greater degree than the Europe graph. The next graph is for the Oceania region and has two clusters of data points. The first cluster is between the GDP per capita ranges of 0 and 20 and the log (infant mortality) values of 2 and 4. The second cluster is between the GDP per capita ranges of 35 and 45 and the log (infant mortality) values of between 1 and 2. The fitted model line is a straight, downward sloping line steeper than that in the America graph. The gray regions in this graph start broad, and narrow after the first cluster and expand again into a larger area before the second data point cluster. The next graph is for Asia region and has a large cluster of data points on the left between the GDP per capita ranges of 0 and 20 with few points between 20 and 40 and one point labeled Singapore, at the GDP per capita value of 80. The log (infant mortality) ranges for all these data points are between 0 and 4. The first cluster is mostly between 2 and 4 and the Singapore data point is at log (infant mortality) of about 0.8. The fitted line is a straight line from the first cluster, straight down and ends below the Singapore data point. The loess smooth line curves from the first data cluster and ends at the Singapore point. The gray area starts narrow at the first cluster and expands towards the right of the graph’s plot area. The last graph is for the Africa region and has a cluster of data points between the GDP per capita values of 0 and 20 and between the log (infant mortality) values of about 2 and 4.2.

205

The fitted model line starts at the cluster and steeply slopes downward and ends past the GDP per capita of 80, at about -4.2 on the log (infant mortality) value. The second row of graphs labeled b, plot health expenditures per capita in $1000s, and the y axis plots the log (infant mortality) with values on this axis ranging from 0 to 4 in intervals of 1. The health expenditures per capita in $1000s, range from 5 to 15 in intervals of 5 on the x axis. The first graph is for the Europe region, and has data points clustered between the health expenditures per capita values of 4 and 12 and the log (infant mortality) values of 1.5 and 3.The fitted model line is a downward sloping line that starts at about (4.5, 2.2) on the left and ends at about (17.5, 1.8) on the right and passes through the middle of the data point cluster. The loess smooth line is a very broad, V-shaped curve between the health expenditures per capita values of 5 and 12 at the log (infant mortality) value of about 2. The gray area starts broad, narrows at the health expenditures per capita value of about 7 and broadens again towards the right of the graph’s plot area. The second graph is for the America region and has a cluster of data points between the health expenditures per capita of about 5 and 11 and between the log (infant mortality) values of about 2 and 3.5. The fitted model line starts at about (4.5, 2.7) on the left and slopes slightly upward to end at (17.5, 2.8). The loess smooth curve is a longer V-shaped curve as compared to the one in the first graph and starts at the beginning of the data cluster on the left and ends at the point labeled United States seen on the far right at about (17.5, 3.2). The gray area like in the first graph, starts broad, then narrows and broadens significantly towards the right side of the graph. The third graph is for the Oceania region and has six data points in two clusters. The cluster on the left is between the health expenditures per capita values of about 4.5 and 5 and the log (infant mortality) values of about 2.5 and 3.1. The second cluster on the right is between the x axis values of about 8 and 10 and the y axis values of about 2.5 and 3. The fitted model line starts on the left at about (4.5, 2.9) and ends at about (17.5, 2.5). The gray area starts broad, narrows and then broadens considerably more than seen in the previous two graphs, towards the right side of the graph’s plot area. The fourth graph is for the Asia region and the data points on this graph are seen between the health expenditures per capita values of 4.5 and 10 and the log (infant mortality) values of

206

about 1.5 and 3.8. The fitted model line starts at about (4.5, 3.5) and ends at (17.5, 0.8). The loess smooth curve starts at about the same point as the fitted model line and slopes below the fitted model line and ends at the data point at about (10, 1.5). The gray area is narrow at the beginning of both lines on the left and broadens and follows the fitted model line to the right side of the graph. The last graph is for the Africa region and the data points on this graph cluster between the health expenditures per capita values of 4.5 and 10 and the log (infant mortality) values of about 1.9 and 3.1. The fitted model line and the loess smooth curve overlap in this graph. The former starts at about (4.5, 2.5) on the left and ends at about (17.5, 2.7) on the right. The gray area forms a band along this line, starting narrow and broadening slightly towards the right side of the graph’s plot area. The last row of graphs labeled c, plot the Gini coefficient of income inequality on the x axis and the log (infant mortality) on the y axis. The Gini coefficient of income inequality ranges between 30 and 60 in intervals of 10 and the log (infant mortality) values range between 1 and 4, in intervals of 1. The first graph is for the Europe region and the data points in this graph are seen between the Gini coefficient of income inequality values of 22 and 45 and the log (infant mortality) values of about 1 and 2.5. The fitted model line starts at about (30, 1.8) on the left and ends at about (62, 2.9) on the right. The loess smooth line overlaps this line within the data cluster. The gray area starts narrow on the left and broadens along the fitted model line to the right of the graph’s plot area. The second graph is for the America region and the data points are loosely clustered between the Gini coefficient of income inequality values of about 31 and 60 and the log (infant mortality) values of about 2 and 3.2. The fitted model line starts at about (25, 2.7) on the left and ends at about (62, 2.6) on the right. The loess smooth line overlaps this line within the data cluster. The gray area starts broad on the left and narrows before it broadens again, along the fitted model line to the right of the graph’s plot area. The third graph is for the Oceania region and the six data points are in two clusters. The first cluster of four data points is seen between the Gini coefficient of income inequality values of about 30 and 37 and the log (infant mortality) values of about 2.5 and 3.1, and the second cluster

207

of two data points is seen between the Gini coefficient of income inequality values of about 45 and 51 and the log (infant mortality) values of about 2.5 and 2.8. The fitted model line starts at about (25, 3) on the left and ends at about (62, 2.5) on the right. The gray area starts very broad on the left and narrows between the two clusters before it broadens very widely again, to the right of the graph’s plot area. The last graph is for the Africa region and the data points are clustered between the Gini coefficient of income inequality values of about 30 and 62 and the log (infant mortality) values of about 2.2 and 3.2. The fitted model line starts at about (25, 2.5) on the left and ends at about (62, 2.7) on the right. The loess smooth line is just above the fitted model line, within the data cluster. The gray area starts broad on the left and narrows marginally along the fitted model line to the right of the graph’s plot area.

208

Marginal Model Plots

y

k

Recall that in a regression with a response variable and numeric explanatory variables 1, … , k, the natural graph of the data is a ( + 1)-dimensional scatterplot of against the s. Having estimated a parametric regression model, we could add the fitted regression surface to the graph and compare it with a multidimensional nonparametric regression fit to determine whether the model adequately represents the conditional mean of as a function of the s. Of course, this plot is impractical if > 2, both because we can’t draw highdimensional coordinate graphs and because unconstrained nonparametric regression in high dimensions isn’t generally feasible.

x

k

x

y

x

y

x

k

Marginal model plots are an essentially simple idea introduced

by Cook and Weisberg (1997) (and also described by Cook, 1998, section 15.2.3). As Cook and Weisberg explain, the information in the imagined ( + 1)-dimensional scatterplot is also contained in the infinite set of two-dimensional scatterplots of against all linear combinations 1 1 + ⋯ + k k of the s. In each such two-dimensional plot, we can smooth the points in the graph and compare that with a smooth of the fitted values

y

k

ax

ax

x

computed from the regression model. This turns out to be a very general result that applies to many kinds of regression models (including the GLMs described in Chapter 8). For a linear regression model, it is also of interest to check the assumption of constant conditional variance of given the s (discussed more generally in Chapter 5), and this can be accomplished by smoothing the two sets of residuals from the conditional mean smoothers of the data and the fitted values in a marginal plot (described in the preceding paragraph). Because the fitted values from the model lack residual variation, it’s necessary to add in the estimated error variance to the smoothed conditional variance of the fitted values in the

y

x

209

marginal plot. Even if the errors have constant variance, the variation around the smoothers in a marginal plot is not in general constant. What’s of interest is to compare the conditional variation of the observed data with the conditional variation implied by the model: If the two are similar, then the assumption of constant conditional variance of is supported.

y

x

Each linear combination of the s represents a plotting direction, or axis, in the -dimensional space.4 Drawing an infinite set of two-dimensional graphs is no more practical than drawing a high-dimensional scatterplot, but selecting a potentially interesting small subset of these graphs is feasible.

k

x

a

4 Cook

and Weisberg (1997) constrain the s so that , but this is an inessential mathematical simplification because

x space represented by for a nonzero constant c are the same.

the directions in the

and

Moreover, by extension, and after adjusting for the lack of residual variation in the fitted values, marginal plots of the data and the fitted values against variable should show similar conditional means and variation if the model is correct. In the remainder of this section, I’ll consider the marginal plots of against each and against the fitted

any

y

x

values —these are marginal model plots that often reveal problems with the specification of a regression model.5 5A

relatively subtle point is I have implicitly assumed that the regressors in the linear model are the same as the explanatory variables, but of course we know that this isn’t necessarily the case. If, for example, an explanatory variable

x is represented by a transformation such as 210

or by a

polynomial or regression-spline term (the latter described later in this chapter), then we can still plot against . The

x

fitted values for a linear model are always computed from the coefficients and the regressors, and the model can include dummy regressors for factors, interaction regressors, and so on. Following Cook and Weisberg (1997), we normally construct marginal model plots only for numeric explanatory variables and for fitted values. Figure 6.7 shows marginal model plots for the regression of log(infant mortality) on GDP per capita, per-capita health expenditures, and the Gini coefficient fit to the data set (Equation 6.5 on p. 83). In each panel, the points represent the observed data, the solid gray lines are for a loess smooth of the data, and the broken black lines are for a loess smooth of the fitted values from the regression model.6 Positive and negative residuals from the estimated loess curves are smoothed separately (as in Chapter 5 in residual plots to detect non-constant error variance) so that it’s possible to visualize discrepancies in conditional skewness as well as in conditional spread. It’s clear that only in the plot against the Gini coefficient does the model perform adequately—in this graph, but in none of the others, the smooths for the fitted values and for the data are similar.

CIA World

Factbook

6 These

loess smoothers use the span 2/3. As a general matter, and as Cook and Weisberg (1997) point out, it’s important to use the span for smoothing both the data and the fitted values, so that the biases in the smoothed values are the same, and cancel when the two smooths are compared. Marginal model plots for the initial regression fit to the data set with log(infant mortality) as the response. The points represent the observed data, the solid gray lines show loess smooths for the data and the broken black lines loess smooths of the fitted values. Both conditional mean and conditional spread smooths are shown.

same

Figure 6.7 CIA World Factbook

211

The four graphs seen are labeled a, b, c and d. Each graph has two sets of three curves each – the gray lines which are the loess smooths for the data and the broken black lines loess smooths for the fitted values. The x axis of graph a is labeled GDP per capita and the values on this graph range from 0 to 80, in intervals of 20. The y axis is labeled log (infant mortality) and the values on this graph range from 1 to 4, in intervals of 1. The loess smooths for the data start on the top left, close together and are in a sloping L-shape from the left to the right side of the graph. The loess smooths of the fitted values are almost parallel, straight lines that start just below the other set

212

of loess smooths and end between the values of 40 and 70 on the x axis. The data points in this graph are clustered on the top left corner and disperse more widely along the loess smooths of the data. The x axis of graph b is labeled health expenditures and the values on this graph range from 5 to 15, in intervals of 5. The y axis is labeled log (infant mortality) and the values on this graph range from 1 to 4, in intervals of 1. The loess smooths for the data are almost parallel curves and start between the log (infant mortality) values of 2.8 and 4.2. They are more widely spaced than those on graph a. These curves slope down and then dip up, gradually. The loess smooths of the fitted values almost overlap those for the data. These curves start just below each of the loess smooth curves for the data and slope gradually down before sloping down more steeply and ending between the health expenditure values of 11 and 18, on the x axis. The data points in this graph are seen between the health expenditure values of 2.5 and 12.5 and the log (infant mortality) values of about 1 and 4.5. The x axis of graph c is labeled Gini coefficient and the values on this graph range from 30 to 60, in intervals of 10. The y axis is labeled log (infant mortality) and the values on this graph range from 1 to 4, in intervals of 1. The loess smooths for the data and the fitted valued almost overlap in this graph. The curves in each set are also spaced wide and start at the bottom left and slope upward, steeply before they slope up more gradually. These curves all end between the log (infant mortality) values of about 2.5 and 4.2 and the Gini coefficient value of about 62. The data points in this graph are seen between the Gini coefficient values of 25 and 55 and the log (infant mortality) values of about 1 and 4.5. The x axis of graph d is fitted values and the values on this graph range from 0 to 4, in intervals of 1. The y axis is labeled log (infant mortality) and the values on this graph range from 1 to 4, in intervals of 1. The loess smooths for the data are almost parallel curves that are close together. They start at the left bottom part of the graph and slope gradually upward until about 2 on the fitted value axis and then slope steeply upward. The loess smooths of the fitted values are parallel lines. The first line is a short line that starts at only about (2.5, 3.1) on the graph. The remaining two lines start at about 0.8 and 1.2 on the fitted value axis and slope steeply upward and end at about 4.2 on the fitted value axis and between 3.5 and 4.2 on the log (infant mortality) axis. The data points in this graph are seen along the first set of loess smooth curves for the data. They are more widely dispersed at the start of the curve and are in a

213

tight cluster towards the end of the loess smooth curves for the data.

partial

Notice that even though the model specifies linear relationships between log(infant mortality) and each explanatory variable, the plots of fitted values versus each explanatory variable are not necessarily linear: What would validate the model are , not necessarily linear, smooths for the data and for the fitted values. Of course, the smoothed conditional means of the fitted values plotted against themselves (the central broken line in Panel (d) of Figure 6.7) are necessarily linear.

marginal

similar

Earlier in this chapter, following examination of componentplus-residual plots for the initial CIA regression, I logtransformed GDP and specified a quadratic partial regression in health expenditures (producing the estimated model in Equation 6.3 on p.85). Marginal model plots for the respecified model appear in Figure 6.8, which, notwithstanding some mildly unusual data points, generally support the respecified model. Marginal model plots for the respecified regression model fit to the the data set. A few unusual data points are identified. The solid gray lines show loess smooths for the data and the broken black lines loess smooths of the fitted values. Only the conditional mean smooths are shown.

Figure 6.8

CIA World Factbook

214

The four graphs seen are labeled a, b, c and d. Each graph has two curves each – the gray line which is the loess smooths for the data and the broken black lines loess smooths of the fitted values. The x axis of graph a is labeled GDP per capita and the values on this graph range from 0 to 80, in intervals of 20. The y axis is labeled log (infant mortality) and the values on this graph range from 1 to 4, in intervals of 1. The loess smooths for the data starts on the top left and slope gradually from the left to the right side of the graph and ends at about 1 on the y axis and past 80 on the x axis. The loess smooths of the fitted values almost overlap that for the data and ends below

215

it, at about 70 on the x axis. The data points in this graph are clustered on the top left corner and disperse more widely along the loess smooths of the data. Wo data points labeled Singapore and Luxembourg are seen at about (81, 0.9) and at about (95, 1.3) respectively, on the graph. The x axis of graph b is labeled health expenditures and the values on this graph range from 5 to 15, in intervals of 5. The y axis is labeled log (infant mortality) and the values on this graph range from 1 to 4, in intervals of 1. The loess smooths for the data and the fitted values overlap in this graph. They start at the log (infant mortality) values of about 3.7, slopes down gradually till about (6, 3) on the x and y axes and then more steeply until about (12, 1.8) before they gradually slope up again. The data points in this graph are seen between the health expenditure values of 2.5 and 12.5 and the log (infant mortality) values of about 0.5 and 4.5. The x axis of graph c is labeled Gini coefficient and the values on this graph range from 30 to 60, in intervals of 10. The y axis is labeled log (infant mortality) and the values on this graph range from 1 to 4, in intervals of 1. The loess smooths for the data and the fitted valued almost overlap in this graph as well. Both curves start at the bottom left, around (22, 1.1) and slopes upward steeply until about (40, 3), before they slope up more gradually and ending at about (62, 3.5). The data points in this graph are seen between the Gini coefficient values of 25 and 55 and the log (infant mortality) values of about 0.5 and 4.5. The x axis of graph d is labeled fitted values and the values on this graph range from 1 to 5, in intervals of 1. The y axis is labeled log (infant mortality) and the values on this graph range from 1 to 4, in intervals of 1. The loess smooths for the data are almost overlap before they diverge at about (4, 4) with the loess smooth of the data ending to the left of the loess smooths of the fitted values. Both curves start at the bottom left part of the graph at about (0.9, 0.9) and slope upward until the loess smooth for the data ends at about (5.5, 4.5) and that for fitted values ends at about (3.8, 4.5) on the x and y axes. The data points in this graph are seen clustered closely along the loess smooth curve for the data. A data point labeled Central African Republic is seen at about (5.5, 4.3) on the graph.

216

Testing for Nonlinearity A general approach to testing for nonlinearity—or, more expressively, if the model isn’t linear in an explanatory variable, —is to specify another, larger model that can capture a more general partial relationship of the response to an explanatory variable. The two models can then be compared by a likelihood-ratio -test.

lack of fit

F

This approach is perhaps most straightforward when a numeric explanatory variable in a regression is discrete with a relatively small number of unique values. To illustrate, I introduce data collected by the General Social Survey (GSS), a representative cross-sectional survey of adult Americans conducted periodically over many years by the NORC at the University of Chicago. An interesting feature of the GSS is that it repeats questions, facilitating comparisons over time. I focus here on a 10-item vocabulary test included in 20 GSS surveys between 1978 and 2016. There are 28,867 cases in these 20 surveys, 27,360 of which have complete data for the variables in the models fit below. Because I may alter the specification of the model based on the data, I begin by dividing the data set randomly into two subsamples: one, for data exploration, containing 5,000 complete cases, and the other, for hypothesis testing, containing the remaining 22,360 cases. Regressing vocabulary test score, which ranges from 0 to 10 words correct,7 on age (in years, from 18 to 89), education (in years, from 0 to 20), a dummy regressor for gender (coded 1 for men and 0 for women), and a dummy regressor for nativity (coded 1 for those who were native-born and 0 for the foreign-born) produce the following results (in the exploratory subsample of cases):

217

limited

7 That

the response variable in the regression is (with a minimum score of 0 and a maximum of 10) and discrete (with integer scores) suggests that a linear model with normal errors may not be appropriate. As it turns out, however, the normal linear model works reasonably well. I invite the reader to examine the distribution of the residuals from the models fit in this section. As well, I’ll revisit this example in Chapter 8 on GLMs. (6.5)

Thus, holding the other explanatory variables constant, the vocabulary score rises on average by a bit more than 1/7 of a word for each 10 years of age (i.e., multiplying the coefficient for age by 10), rises by a bit more than 1/3 of a word for each year of education, declined slightly at the rate of about 1/10 of a word per decade (multiplying the coefficient for year by 10),8 is about 1/9 of a word lower for men than for comparable women, and more than half a word higher for the native-born than for comparable foreign-born individuals. All the estimated coefficients are at least several times their standard errors in magnitude, with the exception of the coefficient for gender, which is slightly more than twice its standard error. These explanatory variables are moderately predictive of vocabulary score, with and the standard deviation of the residuals equal to 1.84 words. 8 The

same 10 words are used in all the administrations of the GSS vocabulary test. In a private communication, Barbara Entwisle made the interesting observation that the small decline in vocabulary score over time might in part or in whole reflect declining usage and hence familiarity of (some of) the

218

10 words. The words are secret, however, to protect the integrity of the test. I’d like to know whether the partial relationships of vocabulary to age, education, and year are really linear, and of course it would have been better to begin by exploring the data (as advocated in Chapter 3). Instead, I’ll look directly at the component-plus-residual plots for these three explanatory variables, which are shown in Figure 6.9. In light of the fairly large sample and discreteness of the s (age, education, and year are all measured in whole years), I used a small span of 0.2 for the loess smoothers. All three relationships look nearly linear but (1) there is small, possibly quadratic, bend in the partial residuals for age, and (2) there are small jumps in the partial residuals for education around 12 and 15 or 16 years (i.e., at high school graduation and near university graduation). Because these small apparent departures from linearity make some substantive sense, it might be interesting to pursue them. Component-plus-residual plots for age, education, and year in the preliminary regression fit to the exploratory subsample GSS of the data. The broken line in each panel is the fitted linear partial regression and the solid line is a loess smooth of the partial residuals.

x

Figure 6.9

219

The three graphs in this figure are labeled a, b and c. Each of these graphs has two curves – the fitted linear partial regression and the loess smooth of the partial residuals plotted with the age, education and the year on the x axis respectively for graphs a, b and c, with the component+residual on the y axis. The x axis on graph a is labeled age (years), and plots the values between 20 and 90 on this axis, in intervals of 10. The y axis is labeled component+residual and shows the values from -6 to 6, in intervals of 2. Both curves overlap in this graph and start at about (18, -0.8) on the left and slope very gradually upward to about (92, 0.2) on the right. The data

220

points in this graph are very densely clustered along the curve between the component+residual values of -5 and 4. The x axis on graph b is labeled education (years), and plots the values between 0 and 20 in intervals of 5. The y axis is labeled component+residual and shows the values from -6 to 4, in intervals of 2. Both curves are intertwined in this graph. The approximate component+residual values for each curve are estimated below: The data points in this graph are arranged in columns against each year of education on the x axis. Most of the data points are clustered between the years of 8 and 20 and the component+residual values of -6 and 4.5. The x axis on graph c is labeled year and plots the years 1980 to 2010 in 10-year intervals. The y axis is labeled component+residual and shows the values from -5 to 5, in intervals of 5. Both curves almost overlap in this graph as well. They start at about (1976, 0.5) on the left and end at about (2018, -0.2) on the right. The data points in this graph are arranged in 20 columns against the year on the x axis. Most of the data points are clustered between the component+residual values of -5 and 4.5.

As noted, age, education, and year in the GSS data set are all discrete. That raises the possibility of modeling these explanatory variables using dummy regressors, as if they were factors. Such a model can capture pattern of nonlinearity, at the expense of having many regression coefficients, in that there are 72 unique ages, 21 unique values of education, and 20 unique years in the data set.

any

First, examining age, and using the validation subsample of about 22,000 cases, I fit the following four models to the data, regressing vocabulary score on 1. 20 dummy regressors for education, 19 dummy regressors for year, and the dummy regressors for gender and nativity (i.e., removing age from the model) 2. a linear term in age, 20 dummy regressors for education, 19 dummy regressors for year, and the dummy regressors for gender and nativity 3. a similar model, but with a quadratic polynomial in age

221

4. a similar model, but with 71 dummy regressors for age

F

These models are properly nested for likelihood-ratio -tests in that each is a generalization of the preceding one. In particular, because Model 4 can capture form of partial relationship of vocabulary to age, linear or nonlinear, it’s a generalization of the quadratic Model 3, which specifies a form of nonlinear relationship. Comparing Models 1 (with no age effect) and 2 tests the linear term in age. Table 6.2 shows the analysis of variance comparing these models. Comparing Models 1 and 2 and Models 2 and 3, there is therefore overwhelming evidence for a linear trend in age and for the superiority of the quadratic model relative to the linear model for age. Comparing Models 3 and 4, there is also strong evidence for nonlinearity more complicated than a quadratic. The sums of squares for the various terms make it clear, however, that departures from linearity are relatively small, and the most general model 4, which uses 71 degrees of freedom for age, isn’t much better than the linear Model 2: In a large sample, we’re able to detect even small departures from linearity, and here the linear or quadratic partial relationship of vocabulary to age remains a reasonable approximation.

any

particular

Proceeding similarly for education, I fit the following four models: 1. 71 dummy regressors for age, 19 dummy regressors for year, and the dummy regressors for gender and nativity (i.e., removing education from the model) 2. a linear term in education, 71 dummy regressors for age, 19 dummy regressors for year, and the dummy regressors for gender and nativity 3. a similar model but adding dummy regressors for 12 and 16 years of education 4. a similar model but with 20 dummy regressors for education

222

Table 6.2

Note: RSS = residual sum of squares; df = degrees of freedom; SS = sum of squares. a Decrease

in the RSS.

b Numerator

degrees of freedom for the test.

An analysis of variance comparing these models appears in Table 6.3. The model with two dummy regressors for high school and university graduation fits better than the model with only a linear term in education, but the pattern of departure from linearity appears to be more complicated than that. Again, although the evidence for a complicated partial relationship of vocabulary to education is strong, the departure from linearity is very small, and so it’s reasonable to retain the linear specification as a useful approximation.

Table 6.3

Note: RSS= residual sum of squares; freedom; = sum of squares.

SS

223

df = degrees of

Finally, because there’s no obvious pattern to the small departures from linearity in the partial relationship of vocabulary score to year, I fit the following three models: 1. 71 dummy regressors for age, 20 dummy regressors for education, and the dummy regressors for gender and nativity (i.e., removing year from the model); 2. a linear term in year, 71 dummy regressors for age, 20 dummy regressors for education, and the dummy regressors for gender and nativity; and 3. a similar model, but with 19 dummy regressors for year.9 9 The

final model in all three cases is the same, with dummy regressors for all of age, education, and year. Table 6.4 reveals that there is very strong evidence for a linear trend over time, together with strong evidence of nonlinearity, but once more the sum of squares for nonlinearity is fairly small, especially given the substantial additional complexity of the general Model 3.10

Bayesian Information

10 Moving

beyond hypothesis testing, the for example, a model selection criterion that strongly penalizes models for complexity, prefers the model that includes linear terms in age, education, and year to the much more complex model that treats these three explanatory variables as factors. For a discussion of model selection criteria see, for example, Fox (2016, section 2.1.1) or Weisberg (2014, section 10.2.1).

Criterion (BIC),

Table 6.4

224

Note: RSS= residual sum of squares; df = degrees of freedom; SS = sum of squares.

225

Modeling Nonlinear Relationships With Regression Splines I’ve thus far described three strategies for dealing with nonlinearity: (1) in the case of simple monotone nonlinearity, transforming the response or, more commonly in the context of multiple regression, an explanatory variable to straighten their relationship; (2) in the case of nonmonotone nonlinearity, fitting a low-degree polynomial function of an explanatory variable, such as a quadratic or a cubic; (3) simply treating a discrete numeric explanatory variable as a factor to model pattern of nonlinearity. One can also treat a numeric variable as a factor by dissecting its range into class intervals (analogous to the bins of a histogram), but doing so loses information and is generally inadvisable.

continuous

any

Although these strategies suffice for the examples in this chapter, transformations work only for simple, monotone nonlinearity; low-degree polynomials can only accommodate relationships of certain shapes, and they can be disconcertingly nonlocal, in the sense that data in one region of the -space can strongly affect the fit in another region; and treating numeric explanatory variables as factors produces complicated models that may overfit the data and can be hard to describe succinctly.

x

In contrast to polynomials, nonparametric regression methods such as loess are sensitive to local characteristics of the data and can fit smooth relationships of arbitrary shape, but nonparametric regression entails the high overhead of abandoning the linear regression model. As it turns out, we can often produce results very similar to nonparametric regression by using constrained piecewise polynomials called , which can be included as regressors in a linear model and consequently are fully parametric.

regression

splines

226

For notational simplicity, I’ll suppose that there is a single numeric explanatory variable ; the extension to an explanatory variable j in multiple regression is immediate. To fit a regression spline, we first divide the range of into a small number + 1 of contiguous intervals, at the preselected values 1, 2, … , p, called . The knots may be selected to dissect the range of into equal-size intervals or at evenly spaced quantiles of , such as the three quartiles.

x

x

p k k

k

x

x

p

knots

x

x

We then fit a low-degree polynomial—typically a cubic function —to the ( , ) data in each region,

x y

x

, where min and max are the smallest and the largest -values in the data, respectively.11 The cubic functions are fit by least squares but are constrained to join smoothly at the knots, so that the

x

x

p

level, slope, and curvature of are the same on both sides 12 of each knot. Were we to fit an unconstrained cubic polynomial in each interval, the regression would require regression coefficients—that is, an intercept and coefficients for , 2, and 3 in each of the + 1 intervals. The constraints at the knots drastically reduce the number of coefficients required to + 4. Although there are numerically cleverer ways to fit a regression spline, a simple approach is to use the regressors , 2, and 3, along with additional

x x

regressors

x

p x x

x

x , j = 1, … , p, coded j

for regressor

p

p

for

and

; there’s also the constant

for the intercept.

11 The

square bracket “[”indicates that each interval is closed at the left, and the parenthesis “)” that it is open at the right—except for the last interval, which is also closed at the right to include the largest -value.

x

227

12 *Technically,

the regression function and its first and second derivatives are the same on both sides of each knot. To illustrate, I randomly generated to the model

n = 200 y-values according , where

x

, and the

xi values were uniformly sampled

from the interval = 0 to 10. Here is the trigonometric cosine function, with the angle measured in radians.13 Figure 6.10 shows the data points along with the rather wiggly population regression function

x

graphed as a broken line. The solid line in the graph is the fitted regression spline with = 3 knots evenly placed at 2.5, 5.0, and 7.5, thus using

p

regression coefficients. For real data, a cubic regression spline with three knots placed at the quartiles often does a good job of capturing complex nonlinearity. 13 Recall

that

π radians is equal to 180 degrees.

Polynomial regression and regression splines both use two or more regressors to represent the nonlinear relationship (or, when there are several explanatory variables, the nonlinear partial relationship) between the response and an explanatory variable. The individual coefficients of polynomial and regression-spline regressors are not generally of direct interest,14 and the coefficients must be combined to construct the nonlinear partial relationship of the response to the predictors. This is best done graphically, as in a predictor effect plot. 14

That’s not to say that the coefficients are uninterpretable, just that their interpretation is complex, and it’s occasionally of interest to test hypotheses about individual coefficients of polynomial regressors or regression-spline

228

p≥2 , then testing the hypothesis H :

regressors. For example, if we fit a polynomial of order with coefficients

0

is a test of nonlinearity.

Figure 6.10 Cubic regression spline with knots at

,

, and (shown as vertical broken lines) fit to artificially generated data; the vertical solid lines are at the boundaries of the data. The broken curve is the population regression function and the solid curve is the fitted regression spline.

229

,

This graph plots the x variable on the horizontal axis and the y variable on the vertical axis. The x axis has values from 0 to 10 in intervals of 2 and the y axis has values from 0 to 12, in intervals of 2. The three vertical lines seen on this graph are labeled k1 at 2.5, k2 at 5.0 and k3 at 7.5. The two curves seen on this graph are a broken curve which is the population regression and the solid curve which is the fitted regression spline. The points on these curves are estimated below against the respective x value: The data points on this graph are seen along the two curves with some of the data points on the first trough and the second peak are more loosely dispersed around these sections of the curves.

230

*Transforming Explanatory Variables Analytically In Chapter 3, I discussed an ML-like analytic method for selecting the transformation of a variable or set of variables toward normality, and in Chapter 5, I described the Box–Cox regression model for transforming the errors in a linear model toward normality. Box and Tidwell (1962) introduced a similar regression model in which power transformations 1, 2, … , k, of the explanatory variables, are estimated as parameters:

λ λ

λ

xijs are positive. The regression coefficients are typically estimated after and conditional on the transformations, because the βs don’t really have meaning until the λs are selected. If, as is often the case, some of the xs (e.g., dummy regressors) aren’t candidates for transformation, then these xs can simply enter the model linearly without a corresponding where all the

power transformation parameter. Box and Tidwell’s approach produces MLEs of the transformation parameters, along with a score test for each transformation. Consider, for example, the following model for the data:

Factbook

CIA World

Here, I treat GDP and the Gini coefficient as candidates for power transformation, but not health expenditures, because log(infant mortality) appears to have a quadratic relationship to the latter.

231

Applying Box and Tidwell’s method produces the MLEs and of the transformation parameters. The asymptotically normally distributed score-test statistics for these estimates, testing the null hypothesis that no transformation is required (i.e., that the corresponding

λ is 1) are

and

p

, for which the two-sided -values are ≈ 0 and 0.67, respectively. We therefore have very strong evidence of the need to transform GDP, along with an estimated transformation close to the log transformation that I previously selected informally, and little evidence of the need to transform the Gini coefficient. Box and Tidwell’s model also leads to a constructed-variable diagnostic for the transformation parameters similar to the constructed-variable diagnostic for the Box–Cox model described in Chapter 5. In the current context, for each variable j that is subject to transformation, we add the

x

constructed variable to the linear regression model and then draw added-variable plots for the constructed variables to judge leverage and influence on the estimated transformations. The -statistics for these constructed variables (i.e., the ratios of the estimates to their standard errors) are the score tests for the transformations.

t

CIA World Factbook

For our model for the data, the constructedvariable plots for 1 (for the transformation of GDP) and 2 (for the transformation of the Gini coefficient) are shown in Panels (a) and (b), respectively, of Figure 6.11. Luxembourg has high leverage on the estimate of 1 in Panel (a) but seems more or less in line with the rest of the data; the shallow slope in Panel (b) indicates that there’s little evidence for transforming the Gini coefficient. Constructed-variable plots for the Box–Tidwell power transformations of (a) GDP per capita and (b) the Gini

λ

λ

Figure 6.11

232

λ

coefficient of income inequality in the regression model for the CIA data.

The two graphs in this figure are labeled a and b. The x axis on graph a is labeled constructed variable (GDP)others and plots values from -10 to 40, in intervals of 10. The y axis is labeled log (infant mortality)-others and the values on this axis range from -1.5 to 2.0, in intervals of 0.5. The line seen in this graph is a straight line that slopes upward from the bottom left at about (-5, -0.8) to the top right and ends at about (46, 2.25). It passes through a data point labeled Luxembourg at about (43, 2.1). The data points in this graph are seen between the x axis values of -10 and 20 and the y axis values of -1.5 and 1.25. The x axis on graph b is labeled constructed variable (Gini)others and plots values from -3 to 3, in intervals of 1. The y axis is labeled log (infant mortality)-others and the values on this axis range from -1.0 to 1.0, in intervals of 0.5. The line seen in this graph is a straight line that slopes slightly downward from the left at about (-3.1, 0.1) to the right and ends at about (4, -0.1). The data points in this graph are seen between the x axis values of -1.5 and 2 and the y axis values of -1.0 and 1.0. Most of the data points in this set are clustered between the x axis values of -1.5 and 0.5 and the y axis values of about -0.5 and 0.5.

233

234

Chapter 7. Collinearity Collinearity is different from the other problems discussed in this monograph in two related respects: (1) Except in exceptional circumstances (explained below), collinearity is fundamentally a problem with the data rather than with the specification of the regression model. (2) As a consequence, there is usually no satisfactory solution for a true collinearity problem. As mentioned in Chapter 2, when there is a perfect linear relationship among the regressors in a linear regression model, the least-squares coefficients are not uniquely defined. This result is easily seen for = 2 s, for which the normal equations (shown in general form in Equation 2.1 on p. 8) are, suppressing the subscript for cases,

k i

x

(7.1)

Solving the normal equations produces the least-squares coefficients, (7.2)

235

where

,

, and

are variables in mean-deviation form. The correlation between

x

1

and

x

is

2

b

b

Thus, if , then the denominator of 1 and 2 in Equation 7.2 is 0, and these coefficients are undefined. More properly, there is an infinity of combinations of values of 0, 1 , and 2 that satisfy the normal equations (Equation 7.1) and thus minimize the sum of squared residuals.

b

b

b

A strong, but less than perfect, linear relationship among the s causes the least-squares regression coefficients to be unstable: Coefficient standard errors are large, reflecting the imprecision of estimation of the s, and consequently, confidence intervals for the s are broad. Small changes in the data—even, in extreme cases, due to rounding errors–can substantially alter the least-squares coefficients, and relatively large changes in the coefficients from the leastsquares values hardly increase the sum of squared residuals.

x

β

β

236

Collinearity and Variance Inflation Recall from Equation 2.2 (p. 9) that the estimated variance of the least-squares regression coefficient j is

b

(7.3)

where is the squared multiple correlation from the regression of j on the other s. The impact of collinearity on

x

x

the precision of estimation is captured by , 1 called the VIFj. It is important to keep in mind that it is not the correlations among the regressors (when > 2) that appears in the VIF, but the correlation for the regression of a particular on the others. For this reason, collinearity in multiple regression is sometimes termed .

multiple 1

variance inflation factor pairwise k

multicollinearity

x

Compare Equation 7.3 to the estimated variance of the slope coefficient in the regression of on j alone:

b

simple

y

x

The residual variance from the simple regression is typically larger than the residual variance 2 from the multiple regression, which has additional explanatory

s

variables, but there is no VIF in

237

.

b

The other factors affecting the variance of j in Equation 7.3 are the estimated error variance 2, the sample size , and the

s

n

x

variance of j. Thus, small error variance, large sample size, and highly variable explanatory variables all contribute to precise estimation in regression. It is my experience that imprecise regression estimates in social research are more frequently the product of large error variance (i.e., weak relationships between the response and the explanatory variables), relatively small samples, and homogeneous samples (i.e., explanatory variables with little variation) than of serious collinearity.

β

Because the precision of estimation of j is most naturally expressed as the width of the confidence interval for this parameter, and because the width of the confidence interval is proportional to the standard error of j, I recommend examining the square root of the VIF in preference to the VIF itself. Table 7.1 reveals that linear relationships among the s must be very strong before collinearity seriously degrades the precision of estimation: For example, it is not until j approaches 0.9 that the precision of estimation is halved.

b

238

x R

Table 7.1 a Impact

on the standard error of

bj

To illustrate VIFs, I’ll return to Duncan’s regression of occupational prestige ( ) on the income ( ) and education ( ) levels of 45 U.S. occupations in 1950 (repeating Equation 4.5 from p. 54):

P

I

E

Income and education are moderately highly correlated in this

x R

data set, . A peculiarity of having two s is that the VIFs for income and education are the same, because the 2 for the regression of income on education and the 2 for

R

239

the regression of education on income are both equal to

.

Thus,

,

and the square roots of these VIFs are

.

The situation changes slightly when the influential pair of cases, minister and railroad conductor, is removed (as discussed in Chapter 4), producing the fitted regression (repeating Equation 4.6 from p. 56)

With these two cases deleted, , and the VIF for income and education rises to 1.68. Thus, confidence intervals for I and E are 68% wider than they would be for uncorrelated explanatory variables. Ninety-five percent confidence intervals for the regression coefficients are not terribly broad, however, roughly the coefficient ±0.25 for income and ±0.2 for education, but neither are they very precise.

β

β

To the degree that there’s a culprit here, it’s the small sample of size = 45, not collinearity. I introduced this example because it’s an extreme case: We rarely see correlations between explanatory variables in the social sciences as high as 0.8, sample sizes as small as 45, and squared multiple correlations as large as 0.9.

n

240

Visualizing Collinearity In this section, I describe two ways to visualize the impact of collinearity on estimation: (1) examining data and confidence ellipses and (2) comparing marginal scatterplots with corresponding added-variable plots. These aren’t really diagnostic graphs but rather graphs for further understanding how collinearity affects the precision of estimation. Figure 7.1 shows the scatterplots and data ellipses for the regressors 1 and 2 in two artificial, randomly generated data sets (in the top Panels (a) and (b)), along with the corresponding confidence ellipses for the coefficients 1 and on 1 and 2 (in the bottom Panels 2 in the regression of (c) and (d)). The data were generated as follows: Data ellipses (a) and (b) and corresponding confidence ellipses (c) and (d) for two artificial regression data sets: In (a) the correlation between 1 and 2 is

x

x

β Figure 7.1

y

x

β

x

x

x

and in (b) it is . The data ellipses in Panels (a) and (b) are 95% concentration ellipses, and the solid dot in the center of each is the point of means . The outer ellipses in Panels (c) and (d) are joint 95% confidence regions for 1 and 2 ; the inner ellipses generate individual 95% confidence intervals when projected onto the 1 and 2 axes. The black dot at the center of each ellipse represents the sample regression coefficients ( 1, 2) and the black square at the origin represents 1 = 2 = 0. The true population regression coefficients are ( 1 = 2, 2 = 3).

β

β

β

β

b b β β β β

241

The four graphs seen in this figure are labeled a, b, c and d. In graphs a and b, the variable plotted on the x and y axes are labeled x1 and x2 respectively. The values on each of these axes range from -3 to 3, in intervals of 1. The dark point at the center of graph a is located at (0, 0) on both graphs. In graph a, the ellipse is a circle with most of the data points inside the circle, around the center. In graph b, the ellipse is a narrow oval that slopes up from the bottom

242

left to the top right part of the graph’s plot area. The end points at about (-1.9, -1.9) and (2.4, 2.4). The data points are concentrated within the oval. In graphs c and d, the variables plotted on the x and y axis read beta 1 and beta 2 respectively. The values on both axes range from -5 to 10 in intervals of 5. Vertical lines at the 0 value on both axes intersect at (0, 0). This point in marked by a small dark squares on both graphs. In graph c, two concentric circles with the inner circle in a darker color than the outer one. The center of these circles are at about (2.5, 3) and the outer circle has a radius of about 2 with the inner circle inside it, with a radius of about 1.75.Both are centered at (2.5, 3). In graph d, two concentric ellipses with a common center slope down from the top left to the bottom right. The coordinates of the ends of the outer ellipse are about (-3.5, 9) and (8, -2.5). Those of the inner ellipse are about (-2.5, 7) and (7, -2).

1.

n = 200 values of the explanatory variables x and x were sampled from a bivariate normal distribution with means μ = μ = 0, and standard deviations σ = σ = 1. For the data in Figure 7.1(a), the population correlation between x and x was set to ρ = 0, while for the data in Figure 7.1(b), it was set to ρ = 0.95. The realized sample correlations are r = −0.007 in Panel (a) and r = 0.957 1

2

1

2

1

1

2

2

12

12

12

12

in Panel (b). 2. Then values of the response were randomly generated according to the regression model (i.e., with 0 = 0, 1 = 2, and

β

β

(i.e.,

β

2

= 3) and

σ = 10).

The estimated regression coefficients, their standard errors, the residual standard deviation, the 2, and the square root of the VIF for each regression are shown in Table 7.2.2

R

243

2 As

I mentioned, when there are just two

necessary equal:

xs, their VIFs are .

Table 7.2yx x

1 2

I introduced the data ellipse in Chapter 4 and the confidence ellipse in Chapter 2. It is evident from Figure 7.1, particularly Panels (b) and (d), that the confidence and data ellipses are related in the following manner: The of the confidence ellipse is the same as that of the data ellipse, except that the former is a 90∘ rotation (and rescaling) of the latter. In a technical sense, explained in the starred section at the end of this chapter, the data and confidence ellipses are inverses of each other. As a consequence, if 1 and 2 are correlated, as in Panel (b), then the regression coefficients 1 and 2 for these s are correlated, as in Panel (d).3

shape

positively

b

b

x negatively

x

3A

x

possibly subtle point is that the correlations between the regression coefficients are estimated , analogous to estimated sampling variances—that is, they pertain to how the coefficients behave with respect to repeated sampling. Thus, if 1 and 2 are negatively correlated, samples that have unusually large values of 1 tend to have unusually small values of 2, and vice versa.

sampling correlations

b

b

b

b

244

x

In Panel (a) of Figure 7.1, the s are virtually uncorrelated, and as a consequence, the axes of the data ellipse are nearly parallel to the 1 and 2 axes. Moreover, because 1 and 2 have nearly identical standard deviations, the data ellipse is nearly circular. The same is true of the confidence ellipses in Panel (c): The coefficients 1 and 2 are nearly uncorrelated and have similar standard deviations (i.e., standard errors).

x

x

b

x

x

x

b

x

Because 1 and 2 are strongly correlated in Panel (b) of Figure 7.1 but are uncorrelated in Panel (a), the confidence ellipses in Panel (d) are larger than those in Panel (c). The confidence intervals produced for the individual regression coefficients in Panel (d) by projecting the inner ellipse onto the 1 and 2 axes both include 0, while neither individual confidence interval in Panel (c) includes 0.

β

β

In contrast, the outer 95% confidence ellipses in both Panels (c) and (d) exclude the point the (false) joint null hypothesis

H: 0

, and so can be

rejected at the level in both cases. In Panel (d), where the s are highly collinear, we therefore have the ambiguous result that we are reasonably sure that at least one (and possibly both) of the s is nonzero, but we can’t say which. This ambiguity is intuitively sensible because the high correlation of 1 and 2 makes it difficult to disentangle their partial effects.

x x

β

x

Yet another way of visualizing the impact of collinearity on the precision of estimation is illustrated in Figure 7.2, drawn for the same two artificial data sets as Figure 7.1. The correlation between the two s in the regression is

x

in Panel (a) and in Panel (b). I focus in both panels on the relationship between and 1 . The hollow points in each graph are for the marginal scatterplot of versus 1 ignoring 2, but expressing both

x

y

x

x

245

y

y

x

and 1 as deviations from their respective means; the filled points are for the added-variable plot for 1—that is, the

y

x

x

residuals from the regression of on 2 (symbolized by ) versus the residuals from the regression of 1 on 2 (symbolized by

x

x

). Marginal scatterplot of versus 1 and the superimposed added-variable plot for 1 in the regression of on 1 and 2 in each of the two artificial data sets: In (a) the

Figure 7.2 x x

correlation between

y

x

1

and

x

2

x

x

is

y

and in (b) it

is . The arrows in each panel show the correspondence between points in the marginal and addedvariable plots, which are shown as open and filled circles, respectively. The solid line is the least-squares line for the marginal scatterplot, showing the slope for the simple regression, and the broken line is the least-squares line for the added-variable plot, giving the multiple regression slope 1.

b

The two graphs in this figure are labeled a and b.

246

In graph a, the x axis is labeled, x (subscript 1) with an arrow pointing to x (subscript 1), a line and x (subscript 2). The values on this axis range from -3 to 3, in intervals of 1. The y axis is labeled, y with an arrow pointing to y, a line and x (subscript 2). The values on this axis range from -30 to 30, in intervals of 10. The line seen in this graph starts at about (-3.2, 5) and slopes gradually upward and ends at about (3.2, 5), on the right. The data points on this graph are marked in circles and each circle has a dark point and an arrow from the circle to the dark point either below or above it. Each of the data points are tightly clustered between the x axis values of -1 and 2 and the y axis values of -20 and 20. In graph b, the x axis is labeled, x (subscript 1) with an arrow pointing to x (subscript 1), a line and x (subscript 2). The values on this axis range from -3 to 3, in intervals of 1. The y axis is labeled, y with an arrow pointing to y, a line and x (subscript 2). The values on this axis range from -30 to 30, in intervals of 10. Two lines are seen in this graph – a solid line and a dashed line. The solid line seen in this graph starts at about (-3.2, -18) and slopes upward and ends at about (3.2, 18), on the right. The dashed line starts at about (-3.2, -8) and slopes more gradually upward than the solid line, and ends at about (3.2, 8), on the right. Each of the data points on this graph are marked in circles and each circle has a dark point and an arrow from the circle to the dark point either below or above it. The data points are very tightly clustered between the x axis values of -1 and 0.5 and the y axis values of -20 and 20. The arrows in this graph between each circle and dark point set is longer than those in graph a.

The solid line in each panel is the least-squares line for the hollow points, and it gives the slope of the simple regression of on 1; because expressing the variables as deviations from their means eliminates the intercept from the regression equation, the regression line goes through the origin. The broken line in each panel is the least-squares line for the solid points, and it gives the slope 1 for 1 in the of on the two s. The arrows connect the hollow point for each case to the corresponding solid point.

y

x

regression

y

b

x

247

x

multiple

x

x

In Panel (a), because 1 and 2 are virtually uncorrelated, the horizontal coordinates of the hollow and solid points are nearly the same. The vertical coordinates of the points tend on average to move toward 0 because the residuals from the regression of on 2 are somewhat less variable (i.e., less spread out vertically) than . As well, and also because the s are nearly uncorrelated, the slope of the marginal regression of on 1 is nearly the same as the partial regression slope 1.

y

b

y

x

y

x

x

The configuration of the two sets of points is very different in Panel (b): As in Panel (a), as we go from the hollow to the filled points, the vertical coordinates of the points tend slightly toward 0, but the horizontal coordinates of the points tend toward 0: Because 1 and 2 are highly correlated, the residuals from the regression of 1 on 2 are much less variable (i.e., much less spread out horizontally) than the original 1 values. As it turns out, the simple regression slope for 1, given by the solid line, is somewhat different from the multiple regression slope 1, given by the broken line, but even more dramatic is the effect of collinearity on the precision of estimation of the slope: The greatly reduced conditional variability of 1 given 2 doesn’t provide a broad base of support for the broken regression line and so greatly increases the standard error of 1.

strongly x

x

x

x

b

x

b

248

x

x

x

Generalized Variance Inflation The VIF is a simple diagnostic that directly expresses the effect of collinearity on the precision of estimation of each coefficient in a regression, but it is only a sensible measure for terms in a model that are represented by a single parameter. Examples of multiple-parameter terms are sets of dummy-variable coefficients for a factor with more than two levels, and polynomial or regression-spline coefficients for a numeric explanatory variable. The reason for this limitation is a little subtle: The correlations among a set of dummy regressors generally depend on which level of the factor is selected as the baseline level (and indeed there are alternative but equivalent ways of coding regressors for factors beyond 0/1 dummy regressors), but the fit of the model to the data and the intrinsic meaning of the model don’t change with this essentially arbitrary choice of regressors. Similarly, we can generally reduce correlations among polynomial regressors by expressing a numeric explanatory variable as deviations from its mean, (and can even reduce the correlations to 0 by using the so-called ), but once again the partial relationship of to doesn’t change with these arbitrary and inconsequential choices. The same is true of the regressors chosen to represent a regression spline.

x polynomial regressors y x

orthogonal

generalized variance

Fox and Monette (1992) introduced ( ) to deal properly with sets of related regression coefficients. The GVIF for two coefficients (e.g., for two dummy regressors) is interpretable as the increase in the squared of the joint-confidence ellipse for the two corresponding parameters relative to the area of this ellipse for otherwise similar data in which the two regressors are unrelated to the regressors in the model. Fox and Monette show that this ratio of squared areas is unaffected by the choice of baseline level for the set of dummy regressors or by

inflation factors GVIFs area

other

249

other similar arbitrary choices. If there are three coefficients in a set, then the GVIF represents inflation in the squared of the joint-confidence ellipsoid for the coefficients, and the generalization beyond three coefficients is the squared of the multidimensional confidence ellipsoid for the coefficients.

volume hypervolume

Because the size of the GVIF tends to grow with dimensionality –that is, with the number of regressors in a set–Fox and Monette recommend taking the 2 th root of the GVIF, effectively reducing it to a linear measure of imprecision, and analogous to taking the square root of the VIF in the one-coefficient case. When = 1 (i.e., when there is only one coefficient for a term in the model), the GVIF reduces to the usual VIF.

p

p

p

The details of computing GVIFs are postponed to the last section of this chapter, but here is an application, using the following regression model fit to the data, regressing the log infant mortality rate for 134 nations on the log of GDP per capita, a quadratic in per-capita health expenditures, and the Gini coefficient of income inequality (repeating Equation 6.3 from p. 85):

CIA World Factbook

Table 7.3CIA World Factbook

The GVIFs and th root GVIFs for GDP, health expenditures, and the Gini coefficient are shown in Table 7.3. Clearly,

250

collinearity isn’t an issue here.

251

Dealing With Collinearity As I stated in the introduction to this chapter, because collinearity is fundamentally a problem with the data and not (typically) with the model, there generally isn’t a satisfactory solution to the problem. That is, in formulating a regression model, a researcher should specify the model to reflect hypotheses about the structure of the data or questions to be put to the data. If we include both 1 and 2 in a regression model, for example, that should mean that we’re interested in the partial relationship of to 1 holding 2 constant, the partial relationship of to 2 holding 1 constant, or both. If 1 and 2 are so highly correlated in our data that we can’t adequately separate their effects, then that’s too bad, but there’s little we can do about it short of collecting new data in which the two explanatory variables aren’t so correlated, or collecting more data or data in which the s are more variable so that we can estimate their coefficients with adequate precision despite their correlation.

x

y

x

x y

x

x

x

x

x

x

That’s usually the end of the collinearity story. There are, however, several strategies that have been suggested for estimating regression models in the presence of collinearity, which I’ll briefly discuss here. None are general solutions for the problem.

Model respecification: What I mean by “model

x

respecification” in this context is removing s from the regression equation to reduce collinearity. When we do this, we implicitly ask different questions of the data. For example, if we remove 2 from the regression of on 1 and 2, then we estimate the between and 1 ignoring 2 rather than the conditioning on 2—and no longer address the relationship between and 2 at all.

y

x

x

x

y

x

x

x

y x marginal relationship partial relationship

252

Justification for respecifying the model in this manner is most straightforward when we made a mistake in formulating the model in the first place, for example, if we foolishly used rather than − 1 dummy regressors for a factor with levels, or put 1, 2,

p

p

and their sum

p

x x

in the same regression equation. Similarly, if slightly less egregiously, it generally doesn’t make sense to include alternative measures of the same explanatory variable as s in a regression. If we’re interested in how employee absenteeism is related to clinical depression, for example, we shouldn’t use three different measures of depression as regressors in the model. After all, we’re probably not interested in the partial relationship of absenteeism to Depression Measure 1 holding constant Depression Measures 2 and 3. In a case like this, we can create an index that combines the alternative measures, just pick one if they’re very highly correlated, or use a regression method that allows us to specify multiple indicators of a (i.e., not directly measured) explanatory variable, such as the construct “depression.” If, as is more typically the case, we make a mistake in formulating the original model, then respecifying the model by removing one or more explanatory variables changes the questions that we ask of the data. This is only OK if the new questions are interesting, and we should in any event appreciate that we gave up our original research goals. Put another way, if the model was carefully formulated, removing explanatory variables will generally produce biased estimates of the coefficients of the remaining explanatory variables. For example, if we want to control statistically for 2 in examining the relationship of to 1 because we think that 2 is a common prior cause of 1 and , removing 2 from the regression because it’s highly correlated with 1 can

x

latent

didn’t

y

x

x

x

253

y

x

x

x

produce an incorrect causal inference about the effect of 1 on . Variable selection methods are automatic techniques for specifying a regression model by including only a subset of candidate explanatory variables. In their most sophisticated form, these methods are often called (e.g., Hastie, Tibshirani, & Friedman, 2009), and they can be effective when the goal is to produce a model that does a good job of the response. When our interest in a regression model is in how the explanatory variables influence the response, using a mechanical method to select the model automatically is not a reasonable strategy. Doing so is tantamount to allowing an algorithm to decide what questions to ask of the data. Regularization methods resolve the ambiguity produced by collinearity by in effect driving the regression coefficients—or some of the regression coefficients—toward 0, with the goal of producing estimates of the s that have smaller mean-squared errors than the least-squares estimates even though the regularized estimates are biased. The most common regularization methods in regression analysis are (Hoerl & Kennard, 1970a, 1970b) and the (an acronym for east bsolute hrinkage and election perator; Tibshirani 1996). The fundamental problem is that for regularization to achieve its goal, it’s necessary to know something about the population regression coefficients, or, more likely, implicitly to pretend to know. Otherwise, there’s no guarantee that the regularized estimates will perform better than least squares. Like variable selection, regularization can work well, however, when the goal is prediction. : In specifying a normal linear model, we make certain assumptions (reviewed in Chapter 2) about the structure of the data. As explained in much of this monograph, of these assumptions are checkable. Perhaps, we’re willing to make additional assumptions about the population regression coefficients.

x y Variable selection:

machine learning

predicting

understanding

Regularization:

β

regression o

l

a

s

Prior information about the βs some

254

s

ridge lasso

These assumptions may take a very simple form, such as that two s are equal, or they may take the form of statements about what are plausible values for the s. In the former case, we can get constrained least-squares estimates of the coefficients, and in the latter case, we can employ Bayesian methods of estimation (e.g., Gelman et al., 2013) in place of least squares. Of course, for these approaches to work, the prior information must be sufficiently specific to reduce the ambiguity due to collinearity, and we have to be honest about the state of our prior knowledge: If, for example, we assume incorrectly that two population regression coefficients are equal or specify unreasonably precise prior constraints on the values of regression coefficients, then the resulting estimates and their estimated precision will be misleading.

β

β

These strategies have more in common than it might at first appear. For example, variable selection in effect respecifies the model, albeit mechanically, and regularization (particularly the lasso) can drive coefficients to 0, effectively eliminating the corresponding regressors from the model. Regularization also entails tacit assumptions about plausible values of the s.

β

255

*Collinearity: Some Details

256

The Joint Confidence Ellipse and the Data Ellipse Equation 2.7 (p. 12) gives the joint confidence ellipse for a

p

subset of regression coefficients in a model with coefficients. When = 2, focusing on the = 2 slope coefficients 1 and 2 in effect removes the intercept 0 from consideration, which is equivalent to expressing the s in mean-deviation form.

β

k

p

β

We, therefore, have

β x

, , and let

confidence ellipse for

β

1

and

β

. Then, the 95% joint 2 is

(7.4)

where

is the sample covariance matrix for residual variance, and

x

1

and

x,s 2

2

is the

is the 0.95 quantile of

F

the -distribution with 2 and

degrees of freedom.

257

In contrast, the 95% data ellipse for

x

1

and

x

2

(7.5)

Comparing Equation 7.5 for the data ellipse to Equation 7.4 for the confidence ellipse, we see that the for the

shape matrix

latter is the inverse of the shape matrix for the former X, and that the factors on the right of the inequalities also differ. These observations account for the relationship between the two ellipses: that the confidence ellipse is the 90∘ rotation and rescaling of the data ellipse.

S

258

Computing Generalized Variance Inflation Partition the regressors in the regression models into two sets: (1) those for the term in question (e.g., the set of dummy regressors for a factor), and (2) the remaining regressors in the model, with the exception of the constant

R

regressor , which is ignored. Let represent the correlation matrix among all the regressors (again, ignoring the constant), 1 the correlations among the regressors in the first set, and 2 the correlations among the regressors in the second set. Then, as Fox and Monette 1992 demonstrate, the GVIF for the first (or indeed second) set of regressors is

R R

, where det() is the determinant. It’s also possible to express the GVIF in terms of the correlation matrix b of the regression coefficients, computed

R

from the coefficient covariance matrix after eliminating the first row and column for the intercept. Let

be the submatrix of

Rb pertaining to the correlations

of the coefficients in Set 1 and the submatrix pertaining to the correlations of the coefficient in Set 2. Then

259

Chapter 8. Diagnostics for Generalized Linear Models Many of the unusual-data diagnostics, nonlinearity diagnostics, and collinearity diagnostics introduced in previous chapters for the linear regression model fit by least squares can be straightforwardly extended to GLMs. The chapter begins with a brief review of GLMs and then proceeds to sketch the application of various regression diagnostics to this important class of statistical models.

260

Generalized Linear Models: Review GLMs, introduced in a remarkable paper by Nelder and Wedderburn (1972), represent both a synthesis and, importantly, a true generalization of many preexisting regression models, including the normal linear model described in Chapter 2. A

generalized linear model (GLM) consists of three components: 1. A distributional family specifying the conditional distribution of the response yi, i = 1, …,n, given one or more explanatory variables ,1 which may be numeric or categorical. In the initial formulation of GLMs, the response distribution was a member of an , in which the conditional variance of the response variable is a function of its conditional

exponential family

mean,

, and a

so that called the

variance function.

dispersion parameter, . The function v() is

,

z

1 If vector notation is unfamiliar, just think of i as the collection of explanatory variables for the th case. The (named in honor of Carl Friedrich Gauss, the codiscoverer of the normal distributions), for which the response variable is numeric and can take on values from − to , and for

i

Gaussian or normal family





which , a constant (i.e., ). The , for which the response variable is the proportion of “successes” in i binomial trials,2 and which therefore takes on one of the discrete values

binomial family

special case is

n

binary data 261

. An important , where all the i = 1 and

n

where

yi consequently takes on either the value 0 or 1.

For the binomial family, and thus the dispersion parameter is set to . Here, the mean i of i is just the probability of success on an individual trial.

μ

y

2

Binomial trials are independent realizations of a process, like flipping a coin, each of which can give rise to two possible outcomes, formally called “success” (say, a head) and “failure” (a tail), where the probability of success is the same for each trial. The , for which the response variable is a nonnegative integer, 0,1,2,3,…, such as a , and for which the conditional variance is equal to the

Poisson family

count

mean,

, with the dispersion parameter

fixed to . The , for which the response variable is a nonnegative real number, and for which the conditional variance is proportional to the square of the mean,

gamma family

.

inverse-Gaussian family

The , for which the response is also a nonnegative real number, and for which the conditional variance increases with the mean even more rapidly, . The family component of GLMs has subsequently been extended to other distributions that aren’t members of simple exponential families, such as the for count data, which has a shape parameter in addition to a dispersion parameter. As

negative

binomial family

well, and the link function (described below) can be specified directly without making an explicit assumption about the form of the conditional

262

y

distribution of , analogous to a linear model without the assumption of normality but with the assumptions of constant error variance and linearity.3 3

As explained in a later section of this chapter, specifying the variance function and the link function without an explicit distributional family leads to the so-called quasi-likelihood estimation of the resulting GLM. 2. A , i, on which the mean of the response depends, and which is a linear function of regressors,

linear predictor η

explanatory variables

z

, derived from the i, and of regression coefficients

: As indicated, the regression coefficients typically include an intercept, 0, associated with the constant

β

x

regressor . The s are just like the regressors in the linear model and may include numeric explanatory variables, transformations of numeric explanatory variables, polynomial and regression spline regressors, sets of dummy regressors for factors, and interaction regressors. 3. An invertible (), which transforms the mean

link function g

of the response to the linear predictor, , making the GLM a linear model for the transformed mean response.4 Because the link function is invertible, we can also think of the GLM as a nonlinear model for the mean of

y:

called the

. The inverse link .

mean function

g

− 1 ()

μ

y

is also

4 Notice, however, that it’s the mean of that’s transformed and not itself. One of the strengths of the GLM paradigm is that it divorces the linearizing

y

263

transformation from distributional modeling of the response. That is, the linearizing link function and distributional family are specified separately. The range of permissible link functions varies with distributional families and, to a degree, with different instantiations of GLMs in statistical software, but each traditional exponential family is associated with a so-called , which simplifies the structure of the GLM. The canonical links are as follows: The for the Gaussian family, defined

canonical link function

identity link

as . Pairing the identity link with the Gaussian family produces the familiar normal linear model. The for the binomial family,

logit link

, that is, the log of the of success versus failure (the log of the probability of success divided by the probability of failure). A common link for binomial data is the ,

odds

noncanonical probit link

, where is the quantile function of the standard normal distribution. Pairing the logit link with the binomial family produces the ; pairing the probit link with the binomial family produces the The for the Poisson family,

model log link

logistic regression probit model.

. Pairing the log link with the Poisson family produces (depending on the context) the for count data or the for a contingency table. The for the gamma family,

Poisson regression model log-linear model inverse link .

264

The

inverse-square link for the inverse-Gaussian

family, . GLMs are typically estimated by the method of ML. Unlike in the case of linear models, where the normal equations have a closed-form solution (Equation 2.6 on p. 12), iterative methods are generally required to maximize the likelihood for a GLM. A common algorithm that is very convenient in the context of regression diagnostics for GLMs is ( ), described in the final starred section of this chapter.

squares IWLS

iteratively weighted least

265

Detecting Unusual Data in GLMs In extending regression diagnostics to GLMs (specifically, to logistic regression), Pregibon (1981) suggested computing hat values for GLMs from the last iteration of the IWLS fit.5 5

Hat values can be straightforwardly defined for the weighted least-squares estimator discussed in Chapter 5, but a peculiarity of using the last iteration of the IWLS procedure is that, unlike in a linear model estimated by either ordinary or weighted least squares, the hat values depend on the values as well as on the configuration of the s. Nevertheless, the hat values measure the weight of each case in determining the estimated coefficients of a GLM.

x

y

Many unusual-data diagnostics are based on residuals, and in GLMs, there are several ways to define residuals: The most direct approach is also the least useful. The fitted values for a GLM are simply the estimated means of the response variable, ,

bs are the MLEs of the βs in the GLM. Then the response residuals are simply . I mentioned that GLMs are typically fit by IWLS, and the working residuals, eWi, are then just the residuals from the where the

last iteration of the IWLS procedure (see the last section of the chapter).

Pearson statistic

The is an overall measure of model fit for a GLM analogous to the residual sum of squares for a linear model. The dispersion parameter is typically estimated as

266

, and the then

Pearson residuals are

residual deviance D

The is another overall measure of fit, also analogous to the residual sum of squares for a linear model, and equal to twice the difference in the maximized log-likelihood for a particular GLM and the maximized loglikelihood for a similar GLM that dedicates one parameter to each case, therefore achieving a perfect fit to the data (see the last section of the chapter). The Di are the square roots of the casewise components of the deviance, with signs chosen to agree with the response residuals. To define for a GLM, we could fit a mean-shift GLM (analogous to the mean-shift linear model in Equation 4.3 on p. 47), with dummy regressors set to 1 for each of the cases in turn, dividing the estimated dummy-regressor coefficients by their standard errors. This procedure is more computationally intensive than is desirable for a diagnostic statistic, and so Williams (1987) developed the approximation

deviance

residuals e

studentized residuals n

n

. Deletion measures of influence on the regression coefficients can also be generalized from linear models. Approximations to the change and standardized change,

b

dij and

, respectively,

in the regression coefficients j, for , associated with deleting the th case can be taken from the final IWLS iteration, and an approximation to Cook‘s distance

i

267

(also due to Williams, 1987) is (cf., Equation 4.4 on p. 49 for Cook‘s in a linear model)

D

where

ePi is the Pearson residual defined above.

Finally, Wang (1987) extended added-variable plots to GLMs. For each regressor j,

x

1. refit the model without and the weights IWLS procedure; 2. regress j on the other

x

the 3. plot

xj, saving the working residuals

from the final iteration of the

xs by weighted least squares using

as weights, and saving the residuals versus

.

268

; and

An Illustration: Mroz’s Data on Women’s Labor Force Participation To illustrate unusual-case (and other) diagnostics for GLMs, I introduce data described originally by Mroz (1987) and used by other authors, including Long (1997), as an example of logistic regression–the most common GLM beyond the Gaussian linear model.6 Mroz’s data are for 753 married women drawn from the U.S. Panel Study of Income Dynamics and include the following variables (using abbreviations introduced by Long): 6

Mroz (1987) used the data for a different purpose.

lfp Labor force participation, coded 1 for a woman in the paid labor force and 0 otherwise; there are 428 women in the labor force and 325 women not working outside the home.

k5 Number of children 5 years old or younger residing in the household, a value ranging from 0 to 3.

k618 Number of children between the ages of 6 and 18 years,

which ranges from 0 to 8 (but with only three cases over 5).

age The woman’s age in years, from 30 to 60. wc A dummy regressor for the woman’s college attendance, coded 1 if she attended college (212 women) and 0 otherwise (541 women).

hc A dummy regressor for the woman’s husband’s college

attendance (of whom 295 attended college and 458 did not).

inc Annual family income in $1,000s, exclusive of the woman’s own income, if any (ranging from −0.029, presumably a small loss, to 96).

lwg The log of the woman’s wage rate (ranging from −2.05 to 3.22), recorded as the actual log(wage rate) for a woman in the 269

imputed

labor force or as her log(wage rate) if she was not in the labor force. Wages were imputed by first regressing log(wages) on the other variables for the 428 women in the labor force, and then using the resulting estimated regression equation to compute predicted values for the 325 nonworking women on the basis of their other observed characteristics.

lfp

The logistic regression of on the other variables produces the following results, with standard errors shown in parentheses below the estimated logistic regression coefficients:

k618

hc

With the exception of and , all these coefficients are several times their standard errors. Figure 8.1 shows an influence plot of studentized residuals versus hat values with the areas of the circles proportional to the values of Cook’s distance. The peculiar vertical separation of the points in the graph is typical of diagnostic plots for a GLM with a discrete response—here, a binary response with only two distinct values, 0 and 1. None of the points stand out. While there are some hat values beyond the and cutoffs, the largest studentized residual, 2.25, isn’t noteworthy in a sample of  = 753 cases: The Bonferroni -value for this point, from the normal

n

p

distribution, , exceeds 1. The largest Cook’s is 0.036, also quite small. Influence plot of studentized residuals versus hat values with areas of the circles proportional to Cook’s , for the logistic regression fit to the Mroz women’s labor force participation data.

Figure 8.1:

D

D

270

The x axis is labeled range from the y axis

of this graph is labeled hat values and the y axis studentized residuals. The values on the x axis 0.01 to 0.06, in intervals of 0.01. The values on range from -2 to 2, in intervals of 1.

Vertical lines are seen through the graph’s plot area at the x axis values of about 0.022 and 0.033. Horizontal lines are also seen through the graph’s plot area at the y axis values of -2, 0 and 2. Two groups of circles of differing sizes are seen clustered in this graph. The lower cluster has circles concentrated between the x axis values of about 0.005 and 0.015 and the y axis values of about -2 and -0.6. Some larger circle are also seen clustered

271

between the two vertical lines described above and the y axis values of about -0.05 and -0.015. The upper cluster has circles concentrated between the x axis values of about 0.004 and 0.02 and the y axis values of about 0.04 and 1.2. Very large and large circles are seen above this cluster, beyond the horizontal line at 2 as well as between the vertical lines at 0.022 and 0.045. Three large circles are also seen between the x axis values of 0.05 and 0.065 and the y axis values of about 1.2 and 1.4.

The added-variable plots for the coefficients in the logistic regression, shown in Figure 8.2, confirm that none of the cases, either individually or in combination, exert much influence on the coefficients. The visual separation of the points can make added-variable plots for a binary GLM (and, to a lesser extent, other GLMs for discrete responses) hard to decode.

272

Nonlinearity Diagnostics for GLMs As explained earlier in the chapter, GLMs are linear in the scale of the linear predictor (e.g., the logit scale for a logistic regression) but are usually nonlinear in the scale of the response (the probability scale for a logistic regression). Even on the scale of the linear predictor, GLMs are linear in the parameters and not necessarily in the explanatory variables, because—like linear models—GLMs can include transformed explanatory variables, polynomial regressors, and regression splines. Consequently, as in Chapter 6, we’re concerned here with nonlinearity in the broad sense of lack of fit, that is, an incorrectly specified regression function. Added-variable plots for the logistic regression fit to the Mroz women’s labor force participation data.

Figure 8.2:

273

The below table shows the labels on the x axis, the range of values on this axis, the approximate start and end points of the logistic regression line seen in each graph of the seven graphs in this figure: The y axis on all seven graphs is labeled ifp-others and the range of values on this axis for all graphs is between -3 and 3, in intervals of 1. The below table shows the number and location of data clusters seen in each of the seven graphs described above:

274

Figure 8.3: Component-plus-residual plots for the numeric

explanatory variables in the logistic regression fit to Mroz ’s data. The solid line in each panel is a nonrobust loess smooth and the broken line is the least-squares line.

The first graph in this figure plots the k5 values on the x axis with values from 0.0 to 3.0 on it in intervals of 0.5. The y axis is labeled component+residual and the values on this axis range from -5 to 10, in intervals of 5. The nonrobust loess smooth curve seen in this graph almost overlaps the least-squares line. Both lines start at about (-0.1, 0.1) on the graph and slope downward to about (3.0, -5) with the least-squares line ending slightly above the nonrobust loess smooth curve. Three columns of data points are seen in graph at the x axis values of 0.0, 1.0 and 2.0. The data points in these columns are concentrated between the values of -5 and 0 and 1 and 4 in the first two columns and only between -4 and -5 on the third column. The second graph plots the k618 values on the x axis with values ranging from 0 to 8 in intervals of 2 and the component+residual values ranging from -5 to 10 in intervals of 5 on the y axis. The nonrobust loess smooth curve seen in

275

this graph overlaps the least-squares line. Both lines start at about (-0.1, 0.0) on the graph and slope downward to about (8.0, -0.1). Six columns of data points are seen in graph at the x axis values of 0.0, 1.0, 2.0, 3.0, 4.0 and 5.0. The data points in these columns are concentrated above and below the lines in this graph between the values of -5 and -1 and 1 and 4 in the first four columns and only between -1 and -3 as well as 1 and 2 on the fifth column. The sixth column has a row of loosely dispersed data points with a gap between 2 and 5 and 0 and -2. The third graph plots the age on the x axis, with the values ranging from 30 to 60 in intervals of 5 and the component+residual values ranging from -5 to 10 in intervals of 5 on the y axis. The nonrobust loess smooth curve seen in this graph overlaps the least-squares line. Both lines start at about (28, 1) on the graph and slope downward to about (60, -1). Multiple columns of data points are seen on either side of the lines for each year between 30 and 60. The data points are clustered between the y axis values of 0 and 2 and 2 and 4. The fourth graph plots the inc values on the x axis that range between 0 and 80 in intervals of 20, and the component+residual values ranging from -10 to 10 in intervals of 5 on the y axis. The nonrobust loess smooth curve seen in this graph almost overlaps the least-squares line. Both lines start at about (0,1) on the graph and slope downward to about (100, -3). The data points on this graph are clustered along the lines, on either side. The cluster on top is tightly clustered between the x axis values of 2 and 35 and the y axis values of 3 and 4, above the line, and -1 and -4, below the line. The last graph plots the lwg values between -2 and 3 in intervals of 1 on the x axis and the component+residual values ranging from -5 to 10 in intervals of 5 on the y axis. The least-squares line seen in this graph starts at about (-2.2, -2,) and slopes upward before it ends on the right at about (3.2, 2). The nonrobust loess smooth curve seen in this graph starts at about (-0.1, 0.0) on the graph and slope downward to about (8.0, -0.1). The data points are clustered above and below the least-squares line between the x axis values of 0 and 1.5 and between the y axis values of 1 and 3 above the line, and between -2 and -5 below the line.

Landwehr, Pregibon, Shoemaker (1984) showed how component-plusresidual plots can be extended to GLMs; the theory justifying the extension was later developed by Cook Croos-Dabrera (1998).

276

Constructing component-plus-residual plots for a GLM is straightforward and uses the last iteration of the IWLS procedure for fitting the GLM: The working residuals Wi from the last iteration are added to the component j ij to produce

bx

e

x

partial residuals , which are then plotted against ij (cf., Equation 6.1 on p. 83 for linear models). The various generalizations of component-plus-residual plots described in Chapter 6 are produced in an analogous manner. Figure 8.3 shows component-plus-residual plots for the numeric predictors in the logistic regression fit to the Mroz data. In each panel, the broken line is the least-squares line representing the fitted model and the solid line is for a nonrobust loess smooth (i.e., not downweighting outlying points): Landwehr et al (1984) explain that using a smoother can be important in smoothing partial residuals for GLMs fit to discrete data. With the notable exception of the graph for , which I’ll discuss below, the component-plusresidual plots support linearity on the logit scale, with the loess lines close to the least-squares lines in each graph. The vertical separation of the points in the component-plusresidual plots into two groups once again reflects the binary response variable, and makes smoothing the points especially helpful in interpreting the graphs.

nonrobust

lwg

As in the case of a linear model (as discussed in Chapter 6), when an explanatory variable in a GLM is discrete, we can formulate a likelihood-ratio test for lack of fit. Consider, for example, the variable in the Mroz logistic regression, which takes on only four distinct values, the integers from 0 through 3. Treating as a factor creates three dummy regressors, and the resulting model fits the data with residual deviance 1 = 904.71. The original model, in which is specified to have a linear effect on the logit scale, has slightly larger residual deviance 0 = 905.27 but uses two fewer coefficients. The likelihood-ratio chi-square statistic (analogous to the incremental -statistic for a linear model)

D

k5

k5

F

277

D

k5

for the null hypothesis of linearity is the difference in deviance between the two models, 905.27 − 904.71 = 0.56, on 2 degrees of freedom, for which  = 0.76, supporting the linear specification.

p

Turning now to the obviously nonlinear partial relationship of to , we could try modeling the relationship as quadratic,7 but the component-plus-residual plot for this variable (at the lower right of Figure 8.3) reveals a logical problem with the model: As is typical of component-plusresidual plots for a binary response, the points are divided into two groups, with the upper group here for women in the labor force and the lower group for those not in the labor force. The values of for the first group are their wages, while the values for the second group are their wages. It’s clear that the distribution of imputed wages is much less variable (i.e., has smaller horizontal spread) than the distribution of actual wages, which is what induces the apparently quadratic partial relationship of to . Imputed wages are less variable because as fitted values they lack residual variation, and the quadratic relationship is therefore an artifact of the manner in which is defined.

lfp

lwg

lwg

actual imputed

lfp

lwg

lfp

7

I invite the reader to refit the model specifying a quadratic in and then redraw the component-plus-residual plot for this explanatory variable. The quadratic specification appears to work well. Marginal model plots for the logit model fit to the Mroz data set. The gray solid line in each panel is a smooth of the data and the black broken line is a smooth of the fitted values. The marginal model plots are for the numeric explanatory variables in the model and for the estimated linear predictor.

lwg

Figure 8.4:

278

279

The first graph in this figure plots the k5 values on the x axis with values from 0.0 to 3.0 on it, in intervals of 0.5. The y axis is labeled Ifp and the values on this axis range from 0.0 to 1.0, in intervals of 0.2. The loess smooth curve of the data and the fitted values overlap. Both lines start at about (0.0, 0.6) on the graph and slope downward to about (3.0, 0.1). The second graph plots the k618 values on the x axis with values ranging from 0 to 8, in intervals of 2 and the Ifp values ranging from 0.0 to 1.0 in intervals of 0.2 on the y axis. The loess smooth curve of the data and the fitted values overlap. Both lines start at about (0, 0.58) on the graph end at about (8, 0.58). The third graph plots the age on the x axis, with the values ranging from 30 to 60, in intervals of 5 and the Ifp values ranging from 0.0 to 1.0 in intervals of 0.2 on the y axis. The loess smooth curve of the fitted values starts at about (30, 0.6) and slopes down gradually to about (60, 0.42) on the right. The loess smooth curve of the data starts a little below that of the data on the left, and curves gently in a downward arc before ending a little below the loess smooth of the fitted values at about (60, 0.4). Two rows of data points are seen on parallel to the x axis, at the y axis values of 1.0 and 0.0 The fourth graph plots the inc values on the x axis that range between 0 and 80, in intervals of 20, and the Ifp values ranging from 0.0 to 1.0 in intervals of 0.2 on the y axis. The loess smooth curve of the data and the fitted values seen in this graph overlaps. They start at about (0, 0.68) and slope downward towards the right and end at about (100, 0.2). Two rows of data points are seen on parallel to the x axis, at the y axis values of 1.0 and 0.0 with most of the values clusters tightly along the first half of the row before the inc value of 40. The fifth graph plots the lwg values on the x axis that range between -2 and 3, in intervals of 1, and the Ifp values ranging from 0.0 to 1.0 in intervals of 0.2 on the y axis. The loess smooth curve of the fitted values starts at about (-2, 0.2) and slopes upward and ends at about (3, 0.9) on the right. The loess smooth curve of the data is an inverted bellshaped curve between the x axis values of 0 and 2 with the peak at about (0.8, 0.35). Two rows of data points are seen on parallel to the x axis, at the y axis values of 1.0 and 0.0 with most of the values clusters tightly along the second half of the row after the lwg value of 0.

280

The last graph plots the linear predictor values between -4 and 2, in intervals of 2 on the x axis and the Ifp values ranging from 0.0 to 1.0, in intervals of 0.2 on the y axis. The loess smooth curve of the fitted values is a sigmoidshaped curve that starts at about (-4, 0.0) and ends at about (4, 0.9). The loess smooth curve of the data starts a little higher than the sigmoid-shaped loess smooth curve of the fitted values at about (-4, 0.15) and almost overlaps the loess smooth curve of the data before ending slightly higher, on the right. Two rows of data points are seen on parallel to the x axis, at the y axis values of 1.0 and 0.0 with most of the values clusters tightly between the x axis values of -2 and 2.

Figure 8.4 shows marginal model plots for the logistic regression fit to the Mroz data set. Recall from Chapter 6 that a marginal model plot has the response—here, the 0/1 binary response —on the vertical axis and some other variable on the horizontal axis—here, each numeric explanatory variable in the model and the estimated linear predictor from the GLM. Because the response is dichotomous, the plotted points aren’t informative, and where the explanatory variable takes on only a few distinct values—that is, the plots for and —many of the points are overplotted. The gray solid line in each panel smooths the data points, and this is to be compared with the broken black line, which smooths the fitted values from the model; the fitted values are on the scale of the response (i.e., the probability scale) and aren’t shown in the graph.8

lfp

k5

8

k618

Unlike for a linear regression model for a continuous response, it doesn’t make sense to compute spread smoothers for the data and fitted values of a GLM fit to binary data. We should also be careful in applying a smoother to highly discrete and bounded -values such as a binary response: The loess smoother applied to a 0/1 response, for example, can produce values outside of the range from 0 to 1. In Figure 8.4, I therefore used a different, spline-based, smoother that works on the unbounded logit scale rather than on the bounded probability scale, translating the resulting smooth back to the probability scale to draw the graph. See the discussion of

y

281

regression splines, a topic related to smoothing splines, in Chapter 6. Comparing the smooths for the response to those for the fitted values, we see substantial lack of fit in the graphs for and for the linear predictor, and some evidence of lack of fit in the graph for . Modeling as a quadratic makes the smooths for the response and fitted values in the marginal model plot for agree, but evidence for the quadratic fit is weak: A likelihood-ratio test for the squared term in produces a large -value.9

age age p

lwg

age

age

9

Neither the marginal model plots for the respecified model nor the likelihood-ratio test are shown; I invite the reader to produce them. An interesting feature of marginal model plots is that unlike other graphical lack-of-fit diagnostics such as component-plusresidual plots, they are not based on residuals. A consequence, pointed out by Cook and Weisberg (1997), is that marginal model plots are applicable to regression models for which it isn’t obvious how to define residuals. A potential disadvantage of marginal model plots, however, is that they are always defined on the scale of the response variable. For a binary logit model, for example, the values of the response are either 0 or 1, and the fitted values are estimated probabilities, which are therefore bounded by 0 and 1. It is more natural in a GLM to work on the scale of the linear predictor—that is, on the log-odds scale for a logit model. As well, because they are —not —plots, it’s not necessarily obvious how to correct problems revealed by marginal model plots.

marginal

282

partial

Diagnosing Collinearity in GLMs I pointed out at the end of Chapter 7 that GVIFs (and hence ordinary VIFs) can be computed from the sampling correlations among the regression coefficients as well as from the correlations among the s. It doesn’t matter for a linear model which approach is used, but for a GLM, where the correlations among the s can’t be used to compute VIFs, we can still work with the coefficient correlations.

x x

There is, however, a complication of interpretation: For a linear model, VIFs and GVIFs represent a comparison of the data at hand to similar utopian data for which the s—or, in the case of the GVIF, for which s in different sets of regressors —are uncorrelated. That simple interpretation breaks down for GLMs, where the utopian data are for uncorrelated that arise from uncorrelated data from the last step of the IWLS fit that produces the estimated regression coefficients.

x

x

coefficients

weighted

Applying this procedure to the Mroz logistic regression, where each of the terms in the model is represented by a single coefficient, the largest VIF is 1.65 for . There is, therefore, no evidence that collinearity compromises the precision of estimation of the logistic regression coefficients in this example.

age

283

Quasi-Likelihood Estimation of GLMs A careful look at the IWLS procedure for estimating a GLM, described in the next section, reveals that the only properties of the GLM that are used are the link function and the conditional variance of the response—the probability density or mass function for the response only appears indirectly through the conditional variance of the response. This observation raises the possibility of estimating a GLM in which the link and conditional variance are specified directly an explicit distributional family. Shortly after Nelder and Wedderburn (1972) introduced the class of GLMs, Wedderburn (1974) showed that of a GLM for which only the link and conditional variance are specified shares many of the properties of ML estimation for GLMs in which the response is a member of an exponential family. All the diagnostics discussed in this chapter apply to GLMs estimated by quasi-likelihood.

without

quasi-likelihood estimation

To illustrate, I’ll return to the General Social Survey (GSS) vocabulary data discussed in Chapter 6, where I regressed the number of words correct for a 10-word vocabulary test on several explanatory variables. The linear model fit by least squares to the vocabulary data worked well, but it isn’t strictly sensible to assume normally distributed errors for a response that can take on only 11 discrete values (i.e., the integers from 0 to 10). We could, alternatively, convert the response into the of words correct of 10, and fit a binomial GLM, but doing so implicitly assumes that for each individual there’s a fixed probability of getting the various words correct, and the words (which aren’t revealed by the GSS) might well differ in difficulty. If that’s the case, then we’d expect responses to be more variable than implied by the binomial distribution—what’s often called .

proportion

overdispersion

One way to accommodate overdispersion is to retain the logit link but to modify the conditional variance of the response for

284

the binomial family to include a dispersion parameter, ; recall that in the binomial family, the dispersion parameter is fixed to 1.10 I fit the resulting GLM by quasi-likelihood, producing the estimated

quasi-

dispersion parameter . The estimated regression coefficients are the same as for a traditional binomial GLM fit by ML, but the coefficient

binomial

standard errors are each inflated by the factor In this case, therefore, overdispersion is slight.

.

10

An even better way to proceed here would be to fit a generalized linear mixed-effects model for the responses to the individual vocabulary items, or possibly a formal measurement model, such as an ( ) model (see, e.g., Andrich, 1988; Baker & Kim, 2004).

item response theory IRT

It’s interesting to compare the coefficients from the overdispersed logistic regression with those from the linear regression in Chapter 6.11 To make the comparison directly, recall that the logistic regression is for the correct of 10 words, while the linear regression is for the correct. As well, a slope on the probability scale for a logistic regression near the probability of success of 0.5 is roughly the corresponding logistic regression coefficient divided by 4. Making these adjustments produces the estimates and standard errors for the two regressions shown in Table 8.1. As the diagnostics that we performed in Chapter 6 would lead us to expect (see, in particular, the component-plus-residual plots in Figure 6.9 on p. 99), the linear model does a reasonable job of approximating the logit model; this happens because the regression is approximately linear on the logit scale and the fitted probability of success doesn’t get too close to 0 or 1.12

proportion

number

285

Table 8.1:n

11

Equation 6.5 (on p. 98) gives the estimated linear leastsquares regression for the 5,000 individuals in the exploratory subsample that I drew from the GSS data. The results that I show here are for the linear and logistic regression models fit to all  = 27, 360 cases in the GSS data set.

n

12

I invite the reader to construct component-plus-residual plots for the overdispersed logistic regression.

286

*GLMs: Further Background The log-likelihood for a GLM takes the form

p

, where () is the probability density function for a continuous response or the probability mass function for a discrete , corresponding to the distributional family of the GLM, and ϕ is the dispersion parameter of the GLM. The expected response

y

y

is a function of the regression coefficients, and the goal is to find the values of the s that maximize the log-likelihood. One way to do so is via the described below.

β

saturated model

IWLS algorithm

n

The dedicates one parameter to each of the cases and so is able to reproduce the observed data i perfectly. The maximized log-likelihood for the saturated model

y

residual deviance

is therefore . The for a model is twice the difference between the maximized loglikelihood for the saturated model and for the model in question, that is,

Because no model can fit better than the saturated model, nonnegative.

nested

D is

Two models are if one is a special case (or restriction) of the other, for example, if a smaller model is formulated by eliminating a term or terms (e.g., an interaction term) from a larger model. The likelihood-ratio test for alternative nested GLMs is formulated by comparing the residual deviances for the two models:

287

For families with fixed dispersion, such as the binomial family, the difference in the residual deviances for nested models is asymptotically distributed as chi-square with degrees of freedom equal to the difference in the number of parameters for the two models. For families in which the dispersion parameter is estimated from the data, the likelihood-ratio test uses the

scaled

deviance, which is . The resulting test statistic is F-distributed, with numerator degrees of freedom equal to

the difference in the number of parameters in the models, and denominator degrees of freedom equal to the residual degrees of freedom for the larger model, minus the number of parameters in the model.

n

288

Iteratively Weighted Least Squares As explained in the earlier sections of this chapter, most of the diagnostics for GLMs are based on the last step of the IWLS algorithm used to find MLEs for a GLM, effectively linearizing the model. Here, briefly, is how IWLS works: Let represent the estimates of the regression coefficients at iteration . Write the corresponding linear predictor for the th case in matrix form as

i

X

matrix

j

, where is the of regressors for the GLM.

ith row of the model

The current fitted values are , where, −1 recall, () is the inverse-link function, and the current

g

variance function is

The working response is then and the

.

working weights are

a

where the i are constants that depend on the distributional family of the GLM through the conditional variance of the response; for example, for

289

ai is the inverse of the number of

the binomial family, trials for the

ith case,

.

In these equations, is the derivative of the linear predictor with respect to the mean of the response, and it therefore depends on the link function. For the logit link, for example, where , this derivative is . The

working residuals are

Using the working weights

. , perform a WLS regression

of the

on

X, obtaining updated

estimates of the regression coefficients, . Repeat these steps (e.g., for iterations) until the regression coefficients converge, within a small numerical tolerance, at which point the MLE of is (to a close

J

β

β

approximation) , that is, the value of from the last iteration. (0) of the It’s necessary to select initial estimates regression coefficients to get the IWLS algorithm started. For a binomial GLM, for example, it suffices to start with

β

. At convergence, the estimated asymptotic covariance matrix of the regression coefficients is simply , where is the diagonal matrix of working weights from the last iteration

290

and

is the estimated dispersion parameter. The square-root

diagonal elements of are the coefficient standard errors, and Wald tests for individual coefficients are formulated by dividing the estimated coefficients by their standard errors. The resulting Wald test statistics are either -values, for GLMs with fixed ϕ (e.g., binomial GLMs), or values, for GLMs with dispersion estimated from the data. Wald confidence intervals follow similarly.

z

t

291

Chapter 9. Concluding Remarks The preceding chapters cover quite a bit of ground, and so I believe that it’s useful to conclude with some general recommendations, in part repeating important points made earlier in the monograph: Examine your data carefully (e.g., using the methods described in Chapter 3) specifying a regression model for the data. Doing so will allow you to correct errors in the data and to anticipate and deal with problems that you would otherwise have to discover at a later stage in the data analysis. If you have a moderately large (or larger) data set in which the individual cases aren’t of intrinsic and direct interest (e.g., comprising survey respondents rather than nations), consider randomly dividing the data into two parts, one to be used to explore the data and specify a tentative statistical model for them and the other to be used to validate the model. Doing so will protect you from overfitting the data and exaggerating the precision of your results. The alternative of not looking at the data to preserve the purity of statistical inference is generally inadvisable. Be especially concerned about unusual data in small data sets, but don’t ignore the issue in larger data sets, because unusual data can represent gross errors in data collection or data management and can also reveal unanticipated interesting characteristics of the phenomenon under study. That said, gross errors in the data (e.g., the classic mistake of treating 999 thousand dollars of annual income as a real value rather than an unaccounted-for missing-data code) have been revealed and dealt with in your initial examination of the data.

before

should

292

In smaller data sets, a nice overview of unusual data is provided by a Cook’s bubble plot of studentized residuals versus hat values, and added-variable plots are generally of interest even in larger data sets. Although it’s advisable to remain open to the possibility that different problems (e.g., skewed residuals and nonlinearity) are related, generally attend to the shape of the conditional distribution of the response before checking the functional specification of the model. In some cases, specifying a more realistic GLM in place of a linear model can deal with this kind of problem. Plotting (studentized) residuals against fitted values is a useful diagnostic for detecting nonconstant error variance, although here too we should be aware that another problem, such as unmodeled interaction, can induce an apparent relationship between residual spread and the level of the response. Nonlinearity, in the general sense of lack of fit, should always be a concern unless you started with a flexible model such as one employing regression splines (a strategy advocated by Harrell, 2015),and even so, you might be concerned about unmodeled interactions. Component-plus-residual plots and their various extensions are the go-to diagnostics for detecting lack of fit. If the estimated regression coefficients are less precise than you hoped, don’t be quick to blame collinearity. Resist the urge to make a collinearity problem go away by removing explanatory variables from the model if it was initially carefully thought-out in relation to the goals of the research. Regression diagnostics are of little practical use if they are difficult to compute. Most of the methods described in this monograph are readily available in standard statistical software, including R, Stata, and SAS. Computing for the examples in the monograph was done

D

293

in R (see, in particular, Fox & Weisberg, 2018), and R scripts for the various examples are available on the website for the book.

294

Complementary Reading This monograph assumes a general familiarity with linear and generalized linear regression models. There are many texts that cover this material at various levels and in varying detail. Two sources that also take up a variety of diagnostics, some in more detail than provided here, are Weisberg 2014) and Fox 2016). Two books on regression diagnostics and associated topics that elaborate many of the ideas discussed in this monograph, and which are well worth reading despite their age, are Cook and Weisberg 1982) and Cook (1998), as is the now-classic presentation of GLMs in McCullagh and Nelder 1989). This monograph focuses on diagnostics for linear and generalized linear models. Many of the methods developed here can be extended to other classes of regression models, perhaps most notably to linear and generalized linear mixedeffects models for dependent clustered data. Mixed models are often applied in social research to hierarchical data, in which individuals are clustered within groups, and to longitudinal data, in which repeated observations are made on individuals, each of whom therefore comprises a “cluster.” I’ve not covered mixed-effects models primarily because of lack of space, but also partly because diagnostic methods for mixed models aren’t entirely mature. Fox and Weisberg (2019, section 8.7) briefly describe the extension of componentplus-residual plots, transformation of the response toward normality, and influence diagnostics to mixed models, in the latter instance assessing the effect of deleting clusters as well as of deleting individual cases within clusters. Fox and Weisberg also provide some additional references.

295

References Missing data. Thousand Oaks, CA: Sage.

Allison, P. D. (2002). Andrich, D. (1988). Park, CA: Sage.

Rasch models for measurement. Newbury

Anscombe, F. J. (1973). Graphs in statistical analysis. , , 17–22.

American Statistican 27

The

Plots, transformations, and regression: An introduction to graphical methods of diagnostic regression analysis. Oxford, England: Clarendon.

Atkinson, A. C. (1985).

Baker, F. B., & Kim, S.-H. (2004).

Item response theory:

Parameter estimation techniques (2nd ed.). New York, NY: Dekker.

Beckman, R. J., & Cook, R. D. (1983). Outliers. , , 119–163.

Technometrics 25

Regression diagnostics: Identifying influential data and sources of collinearity. New York, NY: Wiley.

Belsley, D. A., Kuh, E., & Welsch, R. E. (1980).

Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. ), (2), 211–252.

Journal of the Royal Statistical Society Series B (Methodological 26 296

Box, G. E. P., & Tidwell, P. W. (1962). Transformation of the independent variables. , , 531–550.

Technometrics 4

Breusch, T. S., & Pagan, A. R. (1979). A simple test for heteroscedasticity and random coefficient variation. , , 1287–1294.

Econometrica 47

The world factbook 2015

Central Intelligence Agency. (2015). . Washington, DC: Author. Retrieved from https://www.cia.gov/library/publications/the-worldfactbook/.

–2016

Sensitivity analysis in

Chatterjee, S., & Hadi, A. S. (1988). . New York, NY: Wiley

linear regression

Cleveland, W. S. (1979). Robust locally-weighted regression and smoothing scatterplots. , , 829–836.

Statistical Association 74

Journal of the American

Cook, R. D. (1977). Detection of influential observation in linear regression. , , 15–18.

Technometrics 19

Cook, R. D. (1993). Exploring partial residual plots. , , 351–362.

Technometrics 35

Regression graphics: Ideas for studying regressions through graphics. New York, NY: Wiley.

Cook, R. D. (1998).

Cook, R. D., & Croos-Dabrera, R. (1998). Partial residual plots in generalized linear models. , , 730–739.

Journal of the American

Statistical Association 93

297

Residuals and influence

Cook, R. D., & Weisberg, S. (1982). . New York, NY: CRC Press.

in regression

Cook, R. D., & Weisberg, S. (1983). Diagnostics for heteroscedasticity in regression. , , 1–10.

Biometrika 70

Cook, R. D., & Weisberg, S. (1997). Graphics for assessing the adequacy of regression models. , , 490–499.

Journal of the American

Statistical Association 92

Davis, C. (1990). Body image and weight preoccupation: A comparison between exercising and non-exercising women. , , 13–21.

Appetite 15

Bootstrap methods

Davison, A. C., & Hinkley, D. V. (1997). . Cambridge, England: Cambridge University Press.

and their application

Duncan, O. D. (1961). A socioeconomic index for all occupations. In A. J. Reiss Jr. (Ed.), (pp. 109–138). New York, NY: Free Press.

Occupations and

social status

An introduction to the

Efron, B., & Tibshirani, R. J. (1993). . New York, NY: Chapman & Hall.

bootstrap

Ezekial, M. (1930). York, NY: Wiley. Fox, J. (2016).

Methods of correlation analysis. New

Applied regression analysis and generalized

linear models (3rd ed.). Thousand Oaks, CA: Sage. 298

Fox, J., & Monette, G. (1992). Generalized collinearity diagnostics. , , 178–183.

Journal of the American Statistical Association 87

Fox, J., & Weisberg, S. (2018). Visualizing fit and lack of fit in complex regression models with predictor effect plots and partial residuals. , (9), 1–27.

Software 87

Journal of Statistical

An R companion to applied

Fox, J., & Weisberg, S. (2019). (3rd ed.). Thousand Oaks, CA: Sage.

regression

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. S. (2013). (3rd ed.). Boca Raton, FL: Chapman & Hall.

Bayesian data analysis

Regression modeling strategies: With applications to linear models, logistic and ordinal regression, and survival analysis (2nd ed.). New York, NY:

Harrell, F. E., Jr. (2015). Springer.

Harvey, A. C. (1976).Estimating regression models with multiplicative heteroscedasticity. , , 461– 465.

Econometrica 44

The elements of machine learning: Data mining, inference, and prediction (2nd ed.). New York, NY: Springer.

Hastie, T., Tibshirani, R., & Friedman, J. (2009).

Hawkins, D. M., & Weisberg, S. (2017). Combining the Box–Cox power and generalised log transformations to accommodate

299

negative responses in linear and mixed-effects linear models. , , 317–328.

South African Statistics Journal 51

Hoerl, A. E., & Kennard, R. W. (1970a). Ridge regression: Biased estimation for nonorthogonal problems. , , 55–67.

Technometrics 12

Hoerl, A. E., & Kennard, R. W. (1970b). Ridge regression: Applications to nonorthogonal problems. , 69–82.

Technometrics 12,

The behavior of maximum likelihood estimates under nonstandard conditions.

Huber, P. J. (Ed.). (1967).

Quantile regression. New York, NY:

Koenker, R. (2005). Cambridge University Press.

Landwehr, J. M., Pregibon, D., & Shoemaker, A. C. (1984). Graphical methods for assessing logistic regression models. , , 61– 71.

Journal of the American Statistical Association 79

Larsen, W. A., & McClearly, S. J. (1972). The use of partial residual plots in regression analysis. , , 781–790.

Technometrics 14

Regression models for categorical and limited dependent variables. Thousand Oaks, CA: Sage.

Long, J. S. (1997).

Long, J. S., & Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors in the linear regression model.

300

The American Statistician, 54, 217–224. Mallows, C. L. (1986). Augmented partial residuals. , , 313–319.

Technometrics 28

Generalized linear

McCullagh, P., & Nelder, J. A. (1989). (2nd ed.). London, England: CRC Press.

models

Counterfactuals and causal inference: Methods and principles for social research (2nd ed.). New York, NY: Cambridge University

Morgan, S. L., & Winship, C. (2015). Press.

Data analysis and regression: A second course in statistics. Reading, MA:

Mosteller, F., & Tukey, J. W. (1977). Addison-Wesley.

Mroz, T. A. (1987). The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions. , , 765–799.

Econometrica 55

Nelder, J. A. (1977). A reformulation of linear models.

Journal of the Royal Statistical Society. Series A (General), 140, 48–77.

Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. ), (3), 370–384.

Journal of the Royal Statistical Society. Series A (General 135

Pearce, L. D. (2002). Integrating survey and ethnographic methods for systematic anomalous case analysis. In R. M.

301

Sociological methodology

Stolzenberg (Ed.), (Vol. 32, pp. 103–132). Washington, DC: American Sociological Association.

Causality: Models, reasoning, and inference

Pearl, J. (2009). (2nd ed.). New York, NY: Cambridge University Press. Pregibon, D. (1981). Logistic regression diagnostics. , , 705–724.

of Statistics 9

Annals

Robust regression

Rousseeuw, P. J., & Leroy, A. M. (1987). . New York, NY: Wiley.

and outlier detection

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. , , 267–288.

B 58

Journal of the Royal Statistical Society, Series The visual display of quantitative

Tufte, E. R. (1983). . Cheshire, CT: Graphics Press.

information

Tukey, J. W. (1977). Addison-Wesley.

Exploratory data analysis. Reading, MA:

Velleman, P. F., & Welsch, R. E. (1981). Efficient computing of regression diagnostics. , , 234–241.

The American Statistician 35

Wang, P. C. (1987). Residual plots for detecting nonlinearity in generalized linear models. , , 435–438.

Technometrics 29

302

Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models, and the Gauss–Newton method. , , 439–447.

Biometrika 61

Applied linear regression (4th ed.).

Weisberg, S. (2014). Hoboken, NJ: Wiley.

White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. , , 817–838.

Econometrica 48

Williams, D. A. (1987). Generalized linear model diagnostics using the deviance and single case deletions. , , 181–191.

Applied

Statistics 36

Wood, F. S. (1973). The use of individual effects and residuals in fitting equations to data. , 677–695.

Technometrics 15,

303

Index adaptive kernel density estimate, 14, 15, 21, 24, 65, 66 added-variable plot, 52–57, 61, 83, 107, 111, 114, 115, 141 for generalized linear model, 127, 130, 131 properties of, 53 analysis of variance, 91, 101 Anscombe’s quartet, 2, 3 bandwidth, of kernel density estimate, 15 base, of log function, 18 Bayesian Information Criterion (BIC), 102 Bayesian methods, 120 binomial family, 124, 136 bins, of histogram, 14 Bonferroni adjustment, 47, 50, 55, 66, 130 bootstrap, 64, 77 Box–Cox powers, 68 Box–Cox regression model, 68–69, 105 Box–Tidwell regression model, 105–107 boxplot, 14, 15, 23, 32, 36, 37, 66, 67 Breusch–Pagan test, 75 bulging rule, 30, 32, 84, 85 canonical link function, 125 categorical explanatory variable, factor causation, 7 centroid, 40 CERES plot, 89–90 clustered data, 143 collinearity, 7, 50, 142 in generalized linear models, 135–136 visualizing, 111–116 common logs, 19

see

304

component-plus-residual plot, 81–86, 90, 98, 99, 142, 143 accuracy of, 87–88 augmented, 90 for generalized linear model, 130, 132, 133, 137, 138 for interactions, 89–92 more robust, 88–89 component-plus-residual plot vs. marginal model plot, 135 conditioning predictors, 90 confidence ellipse, regression coefficients, joint confidence region for confidence ellipsoid, regression coefficients, joint confidence region for confidence envelope, for quantile-comparison plot, 16 constant error variance, assumption of, 6, 35, 94 constant regressor, 5, 54, 61, 125 constructed-variable plot for Box–Cox model, 68–70 for Box–Tidwell model, 106, 107 Cook’s , 49–51, 55, 61, 141 for generalized linear model, 127, 129, 130 correlation of regression coefficients, 113, 135 of regressors, 108, 109, 113, 114, 135

see see

D

data ellipse, 40, 41, 111–113, 120–121 and leverage, 45, 46 data sets data, 14, 21, 24–27, 31, 36, 63, 65, 66, 68–70, 72–79, 83, 84, 89, 91, 95–97, 106, 117 Davis’s data on measured and reported weight, 41, 44, 45, 48, 49, 57 Duncan’s occupational prestige data, 54–57 GSS vocabulary data, 96, 97, 99, 100, 136, 137

CIA World Factbook

305

Mroz’s women’s labor-force data, 128, 129, 131– 134, 136 deleted estimate, 46 dependent variable, 1, response design matrix, model matrix deviance residuals, 127 DFBETA, 48, 60 for generalized linear model, 127 DFBETAS, 48, 51 for generalized linear model, 127 DFFITS, 49, 51, 61 dispersion parameter, 123, 136–140 fixed, 124, 136, 138 dummy regressor, 5, 6, 43, 106, 116, 128, 132

see

see also

effect plot, 90–93 efficiency of least-squares estimator, 7 robustness of, 62 elasticity, 38 ellipse, data ellipse; regression coefficients, joint confidence region for ellipsoid, regression coefficients, joint confidence region for error variance, estimation of, 8, 12, 75, 80 errors, in regression model, 5 estimating equations, normal equations experiment, 6 explanatory variable, 1, 5 discrete, 100 exponential family, 1, 123, 136

see see

see

factor, 5, 6, 100, 103 family, for generalized linear model, 123, 136, 138 fitted values, 8, 12, 58, 71–74, 139, 142 fixed explanatory variables, 6 fixed predictors, 90

306

focal predictor, 90 focal values, 33 frequency, 14 -test, incremental, 9,

F

see also likelihood-ratio test

gamma family, 124 Gauss–Markov theorem, 62 Gaussian or normal family, 123 generalized linear model, 1, 123–126, 142 generalized variance inflation factor, 116–117, 121– 122, 135 geometric mean, 68 GLM, 76, generalized linear model GVIF, generalized variance inflation factor

see

see

hat matrix, 58 hat values, 45, 49, 50, 55, 58, 79, 126, 141 for generalized linear model, 126, 129, 130 HC3, 79, regression coefficients, standard errors of, robust heavy-tailed distribution, 16, 17 heteroscedasticity, 79, nonconstant error variance hierarchical data, 143 high-breakdown estimators, 58 histogram, 14

see also

see also

identity link, 125 independence, assumption of, 6 independent variable, 1, explanatory variable index plot, 48 influence, 2, 41, 42, 48–50, 57, 69, 107, 111, 143 in generalized linear model, 127, 129, 130 joint, 51–57 interaction, 70, 89–92, 142 interaction regressor, 6, 43 intercept, 5

see also

307

interquartile range, 15 inverse link, 126, 139 inverse-Gaussian family, 124 inverse-square link, 126 item response theory (IRT), 137 iteratively weighted least squares, 126, 128, 130, 136, 139–140 IWLS, iteratively weighted least squares

see

see

joint confidence region, regression coefficients, joint confidence region for joint influence, added-variable plot; influence, joint

see

see also adaptive kernel

kernel density estimate, 15, density estimate kernel function, 15 knots, for regression spline, 103

see also

lack of fit, 95, 130, 135, 142, linearity, assumption of; nonlinearity ladder of powers and roots, 20, 22, 32, 38, 66, 67, 84, power transformation lasso, 119, 120 least absolute values, 69 least squares, 8–9, 12, 76 least-squares coefficients, 108, regression coefficients efficiency of, 76 standard errors of, 109 unbias of, 76 levels, of a factor, 5 leverage, 40–42, 44–46, 48, 50, 69, 75, 107, hatvalues likelihood-ratio test, 24 for lack of fit, 96–102, 132, 135 for nested models, 138

see also

see also

see also

308

see

linear model, normal linear regression model linear predictor, 124, 139 linearity, assumption of, 6, 7, 81 link function, 124, 125, 136 canonical, 125 noncanonical, 125 LM, normal linear regression model loess, 25, 32–35, 71, 83, 84, 86, 87, 90, 91, 93, 95– 97, 99, 103, 131–133 to show spread, 71, 95 log link, 126 log transformation, 66, 85 as 0th power, 18 log-linear model, 126 logarithms, 18 logistic regression, 1, 125, 126, 128–134, 136–138 logit link, 125, 136 logit model, logistic regression longitudinal data, 143 lowess, 32, loess

see

see see also

machine learning, 119 main effect, 91 marginal model plots, 92–95, 133 maximum likelihood estimation, 23, 78, 106, 126, 136, 138, 139 and weighted least squares, 9 of normal linear regression model, 8 of the error variance, 8 mean function, 125 mean-shift outlier model, 47 median, 15 median regression, 69 mixed-effects model, 137, 143 MLE, maximum likelihood estimation model matrix, 11, 139 model respecification, 57, 118, 120, 142

see

309

see also

multicollinearity, 109, collinearity multimodal residual distribution, 63, 65 multiple correlation, 8, 9 natural logs, 19 nearest-neighbor local-polynomial regression, 32, loess negative binomial family, 124 nonconstant error variance, 69–75, 142 and weighted least squares, 12 testing for, 74–75 nonlinear least squares, 87 nonlinearity, 2, 25, 32, 66, 70, 72, 73, 92–95, 141, 142, lack of fit in generalized linear models, 130–135 monotone, 28, 29, 82, 83, 102 simple, 28, 29, 102 testing for, 95–102 nonnormality, 63–69 nonparametric regression, loess normal equations, 8, 108 normal linear regression model, 1, 5–7 matrix form of, 11–12, 79 normality, assumption of, 6, 62

see

also

see also

see

odds, 125 OLS, least squares one-dimensional scatterplot, rugplot order of magnitude, 21 order statistics, 16 standard errors of, 16 ordinary least squares, least squares orthogonal polynomial regressors, 116 outlier, 2, 15, 36, 38, 40–42, 45–48, 50, 57, 59, 62, 66, 75 test for, 47, 55, 66, 130 overdispersion, 136–138

see

see

see

310

overfitting, 58, 141 parametric bootstrap, 64 partial fit, 85, 86 partial residuals, 83, 85, 91, 92, 99, 130 augmented, 88 partial versus marginal relationship, 82, 114, 118 partial-regression plot, added-variable plot partial-residual plot, component-plus-residual plot Pearson residuals, 78, 127 Pearson statistic, 127 Poisson family, 124 Poisson regression, 1, 126 polynomial regression, 6, 28, 30, 35, 83, 85, 86, 92, 100, 103, 106, 116, 117, 125, 133, 135 nonlocal character of, 103 power transformation, 22, 83 analytic choice of for nonlinearity, 105–107 for normality, 21–26, 68–69 Box–Cox family of, 19–21 confidence interval for, 24, 68 family of, 18 Hawkins–Weisberg family of, 28 likelihood-ratio test for, 25 of data with 0 or negative values, 26–28 start for, 28 predictor effect plot, 39, 90–93 principle of marginality, 91 probit link, 125 probit model, 125

see see

see

QQ plot, quantile-comparison plot quadratic regression, polynomial regression quantile regression, 69 quantile–quantile plot, quantile-comparison plot quantile-comparison plot, 14, 16, 17

see

see

311

of studentized residuals, 64–65 quartiles, 15 quasi-binomial family, 137 quasi-likelihood estimation, 124, 136–138 regression coefficients, 5 confidence intervals for, 10, 50, 110, 112, 114, 140 correlations of, 113, 135 covariance matrix of, 12 for generalized linear model, 140 for weighted least squares, 12, 80 estimation of, 7 hypothesis test for, 9, 12, 140 joint confidence region for, 10, 12, 50, 111–114, 116, 120–121 prior information about, 120 standard errors of, 9, 12, 50, 53, 113 bootstrapped, 77–78 for a generalized linear model, 140 Huber–White, regression coefficients, standard errors of, robust robust, 75–77, 79 sandwich, regression coefficients, standard errors of, robust statistical inference for, 9–11 regression constant, 5 regression model, 1 regression spline, 6, 28, 102–104, 116, 125, 142 regression through the origin, 5 regressors, 5, dummy regressor; interaction regressor; polynomial regression; regression spline regularization, 119, 120 residual degrees of freedom, 8, 139 residual deviance, 127, 132, 138 residuals, 8, 12, 49, 50, 53, 70, 73, 82, 126, 142 distribution of, 60, 64 properties of, 8

see

see

see also

312

variances of, 46, 60, 71 response, 1, 5 binary, 130, 132, 133 count, 124 discrete, 98, 130, 131 limited, 98 response residuals, 126 ridge regression, 119 robust regression, 58, 62 rugplot, 15, 40, 41 saturated model, 138 scaled deviance, 139 scatterplot, 31, 32, 94, 111, 114, 115 scatterplot matrix, 25–27 score test for constructed variable, 69 for nonconstant error variance, 74 for nonlinearity, 106 shape matrix, 121 skewness, 16, 17, 20–22, 32, 63, 65, 66, 95, 141 smoother, loess; spline, smoothing span, of loess smoother, 33–35, 95, 99 spline, smoothing, 133, regression spline spread-level plot, 36, 37, 73, 74 standardized residuals, 46, 75 studentized residuals, 46–48, 50, 51, 55, 64, 66, 67, 71 –73, 82, 141, 142 for generalized linear model, 127, 129, 130

see

see also

see also

transformation, ladder of powers and roots; log transformation; power transformation for linearity, 28–32, 103 for nonconstant variation, 35–38 for symmetry, 18–21 interpretation of, 38–39 of explanatory variable, 6

313

parameter, 21 toward multivariate normality, 25–26 toward normality, 23–25, 143 tricube weight function, 34, 35 unbias, of least-squares estimator, 7, 63 validity, robustness of, 62 variable selection, 119, 120 variance function, 123, 124, 136, 139 variance inflation factor, 109–111, 113, 135 variation, of explanatory variable, 110 conditional, 53, 116 VIF, variance inflation factor

see

website, 4, 13, 142 weighted least squares, 7, 9, 12, 35, 76, 78–80, 126, 128 weighted sum of squared residuals, 78 WLS, weighted least squares working model, 87 working residuals, 126, 128, 130, 140 working response, 139 working weights, 139, 140

see

314