311 77 3MB
English Pages 672 [673] Year 2011
INTRODUCTORY ECONOMETRICS
INTRODUCTORY ECONOMETRICS Intuition, Proof, and Practice
Jeffrey S. Zax
Stanford Economics and Finance An Imprint of Stanford University Press Stanford, California
Stanford University Press Stanford, California © 2011 by the Board of Trustees of the Leland Stanford Junior University. All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or in any information storage or retrieval system without the prior written permission of Stanford University Press. Printed in the United States of America on acid-free, archival-quality paper. Library of Congress Cataloging-in-Publication Data Zax, Jeffrey S. Introductory econometrics: intuition, proof, and practice / Jeffrey S. Zax. p. cm. Includes bibliographical references and index. ISBN 978-0-8047-7262-4 (cloth : alk. paper) 1. Econometrics—Textbooks. I. Title. HB139.Z39 2011 330.01′5195—dc22 2010027083 Typeset by Newgen in 11/13 Times New Roman
To all the students who have trusted this book to help them understand; to Judy, Jacob, and Talya, who help me understand every day; and to Zvi Griliches, who helped us all understand.
BRIEF CONTENTS
Contents ix List of Tables and Figures Preface xxi
xv
What Is a Regression? 1 The Essential Tool 35 Covariance and Correlation 57 Fitting a Line 88 From Sample to Population 136 Confidence Intervals and Hypothesis Tests 191 Inference in Ordinary Least Squares 232 What If the Disturbances Have Nonzero Expectations or Different Variances? 281 Chapter 9 What If the Disturbances Are Correlated? 321 Chapter 10 What If the Disturbances and the Explanatory Variables Are Related? 411 Chapter 11 What If There Is More Than One x? Chapter 12 Understanding and Interpreting Regression with Two x’s 453 Chapter 13 Making Regression More Flexible 501 Chapter 14 More Than Two Explanatory Variables 541 Chapter 15 Categorical Dependent Variables 582 Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8
363
Epilogue 627 Appendix 629 References 637 Index 639
vii
CONTENTS
List of Tables and Figures Preface xxi
xv
Chapter 1 What Is a Regression? 1 2 1.0 What We Need to Know When We Finish This Chapter 3 1.1 Why Are We Doing This? 5 1.2 Education and Earnings 6 1.3 What Does a Regression Look Like? 6 1.4 Where Do We Begin? 7 1.5 Where’s the Explanation? 9 1.6 What Do We Look for in This Explanation? 12 1.7 How Do We Interpret the Explanation? 1.8 How Do We Evaluate the Explanation? 17 19 1.9 R2 and the F-statistic 20 1.10 Have We Put This Together in a Responsible Way? 25 1.11 Do Regressions Always Look Like This? 28 1.12 How to Read This Book 28 1.13 Conclusion Exercises 29 Chapter 2 The Essential Tool 35 2.0 What We Need to Know When We Finish This Chapter 37 2.1 Is This Really a Math Course in Disguise? 38 2.2 Fun with Summations 2.3 Constants in Summations 41 43 2.4 Averages 2.5 Summations of Sums 44
35
ix
x
Contents
2.6 More Fun with Summations of Sums 52 2.7 Summations of Products 53 2.8 Time to Reflect Exercises 54
48
Chapter 3 Covariance and Correlation 57 3.0 What We Need to Know When We Finish This Chapter 58 3.1 Introduction 60 3.2 The Sample Covariance 66 3.3 Understanding the Sample Covariance 3.4 The Sample Correlation 70 78 3.5 Another Note about Numerical Presentation 78 3.6 Another Example 81 3.7 Conclusion Appendix to Chapter 3 82 Exercises 85 Chapter 4 Fitting a Line 88 4.0 What We Need to Know When We Finish This Chapter 90 4.1 Introduction 91 4.2 Which Line Fits Best? 95 4.3 Minimizing the Sum of Squared Errors 103 4.4 Calculating the Intercept and Slope 107 4.5 What, Again, Do the Slope and Intercept Mean? 2 110 4.6 R and the Fit of This Line 117 4.7 Let’s Run a Couple of Regressions 120 4.8 Another Example 122 4.9 Conclusion Appendix to Chapter 4 123 Exercises 128 Chapter 5 From Sample to Population 136 5.0 What We Need to Know When We Finish This Chapter 5.1 Introduction 138 140 5.2 The Population Relationship 143 5.3 The Statistical Characteristics of εi 5.4 The Statistical Characteristics of yi 149 5.5 Parameters and Estimators 152
57
88
136
Contents
5.6 5.7 5.8 5.9 5.10 5.11
Unbiased Estimators of β and α 153 Let’s Explain This Again 160 The Population Variances of b and a 165 The Gauss-Markov Theorem 172 Consistency 178 Conclusion 182 Exercises 183
Chapter 6 Confidence Intervals and Hypothesis Tests 191 6.0 What We Need to Know When We Finish This Chapter 6.1 Introduction 193 6.2 The Basis of Confidence Intervals and Hypothesis Tests 198 6.3 Confidence Intervals 203 6.4 Hypothesis Tests 6.4.1 Two-Tailed Tests 204 6.4.2 One-Tailed Tests 212 6.4.3 Type I and Type II Errors
191 194
216
6.5 The Relationship between Confidence Intervals and 225 Hypothesis Tests Exercises 227 Chapter 7 Inference in Ordinary Least Squares 232 232 7.0 What We Need to Know When We Finish This Chapter 234 7.1 The Distributions of b and a 2 236 7.2 Estimating σ 241 7.3 Confidence Intervals for b 251 7.4 Hypothesis Tests for β 261 7.5 Predicting yi Again 267 7.6 What Can We Say So Far about the Returns to Education? 269 7.7 Another Example 272 7.8 Conclusion Appendix to Chapter 7 273 Exercises 276 Chapter 8 What If the Disturbances Have Nonzero Expectations or 281 Different Variances? 8.0 What We Need to Know When We Finish This Chapter 283 8.1 Introduction
281
xi
xii
Contents
8.2 Suppose the εi’s Have a Constant Expected Value 284 That Isn’t Zero 288 8.3 Suppose the εi’s Have Different Expected Values 289 8.4 Suppose Equation (5.6) Is Wrong 290 8.5 What’s the Problem? 294 8.6 σi2, εi2, ei2, and the White Test 299 8.7 Fixing the Standard Deviations 301 8.8 Recovering Best Linear Unbiased Estimators 304 8.9 Example: Two Variances for the Disturbances 8.10 What If We Have Some Other Form of Heteroscedasticity? 314 8.11 Conclusion Exercises 315
313
Chapter 9 What If the Disturbances Are Correlated? 321 321 9.0 What We Need to Know When We Finish This Chapter 9.1 Introduction 323 323 9.2 Suppose Equation (5.11) Is Wrong 326 9.3 What Are the Consequences of Autocorrelation? 330 9.4 What Is to Be Done? 334 9.5 Autocorrelation, Disturbances, and Shocks 9.6 Generalized Least Squares and the Example of First-Order 342 Autocorrelation 347 9.7 Testing for First-Order Autocorrelation 351 9.8 Two-Step Estimation for First-Order Autocorrelation 353 9.9 What If We Have Some Other Form of Autocorrelation? 355 9.10 Conclusion Exercises 356 Chapter 10 What If the Disturbances and the Explanatory 363 Variables Are Related? 363 10.0 What We Need to Know When We Finish This Chapter 366 10.1 Introduction 366 10.2 How Could This Happen? 371 10.3 What Are the Consequences? 377 10.4 What Can Be Done? 10.5 Two-Stage Least Squares and Instrumental Variables 380 384 10.6 The Properties of Instrumental Variables Estimators 10.7 What’s a Good Instrument? 389
Contents
10.8 How Do We Know If We Have Endogeneity? 395 397 10.9 What Does This Look Like in Real Data? 401 10.10 Conclusion Exercises 401 Chapter 11 What If There Is More Than One x? 411 411 11.0 What We Need to Know When We Finish This Chapter 413 11.1 Introduction 414 11.2 Is There Another Assumption That We Can Violate? 419 11.3 What Shall We Fit This Time? 430 11.4 How Do b1 and b2 Really Work? 438 11.5 The Expected Values of b1 and b2 446 11.6 Conclusion Exercises 447 Chapter 12 Understanding and Interpreting Regression with Two x's 12.0 What We Need to Know When We Finish This Chapter 456 12.1 Introduction 456 12.2 The Variances of b1 and b2 463 12.3 The Interaction between x1i and x2i 468 12.4 Estimated Standard Deviations 470 12.5 Unrestricted and Restricted Regressions 475 12.6 Joint Hypothesis Tests 484 12.7 Wait! What If It’s All a Mistake? 486 12.8 What Happens to Chapters 8, 9, and 10 Now? 491 12.9 Conclusion Exercises 492 501 Chapter 13 Making Regression More Flexible 13.0 What We Need to Know When We Finish This Chapter 503 13.1 Introduction 503 13.2 Dummy Variables 507 13.3 Nonlinear Effects: The Quadratic Specification 514 13.4 Nonlinear Effects: Logarithms 520 13.5 Nonlinear Effects: Interactions 529 13.6 Conclusion Exercises 530
453 453
501
xiii
xiv
Contents
Chapter 14 More Than Two Explanatory Variables 541 14.0 What We Need to Know When We Finish This Chapter 545 14.1 Introduction 14.2 Can We Have More Than Two Explanatory Variables? 551 14.3 Inference in Multivariate Regression 557 14.4 Let’s See Some Examples 568 14.5 Assumptions about the Disturbances 14.6 Panel Data 569 573 14.7 Conclusion Exercises 573 Chapter 15 Categorical Dependent Variables 582 15.0 What We Need to Know When We Finish This Chapter 584 15.1 Introduction 585 15.2 Suppose Income Is Measured Discretely 587 15.3 How Would Parameter Estimates Work? 591 15.4 The Maximum Likelihood Estimator 596 15.5 How Do We Actually Maximize the Likelihood? 15.6 What Can We Do with Maximum Likelihood Estimates? 604 15.7 How about Some Examples? 609 15.8 Introduction to Selection Problems 613 15.9 Are There Other Ways to Do This? 615 15.10 Conclusion Appendix to Chapter 15 615 Exercises 621 Epilogue 627 Appendix 629 References 637 Index 639
541 545
582
600
TA B L E S A N D F I G U R E S
TABLES
1.1 1.2 1.3 1.4 1.5 2.1 2.2 2.3 2.4 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 4.2 4.3 4.4 4.5 5.1 5.2 5.3 5.4
Our first regression, revisited 26 Our first regression, revisited for the second time 27 Our first regression, revisited for the third time 27 Our first regression, revised 30 Our first regression, revised for the second time 31 Wage and salary income in 1999 for eight individuals (in dollars) Two components of income for eight individuals (in dollars) 46 Components of income for eight individuals (in dollars) 55 Components of annual housing costs for eight individuals (in dollars) 56 Hypothetical data 64 Sample covariance calculation with hypothetical data 65 Sample covariance calculation with actual data 65 Data with a zero sample covariance 67 Sample variance calculations with hypothetical data 77 Child mortality rates and water availability by country 79 Census schooling levels and approximate years of schooling 84 Regression calculation for the sample of table 3.1 117 Predicted values and prediction errors for the regression in table 4.1 118 Rooms and rent 121 122 R2 calculation, rooms and rent Hypothetical data 133 Simulated data 161 Fifty simulated regressions 162 Fifty regressions on Census data 165 Effect of sample size on the range of estimates for simulated data
41
180 xv
xvi
Tables and Figures
5.5 6.1 6.2 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 10.1 10.2 10.3 10.4 10.5 10.6
Effect of sample size on the range of estimates for Census data 182 Hypothesis tests: Actions and consequences 217 Contrasts between confidence intervals and hypothesis tests 226 Two methods to estimate the standard deviation of b 239 95% confidence intervals for β in 50 simulated regressions 244 Effect of sample size on the width of 95% confidence intervals for 246 simulated data Effect of sample size on the width of 95% confidence intervals 247 for Census data 249 Simulated distributions of xi for table 5.4 Effect of variation in xi on the width of 95% confidence intervals for simulated data 249 Effect of variation in xi on the width of 95% confidence intervals for Census data 250 Effect of sample size on the width of 95% confidence intervals for yˆ0 in the simulated data of table 5.1 265 Effect of differences in x0 on the width of 95% confidence intervals for yˆ0 in simulated data 267 Gross national income per capita and corruption by country 271 in 2002 Simulated regressions with zero and nonzero values for E(εi) 286 OLS on a simulated pooled sample with low and high disturbance variances 309 OLS on a simulated sample with a high disturbance variance 310 OLS on a simulated sample with a low disturbance variance 311 WLS on a pooled sample with low and high disturbance variances 312 White’s heteroscedasticity test applied to Census data 317 Stratified regressions for the sample in table 8.6 317 WLS applied to Census data 318 OLS regressions on six simulated samples with known measurement error 375 Theoretical and actual OLS bias due to measurement error 376 2SLS regressions on six simulated samples with known measurement error 380 IV estimators with n = 100,000 and different degrees of measurement error 387 389 IV estimates of β when V(vi) = 4 and sample size increases Population standard deviations for b and asymptotic standard 391 deviations for bIV
Tables and Figures
10.7 10.8 10.9 10.10 12.1 12.2 12.3 12.4 12.5 12.6 12.7 12.8 13.1 14.1 14.2 14.3 14.4 14.5 14.6 14.7 14.8 14.9 14.10 15.1 15.2 15.3 15.4 A.1 A.2
xvii
Population standard deviations for b and asymptotic standard deviations for bIV 392 392 Estimated standard deviations for b and bIV in simulated data Sample correlations and IV estimates with weak instruments 394 Hausman test for endogeneity in simulated data 397 Correlation between x 1i and x 2i and the population standard deviations of b 1 465 Sample size and the population standard deviation of b 1 466 Multicollinearity and slope estimates 467 Correlation between x 1i and x 2i and estimated standard deviations of slope estimates 470 Sample size and estimated standard deviations of slope estimates 470 2 2 R and adjusted R from restricted and unrestricted regressions 475 480 Test of the null hypothesis H0 : β1 = 0 and β2 = 0 Effects of an irrelevant explanatory variable 486 Predicted earnings from equation (13.62) 527 Earnings regressions stratified by sex, sample from chapter 1 558 Abbreviated earnings regressions stratified by sex, sample from 560 chapter 1 Earnings regressions stratified by sex, sample from section 7.6 562 Earnings regressions stratified by sex, sample from section 7.6 565 The regression of table 1.4, revisited 574 Earnings regressions stratified by minority status, sample from 576 section 7.6 Earnings regressions stratified by sex, sample from chapter 1, omitting those with zero earnings 577 Abbreviated earnings regressions stratified by sex, sample from chapter 1, omitting those with zero earnings 577 Earnings regressions stratified by sex, sample from section 7.6, omitting those with zero earnings 578 Pooled earnings regression, sample from section 7.6 579 Determinants of the probability that earnings exceed $50,000 604 Extended determinants of the probability that earnings exceed $50,000 607 Determinants of the probability of working in 1999 608 Sample selection correction to table 1.5 611 Standard normal probabilities 629 t-distribution critical values 632
xviii
Tables and Figures
FIGURES
A.3 A.4 A.5
F values for α = .05 633 Chi-square distribution 635 Durbin-Watson critical values
1.1 1.2 1.3 1.4 3.1 3.2 4.1 4.2 4.3 4.4 4.5 4.6 5.1 5.2 5.3 5.4 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8
Our first regression 6 Potential effects of age on earnings 16 Our first regression, restricted 19 Our second regression 32 Graph of data in 3.1 66 2 Graph of y = x 67 Example of a bad line 92 Geometric distance from a point to a line 92 Vertical distance from a point to a line 93 Lines with identical sums of errors 94 Line with nonzero average prediction errors 101 Line with a positive correlation between X and the errors 103 Distribution of slope estimates, simulated data 163 Distribution of intercept estimates, simulated data 164 Distribution of slope estimates, Census data 166 Distribution of intercept estimates, Census data 166 Density function for d* 198 (1 − α)% confidence interval 200 Two-tailed hypothesis test at α% significance 205 211 P(t(19) > .501) Upper-tailed hypothesis test at α% significance 214 Two-tailed and one-tailed hypothesis tests with 20 observations 216 Probability of a Type II error for an upper-tailed hypothesis test 219 Effects of lower significance levels on the probability of a Type II 222 error Reduced SD(d ) and the probability of a Type II error 224 Probability of a Type II error for the Census data of table 3.3 with 259 H0 : β0 = 4,000 and H1 : β1 = 5,000 Sample with a small disturbance variance 292 Sample with a high disturbance variance 292 Random shocks for the sample of equation (9.9) 341 Serially correlated disturbances for the sample of equation (9.9) 341 Serially correlated disturbances created with fourth-order autocorrelation from the shocks of figure 9.1 with γ = .9 354
6.9 7.1 8.1 8.2 9.1 9.2 9.3
636
Tables and Figures
9.4 9.5 11.1 11.2 15.1 15.2 15.3 15.4 15.5 15.6 15.7 15.8 15.9
xix
Serially correlated disturbances for the sample of equation (9.9), ρ = .6 357 Serially correlated disturbances for the sample of equation (9.9), ρ = .3 357 Predicting yi with two explanatory variables 420 Predicted plane of equation (11.13) 421 Probability of income > $50,000 for 16 years of schooling 586 E(yi) with two observations 587 E(yi) with three observations 588 589 Predicted probabilities for two different values of xi 590 Changing the predicted probabilities for two different values of xi 601 Marginal effect of a change in xi when – (a + bxi) is close to zero 602 Marginal effect of a change in xi when – (a + bxi) is far from zero 617 Contributions of εi to the likelihood function Intersecting typologies of estimators 619
P R EFAC E
1
Basic Objectives
2
Innovations
3
Math
4
Statistics
5
Conclusion
xxi
xxii
xxv xxvi xxvii
1 Basic Objectives The purpose of this book is to teach sound fundamentals to students, at any level, who are being introduced to econometrics for the first time and whose preparation or enthusiasm may be limited. This book is a teaching tool rather than a reference work. The way that a typical student needs to approach this material initially is different from the way that a successful student will return to it. This book engages novices by starting where they are comfortable beginning. It then stimulates their interests, reinforces their technical abilities, and trains their instincts. For these purposes, the first objective of this book is the same as that of a medical doctor: Do no harm. Armed with only some vocabulary from a conventional textbook and a standard computer software package, anyone can run a regression, regardless of his or her preparation. The consequences of econometric work that lacks precision, care, and understanding can range from inept student research to bad public policy and disrepute for the econometric enterprise. These outcomes are best prevented at the source, by ensuring that potential perpetrators are exposed to the requirements, as well as the power, of econometric techniques.
xxi
xxii
Preface
Many of the students who take this course will actually run regressions in their subsequent professional life. Few will take another econometrics course. For most, this is a one-shot opportunity. It is crucial that we—students and teacher—make the best of it. At minimum, this book is designed so that those who finish it are aware that econometric tools shouldn’t be used thoughtlessly. This book’s additional objectives form a hierarchy of increasing optimism. Beyond basic awareness of the power of econometrics, we want students to understand and use responsibly the techniques that this book has taught them. Better still, we want students to recognize when they have an econometric challenge that goes beyond these techniques. Best of all, we want students to have enough preparation, respect, and perhaps even affection for econometrics that they’ll continue to explore its power, formally or otherwise.
2 Innovations In pursuit of these objectives, this text makes five distinctive choices: 1. It emphasizes depth rather than breadth. 2. It is curiosity driven. 3. It discusses violations of the standard assumptions immediately after the presentation of the two-variable model, where they can be handled insightfully with ordinary algebra. 4. The tone is conversational whenever possible, in the hope of encouraging engagement with the formalities. It is precise whenever necessary, in order to eliminate confusion. 5. The text is designed to evolve as students learn. This book engages only those subjects that it is prepared to address and that students can be expected to learn in depth. The formal presentations are preceded, complemented, and illuminated by extensive discussions of intuition. The formal presentations themselves are usually complete, so that there is no uncertainty regarding the required steps. End-of-chapter exercises provide students with opportunities to experiment on their own—both with derivations and their interpretations. Consequently, this book devotes more pages to the topics that it does cover than do other books. The hope is that, in compensation, lectures will be more productive, office hours less congested, and examinations more satisfying. Equivalently, this book does not “expose” students to advanced topics, in order to avoid providing them with impressions that would be unavoidably superficial and untrustworthy. The text includes only one chapter that goes beyond the clas-
Preface
xxiii
sical regression model and its conventional elaborations. This chapter introduces limited dependent variables, a topic that students who go on to practice econometrics without further training are especially likely to encounter. A typical course should complete this entire text, with the possible exception of chapter 15, in a single semester. Students who do so will then be ready, should they choose, for one of the many texts that offer more breadth and sophistication. Motivation is the second innovation in this book. The sequence of topics is driven by curiosity regarding the results, rather than by deduction from first principles. The empirical theme throughout is the relationship between education and earnings, which ought to be of at least some interest to most students. The formal presentation begins in chapter 3, which confronts students with data regarding these two variables and simply asks, “What can be made of them?” Initial answers to this question naturally raise further questions: The covariance indicates the direction of the association, but not its strength. The correlation indicates its strength, but not its magnitude. The desire to know each of these successive attributes leads, inexorably, to line fitting in chapter 4. Finally, the desire to generalize beyond the observed sample invokes the concept of the population. This contrasts with the typical presentation, which begins with the population regression. That approach is philosophically appealing. Pedagogically, it doesn’t work. First-time students are not inclined to philosophic rigor. Worse, many first-time students never really get the concept of the population. It’s too abstract. If the course starts with it, students begin with a confusion that many never successfully dispel. For this reason, this book refers only obliquely to the contemporary approach to regression analysis in terms of conditional expectations. This approach is deeply insightful to those with a thorough understanding of statistical fundamentals. The many undergraduates who struggle with the summations of chapter 2 do not truly understand expectations, much less those that are conditional. We cannot expect them to appreciate any approach that is based first on population relationships. The sample is a much more accessible concept. It’s concrete. In several places in this book, it actually appears on the page. The question of how to deal with the sample provokes curiosity regarding its relationship with the underlying population. This gives the student both foundation and motivation to confront the population concept in chapter 5. Subsequent chapters repeat this pattern. Each begins by raising questions about the conclusion of a previous chapter. In each, the answers raise the level of econometric sophistication. The text aspires to engage students to the point that they are actually eager to pursue this development.
xxiv
Preface
The third innovation places the discussions of inference, heteroscedasticity, correlated disturbances, and endogeneity in chapters 6 through 10, immediately after the presentation of the model with one explanatory variable. In the case of inference, students may have seen the basic results in the univariate context before. If they did, they probably didn’t understand them very well. Bivariate regression provides the most convenient and accessible context for review. Moreover, it is all that is necessary for most of the results. Heteroscedasticity, correlated disturbances, and endogeneity all concern violations regarding the ordinary assumptions about the disturbance terms. The methods of diagnosis and treatment do not vary in any important way with the number of explanatory variables. The conventional formulas are often especially intuitive if only one explanatory variable is present. This arrangement relieves the presentation of the multivariate model of the burden of conveying these ancillary issues. Accordingly, chapter 11 concentrates on what is truly novel in the multivariate model: the effects of omitted variables, the implications of including an irrelevant variable, the consequences of correlations among the explanatory variables, and statistical tests of joint significance. In addition, this arrangement ensures that shorter courses, such as those that take place within a quarter rather than a semester, or courses in which progress is slower can still address these alternative assumptions. If the students in such courses must omit some basic material, they are better served by exposure to the problems to which regression may be subject than to an expansion of the basic results regarding the estimation of slopes. This innovation, like the second, is a departure from the ordinary order of presentation. These discussions usually appear after the multivariate model. However, the presence of additional explanatory variables adds nothing, apart from additional notation. With an audience that is wary of notation to begin with, this is counterproductive. As its fourth innovation, this book adopts a conversational tone wherever possible. Appropriate respect for the power of statistical analysis requires some familiarity with formal derivations. However, formal discussions reinforce the prejudice that this material is incompatible with the typical student sensibility. This prejudice defeats the purpose of the derivations. Their real point is that they are almost always intuitive, usually insightful, and occasionally surprising. The real objective in their presentation is to develop the intuitive faculty, in the same way that repeated weight-lifting develops the muscles. The book pursues this objective by placing even greater emphasis on the revelations in the formulas than on their derivations. At the same time, the book is meticulous about formal terminology. The names of well-defined concepts are consistent throughout. In particular, the text is rigorous in distinguishing between population and sample concepts.
Preface
xxv
This rigor is distinctive. The word “mean” provides an egregious example of ambiguity in common usage. As a noun, this term appears as a synonym for both the average in the sample and the expected value in the population. With an audience who can never confidently distinguish between the sample and the population in the first place, this ambiguity can be lethal. Here, “mean” appears only as a verb. This ambiguity is also rampant in discussions of variances, standard deviations, covariances, and correlations. The same nouns are typically employed, regardless of whether samples or populations are at issue. This text distinguishes carefully between the two. Chapter 3 qualifies all variances, standard deviations, covariances, and correlations as sample statistics. The common Greek symbols for variances, standard deviations, and correlations appear only after chapter 5 explains the distinction between sample statistics and population parameters. Finally, this book is designed to evolve with the students’ understanding. Each chapter begins with the section “What We Need to Know When We Finish This Chapter,” which sets attainable and appropriate goals as the material is first introduced. The text of each chapter elaborates on these goals so as to make them accessible to all. Once mastery is attained, the “What We Need to Know . . .” sections, the numbered equations, and the tables and figures serve as a concise but complete summary of the material and a convenient source for review. A companion Web site—www.sup.org/econometrics—provides these review materials independent of the book. It also provides instructors with password-protected solutions to the end-of-chapter exercises. Instructors can release these to students as appropriate so that they can explore the material further on their own.
3 Math Many students would probably prefer a treatment of econometrics that is entirely math-free. This book acknowledges the mathematical hesitations and limitations of average students without indulging them. It carefully avoids any mathematical development that is not essential. However, derivation can’t be disregarded, no matter how ill-prepared the student. Proof is how we know what it is that we know. Students have to have some comfort with the purpose and process of formal derivation in order to be well educated in introductory econometrics. Some respect for the formal properties of regression is also a prerequisite for responsible use. This book accommodates the skills of the typical student by developing all basic results regarding the classical linear regression model and the
xxvi
Preface
elaborations associated with heteroscedasticity, correlated disturbances, and endogeneity in the language of ordinary algebra. Virtually all of the mathematical operations consist entirely of addition, subtraction, multiplication, and division. There is no reference to linear algebra. The only sophistication that the book requires is a facility with summation signs. The second chapter provides a thorough review. In general, individual steps in the algebraic derivations are simple. The text explains all, or nearly all, of these steps. This level of detail replicates what, in my experience, instructors end up saying in class when students ask for explanations of what’s in the book. With these explanations, the derivations, as a whole, should be manageable. They also provide opportunities to emphasize emerging insights. Such insights are especially striking in chapter 11, where ordinary algebra informs intuitions that would be obscured in the matrix presentation. Chapters 11 through 14 present the three-variable model entirely with ordinary algebra. By the time students reach this point, they are ready to handle the complexity. This treatment has two advantages. First, students don’t need matrices as a prerequisite and faculty don’t have to find time to teach, or at least review, them. Second, the algebraic formulas often reveal intuitions that are obscure or unavailable in matrix formulations. Similarly, this book contains very few references to calculus. Only five derivatives are inescapable. Two occur when minimizing the sum of squared errors in the two-variable model of chapter 4. Three appear when executing the same task in the three-variable model of chapter 11. All five are presented in their entirety, so that the students are not responsible for their derivation. All other derivatives are optional, in the appendices to chapters 4 and 7, in chapter 15, and in several exercises. In sum, this book challenges students where they are most fearful, by presenting all essential derivations, proofs, and results in the language with which they are most familiar. Students who have used this book uniformly acknowledge that they have an improved appreciation for proof. This will be to their lasting benefit, well beyond any regressions they might later run.
4 Statistics This text assumes either that students have not had a prior course in introductory statistics or that they don’t remember very much of it. It does not derive the essential concepts, but reviews all of them. In contrast to other texts, this review does not appear as a discrete chapter. This conventional treatment suggests a distinction between statistics and
Preface
xxvii
whatever else is going on in econometrics. It reinforces the common student suspicion, or even hope, that statistics can be ignored at some affordable cost. Instead, the statistics review here is consistent with the curiosity-driven theme. The text presents statistical results at the moment when they are needed. As examples, it derives the sample covariance and sample correlation from first principles in chapter 3, in response to the question of how to make any sense at all out of a list of values for earnings and education. The text defines expectations and population variances in section 5.3, as it introduces the disturbances. Results regarding the expectation of summations first appear in section 5.4, where they are necessary to derive the expected value of the dependent variable. Variances of summations are not required until the derivation of the variance of the ordinary least squares slope estimator. Accordingly, the relevant formulas do not appear until section 5.8. Similarly, a review of confidence intervals and hypothesis tests immediately precedes the formal discussion of inference regarding regression estimates. This review is extensive both because inference is essential to the responsible interpretation of regression results and because students are rarely comfortable with this material when they begin an econometrics course. Consequently, it appears in the self-contained chapter 6. The application to the bivariate regression context appears in chapter 7.
5 Conclusion This book tries to strike a new balance between the capacities and enthusiasms of instructors and students. The former typically have a lot more of both than the latter. This book aspires to locate a common understanding between the two. It attempts to distill the instructor’s knowledge into a form that preserves its essence but is acceptable and even appealing to the student’s intellectual palate. This book insists on rigor where it is essential. It emphasizes intuition wherever it is available. It seizes upon entertainment. This book is motivated by three beliefs. First, that students are, perhaps despite themselves, interested in questions that only econometrics can answer. Second, through these answers they can come to understand, appreciate, and even enjoy the enterprise of econometrics. Third, this text, with its innovations in presentation and practice, can provoke this interest and encourage responsible and insightful application. With that, let’s go!
CHAPTER
1
W HAT I S A R EGRESSION ?
1.0
What We Need to Know When We Finish This Chapter
1.1
Why Are We Doing This?
1.2
Education and Earnings
1.3
What Does a Regression Look Like?
1.4
Where Do We Begin?
3 5 6
6
1.5 Where’s the Explanation?
7
1.6 What Do We Look for in This Explanation?
9
1.7
How Do We Interpret the Explanation?
12
1.8
How Do We Evaluate the Explanation?
17
2
1.9 R and the F-statistic
19
1.10 Have We Put This Together in a Responsible Way? 1.11 Do Regressions Always Look Like This? 1.12 How to Read This Book 1.13 Conclusion Exercises
2
20
25
28
28 29
1
2
Chapter 1
1.0 What We Need to Know When We Finish This Chapter This chapter explains what a regression is and how to interpret it. Here are the essentials. 1. Section 1.4: The dependent or endogenous variable measures the behavior that we want to explain with regression analysis. 2. Section 1.5: The explanatory, independent, or exogenous variables measure things that we think might determine the behavior that we want to explain. We usually think of them as predetermined. 3. Section 1.5: The slope estimates the effect of a change in the explanatory variable on the value of the dependent variable. 4. Section 1.5: The t-statistic indicates whether the explanatory variable has a discernible association with the dependent variable. The association is discernible if the p-value associated with the t-statistic is .05 or less. In this case, we say that the slope is statistically significant. This generally corresponds to an absolute value of approximately two or greater for the t-statistic itself. If the t-statistic has a p-value that is greater than .05, the associated slope coefficient is insignificant. This means that the explanatory variable has no discernible effect. 5. Section 1.6: The intercept is usually uninteresting. It represents what everyone has in common, rather than characteristics that might cause individuals to be different. 6. Section 1.6: We usually interpret only the slopes that are statistically significant. We usually think of them as indicating the effect of their associated explanatory variables on the dependent variable ceteris paribus, or holding constant all other characteristics that are included in the regression. 7. Section 1.6: Continuous variables take on a wide range of values. Their slopes indicate the change that would be expected in the dependent variable if the value of the associated explanatory variable increased by one unit. 8. Section 1.6: Discrete variables, sometimes called categorical variables, indicate the presence or absence of a particular characteristic. Their slopes indicate the change that would occur in the dependent variable if an individual who did not have that characteristic were given it. 9. Section 1.7: Regression interpretation requires three steps. The first is to identify the discernible effects. The second is to understand their magnitudes. The third is to use this understanding to verify or modify
What Is a Regression?
3
the behavioral understanding that motivated the regression in the first place. 10. Section 1.7: Statistical significance is necessary in order to have interesting results, but not sufficient. Important slopes are those that are both statistically significant and substantively large. Slopes that are statistically significant but substantively small indicate that the effects of the associated explanatory variable can be reliably understood as unimportant. 11. Section 1.7: A proxy is a variable that is related to, but not exactly the variable we really want. We use proxies when the variables we really want aren’t available. Sometimes this makes interpretation difficult. 12. Section 1.8: If the p-value associated with the F-statistic is .05 or less, the collective effect of the ensemble of explanatory variables on the dependent variable is statistically significant. 13. Section 1.8: Observations are the individual examples of the behavior under examination. All of the observations together constitute the sample on which the regression is based. 14. Section 1.8: The R2, or coefficient of determination, represents the proportion of the variation in the dependent variable that is explained by the explanatory variables. The adjusted R2 modifies the R2 in order to take account of the numbers of explanatory variables and observations. However, neither measures statistical significance directly. 15. Section 1.9: F-statistics can be used to evaluate the contribution of a subset of explanatory variables, as well as the collective statistical significance of all explanatory variables. In both cases, the F-statistic is a transformation of R2 values. 16. Section 1.10: Regression results are useful only to the extent that the choices of variables in the regression, variable construction, and sample design are appropriate. 17. Section 1.11: Regression results may be presented in one of several different formats. However, they all have to contain the same substantive information.
1.1 Why Are We Doing This? The fundamental question that underlies most of science is, how does one thing affect another? This is the sort of question that we ask ourselves all the time. Whenever we wonder whether our grade will go up if we study more,
4
Chapter 1
whether we’re more likely to get into graduate school if our grades are better, or whether we’ll get a better job if we go to graduate school, we are asking questions that econometrics can answer with elegance and precision. Of course, we probably think we have answers to these questions already. We almost surely do. However, they’re casual and even sloppy. Moreover, our confidence in them is almost certainly exaggerated. Econometrics is a collection of powerful statistical tools that are devoted to helping provide answers to the question of how one thing affects another. Econometrics not only teaches us how to answer questions like this more accurately but also helps us understand what is necessary in order to obtain an answer that we can legitimately treat as accurate. We begin in this chapter with a primer on how to interpret regression results. This will allow us to read work based on regression and even to begin to perform our own analyses. We might think that this would be enough. However, this chapter will not explain why the interpretations it presents are valid. That requires a much more thorough investigation. We prepare for this investigation in chapter 2. There, we review the summation sign, the most important mathematical tool for the purposes of this book. We actually embark on this investigation in chapter 3, where we consider the precursors to regression: the covariance and the correlation. These are basic statistics that measure the association between two variables, without regard to causation. We might have seen them before. We return to them in detail because they are the mathematical building blocks from which regressions are constructed. Our primary focus, however, will be on the fundamentals of regression analysis. Regression is the principal tool that economists use to assess the responsiveness of some outcome to changes in its determinants. We might have had an introduction to regression before as well. Here, we devote chapters 4, 5, and 7 through 14 to a thorough discussion. Chapter 6 intervenes with a discussion of confidence intervals and hypothesis tests. This material is relevant to all of statistics, rather than specific to econometrics. We introduce it here to help us complete the link between the regression calculations of chapter 4 and the behavior that we hope they represent, discussed in chapter 5. Chapter 15 discusses what we can do in a common situation where we would like to use regression, but where the available information isn’t exactly appropriate for it. This discussion will introduce us to probit analysis, an important relative of regression. More generally, it will give us some insight as to how we might proceed when faced with other situations of this sort. As we learn about regression, we will occasionally need concepts from basic statistics. Some of us may have already been exposed to them. For those of us in this category, chapters 3 and 6 may seem familiar, and perhaps even
What Is a Regression?
5
chapter 4. For those of us who haven’t studied statistics before, this book introduces and reviews each of the relevant concepts when our discussion of regression requires them.1
1.2 Education and Earnings Few of us will be interested in econometrics purely for its theoretical beauty. In fact, this book is based on the premise that what will interest us most is how econometrics can help us organize the quantitative information that we observe all around us. Obviously, we’ll need examples. There are two ways to approach the selection of examples. Econometric analysis has probably been applied to virtually all aspects of human behavior. This means that there is something for everyone. Why not provide it? Well, this strategy would involve a lot of examples. Most readers wouldn’t need that many to get the hang of things, and they probably wouldn’t be interested in a lot of them. In addition, they could make the book a lot bigger, which might make it seem intimidating. The alternative is to focus principally on one example that may have relatively broad appeal and develop it throughout the book. That’s the choice here. We will still sample a variety of applications over the course of the entire text. However, our running example returns, in a larger sense, to the question of section 1.1: Why are we doing this? Except now, let’s talk about college, not this course. Presumably, at least some of the answer to that question is that we believe college prepares us in an important way for adulthood. Part of that preparation is for jobs and careers. In other words, we probably believe that education has some important effect on our ability to support ourselves. This is the example that we’ll pursue throughout this book. In the rest of this chapter, we’ll interpret a somewhat complicated regression that represents the idea that earnings are affected by several determinants, with education among them. In chapter 3, we’ll return to the basics and simply ask whether there’s an association between education and earnings. Starting in chapter 4, we’ll assume that education affects earnings and ask: by how much? In chapter 10, we’ll examine whether the assumption that education causes earnings is acceptable, and what can be done if it’s not. As we can see, we’ll ask this question with increasing sophistication as we proceed through this book.2 The answers will demonstrate the power of econometric tools to address important quantitative questions. They will also serve as illustrations for applications to other questions. Finally, we can hope that they will confirm our commitment to higher education.
6
Chapter 1
Figure 1.1 Our first regression
Earnings = −19,427 × Intercept + 3,624.3 × Years of schooling (−2.65) (9.45) + 378.60 × Age − 17,847 × Female (3.51) (−7.17)
− 10,130 × Black − 2,309.9 × Hispanic (−2.02) (−.707)
− 8,063.9 × American Indian or Alaskan Native (−.644)
− 4,035.6 × Asian − 3,919.1 × Native Hawaiian, other Pacific Islander (−.968) (−.199)
R2 = .1652 Adjusted R2 = .1584 F-statistic = 24.5, p-value < .0001 Number of observations = 1,000 Note: Parentheses contain t-statistics.
1.3 What Does a Regression Look Like? Figure 1.1 is one way to present a regression. Does that answer the question? Superficially, yes. But what does it all mean? This question can be answered on two different levels. In this chapter, we’ll talk about how to interpret the information in figure 1.1. This should put us in a position to read and understand other work based on regression analysis. It should also allow us to interpret regressions of our own. In the rest of the book, we’ll talk about why the interpretations we offer here are valid. We’ll also talk about the circumstances under which these interpretations may have to be modified or may even be untrustworthy. There will be a lot to say about these matters. But, for the moment, it will be enough to work through the mystery of what figure 1.1 could possibly be trying to reveal.
1.4
Where Do We Begin? The first thing to understand about regression is the very first thing in figure 1.1. The word “earnings” identifies the dependent variable in the regression. The dependent variable is also referred to as the endogenous variable.
What Is a Regression?
7
The dependent variable is the primary characteristic of the entities whose behavior we are trying to understand. These entities might be people, companies, governments, countries, or any other choice-making unit whose behavior might be interesting. In the case of figure 1.1, we might wonder if “earnings” implies that we’re trying to understand company profits. However, “earnings” here refers to the payments that individuals get in return for their labor. So the entities of interest here are workers or individuals who might potentially be workers. “Dependent” and “endogenous” indicate the role that “earnings” plays in the regression of figure 1.1. We want to explain how it gets determined. “Dependent” suggests that “earnings” depends on other things. “Endogenous” implies the same thing, though it may be less familiar. It means that the value of “earnings” is determined by other, related pieces of information. The question of what it means to “explain” something statistically can actually be quite subtle. We will have some things to say about this in chapters 4 and 10. Initially, we can proceed as if we believe that the things that we use to “explain” earnings actually “cause” earnings.
1.5 Where’s the Explanation? Most of the rest of figure 1.1, from the equality sign to “(−.199),” presents our explanation of earnings. The equality sign indicates that we’re going to represent this explanation in the form of an equation. On the right side of the equation, we’re going to combine a number of things algebraically. Because of the equality, it looks as though the result of these mathematical operations will be “earnings.” Actually, as we’ll learn in chapter 4, it will be more accurate to call this result “predicted earnings.” The material to the right of the equality sign in figure 1.1 is organized into terms. The terms are separated by signs for either addition (+) or subtraction (−). Each term consists of a number followed by the sign for multiplication (×), a word or group of words, and a second number in parentheses below the first. In each term, the word or group of words identifies an explanatory variable. An explanatory variable is a characteristic of the entities in question, which we think may cause, or help to create, the value that we observe for the dependent variable. Explanatory variables are also referred to as independent variables. This indicates that they are not “dependent.” For our present purposes, this means that they do not depend on the value of the dependent variable. Their values arise without regard to the value of the dependent variable.
8
Chapter 1
Explanatory variables are also referred to as exogenous variables. This indicates that they are not “endogenous.” Their values are assigned by economic, social, or natural processes that are not under study and not affected by the process that is. The variables listed in the terms to the right of the equality can be thought of as causing the dependent variable, but not the other way around. We often summarize this assumption as “causality runs in only one direction.” This same idea is sometimes conveyed by the assertion that the explanatory variables are predetermined. This means that their values are already known at the moment when the value of the dependent variable is determined. They have been established at an earlier point in time. The point is that, as a first approximation, behavior that occurs later, historically, can’t influence behavior that preceded it.3 This proposition is easy to accept in the case of the regression of figure 1.1. Earnings accrue during work. Work, or at least a career, typically starts a couple of decades into life. Racial or ethnic identity and sex are usually established long before then. Age accrues automatically, starting at birth. Schooling is usually over before earnings begin as well. Therefore, it would be hard to make an argument at this stage that the dependent variable, earnings, causes any of the explanatory variables.4 In each term of the regression, the number that multiplies the explanatory variable is its slope. The reason for this name will become apparent in chapter 4.5 The slope estimates the magnitude of the effect that the explanatory variable has on the dependent variable. Finally, the number in parentheses measures the precision of the slope. In figure 1.1, these numbers are t-statistics. What usually matters most with regard to interpretations of the t-statistic is its p-value. However, the p-value doesn’t appear in figure 1.1. This is because they don’t appear in the most common presentations of regression results, which is what our discussion of figure 1.1 is preparing us for. We’ll offer an initial explanation of p-values and their interpretations in section 1.8. We’ll present the p-values for figure 1.1 in table 1.3 of section 1.11. Finally, we’ll explore the calculation and interpretation of t-statistics at much greater length in chapter 6. In the presentation of figure 1.1, what matters to us most is the absolute value of the t-statistic. If it is approximately two or greater, we can be pretty sure that the associated explanatory variable has a discernible effect on the dependent variable. In this case, we usually refer to the associated slope as being statistically significant, or just significant. It is our best guess of how big this effect is. If the absolute value of the t-statistic is less than approximately two, regression has not been able to discern an effect of the explanatory variable on
What Is a Regression?
9
the dependent variable, according to conventional standards. There just isn’t enough evidence to support the claim that the explanatory variable actually affects the dependent variable. In this case, we often refer to the associated slope as statistically insignificant, or just insignificant. As we’ll see in chapter 6, this is not an absolute judgment. t-statistics that are less than two in absolute value, but not by much, indicate that regression has identified an effect worth noting by conventional standards. In contrast, t-statistics that have absolute values of less than, say, one, indicate that there’s hardly a hint of any discernible relationship.6 Nevertheless, for reasons that will become clearer in chapter 6, we usually take two to be the approximate threshold distinguishing explanatory variables that have effects worth discussing from those that don’t. As we can see in figure 1.1, regression calculates a value for the slope regardless of the value of the associated t-statistic. This discussion demonstrates, however, that not all of these slopes have the same claim on our attention. If a t-statistic is less than two in absolute value, and especially if it’s a lot less than two, it’s best to assume, for practical purposes, that the associated explanatory variable has no important effect on the dependent variable.
1.6
What Do We Look for in This Explanation? The regression in figure 1.1 contains nine terms. Eight contain true explanatory variables and are potentially interesting. The first term, which figure 1.1 calls the intercept, does not contain a true explanatory variable. As we’ll see in chapter 4 and, more important, in chapter 7, we need it to predict values of the dependent variable. Otherwise, the intercept is ordinarily uninteresting.7 It measures a part of the dependent variable that is common to all entities under examination. In other words, it measures a part that doesn’t depend on the other characteristics of these entities that are included as explanatory variables in the regression. This isn’t usually interesting because we’re typically concerned with explaining why different people or organizations are different from each other. The intercept usually tells us only about what they share. Consequently, it isn’t informative about the relevant question. In the case of figure 1.1, the interesting question is why some people have higher earnings than others. The intercept in the regression there tells us that everyone starts out with −$19,427 in annual earnings, regardless of who they are. This can’t be literally true. It’s probably best to take the intercept as simply a mechanical device. As we’ll see in chapter 4, its purpose is just to provide
10
Chapter 1
an appropriate starting point from which to gauge the effects of the genuine explanatory variables. The eight true explanatory variables are potentially interesting because they measure specific characteristics of each person. Regression can attempt to estimate their contributions to earnings because they appear explicitly in the regression. The first question that must be asked with regard to any of them is whether the regression contains any evidence that they actually affect the dependent variable. As we learned in the last section, the t-statistic answers this question. Therefore, the first number to look at with regard to any of the explanatory variables in figure 1.1 is the number in parentheses. If it has an absolute value that is greater than two, then the regression estimates a discernible effect of the associated explanatory variable on the dependent variable. These variables deserve further attention. In figure 1.1, t-statistics indicate that four explanatory variables are statistically significant, or have discernible effects: years of schooling, age, female, and black. With regard to these four, the next question is, how big are these effects? As we said in the last section, the answer is in the slopes. Two of these explanatory variables, years of schooling and age, are continuous. This means that they can take on a wide range of values. This regression is based on individuals whose years of schooling range from 0 to 21. Age varies from 18 to 65.8 For these variables, the simplest interpretation of their slopes is that they estimate how earnings would change if years of schooling or age increased by a year. For example, the slope for years of schooling is 3,624.3. This indicates that earnings could be expected to increase by $3,624.30 for each additional year devoted to study. Similarly, the slope for age is 378.60. Earnings would increase by $378.60 annually simply as an individual grows older. This interpretation is based on the image of following individuals as their lives evolve. This sort of image will usually be helpful and not grossly inappropriate. However, it’s not exactly what’s going on in figure 1.1. That regression is not comparing different moments in the life of the same individual. Instead, it’s comparing many different individuals, of different ages, all observed at the same moment, to each other. This suggests a more correct, though more subtle interpretation of, for example, the slope associated with years of schooling. It actually compares the earnings of two individuals who have the same values for all of the other explanatory variables, but who differ by one year in their schooling. In other words, the regression of figure 1.1 tells us that if we had two individuals who had the same racial or ethnic identification, were of the same sex and age, but differed by one year in schooling attainment, we would expect the individual
What Is a Regression?
11
with greater schooling to have annual earnings that exceeded those of the other individual by $3,624.30. Chapter 11 will demonstrate formally why this interpretation is appropriate. Until then, it’s enough to summarize the interpretation of the preceding paragraph as follows: Any slope estimates the effect of the associated explanatory variable on the dependent variable, holding constant all other independent variables. This interpretation is often conveyed by the Latin phrase ceteris paribus.9 It’s important to remember that, regardless of the language in which we state this interpretation, it means that we are holding constant only the other variables that actually appear as explanatory in the regression.10 We will often summarize this condition by stating that we are comparing individuals or entities that are otherwise similar, except for the explanatory variable whose slope is under discussion. The ceteris paribus interpretation of the slope for the age variable is analogous to that of years of schooling. Again, if we compared two individuals who had identical racial or ethnic identities, the same sex and level of education, but who differed by one year in age, the older’s earnings would exceed those of the younger by $378.60. The two remaining statistically significant explanatory variables are female and black. Both are discrete variables, or categorical variables, meaning that they identify the presence of a characteristic that we ordinarily think of as indivisible. In this case, the variable female distinguishes women from men, and the variable black distinguishes individuals who reported themselves as at least partially black or African American from those who did not.11 For this reason, the interpretation of the slopes associated with discrete explanatory variables differs somewhat from that of the slopes associated with continuous explanatory variables. In the latter case, the slope indicates the effect of a marginal change in the explanatory variable. In the former case, the slope indicates the effect of changing from one category to another. At the same time, the interpretation of slopes associated with discrete variables is ceteris paribus, as is that of slopes associated with continuous variables. In other words, if we compare a man and a woman who have the same age, the same amount of schooling, and the same racial or ethnic identities, we expect their incomes to differ by the amount indicated by the slope associated with the female variable. In figure 1.1, the slope indicates that the woman would have annual earnings that are $17,847 less than those of the man. Similarly, the slope for blacks indicates that, if we compare two individuals of the same age, schooling, and sex, annual earnings for the black person will be $10,130 less than those of the otherwise similar white person.
12
Chapter 1
These must seem like very large differences. We’ll talk about this in the next section. We conclude this section by noting that the slopes of the variables identifying individuals who are Hispanic, American Indian or Alaskan Native, and Asian are all statistically insignificant. Nevertheless, they also seem to be large. Even the smallest of them, that for Hispanics, indicates that their annual earnings might be $2,309.90 less than those of otherwise similar whites. Although the magnitudes of the slopes for these variables might seem large and even alarming, it’s inappropriate to take them seriously. Not only are their t-statistics less than two, they’re a lot less than two. This means that, even though regression has calculated slopes for these variables, it really can’t pin down their effects, if any, with any precision. In later chapters, we’ll discuss what we might do if we wanted to identify them more clearly.
1.7 How Do We Interpret the Explanation? Regression interpretation proceeds through three steps. We’ve already taken the first. It was to identify the explanatory variables that have statistically significant slopes. For the most part, regression is only informative about these explanatory variables. They are the only variables for which regression can estimate reliable effects. We’re also halfway through the second step, which is to interpret the magnitude of these effects. Effects that are both statistically significant and substantively large are the ones that really catch our attention. The coefficient on the categorical variable for females is an example. Not only is it estimated very reliably, it indicates that women have annual earnings that are almost $18,000 less than those of otherwise similar men. In economic terms, this difference seems huge.12 The slope associated with blacks is similar. Its t-statistic is smaller, as is its magnitude. However, the t-statistic is still large enough to indicate statistical significance. The magnitude is still big enough to be shocking. It takes a little more work to evaluate the effect of years of schooling. Its t-statistic indicates that it is reasonably precise. Its magnitude, however, is markedly smaller than that of the slopes for females and blacks. Nevertheless, this slope indicates that a worker with one more year of schooling than an otherwise similar worker will have earnings that are greater by $3,624.30 in every year that he or she is of working age. If a typical working career lasts for, perhaps, 40 years, this advantage accumulates to something quite substantial. Another way to think of this is to calculate the earnings advantage conferred by completing an additional level of schooling. People with college
What Is a Regression?
13
degrees have approximately four more years of schooling than those who end their formal education with high school graduation. Relative to these people, an individual with a college degree will get the $3,624.30 annual earnings premium for each of his or her four additional years of schooling. This amounts to a total annual earnings premium of $14,497.20. This premium is again quite large. It explains why so many people continue on to college after high school and why there is so much concern regarding the career prospects for those who don’t. It also presents an interesting comparison to the slopes for women and blacks. The slope for women is larger than the earnings premium for four years of schooling. This suggests that, in order for women to have the same annual earnings as otherwise similar men, they would have to have more than four additional years of schooling. Similarly, blacks would have to have nearly three additional years of schooling in order to attain the same earnings level as otherwise similar whites. The remaining explanatory variable with a statistically significant effect is age. Its slope is about one-tenth of that associated with years of schooling, so its effect is substantively much smaller. Two otherwise similar individuals would have to differ in age by about 27 years in order to have an earnings differential similar to that between an otherwise similar black and white individual of the same age.13 This raises a very interesting point. It is possible for an explanatory variable to have an effect that is statistically significant, but economically, or substantively, unimportant. In this case, regression precisely identifies an effect that is small. While this may be of moderate interest, it can’t be nearly as intriguing as an effect that is both precise and large. In other words, statistical significance is necessary in order to have interesting results, but not sufficient. Any analysis that aspires to be interesting therefore has to go beyond identifying statistical significance to consider the behavioral implications of the significant effects. If these implications all turn out to be substantively trivial, their significance will be of limited value. This second interpretive step reveals another useful insight. The slope for females in figure 1.1 is given as −$17,847. The computer program that calculated this slope added several digits to the right of the decimal point. However, the most important question that we’ve asked of this slope is whether it is substantively large or small. Our answer, just above, was that it is “huge.” Which of the digits in this slope provide this answer? It can’t be the digits to the right of the decimal point. They aren’t even presented in figure 1.1. It also can’t be the digits in the ones or tens place in the slope as it is presented there. If this slope had been −$17,807 instead of −$17,847, would we have concluded that it wasn’t huge, after all? Hardly. In
14
Chapter 1
fact, this effect would have arguably looked huge regardless of what number was in the hundreds place. In other words, the substantive interpretation that we applied to this variable really depended almost entirely on the first two digits. The rest of the digits did not convey any really useful information, except for holding their places. At the same time, they didn’t do much harm. They wouldn’t, unless they’re presented in such profusion that they distract us from what’s important. Unfortunately, that happens a lot. We should be careful not to invest too much effort into either interpreting these digits in other people’s work or presenting them in our own.14 The third interpretive step is to formulate a plausible explanation of the slope magnitudes, based on our understanding of economic and social behavior. In fact, this is something we should have done already. Why did we construct the regression in figure 1.1 in the first place? Presumably, because we had reasons to believe that the explanatory variables were important influences on earnings. It’s now time to revisit those reasons. We compare them to the actual regression results. Where they are consistent, our original beliefs are confirmed and strengthened. Where they are inconsistent, we have to consider revising our original beliefs. This is the step in which we consolidate what we have learned from our regression. Without it, the first two steps aren’t of much value. We begin this step by simply asking, “Why?” For example, why does education have such a large positive effect on income? It seems reasonable to believe that people with additional education might be more adept at more sophisticated tasks. It seems reasonable to believe that more sophisticated tasks might be more profitable for employers. Therefore, it may be reasonable to expect that employers will offer higher levels of pay to workers with higher levels of education. The regression in figure 1.1 confirms these expectations. This, in itself, is valuable. Moreover, our expectations had very little to say about exactly how much employers would be willing to pay for an additional year of schooling. The slope estimates this quantity for us, in this case with a relatively high degree of precision. Of course, there was no guarantee that the regression would be so congenial. How would we have responded to a slope for years of schooling that was inconsistent with our expectations, either too low or too high?15 What would we have done if the data, representing actual experience, were inconsistent with our vision about what that experience should be like? A contradiction of this sort raises two possibilities. Either our expectations were wrong, or something was wrong with the regression that we constructed
What Is a Regression?
15
in order to represent them. It would be our obligation to review both in order to reconcile our expectations and experience. Ordinarily, we would begin with the issue where we were least confident in our initial choices. If we were deeply committed to our expectations, we would suspect the regression. If we believed that the regression was appropriate, we would wonder first what was faulty in our expectations. In the case of years of schooling, its estimated effect probably leaves most of us in the following position: We were already fairly certain that more education would increase earnings. Our certainty is now confirmed: We have an estimate of this effect that is generally consistent with our expectations. In addition, we have something much more concrete than our own intuition to point to for support when someone else takes a contrary position. The slope for age presents a different explanatory problem. Why might we expect that earnings would change with age? We can certainly hope that workers become more productive as they learn more about their work. This would suggest that earnings should be greater for older workers. At the same time, at some point in the aging process, workers become less vigorous, both physically and mentally. This should reduce their productivity and therefore their wages. This might be an explanation for why the slope for age is relatively small in magnitude. Perhaps it combines an increase in earnings that comes from greater work experience and a reduction in earnings that comes from lower levels of activity? The first effect might be a little stronger than the second, so that the net effect is positive but modest. This is superficially plausible. However, a little more thought suggests that it is problematic. For example, is it plausible that these two effects should cancel each other to the same degree, regardless of worker age? It may seem more reasonable to expect that the effects of experience should be particularly strong when the worker has very little, at the beginning of his or her career. At the ages when most people begin to work, it’s hard to believe that vigor noticeably deteriorates from one year to the next. If so, then productivity and, therefore, earnings, should increase rapidly with age for young workers. Conversely, older workers may have little more to learn about the work that they do. At the same time, the effects of aging on physical and mental vigor may be increasingly noticeable. This implies that, on net, productivity and earnings might decline with age among older workers. Figure 1.2 illustrates these ideas. As shown in the figure, a more thorough understanding of the underlying behavior suggests that the effects of increasing age on earnings should depend on what age we’re at. We’ll learn how to incorporate this understanding into a regression analysis in chapter 13. For the moment, it suggests that we should be cautious about relying on the slope
16
Chapter 1
Figure 1.2 Potential effects of age on earnings
Change in productivity
0
Knowledge of job
Apparent effect of age Age
Effort
for age in the regression of figure 1.1. It’s estimated reliably, but it’s not clear what it represents. This difficulty is actually a symptom of a deeper issue. The confusion arises because, according to our explanation, the single explanatory variable for age is being forced to do two different jobs. The first is to serve as a rough approximation, or proxy, for work experience. The second is to serve, again as only a proxy, for effort. In other words, the two variables that matter, according to our explanation, aren’t in the regression. Why? Because they aren’t readily available. The one variable that is included in the regression, age, has the advantage that it is available. Unfortunately, it doesn’t reproduce either of the variables that we care about exactly. It struggles to do a good job of representing both simultaneously. We’ll return to this set of issues in chapters 11 and 12. What about the large negative effects of being female or black? We might be tempted to explain them by surmising that women and blacks have lower earnings than do white males because they have less education, but that would be wrong. Education is entered as an explanatory variable in its own right. This means that, as we said in section 1.6, it’s already held constant. The slope for females compares women to males who are otherwise similar, including having the same years of schooling. In the same way, the slope for blacks compares blacks to whites who are otherwise similar, again with the same years of schooling. The explanation must lie elsewhere. The most disturbing explanation is that women and blacks suffer from discrimination in the labor market. A second possibility is that women and blacks differ from white males in some other way that is important for productivity, but not included in the regression of figure 1.1. We will have to leave the exploration of both of these possibilities to some other time and context.
What Is a Regression?
17
A third possibility, however, is that the quality of the education or work experience that women and blacks get is different from that of white males. We’ll talk about how we might address this in chapters 13 and 14.
1.8
How Do We Evaluate the Explanation? At this point, we know what explanatory variables seem to be important in the determination of earnings, and we have explanations for why they might be so. Can we say anything about how good these explanations are as a whole? The answer, perhaps not surprisingly, is yes. The remaining information in figure 1.1 allows us to address this question from a statistical perspective. The most important piece of additional information in figure 1.1 is the p-value associated with the F-statistic. The F-statistic tests whether the whole ensemble of explanatory variables has a discernible collective effect on the dependent variable. In other words, the F-statistic essentially answers the question of whether the regression has anything at all useful to say about the dependent variable. Chapter 12 will explain how it does this in detail. For the moment, the answer is most clearly articulated by the p-value, rather than by the F-statistic with which it is associated. If the p-value is .05 or less, then the joint effect of all explanatory variables on the dependent variable is statistically significant.16 If the p-value is larger than .05, the ensemble of explanatory variables does not have a jointly reliable effect on the dependent variable. We’ll explore the implications of this in chapter 12. It could be that a subgroup of explanatory variables really does have a reliable effect that is being obscured by the subgroup of all other explanatory variables. But it could also be that the regression just doesn’t tell us anything useful. In the case of figure 1.1, the p-value associated with the F-statistic is so small that the computer doesn’t calculate a precise value. It simply tells us that the p-value is less than .0001. Further precision isn’t necessary, because this information alone indicates that the true p-value is not even one fivehundredth of the threshold value of .05. This means that there can be almost no doubt that the joint effect of the collection of explanatory variables is statistically significant. What’s left in figure 1.1 are two R2 measures. The first, R2, is sometimes written as the “R-square” or “R-squared” value. It is sometimes referred to as the coefficient of determination. The R2 represents the proportion of the variation in the dependent variable that is explained by the explanatory variables. The natural interpretation is that if this proportion is larger, the explanatory
18
Chapter 1
variables have a more dominant influence on the dependent variable. So bigger is generally better. The question of how big R2 should be is difficult. First, the value of R2 depends heavily on the context. For example, the R2 value in figure 1.1 is approximately .17. This implies that the eight explanatory variables explain a little less than 17% of the variation in annual earnings. This may not seem like much. However, experience shows that this is more or less typical for regressions that are comparing incomes of different individuals. Other kinds of comparisons can yield much higher R2 values, or even lower values. The second reason why the magnitude of R2 is difficult to evaluate is that it depends on how many explanatory variables the regression contains and how many individuals it is comparing. If the first is big and the second is small, R2 can seem large, even if the regression doesn’t provide a very good explanation of the dependent variable.17 The adjusted R2 is an attempt to correct for the possibility that R2 is distorted in this way. Chapter 4 gives the formula for this correction, which naturally depends on the number of explanatory variables and the number of individuals compared, and explains it in detail. The adjusted R2 is always less than R2. If it’s a lot less, this suggests that R2 is misleading because it is, in a sense, trying to identify a relatively large number of effects from a relatively small number of examples. The number of individuals involved in the regression is given in figure 1.1 as the number of observations. “Observation” is a generic term for a single example or instance of the entities or behavior under study in a regression analysis. In the case of figure 1.1, each individual represents an observation. All of the observations together constitute the sample on which the regression is based. According to figure 1.1, the regression there is based on 1,000 observations. That is, it compares the value of earnings to the values of the eight explanatory variables for 1,000 different individuals. This is big enough, and the number of explanatory variables is small enough, that the R2 should not be appreciably distorted. As figure 1.1 reports, the adjusted R2 correction doesn’t reduce R2 by much. We’ve examined R2 in some detail because it gets a lot of attention. The reason for its popularity is that it seems to be easy to interpret. However, nothing in its interpretation addresses the question of whether the ensemble of explanatory variables has any discernible collective effect on the dependent variable. For this reason, the attention that R2 gets is misplaced. It has real value, not because it answers this question directly, but because it is the essential ingredient in the answer.
What Is a Regression?
19
1.9 R 2 and the F-statistic The F-statistic, which we discussed in the last section, addresses directly the question of whether there is a discernible collective effect. That’s why we’ll emphasize it in preference to R2. Ironically, as chapter 12 will prove, the F-statistic is just a transformation of R2. It’s R2 dressed up, so to speak, so as to be more presentable. For this reason, R2 is necessary even though it’s not sufficient to serve our interests. Moreover, there are other transformations of R2 that are also very useful. For example, suppose we were wondering whether the whole category of racial or ethnic identity has any relevance to earnings. The evidence in figure 1.1 is mixed. The slope for blacks is negative and statistically significant. The slopes for the four other racial or ethnic categories, however, are not statistically significant. Is it possible that we could responsibly discard all of the information on racial or ethnic identity? Figure 1.3 attempts exactly this. It presents a modification of the regression in figure 1.1. All five of the variables indicating racial or ethnic identity have been omitted. Is there any way to tell whether this regression is better or worse than that of figure 1.1? There are two considerations. The regression of figure 1.3 has fewer explanatory variables. Therefore, it’s simpler. If nothing else were to change, this would be a good thing. However, its R2 is lower. It has less explanatory power. If nothing else were to change, this would be a bad thing.18 The question, then, of whether the regression of figure 1.3 is better than that of figure 1.1 turns on whether the advantages of a simpler explanation outweigh the disadvantages of reduced explanatory power. This is the first question that we’ve asked of the data that can’t be answered with a single number already present in figure 1.1 or 1.3. Most statistical
Earnings = −22,756 × Intercept + 3,682.8 × Years of schooling (−3.83) (10.8)
+ 396.94 × Age − 18,044 × Female (3.79) (−7.27)
Figure 1.3
R2 = .1611 Adjusted R2 = .1586 F-statistic = 63.8, p-value < .0001 Number of observations = 1,000
Our first regression, restricted
Note: Parentheses contain t-statistics.
20
Chapter 1
software packages will provide the necessary number if instructed to do so. Here, let’s take the opportunity to introduce a formula for the first time. We’ll talk about this formula at some length in chapter 12. For the moment, it’s sufficient to note that it depends on the comparison between the R2 values for the regressions with and without the variables measuring racial or ethnic identity. Let’s distinguish them by referring to the R2 value for the regression of 2 . This is because that regression figure 1.3 as the restricted R2, or Rrestricted is restricted to have fewer explanatory variables. In addition, let’s designate the size of the sample in both regressions as n, the number of explanatory variables in the regression of figure 1.1 as k, and the number of explanatory variables that are omitted from the regression of figure 1.3 as j. The general form for the comparison that we need is then
(R − R ) / j . (1 − R ) / (n − k ) 2
2 restricted
2
(1.1)
With the relevant values from figures 1.1 and 1.3, this becomes
(.1652 − .1611) / 5 = .9734. (1 − .1652) / (1, 000 − 9) The value for this comparison, .9734, is another F-statistic. It has its own p-value, in this case, .567.19 This is a lot bigger than the threshold of .05 that we discussed at the beginning of this section. Using the language there, this indicates that the ensemble of explanatory variables representing racial or ethnic identity does not have a discernible collective effect on the dependent variable. In other words, the regression of figure 1.3 provides a more effective summary of the evidence in this sample than does that of figure 1.1.
1.10 Have We Put This Together in a Responsible Way? Section 1.8 describes how information contained in figure 1.1 can help evaluate the statistical performance of the regression there. In section 1.9, we ask if the regression in that figure should be modified by omitting some explanatory variables. The answer takes us beyond figure 1.1 because it requires the additional regression of figure 1.3. Here, we ask more fundamental questions about the construction of the regression in figure 1.1. These questions address whether the regression
What Is a Regression?
21
analysis is constructed appropriately in the first place. The answers to these questions again require information that isn’t in figure 1.1. This information would typically appear in the associated text rather than in the regression presentation itself. The answers would not necessarily take the form of precise statistical or mathematical statements. Nevertheless, they are essential to establish the credibility of the statistical analysis. There are four main areas of potential concern. The first is the choice of variables, or model specification. Where did we get the explanatory variables of figure 1.1? The best answer is, from economic theory and intuition. That is, we ordinarily rely on our understanding of the economic and social forces that determine the values of our dependent variable to direct our choice of explanatory variables. In the case of figure 1.1, the dependent variable is earnings. What are the economic determinants of earnings? For the purpose of the current discussion, let’s rely on Gary Becker for the answers. At the same time, let’s recognize that we could say a lot more about the determinants of earnings if that, rather than developing the fundamentals of regression analysis, was our principal interest. The theory of human capital (Becker 1993) begins with the presumption that the market for workers is largely competitive. If people are paid more than what they’re worth, at least for an extended period, they will probably lose their jobs. If people are paid less than what they’re worth, they’ll probably find a job that pays them better. In general, then, people get paid what their labor is worth. What determines what labor is worth? The value of a person’s labor is determined by the value of what he or she produces. That value depends, in part, on the prices people are willing to pay for the product. We’re going to assume that these prices are the same for all workers, so they don’t play any direct role in our analysis. The value of a worker’s output also depends on how much he or she can produce. Productive capacity can vary across workers and needs to be recognized in our regressions. Productive capacity may depend on innate ability. But ability is difficult to measure. We’ll ignore it for the moment and return to it in chapters 10 and 14. Instead, we’ll focus on the part of productive capacity that is more readily measured, acquired skills. The obvious places in which to acquire skills are in school and at work. Consequently, we would like to distinguish between workers with different levels of acquired skills by measuring how much schooling and how much work experience each of them has. In figure 1.1, we attempt this by including variables measuring years of schooling and age. We’ve already discussed the extent to which age is satisfactory for this purpose. We’ll address the question of whether we’ve measured schooling appropriately later. For the moment, the important observation is that, conceptually, it’s clear that the regression in figure 1.1 should attempt to assess the contributions of acquired skills to earnings.
22
Chapter 1
What happens to earnings if there are instances in which the labor market is not completely competitive? Here, the theory of economic discrimination (Becker 1971) suggests to us that personal characteristics, such as sex and racial or ethnic identity, might affect earnings. If, for example, employers have preferences regarding the personal characteristics of those who work for them, and if they have profits that they can spend to satisfy those preferences, then they may pay more to workers whose personal characteristics they prefer. This suggests that we should include variables measuring these personal characteristics in our regression, so that we can see if there is evidence of noncompetitive labor market behavior. In sum, then, we can find substantial motivation for the selection of the explanatory variables in figure 1.1 from the accumulated economic reasoning regarding sources of earnings. We should ordinarily seek similar inspiration, either in previous work or through our own analysis of the relevant economic issues, for regression analyses of other dependent variables. The second source of concern is the construction of each variable. While the attributes that we wish to measure may be well defined, the available measurements may not match these definitions as well as we might like. As just mentioned, we have already raised this issue in connection with our interpretation of the variable for age. Our conclusion there was that, while age was the variable we had, the variables we would ideally like to have were probably work experience and work effort. Naturally, this is an example of a more general issue. It is frequently the case that relevant explanatory variables aren’t available. We may be able to replace them, as in the case of work experience and work effort, with a somewhat plausible proxy. In these cases, as we saw in the example of the age variable in figure 1.1, interpretation can get complicated. Sometimes, even a plausible proxy won’t be available. This makes interpretation even more complicated. We’ll discuss the issue of interpretation when relevant variables are missing at length in chapter 11. The third area of concern is in the construction of the included variables. We’ve only had a hint of this so far, in the repeated references to the dependent variable as representing “annual” earnings. We might wonder whether that is the best measure of the behavior that we want to investigate. The issue is that annual earnings depend on both what an individual’s labor is worth and how much labor that individual provides. The first might be measured by something like an hourly wage, representing the value that an employer places on an hour, a single “unit,” of labor from that individual. The second might be measured by the number of annual hours that this individual works. In the regression of figure 1.1, individuals who work a lot of annual hours at relatively low wages and individuals who work fewer hours at relatively
What Is a Regression?
23
high wages can end up with similar values for annual earnings, the dependent variable. Consequently, this regression treats them as if they are the same. For some purposes this may be fine. However, both of these variables might be interesting in their own right. For other purposes, the differences between hours worked and wages paid may be important or even primary. This implies that the purpose of the regression may have to be clarified in order to ensure that the dependent variable is appropriate. The construction of the explanatory variables merits the same kind of examination. We’ve already talked about the variable for age. What about the others? Superficially, there isn’t anything offensive about using years of schooling as a measure of skill. At the same time, a little reflection might suggest that all years of schooling are probably not equal. For example, employers probably take graduation as an especially important skill signal. Therefore, school years that culminate in graduation, such as the 12th grade, may be more valuable than school years that don’t, such as the 11th grade.20 It’s also possible that different levels of schooling have different value. For example, the difference between the productivity of an illiterate worker and a literate worker is probably very large. This suggests that the effect of literacy on earnings may also be large. Consequently, the return to primary school, where literacy is acquired, may be especially big. More generally, economic theory suggests to us that the economic value of additional years of schooling has to go down eventually as the number of years of schooling already completed goes up. If, to the contrary, each additional year of schooling was worth more than the previous year, it would be very hard to stop going. The evidence that almost everyone eventually does implies that, at some point, the return to additional education has to decline. In other words, as we may have learned in our courses on principles of economics, education is probably subject to diminishing marginal returns.21 All of this suggests that the regression in figure 1.1 might be improved if it identified different levels of schooling attainment separately. We’ll investigate this refinement in chapter 14. The remaining explanatory variables in the regression of figure 1.1 represent racial or ethnic identity. How was that done? The data on which the regression is based come from the U.S. Census conducted in 2000. The Census survey allowed individuals to identify themselves as members of many different racial groups simultaneously. In addition, they could separately designate themselves as Hispanic, from one of many geographic heritages.22 The regression of figure 1.1 identifies only five racial or ethnic identities. It assigns individuals to each of the identities that they reported for themselves.
24
Chapter 1
Consequently, the regression would assign the effects of all chosen identities to an individual who reported more than one. This strategy makes the judgment that the more detailed variation in identity available in the Census data isn’t useful for the purpose of the regression in figure 1.1. This, in itself, may be incorrect. In contrast, it treats the five major nonwhite identity categories symmetrically. It doesn’t make any judgments as to which of them is most important. Moreover, this treatment implies that the five major categories are additive. That is, if an individual who chooses to identify as white and another individual who chooses to identify as Hispanic also both choose to identify as black, regression assigns them both the −$10,130 earnings discount associated with the latter identity. This might not be reasonable. For example, imagine that some of this discount is attributable to discrimination in the labor market. The first individual probably doesn’t suffer discrimination as a consequence of identifying, at least partially, as white. Therefore, to the extent that this individual also identifies as black, that individual might experience the full force of discrimination against this group. Imagine that, in contrast, Hispanics also experience labor market discrimination. It’s plausible that the second individual doesn’t suffer much additional discrimination as a consequence of also identifying as black. In this case, the assumption in the regression of figure 1.1, that all individuals who share a particular identity are treated similarly, regardless of whether they differ in other identifications, would be incorrect. This discussion demonstrates that variables that seem straightforward may often embody a great deal of concealed complexity. Alternative definitions of the same concept may yield results that differ. This doesn’t mean that one definition would be wrong and another correct. It means that results must be interpreted with constant reference to the specific definitions on which they are based. The fourth area of general concern in the evaluation of regression results is the sample design. All we know from figure 1.1 is that the sample contains observations on 1,000 individuals. Who are they? In the case of figure 1.1, they are all between the ages of 18 and 65 and not enrolled in school. In other words, they are what we conventionally call working-age adults. There are two potential problems with this. First, individuals aged 65 and older are increasingly likely to continue working. Second, just because individuals are of working age doesn’t mean that they’re actually working. In fact, 210 of the individuals in this sample don’t seem to be, because they report no earnings.23 Does it make sense to exclude older workers? Probably, because Social Security and many pension programs treat people differently once they reach the age of 65. This is not to say that these people might not be working. The point, instead, is that the circumstances under which their earnings are
What Is a Regression?
25
determined are sufficiently different from those of younger workers that they should probably be analyzed separately. Does it make sense to include nonworking adults of working age? If a purpose of the regression in figure 1.1 is to understand why some workers are worth more to employers than are others, it may seem awkward to include those who don’t work. We don’t know what employers would be willing to pay them, at least, not exactly. At the same time, the question of why working-age adults aren’t at work is certainly interesting. Suppose it’s because they think that they’re worth more than what employers are willing to pay. If so, then the fact that they’re not working tells us at least something about what they might be offered were they to apply for work. If we leave these individuals in the sample for the regression, we have to treat them as if they earned nothing, when, if they were really working, they would certainly earn more. This would create a problem of mismeasurement. If we omit these individuals from the sample, then we ignore what they are implicitly telling us about what earnings they might expect. This would create a problem of sample selection bias. Neither of these seems like a great strategy. In fact, the best thing to do is probably to include these individuals in the analysis, but account explicitly for the difference between their labor market participation and that of individuals who are actually at work. To do this right, we need a more sophisticated technique, which we will examine in chapter 15. As with variable definitions, the point here is not that one sample design is correct and another is not. The point is rather that different sample designs support answers to different questions. The issue when reviewing results such as those in figure 1.1 is whether the chosen sample design answers the question at issue.
1.11 Do Regressions Always Look Like This? Not exactly. Regression presentations always have to convey the same information, but it’s not always, or even most frequently, organized in the form of figure 1.1. Table 1.1 presents the same information in a format that is more common. As we can see, table 1.1 makes no attempt to represent everything as an explicit equation. However, all of the information that was in figure 1.1 is reproduced here. It’s just arranged differently. The explanatory variables are listed in the first column. Each row contains all of the information for that explanatory variable. The second
26
Chapter 1
TABLE
1.1
Our first regression, revisited Dependent variable: annual earnings
Explanatory variable
Intercept Years of schooling Age Female Black Hispanic American Indian or Alaskan Native Asian Native Hawaiian, other Pacific Islander R2 Adjusted R2 F-statistic p-value Number of observations
Slope
t-statistic
−19,427 3,624.3 378.60 −17,847 −10,130 −2,309.9 −8,063.9 −4,035.6 −3,919.1
−2.65 9.45 3.51 −7.17 −2.02 −.707 −.644 −.968 −.199
.165 .158 24.5 0? (b) What is CORR(X, aX) if a < 0? Prove that COV(X, a) = 0, remembering that a is a constant. If a is a constant, what is COV(X, Y + a)? What is CORR(X, Y + a)? To summarize all of the results so far, assume that a, b, c, and d are constants. What, then, is COV(aX + b, cY + d)? What is CORR(aX + b, cY + d)? Calculate the sample covariance for the data in table 3.4. Which equation in chapter 2 guarantees that the third and fifth columns in table 3.2 should sum to zero? The discussion at the end of section 3.4 gives the sample standard deviation of education in table 3.3 as 4.6803, and the sample standard deviation of earnings in that table as $30,507. Calculate these sample standard deviations from the information in table 3.3 and verify these assertions. Return to the data regarding income presented in table 2.3 of exercise 2.6. (a) Calculate the sample covariances for wage and salary income and selfemployment income, for wage and salary income and interest income, and for self-employment income and interest income. Are there any more covariances that can be calculated based on the data in this table? Why or why not? (b) Calculate the sample variances for wage and salary income, selfemployment income, and interest income. (c) Calculate the sample standard deviations for wage and salary income, self-employment income, and interest income.
86
Chapter 3
(d) Calculate the sample correlations for wage and salary income and selfemployment income, for wage and salary income and interest income, and for self-employment income and interest income. 3.13 Return to the data regarding rent payments presented in table 2.4 of exercise 2.7. (a) Calculate the sample covariances for all unique pairs of variables in this table. How many are there? Why? (b) Calculate the sample variances for rent, electricity, gas, and water. (c) Calculate the sample standard deviations for rent, electricity, gas, and water. (d) Calculate the sample correlations for all unique pairs of variables in this table.
NOTES 1. Those of us with a more extensive background in math may have encountered complicated algebraic expressions as function arguments. While this is possible, it won’t happen in this book. 2. In some ways, the construction of this sample was more arduous than that of the artificial sample in table 3.1. It wasn’t appropriate to discuss this issue in chapter 1, where we first encountered these data. The appendix to this chapter describes the process. It gives us some useful insight into the research tasks that must typically be undertaken before we can even begin to apply the econometric techniques that we study here. 3. The sample covariance has an additional property that we will prove in exercise 3.1. It is symmetric with regard to X and Y. This means that we get the same number for the covariance if we interchange X and Y in the formula for the sample covariance. In other words, COV(X, Y) is the same as COV(Y, X). 4. Equation (3.11), like equation (3.7), defines a function. The expression to the left of the equality, V(X), names the function as “V.” The parentheses in this expression indicate that this function takes only one variable as its argument, in this case, X. The expression to the right of the equality gives the actual calculation that yields the sample variance. In this expression, the parentheses serve to organize the order in which the calculations take place. 5. In this case, the function CORR is defined as the combination of other functions to the left of the equality. All of the parentheses in equation (3.13) contain function arguments. If we wanted to explicitly write the calculation that yields the sample correlation between X and Y, we would have to replace COV(X, Y) with its calculation as given in equation (3.7), and SD(X) and SD(Y) with their calculations as given in equation (3.12).
Covariance and Correlation
87
6. In addition, as exercise 3.1 will demonstrate, the sample correlation is symmetric in X and Y. 7. Exercise 3.5a invites us to demonstrate that CORR(X, aX) = 1 for any constant a > 0. 8. Exercise 3.5b invites us to demonstrate that CORR(X, aX) = −1 for any constant a < 0. 9. Exercise 3.11 invites us to calculate these standard deviations ourselves. 10. Both of the variables in table 3.6 are from the United Nations Millennium Indicators Database, accessible at http://mdgs.un.org/unsd/mdg/default.aspx. 11. The data in figure 1.1, exercise 1.6, and tables 2.3 and 2.4 are also from this source. 12. “Your Gateway to Census 2000,” at http://www.census.gov/main/www/ cen2000.html, provides access to all data sets of the 2000 Census of Population and Housing, as well as to discussions of their construction and guidance with respect to their use. 13. We discussed this issue in note 20 of chapter 1. The assignment for the category “One or more years of college, no degree” is only 13 in order to impose the presumption that the associate’s degree, usually obtained after 14 years of schooling, is more valuable.
CHAPTER
4
F ITTING A L INE
4.0 What We Need to Know When We Finish This Chapter 4.1 Introduction
90
4.2 Which Line Fits Best?
91
4.3 Minimizing the Sum of Squared Errors 4.4 Calculating the Intercept and Slope
95 103
4.5 What, Again, Do the Slope and Intercept Mean? 4.6 R2 and the Fit of This Line
4.8 Another Example
4.0
117
120
122
Appendix to Chapter 4 Exercises
107
110
4.7 Let’s Run a Couple of Regressions
4.9 Conclusion
88
123
128
What We Need to Know When We Finish This Chapter This chapter develops a simple method to measure the magnitude of the association between two variables in a sample. The generic name for this method
88
Fitting a Line
89
is regression analysis. The precise name, in the case of only two variables, is bivariate regression. It assumes that the variable X causes the variable Y. It identifies the best-fitting line as that which minimizes the sum of squared errors in the Y dimension. The quality of this fit is measured informally by the proportion of the variance in Y that is explained by the variance in X. Here are the essentials. 1. Equation (4.1), section 4.3: The regression line predicts yi as a linear function of xi: yˆi = a + bxi .
2. Equation (4.2), section 4.3: The regression error is the difference between the actual value of yi and the value predicted by the regression line: ei = yi − yˆi .
3. Equation (4.20), section 4.3: The average error for the regression line is equal to zero: e = 0.
4. Equation (4.28), section 4.3: The errors are uncorrelated with the explanatory variable: CORR ( e, X ) = 0.
5. Equation (4.35), section 4.4: The regression intercept is the difference between the average value of Y and the slope times the average value of X: a = y − bx .
6. Equation (4.40), section 4.5: The slope is a function of only the observed values of xi and yi in the sample: n
b=
∑ i =1 n
∑( x − x ) x i
i =1
n
( yi − y ) xi i
∑( x − x ) ( y − y ) i
=
i
i =1
.
n
∑( x − x ) i
i =1
2
90
Chapter 4
7. Equation (4.57), section 4.6: The R2 measures the strength of the association represented by the regression line: n
R2 = 1 −
∑
n
ei 2
b2
i =1
n
∑( y − y )
= 2
i
∑( x − x )
2
i
i =1 n
.
∑( y − y )
2
i
i =1
i =1
8. Equations (4.58) and (4.59), section 4.6: The R2 in the bivariate regression is equal to the squared correlation between X and Y and to the squared correlation between Y and its predicted values:
(
R 2 = CORR ( X , Y )
) = (CORR (Y , Yˆ )) 2
2
.
4.1 Introduction It’s unquestionably nice to know the direction, if any, in which X and Y are associated. So the covariance is a handy thing to have around. It’s also useful to know the reliability of that association, for which we can thank the correlation. In many cases, however, this is not enough. For example, most of us would not be willing to enroll in another year of schooling if all we knew was that it was pretty likely to raise our subsequent earnings. Schooling, as we are well aware, is expensive. We pay tuition, we pay for textbooks, and, what is often most important, we forgo the income that we would have earned had we worked instead. So what we need to know is not whether our future incomes will increase with more education, but, rather, whether they will increase by enough to compensate us for the costs of that education. In other words, we need an estimate of the magnitude of the association, how much of a change in income we can expect for a given change in education. More generally, we’d like to estimate what change in Y, ΔY, is associated with any particular change in X, ΔX. In mathematical notation, we’re interested in the ratio of the two: ΔY . ΔX
This ratio is immediately recognizable as the slope of a line. It shows us that we can identify the magnitude of the association between X and Y by fitting a line to the data in our sample.1
Fitting a Line
91
Now the question becomes, which line? As an example, the points in figure 3.1 certainly don’t all fall on a single line. Many lines could be drawn through that data with at least a claim to roughly represent the general relationship. Is it obvious which one we should choose? Almost surely not. The purpose of this chapter is to make this choice. Our objective is first to define the “best-fitting line.” Second, we develop the formulas that allow us to calculate its slope and intercept, using only the information in the sample. Third, we compare the formula for the slope to those for the covariance and correlation in order to understand how the slope depends on the concepts of chapter 3. The definition that we choose for “best-fitting line” and the formulas that implement it are jointly referred to as regression analysis. In the case of this chapter, where there are only two variables, the analysis is sometimes identified more precisely as bivariate regression.
4.2 Which Line Fits Best? Returning to figure 3.1, it’s clear that, regardless of which line we choose, many points will not lie on it. The best-fitting line should, in some sense, try to get as close as possible to as many of the points as possible. Our first step is therefore to define what “close” means. We could imagine simply choosing the single line that actually hits the most points in the existing sample. This turns out to be a bad idea. First, this line would be really inconvenient for the purposes of chapter 5. Second, this idea may sound simple, but, mathematically, it’s a monster. Third, this criterion doesn’t consider the extent to which this line would miss the points that don’t lie on it. Figure 4.1 gives an example of a sample in which the line that satisfies this criterion looks terrible. The dashed line goes through three points. Any other line would go through only two. However, the dashed line seems to completely misrepresent the general tendency in these data. The solid line seems much truer to this tendency, even though it doesn’t actually include any point in the sample! So this idea was superficially appealing but actually stupid. However, it raises a very useful point: We ought to consider the distance between whatever line we choose and the points that don’t happen to lie on it. We’ll call this distance e, for reasons that will become apparent in section 4.3. Basic geometry tells us to do this by measuring the perpendicular distance from the point to the line, as in figure 4.2. This also turns out to be a bad idea. First, it is again a mathematical pain. Second, and much more important, it treats deviations from the line in the horizontal dimension, where xi is measured, and the vertical dimension,
92
Chapter 4
Figure 4.1 Example of a bad line
Figure 4.2 Geometric distance from a point to a line
e = [( y1 − y0 )2 + (x1 − x0)2].5
where yi is measured, symmetrically. As we can see from the formula shown in figure 4.2, deviations in both dimensions get handled in the same way: They’re squared and added together. This may not seem like a serious problem. After all, the covariance and the correlation also treat X and Y symmetrically.2 But the mathematical symmetry in equations (3.7) and (3.13) simply mirrors a conceptual symmetry: In chapter 3, we explored the association between X and Y without regard for the source of that association. Neither X nor Y has any sort of primacy in that exploration. Here, we want to introduce an entirely new level in our understanding of this association. We’re now concerned with the notion of causality. The reason to wonder about how much Y will change if X changes is because we believe that the change in X causes the change in Y. To return to our opening example, the association between education and earnings is interesting to us because we believe, as we discussed briefly in section 1.10, that changes in our earnings occur as a consequence of additional investments in education. This is a very important step. We now recognize that associations between two variables, even if they are very reliable, can be misleading if their causal foundations are not considered. To elaborate on our example, earnings and consumption expenditures have a positive covariance and a high correlation coefficient. The association between the two is more reliable than that between earnings and education. Does this mean that we can raise our earnings by spending more? Try it.3
Fitting a Line
93
Figure 4.3 Vertical distance from a point to a line
e = y 1 − y0
Now we realize that we are only interested in associations where we can plausibly claim that changes in X cause changes in Y. This allows us to return to some terminology to which we first became acquainted in chapter 1. X’s role as the source of Y entitles it to be referred to as the explanatory variable, because it explains part of Y; the independent variable, because it affects Y but isn’t affected by Y; the exogenous variable, because the system we are currently examining determines values for Y, while values for X are determined somewhere else; or the predetermined variable, because its value must be known in order to establish a value for Y. Y’s role as the consequence of X assigns to it the status of the dependent variable, because its value depends on that of X; or the endogenous variable, because, as we just said, its values are determined within the system we are examining. The process of fitting a line to our data is sometimes referred to as regressing Y on X. We’re interested in knowing what values of Y are likely to arise from any particular value of X. In other words, we’re going to interpret our chosen line as giving us the value of Y that we would predict for any value of X, given the information in the sample. We’re then interested in whether the actual values of Y for that value of X are different from the value that we’ve predicted. For our purposes, the interesting distance between the point and a line is not the geometrical distance in figure 4.2, but rather the vertical distance in figure 4.3. If the point doesn’t lie on our line, our line predicts its Y value incorrectly. This vertical distance therefore measures the extent of this error, which we represent as ei. The introduction of causality explains why the geometric measure of the distance between each sample point and our chosen line isn’t useful for our present purposes. Because it treats X and Y symmetrically, it essentially implies that we make errors in our values of X as well as in our values of Y. Referring to the point in figure 4.2, the geometrical distance, in effect, asserts not only that y1 is too high but also that x1 is too low. In the context of our example, that point would represent an individual with earnings that are surprisingly high for his or her level of education. The geometrical distance, in effect, would simultaneously congratulate this individual for having a high income and scold him or her for not having enough education.
94
Chapter 4
We’re going to take a different position. We’re going to accept, without judgment, whatever educational decision this individual has made. This is what it means for X to be predetermined. We’re going to focus solely on how intriguing it is that this individual’s income is so much higher than our line predicts. So now we’ve agreed to measure our line’s prediction errors in the vertical, or Y, dimension only. If we choose any particular line, we can measure these errors for all of the points in our sample. The question then is, what do we do with this collection? To be more precise, how do we combine this collection of error measurements to form a single aggregate measure of how well the line fits the sample? A natural idea would be to sum the errors. This is not a good idea, however, for two reasons. The first is that these errors are signed. If we sum them, positive and negative errors will tend to cancel. Figure 4.4 demonstrates the possible consequences. The sample there consists of only two points. The solid line includes them both. It makes no errors in predicting their Y values, so the sum of its errors is zero. In contrast, the dashed line makes pretty sizable errors in predicting both points. Moreover, because this line predicts a Y value that is too high for the leftmost point and too low for the rightmost point, these errors must have opposite signs. The way this line is drawn, these errors are of equal magnitude, so they also sum to zero. In other words, the sums of errors for these two lines are identical. Nevertheless, it’s obvious that we prefer the solid to the dashed line. So the sum of errors, by itself, can’t distinguish between the line that we like and the line that we don’t.
e1
Figure 4.4 Lines with identical sums of errors
e2
Fitting a Line
95
We might address this problem by summing the absolute values of the errors in the vertical dimension. This would certainly prevent positive and negative errors from canceling each other. However, it is again mathematically inconvenient. Moreover, it doesn’t address another problem. The second problem with simply summing the errors in the Y dimension is that this aggregation gives equal weight to all errors. This isn’t actually how we feel. We don’t care much if we choose a line that misses a point, or even lots of points, by a little. No one expects predictions to be perfect. For most purposes, predictions that are reliably close to actual outcomes will be quite useful. We should be really concerned, however, if our line misses even one point by a lot. A prediction that turns out to be badly misleading can lead to dangerous or even disastrous choices. Therefore, we want our line to be particularly careful not to make really big mistakes. In other words, we want it to be especially sensitive to large errors. The convenient way to both remove the signs on the errors and emphasize the bigger errors is to square each of them. Each of the squared errors will, of course, be positive. Moreover, small errors, those less than one in absolute value, will actually become smaller. Large errors will become much larger. If we sum the squared errors, instead of the errors themselves, we will get a total that doesn’t allow negative and positive errors to cancel and that depends heavily on the points that are farthest from the line. This total is therefore the score that we’re going to use to rate any line drawn through our sample. We’re going to choose the line that yields the minimum possible value for this total. In other words, we’re going to define the best-fitting line as the line that minimizes the sum of squared errors.4
4.3 Minimizing the Sum of Squared Errors Now that we’ve decided, conceptually, how to choose our line, we can implement this choice mathematically. Let’s define the line that we’re going to choose as a + bxi ,
where a represents its intercept and b represents its slope. If we put any value for xi into this formula, the value that it gives back is the predicted value of Y for that value of X. We’ll identify this predicted value of Y as yˆi .5 Therefore, the equation for our line is yˆi = a + bxi .
(4.1)
96
Chapter 4
Any of the points in our sample has a value for both X and Y, xi and yi. The value of Y that actually appears with that particular value of X is yi. The value of Y that our line predicts for that particular value of X is yˆi . The difference between the actual value of Y and the predicted value of Y for the ith observation is the prediction error: ei = yi − yˆi .
(4.2)
Rearranged, equation (4.2) states that the observed value yi is composed of two parts, the predicted value, yˆi , and the prediction error, ei: yi = yˆ + ei .
(4.3)
If we make the substitution of equation (4.1) for yˆi , we find that our line partitions the observed value yi into three components: yi = a + bxi + ei .
(4.4)
The first component is the intercept, a; the second is the component that depends on the explanatory variable, bxi; and the third is the prediction error, ei. The prediction error is also referred to as the residual. It is the part of yi that remains after accounting for the components that are constant and that can be attributed to xi. We can see this explicitly by rearranging equation (4.4) to obtain ei = yi − a − bxi .
(4.5)
At the end of the previous section, we agreed that we would use the sum of squared errors as the score for any line fitted to our data. We calculate this score by following three steps: 1. Calculate the prediction error for each of the n observations in our sample. 2. Square each of the errors. 3. Sum them together. Equation (4.2) or (4.5) performs the first step. The second step calculates values of ei2 for each observation. The third step sums them to yield ∑in=1 ei2. The line we choose is that which yields the lowest value for this score. In other words, it’s the line that has the smallest sum of squared errors. Let’s be clear on what, exactly, we’re choosing. Equation (4.2) tells us that the sum of squared errors can be rewritten as n
∑ i =1
n
ei2 =
∑( y − yˆ ) . 2
i
i =1
i
(4.6)
Fitting a Line
97
Equation (4.5) tells us that it can also be rewritten as n
∑
n
ei2 =
i =1
∑( y − a − bx ) . 2
i
i
(4.7)
i =1
Equation (4.7) makes our meaning apparent. The sum of squared errors depends on four quantities. Two, xi and yi, represent values that are given to us in our data. We can use them in calculations, but we cannot change them. In contrast, the second two, a and b, are not given to us. We are free to try different values for either or both. Each time we alter the value of either, we choose a different line, because the intercept and slope define the line. Each time we choose a different line, we get different prediction errors. We therefore get different sums of squared errors. Consequently, different choices of a and b yield different sums of squared errors. In other words, the sum of squared errors is a function of the slope and intercept. The set of all possible lines consists of the set of all possible pairs of values for the slope and intercept. How many possible pairs are there? Lots. We want to choose the one pair that minimizes the sum of squared errors. There are two ways to do this. One way would be to write down each of the possible pairs of slope and intercept values, calculate the sums of squared errors for each, and compare them. The other way is to use calculus. Which is worse? We can be sure that it’s the first. So we’re going to use the second. We may be comfortable with calculus at this level. If we’re not, we should make every effort to understand the discussion of the two derivatives that we’re about to take. If these efforts fail, we should make sure, at the very least, to pick up the discussion again at equation (4.16). Calculus tells us that, if we want to minimize a function with respect to two variables, we begin by taking the derivatives of that function with respect to each. In this case, this means that we take the derivatives of the sum of squared errors with respect to the intercept a and the slope b. We’ll do this with some care, in order to ensure that we understand the answers. The derivative of the sum of squared errors with respect to the intercept of the line is represented as6 n 2 ∂ ei i=1
∑
∂a
.
(4.8)
98
Chapter 4
This is the derivative of a sum with respect to a. We know this to be equal to the sum of the derivatives of the individual terms within the original sum with respect to a: n 2 ∂ ei i=1 = ∂a
∑
n
∂ ei2
∑ ∂a .
(4.9)
i=1
In order to take the derivative of ei2 with respect to a, we need to express ei in terms of a. Using equation (4.5), we get n
∑ i =1
∂ ei2 = ∂a
n
∑
∂ ( yi − a − bxi ) ∂a
i =1
2
.
(4.10)
This reduces the problem to evaluating the derivative inside the summation to the right of the equality:7 ∂ ( yi − a − bxi )
2
∂a
= −2 ( yi − a − bxi ) .
(4.11)
Finally, substituting equation (4.11) into equation (4.10), and the result into equation (4.9), we can rewrite the derivative that we began with in equation (4.8) as8 n 2 ∂ ei i=1 = ∂a
∑
n
n
∑(−2 ( y − a − bx )) = −2∑( y − a −bbx ). i
i
i
i=1
i
(4.12)
i=1
Similarly, we need the derivative of the sum of squared errors with respect to the slope b. This is slightly more difficult to obtain than the derivative with respect to the intercept a. However, we should be able to work it out more quickly because all of the steps are similar or identical to those that we’ve just completed. We begin by combining equations (4.9) and (4.10) and modifying them to refer to derivatives with respect to b rather than to a: n 2 ∂ ei i=1 = ∂b
∑
n
∑ i=1
∂ ei2 = ∂b
n
∑ i=1
∂ ( yi − a − bxi ) ∂b
2
.
(4.13)
Fitting a Line
99
Following equation (4.11), we have ∂ ( yi − a − bxi ) ∂b
2
= −2 ( yi − a − bxi ) xi .
(4.14)
If we substitute equation (4.14) into equation (4.13), we obtain the second of our two derivatives: n 2 e ∂ i i=1 = ∂b
∑
n
∑( (
) )
n
∑( y − a − bx ) x .
−2 yi − a − bxi xi = −2
i=1
i
i
i
(4.15)
i=1
Our objective is not simply to obtain these derivatives but to use them to minimize the sum of squared errors. Calculus tells us that, to do so, we must find the values for b and a that simultaneously equate both derivatives to zero. In other words, when we set equations (4.12) and (4.15) equal to zero, we’ll have two equations in the two unknowns b and a. We’ll begin with equation (4.12). Setting this equal to zero yields n
∑( y − a − bx ).
0 = −2
i
(4.16)
i
i =1
If we divide both sides of equation (4.16) by the factor −2, we obtain n
0=
∑( y − a − bx ). i
(4.17)
i
i =1
The result of these same steps, applied to equation (4.15), is n
0=
∑( y − a − bx ) x . i
i
(4.18)
i
i =1
Equations (4.17) and (4.18) are called the normal equations. The source of the name is unimportant, but it’s convenient to have one because these two equations are absolutely central to what we’re doing. The normal equations allow us to identify our best-fitting line and tell us a great deal about what it looks like. Let’s address this second issue first. The intercept and slope of our chosen line are going to have to make both equations (4.17) and (4.18) true. Equation (4.5) reminds us that we can rewrite the first normal equation, equation (4.17), as n
0=
n
∑( y − a − bx ) = ∑ e . i
i =1
i
i
i =1
(4.19)
100
Chapter 4
The best-fitting line will therefore have to make this equation true as well. But this equation states that the sum of the prediction errors must be zero! This may be a bit of a surprise, because, in the example of figure 4.4, we demonstrated that more than one line might achieve this. As we’re about to see, the second normal equation is going to take care of this problem for us. Of all the lines that might yield prediction errors that sum to zero, only one will satisfy the second normal equation. So in the end, we’ll get a unique best-fitting line. Moreover, the property in equation (4.19) is an excellent property to have. If the sum of the errors is zero, then the average of the errors has to be zero as well: n
∑e
i
e=
i =1
n
=
0 = 0. n
(4.20)
In equation (4.20), the ratio after the second equality replaces the sum of the errors by zero, as given in equation (4.19). The result demonstrates that the average prediction error of the best-fitting line is also zero. The appeal of this property is obvious if we consider the alternative. Suppose someone says to us, “I’ve just fitted this great line. Its average prediction error is only two!” Wouldn’t we immediately say, “You mean, on average, your line is wrong?” The answer is a subdued, “Yes.” We suggest, cautiously, “Why don’t you just lower the line by two units?” Embarrassed silence ensues. Figure 4.5 makes the same point graphically. The dashed line has a positive average prediction error. The solid line has the same slope, but an average prediction error of zero. Which looks better? What about the second normal equation, equation (4.18)? This one is a bit more complicated, but equally insightful. Again making the substitution indicated by equation (4.5), we find that the best-fitting line must satisfy n
0=
∑ i =1
( yi − a − bxi ) xi =
n
∑e x . i i
(4.21)
i =1
The average value of X, x , is a constant, a fixed number for a particular sample. We can multiply both sides of equation (4.19) by this average. The term to the left of the equality in that equation will remain zero, but, by reversing equation (2.7), the equation as a whole becomes n
0= x
n
∑ ∑e x. ei =
i =1
i
i =1
(4.22)
Fitting a Line
101
Figure 4.5 Line with nonzero average prediction errors
In equation (2.23), we introduced the trick of adding zero to an expression in order to reformulate it in a useful way. We return to that trick here, by subtracting the last term of equation (4.22) from the term to the right of the last equality in equation (4.21): n
0=
n
∑e x − ∑e x. i i
i =1
(4.23)
i
i =1
Both summations in equation (4.23) have the same limits. Therefore, equation (2.14), read in reverse, tells us that we can combine them: n
0=
n
n
∑ e x − ∑ e x = ∑(e x − e x ). i i
i =1
i
i =1
i i
i
i =1
The terms in parentheses share a common factor. In consequence, n
0=
∑ i =1
(ei xi − ei x ) =
n
∑ e ( x − x ). i
i
(4.24)
i =1
Recall that equation (4.20) establishes that the average error is zero. Accordingly, we can subtract the average error from any individual error without changing the latter’s value: ei = ei − e .
(4.25)
102
Chapter 4
We can substitute the difference to the right of the equality in equation (4.25) for ei into the last term of equation (4.24) to obtain n
0=
∑ i =1
ei ( xi − x ) =
n
∑(e − e ) ( x − x ). i
i
(4.26)
i =1
Does this seem like an improvement over equation (4.21), where we started? It will in a moment. We can divide both the zero to the left of the first equality and the summation to the right of the last equality in equation (4.26) by n − 1. Zero remains to the left of the equality, which now is n
∑(e − e ) ( x − x ) i
0=
i =1
i
n −1
.
(4.27)
If we stop to catch our breath, we’ll recognize the quantity to the right of the equality in equation (4.27) as the sample covariance between the errors and the xi’s! Equation (4.27) proves that this sample covariance is zero. This is therefore the implication of the second normal equation, equation (4.18). Moreover, this proves that the correlation between the errors and the xi’s is also zero: 0 = COV ( e, X ) = CORR ( e, X ) .
(4.28)
Equation (4.28) states that there is no correlation between the values of X in the sample and the prediction errors from the best-fitting line. In other words, the prediction errors are not linearly related to the values of X. For our purposes, this means that the best-fitting line is as likely to miss the true value, yi, by predicting too high as by predicting too low, no matter what the value of xi. This is again a very appealing property. Consider figure 4.6. The dashed line overpredicts Y for low values of X and underpredicts Y for high values of X. In other words, ei is negative for low values of X and positive for high values of X. This means that X and e have a positive correlation. Compare this to the solid line. At low values for X, this line generates both positive and negative prediction errors. The same is true at high values for X. In other words, the correlation between X and the prediction errors for this line is zero. Which of the two lines do we like best? The corresponding dialogue is easy to imagine. Someone says to us, “I have a best-fitting line. For low values of X, it overpredicts Y. For high values of X, it underpredicts Y.” We could not avoid replying, “Why don’t you just rotate the line counterclockwise a bit?” As promised, the requirement that the errors and Xs be uncorrelated resolves the puzzle of figure 4.4. As discussed with respect to equation (4.19),
Fitting a Line
103
Figure 4.6 Line with a positive correlation between X and the errors
both lines in figure 4.4 satisfy the first normal equation. However, only one satisfies the second normal equation. The dashed line in figure 4.4 makes a large negative prediction error for the small value of X and a large positive prediction error for the large value of X. In other words, the correlation between the errors and the value of X is positive. Therefore, this line doesn’t satisfy equation (4.26). It violates the second normal equation. In contrast, the solid line in figure 4.4 makes no prediction errors. Therefore, its prediction errors are the same, regardless of whether X is small or large. In consequence, the correlation between its errors and the X values is zero. This is the line chosen by the second normal equation. As advertised, the two normal equations select the line in figure 4.4 with the smallest sum of squared errors. The prediction errors associated with the dashed line are both nonzero, as will be their squares and the sum of squared errors. The sum of squared errors for the solid line must be zero, because all of the individual prediction errors for this line are themselves zero. Therefore, the solid line clearly minimizes the sum of squared errors.9
4.4
Calculating the Intercept and Slope We can now use the normal equations to explicitly identify the best-fitting line. To accomplish this, we need values for the intercept a and slope b. Both
104
Chapter 4
appear in each of the two normal equations. The only other quantities in those equations are xi and yi, whose values we know from the sample. Therefore, as we observed before, a and b are the only two unknown quantities in the two equations. As we know from our training in algebra, we can generally expect a unique solution from a system such as this, where the number of unknowns equals the number of equations.10 A simple way to derive this solution is through replacement: 1. Use one of the equations to solve for one of the variables in terms of the other. 2. Insert this solution into the second. 3. Solve the second equation for the remaining variable. We return to equation (4.17): n
0=
∑( y − a − bx ). i
i
i =1
Equation (2.14) and, even better, exercise 2.6c remind us that we can separate the sum to the right of the equality into the sums of the individual quantities within the parentheses: 0=
n
n
n
n
i =1
i =1
i =1
i =1
∑( yi − a − bxi ) = ∑ yi − ∑ a − ∑bxi .
(4.29)
Equation (2.5) gives the second summation to the right of the last equality in equation (4.29) as n
∑ a = na.
(4.30)
i =1
Equation (2.7) demonstrates that the last summation can be rewritten as n
∑
n
∑x .
bxi = b
i =1
(4.31)
i
i =1
Substituting equations (4.30) and (4.31) into equation (4.29), we obtain n
0=
∑ i =1
n
∑x .
yi − na − b
i
i =1
(4.32)
Fitting a Line
105
If we divide all quantities in equation (4.32) by n, zero remains to the left of the equality. The equation as a whole becomes n
0=
∑ i =1
n
n
yi
∑x
i
−a−b
i =1
n
.
According to the definition of the average in equation (2.8), the first term to the right of the equality is the average value for Y. The ratio in the last term is the average value for X. Therefore, the first normal equation implies that 0 = y − a − bx.
(4.33)
Equation (4.33) is very interesting. Rearranging, we have y = a + bx .
(4.34)
This provides yet another useful interpretation of the first normal equation. The average value for Y must be equal to the average value for X multiplied by the correct value for b and added to the correct value for a. In other words, equation (4.34) proves that the point ( x , y ) lies exactly on the best-fitting line. The line predicts that Y should take on the value of y when X takes on the value of x . The phrase that we ordinarily use to express this property is regression goes through the means. Equation (4.33), rearranged again, also allows us to solve conveniently for a in terms of the other quantities: a = y − bx .
(4.35)
This equation states that, if we knew the value of b, we could multiply it by the average value for X and subtract this product from the average value for Y. This difference would give us the value of a, the intercept for the bestfitting line. Equation (4.35) will also help us identify the value for b. If we substitute the expression it gives us for a into equation (4.18), we get 0=
n
n
i =1
i =1
∑( yi − a − bxi ) xi = ∑ ( yi − ( y − bx ) − bxi ) xi .
Rearranging, we obtain n
0=
∑ i =1
( yi − ( y − bx ) − bxi ) xi =
n
∑ (( y − y ) − (bx − bx )) x . i
i =1
i
i
106
Chapter 4
Both terms in the second of the inner parentheses in the last summation contain the constant b. It can therefore be factored out: n
0=
∑( i =1
)
( yi − y ) − (bxi − bx ) xi =
n
∑ (( y − y ) − b ( x − x ) ) x . i
i
(4.36)
i
i =1
For the next step, refer to equation (2.40). Replace the quantity xi in that equation by the term yi − y , the quantity yi in that equation by the term b( xi − x ), and, somewhat confusingly, the quantity zi in that equation by the quantity xi. Then equation (4.36) becomes n
0=
n
n
∑ ( ( y − y ) − b ( x − x )) x = ∑ ( y − y ) x − ∑ b ( x − x ) x . i
i
i
i =1
i
i =1
i
i
i
i =1
We’re very close now. Factor the constant b out of the last summation, as in equation (2.7): n
0=
∑ i =1
( yi − y ) xi −
n
∑ i =1
b ( xi − x ) xi =
n
∑ i =1
n
∑( x − x ) x .
( yi − y ) xi − b
i
i
i =1
Add the last term to the terms before the first equality and after the last, yielding n
n
i =1
i =1
∑( xi − x ) xi = ∑ ( yi − y ) xi .
b
Finally, divide both sides by the summation multiplying b, ∑in=1 ( xi − x ) xi : n
∑( y − y ) x i
b=
i
i =1 n
.
(4.37)
∑( x − x ) x i
i
i =1
Equation (4.37) is the crowning achievement of this chapter. The slope of the best-fitting line, b, is to the left of the equality sign. The expression to the right of the equality sign, though somewhat complicated, contains only xi, yi, and their respective averages. The sample consists of nothing but values for xi and yi. These values and the sample size are all that’s required to calculate the averages.11 Therefore, the value of b, the slope of the best-fitting line, is given by a formula that requires only our data. Returning to equation (4.35), we recall
Fitting a Line
107
that b and the averages for X and Y are all that we need to calculate the value of a as well. In other words, equations (4.17) and (4.18) allow us to identify the intercept and the slope of the best-fitting line, using only the information that is available to us in our sample.12 We actually make these calculations in sections 4.7 and 4.8 and in exercise 4.14.
4.5
What, Again, Do the Slope and Intercept Mean? Now that we’ve identified the slope and intercept of our best-fitting line, we can return to the question of what they mean. Our original purpose was to predict the change in Y that would occur as a consequence of a change in X: ΔY . ΔX
How does our best-fitting line do this? As stated in equation (4.1), our line predicts that the Y value associated with any value of X, xi, is yˆi = a + bxi .
If we increase xi by some amount Δx, we expect the predicted value of Y to change in response. Call this change Δy. In other words, our line will predict a new value for Y: yˆi + Δy = a + b ( xi + Δx ) .
(4.38)
Rearranging equation (4.38), we obtain yˆi + Δy = a + b ( xi + Δx ) = ( a + bxi ) + bΔx.
Once again, equation (4.1) identifies the term in parentheses following the last equality as yˆi : yˆi + Δy = ( a + bxi ) + bΔx = yˆi + bΔx.
Subtracting yˆi from both sides of equation (4.39) yields Δy = bΔx.
Finally, dividing both sides of the equality by Δx, we find that b=
Δy . Δx
(4.39)
108
Chapter 4
In other words, b is precisely the ratio in search of which we began this chapter! It tells us how much of a change we would expect in Y in response to a change in X of one unit. This expression also reveals to us the units assigned to b. It is in terms of the units of Y per unit of X. The product b Δx is therefore in units of Y, as it must be if it is to equal Δy. This, along with equation (4.35), implies that a is also in units of y.13 The value that equation (4.37) assigns to the change in Y caused by a one-unit change in X has a great deal of intuitive appeal. To see this, use equation (2.37) to replace the numerator of the ratio to the right of the equality there, ∑in=1 ( yi − y ) xi , with ∑in=1 ( yi − y )( xi − x ). Use equation (2.28) to replace the denominator, ∑in=1 ( xi − x ) xi , with ∑in=1 ( xi − x )2. The slope coefficient b then becomes n
b=
∑ i =1 n
n
( yi − y ) xi i
i
=
∑( x − x ) x
∑( y − y ) ( x − x ) i
i =1
i
.
n
∑( x − x )
(4.40)
2
i
i =1
i =1
Here, we’ll recall a trick that we first used in conjunction with equation (2.18). We will multiply the ratio to the right of the second equality in equation (4.40) by one. The trick, once again, comes in how we choose to represent “one.” In this case, the useful representation is 1 / ( n − 1) 1 / ( n − 1)
.
The product of this ratio and the last ratio of equation (4.40) is n yi − y xi − x i=1 n −1
∑(
n
∑( y − y )( x − x ) i
b=
i
i=1
n
∑( x − x ) i
i=1
2
=
)(
n xi − x = 1 i n −1
∑(
)
2
)
.
(4.41)
Fitting a Line
109
In the last expression of equation (4.41), we recognize the numerator as the sample covariance between Y and X from equation (3.7), and the denominator as the sample variance of X from equation (3.11): b=
COV ( X , Y ) V(X )
(4.42)
.
This demonstrates two remarkable things. First, as predicted in chapter 3, the covariance between X and Y has reappeared in an essential role. Second, what we learned about the direction of the association between X and Y in that chapter is preserved by our best-fitting line. The variance of X must be positive. Therefore, the sign of b comes solely from the sign of the covariance between X and Y.14 Where does the magnitude of b come from? The trick of multiplying by “one” worked so well in equation (4.41) that we’ll do it again, this time in the form of SD(Y)/SD(Y). Let’s also replace V(X) with the square of SD(X), as implied by equation (3.12): b=
COV ( X , Y ) V(X )
=
COV ( X , Y ) SD (Y ) SD ( X )
2
SD (Y )
.
Rearranging, we find b=
COV ( X , Y ) SD (Y )
SD ( X ) SD (Y ) SD ( X )
.
(4.43)
We recognize the first ratio to the right of the equality in equation (4.43) as the sample correlation between X and Y from equation (3.13): b = CORR ( X , Y )
SD (Y )
SD ( X )
.
(4.44)
What does this mean? Think of SD(X) as representing the “typical” variation in X and SD(Y) as representing the “typical” variation in Y. Then b attributes a fraction of the typical variation in Y to the typical variation in X. That fraction is determined by CORR(X, Y)! As examples, if CORR(X, Y) = 1, then all differences in X across the observations in our sample are matched by corresponding differences in Y. Accordingly, regression analysis characterizes all of the typical variation in Y as attributable to the typical variation in X. In this special case, b=
SD (Y )
SD ( X )
.
110
Chapter 4
If, instead, CORR(X, Y) = 0, then differences in X are not associated with any systematic differences in Y.15 Regression therefore attributes none of the typical variation in Y to typical variations in X. In this case, b = 0. More generally, regression attributes a larger fraction of the typical variation in Y to the typical variation in X, the more reliable is the association between X and Y. It therefore predicts a larger change in Y for any change in X, the more reliable is this association. And the measure of reliability, as we discovered in chapter 3, is the sample correlation between the two! The intercept term, a, has less to tell us. Literally, it is the predicted value of Y if xi = 0. This is usually uninteresting, because this is rarely a realistic value for X. In the example that we’re developing, how often does someone have no education at all?16 Otherwise, it’s most useful to think of a as serving a geometrical purpose. Given the value for b, a adjusts the height of the line so as to ensure that the errors sum to zero, as in equation (4.19).
4.6
R 2 and the Fit of This Line Now that we have our best-fitting line, it’s legitimate to ask if there is any way to measure how well it actually fits. The answer, not surprisingly, is that there is. Begin by asking how we would predict Y if our sample did not contain any information about X. With only Y values available, the best guess we could make regarding any future value of Y would be the average of the existing values. The prediction error, in this case, would be the difference between the actual value of Y and this average: yi − y .
(4.45)
The next question to ask is, how much better would our prediction be if we had information about X ? With values for X, our prediction of y would be yˆi , as in equation (4.1). So we are asking how the difference in equation (4.45) compares to yi − yˆi . To make this comparison, we again employ the trick of adding zero to the expression in equation (4.45). In this case, we’ll represent zero as yˆi − yˆi : yi − y = yi − y + yˆi − yˆi .
Rearranging, we obtain yi − y = ( yi − yˆi ) + ( yˆi − y ) .
(4.46)
Fitting a Line
111
Equation (4.46) demonstrates that the error we make when we predict Y by its average alone can be decomposed into two parts. The first term to the right of the equality sign in equation (4.46) is the prediction error from the regression, ei, as given in equation (4.2). The second term is the difference between the prediction from the regression, yˆi, and the prediction without, y . The closer yˆi is to y , the less we learn about Y from the information about X. Correspondingly, the prediction error from the regression, yi − yˆi , is closer to the prediction error we would make without the regression, yi − y . The regression line provides “value added” to the extent that its predictions are superior to those that we would make in its absence, based on y alone. The regression line does a better job of fitting the data to the extent that its predictions differ from the average for Y, in the second term to the right of the equality sign in equation (4.46), and to the extent that these predictions are closer to the actual value of Y, in the first term to the right of the equality sign. We can make this intuition more formal if we square the terms in equation (4.46):
( yi − y ) = (( yi − yˆi ) + ( yˆi − y )) 2 2 = ( yi − yˆi ) + ( yˆi − y ) + 2 ( yi − yˆi ) ( yˆi − y ) . 2
2
(4.47)
We then sum the result for all n observations: n
n
n
∑( yi − y ) = ∑( yi − yˆi ) + ∑( yˆi − y ) 2
i =1
2
i =1
i =1
2
n
∑ ( y − yˆ ) ( yˆ − y ).
+2
i
i
i
(4.48)
i =1
Here, exercise 2.6c allows us to sum each of the terms to the right of the second equality in equation (4.47) separately. Equation (2.7) allows us to factor the constant two out of the last summation in equation (4.48). The last term to the right of the equality sign in equation (4.48) is a nuisance. Let’s see if we can make it disappear. Begin with the difference yˆi − y in its last parentheses. Recalling equation (4.1) for yˆi and equation (4.34) for y , we can rewrite it as follows: yˆi − y = ( a + bxi ) − ( a + bx ) = bxi − bx = b ( xi − x ) .
(4.49)
Once again, equation (4.2) reminds us that the difference yi − yˆi is simply the prediction error of the regression, ei. With this substitution and that of equation (4.49), the last term to the right of the equality sign in equation (4.48) can be rewritten as n
∑
2
i =1
n
∑ e b ( x − x ).
( yi − yˆi ) ( yˆi − y ) = 2
i
i =1
i
(4.50)
112
Chapter 4
Equation (2.7) allows us to factor the slope coefficient b out of the summation, and equation (2.40) allows us to distribute the remaining product. Accordingly, equation (4.50) becomes ei x . i =1
n ei b xi − x = 2b ei xi − i =1 i =1 n
∑ (
2
n
∑
)
∑
(4.51)
Finally, equation (2.7) allows us to rewrite the last summation in equation (4.51) to yield n n n n 2b ei xi − ei x = 2b ei xi − x ei . i =1 i =1 i =1 i =1
∑
∑
∑
∑
(4.52)
The parentheses after the last equality in equation (4.52) contain two summations, both of which should be familiar to us. The first, ∑in=1 ei xi , is equal to zero according to the second normal equation, equation (4.18). The second, ∑in=1 ei , is equal to zero according to the first normal equation, equation (4.17). In consequence, the last term to the right of the equality sign in equation (4.48) is equal to zero itself. We can now restate equation (4.48) as n
n
n
∑( yi − y ) = ∑( yi − yˆi ) + ∑( yˆi − y ) . 2
i =1
2
i =1
2
i =1
Replacing, yet again, yi − yˆi with ei and dividing each term by n − 1, this becomes n
∑ i =1
( yi − y ) n −1
n
2
=
∑ i =1
n
ei2
n −1
∑( yˆ − y )
2
i
+
i =1
n −1
.
(4.53)
The fraction to the left of the equality sign is the sample variance of Y from equation (3.11). The terms to the right demonstrate that this variance can be decomposed into two parts. The first part is n
∑e
2 i
i =1
n −1
.
Fitting a Line
113
This term must be nonnegative, because the numerator is the sum of squares and the denominator has to be positive. Exercise 4.6 proves that it is the variance of the error terms. It is the component of the original variance of Y that remains in the prediction errors. In other words, this is the part of the variance of Y that X does not pick up, that is not explained by the regression line. As we recall from section 4.3, this line minimizes the numerator in this ratio. Therefore, it maximizes the portion of the variance in Y that is attributed to X. That portion is the second term in equation (4.53): n
∑ i =1
( yˆi − y ) n −1
n
2
=
∑( i =1
b ( xi − x )
)
n
2
n −1
b2
∑( x − x ) i =1
=
2
i
n −1
= b 2 V( X ).
(4.54)
The first equality in equation (4.54) is a consequence of equation (4.49), the second of equation (2.7), and the third of equation (3.11). As with the first term in equation (4.53), this must be nonnegative. Exercise 4.6 will prove that it represents the sample variance of the regression predictions. It is the part of the original variance of Y that is captured, or explained, by the variance in X through the regression line. Obviously, the regression is more successful at explaining the variation in Y the larger is the second term and the smaller is the first term to the right of the equality in equation (4.53). The relative sizes of these terms can be assessed by dividing all three terms by the sample variance of Y. This yields n e2 i i=1 n −1
∑
1=
n yi − y = i 1 n −1
∑(
)
2
+
b2
n
∑( x − x )
2
i
i=1
n −1
n yi − y 1 i = n −1
∑(
)
2
.
(4.55)
The first term to the right of the equality sign in equation (4.55) is the proportion of the variance in Y that remains in the errors of the regression predictions. Our line fits better, the smaller is this proportion. The second term to the right of the equality sign in equation (4.55) is the proportion of
114
Chapter 4
the variance in Y that is explained by the regression. Our line fits better, the larger is this proportion. Our measure of goodness of fit, then, should go down when the first term is greater and should go up when the second term is greater. Accordingly, we define it by first canceling the common factors of 1/(n − 1) in equation (4.55): n
1=
∑
n
ei2
b2
i =1
n
∑( y − y )
+ 2
i
∑( x − x )
2
i
i =1 n
.
∑( y − y )
(4.56)
2
i
i =1
i =1
We then subtract the first term to the right of the equality in equation (4.56) from both sides of that equation. The result is our measure of goodness of fit, which we call “R2”: n
R2 = 1 −
∑
n
ei2
b2
i =1
n
∑( y − y ) i
i =1
= 2
∑( x − x )
2
i
i =1 n
.
∑( y − y )
(4.57)
2
i
i =1
The two fractions to the right of the equality in equation (4.56) must both be positive. Therefore, they must also each be no greater than one. Consequently, R2 itself can vary only between zero and one: 0 ≤ R 2 ≤ 1.
If R2 were equal to zero, the regression prediction for each observation would be the same as the prediction that would be made in the absence of information about X, yˆi = y . In this case, the exercise of trying to improve our predictions of Y by considering the information available to us in X wouldn’t tell us anything we didn’t already know. The regression calculations would be pretty useless. If R2 = 1, ei = 0 for each observation in the sample. This would mean that all of the sample points lie exactly on our regression line. The variation in Y would be completely explained by the variation in X. This would be very impressive. In fact, if it occurs, it’s probably too impressive. It probably means that X and Y are measuring the same thing. For example, an easy way to get R2 = 1 is to regress income in pennies, as the Y variable, on income in dollars, as the X variable. X and Y would appear to be different numbers, but, obviously, they are just different names for the same quantities. Why bother? We already know the result. Our R2 will be one, and our b will be .01. What will our a be?
Fitting a Line
115
In other words, a situation in which X predicts Y perfectly is probably so trivial that it is apparent long before we run the regression. Most of the problems that we care about are sufficiently complicated that the outcome Y is never going to be perfectly explained by any single determinant X.17 Ordinarily, we encounter intermediate values for R2. Higher values for R2 indicate instances where X explains a greater proportion of the variance in Y. In chapter 3, we made an analogous statement: Higher values for the sample correlation coefficient indicate that the relationship between X and Y is more reliable. The similarity between these two interpretations is not accidental. For the regression line of equation (4.1), R2 and CORR(X, Y) are intimately related. In fact, exercise 4.10 will prove that
(
)
R 2 = CORR ( X , Y ) . 2
(4.58)
Here, R2 is just the square of the correlation coefficient! In other words, the measure that we derived in chapter 3 for the reliability of the association between X and Y also measures, when squared, the proportion of the variance in Y that is explained by the variance in X in a bivariate regression. This phrasing is careful because equation (4.58) is valid only through chapter 10. It is no longer true in chapter 11, where we consider situations in which there are two or more explanatory variables. A simple explanation for this is that the correlation is a characteristic of the relationship between only two variables. When there are two explanatory variables, each has its own correlation with the dependent variable. The R2 can’t just be the square of one of them. It has to incorporate both. Nevertheless, exercise 4.11 demonstrates that R2 has another interpretation that is valid both for the bivariate regression of this and the next seven chapters and for the multivariate regression of the chapters that follow:
(
( ))
R 2 = CORR Y , Yˆ
2
.
(4.59)
In other words, R2 is the square of the sample correlation between the actual and the predicted values of Y. It indicates the extent to which the predicted value of Y tracks its actual value. Naturally, we would like this tracking to be pretty good. From this perspective, it’s especially easy to see why high values of R2 are better than low values. Equations (4.58) and (4.59) demonstrate that R2 gives us a highly intuitive indication of how well our line fits the data in the sample. However, it is only a heuristic, informal measure. As we suggested in section 1.9, chapter 12 will demonstrate that it is closely related to a formal test of regression “success.”
116
Chapter 4
In section 1.8, we mentioned that we have to be a bit cautious about high R2 values when there are lots of explanatory variables. That’s rarely a concern here, where we have only one. Nevertheless, if n is small enough, we might still worry about being misled. As we discussed in section 1.8, the typical response to this concern is the adjusted R2. With a single explanatory variable, it is n e2 i i =1 n−2
∑
adjusted R 2 = 1 −
n yi − y i =1 n −1
∑(
)
n
∑e
2 i
2
=1−
i =1
n
∑( y − y )
n −1 . 2 n−2
(4.60)
i
i =1
As we can see, the first equality in equation (4.60) is reminiscent of the first equality of equation (4.57). The differences are that, here, the numerator is divided by n − 2 and the denominator is divided by n − 1. The consequences of this correction are apparent in the restatement after the second equality of equation (4.60). The ratio of the sum of squared errors to the sum of squared deviations of yi from its average is multiplied by the ratio (n − 1)/(n − 2), which is always greater than one. Therefore, the term that’s subtracted from one to obtain the adjusted R2 in equation (4.60) always exceeds the term that’s subtracted from one to obtain the R2 in equation (4.57): n
∑e
2 i
i =1
n
∑( y − y )
n
n −1 > 2 n−2
i
i =1
∑e
2 i
i =1
n
∑( y − y )
. 2
i
i =1
This guarantees that the adjusted R2 is always less than the R2 itself: R 2 > adjusted R 2 .
(4.61)
The difference between the adjusted R2 and the R2 here will almost always be small, unless n itself is tiny. However, the correction embodied in the adjusted R2 gets bigger with more explanatory variables. We’ll see this in detail in section 12.5. There, the adjusted R2 and the R2 statistics can occasionally tell quite different stories.
Fitting a Line
117
4.7 Let’s Run a Couple of Regressions At this point, we should see how our formulas actually work. Table 4.1 reproduces the five observations in the artificial sample that we first saw in table 3.1. It also presents all of the calculations that we need to make for each observation in order to form the required summations. Regardless of how this table looks, we have already performed all of these calculations, with the exception of the last column, in table 3.2. We calculated the last column in table 3.5. In other words, all of the information that we needed to calculate b and a was already available to us at the end of chapter 3. The required summations appear in the bottom row of the table. According to these calculations, n
∑( y − y ) ( x − x ) = 222, 000 i
i
i =1
and n
∑( x − x ) i
2
= 40.
i =1
Therefore, recalling equation (4.40), we obtain n
∑( y − y ) ( x − x ) i
b=
i
i =1
n
∑( x − x )
2
=
222, 000 = 5, 550. 40
i
i =1
TABLE
4.1 Regression calculation for the sample of table 3.1
Observation
1 2 3 4 5 Sum Average
yi
yi − y
xi
xi − x
(yi − y)(xi − x)
(xi − x)2
20,000 53,000 35,000 80,000 62,000
−30,000 3,000 −15,000 30,000 12,000
12 14 16 18 20
−4 −2 0 2 4
120,000 −6,000 0 60,000 48,000
16 4 0 4 16
250,000 50,000
0 0
80 16
0 0
222,000
40
118
Chapter 4
TABLE
4.2 Predicted values and prediction errors for the regression in table 4.1 bxi
yˆi = a + bxi
ei = yi − yˆi
ei2
1 2 3 4 5
66,600 77,700 88,800 99,900 111,000
27,800 38,900 50,000 61,100 72,200
−7,800 14,100 −15,000 18,900 10,200
60,840,000 198,810,000 225,000,000 357,210,000 104,040,000
Sum
250,000
250,000
0
945,900,000
Observation
This implies that, within this sample, an additional year of schooling increases income by $5,550 per year. As reported in the discussion following table 3.1, the average value for income in this sample is $50,000, and the average value for education is 16 years. According to equation (4.35), a = y − bx = 50, 000 − 5, 550 (16 ) = −38, 800.
Taken literally, this implies that individuals with zero years of schooling would have large negative incomes. This isn’t very informative, first because we don’t expect to see many people with no schooling and second because we don’t expect to see anyone with persistently negative income, regardless of his or her education. As we said at the end of section 4.5, however, the most important role of a is to ensure that our regression line has errors that average to zero. Table 4.2 calculates the predicted values of Y and the prediction errors for each observation. The predicted values are given by the regression line: yˆi = −38, 800 + 5, 550 xi ,
and the prediction errors are given by equation (4.2): ei = yi − yˆi .
Table 4.2 demonstrates that the sum of the predicted values equals the sum of the original Y values. Correspondingly, the sum of the prediction errors is zero. This implies, of course, that the average prediction error is zero.18 Table 4.2 also calculates and sums the squared error terms. This sum is one of the components of the R2, as given in equation (4.57). The other component, ∑in=1 ( yi − y )2 , is, according to table 3.5, 2,178,000,000.
Fitting a Line
119
Consequently, n
∑e
2 i
R2 = 1 −
i =1
n
∑( y − y )
= 1− 2
945, 900, 000 = 1 − .4343 = .5657. 2,178, 000, 000
i
i =1
In this sample, the variation in X explains approximately 56.57% of the variation in Y. As demonstrated by equation (4.58), this is just the square of .7521, the sample correlation coefficient that we calculated between X and Y following table 3.5. Now let’s do it again with some real data. Also, let’s exploit the formulas that we have already developed to calculate b, a, and R2 without returning to the information for each of the individual observations. Instead, we’ll rely on the statistics that we have already calculated. Return to the sample introduced in table 3.3. Recall that the discussion following that table gave the sample covariance between education and income as 88,684. The discussion at the end of section 3.4 gave the sample standard deviation of education as 4.6803, implying a sample variance of 21.905. Therefore, according to equation (4.50), b=
( ) = 88, 684 = 4, 048.5. 21.905 V( X )
COV X ,Y
In other words, in this sample, annual income increases by $4,048.50 for each additional year of schooling. Doesn’t sound bad at all. Moreover, it’s going to seem much bigger when we recall that we collect the increase in annual income resulting from an additional year of schooling throughout our working lives. In other words, the true value of the additional year of schooling is not the increase in a single year’s income, but rather the net present value of the stream of increases in annual income for all subsequent working years. We might reconsider going to graduate school, especially if we’re young. As given in the discussion after table 3.3, the average values for Y and X are $28,415 and 11.7, respectively. Equation (4.35) then gives the intercept as a = y − bx = 28, 415 − 4, 048.5 (11.7) = −18, 952.
Once again, the literal interpretation would be that individuals with no schooling have to pay nearly $19,000 every year just to hang around.19
120
Chapter 4
A more useful interpretation would combine the information in the intercept with that in the slope. The former is a little less than five times as large as the latter in absolute value. This means that, if we set xi equal to five years of schooling, yˆi , as given by equation (4.1), is just barely positive. The implication is that, in this sample, at least five years of schooling are necessary in order to have any value in the labor market at all. Finally, we can also calculate the R2 value with what we already know. The sample correlation between education and earnings in this sample is given at the end of section 3.4 as .6211. The square of this value indicates that the R2 for this regression is .3858. The variance in education explains approximately 38.58% of the variance in income within this sample.
4.8 Another Example Ever wonder why our apartments cost so much? We introduced this question without a lot of discussion in exercise 1.6. One of the reasons that we offered there was its size. We can return to this idea in the context of this chapter with a small sample from the 2000 Census. Table 4.3 contains a sample of 20 rental dwelling units in California. The second column reports the number of rooms. This is a pretty good measure of dwelling unit size, for two reasons. First, it’s actually a measure that people think about when they decide what kind of dwelling they need. Second, it’s easy to count accurately. The natural alternative, square feet of area, is a lot harder to collect. The Census doesn’t. This sample contains only one studio, observation 1. It also contains only one dwelling unit that might be thought of as a mansion, or at least really big, observation 20. Most of the dwelling units are of a familiar size for an apartment, either three or four rooms. The average number of rooms per dwelling unit in this sample is 3.55. The fifth column of table 4.3 reports rent, the dependent variable in which we are interested. The studio in observation 1 is pretty cheap, at $200 per month. Observation 5 is nearly as inexpensive, even though it has two rooms. Makes one wonder what might be wrong with it. In contrast, the dwelling units in observations 18 and 19 both cost $1,300 per month. The average rent in this sample is $657.50. As in our previous samples, casual inspection suggests that most observations satisfy equation (3.2): Apartment size and rent appear to be positively associated. Once again, however, there are observations that deviate from this pattern. As examples, observations 2 and 6 have only two rooms but relatively
121
Fitting a Line
TABLE
4.3 Rooms and rent
Observation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sum Average
xi − x
(xi − x)xi
Rent, yi
yi − y
(y − y)x
1 2 2 2 2 2 3 3 3 3 4 4 4 4 4 4 4 5 6 9
−2.55 −1.55 −1.55 −1.55 −1.55 −1.55 −0.55 −0.55 −0.55 −0.55 0.45 0.45 0.45 0.45 0.45 0.45 0.45 1.45 2.45 5.45
−2.55 −3.10 −3.10 −3.10 −3.10 −3.10 −1.65 −1.65 −1.65 −1.65 1.80 1.80 1.80 1.80 1.80 1.80 1.80 7.25 14.70 49.05
200 890 540 420 230 700 400 350 680 680 880 680 450 920 410 450 840 1,300 1,300 830
−457.5 232.5 −117.5 −237.5 −427.5 42.5 −257.5 −307.5 22.5 22.5 222.5 22.5 −207.5 262.5 −247.5 −207.5 182.5 642.5 642.5 172.5
−457.5 465.0 −235.0 −475.0 −855.0 85.0 −772.5 −922.5 67.5 67.5 890.0 90.0 −830.0 1,050.0 −990.0 −830.0 730.0 3,212.5 3,855.0 1,552.5
71 3.55
0 0
58.95
13,150 657.5
0 0
5,697.5
Rooms, xi
i
i
high rents. Similarly, observations 13, 15, and 16 are relatively large but have low rents. These five observations, at least, satisfy equation (3.3). This time, we’ll fit a line to these data using equation (4.37) to calculate the slope. The fourth and seventh items in the second-to-last row of table 4.3 give us the required summations. This equation then yields 96.65 as the value for b. According to equation (4.35), this value, along with the averages in the last row of the table, imply that a is 314.4. Together, they predict yi in this sample as yˆi = 314.4 + 96.65 xi .
Taken literally, the constant means that, in this sample, we have to pay more than $300 just to think about getting an apartment. More important, the slope gives the implied monthly price of an additional room as nearly $100. This is all that it would cost to provide a roommate with a room in this sample. If the roommate was going to split the cost in the intercept, it might be worthwhile to get one.
122
Chapter 4
TABLE
4.4 R2 calculation, rooms and rent
Observation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
bxi
yi = a + bxi
ei = yi − yˆi
96.65 193.30 193.30 193.30 193.30 193.30 289.95 289.95 289.95 289.95 386.60 386.60 386.60 386.60 386.60 386.60 386.60 483.25 579.90 869.85
411.04 507.69 507.69 507.69 507.69 507.69 604.34 604.34 604.34 604.34 700.99 700.99 700.99 700.99 700.99 700.99 700.99 797.64 894.29 1,184.24
−211.04 382.31 32.31 −87.69 −277.69 192.31 −204.34 −254.34 75.66 75.66 179.01 −20.99 −250.99 219.01 −290.99 −250.99 139.01 502.36 405.71 −354.24
44,539.26 146,158.67 1,043.74 7,690.06 77,113.38 36,982.00 41,755.92 64,690.19 5,724.03 5,724.03 32,043.73 440.68 62,997.17 47,964.34 84,676.56 62,997.17 19,323.12 252,363.49 164,599.17 125,486.60
209,306.25 54,056.25 13,806.25 56,406.25 182,756.25 1,806.25 66,306.25 94,556.25 506.25 506.25 49,506.25 506.25 43,056.25 68,906.25 61,256.25 43,056.25 33,306.25 412,806.25 412,806.25 29,756.25
0
1,284,313.32
1,834,975.00
Sum
ei2
(yi − y)2
Finally, let’s calculate the R2 value in the old-fashioned way of table 4.2 one more time. Table 4.4 presents the calculation of yˆi, ei, and yi − y for each observation. The last row presents the sums that equation (4.57) requires. That equation with these sums yields R2 = 1 −
1, 284, 313 = .3001. 1, 834, 975
In this sample, the number of rooms in an apartment explains just over 30% of the variation in monthly rents.20
4.9
Conclusion This chapter has explained how we fit a line to a sample. It has developed, intuitively, the appropriate definition of “best fitting” and then implemented it, mathematically. This is a great accomplishment. We now know how to answer questions like, “In this sample, how much does income increase when education goes up?”
Fitting a Line
123
The problem is that the question we really want to ask probably goes like, “Based on what we observe in this sample, how much would my income go up if I went to school for another year?” It may not seem so, but this question is radically different from the last one. Assuming that we’re not members of the sample, it requires a generalization of the results from the sample to someone, or some group, outside of it. In order to do this, we have to establish some reason why results that we derive from this particular sample apply to others. In the next chapter, we characterize the population from which we assume that our sample has been taken. We use this characterization to help us assess how well the line that we have drawn approximates the relationship that generates the population and how precisely it can predict outcomes for other population members.
Appendix to Chapter 4 This appendix expands our discussion regarding the choice of the best-fitting line. First, for those of us who might want them, it provides all the details for the derivations in equations (4.11) and (4.14). Second, it fulfills the promise of note 12 and proves that our choices of a and b in equations (4.35) and (4.37) minimize the sum of squared errors. The derivative in equation (4.11), ∂ ( yi − a − bxi )
2
,
∂a
may seem a bit complicated because it involves the square of an expression that depends on quantities in addition to a, the object of the derivative. The chain rule helps us unravel this. It tells us that this derivative is equal to the product of the derivative of the squared quantity, (yi − a − bxi), with respect to itself and the derivative of this quantity with respect to a: ∂ ( yi − a − bxi )
2
2
=
∂a
∂ ( yi − a − bxi ) ∂ ( yi − a − bxi ) ∂ ( yi − a − bxi )
∂a
.
(4.62)
The derivative of a quantity with an exponent with respect to itself is that quantity multiplied by the value of the exponent, with a new exponent equal to one less than the original exponent. Therefore, the first derivative to the right of the equality in equation (4.62) is ∂ ( yi − a − bxi )
2
∂ ( yi − a − bxi )
= 2 ( yi − a − bxi )
2 −1
= 2 ( yi − a − bxi ) .
(4.63)
124
Chapter 4
The second derivative to the right of the equality in equation (4.62) contains two terms, yi and −bxi, that are not functions of a. Therefore, their derivatives with respect to a are simply zero. Accordingly, ∂ ( yi − a − bxi )
=
∂a
∂ ( −a ) ∂a
.
(4.64)
The derivative of a negative is just the negative of the derivative: ∂ ( −a ) ∂a
=−
∂a . ∂a
(4.65)
Finally, the derivative of a with respect to itself is simply one. Therefore, −
∂a = −1. ∂a
(4.66)
We substitute equation (4.66) into equation (4.65) and the result into equation (4.64). When we then substitute the new equation (4.64) and equation (4.63) into equation (4.62), we confirm the derivative in equation (4.11). We can similarly confirm equation (4.14). Following equation (4.62), we have ∂ ( yi − a − bxi )
2
∂b
∂ ( yi − a − bxi ) ∂ ( yi − a − bxi ) 2
=
∂ ( yi − a − bxi )
∂b
.
(4.67)
The first derivative to the right of the equality in equation (4.67) is again given by equation (4.63). The second derivative to the right of the equality contains two terms, yi and −a, that are not functions of b. As in the discussion of equation (4.64), their derivatives with respect to b are zero. Accordingly, ∂ ( yi − a − bxi ) ∂b
=
∂ (−bxi ) ∂b
.
(4.68)
The derivative to the right of the equality in equation (4.68) is the only one that differs at all from its analogue in the derivative of the sum of squared errors with respect to a. Here, b, the object of the derivative, is multiplied by −xi rather than by −1, as in equation (4.65). Once again, however, the object of the derivative appears linearly. Therefore, this multiplicative factor is again the derivative in question: ∂ (−bxi ) ∂b
= − xi .
(4.69)
Fitting a Line
125
If we substitute equations (4.63) and (4.69) into equation (4.67), we obtain equation (4.14). In order to prove that our solutions for a and b minimize the sum of squared errors in equation (4.6), we must evaluate the three second derivatives of the sum of squared errors: that with respect to a, that with respect to b, and the “cross-derivative” with respect to a and b. The first of these second derivatives emerges directly out of equation (4.12): n ∂ ei2 i =1 = −2 ∂a
∑
n
∑( y − a − bx ). i
i
i =1
By definition, the second derivative with respect to a is n n ∂ e2 ∂ 2 e2 i i n ∂ i =1 i =1 ∂ yi − a − bxi = −2 = ∂ a ∂a ∂a ∂a 2 1 i =
∑
∑
∑(
) .
(4.70)
As in equation (4.69), the derivative of the product of a constant and the object of the derivative is just the constant multiplied by the derivative of the object: n ∂ 2 ei2 n n ∂ i =1 ∂ yi − a − bxi ) . = −2 ( yi − a − bxi ) = −2 ( ∂ a ∂ a ∂a2 i =1 i =1
∑
∑
∑
Following equation (4.9), we have n ∂ 2 ei2 n ∂ i =1 = −2 ( yi − a − bxi ) ∂ a ∂a2 i =1
∑
∑
∂ y − a − bx ∑ ( ∂a ). n
= −2
i
i
i =1
Equations (4.64), (4.65), and (4.66) give ∂ ( yi − a − bxi ) ∂a
= −1.
(4.71)
126
Chapter 4
Therefore, equation (4.71) becomes n ∂ 2 ei2 n ∂ ( yi − a − bxi ) i =1 = −2 ∂a ∂a2 i =1
∑
∑ n
∑ (−1) = 2n.
= −2
(4.72)
i =1
This second derivative is positive, as is required for a minimum.21 The second derivative with respect to b must also be positive. Here, we begin with equation (4.15): n ∂ ei2 i =1 = −2 ∂b
∑
n
∑( y − a − bx ) x . i
i
i
i =1
Taking the derivative and again following the steps from equations (4.70) through (4.71), we obtain n n 2 ei ∂ ∂ −2 ( yi − a − bxi ) xi ∂ i =1 i =1 = ∂b ∂b ∂b n ∂ ( yi − a − bxi ) xi . = −2 ∂b
∑
∑
∑
(4.73)
i =1
The object of the derivative following the last equality sign can be rewritten as
( yi − a − bxi ) xi = yi xi − axi − bxi2 . Only the last of these three terms depends on b. It is multiplied by the term −xi2, which functions as a constant with respect to the derivative. Therefore,
(
∂ −bxi2 ∂ xi
) = −x . 2 i
(4.74)
Fitting a Line
127
Combining equations (4.73) and (4.74) yields n n ∂ ei2 ∂ 2 ei2 ∂ i =1 i =1 = = −2 ∂b ∂b ∂b2
∑
∑
n
n
∑( ) ∑ x . − xi2 = 2
i =1
2 i
(4.75)
i =1
As in the case of the second derivative with respect to a in equation (4.72), this derivative is unambiguously positive, as required.22 The final second derivative is the cross-derivative, the second derivative with respect to both b and a. It also emerges from equation (4.12). Following the development from equations (4.70) through (4.71), we have n n ∂ e 2 ∂ −2 yi − a − bxi i i =1 ∂ i =1 = ∂b ∂b ∂ a
∑(
∑
n
∑
= −2
i =1
∂ ( yi − a − bxi ) ∂b
)
.
Equations (4.68) and (4.69) give the derivative after the second equality as −xi. Therefore, n ∂ e2 i n n ∂ i =1 = −2 − xi = 2 xi . ∂b ∂ a i =1 i =1
∑
∑( ) ∑
(4.76)
Equation (4.76) demonstrates that this cross-derivative can be either positive or negative. However, that’s not important. The only requirement for a minimum that involves this derivative is the following condition:
2
n
∑
∂ i =1
∂a2
i =1
ei2 ∂ 2
n
∑
∂b 2
2 n ∂ 2 e2 i i =1 − > 0. ∂b ∂ a
ei2
∑
128
Chapter 4
Substituting equations (4.72), (4.75), and (4.76), we get n n 2 n n 2 2 2 2 n 2 xi − 2 xi = 4 n xi − xi > 0 . i =1 i =1 i =1 i =1
( ) ∑
∑
∑
∑
Exercise 2.5 proves that the term in parentheses following the equality is positive. Therefore, this final condition is satisfied. Our estimators, b and a, minimize the sum of squared errors.
EXERCISES 4.1 Equation (4.12) factors the quantity −2 out of a summation. Which equation in chapter 2 justifies this? 4.2 Imagine that we want to fit a horizontal line to our data. In other words, we want to set b = 0 and then choose the value for a that minimizes the sum of squared errors in the regression yi = a + ei . (a) Express the sum of squared errors as a function of a alone. (b) This regression contains only one estimator, the intercept a. The derivative of the sum of squared errors with respect to this estimator is n ∂ ei2 i =1 = −2 ∂a
∑
n
∑( y − a). i
i =1
Set this derivative equal to zero and solve for the value of a that minimizes the sum of squared errors. Why does this value make intuitive sense? (c) Replace this solution in the answer to part a and restate the sum of squared errors. 4.3 Consider a sample that consists of only two observations. (a) Demonstrate that, according to equation (4.37), b=
y1 − y2 . x1 − x2
Why does this result make intuitive sense? (b) Demonstrate that, according to equation (4.35), a=
y2 x1 − y1 x2 . x1 − x2
Fitting a Line
129
(c) Demonstrate that e1 = e2 = 0. What, then, is R2? What does it tell us about the fit of this regression? 4.4 The regression of Y on X is yi = a + bxi + ei , where a is the intercept, b is the slope, and ei is the prediction error. Consider the effect of changing the units of measurement for X and Y. Multiply X by the constant r and Y by the constant s. The regression of sY on rX is then
( syi ) = c + d (rxi ) + fi , where c represents the intercept, d represents the slope, and f i represents the prediction error. (a) Prove that d = (s/r)b. (b) Prove that c = sa. (c) What is the relationship between ei and fi? 4.5 Demonstrate that the average value for yˆi, which we’ll represent as yˆi , for lack of a better notation, is y. (a) Use equation (4.1) to prove that yˆi = a + bx . (b) Recall equation (4.34) for y and compare it to our answer to part a. 4.6 Recall equation (4.53). (a) Use the result of exercise 4.5 to demonstrate that the sample variation of the predicted value of yi, V( yˆi ), is equal to the last term in that equation, n
∑( yˆ − y )
2
i
i =1
n −1
.
(b) Use equation (4.25) to replace ei in equation (4.53) with ei − e . Demonstrate that the second term in equation (4.53), ∑in=1 ei2 /( n −1), is the sample variation of the error, V(ei). (c) Conclude that equation (4.53) implies that V ( yi ) = V ( ei ) + V ( yˆi ) . 4.7 Imagine that x = 0 and y = 0. (a) According to equation (4.37), what is b in this circumstance? (b) According to equation (4.35), what is a in this circumstance? (c) What does our answer to part b imply about the location of this line?
130
Chapter 4
4.8 Let’s derive another expression for b. Begin with equation (4.37): n
∑( y − y ) x i
b=
i
i =1 n
.
∑( x − x ) x i
i
i =1
(a) Rewrite the denominator as in exercise 2.5a: n
n
∑( x − x ) x = ∑ x i
n
2 i
i
i =1
−x
i =1
∑x . i
i =1
Multiply the last summation by n/n and simplify. (b) In the numerator, distribute the product within the summation sign and sum the terms separately to demonstrate that n
n
n
∑( y − y ) x = ∑ y x − y∑ x . i
i
i i
i =1
i
i =1
i =1
(c) Multiply the last summation by n/n and simplify. (d) Express b as n
∑ y x − nyx i i
b=
i =1 n
.
∑
xi2
− nx
2
i =1
4.9 Let’s construct an alternative derivation of R2 by reconsidering the sum of squared errors. Begin with equation (4.5): ei = yi − a − bxi . (a) Replace a with its equivalent, according to equation (4.35), and group related terms. (b) Square ei to get ei2 = ( yi − y ) + b 2 ( xi − x ) − 2b ( xi − x ) ( yi − y ) . 2
2
(c) Prove that n
n
∑b ( x − x ) ( y − y ) = b ∑( x − x ) . 2
2
i
i =1
i
i
i =1
Fitting a Line
131
(d) Use this last result to prove that n
n
∑ e = ∑( y − y ) 2 i
i =1
n
2
− b2
i
i =1
∑( x − x ) . 2
i
i =1
(e) Rearrange to derive the definition of R2 in equation (4.57). 4.10 Consider the definition of R2 in equation (4.57). (a) Multiply the numerator and denominator of the ratio to the right of the second equality of equation (4.57) by 1/(n − 1) to get n xi − x i=1 n −1 2
∑(
R2 = b
n yi − y i 1 = n −1
∑(
)
2
)
2
.
(b) Return to equation (3.11). What is the numerator of the new fraction to the right of the equality? What is the denominator? Rewrite the expression for R2 accordingly. (c) Replace b as indicated by equation (4.42) to obtain
( ) ( )
COV X ,Y R = V X 2
2
( ). ( )
V X V Y
(d) Simplify to obtain
( ) ( ) ( )
COV X ,Y 2 . R = SD X SD Y 2
(e) Why does this imply, as asserted in equation (4.58), that
(
(
R 2 = CORR X ,Y
)) ? 2
132
Chapter 4
4.11 Consider again the definition of R2 in equation (4.57). (a) Replace b with its equivalent, as given in equation (4.40), and simplify to obtain n yi − y xi − x i=1
∑(
R2 =
)(
n
)
2
n
∑( x − x ) ∑( y − y ) 2
i
.
2
i
i=1
i=1
(b) Multiply the numerator and denominator by b2. Rearrange so as to form the term b( xi − x ) in both. In both, invoke equation (4.49) to replace this term with yˆi − y and obtain n yˆi − y yi − y i=1
∑(
R = 2
)(
n
)
n
2
∑( yˆ − y ) ∑( y − y ) 2
i
.
2
i
i=1
i=1
(c) Multiply the numerator and denominator by [1/(n − 1)]2 to obtain n 2 yˆi − y yi − y i=1 n −1 R2 = n n 2 yˆi − y yi − y i = 1 = i 1 n −1 n −1
∑(
∑(
)(
)
) ∑(
)
2
.
Return to chapter 3 and identify each of the terms in parentheses. (d) With these identifications, prove equation (4.59), which states that
(
( ))
R 2 = CORR Y , Yˆ
2
.
4.12 How can equations (4.58) and (4.59) both be true? (a) What equation defines yˆi as a linear function of xi?
Fitting a Line
133
(b) Use the result of exercise 3.7 to demonstrate that the correlation between xi and a linear function of itself is equal to one. (c) Use this fact to provide an intuitive explanation for the result that
( )
CORR ( X , Y ) = CORR Yˆ , Y . (d) Finally, use this equality to demonstrate that equations (4.58) and (4.59) must both be true. 4.13 Consider the second normal equation, equation (4.18). (a) Using the information in tables 4.1 and 4.2, verify that the solution we gave for a and b in section 4.7 satisfies the second normal equation, equation (4.18). (b) Using the information in table 3.3, verify that the solution we gave for a and b in section 4.7 satisfies the first and second normal equations, equations (4.17) and (4.18). (c) Explain how table 4.4 verifies the first normal equation for our solution of a and b in section 4.8. Using the information in tables 4.3 and 4.4, verify that the solution we gave for a and b in section 4.8 satisfies the second normal equation, equation (4.18). 4.14 Consider the hypothetical data given in table 4.5. (a) Calculate a, b, and R2. (b) Verify that the solutions satisfy the first and second normal equations, equations (4.17) and (4.18). 4.15 Just for practice, calculate the sample covariance and the sample correlation coefficient between the number of rooms per dwelling unit and rent per month in the data of table 4.3. 4.16 Use equation (4.60) to perform the following calculations. (a) Calculate the adjusted R2 for the regression of tables 4.1 and 4.2. (b) Calculate the adjusted R2 for the regression at the end of section 4.7 using the data of table 3.3. (c) Calculate the adjusted R2 for the regression of table 4.4.
TABLE
4.5 Hypothetical data
Observation
xi
yi
1 2 3 4 5
1 3 5 7 9
12 6 8 5 4
134
Chapter 4
NOTES 1. In fact, “slope” is the name that we gave in chapter 1 to the effect of an explanatory variable on the dependent variable. 2. We proved this in exercise 3.1. 3. Chapter 10 returns to the question of causality in much greater detail. 4. The line that minimizes the sum of squared errors has deeper statistical appeal. It yields maximum likelihood estimators under the assumption that the disturbances are drawn from a single normal population (Greene 2003, chapter 17). It also provides, under very general conditions, the best linear predictor of yi (Angrist and Pischke 2009, chapter 3). Both of these qualities require more sophistication to appreciate, so for current purposes, the justification in the text is going to be sufficient. 5. We may recall that we referred to this obliquely at the beginning of section 1.5. 6. The parentheses in this expression contain arguments to a function. Explain why this should be apparent. 7. The appendix to this chapter presents a detailed derivation of this derivative. 8. Exercise 4.1 examines the algebra in this step. 9. Figure 4.4 also makes the somewhat obvious point that there is no sense in fitting a line to fewer than two points. Moreover, there is no mystery in fitting a line to exactly two points, as two points define a line. Exercise 4.3 makes this point formally. The question we are dealing with in this chapter only gets interesting when there are at least three observations in the sample. 10. This expectation is satisfied whenever, as here, the system of equations is not degenerate. That is, the individual equations are not linearly dependent. 11. Equation (4.37) reiterates the point that regression analysis requires at least two observations. With only one observation, xi = x , the denominator of the expression for b is zero and the expression itself is undefined. 12. We claim that these values for a and b minimize the sum of squared errors. In principle, we should check the second derivatives of the sum of squared errors in order to ensure that we have actually found a minimum rather than a maximum or a saddle point. It’s not required here, so we’ll do it for the record in the appendix to this chapter. 13. Exercise 4.4 investigates how changes in the units of X and Y affect the values for a and b. 14. Equations (4.41) and (4.42) demonstrate why, as mentioned in the discussion of equation (3.7), it is so convenient that the sample covariance and sample variance both have the quantity n − 1 as their denominators.
Fitting a Line
135
15. At least, no systematic linear differences in Y. 16. We asserted, without much discussion, that the intercept was uninteresting in section 1.6. Now we can provide a justification. If we set the values of all eight explanatory variables in the regression of figure 1.1 equal to zero, we would be referring to a white male with no schooling, aged zero. It is impossible to think of a situation where we would have any interest in predicting the income of such an individual. 17. When we get to chapter 11, we will find that outcomes we care about are usually so complicated that even many explanatory variables together don’t explain them perfectly. 18. Exercise 4.13a asks us to confirm that our results here also satisfy the second normal equation, equation (4.18). 19. In exercise 4.13b, we confirm that these results satisfy the first and second normal equations. 20. Table 4.4 verifies that these results satisfy the first normal equation. We check the second normal equation in exercise 4.13c. 21. This is not the place to derive these requirements. However, some intuition may help. Recall that, as in functions of two variables, extreme values in functions of single variables occur where the first derivative of the function with respect to the variable is zero. In functions of one variable, the second derivative reveals how the first derivative is changing as the variable increases. In this context, a positive second derivative would indicate that the derivative was increasing. This means that it was less than zero just before the extreme value and more than zero just after. Recalling that the first derivative is also the slope of the function, the slope of the function must be negative just before the extreme value and positive just after. Loosely speaking, this means that the points just before and after the extreme value must be a bit above it. That makes the extreme value itself a minimum. 22. We should make sure that we see why.
CHAPTER
5
F ROM S AMPLE TO P OPULATION
5.0
What We Need to Know When We Finish This Chapter
5.1
Introduction
5.2
The Population Relationship
5.3
The Statistical Characteristics of εi
143
5.4
The Statistical Characteristics of yi
149
5.5
Parameters and Estimators
5.6
Unbiased Estimators of β and α
5.7
Let’s Explain This Again
5.8
The Population Variances of b and a
5.9
The Gauss-Markov Theorem
5.10 Consistency 5.11 Conclusion Exercises
5.0
136
138 140
152 153
160 165
172
178 182 183
What We Need to Know When We Finish This Chapter This chapter discusses the reasons why we can generalize from what we observe in one sample to what we might expect in others. It defines the
136
From Sample to Population
137
population, distinguishes between parameters and estimators, and discusses why we work with samples when populations are what we are interested in: The sample is an example. It demonstrates that, with the appropriate assumptions about the structure of the population, a and b, as calculated in chapter 4, are best linear unbiased (BLU) estimators of the corresponding parameters in the population relationship. Under these population assumptions, regression is frequently known as ordinary least squares (OLS) regression. Here are the essentials. 1. Section 5.1: We work with samples, even though populations are what we are interested in, because samples are available to us and populations are not. Our intent is to generalize from the sample that we see to the population that we don’t. 2. Equation (5.1), section 5.2: The population relationship between xi and yi is yi = α + β xi + ε i .
3. Equation (5.5), section 5.3: Our first assumption about the disturbances is that their expected values are zero: E (εi ) = 0.
4. Equation (5.16), section 5.4: The deterministic part of yi is E ( yi ) = α + β xi .
5. Equation (5.6), section 5.3, and equation (5.20), section 5.4: Our second assumption about the disturbances is that their variances are the same. Their variances are equal to those of the dependent variable: V (ε i ) = V ( yi ) = σ 2 .
6. Equation (5.11), section 5.3, and equation (5.22), section 5.4: Our third assumption about the disturbances is that they are uncorrelated in the population. This implies that the values of the dependent variable are uncorrelated in the population:
(
)
(
)
COV ε i , ε j = COV yi , y j = 0.
7. Equation (5.37), section 5.6: Our slope, b, is an unbiased estimator of the population coefficient, β: E (b ) = β .
138
Chapter 5
8. Equation (5.42), section 5.6: Our intercept, a, is an unbiased estimator of the population constant, α: E (a ) = α.
9. Section 5.7: A simulation is an analysis based on artificial data that we construct from parameter values that we choose. 10. Equation (5.50), section 5.8: The variance of b is V(b ) =
σ2
.
n
∑( x − x )
2
i
i =1
11. Equation (5.51), section 5.8: The variance of a is 1 V( a ) = σ 2 + n
2 x . n 2 ( xi − x ) i =1
∑
12. Equation (5.60), section 5.9: According to the Gauss-Markov theorem, no linear unbiased estimator of β is more precise than b. In other words, the variance of any other linear unbiased estimator of β, d, is at least as large as that of b: V ( d ) ≥ V (b ) .
13. Section 5.10: Consistent estimators get better as the amount of available information increases. With enough information, consistent estimators are perfect.
5.1
Introduction Chapter 4 was entirely empirical, that is, concerned with calculations using observed data. In it, we learned how to fit a line to a sample of data on an explanatory variable, xi, and a variable that we believe it affects, yi. The slope of this line represents the typical change in yi that occurs as a consequence of a change in xi within this sample.
From Sample to Population
139
The problem is that we almost surely don’t care about this sample. First, there isn’t any mystery with regard to what happened to its members, or observations. Each of them already has an observed value for yi. This is exactly what happened to them. So what would be the point of predicting what would happen to them with the value of yˆi from the estimated line? Second, in many applications, we don’t even know who is in the sample. Most of the data sets that are of interest to economists, and to social scientists in general, are anonymous. That is, in most cases, people require that their identities be protected before they reveal information about themselves. This is true about all of the information collected and disseminated by, for example, the U.S. Census Bureau. Third, even if we can identify the individuals in the sample, it’s unlikely that the individuals about whom we are most interested are among them. To continue with the example of education and income, we are most likely interested in how our own educations would affect our future income. But it’s impossible for us to be among the sample. Our educations aren’t over, so our ultimate values for xi aren’t known. We’re probably not working, at least not full time, so our values of yi aren’t known, either. The sample is really only useful to us if we can take what we observe in it as representative of what would happen in other circumstances that are similar and of greater relevance. In other words, our ultimate objective must be to generalize from what we observe in the sample to what we would expect in other cases. We want to use the sample as an example. Then the question must be, an example of what? In other words, what is the larger realm of experience from which our particular example has been drawn? The answer to this question, in statistical terms, is the population. In the example of education and earnings, the population includes anyone currently working whose earnings might depend on his or her education in the same way that earnings depend on education for the members of our sample. It also includes anyone whose earnings have ever depended on education in the same way. Finally, it includes anyone who has not yet worked, but who expects that, when he or she does, his or her earnings will depend on education in the same way. More generally, the population consists of all instances of the relationship under examination that might be of interest. The population includes the members of the sample whose behavior we are exploring, as well as every other instance of this behavior that has occurred in the past and every instance that might occur in the future. What defines the population, and unites all of its members, is what is common in their behavior or experiences. If our purpose is to understand what is common in the relationship between yi and xi in the population, this raises the question of why we don’t attempt to examine the population directly. The answer is that, for any interesting
140
Chapter 5
problem, the population is not available to us. The number of times that this relationship has already played itself out is almost surely huge. We have neither the time nor the money necessary to identify xi and yi in each of these instances. That part of the population that consists of instances of this behavior in the future is, of course, unavailable because it hasn’t even occurred yet. So samples are what we work with, even though populations are what we are interested in. Our task is, first, to specify how we think the relevant population is formed. This task is theoretical, that is, concerned with quantities that we can’t observe. Then we investigate the relationship between what we think the population looks like and what we have seen in our sample.1
5.2
The Population Relationship To begin with, we have to believe that there is a relationship between xi and yi in the population. If we didn’t believe this, then why would we waste our time trying to discern the expression of this relationship in our sample? Moreover, we have to believe that this relationship is permanent, or at least stable. If not, why would we believe that the manifestation of this relationship in our particular sample had anything to do with what we would observe in any other sample or for any other individual? In principle, then, we begin with the belief that the relationship between xi and yi in the population is determined by the fundamental mechanisms embedded in the economy (or society or nature, depending on the issue under examination). Because these mechanisms are stable, we assume that they are governed by fixed numerical values. We call these fixed values parameters. Parameters are constants that direct the evolution of behavior, whether in the economic, social, or natural realm. The parameter values that govern the relationship of interest to us are never immediately apparent, because the economy, society, and nature are complex. Our hope is that we can discover something about them through the examination of our sample.2 With these beliefs, we can specify what we think the form of this relationship must be. First, we allow for the possibility that, even in the absence of xi, individuals may have some value of yi. This part of yi is the constant α. Next, we identify the role that xi plays in the formation of yi. First, we return to the attributes of xi that we discussed in section 4.2. There, we agreed that we would think of xi as independent, exogenous, or predetermined. Practically speaking, this means that the process of determining values for yi requires values for xi but doesn’t alter the values for xi. In other words, the question we are trying to answer is, what determines values for yi? In some other context, we might be tempted to ask, what
From Sample to Population
141
determines values for xi? In the present context, however, we disregard this latter question. We assume that the values we observe for xi were established elsewhere. More important, we assume that the process by which values for xi were established was not affected by the possibility that these values might later be used to determine values of yi.3 For the purposes of this text, we will usually assume that the contribution of xi to yi is linear: It consists of the product of xi and a coefficient, β. In other words, the economic or social processes that generate values for yi operate so as to multiply xi by β and then add the product to α. So far, this specification of the population relationship implies that all members of the population with the same value for xi share a common contribution to their values of yi, α + βxi. We will often refer to this contribution as the deterministic component of yi. This means that, once we have a value for xi, this component is fixed and immutable. It must be obvious that we can’t stop here. If the deterministic component of yi were all there were to it, then everyone with the same value of xi would have the same value of yi. In this case, the relationship between the two would be immediately apparent. To demonstrate, observations on any two population members with different values for the independent variable would be sufficient to identify the true values of α and β without any ambiguity. If the observed values of the dependent and independent variables for member i were yi and xi, and the values for member j were yj and xj, then yi = α + βxi and yj = α + βxj. This would imply that
(
) (
yi − y j = (α + β xi ) − α + β x j = β xi − x j
)
or β=
yi − y j xi − x j
.
Similarly, α = yi − β xi
or α = yi −
yi − y j xi − x j
xi .
These would be the actual values of the constant and the coefficient.
142
Chapter 5
In this situation, it would be easy to ascertain the true population relationship. However, it is easy only because it is trivial. It must be obvious that, in any problem that really matters to us, observations with the same values of xi are associated with different values of yi. With respect to our running example, we all know of different individuals with the same levels of schooling but different salaries.4 We therefore have to allow for this possibility in our conception of the population relationship. We accomplish this by designating a component of yi that does not depend on xi and is unique to observation i. We identify this component as εi. This component contains all elements of yi that do not depend on xi. We refer to εi as the disturbance that the economy assigns to the ith observation. We assume, for the purposes of this book, that the disturbance is added to the deterministic component of yi to give yi its final value: yi = α + β xi + ε i .
(5.1)
In other words, yi is a linear function of εi as well as of xi. This disturbance may sometimes seem like a nuisance. However, εi is the component of yi that gives each observation its distinctive identity. It rescues us from the lethal monotony that would characterize our experience if all observations with the same value for xi had exactly the same value for yi. For this, we should be grateful. We will scrupulously avoid referring to εi as an “error.” It’s pejorative. It’s also improper, because it implies that “the economy,” “society,” or “nature” makes “mistakes.” It’s meaningless to impute motive to any of them, because they aren’t sentient. In any case, we have already reserved this term, in section 4.2, for a difference that truly is a mistake: that between the actual value of yi in the sample and the value that we can predict based on our fitted line.5 In sum, equation (5.1) completely specifies the population relationship between xi and yi. The population relationship looks very similar to the sample relationship that we expressed in equation (4.4). There is only one apparent distinction between the two: The population relationship is expressed in terms of Greek letters whereas the sample relationship is expressed in terms of Latin letters. We will emphasize the conceptual distinction between the sample and population relationships by maintaining this typographical distinction throughout the book. In general, Latin letters will represent numbers that we calculate from our samples, and the corresponding Greek letters will represent the parameters that describe our populations. In essence, the purpose of this chapter is to explore the analogies between these parameters and the numbers that we calculated in chapter 4.
From Sample to Population
5.3
143
The Statistical Characteristics of i As will become apparent throughout the rest of this chapter, εi is at the very root of the econometric enterprise. Its properties reproduce themselves in yi, a, and b. Therefore, it is important to understand both what they are and why we assert them. The essential property of εi is that it is a random variable. The previous section has already suggested that εi can have different values for different observations. This implies that, ex ante, or “before the fact of producing a specific value for yi,” a range of values for εi must be available for each observation. This is the variable part. The random part is that the actual value that gets assigned to each observation is completely unrelated to the value of its explanatory variable. For example, imagine a group of recent college graduates with the same courses and grade point averages. Despite the similarity in their educational qualifications, we would never expect them all to receive exactly identical salaries for their first jobs. Why do we expect their first salaries to differ, at least by a little? Because, intuitively, we understand that these salaries will be affected by disturbances, εi: unpredictable differences in their locations, industries, employers, and job responsibilities. Intuitively, we can think of the economy as first using the value of xi to establish the deterministic part of yi for an observation. It then completes the value of yi by assigning a value of εi. It accomplishes this assignment by randomly choosing a specific value for εi from those that are possible. The specific value assigned to εi for a specific observation is sometimes referred to as a realization. In order to explore the consequences of this construction, we need to provide formal specifications for the range of values that are possible for εi and for the properties of the random assignment. Two assumptions specify what we require regarding the range of values, and two additional assumptions establish the properties that are necessary regarding the assignment process. The essential characteristics of the set of possible values for εi consist of its expected value and population variance. These are the first two things that we would typically want to know about any random variable. They may be familiar to us from prior courses in statistics. In any case, we introduce them briefly here. The expected value of a random variable measures that variable’s central tendency. That is, the expected value gives some indication of what a typical outcome might be. Its meaning is most readily apparent in the case of a discrete random variable, a random variable with a finite number of possible outcomes.
144
Chapter 5
For such a random variable, Z, the expected value, or expectation E(Z), is the sum of the possible outcomes, zj, each weighted by the probability that they will occur, P(zj): E (Z ) =
m
∑ z P ( z ). j
(5.2)
j
j =1
As we can see, this measure of typicality gives greater weight to outcomes, zj, that are either large in magnitude or that are especially likely to occur because P(zj) is large.6 Equation (5.2) also illustrates a larger principle. If we take all of the outcomes for any random variable, weight them by the probability that they will occur, and sum these products, we obtain the expected value of that random variable. In equation (5.2), the random variable is simply Z and its outcomes are zj. However, the same principle applies to random variables that may have more complicated names and representations. This will be very useful to us later. The variance of a random variable measures that random variable’s dispersion. In other words, the variance gives some indication of the extent to which individual outcomes across the population might differ from the expected value. Continuing with the discrete random variable Z, we can write the population variance as V (Z ) =
m
∑( z j =1
) ( )
− E (Z ) P z j . 2
j
(5.3)
As we can see, the population variance measures dispersion in terms of the discrepancy of any outcome, zi, from the expected value, E(Z). It treats discrepancies in either direction from E(Z) symmetrically, because negative and positive discrepancies of the same magnitude make the same contribution to V(Z) when squared. This squaring also gives disproportionate weight to outcomes representing large discrepancies. Finally, as with E(Z), outcomes that are more likely to occur make a larger contribution to V(Z). In equation (5.3), the quantity weighted by the probability P(zj) is (zj − E(Z))2. We can think of this quantity as an outcome, wj, of a new random variable, W. This new random variable W is a transformation of our old random variable Z. We construct it by taking each outcome for Z, zj, subtracting the constant E(Z) from this outcome, and squaring the result. The probability that we would observe the new outcome wj for the new random variable W is the same as the probability that we would observe the old outcome zj for the original random variable Z. This is because we only observe the new outcome wj when the old outcome zj has occurred.
From Sample to Population
145
Therefore, equation (5.3) also represents E(W). Recognizing that W can be written as (Z − E(Z))2, this equation also represents E((Z − E(Z))2). Consequently,7 V Z ≡ E Z − E Z
(
( )
( ))
2
.
(5.4)
The population variance of Z is just the expected value of the new random variable that we would get if we took each outcome for Z, subtracted from it E(Z), and squared the difference to obtain (Z − E(Z))2. This identity will be very useful in what follows. The formulas for the expected value and population variance of discrete random variables readily convey the intuition upon which they are built. However, they don’t apply exactly to εi. Our disturbance is a continuous random variable, because we believe that the range of possible outcomes for εi is infinite. This means that formal definitions of its expected value and population variance would require modifications to the notation of equations (5.2) and (5.3). Fortunately, we don’t need formal definitions here.8 The intuition in these two equations applies to continuous as well as to discrete random variables. Moreover, equation (5.4) applies without modification to both. With these preliminaries, we can make the first of our assumptions regarding the range of values that are possible for εi. We assume that its central tendency is zero: E(εi ) = 0.
(5.5)
The specific value that we assume for E(εi) is not especially important, as we will see in chapter 8. What is critical about this assumption, however, is that E(εi) does not depend on i. This is apparent in equation (5.5) because the quantity after the equality does not have a subscript i. In other words, the expected value of the disturbances is a constant, the same value for each member of the population.9 The importance of this assumption will become more apparent in section 5.6. For the moment, it can be justified by recalling that we have chosen to believe that the observations in our sample come from a much larger population, with which they share common characteristics. At the very least, the common characteristics should include the expected value of the disturbances. If we were not willing to make an assumption like that in equation (5.5), the fundamental claim that our sample was part of a population would look pretty foolish.10 The second of our assumptions regarding the range of values that are possible for εi addresses its dispersion. Its population variance is
(
V (ε i ) = E ε i − E (ε i )
)
2
= E (ε i ) = σ 2 . 2
(5.6)
146
Chapter 5
The first equality of equation (5.6) follows from equation (5.4). Equation (5.5) implies the second equality, because E(εi) = 0. The third equality of equation (5.6) states the assumption that we make about the variance of εi. It is equal to the constant σ 2, where σ is positive.11 The population standard deviation of εi, SD(εi), is therefore σ. The assumption in equation (5.6) has an important similarity to that we made regarding E(εi). As in equation (5.5), the subscript i appears before the first equality sign, but not after the last equality sign. In other words, we have once again made the assertion that the disturbances for each observation share the same characteristic, in this case, the same variance. As before, the purpose of this assertion is to identify another characteristic that is common between the observations in our sample and the members of the population. In this case, this assertion has a name: The assertion of equal variances is called the assumption of homoscedasticity. We will examine the alternatives to this assumption in chapter 8. The assumption of homoscedasticity also embodies an important difference with the assumption of equation (5.5). There, we were willing to assert a specific value for E(εi), zero. In equation (5.6), we are willing to assert that V(εi) has a value that is constant across observations, but not what that value might be. Instead, we represent that value by a symbol, σ 2. There are two very important reasons for this distinction between equations (5.5) and (5.6). First, it seems intuitively plausible that, whatever the context, disturbances could be positive or negative. Therefore, it seems reasonable to assume that the central tendency of the disturbances will be in the neighborhood of zero. From there to the assertion that this central tendency is exactly zero is not a large step. In contrast, we have no such intuition regarding the range of values that might appear for εi or the variance that describes this range. We know that the variance has to be positive. However, it is quite possible that it could be different for the different populations that would be relevant to different contexts. Second, as will become apparent in chapter 8, even if E(εi) ≠ 0, the sample can’t tell us. In other words, the sample does not provide us with an estimate of E(εi). We have no choice but to make an assumption about its value. As we will see in section 7.2, however, the sample yields a natural estimate of V(εi). There’s no reason to assert a particular value for this quantity if we can estimate it. At this point, it’s important to remember that we are discussing the characteristics of the population from which εi is chosen. The expected value of equation (5.5) and the population variance of equation (5.6) describe this population. We must be careful not to confuse these two characteristics with the average of equation (2.8) and the sample variance of equation (3.11). While analogous, these latter numbers describe only the sample on which they are based.
From Sample to Population
147
There are two ways to keep track of the difference. First, numbers that we calculate from samples will always have different names than numbers that describe populations. In cases where the population and sample names share the same word, we will always preface that word with one of the two qualifiers “sample” or “population.”12 Second, numbers that describe populations will always be based on calculations involving the probabilities of the possible outcomes, for example, the P(zi) that appear in equations (5.2) and (5.3). In contrast, numbers drawn from samples will always weight each of the observations by a constant derived from the sample. In the cases of the average and the sample variance, these constants are, respectively, 1/n and 1/(n − 1), where n is the sample size. Equation (5.38) will reveal the weights that generate b. We still need two additional assumptions that describe the process by which the economy assigns specific values of εi to each observation. The first addresses the relationships between the disturbances of different population members. The second describes the relationship between the disturbance and the value of xi for each member. In general, the first characteristic of the relationship between two random variables in which we would be interested is the population covariance. Equation (3.7) defined the sample covariance for two variables. There, the contributions of each observation had the same weights as in the sample variance, the common denominator n − 1. As indicated previously, the population covariance, instead, weights the joint outcomes of two random variables by the probability that they will occur. In the discrete case, the population covariance between the random variables W and Z is COV (W , Z ) =
m
n
∑∑ ( w j =1 k =1
j
)(
) (
)
− E (W ) zk − E ( Z ) P w j , zk .
(5.7)
The outcomes for the random variable W are the wj, of which there are m. The outcomes for the random variable Z are the zk, of which there are n. The transformation (W − E(W))(Z − E(Z)) yields a new random variable, say, V. The product (wj − E(W))(zk − E(Z)) is then a single outcome for V, say vl. The probability of observing vl is P(wj, zk) because wj must be the outcome for W and zk must be the outcome for Z in order for vl to occur as the outcome for V. In other words, COV (W , Z ) =
m
n
mn
∑∑ (w j − E (W )) ( zk − E ( Z )) P (w j , zk ) = ∑ vl P (vl ). j =1 k =1
l =1
148
Chapter 5
From this perspective, equation (5.7) just states that13
(
)(
)
COV (W , Z ) = E (V ) = E W − E (W ) Z − E ( Z ) .
(5.8)
While the notation of equation (5.7) isn’t appropriate for continuous random variables, the intuition it embodies is again applicable. Moreover, equation (5.8) is correct for continuous as well as discrete random variables. Therefore,
(
)
(
)(
( ))
COV ε i , ε j = E ε i − E (ε i ) ε j − E ε j .
(5.9)
Equation (5.5) asserts that E(εi) and E(εj) are both zero. Consequently, equation (5.9) becomes
(
)(
( ))
COV(εi , ε j ) = E εi − E (εi ) ε j − E ε j = E(εiε j ).
(5.10)
The population covariance between εi and εj is just the expected value of their product, εiεj. This will be very useful at later points in this text. Regardless of how it is written, we assume that
(
)
COV ε i , ε j = 0,
(5.11)
the population covariance between all disturbances, is equal to zero.14 In strict analogy to the sample correlation of equation (3.13), the population correlation is the ratio of the population covariance to the population standard deviations of the two random variables. Therefore, equation (5.11) also implies that, for our purposes, the value of the disturbance for one observation is uncorrelated to that of any other. Knowledge of one does not allow us to predict anything about any other. This is a fairly strong assumption. Because it is powerful, it plays a critical role in the development of the properties of b and a in section 5.8. However, its strength is also a weakness. Because it rules out any important relationship between disturbances, it may also be inappropriate in some contexts. This assumption means that the disturbance of one population member is not related to that of others with whom the member might have important affiliations. For example, the disturbances assigned to each individual are not related to the disturbances of family members, neighbors, schoolmates, or work colleagues of that individual, if the population contains them. Chapter 9 will examine situations where this assumption is likely to be inappropriate. The last of the assumptions required here asserts that nothing can be predicted about any of the disturbances from any of the values for the explanatory variable. In other words, nothing can be deduced regarding εi from the
From Sample to Population
149
set of values {x1, x2, x3, . . . , xm}, where the subscripts indicate individual population members and m represents the size of the population. This assumption is again fairly strong. It reiterates the intent of the assumption that COV(εi, εj) = 0: The problem that we are currently considering is one in which the disturbances are genuinely random. Their specific values are unrelated to anything that is observable. We are interested in populations with this general property because, first, we can often make arguments that our samples are likely to embody this property. Second, and perhaps as important, the analysis of populations with this property is especially straightforward. We reserve the more complicated analysis required for populations where the disturbances might be related to xi for chapter 10. In particular, this last assumption implies that the disturbance and the explanatory variable for the same observation are unrelated. This implication is often represented as COV(εi, xi) = 0. This representation is convenient, because the calculation of the population covariance in this case should yield a value of zero. It also summarizes concisely the property that will be important in section 5.8. The representation COV(εi, xi) = 0 is also somewhat casual, because we have defined the population covariance only for pairs of random variables. For the purposes of this chapter, xi is definitely not a random variable. So this notation does not represent precisely what we have in mind. This will not bother us here. This discussion of εi has been extensive, deliberately and somewhat paradoxically so. We never even pretend to observe an actual disturbance. So why all the fuss? The answer is that we observe many other random variables and calculate yet more. Every one of them is random solely because it contains one or more disturbances. So, even though we never observe the disturbances themselves, we observe, so to speak, their progeny. As with any progeny, their properties and behavior reflect the characteristics that they inherit from their parental disturbances.
5.4
The Statistical Characteristics of yi These last points become readily apparent when we consider the statistical properties of the dependent variable, yi. Equation (5.1) demonstrates that yi is a function of εi and therefore a random variable itself. This equation also implies that E( yi ) = E(α + β xi + ε i ).
(5.12)
150
Chapter 5
As we said in section 5.2, yi is a linear function of εi. Therefore, we can analyze this expected value by taking advantage of the general properties of expectations of linear combinations of random variables. These properties may be familiar if we’ve had a previous statistics course. Two are useful at this moment. First, the expected value of a sum is the sum of the expected values. For example, n E yi = i =1
∑
n
∑E ( y ). i
(5.13)
i =1
Therefore, equation (5.12) becomes E ( yi ) = E (α ) + E ( β xi ) + E (εi ) .
(5.14)
Second, the expected value of a constant is itself. Using α as a representative constant, we have E (α ) = α.
(5.15)
It may not be immediately apparent, but equation (5.15) also applies to βxi. β is obviously a constant, because it represents a single value that applies to all members of the population. However, xi has to have at least some different values.15 Nevertheless, we have assumed that, whatever these values are, they are predetermined for each member of the population. So, while the value of xi varies across members, it is fixed for each member. Therefore, its expected value for each member is again itself, and E(βxi) = βxi. With the assumption that E(εi) = 0, from equation (5.5), equation (5.14) becomes E ( yi ) = α + β xi .
(5.16)
In other words, what we earlier referred to as the deterministic part of yi is also its expected value. Equation (5.5) asserts that the typical value for εi is zero. Therefore, the typical value for yi is just that part of yi that does not depend on εi, or yi = E ( yi ) + ε i .
(5.17)
The assumptions that we have already made regarding the population relationship and the properties of εi also determine the population variance of yi.
From Sample to Population
151
Following the first equality of equation (5.6), we can define the population variance of yi as V( yi ) = E( yi − E( yi ))2 .
According to equations (5.1) and (5.16), the difference yi − E(yi) can be written as yi − E ( yi ) = (α + β xi + ε i ) − (α + β xi ) = εi .
(5.18)
Therefore,
(
V ( yi ) = E yi − E ( yi )
)
2
( )
= E ε i2 .
(5.19)
Finally, equation (5.6) gives
( )
V ( yi ) = E εi2 = σ 2 .
(5.20)
The population standard deviation of yi is, consequently, σ. In other words, the population variance of yi is completely determined by the variance of εi. This should not be surprising. εi is the only component of yi whose value is not fixed. Therefore, whatever variation is possible in the value of yi around its expected value must come entirely from εi. The final property of the dependent variable describes the relationship between values of yi for two different population members. According to equation (5.8), the population covariance between these values is COV( yi , y j ) = E ( yi − E( yi ))( y j − E( y j )).
Equation (5.18) demonstrates that yi − E(yi) = εi and yj − E(yj) = εj. Therefore,
(
) ( )
COV yi , y j = E εiε j .
(5.21)
Equation (5.10) identifies the expected value to the right of the equality as COV(εi, εj). Consequently, equation (5.11) gives
(
)
(
)
COV yi , y j = COV εi , ε j = 0.
(5.22)
In other words, the deviations of the dependent variable from its expected value for different population members are uncorrelated with each other. This follows because all of the deviations in these variables from their expected values come from the variations in εi. As these deviations are uncorrelated by the assumption embodied in equation (5.11), the value for yi can tell us nothing about the value for yj. They are also uncorrelated.
152
5.5
Chapter 5
Parameters and Estimators The assumptions of the previous two sections completely specify the population relationship in which we are interested. This specification is based on three parameters. As we anticipated in section 5.2, parameters are numbers that have three properties: 1. They are fixed. 2. They govern the generation of the value for yi, given any value for xi. 3. Their values are unknown to us. Our objective, and that of any statistical endeavor, is to develop estimators of these parameters. Estimators are formulas that instruct us as to how to use the information in any sample to construct an estimate. Estimates are values, derived from a particular sample, that we can take as reasonable approximations of the unknown parameter values. We also refer to estimators as sample statistics. This label reminds us of two important characteristics of estimators. First, we construct them from the information in our sample. Second, we would almost certainly get somewhat different values for our approximations from other samples. Therefore, these estimators have statistical properties that describe the extent to which they might vary from sample to sample. These properties help us assess the usefulness of our estimators. Two of the parameters here were introduced in the population relationship of equation (5.1), α and β. The third, σ 2, was introduced as the variance of εi in equation (5.6) and identified as the variance of yi in equation (5.20). The rest of this chapter will consider estimators of α and β.16 Equation (5.1) gives the population relationship as yi = α + β xi + ε i .
This relationship describes the actual process by which observations in the population, including those in the sample, acquired their values for yi. Equation (4.4) gives the relationship, which we impute to xi and yi in our sample by minimizing the sum of squared errors, as yi = a + bxi + ei .
This is clearly not the same as the population relationship. Moreover, the specific values that we would get for a and b would probably be different for different samples. Nevertheless, the striking analogy between equation (4.4) and equation (5.1) suggests that it might be natural to take b as our estimator of β and a as our estimator of α.
From Sample to Population
153
The next five sections examine this suggestion. We will devote most of our effort to understanding the relationship between b and β. This is because β is the most interesting of the three parameters that define the population. We will prove some results regarding the estimation of α by a. In many cases, however, we will simply assert that a possesses properties that we have proven for b. This will simplify the exposition, without depriving us of any essential intuition.
5.6
Unbiased Estimators of b and The first question we might ask is, could b be equal to β? To answer this, we begin with the formula for b given in equation (4.37): n
∑( y − y ) x i
b=
i
i =1 n
.
∑( x − x ) x i
i
i =1
Equation (2.37) tells us that we can replace the numerator with ∑in=1 ( xi − x ) yi . The new expression is n
∑( x − x ) y i
b=
i
i =1 n
(5.23)
.
∑( x − x ) x i
i
i =1
Replacing yi with the population relationship of equation (5.1) yields n
∑( x − x ) (α + β x + ε ) i
b=
i
i =1
i
.
n
(5.24)
∑( x − x ) x i
i
i =1
Equation (5.24) immediately hints at the answer to our question. The slope b depends on all of the disturbances εi that appear in the sample. As these are random variables, b must be a random variable as well. This implies that it cannot always be equal to β, which, as a parameter, is constant.
154
Chapter 5
To make this intuition explicit, repeatedly apply equation (2.40) to the numerator of equation (5.24) to obtain n
n
n
∑( x − x )α + ∑( x − x ) β x + ∑( x − x )ε i
b=
i
i =1
i
i
i =1 n
i
i =1
.
∑( x − x ) x i
i
i =1
The application of equation (2.7) allows us to factor out the population parameters: n
α
∑ i =1
b=
( xi − x ) + β
n
∑ i =1 n
( xi − x ) xi +
n
∑( x − x )ε i
i =1
i
.
∑( x − x ) x i
i
i =1
Finally, the ratio to the right of the equality can be separated into three separate ratios with common denominators: n
b =α
i =1 n
∑( x − x ) x i
n
n
∑( xi − x )
+β
∑ ( xi − x ) xi ∑( xi − x )εi i =1 n
+
.
(5.25)
∑( x − x ) x ∑( x − x ) x i
i
i
i =1
i =1
i =1 n
i
i
i =1
The numerator of the first term to the right of the equality in equation (5.25), ∑in=1 ( xi − x ), is equal to zero according to equation (2.19). Therefore, this entire
term is equal to zero as well: n
∑( x − x ) i
α
i =1 n
∑( x − x ) x i
i
i =1
=α
0 n
∑( x − x ) x i
= 0..
(5.26)
i
i =1
The ratio multiplying β in the second term to the right of equality is simply one. The term as a whole is therefore just β: n
∑( x − x ) x i
β
i
i =1 n
∑( x − x ) x i
i =1
i
= β.
(5.27)
From Sample to Population
155
If we substitute equations (5.26) and (5.27) into equation (5.25), we can rewrite b as n
∑( x − x )ε i
b=β+
i
i =1 n
.
(5.28)
∑( x − x ) x i
i
i =1
The remaining action in the formula for b is in the second term to the right of the equality in equation (5.28). We can replace its denominator by ∑in=1 ( xi − x )2, courtesy of equation (2.28). As we will demonstrate in exercise 5.4, equation (2.33) allows us to replace the numerator by ∑in=1 ( xi − x )(εi − ε ). Finally, if we multiply the entire term by one, in the form of the ratio of 1/(n − 1) to itself, we obtain n
∑
n
( xi − x )εi
i =1 n
∑( x − x ) x i
i =1
i
∑ ( x − x )(ε − ε ) 1 / (n − 1) i
=
i
i =1
n
∑( x − x )
2
1 / ( n − 1)
=
COV( xi , ε i ) . V( xi )
(5.29)
i
i =1
According to equations (3.7) and (3.11), this term is the ratio of the sample covariance between xi and εi to the sample variance of xi. If we insert equation (5.29) into equation (5.28), the result is b=β+
COV( xi , εi ) . V( xi )
(5.30)
The good news in equation (5.30) is that the value of β plays at least some role in determining the estimate b that we calculate from our sample. The bad news is that this estimate also depends on the sample variance of xi and, especially, the sample covariance between xi and εi. It will equal β only if this sample covariance is zero. The last of our assumptions in section 5.3 implies that this covariance must be zero in the population. It will therefore probably be zero, on average, across samples. However, it will be exactly zero within a particular sample only if that sample perfectly reproduces the pattern of combinations of xi and εi that appears in the population. The random allocation of population members to samples will almost never yield this reproduction. If it did, we would never know it, because we don’t observe the disturbances εi in the sample, much less in the population. Consequently, we cannot assume that b is equal to β. We must, instead, rely
156
Chapter 5
on the properties of b developed below to ascertain whether it is acceptably close.17 One indication that b was acceptably close to β would be if b tended to be accurate. In other words, if we were to calculate b many times, from many independent samples, the resulting collection of estimated values would tend to cluster around the true value for β. Another way to say this is that the central tendency for b would be β. This phrasing recalls section 5.3. Formally speaking, it would be nice if the expected value of b were equal to β : E(b) = β. This condition specifies that the expected value of an estimator equals the parameter that it is estimating. It is called unbiasedness. Intuitively, an unbiased estimator tends to be correct. The appeal of this property should be obvious. Consider the alternative: Would we ordinarily prefer an estimator that tended to be incorrect? The difference between the expected value of the estimator and the parameter that it is trying to estimate is called the bias. If, in general, d is an estimator of the parameter δ, bias = E( d ) − δ.
(5.31)
An unbiased estimator has, naturally, zero bias. While it should be easy to see why we might prefer unbiased estimators, it might be hard to see how we can tell. After all, how can we determine whether the expected value of our estimator is equal to the parameter that we are trying to estimate, when the whole reason that we are trying to estimate it is because we don’t know its value? It might seem as if a little magic is required. In fact, however, all that is required is a careful development of the properties of b, based on the assumptions that we have already made. Return to equation (5.28). That equation represents b as the sum of two terms. Therefore, its expected value is the expected value of this sum: E (b ) = E β +
x − x ε ( i ) i i =1 . n ( xi − x ) xi i =1 n
∑
(5.32)
∑
The application of equation (5.13) to equation (5.32) yields n ( x − x )ε i i i 1 = . E (b ) = E ( β ) + E n ( xi − x ) xi i =1
∑
∑
(5.33)
From Sample to Population
157
The first term to the right of the equality in equation (5.33) is easy. As β is a constant, equation (5.15) asserts that its expected value is itself. This already implies that β plays some role in E(b), even though we don’t know its value. Consider the second term to the right of the equality in equation (5.33). The denominator of the ratio inside the parentheses doesn’t contain any random variables. We can therefore treat it as a multiplicative constant. Statistical theory tells us that the expected value of a random variable multiplied by a constant is the constant multiplied by the expected value of the random variable. For example, if k is a constant, E ( kyi ) = k E ( yi ) .
(5.34)
Accordingly, we can extract the denominator from the expectation: n xi − x εi = E i =1 n xi − x xi i =1
∑(
)
∑(
)
n E xi − x ε i . xi − x xi i =1
∑(
1
n
∑(
)
i =1
)
(5.35)
Equation (5.13) again allows us to replace the expectation of a sum with the sum of the expectations: n E xi − x ε i = xi − x xi i =1
∑(
1
n
∑(
)
i =1
)
n
∑( E ( x − x )ε ).
1
i
n
∑( x − x ) x i
i
i =1
i
i =1
The application of equation (5.34) to the quantity xi − x inside the expectation allows us to restate the expected value in the term to the right of the equality as18 n
1 n
∑( x − x ) x i
∑( i =1
)
E ( xi − x ) εi =
i
n
∑( x − x ) x i
i =1
n
1
∑( x − x ) E (ε ). i
i
(5.36)
i =1
i
i =1
By equation (5.5), E(εi) = 0. Therefore, the entire term in equation (5.35) is also zero! Consequently, equation (5.33) becomes n xi − x εi = β + 0 = β. E b = E β + E i =1 n xi − x xi i =1
()
( )
∑(
∑(
)
)
(5.37)
158
Chapter 5
Miraculously, b is an unbiased estimator of β! It estimates β with zero bias. It is worth our while to stop and review how it is that we were able to demonstrate this. As we said before, we don’t actually know the value of β. However, we know, from equation (5.16), how that value, whatever it is, is embedded in the structure of E(yi). As the discussion we have just concluded demonstrates, all we need to prove that b is an unbiased estimator of β is this and repeated applications of the rules for expectations in equations (5.13) and (5.34). The relationship between a and α is analogous to that between b and β. Once again, although we will forgo the proof, a would be exactly equal to α only in the unlikely event that the sample from which it is calculated perfectly reproduces the pattern of xi and εi pairs in the population. According to equation (4.35), however, a is a linear function of y and bx . Implicitly, the definition of averages in equation (2.8) identifies y as a linear function of the yi’s: n
∑y
i
y=
i =1
=
n
1 1 1 1 yi + y2 + y3 + + yn . n n n n
Equation (5.23) implies that the slope b is also a linear function of the yi’s: n
∑( x − x ) y i
b=
i
i =1 n
∑( x − x ) x i
i
i =1
=
x1 − x n
∑( x − x ) x i
i
i =1
y1 +
x2 − x n
∑( x − x ) x i
i =1
i
y2 + +
xn − x n
∑( x − x ) x i
yn .
(5.38)
i
i =1
Accordingly, a itself is a linear function of the yi’s. We can therefore once again rely on equations (5.13) and (5.34) to help us demonstrate that a is an unbiased estimator of α. To see this, let us first explore the expected value of y in the sample: n yi 1 n 1 n i =1 = E yi = E y =E E yi . n n i =1 n i =1
( )
∑
∑
∑ ( )
(5.39)
From Sample to Population
159
The first equality simply replaces y with its definition. The second equality is a consequence of equation (5.34). Equation (5.13) gives us the third equality. Equation (5.16) allows us to replace E(yi): E(y) =
1 n
n
∑ i =1
E ( yi ) =
1 n
n
∑ (α + β x ). i
i =1
Equation (2.14) allows us to rewrite this as
( )
E y =
1 n
n n 1 α + β x = α + β x ( i) i. n i =1 i =1 i =1 n
∑
∑ ∑
Applying equation (2.5) to the first summation and equation (2.7) to the second summation to the right of the last equality, we get n 1 E y = α+ n i =1
( )
n 1 β xi = nα + β xi . n i =1 i =1 n
∑ ∑
∑
Finally, multiplying each of the terms in the parentheses to the right of the last equality by 1/n and recalling the definition of the average value for xi in equation (2.8), we have n 1 E y = nα + β xi = α + β x . n i =1
( )
∑
(5.40)
With these preliminaries, we can now derive the expected value of a. By equation (4.35), E ( a ) = E ( y − bx ) .
By equation (5.13), E ( a ) = E ( y − bx ) = E ( y ) − E (bx ) .
Equation (5.40) replaces the first term to the right of the last equality, while equation (5.34) allows us to rewrite the second term: E ( a ) = E ( y ) − E (bx ) = (α + β x ) − x E (b ) .
(5.41)
Finally, we can replace E(b) with β by virtue of equation (5.37): E ( a ) = (α + β x ) − xβ = α .
(5.42)
160
Chapter 5
As predicted, a is an unbiased estimator of α. Once again, it was not necessary to know the actual value of α to deduce that, whatever it is, it is the expected value of a. It was sufficient merely to understand how α is embedded in E(yi), as well as the rules regarding expectations.
5.7
Let’s Explain This Again We’ve proven that b and a are unbiased estimators of β and α, even though we don’t know the values of β and α. Nevertheless, the idea may still seem hard to accept, emotionally. Let’s work through it again in a situation where we actually do know these values. A situation such as this is called a simulation. In a simulation, we construct artificial data that are based on parameter values that we choose ourselves.19 In other words, we create the population! Obviously, the point of a simulation is different from the point of an analysis with real data. It’s not to learn about actual behavior. It’s to learn about the properties of a statistical procedure. Let’s begin with values of xi. We’ll choose 20, so that our artificial samples will have the same number of observations as did our actual sample in table 3.3. For computational simplicity and in order to have a reasonable array of artificial educational levels, we assign 8 years of schooling to one observation, 9 to 17 years of schooling to each of two observations, and 18 years of schooling to the last observation. Next, we’ll choose parameter values. Let’s set β and α equal to 4,000 and −20,000, respectively. These values are similar to those that we estimated in section 4.7 using the Census data of table 3.3. With these values for β, α, and xi, we tell a computer to begin by computing E(yi), using equation (5.16). Finally, we ask the computer to generate random values for εi. If the computer is to simulate them, it needs to know everything about the distribution from which they are to be drawn. We’ll set σ equal to 25,000, for reasons that we’ll explain in section 7.2. However, even this standard deviation and the expected value from equation (5.5) are not enough. We have to actually specify the distribution of εi. We choose a well-known and well-defined distribution that may already be familiar to us, the normal.20 The computer draws a random realization of εi from the normal distribution with expected value zero and population standard deviation 25,000 and adds it to E(yi) according to equation (5.17) to generate the value of yi for each observation. In other words, we actually know the values of εi for these data because we fabricate them. Table 5.1 presents a simulated data set containing these values, as well as those for xi, E(yi), and yi. We must admit that the column for E(yi) is boring. However, the simulated disturbances look pretty random,
From Sample to Population
TABLE
5.1 Simulated data
Observation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
161
Simulated education, xi
Simulated expected value of earnings, E(yi) = α + βxi
Simulated disturbance, εi
Simulated earnings, yi
8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18
12,000 16,000 16,000 20,000 20,000 24,000 24,000 28,000 28,000 32,000 32,000 36,000 36,000 40,000 40,000 44,000 44,000 48,000 48,000 52,000
−11,604 3,381 21,952 −13,212 −11,097 −4,041 20,765 18,108 −30,297 35,206 21,102 4,217 −3,048 −18,598 −16,347 29,436 612 12,134 −4,354 6,132
395.88 19,381 37,952 6,787.6 8,902.9 19,959 44,765 46,108 −2,297.2 67,206 53,102 40,217 32,952 21,402 23,653 73,436 44,612 60,134 43,646 58,132
don’t they? Thanks to them, simulated actual earnings, yi, are a lot more interesting than expected earnings, E(yi). Nothing except our will and interest prevents us from running a regression of these simulated values for earnings on these simulated values for education. In the case of the data in table 5.1, the regression calculations yield an intercept of −23,979 and a slope of 4,538.6. Recall that the actual parameter values for the constant and coefficient are −20,000 and 4,000. Not bad! The results of section 5.6, however, don’t tell us that individual intercept and slope estimates will be close to the true parameter values. Instead, they tell us that groups of estimates will cluster around the true values. How can we obtain groups of estimates? Let’s just fabricate more samples and repeatedly run the same regression. First, we fabricate 49 more samples of 20 observations each. All 50 begin with the same parameter values and the same set of values for xi. Therefore, the second and third columns of table 5.1, for xi and E(yi), are the same for each sample. What distinguishes each of the samples is that the computer assigns each observation in each sample a different random draw for the value of εi. Accordingly, each sample has a different random set of 20 disturbance values, the fourth column of table 5.1. Consequently, each has a unique set of values for yi and will ordinarily yield different values for b and a.
162
Chapter 5
TABLE
5.2 Fifty simulated regressions
Regression
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Intercept estimate
−23,979 −10,021 −6,824.6 −44,221 −9,526.3 −13,768 −57,953 −34,687 −32,100 −18,448 −26,415 −78,350 11,796 −12,248 −26,499 1,941.9 −54,580 −8,699.1 −40,969 −2,228.6 25,101 −55,624 −11,866 −30,993 −38,250
Slope estimate
Regression
Intercept estimate
4,538.6 3,933.6 3,265.6 5,425.3 3,332.9 3,344.1 6,666.0 5,857.4 4,856.9 3,687.6 4,064.1 9,249.5 2,101.6 3,502.2 3,399.7 2,289.6 6,459.3 3,488.0 5,724.4 2,182.5 722.5 6,467.6 3,729.3 4,970.7 5,075.4
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
−2,451.6 23,650 −4,806.9 −3,826.1 5,877.3 6,903.3 −18,670 9,747.6 −59,806 15,791 −7,295 −77,039 −72,376 4,354.3 −42,787 −16,651 −42,120 −32,503 10,326 −41,507 8,041 −26,266 −55,105 −40,477 58,862
Slope estimate
3,257.1 1,638.0 2,719.6 2,314.5 2,249.4 2,464.0 3,999.1 1,421.4 6,820.5 2,364.4 3,157.1 7,664.0 7,388.4 2,143.9 5,558.9 3,461.9 6,598.3 4,135.5 1,685.4 5,768.3 1,613.2 4,491.1 7,045.5 5,345.9 −1,271.8
The set of disturbances assigned to each sample is unrelated to that assigned to any other sample, because all are random. This means that the samples are independent of each other. Consequently, the estimators that arise from each sample are independent as well. This is exactly the situation that the expected value describes: It identifies the central tendency of a collection of independent estimates, each calculated in the same way. We obtain this collection by running the regression of simulated earnings on simulated education for each of the 50 samples. Table 5.2 presents the results of this exercise. It reports the intercept and slope estimates for all 50 regressions. The first thing we notice is that, in a sense, we got lucky with our first sample. The intercept and slope from the regression on the data of table 5.1 are unusually close to the true values. This implies that the sample of random disturbances that the computer happened to assign to this particular sample did a pretty good job of representing the population of possible disturbance values. In contrast, the sample of disturbances assigned randomly to, say, the 50th sample looks as though it was not very representative of the population. The
From Sample to Population
163
regression on this sample estimates that its members begin with approximately $59,000 in annual earnings and then lose $1,271.80 per year in annual earnings for every year of schooling. Similarly unrepresentative disturbance samples must underlie some of the other regressions. The estimated return to education is quite small in regression 21. However, it is more than $7,000 for four other regressions. Once again, the previous section says nothing about what each of these individual estimates should look like. Therefore, none of these anomalies are in conflict with the results there. To the contrary, the collection of estimates in table 5.2 looks roughly like section 5.6 says that it should. Equation (5.37) demonstrates that the distribution of slope estimates should be approximately centered on the true value of β, 4,000. The average value of the 50 slope estimates in table 5.2 is very close, at 4,047.4. Figure 5.1 presents the distribution of these slope estimates. They cluster quite noticeably around the true value.
14 12
Number of estimates
10 8 6 4 2
Estimate values Figure 5.1 Distribution of slope estimates, simulated data
9,000–10,000
8,000–9,000
7,000–8,000
6,000–7,000
5,000–6,000
4,000–5,000
3,000–4,000
2,000–3,000
1,000–2,000
0–1,000
−1,000–0
−2,000– −1,000
0
Chapter 5
9 8
Number of estimates
7 6 5 4 3 2 1 50,000–60,000
40,000–50,000
30,000–40,000
20,000–30,000
10,000–20,000
0–10,000
−10,000–0
−20,000– −10,000
−30,000– −20,000
−40,000– −30,000
−50,000– −40,000
−60,000– −50,000
−70,000– −60,000
0 −80,000– −70,000
164
Estimate values Figure 5.2 Distribution of intercept estimates, simulated data
The average value of the intercept estimates in table 5.2 is −19,991. Consistent with the assertion in equation (5.42), this is also very close to the actual parameter value of −20,000. Figure 5.2 presents the distribution of the 50 individual intercept estimates. Here, the clustering around the true value is present, but a little less distinct than that in figure 5.1.21 Equations (5.37) and (5.42) are the point of section 5.6. Figures 5.1 and 5.2 give us a visual representation of what these two equations mean. In both cases, they demonstrate that collections of estimates from independent samples tend to cluster around the values that we have assigned to the parameters. What happens if we perform the same exercise using actual data? We can draw 50 more random samples of 20 observations from the Census data that gave us the sample in table 3.3 and run the regression at the end of section 4.7 on each. Table 5.3 catalogues the results. Once again, we have 50 independent slope estimates and 50 independent intercept estimates. Once again, we can examine the distributions of each. However, we chose the values of β and α for figures 5.1 and 5.2. We could therefore use those figures to verify equations (5.37) and (5.42). In contrast, we don’t know the values of β and α for the Census data. To the contrary, the point of examining the Census data is to try to get a sense of what they might be. In this case, therefore, we apply equations (5.37)
From Sample to Population
TABLE
5.3 Fifty regressions on Census data
Regression
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
165
Intercept estimate
Slope estimate
Regression
Intercept estimate
Slope estimate
−103,607 −38,101 15,950 −141,532 −8,872.7 −32,426 −29,957 5,873.5 −80,158 −7,960.2 −108,539 −17,994 −35,523 −8,801.7 −57,610 −46,698 −21,293 −84.53 −2,710.4 −139,728 −6,149.4 −21,488 16,263 −145,918 9,811.6
10,911 6,958.4 1,873.3 13,018 3,554.8 5,112.3 5,027.2 1,070.1 8,038.0 1,978.5 11,091 4,707.7 5,301.7 2,698.1 6,417.9 6,209.6 4,211.6 3,256.3 1,342.2 14,948.3 3,177.5 5,046.2 1,003.8 14,062.3 565.75
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
46,766 −24,157 −36,746 4,592.0 −9,852.2 −7,218.8 −15,199 −2,110.8 7,534.3 −52,758 −9,547.0 −119,609 591.92 −17,354 11,922 −64,561 −88,746 5,328.4 −5,246.3 1,358.1 17,892 −45,500 −14,469 −12,822 8,024.7
−1,553.5 3,960.2 4,479.5 1,208.5 3,261.6 2,095.8 3,393.2 2,341.6 1,813.1 6,794.5 3,166.7 11,491.0 2,202.2 3,388.3 1,358.7 9,610.8 9,230.4 2,057.1 3,402.1 2,066.4 875.68 6,036.5 3,079.0 2,364.0 938.49
and (5.42) to the distributions of estimates to indicate the likely values for β and α. Figure 5.3 presents the distribution of slope estimates from table 5.3. The range of estimates is quite wide. One suggests that the return to education is negative, while at least two imply that it exceeds $14,000 per year. However, there is a distinct mass of estimates in the range of $1,000 to $4,000. Equation (5.37) suggests that the true value of β probably lies somewhere in there. Figure 5.4 demonstrates that the distribution of intercept estimates is also very wide. Most of the values are less than zero. However, the mass of estimates probably ranges from −$40,000 to 10,000. The implication of equation (5.42) is that α probably lies somewhere in that range.
5.8 The Population Variances of b and a As we’ve already said, unbiased estimators have to be more useful than biased estimators, all else equal. Therefore, it’s good news that a and b are unbiased
Figure 5.4
Distribution of intercept estimates, Census data
13,000–14,000 >14,000
>20,000
7,000–8,000
6,000–7,000
5,000–6,000
4,000–5,000
3,000–4,000
2,000–3,000
1,000–2,000
0–1,000
d − tα( df/ 2)SD ( d ) .
Finally, we obtain the conventional representation of the confidence interval by reversing the order of the terms and, consequently, the inequalities, and reinserting them into equation (6.7):
(
)
1 − α = P d − tα( df/ 2)SD (d ) < δ < d + tα( df/ 2)SD (d ) .
(6.8)
The probabilities in equations (6.7) and (6.8) have to be the same because values for d that satisfy the inequality in one must always satisfy the inequality in the other. Values that violate one inequality must violate both. Figure 6.2 presents a graphical representation of the confidence interval in equation (6.8). This representation restates the interpretation we provided for the confidence interval just before the formal development. First, the confidence interval is centered on d, the estimator from our sample.
200
Chapter 6
Figure 6.2 (1 − α)% confidence interval
) t(df SD(d ) α/2
) t(df SD(d ) α/2
) d + t(df SD(d) α/2
d
) d + t(df SD(d) α/2
Second, the lower and upper bounds of the confidence interval are known: Given our choice of a specific value for α and the degrees of freedom in our sample, we obtain a specific value for tα( df/ 2 ) from the table for the t-distribution, table A.2. Our sample has provided us with a specific value for d and an estimate of its standard deviation, SD(d ). We believe, with (1 − α)% confidence, that the unknown parameter δ lies between the known bounds d − tα( df/ 2 )SD(d ) and d + tα( df/ 2 )SD(d ). Third, figure 6.2 doesn’t contain the density function for d because we don’t know where to put it. The problem is that the density function is centered on δ. But that’s what we’re looking for! All we know about δ is that our confidence interval is (1 − α)% likely to contain it. The only remaining question is the choice of α. The advantage of confidence intervals is that, while point estimators are almost surely “wrong,” in the sense of not being precisely right, confidence intervals can be constructed to provide any level of assurance. So, in general, we construct them so that they provide a very high degree of confidence. Of course, the highest degree would be 100% confidence, or α = 0. This degree of confidence would certainly be reassuring. The problem is, as exercise 6.2 asks us to demonstrate, that the lower bound of the corresponding confidence interval is −∞ and the upper bound is ∞. In other words, we can be 100% certain that δ lies between −∞ and ∞. This assertion is as useless as it is true. Imagine that someone told us, in our running example, that β, the annual return to a year of schooling, was surely between −∞ and ∞. Would we dash to the nearest institution of higher learning and sign up for 20 credits? This illustrates a more general problem. Confidence intervals that include a wide range of possible behaviors don’t give us much guidance as to which particular type of behavior we can expect. Predictions at different points of such confidence intervals would lead us to make very different choices. In consequence, confidence intervals of this sort aren’t very helpful. This is the essential tension in the construction of confidence intervals. All else equal, higher certainty requires that we expand the range of possible values for the parameter. If we expand the range of possible values, we are more confident that we are correct. However, we run a greater risk that values at different points of this range represent very different behaviors. The choice of α always represents a compromise between confidence and usefulness.
Confidence Intervals and Hypothesis Tests
201
In practice, we require confidence of at least 90%. Therefore, we always choose α to be .1 or less. We usually set it at .05, indicating that we wish to be 95% confident that our interval contains the true value of δ. Usefulness requires that confidence intervals be narrow enough to imply relatively consistent behavior throughout their ranges. Equation (6.8) and figure 6.2 demonstrate that the width of the confidence interval is equal to 2 tα( df/ 2 ) SD(d ). More confidence requires higher values of α, correspondingly higher values of tα( df/ 2 ), and wider confidence intervals. With α fixed, the width of the confidence interval depends only on SD(d ). We attempt to ensure that our confidence intervals are useful by relying on data that are rich enough to yield acceptably small estimates for the sample standard deviation. Estimators will tend to have smaller sample standard deviations when based on data in which the behavior at issue is expressed clearly and consistently. In the example of the effects of education on earnings, clarity would require that both variables be measured with some precision. Consistency would require that education typically increased productivity, and productivity was reliably related to earnings. We might encounter behavior that was neither clear nor consistent in, for example, a poor command economy. There, records might be haphazard, and earnings determined by ideology or access to economic rents. Estimators will also, of course, ordinarily have smaller sample standard deviations if the number of observations is larger. This is because we’re more likely to be able to discern the true relationship between xi and yi, no matter how weak, the more evidence we have. If narrower, more informative confidence intervals are desired, here is the most effective strategy: Get more data. We can illustrate all of these issues by continuing with the examples of earlier chapters. Let’s engage in an exercise we may have already seen if we’ve had a previous statistics course. We’ll construct a 95% confidence interval for the expected value of earnings in the population that gave rise to the sample of table 3.3. This expected value is the unknown population parameter in which we are interested. We usually represent expected values with the Greek letter µ. This symbol, representing a specific population parameter, will replace the generic parameter δ in equation (6.8). We typically estimate µ, an expected value in a population, with an average from a corresponding sample. We’ve already labeled average earnings in the sample of table 3.3 as y . Accordingly, this symbol, representing a specific estimator, will replace the generic estimator d in equation (6.8). In addition, equation (6.8) requires three interrelated quantities, SD(d ), df, and tα( df/ 2 ). The first is the sample standard deviation of the estimator. We’ll
202
Chapter 6
prove, in exercise 6.4b, that the sample standard deviation of our particular estimator, y , can be written as SD ( y ) =
SD ( yi )
.
(6.9)
n
According to equation (3.11), the calculation of the sample variance, and therefore SD(yi), uses n − 1 in the denominator. This is therefore the degrees of freedom, or df, that we need in equation (6.8). Finally, we’ll follow our convention and set α = .05. Consequently, for this example, equation (6.8) becomes SD ( yi ) SD ( yi ) n−1) n−1) . .95 = P y − t.(025 < µ < y + t.(025 n n
(6.10)
Now we need actual quantities for all of the symbols in equation (6.10) with the exception of µ. Three come from what we already know about this sample: Table 3.3 reports that the sample average for earnings, y , is $28,415 and the sample size, n, is 20. The end of section 3.4 gives the standard deviation of earnings in this sample, SD(yi), as $30,507. The fourth quantity in equation (6.10), now that we know what n − 1 must be, comes from table A.2: (19) = 2.093. t.025 With these values, the confidence interval for µ is 30, 507 30, 507 .95 = P 28, 415 − 2.093 < µ < 28, 415 + 2 .093 20 20
(
)
= P 14,137 < µ < 42, 693 .
(6.11)
In other words, we are 95% confident that the expected value of earnings in the population from which we drew the sample of table 3.3 is somewhere between the lower bound of $14,137 and the upper bound of $42,693. This isn’t very helpful. The lower bound would be, for many families, below the federal poverty level. The upper bound could be consistent with a relatively comfortable lifestyle. Suppose we would like to have more precise knowledge of this expected value. As we’ve just said, there’s really only one responsible way to accomplish this: Augment or replace our sample with more or richer data, in the hopes of reducing SD (y). Let’s replace the sample of 20 from table 3.3 with a sample of 1,000 observations such as those tabulated in the third line of table 5.5. Average earnings in this sample are very close to those in our smaller sample, y = 29,146. In
Confidence Intervals and Hypothesis Tests
203
999 ) ≈ 1.960. Thereaddition, SD(yi) = 42,698 and df = 999. Consequently, t.(025 fore, equation (6.10) becomes
42, 698 42, 698 .95 = P 29,146 − 1.960 < µ < 29,146 + 1.960 1, 000 1, 000
(
)
= P 26, 500 < µ < 31, 792 .
(6.12)
The confidence interval in equation (6.11) is $28,556 wide. That in equation (6.12) is only $5,292 wide. Clearly, the latter is a lot more informative than the former. The difference is mostly because SD(y ) is much smaller for df ) in this equathe larger sample of equation (6.12). The smaller value for t.(025 tion also plays a subsidiary role. As we can see, when we construct these confidence intervals, we assume an intellectual posture that is without prejudice. In other words, we impose no preconceived notions regarding the value of δ. We simply ask the data to instruct us as to where it might lie. We can illustrate this posture with what might be a useful metaphor. Imagine playing horseshoes with an invisible post. Ordinarily, this would be frustrating. But, here, think of the unknown value of the population parameter, δ, as the invisible post. Think of d − tα( df/ 2 )SD(d ) and d + tα( df/ 2 )SD(d ) as the ends of the horseshoe. This horseshoe is different from ordinary horseshoes, because it actually has an affinity for the post. This affinity arises from the relationship between the unknown value of the parameter δ and the known value of d, given in equation (6.1). In consequence, if we throw this particular horseshoe at this particular invisible post, the probability of a ringer is (1 − α)%.
6.4 Hypothesis Tests The arithmetic of hypothesis tests is virtually identical to that of confidence intervals. However, the intellectual posture is completely different. We invoke hypothesis tests when we have hypotheses, or strong prior convictions, regarding what values for δ might be of interest. For our purposes, the source of these convictions will usually be in behavioral intuition, economic theory, or substantively relevant thresholds. When we make hypothesis tests, we address the data not for instruction but for validation. This requires that the test be constructed with a certain amount of artifice. Imagine that we believe something strongly. We want to convince someone else. They say, “Prove it!” What would be an effective response?
204
Chapter 6
A response like “I just believe it” would ordinarily be considered pretty weak. If we were able to cite a couple of facts that were consistent with our belief, that would be stronger. If we were able to point out a large number of empirical observations that couldn’t be explained any other way, that would usually be pretty convincing. The objective here is to be very convincing. Therefore, we begin, essentially, by asserting the opposite of what we really expect. We refer to this assertion as the “null hypothesis,” represented by H0. It specifies a value for the parameter that would be incompatible with our expectations, δ0: H 0 : δ = δ0 .
As will become apparent, we construct the test so as to “stack the deck” in favor of this hypothesis. Then, if we’re still able to conclude that it isn’t plausible, we can reasonably claim that it must be discarded, and our actual expectations adopted.
6.4.1
Two-Tailed Tests Intuitively, a hypothesis test takes the form of a question. Is d, the observable estimate of δ, close enough to δ0 to be taken as validation of the null hypothesis that δ0 is the true value? In order to put this question formally, we identify a fairly wide range of values for the estimator d that we would think of as being sufficiently close to δ0. This range is called the acceptance region. It is an interval defined by a known lower bound and a known upper bound, within which values for d should lie with prespecified probability. As such, it seems natural to derive this region by returning to the inequality in equation (6.7). Now we replace δ with its hypothesized value of δ0 and rearrange the inequality so as to isolate d in the middle term. As in the construction of confidence intervals, we multiply all three terms in this inequality by SD(d ): −tα( df/ 2)SD ( d ) < d − δ0 < tα( df/ 2)SD ( d ) .
However, this time we simply add δ0 to each term, yielding δ0 − tα( df/ 2)SD ( d ) < d < δ0 + tα( df/ 2)SD ( d ) .
(6.13)
This expression defines the acceptance region. Its lower bound is δ0 − tα( df/ 2 )SD(d ), and its upper bound is δ0 + tα( df/ 2 )SD(d ). Both bounds are known: The null hypothesis gives us the value for δ0. The choice of α, the
Confidence Intervals and Hypothesis Tests
205
degrees of freedom in the sample, and the distributional assumptions that we have made about d give us tα( df/ 2 ). Finally, the data give us SD(d ). Values for d between these bounds are taken as sufficiently close to δ0 as to be consistent with, or supportive of, the null hypothesis that δ = δ0. Correspondingly, values of d below δ0 − tα( df/ 2 )SD(d ) and above δ0 + ( df ) tα / 2 SD(d ) are in the rejection region. These values are sufficiently far from δ0 as to be inconsistent with the null hypothesis, even when the “deck is stacked” in its favor. Because the rejection region has two components, one at either extreme of the range of possible values for d, we refer to this type of hypothesis test as a two-tailed test. The hypothesis test itself is formed by reinserting the expression of equation (6.13) into equation (6.7):
(
)
1 − α = P δ0 − tα( df/ 2)SD ( d ) < d < δ0 + tα( df/ 2)SD ( d ) .
(6.14)
This equation states that, if δ = δ0, d, an unbiased estimate of δ, should lie within the acceptance region with probability 1 − α. In order to use this test, we ask if our calculated value of d lies within the known bounds of the probabilistic interval in equation (6.13).6 The decision rule accompanying this test is as follows: If the calculated value of d actually lies within the acceptance region, the data are not inconsistent with the null hypothesis that δ = δ0. We choose to retain δ0 as our belief about δ.7 In this case, we announce that we have “failed to reject the null hypothesis.” If, instead, this value lies in the rejection region, it is clearly inconsistent with the null hypothesis. Accordingly, we choose to abandon it. Figure 6.3 illustrates this test. The null hypothesis specifies that E(d ) = δ0. Under the null hypothesis, the density function for d is therefore centered on δ0. The acceptance region extends tα( df/ 2 )SD(d ) above and below δ0. The rejection region lies beyond in either direction.
Density function for d under H0
) t(df SD(d) α/2
) t(df SD(d ) α/2 ) δ0 − t(df SD(d ) α/2
δ0
) δ0 + t(df SD(d ) α/2
Figure 6.3 Two-tailed hypothesis test at α% significance
Rejection region
Acceptance region
Rejection region
206
Chapter 6
In the context of hypothesis tests, the value of α is called the significance level. As in the case of the confidence interval, the value of α affects the width of the relevant interval. Higher values imply more probability in the rejection regions, lower values for tα( df/ 2 ), and narrower acceptance regions. Therefore, higher values of α make it easier to reject the null hypothesis. For example, suppose α = 1. In this case, the acceptance region becomes the single point δ0.8 If d is equal to anything else, the decision rule states that we should reject the null hypothesis. Obviously, it feels as though this test is fixed to guarantee rejection.9 Returning to the beginning of this section, it would amount to dismissing an opposing position without any consideration of its merits. While we all know people who behave in this way, it’s not good science. Moreover, if this is how we’re going to behave, why bother with the test? We already know what decision we’ll make. So the question of what value we should assign to α is clearly important. It becomes the question of the extent to which we want to predispose the test to rejection or acceptance. To begin an answer, recall that we actually want to make a very strong case that the null hypothesis is wrong. To do so, we have to give it the benefit of the doubt. In other words, we want to be very careful that we don’t reject the null hypothesis when it is really true. Therefore, we are prepared to construe a relatively wide range of values for d as not inconsistent with the hypothesis that δ = δ0. For this reason, we choose values for α that are low. At the same time, we never set α as low as zero. This significance level would imply boundaries for the acceptance region of −∞ and ∞. In other words, we would take any evidence as consistent with the null hypothesis, the proposition that we dispute! This makes even less sense than setting α = 1. Therefore, we always set α > 0. As in the case of confidence intervals, α is never more than .1 and is most often .05. These values ensure that we reject the null hypothesis only when the value of the estimator is so far from the hypothesized value that it would be most unlikely to occur if the null hypothesis were correct. When we reject the null hypothesis, we declare that the hypothesis test is significant at the level of α. This practice, unfortunately, involves some awkward terminology. If we choose a smaller value for α, we give greater benefit of the doubt to the null hypothesis. If we then reject the null hypothesis, it’s more noteworthy. So we often describe rejections at lower values for α, lower significance levels, as being of greater significance. This terminology is almost universal, so there’s no way to fix the ambiguity. We just have to be alert to it. In any case, rejection is a big moment. We have constructed the test so that a wide range of values for d would validate the null hypothesis. Nevertheless, the data contradict it. We can therefore discard it with conviction.
Confidence Intervals and Hypothesis Tests
207
If, instead, we fail to reject the null hypothesis, we say that the test is insignificant. It can also be called a “failure to disprove” and, more derisively, “proving the null hypothesis.” No one becomes famous proving null hypotheses. After all, we’ve constructed our test so that this hypothesis is very hard to disprove. Also, there is the nagging possibility that the reason we have failed to disprove is not because the null hypothesis is actually true but because we just weren’t industrious enough to collect data that were informative enough to reveal its falsity. Better data might have yielded a more precise estimate of δ, in the form of a smaller sample standard deviation for d, and an unambiguous rejection. For these reasons, an insignificant test is usually regarded as a modest accomplishment, if an accomplishment at all. We can illustrate these ideas by returning to the problem of the expected value for earnings. We rewrite equation (6.14) with the symbols that are appropriate for this problem, as we discussed in section 6.3:
( )
( ) .
SD yi SD yi n−1) n−1) .95 = P µ0 − t.(025 < y < µ0 + t.(025 n n
(6.15)
The only new symbol here is µ0. Therefore, the only thing we need to add to the information from section 6.3 is a null hypothesis. Let’s use H 0 : µ0 = 25, 000.
This null hypothesis asserts that the expected value of earnings is $25,000. Section 6.3 reproduced the values for all of the other quantities in the sample of 20 observations presented in table 3.3. If we now replace all of the symbols in equation (6.15) with these values, we obtain 30, 507 30, 507 .95 = P 25, 000 − 2.093 < y < 25, 000 + 2.093 20 20
(
)
= P 10, 722 < y < 39, 278 .
(6.16)
In other words, the acceptance region for this null hypothesis, given this sample, ranges from $10,722 to $39,278. If the average value for earnings in our sample is outside of this range, it would be in the rejection region. We would conclude that the evidence in our sample was inconsistent with, and therefore contradicted, our null hypothesis. As we’ve already seen, however, the sample average, y , is $28,415. This estimator lies within the acceptance region. The appropriate conclusion is that this sample is not inconsistent with the null hypothesis and therefore fails to reject it.
208
Chapter 6
The next step would be to wonder whether we have failed to reject the null hypothesis because it is actually plausible or because our sample isn’t sufficiently informative. After all, the acceptance region in equation (6.16) is $28,556 wide! We can compare these two possibilities by asking the same question of the larger sample of 1,000 observations that we examined in section 6.3: 42, 698 42, 698 .95 = P 25, 000 − 1.960 < y < 25, 000 + 1.960 1, 000 1, 000
(
)
= P 22, 354 < y < 27, 646 .
(6.17)
The acceptance region in equation (6.17) offers a much more discriminating test. Although it is at the same significance level as the test in equation (6.16), it is only $5,292 wide. Therefore, a sample average that failed to reject the null hypothesis would actually have to be somewhat close, in substantive terms, to the hypothesized value of $25,000. In this case, it’s not. The average in this sample is $29,146. It lies in the rejection region. We conclude that the first sample was consistent with the null hypothesis, not because the hypothesis was true but rather because the average for that sample was estimated too imprecisely. It would have been consistent with a wide range of values for the null hypothesis. The average for the second sample is estimated much more precisely. Consequently, it does a much better job of distinguishing between values for µ0 that are plausible and those that are not. Now that we understand the foundations of the two-tailed hypothesis test, we’re in a position to discuss two methods that perform the same test but simplify the execution a bit. The first method begins with equation (6.14), but rewrites it in the form of equation (6.7): d − δ0 1 − α = P −tα( df/ 2) < < tα( df/ 2) . SD ( d )
(6.18)
The standardized value of d under the null hypothesis is d0* =
d − δ0
( )
SD d
.
(6.19)
Equation (6.18), with the substitution of equation (6.19), becomes df d − δ0 df df df 1 − α = P −tα( / 2) < < tα( / 2) = P −tα( / 2) < d0* < tα( / 2) . SD d
( )
(6.20)
Confidence Intervals and Hypothesis Tests
209
The second equality of equation (6.20) demonstrates that the test rejects the null hypothesis if d0* ≤ −tα( / 2) or tα( / 2) ≤ d0* . df
df
These two conditions can be combined as tα( / 2) ≤ d0* . df
(6.21)
In other words, the test can be construed as a comparison between the absolute value of d0* and tα( df/ 2 ). We can calculate the former using d and SD(d ) from our sample, and δ0 from H0. We obtain the latter from table A.2 for the t(df ) random variable. The absolute value of d0*, | d0*|, is called the test statistic. The test rejects the null hypothesis if the test statistic equals or exceeds tα( df/ 2 ) and fails to reject it otherwise. Therefore, tα( df/ 2 ) is the decisive value, or critical value, for this test. Exercise 6.7 demonstrates that d lies in the rejection region of equation (6.14) whenever equation (6.21) is true. Conversely, d lies in the acceptance region of equation (6.14) whenever equation (6.21) is false. In our example of earnings, equation (6.19) becomes µ0* =
y − µ0
( )
SD y
(6.22)
.
Using the values from the first of our two samples of earnings observations, this is µ0* =
y − µ0
( )
SD y
=
28, 415 − 25, 000
= .501.
(6.23)
30.507 / 20
(19) In section 6.3, we identified the critical value for this sample as t.025 = 2.093. This clearly exceeds the test statistic of equation (6.23). Therefore, just as we concluded in our discussion of equation (6.16), we do not reject the null hypothesis with these data. As we might now expect, the second of our two samples of earnings yields a different outcome. Using the values we’ve already established for this sample, equation (6.22) becomes
µ*0 =
y − µ0
( )
SD y
=
29,146 − 25, 000
= 3.071.
(6.24)
42, 698 / 1, 000
The critical value for this sample is tα( df/ 2 ) = 1.960. This is much less than 3.071, the test statistic in equation (6.24). Therefore, as we saw in equation (6.17), this sample decisively rejects the null hypothesis that µ0 = 25,000.
210
Chapter 6
The second simplification of the two-tailed hypothesis test also returns to equation (6.14). The decision rule that we have adopted there implies that we reject the null hypothesis if d differs so greatly from δ0 that, were δ = δ0, the probability of observing d would be α or less. Equation (6.20) tells us that this is equivalent to rejecting the null hypothesis if the probability of observing the value | d0*| for the random variable t (df ) is α or less. This probability is the p-value.10 In other words, another way to evaluate the null hypothesis is to identify this probability, P(t (df ) ≤ −| d0*| or t(df ) ≥ | d0*|), where | d0*| is the test statistic. We first locate the probability of a positive outcome |d0*| units or more above zero directly from the entry for this value in table A.2 for the random variable t (df ).11 This is equal to the probability of a negative outcome | d0*| units or more below zero because of the symmetry of the distribution for this random variable:
(
) (
)
P t ( df ) ≤ − d0* = P t ( df ) ≥ d0* .
Therefore, the probability of observing an outcome for t (df ) at least | d0*| units away from zero in either direction is twice the probability of observing a value at least | d0*| greater than zero, as given in table A.2 for t (df ):
(
) (
) (
)
(
)
P t ( df ) ≤ − d0* or t ( df ) ≥ d0* = P t ( df ) ≤ − d0* + P t ( df ) ≥ d0* = 2 P t ( df ) ≥ d0* .
Therefore, the p-value is
(
)
p -value = 2 P t ( df ) ≥ d0* .
The hypothesis test then becomes a comparison of the p-value—the probability of observing the estimated absolute value or a greater value—to the significance level α—the probability that we have established as our threshold for rejection in either direction from the null hypothesis. If the p-value is larger than α, the probability of observing our value of d0* is greater than our significance level. Therefore, our value of d is in the acceptance region for the hypothesis test and consistent with H0. In other words, we accept the null hypothesis if p-value > α.
(6.25)
Analogously, we reject the null hypothesis if the probability of observing our value of d0* is less than or equal to the threshold probability: p-value ≤ α.
(6.26)
Confidence Intervals and Hypothesis Tests
211
When this is true, d must be in the rejection region for H0. Exercise 6.7 asks us to confirm that the decision rules in equations (6.25) and (6.26) are consistent with those that we originally established for the hypothesis test in equation (6.14) and reiterated for the reformulation of equation (6.21) in terms of critical values. Once again, we can illustrate this version of the two-tailed hypothesis test with the two samples that we have been following since section 6.3. From equation (6.23), the test statistic for the first is .501. This value doesn’t appear in table A.2 in the row for df = 19. However, that row gives us all that we really need to know. It tells us that
(
)
P t (19 ) ≥ .688 = .25.
Figure 6.4 illustrates this situation. The probability in the density function above .688 is .25. The test statistic .501 lies below .688. Therefore, the probability in the density function above .501 must exceed the probability above .688,
(
)
P t (19 ) ≥ .501 > .25.
(6.27)
Similarly, the probability that our test statistic would be less than −.501 must also exceed .25. Together, these implications demonstrate that the p-value associated with the test of the null hypothesis µ0 = 25,000 must exceed .5. This is more than 10 times as large as our significance level, α = .05. In this case, equation (6.25) directs us to accept the null hypothesis. The procedure is similar for the second sample. Equation (6.24) gives the test statistic for this sample as 3.071. Table A.2 doesn’t have a row for df = 999, but this is so large that we can safely use the row for df = ∞. Still, 3.071 doesn’t appear in that row. What does appear is
(
)
P t (∞ ) ≥ 2.576 = .005. Density function
(6.28)
>.25 .25
.025
Figure 6.4 P(t (19) > .501)
0
.501 .688
2.093
212
Chapter 6
This implies that
(
)
P t (∞ ) ≥ 3.071 < .005.
This, in turn, implies that the p-value associated with the test statistic in this sample is less than .01.12 With α still set at .05, equation (6.26) instructs us to reject the null hypothesis. The null hypothesis δ0 = 0 is often of special interest, as will be discussed in section 7.4. In this case, equation (6.7) simplifies to d 1 − α = P −tα( df/ 2) < < tα( df/ 2) . SD (δ )
The test of this hypothesis is especially simple: It consists of a comparison between the ratio of the absolute value of d/SD(d ) and the critical value tα( df/ 2 ). If tα( df/ 2) ≤
d
( )
SD d
= dδ* =0 , 0
then d must lie in the rejection region. Informal applications of this test compare the ratio of the estimator to its estimated standard deviation to two. We previously alluded to this practice in section 1.5. When α = .05, two is a reasonable approximation to Zα/2 or to tα( df/ 2 ) for all but the smallest samples. The quantity | dδ* =0 | = |d|/SD(d ) is often referred to as the t-statistic. This 0 is misleading, because d0* has the t-distribution for all values of δ0, not just for δ0 = 0. However, this designation emphasizes the importance of the null hypothesis that δ0 = 0 in statistical practice, and in regression analysis in particular. If tα( df/ 2 ) < | dδ* =0 |, we often summarize this result by stating that “the t-statistic 0 is significant at the α% level.” If it’s not, we often describe this as a circumstance in which “we cannot statistically distinguish the estimate from zero.” Sometimes we’re more dismissive and leave out the qualifier “statistically.” 6.4.2
One-Tailed Tests When we reject the null hypothesis, regardless of its value, we have to confront the question of what we will choose to believe instead. What is our alternative hypothesis, or H1? Though we haven’t discussed it yet, part of the answer to this question is implicit in the construction of the rejection region. In the hypothesis test of equation (6.14), we reject the null hypothesis if the observed value of d
Confidence Intervals and Hypothesis Tests
213
is either too low or too high. In other words, when we reject, we don’t care whether values of d are above or below δ0, so long as they are far enough away from it. Implicitly, our alternative hypothesis is “anything but δ0,” or H1 : δ ≠ δ 0 . This is a somewhat negative posture. It’s also largely nonconstructive. How comfortable can we be with predictions, decisions, or policies based wholly on the belief that δ isn’t δ0? Often, we find ourselves with more instructive instincts with regard to the alternative to δ0. If nothing else, we’re often in the position to suspect that if δ isn’t δ0, it is likely to lie on one side of δ0. How should this belief be reflected in our rejection region? For example, imagine that we believe that δ must be greater than δ0 if it isn’t equal to it. Our alternative hypothesis is H1 : δ > δ0. It should be obvious that it no longer makes sense to reject the null hypothesis of δ = δ0 if d is well below δ0. No matter how ridiculously far below, any value of d less than δ0 is going to be more consistent with the null hypothesis that δ = δ0 than with the alternative hypothesis that δ > δ0. So all values of d below δ0 should be included in the acceptance region. What about values of d above δ0? Here, we have to be more careful. What should we do about the highest values of d included in the acceptance region of equation (6.14)? These are the kinds of values that might naturally occur if the alternative hypothesis were true. We would therefore run a serious risk if we took them as not inconsistent with the null hypothesis. This means that, just as we have extended the acceptance region in the direction away from the alternative hypothesis, we need to shrink it in the other direction. In other words, rather than splitting the probability indicated by the significance level α between an upper and a lower section of the rejection region, we should consolidate that probability in a single rejection region located in the direction where the null and alternative hypotheses are going to conflict. A hypothesis test with a unitary rejection region is called a onetailed test. Formally, when the alternative hypothesis is δ > δ0, we define the rejection region as consisting solely of values for d that are so high as to indicate that the alternative is preferable to the null. This is sometimes referred to as an upper-tailed test. The range of outcomes in this single region must therefore occur with probability α. The corresponding modification to equation (6.7) is d −δ 1− α = P < tα( df ) . SD (δ )
(6.29)
The lower bound for the inequality in the parentheses of equation (6.29) is implicit and equal to −∞. The upper bound, tα( df ), is a value from table A.2
214
Chapter 6
for the t-distribution with df degrees of freedom. This value is the point at which α% of the probability in this distribution is above and (1 − α)% below. The transformation of equation (6.29) that is analogous to that yielding equation (6.14) is13
(
)
1 − α = P d < δ0 + tα( df )SD ( d ) .
(6.30)
Again, the lower bound is implicitly −∞. This hypothesis test directs us to accept the null hypothesis of δ = δ0 if we observe a value for d that is less than δ0 + tα( df ) SD(d ). In other words, the acceptance region for this null hypothesis is between −∞ and δ0 + tα( df ) SD(d ). The test directs us to reject the null hypothesis in favor of the alternative that δ > δ0 if we observe a value for d in the rejection region. This region is at or above δ0 + tα( df )SD(d ).14 In other words, the test accepts the null hypothesis automatically if d < δ0. In addition, it accepts the null hypothesis if d lies above δ0, but not by too much. How much? As long as d is no more than tα( df )SD(d ) above δ0. Anything above that rules in favor of H1. Figure 6.5 illustrates the acceptance and rejection regions for this upper-tailed test. Equivalently, we can reject the hypothesis if d* equals or exceeds the critical value of tα( df ) and accept if it does not. Or, we can calculate P(t (df ) ≥ d*) from table A.2. In the context of a one-tailed test, there is no need to consider the possibility of an extreme value for t (df ) on the side of its distribution away from the alternative hypothesis. Therefore, this probability alone is the p-value for a one-tailed test:
(
)
p -value for one-tailed test = P t ( df ) ≥ d * .
As with the two-tailed test, d is in the rejection region and the null hypothesis is rejected if p-value for one-tailed test ≤ α.
Density function for d under H0
Figure 6.5 Upper-tailed hypothesis test at α% significance
δ0 Acceptance region
δ0 + tα(df )SD(d ) Rejection region
Confidence Intervals and Hypothesis Tests
215
The estimator is in the acceptance region and we accept the null hypothesis if p-value for one-tailed test > α.
Once again, we can illustrate the points of this section with our continuing analysis of earnings. Let’s adopt this alternative hypothesis: H1 : µ1 > 25, 000.
(6.31)
This implies that we need an upper-tailed hypothesis test. Equation (6.30), rewritten with the appropriate notation, is
(
)
.95 = P y < µ0 + tα( df )SD ( y ) .
(6.32)
This is the general form of the test that we need. For our first sample, table A.2 reports that t.(0519 ) = 1.729. Accordingly, equation (6.32) becomes 30, 507 .95 = P y < 25, 000 + 1.729 = P ( y < 36, 794 ) . 20
(6.33)
There are two interesting things to note about this test. First, average earnings in this sample, $28,415, are in the acceptance region and therefore consistent with the null hypothesis, even in this upper-tailed test. Second, figure 6.6 compares the acceptance and rejection regions for the one-tailed test in equation (6.33) to those in the two-tailed test, using the same sample, in equation (6.16). As we discussed previously, the lower end of the acceptance region in the upper-tailed test includes all values below the null hypothesis. This means that it contains all of the area below $10,722, the area that was in the lower of the two rejection regions for the two-tailed hypothesis test of equation (6.16). The probability associated with this area was 2.5%. The rejection region for the upper-tailed hypothesis test of equation (6.33) contains all of the area above $39,278, the area in the upper rejection region for the two-tailed hypothesis test of equation (6.16). However, the probability associated with this area is only 2.5%. The upper-tailed rejection region for a test at 5% significance must have probability equal to 5%. Therefore, it also incorporates the area between $36,794 and $39,278, which contributes the necessary additional probability of 2.5%. The other two methods of evaluating this hypothesis test, naturally, yield the same result. According to equation (6.23), the test statistic is µ0* = .501 for this sample. We have just established that the critical value is t.(0519 ) = 1.729. Therefore, the critical value exceeds the test statistic. We do not reject the null hypothesis.
216
Chapter 6
Figure 6.6
Density function for y under H0
Two-tailed and onetailed hypothesis tests with 20 observations
.025
.025
.025
25,000
10,722
36,794 39,278
Two-tailed acceptance region One-tailed acceptance region
Similarly, in equation (6.27), we found that
(
)
P t (19 ) ≥ .501 > .25.
As we said earlier, this probability is equal to the p-value for a one-tailed test. It still exceeds our chosen significance level, α = .05. Yet again, we conclude that we cannot reject the null hypothesis.15 6.4.3
Type I and Type II Errors Now we have a thorough understanding of how to construct a hypothesis test and what decisions to make. We next have to consider how this whole procedure could go wrong. There are two ways. First, imagine that the null hypothesis is correct. In this case, given the way that we have designed the test, the probability of getting an observed value for d in the rejection region is exactly α. As we said previously, we deliberately set α to be low in order to give the benefit of the doubt to the null hypothesis. Nevertheless, if we test 100 true null hypotheses, we will typically reject 100α of them. For example, if α = .05, we would ordinarily expect to reject 5 out of every 100 true null hypotheses. In these cases of mistaken rejection, we have simply observed a value for d that is relatively rare. We have mistaken a low-probability event, generated by the null hypothesis, for an indication that the null hypothesis is false and the alternative hypothesis is true. For example, imagine that we observed a series of coin tosses that yielded 10 consecutive heads. Would we conclude that we have been extremely lucky or that we are dealing with a crooked coin? Our procedure for hypothesis testing directs us to accept the alternative hypothesis, that the coin is crooked.16 However, we must have some concern that the null hypothesis—that the coin
Confidence Intervals and Hypothesis Tests
217
is fair—is true and that we have simply been fortunate to observe something that almost never occurs. This kind of mistake, rejecting the null hypothesis when it is true, is called a Type I error. This is the kind of mistake that worries us most. Accordingly, we design our statistical test so as to ensure that the probability that it will occur is acceptably low. As before, we set that probability at the significance level α. We also refer to α as the size of the hypothesis test. The other mistake that we can make is accepting the null hypothesis when it is false. How does this occur? We can only accept the null hypothesis if the observed value of d lies in the acceptance region. So this type of mistake, a Type II error, occurs when the alternative hypothesis is true, but happens to yield a value for the estimator that appears to validate the null hypothesis. In the example of the coin, a Type II error would occur if the coin were truly crooked, but we concluded that it was fair. This could happen if, for example, the coin yielded only 7 heads in 10 tosses. While this is a lot of heads for a fair coin, it’s not that far from the 5 that we would ordinarily expect. And remember, we’re giving the null hypothesis of a fair coin the benefit of the doubt. Accordingly, we would probably not reject it on the basis of this evidence. Table 6.1 provides a convenient summary of the actions we might take and the mistakes we might make in the process of testing a statistical hypothesis. Unfortunately, the names that are universally given to these two mistakes, Type I and Type II errors, don’t help us remember what’s actually at issue. They are not especially descriptive, and they are very easy to confuse. An analogy should help us to remember the distinctions between them. Consider the criminal justice system in the United States. In general, an individual accused of criminal activity is considered “innocent until proven guilty.” In other words, the null hypothesis in this context is that of innocence. A Type I error occurs when the null hypothesis of innocence is true, but rejected. In other words, a Type I error is the statistical equivalent of convicting an innocent person. This kind of mistake, when it is discovered, is usually very upsetting. Therefore, we limit the probability that it will occur by requiring that evidence TABLE
6.1 Hypothesis tests: Actions and consequences
Action
In fact
Consequence
If we accept the null hypothesis . . .
and it’s true . . . and it’s false . . .
we’ve done the right thing Type II error
If we reject the null hypothesis . . .
and it’s true . . . and it’s false . . .
Type I error we’ve done the right thing
218
Chapter 6
demonstrate guilt “beyond a reasonable doubt.” This is the legal equivalent of our practice of setting a low significance level, choosing a low value for α, in a hypothesis test. However, the cost of this stringent evidentiary standard is that sometimes the evidence is not sufficient to support a conviction when it would be appropriate. In this case, a Type II error occurs: The null hypothesis of innocence is accepted even though the individual is guilty. In other words, a Type II error is the statistical equivalent of acquitting a guilty person. We obviously care about Type II errors as well as Type I errors, but not as much. That’s why we design our tests specifically to control the probability of the latter, without regard to the probability of the former, at least initially. But that leaves the obvious question, what is the probability that a Type II error will occur? Based on what we have said so far about H1, we can’t tell. If the alternative hypothesis is true, but simply states that δ > δ0, it’s possible that the true value for δ is only slightly greater than δ0. In this case, it would be easy for a value of d to occur that was in the acceptance region, and the probability of a Type II error would be very high.17 Under this alternative hypothesis, however, it is also possible that the true value for δ is much greater than δ0. In this case, it would be very unlikely that d would nevertheless be small enough to appear within the acceptance region. The probability of a Type II error here could be small and even negligible. What this discussion reveals is that, in order to identify a specific probability of a Type II error, we have to specify precisely the value of δ under the alternative hypothesis. We have to be prepared to assert H1 : δ = δ1 for some explicit value δ1. Before we attempt this, we need to be sure we understand what’s at stake. What’s not at stake is the hypothesis test itself. The only bearing that the alternative hypothesis has on this test is to indicate whether it should be two-tailed, as in equation (6.14), or one-tailed, as in equation (6.30). Once that’s decided, neither of these equations depends on the value for δ1. Therefore, the choice of a specific value for δ1 has no effect on the calculation or outcome of the hypothesis test. It affects only the probability of incorrectly concluding that the null hypothesis shouldn’t be rejected. Any specific alternative hypothesis can be either above or below the value for the null hypothesis, but not both. In other words, either δ1 < δ0 or δ1 > δ0. In either case, the appropriate test is one-tailed. For example, δ1 > δ0 implies an upper-tailed test. Equation (6.30) gives the acceptance region for this test as d < δ0 + tα( df )SD(d ). If the alternative hypothesis is true, the probability that d will nevertheless lie in the acceptance region for the null hypothesis is18
(
)
P ( Type II error ) = P d ≤ δ0 + tα( df )SD ( d ) | δ = δ1 .
(6.34)
Confidence Intervals and Hypothesis Tests
Figure 6.7
219
Density function for d under H1
Probability of a Type II error for an uppertailed hypothesis test
P(Type II error)
δ0
δ0 + tα(df )SD(d ) Acceptance region
δ1 Rejection region
Figure 6.7 illustrates this probability. If the alternative hypothesis were true, δ = δ1. In this case, the correct standardization for d would be d1* =
d − δ1
( )
SD d
~ t ( df ) .
(6.35)
If we subtract δ1 from both sides of the inequality in equation (6.34) and divide both sides by SD(d ), we obtain
( ) ( ) ( ) ( )
( df ) d − δ1 δ0 + tα SD d − δ1 P Type II error = P ≤ SD d SD d δ + t ( df )SD d − δ1 df = Pt ( ) ≤ 0 α . SD d
(
)
( )
(6.36)
The inequality in parentheses to the right of the first equality sign in equation (6.36) contains two terms. The quantity to the left of the inequality sign is simply d1*, a t random variable, as given in equation (6.35). The quantity to the right of the inequality looks pretty fearsome, but it’s just a number. The value of δ0 is given by the null hypothesis H0, that of δ1 is given by the alternative hypothesis H1, tα( df ) comes from table A.2 for the t-distribution, and SD(d ) is given by our data. Therefore, the probability of a Type II error is simply the probability that a t random variable will generate a value less than the known quantity δ0 − tα( df )SD ( d ) − δ1 SD ( d )
.
We locate this quantity in the margins of the table for the t random variable, table A.2. The probability that we’re looking for is associated with this quantity in the body of the table.19
220
Chapter 6
There are only two actions we can take when the null hypothesis is false. We can either accept it, incorrectly, or reject it. The probability of the first action is given by equation (6.36). The probability of the second action is the power of the hypothesis test: power = 1 − P ( Type II error ) .
(6.37)
The power of the test is the probability that the test will correctly reject the null hypothesis when it is false. For the last time, we illustrate these issues with our sample of 20 observations on earnings. First, we rewrite equation (6.36) in the appropriate notation: µ + t ( df )SD ( y ) − µ1 df . P ( Type II error ) = P t ( ) ≤ 0 α SD y ( )
(6.38)
We have values for all of the quantities to the right of the inequality with the exception of µ1. Let’s now assert an alternative hypothesis: The expected earnings in the population are $35,000, or H1 : µ1 = 35, 000.
Now we can take the expression to the right of the inequality in equation (6.38) and replace all of the symbols with actual values: 19 25, 000 + 1.729( 30, 507 / 20 ) − 35, 000 = P t (19) ≤ .263 . P Type II error = P t ( ) ≤ 30, 507 / 20
(
)
This probability doesn’t appear explicitly in table A.2. However, this table tells us that
(
)
19 P t ( ) ≤ .688 = .750.
In addition, we know that
(
)
19 P t ( ) ≤ 0 = .500,
because zero is the expected value of this random variable and its density function is symmetric around zero. Finally, we know that
(
) (
) (
)
19 19 19 P t ( ) ≤ .688 > P t ( ) ≤ .263 > P t ( ) ≤ 0 .
(6.39)
Confidence Intervals and Hypothesis Tests
221
Equation (6.39) implies that
(
)
19 .750 > P t ( ) ≤ .263 > .500.
In other words, the probability of a Type II error in our sample of 20 observations, when the null hypothesis specifies µ0 = 25,000 and the alternative hypothesis specifies µ1 = 35,000, is between 50% and 75%. Correspondingly, according to equation (6.37), the power of this test is no greater than 50% and could be as low as 25%. It should be evident that a hypothesis test is more convincing, the lower is its significance level, or size, and the higher is its power. A lower significance level means that we are less likely to reject the null hypothesis when it’s true. Higher power means that we are less likely to accept the null hypothesis when it’s false. In other words, tests with lower significance and higher power are less likely to make either of the possible mistakes. The unfortunate truth is that, if we don’t change our null hypothesis, our alternative hypothesis, or our data, reducing the significance level also reduces the power. Algebraically, reducing the significance level means reducing α. Reducing α means increasing tα. Increasing tα means moving the boundary of the acceptance region for H0 closer to the value for δ specified by the alternative hypothesis H1. This increases the probability that, if δ1 were the correct expected value for d, we could obtain an actual value for d that we would take as consistent with δ0. Figure 6.8 demonstrates this dilemma. The original significance level is α0. The associated acceptance region is below δ0 + tα( df0 )SD(d ). The new, lower significance level is α1, with α1 < α0. The new acceptance region is below δ0 + tα( df1 )SD(d ), with δ0 + tα( df1 )SD(d ) > δ0 + tα( df0 )SD(d ). The upper boundary of the new acceptance region is closer to δ1 than was that of the original acceptance region. Consequently, the probability of a Type II error is greater. Intuitively, the problem is as follows. In order to reduce the probability of rejecting the null hypothesis when it’s correct, we have to accept a wider range of values for d as being consistent with our hypothesized value δ0. As we expand this range, we inevitably include more of the values of d that could be expected if the true value of δ were actually δ1. Therefore, we run an increasing risk of mistaking data that were actually generated by δ1 as validating δ0.20 Is there any way, then, to improve the test? It’s evident from figure 6.7 that we could increase the power of our test simply by increasing δ1. This wouldn’t affect the acceptance region, and therefore the significance level, at all. However, it would clearly shift the density function of d, given δ = δ1, to the right in that figure, reducing the area in the tail to the left of the upper bound of the acceptance region.
222
Chapter 6
Figure 6.8 Effects of lower significance levels on the probability of a Type II error
Power
Density function for d under H1
P(Type II error|α1) P(Type II error|α0)
δ0
δ1
δ0 + tα(df0 )SD(d) δ0 +tα(df1 )SD(d)
Acceptance region with α0 Acceptance region with α1
Rejection region with α0 Rejection region with α1
To see this formally, rewrite equation (6.36) as ( df ) d − δ1 − (δ1 − δ0 ) + tα SD ( d ) . P ( Type II error ) = P ≤ SD ( d ) SD ( d )
(6.40)
As we have already said, the first term in the parentheses after the equality is simply a t random variable. It could be positive or negative with equal probability. The second term in the parentheses after the equality of equation (6.40) is equal to the corresponding term in equation (6.36) but has been rearranged. Equation (6.40) makes clear that the probability of a Type II error appears to depend on the algebraic difference between the values specified by the alternative and null hypotheses, δ1 − δ0. This difference is positive, as we have illustrated in figure 6.7. Therefore, increasing δ1 increases this difference, making the second term either less positive or more negative. Either way, the probability defined by equation (6.40) goes down. Superficially, increasing the algebraic distance between δ1 and δ0 seems like a painless way to increase the power of the test without affecting its size. The problem is that we’re only interested in the alternative hypothesis if there is some compelling economic or behavioral reason why the particular value δ1 is of interest. If we care so little about its value that we’re willing to change it simply to effect a mechanical increase in the power of the test, why should anyone else care about it? In other words, this strategy puts us in the
Confidence Intervals and Hypothesis Tests
223
position of saying that “our test has high power with respect to an alternative hypothesis that no one has any reason to care about.” Go ahead, say it. How does it feel?21 Upon reflection, then, increasing the algebraic distance between δ1 and δ0 by arbitrarily redefining H1, or, for that matter, H0, is a bad idea. However, there is a good idea lurking quite close by. Instead of increasing the algebraic distance between H0 and H1 by arbitrarily changing δ0 or δ1, we could try to increase the statistical distance between our chosen values of δ0 and δ1. What is the statistical distance between the two constants δ0 and δ1? Rewrite the second term in the parentheses of equation (6.40) as − (δ1 − δ0 ) SD ( d )
+ tα( df ) .
(6.41)
In this form, the term has two components. The second component, tα( df ) , depends only on α. As we can see from figure 6.7 or equation (6.29), reducing α increases tα( df ) and expands the acceptance region. As we can see from figure 6.7 or equation (6.34), this increases the probability of a Type II error, which is not attractive. At the same time, we’re certainly not going to increase the significance level because we’ve chosen it to ensure an acceptable probability of a Type I error. So we’re going to leave this component alone. The numerator of the first component is the negative of what we have called the algebraic distance between the values for the two hypotheses. However, we notice here that this distance itself is not what’s important. Instead, it’s this distance divided by the sample standard deviation of the estimator d. In other words, what matters is how many sample standard deviations of our estimator fit within the algebraic difference between δ1 and δ0. This quantity is the statistical distance between the two. It’s measured in units of the sample standard deviation of d. SD(d ) is the metric for the statistical distance. Holding constant the algebraic difference between δ1 and δ0, we could increase the statistical difference between the two by reducing the sample standard deviation of our estimator. How can we do this? For the estimators we consider in this book, additional data will generally do the trick. We’ve already demonstrated this with regard to the regression slope and intercept, b and a, in section 5.10. Figure 6.9 illustrates how reductions in the sample standard deviation of d affect hypothesis tests. SD(d ) represents the original sample standard deviation of d. SD(d )* represents the reduced sample standard deviation of d, after obtaining additional data. The first consequence of this reduction is to move the boundary of the acceptance region to the left, algebraically closer to δ0. Notice that we have
224
Chapter 6
Figure 6.9
Density function for d under H1 with SD(d )*
Reduced SD(d ) and the probability of a Type II error
Density function for d under H1 with SD(d )
P(Type II error) with SD(d) P(Type II error) with SD(d )*
δ0
δ0 + tα(df )SD(d )* δ0 +t α(df )SD(d)
Acceptance region with SD(d )* Acceptance region with SD(d)
δ1 Rejection region with SD(d)* Rejection region with SD(d )
not changed α, the significance level of the test. This effect is primarily because the value of SD(d ) is reduced to SD(d )*. There is also a secondary effect because more data increase df. This reduces the value of tα( df ) perhaps substantially in samples that are initially small, though infinitesimally in those that are already large. This shift also moves the boundary of the acceptance region farther, algebraically, from δ1. By itself, this would reduce the probability of a Type II error to the area under the original density function for d to the left of the new boundary. However, the reduction in the sample standard deviation of d also changes its density function. The new density function, based on the smaller SD(d )*, is narrower and taller than the original, based on the larger SD(d ). This narrows the range of values that would be likely to arise were the alternative hypothesis to be true. As can be seen in figure 6.9, this change alone would reduce the area under the density function to the left of the original boundary for the acceptance region and reduce the probability of a Type II error.
Confidence Intervals and Hypothesis Tests
225
In sum, additional data usually reduce the sample standard deviation of the estimator. This has two very positive consequences. First, it increases the algebraic distance between the boundary of the acceptance region and the value for the alternative hypothesis. Second, it increases the statistical distance between the two, as measured in units of the sample standard deviation, by even more. Together, the two changes can increase the power of the test dramatically, without reducing its significance.22 Of course, we could also use additional data to reduce the significance level, holding constant the power of the test. Or we could choose some combination of reduced significance level and increased power. Any of these three options would constitute an improvement in the properties of the test and in our confidence in the results.23 In terms of our earlier legal analogy, “more data” are equivalent to evidence of higher quality. The recent increase in the use of DNA tests for the purposes of identification is an example of improved evidence. Reportedly, it has substantially reduced the probability of mischaracterizing someone who is innocent, Type I error. It has also reduced the probability that a true perpetrator will not be identified, Type II error. It has therefore both reduced the size and increased the power of the legal hypothesis test. The admirable consequence is a more just legal system.
6.5
The Relationship between Confidence Intervals and Hypothesis Tests As should be evident, confidence intervals and hypothesis tests are intimately related. Perhaps too intimately related. They require essentially the same arithmetic, which often makes it difficult to remember which is which. The critical difference between the two is really a matter of presentation and dramatics, as described in table 6.2. The confidence interval often seems like the more natural approach. It is data driven; that is, it takes the posture that the data should be explored for any information they may contain regarding the value of δ, without any preconceived notions. In addition, the confidence interval contains all of the information that would be available in all of the possible two-tailed hypothesis tests. To be more precise, if any value contained within the (1 − α)% confidence interval were to be adopted as the null hypothesis, it would not be rejected at the α% significance level in a two-tailed hypothesis test. If any value outside of the (1 − α)% confidence interval were to be adopted as the null hypothesis, it would be rejected at the α% significance level in such a test.
226
Chapter 6
TABLE
6.2 Contrasts between confidence intervals and hypothesis tests Quantity
Interval estimate
Known
Unknown
Confidence interval Hypothesis test
Estimator Population parameter
Population parameter Estimator
To demonstrate, suppose we were to adopt the lower bound on the confidence interval in equation (6.8) as our null hypothesis: H 0 : δ0 = d − tα( df/ 2)SD ( d ) .
Inserting this value for the null hypothesis into the two-tailed hypothesis test of equation (6.14) yields
(
)
1 − α = P d − 2tα( df/ 2)SD ( d ) ≤ d ≤ d .
(6.42)
The inequality here is obviously true. Just as obviously, however, any smaller value for H0 would yield an upper bound to the acceptance region that was less than d. Therefore, the evidence in the sample at hand would reject any null hypothesis that specified a lower value for δ0. Exercise 6.17 invites us to demonstrate that, similarly, any specified value for δ0 in excess of d + tα( df ) SD(d ) would also be rejected by the sample estimate of d. Why, then, perform hypothesis tests? Again, the answer has to do with effective argumentation rather than statistical validity. In many circumstances, there is a clear alternative to the proposition that we want to refute. In these circumstances, it is often more constructive to compare the alternative directly rather than to identify the set of all refutable propositions and then note that the proposition at issue is a member. In other words, a hypothesis test is probably going to make this point more compellingly than a confidence interval. For example, suppose someone challenges us about our enrollment in college: “Education is a waste of time!” Presumably, what’s meant is the null hypothesis that the return to education, β, is low. We could respond, as we will in section 7.3, with a confidence interval: “What do you mean? The annual return to a year of schooling is between $2,000 and $2,500 with 95% confidence!” Or we could respond, as we will in section 7.4, with, “Oh yeah? Well I tested your stupid idea, gave it all the benefit of the doubt, and it is REJECTED!” Both statements would be true. Given the context, which would be more effective?
Confidence Intervals and Hypothesis Tests
227
EXERCISES 6.1 The derivation of equation (6.4) requires the results that V(δ) = 0 and COV(d, δ) = 0. Let’s prove them here. (a) Begin with equation (5.15), which demonstrates that the expected value of a constant is itself. Based on this result, what is E(δ)? (b) Replace this value in equation (5.4) to demonstrate that V(δ) = 0. (c) Rewrite COV(d, δ) as in equation (5.8). Replace E(δ) and demonstrate that COV(d, δ) = 0. 6.2 Demonstrate by reference to table A.2 that the confidence interval in equation (6.8) becomes unbounded as α approaches zero. 6.3 Section 6.3 gives the width of a confidence interval as 2 tα( df/ 2 )SD(d). If α = .05, why might it be reasonable to approximate this width as 4 SD(d)? When would this approximation be dangerous? 6.4 Let’s verify the statistical assertions that we made in the course of constructing the confidence interval in equation (6.10). (a) Prove that y , in the example of confidence intervals of section 6.3, is an unbiased estimator of µ, the expected value of earnings in the population. Begin with the definition of the average in equation (2.8). Take the expected value of both sides of the equation. Simplify the expected value to the right of the equality using equations (5.13) and (5.34). Finally, substitute E(yi) = µ. (b) Prove equation (6.9). Begin with the definition of the average in equation (2.8). Take the population variance of both sides of the equation. Simplify the population variance to the right of the equality using equations (5.43) and (5.45). 6.5 Demonstrate that the acceptance region in equation (6.14) degenerates to a point as α increases toward one by adopting progressively higher values for α and constructing the associated hypothesis tests. 6.6 Consider the confidence interval in section 6.3 and the two-tailed hypothesis test of section 6.4.1. (a) What is the width of the confidence interval in equation (6.11)? What is the width of the acceptance region in equation (6.16)? How do they compare? Why? (b) More generally, what is the width of the confidence interval in equation (6.8)? What is the width of the acceptance region in equation (6.14)? How do they compare? Why? 6.7 Consider the three methods for evaluating two-tailed hypothesis tests. (a) Demonstrate that if equation (6.21) is true, tα( / 2) ≤ d0* , df
228
Chapter 6
then either d < δ0 − tα( df/ 2 )SD(d ) or δ0 + tα( df/ 2 )SD(d ) < d in equation (6.14). Conversely, demonstrate that if equation (6.21) is false, δ0 − tα( df/ 2 )SD(d ) < d < δ0 + tα( df/ 2 )SD(d ). (b) Demonstrate that if equation (6.26) is true, equation (6.21) must also be true. Conversely, demonstrate that if equation (6.25) is true, equation (6.21) is false. 6.8 Consider the calculation of the p-value for the test of the null hypothesis in the sample of 1,000 observations on earnings in section 6.4.1. (a) Alter figure 6.4 to represent the analogous situation with respect to this test. (b) Prove explicitly that, as a consequence of equation (6.28), the p-value for this test must be less than .01. 6.9 Derive equation (6.30) from equation (6.29). 6.10 Construct the one-tailed hypothesis test when the alternative hypothesis specifies a value below that specified by the null hypothesis, H1 : δ < δ0. 6.11 Test the null hypothesis H0 : µ0 = 25,000 against the alternative hypothesis H1 : µ1 > 25,000 using the information for the sample of 1,000 observations presented in sections 6.3 and 6.4.1. (a) Reformulate equation (6.32) for this sample, identify the acceptance region, and determine if the sample average lies within or outside of it. (b) What is the test statistic for this sample? What is the critical value? How do they compare? What should we conclude? (c) What is the p-value for this sample? How does it compare to the significance level that we have chosen? Again, what should we conclude? (d ) Respecify the alternative hypothesis as H1 : µ1 = 35,000. Calculate the probability of a Type II error and the power of the hypothesis test. Compare it to the probability of a Type II error in the sample of 20 earnings observations. 6.12 Section 6.4.3 states that more data will reduce SD(d ). It also echoes note 17 in chapter 5 to the effect that data embodying richer contrasts in the explanatory variable will also reduce SD(d ). Why is this so? Consider the consequences of more data, as given in section 6.4.3. Which of them also apply to data containing richer contrasts? Which do not? 6.13 Derive the formulas for the probability of a Type II error and the power of a test for a lower-tailed test. 6.14 Consider the upper-tailed hypothesis test of equation (6.30) and figure 6.5. If we can change only α, what must we do to it in order to increase the power of our hypothesis test? What are the other consequences of this change? Are they good or bad? Why? Are they more or less important than the increase in the power of the test?
Confidence Intervals and Hypothesis Tests
229
6.15 Consider the upper-tailed hypothesis test of equation (6.29) and figure 6.7. Imagine that we choose a new null hypothesis, H0 : δ = d0*, where d0* < δ0, and retain the same values for α and δ1. What is the consequence for the power of our test? Why is this a bad idea? 6.16 Consider the upper-tailed hypothesis test of equation (6.29) and figure 6.7. Imagine that our sample size increases from n to n*. (a) Derive the formula for the power of our test if we maintain the original significance level, α. How does it compare to the power when the sample size is just n? (b) Derive the formula for the significance level that our test can attain if we maintain the original power. How does it compare to the original significance level, α? (c) Construct the test with the original sample size, n, so as to equate the probability of Type I and Type II errors. How does this test change when the sample size increases? 6.17 Recall equation (6.42) and its development. Demonstrate that any specified value for δ0 in excess of d + tα( df/ 2 )SD(d ), the upper bound for the confidence test in equation (6.8), would be rejected by the two-tailed hypothesis test of equation (6.14). 6.18 Recalculate the hypothesis tests of equations (6.16), (6.17), and (6.33) at the 10% significance level. Do any of the results change? If yes, explain the change.
NOTES 1. It’s good to be aware of the notational ambiguity. Here, we use “V” to represent the population and “SD” to refer to the sample. 2. How big does df have to be? Most tables for the t-distribution present approximately 40 values for df: all integer values from 1 to 30, several values up to, perhaps, 120, and then df = ∞. If we compare tables A.1 and A.2, we can verify that the t-distribution is identical to the standard normal distribution when df = ∞. Moreover, if we examine table A.2, we’ll see that the critical values for the t-distribution with df > 30 begin to look very similar to those for the t-distribution with df = ∞. Therefore, the normal distribution becomes an acceptable approximation for the t-distribution somewhere in the range of df > 30. 3. It is really unfortunate that α is the standard notation for this probability. It is also, as in chapter 5, frequently used to represent the intercept of the population relationship. We’re going to have to decide which interpretation is appropriate based on context. Fortunately, we’re rarely going to be in the business
230
Chapter 6
of constructing interval estimates for the intercept, so we shouldn’t have to use α for both of its meanings at the same time. This should make it relatively easy for us to ascertain which usage of α is relevant. 4. Just to reiterate, if df were large, we would treat d* as a standard normal random variable in equation (6.5). Under this treatment, we would replace tα( df/ 2 ) with Zα/2. 5. Figure 6.1, and indeed the rest of this book, uses strict inequalities to define the area comprising (1 − α)% of the probability. Nothing would change if this area included the points that define its boundaries, because the probability associated with any individual value for a continuous random variable is negligible. The convention that we adopt here is consistent with the typical practice of forming hypothesis tests. We won’t discuss this practice until section 6.4, but, in general, it takes critical values identically equal to the standardized estimator or p-values identically equal to α as rejections of the null hypothesis. 6. Notice the typographical similarity between equations (6.8) and (6.14). The only big difference is that d and δ have switched places. There’s also the minor difference that δ has acquired a subscript 0. This similarity is both good news and bad. The good news is that once we’ve mastered the arithmetic for either the confidence interval or the hypothesis test, the other should be easy to learn. The bad news is that we have to be careful not to mistake one for the other. 7. In this circumstance, the data are sometimes described as “consistent” with the null hypothesis, which is therefore “accepted.” This language is convenient, but gives δ0 too much credit. “Consistency” should indicate some measure of active agreement. Here, the test accepts the null hypothesis unless a serious contradiction occurs. Therefore, a test can often fail to reject the null hypothesis even if the point estimate is quite different from δ0, in terms of what it implies about the behavior in question. 8. Why? Exercise 6.5 will explain. 9. Essentially, this is a restatement of the point made previously that point estimates are almost certainly wrong. Again, can we see why? 10. We first encountered the p-value in section 1.8, where it was introduced as a companion to the F-statistic. Here, it’s a companion to the t-statistic, but the underlying meaning is the same. 11. Remember, |d*| is always positive, regardless of the sign on d*. Similarly, −|d*| is always negative. 12. Exercise 6.8 directs us to derive this result explicitly. 13. We make this transformation explicitly in exercise 6.9.
Confidence Intervals and Hypothesis Tests
231
14. Exercise 6.10 directs us to construct and interpret the lower-tailed test, the one-tailed hypothesis test when the alternative hypothesis suggests a value lower than that of the null hypothesis. 15. We repeat these analyses with the sample of 1,000 observations on earnings in exercise 6.11. 16. For entertainment, we might try to calculate the probability of observing this outcome for a fair coin. 17. Of course, if the null and alternative hypotheses are this close, the substantive differences between them might be so small that it doesn’t matter which one we believe. 18. The vertical bar in equation (6.34) indicates conditional probability. This notation reminds us that we are assessing the associated probability under the assumption that the value of δ is δ1. 19. Exercise 6.13 addresses the calculation of the probability of a Type II error in the context of a lower-tailed test. The probability of a Type II error is occasionally represented as β. Can we see why we would want to avoid this notation? There must not be enough Greek letters to go around. 20. Exercise 6.14 asks us to identify what must be done to α to increase the power of a hypothesis test, if nothing else can change, and to evaluate the consequences. 21. Exercise 6.15 invites us to consider the consequences for the power and meaningfulness of a hypothesis test if we set a lower value for H0. 22. Exercise 6.11d directs us to calculate the probability of a Type II error for the sample of 1,000 earnings observations and to compare it to that for the sample of 20 that we calculated earlier. 23. We explore these options in exercise 6.16.
CHAPTER
7
I NFERENCE IN O RDINARY L EAST S QUARES
7.0 What We Need to Know When We Finish This Chapter 7.1 The Distributions of b and a 7.2 Estimating σ 2
234
236
7.3 Confidence Intervals for b 7.4 Hypothesis Tests for β
241 251
7.5 Predicting yi Again
261
7.6 What Can We Say So Far about the Returns to Education? 7.7 Another Example 7.8 Conclusion
7.0
267
269
272
Appendix to Chapter 7 Exercises
232
273
276
What We Need to Know When We Finish This Chapter This chapter addresses the question of how accurately we can estimate the values of β and α from b and a. Regression produces an estimate of the standard deviation of εi. This, in turn, serves as the basis for estimates of
232
Inference in Ordinary Least Squares
233
the standard deviations of b and a. With these, we can construct confidence intervals for β and α and test hypotheses about their values. Here are the essentials. 1. Section 7.1: We assume that the true distributions of b and a are the normal. Because we have to estimate their variances in order to standardize them, however, we have to treat their standardized versions as having t-distributions if our samples are small. 2. Section 7.2: Degrees of freedom count the number of independent observations that remain in the sample after accounting for the sample statistics that we’ve already calculated. 3. Equation (7.3), section 7.2: The ordinary least squares (OLS) estimator for σ 2, the variance of εi, is n
∑e
2 i
s2 =
i =1
n−2
.
4. Equation (7.9), section 7.3: The (1 − α)% confidence interval for β is n− 2 1 − α = P b − tα( / 2 )
n−2 < β < b + tα( / 2 )
s2 n
∑(
xi − x
i =1
)
2
s2 n
∑( x − x )
2
i
i =1
.
5. Section 7.3: Larger samples and greater variation in xi yield narrower confidence intervals. So does smaller σ 2, but we can’t control that. 6. Equation (7.16), section 7.4: The two-tailed hypothesis test for H0 : β = β0 is n− 2 1 − α = P β0 − tα( / 2 )
s2 n
∑( x − x ) i
i =1
2
< b < β0 + tα( n/−22)
s2 n
∑( x − x ) i
i =1
2
.
The alternative hypothesis is H1 : β ≠ β0. 7. Section 7.4: The test of the null hypothesis H0 : β = 0 is always interesting because, if true, it means that xi doesn’t affect yi.
234
Chapter 7
8. Equation (7.26), section 7.4: The upper-tailed hypothesis test for H0 : β = β0 is n− 2 1 − α = P b < β0 + tα( )
s2 n
∑( x − x ) i
i =1
2
.
9. Section 7.5: The best linear unbiased estimator of E(y0) is yˆ0 = a + bx0 . 10. Section 7.5: Predictions are usually more reliable if they are based on larger samples and made for values of the explanatory variable that are similar to those that appear in the sample.
7.1
The Distributions of b and a Our review of confidence intervals and hypothesis tests in chapter 6 demonstrates that we have to determine the distributions of our bivariate regression estimators b and a in order to know anything more about the relationships between them and the underlying population parameters β and α. These distributions depend on the distributions of the εi’s. This can’t be much of a surprise. According to equation (5.38), b is a linear combination of all of the individual yi’s. As stated immediately after that equation, so is a. Each yi is a linear function of an εi, according to equation (5.1). Therefore, b and a are both linear functions of all of the εi’s. Moreover, in sections 5.6 and 5.8, we’ve already shown that the expected values and population variances of b and a depend on the expected values, population variances, and population covariances of the εi’s. The expected values and population variances of random variables are just two properties of the distributions of those random variables. Naturally, the other properties of the distributions of b and a depend on the distributions of the εi’s as well. Here, the simplest and most common strategy would be to assume that the εi’s are normally distributed. Linear combinations of normally distributed random variables are themselves normally distributed. This would directly imply that the true distributions of b and a are normal. If, for some reason, we didn’t want to return to this assumption, we have an alternative strategy. If n is sufficiently large, we can invoke a central limit theorem. The proof—and even statement—of these theorems requires statistical machinery beyond anything we need elsewhere in this course. So we
Inference in Ordinary Least Squares
235
won’t present one explicitly. All we really need is the result: If the sample is large enough, sample statistics such as b and a can be treated as if they are normally distributed.1 In other words, we end up assuming that b and a are distributed normally regardless of whether we assume that the εi’s themselves are distributed in this way or if we assume that the sample is large enough to justify a central limit theorem. Given the results of sections 5.6 and 5.8, this means that we can treat b as b ~ N β,
σ2 n 2 ( xi − x ) i =1
∑
and its standardization as b−β
~ N (0,1) .
σ2 n
∑( x − x )
(7.1)
2
i
i =1
Similarly, we can treat a as 1 a ~ N α,σ 2 + n
n
∑( i =1
x2 , 2 xi − x
)
with the accompanying standardization a −α 2 1 σ + n
n
∑( i=1
x2 2 xi − x
)
( )
~ N 0,1 .
(7.2)
236
7.2
Chapter 7
Estimating s 2 The remaining problem is that the standardized distributions of both b and a, in equations (7.1) and (7.2), depend on σ, the third population parameter in the population relationship of chapter 5. This parameter is no more known to us than are β and α. In order to make further progress, we must estimate it. σ 2 is the variance of the εi’s. We don’t observe the εi’s, either. However, we have a natural analogue for them in the residuals from the regression, the ei’s. So a natural estimator for σ 2 might be the sample variance of the ei’s. Equation (3.11) gives the sample variance for any variable xi as n
V ( xi ) =
∑( x − x )
2
i
i =1
n −1
.
This isn’t quite right for our present purpose. The problem is the denominator. We didn’t discuss where it came from when we introduced the sample variance in chapter 3. However, now that we’ve introduced the idea of unbiasedness in section 5.6 and the phrase “degrees of freedom” in chapter 6, we can address it, at least informally. The denominator isn’t intended to measure the sample size. It measures the degrees of freedom. This raises two questions. The first is, why divide by the degrees of freedom rather than the sample size? The answer is that this is necessary to obtain an unbiased estimator of the population variance. The proof of this assertion requires statistical foundations that are beyond the ambitions of this book. Therefore, this is one of the few assertions that we make but won’t prove. The second question is, how do we calculate the degrees of freedom? The answer is that the degrees of freedom count the number of independent observations that remain in the sample after accounting for the sample statistics that we’ve already calculated. How many independent observations do we start out with? Remember, we assumed in equation (5.11) and demonstrated in equation (5.22) that all of the covariances among the εi’s and the yi’s are equal to zero. If we assume, in addition, that the εi’s are distributed normally, zero covariances guarantee that they are all independent as well. Even if we don’t assume that the εi’s are distributed normally, we’ll rely on the implication that the observations in our sample are independent of each other throughout this book with the exception of chapter 9. Consequently, we have n independent observations. If all of the observations are independent, why do we have to adjust the sample size to obtain the degrees of freedom? The answer is that, while the observations are all initially independent of each other, this isn’t true after we’ve used them to calculate sample statistics.
Inference in Ordinary Least Squares
237
We can illustrate this with equation (3.11). The numerator of this calculation has n terms in it, one for each of the xi’s. However, we’ve already used these n values to calculate a sample statistic, x . If we know the values of all of the xi’s from x1 to xn−1 and their average, we can figure out what the missing xn must be.2 So xn is redundant. With x already in hand, we have only n − 1 independent values of xi remaining with which to calculate the sample variance. That’s why we can only divide by n − 1 instead of n. The reason that we can’t apply equation (3.11) directly to the ei’s is that they are regression residuals. In order to obtain them, we have already calculated two sample statistics, b and a. With these values, and those of n − 2 of the ei’s, we could deduce the values of the remaining two residuals. Therefore, in the case of the bivariate regression model, we have only n − 2 independent ei’s left with which to calculate their variance. The unbiased estimator of the underlying population variance, σ 2, is accordingly n
s2 =
n
∑(ei − e ) ∑ ei2 i =1
2
n−2
=
i =1
n−2
.
(7.3)
The second equality invokes equation (4.20) to replace e with zero. The numerator of equation (7.3) should be familiar. It’s the sum of squared errors, first introduced in equation (4.6). There, we chose b and a to minimize this sum. We now see that, as a consequence, our regression line also provides a smaller value for s2 than would any other line through the same data. Although we won’t prove this, s2 is an unbiased estimator of σ 2 precisely because it is based on the minimum sum of squared errors. What does s2 look like? We’ve actually calculated the sum of squared errors ourselves only a couple of times so far. Tables 4.2 and 4.4 are examples. In table 4.2, it was 945,900,000. There, n was five. Therefore, s2 =
945, 900, 000 = 315, 300, 000. 5− 2
Our estimate of the variance of the disturbances in this example is 315,300,000. This is a huge number. We have to remember, however, that its units are those of the dependent variable, squared. In this case, that means 315,300,000 dollars squared. What are those? No one knows. For interpretative purposes, it’s much easier to work with an estimate of the standard deviation of the disturbances. This would be, according to equation (3.12), simply s = + s2 .
238
Chapter 7
In the case of our example, the estimated standard deviation of the earnings disturbances is $17,457. The computer has calculated the sum of squared errors for the other regressions that we have presented. For example, the sum of squared errors for the regression on the Census data of table 3.3, reported at the end of section 4.7, is 10,861,246,076. That regression has 20 observations, one intercept estimate, and one slope estimate, or 18 degrees of freedom. Therefore, s2 =
10, 861, 246, 076 = 603, 402, 560. 20 − 2
Consequently, the estimated standard deviation of the earnings disturbances is $24,564.3 As another example, the simulated data in table 5.1 yield 5,782,678,689 as the sum of squared errors. With 18 degrees of freedom, this implies that s2 = 321,259,927. Accordingly, s = $17,924. Now that we have s2 as an unbiased estimator of σ 2, we can substitute it into equations (5.50) and (5.51). Consequently, our estimate of the variance of b is V (b ) =
s2
.
n
∑( x − x )
(7.4)
2
i
i =1
According to equation (3.12), our estimate of its standard deviation is SD (b ) = +
s2 n
∑( x − x )
=s 2
i
1 n
(7.5)
.
∑( x − x )
2
i
i =1
i =1
Similarly, our estimates of the variance and standard deviation of a are 1 V a = s2 + n
()
n
∑( i =1
x2 2 xi − x
(7.6)
)
and 1 SD a = + s 2 + n
()
n
∑( i =1
x2 =s 2 xi − x
)
1 + n
n
∑( i =1
x2 . 2 xi − x
)
(7.7)
Inference in Ordinary Least Squares
239
Notice that we used the notation V(b) and V(a) to represent the population variances of the sample statistics b and a in equations (5.50) and (5.51). We are now using the same notation to represent the sample estimates of these variances. As we’ve said before, this ambiguity is present in all statistical literature. That’s why we don’t attempt to fix it here. However, we can be certain that the formulas assigned to these notations will never contain both σ 2 and s2. We’re talking about the population variance if σ 2 appears, and sample estimates if, as in equations (7.4) and (7.6), it’s s2. We sometimes acknowledge this ambiguity by referring to the estimates in equations (7.4) through (7.6) as empirical variances and standard deviations. This serves two purposes. First, it indicates that they can be calculated using sample information. Second, it distinguishes them from the population variances in equations (5.50) and (5.51). The population variances cannot be observed, because σ 2 is unknown, at least in actual data. Now that we have equations (7.5) and (7.7), we might ask if we really need them. For example, didn’t we already have standard deviations for b in table 5.4? We did, and table 7.1 demonstrates that they were pretty accurate! Remember, the regressions summarized in table 5.4 were run on simulated data, so we know the true value for σ 2 and, therefore, for SD(b). The second column of table 7.1 gives the true standard deviations of b for those regressions in four different sample sizes. We may have even calculated these standard deviations already. They’re the answers to exercise 5.15e. The third column of table 7.1 reproduces the sample standard deviations from table 5.4. They were all calculated by applying equation (3.12) to the samples of 50 b’s for each sample size. They are all quite close to the true values. Isn’t that good enough? The fourth column of table 7.1 presents the sample standard deviations, calculated using equation (7.5). These are from the single regressions for each sample size that appear with greater detail in table 7.3. They’re actually a little closer to the true values than are the sample standard deviations in the third column.4 However, that’s not the reason to prefer them. The real reason is that in chapter 5 we ran 50 regressions in order to get each of the sample standard
TABLE
7.1
Sample size
100 1,000 10,000 100,000
Two methods to estimate the standard deviation of b True standard deviation of b, based on equation (5.50)
857.49 271.16 85.749 27.116
Sample standard deviation of 50 slope estimates, from table 5.4
Estimated standard deviation of slope, from equation (7.5)
879.73 299.71 79.271 29.899
830.27 276.94 86.957 27.483
240
Chapter 7
deviations in the third column. We only had to run one regression to get the sample standard deviations in the fourth column. How are we able to do now, with 1 regression, what took us 50 regressions in chapter 5? The answer is equation (7.5). This equation embodies what we know about the structure of the population, as described in chapter 5, and our estimate of σ 2 from equation (7.3). This additional information is ignored by equation (3.12). It is so useful that, in this example, it actually gets us closer to the true answer than do 49 additional regressions. As we mentioned in conjunction with equation (6.5), the substitution of an estimated variance for the true variance of a random variable introduces some additional variability into the standardized form of the random variable. In principle, we should account for the additional variability associated with replacing σ 2 with s2 here by replacing the standard normal distributions of equations (7.1) and (7.2) with the t-distributions for n − 2 degrees of freedom. Again, the importance of this adjustment depends on the size of the sample. Note 2 of chapter 6 explains that the t-distribution is essentially indistinguishable from the standard normal distribution when n is large. Certainly, if we think that n is large enough to allow us to rely on a central limit theorem to establish the essential normality of b and a, it is also large enough for the normal distribution to acceptably approximate the relevant t-distribution. In this case, for example, we could retain the normal distribution even though we are using estimated variances. In small samples, however, the difference between the t-distribution and the standard normal distribution can be substantial. In these cases, the substitution of the former for the latter is really mandatory. For this reason, we make this substitution here. This way, we have formulas that are correct regardless of sample size. Accordingly, b−β s
~ t(
n−2 )
(7.8)
2
n
∑( x − x )
2
i
i =1
and a −α 2 1 s + n
n
∑( i =1
x2 2 xi − x
)
~ t(
n− 2
).
Inference in Ordinary Least Squares
7.3
241
Confidence Intervals for b With all of these preliminaries accomplished, we can finally turn to the question of how much information b can provide us with respect to the location of β. We begin with the general expression for confidence intervals in equation (6.8). We then modify this equation further to fit the specifics of our bivariate regression problem. In this context, we replace δ, the generic notation in equation (6.8) for the population parameter, with β, the specific population parameter of interest here. We replace d, the generic notation for the estimator there, with b, the bivariate regression estimator of β. Finally, we replace the generic notation SD(d) in equation (6.8) with our actual estimate of the standard deviation of b from equation (7.5). The result is our (1 − α)% confidence interval for β: n− 2 1 − α = P b − tα( / 2 )
n− 2 ≤ β ≤ b + tα( / 2 )
s2 n
∑( x − x ) i
i =1
2
s2 n
∑( x − x ) i
i =1
2
.
(7.9)
There’s a lot of notation here, but all of the pieces are actually pretty familiar. First, the middle term of the inequality in parentheses consists solely of β, the parameter we’re trying to estimate. So, just as in section 6.3, we have a lower bound and an upper bound that we are (1 − α)% certain contain the true value of β. Second, the lower and upper bounds, despite their apparent complexity, are simply numbers that we already know. Our sample provides us with values for ∑in=1 ( xi − x )2. We have already seen this sum as the numerator in the calculation of the sample variance for our explanatory variable in equation (3.11) and one form of the denominator in the formula for b in equation (5.47). Our sample also provides us with n, the sample size. Our bivariate regression calculations provide us with the value for b, the slope of our best-fitting line. Equation (7.3) provides us with the value for s2. Finally, given n and our chosen value of α, table A.2 gives us the value n− 2 for tα( / 2 ) . Let’s construct confidence intervals for the regressions on the data in tables 3.1, 3.3, and 5.1. In the case of the first sample, table 4.1 gives n
∑( x − x ) i
i =1
2
= 40.
242
Chapter 7
Section 4.7 gives b = 5,550. We just calculated that s2 = 315,300,000 and n − 2 = 3 in section 7.2. If we choose the conventional value of α = .05, then table A.2 tells us that tα( 3/)2 = 3.182. Therefore, the 95% confidence interval for the population coefficient embedded in these data is 315, 300, 000 315, 300, 000 .95 = P 5, 550 − 3.182 ≤ β ≤ 5, 550 + 3.182 40 40 = P (−3, 384 ≤ β ≤ 14, 484 ) .
(7.10)
In other words, there is a 95% chance that the true effect of a year of schooling on annual earnings in these data lies between a reduction of $3,384 and an increase of $14,484. This range is so large that it is at best uninteresting and at worst alarming. It doesn’t rule out the possibility that education could reduce our earnings! However, our objective at the moment is simply to get comfortable with the process of constructing it. We’ll talk about how to evaluate and, perhaps, improve it in a moment. For the Census data in table 3.3, section 4.7 gave the sample variance of education as 21.905. With a sample of 20, this implies that5 n
∑( x − x )
2
i
= 416.20.
i =1
Section 4.7 calculates b = 4,048.5. The previous section gives s2 = 603,402,560 and n − 2 = 18. With α = .05 and tα(18/ 2) = 2.101, 603, 402, 560 603, 402, 560 .95 = P 4, 048.5 − 2.101 ≤ β ≤ 4, 048.5 + 2.101 416.20 416.20 = P (1, 518.7 ≤ β ≤ 6, 578.3) .
(7.11)
This is much more comforting than the confidence interval in equation (7.10). At least, all of the values in this entire interval are positive. Moreover, even the lower bound, $1,518.70, indicates a noteworthy return to education. It might not be enough to make it worth our while to enroll, but at least it makes us think about whether we should. Finally, n
∑( x − x ) i
i =1
2
= 170
243
Inference in Ordinary Least Squares
for the simulated data in table 5.1, as stated in section 5.8 and proven in exercise 5.15b. The text following table 5.1 reports that b = 4,538.6. According to section 7.2, s2 = 321,259,927 and n − 2 = 18. With, once again, α = .05, and ) tα(18 / 2 = 2.101, 321, 259, 927 321, 259, 927 .95 = P 4, 538.6 − 2.101 ≤ β ≤ 4, 538.6 − 2.101 170 170 = P (1, 650.4 ≤ β ≤ 7, 426.8) .
(7.12)
To reiterate, equation (7.9) says that we are (1 − α)% certain that β lies between n− 2 b − tα( / 2 )
s2 n
(7.13)
∑( x − x )
2
i
i =1
and n− 2 b + tα( / 2 )
s2 n
∑( x − x )
.
(7.14)
2
i
i =1
Again, what does this actually mean? To return to the metaphor that concluded section 6.3, it is equivalent to saying that a horseshoe with these end points would ring the invisible post on approximately 95 throws out of 100. Perhaps a more helpful way to think of it is that if we were able to construct confidence intervals from 100 independent samples, the true parameter value should typically lie within about 95 of them. We won’t attempt to play horseshoes with an invisible post, but we can attempt to verify this latter assertion. Table 5.2 presented intercept and slope estimates from 50 independent, simulated samples, all of which shared the parameter value β = 4,000. Table 7.2 presents the lower and upper bounds, as given in equations (7.13) and (7.14), of the 95% confidence interval for this coefficient from each sample. Many of these confidence intervals have upper bounds that are well above the true value. A number have lower bounds that are negative. Nevertheless, most of the samples generate confidence intervals that include the true value of β = 4,000. In fact, only two do not, the 12th and the 50th. In other words, the true parameter value lies within 48 out of 50, or 96%, of the independent 95% confidence intervals in table 7.2. This is almost exactly what we expected!
244
TABLE
Chapter 7
7.2
Regression
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
95% confidence intervals for b in 50 simulated regressions Lower bound for β
Upper bound for β
Width
Regression
Lower bound for β
Upper bound for β
Width
1,650.5 590.89 −743.71 1,513.8 −1,608.3 −331.33 2,943.8 1,507.9 735.32 −921.16 −386.36 5,854.6 −2,235.4 −656.85 −256.27 −1,471.5 3,274.0 382.98 1,802.3 −2,716.5 −2,773.1 2,353.5 708.47 1,854.8 2,628.5
7,426.7 7,276.3 7,274.9 9,336.8 8,274.0 7,019.5 10,388 10,207 8,978.5 8,296.4 8,514.5 12,644 6,438.6 7,661.2 7,055.6 6,050.6 9,644.6 6,593.1 9,646.6 7,081.6 4,218.2 10,582 6,750.2 8,086.6 7,522.3
5,776.2 6,685.4 8,018.6 7,823.0 9,882.3 7,350.8 7,444.2 8,699.1 8,243.2 9,217.5 8,900.9 6,789.4 8,674.0 8,318.1 7,311.9 7,522.1 6,370.6 6,210.1 7,844.3 9,798.1 6,991.3 8,228.5 6,041.7 6,231.8 4,893.8
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
−1,123.4 −2,755.5 −434.37 −2,356.7 −1,426.7 −1,873.8 688.64 −1,342.3 1,884.0 −2,650.2 −1,080.7 3,803.9 3,278.5 −2,373.7 1,693.6 −864.63 2,525.0 241.57 −2,116.7 3,085.6 −1,984.2 424.54 2,493.2 1,362.9 −4,493.3
7,637.5 6,031.5 5,873.6 6,985.8 5,925.4 6,801.7 7,309.5 4,185.0 11,757 7,378.9 7,394.8 11,524 11,498.3 6,661.6 9,424.2 7,788.4 10,672 8,029.4 5,487.5 8,451.0 5,210.6 8,557.7 11,598 9,328.8 1,949.8
8,760.9 8,787.0 6,307.9 9,342.5 7,352.1 8,675.5 6,620.8 5,527.3 9,873.0 10,029 8,475.5 7,720.1 8,219.8 9,035.3 7,730.6 8,653.0 8,147.0 7,787.8 7,604.2 5,365.4 7,194.8 8,133.2 9,104.8 7,965.9 6,443.1
Now that we have a better handle on how to interpret a confidence interval, we can address an issue that we first raised in regard to equation (7.10). The confidence interval in equation (7.10) is enormously wide. That of equation (7.12) is pretty big, too, as are most of the intervals in table 7.1. Even the confidence interval in equation (7.11) includes values at the low end that would make it questionable whether schooling was economically worthwhile, and values at the high end that suggest that schooling is extremely lucrative. With this information, we don’t have any clear idea as to whether we should continue to go to school or not. What can we make of this? As we discussed in section 6.3, the range in which we expect β to lie has width equal to n− 2 2tα( / 2 )
s2 n
∑( x − x ) i
i =1
, 2
(7.15)
Inference in Ordinary Least Squares
245
n− 2 or 2 tα( / 2 ) times the estimated standard deviation of b. If this range is wide, as in many of the confidence intervals we have just constructed, it will be useless. Among the plausible values for β will be some with radically different behavioral implications. While we would be (1 − α)% confident that β would lie in this range, we would have hardly any confidence that we knew what behavior to actually choose or expect. In contrast, if this range is narrow, it could be quite informative. Most or all of the plausible values would imply relatively similar behavior. We wouldn’t know the exact value of β, but we could be quite certain regarding the type of behavior we would expect to see. For example, imagine that we had obtained a lower bound for the 95% confidence interval around the true annual return to education of $3,000 and an upper bound of $3,200. With this confidence interval, the range of likely values for β would be quite limited. We would still have to compare the present discounted value of these returns over our working lives to the costs of education in order to decide whether to enroll. However, the outcome of that comparison wouldn’t vary much whether we used the lowest or the highest values in the interval. Either way, we would get a pretty clear indication as to whether or not additional schooling was financially attractive. What determines the width of the interval in equation (7.9), as given in equation (7.15)? Two quantities are fixed, or nearly so. The multiplicative factor of two is obviously a constant. Less obviously, s2 is unlikely to change much. As we said in the last section, it’s an unbiased estimator of the constant σ 2. Unbiasedness doesn’t depend on anything other than the basic assumptions that we made about the population in chapter 5. Therefore, s2 will be unbiased for σ 2 regardless of sample size. This means that, given the population relationship, its value shouldn’t fluctuate a lot as the sample size changes, and certainly not predictably. In contrast, the remaining quantities are, to some degree, within our control. As we discussed in section 6.3, increasing n, the sample size, will clearly reduce the width of the confidence interval. First, it will increase the degrees n− 2 of freedom for the t-distribution, leading to a lower value of tα( / 2 ) when n is large. We can see the potential effects of this in the comparison between equations (7.10) and (7.11): The former has only 3 degrees of freedom. The latter n− 2 has 18. Consequently, the value of tα( / 2 ) for equation (7.10) is 3.182. That for equation (7.11) is approximately two-thirds as large, at 2.101. All else equal, equation (7.15) therefore implies that the confidence interval for equation (7.11) would be approximately one-third narrower as well. Second, in the specific context of the standard deviation of b, each new observation adds a term of the form ( xi − x )2 to the denominator within the square root. As we discussed in section 5.10, this term will be positive unless
246
TABLE
Chapter 7
7.3
Effect of sample size on the width of 95% confidence intervals for simulated data
Sample size
Slope estimate
10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000
8,213.7 2,382.1 3,640.9 4,008.1 3,945.7 4,016.6 3,995.8 3,999.8
Estimated standard deviation of slope estimate
1,351.5 830.27 276.94 86.957 27.483 8.6950 2.7523 .87029
Lower bound of confidence interval for β
5,097.0 734.5 3,097.5 3,837.7 3,891.8 3,999.6 3,990.4 3,998.1
Upper bound of confidence interval for β
11,330 4,029.7 4,184.3 4,178.6 3,999.5 4,033.7 4,001.2 4,001.5
Width of confidence interval for β
6,233.0 3,295.2 1,086.8 340.91 107.73 34.084 10.788 3.4115
xi = x for this observation. Except in this case, each new observation will
reduce the fraction within the square root. This square root is the estimated standard deviation of b. We’ve already demonstrated in sections 6.3 and 6.4.3, that, in general, reductions in the standard deviation of an unbiased estimator increase the precision with which we can locate the underlying population parameter. The current discussion, in the context of equation (7.15), just restates the point for the case of b. Table 7.3 demonstrates the effects of increased sample size on the sample standard deviation of b and the associated confidence intervals for β. It presents the results of eight regressions on simulated data. Each of the eight samples is based on the same parameter values and the same distribution of xi values as are the regressions of table 5.4. The difference between the two tables is that, for table 5.4, we calculated 50 independent estimates of b for each sample size. As we recalled in our discussion of table 7.1, we calculated the sample standard deviation for that collection of estimates according to equation (3.12). Here, we follow our recommendation from sections 5.8 and 7.2. We calculate only one regression for each sample size and simply use equation (7.5) to give us the estimated standard deviation of b. Table 7.3 presents the value for b, its estimated standard deviation as given by equation (7.5), and the corresponding confidence interval of equation (7.9). Once again, the appeal of increased sample size is obvious. Here, the estimated standard deviation of the slope estimate declines impressively with sample size. We can also see the implications of consistency: The slope estimates get closer to the true value of 4,000 as their estimated standard deviation declines. Finally, table 7.3 confirms the point we’ve been investigating: The confidence interval for β narrows as the sample size increases. The regression with 10 observations not only yields a confidence interval that is more than $6,000
Inference in Ordinary Least Squares
247
wide but it is also one of the 5% of all 95% confidence intervals that actually doesn’t include the true value! For the seven regressions on larger samples, six of the confidence intervals contain the true value. Each is narrower than those based on fewer observations. The confidence interval from the regression on 100,000,000 observations is less than $3.50 wide! In other words, if we didn’t know the true value of the annual returns to education and had only this last regression to work with, we could be 95% certain that the true value was within an interval of less than $3.50. Table 7.4 presents the exercise of table 7.3 applied to our Census data. Again, we use only one sample of each size because the purpose is to compare results across sample sizes, not across different samples of the same size. The largest sample size here is 100,000, because the 1% Public Use Microdata Sample (PUMS) for California doesn’t contain many more appropriate observations. Nevertheless, the point of table 7.3 reappears clearly here. The estimated standard deviation of b declines regularly and substantially with each increase in sample size. Consequently, the confidence intervals narrow markedly as n increases. The confidence interval for n = 10—and even that for n = 100—is so wide as to be of little value. In contrast, that for n = 100,000 implies that β, whose value is unknown to us in this case, almost surely lies within a very small interval around $3,856. The denominator of the variance of b may also be controlled in a more subtle sense, to which we first referred in section 6.3. Holding n constant, this term will be larger when the values for xi are more varied. To understand this intuition, imagine that the values for xi didn’t vary at all; that is, xi = x for all observations. In this case, the denominator within the square root would be zero, the ratio within the square root would be infinite, and the confidence interval would be very, very wide. Very wide. Regardless of how many observations were in our sample. This would be regression’s way of telling us that we are being dull-witted. The question we are trying to answer is how yi would differ for different
TABLE
7.4
Sample size
10 100 1,000 10,000 100,000
Effect of sample size on the width of 95% confidence intervals for Census data
Slope estimate
Estimated standard deviation of slope estimates
1,500.8 3,362.8 3,374.0 3,793.9 3,855.9
1,255.1 946.71 342.21 113.03 35.678
Lower bound of confidence interval for β
−1,393.4 1,484.1 2,702.4 3,572.3 3,785.9
Upper bound of confidence interval for β
Width of confidence interval for β
4,395.0 5,241.5 4,045.5 4,015.5 3,925.8
5,788.4 3,757.4 1,343.1 443.13 139.86
248
Chapter 7
values of xi. But we don’t observe different values of xi. So how can we expect to answer the question? We can’t, and regression won’t.6 In contrast, imagine that we had a relatively small number of observations, but with a very wide range of values for xi among them. All of the observations would make positive contributions to the denominator. Moreover, the many values that appeared far from the average would contribute large squared terms to the summation. Accordingly, the summation could be surprisingly large even if the sample were small. In this case, the confidence interval could be quite narrow. Here, regression would be telling us that, even though we don’t have much data, the data we do have are very rich and very informative. They provide lots of contrast in the explanatory variable. Therefore, it’s relatively easy to be confident about what the resulting contrasts in yi should be. We can explore this point with simulated data. Table 7.5 presents the distribution of values for xi in 10 different simulated samples. Each of the samples embodies the same parameter values that we have used in all previous simulations: β = 4,000, α = −20,000, and σ = 25,000. In each, the average value for xi is 13. Each of these samples contains 200 observations.7 The only thing that distinguishes these samples is the distributions of xi values in each. Sample 1 reproduces that in all of our simulations to this point. Sample 2 takes the 20 observations in sample 1 with 13 years of schooling and replaces them with 10 observations that have 8, and 10 that have 18 years. Sample 3 takes 10 of the observations with 12 years of schooling and 10 of those with 14 years of schooling and reassigns them to 8 and 18 years, respectively. Similar reassignments take place in each successive sample, until sample 10 is split evenly between individuals with 8 and with 18 years of schooling. The second column of table 7.6 demonstrates how this sequence of samples embodies the issues at hand. The sum of squared deviations between xi and its average increases, usually substantially, from one sample to the next. The estimated standard deviation of b, as given in equation (7.5), and the width of any confidence interval, as given in equation (7.15), should decline as well. The fourth through seventh columns demonstrate that this is, approximately, what occurs. They present the results of regressions of earnings on education in each of these simulated samples. In the fourth and seventh columns, the estimated standard deviations for b and the widths of the corresponding 95% confidence intervals decline from sample to sample with only a couple of random exceptions. Those of samples 9 and 10 are less than twothirds as large as those of sample 1. These reductions are attributable entirely to the increase in ∑in=1 ( xi − x )2 from sample to sample, because nothing else changes. The reductions aren’t entirely consistent because, in some samples, the random collection of
Inference in Ordinary Least Squares
249
TABLE 7.5 Simulated distributions of xi for table 5.4 Years of schooling Sample
1 2 3 4 5 6 7 8 9 10
8
9
10
11
12
13
14
15
16
17
18
10 20 30 40 50 60 70 80 90 100
20 20 20 20 20 20 20 20 10 0
20 20 20 20 20 20 10 0 0 0
20 20 20 20 10 0 0 0 0 0
20 20 10 0 0 0 0 0 0 0
20 0 0 0 0 0 0 0 0 0
20 20 10 0 0 0 0 0 0 0
20 20 20 20 10 0 0 0 0 0
20 20 20 20 20 20 10 0 0 0
20 20 20 20 20 20 20 20 10 0
10 20 30 40 50 60 70 80 90 100
NOTE: The row of each entry identifies the sample. The column identifies the years of schooling. The entry itself reports the number of observations in that sample with those years of schooling.
TABLE 7.6 Effect of variation in xi on the width of 95% confidence intervals for simulated data n
∑(x − x)
Slope estimate
Estimated standard deviation of slope estimate
Lower bound of confidence interval for β
Upper bound of confidence interval for β
Width of confidence interval for β
3,952.9 2,768.3 3,070.9 3,811.4 3,912.3 3,853.9 3,691.3 4,223.4 3,744.8 3,981.8
570.95 496.08 447.21 437.15 377.84 399.50 395.60 385.90 327.28 349.01
2,827.0 1,790.0 2,189.0 2,949.3 3,167.1 3,066.1 2,911.0 3,462.4 3,099.3 3,293.6
5,078.8 3,746.6 3,952.8 4,673.4 4,657.3 4,641.7 4,471.3 4,984.4 4,390.2 4,670.1
2,251.8 1,956.6 1,763.8 1,724.1 1,490.2 1,575.6 1,560.3 1,522.0 1,290.9 1,376.5
2
i
Regression
1 2 3 4 5 6 7 8 9 10
i =1
1,700 2,200 2,680 3,160 3,580 4,000 4,320 4,640 4,820 5,000
disturbance values yields a value for s2 that differs substantially from the true value of σ 2. Exercise 7.6 explores these differences. It directs us to derive the population standard deviations of b using equation (5.50) to derive the true width of the corresponding 95% confidence interval by replacing s2 with σ 2 in equation (7.9), and to compare them to the estimated values in table 7.6. It should be obvious that we can’t replicate the demonstration of tables 7.5 and 7.6 with our Census data. Each individual already has an xi value. It would be meaningless to reassign them. However, we can investigate the underlying point in a different way. Return to table 5.5. The third row of each panel in that table summarizes the results of 50 bivariate regressions on independent samples of 1,000 observations each. Table 7.7 presents the values of ∑in=1 ( xi − x )2 , b, SD(b), ) ( 998 ) ( 998 ) b − tα( 998 / 2 SD(b), b + tα / 2 SD(b), and 2 tα / 2 SD(b) for each of them.
250
TABLE
Chapter 7
7.7
Effect of variation in xi on the width of 95% confidence intervals for Census data n
∑(x − x)
Slope estimate
Estimated standard deviation of slope estimate
Lower bound of confidence interval for β
Upper bound of confidence interval for β
Width of confidence interval for β
3,912.4 3,981.1 4,532.2 4,353.4 4,748.2 4,097.2 3,577.8 3,678.0 4,302.7 3,937.2 3,869.5 4,018.2 4,575.4 3,932.0 3,892.1 4,266.3 4,405.3 3,993.4 4,198.0 4,433.5 3,732.3 3,198.3 3,892.4 3,913.5 3,389.2 4,144.2 3,283.5 4,730.9 3,659.0 3,943.6 3,929.5 3,326.7 4,469.0 3,718.9 3,778.3 3,665.2 3,732.8 3,953.2 4,016.9 4,011.3 3,722.8 3,723.7 3,492.6 3,426.5 3,582.1 3,644.0 3,749.3 3,254.9 3,871.4 3,357.0
394.99 356.55 439.78 420.35 385.53 352.20 367.54 316.15 396.47 364.89 383.02 339.09 415.93 347.29 360.46 380.32 377.40 388.60 378.97 352.78 357.64 320.66 361.38 356.21 307.68 356.12 346.28 403.41 303.11 354.18 385.72 292.61 398.11 325.78 311.58 356.59 361.99 330.33 366.77 350.42 370.50 320.04 316.71 327.76 276.45 330.07 340.74 322.58 362.04 371.66
3,137.3 3,281.4 3,669.2 3,528.5 3,991.7 3,406.0 2,856.6 3,057.6 3,524.7 3,221.2 3,117.8 3,352.8 3,759.2 3,250.5 3,184.7 3,519.9 3,664.7 3,230.8 3,454.3 3,741.2 3,030.5 2,569.0 3,183.2 3,214.4 2,785.4 3,445.4 2,604.0 3,939.3 3,064.1 3,248.5 3,172.6 2,752.5 3,687.8 3,079.6 3,166.9 2,965.4 3,022.5 3,304.9 3,297.1 3,323.6 2,995.8 3,095.6 2,871.1 2,783.3 3,039.6 2,996.3 3,080.6 2,621.9 3,161.0 2,627.6
4,687.5 4,680.8 5,395.2 5,178.2 5,504.8 4,788.3 4,299.1 4,298.4 5,080.7 4,653.3 4,621.1 4,683.6 5,391.6 4,613.5 4,599.4 5,012.6 5,145.9 4,755.9 4,941.7 5,125.8 4,434.2 3,827.5 4,601.5 4,612.5 3,993.0 4,843.0 3,963.0 5,522.5 4,253.8 4,638.6 4,686.4 3,900.9 5,250.3 4,358.2 4,389.7 4,364.9 4,443.2 4,601.4 4,736.6 4,698.9 4,449.8 4,351.7 4,114.0 4,069.7 4,124.6 4,291.7 4,417.9 3,887.9 4,581.8 4,086.3
1,550.2 1,399.4 1,726.0 1,649.7 1,513.1 1,382.3 1,442.5 1,240.8 1,556.0 1,432.1 1,503.3 1,330.8 1,632.4 1,363.0 1,414.7 1,492.7 1,481.2 1,525.1 1,487.4 1,384.6 1,403.7 1,258.5 1,418.3 1,398.1 1,207.6 1,397.6 1,359.0 1,583.2 1,189.7 1,390.1 1,513.8 1,148.4 1,562.5 1,278.6 1,222.8 1,399.5 1,420.7 1,296.5 1,439.5 1,375.3 1,454.0 1,256.1 1,242.9 1,286.4 1,085.0 1,295.4 1,337.3 1,266.0 1,420.8 1,458.7
2
i
Regression
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
i =1
12,851 13,006 13,023 13,061 13,452 13,724 13,738 13,778 13,864 13,871 13,904 14,103 14,142 14,143 14,247 14,300 14,316 14,453 14,472 14,495 14,503 14,561 14,599 14,643 14,658 14,710 14,762 14,885 14,929 14,943 15,060 15,105 15,127 15,242 15,313 15,470 15,559 15,584 15,769 15,772 15,781 15,803 15,918 15,986 16,027 16,161 16,165 16,203 16,307 16,934
Inference in Ordinary Least Squares
251
Table 7.7 presents these regressions in ascending order of their values for ∑in=1 ( xi − x )2 . This means that, as we read down the table, the denominator
for s2 increases. In general, this should imply that s2 itself should decrease, SD(b) should decrease, and the width of the 95% confidence interval should decrease. Given the random variations in the collection of disturbances from sample to sample, it’s a little hard to discern this pattern visually in table 7.7. But, of course, the whole reason why we have statistics is because many important relationships aren’t readily apparent to the naked eye. In this case, we expect to find a reasonably reliable negative relationship between ∑in=1 ( xi − x )2 and the width of the confidence interval. How would we go about looking for it? Recall that, in chapter 3, we agreed that the sample correlation coefficient measured both the direction and the reliability of an association between two variables. We can return to it here. The sample correlation between the numbers in the second and last columns of table 7.7 is −.4775. Despite all random fluctuations, there is a negative and reasonably reliable relationship between the sum of squared deviations of xi from its average and the width of the 95% confidence interval for β in table 7.7. This demonstrates that, if we’re able to design our own data collection strategy, we should ensure that we sample the largest possible range of values for our explanatory variable. However, in many cases for most social scientists and in most cases for economists, we analyze data that have already been collected by someone else. We’re therefore not in a position to influence the range of variation in our explanatory variable. We simply run the risk that it may turn out to be too small to yield useful confidence intervals. In sum, if we want narrow confidence intervals, there are three things that we can try to do. First, hope that σ 2 is small so that its estimator, s2, will be small as well. Second, get as many observations into our sample as possible. Third, to the extent that we can choose, make sure that there is as much variation in xi as possible among the observations that are available.
7.4 Hypothesis Tests for b We form the two-tailed hypothesis test for β in the same way that we formed its confidence interval. We take equation (6.14) and adapt it to the bivariate regression context. Again, we replace d with b and SD(d) with + s 2/ ∑in=1 ( xi − x )2 . In addition, we replace δ0 with β0, the value for our null hypothesis regarding the slope of the population relationship.
252
Chapter 7
The result is the hypothesis test n − 2 1 − α = P β0 − tα( / 2 )
s2 n
∑( x − x ) i
i =1
2
n − 2 < b < β0 + tα( / 2 )
s2 n
∑( x − x ) i
i =1
2
.
(7.16)
The acceptance region lies between n−2 β0 − tα( / 2 )
s2 n
∑( x − x )
2
i
i =1
and n− 2 β0 + tα( / 2 )
s2
.
n
∑( x − x )
2
i
i =1
If our sample were to yield a value of b that lies within this range, it would be consistent with the hypothesized value of β0. We would fail to reject the null hypothesis. If it were to yield a value outside of this range, it would be inconsistent with the null hypothesis. We would reject it. As we stated in section 6.4.1, the null hypothesis that the true parameter value is zero, H0 : β0 = 0, is often of interest. In the context of regression analysis, this null hypothesis is always of interest. Why? What’s so special about it? Returning to the population relationship of chapter 5, we see that this null hypothesis states that xi has no effect on yi. This would require a pretty radical revision of our original motivation for the whole regression enterprise. For this reason, it’s easy to see why the first thing we always do with a regression is check to see if we can reject this hypothesis. If we can’t, then there’s usually not much more to say. If we can, then we can go on to discuss what we’ve learned about the behavioral consequences of the relationship between xi and yi, and perhaps the policy implications. Under the null hypothesis H0 : β0 = 0, equation (7.16) becomes n− 2 1 − α = P −tα( / 2 )
n− 2 < b < tα( / 2 )
s2 n
∑( x − x ) i
i =1
2
s2 n
∑( x − x ) i
i =1
2
.
(7.17)
Inference in Ordinary Least Squares
253
Here, the “t-statistic” of section 6.4.1 is b
.
(7.18)
s2 n
∑( x − x )
2
i
i =1
n− 2 If it equals or exceeds tα( / 2 ) , then b lies in the rejection region for the null hypothesis H0 : β0 = 0 and we conclude that xi affects yi. Let’s test this hypothesis, at the 5% significance level, in the samples of tables 3.1, 3.3, and 5.1. In equation (7.10), we implicitly gave SD(b) for the sample of table 3.1 as
SD (b ) =
315, 300, 000 = 2, 807.6. 40
(n−2) = 3.182 for this sample. Accordingly, That equation also gave t.025 equation (7.17) gives the test at 5% significance for the null hypothesis H0 : β0 = 0 as
(
.95 = P −3.182 ( 2, 807.6 ) < b < 3.182 ( 2, 807.6 ) = P (−8, 933.8 < b < 8, 933.8) .
) (7.19)
In other words, the acceptance region for this hypothesis lies between −$8,933.80 and $8,933.80. Point estimates in this range would not be inconsistent with the null hypothesis. The rejection region lies below −$8,933.80 and above $8,933.80. Point estimates in this range would be inconsistent with the null hypothesis and would require us to reject it. As we already know from the discussion of table 4.1, in this sample b = 5,550. This value falls in the acceptance region. Therefore, we cannot reject the null hypothesis that the true annual returns to education are zero. How did this happen? After all, an estimated annual return to education of $5,550 seems pretty hefty. Why, then, can’t we be sure that the true return isn’t zero? As is often the case, the issue is not with the estimate. It’s with its standard deviation. The value of the null hypothesis, β0 = 0, is less than two standard deviations away from the point estimate. Recalling our discussion in section 6.4.3, this isn’t very far in statistical terms. In the case of this sample, it’s even closer. Table A.2 for the t-statistic is telling us that, with only three degrees of freedom, the two have to be at least 3.182 standard deviations away from each other in order to tell them apart.
254
Chapter 7
Intuitively, here’s what’s going on. The sample is tiny. This has two negative implications. First, the standard deviation for b is large. This means that values for b could vary substantially from one sample to the next, simply because of random variations in the collections of εi’s embodied in each sample. Regression is telling us that the estimate we actually got, of $5,550, is no big deal because the next estimate we get could just as easily be $42. Or something worse. The second implication of the sample size is that s2 is not a very trustn−2 ) has to be so worthy estimator of σ 2. That’s why the critical value for t.(025 much larger than the value of 1.96 that we would use if the sample were large enough to approximate the normal distribution. Together, the uncertainty in b and the uncertainty in s2 are so great that we can’t be sure education has any positive return at all, even though our best single guess is that it increases annual earnings by $5,550. The regression on the Census data from table 3.3 is more promising. If 18) is 2.101. Moreover, nothing else, with 18 degrees of freedom the value for t.(025 implicit in equation (7.11) is SD (b ) =
603, 402, 560 = 1, 204.1. 416.20
With a smaller critical value and a smaller estimated standard deviation for the estimate, the acceptance region for the null hypothesis H0 : β0 = 0 has to be smaller:
(
(
)
(
))
.95 = P −2.101 1, 204.1 < b < 2.101 1, 204.1
(
)
= P −2, 529.8 < b < 2, 529.8 .
(7.20)
In this sample, the acceptance region for the null hypothesis that xi truly has no effect on yi is between −$2,529.80 and $2,529.80. Our estimator, b = 4,048.5, is clearly in the rejection region. We reject the null hypothesis and conclude that earnings are definitely affected by education. Let’s restate this test in the two other equivalent forms presented in section 6.4.1. According to equation (7.18), the t-statistic for this test is b
=
s2 n
∑( x − x ) i
i =1
2
4, 048 .5 = 3 .362 . 1, 204,1
(7.21)
Inference in Ordinary Least Squares
255
18) = 2.101. Clearly, We already know that the critical value for this test is t.(025 the t-statistic exceeds the critical value. Once again, we conclude that the null hypothesis must be rejected. We can also consider the p-value for this test. Unfortunately, table A.2 for the t-statistic only identifies the values associated with a limited number of probabilities. Our t-statistic of equation (7.21) doesn’t appear among them.8 18 ) 18 ) = 2.878 and t.(001 = 3.610. Obviously, However, according to table A.2, t.(005 the value of the t-statistic in equation (7.21) lies between these two values: 2.878 < 3.362 < 3.610. This proves that 18 ) 18 ) t.(005 < t -statistic < t.(001 .
Consequently, .005 < p-value < .001.
Whatever the true p-value is, it has to be less than .005. This means that it’s much less than the threshold of .05 significance that we have chosen here. Once again, we reject the null hypothesis. Finally, we can reformulate the confidence interval in equation (7.12) for the simulated data in table 5.1 into the hypothesis test for H0 : β0 = 0: 321, 259, 927 321, 259, 927 .95 = P −2.101 < b < 2.101 170 170 = P (−2, 888.2 < b < 2, 888.2 ) .
(7.22)
As in the case of the sample of table 3.3, the estimate of b = 4,538.6 is in the rejection region. Equivalently, the t-statistic is greater than the critical value: b s n
=
2
∑( x − x )
2
4, 538.6 321, 259, 927 170
= 3.302 > 2.101.
i
i =1
Finally, the p-value must be less than .005, because the t-statistic is greater 18 ) = 2.878. No matter how we do it, we still find that we have to reject than t.(005 the null hypothesis for this sample. None of these results should surprise. The confidence interval for the sample of table 3.1 in equation (7.10) contains zero. Those of equations (7.11) and (7.12) do not. We said, as far back as section 6.5, that any value within the
256
Chapter 7
(1 − α)% confidence interval for a parameter could not be rejected by a hypothesis test at the α% significance level applied to its estimator. Conversely, any value outside of the confidence interval would be rejected. We’ve just spent a fair amount of time confirming these assertions. Let’s spend a little more. All three confidence intervals in equations (7.10), (7.11), and (7.12) contain $4,000. Therefore, none of these samples should reject the null hypothesis H0 : β0 = 4,000. The appropriate test for the sample of table 3.1 follows equation (7.16):
(
(
) = P (− 4, 933.8 < b < 12, 934 ) .
(
.95 = P 4, 000 − 3.182 2, 807.6 < b < 4, 000 + 3.182 2, 807.6
)) (7.23)
As given there, the only difference between this hypothesis test and that of equation (7.19) is that the value of zero for β0 there is replaced by the value of $4,000 here. As predicted by the confidence interval in equation (7.10), the estimate of b = 5,550 lies within the acceptance region for H0 : β0 = 4,000. The same is true for the samples of tables 3.3 and 5.1. For the former,
(
.95 = P 4, 000 − 2.101(1, 204.1) < b < 4, 000 + 2.101(1, 204.1) = P (1.470.2 < b < 6, 529.8) ,
) (7.24)
and the estimate is b = 4,048.5. For the latter,
(
.95 = P 4, 000 − 2.101(1, 374.7) ≤ β ≤ 4, 000 − 2.101(1, 374.7) = P (1,111.8 ≤ β ≤ 6, 888.2 ) ,
) (7.25)
and the estimate is b = 4,538.6.9 Both estimated values for b lie within their corresponding acceptance regions. As predicted by the confidence intervals, none of these samples can reject the null hypothesis that the true coefficient is $4,000. The test of the null hypothesis H0 : β0 = 4,000 illustrates our initial interpretation of the significance level. As we said in section 6.4.3, α is the probability of rejecting the null hypothesis when it’s true. In the sample of table 5.1, the true value of β is $4,000. Here, with α = .05, 5% of all independent tests of the null hypothesis H0 : β0 = 4,000 should reject. Can we verify this? Table 5.2 presents 50 regressions based on 50 independent samples that share this same value for β. Table 7.2 presents the confidence intervals for β from each. As we said in the discussion of that table, only two of those confidence intervals do not contain $4,000. The estimated b values associated with those two regressions would therefore incorrectly reject the null hypothesis that β = $4,000. These two estimates represent 4% of the sample of estimates, about as close as one can get to the 5% that we would expect, with only 50 samples.
Inference in Ordinary Least Squares
257
Now that we’ve thoroughly investigated this null hypothesis, it might seem a little late to ask under what circumstances we might care about it. Nevertheless, this question allows us to reacquaint ourselves with the rest of what we learned about hypothesis tests in section 6.4, so it’s still quite useful. It seems unlikely that we would care about this hypothesis in a way that would motivate the tests that we’ve just performed in its regard. The tests in equations (7.23), (7.24), and (7.25) are all two-tailed tests. In other words, they are formulated so as to represent the idea that we don’t care why the hypothesis that β = $4,000 is wrong, we just care that it’s wrong. It’s hard to imagine any situation in which this attitude would be appropriate. It’s easier to imagine the following scenario: Suppose we had examined all of the costs of additional schooling and determined that it only made economic sense for us if we expected that the return to schooling would be greater than $4,000 per year. In this case, $4,000 would be the value that mattered. However, returns of $4,000 or less would all amount to the same thing for us: We wouldn’t enroll. We would do so only if returns were greater than $4,000. In other words, we’re most likely to be interested in the possibility that $4,000 is the true value of β in a context where a one-tailed test is appropriate, rather than a two-tailed test. The appropriate replacements in equation (6.30) lead to the upper-tailed test for β: n− 2 1 − α = P b < β0 + tα( )
s2 n
∑( x − x ) i
i =1
2
.
(7.26)
If b is less than β0 + tα(
n− 2 )
s2 n
∑( x − x )
, 2
i
i =1
we accept the null hypothesis H0 : β = β0. If not, we reject the null hypothesis in favor of the alternative H1 : β > β0. The only difference between the arithmetic for the one-tailed test in equation (7.26) and the two-tailed test in equation (7.17) is that the two-tailed n− 2 test requires tα( / 2 ) and the one-tailed test needs tα( n−2 ) . At the 5% significance level, the latter value is 2.353 with 3 degrees of freedom and 1.734 with 18 degrees of freedom. Accordingly, the one-tailed test for the sample from table 3.1 is
(
)
.95 = P b < 4, 000 + 2.353 ( 2, 807.6 ) = P (b < 10, 606 ) .
(7.27)
258
Chapter 7
It is
(
)
(7.28)
)
(7.29)
.95 = P b < 4, 000 + 1.734 (1, 204.1) = P (b < 6, 087.9)
for the sample of table 3.3 and
(
.95 = P b < 4, 000 + 1.734 (1, 374.7) = P ( b < 6, 383.7)
for the sample of table 5.1. The estimates of b for all three samples lie within the acceptance regions for these hypothesis tests. Therefore, we cannot reject the hypothesis that an additional year of schooling won’t provide the required return. Consequently, we shouldn’t continue in school.10 Or should we? What is the possibility that, in making this decision, we’re committing a Type II error? Recall, from section 6.4.3, that a Type II error occurs when we fail to reject the null hypothesis, even though it’s false. As we said there, we have to have an alternative hypothesis that specifies a precise alternative value in order to assess the probability that we’ve made this mistake. Let’s adopt an alternative hypothesis that specifies annual returns to education well in excess of the minimum necessary to justify continuing in school. How about H1 : β1 = 5,000? If the true annual returns to education were $5,000, what is the probability that we would, nevertheless, estimate annual returns to education that would not reject H0 : β0 = 4,000? Equation (6.36) provides the answer. Let’s apply it to the Census data of table 3.3. We replace δ0, δ1, and d in that equation with the values for β0, β1, 18) = 1.734 and SD(b) = 1,204.1: and b. In addition, we have t.(025 18 4, 000 + 1.734 (1, 204.1) − 5, 000 P ( Type II error ) = P t ( ) ≤ 1 , 204 . 1
(
)
18 = P t ( ) ≤ .904 .
(7.30)
Figure 7.1 illustrates equation (7.30). According to equation (7.28), the acceptance region for H0 includes all values less than $6,087.90.The probability of a Type II error, if the alternative hypothesis is true, is the area between −∞ and $6,087.90 under the density function for the t-distribution centered at β1 = 5,000. The probability of observing an outcome in the half of this density function that lies below 5,000 is, naturally, .5. According to equation (7.30), the probability of observing an outcome in the part of this density function between 5,000 and 6,088.1 is equal to the probability of observing an outcome above E(b) by less than .904 of the standard deviation of b.
Inference in Ordinary Least Squares
Figure 7.1 Probability of a Type II error for the Census data of table 3.3 with H 0 : β0 = 4,000 and H1 : β1 = 5,000
259
Density function for b under H1 P(Type II error)
4,000
5,000
6,087.9
Acceptance region
Once again, table A.2 for the t-statistic doesn’t give this probability exactly. 18 ) 18 ) = .69 and t.(10 = 1.33. Therefore, However, it gives t.(25 t.(25 ) < .904 < t.(10 ) 18
18
and
(
)
18 .25 < P 0 < t ( ) < .904 < .40.
Adding in the .5 probability that t (18) will be less than zero, we get
(
)
18 .75 < P −∞ < t ( ) < .904 < .90.
(7.31)
The probability of a Type II error with H1 : β1 = 5,000 is between 75% and 90%! To summarize, we’ve designed the hypothesis test in equation (7.28) to set the probability of a Type I error to 5%. This means that if we had rejected the null hypothesis, we would have run only a 5% risk of doing so when it was really true. However, that’s not the action we’ve chosen. Instead, we’ve failed to reject the null hypothesis. Consequently, we can make a mistake only if the null hypothesis is actually false. Equation (7.31) tells us that, if β is really $5,000, the probability of making this mistake is at least 75%. This is troubling. The difference between not rejecting and rejecting is a big deal. It’s the difference between dropping out and continuing in school. If the right thing to do is to continue in school, there’s at least a 75% chance that, on the basis of the hypothesis test in equation (7.28), we’ll make the wrong choice.
260
Chapter 7
Yet again, the problem here is really with SD(b). It’s large because our sample is small. Therefore, we can’t discriminate between lots of values that have very different substantive implications. Our value for β0 implies that we shouldn’t invest any more in education, and our value for β1 implies that we should. However, the difference between them is less than one estimated standard deviation. Statistically, we can’t really tell them apart. We can demonstrate this point forcefully by examining the probability of a Type II error in the regressions on larger samples of Census data from the second, third, and fourth regressions of table 7.4. Equation (7.30) serves as our model. We retain the values there for our null hypothesis, our alternative hypothesis, and our level of significance. We only have to change two things in that equation. First, the value for t.(05df ) has to correspond to the actual degrees of freedom. The second regression of table 7.3 has 98 degrees of freedom. This is arguably enough to approximate the relevant t-distribution with that for the standard normal. Consequently, we adopt the value t.(05∞) = 1.64 for that regression and retain it for the others, where the degrees of freedom are even greater. Looking back at equation (7.26), we see that the effect of this change is to move the boundary of the acceptance region closer to the value for β0. With 98 degrees of freedom, this boundary is 1.64 standard deviations of b above β0. With only 18 degrees of freedom, it was 1.734 standard deviations above. This reduces the chances that we will mistake an estimate based on β1 as consistent with β0. Second, we have to use the standard deviation for b estimated by each regression. These standard deviations appear in the third column of table 7.4. As discussed there, these standard deviations decline with sample size. This will also shrink the acceptance regions and reduce the probability of a Type II error. For the sample with 100 observations, the probability of a Type II error is therefore 98 4, 000 + 1.64 ( 946.71) − 5,0000 P ( Type II error ) = P t ( ) ≤ 946.71
(
)
98 = P t ( ) ≤ .583 .
The same reasoning that we employed after equation (7.30) gives .5 < P ( Type II error ) < .75.
Expanding the sample by a factor of five has noticeably reduced the probability of a Type II error, though not yet to the point where we’re comfortable with it.
Inference in Ordinary Least Squares
261
The corresponding calculation for the sample with 1,000 observations is
(
)
998 4, 000 + 1.64 342.21 − 5, 000 P Type II error = P t ( ) ≤ 3 42 21 . (998) = Pt ≤ −1.28 .
(
)
According to table A.2, this probability is exactly 10%. The increase in sample size from 100 to 1,000 has reduced the probability of a Type II error from something astronomical to something that’s acceptable. Of course, it gets even better. With the sample of 10,000,
(
)
9,998 4, 000 + 1.64 113.03 − 5, 000 )≤ P Type II error = P t ( . 113 03 (9,998) = Pt ≤ −7.207 .
(
)
This probability is infinitesimal. Because of the size of this sample, the estimated standard deviation of b is only 113.03. In this case, the null and alternative hypotheses, β0 and β1, are nearly nine standard deviations apart. That is an enormous distance in statistical terms. Consequently, there is essentially no chance that we could be led to believe that the annual returns to education were $4,000 when they were really $5,000.11
7.5
Predicting yi Again The last analytical issue that we have to deal with in this chapter is the problem that originally attracted our attention in chapter 4: predicting the value of y0 that we would expect to be associated with a given value x0. There, we agreed that we would form that prediction according to equation (4.1): yˆ0 = a + bx0 .
The question now is, what are the statistical properties of yˆ0? Perhaps the prior question should be, why does yˆ0 have interesting statistical properties at all? The answer to this is that, as given in equation (4.1), yˆ0 is a linear combination of the regression estimators b and a. Each of these estimators is a linear combination of the yi’s and, therefore, a linear combination of the εi’s. Consequently, yˆ0 is also a linear combination of the εi’s. This
262
Chapter 7
means that it is a random variable, with properties that derive from those originally assumed for the εi’s. The first property of interest is the expected value of yˆ0 . Beginning with equation (4.1), we find E ( yˆ0 ) = E ( a + bx0 ) = E (a ) + E (bx0 ) .
The first equality just expresses the definition of yˆ0 . The second equality is an implication of the general rule in equation (5.13) that the expected value of a sum is the sum of the expected values. Replacing E(bx0) with x0E(b), according to equation (5.34), and the expected values of b and a with β and α, as given in equations (5.37) and (5.42), we have E ( yˆ0 ) = E (a ) + E (bx0 ) = α + x0 E (b ) = α + β x0 .
Finally, we recall equation (5.16) to recognize that E ( yˆ0 ) = α + β x0 = E ( y0 ) .
(7.32)
What does this say? Remember, y0 is the value that the population relationship actually assigns to x0. yˆ0 is the value that the regression line we have fitted to our sample assigns to x0. Equation (7.32) says that the expected values of these two assignments are the same. In other words, yˆ0 is an unbiased estimator of E(y0). Intuitively, equation (7.32) says that the typical value of yˆ0 that we predict for x0 from equation (4.1) is the same as the typical value that the population relationship would assign to x0 from equation (5.1). Imagine assembling a sample of observations, all of which share the same value for the explanatory value, x0. The collection of values for y0 would form a cluster concentrated around E(y0), whose true value we don’t know. However, yˆ0 is a good estimator of this central tendency, because it is an unbiased estimator of this value. Why isn’t yˆ0 an unbiased estimator of a specific y0, the actual value of the dependent variable for a specific observation with the value of x0 for its explanatory variable? According to equations (5.16) and (5.17), the difference between y0 and E(y0) is that the latter doesn’t contain ε0. We don’t have to worry about estimating ε0 if we only want to obtain an unbiased estimator of E(y0), so this is easy. This is what yˆ0 accomplishes. Another way to say this is to recognize that E(y0) contains E(ε0), which doesn’t need to be estimated because we’ve already assumed that it’s equal to zero in equation (5.5). We would have to estimate the value of the disturbance assigned to a specific observation, ε0 itself, if we wanted to estimate y0. But ε0 is random. Given the assumptions of chapter 5, there is no way for us to know anything about
Inference in Ordinary Least Squares
263
the specific value of ε0 that will be assigned to a particular observation with x0 as its value for the explanatory variable.12 For the purposes of this course, at least, an unbiased estimator of E(y0) is the best that we can expect to do. As in the analysis that we undertook in chapter 5 of the properties of a and b, it’s very nice to know that yˆ0 is an unbiased estimator of E(y0), but it is also important to understand how different a specific value for yˆ0 might be from the true value of E(y0). As in chapter 5, the answer to this question is given by the population variance of the estimator, in this case yˆ0 . The first thing to note in this regard is that, according to equation (4.1), yˆ0 is a linear combination of a and b. In section 5.9, we established that a and b were the best linear unbiased (BLU) estimators of β and α. Therefore, yˆ0 must be the best linear unbiased estimator of E(y0). Whatever V(ˆy0 ) turns out to be, we already know that it has to be as small or smaller than the population variance for any other linear unbiased estimator of E(y0). It’s helpful to be fortified with this knowledge because the actual derivation of V(ˆy0 ) offers a nice opportunity to exercise our derivational skills. The easiest approach is through the first equality of equation (5.4):
( )
( )
(
) (
)
2 2 V yˆ0 = E yˆ0 − E yˆ0 = E a + bx0 − α + β x0 ,
where we replace yˆ0 according to equation (4.1) and E(ˆy0 ) according to equation (7.32). Regrouping, we have
( )
(
)
(
)
2 V yˆ0 = E a − α − x0 b − β .
Squaring, we obtain 2 V ( yˆ0 ) = E ( a − α ) + x02 (b − β ) − 2 x0 ( a − α ) (b − β ).
We now have the expected value of a sum of three terms. Again, equation (5.13) allows us to rewrite the expected value of a sum as the sum of the expected values:
( )
(
V yˆ0 = E a − α
)
2
2 + E x02 b − β − E 2 x0 a − α b − β .
(
)
(
)(
)
Equation (5.34) reminds us that the expected value of a constant times a random variable is the constant times the expected value of the random variable, so the second and third terms can be rewritten to produce 2 2 V ( yˆ0 ) = E ( a − α ) + x02 E (b − β ) − 2 x0 E ( a − α ) (b − β ).
264
Chapter 7
Here, we recognize that E(a) = α from equation (5.42) and E(b) = β from equation (5.37). Therefore, equation (5.4) tells us that the first term is V(a) and the second term is V(b). Equation (5.8) identifies the expectation in the third term as COV(a, b). Therefore, V ( yˆ0 ) = V ( a ) + x02 V (b ) − 2 x0 COV ( a, b ) .
It is now evident that, in order to make further progress, we’re going to have to consider COV(a, b). We ducked this one in chapter 5 because it wasn’t immediately relevant and it isn’t much fun. It still isn’t much fun but now we need it, so we’re simply going to state its value without derivation:13 COV ( a, b ) =
−σ 2 x
.
n
∑( x − x )
2
i
i =1
If we combine this result with the formulas for V(b) in equation (5.50) and for V(a) in equation (5.51) and rearrange as if possessed, we get the population variance14 n
V ( yˆ0 ) =
σ2 n
∑( x − x ) i
2
0
i =1 n
(7.33)
∑( x − x )
2
i
i =1
and the population standard deviation n
SD ( yˆ0 ) = +
σ n
2
∑( x − x ) i
i =1 n
2
0
.
∑( x − x )
(7.34)
2
i
i =1
What does this mean? Well, it’s obvious that, if we choose to predict the dependent variable at the average value for the explanatory variable, x0 = x , then V ( yˆ0 | x0 = x ) =
σ2 n
(7.35)
Inference in Ordinary Least Squares
TABLE
7.8 Effect of sample size on the width of 95% confidence intervals for yˆ 0 in the simulated data of table 5.1
Sample size
20 100 1,000 10,000 100,000 NOTE :
265
yˆ0
Population standard deviation of yˆ0
Lower bound of confidence interval for E(y0)
Upper bound of confidence interval for E(y0)
Width of confidence interval for E(y0)
35,023 35,023 35,023 35,023 35,023
5,590.2 2,500.0 790.57 250.00 79.057
23,278 30,060 33,473 34,533 34,868
46,768 39,985 36,572 35,513 35,178
23,490 9,925.0 3,099.0 980.00 309.90
The prediction of yˆ0 is at the sample average value for x0, 13, for all samples.
and SD ( yˆ0 | x0 = x ) =
σ
.
(7.36)
n
We immediately notice that the variance of this prediction varies inversely with the sample size. This has two implications. Formally, yˆ0 calculated at x is a consistent estimator for E(y0 | x0 = x ). It is unbiased, and its variance becomes negligible as n approaches infinity. Practically speaking, our predictions of yˆ0 calculated at x are more reliable if they are based on larger samples.15 Table 7.8 demonstrates this conclusion. The second column of the first row reports yˆ0 , as predicted by the regression equation for the simulated sample in table 5.1 for the value x0 = x = 13:16 yˆ0 = −23, 979 + 4, 538.6(13) = 35, 023.
(7.37)
The third column of this row presents the population standard deviation of this prediction, according to equation (7.36), based on the simulated values of σ = 25,000 and n = 20. Equation (7.32) demonstrates that the expected value of yˆ0 is simply E(y0). Accordingly, the fourth and fifth columns of the first row report the lower and upper bounds on the 95% confidence interval for E(y0) using equation (6.8) with all of the appropriate replacements. Finally, the sixth column of the first row of table 7.8 gives the width of this confidence interval from the analogue to equation (7.15) in the case where the estimator is yˆ0. This confidence interval is huge! Taken seriously, it implies that someone with one year of college could reasonably expect annual earnings anywhere from $23,278 to $46,768. In practical terms, that’s not much different from saying that we really don’t know what someone with this much schooling can expect.
266
Chapter 7
The second column of the second row of table 7.8 again predicts the value of yˆ0 for x0 = 13 using equation (7.37). However, the rest of this row assumes that n = 100 rather than 20. This is the type of change that we were talking about after equation (7.30). The consequences are readily apparent. The population standard deviation of yˆ0 in the third column of this row is less than that in the first row by a factor of 2.236.17 Consequently, the width of the confidence interval in this row is smaller than that in the preceding row by exactly the same factor. It’s also small enough that, were it based on authentic data, we might begin to feel comfortable taking guidance from it as we make our educational choices. The succeeding rows of table 7.8 continue to simulate the same confidence interval using equation (7.37) and σ = 25,000 and simply increasing the sample by a factor of 10 from row to row. The population standard deviations in column 3 decline by a factor of 3.162. The width of the corresponding confidence interval declines by the same factor. By the time the sample size reaches 100,000, the confidence interval for E(y0) is so narrow that we would almost certainly make the same choices whether we believed the lowest or the highest value within it. What about when we want to predict the value of yi for a value of xi that is not equal to the average, x0 ≠ x ? To answer this, we have to examine the behavior of the numerator in the second ratio of equation (7.33), ∑in=1 ( xi − x0 )2 . The appendix to this chapter proves that this summation attains its smallest value when x0 = x . Its value increases monotonically as x0 moves away from x in either direction. Therefore, V(ˆy0 ) also increases as | x0 − x | increases. In other words, our predictions of E(y0 | x0) are most reliable when we make them for x0 = x . They become less reliable as we choose values of x0 that are farther away from the typical values observed in our sample, represented by their average, x . For this reason, we will frequently hear the statement that “out-of-sample” predictions from a regression line—meaning predictions for values of the explanatory variable that are outside the range of values that appears in the sample—are to be treated with special caution.18 In our terms, what this means is that, if we calculate yˆ0 for values of x0 that are so extreme that they don’t even appear in the sample, we can expect the variance of our prediction to be large or even huge. Confidence intervals for E(y0) may be too wide to be informative. This is another way of saying that our estimator might be, for practical purposes, unreliable. Table 7.9 illustrates this point. It is based on the same assumptions as is the third row of table 7.8. Accordingly, the row for x0 = 13 in this table is identical to the row for n = 1,000 there. The difference between the two tables is that the other rows of table 7.9 present predictions, standard errors, and confidence intervals for yˆ0 predicted at other values for x0. The predictions in the second column of table 7.9 are no surprise. They indicate that predicted annual earnings are very low at low levels of education.
Inference in Ordinary Least Squares
267
TABLE 7.9 Effect of differences in x0 on the width of 95% confidence intervals for yˆ 0 in simulated data n
∑(x − x ) i
x0
yˆ0
4 8 9 10 11 12 13 14 15 16 17 18 22
−5,824.6 12,330 16,868 21,407 25,946 30,484 35,023 39,561 44,100 48,639 53,177 57,716 75,870
NOTE :
2
0
i=1
89,500 33,500 24,500 17,500 12,500 9,500 8,500 9,500 12,500 17,500 24,500 33,500 89,500
Standard deviation of yˆ0
Lower bound of confidence interval for E(y0)
2,565.3 1,569.5 1,342.2 1,134.4 958.71 835.78 790.57 835.78 958.71 1,134.4 1,342.2 1,569.5 2,565.3
−10,853 9,253.6 14,238 19,184 24,067 28,846 33,473 37,923 42,221 46,415 50,547 54,640 70,842
Upper bound of confidence interval for E(y0)
−796.57 15,406 19,499 23,630 27,825 32,122 36,572 41,200 45,979 50,862 55,808 60,792 80,898
Width of confidence interval for E(y0)
10,056 6,152.4 5,261.4 4,446.7 3,758.1 3,276.3 3,099.0 3,276.3 3,758.1 4,446.7 5,261.4 6,152.3 10,056
Sample size is 1,000. σ / n = 790.57.
They increase by approximately $4,539 with each additional year of schooling. This is, of course, the value of the slope in equation (7.37). The third column contains the essential news in this table. It demonstrates the extent to which ∑in=1 ( xi − x0 )2 varies with x0. For extreme values, such as x0 = 4 or x0 = 22, this sum is more than 10 times its value for x0 = x = 13 . The fourth column of table 7.9 reports the consequences for the standard deviation of yˆ0 , as given in equation (7.34). It is smallest for x0 = 13. It increases with values for x0 farther from the average for xi. At the extremes of x0 = 4 and x0 = 22, SD(ˆy0 ) is more than three times its value for x0 = 13. Inevitably, this pattern and equation (7.15) imply the final column of table 7.9. Confidence intervals for predicted values of the dependent variable are wider the farther x0 is from the average value for xi. Those for the values outside the sample, x0 = 4 and x0 = 22, are approximately 40% larger than those for the largest values within the sample, x0 = 8 and x0 = 18. Whether confidence intervals for the more extreme values are useful depends, of course, on the context. There can be no doubt, however, that they are wider than those for more typical values of the explanatory variable.19
7.6 What Can We Say So Far about the Returns to Education? After all of this, what can we actually say about the returns to education? The 1% PUMS for California contains a total of 179,549 observations
268
Chapter 7
that meet the restrictions we imposed in the appendix to chapter 3. The regression of equation (4.1), applied to this sample, estimates b = 3,845.7 from equation (4.40) and SD(b) = 26.649 from equation (7.5). What can we conclude? First, the 95% confidence interval for β, from equation (7.9), is
(
.95 = P 3, 845.7 − 1.96 ( 26.649) ≤ β ≤ 3, 845.7 + 1.96 ( 26.649) = P ( 3, 793.5 ≤ β ≤ 3, 897.9) .
) (7.38)
Its width is only $104.40!20 Based on this regression, it’s hard to believe that the true annual returns to education are much less than $3,800 or much more than $3,900. In particular, it’s very clear that they’re not zero. Zero isn’t even close to being in the confidence interval. For the record, the hypothesis test for the null hypothesis H0 : β0 = 0, from equation (7.16), is
(
)
.95 = P −1.96 ( 26.649) < b < 1.96 ( 26.649) = P (−52.232 < b < 52.232 ) .
The acceptance region for this hypothesis runs only from −$52.23 to $52.23. Our estimate is firmly in the rejection region. We get the same result from our other two testing methods. The t-statistic in this case is 3,845.7/26.649, or 144.31. As we can see from equation (7.38), the critical value is 1.96. The t-statistic exceeds the critical value, by a lot, so again we reject the null hypothesis. Finally, this t-statistic is so big that we can’t find it in table A.2, or even table A.1, which is the table A.2 equivalent for large samples. All we can really say is that the p-value associated with this t-statistic is less than .0001. With α set at .05, the null hypothesis doesn’t have a chance. Suppose we had determined that an additional year of schooling only made economic sense for us if we could obtain an annual return of more than $3,500. This implies a one-tailed test, where the null hypothesis specifies that the return isn’t enough, H0 : β0 = 3,500, and an alternative hypothesis that it is, H1 : β1 > 3,500. The test itself, from equation (7.26), is
(
)
.95 = P b < 3, 500 + 1.645 ( 26.649) = P (b < 3, 543.8) .
(7.39)
Our estimate, $3,845.70, isn’t in this acceptance region, either. Accordingly, we would go to school.21 There’s no opportunity for a Type II error here, because we haven’t accepted the null hypothesis. Nevertheless, let’s work out a probability for this type of error, just for practice. Imagine that our alternative hypothesis is precise rather than diffuse: H1 : β1 = 4,000. The probability that we would observe an
Inference in Ordinary Least Squares
269
estimate in the acceptance region for the null hypothesis H0 : β0 = 3,500 when the alternative hypothesis is really true is given by equation (6.36):
(
)
179,547 3, 500 + 1.645 26.649 − 4, 000 )≤ P Type II error = P t ( 26.649 (179,547) = P t ≤ −17.12 .
(
)
The critical value here, −17.12, isn’t as far off the charts as is our t-statistic of 144.31, but it’s far enough. The probability of a Type II error in this case is again less than .0001. This is no surprise. The null and alternative hypotheses are $500 apart. That might not seem like a lot in financial terms. However, the standard deviation is only 26.649. In statistical terms, they’re more than 17 standard deviations apart. Might as well be from here to the moon. With the data at hand, there’s virtually no chance that we might mistake one for the other. This is as far as this chapter can take us. Our final observation returns to chapter 4 and anticipates chapter 11. The R2 statistic for the regression discussed here is .104. In other words, education explains approximately 10% of the variation in earnings within this sample. We seem to be getting a pretty strong signal that education affects earnings, but an equally strong signal that there’s a lot of other stuff going on as well. As things currently stand, we don’t have any way to investigate whether other factors are also important in the determination of earnings. This suggests that the analysis we’ve developed until this point may not be able to address all of the questions that we would find interesting. We’ll talk about how to extend it to incorporate other explanatory variables in chapter 11.
7.7 Another Example Here’s another example of inference with respect to regression estimates. Corruption is the exercise of extralegal power to divert resources from those with legal claims to them. Bribes and nepotism are probably its most familiar manifestations. How does corruption affect the overall level of economic activity? If individuals and businesses fear that they could lose wealth that legitimately belongs to them, it probably makes the acquisition of wealth less attractive. Consequently, individuals and businesses probably don’t put as much effort into acquiring, protecting, or developing wealth. This suggests that corruption should reduce the overall level of economic activity.22
270
Chapter 7
Table 7.10 presents data on levels of economic activity and perceived levels of corruption in 100 different countries. It measures the level of economic activity using gross national income (GNI) per capita. This income measure is a recent variant on gross domestic product (GDP). It is explained in greater detail at the World Bank Web site, http://data.worldbank.org/indicator/NY.GNP .PCAP.CD, which is also the source of the GNI per capita data in table 7.10. The average value of GNI per capita in this sample is $11,061. The range around this value is huge. The minimum, in unfortunate Tanzania, is $550. The maximum value is nearly two orders of magnitude, or 100 times greater, than the minimum. It occurs in Luxembourg, where GNI is equal to $51,060 per inhabitant. The corruption index in table 7.10 is from Transparency International. Its Web site, http://www.transparency.org/policy_research/surveys_indices/ cpi/2002, is the source of these data, as well as the following definition: “The Transparency International Corruption Perceptions Index [CPI] ranks countries in terms of the degree to which corruption is perceived to exist among public officials and politicians.”23 Theoretically, the CPI would assign a score of 10 to a country that was free of corruption and a score of 0 to a country that was totally corrupt. In practice, the maximum measured value is 9.7, achieved by Finland. The minimum of 1.2 occurs in Bangladesh. The average CPI score across all 100 countries is 4.568. If the question is how corruption affects income, the natural answer, at least at this stage, would be to run the regression of equation (4.1), in which the income measure of table 7.10 is the dependent variable and the corruption measure in that table is the explanatory variable. Why not? Here’s what it looks like: GNI per capita = −$6,927.50 + $3,938.10CPI.
(7.40)
What does it mean? If this were the end of chapter 4, we would say that, according to the slope, each point on the CPI is worth nearly $4,000 in GNI per capita. This seems like a very large effect. After all, for example, the entire GNI per capita in the Russian Federation is only $7,820, or less than twice the effect associated with a two-point difference in our corruption measure. We wouldn’t take the intercept literally, but, in conjunction with the slope, we would conclude that a country has to rise noticeably above the level of total corruption in order to have any measurable economic activity at all. In fact, the minimum corruption score in the sample isn’t enough to yield a positive prediction for GNI per capita.24 However, this is the end of chapter 7. Perhaps the most important thing that we have learned is that we shouldn’t offer either of the interpretations in the previous two paragraphs without checking for statistical significance first.
Inference in Ordinary Least Squares
TABLE
7.10
271
Gross national income per capita and corruption by country in 2002
Country
Finland Denmark New Zealand Iceland Singapore Sweden Canada Luxembourg Netherlands United Kingdom Australia Norway Switzerland Hong Kong Austria United States Chile Germany Israel Belgium Japan Spain Ireland Botswana France Portugal Slovenia Namibia Estonia Italy Uruguay Hungary Malaysia Trinidad and Tobago Belarus Lithuania South Africa Tunisia Costa Rica Jordan Guatemala Venezuela Georgia Ukraine Vietnam Kazakhstan Bolivia Cameroon Ecuador Haiti
Gross national income per capita
Corruption index
25,440 29,450 20,020 28,590 23,090 25,080 28,070 51,060 27,470 25,870 26,960 35,840 31,250 26,810 28,240 35,060 9,180 26,220 19,260 27,350 26,070 20,460 28,040 7,770 26,180 17,350 17,690 6,650 11,120 25,320 12,010 12,810 8,280 8,680 5,330 9,880 9,870 6,280 8,260 4,070 3,880 5,080 2,210 4,650 2,240 5,480 2,300 1,640 3,130 1,580
9.7 9.5 9.5 9.4 9.3 9.3 9.0 9.0 9.0 8.7 8.6 8.5 8.5 8.2 7.8 7.7 7.5 7.3 7.3 7.1 7.1 7.1 6.9 6.4 6.3 6.3 6.0 5.7 5.6 5.2 5.1 4.9 4.9 4.9 4.8 4.8 4.8 4.8 4.5 4.5 2.5 2.5 2.4 2.4 2.4 2.3 2.2 2.2 2.2 2.2
Country
Mauritius South Korea Greece Brazil Bulgaria Jamaica Peru Poland Ghana Croatia Czech Republic Latvia Morocco Slovak Republic Sri Lanka Colombia Mexico China Dominican Republic Ethiopia Egypt El Salvador Thailand Turkey Senegal Panama Malawi Uzbekistan Argentina Ivory Coast Honduras India Russian Federation Tanzania Zimbabwe Pakistan Philippines Romania Zambia Albania Moldova Uganda Azerbaijan Indonesia Kenya Angola Madagascar Paraguay Nigeria Bangladesh
Gross national income per capita
10,530 16,480 18,240 7,250 6,840 3,550 4,800 10,130 2,000 9,760 14,500 8,940 3,690 12,190 3,290 5,870 8,540 4,390 5,870 720 3,710 4,570 6,680 6,120 1,510 5,870 570 1,590 9,930 1,430 2,450 2,570 7,820 550 2,120 1,940 4,280 6,290 770 4,040 1,560 1,320 2,920 2,990 990 1,730 720 4,450 780 1,720
Corruption index
4.5 4.5 4.2 4.0 4.0 4.0 4.0 4.0 3.9 3.8 3.7 3.7 3.7 3.7 3.7 3.6 3.6 3.5 3.5 3.5 3.4 3.4 3.2 3.2 3.1 3.0 2.9 2.9 2.8 2.7 2.7 2.7 2.7 2.7 2.7 2.6 2.6 2.6 2.6 2.5 2.1 2.1 2.0 1.9 1.9 1.7 1.7 1.7 1.6 1.2
272
Chapter 7
As it turns out, SD(b) = 211.98, the t-statistic for the slope is 18.58, and the associated p-value is less than .0001. In other words, there is no real doubt that the underlying coefficient in the population is not zero. With this reassurance, our initial interpretation of the slope stands. Similarly, SD(a) = 1,091.2, the t-statistic for the intercept is 6.35, and the associated p-value is again less than .0001. Notice that, although this t-statistic is only about one-third of that associated with b, its minuscule p-value indicates that it’s still huge in statistical terms. There’s no real doubt that the intercept is nonzero, either. So our original interpretation of the constant turns out to be appropriate as well. This analysis would seem to suggest that dishonest behavior is exceptionally destructive to economic activity. This is a conclusion that most of us might find reassuring. However, the R2 value is .779. Is it really plausible that the level of corruption explains essentially 78% of the variation in GNI per capita? And, if so, why is there so much corruption? After all, the average country doesn’t even make it to the halfway mark on the CPI scale. Is it possible that people are especially likely to engage in dishonest behavior when they’re especially desperate? But this would imply that corruption is a consequence of low levels of economic activity, as much as a cause. Does this mean that we’ve got our independent and dependent variables mixed up? Or that each of our variables should be both? Perhaps. However, we don’t have any way, at the moment, to deal with the possibility that the direction of causality here is ambiguous. As at the end of the previous section, the analysis here raises questions that can’t be answered with the results that we’ve developed so far. We’ll address this particular question in chapter 10.
7.8
Conclusion Inference is at the core of the statistical enterprise. If we stop at the end of chapter 4, all we can say is that we have fitted a line to a sample. If we stop at the end of chapter 5, all we can say is that collections of slope and intercept estimators, calculated as in chapter 4, would cluster around the true values of the underlying coefficient and constant. Chapter 7 allows us to locate that coefficient and constant with any degree of certainty that we desire. It also reveals the fundamental tension in this endeavor: There is an unavoidable tradeoff between certainty and precision. If we want to be more certain, we have to allow for more alternatives. If we want to narrow the number of alternatives, we have to relinquish some of our certainty.
Inference in Ordinary Least Squares
273
Finally, this chapter states emphatically the case for good data. In its terms, good data consist of many observations with a wide range of values for xi. The tension between certainty and precision is inevitable. However, with data that are good enough, it can be rendered irrelevant. If standard errors are small enough, confidence intervals can be narrow and probabilities of Type I and II errors can both be tiny. In this case, who cares if greater confidence requires slightly wider intervals? Who cares if the probability of a Type I error can only be reduced further by increases in the probability of a Type II error? This is about all that there is to be said with regard to interpreting the regression of equation (4.1) under the assumptions of chapter 5. Why is there anything left in this book? Because, as we said in chapter 5, and implied again at the end of each of the previous two sections, there is no guarantee that all of its assumptions are true, all of the time. The next four chapters examine how we might tell, what the consequences might be, and how we might respond.
Appendix to Chapter 7 Let’s examine the behavior of ∑in=1 ( xi − x0 )2 as x0 varies. To do this, take its derivative with respect to x0: n 2 ∂ ( xi − x0 ) i =1 . ∂ x0
∑
The first step is to apply equation (4.9), asserting that the derivative of a sum is the sum of the derivatives: n 2 ∂ ( xi − x0 ) i =1 = ∂ x0
∑
n
∑ i =1
2 ∂ ( xi − x0 ) . ∂ x0
The second step is to apply the chain rule, as in equation (4.62), to the individual derivatives within the summation: n 2 ∂ ( xi − x0 ) i =1 = ∂ x0
∑
n
∑ i =1
2 ∂ ( xi − x0 ) = ∂ x0
n
∑ i =1
2 ∂ ( xi − x0 ) ∂ x − x ( i 0) . ∂ x0 ∂ ( xi − x0 )
274
Chapter 7
Finally, we apply the reasoning of equations (4.9) through (4.12): n 2 ∂ ( xi − x0 ) i =1 = ∂ x0
∑
=
n
∑ i =1 n
2 ∂ ( xi − x0 ) ∂ x − x ( i 0) ∂ x0 ∂ ( xi − x0 )
∑2 ( x − x ) (−1) 2 −1
0
i
i =1
n
∑( x − x ).
= −2
i
(7.41)
0
i =1
Now we’re in a position to set this derivative equal to zero: n
∑
0 = −2
i =1
n
∑
( xi − x0 ) =
i =1
( xi − x0 ) =
n
n
∑ ∑ xi −
i =1
n
x0 =
i =1
∑ x − nx . 0
i
i =1
The second equality is the result of dividing the terms on either side of the first equality by −2. The third equality relies on equation (2.14). The last equality is based on equation (2.5). To find the extreme point, solve for x0: n
nx0 =
∑x . i
i =1
Dividing both sides of the equation by n yields n
∑x
i
x0 =
i =1
n
= x,
where the last equality invokes the definition of equation (2.8). To prove that this value for x0 yields a minimum rather than a maximum or a saddle point for ∑in=1 ( xi − x0 )2 , take the second derivative of this sum with respect to x0: n ∂ xi − x0 i=1 ∂ n ∂ x0 2 xi − x0 ∂ 2 i=1 = 2 ∂ x0 ∂x
∑(
∑(
)
0
)
2
n ∂ −2 xi − x0 i=1 = ∂ x0
∑(
)
.
Inference in Ordinary Least Squares
275
The first equality states the definition of the second derivative of the original function: It is just the first derivative of its first derivative. The second equality replaces the symbol for the first derivative with the value that we just derived in equation (7.41). The rules that we have already cited regarding the role of multiplicative constants and summations in derivatives allow us to further simplify:
n
∑
∂ ( i =1 2
xi − x0
)
2
∂ x02
n − 2 xi − x0 ∂ i =1 = ∂ x0
∑(
)
∂ x −x ∑ ( ∂ x ). n
= −2
0
i
i =1
0
Finally, a derivation analogous to that in equations (4.9) through (4.12) allows us to express this derivative as n 2 ∂ 2 ( xi − x0 ) i =1
∑
∂ x02
n
∑
= −2
i =1
∂ ( xi − x0 ) ∂ x0
n
∑(−1) = 2n > 0.
= −2
i =1
This second derivative is always positive. Therefore, the first derivative increases as x0 increases. This proves that ∑in=1 ( xi − x0 )2 attains its minimum value when x0 = x . The same must therefore be true of V(ˆy0 ) . For all other values of x0, n
∑( x − x ) i
i =1 n
2
0
∑( x − x )
>1
2
i
i =1
and V ( yˆ0 | x0 ≠ x ) > V ( yˆ0 | x0 = x ) .
Moreover, the first and second derivatives of ∑in=1 ( xi − x0 )2 with respect to x0 demonstrate that this sum increases monotonically as the difference | x0 − x | increases. Therefore, the variance of yˆ0 and the difference V ( yˆ0 | x0 ≠ x ) − V ( yˆ0 | x0 = x )
must increase as | x0 − x | increases.
276
Chapter 7
EXERCISES 7.1 In table 7.1, the estimated standard deviations for b are close to, but not identical with, the true standard deviations. (a) Why might the estimated standard deviations in the third column differ from the true standard deviations in the second column? What could we do to make them more accurate? (b) Why might the estimated standard deviations in the fourth column differ from the true standard deviations in the second column? What could we do to make them more accurate? 7.2 Consider the regression results in table 7.2. (a) Use the information in tables 5.2 and 7.2 to deduce the value of SD(b) for regression 19. (b) Use the information in tables 5.2 and 7.2 and the value for SD(b) to construct the two-tailed test of the null hypothesis H0 : β0 = 4,000 for regression 19. Does this hypothesis test accept or reject the null? (c) Use the information in tables 5.2 and 7.2 to deduce the value for SD(b) and to construct the two-tailed test of the null hypothesis H0 : β0 = 4,000 for regression 12. Does this hypothesis test accept or reject the null? 7.3 Consider the confidence interval of equation (7.11). (a) Recalculate this confidence interval at the 99% confidence level. How does its width change? (b) Calculate the acceptance region for the null hypothesis H0 : β0 = 0 at the 1% significance level. Does the sample reject or fail to reject this null hypothesis? (c) Recalculate this confidence interval at the 90% confidence interval. How does its width change? (d) Calculate the acceptance region for the null hypothesis H0 : β0 = 0 at the 10% significance level. Does the sample reject or fail to reject this null hypothesis? (e) Discuss the differences between the answers to parts a and c. Are they plausible? Why or why not? Do the same with the answers to parts b and d. 7.4 Examine tables 7.2, 7.3, 7.4, and 7.6. In each of these tables, which regressions reject the null hypothesis H0 : β0 = 0? Which fail to reject? 7.5 Examine the confidence intervals in table 7.6. How many contain the true parameter value? Does this number seem reasonable? Why or why not? 7.6 Recall the estimated values for SD(b) and the corresponding 95% confidence intervals in table 7.6. (a) Use equation (5.50) to derive the population values for SD(b) in each sample. Use the comparison between the population and estimated values
Inference in Ordinary Least Squares
277
to deduce the value of s2 for each sample. What does this tell us about the collection of disturbances assigned to each sample? (b) Replace s2 with the true value for σ 2 in equation (7.9). Derive the true 95% confidence intervals for each of the samples. 7.7 Recalculate the hypothesis tests of equations (7.19), (7.20), and (7.22) at the 10% significance level. Do any of the results change? If yes, how might we explain the change? 7.8 Construct the two-tailed hypothesis test for H0 : β0 = 10,000 for the samples from tables 3.1, 3.3, and 5.1. Which, if any, of these tests rejects the null hypothesis? 7.9 Imagine that we had determined that we would invest in additional years of schooling if the annual return was greater than $2,000. Reformulate the hypothesis tests in equations (7.27), (7.28), and (7.29) to determine whether we should enroll. 7.10 Recall the analysis in section 7.4 of the probability of a Type II error for the Census samples of table 7.4. (a) Use the information in table 7.4 and equation (7.26) to test the null hypothesis H0 : β0 = 4,000 for the sample of 100,000. What do we conclude? (b) Replicate the analysis of equation (7.30) for this sample to calculate the probability of a Type II error in this sample if the alternative hypothesis is H1 : β1 = 5,000. Is this probability relevant here? If yes, why? If no, why not? 7.11 Consider equation (7.33). (a) Derive this equation using the instructions that precede it. (b) We may often see the variance in equation (7.33) written as 1 V ( yˆ0 ) = σ 2 + n
2 ( x0 − x ) . n 2 ( xi − x ) i =1
∑
Prove that the two definitions are equivalent. First, write the numerator on the right side of the equality in equation (7.33) as n
σ2
∑( x − x ) i
i =1
0
2
n =σ2 ( xi − x ) − ( x0 − x ) i =1
∑(
) . 2
Then prove that the resulting term can be rewritten as n 2 2 σ 2 ( xi − x ) + n ( x0 − x ) . i =1
∑
278
Chapter 7
Finally, divide this expression by the denominator in the term on the right side of the equality to verify that the two definitions are equivalent. (c) Can we tell from the formula we have just derived for V(ˆy0 ) whether yˆ0 is a consistent estimator of E(y0)? How? 7.12 Form the (1 − α)% confidence interval for yˆ0 calculated at x0 = x . 7.13 Return to table 7.8 and equation (7.36). (a) Why is the population standard deviation in the second row exactly equal to that in the first row divided by 2.236? Would this be true for the estimated rather than the population standard deviation? Why or why not? (b) Why is the population standard deviation in each subsequent row exactly equal to that in the previous row divided by 3.162? Would this be true for the estimated rather than the population standard deviation? Why or why not? 7.14 Consider the regression on the Census data presented in section 7.6. (a) Section 7.6 gives the value of SD(b) for this regression. The value of s is 43,427. Use equation (7.5) to figure out what ∑in=1 ( xi − x )2 must be. (b) The value of a for this regression is −17,136. What is yˆ0 if x0 = 4? (c) The average value for years of schooling in these data is 12.43. In exercise 7.11, we proved that
( )
V yˆ0
1 =σ2 + n
x0 − x . n 2 xi − x i =1
(
∑(
)
2
)
Using this information and the answer to part a, what is the variance of yˆ0 if x0 = 4? What is the 95% confidence interval for this predicted value? (d) Answer parts b and c again under the assumption that x0 = 13. (e) Answer parts b and c again under the assumption that x0 = 23. 7.15 Consider again the regression on the Census data presented in section 7.6. (a) Use the information in exercise 7.14 and equation (7.7) to calculate SD(a). (b) Use the answer to part a to construct the 99% confidence interval for the constant α in the underlying population relationship. (c) Test the null hypothesis H0 : α0 = 0 against the alternative hypothesis H1 : α1 > 0.
Inference in Ordinary Least Squares
279
NOTES 1. Intuitively, central limit theorems can be thought of as demonstrating that if we do something often enough, it becomes normal. 2. Of course, this is a more general result. If we know the average value for the xi’s and the values of any set of n − 1 individual xi’s, then we can calculate the value of the remaining xi’s. 3. Recall that we set σ = $25,000 for our simulations of section 5.7 and promised to give a reason in this section. Here it is. This is a close approximation of the value for s that we estimated from our sample of Census data in table 3.3. 4. Exercise 7.1 discusses the differences between the true standard deviations and the sample standard deviations in greater depth. 5. Verify this. 6. We made essentially the same point in note 15 of chapter 5. 7. With σ = 25,000, as here, random variations in the set of εi’s assigned to each sample obscure the relationship that we are looking for with fewer observations. 8. This would be much less of a problem if n − 2 were large enough to justify recourse to table A.1 for the standard normal distribution. That table presents probabilities for many more values and so has fewer and smaller gaps. 9. Let’s test our understanding. Where did the value of 1,374.7 come from? 10. Of course, this decision depends crucially on what we’ve chosen as our required return. Exercise 7.9 revisits this decision with a lower minimum return. In that case, we might make a different choice. 11. Exercise 7.10 directs us to apply the analysis of this section to the sample of 100,000. 12. We could have a computer simulate a random value for the disturbance term, as in the construction of the simulated data in table 5.1, and add it to yˆ0 . This would simulate a value of the dependent variable associated with x0. However, the simulated disturbance value would be uncorrelated with the actual value of the disturbance assigned by the population relationship to any specific observation. Therefore, the simulated value of y0 would be no better as an estimate of the actual value of y0 for that observation than would be yˆ0 . 13. Of course, the derivation is possible using regular algebra and actually fairly easy with matrix algebra. We’ll probably encounter it if we take a more advanced course in econometrics. 14. We actually get a chance to do this in exercise 7.11.
280
Chapter 7
15. Exercise 7.12 invites us to form a confidence interval for the predicted value of y at x . 16. What does this predicted value tell us about the average value of yi in table 5.1? Equation (4.34) and the accompanying text can help. 17. Exercise 7.13 asks us to explain why it is exactly this factor that translates the population standard deviation in the first row to that in the second. 18. Nevertheless, they are still consistent. Exercise 7.11 addresses this question. The point here is that the benefits of consistency will appear at lower values of n for predictions based on values for xi in the neighborhood of x , and at higher values of n for predictions based on more extreme values for xi. 19. Exercise 7.14 analyzes the properties of predicted values from regressions on the Census data. 20. Where did the value of 1.96 in equation (7.38) come from? 21. This one is a bit harder than the last one. Where did the value of 1.645 in equation (7.39) come from? 22. For a formal presentation including these ideas, try Murphy, Shleifer, and Vishny (1993). 23. The initials “CPI” are a bit unfortunate, because they have already been appropriated by the Consumer Price Index, the best-known measure of inflation in the United States. There shouldn’t be any confusion in this section, however, because inflation is not at issue. 24. Verify this.
CHAPTER
8
W HAT I F THE D ISTURBANCES H AVE N ONZERO E XPECTATIONS OR D IFFERENT V ARIANCES ?
8.0 What We Need to Know When We Finish This Chapter
281
8.1
Introduction
8.2
Suppose the εi’s Have a Constant Expected Value That Isn’t Zero
8.3
Suppose the εi’s Have Different Expected Values
8.4
Suppose Equation (5.6) Is Wrong
8.5
What’s the Problem?
8.6
σi2, εi2, ei2, and the White Test
8.7
Fixing the Standard Deviations
8.8
Recovering Best Linear Unbiased Estimators
301
8.9
Example: Two Variances for the Disturbances
304
283
288
289
290 294 299
8.10 What If We Have Some Other Form of Heteroscedasticity? 8.11 Conclusion Exercises
284
313
314 315
8.0 What We Need to Know When We Finish This Chapter The εi’s must have the same expected value for our regression to make any sense. However, we can’t tell if the εi’s have a constant expected value that 281
282
Chapter 8
is different from zero, and it doesn’t make any substantive difference. If the disturbances have different variances, ordinary least squares (OLS) estimates are still unbiased. However, they’re no longer best linear unbiased (BLU). In addition, the true variances of b and a are different from those given by the OLS variance formulas. In order to conduct inference, either we can estimate their true variances, or we may be able to get BLU estimators by transforming the data so that the transformed disturbances share the same variance. Here are the essentials. 1. Section 8.2: If E(εi) equals some constant other than zero, b is still an unbiased estimator of β and a is still an unbiased estimator of the fixed component of the deterministic part of yi. 2. Section 8.2: An identification problem arises when we don’t have a suitable estimator for a parameter whose value we would like to estimate. 3. Section 8.2: A normalization is a value that we assign to a parameter when we don’t have any way of estimating it and when assigning a value doesn’t have any substantive implications. 4. Section 8.2: If E(εi) equals some constant other than zero, it wouldn’t really matter and we couldn’t identify this constant. Therefore, we always normalize it to zero. 5. Section 8.3: If E(εi) is different for each observation, then the observations don’t come from the same population. In this case, b and a don’t estimate anything useful. 6. Section 8.4: When the disturbances don’t all have the same variance, it’s called heteroscedasticity. 7. Section 8.4: The OLS estimators b and a remain unbiased for β and α regardless of what we assume for V(εi). 8. Section 8.5: The OLS estimators b and a are not BLU and their true variances are probably not estimated accurately by the OLS variance formulas. 9. Section 8.6: An auxiliary regression does not attempt to estimate a population relationship in an observed sample. It provides supplemental information that helps us interpret regressions that do. 10. Section 8.6: The White test identifies whether heteroscedasticity is bad enough to distort OLS variance calculations. 11. Equation (8.15), section 8.7: The White heteroscedasticity-consistent variance estimator for b is
What If the Disturbances Have Nonzero Expectations or Different Variances? n
()
VW b =
283
∑( x − x ) e i
2 2 i
i =1
∑(
2 2 xi − x
.
)
It and the corresponding variance estimator for a are consistent even if heteroscedasticity is present. 12. Equation (8.19), section 8.8: Weighted least squares (WLS) provides BLU estimators for β and α if the different disturbance variances are known or can be estimated: yi si
= aWLS
x 1 + bWLS i + ei . si si
13. Section 8.10: Heteroscedasticity can take many forms. Regardless, the White heteroscedasticity-consistent variance estimator provides trustworthy estimates of the standard deviations of OLS estimates. In contrast, WLS estimates require procedures designed specifically for each heteroscedastic form.
8.1 Introduction We now know how to fit a line to a sample and how to extract as much information as possible from that line regarding the underlying population from which our sample was drawn. Now we have to talk about what could go wrong. We’ll examine each of the assumptions that we made in chapter 5, beginning in this chapter and continuing through the three that follow. In sections 8.2 and 8.3, we’ll discuss what happens if our first assumption about the properties of the εi’s are incorrect. That assumption appeared in equation (5.5):
( )
E εi = 0.
As we’ll see, violations of this assumption are either trivial or fatal. In the first case, b is still the BLU estimator. In the second, it is useless. Consequently, in practice we never question the assumption of equation (5.5). This begs the question of why we raise it here. First, we do so to demonstrate that we’re not afraid to examine and justify any of the assumptions on
284
Chapter 8
which chapters 5 and 7 are based. Second, the necessary analysis is fairly simple. We undertake it, in part, so that we can practice some of the relevant techniques before moving on to the rest of this chapter and to chapter 9, where the issues are more complicated. Third, section 8.2 provides a very convenient and manageable opportunity to introduce the issue of identification. In sections 8.4 to 8.10, we discuss what happens when our assumption about the variance of the εi’s is incorrect. That assumption appears in equation (5.6):
( )
V εi = σ 2 .
In contrast to violations of equation (5.5), violations of equation (5.6) present problems that are potentially bad but potentially fixable. At this stage, there are two possible fixes. The first is to make the most that we can out of the OLS line as fitted to the original data, which will still be unbiased but will no longer be BLU. The second will be to transform the original data so that OLS, applied to the new data, will again provide BLU estimates.
8.2 Suppose the i ’s Have a Constant Expected Value That Isn’t Zero If equation (5.5) is wrong, then E(εi) ≠ 0. This can happen in two different ways. The first way is inconsequential. The second invalidates the entire enterprise. The harmless violation of equation (5.5) occurs when the disturbances all have the same expected value, but it’s not zero:1
( )
E εi = λ,
λ ≠ 0.
(8.1)
In this case, as in equation (5.14),
( )
( ) = E (α ) + E ( β yi ) + E (ε i ) .
E yi = E α + β yi + ε i
However, if equation (8.1) replaces equation (5.5),
( )
E yi = α + β xi + λ
(
)
= α + λ + β xi .
(8.2)
The expected value of yi is different in form from its expression in equation (5.16), based on the assumption of equation (5.5). However, it’s the same
What If the Disturbances Have Nonzero Expectations or Different Variances?
285
in substance. The expected value of yi still consists of a variable component that depends on the associated value of xi, βxi, and a fixed component that is shared by all members of the population. The only change is that the fixed component now consists of α, the constant component of E(yi) in equation (5.16), and λ, the value for the typical εi. This change has no practical import. Under the assumption of equation (8.1), b is still an unbiased estimator of β. We prove this by returning to equations (5.33) through (5.36), which collectively demonstrate that
()
E b =β+
n
∑( x − x ) E (ε ).
1
i
n
∑( x − x )x i
i
(8.3)
i =1
i
i =1
With the assumption of equation (8.1), this becomes
()
E b =β+
n
∑( x − x ) λ
1
i
n
∑( x − x )x i
i =1
i
i =1
=β +
n
λ n
∑(
i
)
xi − x xi
i =1
∑( x − x ). i =1
Once again, the final summation is equal to zero, according to equation (2.19). Therefore,
()
E b =β+
λ n
∑( x − x )x i
(0) = β.
(8.4)
i
i =1
This proves that b is an unbiased estimator of β as long as E(εi) is some constant. It doesn’t matter whether that constant is zero, as assumed in equation (5.5), or nonzero, as assumed in equation (8.1). The consequences of equation (8.1) for E(a) are a little more complicated, but not, in the end, problematic. As in equation (5.39), we begin the analysis with the expected value of the average value for the yi’s: n yi 1 E y = E i =1 = n n
( )
∑
n
∑E ( y ). i
i =1
286
Chapter 8
Substituting equation (8.2) for E(yi), we obtain
( )
E y =
1 n
n
∑((α + λ ) + β x ) = (α + λ ) + β x .
(8.5)
i
i =1
As in equation (5.41), the expected value of a can be written as
()
(
)
( )
()
E a = E y − bx = E y − x E b .
Here, however, we substitute equation (8.4) for E(b) and equation (8.5) for E( y ) to obtain
( ) ((
)
)
E a = α + λ + β x − β x = α + λ.
(8.6)
In other words, a is no longer an unbiased estimator of α. This is inconsequential. α, by itself, is no longer of any interest. As we’ve already noted, it’s now only a piece of the fixed component that appears in all values of yi. That fixed component is now the sum α + λ. According to equation (8.6), the expected value of a is exactly equal to this sum. This demonstrates that, even under the assumption of equation (8.1), a continues to be an unbiased estimator of the fixed component that is assigned by the population relationship to the value of yi for each population member. E(a) is still the part of E(yi) that doesn’t depend on xi. Once again, we can illustrate this with our simulations. Table 8.1 presents the results of two simulated regressions. We have already seen part of the regression in the second column. The fifth row of table 7.3 reported the value for b, its standard deviation, and some details about the associated confidence interval for β. Table 8.1 restates the value for b and its standard deviation. It adds the value for a and its standard deviation, as well as the parameter values on which this simulated regression is based. TABLE 8.1 Simulated regressions with zero and nonzero values for E(i) Parameters and estimators
β b SD(b)
Regression from table 7.3
4,000 3,945.7 27.483
Regression with E(εi) ≠ 0
4,000 4,024.4 27.401
α E(εi) = λ a SD(a)
−20,000 0 −19,373 352.49
−15,000 −5,000 −20,402 351.43
Sample size
100,000
100,000
What If the Disturbances Have Nonzero Expectations or Different Variances?
287
The third column presents a new regression. It and the first regression have the same sample size and the same value for β. There are two differences. The constant α is −20,000 for the first regression and −15,000 for the second. At the same time, E(εi) = 0 for the first regression, as in equation (5.5). E(εi) = −5,000 for the second regression, as in equation (8.1). Can we see any differences? Well, . . . no. At least, none that matter. The values for b are, for practical purposes, almost the same and both very close to the value for β. Their standard deviations are virtually identical. The value for a in the first regression is very close to the value of α on which it is based. The value for a in the second regression is not close to the corresponding value for α. But then, according to equation (8.6), it shouldn’t be. Its expected value is α + λ. Table 8.1 demonstrates that the value of a in this regression is very close to this sum. Because α + λ for the second regression is −20,000, or the value of α for the first regression, the estimated values for a in the two regressions are very close to each other. The simulations of table 8.1 illustrate an underlying theme. The part of E(yi) that doesn’t depend on xi can be separated arbitrarily into a component that is truly fixed, α, and a component that each observation can expect to receive, E(εi) = λ. This separation is arbitrary, from a behavioral perspective, because we would expect people to make decisions based on what they expect about yi. Because both components contribute equally to E(yi), decisions should be the same regardless of how we partition that part of E(yi) that doesn’t depend on xi. The separation between α and λ is also arbitrary, econometrically, because we have only one estimator, a, with which to capture the contributions of both. As we have just demonstrated, this estimator does a good job of capturing their joint contribution. But we would need a second estimator in order to distinguish between α and λ. None is available.2 This is our first example of an identification problem. An identification problem arises when the population relationship contains a parameter in which we may be interested, but we can’t figure out a way to construct a suitable estimator for it from the sample. In this situation, we can’t identify a value that we’re comfortable using as an estimate. Identification problems arise in many econometric contexts. For example, we’ll encounter them again in chapter 10. Here, the identification problem is that we might believe that there are two underlying parameters, but we can only estimate one quantity. In this case, the identification problem isn’t a big one. We have a perfectly fine estimate of the important quantity, α + λ. There is nothing substantive to be gained from separate estimates of α and λ. Therefore, we lose nothing of importance, and gain in simplicity, by simply asserting a value for one of the two. An assertion of this type, where a parameter can’t be identified but
288
Chapter 8
nothing of interest depends on its value, is called a normalization. The normalization we choose here, and, for that matter, for the rest of econometrics, is that of equation (5.5), E(εi) = 0.
8.3
Suppose the i ’s Have Different Expected Values The situation is radically different if we replace equation (5.5), not with equation (8.1), but with
( )
E ε i = λi
(8.7)
instead. Typographically, the difference between equations (8.1) and (8.7) is tiny, just the subscript i at the end of equation (8.7). Econometrically and behaviorally, the difference is immense. Equation (5.1) tells us that individual observations with the same values for xi will usually turn out to have different actual values for yi. However, equation (5.16) or (8.2) tells us that, even so, their deterministic parts are identical. We can expect them to have the same values for yi. Therefore, the regression calculations of chapter 4 ultimately give us an unbiased estimator of E(yi) in either case.3 If, instead, we believed equation (8.7), we would also have to believe that
( )
E yi = α + β xi + E(ε i ) = α + β xi + λi .
(8.8)
This is fundamentally different from either equation (5.16) or equation (8.2). In both of those equations, E(yi) has two components. One is fixed and identical for all observations. The other depends on the value of xi for each observation. These two components reappear as the first and second terms following the second equality in equation (8.8). However, a third term also follows the last equality in equation (8.8), λi. This term represents a component of E(yi) that varies across observations, is not random, and is not related to anything observable. In other words, the deterministic part of yi is different for observations with the same value of xi. This takes us well beyond any place we’ve been up until now. If we assert equation (8.7), we are asserting not only that individual observations should usually have different actual yi values, as in equation (5.1), but that they shouldn’t even expect to have the same yi values. This suggests, unfortunately, that we shouldn’t expect the regression estimates to give us unbiased estimators of E(yi). In turn, this implies that we should expect biased estimators of β, α, or both.
What If the Disturbances Have Nonzero Expectations or Different Variances?
289
We can confirm this by returning to equation (8.3). This time, we replace E(εi) with λi, following equation (8.7). This immediately yields n
()
E b =β+
∑( x − x )λ i
i
i =1 n
∑( x − x )x i
≠ β.
(8.9)
i
i =1
As demonstrated in the discussion of equation (2.22), the numerator in the second term after the equality in equation (8.9) would not be zero. Therefore, under the assumption of equation (8.7), b is a biased estimator of β.4 If we can’t get an unbiased estimator of β, the true effect of xi on yi, there’s not much point in continuing. The fundamental problem here is that our estimation procedure in chapters 4 through 7 is based on the idea that all of the observations in the sample come from the same population. However, equation (8.7) asks us to believe that the εi’s don’t even share the same expected value. If we accept this, the idea that they should still be considered members of the same population becomes very hard to sustain. If it doesn’t make sense to consider the observations as members of the same population, then it doesn’t make sense to think of a single population relationship. If it doesn’t make sense to think of a single population relationship, then there’s no single set of parameters that describe the behavior. If there’s no single set of parameters, then it’s no surprise that our single bestfitting line of chapter 4 doesn’t estimate anything useful. The assumption that E(εi) = λi implies that the data we’ve observed simply can’t be addressed by the techniques we’re developing in this course.5
8.4 Suppose Equation (5.6) Is Wrong Violations of the assumption in equation (5.6) are much more interesting than are violations of the assumption in equation (5.5). While alternatives to the assumption of equation (5.5) are either trivial or ridiculous, alternatives to that of equation (5.6) can arise under plausible and even important circumstances. When alternatives to this assumption seem compelling, the relationships between b and a, on the one hand, and β and α, on the other, become more difficult to interpret. According to chapter 5, equation (5.6) imposes the assumption of homoscedasticity. If homoscedasticity specifies that all disturbances have the same variance, the alternative must obviously be that the disturbances can have
290
Chapter 8
different variances. This assumption is called, not surprisingly, heteroscedasticity. It looks like this:
( )
V ε i = σ i2 .
(8.10)
As in equation (8.7), the typographical difference between this and homoscedasticity is simply the subscript i at the end of equation (8.10). However, in contrast to equation (8.7), this subscript is not fatal. The reason for this contrast is that equation (5.5), or at least equation (8.1), is necessary to prove that b and a are unbiased estimators of β and α. If we review the proofs in sections 5.6 or 8.2, however, we see that they didn’t require any assumption at all about V(εi). In particular, they didn’t require equation (5.6), so they aren’t affected if that equation is replaced with equation (8.10). In other words, b and a remain unbiased for β and α regardless of what we assume for V(εi). This is the good news. The OLS estimators are still linear and unbiased. As unbiasedness is the first property we typically look for in an estimator, b and a still have something to offer even if we adopt equation (8.10).
8.5
What’s the Problem? If the assumption of equation (8.10) creates a problem, it’s not in the expected values of our estimators. It must therefore be in their variances. Chapter 5 uses equation (5.6) to first derive the variance of the yi’s in equation (5.20) and then the variance of b in equation (5.50). Obviously, these derivations must be re-examined if we have to adopt equation (8.10) instead. As with equation (5.6), it’s still true that
( )
( )
V ε i = E ε i2 .
This required only that E(εi) = 0, which we agreed to accept in section 8.2. As in equation (5.19), it’s still true that
( )
( )
V yi = E εi2 .
With equation (8.10), these two equations imply that
( )
( ) ( )
V yi = E εi2 = V εi = σ i2 .
(8.11)
What If the Disturbances Have Nonzero Expectations or Different Variances?
291
So if equation (8.10) replaces equation (5.6), then equation (8.11) must replace equation (5.20). What, then, of equation (5.50)? Equation (5.48) can be condensed to give the variance of b as n
()
V b =
∑( x − x ) V ( y ) 2
i
i
i =1
n xi − x i =1
∑(
)
2 2
.
(8.12)
With equation (8.11), this implies that n
()
V b =
∑( x − x ) σ 2
i
2 i
i =1
n xi − x i =1
∑(
)
2 2
.
(8.13)
Now that subscript i in equation (8.10) makes all the difference. Without it, the variances of the yi’s would all be identical. They would factor out of the numerator in the ratio to the right of the equality in equation (8.13), leaving the formula for the variance of b in equation (5.50). With this subscript, the formula for this variance cannot be simplified beyond equation (8.13). This is where the bad news begins. It comes in two pieces. First, section 5.9 proved that b and a are BLU estimators on the basis of the assumption in equation (5.6). If we adopt the assumption of equation (8.10) instead, we no longer know if OLS estimates are best. We won’t explore this issue formally, but we can develop some intuition about it. Equation (8.10) says that different observations have disturbances with different variances. Observations with small variances, on average, have small disturbances. For the most part, then, the values of yi for these observations lie fairly close to their values of E(yi). Therefore, the points given by xi and yi for these observations lie fairly close to the line given by equation (5.16),
( )
E yi = α + β xi .
Figure 8.1 displays a sample of points representing observations whose disturbances have small variances. It doesn’t portray the line of equation (5.16). However, if we examine these points, we’ll probably feel pretty
292
Chapter 8
Figure 8.1
$80,000
Sample with a small disturbance variance
$70,000 $60,000
Earnings
$50,000 $40,000 $30,000 $20,000 $10,000 0 −$10,000
0
5
10
15
20
15
20
Years of schooling Figure 8.2
Earnings
Sample with a high disturbance variance
$140,000 $120,000 $100,000 $80,000 $60,000 $40,000 $20,000 0 −$20,000 0 −$40,000 −$60,000 −$80,000
5
10
Years of schooling
confident that we know where that line lies. This means that, at least intuitively, we’re pretty sure of the line’s coefficient and constant. In other words, this sample is very informative regarding what β and α must be. In contrast, observations with large variances, on average, have large disturbances. The differences between their actual values for the dependent variable, yi, and their expected values for this variable, E(yi), could be quite large. Therefore, these observations could lie quite far from the line relating E(yi) to xi, in all directions. Figure 8.2 displays a sample of points with disturbances that have large variances. Again, it omits the line of equation (5.16). Do we have any idea where it lies? The point here is that if we examine only points such as this, we would never claim to know much about the values of β and α. In other words, observations with large variances for their disturbances are not very informative regarding the part of the population relationship that we care about, the deterministic part.
What If the Disturbances Have Nonzero Expectations or Different Variances?
293
This discussion demonstrates that, if we have a sample that includes some observations with small variances for their disturbances and some with large variances for their disturbances, we would like to pay more attention to the former and less to the latter. The formulas for OLS regression don’t do this. The formula for b, equation (4.37), adds in terms for each observation without regard for the variance of the disturbance associated with that observation. The formula for a in equation (4.35) requires the value for b and the simple averages for yi and xi, calculated again by treating all observations symmetrically, regardless of their variances. The formulas for the variances of these estimators in equations (5.50) and (5.51) do the same. This illustrates the first piece of bad news associated with the assumption of equation (8.11). It is obvious that, when equation (8.11) holds, we want to place more emphasis on the observations with low variances, because they are more informative. We want to place less emphasis on observations with high variances, because they are less informative. OLS weights them all equally. It therefore doesn’t take advantage of this insight. A procedure, in statistics or anywhere else, for that matter, that doesn’t take advantage of all the information or resources available to achieve the goal at hand is inefficient. Under equation (8.11), OLS is inefficient. This means that its job could be done better. What would it mean to do this job better? Well, OLS is still unbiased. So improved performance would have to mean lower variances for b and a. As we stated in section 5.9, this is the definition of efficiency in statistical terms: More efficient estimators have lower variances. Under equation (8.11), OLS estimators do not have the smallest possible variances. The second piece of bad news is that, unless otherwise directed, computers proceed as if they are estimating equation (5.50) with equation (7.4) when they calculate the estimated variance of b. Instead, they should be trying to estimate equation (8.13). Similarly, they use equation (7.6) to estimate the variance of a in equation (5.51). Although we haven’t discussed the correct variance for a under the assumption of equation (8.11), we can be sure that it isn’t equation (5.51). Therefore, if equation (8.11) holds, the variances that we ordinarily read in our computer outputs will be wrong. Moreover, they could be either too small or too large. Why is this? According to equation (8.13), the true variance of b under equation (8.11) multiplies the variance for the disturbance of each observation, σi2, by the term ( xi − x )2 for the same observation. If, for example, observations with big variances tend to also be observations with values for xi that deviate substantially from their average value, the products for these observations will tend to be huge. Consequently, the numerator of equation (8.13)
294
Chapter 8
will be large, and the true variance of b will be much greater than the value given by equation (5.50). In contrast, if observations with small variances tend to have values for xi that are far from their average value, the products for these observations will be relatively small. The products for observations with large variances will also be small, because large values for σi2 will be multiplied by small values for ( xi − x )2. As a result, the numerator of equation (8.13) will be small, and the true variance of b will be less than the value given by equation (5.50). Finally, if observations with large values for σi2 are just as likely to have small values for ( xi − x )2 as large values, then, on average, these observations as a group will make the same contribution to equation (5.50) as they would to equation (8.13). The same is true for observations with small values for σi2, if they are just as likely to have large values for ( xi − x )2 as small values. In this case, equation (5.50) would yield a value similar to the correct one provided by equation (8.13). In other words, the correlation between σi2 and ( xi − x )2 determines the extent to which the true variance of b under heteroscedasticity differs from what the computer reports it to be, using the homoscedastic formulas. It’s possible that, if we have reason to believe equation (8.11), we will also have some belief about the sign and perhaps even the magnitude of this correlation. Ordinarily, however, this won’t be good enough. Confidence intervals and hypothesis tests require specific values for the variance, not guesses as to how the valid variance estimate differs from the calculated values.
8.6 si2, i2, ei2, and the White Test The previous section demonstrates that, if we have reason to believe equation (8.11) instead of equation (5.6), something must be done. The first thing to do would be to try to figure out if there’s any evidence to support this belief. Because this belief is about σi2, some estimator of σi2 is necessary. Recall from equation (8.11) that
( )
σ i2 = E εi2 .
This suggests that one way to obtain an estimator of σi2 might be to begin with an estimator of εi2. To get an estimator of εi2, we might start with an estimator of εi, the disturbance embodied in the ith observation. This is a fairly subtle thing. Without going into too much detail, let’s just recall that we’ve already committed to estimating three population parameters: β and α in chapter 5 and σ in chapter 7. There are n values of εi in our
What If the Disturbances Have Nonzero Expectations or Different Variances?
295
sample. If we wanted to estimate them as well, we would be in the position of trying to squeeze n + 3 estimators out of only n observations, or pieces of information. This is like spending more money than you have. It can’t be done. Technically, what’s happening is that we can’t obtain independent estimates of all three parameter values and of all n εi’s. However, the intuition that we get from pretending that this is what we’re doing is quite valuable. Therefore, we’ll motivate the following discussion by taking the posture that we’re going to use the calculated error for the ith observation from the OLS regression line, ei, to estimate εi, the disturbance for that observation. We’ll remind ourselves along the way when we’re making statements for heuristic purposes that can’t be taken literally. For the present, then, ei can be thought of as an estimator of εi. It is consistent, which can be demonstrated as follows. Repeating equation (4.5), we have ei = yi − a − bxi .
Replacing yi with the population relationship of equation (5.1), we get
(
)
ei = α + β xi + εi − a − bxi .
Rearranging yields
(
) (
)
ei = εi + α − a + β − b xi .
In other words, the observed error differs from the unobserved disturbance to the extent that b is not identical to β and a is not identical to α. As we’ve said, this is generally the case. But, as we discussed in section 5.10, it becomes less and less the case as the sample gets bigger and bigger. To review, b and a are consistent estimators of β and α. Bigger samples mean that the variances of b and a get smaller. These variances converge to zero as the number of observations approaches infinity. Therefore, the actual values of b and a converge to their expected values, β and α. This implies that, as the sample size approaches infinity, β − b and α − a converge to zero. The consequence is that ei converges to εi as well. This demonstrates that ei can be taken as a consistent estimator of εi. Therefore, ei2 can be taken as a consistent estimator for εi2. It turns out that, in large samples, εi2 can replace σi2. Therefore, we can proceed as if we could use the observed value of ei2 as an estimator for the unknown σi2 in large samples as well.6 The first thing that we’ll do is use the metaphor of ei2 as an estimator for 2 σi to help us understand how we assess whether we actually have a problem. This is not quite the same as identifying whether we have heteroscedasticity.
296
Chapter 8
As we said previously, equation (5.50) won’t be too far off if σi2 and ( xi − x )2 are uncorrelated. In this case, there’s really no need to figure out the variance of each individual εi2. All we really care about is the variance of b, and perhaps that of a. The computer’s ordinary estimates of these variances should be acceptable. Therefore, the critical question is not whether heteroscedasticity is present, but whether, if it is, it implies that σi2 and ( xi − x )2 are correlated. It might appear that a natural way to answer this question would be to simply calculate the correlation between ei2, our metaphorical estimator for σi2, and xi. This isn’t quite good enough, however, because
( xi − x )
2
= xi2 − 2 xxi + x 2 .
This demonstrates that any correlation between σi2 and ( xi − x )2 could come from a correlation between σi2 and either the linear component, −2xxi , the squared component, xi2, or both.7 This suggests that we might need to calculate the correlations between ei2, xi, and xi2. The problem with this is that there are three of them, two of which involve ei2. It isn’t obvious how we would combine them to get a diagnosis. In fact, we do combine them. Even though we cannot literally take individual values of ei2 as valid estimators for individual values of σi2, we can make a valid diagnosis regarding the presence or absence of problematic heteroscedasticity by running a regression in which ei2 is the dependent variable and both xi and xi2 are explanatory variables:8 ei2 = c + dxi + f xi2 + error.
(8.14)
The regression of equation (8.14) is an example of an auxiliary regression. It is not meant to estimate some explicit population relationship, such as equation (5.1). Its purpose is to help us interpret regressions that do estimate explicit population relationships. We’ll encounter several more of them in chapters 10 and 11. For that matter, we’ll also have to wait until chapter 11 to figure out how we actually calculate values for c, d, and f. The formulas of chapter 4 won’t work here, because they were developed for sample regressions with only one explanatory variable. Equation (8.14) has two. But we don’t have to address this for the present task. What we do need from this regression is the value for its R2, which we’ll learn to calculate in chapter 12. There is also more to this statistic when there are multiple explanatory variables than was apparent in chapter 4. However,
What If the Disturbances Have Nonzero Expectations or Different Variances?
297
the only thing that is important here is the following fact, already stated in equation (4.59), proven in exercise 4.11, and to be proven for two explanatory variables in exercise 12.10: R2 is the square of the correlation between the dependent variable and the value predicted for the dependent variable based on the values of all of the explanatory variables:
(
(
R 2 = CORR Yi , Yˆi
))
2
.
Therefore, small R2 values for equation (8.14) can be taken, intuitively, to indicate that xi and xi2 don’t predict ei2 very well. This would imply that ei2 and ( xi − x )2 were not closely related. We can take this as implying, though at some distance removed, that σi2 and ( xi − x )2 are not closely related. In this case, equations (5.50) and (8.13) would probably yield similar values. Therefore, the estimate of V(b) that we get from equation (7.4) would be acceptable, even though equation (7.4) was designed to estimate equation (5.50) rather than equation (8.13). In other words, whether or not we actually have heteroscedasticity, we wouldn’t have to worry about it. In contrast, big values for R2 could be taken to indicate that xi and xi2 predict ei2 pretty well. This would imply that ei2 and ( xi − x )2, and probably σi2 and ( xi − x )2, are closely related. Therefore, equations (5.50) and (8.13) would probably yield substantially different results. Consequently, equation (7.4) would not be an acceptable estimate of the true variance of b in equation (8.13). In other words, what the computer would tell us about the variance of b would be very different from what it actually was. We would have a problem. How big does R2 have to be before we get concerned? It turns out that R2 multiplied by the sample size, or nR2, is a random variable with a χ2-distribution, with degrees of freedom equal to the number of explanatory variables in equation (8.14), or two.9 Therefore, we can frame the question of how big R2 has to be as a formal hypothesis test, as described in section 6.4. This particular test is known as the White test. The null hypothesis is that equations (5.50) and (8.13) yield the same results: n
H0 :
=
n
i
i =1
2
i
σ i2
∑( x − x )
∑( x − x ) σ
2
2 i
i =1
n xi − x i =1
∑(
)
2 2
.
298
Chapter 8
The alternative hypothesis is that they don’t, which implies heteroscedasticity of unspecified character: n
H1 :
≠
n
∑( x − x ) i
i =1
∑( x − x ) σ 2
i
σ i2 2
2 i
i =1
n xi − x i =1
∑(
)
2 2
.
The test statistic is nR2. The critical value comes from table A.4 for the χ2-distribution. Table A.4 requires a value for α, the probability of a Type I error, and the degrees of freedom. The degrees of freedom are equal to the number of explanatory variables in the auxiliary regression of equation (8.14), or two, in the case of this chapter. If the test statistic is less than the critical value,
()
nR 2 < χα2 2 ,
then xi and xi2 have little association with ei2. The OLS variance calculations are reasonably trustworthy, and the null hypothesis is not rejected. In this case, we can proceed with the OLS estimates of the variances of a and b from equations (7.4) and (7.6). If, instead, the test statistic equals or exceeds the critical value,
()
nR 2 ≥ χα2 2 ,
then xi and xi2 are so closely associated with ei2 that the OLS variances of b and a cannot be trusted. The null hypothesis must be rejected. What does this test tell us about some of the regressions that we’ve already run? Let’s begin with the regression on the data of table 3.3, which we presented in section 4.7. In that regression, we concluded that education had a statistically significant effect on earnings, estimated at approximately $4,000 in additional earnings for each additional year of schooling. The χ2(2) value for the White test on this regression data is 2.24. Table A.4 indicates that the critical value for this random variable at the 5% significance level is 5.99. Therefore, the test statistic is less than the critical value. Equivalently, the p-value associated with this test statistic is .3261, well in excess of our conventional threshold value of .05. Either way, we fail to reject the null hypothesis. This means that we can trust the standard deviations that we calculated in section 4.7. The regression in section 7.7 provides a contrary example. Here, we estimated that gross national income per capita declines by nearly $4,000 for
What If the Disturbances Have Nonzero Expectations or Different Variances?
299
each point on the Corruption Perceptions Index. Both the slope and the intercept are many times greater than their estimated standard deviations. The White test, applied to this regression, yields a χ2(2) value of 13.41. This exceeds the critical value of 5.99 by a lot. The p-value associated with this test statistic is only .0012. In this case, we reject the null hypothesis of homoscedasticity. This suggests that the standard deviations that we estimated for the slope and intercept in this regression are not trustworthy. The data in table 3.6 provide a last example for this section. Section 3.6 discussed the correlation between child mortality and the availability of clean water to rural populations. The regression of the former on the latter yields the following results: percentage of rural population with access child mortality rate = 198.27 − 1.75 . to improved drinking water sourrces
Both the slope and the intercept are very significant: Their respective t-statistics are 14.26 and 9.39. The R2 and adjusted R2 are .3798 and .3755. Finally, the F-statistic is 88.18, with a p-value of less than .0001. However, the White test indicates that heteroscedasticity is present. The test yields a χ2(2) value of 15.36. This is, again, far above the critical value at the 5% significance level of 5.99. It has a p-value of .0005. This indicates that, for the moment, we shouldn’t take the indicated t-statistics too seriously.
8.7 Fixing the Standard Deviations In the case of the last two regressions, we have to worry about what to do. The White test has told us that heteroscedasticity is so bad that we can’t trust the OLS variance calculations. However, the alternative hypothesis for this test is entirely nonspecific. It doesn’t tell us what to use for V(εi) in place of σ 2. We have two options. The first is to get valid estimates of the variances of the OLS estimators b and a. Even though they’re not BLU estimators, we might still be content with them if their variances are small enough to support useful confidence intervals and discerning hypothesis tests. The second is to figure out what the BLU estimators would be under equation (8.11). At this point, the first option is easy to execute. We know that equation (8.13) gives us the correct value for V(b). The only quantity in this equation that isn’t observed is σi2. If we were to continue with the metaphor that ei2 can be thought of as an estimator of σi2, we could simply replace the latter by the former.
300
Chapter 8
While this isn’t literally accurate, it turns out that the result of this replacement is nevertheless a valid estimator of V(b), at least in large samples.10 The White heteroscedasticity-consistent variance estimator is n
()
VW b =
∑( x − x ) e i
2 2 i
i =1
n xi − x i =1
∑(
)
2 2
.
(8.15)
With this variance estimator, we can use the formulas of chapter 7 to create valid confidence intervals and hypothesis tests for b. The White variance estimator, applied to the regression in section 7.7, estimates SD(b) = 268.3. In contrast, the computer estimated this standard deviation, under the mistaken OLS assumption of equation (5.6), as 212.0. In other words, the OLS estimate of the standard deviation, based on the assumption that it was the same for all observations, was too small by approximately one-quarter. This implies that, instead, observations with values of the Corruption Perceptions Index that are far from the average value are likely to have disturbances with larger variances. We might wonder whether, in this case, this was worth the effort. The slope in this regression is 3,938.1. OLS estimated a t-statistic of 18.58, with a p-value of less than .0001. With the White variance estimator, this t-statistic is now 14.68. Its p-value is still less than .0001. Either way, it seems, there can be no doubt that β is not equal to zero. The problem with this reasoning is that, given the result of the White test in the previous section, the OLS t-statistic is essentially meaningless. All we know is that it’s based on an incorrect calculation of SD(b). The fact that it was big doesn’t change the fact that it was probably wrong. It couldn’t be taken to indicate that β was not equal to zero. Moreover, the estimated standard deviation on which it was based couldn’t be used to construct valid tests of any other hypothesis. We had to fix it if we wanted to make anything out of the estimated b. The fix for the regression in table 3.6 yields similar results. The OLS estimate of SD(b) is .187. The White heteroscedasticity-consistent estimate is larger, at .210. This means that the actual t-statistic for the null hypothesis H0 : β0 = 0 is 8.36 rather than 9.39. Once again, our conclusion regarding this null hypothesis doesn’t change when we correct the estimate of b’s standard deviation. However, we now have a good reason, rather than a bad reason, to reject it. Moreover, we might be in a position where the appropriate value for our null hypothesis is something other than zero. Certainly, for some values, we would make a different
What If the Disturbances Have Nonzero Expectations or Different Variances?
301
decision if we based our test on the incorrect OLS standard deviation estimate rather than on the White heteroscedasticity-consistent estimate. Of course, there will be other regressions in which the incorrect, heteroscedasticity-ridden OLS standard deviations imply that a slope is significant when the White heteroscedasticity-consistent estimates indicate that it is not. Moreover, there will be yet other regressions in which the incorrect, heteroscedasticity-ridden OLS standard deviations imply that a slope is not significant when the White heteroscedasticity-consistent estimates indicate that it is. We won’t know which of these situations we have until we actually check, by calculating the White heteroscedasticity-consistent standard deviations.
8.8 Recovering Best Linear Unbiased Estimators The alternative to the White heteroscedasticity-consistent estimate of the standard deviation promises better estimators but requires more from us in return. In essence, we have to replace the nonspecific alternative hypothesis we used for the White test with something very specific. We’ll talk about how we might do that in a moment, but first let’s discuss the advantages of doing so. Assume, for the moment, that we were willing to assert specific values for σi2. What could we do with them? If we knew the variances of the εi’s, could we transform the εi’s into some other quantities that are homoscedastic? Consider this. Suppose we divide the population relationship of equation (5.1) by σi: ε yi x 1 =α +β i + i . σi σi σi σi
(8.16)
This transformed population relationship has a transformed disturbance, εi/σi. What are its properties? This transformation of εi amounts to multiplying it by 1/σi. Using equation (5.34), we can derive the expected value of εi/σi as follows: ε 1 1 E i = E εi = 0 = 0. σi σ i σ i
( )
()
(8.17)
In other words, the transformed disturbances all have the same expected value as did the original disturbances, namely zero. This is probably not a great surprise.
302
Chapter 8
The more interesting result is the variance of the transformed disturbances. Using equation (5.43), we get ε 1 2 1 2 V i = V εi = σ i = 1. σ i2 σi σi
( )
(8.18)
The critical observation here is that the final term in equation (8.18) has no subscript i. In other words, all of the transformed disturbances have the same variance! The transformation of equation (8.16) yields a new expression of the population relationship in which the disturbances are homoscedastic. This seems almost miraculous. Equations (8.17) and (8.18) demonstrate that the disturbances in the transformed population relationship of equation (8.16) have the first two properties required by the Gauss-Markov theorem: zero expected values and identical variances. In exercise 8.4, we will demonstrate that they have the third required property, uncorrelated disturbances, as well. Equation (8.17) guarantees that OLS estimation of equation (8.16) will yield unbiased estimators of β and α. Equation (8.18) and exercise 8.4 together guarantee that these estimators will have the least variance among linear unbiased estimators. In other words, if we actually knew σi we could divide it into our original variables, yi and xi, to create the new variables yi/σi and xi/σi. We could then regress the former on the latter. This regression would conform to all of the assumptions of chapter 5. We would therefore have BLU estimates of β and α and valid standard deviations for those estimates. So far, the transformation of equation (8.16) seems to solve all of our problems! Unfortunately, there’s one problem left. This transformation is usually impracticable because the values for σi2 are typically unknown. However, many of the advantages that would accrue if the σi2 were known are also available if they can be estimated. If we can estimate σi with si, values that we construct from our sample, we can estimate the regression equation yi si
= aWLS
x 1 + bWLS i + ei . si si
(8.19)
The estimation procedure represented in equation (8.19) is called weighted least squares (WLS). Each observation is weighted by the reciprocal of its estimated standard deviation. The task of estimation itself presents some of the same challenges we encountered in estimating equation (8.14) for the White test. As there, the right-hand side of equation (8.19) has two observable variables that vary across observations, 1/si and xi/si. In consequence, obtaining actual values
What If the Disturbances Have Nonzero Expectations or Different Variances?
303
for bWLS and aWLS in equation (8.19) is a job for chapter 11.11 Nevertheless, were we to estimate equation (8.19) as will be prescribed in chapter 11, the estimates have very attractive properties. The WLS procedure has enormous intuitive appeal. First, it rigorously implements the aspirations that we expressed in the discussion around figures 8.1 and 8.2. Observations with large variances are relatively uninformative. Dividing them by their standard deviations reduces their algebraic contribution to the regression calculation. Conversely, observations with small variances convey a lot of information about β and α. This weighting increases their relative contribution.12 Second, the WLS procedure illustrates two much larger principles. When any of the assumptions, other than that of equation (5.5), that we made in chapter 5 about the disturbances don’t apply, OLS applied to the original data won’t give best estimators. In chapter 11, we’ll see that it can even give estimators that are truly bad. However, the solution is almost always to transform the population relationship so that the new population relationship conforms to the assumptions of chapter 5, and transform the data in the sample analogously. This puts us right back where we want to be, in position to obtain BLU estimators from OLS. In particular, equation (8.19) makes plain that the search for BLU estimators under equation (8.11) must begin with a search for estimates of the σi2. First, however, we must reiterate that we cannot estimate n different values of the σi2 without restrictions. While the conceit that we could replace each value of σi2 with the value of ei2 for the same observation has been helpful to develop intuition, we have to remind ourselves that this replacement is not exactly valid. We are going to want to keep the total number of estimators that we actually draw from our sample well below n, the number of available observations. If we can’t estimate the σi2 without restrictions, we’ll have to impose some. In other words, we’re going to have to provide some explanation as to why heteroscedasticity occurs in our data. This is new. For comparison, the implicit answer that the alternative hypothesis of the White test offers to the question of why heteroscedasticity occurs is simply, “because it does!” This is, as we say, nonconstructive. We’re going to need something more insightful in order to make further progress. If we can assert that the variation across observations in σi2 is systematic, we can use this systematization to impute values of σi2 without exhausting our sample. In other words, if we want to get estimates of the n + 2 values—β, α, and the σi2’s—we need to contribute more than just the n observations in our sample. We also need to contribute a theory. This is a statement of the behavioral basis for suspecting that different disturbances have different variances.
304
Chapter 8
Most of the things that we’ve done so far have been, to some degree, mechanical. We don’t exercise any discretion when we fit a line to a sample in chapter 4; we just follow the formulas. In chapter 5, the properties of that line follow without ambiguity from the assumptions that we make there. Inference, as in chapter 7, is largely, again, a matter of replacing values in standard formulas. As a consequence, we might have formed the impression that applied econometrics is a “cookbook” enterprise, generally a question of following standard statistical “recipes.” In practice, it probably is that far more often than is healthy or than we would want to admit. But good applied econometrics is almost always a question of artistry. If an econometric question is interesting, it is probably at least somewhat unique. If it is unique at all, appropriate analysis will probably require modifying the conventional econometric forms to fit the circumstances. We’ll talk about this issue at much greater length in chapters 10 and beyond. However, this is our first encounter with it. Where would we go to find a theory of heteroscedasticity? We have to go to our understanding of the behavior that’s embedded in our particular sample. What limits are there on the behavior that we might observe in various samples? Almost none. What limits does that put on the kinds of theories we might develop for heteroscedasticity? Again, almost none. In other words, the range of theories that might explain variations in the σi2’s is essentially unbounded. This means that, in what follows, we won’t be able to offer general formulas that hold for every possible variation. Instead, we’ll analyze a very simple example of heteroscedasticity. We’ll typically repeat the steps of this analysis when we encounter other potentially heteroscedastic situations, but we’ll have to modify them to conform to the specifics of those situations.
8.9 Example: Two Variances for the Disturbances Let’s imagine that our sample is drawn from a population containing two distinct subpopulations. The deterministic component of yi is the same for each subpopulation:
( )
E yi = α + β xi ,
as given in equation (5.16). However, the disturbances in the two subpopulations have different variances:
(
)
( )
V ε i | subpopulation 1 = V ε1i = σ 12
(8.20)
What If the Disturbances Have Nonzero Expectations or Different Variances?
305
and
(
)
( )
V ε i | subpopulation 2 = V ε 2 i = σ 22 .
(8.21)
This formulation replaces σ 2 in the conventional population relationship of equation (5.6) with σ 12 and σ 22 . Therefore, it adds only a single parameter to the three that we estimated there. This should be easy to estimate in any sample of reasonable size. But why would we imagine this replacement? In our example of the relationship between income and education, our data contain observations on both men and women. Is there any reason to believe that the unobservable components of their incomes might have noticeably different variances? In other words, is there any reason to suspect that the range of incomes observed for men with any particular level of education would be greater or less than that for women with the same level of education? Here’s where theory, our behavioral understanding, comes in. It may be, for example, that women are less likely to be in the labor force than are men, at any level of education. This implies that, at any level of education, we would observe more women than men with incomes of zero. On its own, this effect would lead to a larger variance of incomes among all women than among all men. At the same time, working women may be subject to a “glass ceiling.” This is a metaphor for the suspicion that the career paths of working women are artificially truncated because employers are reluctant to promote them into the highest-paid, most responsible positions, regardless of their ability. This suggests that, among men and women with the same level of education, the former will hold a disproportionate proportion of the highest-income jobs. This effect would cause the variance of incomes among working men to be greater than that among working women. Together, these two arguments are ambiguous. If the first is more powerful, women should have income disturbances with larger variances. If the second is more powerful, it should be the men. It is also possible that they are of equal influence, leaving variances that are similar for the two subpopulations.13 What is certain, however, is that these arguments suggest an empirical question: Are the variances of the income disturbances for men and women different? More generally, the question is whether we can tell if the disturbance variances for two subpopulations differ. The White test described in section 8.6 might do the job. The problem is that, in this context, the White test is inefficient. It’s looking for any kind of heteroscedasticity that would invalidate the OLS variance estimates. We believe that the only possible heteroscedasticity would arise from different disturbance variances in two identifiable subpopulations. The White test doesn’t take advantage of this belief.
306
Chapter 8
The Goldfeld-Quandt test provides a more efficient answer. The null hypothesis specifies that the two subpopulations do not have different disturbance variances: H 0 : σ 12 = σ 22 .
(8.22)
The alternative hypothesis is one-tailed. It relies on the theory motivating this inquiry to specify that one variance is larger than the other: H1 : σ 12 > σ 22 .
(8.23)
The test itself, not surprisingly, consists of a formal comparison between estimates of the variances for the two subpopulations. In order to perform the Goldfeld-Quandt test, we first extract all n1 members of the first subpopulation from the sample. We then run the regression, y1i = a1 + b1 x1i + e1i ,
(8.24)
on this subsample alone. Here, y1i represents the dependent variable for observations from the first subpopulation, and x1i represents the independent variable for those observations. We calculate b1, the slope estimate for this subsample, using equation (4.37) and a1, the intercept estimate for this subsample, using equation (4.35). Next, we calculate e1i, the errors in this subsample, from equation (4.5). Finally, we estimate σ 12 with n1
∑e
2 1i
s12 =
i=1
n1 − 2
(8.25)
,
where, again, n1 represents the number of sample members from the first subpopulation. Similarly, we run the regression y2 i = a2 + b2 x2 i + e2 i
(8.26)
for the n2 sample members from the second subpopulation. We estimate σ 22 with n2
∑e
2 2i
s22 =
i =1
n2 − 2
.
(8.27)
What If the Disturbances Have Nonzero Expectations or Different Variances?
307
Under our null hypothesis, the ratio of the two variance estimators is the Goldfeld-Quandt test statistic. It is an F random variable with n1 − 2 degrees of freedom in the numerator and n2 − 2 degrees of freedom in the denominator: s12 s22
(
)
~ F n1 − 2, n2 − 2 .
(8.28)
What’s that? We may have learned about this type of random variable in a previous statistics course. But, even if we have, we might have forgotten, because there may not have been many uses for this random variable there. For that matter, this is the first time we’ve seen it here, formally. However, we encountered it briefly in chapter 1. As was implied there, the F random variable turns out to be indispensable for the interpretation of regression results. We’ll need it often from now on, especially in chapter 12. Practically speaking, the consequences of having an F random variable are straightforward. We go to table A.3 for the F-distribution and locate the column for n1 − 2 and the row for n2 − 2. The value of that entry is the critical value for this test, F.05(n1 − 2, n2 − 2). If the ratio of estimated disturbance variances is less than this critical value, s12 s22
(
)
< F.05 n1 − 2, n2 − 2 ,
then this ratio is not large enough to be inconsistent with the null hypothesis. The true ratio could plausibly be one, given the actual value of the estimated ratio. Accordingly, we do not reject the null hypothesis of homoscedasticity from equation (8.22). If, instead, s12 s22
(
)
≥ F.05 n1 − 2, n2 − 2 ,
then the two variance estimators differ too greatly to be consistent with the null hypothesis that the true variances are the same. In this case, we reject the null hypothesis of homoscedasticity and accept the alternative of heteroscedasticity from equation (8.23). If we decide that our data are heteroscedastic, are we done? After all, we now have estimates of the variances for both subpopulations. Moreover, the regressions of equations (8.24) and (8.26) are each applied to a single subpopulation. Under the alternative hypothesis, each subpopulation is homoscedastic within itself. Therefore, a1 and b1 are the BLU estimators for the first subpopulation alone. Similarly, a2 and b2 are the BLU estimators for the
308
Chapter 8
second subpopulation. Maybe we should just stick with these two regressions and declare victory. Maybe not. The problem here is that now we have two estimators of β, b1 and b2, that are almost surely at least a little different, even though we’ve assumed that the true effect of xi on yi is the same for each subpopulation. Similarly, we have two estimators for α, a1 and a2. This is confusing. Moreover, this is also inefficient, which may be further confusing. After all, we’ve just said that a1 and b1 are the most efficient linear unbiased estimators available within the first subpopulation, and a2 and b2 are the most efficient linear unbiased estimators available within the second. The trick is that we believe β and α to be the same for both subpopulations. We should be able to get better, meaning more precise, estimates of them if we combine the information that’s available about them in the two subsamples, rather than using each subsample to generate separate estimates. Ideally, we’d use each subsample separately to estimate the variances of the two subpopulations but pool them to estimate β and α, the parameters that they have in common. This returns us to the WLS procedure of equation (8.19). In the case we’re considering, we believe that there are only two values for the disturbance variances. We can use s1 as our estimate of σ1 for observations from the first subpopulation and s2 as our estimate of σ2 for observations from the second subpopulation. If we then calculate b and a as will be indicated in chapter 11, we’ll have estimators of β and α that are best linear unbiased and consistent. Let’s work through all of this with a simulated example. Table 8.2 presents the results of a regression run on the combined samples of figures 8.1 and 8.2. Each of those figures represents 100 observations. The combined, or pooled, sample therefore has 200 observations. The distribution of xi values within each figure, and therefore in the combined sample of 200 observations, is the same as in the simulations of tables 5.1, 5.2, 5.4, 7.1, and 7.2. In addition, all 200 observations share the same deterministic component that we used in all of these tables, as well as in table 7.5:
( )
E yi = −20, 000 + 4, 000 xi .
These parameters are restated in the first two columns of table 8.2. What distinguishes the sample for table 8.2 from the samples for previous simulations is exactly the point of figures 8.1 and 8.2: There are two different disturbance variances. These are stated explicitly in column 2 of table 8.2. The two different standard deviations are 40,000 and 10,000, respectively, 15,000 more and 15,000 less than the standard deviation of 25,000 that we have used in previous simulations.
What If the Disturbances Have Nonzero Expectations or Different Variances?
309
TABLE 8.2 OLS on a simulated pooled sample with low and high disturbance variances Population relationship
Sample relationship
Constant:
α = −20,000
Intercept, a Standard deviation t-statistic p-value
Coefficient:
β = 4,000
Slope, b Standard deviation t-statistic p-value White heteroscedasticity-adjusted standard deviation
Standard deviation of disturbances:
σ1 = 40,000 σ2 = 10,000
Estimated standard deviation of disturbances, s R2 Adjusted R2 F-statistic p-value χ2 value for the White test of heteroscedasticity p-value for the White test of heteroscedasticity Sample size
−18,174 8,277.8 2.20 .0293 3,935.0 621.32 6.33 0. When xi is greater than −β1/2β2, small increases in xi reduce E(yi): Δy/Δx < 0. We can demonstrate this in an example that first occurred in section 1.7. There, we discussed the possibility that workers accumulate valuable experience rapidly at the beginning of their careers, but slowly, if at all, toward the end. In addition, at some point age reduces vigor to the point where productivity declines as well. This suggested that increasing age, serving as a proxy for increasing work experience, should increase earnings quickly at early ages, but slowly at later ages and perhaps negatively at the oldest ages. We can test this proposition in the sample that we’ve been following since that section. For illustrative purposes, the regression with two explanatory variables that we develop in chapters 11 and 12 is sufficient. Let’s set xi equal to age and x i2 equal to the square of age in the population relationship of equation (13.13). The sample analogue is then yi = a + b1 xi + b2 xi2 + ei .
(13.19)
511
Making Regression More Flexible
Applying equation (13.19) to our data, we obtain
( ) ( (14,629) (748.6) (9.044)
)
earnings = −50, 795 + 3, 798 age − 40.68 age squared + error.
(13.20)
As predicted, there are diminishing returns to age: b 1 = 3,798 > 0 and b 2 = −40.68 > 0.6 Let’s figure out what all of this actually means. Equation (13.20) predicts, for example, that earnings at age 20 would be $8,893. The linear term contributes $3,798 × 20 = $75,960 to this prediction. The quadratic term contributes −$40.68 × (20)2 = −$16,272. Finally, the intercept contributes −$50,795. Combined, they predict that the typical 20-year-old will earn somewhat less than $9,000 per year. Similar calculations give predicted earnings at age 30 as $26,533, at 40 as $36,037, at 50 as $37,405, and at 60 as $30,637. These predictions confirm that typical earnings first go up with age and then down. Exercise 13.5 replicates the analysis of equations (13.14) through (13.17) for the empirical relationship of equation (13.19). It proves that, in this context, Δy ≈ b + 2b2 xi , Δx 1
(13.21)
where Δy now represents the estimated change in the predicted value of yi. The value for xi at which its effect on the predicted value of yi, yˆi , reverses direction is xi =
−b1 . 2b2
(13.22)
Applying equation (13.22) to the regression of equation (13.20), we see that maximum predicted earnings of $37,853 occur at about 46.67 years of age. We might expect that there are diminishing returns to schooling as well as to age. If for no other reason, we observe that most people stop going to school at some point relatively early in their lives. This suggests that the returns to this investment, as to most investments, must decline as the amount invested increases. We’re not quite ready to look at a regression with quadratic specifications in both age and schooling, so, again for illustrative purposes, let’s just rerun equation (13.19) on the sample of equation (13.20) with xi equal to years of schooling and x i2 equal to the square of years of schooling. We get
(
)
( )
)
earnings = 19, 419 − 4, 233 years of schooling + 376.2 years of schooling squared + error.
(
)(
6, 561 1,1100
)
(
49.37
(13.23)
512
Chapter 13
This is a bit of a surprise. It’s still the case that b 1 and b 2 have opposite signs. However, b 1 < 0 and b 2 > 0! Taken literally, this means that increases in years of schooling reduce earnings by b 1, but increase them by b 2xi. When years of schooling are very low, the first effect is larger than the second, and additional years of schooling appear to actually reduce earnings. When years of schooling are larger, the second effect outweighs the first, and additional years of schooling increase earnings. There are two things about this that may be alarming. First, at least some years of schooling appear to reduce productivity. Second, as years of schooling become larger, the increment to earnings that comes from a small additional increase, b 2xi, gets larger as well. In other words, the apparent returns to investments in education are increasing! This raises an obvious question: If the next year of education is even more valuable than the last one, why would anyone ever stop going to school? Let’s use equation (13.22) to see how worried we should be about the first issue. In a context such as equation (13.23), where b 1 < 0 and b 2 > 0, increases in xi first reduce and then increase yi. This means that equation (13.22) identifies the value for xi that minimizes yi, rather than maximizes it as in the example of equation (13.20).7 This value is 5.63 years of schooling. Additional schooling beyond this level increases earnings. This is a relief. Almost no one has less than six years of schooling, just 39 people out of the 1,000 in our sample. Therefore, the implication that the first five or so years of schooling make people less productive is, essentially, an out-of-sample prediction. We speak about the risks of relying on them in section 7.5. Formally, they have large variances. Informally, there is often reason not to take them too seriously. Here, the best way to understand the predictions for these years is that they’re simply artifacts, unimportant consequences of a procedure whose real purpose is elsewhere. Equation (13.23) is not trying very hard to predict earnings for low levels of schooling, because those levels hardly ever actually appear in the data. Instead, it’s trying to line itself up to fit the observed values in the sample, which almost all start out at higher levels of schooling, to the quadratic form of equation (13.19). Given this form, the best way to do that happens to be to start from a very low value for earnings just below six years of schooling. What about the implication that schooling has increasing returns? Based on equation (13.23), continuing in school for another year after completing the seventh grade increases predicted annual earnings by $1,034. Continuing to the last year of high school after completing the 11th year of schooling increases predicted annual earnings by $3,291. Continuing from the 15th to the 16th year of schooling, typically the last year in college, increases
Making Regression More Flexible
513
predicted annual earnings by $7,053. Taking the second year of graduate school, perhaps finishing a master’s degree, increases predicted annual earnings by $8,558. This is a pretty clear pattern of increasing returns. At the same time, it isn’t shocking. In fact, these predictions seem pretty reasonable. This still leaves the question of why we all stop investing in education at some point. Part of the reason is that each additional year of schooling reduces the number of subsequent years in which we can receive increased earnings by roughly one. So the lifetime return to years of schooling does not go up nearly as quickly as the annual return. At some point, the lifetime return actually has to go down because there aren’t enough years left in the working life to make up the loss of another year of current income. Coupled with the possibility that higher levels of education require higher tuition payments and more work, most of us find that, at some point, we’re happy to turn our efforts to something else.8 According to exercise 13.7, the regression of apartment rent on the number of rooms and the square of the number of rooms in the apartment yields the same pattern of signs as in equation (13.23). Exercise 13.8 demonstrates, however, that the regression of child mortality rates on linear and quadratic terms for the proportion of the rural population with access to improved water yields b 1 < 0 and b 2 < 0. Moreover, the regression of GNI per capita on linear and quadratic terms for the Corruption Perceptions Index (CPI), discussed in exercise 13.9, yields b 1 > 0 and b 2 > 0. How are we to understand the quadratic specification of equation (13.13) when both slopes have the same sign? In many circumstances, xi must be positive. This is certainly true in our principal example, where the explanatory variable is education. It also happens to apply to our other examples. Age, the number of rooms in an apartment, and the proportion of rural residents with access to improved drinking water have to be nonnegative, as a matter of physical reality. The CPI is defined so as to have only values between zero and ten. In cases such as these, if b 1 and b 2 have the same signs, the two terms b 1 and b 2xi in equation (13.21) also have the same signs. They reinforce each other. A small change in xi changes yi in the same direction for all valid values of xi. As these values get larger, the impact of a small change in xi on yi gets bigger. In order for the effects of xi on yi to vary in direction, depending on the magnitude of xi, the two terms to the right of the approximation in equation (13.21) must differ in sign. This is still possible if b 1 and b 2 have the same sign, but only if xi can be negative. Another way to make the same point is to realize that, if b 1 and b 2 have the same sign, the value of xi given by equation (13.22) must be negative. If the slopes are both positive, yi is minimized at this value for xi. If they are both negative, this value for xi corresponds to the maximum for yi.
514
Chapter 13
Regardless of the signs on b 1 and b 2, the second term in equation (13.21) obviously becomes relatively more important as the value of xi increases. We’ve already seen this when the slopes have different signs, in our analyses of equations (13.20) and (13.23). It will appear again in this context in exercise 13.7. Moreover, exercises 13.8 and 13.9 will demonstrate that it’s also true when the slopes have the same sign. This illustrates an important point. If xi is big enough, the quadratic term in equation (13.19) dominates the regression predictions. The linear term becomes unimportant, regardless of the signs on b 1 and b 2. Consequently, the quadratic specification always implies the possibility that the effect of xi on yi will accelerate at high enough values of xi. As we’ve already said, this looks like increasing returns to scale when b 2 > 0, which we don’t expect to see very often. Even if b 2 < 0, the uncomfortable implication is still that, at some point, xi will be so big that further increases will cause yi to implode. The question of whether or not we need to take these implications seriously depends on how big xi has to be before predicted values of yi start to get really crazy. If these values are rare, then these implications are, again, mostly out-of-sample possibilities that don’t have much claim on our attention. If these values are not uncommon, as in the example of earnings at high levels of education, then we have to give some careful thought as to whether this apparent behavior makes sense.
13.4
Nonlinear Effects: Logarithms Another way to introduce nonlinearity into the relationship between yi and its explanatory variables is to represent one or more of them as logarithms. As we recall, the logarithm of a number is the exponent that, when applied to the base of the logarithm, yields that number: number = base logarithm .
What makes logarithms interesting, at this point, is that they don’t increase at the same rate as do the numbers with which they are associated. They increase much less quickly. For example, imagine that our base is 10. In this case, 100 can be expressed as 102. Therefore, its logarithm is two. Similarly, 10,000 is 104. Consequently, its logarithm is four. While our numbers differ by a factor of 100, the corresponding logarithms differ by only a factor of two. If we double the logarithm again, to eight, the associated number increases by a factor of 10,000 to 108, or 100,000,000. If we wanted the number whose logarithm was 100 times the
Making Regression More Flexible
515
logarithm of 100, we would have to multiply 100 by 10198 to obtain 10 followed by 199 zeros. Imagine that we specify our population relationship as yi = α + β log xi + ε i .
(13.24)
Equation (13.24) is a standard representation, but it embodies a couple of ambiguities. First, the expression “log xi” means “find the value that, when applied as an exponent to our base, yields the value xi.” In other words, this notation tells us to do something to xi. The expression “log” is, therefore, a function, as we discussed in section 3.2. The expression “xi” represents the argument. If we wanted to be absolutely clear, we would write “log(xi).” We’ll actually do this in the text, to help us keep track of what we’re talking about. However, we have to remember that it probably won’t be done anywhere else. Therefore, we’ll adopt the ordinary usage in our formal equations. Second, it should now be clear, if it wasn’t before, that β does not multiply “l” or “lo” or “log.” It multiplies the value that comes out of the function log(xi). If we wanted to be clearer still, we might write the second term to the right of the equality in equation (13.24) as β[log(xi)]. Conventionally, however, we don’t. So we have to be alert to all of the implied parentheses in order to ensure that we understand what is meant. Notational curiosities aside, what does equation (13.24) say about the relationship between xi and yi? The first thing it says is that a given change in log(xi) has the same effect on yi, regardless of the value for xi. A one-unit change in log(xi) alters yi by β, no matter what. This is of only marginal interest, because the logarithmic transformation of xi is just something that we do for analytical convenience. The variable that we observe is xi. That’s also the value that is relevant to the individuals or entities that comprise our sample. Therefore, we don’t care much about the effect of log(xi) on yi. We’re much more interested in what equation (13.24) implies about the effects of xi itself on yi. This implication is that a larger change in xi is necessary at high values of xi than at low values of xi in order to yield the same change in yi. As we see just before equation (13.24), a given change in log(xi) requires larger increments at larger values for xi than at smaller values. As a given change in log(xi) causes the same change in yi regardless of the value of xi, a given change in yi also requires larger increments of xi at larger values for xi than at smaller values. In other words, the relationship between xi and yi in equation (13.24) is nonlinear. If this was all that there was to the logarithmic transformation, we wouldn’t be very interested in it. To its credit, it creates a nonlinear relationship between
516
Chapter 13
xi and yi with only one parameter, so it saves a degree of freedom in comparison to the quadratic specification of the previous section. However, we lose a lot of flexibility. As we saw earlier, the effect of xi on yi can change direction in the quadratic specification. The logarithmic specification forces a single direction on this relationship, given by the sign of β. Moreover, it isn’t even defined when xi is zero or negative. Of course, the reason we have a whole section devoted to the logarithmic transformation is that there is something more to it. In order to achieve it, however, we have to be very careful about the base that we choose. Ten is a convenient base for an introductory example because it’s so familiar. However, it’s not the one that we typically use. Instead, we usually turn to the constant e. Note that we have now begun to run out of Latin as well as Greek letters. This does not represent a regression error. Instead, it is the symbol that is universally used to represent the irrational number whose value is approximately 2.71828.9 Logarithms with this base are universally represented by the expression “ln,” generally read as “natural logarithm.” What makes logarithms with e as their base so special? For our purposes, it’s this. Imagine that we add a small increment to xi. We can rewrite this as Δx xi + Δx = xi 1 + . xi
The natural logarithm of xi + Δx is Δx ln xi + Δx = ln xi 1 + . xi
(
)
Using the rules of logarithms, we can rewrite this as Δx Δx ln xi 1 + = ln xi + ln 1 + . xi xi
( )
(13.25)
The last term of equation (13.25) is where the action is. When the base is e and Δx is small, Δx Δx ln 1+ ≈ . xi xi
(13.26)
In other words, the natural logarithm of 1 + Δx/xi is approximately equal to the percentage change in xi represented by Δx.
Making Regression More Flexible
517
How good is this approximation? Well, when Δx/xi = .01, ln(1.01) = .00995. So that’s pretty close. When Δx/xi = .1, ln(1.1) = .0953, which is only a little less accurate. However, the approximation is less satisfactory when Δx/xi = .2, because ln(1.2) = .182.10 In general, then, it seems safe to say that the approximation in equation (13.26) is valid for changes of up to about 10% in the underlying value. The importance of equation (13.26) is easy to demonstrate. Rewrite equation (13.24) in terms of the natural logarithm of xi: yi = α + β ln xi + ε i .
(13.27)
With this specification, the expected value of yi is
( )
E yi = α + β ln xi .
(13.28)
Now make a small change in xi (10% or less) and write the new value of E(yi) as
( )
(
)
E yi + Δy = α + β ln xi + Δx .
(13.29)
With equation (13.26), equation (13.25) can be approximated as ln ( xi + Δx ) ≈ ln ( xi ) +
Δx . xi
(13.30)
If we substitute equation (13.30) into equation (13.29), we get
( )
E yi + Δy ≈ α + β ln xi + β
Δx , xi
(13.31)
where Δy again represents the change in the expected value. Now, if we subtract equation (13.28) from equation (13.31), the result is Δy ≈ β
Δx . xi
(13.32)
Finally, we rearrange equation (13.32) to obtain β≈
Δy . Δx / xi
(13.33)
The numerator of equation (13.33) is the magnitude of the change in the expected value of yi. The denominator is, as we said previously, the percentage change in xi. Therefore, β in equation (13.27) represents the absolute
518
Chapter 13
change in the expected value of yi that arises as a consequence of a given relative change in xi. The specification of equation (13.27) may be appealing when, for example, the explanatory variable does not have natural units. In this case, the interpretation associated with equation (13.12) may not be very attractive. How are we to understand the importance of the change that occurs in yi as a consequence of a one-unit change in xi, if we don’t understand what a one-unit change in xi really consists of ? This is obviously not a problem in our main example, where xi is years of schooling. We all have a pretty good understanding of what a year of schooling is. Similarly, we know what it means to count the rooms in an apartment, so it’s easy to understand the change in rent that occurs when we add one. In contrast, the CPI, which we revisit in exercise 13.9, is an example. As we said in the previous section, the CPI is defined so as to have values ranging from zero to ten. However, these values convey only ordinal information: Countries with higher values are less corrupt than countries with lower values. They do not convey cardinal information: A particular score or the magnitude of the difference between two scores is not associated, at least in our minds, with a particular, concrete set of actions or conditions. In other words, the units for the CPI are arbitrary. An increase of one unit in this index doesn’t correspond to any additional actions or conditions that we could measure in a generally recognized way. The index could just as easily have been defined to vary from 0 to 1, from 0 to 100, or from 8 to 13. In sum, the absolute level of the CPI doesn’t seem to tell us much. Consequently, we may not be too interested in the changes in GNI per capita associated with changes in these levels. It might be informative to consider the effects of relative changes. For this purpose, the sample regression that corresponds to the population relationship of equation (13.27) is yi = a + b ln xi + ei .
If we apply this specification to our data on GNI per capita and the CPI, we get the regression
(
)
GNI per capita = −13, 421 + 17, 257ln CPI + ei .
( 2,183) (1, 494)
(13.34)
The slope is significant at much better than 5%. It says that a 10% increase in the CPI increases GNI per capita by $1,726. In practice, the specification of equation (13.27) isn’t very common because we don’t often have situations in which we expect a relative change in an
519
Making Regression More Flexible
explanatory variable to cause an absolute change in the dependent variable. However, the converse situation, where an absolute change in an explanatory variable causes a relative change in the dependent variable, occurs frequently. This is represented by the population relationship ln yi = α + β xi + ε i .
(13.35)
Equation (13.35) is an example of the log-linear or semi-log specification. Exercise 13.10 demonstrates that, in this specification, β=
E[Δy / yi ] Δx
.
(13.36)
The coefficient represents the expected relative change in the dependent variable as a consequence of a given absolute change in the explanatory variable. The most famous example of the semi-log specification is almost surely the Mincerian human capital earnings function.11 This specifies that the natural logarithm of earnings, rather than earnings itself, is what schooling affects. According to the interpretation in equation (13.36), the coefficient for earnings therefore represents the percentage increase in earnings caused by an additional year of schooling, or the rate of return on the schooling investment. Once again, we return to our sample of section 1.7 to see what this looks like. The sample regression that corresponds to the population relationship in equation (13.35) is ln yi = a + bxi + ei .
With yi defined as earnings and xi as years of schooling, the result is
(
)
(
)
ln earnings = 8.764 + .1055 years of schooling + error.
(.1305)(.00985)
(13.37)
Equation (13.37) estimates that each year of schooling increases earnings by approximately 10.6%.12 The last variation on the logarithmic transformation is the case where we believe that relative changes in the dependent variable arise from relative changes in the explanatory variable. This is represented by the population relationship ln yi = α + β ln xi + ε i .
(13.38)
The relationship in equation (13.38) portrays the log-log specification.
520
Chapter 13
Exercise 13.12 demonstrates that, in this specification, β=
E[Δy / yi ] Δx / xi
= η yx .
(13.39)
The coefficient represents the expected relative change in the dependent variable as a consequence of a given relative change in the explanatory variable. As we either have learned or will learn in microeconomic theory, the ratio of two relative changes is an elasticity.13 Equation (13.39) presents a common, though not universal, notation for elasticities: the Greek letter η, or eta, with two subscripts indicating, first, the quantity being changed and, second, the quantity causing the change. The log-log specification is popular because the interpretation of its coefficient as an elasticity is frequently convenient. The sample analogue to the population relationship in equation (13.38) is ln yi = a + b ln xi + ei .
(13.40)
Let’s revisit the relationship between child mortality and access to improved water from the regression of equation (11.43). If we respecify this regression in log-log form, the result is percentage of rural population ln child mortality = 9.761 − 1.434 ln + ei . with access to improved water
(
)
(.7216 ) (.1722 )
(13.41)
The slope in equation (13.41) is statistically significant at much better than 5%. It indicates that an increase of 1% in the proportion of the rural population with access to improved drinking water would reduce the rate of child mortality by almost 1.5%. This seems like a pretty good return on a straightforward investment in hygiene.14
13.5
Nonlinear Effects: Interactions In our analysis of equation (13.12), we observe that the effect of a change in x1i doesn’t depend on the level of x1i. We might equally well observe that it doesn’t depend on the level of x2i, either. But why should it? Well, we’ve already had an example of this in table 1.4 of exercise 1.4. There, we find intriguing evidence that the effect of age on earnings depends on sex.15 Similarly, we might wonder if the effect of schooling depends on sex. The effects of age and schooling might also vary with race or ethnicity. For that matter, the effects of age and schooling might depend on each other: If two
Making Regression More Flexible
521
people with different ages have the same education, it’s likely that the education of the older individual was completed at an earlier time. If the quality or content of education has changed since then, the earlier education might have a different value, even if it is of the same length. The specification of equation (11.1) doesn’t allow us to investigate these possibilities. Let’s return to our principal example in order to see why. Once again, yi is annual earnings. The first explanatory variable, x1i, is a dummy variable identifying women. The second, x2i, is a dummy variable identifying blacks. For simplicity in illustration, we’ll ignore all of the other variables on which earnings might depend. With the population relationship of equation (11.1), we can distinguish between the expected earnings of four different types of people. For men who are not black, x1i = 0 and x2i = 0. According to equation (11.5),
(
)
()
()
E yi | man, not black = α + β1 0 + β2 0 = α.
(13.42)
For women who are not black, x1i = 1 and x2i = 0. Therefore,
(
)
()
()
E yi | woman, not black = α + β1 1 + β2 0 = α + β1.
(13.43)
For black men, x1i = 0 and x2i = 1. Consequently,
(
)
()
()
E yi | man, black = α + β1 0 + β2 1 = α + β2 .
(13.44)
Finally, black women have x1i = 1 and x2i = 1:
(
)
()
()
E yi | woman, black = α + β1 1 + β2 1 = α + β1 + β2 .
(13.45)
In this specification, the expected value of earnings is constant for all individuals of a particular type. However, the expected values of earnings for these four types of people are all different. Still, this specification isn’t complete. The effects of being female and being black can’t interact. The difference in expected earnings between a nonblack woman and a nonblack man, from equations (13.43) and (13.42), is
(
) (
) (
)
E yi | woman, not black − E yi | man, not black = α + β1 − α = β1 .
(13.46)
The difference between a black woman and a black man, from equations (13.45) and (13.44), is
(
) (
) (
) (
)
E yi | woman,black − E yi | man,black = α + β1 + β2 − α + β2 = β1 . (13.47)
These differences are identical, even though the employment traditions of men and women might differ across the two races. Exercise 13.14
522
Chapter 13
demonstrates that, similarly, the specification of equation (11.1) forces the effect of race to be the same for both sexes. Another way to describe the implications of equation (11.1) in this context is that the effects of race and sex are additive. Expected earnings differ by a fixed amount for two individuals of the same sex but different race. They differ by another fixed amount for two individuals of the same race but different sex. The difference between two individuals who differ in both sex and race is simply the sum of these two differences. Of course, it’s possible that these effects really are additive. But why insist on it, before even looking at the evidence? In other words, we prefer to test whether additivity is really appropriate, rather than to assume it. In order to do so, we need to introduce the possibility that the effects are not additive. This means that we need to allow for the possibility that the effects of race and sex are interdependent. These kinds of interdependencies are called interactions. Interactions are another form of nonlinear effect. The difference is that, instead of changing the representation of a single explanatory variable, as we do in the previous two sections, interactions multiply two or more explanatory variables. What this means is that we create a new variable for each observation, whose value is given by the product of the values of two other variables for that same observation. The simplest representation of an interaction in a population relationship is yi = α + β1 x1i + β2 x2 i + β3 x1i x2 i + ε i .
(13.48)
In this relationship, the expected value of yi is
( )
E yi = α + β1 x1i + β2 x2 i + β3 x1i x2 i .
(13.49)
Following what is now a well-established routine, let’s change x1i by Δx 1 and see what happens. Representing, yet again, the change in the expected value of yi as Δy, equation (13.49) gives us
( )
(
)
(
)
E yi + Δy = α + β1 x1i + Δx1 + β2 x2 i + β3 x1i + Δx1 x2 i = α + β1 x1i + β2 x2 i + β3 x1i x2 i + β1Δx1 + β3 x2 i Δx1 .
(13.50)
When we, predictably, subtract equation (13.49) from equation (13.50), we get Δy = β1Δx1 + β3 x2 i Δx1 .
(13.51)
Finally, we divide both sides of equation (13.51) by Δx 1: Δy = β1 + β3 x2 i . Δx1
(13.52)
Making Regression More Flexible
523
Equation (13.52) states that the effect of a change in x1i on the expected value of yi has a fixed component, β1, and a component that depends on the level of x2i, β3x2i. This second term embodies the interdependence.16 To see how this works in practice, let’s return to our example of equations (13.42) through (13.47). In this case, each of the individual variables, x1i and x2i, has only the values zero and one. This means that their product, x1ix2i, can also have only these two values. It can only equal one if both of its individual factors also equal one. In other words, x1i x2i = 1 only if x1i = 1 and x2i = 1. The first condition, x1i = 1, identifies the observation as a woman. The second condition, x2i = 1, identifies the observation as a black. Therefore, the value x1i x2i = 1 identifies the observation as a black woman. Consequently, equation (13.49) gives expected earnings for men who are not black as
(
)
()
()
()
E yi | man, not black = α + β1 0 + β2 0 + β3 0 = α .
(13.53)
For women who are not black,
(
)
()
()
()
()
()
E yi | woman, not black = α + β1 1 + β2 0 + β3 0 = α + β1 .
(13.54)
For black men,
(
)
()
E yi | man, black = α + β1 0 + β2 1 + β3 0 = α + β2 .
(13.55)
Finally, for black women,
(
)
()
()
()
E yi | woman, black = α + β1 1 + β2 1 + β3 1 = α + β1 + β2 + β3 .
(13.56)
With this specification, expected earnings for nonblack men, nonblack women, and black men in equations (13.53) through (13.55) are the same as in equations (13.42) through (13.44). In other words, the specification of equation (13.48) and that of equation (11.1) have the same implications for these three groups. Consequently, the difference in expected earnings between nonblack women, given in equation (13.54), and nonblack men, given in equation (13.53), is simply β1. It’s identical to this difference under the specification of equation (11.1), given in equation (13.46). How does this compare to the difference that is given by equation (13.52)? That equation requires some careful interpretation in our current context, because we usually think of terms beginning with “Δ” as indicating small changes. Here, all of the explanatory variables are dummies. This means that the only possible changes are from zero to one and back. In other words, here
524
Chapter 13
equation (13.52) must be comparing two different individuals, one with the characteristic indicated by x1i and one without. In addition, it is helpful to recall that these two individuals must be of the same race. How do we know that? Because we allow the terms β2x2i in equations (13.49) and (13.50) to cancel when we subtract the former from the latter to get equation (13.51). This is only valid if both individuals have the same value for x2i, which, here, means the same racial identity. With these preliminaries, equation (13.52) tells us that the difference between the expected earnings for a woman and a man when x2i = 0, that is, when both are nonblack, should be exactly β1. It agrees perfectly with the difference between equations (13.54) and (13.53). The difference between equations (13.48) and (11.1) is in their implications for black women. Expected earnings for black women in equation (13.56) are again a constant. However, it differs from the constant of equation (13.45) by the coefficient on the interaction term, β3. Therefore, the difference between expected earnings for a black woman and for a black man is, from equations (13.56) and (13.55),
(
) (
) (
) (
)
E yi | woman, black − E yi | man, black = α + β1 + β2 + β3 − α + β2 = β1 + β3 .
(13.57)
This, again, is exactly the difference given by equation (13.52), now that x2i = 1. It is not the same as the difference between expected earnings for a nonblack woman and for a nonblack man unless β3 = 0. If β3 ≠ 0, then the effects of sex and race on earnings are interdependent. As always, we can check this out empirically with our sample of section 1.7. The sample counterpart to the population relationship of equation (13.48) is yi = a + b1 x1i + b2 x2 i + b3 x1i x2 i + ei .
(13.58)
With x1i and x2i representing women and blacks, the estimates are
( (1, 923) (2, 740)
)
(
)
(
)
earnings = 40, 060 − 18, 612 female − 13,163 black + 11, 561 black female + error.
(7, 317)
(10, 354)
(13.59)
The slopes for women and blacks are familiar. They are only slightly larger in magnitude than in our very first regression of figure 1.1. However, the slope for black women is positive and nearly as large in magnitude as the slope for all blacks! For black women, the magnitude of the combined effect of these
525
Making Regression More Flexible
two slopes is essentially zero. This suggests that, on net, earnings for black women are not affected by race. The regression of equation (13.59) estimates the expected value of earnings for black women, as given in equation (13.56), as approximately equal to that for nonblack women, given in equation (13.54). This conclusion must be drawn with some caution. The t-statistic of the slope for all blacks is 1.80, significant at better than 10%. However, the t-statistic of the slope for black women is only 1.12, with a p-value of .2644. This fails to reject the null hypothesis that there is no interaction between race and sex, H 0 : β3 = 0. At the same time, the regression of equation (13.59) also fails to reject the joint null hypothesis that the effects of being black and being a black woman cancel each other. This hypothesis is equivalent to the null hypothesis that expected earnings for nonblack and black women are the same. Formally, it is H 0 : β2 + β3 = 0. The F-statistic for the test of this hypothesis is only .05, with a p-value of .8270. In sum, the evidence from equation (13.59) is weak. The sample on which it is based does not contain enough information to estimate the interaction effect of being a black female very precisely. Consequently, this equation can’t tell the difference, at least to a statistically satisfactory degree, between the expected earnings of black women and black men or between black women and nonblack women. Fortunately, we have stronger evidence available to us. If we repeat the regression of equation (13.59) using the entire sample of section 7.6, the result is
(
)
(
)
(
)
earnings = 40, 784 −19, 581 female −14, 520 black +14, 352 black female + error.
(153.4) ( 218.7)
(620.2)
(870.1)
(13.60)
The intercept and slopes of equation (13.60) do not differ from those of equation (13.59) in any important way. However, the sample for equation (13.60) is more than 100 times as large as that for equation (13.59). Predictably, the standard deviations here are less than one-tenth of those for the smaller sample.17 Exercise 13.16 confirms that equation (13.60) rejects the null hypothesis that H 0 : β3 = 0 but again fails to reject the null hypothesis H 0 : β2 + β3 = 0. This evidence strongly suggests that expected earnings for black and nonblack women are similar. Race appears to affect expected earnings only for black men. When x1i and x2i are both dummy variables, the interaction between them has the effect of altering the “constant” for the case where x1i = 1 and x2i = 1, as we show in our discussion of equation (13.57). When x2i is a
526
Chapter 13
continuous variable, the effect of the interaction term is to assign different slopes to x2i, depending on whether x1i = 0 or x1i = 1. In the first case, equation (13.49) again reduces to equation (5.1):
( )
()
()
E yi = α + β1 0 + β2 x2 i + β3 0 x2 i = α + β2 x2 i .
In the second case, equation (13.49) reduces to a more elaborate version of equation (5.1):
( )
()
()
(
) (
)
E yi = α + β1 1 + β2 x2 i + β3 1 x2 i = α + β1 + β2 + β3 x2 i .
β3 is the difference between the slopes for observations with x1i = 0 and x1i = 1.18 We can illustrate this in the sample of equation (13.59) by redefining x2i as schooling. Equation (13.58) becomes
(
)
( (466.1)
earnings = −21, 541 + 10, 207 female + 4, 841 years of schooling
(6, 098) (8,887) −2, 226 ( years of schooling for women ) + error. (682.0)
) (13.61)
This is quite provocative! The slope for the female dummy variable is now positive!19 However, it’s also statistically insignificant. The best interpretation is therefore that there’s no evidence that the constant component of earnings differs between men and women. The reason that the slope for the female dummy variable has changed signs, relative to previous regressions, is that the slope for women’s years of schooling is negative. Moreover, it’s statistically significant at much better than 5%. This indicates that β3 is negative. In fact, the estimated return to schooling for women is b 2 + b 3 = 2,615, only slightly more than half the return to schooling for men! While previous regressions have suggested that women’s earnings are less than men’s by a rather large constant, equation (13.61) indicates, instead, that the gap between them is relatively small at low levels of education but increases substantially with higher levels of schooling! The third possible form of interaction is between two continuous variables. The interpretation of this form is directly embodied in equation (13.52). It specifies that the slope for each variable is different, depending on the level of the other. This is probably the most difficult interpretation to understand. Fortunately, we have the illustration that we suggested at the beginning of this section. Individuals of different ages typically had their educations at different times. Educations of different vintages may have different values for many reasons.
Making Regression More Flexible
527
The content of education has certainly changed. The quality may have changed as well. Of course, these changes could either reduce or increase the value of older educations relative to those that are more recent. At the same time, the value of work experience may vary with education. Workers with little education may not have the foundation necessary to learn more difficult work tasks. If so, they will benefit less from experience than will workers who are better educated. This suggests that experience and education may be complements. As we approximate experience here with the variable measuring age, this suggests that the returns to age will increase with higher education. Doesn’t this sound interesting? Let’s find out what the data can tell us. Returning, yet again, to the sample of equation (13.59), we retain the definition of x2i as schooling from equation (13.61). We now define x1i as age. Here’s what we get:
( ) ( (17, 270)(402.0) (1, 366) + 74.23(age*years of schooling) + error. (31.37)
)
earnings = 7, 678 − 552.5 age + 582.2 years of schooling
(13.62)
This is really provocative! According to equation (13.62), b 1 and b 2 are both statistically insignificant! This suggests that β1 = 0 and β2 = 0: Neither age nor education has any reliable effect on earnings, on their own. In contrast, b 3 is significantly greater than zero. In sum, equation (13.62) indicates that the only effect of schooling or age on earnings comes from the second term to the right of the equality in equation (13.52), β3x2i, for x1i, and its analogue from exercise 13.15, β3x1i for x2i. What, quantitatively, does this mean? The best way to illustrate this is with table 13.1. This table presents values of b 3x1i x2i for selected values of x1i and x2i. These values are estimates of the contribution of the interaction between age and schooling to earnings, β2x1i x2i.
TABLE 13.1 Predicted earnings from equation (13.62) Years of schooling Age
8
12
16
18
30 40 50 60
$17,815 $23,753 $29,692 $35,630
$26,722 $35,630 $44,538 $53,445
$35,630 $47,507 $59,384 $71,260
$40,084 $53,445 $66,807 $80,168
528
Chapter 13
Looking down each of the columns in table 13.1, we see that increases in age make bigger contributions to earnings at higher levels of education. Looking across each of the rows in table 13.1, we see that higher levels of education make bigger contributions to earnings at higher ages. The estimates in table 13.1 have some claim to be the best predictions of earnings that can be made on the basis of equation (13.62). First, they actually look pretty reasonable. A 30-year-old with no high school education makes $17,815 per year? A 60-year-old with a master’s degree makes $80,168? Nothing here conflicts substantially with our intuition. More seriously, the values of a, b 1, and b 2 are statistically insignificant. They fail to reject the individual hypotheses H 0 : α = 0, H 0 : β1 = 0, and H 0 : β2 = 0. If we were to impose these null hypotheses on the population relationship of equation (13.49), it would become E(yi) = β3x1i x2i. In this case, yˆi = b3 x1i x2 i . The values in table 13.1 would be the appropriate predictions of expected earnings. The question of whether we are comfortable imposing these null hypotheses turns on two points. First, how certain are we, based on intuition or formal economic reasoning, that education and age should make independent contributions to earnings? Second, how confident are we that the data on which we base equation (13.62) are adequate for the purpose of testing whether these contributions exist? In this case, we have pretty easy answers. It’s hard to accept that age or education doesn’t make any impact on earnings at all, apart from their interaction. Moreover, using equation (12.29), we find that the test of the joint hypothesis H 0 : β1 = β2 = 0 in equation (13.62) yields an F-statistic of 11.02, with a p-value of less than .0001. This is a pretty powerful rejection. So even though this equation can’t pin down the individual values of β1 and β2 very accurately, it’s almost certain that they’re not both zero. Also, no matter how good the sample of equation (13.62) is, we know that we have a much larger sample available. We’ve used it most recently to calculate equation (13.60). Therefore, the right thing to do is to re-examine the question asked by equation (13.62) using this larger sample. The result is
( ) ( (1, 232 ) (28.64 ) (98.60) + 39.99 ( age * years of schooling ) + error. (2.259)
earnings = −7, 678 − 201.1 age + 2, 079 years of schooling
) (13.63)
The signs on the slopes of equation (13.63) are identical to those in equation (13.62).20 However, the standard errors are all, as we would expect, much smaller. All three slopes are statistically significant at much better than 5%.
Making Regression More Flexible
529
The estimates here of β1 and β2 indicate that both x1i and x2i make independent contributions to earnings. The estimate of β3 again indicates that age and schooling are complements. Returning to equation (13.52), we find that the total effect of a change in age on earnings is
(
Δ earnings
( )
Δ age
) = −201.1 + 39.99 years of schooling . ( )
(13.64)
Formally, equation (13.64) implies that increases in age reduce earnings if years of schooling are five or fewer. As we observe in our discussion of equation (13.23), this is not particularly interesting because almost everyone has more schooling than this. Regardless, each additional year of schooling increases the annual return to age, and presumably to experience, by approximately $40. For someone with an eighth-grade education, each additional year of experience should increase earnings by almost $120. For someone with a college degree, earnings should go up by nearly $440 with an additional year of experience. Similarly, the total effect of schooling on earnings is
(
(
Δ earnings
)
Δ years of schooling
)
( )
= 2, 079 + 39.99 age .
At age 20, an additional year of schooling increases earnings by $2,879. At age 40, the increase is $3,679. At age 60, it is $4,478.
13.6
Conclusion We can see that dummy variables, quadratic and logarithmic specifications, and interaction terms allow us to construct regression specifications that are much more flexible than might be apparent from equation (11.1) alone. Moreover, our discussion has by no means exhausted the possibilities. Nonlinear relationships may occasionally be represented by variables expressed as reciprocals or even trigonometric functions.21 We may, on rare occasions, find regression specifications that try to achieve even more flexibility, and complexity, by appending a cubic or even a quartic term in the explanatory variable. The first would contain the quantity xi3, the second the quantity xi4. With a cubic term, the relationship between xi and yi can change directions twice. With a quartic term, it can happen three times. If these patterns are appropriate for the behavior under study, these
530
Chapter 13
specifications can be very revealing. If not, of course, they can be thoroughly confusing. The overall message of this chapter is that we don’t always have to force the relationship in which we’re interested to fit the simple specification of equation (11.1). We will need much more training before we can explore nonlinear relationships among the parameters. But with a little ingenuity, nonlinear treatments of the variables can modify equation (11.1) to accommodate a much broader range of behavior than we might have otherwise guessed.
EXERCISES
13.1 Assume that εi has the properties of chapter 5. (a) Apply the rules of expectations from chapter 5 to equation (13.5) in order to derive equation (13.6). (b) Replace equation (13.1) in equation (13.7) and repeat the analysis of part a to derive equation (13.8). (c) Redefine x2i as equal to one when the characteristic in question is not present and zero when it is. Restate equations (13.1) through (13.9) with this definition. How do these restated equations differ from the originals? What, now, is the difference in E(yi) for observations with and without the characteristic? How does this compare to the difference in equation (13.9)? 13.2 Section 13.2 alludes to two circumstances in which the regression calculations are not well defined. (a) Toward the end of section 13.2, we make the claim that the results pertaining to regression slopes in chapters 11 and 12 don’t depend on the values of the associated explanatory variables, “as long as these values vary across observations.” Why did we have to add this qualification? What happens if the value of an explanatory variable does not vary across observations? (b) Demonstrate that the sample CORR(x1i, x2i) = −1 if equation (13.10) is true. Recall the consequences from chapter 12. 13.3 Imagine that we intend x2i to be a dummy variable in the population relationship of equation (11.1). (a) Unfortunately, we make a mistake. We assign the value x2i = 2 to observations that have the characteristic of interest and the value x2i = 1 to observations that do not. Modify the analysis of equations (13.1) through (13.9) to derive the appropriate interpretation of β2 in this case. (b) Based on the answer to part a, what would be the consequence if we assigned the value x2i = 3 to observations that have the characteristic of
Making Regression More Flexible
531
interest and the value x2i = 2 to observations that do not? What about the values 5 and 4? What about any two values that differ only by one? (c) Based on the answers to parts a and b, what would be the consequence if we assigned the value x2i = 3 to observations that have the characteristic of interest and the value x2i = 1 to observations that do not? What about any two values that differ by two? What about any two values that differ by any amount? 13.4 Consider the regression in equation (13.20). (a) Demonstrate, either with confidence intervals or two-tailed hypothesis tests, that a, b 1, and b 2 are statistically significant. (b) Verify the predicted earnings at ages 30, 40, 50, and 60. (c) Use equation (13.22) to verify that maximum predicted earnings occur at approximately 46.67 years of age. 13.5 Return to the quadratic regression specification of equation (13.19). The predicted value of yi from this specification is yˆi = a + b1 xi + b2 xi2 . (a) Change xi by Δx. Demonstrate that the new predicted value is
( )
2
yˆi + Δy = a + b1 xi + b1Δx + b2 xi2 + 2b2 xi Δx + b2 Δx . (b) Subtract the expression for yˆi from the result of part a to obtain
( )
2
Δy = b1Δx + 2b2 xi Δx + b2 Δx . (c) State the assumption that is necessary in order to approximate the relationship in part b with Δy ≈ b1Δx + 2b2 xi Δx. (d) Starting with the answer to part c, prove that Δy ≈ b1 + 2b2 xi . Δx 13.6 Consider the regression in equation (13.23). (a) Demonstrate, either with confidence intervals or two-tailed hypothesis tests, that a, b 1, and b 2 are statistically significant. (b) Verify the predicted changes in annual earnings at 7, 11, 15, and 17 years of schooling.
532
Chapter 13
(c) Use equation (13.22) to verify that minimum predicted earnings occur at approximately 5.63 years of schooling. 13.7 The quadratic version of the regression in equation (12.42) is rent = 668.4 − 43.44 (all rooms) + 15.70 (all rooms) + ei . 2
(49.91)(26.97)
(3.329)
(a) Interpret the signs and values of b 1 and b 2 following the analysis of section 13.3. (b) Check, either with confidence intervals or two-tailed hypothesis tests, whether b 1 and b 2 are statistically significant. Does the answer alter or confirm the interpretations of part a? (c) Predict rents for apartments of two, four, six, and eight rooms. Compare these predictions. Does the comparison seem plausible? If yes, why? If no, why not, and what might explain the anomalies? 13.8 The regression of equation (11.43), with standard errors, is percentage of rural population child mortality = 200.5 − 1.764 + ei . with access to improved water
(14.66 ) (.1973)
(a) Demonstrate, either with confidence intervals or two-tailed hypothesis tests, that b is statistically significant. (b) The quadratic specification of this regression yields percentage of rural population child mortality = 152.5 − .05644 with access to improved water percentage of rural population 2 − .01305 + ei . with acccess to improved water Interpret the signs and values of b 1 and b 2 following the analysis of section 13.3. (c) For this regression, SD(b 1) = 1.125 and SD(b 2) = .00846. Are either of these slopes statistically significant? Does the answer to this alter or confirm the interpretations of part b? (d) The R 2 value for this regression is .3865. Test the null hypothesis H 0 : β1 = 0 and β2 = 0 using equation (12.35). What can we conclude about the joint effects of the variable measuring rural access to water and its square on child mortality? Comparing this conclusion to that of part c, what can we conclude about the relationship between these two variables?
Making Regression More Flexible
533
(e) Use equation (12.36) to derive the correlation between b 1 and b 2. (f ) Equation (11.43), with the linear explanatory variable replaced by the quadratic explanatory variable, is percentage of rural population 2 child mortality = 150.8 − .01346 + ei . with access to improved water
(9.390) (.00147)
Is b statistically significant? (g) Compare equation (11.43), the quadratic specification in part b, and the bivariate regression in part f. In addition to what we’ve already learned about these regressions in this exercise, their R 2 values are, respectively, .3754, .3865, and .3864. Which regression seems to be the most compelling way to represent the information in the sample? Why? Predict the child mortality rates for water accessibility rates of 30%, 60%, and 90% using each of the regressions. Does it make much difference which regression we use? Why or why not? 13.9 The regression of equation (11.41), with standard errors, is
(
)
GNI per capita = −7, 399 + 4, 013 CPI + ei .
(1, 373) (277.7)
(a) Demonstrate, either with confidence intervals or two-tailed hypothesis tests, that b is statistically significant. (b) The quadratic specification of this regression yields
(
)
(
)
2
GNI per capita = −1, 618 + 1, 403 CPI + 239.1 CPI + ei . Interpret the signs and values of b 1 and b 2 following the analysis of section 13.3. (c) For this regression, SD(b 1) = 1,398 and SD(b 2) = 125.6. Are either of these slopes statistically significant? Does the answer to this alter or confirm the interpretations of part b? (d) For this regression, SD(a) = 3,323. Is the intercept statistically significant? At this point, what do we think of this regression? (e) The R 2 value for this regression is .7507. Test the null hypothesis H 0 : β1 = 0 and β2 = 0 using equation (12.35). What can we conclude about the joint effects of the CPI and its square on child mortality? Comparing this conclusion to that of part c, what can we conclude about the relationship between these two variables? Comparing this conclusion to that of part d, what can we conclude about this regression as a whole?
534
Chapter 13
(f ) Use equation (12.36) to derive the correlation between b 1 and b 2. (g) Equation (11.41), with the linear explanatory variable replaced by the quadratic explanatory variable, is
( ) (837.9) ( 24.53)
2
GNI per capita = 1, 610 + 362.8 CPI + ei .
Is b statistically significant? (h) Compare equation (11.41), the quadratic specification in part b, and the bivariate regression in part g. In addition to what we’ve already learned about these regressions in this exercise, their R 2 values are, respectively, .7383, .7507, and .7473. Which regression seems to be the most compelling way to represent the information in the sample? Why? Predict the GNI per capita when the CPI is at 3.0, 6.0, and 9.0 using each of the regressions. Does it make much difference which regression we use? Why or why not? 13.10 Consider the log-linear specification of equation (13.35). (a) What is the expression for E(ln yi) in terms of α, β, and xi? (b) Change xi by Δx. Write the new expected value of the dependent variable as
(
)
(
)
E ln yi + Δy = α + β xi + Δx . (c) Use equation (5.13), regarding the expected value of a summation, and the approximation of equation (13.30) to rewrite Δy E ln yi + Δy = E ln yi + E . yi
(
)
( )
(d) Replace the answer to part c in the expression from part b. Subtract the expression in part a and rearrange to obtain
β≈
E[Δy / yi ] Δx
.
13.11 We return to the regression in equation (12.42) and respecify it in log-linear form:
(
)
(
)
ln rent = 6.112 + .1030 all rooms + ei .
(.04139) (.01064 )
Making Regression More Flexible
535
(a) Is the slope for the variable measuring the number of rooms statistically significant? Why or why not? (b) Interpret the value of the slope for the variable measuring the number of rooms, referring to equation (13.36). Is this indicated effect large or small? Why? 13.12 Consider the log-log specification of equation (13.38). (a) What is the expression for E(ln yi) in terms of α, β, and xi? (b) Change xi by Δx. Write the new expected value of the dependent variable as
(
)
(
)
E ln yi + Δy = α + β ln xi + Δx . (c) Use the approximation of equation (13.30) and the result of exercise 13.10c to rewrite Δy Δx E ln ( yi ) + E ≈ α + β ln xi + β . xi yi (d) Subtract the expression in part a from the expression in part c and rearrange to obtain
β≈
E[ Δy / yi ] Δx / xi
.
13.13 If we respecify the regression of equation (13.34) in the log-log form of equation (13.35), we obtain
(
)
(
)
ln GNI per capita = 6.394 +1.733ln CPI + ei .
(.2323) (.1590)
(a) Is the slope for the natural log of the CPI statistically significant? Why or why not? (b) Interpret the value of the slope for the natural log of the CPI, referring to equation (13.39). Is this elasticity big or small? Why? 13.14 Demonstrate that the population relationship of equation (11.1) forces the effects of race to be identical regardless of sex. (a) Subtract equation (13.42) from equation (13.44) to derive the expected earnings difference between a black and a nonblack man. (b) Subtract equation (13.43) from equation (13.45) to derive the expected earnings difference between a black and a nonblack woman. (c) Compare the answers to parts a and b. (d) Make the comparison of part a for the interacted specification of equation (13.48), using equations (13.53) and (13.55). Is the result the same as in part a?
536
Chapter 13
(e) Make the comparison of part b for the interacted specification of equation (13.48), using equations (13.54) and (13.56). Is the result the same as in part b? 13.15 Return to equation (13.49). Change x2i by Δx 2. Follow the derivation in equations (13.50) through (13.52) to prove that Δy = β2 + β3 x1i . Δx2 13.16 Return to equations (13.59) and (13.60). (a) Test whether or not the three slopes in equation (13.59) are statistically significant. What do these tests indicate about the effect of being a black woman on expected earnings? (b) Test whether or not the three slopes in equation (13.60) are statistically significant. What do these tests indicate about the effect of being a black woman on expected earnings? (c) For the regression of equation (13.60), the F-statistic for the test of the joint null hypothesis H 0 : β2 + β3 = 0 is .08. The degrees of freedom are 1 and 179,545. Using table A.3, interpret the results of this test. What does this test indicate about the effect of being a black woman on expected earnings? (d) In equation (12.30), note 10 of chapter 12, and exercise 12.12, we assert that any F-test with a single degree of freedom in the numerator can be reformulated as a t-test. Consider the population relationship yi = α + β1 x1i + β2 x2 i + β3 x3i + ε i , where x1i is a dummy variable identifying women, x2i is a dummy variable identifying black men, and x3i is a dummy variable identifying black women. Compare the expected values of earnings for nonblack men, nonblack women, black men, and black women to those of equations (13.53) through (13.56). Explain why this specification is equivalent to that of equation (13.48), where x1i is a dummy variable identifying women and x2i is a dummy variable identifying blacks. (e) The sample regression that corresponds to the population relationship of part d, calculated with the sample of equation (13.60), is
( ) ( (153.4) ( 218.7) (620.2 ) − 167.2 ( black female) + error. (610.3)
earnings = 40, 784 − 19, 581 female − 14, 520 black male
)
Making Regression More Flexible
537
How do the effects for females, black males, and black females compare in the two regressions? Are they statistically significant? Interpret them. In this equation, what is the statistical test for the null hypothesis that expected earnings are the same for black and nonblack females? What is the outcome of this test? (f ) In both equations (13.59) and (13.60), women and black men have large negative slopes. This suggests the null hypothesis that their expected earnings might differ from those of males by the same amount, H 0 : β1 = β2. The F-statistic for the test of this null hypothesis in equation (13.59) is .55 with 1 and 996 degrees of freedom. For equation (13.60), it is 66.46 with 1 and 179,545 degrees of freedom. What can we conclude regarding this hypothesis? (g) The F-tests of part f have only 1 degree of freedom in the numerator. How would we specify a regression in order to test the null hypothesis with a t-statistic? 13.17 Consider the sample regression of equation (13.58), where x1i is a dummy variable and x2i is a continuous variable: yi = a + b1 x1i + b2 x2 i + b3 x1i x2 i + ei . Imagine that x1i identifies women. Section 13.2 explains why we don’t want to add a dummy variable identifying men, say x3i, to this regression. However, how do we know whether or not the effect of x2i is different for men than for women? Is there a reason why we shouldn’t add an interaction term between x2i and x3i to this regression, so that it looks like yi = a + b1 x1i + b2 x2 i + b3 x1i x2 i + b4 x3i x2 i + ei ? (a) Recall that x1i = 0 when x3i = 1 and x1i = 1 when x3i = 0. Prove that, for each observation, x2 i = x1i x2 i + x3i x2 i . (b) Consider the following auxiliary regression: x2 i = a + bx x x1i x2 i + bx x x3i x2 i + errori . 2 1
2 3
Prove that, if a = 0, bx x = 1, and bx x = 1 , this regression would fit per2 1 2 3 fectly: All errors would be equal to zero. (c) Explain, intuitively, why the regression of chapter 11 would choose the values a = 0, bx x = 1 , and bx x = 1 if we were to actually calculate this 2 3 2 1 regression.
538
Chapter 13
(d) Recall our discussion in section 11.4. There, we demonstrate that the multivariate regression uses only the part of each explanatory variable that is not associated with any of the other explanatory variables. Based on the answer to part c, explain why x2i has no parts that are not related to x1i x2i and x3i x2i. (e) Based on the answer to part d, explain why, intuitively, the regression with which this question begins cannot be calculated. (f ) Return to table 1.4. The regression there contains interaction terms for age and women and for age and men. It can be calculated because it omits a crucial variable. What is this variable? (g) This analysis shows that there are two specifications that can be calculated. Regression can estimate either the general effect of x2i and the difference between the effects of x2i when x1i = 0 and x1i = 1, as in section 13.5, or the absolute effects of x2i when x1i = 0 and x1i = 1, as in table 1.4. However, it cannot estimate a general effect and separate absolute effects for each of the cases x1i = 0 and x1i = 1. Explain, intuitively, why. 13.18 The regression of equation (13.61), omitting the interaction between the dummy variable identifying women and the continuous variable for years of schooling, is earnings = −8, 506 − 17, 644 ( female ) + 3, 801 ( years of schooling ) .
(4, 630) (2, 495)
(341.1)
Explain, with reference to section 11.2, why the addition of the interaction term in equation (13.61) changes these results. 13.19 The intercept and slopes in equation (13.62) appear to differ substantially in magnitude from those in equation (13.63). Is this because the two regressions are contradictory or because some estimates are not very precise? Form the confidence intervals around a, b 1, and b 2 of equation (13.62). Do they include the corresponding values from equation (13.63)? What is the best explanation for the discrepancies between the two equations?
NOTES 1. We prove this in exercise 13.2. 2. We can also fix the dummy variable trap by using dummy variables for each available category but dropping the intercept. With this specification, the effects associated with each dummy variable are absolute, rather than relative
Making Regression More Flexible
539
to any missing category. In some contexts, this may be a more useful interpretive context. For more discussion, see Greene (2003, 118) or Johnston and DiNardo (1997, 137). 3. We’ve already seen the square of an explanatory variable in the auxiliary regressions for the White test of heteroscedasticity, equations (8.14) and (12.44). However, the reason for its inclusion in these equations, which we discuss prior to equation (8.14), is different from the purpose here. 4. Those of us who are comfortable with calculus will recognize this as a restatement of the derivative dyi = β1 + 2 β2 xi . dxi 5. Again, from the perspective of calculus, we have simply set the first derivative equal to zero in order to find the extreme values. The second derivative is d 2 yi dxi2
= 2 β2 .
If, as in the example in the text, β2 < 0, this second derivative is negative and the value for xi that sets the first derivative equal to zero is a maximum. 6. In exercise 13.4, we confirm that these slopes are statistically significant. 7. Formally, the second derivative in note 5 is now positive, because b 2 > 0. This proves that, in this case, the solution to equation (13.22) is a minimum. 8. We re-examine the issues in this regression in section 14.4. 9. In this sense, e is just like π, another universal symbol for another irrational number, approximately equal to 3.14159. 10. At larger values for Δx, the approximation gets terrible. For example, ln(e) = 1. But the approximation would give ln(e) = ln(2.71828) = ln(1 + 1.71828) ≈ 1.71828. This is an error of more than 70%! 11. This was introduced in the pioneering work of Jacob Mincer, summarized briefly in Card (1999). 12. The sample for this regression contains only the 788 observations with positive earnings. The 212 individuals with zero earnings must be dropped because the natural logarithm of zero is undefined. Perhaps surprisingly, this estimate is within the range produced by the much more sophisticated studies reviewed by Card (1999). Exercise 13.11 gives us an opportunity to interpret another log-linear specification.
540
Chapter 13
13. Note 6 of chapter 12 declares that the ratio of a marginal change to an average is also an elasticity. Here’s why both claims are true: elasticity =
marginal change Δy /Δx Δy / y relative change in y = = = . average change y/x Δx / x relative changee in x
14. Exercise 13.13 gives us another opportunity to interpret a regression of the form in equation (13.40). 15. We examine this particular specification in exercise 13.17. 16. Exercise 13.15 demonstrates that, similarly, the effect of a change in x2i in equation (13.49) depends on the level of x1i. 17. We build the foundation for this prediction in exercise 7.13. 18. In exercise 13.17, we demonstrate that it would be a mistake to add separate interaction terms for observations with x1i = 0 and x1i = 1 in this specification. 19. We consider the reasons for this change in exercise 13.18. 20. Exercise 13.19 examines the magnitudes of these slopes. 21. We’ve already seen variables expressed as reciprocals in the weighted least squares (WLS) specification of equation (8.16). However, the reciprocal there addresses statistical rather than behavioral concerns.
CHAPTER
14
M ORE T HAN T WO E XPLANATORY V ARIABLES
14.0
What We Need to Know When We Finish This Chapter
14.1
Introduction
14.2
Can We Have More Than Two Explanatory Variables?
14.3
Inference in Multivariate Regression
14.4
Let’s See Some Examples
14.5
Assumptions about the Disturbances
14.6
Panel Data
569
14.7
Conclusion
573
Exercises
14.0
541
545 545
551
557 568
573
What We Need to Know When We Finish This Chapter The addition of a second explanatory variable in chapter 11 adds only four new things to what there is to know about regression. First, regression uses only the parts of each variable that are unrelated to all of the other variables. Second, omitting a variable from the sample relationship that appears in the population relationship almost surely biases our estimates. Third, including an irrelevant variable does not bias estimates but reduces their precision. Fourth, the number of interesting joint tests increases with the number of slopes. All four remain valid when we add additional explanatory variables. 541
542
Chapter 14
1. Equations (14.11), (14.1), and (14.2), section 14.2: The general form of the multivariate population relationship is k
yi = α +
∑β x
+ εi .
l li
l =1
The corresponding sample relationship is k
yi = a +
∑b x
l li
+ ei .
l =1
The predicted value of yi is k
yˆi = a +
∑b x . l li
l =1
2. Equations (14.3) and (14.4), section 14.2: When we minimize the sum of squared errors in the multivariate regression, the errors sum to zero, n
0=
∑e , i
i =1
and are uncorrelated in the sample with all explanatory variables n
0=
∑e x
i pi
(
= COV ei , x pi
i =1
)
for all p = 1, . . . , k .
3. Equations (14.5) and (14.8), section 14.2: The intercept in the multivariate regression is k
a= y−
∑b x . l l
l =1
The slopes are
∑(e(
)(
n
bp =
i =1
) − e( x p , x1…x p−1 x p+1…xk ) e( y, x1…x p−1 x p+1…xk )i − e( y, x1…x p−1 x p+1…xk )
x p , x1…x p −1 x p+1…xk i
∑(e( n
i =1
) − e( x p , x1…x p−1 x p+1…xk )
x p , x1 …x p−1 x p+1 …xk i
)
2
)
.
More Than Two Explanatory Variables
543
4. Equations (14.9) and (14.17), section 14.2: R 2 is
(
)
R 2 = COV yi , yˆi .
The adjusted R 2 is adjusted R 2 = 1−
s2 . V yi
( )
5. Section 14.2: Regression is not limited to two explanatory variables. However, the number of observations must exceed the number of estimators, and each explanatory variable must have some part that is not related to all of the other explanatory variables in order to calculate meaningful regression estimators. 6. Equations (14.12) and (14.14), section 14.2: If the regression is specified correctly, estimators are unbiased:
( )
()
E a = α and E b p = β p for all p = 1, . . . , k .
If the regression omits explanatory variables, estimators are biased:
( )
E bp
k β m = βp + m= k − q
∑
n
∑(
ex
i =1
, x …x p−1 x p+1…xk −q−1 ) ( m , x1…x p−1 x p+1…xk − q−1 )i p 1 . n 2 e x , x …x x …x i ( p 1 p−1 p+1 k −q−1 ) i =1 e i x
∑
7. Equations (14.16), (14.19), and (14.20), section 14.3: The estimator of σ 2 is n
∑e
2 i
s2 =
i =1
n − k −1
.
With this estimator, the standardized value of bp is a t random variable with n − k − 1 degrees of freedom: bp − β p s
~ t(
2
n
∑ e( i =1
2 x p , x1…x p−1 x p+1…xk i
)
n− k −1)
.
544
Chapter 14
If n is sufficiently large, this can be approximated as a standard normal random variable. If the sample regression is correctly specified, bp is the best linear unbiased (BLU) estimator of βp. 8. Equations (14.21) and (14.22), section 14.3: The general form of the test between an unrestricted alternative hypothesis and a null hypothesis subject to j restrictions is n n e2 − e2 i i i =1 R i =1 U j
∑
∑
n e2 i i =1 U n − k −1
~ F ( j, n − k − 1).
∑
For the null hypothesis that all coefficients are equal to zero, this reduces to RU2 n − k − 1 ~ F k, n − k −1 . k 1 − R2
(
)
U
9. Equation (14.26), section 14.3: The Chow test for differences in regimes is n n n 2 2 2 + ei ei − ei i =1 R i =1 x =0 i =1 x =1 1i 1i k −1
∑
∑
∑
n n e2 e2 + i i i =1 x1i =0 i =1 x1i =1 n − 2k
∑
∑
~ F ( k − 1, n − 2 k ).
More Than Two Explanatory Variables
545
10. Section 14.6: If omitted explanatory variables are fixed for each entity in the sample, multiple observations on each entity may allow us to use panel data estimation techniques. These techniques can purge the effects of unobserved heterogeneity and yield unbiased estimators for the effects of included explanatory variables.
14.1 Introduction We’ve had more than two explanatory variables in a regression since the beginning of chapter 1. However, we haven’t been ready to understand how we can do this until now. This chapter formally closes the exploratory loop upon which we embarked when we began this book. Most of what we need to accomplish this is in the three preceding chapters. Most of what we do here is to assert that the addition of explanatory variables beyond the second changes very little of substance. For the most part, we justify these assertions with appeals to proofs in earlier chapters and to the intuition that we have developed since we began. We don’t provide the kinds of proofs to which we have become accustomed because the framework of ordinary algebra is too clumsy when the number of explanatory variables increases. For those of us who are interested in those proofs, they’re available in Johnston and DiNardo (1997) and Greene (2003). Those of us who are curious about why matrix algebra seems to be such a big deal ought to check these references to see the extraordinary economy of notation and generality of result that is possible with this tool.
14.2 Can We Have More Than Two Explanatory Variables? Apart from illustrating the interactive specification, equation (13.48) is the fifth time in this book that we’ve gotten ahead of ourselves by writing a population equation that we don’t yet know how to estimate. Two times don’t count for the present purpose: In chapters 8 and 9, we needed two explanatory variables in the auxiliary regression for White’s heteroscedasticity test and in the generalized least squares (GLS) estimation. However, we figured out how to calculate these regressions in chapter 11. Three instances are still unresolved. The last, of course, is equation (13.48). The first is the entire first chapter. There, we see regressions with between three and nine explanatory variables. The second instance is equation (12.44), which tells us that the White heteroscedasticity test for the regression with two
546
Chapter 14
explanatory variables requires an auxiliary regression with five explanatory variables. In all of these cases, we find that we are able to extract information from our sample much more effectively if we can add additional explanatory variables beyond the two of chapters 11 and 12. By the time we get to equation (13.48), this idea seems so natural that we simply slip in a third explanatory variable without even remarking on the complications that this might entail. There are two reasons for this. First, the addition of a third explanatory variable, or additional explanatory variables beyond that, does not change anything essential in the understanding that we have acquired of regression with two explanatory variables. The instincts that we have developed in chapters 11 and 12 generally serve us well no matter how many explanatory variables we have to deal with. The second reason is that it would be very difficult to analyze regressions that contain more than two explanatory variables with the mathematical tools to which we have restricted ourselves. Just imagine the sample regression with a third explanatory variable, x3i: yi = a + b1 x1i + b2 x2 i + b3 x3i + ei .
This regression has four estimators: a, b 1, b 2, and b 3. That means four unknowns, four derivatives of the sum of squared errors, and four normal equations. It is certainly possible to solve a system of four equations in four unknowns with ordinary algebra, but the effort probably exceeds even our appetite for it. Instead, we’ll leave the formal analysis for others. Here, we’ll summarize the essential results and intuitions. First, we work through the algebra of multivariate regression. The general multivariate sample regression is k
yi = a +
∑b x
l li
+ ei .
(14.1)
l =1
If we set k = 1, we get equation (4.4). With k = 2, we get equation (11.12). Regardless of the value of k, the predicted value of yi is k
yˆi = a +
∑b x . l li
(14.2)
l =1
Equations (4.1) and (11.13) are the special cases where k = 1 and k = 2, respectively. In the first, the predicted values of yi together constitute a line. In the second, they constitute a plane. When k > 2, the collection of all values of yˆi in the sample is called a hyperplane.1
More Than Two Explanatory Variables
547
When we square the errors in equation (14.1) and minimize their sum, we obtain k + 1 normal equations, one for the constant and k for the k slopes. The first tells us, to no one’s surprise, that the sum of the errors from the bestfitting line, and therefore their average, both equal zero: n
0=
∑e .
(14.3)
i
i =1
We prove this for the bivariate regression in equations (4.19) and (4.20). We do the same for the regression with two explanatory variables in equation (11.19). The normal equations for the slopes tell us that the errors from the best-fitting line are uncorrelated, in the sample, with all of the explanatory variables: n
0=
∑e x
i pi
(
= COV ei , x pi
i =1
)
for all p = 1, . . . , k .
(14.4)
We prove this for the regression with one explanatory variable in equations (4.21) through (4.28). We replicate this proof for the regression with two explanatory variables in equations (11.23), (11.24), (11.26), and (11.27). The solutions to the k + 1 normal equations associated with equation (14.1) reiterate two essential results. First, the constant is equal to the difference between the average value of yi and the sum of the average values of all of the explanatory variables, multiplied by their slopes. Extending equations (4.35) and (11.28), we have k
a= y−
∑b x . l l
(14.5)
l =1
Here, n
xp =
∑x i =1
n
pi
for all p = 1, . . . , k .
Once again, this implies that regression “goes through the means”: If we insert the average values for the dependent and all of the independent variables into equation (14.1), the error will be zero. Second, the slope for any explanatory variable xpi is equal to what we would get if we were to take the following three steps. In the first step, we extend the auxiliary regression of equation (11.53), in the form of equation (14.1).
548
Chapter 14
The dependent variable is xpi and the independent variables are all of the other explanatory variables of equation (14.1): p −1
x pi = c +
∑
k
dl xli +
l =1
∑dx
l li
+e x
(
l = p+1
p , x1 … x p−1 x p+1 … xk
)i .
(14.6)
The subscript to the error term extends the notation that we develop after equation (11.2) as far as we can possibly expect it to take us. In the second step, we extend the auxiliary regression of equation (11.59), again in the form of equation (14.1). We use the same independent variables as in equation (14.6), but now yi is the dependent variable: p −1
yi = g +
∑
k
hl xli +
l =1
∑ hx
l li
l = p+1
+e
( y, x1…x p−1 x p+1…xk )i .
(14.7)
In the third and final step, we use the errors from equation (14.7) as the dependent variable and the errors from equation (14.6) as the independent variable in a bivariate regression as in chapter 4. In analogy to equations (11.64) and (11.65), this gives our slope as n
bp =
∑e( i =1
) e( y, x1…x p−1 x p+1…xk )i
x p , x1…x p −1 x p+1…xk i n
∑e( i =1
2 x p , x1…x p −1 x p+1…xk i
)
n
=
∑ e( i =1
) − e( x p , x1…x p−1 x p+1…xk ) e( y, x1…x p−1 x p+1…xk )i − e( y, x1…x p−1 x p+1…xk )
x p , x1…x p −1 x p+1…xk i n
∑ i =1
2 e x ,xx …x x …x i − e x , x …x x …x ( p 1 p−1 p+1 k ) ( p 1 p−1 p+1 k )
.
(14.8)
The result in equation (14.5) is familiar to us from chapter 4. In contrast, the result in equation (14.8) is new to chapter 11. In fact, the idea that the multivariate sample regression uses only the part of each variable that is unrelated to all of the other variables is the only important elaboration that chapter 11 makes on the algebra of chapter 4. It is so natural and intuitive that it is a great satisfaction to see it restated here. The formula for R 2 in equation (12.24) or in the first equality of equation (4.57) remains valid for the multivariate regression of equation (14.1).
More Than Two Explanatory Variables
549
Of equal importance, the proof that
(
R 2 = COV yi , yˆi
)
(14.9)
in exercise 12.10 is easily extended to the general multivariate case, as we see in exercise 14.1. However, the adjusted R 2 must be rewritten to recognize the new number of explanatory variables: n ei2 i =1 n − k −1
∑
adjusted R 2 = 1 −
n yi − y i =1 n −1
∑(
)
2
n
∑e
2 i
=1−
i =1
n
∑( y − y )
n −1 . 2 n − k − 1
(14.10)
i
i =1
With this general formulation, it is apparent that the difference between R 2 and the adjusted R 2 increases as the sample size declines or as the number of explanatory variables increases. This expresses the concern that regression analysis is being overworked when it is asked to extract too many estimators from too few pieces of information. In sum, all of the sample relationships that we have come to expect from the sample regression calculation reappear in the general multivariate model with only the most minor modifications, if any. The two principal cautions reappear as well. First, as we discuss in exercise 12.11 for the case in which k = 2, the number of observations in the sample must exceed the number of sample statistics in equation (14.1), n > k + 1. If n = k + 1, the regression fits perfectly, but the standard errors of the intercept and slopes are infinite. If n < k + 1, the regression formulas do not yield unique values for the estimators. Second, the expression to the right of the first equality in equation (14.8) reminds us that, in order to estimate bp, there has to be something in xpi that is not related to the other explanatory variables. Otherwise, the errors from equation (14.6) will all be zero, as will be their squares and the sum of their squares in the denominator of the expression for bp. We’ve already seen two examples of this problem in the previous chapter, the dummy variable trap in section 13.2 and the problem that arises if we enter a continuous variable and its interactions with a complete set of dummy variables in exercise 13.17.
550
Chapter 14
The relationships between our sample statistics and our population parameters with which we have become familiar also persist when we add additional explanatory variables. The general multivariate population relationship is k
∑β x + ε .
yi = α +
l li
(14.11)
i
l =1
Equation (14.11), with k = 1, gives us our first population relationship of equation (5.1). With k = 2, it gives us the population relationship of equation (11.1). The intercept and slopes of equation (14.1) are unbiased estimators of the corresponding constant and coefficients in the population relationship of equation (14.11):
( )
()
E a = α and E b p = β p for all p = 1, . . . , k .
(14.12)
However, if the sample regression of equation (14.1) omits any explanatory variable that is in the population relationship of equation (14.11), all slope estimates are potentially biased. The potential bias is analogous to that of equation (11.6). If xki is the omitted variable, then the expected value of the slope for any other explanatory variable, xpi, is n
( )
E b p = β p + βk
∑ e( i =1
) e( xk , x1…x p−1 x p+1…xk −1 )i
x p , x1… x p−1 x p+1… xk −1 i
.
n
∑( i =1
(14.13)
e 2x , x …x x …x i p 1 p−1 p+1 k −1
)
The term n
∑ e( i =1
) e( xk , x1…x p−1 x p+1…xk −1 )i
x p , x1…x p−1 x p+1…xk −1 i n
∑ e( i =1
2 x p , x1 …x p −1 x p+1 …xk −1 i
)
is, of course, the slope of xpi in the auxiliary regression of xki on all explanatory variables included in the regression for yi. Equation (14.13) tells us that the expected value of bp incorporates the true effect of xpi on yi, βp, and the true effect of xki on yi, βk, to the extent that xpi looks like xki. The bias in equation (14.13) is compounded if equation (14.1) omits additional explanatory variables that appear in the population relationship. For
More Than Two Explanatory Variables
551
example, if both xk−1,i and xki are omitted, the expected value of bp is n
∑e(
( )
i =1
E b p = β p + βk −1
) e( xk −1 , x1…x p−1 x p+1…xk −2 )i
x p , x1…x p−1 x p+1…xk − 2 i n
∑e( i =1
2 x p , x1…x p −1 x p+1…xk −2 i
)
n
+ βk
∑ e( i =1
) e( xk , x1…x p−1 x p+1…xxk −2 )i
x p , x1…x p−1 x p+1…xk − 2 i
.
n
∑( i =1
e 2x , x …x x …x i p 1 p −1 p+1 k −2
)
The slope for xpi picks up some of the true effects of both xk−1,i and xki, to the extent that either has parts with which only it is associated. In general, if all explanatory variables from xk−q,i through xki are omitted, each of their effects appears in the expected values of all included variables, to the extent that each of the included variables picks up a unique part of the excluded variable:
( )
E bp
14.3
k βm = βp + m=k −q
∑
n
∑(
ex
i =1
x x x x x x p , 1… p −1 p+1… k − q−1 ) ( m , 1… p −1 x p+1…xk − q−1 )i . (14.14) n e 2x , x …x x …x ( p 1 p−1 p+1 k −q−1 )i i =1 e i x
∑
Inference in Multivariate Regression The population variances of the slopes in the multivariate regression of equation (14.1) are identical in form to the variances that we derive in equation (12.11):
( )
V bp =
σ2
.
n
∑( i =1
(14.15)
e 2x , x …x x …x i p 1 p−1 p+1 k
)
The numerator is the population variance of the disturbances, the εi’s. The denominator is the sum of the squared part of xpi that is not related to any of the other explanatory variables.
552
Chapter 14
The only difference between equation (12.11) and equation (14.15) is that, in the latter, the number of other explanatory variables is greater, the sum of the squared part of xpi that is not related to any of the other explanatory variables is presumably smaller, and V(bp) is presumably larger. Nevertheless, these other explanatory variables must appear in the sample regression of equation (14.1) in order to avoid left-out-variable error (LOVE). Therefore, the Gauss-Markov theorem proves that these standard deviations are the smallest possible among unbiased linear estimators of α and βp: The estimators a and bp of equations (14.5) and (14.8) are BLU. The estimates of the predicted values from equation (14.2) are also BLU. Our estimates remain unbiased even if we include one or more irrelevant variables. That is, equation (14.12) holds if equation (14.1) contains explanatory variables that are not in equation (14.10). However, their precision declines. With unnecessary explanatory variables in the auxiliary regression of equation (14.6), the residuals will almost surely be smaller than they need be. The same will be true of their summed squares, which appear in the denominator of equation (14.15). Consequently, V(bp) in equation (14.15) will be larger than necessary. Not surprisingly, in this situation bp and a are not BLU. With k explanatory variables, we now extract k slopes as well as an intercept from our sample. Therefore, our degrees of freedom are n − (k + 1) = n − k − 1. Consequently, our estimate of σ 2 is now n
∑e
2 i
s2 =
i =1
n − k −1
.
(14.16)
Parenthetically, we recognize this as the numerator in the second term to the right of the first equality in equation (14.10). The denominator of this same term we’ve known to be the sample variance of yi, as far back as chapter 3. Making these replacements, we can define the adjusted R 2 more compactly as n ei2 i =1 n − k −1
∑
adjusted R 2 = 1 −
n yi − y 1 i = n −1
∑(
)
2
=1−
s2 . V yi
( )
(14.17)
More Than Two Explanatory Variables
553
As in equations (7.5) and (12.16), our sample estimates of the standard deviations of the slopes in equation (14.15) replace the population parameter σ 2 with the estimator s 2 from equation (14.16): s2
( )
SD bp = +
= +s
n
∑( i =1
e 2x , x …x x …x i p 1 p −1 p+1 k
)
1
.
n
∑( i =1
(14.18)
e 2x , x …x x …x i p−1 p+1 k p 1
)
As we discuss in sections 6.2 and 7.2, if the sample is small but we assume that the disturbances are normally distributed, then we can treat the standardized form of the slopes from equation (14.8) as t random variables: bp − β p s
~ t(
n− k −1)
.
(14.19)
2
n
∑ e( i =1
2 x p , x1…x p−1 x p+1…xk i
)
If n is large, then we can count on a central limit theorem to confirm that the slopes are normal random variables when standardized by their population standard errors. Also, we can be confident that the true t-distribution for the standardized estimators is essentially indistinguishable from the standard normal distribution. Therefore, we can treat the standardized form of the slopes from equation (14.8) as standard normal random variables:2 bp − β p
~ N (0, 1) .
s2
(14.20)
n
∑ e( i =1
2 x p , x1…x p−1 x p+1…xk i
)
Of course, as we discuss in sections 6.2 and 7.2, the only thing that depends on whether we assert equation (14.19) or (14.20) is the table from which we take our critical value. Regardless of whether we assume that the disturbances are normally distributed, that the sample is large, or both, we can use the standard deviations in equation (14.18) to form confidence intervals and hypothesis tests for individual parameters, exactly as we do in chapters 7 and 12. As the number of explanatory variables increases, the number of possible joint hypotheses increases much more rapidly. However, tests of these hypotheses still reduce, almost invariably, to comparisons of unrestricted and restricted regression specifications. We usually make this comparison with the
554
Chapter 14
generalization of equation (12.29) to the case with k explanatory variables: n n e2 − e2 i i i =1 R i =1 U j
∑
∑
n e2 i i =1 U n − k −1
∑
(
)
~ F j, n − k − 1 .
(14.21)
These hypotheses can take on many specific forms, depending on the context. Nevertheless, two joint null hypotheses appear regularly. The most common is that of equation (12.31), the joint null hypothesis that all coefficients are simultaneously equal to zero: H 0 : β1 = 0, . . . , βk = 0. With k explanatory variables, the test statistic of equation (12.35) becomes RU2 n − k − 1 ~ F k, n − k −1 . k 1 − R2
(
)
(14.22)
U
The next most common joint hypothesis test is that of differences in regimes, often referred to as the Chow test.3 The null hypothesis here is actually something that we’ve generally taken to be fundamental, that the sample comes from a single population. In chapter 8, for example, we conclude that there is no point in running a regression if we can’t assume that the disturbances share the same expected value. However, chapter 8 also has some moments where we’ve relaxed this a bit. For example, in section 8.9, we analyze the situation in which the sample comes from two subpopulations that share the same constant and coefficients, but differ in their disturbance variances. Here, we’re interested in a situation that is almost exactly the reverse: the possibility that the sample comes from two subpopulations in which the disturbances have the same expected values and perhaps variances, but the constants and coefficients are different. This situation is an expansion of the example in table 1.4, which we revisit in section 14.4. In that table, we entertain the possibility that the coefficient on age is different for men and women.4 In the structure that we’re developing here, we can consider the possibility that the coefficients on all of the explanatory variables in that table differ by sex. More generally, we are allowing for
More Than Two Explanatory Variables
555
the possibility that the effects of all explanatory variables differ, depending on which of two categories is the source of an observation. This is why we often speak of differences in “regimes.” The null hypothesis is that there is only a single regime, meaning a single equation that determines yi, equation (14.11). This hypothesis asserts that each explanatory variable has the same effect on yi in both of the subpopulations. Typically, the equation for the null hypothesis distinguishes between subpopulations only by including the dummy variable that identifies them among the explanatory variables. This treatment asserts that the difference between values of yi for all pairs of otherwise identical observations from different subpopulations is fixed at the coefficient for this dummy variable. The alternative hypothesis is that there are two subpopulations, each described by its own regime. This hypothesis specifies that the same explanatory variables affect yi in both of the subpopulations, but the magnitudes of these effects are different. It asserts that values of yi do not differ by a single fixed number depending on the subpopulation to which they belong. Instead, the effects of all of the explanatory variables can differ. Assume that the dummy variable that distinguishes between subpopulations is x1i. In the example of table 1.4, this dummy variable identifies women. Under the alternative hypothesis, the population relationship for the first subpopulation, where x1i = 0, is k
yi = δ0 +
∑δ x
l li + ε i
.
(14.23)
l =2
Equation (14.23) does not contain x1i. That’s because it has the same value, zero, for everyone in this subpopulation. The population relationship for the second subpopulation, where x1i = 1, is k
yi = µ0 +
∑µ x
l li
+ εi .
(14.24)
l =2
Once again, x1i has the same value for each member of this subpopulation, so it doesn’t appear in equation (14.24), either. However, there is still the possibility that this subpopulation has a different constant than does the other, because µ0 can differ from δ 0. At this point, the outlines of a test in the form of equation (14.21) are beginning to emerge. Clearly, the combination of equations (14.23) and (14.24) allows for a lot more possibilities than the one-regime population relationship of equation (14.11). So equations (14.23) and (14.24) are beginning to look a lot like an unrestricted model. Equation (14.11) is beginning to make sense as a restricted model.
556
Chapter 14
Let’s see if we can fill out the required items in equation (14.21). The restricted sum of squared errors, (∑in=1 ei2 )R, is easy. It’s just the sum of squared errors from the sample counterpart of the population relationship in equation (14.11), equation (14.1). In this context, where we’re wondering if we should separate the observations for which x1i = 0 from those for which x1i = 1, the restricted regression in which we include both is called a pooled regression. The number of restrictions, j, is also not hard to figure. If the null hypothesis is true, then the effects of each explanatory variable in equations (14.23) and (14.24) must be the same: δp = µp for all values of p from 2 to k. The null hypothesis is therefore H 0 : δ 2 = µ2, δ 3 = µ3, . . . , δk = µk. Because this hypothesis requires k − 1 pairs of coefficients to have the same values, it imposes k − 1 restrictions. The remaining two pieces of equation (14.21) are a little harder to see. It seems as though equations (14.23) and (14.24), together, are the unrestricted model. But that appears to be two equations, not one. However, it turns out that, with some clever interactions from section 13.5, we can combine them into a single population relationship:
(
k
k
l =2
l =2
) ∑δl (1 − x1i ) xli + µ1 x1i + ∑µl x1i xli + εi .
yi = δ1 1 − x1i +
(14.25)
In this equation, εi is heteroscedastic if the disturbance variances for the two subpopulations are different. Exercise 14.3 confirms that, for observations with x1i = 0, equation (14.25) reduces to equation (14.23). For observations with x1i = 1, it becomes equation (14.24). Now, at least, we can count the degrees of freedom in the unrestricted population relationship. The parameters in equation (14.25) consist of k different δ’s and k different µ’s. The total number of parameters is therefore 2k. Consequently, the degrees of freedom are n − 2k. We could also obtain the unrestricted sum of squared errors, (∑in=1 ei2 )U, by calculating the sample counterpart to equation (14.25). It turns out, though, that there’s an easier way. We can, instead, estimate (∑in=1 ei2 ) x =0 from the 1i sample counterpart of equation (14.23) and (∑in=1 ei2 ) x =1 from the sample 1i counterpart of equation (14.24). This is called stratifying the sample by x1i, or running the stratified regressions. If we add the sum of squared errors from the two stratified regressions, we get the unrestricted sum of squared errors that would have been calculated by the sample counterpart of the unrestricted regression in equation (14.25): n n e2 = e2 i i i =1 U i =1 x
∑
∑
1i
n + ei2 x =0 i =1
∑
1i =1
.
More Than Two Explanatory Variables
557
It may not be exactly obvious why this should be so. In exercise 14.4, we prove that it works in a simplified version of equation (14.25). With these ingredients, equation (14.21) gives the Chow test for the possibility of different regimes as n n n 2 2 2 + ei ei − ei i =1 R i =1 x =0 i =1 x =1 1i 1i k −1
∑
∑
∑
n n e2 e2 + i i i =1 x1i =0 i =1 x1i =1 n − 2k
∑
∑
(
)
~ F k − 1, n − 2 k .
(14.26)
If the value of the test statistic exceeds the critical value for an F random variable with k − 1 degrees of freedom in the numerator and n − 2k degrees of freedom in the denominator, we reject the null hypothesis. We conclude, instead, that different coefficients apply to the explanatory variables when x1i = 0 than when x1i = 1.
14.4
Let’s See Some Examples Over the course of this text, we’ve raised a number of empirical issues that we’ve left unresolved because appropriate treatment requires many explanatory variables. We return to some of them here, in roughly the order in which we first encountered them. Let’s start with the issue of whether the determinants of earnings differ for men and women, which has been with us since table 1.4. Table 14.1 presents our original earnings regression of figure 1.1, stratified by sex. As we observe in our discussion of equations (14.23) through (14.25), the pooled regression of figure 1.1 allows for earnings to differ across sexes by a fixed amount, because it includes the dummy variable for sex. The stratified regressions of table 14.1 allow for this fixed difference because they have different intercepts. However, they also allow for the effects of all explanatory variables on earnings to differ by sex.
558
Chapter 14
TABLE 14.1 Earnings regressions stratified by sex, sample from chapter 1 Explanatory variable
Intercept Years of schooling Age Black Hispanic American Indian or Alaskan Native Asian Native Hawaiian, other Pacific Islander R2 Adjusted R2 F-statistic p-value for F-statistic Sum of squared errors Number of observations NOTE:
Regression for men
−34,233 (2.88) 4,515.1 (7.04) 521.88 (2.96) −16,332 (1.94) −4,880.2 (.88) −18,106 (.94) −10,565 (1.43) −36,584 (1.10) .1569 .1451 13.26