366 18 9MB
English Pages 0 [463] Year 2014
Empirical Development Economics
Understanding why so many people across the world are so poor is one of the central intellectual challenges of our time. This book provides the tools and data that will enable students, researchers and professionals to address that issue. Empirical Development Economics has been designed as a hands-on teaching tool to investigate the causes of poverty. The book begins by introducing the quantitative approach to development economics. Each section uses data to illustrate key policy issues. Part One focuses on the basics of understanding the role of education, technology and institutions in determining why incomes differ so much across individuals and countries. In Part Two, the focus is on techniques to address a number of topics in development, including how irms invest, how households decide how much to spend on their children’s education, whether microcredit helps the poor, whether food aid works, who gets private schooling and whether property rights enhance investment. A distinctive feature of the book is its presentation of a range of approaches to studying development questions. Development economics has undergone a major change in focus over the last decade with the rise of experimental methods to address development issues; this book shows how these methods relate to more traditional ones. Måns Söderbom is Professor of Economics at the Department of Economics, School of Business, Economics and Law, University of Gothenburg, Sweden. Francis Teal is Research Associate, CSAE, University of Oxford, UK and Managing Editor, Oxford Economic Papers. Markus Eberhardt is Assistant Professor in Economics, School of Economics, University of Nottingham, UK. Simon Quinn is Associate Professor in Economics and Deputy Director of the Centre for the Study of African Economies, Department of Economics, University of Oxford, UK. Andrew Zeitlin is Assistant Professor at the McCourt School of Public Policy at Georgetown University, USA.
Routledge Advanced Texts in Economics and Finance
1. Financial Econometrics Peijie Wang 2. Macroeconomics for Developing Countries, second edition Raghbendra Jha 3. Advanced Mathematical Economics Rakesh Vohra 4. Advanced Econometric Theory John S. Chipman 5. Understanding Macroeconomic Theory John M. Barron, Bradley T. Ewing and Gerald J. Lynch 6. Regional Economics Roberta Capello 7. Mathematical Finance Core theory, problems and statistical algorithms Nikolai Dokuchaev 8. Applied Health Economics Andrew M. Jones, Nigel Rice, Teresa Bago d’Uva and Silvia Balia 9. Information Economics Urs Birchler and Monika Bütler 10. Financial Econometrics, second edition Peijie Wang 11. Development Finance Debates, dogmas and new directions Stephen Spratt
12. Culture and Economics On values, economics and international business Eelke de Jong 13. Modern Public Economics, second edition Raghbendra Jha 14. Introduction to Estimating Economic Models Atsushi Maki 15. Advanced Econometric Theory John Chipman 16. Behavioral Economics Edward Cartwright 17. Essentials of Advanced Macroeconomic Theory Ola Olsson 18. Behavioral Economics and Finance Michelle Baddeley 19. Applied Health Economics, second edition Andrew M. Jones, Nigel Rice, Teresa Bago d’Uva and Silvia Balia 20. Real Estate Economics A point to point handbook Nicholas G. Pirounakis 21. Finance in Asia Institutions, regulation and policy Qiao Liu, Paul Lejot and Douglas Arner
22. Behavioral Economics, second edition Edward Cartwright 23. Understanding Financial Risk Management Angelo Corelli
24. Empirical Development Economics Måns Söderbom and Francis Teal with Markus Eberhardt, Simon Quinn and Andrew Zeitlin
This page intentionally left blank
Empirical Development Economics Måns Söderbom and Francis Teal with Markus Eberhardt, Simon Quinn and Andrew Zeitlin
First published 2015 by Routledge 2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN and by Routledge 711 Third Avenue, New York, NY 10017 Routledge is an imprint of the Taylor & Francis Group, an informa business © 2015 selection and editorial material, Måns Söderbom, Francis Teal, Markus Eberhardt, Simon Quinn and Andrew Zeitlin; individual chapters, the contributors. The right of the editors to be identiied as the authors of the editorial material, and of the authors for their individual chapters, has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identiication and explanation without intent to infringe. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloguing in Publication data Empirical development economics / Måns Söderbom, Francis Teal, Markus Eberhardt, Simon Quinn and Andrew Zeitlin. pages cm. – (Routledge advanced texts in economics and inance) Includes bibliographical references and index. 1. Development economics. 2. Poverty. 3. Income distribution. 4. Econometrics. I. Title. HD82.T38 2014 338.9–dc23 2014012241 ISBN: 978-0-415-81048-7 (hbk) ISBN: 978-0-415-81049-4 (pbk) ISBN: 978-0-203-07092-5 (ebk) Typeset in Times New Roman by Out of House Publishing
Contents
List of igures List of tables Notes on authors Preface How to use this book
xviii xx xxiii xxv xxvii
PART 1
Linking models to data for development
1
1
3
An introduction to empirical development economics 1.1 The objective of the book 3 1.2 Models and data: the Harris–Todaro model 4 1.3 Production functions and functional form 6 1.3.1 The Cobb–Douglas production function 6 1.3.2 The constant elasticity of substitution (CES) functional form 10 1.4 A model with human capital 11 1.5 Data and models 13 1.5.1 The macro GDP data 13 1.5.2 Interpreting the data 14 References 14 Exercise 15
SECTIon I
Cross-section data and the determinants of incomes
17
2
19
The linear regression model and the OLS estimator 2.1 Introduction: models and causality 19 2.2 The linear regression model and the OLS estimators 20 2.2.1 The linear regression model as a population model 20 2.2.2 The zero conditional mean assumption 21 2.2.3 The OLS estimator 24 2.3 The Mincerian earnings function for the South African data 26
viii Contents Properties of the OLS estimators 28 2.4.1 The assumptions for OLS to be unbiased 28 2.4.2 The assumptions for OLS to be minimum variance 29 2.5 Identifying the causal effect of education 31 References 31 Exercise 32
2.4
3
Using and extending the simple regression model 3.1 Introduction 33 3.2 Dummy explanatory variables and the return to education 33 3.3 Multiple regression 36 3.3.1 Earnings and production functions 36 3.3.2 The OLS estimators for multiple regression 37 3.3.3 Omitted variables and the bias they may cause 39 3.4 Interpreting multiple regressions 40 3.4.1 How much does investing in education increase earnings? Some micro evidence 40 3.4.2 How much does investing in education increase productivity? Some macro evidence 43 References 45 Exercise 45
33
4
The distribution of the OLS estimators and hypothesis testing 4.1 Introduction 47 4.2 The distribution of the OLS estimators 47 4.2.1 The normality assumption 47 4.2.2 Why normality? 48 4.3 Testing hypotheses about a single population parameter 49 4.3.1 The t distribution 49 4.3.2 The t-test 51 4.3.3 Conidence intervals 53 4.4 Testing for the overall signiicance of a regression 55 4.5 Testing for heteroskedasticity 57 4.6 Large sample properties of OLS 58 4.6.1 Consistency 58 4.6.2 Asymptotic normality 60 References 60 Exercise 61
47
5
The determinants of earnings and productivity 5.1 Introduction 62 5.2 Testing the normality assumption 62 5.3 The earnings function 65 5.3.1 Bringing the tests together 65 5.3.2 Robust and clustered standard errors 65
62
Contents ix The production function 67 5.4.1 Testing the production function 67 5.4.2 Extending the production function 67 5.5 Interpreting our earnings and production functions 72 5.5.1 Can education be given a causal interpretation? 72 5.5.2 How much does education raise labour productivity? 73 References 74 Exercise 74 5.4
SECTIon II
Time-series data, growth and development
75
6
Modelling growth with time-series data 6.1 Introduction: modelling growth 77 6.2 An introduction to the Solow model 78 6.3 A Solow model for Argentina 80 6.4 OLS estimates under the classical assumptions with time-series data 81 6.4.1 Assumptions for OLS to be unbiased 81 6.4.2 The variance of the OLS estimators 83 6.4.3 Testing for autocorrelation 85 6.5 Static and dynamic time-series models 85 6.6 Assumptions to ensure the OLS estimators are consistent 87 6.7 Spurious regression with nonstationary time-series data 89 6.8 A brief summary 91 References 92 Exercise 93
77
7
The implications of variables having a unit root 7.1 Introduction and motivation 95 7.2 Testing for a unit root and the order of integration 96 7.3 Cointegration 100 7.4 How are growth and inlation related in Argentina? 101 7.5 The error-correction model 104 7.6 Causality in time-series models 105 7.7 Cross-section and time-series data 106 References 107 Exercise 107
95
8
Exogenous and endogenous growth 8.1 The Solow model and the history of development 109 8.2 Long-term growth and structural change 109 8.3 The Solow model, structural change and endogenous growth 112 8.4 Human capital and the dynamic Solow model 113 8.5 Exogenous and endogenous growth 116
109
x
Contents 8.6 A Solow interpretation of development patterns 118 References 118 Exercise 119 Appendix: deriving the dynamic Solow model 119
SECTIon III
Panel data
121
9
123
Panel data: an introduction 9.1 Introduction 123 9.2 Panel data 123 9.2.1 The structure of the panel 123 9.2.2 Panel data and endogeneity 124 9.3 Panel production functions 127 9.3.1 A panel macro production function 127 9.3.2 A panel micro production function 130 9.4 Interpreting the ixed effect 134 References 135 Exercise 135 Appendix: matrix notation 135
10 Panel estimators: POLS, RE, FE, FD 10.1 Introduction 140 10.2 Panel estimators 140 10.2.1 The ixed effects and irst difference estimators 140 10.2.2 The random effects estimator 142 10.3 Key assumptions for consistency 143 10.4 Model selection 144 10.4.1 Testing for correlation between the ci and the explanatory variables 145 10.4.2 Testing for the presence of an unobserved effect 146 10.5 The micro panel production function extended 147 10.6 What determines the productivity of Ghanaian irms? 148 References 152 Exercise 152
140
11 Instrumental variables and endogeneity 11.1 Introduction 153 11.2 Sources of bias in the OLS estimates 153 11.2.1 Bias from omitted variables 153 11.2.2 Bias from measurement error 154 11.2.3 Panel data: omitted variables and measurement error 155 11.3 Instrumental variables 156 11.3.1 Valid and informative instruments 157
153
Contents xi 11.3.2 Interpreting the IV estimator 159 11.4 The properties of the IV estimator 160 11.4.1 The IV and OLS estimators compared 160 11.4.2 Inference with the IV estimator 161 11.5 The causes of differences in world incomes 162 Exercise 167 References 168 SECTIon IV
An introduction to programme evaluation
169
12 The programme evaluation approach to development policy 12.1 Introduction: causal effects and the counterfactual problem 171 12.2 Rubin causal model 172 12.2.1 Potential outcomes 172 12.2.2 Assignment mechanism 173 12.2.3 Deining measures of impact 174 12.2.4 From potential outcomes to regression 174 12.3 Selection on observables 177 12.3.1 Ignorability of treatment 177 12.3.2 Overlap 178 12.4 Unconditional unconfoundedness and the experimental approach 179 References 180 Exercise 180
171
13 Models, experiments and calibration in development policy analysis 13.1 Introduction 182 13.2 Empirical estimators under (conditional) unconfoundedness 182 13.2.1 Multivariate regression 183 13.2.2 Panel data methods 184 13.3 A randomised controlled trial (RCT) for conditional cash transfers 185 13.4 Calibrating technology 188 13.5 Education, technology and poverty 190 References 190 Exercise 191
182
PART 2
Modelling development
193
14 Measurement, models and methods for understanding poverty 14.1 Introduction 195 14.2 The causes of poverty 195 14.2.1 Poverty and GDP data 195 14.2.2 Poverty, consumption and incomes 196
195
xii Contents 14.2.3 Poverty, inequality and GDP 197 14.3 The Mincerian earnings function, the price of labour and poverty 199 14.4 Modelling impacts 201 14.4.1 A generalised Roy model of selection 201 14.4.2 Implications of the Roy model for estimation of treatment effects 202 14.5 An overview: measurement, models and methods 203 References 204 Exercise 205 SECTIon V
Modelling choice
207
15 Maximum likelihood estimation 15.1 Introduction 209 15.2 The concept of maximum likelihood 209 15.3 The concept of population 211 15.4 Distributional assumptions and the log-likelihood function 211 15.5 Maximising the (log-)likelihood 214 15.6 Maximum likelihood in Stata 215 15.7 Problems and warnings … 218 15.7.1 Maximum likelihood and endogeneity 218 15.7.2 Maximum likelihood and convergence 219 15.8 Properties of maximum likelihood estimates 220 15.8.1 Consistency 221 15.8.2 Eficiency 221 15.8.3 So what? 221 15.9 Hypothesis testing under maximum likelihood 222 15.10 Overview 224 References 224 Exercise 224
209
16 Modelling choice: the LPM, probit and logit models 16.1 Introduction 226 16.2 Binary choices and interpreting the descriptive statistics 227 16.3 Estimation by OLS: the linear probability model 228 16.4 The probit and logit models as latent variable models 231 16.4.1 The probit model 232 16.4.2 The logit model 234 16.5 Maximum likelihood estimation of probit and logit models 234 16.6 Explaining choice 235 References 237 Exercise 237
226
Contents
xiii
17 Using logit and probit models for unemployment and school choice 239 17.1 Introduction 239 17.2 Interpreting the probit model and the logit model 240 17.2.1 A model of unemployment 240 17.2.2 Average partial effects and marginal effects at the mean 240 17.2.3 Age and education as determinants of unemployment in South Africa 245 17.3 Goodness of it 245 17.4 Indian private and state schools 248 17.4.1 How well do private schools perform? 248 17.4.2 Who attends a private school? 249 17.4.3 Mother’s education and wealth as determinants of attending private school in India 250 17.5 Models of unemployment and school choice 250 References 252 Exercise 252 18 Corner solutions: modelling investing in children and by irms 18.1 Introduction 254 18.2 OLS estimation of corner response models 255 18.2.1 Investment in Ghana’s manufacturing sector 255 18.2.2 Gender discrimination in India 258 18.3 The Tobit model 260 18.4 Two-part models 262 18.4.1 Truncated normal hurdle model 264 18.4.2 The log-normal hurdle model 265 18.5 Overview 268 References 268 Exercise 269 Appendix: the Inverse Mills Ratio (IMR) 269
254
SECTIon VI
Structural modelling
271
19 An introduction to structural modelling in development economics 19.1 Introduction: the challenge of using microeconomic theory in empirical research 273 19.2 Using a structural model to think about risk-sharing 274 19.3 Building and solving a microeconomic model 276 19.4 Thinking about unobservables and choosing an estimator 281 19.4.1 The model to be estimated 281 19.4.2 Identiication in the model 282 19.4.3 Testing the model 282 19.5 Estimating the model 283
273
xiv Contents 19.5.1 The data 283 19.5.2 Estimation results 283 19.6 Conclusion 284 References 285 Exercise 285 20 Structural methods and the return to education 286 20.1 Introduction: Belzil and Hansen go to Africa 286 20.2 The question 286 20.3 A model of investment in education 287 20.4 Thinking about unobservables and choosing an estimator 292 20.5 Models and data 296 20.5.1 ‘Adolescent econometricians’? 296 20.5.2 Possible applications for structural modelling in development 297 20.6 Structural models: hubris or humility? 298 References 298 Exercise 299 SECTIon VII
Selection, heterogeneity and programme evaluation
301
21 Sample selection: modelling incomes where occupation is chosen 21.1 Introduction 303 21.2 Sample selection 303 21.3 A formal exposition 304 21.3.1 The regression with sample selection 304 21.3.2 Modelling the correlation of the unobservables 305 21.4 When is sample selection a problem? 308 21.5 Selection and earnings in South Africa 309 21.6 Corner solution and sample selection models 313 References 314 Exercise 314
303
22 Programme evaluation: regression discontinuity and matching 22.1 Introduction 316 22.2 Regression discontinuity design 316 22.3 Propensity score methods 319 22.3.1 Regression using the propensity score 319 22.3.2 Weighting by the propensity score 320 22.3.3 Matching on the propensity score 321 22.4 Food aid in Ethiopia: propensity-score matching 322 22.5 Assessing the consequences of property rights: pipeline identiication strategies 323 22.6 Estimating treatment effects (the plot so far) 326
316
Contents
xv
References 326 Exercise 327 23 Heterogeneity, selection and the marginal treatment effect (MTE) 328 23.1 Introduction 328 23.2 Instrumental variables estimates under homogeneous treatment effects 328 23.3 Instrumental variables estimates under heterogeneous treatment effects 330 23.3.1 IV for noncompliance and heterogeneous effects: the LATE Theorem 330 23.3.2 LATE and the compliant subpopulation 332 23.4 Selection and the marginal treatment effect 333 23.4.1 Interpreting the LATE in the context of the Roy model 333 23.4.2 The marginal treatment effect 336 23.4.3 What does IV identify? 337 23.5 The return to education once again 339 23.6 An overview 341 References 342 Exercise 342 SECTIon VIII
Dynamic models for micro and macro data
345
24 Estimation of dynamic effects with panel data 24.1 Introduction 347 24.2 Instrumental variable estimation of dynamic panel-data models 348 24.3 The Arellano–Bond estimator 349 24.3.1 No serial correlation in the errors 349 24.3.2 Serially correlated errors 350 24.4 The system GMM estimator 351 24.5 Estimation of dynamic panel-data models using Stata 352 24.6 The general case 355 24.6.1 The regressors are strictly exogenous 355 24.6.2 The regressors are predetermined 356 24.6.3 The regressors are contemporaneously endogenous 357 24.6.4 Implications of serial correlation in the error term 357 24.7 Using the estimators 358 References 358 Exercise 359 Appendix: the bias in the ixed effects estimator of a dynamic panel-data model 359
347
25 Modelling the effects of aid and the determinants of growth 25.1 Introduction 361 25.2 Dynamic reduced-form models 361
361
xvi Contents 25.2.1 Aid, policy and growth 361 25.2.2 Dynamics and lags 364 25.2.3 Differenced and system GMM estimators 366 25.3 Growth rate effects: a model of endogenous growth 368 25.3.1 Dynamic and growth rate models 368 25.3.2 Is there evidence for endogenous growth? 370 25.4 Aid, policy and growth revisited with annual data 371 25.4.1 Cross section and time-series uses of macro data 371 25.4.2 Growth and levels effects of aid 371 25.5 A brief overview: aid, policy and growth 372 References 373 Exercise 373 SECTIon IX
Dynamics and long panels
375
26 Understanding technology using long panels 26.1 Introduction 377 26.2 Parameter heterogeneity in long panels 378 26.3 The mean group estimator 379 26.4 Cross-section dependence due to common factors 383 26.5 Conclusion 386 References 386 Exercise 386
377
27 Cross-section dependence and nonstationary data 27.1 Introduction 388 27.2 Alternative approaches to modelling cross-section dependence 388 27.2.1 Country ixed effects and year dummies 389 27.2.2 Estimating unobserved common factors 389 27.2.3 Constructing weight matrices 390 27.3 Modelling cross-section dependence using cross-section averages 390 27.4 Detecting cross-section dependence 393 27.5 Panel unit root testing 394 27.5.1 First-generation panel unit root test 394 27.5.1.1 The Im, Pesaran and Shin test (IPS) 27.5.1.2 The Maddala and Wu test (MW) 27.5.2 Second-generation panel unit root test 395 27.5.2.1 The PANIC approach 27.5.2.2 The CIPS and CIPSM tests 27.6 Cointegration testing in panels 396 27.6.1 Residual analysis and error-correction models 396 27.6.2 Tests for panel cointegration 397
388
395 395 395 396
Contents
xvii
27.7 Parameter heterogeneity, nonstationary data and cross-section dependence 397 References 399 Exercise 400 28 Macro production functions for manufacturing and agriculture 28.1 Introduction 402 28.2 Estimating a production function for manufacturing 403 28.2.1 The homogeneous models 403 28.2.2 The heterogeneous models 405 28.3 Estimating a production function for agriculture 407 28.3.1 Unit roots 408 28.3.2 What determines the productivity of agriculture? 409 28.4 Manufacturing and agriculture and the growth of an economy 412 References 412 Exercise 413
402
SECTIon X
An overview
415
29 How can the processes of development best be understood? 417 29.1 Introduction 417 29.2 A range of answers as to the causes of poverty 417 29.3 Macro policy, growth and poverty reduction 419 29.4 Programme evaluation and structural models 419 29.4.1 Programme evaluation and the ‘failure’ of poverty policies 419 29.4.2 Structural models and understanding the causes of poverty 420 29.5 Skills, technology and the returns on investment 420 29.5.1 The value of skills 420 29.5.2 The role of technology 421 29.5.3 Rates of return on investment 421 29.6 A inal word 421 References 422
Bibliography Index
423 431
List of igures
1.1 1.2 2.1 2.2 3.1 4.1 4.2 4.3 4.4 5.1 5.2 5.3 6.1 6.2 6.3a 6.3b 6.4 7.1 7.2 7.3 8.1 8.2
A wage curve for South Africa (1993) A world production function (2000) An earnings function for South Africa (1993) Predicted and actual earnings for South Africa (1993) Predicted log (wages) in South Africa (1993) The distribution of wages and log wages The standard normal distribution Rejection rules for H1 : β j > 0 200 df and 5% signiicance level: rejection rule for H1 : β j ≠ 0 Residuals from the simple regression model The residuals for the log earnings function Residuals from Table 3.4 Regression (2) Long-term growth Incomes and investment in Argentina: 1950–2000 Two random walks (sample size 500) Two random walks (sample size 10,000) Regressing the differenced variables Panel data for growth and inlation 1950–2000 GDP and the price level in Argentina: 1950–2000 GDP and investment for four countries: 1950–2000 Very long-run growth: log GDP per capita (1990 US$) 1500–2006 Changing sectoral shares for China, India and Sub-Saharan Africa: 1980–2008 8.3 Growth rates and levels of GDP in China, India and Sub-Saharan Africa: 1980–2009 12.1 Schrödinger’s cat 13.1 The Caselli and Coleman result 14.1 Poverty measures from micro data in Sub-Saharan Africa 14.2 Poverty measures from macro 14.3 An international earnings function 15.1 The standard normal: probability density (φ (.)) and cumulative density (Φ(.)) 15.2 OLS as a maximum likelihood estimator: convergence
5 8 23 27 42 48 50 54 54 63 66 69 78 81 90 91 92 96 97 107 110 111 112 172 189 197 198 200 212 218
List of igures 15.3 OLS as a maximum likelihood estimator: suspect ‘convergence’ 15.4 The Wald test and the likelihood ratio test 16.1 The normal and logistic distributions 17.1 Changes in unemployment among black South African men 17.2 Predicted probability of being in a private school 18.1 Frequency distribution of investment rates by ownership status 18.2 The distribution of the OLS residuals from the logarithmic speciication 18.3 Predicted investment rates in Ghanaian irms 18.4 The truncated normal distribution 19.1 Power utility with multiplicative preference shocks 19.2 Change in household consumption 20.1 ‘School ability’ (νis ) and ‘market ability’ (νiw) 20.2 Choosing between S = 0, S = 4, S = 7 and S = 11 21.1 The IMR function 21.2 An illustration of the possible effect of sample selection on the age–earnings relationship 22.1 Strict regression discontinuity design 22.2 The probability of borrowings and land area 22.3 Propensity-score matching using nearest-neighbour matching 23.1 The cumulative standard normal density function for v 23.2 Local linear regression of log wage on p. White males PDID91, baseline model without interactions 28.1 TFP evolution based on FD speciication
xix 220 223 233 246 251 256 267 267 268 277 283 288 292 307 313 317 318 322 335 338 406
List of tables
1.1 1.2 2.1 2.2 3.1 3.2 3.3 3.4 5.1 5.2 5.3 5.4 5.5 5.6 5.7 6.1 6.2 6.3 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 8.1 8.2 9.1 9.2 9.3 9.4
A Cobb–Douglas production function from Hall and Jones data Stata code to create Hall and Jones measure of human capital The regression model using data set ‘Labour_Force_SA_SALDRU_1993’ The regression model using data in ‘Labour_Force_SA_SALDRU_1993’ with robust standard errors Log (earnings) and a dummy variable for primary education The means of log (earnings) by whether primary education completed Stata printout from South African wage data The macro production function data Testing the normality assumption Testing normality and homoskedasticity The earnings function with robust and clustered standard errors Residual diagnostics from Table 3.4 Regression (2) The production function with robust standard errors A pooled cross-section production function A cross-section differenced production function: 1980–2000 A Solow model for Argentina Testing for heteroskedasticity and autocorrelation A general dynamic Solow model for Argentina Testing for a unit root in the level of GDP in Argentina Testing for a unit root in the growth rate of GDP Testing for a unit root in the inlation rate GDP and inlation for Argentina Asymptotic critical values for cointegration tests Growth and the change in inlation An error-correction speciication Granger causality for GDP and inlation The growth form of the Solow model on cross-section data The dynamic model form of the Solow model on cross-section data The structure of a panel data set The macro production function: data pooled for 1980 and 2000 The macro production function: ixed effects estimation The macro production function: irst difference estimation
8 11 27 30 34 35 41 43 64 66 68 69 70 70 72 82 84 87 98 99 99 101 102 102 103 106 115 116 124 128 129 130
List of tables xxi 9.5 9.6 9.7 9.8 10.1 10.2 10.3 10.4 11.1.1 11.1.2 11.2 11.3 12.1 12.2 13.1 13.2 13.3 13.4 13.5 14.1 15.1 15.2 16.1 16.2 16.3 16.4 17.1 17.2 17.3 17.4 17.5 17.6 17.7 17.8 17.9 17.10 17.11 17.12 18.1
Ghana panel data of irms: descriptive statistics Ghanaian irm-level data: pooled Ghanaian irm-level data: ixed effects estimation Ghanaian irm-level data: irst difference estimation Ghanaian irm-level data: pooled OLS Ghanaian irm-level data: FE versus RE Ghanaian irm-level data: the Breush–Pagan and Hausman tests Baseline speciication with individual means added: OLS results and F test The Hall and Jones model: the reduced form The Hall and Jones model: using instruments The components of social infrastructure Determinants of education in Ghana and Tanzania: 2004–2005 Log (earnings) and a dummy variable for primary education Maize yields Regression and potential outcomes Enrolments in school The results of the RCT from table means The results of the RCT from regression Farm size and productivity Earnings and education in urban Ghana and Tanzania: 2004–2005 Log earnings as a function of education (OLS) Log earnings as a function of education (MLE) The probability of unemployment among black and non-black South Africans Interpreting the LPM LPM with robust standard errors Probit maximum likelihood program in Stata Determinants of unemployment among black South African men: probit results Determinants of unemployment among black South African men: average partial effects Probit results based on alternative syntax Average partial effects with correct treatment of nonlinearities Marginal effects evaluated at the mean Determinants of unemployment: logit results Average partial effects and marginal effects at the mean for logit model Predicted and actual unemployment outcomes following logit estimation Descriptive statistics for achievement and schooling: stata data ile spupilmsc.dta A model of school choice How well does the school choice model predict? Average partial effects and marginal effects evaluated at the mean Investment rates and foreign ownership in Ghana
131 132 133 134 148 149 150 151 164 165 166 167 171 180 185 186 187 187 191 205 216 217 228 229 231 236 241 242 243 244 244 244 245 247 248 249 250 251 257
xxii
List of tables
18.2 18.3 18.4 18.5 18.6 18.7 19.1
Investment rates and foreign ownership Extended investment model: OLS estimates Summary statistics for educational share The OLS estimation of the educational share equation Investment in Ghana: Tobit estimation Results from log-normal hurdle model A Mace (1991) test on the Ethiopian Rural Household Survey (rounds 2 and 3) Earnings function for South African (black and male) Modelling earnings with selection Heckman by hand Regression and potential outcomes Compliance types The return to a year of college education in the US and China Estimates of dynamic effects using panel data The basic speciication using averaged and annual data A more general dynamic speciication using averaged and annual data Using the panel data estimators using averaged data Diagnostic tests for estimators reported in Table 25.3 Using annual data to model the effects of aid and policy on growth Mean group estimates for a macro manufacturing panel Implementing the mean group estimator in Stata using xtmg A data set with cross-section averages (a) Descriptive statistics (b) Panel unit root tests (c) Cross-section dependence analysis A manufacturing production function assuming homogeneous technology A manufacturing production function assuming heterogeneous technology (a) Maddala and Wu (1999) unit root tests (b) Pesaran (2007) unit root test (CIPS) (c) Pesaran, Smith and Yamagata (2013) panel unit root test (CIPSM) An agricultural production function: pooled estimators An agricultural production function: mean group type estimators
21.1 21.2 21.3 22.1 23.1 23.2 24.1 25.1 25.2 25.3 25.4 25.5 26.1 26.2 27.1 27.2 27.2 27.2 28.1 28.2 28.3 28.3 28.3 28.4 28.5
257 258 259 259 263 266 284 310 310 311 324 330 339 353 363 365 367 368 372 380 382 393 400 400 401 404 406 409 409 409 410 411
notes on authors
Måns Söderbom is Professor of Economics at the Department of Economics, School of Business, Economics and Law, University of Gothenburg. He is also a Research Associate at the Centre for the Study of African Economies (CSAE), Department of Economics, University of Oxford, and a Fellow of the European Development Research Network. Prior to joining the faculty in Gothenburg in 2008, he was a Research Fellow at the CSAE for seven years. Research on industrial development is his main area of interest, but he has also worked on civil conlict, labour markets and schooling. Francis Teal was Deputy Director of the Centre for the Study of African Economies (CSAE) at the University of Oxford from 1996 to 2012. He is now a Research Associate of the Centre, continuing to work on labour markets and irms in Sub-Saharan Africa, and is a Managing Editor of Oxford Economic Papers. During much of his time at the CSAE, he was responsible for teaching a course on Quantitative Methods for the MSc in Economics. Before joining the Centre in 1991, he taught at the School of Oriental and African Studies at the University of London and the Australian National University. Markus Eberhardt is Assistant Professor in the School of Economics, University of Nottingham and a Research Associate of the Centre for the Study of African Economies (CSAE), University of Oxford. His research interests centre on the analysis of productivity at the macro level, including questions related to knowledge accumulation and diffusion as well as to structural transformation in developing countries. At Nottingham, he teaches applied econometrics and growth empirics to Masters students; during his doctoral and post-doctoral studies at CSAE, he has taught macro panel econometrics to the MSc course in Economics for Development. Simon Quinn is currently working as an Associate Professor in the Department of Economics, and as a Deputy Director of the Centre for the Study of African Economies (CSAE), at the University of Oxford. His research interests centre on the role of irms and labour markets in developing economies. He has taught applied microeconometrics and microeconomic theory of development on both the MPhil in Economics and the MSc in Economics for Development at Oxford.
xxiv Notes on authors Andrew Zeitlin is Assistant Professor at the McCourt School of Public Policy at Georgetown University. He is also a Research Associate at the Centre for the Study of African Economies (CSAE), Department of Economics, University of Oxford; a Non-Resident Fellow at the Center for Global Development; and Lead Academic for the International Growth Centre’s Rwanda Program. Prior to joining the faculty at Georgetown in 2012, he was a Research Fellow at the CSAE. His research uses theory-drive-ield and laboratory experimental methods, together with the collection of observational data, to study a range of microeconomic policy issues in economic development, with a focus on SubSaharan Africa. His current research focuses primarily on public-sector motivation and other dimensions of state capacity.
Preface
This book has its origins in a course taught as part of the Economics for Development MSc at the University of Oxford by Måns Söderbom and Francis Teal. It began as a relatively orthodox course in econometric methods but gradually evolved into one focused on how quantitative methods can be applied to development issues. Initially the methods covered were the classical linear regression model and nonlinear choice models but, encouraged by students and aware of the radical changes in methods that have characterised development economics over the last decade, the course gradually expanded its coverage. Måns Söderbom moved to teach at the University of Gothenburg in 2008. After his departure, parts of the course were taken over by post-doctoral researchers at the Centre for the Study of African Economies (CSAE) who generously volunteered to teach parts of the course in which they were particularly interested and which focused on their research interests. Andrew Zeitlin taught a module on impact evaluation, Simon Quinn covered structural modelling and Markus Eberhardt introduced the students to the new methods that were being developed for panel data where the time dimension is relatively long. This book is not that lecture course, but it has grown out of it and does share its objectives. The irst objective is to give some insight into how development economics can be viewed from an empirical perspective. The second is to introduce the tools that will enable the student to carry out empirical work in development. The rationale for this book is that there is much to be gained from teaching development from a quantitative perspective. The problem this poses is that those interested in development often come from intellectual traditions uninterested in, indeed often hostile to, a quantitative approach. The topics typically addressed in a course of development would include: Who are the poor, how is poverty measured and how can it be reduced? Does globalisation impoverish the poor? What is the role of human capital in growth and poverty reduction? Are neo-liberal policies increasing poverty? How can gender gaps be addressed? These questions can be, indeed usually are, taught without reference to a course on quantitative methods. Our objective in this book is to show that an empirical approach can advance understanding of these questions. We are aware of the challenge involved in convincing students interested in development that it is useful to know the properties of the classical linear model or the methods of maximum likelihood. It is, however, this challenge we have set ourselves. This book attempts to tackle the challenge by covering much of the ground taught in a basic statistics course, but rather than focusing on the statistical issues, we seek to show how they inform our understanding of development questions. There is no point in pretending that data can be analysed without certain basic statistical techniques.
xxvi Preface Equally, the issues of concern to many areas of theoretical econometrics are largely irrelevant to understanding much of development, as its problems are very different. We are well aware that this book does not cover more than a small fraction of the enormous amount of empirical work on development that has been undertaken in the last two decades. The book is envisaged as a guide to how that literature can be read, evaluated and understood in context. As well as providing an approach to studying development that we think is distinctive, our objective is also to enable students to undertake empirical work of their own. In the last two decades, the increase in the availability of data and in the power of computers to analyse large data sets, combined with the dramatic growth of interest in development issues, has created a market for those equipped to use data to address policy questions. We hope those engaged in development policy will ind some parts of this book useful. While the book is a collaborative effort, as all aspects of it have been discussed over the years the material has been taught, there was a division of labour which is relected in the structure of the book. Markus Eberhardt, now at the University of Nottingham, is the principal author of Chapters 26 to 28, Simon Quinn, now at the CSAE at the University of Oxford, is the principal author of Chapters 15, 19 and 20. Andrew Zeitlin, now at Georgetown University, is the principal author of Chapters 12, 22 and 23. The other chapters are our joint responsibility, although we stress again the collaborative nature of the endeavour involved in writing the book. We are very grateful to Yonas Alem, Simona Bejenariu, Arne Bigsten, Oana Borcan, Dick Durevall, Ann-Soie Isaksson, Annika Lindskog, Andreea Mitrut, Hanna Mühlrad, Anja Tolonen and Michele Valsecchi for detailed, and very helpful, comments on several of the chapters. Finally we need to acknowledge the role played by the students we have all had the privilege of teaching. Their energy and enthusiasm when they saw how the methods could be used convinced us that writing this book was a worthwhile task. We hope they, and others, ind here a rather clearer and fuller account of what we were trying to say. We take full responsibility for any failing and mistakes that you may quite possibly ind. Måns Söderbom University of Gothenburg Francis Teal CSAE, University of Oxford
How to use this book
As we indicated in the Preface, our objective in writing this book has been to enable the student and researcher to understand the links from empirical analysis to issues prominent in development economics. It is intended to be very ‘hands-on’, in that nearly all the exercises ask the student to apply the techniques that have been covered in the chapter to data drawn from developing countries. In some chapters, this data is that already presented; in others, the exercises require the application of the techniques to other data sets. The intention is to encourage the student to apply the techniques learned and to see how what might look like minor changes in speciication can have major changes for the results. The website linked to this book provides additional data and exercises. A website of data sources suitable for uses with this book can be found at: https://sites. google.com/site/medevecon/development-economics/devecondata. However, the structure of the book allows instructors to introduce their own data with applications of particular interest to them. It is now common practice with journals that empirical papers make their data publicly available. There is no shortage of data on which you can draw and the exercises should be seen as illustrative of an approach that we believe enables students to obtain a much clearer idea of the links between econometric techniques and their application. Some of the models presented do not work well using the data provided. There is much to learn from such failure. The objective of understanding the links from models to data in development underlies the structure of the book. Broadly speaking, the irst part introduces the basic techniques. This part is intended to be suitable for an undergraduate course in which the instructor wants to show how econometric techniques can inform development questions. The second part provides a more advanced treatment and is intended to be suitable for a graduate course with a similar objective of showing how a quantitative approach can be used for development issues. We need to stress that this is not an econometrics textbook, nor is it a book on development economics. There are excellent texts in both these areas. The approach we have adopted to the presentation of the introductory econometric material in Part 1 of the book follows closely that of Wooldridge (2013) and, if you are new to this material, then it is essential that you consult that book. The more advanced Wooldridge (2010) text will also need to be consulted for the sections in Part 2 which cover nonlinear models and panel methods. In contrast to introductory econometric texts, there is not an agreed method of presenting development issues. That fact has allowed us the licence to provide our own framework – which, in general terms, is set out in the irst chapter. Understanding the process of development requires understanding the processes that drive income (deined very broadly). We have
xxviii
How to use this book
structured the book so that the income sources are drawn from both micro data – that for individuals, households or irms – and macro data, providing comparative international data for GDP and education. In poor countries, measuring income, either at the level of the household or at the level of the macroeconomy, is a major challenge, which we discuss. This book is intended to be distinctive, not simply in its quantitative approach, but in its presentation of a range of approaches. Development economics has undergone a major change in focus over the last decade with the rise of experimental methods to address development issues. One of our objectives has been to show how such methods relate to more traditional ones. Thus when this new approach is introduced, in Section IV, we go to some lengths to explain its link to the simple and multiple regression models. Until very recently this experimental approach was seen to be diametrically opposite to an approach using structural models. In Section VI we present the basics of building a structural model and direct you, if you are interested, to current work which is seeking to combine both experimental data and structural modelling. Another important aspect of the book is its presentation of the differences that arise in the analysis of cross-section, time-series and panel data. In Part 1 we introduce all three types of data: cross-section data is the basis of Section I, time-series data is covered in Section II and panel data is introduced in Section III. A common theme across all these different types of data is the desire to make causal statements along the lines ‘investing in education leads to increases in income’. We hope that by the end of Part 1, you will understand why it is a dificult task to establish when, or if, that is true. Panel data involves a combination of cross-section and time-series data and as such offers opportunities to address problems that cannot be fully addressed with cross section or time series alone. How panel data can do this, how dynamics can be introduced (and why the form of the dynamics is critical for understanding debates about the sources of growth), whether investment affects growth rates and how aid and policy may impact on growth are all subjects tackled in Part 2. In Section IX we focus on recent work in panel macro data where the time dimension is long. These techniques are used to address issues in modelling manufacturing and agriculture, rather than aggregate GDP, and assessing how productivity varies over time and across sectors. The last broad theme of the book is the links from models of choice (Section V) to problems posed by selection and heterogeneity in response to policy (Section VII). We hope you will see the connection between these topics and the problems posed by understanding how irms invest, how households decide how much to spend on the education of their children, whether microcredit does help the poor, whether food aid works, who gets private schooling and whether property rights enhance investment. Our objective throughout the book is not to answer those questions, but to convince you that studying the relevant quantitative methods can help you to do so yourself.
References Wooldridge, J. M. (2010) Econometric Analysis of Cross Section and Panel Data, Second Edition, The MIT Press, Cambridge, Massachusetts. Wooldridge, J. M. (2013) Introductory Econometrics: A modern approach, Fifth Edition, SouthWestern Cengage Learning.
Part 1
Linking models to data for development
This page intentionally left blank
1
An introduction to empirical development economics
1.1 The objective of the book This is a book about quantitative methods which aims to enable you to carry out empirical work in development economics. It is not a book of econometric theory or data description, although both are key elements of the toolbox you need to be able to carry out empirical work. We are concerned with the measurement of economic relationships using techniques of estimation and the methods of the classical theory of statistical inference. The economic relationships we analyse come from economic theory. Our objective in this book is to draw on models from economic theory that allow us to test propositions of central interest in development economics. Theories can be developed without concern for the statistical problems associated with drawing inference from the data: econometrics seek to bridge the gap between economic theory and the data-based representation of the economic systems or relationships of interest. We typically use econometric methods for three related purposes: policy analysis, testing theory and forecasting. We begin by giving some examples of the questions and issues that can arise for each of these possible uses of quantitative methods in economics. Examples of policy analysis: • • • •
Does a training programme work? How much does investing in education/health/social networks increase productivity and/or wages? What effect does raising the minimum wage have on unemployment? Is a high rate of inlation costly?
These issues are all clearly matters of policy interest. There is also a direct link from a policy instrument to a desired outcome. We want to know how we can go about investigating whether a particular policy instrument will have the desired impact on the outcome. Examples of theory for testing: • •
The Harris–Todaro model predicts that high wage sectors in poor countries will have high unemployment to generate equilibrium across sectors. Endogenous growth theory predicts that the growth rate (growth rate – not the level of income) will be a function of the level of investment in human capital.
4
Linking models to data for development
•
Risk is particularly important in agricultural markets, which dominate the livelihoods of the poor in developing countries. If households cannot adequately insure, market outcomes will be ineficient and limit the ability of the poor to invest.
These three examples have in common that they are predictions from models which are widely used in development economics. In this book you will meet these models (and many others). The objective of this part of the book is to show you how such models are tested. For each of those predictions you can ask: does the data support the theory? If the answer is yes then that is good news for the theory. If the answer is no that is not necessarily bad news for the theory. There may be a problem with your data or there may be a problem with how you have used the data. If there are not then you need to think about what theory, better than the one you have, can explain the data. Examples of forecasting questions: • • •
What will be the level of commodity prices for copper over the next 20 years? How much will poverty increase in a country subject to a large shock to the terms of trade? How educated will the population of India be at the end of the next ten years?
The key element of forecasting is that it requires time-series data. By deinition you are asking about how the future will evolve. You can only assess forecasts by comparing the outcomes with the predictions. The common dimension across all these uses for quantitative methods is a model. We cannot do policy analysis, test theory or forecast unless we have a model which tells us how the variables we are interested in are related to the variables we think determine them. You can think of econometrics as combining three elements: economic theory, data and statistical theory. These three elements are used to develop models that are ‘good’ explanations for policy analysis, testing theory or forecasting. What is ‘good’ about a model is a complex question to which we will return throughout the book. In economics generally, and in development economics in particular, models are freely available and data is much scarcer. You will ind many models which have not been tested and some models which are regarded as rather obviously true but which, when tested, turn out to be inconsistent with the data. One example we use in the irst part of the book is the Harris–Todaro model of what determines equilibrium in labour markets in poor countries.
1.2 Models and data: the Harris–Todaro model The key prediction of the Harris–Todaro model of labour markets is very simple. Wages in urban areas exceed those in rural areas by substantial amounts. Why do people not move? Well, they do, and in doing so they create urban unemployment which acts to equilibrate the market by ensuring that the expected wage in the urban area is equal to the wage in the rural sector. The following exposition is taken from Fields (1975: 167–8). Let Wa and Wu denote agricultural and urban wage rates, respectively, Eu the number of urban jobs, and Lu the urban labour force. The expected urban income is
(E (W )) = W ( E u
u
u
/Lu ).
(1.1)
Introduction 5 6
Log of wages
4
2
0
–2 0
.2
.4
.6
.8
1
Cluster unemployment rate Log of hourly wages (weighted by prices) Linear prediction
Figure 1.1 A wage curve for South Africa (1993) Source: SALDRU data
Expected rural income E (Wa ) is simply Wa . The amount of rural–urban migration L u is a function of the urban–rural expected wage differential, L u = ψ ( E (Wu ) − E (Wa )) ,
(1.2)
the rural–urban equilibrium condition is E (Wu ) = E (Wa ) Wu ( Eu /Lu ) = Wa ,
(1.3)
and the equilibrium employment rate is ( Eu /Lu ) = Wa /Wu .
(1.4)
We see that this model has some very simple and intuitive predictions. As the urban wage rises relative to the rural, the employment rate must fall – and, by implication, the unemployment rate must rise. So, if this model is correct, we expect high-wage regions to be high-unemployment regions, that is, we expect a positive relationship between wages and unemployment. The paper by Fields (1975) points out that this simple model predicts unemployment rates far higher than those actually observed in poor countries and goes on to develop a model which includes an informal urban sector where job seekers have a higher probability of employment than those in the rural area. While the Fields extension lowers the expected unemployment rate that will be associated with any level of wage differential, it retains the key insight of the Harris–Todaro model that we will observe highwages areas associated with high unemployment.
6
Linking models to data for development
In fact, we can observe the opposite (see Figure 1.1). There is a large literature which has regressed wages on local unemployment and, almost without exception, has found this relationship to be negative. This relationship is termed a ‘wage curve’ and was noticed and documented by Blanchlower and Oswald (1995). It is one of the most striking empirical regularities in labour economics and on the face of it appears to latly contradict the Harris–Todaro model. One such example is the data for South Africa that forms the basis for Figure 1.1, which plots individual-level wages against clusterlevel unemployment rates using cross-section data from Kingdon and Knight (2006). Many economists think this result must be wrong. Why? The answer is the power of economic logic. Wages and unemployment cannot be negatively correlated in equilibrium. Why? Well, why live in an area where there are low wages and a lower probability of getting a job? It must make economic sense, in the long run, to move to the area with higher wages and a higher probability of getting a job. While you may observe a short-run negative relationship in the data there cannot be such a relationship in the long run. Such logic is powerful. It both suggests aspects of the market we need to look at if we want to push the analysis further, and questions we need to ask. First, how do we measure real wages? Housing and other costs may well vary between low and high unemployment areas – so how you measure ‘real’’ wages becomes an important question. Second, it suggests what type of data we may need to test some propositions. We may need to observe changes in wages and unemployment within regions or districts over time. This we cannot do with the data in Figure 1.1, which just shows a ‘snapshot’ at a single point in time. There is an important general point to keep in mind when thinking about empirical work in development economics. You need to link your question with the kind of data that can answer the question you have posed. The key building block in the process of policy analysis, testing theory or forecasting is the model. Models are simply the speciication of relationships between the variables of interest. To use the model for an empirical analysis, we need data – and to test the model we need to link the model to the data. How the data is linked to the model is the task of the statistics – a task that occupies us from the next chapter. In Section 1.3, we want to show you what types of data you will meet and give an example of how models can be taken to the data.
1.3 Production functions and functional form A production function is a technical relationship telling us how inputs are linked to outputs. In interpreting any production function, it is necessary to focus on how both outputs and inputs are deined and measured. Is the output measure a gross output or a value-added measure? How has the measure of the capital stock been derived? How comprehensive are the measures of inputs? You will need to understand the implications of using different speciications of the production function. 1.3.1 The Cobb–Douglas production function Let us write down a Cobb–Douglas version of a production function with two factors of production, capital and labour: Vit = K itα ( Ait Lit )(1-α ) e uit ,
(1.5)
Introduction 7 where Vit is a value-added measure of output, K it is a measure of physical capital and Lit is the amount of labour used in production. Ait incorporates factors which augment labour productivity, for example, by more education or training. The error term uit captures all the other unobserved factors that may determine value-added, which include any variables omitted from the equation, some of which may be hard to measure, for example, managerial quality. Note that we have used the subscripts (i , t ) in equation (1.5). These refer to unit i at time t. The unit may be a irm, an industry or a country. It will depend on your question and the data you have to address the question. In writing equation (1.5) with such subscripts, we are assuming panel data is available. A panel data set is one in which we can observe the same unit over time. If we had dropped the t subscript so that we can only index the variables by i we would be assuming we had a cross-section data set, that is, we observed a lot of irms or countries but at only one point in time. If we had dropped the i subscript we would be assuming we had a time-series data set, that is, we observed one irm or country over a period of time. The econometric problems which arise in linking data to models vary depending on which types of data we have. The kinds of data that were mainly available to econometricians in the early days of the subject were time series. Budget surveys were available and used but such data was limited. In the last two decades, partly as a result of the revolution in computer technology, the number of cross-section data sets has expanded rapidly. These micro cross-section data now include data on households, irms and individuals (roughly in that order of frequency in terms of availability). At the macro level, as a result of the International Comparisons Project, which has resulted in the Penn Tables (see section 1.5.1), the panel dimension of macro data has also been greatly extended. So in the exposition in this section we assume that a panel data set is available to investigate the models. The Cobb–Douglas form of the production function in equation (1.5) is linear in logarithms (we always use natural logarithms throughout the book) so we can write: logVit = α logKit + (1 − α ) logAit + (1 − α ) logLit + uit .
(1.6)
Deining kit as the stock of capital per effective unit of labour, kit = K it /Ait Lit , and vit as the output per effective unit of labour, vit = Vit /Ait Lit , we can write this equation as logvit = α log kit + uit .
(1.7)
This equation appears to be extremely simple. It says that value-added per effective unit of labour depends only on the capital per effective unit of labour. How, you will ask, can we take such a model to the data? How is effective labour to be measured? Surely valueadded depends on far more than physical capital? Even if value-added does depend on physical capital, to use such a model begs the important question as to what in turn determines physical capital. All these are very relevant questions to which we will return but, being a little impatient, we want to know how closely value-added is related to capital and labour. So we make the simplest possible assumptions. We look at a cross section of countries in 2000 (so we can drop the t subscript and only have i). We assume that all labour is equally
8
Linking models to data for development
Table 1.1 A Cobb–Douglas production function from Hall and Jones data . reg hjlogyl hjlogkl ; Source | SS df MS Number of obs = 127 -------------+-----------------------------F( 1, 125) = 1091.27 Model | 131.260296 1 131.260296 Prob > F = 0.0000 Residual | 15.0352867 125 .120282294 R-squared = 0.8972 -------------+-----------------------------Adj R-squared = 0.8964 Total | 146.295582 126 1.16107605 Root MSE = .34682 -----------------------------------------------------------------------------hjlogyl | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------hjlogkl | .6545332 .0198137 33.03 0.000 .6153194 .693747 _cons | 2.704834 .1860317 14.54 0.000 2.336654 3.073014 ------------------------------------------------------------------------------
Log of GDP per capita
10
9
8
7
UGA
BRB MUS TTO ARG CHL MYS URY MEX ZAF BRA THA TUR TUN VEN IRNPAN CRI COL DOM PRY DZA PER SLV EGYGTM SYRCHN JOR JAM IDN LKA PHL ECU BOL ZWE IND CMR PAK HND NIC BGD SEN LSO NPL GHA GMB BEN KEN MOZ MLI RWANERTGO ZMB MWI GNB
USA NOR CAN DNK IRL HKG CHE AUS ISL JPN NLD FIN BEL SWE AUT FRA GBR ITA NZL ESP ISR PRT KOR GRC
TZA
6 6
8
10
12
Log of capital per capita Log of real GDP in US$ (1996 PPP) Linear prediction
Figure 1.2 A world production function (2000) Note: The abbreviations in the igure are the standard Word Bank country codes. For details please see http://www.irows.ucr.edu/research/tsmstudy/wbcountrycodes.htm. Source: PENN World Tables with inputted capital stock
effective, so we set Ai = 1 for all i. As log (1) = 0, we can write our basic production function as: log (V /L )i = α log ( K /L )i + ui .
(1.8)
Notice we write it this way to be explicit that (i) we only need measures of value-added, the physical capital stock and labour in order to be able to take the model to the data, and (ii) that we are going to estimate the model using cross-section data. Such data is available from the Penn World Tables and is used by Hall and Jones (1999) and presented in Figure 1.2. The predicted line shown in Figure 1.2 has been estimated by OLS. For those of you unfamiliar with that concept you can, for the moment, think of it as the ‘best’ way
Introduction 9 of itting a line through the data. Those of you who have met the concept already will know that it comes from minimising the sum of squares of ui . Why, you might ask, the square? Why not the level or its absolute value or indeed why not just draw a straight line through the igure? All of those are possible (but rarely used) ‘methods of estimation‘. We spend a lot of time in what follows as to why you should choose one method rather than another for drawing a line. For the moment, we simply report that the igure comes from an OLS regression which is shown in Table 1.1. We need to focus on what the line might mean. You might well think that capital per capita appears to do a remarkably good job of explaining differences in world incomes. Several important features of data that recur throughout this book can be illustrated with this igure. First, it is in natural logs and you need to be clear on how we move between such logs and the levels of the variables. An equation that is linear in logs is not linear in levels (compare equation (1.6) with equation (1.5)). The coeficient on an equation which is linear in logs has an interpretation as an elasticity. Elasticity =
%Change inv dv = v %Change in k
dk d log (v ) = d log ( k ) k
So armed with the result of the regression we know that the elasticity of value-added with respect to capital is approximately 0.65. In fact, this elasticity has an economic interpretation. If factors are paid their marginal product, and denoting the price of labour as w and the price of capital as r, we have dV = w = (1 − α ) K α L− α = (1 − α ) ( K /L )α = (1 − α )(V /L ) dL dV = r = α K α −1L1− α = α( K /L )α −1 = α(V /K ). dK
(1.9)
(1.10)
With a slight rearrangement, we see that our Cobb–Douglas form of the production function implies: wL = (1 − α ) V
(1.11)
and rK = α. V We see that α is the share of capital in value-added. The Hall and Jones (1999) decomposition of the determinants of differences in income across countries used the fact that the share of capital in value-added was 0.3 to impute the value of α. We now see that our OLS regression result directly contradicts that assumption. It says that the share of capital in value-added is 0.65. Something seems to have gone seriously wrong. We return to what that something might be in later chapters.
10 Linking models to data for development What the line tells us is that there appears to be a close linear relationship between the natural logs of value-added per capita and capital per capita. As we have already noted, an equation that is linear in the logs is not linear in levels. That matters, as we see in Chapter 2 when we come to interpreting the regression line. Note how large the differences in capital per capita are over the countries. The natural log of this variable ranges from 6 to 12 in Figure 1.2, exp (6) ≈ 400 and exp (12) ≈ 163,000, both of these numbers are expressed in 1985 international dollars (also termed purchasing power parity (PPP) US$). So capital per capita increases by a factor of over 400 times between the poor countries of the world in the data, concentrated in Sub-Saharan Africa and South Asia and the rich ones in Europe and North America. Can this regression tell us how countries grew from being as poor as Ethiopia and Tanzania to being as rich as the US, Luxembourg and Switzerland? Clearly, the igure does not model such a process which has occurred over time and a cross-section of data can never model time-series processes. However, if we are willing to interpret the regression as causal then we can make such a statement. In fact, if we interpret the regression as causal, we can say that to increase its income per capita to that of a high income country Ethiopia and Tanzania need to invest so that the capital per person grows from US(PPP)$ (1985) 400 to 160,000 per capita. If capital per head were to grow by 20 per cent per annum for 30 years this would effect the change needed to get from a Ugandan level of capital per head to a Swiss one. So we know how to solve the problems of poor countries! They just need to invest so that capital per head grows by 20 per cent per annum. How do we know that? Well, we have given our regression equation a causal interpretation. Can we? Deinitely not – for reasons that are going to occupy us in much of the rest of the book. For the moment, we can simply note that we have strong evidence that the estimated coeficient on capital is biased. In fact, we think it is at least twice as high as the evidence suggests it should be. Our task of understanding how countries develop is not over yet. 1.3.2 The constant elasticity of substitution (CES) functional form What might account for the difference between the model and the data? That question is the focus of much of our efforts over the next chapters, but it is useful to begin by focusing on how strong are the assumptions we have been making. The irst of those strong assumptions is one of functional form. The Cobb–Douglas model implies that factor shares are constant for all countries for all time periods. Can we relax that assumption? Yes – and the simplest way to do so is to move from the Cobb–Douglas form to the assumption of the constant elasticity of substitution (CES) which is: 1
Vit = [( AK K )σit + ( AL L )σit ] σ e uit
(1.12)
Elasticity of substitution = ρ = 1 ( 1− σ )
(1.13)
σ
w AL L = r AK K
σ −1
A = L AK
ρ−1 ρ
L K
−
1 ρ
.
(1.14)
Introduction
11
Table 1.2 Stata code to create Hall and Jones measure of human capital /*The following creates the Hall and Jones imposed measure of the effect of human capital on labour productivity*/ gen e=0.134*hjschool if hjschool 4 & hjschool 8 gen hjh=(exp(e))*hjl gen hjhl=(exp(e))
We see that we now have additional degrees of freedom. The Cobb–Douglas form assumes that the elasticity of substitution is unity. What if it is not: how might that affect our interpretation of differences in GDP across the world? How might we be able to use the additional lexibility? We return to those questions in Chapter 13, the inal chapter of Part 1.
1.4 A model with human capital So far, we have been concerned with only two factors, capital and labour. The equation can be extended to include human capital, and a form that has been extensively used is: Vit = K itα ( Ait Hit )(1− α ) e uit .
(1.15)
To link this equation with data we need some means for measuring human capital. The following speciication is frequently used: Hit = e φ( Eit)Lit ,
(1.16)
where Ei is the number of years of education of workers in the labour force. You can use the relevant part of the Stata code (given in Table 1.2) to create the variable H/L, which is called hjhl in the Stata program. Equation (1.16) enables us to link empirically the wages paid to labourers with different levels of education. If we deine wL as the wage of a labourer and wH as the price of human capital, then we have wH Hit = wH e φ( Eit)Lit = wL ( it ) Lit
(1.17)
and we can write log wL ( it ) = log wH + φ ( Eit ),
(1.18)
where log wH is a constant and wL ( it ) tells us the wage of a labourer with level of education Eit . This is a semi-logarithmic equation and is in structure the basis for estimating Mincerian earnings functions. In empirical work, the function φ is usually written in the form of nonlinear variables as:
φ(Eit ) = δ 0 + δ1Eit + δ 2 Eit2 + vit ,
(1.19)
12 Linking models to data for development so we have log wL ( it ) = δ 0 + δ1Eit + δ 2 Eit2 + vit .
(1.20)
This is the Mincerian Earnings Function. The Mincerian return to education is deined as φ′( Eit ) and with the speciication chosen is given by (δ1 + 2 ⋅ δ 2 ⋅ Eit ) . Note that if the function is linear then the Mincerian return to education is a constant and does not depend on the level of education. Armed with this way of mapping education (which we can observe) into human capital (which we cannot), we can derive an equation we can take to the data for modelling production: logVit = α log K it + (1 − α )log Ait + (1 − α )φ( Eit ) + (1 − α )log Lit + uit ,
(1.21)
we can rearrange this equation into per capita terms to give: log
Vit K = α log it + (1 − α )log Ait + (1 − α )φ( Eit ) + uit . Lit Lit
(1.22)
This is the human capital augmented production function. In this speciication, the function φ ( E ) relects the eficiency of a unit of labour with E years of schooling relative to one with no schooling (φ ( 0 ) = 0 ). Note that if (φ(E ) = 0 ) for all E this is the standard production function with undifferentiated labour. log
Vit K = α log it + (1 − α )log Ait + (1 − α )( δ 0 + δ1Eit + δ 2 Eit2 ) + uit . Lit Lit
(1.23)
Both the Mincerian earnings function and the human capital augmented production function can be interpreted as examples of technical relationships. The Mincerian earnings function links wages to skills while the production function is, in principle, simply a description of the technology that shows how inputs determine outputs. How can we link the speciication of the production function to the earnings function? Differentiating the production function (equation (1.22)) with respect to Eit we note that: ∂(Vit /Lit ) ∂φ ∂w / E = (1 − α ) = (1 − α )φ′( Eit ) = (1 − α ) it it . ∂Eit wit ∂E it Vit /Lit
(1.24)
This expression makes explicit that, in the model we have used so far, the only reason why labour is paid more for more education is that it increases the productivity of labour to the irm. Such a model is testable and we will come to data that enables it to be tested in future chapters. However, before that we want to know the answer to the question implicit in Figure 1.2 and explicit in the title of the Hall and Jones (1999) paper – why do some countries produce so much more output per worker than others? You already have the elements you need to replicate their answer to that question. That is the focus of the exercise for this chapter.
Introduction
13
1.5 Data and models 1.5.1 The macro GDP data In this chapter we introduced data from the Penn World Tables. The basis of these tables is work carried out by the International Comparison Project (ICP) which began in the late 1960s. Deaton and Heston (2010) provide an overview of the history of this project, an account of the changes that were made during the 2005 round of price comparisons and discuss both the conceptual and practical issues that arise in making GDP numbers comparable across countries with very different structures of consumption and relative prices The objective of the ICP was to enable comparisons to be made across countries, as it was widely recognised that using actual exchange rates to make such comparison was misleading. A dollar buys very different amounts of services in poor countries than it does in rich ones. The project devised a set of prices in ‘International $s’ whose irst output was for ten countries for 1970 and for six of them for 1967 as well (see Kravis et al. (1975) for these initial results). Hall and Jones (1999) used version 5.6 of these Penn World Tables, a development of version 5 set out in Summers and Heston (1991). Most of the work we cite in this book, and the data we ourselves present, uses version 6.1 or 6.3 of these tables. Version 6.1 covers the period from 1950 to 2000 and Version 6.3 extends this data to 2007. These tables do not include the data from the latest ICP round for 2005. Versions 7.0, and later, of the Penn World Tables do incorporate this data, and have adjusted the past data to provide a coherent set of numbers as far back as 1950. Deaton and Heston (2010: 14) provide the following introduction to what these tables seek to do: To illustrate from the inal set of global calculations, ‘rice’ is one basic heading in the consumption account. Some country parities for rice from the 2005 round are 4,304 Vietnamese dongs per dollar, 0.65 British pounds per dollar, or 44.6 Kenyan shillings per dollar. If rice were the only component of consumption (or GDP), these would be the PPP exchange rates for those countries relative to the United States; in fact, the actual consumption (GDP) PPPs for those countries are 5,920 (4,713) Vietnam, 0.66 (0.65) United Kingdom, and 32.7 (29.5) for Kenya. Clearly, knowledge of the price of one good, or at least one group of goods, takes us some way, which is why the Economist’s Big Mac Index is useful. Of course, relative prices differ greatly from one country to another, which is why the Big Mac Index is far from suficient (or safe), and the ICP tries to do better by covering all the expenditures in GDP. In the versions of the Tables before 7.0, creating an international dollar measure of GDP meant using a world price for each good so that each item of GDP is re-priced at the world price. As Deaton and Heston (2010) explain, weights need to be applied to create price indices and the price index can be very sensitive to the weights used even though the same price is used for all countries. An important point stressed in the Deaton and Heston (2010) survey is that the ICP project is exclusively concerned with comparisons across countries and the time series come from national accounts data. They argue that any use of the annual data from the Penn Tables to measure changes over time will be unreliable.
14 Linking models to data for development 1.5.2 Interpreting the data In interpreting any economic data, a model is essential. Even if you do not write one down, it is certain that if you seek to interpret data then you are using some implicit model which is predicting some relationships between the variables. In this chapter we have sought to show you that even in the context of a simple bivariate model it seems possible to establish that there is a clear relationship in the data between the aggregate capital stock per person and the labour productivity at the level of the macroeconomy. The model that underlies this is a production function which can be seen as a basic technological building block for other models – it simply says that inputs and outputs are related. In this chapter we have shown in outline form what you are going to be asked to do in this book. You are going to be asked to think about how the development question of interest can be formulated in a way that enables it to be tested or assessed. In testing any model you need to think about the type of data you have and whether the question you have posed can be answered by that type of data. In the next chapter we consider versions of bivariate models of the earnings function. The irst is a semi-logarithmic bivariate version of equation (1.20): log wL ( it ) = δ 0 + δ1Eit + vit . In the exposition we have given thus far, we have treated education as a continuous variable. Whether or not education can be so treated is an important issue both in terms of how the model can be interpreted and in how it can be estimated. We will compare the continuous version Eit with an educational variable that is binary, which is whether or not the individual has completed some given level of education, say, ten years of school. We will be comparing both these speciications with a double logarithmic form: log wL ( it ) = δ 0 + δ1log Eit + vit These two equations look rather similar, but as we will see they are not, and one functional form will be clearly rejected by the data (at least using OLS). To test any of the models discussed in this chapter, we need methods of statistical inference that allow us to infer aspects of the population from information about a sample. We turn to that task in the next chapters.
References Blanchlower, D. and Oswald, A. (1995) The Wage Curve, MIT Press, Cambridge. Deaton, A. and Heston, A. (2010) ‘Understanding PPPs and PPP-based national accounts’, American Economic Journal: Macroeconomics, 2(4): 1–35. Fields, G. S. (1975) ‘Rural-urban migration, urban unemployment and underemployment, and job-search activity in LDCs’, Journal of Development Economics 2: 165–87. Hall, R. E. and Jones, C. I. (1999) ‘Why do some countries produce so much more output per worker than others?’ Quarterly Journal of Economics, 114(1): 83–116. Harris, J. R. and Todaro, M. P. (1970) ‘Migration, unemployment and development: A two-sector analysis’, American Economic Review, 60: 126–42. Kingdon, G. and Knight, J. (2006) ‘How lexible are wages in response to local unemployment in South Africa?’ Industrial and Labor Relations Review, 59(3): 471–95.
Introduction
15
Kravis, I. B., Kenessey, Z., Heston, A. and Summers, R. (1975) A System of International Comparisons of Gross Product and Purchasing Power, published for the World Bank by the Johns Hopkins Press, Baltimore and London. Summers, R. and Heston, A. (1991) ‘The Penn World Table (Mark 5): An expanded set of international comparisons, 1950–1988’, Quarterly Journal of Economics, 106(2): 327–68.
Exercise 1. 2. 3. 4.
5.
What is meant by saying that total factor productivity (TFP) differs across countries? What is the basis for the argument in Hall and Jones that to explain differences in income across countries we need to understand differences in their TFP? Discuss briely whether the Hall and Jones decomposition can tell you the causes of income differences across countries. The Harris and Todaro model directs attention to the existence of different sectors within economies paying different wages. Does this mean that the single-sector model used by Hall and Jones must be misspeciied? How does your answer to the last two questions link to understanding the causes of poverty?
The Stata data ile ‘hjones’ has the data underlying the Hall and Jones (1999) paper which is used in this chapter. You can use the Stata ile ‘Hall and Jones_1. do’ to replicate the inding reported in this chapter. You can also use and adapt this program to answer the questions above.
This page intentionally left blank
Section I
Cross-section data and the determinants of incomes
This page intentionally left blank
2
The linear regression model and the oLS estimator
2.1 Introduction: models and causality In this chapter we show how the ordinary least squares (OLS) estimator can be used to provide estimates of the parameters of the models of interest. OLS is the most commonly used statistical method in applied economics, and it can be used for analysing a wide range of questions in development. In Chapter 1 we met several models and presented estimates of the parameters of the models, although we were not explicit as to the source of those parameter estimates. Two of the models we met in Chapter 1 were the basic Mincerian earnings function log wL ( i ) = β0 + β1Ei + ui and the Cobb–Douglas production function with homogeneous labour log
Vi K = α log i + (1− α )logA + ui . Li Li
In this chapter we focus on the Mincerian earnings function. In Chapter 3 we return to the production function. In Chapter 14 we see that the interpretation of the Mincerian earnings function is rather more complex than it appears to be in this chapter. However, even in its simplest form it remains a fundamental building block to understanding the possible determinants of income. In Chapter 1 we implicitly treated these models as telling us that in the irst one there was a causal relationship from education to earnings and in the second one a causal relationship from capital, both physical and human, to labour productivity. Chapter 1 produced plenty of evidence, both from the regressions and the graphs, that these variables were positively correlated. But did our empirical results establish any causality from education onto earnings or from capital intensity to labour productivity? In order to answer that question we need to be clear about what we mean by causality, and few concepts in the social sciences have had a wider range of meanings attached to them. By a causal relationship we mean that in changing the amount of education, while holding all other factors constant, the effect will be a change in earnings and, in the case of the production function, that changing the amount of capital in a irm or country will change labour productivity measured as output per worker. With that definition in mind, in the sections that follow we spell out the conditions under which our
20 Linking models to data for development empirical results can be interpreted as evidence of a causal effect of education on earnings, and of the capital labour ratio on labour productivity. We begin Section 2.2 by setting out the linear regression model and showing how the method of OLS can be used to obtain parameter estimates for the simplest version of this model, which has only one explanatory variable. In Section 2.3 we present the earnings function for our South African data. The assumptions that need to be made for the OLS estimator to be unbiased and to have minimum variance are discussed in Section 2.4. The inal section returns to the discussion of the possible causal link between education and earnings.
2.2 The linear regression model and the oLS estimators 2.2.1 The linear regression model as a population model The models of earnings and production presented in the introduction can be written in a general form as: y = β0 + β1x + u,
(2.1)
where y is the dependent variable, x is the independent variable, or the explanatory variable, and u is a residual, or an error term. This model is known as a simple linear regression model. It is linear in the parameters β0 and β1 in the sense that the right-hand side of equation (2.1) is written as a sum of terms in which β0 and β1 enter separately and linearly: the parameter β1 is multiplied by the explanatory variable x, while the intercept β0 is (implicitly) multiplied by the constant unity (for this reason β0 is often referred to simply as the constant). Our objective is to make inferences from a random sample to the unobserved (indeed unobservable) nature of the data in the population. If our inferences are good ones then we should expect that if we draw other samples from the population we will obtain, in a sense we make more precise below, similar results as to the nature of the population. The micro data we use in this and the following chapters relates to the earnings and education of South Africans for 1993. In this chapter the population of interest can be thought of as the one of all individuals who work for pay in South Africa, and our interest is to infer from this 1993 sample how earnings and education are related. In order to do that we need to use that sample to provide us with estimates of the parameters β0 and β1. These are unknown and estimating them is a central objective in econometric analysis. Now our discussion above as to our meaning of the term causality requires that we state clearly why we think x causes y rather than y causing x. Our application is the simplest Mincerian earnings function and it might seem rather obvious that causality is running from education to earnings; after all, for virtually everyone in the sample, education preceded earnings over the lifetime of the individual. How could it possibly be true that wages themselves determine education? Later in this book in Chapter 19 we present a model in which the future expectations of how education affects earnings lead to the greater accumulation of human capital. If expected wages in the past are realised in actual wages now, then our regression could relect causality running from wages to education. For the moment we put this issue of reverse causation to one side – a suficient basis for doing so is to assume that all households cannot respond to any expected future increase in wages. In this chapter we assume that causality runs from education to earnings.
Linear regression model and OLS estimator
21
One way – and we are going to argue it is a productive way – of thinking about equation (2.1) is to think of the equation as one where there are two elements in the determination of wages. The irst are observable factors, in this application that is education, and the second are a whole range of unobservables. The u term represents all these unobservable factors which, together with the observables, determine wages. As we see in Chapter 3, it is very easy to add variables capturing aspects that we can observe, although we need to think carefully as to whether the variables are themselves determined by education. The problem with any empirical investigation is what we cannot observe all these factors are in the error term u. In terms of understanding how education affects earnings, the parameter in the simple regression model in which we are interested is β1. This parameter, which is sometimes called the slope parameter, determines the change in y resulting from a given change in x, holding all other determinants of y constant. As all other determinants of y are captured in our model by the residual u, those being constant implies ∆u = 0, and ∆y = β1 ∆x.
(2.2)
This is the causal effect of x on y in our model. Clearly, it follows that
β1 =
∆y , ∆x
that is, the parameter β1 is interpretable as the relative change in y resulting from a change in x, all other determinants of y held constant. We can always think of β1 as indicating the quantitative change in y resulting from an increase in x by (exactly) one unit. The other parameter in equation (2.1) is β0 , the constant term, which tells us the value of y if β1x + u = 0 . The constant term is rarely a parameter of interest in applied work. Consider the simple earnings model in which log of earnings is determined by years of education. Suppose for a moment that β0 and β1 were known. Perhaps β0 = 5 and β1 = 0.05 . What would that tell us about how education and earnings are related? Our earnings function would be: log wL ( i ) = 5 + 0.05educi + ui . The 0.05 coeficient on educi thus implies that one additional year of schooling raises log earnings by 0.05 (which is approximately equal to a 5 per cent increase in earnings), ceteris paribus (∆u = 0). More generally, for any given change in the years of schooling (∆educ), ∆log wL ( i ) = 0.05 ∆educi , provided all other factors determining earnings remain unchanged (∆u = 0). 2.2.2 The zero conditional mean assumption You see how, once we have a view as to what a plausible value of β1 might be, we can say something very speciic about the size of the effect of changing x on y. Since the model parameters β0 and β1 are unobserved, we need to ind a good method for estimating
22 Linking models to data for development them. The irst thing we need to do is draw a random sample of observations from the population. This is our data set. In the data set there is information on the variables y and x (these are our ‘observables’) but we do not have data on u (which is ‘unobservable’). Intuition thus suggests that, given the data available, a good estimation method – from now on, estimator – would be one that exploits information on how x and y vary and co-vary in the sample, but does not require any information on u. A very important stepping-stone towards such an estimator is the insight that if we make an assumption about the relationship between the unobserved residual u and the observed variable x, it will be possible to express the population parameters β0 and β1 in terms of distributions of the observable variables y and x. The key assumption is that the expected value of u, conditional on x, is equal to the unconditional expected value of u: E ( u | x ) = E ( u ) . This says that, for any given value of x, the expected value of the unobservable u is the same and therefore must equal the expected value of u in the population. We shall also assume that E ( u ) = 0, hence it follows that E (u | x ) = 0 This equation is often referred to as the zero conditional mean assumption for the residual u (for example, Wooldridge, 2013). It is the basis for the deinition of the OLS estimator, and it is also a key assumption when we want to show that the OLS estimator is unbiased. Note that the zero conditional mean assumption implies that u and x are uncorrelated in the population. If u and x are in fact correlated in the population, the zero conditional mean assumption fails. We now use the zero conditional mean assumption to do two things. First, we can express the population parameter β1 as a function of population moments in y and x. We begin by writing the covariance between y and x as follows: Cov ( y, x ) = Cov (( β0 + β1x + u ) , x ) Cov ( y, x ) = 0 + β1Var ( x ) + Cov( u, x ),
(2.3)
where we have used the fact that Cov ( x, x ) = Var( x ), and the 0 on the right-hand side in the second row is obtained because β0 is a constant (the covariance between a variable and a constant is always zero). Since the zero conditional mean assumption E ( u | x ) = 0 implies Cov( u, x ) = 0, we can write the population parameter β1 as the covariance between y and x divided by the variance of x:
β1 =
Cov ( y, x ) . Var( x )
(2.4)
It is important to understand at this point that Cov ( y, x ) and Var ( x ) are population moments which are not themselves observable; so we cannot simply use equation (2.4) to ind out the value of β1. However, Cov ( y, x ) and Var ( x ) can be estimated using a sample of observations on y and x. As we shall see below, this is precisely the logic underlying the OLS estimator. Our second use of the zero conditional mean assumption is that it enables us to write the expected value of y conditional on x in ways that shed further light on the interpretation of the parameters β0 and β1. Taking the expected value of equation (2.2) conditional on x and using E ( u | x ) = 0 we obtain
Linear regression model and OLS estimator E(log (wages)|Educ) = β0 + β1 Educ
6
Log of wages
23
4
2
0
-2 0
5
10
15
Years of education Log of hourly wages (weighted by prices) PRF
Figure 2.1 An earnings function for South Africa (1993) Source: SALDRU data. The PRF is the necessarily ictitious population regression function
E ( y | x ) = β0 + β1x.
(2.5)
Equation (2.5), which is sometimes referred to as the population regression function (PRF), shows that E ( y | x ) is a linear function of x. The linearity means that a one-unit increase in x changes the expected value of y by the amount β1. Furthermore, the interpretation of β0 now becomes a little clearer: it is the expected value of y given x = 0. The function E ( y | x ) is sometimes referred to as a Conditional Expectation Function (CEF); see for example Angrist and Pischke (2009). The reason for the name is clear: the CEF tells you the expected value of the dependent variable y associated with a particular value of the explanatory variable x. Note that the functional form of E ( y | x ) in equation (2.5), that is, β0 + β1x, follows from how we have speciied the model of y. Of course, a different model for y would imply a different CEF; for example, had we 1 + u, the associated for some reason speciied the population model as y = β0 + β1 x 1 CEF would be E ( y|x ) = β0 + β1 . We can sometimes obtain a CEF without makx ing a functional form assumption altogether. The simplest example is when x can only take two values, 0 or 1 (that is, x is a dummy variable), in which case there are only two conditional expectations of y, E ( y|x = 0 ) and E ( y|x = 1). In Chapter 12 we introduce the programme evaluation approach to development policy which is based on a CEF of this form. In the context of our earnings model, the zero conditional mean assumption amounts to assuming that E ( ui | educi ) = E ( ui ) = 0 which, if true, would mean that E (log wL ( i ) | educi ) = β0 + β1educi . Now think about what would happen if the assumption that E ( ui | educi ) = E ( ui ) = 0 were not true. Consider Figure 2.1. The illed circles
24 Linking models to data for development in the graph are actual observations of earnings (measured in natural logarithms) and years of education in our South African sample. It is obvious from the graph that there is a clear positive relationship between wages and education in the data. Why might this not tell us that education has a positive causal effect on wages? Think about a possible variable which determines earnings and which our model fails to incorporate. For example, consider innate ability, that is, the skills a person has that were not acquired through education. In this context, the assumption that E ( ui |educi ) = 0 means E (iability | educ = 5 ) = E (iability | educ = 15 ), where iability denotes innate ability. In words, the expected innate ability of an individual is the same at 5 years of education as at 15. However, if individuals with a higher level of innate ability tend to invest more in education, this assumption is not true. In that case, we cannot infer any causal relationship between education and wages from the observed positive association in the data. That possibility is illustrated in Figure 2.2. Suppose education actually has no causal effect on earnings, so that the true (unobserved) β1 is equal to zero. In such a case, the true PRF is represented by a horizontal line in the graph; and the positive relationship between earnings and education in the data relects not a causal impact of education on wages, but the fact that people with more education tend to have a higher innate ability. How you can estimate the causal effect of some explanatory variable on the outcome of interest whilst allowing for the possibility that the residual is correlated with the explanatory variable is a central theme of this book. 2.2.3 The OLS estimator We now proceed to show how we can use a sample randomly drawn from a population to obtain estimates of the parameters β0 and β1. The OLS estimates of β0 and β1 are denoted by βˆ 0 and βˆ1, and the OLS residual for individual i is deined as uˆi = yi − βˆ 0 − βˆ 1x1 .
(2.6)
By using the i subscript, we are explicit that our model draws on cross-section data. The OLS estimates βˆ 0 and βˆ1 are those that minimise the sum of squared residuals across the sample observations: n
∑uˆ
i
i =1
n
2
= ∑( yi − βˆ 0 − βˆ 1x1 )2 . i =1
In other words, all alternative estimates of the population parameters of β0 and β1 would result in a higher sum of squared residuals than that obtained based on the OLS estimates. Based on the OLS estimates we can obtain the sample regression function (SRF): yˆ = βˆ 0 + βˆ1x,
(2.7)
which needs to be kept very clearly separate in our minds from the PRF. Because the OLS estimates βˆ 0 and βˆ1 minimise the sum of the squared residuals, the following two linear equations must hold:
Linear regression model and OLS estimator n
∑( y
i
i =1
− β 0 − β1xi ) = 0
n
i
(2.8)
− β 0 − β1xi ) = 0.
∑x ( y
i
i =1
25
(2.9)
These are the OLS irst-order conditions which deine our estimator. The irst property of the OLS estimators is that the residuals from the regression model sum to zero. This follows from equation (2.8): n
∑( y
i
i =1
n
− β 0 − β1xi ) = ∑ ui = 0.
(2.10)
i =1
The second property of the OLS estimator is that the xi are uncorrelated with the regression residuals uˆi , again by construction. This follows from equation (2.9): n
n
i =1
i =1
∑xi ( yi − β0 − β1xi ) = ∑xi u i = 0.
(2.11)
We can now derive the formulae for βˆ 0 and βˆ1. Equation (2.10) can be rewritten as: y = βˆ 0 + βˆ1x ,
(2.12)
where n
y = n −1 ∑yi i =1
is the sample average for y and likewise for x. So we can write:
βˆ 0 = y -βˆ1x.
(2.13)
Substituting equation (2.13) into equation (2.9), we can write: n
∑[ x ( y i
i
i =1
− ( y − β1x ) − β1xi )] = 0,
(2.14)
which can be rearranged to read: n
n
i =1
i =1
∑xi ( yi -y ) = βˆ1 ∑xi ( xi -x ).
(2.15)
From the basic properties of the summation operator we have: n
∑x ( x i
i =1
i
n
− x ) = ∑( xi − x )2 i =1
(2.16)
26 Linking models to data for development and n
∑x ( y i
i
i =1
n
− y ) = ∑ ( xi − x )( yi − y ) . i =1
Assuming that our explanatory variable varies across observations in the sample, which implies: n
∑( x
i
− x )2 > 0,
(2.17)
i =1
it follows that the estimated slope coeficient is
βˆ 1 =
∑
n
( xi − x )( yi − y )
i =1
∑ i =1( xi − x )2 n
.
(2.18)
It is straightforward to verify that equation (2.18) is equal to the sample covariance between x and y divided by the sample variance of x. In other words, equation (2.18) is the sample analogue of equation (2.4), which expressed the population parameter β1 as equal to the population covariance divided by the population variance.
2.3 The Mincerian earnings function for the South African data In this section we present the results for a Mincerian earnings function for the South African data. In Table 2.1 we show how the results of the OLS estimation procedure can be obtained from Stata. In Figure 2.2 we show how the actual values of the yi link to their predicted values. The straight line is the predicted value of the log of earnings for given years of education. This is simply the SRF deined earlier: log wphyi = βˆ 0 + βˆ 1educi , where βˆ 0 = 0.458 and βˆ1 = 0.135. If you were simply looking at the data, you might want to say that between zero and nine years there was no change in earnings but a steep rise after that. However, we have speciied a model which imposes a linear relationship between log earnings and years of education. As we will show, you can, within the conines of the model linear in parameters, allow for this nonlinear pattern, but before we do that let us pursue further the interpretation of this linear regression. We thus have a very speciic example of equation (2.7). log wphyi = 0.46 + 0.14 educ The coeficient on education in this regression is often described as the Mincerian return on education, as it tells us (approximately) the percentage change in earnings for an increase in one year of education. As we discuss in Chapter 14, it is not possible to interpret this parameter as a guide to the desirability of educational investments, as this requires knowledge of the marginal, not average, rate of return. What the regression
Linear regression model and OLS estimator
27
Table 2.1 The regression model using data set ‘Labour_Force_SA_SALDRU_1993’ . reg logwphy educ Source | SS df MS -------------+-----------------------------Model | 2368.42412 1 2368.42412 Residual | 6155.34858 6966 .883627416 -------------+-----------------------------Total | 8523.77271 6967 1.22344951
Number of obs F( 1, 6966) Prob > F R-squared Adj R-squared Root MSE
= 6968 = 2680.34 = 0.0000 = 0.2779 = 0.2778 = .94001
-----------------------------------------------------------------------------logwphy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | .1353827 .002615 51.77 0.000 .1302565 .1405088 _cons | .4581331 .0238719 19.19 0.000 .4113368 .5049294 ------------------------------------------------------------------------------
6
Log of wages
4
uˆ 15
2
0
yˆ 15
yˆ 5 y10
–2 0
5
10
15
Years of education Log of hourly wages (weighted by prices) Linear prediction
Figure 2.2 Predicted and actual earnings for South Africa (1993) Source: SALDRU data
does show is that the impact of education on earnings appears to be very large. Reading off the points on the regression line, the natural log of earnings for those with zero years of education is 0.46 while that for 16 years is 0.46 + 0.14 × 16 = 2.7. That translates into a more than nine times increase in earnings as exp (0.46) = 1.58 and exp (0.46 + 0.14 × 16) = 14.88 rand per hour. The number 1.58 rand per hour converts to about 50 US cents per hour in 1993 prices. It appears that education makes an enormous difference to expected earnings. So far, we have simply interpreted the coeficients from the Stata output. We have said nothing about how good our model is at explaining earnings with education. In fact the estimated R 2 reported in our Stata output is a measure of how close our actual observations of y tend to be to the predicted values yˆi . The estimate of R 2 is based on
28 Linking models to data for development the total sum of squares (SST), the explained sum of squares (SSE) and the residual sum of squares (SSR) as follows: Total sum of squares n
SST = ∑( yi − y )2 i =1
Explained sum of squares n
SSE = ∑( yi − y )2 i =1
Residual sum of squares n
SSR = ∑ u i
2
i =1
It can readily be shown that: SST = SSE + SSR
(2.19)
and we deine the proportion of the variation explained as: R 2 = SSE /SST = 1 − SSR /SST .
(2.20)
Since R 2 is the ratio of the explained variation compared to the total variation, it is interpretable as the fraction of the sample variation in y that is explained by x. The intuition of this is clear. If our equation can perfectly predict earnings, we will get an R 2 of unity; if we can explain nothing, that is, the best we can do to predict that y is its mean, then we will get a value of zero for R 2 .
2.4 Properties of the oLS estimators 2.4.1 The assumptions for OLS to be unbiased Why should we prefer speciically the OLS estimator to alternative estimators which could be suggested? One reason is that the OLS estimator can be shown to be unbiased if we make four assumptions. By an unbiased estimator we mean one that, averaged over an (ininitely) large number of repeated trials, is the true value of the parameter: E(β 0 ) = β0 E(β1 ) = β1 The assumptions under which the OLS estimator is unbiased have already been made in setting up the model. They are as follows (A1) The model is linear in parameters. (A2) We have a random sample from the population of interest. (A3) There is sample variation in the explanatory variable. (A4) The conditional mean of the residual is zero, E ( u | x ) = 0. Given the formula for the OLS estimator, it is readily shown that:
∑ ( xi − x )ui βˆ 1 = β1 + i n=1 ∑ i =1( xi − x )2 n
Linear regression model and OLS estimator
29
Hence, n ( xi − x )ui ∑ E (β1 ) = β1 + E i n=1 2 ∑ i =1( xi − x ) which reduces to E(β1 ) = β1 since E ( ui | xi ) = 0 for all i = 1....n . We do not provide a formal proof here as one can be found in any introductory econometrics textbook; a particularly clear exposition can be found in Wooldridge (2013: 46–47). We also omit the proof that βˆ 0 is an unbiased estimator of the constant term β0 . 2.4.2 The assumptions for OLS to be minimum variance Clearly, that an estimator is unbiased is a desirable property. However, it is not the only aspect in which we are interested. We need also to consider its variance, which is a measure of the dispersion of the estimator around the true parameter value. Ceteris paribus, we clearly prefer estimators with smaller variances than any other estimators that may be available; estimators with lower variance are, in a sense, making better use of the information in the data. Let us start by assuming that the error term u has the same variance given any value of the explanatory variable, which we write formally as Var( u | x ) = σ 2 .
(A5)
This is known as the homoskedasticity assumption: the variance is a constant. When Var( u | x ) varies across observations, the error term is said to exhibit heteroskedasticity (non-constant variance). Provided assumptions A1–A5 hold, it can be shown that the OLS estimator is the best linear unbiased estimator (BLUE). OLS is a linear estimator, because it can be expressed as a linear function of the dependent variable y. It is unbiased under assumptions A1–A4, as already discussed. And it is ‘best’, meaning it has the lowest variance, if we also assume homoskedasticity (A5). With the homoskedasticity assumption in place, it is possible to express the variance of the OLS estimator βˆ1 as follows: Var(β1 ) =
σ2
∑
n
(x − x ) i =1 i
2
σ 2 n-1 ∑ i =1 xi2
= σ 2 /SSTx
(2.21)
n
Var(β0 ) =
∑
n
( x − x )2 i =1 i
.
These variances are expressed in terms of the unobservable variance of u. Therefore, the variances of the OLS estimators are themselves unobservable, but they can be estimated if we have an estimate for σ 2 . An unbiased estimate for σ 2 is:
σˆ 2 =
n 1 uˆi2 = SSR /( n − 2 ) ∑ ( n − 2 ) i =1
(2.22)
30 Linking models to data for development Table 2.2 The regression model using data in ‘Labour_Force_SA_SALDRU_1993’ with robust standard errors . reg logwphy educ,robust Linear regression
Number of obs F( 1, 6966) Prob > F R-squared Root MSE
= 6968 = 2277.27 = 0.0000 = 0.2779 = .94001
-----------------------------------------------------------------------------| Robust logwphy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | .1353827 .002837 47.72 0.000 .1298214 .140944 _cons | .4581331 .0275767 16.61 0.000 .4040744 .5121918
-----------------------------------------------------------------------------------------------------
and σˆ = σˆ 2 is called the standard error of the regression (SER). If σˆ 2 is plugged into the variance formulae for βˆ 0 and βˆ1, we have unbiased estimators of Var(β1 ) and Var(β0 ), and if we take the square root of these estimated variances we obtain the standard errors of βˆ1 and βˆ 0 . As we discuss in Chapter 4, the standard errors are very central for hypothesis testing. Suppose the homoskedasticity assumption is not supported by the data. What are we to do? There is a very fundamental difference of view among econometricians as to how best to proceed at this point. One approach – which might be termed the ‘let’s understand what we are doing’ camp – views the failure of the homoskedasticity assumption as information about the model that needs to be understood. An alternative view – which might be termed ‘the let’s get on with evaluating the model’ camp – would argue that we should simply allow for heteroskedasticity and move on. Note in particular that we do not need the homoskedasticity assumption for our estimates to be unbiased. What is the problem posed by heteroskedasticity? If the errors are heteroskedastic, so that Var( ui | xi ) = σ i2 where the i subscript conirms that the variance varies and is not a constant, the variance of the OLS estimator is given by Var(β1 ) =
∑
n
( xi − x )2 σi2
i =1
SSTx2 2
,
(2.23)
n where SSTx2 = ∑( xi − x )2 is the total sum of squares of the xi , itself squared(!). Note i =1 that under homoskedasticity, that is, σ i2 = σ 2 for all i, this formula reduces to the conventional equation σ 2 /SSTx .
Linear regression model and OLS estimator
31
Since the standard error of βˆ1 is based directly on estimating Var(β1 ) we need a way of estimating this variance when heteroskedasticity is present. White (1980) showed that a valid estimator of Var(β1) under heteroskedasticity of any form (including homoskedasticity) is
∑
n
( xi − x )u i2
i =1
SSTx2
.
(2.24)
If we take the square root of this equation, we obtain heteroskedasticity-robust standard errors. Thus, we can easily see if the presence of heteroskedasticity does change our view of the value of the standard errors. This is easily done in Stata and in Table 2.2 we show the consequences of allowing for heteroskedasticity. The consequences can be seen by a comparison of Tables 2.1 and 2.2, which shows a modest rise in the standard errors in Table 2.2 compared to Table 2.1. However, up to this point we have not tested for the presence of heteroskedasticity in the data although looking at Figure 2.2 it appears rather clear that the variance is not constant across the range of education. We come to a formal test for heteroskedasticity in Chapter 4; before that we need to note the importance of our assumption (A2) above that we have a random sample from the population of interest. If our sample is a random one we can be sure the errors are not correlated with each other. In fact, our micro data is drawn from a stratiied random sample which means that the sample was drawn by selecting clusters irst and then sampling within those clusters. This is the common method for carrying out large-scale sampling, as a truly random sample would be too expensive to collect. In doing this, it means that we cannot be sure our errors are uncorrelated with each other within the cluster. When we come to formal testing of the more general earnings function that we introduce in Chapter 3, we show this is an important issue for our data (see Section 5.3.2).
2.5 Identifying the causal effect of education In this chapter we have set out the simple linear regression model and the assumptions that ensure the parameters are an unbiased estimate of the true population parameters. We began the chapter with a discussion of our desire to make causal statements as to the effects of education on earnings and on labour productivity. If we could argue that the parameters in the earnings function presented in Table 2.2 were unbiased, then we would be able to make causal statements as we would have shown how changing education changed earnings. While we are not able to make such a statement, we do now understand what is necessary for us to be able to do so. How such a causal effect from education can be identiied in our data is a major theme in many of the chapters that follow. In the next chapter we start on that path by showing how we can use and extend the simple regression model.
References Angrist, J. D. and Pischke, J. (2009) Mostly Harmless Econometrics, Princeton University Press, Princeton, New Jersey.
32 Linking models to data for development White, H. (1980) ‘A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity‘, Econometrica, 48: 817–38. Wooldridge, J. M. (2013) Introductory Econometrics: A modern approach, Fifth Edition, SouthWestern Cengage Learning.
Exercise Consider the model y = α + β×x + u where y = log [value-added per worker], x = log [capital per worker], α and β are scalars, and u is a residual. Data on 22 irms produce the following: y = 6.436,
∑ ( y -y )
2
i
= 49.5352,
i
x = 2.036,
∑ (x
− x ) = 57.3152,
∑ (x
− x )( yi − y ) = 47.2752.
i
2
i
i
i
Source Model Residual
SS [b] [e]
df [c] [f]
MS [d] [g]
Total
[h]
[i]
[j]
Number of observations F([u], [v]) Prob > F R-squared Adj R-squared Root MSE
Y
Coef.
Std. Err.
t
P>|t|
X _cons
[k] [p]
[l] [q]
[m] [r]
[n] [s]
1. 2. 3.
[a] [x] [y] [z] [aa] [ab]
[95% Conf. Interval] [o] [t]
Above is the template for regression results as reported by Stata. Complete the table by computing the numbers to enter the cells [a]–[z], [aa]–[ab]. Test the hypothesis that β = 0.3 . Explain your decision to accept or reject the null hypothesis. Now impose the restriction α = 0 and re-estimate β . How does the resulting estimate compare to that obtained in (1)? Explain intuitively, perhaps using a graph, why this estimate of β is different. Comment on the validity and meaning of imposing the restriction α = 0.
3
Using and extending the simple regression model
3.1 Introduction Our objective in Chapter 1 was to set out some basic models as to how incomes at the level of the individual and at the level of a country are determined. In the last chapter we showed how in a simple bivariate model of the determinants of wage income in South Africa we could use the OLS estimator to provide us with estimates of the impact of education on earnings. In drawing causal inferences, we stressed the need for the zero conditional mean assumption (A4) which must hold for the OLS estimator to be unbiased. In the simple regression model, all determinants of y except x are hidden in the residual u, and if u in fact correlates with the explanatory variable x, assumption A4 will fail. This possible failure is the central issue in our ability to interpret our equation as telling us about the effects of x on y. In this chapter we extend the regression model to allow for several explanatory variables, and show how OLS can be used to estimate the parameters of the more general model. In doing that we take an important step towards controlling for a range of possible factors correlated with our explanatory variable. However, before taking this step, in the next section we consider a simpler model of the impact of education on earnings and use this model to show more fully the assumptions we are making when we seek to draw causal inferences from our OLS regressions. We then turn to extending our simple regression model to one with many variables in Section 3.3. In Section 3.4 we show how extending our earnings and production functions alters our view as to the determinants of earnings and productivity.
3.2 Dummy explanatory variables and the return to education One of the millennium development goals is providing universal primary education by the end of 2015. Suppose we want to estimate the value for a student of achieving primary education. While we certainly could use the regression results from Chapter 2, here we consider an alternative approach which illustrates how we can use regression analysis with dummy explanatory variables. We begin by modifying our earnings equation and write it as follows: log wphy = β0 + β1 primary _complete + u,
(3.1)
where primary_complete is a dummy variable equal to 1 for those who have completed primary education and 0 for those who have not. Under the zero conditional mean assumption for the residual u, conditional expected log earnings are as follows
34 Linking models to data for development Table 3.1 Log (earnings) and a dummy variable for primary education
. reg
logwphy primary_complete Source | SS df MS Number of obs = 6968 -------------+-----------------------------F( 1, 6966) = 1304.23 Model | 1344.21093 1 1344.21093 Prob > F = 0.0000 Residual | 7179.56178 6966 1.03065773 R-squared = 0.1577 -------------+-----------------------------Adj R-squared = 0.1576 Total | 8522.77271 6967 1.22344951 Root MSE = 1.0152 -----------------------------------------------------------------------------logwphy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------primary_co~e | .9481207 .0262535 36.11 0.000 .8966558 .9995855 _cons | .8954331 .0217785 41.12 0.000 .8527406 .9381255 ------------------------------------------------------------------------------
E (log wphy | primary _ complete = 0 ) = β0
(3.2)
and E (log wphy | primary _ complete = 1) = β0 + β1 . Clearly, the coeficient of primary_complete (β1) gives us the value from completing primary education. The fact that our explanatory variable is now a dummy variable and not a continuous variable does not change any of the mechanics of OLS estimation. Table 3.1 presents the OLS results using our South African data. Now rather than run a regression let us compute the mean values of log earnings separately for those with and without primary education, which we do in Table 3.2. You can see that this cross tabulation has given you exactly the same information as the simple linear regression. Look at the cross tabulation. Those with ‘primary education not complete’ earn 0.895 on average. This is identical to the estimated intercept in the OLS regression. Those with ‘primary complete’ earn 1.84 on average, which is identical to the estimated intercept plus the estimated coeficient on primary_complete in our OLS regression (0.895 + 0.948 = 1.84). The connection between the cross-tabulation results and the regression results is clear in this case: these are simply equivalent estimators of the expected value of log earnings conditional on primary_complete = 0 and primary_complete = 1. These numbers are in natural logs, so to see how much the difference is in earnings for those with primary education complete relative to those with uncompleted primary education we need to ind the predicted value of earnings. Our equation so far has predicted the natural log of earnings, not actual earnings. The expected value of earnings, denoted wphy, is given by: E (wphyi | primary _ complete ) = α 0 exp(β0 + β1 primary _ complete ),
(3.3)
where α 0 = E (exp (u )) . One possible estimator for α 0 is: n
αˆ 0 = n-1 ∑exp( uˆi ). i =1
(3.4)
Extending the simple regression model
35
Table 3.2 The means of log (earnings) by whether primary education completed bysort primary_complete: sum logwphy --------------------------------------------------------------------------------------------> primary_complete = 0 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------logwphy | 2173 .8954331 1.053031 -2.383063 5.047622 ------------------------------------------------------------------------------------------------------------------------> primary_complete = 1 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------logwphy | 4795 1.843554 .9976072 -2.364309 6.64859 --------------------------------------------------------------------------------------------
While the intuition for this estimator is clear, it is simply the sample analogue of the population exp ( u ) ; it is not in fact an unbiased estimator – but it can be shown to be consistent (a term we deine and explain in the next chapter). If you use the data for the regression, you will ind that
αˆ 0 = 1.59. So the following calculation translates the logs of earning into levels: = 1.59 ⋅ exp (log wphy ) = 1.59 ⋅ exp (1.843) = 10.04 rand per hour wphy pc pc w phypnc = 1.59 ⋅ exp (log wphy pnc ) = 1.59 ⋅ exp ( 0.895) = 3.89 rand per hour, where pc and pnc mean primary completed and primary not completed. You now see quite clearly that the increase in earnings associated with primary completed is slightly more than 150 per cent. In moving between the levels of variables and the logarithmic speciication, you need to be aware of when the coeficient on the dummy is a good approximation for the percentage change and when it is not. In general, for a log-dummy speciication of the following type log ( y ) = β0 + β1x + u where x is a dummy variable, the exact percentage difference in y resulting from changing x from 0 to 1 is given by %∆y = (exp (β1 ) − 1) ⋅100 . Using a irst-order Taylor approximation it can be shown that, if β1 is small, %∆y ≈ 100 ⋅ β1 . For example, if β1 = 0.05 , the exact percentage difference is 100 × (exp (0.05) − 1) = 5.13 per cent, so simply reading off the effect as 5 per cent directly from the coeficient only results in a small (and acceptable) approximation error. However, in our case we obtain
36 Linking models to data for development an estimate of β1 equal to 0.95, so the exact percentage change is 158 per cent, which is obviously much larger than 95 per cent. For these estimates to be unbiased, we need to ensure that the errors are not correlated with the explanatory variable in the population. By construction, the errors are not correlated with the x variables in the OLS regression. How can we know if that is the case in the population? The short answer is that we cannot. However, it may be relatively easy to ind out if it is not true. If our model suggests that there are variables that determine earnings, and we can observe these variables, then we can test to see if they do indeed affect earnings once we condition on education; this is the role of multiple regression which we come to in Section 3.3. However, in some circumstances we can use another method to ensure that the explanatory variables are not correlated with the errors, and that is if we can conduct an experiment. Suspend disbelief for a moment and assume that we can choose if a student is given primary education. Now let us choose randomly, so each has an equal chance of being educated. If our random selection is successful then there will be no difference between the primary-educated and the non-educated apart from their education. In this case we will know that the unobservables are not correlated with the x variable, as our experiment has ensured that is the case. In this experimental setting, we will have succeeded in identifying the effect of education in the sense that the averages across the two groups will measure the average difference education has made to their earnings. Now you will object, correctly, that this example is fanciful: not only are students not chosen at random, it would be quite wrong to do so. However, there may be problems that can be addressed in this way and if it is possible, it is one way to address the problems posed by the potential correlation of the x variable with the unobservable. This method is introduced more formally in Chapter 12.
3.3 Multiple regression 3.3.1 Earnings and production functions In Chapter 1 we introduced more general forms of both an earnings function and a production function than we considered in the last chapter. In this section we develop the study of multiple regression by extending those functions still further. The most general form of the earnings function that we consider in this and the following chapters is: log wL ( i ) = β0 + β1Experi + β2 Experi 2 + β3 Ei + β4 Ei2 + ui .
(3.5)
In this equation, the log of earnings is a function of work experience (Exper) as well as of education (E). Both work experience and education are allowed to enter nonlinearly into the equation. Our equation remains linear in the parameters but allowing for nonlinearity in the variables is, as we will see, an important aspect of how work experience and education impact on earnings. If the earnings function is concave then the parameters β2 and β4 will be negative, implying that the rate of increase of earnings with experience and education declines with their level. Recall that in Chapter 1 the Hall and Jones (1999) measure assumed concavity of the earnings function with respect to education. We test this assumption with our micro data. In Chapter 1 we introduced the human capital augmented Cobb–Douglas production function of the form:
Extending the simple regression model log
Vi K = α log i + (1 − α )logAi + (1 − α )φ( Ei ) + ui . Li Li
37 (3.6)
Our inclusion of experience in the equation for our micro function, and in the inclusion of education for our macro production function, show how multiple regression allows us to extend our simple regression model. If experience is correlated with education in the micro function and human capital is correlated with physical capital in the macro function, then the simple regression model will give us biased point estimates, for the reasons we have set out in Chapter 2. The credibility with which we can argue that OLS estimates a causal effect without bias is generally stronger if our empirical model explicitly contains the factors thought to determine earnings or labour productivity. 3.3.2 The OLS estimators for multiple regression In this section we set out in general terms how the parameters of the multiple regression model can be estimated. We specify the multiple linear regression model as follows: yi = β0 + β1x1i + β2 x2i + β3 x3i + + β k xki + ui ,
(3.7)
where x1 , x2 ,…, xk are independent variables, or explanatory variables, and u is a residual or an error term. We seek to estimate the unknown parameters β0 , β1 ,…, βk in this equation. The OLS estimates of β0 , β1 ,…, βk are denoted βˆ 0 , βˆ1 ,...., βˆ k , and the OLS residual for individual i is deined as uˆi = yi − βˆ 0 − βˆ1x1i − ..... − βˆ k xki . The OLS estimates βˆ0 , βˆ1 ,...., βˆ k are those that minimise the sum of squared residuals across the sample observations: n
n
∑uˆ
i
= ∑( yi − βˆ0 − βˆ1x1i − ..... − βˆk xki )2 ,
2
i =1
(3.8)
i =1
Hence, all alternative estimates of the population parameters β0 , β1 ,…, βk would result in a higher sum of squared residuals than that based on OLS. It follows that the OLS estimates βˆ 0 , βˆ1 ,...., βˆ k satisfy the following OLS irst-order conditions: n
∑( y
i
− β0 − β1x1i − ..... − βk xki ) = 0
i =1 n
∑x
1i
( yi − β0 − β1x1i − ..... − βk xki ) = 0
i =1 n
∑x
2i
( yi − β0 − β1x1i − ..... − βk xki ) = 0
i =1
…………..
38 Linking models to data for development n
∑x
ki
( yi − β0 − β1x1i − ..... − βk xki ) = 0.
(3.9)
i =1
Based on these equations we can solve for βˆ 0 βˆ k in the same manner as we solved for βˆ 0 and βˆ1 in Chapter 2. The extension to a general multivariate model involves no new mathematical principles although the algebra is cumbersome if matrices are not used. For the OLS estimator of the parameters of the multiple regression model to be unbiased, the following four assumptions need to hold: (A1′) The population model is linear in parameters. (A2′) The data was obtained through random sampling. (A3′) There is sample variation in all explanatory variables and no explanatory variable is collinear with other explanatory variables. (A4′) Zero conditional mean of the residual: E ( u | x1 , x2 ,...xk ) = 0. (see Wooldridge, 2013, Chapter 3). As you can see, while the irst and second of these assumptions are identical to those underpinning the simple linear regression model, the third and fourth assumptions are extended versions of their single regression counterparts. The only new concept is that of collinearity, which means there is linear dependence between variables. Hence, if we were to regress one of our explanatory variables on the other explanatory variables in the model and obtain an R-squared equal to 1 from that regression, we would have collinearity. In that case, it is not possible to estimate all the parameters of the model using OLS. Stata deals with collinearity simply by dropping all collinear variables from the model. Under assumptions (A1′)–(A4′), the OLS estimators of the multiple linear regression model are unbiased. If, in addition, we assume that the residual is homoskedastic, which for the multiple regression model is expressed as (A5’) Homoskedasticity: Var( u | x1 , x2 ,..., xk ) = σ 2 , the OLS estimator is BLUE. Moreover, under assumptions (A1′)–(A5′), obtaining a formula for the variance of the OLS estimator βˆ j is straightforward (we do not derive it here). An illuminating way of expressing the variance of the OLS estimator βˆ j is as follows: Var(β j ) =
σ2 SST j (1− R 2j )
(3.10)
(see Chapter 3 in Wooldridge, 2013, for details). It is quite useful to compare this formula, which applies for the multiple regression, to the variance formula for the simple regression model encountered earlier: Var(β1 )=
σ2
∑ i =1( xi − x )2 n
= σ 2 /SSTx
Extending the simple regression model
39
(see equation (2.21) of Chapter 2). We observe that two of the terms in the variance formula for multiple regression are the same as for the simple regression formula: with everything else equal, the variance falls as: (i) the variance of the residual falls; (ii) the total sum of squares in the explanatory variable increases. But there is a third factor in the variance formula for multiple regression that we have not encountered thus far, namely (1 − R 2j ) . It is important not to become confused about the meaning of the R 2j here: this is indeed an R-squared, but it is not the R-squared from the regression of yi on the explanatory variables. In fact, the R 2j that features in equation (2.26) is the R-squared one would obtain from a regression of xj on all other explanatory variables in the original model. That is, if equation (3.7) is our model, R12 would be the R-squared associated with OLS estimation of the following regression model x1i = α 0 + α 2 x2i + α 3 x3i +... + α k xki + e1i ,
(3.11)
where α j denote the parameters to be estimated (typically these would not be of great interest to us) and e1i is an error term; and so on for R22 , R32 ,…, Rk2 . That is, R 2j measures how strongly the other explanatory variables in the model correlate with variable xj. If they correlate with xj quite strongly, the R 2j will be quite high, and if they correlate with xj quite weakly, the R 2j will be quite low. Equation (3.10) implies that the higher the R 2j , the higher the Var(β j ) , with other factors held constant. Hence, the variance of the OLS estimator depends partly on how strongly the explanatory variables in the model correlate with each other. If some of the explanatory variables in the model are close to being collinear with xj, so that, R 2j becomes close to 1, it follows from equation (3.10) that Var βˆ can become very high indeed. In the extreme case in which x is collinear
( ) j
j
with the other explanatory variables we obtain R 2j = 1, implying that the variance is not deined. 3.3.3 Omitted variables and the bias they may cause Clearly, the assumption that E ( ui | xi ) = 0 is crucial for the OLS estimator βˆ1 to be unbiased. If that assumption fails, the OLS estimator is generally biased. A very common worry in applied work is that important determinants of the dependent variable that have been omitted from the model in fact correlate with the explanatory variables included in the model. This will lead to omitted variable bias. To illustrate, suppose earnings depend on innate ability (iability) as well as education: logwphy = β0 + β1educ + β2 iability + u. Suppose that the OLS estimator applied to this model would be unbiased. Now think about what would happen if we were to run a regression with iability omitted from the model. Would the OLS based on such a speciication be unbiased? Unless iability is uncorrelated with educ or β2 = 0, the answer is no. To see why, summarise the association between iability and educ as follows iability = δ 0 + δ1educ + e, where e is an error term and δ1 = Cov( educ, iability ) / Var( educ ). Note that this equation should not be interpreted causally: it merely describes the statistical association between
40 Linking models to data for development education and innate ability (formally, the equation is known as a linear projection of iability on educ). It follows that the earnings equation can be written as logwphy = ( β0 + β2 δ 0 ) + ( β1 + β2 δ1 ) educ + {β2 e + u} , and the assumptions we have already made imply that the expected value of the equation residual {β2 e + u} conditional on educ is zero. This is an equation with iability omitted. Hence, if we run an OLS regression with iability omitted from the model, the estimator of the slope coeficient on educ will be an unbiased estimator of ( β1 + β2 δ1 ). This is no cause for celebration however, as ( β1 + β2 δ1 ) happens not to be the quantity of interest! We are interested in the causal effect of education, that is, β1, and unless β2 = 0 or δ1 = 0, OLS will be a biased estimator if iability is omitted from the speciication. Note that the sign of the bias depends on the signs of β2 and δ1. If it is the case that education is positively correlated with innate ability, the OLS based on a speciication with iability omitted would tend to overestimate the causal effect of education on earnings (provided of course that a higher innate ability results in higher earnings). Omitted variable bias is discussed further in several of the chapters in this book.
3.4 Interpreting multiple regressions In this section we consider the earnings function and the production functions in more detail. We use our very speciic question – how important is education in determining earnings and productivity – to illustrate the general principles that underlie an interpretation of multiple regression. We begin in the next sub-section with the micro evidence before moving on to the macro evidence. 3.4.1 How much does investing in education increase earnings? Some micro evidence Now at irst you may well think that if we want to understand how education impacts on earnings we should include as many variables as possible that may be correlated with education, as that will enable us to identify the role that education, by itself, plays in increasing earnings. However, a little thought suggests that proceeding in that way would not necessarily be appropriate. As we will show later in our applications, many aspects of a person’s job – their occupation, the size of the enterprise in which they work and their sector – all affect their earnings. Should we introduce those as additional variables in our regression? The answer to that question is that it depends on what our question is. If we want to know how education impacts on earnings and we think that education affects where you work, then to include these as additional variables would not be correct – to do so would hide the role that education is playing. However, if our question was: does education only increase your earnings through enabling you to get better types of job in terms of occupation, type of irm and sector, then it would be entirely correct to include these variables and, if education remained an important determinant of earnings, that would tell you that education increased your earnings within these types of job (see Fafchamps et al., 2009, for an analysis of the returns to education across and within jobs in Sub-Saharan Africa). The question we want to tackle here is the irst: what is the ‘full effect’ of education on earnings? In Table 3.3, which is a printout from Stata regressions, we extend our basic earnings function by including work experience and by allowing for nonlinear
Extending the simple regression model
41
Table 3.3 Stata printout from South African wage data . reg logwphy educ Source | SS df MS Number of obs = 6968 -------------+-----------------------------F( 1, 6966) = 2680.34 Model | 2368.42412 1 2368.42412 Prob > F = 0.0000 Residual | 6155.34858 6966 .883627416 R-squared = 0.2779 -------------+-----------------------------Adj R-squared = 0.2778 Total | 8523.77271 6967 1.22344951 Root MSE = .94001 -----------------------------------------------------------------------------logwphy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | .1353827 .002615 51.77 0.000 .1302565 .1405088 _cons | .4581331 .0238719 19.19 0.000 .4113368 .5049294 -----------------------------------------------------------------------------. reg logwphy educ exper Source | SS df MS Number of obs = 6968 -------------+-----------------------------F( 2, 6965) = 1529.50 Model | 2601.17543 2 1300.58771 Prob > F = 0.0000 Residual | 5922.59728 6965 .85033701 R-squared = 0.3052 -------------+-----------------------------Adj R-squared = 0.3050 Total | 8523.77271 6967 1.22344951 Root MSE = .92214 -----------------------------------------------------------------------------logwphy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | .1616223 .0030159 53.59 0.000 .1557101 .1675344 exper | .0175132 .0010586 13.454 0.000 .0154381 .0195883 _cons | -.1507435 .0436215 -3.46 0.001 -.2362549 -.0652321 -----------------------------------------------------------------------------. reg logwphy educ educ_sq exper Source | SS df MS Number of obs = 6968 -------------+-----------------------------F( 3, 6964) = 1222.10 Model | 2939.77545 3 979.925151 Prob > F = 0.0000 Residual | 5583.99725 6964 .80183763 R-squared = 0.3449 -------------+-----------------------------Adj R-squared = 0.3446 Total | 8523.77271 6967 1.22344951 Root MSE = .89545 -----------------------------------------------------------------------------logwphy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | -.0008288 .0084304 -0.10 0.922 -.017355 .0156974 educ_sq | .011415 .0005555 20.55 0.000 .0103261 .012504 exper | .0166749 .0010287 13.421 0.000 .0146583 .0186916 _cons | .2246231 .0461299 4.87 0.000 .1341944 .3150518 -----------------------------------------------------------------------------. reg logwphy educ educ_sq exper exper_sq Source | SS df MS Number of obs = 6968 -------------+-----------------------------F( 4, 6963) = 987.89 Model | 3085.97829 4 771.494572 Prob > F = 0.0000 Residual | 5437.79442 6963 .780955682 R-squared = 0.3620 -------------+-----------------------------Adj R-squared = 0.3617 Total | 8523.77271 6967 1.22344951 Root MSE = .88372 -----------------------------------------------------------------------------logwphy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | -.0251407 .0085075 -2.96 0.003 -.0418181 -.0084633 educ_sq | .0131101 .000562 23.33 0.000 .0120084 .0142119 exper | .0603432 .0033491 18.02 0.000 .0537779 .0669086 exper_sq | -.0008603 .0000629 -13.68 0.000 -.0009836 -.0007371 _cons | -.1394173 .0527299 -2.64 0.008 -.2427841 -.0360505 ------------------------------------------------------------------------------
42 Linking models to data for development 3.0
1.6
2.5
1.4
2.0
1.2
1.5
1.0
1.0
0.8
0.5
0.6 0
5
10
15
0
Note: Wages are predicted for a worker with average levels of work experience
10
20
30
40
Years of work experience
Years of education
Note: Wages are predicted for a worker with average levels of education
Figure 3.1 Predicted log (wages) in South Africa (1993)
effects. While our model is linear in parameters it does not need to be linear in the variables and, as we will see, nonlinearities are important for both education and work experience. Very few data sets have actual measures of work experience and, as is the case for our data, work experience is proxied by age less years spent in school less time started school. Now for most students working and schooling cannot be combined so, ceteris paribus, we expect that students with more education will have less work experience, that is, the two variables will be negatively correlated. In controlling for work experience in the earnings function, we can compare the returns from two types of human capital: that acquired in school and that acquired within the labour market. In the light of what we have just argued about what should, and should not, be included in a regression determining earnings, we are implicitly assuming there is no causal path from education onto work experience. It is possible the returns to work experience will differ by educational level, but we can allow for that and test for it by interaction terms, which we come to below. Regression (1) in Table 3.3 repeats our simple regression of Chapter 2 and we show it here so that the effects of extending the model can be seen in one table. Our irst step in extending the model is to include work experience (exper) as simply a linear term in Regression (2). Notice, in comparing Regressions (1) and (2), that the consequence of including this variable is to increase the point estimate on education from 0.14 to 0.16, consistent with the view that education and time in the labour market are negatively correlated. In Regression (3) we irst include nonlinear terms in work experience, which is always done when specifying earnings functions and in Regression (4) we also include nonlinear terms in education which is often not done when specifying earnings functions. This inal regression is the most general form of the earnings function that we consider in this chapter. In Figure 3.1 we present in graphical form the results from Regression (4) of Table 3.3. As is apparent from the point estimates (and is illustrated in Figure 3.1), earnings with respect to years of education is strongly convex while earnings with respect to work experience is strongly concave. It appears that our linear speciication is misleading as to the patterns of returns from both education and work experience, as both are strongly nonlinear. If we now ask what the gain is from seven years of education (that is, up to primary level) we see that earnings increase by 60 per cent, a substantial rise,
Extending the simple regression model
43
Table 3.4 The macro production function data (year=2000 only)
Regression (1) . reg lrgdpch lkp if year==2000 Source | SS df MS Number of obs = 82 -------------+-----------------------------F( 1, 80) = 1072.94 Model | 104.833774 1 104.833774 Prob > F = 0.0000 Residual | 7.81654577 80 .097706822 R-squared = 0.9306 -------------+-----------------------------Adj R-squared = 0.9297 Total | 112.65032 81 1.39074469 Root MSE = .31258 -----------------------------------------------------------------------------lrgdpch | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lkp | .7267638 .0221873 32.76 0.000 .6826096 .770918 _cons | 1.728678 .2131029 8.11 0.000 1.30459 2.152766 -----------------------------------------------------------------------------Regression (2) . reg lrgdpch lkp tyr15 if year==2000 Source | SS df MS Number of obs = 82 -------------+-----------------------------F( 2, 79) = 575.83 Model | 105.418902 2 52.7094511 Prob > F = 0.0000 Residual | 7.23141754 79 .091536931 R-squared = 0.9358 -------------+-----------------------------Adj R-squared = 0.9342 Total | 112.65032 81 1.39074469 Root MSE = .30255 -----------------------------------------------------------------------------lrgdpch | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lkp | .6195039 .0475497 13.03 0.000 .5248585 .7141492 tyr15 | .0663426 .0262401 2.53 0.013 .014113 .1185722 _cons | 2.323293 .3128209 7.43 0.000 1.700639 2.945948 -----------------------------------------------------------------------------Regression (3) . reg lrgdpch lkp tyr15 tyr15_sq if year==2000 Source | SS df MS Number of obs = 82 -------------+-----------------------------F( 3, 78) = 384.83 Model | 105.521087 3 35.1736956 Prob > F = 0.0000 Residual | 7.12923298 78 .091400423 R-squared = 0.9367 -------------+-----------------------------Adj R-squared = 0.9343 Total | 112.65032 81 1.39074469 Root MSE = .30233 -----------------------------------------------------------------------------lrgdpch | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lkp | .6127846 .0479373 12.78 0.000 .5173487 .7082205 tyr15 | .120592 .0576187 2.09 0.040 .0058819 .2353021 tyr15_sq | -.0039269 .0037139 -1.06 0.294 -.0113207 .0034669 _cons | 2.232001 .3242926 6.88 0.000 1.586384 2.877618 ------------------------------------------------------------------------------
but one dwarfed by the three-fold rise that occurs between seven and twelve years of education. 3.4.2 How much does investing in education increase productivity? Some macro evidence In Chapter 1 we introduced the human capital augmented production function, set out as equation (3.6) above. In that equation, human capital ( φ( Eit )) shifts labour productivity. However, as we showed in Chapter 1 we can translate this speciication into a model of the average wage in the economy using the deinition of human capital from Hall and Jones (1999).
44 Linking models to data for development wH Hit = wH eφ ( Eit ) Lit = wL ( it ) Lit and we can write log wL ( it ) = log wH + φ ( Eit ), where log wH is a constant and wL ( it ) tells us the wage of a labourer with level of education Eit . This is a semi-logarithmic equation and is in structure the basis for estimating micro Mincerian earnings functions, which were our concern in Section 3.4.1. In the empirical work we report below, the function φ is written in a form linear in Education:
φ ( Eit ) = δ 0 + δ1Eit + vit In making this function linear, we enable a direct comparison between our micro earnings function and the macro production function. Our ability to include a measure of education into the macro production function is due to the work of Barro and Lee (2000) who provided estimates at ive-year intervals of educational attainment for the years 1960–2000. They also imputed the number of years of schooling achieved by the average person at the various levels and at all levels of schooling combined. In the production function reported in Table 3.4 we use the variable tyr15, which is the average years of education in the population aged over 15. The Hall and Jones (1999) data of Chapter 1 used the Penn World tables Version 5.6. In Table 3.4 we use the Version 6.1. In this version of the Penn World Tables capital stock igures were not supplied. The capital per capita data we use is constructed from the investment low data using the same procedure as in Klenow and Rodríguez-Clare (1997). In Table 3.4 the variable lkp is the natural log of this imputation of capital per capita. The dependent variable in the regression is the natural log of real GDP per capita in constant 1996 US$ using the chain index method from the Penn world Tables. Thus the regression shown in the Stata output in Table 3.4 is the empirical version of equation (3.6) where we have imposed linearity on the impact of education: log
Vi K = α log i + (1 − α )log A + (1 − α )( Ei ) + ui . Li Li
We are now in a position to answer the question of how the introduction of education into the production function qualiies our view of this as a model of the determination of income at the macro level presented in Chapter 1. In Table 3.4 Regression (1) we provide the same basic speciication as used in Chapter 1; the only difference is the version of the Penn World Tables being used and in Regression (2) we add the years-ofeducation term. We can use our algebra linking the impact of education on labour productivity to average wages to obtain the Mincerian return to earnings from this macro production function: 0.07 = (1 − 0.62 ) δ1
δ1 = 0.18.
Extending the simple regression model
45
This macro data set has produced a Mincerian return to education strikingly similar to that from our micro South African labour force data. Finally in Regression (3) of Table 3.4 we test if we ind the same nonlinearity in the macro data as that shown in the micro data above and the answer is no: the data is consistent with the effect of education in the macro data being linear. We see that the coeficient of the capital stock is reduced from 0.73 to 0.62 by including years of education in the equation. Thus it could be argued that as the two are positively correlated, then our initial bivariate regression misled us as to the ‘true’ effect of capital on output. That indeed is the meaning of treating education as a control in the production function in order to identify the ‘true’ effect of capital on output. In fact, one way of reading this multiple regression, in contrast to the bivariate regression, is that it shows that part of productivity can be explained by physical capital which is not explained by human capital. We return to both our micro and our macro equations in Chapter 5; irst, however, we need the distribution of the OLS estimators and how we test hypotheses about the data. That is the subject of the next chapter.
References Barro, R. J. and Lee, J-W. (2000) International data on educational attainment: Updates and implications, CID Working Paper no. 42. Barro, R. J. and Lee, J-W. (2010) A new data set of educational attainment in the world, 1950– 2010, NBER Working Paper 15902. Fafchamps, M., Söderbom, M. and Benhassine, N. (2009) ‘Wage gaps and job sorting in African manufacturing’, Journal of African Economies, 18(5): 824–68. Hall, R. E. and Jones, C. I. (1999) ‘Why do some countries produce so much more output per worker than others?’ Quarterly Journal of Economics, 114(1): 83–116. Klenow, P. J. and Rodríguez-Clare, A. (1997) The neo-classical revival in growth economics: Has it gone too far? In Bernanke, B. and Rotemberg, J., (eds) NBER Macroeconomics Annual, MIT Press, Cambridge, Massachusetts. Wooldridge, J. M. (2013) Introductory Econometrics: A modern approach, Fifth Edition, SouthWestern Cengage Learning.
Exercise The Stata data ile ‘Labour_Force_SA_SALDRU_1993’ has the data used in Chapters 2 and 3. 1.
2.
Using this data, run the earnings function as a linear, semi-log, and doublelog function and derive the Mincerian return to education for each of the speciications. Which do you prefer and why?
The Stata data ile ‘Macro_1980_2000_PENN61.dta’ has the macro data used in this chapter. Using this data, estimate the following models and answer the questions below: logVi = β0 + β1 log K i + β2 log Li + ui
46 Linking models to data for development
( )
( )
( )
( )
log V L = γ 0 + γ 1 log K L + ui i i log V L = θ0 + θ1 log K L + θ2 log L i + ui i i 3. 4. 5. 6.
Is the data consistent with constant returns to scale? Is the share of capital consistent with the national accounts in any of these regressions? If not, suggest reasons for the parameter estimate you observe. Using the data for 1980 and 2000, create a cross section of differenced variables and re-estimate this equation in differences:
( )
( )
∆ log V L = β0 + β1 ∆ log K L + ui i i 7.
Comment on the new point estimates and why they differ from those used in your answers to questions 3 and 4.
4
The distribution of the oLS estimators and hypothesis testing
4.1 Introduction In Chapter 2 we discussed how OLS can be used to estimate the model parameters of interest. In this chapter we set out methods for testing hypotheses about parameters in the model. We begin by discussing the distribution of the OLS estimators in Section 4.2 and introduce the normal distribution, which plays a central role in hypothesis testing – the subject of Section 4.3. How we test the overall signiicance of our regression is considered in Section 4.4. In Section 4.5 we return to the issue raised at the end of Chapter 2 as to how we test for heteroskedasticity. Finally, in Section 4.6 we discuss a theoretical result that says that in the limit, as the sample size tends to ininity, the distribution for our estimators can be derived without the normality assumption. This largesample result is often appealed to as a rationale for not assuming, or indeed testing for, normality in the error term.
4.2 The distribution of the oLS estimators 4.2.1 The normality assumption In Chapter 3 we listed the ive assumptions (A1′–A5′) under which OLS estimators of the parameters of the multiple linear regression are BLUE. We proceed by adding a sixth assumption for the multiple regression model, namely that (A6′) The unobserved error u is normally distributed in the population. Under the six population assumptions that we have now made, which are collectively known as the classical linear regression model (CLM) assumptions, we can establish the following result conditional on the sample values of the independent variables:
βˆ j ~ Normal [ β j ,Var( βˆ j )],
(4.1)
where Var(βj ) is given by
( )
Var βˆ j =
σ2 SST j (1− R 2j )
(4.2)
(see equation (3.10) in Chapter 3). The proof of the result in equation (4.1) depends on the fact that the βˆ j are a linear function of the errors and that a linear function of
48 Linking models to data for development Distribution of wages
Distribution of log (wages)
0.1
0.4 0.3 Density
Density
0.08 0.06 0.04
0.2 0.1
0.02 0
0 0
20
40
60
80
Hourly wages (weighted by prices)
–4
–2
0
2
4
6
Log of hourly wages (weighted by prices)
Figure 4.1 The distribution of wages and log wages
normally distributed variables is itself normally distributed. Wooldridge (2013: 112–13) provides a full discussion. Equation (4.1) is an important stepping-stone towards hypothesis testing. Before explaining why that is so, which is done in Section 4.3, we relect a little on the normality assumption. 4.2.2 Why normality? Since we cannot look at the distribution of u, and since u is the part of the dependent variable y that remains unexplained by our model, it is natural to begin by inspecting the sample distribution of y. It is often the case that the estimated u inherits features of the distribution of y, so this approach can be informative of whether the normality assumption for u is appropriate. In Chapter 2 we simply wrote down a model in which we explained the log of income as a function of the level of education. We did not explain why we chose that speciication rather than the level of income. The left-hand side of Figure 4.1 shows how real wages are distributed based on the South African data. In fact we have truncated the sample at 80 Rand; if we had not done so the distribution would be even further from the normal. The right-hand side shows the distribution of the log of real wages, which is clearly a lot closer to normality. We would thus be more comfortable with the normality assumption if we were modelling log income rather than the level of income. Of course, the normality assumption refers speciically to the residual, so non-normality in the dependent variable does not necessarily imply that the error term is non-normal. Methods for testing the normality assumption for the residual are discussed below. Why might normality be a natural assumption? And why log normality, rather than normality in levels? The central limit theorem (CLT), which is one of the most powerful results in statistics, suggests why we may often observe normal distributions. The CLT states that if we have N random variables {Y1 ,Y2 ,.,Yn } which are independently and identically distributed (note that they do not need to be normally distributed) with mean µ and standard deviation σ , then Zn, deined as: Zn =
Yn − µ , σ /√n
OLS estimators and hypothesis testing
49
n
will tend to a standard normal distribution as n → ∞, where Y = n-1 ∑Yi . We can say that i =1
Zn has an asymptotic normal distribution. The canonical example of a variable which has a normal distribution is height. So what does the CLT suggest might be determining height? If the height of an individual is the result of the sum of lots of different i.i.d. (independently and identically distributed) random variables, then the normal distribution of height can be explained by the CLT. So drawing a sample of individuals from the population will be like sampling from the N random variables determining height and averaging the outcome. These averages will, providing N is large enough, follow a normal distribution. That logic does not, of course, prove that height is so determined but suggests a possible mechanism that will tend to produce normal distributions. We have already seen that for our earnings data it is the log that is much closer to normality than the level of income. In fact incomes are very far from a normal distribution. The logic of log normality is similar to that for normality, but instead of assuming the normality results from the sum of i.i.d. random variables it results from their product. Such a product will be linear in logs. This has an economic logic. Instead of the factors underlying income being additive, they are multiplicative. In other words they are differences in logs rather than in levels. The fact that income is much closer to being log normal than normal in levels is consistent with numerous i.i.d. random variables affecting its growth rate.
4.3 Testing hypotheses about a single population parameter In this section we show how the assumption of the normal distribution allows us to test hypotheses. 4.3.1 The t distribution Recall equation (4.1) above:
βˆ j ~ Normal [ β j ,Var( βˆ j )]. We claimed earlier that this result is very central for understanding hypothesis testing. Now is the time to explain why. We begin by standardising βˆ j , that is, we subtract β j (note that β j is the expected value of βˆ j , given that βˆ j is unbiased; see Section 2.4 in Chapter 2), and divide by sd βˆ = Var(βˆ ) , the standard deviation of βˆ . This trans-
( ) j
j
j
formed random variable follows a standard normal distribution:
βˆ j − β j ~ Normal ( 0,1) . sd βˆ
( )
(4.3)
j
Suppose our null hypothesis is that β j is equal to some speciic value. We can then plug that hypothesised value for β j into equation (4.3), and observe that the resulting ratio
50 Linking models to data for development
–4
–2
0 z
2
4
Figure 4.2 The standard normal distribution
βˆ j − ( value of β j under null hypothesis ) sd βˆ
( )
(4.4)
j
should follow a standard normal distribution if the null hypothesis is true. For any random variable that follows a standard normal distribution, it is quite likely that we will observe realised values quite close to zero. It is quite unlikely that we will observe values larger than, say, 2 or values lower than −2. And it is extremely unlikely that we will observe values larger than 3 or values lower than −3. These claims are illustrated in Figure 4.2, which shows the probability density function for the standard normal distribution. The probability that a variable that is standard normal is higher than 2 is 0.023; the probability that it is higher than 3 is 0.0013. We can use these insights to reason as follows: (i) if the null hypothesis is true, it is unlikely that we would observe a ratio deined in equation (4.4) that is far from zero; (ii) if we were to observe a value of the ratio deined in equation (4.4) that is far from zero, we should thus be inclined to think that the null hypothesis is not true. As we shall see, this type of reasoning is the basis for our formal hypothesis testing. To calculate the ratio in equation (4.4), we need to know the true standard deviation of the OLS estimator, sd βˆ j , which is given by the square root of equation (4.2). As discussed in Chapters 2 and 3, σ 2 is an unknown population parameter, which implies that sd βˆ is unknown. Equation (4.4) is hence not directly implementable. However,
( )
( ) j
( ) ( )
we can estimate σ 2 and consequently also sd βˆ j using our data. We use this estimate of sd βˆ j , known as the standard error of βˆ j , se βˆ j instead of sd βˆ j in equation (4.4). As a result, the ratio follows a t distribution, rather than a standard normal distribution. Hence, under the CLM assumptions, (β j − β j )/ se β j follows the t distribution with n−k−1 degrees of freedom (df):
( )
( )
( )
(β j − β j )/se(β j ) ~ tn − k −1 ,
(4.5)
OLS estimators and hypothesis testing
51
where k +1 is the number of unknown parameters in the population model (k slope parameters and the intercept β0 ). As df grows, the t distribution becomes increasingly similar to the standard normal distribution. For large df, whether one uses the t distribution or the standard normal distribution is unimportant from a practical point of view. However, provided the CLM assumptions hold, to be strictly correct one should always use the t distribution. 4.3.2 The t-test Suppose the null hypothesis is that the variable xj has no effect on y, which we state formally as H 0 : β j = 0, and the alternative hypothesis is that the variable xj has a positive effect on y, H1 : β j > 0, Now, if H 0 is true, it follows from equation (4.5) that the ratio of the OLS estimate βˆ j to the standard error follows a t distribution with n–k–1 degrees of freedom: tβˆ j ≡ βˆ j / se(βˆ j ). In applied research, this ratio is often referred to as the t value or the t statistic. Based on the t value, we will decide whether or not to reject the null hypothesis in favour of the alternative hypothesis. Of course we will never know for sure whether H 0 or H1 is true, and whenever we make our decision as to which of these we prefer, we could actually get it wrong: if we reject H 0 , there is a chance that we are doing so even though H 0 is actually true, and if we decide not to reject H 0 there is a chance that we are doing so even though H 0 is actually false. These types of mistake are known as errors of Type I (rejecting a true H 0 ) and Type II (not rejecting a false H 0 ). The Type I error, which is to claim that the null hypothesis – that the effect is zero – is false, when it is actually true, is considered a bad mistake, and the convention in the literature is to adopt a rule for selecting between H 0 or H1 that implies a low probability of a Type I error. The probability of a Type I error implied by our rule is called the signiicance level. In applied work, the most common signiicance levels are the 1 per cent, 5 per cent and 10 per cent levels. We now return to the task of distinguishing between H0 : β j = 0 and H1 : β j > 0. Suppose we decide to use a signiicance level of 5 per cent. This means that if we were to do hypothesis testing based on many samples drawn from a population for which H 0 is true, we would not reject H 0 95 per cent of the time (the correct decision) and
52 Linking models to data for development reject it 5 per cent of the time (Type I error). Under the null hypothesis that β j = 0, the t value, that is, the estimated coeficient divided by the standard error, follows a t distribution. We know exactly what the t distribution looks like, and it is therefore straightforward to come up with a rule that implies a signiicance level of exactly (say) 5 per cent. Since the alternative hypothesis is that β j is positive, we will consider t statistics that are suficiently above zero as evidence that the alternative hypothesis, rather than the null hypothesis, is true. The testing procedure is as follows: • •
Find the value c which is such that the probability of observing a t value at least equal to c is 5 per cent if H 0 is true. Reject H 0 if the t value is larger than c, and do not reject H 0 if the t value is smaller than c.
Critical values c for different signiicance levels and degrees of freedom can be found in some econometrics textbooks; see for example, table G.2, Appendix G, in Wooldridge (2013). Alternatively, they can be computed using the Stata command invtail(n,p) where the user speciies the degrees of freedom (n) and the signiicance level (p). Figure 4.3 shows t distributions for degrees of freedom equal to 30 (panels a and b) and 200 (panels c and d), and highlights the critical values associated with a 5 per cent signiicance level (panels a and c) and a 10 per cent signiicance level (panels b and d). For example, if our sample consists of 203 observations (n = 203), and our model contains two explanatory variables (k = 2), we have 200 degrees of freedom. In this case we would reject H 0 in favour of H1 if and only if we obtain a t value larger than 1.653 (panel c). Had we been using a signiicance level of 10 per cent it would be enough to obtain a t value larger than 1.286 in order to reject H 0 in favour of H1. Because of how the alternative hypothesis was speciied above, it is suficient to consider only the right tail of the t distribution when determining the critical value. If we want to distinguish between H 0 : β j = 0 and H1 : β j < 0 (that is, with the sign reversed under the alternative hypothesis) we can use a very similar approach: accept the null hypothesis if the t value is higher than −c and reject it if is lower than −c. That is, we consider only the left tail of the t distribution. For obvious reasons these types of test are known as one-sided (or one-tailed) tests. Underlying the one-sided tests is the assumption that the sign of β j is known under the alternative hypothesis. In applied work this is actually quite an unusual approach. It is much more common to specify the alternative hypothesis as follows: H1 : β j > 0 or β j < 0, which can be written simply as H1 : β j ≠ 0. That is, we are testing the null hypothesis that xj has no effect on y (H 0 : β j = 0) against the alternative hypothesis that xj has an effect on y (H1 : β j ≠ 0). This changes the testing procedure compared to the one-sided test in one important respect: we now interpret large positive values and large negative values of the t statistic as evidence that the null hypothesis is false. That is, we need to consider realisations of the t statistic in both tails
OLS estimators and hypothesis testing
53
of the t distribution. This test is therefore known as a two-sided (or two-tailed) test. A convenient feature of the t distribution is that it is symmetrical around zero. This implies that outcomes higher than some constant z are always as likely as outcomes lower than −z. Sticking to a signiicance level of 5 per cent we thus modify our testing procedure as follows: • •
Find the value c which is such that the probability of obtaining a t value larger than c, or smaller than −c, is 5 per cent if H 0 is true. Reject H 0 if the t value is larger than c or smaller than −c; otherwise do not reject H0 .
This rejection rule can alternatively be written •
Reject H 0 if the absolute value of the t statistic is larger than c: tβˆ j >c ; otherwise do not reject H 0 .
Figure 4.4 illustrates the two-sided t-test for 200 degrees of freedom with a 5 per cent signiicance level. The area under the curve to the right of 1.972 combined with the area under the curve to the left of −1.972 corresponds to a probability of 5 per cent. Hence, in this case we would reject H 0 in favour of H1 at the 5 per cent signiicance level if and only if we obtain an absolute value of the t statistic larger than 1.972. It is straightforward to test the null hypothesis that β j is equal to some constant other than zero. Under the null hypothesis that β j = a j , where a j is any constant you like, (β j − a j )/ se(β j ) ~ tn − k −1 . Hence to calculate the t value we simply subtract a j from βˆ j , divide by the standard error, and proceed along the lines explained previously. 4.3.3 Conidence intervals Based on our OLS estimate βˆ j and the standard error se(β j ), we can construct a conidence interval for the population parameter β j . Constructing a 95 per cent conidence interval involves calculating two values, β j and β j , which are such that if random samples were obtained many times, with the conidence interval (β j , β j ) computed each time, then 95 per cent of these intervals would contain the unknown population value β j . Equivalently, if we were to do a two-tailed t-test of the null hypothesis that β j is equal to a speciic value between β j and β j we would not reject the null hypothesis using a 5 per cent signiicance level (or lower), whereas if were to do a t-test of the null hypothesis that β j is equal to a speciic value outside the (β j ,β j ) interval we would reject the null hypothesis at the 5 per cent signiicance level. To calculate the values β j and β j we start from the above result that (β j − a j )/ se(β j j ) ~ tn − k −1 under the null hypothesis that β j = a j , where a j is a constant. As discussed in the previous section, observed t values that are either larger than the critical value c or smaller than −c imply rejection of the null hypothesis (at the underlying signiicance level) in favour of the alternative hypothesis that β j ≠ a j . Consequently, t values between −c and c imply that we do not reject the null hypothesis against the alternative. The 95 per cent conidence interval includes all values a j that result in a t statistic between −c and c as implied by a 5 per cent signiicance level. β j will be the lowest, and β j the highest, of these values. Hence,
54 Linking models to data for development (a)
–4
5% significance level, with 30 df
–2
0
2
(b)
4
10% significance level, with 30 df
–4
–2
0
t (c)
–4
5% significance level, with 200 df
–2
0
2
4
–4
–2
0
t
t
Figure 4.3 Rejection rules for H1 : β j > 0
–2
4
10% significance level, with 200 df
(d)
Note: The critical values are as follows: (a) 1.697; (b) 1.310; (c) 1.653; (d) 1.286.
–4
2
t
0
2
4
t
Figure 4.4 200 df and 5% signiicance level: rejection rule for H1 : β j ≠ 0 Critical value for absolute value of t statistic: 1.972.
2
4
OLS estimators and hypothesis testing
55
β j = βˆ j − c ⋅ se(βˆ j ) and
β j = βˆ j + c ⋅ se(βˆ j ). While the 95 per cent conidence interval is the most common in applied work (and that reported by default in Stata), we can easily obtain conidence intervals associated with alternative signiicance levels. For example, if we want a 90 per cent conidence interval we simply use the value of c that corresponds to the 10 per cent signiicance level, and compute β j and β j as above.
4.4 Testing for the overall signiicance of a regression The t-test is useful when we want to test a hypothesis regarding a single parameter in our model. If we want to test a hypothesis that refers to several of the parameters in the model, we can use an F test. This test involves comparing the sum of squared residuals from our unrestricted model with the sum of squared residuals from a restricted version of the model in which some parameters are restricted according to the null hypothesis. The F statistic is deined as follows: Fº
(SSRr − SSRur ) / q SSRur /( n − k −1)
where SSRr is the sum of squared residuals of the restricted model; SSRur is the sum of squared residuals of the unrestricted model; q is the numerator degrees of freedom and equals the number of restrictions imposed; n–k–1 is the denominator degrees of freedom, where k is the number of slope parameters in the unrestricted model. Since the OLS estimator minimises the SSR, the SSRr must be bigger than SSRur so the F statistic is positive and it clearly gets larger as SSRr increases relative to SSRur . In fact the F statistic reported in the Stata regress output is a special case of this general statistic as it provides a test of null hypothesis: H 0 : β1 = β2 = β3 = ...... = β k = 0 and the alternative hypothesis H1 is that one or several of the β j are different from zero (indeed it would be perfectly ine to write the alternative hypothesis as ‘H 0 not true’ in this case). There are clearly k restrictions under the null hypothesis, and if these are imposed we obtain a model without explanatory variables: y = β 0 + u. Since none of the variation in y is being explained with a model that has no explanatory variables, the R-squared from estimating this restricted model is precisely zero. Using the fact that SSRr = SST (1 − Rr2 ) and SSRur = SST (1 − Rur2 ) we can now write our F statistic as:
56 Linking models to data for development F=
( Rur2 − Rr2 ) / q . (1 − R 2 ur )( n − k − 1)
Therefore, the F statistic for testing the null that all the parameters of the model are all zero can be written as F=
R 2/k , (1 − R 2 ) / ( n − k − 1)
where R 2 is the R-squared from the regression of y on ( x1 , x2 ,........xk ), which in this case is the unrestricted model. Provided the CLM assumptions hold, it can be shown that, if H 0 is true, the F statistic follows an F distribution with ( q, n − k −1) degrees of freedom. We write this as F ~ Fq ,n − k −1 . Hence, if and only if the observed F statistic exceeds some critical value c, we would reject the null hypothesis in favour of the alternative hypothesis. The critical values for different signiicance levels and degrees of freedom are tabulated in some econometrics textbooks; see for example, table G.2a-c, Appendix G, in Wooldridge (2013). Alternatively they can be computed using the Stata command invFtail(n1,n2,p) where the user speciies the numerator degrees of freedom (n1; this is q in our notation), the denominator degrees of freedom (n2; n–k–1 in our notation) and the signiicance level (p). The p-value associated with an F test indicates the lowest signiicance level at which the null hypothesis can be rejected. Looking at the F value in our Stata output shown in Table 3.1 and the associated p values, we see that we can reject the null that all the parameter values are zero at any signiicance level. You can also see in our Stata output that as well as the R 2 we also have an adjusted 2 R . The formula for the adjusted R 2 is R 2 = 1 − (SSR / ( n − k − 1)) / (SST / ( n − 1)). What are we doing when we change the formula for R 2 in this way? In Chapter 2 we deined the R 2 as: R 2 = SSE /SST = 1 − SSR /SST. Thus the R 2 is an estimate of how much of the variation in y is explained by the x variables. We can deine the population R 2 as
ρ 2 = 1 − σ 2 /σ y2 . If we rewrite the formula for R 2 as R 2 = 1 − (SSR / n )/(SST / n )), we see that what we were implicitly doing is to provide estimates of the population variances of interest that are biased ones. What the formula for R 2 given above does is to replace these biased estimates with unbiased ones, so we have:
OLS estimators and hypothesis testing
57
2 R 2 = 1 − (SSR / ( n − k − 1)) / (SST / ( n − 1)) = 1 − σ / (SST / ( n − 1)).
It is important to remember that while both the numerator and denominator are unbiased estimates of their respective population parameters, it does not follow that R 2 is an unbiased estimate of the population R-squared. Technically, this is because the ratio of two unbiased estimators of moments is not an unbiased estimator of a ratio of moments. The main rationale for preferring the adjusted R 2 is that it imposes a penalty for including additional regressors. Indeed, unlike the conventional R 2 , the adjusted R 2 may actually fall as a result of adding an explanatory variable to the model if that variable has only a modest effect on the SSR. For this reason, the adjusted R 2 is sometimes used as the basis for model selection.
4.5 Testing for heteroskedasticity In Chapter 2 we rather informally pointed out that, based on Figure 2.3, it appears that the residuals in the earnings equation are not homoskedastic. In fact it is straightforward to conduct a formal test for heteroskedasticity. We deine the null hypothesis as H 0 : Var( u | x1 , x2 ,...., xk ) = σ 2 , which implies homoskedasticity, and test it against an alternative hypothesis under which there is heteroskedasticity. Since we assume E ( u|x ) = 0, it follows that Var( u | x ) = E ( u 2 | x ), hence the null hypothesis of homoskedasticity can be stated as H 0 : E ( u 2 | x1 , x2 ,...., xk ) = E ( u 2 ) = σ 2 . To obtain an empirical framework suitable for testing for heteroskedasticity, we assume that E ( u 2 ) is is related to our explanatory variables in the following way: E ( u 2 ) = δ 0 + δ1x1 + δ 2 x2 + δ k xk , which suggests the following regression model u 2 = δ 0 + δ1x1 + δ 2 x2 + δ k xk + v where v is assumed independent of the x j . Clearly if any of the δ1 , δ 2 ,…, δ k are different from zero, the variance of u depends on the associated x-variable(s), and the residual is heteroskedastic. The null hypothesis of homoskedasticity is therefore stated as H 0 : δ1 = δ 2 = δ 3 = ..... = δ k = 0 and we test this null hypothesis against the alternative hypothesis that H 0 is not true using an F test. It should be noted that u 2 cannot be normally distributed (it can never take negative values, for example). Hence, the error term v is not normally distributed, so the assumptions of the classical linear model are not fulilled. However, as discussed in Section 4.6, results from asymptotic theory suggest that the F test is likely to remain reliable in large samples even if the normality assumption is violated.
58 Linking models to data for development There is one inal step, which is that, as we do not observe u 2, the dependent variable in the regression has to be based on the estimated residual. We thus write the equation to be estimated as: uˆ 2 = δ 0 + δ1x1 + δ x2 + δ k xk + error and compute the F for the joint signiicance of x1 , x2 ,...., xk . The F statistic depends on the R-squared from the regression in a similar way to the test given in the last section, that is, F=
Ruˆ22 / k . (1 − Ruˆ22 )/( n − k − 1)
This R 2 also appears in another test statistic called the Lagrange Multiplier (LM) statistic, which is of the form: LM = n ⋅ Ru2ˆ 2 . Under the null hypothesis of homoskedasticity, LM is distributed asymptotically as chi-squared with k degrees of freedom χ2k . This LM version of the test is typically called the Breusch–Pagan (B–P) test for homoskedasticity and is the one we report from Stata. You may wonder whether it is not a problem that the true unobserved residual u is replaced by the OLS residual uˆ in the testing procedure described above. Indeed, to justify this procedure we have to appeal to a theoretical result which is that, in the limit, as N tends to ininity, using the OLS residuals in place of the true unobserved residuals does not affect the distribution of the F statistic. This suggests that we should interpret outcomes based on the above tests for homoskedasticity with some caution if our sample size is relatively small.
4.6 Large sample properties of oLS 4.6.1 Consistency Up to this point we have been stressing the importance of having unbiased estimators of the population parameters of interest. However, in several of the chapters that follow we use estimators that are biased as we do not have ones that are unbiased. So we need a criterion that allows us to assess whether some biased estimator may still be useful. Consistency is such a property and it concerns how the estimator behaves as the sample size increases – conceptually – to ininity. In order to understand what we mean by a consistent estimator, we need the concept of a probability limit. Let Wn be an estimator of θ based on a sample {Y1 ,Y2 ,…..Yn } of size n.Then Wn is a consistent estimator of θ if for every ε >0 , P ( Wn − θ >ε ) → 0 as n → ∞. An example of a consistent estimator is the sample mean of some variable Y for a random sample drawn from a population with mean µ and variance σ 2 . First of all, the sample mean is an unbiased estimate of the population mean:
OLS estimators and hypothesis testing
59
1 n n n n 1 E (Y ) = E ∑Yi = (1 / n )E ∑Yi = (1 / n ) ∑E (Yi ) = (1 / n ) ∑µ = nµ = µ. n i =1 i =1 i =1 i =1 n So how can we show it is also consistent? It can be shown that Var (Yn ) = σ 2 / n , so as n → ∞ then Var(Yn ) → 0 , thus Yn is also a consistent estimator of µ . This result is an implication of the Law of Large Numbers which states: Let Y1 ,Y2 ,…..,Yn be independent, identically distributed (i.i.d.) random variables with mean µ : then plim (Yn ) = µ . There is one particular property of probability limits that is very important for enabling us to show that an estimator is consistent. If Wn is a consistent estimator of θ and γ = g(θ ), where g is some one-to-one, continuous and continuously differentiable function, then g (Wn ) is a consistent estimator of γ . This is often stated as: plim g (Wn ) = γ = g (θ ) = g ( plimWn ) More generally, the probability limit of a function of several random variables is equal to the function evaluated at the probability limits of the random variables. Two speciic applications of this result that will be useful to us are: plim (Q1 ⋅ Q2 ) = plim (Q1 ) × plim(Q2) plim (Q1 / Q2 ) = plim(Q1)/ plim(Q2) Now let us consider the case of our OLS estimate βˆ1. This parameter has, for any sample of size n, a probability distribution. Under the standard assumptions, this estimator is unbiased – so the mean of the distribution is β1. If the estimator is consistent then, in the limit as n goes to ininity, the probability that the estimator differs from the true parameter value β1 by more than some (small) constant approaches zero. Using our deinition of consistency given previously, we can write this deinition more formally as:
)
(
lim P β1 ( n ) − β1 >ε = 0.
n →∞
To show the consistency of our OLS estimator, recall the formula for the simple regression model:
βˆ1 =
∑
n
( xi − x )( yi − y )
i =1
∑
n
( x − x )2 i =1 i
.
If we substitute the equation for yi we get n −1 ∑ i =1 ( xi − x ) ui n
βˆ1 = β1 +
n −1 ∑ i =1( xi − x )2 n
.
We can now use one of our results from probability limits to write:
( )
plim βˆ1 = β1 + Cov( x1 , u ) / Var( x1 ),
60 Linking models to data for development which reduces to
( )
plim βˆ1 = β1 if x1 and u are uncorrelated. Recall that we had to assume that u is mean-independent of x in order to show that the OLS estimator is unbiased. We have just seen that in order to show that the OLS estimator is consistent it is enough to assume that x and u are uncorrelated, which is a weaker assumption than mean independence. For example, zero covariance between u and x does not rule out the possibility that the conditional expectation of u depends on a nonlinear function of x, but zero conditional mean does. See Chapter 5 in Wooldridge (2013) for details on the distinction between zero conditional mean and zero covariance. 4.6.2 Asymptotic normality Tests for normality are rarely presented in empirical work. One reason is that the sampling distribution of the OLS estimator in the limit as the sample size approaches ininity can be derived without the normality assumption, provided the assumptions under which OLS is BLUE hold (see Chapter 2). Under these assumptions, the central limit theorem and the law of large numbers imply that: (i) the asymptotic distribution of the βˆ1 is normal; (ii) σˆ 2 is a consistent estimator of the variance of the population residual; and (iii) (β1 − β1)/se(β1 ) is asymptotically standard normal (see Wooldridge, 2013, Chapter 5 for details). This is a remarkable result. It provides a large-sample justiication for continuing to test hypotheses based on the conventional F and t statistics regardless of whether the residual u is normally distributed. Note that the result is not saying that as the sample size gets larger, the errors will get closer to a normal distribution. The result says that providing the assumptions set out above are met, then, in the limit as n tends to ininity, the OLS estimator (βˆ1) will follow a normal distribution whatever the underlying distribution of the ui . The two most important assumptions that matter for this result are the zero conditional mean assumption for ui and that the errors are homoskedastic. The intuition for this result is clear – even if the underlying algebra is complex. Providing the ui are identically and independently distributed, the central limit theorem can be used to show that the distribution of the OLS estimator will tend to the normal distribution. It is the fact that the ui need to be identically distributed that ensures we need the homoskedasticity assumption. Clearly if the ui are in fact dependent on the x, then they cannot be independently distributed.
References Wooldridge, J. M. (2013) Introductory Econometrics: A modern approach, Fifth Edition, SouthWestern Cengage Learning.
OLS estimators and hypothesis testing
Exercise Using ‘Labour_Force_SA_SALDRU_1993’ answer the following questions: 1. 2.
3.
4. 5.
What was the average rate of broad unemployment in South Africa in 1993? How did this unemployment rate differ between white and black South Africans? Present as a table and as a regression of unemployment on race and interpret the coeficients on the race dummy. Estimate a linear probability model for unemployment (the linear probability model is simply OLS with a binary dependent variable) as a function of ‘otherinc’ only. Test for normality and whether the errors are homoskedastic and report your results. What were the signiicant (in both the statistical and economic sense of this term) determinants of unemployment for black South Africans in 1993?
Using ‘Macro_1980_2000_PENN61.dta’: 6.
Provide and interpret tests for normality and heteroskedasticity for:
( L) = β
log V
i
( )
0
( L ) + λ ⋅ time + u
+ β1log K
i
i
( )
∆ log V L = β0 + β1 ∆log K L + ui i i 7.
Compare these tests and provide reasons why they might differ.
61
5
The determinants of earnings and productivity
5.1 Introduction We now propose to use the results we have set out in the previous chapters to test our earnings and production functions. We began the last chapter by setting out how central the normality assumption was for making inferences about the distribution of our parameter estimates. We concluded the chapter with a theoretical result which states that in the limit, as the sample size n approaches ininity, the distribution of the OLS estimators can be derived without the normality assumption. The critical assumption for that result to hold is that the ui are identically and independently distributed, which holds under homoskedasticity. Now, of course, nobody actually has data in which the number of observations strictly approaches ininity, but researchers with large data sets can nevertheless appeal to the above limit argument with some credibility. In the next section we set out how to test for the normality assumption and why an appeal to the limit condition should be used with caution. We then return to our earnings function in Section 5.3 and consider the implications for the standard errors if, as will turn out to be the case, the homoskedasticity assumption fails to hold. In Section 5.4 we both test and extend our production function; in doing so we anticipate some of the analysis of panel data which is the subject of Section IV. Finally, in Section 5.5 we consider the implications of our tests for the effects of education on earnings and labour productivity.
5.2 Testing the normality assumption It is relatively rare for tests of normality to be presented as part of the statistical diagnostics of a model. The reason, discussed in the introduction above, is the appeal to the asymptotic result. Despite this, there are good reasons to be concerned if the normality assumption appears not to hold. First, the smaller our data set, the less convincing is it to appeal to a theoretical result which characterises the estimator in the limit as the number of observations tends to ininity. It may be nice to know that our method has certain attractive large-sample properties, but if we have a small sample the practical relevance of such results may be limited. Second, and this echoes the point already made concerning the absence of homoskedastic residuals, the deviation from normality may be an important clue as to why our model is misspeciied. In Figure 4.1 of the last chapter we showed that the distribution of the natural log of earnings was far closer to the normal distribution than the level. Now that does not necessarily imply that the errors follow the same pattern
0.1
0.5
0.08
0.4 Density
Density
Determinants of earnings and productivity
0.06 0.04 0.02
63
0.3 0.2 0.1
0
0 –20
0
20
40
60
Residuals
–4
–2
0
2
4
Residuals
Figure 5.1 Residuals from the simple regression model
as the dependent variable so it is sensible for us to check that the residuals from the two speciications display similar characteristics to the dependent variables themselves. Figure 5.1 shows the distribution of residuals from a simple regression of earnings on education, where in the left-hand column the dependent variable in the regression was the level and in the right-hand column it was the natural log. Indeed, we see for the residuals a very similar pattern as for the dependent variables. As with Figure 4.1, the residuals from the semi-logarithmic speciication are much closer to normality, but we need to test if the deviations from normality we can see in the igure are statistically signiicant. To formally test deviations from normality we need to know how far the actual distribution deviates from that which would be expected if the residuals were normally distributed. To do that we investigate the skewness of the distribution (that is, how far it is from being symmetrical) and its degree of kurtosis (essentially how fat are the tails). The skewness depends on the third moment of the distribution 3
u 3 = E , σ which if the distribution is normal will be zero. The degree of kurtosis depends on the fourth moment, which has a value of 3 if the distribution is normal, so in testing for the normality of the residuals we refer to excess kurtosis and deine 4
u 4 = E − 3 σ which, again, should be zero if the distribution is normal. Tests for normality take as the null that 3 = 0 and
4 = 0,
while the alternative hypothesis is that the null hypothesis is not true. Tests for normality can be constructed using the sample analogues of these population quantities. The test statistics are:
64 Linking models to data for development Table 5.1 Testing the normality assumption
. sum res_logwphy if e(sample)==1, detail Residuals ------------------------------------------------------------Percentiles Smallest 1% -2.25563 -4.447034 5% -1.562273 -4.111961 10% -1.147822 -3.819869 Obs 6968 25% -.558725 -3.770803 Sum of Wgt. 6968 50% .0140132 Mean 1.81e-09 Largest Std. Dev. .9399471 75% .5311628 4.159716 90% 1.100464 4.419048 Variance .8835006 95% 1.560809 4.464937 Skewness .1659429 99% 2.592747 4.589489 Kurtosis 4.246579 . . display r(skewness) .16594286 . display r(kurtosis) 4.2465788 . scalar S=r(skewness) . scalar KK=r(kurtosis) . scalar N=r(N) . scalar JB=N*(S^2/6 + (KK-3)^2/24) . scalar list JB JB = 483.14569 . disp chiprob(2,JB) 1.22e-105 . sktest res_logwphy Skewness/Kurtosis tests for Normality ------- joint -----Variable | Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2 -------------+--------------------------------------------------------------res_logwphy | 7.0e+03 0.0000 0.0000 . 0.0000 . swilk res_logwphy Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+-------------------------------------------------res_logwphy | 6968 0.98828 42.613 9.944 0.00000
2 χ skew =
nˆ32 , 6
2 χ kurt =
nˆ42 , 2 24
2 χ norma l
2 χ s2keww + χ kkurt
nˆ32 2 2 We construct tests by comparing these statistics to χ distributions where χ skew = 6 nˆ42 2 2 χ χ = and kurt , where both have 1 degree of freedom; and normal has 2 degrees of 24 freedom; see Hendry and Nielsen (2007: 128–9). In Table 5.1 we provide the tests which can be computed in Stata. The joint test statistic which combines the tests for skewness and kurtosis is often referred to as the Jarque–Bera (1987) test. In Table 5.1 we begin by setting out a summary of the residuals from the simple earnings function in Table 3.3 Regression (1) and showing how the statistics for skewness and kurtosis can be displayed and then stored by Stata so the Jarque–Bera test can be computed. Stata does have preprogrammed commands that can be used to test for normality. One which simply looks at the skewness and kurtosis is called sktest, while a
Determinants of earnings and productivity
65
test for normality due to Shapiro and Wilks is called swilk. The p-values reported in the table have a similar interpretation to that which we have already given. We can thus infer from the very low p-values that the null hypothesis of normality should be rejected at any conventional signiicance level.
5.3 The earnings function 5.3.1 Bringing the tests together We now return to our most general speciication of the earnings function – Regression 4 in Table 3.3. logwˆ L (i ) = −0.14 −0.025 educi +0.013 educi2 +0.060 experri [0.05] [0.01]
[0.0006]
[0.003]
−0.0009 experri 2 [0.00006]
The standard errors reported in the [ ] parentheses are those computed as part of the OLS estimation procedure in Stata. For the formulae used to produce these estimates to be correct we need the assumption of homoskedasticity. For the properties of the sampling distribution of the estimators we need either to assume normality, in which case we know the exact distribution of the estimators, or to appeal to our results from asymptotic theory that the distribution approximates the normal for a suficiently large sample. Figure 5.2 shows the residuals as a function of years of education. As we can see from Figures 5.1 and 5.2, both normality and homoskedasticity are very unlikely to characterise our data. In Table 5.2 we present some formal tests from Stata which show this is the case. In the left-hand part of Table 5.2 we present the test for normality and in the righthand panel we present the Stata code and results for the Breusch–Pagan tests for homoskedasticity. It should be noted that the tests decisively rejected at any conventional signiicance level the null hypotheses of normality and homoskedasticity. In the light of such statistical results, how are we to interpret our model? 5.3.2 Robust and clustered standard errors One option open to us, which we introduced in Section 2.4.2, is to correct the standard errors to allow for the heteroskedasticity. Recall the formula for robust standard errors due to White (1980), which is valid under heteroskedasticity of any form
∑
n
( xi
i =1
SSTx2
2 x )ui
.
(2.24)
In fact, equation (2.24) is valid under homoskedasticity as well as under heteroskedasticity, which rather begs the question, why would we ever use the conventional formula which is only valid under homoskedasticity? The answer is that the variance estimator, equation (2.24), is consistent but not unbiased. The conventional variance estimator, in contrast is consistent as well as unbiased under the null of homoskedasticity (but biased and inconsistent under heteroskedasticity). The upshot is that if we apply
66 Linking models to data for development Table 5.2 Testing normality and homoskedasticity . scalar list JB S KK N JB = 714.51017 S = .22251763 KK = 4.5043082 N = 6968 . disp chiprob(2,JB) 7.02e-156 . swilk res_logwphy Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+----------------------------------------------- res_logwphy | 6968 0.98492 54.799 10.610 0.00000
. estat hettest,rhs Breusch-Pagan / CookWeisberg test for heteroskedasticity Ho: Constant variance Variables: educ educ_sq exper exper_sq chi2(4) = 326.86 Prob > chi2 = 0.0000
4
2
0
–2
–4 0
5
10
15
Years of education
Figure 5.2 The residuals for the log earnings function
equation (2.24) to a small sample we may obtain standard errors that are in fact very misleading. In Table 5.3 Regression (1) we irst re-present the most general form of the earnings function with the standard errors, which are correct only if the errors are homoskedastic, and in Regression (2) we present the same regression with robust standard errors. However, that is not our inal result. As we explained at the end of Chapter 2, our data is not a simple random sample but is a stratiied one. So we must allow for the possible correlation of the errors within the sample. One option is to use the cluster option in Stata. This procedure produces estimates of the standard errors that are robust to both heteroskedasticity and correlation in the error term across observations within strata. The Stata syntax is straightforward, for example, regress yvar xvar, cluster(strata). The formula is given in Section 9.2.2. The residuals of the regression are used to create an estimator of the standard errors, which can be shown to be
Determinants of earnings and productivity
67
consistent allowing for the pattern of heteroskedasticity. The formulation assumes the residuals to be uncorrelated across, but not within, strata. Regression (3) in Table 5.3 allows for clustering the standard errors. It is apparent that allowing for the clustering of the data does increase the standard errors and we see that the linear education term in the earnings function is no longer statistically signiicant.
5.4 The production function 5.4.1 Testing the production function What of the properties of the residuals and the sampling distribution of our estimators for the production function we presented in Table 3.4 Regression (2)? In Figure 5.3 we present the histogram for the residuals in the left-hand panel, and we plot the residuals against the log of capital per capita in the right-hand panel to illustrate if there is a suggestion of heteroskedasticity. The same set of tests as we performed for our earnings function are shown in Table 5.4. These show that at the 5 per cent signiicance level we fail to reject the null of normality but we decisively reject the null of homoskedasticity. In the light of these tests, we next present in Table 5.5 our human capital augmented production function with heteroskedasticity robust standard errors. We see that as a result of allowing for the heteroskedasticity the standard errors rise and the education variable, while signiicantly different from zero at the 5 per cent level, is not signiicant at the 1 per cent level. However, we need to be cautious as our sample size is small and the variance estimator that underlies the robust standard errors is consistent but not unbiased. Usually extending our sample is not feasible with micro data, but it may be with macro data. Up to this point we have conined ourselves to one cross section. In Section 5.4.2 we combine two cross sections. In doing so we take a step towards the panel data which we present in Section III. 5.4.2 Extending the production function In Chapter 1 we introduced the human capital augmented Cobb–Douglas production function of the form given in equation(1.15): Vit
K itα ( Ait Hit )(
−α )
e uit
and showed how this led to the human capital augmented production function, given in equation (1.22): log
Vit K = α lo logg it + (1 − α )logAit + (1 − α )φ ( Eit ) + uit : Lit Lit
We now need to be more explicit as to how we model total factor productivity (TFP). In our model of Chapter 1 the model was reduced to a cross section and the productivity term Ai was identiied through the residual from the cross-section production function. We now write it out as: ( − )logAit = ci + g.time
68 Linking models to data for development Table 5.3 The earnings function with robust and clustered standard errors Regression (1) . reg logwphy educ educ_sq exper exper_sq Source | SS df MS Number of obs = 6968 -------------+-----------------------------F( 4, 6963) = 987.89 Model | 3085.97829 4 771.494572 Prob > F = 0.0000 Residual | 5437.79442 6963 .780955682 R-squared = 0.3620 -------------+-----------------------------Adj R-squared = 0.3617 Total | 8523.77271 6967 1.22344951 Root MSE = .88372 -----------------------------------------------------------------------------logwphy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | -.0251407 .0085075 -2.96 0.003 -.0418181 -.0084633 educ_sq | .0131101 .000562 23.33 0.000 .0120084 .0142119 exper | .0603432 .0033491 18.02 0.000 .0537779 .0669086 exper_sq | -.0008603 .0000629 -13.68 0.000 -.0009836 -.0007371 _cons | -.1394173 .0527299 -2.64 0.008 -.2427841 -.0360505 -----------------------------------------------------------------------------Regression (2) . reg logwphy educ educ_sq exper exper_sq,robust Linear regression Number of obs = 6968 F( 4, 6963) = 1155.23 Prob > F = 0.0000 R-squared = 0.3620 Root MSE = .88372 -----------------------------------------------------------------------------| Robust logwphy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | -.0251407 .0095704 -2.63 0.009 -.0439017 -.0063797 educ_sq | .0131101 .0005772 22.71 0.000 .0119786 .0142417 exper | .0603432 .0032931 18.32 0.000 .0538877 .0667987 exper_sq | -.0008603 .0000638 -13.48 0.000 -.0009855 -.0007352 _cons | -.1394173 .055031 -2.53 0.011 -.2472947 -.0315399 Regression (3) . reg logwphy edyrs edyrsq exp expsq,cluster(clustnum) Linear regression Number of obs = 6968 F( 4, 351) = 276.79 Prob > F = 0.0000 R-squared = 0.3620 Root MSE = .88372 (Std. Err. adjusted for 352 clusters in clustnum) -----------------------------------------------------------------------------| Robust logwphy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------edyrs | -.0251407 .0160156 -1.57 0.117 -.0566393 .0063579 edyrsq | .0131101 .0009062 14.47 0.000 .0113279 .0148924 exp | .0603432 .0037813 15.96 0.000 .0529064 .0677801 expsq | -.0008603 .0000744 -11.57 0.000 -.0010066 -.0007141 _cons | -.1394173 .1095486 -1.27 0.204 -.3548716 .076037 ------------------------------------------------------------------------------
log
Vit K = ci + α llog it + (1 − α ) φ ( Eit ) + g ⋅ time + uit : Lit Lit
So we are now explicit that TFP can be viewed as consisting of three elements. The irst is the time-invariant aspect, which is speciic to a country (ci ); the second is an effect captured by a time trend or time dummies ( g ttime ) ; and the third is the time-varying
Determinants of earnings and productivity A histogram of the residuals 2.0
1.0
1.5
0.5
1.0
0
0.5
–0.5
0
–1.0 –1.0
–0.5
0
0.5
1.0
69
A scatter plot of the residuals and log (capital per capita)
6
Residuals
8
10
12
Log (capital per capita)
Figure 5.3 Residuals from Table 3.4 Regression (2)
Table 5.4 Residual diagnostics from Table 3.4 Regression (2) (reg lrgdpch lkp tyr15 if year==2000) . scalar JB=N*(S^2/6 + (KK-3)^2/24) . scalar list JB JB = 5.377975 . disp chiprob(2,JB) .0679497 . sktest resprod Skewness/Kurtosis tests for Normality ------- joint -----Variable Obs Pr(Skewness) Pr(Kurtosis) adjchi2(2) Prob>chi2 -------------+--------------------------------------------- resprod | 82 0.1141 0.0780 5.43 0.0663 . swilk resprod Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+--------------------------------------------resprod | 82 0.97533 1.728 1.200 0.11509
. estat hettest, rhs Breusch-Pagan / CookWeisberg test for heteroskedasticity Ho: Constant variance Variables: lkp tyr15 chi2(2) = 18.70 Prob > chi2 = 0.0001
unobservables ( it)). With a panel data set all these dimensions can be identiied, although we clearly obtain only estimates for any of them. The form of the production function estimated in Table 5.5 uses only cross-section data, so we cannot identify the irst two aspects of TFP: log
Vi K = c0 + α llog i + ( − α ) φ ( Ei ) ui . Li Li
In Table 5.6 we use two cross sections, taken from 1980 and 2000, in a pooled regression which allows us to identify the time dimension of TFP (time in the regression). This variable tells us how much labour productivity increased between 1980 and 2000,
70 Linking models to data for development Table 5.5 The production function with robust standard errors
. reg lrgdpch lkp tyr15 if year==2000,robust Linear regression
Number of obs = 82 F( 2, 79) = 474.71 Prob > F = 0.0000 R-squared = 0.9358 Root MSE = .30255 -----------------------------------------------------------------------------| Robust lrgdpch | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lkp | .6195039 .0598026 10.36 0.000 .5004698 .738538 tyr15 | .0663426 .0311635 2.13 0.036 .0043131 .1283721 _cons | 2.323293 .4072022 5.71 0.000 1.512778 3.133809 ------------------------------------------------------------------------------
Table 5.6 A pooled cross-section production function Regression (1) . reg lrgdpch lkp time,robust Linear regression
Number of obs = 164 F( 2, 161) = 594.95 Prob > F = 0.0000 R-squared = 0.9089 Root MSE = .34377 -----------------------------------------------------------------------------| Robust lrgdpch | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lkp | .7027657 .0203954 34.46 0.000 .6624888 .7430426 time | -.0032893 .053479 -0.06 0.951 -.1089001 .1023214 _cons | 1.959418 .2081833 9.41 0.000 1.548296 2.37054 -----------------------------------------------------------------------------Regression (2) . reg lrgdpch lkp tyr15 time,robust Linear regression Number of obs = 164 F( 3, 160) = 535.49 Prob > F = 0.0000 R-squared = 0.9195 Root MSE = .3242 -----------------------------------------------------------------------------| Robust lrgdpch | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lkp | .5737218 .0401488 14.29 0.000 .4944318 .6530118 tyr15 | .0814435 .0219312 3.71 0.000 .0381315 .1247555 time | -.0645375 .0501323 -1.29 0.200 -.1635439 .0344689 _cons | 2.725696 .2769911 9.84 0.000 2.178666 3.272727 ------------------------------------------------------------------------------
holding physical and human capital input constant. In Table 5.6 Regression (1) we do not include the human capital measure. Regression (2) then does so. In both regressions we report robust standard errors. There may be a problem of serial correlation (that is, the error term for country i in period 1 may be correlated with the error term for country i in period 2) in which case standard errors (even if they are robust) will be
Determinants of earnings and productivity
71
incorrect. Why this may arise is covered in Section II of the book. In this case clustering on country is a possible way of addressing such a problem. Given the larger sample size, we expect to obtain estimates that are more precise and that is indeed the case. The human-capital term is now highly signiicant and the point estimate higher than in the single cross section, although the two point estimates are not signiicantly different. The ‘surprise’ in this estimation is that the term measuring the shift in TFP is not signiicantly different from zero: indeed, the point estimate is negative. We conclude our analysis of these simple macro production functions by noting another transformation that is possible with the data once we have two cross sections. We can difference our cross sections and produce a cross section in differences in which we model changes in the natural log of labour productivity, which is (approximately) the growth rate, as a function of changes in the natural log of capital per capita and the changes in years of education. To see this, note a difference of Table 5.6 regression (2) is: ∆log
Vit K = ∆ci + α ∆log it + β1 ∆Eit + g ⋅ ∆time + ∆uit . Lit Lit
We are now explicit that the ci term was included in our cross-section data but as it is time-invariant it disappears when we take the difference. The importance of this transformation is considered in detail in Section VI. Further, as time changes by 1 then our differenced equation will have the time effect as the constant term in the regression. ∆log
Vit K = g + α∆log it + β1 ∆Eit + ∆uit . Lit Lit
The irst two regressions reported in Table 5.7 are the differenced versions of the two pooled cross-section regressions reported in Table 5.6. In comparing Regressions (1) in the two tables, we see that the parameter estimate on the natural log of capital per capita has declined from 0.59 to 0.54, still much higher than our prior estimate that it should be 0.3. When we compare the point estimate on the education variable between the regressions in Table 5.6 and 5.7 we see that it has declined from 0.8 to 0.3 and is no longer signiicantly different from zero. We also note that in Table 5.7 the point estimate on the natural log of capital per capita is virtually unchanged when the education variable is included. The underlying statistical reason for this inding is shown in Regression (3) of Table 5.7, as the changes in capital per capita and changes in education are only weakly correlated. The results we have reported in Table 5.7 show that the correlation between education and income at the macro level is much weaker than we have found in the micro data. A review of the evidence linking micro and macro data can be found in Pritchett (2006). It is an important puzzle in the data, set out fully by Pritchett (2006), that at the macro level it appears that schooling expansion does not explain the growth of output. We thus have two major research elements in development economics that have reached very different conclusions. In the next section we summarise these two elements and the problems on which they have focused.
72 Linking models to data for development Table 5.7 A cross-section differenced production function: 1980–2000 Regression (1) . reg d20_lrgdpch d20_lkp ,robust Linear regression
Number of obs = 82 F( 1, 80) = 50.22 Prob > F = 0.0000 R-squared = 0.5474 Root MSE = .23644 -----------------------------------------------------------------------------| Robust d20_lrgdpch | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------d20_lkp | .5443869 .0768165 7.09 0.000 .3915171 .6972566 _cons | .0587519 .0349492 1.68 0.097 -.0107991 .128303 -----------------------------------------------------------------------------Regression (2) . reg d20_lrgdpch d20_lkp d20_tyr15,robust Linear regression Number of obs = 82 F( 2, 79) = 31.53 Prob > F = 0.0000 R-squared = 0.5531 Root MSE = .23643 -----------------------------------------------------------------------------| Robust d20_lrgdpch | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------d20_lkp | .5317095 .079703 6.67 0.000 .3730647 .6903543 d20_tyr15 | .0343736 .0344578 1.00 0.322 -.0342129 .1029602 _cons | .016533 .053942 0.31 0.760 -.0908359 .123902 -----------------------------------------------------------------------------Regression (3) . reg d20_lrgdpch d20_tyr15,robust Linear regression Number of obs = 82 F( 1, 80) = 5.25 Prob > F = 0.0246 R-squared = 0.0568 Root MSE = .34133 -----------------------------------------------------------------------------| Robust d20_lrgdpch | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------d20_tyr15 | .1059902 .046276 2.29 0.025 .013898 .1980824 _cons | .1265092 .0659159 1.92 0.059 -.0046676 .2576861 ------------------------------------------------------------------------------
5.5 Interpreting our earnings and production functions 5.5.1 Can education be given a causal interpretation? In both this chapter and indeed throughout this book, we are looking at ways that enable us to make causal statements along the lines: if we invest in education, will it increase the earnings of the educated? As we have seen with our micro data, those with more education on average earn more. It is an implication of the results of the last three chapters that it does not follow that investing in education will increase an individual’s earnings. This is often summarised by saying that correlation does not imply causation – which, while correct, does not get to the core of the problem in making causal inferences from data of the form presented in this and previous chapters. We have stressed that the interpretation of 1 in the simple Mincerian earnings function is that it shows the effects of changing education. But how can you change education in a cross section? Indeed, what is the basis for the effect we do observe, that is,
Determinants of earnings and productivity
73
what is the variation in the data that is enabling us to identify an effect from education onto earnings? The answer is very clear. We are comparing individuals with different levels of education. We have not been explicit on this point up to now but it is one that will be of importance for many of the issues we want to discuss in this book. What we want to know for an individual is how their income would have changed had they had more or less education. Now that is exactly what we cannot observe. We cannot observe the same individual with and without primary education (at least, not at the same point in time). So how can we seek to make causal inferences? One possible route is to seek to make those with one level of education as similar as possible to those without it. We can then argue that the only difference between the two individuals is their level of education. We have used the South African data to introduce the earnings function. We have not been explicit that this data is for the earnings of black and white South Africans and for both men and women. Clearly our data consists of very different individuals with very different characteristics and we might rather strongly suspect that race and gender inluence both earnings and education. As you can see in the data for this chapter, both race and gender do matter a lot for earnings in South Africa – particularly race. An obvious step to making our individuals more comparable – another way of saying that is that it reduces the heterogeneity across individuals on the basis of their observable characteristics – is to conine our sample to those of a speciic race and gender. In so doing we take a very large step to making our individuals much more comparable than are the individuals in the data we have presented so far. How does doing so affect our view as to the returns to education, and can we now safely conclude that education has a causal effect on earnings? That is the question you are asked to investigate in the exercise for this chapter. 5.5.2 How much does education raise labour productivity? We return now to our macro evidence for how much education increases labour productivity. Our analysis produces two rather striking results. The irst is the remarkable concordance between our micro data and the macro data when it comes to estimating the return to education, which you are asked to show in the exercise for this chapter. The second is the very weak correlation in the macro data between educational growth and that of GDP. In his review of the evidence of the models that have sought to link educational investments and GDP growth, Pritchett (2006) argues that however one does the calculations the rate of growth of education in almost all countries, and the very low rates of growth in many developing countries until the last decade, must imply that educational expansion had a limited impact on growth. The data we have presented in this chapter might appear consistent with his conclusions. In Table 5.7 we showed that once we control for the growth of capital per capita there is no signiicant effect for the growth of education on the growth of output. However, this is a result from a crosssection and, as we see in the following chapters, interpreting trending data in a time series is a complex issue. Two important questions arise from these results. The irst is whether they are inconsistent with the idea that there are substantial externalities to education. The second is whether, within the conines of macro data sets, the results here support the argument in Hall and Jones (1999) that education does not play an important part in explaining difference in labour productivity across countries.
74 Linking models to data for development Probably the most well-known advocate of the notion that education results in positive externalities is Lucas (2002). Should we then be puzzled that the returns from micro and macro data are very similar? The answer to that is possibly yes, if we think our macro production function is correctly speciied. If, on the other hand, we think such a speciication misses key aspects of how education impacts on productivity then the problem is with our speciication. These are important issues to which we need to return. If we for the moment abstract from the possibility that our speciication is missing completely how education impacts on labour productivity, then our results are broadly consistent with the argument advanced by Hall and Jones (1999), although we have arrived at their result by a completely different route. Such a concordance between a method relying on calibration and one relying on estimation might be thought reassuring if we were convinced the speciication of our model is correct. Whether that is so is the subject of several of the chapters that follow.
References Hall, R. E. and Jones, C. I. (1999) ‘Why do some countries produce so much more output per worker than others?’ Quarterly Journal of Economics, 114(1): 83–116. Hendry, D. and Nielsen, B. (2007): Econometric Modeling: A likelihood approach, Princeton University Press. Lucas Jr., R. E (2002) Lectures on Economic Growth, Harvard University Press, Cambridge, Massachusetts. Pritchett, L. (2006) ‘Does learning to add up add up? The returns to schooling in aggregate data’, Chapter 11 in E. Hanushek and F. Welch (eds) Handbook of the Economics of Education, Volume 1, North Holland, Amsterdam. White, H. (1980) ‘A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity’, Econometrica, 48: 817–38.
Exercise 1.
2.
3. 4. 5. 6.
7.
Explain how to calculate the returns to education from a macro production function and use ‘Macro_1980_2000_PENN61’ to derive an estimate from that data for cross sections of 1980 and 2000 and the pooled cross section. Compare the Mincerian rate of return to education from the micro data ‘Labour_Force_SA_SALDRU_1993’ with that which you obtained from the macro data. Do you expect the returns to education to be higher in the macro data set and, if so, why? Does your answer to the previous question provide any support for the view that there are externalities to education? Does the Mincerian return to education obtained from ‘Labour_Force_SA_ SALDRU_1993’ differ by gender and by race? Is education or trade a more signiicant (in both the statistical and economic sense of this term) determinant of productivity in the macro production function? Discuss whether any of your regressions can be given a causal interpretation.
Section II
Time-series data, growth and development
This page intentionally left blank
6
Modelling growth with time-series data
6.1 Introduction: modelling growth In a book on development it might seem a little odd that it has taken us until Chapter 6 before we can discuss growth. The reason is simple. The econometric analysis of time series is very different from that of cross-section data. In Section 6.2 we present the Solow model, which has come to dominate the analysis of growth. We are interested in at least two dimensions of growth. One is how growth rates have differed across countries; another is how they have varied over time within a country. To analyse the irst of these questions we need panel data, which we come to in Section III. It was the second of these questions, how growth rates have varied within countries, that was the initial focus of the Solow model, and to analyse that issue we need time-series data, which is the subject of Section II. We use a time series of GDP for Argentina to investigate whether we can obtain reasonable estimates of a Solow model for that country. In doing so, we introduce the problems posed by time-series data which differentiate such data from the cross-section data that we have used up to this point. Three differences between time-series and crosssection data are highlighted in this chapter. The irst of these is that, by deinition, time series are not obtained through a random sampling procedure. The second is that time-series data can show strong persistence, which means that the variables tend to be highly correlated with each other. The third feature of many time series is that they show sustained changes over time, a feature which has a name – such data is termed nonstationary, a term that we formally deine below. Do these features of the data mean that our OLS estimators cannot be used to derive unbiased and minimum variance estimators? We set out in Section 6.4 the assumptions necessary for OLS estimates to be unbiased and minimum variance. In the case of timeseries data, these assumptions are very strong and are unlikely to be valid for many of the models we want to estimate. However, weaker assumptions which our models will meet will allow us to argue that OLS estimates are consistent. So, rather ironically given that large samples of time-series data are hard to ind in development, we need to appeal to the large-sample properties of the estimator to justify our use of OLS. Understanding why growth rates differ is a key problem in development. Small differences in growth rates can imply very large long-run differences in incomes. The left-hand panel of Figure 6.1 shows the log of GDP of three countries – Australia, Argentina and Ghana – over the course of the twentieth century. Thus for this left-hand panel the slope of the line is the growth rate. The trend rate of growth for Argentina over the period from 1900 was 1.1 per cent per annum, that for Australia was 1.8 and for Ghana zero. The right-hand panel presents the levels of GDP. While in the early part of
78 Linking models to data for development Log GDP per capita in 1990 PPPUS$ 1900–2006
10
GDP per capita in 1990 PPPUS$ 1900–2006
25000
Australia
Australia
20000 9 Argentina
Argentina
10000
8 7
15000
5000
Ghana
Ghana
0 1900
1920
1940
1960
1980
2000
1900
1920
Australia
Argentina
1940
1960
1980
2000
Year
Year Ghana
Australia
Argentina
Ghana
Figure 6.1 Long-term growth Source: Angus Maddison, Historical Statistics of the World Economy, 1–2006AD
the twentieth century the differences in income between Australia and Argentina were modest, by the irst decade of the twenty-irst century Australia had income more than three times higher than Argentina. Ghana in the early part of the twentieth century was very much poorer than both Australia and Argentina; since then though the gap has steadily widened. Small differences in average growth rates, in the case of Argentina and Australia less than 1 per cent per annum, imply a very large difference in long-run income levels. Understanding why these growth rates have differed is among the irst questions that development economists posed. William Easterly (2002) has argued in a stimulating book on exactly this question that while its continued importance is undisputed, the answer remains elusive. In the next section we introduce the Solow model, which has provided the framework to understand the patterns of change for GDP shown in Figure 6.1. In Section 6.3 we provide time-series estimates of a Solow model for the Argentinian economy. We then consider in Section 6.4 the assumptions necessary to prove these estimates are unbiased and have minimum variance, thus paralleling the discussion of these questions in Chapter 4. Sections 6.5 and 6.6 then tackle the potential problems posed for OLS of the fact that many time series are highly persistent. In Chapter 6 we show how these problems can be addressed and come back to when OLS can be used.
6.2 An introduction to the Solow model The Solow growth model, originally set out in Solow (1956) and exposited in Solow (1970), consists of four equations. The irst is a production function; the second is a model for saving by which savings is a constant fraction of income; the third is the assumption that the growth rate of the labour force is constant; and the inal one, and the one with the most important implications for growth, is that the rate of technical progress is given exogenously and is also a constant in the model. While the production function of the Solow model is more general than a Cobb–Douglas speciication, we will use that as we have already introduced it and almost all the empirical applications of the model have assumed this functional form. The four equations of the Solow model are:
Modelling growth with time-series data 79 Production function: Vt
K tα ( At Lt )(
−α )
e ut .
Savings function: St
ssV Vt .
The rate of growth of the population is constant (n): Lt
L0 e nt .
The rate of growth of underlying technical eficiency is constant (g): At
A0 e gt.
The rather surprising implication of this model is that in the long run, factor accumulation affects the long-run level of income but not the growth rate. The long-run growth rate is solely a function of the rate of technical progress – ∆logA logA. logA t We can see this relatively easily. Capital evolves according to the formula: K t = I t − δ K t , where δ is the depreciation rate. Write k K /AL . Note that: ∂k ∂ log k = kt ∂t ∂t k d = (logK t kt dt
l At − lo Lt )
I ∂logK t k = − g − n kt = t − g − n − δ kt ∂t Kt V k = s t − g − n − δ k Kt k = sv svt − ( g + n + δ )kt k = sk skt − ( g + n + δ )kt . This is the law of motion of the Solow model. This inal equation implies that k converges to a steady-state value of k * deined by k = 0, so k * = [ s / ( g + n + δ )]1/(1−α ) and we can use the fact that vt /( − α ) v* = [ s / ( n + g + δ ))]]α /(1 ,
which can also be written as:
ktα where vt
(V AL )t
to write
80 Linking models to data for development log vt* =
α (1
)
logg st − lo
α (1
)
logg ( n+g +δ )).
Using the fact that: log vt = lo logV gVt − logAt − logLt = logV gVt − ( A gt ) − log Lt , we have an equation we can estimate:
α α V log = logg st − L t (1 ) (1 − *
)
logg ( n + g +
t + ut . )t + A0 + g ⋅time
(6.1)
6.3 A Solow model for Argentina There are several important features of this Solow equation which we have presented in equation (6.1). The irst is that it makes very speciic predictions as to the sign and size of the coeficients. The second is that in this equation the rate of technical progress gt has been modelled as a linear trend g ttime, which implies that if we were to estimate this equation the parameter on the time trend would be our estimate of the rate of technical progress. When the Solow model is described as an exogenous growth model what is meant is that the rate of technical progress is a parameter of the model, that is, what is determining it is outside the model. Finally, equation (6.1) is a ‘steady state’ result: it will be true when the economy has adjusted to any changes in saving and population growth (the two factors in the model which determine the steady-state level of GDP). Equation (6.1) is what is termed a static model in that only current values of time enter the equation. In Section 6.5 we extend the model to being dynamic – allowing the past to inluence the present – but here we consider the simplest version of this static steadystate Solow model. In the left-hand panel of Figure (6.2) we show the data for the natural logs of GDP and the savings rate (assumed to be equal to the investment rate) for Argentina over the period from 1950 to 2000. In the right-hand panel we show the Solow equation that results from estimating equation (6.1). The regressions that underlie the igure can be found in Table 6.1 The Solow model reported in Table 6.1 is: logg
Y = 0.61lo logg st + 0.24 log ( n + g + L
)t − 12.9 + 0.01time.
This is not in accord with our model, which predicts that the coeficient on population growth should be negative. What is the implied coeficient on the capital stock in the production function? To see this we note from equation (6.1) that
α = 0.61, ( α) which implies that α = 0.338 , which is much closer to the hypothesised 0.3 than the value we found with our macro cross-section data in earlier chapters. Clearly time-series data can give us very different results from those of a cross section.
Modelling growth with time-series data 81 A Solow equation for Argentina
9.2
3
9.0
2.8
8.8
2.6
8.6
2.4 1950 1960 1970 1980 1990 2000
9.4 Log GDP per capita
3.2
Log investment share of GDP
Log GDP per capita
Incomes and investment 9.4
9.2 9.0 8.8 8.6 2.4
Year
2.6
2.8
3
3.2
Log of investment share of GDP
Log GDP per capita
Log GDP per capita
Log of investment share of GDP
Linear prediction
Figure 6.2 Incomes and investment in Argentina: 1950–2000
Note also that the trend rate of growth of the Argentinian economy shown in Regression (1) of Table 6.1 is virtually identical to the ‘exogenous technical progress’ parameter in Regression (3). Technical progress explains virtually all the growth of the Argentinian economy over this period (remember this assumes our model is correctly speciied, which we have not yet tested). We see from Regressions (1) and (2) in Table 6.1 that GDP has a signiicant trend and the saving (investment) rate does not. We said in the introduction that the trending nature of many macro data series could present problems; we now need to consider the assumptions for the OLS estimates to be unbiased and minimum variance.
6.4 oLS estimates under the classical assumptions with time-series data So far we have treated our time-series data exactly as we treated our cross-section data. But quite clearly any time-series data differs in important respects from data derived from a cross section and we must now consider how that affects the interpretation of our OLS estimates. If we take a cross section from our labour-force data for South Africa we can, if we want, repeat that process and if we have two random samples from the same data we can expect them to have means and variances that are not signiicantly different. That our sample was random enabled us to assume that our errors were not correlated with each other. For our results for the classical linear regression model (CLM) from the cross section to carry over to time series we need assumptions that allow us to treat a time series ‘as though’ it was drawn as a random sample. The most obvious characteristic in which our data differs is that GDP rises, falls and then rises again. If we wanted to know the average level of GDP the answer would clearly depend on when we took the sample. Also the fact that the data was not taken as a random sample means that we cannot expect the variables to be independent of each other. 6.4.1 Assumptions for OLS to be unbiased So our irst step in formalising when we can obtain unbiased estimators for our timeseries data using OLS is to present a concept that ensures our data does not vary over
82 Linking models to data for development Table 6.1 A Solow model for Argentina
Figure 6.2 is the graph of the following regressions: Regression (1) . reg lrgdpch year if country==‘ARGENTINA’; Source | SS df MS Number of obs = 51 -------------+-----------------------------F( 1, 49) = 82.90 Model | .985393048 1 .985393048 Prob > F = 0.0000 Residual | .582465987 49 .011887061 R-squared = 0.6285 -------------+-----------------------------Adj R-squared = 0.6209 Total | 1.56785904 50 .031357181 Root MSE = .10903 -----------------------------------------------------------------------------lrgdpch | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------year | .0094433 .0010372 6.10 0.000 .007359 .0115276 _cons | -6.589278 2.048497 -4.68 0.000 -13.70589 -6.472669 -----------------------------------------------------------------------------Regression (2) . reg lki year if country==‘ARGENTINA’; Source | SS df MS Number of obs = 51 -------------+-----------------------------F( 1, 49) = 0.16 Model | .003846308 1 .003846308 Prob > F = 0.6921 Residual | 1.18760705 49 .024236879 R-squared = 0.0032 -------------+-----------------------------Adj R-squared = -0.0171 Total | 1.19145336 50 .023829067 Root MSE = .15568 -----------------------------------------------------------------------------lki | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------year | -.00059 .001481 -0.40 0.692 -.0035662 .0023862 _cons | 3.988116 2.925071 1.36 0.179 -1.890034 6.866265 -----------------------------------------------------------------------------Regression (3)
The Solow Model for Argentina . reg lrgdpch lki ln_n_g_d year if country==‘ARGENTINA’; Source | SS df MS Number of obs = 50 -------------+-----------------------------F( 3, 46) = 107.52 Model | 1.29579293 3 .431930976 Prob > F = 0.0000 Residual | .184798648 46 .004017362 R-squared = 0.8752 -------------+-----------------------------Adj R-squared = 0.8670 Total | 1.48059158 49 .030216155 Root MSE = .06338 -----------------------------------------------------------------------------lrgdpch | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lki | .6133598 .0619973 6.89 0.000 .4885657 .7381539 ln_n_g_d | .2403555 .3938393 0.61 0.545 -.5524018 1.033113 year | .0105882 .0009174 6.54 0.000 .0087415 .0124349 _cons | -12.92954 1.303472 -6.92 0.000 -16.55329 -10.30578 ------------------------------------------------------------------------------
time. In an informal sense, a time series that is the same at all points in time is called stationary. More formally: The stochastic process {xt : t =11, 2 ..} is stationary if for every collection of time indices 1 £ t1 chi2 = 0.1866 . estat dwatson ; Durbin-Watson d-statistic( 4, 50) = .6080562 . predict res, resid; (1 missing value generated) . reg res l.res ln_ki ln_n_g_d year, robust; Linear regression Number of obs = 49 F( 4, 44) = 10.71 Prob > F = 0.0000 R-squared = 0.4931 Root MSE = .04612 -----------------------------------------------------------------------------| Robust res | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------res | L1. | .7288615 .1275947 6.71 0.000 .4717112 .9860118 [Other regressors are included but not reported]
The variance of βˆ j conditional on X is
( )
Var βˆ j =
σ2
SST j ( − R j
)
j
1, 2, ....k k.
where n
SST j = ∑( xsj − x j )2 s =1
is the total sample variance in x j and R 2j is the R-squared from regressing xj on all the other independent variables (including an intercept term). As in Chapter 3 the estimator σˆ 2 = SSR / df is an unbiased estimator of σ 2 , where df = n − k −1. If we combine our assumptions (A1″) – (A3″) with that of homoskedasticity and no autocorrelation, then the OLS estimators are the best linear unbiased estimators (BLUE) conditional on X . These are the time-series equivalent of the assumptions we
Modelling growth with time-series data 85 made with the cross-section data. If we do what we did there and assume the errors ut are independent of X and are independently and identically distributed (i.i.d.), as normal (0, σ 2), then the OLS estimators are normally distributed conditional on X . We can then proceed as in Chapter 4 and derive the sampling distribution of the estimators and the same test statistics. 6.4.3 Testing for autocorrelation For the OLS estimators to be BLUE requires both homoskedasticity and serially uncorrelated errors, so knowing if there is autocorrelation in the errors is clearly very important. The possible methods of testing are covered in Wooldridge (2013 Section 12.2: 402–409). We present here one of those methods, which is very informative as to how to use autocorrelation testing to build a dynamically complete model. The method is valid without strictly exogenous regressors, although we need to make our standard errors robust to heteroskedasticity. The method is as follows: (i) Run the OLS regression of yt on xt1 , xt 2 , ...xtk and obtain the OLS residuals uˆt for all t 1, 2, ...n . (ii) Run the regression of uˆt on xt1 , xt 2 , ...xtk , u t -1 , for all t 1, 2, n to obtain the coeficient ρ on uˆt-1 and its t -statistic, tρˆ . (iii) Use tρˆ to test H 0 : ρ = 0 against H1: ρ ≠ 0 in the usual way. In Table 6.2 we present the Breusch–Pagan test for heteroskedasticity and the test for autocorrelation we have just outlined. While at conventional signiicance levels we fail to reject the null of homoskedasticity, we have highly signiicant autocorrelation. It is possible to make standard errors robust to autocorrelation using methods similar to those used for heteroskedasticity but that is not the correct response here. Autocorrelation may be an indicator of omitted variables and in particular these missing variables are lagged values of both the explanatory and the dependent variable. To address that possibility we must extend our model to include lags, thus making it dynamic.
6.5 Static and dynamic time-series models The version of the Solow model we have been considering so far is an example of a static time-series model, equation (6.1):
α α Y log = logg st − Lt (1 − ) (1 − *
)
logg ( n + g +
t + ut . )t + A0 + g ⋅ time
It is static in the sense that it includes only variables which are dated contemporaneously. We have interpreted this model as one capturing a steady-state relationship between labour productivity and its determinants – savings, population growth and technical progress. In writing down a steady-state relationship we have an empirical model in which savings and the population growth rate in the past do not affect current labour productivity. That is unlikely to be true and we can test for that possibility by including lagged values of the determinants on the right-hand side, giving us a model of the form:
86 Linking models to data for development logg
Y = A0 + β1 llog og st + β 2 lo logg st 1 − β3 llog og ( n + g + δ )t L t − β 4 log ( n + g + δ )t −1 +g ⋅ ttime ime+ut .
(6.2)
Equation (6.2) is an example of a inite distributed lag model. As its name suggests it allows for the fact that past values of savings and the population growth rate may affect labour productivity. We still have our long-run relationship, but we now need to derive it from our estimated equation. We do this by setting log st = lo logg st -1, and logg ( n + g + δ )t = llog og ( n + g + δ )t -1, giving us: *
log g
Y = A0 + ( β1 + β 2 ) l L t
st* ( β3 + β 4 )log ( n + g +
ti e + ut . ) t-1 + g ⋅ tim *
(6.3)
In equation (6.3) we have the same equation as (6.1), but rather than estimating the long-run relationship directly as in equation (6.1) we proceed indirectly with the distributed lag model which we then solve for the long-run condition. We can further extend our model by allowing for the possibility that there is an autoregressive process in the dependent variable. We do this by including lagged values of the dependent variables as a regressor, giving us the most general model we have considered so far (see Table 6.3): Y Y log = A0 + λ llog + β1log st + β 2 log st 1 − β3log ( n + g + L t L t −1 − β log ( n + g + δ )t -1 + g ⋅ time + ut .
)t
(6.4)
This is a dynamic model where we have restricted the time lags to one period. This equation can be written in a different, but equivalent, way as: Y Y ∆log = A0 + ( λ − ) log + β1log st + β 2 log st −1 + β3llo og ( n + g + L t L t −1 − β log ( n + g + δ )t −1 + g ⋅ time + ut .
)t
(6.5)
It is important to understand that these two equations, (6.4) and (6.5), are simply different ways of writing the same relationship. In the equation (6.5) form, it has the growth rate as the dependent variables, so it can be read as explaining the growth of labour productivity. However, it always needs to be stated that this growth rate is conditioned on the value of labour productivity in the last period, so long-run growth remains a function of the exogenously given rate of technical progress, g in this Solow model. We now check in Table 6.3 to see if our dynamic model has addressed the problems posed by autocorrelation in the residuals. As the results show, the extent of the autocorrelation has greatly declined although at the 5 per cent signiicance level we cannot reject the hypothesis that there is at least irst-order autocorrelation. The lagged dependent term also appears to be highly signiicant, suggesting that a dynamic model is required to fully specify the Solow model.
Modelling growth with time-series data 87 Table 6.3 A general dynamic Solow model for Argentina . reg lrgdpch l.lrgdpch ln_ki l.ln_ki ln_n_g_d l.ln_n_g_d year ; Source |
SS
df
MS
Number of obs =
-------------+------------------------------
F(
6,
42) =
49 122.26
Model |
1.34194452
6
.22365742
Prob > F
=
0.0000
Residual |
.076834208
42
.001829386
R-squared
=
0.9458
-------------+-----------------------------Total |
1.41877873
48
.02955789
Adj R-squared =
0.9381
Root MSE
.04277
=
-----------------------------------------------------------------------------lrgdpch |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------lrgdpch | L1. |
.7637508
.1039746
7.35
0.000
--. |
.4114097
L1. |
-.2527254
--. |
.5539215
.9735801
.0711918
6.78
0.000
.2677388
.5550806
.0903789
-2.80
0.008
-.4351175
-.0703333
-.3166049
.2830918
-1.12
0.270
-.8879073
.2546974 .4072863
ln_ki |
ln_n_g_d | L1. |
-.1337107
.2680749
-0.50
0.621
-.6747077
year |
.0017664
.001416
1.25
0.219
-.0010912
.004624
_cons |
-3.016845
1.641976
-1.84
0.073
-6.330487
.2967974
-----------------------------------------------------------------------------. reg res1 l.res1 ln_ki ln_n_g_d year, robust; Linear regression
Number of obs = F(
4,
48
43) =
1.15
Prob > F
=
0.3469
R-squared
=
0.0854
Root MSE
=
.04003
-----------------------------------------------------------------------------| res1 |
Robust Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------res1 | L1. |
.2833246
.1365104
2.08
0.044
.0080252
.5586241
[Other regressors included but not reported]
6.6 Assumptions to ensure the oLS estimators are consistent In Section 6.5 we introduced a dynamic model with a lagged dependent variable. This implies that our strict exogeneity assumption cannot hold as ut −1 is a determinant of yt −1 which is a regressor. However, it is possible that contemporaneous exogeneity holds and it can be shown that with this assumption the OLS estimates are consistent providing that as the sample size gets larger the correlation between xt and xt h tends to zero. The
88 Linking models to data for development characteristic by which as the sample size increases xt and xt h become ‘almost independent’ is termed weak dependence. An example of an equation which we can show is weakly dependent is the irst-order autoregressive process, which is written as AR(1), where we need to assume that ρ