438 38 8MB
English Pages 788 Year 2019
Probability Theory and Statistical Inference Doubt over the trustworthiness of published empirical results is not unwarranted and is often a result of statistical misspecification: invalid probabilistic assumptions imposed on data. Now in its second edition, this bestselling textbook offers a comprehensive course in empirical research methods, teaching the probabilistic and statistical foundations that enable the specification and validation of statistical models, providing the basis for an informed implementation of statistical procedure to secure the trustworthiness of evidence. Each chapter has been thoroughly updated, accounting for developments in the field and the author’s own research. The comprehensive scope of the textbook has been expanded by the addition of a new chapter on the Linear Regression and related statistical models. This new edition is now more accessible to students of disciplines beyond economics and includes more pedagogical features, with an increased number of examples as well as review questions and exercises at the end of each chapter. is Wilson E. Schmidt Professor of Economics at Virginia Polytechnic Institute and State University. He is the author of Statistical Foundations of Econometric Modelling (Cambridge, 1986) and, with D. G. Mayo, Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science (Cambridge, 2010). ARIS SPANOS
Probability Theory and Statistical Inference Empirical Modeling with Observational Data Second Edition
Aris Spanos Virginia Tech (Virginia Polytechnic Institute & State University)
University Printing House, Cambridge CB2 8BS, United Kingdom One Liberty Plaza, 20th Floor, New York, NY 10006, USA 477 Williamstown Road, Port Melbourne, VIC 3207, Australia 314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India 79 Anson Road, #06–04/06, Singapore 079906 Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9781107185142 DOI: 10.1017/9781316882825 c Aris Spanos 2019 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 1999 Third printing 2007 Second edition 2019 Printed in the United Kingdom by TJ International Ltd. Padstow Cornwall A catalogue record for this publication is available from the British Library. Library of Congress Cataloging-in-Publication Data Names: Spanos, Aris, 1952– author. Title: Probability theory and statistical inference : empirical modelling with observational data / Aris Spanos (Virginia College of Technology). Description: Cambridge ; New York, NY : Cambridge University Press, 2019. | Includes bibliographical references and index. Identifiers: LCCN 2019008498 (print) | LCCN 2019016182 (ebook) | ISBN 9781107185142 | ISBN Subjects: LCSH: Probabilities – Textbooks. | Mathematical statistics – Textbooks. Classification: LCC QA273 (ebook) | LCC QA273 .S6875 2019 (print) | DDC 519.5–dc23 LC record available at https://lccn.loc.gov/2019008498 ISBN 978-1-107-18514-2 Hardback ISBN 978-1-316-63637-4 Paperback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.
To my grandchildren Nicholas, Jason, and Evie, my daughters Stella, Marina, and Alexia, and my wife Evie for their unconditional love and support
Contents
Preface to the Second Edition
page xix
1
An Introduction to Empirical Modeling 1.1 Introduction 1.2 Stochastic Phenomena: A Preliminary View 1.2.1 Chance Regularity Patterns 1.2.2 From Chance Regularities to Probabilities 1.2.3 Chance Regularity Patterns and Real-World Phenomena 1.3 Chance Regularities and Statistical Models 1.4 Observed Data and Empirical Modeling 1.4.1 Experimental vs. Observational Data 1.4.2 Observed Data and the Nature of a Statistical Model 1.4.3 Measurement Scales and Data 1.4.4 Measurement Scale and Statistical Analysis 1.4.5 Cross-Section vs. Time Series, is that the Question? 1.4.6 Limitations of Economic Data 1.5 Statistical Adequacy 1.6 Statistical vs. Substantive Information∗ 1.7 Looking Ahead 1.8 Questions and Exercises
1 1 3 3 7 11 12 14 14 15 16 18 20 22 23 25 27 28
2
Probability Theory as a Modeling Framework 2.1 Introduction 2.1.1 Primary Objective 2.1.2 Descriptive vs. Inferential Statistics 2.2 Simple Statistical Model: A Preliminary View 2.2.1 The Basic Structure of a Simple Statistical Model 2.2.2 The Notion of a Random Variable: A Naive View 2.2.3 Density Functions 2.2.4 A Random Sample: A Preliminary View 2.3 Probability Theory: An Introduction 2.3.1 Outlining the Early Milestones of Probability Theory 2.3.2 Probability Theory: A Modeling Perspective 2.4 A Simple Generic Stochastic Mechanism 2.4.1 The Notion of a Random Experiment 2.4.2 A Bird’s-Eye View of the Unfolding Story
30 30 30 30 32 33 34 35 36 40 40 42 42 42 44 vii
Contents
viii
2.5
Formalizing Condition [a]: The Outcomes Set 2.5.1 The Concept of a Set in Set Theory 2.5.2 The Outcomes Set 2.5.3 Special Types of Sets Formalizing Condition [b]: Events and Probabilities 2.6.1 Set-Theoretic Operations 2.6.2 Events vs. Outcomes 2.6.3 Event Space 2.6.4 A Digression: What is a Function? 2.6.5 The Mathematical Notion of Probability 2.6.6 Probability Space (S,, P(.)) 2.6.7 Mathematical Deduction Conditional Probability and Independence 2.7.1 Conditional Probability and its Properties 2.7.2 The Concept of Independence Among Events Formalizing Condition [c]: Sampling Space 2.8.1 The Concept of Random Trials 2.8.2 The Concept of a Statistical Space 2.8.3 The Unfolding Story Ahead Questions and Exercises
45 45 45 46 48 48 51 51 58 59 63 64 65 65 69 70 70 72 74 75
The Concept of a Probability Model 3.1 Introduction 3.1.1 The Story So Far and What Comes Next 3.2 The Concept of a Random Variable 3.2.1 The Case of a Finite Outcomes Set: S = {s1 , s2 , . . . , sn } 3.2.2 Key Features of a Random Variable 3.2.3 The Case of a Countable Outcomes Set: S = {s1 , s2 , . . . , sn , . . .} 3.3 The General Concept of a Random Variable 3.3.1 The Case of an Uncountable Outcomes Set S 3.4 Cumulative Distribution and Density Functions 3.4.1 The Concept of a Cumulative Distribution Function 3.4.2 The Concept of a Density Function 3.5 From a Probability Space to a Probability Model 3.5.1 Parameters and Moments 3.5.2 Functions of a Random Variable 3.5.3 Numerical Characteristics of Random Variables 3.5.4 Higher Moments 3.5.5 The Problem of Moments* 3.5.6 Other Numerical Characteristics 3.6 Summary 3.7 Questions and Exercises Appendix 3.A: Univariate Distributions
78 78 78 79 80 81
2.6
2.7
2.8
2.9 3
85 86 86 89 89 91 95 97 97 99 102 110 112 118 119 121
Contents
3.A.1 3.A.2
Discrete Univariate Distributions Continuous Univariate Distributions
ix
121 123
4
A Simple Statistical Model 4.1 Introduction 4.1.1 The Story So Far, a Summary 4.1.2 From Random Trials to a Random Sample: A First View 4.2 Joint Distributions of Random Variables 4.2.1 Joint Distributions of Discrete Random Variables 4.2.2 Joint Distributions of Continuous Random Variables 4.2.3 Joint Moments of Random Variables 4.2.4 The n Random Variables Joint Distribution 4.3 Marginal Distributions 4.4 Conditional Distributions 4.4.1 Conditional Probability 4.4.2 Conditional Density Functions 4.4.3 Continuous/Discrete Random Variables* 4.4.4 Conditional Moments 4.4.5 A Digression: Other Forms of Conditioning 4.4.6 Marginalization vs. Conditioning 4.4.7 Conditioning on Events vs. Random Variables 4.5 Independence 4.5.1 Independence in the Two Random Variable Case 4.5.2 Independence in the n Random Variable Case 4.6 Identical Distributions and Random Samples 4.6.1 Identically Distributed Random Variables 4.6.2 A Random Sample of Random Variables 4.7 Functions of Random Variables 4.7.1 Functions of One Random Variable 4.7.2 Functions of Several Random Variables 4.7.3 Ordered Sample and its Distributions* 4.8 A Simple Statistical Model 4.8.1 From a Random Experiment to a Simple Statistical Model 4.9 The Statistical Model in Empirical Modeling 4.9.1 The Concept of a Statistical Model: A Preliminary View 4.9.2 Statistical Identification of Parameters 4.9.3 The Unfolding Story Ahead 4.10 Questions and Exercises Appendix 4.A: Bivariate Distributions 4.A.1 Discrete Bivariate Distributions 4.A.2 Continuous Bivariate Distributions
130 130 130 130 131 131 133 136 138 139 142 142 143 146 146 148 150 151 155 155 156 158 158 160 161 161 162 165 166 166 167 167 168 169 170 171 171 172
5
Chance Regularities and Probabilistic Concepts 5.1 Introduction
176 176
Contents
x
5.2 5.3 5.4
5.5
5.6 5.7
6
5.1.1 Early Developments in Graphical Techniques 5.1.2 Why Do We Care About Graphical Techniques? The t-Plot and Independence The t-Plot and Homogeneity Assessing Distribution Assumptions 5.4.1 Data that Exhibit Dependence/Heterogeneity 5.4.2 Data that Exhibit Normal IID Chance Regularities 5.4.3 Data that Exhibit Non-Normal IID Regularities 5.4.4 The Histogram, the Density Function, and Smoothing 5.4.5 Smoothed Histograms and Non-Random Samples The Empirical CDF and Related Graphs* 5.5.1 The Concept of the Empirical cdf (ecdf) 5.5.2 Probability Plots 5.5.3 Empirical Example: Exchange Rate Data Summary Questions and Exercises Appendix 5.A: Data – Log-Returns
Statistical Models and Dependence 6.1 Introduction 6.1.1 Extending a Simple Statistical Model 6.2 Non-Random Sample: A Preliminary View 6.2.1 Sequential Conditioning: Reducing the Dimensionality 6.2.2 Keeping an Eye on the Forest! 6.3 Dependence and Joint Distributions 6.3.1 Dependence Between Two Random Variables 6.4 Dependence and Moments 6.4.1 Joint Moments and Dependence 6.4.2 Conditional Moments and Dependence 6.5 Joint Distributions and Modeling Dependence 6.5.1 Dependence and the Normal Distribution 6.5.2 A Graphical Display: The Scatterplot 6.5.3 Dependence and the Elliptically Symmetric Family 6.5.4 Dependence and Skewed Distributions 6.5.5 Dependence in the Presence of Heterogeneity 6.6 Modeling Dependence and Copulas* 6.7 Dependence for Categorical Variables 6.7.1 Measurement Scales and Dependence 6.7.2 Dependence and Ordinal Variables 6.7.3 Dependence and Nominal Variables 6.8 Conditional Independence 6.8.1 The Multivariate Normal Distribution 6.8.2 The Multivariate Bernoulli Distribution 6.8.3 Dependence in Mixed (Discrete/Continuous) Variables
176 177 178 184 189 189 195 196 201 206 206 207 208 215 218 219 220 222 222 222 224 225 227 228 228 229 229 232 233 234 236 240 245 257 258 262 262 263 266 268 269 271 272
Contents
6.9 6.10
What Comes Next? Questions and Exercises
xi
273 274
7
Regression Models 7.1 Introduction 7.2 Conditioning and Regression 7.2.1 Reduction and Conditional Moment Functions 7.2.2 Regression and Skedastic Functions 7.2.3 Selecting an Appropriate Regression Model 7.3 Weak Exogeneity and Stochastic Conditioning 7.3.1 The Concept of Weak Exogeneity 7.3.2 Conditioning on a σ -Field 7.3.3 Stochastic Conditional Expectation and its Properties 7.4 A Statistical Interpretation of Regression 7.4.1 The Statistical Generating Mechanism 7.4.2 Statistical vs. Substantive Models, Once Again 7.5 Regression Models and Heterogeneity 7.6 Summary and Conclusions 7.7 Questions and Exercises
277 277 279 279 281 288 292 292 295 297 301 301 304 308 310 312
8
Introduction to Stochastic Processes 8.1 Introduction 8.1.1 Random Variables and Orderings 8.2 The Concept of a Stochastic Process 8.2.1 Defining a Stochastic Process 8.2.2 Classifying Stochastic Processes; What a Mess! 8.2.3 Characterizing a Stochastic Process 8.2.4 Partial Sums and Associated Stochastic Processes 8.2.5 Gaussian (Normal) Process: A First View 8.3 Dependence Restrictions (Assumptions) 8.3.1 Distribution-Based Concepts of Dependence 8.3.2 Moment-Based Concepts of Dependence 8.4 Heterogeneity Restrictions (Assumptions) 8.4.1 Distribution-Based Heterogeneity Assumptions 8.4.2 Moment-Based Heterogeneity Assumptions 8.5 Building Block Stochastic Processes 8.5.1 IID Stochastic Processes 8.5.2 White-Noise Process 8.6 Markov and Related Stochastic Processes 8.6.1 Markov Process 8.6.2 Random Walk Processes 8.6.3 Martingale Processes 8.6.4 Martingale Difference Process 8.7 Gaussian Processes
315 315 316 318 318 320 322 324 328 329 329 330 331 331 333 335 335 336 336 336 338 340 342 345
Contents
xii
8.8
8.9 8.10
9
8.7.1 AR(p) Process: Probabilistic Reduction Perspective 8.7.2 A Wiener Process and a Unit Root [UR(1)] Model 8.7.3 Moving Average [MA(q)] Process 8.7.4 Autoregressive vs. Moving Average Processes 8.7.5 The Brownian Motion Process* Counting Processes* 8.8.1 The Poisson Process 8.8.2 Duration (Hazard-Based) Models Summary and Conclusions Questions and Exercises Appendix 8.A: Asymptotic Dependence and Heterogeneity Assumptions* 8.A.1 Mixing Conditions 8.A.2 Ergodicity
Limit Theorems in Probability 9.1 Introduction 9.1.1 Why Do We Care About Limit Theorems? 9.1.2 Terminology and Taxonomy 9.1.3 Popular Misconceptions About Limit Theorems 9.2 Tracing the Roots of Limit Theorems 9.2.1 Bernoulli’s Law of Large Numbers: A First View 9.2.2 Early Steps Toward the Central Limit Theorem 9.2.3 The First SLLN 9.2.4 Probabilistic Convergence Modes: A First View 9.3 The Weak Law of Large Numbers 9.3.1 Bernoulli’s WLLN 9.3.2 Poisson’s WLLN 9.3.3 Chebyshev’s WLLN 9.3.4 Markov’s WLLN 9.3.5 Bernstein’s WLLN 9.3.6 Khinchin’s WLLN 9.4 The Strong Law of Large Numbers 9.4.1 Borel’s (1909) SLLN 9.4.2 Kolmogorov’s SLLN 9.4.3 SLLN for a Martingale 9.4.4 SLLN for a Stationary Process 9.4.5 The Law of Iterated Logarithm* 9.5 The Central Limit Theorem 9.5.1 De Moivre–Laplace CLT 9.5.2 Lyapunov’s CLT 9.5.3 Lindeberg–Feller’s CLT 9.5.4 Chebyshev’s CLT 9.5.5 Hajek–Sidak CLT
345 349 352 353 354 360 361 363 364 367 369 369 370 373 373 374 375 376 377 377 378 381 381 383 383 385 386 387 388 389 390 390 391 392 394 395 396 397 399 399 401 401
Contents
9.6 9.7 9.8
10
9.5.6 CLT for a Martingale 9.5.7 CLT for a Stationary Process 9.5.8 The Accuracy of the Normal Approximation 9.5.9 Stable and Other Limit Distributions* Extending the Limit Theorems* 9.6.1 A Uniform SLLN* Summary and Conclusions Questions and Exercises Appendix 9.A: Probabilistic Inequalities 9.A.1 Probability 9.A.2 Expectation Appendix 9.B: Functional Central Limit Theorem
From Probability Theory to Statistical Inference 10.1 Introduction 10.2 Mathematical Probability: A Brief Summary 10.2.1 Kolmogorov’s Axiomatic Approach 10.2.2 Random Variables and Statistical Models 10.3 Frequentist Interpretation(s) of Probability 10.3.1 “Randomness” (Stochasticity) is a Feature of the Real World 10.3.2 Model-Based Frequentist Interpretation of Probability 10.3.3 Von Mises’ Frequentist Interpretation of Probability 10.3.4 Criticisms Leveled Against the Frequentist Interpretation 10.3.5 Kolmogorov Complexity: An Algorithmic Perspective 10.3.6 The Propensity Interpretation of Probability 10.4 Degree of Belief Interpretation(s) of Probability 10.4.1 “Randomness” is in the Mind of the Beholder 10.4.2 Degrees of Subjective Belief 10.4.3 Degrees of “Objective Belief”: Logical Probability 10.4.4 Which Interpretation of Probability? 10.5 Frequentist vs. Bayesian Statistical Inference 10.5.1 The Frequentist Approach to Statistical Inference 10.5.2 The Bayesian Approach to Statistical Inference 10.5.3 Cautionary Notes on Misleading Bayesian Claims 10.6 An Introduction to Frequentist Inference 10.6.1 Fisher and Neglected Aspects of Frequentist Statistics 10.6.2 Basic Frequentist Concepts and Distinctions 10.6.3 Estimation: Point and Interval 10.6.4 Hypothesis Testing: A First View 10.6.5 Prediction (Forecasting) 10.6.6 Probability vs. Frequencies: The Empirical CDF 10.7 Non-Parametric Inference 10.7.1 Parametric vs. Non-Parametric Inference
xiii
402 402 403 404 406 409 409 410 412 412 413 414 421 421 422 422 422 423 423 424 426 427 430 431 432 432 432 435 436 436 436 440 443 444 444 446 447 449 450 450 453 453
xiv
Contents
10.8 10.9 10.10
10.7.2 Are Weaker Assumptions Preferable to Stronger Ones? 10.7.3 Induction vs. Deduction 10.7.4 Revisiting Generic Robustness Claims 10.7.5 Inference Based on Asymptotic Bounds 10.7.6 Whither Non-Parametric Modeling? The Basic Bootstrap Method 10.8.1 Bootstrapping and Statistical Adequacy Summary and Conclusions Questions and Exercises
454 457 458 458 460 461 462 464 466
11
Estimation I: Properties of Estimators 11.1 Introduction 11.2 What is an Estimator? 11.3 Sampling Distributions of Estimators 11.4 Finite Sample Properties of Estimators 11.4.1 Unbiasedness 11.4.2 Efficiency: Relative vs. Full Efficiency 11.4.3 Sufficiency 11.4.4 Minimum MSE Estimators and Admissibility 11.5 Asymptotic Properties of Estimators 11.5.1 Consistency (Weak) 11.5.2 Consistency (Strong) 11.5.3 Asymptotic Normality 11.5.4 Asymptotic Efficiency 11.5.5 Properties of Estimators Beyond the First Two Moments 11.6 The Simple Normal Model: Estimation 11.7 Confidence Intervals (Interval Estimation) 11.7.1 Long-Run “Interpretation” of CIs 11.7.2 Constructing a Confidence Interval 11.7.3 Optimality of Confidence Intervals 11.8 Bayesian Estimation 11.8.1 Optimal Bayesian Rules 11.8.2 Bayesian Credible Intervals 11.9 Summary and Conclusions 11.10 Questions and Exercises
469 469 469 472 474 474 475 480 485 488 488 490 490 491 492 493 498 499 499 501 502 503 504 505 507
12
Estimation II: Methods of Estimation 12.1 Introduction 12.2 The Maximum Likelihood Method 12.2.1 The Likelihood Function 12.2.2 Maximum Likelihood Estimators 12.2.3 The Score Function 12.2.4 Two-Parameter Statistical Model 12.2.5 Properties of Maximum Likelihood Estimators
510 510 511 511 514 517 519 524
Contents
12.3
12.4 12.5
12.6 12.7
13
12.2.6 The Maximum Likelihood Method and its Critics The Least-Squares Method 12.3.1 The Mathematical Principle of Least Squares 12.3.2 Least Squares as a Statistical Method Moment Matching Principle 12.4.1 Sample Moments and their Properties The Method of Moments 12.5.1 Karl Pearson’s Method of Moments 12.5.2 The Parametric Method of Moments 12.5.3 Properties of PMM Estimators Summary and Conclusions Questions and Exercises Appendix 12.A: Karl Pearson’s Approach
Hypothesis Testing 13.1 Introduction 13.1.1 Difficulties in Mastering Statistical Testing 13.2 Statistical Testing Before R. A. Fisher 13.2.1 Francis Edgeworth’s Testing 13.2.2 Karl Pearson’s Testing 13.3 Fisher’s Significance Testing 13.3.1 A Closer Look at the p-value 13.3.2 R. A. Fisher and Experimental Design 13.3.3 Significance Testing: Empirical Examples 13.3.4 Summary of Fisher’s Significance Testing 13.4 Neyman–Pearson Testing 13.4.1 N-P Objective: Improving Fisher’s Significance Testing 13.4.2 Modifying Fisher’s Testing Framing: A First View 13.4.3 A Historical Excursion 13.4.4 The Archetypal N-P Testing Framing 13.4.5 Significance Level α vs. the p-value 13.4.6 Optimality of a Neyman–Pearson Test 13.4.7 Constructing Optimal Tests: The N-P Lemma 13.4.8 Extending the Neyman–Pearson Lemma 13.4.9 Constructing Optimal Tests: Likelihood Ratio 13.4.10 Bayesian Testing Using the Bayes Factor 13.5 Error-Statistical Framing of Statistical Testing 13.5.1 N-P Testing Driven by Substantively Relevant Values 13.5.2 Foundational Issues Pertaining to Statistical Testing 13.5.3 Post-Data Severity Evaluation: An Evidential Account 13.5.4 Revisiting Issues Bedeviling Frequentist Testing 13.5.5 The Replication Crises and Severity 13.6 Confidence Intervals and their Optimality 13.6.1 Mathematical Duality Between Testing and CIs
xv
532 534 534 535 536 539 543 543 544 546 547 549 551 553 553 553 555 555 556 558 561 563 565 568 569 569 570 574 575 578 580 586 588 591 594 596 596 598 600 603 609 610 610
xvi
Contents
13.6.2 Uniformly Most Accurate CIs 13.6.3 Confidence Intervals vs. Hypothesis Testing 13.6.4 Observed Confidence Intervals and Severity 13.6.5 Fallacious Arguments for Using CIs Summary and Conclusions Questions and Exercises Appendix 13.A: Testing Differences Between Means 13.A.1 Testing the Difference Between Two Means 13.A.2 What Happens when Var(X1t ) = Var(X2t )? 13.A.3 Bivariate Normal Model: Paired Sample Tests 13.A.4 Testing the Difference Between Two Proportions 13.A.5 One-Way Analysis of Variance
612 613 614 614 615 617 620 620 621 622 623 624
14 Linear Regression and Related Models 14.1 Introduction 14.1.1 What is a Statistical Model? 14.2 Normal, Linear Regression Model 14.2.1 Specification 14.2.2 Estimation 14.2.3 Fitted Values and Residuals 14.2.4 Goodness-of-Fit Measures 14.2.5 Confidence Intervals and Hypothesis Testing 14.2.6 Normality and the LR Model 14.2.7 Testing a Substantive Model Against the Data 14.3 Linear Regression and Least Squares 14.3.1 Mathematical Approximation and Statistical Curve-Fitting 14.3.2 Gauss–Markov Theorem 14.3.3 Asymptotic Properties of OLS Estimators 14.4 Regression-Like Statistical Models 14.4.1 Gauss Linear Model 14.4.2 The Logit and Probit Models 14.4.3 The Poisson Regression-Like Model 14.4.4 Generalized Linear Models 14.4.5 The Gamma Regression-Like Model 14.5 Multiple Linear Regression Model 14.5.1 Estimation 14.5.2 Linear Regression: Matrix Formulation 14.5.3 Fitted Values and Residuals 14.5.4 OLS Estimators and their Sampling Distributions 14.6 The LR Model: Numerical Issues and Problems 14.6.1 The Problem of Near-Collinearity 14.6.2 The Hat Matrix and Influential Observations 14.6.3 Individual Observation Influence Measures
625 625 625 626 626 628 633 635 635 642 643 648
13.7 13.8
648 651 653 655 655 655 657 657 658 658 660 661 662 665 666 666 673 674
Contents
14.7 14.8
15
xvii
Conclusions Questions and Exercises Appendix 14.A: Generalized Linear Models 14.A.1 Exponential Family of Distributions 14.A.2 Common Features of Generalized Linear Models 14.A.3 MLE and the Exponential Family Appendix 14.B: Data
675 677 680 680 681 682 683
Misspecification (M-S) Testing 15.1 Introduction 15.2 Misspecification and Inference: A First View 15.2.1 Actual vs. Nominal Error Probabilities 15.2.2 Reluctance to Test the Validity of Model Assumptions 15.3 Non-Parametric (Omnibus) M-S Tests 15.3.1 The Runs M-S Test for the IID Assumptions [2]–[4] 15.3.2 Kolmogorov’s M-S Test for Normality ([1]) 15.4 Parametric (Directional) M-S Testing 15.4.1 A Parametric M-S Test for Independence ([4]) 15.4.2 Testing Independence and Mean Constancy ([2] and [4]) 15.4.3 Testing Independence and Variance Constancy ([2] and [4]) 15.4.4 The Skewness–Kurtosis Test of Normality 15.4.5 Simple Normal Model: A Summary of M-S Testing 15.5 Misspecification Testing: A Formalization 15.5.1 Placing M-S Testing in a Proper Context 15.5.2 Securing the Effectiveness/Reliability of M-S Testing 15.5.3 M-S Testing and the Linear Regression Model 15.5.4 The Multiple Testing (Comparisons) Issue 15.5.5 Testing for t-Invariance of the Parameters 15.5.6 Where do Auxiliary Regressions Come From? 15.5.7 M-S Testing for Logit/Probit Models 15.5.8 Revisiting Yule’s “Nonsense Correlations” 15.5.9 Respecification 15.6 An Illustration of Empirical Modeling 15.6.1 The Traditional Curve-Fitting Perspective 15.6.2 Traditional ad hoc M-S Testing and Respecification 15.6.3 The Probabilistic Reduction Approach 15.7 Summary and Conclusions 15.8 Questions and Exercises Appendix 15.A: Data
685 685 688 688 691 694 694 695 697 697 698
References Index
700 700 701 703 703 704 705 706 707 707 710 710 713 716 716 718 721 729 731 734 736 752
Preface to the Second Edition
The original book, published 20 years ago, has been thoroughly revised with two objectives in mind. First, to make the discussion more compact and coherent by avoiding repetition and many digressions. Second, to improve the methodological coherence of the proposed empirical modeling framework by including material pertaining to foundational issues that has been published by the author over the last 20 years or so in journals on econometrics, statistics, and philosophy of science. In particular, this revised edition brings out more clearly several crucial distinctions that elucidate empirical modeling, including (a) the statistical vs. the substantive information/model, (b) the modeling vs. the inference facet of statistical analysis, (c) testing within and testing outside the boundary of a statistical model, and (d) pre-data vs. post-data error probabilities. These distinctions shed light on several foundational issues and suggest solutions. In addition, the comprehensiveness of the book has been improved by adding Chapter 14 on the linear regression and related models. The current debates on the “replication crises” render the methodological framework articulated in this book especially relevant for today’s practitioner. A closer look at the debates (Mayo, 2018) reveals that the non-replicability of empirical evidence problem is, first and foremost, a problem of untrustworthy evidence routinely published in prestigious journals. The current focus of that literature on the abuse of significance testing is rather misplaced, because it is only a part of a much broader problem relating to the mechanical application of statistical methods without a real understanding of their assumptions, limitations, proper implementation, and interpretation of their results. The abuse and misinterpretation of the p-value is just symptomatic of the same uninformed implementation that contributes majorly to the problem of untrustworthy evidence. Indeed, the same uninformed implementation often ensures that untrustworthy evidence is routinely replicated, when the same mistakes are repeated by equally uninformed practitioners! In contrast to the current conventional wisdom, it is argued that a major contributor to the untrustworthy evidence problem is statistical misspecification: invalid probabilistic assumptions imposed on one’s data, another symptom of the same uninformed implementation. The primary objective of this book is to provide the necessary probabilistic foundation and the overarching modeling framework for an informed and thoughtful application of statistical methods, as well as the proper interpretation of their inferential results. The emphasis is placed less on the mechanics of the application of statistical methods, and more on understanding their assumptions, limitations, and proper interpretation. xix
xx
Preface to the Second Edition
Key Features of the Book ●
●
●
●
●
●
It offers a seamless integration of probability theory and statistical inference with a view to elucidating the interplay between deduction and induction in “learning from data” about observable phenomena of interest using statistical procedures. It develops frequentist modeling and inference from first principles by emphasizing the notion of a statistical model and its adequacy (the validity of its probabilistic assumptions vis-à-vis the particular data) as the cornerstone for reliable inductive inference and trustworthy evidence. It presents frequentist inference as well-grounded procedures whose optimality is assessed by their capacity to achieve genuine “learning from data.” It focuses primarily on the skills and the technical knowledge one needs to be able to begin with substantive questions of interest, select the relevant data carefully, and proceed to establish trustworthy evidence for or against hypotheses or claims relating to the questions of interest. These skills include understanding the statistical information conveyed by data plots, selecting appropriate statistical models, as well as validating them using misspecification testing before any inferences are drawn. It articulates reasoned responses to several charges leveled against several aspects of frequentist inference by addressing the underlying foundational issues, including the use and abuse of p-values and confidence intervals, Neyman–Pearson vs. Fisher testing, and inference results vs. evidence that have bedeviled frequentist inference since the 1930s. The book discusses several such foundational issues/problems and proposes ways to address them using an error statistical perspective grounded in the concept of severity. Methodological issues discussed in this book include rebuttals to widely used, ill-thought-out arguments for ignoring statistical misspecification, as well as principled responses to certain Bayesian criticisms of the frequentist aproach. Its methodological perspective differs from the traditional textbook perspective by bringing out the perils of curve-fitting and focusing on the key question: How can empirical modeling lead to “learning from data” about phenomena of interest by giving rise to trustworthy evidence?
N O T E: All sections marked with an asterisk (∗) can be skipped at first reading without any serious interruption in the flow of the discussion.
Acknowledgments More than any other person, Deborah G. Mayo, my colleague and collaborator on many foundational issues in statistical inference, has helped to shape my views on several methodological issues addressed in this book; for that and the constant encouragement, I’m most grateful to her. I’m also thankful to Clark Glymour, the other philosopher of science with whom I had numerous elucidating and creative discussions on many philosophical issues discoursed in the book. Thanks are also due to Sir David Cox for many discussions that helped me appreciate the different perspectives on frequentist inference. Special thanks are also due to my longtime collaborator, Anya McGuirk, who contributed majorly in puzzling out several thorny issues discussed in this book. I owe a special thanks to Julio Lopez for
Preface to the Second Edition
xxi
his insightful comments, as well as his unwavering faith in the coherence and value of the proposed approach to empirical modeling. I’m also thankful to Jesse Bledsoe for helpful comments on chapter 13 and Mariusz Kamienski for invaluable help on the front cover design. I owe special thanks to several of my former and current students over the last 20 years, who helped to improve the discussion in this book by commenting on earlier drafts and finding mistakes and typos. They include Elena Andreou, Andros Kourtellos, Carlos Elias, Maria Heracleous, Jason Bergtold, Ebere Akobundu, Andreas Koutris, Alfredo Romero, Niraj Pouydal, Michael Michaelides, Karo Solat, and Mohammad Banasaz.
Symbols N – set of natural numbers N:={1, 2, ..., n, ...} R – the set of real numbers; the real line (−∞, ∞) n times
Rn R+ f (x; θ ) F(x; θ ) N(μ, σ 2 ) E
S P(.) σ (X)
– – – – – – – – –
:=R × R× · · · × R the set of positive real numbers; the half real line (0, ∞) density function of X with parameters θ cumulative distribution function of X with parameters θ Normal distribution with mean μ and variance σ 2 Random Experiment (RE) outcomes set (sample space) event space (a σ −field) probability set function minimal sigma-field generated by X
Acronyms AR(p) CAN cdf CLT ecdf GM IID LS ML M-S N-P PMM SLLN WLLN UMP
– – – – – – – – – – – – – – –
Autoregressive model with p lags Consistent, Asymptotically Normal cumulative distribution function Central Limit Theorem empirical cumulative distributrion function Generating Mechanism Indepedent and Identically Distributed Least-Squares Maximum Likelihood Mis-Specification Neyman-Pearson Parametric Method of Moments Strong Law Large Numbers Weak Law Large Numbers Uniformly Most Powerful
1 An Introduction to Empirical Modeling
1.1
Introduction
Empirical modeling, broadly speaking, refers to the process, methods, and strategies grounded on statistical modeling and inference whose primary aim is to give rise to “learning from data” about stochastic observable phenomena, using statistical models. Real-world phenomena of interest are said to be “stochastic,” and thus amenable to statistical modeling, when the data they give rise to exhibit chance regularity patterns, irrespective of whether they arise from passive observation or active experimentation. In this sense, empirical modeling has three crucial features: (a) it is based on observed data that exhibit chance regularities; (b) its cornerstone is the concept of a statistical model that decribes a probabilistic generating mechanism that could have given rise to the data in question; (c) it provides the framework for combining the statistical and substantive information with a view to elucidating (understanding, predicting, explaining) phenomena of interest. Statistical vs. substantive information. Empirical modeling across different disciplines involves an intricate blending of substantive subject matter and statistical information. The substantive information stems from a theory or theories pertaining to the phenomenon of interest that could range from simple conjectures to intricate substantive (structural) models. Such information has an important and multifaceted role to play by demarcating the crucial aspects of the phenomenon of interest (suggesting the relevant variables and data), as well as enhancing the learning from data when it meliorates the statistical information without belying it. In contrast, statistical information stems from the chance regularities in data. Scientific knowledge often begins with substantive conjectures based on subject matter information, but it becomes knowledge when its veracity is firmly grounded in real-world data. In this sense, success in “learning from data” stems primarily from a harmonious blending of these two sources of information into an empirical model that is both statistically and substantively “adequate”; see Sections 1.5 and 1.6. 1
2
An Introduction to Empirical Modeling
Empirical modeling as curve-fitting. The current traditional perspective on empirical modeling largely ignores the above distinctions by viewing the statistical problem as “quantifying theoretical relationships presumed true.” From this perspective, empirical modeling is viewed as a curve-fitting problem, guided primarily by goodness-of-fit. The substantive model is often imposed on the data in an attempt to quantify its unknown parameters. This treats the substantive information as established knowledge, and not as tentative conjectures to be tested against data. The end result of curve-fitting is often an estimated model that is misspecified, both statistically (invalid probabilistic assumptions) and substantively; it doesn’t elucidate sufficiently the phenomenon of interest. This raises a thorny problem in philosophy of science known as Duhem’s conundrum (Mayo, 1996), because there is no principled way to distinguish between the two types of misspecification and apportion blame. It is argued that the best way to address this impasse is (i) to disentangle the statistical from the substantive model by unveiling the probabilistic assumptions (implicitly or explicitly) imposed on the data (the statistical model) and (ii) to separate the modeling from the inference facet of empirical modeling. The modeling facet includes specifying and selecting a statistical model, as well as appraising its adequacy (the validity of its probabilistic assumptions) using misspecification testing. The inference facet uses a statistically adequate model to pose questions of substantive interest to the data. Crudely put, conflating the modeling with the inference facet is analogous to mistaking the process of constructing a boat to preset specifications with sailing it in a competitive race; imagine trying to construct the boat while sailing it in a competitive race. Early cautionary note. It is likely that some scholars in empirical modeling will mock and criticize the introduction of new terms and distinctions in this book as “mounds of gratuitous jargon,” symptomatic of an ostentatious display of pedantry. As a pre-emptive response to such critics, allow me to quote R. A. Fisher’s 1931 reply to Arne Fisher’s [American mathematician/statistician] complaining about his “introduction in statistical method of some outlandish and barbarous technical terms. They stand out like quills upon the porcupine, ready to impale the sceptical critic. Where, for instance, did you get that atrocity, a statistic?”
His serene response was: I use special words for the best way of expressing special meanings. Thiele and Pearson were quite content to use the same words for what they were estimating and for their estimates of it. Hence the chaos in which they left the problem of estimation. Those of us who wish to distinguish the two ideas prefer to use different words, hence ‘parameter’ and ‘statistic’. No one who does not feel this need is under any obligation to use them. Also, to Hell with pedantry. (Bennett, 1990, pp. 311–313) [emphasis added]
A bird’s-eye view of the chapter. The rest of this chapter elaborates on the crucial features of empirical modeling (a)–(c). In Section 1.2 we discuss the meaning of stochastic observable phenomena and why such phenomena are amenable to empirical modeling. Section 1.3 focuses on the relationship between data from stochastic phenomena and statistical models. Section 1.4, discusses several important issues relating to observed data, including their different measurement scales, nature, and accuracy. In Section 1.5 we discuss the important notion of statistical adequacy: whether the postulated statistical model “accounts fully for”
1.2 Stochastic Phenomena: A Preliminary View
3
the statistical systematic information in the data. Section 1.6 discusses briefly the connection between a statistical model and the substantive information of interest.
1.2
Stochastic Phenomena: A Preliminary View
This section provides an intuitive explanation for the notion of a stochastic phenomenon as it relates to the concept of a statistical model, discussed in the next section.
1.2.1 Chance Regularity Patterns The chance regularities denote patterns that are usually revealed using a variety of graphical techniques and careful preliminary data analysis. The essence of chance regularity, as suggested by the term itself, comes in the form of two entwined features: chance an inherent uncertainty relating to the occurrence of particular outcomes; regularity discernible regularities associated with an aggregate of many outcomes. T E R M I N O L O G Y: The term “chance regularity” is used in order to avoid possible confusion with the more commonly used term “randomness.” At first sight these two attributes might appear to be contradictory, since “chance” is often understood as the absence of order and “regularity” denotes the presence of order. However, there is no contradiction because the “disorder” exists at the level of individual outcomes and the order at the aggregate level. The two attributes should be viewed as inseparable for the notion of chance regularity to make sense. Example 1.1 in Table 1.1.
To get some idea about “chance regularity” patterns, consider the data given
Table 1.1 Observed data 3 10 11 5 6 7 10 8 5 11 2 9 9 6 8 4 7 7 8 5 4 6 11 7 10 5 8 7 5 9 8 10 2 7 11 8 9 5 7 3 4 9 10 4 7 4 6 9 7 6 12 10 3 6 9 7 5 8 6 2 9 6 4 7 8 10 5 8 5 7 7 6 12 9 10 4 8 6 5 4 7 8 6 7 11
6 5 12 3 8 10 8 11 9 7 9 6 7 8 3
A glance at Table 1.1 suggests that the observed data constitute integers between 2 and 12, but no real patterns are apparent, at least at first sight. To bring out any chance regularity patterns we use a graph as shown in Figure 1.1, t-plot: {(t, xt ), t = 1, 2, . . . , n}. The first distinction to be drawn is that between chance regularity patterns and deterministic regularities that is easy to detect. Deterministic regularity. When a t-plot exhibits a clear pattern which would enable one to predict (guess) the value of the next observation exactly, the data are said to exhibit deterministic regularity. The easiest way to think about deterministic regularity is to visualize
4
An Introduction to Empirical Modeling 12
10
x
8
6
4
2 1
10
20
Fig. 1.1
30
40
50 Index
60
70
80
90
100
t-Plot of a sequence of 100 observations
1.5 1.0
x
0.5 0.0 –0.5 –1.0 –1.5 1
10
20
30
40
50
60
70
80
90
100
Index
Fig. 1.2
Graph of x = 1.5 cos((π/3)t+(π/3))
the graphs of mathematical functions. If a t-plot of data can be depicted by a mathematical function, the numbers exhibit deterministic regularity; see Figure 1.2. In contrast to deterministic regularities, to detect chance patterns one needs to perform a number of thought experiments. Thought experiment 1–Distribution regularity. Associate each observation with identical squares and rotate Figure 1.1 anti-clockwise by 90◦ , letting the squares fall vertically to form a pile on the x-axis. The pile represents the well-known histogram (see Figure 1.3). The histogram exhibits a clear triangular shape, reflecting a form of regularity often associated with stable (unchanging) relative frequencies (RF) expressed as percentages
1.2 Stochastic Phenomena: A Preliminary View
5
18 Relative frequency (%)
16 14 12 10 8 6 4 2 0
2
3
Fig. 1.3
4
5
6
7 x
8
9
10
11
12
Histogram of the data in Figure 1.1
(%). Each bar of the histogram represents the frequency of each of the integers 2–12. For example, since the value 3 occurs five times in this data set, its relative frequency is RF(3)=5/100 = .05. The relative frequency of the value 7 is RF(7)=17/100 = .17, which is the highest among the values 2–12. For reasons that will become apparent shortly, we name this discernible distribution regularity. [1] Distribution: After a large enough number of trials, the relative frequency of the outcomes forms a seemingly stable distribution shape. Thought experiment 2. In Figure 1.1, one would hide the observations beyond a certain value of the index, say t = 40, and try to guess the next outcome on the basis of the observations up to t = 40. Repeat this along the x-axis for different index values and if it turns out that it is more or less impossible to use the previous observations to narrow down the potential outcomes, conclude that there is no dependence pattern that would enable the modeler to guess the next observation (within narrow bounds) with any certainty. In this experiment one needs to exclude the extreme values of 2 and 12, because following these values one is almost certain to get a value greater and smaller, respectively. This type of predictability is related to the distribution regularity mentioned above. For reference purposes we name the chance regularity associated with the unpredictability of the next observation given the previous observations. [2] Independence: In a sequence of trials, the outcome of any one trial does not influence and is not influenced by the outcome of any other. Thought experiment 3. In Figure 1.1 take a wide enough frame (to cover the spread of the fluctuations) that is also long enough (roughly less than half the length of the horizontal axis) and let it slide from left to right along the horizontal axis, looking at the picture inside the frame as it slides along. In cases where the picture does not change significantly, the data exhibit the chance regularity we call homogeneity, otherwise heterogeneity is present; see
6
An Introduction to Empirical Modeling
Chapter 5. Another way to view this pattern is in terms of the arithmetic average and the variation around this average of the observations as we move from left to right. It appears as though this sequential average and its variation are relatively constant around 7. Moreover, the variation around this constant average value appears to be within fixed bands. This chance regularity can be intuitively described by the notion of homogeneity. [3] Homogeneity: The probabilities associated with all possible outcomes remain the same for all trials. In summary, the data in Figure 1.1 exhibit the following chance regularity patterns: [1] A triangular distribution; [2] Independence; [3] Homogeneity (ID). It is important to emphasize that these patterns have been discerned directly from the observed data without the use of any substantive subject matter information. Indeed, at this stage it is still unknown what these observations represent or measure, but that does not prevent one from discerning certain chance regularity patterns. The information conveyed by these patterns provides the raw material for constructing statistical models aiming to adequately account for (or model) this (statistical) information. The way this is achieved is to develop probabilistic concepts which aim to formalize these patterns in a mathematical way and provide canonical elements for constructing statistical models. The formalization begins by representing the data as a set of n ordered numbers denoted generically by x0 := (x1 , x2 , . . . , xn ) . These numbers are in turn interpreted as a typical realization of a finite initial segment X:= (X1 , X2 , . . . , Xn ) of a (possibly infinite) sequence of random variables {Xt , t = 1, 2, . . . , n, . . .} we call a sample X; note that the random variables are denoted by capital letters and observations by small letters. The chance regularity patterns exhibited by the data are viewed as reflecting the probabilistic structure of {Xt , t = 1, 2, . . . , n, . . .}. For the data in Figure 1.1, the structure one can realistically ascribe to sample X is that they are independent and identically distributed (IID) random variables, with a triangular distribution. These probabilistic concepts will be formalized in the next three chapters to construct a statistical model that will take the simple form shown in Table 1.2. Table 1.2 Simple statistical model [D] Distribution [M] Dependence [H] Heterogeneity
Xt (μ, σ 2 ), xt ∈ NX :=(2, . . . , 12), discrete triangular (X1 , X2 , . . . , Xn ) are independent (I) (X1 , X2 , . . . , Xn ) are identically distributed (ID)
Note that μ = E(Xt ) and σ 2 = E(Xt − μ)2 denote the mean and variance of Xt , respectively; see Chapter 3. It is worth emphasizing again that the choice of this statistical model, which aims to account for the regularities in Figure 1.1, relied exclusively on the chance regularities, without invoking any substantive subject matter information relating to the actual mechanism that gave rise to the particular data. Indeed, the generating mechanism was deliberately veiled in the discussion so far to make this point.
1.2 Stochastic Phenomena: A Preliminary View
7
1.2.2 From Chance Regularities to Probabilities The question that naturally arises is whether the available substantive information pertaining to the mechanism that gave rise to the data in Figure 1.1 would affect the choice of a statistical model. Common sense suggests that it should, but it is not clear what its role should be. Let us discuss that issue in more detail. The actual data-generating mechanism (DGM). It turns out that the data in Table 1.1 were generated by a sequence of n = 100 trials of casting two dice and adding the dots of the two sides facing up. This game of chance was very popular in medieval times and a favorite pastime of soldiers waiting for weeks on end outside the walls of European cities they had under siege, looking for the right opportunity to assail them. After thousands of trials these illiterate soldiers learned empirically (folk knowledge) that the number 7 occurs more often than any other number and that 6 occurs less often than 7 but more often than 5; 2 and 12 would occur the least number of times. One can argue that these soldiers had an instinctive understanding of the empirical relative frequencies summarized by the histogram in Figure 1.3. In this subsection we will attempt to reconstruct how this intuition was developed into something more systematic using mathematization tools that eventually led to probability theory. Historically, the initial step from the observed regularities to their probabilistic formalization was very slow in the making, taking centuries to materialize; see Chapter 2. The first crucial feature of the generating mechanism is its stochastic nature: at each trial (the casting of two dice), the outcome (the sum of the dots of the sides) cannot be predicted with any certainty. The only thing one can say with certainty is that the result of each trial will be one of the numbers {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}. It is also known that these numbers do not occur equally often in this game of chance. How does one explain the differences in the empirical relative frequency of occurrence for the different numbers as shown in Figure 1.3? The first systematic account of the underlying mathematics behind Figure 1.3 was given by Gerolamo Cardano (1501–1576), who lived in Milan, Italy. He was an Italian polymath, whose wide interests ranged from being a mathematician, physician, biologist, chemist, astrologer/astronomer, to gambler. The mathematization of chance regularities. Cardano reasoned that since each die has six faces (1, 2, . . . , 6), if the die is symmetric and homogeneous, the probability of each outcome is equal to 1/6, i.e. Number of dots
1
2
3
4
5
6
Probability
1 6
1 6
1 6
1 6
1 6
1 6
When casting two dice (D1 , D2 ), one has 36 possible outcomes associated with the different pairings of these numbers (i, j), i, j = 1, 2, . . . , 6; see Table 1.3. That is, behind each one of the possible events {2, 3, . . . , 12} there is a combination of elementary outcomes, whose probability of occurrence could be used to explain the differences in their relative frequencies. The second crucial feature of the generating mechanism is that, under certain conditions, all elementary outcomes (x, y) are equally likely to occur; each elementary outcome occurs with probability 1/36. These conditions are of paramount importance in modeling stochastic
An Introduction to Empirical Modeling
8
Table 1.3 Elementary outcomes: casting two dice D1 \D2
1
2
3
4
5
6
1 2 3 4 5 6
(1,1) (2,1) (3,1) (4,1) (5,1) (6,1)
(1,2) (2,2) (3,2) (4,2) (5,2) (6,2)
(1,3) (2,3) (3,3) (4,3) (5,3) (6,3)
(1,4) (2,4) (3,4) (4,4) (5,4) (6,4)
(1,5) (2,5) (3,5) (4,5) (5,5) (6,5)
(1,6) (2,6) (3,6) (4,6) (5,6) (6,6)
phenomena, because they constitute the premises of inference. In this case they pertain to the physical symmetry of the two dice and the homogeneity (sameness) of the replication process. In the actual experiment giving rise to the data in Table 1.1, the dice were cast in the same wooden box to secure a certain form of nearly identical conditions for each trial. Going from these elementary outcomes to the recorded result z = x+y, it becomes clear that certain events are more likely to occur than others, because they occur when different combinations of the elementary outcomes arise (see Table 1.4). For instance, we know that the number 2 can arise as the sum of a single combination of faces: {1, 1} – each die comes up 1, hence Pr({1, 1}) = 1/36. The same applies to the number 12: Pr({6, 6}) = 1/36. On the other hand, the number 3 can arise as the sum of two sets of faces: {(1, 2), (2, 1)} , hence Pr({(1, 2), (2, 1)}) = 2/36. The same applies to the number 11: Pr({(6, 5), (5, 6)}) = 2/36. If you do not find the above derivations straightforward do not feel too bad, because a giant of eighteenth-century mathematics, Gottfried Leibniz (1646–1716), who developed differential and integral calculus independently of Isaac Newton, made an elementary mistake when he argued that Pr(z = 11) = Pr(z = 12) = 1/36; see Todhunter (1865, p. 48). The reason? Leibniz did not understand clearly the notion of “the set of all possible distinct outcomes” (Table 1.3)! Continuing this line of thought, one can construct a probability distribution that relates each event of interest with a certain probability of occurrence (see Figure 1.4). As we can see, the outcome most likely to occur is the number 7. We associate the relative frequency of occurrence with the underlying probabilities defining a probability distribution over all possible results; see Chapter 3. Table 1.4 Probability distribution: sum of two dice Outcome Probability
2
3
4
5
6
7
8
9
10
11
12
1 36
2 36
3 36
4 36
5 36
6 36
5 36
4 36
3 36
2 36
1 36
One can imagine Cardano sitting behind a makeshift table at a corner of Piazza del Duomo in Milan inviting passers-by to make quick money by betting on events like C – the sum of two dice being bigger than 9, and offering odds 3-to-1 against; three ways to lose and one to win. He knew that based on Table 1.3, Pr(C) = 6/36. This meant that he would win most of the time, since the relevant odds to be a fair game should have been 5-to-1. Probabilistic knowledge meant easy money for this avid gambler and he was not ready to share
1.2 Stochastic Phenomena: A Preliminary View 18 16 Probability/frequency (%)
18 16
Probability (%)
14 12 10 8 6 4
14 12 10 8 6 4 2
2 0
9
2
3
4
Fig. 1.4
5
6
7 x
8
9
10
11
12
Probability distribution
0
2
3
Fig. 1.5
4
5
6
7 x
8
9
10
11
12
Probability vs. relative frequency
it with the rest of the world. Although he published numerous books and pamphlets during his lifetime, including his autobiography in lurid detail, his book about games of chance, Liber de Ludo Aleae, written around 1564, was only published posthumously in 1663; see Schwartz (2006). The probability distribution in Table 1.4 represents a mathematical concept formulated to model a particular form of chance regularity exhibited by the data in Figure 1.1 and summarized by the histogram in Figure 1.3. A direct comparison between Figures 1.3 and 1.4, by superimposing the latter on the former in Figure 1.5, confirms the soldiers’ intuition: the empirical relative frequencies are very close to the theoretical probabilities. Moreover, if we were to repeat the experiment 1000 times, the relative frequencies would have been even closer to the theoretical probabilities; see Chapter 10. In this sense we can think of the histogram in Figure 1.3 as an empirical instantiation of the probability distribution in Figure 1.4. Let us take the above formalization of the two-dice example one step further. Example 1.2 When playing the two-dice game, the medieval soldiers used to gamble on whether the outcome would be an odd or an even number (the Greeks introduced these concepts around 300 BC), by betting on odd A = {3, 5, 7, 9, 11} or even B = {2, 4, 6, 8, 10, 12} numbers. At first sight it looks as though the soldier betting on B would have had a clear advantage since there are more even than odd numbers. The medieval soldiers, however, had folk knowledge that this was a fair bet! We can confirm that Pr(A)=Pr(B) using the probabilities in Table 1.4 to derive those in Table 1.5: Pr(A)= Pr(3) + Pr(5) + Pr(7) + Pr(9) + Pr(11) =
2 36
+
4 36
+
6 36
+
4 36
+
2 36
= 12 ;
1 3 5 5 3 1 1 Pr(B)= Pr(2)+ Pr(4)+ Pr(6)+ Pr(8)+ Pr(10)+ Pr(12)= 36 + 36 + 36 + 36 + 36 + 36 =2.
Table 1.5 Odd and even sum Outcome Probability
A .5
B .5
The historical example credited with being the first successful attempt to go from empirical relative frequencies (real world) to probabilities (mathematical world) is discussed next.
10
An Introduction to Empirical Modeling
1.2.2.1 Example 1.3: Chevalier de Mere’s Paradox∗ Historically, the connection between a stable (unchanging) law of relative frequencies can be traced back to the middle of the seventeenth century in an exchange of letters between Pascal and Fermat; see Hacking (2006). Chevalier de Mere’s paradox was raised in a letter from Pascal to Fermat on July 29, 1654 as one of the problems posed to him by de Mere (a French nobleman and a studious gambler). De Mere observed the following empirical regularity: P(at least one 6 in 4 casts of 1 die)> 12 > P(a double 6 in 24 casts with 2 dice) on the basis of numerous repetitions of the game. This, however, seemed to contradict his reasoning by analogy; hence the paradox. De Mere’s false reasoning. He reasoned that the two probabilities should be identical because one 6 in four casts of one die should be the same event as a double 6 in 24 casts of two dice, since 4 is to 6 as 24 is to 36. False! Why? Multiplication counting principle. Consider the sets S1 , S2 , . . . , Sk with n1 , n2 , . . . , nk elements, respectively. Then there are n1 ×n2 × . . . ×nk ways to choose one element from S1 , then one element from S2 , . . . , then one element from Sk . In the case of two dice, the set of all possible outcomes is 6×6 = 62 = 36 (see Table 1.3). To explain the empirical regularity observed by de Mere, one needs to assume equal probability (1/36) for each pair of numbers from 1 to 6 in casting two dice, and argue as in Table 1.6. The two probabilities p = 0.4914039 and q = 0.5177469 confirm that de Mere’s empirical frequencies were correct but his reasoning by analogy was erroneous. What rendered the small difference of .026 in the two probabilities of empirical discernability is the very large number of repetitions under more or less identical conditions. The mathematical result underlying such stable long-run frequencies is known as the Law of Large Numbers (Chapter 9). Table 1.6 Explaining away de Mere’s paradox One die (P(i) = 16 , i = 1, 2, . . . , 6)
1 , i, j = Two dice (P(i, j) = 36 1, 2, . . . , 6)
P(one 6) = 16
1 P(one (6,6)) = 36
n 1
P(one 6 in n casts) = 6 n P(no 6 in n casts) = 56
P(at least one 6 in n casts) = 1−( 56 )n = q 4 For n = 4, q = 1 − 56 = 0.5177469
n 1 P(one (6,6) in n casts) = 36 n P(no (6,6) in n casts) = 35 36 P(at least one (6,6) in n casts) = n 1 − ( 35 36 ) = p 24 For n = 24, p = 1 − 35 = 36 0.4914039
1.2 Stochastic Phenomena: A Preliminary View
11
1.2.2.2 Statistical Models and Substantive Information Having revealed that the data in Figure 1.1 have been generated by casting two dice, the question is whether that information will change the statistical model in Table 1.2, built exclusively on the statistical information gleaned from chance regularity patterns. In this case the substantive information simply confirms the appropriateness of assuming that the integers between 2 and 12 constitute all possible values that the generating mechanism can give rise to. In practice, any substantive subject matter information, say that the two dice are perfectly symmetrical and homogeneous, should not be imposed on the statistical model at the outset. Instead, one should allow the data to confirm or deny the validity of such information.
1.2.3 Chance Regularity Patterns and Real-World Phenomena In the case of the experiment of casting two dice, the chance mechanism is explicit and most people will be willing to accept on faith that if this experiment is actually performed properly, then the chance regularity patterns of IID will be present. The question that naturally arises is whether data generated by real-world stochastic phenomena also exhibit such patterns. It is argued that the overwhelming majority of observable phenomena in many disciplines can be viewed as stochastic, and thus amenable to statistical modeling. Example 1.4 Consider an example from economics where the t-plot of X = ln(ER), i.e. log-changes of the Canadian/US dollar exchange rate (ER), for the period 1973–1991 (weekly observations) is shown in Figure 1.6. What is interesting about the data in Figure 1.6 is the fact that they exhibit a number of chance regularity patterns very similar to those exhibited by the dice observations
3
Can/USA$ exchange rate
2 1 0 –1 –2 –3 –4 1
95
190
285 380 475 570 665 760 time (weekly observations) 1973–1992
Fig. 1.6
Exchange rate returns
855
950
12
An Introduction to Empirical Modeling 16 Relative frequency (%)
14 12 10 8 6 4 2 0 –3.15
–2.10
Fig. 1.7
–1.05
0.00 x
1.05
2.10
3.15
Histogram of exchange rate returns
in Figure 1.1, but some additional patterns are also discernible. The regularity patterns exhibited by both sets of data are: (a) (b)
the arithmetic average over the ordering (time) appears to be constant; the band of variation around this average appears to be relatively constant.
In contrast to the data in Figure 1.2, the distributional pattern exhibited by the data in Figure 1.5 is not a triangular. Instead: (c)
(d)
the graph of the relative frequencies (histogram) in Figure 1.7 exhibits a certain bellshaped symmetry. The Normal density is inserted in order to show that it does not fit well at the tails, in the mid-section, and the top, which is much higher than the Normal curve. As argued in Chapter 5, Student’s t provides a more appropriate distribution for this data; see Figures 3.23 and 3.24. In addition, the data in Figure 1.6 exhibit another regularity pattern: there is a sequence of clusters of small and big changes in succession.
At this stage the reader might not have been convinced that the features noted above are easily discernible from t-plots. An important dimension of modeling in this book is to discuss how to read systematic information in data plots, which will begin in chapter 5.
1.3
Chance Regularities and Statistical Models
Motivated by the desire to account for (model) these chance regularities, we look to probability theory to find ways to formalize them in terms of probabilistic concepts. In particular, the stable relative frequencies regularity pattern (Tables 1.3–1.5) will be formalized using the concept of a probability distribution (see Chapter 5). The unpredictability pattern will be related to the concept of Independence ([2]), and the approximate “sameness” pattern to the Homogeneity (ID) concept ([3]). To render statistical model specification easier, the probabilistic concepts aiming to “model” the chance regularities can be viewed as belonging to three broad categories:
1.3 Chance Regularities and Statistical Models
(D) Distribution;
(M) Dependence;
13
(H) Heterogeneity.
These broad categories can be seen as defining the basic components of a statistical model in the sense that every statistical model is a blend of components from all three categories. The first recommendation to keep in mind in empirical modeling is: 1.
A statistical model is simply a set of (internally) consistent probabilistic assumptions from the three broad categories (D),(M), and (H) defining a stochastic generating mechanism that could have given rise to the particular data.
The statistical model is chosen to represent a description of a chance mechanism that accounts for the systematic information (the chance regularities) in the data. The distinguishing feature of a statistical model is that it specifies a situation, a mechanism, or a process in terms of a certain probabilistic structure. The main objective of Chapters 2–8 is to introduce numerous probabilistic concepts and ideas that render the choice of an appropriate statistical model an educated guess and not a hit-or-miss selection. The examples of casting dice, discussed above, are important not because of their intrinsic interest but because they represent examples of a simple stochastic phenomenon we refer to as a random experiment, which will be used in Chapters 2–4 to motivate the basic structure of a simple statistical model. For the exchange rate data in Figure 1.4, we will need to extend the scope of such models to account for dependence and heterogeneity; this is the subject matter of Chapters 6–8. Hence, the appropriate choice of a statistical model depends on: (a) (b)
detecting the chance regularity patterns as exhibited by the observed data; accounting for (modeling) these patterns by selecting the appropriate probabilistic assumptions.
The first requires developing the skill to detect such patterns using a variety of graphical techniques. Hence, the second recommendation in empirical modeling is: 2. Graphical techniques constitute an indispensable tool in empirical modeling! The interplay between chance regularities and probabilistic concepts using a variety of graphical displays is discussed in Chapter 5. Accounting for the statistical systematic information in the data presupposes a mathematical framework rich enough to model the detected chance regularity patterns. Figure 1.8 brings out the interplay between observable chance regularity patterns and formal probabilistic concepts used to construct statistical models. The variety and intended scope of statistical models are constrained only by the scope of probability theory (as a modeling framework) and the training and the imagination of the modeler. Empirical modeling begins by choosing adequate statistical models with a view to accounting for the systematic statistical information in the data. The primary objective of modeling, however, is to learn from the data by posing substantive questions of interest in the context of the selected statistical model. The third recommendation in empirical modeling is: 3.
Statistical model specification is guided primarily by the probabilistic structure of the observed data, with a view to posing substantive questions of interest in its context.
14
An Introduction to Empirical Modeling
Statistical model
Chance regularity patterns
Observed data
Probabilistic concepts Taxonomy (D) Distribution (M) Dependence (H) Heterogeneity
Fig. 1.8 Chance regularity patterns, probabilistic assumptions, and a statistical model
Some of the issues addressed in the next few chapters are: (i) (ii) (iii) (iv) (v)
1.4
How should one construe a statistical model? Why is statistical information coded in probabilistic terms? What information does one utilize when choosing a statistical model? What is the relationship between the statistical model and the data? How does one detect the statistical systematic information in data?
Observed Data and Empirical Modeling
In this section we will attempt a preliminary discussion of a crucial constituent element of empirical modeling, the observed data. Certain aspects of the observed data play an important role in the choice of statistical models.
1.4.1 Experimental vs. Observational Data In most sciences, such as physics, chemistry, geology, and biology, the observed data are often generated by the modelers themselves in well-designed experiments. In econometrics the modeler is often faced with observational as opposed to experimental data. This has two important implications for empirical modeling. First, the modeler needs to develop better skills in validating the model assumptions, because random (IID) sample realizations are rare with observational data. Second, the separation of the data collector and the data analyst requires the modeler to examine thoroughly the nature and structure of the data in question. In economics, along with the constant accumulation of observational data collection grew the demand to analyze these data series with a view to a better understanding of economic phenomena such as inflation, unemployment, exchange rate fluctuations, and the business cycle, as well as improving our ability to forecast economic activity. A first step toward attaining these objectives is to study the available data by being able to answer questions such as:
1.4 Observed Data and Empirical Modeling
(i) (ii) (iii) (iv) (v)
15
How were the data collected and compiled? What is the subject of measurement and what do the numbers measure? What are the measurement units and scale? What is the measurement period? What is the link between the data and any corresponding theoretical concepts?
A fourth recommendation to keep in mind in empirical modeling is: 4.
One needs to get to know all the important dimensions (i)–(v) of the particular data before any statistical modeling and inference is carried out.
1.4.2 Observed Data and the Nature of a Statistical Model A data set comprising n observations will be denoted by x0 :=(x1 , x2 , . . . , xn ). R E M A R K: It is crucial to emphasize the value of mathematical symbolism when one is discussing probability theory. The clarity and concision this symbolism introduces to the discussion is indispensable. It is common to classify economic data according to the observation units: (i) (ii)
Cross-section {xk , k = 1, 2, . . . , n}, k denotes individuals (firms, states, etc.); Time series {xt , t = 1, 2, . . . , T}, t denotes time (weeks, months, years, etc.).
For example, observed data on consumption might refer to consumption of different households at the same point in time or aggregate consumption (consumers’ expenditure) over time. The first will constitute cross-section, the second time-series data. By combining these two (e.g. observing the consumption of the same households over time), we can define a third category: (iii)
Panel (longitudinal) {xk , k:= (k, t) , k = 1, 2, . . . , n, t = 1, 2, . . . , T} , where k and t denote the index for individuals and time, respectively.
N O T E : In this category the index k is two-dimensional but xk is one-dimensional. At first sight the two primary categories do not seem to differ substantively because the index sets appear identical; the index sets are subsets of the set of natural numbers. A moment’s reflection, however, reveals that there is more to an index set than meets the eye. In the case where the index set N:={1, 2, . . . , n} refers to particular households, the index might stand for the names of the households, say {Jones, Brown, Smith, Johnson, . . . }.
(1.1)
For time series the index T:={1, 2, . . . , T, . . .} might refer to particular dates, say {1972, 1973, . . . , 2017}.
(1.2)
Comparing the two index sets, we note immediately that they have very different mathematical structures. The most apparent difference is that set (1.1) does not have a natural ordering,
16
An Introduction to Empirical Modeling
whether we put Brown before Smith is immaterial, but in the case of set (1.2), the ordering is a crucial property of the set. In the above example the two index sets appear identical but they turn out to be very different. This difference renders the two data sets qualitatively dissimilar to the extent that the statistical analysis of one set of data will be distinctively different from that of the other. The reason for this will become apparent in later chapters. At this stage it is sufficient to note that a number of concepts such as dependence and heterogeneity are inextricably bound up with the ordering of the index set. The mathematical structure of the index set is not the only criterion for classifying dissimilar data sets. The mathematical structure of the range of values of observations themselves constitutes another even more important criterion. For example, the “number of children” in different households can take values {0, 1, 2, . . . , 100} ; 100 is an assumed upper bound. The set of values of the variable consumption would be R+ = (0, ∞). The variable religion (Christian, Muslim, Buddhist, Other) cannot be treated in the same way, because there is no natural way to measure religion. Even if we agree on a measurement scale for religion, say {1, 2, 3, 4} , the ordering is irrelevant and the difference between these numbers is meaningless. The above discussion raises important issues in relation to the measurement of observed data. The first is whether the numerical values can be thought of as being values from a certain interval on the real line, say [0, 1], or whether they represent a set of discrete values, say {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} . The second is whether these values have a natural ordering or not. Collecting these comments together, we can see that the taxonomy which classifies the data into cross-section and time series is inadequate, because there are several additional classifications which are ignored. These classifications are important from the modeling viewpoint because they make a difference in so far as the applicable statistical techniques are concerned. In its abstract formulation a generic data set is designated by {xk , k ∈ N, xk ∈ RX } , where N denotes the index set and RX the range of values of x. Note: Both sets N and RX are subsets of the real line, denoted by R:=(−∞, ∞). Depending on the mathematical structure of these two sets, different classifications arise. Indeed, the mathematical structure of the sets N and RX plays a very important role in the choice of the statistical model (see Sections 1.4.3 to 1.4.5). RX can be a discrete (countable) subset of R, such as RX = {0, 1, 2, . . .}, or a continuous (uncountable) subset of R, such as RX = [0, ∞). The same discrete–continuous classification can also be applied to the index set N, leading to a four-way classification of variables and the corresponding data. As shown in Chapters 3 and 4, the nature of both sets N (the index set) and RX (the range of values of the data) plays an important role in selecting the statistical model.
1.4.3 Measurement Scales and Data A very important dimension of any observed data is the measurement scale of the individual data series. The measurement scales are traditionally classified into four broad categories (Table 1.7), together with the mathematical operations that are meaningful (legitimate) for different scales.
1.4 Observed Data and Empirical Modeling
17
Ratio scale. Variables in this category enjoy the richest mathematical structure in their range of values, where for any two values along the scale, say x1 and x2 , all the mathematical operations (i)–(iv) are meaningful. Length, weight, consumption, investment, and gross domestic product (GDP) all belong to this category. Table 1.7 Scales and mathematical operations Scale
(i) (x1 /x2 )
(ii) (x2 − x1 )
(iii) x2 ≷ x1
(iv) x2 =x1
Ratio
Interval Ordinal
× ×
×
Nominal
×
×
×
Transformation scalar multiplication linear function increasing monotonic one-to-one replacement
Interval scale. For a variable measured on an interval scale, the operations (ii)–(iv) are meaningful but (i) is not. The index set (1.2) (calendar time) is measured on the interval scale because the difference (1970–1965) is a meaningful magnitude but the ratio (1965/1970) is not. Additional examples of variables of interval scale are temperature (Celsius, Fahrenheit) and systolic blood pressure. Ordinal scale. For a variable measured on an ordinal scale, the operations (iii)–(iv) are meaningful but (i) and (ii) are not, e.g. grading (excellent, very good, good, failed), income class (upper, middle, lower). For such variables the ordering exists but the distance between categories is not meaningfully quantifiable. Nominal scale. For a variable measured on a nominal scale, the operation (iv) is meaningful but (i)–(iii) are not. Such a variable denotes categories which do not have a natural ordering, e.g. marital status (married, unmarried, divorced, separated), gender (male, female, other), employment status (employed, unemployed, other). It is important to note that statistical concepts and methods do not apply to all variables irrespective of scale of measurement (see Chapter 6). T E R M I N O L O G Y: In the statistical literature there is some confusion between the measurement scales and three different categorizations of variables: discrete/continuous, qualitative/quantitative, categorical/non-categorical. Discrete variables can be measured on all four scales and continuous variables can sometimes be grouped into a small number of categories. Categorical variables are only those variables that can be measured on either the ordinal or the nominal scales, but the qualitative variables category is less clearly defined in several statistics books. Measurement scales and the index set. The examples of measurement scales used in the above discussion refer exclusively to the set RX : the range of values of a variable X. However, the discussion is also relevant for the index set N. In the case of the variable names of
18
An Introduction to Empirical Modeling
households, (1.1) is measured on a nominal scale. On the other hand, in the case of GDP, (1.2) is measured on the interval scale (time). This is because time does not have a natural origin (zero) and in statistical analysis the index set (1.2) is often replaced by a set of the form T:={1, 2, . . . , T, . . .}. We note that the time-series/cross-section distinction is often based on the measurement scale of the index set. The index set of time series is of interval scale, but that of cross-section can vary from nominal scale (gender) to ratio scale (age). In view of the fact that in addition to the discrete/continuous dichotomy we have four different measurement scales for the range of values of the observed variable itself (RX ) and another four for the index set N:={1, 2, . . . , n, . . .}, a wide variety of data types can be defined. Our concern is with the kind of statistical methods that can be meaningfully applied to the particular data in light of their nature and features.
1.4.4 Measurement Scale and Statistical Analysis The measurement scales are of interest in statistical modeling because data measured on different scales need different statistical treatment. To give an idea of what that involves, consider data x0 :=(x1 , x2 , . . . , xn ) on religious affiliation under the categories Christian (1), Jewish (2), Muslim (3), Other (4) and decide to attach to these four groups the numbers 1–4. How can one provide a set of summary statistics for such data in the context of descriptive statistics? The set of data for such a variable will look like (1, 4, 3, 1, 1, 2, 2, 2, 1, 2, 3, 3, 1, 1, 1). It is clear that for such data the notion of the arithmetic mean 1 (1.3) x = 1n nk=1 xk = ( 15 )(1+4+3+1+1+2+2+2+1+2+3+3+1+1+1) = 1.867 makes no substantive or statistical sense because the numbers we attached to these groups could easily have been 10, 20, 30, 40. How can one provide a measure of location for such data? A more appropriate descriptive measure is that of the mode: the value in the data that has the highest relative frequency. In this case, the mode is xm = 1, since this value occurs in 7 out of 15 data values; see Table 1.8. Table 1.8 Scales and location measures
mean median mode
nominal
ordinal
interval
ratio
× ×
×
Consider data on an ordinal variable that measures a teacher’s performance: Excellent (1), Good (2), Average (3), Poor (4), Very poor (5). The data for a particular teacher will look like (1, 5, 3, 1, 1, 2, 4, 2, 1, 2, 2, 5, 3, 1, 4). What measures of location are statistically meaningful for this data in the context of descriptive
1.4 Observed Data and Empirical Modeling
19
statistics? The histogram and the mode are clearly meaningful, but so is the median: the middle value when the data are arranged in ascending or descending order of magnitude. For the above data, (1, 1, 1, 1, 1, 2, , 2, 2 , 2, 3, 3, 4, 4, 5, 5). Again, the arithmetic average (mean) is (1.3) because of the arbitrariness of the values we chose; we could equally use the values 5, 7, 11, 17, 19. For nominal and ordinal data a number of measures of variation like variance: s2x =
1 n
n
k=1 (xk
− x)2
or standard deviation: sx
are also statistically questionable because of the arbitrariness of the values given to the underlying variables. The same is true for the notion of covariance between two nominal/ordinal variables. For instance, if one suspects that the teacher’s performance is related to their academic rank (Y): Assistant (1), Associate (2), Full professor (3), one could collect such data and evaluate the covariance between performance and rank: cxy =
1 n
n
k=1 (xk
− x)(yk − y).
This statistic, however, will also be statistically spurious, and so will the correlation coefficient: n
k=1 (xk −x)(yk −y) n 2 2 k=1 (xk −x) k=1 (yk −y)
rxy = √n
=
cxy sx ·sy .
In practice, researchers often abuse such data indirectly when used in the context of regression analysis. Estimating the regression line yk = β 0 + β 1 xk + uk , k = 1, 2, . . . , n using the least-squares method gives rise to the estimated coefficients β 1 x, β1 = β0 = y −
n (x −x)(yk −y) k=1 n k , 2 k=1 (xk −x)
which involve the means of X and Y, the variance of X, and their covariance. The only general rule for the methods of analysis of different measurement-scale variables one can state at this stage is that a method appropriate for a certain measurement scale in the hierarchy is also appropriate for the scales above but not below it. There are several books which discuss the methods of analysis of the so-called categorical data: data measured on the nominal or ordinal scale; see Bishop et al. (1975), Agresti (2013) inter alia. A cursory look at the applied econometrics literature reveals that variables from very different measurement scales are involved in the same regression equation (see Chapter 7), rendering some of these results problematic. Hence: 5. The scale of measurement of different data series should be taken into account for statistical analysis purposes to avoid meaningless statistics inference results and spurious inference results.
20
An Introduction to Empirical Modeling
1.4.5 Cross-Section vs. Time Series, is that the Question? In relation to the traditional cross-section/ time-series taxonomy, it is important to warn the reader against highly misleading claims. The conventional wisdom in econometrics is that dependence or/and heterogeneity are irrelevant for cross-section data because we know how to select “random samples” from populations of individual units such as people, households, firms, cities, states, countries, etc.; see Wooldridge (2013) inter alia. It turns out that this distinction stems from insufficient appreciation of the notion of “random sampling.” When defining a random sample as a set of random variables X1 , X2 , . . . , Xn which are IID, the ordering of X1 , X2 , . . . , Xn , based on the index k = 1, 2, . . . , n, provides the key to this definition; see Chapter 6. IID is defined relative to this ordering; without the ordering, this definition, as well as the broader notions of dependence and heterogeneity, make little sense. How does this render the distinction between statistical models for cross-section and time-series data superfluous? In time series there is a generic ordering (time) that suggests itself when talking about dependence and heterogeneity. The fact that in cross-section data there is no one generic ordering that suggests itself does not mean that the ordering of such samples is irrelevant. The opposite is true. Because of the diversity of the individual units, in cross-section data there is often more than one ordering of interest. For instance, in a medical study the gender or the age of individuals might be orderings of interest. In a sample of cities, geographical position and population size might be such orderings of interest. For each of these different orderings, one can define dependence and heterogeneity in a statistically meaningful way. Despite claims to the contrary, the notions of dependence and heterogeneity are equally applicable to modeling cross-section or time-series data. The only differences arise in the measurement scale of the relevant ordering(s). For time-series data, the time ordering is measured on an interval scale, and thus it makes sense to talk about serial correlation (a particular form of temporal dependence) and trending mean and variance (particular forms of heterogeneity). In the case of a sample of individuals used in a medical study, the gender ordering is measured on a nominal scale and thus it makes sense to talk about heterogeneity (shift) in the mean or the variance of male vs. female units; see data plots in Chapter 5. In the case of a cross-section of cities or states, geographical position might be a relevant ordering, in which case one can talk about spatial heterogeneity or/and dependence. A data set can always be represented in the form x0 :=(x1 , x2 , . . . , xn ) and viewed as a finite realization of the sample X:=(X1 , X2 , . . . , Xn ) of a stochastic process {Xk , k ∈ N, xk ∈ RX } (Chapter 8), where N denotes the index set and RX the range of values of x, irrespective of whether the data constitute a cross-section or a time series; their only differences might lie in the mathematical structure of N and RX . Statistical modeling and inference begins with viewing data x0 as a finite realization of an underlying stochastic process {Xk , k ∈ N, xk ∈ RX } , and the statistical model constitutes a particular parameterization of this process. A closer look at the formal notion of a random sample (IID) reveals that it presupposes a built-in ordering. Once the ordering is made explicit, both notions of dependence and heterogeneity become as relevant in cross-section as they are for time-series data. If anything, cross-section data are often much richer in terms of ordering structures, which is potentially more fruitful in learning from data. Moreover, the ordering of a sample renders
1.4 Observed Data and Empirical Modeling
21
the underlying probabilistic assumptions, such as IID, potentially testable in practice. The claim that we know how to select a random sample from a population, and thus can take the IID assumptions as valid at face value, is misguided. Example 1.5 Sleep aid Ambien A real-life example of this form of misspecification is the case of the sleep aid Ambien (zolpidem) that was US Food and Drug Administration (FDA) approved in 1992. After a decade on the market and more than 40 million prescriptions, it was discovered (retrospectively) that women are more susceptible to the risk of “next day impairment” because they metabolize zolpidem more slowly than men. This discovery was the result of thousands of women experiencing sleep-driving and getting involved in numerous accidents in early-morning driving. The potential problem was initially raised by Cubała et al. (2008), who recounted the probing of potential third factors such as age, ethnicity, and prenatal exposure to drugs, but questioned why gender was ignored. After a more careful re-evaluation of the original pre-approval trials data and some additional post-approval trials, the FDA issued a Safety Communication [1-10-2013] recommending lowering the dose of Ambien for women; 10 mg for men and 5 mg for women. Example 1.6 Consider the data given in Table 1.9 that refer to the test scores (y-axis) in a multiple-choice exam on the principles of economics, reported in alphabetical order using the students’ surnames (x-axis). Table 1.9 Test scores: alphabetical order 98 71 66 62
43 74 72 93
77 83 65 84
51 75 58 68
93 70 45 76
85 76 63 62
76 56 57 65
56 84 87 84
59 80 51 59
62 53 40 60
67 70 70 76
79 66 98 57 67 100 78 65 56 75 92 73 81 69 95 66
80 73 68 77 88 81 59 81 85 87
The data in the t-plot (Figure 1.9) appears to exhibit independence and homogeneity, as seen in Figure 1.1. On the other hand, ordering the observations according to the sitting arrangement during the exam, as shown in Figure 1.10, seems to exhibit very different chance regularity patterns. The ups and downs of the latter graph are a bit more orderly than those of Figure 1.9. In particular, Figure 1.10 exhibits some sort of varying cyclical behavior that renders predicting the next observation easier. As explained in Chapter 5, this pattern of irregular cycles reveals that the data exhibit some form of positive dependence related to the sitting arrangement. In plain English, this means that there was cheating taking place during the exam by glancing at the answers of one’s neighbors! The main lesson from Examples 1.5 and 1.6 is that ordering one’s data is a must because it enables the modeler to test dependence and heterogeneity with respect to each ordering of interest. Hence: 6. Statistical models for cross-section data do admit dependence and heterogeneity assumptions that need to be tested by selecting natural orderings (often more than one) for the particular data. Statistical models should take into consideration a variety of different dimensions and features of the data.
An Introduction to Empirical Modeling
22
100
90
x
80
70
60
50
40 1
7
14
21
28
35
42
49
56
63
70
63
70
alphabetical order
Fig. 1.9
Exam scores data in alphabetical order
100
90
x
80
70
60
50
40 1
7
14
21
Fig. 1.10
28 35 42 49 sitting arrangement order
56
Exam scores data in sitting order
1.4.6 Limitations of Economic Data In relation to the limitations of economic data, we will consider two important issues: (i) what they actually measure and (ii) how accurately. Morgenstern (1963) disputed the
1.5 Statistical Adequacy
23
accuracy of published economic data and questioned the appropriateness of such data for inference purposes. In cases where the accuracy and quality of the data raise problems, the modeler should keep in mind that no statistical procedure can extract information from observed data when it is not there in the first place: 7. “Garbage in garbage out” (GIGO): no statistical procedure or substantive information can salvage bad quality data that do not contain the information sought. The accuracy of economic data has improved substantially since the 1960s and in developed countries data collected by governments and international institutions are sufficiently accurate. The need for different statistical techniques and procedures arises partly because of “what is being measured” by the available data vs. what information is being sought. The primary limitation of the available economic data arises from the fact that there is a sizeable gap between what the theoretical variables denote and what the available data measure. Economic theory, via the ceteris paribus clauses, assumes a nearly isolated system driven by the plans and intentions of optimizing agents, but the observed data are the result of an on-going multidimensional process with numerous influencing factors beyond the control of particular agents. In what follows we assume that the modeler has checked the observed data thoroughly and deemed them accurate enough to be considered reliable enough for posing substantive questions of interest. This includes due consideration of the sample size n being large enough for the testing procedures to have adequate capacity to detect any discrepancies of interest; see Chapter 13. Hence, a crucial recommendation in empirical modeling is: 8.
Familiarize oneself thoroughly with the nature and the accuracy of the data to ensure that they do contain the information sought.
This will inform the modeler about what questions can and cannot be posed to a particular data set.
1.5
Statistical Adequacy
The crucial message from the discussion in the previous sections is that probability theory provides the mathematical foundations and the overarching framework for modeling observable stochastic phenomena of interest. The modus operandi of empirical modeling is the concept of a statistical model Mθ (x), that mediates between the data x0 and the real-world phenomenon of interest at two different levels [A] and [B] (Figure 1.11). [A] From a phenomenon of interest to a statistically adequate model. The statistical model Mθ (x) is chosen so that the observed data x0 constitute a truly typical realization of the stochastic process {Xt , t∈N} underlying Mθ (x). Validating the model assumptions requires trenchant misspecification (M-S) testing. The validity of these assumptions secures the soundness of the inductive premises of inference (Mθ (x)) and renders inference reliable in learning from data x0 about phenomena of interest. The notion of statistical adequacy is particularly crucial for empirical modeling because it can provide the basis for establishing stylized facts stemming from the data which theory needs to account for.
24
An Introduction to Empirical Modeling
Fig. 1.11
Model-based frequentist statistical induction
[B] From the inference results to the substantive questions of interest. This nexus raises issues like statistical vs. substantive significance and how one assesses substantive information. As argued in Chapter 13, most of these issues can be addressed using the post-data severity evaluation of the accept/reject rules of testing by establishing the discrepancy from the null warranted by data x0 and test Tα . These points of nexus with the real world are often neglected in traditional statistics textbooks, but the discussion that follows will pay special attention to the issues they raise and how they can be addressed. Statistical inference is often viewed as the quintessential form of inductive inference: learning from a particular set of data x0 about the stochastic phenomenon that gave rise to the data. However, it is often insufficiently recognized that this inductive procedure is embedded in a deductive argument: if Mθ (x), then Q(θ ; x), where Q(θ; x) denotes the inference propositions (estimation, testing, prediction, policy simulation). The procedure from Mθ (x) (the premise) to Q(θ ; x) is deductive. Estimators and tests are pronounced optimal based on a purely deductive reasoning. In this sense, the reliability (soundness) of statistical inference depends crucially on the validity of the premises Mθ (x). The ninth recommendation in empirical modeling is: 9.
Choose a statistical model Mθ (x) with a view to ensuring that data x0 constitute a truly typical realization of the stochastic mechanism defined by Mθ (x).
On the basis of the premise Mθ (x) we proceed to derive statistical inference results Q(θ; x0 ) using a deductively valid argument ensuring that if the premises are valid, then the conclusions are necessarily (statistically) reliable. To secure the soundness of such results, one needs to establish the adequacy of Mθ (x) vis-à-vis the data x0 . By the same token, if Mθ (x) is misspecified then the inference results Q(θ ; x0 ) are generally unreliable. Indeed, the ampliative (going beyond the premises) dimension of statistical induction relies on the statistical adequacy of Mθ (x). The substantive questions of interest are framed in the context of Mϕ (x), which is parametrically nested within Mθ (x) via the restrictions G(θ , ϕ) = 0.
1.6 Statistical vs. Substantive Information∗
Fig. 1.12
25
Statistical adequacy and inference
When the substantive parameters ϕ are uniquely defined as functions of θ , one can proceed to derive inferential propositions pertaining to ϕ, say Q(ϕ; x). These can be used to test any substantive questions of interest, including the substantive adequacy of Mϕ (x). Hence, the tenth recommendation in empirical modeling is: 10.
No statistical inference result can be presumed trustworthy unless the statistical adequacy of the underlying model has been secured.
The initial and most crucial step in establishing statistical adequacy is a complete list of the probabilistic assumptions comprising Mθ (x). Hence, the next several chapters pay particular attention to the problem of statistical model specification. Departures from the postulated statistical model Mθ (x) are viewed as systematic information in the data that Mθ (x) does not account for that can be detected using mis-specification (M-S) testing. The statistical model needs to be respecified in order to account for such systematic information. Hence, the procedure is supplemented with the respecification stage. Figure 1.12 depicts the proposed procedure with the added stages, indicated in circular and elliptical shapes, supplementing the traditional perspective. The M-S testing raises an important issue that pertains to the sample size n. For an adequate probing of the validity of Mθ (x) one requires a “large enough” n for the M-S tests to have sufficient capacity (power) to detect any departures from these assumptions. As shown in Chapter 15, even the simplest statistical models that assume a random sample, such as the simple Normal and Bernoulli models, call for n > 40. This leads to the following recommendation in empirical modeling: 11.
1.6
If the sample size n is not large enough for a comprehensive testing of the model assumptions, then n is not large enough for inference purposes.
Statistical vs. Substantive Information∗
In an attempt to provide a more balanced view of empirical modeling and avoid any hasty indictments of the type: “the approach adopted in this book ignores the theory,” this section
26
An Introduction to Empirical Modeling
will bring out briefly the proper role of substantive information in empirical modeling (see also Spanos, 1986, 1995a, 2010a). Despite the fact that the statistical model is specified after the relevant data have been chosen, it does not render either the data or the statistical model “theory-laden.” In addition to the fact that the variables envisioned by the theory often differ from the available data, the chance regularities in the particular data exist independently from any substantive information a modeler might have. Indeed, in detecting the chance regularities one does not need to know what substantive variable the data measure. This is analogous to Shannon’s (1948) framing of information theory: “Frequently the messages have meaning; that is, they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem” (p. 379). In direct analogy to that, “the semantic aspects of data are irrelevant to the statistical problem.” In addition, the statistical model is grounded in probabilistic assumptions aiming to account for the chance regularities in the particular data, and is related to the relevant substantive model in so far as it facilitates the posing of the substantive questions of interest after they have been reframed in statistical terms. Hence, when a statistical model is viewed as a parsimonious description of the stochastic mechanism that gave rise to the particular data, it has “a life of its own,” providing the inductive premises for inferences stemming from the data; see Spanos (2006b, 2010a,c). Observational data are often compiled by government agencies and private organizations, but such data rarely coincide with the “ideal” data needed when posing specific substantive questions of interest. Hence, a key recommendation is: 12.
Never assume that the available data measure the theoretical concepts one has in mind just because the names are very similar (or even coincide)!
A striking example is the theoretical concept of demand (intention to buy given a range of hypothetical prices) vs. data on actual quantities transacted; see Spanos (1995a). As a result of this gap, empirical modeling in practice attempts to answer substantive questions of interest by utilizing data which contain no such information. A clear distinction between statistical and substantive information constitutes one of the basic pillars of the empirical modeling methodology advocated in this book; see also Spanos 2006c, 2010c, 2012a). The theory influences the choice of an appropriate statistical model in two indirect ways. First, it demarcates the observable aspects of the phenomena of interest and that determines the relevant data. Second, the theory influences the parameterization of the statistical model in so far as the latter enables one to pose substantive questions of interest in its context. Hence, the mis-specification (M-S) testing and respecification facets of empirical modeling are purely statistical procedures guided by statistical information. That is: 13.
No theory, however sophisticated, can salvage a misspecified statistical model, unless it suggests a new statistical model that turns out to be statistically adequate.
As argued in Chapter 7, the statistical and substantive perspectives provide very different but complementary viewing angles for modeling purposes; see Spanos (2007).
1.7 Looking Ahead
27
A statistically adequate Mθ (x) accounts for the statistical information in the data, but is often not the ultimate objective of empirical modeling. More often than not, the modeler is interested in appraising the validity of particular substantive information, such as “is there a causal connection between inflation and money in circulation?” The statistical reliability of such inferences can only be secured when the question is posed in the context of a statistically adequate model. Hence: 14.
1.7
The success of empirical modeling depends crucially on the skillful synthesizing of the statistical and substantive information, without undermining the credibility of either.
Looking Ahead
The main objective of the next seven chapters (Chapters 2–8) is to introduce the necessary probabilistic framework for relating the chance regularity patterns exhibited by data to the proper probabilistic assumptions, with a view to selecting an appropriate stastistical model. The discussion in Chapters 2–4 presents a simple statistical model as a formalization of a stochastic phenomenon known as a random experiment. The interplay between chance regularity patterns and the probabilistic concepts defining a simple statistical model is brought out in Chapter 5 using a variety of graphical techniques. The primary objective of Chapter 6 is to extend the simple statistical model in directions which enable the modeler to capture certain forms of dependence. Chapter 7 continues the theme of Chapter 6, with a view to showing that the key to modeling dependence and certain forms of heterogeneity in data is the notion of conditioning, leading naturally to regression and related models. Extending the simple statistical model in directions which enable the modeler to capture several forms of dependence and heterogeneity is completed in Chapter 8. Additional references: Spanos (1989a, 1990a, 2006a, 2010b, 2014b, 2015), Granger (1990), Hendry (2000, 2009), Mayo and Spanos (2010).
Important Concepts Substantive information, statistical information, stochastic phenomena, chance regularity patterns, deterministic regularity, distribution regularity, dependence regularity, heterogeneity regularity, statistical adequacy, measurement scales, time-series data, cross-section data, panel data, ratio scale, interval scale, ordinal scale, nominal scale. Crucial Distinctions Statistical vs. substantive subject matter information/model, chance vs. deterministic regularity patterns, statistical modeling vs. statistical inference, curve-fitting vs. statistical modeling, statistical vs. substantive adequacy, chance regularity patterns vs. probabilistic assumptions, relative frequencies vs. probabilities, induction vs. deduction, time-series vs. cross-section data, variables in substantive models vs. observed data, theoretical concepts vs. data.
An Introduction to Empirical Modeling
28
Essential Ideas ●
●
●
●
●
●
●
●
●
●
The primary aim of empirical modeling is to learn from data about phenomena of interest by blending substantive subject matter and statistical information (chance regularity patterns). A statistical model comprises a set of internally consistent probabilistic assumptions that defines a stochastic generating mechanism. These assumptions are chosen to account for the chance regularities exhibited in the data. The traditional metaphor of viewing data as a “sample from a population” is only appropriate for real-world data that exhibit IID patterns. Hence, the notion of a “population” is replaced with the concept of a stochastic generating mechanism. Chance regularities and the probabilistic assumptions aiming to account for such regularities can be classified into three broad categories: distribution, dependence, and heterogeneity. Graphical techniques provide indispensable tools for empirical modeling because they can be used to bring out the chance regularities exhibited by data. Time-series and cross-section data differ only with respect to their ordering of interest. Time, an interval scale variable, is the natural ordering for the former but often crosssection data have several such orderings of interest, whose potential orderings span all four categories of scaling. Claims that one does not have to worry about dependence and heterogeneity when modeling cross-section data are highly misleading and misguided. Establishing the statistical adequacy of an estimated model is the most crucial step in securing the trustworthiness of the evidence stemming from the data. If the sample size is not large enough for properly testing the statistical model assumptions, then it is not large enough for inference purposes. Assuming that a data series quantifies the variable used in a substantive model just because the names coincide, or are very similar, is not a good strategy.
1.8
Questions and Exercises
1. What determines which phenomena are amenable to empirical modeling? 2. (a) Explain intuitively why statistical information, in the form of chance regularity patterns, is different from substantive subject-matter information. (b) Explain how these two types of information can be separated, ab initio, by viewing the statistical model as a probabilistic construct specifying the stochastic mechanism that gave rise to the particular data. (c) The perspective in (b) ensures that the data and the statistical model are not “theoryladen.” Discuss. 3. Compare and contrast the notions of chance vs. deterministic regularities. 4. Explain why the slogan “All models are wrong, but some are useful” conflates two different types of being wrong using the distinction between statistical and substantive inadequacy.
1.8 Questions and Exercises
29
5. In relation to the experiment of casting two dice (Table 1.3), evaluate the probability of events A – the sum of the two dice is greater than 9 and B – the difference of the two dice is less than 3. 6. Discuss the connection between observed frequencies and the probabilistic reasoning that accounts for those frequencies. 7. In relation to the experiment of casting two dice, explain why focusing on (i) adding up the two faces and (ii) odds and evens constitute two different probability models stemming from the same experiment. 8. Explain the connection between a histogram and the corresponding probability distribution using de Mere’s paradox. 9. Give four examples of variables measured on each of the different scales, beyond those given in the discussion above. 10. (a) Compare the different scales of measurement. (b) Why do we care about measurement scales in empirical modeling? 11. Beyond the measurement scales, what features of the observed data are of interest from the empirical modeling viewpoint? 12. (a) In the context of descriptive statistics, explain briefly the following concepts: (i) mean, (ii) median, (iii) mode, (iv) variance, (v) standard deviation, (vi) covariance, (vii) correlation coefficient, (viii) regression coefficient. (b) Explain which of the concepts (i)–(ix) make statistical sense when the data in question are measured on different scales: nominal, ordinal, interval, and ratio. 13. Compare and contrast time-series, cross-section, and panel data as they relate to heterogeneity and dependence. 14. Explain how the different features of observed data can be formalized in the context of expressing a data series in the form {xk , xk ∈ RX , k ∈ N} . 15. Explain briefly the connection between chance regularity patterns and probability theory concepts. 16. Explain the connection between chance regularities and statistical models. 17. Explain the notion of statistical adequacy and discuss its importance for statistical inference. 18. Under what circumstances can the modeler claim that the observed data constitute unprejudiced evidence in assessing the empirical adequacy of a theory? 19. “Statistical inference is a hybrid of a deductive and an inductive procedure.” Explain and discuss. 20. Discuss the claim: “If the sample size is not large enough for validating the model assumptions, then it is not large enough for reliable inference.”
2 Probability Theory as a Modeling Framework
2.1
Introduction
2.1.1 Primary Objective The primary objective of Chapters 2–8 is to introduce probability theory as a mathematical framework for modeling observable stochastic phenomena (Chapter 1). Center stage in this modeling framework is occupied by the concept of a statistical model, denoted by Mθ (x), that provides the cornerstone of a model-based inductive process underlying empirical modeling.
2.1.2 Descriptive vs. Inferential Statistics The first question we need to consider before the long journey to explore the theory of probability is: Why do we need probability theory? The brief answer is that it frames both the foundation and the relevant inference procedures for empirical modeling. What distinguishes statistical inference proper from descriptive statistics is the fact that the former is grounded in probability theory. In descriptive statistics one aims to summarize and bring out the important features of a particular data set in a readily comprehensible form. This usually involves the presentation of the data in tables, graphs, charts, and histograms, as well as the computation of summary “statistics,” such as measures of central tendency and dispersion. Descriptive statistics, however, has one very crucial limitation: conclusions from the data description cannot be extended beyond the data in hand. A serious problem during the early twentieth century was that statisticians would use descriptive summaries of the data, and then proceed to claim generality for their inferences beyond the data in hand. The conventional wisdom at the time is summarized by Mills (1924), who distinguishes between “statistical description” and “statistical induction,” where the former is always valid and “may be used to perfect confidence, as accurate descriptions of the given characteristics” 30
2.1 Introduction
31
(p. 549), but the latter is only valid when the inherent assumptions of (a) “uniformity” for the population and (b) the “representativeness” of the sample (pp. 550–552) are appropriate for the particular data. The fine line between statistical description and statistical induction was blurred until the 1920s, and as a result there was (and, unfortunately, still is) a widespread belief that statistical description does not require any assumptions because “it’s just a summary of the data.” The reality is that there are appropriate and inappropriate (misleading) summaries. Example 2.1 Consider a particular data set x0 :=(x1 , x2 , . . . , xn ) whose descriptive statistics for the mean and variance yield the following values: x=
1 n
n
k=1 xk
= 12.1 and s2x =
1 n
n
k=1 (xk
− x)2 = 34.21.
(2.1)
There is no empirical justification to conclude from (2.1) that these numbers are typical of the broader population from which x0 was observed, and thus representative of the “population” mean and variance (E(X), Var(X)); such an inference is unwarranted. This is because such inferences presuppose that x0 satisfies certain probabilistic assump
tions that render x, s2x appropriate estimators (appraisers) of (E(X), Var(X)), but these assumptions need to be empirically validated before such inference becomes warranted. In
the case of the formulae behind x = 12.1, s2x = 34.21 , the assumptions needed are IID (Chapter 1). Looking at the t-plot of x0 in Figure 2.1, it is clear that the ID assumption is invalid because the arithmetic average of x0 is increasing with t (the index). This renders the formulae in (2.1) completely inappropriate for estimating (E(X), Var(X)), whose true values are E(X) = 2 − .2t, Var(X) = 1,
(2.2)
where t = 1, 2, . . . , n is the index; these values are known because the data were created by simulation. The summary statistics in (2.1) have nothing to do with the true values in (2.2), because the chance regularities exhibited by the data in Figure 2.1 indicate clearly that the mean is changing with t and the evaluation of the variance using s2x is erroneous when the deviations are evaluated from a fixed x.
25
5
20
4 3 y
x
15
2
10 1 5 0 0 1
10
20
30
Fig. 2.1
40
50 60 Index
70
80
t-Plot of data x0
90
100
1
10
Fig. 2.2
20
30
40
50 60 Index
70
80
90
100
Typical realization of NIID data
32
Probability Theory as a Modeling Framework
On the other hand, if the data in x0 looked like the data y0 :=( y1 , y2 , . . . , yn ) shown in Figure 2.2, the formulae in (2.1) would have given reliable summary statistics: y=
1 n
n
k=1 yk
= 2.01 and s2y =
1 n
n
k=1 ( yk
− y)2 = 1.02,
since the true values are E(Y) = 2, Var(Y) = 1. The lesson from this example is that there is no such thing as summary statistics that invoke no probabilistic assumptions. There are reliable and unreliable descriptive statistics, depending on the validity of probabilistic assumptions implicitly invoked. Indeed, the crucial change pioneered by Fisher (1922a) in recasting descriptive into modern statistics is to bring out these implicit presuppositions in the form of a statistical model and render them testable. In this sense, statistical inference proper views data x0 through the prism of a prespecified statistical model Mθ (x). That is, the data x0 are being viewed as a typical realization of the stochastic mechanism specified by Mθ (x). The presumption is that Mθ (x) could have generated data x0 . This presumption can be validated vis-à-vis x0 by testing the probabilistic assumptions comprising Mθ (x). In contrast to descriptive statistics, the primary objective of statistical modeling and inference proper is to model (represent in terms of a probabilistic framing) the stochastic mechanism that gave rise to the particular data, and not to describe the particular data. This provides a built-in inductive argument which enables one to draw inferences and establish generalizations and claims about the mechanism itself, including observations beyond the particular data set. This is known as the ampliative dimension of inductive inference: reasoning whose conclusions go beyond what is contained in the premises. A bird’s-eye view of the chapter. In Section 2.2 we introduce the notion of a simple statistical model at an informal and intuitive level. Section 2.3 introduces the reader to probability theory from the viewpoint of statistical modeling. In Section 2.4 we sidestep the axiomatic approach to probability in an attempt to motivate the required mathematical concepts by formalizing a simple generic stochastic phenomenon we call a random experiment (RE) defined by three conditions in plain English. In Sections 2.5 and 2.6 we proceed to formalize the first two of these conditions in the form of (i) the outcomes set, (ii) the event space, and (iii) the probability set function, together with the associated Kolmogorov axioms. Section 2.7 discusses the notion of conditioning needed to formalize the third condition in Section 2.8.
2.2
Simple Statistical Model: A Preliminary View
As mentioned above, the notion of a statistical model takes center stage in the mathematical framework for modeling stochastic phenomena. In this section we attempt an informal discussion of the concept of a simple statistical model at an intuitive level with a healthy dose of hand-waving. The main objective of this preliminary discussion is twofold. First, for the less mathematically inclined reader, the discussion, although incomplete, will provide an adequate description of the primary concept of statistical modeling. Second, this preliminary discussion will help the reader keep an eye on the forest, and not get distracted by the trees,
2.2 Simple Statistical Model: A Preliminary View
33
as the formal argument unfolds in Sections 2.3–2.8. The formalization of the notion of a generic random experiment will be completed in Chapter 4.
2.2.1 The Basic Structure of a Simple Statistical Model The simple statistical model, pioneered by Fisher (1922a), has two components: [i] Probability model = { f (x; θ ), θ∈, x∈RX } ; [ii] Sampling model X:=(X1 , X2 , . . . , Xn ) is a random sample. The probability model specifies a family of densities ( f (x; θ ), θ ∈) , defined over the range of values (RX ) of the random variable X, one density function for each value of the parameter θ, as the latter varies over its range of values : the parameter space (hence the term parametric statistical model). Example 2.2 The best way to visualize a probability model is in terms of Figure 2.3. This diagram represents several members of a particular family of densities known as the two-parameter gamma family and takes the explicit form α−1
β −1 x x 2 , x∈R , θ :=(α, β)∈R exp − = f (x; θ) = [α] , (2.3) + + β β where [α] denotes the gamma function [α] =
∞ 0
exp(−u)·uα−1 du.
N O T E : The particular formula is of no intrinsic interest at this stage. What is important for the discussion in this section is to use this example in order to get some idea as to what lies behind the various symbols used in the generic case. For instance, the parameter space and the range of values RX of the random variable X are the positive real line R+ :=(0, ∞), i.e. :=R+ and RX :=R+ . Each curve in Figure 2.3 represents the graph of one density function (varying over a subset of the range of values of the random variable X:
0.5 (α =1, β =1) 0.4 (α = 1, β = 2) 0.3 f(x)
(α = 1, β = 3) (α = 1, β = 5)
0.2
(α = 1, β = 8)
0.1 0.0
0
1
2
3
Fig. 2.3
4
5
6
7 x
8
9 10
12
The gamma probability model
14
34
Probability Theory as a Modeling Framework
(0, 14] ⊂ R+ ) for a specific value of the parameter θ . In Figure 2.3 we can see five such curves for α = 1 and β = 1, 2, 3, 5, 8, the latter being a small subset of the parameter space R+ . In other words, the graphs of the density functions shown in Figure 2.3 represent a small subset of the set of densities in (2.3). Figure 2.3 illustrates the notion of a probability model by helping one visualize the family of densities indexed by the parameter θ . Let us now briefly discuss the various concepts invoked in the above illustration.
2.2.2 The Notion of a Random Variable: A Naive View The notion of a random variable constitutes one of the most important concepts in the theory of probability. For a proper understanding of the concept, the reader is required to read through to Chapter 3. In order to come to grips with the notion at an intuitive level, however, let us consider the naive view first introduced by Chebyshev (1821–1884) in the middle of the nineteenth century, who defined a random variable as: a real variable that assumes different values with different probabilities. This definition comes close to the spirit of the modern concept, but it leaves a lot to be desired from the mathematical viewpoint. As shown in Chapter 3, a random variable is a function from a set of outcomes to the real line; attaching numbers to outcomes! The need to define such a function arises because the outcomes of certain stochastic phenomena do not always come in the form of numbers but the data often do. The naive view of a random variable suppresses the set of outcomes and identifies the notion of a random variable with its range of values RX ; hence the term variable. Example 2.3 In the case of the experiment of casting two dice and looking at the uppermost faces, discussed in Chapter 1, the outcomes come in the form of combinations of die faces (not numbers!), all 36 such combinations, denoted by, say, {s1 , s2 , . . . , s36 }. Let us assume that we are interested in the sum of dots appearing on the two faces. This amounts to defining a random variable X(.): {s1 , s2 , . . . , s36 } → RX :={2, 3, . . . , 12}. However, this is not the only random variable we could have defined. Another one might be Y(.): {s1 , s2 , . . . , s36 } → {0, 1} , if we want to define the outcomes even (Y = 0) and odd (Y=1). This example suggests that ignoring the outcomes set and identifying the random variable with its range of values can be misleading. Be that as it may, let us take this interpretation at face value and proceed to consider the other important dimension of the naive view of a random variable: its randomness. The simplest way to explain this dimension is to return to the above example. Example 2.4 In the case of the experiment of casting two dice and adding the dots of the uppermost faces, we defined two random variables, which the naive view identifies with their respective range of values: X with {2, 3, . . . , 12} and Y with {0, 1} .
2.2 Simple Statistical Model: A Preliminary View
35
In the case of the random variable X, the association of its values and the probabilities (density function) (Chapter 1, table 1.4) takes the form x
2
3
4
5
6
7
8
9
10
11
12
f (x)
1 36
2 36
3 36
4 36
5 36
6 36
5 36
4 36
3 36
2 36
1 36
(2.4)
Similarly, the density function of the random variable Y is y
0
1
f (y)
1 2
1 2
(2.5)
More generally, the density function is defined by P(X = x) = f (x), for all x∈RX ,
(2.6)
and satisfies the properties (a) fx (x)≥0, for all x∈RX ; (b)
xi ∈RX fx (xi )
= 1.
The last property just says that adding up the probabilities for all values of the random variable will give us one. The density function can be visualized as distributing a unit of mass (probability) over the range of values of X. 2.2.2.1 Continuous Random Variables The above example involves two random variables which comply perfectly with Chebyshev’s naive definition. With each value of the random variable we associate a probability. This is because both random variables are discrete: their range of values is countable. On the other hand, when a random variable takes values over an interval, i.e. its range of values is uncountable, things are not as simple. Attaching probabilities to particular values does not work (see Chapter 3), and instead we associate probabilities with small intervals which belong to this range of values. Instead of (2.6), the density function for continuous random variables is defined over intervals as follows: P(x ≤ X < x+dx) = f (x)dx, for all x∈RX , and satisfies the properties
(a) fx (x) ≥ 0, for all x∈RX ; (b)
fx (x)dx=1. x∈RX
It is important to note that the density function for continuous random variables takes values in the interval [0, ∞); its values cannot be interpreted as probabilities. In contrast, the density function for discrete random variables takes values in the interval [0, 1].
2.2.3 Density Functions The densities of the random variables X and Y associated with the casting of the two dice experiment, introduced above, involve no unknown parameters because the probabilities are known. This has been the result of implicitly assuming that the dice are symmetric and
36
Probability Theory as a Modeling Framework
each side arises with the same probability. In the case where it is known that the dice are loaded, the above densities will change in the sense that they will now involve some unknown parameters. Example 2.5 In the case of two outcomes {0, 1}, assuming that P(Y = 1) = θ (an unknown parameter), 0 ≤ θ ≤ 1, the density function of Y now takes the form y
0
1
f (y; θ)
1−θ
θ
(2.7)
This can be expressed in the more compact form f ( y; θ ) = θ y (1 − θ )1−y , θ∈[0, 1], y = 0, 1, known as the Bernoulli density, with :=[0, 1] and RY :={0, 1}. Example 2.6 The notion of a parametric distribution (density) goes back to the eighteenth century with Bernoulli proposing the binomial distribution with density function
f (x; θ) = nx θ x (1 − θ)n−x , θ∈[0, 1], x = 0, 1, n = 1, 2, . . . ,
where nx = n!/(n − x)!x!, n! = n·(n − 1)·(n − 2) · · · (3)·(2)·(1). Example 2.7 In the early nineteenth century, de Moivre and Laplace introduced the Normal distribution whose density is
f (x; θ) = √1 exp − 2σ1 2 (x − μ)2 , θ :=(μ, σ 2 )∈R × R+ , x∈R. σ 2π
The real interest in parametric densities for modeling purposes began with Pearson (1895), who proposed a family of distributions known today as the Pearson family which includes the Normal, Student’s t, Laplace, Pareto, gamma, and beta, as well as discrete distributions such as the binomial, negative binomial, hypergeometric, and Poisson (see Appendix 3.A at the end of Chapter 3). 2.2.3.1 The Parameter(s) θ As can be seen in Figure 2.3, the parameters θ are related to distinctive features of the density function such as the shape and the location. The values of the parameters θ change over their range of values , the parameter space. Hence, the notion of a parametric family of densities indexed by θ∈. The notion of a simple statistical model and its first component, a parametric family of densities, will be discussed at length in Chapter 3 and thus no further discussion will be given in this section; see Appendix 3.A for a more complete list of parametric densities.
2.2.4 A Random Sample: A Preliminary View 2.2.4.1 A Statistical Model with a Random Sample What makes the generic statistical model specified in Section 2.2 simple is the form of the sampling model, the random sample assumption. This assumption involves two interrelated
2.2 Simple Statistical Model: A Preliminary View
37
notions known as independence and identical distribution. These notions can be explained intuitively as a prelude to the more formal discussion that follows. Independence. The random variables (X1 , X2 , . . . , Xn ) are said to be independent if the occurrence of any one, say Xi , does not influence and is not influenced by the occurrence of any other random variable in the set, say Xj , for i = j, i, j = 1, 2, . . . , n. Identical distribution. The independent random variables (X1 , X2 , . . . , Xn ) are said to be identically distributed if their density functions are identical in the sense that f (x1 ; θ ) = f (x2 ; θ ) = · · · = f (xn ; θ ). For observational data the validity of the IID assumptions can often be assessed using a battery of graphical techniques, as discussed in Chapters 5 and 6. 2.2.4.2 Experimental Data: Sampling and Counting Techniques Sampling refers to a procedure to select a number of objects (balls, cards, persons), say r, from a larger set we call the target “population,” with n (n ≥ r) such objects. The sampling procedure gives rise to a random sample (IID) when: (i) (ii)
the probability of selecting any one of the population objects is the same, and the selection of the ith object does not affect and it is not affected by the selection of the jth object for all i =j, i, j = 1, 2, . . . , n.
Two features of the selection procedure matter, whether we replace an object after being selected or we do not, and whether the order of the selected objects matters or not. This gives rise to the fourway classification in Table 2.1, for which the assignment of the common probability of an object being selected is different. Table 2.1 Sampling procedure probabilities O\R order (O) ¯ no order (O)
replacement (R)
1 nr 1
Cnn+r−1
¯ no replacement (R) 1 n! n Pnr , where Pr = (n−r)! 1 n! n Cn , where Ck = r!(n−r)! r
To shed light on the formulae in Table 2.1, let us state a key counting rule. Multiplication counting rule. Consider the sets S1 , S2 , . . . , Sk with n1 , n2 , . . . , nk elements, respectively. Then the number of ways one can choose k elements, one from each of these sets, is n1 × n2 × . . . × nk . Example 2.8 is nr .
The number of ways one can choose r elements from a set of n elements
Combinations. An unordered subset of r elements from a set S containing n elements (0 0, x0 ∈R}.
2.6 Formalizing Condition [b]: Events and Probabilities
57
It can be shown that Bρ (R) = B(R); see Shiryayev (1996, p. 144). Given that the real line R has an uncountably infinite number of elements, the question which naturally arises is: How do we define the Borel field B(R)? As shown above, the most effective way to define a σ -field over an infinite set is to define it via the elements that can generate this set. In the case of the real line, a number of different intervals such as (a, ∞), (a, b], (a, b), (−∞, b) can be used to generate the Borel field. However, it turns out that the half-infinite interval (−∞, x] is particularly convenient for this purpose. The Borel field generated by Bx = {(−∞, x]: x∈R} includes all subsets we encounter in practice, including {a}, (−∞, a), (−∞, a], (a, ∞), [a, ∞), [a, b], (a, b], [a, b), (a, b), for any real numbers aa ⇒ (a, b]∈B(R); 1 {a} = ∩∞ n=1 (a − n , a] ⇒ {a}∈B (R); ∞ (a, b) = ∪n=1 (a, b − 1n ] ⇒ (a, b)∈B(R); 1 [a, b] = ∩∞ n=1 (a − n , b] ⇒ [a, b]∈B (R).
It is important to note that the Borel field B(R) includes just about all subsets of the real line R, but not quite all! That is, there are subsets of R which belong to the power set but not to B(R), i.e. B(R)⊂P (R) but B(R)P (R). At this stage, it is crucial to collect the terminology introduced so far (Table 2.8), to bring out the connection between the set-theoretic and the probabilistic terms. Table 2.8 Set-theoretic vs. probabilistic terminology Set-theoretic
Probabilistic
universal set S empty set ∅ B is a subset of A: B ⊂ A set A ∩ B set A ∪ B set A:=S − A disjoint sets: A ∩ B = ∅ subset of S element of S field σ -field
sure event S impossible event ∅ when event B occurs event A occurs events A and B occur at the same time events A or B occur event A does not occur mutually exclusive events A, B event elementary outcome event space countably infinite event space
58
Probability Theory as a Modeling Framework
The formalization so far. Summarizing in symbols the argument so far: E :=([a], [b], [c]) → [a] ⇒ S , [b] ⇒ (, ?) , [c] ⇒? . In the next section we formalize the notion of probability, and proceed to show how we attach probabilities to elements of an event space .
2.6.4 A Digression: What is a Function? Before we proceed to complete the second component in formalizing condition [b] defining a random experiment, we need to take a digression in order to define the concept of a function, because the type of functions we will need in this and the next chapter go beyond the usual point-to-point numerical functions. The naive notion of a function as a formula enabling f (x) to be calculated in terms of x is embarrassingly inadequate for our purposes. It is no exaggeration to claim that the notion of a function is perhaps the most important concept in mathematics. However, the concept of a function has caused problems in several areas of mathematics since the time of Euclid, because it has changed numerous times. The definitions adopted at different times during the eighteenth and nineteenth centuries ranged from “a closed (finite analytical) expression” to “every quantity whose value depends on one or several others” (see Klein, 1972 for a fascinating discussion). The problems caused by the absence of a precise notion of a function were particularly acute during the nineteenth century, when several attempts were made by famous mathematicians, such as Cauchy, Riemann, and Weierstrass, to provide more rigorous foundations for calculus; see Gray (2015). One can go as far as to claim that the requirements of “analysis” forced mathematicians to invent more and more general categories of functions, which were instrumental in the development of many areas of modern mathematics such as set theory, the modern theory of integration, and the theory of topological spaces. In turn, the axiomatization of set theory provided the first general and precise definition of a function in the early twentieth century. Intuitively, a function is a special type of “marriage” between two sets. A function f (.): A → B is a relation between sets A and B satisfying the restriction that for each x∈A, there exists a unique element y∈B such that (x, y)∈f . The sets A and B are said to be the domain and the co-domain of the function f (.). Intuitively, a relation R connects elements of A to elements of B, to define pairs (x, y) where x∈A and y∈B, and is denoted by xRy or (x, y)∈R. To distinguish between the elements of the input set A and the output set B, we treat (x, y) as an ordered pair; to indicate the order we use the set-theoretic notation (x, y):= {x, {x, y}} . Formally, a relation R is defined to be any subset of the Cartesian product (A×B) – the set of all ordered pairs (x, y) where x∈A and y∈B. That is, a function is a special kind of relation (x, y)∈f such that: (i) (ii)
every element x of the domain A has an image y in B; for each x∈A, there exists a unique element y∈B, i.e. if y1 = f (x) and y2 = f (x), then y1 = y 2 .
Looking at Figure 2.6 brings out three important features of a function:
2.6 Formalizing Condition [b]: Events and Probabilities
Fig. 2.6
(i)
(ii) (iii)
59
Defining a function f (.): A → B
The underlying intuition is that two arrows from two different elements in A can connect to the same element in B, but no two arrows can emanate from the same element in A. The nature of the two sets and their elements is arbitrary; it does not have to be a formula relating numbers to numbers. The uniqueness restriction concerns the elements of the co-domain which are paired with elements of the domain (BA :=f (A) ⊂ B, and BA is called the range of f ); see Figure 2.6.
N O T E: In the case where BA :=f (A) = B, the function is called surjective (onto). Also, in the case where, for each y∈BA , there corresponds a unique x∈A, the function is said to be injective (one-to-one). If the function is both one-to-one and onto, it is called a bijection. Example 2.38 Let the domain and co-domain of a numerical relation be A = {1, 2, 3, 4} and B = {2, 3, 5, 7, 11, 13}, respectively. The set of ordered pairs f = {(1, 13), (2, 11), (3, 7), (4, 5)} constitutes a function with range Rf = {4, 5, 7, 11, 13}. In contrast, the set of ordered pairs h = {(1, 13), (2, 11), (2, 3), (3, 7), (4, 5)} does not constitute a function, because an element of A is paired with two different numbers in B; the number 2 in A takes the values 3 and 11 in B.
2.6.5 The Mathematical Notion of Probability The next step in formalizing condition [b] of a random experiment (E ) is to assign probabilities to the events of interest, as specified by the event space. Example 2.39 In Example 2.34, with S3 and 3 as defined in (2.8) and (2.9), common sense suggests that the following assignment of probabilities seems appropriate: P(A1 ) = 18 , P(A2 ) = 18 , P(A¯ 1 ) = 78 , P(A¯ 2 ) = 78 , P(A1 ∪A2 ) = 14 , P(A¯ 1 ∩A¯ 2 ) = 34 . In calculating the above probabilities we assumed that the coin is fair and used common sense to argue that for an event such as A1 ∪ A2 , we find its probability by adding that of A1 and A2 together, since the two are mutually exclusive. In mathematics, however, we cannot
60
Probability Theory as a Modeling Framework
rely exclusively on such things as common sense and intuition when framing a mathematical setup. We need to formalize the common-sense arguments by giving a mathematical definition for P(.). 2.6.5.1 Probability Set Function The major breakthrough that led to the axiomatization of probability theory in 1933 by Kolmogorov was the realization that P(.) is a special type of measure in the newly developed advanced integration theory called measure theory. The theory of probability, as a mathematical discipline, can and should be developed from axiom in exactly the same way as Geometry and Algebra. This means that after we have defined the elements to be studied and their basic relations, and have stated the axioms by which these relations are to be governed, all further exposition must be based exclusively on these axioms, independent of the usual concrete meaning of these elements and their relations. (Kolmogorov, 1933a, p. 1)
The idea behind the axiomatization of any field is to specify the fewest independent (not derivable from the other) axioms that specify a formal system which is complete (every statement that involves probabilities can be shown to be true or false within the formal system) and consistent (no contradictions stem within the system). The main objective is for the axioms to be used in conjunction with deductive logic to derive theorems that unpack the information contained in the axioms. P(.) is defined as a function from an event space to the real numbers between 0 and 1 which satisfies certain axioms. That is, the domain of the function P(.) is a set of subsets of S. To be more precise, P(.): → [0, 1] is said to be a probability set function if it satisfies the axioms in Table 2.9. Looking at the axioms in Table 2.9, it is apparent that the concept of an “event” is as fundamental in the axiomatic framing of probability theory as the concept of a straight line for Euclidean geometry. Moreover, relations among events pertain exclusively to their occurrence. Axioms [A1] and [A2] are self-evident, but [A3] requires some explanation because it is not self-evident and it largely determines the mathematical structure of the probability set function. The countable additivity axiom provides a way to attach probabilities to events by utilizing mutually exclusive events. Table 2.9 Kolmogorov axioms of probability [A1] [A2] [A3]
P(S) = 1, for any outcomes set S P(A) ≥ 0, for any event A∈ Countable additivity. For a countable sequence of mutually exclusive events, i.e. Ai ∈, i = 1, 2, . . . , n, . . . such j = ∅, for all A∞ that Ai ∩ i=j, i, j = 1, 2, . . . , n, . . . , we have P( ∞ i=1 Ai ) = i=1 P(Ai )
In an attempt to understand the role of axiom [A3], let us consider the question of assigning probabilities to different events in , moving from the simplest to more complicated examples. (a) Finite outcomes set: S = {s1 , s2 , . . . , sn }
2.6 Formalizing Condition [b]: Events and Probabilities
61
In this case one can assign probabilities to the elementary outcomes s1 , s2 , . . . , sn without worrying about any inconsistencies or technical difficulties, because one can always consider the relevant event space to be P (S), the set of all subsets of S. Moreover, since the elementary events s1 , s2 , . . . , sn constitute a partition of S (mutually exclusive and n i=1 si = S), axiom [A3] implies that (by axiom [A1]): P( ni=1 si ) = ni=1 P(si ) = 1, and suggests that assigning probabilities to the outcomes yields the simple probability distribution on S: [p(s1 ), p(s2 ), . . . , p(sn )], and ni=1 p(si ) = 1. The probability of event A in is then defined as follows. First we express event A in terms of the elementary outcomes, say A = {s1 , s2 , . . . , sk }. Then we derive its probability by adding the probabilities of the outcomes s1 , s2 , . . . , sk , i.e. P(A) = p(s1 ) + p(s2 ) + . . . + p(sk ) = ki=1 p(si ). Example 2.40 (a) Consider the case of the random experiment [iii] “tossing a fair coin three times.” The outcomes set is S3 = {(HHH), (HHT), (HTT), (HTH), (TTT), (TTH), (THT), (THH)}. Let A1 = {(HHH)} and A2 = {(TTT)}, and derive the probabilities of the events A3 :=(A1 ∪ A2 ), A4 :=A¯ 1 , A5 :=A¯ 2 , and A6 :=(A¯ 1 ∩ A¯ 2 ) : P(A3 ) = P(A1 ) + P(A2 ) = 18 + 18 = 14 , P(A5 ) = P(S3 ) − P(A2 ) = 1 −
1 8
P(A4 ) = P(S3 ) − P(A1 ) = 1 − 18 = 78 ,
= 78 , P(A6 ) = P(A¯ 1 ∩A¯ 2 ) = 1−P(A1 ∪A2 ) = 34 .
If we go back to the previous section we can see that these are the probabilities we attached using common sense. More often than not, the elementary events s1 , s2 , . . . , sn are equiprobable. (b) Consider the assignment of probability to the event A = {(HH), (HT), (TH)}, in the case of the random experiment [ii] “tossing a fair coin twice.” The probability distribution in this case takes the form {P(HH) = 14 , P(HT) = 14 , P(TH) = 14 , P(TT) = 14 }. This suggests that P(A) = P(HH) + P(HT) + P(TH) = 34 . (b) Countable outcomes set: S = {s1 , s2 , . . . , sn , . . .} This case is a simple extension of the finite case where the elementary outcomes s1 , s2 , . . . , sn , . . . are again mutually exclusive and they constitute a partition of S, i.e. ∞ i=1 si = S. Axiom [A3] implies that
∞ ∞ P i=1 si = i=1 P(si ) = 1 (by axiom [A1]) and suggests that assigning probabilities to the outcomes yields the probability distribution on S: ∞ [p(s1 ), p(s2 ), . . . , p(sn ), . . .], such that i=1 p(si ) = 1.
Probability Theory as a Modeling Framework
62
As in case (a), the probability of event A in (which might coincide with the power set of S) is defined similarly by (2.10) P(A) = [i:si ∈A] p(si ). In contrast to the finite S case, the probabilities {p(s1 ), p(s2 ), . . . , p(sn ), . . .} can easily give rise to inconsistencies, such as the case p(sn ), n = 1, 2, . . . constant and non-negative. For instance, if we assume that p(sn )=p>0, for all n = 1, 2, 3, . . . , this gives rise to an incon sistency because, however tiny p is, ∞ n=1 p = ∞. The only way to render this summation bounded is to make p a decreasing function of n. For example, assuming pn = 1/n2 implies 1 −2 = 1, which is consistent with axioms [A1]–[A3]; note that for any ) ∞ that ( 1.6449 n=1 n ∞ −k k > 1, n=1 n 0.
(2.18)
T E R M I N O L O G Y: It is important to point out the fact that the attribution of the formula in (2.18) to Bayes (1763) is a classic example of Stigler’s (1980) “Law of Eponymy” stating that no scientific discovery is named after its original discoverer. The formula in (2.12) pertains to conditional probability between events, which was used in the early sixteenth century by Cardano, and both formulae (2.12) and (2.13) are clearly stated in de Moivre (1718); see Hald (1998). Moreover, the claim that (2.18) provides the foundation of Bayesian statistics is also misleading, because in Bayesian inference Ai , i = 1, . . . , n, are not observable events, as in the above context, but unobservable parameters θ :=(θ 1 , θ 2 , . . . , θ n ); see Chapter 10. Example 2.50 False positive/negative Consider the case of a medical test to detect a particular disease. It is well known that such tests are almost never 100% accurate. Let us assume that for this particular test it has been established that: (a) (b)
If a patient has the disease, the test will likely detect it (give a positive result) with .95 probability, i.e. its false negative probability is .05. If a patient does not have the disease, the test will likely incorrectly give a positive result with .1 probability ( false positive).
Let us also assume that a person randomly selected from the relevant population will have the disease with probability of .03. The question of interest is: When a person from that population tests positive, what is the probability that he/she actually has the disease?
68
Probability Theory as a Modeling Framework
To answer that question we need to define the events of interest in terms of the two primary events, a patient: A – has the disease; B – tests positive. (a) and (b) suggest that the relevant probabilities are P(A) = .03, P(B|A) = .95, P(B|A) = .05. Applying Bayes’ formula (2.18) yields P(A|B) =
P(A)·P(B|A) P(A)·P(B|A)+P(A)·P(B|A)
=
(.03)(.95) (.03)(.95)+(.97)(.05)
= .370.
At first sight this probability might appear rather small, since the test has a 95% accuracy, but that ignores the fact that the incidence of the disease in this particular population is small, 3%. A contestant on a TV game, “Let’s Make a Deal,” Example 2.51 Monty Hall Puzzle1 is presented with three doors numbered 1 , 2 , 3 . One of doors has a car behind it and the other two have goats. The contestant will be asked to choose one of the doors and then the game host will open one of the other doors and give the contestant a chance to switch. Stage 1. The contestant is asked to pick one door, and he chooses door 1 . Stage 2. The game host opens door 3 to reveal a goat. Stage 3. The game host asks the contestant: do you want to switch to door 2 ? Will switching to door 2 improve the contestant’s chances of winning the car? A professor of mathematics claimed in print that the answer is a definite No! If one door is shown to be a loser, that information changes the probability of either remaining choice, neither of which has any reason to be more likely, to 1/2.
It turns out that this claim is wrong! Why? The probability 1/2 for doors 1 and 2 hiding the car ignores one important but subtle piece of information: the game host opened door 3 knowing where the car is! This information is relevant for evaluating the pertinent probabilities, not only of the primary event of interest, which is Ck – door k hides the car, but also of the related event, Dk – the host opened door k . Probability theory can frame the relationship between these events in terms of their joint, marginal, and conditional probabilities. Initially, the car could have been behind any one of the doors, and thus the marginal probabilities for events Ck are P(C1 ) = P(C2 ) = P(C3 ) = 1/3. After the contestant selected door 1 , the game host has only two doors to choose from, and thus the marginal probabilities for events Dk are P(D1 ) = 0, P(D2 ) = P(D3 ) = 1/2. The professor of mathematics was wrong, because he attempted to account for the occurrence of the event D3 – the game host opened door 3 by erroneously changing the original marginal probabilities from P(C1 ) = P(C2 ) = P(C3 ) =
1 3
to P(C1 ) = P(C2 ) = 12 .
Probabilistic reasoning teaches us that the proper way to take into account the information that event D3 has occurred is to condition on it. Using the conditional probability formula 1 www.facebook.com/virginiatecheconomics/videos/1371481736198880/
2.7 Conditional Probability and Independence
69
(2.12), one can elicit the probabilities the contestant needs: P(C1 |D3 ) =
P(C1 ∩D3 ) P(D3 )
and P(C2 |D3 ) =
P(C2 ∩D3 ) P(D3 ) .
(2.19)
To evaluate (2.19), however, one requires the probabilities P(C1 ∩ D3 ) and P(C2 ∩ D3 ). The joint probability rule in (2.13) suggests that one can evaluate these joint probabilities via P(C1 ∩D3 ) = P(D3 |C1 )·P(C1 ), P(C2 ∩D3 ) = P(D3 |C2 )·P(C2 ). But how can one retrieve P(D3 |C1 ) and P(D3 |C2 )? The game host’s reasoning, based on his information, that led him to open door 3 can be used to elicit these conditional probabilities. Pondering on the game host’s reasoning in opening door 3 : if the car is behind door 1 , the game host is free to pick between doors 2 and 3 at random, hence P(D3 |C1 ) = 12 ; if the car is behind door 2 , he has no option but to open door 3 , hence P(D3 |C2 ) = 1; the car could not have been behind door 3 , and thus P(D3 |C3 ) = 0. The coherence of these conditional probabilities is confirmed by the total probability rule in (2.16): P(D3 ) = 3k=1 P(D3 |Ck )·P(Ck ) = 12 ( 13 )+1( 13 )+0( 13 ) = 12 . Collecting all the relevant probabilities derived above: 1 1 1 , P(D3 ) = P(D2 )= , P(D3 |C1 ) = , P(D3 |C2 ) = 1 3 2 2 and evaluating the relevant conditional probabilities yields P(C1 ) = P(C2 ) = P(C3 ) =
P(C1 |D3 ) =
P(D3 |C1 )·P(C1 ) P(D3 )
=
1 2
1 3
1 2
=
1 3
< P(C2 |D3 ) =
P(D3 |C2 )·P(C2 ) P(D3 )
=
(1) 13 1 2
= 23 .
This shows that switching doors doubles the chances of winning the car from 1/3 to 2/3. The moral of this true story is that being a professor of mathematics does not necessarily mean that you can reason systematically with probabilities. That takes more than just sound common sense and a good mathematical background! It requires mastering the mathematical structure of probability theory and its rules of reasoning.
2.7.2 The Concept of Independence Among Events The notion of conditioning can be used to determine whether two events A and B are related in the sense that information about the occurrence of one, say B, alters the probability of the occurrence of A. If knowledge of the occurrence of B does not alter the probability of event A, it is natural to say that A and B are independent. More formally, A and B are independent if P(A|B) = P(A) ⇔ P(B|A) = P(B).
(2.20)
Using the conditional probability formula (2.12), we can deduce that two events A and B are independent if P(A ∩ B) = P(A)·P(B).
(2.21)
Note that this notion of independence can be traced back to Cardano in the 1550s. Example 2.52 For A = {(HH), (TT)} and B = {(TT), (HT)}, A ∩ B = {(TT)} and thus P(A ∩ B) = 1/4 = P(A)·P(B), implying that A and B are independent.
70
Probability Theory as a Modeling Framework
It is very important to distinguish between independent and mutually exclusive events; the definition of the latter does not involve probability. Indeed, two independent events with positive probability cannot be mutually exclusive. This is because if P(A)>0 and P(B)>0 and they are independent, then P(A ∩ B) = P(A)·P(B)>0, but mutual exclusiveness implies that P(A ∩ B) = 0, since A ∩ B = ∅. The intuition behind this result is that mutually exclusive events are informative about each other because the occurrence of one precludes the occurrence of the other. Example 2.53
For A = {(HH), (TT)} and B = {(HT), (TH)}, A ∩ B = ∅ but
P(A ∩ B) = 0 =
1 4
= P(A)·P(B).
Independence can be generalized to more than two events but in the latter case we need to distinguish between pairwise, joint, and mutual independence. For example, in the case of three events A, B, and C, we say that they are jointly independent if P(A ∩ B ∩ C) = P(A) · P(B) · P(C).
(2.22)
Pairwise independence. The notion of joint independence, however, is not equivalent to pairwise independence, defined by the conditions P(A ∩ B) = P(A)·P(B), P(A ∩ C) = P(A)·P(C), P(B ∩ C) = P(B)·P(C). Example 2.54 Consider the outcomes set S = {(HH), (HT), (TH), (TT)} and the events A = {(TT), (TH)}, B = {(TT), (HT)}, and C = {(TH), (HT)}. Given that A ∩ B = {(TT)}, A ∩ C = {(TH)}, B ∩ C = {(HT)}, and A ∩ B ∩ C = ∅, we can deduce P(A ∩ B) = P(A)·P(B) = 14 ,
P(B ∩ C) = P(B)·P(C) = 14 ,
P(A ∩ C) = P(A)·P(C) = 14 , but P(A ∩ B ∩ C) = 0=P(A)·P(B)·P(C) = 18 . Similarly, joint independence does not imply pairwise independence. Moreover, both of these forms of independence are weaker than independence, which involves joint independence for all subcollections of the events in question. Independence. The events A1 , A2 , . . . , An are said to be independent iff P(A1 ∩ A2 ∩ · · · ∩ Ak ) = P(A1 )·P(A2 )· · · · ·P(Ak ), for each k = 2, 3, . . . , n. That is, this holds for any subcollection A1 , A2 , . . . , Ak (k ≤ n) of A1 , A2 , . . . , An . In the case of three events A, B, and C, pairwise and joint independence together imply independence and conversely.
2.8
Formalizing Condition [c]: Sampling Space
2.8.1 The Concept of Random Trials The last condition defining the notion of a random experiment is [c] The experiment can be repeated under identical conditions. This is interpreted to mean that the circumstances and conditions from one trial to the next remain the same. This entails two interrelated but different components:
2.8 Formalizing Condition [c]: Sampling Space
(i) (ii)
71
the probabilistic setup of the experiment remains the same for all trials, and the outcome in one trial does not affect that of another.
How do we formalize these conditions? The first notion we need to formalize pertains to a finite sequence of trials. Let us denote the n trials by {A1 , A2 , A3 , . . . , An } and associate each trial with a probability space (Si , i , Pi (.)), i = 1, 2, . . . , n, respectively. In order to be able to discuss any relationship between trials, we need to encompass them in an overall probability space; without it, we cannot formalize condition (ii) above. The overall probability space that suggests itself is the product probability space (S1 , 1 , P1 (.)) × (S2 , 2 , P2 (.)) × · · · × (Sn , n , Pn (.)), which can be thought of as a triple of the form
([S1 ×S2 × · · · ×Sn ] , [1 ×2 × · · · ×n ] , [P1 ×P2 × · · · ×Pn ]) := S(n) , (n) , P(n) ,
in an obvious notation. The technical question that arises is whether S(n) , (n) , P(n) is a proper probability space. To be more precise, the problem is whether S(n) is a proper outcomes set, (n) has the needed structure of a σ -field, and P(n) defines a set function which satisfies the three axioms. The answer to the first scale of the question is in the affirmative, since the outcomes set can be defined by S(n) = s(n) : s(n) :=(s1 , s2 , . . . , sn ), si ∈Si , i = 1, 2, . . . , n . It turns out that indeed (n) has the needed structure of a σ -field (for finite n) and P(n) defines a set function which satisfies the three axioms; the technical arguments needed to prove these claims are beyond the scope of the present book; see Billingsley (1995). Having established that the product probability space is a proper probability space, we can
proceed to view the sequence of trials {A1 , A2 , A3 , . . . , An } as an event in S(n) , (n) , P(n) . An event to which we can attach probabilities. The first component of condition [c] can easily be formalized by ensuring that the probability space (S, , P(.)) remains the same from trial to trial, in the sense [i] (Si , i , Pi (.)) = (S, , P(.)), for all i = 1, 2, . . . , n.
(2.23)
We refer to this as the identical distribution (ID) condition. Example 2.55 Let S = {s1 , s2 , . . . , sk } be a generic outcomes set with P = [p(s1 ), p(s2 ), . . . , p(sk )], such that ki=1 p(si ) = 1 the associated probability distribution. Then condition [i] amounts to saying that [i] P is the same for all n trials A1 , A2 , A3 , . . . , An .
Formally, the ID condition reduces S(n) , (n) , P(n) into something simpler:
ID S(n) , (n) , P(n) → (S, , P(.))×(S, , P(.))× . . . ×(S, , P(.)):=[(S, , P(.))]n , with the same probability space (S, , P(.)) associated with each trial k = 1, 2, . . . , n.
72
Probability Theory as a Modeling Framework
The second component is more difficult to formalize, because it involves ensuring that the outcome in the ith trial does not affect and is not affected by the outcome in the jth trial for i=j, i, j = 1, 2, . . . , n. Viewing the n trials (A1 , A2 , A3 , . . . , An ) as an event in the
context of the product probability space S(n) , (n) , P(n) , we can formalize this in the form of independence among the trials. Intuitively, trial i does not affect and is not affected by the outcome of trial j. That is, given the outcome in trial j, the probabilities associated with the various outcomes in trial i are unchanged and vice versa. The idea that “given the outcome of trial j the outcome of trial i is unaffected” can be formalized using the notion of conditioning, discussed in the previous section. Let us return to the formalization of the notion of a RE (E ) by proceeding to formalize condition [c](ii): the outcome in one trial does not affect and is not affected by that of another. Sampling space. A sequence of n trials, denoted by Gn := {A1 , A2 , A3 , . . . , An }, where Ai represents the ith trial of the experiment, associated with the product probability space
S(n) , (n) , P(n) , is said to be a sampling space. As argued above, we view the n trials Gn := {A1 , A2 , A3 , . . . , An } , as an event in the con
text of the product probability space S(n) , (n) , P(n) . As such, we can attach a probability to this event using the set function P(n) . Hence, we formalize [c](ii) by postulating that the trials are independent: [ii] P(n) (A1 ∩ A2 ∩ · · · ∩ Ak ) = P1 (A1 )·P2 (A2 ) · · · Pk (Ak ), for k = 2, 3, . . . , n or [ii]* P(n) (Ak |A1 , A2 , . . . , Ak−1 , Ak+1 , . . . , An ) = Pk (Ak ), for k = 1, 2, . . . , n. (2.24) Note that P(n) (.) and Pk (.) are different probability set functions which belong to the
probability spaces S(n) , (n) , P(n) and (Sk , k , Pk ), respectively; see Pfeiffer (1978). Taking the conditions of independence (2.24) and identical distribution (2.23), we define what we call a sequence of random trials. Random trials. A sequence of trials GnIID :={A1 , A2 , A3 , . . . , An }, which is both independent and identically distributed, i.e. P(n) (A1 ∩ A2 ∩ . . . ∩ Ak ) = P(A1 ) · P(A2 ) · · · P(Ak ), for k = 2, 3, . . . , n is referred to as a sequence of random trials.
N O T E: GnIID is a special case of a sampling space Gn associated with S(n) , (n) , P(n) , defined above, in the sense that GnIID is associated with (S, , P(.))n , a sequence of random
trials. In general, the components of S(n) , (n) , P(n) can be both non-identically distributed and non-independent.
2.8.2 The Concept of a Statistical Space Combining a simple product probability space and a sequence of random trials we define a simple statistical space, denoted by [(S, , P(.))n , GnIID ].
2.8 Formalizing Condition [c]: Sampling Space
73
The term simple stems from the fact that this represents a particular case of the more gen
eral formulation of a statistical space S(n) , (n) , P(n) , Gn , where each trial, say Ai , is associated with a different probability space (Si , i , Pi (.)) (i.e., non-ID) and the trials are not necessarily independent. As argued in Chapters 5–8, in many disciplines the IID formulation is inadequate because observational data rarely satisfy such conditions. A simple statistical space [(S, , P(.))n , GnIID ] represents our first formalization of the notion of a random experiment E . This formulation, however, is rather abstract because it involves arbitrary sets and set functions. The main aim of the next chapter is to reduce it to a more appropriate form by mapping this mathematical structure onto the real line where numerical data live. The story so far in symbols ⎞ ⎡ ⎤ ⎛ [a] ⇒ S ⎟ ⎢ ⎥ ⎜ E := ⎣ [b] ⎦ ⇒ ⎝ (, P(.)) ⎠ =⇒ (S, , P(.))n , GnIID . Gn [c] ⇒ The purpose of this chapter has been to provide an introduction to probability theory using the formalization of a simple chance mechanism we called a random experiment (E ) defined by conditions [a]–[c]. The formalization had a primary objective: to motivate some of the most important concepts of probability theory and define them in a precise mathematical way in the form of a statistical space. The questions addressed along the way include the following: Why these particular primitive notions (S, , P(.))? The probability space (S, , P(.)) provides an idealized mathematical description of the stochastic mechanism that gives rise to the events in . Why is the set of events of interest a sigma-field? Mathematically, has the structure of a σ -field, because of the nature of the basic concept of probability we call an event: a subset of S, which is an element of , that might or might not occur at any particular trial. If A and B are events, so are A ∪ B, A ∩ B, A, B, etc. The notion of an event (an element of ) in probability plays an analogous role to the notion of a point in geometry. is a set of subsets of S that is closed under the set-theoretic operations ∪, ∩, − . Why choose the particular axioms [A1]–[A3] in Table 2.9? This formalization places probability squarely into the mathematical field of measure theory concerned more broadly with assigning size, length, content, area, volume, etc. to sets; see Billingsley (1995). The axioms [A1]–[A3] ensure that P(.) assigns probabilities to events in in a consistent and coherent way. What is the scope of the IID trials in GnIID ={A1 , A2 , A3 , . . . , An }? The notion of a set of IID trials in GnIID formalizes two vague notions often invoked in descriptive statistics: (a) the “uniformity” of the target population (or nature) and (b) the “representativeness” of the sample.
74
Probability Theory as a Modeling Framework
2.8.3 The Unfolding Story Ahead In Chapter 3 the probability space (S, , P(.)) is mapped onto the real line (R) to define a probability model of the form = { f (x; θ ), θ∈, x∈R} . In Chapter 4 the sampling space is transformed into a special type of sampling model we call a random sample: a set of random variables X:= (X1 , X2 , . . . , Xn ) which are independent and identically distributed. The unfolding story in symbols: (S, , P(.)) → = {f (x; θ), θ∈, x∈R} ,
GnIID → X:= (X1 , X2 , . . . , Xn ) .
Important Concepts Random experiment, outcomes set (sample space), elementary outcomes, events, sure event, impossible event, set-theoretic union, intersection, complementation, partition of a set, empty set, finite set, infinite set, countable set, uncountable set, Venn diagrams, de Morgan’s law, mutually exclusive events, event space, power set, field of events, sigma field of events, Borel field, function, domain and co-domain of a function, range of a function, probability set function, countable additivity, probability space, mathematical deduction, conditional probability, total probability rule, Bayes’ rule, independent events, pairwise independent events, sampling space, independent trials, identically distributed trials, statistical space. Crucial Distinctions Descriptive vs. inferential statistics, elementary outcomes vs. events, countable vs. uncountable sets, power set vs. sigma-field, independent events vs. mutually exclusive events, co-domain vs. range of a function, probabilistic vs. set-theoretic terminology, independence vs. joint independence vs. pairwise independence among events. Essential Ideas ●
●
●
●
●
●
●
There is no such thing as “descriptive statistics for particular data” that do not invoke probabilistic assumptions. Probability theory as the foundation and overarching framework for empirical modeling is crucial for defining the premises of statistical induction as well as calibrating the capacity of the inference procedures stemming from these premises. In statistics, one aims to model the stochastic mechanism that gave rise to the data, and not to summarize the particular data. Indeed, the inference pertains to this mechanism, even though it is framed in terms of the parameters of the model. The most effective way to transform an uncountable set in probability theory into a countable one is to use partitioning. The concept of a σ -field played a crucial role in Kolmogorov’s framing of the axiomatic approach to probability theory because it captures the key features of the concept of an event for all outcomes sets, including uncountable ones. The axiomatization of probability revolves around the concept of an “event” and its occurrence. The concept of a σ -field provides the key to understanding the concept of conditioning in its various forms, e.g. E(Y|X = x) vs. E(Y|σ (X)). Kolmogorov (1933a) was the first to properly formalize conditional probability using the concept of a σ -field.
2.9 Questions and Exercises
2.9
75
Questions and Exercises
1.
(a) Explain the main differences between descriptive and inferential (proper) statistics as they relate to their objectives and the role of probability. (b) “There is no such thing as descriptive statistics that summarizes the statistical information in the data at hand without invoking any probabilistic assumptions.” Discuss. (c) Explain the difference between modeling the particular data and modeling the underlying stochastic mechanism that gave rise to the data.
2.
(a) Compare and contrast Figures 2.1 and 2.2 in terms of the type of chance regularity pattern they exhibit. (b) Explain intuitively why the descriptive statistics for the x0 data (Figure 2.1) based on x = 1n nk=1 xk and s2x = 1n nk=1 (xk − x)2 are unreliable as measures of location and central tendency. (c) Explain intuitively why the values based on s2x = 1n nk=1 (xk − x)2 misleadingly inflate the true variation around the mean. (d) Explain why the descriptive statistics in (b) give rise to reliable measures when applied to the y0 data (Figure 2.2).
3.
In Example 2.11 on casting three dice and adding up the dots, explain the different permutations for the occurrence of the events (11, 12) and evaluate their probabilities using Galileo’s reasoning.
4.
(a) Explain the difference between combinations and permutations. (b) Compare and contrast the following sample survey procedures: (i) simple random sampling; (ii) stratified sampling; (iii) cluster sampling; (iv) quota sampling.
5.
(a) Which of the following observable phenomena can be considered as random experiments, as defined by conditions [a]–[c]. Explain your answer briefly. (i) A die is cast and the number of dots facing up is counted. (ii) For a period of a year, observe the newborns in NYC as male or female. (iii) Observe the daily price of a barrel of crude oil. (b) For each of the experiments (i)–(iii), specify the set S of all distinct outcomes. (c) Contrast the notions of outcome vs. event; use experiment (i) to illustrate.
6.
For the sets A = {2, 4, 6} and B = {4, 8, 12}, derive the following: (a) A ∪ B, (b) A ∩ B, (c) A ∪ B relative to S = {2, 4, 6, 8, 10, 12}. Illustrate your answers using Venn diagrams.
7.
A die is cast and the number of dots facing up is counted. (a) Specify the set of all possible outcomes. (b) Define the sets A – the outcome is an odd number, B – the outcome is an even number, C – the outcome is less than or equal to 4. (c) Using your answer in (b), derive A∪B, A∪C, B∪C, A∩B, A∩C, B∩C. (d) Derive the probabilities of A, B, C and all the events in (c). (e) Derive the probabilities P(A|B), P(A|C), P(B|C).
76
Probability Theory as a Modeling Framework
8.
(a) “Two mutually exclusive events A and B cannot be independent.” Discuss. (b) Explain the notions of mutually exclusive events and a partition of an outcomes set S. How is the latter useful in generating event spaces?
9.
Define the concept of a σ -field and explain why we need such a concept for the set of all events of interest. Explain why we cannot use the power set as the event space in all cases.
10.
Consider the outcomes set S = {2, 4, 6, 8} and let A = {2, 4} and B = {4, 6} be the events of interest. Show that the field generated by these two events coincides with the power set of S.
11.
Explain how intervals of the form (−∞, x] can be used to define intervals such as {a}, (a, b), [a, b), (a, b], [a, ∞), using set-theoretic operations.
12.
(a) Explain the difference between a relation and a function. (b) Let the domain and co-domain of a numerical relation be A = {1, 2, 3, 4} and B = {2, 3, 5, 7, 11, 13}, respectively. Explain whether the set of ordered pairs (i) f = {(1, 13), (2, 11), (3, 7), (4, 7)}, (ii) h = {(1, 13), (2, 11), (2, 3), (3, 7), (4, 5)} constitute a function or not.
13.
Explain whether the probability functions defined below are proper ones: ¯ = 1 , P(S) = 1, P(∅) = 0, (ii) P(A) = 1 , P(A) ¯ = 1 , P(S) = 1, (i) P(A) = 23 , P(A) 3 3 3 P(∅) = 0, ¯ = 3 , P(S) = 0, P(∅) = 1, (iv) P(A) =- 1 , P(A) ¯ = 5 , P(S) = 1, (iii) P(A) = 14 , P(A) 4 4 4 P(∅) = 0.
14.
(a) Explain how we can define a simple probability distribution in the case where the outcomes set is finite. (b) Explain how we can define the probability of an event A in the case where the outcomes set has a finite number of elements, i.e. S = {s1 , s2 , . . . , sn }. (c) How do we deal with the assignment of probabilities in the case of an uncountable outcomes set?
15.
Describe briefly, using your own words, the formalization of conditions [a] and [b] of a random experiment into a probability space (S, , P(.)).
16.
Explain how Theorem 2.4: for events A and B: P(B) = P(A ∩ B)+P(A ∩ B) relates to the total probability formula: P(B) = P(A)·P(B|A) + P(A)·P(B|A).
17.
Draw a ball from an urn containing 6 black, 8 white, and 10 yellow balls. (a) Evaluate the probabilities of drawing a black (B), a white (W), or a yellow (Y) ball, separately. (b) Evaluate the probability of the event “draw three balls, a red, a white, and a blue in that sequence” when sampling with or without replacement.
18.
Apply the reasoning in Example 2.50 using the probabilities P(A) = .3, P(B|A) = .95, P(B|A) = .05 and contrast your results with those in that example.
2.9 Questions and Exercises
77
19.
In the Monty Hall puzzle (Example 2.51), explain why the reasoning used by the math professor to reach the conclusion that keeping the original door or switching to the other will make no difference to the probability of finding the car is erroneous.
20.
Describe briefly the formalization of condition [c] of a random experiment into a simple sampling space GnIID .
21.
Explain the notions of independent events and identically distributed trials.
22.
Explain how conditioning can be used to define independence; give examples.
23.
Explain the difference between a sampling space in general and the simple sampling space GnIID in particular.
24.
In the context of the random experiment of tossing a coin twice, derive the probability of event A = {(HT), (TH)} given event B = {(HH), (HT)}. Explain why events A and B are independent.
25*. For two events A and B in S, use Venn diagrams to evaluate (a) the smallest possible value of P(A ∪ B) and (b) the greatest possible value of P(A ∩ B), under the following two scenarios: (i) P(A) = .4, P(B) = .6; (ii) P(A) = .3, P(B) = .5. Hint: Consider all possible relationships between A and B within S.
3 The Concept of a Probability Model
3.1
Introduction
3.1.1 The Story So Far and What Comes Next In chapter 2 we commenced the long journey to explore the theory of probability as a modeling framework. In an effort to motivate the various concepts needed, the discussion began with the formalization of the notion of a random experiment E , defined by the conditions in Table 3.1. Table 3.1 Random experiment (E ) [a] [b]
[c]
All possible distinct outcomes are known at the outset In any particular trial the outcome is not known in advance, but there exist discernible regularities pertaining to the frequency of occurrence associated with different outcomes The experiment can be repeated under identical conditions
The mathematization of E took the form of a statistical space [(S, , P(.))n , GnIID ] where (S, , P(.)) is a probability space and GnIID is a simple sampling space. Unfortunately, the statistical space does not lend itself naturally or conveniently to the modeling of stochastic phenomena that give rise to numerical data. The main purpose of this chapter is to map the abstract probability space (S, , P(.)) onto the real line where observed data live. The end result will be a reformulation of (S, , P(.)) into a probability model, one of the two pillars of a statistical model, the other being mapping of GnIID onto the real line to define a sampling model in Chapter 4. A bird’s-eye view of the chapter. The key to mapping the statistical space onto the real line (R): [(S, , P(.)), GnIID ] → R is the concept of a random variable, one of the most crucial concepts of probability theory. In Section 3.2 we begin the discussion of this mapping using the simplest case where the 78
3.2 The Concept of a Random Variable
79
outcomes set S is countable by transforming (S, ) using the concept of a random variable. In Section 3.3 we consider the concept of a random variable in a more general setting. In Section 3.4 we use the concept of a random variable to transform P(.) into a numerical function that assigns probabilities (or densities). In Section 3.5 we complete the transformation of the probability space into a probability model. In Sections 3.6 and 3.7 we take important digressions in an attempt to relate the unknown parameters (the focus of statistical inference) to the numerical characteristics of the distributions, which are indispensable in the context of both modeling as well as statistical inference. A preview summary. In order to help the reader keep an eye on the forest, we summarize X(.)
the mapping (S, , P(.)) → R in three steps. The first step is the mapping defining a random variable: X(.): S → R that preserves the event structure of interest, . Armed with the mapping X(.), the second step trades the probability set function P(.): → [0, 1] with a point-to-point numerical function, the cumulative distribution function (cdf), defined in terms of X: FX (.): R → [0, 1]. The third step simplifies the cdf by transforming it into the density function: fx (.): R → [0, ∞). The concept of a probability model is usually defined in terms of the density function.
3.2
The Concept of a Random Variable
From the mathematical viewpoint it is often preferable to define a concept in its full generality and then proceed to discuss the special cases. From the pedagogical viewpoint, however, the reverse is sometimes advisable in order to help the reader understand the concept without undue mathematical complexity. In the case of the concept of a random variable, what renders the definition easy or not so easy, from the mathematical viewpoint, is whether the outcomes set is countable or not. In the case of a countable outcomes set, the random variable is said to be discrete because it takes a countable number of values. To help the reader understand the modern concept of the random variable and how it transforms the abstract statistical space into something much easier to handle, the discussion begins with the simplest case and then moves on to more complicated ones. (i) (ii) (iii)
Finite: the outcomes set S is finite. Infinite/countable: the outcomes set S is infinite but countable. Infinite/uncountable: the outcomes set S is infinite and uncountable.
80
The Concept of a Probability Model
3.2.1 The Case of a Finite Outcomes Set: S = {s1 , s2 , . . . , sn } A (discrete) random variable with respect to the event space is defined to be a real-valued function of the form X(.): S → RX ,
(3.1)
such that all the sets defined by {s: X(s) = x} for x ∈ R constitute events in , denoted by Ax := {s: X(s) = x} ∈ , ∀x ∈ R.
(3.2)
Intuitively, a random variable is a function which attaches real numbers to all the elements of S in a way which preserves the event structure of . Example 3.1
Consider the random experiment of “tossing a coin twice”:
S = {(HH), (HT), (TH), (TT)}, where the event space of interest is = {S, ∅, A, B, C, A ∪ B, A ∪ C, B ∪ C},
(3.3)
A = {(HH)}, B = {(TT)}, C = {(HT), (TH)}.
Let us consider the question of whether the following two mappings constitute random variables relative to in (3.3): X(HH) = 2,
X(TT) = 0,
X(HT) = X(TH) = 1,
Y(HT) = 1,
Y(TH) = 2, Y(HH) = Y(TT) = 0.
For an affirmative answer, the pre-image for each of the values in their range defines an event in . The pre-image is found by tracing the elements of S associated with each of the values in the mappings range (Figure 3.1), and determining whether the resulting subset of S belongs to or not (Figure 3.2). It is important to emphasize that the sets Ax := {s: X(s) = x} represent the pre-image of x, denoted by X −1 (.): {s: X = x} := X −1 (x), for all x ∈ RX .
Fig. 3.1
Random variable X
Fig. 3.2
Pre-image of random variable X
3.2 The Concept of a Random Variable
81
Note that the pre-image does not often coincide with the inverse of the function! All functions have a pre-image, but only one-to-one functions have an inverse. For X(.), the pre-images associated with its values in RX := (0, 1, 2) yield: X −1 (0) = {(TT)} = A ∈ , X −1 (1) = {(HT), (TH)} = C ∈ , X −1 (2) = {(HH)} = B ∈ , and since all of them belong to , X is a random variable relative to it. For Y(.), the pre-images for the values in RY := (0, 1, 2) yield: Y −1 (0) = {(HH), (TT)} = A ∪ B ∈ , Y −1 (1) = {(HT)} ∈ / , Y −1 (2) = {(TH)} ∈ / .
(3.4)
Hence, Y is not a random variable relative to the above , since two of those pre-images are not events in !
3.2.2 Key Features of a Random Variable First, the term “random variable” is a misnomer since, in light of its definition in (3.2), X(.) is just a real-valued function that involves no probabilities, i.e. it is neither random nor a variable. However, the term is so deeply embedded in the literature that it will be hopeless to attempt to replace it with a more pertinent term. Second, the concept of a random variable is always defined relative to an event space , and whether or not X(.) satisfies the restriction in (3.2) depends on , not on P(.). The fact that a certain real-valued function is not a random variable with respect to a particular does not mean that it cannot be a random variable with respect to some other event space. Indeed, in Example 3.1 one can define an event space, say Y , with respect to which Y constitutes a random variable. Example 3.2 To see that, let us return to the pre-images of Y −1 (y), y = 0, 1, 2 in (3.4) and define the events A1 := Y −1 (0) = {(HH), (TT)}, A2 := Y −1 (1) = {(HT)}, A3 := Y −1 (2) = {(TH)}. We use (A1 , A2 , A3 ) to generate a field: Y := σ (Y) = {S, ∅, A1 , A2 , A3 , A1 ∪ A2 , A1 ∪ A3 , A2 ∪ A3 }. Y :=σ (Y) is known as the minimal field generated by the random variable Y. This concept gives rise to an alternative but equivalent definition for a random variable that generalizes directly to the case where S is unaccountable. Random variable. The real-valued function X(.): S → RX is said to be a random variable with respect to if the σ -field generated by X is a subset of , i.e. σ (X) ⊆ . In example 3.1, σ (X) = (verify), but in general it can be a proper subset of . Example 3.3
The real-valued function
Z(HT) = Z(TH) = 0, Z(HH) = Z(TT) = 1
82
The Concept of a Probability Model
is a random variable relative to since σ (Z) = {S, ∅, C, C}⊂, where C: = {s: Z = 0} := Z −1 (0) = {(HT), (TH)}, C: = {s: Z = 1} := Z −1 (1) = {(HH), (TT)}. In light of that, when would one prefer to use Z instead of X? Z will be the random variable of choice when the event of interest is C “one of each” and its complement C “two of the same.” Third, since RX ⊂R one might wonder why the restriction in (3.2) is in terms of x ∈ R and not x ∈ RX . It turns out that all the points RX := (R − RX ) have the empty set ∅ as their pre-image, and since ∅ belongs to all event spaces (being a σ -field): X −1 (x) := {s: X(s) = x} = ∅ ∈ , for all x ∈ RX := (R − RX ) . Intuitively, X(.): S → RX preserves the event structure of a particular event space , by ensuring that its pre-image takes the form X −1 (.): R → ,
(3.5)
/ RX , X −1 (x) = ∅ ∈ . where for each x ∈ RX , X −1 (x) ∈ and for each x ∈ Fourth, from the mathematical perspective, the particular values RX taken by X(.): S → R are not important, and neither is the event structure of interest , as long as the latter is a σ -field. For instance, the real-valued function √ X(HH) = 7405926, X(TT) = 2, X(HT) = X(TH) = 3.14159265 . . . is a random variable relative to = {S, ∅, A, B, C, A ∪ B, A ∪ C, B ∪ C}, A = {(HH)}, B = {(TT)}, C = {(HT), (TH)}. From the modeling perspective, however, both the particular values taken by X and the events in are crucially important! The observed data are viewed as particular values taken by random variables, and thus it is important to choose these values and the associated event structure of interest very carefully to reflect what one is interested in learning about using data. If is too small, X(.) has limited scope for modeling purposes. Example 3.4 Consider the case 0 = {S, ∅} where the only X(.): S → R that is a random variable relative to 0 is X(s) = c ∈ R, for all s ∈ S, which defines a constant as a degenerate random variable. Example 3.5 For a slightly more informative case, consider = {S, ∅, A, A} and the random variable can take two values, say {s: X(s) = 1} := A, {s: X(s) = 0} := A. & 1, s ∈ A The resulting random variable is the indicator function IA (s) = 0, s ∈ A Bernoulli distributed.
that is
3.2 The Concept of a Random Variable
83
3.2.2.1 Assigning Probabilities Using the concept of a random variable, we mapped S (an arbitrary set) to a subset of the real line (a set of numbers) RX . Because we do not want to change the original probability structure of (S, , P(.)), we imposed condition (3.2) to ensure that all events defined in terms of the random variable X belong to the original event space . We also want to ensure that the same events in the original probability space (S, , P(.)) and the new formulation, such as Ax = {s: X(s) = x}, get assigned the same probabilities. In order to ensure that, we define the point function fx (.), which we call a density function, as follows: fx (x) := P(X = x) for all x ∈ RX .
(3.6)
Note that (X = x) is a shorthand notation for {s: X(s) = x}. Clearly, for x ∈ / RX , X −1 (x) = ∅, / X. and thus fx (x) = 0, for all x∈R Example 3.6 In the case of the indicator function, if we let X(s):=IA (s) we can define the probability density as follows: fx (1) := P(X = 1) = θ and fx (0) := P(X = 0) = (1 − θ), where 0 ≤ θ ≤ 1. This is known as the Bernoulli density: x
0
1
fx (x)
(1 − θ)
θ
(3.7)
What have we gained? In the context of the original probability space (S, , P(.)), where S = {s1 , s2 , . . . , sn }, the probabilistic structure of the experiment was specified in terms of P := {p(s1 ), p(s2 ), . . . , p(sn )} , such that ni=1 p(si ) = 1. Armed with this we could assign a probability of any event A ∈ as follows. We know that all events A ∈ are just unions of certain outcomes. Given that outcomes are also mutually exclusive elementary events, we proceed to use Axiom [3] (see Chapter 2) to define the probability of A as equal to the sum of the probabilities assigned to each of the outcomes making up the event A, i.e. if A = {s1 , s2 , . . . , sk }, then: P(A) = ki=1 p(si ). Example 3.7
In the case of the random experiment of “tossing a coin twice”:
S = {(HH), (HT), (TH), (TT)}, = P (S), where P (S) denotes the power set of S – the set of all subsets of S (see Chapter 2). The random variable of interest is defined by X – the number of “heads.” This suggests that the events of interest are: A0 = {s: X = 0} = {(TT)} , A1 = {s: X = 1} = {(HT), (TH)} , A2 = {s: X = 2} = {(HH)} .
84
The Concept of a Probability Model
In the case of a fair coin, all four outcomes are given the same probability and thus: P(A0 ) = P {s: X = 0} = P {(TT)} = 14 , P(A1 ) = P {s: X = 1} = P {(HT), (TH)} = P(HT)+P(TH) = 12 , P(A2 ) = P {s: X = 2} = P {(HH)} = 14 . Returning to the main focus of this chapter, we can claim that using the concept of a random variable we achieved the following mapping: X(.)
(S, , P(.)) → (RX , fx (.)), where the original probabilistic structure has been transformed into {fx (x1 ), fx (x2 ), . . . , fx (xm )} , such that
m
i=1 fx (xi )
= 1, m ≤ n;
the last is referred to as the probability distribution of a random variable X. The question which arises at this point is to what extent the latter description of the probabilistic structure is preferable to the former. At first sight it looks as though no mileage has been gained by this transformation. However, it turns out that this is misleading and a lot of mileage has been gained for two reasons. (a) Instead of having to specify { fx (x1 ), fx (x2 ), . . . , fx (xm ) } explicitly, we can use simple real-valued functions in the form of formulae such as fx (x; θ) = θ x (1 − θ)1−x , x = 0, 1, 0 ≤ θ ≤ 1,
(3.8)
which specify the distribution implicitly. For each value of X, the function fx (x) specifies its probability. This formula constitutes a more compact way of specifying the distribution given above. (b) Using such density functions, there is no need to know the probabilities associated with the events of interest in advance. In the case of the above formula, θ could be unknown and the set of such density functions is referred to as a family of density functions indexed by θ . This is particularly important for modeling purposes, where such families provide the basis of probability models. In a sense, the uncertainty relating to the outcome of a particular trial (condition [c] defining a random experiment) has become the uncertainty concerning the “true” value of the unknown parameter θ. The distribution defined by (3.8) is known as the Bernoulli distribution. This distribution can be used to describe random experiments with only two outcomes. Example 3.8 In the case of the random experiment of “tossing a coin twice”: S = {(HH), (HT), (TH), (TT)}, = {S, ∅, A, A}, where the event of interest is A = {(HH), (HT), (TH)}, with P(A) = θ, P(A) = (1 − θ). By defining the random variable X(A) = 1 and X(A) = 0, the probabilistic structure of the experiment is described by the Bernoulli density (3.8). This type of random experiment can easily be extended to n repetitions of the same twooutcomes experiment, giving rise to the so-called binomial distribution.
3.2 The Concept of a Random Variable
85
Example 3.9 Consider the random experiment of “tossing a coin n times and counting the number of heads.” The outcomes set for this experiment is defined by S = {H, T}n , with P(H) = θ and P(T) = 1 − θ. Define the random variable X: the total number of Hs in n trials. The range of values of X is RX = {0, 1, 2, 3, . . . , n}, a binomially distributed random variable and a density function
(3.9) fx (x; θ ) = nx θ x (1 − θ )n−x , 0 ≤ x ≤ n, n = 1, 2, . . . , 0≤θ ≤ 1,
n where x = n!/(n − x)!x! and n! = n·(n − 1)·(n − 2)· · ·(3)·(2)·1. This formula stems naturally from the combinations rule discussed in Chapter 2. This formula can be graphed for specific values of θ . In Figures 3.3 and 3.4 we can see the graph of the binomial density function (3.9) with n = 10 and two different values of the unknown parameter, θ = .15 and θ = .5, respectively. The horizontal axis depicts the values of the random variable X (RX = {0, 1, 2, 3, . . . , n}) and the vertical axis depicts the values of the corresponding probabilities as shown below. The gains from this formulation are even more apparent in the case where the outcomes set S is infinite but countable. As shown next, in such a case listing the probabilities for each s ∈ S in a table is impossible. The assignment of probabilities using a density function, however, is trivial.
3.2.3 The Case of a Countable Outcomes Set: S = {s1 , s2 , . . . , sn , . . .} Consider the case of the countable outcomes set S = {s1 , s2 , . . . , sn , . . .}. This is a simple extension of the finite outcomes set case where the probabilistic structure of the experiment is specified in terms of {p(s1 ), p(s2 ), . . . , p(sn ), . . .}, such that ∞ i=1 p(si ) = 1. The probability of an event A ∈ is equal to the sum of the probabilities assigned to each of the outcomes making up the event A: P(A) = {i:si ∈A} p(si ).
0.35
0.35 – = .15
0.25
0.25
0.20
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10 x
Fig. 3.3
Binomial n = 10, θ = .15
– = .5
0.30
f(x)
f(x)
0.30
0.00
0 1 2 3 4 5 6 7 8 9 10 x
Fig. 3.4
Binomial n = 10, θ = .5
86
The Concept of a Probability Model
0.40
0.30
0.25
0.25
0.20
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0.00
1 2 3 4 5 6 7 8 9 11 13 15 17 19 x
Fig. 3.5
Geometric n = 20, θ = .2
– = .35
0.35
0.30 f(x)
f(x)
0.40
– = .2
0.35
1 2 3 4 5 6 7 8 9 11 13 15 17 19 x
Fig. 3.6
Geometric n = 20, θ = .35
Example 3.10 Consider the random experiment of “tossing a coin until the first H turns up.” The outcomes set is: S = {(H), (TH), (TTH), (TTTH), (TTTTH), (TTTTTH), . . .}, and let the event space be the power set of S. If we define the random variable X(.) = the number of trials needed to get one H, i.e. X(H) = 1, X(TH) = 2, X(TTH) = 3, etc. and P(H) = θ, then the density function for this experiment is: fx (x; θ) = (1 − θ)x−1 θ, 0 ≤ θ ≤ 1, x ∈ RX = {1, 2, 3, . . .}. This is the density function of the geometric distribution. This density function is graphed in Figures 3.5 and 3.6 for n = 20 and two different values of the unknown parameter, θ = .20 and θ = .35, respectively. Looking at these graphs we can see why the name “geometric” was given to this distribution: the probabilities decline geometrically as the values of X increase.
3.3
The General Concept of a Random Variable
Having introduced the basic concepts needed for the transformation of the abstract probability space (S, , P(.)) into something more appropriate (and manageable) for modeling purposes, using the simplest case of a countable outcomes set, we will now proceed to explain these concepts in their full generality.
3.3.1 The Case of an Uncountable Outcomes Set S As a prelude to the discussion that follows, let us see why the previous strategy of assigning probabilities to each and every outcome in the case of an uncountable set, say S = R, will not work. The reason is very simple: the outcomes set has so many elements that it is impossible to arrange them in a sequence and thus count them. Hence, any attempt to
3.3 The General Concept of a Random Variable
87
follow the procedure used in the countable outcomes set case will lead to insurmountable difficulties. Intuitively, we know that we cannot cover the real line point by point. The only way to overlay R or any of its uncountable subsets is to use a sequence of intervals of any one of the following forms: (a, b), [a, b], [a, b), (−∞, a], where a < b, a and b real numbers. We will see in the sequel that the most convenient form for such intervals is {(−∞, x]} for each x ∈ R.
(3.10)
3.3.1.1 The Definition of a Random Variable In view of the above discussion, any attempt to define a random variable using the definition of a discrete random variable X(.): S → RX , such that {s: X(s) = x} := X −1 (x) ∈ for all x∈R
(3.11)
is doomed to failure. We have just agreed that the only way we can overlay R is by using intervals not points. The half-infinite intervals (3.10) suggest the modification of the events {s: X(s) = x} of (3.11) into events of the form {s: X(s) ≤ x}. A random variable relative to is a function X(.): S → R that satisfies the restriction {s: X(s) ≤ x} := X −1 ((−∞, x]) ∈ for all x ∈ R.
(3.12)
Notice that the only difference between this definition and that of a discrete random variable comes in the form of the events used. {s: X(s) = x} ⊂ {s: X(s) ≤ x}, Moreover, in view of the fact that the latter definition includes the former as a special case. From this definition we can see that the pre-image of the random variable X(.) takes us from intervals (−∞, x], x ∈ R back to the event space . The set of all such intervals generates a σ -field on the real line known as the Borel field and denoted by B(R): B(R) = σ ((−∞, x], x ∈ R) .
It is worth noting that we could have generated B(R) using any one of the interval forms mentioned above, (a, b), [a, b], [a, b), (−∞, a], and all such intervals are now elements of the Borel field B(R). Hence, in a formal sense, the pre-image of the random variable X(.) constitutes a mapping from the Borel field B(R) to the event space and takes the form X −1 (.): B(R) → .
(3.13)
This ensures that the random variable X(.) preserves the event structure of , because the pre-image preserves the set-theoretic operations (see Karr, 1993) as shown in Table 3.2. Note that the restriction (3.12) defining a random variable includes the restriction (3.2) as a special case. Hence, in principle, we could have begun the discussion with the general definition of a random variable (3.12) and then applied it to the various different types of outcomes sets.
88
The Concept of a Probability Model
Table 3.2 Pre-image and set-theoretic operations (i)
Union
(ii)
Intersection
∞ −1 X −1 ( ∞ i=1 Bi ) = i=1 X (Bi ) ∞ −1 X −1 ( ∞ i=1 Bi ) = i=1 X (Bi )
(iii)
Complementation
X −1 (B) = (X −1 (B)) h(g(.))
.
x
A
Fig. 3.7
.
. h(g(x)) B
Composite function h(g(.)): A → C
3.3.1.2 The Probability Space Induced by a Random Variable* Let us take stock of what we have achieved so far. The metamorphosis of the probability space (S, , P(.)) into something usable for modeling purposes has so far traded the outcomes set S with a subset of the real line RX and the event space with the Borel field B(R). The modus operandi of this transformation has been the concept of a random variable. The next step should be to transform P(.): → [0, 1] into a mapping on the real line or, more precisely, on B(R). This transformation of the probability set function takes the form P(X ≤ x) = PX −1 ((−∞, x]) = PX ((−∞, x]). It is very important to note at this stage that the events in the first and second terms are elements of the event space , but that of the last equality is an element of B(R). So far we have learned how to assign probabilities to intervals of the form {(−∞, x]: x ∈ R} whose pre-image belongs to . Using Caratheodory’s extension theorem we can extend the probability set function PX (.) to assign probabilities to other intervals of the form (a, b), [a, b], [a, b), (−∞, a) that can be constructed using set-theoretic operations on {(−∞, x]: x ∈ R}. The ultimate aim is to assign probabilities to every element Bx of B(R): PX −1 (Bx ) = PX (Bx ) for each Bx ∈B(R). This means that we can define a new probability set function as a composite function (function of a function) PX −1 (.) with two components P(.), X −1 (.), which eliminates , giving rise to PX (.) := PX −1 (.): B(R) → [0, 1]. In terms of Figure 3.7, g(.): A → B, h(.): B → C correspond to X −1 (.): B(R) → , P(.): → [0, 1], respectively, and A, B, C correspond to B(R), , [0, 1], respectively. Collecting the above elements together, we can see that in effect a random variable X induces a new probability space (R, B(R), PX (.)) with which we can replace the abstract probability space (S, , P(.)). The main advantage of the former over the latter is that everything takes place on the real line and not in some abstract space. In direct analogy to
3.4 Cumulative Distribution and Density Functions
89
the countable outcomes set case, the concept of a random variable induces the following mapping: X(.)
(S, , P(.)) → (R, B(R), PX (.)). That is, using the mapping X(.) we traded S for R, for B(R), and P(.) for PX (.). For reference purposes we call (R, B(R), PX (.)) the probability space induced by a random variable X; see Galambos (1995). Borel (measurable) functions. In probability theory we are interested not just in random variables, but also in well-behaved functions of such random variables. By well-behaved functions, in calculus, we usually mean continuous or differentiable functions. In probability theory, the term well-behaved functions refers to ones which preserve the event structure of their argument. A random variable function defined by h(.): R → R such that {h(X) ≤ x} := h−1 ((−∞, x]) ∈B(R), for all x ∈ R is called a Borel (measurable) function. That is, a Borel function is a function which is a random variable relative to B(R). Note that indicator functions, monotone functions, continuous functions, as well as functions with a finite number of discontinuities are Borel functions; see Khazanie (1976). Equality of random variables. Random variables are unlike mathematical functions in so far as their probabilistic structure is of paramount importance. Hence, the concept of equality for random variables involves this probabilistic structure. Two random variables X and Y, defined on the same probability space (S, , P(.)), are said to be equal with probability one (or almost surely) if (see Karr, 1993) P (s: X(s)=Y(s)) = 0, for all s ∈ S, i.e. if the set (s: X(s) =Y(s)) is an event with zero probability.
3.4
Cumulative Distribution and Density Functions
3.4.1 The Concept of a Cumulative Distribution Function Using the concept of a random variable X(.), so far we have transformed the abstract probability space (S, , P(.)) into a less abstract space (R, B(R), PX (.)). However, we have not reached our target yet, because PX (.) := PX −1 (.) is still a set function. Admittedly it is a much easier set function, because it is defined on the real line, but a set function all the same. What we prefer is a numerical point-to-point function. The way we transform the set function PX (.) into a numerical point-to-point function is by a clever stratagem. By viewing PX (.) as only a function of the end point of the interval (−∞, x], we define the cumulative distribution function (cdf) FX (.): R → [0, 1], defined by FX (x) = P{s: X(s) ≤ x} = PX ((−∞, x]).
(3.14)
90
The Concept of a Probability Model
The ploy leading to this trick began a few pages ago when we argued that even though we could use any one of the following intervals (see Galambos, 1995): (a, b), [a, b], [a, b), (−∞, a], where a < b, a ∈ R, and b ∈ R to generate the Borel field B(R), we chose intervals of the form (−∞, x], x ∈ R. In view of this, we can think of the cdf as being defined via P{s: a0, x∈R+ = [0, ∞).
The Concept of a Probability Model
92
Table 3.4 Density function (continuous) properties f1 f2
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0
3.0 2.5
– =3
2.0 f(x)
F(x)
f3
fx (x) ≥ 0, for all x ∈ RX ∞ −∞ fx (x)dx = 1 FX (b) − FX (a) = ab fx (x)dx, a 0, x ∈ R+ . The density function, for continuous random variables (defined by (3.16)), satisfies the properties in Table 3.4. Example 3.12 For what value of c is the function f (x) = cx2 , 00} . To illustrate the key concept of a probability model, we consider two examples. Example 3.17 (a) An interesting example of a probability model is the beta distribution:
α−1 (1−x)β−1 , θ := (α, β) ∈ R2+ , 0 < x < 1 . = f (x; θ ) = x B[α,β] In Figure 3.14 several members of this family of densities (one for each combination of values of θ) are shown. This probability model has two unknown parameters α > 0 and β > 0; the parameter space is the product of the positive real line: :=R2+ . This suggests that the set has an infinity of elements, one for each combination of elements from two infinite sets. Its support is R∗X :=(0, 1). As can be seen, this probability model involves density functions with very different shapes, depending on the values of the two unknown parameters. (b) Another example of a probability model is the gamma distribution: α−1
β −1 x x 2 exp − β , θ :=(α, β) ∈ R+ , x ∈ R+ . = f (x; θ ) = [α] β In Figure 3.15 several members of this family of densities (one for each combination of values of θ) are shown. Again, the probability model has two unknown parameters α > 0 and β > 0; the parameter space is the product of the positive real line: :=R2+ . Its support is R∗X :=(0, ∞). For empirical modeling purposes the selection of a probability model is of paramount importance because it accounts for the “distribution” chance regularities of the underlying stochastic mechanism that gave rise to the observed data in question. Our task is to choose the most appropriate family for the data in question; see Appendix 3.A for several such models. The question that naturally arises at this stage is:
4.5 4.0 3.5
(α = 1, β = 4)
3.0
(α = 1, β = 1)
(α = 4, β = 1)
(α = 1, β = 2)
(α = 6, β = 6)
2.5 2.0
0.3
(α = 1, β = 3) (α = 1, β = 5)
0.2
1.5 1.0
0.4
f(x)
f(x)
0.5
(α = 14, β = 14)
(α = 2, β = 4)
(α = 4, β = 2)
0.5 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x
Fig. 3.14
Beta probability model
(α = 1, β = 8)
0.1 0.0
0 1 2 3 4 5 6 7 8 9 10 x
Fig. 3.15
12
14
Gamma probability model
3.5 From a Probability Space to a Probability Model
97
How do we select an appropriate probability model? An oversimplified answer is that the modeler selects the probability model based on the chance regularity patterns exhibited by the particular data. Chapter 5 discusses how the histogram of the particular data can be used to make informed decisions with regard to the appropriate probability model. The best way to distinguish between similar-looking distributional shapes is via measures based on moments, such as the skewness and kurtosis coefficients considered next.
3.5.1 Parameters and Moments Why do we care? In the previous section we introduced the concept of a probability model = {fx (x; θ ), θ∈, x ∈ RX } as a formalization of conditions [a] and [b] of a random experiment. Before we proceed to formalize condition [c] (see next chapter), we make an important digression to introduce a more convenient way to handle the unknown parameter(s) θ of the probability model. In the context of statistical modeling and inference, the most efficient way to deal with the unknown parameters θ is to relate them to the moments of the distribution. As mentioned in the previous section, one of the most important considerations in choosing a probability model is the shape of the density functions. In selecting such probability models, one can get ideas by looking at the histogram of the data as well as a number of numerical values, such as arithmetic averages, from descriptive statistics. These numerical values are related to what we call the moments of the distribution and can be used to make educated guesses about the appropriateness of different probability models.
3.5.2 Functions of a Random Variable In empirical modeling one is often interested in a real-valued function Y = h(X) of a random variable X defined on (S, , P(.)), as well as the distribution of Y. Since X(.): S → R and h(.): R → R, Y defines a composite function h(X(.)): S → R and Y = h(X) is a random variable only if its pre-image defines events in , i.e. Y −1 ((−∞, y]) ∈ , for all y ∈ R. Such a function is said to be a Borel (measurable) function, because it preserves the event structure of interest in , i.e. Y is a random variable relative to this particular . The range of values of Y = h(X) is defined by RY = {h(x): x ∈ RX }, and its distribution is determined by the functions h(.) and f (x): Discrete fY (y) = {x: h(x)=y} f (x), y∈RY ; y Continuous FY (y) = −∞ fY (u)du = {x: h(x)≤y} f (x)dx.
98
The Concept of a Probability Model
Note that in the case of a discrete random variable we can evaluate fY (y) directly, but for continuous random variables we evaluate the cdf: FY (y) = P(Y ≤ y) = P(h(X) ≤ y) = P(X ≤ h−1 (y)) = P(X ≤ h−1 (y)) = FX (h−1 (y)). In cases where h(.) is strictly monotonic (increasing or decreasing), i.e. for x1 ≷ x2 , h(x1 ) ≷ h(x2 ), whose inverse is X = h−1 (Y), and fY (y) is continuous, one can recover the latter by differentiating FX (h−1 (y)) to yield −1
fY (y) = fX (h−1 (y))·| dh dy(y) |, y ∈ RY . Example 3.18
Consider the discrete random variable X with distribution x f (x)
−1 .2
0 .1
1 .2
2 .5
The range of values of Y = 2X 2 is RY = {0, 2, 8} and its distribution takes the form y f (y)
0 .1
2 .4
8 .5
Example 3.19 Let X be a continuous random variable with a uniform distribution, i.e. XU(−1, 1), f (x) = 1/2, x ∈ [−1, 1]. The range of values of Y = X 2 is RY = [0, 1], and by definition FY (y) = P(ln(X) ≤ y), which suggests that
2
√ √ √ X ≤ y ⇐⇒ − y ≤ x ≤ y → FY (y) = P(0 ≤ x ≤ ln(y)) = y, i.e. FY (y) = P(X 2 ≤ y) =
y
1 √ −∞ 2 y du
→ f (y) =
1 √ 2 y,
y ∈ [0, 1].
Example 3.20 Probability integral transformation Let X be a continuous random variable with a cdf F(x) that has a unique inverse x = F −1 (u). The distribution of Y = F(X) is uniform with y ∈ RY :=[0, 1], i.e. Y = FX (X)U(0, 1), since FY (y) = P(F(X) ≤ y) = P(X ≤ F −1 (y)) = F(F −1 (y)) = y. In practice, for a continuous random variable X one can go directly to the density function of Y = h(X) when h(.) is differentiable and strictly monotonic [its first derivative (dh(x)/dx) is either positive or negative] for h−1 (.) the inverse and ∀x ∈ RX : + −1 + + + fY (y) = fX (h−1 (y)) + dh dx(y) + , ∀y ∈ RY , where RY = (a, b), a = min (h(−∞), h(∞)), and b = max (h(−∞), h(∞)) . Example 3.21 Let X be a continuous random variable with a uniform distribution, i.e. XU(0, 1), f (x) = 1, x ∈ (0, 1). The range of values of Y = eX is RY = (1, e), and h−1 (.) is x = ln(y), (dh−1 (y))/dx = 1y . Hence: + + + + fY (y) = 1 + 1y + , for all y ∈ (1, e).
3.5 From a Probability Space to a Probability Model
99
Moments. The moments of a distribution are defined in terms of the mathematical expectation of certain functions of the random variable X, say h(X). By choosing specific forms of the function h(X), such as h(X) = X r , h(X) = |X|r , r = 1, 2, . . . , h(X) = etx , h(X) = eitx , the expected value of such functions ∞ E[h(X)] = −∞ h(x)·f (x; θ )dx = g(θ)
(3.27)
(3.28)
yields functions of the form g(θ ) which involve the unknown parameters θ and certain moments of f (x; θ ). The primary usefulness of such a function g(θ ) is that it offers the best way to understand the role of the unknown parameters θ by relating them to specific moments. This is particularly useful in statistical inference (Chapters 11–14).
3.5.3 Numerical Characteristics of Random Variables 3.5.3.1 The Mean For h(X):=X, where X takes values in RX , the mean of the distribution is Discrete E(X) = xi ∈RX xi ·fx (xi ; θ ); ∞ Continuous E(X) = −∞ x·fx (x; θ )dx.
(3.29)
Note that the only difference in the definition between continuous and discrete random variables is the replacement of the integral by a summation. The mean is a measure of location in the sense that knowing what the mean of X is, we have some idea of where fx (x; θ ) is located. Intuitively, the mean represents a weighted average of the values of X, with the corresponding probabilities providing the weights. Denoting the mean by μ := E(X), the above definition suggests that μ is a function of the unknown parameters θ, i.e. μ(θ). This provides the modeler with a direct relationship between the first moment of f (x; θ ) and θ. Example 3.22 (a) (b)
For the Bernoulli distribution: μ(θ) := E(X) = 0·(1 − θ)+1·θ = θ , and thus the mean coincides with the unknown parameter θ. For the uniform distribution (a continuous distribution): f (x; θ ) =
1 (θ 2 −θ 1 ) ,
θ2
μ(θ) := E[X] = (c)
x ∈ [θ 1 , θ 2 ], θ := (θ 1 , θ 2 ), − ∞ 0.
Appendix 9.A at the end of Chapter 9 lists several important probabilistic inequalities. 3.5.3.3 Standard Deviation The square root of the variance of a random variable, named by Pearson (1894) as the 1 standard deviation, is also used as a measure of dispersion: SD(X) = [Var(X)] 2 . This measure is particularly useful in statistical inference, because it provides one of the best ways to standardize any random variable X whose variance exists. One of the most useful practical rules in statistical inference is the following: A random variable X is as “big” as its standard deviation SD(X) 0, θ > 0,
∞ the raw moments are defined by μr (θ) = 0 xr θe−θ x dx. Using the change of variables ∞ ∞ r u = θx, dx = θ1 du, we deduce that μr (θ) = 0 θur e−u du = θ1r 0 u(r+1)−1 e−u du = θr!r .
3.5 From a Probability Space to a Probability Model
103
In practice, it is often easier to derive the higher moments indirectly using the moment generating function (mgf) or the characteristic function (chf) defined as the two-sided Laplace and Fourier transformations of the density function, respectively:
∞ Mgf mX (t) := E etX = −∞ etX f (x)dx, for t ∈ (−h, h), h>0; (3.31)
∞ √ Chf ϕ X (t) := E eitX = −∞ eitx f (x) dx, for i = −1. This enables one to derive higher moments using differentiation instead of integration: + + + + dr dr (r) r + + m (t) := m (0) = E(X ), ϕ (t) = E(ir X r eitX )|t=0 = ir E(X r ), X + X + X r r dt dt t=0 t=0 r = 1, 2, 3, . . . 3.5.4.2 Higher Central Moments The concept of variance can be extended to define the central moments using the sequence of functions h(X) := (X − E(X))r , r = 3, 4, . . . in (3.28): ∞ μr (θ ) := E(X − μ)r = −∞ (x − μ)r f (x; θ)dx, r = 2, 3, . . . Example 3.27 For the Normal distribution [XN(μ, σ 2 )]: ⎧ ⎨ r! σ r , for r=2, 4, 6, . . . r E(X − μ)r = 2 2 ( 2r !) ⎩ 0, for r = 3, 5, 7, . . . Not surprisingly, the central moments are directly related to the raw moments as well as the cumulants, κ r , r = 1, 2, . . . (see Appendix 3.B) via μ2 = μ2 − (μ1 )2 , μ3 = μ3 − 3μ2 μ1 +2(μ1 )3 , κ 2 = μ2 , κ 3 = μ3 μ4 = μ4 − 4μ3 μ1 +6μ2 (μ1 )2 − 3(μ1 )4 , .. .
κ 4 = μ4 − 3μ22 .. .
For more details, see Stuart et al. (1999). Example 3.28 The first four cumulants of the Normal distribution [XN(μ, σ 2 )] are κ 1 = μ, κ 2 = σ 2 , κ 3 = μ3 = 0, κ 4 = μ4 − 3μ22 = 0, κ r = 0, r > 4. One of the main uses of the central moments is that they can be used to give us a more complete picture of the distribution’s shape. By standardizing the above central moments, we define a number of useful measures which enable us to get a more complete idea of the possible shape of a density function. The first important feature of a distribution’s shape is that of symmetry around a given point a; often a = E(X). Symmetry. A random variable X with density f (x) is said to have a symmetric distribution about a point x0 if f (x0 − x) = f (x0 + x), for all x ∈ R. In terms of the cdf, symmetry is defined by FX (−x) = 1 − FX (x), for all x ∈ RX .
The Concept of a Probability Model
104 f (x) 2.5
f(x) 6
2.0 1.5
4
1.0 2 0.5 0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
x
N(.5, .0278) density
Fig. 3.17
Fig. 3.18
Beta(4, 4) density
3.5.4.3 The Skewness Coefficient The first index of shape, designed to gives us some idea about the possible asymmetry of a density function around the mean, is the skewness coefficient, defined as the standardized third central moment, introduced by Pearson (1895): Skewness
α 3 (X) =
μ √ 3 . ( μ2 )3
1 √ Note that μ2 = [Var(X)] 2 denotes the standard deviation. If the distribution is symmetric around the mean, then α 3 = 0; the converse is not true!
Example 3.29 Figure 3.17 depicts the Normal density (3.30), for values of the mean and variance chosen to be the same as those of the beta density f (x; θ) =
xα−1 (1−x)β−1 , B[α,β]
θ := (α, β)∈R2+ , 00) beta densities with values (α = 1, β = 4) and (α = 2, β = 4), and Figure 3.20 two negatively skewed density functions (α 3 3 is called leptokurtic. Example 3.32
Figure 3.21 compares a Normal [N(1, .5)] with a logistic [Lg(α, β)] density:
f (x; θ) =
|x−α| 1 − β , 2β e
θ := (α, β) ∈ R×R+ , x ∈ R,
with α = E(X) = 1, β = .38985 [Var(X) = β 2 π 2 /3 = .5], and α 4 =4.2. Figure 3.22 compares a Normal [N(1, .5)] with a Laplace [Lp(α, β)] density: f (x; θ) =
|x−α| 1 − β , 2β e
θ := (α, β) ∈ R × R+ , x ∈ R,
with α = E(X) = 1, β = .5 Var(X) = 2β 2 = .5 , and α 4 = 6 (see Appendix 3.A.).
3.5 From a Probability Space to a Probability Model
107
0.5
f(x)
0.4 0.3 0.2 0.1
–5
–4
–3
–2
–1
0
1
2
St(ν = 5) vs. N(0, 1) with
Fig. 3.23
3
.
4
5
x
Var(X ) = 1
0.5
f(x) 0.4
0.3
0.2
0.1
–5
–4
–3
–2
–1
0
1
2
3
4
5
x Fig. 3.24
5 vs. N(0, 1) with Var(X ) = 1 St(ν = 5) with Var(X ) = 5−2
Example 3.33 Figure 3.23 compares the standard Normal [N(0, 1)] density (bold line) and the standard Student’s t density with ν = 5, denoted as St(ν = 5): f (x) =
[ 12 (ν+1)](νπ )
[ 12 ν]
− 21
2
1+ xν
− 1 (ν+1) 2
, ν>2, x ∈ R.
(3.34)
The Normal (mesokurtic) and the Student’s t (leptokurtic) differ in two respects: (i) the tails of the St(ν = 5) are thicker; (ii) the peak of the St(ν = 5) is more pointed. W A R N I N G : In many textbooks the graph of the Normal and Student’s t distribution looks like Figure 3.24 instead. The latter picture is misleading, however, because the Normal has √ SD(X) = 1 but the Student’s t is SD(X) = ν/(ν − 2). Standardizing the latter to SD(X) = 1 yields Figure 3.23, which is the relevant plot when looking at real data plots (Chapter 5) (see Example 4.42 for details of the transformation). In Figure 3.25 we compare the Normal
108
The Concept of a Probability Model 0.8
f(x) 0.6
0.4
0.2
–4
–3
–2
–1
0
1
2
3
4
x Fig. 3.25
St(ν = 3) vs. Cauchy(.4) vs. Normal [N(0,1)] densities
f(x)
0.4
0.3
0.2
0.1
–4
–3
–2
–1
0
1
2
3
4
x Fig. 3.26
Pearson II (ν = 3), −3.16 ≤ x ≤ 3.16 vs. Normal [N(0,1)]
[N(0,1)] (bold, lowest peak) with two even more leptokurtic distributions, the St(ν = 3) and the Cauchy(.4) (highest peak), where .4 is the scale parameter. [b] Platykurtic. Any distribution whose kurtosis coefficient α 4 < 3 is called platykurtic. Example 3.34 In Figure 3.26 we compare the Normal density (bold) with a platykurtic density, the Pearson type II with ν = 3: f (x) =
[ν+2]
[.5]· [ν+1.5](c)
1−
x2 c2
(ν+ 1 ) 2
, −c ≤ x ≤ c, c2 = 2(ν+2).
(3.35)
The Normal density differs from the Pearson type II in exactly the opposite way to how it differs from the Student’s t:
3.5 From a Probability Space to a Probability Model
(a) (b)
109
the tails of the Pearson II are slimmer; the curvature of the Pearson II is less pointed.
It should not be surprising that the Pearson II distribution is directly related to the symmetric beta distribution (α = β); see Appendix 3.A. The Normal, Student’s t, and Pearson type II densities are bell-shaped, but they differ in terms of their kurtosis: meso-, lepto-, and platy-kurtic, respectively. In conclusion, it must be said that the usefulness of the kurtosis coefficient is reduced in the case of non-symmetric distributions because it does not have the same interpretation as in the symmetric cases above; see Balanda and MacGillivray (1988). Example 3.35 Consider the discrete random variable X with density function x
0
1
2
f (x)
.3
.3
.4
(3.36)
E(X) = 0(.3)+1(.3) + 2(.4) = 1.1, E(X 2 ) = 02 (.3)+12 (.3)+22 (.4) = 1.9, E(X 3 ) = 03 (.3)+13 (.3)+23 (.4) = 3.5, E(X 4 ) = 04 (.3)+14 (.3)+24 (.4) = 6.7. Var(X) = [0 − 1.1]2 (.3) + [1 − 1.1]2 (.3) + [2 − 1.1]2 (.4) = 0.69, Var(X) = E(X 2 ) − [E(X)]2 = 1.90 − 1.21 = 0.69. E{(X − E(X))3 } = [0 − 1.1]3 (.3)+[1 − 1.1]3 (.3)+[2 − 1.1]3 (.4) = 0.108, E{(X − E(X))4 } = [0 − 1.1]4 (.3)+[1 − 1.1]4 (.3)+[2 − 1.1]4 (.4) = 0.7017. 0.108 0.7017 α3 = = 0.18843, α 4 = = 1.4785. (0.83)3 (0.83)4 Example 3.36 Consider the continuous random variable X with f (x) = 2x, + 1 1 2 3 ++1 2 2 E(X) = 2x dx = x + = , 2x3 dx = E(X 2 ) = 3 0 3 0 0 1 1 4 1 2 2 3 2x4 dx = , E(X ) = Var(X) = E(X ) − [E(X)] = − = 2 9 18 0
00: xp P(|X| > x) → 0, x→∞
then all moments of lower order than p exist, i.e. E (X r ) 0. (a) Derive its mean and variance. (b) Derive its mode. 11. Consider the function f (x) = 140 x3 (1 − x)3 , 0 0, x ≥ x0
Numerical characteristics 1 0 θ E(X) = (θθx −1) , median = 2 x0 , mode = x0
Var(X) =
θ x02 θx0r , μr = (θ−r) , for θ > r 2 (θ −2)(θ−1)
Relationships with other distributions (a) Pareto–exponential: see exponential (b) Pareto–chi-square: if X1 , X2 , . . . , Xn are IID Pareto random variables 1 Y = 2θ ln{ ni=1 ( xXni ) χ 2 (2n) 0
=
Power exponential (or error): PE(μ, β, δ)
⎧ ⎪ ⎪ ⎨
&
f (x; θ ) =
⎪ ⎪ ⎩
(β)−1
e δ +1 2 2 (1+ 2δ )
+ +2 + +δ − 12 + x−μ β +
2
, θ := (μ, β, δ) ∈ R × R2+ , x ∈ R
Numerical characteristics
δ 2 [3δ/2] E(X) = μ, mode = median = μ, Var(X) = 2 β [δ/2] , α3 = 0
[5δ/2]· [δ/2] 2δr β r [(r+1)δ/2] , μr = 0, r odd, μr = , r even α4 = 2
[δ/2]
[3δ/2]
Relationships with other distributions (a) Power exponential–Normal: PE(μ, 1, 1):=N(μ, 1) (b) Power exponential–Laplace: PE(μ, .5, 2):=L(μ, 1) (c) Power exponential–uniform: as δ → 0, then PE(μ, β, δ) ⇒ U(μ − β, μ + β)
& = f (x; θ ) =
Student’s t: St(ν)
[ 12 (ν+1)](σ 2 νπ)
[ 21 ν]
2
− 12
1 2 − 2 (ν+1)
1+ (x−μ) 2 νσ
,
θ := (μ, σ 2 ) ∈ R×R+ , x ∈ R Numerical characteristics ν , ν > 2, α = 0, (α − 3) = 6 , ν > 4 E(X) = θ , Var(X) = (ν−2) 3 4 (ν−4) r 1·3·5·7···(r−1) μr = 0 for r odd, μr = ν ( 2 ) (ν−2)(ν−4)···(ν−r) for ν > r = 2, 4, . . . Relationships with other distributions (a) Student’s t–Normal: as ν → ∞, St(ν) ⇒ N(0, 1) (b) Student’s t–F: if X St(ν) then Y = X 2 F(1, ν)
⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭
Appendix 3.A: Univariate Distributions Uniform: U(α, β) (continuous)
1 , θ := (a, b), a ≤ x ≤ b = f (x; θ ) = (b−a) Numerical characteristics (a+b) (b−a)2 E(X) = (a+b) 2 , no mode, median = 2 , Var(X) = 12 α 3 = 0, α 4 = 1.8, μr = 0 for r odd r eβt −eαt μr = 2(b−a) r (r+1) for r even, m(t) = (β−α)t
Relationships with other distributions (a) Uniform–beta: if X U(0, 1), then X Beta(1, 1) (b) Uniform–all other distributions: if X U(0, 1), then for any random variable Y with cdf F(y), Y=F −1 (x)
= f (x; θ ) =
βxβ−1 αβ
Weibull: W(α, β)
β , θ := (α, β) ∈ R2+ , x>0 exp − αx
Numerical characteristics 1
E(X) = α β [ 1+β β ], mode = α(β − 1),α ≥ 1 2 r β β+1 2
[ β+2 , μr = α β [ r+β Var(X) = α β ] − [ β ] β ], r = 3, 4, . . . Relationships with other distributions (a) Weibull–exponential: see exponential (b) Weibull–extreme value: if X W(α, β), then Y = − ln(αX β ) EV(α, β) (c) Weibull–Rayleigh: W(α, 2) is the Rayleigh distribution
129
4 A Simple Statistical Model
4.1
Introduction
4.1.1 The Story So Far, a Summary Chapter 2 initiated the formalization of a simple chance mechanism known as a random experiment E into a simple statistical space: (S, , P(.))n , GnIID = {A1 , A2 , A3 , . . . , An } . In light of the fact that numerical data live on the real line, the concept of a random variable X(.) was used to transform (S, , P(.)) into a probability model: Probability space (S, , P(.))
Probability model = {f (x; θ ), θ ∈, x∈RX }
where denotes a family of density functions f (x; θ ), indexed by θ in . The primary objective of this chapter is to map the statistical space onto the real line by defining the sampling model: Sampling space GnIID = {A1 , A2 , A3 , . . . , An }
→
Sampling model X:=(X1 , X2 , X3 , . . . , Xn ).
The transformation involves two important concepts in probability theory: independence and identical distribution (IID). The resulting sampling model will, when combined with the probability model, give rise to a simple statistical model.
4.1.2 From Random Trials to a Random Sample: A First View As argued in Chapter 2, a simple sampling space GnIID :={A1 , A2 , . . . , An } is a set of random trials, which are both Independent (I) P(n) (A1 ∩A2 ∩ · · · ∩Ak ) =
7k i=1
Pi (Ai ), for k = 2, 3, . . . , n,
Identically distributed (ID) P1 (.) = P2 (.) = · · · = Pn (.) = P(.). 130
(4.1) (4.2)
4.2 Joint Distributions of Random Variables
131
Independence is related to the condition that “the outcome of one trial does not affect and is not affected by the outcome of any other trial,” or equivalently: P(n) (Ak |A1 , A2 , . . . , Ak−1 , Ak+1 , . . . , An ) = Pk (Ak ), for k = 1, 2, . . . , n.
(4.3)
The second pertains to “keeping the same probabilistic setup from one trial to the next,” ensuring that the events and probabilities associated with the different outcomes remain the same for all trials. Having introduced the concept of a random variable in Chapter 3, it is natural to map GnIID onto the real line to transform the trials {A1 , A2 , . . . , An } into a set of random variables X:=(X1 , X2 , . . . , Xn ). The set function P(n) (.) will be transformed into the joint distribution function f (x1 , x2 , . . . , xn ). Using these two concepts we can define the concept of a random sample X to be a set of IID random variables. A bird’s-eye view of the chapter. In Section 4.2 we introduce the concept of a joint distribution using the simple bivariate case for expositional purposes. In Section 4.3 we relate the concept of the joint distribution to that of the marginal (univariate) distribution. Section 4.4 introduces the concept of conditioning and conditional distributions as it relates to both the joint and marginal distributions. In Section 4.5 we define the concept of independence using the relationship between the joint, marginal, and conditional distributions. In Section 4.6 we define the concept of identically distributed in terms of the joint and marginal distributions and proceed to define the concept of a random sample. In Section 4.7 we introduce the concept of a function of random variables and its distribution with the emphasis placed on applications to the concept of an ordered random sample. Section 4.8 completes the transformation of a simple statistical space into a simple statistical model.
4.2
Joint Distributions of Random Variables
The concept of a joint distribution is undoubtedly one of the most important concepts in both probability theory and statistical inference. As in the case of a single random variable, the discussion will proceed to introduce the concept from the simple to the more general case. In this context simple refers to the case of countable outcomes sets, which give rise to discrete random variables. After we introduce the basic ideas in this simplified context, we proceed to discuss them in their full generality.
4.2.1 Joint Distributions of Discrete Random Variables In order to understand the concept of a set of random variables (a random vector), we consider first the two random variable, case since the extension of the ideas to n random variables is simple in principle, but complicated in terms of notation.
132
A Simple Statistical Model
Random vector. Consider the two simple random variables (random variables) X(.) and Y(.) defined on the same probability space (S, , P(.)), i.e. X(.): S → R, such that X −1 (x)∈, for all x∈R, Y(.): S → R, such that Y −1 (y)∈, for all y∈R. R E M A R K: Recall that Y −1 (y) = {s: Y(s) = y, s∈S} denotes the pre-image of the function Y(.) and not its inverse. Viewing them separately, we can define their individual density functions, as explained in the previous chapter, as follows: P(s: X(s) = x) = fx (x)>0, x∈RX , P(s: Y(s) = y) = fy (y)>0, y∈RY , where RX and RY denote the support of the density functions of X and Y. Viewing them together, we can think of each pair (x, y)∈RX × RY as events of the form {s: X(s) = x, Y(s) = y}:={s: X(s) = x}∩{s: Y(s) = y}, (x, y)∈RX ×RY . In view of the fact that the event space is a σ -field, and thus closed under intersections, the mapping Z(., .):= (X(.), Y(.)) : S → R2 is a random vector, since the pre-image of Z(.) belongs to the event space :
Z−1 (x, y) = X −1 (x) ∩ Y −1 (y) ∈, since by definition, X −1 (x)∈ and Y −1 (y)∈ (being a σ -field; see Chapter 3). Joint density. The joint density function is defined by f (., .): RX × RY → [0, 1], f (x, y) = P{s: X(s) = x, Y(s) = y}, (x, y)∈RX ×RY . Example 4.1 Consider the case of the random experiment of tossing a fair coin twice, giving rise to the set of outcomes S = {(HH), (HT), (TH), (TT)}. Let us define the random variables X(.) and Y(.) on S as follows: X(HH) = X(HT) = X(TH) = 1, X(TT) = 0, Y(HT) = Y(TH) = Y(TT) = 1, Y(HH) = 0. We can construct the individual density functions as follows: x
0
1
y
0
1
f (x)
.25
.75
f (y)
.25
.75
To define the joint density function, we need to specify all the events (X = x, Y = y), x∈RX , y∈RY
(4.4)
4.2 Joint Distributions of Random Variables
133
and then attach probabilities to these events: (X = 0, Y = 0) = { } = ∅ (X = 0, Y = 1) = {(TT)} (X = 1, Y = 0) = {(HH)} (X = 1, Y = 1) = {(HT), (TH)}
→ → → →
f (x = 0, y = 0) = .00, f (x = 0, y = 1) = .25, f (x = 1, x = 0) = .25, f (x = 1, y = 1) = .50.
That is, the joint density takes the form yx
0
1
0 1
.00 .25
.25 .50
(4.5)
If we compare this joint density (4.5) with the univariate densities (4.4), there is no obvious relationship. As argued in the next chapter, however, the difference between the joint probabilities f (x, y), x∈RX , y∈RY and the product of the individual probabilities (f (x)·f (y)) for x∈RX , y∈RY reflects the dependence between the random variables X and Y. At this stage it is crucial to note that a most important feature of the joint density function f (x, y) is that it provides a general description of the dependence between X and Y. Before we proceed to consider the continuous random variables case it is instructive to consider a particularly simple case of a bivariate discrete density function. Example 4.2 The previous example is a particular case of a well-known discrete joint distribution, the Bernoulli distribution, given in Table 4.1. Table 4.1 Bernoulli density yx
0
1
0
p(0, 0)
p(1, 0)
1
p(0, 1)
p(1, 1)
(4.6)
Here, p(i, j) denotes the joint probability for X = i and Y = j, i, j = 0, 1. The Bernoulli joint density takes the form f (x, y) = p(0, 0)(1−y)(1−x) p(0, 1)(1−y)x p(1, 0)y(1−x) p(1, 1)xy , x = 0, 1, y = 0, 1.
4.2.2 Joint Distributions of Continuous Random Variables In the case where the outcomes set S is uncountable, the random variables defined on it are said to be continuous because their range of values is a piece of the real line R. Random vector. Consider the two continuous random variables (random variables) X(.) and Y(.) defined on the same probability space (S, , P(.)), i.e. X(.): S → R, such that X −1 ((−∞, x])∈, for all x∈R, Y(.): S → R, such that Y −1 ((−∞, y])∈, for all y∈R.
134
A Simple Statistical Model
Viewing them separately, we can define their individual cumulative distribution functions (see Chapter 3) as follows:
P(s: X(s)≤x) = P X −1 (−∞, x] = PX ((−∞, x]) =FX (x), x∈R,
P(s: Y(s)≤y) = P Y −1 (−∞, y] = PY ((−∞, y]) = FY (y), y∈R. Viewing them together, we can associate with each pair (x, y)∈R × R events of the form {s: X(s)≤x, Y(s)≤y}:={s: X(s)≤x}∩{s: Y(s)≤y}, (x, y)∈R × R. As in the discrete random variable case, since is a σ -field (closed under intersections), the mapping Z(., .):= (X(.), Y(.)) : S → R2 constitutes a random vector whose pre-image:
Z−1 ((−∞, x] × (−∞, y]) = X −1 ((−∞, x]) ∩ Y −1 ((−∞, y]) ∈, since X −1 ((−∞, x])∈ and Y −1 ((−∞, y])∈ by definition. The joint cumulative distribution function is defined by FXY (., .): R2 → [0, 1], FXY (x, y) = P{s: X(s)≤x, Y(s)≤y} = PXY ((−∞, x]×(−∞, y]), (x, y)∈R2 . The joint cdf can also be defined on intervals of the form (a, b], taking the form P{s: x1 0, θ>0,
it’s obvious that X and Y are independent only when θ = 0, since the above factorization can be achieved only in that case: f (x, y; 0) = [(1+θ x)(1+θy)−θ ] exp{−x − y − θ xy}|θ =0 = (e−x )(e−y ).
4.5.2 Independence in the n Random Variable Case The extension of the above definitions of independence from the two to the n variable case is not just a simple matter of notation. As argued in the previous chapter, the events A1 , A2 , . . . , An are independent if the following condition holds: P(A1 ∩ A2 ∩ . . . ∩ Ak ) = P(A1 )·P(A2 ) · · · P(Ak ), for k = 2, 3, . . . , n.
(4.41)
That is, this must hold for all subsets of {A1 , A2 , . . . , An }. For example, in the case n = 3, the following conditions must hold for A1 , A2 , A3 to be independent:
4.5 Independence
157
(a) P(A1 ∩ A2 ∩ A3 ) = P(A1 )·P(A2 )·P(A3 ), (b) P(A1 ∩ A2 ) = P(A1 )·P(A2 ), (c) P(A1 ∩ A3 ) = P(A1 )·P(A3 ),
(d) P(A2 ∩ A3 ) = P(A2 )·P(A3 ).
In the case where only conditions (b)–(d) hold, the events A1 , A2 , A3 are said to be pairwise independent. For (complete) independence, however, we need all four conditions. The same holds for random variables, as can be seen by replacing the arbitrary events A1 , A2 , A3 with the special events Ai = (Xi ≤ xi ), i = 1, 2, 3. Independence. The random variables X1 , X2 , . . . , Xn are said to be independent if the following condition holds: F(x1 , x2 , . . . , xn ) = F1 (x1 )·F2 (x2 ) · · · Fn (xn ), for all x∈Rn ,
(4.42)
where x:=(x1 , . . . , xn ). In terms of the density functions, independence can be written in the form f (x1 , x2 , . . . , xn ) = f1 (x1 )·f2 (x2 ) · · · fn (xn ), for all x∈Rn .
(4.43)
From (4.43) we can see that the qualification for all subsets of {A1 , A2 , . . . , An } in the case of events has been replaced with the qualification for all x∈Rn . In other words, in the case of random variables we do not need to check (4.43) for any subsets of the set X1 , X2 , . . . , Xn , but we do need to check it for all values x∈Rn . It is also important to note that when (4.43) holds for all x∈Rn , it implies that it should hold for any subsets of the set X1 , X2 , . . . , Xn , but not the reverse. Example 4.36 Let us return to our favorite example of tossing a fair coin twice and noting the outcome: S = {(HH), (HT), (TH), (TT)}, being the power set. Define the following random variables: X(HT) = X(HH) = 0, X(TH) = X(TT) = 1, Y(TH) = Y(HH) = 0,
Y(TT) = Y(HT) = 1,
Z(TH) = Z(HT) = 0,
Z(TT) = Z(HH) = 1.
PXYZ (1, 1, 1) = 14 , PXYZ (1, 1, 0) = 0,
PXYZ (1, 0, 0) = 14 , PXYZ (0, 0, 1) = 14 ,
PXYZ (1, 0, 1) = 0, PXYZ (0, 1, 0) = 14 , PXYZ (0, 1, 1) = 0, PXYZ (0, 0, 0) = 0. PX (0) = P(0, y, z) = P(0, 1, 0) + P(0, 0, 1) + P(0, 1, 1) + P(0, 0, 0) = 12 , z y P(1, y, z) = P(1, 1, 1) + P(1, 0, 0) + P(1, 1, 0) + P(1, 0, 1) = 12 , PX (1) = z y PY (0) = P(x, 0, z) = P(1, 0, 0)+P(0, 0, 1)+P(1, 0, 1)+P(0, 0, 0) = 12 , z x P(x, 1, z) = P(1, 1, 1)+P(0, 1, 1)+P(1, 1, 0)+P(0, 1, 0) = 12 , PY (1) = z x PZ (0) = P(x, y, 0) = P(1, 0, 0)+P(1, 1, 0)+P(0, 1, 0)+P(0, 0, 0) = 12 , y x P(x, y, 1) = P(1, 1, 1)+P(0, 0, 1)+P(1, 0, 1)+P(0, 1, 1) = 12 . PZ (1) = y
x
In view of these results, we can deduce that (X, Y), (X, Z), and (Y, Z) are independent in pairs, since
158
A Simple Statistical Model
PXY (0, 0) = PX (0)·PY (0) = 14 , PXZ (0, 0) = PX (0)·PZ (0) = 14 , PXY (1, 0) = PX (1)·PY (0) = 14 , PXZ (1, 0) = PX (1)·PZ (0) = 14 , PXY (0, 1) = PX (0)·PY (1) = 14 , PXZ (0, 1) = PX (0)·PZ (1) = 14 , PYZ (0, 0) = PY (0)·PZ (0) = 14 , PYZ (1, 0) = PY (1)·PZ (0) = 14 , PYZ (0, 1) = PY (0)·PZ (1) = 14 . On the other hand, jointly(X, Y, Z) are not independent, since PXYZ (1, 1, 1) =
1 4
= PX (1)·PY (1)·PZ (1) = 18 .
The above discussion completes the first stage of our quest to transform the concept of random trials. The independence given in the introduction in terms of trials (4.1) has now been recast in terms of random variables, as given in (4.43). We consider the second scale of our quest in the next section.
4.6
Identical Distributions and Random Samples
As mentioned in the introduction, the concept of random trials has two components: independence and identical distributions. Let us consider the recasting of the identically distributed component in terms of random variables.
4.6.1 Identically Distributed Random Variables Example 4.37
Consider the Bernoulli density function
f (x; θ) = θ x (1 − θ)1−x , x = 0, 1, where θ = P(X = 1). Having a sample of n independent trials, say (X1 , X2 , . . . , Xn ), amounts to assuming that the random variables X1 , X2 , . . . , Xn are independent, with each Xi having a density function of the form f (xi ; θ i ) = θ xi i (1 − θ i )1−xi , xi = 0, 1, i = 1, 2, . . . , n, where θ i = P(Xi = 1), i = 1, 2, . . . , n. Independence in this case ensures that f (x1 , . . . , xn ; φ) = ni=1 fi (xi ; θ i ) = ni=1 θ xi i (1 − θ i )1−xi , xi = 0, 1, where φ:=(θ 1 , θ 2 , . . . , θ n ). Obviously, this does not satisfy the identically distributed component. For that to be the case, we need to impose the restriction that for all trials the probabilistic structure remains the same, i.e. the random variables X1 , X2 , . . . , Xn are also identically distributed in the sense that f (xi ; θ i ) = θ xi (1 − θ)1−xi , xi = 0, 1, i = 1, 2, . . . , n. Let us formalize the concept of identically distributed random variables in the case of arbitrary but independent random variables, beginning with the two variable case. In general,
4.6 Identical Distributions and Random Samples
159
the joint density involves the unknown parameters φ, and the equality in (4.38) takes the form f (x, y; θ ) = fx (x; θ 1 )·fy (y; θ 2 ), for all (x, y)∈RX ×RY , where the marginal distributions fx (x; θ 1 ) and fy (y; θ 2 ) can be very different. Two independent random variables are said to be identically distributed if fx (x; θ 1 ) and fy (y; θ 2 ) are the same density functions, denoted by fx (x; θ 1 )≡fy (y; θ 2 ), for all (x, y)∈RX ×RY , where the equality sign ≡ is used to indicate that all the marginal distributions have the same functional form and the same unknown parameters, i.e. fx (.) = fy (.) and θ 1 = θ 2 . −y/θ Example 4.38 Consider the case where f (x, y; θ ) = θθ 12 e x2 2 , x ≥ 1, y > 0. It is clear that X and Y are independent with marginal densities θ1 , x2
fx (x; θ 1 ) =
x≥1, fy (y; θ 2 ) =
y
1 − θ2 θ2 e
, y>0.
However, the random variables X and Y are not identically distributed because neither of the above conditions for ID are satisfied. In particular, the two marginal densities belong to different families of densities (fx (x; θ 1 ) belongs to the Pareto and fy (y; θ 2 ) belongs to the exponential families); they depend on different parameters θ 1 = θ 2 and the two random variables X and Y have different ranges of values. Example 4.39 Consider the three bivariate distributions (a)–(c) given below: xy
0
2
fx (x)
xy
0
1
fx (x)
xy
0
1
fx (x)
1 2
.18 .42
.12 .28
.3 .7
0 1
.18 .42
.12 .28
.3 .7
0 1
.36 .24
.24 .16
.6 .4
fy (y)
.6
.4
1
fy (y)
.6
.4
1
fy (y)
.6
.4
1
(a)
(b)
(c)
The random variables (X, Y) are independent in all three cases (verify!). The random variables in (a) are not identically distributed because RX = RY , and fx (x) = fy (y) for some (x, y)∈RX ×RY . The random variables in (b) are not identically distributed because even though RX = RY , fx (x) = fy (y) for some (x, y)∈RX ×RY . Finally, the random variables in (c) are identically distributed because RX =RY , and fx (x) = fy (y) for all (x, y)∈RX ×RY . Example 4.40 When f (x, y; θ ) is bivariate Normal, as specified in (4.7), the two marginal density functions have the same functional form but θ:=(μ1 , μ2 , σ 11 , σ 22 ), θ 1 :=(μ1 , σ 11 ), and θ 2 :=(μ2 , σ 22 ) are usually different. Hence, for the random variables X and Y to be identically distributed, the two means and two variances should coincide: μ1 = μ2 and σ 11 = σ 22 , i.e. f (x; θ 1 ) =
√ 1 e 2πσ 11
− 2σ1 [x−μ1 ]2 11
,
f (y; θ 2 ) =
√ 1 e 2πσ 11
− 2σ1 [y−μ1 ]2 11
.
The concept of identically distributed random variables can easily be extended to the n variable case in a straightforward manner.
160
A Simple Statistical Model
Identical distributions. The random variables (X1 , X2 , . . . , Xn ) are said to be identically distributed if fk (xk ; θ k )≡f (xk ; θ ), for all k = 1, 2, . . . , n. This has two dimensions: (i)
f1 (.) = f2 (.) = f3 (.) = · · · ≡fn (.) = f (.) and (ii) θ 1 = θ 2 = θ 3 = . . . = θ n = θ.
Example 4.41
Let X1 and X2 be independent Normal random variables with densities 2 i) f (xi ) = √1 exp − (xi −μ , xi ∈R, i = 1, 2. 2 2π
X1 and X2 are not identically distributed because they have different means: μ1 = μ2 .
4.6.2 A Random Sample of Random Variables Our first formalization of condition [c] of a random experiment E , where [c] the experiment can be repeated under identical conditions, took the form of a set of random trials {A1 , A2 , A3 , . . . , An } which are both independent and identically distributed (IID): P(n) (A1 ∩ A2 ∩ . . . ∩ Ak ) = P(A1 )·P(A2 ) · · · P(Ak ), for k = 2, 3, . . . , n.
(4.44)
Using the concept of a sample X:=(X1 , X2 , . . . , Xn ), where Xi denotes the ith trial, we can proceed to formalize condition [c] in the form of a set of random variables, X1 , X2 , . . . , Xn , that are both independent (I) and identically distributed (ID). Random sample. The sample XIID (n) :=(X1 , X2 , . . . , Xn ) is called a random sample if the random variables involved are I 7n fk (xk ; θ k ), for all x∈RnX , (a) I f (x1 , x2 , . . . , xn ; φ) = k=1
(b) ID fk (xk ; θ k ) = f (xk ; θ ), for all k = 1, 2, . . . , n, where x:=(x1 , . . . , xn ). That is, the joint density for XIID (n) :=(X1 , X2 , . . . , Xn ) is I 7n IID 7n fk (xk ; θ k ) = f (xk ; θ ), for all x∈RnX . f (x1 , x2 , . . . , xn ; φ) = k=1
k=1
(4.45)
The first equality follows from the independence condition and the second from the ID condition. Note that fk (xk ; θ k ) denotes the marginal distribution of Xk (.), derived via fk (xk ; θ k ) =
∞ ∞ −∞ −∞
···
∞ −∞
f (x1 , . . . , xk−1 , xk , xk+1 , . . . , xn ; φ)dx1 . . . dxk−1 dxk+1 . . . dxn .
As argued in Chapter 2, the formalization of a random experiment was chosen to motivate several concepts because it was simple enough to avoid unnecessary complications. It was also stated, however, that simple stochastic phenomena within the intended scope of a simple statistical model are rarely encountered in economics. One of our first tasks, once the transformation is complete, is to extend it. In preparation for that extension, we note at this
4.7 Functions of Random Variables
161
stage that the concept of a random sample is a very special form of what we call a sampling model. Sampling model. A sampling model is a set of random variables (X1 , X2 , . . . , Xn ) (a sample) with a certain probabilistic structure, which relates the observed data to the probability model. In so far as the sampling model is concerned, we note that from the modeling viewpoint the basic components of a random sample X:=(X1 , X2 , . . . , Xn ) are the assumptions of (i) independence and (ii) identical distribution. The validity of these assumptions can often be assessed using a battery of graphical techniques, discussed in Chapters 5 and 6, as well as formal misspecification tests, discussed in Chapter 15.
4.7
Functions of Random Variables
The concept of a function of random variable(s) is crucial in the context of statistical inference since estimators, test statistics, and predictors are such functions.
4.7.1 Functions of One Random Variable Let X be a random variable defined on the probability space (S, , P(.)). Consider a realvalued function h(.): RX −→ R that gives rise to Y = h(X). The first question of interest is whether Y is also a random variable relative to . The simple answer is yes when h(.) is a Borel (measurable) function, i.e. when {g(X) ≤ Y} = X∈h−1 (−∞, y) ∈, ∀y∈R, which happens when h(.) is Borel, since h−1 (−∞, y) defines events in . In this case: F(y) = P(X∈h−1 (−∞, y)), ∀y∈R.
(4.46)
Although it can be difficult to derive the result using (4.46) directly and then proceed to derive the density function, there are cases where one can derive the latter directly using the inverse function g(y) = x: h(.) is monotonically increasing:
fY (y) = fX (g(y))· dg(y) dx , ∀y∈R,
h(.) is monotonically decreasing: fY (y) = fX (g(y))·(− dg(y) dx ), ∀y∈R.
(4.47)
Combining the two results, in cases where X is a continuous random variable, h(.) is a continuous differentiable Borel function such that h(x)−1 = g(y), then the density function of Y takes the form + + + + fY (y) = fX (g(y))· + dg(y) dx + , ∀y∈R. Example 4.42 Let the distribution of X be Student’s t with ν degrees of freedom. The density function of Y = aX for a>0 is fY (y) = fX ( ay )· 1a , ∀y∈R,
162
A Simple Statistical Model
√ since X = Y/a, dg(y)/dx = 1/a. A particular case where a = (ν − 2)/ν is of great interest because Var(X) = ν/(ν − 2), and thus to ensure that it has unit variance one needs √ √ to transform it into Y = [X/ ν/(ν − 2)] ⇒ X = Y[ ν/(ν − 2)], whose density function takes the form ν+1 − 1 (ν+1) 3 2
[ 2 ] y2 ν 1 12 1+ ) . ( f (y; ν) = ν ν−2 νπ ν−2
[ ] 2
This is the transformation used in Figures 3.21 and 3.23 to ensure that the comparison with XN(0, 1) makes sense when both random variables have a unit variance, since Var(Y) = 1.
4.7.2 Functions of Several Random Variables Functions of two random variables X and Y that are of interest in practice are simple functions such as Z1 = (X+Y) , Z2 = (X−Y) , Z3 = (X·Y) , Z4 = (X/Y) , Z5 = max(X,Y), Z6 = min(X,Y), whose distributions can be derived directly using the formulae f (x, y), Discrete fZ (z) = {(x,y): h(x,y)=z}
Continuous FZ (z) =
(4.48) f (x, y)dxdy.
{(x,y): h(x,y)≤z}
Note that in the discrete case one can evaluate the density function directly. In the continuous case, fZ (z) can be evaluated directly for simple functions, such as Z1 = (X+Y) , Z2 = (X − Y) , Z3 = (X·Y) , Z4 = (X/Y) (Khazanie, 1976): Functions of two random variables ∞ Z = X + Y: fX+Y (z) = −∞ f (z − y, y)dy ∞ 1 Z = X·Y: fX·Y (z) = −∞ | x |f (x, xy )dx ∞ fX/Y (z) = −∞ |y|f (zy, y)dy Z= X Y:
Example 4.43 densities
Let X and Y be independent Poisson-distributed random variables with
f (x; θ 1 ) =
e−θ 1 θ x1 x! ,
x = 0, 1, 2, . . . , f (y; θ 2 ) =
e−θ 2 θ 2 y! , y
y = 0, 1, 2, . . .
The distribution of Z = X+Y is Poisson fZ (z) = e−θ 1 −θ 2 (θ 1 + θ 2 )z /z! , z = 0, . . . , since fZ (z) =
;;
f (x, y) =
{(x,y): x+y=z}
= e−θ 1 −θ 2
;z x=0
;z x=0
I
f (X = x, Y = z − x) =
;z x=0
(
e−θ 1 θ x1 e−θ 2 θ z−x 2 )( ) x! (z − x)!
θ x1 · θ z−x e−θ 1 −θ 2 (θ 1 + θ 2 )z e−θ 1 −θ 2 ;z z x z−x 2 = . θ 1 ·θ 2 = x=0 x x!(z − x)! z! z!
4.7 Functions of Random Variables
163
Example 4.44 Let X1 and X1 be independent N(μ, 1), then for Z = X1 +X2 : ∞ f (z; μ) = f1 (z − x2 ; μ)·f2 (x2 ; μ)dx2 =
−∞ ∞ −∞
f2 (z − x1 ; μ)·f1 (x1 ; μ)dx2 , −∞ < z < ∞,
known as the convolution formula. Using Normality, we can deduce that -, ∞ , f (z; μ) = −∞ √1 exp{− 12 (z − x2 − μ)2 } √1 exp{− 12 (x2 − μ)2 } dx2 , 2π 2π
∞ 1 f (z; μ) = −∞ 2π exp − 12 (z − x2 − μ)2 + (x2 − μ)2 dx2 , −∞x) = 1 −
1n
i=1 [1 − Fi (x)]
IID
= 1 − [F(x)]n .
In light of the above results, it should come as no surprise that the ordered sample
X[1] , X[2] , . . . , X[n] is non-random; the random variables X[1] , X[2] , . . . , X[n] are neither independent nor identically distributed. Let us see that in more detail. It turns out that for a random sample (X1 , X2 , . . . , Xn ) from a continuous distribution with cdf and density functions F(x; θ ), f (x; θ ), x∈RX , θ ∈, one can use simple combinatorics in conjunction with the IID assumptions to derive the marginal distribution of any order statistic X[k] as functions of F(x; θ ), f (x; θ ) (David, 1981):
F[k] (x) = P X[k] ≤x = nm=k mn [F(x; θ )]m [1 − F(x; θ )]n−m , (4.52) n! f[k] (x) = (k−1)!(n−k)! f (x) [F(x)]k−1 [1 − F(x)]n−k . Example 4.49 Consider the case where (X1 , X2 , . . . , Xn ) constitutes a random sample from a uniform distribution: Xk U(0, 1), k = 1, 2, . . . , n,
(4.53)
with F(x) = x, f (x) = 1, 0≤x≤1. Using the formulae in (4.52), the density function of X[1] for n = 5 takes the form f[1] (x) =
5! 1−1 (1 − x)5−1 (1−1)!(5−1)! (1)(x)
= 5(1 − x)4 , 0≤x≤1.
166
A Simple Statistical Model
Note that 0! = 1 by convention. In addition, the joint distribution of any two order statistics
X[i] , X[j] for −∞ 4
Bivariate gamma (Kibble) 8 9 1 − 1 (β−1) 1 2 (x+y) 2(αxy) (β−1) α 2 −2 ·Iβ−1 (1−α)
(β)(1−α) exp − (1−α) [xy]
θ :=(α, β)∈[0, 1] × R+ , x ≥ 0, y ≥ 0 and In (z) =
∞
( 12 z)n+2k k=0 [k! (n+k+1)]
In (z) is a modified Bessel function (Abramowitz and Stegum, 1970)
174
A Simple Statistical Model (cont.) Marginals and conditionals f (x; β), f (y; β), and f (y|x; β, α) are also gamma Numerical characteristics E(X) = β, Var(X) = β, Corr(X, Y) = α
Bivariate gamma (Cherian, 1941) min(x,y) e−(x+y) f (x, y; θ ) = (θ ) (θ ez zθ 0 −1 (x − z)θ 1 −1 (y − z)θ 2 −1 dz 0 1 ) (θ 2 ) 0 θ :=(θ 0 , θ 1 , θ 2 )∈R3+ , x ≥ 0, y ≥ 0 Marginals and conditionals f (x; θ ), f (y; θ), and f (y|x; θ ) are also gamma Numerical characteristics θ0 (θ 1 +θ 0 )(θ 2 +θ 0 )
E(X) = θ 1 + θ 0 , Var(X) = θ 1 + θ 0 , Corr(X, Y) = √
Bivariate gamma (McKay) a(θ 1 +θ 2 ) e−ay xθ 1 −1 (y − x)θ 2 −1 , y>x≥0, θ: =(a, θ , θ )∈R3 f (x, y; θ ) = (θ 1 2 + 1 ) (θ 2 )
Marginals and conditionals f (x; θ ), f (y; θ ), and f (y|x; θ ) are gamma but f (x|y; θ) is beta Numerical characteristics E(X) = θ 1 /a, Var(X) = θ 1 /a2 , Corr(X, Y) =
8 f (x, y; θ) =
−1
9
2 2 (1−ρ √ ) 2π σ 11 σ 22
3
θ1 (θ 1 +θ 2 )
Bivariate Normal
exp −
,
1 Y˘ 2 − 2ρ Y˘ X˘ + X˘ 2 2(1−ρ 2 )
-
(y−μ ) ˘ (x−μ √ 2 ) , x∈R, y∈R Y:= √σ 1 , X:= σ 22 11
θ:=(μ1 , μ2 , σ 11 , σ 22 , ρ)∈R2 ×R2+ ×[−1, 1] Marginals and conditionals f (x; θ 2 ), f (y; θ 1 ), and f (y|x; θ ) are Normal Numerical characteristics E(Y) = μ1 , E(X) = μ2 , Var(Y) = σ 11 , Var(X) = σ 22 , Corr(X, Y) = ρ
Appendix 4.A: Bivariate Distributions Bivariate Pareto f (x, y; θ ) = γ (γ + 1)(αβ)γ +1 [αx + βy − αβ]−(γ +2) θ:=(α, β, γ ), x > β > 0, y > α > 0, γ > 0 Marginals and conditionals f (y; α, γ ), f (x; β, γ ) f (y|x; θ ) = β(γ + 1)(αx)γ +1 [αx + βy − αβ]−(γ +2) All three densities are Pareto Numerical characteristics αγ βγ E(Y) = (γ −1) , E(X) = (γ −1) , Var(Y) = Var(X) =
α2 γ (γ −1)2 (γ −2)
β2γ , Corr(Y, X) = γ1 , for γ > 2 (γ −1)2 (γ −2)
Bivariate Pearson type II
− 1 , -ν (1−ρ 2 )σ 11 σ 22 2 (ν+1) (1−ρ 2 )−1 ˘ 2 ˘ X˘ + X˘ 2 1 − Y − 2ρ Y 2π (ν+2) 2(ν+2) ˘ (x−μ ˘ (y−μ √ 1 ) , X:= √ 2 ) , θ:=(μ1 , μ2 , σ 11 , σ 22 , ρ)∈R2 ×R2+ ×[−1, 1] Y:= σ 11 σ 22
f (x, y; θ ) =
√ √ √ √ ν > 0, −c σ 11 < y < c σ 11 , −c σ 22 3; Chapter 3). To see that, Figure 5.34 (a reproduction of Figure 3.21 shown here for convenience) compares the Normal (bold) with the Student’s t with ν = 5, and Figure 5.35 compares the Normal (bold) and the Student’s t for ν = 3, 4, 5, 10; ν = 10 is closest to and ν = 3 is farthest from the Normal; note the comment for Figure 3.21 in Chapter 3. Figure 5.33 displays IID data from the Cauchy distribution, a special case of the Student’s t with 1 d.f., and exhibits even more leptokurticity than in Figure 5.32.
5.4.4 The Histogram, the Density Function, and Smoothing Arguably, the histogram provided the motivation for the probabilistic notion of a density function and for a long time the dividing line between the two was blurred, until the twentieth
202
Chance Regularities and Probabilistic Concepts f(x)
0.6 0.5 0.4 0.3 0.2 0.1
–5
–4
–3
–2
–1
0
1
2
3
4
5 x
Fig. 5.35
Normal density (bold) vs. Student’s t with ν = 3, 4, 5, 10
century. As a result, relative frequencies and probabilities, as well as an unknown parameters and their estimates, were conflated, until R. A. Fisher (1922a) pointed out the confusion: During the rapid development of practical statistics in the past few decades, the theoretical foundations of the subject have been involved in great obscurity. Adequate distinction has seldom been drawn between the sample recorded and the hypothetical population from which it is regarded as drawn. (Fisher, 1922a, p. 333)
Unfortunately, for statistics the problem of terminology diagnosed by Fisher a century ago is still bedeviling the subject, even though the distinction is clearly established. The confusion between unknown parameters and their estimates pervades the statistical literature of the early twentieth century (Galton, Edgeworth, Karl Pearson, Yule) because this literature had one leg in the descriptive statistics tradition of the previous century and the other in the statistical inference tradition which began with Fisher in the 1920s and 1930s. R. A. Fisher, in his book Statistical Methods for Research Workers, published in 1925, begins the second chapter, entitled “Diagrams” (devoted to the usefulness of graphical techniques in statistical inference), as follows: The preliminary examination of most data is facilitated by the use of diagrams. Diagrams prove nothing, but bring outstanding features readily to the eye; they are therefore no substitute for such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in explaining the conclusions founded upon them. (Fisher, 1925a, p. 24)
The histogram constitutes a graphical way to summarize the relative frequencies of occurrence of the values (x1 , x2 , . . . , xn ) of the variable X underlying this data. Let us first partition the range of values of the variable, say ai k>j π hk i=1 j=1 π ij n . π π Discordance d = 2 m ij hk h>i k 0 and negative if (c − d ) < 0. A scaled version of the distance (c − d ) is the so-called gamma coefficient, introduced by Goodman and Kruskal (1954) and defined by γ =
c − d , where − 1 ≤ γ ≤ 1. c +d
266
Statistical Models and Dependence
Like the correlation coefficient, if |γ | =1 the two random variables are perfectly associated. Moreover, like the correlation coefficient, if γ = 0 the two random variables are not necessarily independent. Independence, however, implies that γ = 0. Example 6.13 Consider the joint density function represented in (6.31), where X denotes age bracket and Y income bracket: X = 1: (18–35), X = 2: (36–55), X = 3: (56–70), Y = 0: poor, Y = 1: middle income, Y = 2: rich. xy
0
1
2
fx (x)
1
.20
.10
.01
.31
2 3
.10 .15
.25 .05
.06 .08
.41 .28
fy (y)
.45
.40
.15
1
(6.31)
Consider evaluating the concordance coefficient:
π hk = .20(.25+.05+.06+.08) = 0.088, i = 0, j = 1: π 01 h>0 k>1
= .10(.05+.08) = 0.013, i = 0, j = 2: π 02 π
h>0 k>2 hk = .10(.06+.08) = 0.014, i = 1, j = 1: π 11 π
h>1 k>1 hk i = 1, j = 2: π 12 h>1 k>2 π hk = .25(.08) = 0.020. c = 2 (0.088+0.013+0.014+0.020) = 0.270. The discordance coefficient:
i = 0, j = 2: π 02 h>0 k0 k1 k1 k0, if P(A ∩ B | D) = P(A | D)·P(B | D). That is, knowing that D has occurred renders the events A and B independent. Going from events to random variables brings into the discussion the key role played by the quantifiers pertaining to their different values. The random variables X and Y are said to be conditionally independent given Z, if f (x, y | z) = f (x | z)·f (y | z), for all (x, y, z)∈ [RX × RY × RZ ] ,
(6.32)
where RZ :={z ∈ R: fz (z)>0} is the support set of fz (z). That is, the joint density of (X, Y, Z) factors into two conditional densities. Intuitively, X and Y are conditionally independent given Z if X and Y are related only through Z. Note that (6.32) can be defined equivalently by (M): f (y|x, z) = f (y|z), for all (x, y, z)∈ [RX ×RY ×RZ ]
(6.33)
since, in general, f (x, y | z) = f (y | x,z)·f (x | z). In (6.33), if we interpret Y as the “future,” X as the “past,” and Z as the “present,” (M) says: “given the present the future is independent of the past [Markov dependence].” Using the points 0 xt ,> i = 1, 2, . . . , 6, and compare them to those in (b).
7 Regression Models
7.1
Introduction
In Chapter 6 we took the first short excursion into the non-random sample territory, and argued that the key to dealing with a dependent sample X:=(X1 , X2 , . . . , Xn ) is the use of sequential conditioning that simplifies the joint distribution down to a product of univariate distribution: non-IID
f (x1 , x2 , . . . , xn ; φ) = f1 (x1 ; ψ 1 )
n 1 k=2
fk (xk |xk−1 , . . . , x1 ; ψ k ), ∀x∈RnX ,
(7.1)
where “∀” denotes “for all.” This addresses the dimensionality problem because the left-hand side (LHS) is n-dimensional and the right-hand side (RHS) is a product of one-dimensional densities. This reduction, however, raises the problem of the increasing conditioning information set, since the past history of X at k is different from that at k − 1. In Chapter 6 it was mentioned that a way to circumvent this is to assume Markov dependence which simplifies (7.1) to f (x1 , x2 , . . . , xn ; φ)
Markov
=
f1 (x1 ; ψ 1 )
n 1 k=2
fk (xk |xk−1 ; ψ k ), ∀x∈RnX .
(7.2)
Even after this simplification, however, the reduction in (7.2) does not give rise to an operational (estimable) model, since we still need to address the heterogeneity problem associated with fk (.|.) and ψ k changing with index k, which will be addressed in Chapter 8. The good news is that the RHS of (7.2) suggests that Markov dependence enables us to focus exclusively on two-dimensional distributions, since fk (xk ,xk−1 ; ϕ k ) = fk (xk |xk−1 ; ψ 1k )·fk−1 (xk−1 ; ψ 2k ), (xk ,xk−1 )∈R2X , k = 2, 3, . . . , n. (7.3) In light of (7.3), in this chapter we will assume away the heterogeneity problem (see Chapter 8), and instead focus on the simple case of two random variables X and Y: f (x, y; φ) = f (y|x; ϕ 1 )·fx (x; ϕ 2 ), ∀(x, y)∈RX × RY .
(7.4) 277
278
Regression Models
In this simple case the increasing conditioning set and the heterogeneity problems do not arise, but there is still an important issue to address. A bird’s-eye view of the chapter. Despite appearances, the reduction in (7.4) does not represent a joint distribution on the left and a product of one conditional and one marginal distribution on the right! The concept of a conditional distribution (Chapter 6) is defined with respect to a specific value of the conditioning variable X = x, f (y|X = x; ϕ 1 ). Hence, the quantifier ∀(x, y)∈RX ×RY indicates that for each value of the conditioning variable, x∈RX , there exists a different conditional distribution. This means that f (y|x; ϕ 1 ), ∀(x, y)∈RX × RY
(7.5)
defines a (possibly infinite) family of conditional densities indexed by different values of the random variable X. How do we address this problem? The answer will be given in Section 7.2 in terms of a statistical model known as regression, specified in terms of the conditional moment functions: E(Y r |X = x) = hr (x), ∀x∈RX , r = 1, 2, . . .
(7.6)
Of particular interest are the regression and skedastic functions: E(Y|X = x) = h(x) and Var(Y|X = x) = g(x), ∀x∈RX , respectively. The reduction in (7.4) raises an additional problem that has to do with the circumstances under which one can specify the regression model exclusively in terms of the family of conditional distributions in (7.5), that is ignore the marginal distribution fx (x; ϕ 2 ), ∀x∈RX . In Section 7.3, it is argued that there are cases where the latter distribution needs to be retained as an integral component of a more general stochastic regression model. This is achieved by conditioning on σ (X), the σ -field generated by X, for which (7.6) takes the new form E(Y r |σ (X)) = hr (X), X D(.), r = 1, 2, . . . This generalizes the concept of conditional expectation that requires different handling from that in (7.6), as discussed in Section 7.3. The new concept is used in Section 7.4 to define the notion of a statistical generating mechanism (GM) that takes the generic form of an orthogonal decomposition: Y = E(Y|σ (X)) + u, where the systematic [μ(X) = E(Y|σ (X))] and non-systematic [u = Y − E(Y|σ (X))] components are orthogonal E(μ(X)·u|σ (X)) = 0. As argued in the sequel, this provides the main link between a statistical and a substantive (structural) model. Section 7.5 considers the question of modeling heterogeneity, in addition to dependence, in the context of regression models, by focusing on specific forms of heterogeneity.
7.2 Conditioning and Regression
7.2
279
Conditioning and Regression
7.2.1 Reduction and Conditional Moment Functions In order to shed light on why (7.5) represents as many conditional distributions as the values taken by the random variable X, let us consider this issue using examples. Example 7.1
Consider the joint and marginal distributions given below: xy
1
2
3
fx (x)
0 1 2
.20 .10 .15
.10 .25 .05
.01 .06 .08
.31 .41 .28
fy (y)
.45
.40
.15
1
(7.7)
According to (7.5) this joint distribution gives rise to three different conditional distributions, f (y|X = x) for x = 1, 2, and 3, given by (see Chapter 4) ⎧ f (x=1,y=0) ⎪ ⎪ = .20 ⎪ y 0 1 2 .31 , y = 0, ⎨ fx (x=1) f (x=1,y=1) .10 f (y|x = 1) = fx (x=1) = .31 , y = 1, ⎪ f (y|x = 1) .645 .323 .032 ⎪ ⎪ ⎩ f (x=1,y=2) = .01 , y = 2, fx (x=1)
.31
⎧ f (x=2,y=0) ⎪ ⎪ = ⎪ ⎨ fx (x=2)
y = 0,
=
.10 .41 , .25 .41 , .06 .41 ,
⎧ f (x=3,y=0) = ⎪ ⎪ ⎨ fx (x=3) f (y|x = 3) = f (x=2,y=1) fx (x=3) = ⎪ ⎪ ⎩ f (x=2,y=2) fx (x=3) =
.15 .28 , .05 .28 , .08 .28 ,
y = 0,
f (y|x = 2) =
f (x=2,y=1) fx (x=2) ⎪ ⎪ ⎪ ⎩ f (x=2,y=2) fx (x=2)
=
y = 1, y = 2,
y = 1, y = 2.
y
0
1
2
f (y|x = 2)
.244
.610
.146
y
0
1
2
f (y|x = 3)
.536
.179
.285
(7.8)
The above example suggests that any attempt to deal with the modeling of the reduction (7.4) by concentrating on the moments of the distributions involved cannot work, since for each value X = x, the moments are different. The conditional densities f (y|x; ϕ 2 ), (x, y)∈RX ×RY give rise to a possibly infinite number of conditional moments. To illustrate this, let us return to Example 7.1. Example 7.1 (continued) For the joint distribution given in (7.7), there correspond three different conditional distributions as given in (7.8), giving rise to three pairs of conditional means and variances: f (y|X = 1): E(Y|X = 1) = .387, Var(Y|X = 1) = .301, f (y|X = 2): E(Y|X = 2) = .902, Var(Y|X = 2) = .380, f (y|X = 3): E(Y|X = 3) = .749, Var(Y|X = 3) = .758.
(7.9)
280
Regression Models
Intuition suggests that to be able to retain all the information in f (y|x; ϕ 2 ), (x, y)∈RX ×RY , we need to use all its conditional moments, but how do we combine them? The obvious answer is to make the conditional moments functions of the different value of X = x. Transform the moments into functions of x∈RX . Formally this amounts to extending the conditional moments (Chapter 6): yr ·f (y|x)dy, r = 1, 2, . . . E(Y r |X = x) = y∈R Y r y−E(y|x) ·f (y|x)dy, r = 2, 3, . . . E([Y − E(Y|X = x)]r |X = x) = y∈RY
by rendering them functions of x∈RX to construct the conditional moment functions E(Y r |X = x) = hr (x),
Raw
x∈R, r = 1, 2, . . .
Central E ([Y − E(Y|X = x)]r |X = x) = gr (x), x∈R, r = 2, 3, . . .
(7.10)
Note that for discrete random variables the integral is replaced by a summation. Of particular interest are the first in each category: (i) Regression function E(Y|X = x) = h(x), x∈RX ; (ii) Skedastic function
Var(Y|X = x)=g(x), x∈RX .
(7.11)
Example 7.1 (continued) For the joint distribution given in (7.7) and the conditional moments (7.9), we can define the functions below, known as the regression and skedastic functions associated with f (y|x; ϕ 2 ), (x, y)∈RX ×RY : x
h(x) = E(Y|X = x)
x
g(x) = Var(Y|X = x)
1 2 3
.387 .902 .749
1 2 3
.301 .380 .758
Example 7.2 Consider the case where f (x, y; φ) is bivariate Normal of the form 8 9 8< = < =9 Y μ1 σ 11 σ 12 N , , μ2 σ 12 σ 22 X
(7.12)
(7.13)
where φ:= (μ1 , μ2 , σ 11 , σ 22 , σ 12 ) ; μ1 = E(Y), μ2 = E(X), σ 11 = Var(Y), σ 22 = Var(X), σ 12 = Cov(X, Y). f (y|x; ϕ 1 ) and fx (x; ϕ 2 ) in (7.4) take the form
(Y|X = x) N β 0 +β 1 x, σ 2 , x∈R, X N (μ2 , σ 22 ) , (7.14)
σ2 β 0 = μ1 − β 1 μ2 ∈R, β 1 = σσ 12 ∈R, σ 2 = (σ 11 − σ 12 )∈R+ . 22 22
ϕ 1 := β 0 , β 1 , σ 2 , ϕ 2 := (μ2 , σ 22 ). In this case, the conditional distribution f (y|x; ϕ 1 ), (x, y)∈RX ×RY represents an infinite family of conditional densities, one for each value x∈R. To account for all the information, one could use the conditional moment functions: E(Y|X = x) = β 0 + β 1 x, Var(Y|X = x) = σ 2 , x∈R.
(7.15)
The use of the concept of functions deals directly with the problem of many different sets of conditional moments by rendering the moments functions of the values of the conditioning
7.2 Conditioning and Regression
281
variable. In cases where these functions can be defined in terms of specific functional forms they provide easy ways to model dependence. 7.2.1.1 Clitic and Kurtic Functions The question that naturally arises at this stage is: Why consider only the first two conditional moment functions (regression and skedastic) in modeling dependence? We know that in general we need many (often an infinite number) of moments to characterize distributions (see Chapter 3). Focusing exclusively on the first two conditional moment functions is justified only in limited circumstances, such as when the bivariate distribution is Normal, since the latter is characterized by its first two moments. Probability theory suggests that there are good reasons to believe that when dealing with non-symmetric joint distributions some additional conditional moment functions might be needed. The next two central conditional moment functions, first introduced by Pearson (1905), are:
(iii) Clitic function E [Y − E(Y|X = x)]3 |X = x = g3 (x), x∈RX ;
(iv) Kurtic function E [Y − E(Y|X = x)]4 |X = x = g4 (x), x∈RX . Example 7.3 In the case of the bivariate beta distribution (Appendix 4.A), (iii) and (iv) are: ,
2 θ 3 (θ 3 −θ 2 ) (1 − x)3 , x∈[0, 1]; E [Y − E(Y|X = x)]3 |X = x = (θ +θ )32θ 2 3 (1+θ 2 +θ 3 )(2+θ 2 +θ 3 ) / 0
3θ θ (2θ 22 −2θ 2 θ 3 +θ 22 θ 3 +2θ 23 −θ 2 θ 23 ) E [Y − E(Y|X = x)]4 |X = x = (θ +θ2 )34 (1+θ (1 − x)4 . +θ )(2+θ +θ )(3+θ +θ ) 2
3
2
3
2
3
2
3
As we can see, the bivariate beta distribution yields heteroclitic and heterokurtic functions; the terminology was introduced by Pearson (1905). Example 7.4 For the bivariate Student’s t distribution, these functions take the form
E [Y − E(Y|X = x)]3 |X = x = 0, x∈R;
2 E [Y − E(Y|X = x)]4 |X = x = 3(ν−1) (ν−3) [Var(Y|X)] , x∈R. These are homoclitic and heterokurtic functions; the latter is of special form, being a function of the skedastic function. This suggests that for the Student’s t distribution the first two conditional moments contain all the relevant information.
7.2.2 Regression and Skedastic Functions The main objective of regression models is to account for the information f (y|x; ϕ 1 ), (x, y)∈RX ×RY , using the first few conditional moment functions as defined in (7.10). The current literature on regression models concentrates, almost exclusively, on the first two such conditional moment functions. It turns out that for most bivariate distributions we can derive these functions explicitly.
Regression Models
282
The regression function is defined to be the conditional mean of Y given X = x, viewed as a function of x: E(Y|X = x) = h(x), ∀x∈RX .
(7.16)
The term regression was first coined by Galton (1885); see Gorroochurn (2016). The skedastic function is defined to be the conditional variance viewed as a function of x: Var(Y|X = x) = g(x), ∀x∈RX .
(7.17)
The term “skedastic” was first coined by Pearson (1905) and it is based on the Greek words σ κ ε´ δασ η = scattering and σ κεδασ τ o´ ς = scattered. In what follows, we refer to the graphs (h(x), x) and (g(x), x), ∀x∈RX as regression and skedastic curves, respectively. There are several crucial features of the regression and skedastic functions that need to be brought out explicitly. First, these functions depend crucially on the underlying joint distribution in the sense that, for any bivariate distribution f (x, y; φ), ∀(x, y)∈RX ×RY whose first two moments exist, one can derive the functional forms of the regression and skedastic functions using the formulae in (7.10). From (7.15) we can see that for the bivariate Normal, the regression function is a linear function of x and the skedastic function is free of x; see Figures 7.1 and 7.2 with parameter values μ1 = 1.5, μ2 = 1, σ 11 =1, σ 22 = 1 and three different values of σ 12 =−0.8, 0.1, 0.9. Second, from the probabilistic perspective the reduction in (7.4) is symmetric with respect to the random variables X and Y, and thus one can define the reduction to be f (x, y; φ) = f (x|y; ϕ 1 )·fy (y; ϕ 2 ), ∀(x, y)∈RX × RY .
(7.18)
Hence, f (x|y; ϕ 1 ) can be used to define the reverse regression and skedastic functions: E(X|Y = y) = h(y), Var(X|Y = y) = g(y), ∀y∈RY .
3.5
(7.19)
2.0
3.0 1.5 Var(YIX=x)
E(YIX=x)
2.5 2.0 1.5 1.0
1.0 0.5
0.5 0.0 0.0
0.5
1.0
1.5
2.0
2.5
3.0
0.0 0.0
0.5
Fig. 7.1
Normal: regression lines
1.0
1.5
2.0
2.5
x
x
Fig. 7.2
Normal: skedastic lines
3.0
7.2 Conditioning and Regression
Example 7.5
283
In the case of the bivariate Normal distribution:
E(X|Y = y) = α 0 + α 1 y, Var(X|Y = y) = v2 , ∀y∈R, σ2 , v2 = (σ 22 − σ 12 ). α 0 = (μ2 − α 1 μ1 ) , α 1 = σσ 12 11 11
(7.20)
How do the regression lines in (7.15) and (7.20) relate to the bivariate Normal? The answer is that E(Y|X = x) = h(x), ∀x∈R links directly to the contours of the distribution in Figure 7.3, as shown in Figure 7.4.
0.30
4 3
0.25
2
y
1
f(x,y)
0.20
0 –1 –2
0.15
–3 –4 –4 –3 –2 –1 0 1 x
0.10
–3
Fig. 7.3
–2
–1
0 x
1
2
3
Bivariate Normal density with equal-probability contours
5 4 2
3
0.0
04
0.
2 1
0. 0.1 14 5
y
3
4 3 12 0 y –1 –2 –3 –4
0.05
–4
2
2 0.1 0.10 8 0.0 6 .0 0
0 –1 –2 –3 –1
Fig. 7.4
0
1
2 x
3
4
5
Bivariate Normal contours and regression lines: E(Y |X = x), E(X |Y = y)
4
284
Regression Models
Example 7.6 Consider the following values for the moments of the bivariate Normal in Example 7.4: μ1 = 1, μ2 = 2, σ 11 = .8, σ 22 = 1.8, σ 12 = .6. These values can be used to evaluate the parameters of both regression lines: β 0 = μ1 − β 1 μ2 = 1 − (.333)2 = .334, β 1 = σ 2 = σ 11 −
σ 212 (.6)2 = .8 − = .6, σ 22 1.8
α 0 = μ2 − α 1 μ1 = 2−(.75)1 = 1.25, α 1 = v2 = σ 22 −
σ 12 .6 = = .333, σ 22 1.8
σ 12 .6 = = .75, σ 11 .8
σ 212 (.6)2 = 1.8 − = 1.35, σ 11 .8
E(Y|X = x) = .33 + .333x, E(X|Y = y) = 1.25 + .75y. These two regression lines can be seen in Figure 7.4 together with the equal-probability contours. It is important to point out that they both pass through the center of the ellipse (2, 1) , whose principal axis lies between these lines. Moreover, these regression lines cut the contours at the point of tangency with straight lines parallel to the y-axis and x-axis, respectively; see Galton (1886) and Gorroochurn (2016). In Table 7.1 several regression and skedastic functions associated with different bivariate continuous distributions (Appendix 4.A) are listed in an attempt to bring out the wide variety of functional forms, and the unique place of the bivariate Normal distribution in the set of regression models; see Ord (1972). Not surprisingly, the two functional forms of the Normal regression and skedastic functions are used as yardsticks for comparison for other distributions. The regression function
associated with the bivariate Normal is linear in both x and the parameters β 0 , β 1 , and the skedastic function is constant; it does not change with x ∈ R. These features motivate the following terminology. Regression Functional Form Terminology. Linearity of the regression function has two dimensions: (i) (ii)
Linear in x, when h(x) = E(Y|X = x) is a linear function of x∈RX . Linear in the parameters, when the parameters of the function h(x) appear in linear form.
Example 7.7 (a) The second-degree polynomial h(x) = a0 + a1 x + a2 x2 is non-linear in x but linear in the parameters (α 0 , α 1 , a2 ).
(b) The second-degree polynomial h(x) = γ 1 − γ 3 x − γ 2 2 is non-linear in both the
parameters γ 1 , γ 2 , γ 3 and x. Looking at Table 7.1 it is clear that for the Normal, Student’s t, and Pearson type II bivariate distributions their regression functions are identical. This is because all three distributions belong to the elliptically symmetric family of distributions. The rest of the bivariate distributions – the Pareto, beta, F, and three gamma distributions – have regression functions that are linear in x, but non-linear in the parameters. The exponential (Figure 7.5), the logistic
7.2 Conditioning and Regression
285
Table 7.1 Regression models (bivariate continuous distributions) Bivariate Normal
E(Y | X = x) = β 0 + β 1 x, Var(Y | X = x) = σ 2 , ∀x ∈ R, 2
2 = σ − σ 12 ∈R β 0 = μ1 −β 1 μ2 ∈R, β 1 = σσ 12 ∈R, σ + 11 σ 22 22 νσ {1+ 1 [x−μ ]2 },∀x ∈ R, E(Y | X = x) = β 0 + β 1 x,Var(Y | X = x) = ν−1 2 νσ 22 (β 0 , β 1 , σ 2 )∈ R2 ×R+ , same as the Normal (x−μ2 )2 1 , 1− E(Y | X = x) = β 0 +β 1 x, Var(Y | X = x) = σ 2 σ 22 (2ν+3) √ ∀x∈ ± (2(ν+2) σ 22 ), (β 0 , β 1 , σ 2 )∈ R2 ×R+ , same as the Normal 2
Bivariate Student’s t Bivariate Pearson II Bivariate exponential Bivariate Pareto Bivariate LogNormal Bivariate beta
Bivariate F
2 2] E(Y | X = x) = (1+θ+θ2x) , Var(Y | X = x) = [(1+θ+θx) −2θ , x∈R+ (1+θx) [1+θx]4 2 (1+θ 3 ) 2 E(Y | X = x) = θ 1 +( θ θ1 θ 3 )x, Var(Y | X = x) = θθ 1 2 x , x ∈ R+ 2
2
E(Y | X = x) = 1 − ln 1+ exp − (x−μ) σ
(1+θ 3 )θ 3
, Var(Y | X = x) = 2.29, x ∈ R
2 (1 − x), E(Y | X = x) = [θ θ+θ 2 3] −2 θ 3 (θ 2 +θ 3 ) Var(Y | X = x) = θ 2(1+θ (1−x)2 , x ∈ [0, 1] 2 +θ 3 ) 0 +θ 1 E(Y | X = x) = (θ θ+θ −2) x, 0
1
2 Var(Y | X = x) = 2(θ 1 +θ 2 +θ 0 −2)(θ 0 +θ 1 x)2 , ∀x ∈ R+
θ 2 (θ 1 +θ 0 −4)(θ 1 +θ 0 −2)
Bivariate gamma (Kibble) Bivariate gamma (Cherian) Bivariate gamma (McKay) Bivariate LogNormal
E(Y | X = x) = θ 2 (1 − θ 1 )+θ 1 x, Var(Y | X = x) = (1–θ 1 )[θ 2 (1−θ 1 )+2θ 1 x], ∀x ∈ R+ 0 x, E(Y | X = x) = θ 2 + (θ θ+θ ) 1 0 −2 θ 1 (θ 1 +θ 0 ) Var(Y | X = x)=θ 2 + θ 0(1+θ x2 , ∀x∈R+ 1 +θ 0 ) E(Y | X = x) = θa1 + x, Var(Y | X = x) = θ 21 , ∀x ∈ R+ a β 1 2 eμ1 + 2 σ , E(Y | X = x) = μx 2 2β 2 2 Var(Y | X = x) = μx e2μ1 +σ (eσ − 1), ∀x ∈ R+ 2
(Figure 7.6), and the LogNormal have regression functions that are non-linear in both x and the parameters. Skedastic Functional Form Terminology (i) Homoskedasticity. When the skedastic function does not depend on X = x, that is Var(Y|X = x) = c > 0, ∀x∈RX for some constant c, g(x) = c is said to be homoskedastic (see (7.15)).
Regression Models
286
1.5
4
1.0 0.5 E(YIX=x)
E(YIX=x)
3 2
0.0 –0.5 –1.0
1
–1.5 0 0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
–2.0 –2
4.0
–1
0
Fig. 7.5
1
2
3
4
x
x
Fig. 7.6
Exponential: regression curves
Logistic: regression curves
1.0 0.04 0.03
0.6
Var(YIX=x)
Var(YIX=x)
0.8
0.4 0.2
0.02 0.01
0.0 0.0
0.5
1.0
1.5
2.0
2.5
x
Fig. 7.7
Student’s t: skedastic curves
3.0
0.00 0.0
Fig. 7.8
0.5
1.0
1.5
2.0
2.5
3.0
Pearson type II: skedastic curves
(ii) Heteroskedasticity. When the skedastic function depends on x∈RX , that is Var(Y|X = x) = g(x), ∀x∈RX , g(x) is said to be heteroskedastic; both terms were first introduced by Pearson (1905). Looking at Table 7.1, it is clear that homoskedasticity is the exception and not the rule. The only bivariate distributions that give rise to homoskedasticity are the Normal and the logistic. All other bivariate distributions give rise to heteroskedastic skedastic functions. Moreover, the Normal, Student’s t, and Pearson type II have identical regression functions; only the Normal is homoskedastic, the other two are heteroskedastic (Figures 7.7 and 7.2.2). Indeed, homoskedasticity characterizes the Normal within the elliptically symmetric family; see Kelker (1970). The basic difference between these three elliptically symmetric distributions is in terms of their kurtosis: the Normal is mesokurtic (kurtosis = 3), the Student’s t distribution is leptokurtic (kurtosis > 3), and the Pearson type II is platykurtic (kurtosis < 3). Figures 7.9 and 7.10 depict skedastic functions associated with the bivariate exponential and pareto distributions, respectively, both of which are heteroskedastic. At this point it is important to bring out several important issues that are sometimes misunderstood in the regression literature.
7.2 Conditioning and Regression 2.0
287
1.2 1.0 Var(YIX=x)
Var(YIX=x)
1.5 1.0
0.8 0.6 0.4
0.5 0.2 0.0 0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0.0 1.0
1.5
2.0
Fig. 7.9
2.5
3.0
3.5
4.0
x
x
Exponential: skedastic curves
Fig. 7.10
Pareto: skedastic curves
First, the regression and skedastic functions in Table 7.1 are associated with bivariate continuous distributions, but these concepts apply equally to discrete bivariate distributions (see Appendix 4.A). To avoid any false impressions, Table 7.2 presents three discrete bivariate distributions with their corresponding regression models. Table 7.2 Regression models (bivariate discrete distributions) Bivariate binomial
Bivariate Poisson
θ2 E(Y | X = x) = (1−θ (n−x), 1) θ 2 (1−θ 1 −θ 2 ) (n−x), x = 0, 1, . . . , n Var(Y | X = x) = (1−θ 1 )2 E(Y | X = x) = (θ 2 −θ 3 )+ θθ 3 x, 2 Var(Y | X = x) = (θ 2 −θ 3 )+ (θ 1 −θ23 )θ 3 x, x=0, 1, 2, . . . θ1
Bivariate negative binomial
θ2 (k + x), E(Y | X = x) = (1−θ 2) θ2 Var(Y | X = x) = 2 (k + x), x = 0, 1, 2, . . . (1−θ 2 )
Second, it is very important to reiterate that in Tables 7.1 and 7.2 the Normal is the only joint distribution with a linear (in x and the parameters) regression function and a homoskedastic conditional variance. In particular, the overwhelming majority of the above distributions have heteroskedastic conditional variances and several have non-linear regression curves. Hence, linear/homoskedastic regression models are not the rule but the exception. Third, in the above discussion of conditional moment functions in general, and the regression and skedastic functions in particular, the random variables X and Y did not have subscripts to indicate any type of ordering as we have so far assigned to a sample X:=(X1 , X2 , . . . , Xn ). That was no accident, but a deliberate choice of notation to bring out the fact that one can account for two different types of dependence using regression models.
288
Regression Models
Contemporaneous (synchronic) dependence arises when X and Y have the same index indicator, say (Xt , Yt ) , t = 1, 2, . . . , n. Temporal dependence arises when X and Y denote the same random variable, indexed by two consecutive elements of the index sequence, say (Yt , Yt−1 ) , t = 1, 2, . . . , n. Using these subscripts, we can define two types of regression models: Synchronic E(Yt | Xt = xt ) = h(xt ), Var(Yt | Xt = xt ) = g(xt ), ∀xt ∈RX ; Temporal E(Yt | σ (Yt−1 )) = h(σ (Yt−1 )), Var(Yt | σ (Yt−1 )) = g(Yt−1 ), Yt−1 D(.).
(7.21)
Here, the notation σ (Yt−1 ) for the conditioning used for the temporal regressions (autoregressions) will be explained in the next section. Although the names synchronic (contemporaneous) and temporal allude to “time” as the indexing sequence, the above concepts apply equally to cross-section orderings, such as weight, age, geographical position, etc.
7.2.3 Selecting an Appropriate Regression Model In light of the discussion above, the important question from the empirical modeling perspective is: How does one select an appropriate regression model on the basis of the particular data z0 := {(xt , yt ), t = 1, 2, . . . , n}? A crucial bridge between f (x, y) and the real-world data z0 is to use t-plots to ensure that the data can be viewed as realizations of IID data (Chapter 5), and when IID is secured, proceed to relate the scatterplot to the contour plot of the different distributions f (x, y) (Chapter 6) and make a choice of which one best describes the data scatterplot. The t-plots and scatterplots in Table 7.3 convey chance regularity patterns pertaining to marginal and joint distributions, where “ ” stands for “convey information pertaining to.” In light of that, one can make informed conjectures concerning the appropriate regression model for data z0 using the five steps in Table 7.4. Table 7.3 Relating data plots to assumptions Data plot
Distribution
t-plot of yt
t-plot of xt
scatterplot of (xt , yt )
scatterplot of (yt , ys )
scatterplot of (xt , xs )
scatterplot of (yt , xs )
f (yt ), f (y1 , y2 , . . . , yn ) f (xt ), f (x1 , x2 , . . . , xn ) f (xt , yt ) f (yt , ys ) , t = s, t, s = 1, . . . , n f (xt , xs ) , t = s, t, s = 1, . . . , n f (yt , xs ) , t = s, t, s = 1, . . . , n
Example 7.8 The simulated data in the t-plots (Figures 7.11 and 7.12) and the scatterplot (Figure 7.13) provide the ideal graphs for the Normal, linear regression model; compare Figure 7.13 with Figures 7.3 (repeated below) and 7.4.
7.2 Conditioning and Regression
289
Table 7.4 Modeling strategy for regression models 1. Scatterplot 2. Contour plot 3. Select f (x, y) 4. Derive f (y|x)
{(xt , yt ), t = 1, . . . , n} {f (x, y) = c[i] , i = 1, 2, . . . , m, c[1] >c[1] > · · · >c[m] } {f (x, y), (x, y)∈RX ×RY } f (y|x) = f (x,y)
5. Derive moments
E(Y r |X = x) = hr (x), x∈RX , r = 1, 2 . . .
y∈RY
f (x,y)dy
5 4 3 Y
2 1 0 –1 –2
0
10 20 30 40 50 60 70 80 90 100 Observations
Fig. 7.11
t-Plot of yt
4 3
X
2 1 0 –1 –2 0
10 20 30 40 50 60 70 80 90 100 Observations
Fig. 7.12
t-Plot of xt
Example 7.9 Consider the t-plots and scatterplot given in Figures 7.15–7.17. The t-plots exhibit no dependence or heterogeneity and thus one can proceed to “read” the scatterplot for chance regularity patterns suggesting a particular distribution. This scatterplot reminds one of the contour plot associated with the Exponential distribution in Figure 7.18. This in turn suggests that the regression model one should entertain is the exponential regression model: E(Y | X = x) =
(1+θ+θ x) , (1+θ x)2
Var(Y | X = x) =
[(1+θ +θx)2 −2θ 2 ] , [1+θx]4
∀x ∈ R+ .
Regression Models 5 4 3 Y
2 1 0 –1 –2 –2
–1
0
1
2
3
4
X
Scatterplot of (xt , yt )
Fig. 7.13
0.30
y
0.25
f(x,y)
0.20 0.15 0.10
4 3 2 1 0 –1 –2 –3 –4 –4 –3 –2 –1 0 1 2 3 4 x
4 3 2 01 –1 –2 –3 –4
y
0.05
–4 –3
–2
–1
Fig. 7.14
Y
290
5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0
0
10
20
0 x
1
2
3
Bivariate Normal density
30
40
50
60
70
Observations
Fig. 7.15
t-Plot of yt
80
90 100
7.2 Conditioning and Regression
X
4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0
0
10 20 30 40 50 60 70 80 90 100 Observations
t-Plot of xt
0.0 0.2
0.4
y
0.6
0.8
1.
Fig. 7.16
0.0
0.5
Fig. 7.17
1.0 x
1.5
Scatterplot of (xt , yt )
0.4
y
0.3 0.2
5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0. 0 0. 5 1. 0 1. 5 2. 0 2. 5 3. 0 3. 5 4. 0 4. 5 5. 0
f(x,y)
2.0
x
0.1
5 4 0
1
2
3
Fig. 7.18
1 4
3 2 y
0
Bivariate exponential
291
292
Regression Models
4.5 4.0 3.5 y
3.0 2.5
0.0
01
2.0 1.5 1.0 0.5 0.0 0.0
0.2 0. 2 0.3 50 00 00 0.3 50
0.5
Fig. 7.19
1.0
0.0 50 0.1 0 50 0
0.1
1.5
2.0
2.5 x
3.0
3.5
4.0
4.5
5.0
Bivariate exponential contours: regression function
The regression function is shown in Figure 7.19 as it relates to the contours. The same reasoning can be used to specify different regression models! M O D E L I N G WA R N I N G: At this stage it should be emphasized that the above choices of regression models based on the chance regularities exhibited by the data are only tentative. The initial selection of a regression model needs to be subjected to more formal evaluations using misspecification testing (Chapter 15) to affirm or disconfirm its appropriateness. In cases where misspecification testing indicates that the initial choice is inappropriate, a new regression model should be selected.
7.3
Weak Exogeneity and Stochastic Conditioning
7.3.1 The Concept of Weak Exogeneity In the previous two sections we discussed the question of dealing with the reduction f (x, y; φ) = f (y|x; ϕ 1 )·f (x; ϕ 2 ), ∀(x, y)∈RX ×RY .
(7.22)
In Section 7.2 we ignored the marginal distribution f (x; ϕ 1 ), and argued that we can deal with the many conditional distributions (one for each value of X) by extending the notion of conditional moments to functions. We then focused exclusively on f (y|x; ϕ 1 ) and its first two moments to specify E(Y|X = x) = h(x), Var(Y|X = x) = g(x), x∈RX , the regression and skedastic functions. Example 7.10 As shown in Table 7.1, when f (x, y; φ) is bivariate Normal, the conditional and marginal densities are also Normal of the form
(Y|X = x) N β 0 + β 1 x, σ 2 , x∈R, X N (μ2 , σ 22 ) , σ2 , σ 2 = σ 11 − ( σ 12 ). β 0 = μ1 − β 1 μ2 , β 1 = σσ 12 22 22
(7.23)
7.3 Weak Exogeneity and Stochastic Conditioning
293
The reduction in (7.22) has induced a reparameterization of the form φ ϕ 1 , ϕ 2 : φ:=(μ1 , μ2 , σ 12 , σ 11 , σ 22 )∈:= R3 ×R2+ ,
(7.24) ϕ 1 := β 0 , β 1 , σ 2 ∈1 := R2 ×R+ , ϕ 2 := (μ2 , σ 22 ) ∈2 := R×R+ ,
where φ and ϕ 1 , ϕ 2 have an equal number (five) of unknown parameters. This suggests that the reduction in (7.22) involves no real simplification of the modeling problem, since going from f (x, y; φ) to f (y|x; ϕ 1 )·f (x; ϕ 2 ) one is still dealing with the joint distribution but reparameterized. However, focusing exclusively on the regression and skedastic functions: E(Y | X = x) = β 0 + β 1 x, Var(Y | X = x) = σ 2 , ∀x ∈ R is tantamount to ignoring f (x; ϕ 2 ), x∈R completely. This strategy was followed in constructing Tables 7.1 and 7.2. Are we justified in ignoring f (x; ϕ 2 ) when specifying regression models? Put more broadly, the question is: Are there any circumstances under which f (x; ϕ 2 ) can be ignored without any loss of information? The answer in a nutshell is that it depends on how the two sets of parameters ϕ 2 ∈2 , ϕ 1 ∈1 constrain each other. The answer is yes in the case where 1 (the parameter space of ϕ 1 ) is not affected by any of the values taken by ϕ 2 ∈2 and vice versa, but not otherwise. The concept we need is the so-called variation freeness. Variation freeness. We say that ϕ 1 ∈1 and ϕ 2 ∈2 are variation free when, for all possible values of ϕ 2 ∈2 , ϕ 1 remains unconstrained, in the sense that its original parameter space 1 has not been reduced to a subset. Using the notion of variation freeness we can give a more formal answer to the above question on whether one can ignore f (x; ϕ 2 ) without any loss of information. Weak exogeneity. X is said to be weakly exogenous with respect to ϕ 2 when: (a) the parameters of interest are a function of ϕ 1 only, i.e. θ = H(ϕ 1 );
(b) ϕ 1 and ϕ 2 are variation free, denoted by ϕ 1 , ϕ 2 ∈ [1 × 2 ]. Under conditions (a) and (b), f (x; ϕ 2 ), x∈R can be ignored without any loss of information; see Engle et al. (1983). Example 7.10 (continued) When f (x, y; φ) is bivariate Normal, the marginal ϕ 2 :=
(μ2 , σ 22 ) ∈2 and conditional ϕ 1 = β 0 , β 1 , σ 2 ∈1 parameters can be shown to be variation free. To verify that one needs to assume different possible values for ϕ 2 = (μ2 , σ 22 ) and see if the parameters ϕ 1 are affected in a way that reduces 1 to a subset. For instance, one could choose ϕ ‡2 = {μ2 = 0, σ 22 = 1}, which, in light of (7.23), will imply that
β 0 = μ1 , β 1 = σ 12 , σ 2 = σ 11 − σ 212 ; note that one cannot choose σ 22 =0 because zero is outside its range of values (R+ ). This appears to have affected the parameters in ϕ 1 , but it has not reduced their original parameter space, since
ϕ ‡2 = {0, 1}∈2 = R×R+ → ϕ ‡1 = μ1 , σ 12 , σ 11 − σ 212 ∈1 = R2 ×R+ . Hence, if conditions (a) and (b) hold (i.e. the parameters of interest θ = H(ϕ 1 )), X is weakly exogenous with respect to ϕ 2 because, no matter what values of ϕ 2 in 2 one chooses, the
294
Regression Models
parameters ϕ 1 remain unconstrained in the sense that they can take all their possible values in 1 . Example 7.11
When f (x, y; φ) is bivariate Student’s t with ν > 2 d.f.: 9 9 8 9 88 9 8 σ 11 σ 12 Y μ1 , ; ν , St μ2 σ 12 σ 22 X
(7.25)
f (y|x; ϕ 1 ) and f (x; ϕ 2 ) are also Student’s t of the form νσ 2 {1+ νσ122 [x − μ2 ]2 }; ν+1 , x∈R, (y|X = x) St β 0 +β 1 x, ν−1
(7.26) X St (μ2 , σ 22 ; ν) ,
where the parameters β 0 , β 1 , σ 2 coincide with those of the bivariate Normal (see (7.23)), but ϕ 1 is now different because it includes (μ2 , σ 22 ) among its parameters: φ:=(μ1 , μ2 , σ 12 , σ 11 , σ 22 )∈:= R3 × R2+ ,
ϕ 1 := β 0 , β 1 , μ2 , σ 22 , σ 2 ∈1 := R3 × R2+ , ϕ 2 := (μ2 , σ 22 ) ∈2 := R × R+ . In view of these results, if ϕ 1 includes the parameters of interest, X is not weakly exogenous with respect to ϕ 2 because ϕ 1 can be directly restricted via ϕ 2 ; (μ2 , σ 22 ) are in both sets of parameters. For instance, ϕ ‡2 = {0, 1}∈2 := (R×R+ ) implies that
ϕ ‡1 = μ1 , σ 12 , 0, 1, σ 11 − σ 212 ∈‡1 := R2 ×{0}×{1}×R+ ⊂ 1 := R3 ×R2+ . Example 7.12
Consider the case of the (temporal) bivariate Normal of the form 8 9 88 9 8 99 Yt μ σ (0) σ (1) N , . Yt−1 μ σ (1) σ (0)
The conditional and marginal distributions are
(Yt | Yt−1 ) N α 0 + α 1 Yt−1 , σ 20 , Yt−1 N (μ, σ (0)) , 2 2 α 0 = μ(1 − α 1 )∈R, α 1 = σσ (1) (0) ∈(0, 1), σ 0 = σ (0) 1 − α 1 .
The reduction in (7.22) induced a reparameterization φ ϕ 1 , ϕ 2 , where φ:=(μ, σ (1), σ (0))∈:= R2 ×R2+ ,
ϕ 1 := α 0 , α 1 , σ 20 ∈1 := R2 ×R+ , ϕ 2 := (μ, σ (0)) ∈2 := R×R+ .
(7.27)
(7.28)
(7.29)
Hence, ϕ 1 and ϕ 2 are not variation free since ϕ ‡2 = {μ = 0, σ (0) = 1}∈2 = (R×R+ )
implies that α 0 = 0, α 1 = σ (1), σ 20 = 1 − α 21 , and thus
ϕ ‡1 = 0, σ (1), σ 20 = 1 − α 21 ∈‡1 = [{0}×R×(0, 1)] ⊂ 1 := R2 ×R+ . This indicates that if ϕ 1 are the parameters of interest, Yt−1 is not weakly exogenous with respect to ϕ 2 . The statistical model based on (7.28) is known as the Normal, AutoRegressive model [AR(1)]. The concept of weak exogeneity has two important features: (i)
It is inextricably bound up with the joint distribution (f (x, y; φ)) and its parameterization (φ), as it relates to those of f (y|x; ϕ 1 ) and f (x; ϕ 2 ).
7.3 Weak Exogeneity and Stochastic Conditioning
(ii)
295
In view of the discussion in Section 7.2 (see Table 7.1), weak exogeneity is likely to be the exception and not the rule in practice.
In light of Examples 7.11 and 7.12, the obvious question is: What is one supposed to do when the weak exogeneity does not hold? The answer is that ignoring the marginal distribution f (x; ϕ 2 ), x∈R will result in a loss of information; hence, focusing on the regression and skedastic functions exclusively is not recommended. This means that for modeling purposes one needs to retain the marginal distribution after the reduction in (7.22). This calls into question the appropriateness of the conditioning (X = x) used in defining the conditional moment functions in Section 7.2 since, in a certain sense, this amounts to assuming that the different values taken by the random variable X occur with probability one: hr (x) = E(Y r |X = x), where P(X = x) = 1, for all x∈RX .
(7.30)
This discards f (x; ϕ 2 ) by replacing its weights for each value x∈RX with unit weights. This raises the question: What does one do when X is not weakly exogeneous with respect to ϕ 2 ? The simple answer is that one needs to retain f (x; ϕ 2 ), x∈RX by replacing the conditioning on X = x with conditioning on the σ -field σ (X) instead. Let us explore how this change of conditioning can be handled.
7.3.2 Conditioning on a σ -Field The formal way to retain the probabilistic structure of X when defining the conditional moment functions is to extend the concept of conditioning one step further with a view to accounting for all events associated with the random variable X. This is achieved by replacing the conditioning on X = x with conditioning on the σ -field generated by the random variable X (all possible events associated with X; see Chapter 3) to define the stochastic conditional moment functions hr (X) = E(Y r |σ (X)), for X DX (.),
(7.31)
where DX (.) denotes the marginal distribution of the random variable X. This conditioning is meaningful in the context of the probability space (S, , P(.)) because σ (X)⊂. It is obvious that the functions hr (X) = E(Y r |σ (X)) are different from those in (7.30) because the former are random variables, being a function of the random variable X. Hence, the question is: What meaning do we attach to such stochastic conditioning functions? As argued in Chapter 3, going from events A and B in the context of the probability space (S, , P(.)) , to random variables X and Y defined on this space, raises the crucial problem of attaching the appropriate quantifiers because random variables take more than one value and sometimes involve zero probability events; hence, the notation σ (X): the σ -field generated by X (Chapter 3). Note that the notation E(Y|X) is highly ambiguous because conditioning
296
Regression Models
on X = x is very different from conditioning on σ (X). In this sense, the stochastic conditional expectation E(Y|σ (X)) = h(X)
(7.32)
is probabilistically meaningful because σ (X)⊂. Intuitively, (7.32) represents the expectation of Y given that “an event related to X has occurred,” but the conditioning involves more than one such event with different probabilities of occurrence. For instance, let the Bernoulli random variables denote Y=dying from heart disease and X=gender in a particular population, respectively. In this case E(Y|σ (X)) is a random variable that takes different values for X = 0 and X = 1 with different probabilities. In light of the fact that always includes as a subset the trivial field D0 = {S, ∅} , common sense suggests that for any random variable Y defined on the probability space (S, , P(.)): (i) E(Y|D0 ) = E(Y) and (ii) E(Y|) = E(Y|σ (Y)) = Y.
(7.33)
(i) follows from the fact that D0 is non-informative, and (ii) because contains all relevant information, including σ (Y). Viewed in this light, σ (X) constitutes a restriction on in the sense that {S, ∅} ⊂ σ (X) ⊂ ; see Chapter 3. Indeed, conditioning on a σ -field D is meaningful: E(Y|D) for any D⊂.
(7.34)
Recall that σ (X) denotes all events associated with the random variable X. Example 7.13 Consider the example with S = {(HH), (HT), (TH), (TT)}, where is chosen to be the power set, and define the random variables X(TT) = 0, X(HT) = X(TH) = 1, X(HH) = 2, Y(TT) = Y(HH) = 2, Y(HT) = Y(TH) = 1. Taking the pre-image of the random variable X, we can see that B1 = X −1 (0) = {(TT)}, B2 = X −1 (1)={(HT), (TH)}, B3 = X −1 (2) = {(HH)}, P(B1 ) = .25, P(B2 ) = .5, P(B3 ) = .25. σ (X) = {S, ∅, B1 , B2 , B3 , B1 ∪B2 , B1 ∪B3 , B2 ∪B3 } . Example 7.14 Let us return to Example 7.1 with the joint distribution in (7.7), giving rise to the regression and skedastic functions in (7.12). How would the latter be modified when the conditioning is relative to σ (X)? The answer is that a new column will be added to both tables to give the (marginal) probabilities associated with the different values of X, shown below: x
h(x) = E(Y|σ (X))
P(X = x)
x
g(x) = Var(Y|σ (X))
P(X = x)
1 2 3
.387 .902 .749
.31 .41 .28
1 2 3
.301 .380 .758
.31 .41 .28
(7.35)
7.3 Weak Exogeneity and Stochastic Conditioning
297
A formal discussion of E [Y(.)|σ (X(.))]: S → R, as a random variable with respect to σ (X) , was first given by Kolmogorov (1933a) (chapter V: Conditional Probabilities and Conditional Expectation), where he addressed the technical difficulties associated with defining this function. Here, we will focus on providing a more intuitive understanding in terms of its properties; see Ash (2000) for more technical details. Stochastic conditional expectation function. The random variable E (Y|σ (X)) satisfies the properties in Table 7.5. Table 7.5 Properties of E (Y|σ (X)) Sce1 Sce2 Sce3
E (Y|σ (X)) is a random variable relative to σ (X) E (Y|σ (X)) = h(X), for some h(.): R → R E [E (Y|σ (X)) ·IB ] = E [Y·IB ] , ∀B∈σ (X)
& 1 if s∈B, Here, IB is the indicator function: IB (s) = 0 if s∈B. / This notion of conditional expectation can be extended to any sub-σ -field D ⊂ , since we can always find a random variable (or vector) X such that σ (X) = D, in the sense that all events (X ≤ x) ∈D, ∀x∈R. This is another way of saying that the information D conveys to the modeler what the random variable X does.
7.3.3 Stochastic Conditional Expectation and its Properties Having established the existence and the almost sure uniqueness of E (Y r |σ (X)) , we proceed to consider the question of determining the functional form of hr (X) = E (Y r |σ (X)) . Common sense suggests that the similarity between (7.30) and (7.31) will carry over to the functional forms. That is, when the ordinary conditional moment functions take the form E (Y r |X = x) = hr (x), x∈RX , r = 1, 2, . . . , we interpret the stochastic conditional moment functions as: E (Y r |σ (X)) = hr (X), for X DX (.), r = 1, 2, . . .
(7.36)
Functional form. An obvious implication of the above result is that E(Y r |X = xi ) = hr (xi ), ∀xi ∈RX ⇒ E(Y r |σ (X)) = hr (X),
(7.37)
ensuring that the functional form of the ordinary and the corresponding stochastic conditional moment functions coincide. The only difference is that E (Y r |σ (X)) is a random variable but E (Y r |X = x) is a deterministic function. Example 7.15 Returning to Examples 7.11–7.12, where the weak exogeneity did not hold, we can state that the regression and skedastic functions should take the forms stemming from stochastic conditioning:
298
Regression Models
Table 7.6 Properties of (stochastic) conditional expectation CE1 Linearity E(aX+bY+c|σ (Z)) = aE(X|σ (Z))+bE(Y|σ (Z))+c, for constants a, b, c CE2 The law of iterated expectations (LIE) E(Y) = E [E(Y|σ (X))] CE3 Taking out what is known property E (h(Y)·g(X)|σ (X)) = g(X)·E (h(Y)|σ (X)) . CE4 The best least-squares prediction property 2 E [Y − E(Y|σ (X))]2 ≤ E Y − g(X) for any Borel functions g(.), with equality if and only if g(X) = E(Y|σ (X)) CE5 The corset property E {E (Y|σ (X, Z)) |σ (X)} = E {E (Y|σ (X)) |σ (X, Z)} = E (Y|σ (X)) CE6 Regression function characterization For any X and Y, h(X) = E (Y | σ (X)) when E (Y − h(X)) ·g(X) = 0, for any bounded Borel function g(.): RX → R
Student’s t: E(Yt |σ (Xt ))=β 0 +β 1 Xt , Var(Yt |σ (Xt )) =
νσ 2 1 ν−1 [1+ νσ 22
(Xt − μ2 )2 ]
Normal AR(1): E (Yt |σ (Yt−1 )) = α 0 +α 1 Yt−1 , Var (Yt |σ (Yt−1 )) = σ 20 .
(7.38)
The question which naturally arises is how does one determine the function hr (x) in the first place? The answer from the modeling viewpoint is that both the conditional densities as well as the conditional moment functions are determined by the joint density as shown in (7.4). Properties. Recall that when the marginal distribution fx (x; ϕ 2 ) in the reduction f (x, y; φ) = f (y|x; ϕ 1 )·fx (x; ϕ 2 ), ∀(x, y)∈RX ×RY cannot be ignored, the conditional moment function changes: from E (Y r |X = x) = hr (x) to E (Y r |σ (X)) = hr (X), r = 1, 2, . . . but the functional form hr (.) remains the same (e.g. if h1 (x) = a+bx, then h1 (X) = a+bX). In this sense, E (Y|X = x) can be viewed as a special case of E (Y|σ (X)). Table 7.6 lists six useful properties of E (Y|σ (X)) , where X, Y, and Z are defined on the probability space (S, , P(.)), assuming their relevant moments exist as required in each case for the properties of E (Y|σ (X)); see Williams (1991). The CE1 property can easily be adapted to the special case E(aX + bY|Z = z), which is analogous to the ordinary expectation; see Chapter 3. The CE2 property states that one can derive the (marginal) mean using the conditional mean with respect to the marginal distribution of the random variable X: 0 ∞ / ∞ y·f (y|x)dy ·f (x)dx. E [E(Y|X)] = −∞
−∞
7.3 Weak Exogeneity and Stochastic Conditioning
299
Example 7.16 Consider the following joint and conditional distributions: xy
−1
0
1
f (x)
−1
.1
.2
.1
.4
1
.2
.1
.3
.6
f (y)
.3
.3
.4
1
(7.39)
The conditional distribution(s) of (Y|X = x) for x = −1 and x = 1 are given below: y
−1
0
1
y
−1
0
1
f (y|x = −1)
1 4
1 2
1 4
f (y|x = 1)
1 3
1 6
1 2
(7.40)
Moreover, the conditional means in these examples are: E(Y|X = −1) = (−1) 14 +0( 12 )+1( 14 ) = 0, E(Y|X = 1) = (−1)( 13 )+0( 16 )+1( 12 ) = 16 . We can verify CE2 by taking expectations of E(Y|X) over X: E(Y) = (0.4)E(Y|X = −1) + (0.6)E(Y|X = 1) = 0.1, and show that it coincides with the mean when evaluated using the marginal distribution: E(Y) = (−1)(0.3) + 0(0.3) + 1(0.4) = 0.1. The CE2 property is very useful in practice, and can extend to the conditional variance and covariance: Var(Y) = E [Var(Y|σ (X))] + Var [E(Y|σ (X))] ,
(7.41)
Cov(X, Y) = E [Cov(X, Y|σ (Z))] + Cov [E(X|σ (Z))·E(Y|σ (Z))] .
(7.42)
Note that (7.41) and (7.42) are often called rules of total variance and covariance, respectively, in direct analogy to the rule of total probability; see Chapter 2. The CE3 property states that any two Borel functions g(X) of X (which is a random variable relative to σ (X)) pass through the conditioning unchanged. Example 7.17 Consider the simple case where g(X) = X and h(Y) = Y, then CE3 implies E (Y·X|σ (X)) = X·E (Y|σ (X)) . This, however, implies that when Y is a random variable relative to σ (X): E(Y|σ (X)) = Y a.s. The CE3 property can easily be adapted to the special case E (h(Y)·g(X)|X = x). √ Example 7.18 Consider the functions h(Y) = Y, g(X) = X 2 : √ √ E(h(Y)·g(X)|X = −1) = (−1)2 E( Y|X = −1) = E( Y|X = −1).
(7.43)
300
Regression Models
The above properties are particularly useful in the context of regression models for numerous reasons which will be discussed in the next few sections. At this point it is instructive to use these properties in order to derive an important result in relation to linear regressions. Example 7.19 the form
In the case of the bivariate Normal distribution, the conditional mean takes
(7.44) E (Y|σ (X)) = β 0 + β 1 X,
where the parameters β 0 = μ1 − β 1 μ2 , β 1 = (σ 12 /σ 22 ); see Table 7.1. What if we were to reverse the derivation and begin by postulating (7.44), then proceed to derive the parameterization of β 0 and β 1 from first principles using CE1-CE6 in table 7.6? Using CE2 we can deduce that E(Y) = β 0 + β 1 E(X) (i.e. β 0 = E(Y) − β 1 E(X)). Applying the CE2 and CE3 properties we can deduce that E(X·Y) = E [E(X·Y|σ (X))] = E [X·E(Y|σ (X))] . Using (7.44) in place of σ (X) we can deduce that E(X·Y) = E X(β 0 +β 1 X) = E X[E(Y) − β 1 E(X)+β 1 X] = E(X)E(Y)+β 1 E(X 2 − E(X)E(X) = E(X)·E(Y) + β 1 E(X 2 ) − [E(X)]2 = E(X)·E(Y) + β 1 Var(X). This implies that Cov(XY) = E(X·Y) − E(X)·E(Y) = β 1 Var(X), and thus β1 =
Cov(X,Y) Var(X) .
(7.45)
This result implies that, irrespective of the nature of the joint density f (x, y), if the regression function is linear, the parameters β 0 and β 1 are related to the moments of f (x, y) via
β 0 = μ1 − β 1 μ2 , β 1 = (σ 12 /σ 22 ) , which coincides with the parameterization under the bivariate Normality assumption given in Table 7.1. CE4 has to do with all possible Borel functions g(.) of X, yielding an error of (X) = (Y − g(X)) . The function that renders the mean square error (MSE) minimum is g(X) = E(Y|σ (X)): E(Y|σ (X)) = arg ming(.) E (Y − g(X))2 . That is, the conditional mean E(Y|σ (X)) provides a best mean squared error predictor for Y. This is a particularly useful property because it renders conditional expectation the obvious choice for a predictor (forecasting rule). The intuition underlying the CE5 property is that in sequential conditioning the smaller conditional information set (note that σ (X)⊂σ (X, Z)) dominates the conditioning. Like wearing two corsets, the smaller will dominate irrespective of the order of wearing them! All the stochastic conditional expectation properties CE1–CE6 in Table 7.6 remain valid when σ (Z) and σ (X) are replaced with an arbitrary σ -field D⊂F , appropriately chosen to relate to (X, Y, Z) as needed.
7.4 A Statistical Interpretation of Regression
7.4
301
A Statistical Interpretation of Regression
As argued in Chapter 1, for observed data to provide impartial evidence in assessing the validity of a substantive model it is imperative that we specify the statistical model (a convenient summary of the systematic information in the data) in terms of purely probabilistic concepts, with a view to accounting for the chance regularities in the data. In Chapters 2–6 we introduced several probabilistic concepts with a view to to provide the framework in the context of which such statistical models can be understood and properly employed for empirical modeling purposes. The concept of a statistical model defined so far has just two components, the probability and the sampling models. These two components are sufficient for statistical inference purposes because they can define the joint distribution of the sample. For instance, for the simple statistical models this takes the form 1 f (x1 , x2 , . . . , xn ; φ) nk=1 f (xk ; θ ), ∀x∈RnX , and in the case of regression models, this takes the form 1 f (y1 , y2 , . . . , yn , x1 , x2 , . . . , xn ; φ) = nk=1 f (yk |xk ; ϕ 2 )·f (xk ; ϕ 2 ), ∀ (x, y) ∈RnX ×RnY . As argued by Cox (1990), however, this is often not enough in practice: For empirical and indirect purposes, it may be enough that a model defines the joint distribution of the random variables concerned, but for substantive purposes it is usually desirable that the model can be used fairly directly to simulate data. The essential idea is that if the investigator cannot use the model directly to simulate artificial data, how can “Nature” have used anything like that method to generate real data? (p. 172)
Hence, to ensure a purely probabilistic specification for a statistical model, one needs to specify the stochastic generating mechanism that could have generated the observed data. Indeed, this purely statistical mechanism could potentially provide a bridge between statistical and any substantive models. The ultimate objective of empirical modeling is not just a statistically adequate description of the systematic information in the data, in the form of a statistical model, but also to use such models to shed light on observable phenomena of interest. In this sense, relating such statistical substantive models is of fundamental importance.
7.4.1 The Statistical Generating Mechanism Conditioning information set. Let the probability space of interest be (S, , P(.)) . In view of the fact that all events of interest are elements of , we define information in terms of subsets of ,i.e. D constitutes information in the context of the probability space (,, S P( , )).if D ⊂ .D can range from the non-informative case D0 = { ,}ø, S we know this a priori, to the fully-informative case D∗ = , we know everything that can occur in (, = S P(.)). In view of the fact that we can always define a random variable X such that the minimal σ -field generated by X coincides with D (i.e. σ (X) = DX ), we can think of information as a restriction on the event space relative to some observable aspect of the chance mechanism in question. This will enable us to operationalize expressions of the
302
Regression Models
form E(Y|D), which can be interpreted as the conditional expectation of the random variable Y given the subset D: a set of events known to the modeler. In addition, we know that by transforming information there is no possibility of increasing it but there is some possibility that the transformation might decrease it. More formally, for any well-behaved (Borel) function g(.) of X, σ (g(X))⊂σ (X), but the converse is only true when the function is one-to-one: σ (g(X)) = σ (X) only if g(.): RX → R is one-to-one. Orthogonal decomposition. Let a random variable Y (assuming that E(|Y|2 )x∈R+ . E(X|Y = y) = (θ 1θ+θ 2) 1 +θ 2 ) An issue related to the reverse regression is the following. Regression toward the mean. This arises in the special case where the underlying bivariate distribution is Normal and Var(Y) = Var(X) which in light of the statistical parameterizations in (7.14) and (7.20) implies that: σ 11 = σ 22 → β 1 = α 1 = ρ 12 , σ 2 = v2 ,
308
Regression Models
√
√ √ √ √ since β 1 σ 22 / σ 11 = α 1 σ 11 / σ 22 = σ 12 / σ 22 ·σ 11 :=ρ 12 and the regression lines have the same slope: (Y − μ1 ) = ρ 12 (x − μ2 ), (X − μ2 ) = ρ 12 (y − μ1 )
(7.55)
but different intercepts unless one also assumes + +that E(Y) = E(X). The term “regression toward the mean” stems from the fact that +ρ 12 + < 1, and thus the dispersion about the mean of Y is less than that of X. Even in the case μ1 = μ2 , however, the reverse regression in (7.55) does not satisfy (7.54), since ρ 12 < 1/ρ 12 . Galton (1886) used this special case to argue that when regressing the height of offspring Y on the height of parent X, very tall parents would produce not as tall offspring and short parents would produce taller offspring” closer to the mean. On statistical grounds, however, such a causal interpretation is misplaced because the reverse regression also implies that very tall offspring have parents that are not as tall. One of the crucial problems associated with the curve-fitting perspective is the blurring of the distinction between the substantive and statistical models. The end result is that the theory-driven empirical modeling assigns a subordinate role to the data, that of quantifying substantive models assumed to be valid. This treats the substantive subject matter information as knowledge instead of offering tentative explanations to phenomena of interest, that are subject to evaluation using data. The problem is that if the estimated model does not account for all the statistical systematic information in the data (the chance regularities), it will be statistically misspecified and the inference procedures based on it are likely to be highly unreliable (e.g. one will be applying a .05 significance test for a coefficient, when the actual type I error will be closer to .9, with dire consequences for the reliability of the inferences based on statistically misspecified models). To avoid the perils of theory-driven empirical modeling, one needs a clear distinction between statistical and substantive models that assigns different roles for the two types of potential information without compromising the credibility of either form of information.
7.5
Regression Models and Heterogeneity
In Chapter 2 we raised the issue of data exhibiting heterogeneity and its devastating effects on descriptive statistics, but we left the question on how to account for such a chance regularity pattern unanswered. In Chapter 5 we considered the question of “subtracting” out certain forms of heterogeneity by using deterministic polynomials in t to account for trending means, as well as shifts, dummy variables, and sinusoidal polynomials for seasonal patterns. In this section we will consider the question of modeling trending data in the context of regression models. Consider the following example. Example 7.23
Consider the case where f (xt , yt ; φ(t)) is bivariate Normal of the form = < =9 8< 8 9 σ 11 (t) σ 12 (t) μ1 (t) Yt , , N Xt μ2 (t) σ 12 (t) σ 22 (t)
(7.56)
7.5 Regression Models and Heterogeneity
309
where all the parameters in φ(t):= (μ1 (t), μ2 (t), σ 11 (t), σ 22 (t), σ 12 (t)) are unknown functions of t. This gives rise to a non-operational Normal, linear regression model: Yt = β 0 (t) + β 1 (t)xt + ut , t ∈ N:=(1, 2, . . . , n, . . .),
(7.57)
since all its parameters vary with t: β 0 (t) = μ1 (t) − β 1 (t)μ2 (t), β 1 (t) =
σ 12 (t) σ 22 (t) ,
σ 2 (t) = σ 11 (t) −
[σ 12 (t)]2 σ 22 (t) .
The linear regression model in (7.57) can be rendered operational, however, when the parameters in φ(t) are specified in terms of explicit functional forms of t with a small number of unknown parameters. Let us consider two special cases. Example 7.24 Consider the case where f (xt , yt ; φ(t)) is bivariate Normal: = < 8< =9 8 9 σ 11 σ 12 γ 10 +γ 11 t Yt , N , (7.58) Xt σ 12 σ 22 γ 20 +γ 21 t
with unknown parameters ϕ:= γ 10 ,γ 11 , γ 20 ,γ 21 , σ 11 , σ 22 , σ 12 . This gives rise to a Normal, linear regression model with a trend whose statistical GM is Yt = δ 0 + δ 1 t + β 1 xt + ut , t ∈ N:=(1, 2, . . . , n, . . .),
and its underlying parameterization is ϕ 1 := δ 0 , δ 1 , β 1 , σ 2 :
δ 0 = γ 10 − β 1 γ 20 ∈R, δ 1 = (γ 11 − β 1 γ 21 )∈R, β 1 = (σ 12 /σ 22 ) ∈R, σ 2 = (σ 11 − σ 212 /σ 22 )∈R+ .
(7.59)
(7.60)
Note that the trend in the linear regression model is derived via
β 0 (t) = μ1 (t) − β 1 μ2 (t)= γ 10 +γ 11 t − β 1 (γ 20 +γ 21 t)
= γ 10 − β 1 γ 20 +(γ 11 − β 1 γ 21 )t = δ 0 + δ 1 t. It is straightforward to extend this model to higher-order polynomials in t: p μi (t) = k=0 γ ik tk = γ i0 +γ i1 t+γ i2 t2 + · · · + γ ip tp , i = 1, 2, t ∈ N. Similarly, the above linear regression model can easily include dummy variables that can accommodate shifts or/and seasonal (weekly, monthly, quarterly) heterogeneity: & s 1, for season k, s μi (t) = k=1 γ ik Dk , i = 1, 2, where Dk = k=1 γ ik = 0. 0, otherwise, or sinusoidal polynomials to model seasonal patterns: 2π kt μi (t) = m + δ k sin 2π12kt , i = 1, 2. k=0 α k cos 12 What is important to note is that the heterogeneity terms enter the regression function through the marginal means of the random variables (Xt , Yt ) . Example 7.25 Consider the case where f (xt , yt ; φ(t)) is bivariate Normal: = < =9 8< 8 9 σ 11· t σ 12 ·t γ 10 +γ 11 t Yt , . N Xt γ 20 +γ 21 t σ 12 ·t σ 22 ·t
(7.61)
310
Regression Models
This differs from Example 7.24 in so far as the variances/covariances are multiples of t (separable heterogeneity; see Chapter 8). This gives rise to a Normal, linear regression model with a trend whose statistical GM is Yt = δ 0 + δ 1 t + β 1 xt + ut , (ut |Xt = xt ) N(0, σ 2 ·t), t ∈ N.
(7.62)
In light of the regression parameterization in (7.60), the new parameterization is 2 2 σ 212 ·t σ 12 2 (t) = σ ·t − σ 12 ·t 2 = , σ − = σ β 1 = σσ 12 11 11 σ 22 σ 22·t σ 22 t = σ ·t. 22 ·t Hence, the Normal, linear regression model in (7.62) has a heterogeneous skedastic function: Var (Yt | Xt = xt ) = σ 2 ·t, t ∈ N.
(7.63)
At this stage it is important to recall that (7.63) is not heteroskedasticity, since Heteroskedasticity: Var(Yt | Xt = xt ) = g(xt ) is a function of xt ∈RX Heterogeneity: Var(Yt | Xt = xt ) = g(t) is a function of t ∈ N. The above examples of modeling particular forms of heterogeneity illustrate the usefulness of deterministic terms that provide a generic way to account for various forms of heterogeneity with a view to securing a statistically adequate model. However, it is important to emphasize that such generic modeling is perfectly acceptable on statistical modeling and inference grounds, but such terms represent ignorance from the substantive perspective. Hence, going from an adequate statistical model to a substantive model one needs to replace the generic trends with substantively meaningful variables. This is important, not only for enhancing the substantive scope of the statistically adequate model, but also to improve its forecasting capacity. Deterministic terms, such as trend polynomials, are too rigid to be useful for medium- and long-term forecasting; they might be reasonably accurate for short-term forecasting.
7.6
Summary and Conclusions
Regression models arise naturally in modeling the dependence (synchronic or temporal) between random variables by reparameterizing the dependence in their joint distribution in the context of the conditional distribution. For two random variables: f (x, y; φ) = f (y|x; ϕ 1 )·f (x; ϕ 2 ), ∀(x, y)∈RX ×RY .
(7.64)
In this reduction { f (y|X = x; ϕ 1 ), ∀x∈RX } represents as many densities as possible values of x, which can be modeled by rendering the conditional moments of f (y|X = x; ϕ 1 ) functions of x∈RX , giving rise to regression models. Different joint distributions f (x, y; φ) generate different regression models with a wide variety of functional forms for the regression and skedastic functions. It turns out that one cannot always ignore { f (x; ϕ 2 ), ∀x∈RX } and specify the regression model exclusively in terms of { f (y|X = x; ϕ 1 ), ∀x∈RX }. After delineating the circumstances under which this problem arises, an obvious way to account for f (x; ϕ 2 ) is to extend the ordinary conditional moments E(Y r |X = x) to allow for
7.6 Summary and Conclusions
311
stochastic conditioning E(Y r |σ (X)). This form of conditioning was also used to introduce the concept of the statistical generating mechanism that provides the link between the statistical Mθ (x) and the substantive model Mϕ (x), relating to the same data x0 . The purely probabilistic construal of a statistical model Mθ (x) aims to keep the two types of models separate ab initio, so as to ensure the validity of Mθ (x) before one poses the substantive questions of interest relating to the substantive model Mϕ (x). All these threads were tied together to specify regression models in terms of their probabilistic assumptions. These regression models are then extended to allow for different forms of dependence and heterogeneity in the mean and variance. Important Concepts Probabilistic reduction, conditioning on a random variable, regression function, skedastic function, clitic function, kurtic function, linear regression functions, homoskedasticity, heteroskedasticity, regression models, weak exogeneity, reverse regression, regression toward the mean, synchronic (contemporaneous) dependence, temporal dependence, regression parameterization, the Normal linear regression model, the Student’s t linear regression model, stochastic regression models, law of iterated expectations (LIE), Gauss linear model, systematic component, non-systematic component, a statistical generating mechanism (GM), structural model validation, heterogeneity in regression models. Crucial Distinctions Conditioning on a random variable vs. conditioning on an event, linearity in parameters vs. linearity in x, homoskedasticity vs. heteroskedasticity, heteroskedasticity vs. conditional variance heterogeneity, contemporaneous (synchronic) vs. temporal (diachronic) dependence, the Normal vs. the Student’s t linear regression model, the Normal vs. the Gauss linear model, conditioning Y on X = x vs. conditioning Y on σ (X), systematic component vs. non-systematic components, statistical vs. substantive models, statistical vs. substantive error terms, a statistical GM vs. a structural relationship, statistical adequacy. Essential Ideas ●
●
●
Regression models provide a natural way to parameterize the dependence between the random variables Y and X and render it estimable with data {(xt , yt ), t = 1, 2, . . . , n}. For each different joint distribution f (x, y; φ), there is a distinct regression model specified in p terms of the conditional moments E(Y r |X = x) = hr (x), r = 1, 2, . . . , ∀x∈RX , whose functional forms hr (.), r = 1, 2, . . . are often different; see Table 7.1. The Normal, linear regression (LR) dominates unduly the empirical literature in many disciplines, either because its probabilistic assumptions are often never validated to question its appropriateness, and/or the model is viewed as a simple tool in curve-fitting. Linearity in x of the regression function is not rare among known distributions, but homoskedasticity is uncommon. The combination of linearity in x, as well as the parameters, for the regression function and homoskedasticity is extremely rare; see Chapter 14.
Regression Models
312 ●
●
●
●
●
Selecting a regression model does not have to be either a curve-fitting exercise or a haphazard and aimless search, guided solely by substantive information. A coherent strategy for selecting appropriate regression models begins with the t-plots of the data selected by substantive information, combined with scatterplots in cases where the t-plots do not exhibit any departures from the IID assumptions. The scatterplot is related directly to the joint distribution f (x, y; φ) via the equal-probability contours and can guide one in selecting the appropriate f (x, y; φ). This choice of f (x, y; φ) determines the relevant regression model; see Table 7.1. It is a serious modeling error to assume that all regression models can ignore f (x; ϕ 2 ) and be framed exclusively in terms of { f (y|X = x; ϕ 1 ), ∀x∈RX }. The concept of a statistical GM constitutes a purely probabilistic construct that provides the link between the statistical model, based exclusively on systematic information in the selected data, and the substantive (structural) model stemming from a theory or theories. Conflating the two models at the outset will hamstring any effort to test the empirical validity of the substantive model. Viewing statistical models as stemming from a reduction of a joint distribution ensures the internal consistency of the model assumptions, as well as their testability using the particular data, and provides the statistical parameterization of the model. The latter plays a very important role in the context of statistical inference. To establish the empirical validity of a substantive model one needs to begin with a statistically adequate model that (i) fully accounts for the chance regularity patterns in the data and (ii) nests the substantive model parametrically. Statistical adequacy secures the reliability of the evaluation inference procedures, including testing the parametric restrictions stemming from comparing the statistical and substantive models. Without statistical adequacy the appraisal of the substantive model using statistical procedures is likely to be unreliable.
7.7
Questions and Exercises
1. Explain how the notion of conditioning enables us to deal with the dimensionality problem raised by joint distributions of samples. 2. Explain why the reduction f (x, y) = f (y|x)·fx (x) raises a problem due to the fact that { f (y|X = x; ϕ 1 ), ∀x∈RX } represents as many conditional distributions as there are possible values of x in RX . 3. Define and explain the following concepts: (a) (b) (c) (d) (e)
conditional moment functions; regression function; skedastic function; homoskedasticity; heteroskedasticity.
7.7 Questions and Exercises
313
4. Consider the joint distribution as given below: xy
1
2
3
fx (x)
−1
.10
.08
.02
.2
0
.15
.06
.09
.3
1
.2
.20
.10
.5
fy (y)
.45
.34
.21
1
(a) Derive the conditional distributions of (Y|X = x) for all values of X. (b) Derive the regression and skedastic functions for the distributions in (a). 5. Let the joint density function of two random variables X and Y be xy
0
1
2
0 1
.1 .2
.2 .1
.2 .2
(a) Derive the following conditional moments: E(Y|X = 1), Var(Y|X = 1), E{[Y − E(Y|X = 1)]3 |X = 1}. (b) Verify the equalities (i) Var(Y|X = 1) = E(Y 2 |X = 1) − {E[Y|X = 1]}2 , (ii) E(Y) = E{E(Y|X)}, and (iii)* Var(Y) = E{Var(Y|X)}+Var{E(Y|X)}. 6. Compare and contrast the concepts E[Y|X = x] and E[Y|σ (X)]. 7. From the bivariate distributions in Chapter 7: (a) Collect the regression functions which are: (i) linear in the conditioning variable, (ii) linear in parameters, and (iii) the intersection of (i) and (ii). (b) Collect the skedastic functions which are: (iv) homoskedastic, (v) heteroskedastic. (c) In light of your answers in (a) and (b), explain how typical is the Normal, linear regression model. 8. Explain the notion of linear regression. Explain the difference between linearity in x and linearity in the parameters. 9. Consider the joint Normal distribution denoted by 8 9 88 9 8 99 Y μ1 σ 11 σ 12 N , . μ2 σ 12 σ 22 X
(7.65)
(a) For values μ1 = 1, μ2 = 1.5, σ 11 = 1, σ 12 = −0.8, σ 22 = 2, plot E(Y|X = x) and Var(Y|X = x) for x = 0, 1, 2. (b) Plot E(Y|X = x) and Var(Y|X = x) for x = 0, 1, 2 for a bivariate Student’s t distribution with the parameters as in (7.65) taking the same values as those given in (a) for ν = 3, 5, 7. (c) State the marginal distributions of Y and X.
314
Regression Models
10. Explain the concept of stochastic conditional moment functions. 11. Explain the notion of weak exogeneity. Why do we care? 12. Explain what one could do when weak exogeneity does not hold for a particular reduction as given in (7.64). 13. Explain the concept of a statistical generating mechanism and discuss its role in empirical modeling. 14. Let Y be a random variable and define the error term by u = Y − E(Y|σ (X)). Show that by definition, this random variable satisfies the following properties: [i] E(u|σ (X)) = 0, [ii] E(u·X|σ (X)) = 0, [iii] E(u) = 0, [iv] E{u·[E(Y|σ (X)]|σ (X)} = 0. 15. Explain the difference between temporal and contemporaneous dependence. 16. Compare and contrast the statistical GMs of: (a) the simple Normal model, (b) the linear/Normal regression model, and (c) the linear/Normal autoregressive model. 17. Compare and contrast the simple Normal and Normal/linear regression models in terms of their probability and sampling models. 18. Compare and contrast the Normal/linear and Student’s t regression models in terms of their probability and sampling models. 19. Explain why the statistical and structural (substantive) models are based on very different information. 20. Discuss why a statistical misspecified model provides a poor basis for reliable inference concerning the substantive questions of interest. 21. Explain how the purely probabilistic construal of a statistical model enables one to draw a clear line between the statistical and substantive models. 22. Explain how the discussion about dependence and regression models in relation to the bivariate distribution f (x, y; φ) can be used to represent both synchronous and temporal dependence, by transforming the generic notation (X, Y) to stand for both (Xt , Yt ) and (Yt , Yt−1 ) . 23. Discuss the purely probabilistic construal of a regression model proposed in this chapter and explain why it enables one to distinguish between a statistical and a substantive (structural) model. Explain the relationship between the two types of models.
8 Introduction to Stochastic Processes
8.1
Introduction
The concept of a stochastic process is of fundamental importance for the approach to empirical modeling adopted in this book for several reasons, the most important being that every statistical model, specified generically as Mθ (z) = { f (z; θ ), θ ∈ ⊂ Rm }, z∈RnZ , m < n,
(8.1)
can be viewed as a particular parameterization of the observable stochastic process {Zt , t∈N:=(1, 2, . . . , n, . . .)} underlying the observed data Z0 . This ensures that the parameterization θ ∈ has a “well-defined meaning” because it is directly related to the probabilistic structure of {Zt , t∈N}; see McCullagh (2002). The probabilistic assumptions comprising the model are chosen with a view to accounting fully for the systematic statistical information (chance regularity patterns) exhibited by Z0 . This is tantamount to selecting these probabilistic assumptions so as to render the data Z0 a typical realization of {Zt , t∈N}. Hence, validating a statistical model amounts to testing whether its probabilistic assumptions do, indeed, fully account for the systematic information exhibited by Z0 . In addition to selecting the relevant data Z0 , the substantive information influences the choice of the parameterization for the statistical model, enabling the modeler to pose the substantive questions of interest to data Z0 . The ultimate aim is to harmoniously blend the statistical and substantive information with a view to learning about the phenomena of interest that gave rise to Z0 . Example 8.1 In Example 1.1, casting two dice and adding the dots, the generating mechanism is the same whether one is interested in the occurrence of all 36 combinations (Table 1.3) or in adding up the two faces with outcomes 2, 3, . . . , 12 (Table 1.2) or gambling on events odds A = (3, 5, 7, 9, 11) and evens B = (2, 4, 6, 8, 10, 12) (Table 1.5). In each case, however, the relevant statistical model and its parameterization is different because one is posing different questions. The probabilistic perspective adopted in this book raises two crucial issues. The first has to do with the conditions required for the joint distribution f (z; θ ), z∈RnZ of the sample 315
316
Introduction to Stochastic Processes
Z:= (Z1 , . . . , Zn ) to be well defined. This is a non-trivial issue because the sample represents only a finite initial segment of the process {Zt , t∈N}. Kolmogorov (1933a) addressed this crucial issue by demonstrating that one needs to impose certain mild conditions on the (infinite) process {Zt , t∈N} for f (z; θ ), z∈RnZ , to provide a general probabilistic description of the process; Section 2.3. The second crucial issue relates to the demarcation of possible chance regularities that give rise to operational statistical models using probabilistic assumptions to account for such patterns. It is obvious that for a data set to be amenable to statistical modeling it must exhibit sufficient regularity patterns and a certain degree of invariance (stability of relative frequencies). If the process changes with the index t∈N in non-systematic and haphazard ways, the resulting data are unlikely to be amenable to statistical modeling. The current chapter will consider this question in great detail by defining several probabilistic assumptions (restrictions) from the three categories – distribution, dependence, and heterogeneity – that could give rise to operational models. Special emphasis is placed on particular stochastic processes, such as Normal processes, including the Wiener and Brownian motion, Markov processes, martingale and martingale difference processes, as well as count processes, because of their importance in empirical modeling. For empirical modeling, the development of the theory of stochastic processes in the late 1920s and early 1930s, pioneered by Kolmogorov and Khinchin, is second in importance only to Fisher’s recasting of statistics when he introduced the concept of a prespecified statistical model as a set of probabilistic assumptions imposed on the data. It is fair to say that, historically speaking, the IID assumptions were implicitly imposed on the overwhelming majority of statistical modeling before the 1930s. Indeed, one can argue that the IID assumptions constitute a formalization of the philosopher’s a priori assumptions of the “uniformity of nature” and the “representativeness of the sample.” The concept of a non-IID sequence (based on an ordering) of random variables that can have many different forms of probabilistic structure giving rise to estimable statistical models was a major step forward for empirical modeling. The primary focus of this chapter will be on this empirical modeling aspect of stochastic processes, with the concept of a sample representing {Xt }nt=1 and the data x0 :={xt }nt=1 as a finite realization of that.
8.1.1 Random Variables and Orderings It is often said that for a random (IID) sample X:=(X1 , X2 , . . . , Xn ), the ordering (t = 1, 2, . . . , n) has no real role to play in empirical modeling since the random variables are exact replicas of each other. This is a highly misleading claim since it is only true at the inference stage. However, before one reaches that stage, the ordering is of crucial importance for the specification (initial choice of Mθ (z)), the misspecification testing (model validation), and the respecification (choosing another model) with a view to securing statistical adequacy. The reason is that: (a) The ordering provides the basis for detecting chance regularities in data x0 pertaining to IID or non-IID patterns, using both graphical techniques as well as formal misspecification (M-S) testing.
8.1 Introduction
317
(b) The probabilistic assumptions for the (M) Dependence and (H) heterogeneity categories are always defined relative to a particular ordering of interest. Example 8.1 (continued) In the case of casting two dice, the observations are entered in terms of the ordered sequence of trials as shown in Table 1.1, and using the t-plot in Figure 1.1 one could assess the IID assumptions using the chance regularities exhibited by the data. In contrast, when the data for exam scores in Example 1.8 were ordered using the sitting arrangement, they exhibited spatial dependence. Having said that, the concept of a random variable X as such (Chapter 3) has no inherent ordering. The ordering was introduced in relation to the concept of a sample (X1 , X2 , . . . , Xn ) in the context of a sampling model. This calls for the concept of a sample, a set of random variables, to be assigned an index that represents the relevant ordering. Such orderings often represent time, age, geographical position, marital status, gender, etc. When we specified statistical models, however, we allowed this sequence of random variables to go beyond n since a statistical model represents a stochastic mechanism that, in principle, is not confined to the particular data x0 :=(x1 , x2 , . . . , xn ). A statistical model is often used to predict beyond the observed data, and thus the stochastic mechanism should be specified more generally to hold for a sequence of random variables {X1 , X2 , . . . , Xn , . . .} :={Xn , t∈N:=(1, 2, . . . , n, . . .)}. A stochastic process is an indexed sequence of random variables, with the index representing the relevant ordering. In the case of time-series data, the index is invariably time. In modeling cross-section data where the individual observation units are people, households, firms, cities, states, and countries, there is often more than one ordering of interest with respect to which dependence or/and heterogeneity can be defined and evaluated. The only difference between time-series and cross-section data is that for the former the time ordering is measured on an interval scale, but for the latter the relevant orderings might vary from ratio scale (geographical position, age, size) to nominal scale (gender, religion); see Chapter 1. A bird’s-eye view of the chapter. The main objective of this chapter is to define and explain the concept of a stochastic process and related restrictions of dependence and heterogeneity, needed to specify operational statistical models which can be used for modeling non-IID data. In addition, the discussion focuses on several named stochastic processes that play a crucial role in statistical modeling. The discussion of stochastic processes can end up being one of the most involved and confusing parts of probability theory, mainly because of the numerous overlapping types of stochastic processes one encounters. The difficulties of mastering the material are alleviated when the discussion is structured in a way that makes it easier to compare and contrast the various stochastic processes. In an attempt to mitigate the confusion for the uninitiated, we use the following learning aids: (i)
The discussion begins with a brief overview of the early developments in stochastic processes. This is to lessen the problem of introducing too many concepts too quickly, as well as to introduce some basic terminology.
318
(ii)
Introduction to Stochastic Processes
The probabilistic structure of stochastic processes is discussed in relation to the three basic categories of probabilistic assumptions: (D) Distribution, (M) Dependence, (H) Homogeneity.
(8.2)
This renders the comparison between different processes more transparent. (iii) Several taxonomies of stochastic processes are presented to bring out the interrelationships between them, including the discrete/continuous distinction. (iv) In specifying the probabilistic structure of different stochastic processes, the emphasis is placed on whether the structure is given in terms of their joint distribution, or their moments. In relation to that, it is important to bring out the difference between probabilistic assumptions that are useful for proving theorems, and those that are useful for modeling purposes. The latter differ from the former in so far as they are testable vis-à-vis the data.
8.2
The Concept of a Stochastic Process
8.2.1 Defining a Stochastic Process A stochastic process is an indexed sequence {Xt , t∈N} of random variables defined on the same probability space (S, , P(.)), i.e. Xt is a random variable relative to (S, , P(.)), for each t in the index set N. Example 8.2 The number of telephone calls arriving at a telephone exchange over the interval [0, t] can be modeled using such an indexed sequence of random variables, where Xt measures the number of calls up to time t whose index set is N:=(0, 1, 2, 3, . . .). As explained in Chapter 3, a random variable X(.), defined on (S, , P(.)), is a real-valued function from the outcomes set S to the real line R, such that it preserves the event space by ensuring that {s: X(s) = x}:=X −1 (x)∈, ∀∈R. This function can be extended to a stochastic process by adding another argument t: X(., .): S × N → R, where N is the relevant index set. In this sense, the notation {Xt , t∈N} should be viewed as a simplification of {X(s, t), s∈S, t∈N}. The index set can take a number of different forms, including (1, 2, . . . , n, . . .), (0, ±1, ±2, . . .), (−∞, ∞), [0, 1] , (a, b), a < b, (a, b) ∈R2 . When the index set is countable, we refer to it as a discrete index, and when it is uncountable, we call it a continuous index; the widely used terminology discrete/continuous parameter process should be avoided since the index is not a parameter in the sense used in this book. N O TAT I O N A L D E V I C E: In many cases during the discussion that follows we are going to discuss concepts which are applicable to both discrete- and continuous-index stochastic processes. The notation for discrete-index processes is, of course, more natural and less
8.2 The Concept of a Stochastic Process
319
complicated than that of continuous-index processes and more often than not the former will be used. However, in cases where we want to emphasize the general applicability of a concept, we use a notational device which in a sense enables us to use the discrete notation to cover both cases. Instead of using the sequence {Xk , k∈N} which is clearly discrete, we use {Xtk , k∈N} such that 0 < t1 < t2 < t3 < · · · < tn < · · · < ∞, tk ∈N:=(1, 2, . . . , n, . . .), to deal with both cases. Example 8.3
Let the stochastic process {X(t), t∈[0, ∞)} be of the form
X(t) = W·t, WN (0, 1) . This implies that {X(t), t∈[0, ∞)} is a Normal stochastic process with E(X(t)) = 0, Var(X(t)) = t. In view of the fact that {X(s, t), s∈S, t∈N} has two arguments, we can view a stochastic process from two different viewing angles, which can be confusing if they are not clearly distinguished. (i) The random variable viewing angle. For N:={t1 , t2 , . . .}, when t = tk is given, X(s, tk ), s∈S represents a function X(., tk ): S → R, which is just a random variable whose density function f (x(tk )), x(tk )∈RX specifies its probabilistic structure. For a given subset of N, say {t1 , t2 , . . . , tn }⊂N, {X(., t1 ), X(., t2 ), . . . , X(., tn )} is a collection of random variables, such as a sample whose probabilistic structure is fully described by their joint density function f (x(t1 ), x(t2 ), x(t3 ), . . . , x(tn )). (ii) The sample path viewing angle. For S = {s1 , s2 , s3 , . . . , sn }, when outcome s = sk is given, X(sk , t), t∈N represents a deterministic function X(sk , .): N → R. The graph of this function is often called a sample path (or sample realization), since this is the feature of the stochastic process that we often associate with observed data. In Figures 8.1 and 8.2 we can see the sample paths of a discrete and a continuous stochastic process, respectively. When
3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 –0.5 –1.0 –1.5 –2.0 –2.5 –3.0 –3.5
3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 –0.5 –1.0 –1.5 –2.0 –2.5 –3.0 –3.5 0
2
4
Fig. 8.1
6
8
10
12
14
16
18
A discrete-sample path
20
0
2
4
Fig. 8.2
6
8
10
12
14
16
18
A continuous-sample path
20
320
Introduction to Stochastic Processes 4 3 2
Xt
1 0 –1 –2 –3 –4
1
6
11
16
21
26
31
36
41
46
51
t
Fig. 8.3
An ensemble of five sample paths
both t = tk and s = sk are given, X(sk , tk ) represents a single observation. Allowing s to change (s∈S = {s1 , s2 , s3 , . . . , sn }), the functions {X(s1 , .), X(s2 , .), X(s3 , .), . . . , X(sn , .)}, t∈N define a collection of different sample paths, called an ensemble; see Figure 8.3. The mathematical structure of the ensemble also plays an important role in what follows. N O T E S : (a) It is important to emphasize at this stage that it is common practice to connect the points of a sample path of a discrete process. To avoid confusion when looking at t-plots of data, the discrete process will be distinguished by the dots or little squares added for the particular observations. (b) We often cannot resist the temptation to interpret t as time for convenience, but it could easily be a cross-section ordering such as those mentioned above. (c) The index t can easily be multidimensional in the sense that the stochastic process {Xt , t∈R3 } could represent the velocity of a particle suspended in liquid with t being its position in the three-dimensional Euclidean space. In such cases the stochastic process is often called a stochastic field. (d) The stochastic process {Xt , t∈N} can easily be extended to an m × 1 vector Xt of random variables: Xt := (X1t , X2t , . . . , Xmt ) .
8.2.2 Classifying Stochastic Processes; What a Mess! The structure of a stochastic process {X(s, t), s∈S, t∈N} depends partly on the nature of two sets, its index set N and its state space, the range of values taken by the process X(., t), which often changes with t, and hence it is defined to be the union of the sets of values of X(., t) for each t, RX = ∪t∈N RXt . What renders stochastic processes mathematically different is whether these sets are countable (discrete) or uncountable (continuous); see Chapter 2.
8.2 The Concept of a Stochastic Process
321
(a) In the case where N is a countable set, such as N = {0, 1, 2, 3, . . .}, we call {Xt , t∈N} a discrete-index stochastic process. On the other hand, when N is an uncountable set, such as N = [0, ∞), we call {X(t), t∈N} a continuous-index stochastic process. (b) Similarly, the state space RX of the stochastic process {Xt , t∈N} can be countable or uncountable, introducing a four-way index set/state space (N, RX ) classification of stochastic processes. Table 8.1 Classifying stochastic processes Index set N
State space RX
D-D
countable
countable
simple random walk
D-C
countable
uncountable
Normal process
C-D
uncountable
countable
Poisson process
C-C
uncountable
uncountable,
Brownian motion process
Example
Depending on whether N and RX are discrete (countable) or continuous (uncountable) sets, stochastic processes can be classified as in Table 8.1. The classification constitutes a schematic grouping that partitions stochastic processes, and its usefulness stems from organizing our thoughts at the initial stages of mastering the material, and avoiding confusion when reading different books or papers. This, however, is not the only, or even the most useful, classification, because it ignores the probabilistic structure of a stochastic process. A number of other classifications of stochastic processes, such as stationary/non-stationary, Markov/non-Markov, Gaussian/non-Gaussian, ergodic/non-ergodic, are based on their probabilistic structure and provide useful groupings
Fig. 8.4 Classifying stochastic processes based on their probabilistic structure
322
Introduction to Stochastic Processes
of stochastic processes. A bird’s-eye view of a categorization based on the probabilistic structure of stochastic processes is given in Figure 8.4; see Srinivasan and Mehata (1988).
8.2.3 Characterizing a Stochastic Process In view of the fact that the probabilistic structure of a set of random variables is best described by their joint distribution, it is only natural to use the same device for specifying the probabilistic structure of a stochastic process. This, however, raises the question of specifying infinite-dimensional distributions because a stochastic process {Xt , t∈N} often has an infinite index set N:=(1, 2, . . . , n, . . .). Kolmogorov, in the same 1933 book that founded modern probability, provided the answer in the form of a theorem indicating that one need not worry about infinite dimensional distributions because a sequence of n-dimensional ones, for n ≥ 1, can completely describe (characterize) the probabilistic structure of a stochastic process. A stochastic process {Xt , t∈N} is said to be fully specified when its finite joint cumulative distribution function (cdf) is F(xt1 , xt2 , . . . , xtn ), for all subsets {t1 , t2 , . . . , tn }⊂N. This result is very useful because its converse is also true (see Kolmogorov, 1933a). Kolmogorov’s extension theorem For any integer n ≥ 1, let Fn (xt1 , xt2 , . . . xtn ), be the n-dimensional cdf that satisfies the following conditions: (i) Permutation invariance. F(xt1 , xt2 , . . . xtn ) = F(xπ(t1 ) , xπ(t2 ) , . . . , xπ(tn ) ) for all permutations π of {t1 , t2 , . . . , tn }. (ii) Consistency. For all (xt1 , xt2 , . . . , xtn )∈Rn : lim
(xtn+1 )→∞
Fn+1 (xt1 , xt2 , . . . , xtn , xtn+1 ) = Fn (xt1 , xt2 , . . . , xtn ), for (n+1) >1.
Then, there exists a probability space (S, , P(.)), and a stochastic process {Xt , t∈N} defined on it, such that Fn (xt1 , xt1 , . . . , xtn ) is the cdf of (Xt1 , Xt2 , . . . , Xtn ) for each n; see Billingsley (1995). The symmetry condition is trivially satisfied by any stochastic process, since reshuffling (permuting) the order of (Xt1 , Xt2 , . . . , Xtn ) should not change the joint cdf. Example 8.4
Consider the Gumbel bivariate cdf for (X1 , X2 ):
F(x1 , x2 ) = 1 − e−x1 − e−x2 − e−(x1 +x2 +θ x1 x2 ) = F(x2 , x1 ) = 1 − e−x2 − e−x2 − e−(x2 +x1 +θ x2 x1 ) . The consistency condition is more interesting since it enables the modeler to go from the
joint distribution F2 xt1 , xt2 to the marginal lim Fx2 xt1 , xt2 = Fx1 (xt1 ) and then to the xt2 →∞ x2 f (xt1 ,u) conditional Fx2 |x1 (xt2 |xt1 ) = −∞ f1 (xt ) du. 1
Kolmogorov’s extension theorem is a remarkable result since it shows that one can construct an infinite-dimensional stochastic process from a consistent family of finitedimensional probability spaces (S, , P(.))n :=(S, , P(.)) × · · · × (S, , P(.)), as well as the
8.2 The Concept of a Stochastic Process
323
reverse. This means that for any reasonable (i.e. consistent) family of n-dimensional distributions, their probabilistic structure is fully described by such distributions. Broadly speaking, this suggests that phenomena which exhibit chance regularity patterns can be modeled within the mathematical framework demarcated by the probability space (S, , P(.))n , endowed with the mathematical structure given in Chapters 2–4, unless they contain inconsistencies of the form stated in the theorem. Kolmogorov’s foundation became an instant success by discarding the straightjacket of the IID games of chance view of probability, and greatly expanding its scope and domain of applicability. H I S T O R I C A L N O T E: The mathematical concept of a stochastic process as given above was formulated in the early 1930s. Before that time the notion of a stochastic process existed only in the form of a model for specific stochastic phenomena. These models of stochastic phenomena were primarily in physics, with Maxwell and Einstein. The notable exception to this is the attempt by Bachelier (1900) to put forward a model for the behavior of prices in the Paris stock market. From the probabilistic viewpoint the concepts one needs to define a stochastic process were not developed until the 1920s. Indeed, from the time of Cardano (1501–1576), when the notion of independence between two events was first introduced and then formalized by de Moivre in the 1730s, until the late nineteenth century, dependence was viewed as a nuisance and interpreted negatively as lack of independence. In practice, the perspective of a stochastic process from the Fn (xt1 , xt2 , . . . , xtn ), referred to as the joint distribution perspective, is not as common as the functional perspective of a stochastic process {Yk , k∈N}, constructed as a function of a simpler stochastic process (often IID or independent increments, etc.) {Xt , t∈N}: Yk = g(Xt1 , Xt2 , . . . , Xtk ), k∈N.
(8.3)
This means that several stochastic processes are constructed using simpler building blocks. The probabilistic structure of the constructed process {Yk , k∈N} is determined from that of the simpler process {Xt , t∈N} via the mapping (8.3). Example 8.5
Consider the following function:
Yk = A sin(ωk + U), UU(0, 2π ), k∈N, where A denotes the amplitude, ω denotes the angular frequency (A and ω are constants), and U is a uniformly distribution random variable, giving rise to a stochastic process {Yk , k∈N} with a random phase. Example 8.6 The stochastic process {Yk , k∈N} can be generalized by assuming that the amplitude (A) is also random with a density function a2 f (a) = σa2 exp − 2σ 2 , a∈[0, ∞), known as the Rayleigh distribution, which is a special case of the chi-square with 2 d.f. (see Appendix 3.A). Assuming that A and U are independent, the stochastic process can be expressed in the form Yk = A sin(ωk + U) = X1 cos ωk + X2 sin ωk, k∈N,
324
Introduction to Stochastic Processes
where X1 = A sin 3 ω and X2 = A cos ω, representing the inverse of the transformations based on A = (X12 +X22 ), ω = tan (X1 /X2 ) , expressed in polar coordinates, yielding Xi NIID(0, σ 2 ), i = 1, 2. Example 8.7 Partial sums process A very important mapping g(.), which plays a crucial role in what follows, is the function defining the partial sums of an IID process {Zt , t∈N}: (8.4) St = ti=1 Zi , t∈N. Despite the obvious simplicity of constructing stochastic processes using functions of simple building block processes, it is important to emphasize that to understand the structure of the resulting stochastic process is to derive its own joint distribution. Example 8.7 (continued) To reveal this, let us simplify the problem by assuming that the IID process {Zt , t∈N} has bounded mean and variance: E(Zt ) = μ, Var(Zt ) = σ 2 , t = 1, 2, 3, . . . Using the linearity of the expectation (Chapter 3), we can deduce that (i) E(St ) = tμ, t = 1, 2, 3, . . . (ii) Var(St ) = tσ 2 , t = 1, 2, 3, . . . (iii) Cov(St , Sk ) = σ 2 min(t, k), t, k=1, 2, 3, . . .
(8.5)
The results (i) and (ii) are trivial to derive, but (iii) can be demonstrated as follows: ; ;m k Cov(Sk , Sm ) = E{(Sk − kμ)(Sm − mμ)} = E ( (Zi − μ))( (Zj − μ)) =
;k i=1
;m j=1
i=1
E[(Zi − μ)(Zj − μ)] =
;min(k,m) i=1
j=1
E(Zi − μ)2
= σ 2 min(k, m), since Cov(Zi , Zj ) = 0, i =j. The structure (i)–(iii) suggests that the partial sums process {St , t∈N} is neither independent nor identically distributed. C A U T I O N: The reader is cautioned that the above structure is only indicative of the more general dependence structure of partial sums because we concentrated exclusively on the first two moments, which in general might not even exist!
8.2.4 Partial Sums and Associated Stochastic Processes From the historical perspective, the first to venture into the uncharted territory of nonIID stochastic processes from the probabilistic viewpoint was Markov in 1908, who noticed that the derived partial sum process {St , t ∈ N} is no longer IID since it is both dependent and heterogeneous. A number of important stochastic processes, such as Markov, random walk, independent increments, martingales, and the Brownian motion process, as well as their associated dependence and heterogeneity restrictions, can be viewed in the context of the functional perspective as partial sums of independent random variables.
8.2 The Concept of a Stochastic Process
325
What is so special about a partial sum process? Looking at the functional form of a partial sums process in (8.4), it is clear that the cumulative sum can be defined in a more compact and transparent form by subtracting t−1 Zi from St = ti=1 Zi to yield St−1 = i=1 St − St−1 = Zt → St = St−1 + Zt , t = 1, 2, . . . , with S0 = 0.
(8.6)
Markov process. This suggests that the conditional distribution of St given its past (St−1 , St−2 , . . . , S1 ) depends only on the most recent past, i.e. ft (st |s(t−1) ; ψ t ) = ft (st |st−1 ; ϕ t ), ∀s(t−1) :=(st−1 , . . . , s1 )∈Rt−1 , t = 2, 3, . . .
(8.7)
That is, the dependence structure between St and its past (St−1 , . . . , S1 ) is fully captured by its conditional distribution given its most recent past St−1 ; we called it Markov dependence, and the stochastic processes that satisfy this are called Markov processes. Markov’s early result using discrete processes, was formalized in its full generality by Kolmogorov (1928b, 1931); see Section 8.6. Random walk process. Another important stochastic process that arises by partially summing independent random variables is the random walk process. The stochastic process {Sk , k∈N} is said to be a random walk if it can be specified as the partial sum of IID random variables {Ztk , k∈N}: k Zti , where Zti IID(.), i = 1, 2, . . . , k∈N. Sk = i=1
(8.8)
Note that this notation enables us to define the random walk process for both discrete and continuous-index processes. For a continuous partial sum raw process we need to replace the summation with an integral. In light of (8.6), {Sk , k∈N} can also be expressed in the equivalent form Sk = Sk + Ztk , where Ztk IID(.), k = 1, 2, . . . , k∈N.
(8.9)
This also stems from the fact that, as shown above, the process {Sk , k∈N} has increments {(Sk − Sk−1 ) , k∈N} which are IID random variables. The joint distribution takes the form I
n 1
ID
k=2 n 1
f (s1 , . . . , sn ; φ) = f1 (s1 ; θ 1 )
fk (sk − sk−1 ; θ k )
= f1 (s1 ; θ 1 )
f (sk − sk−1 ; θ), s∈Rn ,
(8.10)
k=2
where the first and second equalities follow from the IID of {(Sk − Sk−1 ) , k∈N}. In terms of our taxonomy of probabilistic assumptions, both a Markov and a random walk process are defined without any direct distribution assumption. The tendency to concentrate on the first two moments of the process can be very misleading because: (a) they might not exist (Zi Cauchy(0, 1), i = 1, 2, . . .); (b) they capture only limited forms of dependence/heterogeneity. In a certain sense, the notion of a random walk process is an empty box which can be filled in with numerous special cases by imposing some additional probabilistic structure.
326
Introduction to Stochastic Processes
By choosing the distribution to be discrete (e.g. Poisson) or continuous (e.g. Normal), we can define several different kinds of stochastic processes which, nevertheless, share a certain common structure. It is instructive to discuss briefly this common structure. The probabilistic structure imposed on the generic random walk by the functional form in (8.4) has been shown to give rise to a Markov and a random walk process, where the former is considerably more general than the latter. This is because Markov dependence does not depend on the partial sum transformation. Example 8.8 Let {Zt , t∈N} be an IID process with zero mean (E(Zt ) = 0, t = 1, 2, . . .) . Then the sequence defined by the recursion Yt = h(Yt−1 ) + Zt , t = 2, 3, . . . for any well-behaved (Borel) function h(.) is a Markov process. This suggests that the Markov dependence structure does not depend on the linearity of the transformation but on its recursiveness. Independent increments process. A natural extension of an IID process is to consider an independent, but not ID, process {Ztk , k∈N}: 1 f (zt1 , zt2 , . . . , ztn ; φ) = nk=1 fk (ztk ; θ k ), z∈Rn , and define the process {Ytk = (Ztk − Ztk−1 ), t∈N}, where Yt1 = Zt1 . Given that k Zti → (Ytk − Ytk−1 ) = Ztk , k∈N, Ytk = i=1
we can deduce that the partial sum induces a certain linearity into the original process: k (Zti − Zti−1 ), k∈N. Ztk = Zt1 + i=2
This partial sum-induced linearity restricts the joint distribution f (zt1 , zt2 , . . . , ztn ; φ) since the distribution of Zt3 −Zt1 must be the same as that of the sum (Zt3 −Zt2 )+(Zt2 −Zt1 ). Conversely, if {Ytk , k∈N} is an independent process, then for some arbitrary random variable Zt1 , the process {Ztk , k∈N} is defined by n Yti , n > 1. Ztn − Zt1 = i=1
The independent increments process {Ytk = (Ztk − Ztk−1 ), t∈N} can be written as Ytk = Ytk−1 + Ztk , k∈N, which suggests that it is Markov dependent. Given that Ytk , Ytk−1 is a linear (differentiable and strictly monotonic) function of Ztk , one can use the result that for Y = h(Z) + −1 + + + fY (y) = fZ (h−1 (y)) + dh dz (y) + , for all y∈RY to show that the conditional distribution of (ytk |ytk−1 ) has a special structure: fk (ytk |ytk−1 ; ϕ k ) = fZtk (ytk − ytk−1 ; θ k ), k = 2, 3, . . . ,
(8.11)
8.2 The Concept of a Stochastic Process
327
where fZtk (ytk − ytk−1 ; θ k ) denotes the density of Ztk evaluated at ytk − ytk−1 . Applying this result to the joint distribution yields 1 (8.12) f (yt1 , yt2 , . . . , ytn ; φ) = f1 (yt1 ; θ 1 ) nk=2 fk (ytk − ytk−1 ; θ k ), y∈RnY . Martingale process. Another important stochastic process motivated by the partial sums formulation is a martingale process. The importance of this process stems from the fact that it allows for sufficient dependence and heterogeneity for its partial sums process to behave asymptotically like a simple IID process. The notion of a martingale process was introduced in the late 1930s, but its importance was not fully appreciated until Doob (1953). The notion of a martingale process, in contrast to the Markov process, concentrates mostly on the first conditional moment instead of the distribution itself. t Consider the partial sums stochastic process {St = i=1 Zi , t∈N}, where the process {Zt , t∈N} is just independent (non-ID) with E(Zt ) = 0, t∈N. Notice, however, that the zero mean is an ID restriction. As shown above, the partial sums process can be written in the form St = St−1 +Zt , S0 = 0, t∈N. We can show that the conditional expectation of St given its past is E(St |St−1 , . . . , S1 ) = E ((St−1 +Zt )|St−1 , . . . , S1 ) = St−1 , t∈N.
(8.13)
This follows from the property CE4 “taking what is known out” (see Chapter 7). Collecting these results together, we say that the stochastic process {St , t∈N} is a martingale if: (i)
E(|St |) < ∞, t = 1, 2, . . . , t∈N;
(ii) E (St |σ (St−1 , St−2 , . . . , S1 )) = St−1 , t∈N.
(8.14)
Martingale difference process. Reversing the order, and viewing the original process {Zt , t∈N} through the prism of the martingale process {St , t∈N}, the fact that σ (St−1 , . . . , S1 ) = σ (Zt−1 , . . . , Z1 ) – the two event spaces coincide – implies that E (Zt |St−1 , , . . . , S1 ) = E (Zt |Zt−1 , . . . , Z1 ) = E(Zt ) = 0, t∈N. Hence, viewing the process {Zt , t∈N} through the prism of {St , t∈N}: (i)
E(Zt ) = 0, t∈N;
(ii) E (Zt |Zt−1 , Zt−2 , . . . , Z1 ) = 0, t∈N. The process {Zt , t∈N} satisfying (i) and (ii) is a martingale difference process. All the stochastic processes discussed above are interrelated, but the overlap between them is often unclear because some of these concepts are framed in terms of distributions but others are framed in terms of certain moments of these distributions. The random walk and the independent increments processes are subsets of the Markov process category. On the other hand, martingale processes are not a proper subset of the Markov process category because the former impose the additional restriction of a bounded first moment. To make matters more cloudy, the partial sum process has been illustrated by assuming the existence of its first two moments.
328
Introduction to Stochastic Processes
8.2.5 Gaussian (Normal) Process: A First View Let us return to the Normal (Gaussian) process {Xk , k∈N} without any dependence or heterogeneity restrictions (Chapter 6) as a prelude to the discussion of those two sets of probabilistic assumptions. Example 8.9 Consider the stochastic process {Xt , t∈N} whose n-dimensional joint distribution f (x1 , x2 , . . . , xn ; φ) is assumed to be Normal (Gaussian) (N(μ, )) with density function
−1 ([det ]) 2 1 −1 (x − μ) , x∈Rn , f (x1 , x2 , . . . , xn ; θ ) = (x − μ) exp − n/2 2 (2π) where φ:=(μ, ), μ = E(X), 1×n vector and = Cov(X), = [σ ij ]ni,j=1 , n×n variance– covariance matrix. This can be denoted using the matrix notation ⎛⎛ ⎞ ⎛ ⎞ ⎞ ⎞ ⎛ σ 11 σ 12 σ 13 · · · σ 1n μ1 X1 ⎜⎜ ⎜ ⎟ ⎟ ⎜X ⎟ ⎜⎜μ2 ⎟ ⎟ ⎜σ 21 σ 22 σ 23 · · · σ 2n ⎟ ⎟ ⎜ 2 ⎟ ⎟ ⎜⎜ ⎜ ⎟ ⎟ ⎜ ⎜⎜μ3 ⎟, ⎜σ 31 σ 32 σ 33 · · · σ 3n ⎟ ⎟ ⎟ ⎜X3 ⎟ ⎟ ⎜ ⎟ ⎟ ⎜ ⎜ . ⎟ N⎜ (8.15) .. .. ⎟ ⎟ . .. ⎜⎜ .. ⎟ ⎜ .. .. ⎜. ⎟ ⎟ ⎜⎝ . ⎠ ⎝ . . ⎠ ⎝. ⎠ . . . ⎟ ⎜ ⎟ ⎜ μ σ n1 σ n2 σ n3 · · · σ nn Xn ⎠ ⎝ n μ
X
As argued in Chapter 6, this process does not give rise to an operational model because it
involves N = n+ [n (n + 1)] /2 unknown parameters φ:= μi , σ ij , i, j = 1, 2, . . . , n , which are considerably greater than the number n of observations; a realization of a sample X:=(X1 , X2 , . . . , Xn ) based on such a process. As argued in Chapter 7, the only possible reduction of the joint distribution is the one based on sequential conditioning: f (x1 , x2 , . . . , xn ; φ n )
non-IID
=
f1 (x1 ; ψ 1 )
n 1
ft (xt | xt−1 , . . . , x1 ; ψ t ), ∀x ∈Rn ,
(8.16)
t=2
giving rise to the autoregressive and autoskedastic functions ⎫ ;t−1 E(Xt | σ (Xt−1 , . . . , X1 )) = β 0 (t) + β i (t)Xt−i ⎬ i=1
⎭
Var(Xt | σ (Xt−1 , . . . , X1 )) = σ 20 (t)
t = 2, 3, . . . , n
that do not yield an operational model because of the incidental parameter problem: ψ t :=(β 0 (t), β 1 (t), . . . , β t−1 (t), σ 20 (t)), t = 2, . . . , n. Example 8.9 (continued) Let us see what happens when one imposes independence σ ij = 0 for i=j, i, j = 1, 2, . . . , n: Distribution
Dependence
Heterogeneity
Normal
Independence
Unrestricted
No operational statistical model results, since the marginal distributions
Xt N μt , σ tt , t = 1, 2, . . . , n
(8.17)
8.3 Dependence Restrictions (Assumptions)
329
involve 2n unknown parameters θ t :=(μt , σ tt ), t = 1, 2, . . . , n, which are also increasing with the sample size; the incidental parameter problem did not go away! To be able to specify operational models, we need to impose some restrictions on the dependence and heterogeneity on the stochastic process {Xt , t ∈ N}. Table 8.2 lists some of these assumptions. Table 8.2 Probabilistic assumptions for statistical models D ISTRIBUTION
D EPENDENCE
H ETEROGENEITY
Normal
independence
identically distributed
Beta
correlation
strict Stationarity
Gamma
Markov dependence
weak (second-order) stationarity
Bernoulli
Markov of order p
Markov homogeneity
Student’s t
m-dependence
stationarity of order m
Pearson type II
martingale dependence
exchangeability
Exponential
martingale difference
separable heterogeneity
Poisson
α-mixing
asymptotic homogeneity
Binomial .. .
mixingale .. .
partial sums heterogeneity .. .
8.3
Dependence Restrictions (Assumptions)
Having introduced a number of important notions using particular examples of stochastic processes, we proceed to define a number of dependence and heterogeneity restrictions. We limit ourselves to very few examples because the rest of this chapter will be devoted to the usefulness of the notions introduced in this section in the context of different stochastic processes. For notational convenience we often avoid the discretized index notation 0 < t1 < t2 < t3 < · · · < tn < · · · < ∞, tk ∈N:=(1, 2, . . . , n, . . .) and use the discrete notation t∈N:=(1, 2, . . . , n, . . .).
8.3.1 Distribution-Based Concepts of Dependence The concepts of dependence in this subsection are framed in terms of distributions. Independence. The stochastic process {Yt , t∈N} is said to be independent if 1 f (y1 , y2 , . . . , yn ; φ) = nt=1 ft (yt ; ψ t ), ∀y∈RY , where y:=(y1 , . . . , yn ). This concept has been discussed extensively in previous chapters and constitutes the absence of dependence. Markov dependence. The stochastic process {Yt , t∈N} is said to be Markov if ft (yt |yt−1 , yt−2 , . . . , y1 ; ϕ t ) = ft (yt |yt−1 ; ψ t ), t = 2, 3, . . .
330
Introduction to Stochastic Processes
The intuition underlying this concept of dependence is that the conditional distribution of Yt given its “past” Yt−1 :=(Yt−1 , Yt−2 , . . . , Y1 ) depends only on the most recent past Yt−1 . It is important to emphasize that this does not mean that Yt is independent of (Yt−2 , . . . , Y1 ); it is not! Instead, Yt is conditionally independent of (Yt−2 , . . . , Y1 ) given yt−1 ; see Chapter 6. This notion of dependence can be trivially extended to higher orders as follows. Markov dependence of order p. The stochastic process {Yt , t∈N} is said to be Markov dependent of order p≥1 if ft (yt |yt−1 , yt−2 , . . . , y1 ; ϕ t ) = ft (yt |yt−1 , . . . , yt−p ; ψ t ), t = p + 1, p + 2, . . . That is, Yt is conditionally independent of (Yt−p+1 , . . . , Y1 ) given (Yt−1 , . . . , Yt−p ). Dependence of order m. The stochastic process {Yt , t∈N} is said to be m-dependent if, for p>m>0
f ((y1 , . . . , yn ) , yn+p , yn+p+1 , . . . , y2n+p ; φ n,p ) = f (y1 , . . . , yn ; ψ n )·f (yn+p , . . . , y2n+p ; ψ n,p ). The intuition behind this form of dependence is that when the elements of the stochastic process are more than m periods apart, they become independent. In contrast to Markov dependence of order m, m-dependence means that yn and yn+p are independent. Example 8.10 This form of dependence arises naturally when the modeler considers an IID, mean zero sequence {Xt , t∈N} and defines Yt = Xt ·Xt+m , t∈N. The stochastic process {Yt , t∈N} is an m-dependent process. This follows from the fact that ∞ for Y = X1 ·X2 : fX1 ·X2 (y) = −∞ |1/x1 | f (x1 , x2 /x1 )dx; see Chapter 4. Asymptotic independence. The stochastic process {Yt , t∈N} is said to be asymptotically independent if f (yt+p |yt , yt−1 , . . . , y1 ; φ t,p ) f (yt+p ; ψ t+p ) as p → ∞. The intuition behind this form of dependence is that the elements of the stochastic process become independent as the distance between them increases to infinity.
8.3.2 Moment-Based Concepts of Dependence Historically, the earliest dependence restriction based on the first two moments was the extreme case of non-correlation. + + Non-correlation. The stochastic process {Yt , t∈N}, with E(+Yt2 +)m>0, That is, when the elements of {Yt , t∈N} are more than m periods apart, the correlation is zero. Martingale dependence. The stochastic process {Yt , t∈N} is said to be martingale dependent if E(Yt )0. e −ε
ε
Thus, in view of the fact that the Lindeberg condition is necessary and sufficient for both (CLT) and (F) (see (9.36)) and the latter does not hold, it means that the Lindeberg condition cannot be satisfied (see Stoyanov, 1987).
9.5.4 Chebyshev’s CLT We are now in a position to return to view Chebyshev’s theorem in the light of the Lindeberg–Feller theorem. The CLT holds for a stochastic process {Xn , n∈N} that satisfies the assumptions in Table 9.21. Table 9.21 Chebyshev’s CLT (D)
Bounded second moment
E(Xk )2 < ∞, k∈N and (UB)
Independence
(H)
Heterogeneity
v2n = Var
n
n→∞
→ ∞ k=1 Xk n→∞ 1 f (x1 , x2 , . . . , xn ; ϕ) = nk=1 fk (xk ; θ k ), x∈RnX E(Xk ) = μk , Var(Xk ) = σ 2k , k∈N (M)
(M)
P (|Xn | < b) → 1, for b > 0
That is, imposing the uniform boundedness (UB) condition, and adding Markov’s condition, the CLT holds. The conditions (UB) and (M) imply the Lindeberg condition (L), since v2n → ∞ implies that εvn → ∞ for any ε>0, and nk=1 E (Xk − μk )2 ·I{|Xk −μk |>εvn } n→∞
n→∞
grows less fast than v2n .
9.5.5 Hajek–Sidak CLT Define a vector sequence of real numbers {cn }∞ n=1 , where cn := (cn1 , cn2 , . . . , cnn ) such that: / 0
n 2 [ max c2ni ]/ (9.38) → 0, j=1 cnj 1≤i≤n
n→∞
known as a triangular array, where (9.38) ensures that no single weight cn1 dominates the vector cn . Assume that {Xn , n∈N} satisfies the assumptions in Table 9.22.
Limit Theorems in Probability
402
Table 9.22 Hajek–Sidak CLT (D)
Bounded moments
(M)
Independence
E(Xk ) = μs and (ii) Var( 1n nk=1 Xk ) = n12 nk=1 E(Xk2 ) = n σ 2k 2 k=1 σ k , and thus the CLT follows from the fact that (L) ⇒ (F): lim [ max ( v2 )] = 0; see n→∞ 1≤k≤n
n
(9.35).
9.5.7 CLT for a Stationary Process Assume that {Xn , n∈N} satisfies the probabilistic assumptions in Table 9.24 or Table 9.25. Then the CLT holds for both stochastic processes:
lim P
n→∞
,
1 vn
k=1 (Xk − μ) ≤ z =
n
√1 2π
z
−∞
1 2
e− 2 u du, ∀z∈R.
(9.40)
9.5 The Central Limit Theorem
403
Table 9.24 CLT for a second-order stationary process (D)
Bounded (2+δ) moments
E(|Xk |2+δ ) < ∞, δ > 0, k∈N
(M)
α-Mixing
α(k) → 0, k∈N
(H)
Strict stationarity
f (x1 , x2 , . . . , xn ) = f (x1+τ , x2+τ , . . . , xn+τ ), for any τ
k→∞
Table 9.25 CLT for a second-order stationary process (D)
Bounded second moment
E(Xk )2 < ∞, k∈N
(M)
ρ-Mixing:
v2n → ∞ and lim
(H)
Secon-order stationarity:
E(Xk ) = μ, Var(Xk ) = σ 2 , k∈N
n→∞
n→∞
n
k k=1 ρ(2 ) 1. n Intuitively, this says that F(x) is stable if the distribution of Sn = nk=1 Xk is of the same type as an X +bn , where (X1 , X2 , . . . , Xn ) are IID random variables with the same distribution
9.5 The Central Limit Theorem
405
as X. It turns out that the members of this family have explicit formulae density functions only in special cases such as the Normal and the Cauchy distributions. For the other members of the stable family we usually use the characteristic function; see Galambos (1995). Levy’s theorem. Let {Xn , n∈N} be a sequence of IID random variables with Sn = n k=1 Xk . Assume that there exist constants an > 0 and bn such that n lim P Sna−b ≤ x = F(x), where F(x) is non-degenerate. n n→∞
Then F(x) is a stable distribution. 9.5.9.2 Max-stable Family of Distributions Using the equivalence relation “being of the same type” we can unify another category of limit distributions associated with the maximum (not the partial sum) of a stochastic process {Xn , n∈N}. Max-stable family. Let {Xn , n∈N} be a sequence of IID random variables with a nondegenerate cdf F(x). The distribution F(x) is said to be max-stable if the distribution of Xmax (n) = max (X1 , X2 , . . . , Xn ) is of the same type for every positive integer n. Gnedenko’s theorem. For the IID process {Xn , n∈N} with Xmax (n) = max (X1 , . . . , Xn ), assume an and bn >0 are constants such that n P Xmaxb(n)−a ≤x = (F(x))n and lim (F(x))n = G(x), n n→∞
where G(x) is a max-stable distribution. It turns out that the members of the G(x) family for Xmax (n) come in three forms, as shown in Table 9.26. Table 9.26 G(x) family for Xmax (n) 8 (a) Gumbel (b) Frechet (c) Weibull
−
x−μ σ
9
, ∀x∈R G1 (x) = exp −e & ])−α , α > 0, exp(−[1 + α1 x−μ σ G2 (x) = &0, ])α , α < 0, exp([1 − α1 x−μ σ G3 (x) = 1,
for x > 0 for x ≤ 0 for x < 0 for x ≥ 0
McFadden (1978) showed that all three distributions are special cases of the generalized extreme value (GEV) distribution: & & 1 1 1 − λ1 −1 exp(−[1 + λz]− λ ), for λ=0, exp(−[1+λz]− λ ) σ (1+λz) G(x) = , g(x) = 1 1 1 −λ exp(− exp (−z)])− λ , for λ = 0, σ exp(−z) exp(− exp (−z)]) where z = ((x − μ)/σ ) , μ and σ are location and scale parameters, and λ a shape parameter; for λ > 0, z > −1/λ and for λ < 0, z < −1/λ.
406
Limit Theorems in Probability
9.5.9.3 Infinitely Divisible Family of Distributions The infinitely divisible family of distributions constitutes a natural extension of the stable family in the case where the random variables in the sequence {Xn , n∈N} are independent but non-ID. The basic result in relation to this family is that the limit distribution (assumed to be non-degenerate) of the partial sums of independent but not necessarily ID random variables is infinitely divisible; for further details see Moran (1968) and Galambos (1995).
9.6
Extending the Limit Theorems*
When limit theorems are extended from the behavior of 1n nk=1 Xk to that of well-behaved functions Yn = h(X), X:=(X1 , X2 , . . . , Xn ) play an important role in statistical inference in two respects. First, they provide approximate sampling distributions for estimators and test statistics when the finite sample one is unobtainable. Second, they provide minimal (necessary but not sufficient) properties for estimators, tests, and other statistics. For θ ∗ denoting the “true” value of θ (Chapter 13), an estimator θ n (X) = h(X) of θ is said to satisfy the asymptotic properties in Table 9.27. Table 9.27 Asymptotic properties for an estimator θ n (X) Consistent (weakly) Consistent (strongly) Asymptotically Normal
lim P | θ n (X) − θ ∗ | 0 θ n (X) = θ ∗ = 1 P lim n→∞ θ n (X)−θ ∗ √ lim P ≤ z = (z), ∀z ∈ R n→∞
n→∞
Var( θ n (X))
The first obvious extension of the above limit theorems is to the case of a random vector X∈Rm X . In the case of the LLN this is trivially true because when the law holds for every element it holds for the random vector. For the CLT, however, it is different because the asymptotic distribution is defined in terms of the first two moments which involve the covariances among the elements of the random vector. Multivariate CLT. Let {Xk , n∈N} be a vector stochastic process with E(Xk ) = μ (an m × 1 vector) and Cov(Xk ) = (an m × m matrix), ∀k = 1, 2, . . . , n, . . . , then under certain restrictions which ensure that no random vector dominates the summation: √ n Xn − μ N(0, ), where Xn = 1n nk=1 Xk . n→∞
The question which naturally arises at this point is to what extent the above limit theorems can help our quest for approximate distribution results for arbitrary functions g(X1 , X2 , . . . , Xn ). After all, the above theorems are related to a very specific function:
n −1 (9.42) c−1 n Sn :=cn k=1 [Xk − E(Xk )] , as n → ∞. The gap between asymptotic results of general functions and the results of the above limit theorems is bridged in several ways. The first is trivial in the sense that the scaled sum in
9.6 Extending the Limit Theorems*
407
(9.42) includes cases such as nk=1 Xkr , for r = 1, 2, 3, . . . , when the modeler can ensure that the new random variables Zk = Xkr , for k = 1, 2, . . . , satisfy the conditions of the above limit theorems. As shown in Chapters 11–15, numerous estimators and test statistics (the stuff that statistical inference is built upon) fall into this category of functions. Another way the gap between scaled sums as in (9.42) and arbitrary functions can be bridged is the following theorem which ensures that any continuous functions of X can be accommodated within a general limit theorem framework. Mann and Wald theorem. Assuming that {Xn , n∈N} is a stochastic process, X is another random variable on the same probability space (S, , P(.)), and g(.): R → R is a continuous function, then a.s.
a.s.
P
P
(a) (Xn → X)⇒g(Xn ) → g(X), (b) (X → X)⇒g(Xn ) → g(X), D
D
(c) (Xn → X)⇒g(Xn ) → g(X). Example 9.6
An interesting example of this theorem is the following: D
D
if Xn → X N(0, 1), then Xn2 → X 2 χ 2 (1). D
Example 9.7 Let Xn → X N(0, 1) and consider the function Yn =
D Yn → Z, where f (z) = 2 √1 exp − 2z12 , z=0, z
despite the fact that g(x) = Gaussian distribution.
1 x
1 Xn .
It turns out that
2π
is not continuous at x = 0. f (z) is known as the inverse
There are two things worth noting about the Mann–Wald theorem. (i) (ii)
Mann and Wald (1943) proved a more general result where g(.) is a Borel function with discontinuities on a set of probability zero. This theorem is more general than we need in the sense that the SLLN and the WLLN refer to convergence almost surely and in probability to a constant not a random variable. Moreover, the CLT entails convergence to the Normal distribution. The above theorem refers to any limit distribution.
Another useful result for asymptotic theory is the following theorem. Cramer’s theorem. Let g(.): R → R such that dg(θ)dθ /=0 is continuous at the neighborhood of θ ∈R: , -2 √ √ n (Xn − θ) N(0, σ 2 ) → n (g(Xn ) − g(θ)) N 0, dg(θ) σ2 . dθ n→∞
n→∞
The vector form of this theorem is √ n (Xn − θ) N(0, ), g(.): Rm →Rk , with rank[ ∂g(θ) ∂θ ] = k, n→∞
√
, - , - ) ∂g(θ ) n (g(Xn ) − g(θ )) N 0, ∂g(θ . ∂θ ∂θ n→∞
Limit Theorems in Probability
408
√ D Example 9.8 Consider the case where n(X n − θ ) → ZN(0, σ 2 ), and let g(x) = x2 . We know that dg(θ)/dθ = 2θ and thus √ 2 D n(X n − θ 2 ) → Y N(0, 4θ 2 σ 2 ). It is clear from the above approximate result, often known as the delta method approximation, that this is a first-order Taylor’s approximation which, for linear functions g(.), provides an exact result. For non-linear functions, however, it can provide a poor approximation. In such cases we can proceed to derive a second-order approximation. The second-order Taylor’s approximation of g(.): R → R at x = θ such that d2 g(θ)/ dθ 2 =0 takes the form ⎛< =2 2 −1 d g(θ) 1 d2 g(θ) ⎝ dg(θ) dg(θ) 1 d2 g(θ) 2 (x–θ) = g(x)–g(θ) (x–θ)+ (x–θ ) + dθ 2 dθ 2 2 dθ 2 dθ dθ 2 9 2 −2 d g(θ) dg(θ) 2 , − dθ dθ 2 where the second equality follows from completing the square. Hence, for √ n (Xn − θ) N(0, σ 2 ), n→∞
the second-order approximation takes the8form 9 / 0 √ 2 σ 2 d2 g(θ) n(Xn − θ ) 2 n (g(Xn ) − g(θ)) +δ n − δ n , 2 σ dθ 2 √ / 2 0 0−1 / n d g(θ) dg(θ) δn = . σ dθ dθ 2 Since the square of a N(0, 1) random variable is chi-square distributed: (
√ n(Xn −θ ) +δ n )2 σ n→∞
χ 2 (1; δ 2n ),
where χ 2 (δ n ) denotes a chi-square with one degree of freedom and non-centrality parameter δ 2n . Hence, the second-order approximation takes the form n (g(Xn ) − g(θ))
σ2 2
,
d2 g(θ ) dθ 2
-
χ 2 (1; δ 2n ) − δ 2n , δ n :=
√ , dg(θ ) n dθ , 2 -. ) σ d g(θ 2 dθ
Example 9.9 Let us reconsider the above example, where √ D n(X n − θ) → Z N(0, σ 2 ), and take g(x) = x2 , result:
dg(θ) dθ
= 2θ,
d2 g(θ ) dθ 2
= 2. The second-order approximation is an exact
D 2 n(X n − θ 2 ) → Y σ 2 χ 2 (1; δ 2n ) − δ 2n , δ 2n =
nθ 2 . 4σ 2
We conclude this section on the positive note that the above results go a long way to help the modeler extend the CLT and derive asymptotic distribution results for arbitrary functions Yn = g(X1 , X2 , . . . , Xn ); see Chapter 11.
9.7 Summary and Conclusions
409
9.6.1 A Uniform SLLN* In statistical inference the modeler often finds himself dealing with a function of the random variables (X1 , X2 , . . . , Xn ) which includes some unknown parameter(s) θ , say 1 n k=1 h(Xk , θ ), where E (h(Xk , θ )) = τ (θ ). n The modeler often assumes that if one replaces the unknown parameter with a good estimator θ n (Chapter 12), then, using the SLLN, one can deduce that a.s. 1 n (9.43) k=1 h(Xk , θ ) → τ (θ). n This, however, will be an erroneous conclusion because the SLLN by itself is not strong enough to yield (9.43). It turns out that for (9.43) to hold we need (a) to restrict the parameter space to be closed and bounded, (b) to ensure that h(.) is well behaved with respect to both arguments (such as being continuous), and (c) to strengthen + the almost sure+ convergence to uniform + a.s. +1 n convergence in θ (Billingsley, 1995): sup + n k=1 h(Xk , θ ) − τ (θ )+ → 0. θ∈
9.7
Summary and Conclusions
The limit theorems, WLLN, SLLN, and CLT, tell us what happens in the limit (when n = ∞), by providing approximations to the sampling distribution or the moments of X n = 1n nk=1 Xk , but provide no quantitative information pertaining to the accuracy of , the difference X n − 1n nk=1 E(Xk ) for a given n. Like most results in probability theory, they have a mathematical and a modeling/inference dimension which are not always in synchrony. A mathematically general and elegant result often stems from premises that are weak and non-testable, and thus are of little value for modeling/inference purposes. Relying on weak but untestable probabilistic assumptions will at best give rise to statistical inferences with unknown reliability. The non-testability of the weaker assumptions often leads to imprecise inferences, and renders the reliability of such results questionable at best, undermining any “learning from data.” That is, one should treat limit theorems as promissory notes that are really relevant for the “statistical afterlife” (as n→∞), because, as argued by Le Cam (1986a), “limit theorems ‘as n tends to infinity’ are logically devoid of content about what happens at any particular n.” Limit theorems demarcate the intended scope of empirical modeling by specifying minimal conditions (D, M, and H probabilistic assumptions) pertaining to {Xt , t ∈ N}] under which “learning from data” is possible. That, however, provides no assurance that such learning will take place for particular data x0 :=(x1 , x2 , . . . , xn ). To ensure the latter we need to validate the statistical model (Mθ (x)) assumptions vis-à-vis x0 . Hence, one has to go the extra mile to establish the validity of the statistical premises to secure the precision and reliability of inference stemming from the particular x0 . Additional reference: Barndorff-Nielsen and Cox (1994).
410
Limit Theorems in Probability
Important Concepts Limit theorems in probability, weak law of large numbers (WLLN), strong law of large numbers (SLLN), distribution of a (Borel) function of the sample, convergence in probability, almost sure convergence, convergence in distribution, Poisson’s law of small numbers, Bernoulli’s LLN, Chebyshev’s LLN, Borel’s SLLN, Kolmogorov’s SLLN, law of iterated logarithm (LIL), central limit theorem (CLT), Lindeberg–Feller CLT, family of stable distributions, Mann and Wald theorem, Cramer’s theorem, delta method, uniform convergence, limit theorems and asymptotic bounds, probabilistic inequalities, Donsker’s functional CLT. Crucial Distinctions Bernoulli’s WLLN vs. Borel’s SLLN, convergence in probability vs. almost sure convergence, convergence in probability vs. convergence in distribution, CLT vs. functional CLT. Essential Ideas ●
The main limit theorems, known as the WLLN, the SLLN, and the CLT, revolve around a particular function of the sample X, the difference between the sample mean X n = 1 n 1 n k=1 Xk and its expected value n k=1 E(Xk ). n The results of the WLLN, the SLLN, and the CLT concern the behavior of X n at the limit n = ∞; they convey nothing about what happens for any n 0.
Markov inequality. Let X be a random variable such that E(|X|p )0: P (|X| ≥ ε) ≤
E(|X|p ) εp .
The well-known saying that “there is no free lunch” can be illustrated by using this inequality to show that by postulating the existence of higher moments we can improve the upper bound.
Appendix 9.A: Probabilistic Inequalities
413
Example 9.A.1 Let {Xt , t∈N:=(1, 2, . . . , n, . . .)} be a sequence of IID Bernoullidistributed random variables. It can be shown that Sn := nk=1 Xk Bin (nθ, nθ(1 − θ )) .
Using Chebyshev’s inequality yields P |n−1 Sn − θ| > ε ≤ θ(1−θ) . On the other hand, nε 2 using Markov’s inequality for the fourth moment: E |Y−E(Y)|4
, P |Y − E(Y)|4 >ε ≤ ε4
−1 noting that E |n Sn − θ|4 = nθ [1+3θ (1 − θ )(n − 2)], yields
P |n−1 Sn − θ| > ε ≤ (16)n3 2 ε4 . As can be seen, the estimate of the upper bound given by Markov’s inequality is less crude because it utilizes more information in relation to the existence of moments. Bernstein inequality. Let X(.): S → RX :=(0, ∞) be a positive random variable such that
E etX 0:
E etX P (X ≥ ε) ≤ (etX ) ≤ inf e−tX E etX . 0≤t≤c
Hoeffding inequality. Let {Xt , t∈N} be an independent process such that E(Xk ) = 0 and ak ≤ Xk ≤ bk for ak < bk , t2 (b −a )2
then for t > 0: P n Xk ≥ ε ≤ eεt 1n e k 8 k , for any ε > 0. k=1 k=1
As one can see, the sharper bound comes at the expense of bounding the support of ∞ {Xn }∞ n=1 between two known sequences of bounds {an , bn }n=1 . It is interesting to note that ∞ when {Xn }n=1 is a Bernoulli process, this inequality becomes
2 P | X n − θ |> ε ≤ 2e−2nε . Mill inequality. If Z N(0, 1), with φ(z) denoting the density function: ⎫ ⎪ (i) P (|X| >t) ≤ 2φ(t) ⎪ t ⎬ 1 1 1 < P > t) < (ii) − φ(t) (X for any t > 0. 3 t t t ⎪ ⎪ ⎭ 1 t2 (iii) P (X > t) < exp(− ) 2
2
9.A.2 Expectation Cauchy–Schwarz inequality. Assume that the random variables X and Y have bounded second moments, i.e. E(X 2 ) .1}. How does this probability change if one knows that X U(−1, 1)? 13. For the random variable X, where E(X) = 0 and Var(X) = 1, derive an upper bound on the probability of the event {|X − .6| > .1}. How does this probability change if one knows that X N(0, 1)? How accurate is the following inequality: ∞ 1 x − x2 1 − ε2 e 2 dx = √ e 2 , for x > ε. P (|X| ≥ ε) ≤ √1 2π ε ε 2π ε (i) In Example 10.25 with ε = .1, evaluate the required sample size n to ensure
that P | X n − θ | >ε ≤.020. (ii) Calculate the increase in n needed for the same upper bound (.02) with ε = .05. (iii) Repeat (i) and (ii) using the Normal approximation to the finite sample distribution and compare the results. (b) Discuss briefly the lessons to be learned from the Bahadur and Savage (1956) Example 10.21 pertaining to the use of limit theorems for inference purposes.
14. (a)
15. (a) “For modeling purposes specific distribution assumptions are indispensable if we need precise and sharp results. Results based on bounded moment conditions are invariably imprecise and blunt.” Discuss. (b) “For a large enough sample size n, one does not need to worry about distributional assumptions, because one can use the bounds offered by limit theorems to ‘calibrate’ the reliability of any inference.” Discuss.
11 Estimation I: Properties of Estimators
11.1 Introduction Chapter 10 introduced the basic ideas about estimation, focusing on the primary objective of an estimator θ n (X), which is to pinpoint θ ∗ , the “true” value of θ in . How best (optimally) to achieve that objective is the subject matter of this chapter. Before we begin the discussion of what properties determine an optimal estimator, it is important to emphasize that estimation presupposes a statistical model Mθ (x) = {f (x; θ ), θ∈⊂Rm }, x∈RnX , m 0, x > 0.
Table 11.3
The simple exponential model
Statistical GM Xt = [1] Exponential [2] Constant mean [3] Constant variance [4] Independence
1 θ + ut , t∈N:=(1, 2, . . . , n, . . .) Xt Exp(., .), xt ∈R+ E(Xt ) = θ1 , θ∈R, ∀t∈N Var(Xt ) = (1/θ )2 , ∀t∈N {Xt , t∈N}, independent process
It can be shown that no unbiased estimator of θ exists; see Schervish (1995, p. 297). This example is directly related to the above comment by Fisher (1956). 2. Unbiasedness is not invariant to transformations of θ. Assuming that there exists an θ n (X)) = θ ∗ , for ϕ = g(θ), where g(.): → and unbiased estimator θ n (X) of θ, i.e. E( θ n (X)), then in general E( ϕ n ) = ϕ. ϕ n = g( Example 11.9 For the simple exponential model (Table 11.3), we have seen in Example 11.8 that no unbiased estimator of θ exists, but we can show that when θ is reparameter ized into φ = (1/θ ) , one can show that φ = 1n ni=1 Xi is an unbiased estimator of φ since [2] assumption [2] (Table 11.3) becomes E(Xt ) = φ and thus E( φ ) = 1 ni=1 E(Xi ) = φ. n
This is the reason why the exponential density is often parameterized in terms of φ: f (x; φ) = (1/φ) exp{−(x/φ)}, φ>0, x>0.
11.4.2 Efficiency: Relative vs. Full Efficiency Given that a frequentist (point) estimator θ n (X) aims to pinpoint θ ∗ , and its optimality is evaluated by how effectively it achieves that, it is natural to extend the search for optimal properties to the second moment of the sampling distribution of θ n (X). The second moment
476
Estimation I: Properties of Estimators
provides information about the dispersion of the sampling distribution around θ ∗ , defining the property referred to as efficiency. In practice, one needs to distinguish clearly between two different forms of efficiency, relative and full efficiency, which are sometimes confused – with serious consequences. This is because by itself relative efficiency is of limited value, but full efficiency is priceless. 11.4.2.1 Relative Efficiency θ 2 (X) of θ , θ 1 (X) is said to be relatively more For any two unbiased estimators θ 1 (X) and efficient than θ 2 (X) if Var(θ 1 (X)) ≤ Var(θ 2 (X)). Example 11.10 For the simple Bernoulli model (Table 11.1), the sampling distributions in θ 2 , and θ 3 are unbiased estimators (11.7) of the estimators (i)–(vi) in (11.2) indicate that θ 1, of θ, but in terms of relative efficiency we have a clear ordering based on their variances: θ 3) = Var(
θ (1−θ) 3
θ 2) = < Var(
θ (1−θ) 2
θ 1) = < Var(
θ (1−θ ) 1 ,
for n ≥ 3.
Example 11.11 For the simple Normal model (Table 11.2), the sampling distributions in μ2 , and μn are unbiased esti(11.8) of the estimators (i)–(vi) in (11.3) indicate that μ1 , mators of θ, but in terms of relative efficiency we have a clear ordering based on their variances: Var( μn ) =
σ2 n
< Var( μ2 ) =
σ2 2
< Var( μ1 ) =
σ2 1 ,
for n ≥ 3.
The problem with using relative efficiency to compare different estimators is that the comparison is local in the sense that it all depends on the pool of estimators in hand. This is equivalent to my assertion that “I am the best econometrician in my family.” That is true, but it does not make me a good econometrician because the pool of comparison is much too narrow. In the case of Example 11.10, θ 3 is the best on relative efficiency grounds, but it is a terrible estimator because it is inconsistent and the comparison is narrowed to a group of inconsistent estimators. Hence, the question that immediately comes to mind is:
Is there a global notion of efficiency?
The surprising answer is yes! 11.4.2.2 Full Efficiency: The Cramer–Rao Lower Bound The challenge of devising an absolute lower bound was met successfully in the 1940s by two pioneers of modern frequentist statistics, Harald Cramér (1893–1985) and C. R. Rao (1920), in Cramer (1946b) and Rao (1945). Using different approaches they both reached the conclusion that the global lower bound for unbiased estimators is related to the Fisher information; a concept introduced by Fisher (1922a). Fisher’s information of the sample. Like all results with any generality, the Fisher information of the sample and the associated Cramer–Rao lower bound applies to “regular” statistical models, where regularity is defined in terms of certain restrictions pertaining to the distribution of the sample f (x; θ ), x∈RnX .
11.4 Finite Sample Properties of Estimators
477
A statistical model Mθ (x) is said to be regular if its f (x; θ ), x∈RnX satisfies the regularity conditions in Table 11.4. Table 11.4
Regularity for Mθ (x)={f (x; θ ), θ∈}, x∈RnX
(R1)
The parameter space is an open subset of Rm , m < n
(R2)
The support of X, RnX := {x: f (x; θ )>0} , is the same ∀θ∈ ∂ ln f (x;θ) < ∞ exists and is finite ∀θ ∈, ∀x∈RnX ∂θ
(R3) (R4)
For any Borel function h(X), differentiation and, integration - can be interchanged: d · · · h(x) · f (x; θ )dx = · · · h(x) ∂ f (x; θ) dx < ∞ dθ ∂θ
R1 excludes boundary points to ensure that derivatives (from both sides of a point) exist, and R1 ensures that the support of f (x; θ ) does not depend on θ. For such regular probability models we can proceed to define the Fisher information of the sample, which is designed to provide a measure of the information rendered by the sample for a parameter θ∈. Focusing on the case of a single parameter θ, the Fisher information of the sample X:=(X1 , X2 , . . . , Xn ) is defined by 2 d ln f (x;θ) In (θ ):=E . (11.9) dθ There are several things to note about this concept. (a)
(b)
Under the regularity conditions (R1)–(R4), it can be shown that 2 2 d ln f (x;θ) ) . In (θ ):=E = E − d lndθf (x;θ 2 dθ
(11.10)
This often provides a more convenient way to derive Fisher’s information and thus the Cramer–Rao lower bound. The form of In (θ) depends crucially on the sampling model. For example, in the case of an independent sample: f (x;θ) f (xi ;θ ) = ni=1 E d ln dθ E d lndθ and in the IID sample case Fisher’s information takes the even simpler form 2 d ln f (x;θ ) In (θ ) = nI (θ):=nE , dθ in an obvious notation, where f (x; θ ) denotes the marginal density function and I (θ ) represents the Fisher information of a single observation.
Example 11.12 For the simple Normal model (Table 11.2): f (x; θ ) =
√1 2π
2
) exp{− (x−θ }, 2σ 2
d dθ
ln f (x; θ ) =
(x−θ ) σ2
=⇒ I (θ) = 1 and In (θ ) = n.
This example suggests that as the information increases, the Fisher information of the sample In (θ ) increases and thus more information about θ is gained.
478
Estimation I: Properties of Estimators
Cramer–Rao lower bound. Assuming that the Fisher information of the sample exists and In (θ)>0, ∀θ∈, the variance of any unbiased estimator of a parameter θ, say θ n (X), cannot be smaller than the inverse of In (θ): Var( θ) ≥ C-R(θ ) = In−1 (θ).
(11.11)
In the case where one is interested in some differentiable function of θ, say q(θ ), and q(θ ) is an estimator of q(θ), the Cramer–Rao lower bound takes the form Var( q(θ)) ≥ C-R(q(θ )) =
d dθ E(q(θ ))
2
In−1 (θ ).
(11.12)
General C-R lower bound. Using (11.12) we can extend the Cramer–Rao lower bound to the case of any estimator θ n (X) of θ, including biased ones: Var( θ n (X)) ≥ GC-R(θ) =
2 dE( θ n (X)) dθ
2 −1 d ln f (x;θ ) . E dθ
(11.13)
The following example illustrates the role of condition R2 in deriving the C-R lower bound. Example 11.13 Consider the simple uniform model Mθ (x): Xt UIID(0, θ), 0 < xk < θ , θ > 0, k∈N,
(11.14)
whose density function takes the form f (x; θ ) =
1 θ
, θ >0, 0 θ) = E E(> θ |h(X)) = E( θ ) = θ ∗ , and property > > > (iii) from Var(θ) = Var[E(θ|h(X))] + E[Var(θ |h(X))] = Var(θ ) + E[Var(> θ |h(X))] ≥ Var( θ ).
11.4 Finite Sample Properties of Estimators
483
The discerning reader might be wondering if sufficiency is required for the above results to hold, since no use of sufficiency is made in deriving (ii) and (iii). Sufficiency is needed for (i)! θ is an estimator (it does not depend on unknown parameters) exactly because h(X) is a sufficient statistic. The Rao–Blackwell theorem provides a way to improve upon an unbiased estimator but it does not tell us anything about the full efficiency of the improved estimator θ. 11.4.3.2 Where do Sufficient Statistics Come From? Both the definition of a sufficient statistic and the factorization theorem do not provide easy ways to construct sufficient statistics. The following result, however, provides a relatively easy way to derive minimal sufficient statistics. The idea is due to Lehmann and Scheffé (1950). Lehmann–Scheffé theorem 1 Suppose that there exists a statistic h(X) such that for two
different sample realizations x1 and x2 x1 ∈RnX , x2 ∈RnX , the ratio f (x1 ; θ ) is free of θ if f h(X1 ) = h(X1 ), f (x2 ; θ )
(11.20)
then h(X) is a minimal sufficient statistic for θ. Example 11.19 For the simple Bernoulli model (Table 11.1), the ratio in (11.20): n n nk=1 (x1k −x2k ) θ f (x1 ; θ ) θ k=1 x1k (1 − θ ) k=1 (1−x1k ) n = , = n f (x2 ; θ ) 1−θ θ k=1 x2k (1 − θ ) k=1 (1−x2k ) n n is free of θ if f nk=1 X1k = k=1 X2k . Hence, the statistic S(X) = k=1 Xk is not just sufficient but minimal sufficient.
Example 11.20 For the simple Normal model (Table 11.2), the ratio in (11.20): ⎡ ⎤ ;n − n − 1 (x1k − μ)2 2 2 1 2σ k=1 ⎦ ⎣ e 2π σ 2 f (x1 ; μ, σ 2 ) = ⎡ ⎤ ;n 2 f (x2 ; μ, σ 2 ) − n − 1 − μ) (x 2k 2 2 k=1 ⎣ 12 ⎦ e 2σ 2π σ - ;n ;n n ,;n 1 ,;n 2 2 + 2 = exp − 2 x − x x1k − x2k , k=1 1k k=1 2k k=1 k=1 2σ σ
n n n n 2 2 is free of μ, σ 2 if f k=1 X1k = k=1 X2k , k=1 X1k = k=1 X2k . Hence, S(X) =
n n 2 k=1 Xk , k=1 Xk is not just sufficient but minimal sufficient.
The results in (11.7) and (11.8) suggest that, when S is minimal sufficient for θ, any inference concerning θ could be based exclusively on f (s; θ ) because it’s the only deductively derived component of f (x; θ ) that involves θ. Sufficiency principle (SP). When the statistical model Mθ (x) has a minimal sufficient statistic S(X), any inference concerning θ could be based exclusively on f (s; θ ) – this is
484
Estimation I: Properties of Estimators
a variant of what is known as the sufficiency principle: see Severini (2000, p. 18). Does the adoption of the SP forfeit any relevant information in f (x; θ)? The answer is yes unless S(X) is also complete.
11.4.3.3 Minimal Sufficiency and Completeness* Intuitively, a minimal sufficient statistic S(X) should be the last word in finding a good estimator for θ, because S(X) is both sufficient as well as minimal; it cannot be reduced any further! Intuition can sometimes lead one astray when there are subtle mathematical issues involved. It turns out that S(X) cannot guarantee the uniqueness of an estimator θ(X) = θ (S(X)) because any one-to-one function g(.) of a minimal sufficient statistic S(X) is also minimal sufficient. To ensure the uniqueness of optimal estimators, we need another property relating to sufficient statistics called completeness. This additional property is needed to ensure that, using S(X) and not X for all inferences pertaining to θ , no loss of sample information will take place. Completeness is a property of Mθ (x) = {f (x; θ ), θ∈⊂Rp }, x∈RnX . The statistical model Mθ (x) is said to be complete if, for every (Borel) function h(X), the following relationship holds: E (h(X)) = 0 =⇒ h(X) = 0 (a.s.) ∀x∈RnX = {x: f (x; θ ) > 0}, where RnX = {x: f (x; θ ) > 0} denotes the support of f (x; θ). Intuitively, this result means that the only unbiased estimator of zero is zero itself; it cannot be a non-zero function of X. Completeness. A sufficient statistic S(X) is said to be complete if Mθ (x) is complete. The usefulness of the property of completeness stems from the fact that if S(X) is a complete sufficient statistic and θ = g(S(X)) is an unbiased estimator of θ , i.e. E (g(S(X))) = θ ∗ , then this estimator is unique. The relationship between a complete sufficient statistic and a minimal sufficient statistic is that a complete sufficient statistic is minimal sufficient (see Lehmann and Scheffé, 1950). This brings us to the end of our search for best unbiased estimators by utilizing sufficient statistics. The main result is given by the following theorem (see Lehmann and Scheffé, 1955). Lehmann–Scheffé theorem 2. Let S(X) be a complete sufficient statistic for θ (or θ of θ, better for a statistical model Mθ (x)). If there exists an unbiased estimator which is a function of S(X) (i.e. θ = g(S(X))), then this estimator is both best and unique. Combining this theorem with that of Rao–Blackwell, the modeler can form the following strategy: in the case where a complete sufficient statistic exists, one should begin with an arbitrary unbiased estimator and then proceed to derive the conditional expectation given the sufficient statistic; see Casella and Berger (2002) for further discussion. Sufficiency plays a very important role in both modeling and inference.
11.4 Finite Sample Properties of Estimators
485
11.4.4 Minimum MSE Estimators and Admissibility The above measures of efficiency enable us to choose between unbiased estimators on the one hand and biased estimators on the other, using (11.11) and (11.13), respectively. These lower bounds, however, offer no guidance on the question of choosing between a biased and an unbiased estimator. Such a comparison is of interest in practice because fully efficient and unbiased estimators do not always exist and unbiased estimators are not always good estimators. There are circumstances where a biased estimator is preferable to an unbiased one. The question then is: How do we compare biased and unbiased estimators? It makes sense to penalize a biased estimator in order to derive a dispersion measure that makes a biased and an unbiased estimator comparable. The most widely used such measure is the mean square error: θ n (X) − θ ∗ )2 }. MSE( θ n (X); θ ∗ ) = E{(
(11.21)
It is very important to emphasize that the MSE for frequentist estimators, like the bias, is defined at the point θ = θ ∗ . It can be shown to represent a penalized variance: MSE( θ n (X); θ ∗ ) = Var( θ n (X))+[B( θ n (X); θ ∗ )]2 ,
(11.22)
with the penalty equal to the square of B( θ n ; θ ∗ ) = E( θ n ) − θ ∗ . This can be derived by adding and subtracting E( θ n ) = θ m in the definition of the MSE as follows: MSE( θ n ; θ ∗ ) = E{[( θ n − θ m ) + (θ m − θ ∗ )]2 } = E[ θ n − θ m ]2 +2[θ m − θ ∗ ]E[ θ n − θ m ]+[θ m − θ ∗ ]2
(11.23)
= E[ θ n − θ m ]2 + [θ m − θ ∗ ]2 = Var( θ n ) + [B( θ n ; θ ∗ )]2 . Minimum MSE estimator. An estimator θ n (X) is said to be a minimum MSE estimator of θ if θ n (X); θ ∗ ) MSE( θ n (X); θ ∗ ) ≤ MSE(> for any other estimator > θ n (X). Example 11.21 For the simple Bernoulli model (Table 11.1), the estimators ( θ n+1 , θ n+2 ) θ 2 , and θ 3 , since for n > 3 are preferable in terms of their MSE to θ 1, 2 n −θ θ(1 − θ)+ MSE(θ n+1 ) = (n + 1) (n + 1)2 nθ(1 − θ) + θ 2 θ(1 − θ) ≤ MSE( θ k) = , k = 1, 2, 3; 2 k (n + 1) 2 n −θ θ(1 − θ)+ MSE( θ n+2 ) = (n + 2) (n + 2)2 =
=
nθ(1 − θ) + θ 2 θ(1 − θ) ≤ MSE( θ k) = , k = 1, 2, 3. k (n + 2)2
(11.24)
486
Estimation I: Properties of Estimators
C A U T I O N A R Y N O T E: The frequentist definition of the MSE in (11.22) should be contrasted with a non-frequentist definition in some textbooks defined for all θ in [∀θ ∈ ]: MSE( θ − θ)2 , ∀θ ∈ . θ) = E(
(11.25)
θ n ; θ ∗ ) = E( θ n ) − θ ∗ and Var( θ n ) = E( θ n − θ m )2 In light of the fact that both the bias B( ∗ involve only two particular values of θ in , i.e. θ m = E(θ ) and θ , defining B( θ n ; θ ) and Var( θ n ) for all θ ∈ makes no frequentist sense, but it does make sense in both Bayesian and decision-theoretic statistics. The problem is that the non-frequentist definition of the MSE in (11.25) is employed to define another property of estimators known as admissibility. θ n (X) Admissibility. An estimator > θ n (X) is inadmissible if there exists another estimator such that θ n , θ ), ∀θ ∈ MSE( θ n , θ ) ≤ MSE(>
(11.26)
and the strict inequality ( θ(X) is said to be θ n , θ ) = ( θ n − θ)2 , ∀θ∈. admissible with respect to the loss function L2 ( Example 11.21 (continued) Comparing the estimators in terms of their MSEs: MSE( θ n+2 ) for any n > 1, θ n+1 ) > MSE( θ n+2 a good estimator? Not necindicating that θ n+1 is inadmissible, but does that make essarily, because this is only a relative comparison, since admissibility is a form of relative efficiency with respect to a particular loss function. In the above case the loss function is quadratic. Admissibility can be highly misleading as a finite sample property, because it can be shown that estimators with excellent optimal frequentist properties are not always optimal in terms of admissibility. This is because the nature of the quantifier ∀θ∈ associated with admissibility in (11.26) has no bearing on the effectiveness of frequentist estimation that focuses on the capacity of an estimator to pinpoint θ = θ ∗ . Hence, admissibility is not an interesting property for a frequentist estimator because it concerns values of θ other than the true value θ ∗ . Indeed, a moment’s reflection suggests that there is something wrong-headed about the use of the quantifier ∀θ∈ in (11.25) because it gives rise to dubious results when viewed from the frequentist perspective. The factual nature of frequentist reasoning in estimation also brings out the inappropriateness of the notion of admissibility as a minimal property (necessary but not sufficient). To bring out this problem, consider the following example. Example 11.22 In the context of the simple Normal model (Table 11.2), let us compare two estimators of θ in terms of admissibility: (i) the unbiased and fully efficient estimator X n =
1 n
n
k=1 Xk ;
(ii) the “crystal ball” estimator μcb = 7,405,926, ∀x∈RnX .
11.4 Finite Sample Properties of Estimators
487
When compared on admissibility grounds, these two estimators are both admissible since neither dominates the other on MSE grounds: MSE(X n , μ) ≷ MSE(μcb , μ), ∀μ∈R, and thus they are equally acceptable. Common sense, however, suggests that if a particular criterion of optimality cannot distinguish between X n (the best estimator) and θ cb , a number picked from the air that ignores the data altogether, it is not much of a minimal property. A moment’s reflection suggests that its inappropriateness stems from the reliance of admissibility on the quantifier “∀θ ∈.”The admissibility of θ cb stems from the fact that for certain λ values of θ close to θ cb , say θ ∈ θ cb ± √n for 0 < λ < 1, on MSE grounds θ cb is “better” than X n : MSE(X n ; θ) =
1 n
> MSE(θ cb ; θ ) ≤
λ2 n
for θ ∈ θ cb ±
√λ n
.
This example indicates that admissibility is totally ineffective as a minimal property because it does not filter out exceptionally “bad” estimators such as θ cb ! Instead, it did exclude potentially good estimators like the sample median; see Cox and Hinkley (1974). This highlights θ (X); θ ), since, the extreme relativism of admissibility to the particular loss function, i.e. L2 ( as mentioned above, the sample median would have been the optimal estimator in the case θ(X); θ ) = | θ(X) − θ |. of the absolute loss function L1 ( 11.4.4.1 Full Efficiency vs. MSE and Biased Estimators C-R lower-bound comparison. Let us return to the comparison of estimators in terms of the C-R lower bound that allows for biased estimators. Example 11.23 In the context of the simple Bernoulli model (Table 11.1), let us compare θ n+2 ) in terms of their respective Cramer–Rao lower bounds. Given the estimators ( θ n+1 , that θ n+1 ) θ n+2 ) n n n n E( θ n+1 ) = n+1 θ, dE(dθ θ , dE(dθ = ( n+1 ), E( θ n+2 ) = n+2 = ( n+2 ), their respective Cramer–Rao lower bounds based on (11.13) are: C-R( θ n+1 ) = C-R( θ n+2 ) =
n n+1 n n+2
2 2
θ (1−θ) n θ (1−θ) n
=
nθ (1−θ ) 0, goes to one as n goes to infinity”; see Chapter 9. REMARKS To avoid messy notation in what follows, the star (∗ ) in θ ∗ will be dropped unless it is important to highlight its role. (ii) θ n in this definition stands for a generic estimator; the subscript n is used to emphasize the role of the sample size. (iii) In general, verifying (11.27) is non-trivial, but in one case where θ n has a bounded variance, it can easily be verified using Chebyshev’s inequality (see Appendix 9.A): + + )2 . P(+ θ n − θ + ≤ ε) ≥ 1 − E(θ nε−θ 2 (i)
This is because E( θ n − θ)2 =MSE( θ n ), and thus when MSE( θ n ) → 0, n→∞
E( θ n −θ )2 → ε2 n→∞
0
and (11.27) holds. From (11.22) we know that MSE( θ n ) → 0 if Var( θ n ) → 0 and B( θ n ; θ ) → 0. This suggests two easily verifiable conditions for θ n to be a consistent estimator of θ when the required moments of its sampling distribution exist: θ n ) = θ, (b) limn→∞ Var( θ n ) = 0. (a) limn→∞ E(
11.5 Asymptotic Properties of Estimators
489
This suggests that, in the case where Var( θ n ) < ∞, we can verify the consistency of θ n by checking the above (sufficient) conditions. The notion of consistency based on (a) and (b) is sometimes called mean square consistency. Example 11.24 In the context of the simple Bernoulli model (Table 11.1), the estimators θ 2 , and θ 3 satisfy (a) but not (b) because their variances do not decrease as n → ∞ θ 1, θ n+2 ) are automatically, and thus they are all inconsistent. In contrast, the estimators ( θ n+1 , θ 2 , and θ 3, consistent because they satisfy both (a) and (b) (verify!). The estimators θ 1, being inconsistent, can be eliminated from the list of good estimators of θ and the choice θ n+1 , and θ n+2 . Given that θ n is both unbiased and fully efficient but is now between θ n, θ n , at least on intuitive grounds, (θ n+1 , θ n+2 ) are biased, there is a slight preference for unless the gain from the MSE is sizeable enough. Consistency as a minimal property. It is important to emphasize the fact that consistency is a minimal (necessary but not sufficient) property. That is, when an estimator is inconsistent it is not worth serious consideration, but since consistency holds at the limit (n = ∞), it should be accompanied by certain other desirable finite sample properties to define a “good” estimator for inference purposes. There are numerous examples of just consistent estimators that are practically useless. Example 11.25 In the case of the simple Normal model (Table 11.2), X n = 1n nk=1 Xk is a consistent estimator of μ, but so are the following two estimators of μ (Rao, 1973): & 5 0, for n ≤ 1024 , ‡ ' Xn = X n = n−10 7 Xn. 24 n−10 X n , for n > 10 , ‡
'
Although X n and X n are consistent estimators of μ, they are practically useless! Hence, in practice, one needs to supplement consistency with finite sample properties; see Chapter 12. Example 11.26 For the simple Normal model (Table 11.2), a summary of the results on unbiasedness and consistency is given in Table 11.6. The general conclusion is that unbiased but inconsistent estimators are not good estimators. Table 11.6
Unbiased and consistent estimators Unbiased E( μ1 ) = μ
(i) μ1 = X1 (ii) μ3 = 12 (X1 − Xn ) (iii) μ3 = 12 (X1 − Xn ) (iv) μn = 1n ni=1 Xi
E( μ2 ) = μ E( μ3 ) = 0 E( μn ) = μ
Consistent √ √ × √
n μ E( μn+1 ) = n+1 1 ) n X E( n μ (vi) μn+2 = ( n+2 μn+2 ) = n+2 i=1 i 1 ) (v) μn+1 = ( n+1
n
i=1 Xi
× ×
Var( μ1 ) = σ 2
×
2 Var( μ2 ) = σ2 2 Var( μ3 ) = σ2 2 Var( μn ) = σn
× × √
2 Var( μn+1 ) = nσ 2
(n+1) 2 Var( μn+1 ) = nσ 2 (n+2)
√ √
490
Estimation I: Properties of Estimators
From the above comparison we can conclude that μn = of μ because it is unbiased, consistent, and fully efficient.
1 n
n
i=1 Xi
is the best estimator
It is important to note that in the case of the above examples (and in most cases in practice), we utilize only their first two moments when deciding the optimality of the various estimators. That is, higher moments or other features of the sampling distribution are not explicitly utilized. For statistical inference purposes in general, however, we usually require the sample distribution itself, not just its first two moments.
11.5.2 Consistency (Strong) An estimator θ n is said to be a (strongly) consistent estimator of θ if a.s. P(limn→∞ θ n → θ ∗. θ n = θ ∗ ) = 1, denoted by
As argued in Chapter 9, almost sure (a.s.) implies convergence in probability. In Chapter 9 it is shown that almost sure convergence is stronger than convergence in probability and not surprisingly, the former implies the latter. Example 11.27 In light of the fact that the simple Bernoulli (Table 11.1) and Normal (Table 11.2) models assume that their respective underlying processes {Xk , k∈N} are IID and the estimators (i)–(vi) in (11.2) and (11.3) are functions of 1n ni=1 Xi , we can invoke directly the SLLN in Chapter 9 to show which estimators are strongly consistent (Table 11.7). Table 11.7
Strong consistency of estimators
Bernoulli
a.s. θ 1 θ ,
Normal
μ1 μ,
a.s.
a.s. θ 2 θ , a.s.
μ2 μ,
a.s. θ 3 θ , a.s.
μ3 μ,
a.s. θ n → θ, a.s.
μn → μ,
a.s. θ n+1 → θ, a.s.
μn+1 → μ,
a.s. θ n+2 → θ a.s.
μn+2 → μ
11.5.3 Asymptotic Normality The next asymptotic property, known as asymptotic Normality, is an extension of the central limit theorem (CLT), discussed in Chapter 9. An estimator θ n of θ is said to be asymptotically Normal if there exists a Normalizing sequence {cn }∞ n=1 such that cn ( θ n − θ) N(0, V∞ (θ)), for V∞ (θ )=0. n→∞
(i) “ ” reads asymptotically distributed. (ii) V∞ (θ ) denotes the asympn→∞ totic variance of θ n . (iii) The sequence {cn }∞ n=1 is a function of n, and in the case of an IID √ sample cn = n. REMARKS:
Example 11.28 Given that the simple Bernoulli (Table 11.1) and Normal (Table 11.2) models assume that their respective underlying processes {Xk , k∈N} are IID and all the estimators (i)–(vi) in (11.2) and (11.3) are functions of 1n ni=1 Xi , we can invoke directly the
11.5 Asymptotic Properties of Estimators
491
CLT in Chapter 9 to show that the consistent estimators in Table 11.7 are also asymptotically Normal (Table 11.8). Table 11.8 Bernoulli Normal
Asymptotic Normality of estimators
√ √ θ n − θ ) N (0, θ (1 − θ)) , n( n( θ n+k − θ ) N (0, θ (1−θ)) , k = 1, 2 n→∞ √
μn − μ) N 0, σ 2 , n( n→∞
n→∞
√ n( μn+k − μ) N 0, σ 2 , k = 1, 2 n→∞
The above example brings up a very important class of estimators known as consistent and asymptotically Normal (CAN) estimators of θ. The question that naturally arises is whether, within the class of CAN estimators, one can use the asymptotic variance V∞ (θ ) in order to choose between them. Let us discuss this question next.
11.5.4 Asymptotic Efficiency Is there a way to choose between CAN estimators by comparing their asymptotic variances V∞ (θ )? The answer is that there is an asymptotic Cramer–Rao (C-R∞ (θ )) lower bound with respect to which V∞ (θ) can be compared. Assuming that the regularity conditions in Table 11.4 are valid, C-R∞ (θ ) takes the form C-R∞ (θ ) = [I∞ (θ)]−1 , where I∞ (θ ) = lim ( c12 )In (θ ) , n→∞
n
and I∞ (θ) is referred to as the asymptotic Fisher information. Example 11.29 In the context of the simple Bernoulli model (Table 11.1): 2 n 1 n 1 E − d lndθf (x;θ) = ( =⇒ I (θ)= lim ) ∞ 2 θ (1−θ ) n θ (1−θ ) = θ(1−θ) , n→∞
and thus the asymptotic lower bound is C-R∞ (θ) = θ(1 − θ ). Example 11.30 In the context of the simple Normal model (Table 11.2): 2 E − d lndθf (x;θ) = σn2 =⇒ I∞ (μ) = lim ( 1n ) σn2 = σ12 =⇒ CR∞ (θ )=σ 2 . 2 n→∞
A CAN estimator θ n of θ is said to be asymptotically efficient if
θ n − θ) N 0, [I∞ (θ)]−1 , when I∞ (θ )>0. cn ( n→∞
That is, the asymptotic variance equals the asymptotic Cramer–Rao lower bound V∞ (θ ) = C-R∞ (θ)=[I∞ (θ)]−1 . Example 11.31 For the simple Bernoulli model (Table 11.1), the results in Table 11.8 sugθ n+1 , θ n+1 ) are asymptotically efficient. Although gest that all three CAN estimators ( θ n, θ n is fully efficient for any (θ n , θ n+1 , θ n+1 ) are asymptotically equivalent, the fact that n < ∞ renders it the best among the three. Note that the same comments apply to the μn+1 , μn+1 ) in the context of the simple Normal model (Table 11.2). estimators ( μn ,
492
Estimation I: Properties of Estimators
11.5.5 Properties of Estimators Beyond the First Two Moments So far the discussion of optimal properties has revolved mostly around the first two moments of the sampling distributions of the estimators. Our focus on the first two moments is primarily based on convenience. The fact of the matter is that statistical inference is based on the sampling distribution of the estimator, but we often focus on some of its numerical characteristics to define particular properties. Mode unbiasedness. An estimator θ n of θ is said to be mode unbiased if the sampling distribution of θ n has a mode equal to θ ∗ , the true value of θ in : Mode( θ n) = θ ∗. Example 11.32 In the context of the simple uniform model Mθ (x): Xt UIID(0, θ), 0 < xt < θ , θ > 0, t∈N,
(11.28)
θ [n] = max(X1 , X2 , . . . , Xn ):=Y has sampling distribution of the form f (y; θ) =
nyn−1 θn ,
0 < y < θ.
Given that for any θ > 0 the density function f (y; θ ) has a unique maximum at the point h(θ) = θ, the estimator θ [n] is a mode-unbiased estimator of θ : Mode( θ [n] ) = θ ∗ . Median unbiasedness. An estimator θ n of θ is said to be median unbiased if the sampling distribution of θ n has a median equal to θ ∗ , the true value of θ in : Median( θ n) = θ ∗. Example 11.33 In the context of the simple Normal model (Table 11.2), the estimator μn = 1n nk=1 Xk is mean unbiased with a Normal sampling distribution. The latter implies that μn is also a mode and median-unbiased estimator. Closeness. In addition to using numerical characteristics of the sampling distribution, there are other ways to define the closeness of an estimator to θ ∗ , the true value of the parameter. For example, we can define the notion of the closeness of two estimators θ and > θ of the unknown parameter θ to the true value θ ∗ using the following concentration measure: + +
+
+ P + θ n − θ ∗ + ≤c ≥ P +> θ n − θ ∗ + ≤c , for all c > 0. In the case where the above condition is valid and strict inequality holds for some values of c > 0, θ is said to be more concentrated around θ 0 than > θ. Pitman closeness. Estimator θ n is Pitman closer than > θ n (Pitman, 1937) when + + +
+ P + θ n − θ ∗ + < +> θ n − θ ∗ + > 12 . Such measures will not be pursued any further in this book, but they are noted to emphasize the role of the sampling distribution in assessing the optimality of estimators as well as the closeness in relation to θ ∗ , as opposed to ∀θ∈.
11.6 The Simple Normal Model: Estimation
493
11.6 The Simple Normal Model: Estimation In the previous section we used two very simple examples in an attempt to keep the technical difficulties at a minimum and concentrate on the ideas and concepts. In this section we use a marginally more complicated model, that happens to be one of the most widely discussed models in statistics. This simple model provides the cornerstone of several widely used statistical models that we discuss in the sequel. The sampling distribution of μn =
1 n
n
k=1 Xk .
Consider the simple Normal model in Table 11.9. Table 11.9
The simple Normal model Xt = μ + ut , t∈N:=(1, 2, . . . , n, . . .)
Statistical GM [1]
Xt N(., .), xt ∈R
Normal
[2]
Constant mean
E(Xt ) = μ, μ∈R, ∀t∈N
[3]
Constant variance
Var(Xt ) = σ 2 , ∀t∈N
[4]
Independence
{Xt , t∈N}, independent process
As argued above, the best estimator of μ, in the case of a simple one-parameter Normal model, is μn = 1n nk=1 Xk :=X n . The obvious argument is that since μ = E(Xt ), it makes intuitive sense that the sample mean μ should be a good estimator. We have shown above that the sampling distribution of μn is 2 μn N μ, σn ,
(11.29)
and shown that μn is unbiased, fully efficient, and strongly consistent. The derivation of these properties, however, was based on assuming that σ 2 is known. We would like to know if any of these properties change when σ 2 is an unknown parameter to be estimated alongside μ. Although matching distribution and sample moments is not always a valid argument, for simplicity let us employ the same intuitive argument to estimate σ 2 = Var(Xt ) using the sample variance σ 2n =
1 n
n
i=1 (Xi
− μn )2 .
(11.30)
Since unbiasedness and consistency are not affected by assuming that σ 2 is also unknown, the only thing that might change is the Cramer–Rao lower bound. To derive that we need the distribution of the sample
exp − 2σ1 2 ni=1 (xi − μ)2 ; ln f (x; μ, σ 2 ) = − n2 [ln(2π )] − n2 ln(σ 2 ) − 2σ1 2 ni=1 (xi − μ)2 . f (x; μ, σ 2 ) =
√1 σ 2π
n
494
Estimation I: Properties of Estimators
Taking first and second derivatives of ln f (x; μ, σ 2 ) yields ∂ ln f (x;μ,σ 2 ) ∂μ
=
1 σ2
n
=
∂ 2 ln f (x;μ,σ 2 )
= − σ14
∂σ 2 ∂μ
∂ ln f (x;μ,σ 2 ) = − 2σn 2 + 2(σ12 )2 ni=1 (xi − μ)2 , ∂σ 2 ∂ 2 ln f (x;μ,σ 2 ) = 2σn 4 − σ16 ni=1 (xi − μ)2 , ∂(σ 2 )2
− μ),
n
∂ 2 ln f (x;μ,σ 2 ) ∂μ2
1 σ2
i=1 (xi
i=1 (−1),
n
i=1 (xi
− μ).
2 2) Since E − ∂ ln∂σf (x;μ,σ = 0, the Fisher information matrix takes the form 2 ∂μ
2 2 ⎞ 8 2) (x;μ,σ 2 ) n E − ∂ ln f∂μ E − ∂ ln∂σf (x;μ,σ 2 2 ∂μ σ2 2 ⎠= In (μ, σ 2 ):= ⎝ 2 2 2 ) f (x;μ,σ ) 0 E − ∂ ln∂(σ E − ∂ ln∂σf (x;μ,σ 2 ∂μ 2 )2 ⎛
and thus the C-R lower bound for any unbiased estimators of (μ, σ 2 ) is −1 2 C-R(μ, σ 2 ) = In (μ, σ 2 ) =⇒ C-R(μ) = σn and C-R(σ 2 ) =
9
0 n 2σ 4
2σ 4 n .
(11.31)
This shows that μn achieves its C-R bound, and retains its asymptotic properties of consistency and asymptotic Normality. The question is whether the estimator σ 2n has similar properties, and for that we need to derive its sampling distribution. The sampling distribution of σ 2n =
1 n
n
k=1 (Xk
− μn )2
Since σ 2n is a quadratic function of Normally distributed random variables, we will use the following lemma to derive its sampling distribution. Lemma 11.3 Assume that the random variables (Z1 , Z2 , . . . , Zn ) are distributed as NIID(0, 1), k = 1, 2, . . . , n, then the function Vn = nk=1 Zk2 is chi-square distributed with n degrees of freedom: ;n Vn = Zk2 χ 2 (n). k=1
Since this lemma pertains to N(0, 1) random variables, we need to standardize and then use it to claim that n n Xk −μ 2 2 χ 2 (n). Zk = Xkσ−μ NIID(0, 1) =⇒ k=1 Zk = k=1 σ 2 μn )2 differs from nk=1 Xkσ−μ , we need an identity to Given that σ 2n = 1n ni=1 (Xi − relate the two in the form of (Spanos, 1986, p. 240) n Xk −μ 2 n Xk − μn 2 μ −μ 2 (11.32) = k=1 + n nσ . k=1 σ σ Standardizing μn in (11.29) we can deduce that √ lemma 11.3 μ3n −μ n( μn −μ) μn −μ 2 = N 1) =⇒ n χ 2 (1). (0, σ σ 2 σ n
11.6 The Simple Normal Model: Estimation
495
μ −μ 2 X − μ 2 In addition, it can be shown that V1 = n nσ and V2 = nk=1 k σ n are independent (Casella and Berger, 2002); we need to relate the left and right-hand sides of (11.32). For that we need the following lemma. Lemma 11.4 Let V1 χ 2 (m1 ), V2 χ 2 (m2 ), and V1 , V1 be independent, then the sum V = V1 +V2 is chi-square distributed with m = m1 +m2 degrees of freedom: (V1 +V2 ) χ 2 (m1 +m2 ). In light of this lemma, since
n k=1
Xk −μ σ
2
is distributed as χ 2 (n), the right-hand side is
composed of two independent random variables, and one has a χ 2 (1) distribution, it follows from Lemma 11.4 that 2 2 n σn Xk − μ n (11.33) χ 2 (n − 1). = 2 k=1 σ σ Using the fact that for Vχ 2 (m), E(V) = m (see Appendix 3.A), we deduce 2 n σ 2 2 σ 2n ) = (n−1) E σ 2n = (n − 1) ⇒ E( n σ = σ ,
(11.34)
and thus σ 2n is a biased estimator of σ 2 . The sampling distribution of s2n =
1 n−1
n
i=1 (Xi
− μn )2
The result in (11.34) also suggests that n 1 n σ 2n = n−1 μn )2 , E(s2n ) = σ 2 , s2n = n−1 k=1 (Xk − that is, a rescaled form of σ 2n , s2n = (n−1)s2n σ2
=
n k=1
1 n−1
n
i=1 (Xi
− μn )2 is an unbiased estimator of σ 2 and
2 μn ) (Xk − χ 2 (n − 1). σ2
(11.35)
The question which arises is whether s2n , in addition to being unbiased, has any further advantages over σ 2n . To derive the variance of s2n we use the result that for Vχ 2 (m), Var(V) = 2m (see Appendix 3.A) to deduce that 4 (n−1)s2n 2σ 4 = 2(n − 1) =⇒ Var(s2n ) = n−1 > C-R(σ 2 ) = 2σn . Var σ2 Hence, s2n does not achieve the Cramer–Rao lower bound. Similarly, using (11.33) we can deduce that σ 2n does not achieve the lower bound given by GC-R(σ 2 ), since 2 2 n σ 2 ) = 2(n−1)2 σ 4 , σ 2n ) = 2(n−1)σ > GC-R(σ Var σ 2n = 2(n − 1) =⇒ Var( n2 n3 GC-R(σ 2 ) =
dE( σ 2n ) dσ 2
2 2 −1 2 4 (x;μ,σ 2 ) 2(n−1)2 2σ = σ 4. = n−1 E d ln f dσ 2 3 n n n
σ 2n achieve their respective Cramer–Rao lower bounds, the question Given that neither s2n nor that naturally arises is whether there is another way to choose between them.
496
Estimation I: Properties of Estimators
First, we can eliminate the possibility of a fully efficient unbiased estimator by using the necessary and sufficient condition in (11.16), since , d ln f (x;μ,σ 2 ) n n 1 2 − σ2 = (x − μ) k 2 k=1 2σ 4 n dσ implies that the relevant linear function takes the form , , d ln f (x;μ,σ 2 ) n 1 2 2 = 2σ 4 , k=1 (xk − μ) − σ n n dσ 2
(11.36)
indicating that the only unbiased estimator to achieve the C-R bound is 1n nk=1 (xk − μ)2 , which requires μ to be known. Second, Bhattacharya (1946) proposed another lower bound which is more achievable in certain cases. By viewing the Cramer–Rao inequality as based on the correlation between an f (x;θ) 1 df (x;θ) , he proposed a sharper inequality based on the = f (x;θ) estimator h(X) and d lndθ dθ multiple correlation between h(X) and higher-order derivatives of ln f (x; θ): 1 df (x;θ) 1 d2 f (x;θ) 1 dm f (x;θ) , , ··· , , m ≥ 1. f (x;θ) dθ f (x;θ) dθ 2 f (x;θ) dθ m Using the higher-order derivatives, Bhattacharya extended (11.16) to - , k , d f (x;θ ) m 1 , m ≥ 1, a ( θ n − θ) = h(θ) f (x;θ k k=1 ) dθ k
(11.37)
for a function h(θ) and constants (ak , k = 1, 2, . . . , m), to define the Bhattacharya lower bound Var( θ n ) ≥ BLB (θ) = m (11.38) i,j=1 cij (θ)·ai aj , for some m ≥ 1, 1 di f (x;θ ) 1 dj f (x;θ ) , s.t. m where cij :=E f (x;θ j=1 cij (θ )·aj = 1, i = 1, 2, . . . , m. ) dθ i f (x;θ) dθ j m m When cij (θ) i,j=1 is positive definite with an inverse of the form cij (θ ) i,j=1 , the Bhattacharya lower bound is ij Var( θ n ) ≥ BLB (θ) = m i,j=1 c (θ), for some m ≥ 1. Using m = 2 we can show that for s2n , (11.37) takes the form 4 n , ∂ ln f (x;μ,σ 2 ) - 1 1 ∂ 2 f (x;μ,σ 2 ) 2 sn − σ 2 = 2σn − 2(n−1) , n−1 f (x;θ) ∂σ 2 ∂μ2 which implies that s2n achieves the Bhattacharya lower bound, rendering it a better choice than σ 2n ; see Srivastava et al. (2014). Sufficiency. As shown in Example 11.35, in the case of the simple Normal model
n n 2 (Table 11.9), there exists a minimal sufficient statistic S(X) = k=1 Xk , k=1 Xk and
all three estimators μn , s2n , σ 2n are minimal sufficient since they are all functions of S(X). Asymptotic properties. In terms of their asymptotic properties, both estimators σ 2n and s2n 2 of σ enjoy all the optimal asymptotic properties: consistency, asymptotic Normality, and asymptotic efficiency, since √ 2
√ 2 n σ n − σ 2 N 0, 2σ 4 , n sn − σ 2 N 0, 2σ 4 n→∞
n→∞
11.6 The Simple Normal Model: Estimation
497
in view of the fact that the asymptotic Fisher’s information matrix is 9 8 1 0 2 1 2 2 σ . I∞ (μ, σ ) = lim n In (μ, σ ) = n→∞ 0 2σ1 4 √ The sampling distribution of
n( μn −μ) sn
For the simple Normal model (Table 11.9), the two best estimators of μ, σ 2 in the form of μ and s2n come together when one is interested in the sampling distribution of the ratio √ n n( μn − μ)/sn , μn = X n . This is the basis of the well-known Student’s t statistic that began modern statistics at the Guinness brewery in Dublin, Ireland. An employee of this brewery, by the name of William Gosset, published a paper in 1908, under the pseudonym “Student,” that inspired Fisher to focus attention on finite sampling distributions as a basis for frequentist inference. At that time it was known that in the case where X:=(X1 , X2 , . . . , Xn ) is a random sample (IID) from a distribution with unknown mean (μ) and variance (σ 2 ), the estimator X n = 1 n k=1 Xk is asymptotically (n → ∞) Normally distributed: n √ 2 X n N μ, σn ⇒ n(Xσn −μ) N (0, 1) . (11.39) n→∞
n→∞
σ 2n = It was also known that when σ 2 is replaced by distribution in (11.39) remains the same: √ n(X n −μ) σn n→∞
1 n
n
2 i=1 (Xi −X n ) ,
the asymptotic
N (0, 1) .
Motivated by his experimental work at the Guinness brewery, Gosset was interested in drawing inferences with small n < 15, and the asymptotic approximation in (11.39) was not accurate enough for that: “in such cases [small n] it is sometimes necessary to judge of the certainty of the results from a very small sample, which itself affords the only indication of the variability” (Gosset, 1908, p. 2). Gosset realized that when X is a NIID sample, the result in (11.39) is exact and not asymptotic, i.e. it holds for any n > 1: √ 2 (11.40) X n N μ, σn ⇒ n(Xσn −μ) N (0, 1) . This was the first finite sampling distribution. He asked the question about the finite sampling distribution of √ n(X n −μ) sn
?
D (0, 1)
and used simulated data, in conjunction with Karl Pearson’s approach to statistics (Chapter 10), to conclude (guess) that for any n>1: √ n(X n −μ) sn
St(n−1),
(11.41)
where St(n−1) denotes a Student’s t distribution with (n−1) degrees of freedom. This result was formally derived by Fisher (1915), whose derivation of (11.41) is encapsulated by the following lemma.
498
Estimation I: Properties of Estimators
Lemma 11.5 Assuming that Z and V are two independent random variables with distributions
Z N(0, 1), V χ 2 (m) =⇒ 3 ZV St(m), That is, the ratio
Z (V/m)
m
is Student’s t distributed with m degrees of freedom.
Example 11.34 In the context of the simple Normal model (Table 11.9), the finite sam 1 n 2 pling distributions of X n = 1n nk=1 Xk and s2n = n−1 i=1 (Xi −X n ) in (11.40) and (11.35) are Z=
√ n(X n −μ) σ
N (0, 1) , V =
(n−1)s2n σ2
χ 2 (n − 1).
In addition, X n and s2n are independent. Using Lemma 11.5 we can deduce that ⎛ ⎞ √
⎝@
n(X n −μ) σ (n−1)s2 n (n−1)σ 2
⎠=
√
n(X n −μ) sn
St(n−1).
After deriving the finite sampling distributions of the Student’s t ratio and that of the sample correlation coefficient in 1915, Fisher went on to derive almost all the finite sampling distributions to the simple Normal and linear regression models.
11.7 Confidence Intervals (Interval Estimation) It is often insufficiently appreciated that although point estimation provides the basis for all other forms of statistical inference, interval estimation, hypothesis testing, prediction, and simulation, it does not, by itself, output an inferential claim as often erroneously presumed. What is the inferential claim associated with an “optimal” point estimator θ n (X) of θ in , θ n (X) does not justify the obvious in the context of a prespecified statistical model Mθ (x)? inferential claim θ n (x0 ) is “close in some sense” to θ ∗ (the true value of θ), since θ n (x0 ) represents only a single value in the infinite parameter space . The optimal properties calibrate the capacity of the estimator θ n (X) to pinpoint θ ∗ , but the relevant probabilities do not extend to the estimate θ n (x0 ). This is the reason why the reported point estimates are usually accompanied by the estimated standard error (SE) of θ n (X) to convey, in some broad sense, how close that point estimate is likely to be to θ ∗ . In the case of3a simple Bernoulli model (Table 11.1), a point estimate θ n (x0 ) = .51 with SE( θ n) =
3
Var( θ n (X)) = .01 in some intuitive sense appears to be closer to θ ∗ than if
θ n (X)) = .1. An attempt to formalize this intuition gave rise to confidence SE( θ n ) = Var( interval estimation. A two-sided (1−α) confidence interval for θ takes the form
(11.42) P θ U (X); θ = θ ∗ = 1−α, θ L (X) ≤ θ ≤ where θ L (X) and θ U (X) denote the lower- and upper-bound statistics that define the different CIs for θ. The notation θ=θ ∗ indicates that the evaluation of the probability (1−α) is
11.7 Confidence Intervals (Interval Estimation)
499
under θ =θ ∗ ; the true value of θ, whatever that happens to be. Here, (1−α) refers to the
θ U (X) “cover” (overlay) the true value θ ∗ . probability that the random bound(s) θ L (X), Equivalently, the “coverage error” probability α denotes the probability that the random θ U (X)] does not cover θ ∗ . Note that for discrete probinterval based on the bounds [ θ L (X), abilistic models f (x; θ ) one uses ≥ 1−α because the equality might not be available for a particular α. A one-sided (1−α) CI for θ takes the form
(11.43) θ U1 (X); θ = θ ∗ = 1−α, P θ ≥ θ L1 (X); θ = θ ∗ = 1−α, P θ ≤ where θ L1 (X) and θ U1 (X) denote the relevant lower and upper statistics.
11.7.1 Long-Run “Interpretation” of CIs An intuitive way to understand the coverage probability is to invoke the metaphor of repeatability (in principle) and draw a sequence of different sample (X) realizations of size n, say θ L (xi ), θ U (xi )], i = 1, . . . , N. It is crucial x1 , x2 , . . . , xN , then evaluate the observed CIs [ to note that no probability can be attached to an observed CI:
θ U (x0 ) , θ L (x0 ) ≤ θ ≤ because the latter is not stochastic and it either includes or excludes θ ∗ . The coverage probability (1−α) will ensure that a proportion of at least (1−α) % of these observed CIs cover (overlay) the true θ, whatever its value θ ∗ is. As shown in Figure 11.1, a vertical line from μ∗ cutting through the N=20 observed CIs xn ± √1n cα/2 reveals that either overlays μ∗ or it doesn’t, but in practice we do not know which is the case for any one of them. The prespecified α=.05 in the above case suggests that only 1 in 20 (5%) of the observed CIs will not cover μ∗ . Despite its intuitive appeal, the long-run metaphor is often misinterpreted by confusing the probability (1−α) with the relative frequency associated with the metaphor.
11.7.2 Constructing a Confidence Interval Where do CIs come from? The simplest way to construct a confidence interval is to use what Fisher (1935b) called a pivot (or a pivotal quantity) q(X, θ ) when it exists. A pivot is defined to be a function of both the sample X and the unknown parameter θ: q(., .): (RnX × ) → R, whose distribution f (q(x, θ)) is known; it does not involve unknown parameters. Example 11.35 For the simple (one-parameter) Normal model (Table 11.2) a pivotal quantity exists and takes the form √ q(X, μ) = n X nσ−μ N(0, 1).
500
Estimation I: Properties of Estimators
μ∗
Fig. 11.1
The long-run metaphor for confidence intervals
1. 1 − − − − xn − − − − 2 2. 1 − − − − xn − − − − 2 3. 1 − − − − xn − − − − 2 4. 1 − − − − xn − − − − 2 5. 1 − − − − xn − − − − 2 6. 1 − − − − xn − − − − 2 7. 1 − − − − xn − − − − 2 8. − − − − xn − − − − 2 9. 1 − − − − xn − − − − 2 10. 1 − − − − xn − − − − 2 11. 1 − − − − xn − − − − 2 12. 1 − − − − xn − − − − 2 13. 1 − − − − xn − − − − 2 14. 1 − − − − xn − − − − 2 15. 1 − − − − xn − − − − 2 16. 1 − − − − xn − − − − 2 17. 1 − − − − xn − − − − 2 18. 1 − − − − xn − − − − 2 19. 1 − − − − xn − − − − 2 20. 1 − − − − xn − − − − 2 To bring out the fact that this result is an example of factual reasoning, it should be written more correctly as μ=μ∗ ∗ √ (11.44) N(0, 1), q(X, μ) = n X n −μ σ where the evaluation of the pivot is under μ=μ∗ , whatever μ∗ happens to be. Indeed, if one were to choose a particular value for μ, say μ=μ0 = μ∗ , then the above result would not hold since μ=μ∗ √ ∗ 0 √ 0 N(δ 0 , 1), δ 0 = n μ −μ . q(X, μ) = n X n −μ σ σ One can use the pivot in (11.44) to construct a two-sided confidence interval in two steps.
11.7 Confidence Intervals (Interval Estimation)
501
Step 1. Select α and use the distribution N(0, 1) in (11.44) to define the constants a and b such that √ P a ≤ n X nσ−μ ≤ b; μ = μ∗ = 1 − α. In this case, for α=.05, −a = b=cα/2 =1.96. Step 2. “Solve” the pivot to isolate μ between the two inequalities, to derive the two-sided CI with (1– α) coverage probability P X n − √σn c α2 ≤ μ ≤ X n + √σn c α2 ; μ = μ∗ =1−α, √ ≤ μ ≤ X n + 1.96 √ ; μ = μ∗ = .95, P X n − 1.96 n n or equivalently in terms of the coverage error probability: P X n − √σn c α2 ≤ μ ≤ X n + √σn c α2 ; μ = μ∗ = α, 1.96σ √ √ ; μ = μ∗ = .05. ≤ μ ≤ X + P X n − 1.96σ n n n By the same token, the probability (1−α) cannot be assigned to the observed CI xn − √1n c α2 ≤ μ ≤ xn + √1n c α2 , where xn denotes the observed value of X n .
11.7.3 Optimality of Confidence Intervals How does one assess the optimality (effectiveness) of confidence intervals? Intuitively, for a given confidence level (1 − α) , the shorter the CI the more informative it is for learning about the true value θ ∗ of the unknown parameter θ . An ideal CI will be one that reduces the interval to a single point θ = θ ∗ , but as in the case of an ideal estimator, no such interval exists for a finite sample size n. But how can one measure the length of a CI? In the case of the (1 − α) CI for μ: P X n − √1n c α2 ≤ μ ≤ X n +
√1 c α ; n 2
μ = μ∗ = 1−α,
one can measure its length by subtracting the lower from the upper bound: θ U (X) − θ L (X)= X n + √1n c α2 − X n − √1n c α2 = √2n c α2 . It turns out that this CI is the shortest possible because of the choice of the quantiles of the Normal distribution: −a = b = cα/2 . Any choice that does not satisfy this equality gives rise to CIs of bigger length. In general, evaluating the length of a CI is not as easy because the difference θ U (X) − θ L (X) often gives rise to a statistic, which is a random variable, and one has to evalu ate the expected length E θ L (X) . It turns out, however, that there is a duality θ U (X) − between optimal tests and optimal CIs that renders the optimality of both procedures easier to understand because, for every optimal test, there is an analogous optimal CI; see Chapter 13.
502
Estimation I: Properties of Estimators
11.8 Bayesian Estimation A key argument used by Bayesians to promote their favorite approach to statistics is its simplicity in the sense that all forms of inference revolve around a single function, the posterior distribution π (θ|x0 ) ∝ π(θ)·f (x0 |θ), ∀θ ∈. This, however, is only half the story. The other half is how the posterior distribution is utilized to yield “optimal” inferences. The issue of optimality, however, is intrinsically related to what the primary objective of Bayesian inference is. An outsider looking at the Bayesian approach would surmise that its primary objective is to yield “the probabilistic ranking” (ordering) of all values of θ in . The modeling begins with an a priori probabilistic ranking based on π (θ ), ∀θ∈, which is revised after observing x0 to derive π(θ|x0 ), ∀θ ∈, hence the key role of the quantifier ∀θ∈. Indeed, O’Hagan (1994), p. 6, argues that the revised probabilistic ranking is the inference: The most usual inference question is this: After seeing the data x0 , what do we now know about the parameter θ? The only answer to this question is to present the entire posterior distribution.
He goes on to argue: Classical inference theory is very concerned with constructing good inference rules. The primary concern of Bayesian inference . . . is entirely different. The objective is to extract information concerning θ from the posterior distribution, and to present it helpfully via effective summaries.
Where do these effective summaries come from? O’Hagan argues that the criteria for “optimal” Bayesian inferences are only parasitical on the Bayesian approach and enter the picture via the decision-theoretic perspective: . . . a study of decision theory . . . helps identify suitable summaries to give Bayesian answers to stylized inference questions which classical theory addresses. (p. 14)
Decision-theoretic framing of inference. Initially proposed by Wald (1939, 1950). The decision-theoretic setup has three basic components: (i) (ii)
A prespecified statistical model Mθ (x). A decision space D containing all mappings d(.): RnX → A, where A denotes the set of all actions available to the statistician.
(iii)
A loss function L(., .): [D × ] → R representing the numerical loss if the statistician takes action a∈A when the state of nature is θ∈; see Ferguson (1967), Berger (1985), Wasserman (2004).
The basic idea is that, when the decision maker selects action a, he/she does not know the “true” state of nature, represented by θ ∗ . However, contingent on each action a∈A, the decision maker “knows” the losses (gains, utilities) resulting from different choices (d, θ ) ∈[D × ]. The decision maker observes data x0 , which provides some information about θ ∗ , and then maps each x∈RnX to a certain action a∈A guided solely by L(d, θ). This intuitive argument needs to be tempered somewhat, since θ ∗ is unknown, and thus the loss function will penalize θ ∗ as every other value of θ in .
11.8 Bayesian Estimation
503
11.8.1 Optimal Bayesian Rules The decision-theoretic setup provides optimal Bayesian rules for the risk function R(θ, θ) = θ(X)) to define the Bayes risk: EX L(θ, Bayes risk RB ( θ) = θ∈ R(θ, θ)π (θ)dθ , whose minimization with respect to all such rules > θ (x) yields θ) = inf θ∈ R(θ, θ)π (θ)dθ . Bayes rule inf RB ( > θ (x)
> θ(x)
RB ( θ) can be expressed in the form (Bansal, 2007) RB ( θ) = x∈Rn θ ∈ L( θ (X), θ )π(θ|x)dθdx, X
(11.45)
where π(θ |x)∝f (x|θ )π(θ). In light of (11.45), a Bayesian rule is “optimal” relative to a particular loss function L( θ(X), θ ) when it minimizes RB ( θ ). This makes it clear that what constitutes an “optimal” Bayesian rule is primarily determined by L( θ (X), θ ) (Schervish, 1995): (i) (ii) (iii)
θ, θ ) = ( θ − θ )2 , the Bayes estimate θ is the mean of π(θ |x0 ); when L2 ( > > > estimate θ is the median of π(θ |x0 ); when L1 (θ, θ ) = |θ − θ|, the Bayes & + + 0, for ++θ−θ ++ < ε when L0−1 (θ, θ ) = δ(θ, θ) = for ε>0, the Bayes estimate θ 1, for +θ−θ + ≥ ε is the mode of π (θ|x0 ).
In practice, the most widely used loss function is the square: L2 ( θ(X); θ ) = ( θ (X) − θ)2 , ∀θ∈, whose risk function is the decision-theoretic mean square error: θ(X); θ ), ∀θ∈. R(θ , θ ) = E( θ(X)−θ )2 = MSE(
(11.46)
It is important to note that (11.46) is the source of confusion between that and the frequentist definition in (11.21). Example 11.36 As shown in Example 10.10, for the simple Bernoulli model (Table 11.1), with π (θ)Beta(α, β), the posterior distribution is π (θ|x0 )(α ∗ , β ∗ ), where α ∗ = nx+α, β ∗ = n(1 − x)+β. Note that for Z Beta(α, β), mode(Z)=(α − 1)/(α + β −2), E(Z) = α/(α + β), and Var(Z)=αβ/(α + β)2 (α + β + 1) (Appendix 3.A). (a)
When the relevant loss function is L0−1 (θ , θ ), the optimal Bayesian rule is the mode of π(θ|x0 )(α ∗ , β ∗ ), which takes the form > θB =
α ∗ −1 α ∗ +β ∗ −2
=
(nX+α−1) (n+α+β−2) .
(11.47)
θ, θ ) = ( θ − θ )2 , the optimal Bayesian rule is the (b) When the relevant loss function is L2 ( mode of π (θ |x0 )Beta(α ∗ , β ∗ ), which takes the form θB =
α∗ α ∗ +β ∗
=
(nX+α) (n+α+β) .
(11.48)
504
Estimation I: Properties of Estimators
For a Jeffreys prior π (θ)Beta(.5, .5), nx=4, n=20, α ∗ = n(1-x)+β=16.5:
β∗
> θB =
3.5 21−2
= .184,
θB =
4.5 4.5+16.5
=
nx + α
= .214.
=
4.5,
(11.49)
Lehmann (1984) warned statisticians about the perils of arbitrary loss functions: It is argued that the choice of a loss function, while less crucial than that of the model, exerts an important influence on the nature of the solution of a statistical decision problem, and that an arbitrary choice such as squared error may be baldly misleading as to the relative desirability of the competing procedures. (p. 425)
Tukey (1960) went even further, arguing that the decision-theoretic framing distorts frequentist testing by replacing error probabilities with losses and costs: Wald’s decision theory . . . has given up fixed probability of errors of the first kind, and has focused on gains, losses or regrets. (p. 433)
He went on to echo Fisher’s (1955) view by contrasting decisions vs. inferences: Conclusions are established with careful regard to evidence, but without regard to consequences of specific actions in specific circumstances. (p. 425)
Hacking (1965) brought out the key difference between an “inference pertaining to evidence” for or against a hypothesis and a “decision to do something” as a result of an inference: to conclude that an hypothesis is best supported is, apparently, to decide that the hypothesis in question is best supported. Hence it is a decision like any other. But this inference is fallacious. Deciding that something is the case differs from deciding to do something. . . . Hence deciding to do something falls squarely in the province of decision theory, but deciding that something is the case does not. (p. 31)
11.8.2 Bayesian Credible Intervals A Bayesian (1 − α) credible interval for θ is constructed by finding the area between the α/2 and (1−α/2) percentiles of the posterior distribution, say a and b, respectively: 1 1 (11.50) (a ≤ θ < b) = 1 − α, a π(θ |x0 )dθ = (1− α2 ), b π (θ|x0 )dθ = α2 , where (.) denotes probabilistic assignments based on the posterior distribution π (θ|x0 ), ∀θ∈. In practice, one can define an infinity of (1 − α) credible intervals using the same posterior π (θ|x0 ). To avoid this indeterminancy one needs to impose additional restrictions like the interval with the shortest length or one with equal tails; see Robert (2007). Example 11.37 In the case of the simple Bernoulli model (Table 11.1), the end points of an equal-tailed credible interval can be evaluated by transforming the beta distribution into the F-distribution via Z Beta(α ∗ , β ∗ ) ⇒
β∗Z α ∗ (1−Z)
F(2α ∗ , 2β ∗ ).
(11.51)
11.9 Summary and Conclusions
505
Denoting the α/2 and (1−α/2) percentiles of the F(2α ∗ , 2β ∗ ) distribution by f(α/2) and f(1−α/2), respectively, the Bayesian (1 − α) credible interval for θ is −1 −1 β∗ β∗ (11.52) ≤ θ ≤ 1 + . 1 + α ∗ f(1− α α ∗ ) α f( ) 2
2
Example 11.38 For the simple Bernoulli model (Table 11.1), with Jeffreys prior π J (θ) Beta(.5, .5), nx=2, n=20, α=.05: α ∗ = nx + α = 2.5, β ∗ = n(1 − x) + β = 18.5, f(1− α2 ) = .163, f( α2 )=2.93, −1 −1 18.5 18.5 (11.53) 1 + 2.5(.163) ≤ θ ≤ 1 + 2.5(2.93) ⇔ (.0216 ≤ θ ≤ .284) . It is important to contrast the frequentist confidence interval with the Bayesian credible interval to bring out their key differences. The first important difference is that the basis of a CI is a pivot q(X, θ ), whose sampling distribution is evaluated under θ = θ ∗ , with the probability firmly attached to x∈RnX . In contrast, a (1 − α) credible interval represents the highest posterior density interval. That is, it is simply the shortest interval whose area (integral) under the posterior density function π(θ |x0 ) has value (1 − α), where the probabilities are firmly attached to θ∈. Hence, any comparison of tail areas amounts to likening apples
θ U (X); θ = θ ∗ is ranto oranges. The second key difference is that the CI θ L (X) ≤ θ ≤ dom and its primary purpose is to cover θ ∗ with probability (1−α). There is nothing random about a (1 − α) credible interval, and nothing connects that interval to θ ∗ . Bayesians would like to think that it does, since it covers a large part of the posterior, but there is nothing in the above derivation or its underlying reasoning that ensures that. Indeed, the very idea of treating θ as a random variable runs foul of any notion of the true value θ ∗ that could have generated data x0 . In the case of a CI, the evaluation of the sampling distribution of q(X, θ ) under θ = θ ∗ secures exactly that.
11.9 Summary and Conclusions The primary objective in frequentist estimation is to learn about a particular parameter θ θ n ; θ ∗ ) associated with a particular sample size of interest using its sampling distribution fn ( n. The finite sample properties are defined in terms of this distribution and the asymptotic θ n ; θ ∗ ), aiming to properties are defined in terms of the asymptotic sampling distribution f∞ ( ∗ approximate fn (θ n ; θ ) at the limit n → ∞. The question that arises is: What combination of properties define an optimal estimator? A minimal property (necessary but not sufficient) for an estimator is consistency (weak or strong). As an extension of the law of large numbers, a consistent estimator of θ indicates potential (as n → ∞) learning from data about the unknown parameter θ. By itself, however, it does not secure learning for a particular n. This suggests that in going from potential learning to actual learning, one needs to supplement consistency with certain finite sample properties to ensure learning for the particular data x0 of sample size n. Among finite sample properties, full efficiency is clearly the most important because it secures the highest degree of precision in learning for a given n. Relative efficiency, although desirable,
506
Estimation I: Properties of Estimators
needs to be investigated further to find out how large the class of estimators being compared is before passing judgement. Unbiasedness, although wanted, is not considered indispensable by itself. Indeed, as shown above, an unbiased but inconsistent estimator is practically useless, and a consistent but biased estimator is always preferable for large enough n. Sufficiency is clearly a desirable property, because it ensures that no information relevant for inference involving θ is forfeited. In summary, a consistent, unbiased, fully efficient, and sufficient estimator sets the gold standard in estimation. When no consistent estimator can achieve this standard, one should be careful about trading a loss in efficiency to secure unbiasedness. Analogously, minimum MSE, when properly defined at θ=θ ∗ , is not a particularly essential property by itself. What about the case where an estimator is consistent, asymptotically Normal, (CAN) and possibly asymptotically efficient? Although in practice statisticians and econometricians consider CAN as being close to the gold standard, the fact of the matter is that relying exclusively on asymptotic properties is a bad strategy in general, as illustrated in Chapter 9. θ n ; θ ∗ ), because it asserts what The reason is that the asymptotic sampling distribution f∞ ( θ n ; θ ∗ ), the relevant dishappens in the limit, might provide a terrible approximation for fn ( tribution for inferences with data x0 . Even worse, one has no way to make an informed appraisal of how bad this approximation might be for given n. Hence, relying solely on asymptotic properties like CAN is not a good strategy for learning from data, because the reliability of inference is at best non-ascertainable and at worst highly misleading. Point estimation does not, by itself, output an inferential claim. In particular, point estimation does not output the inferential claim that a point estimate provides a close approximation to the true unknown parameter the underlying estimator aims to pinpoint. A point estimate needs to be supplemented by a certain measure of the precision associated with the θ n (x0 ) is often accompanied by its standard error estimate θ n (x0 ). This is the reason why [SE(θ n (X))]. Interval estimation rectifies this omission of point estimation by providing the relevant inferential claim based on a prespecified coverage error probability for the true value θ ∗ . Additional references: Arnold (1990), Azzalini (1996), Davison (2003), Spanos (2008). Important Concepts Ideal estimator, sampling distribution of an estimator, finite sample properties of estimators, unbiasedness, relative efficiency, full efficiency, Cramer–Rao lower bound, Fisher’s information, irregular statistical models, sufficiency, mean square error (MSE), bias of an estimator, admissibility of an estimator, asymptotic properties of estimators, weak consistency, strong consistency, consistency as a minimal property, asymptotic Normality, asymptotic efficiency, mode and median unbiased estimator, inferential claims and confidence intervals, longrun metaphor, pivotal function, coverage probability, minimum length confidence intervals, optimal Bayes rules, loss and risk functions, factorization theorem, minimal sufficiency, completeness, exponential family of distributions. Crucial Distinctions Ideal vs. feasible estimator, finite sample vs. asymptotic properties, relative efficiency vs. full efficiency, frequentist definition of the MSE vs. the Bayesian definition, point vs. interval
11.10 Questions and Exercises
507
estimation, statistical modeling vs. statistical inference, confidence intervals vs. credible intervals. Essential ideas ●
●
●
●
●
●
Properties of an estimator aim to gauge its generic capacity to pinpoint θ ∗ for all values X = x, x∈RnX . The gold standard for an optimal estimator θ n (X) comes in the form of a combination of properties. θ n (X) needs to satisfy consistency as a minimal property, combined with full efficiency and sufficiency when the latter exists. Reparameterization invariance is a more desirable property than unbiasedness. An optimal point estimator θ n (X), although fundamental for all forms of statistical inference (confidence intervals, hypothesis testing, prediction), does not output an inferential claim, such as θ n (x0 ) is close enough to θ ∗ . A consistent and asymptotically Normal (CAN) estimator θ n (X) does not guarantee the reliability of any inference procedure based on it. The Bayesian definition of the mean square error (MSE), based on the quantifier ∀θ∈, and the related property of admissibility, are at odds with the underlying reasoning and primary aim of frequentist estimation. Consistency, and not admissibility, is the relevant minimal property for frequentist estimators. In statistical models for which the minimal sufficient and maximal ancillary statistics co-exist, one can separate the modeling from the inference facet, by using the ancillary statistic for the former and the sufficient statistic for the latter. The exponential family of distributions includes several such statistical models that are widely used in empirical modeling, including the simple Normal and the linear regression models; see Spanos (2010b).
11.10 Questions and Exercises 1. Explain briefly what we do when we construct an estimator. Why is an estimator a random variable? 2. “Defining the sampling distribution of an estimator is in theory trivial but technically very difficult.” Discuss. 3. Explain what the primary aim of an estimator is, and why its optimality can only be assessed via its sampling distribution. 4. For the Bernoulli statistical model (Table 11.1): (a) Discuss whether the following functions constitute possible estimators of θ: θ 2 = 12 (X1 − X2 ), (iii) θ 3 = 13 (X1 − X2 +Xn ), (i) θ 1 = Xn , (ii) 1 n 1 n (iv) θ n = n i=1 Xi , (v) θ n+1 = n+1 i=1 Xi . (b) For those that constitute estimators, derive their sampling distributions. 5. Explain briefly the properties of unbiasedness and efficiency of estimators.
508
Estimation I: Properties of Estimators
6. “In assessing the optimality of an estimator we need to look at the first two moments of its sampling distribution only.” Discuss. 7. Explain briefly what a consistent estimator is. What is the easiest way to prove consistency for estimators with bounded second moments? 8. Explain briefly the difference between weak and strong consistency of estimators. 9. “Asymptotic Normality of an estimator is an extension of the central limit theorem for functions of the sample beyond the sample mean.” Discuss. 10. (a) Compare and contrast full efficiency with relative efficiency. (b) Explain the difference between full efficiency and asymptotic efficiency. 11. Discuss the key differences between finite sample and asymptotic properties of estimators, and why these differences matter for the reliability of inference with x0 . 12. Explain the difference between the Cramer–Rao and Bhattacharya lower bounds. 13. (a) Explain the notion of sufficiency. (b) Explain the notion of a minimal sufficient statistic and how it relates to the best unbiased estimator. 14. (a) Discuss the difference between the following two definitions of the notions of the bias and MSE: (i) E( θ) = θ, MSE( θ) = E( θ − θ)2 , for all θ ∈ ; (ii) E( θ; θ ∗ ) = E( θ − θ ∗ )2 , where θ ∗ is the true θ in . θ) = θ ∗ , MSE( Explain which one makes sense in the context of frequentist estimation and why the other one does not. (b) Consider the simple Normal statistical model with μ = 0 and the following estima 1 n 1 n 2 2 tors of σ 2 : σ 2n+1 = n+1 σ 2n−1 = n−1 σ 2 = 1n nk=1 Xk2 , > k=1 Xk , > k=1 Xk . (c) Compare the relative efficiency of these estimators in terms of their MSE and evaluate it for σ = 1 and n = 100. 15. Consider the Normal (two-parameter) statistical model. (a) Derive (not guess!) the sampling distributions of the following estimators: (i) μ1 = Xn , (ii) μ2 = 13 (X1 +X2 +X3 ), (iii) μ3 = (X1 −Xn ), (iv) μn =
1 n
n
i=1 Xi .
(H I N T: State explicitly any properties of E(.) or any lemmas you use.) (b) Compare these estimators in terms of the optimal properties, unbiasedness, efficiency, and consistency. (c) Compare and contrast the estimators μn )2 and s2n = σ 2n = 1n ni=1 (Xi − n 1 2 μn ) in terms of their properties. i=1 (Xi − n−1 16. Consider the simple Poisson model based on f (x; θ ) = θ x eθ /x!, θ >0, x = 0, 1, 2, . . . (a) Derive the Cramer–Rao lower bound for unbiased estimators of θ. (b) Explain why X = 1n ni=1 Xi is an unbiased and fully efficient estimator of θ . (c) Derive the posterior distribution when π (θ ) is Gamma(α, β).
11.10 Questions and Exercises
509
17. (a) Explain what the gold standard for the optimality of estimators amounts to in terms of a combination of properties; give examples if it helps. (b) Explain why the property of admissibility is unduly dependent on the quantifier ∀θ∈, which is at odds with the frequentist definition of the MSE in (11.21). (c) Explain why using admissibility as a minimal property for frequentist estimators makes no sense. Explain why consistency can be the relevant minimal property.
12 Estimation II: Methods of Estimation
12.1 Introduction In Chapter 11 we discussed estimators and their properties. The essential finite sample and asymptotic properties of estimators are listed in Table 12.1. Table 12.1
Properties of estimators
Finite sample (1Var( Despite the few pathological cases for which existence and uniqueness of the MLE θ is not guaranteed (Gourieroux and Monfort, 1995), in practice θ ML (X) exists and is unique in the overwhelming number of cases of interest. In order to reduce the pathological cases for which θ ML (X) may not exist, we often restrict our discussion to cases where two additional restrictions to R1–R4 in Table 11.4 are imposed on Mθ (x) (Table 12.5). Table 12.5
Regularity for Mθ (x) ={f (x; θ ), θ ∈}, x∈RnX
(R5)
L(.; x0 ): → [0, ∞) is continuous at all points θ∈
(R6)
For all values θ 1 =θ 2 in , f (x; θ 1 ) =f (x; θ 2 ), x∈RnX
Condition (R5) ensures that L(θ ; x) is smooth enough to locate its maximum, and condition (R6) ensures that θ is statistically identified and thus unique; see section 4.9.2. When the LF is also differentiable, one can locate the maximum by solving the first-order conditions + dL(θ ;x) d2 L(θ ;x) + =g( θ )=0, given < 0. + ML 2 dθ d θ θ = θ ML
In practice, it is often easier to maximize the log-likelihood function instead, because they have the same maximum (the logarithm is a monotonic transformation): d ln L(θ ;x) θ ML ) = 0, given L=0. = ( θ ML ) = L1 dL(θ;x) = L1 g( dθ dθ
516
Estimation II: Methods of Estimation
Example 12.4 For the simple Bernoulli model (Table 12.4), the log-likelihood function is
n
n ln L(x; θ ) = i=1 Xi ln θ + i=1 [1 − Xi ] ln(1 − θ) (12.9) = Y ln θ+(n − Y) ln(1 − θ ), where Y= nk=1 Xk . Solving the first-order condition d ln L(x;θ ) 1 1 Y − = dθ θ 1−θ (n − Y) = 0 ⇒ Y(1 − θ ) = θ(n − Y) ⇒ nθ=Y
for θ yields the MLE θ ML = 1n nk=1 Xk of θ, which is just the sample mean. To ensure that + 2 ln L(x;θ ) + d θ ML is a maximum of ln L(x; θ), we need to check that + 0, θ is a minimum. The second-order conditions confirm that θ ML is a + ML dθ 2 θ = θ ML
maximum, since
+ 2 1 d2 ln L(x; θ ) ++ 1 = − (n − Y) Y − + 1−θ dθ 2 θ2 θ =θ ML
2 + nθ + 2Yθ − 2Yθ 2 − Y ++ =− + + θ 2 (θ − 1)2 θ=θ
=−
ML
n3 Y) are positive. because both the numerator To avoid the misleading impression that the maximum likelihood estimator for simple statistical models can always be derived using differentiation, compare Example 12.4 with the following. (n3 )
Example 12.5 Consider the simple Laplace model (Table 12.6) whose density function is f (x; θ ) =
1 2
exp{−|x − θ |}, θ ∈R, x∈R.
Table 12.6
The simple Laplace model
Statistical GM
Xt = θ + ut , t∈N:=(1, 2, . . . , n, . . .) Xt Lap(., .)
[1]
Laplace
[2]
Constant mean
E(Xt )=θ, for all t∈N
[3]
Constant variance
Var(Xt )=2, for all t∈N
[4]
Independence
{Xt , t∈N}, independent process
The distribution of the sample takes the form n 7n 1 ;n 1 f (x; θ) = exp{− |xt − θ |}, x∈Rn , exp{− |xt − θ|} = t=1 2 t=1 2 and thus the log-likelihood function is ln L(θ; x) = const. − n ln(2) −
n
t=1 |xt
− θ |, θ∈R.
12.2 The Maximum Likelihood Method
517
Since ln L(θ ; x) is non-differentiable, one needs to use alternative methods to derive the maximum of this function. In this case maximizing ln L(θ ; x) with respect to θ is equivalent to minimizing the function (θ) = nt=1 |xt − θ|, which (in the case of n odd) gives rise to the sample median θ ML = median(X1 , X2 , , . . . , Xn ).
12.2.3 The Score Function d The quantity dθ ln L(θ; x) has been encountered in Chapter 11 in relation to full efficiency, but at that point we used the log of the distribution of the sample ln f (x; θ ) instead of ln L(θ ; x) to define the Fisher information 2 ∂ ln f (x;θ) In (θ ):=E . (12.10) ∂θ
In terms of the log-likelihood function, the Cramer–Rao (C-R) lower bound takes the form / 2 0−1 ∂ ln L(θ;x) (12.11) , Var(θ ) ≥ E ∂θ for any unbiased estimator θ of θ. S H O R T D I G R E S S I O N: From a mathematical perspective: 2 2 ∂ ln f (x;θ ) ∂ ln L(θ;x) E =E , ∂θ ∂θ but the question is which choice between ln f (x; θ) and ln L(θ ; x) provides a correct way to express the C-R bound in a probabilistically meaningful way. It turns out that neither of these concepts is entirely correct for that. Using ln L(θ ; x) renders taking the derivative with respect to θ meaningful, since it is a function of θ ∈, in contrast to f (x; θ), which is a function of x∈RnX with θ assumed fixed at a particular value. On the other hand, the expectation E(.) is always with respect to x∈RnX , and that makes sense only with respect to f (x; θ ). Hence, what is implicitly assumed in the derivation of the C-R bound is a more general real-valued function with two arguments g(., .): (RnX × ) → R such that (i) for a given x = x0 , g(x0 ; θ ) ∝ L(θ ; x0 ), θ ∈ and (ii) for a fixed θ, say θ=θ ∗ , g(x; θ )=f (x; θ ∗ ), x∈RnX . The first derivative of the log-likelihood function, when interpreted as a function of the sample X, defines d ln L(θ; x), ∀x∈RnX the score function s(θ; x)= dθ
that satisfies the properties in Table 12.7.
518
Estimation II: Methods of Estimation
Table 12.7
Score function properties E[s(θ ; X)] = 0
(Sc1)
2 Var[s(θ ; X)]=E[s(θ; X)]2 =E − d 2 ln L(θ; X) :=In (θ)
(Sc2)
dθ
That is, the Fisher information is the variance of the score function. As shown in the previous chapter, an unbiased estimator θ n (X) of θ achieves the Cramer–Rao (C-R) lower bound if and only if ( θ n (X) − θ ) can be expressed in the form ( θ n (X) − θ )=h(θ)·s(θ; X), for some function h(θ ). Example 12.6 In the case of the Bernoulli model, the score function is d 1 s(θ; X):= dθ (n − Y) ⇒ ln L(θ; X) = θ1 Y − 1−θ ,
θ(1−θ ) n
-
s(θ; X) =
which implies that θ ML = 1n
1 n
θ ML − θ) ⇒ ( (Y − nθ) = ( θ ML − θ ) =
n
i=1 Xi
Var( θ ML ) = C-R(θ ) =
,
θ (1−θ ) n
achieves the C-R lower bound
θ(1−θ ) n ,
confirming the result in Example 11.15. Example 12.7 Consider the simple exponential model in Table 12.8. Table 12.8
The simple exponential model Xt = θ + ut , t∈N:=(1, 2, . . . , n, . . .)
Statistical GM [1]
Exponential
Xt Exp(., .), xt ∈R+
[2]
Constant mean
E(Xt )=θ, θ∈R, ∀t∈N
[3]
Constant variance
Var(Xt )=θ 2 , ∀t∈N
[4]
Independence
{Xt , t∈N},independent process
Assumptions [1]–[4] imply that f (x; θ), x∈RnX takes the form [4]
f (x; θ) =
1n
k=1 fk (xk ; θ k )
[1]–[4]
=
1n
1 k=1 θ
[2]–[4]
=
1n
k=1 f (xk ; θ )
n exp − xθk = θ1 exp − θ1 nk=1 xk , x∈Rn+ ,
and thus the log-likelihood function is ln L(θ; x) = −n ln(θ ) − θ1 nk=1 xk , d dθ
ln L(θ; x) = − θn + θ12
n
k=1 xk
=0⇒ θ ML =
1 n
n
k=1 Xk .
-
s(θ ; X),
12.2 The Maximum Likelihood Method
519
The second-order condition d2 dθ 2
+ + ln L(θ ; x)+
θ = θ ML
=
−
n θ2
2 θ3
n
+ +
k=1 xk +
θ= θ ML
= − 2n 0, (ii) + + θ ML
+ +
Note that when (ii) and (iii) are positive, the optimum is a minimum. The Fisher information matrix is defined by
L(θ;x) ∂ ln L(θ ;x) In (θ ) = E ∂ ln ∂θ ∂θ
2 L(θ ;x) = E − ∂ ln L(θ ;x) = Cov ∂ ln ∂θ . ∂θ∂θ
Example 12.8 Consider the simple Normal model in Table 12.9. Table 12.9
The simple Normal model
Statistical GM
+ +
∂ 2 ln L(θ ;x) + ∂ 2 ln L(θ ;x) + + 0: +
+ P θ ML → θ . θ ML − θ ∗ + < ε = 1, denoted by lim P + n→∞
(b) Strong consistency. Under these regularity conditions, MLEs are strongly consistent: a.s.
θ ML → θ . P( lim θ ML = θ ∗ ) = 1, denoted by n→∞
See Chapter 9 for a discussion between these two different modes of convergence. (6) Asymptotic Normality Under the regularity conditions R1–R9, MLEs are asymptotically Normal: √ n( θ ML − θ ∗ ) N(0, V∞ (θ)), n→∞
(12.23)
530
Estimation II: Methods of Estimation
where V∞ (θ) denotes the asymptotic variance of θ ML . (7) Asymptotic unbiasedness The asymptotic Normality for MLEs also implies asymptotic unbiasedness: lim E( θ ML ) = θ ∗ .
n→∞
(8) Asymptotic (full) efficiency Under the same regularity conditions the asymptotic variance of maximum likelihood estimators achieves the asymptotic Cramer–Rao lower bound, which in view of (12.22) is θ ML ) = I −1 (θ). V∞ ( Example 12.15 For the simple Bernoulli model (Table 12.4): √ n( θ ML − θ) N(0, θ(1 − θ )). n→∞
Example 12.16 For the simple exponential model (Table 12.8): √ n( θ ML − θ) N(0, θ 2 ). n→∞
Example 12.17 For the simple logistic model (Table 12.11): √ n( θ ML − θ) N(0, 3). n→∞
Example 12.18 For the simple Normal model (Table 12.9): √ 2 √ n( μML − μ) N(0, σ 2 ), n( σ ML − σ 2 ) N(0, 2σ 4 ). n→∞
n→∞
12.2.5.3 Asymptotic Properties (Independent (I) but Non-ID Sample) The above asymptotic properties need to be modified somewhat in the case where the sample is independent but non-identically distributed. In this case, the relationship between the individual observation Fisher information I (θ) and the sample Fisher information In (θ ) takes the form , -2 I d ln f (xk ;θ) In (θ) = nk=1 Ik (θ), Ik (θ) = E . (12.24) dθ For the above properties to hold, we need to impose certain restrictions on the asymptotic behavior of In (θ) (see Spanos, 1986, chapter 10), as given in Table 12.16. Table 12.16 (a) (b)
Regularity conditions for In (θ )
lim In (θ ) = ∞
n→∞
There exists a sequence {cn }∞ n=1 such that lim
n→∞
1 I (θ) c2n n
= I∞ (θ)>0
12.2 The Maximum Likelihood Method
531
The first condition ensures consistency, and the second ensures asymptotic Normality. Asymptotic Normality under these conditions takes the form cn ( θ ML − θ) N(0, I∞ (θ)). n→∞
Example 12.19 Consider a Poisson model with separable heterogeneity: Xk PI(kθ ), f (xk ; θ ) = L(θ; x) =
E(Xk ) = Var(Xk ) = kθ , k∈N, θ>0, x = {0, 1, 2, . . . .}.
1n
xk −(kθ) (1/(xk )!) k=1 (kθ) e
⇒ ln L(θ; x) = const.+ ⇒
e−θk (θk)x , x!
d ln L(θ ;x) dθ
=
n
k=1 xk
=
1n
(k)xk k=1 xk !
exp
n
ln θ − θ an , where an =
k=1 xk
n
k=1 k
ln θ − θ nk=1 k = 12 (n(n+1))
= 0 ⇒ θ ML = a1n nk=1 Xk , E( X − a θ ML ) = θ,Var( θ ML ) = k n k=1
n 1 θ
θ an .
The question is whether, in addition to being unbiased, θ ML is fully efficient: n 2 n E(Xk ) d2 ln L(x; θ) d ln L(x; θ ) k=1 Xk = − I (θ) = E − ⇒ = k=1 2 n 2 2 2 dθ θ dθ θ n θk θa a n n = k=1 = 2 = . θ θ2 θ θ ML ) and thus θ ML is fully efficient. In terms of asymptotic Hence, C-R(θ ) = aθn = Var( θ ML ) → 0. properties, θ ML is clearly consistent since Var( n→∞
The asymptotic Normality is less obvious, but since √ sequence is { an }∞ n=1 : √
1 → an In (θ ) n→∞
1/θ , the scaling
an ( θ ML − θ) N(0, θ1 ). n→∞
This, however, is not a satisfactory result, because the variance involves the unknown θ. A √ more general result that is often preferable is to use { In (θ )}∞ n=1 as the scaling sequence: √
In (θ )( θ ML − θ) =
Y√ n −θ an θ an
N(0, 1), Yn =
n→∞
n
k=1 Xk P (θan ) .
Example 12.20 Consider an independent Normal model with separable heterogeneity: Xk NI(kμ, 1), f (xk ; θ ) =
√1 2π
2 , k∈N, μ∈R, x∈R. exp − (xk −kμ) 2
The distribution of the sample is f (x; θ ) =
1n
2 = ( √1 )n exp − 12 nk=1 (xk − kμ)2 exp − (xk −kμ) 2 2π 2 1 n 2 n ) exp − 2 k=1 xk exp μ nk=1 kxk − bn2μ ,
k=1
= ( √1
2π
√1 2π
532
Estimation II: Methods of Estimation
since (xk − kμ)2 = xk2 +k2 μ2 − 2kμxk , bn =
n
k=1 k
2
=
n(n+1)(2n+1) 6
and thus ln L(μ; x) is
2 ln L(μ; x) = const. + μ nk=1 kxk − bn2μ L(μ;x) ⇒ d ln dμ = nk=1 kxk − μbn = 0 ⇒ μML = b1n nk=1 kXk ⇒ E( μML ) = μ, Var( μML ) = b12 nk=1 k2 Var(Xk ) = bbn2 = b1n n n 2 d ln L(μ;x) 1 = bn ⇒ C-R (μ) = bn = Var( ⇒ In (μ) = E − dμ2 μML ). These results imply that μML is unbiased, fully efficient, and consistent. In addition, since 1 I (μ) → 1: n bn n→∞
√ μML − μ) N(0, 1). bn ( n→∞
Summary of optimal properties of MLEs. The maximum likelihood method yields estimators which, under certain regularity conditions, enjoy all the asymptotic optimal properties – consistency, asymptotic Normality, unbiasedness, and efficiency – and in addition they satisfy excellent finite sample properties, such as reparameterization invariance, sufficiency, as well as unbiasedness-full efficiency when they hold simultaneously.
12.2.6 The Maximum Likelihood Method and its Critics The results relating to MLEs discussed above justify the wide acceptance of the maximum likelihood as the method of choice for estimation purposes in frequentist statistics. It turns out that there are good reasons for the ML method to be preferred for testing purposes as well (see Chapter 14). Despite the wide acceptance of the ML method, there are also critics who point to several examples where the method does not yield satisfactory results. Such examples range from cases where (a) the sample size is inappropriately small, (b) the regularity conditions do not hold, and (c) the postulated statistical model is problematic. The criticism in (a) is completely misplaced because the modeler is looking for the famous “free” lunch. As argued in Chapter 1, if the sample size is too small to enable the modeler to test the model assumptions adequately, it is too small for inference purposes. The criticism of the ML method based on examples which do not satisfy the regularity conditions is also somewhat misplaced, because when the modeler seeks methods with any generality the regularity conditions are inevitable. Without regularity conditions each estimation problem will be viewed as unique; no unifying principles are possible. Category (c) deserves more discussion because the assumed statistical models are ill specified. From this category let us consider a widely discussed example. Example 12.21 Neyman and Scott (1948) model The statistical GM for this N-S model takes the form Xt = μt + ε t , t = 1, 2, . . . , n, . . . ,
12.2 The Maximum Likelihood Method
where the underlying distribution Normal is of the form 9 88 9 8 99 8 μt σ2 0 X1t NI , , t = 1, 2, . . . , n, . . . Xt := X2t μt 0 σ2
533
(12.25)
Note that this model is not well defined since it has an incidental parameter problem: the unknown parameters (μ1 , μ2 , . . . , μn ) increase with the sample size n. Neyman and Scott attempted to sidestep this problem by declaring σ 2 the only parameter of interest and designating (μ1 , μ2 , . . . , μn ) as nuisance parameters, which does not deal with the problem. Let us ignore the incidental parameter problem and proceed to derive the distribution of the sample and the log-likelihood function: 2 n 1 1
f (x; θ ) =
t=1 i=1
− 1 (xit −μt )2 √1 e 2σ 2 σ 2π
ln L(θ; x) = −n ln σ 2 − 2σ1 2
n
=
n 1 t=1
2 t=1 [(x1t −μt )
− 12 [(x1t −μt )2 +(x2t −μt )2 ] 1 2σ e ; 2πσ 2
+ (x2t −μt )2 ].
(12.26)
In light of (12.26), the “MLEs” are then derived by solving the first-order conditions ∂ ln L(θ ;x) ∂μt
=
∂ ln L(θ ;x) ∂σ 2
= − σn2 + 2σ1 4
1 [(x1t σ2
⇒ σ2 =
1 2n
− μt )+(x2t − μt )] = 0 ⇒ μt = 12 (X1t +X2t ), t = 1, . . . , n; n
t=1 [(x1t
n
t=1 [(X1t
− μt )2 + (x2t − μt )2 ] = 0
− μt )2 + (X2t − μt )2 ] =
1 n
(12.27)
n t=1
(X1t −X2t )2 . 4
Critics of the ML method claim that ML yields inconsistent estimators, since P
E( μt ) = μt , Var( μt ) = 12 σ 2 0, E( σ 2 ) = 12 σ 2 , σ 2 → 12 σ 2 =σ 2 . n→∞
This, however, is a misplaced criticism since by definition σ 2 =E(Xit −μt )2 , and thus any attempt to find a consistent estimator of σ 2 calls for a consistent estimator of μt , but μt =(X1t +X2t )/2 is inconsistent. In light of that, the real question is not why the ML does not yield a consistent estimator of σ 2 , but given that (12.25) is ill specified: Why would the ML method yield a consistent estimator of σ 2 ? Indeed, the fact that the ML method does not yield consistent estimators in such cases is an argument in its favor, not against it! A modeler should be skeptical of any method of estimation that yields consistent estimators in the context of (12.25). The source of the problem is not the ML method but the statistical model in (12.25). Hence, one should focus on respecifing the ill-defined model with a view to finding an optimal estimator σ 2 , without the incidental parameter problem. This problem can be addressed by respecifying (12.25) using the transformation
Yt = √1 (X1t − X2t ) NIID 0, σ 2 , t = 1, 2, . . . , n, . . . (12.28) 2
534
Estimation II: Methods of Estimation
For the respecified model in (12.28), the MLE for σ 2 is σ 2ML = 1n nt=1 Yt2 , which σ 2ML ) = 2σ 4 /n, is unbiased, fully efficient, and strongly consistent: E( σ 2ML ) = σ 2 , Var( a.s. σ 2ML → σ 2 . The criticism in (c) relates to ill-specified models suffering from the incidental parameter problem or contrived constraints that give rise to unnatural reparameterizations imposed on the parameters at the outset; see Spanos (2010b, 2011a, 2012b, 2013a–d). C A U T I O N A R Y N O T E: When the ML method does not gives rise to “optimal” estimators, one should first take a closer look at the assumed statistical model to verify that it is well specified before blaming the ML method. A statistical model Mθ (x) is well-defined when all its parameters are relevant for statistical inference purposes, irrespective of whether some of them do not directly relate to the substantive questions of interest.
12.3 The Least-Squares Method 12.3.1 The Mathematical Principle of Least Squares The principle of least squares was originally proposed as a mathematical approximation procedure by Legendre in 1805; see Harter (1974–6). In its simplest form the problem involves the approximating of an unknown function h(.): RX → RY : y = h(x), (x, y)∈ (RX ×RY ) , by selecting an approximating function, say linear: g(x) = α 0 +α 1 x, (x, y)∈ (RX ×RY ) , and fitting g(x) using data z0 := {(xt , yt ), t = 1, 2, . . . , n}. This curve-fitting problem involves the approximation error t = h(xt ) − g(xy ), giving rise to the problem of how to use data z0 to get the best approximation by fitting yt = α 0 + α 1 xt + t , t = 1, 2, . . . , n.
(12.29)
The earliest attempt to address this problem was made by Boscovitch in 1757 by proposing (Hald, 1998, 2007) the criterion min nt=1 | t | subject to nt=1 t = 0, (12.30) α ,α 0
1
using a purely geometric argument about its merits. In 1789 Laplace proposed an analytic solution to the minimization problem in (12.30) that was rather laborious to implement. In 1805 Legendre offered a less laborious solution to the approximation problem by replacing n n 2 t=1 | t | with t=1 t , giving rise to the much easier minimization of the sum of squares (least squares) of the errors: min nt=1 2t . α 0 ,α 1
In the case of (12.29), the principle of least squares amounts to minimizing (a0 , α 1 ) = nt=1 (yt − α 0 − α 1 xt )2 .
(12.31)
The first-order conditions for a minimum, called the Normal equations, are: ∂ ∂ (i) ∂α = (−2) nt=1 (yt − α 0 − α 1 xt ) = 0, (ii) ∂α = (−2) nt=1 (yt − α 0 − α 1 xt )xt = 0. 0 1
12.3 The Least-Squares Method
535
8 7
|
6 5 4 3 05
04
3
4
5
6
7
8
{
Fig. 12.3
Least-squares line fitting
Solving these two equations for (a0 , α 1 ) yields the least-squares estimates α 1 x¯ , α1 = α0 = y −
n (y −y)(xt −¯x) t=1 n t . 2 t=1 (xt −¯x)
(12.32)
Example 12.22 The fitted line yt = α0 + α 1 xt , through a scatterplot of data (n=200) in Figure 12.3 is yt = 1.105 + .809xt .
(12.33)
In addition to (12.33), one could construct goodness-of-fit measures s2 =
1 n−2
n
2t t=1
= .224, R2 = 1−
n
2t / t=1
n
t=1 (yt
− y)2 = .77.8. (12.34)
As it stands, however, (12.33) and (12.34) provide no basis forinductive inference. The fitted line in (12.33) cannot be used as a basis of any form of statistical inference, because it α 1 ). has no inductive premises to provide measures for the uncertainty associated with ( α0 , The above mathematical approximation perspective to curve fitting does not have any
α 1 , s2 , R2 are probabilistic premises stating the conditions under which the statistics α0 , inferentially meaningful and reliable, as opposed to mathematically meaningful.
12.3.2 Least Squares as a Statistical Method It is interesting to note that Legendre’s initial justification for the least-squares method was that for the simplest case where g(x)=μ, (x, y)∈ (R×R): Yt = μ + t , t = 1, 2, . . . , n, minimizing the sum of squares (μ) =
n
t=1 (Yt
n
− μ)2
2 t=1 t
⇒
(12.35) yields n d t=1 (Yt dμ =(−2)
− μ) = 0,
536
Estimation II: Methods of Estimation
giving rise to the arithmetic mean μ= 1n nt=1 Yt . At that time, the arithmetic mean was considered to be the gold standard for summarizing the information contained in the n data points y1 , y2 , . . . , yn , unaware that this presumes that (Y1 , . . . , Yn ) are IID. The first probabilistic framing for least squares was given by Gauss (1809). He introduced the Normal distribution by arguing that for a sequence of n independent random variables Y1 , Y2 , . . . , Yn , whose density function f (yt ) satisfies certain regularity conditions, if y¯ is the most probable combination for all values of y1 , y2 , . . . , yn and each n ≥ 1, then f (yt ) is Normal; see Heyde and Seneta (1977, p. 63). This provided the missing probabilistic premises, and Gauss (1821) went on to prove an important result known today as the Gauss–Markov theorem. Gauss–Markov theorem Gauss supplemented the statistical GM (12.35) with the probabilistic assumptions (i) E( t ) = 0, (ii) E( 2t ) = σ 2 > 0, (iii) E( t s )=0, t=s, t, s = 1, 2, . . . , n, and proved that under assumptions (i)–(iii) the least-squares estimator μ= 1n nt=1 Yt is best (smallest variance) within the class of linear and unbiased estimators (BLUE). Proof. Any linear estimator of μ will be of the form > μ(w)= nt=1 wt Yt , where μ(w) to be unbiased it must be the case w:=(w1 , w2 , . . . , wn ) denote constant weights. For > n n μ(w)) = that t=1 wt = 1, since E(> t=1 wt E(Yt ) = μ. This implies that the problem of minimizing Var(> μ(w)) = σ 2 nt=1 w2t can be transformed into a Lagrange multiplier problem:
n
n 2 minL(w) = t=1 wt − 2λ t=1 wt − 1 , w
whose first-order conditions for a minimum yield ⎫ ∂ L(w) n ⎬ ; 1 wt = 2wt − 2λ = 0 ⇒ (wt = λ) λ = 1 ⇒ λ = , t = 1, 2, . . . , n. ⇒
n ∂ L(w) ⎭ n = −2 t=1 t=1 wt − 1 = 0 λ This proves that μ= 1n
n
t=1 yt
is BLUE of μ.
As argued in Chapter 14, the Gauss–Markov theorem is of very limited value in “learning from data” because a BLUE estimator provides a very poor basis for inference, since the class of linear and unbiased estimators is unnecessarily narrow. For instance, in the case where the distribution of t is Laplace (Appendix 3.A), the MLE of μ is the sample median μ; see Norton (1984). The m(Y) = Y[ n+1 ] for n odd, and its variance is smaller than that of 2 Gauss–Markov theorem evades this problem because m(x) is excluded from consideration for being a non-linear function of Y.
12.4 Moment Matching Principle The moment matching principle cannot be credited to any one famous statistician because it was the result of a fundamental confusion between distribution and sample moments; that group includes both Karl Pearson and Francis Ysidro Edgeworth (1845–1926). In his
12.4 Moment Matching Principle
537
classic paper that founded modern statistical inference, Fisher (1922a), p. 311, pointed out in his opening paragraph a confusion that hindered progress in statistics between an unknown θ n (x0 ) as follows: parameter θ, its estimator θ n (X), and the estimate it has happened that in statistics a purely verbal confusion has hindered the distinct formulation of statistical problems; for it is customary to apply the same name, mean, standard deviation, correlation coefficient, etc. both to the true value which we should like to know, but can only estimate, and the particular value at which we happen to arrive by our methods of estimation.
Fisher pointed to a confusion between three different concepts: the moment of a probability distribution, its estimator, and the corresponding estimate based on a specific sample realization. A confusion brought about because of the use of the same term for all three different notions. Table 12.17 Terms Mean
Parameters vs. estimators vs. estimates
μ1 = x∈RX
Probability xf (x)dx
Sample (X) n Xi :=X μ1 (X) = 1n
μ = (x − μ1 )2 f (x)dx μ2 (X) = Variance: 2 x∈R X μr = xr f (x)dx μr (X) = Raw moments: x∈R X
i=1
Data (x0 ) n xi μ1 (x0 ) = 1n
n 1 (X − X)2 μ2 (x0 ) = i n i=1 n 1 Xr μr (x0 ) = n i i=1
i=1
n 1 (x − x)2 i n i=1 n 1 xr , r = 1, 2, . . . n i i=1
n n r μr (X) = 1n (Xi − X)r μr (x0 ) = 1n (xi − x)r , r = 2 . . . Central μr = (x − μ1 ) f (x)dx i=1 i=1 moments: x∈RX
Table 12.17 presents three very different groups of concepts assigned the same terms. The confusion between the various uses of the term “moments” is compounded by the fact that in statistical inference we often talk about the moments of the sample moments. In an attempt to
deal with that difficulty we utilize the notation μr (.), μr (.) , which enables us to be specific on whose moments we are referring to when it is not obvious from the context. Hence, the notation μr (X), r = 1, 2, 3, . . . , denotes the raw moments of the sampling distribution of the sample mean. It should be noted that Fisher’s attempt to introduce new terminology to distinguish between these different concepts annoyed the high priest of statistics, who interpreted it as misplaced pedantry; see Section 1.1 (Chapter 1). During the eighteenth and nineteenth centuries the endemic practice of conflating distribution and sample moments gave rise to what we call today the moment matching (MM) principle: construct estimators by equating distribution moments with sample moments. The moment matching principle is implemented in two steps:
538
Estimation II: Methods of Estimation
Step 1. Relate the unknown parameter θ to the moments of the distribution in terms of which the probability model is specified, say θ i = gi (μ1 , μ2 ), i=1, 2. Step 2. Substitute the sample moments in place of the distribution moments μ1 = n 1 n 1 μ2 = n t=1 Xt2 , to construct the moment estimators of (θ 1 , θ 2 ) via t=1 Xt , n θ 1 = g1 ( μ1 , μ2 ), θ 2 = g2 ( μ1 , μ2 ). It is worth noting that this procedure is the reverse of that used for the method of moments, discussed next, where one begins with the relationship between the moments and the unknown parameters, say μ1 = h1 (θ 1 , θ 2 ), μ2 = h2 (θ 1 , θ 2 ); the sample moments are then substituted in place of (μ1 , μ2 ) and solved for (θ 1 , θ 2 ) to define their estimators. Example 12.23 For the simple Bernoulli model (Table 12.4), the unknown parameter θ is directly related to the mean of X: E(X) = θ , and thus the moment matching principle suggests that a natural estimator for θ is θ = 1n nt=1 Xt . Example 12.24 For the simple Normal model (Table 12.6), the unknown parameters θ:=(μ, σ 2 ) are related to the mean and variance of X, respectively: E(X) = μ, Var(X) = σ 2 . The MM principle proposes the obvious estimators of these parameters, i.e. μ)2 . μ = 1n nt=1 Xt , σˆ 2 = 1n nti=1 (Xt − Example 12.25 For the Normal linear regression model (Table 7.7), the unknown parameters θ:=(β 0 , β 1 , σ 2 ) are related to the moments of the bivariate distribution f (x, y; ϕ) via the following parameterizations:
2 t ,Yt ) t ,Yt )] ∈R, σ 2 = Var(Yt ) − [Cov(X β 0 = E(Yt ) − β 1 E(Yt ) ∈R, β 1 = Cov(X Var(Xt ) Var(Xt ) ∈R+ . By substituting the corresponding sample moments in place of the distribution moments, we get the following MM principle estimators: 1 n (Yt − Y)(xt − x¯ ) β 1 = n 1t=1 , β 0 = Y − βˆ 1 x¯ , n ¯ )2 t=1 (xt − x n 2 n 1 (Yt − Y)(xt − x¯ ) ;n t=1 n 1 σ 2MM = (Yt − Y)2 − . (12.36) 1 n t=1 n ¯ )2 t=1 (xt − x n The above examples give an overly optimistic picture of when the MM principle gives rise to optimal estimators, because of the particular properties of the Bernoulli and Normal distributions. It turns out that for many other distributions, the MM principle does not yield such optimal estimators.
12.4 Moment Matching Principle
Recall that in Example 12.3 the first two moments of the sample mean θn = an estimator of E(X) = θ in the case where Xk UIID(θ − .5, θ +.5) are 1 θ n = 1n ni=1 Xi : E( θ n ) = θ and Var( θ n ) = 12n ,
1 n
539
n
i=1 Xi
as
but the same moments for θ ML (X) = (X[n] + X[1]) /2 are θ ML (X) =
X[n] +X[1] : 2
E( θ ML (X)) = θ and Var( θ ML (X)) =
1 2(n+1)(n+2) .
Let us have a closer look at the sample moments and their properties in general.
12.4.1 Sample Moments and their Properties For random variable X, the raw and central moments are defined by μr (θ ) = x∈RX xr f (x; θ)dx, r = 1, 2, . . . and μr (θ) = x∈RX (x − μ)r f (x; θ )dx, r = 2, 3 . . . with the corresponding sample moments being: μr = μr = 1n nt=1 Xir , r = 1, 2, 3 . . . , and
1 n
n
t=1 (Xi
− μ)r , r = 2, 3 . . .
For two jointly distributed random variables (X, Y), the raw and central moments are μr,s (θ) = x∈RX y∈RY xr ys f (x, y; θ )dxdy, r, s = 1, 2, 3 . . . μr,s (θ) = x∈RX y∈RY (x − μx )r (y − μy )s f (x, y; θ )dxdy, r, s = 1, 2, 3 . . . with the corresponding sample joint raw and central moments being μr,s = 1n nt=1 Xtr Yts , μr,s = 1n nt=1 (Xt − μx )r (Yt − μy )s , r, s = 1, 2, 3 . . . Of particular interest in the present context are the sampling distributions of the above sample moments and their properties. In general, the distribution of any sample moment depends crucially on the probability and sampling models postulated. In practice, however, non-parametric modeling makes indirect (not direct) distribution assumptions pertaining to the existence of the first few moments and other smoothness (differentiability, unimodality, etc.) conditions for the unknown density function; see Chapter 10. What is of particular interest in the present context is to explore the extent to which the absence of a distributional assumption affects the various sampling distributions of interest. To simplify the expressions involved, we focus on the simple case where the sample X:=(X1 , X2 , . . . , Xn ) is assumed to be IID. Table 12.18
Sample raw moments and their first two moments
E( μr ) = μr , r = 1, 2, 3, . . . [unbiased] Var( μr ) = 1n μ2r − (μr )2 , r = 1, 2, . . . [moments up to 2r] μs ) = 1n μr+s − μr (μs )2 r, s = 1, 2, . . . [moments up to r+s] Cov( μr ,
540
Estimation II: Methods of Estimation
The results in Table 12.18 suggest that, in the case of a random sample, irrespective of the underlying distribution (assuming the required moments exist), the sample raw moments provide unbiased and consistent estimators for the corresponding distribution raw moments. Consistency follows from the fact that the variance of the sample raw moments Var( μr ) goes to zero as n → ∞. The same table for the central moments (Table 12.18) seems a lot more tangled, in the sense that one needs to approximate these moments. Hence, the last terms that denote the order of approximation using the notation o(nk ) and O(nk ). The notation k k an = o(nk ), for k =0, denotes a sequence {an }∞ n=1 of order smaller than n and an = O(n ) k denotes a sequence {an }∞ n=1 at most of order n : an = o(nk ): limn→∞ ( annk ) = 0, O(nk ): limn→∞ |annk | ≤K, 0 μPMM = 1n nt=1 ln Xt , > This example brings out the non-invariance of the PMM estimator to reparameterμPMM , σ 2PMM ) are very different from izations, since the PMM estimators θ PMM :=( > μPMM , > σ 2PMM ). θ PMM :=(> Example 12.30 Consider the simple uniform model Mθ (x): Xt UIID(0, θ), 0 < xk < θ , θ > 0, k ∈ N
(12.41)
whose density function and first two moments are f (x; θ ) = ( θ1 ), θ >0, 0
12.5.3 Properties of PMM Estimators In general, the only optimal properties that PMM estimators enjoy are asymptotic. As shown above, in the case of a random sample (X1 , X2 , . . . , Xn ), the sample raw moments μr = 1n nt=1 Xtr , r = 1, 2, . . . are consistent estimators of the distribution raw moments P
(assuming they exist): μr → μr . In the case where μr (θ 1 , θ 2 , . . . , θ k ) is a well-behaved θ :=( θ 1, θ 2, . . . , θ k ), (Borel) function of the θ i s, we can deduce that for the PMM estimators θ i ( μ1 , μ2 , . . . , μk ), i = 1, 2, . . . , k: where θ i := P θ PMM → θ
and
√
n( θ PMM − θ) N(0, V∞ (θ )), n→∞
but these estimators are not necessarily asymptotically efficient. The question of the optimal properties of PMM estimators goes back to the 1920s. In his first statistics paper, Fisher (1912) criticized Pearson’s method of moments for its arbitrariness in using the first k sample raw moments to estimate k unknown parameters θ:=(θ 1 , θ 2 , . . . , θ k ). He continued the criticisms in a series of papers (Fisher, 1922b, 1924a, b, 1937), where he compared the optimality of MM estimators to that of MLEs, making a strong case that the method of moments gives rise to inefficient and parameterization noninvariant estimators except in cases where the assumed distribution is close to the Normal; see Gorroochurn (2016). Fisher used the following example to call into question Pearson’s MM. Example 12.32 Consider the simple Cauchy model with density function f (x; θ ) =
1 , π[1+(x−θ )2 ]
θ∈R, x∈R
12.6 Summary and Conclusions
547
that belongs to the Pearson family. Given that this density is bell-shaped symmetric and θ is a location parameter, the MM would estimate it using the sample mean X = 1n nt=1 Xt , which is an inconsistent estimator of θ; see Fisher (1922a, p. 322). In contrast, the ML method gives rise to an asymptotically optimal estimator. Karl Pearson mounted several spirited replies to Fisher’s criticisms of his MM, but lost the argument. He did not realize that Fisher had changed statistics from descriptive statistics (using the data to choose a descriptive model) to model-based statistical inference (prespecifying a statistical model (12.40), and using the data to estimate the parameters of this model). As argued above, the method of maximum likelihood is tailor-made for model-based statistical inference, but it is practically useless for Pearson’s approach since one needs the model at the outset to define the likelihood function. The PMM often gives rise to less efficient estimators because it does not utilize all the information contained in the prespecified statistical model. It utilizes only the part of the information pertaining to the first few moments and, as argued in Chapter 3, knowing the first few moments is not equivalent to knowing the distribution itself. The latter implies knowing all the moments and more. For instance, assuming that the distribution is LogNormal we know that even an infinite number of moments will not characterize this distribution, let alone the first two; see Chapter 3.
12.6 Summary and Conclusions 1. Maximum likelihood method. The ML method is tailor-made for frequentist estimation because the likelihood function contains all the probabilistic information comprising the statistical model Mθ (x) = {f (x; θ), θ ∈ }, x∈RnX , since it is defined as beging proportional to the distribution of the sample: L(θ; x0 ) ∝ f (x0 ; θ ), for all θ ∈ . The property of sufficiency for MLEs often ensures optimal finite sample properties, and under certain regularity conditions, MLEs enjoy all optimal asymptotic properties. These optimal properties justify the wide acceptance of the ML as the method of choice for estimation purposes. Critics of the ML method often use problematic examples that range from cases where: (a) the sample size n is too small to (b) the regularity conditions do not hold and (c) the postulated model is not well defined. The rule of thumb for how large n should be is: if n is too small to test the model assumptions using comprehensive misspecification testing, it is too small for inference purposes! Does the superiority of the ML method imply that the other methods of estimation are redundant? The answer is that the other methods have something to contribute by supplementing and shedding additional light on the ML method. 2. Method of least squares. The LS procedure provides additional insight into the ML estimation of statistical models based on the Normal distribution. The additional insight stems from the geometry of fitting a line to a scatterplot. Beyond that, the LS method can be very misleading in practice. A closer look at this theorem reveals that its results are of very limited value for inference purposes! One needs to invoke asymptotics for inference purposes; see Chapters 9 and 14.
548
Estimation II: Methods of Estimation
3. Moment matching principle. This is not a fully fledged method of estimation, but it can be used to provide additional intuition and insight into other estimation methods, including the ML method. 4. Parametric method of moments. This estimation method is clearly problematic because it does not utilize all the systematic information included in the statistical model. The same comments pertaining to invoking asymptotics for inference purposes apply to this method as well. Its real value is to provide respectable initial estimates in the context of numerical optimization for MLEs. Additional references: Stuart et al. (1999), Pawitan (2001), Severini (2000).
Important Concepts Method of maximum likelihood, least-squares method, moment matching principle, Pearson’s method of moments, parametric method of moments, maximum likelihood estimator, regular statistical models, score function, Fisher information, Kullback–Leibler distance, parameterization invariance, Gauss–Markov theorem, sample moments and their sampling distributions. Crucial Distinctions Pearson’s vs. parametric method of moments, distribution of the sample vs. likelihood function, nuisance parameters vs. parameters of interest, least squares as mathematical approximation vs. statistical estimation procedure, Gauss–Markov theorem, parameters vs. estimators vs. estimates, sampling distributions of sample moments under IID vs. NIID. Essential Ideas ●
●
●
●
●
●
The method of maximum likelihood is custom-made for parametric inference, and delivers the most optimal estimators for regular statistical models. The least-squares method is an adaptation of a numerical approximation method that adds geometric intutition to estimation in certain cases, but very little else. The moment matching principle is the result of a major confusion in statistics, initially brought out by Fisher (1922a), but in some cases (Normal, Bernoulli) it delivers good estimators. The parametric method of moments is an anachronistic interpretation of Karl Pearson’s method of moments, which was designed for a very different approach to statistical modeling and inference. Reliance on the asymptotic sampling distributions of sample moments without a distribution assumption often gives rise to highly imprecise and potentially unreliable inferences. One is always better off assuming an explicit distribution and testing it rather than being agnostic. In model-based inference all the parameters of the prespecified statistical model are relevant for inference purposes, irrespective of whether only a subset of them relate to the substantive questions of interest. There are no nuisance parameters for statistical inference purposes.
12.7 Questions and Exercises ●
549
Securing statistical adequacy requires a well-defined statistical model Mθ (x) whose assumptions can be validated vis-a-vis the data. Hence, ‘concentrating’ the likelihood function with respect to ‘nuisance’ parameters (Pawitan, 2001) should be performed after the statistical adequacy of Mθ (x) is secured.
12.7 Questions and Exercises 1. (a) Explain the concept of the likelihood function as it relates to the distribution of the sample. (b) Explain why the likelihood function does not assign probabilities to the unknown parameter(s) θ. (c) Explain how one can derive the maximum likelihood estimators in the case where the likelihood function is twice differentiable. 2. (a) Explain how the likelihood function ensures learning from data as n → ∞. (b) Explain why the identity below is mathematically correct but probabilistically questionable: E
∂ ln f (x;θ) ∂θ
2
=E
∂ ln L(θ ;x) ∂θ
2
;
explain how one can remedy that. (c) Explain the difference between (i) Fisher’s sample and individual observation information and (ii) Fisher’s information and observed information matrix. 3. (a) Define the concept of the score function and explain its connection to Fisher’s information. (b) Explain how the score function relates to both sufficiency and the Cramer–Rao lower bound. 4. (a) State and explain the finite sample properties of MLEs under the usual regularity conditions. (b) State and explain the asymptotic properties of MLEs under the usual regularity conditions. (c) Explain why an asymptotically Normal MLE is also asymptotically unbiased. 5. Consider the simple Normal statistical model (Table 12.9). (a) Derive the MLEs of (μ, σ 2 ) and state their sampling distributions. (b) Derive the least-squares estimators of (μ, σ 2 ) without the Normality assumption [1] and their sampling distributions. (c) Compare these estimators in terms of the optimal properties of (i) unbiasedness, (ii) full efficiency, (iii) asymptotic Normality, and (iv) consistency. 6. (a) The maximum likelihood method is often criticized for the fact that for a very small sample size, say n = 5, MLEs are not very reliable. Discuss.
550
Estimation II: Methods of Estimation
(b) Explain why the fact that the ML method gives rise to inconsistent estimators in the case of the Neyman–Scott (1948) model constitutes a strength of the method and not a weakness. (c) Use your answer in (b) to explain how the original model can be rendered well specified. 7. (a) Explain why least squares as a mathematical approximation method provides no basis for statistical inference. (b) Explain how Gauss has transformed the least-squares method into a statistical estimation procedure. 8. (a) State and explain the Gauss–Markov theorem, emphasizing its scope. (b) Discuss how one can use the Gauss–Markov theorem to test the significance of the slope coefficient. (c) Explain why the least-squares method relates to the Normal distribution and why in the case where the error term has a Laplace or a uniform distribution the method is questionable. 9. (a) Explain the moment matching principle as an estimation method. (b) Use the MM principle to derive the MM estimator of θ in the case of the simple 1 − θ1 |x| Laplace statistical model with density function f (x; θ ) = 2θ e , θ > 0, x ∈ R. 10. (a) Explain how the sample raw moments provide consistent estimators for the distribution moments, but their finite sample properties invoke the existence of much higher moments that (i) are often difficult to justify and (ii) would introduce major imprecision in the resulting inference from their estimation. (b) Compare the first few moments of the sample mean from a generic IID sample (no distributional assumption) with those of the sample variance. (c) Explain how the sample mean, variance, standard deviation, third and fourth central moments as well as the skewness and kurtosis coefficient variances simplify substantially and are rendered considerably more precise when a distributional assumption such as Normality is invoked. 11. (a) Explain why it is anachronistic to compare the maximum likelihood method to the parametric method of moments. (b) Compare and contrast Pearson’s method of moments with the parametric method of moments. (c) Consider the simple Pareto statistical model with density f (x; θ ) = θx0θ x−(θ+1) , θ > 0, x > x0 > 0, x ∈ R, where x0 is a known lower bound for X. Derive the MLE of θ and compare it with the PMM and least-squares estimators. 12. (a) Explain why the notion of nuisance vs. parameters of interest is problematic empirical modeling.
Appendix 12.A: Karl Pearson’s Approach
551
(b) Using your answer in (a), explain why eliminating nuisance parameters distorts the original statistical model in ways that might be very difficult to test for statistical misspecification.
13*. For the Normal model with separable heterogeneity Xt NI μt, σ 2 , t∈N :
(a) Derive the MLEs for the parameters μ, σ 2 and compare your μML with the results of Example 12.20.
(b) Derive the Cramer–Rao bound and evaluate whether the MLEs of μ, σ 2 are fully efficient. σ 2ML . (c) Derive the asymptotically Normal distribution for μML and
Appendix 12.A: Karl Pearson’s Approach The Pearson family of frequency curves can be expressed in terms of the following differential equation in four unknown parameters: d ln f (x) (x − θ 1 ) . = dx θ 2 + θ 3 x + θ 4 x2
(12.A.1)
In the context of probability theory we have seen that we can relate the unknown parameters, say (θ 1 , θ 2 , θ 3 , θ 4 ), to the moments of a given density function f (x; θ 1 , θ 2 , θ 3 , θ 4 ) (see Chapter 3) via μr (θ 1 , θ 2 , θ 3 , θ 4 ) = x∈RX xr f (x; θ 1 , θ 2 , θ 3 , θ 4 )dx, for r = 1, 2, 3, 4. One can relate the Pearson family to the raw moments by integrating both sides: r 2 r x∈RX x (θ 2 + θ 3 x + θ 4 x )df = x∈RX x (x − θ 1 )f (x)dx, for r = 1, 2, 3, 4 to get the following recursive relationship among the moments and the parameters: kμk−1 θ 2 +{(k+1)θ 3 −θ 1 }μk +{(k+2)θ 4 +1}μk+1 = 0, k = 0, 1, 2, . . . , μ1 , μ2 , μ3 , and μ4 are sufficient to determine where μ0 =1. The first four raw data moments (θ 1 , θ 2 , θ 3 , θ 4 ), and thus one can select the particular f (x) from the Pearson family by solving the first four equations, for k=0, 1, 2, 3: ⎫ ⎛ ⎞ θ 1 ) + (2 θ 4 + 1) μ1 = 0 μ1 , μ2 , μ3 , μ4 ) θ 1 = h1 ( ( θ3 − ⎪ ⎪ ⎪
⎪ ⎜ ⎟ ⎬ ⎜ θ 2 = h2 ( μ1 +(3 θ 3 − θ1 θ 4 +1) μ2 = 0 μ1 , μ2 , μ3 , μ4 ) ⎟ θ 2 + 2 ⎜ ⎟. ⇒⎜ , , , ) ⎟ 2 μ1 θ 3 − θ 1 ) μ2 +(4 θ 4 +1) μ3 = 0 ⎪ = h ( μ μ μ μ θ 2 +(3 θ ⎪ ⎝ ⎠ 3 3 1 2 3 4 ⎪ ⎪ ⎭ 3 μ2 θ 2 +(4θ 3 −θ 1 ) μ3 +(5θ 4 +1) μ4 = 0 μ1 , μ2 , μ3 , μ4 ) θ 4 = h4 ( The solution θ:=( θ 1, θ 2, θ 3, θ 4 ) of these equations would deal with two different problems simultaneously: (a) (b)
Specification. The choice of a frequency curve f (x; θ ) (among the Pearson type I–XII, Table 12.A.1) on the basis of the particular values θ. θ 2, θ 3, θ 4 ). Estimation of the unknown parameters θ:=( θ 1,
552
Estimation II: Methods of Estimation
Table 12.A.1 Pearson family of frequency curves Beta (type I)
Gamma (type III)
Beta prime (type VI),
Inverse-gamma (type V)
Cauchy (type VII)
F-distribution (type VI)
Chi-square (type III)
Inverse-chi-square (type V)
Uniform (limit of type I)
Normal (limit of type I, III, IV, V, VI)
Exponential (type III)
Student’s t-distribution (type VII)
The Pearson-type densities are most effectively characterized by two quantities, commonly referred to as β 1 and β 2 : 3 4 β 1 =α 23 = μ , α 3 =skewness and β 2 =α 4 = μ , α 4 =kurtosis. σ3 σ4
Example 12.A.1
The values β 1 = 0, β 2 = 3 ⇒ f (x; θ ) are the Normal.
Other distributions belonging to the Pearson family are listed in Table 12.A.1.
13 Hypothesis Testing
13.1 Introduction 13.1.1 Difficulties in Mastering Statistical Testing Statistical testing is arguably the most difficult, confusing, and confused chapter in statistical inference for a variety of reasons, including the following. Inherent difficulties. (a)
There is a need to introduce numerous new notions, concepts, and procedures before one can articulate a coherent picture of what constitutes a statistical test. This makes it very easy for readers to miss the forest for the trees and focus on the superficial elements of testing, such as the formulae and the probability tables. (b) Testing is conceptually sophisticated due to the nature and role of the underlying hypothetical reasoning that is often insufficiently appreciated. (c) The framing of frequentist testing bequeathed by the three pioneers Fisher, Neyman, and Pearson was incomplete in the sense that there was no coherent evidential account pertaining to the results beyond p-values and accepting/rejecting the null rules. As a result, the subsequent discussions of frequentist testing have been beleaguered by serious foundational problems. Worse, different applied fields have generated their own secondary literatures attempting to address these foundational problems, but often making matters worse! Extraneous difficulties. (d)
The Fisherian and Neyman–Pearsonian (N-P) variants of frequentist testing are often misrepresented using oversimplified framings that neglect the key role of the prespecified statistical model. In addition, the two variants are often presented as essentially incompatible and any attempt to blend them will result in an inconsistent hybrid which is “burdened with conceptual confusion” (Gigerenzer, 1993, p. 323). Worse, the N-P testing is often viewed as decision-theoretic in nature, when in fact the latter distorts the underlying reasoning and misrepresents frequentist testing; see Spanos (2017b). 553
554
(e)
(f)
Hypothesis Testing
Statistics cuts across numerous disciplines, providing the framework for empirical modeling and inference, and textbook writers in different disciplines often rely on second-hand accounts of past textbooks in their discipline. The resulting narratives often ignore key components of inference, including the underlying statistical model and the reasoning associated with different inference procedures. As a result, their discussion often amounts to an idiot’s guide to combining off-the-shelf formulae with statistical tables to yield tabular asterisks ***, **, * indicting statistical significance. Unfortunately, there is an attempt by some Bayesian authors to distort and cannibalize frequentist testing in a misguided attempt to motivate their own preferred viewpoint on statistical inference. Manifest signs of such a distorted preaching include (i) undue emphasis on “long-run” frequencies as probabilities, (ii) loss functions as indispensable tools, (iii) admissibility as a minimal property of estimators, as well as cannibalization of frequentist concepts, such as (iv) using non-regular statistical models, (v) misinterpreting “error probabilities” as being conditional on θ, (vi) distorting the p-value to resemble posterior probabilities assigned to the null, (vii) misinterpreting confidence intervals as Bayesian credible intervals, and (viii) assigning prior probabilities to “true” nulls and alternatives (whatever that means); reader be aware!
In an attempt to alleviate problems (a)–(c), the discussion that follows uses a sketchy historical development of frequentist testing. To ameliorate problems (d)–(f), the discussion includes pointers () that aim to highlight crucial counterarguments that shed light on certain erroneous interpretations or misleading claims. The discussion will pay special attention to (c), with a view to addressing certain key foundational problems that have been left unresolved by the protagonists. It is argued that the original Fisher–Neyman–Pearson framing needs to be refined and extended into a broader error-statistical framing that includes an evidential component beyond p-values and accept/reject H0 results. This framing aims to address several crucial foundational questions raised by the testing literature, including: (i) (ii) (iii) (iv) (v) (vi) (vii)
How did Neyman and Pearson modify the Fisher testing framework to provide an optimal theory of testing? When do data x0 provide evidence for or against a hypothesis or claim? What are the weaknesses of the p-value as evidence for or against the null? Is there a post-data coherent evidential interpretation for the p-value and the accept/reject results in frequentist testing? Are confidence intervals less vulnerable to misinterpretations than p-values? How can one distinguish between statistical and substantive significance? Is N-P testing vulnerable to ad hoc choices and unwarranted manipulations?
A bird’s-eye view of the chapter. Section 13.2 begins with a brief discussion of the preFisher approach to statistical testing, in an attempt to place the ideas of the main protagonists Fisher, Neyman, and Pearson in a proper historical perspective. Section 13.3 discusses Fisher’s recasting of frequentist inference and focuses on his theory of significance testing. Section 13.4 discusses the Neyman–Pearson concerted effort to improve upon Fisher’s approach, with a view to bringing out the shared features as well as their differences. In
13.2 Statistical Testing Before R. A. Fisher
555
Section 13.5 a case is made for a broader framing of frequentist testing, known as the errorstatistical approach, which constitutes a refinement/extension of both approaches with a view to achieving a harmonious blending of the two. The error-statistical framing is used to address several well-known foundational issues, including the above-mentioned questions (i)–(vii). Section 13.6 returns to confidence intervals and their optimality as it relates to hypothesis testing. It is important to emphasize at the outset that the framing of frequentist testing in this chapter is a veritable attempt to propose a coherent account by emphasizing its key elements, such as the statistical model, and the underlying reasoning. In that sense, the overarching framework is novel enough to deserve the name “error-statistical,” if for no other reason than to distinguish it from other similar attempts.
13.2 Statistical Testing Before R. A. Fisher 13.2.1 Francis Edgeworth’s Testing Francis Ysidro Edgeworth (Drummond Chair of Political Economy at Oxford) was an economist who made significant contributions to the methods of statistics during the 1880s; see Stigler (1986), Gorroochurn (2016). A typical example of a testing procedure at the end of the nineteenth century is given by Edgeworth (1885). Viewing his testing procedure retrospectively from Fisher’s perspective, Edgeworth assumes that the data x0 :=(x11 , x12 , . . . , x1n ; x21 , x22 , . . . , x2n ) constitute a realization of the 2n-dimensional random (IID) sample X:=(X11 , X12 , . . . , X1n ; X21 , X22 , . . . , X2n ) from the simple (IID) bivariate Normal model Xt NIID (μ, ) , t = 1, 2, . . . , n, . . . , 9 8 9 8 8 μ1 σ2 X1t Xt := , μ:= , := X2t μ2 0
(13.1)
9 0 σ2
,
E(X1t ) = μ1 , E(X2t ) = μ2 , Var(X1t ) = Var(X2t ) = σ 2 , Cov(X1t , X2t ) = 0.
(13.2)
The hypothesis of interest relates to the equality of the two means: (a) μ1 =μ2 . Common sense, combined with the statistical knowledge at the time, suggested using the difference between the estimated means μ2 , where μ1 = 1n μ1 −
n
t=1 X1t ,
μ2 = 1n
n
t=1 X2t ,
μ1 − 3 μ2 ) free of as a basis for deciding whether μ1 =μ2 . To render the difference ( . μ1 − μ2 ) = ( σ 21 + σ 21 ), the units of measurement, Edgeworth divided it by Var( n n μ1 )2 , σ 22 = 1n (X2t − μ1 )2 , to define (b) a distance function where σ 21 = 1n (X1t − t=1
t=1
556
Hypothesis Testing
ξ (X) =
ˆ 1 −μ ˆ 2| 3|μ . ( σ 21 + σ 21 )
To decide whether the√ observed distance ξ (x0 ) is “large enough” to infer μ1 =μ2 , Edgeworth used (c) a threshold 2 2. His argument is that rejection of (μ1 −μ2 )=0 could not have been “due to pure chance” (accidental) if √ (13.3) ξ (x0 ) > 2 2. √ Where √ did the threshold 2 2 come from? It came from the tail area of N(0, 1) beyond ±2 2, which is approximately .005. This value was viewed at the time as a “reasonable” lower bound for the probability of a “chance” error, i.e. the probability of erroneously inferring a significant discrepancy. But why Normality? At the time, statistical inference did not have the notion of a statistical model (introduced by Fisher, 1922a), and thus it relied heavily on large sample size n (asymptotic) results by routinely invoking the central limit theorem. For a detailed discussion of testing the difference between two means, see Appendix 13.A. In summary, Edgeworth introduced three generic concepts for statistical testing (Table 13.1) that have been retained, after being modified by the subsequent literature. Table 13.1
Edgeworth’s testing, key concepts
Hypothesis of interest: μ1 = μ2 (b) Notion of a standardized distance: ξ (X) (a) (c)
√ Threshold value for “significance”: ξ (x0 ) > 2 2
13.2.2 Karl Pearson’s Testing Karl Pearson’s approach to statistics was discussed in Chapter 12 in some detail. In summary, one would begin with the data x0 in search of a descriptive model within the Pearson family. The search was data-driven in the sense that one would use the first four raw data μ2 , μ3 , and μ4 to derive the estimates ( θ 1, θ 2, θ 3, θ 4 ) of the four unknown moments μ1 , parameters that characterize the Pearson family in order to select the particular member f0 (x) from it that includes well-known distributions, such the Normal, Student’s t, beta, uniform, gamma, inverse-gamma, chi-square, F, and exponential. Having selected a frequency θ 1, θ 2, θ 3, θ 4 )=f (x; θ ), Karl Pearson would proceed to assess curve f0 (x) on the basis of f (x; the appropriateness of f0 (x) by testing the hypothesis of interest: f0 (x) = f ∗ (x) ∈ Pearson(θ 1 , θ 2 , θ 3 , θ 4 ), where f ∗ (x) denotes the “true density” in the sense that it could have generated data x0 . To construct a test, Pearson proposed the standardized distance function η(X) =
;m (( ;m ( fi − fi )2 fi /n) − (fi /n))2 =n i=1 i=1 fi (fi /n)
χ 2 (m),
n→∞
(13.4)
where ( fi , i=1, 2, . . . , m) and (fi , i=1, 2, . . . , m) denote the empirical and assumed (as specified by f0 (x)) frequencies. This test statistic compares how close the observed are to the
13.2 Statistical Testing Before R. A. Fisher
557
expected (under f0 (x)) relative frequencies. Instead of relying on a simple threshold for the observed distance, like Edgeworth, he went a step further and introduced a primitive version of the p-value: P(η(X) > η(x0 )) = p(x0 ),
(13.5)
as a basis for inferring whether the choice of f0 (x) was adequate or not. The rationale for the p-value was simply based on goodness-of-fit: the smaller the p-value, the worse the fit. Example 13.1 Mendel’s cross-breeding experiments (1865–6) were based on pea-plants with different shapes (round or wrinkled) and colors (yellow or green), which can be framed in terms of two Bernoulli random variables: R
W
Y
G X(round) = 0, X(wrinkled) = 1, Y(yellow) = 0, Y(green) = 1.
Mendel’s theory of heredity was based on two assumptions: (i) (ii)
the two random variables X and Y are independent; “round” (X=0) and “yellow” (Y=0) are dominant traits, but “wrinkled” (X=1) and “green” (Y=1) are recessive traits with probabilities dominant: P(X = 0) = P(Y = 0) = .75, recessive: P(X = 1) = P(Y = 1) = .25.
Mendel’s substantive theory gives rise to the probabilistic model in Table 13.2 based on the bivariate Bernoulli distribution below. Table 13.2
Mendel’s substantive model
x\y
0
1
fx (x)
0 1 fy (y)
.5625 .1875 .750
.1875 .0625 .250
.750 .250 1.00
Data. The four gene-pair experiments Mendel carried out with n=556 gave rise to the observed frequencies and relative frequencies given below:
R,Y
R,G
W,Y
W,G
(0, 0)
(0, 1)
(1, 0)
(1, 1)
315 108 101 32 Observed frequencies
⇒
x\y
0
1
fx (x)
0 1
.5666 .1817
.1942 .0576
.7608 .2393
fy (y)
.7483
.2518
1.001
Observed relative frequencies How adequate is Mendel’s theory in light of the data? Using Pearson’s goodness-of-fit chi-square test statistic in (13.4) yields 2 2 2 2 η(x0 ) = 556 (.5666−.5625) = .463. + (.1942−.1875) + (.1817−.1875) + (.0576−.0625) .5625 .1875 .1875 .0625
Hypothesis Testing
558
Given that the tail area of χ 2 (3) is P(η(X)>.463)=.927, the p-value indicates excellent goodness-of-fit (no discordance) between the data and Mendel’s theory. Table 13.3 (a) (b) (c) (d)
Karl Pearson’s testing, key concepts
Introducing the Pearson family of distributions that extended significantly the scope of statistical modeling Broadening the scope of the hypothesis of interest, initiating misspecification testing by introducing a goodness-of-fit test in terms of f0 (x) and f ∗ (x) Introducing the notion of a distance function whose distribution is asymptotically known (as n → ∞) Using the tail probability (p-value) as a basis for evaluating the goodness-of-fit of f0 (x) with data x0
Karl Pearson’s main contributions to statistical testing are listed in Table 13.3. As argued next, these features were subsequently reframed and modified by Fisher. It is very important to highlight the fact that, when viewed from a modern perspective, the hypothesis of interest f0 (x)=f ∗ (x) ∈ Pearson(θ 1 , θ 2 , θ 3 , θ 4 ) pertains to the adequacy of the choice of the probability model f0 (x). That is, Pearson’s chi-square was the first misspecification test for a distribution assumption!
13.3 Fisher’s Significance Testing R. A. Fisher founded modern statistics while he was working as a statistician at the Rothamsted (Agricultural) Experimental Station in Harpenden (25 miles north of London) from 1919 to 1933. He was appointed Professor of Eugenics at University College, London in 1933, and moved to Cambridge University in 1943 as the Balfour Professor of Genetics. In addition to being the father of modern statistics, he was also a key founder of modern human genetics. Table 13.4
The simple Normal model Statistical GM
[1] [2] [3] [4]
Normal Constant mean Constant variance Independence
Xt = μ + ut , t∈N Xt N(., .) E(Xt ) = μ, for all t∈N Var(Xt ) = σ 2 , for all t∈N {Xt , t∈N}, independent process.
Fisher’s recasting of statistical testing was inspired by a paper by “Student” (Gosset, 1908), encountered in Chapter 11. Gosset introduced the result that for a simple Normal model (Table 13.4): √ n(X n −μ) s
St(n−1),
(13.6)
where St(n−1) denotes a Student’s t distribution with (n−1) degrees of freedom. The crucial importance of this paper stems from the fact that (13.6) was the first finite sampling
13.3 Fisher’s Significance Testing
559
distribution (valid for any n>1) that inspired Fisher to recast modern statistics by introducing several crucial innovations. (a)
Fisher (1915) introduced the mathematical framework for formally deriving the finite sampling distribution of the correlation coefficient. In subsequent papers he derived several additional sampling distributions, including those for testing the significance of the regression coefficients as well as partial correlations. (b) He brought out explicitly the probabilistic assumptions invoked in deriving (13.6) in the form of the simple Normal (Table 13.4): Xk NIID(μ, σ 2 ), k = 1, 2, . . . , n, . . .
(13.7)
This move introduced the concept of a statistical model whose generic form is Mθ (x) = {f (x; θ ), θ ∈ }, x∈RnX ,
(13.8)
where f (x; θ ), x∈RnX denotes the (joint) distribution of the sample X:=(X1 , . . . , Xn ) that encapsulates the prespecified probabilistic structure of the underlying stochastic process {Xt , t ∈ N}. The link to the phenomenon of interest comes from viewing data x0 :=(x1 , x2 , . . . , xn ) as a “truly typical” realization of the process {Xk , k ∈ N}. In addition to the Pearson family (Pearson, 1894), Fisher later enlarged the scope of statistical modeling and inference significantly by introducing the exponential and the transformation families of distributions; see Lehmann and Romano (2006). (c)
Fisher used the result (13.6) to construct a test of significance for the Null hypothesis H0 : μ=μ0
(13.9)
in the context of the simple Normal model (Table 13.4) by unlocking the reasoning underlying the estimation result in (13.6) and modifying it for testing purposes. Let us unpack this claim in detail. Looking at Gosset’s result in (13.6) it is not obvious how one could interpret it, √ since τ (X; μ) = n(X n − μ)/s is not a statistic; it involves the unknown parameter μ. Fisher (1943) called τ (X; μ) a pivotal quantity (or pivot), to stand for a function of both X and μ, when its distribution involves no unknown parameters. In the case of (13.6) its first two moments are E [τ (X; μ)] = 0 and Var [τ (X; μ)] = (n − 1)/(n − 3). What is not so obvious is why μ disappears from the sampling distribution in (13.6). A moment’s reflection √ suggests that E ( n(X n − μ))/s = 0 only when E(X n ) = μ∗ , i.e. X n is an unbiased estimator of μ, where μ∗ denotes the true value of μ. Given that, Fisher must have realized that the result in (13.6) makes frequentist estimation sense when it is evaluated under θ = θ ∗ , stemming from the factual reasoning: τ (X; μ) =
√ ∗ n(X n −μ) μ=μ s
St(n−1).
(13.10)
Fisher’s recasting of frequentist statistical testing was based on transforming (13.10) into a test statistic, by replacing the unknown μ with a hypothesized value μ0 (H0 ) that yields
560
Hypothesis Testing
√ τ (X) = [ n(X n − μ0 )/s], in conjunction with replacing the factual reasoning (μ=μ∗ ) with hypothetical reasoning, “what if” (under), μ = μ0 : τ (X) =
√ n(X n −μ0 ) μ=μ0 s
St(n−1).
(13.11)
Note that when τ (X) is evaluated under μ = μ∗ , the result would have been τ (X) =
√ ∗ n(X n −μ0 ) μ=μ s
St(δ; n−1), δ =
√ n(μ∗ −μ0 ) σ
(13.12)
This is non-operational since St(δ; n−1), a non-central Student’s t distribution with noncentrality parameter δ, depends on μ∗ . Fisher’s perceptive mind must have realized that by replacing the factual evaluation in (13.12) with a hypothetical one based on μ = μ0 , the √ parameter δ disappears since δ|μ=μ0 = [ n(μ0 − μ0 )/σ ] = 0, and the resulting distribution St(n−1) coincides with that in (13.6). Fisher was very explicit about the nature of the reasoning underlying significance testing: “In general, tests of significance are based on hypothetical probabilities calculated from their null hypotheses” (Fisher, 1956, p. 47). (d) Fisher went on to modify/extend Pearson’s definition of the p-value from a goodnessof-fit indicator into the basic criterion for his finite sample significance testing. The key was the realization that the sampling distribution of τ (X), when evaluated under H0 , involves no unknown parameters and yields the tail area P(τ (X) > τ (x0 ); H0 )=p(x0 ).
(13.13)
Fisher reinterpreted the p-value as an indicator of discordance (contradiction) between data x0 and H0 . Table 13.5
Normal [N (0, 1)]
One-sided threshold
Two-sided threshold
α=.100, α=.050, α=.025, α=.010,
α=.100, α=.050, α=.025, α=.010,
Table 13.6
cα =1.28 cα =1.645 cα =1.96 cα =2.33
cα =1.645 cα =1.96 cα =2.00 cα =2.58
Student’s t [St(60)]
One-sided threshold
Two-sided threshold
α=.100, α=.050, α=.010, α=.001,
α=.100, α=.050, α=.010, α=.001,
cα =1.296 cα =1.671 cα =2.390 cα =3.232
cα =1.671 cα =2.000 cα =2.660 cα =3.460
To transform the p-value p(x0 ) into an inference pertaining to the “significance” of H0 one needs to choose a threshold for deciding how small a p-value is “small enough” to falsify H0 . Fisher suggested several such thresholds, .01, .02, .05, but left the choice to be made on a case-by-case basis; see Tables 13.5 and 13.6.
13.3 Fisher’s Significance Testing
561
Figures 13.1–13.4 illustrate different tail areas for the N(0, 1) and St(ν = 5) distributions for comparison purposes. It should be noted that as ν increases, St(ν) approaches N(0, 1), but ν>120 is needed for the two tail areas to be almost identical. As in the case of frequentist estimation, the primary objective of frequentist testing is to learn about the “true” (θ=θ ∗ ) statistical data generating mechanism M∗ (x) = {f (x; θ ∗ )}, x∈RnX ,
assumed to have generated the data x0 . In light of that, frequentist testing relies on a good estimator of θ to learn from data about θ ∗ . In the case of the simple Normal model, X n
is the best at pinpointing μ∗ , and thus the difference X n − μ0 , defining the test statistic τ (X), aims to appraise the standardized distance (μ∗ − μ0 ); recall that μ∗ has generated the data for the evaluation of X n . In contrast to estimation where there is a single factual scenario μ = μ∗ , hypothesis testing relies on hypothetical reasoning and thus one can pose numerous questions to the data based on such hypothetical scenarios using any values μ∈R.
13.3.1 A Closer Look at the p-value
0.4
0.4
0.3
0.3 Density
Density
In light of the above discussion, the p-value P(τ (X)>τ (x0 ); H0 )=p(x0 ) aims to evaluate the
discordance arising from the difference μ∗ − μ0 , or equivalently between
M∗ (x) = {f (x; μ∗ , σ 2 } and M0 (x)={f (x; (μ0 , σ 2 )}, x∈RnX ,
0.2
0.2
0.1
0.1 0.05
0.025
0.0
–1.645
N(0,1): P(X < −1.64) = .05
0.0
Fig. 13.2
0.4
0.4
0.3
0.3 Density
Density
Fig. 13.1
0 x
0.2
1.960
N(0,1): P(X > 1.96) = .025
0.2
0.1
0.1
0.0
0 x
0.05
Fig. 13.3
–2.015
0 x
St(5): P(X < −2.015) = .025
0.0
Fig. 13.4
0.025 0 x
2.571
St(5): P(X > 2.571) = .025
562
Hypothesis Testing
when the sampling distribution of τ (X) is evaluated under the hypothetical scenario μ = μ0 . In this sense, the p-value is firmly attached to the testing procedure’s capacity to detect discrepancies from the null and cannot (or should not) legitimately be interpreted as the probability assigned to any value of μ in R. Although it is impossible to pin down Fisher’s interpretation of the p-value because his views changed over time and held several renderings at any one time, there is one fixed point among his numerous articulations (Fisher, 1925a, 1955): “a small p-value can be interpreted as a simple logical disjunction: either an extremely rare event has occurred or H0 is not true.” The focus on “a small p-value” needs to be viewed in conjunction with Fisher’s falsificationist stance about testing, in the sense that significance tests can falsify but never verify hypotheses (Fisher, 1955): “. . . tests of significance, when used accurately, are capable of rejecting or invalidating hypotheses, in so far as these are contradicted by the data; but that they are never capable of establishing them as certainly true.” Having quoted the above strict falsificationist stance, it is interesting to quote a less stringent one from Fisher (1925a): “If p is between .1 and .9 there is certainly no reason to suspect the hypothesis tested. If it is below .02 it is strongly indicated that the hypothesis failed to account for the whole of the facts” (p. 80). The combination of Fisher’s logical disjunction with his falsificationist stance could be interpreted as arguing that a p-value smaller than the designated threshold indicates that H0 is false in the sense that M0 (x) = {f (x; (μ0 , σ 2 )}⊂Mθ (x), x∈RnX
could not have generated the data x0 . What is less apparent is whether this could be legitimately interpreted as providing evidence against H0 . Fisher did not articulate a convincing evidential interpretation that the p-value that “indicates the strength of evidence against the null” (Fisher, 1925a, p. 80) The traditional definition of the p-value refers to the probability of obtaining a result ‘equal to or more extreme’ than the one actually observed, when H0 is true. Misleadingly, when this definition is used in the context of a Neyman-Pearson (N-P) test the clause ‘equal to or more extreme’ is invariably interpreted in light of the alternative hypothesis. This has led to the p-value being viewed as the smallest significance level amin at which H0 would have been rejected when H0 is true, and referred to as the observed significance level or size of a N-P test. Unfortunately, this amounts to assigning a pre-data perspective on the p-value that is in conflict with the original post-data perspective proposed by Fisher as a ‘measure of discordance’ between the null hypothesis and data x0 . To avoid this confusion, the definition adopted in this book revolves around the observed value of the test statistics d(x0 ): Post-data perspective. The p-value is the probability of all sample realizations x that accord less well (in terms of d(x)) with H0 than x0 does when H0 is true. An obvious implication stemming from this definition is that the post-data p-value is always one-sided because the sign of d(x0 ) indicates the relevant tail. That is, μ∗ (the true μ) could not belong to the subset of the parameter space indicated by the other tail. It is also important to note that discussions
13.3 Fisher’s Significance Testing
563
pertaining to the distribution of the p-value viewed as a statistic, a function of the sample X, such as p(X) ∼ U(0,1), when the null is true, i.e. Uniformly distributed over the interval [0,1], also adopt a pre-data perspective. Unfortunately, the p-value could not provide a cogent evidential interpretation for “rejecting H0 ” due primarily to a crucial weakness, known as the large n problem, initially raised by Berkson (1938). The large n problem. Using Pearson’s chi-square test based on η(X) = n
;m (( fi /n) − (fi /n))2 i=1 (fi /n)
χ 2 (m),
n→∞
Berkson (1938) argued that since η(X) usually increases with n, the p-value decreases as n increases. Hence, there is always a large enough n to reject any null hypothesis. His claim needs to be qualified by attaching the clause: when (f0 (x) − f ∗ (x)) =0, irrespective of how small this difference is. Taken at face value, Berkson’s claim means that a rejection of H0 based on p(x0 ) = .03 and n = 50 does not have the same evidential weight for the falsity of H0 as a rejection with p(x0 ) = .03 and n = 20,000. This calls into question Fisher’s strategy of evaluating “significance” using p(x0 ) < .05 or .01, and ignoring the sample size n; see Spanos (2014a). Misinterpreting the p-value. It is a serious error to interpret the p-value in (13.13) as conditional on the null hypotheses H0 (see Cohen, 1994): P(τ (X) > τ (x0 )|H0 ) = p(x0 ). Conditioning on H0 : θ = θ 0 , or any other value of θ, is meaningless in frequentist inference since θ is an unknown constant and the evaluation in (13.13) is made under the hypothetical scenario H0 : θ =θ 0 is true. Hence, using the vertical line (|) instead of a semi-colon (;) is incorrect and highly misleading! As argued in Chapter 10, Bayesians treat θ as a random variable, and thus in their setup conditioning makes sense. In light of that, the following interpretations of the p-value are clearly erroneous: (i) (ii)
assigning probabilities to the null or any other value of θ in ; assigning a probability that the particular result is due to “chance” (whatever that means).
It is extremely important to keep in mind that frequentist error probabilities, including the p-value, are always attached to the inference procedure (estimator, test, predictor) via f (x; θ), x∈RnX ; they are never attached to θ directly or indirectly! Their aim is to “calibrate” the capacity of the test to detect discrepancies from θ ∗ .
13.3.2 R. A. Fisher and Experimental Design Fisher’s first statistical appointment was at the Rothamsted (Agricultural) Experimental Station in 1919, established in 1843 with a primary objective to measure the effect on crop yields of inorganic and organic fertilizers using a series of long-term field experiments. By 1919 the station had accumulated tons of data and Fisher’s main task was to help analyze
564
Hypothesis Testing
them. To understand how Fisher (1935a) revolutionized the statistical analysis of experimental data, it is important to consider the analysis of such data before he joined the station using a very simple example. An agricultural economist who wants to assess the viability of applying fertilizer in the production of barley would divide a large plot of land into two pieces and apply the fertilizer to one and not to the other. Consider the case where the untreated plot yield was y = 15 (bushels per acre) and the treated plot yield was z = 16.2 (bushels per acre). The questions of interest at the time were: (a) How effective is the fertilizer treatment? (b) Is the treatment economically viable? Since [(16.2 − 15)/15]100 = 8, the increase in yield is 8%. Should one proceed to a cost–benefit analysis to appraise whether this increase is economically viable? Not so fast. The 8% increase could have been due to the fertilizer or to several other influencig factors, including (i) topsoil quality, (ii) rainfall, (iii) wind, (iv) sunshine, etc. Confounding. This is a crucial issue concerning potential confounders, omitted but relevant factors. To address the issue, agricultural experimenters began to collect data on an ever-increasing number of potential confounders, but they still could not untangle the different effects! Why? Their statistical analysis was primitive and their efforts to account for influencing factors were focused on “cleaning” the data from the influence of potential confounders. The key problem was that one could not claim that the above (i)–(iv) are the only such potential confounders. In his attempt to address the potential confounding problem, Fisher revolutionized experimental design by proposing several new techniques (Fisher, 1925a, 1935a), including replication, blocking, and randomization, that rendered learning from experimental data a lot more effective! Replication. Create as many plots of land as possible to increase the accuracy of the inference. For instance, in the case of the simple Normal model (Table 13.4), X n N(μ, σ 2 /n), and thus increasing the sample size n decreases Var(X n ) = σ 2 /n, increasing the precision of inference about μ. This will also enable one to estimate the non-systematic experimental error relevant for comparing differences among treatments. Blocking. Pair the plots to ensure that they are similar enough with respect to potential influencing factors to provide local control of the experimental error. Of particular interest in the above example are factors that might potentially influence yield via location, which can be generically quantified by the ordering k = 1, 2, . . . , n. Think of a large plot located on a hill with the paired subplots located side by side from the top to the bottom of the hill, with each pair divided vertically. Note that using location as the relevant ordering one is neutralizing all potential factors, not just (i)–(iv) listed above, that could influence the yield via this dimension of the experimental design. Randomization. Use a random mechanism (e.g. toss a coin) to decide which one in each pair of plots is treated. If there was any heterogeneity or dependence associated with particular locations it will be neutralized, because each pair (zk , yk ) will be affected equally by any
13.3 Fisher’s Significance Testing
565
factors, such as (i)–(iv) above, which could potentially influence yield via location. This yields an unbiased estimator of the error variance as well as the treatment differences; see Hinkelmann and Kempthorne (1994). To answer the substantive questions of interest (a) and (b), one needs to embed this material experiment into a statistical model, representing an idealized description of the stochastic mechanism that could have given rise to data x0 :=(x1 , x2 , . . . , xn ), and frame question (a) in its context. The first step is specification. Assume that the data x0 :={xk = (zk − yk ), k = 1, 2, . . . , n} can be viewed as a typical realization of a simple Normal model (Table 13.4). The next step is to frame the substantive question (a) into a statistical hypothesis H0 : μ = 0. Table 13.7 z = treated y = untreated x = (z−y) Ordering:
Barley yield data 14 12 2 1
16 13 3 2
16 10 6 3
12 14 −2 4
16 13 3 5
13 15 −2 6
18 11 7 7
17 13 4 8
13 10 3 9
15 16 −1 10
15 14 1 11
14 12 2 12
Example 13.2 For the data in Table 13.7, n = 12, xn = 2.167, s = 2.855, Fisher would then evaluate the Student’s t-test based on τ (X) =
√ n(X n −μ0 ) H0 s
St (n − 1) , τ (x0 ) =
√
12(2.167−0) 2.855
= 2.629,
(13.14)
giving rise to the p-value P(τ (X)>τ (x0 ); H0 ) = .023, indicating discordance with H0 . Does this confirm that the fertilizer treatment increases yield by 8% and one can proceed to evaluate its economic viability? Before one can draw such a conclusion one needs to address several issues pertaining to the reliability and precision of the above modeling and inference, including: (i)
(ii) (iii)
Statistical adequacy. Are the model assumptions of NIID valid for data x0 ? Just because one applied techniques such as randomization and blocking does not mean they were effective in achieving the intended objectives. Warranted discrepancy. A point estimator is never the best value for the warranted discrepancy from the null. Substantive adequacy. Are there any additional effects that have not been neutralized by the experimental design, but could have influenced the outcome?
Fisher’s work on experimental design introduced several important tests into statistical modeling and inference, including testing the differences between two means or proportions, and the analysis of variance (ANOVA) test; see Appendix 13.A.
13.3.3 Significance Testing: Empirical Examples Example 13.3 Consider testing the null hypothesis H0 : μ=70 in the context of the simple Normal model (Table 13.4), using the exam score data in Table 1.6 (Chapter 1) where μ ˆ n =71.686, s2 =13.606, and n=70. The test statistic (13.11) yields
566
Hypothesis Testing
τ (x0 ) =
√ 70(71.686−70) √ 13.606
= 3.824, P(τ (X)>3.824; μ0 = 70) = .00007,
where p(x0 )=.00007 is found from the St(69) tables. The tiny p-value suggests that x0 indicate strong discordance with H0 . Example 13.4 Arbuthnot’s (1710) conjecture The ratio of males to females in newborns might not be “fair.” This can be tested using the statistical null hypothesis H0 : θ = θ 0 ,
where θ 0 = .5 denotes “fair”
in the context of the simple Bernoulli model (Table 13.8) based on the random variable X defined by {male}={X=1}, {female}={X=0}. Table 13.8
[1] [2] [3] [4]
The simple Bernoulli model
Statistical GM Xt = θ + ut , t∈N:=(1, 2, . . . , n, . . .) Bernoulli Xt Ber(., .), xt ={0, 1} Constant mean E(Xt )=θ, 0≤θ≤1, for all t∈N Constant variance Var(Xt )=θ(1−θ), for all t∈N Independence {Xt , t∈N}, independent process
Using the MLE of θ, θˆ ML :=X n = 1n derive the test statistic d(X) =
√ n(X −θ ) H0 √ n 0 θ 0 (1−θ 0 )
θ (1−θ ) X , where X Bin θ, ; n , one can i n i=1 n
n
Bin (0, 1; n) .
(13.15)
n Note that i=1 Xi Bin (nθ , nθ(1 − θ); n) can be approximated very accurately using a N(nθ, nθ(1 − θ )) distribution for n > 20, as Figure 13.5 attests.
0.20
Density/probability
0.15
0.10
0.05
0.00 2
4
Fig. 13.5
6
8
10 x
12
14
16
Normal approximation f (y; θ = .5, n = 20)
18
13.3 Fisher’s Significance Testing
567
Data. n=30,762 newborns during the period 1993–5 in Cyprus, out of which 16,029 were boys and 14,733 girls. The test statistic takes the form (13.15) and θˆ n =16,029 / 30,762=.521 : d(x0 ) =
√ 30762(.521−.5) √ .5(.5)
= 7.366, P(d(X)>7.366; θ = .5) = .00000017.
The tiny p-value indicates strong discordance with H0 . Example 13.5 Consider the case of the simple bivariate Bernoulli model (Example 12.12) based on the random variables X = 0 [no cancer], X = 1 [cancer], Y = 0 [no smoker], Y = 1 [smoker]. The data refer to a study of cancer occurrence and smoking in Germany with n = 656 individuals. The observed frequencies are summarized in Table 13.9; see Pagano and Gauvreau (2018). Table 13.9
Table 13.10
Cancer data
Independence
xy
0
1
total
xy
0
1
total
0 1 total
268 117 385
163 108 271
431 225 656
0 1 total
253 132 385
178 93 271
431 225 656
The null hypothesis of interest is that there is no dependence between smoking and cancer that will be tested using two different tests. (a)
Pearson’s chi-square test. This test compares the observed frequencies with those under (hypothetical) independence using the test statistic in (13.4). Using the notation of Table 12.12, when X and Y are independent: π ij = π +i ·π j+ , π ij = P(X = i − 1, Y = j − 1), π +i = 2j=1 π ij , π j+ = 2i=1 π ij i, j = 1, 2.
Taking the observed marginal probabilities as given, one can evaluate the expected frequencies under independence as given in Table 13.10: 385(431) 271(431) (385)(225) 132, (271)(225) 93. 656 253, 656 178, 656 656
Evaluating the chi-square test statistic yields η(z0 ) =
(108−93)2 93
+
(117−132)2 132
+
(163−178)2 178
+
(268−253)2 253
= 6.277,
(13.16)
whose p-value, p(z0 )=.012, is evaluated from χ 2 (1) tables. This indicates a clear discordance with the null hypothesis that X and Y are independent. Testing within vs. testing outside Mθ (x). It is important to bring out the fact that the Pearson test in (13.16) is probing within the boundaries of the prespecified model Mθ (x). This is in contrast to Example 13.1, where it is used as a misspecification test – probing the validity of Mθ (x) – and thus probing outside its boundaries.
568
(b)
Hypothesis Testing
Odds (cross-product) ratio test. A popular measure of dependence between categorical variables is the odds ratio, or its log (Chapter 6): ·π 22 12 (z0 ) = ln 108(268) = .417. , whose ML estimate is ψ ψ 12 = ln ππ 11 ·π 163(117) 21 12
As argued in Example 12.12, this follows from the parameterization invariance of the ML estimators. This can be used to construct a significance test based on the asymptotic sampling distribution of the test statistic:
ψ 12 =0 (Z) ψ N 0, v2 , d(Z) = √ 12 where v=
@
Var(ψ 12 (Z)) n→∞
1 1 1 1 108 + 117 + 163 + 268
= .166. This yields a p-value
p(z0 ) = P(d(Z)>2.512; ψ 12 = 0) = .006 that confirms the result of the chi-square test in (a) above.
13.3.4 Summary of Fisher’s Significance Testing The main elements of a Fisher significance test {τ (X), p(x0 )} are listed in Table 13.11. Table 13.11 (a) (b) (c) (d) (e) (f)
Fisher’s testing, key elements
A prespecified statistical model: Mθ (x) A null (H0 : θ =θ 0 ) hypothesis A test statistic (distance function) τ (X) The distribution of τ (X) under H0 is known The p-value P(τ (X) > τ (x0 ); H0 )=p(x0 ), A threshold value c0 (e.g. .01, .025, .05) such that p(x0 ) κ(x0 ); H0 )=p(x0 ). (f) A threshold value for discordance, say c0 = .05. Note that this particular example will be used extensively in the discussion that follows because of the simplicity of the test statistic κ(X) and its sampling distributions.
13.4 Neyman–Pearson Testing Jerzy Neyman was a Polish mathematician and statistician who spent the first part of his professional career (1921–1934) at various institutions in Warsaw, Poland and then (1934– 1938) at University College, London and the second part (1938–1981) at the University of California, Berkeley, where he founded an influential tradition (school) in statistics. Egon Pearson was the son of Karl Pearson, who joined his father’s Department of Applied Statistics at University College London (UCL) as a lecturer in 1923. Neyman went to UCL in 1925 to study with Karl Pearson because he was intrigued by his contributions to statistics and his more philosophical book entitled The Grammar of Science, published in 1892. During his visit to UCL Neyman befriended Egon, with whom he began a decade-long collaboration. At the time, Egon Pearson was going through a soul-searching dilemma, follow his father’s approach to statistics or that of his arch-enemy, R.A. Fisher? In 1925–6, I was in a state of puzzlement, and realized that, if I was to continue an academic career as a mathematical statistician, I must construct for myself what might be termed statistical philosophy, which would have to combine what I accepted from K. P’s [his father is] large sample tradition with the newer ideas of Fisher. (Pearson et al., 1990)
Neyman visited his friend at UCL several times during the period 1925–1933 and Egon went to Poland to meet Neyman twice. Their collaborative efforts gave rise to several papers, primarily on hypothesis testing; see Pearson (1966). The highlight of Neyman’s and Pearson’s professional careers was their collaboration on shaping an optimal theory for hypothesis testing during the years 1925–1935.
13.4.1 N-P Objective: Improving Fisher’s Significance Testing Pearson (1962), reflecting upon the Neyman–Pearson collaborative efforts, described their motivation as follows: “What Neyman and I experienced . . . was a dissatisfaction with the logical basis – or lack of it – which seemed to underlie the choice and construction of statistical tests” (p. 395). In particular, their main objective was to ameliorate Fisher’s testing by improving what they considered the weak features of that approach. [a]
Fisher’s choice of a test statistic d(X) on common-sense grounds. Fisher would justify his choice of the test statistic as a common-sense “distance function” constructed around the “best” estimator of θ. Neyman and Pearson questioned the choice of the distance function as ad hoc and sought objective criteria to define what a “good” test is. That is, their primary objective was an optimal theory of testing analogous to the optimal theory of estimation developed by Fisher in the 1920s.
570
Hypothesis Testing
[b] Fisher’s use of a post-data (x0 is given) threshold in conjunction with the p-value to indicate discordance with H0 . Neyman and Pearson questioned the post-data threshold as vulnerable to abuse by practitioners who could evaluate the p-value p(x0 ) and then select a threshold that would give rise to the inference result they favor. [c] Fisher’s falsificationist stance that d(x0 ) and p(x0 ) can only indicate discordance but never accordance with H0 . Their view was that scientific research is also interested in accordance with H0 . The culmination of their efforts was the classic Neyman–Pearson (1933) paper where they put forward an optimal theory of testing aiming to address the issues [a]–[c], but as Pearson (1962, p. 395) mentioned, their being able to see further was the result of standing on the shoulders of giants (primarily Fisher and Student): (a) The way of thinking which had found acceptance for a number of years among practicing statisticians, which included the use of tail areas of the distributions of the test statistic. (b) The classical tradition that, somehow, prior probabilities should be introduced numerically into a solution – a tradition which can certainly be treated in the writings of Karl Pearson and of Student, but to which perhaps only lip service was then being paid. (c) The tremendous impact of R.A. Fisher. His criticism of Bayes’s Theorem and his use of Likelihood. (d) His geometrical representation in multiple space, out of which readily came the concept of alternative critical regions in the sample space. (e) His tables of 5 and 1% significance levels, which lend themselves to the idea of choice, in advance of experiment, of the risk of the first kind of error which the experimenter was prepared to take. (f) His emphasis on the importance of planning an experiment, which led naturally to the examination of the power function, both in choosing the size of the sample so as to enable worthwhile results to be achieved, and in determining the most appropriate test. (g) Then, too, there were a number of common-sense contributions from that great practicing statistician, Student, some in correspondence, some in personal communication.
13.4.2 Modifying Fisher’s Testing Framing: A First View It is widely accepted that the key modification of Fisher’s framing by Neyman and Pearson was the introduction of the concept of an alternative hypothesis originally suggested to Egon Pearson by Gosset; see Pearson (1968), Lehmann (2011). In an attempt to find a more formal way to derive a test statistic to replace Fisher’s construction based on intuition, Neyman and Pearson (1928) adopted wholeheartedly Fisher’s likelihood function and his optimal estimation theory and decided to view hypothesis testing as a comparison of the estimated likelihood functions for the null and alternative hypotheses using their ratio; they called it “the criterion of likelihood.” To illustrate this “likelihood criterion,” consider the following example. Example 13.7 Consider the simple (one-parameter) Normal model (Table 13.10), where null and alternative hypotheses are simple: H0 : μ=μ0
vs. H1 : μ =μ0 .
Given that the likelihood function takes the form n
L(μ; x) = √ 1 2 exp − 2σ1 2 nk=1 (xk − μ)2 , 2πσ
(13.17)
13.4 Neyman–Pearson Testing
the MLE of μ under H0 is μ0 , yielding the estimated likelihood function
n n 2 2 L(μ0 ; x) = √ 1 2 exp − 2σ1 2 , k=1 (Xk − X n ) + n(X n − μ0 ) 2πσ
where (13.18) follows from the equality n n 2 2 k=1 (Xk − μ0 ) = k=1 (Xk − X n +X n − μ0 ) = nk=1 (Xk − X n )2 + n(X n − μ0 )2 .
571
(13.18)
(13.19)
The MLE under the alternative is μML = X n , giving rise to the estimated likelihood function
n L( μML ; x) = √ 1 2 exp − 2σ1 2 nk=1 (xk − xn )2 . 2πσ
Hence, the ratio of the two estimated likelihood functions yields
$(x) =
L( μML ;x) L(μ0 ;x)
= exp
, 1 2
=
√
1
2π σ 2
n(xn −μ0 σ2
)2
n
√1 exp − 12 nk=1 (xk −xn )2 2 2σ n2π σ exp − 1 2 [ nk=1 (xk −xn )2 +n(xn −μ0 )2 ] 2σ
-
(13.20)
.
Intuition suggests that one + would reject H0 when $(x0 ) is large enough, i.e. when + h(x0 ) = +n(xn − μ0 )2 /σ 2 + > c, which suggests that n(xn − μ0 )2 /σ 2 provides a natural distance function. A closer look at h(x0 ) confirms Fisher’s choice of a test statistic √ κ(X) = n(X n − μ0 )/σ as a basis for his significance testing. Indeed, several of the examples used by Neyman and Pearson (1928) appear to suggest that the likelihood ratio affirmed several of Fisher’s choices of test statistics; see Gorroochurn (2016). In this sense, the likelihood ratio seemed to suggest natural test statistics based on maximum likelihood estimators, but one needed a more formal justification for such a choice that is ideally based on some optimal theory of testing analogous to Fisher’s optimal estimation theory. What was not appreciated enough at the time, and continues to this day, is that the concept of an alternative hypothesis brought into testing the whole of the parameter space associated with the prespecified statistical model Mθ (x) = {f (x; θ ), θ ∈ }, x∈RnX .
(13.21)
This is because an optimal theory of testing would require one to compare the null value θ 0 with all other possible values of θ in to establish a “best” test. Hence, the alternative hypothesis H1 should be defined as the complement to the null value(s) relative to . That is, the archetypal way to specify the null and alternative hypotheses for N-P testing is H0 : θ∈0 vs. H1 : θ ∈1 ,
(13.22)
where 0 and 1 constitute a partition of : 0 ∩ 1 = ∅, 0 ∪ 1 = . Simple vs. composite hypotheses. In the case where the sets 0 or 1 contain a single point, say 0 ={θ 0 } that determines f (x; θ 0 ), the hypothesis is said to be simple, otherwise it is composite.
572
Hypothesis Testing
Fig. 13.6
N-P testing within Mθ (x)
Example 13.8 For the simple Bernoulli model (Table 13.6), assuming that the null hypothesis of interest is H0 : θ=θ 0 , it is simple, but the archetypal alternative H1 : θ =θ 0 is composite, since 1 = [0, 1] − {θ 0 }. Unfortunately, the archetypal specification in (13.22) has been neglected in a tangle of confusions generated by the subsequent literature, commencing with the misconstrual of the Neyman–Pearson lemma that was based on a largely artificial partition :=(θ 0 , θ 1 ); see Section 13.4.7. The above archetypal specification makes it abundantly clear that N-P testing takes place within the boundaries of Mθ (x) (Figure 13.6), and all possible values of the parameter space are relevant for statistical purposes, even though only a few might be of substantive interest. This foregrounds the importance of securing the statistical adequacy of Mθ (x) (validating its probabilistic assumptions) to ensure the reliability of testing inferences. That is, “learning from data” about θ ∗ using N-P testing presupposes that the modeler has established that θ ∗ lies within the boundaries of Mθ (x) by securing its statistical adequacy. The second modification of Fisher’s framing came in the form of what constitutes a test, which is not just a distance function selected on intuitive grounds and a tail probability! As in the case of an estimator, an N-P test statistic is a mapping from the sample space (RnX ) to the parameter space (): d(.): RnX → R that partitions the sample space (RnX ) into an acceptance C0 and a rejection region C1 : C0 ∩ C1 = ∅, C0 ∪ C1 =RnX in a way that corresponds to 0 and 1 , respectively: 2 & C0 ↔ 0 n = . RX = C1 ↔ 1 These modifications, in conjunction with the pre-data (before x0 is used for inference) significance level (probability of type I error) α, enabled Neyman and Pearson to replace the post-data (using x0 ) p-value with the N-P decision rules: [i] if x0 ∈C0 , accept H0 ;
[ii] if x0 ∈C1 , reject H0 .
(13.23)
13.4 Neyman–Pearson Testing
Table 13.13
N-P type I and II errors
N-P rule
H0 true √
Accept H0 Reject H0
type I error
573
H0 false type II error √
This N-P reframing gave rise to two types of errors (Table 13.13), whose probabilities (how often a testing procedure errs) are evaluated by: type I
P(x0 ∈ C1 ; H0 (θ)) = α(θ), for θ ∈ 0 , [x0 ∈ C1 ⇐⇒ reject H0 ];
type II P(x0 ∈ C0 ; H1 (θ)) = β(θ), for θ ∈ 1 , [x0 ∈ C1 ⇐⇒ accept H]. These error probabilities enabled Neyman and Pearson to introduce the third key notion of an “optimal” test, by fixing α(θ) to a small value and minimizing β(θ), or maximizing the power π(θ) = (1 − β(θ )) , for all θ ∈ 1 . Optimal test. Select the particular d(X) in conjunction with a rejection region based on a prespecified significance level α: C1 (α) = {x: d(x) ≷ cα }, with a view that the combination (d(X), C1 (α)) defines a test with the highest pre-data capacity to reject H0 when it is false. This capacity is known as the power of the test: π(θ ) = P(x0 ∈ C1 ; H1 (θ)) = 1 − β(θ ), ∀θ ∈ 1 . The power is a measure of the generic (whatever the value x∈RnX ) capacity of the test to detect discrepancies from H0 for all θ in 1 . That is, the combination of the power π(θ ), ∀θ 1 ∈ 1 and the significance level α calibrates the generic capacities of a particular test to detect discrepancies from H0 . In summary, the Neyman–Pearson proposed modifications to the perceived weaknesses of Fisher’s significance testing have changed Fisher’s framework in four important respects. The N-P testing takes place within a prespecified Mθ (x) by partitioning it into the subsets associated with the null and alternative hypotheses, and any N-P inference involves (directly or indirectly) the whole of the parameter space . (ii) Fisher’s post-data p-value has been replaced with a pre-data significance level α and the power of the test. (iii) Combining (i) and (ii), an optimal test is defined as Tα := (d(X), C1 (α)) where Tα is chosen so that it maximizes the power π(θ), for all values of θ in 1 . (iv) Fisher’s p-value as a measure of discordance with H0 was replaced by the more behavioristic accept/reject H0 rules: (i)
“We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis. But we may look at the purpose of tests from another view-point. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in
574
Hypothesis Testing
the long run of experience, we shall not be too often wrong. (Neyman and Pearson, 1933, p. 290)
As argued in the sequel, notwithstanding Fisher’s (1955) criticisms pertaining to (i)–(iii), the N-P reframing gave rise to an optimal theory of testing, analogous to that of estimation framed almost single-handedly by Fisher himself. Where the N-P reframing did not make real progress over Fisher’s original formulation was in (iv). As a result, attempts to address the coarseness of the accept reject rules by providing a sound evidential interpretation for the p-value and the accept/reject H0 results have beleaguered frequentist testing since the 1930s.
13.4.3 A Historical Excursion During the period (1928–1934), Fisher was supportive of the collaborative efforts between Neyman and Pearson to improve upon his significance testing, by encouraging them when replying to Neyman’s queries. Neyman was invited several times to the Rothamsted Experimental Station where Fisher was working as chief statistician since 1919. We know that Fisher commented on an early draft of the classic Neyman and Pearson (1933) paper, because the authors thanked him in a footnote to the published paper. Fisher had reasons to take this positive attitude because the optimality theory in the 1933 paper confirmed that almost all the test statistics he constructed on intuitive grounds turned out to define optimal tests! The first and last published discussion of N-P testing in a positive light by Fisher is a remarkable (1934) paper entitled “Two new properties of the mathematical likelihood.” In this paper, Fisher formalized the concept of a sufficient statistic and derived the famous factorization theorem. He went further and introduced the one-parameter exponential family for which sufficient statistics exist, and related that to the optimality of N-P tests (UMP) within that family. His discussion includes the concept of an alternative hypothesis, significance levels, and the power of tests, in a constructive tone; see Gorroochurn (2016). The relationship between these three pioneers began to deteriorate in 1933 when Karl Pearson retired and the administrators at UCL decided to divide his professorship into two, a Readership (Associated Professor) for Egon in the Statistics Department and a Professorship for Fisher in the Biology Department as Professor of Eugenics. What contributed to souring their relationship somewhat was the fact that Fisher’s professorship came with a humiliating clause that he was forbidden to teach statistics at UCL (see Box, 1978, p. 258); the father of modern statistics was explicitly told to keep his views on statistics to himself! Indeed, Fisher never held an academic post in statistics. Egon offered his best friend a Readership in 1934 and Neyman accepted. Hence, all three were now colleagues at UCL for the next four years; Neyman left to join the faculty at Berkeley in 1938. The initial intellectual agreement between Fisher on the one side and Neyman and Pearson on the other was apparent in the two presentations at the Royal Statistical Society (RSS) meetings by Neyman and Fisher in 1934. Parenthetically, it should be noted that such an invitation was extended only to statisticians who distinguished themselves, and it was considered as a clear “coming of age” honor for young statisticians. Between 1915 and 1934 Fisher had recast statistics into its modern form, but was largely ignored by the old guard
13.4 Neyman–Pearson Testing
575
of the RSS. Indeed, when Fisher became a professor at UCL in 1933, and they felt they had to invite him, the invitation came after Neyman’s, a relative newcomer to statistics. In both presentations and the customary discussion that followed, Fisher, Neyman, and Pearson showed that they shared a largely Fisherian perspective on statistics, in contrast to the old guard of the RSS led by Arthur Lyon Bowley (1869–1957), a Professor of Statistics at the London School of Economics, and second only to Karl Pearson in the hierarchy of statistical priesthood. Fisher’s comments on Neyman’s (1934) paper were complimentary, and even on issues where they disagreed, such as the concept of confidence intervals, the discussion was respectful and constructive. Similarly, Neyman’s comments, along with those of Egon Pearson, on Fisher’s (1935b) paper were very supportive. In contrast, most of the comments by the other discussants, including Bowley, Isserlis, Irwin, and Wolf, varied from highly critical to dismissive of Fisher’s ideas. The irreparable rift between Fisher and Neyman occurred at Neyman’s second presentation to the RSS in March of 1935 on “Statistical Problems in Agricultural Experimentation,” in which he criticized Fisher’s treatment of randomized block vs. Latin square methods. Neyman put forward an explicit linear model to frame his criticisms and call into question Fisher’s implicit model of Latin squares as less efficient than that of randomized block. In his comments, Fisher was acerbic in tone, criticizing Neyman for discussing a topic ‘he knew very little about’, in contrast to his previous presentation; see Gorroochurn (2016), Lehmann (2011). The subsequent reply by Neyman drew the battle lines for an endless war between them on all things statistical, including topics they were in agreement about before this episode. The first casualty of this war was their common ground on confidence intervals, as specified in Fisher (1930) and Neyman (1934), with both authors going out of their way to exaggerate their differences. Fisher began to overemphasize his fiducial interpretation and Neyman’s (1937) key paper on confidence intervals does not even mention Fisher, ignoring his original 1934 discussion where he gives full credit to Fisher (1930) for the construction of confidence intervals. The second casualty was the interpretation of inference for statistical tests, with both pioneers making great efforts to move away from any common ground. This led Fisher to overemphasize his rendering of inductive inference and Neyman to take his behavioristic interpretation of tests to an extreme position: when accept H0 take action A, when reject H0 take action B; see Mayo (1996). Unfortunately, this war spread more broadly to the statistics profession and had a profound effect on the subsequent development of statistical testing. The last battle of this war between the three pioneers unfolded in Fisher (1955), Pearson (1955), and Neyman (1956), which, unfortunately, generated more heat than light. With that as the relevant historical background, let us return to the main discussion of N-P testing.
13.4.4 The Archetypal N-P Testing Framing In N-P testing there has been a long debate concerning the proper way to specify the null and alternative hypotheses. It is often thought to be rather arbitrary and vulnerable to abuse. This view stems primarily from inadequate understanding of the role of statistical vs. substantive information. When properly understood in the context of a statistical model Mθ (x), N-P testing leaves very little leeway in specifying the H0 and H1 hypotheses.
576
Hypothesis Testing
This is because the whole of (all possible values of θ ) is relevant on statistical grounds, irrespective of whether only a small subset might be relevant on substantive grounds. Hence, there is nothing arbitrary about specifying the null and alternative hypotheses in N-P testing. The default alternative is always the complement to the null relative to the parameter space of Mθ (x). An equivalent formulation that brings out more clearly the fact that N-P testing poses questions pertaining to the “true” θ ∗ , or more correctly the true statistical data generating mechanism M∗ (x)={f (x; θ ∗ )}, x∈RnX , is to rewrite (13.22) in an equivalent form: H0 : θ ∗ ∈0
H1 : θ ∗ ∈1
f (x; θ ∗ ) ∈ M0 (x) = {f (x; θ ), θ ∈ 0 } vs. f (x; θ ∗ )∈M1 (x) = {f (x; θ), θ ∈ 1 }, x∈RnX .
This constitutes a partition of Mθ (x) (Figure 13.6), where P (x) denotes the set of all possible statistical models that could have given rise to data x0 :=(x1 , x2 , . . . , xn ). In cases where (0 ∪ 1 ) is a proper subset of , the values of θ postulated by H0 and H1 , respectively, do not exhaust , raising the possibility that θ ∗ lies in − (0 ∪ 1 ). Example 13.9 Consider the following hypotheses: H0 : μ ≤ μ0 vs. H1 : μ > μ0 ,
(13.24)
in the context of the simple (one-parameter) Normal model (Table 13.10):
Mθ (x): Xt N μ, σ 2 , (σ 2 known), t = 1, 2, . . . , n, . . . In light of the results in Example 3.7, consider the test Tα :={κ(X), C1 (α)}: test statistic
κ(X)=
rejection region
√
n(X n −μ0 ) , σ
X n = 1n
n
k=1 Xk ,
(13.25)
C1 (α)={x: κ(x) > cα },
which can be shown to be optimal. To evaluate the error probabilities one needs the distribution of κ(X) under H0 and H1 : √ n(X n −μ0 ) H0 (μ0 ) N(0, 1); σ √ H1 (μ ) = n(Xσn −μ0 ) 1 N(δ 1 , 1),
[i] κ(X) = [ii] κ(X)
δ1 =
√ n(μ1 −μ0 ) >0 σ
for all μ1 > μ0 .
These hypothetical sampling distributions are then used to compare H0 or H1 via κ(x0 ) to the true value μ=μ∗ represented by data x0 via X n , the best estimator of μ. That is, the dis√ √ tance n(X n − μ0 )/σ is the estimated form of n(μ∗ − μ0 )/σ ; recall that by presumption, M∗ (x) has given rise to the data x0 . Both evaluations in [i] and [ii] involve hypothetical reasoning, in contrast to Factual [iii]
√ ∗ n(X n −μ) μ=μ N(0, 1) σ
that underlies estimation (point and interval). The evaluation of the type I error probability is based on (i): α = maxμ≤μ0 P(κ(X)>cα ; H0 (μ)) = P(κ(X)>cα ; μ = μ0 ).
13.4 Neyman–Pearson Testing f(d(x)) H0 0.4
577
H1 π(μ1)
0.3 β(μ1)
0.2 α 0.1
–3.5 –3.0 –2.5 –2.0 –1.5 –1.0 –0.5 0.0
Fig. 13.7
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5 5.0 5.5 d(x)
Type I and II error probabilities and the power of the test
Notice that in cases where the null is for the form H0 : μ ≤ μ0 , the type I error probability is defined as the maximum over all values in the interval (−∞, μ0 ], which turns out to be μ=μ0 . The evaluation of type II error probabilities is based on (ii): β(μ1 ) = P(κ(X) ≤ cα ; H1 (μ1 )) for all μ1 > μ0 . The power (rejecting the null when false) is equal to 1−β(μ1 ), i.e. π(μ1 ) = P(κ(X) > cα ; H1 (μ1 )) for all μ1 > μ0 . Figure 13.7 illustrates the type I and II error probabilities, and the power. Why power? The power π(μ1 ) measures the pre-data (generic) capacity (probativeness) of test Tα :={κ(X), C1 (α)} to detect a discrepancy, say γ =μ1 −μ0 = .1, when present. Hence, when π (μ1 )=.35, this test has very low capacity to detect such a discrepancy. If γ =.1 is the discrepancy of substantive interest, this test is practically useless for that purpose because we know beforehand that this test does not have enough capacity (probativeness) to detect γ even if present! What can one do in such a case? The power of Tα is monotonically increasing √ with δ 1 = n(μ1 − μ0 )/σ , and thus, increasing the sample size n or decreasing σ increases the power. Learning from data. The reasoning underlying this argument is that N-P tests pose hypothetical questions – framed in terms of θ ∈ – aiming to learn about the true value θ ∗ , which is ultimately about M∗ (x)={f (x; θ ∗ )}, x∈RnX . The N-P partitioning and RX is analogous to zeroing in on an unknown prespecified city on a map using vertical and horizontal sequential partitioning. Being a “smart alec” with words. Certain statistics textbooks make a big deal out of the distinction “accept H0 ” vs. “fail to reject H0 .” At a certain level, this is a commendable attempt to bring out the problem of misinterpreting “accept H0 ” as tantamount to “there is evidence for H0 ,” known as the fallacy of acceptance. However, by the same token there is an analogous distinction between “reject H0 ” and “accept H1 ” that highlights the problem
578
Hypothesis Testing
of misinterpreting “reject H0 ” as tantamount to “there is evidence for H1 ,” known as the fallacy of rejection. What is objectionable about this practice is that the verbal distinctions by themselves do nothing but perpetuate these fallacies by just paying lip service instead of addressing them; see Section 13.5. The zero probability “paradox.” The assertion underlying this “paradox” is that point null hypotheses H0 : θ=θ 0 are always false (not exactly true) in the real world, and thus such testing is pointless. This argument has no merit because a hypothesis or an inferential claim being “exactly correct” has no place in statistics; being a statistician means never having to say a hypothesis or an inferential claim is exactly correct. What is assumed is that there is a true θ ∗ in , and hypothesis testing poses questions to the data seeking to find out whether the hypothesized value θ 0 is “close enough” to θ ∗ . Indeed, the whole idea behind frequentist inference is that we can learn from data about θ ∗ without learning its exact value, and such claims are calibrated using error probabilities. This is the reason why the post-data severity evaluation discussed in Section 13.5 outputs the discrepancy γ from θ =θ 0 warranted by data x0 and test Tα .
13.4.5 Significance Level α vs. the p-value It is important to note that there is a mathematical relationship between the type I error probability (significance level) and the p-value. Placing them side by side in the case of Example 13.7: P(type I error) P(κ(X)>cα ; μ=μ0 ) = α p-value P(κ(X)>κ(x0 ); μ=μ0 )=p(x0 )
(13.26)
it becomes obvious that (a) they share the same test statistic κ(X) and are both evaluated using the tail of the sampling distribution under H0 , but (b) differ in terms of their tail areas of interest: {x:κ(x)>cα } vs. {x:κ(x)>κ(x0 )}, x∈RnX , rendering α a pre-data and p(x0 ) a post-data error probability. Example 13.10 Consider the test Tα :={κ(X), C1 (α)} in Example 13.9 for μ0 =10, σ =1, n=100, xn =10.4, α = .05 ⇒ cα =1.645. κ(x0 ) =
√ n(xn −μ0 ) σ
=
√ 100(10.175−10) 1
= 1.75>cα , Reject H0 .
The p-value is P(κ(X)>1.75; μ = μ0 ) = .04. Their common features and differences bring out several issues. First, it is important to distinguish between two different perspectives on the p-value. Pre-data perspective. The p-value is traditionally defined as the probability of obtaining a result ‘equal to or more extreme’ than the one observed x0 , when H0 is true. Misleadingly, when this definition is used in the context of a Neyman-Pearson (N-P) test the clause ‘equal to or more extreme’ is invariably interpreted in light of the alternative hypothesis. This has led to the p-value being viewed as the smallest significance level α min at which H0 would have been rejected when H0 is true, and referred to as the observed significance level or the size of a N-P test. The same pre-data perspective is also used when the p-value is interpreted as a statistic, a function of the sample X, asserting that p(X) ∼ U(0,1) under H0 . Hence, it
13.4 Neyman–Pearson Testing
579
should come as no surprise to learn that the above N-P decision rules (13.23) could be recast in terms of the p-value: [i]* if p(x0 )>α, accept H0 ; [ii]* if p(x0 )≤α, reject H0 . Indeed, practitioners often prefer to use the modified rules [i]* and [ii]* because the p-value appears to convey additional information. This pre-data interpretation of the p-value gave rise to the concept of a two-sided p-value: P(|κ(X)| > |κ(x0 )|; μ=μ0 ). This brings up the next issue. This is at odds with the original post-data perspective of the p-value proposed by Fisher as a ‘measure of discordance’ between X0 and H0 , since no H1 is implicitly or explicitly involved. Post-data perspective. In this book, the p-value is defined as the probability of all sample realizations x that accord less well (in terms of d(x)) with H0 than x0 does when H0 is true. An obvious implication stemming from this definition is that the p-value is always one-sided because the sign of d(x0 ) indicates the relevant tail. That is, μ∗ (the true μ) could not lie in the subset of the parameter space represented by the other tail. Second, there is nothing irreconcilable between the significance level α and the p-value. α is a pre-data error probability defining the generic capacity of the test in question, and p(x0 ) is a post-data error probability evaluating the discordance between H0 and data x0 , based on the same generic capacity (power). Even Fisher, in unguarded moments, would refer to the power as the “sensitiveness” of a test: By increasing the size of the experiment, we can render it more sensitive, meaning by this that it will allow of the detection of a lower degree of sensory discrimination, or in other words, of a quantitatively smaller departure from the null hypothesis. (Fisher, 1934, pp. 21-22)
It is rather unfortunate that to this day the discussion of the crucial weaknesses of the p-value largely ignores the power of the test. Third, neither the significance level α nor the p-value can be interpreted as probabilities attached to particular values of μ, associated with H0 or H1 , since the probabilities in (13.26) are firmly attached to the sample realizations x∈RnX . Indeed, attaching probabilities to the unknown constant μ, or conditioning on values of μ, makes no sense in frequentist statistics. On that issue Fisher (1921) argued: We may discuss the probability of occurrence of quantities which can be observed or deduced from observations, in relation to any hypotheses which may be suggested to explain these observations. We can know nothing of the probability of hypotheses. (p. 25)
Fourth, both the p-value and the type I and II error probabilities are not conditional on H0 or H1 ; the sampling distribution of κ(X) is evaluated under different hypothetical scenarios pertaining to the values of θ in .
580
Hypothesis Testing 1.0 P(.)
0.8 type II error probability
0.6 0.4 type I error probability
0.2 0.0 0
1
Fig. 13.8
2
3 c
The trade-off: type I vs. II
13.4.6 Optimality of a Neyman–Pearson Test How does the introduction of type I and II error probabilities address the arbitrariness associated with Fisher’s choice of a test statistic? The ideal test is one whose error probabilities are zero, but no such test exists for a given n; an ideal test, like the ideal estimator, exists only as n → ∞! For n < ∞, however, there is a trade-off between the above error probabilities: increasing the threshold c>0 decreases the type I but increases the type II error probabilities (see Figure 13.8). Example 13.11 In the case of test Tα :={κ(X), C1 (α)} in (13.25), decreasing the probability of type I error from α=.05 to α=.01 increases the threshold from cα =1.645 to cα =2.33, which makes it easier to accept H0 , and this in turn increases the probability of type II error. To address this trade-off, Neyman and Pearson (1933) proposed a twofold strategy. Defining an optimal N-P test. An optimal N-P test is based on (a) fixing an upper bound α for the type I error probability P(x∈C1 ; H0 (θ) true) ≤ α, for all θ ∈ 0 and then (b) selecting {d(X), C1 (α)} that minimizes the type II error probability, or equivalently, maximizes the power: π(θ) = P(x0 ∈ C1 ; H1 (θ) true) = 1−β(θ ), for all θ ∈ 1 . The general rule is that in selecting an optimal N-P test, the whole of the parameter space is relevant. This is why partitioning both and the sample space RnX using a test statistic d(X) provides the key to N-P testing. N-P rationale. In their attempt to justify their twofold strategy, Neyman and Pearson (1933) urged the reader to consider the analogy with a criminal offense trial, where the jury are instructed by the judge to find the defendant “not guilty” unless they have been convinced “beyond any reasonable doubt” by the evidence: H0 : not guilty vs. H1 : guilty.
13.4 Neyman–Pearson Testing
581
The clause “beyond any reasonable doubt” amounts to fixing the type I error to a very small value, to reduce the risk of sending innocent people to death row. At the same time, one would want the system to minimize the risk of letting guilty people get off scot free.
13.4.6.1 Optimal Properties of N-P Tests The property that sets the gold standard for optimal tests is: [1]
Uniformly most powerful (UMP). A test Tα :={d(X), C1 (α)} is said to be UMP if it has higher power than any other α-level test > Tα , ∀θ∈1 : Tα ), for all θ ∈ 1 . π (θ; Tα ) ≥ π (θ; >
That is, an α significance level UMP test Tα has equal or higher power than any other α significance level test > Tα for all values of θ in 1 . Figure 13.9 illustrates the notion of a UMP by depicting the power of several tests, whose power curves begin at π(θ 0 ; Tα ) = α and increase monotonically as the discrepancy from the null value θ 0 increases. The UMP test corresponds to the power curve in a bold line, since it dominates all other power curves for all θ∈1 ; when power curves intersect, no UMP exists. Example 13.6 (continued) In the context of the simple Normal model (Table 13.12), the test Tα :={κ(X), C1 (α)} in (13.25) is UMP; see Lehmann and Romano (2006). Additional properties of N-P tests: [2] Unbiasedness. A test Tα :={d(X), C1 (α)} is said to be unbiased if the probability of rejecting H0 when false is always greater than that of rejecting H0 when true, i.e. max P(x0 ∈C1 ; H0 (θ))≤α μ0 , then using the N(0, 1) tables requires one to split the test statistic into √ √ √ n(X n − μ0 ) n(X n − μ1 ) n(μ1 − μ0 ) = +δ 1 , δ 1 = , σ σ σ since under H1 (μ1 ) the distribution of the first component is H (μ ) √ √ 1 1 n(X n −μ1 ) n(X n −μ0 ) = − δ1 N(0, 1), for μ1 >μ0 . σ σ Table 13.14
Evaluating the power of test Tα
γ =μ1 − μ0
δ1 =
γ =.1 γ =.2 γ =.3
δ=1 δ=2 δ=3
√
n(μ1 −μ0 ) σ
π (μ1 ) = P
√
n(X n −μ1 ) >cα − δ 1 ; μ1 σ
π (10.1) = P(Z > 1.645 − 1) = .259 π (10.2) = P(Z > 1.645 − 2) = .639 π (10.3) = P(Z > 1.645 − 3) = .913
Step 2. Evaluation of the power of the test Tα :={κ(X), C1 (α)}, with σ =1, n=100, cα =1.645 for different discrepancies (μ1 −μ0 ) yields the results in Table 13.14 (Figure 13.10), where Z denotes a generic standard Normal random variable, i.e. ZN(0, 1). The tail areas associated with the significance level and the power of this test are illustrated in Figures 13.10 and 13.11. The power of the test Tα :={κ(X), C1 (α)} is typical of an optimal test, since π (μ1 ) √ increases with the non-zero mean δ 1 = n(μ1 − μ0 )/σ , and thus the power (a) increases as the sample size n increases, (b) increases as the discrepancy γ =(μ1 −μ0 ) increases, and (c) decreases as σ increases. Example 13.13 For test Tα :={κ(X), C1 (α)} in (13.25), with σ =1, α = .05 (cα =1.645), let us allow n to increase from 2 to 1000 and evaluate the power for different discrepancies γ , .075 ≤ γ ≤ .20, in order to see how the generic capacity (power) of test Tα increases with n. As can be seen in Figure 13.12, the power at n = 400 increases with discrepancies γ , but it also increases rapidly with n: π(.02) = .991, π(.015) = .912, π (.125) = .804, π (.125) = .639, π(.075) = .442.
13.4 Neyman–Pearson Testing f(d(x)) H0 0.4
H3
H2
H1
583
π(μ3)
0.3
π(μ2)
0.2
π(μ1)
0.1 α –3.0 –2.5 –2.0 –1.5 –1.0 –0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 d(x)
Fig. 13.10
Power
Power of the test for different discrepancies μ1 μ0 . A pre-data role for the power. In light of (a)–(c) one can use the power function pre-data to ensure that before carrying out a study one will be able to detect certain discrepancies of substantive interest with high probability. In practice, it is often difficult to decrease σ and thus the focus for that is usually on the selection of the sample size n. Example 13.13 (continued) When applying the test Tα :={κ(X), C1 (α)}, let us assume that the discrepancy of substantive interest is γ =(μ1 −μ0 ) = .2. To ensure that this test has high enough generic capacity (power), say π(.2) ≥ .8, to detect γ =.2 if it exists, one needs to perform certain preliminary calculations to ensure that n is large enough for π(.2) ≥ .8 by solving the equation √ √ ; μ = .2 ≥ .8, π(.2) = P n(Xσn −.2) >1.645 − n(.2) 1 σ √ for n. Given that P(Z > c) = .8 ⇒ c = −.842, 1.645 − n(.2) = −.841 ⇒ n = 155. Hence, the question “how large should the sample size be to learn from data about phenomena of interest?” is of critical importance in practice. In this case, there is no point in carrying out this study when n = 100; one would not learn whether the discrepancy of interest γ =.2 is present or not.
13.4 Neyman–Pearson Testing
585
Revisiting the large n problem. Figure 13.12 illustrates how the notion of power can be used to explain the large n problem. Since the power for detecting any discrepancy γ > 0 increases with n, there is always a large enough n to reject any null hypothesis μ = μ0 , unless μ0 = μ∗ , which can happen only rarely. Hence, even tiny discrepancies from the null,
say μ1 − μ0 = .000001, will give rise to the large n problem. It is important to emphasize that there is nothing paradoxical about the power increasing with n; this is what constitutes a consistent test. What is important to take away from this is that both the p-value and an N-P rejection of H0 are vulnerable to the large n problem. This renders both inappropriate in providing an evidential interpretation of inference results without being adapted to account for the generic capacity of a test (power). In summary, it is very important to emphasize three features of an optimal test. First, an N-P test probes within the boundaries of a particular Mθ (x) whose statistical adequacy is presumed. Any departures from the model assumptions are likely to distort the test’s error probabilities, resulting in sizeable discrepancies between the nominal (assumed) and actual error probabilities. Second, an N-P test is not just a formula associated with particular statistical tables. It is a combination of a test statistic and a rejection region; hence the notation Tα :={d(X), C1 (α)}. Third, the optimality of an N-P test is inextricably bound up with the optimality of the estimator defining the test statistic. Hence, it is no accident that most optimal N-P tests are based on consistent, fully efficient, and sufficient estimators. Example 13.14 In the case of the simple (one-parameter) Normal model (Table 13.10), μ2 = (X1 +Xn )/2 consider replacing X n with the second rate unbiased estimator
N μ, σ 2 /2 . The resulting test Tˇ α :={ζ (X), C1 (α)}, for ζ (X)=
√
2( μ2 −μ0 ) σ
and C1 (α)={x: ζ (x)>cα }
will not be optimal because Tˇ α is inconsistent and its power is much lower than that of
Tα ! This is because the non-zero mean of the sampling distribution under μ=μ1 will be √
d= 2(μ1 − μ0 )/σ and does not change as n → ∞.
It is important to note that by changing the rejection region one can render an optimal N-P test useless! For instance, replacing the rejection region of Tα :={κ(X), C1 (α)} with C1 (α) = {x: κ(x) < cα }, the resulting test T˘α :={κ(X), C1 (α)} is practically useless because, in addition to being inconsistent (its power does not increase as n → ∞), it is also biased (it rejects H0 when it is true rather than when it is false) and its power decreases as the discrepancy γ increases. Example 13.15 Student’s t test In the context of the simple Normal model (Table 13.4), consider the one-sided hypotheses
or
(1-s>) H0 : μ = μ0
vs.
H1 : μ > μ0
H0 : μ ≤ μ0
vs.
H1 : μ > μ 0 .
(1-s*)
(13.27)
586
Hypothesis Testing
In this case, a UMP (consistent) test Tα :={τ (X), C1 (α)} is the well-known Student’s t test: τ (X)=
test statistic
rejection region
√
n
n(X n −μ0 ) 2 1 , s = n−1 s
(Xk −X n )2
k=1
(13.28)
C1 (α)={x: τ (x) > cα }
where cα can be evaluated using the Student’s t tables; see Table 13.6. To evaluate the two types of error probabilities, the distribution of τ (X) under both the null and alternatives is needed: [I]* τ (X) = [II]* τ (X) = where δ 1 =
√
n(μ1 −μ0 ) σ
√ n(X n −μ0 ) μ0 St(n−1), σ √ μ n(X n −μ0 ) 1 St(δ 1 ; n−1), s
is the non-centrality parameter.
Table 13.15 (a) (b) (c) (d) (e) (f) (g)
(13.29) for all μ1 > μ0 ,
N-P testing, key elements
A prespecified (parametric) statistical model: Mθ (x) A null (H0 : θ ∈0 ) and the alternative (H1 : θ∈1 ) within Mθ (x) A test statistic (distance function) τ (X) The distribution of τ (X) under H0 is known A prespecified significance level α [.01, .025, .05] A rejection region C1 (α) The distribution of τ (X) under H1 , i.e. for all θ∈1
Table 13.15 summarizes the main components of a Neyman–Pearson test {τ (X), C1 (α)}. When this table is put side by side with Fisher’s testing Table 13.11, it is clear that the elements (a), (c), and (d) are identical, and the changes stem primarily from bringing into the testing the whole of the parameter space and replacing the p-value (post-data) with the significance level (pre-data) error probabilities. As argued in the sequel, the N-P approach to testing can be seen as complementary to Fisher’s significance testing, and the two perspectives can be accommodated within the broader error-statistical framework for frequentist testing; see Section 13.5.
13.4.7 Constructing Optimal Tests: The N-P Lemma The cornerstone of the N-P approach is the Neyman–Pearson lemma. Assume the simple generic statistical model Mθ (x) = {f (x; θ)}, θ ∈ :={θ 0 , θ 1 }, x∈RnX
(13.30)
and consider the problem of testing the simple hypotheses H0 : θ = θ 0
vs. H1 : θ = θ 1 .
(13.31)
Existence. There exists an α-significance level uniformly most powerful (α-UMP) test, whose generic form is 1) (13.32) d(X)=h ff (x;θ (x;θ 0 ) , C1 (α)={x: d(x) > cα },
13.4 Neyman–Pearson Testing
587
where h(.) is a monotone function. Sufficiency. If an α-level test of the form (13.32) exists, then it is UMP for testing (13.31). Necessity. If {d(X), C1 (α)} is an α-UMP test, then it will be given by (13.32). Note that to implement this result one would need to find a function h(.) yielding a test statistic d(X) whose sampling distribution is known under both H0 and H1 . The idea of using the ratio f (x; θ 1 )/f (x; θ 0 ) as a basis for constructing a test was credited by Egon Pearson (1966) to Gosset in an exchange they had in 1926: “It is the simple suggestion [by Gosset] that the only valid reason for rejecting a statistical hypothesis is that some alternative explains the observed events with a greater degree of probability.” At first sight the N-P lemma seems overly simplistic because it assumes a simple statistical model Mθ (x) whose parameter space has only two points :={θ 0 , θ 1 }, which seems totally contrived. Unfortunately, this particular detail is often ignored in some statistics textbook discussions of this lemma. Note, however, that it fits perfectly into the archetypal formulation because the two points constitute a partition of . This lemma is often misconstrued as suggesting that for an α-UMP test to exist one needs to confine testing to simple vs. simple hypotheses even when is uncountable. The truth of the matter is that the construction of an α-UMP test in more realistic cases has nothing to do with simple vs. simple hypotheses. Instead, it is invariably connected to the archetypal N-P testing formulation in (13.22), and relies primarily on monotone likelihood ratios (Karlin and Rubin, 1956) and other features of the prespecified statistical model Mθ (x) for the existence of an α-UMP test. Example 13.16 To illustrate these comments, consider the simple hypotheses (i) (1-1) H0 : μ=μ0
vs. H1 : μ=μ1 , for μ1 > μ0
(13.33)
in the context of a simple Normal (one-parameter) model (Table 13.10). This does not satisfy the conditions of the N-P lemma because the parameter space is the whole of the real line, not just two points. Regardless, let us apply the N-P lemma by constructing the ratio using just the two points in (13.33):
f (x;μ1 ) n n 2 − μ2 ) = exp (μ − μ )X − (μ n 1 0 1 0 f (x;μ0 ) σ2 2σ 2 (13.34)
0) 0 . X n − μ1 +μ = exp n(μ1σ−μ 2 2 This ratio is clearly not a test statistic, as it stands, but it can be transformed into one by noticing that since (μ1 − μ0 ) > 0, a rejection region C1 (α)={x: f (x; μ1 )/f (x; μ0 )>c} is equivalent to 2 0 . (13.35) X n > c∗ = n(μ σ−μ ) ln(c) + μ1 +μ 2 1
0
c∗
Since is an arbitrary positive constant, it is clear that the relevant test statistic should be a function of X n whose sampling distribution is known under both the null and alterna
tive hypotheses. Given that X n N μ, σ 2 /n , (13.34) can be transformed into a familiar test statistic: √ n(X n −μ0 ) 1) , κ(X) = h ff (x;μ (x;μ ) = σ 0
where the relevant sampling distributions are
588
Hypothesis Testing μ=μ
μ=μ
κ(X) 0 N(0, 1), κ(X) 1 N(δ 1 , 1), δ 1 =
√ n(μ1 −μ0 ) . σ
These provide the basis for evaluating the type I and II error probabilities. In this particular case, it turns out that when the ratio f (x; μ1 )/f (x; μ0 ) is a monotone √ function of the test statistic κ(X)= n(X n − μ0 )/σ , that can provide the basis for constructing optimal tests even when the null and alternative hypotheses are composite (Lehmann and Romano, 2006). Let us consider certain well-known cases. [1]
For μ1 > μ0 , the test Tα> :={κ(X), C1> (α)}, where C1> (α)={x: κ(x) > cα } is an α-UMP for the hypotheses (ii) (1-s ≥ ) H0 : μ ≤ μ0 (iii) (1-s>) H0 : μ = μ0
vs. vs.
H1 : μ > μ0 , H1 : μ > μ0 .
Despite the difference between (ii) and (iii), the relevant error probabilities coincide. This is because the type I error probability for (ii) is defined as the maximum over all μ ≤ μ0 , which happens to be the end point (μ=μ0 ): α = maxμ≤μ0 P(κ(X) > cα ; H0 (μ)) = P(κ(X) > cα ; μ = μ0 ).
(13.36)
Similarly, the p-value is the same, because p> (x0 ) = max P(κ(X) > κ(x0 ); H0 (μ))=P(κ(X) > κ(x0 ); μ = μ0 ). μ≤μ0
[2]
(13.37)
For μ1 < μ0 , the test Tα< :={κ(X), C1< (α)}, where C1< (α)={x: κ(x) < cα }, is an α-UMP for the hypotheses (iv) (1-s ≤ ) H0 : μ ≥ μ0 (v) (1-sc for some c. The general result is that Mθ (x) has a monotone likelihood ratio of the form f (x;θ 1 ) f (x;θ 0 )
= h(s(x); θ 0 , θ 1 ),
(13.38)
13.4 Neyman–Pearson Testing
589
where for all θ 1 >θ 0 , h(s(x); θ 0 , θ 1 ) is a non-decreasing function of the statistic s(X) upon which the test statistic will be based; see Karlin and Rubin (1956). This regularity condition is valid for most statistical models of interest in practice, including the one-parameter exponential family of distributions (Normal, gamma, beta, binomial, negative binomial, Poisson, etc.), the uniform, the exponential, the logistic, the hypergeometric, etc. It is interesting to note that in his first published paper after the Neyman–Pearson 1933 classic, Fisher (1934) proved the existence of UMP tests in the case of this family as stemming from the property of sufficiency. It was the first and last time Fisher discussed the N-P approach to testing in a positive light. [B] Convexity. The parameter space 1 under H1 is convex, i.e. for any two values (μ1 , μ2 ) ∈ 1 , their convex combinations λμ1 +(1 − λ)μ2 ∈ 1 , for any 0 ≤ λ ≤ 1. When convexity does not hold, like the two-sided alternative (vi) (2-s) H0 : μ = μ0 [3]
vs.
H1 : μ = μ0
(13.39)
=
the test Tα :={κ(X), C1 (α)}, C1 (α)={x: |κ(x)| > cα/2 } is α-UMPU (unbiased); the α-level and p-value are α = P(|κ(X)| > c α2 ; μ = μ0 ), p= (x0 )=P(|κ(X)|> |κ(x0 )|; μ = μ0 ).
(13.40)
Example 13.17 Consider the simple Normal model (Table 13.10) and the test
√ Tα> := κ(X) = n(Xσn −μ0 ) , C1 (α) = {x: κ(x) > c α2 to the two-sided hypotheses in (13.39). Test Tα is not UMP because for negative values of the discrepancy its power is below α/2 (Figure 13.13). Indeed, test Tα> is biased. When one narrows the group of desirable tests to include unbiasedness, test Tα> is excluded from consideration for being biased in testing the two-sided hypotheses (13.39). On the other hand, the two-sided version of Tα> :
√ = Tα := κ(X) = n(Xσn −μ0 ) , C1 (α) = {x: |κ(x)| > c α2 for testing (13.39) has lower power for positive discrepancies, but it is unbiased (UMPU); see Figure 13.14. An alternative way to proceed with a view to using the most powerful test and sidestepping the bias is to separate the two-sided alternative into two pairs of one-sided hypotheses (Cox and Hinkley, 1974): (i) H0 : μ ≤ μ0 vs. H1 : μ > μ0 , (ii) H0 : μ ≥ μ0 vs. H1 : μ and Tα< to evaluate the relevant p-values, say p> (x0 ) and p< (x0 ). The inference result is then based on p(x0 ) = min (p> (x0 ), p< (x0 )) . To avoid any misleading impressions that the only distribution for which optimal N-P tests exists is the Normal, consider the following examples.
590
Hypothesis Testing Power
1.0 0.8 two-sided
0.6 0.4 0.2 –5
–4
–3
Fig. 13.14
–2
–1
0
1
2
3 4 Discrepancies
5
Power of test Tα> for one-sided (bold) and two-sided.
Example 13.18 In the context of a simple Poisson model Xk PoissonIID (θ) , θ >0, xk ∈N0 = {0, 1, 2, . . .}, k∈N = {1, 2, . . .}, whose density function is f (x; θ) = (θ x e−θ /x!), θ > 0, x∈N0 , consider the question of constructing a UMP test for the hypotheses H0 : θ ≤ θ 0
(13.41) H1 : θ ≥ θ 1 , θ 1 > θ 0 . 7n Given that f (x; θ ) = θ k=1 xk e−nθ (1/xk !), the ratio in (13.38) for yn = nk=1 xk is k=1 7n 1 y yn θ 1n e−nθ 1 y k=1 xk ! θ 1n e−nθ 1 θ1 f (x; θ 1 ) = [exp(n(θ 0 − θ 1 ))]. = = y n 7 −nθ n 1 f (x; θ 0 ) θ0 θ0 e 0 y θ 0n e−nθ 0 k=1 xk ! vs.
n
Hence, for any θ 1 > θ 0 , [f (x; θ 1 )/f (x; θ 0 )] is a monotonically increasing function of yn , which can be used to construct a UMP test for (13.41) based on the test statistic (13.42) X n = 1n nk=1 Xk Poisson (θ) , θ > 0 and the rejection region C1 (α) = {x: X n > cα }, since the alternative defines a convex set. The choice of α and the power of this test will be based on a Poisson distribution and have the form θ r0 e−θ 0 ≤ α, P(X n >cα ; θ = θ 0 ) = ∞ r=cα +1 r! π(θ) = P(X n >cα ; θ = θ 1 ) =
∞
r=cα +1
θ r1 e−θ 1 r!
, for all θ 1 > θ 0 .
This example raises the issue of inexact significance levels. Randomization? In cases where the test statistic has a discrete sampling distribution under H0 , as in (13.42), one might not be able to define α exactly. The traditional way to deal with this problem is to randomize Lehmann and Romano (2006), but this “solution” raises more problems than it solves. Instead, the best way to address the discreteness issue is to select a value α that is attained for the given n, or approximate the discrete distribution
13.4 Neyman–Pearson Testing
591
with a continuous one to circumvent the problem. In practice, there are several continuous distributions one can use, depending on whether the original distribution is symmetric or not. In the above case, even for moderately small sizes, the Normal distribution provides a √ good approximation for the distribution of ((X n − θ)/ θ ) using the central limit theorem; see Chapter 9. Example 13.19 In the context of a simple exponential model Xk ExpIID (θ ) , θ >0, xk ∈R+ , k∈N = {1, 2, . . .}, whose density function is f (x; θ ) = θ1 exp − xθk , θ > 0, x∈R+ , consider the question of constructing a UMP test for the hypotheses H0 : θ ≤ θ 0
(13.43) H1 : θ ≥ θ 1 , θ 1 > θ 0 . n
xk 1 1 exp − θ1 nk=1 xk , x∈Rn+ , the ratio in k=1 θ exp − θ = θ
vs.
1n
Given that f (x; θ) = (13.38) for yn = nk=1 xk is
n 1 1 n / 0 exp − y n θ1 θ1 θ0 f (x; θ 1 ) θ1 − θ0 =
= n exp yn . 1 f (x; θ 0 ) θ1 θ 0θ 1 exp − 1 y θ0 n
θ0
Hence, for any θ 1 > θ 0 , f (x; θ 1 )/f (x; θ 0 ) is a monotonically increasing function of yn , which can be used to construct a UMP test for (13.41) based on the test statistic Yn = nk=1 Xk Gamma (n, θ) and the rejection region C1 (α) = {x: Yn > cα }, since the alternative defines a convex set. Example 13.20 Consider the simple Cauchy model with density function f (x; θ ) =
1 , π [1+(x−θ )2 ]
giving rise to f (x; θ) =
7n k=1
θ∈R, x∈R,
−1
π[1+(xk − θ)2 ]
, and the ratio in (13.38) is
−1 7n
π [1+(xk − θ 1 )2 ] 7n 1 + (xk − θ 1 )2 f (x; θ 1 ) k=1 . = 7 −1 = n k=1 (1 + (xk − θ 0 )2 ) f (x; θ 0 ) π [1+(xk − θ 0 )2 ] k=1
Since this ratio cannot be expressed as a monotone function of any statistic, it does not satisfy this key property.
13.4.9 Constructing Optimal Tests: Likelihood Ratio The likelihood ratio test procedure can be viewed as a generalization/extension of the Neyman–Pearson lemma to more realistic cases where the null and/or the alternative might be composite hypotheses. Its general formulation in the context of a statistical model Mθ (x) takes the following form. (a)
The hypotheses of interest are specified by H0 : θ∈0 vs. H1 : θ ∈1 , where 0 and 1 constitute a partition of .
592
Hypothesis Testing
(b) The test statistic is a function of the “likelihood” ratio λn (X) =
maxθ ∈ L(θ ;X) maxθ ∈0 L(θ ;X)
=
L( θ; X) . > L(θ ; X)
(13.44)
Note that the max in the numerator is over all θ ∈ (yielding the MLE θ), but that of the denominator is confined to all values under H0 : θ ∈0 (yielding the constrained MLE > θ). This is in contrast to the N-P lemma where the numerator is evaluated under the alternative H1 . (c)
The generic rejection region is defined by C1 = {x: λn (x) > c} , but it is almost never the case that the distribution of λn (X) under H0 is known. More often than not, one needs to use a transformation h(.) to ensure that h (λn (X)) has a known sampling distribution under H0 . This can then be used to define the rejection region C1 (α) = {x: τ (X) = h (λn (X)) > cα } .
(13.45)
In terms of constructing optimal tests, the LR procedure can be shown to yield several wellknown optimal tests; Lehmann and Romano (2006). Example 13.21 In the context of a simple Normal model (Table 13.4), consider the simple vs. composite hypotheses H0 : μ=μ0
vs. H1 : μ =μ0 . n
The generic likelihood takes the form L(θ; x) = (2πσ 2 )− 2 exp − 2σ1 2 giving rise to the numerator of (13.44): n θ; X) = (2π σ 2 )− 2 exp − 12 n maxθ∈ L(θ; X) = L(
(13.46) n 2 , i=1 (xi − μ) (13.47)
that denotes the likelihood function evaluated at the unconstrained MLE estimators: (13.48) σ 2 = 1n ni=1 (xi − μ)2 . μ = Xn, Using (13.19), the denominator of (13.44) takes the form
, 2 -− n2
θ; X) = 2π σ 2 + X n − μ0 exp − 12 n , maxθ∈0 L(θ; X) = L(>
(13.49)
which is evaluated at the constrained (under H0 ) MLE as > μ0 = μ0 , > σ 20 = 1n ni=1 (xi − μ0 )2 = σ 2 + (X n − μ0 )2 .
(13.50)
Substituting (13.47)–(13.49) into (13.44) yields exp − 12 n , −n {2π[ σ 2 +(X n −μ0 )2 ]} 2 exp − 12 n ,
λn (X) =
− 2n
(2π σ 2)
-
=
σ2 σ 2 +(X nt −μ0 )2
− n 2
n
2 2 0) = 1+ (X n −μ σ2
n 2 2 (X n −μ0 )2 τ (X)2 n −μ0 ) = 1+ (n−1) , → λnn (X) = 1+ (n−1) n = 1+ n(X (n−1)s2 σ2 (n−1)
√ 1 n μ)2 . Hence, rejecting H0 when where τ (X)= n(X n −μ0 )/s, s2 = n−1 i=1 (xi − n 2
2 2 τ (X) 0) >b=c2/n . This indicates that λn (X) is >c implies that 1+ (n−1) λn (X)= 1+ (X n −μ σ2
13.4 Neyman–Pearson Testing
593
a monotonically increasing function of τ (X)2 , which implies that one can use the known distribution of τ (X) to define the two-sided t-test for (13.46):
√ τ (X) = n(X ns −μ0 ) , C1 (α) = {x: |τ (x)| > c α2 } , which can be shown to be UMP unbiased; see Lehmann and Romano (2006). Note that the H0
sampling distribution of τ (X)2 is also known, τ (X)2 F(1, n − 1), where F(1, n − 1) denotes the F (for Fisher) distribution with 1 and (n − 1) degrees of freedom. This is because when v St(m), then v2 F(1, m); see Lehmann and Romano (2006). Hence, in principle, one could use the F test:
2 0) , C1 (α) = {x: τ (X)2 >cα } , τ (X)2 = n(X ns−μ 2 since they will give rise to the same inference result. Example 13.22 In the context of a simple Normal model (Table 13.4), consider the composite hypotheses H 0 : μ ≤ μ0
vs. H1 : μ > μ0 .
(13.51)
n Since the likelihood function is L(θ; x) = (2πσ 2 )− 2 exp − 2σ1 2 ni=1 (xi − μ)2 , the unconstrained MLEs and L( θ; X) will coincide with (13.48). Since L(θ; x), as a function of μ, is centered at xn (its maximum):
n σ2 n 2 L( θ; x) = (2πσ 2 )− 2 exp − n exp − , −μ) (x n 2 2 2σ 2σ
(13.52)
the relevant component exp −n (xn − μ)2 /2σ 2 is essentially the Normal density without the Normalization constant. Hence, the constrained MLE (under H0 ) will either be (a) > μ=X n when μ0 lies to the right of xn or (b) > μ0 =μ0 when μ0 lies to the left of xn . Case (a) is θ; x)=L(> θ ; x), and thus uninteresting because the likelihood ratio will be λn (X)=1, since L( λn (X) yields no test statistic. In case (b) the constrained MLEs for μ and σ 2 will coincide with (13.50) and thus L(> θ; x) coincides with that in (13.48). Hence, the likelihood ratio gives rise to the one-sided t-test for the hypotheses in (13.51):
√ τ (X) = n(X ns −μ0 ) , C1 (α) = {x: τ (X) > cα }, which can be shown to be UMP; see Lehmann and Romano (2006). Example 13.23 In the context of a simple Normal model (Table 13.4), consider testing the hypotheses H0 : σ 2 = σ 20
vs.
H1 : σ 2 =σ 20 .
From Chapter 12 we know that the maximum likelihood method yields θ ; x) ⇒ μ ˆ n = 1n ni=1 Xi and σ 2 = 1n ni=1 (Xi −μ ˆ n )2 , maxL(θ; x)=L( θ ∈
max L(θ ; x)=L(> θ ; x) ⇒ μ ˆ n = 1n
θ ∈0
n
i=1 Xi
and σ 2 =σ 20 .
Hypothesis Testing
594
Moreover, the two estimated likelihoods are
− n
n σ2 θ; x) = 2πσ 20 2 exp − n L( θ ; x) = (2π σ 2 )− 2 exp − n2 and L(> . 2 2σ 0
Hence, the likelihood ratio in (13.44) gives rise to λn (X) =
=
L( θ ;X) L(> θ ;X)
=
, n V −2 n
2π σ2 n − 2
2πσ 0
exp
− n2
− n
exp(− n2 ) exp −(n σ 2 /2σ 20 )
2
2
1− Vn
-
c2 or υ(X) < c1 }. Given that the chi-square (χ 2 (n−1)) is non-symmetric, the threshold values c1 < c2 for the significance level α are chosen in such a way as to ensure that the two tails are c ∞ of equal probability: 0 1 ψ(x)dx = c2 ψ(x)dx= α2 , where ψ(x) denotes the chi-square density function. The test defined by {υ(X), C1 (α)} turns out to be UMP unbiased (see Lehmann, 1986). Asymptotic likelihood ratio test. One of the most crucial advantages of the likelihood ratio test in practice is that even when one cannot find a transformation h(.) of λn (X) that will yield a test statistic whose finite sample distributions are known, one can use the asymptotic distribution. Wilks (1938) proved that under certain restrictions:
H0 2 ln λn (X) =2 ln L( θ; X)− ln L(> θ; X) χ 2 (r), n→∞
H0
where ∼ reads “under H0 is asymptotically distributed as” and r denotes the number of α
restrictions involved in defining 0 . That is, under certain regularity restrictions on the underlying statistical model, when X is an IID sample, the asymptotic distribution (as n → ∞) of 2 ln λn (X) is chi-square with as many degrees of freedom as there are restrictions, irrespective of the distributional assumption. This result can be used to define the asymptotic likelihood ratio test ∞ {2 ln λn (X), C1 (α) = {x: 2 ln λn (x) > cα }, cα ψ(x)dx = α.
13.4.10 Bayesian Testing Using the Bayes Factor The Bayesian testing of the hypotheses H0 : θ ∈0 vs. H1 : θ ∈1 in the context of a statistical model Mθ (x) = {f (x; θ ), θ∈}, x∈RnX is naturally based on the ratio of the posterior distribution under the two hypotheses. Using the notation in
13.4 Neyman–Pearson Testing
595
Chapter 10, the posterior is defined by π(θ |x0 ) =
f (x0 |θ )·π (θ) m(x0 )
∝ L(θ|x0 )·π(θ), ∀θ∈,
where π(θ) denotes the prior, L(θ|x0 ) the likelihood, and m(x0 ) = θ f (x0 | θ )·π (θ )dθ . Assuming that π i = π(θ ∈i ), i = 0, 1 are the prior probabilities for H0 and H1 , the “renormalized” priors are π i (θ ) = (π(θ)/π i ), θ∈i , θ ∈i π i (θ)dθ = 1, i = 0, 1. Hence, the posterior probabilities for H0 and H1 are defined by π(θ∈i |x0 ) = θ∈i π(θ )L(θ|x0 )dθ = π i θ ∈i π i (θ )L(θ |x0 )dθ, i = 0, 1. The posterior odds ratio is defined by π 0 θ ∈0 π 0 (θ)L(θ|x0 )dθ π(θ ∈0 |x0 ) . = π(θ∈1 |x0 ) π 1 θ ∈1 π 1 (θ)L(θ|x0 )dθ To avoid the charge that prior probabilities render such testing susceptible to ad hoc manipulation, Bayesians often prefer to partially eliminate them from this ratio by defining the Bayes factor θ ∈ π 0 (θ )L(θ |x0 )dθ π (θ ∈0 |x0 ) π 1 (13.53) BF(x0 ) = π (θ ∈1 |x ) π 0 = 0 π (θ )L(θ |x )dθ . 0
θ ∈1
1
0
A close look at the BF indicates that this is analogous to the frequentist likelihood ratio in (13.44), but instead of maximizing the values of θ over and 0 , the parameters are integrated out of the likelihood function over the two subsets using the respective (renormalized) priors as weights. By choosing the priors strategically, this ratio can be simplified enough to involve only the likelihood functions. Example 13.24 For a simple Bernoulli model (Table 13.8), consider the hypotheses H0 : θ = θ 0 vs. H1 : θ =θ 0 1 and choose the priors π i = .5, i = 0, 1. In this case BF(x0 ) = L(θ 0 |x0 )dθ / 0 L(θ|x0 )dθ. Bayesian testing is based on log10 BF(x0 ) in conjunction with certain thresholds concerning the strength of the degree of belief against H0 (Robert, 2007, p. 228): (i)
0 ≤ − log10 BF(x0 ) ≤ .5, the degree of belief against H0 is poor;
(ii)
.5 < − log10 BF(x0 ) ≤ 1, the degree of belief against H0 is substantial;
(iii)
1 < − log10 BF(x0 ) ≤ 2, the degree of belief against H0 is strong;
(iv)
− log10 BF(x0 ) > 2, the degree of belief against H0 is decisive.
These “rules of thumb,” however, have been questioned by Kass and Raftery (1995) as largely ad hoc. There is no principled argument based on sampling distributions as in the case of the frequentist likelihood ratio test in (13.44).
596
Hypothesis Testing
13.5 Error-Statistical Framing of Statistical Testing 13.5.1 N-P Testing Driven by Substantively Relevant Values The primary aim of this section is to raise several issues when N-P testing is driven by one or more substantive values of interest for the parameters, as a prelude to introducing the error-statistical framing of frequentist testing. Let us focus on two values of historical interest pertaining to the ratio of boys (B) to girls (G) in newborns (see Gorroochurn, 2016): Arbuthnot #B=#G; Bernoulli
18B to 17G.
(13.54)
These values can be embedded into a simple Bernoulli model (Table 13.8), where θ =P(X=1)=P(B): Arbuthnot θ A = 12 ;
Bernoulli θ B =
18 35 .
(13.55)
The obvious formulation suggested by substantive arguments is H0 : θ = θ A vs. H1 : θ = θ B which, in light of the above discussion, is illegitimate, because the parameter space of the simple Bernoulli model is not ={θ A , θ B } but =[0, 1]. Hence, any invoking of the N-P lemma to secure an α-UMP test is based on misinterpreting it. What secures an α-UMP test in this case is the monotonic likelihood ratio of the simple Bernoulli model: n nk=1 xk f (x;θ 1 ) (1−θ 1 ) θ 1 (1−θ 0 ) = , where 0 < θ 0 < θ 1 < 1, f (x;θ 0 ) (1−θ 0 ) θ 0 (1−θ 1 ) that is a monotonically increasing function of the statistic nX n = nk=1 Xk , since θ 1 (1 − θ 0 )/θ 0 (1 − θ 1 )>1. This ensures the existence of several α-UMP tests depending on the N-P formulation of the hypotheses of interest based on the test statistic d(X) and its sampling distributions: √ n(X −θ ) θ=θ 0 √ n 0 θ 0 (1−θ 0 )
d(X) =
Bin (0, 1; n) ,
θ=θ
d(X) = 1 Bin (δ(θ 1 ), V(θ 1 ); n) , for θ 1 > θ 0 , δ(θ 1 ) =
√ n(θ −θ ) √ 1 0 θ 0 (1−θ 0 )
1) ≥ 0, V(θ 1 )= θθ 10 (1−θ (1−θ 0 ) , 0 < V(θ 1 ) ≤ 1.
In an N-P setup one is faced with several different legitimate formulations pertaining to the two substantive values θ A = 1/2 vs. θ B = 18/35 defining H0 and H1 in conjunction with γ = [θ B − θ A ] being the primary discrepancy of interest, including those in Table 13.16. The one-sided tests for (i), (iii), and (iv) can be shown to be α-level UMP tests, and the two-sided tests for (ii) and (iv) can be shown to be α-level UMPU tests; see Lehmann and Romano (2006). Example 13.25 Arbuthnot’s value Let us test the hypotheses (i) and (ii) of the simple Bernoulli model (Table 13.8), using the data in the form of n=30,762 newborns during the period 1993–5 in Cyprus, 16,029 boys and 14,733 girls. In view of the huge sample size, it is
13.5 Error-Statistical Framing of Statistical Testing
Table 13.16
N-P formulations for θ A =
Null and alternative (i) (ii) (iii) (iv) (v)
H0 : θ H0 : θ H0 : θ H0 : θ H0 : θ
≤ θ A vs. H1 : θ = θ A vs. H1 : θ ≥ θ B vs. H1 : θ = θ B vs. H1 : θ ≤ θ B vs. H1 : θ
1 2
and θ B =
597
18 35
Rejection region > θA = θ A < θB = θ B > θB
C1> (α) = {x: dA (x) > cα } C1 (α) = {x: |dA (x)| < cα } C1< (α) = {x: dB (x) < cα } C1 (α) = {x: |dB (x)| < cα } C1> (α) = {x: dB (x) > cα }
advisable to choose a smaller significance level, say α=.01 ⇒ cα =2.326. The test statistic based on θ A = θ 0 = .5 yields dA (x0 ) =
√ 1 30762 16029 30762 − 2 √ .5(.5)
= 7.389,
and thus the p-values take the form (i)
pA> (x0 ) = P(dA (X) > 7.389; H0 ) = 7.4×10−14 < α = .01,
(ii) pA = (x0 ) = P(|dA (X)| >7.389; H0 )=1.48×10−13 < α2 = .005.
(13.56)
Thus, in both cases, H0 : θ =1/2 is strongly rejected with tiny p-values. Example 13.26 Bernoulli’s value Testing the Bernoulli value θ B = θ 0 = 18/35 for the hypotheses (iii)–(v), using the same data as in Example 13.24, yields dB (x0 ) =
√ 18 30762 16029 30762 − 35 @ 18 18 35 1− 35
= 2.379,
and thus the p-values take the form (iii) pB< (x0 ) = P(dB (X) < 2.379; H0 ) = .991 > α = .01, α 2
(iv)
pB= (x0 ) = P(|dB (X)| > 2.379; H0 ) = .0174 >
= .005,
(v)
pB> (x0 ) = P(dB (X) > 2.379; H0 ) = .009 < α = .01.
(13.57)
The results in (13.57) seem rather puzzling, because for the hypotheses (iii)–(iv) H0 : θ 0 = 18/35 is accepted but for (v) it is rejected. More puzzling are the three p-values which vary widely! Having used the same data in five different N-P formulations of the null and alternative hypotheses, with each case giving rise to a different p-value, the first question that arises is which one is the relevant p-value. The above discussion suggests that pB= (x0 ), pA = (x0 ), and pB< (x0 ) make no post-data sense, but pA> (x0 ), and pB> (x0 ) do; The reason is that the p-value is used for two different purposes. The first as a supplement to the N-P accept/reject rules to evaluate the minimum threshold at which the null would have been rejected with the particular data. The second is as post-data indicator of the discordance with the null. In the latter case, the observed value of the test statistic eliminates one of the two tails as irrelevant. Recall that the primary aim is to learn about the true θ . When d(x0) = 2.379 > 0, the true θ could not lie to the left of the null value. Having settled that, the second question pertains to whether data x0 provide evidence for or against the
598
Hypothesis Testing
substantive values θ A = 1/2 and θ B = 18/35. A clear answer to this question will be given in the sequel using the post-data severity evaluation.
13.5.2 Foundational Issues Pertaining to Statistical Testing 13.5.2.1 Fisher and N-P Testing Let us now appraise how successful the N-P framework was in addressing the perceived weaknesses of Fisher’s testing: [a] Fisher’s choice of test statistics d(X) based on intuitive grounds; [b] his use of a post-data (d(x0 ) is known) threshold in conjunction with the p-value that indicates discordance (reject) with H0 ; [c] his denial that d(x0 ) can indicate accordance with H0 . Very briefly, the N-P framework partly succeeded in addressing [a] and [c], but did not provide a coherent evidential account that answers the basic question (Mayo, 1996): when do data x0 provide evidence for or against a hypothesis H? This is primarily because one could not interpret the N-P results “accept H0 ” (“reject H0 ”) as data x0 provide evidence for (against) H0 (for H1 ). Why? (i)
(ii)
A particular test Tα :={d(X), C1 (α)} could have led to “Accept H0 ” because the power of that test to detect an existing discrepancy γ was very low. This can easily happen when the sample size n is small. A particular test Tα :={d(X), C1 (α)} could have led to “reject H0 ” simply because the power of that test was high enough to detect “trivial” discrepancies from H0 . This can easily happen when the sample size n is very large.
In light of the fact that the N-P framework did not provide a bridge between the accept/reject rules and questions of interest of the scientific inquiry, i.e. when data x0 provide evidence for or against a hypothesis H, it should come as no surprise to learn that most practitioners sought such answers by trying unsuccessfully (and misleadingly) to distill an evidential interpretation out of Fisher’s p-value. In their eyes the pre-designated α is equally vulnerable to manipulation (pre-designation to keep practitioners “honest” is unenforceable in practice), but the p-value is more informative and data-specific; see Lehmann and Romano (2006). A crucial problem with the p-value is that a small (large) p-value could not be interpreted as evidence for the presence (absence) of a substantive discrepancy γ for the same reasons as (i) and (ii). The power of the test affects the p-value. For instance, a very small p-value can easily arise in the case of a very large sample size n. 13.5.2.2 The Large n Problem Mayo (2006, p. 809) described the problem as follows: “for any discrepancy from the null, however small, one can find a sample size such as there is a high probability (as high as one likes) that the test will yield a statistically significant result (for any p-value one wishes).” The above comments raise a crucial problem with the current practice in empirical modeling, which ignores the fundamental trade-off between the type I and II error probabilities, by
13.5 Error-Statistical Framing of Statistical Testing
599
detaching the inference result – accept/reject H0 or/and small or large p-value – from the particular n and Tα . As shown above, for a particular α an optimal N-P test has power that increases with n. Hence, in practice a rejection at α = .05 with n = 25 is very different from a rejection with n = 2500 from the evidential perspective. That is, for the same discrepancy from the null (θ 0 ), say γ 1 = θ 1 −θ 0 , however small, the power of a Tα test is, say π (γ 1 ) = .2, with n = 25; it could easily become π(γ 1 ) = 1 with n = 2500. It turns out that both the Fisher and N-P accounts of testing are vulnerable to this problem, precluding any principled evidential interpretation of their inference results. A .05 statistical significance or a p-value of .04 depends crucially on n, which suggests that such results cannot be detached from the particular context. The dependence of the p-value on the power of the test, and in particular n, calls into question the disputes concerning the relevance of the alternative and the type II error probabilities. When the testing takes place within the boundaries of a prespecified statistical model Mθ (x), the alternative hypothesis is automatically the complement to the null relative to the particular , and any attempts to exorcize the power of the test as irrelevant are misplaced. To counter the decrease in the p-value as n increases, some textbooks advise practitioners to use rules of thumb based on decreasing α for larger and larger sample sizes; see Lehmann and Romano (2006). Good (1988) proposes standardizing the p-value p(x0 ) to the fixed sample size n = 100 using the rule of thumb
√ p100 (x0 ) = min .5, p(x0 )· n/100 , n > 10. Example 13.27 p(x0 ) = .04 for n = 1000 corresponds to p100 (x0 ) = .126. The severity evaluation discussed below provides a more formal way to take into account the change in the sample size as it affects the power. 13.5.2.3 Fallacies of Acceptance and Rejection The issues with the accept/reject H0 and p-value results raised above can be formalized into two classic fallacies: The fallacy of acceptance. No evidence against H0 is misinterpreted as evidence for H0 . This fallacy can easily arise in cases where the test in question has low power to detect discrepancies of interest. (b) The fallacy of rejection. Evidence against H0 is misinterpreted as evidence for a particular H1 . This fallacy arises in cases where the test in question has high power to detect substantively minor discrepancies. Since the power of a test increases with the sample size n, this renders N-P rejections, as well as tiny p-values, with large n, highly susceptible to this fallacy.
(a)
The above fallacies arise in statistics in a variety of different guises, including the distinction between statistical and substantive significance. A few textbooks in statistics warn readers that one should not conflate the two, but there are no principled ways to address the problem head on. In the statistics literature, as well as in the secondary literatures in several applied fields, there have been numerous attempts to circumvent these two fallacies, but none succeeded until recently. These fallacies can be addressed using the post-data
600
Hypothesis Testing
severity evaluation of inference results (accept/reject, p-values) by offering an evidential account in the form of the discrepancy from the null warranted by the data; see Mayo and Spanos (2006, 2011), Mayo (2018).
13.5.3 Post-Data Severity Evaluation: An Evidential Account On reflection, the above fallacies stem primarily from the fact that there is a problem when the p-value and the accept/reject H0 results are detached from the test itself. That is, the results are viewed as providing the same evidence for a particular hypothesis H (H0 or H1 ), regardless of the generic capacity (the power) of the test in question to detect discrepancies from H0 . The intuition behind this reflection is that a small p-value or a rejection of H0 based on a test with low power (e.g. a small n) for detecting a particular discrepancy γ provides stronger evidence for the presence of a particular discrepancy γ than using a test with much higher power (e.g. a large n). Mayo (1996) proposed a frequentist evidential account based on harnessing this intuition in the form of a post-data severity evaluation of the accept/reject results. This is based on custom-tailoring the generic capacity of the test to establish the discrepancy γ warranted by data x0 . This evidential account can be used to circumvent the above fallacies, as well as other charges against frequentist testing. The severity evaluation is a post-data appraisal of the accept/reject and p-value results that revolves around the discrepancy γ from H0 warranted by data x0 . A hypothesis H passes a severe test Tα with data x0 if: (S-1) (S-2)
x0 accords with H, and with very high probability, test Tα would have produced a result that accords less well with H than x0 does, if H were false.
Severity can be viewed as a feature of a test Tα as it relates to a particular data x0 and a specific claim H being considered. Hence, the severity function has three arguments, SEV(Tα , x0 , H), denoting the severity with which H passes Tα with x0 ; see Mayo and Spanos (2006), Mayo (2018). To explain how the above severity evaluation can be applied in practice, let us return to the problem of assessing the two substantive values of interest θ A =1/2 and θ B =18/35. When these values are viewed in the context of the error-statistical perspective, it becomes clear that the way to frame the probing is to choose one of the values as the null hypothesis and let the difference between them (θ B − θ A ) represent the discrepancy of substantive interest. Example 13.28 Probing the Arbuthnot value θ A using the hypotheses (i)
H0 : θ ≤ θ A vs. H1 : θ > θ A ,
(ii) H0 : θ = θ A vs. H1 : θ = θ A in Example 13.25 gave rise to a rejection of H0 at α=.01 (cα =2.326), since dA (x0 ) =
√ 1 30762 16029 30762 − 2 √ .5(.5)
= 7.389,
13.5 Error-Statistical Framing of Statistical Testing
601
(i) pA> (x0 ) = P(dA (X) > 7.389; H0 ) = 7.4×10−14 < α=.01, (ii) pA = (x0 ) = P(|dA (X)| >7.389; H0 )=1.48×10−13 < α2 = .005. An important feature of the severity evaluation is that it is post-data, and thus the sign of the observed test statistic d(x0 ) provides information that indicates the directional inferential claims that “passed.” In relation to the above example, the severity “accordance” condition (S-1) implies that the rejection of θ 0 =1/2 with d(x0 )=7.389>0 indicates that the inferential claim that “passed” is of the generic form θ > θ 1 = θ 0 +γ , for some γ ≥ 0.
(13.58)
The directional feature of the severity evaluation is very important in addressing several criticisms of N-P testing, including: [a] switching between one-sided to two-sided, or simple vs. simple hypotheses; [b] interchanging the null and alternative hypotheses; [c] manipulating the level of significance in an attempt to get the desired result. To establish the particular discrepancy γ warranted by data x0 , the severity post-data “discordance” condition (S-2) calls for evaluating the probability of the event “outcomes x that accord less well with θ >θ 1 than x0 does,” i.e. [x: d(x) ≤ d(x0 )]: SEV(Tα> ; θ > θ 1 ) = P(x: dA (x) ≤ dA (x0 ); θ > θ 1 is false).
(13.59)
Note that Sev(Tα> ; x0 ; θ >θ 1 ) is evaluated at θ =θ 1 : SEV(Tα> ; θ > θ 1 ) = P(x: dA (x) ≤ dA (x0 ); θ = θ 1 ), for θ 1 = θ 0 +γ ,
(13.60)
because the probability decreases with γ and in the case of reject one is seeking the “largest” discrepancy warranted by x0 . The evaluation of SEV is based on √ n(θˆ −θ ) θ=θ 1 √ n 0 Bin (δ(θ 1 ), V(θ 1 ); n) , for θ 1 > θ 0 , θ 0 (1−θ 0 ) √ 1 −θ 0 ) 1) δ(θ 1 ) = √n(θ ≥ 0, V(θ 1 )= θθ 10 (1−θ (1−θ 0 ) , 0 < V(θ 1 ) ≤ 1. θ 0 (1−θ 0 )
d(X) =
(13.61)
Since the hypothesis that “passed” is of the form θ>θ 1 =θ 0 +γ , the objective of SEV(Tα> ; θ > θ 1 ) is to determine the largest discrepancy γ ≥0 warranted by data x0 . Example 13.28 (continued) For the observed test statistic dA (x0 ) = 7.389, Table 13.17 evaluates SEV(Tα ; θ > θ 1 ) for different values of γ , with the evaluations based on: [dA (X)−δ(θ 1 )] θ=θ 1 √ V(θ 1 )
Bin (0, 1; n) N(0, 1).
Note that for the above data, the scaling
√
(13.62)
V(θ 1 )=.99961 can be ignored.
The evaluation of SEV(Tα> ; γ >γ 1 ), like that of the power, is based on the distribution of the test statistic under the alternative, but unlike power SEV uses dA (x0 ) as the threshold instead of cα . Figure 13.15 plots the severity curve evaluated for several alternatives in
Hypothesis Testing
602
Table 13.17. For γ :=(θ 1 − θ 0 ) = .015, the evaluation of severity components yields dA (x0 ) = 7.389, δ(θ 1 ) =
√ 30762(.515−.5) √ .5(.5)
= 5.262,
dA (x0 ) − δ(θ 1 ) = 7.389 − 5.262 = 2.127. Hence, the evaluation of SEV(Tα> ; γ > γ 1 ) using the N(0, 1) tables yields SEV(Tα> ; γ >γ 1 ) = P(dA (X)≤7.389; θ 1 = .515) = .983, since (z≤2.127)=.983, where (.) is the cdf of N(0, 1). Similarly, for γ =.0142: √ n(θˆ −θ ) √ n 0 θ 0 (1−θ 0 )
− δ(θ 1 ) = 7.389 −
√ 30762(.5142−.5) √ .5(.5)
= 2.408,
SEV(Tα> ; γ >γ 1 ) = P(dA (X)≤7.389; θ 1 = .5142) = .992. Severity and evidential interpretation. Taking a very high probability, say .95, as a threshold, the largest discrepancy from the null warranted by this data is γ ≤ .01637, since SEV(Tα> ; θ > .51637)=.950. Table 13.17 Severity of “Reject H0 : θ = .5 vs. H1 : θ > .5” with (Tα> ; x0 ) θ 0 +γ , γ = .01 Sev(θ >θ 1 ) = .999
.013 .997
.0143 .991
.015 .983
.016 .962
.017 .923
.018 .859
.02 .645
.021 .500
.025 .084
Is this discrepancy substantively significant? In general, to answer this question one needs to appeal to substantive subject matter information to assess the warranted discrepancy on substantive grounds. In human biology it is commonly accepted that the sex ratio at birth is approximately 105 boys to 100 girls; see Hardy (2002). Translating this ratio in terms of θ yields θ + =105/205=.5122, which suggests that the above warranted discrepancy
SEV
1.0 0.8 0.6 0.4 0.2 0.0 0.510
0.512
0.514
0.516
Fig. 13.15
0.518
0.520
0.522 0.524 0.526 0.528 Alternative hypotheses
The severity curve (Table 13.17)
0.530
13.5 Error-Statistical Framing of Statistical Testing
603
γ ≥ .01637 is substantively significant, since this outputs θ ≥ .5173 which exceeds θ + . In light of that, the substantive discrepancy of interest between θ B and θ A : γ ∗ = (18/35) − .5 = .0142857, yields SEV(Tα> ; θ > .5143)=.991, indicating that there is excellent evidence for the claim θ > θ 1 =θ 0 +γ ∗ . In terms of the ultimate objective of statistical inference, learning from data, this evidential interpretation seems highly effective because it narrows down an infinite set to a very small subset!
13.5.4 Revisiting Issues Bedeviling Frequentist Testing The above post-data evidential interpretation based on the severity assessment can also be used to shed light on a number of issues raised in the previous sections. 13.5.4.1 Addressing the Large n Problem The post-data severity evaluation of the accept/reject H0 result, addresses the large n problem by taking into consideration the generic capacity (power) of the test in evaluating the warranted discrepancy γ ∗ from H0 . Example 13.29 In the context of the simple Normal model (Table 13.10), consider the hypotheses H0 : μ≤μ0 vs. H1 : μ>μ0 for μ0 = 0, α = .025, σ = 2. The severity curves shown are associated with test Tα and are based on the same outcome κ(x0 ) = 1.96 but different sample sizes (n = 25, n = 100, n = 400), indicating that the severity for inferring μ > .2 decreases as n increases: (i) n = 50, SEV(μ > 0.2) = .895; (ii) n = 150, SEV(μ > 0.2) = .769; (iii) n = 400, SEV(μ > 0.2) = .49 See Figure 13.16.
13.5.4.2 The Arbitrariness of the N-P Specification of Hypotheses The question that naturally arises at this stage is whether having good evidence for inferring θ > 18/35 depends on the particular way one has chosen to specify the hypotheses for
SEV
1.0 0.8 n = 50 0.6 n = 150 0.4 n = 400
0.2 0.0 0.0
0.1
Fig. 13.16
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Severity for μ > .2 with different sample sizes
1.0 µ
604
Hypothesis Testing
the N-P test. Intuitively, one would expect that the evidence should not be dependent on the specification of H0 and H1 as such, but on test Tα , data x0 , and the statistical model Mθ (x), x∈RnX . Example 13.30 To explore that issue let us focus on probing the Bernoulli value H0 : θ 0 =18/35, where the test statistic in Example 13.24 yields dB (x0 ) =
√ 18 30762 16029 30762 − 35 @ 18 18 35 1− 35
= 2.379.
(13.63)
This value leads to rejecting H0 at α=.01 when one uses the formulation in (v), but accepts H0 when formulations (iii) and (iv) are used. Which result should one believe? Irrespective of these conflicting results, the severity “accordance” condition (S-1) implies that, in light of the observed test statistic dB (x0 )=2.379 > 0, the directional claim that “passed” on the basis of (13.63) is of the form θ > θ 1 = θ 0 +γ . To evaluate the particular discrepancy γ warranted by data x0 , condition (S-2) calls for the evaluation of the probability of the event: “outcomes x that accord less well with θ > θ 1 than x0 does,” i.e. [x: dB (x) ≤ dB (x0 )], giving rise to (13.59), but with a different θ 0 . Table 13.18 lists several such evaluations of SEV(Tα> ; θ >θ 1 ) = P(dB (X) ≤ 2.379; θ 1 = .5143+γ ), for different values of γ . Note that these evaluations are based on (13.61). A number of negative values of γ are included in order to address the original substantive hypotheses of interest, as well as bring out the fact that the results of Tables 13.17 and 13.18 are identical when viewed as inferences pertaining to θ; they simply have a different null value θ 0 and thus different discrepancies, but identical θ 1 values. Table 13.18 Severity of Accept H0 : θ=18/35 vs. H1 : θ >18/35 with (Tα> ; x0 ) θ >θ 0 +γ , γ = −.0043 −.0013 .000 .0007 .0017 .0027 .0037 .0057 .0067 .0107 Sev(θ >θ 1 ) =
.999
.997 .991 .983 .962 .923 .859 .645 .500 .084
The severity evaluations in Tables 13.17 and 13.18, not only render the choice between (i)–(v) irrelevant, they also scotch the widely used argument pertaining to the asymmetry between the null and alternative hypotheses. The N-P convention of selecting a small α is generally viewed as reflecting a strong bias against rejection of the null. On the other hand, Bayesian critics of frequentist testing argue the exact opposite: N-P testing is biased against accepting the null; see Lindley (1965). The evaluations in Tables 13.17 and 13.18 demonstrate that the evidential interpretation of frequentist testing based on severe testing addresses such asymmetries.
13.5 Error-Statistical Framing of Statistical Testing
605
13.5.4.3 Addressing the Fallacy of Rejection The potential arbitrariness of the N-P specification of the null and alternative hypotheses and the associated p-values is brought out in probing the Bernoulli value θ B =18/35 using the different formulations of the hypotheses in (13.57). Can the severity evaluation explain away these conflicting and confusing results? The choice (iii) where H1 : θ < 18/35 was driven solely by substantive information relating to θ A =1/2, which is often a bad idea because it ignores the statistical dimension of inference. The choice (iv) where H1 : θ < 18/35 was based on lack of information about the direction of departure, which makes sense pre-data, but not post-data. The choice (v) where H1 : θ > 18/35 reflects the post-data direction of departure indicated by dB (x0 )=2.379 > 0. In light of the discussion in section 13.5.4, the severity evaluation confirms that pB> (x0 )=.009 is the only relevant p-value. This perspective brings out the vulnerability of both the p-value and the N-P reject H0 rule to the fallacy of rejection in cases where n is large. Viewing the above relevant p-value (pB> (x0 )=.009) from the severity vantage point, it is directly related to H1 passing a severe test. This is because the probability that test Tα would have produced a result that accords less well with H1 than x0 does (x: dB (x) ; x0 ; θ >θ 0 ) = P(dB (X)dB (x0 ); θ = θ 0 ) = .991. This suggests that the crucial weakness of the p-value is that is establishes the existence of some discrepancy γ ≥ 0, but provides no information concerning the magnitude warranted by x0 . The severity evaluation remedies that by relating Sev(Tα> ; x0 ; θ > θ 0 )=.991 to the discrepancy γ warranted by data x0 that revolves around the inferential claim θ > θ 0 + γ . Given that the p-value is evaluated at θ = θ 0 , the implicit discrepancy associated with the p-value is γ =0. This ignores the generic capacity of the test that gave rise to the rejection of H0 . 13.5.4.4 Addressing the Fallacy of Acceptance Example 13.31 Arbuthnot’s 1710 conjecture reparameterized. The equality of males and females can be tested in the context of a simple Bernoulli model (Table 13.8) but now {X=1}={female}, {X=0}={male}, using the hypotheses: (i)
H0 : ϕ ≤ ϕ A vs. H1 : ϕ > ϕ A , ϕ A = .5,
(ii) H0 : ϕ ≥ ϕ A vs. H1 : ϕ < ϕ A , in terms of ϕ = P(X=1)=E(X). Note that ϕ = 1−θ in terms of the notation in table 13.8. The best (UMP) tests for (i)-(ii) take the form: √
(i)
Tα> := {dA (X) = √n(X n −ϕ 0 ) , C1> (α) = {x:d(x) > cα }},
(ii) Tα< :={dA (X) =
ϕ 0 (1−ϕ 0 ) √ √n(X n −ϕ 0 ) , ϕ 0 (1−ϕ 0 )
C1< (α) = {x:d(x) < −cα }},
606
Hypothesis Testing
where ϕ n :=X n = 1n
n
i=1 Xi
as the best estimator of ϕ and:
√
dA (X)= √n(X n −ϕ 0 ) Bin (0, 1; n) . H0
ϕ 0 (1−ϕ 0 )
(13.64)
Using the data with n=30762 newborns during the period 1993–5 in Cyprus, 16029 boys ϕ n (x0 ) = 14733 and 14733 girls, α=.01 ⇒ cα =±2.326, yields 30762 = .4789: √
(i)
dA (x0 ) = √
(ii) dA (x0 ) =
30762( 14733 −.5) √ 30762 .5(.5)
= − 7.389 < 2.326,
30762( 14733 −.5) √ 30762 .5(.5)
= − 7.389 < −2.326,
where (i) indicates accepting of H0 , but (ii) indicates rejecting H0 , confirmed by the associated p-values when interpreted as evaluating the smallest N-P threshold at which the null would have been rejected: (i)
pA> (x0 ) = P(dA (X) > −7.389; H0 ) = 1.0,
(ii) pA< (x0 ) = P(dA (X) < −7.389; H0 ) = 2.065 × 10−10 ϕ A . They agree that, with very high / 1]. probability, the true value of ϕ, ϕ ∗ ∈[0, .5) or equivalently ϕ ∗ ∈[5, The problem is that the coarseness of these results renders them largely uninformative with respect to the main objective of testing which is to learn from x0 about ϕ ∗ . This is the issue the post-data severity evaluation aims to address. It is achieved by supplementing the coarse accept/reject rules with effective probing of the null value ϕ A =.5 with a view to output the discrepancy warranted by data x0 in the direction indicated by the sign of dA (x0 ) = − 7.389. Hence, as a post-data error probability, the severity evaluation is guided by dA (x0 ) ≷ 0 and not by the accept/reject H0 results as such because the latter is driven by its pre-data specification; see Spanos (2013). As argued above, the sign of dA (x0 ) = − 7.389 indicates the relevant direction of departure from ϕ A = .5, and thus the p-value that makes sense as a post-data error probability is the one for case (ii) since the values of x∈{0, 1}n that accord less well with ϕ A = .5 lie to the left of that value. Similarly, for evaluating the post-data severity the relevant inferential claim is: ϕ ≤ ϕ 1 = ϕ 0 +γ , for some γ ≤ 0, SEV(Tα< , ϕ ≤ ϕ 1 ) = P(x: dA (x) > dA (x0 ); ϕ 1 = ϕ 0 +γ ),
(13.65)
. where the evaluation of SEV is based on (13.61); note that V(ϕ 1 ) 1. In light of ϕ ≤ ϕ 1 =ϕ 0 +γ , γ ≤0, the objective of SEV(Tα> ; ϕ ≤ ϕ 1 ) is to determine the largest discrepancy γ ≤0 warranted by data x0 . The post-data severity evaluations are as follows:
13.5 Error-Statistical Framing of Statistical Testing
607
√ 30762(.49−.5) √ = −3.881, SEV(Tα> ; ϕ≤ϕ 1 = .49) = .999995, .5(.5) √ √ − .014, -7.389 − 30762(.486−.5) = − 2.478, SEV(Tα> ; ϕ≤ϕ 1 = .486) = .993 .5(.5) √ √ − .015, -7.389− 30762(.485−.5) = − 2.127, SEV(Tα> ; ϕ≤ϕ 1 = .485) = .983, .5(.5) √ √ −.01636, -7.389 − 30762(.48364−.5 =−1.65, SEV(Tα> ; ϕ≤ϕ 1 = .48364) = .951, .5(.5) √ √ = − 1.426, SEV(Tα> ; ϕ≤ϕ 1 = .483) = .923, − .017, -7.389− 30762(.483−.5 .5(.5) √ √ − .019, -7.389− 30762(.481−.5 = − .724, SEV(Tα> ; ϕ≤ϕ 1 = .481) = .766, .5(.5) √ √ − .0211, -7.389− 30762(.4789−.5) = 0, SEV(Tα> ; ϕ≤ϕ 1 = .4789) = .5, .5(.5) √ √ = .328, SEV(Tα> ; ϕ≤ϕ 1 = .478) = .371, − .022, -7.389− 30762(.478−.5) .5(.5) √ √ − .024, -7.389− 30762(.476−.5) = 1.03, SEV(Tα> ; ϕ≤ϕ 1 = .476) = .152, .5(.5) √ √ − .025, -7.389− 30762(.475−.5) = 1.381, SEV(Tα> ; ϕ≤ϕ 1 = .475) = .084. .5(.5)
for γ = − .01, -7.389− for γ = for γ = for γ = for γ = for γ = for γ = for γ = for γ = for γ =
Table 4 reports the severity curve for various discrepancies and is plotted in Figure 13.5. Using the severity threshold of .95, the warranted discrepancy from ϕ A =.5 can be derived using the equation: −7.389−
√ 30762(γ ) √ = − 1.645 .5(.5)
→ γ ∗ = − .01636,
where the value −1.645 is chosen because 1−(−1.645)=.95; denotes the cumulative distribution function (cdf) of N(0, 1). The inferential claim warranted by data x0 is: ϕ ≤ .48364. Similarly, the severity evaluation of the discrepancy associated with the substantive value of ϕ, ϕ =.4878, has SEV(Tα> ; ϕ≤.4878)=.999, which exceeds the .95 threshold. This indicates that the test result based on dA (x0 )=–7.389 is both statistically and substantively significant with data x0 . Finally, it is important to note that the post-data severity evaluations in table 13.19 are in complete inferential agreement with those in tables 13.17 and 13.18. Table 13.19
Severity of N-P “Accept H0 ”: ϕ≤.5 vs. H1 : ϕ>.5 (Tα> ; dA (x0 )ϕ 1 ) = 1.0
.486
.485
.48364
.483
.481
.4789
.478
.476
.475
.993
.983
.951
.923
.766
.500
.371
.152
.084
A comparison of Figures 13.17 and 13.15 indicates that their severity curves differ with respect to their slope since their evaluations pertain to the two different tails of the sampling distribution of dA (X) under H1 .
608
Hypothesis Testing
SEV
1.0 0.8 0.6 0.4 0.2 0.0 0.470
0.475
Fig. 13.17
0.480
0.485
0.490
0.495
0.500
parameter values
The severity curve (Table 13.19)
13.5.4.5 The Arbitrariness of the Significance Level It is often argued by critics of frequentist testing that the choice of the significance level in the context of the N-P approach, or the threshold for the p-value in the case of Fisherian significance testing, is totally arbitrary and vulnerable to manipulation. To see how the severity evaluations can address this problem, let us try to “manipulate” the original significance level (α=.01) to a different one that would alter the accept/reject results for H0 : θ =18/35. Example 13.32 Looking back at the results in Examples 13.25 and 13.26 for Bernoulli’s conjectured value using the data from Cyprus, it is easy to see that choosing a larger significance level, say α=.02 ⇒ cα =2.053, would lead to rejecting H0 for both N-P formulations: (2-s)(H1 : θ =
18 35 )
and (1-s>)(H1 : θ >
18 35 )
since d(x0 ) = 2.379>2.053; the corresponding p-values are (2-s) p(x0 ) = P(|d(X)| > 2.379; H0 ) = .0174, (1-s>) p> (x0 ) = P(d(X) > 2.379; H0 ) = .009. How does this change the original severity evaluations? The short answer is: it doesn’t! The severity evaluations remain invariant to any changes to α because they depend on the direction of departure indicated by the sign of d(x0 )=2.379>0, and not by the direction of the alternative or the value of cα . From a post-data perspective, the relevant direction of departure in light of d(x0 ) > 0 is θ > θ 1 =θ 0 +γ , indicating the generic form of the hypothesis that “passed,” and thus the same evaluations given in Table 13.14 apply. 13.5.4.6 The Severity Principle Central to the error-statistical approach is the notion of a severe test that provides an effective way to learn from data about certain aspects of the phenomena of interest. An adequate
13.5 Error-Statistical Framing of Statistical Testing
609
test of a hypothesis H must be a severe test in the sense that the data x0 must provide good evidence for that claim or hypothesis. The error-statistical perspective suggests that a sufficiently severe test should conform to the severity principle, which has a weak and a strong version; see Mayo (1996; 2018). Weak severity principle. Data x0 (generated by a statistical model Mθ (x), x∈RnX ) do not provide good evidence for hypothesis H if d(x0 ) results from a test procedure Tα with a very low probability or capacity to uncover the falsity of H, even if H is untrue. The weak severity principle pertains to circumstances where one should deny that data x0 provide evidence for a hypothesis H. It is called weak because this principle does not call for obviating situations where an agreement between data and a hypothesis occurs when the hypothesis is false. This negative conception is not sufficient for a full account of evidence. The latter is articulated with the strong severity principle. Strong severity principle. Data x0 (generated by a statistical model Mθ (x), x∈RnX ) provide good evidence for a hypothesis H (just) to the extent that test Tα has severely passed H with x0 . With a severely tested hypothesis or inferential claim, the probability is low that Tα would pass if the hypothesis were false. In addition, the probability that the data accord with the alternative hypothesis must be very low. The severity principle provides the key to the error-statistical account of evidence and underscores the rationale for using error-statistical methods of inference. The relevant error probabilities associated with these statistical procedures provide a measure of their “probativeness” (capacity) to distinguish between different hypotheses, and how reliably they can detect errors.
13.5.5 The Replication Crises and Severity The recent debates pertaining to the “replication crises” (Mayo, 2018) blame abuses in significance testing as the main culprit for rendering “most published research findings false”; see Ioannidis (2005). He placed the positive predictive value (PPV) at the center of these debates by arguing that it provides a key measure of the trustworthiness of empirical results. It is defined in terms of “events”: F=H0 is false, R=test rejects H0 , and takes the conditional probability formulation PPV= Pr(F|R) =
Pr(R|F) Pr(F) Pr(R|F)P(F)+Pr(R|F)P(F)
(13.66)
grounded on the prior probability Pr(F) that a certain proportion, say 20% (Pr(F) = .2), of the “null effects” tested in a particular field, say economics, are believed to be “false”. This is a measure adapted from medical screening that aims to evaluate the reliability of medical diagnostic tests detecting ailments in patients that revolves around the notions of “false positive” and “false negative”. By assigning hypothetical values to Pr(F), Pr(R|F), and Pr(R|F) one can evaluate the degree of untrustworthiness of discipline-wide testing results; lower values of PPV indicate less trustworthy evidence. Ioannidis (2005) concluded that the untrustworthiness of evidence can be traced to known abuses of significance testing, including biased results, small n, p-hacking, multiple testing, cherry-picking, etc.
610
Hypothesis Testing
On closer examination, the analogical reasoning behind the adaptation of the PPV from medical screening gives the impression that Pr(R|F) and Pr(R|F) relate directly to the frequentist concepts of the power and the significance level of a test, respectively. This semblance, however, is highly misleading. First, frequentist error probabilities are never defined as conditional on “H0 is true” or “H0 is false”, because the latter do not represent legitimate events in the context of a prespecified statistical model Mθ (x), upon which one can condition. As argued above, “H0 is true” or “H0 is false” represent hypothetical scenarios under which the sampling distribution of the test statistic (d(X)) is evaluated. Second, there is no such thing as discipline-wide false positive/negative associated with generic tests and generic null hypotheses analogous to medical screening devices. One can assert that the false positive of this screening device is 5%, but the analogy with frequentist error probabilities (pre-data and post-data) is misplaced because the latter (i) are model [Mθ (x)]-specific and always framed in terms of θ ∈, (ii) are always assigned to the inference procedure (never to hypotheses), and (iii) invariably depend on n>1 for the particular data x0 ; see Spanos (2013a–d). Recall that a screening device has only two outcomes (positive, negative), which is never the case with non-artificial statistical models, since the result of a frequentist test pertains to all values θθ; usually an uncountable subset of the real line. Indeed, this is the primary source of the fallacies of acceptance and rejection. Third, the post-data severity evaluation can be used to address the well-known misuses and erroneous interpretations of frequentist testing results. In summary, the PPV constitutes an ad hoc posterior measure of blameworthiness by association based on a Bayesian meta-model for field-wide inferences. The fact that the probabilities in (13.65) are meaningful in the context of Bayesian inference does not render them relevant for evaluating the reliability of frequentist testing results. Indeed, when viewed from a frequentist perspective, the PPV has nothing to do with unveiling the untrustworthiness of published empirical evidence. Worse, it reflects attention away from certain crucial sources of untrustworthy evidence, including (i) statistical misspecification and (ii) false evidential interpretations of the p-values and accept/reject H0 rules. This suggests a refocusing of the proposed strategies for securing the trustworthiness of published empirical evidence on a case-by-case basis by appraising whether the study in question has circumvented or dealt with the potential errors and omissions that could have undermined the reliability of the particular inferences drawn.
13.6 Confidence Intervals and their Optimality 13.6.1 Mathematical Duality Between Testing and CIs From a purely mathematical perspective, a point estimator is a point-to-point mapping h(.) of the form h(.): RnX → .
13.6 Confidence Intervals and their Optimality
611
Similarly, an interval estimator is a point-to-subset mapping g(.): RnX → such that g−1 (θ)={x: θ ∈ g(x)}⊂RnX for θ ∈ . The pre-image of g(.), g−1 (θ)={x: θ ∈ g(x)}, defines a subset on RnX , a legitimate event that can be assigned a probability using factual reasoning: P (x: θ ∈ g(x); θ = θ ∗ ) :=P (θ ∈ g(X); θ=θ ∗ ) .
(13.67)
This is the probability that the random set g(X)⊂ contains θ ∗ (the true value of θ ). One can define a (1 − α) confidence interval CI(x)={x: θ ∈ g(x)} for θ by attaching a lower bound to this probability: P (θ ∈ g(X); θ =θ ∗ ) ≥ (1 − α). Neyman (1937) noted a mathematical duality between the N-P Hypothesis Testing (HT) and CIs and used it to construct an optimal theory for the latter. The α-level test of the hypotheses: H0 : θ = θ 0 vs. H1 : θ = θ 0 takes the form {d(X;θ 0 ), C1 (θ 0 ; α)}, where d(X;θ 0 ) denotes the test statistic and C1 (θ 0 ; α) = {x: |d(x;θ 0 )| > c α2 } ⊂ RnX the rejection region. The complement of C1 (θ 0 ; α) with respect to the sample space (RnX ) defines the acceptance region C0 (θ 0 ; α) = {x: |d(x;θ 0 )| ≤ c α2 } = RnX − C1 (θ 0 ; α). The mathematical duality between HT and CIs stems from the fact that for each x∈RnX , one can define a corresponding set CI(x;α) on the parameter space by CI(x;α) = {θ: x∈C0 (θ 0 ; α)}, θ∈, e.g. CI(x;α)={θ : |d(x;θ 0 )| ≤ c α2 }. Conversely, for each CI(x;α) there is a CI corresponding to the acceptance region C0 (θ 0 ; α) such that: C0 (θ 0 ; α) = {x: θ 0 ∈ CI(x;α)}, x∈RnX , e.g. C0 (θ 0 ; α)={x: |d(x;θ 0 | ≤ c α2 }. This equivalence is encapsulated by for each x∈Rn , x∈C0 (θ 0 ; α) ⇔ θ 0 ∈ CI(x; α),
for each θ 0 ∈.
It is important to note the blurring of the distinction between the pre-specified θ 0 and the θ U (X) of a CI, i.e. unknown generic θ. When the lower bound θ L (X) and upper bound
P θ U (X) = 1−α, θ L (X) ≤ θ ≤ are monotonically increasing functions of X, the equivalence takes the explicit form
−1 −1 θ U (x) ⇔ x∈ θ U (θ) ≤ x ≤ θ L (θ ) . θ∈ θ L (x) ≤ θ ≤
612
Hypothesis Testing
Example 13.33 Consider a simple Bernoulli model (Table 13.8) for which it was shown in Example 13.4 that the relevant estimator and test statistic are X n = 1n ni=1 Xi and d(X) =
√ n(X −θ ) θ=θ 0 √ n 0 θ 0 (1−θ 0 )
Bin (0, 1; n) .
(13.68)
This implies that the relevant acceptance region of a two-sided α significance level test takes the form: √ √ θ 0√ (1−θ 0 ) θ 0√ (1−θ 0 ) P θ 0 −c α2 ≤ X n < θ 0 +c α2 ; θ = θ 0 = 1−α. n n On the other hand, the relevant pivot for a two-sided CI is d(X; θ ) =
√ n(X −θ) θ=θ ∗ √ n θ(1−θ )
(13.69) Bin (0, 1; n) , √ ∗ n −θ ) α
Eθ ∗ ( θ U (x) − θ L (x)) ≤ Eθ ∗ (> θ U (x) − > θ L (x)). Example 13.34 (continued) The expected length of the CI in (13.71) is E X n +c α2 √sn − X n +c α2 √sn = 2c α2 √σn .
13.6.3 Confidence Intervals vs. Hypothesis Testing Does this mathematical duality render HT and CIs equivalent in terms of their respective inferential assertions? No! The coverage probability provides a very different inferential assertion from the type I and II error probabilities. The difference stems from the underlying reasoning, hypothetical (HT) vs. factual (CI). The primary source of the confusion on this issue emanates from the fact that they often share a common pivotal function, and their error probabilities are evaluated using the same tail areas, but that’s where the analogy ends. Example 13.35 To bring out how their inferential assertions and respective error probabilities are fundamentally different, consider the simple (one-parameter) Normal model (Table 13.10), where the two relevant sampling distributions, based on factual and hypothetical reasoning, respectively, are (HT) d(X) =
√ n(X n −μ0 ) μ=μ0 σ
N(0, 1), (CI) d(X;μ) =
√ ∗ n(X n −μ∗ ) μ=μ σ
N(0, 1). (13.72)
Putting the (1−α) CI side by side with the (1−α) acceptance region: (CI) P X n −c α2 √σn ≤ μ < X n +c α2 √σn ; μ = μ∗ = 1−α, (HT) P μ0 −c α2 √σn ≤ X n < μ0 +c α2 √σn ; μ = μ0 =1−α,
(13.73)
it becomes clear that their inferential claims are very different. √ (CI) The CI asserts that the random bounds [X n ± cα/2 (σ / n)], when evaluated under μ=μ∗ , will cover (overlay) the true μ∗ with probability (1−α) . (HT) The acceptance region of the test asserts that with probability (1−α) , test Tα would yield a result that accords equally well or better with H0 (i.e. [x: |d(x;μ0 )| ≤ cα/2 ]), evaluated under H0 : μ=μ0 is true. How are these two scenarios related? The evaluation under μ=μ0 can be re-interpreted as a “what if μ=μ∗ ” assertion. This enables one to blur the factual with hypothetical reasoning and vice versa. Having said that, there is usually an infinity of hypothetical values for μ0 in R, but only one true value μ∗ . This renders the HT hypothetical reasoning meaningful both pre-data and post-data; hence, the pre-data type I and II error probabilities and the post-data p-value and the severity evaluations. In contrast, the single scenario μ = μ∗ has occurred post-data, rendering the coverage error probability degenerate, i.e. the observed CI , (13.74) xn − c α2 √σn , xn + c α2 √σn , either includes or excludes the true value μ∗ .
614
Hypothesis Testing
13.6.4 Observed Confidence Intervals and Severity In addition to addressing the fallacies of acceptance and rejection, the post-data severity evaluation can be used to address the issue of degenerate post-data error probabilities and the inability to distinguish between different values of μ within an observed CI; Mayo and Spanos (2006). This is achieved after two key changes. (a) Replace the factual reasoning underlying CIs, which becomes degenerate post-data, with the hypothetical reasoning underlying the post-data severity evaluation. (b) Replace any claims pertaining to overlaying the true μ∗ with severity-based inferential claims of the form μ > μ1 = μ0 +γ , μ ≤ μ1 = μ0 +γ , for some γ ≥ 0. This can be achieved by relating the observed bounds xn ± c α2 √σn
(13.75)
(13.76)
to particular values of μ1 associated with (13.75), say μ1 =xn −c α2 ( √σn ), and evaluating the post-data severity of the inferential claim: μ > μ1 = xn −c α2 √σn . A moment’s reflection, however, suggests that the connection between establishing the warranted discrepancy γ from μ0 and the observed CI (13.76) is more apparent than real. The severity assessment of μ > μ1 is based on post-data hypothetical reasoning and does not pertain directly to μ1 or μ∗ , but to the warranted discrepancy γ ∗ = μ1 − μ0 in light of x0 . The severity evaluation probability relates to the inferential claim (13.75), and has nothing to do with the coverage probability. Without the coverage probability there no interpretation one can associate with a CI because μ > μ1 pertains to discrepancies from the null. Equating μ1 to an observed CI does nothing to change that. The equality of the tail areas stems from the mathematical duality, but that does not imply inferential duality.
13.6.5 Fallacious Arguments for Using CIs A. Confidence intervals are more reliable than p-values? It is often argued in the social science statistical literature that CIs are more reliable than p-values because the latter are vulnerable to the large n problem; as the sample size n → ∞ the p-value becomes smaller and smaller. Hence, one would reject any null hypothesis given enough observations. What these critics do not seem to realize is that a CI like P X n −c α2 √σn ≤ μ ≤ X n +c α2 √σn ; μ = μ∗ = 1−α, is equally vulnerable to the large n problem since its length - , , X n +c α2 √σn − X n −c α2 √σn = 2c α2 √σn → 0 n→∞
B. The middle of an observed CI is more probable? Despite the obvious fact that postdata the probability of coverage is either zero or one, there have been numerous attempts to discriminate among the different values of μ within an observed CI: “the difference
13.7 Summary and Conclusions
615
between population means is much more likely to be near the middle of the confidence interval than towards the extremes.” (Altman et al., 2000, p. 22). This claim is clearly fallacious. A moment’s reflection suggests that viewing all possible observed CIs (13.74) as realizations from a single distribution in (13.72), leads one to the conclusion that, unless the particular xn happens (by accident) to coincide with the true value μ∗ , values to the right or the left of xn will be more likely depending on whether xn is to the left or the right of μ∗ . Given that μ∗ is unknown, however, such an evaluation cannot be made in practice with a single realization. Equally misplaced is the move to place the likelihood function L(μ, σ 2 ; x0 ) over the observed CI in an dire attempt to justify assigning ’likelihood’ to different value of μ, and inferring that the values of μ around the ML estimate are more likely than those closer to the bounds of the observed CI. producing the so-called cat’s eye diagram; see Cumming (2012). This is a misguided move since L(μ, σ 2 ; x0 ) has nothing to do with coverage probability. It’s like putting lipstick on a pig; it’s still a pig! One might as well ignore the observed CI altogether and just use L(μ, σ 2 ; x0 ) to evaluate the relative likelihood of different values of μ, as proposed by likelihoodists; see Royal (1997). Both such attempts are misplaced since any inference based on L(μ, σ 2 ; x0 ) is plagued with a serious flaw: the ML estimate is always the maximally likely value; see Barnard (1967). This distorts any likelihood-based comparisons, including likelihoodist ratios and Bayes factors. This is because such comparisons would only be appropriate if it were warranted to infer that ML estimate is approximately equal (or very close) to μ∗ . As argued in chapters 11–12, no such inferential claim is warranted in point estimation; see Spanos (2013b–c).
13.7 Summary and Conclusions In frequentist inference learning from data x0 about the stochastic phenomenon of interest is accomplished by applying optimal inference procedures with ascertainable error probabilities in the context of a statistical model Mθ (x). Hypothesis testing gives rise to learning from data by partitioning Mθ (x): M0 (x) = {f (x; θ ), θ ∈ 0 } or M1 (x) = {f (x; θ), θ ∈1 }, x∈RnX
(13.77)
and framing the hypotheses in terms of θ, H0 : θ ∈ 0 vs. H1 : θ ∈1 . A test Tα = {d(X), C1 (α)} is defined in terms of a test statistic and a rejection region, and its optimality (effectiveness) is calibrated in terms of the relevant type I and II error probabilities evaluated using hypothetical reasoning. These error probabilities specify how often these procedures lead to erroneous inferences, and thus determine the power of the test: π (θ 1 ) = P(x0 ∈C1 ; H1 (θ 1 ) true) = β(θ 1 ), ∀θ 1 ∈ 1 . An inference is reached by an inductive procedure which, with high probability, will reach true conclusions from valid premises Mθ (x). Hence, the trustworthiness of frequentist inference depends on two pre-conditions: (a) optimal inference procedures, (b) after securing the adequacy of Mθ (x). Addressing foundational issues. The first such issue concerns the presumption that the true M∗ (x) lies within the boundaries of Mθ (x). This can be addressed by securing the statistical
616
Hypothesis Testing
adequacy of Mθ (x), vis-a-vis data x0 , using trenchant M-S testing before applying frequentist testing; see Chapter 15. As shown above, when Mθ (x) is misspecified, the nominal and actual error probabilities can be very different, undermining the reliability of the test in question. The second issue concerns the form and nature that the evidence x0 can provide for θ ∈0 or θ∈1 . Neither the p-value nor the N-P accept/reject rules provide an evidential interpretation, primarily because they are highly vulnerable to the fallacies of acceptance and rejection. These fallacies, however, can be circumvented using a post-data severity evaluation of p-value and accept/reject results to output the discrepancy γ from the null warranted by data x0 . This warranted inferential claim for a particular test Tα and data x0 gives rise to learning from data. The post-data severity evaluation was shown to address, not only the classic fallacies of acceptance and rejection, but several other foundational problems bedeviling frequentist testing since the 1930s. Additional references: Cox (1977), Silvey (1975), Mayo (2018).
Important Concepts Pearson’s chi-square test, Fisher’s recasting of statistical induction, Fisher’s test, Student’s t distribution, null hypothesis, factual reasoning, hypothetical reasoning, test statistic d(X), the distribution of d(X) under the null (H0 ), post-data p-value, rejection threshold, the large n problem, replication, blocking, randomization, confounding, archetypal Neyman–Pearson (N-P) formulation, N-P test, alternative hypothesis (H1 ), likelihood ratio, H0 and H0 partitioning the parameter space, the acceptance and rejection region partitioning the sample space (RnX ), testing within a statistical model, type I and II error probabilities, power of a test, the distribution of d(X) under the alternative (H1 ), uniformly most powerful (UMP) test, consistent test, unbiased test, monotone likelihood ratio, convex alternative parameter space, fallacy of acceptance, fallacy of rejection, post-data severity evaluation, warranted discrepancy from the null, severity principle, uniformly most accurate (UMA) confidence intervals, duality between hypothesis testing and confidence intervals, Bayes factor and Bayesian testing, the Behrens–Fisher problem. Crucial Distinctions Factual vs. hypothetical reasoning, simple vs. composite hypotheses, testing within vs. testing outside a statistical model, significance level vs. p-value, pre-data vs. post-data error probabilities, fallacy of acceptance vs. fallacy of rejection, statistical vs. substantive significance, hypothesis testing vs. confidence intervals. Essential Ideas ●
●
Hypothesis testing, based on hypothetical reasoning, constitutes the most powerful and flexible inference procedure in learning from data about the stochastic generating mechanism M∗ (x), x∈RnX that gave rise to data x0 . Pre-data error probabilities calibrate the generic capacity of a test to achieve that objective. The p-value is the probability of all possible outcomes x∈RnX that accord less well with H0 than x0 does, when H0 is true. Any attempt to assign it an evidential interpretation is doomed because the concept ignores the generic capacity of the test; the large n problem
13.8 Questions and Exercises
●
●
●
●
●
617
highlighted that weakness. Since the p-value is a post-data error probability, it cannot be two-sided; the observed d(x0 ) designates the only relevant side. The N-P reframing recasts Fisher’s testing by partitioning the parameter and sample spaces, rendering N-P testing strictly within the boundaries of Mθ (x). In contrast, misspecification (M-S) testing constitutes probing outside Mθ (x). In N-P testing the null and alternative hypotheses should constitute a partition of the parameter space of Mθ (x). For statistical purposes all possible values θ∈ are relevant for N-P testing, irrespective of whether only one or more values are of substantive interest. There can be no legitimate evidential interpretation of the accept/reject results and the p-value that can address the fallacies of acceptance and rejection, without accounting for the generic capacity (power) of the test applied to the particular data x0 . The post-data severity evaluation provides an evidential interpretation of the statistical significance/insignificance by outputting the warranted discrepancy from the null, γ 1 = θ 0 − θ 1 , after taking into account the generic capacity of the particular test Tα and data x0 . Egregious and misleading claims in frequentist hypothesis testing: (a) The type I, II error and the p-value constitute conditional probabilities. (b) The type I, II error probabilities and the p-value can be assigned to θ. (c) Confidence intervals are more reliable than p-values. (d) The p-value provides a standalone measure of evidence against the null. (e) The N-P lemma yields a UMP for any two simple hypotheses, regardless of Mθ (x) = {f (x; θ ), θ ∈ }, x∈RnX . (f) Statistical significance can be detached from Mθ (x) and data x0 , hence ***, **, *. Rejecting the null at α = .05 with n = 40 or n = 10000 are very different on evidential grounds.
13.8 Questions and Exercises 1. What are the similarities and differences between Karl Pearson’s chi-square test and Fisher’s t-test for μ = μ0 . 2. In the case of the simple Normal model, Gosset has showed that for any n > 1: √ 1 n 1 n n(X n −μ) (13.78) St(n−1), X n = n t=1 Xt , s2 = n−1 i=1 (Xi −X n )2 . s (a) Explain why the result in (13.78), as it stands, is meaningless unless it is supplemented with the reasoning underlying it. (b) Explain how interpreting (13.78) using factual reasoning yields a pivotal quantity, and using hypothetical reasoning yields a test statistic. 3. (a) Define and explain the key components introduced by Fisher in specifying his significance testing. (b) Apply a Fisher significance test for H0 : θ = .5, in the context of a simple Bernoulli model, using the new births data for Cyprus below, where θ = P(X = 1), with {X = 1} = {male}.
618
Hypothesis Testing Data on newborn (1995) Year 1995
Males 5152
Females 4717
Total 9869
(c) “The p-value could not provide an evidential account for Fisher’s testing because it is vulnerable to the large n problem.” Discuss 4. (a) Define and explain the notion of the p-value. (b) Using your answer in (a) explain why each of the following interpretations of the p-value are erroneous: (i) The p-value is the probability that H0 is true. (ii) 1−p(x0 ) is the probability that the H1 is true. (iii) The p-value is the conditional probability of obtaining a test statistic d(x) more extreme than d(x0 ) given H0 . (c) Define the large n problem and explain how it affects the p-value and the N-P accept/reject rules. 5. (a) Explain the new features the N-P framing introduced into Fisher’s testing and for what purpose. In what respects it modified Fisher’s testing. (b) Compare and contrast the N-P significance level and Fisher’s p-value. (c) Explain why the archetypal N-P formulation of the null and alternative hypotheses is ultimately about learning from data by narrowing down the original statistical model in search of the true statistical Data Generating Mechanism. (d) Explain why an N-P test is not just a formula associated with probability tables. (e) Explain the concepts of a simple and a composite hypothesis. 6. (a) Explain the notions of a type I and type II error. Why does one increase when the other decreases? (b) How does the Neyman-Pearson procedure solve the problem of a trade off between the type I and type II errors? (c) Compare and contrast the power of a test at a point, and the power function. (d) What do we mean by a uniformly most powerful test? (e) Explain the concepts of (i) unbiased test and (ii) consistent test. 7. (a) In the case of the simple (one parameter) Normal model, explain how the sampling √ n(X n −μ0 ) changes when evaluated under the null distribution of the test statistic σ and under the alternative: H0 : μ = μ0 , vs. H1 : μ>μ0 . (b) In the context of the simple (one parameter) Normal model (σ = 1), test these hypotheses for μ0 = .5, xn = .789, n = 100, and α = .025, using the optimal N-P test. (c) Evaluate the power of the optimal N-P test in (b) for μ1 = .51, .7, 1 and plot the power curve. (d) Evaluate how large the sample size n needs to be for one to be able to detect the
discrepancy of interest μ1 − μ0 =.2 with probability ≥ .8. 8. (a) State and explain the N-P lemma, paying particular attention to the framing of the null and alternative hypotheses.
13.8 Questions and Exercises
619
(b) State and explain the two conditions needed to extend the N-P lemma to more realistic cases for the existence of these α-UMP tests. (c) In the case of the simple Normal model the two formulations of the null and alternative hypotheses: (i) H0 : μ ≤ μ0 , vs. H1 : μ > μ0 , (ii) H0 : μ = μ0 , vs. H1 : μ > μ0 , give rise to the same optimal test. Explain why the t-test is UMP in this case. (d) Explain why in testing the hypotheses: (iii) H0 : μ = μ0 , vs. H1 : μ = μ0 , the optimal t-test for (i)-(ii) in (c) is biased when used to test (iii). Suggest an unbiased test for (iii). 9. (a) In the context of the simple Bernoulli model, explain how you would reformulate the null and alternative hypotheses when the substantive hypotheses of interest are: H0 : θ = .5, vs. H1 : θ = 18 35 . (b) Using your answer in (a), apply the test for α = .01 using the following data:
Data on newborns (1994) Year 1994
Males 5335
Females 5044
Total 10379
10. (a) State and explain the fallacies of acceptance and rejection and relate your answer to the distinction between statistical and substantive significance. (b) Explain how the post-data severity assessment of the N-P accept/reject results can be used to address both the fallacies of acceptance and rejection. (c) Use the post-data severity assessment to explain why the p-value is vulnerable to both fallacies. 11. (a) Explain the concept of a post-data severity assessment and use it – in the context of a simple Bernoulli model– to evaluate the severity of the N-P decision for the hypotheses: H0 : θ ≤ .5, vs. H1 : θ > .5, using the data in Table 13.2, by evaluating the different discrepancies γ = .01, .014, .0145, .015, associated with θ 1 = θ 0 +γ using the data in table 2. (b) Compare and contrast the severity curve in (a) with the power curve for the same discrepancies γ = .01, .014, .0145, .015. (c) What is the discrepancy γ from θ 0 = .5 warranted by the data in Table 13.1 for a severity threshold of .95? 12. (a) Explain the notions testing within and testing without (outside) the boundaries of a statistical model in relation to Neyman–Pearson (N-P) and mispecification (M-S) testing. (b) Specify the generic form of the null and alternative hypotheses in M-S testing for a statistical model Mθ (z) and compare it with that of a proper N-P test. (c) Using your answers in (a)–(b) explain why M-S testing is particularly vulnerable to the fallacy of rejection.
620
Hypothesis Testing
13. (a) Explain the likelihood ratio test procedure and comment on its relationship to the Neyman–Pearson lemma. (b) Explain why when the postulated statistical model is misspecified all NeymanPearson type tests will be invalid. 14. (a) “Confidence intervals are more reliable than p-values and are not vulnerable to the large n problem.” Discuss. (b) “The middle of an observed CI is more probable than any other part of the interval.” Discuss.
Appendix 13.A: Testing Differences Between Means 13.A.1
Testing the Difference Between Two Means
Let us return to Edgeworth’s problem (Section 13.2.1) of testing the difference between two means in the context of a simple bivariate Normal model. Consider the case where the data z0 :=(x1 , . . . , xn1 ; y1 , . . . , yn2 ) constitute a realization of a sample Z:=(X1 , X2 , . . . , Xn1 ; Y1 , Y2 , . . . , Yn2 ), assumed to be NIID: [i] Z NIID(μ, ), μ:=(μ1 , μ2 ) , :=[σ ij ]2ij=1 t = 1, 2, . . . , n, . . . , [ii] σ 11 = σ 22 = σ 2 , [iii] σ 12 = 0,
(13.A.1)
where μ1 = E(Xt ), μ2 = E(Yt ), σ 11 = Var(Xt ), σ 22 = Var(Yt ), σ 12 = Cov(Xt , Yt ). For γ = (μ1 − μ2 ), let the hypotheses of interest be H0 : γ = γ 0 vs. H1 : γ =γ 0 . Defining the difference dn = (X n − Y n ), it follows that its sampling distribution is 1 2 dn N γ , σ 2 n11 + n12 , γ = (μ1 − μ2 ), X n = n11 ni=1 Xi , Y n = n12 ni=1 Yi . B @ 1 1 Standardizing dn yields (dn − γ ) σ n1 + n2 N (0, 1) . Estimate σ 2 using the pooled unbiased estimator 1 (n −1)s21 +(n2 −1)s22 , s21 = (n11−1) ni=1 (Xi − X n )2 , s22 = s2 = 1 (n1 +n 2 −2)
1 (n2 −1)
n 2
i=1 (Yi
− Y n )2 ,
whose sampling distribution is (n1 +n2 − 2)s2 /σ 2 χ 2 (n1 +n2 − 2). Since σ cancels out, we deduce the Student’s t distribution for the ratio σ
@
@(dn −γ ) 1 1 n1 + n2 (n1 +n2 −2)s2 (n1 +n2 −2)σ 2
=
n −γ ) @(d 1 1 s n +n 1
St(n1 +n2 − 2).
2
Test 1. An optimal (UMPU) α significance level test in this case is τ (Z) =
(X n −Y n )−γ 0 γ =γ 0 3 s n1 + n1 1
St(n1 +n2 − 2), C1 = {z: |τ (z)| > c α2 },
2
γ =γ γ −γ τ (Z) 1 St(δ 1 ; n1 +n2 − 2), δ 1 = (311 0 )1 , for γ 1 =γ 0 . σ
n1 + n2
(13.A.2)
Appendix 13.A: Testing Differences Between Means
621
3 3 Test 2. n1 = n2 = n which implies that n11 + n12 = 2n : √n [(X n −Y n )−γ 0 ] γ =γ 0 τ (Z) = 2 St(2n − 2), s √n γ =γ (γ −γ ) τ (Z) 1 St(δ 1 ; 2n − 2), δ 1 = 2 σ1 0 , for γ 1 =γ 0 , Test 3. σ 2 is known: d(Z) =
(X n −Y n )−γ 0 γ =γ 0 3 σ n1 + n1 1
d(Z) =
N(0, 1), γ = (μ1 − μ2 ), C1 = {z: |d(z)| > c α2 },
2
(X n −Y n )−γ 0 γ =γ 1 3 1 1 σ n +n 1
γ −γ N(δ 1 , 1), δ 1 = (311 0 )1 , for γ 1 =0. σ
2
n1 + n2
For the one-sided hypotheses H0 : γ ≤γ 0 vs. H1 : γ >γ 0 , the above tests define α-level UMP tests with the rejection region C1 = {y: τ (y) > cα }.
What Happens when Var(X1t ) = Var(X2t )?
13.A.2
13.A.2.1 The Behrens-Fisher Problem Consider extending the above bivariate Normal model to the case where 9 88 9 8 99 8 μ1 σ 21 0 Xt NIID , , t = 1, 2, . . . , n, . . . Zt := Yt μ2 0 σ 22 E(Xt ) = μ1 , E(Yt ) = μ2 , Var(Xt )=σ 21 , Var(Yt ) = σ 22 , Cov(Xt , Yt ) = 0. Let the hypotheses of interest be H0 : γ = γ 0 vs. H1 : γ =γ 0 , where γ = (μ1 − μ2 ). Defining the difference dn = (X n − Y n ), it follows that σ2 σ2 dn N γ , n11 + n22 , γ :=(μ1 − μ2 ), X n =
(13.A.3)
1 n1
n 1
i=1 Xi ,
Yn =
1 n2
n 2
i=1 Yi .
3 Standardizing dn yields (dn − γ )/ (σ 21 /n1 )+(σ 22 /n2 ) N (0, 1) . One can then estimate σ 21 and σ 22 using 1 (Xi − X n )2 , s22 = s21 = (n11−1) ni=1
1 (n2 −1)
n 2
i=1 (Yi
− Y n )2 ,
which give rise to unbiased estimators, i.e. E(s21 ) = σ 21 and E(s22 ) = σ 22 with sampling distributions (n1 −1)s21 σ 21
χ 2 (n1 − 1),
(n2 −1)s22 σ 22
χ 2 (n2 − 1).
Hence, one can deduce that ((n1 − 1)s21 /σ 21 +((n2 − 1)s22 /σ 22 ) χ 2 (n1 +n2 − 2).) The following ratio is therefore Student’s t distributed:
D E E F
C (dn −γ ) σ 21 σ 2 2 n1 + n2 2 (n1 −1)s1 (n2 −1)s2 2 + σ 21 σ2 2 (n1 +n2 −2)
St(n1 +n2 − 2).
(13.A.4)
622
Hypothesis Testing
Unfortunately, the unknown variances σ 21 and σ 22 do not cancel out, as in the case of test 1 (see (13.A.2)). Hence, the ratio (13.A.4) is not a test statistic, since it depends on unknown parameters! This is a famous puzzle in statistics known as the Behrens-Fisher problem. To transform it into a statistic one needs to know the ratio v = (σ 21 /σ 22 ) which will then yield C s2
√ (n1 +n2 −2)(dn −γ ) 2 s n n 1+ n2 v 1+ n1 12 1v 1
2
=
√ @ (n1 +n2 −2)(dn −γ ) (n1 +vn2 ) 2 n1 s1 +νn2 s22 vn n
St((n1 +n2 − 2)).
1 2
s2
which is a proper test statistic only when v is known. Several suggestions have been proposed in the statistics literature on how to overcome the above problem; see Welch (1947), Scheffé (1944). One might be tempted to choose the distance function (ratio) 3 (dn −γ ) (s21 /n1 )+(s22 /n2 )
?
D(.)
(13.A.5)
with a view to avoiding the problem, but unfortunately the distribution of (13.A.5) is not Student’s t. Despite that, a second-best solution might be to use its asymptotic distribution 3 (dn −γ ) (s21 /n1 )+(s22 /n2 ) n→∞
N(0, 1)
in cases where n1 + n2 is large enough, but it is not an optimal test!
13.A.3
Bivariate Normal Model: Paired Sample Tests
Consider a statistical model that assumes that data z0 :=(x1 , x2 , . . . , xn ; y1 , y2 , . . . , yn ) constitute a realization of a sample Z:=(Z1 , Z2 , . . . , Zn ), where Zt :=(Xt , Yt ), t = 1, 2, . . . , n, from a bivariate Normal and independent distribution with different means and variances and a non-zero covariance: 9 88 9 8 99 8 ρσ 1 σ 2 μ1 σ 21 X1t NIID , , t = 1, 2, . . . , n, . . . Zt := X2t μ2 ρσ 1 σ 2 σ 22 E(X1t ) = μ1 , E(X2t ) = μ2 , Var(X1t ) = σ 21 , Var(X2t ) = σ 22 , Cov(X1t , X2t ) = ρσ 1 σ 2 . Test 5. Define a new random variable Yt = (X1t − X2t ) NIID((μ1 − μ2 ), σ 2 ), t = 1, 2, . . . , n, . . . , where σ 2 = σ 21 +σ 22 −2ρσ 1 σ 2 . A good estimator of σ 2 is ;n ;n 1 1 (Yt − Y n )2 = (n−1) (X1t − X2t − (X 1n − X 2n ))2 s2 = (n−1) i=1 i=1 ;n 2 1 (X1t − X 1n ) − (X2t − X 2n ) = (n−1) i=1 ;n , 1 (X1t − X 1n )2 + (X2t − X 2n )2 − 2(X1t − X 1n )(X2t − X 2n ) . = (n−1) i=1
The optimal test (UMPA) takes the form: dτ (Y) =
√ n Y n −γ 0 γ =0 s
St(n − 1), γ = (μ1 − μ2 ), C1 = {y: |τ (y)| > c α2 },
Appendix 13.A: Testing Differences Between Means
τ (Y) =
√ n Y n −γ 0 γ =γ 1 s
N(δ 1 , 1), δ 1 =
√ n(γ 1 −γ 0 ) , s
for γ 1 =0.
623
(13.A.6)
Test 6. Consider a special case of the above test where ρ = 0. It is interesting to note that ρ = 0 is a special case of the problematic test 4 (Behrens–Fisher problem) with one additional restriction: n1 = n2 = n. This restriction is needed to be able to pair the two subsamples. It turns out that the problem arising in test 4 is easy to address because test 5 in (13.A.6) n applies equally well to the case ρ = 0. This is because s2 = i=1 (Yi − Y n ) /(n − 1) is a good estimator of σ 2 = σ 21 +σ 22 . Note that for the one-sided hypotheses H0 : γ ≤γ 0 vs. H1 : γ >γ 0 , the above tests define α-level UMP tests with the rejection region C1 = {y: τ (y) > cα }.
13.A.4
Testing the Difference Between Two Proportions
The basic model for testing the difference between two proportions: H0 : p1 − p2 = 0 vs. H1 : p1 − p2 =0, where p1 = E(X), p2 = E(Y), is the bivariate Bernoulli model (Example 12.12) that gives rise to the 2 × 2 Table 13.A.1, with the additional assumption that X and Y are independent: π ij = π i+ ·π +j , i, j = 1, 2. Table 13.A.1
2 × 2 table
xy
0
1
f (x)
0 1 f (y)
π 11 π 21 π +1
π 12 π 22 π +2
π 1+ π 2+ 1
Table 13.A.2
Observed frequencies
xy
0
1
total
0 1 total
n11 n21 n1 = n11 +n21
n12 n22 n2 = n12 +n22
n11 +n12 n21 +n22 n = n1 +n2
Since p1 = π 21 /π +1 , p2 = π 22 /π +2 , one can use the parameterization invariance to derive their ML estimators via p1 = n21 /n1 , p2 = n22 /n2 . π ij = nij /n, i, j = 1, 2 =⇒ The raw data are often not reported, but instead they are summarized in terms of frequencies nij as in Table 13.A.2, ignoring the fact that this implicitly imposes the independence assumption. Given that n2i Bin(ni pi , ni pi (1 − pi )) , i, j = 1, 2, the distance function that suggests itself is p1 − p2 √ Var( p1 − p2 )
=
@
p1 − p2 p1 (1−p1 ) p2 (1−p2 ) + n n1 2
,
624
Hypothesis Testing
which can be transformed into a test statistic by estimating the denominator. Although one can replace p1 and p2 by their ML estimators, under H0 , p1 = p2 = p, and thus the pooled estimator of p, p = (n21 + n22 )/(n1 + n2 ) will be a better choice, yielding the test: d(Z) =
H0 p1 − p2 3 p(1− p)( n1 + n1 ) n→∞ 1
N(0, 1), C1 = {z: |d(z)| > c α2 }
2
where N(0, 1) provides an approximation for a large enough n. Note that when n1 and n2 are small (say, less then 30), one should use Fisher’s exact test; see Lehmann and Romano (2006).
13.A.5
One-Way Analysis of Variance
The one-way ANOVA model in Example 12.13 takes the form
Xij = μ + α i + uij , uij NIID 0, σ 2 , j = 1, 2, . . . , ni , i = 1, 2, . . . , p, where μ and α i , i = 1, 2, . . . , p, are orthogonal. The MLEs take the form p i i ni μ = N1 i=1 nj=1 Xij = nj=1 ( N )Xij , n i μi − μ, μi = n1i j=1 Xij , i = 1, 2, . . . , p. αi = The hypotheses of interest in this case are H0 : α 1 = α 2 = · · · = α p = 0 vs. H1 : α 1 =0, or α 2 =0, · · · or α p =0 and the α-level UMP test is the F test based on Table 12.13: p ni ( μi − μ)2 N−p H0 ni F(X) = p i=1 2 p−1 F(p − 1, N−p), C1 = {x: F(x) > cα }. i=1
μi ) j=1 (xij −
14 Linear Regression and Related Models
14.1 Introduction The linear regression (LR) model is arguably the most widely used statistical model in empirical modeling across many disciplines. It provides the exemplar for all regression models as well as several other statistical models referred to as “regression-like” models, some of which will be discussed briefly in this chapter. The primary objective is to discuss the LR model and its associated statistical inference procedures. Special attention is paid to the model assumptions and how they relate to the sampling distributions of the statistics of interest. The main lesson of this chapter is that when any of the probabilistic assumptions of the LR model are invalid for data z0 :={(xt , yt ) , t = 1, . . . , n} inferences based on it will be unreliable. The unreliability of inference will often stem from inconsistent estimators and sizeable discrepancies between actual and nominal error probabilities induced by statistical misspecification.
14.1.1 What is a Statistical Model? Despite the increasing mathematical formalization of the theory of optimal statistical inference and the persistent accumulation of more sophisticated techniques for data analysis since the 1920s, the conceptual framework in the context of which the statistical modeling and inference takes place is less and less clearly appreciated. The prime example of that is the concept of a statistical model, which is often treated as an afterthought, by being viewed as a trivial issue of attaching error terms to any form of equations (static, dynamic, difference, differential, integro-differential, etc.), usually suggested by some theory. “What is a statistical model?” turns out to be a very difficult and complicated question to answer. It is, however, much easier to assert “what a statistical model is not.” It is not an equation specified in terms of variables and constants with a stochastic term, carrying the relevant probabilistic structure of the relevant statistical premises of inference, affixed. Example 14.1 Consider the case of the stochastic difference equation
Yt = α 0 + α 1 Yt−1 + ut , (ut | σ (Yt−1 ))NIID 0, σ 20 , t = 1, 2, . . . , n, . . .
(14.1) 625
626
Linear Regression and Related Models
Does (14.1) constitute a legitimate statistical model? To answer that, one needs to give “a
well-defined (probabilistic) meaning” to parameterization of θ:= α 0 , α 1 , σ 20 ∈, where is the assumed parameter space, stemming from the underlying probabilistic structure of the stochastic process {Yt , t∈N}. A traditional answer in the current literature is that
θ := α 0 , α 1 , σ 20 ∈:=R2 ×R+ , with |α 1 | = 1 and |α 1 | GC-R(σ 2 ) = (n−2) 2 n n n3 σ 2ML is not a fully efficient estimator of σ 2 , but the same is true for s2 since
4 2σ 4 > C-R(σ 2 ) = 2σn . Var s2 = (n−2) Having said that, the differences are tiny (third decimal) for n > 100, and thus of little importance in practical terms. Hence, in practice s2 is almost universally preferred as an unbiased estimator of σ 2 . Example 14.2 Consider the data in Table 1 in Appendix 14.B (Mendenhall and Sincich, 1996, p. 184), where z1t =the auction final price and z2t =the age of an antique grandfather clock in a sequence of n=32 such transactions. The assumed regression model
14.2 Normal, Linear Regression Model
631
is of the form given in Table 14.1, where Yt =ln(Z1t ) and Xt =ln(Z2t ). The idea is to account for the final auction price using the age of the antique clock as the explanatory variable. The estimated linear regression using the data in Table 1 (Appendix 14.C) yields u, R2=.551 , s = .208, n = 32, Yt = 1.312 + 1.177xt + (.966)
(14.8)
(.195)
where the estimates of the unknown parameters are evaluated via: β 1 x = 7.148 − 1.177(4.958) = 1.312, β0 = Y − β1 =
1 n
σ 2ML = 1n
n t=1 (Yt −Y)(xt −x) 1 n 2 t=1 (xt −x) n
n
u2t t=1
= .0905 − s2
=
=
1 n
(.0423)2 .0359
n σ 2ML (n−2)
=
32 30
=
n
.04225 .0359
t=1 (Yt
= 1.177, n 1
− Y)2
−
n
t=1 (Yt −Y)(xt −x) 1 n 2 t=1 (xt −x) n
= .04065, (.04065) = .04336, s =
2
(14.9)
√ .04336 = .208,
and the numbers in brackets underneath the estimates denote their estimated standard errors stemming from (14.6): @ √ 1 2 = .966, SE( SE(β 0 ) = s β 1 ) = s ϕ x = .195. n +ϕ x x Note that all the above estimates and their standard errors are based on five numbers, the first two sample moments of data z0 :={(xt , yt ), t = 1, 2, . . . , n}: G Y = 7.148, x = 4.958, V ar(Yt ) = 1n nt=1 (Yt − Y)2 = .0905, G (Yt , Xt ) G V ar(Xt ) = 1n nt=1 (xt − x)2 = .0359, Cov = 1n nt=1 (Yt − Y)(xt − x) = .0423.
(14.10)
The estimated coefficients seem reasonable on substantive grounds because an economist expects the value of the antique clock to increase with age. Having said that, a practitioner should exercise caution concerning such evaluations, before the validity of the model assumptions [1]–[5] in Table 14.1 is established. One can use informed conjectures based on the link between reduction and model assumptions, shown in Table 14.3, to provide informed guesses about which model assumptions are likely to be invalid when certain potential departures from NIID are gleaned. A glance at Figures 14.1–14.4 does not indicate any serious departures from the NIID assumptions, suggesting that [1]–[5] are likely to be valid for this data. C A U T I O N: It is important to emphasize that to establish statistical adequacy formally, one is required to apply comprehensive mis-specification (M-S) testing to evaluate the validity of the model assumptions thoroughly; see chapter 15. The sample size in this case is rather small to allow thorough M-S testing.
Linear Regression and Related Models
632 7.75
5.3 5.2
7.50
5.1 7.25 x
Y
5.0 4.9
7.00
4.8 6.75 4.7 6.50 3
6
9
12
15 18 21 ordering
24
27
4.6
30
3
t-plot of yt
Fig. 14.1
6
9
12
15 18 21 ordering
Fig. 14.2
24
27
30
27
30
t-plot of xt
0.4
7.75
0.3 7.50 Residuals
0.2
y
7.25 7.00
0.1 0.0 –0.1 –0.2 –0.3
6.75
–0.4 –0.5
6.50 4.6
4.7
4.8
4.9
5.0
5.1
5.2
3
5.3
6
x
Fig. 14.3
Scatterplot of xt , yt
Fig. 14.4
9
12
15 18 21 ordering
24
t-Plot of the residuals from (14.8)
14.2.2.1 Sample Moments Can Be Highly Misleading: Plot the Data! As argued above, all the inference results associated with the Linear Regression model (Table 14.1) discussed so far depend on the first two sample moments (means, variancecovariance) of the observable stochastic process {Zt :=(Xt , Yt ), t∈N}; see (14.10) for the estimated model in (14.8). Going directly to the sample moments, however, is a very bad strategy because these moments might turn out to be statistically misspecified. To illustrate that, let us consider the following example. Example 14.3 Anscombe’s (1973) data Table 2 in Appendix 14.C reports the data contrived by Anscombe (1973) to emphasize how important it is to look at data plots before any modeling is attempted. The data comprise four pairs of variables (Yit , xit ) , i=1, 2, 3, 4, t=1, 2, . . . , 11. The sample moments of each pair are identical for i=1, . . . , 4: 8 8
Y i = 7.501 xi = 9.00
9 ,
2 Yit − Y i = 4.127 − xi )(Yit − Y i ) = 5.5
1 n t=1 n n 1 t=1 (xit n
1 n
n
t=1 (Yit − Y i )(xit − xi ) = 5.5 1 n 2 t=1 (xit − xi ) = 11.0 n
9
14
14
12
12
10
10 y2
y1
14.2 Normal, Linear Regression Model
8
8 6
6
4
4 5.0
7.5
10.0
12.5
5.0
15.0
7.5
x1
Fig. 14.5
10.0
12.5
15.0
x2
Scatterplot of x1t , y1t
Fig. 14.6
14
14
12
12
10
10 y4
y3
633
8 6
Scatterplot of x2t , y2t
8 6
4
4 5.0
7.5
10.0
12.5
15.0
8
10
x3
Fig. 14.7
Scatterplot of x3t , y3t
Fig. 14.8
12
14 x4
16
18
20
Scatterplot of x4t , y4t
giving rise to numerically identical estimated regression results: Yit = 3.00 +.500 xit + uit , R2i = .667, si = 1.236, n = 11, i = 1, . . . , 4, (1.12)
(.118)
giving rise to identical inferences results pertaining to β 0 , β 1 , σ 2 . The scatterplots of the four estimated linear regressions (Figures 14.5–14.8), however, reveal a very different story about the trustworthiness of these inference results. Only one estimated regression, based on the data in Figure 14.5 is seemingly statistically adequate. The scatterplots of the rest indicate clearly that they are statistically misspecified. In light of the small sample size n = 11, no formal M-S testing is possible, but as argued in Chapter 1, when n is too small for comprehensive M-S testing, it should be considered too small for inference purposes.
14.2.3 Fitted Values and Residuals In Chapter 7 it was argued that the statistical GM is based on an orthogonal decomposition: Yt = E(Yt |Xt = xt ) + ut , t∈N,
634
Linear Regression and Related Models
where m(xt )=E(Yt |Xt =xt ), ut =Yt −E(Yt |Xt =xt ) are the systematic and non-systematic components such that (a) E (ut |Xt = xt ) = 0, (b) E (ut ·m(xt )|Xt =xt ) = 0,
(c) E u2t |Xt = xt = Var(Yt |Xt = xt ) c α2 }),
where C1 (α) denotes the rejection region. Similarly, an α-significance level t-test for H0 : β 0 = 0 vs. H1 : β 0 =0 takes the analogous form: ⎛ T0 := ⎝τ 0 (Y) =
(14.21) ⎞
β −0 @ 0 , 1 2 s n +ϕ x x
C1 (α) = {y: |τ 0 (y)| > c α2 }⎠ .
(14.22)
638
Linear Regression and Related Models
Example 14.2 (continued) Using the estimated regression in (14.8), testing the hypotheses in 14.21 and 14.19 at α=.05 significance level (type I error probability) yields: τ 0 (y0 ) =
β −0 @ 0 1 2 s n +ϕ x x
τ 1 (y0 ) =
β 1 −0 √ s ϕx
=
=
1.177 .195
1.312 .966
= 1.358[.092],
= 6.036[.000].
Given that c α2 =2.042, these results suggest that H0 : β 0 =0 is accepted, but H0 : β 1 =0 rejected. The numbers in square brackets denote the p-values defined by P(τ i (y) > τ i (y0 ); β i = 0) = pi (y0 ),
i = 0, 1,
confirming the results since p0 (y0 )=.092 and p1 (y0 )=.000001. These p-values provide a bit more information by indicating that the smallest significance level H0 would have been rejected. As argued in Chapter 13, there is no such thing as a two-sided p-value, because post-data (after τ i (y0 ), i=0, 2 are revealed), the only relevant direction of departure from the null is indicated by the sign of the test statistic. Finally, it is important to note that in the case of one regressor the t-test T1 above is equivalent to the F-test in (14.14). Hypotheses about the variance: χ 2 -tests. In relation to σ 2 the following hypotheses might be of interest (σ 20 a known constant): (i) H0 : σ 2 = σ 20 , H1 : σ 2 =σ 20 ; (ii) H0 : σ 2 ≤σ 20 , H1 : σ 2 >σ 20 ; (iii) H0 : σ 2 ≥σ 20 , H1 : σ 2 cα } , cα χ(v)dv=α, (14.24) cα (iii) C1(iii) ={y: v(y) < cα } , 0 χ(v)dv=α. The relevant sampling distributions of v(y) are 2 H0 σ2 2 (n − p), since v(y) = (n−2)s χ = 1 under H0 , 2 2 σ0
v(y) =
2 (n−2)s2 H1 (σ 1 ) 2 σ0
σ 21 σ 20
σ0
·χ 2 (n − p), for any σ 21 =σ 20 .
14.2 Normal, Linear Regression Model
639
Example 14.4 Let us consider an α=.05 significance level test for the hypotheses in (ii): Yt = 6.442 + 0.823xt + ut , s2 = (.237)2 = .056, n = 100. (.052)
(14.25)
(.0225)
H0 : σ 2 ≤ .06, H1 : σ 2 > .06. (98)(.0562) .06
For cα =122.1, the estimated regression in (14.25) yields v(y) = implying that H0 is not rejected.
= 91.79[.657]
F-tests. Another test which is particularly useful in the context of linear regression is the joint test of linear restrictions among the coefficients. In order to consider this test we need to go beyond the one regressor case discussed so far. With this in mind let us consider the two-variable regression Yt = β 0 + β 1 x1t + β 2 x2t + ut , t ∈ N.
(14.26)
For the estimation of the parameters β 0 , β 1 and β 2 we follow the same methods and procedures discussed above and we postpone the discussion until Section 14.4, where we consider the general m-regressor case. The estimators for the parameters (β 0 , β 1 , β 2 , σ 2 ) are: (a)
β 0 =Y − β 1 x1 − β 2 x2 ; 8
cα }, F(y) = (RRSS−URSS) URSS m where p=m+1, F(m, n − p) denotes the F-distribution with m and n − p degrees of freedom. Intuitively, large values for this test statistic suggest that H0 is not supported by the data. C1 defines the rejection region where cα is determined from the F tables once the significance level α is chosen. Consider the null joint hypotheses H0 : β 1 = 0 and β 2 =0 vs. H0 : β 1 =0 or β 2 =0. The test for such hypotheses is an F-test of the form n (Y −Y)2 − n u2 R2 n−3 = (1−R , C1 = {y: F(y) > cα }. F(y) = t=1 tn 2 t=1 t n−3 2) 2 2 ut t=1
This shows most clearly that the F-test of the significance of the coefficients (β 1 , β 2 ) is directly related to the R2 . Most computer packages report the above F-statistic and it is
14.2 Normal, Linear Regression Model
641
important to know how to interpret it. Assuming that assumptions underlying the linear regression model are valid, the reported F-statistic refers to the joint significance of all the coefficients in the regression apart from that of the constant term. Example 14.5 Consider the following estimated regression model: ut , R2 = .813, s = 5.274, n = 120. Yt = 48.562+ .671 x1t − .223 x2t + (8.896)
(0.041)
(0.043)
(14.29)
We want to test the joint significance of the coefficients (β 1 , β 2 ), i.e. H0 : β 1 = 0 and β 2 =0 vs. H0 : β 1 = 0 or β 2 = 0. F(y) =
RRSS−URSS URSS
n−p m
=
17388.164 −3253.868 3253.868
117 2
= 254.115,
which in view of the fact that cα =3.08, for α=.05, rejects H0 strongly. Linear coefficient restrictions. The above F-test can easily be extended to test any linear restrictions on the coefficients of a more general linear regression model Yt = β 0 + β 1 x1t + β 2 x2t + β 3 x3t + β 4 x4t + β 5 x5t + ut , t ∈ N, such as (i) β 1 =−β 2 , β 3 =1, (ii) β 1 =.9, β 2 + β 3 =1, β 4 =β 5 . All we need to do is impose the restrictions, derive the form of the statistical GM under the null, and treat its residual sum of squares as the RRSS. For example in case (i) the restricted statistical GM takes the form (Yt − x3t ) = α 0 + α 1 (x1t − x2t ) + α 2 x4t + α 3 x5t + ε t , t ∈ N. In case (ii) the restricted statistical GM takes the form (Yt − .9x1t − x2t ) = α 0 + α 1 (x3t − x2t ) + α 2 (x4t +x5t ) + ε t , t ∈ N. Example 14.6 Consider the estimated regression model (14.29), extended to include two additional regressors ut , Yt = 45.431+0.602x1t − .217 x2t + .141 x3t + (8.564)
(0.044)
(0.041)
(0.041)
(14.30)
R2 = .830, s = 5.048, n=120, RSS = 2955.574. The joint null hypothesis to be tested is H0 : β 1 = .7 and β 2 =0, vs. H0 : β 1 =.7 or β 2 =0. The estimated restricted statistical GM takes the form ut , (Yt − .7x1t ) = 8.418 + 0.151 x3t + (3.372)
(0.039)
= .114, s = 5.564, n=120, RRS = 3653.038, 116 = 13.687, F(y) = 3653.038−2955.574 2955.574 2 R2
which in view of the fact that cα =3.08, for α=.05, rejects H0 most strongly.
642
Linear Regression and Related Models
14.2.6 Normality and the LR Model It is well known that from the statistical misspecification perspective, the least problematic among the model assumptions is [1] Normality, because a number of inferences turn out to be robust to minor departures from this assumption, such as the distribution of (Yt |Xt = xt , ) is symmetric but not bell-shaped enough. In contrast, the presence of any mean heterogeneity will invalidate assumption [5] and will have devastating effects on the reliability of almost all inference procedures associated with the LR model, including inconsistent estimators. Viewing the LR model from a purely probabilistic perspective enabled us to specify it in terms of the statistical GM and the probabilistic assumptions [1]–[5] (Table 14.1). In Section 14.2.1 it was shown that relating the model assumptions pertaining to {(Yt |Xt = xt ), t ∈ N} to the reduction assumptions of NIID for {Zt :=(Xt , Yt ), t ∈ N} enhances our understanding of the LR model and provides a broader framework for the modeling facet that includes the model specification, M-S testing and respecification. Part of this broader framework is the relationship between the reduction and model assumptions in Table 14.3: 2 & 2 & {(Yt |Xt = xt ), t ∈ N} {Zt :=(Xt , Yt ), t ∈ N} =⇒ NIID [1]–[5] which can be used to guide these facets of modeling using a number of data plots. In light of that, it is important to take a closer look at the relationship N ⇒ [1]–[3], with a view to considering the question: “under what additional assumptions does the reverse [1]–[3] and [?] ⇒ N also hold, i.e. N ⇔ [1]–[3] and [?]?” Bhattacharya (1946) proved that f (y, x) is bivariate Normal (N) iff: (i) N ⇔ [1]–[3] & [6] f (x) is Normal. (ii) N ⇔ [1]–[3] & [7] f (x|y) is Normal. (iii) N ⇔ [1]–[3] & [8] f (x,y) is elliptically symmetric. The result in (iii) was modified and extended in Spanos (1995b), showing that (iv) N ⇔ [2] & [3] & [9] E(Xt |Yt = yt ) = γ 0 +γ 1 yt , ∀yt ∈ R. Note that (iv) does not invoke assumption [1]. What is needed, in addition to [2] and [3], is E(Xt |Yt = yt ) being linear in yt . Example 14.7 In the context of the autoregressive model [AR(1)] (Chapter 8), this restriction is satisfied since E(Yt | σ (Yt−1 )) = α 0 +α 1 Yt−1
⇔ E(Yt−1 | σ (Yt )) = γ 0 +γ 1 Yt , t∈N.
Indeed, when {Yt , t∈N} is a stationary process (Example 8.21): α1 =
Cov(Yt ,Yt−1 ) Va(Yt−1 ) =γ 1
=
Cov(Yt ,Yt−1 ) , Va(Yt )
α 0 = E(Yt ) − α 1 E(yt−1 ) = γ 0 ,
since E(Yt ) = E(Yt−1 ) and Var(Yt ) = Var(Yt−1 ).
14.2 Normal, Linear Regression Model
643
The above discussion calls into question the conventional claim that the distribution of Xt is largely irrelevant for inference purposes in the context of the LR model. The question that needs to be answered is whether a non-Normal f (xt ; ϕ 2 ) will affect the reliability of inference based on the estimators ( β, s2 ). In light of the fact that the F-test in (14.28) involves the sampling distributions of both estimators, and includes the t-test as a special case, it provides a good basis to pose the reliability of inference question when f (xt ; ϕ 2 ) is non-Normal. As shown by Ali and Sharma (1996), the power and size of the F-test are negatively influenced, in the sense of inducing sizeable discrepancies between the nominal and actual type I and II error probabilities, by two factors: (i) the non-Normality of D(yt |Xt ; ϕ 1 ), especially any non-symmetry, and (ii) the “non-Normality” of f (xt ; ϕ 2 ). Their main conclusion is: “Besides the sample size and the degrees of freedom of error sum of squares, the major determinant of the sensitivity to non-Normality is the extent of the non-Normality of the regressors”(p. 175). In an earlier paper, Box and Watson (1962) reached the same conclusion: “Our results may be summarized in the simple statement that sensitivity to non-Normality in the yt ’s is determined by the ‘extent of non-Normality’ in the xit ’s” (p. 106). In light of the above discussion, a modeler is advised to evaluate any possible departures of f (xt ; ϕ 2 ) from Normality using, at least, informal graphical techniques for detecting “nontypical” observations; see Section 14.6.3.1.
14.2.7 Testing a Substantive Model Against the Data As argued in previous chapters, behind every structural (substantive) model Mϕ (z) = { f (z; ϕ), ϕ∈}, z∈RnZ , for ϕ∈⊂Rp , p < n
being subjected to statistical analysis, there is always a statistical model (often implicit) Mθ (z) = { f (z; θ ), θ∈}, z∈RnZ , for θ ∈⊂Rm , p < m < n,
and the two are related via certain nesting restrictions G(θ, ϕ)=0, where θ denotes the statistical parameters and ϕ the substantive parameters of interest. 14.2.7.1 The Capital Asset Pricing Model Consider the following example Mϕ (z): (Yt − x2t ) = α 1 (x1t − x2t )+ε t , ε t NIID(0, σ 2ε ), t = 1, . . . , n, . . . , Mθ (z): Yt = β 0 + β 1 x1t + β 2 x2t + ut , ut NIID(0, σ 2u ), t = 1, 2, . . . , n, . . .
(14.31)
where the substantive model represents the CAPM, with yt returns of a particular asset or portfolio of assets, x1t market returns, and x2t returns of a risk-free asset. The structural model has 3 parameters ϕ:=(α 1 , σ 2ε ) and the statistical has 4 parameters θ:=(β 0 , β 1 , β 2 , σ 2u ). The important point is that the substantive model is a special case of the statistical, and in practice the former is part of probing substantive adequacy is to test the restrictions relating the statistical with the substantive parameters. In this case these restrictions are:
644
Linear Regression and Related Models
β 0 = 0, β 1 + β 2 = 1.
(14.32)
Before these restrictions can be tested using reliable testing procedures, it is important to secure the statistical adequacy of Mθ (z). Without it, there is no reason to believe the testing results because they are likely to be unreliable. 14.2.7.2 Statistical Misspecification and Untrustworthy Empirical Evidence What goes wrong when an estimated model is statistically misspecified? For the estimators of the model parameters, certain departures often imply that the estimates are highly unreliable; the estimators are likely to be inconsistent in the presence of mean t-heterogeneity. For hypothesis testing, the nominal (assumed) error probabilities, like the type I and II and the coverage probability, are very different from the actual ones! Applying a .05 significance level test, when the actual type I error is .95, will lead an inference astray! This is because instead of rejecting a true null hypothesis 5% of the time, as assumed, one is actually rejecting it 95% of the time! W A R N I N G: When looking at published empirical results in prestigious journals, keep in mind that a very high proportion, say 99%, are of questionable trustworthiness! Example 14.8 Lai and Xing (2008, pp. 71–81) illustrate the CAPM using monthly data for the period Aug. 2000 to Oct. 2005 (n=64); see Appendix 5.A. For simplicity, let us focus on one of their equations where: yt is excess (log) returns of Intel, xt is the market excess (log) returns based on the SP500 index; the risk free returns is based on the 3-month treasury bill rate. Estimation of the statistical (LR) model that nests the CAPM when the constant is zero yields: ut , R2 = .536, s = .0498, n = 64, Yt = .020 +1.996xt + (.009)
(.237)
(14.33)
where the standard errors are given in parentheses. On the basis of the estimated model in (14.33), the authors proceeded to draw the following inferences providing strong evidence for the CAPM: (a)
the signs and magnitudes of the estimated (β 0 , β 1 ) corroborate the CAPM: (i) the beta coefficient β 1 is statistically significant: τ 1 (y0 )=1.996/.237=8.422[.000].
(b)
(ii) the restriction β 0 =0 is accepted at α=.025 since τ 0 (y0 )=.019/.009=2.111[.039], The goodness-of-fit (R2 =.536) is high enough to provide additional evidence for the CAPM.
The problem with these inferences is that their trustworthiness depends crucially on the estimated linear regression in (14.33) being statistically adequate. But is it? The first hint that some of the model assumptions [1]–[5] (Table 14.4) are most probably invalid for the particular data comes from basic data plots. A glance at the t-plots of the data {(yt , xt ) , t=1, 2, . . . , n} (figures 14.9 and 14.10) suggests that the data exhibit very distinct time cycles and trends in the mean, and a shift in the variance after observation t=30. Using the link between reduction and model assumptions in Table 14.2, it can be conjectured that assumptions [4] and [5] are likely to be invalid. The residuals from the estimated equation in (14.33), shown in Figure 14.11, corroborate this.
14.2 Normal, Linear Regression Model 0.1 Market excess returns
0.02
0.0
Intel excess returns
645
–0.1
–0.2
0.00 –0.02 –0.04 –0.06 –0.08
–0.3
–0.10
Month Aug Feb Aug Feb Aug Feb Aug Feb Aug Feb Aug Year 2000 2001 2002 2003 2004 2005
Fig. 14.9
Month Aug Feb Aug Feb Aug Feb Aug Feb Aug Feb Aug Year 2000 2001 2002 2003 2004 2005
Intel Corp. excess returns
Fig. 14.10
0.10 General Motors excess returns
0.10 0.05 Residuals
Market excess returns
0.00 –0.05 –0.10 –0.15
–0.20 Month Aug Feb Aug Feb Aug Feb Aug Feb Aug Feb Aug Year 20002001 2002 2003 2004 2005
Fig. 14.11
0.05 0.00 –0.05 –0.10 –0.15
Month Aug Feb Aug Feb Aug Feb Aug Feb Aug Feb Aug Year 2000 2001 2002 2003 2004 2005
Fig. 14.12
t-plot of the residuals from (14.33)
GM excess returns
A formal introduction to M-S testing is given in Chapter 15, but as a prelude to that discussion let us consider two auxiliary regressions aiming to bring out certain forms of departure from the model assumptions [2]–[5] as indicated in (14.34) and (14.35). It suffices at this stage to interpret such auxiliary regressions as an attempt to probe for the presence u2t ), where the ut of statistical systematic information in the residuals ( ut ) and their squares ( equation in (14.34) probes for departures pertaining to assumptions about E(Yt |Xt = x) and the u2t equation in (14.35) for departures from Var(yt |Xt = x). The misspecifications conjectured above are confirmed by the auxiliary regressions: [2]
[5]
[5] [4] (14.34) 2 2 3 ut−1 + v1t , ut = .047 +.761 xt + 7.69 xt −.192D2 − 1.11 t + .175 t − .320 (.018)
(.474)
(6.24)
(.338)
(.044)
(.054)
(.109)
where D2 is a dummy variable (takes the value one for t=2 and zero for t=2): [3]
[5] [5] 2 v2t , R2 = .19, n = 64. ut = .004 + .025D2 − .006 t − .053 xt2 + (.0008)
(.0025)
(.002)
Linearity [2] τ (56)= 7.69 6.24 =.123[.223],
(.179)
(14.35)
646
Linear Regression and Related Models
Homoskedasticity [3] τ (56)= .053 .179 =.30[.768], Dependence [4] τ (56)= .320 .109 =2.93[.005] , Mean-heterogeneity [5] τ D (56)= .192 .044 =4.32[.0000] , (.035671)/2 F(2; 57)= (.093467)/56 =10.686[.0002] , .025 Variance-heterogeneity [5] τ D (56)= .0025 =9.75[.0000] , τ (56)= .006 .002 =3.04[.004] .
As argued in Chapter 15, applying a Normality test using the original model residuals is a good strategy when all the other model assumptions [2]–[5] are shown to be valid. This is because the current Normality tests assume that the residuals are IID, i.e. assumptions [2]– [5] are valid, which is not the case above. One can, however, get some idea of the validity of assumption [1] by applying such tests using the post-whitened residuals from the auxiliary equation (14.34). Hence, the above M-S testing results indicate clearly that no reliable inferences can be drawn on the basis of the estimated model in (14.33) since assumptions [4]–[5] are invalid. No respecification of (14.33) with a view to find a statistically adequate model is attempted because it will involve the Student’s dynamic linear regression model; see Spanos (1995a). 14.2.7.3 Probing for Substantive Adequacy? To illustrate the perils of a misspecified model, consider posing the question: is zt−1 last period’s excess returns of General Motors (see Figure 14.12) an omitted variable in (14.33)? Adding zt−1 to the estimated equation (14.33) gives rise to Yt = .013 +2.082xt − .296 zt−1 + t , R2 = .577, s = .0483, n = 63. (.009)
(.232)
(.129)
(14.36)
Taking (14.36) at face value, the t-statistic τ (60) = .296/.129 = 2.29[.026] suggests that zt−1 is a relevant omitted variable, which is a highly misleading inference. The truth is that any variable which picks up the unmodeled trend will misleadingly appear to be statistically significant. Indeed, a simple respecification of the original model, such as adding trends and lags to account for the detected departures based on the above M-S testing: εt , Yt = .049 − .175D2 +2.307xt − .755 t2 +.119 t3 −.205 Yt−1 +.032 zt−1 + (.02)
(.045)
(.272)
(.101)
(.057)
(.090)
(.138)
R2 = .704, s = .0418, n = 63, renders zt−1 insignificant since its t-statistic is: τ (56)=.032/.138=.023[.818]. The lesson from this example is that one should never probe for substantive adequacy when the underlying statistical model is misspecified. In such a case, the inference procedures used to decided whether a new variable is relevant are unreliable! Example 14.9. Let us return to Example 14.2 where the original structural model coincided with an LR model with Yt =ln(Z1t ) and Xt =ln(Z2t ), z1t =the auction final price and z2t =the age of an antique grandfather clock in a sequence of n=32 such transactions. Let us assume that during a presentation of the estimated regression model u, s = .208, n = 32, Yt = 1.312 + 1.177xt + (.966)
(.195)
(14.37)
14.2 Normal, Linear Regression Model
647
0.20 0.15
Residual
0.10 0.05 0.00 –0.05 –0.10
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
Observation Order
Fig. 14.13
t-Plot of the residuals of (14.38)
an economist in the audience raised the possibility that (14.37) is substantively inadequate because a crucial explanatory variable, x2t the number of bidders has been omitted. The modeler decides to evaluate this by re-estimating (14.37) with the additional regressor ut , s = .0718, R2 = .947, n = 32. Yt = −1.316+1.418x1t + .649 x2t + (.382)
(.070)
(.044)
(14.38)
Its residuals are plotted in Figure 14.13. In light of the fact that (14.37) is statistically adequate, one can trust the t-test for the significance of X2t : τ (y) =
.649 .044
= 14.75[.000],
to infer that indeed X2t is a relevant explanatory variable that enhances the substantive adequacy of the original model in (14.37). The t-plots of the residuals from (14.38) confirm that the respecified model has preserved the statistical adequacy. 14.2.7.4 Statistical vs. Substantive Adequacy: The Tale of Two Error Terms To distinguish between statistical and substantive adequacy, it is important to bring out the different interpretations for the two error terms associated with the statistical and substantive models, and how the probing of potential departures from these error terms differs in crucial respects. The statistical error term ut is assumed to represent the non-systematic statistical information in data Z0 :=(y, X), left behind by the systematic component m(t) = E(Yt |Xt = xt ). This is true when assumptions [1]–[5] are valid. Hence, the statistical error term ut is:
648
Linear Regression and Related Models
Derived (ut = Yt − β 0 − β 1 x1t − β 2 x2t ) in the sense that the probabilistic structure of {(ut |Xt = xt ), t∈N} is completely determined by that of the observable process {(Yt |Xt = xt ), t∈N}; assumptions [1]–[5]. This implies that when any of the assumptions [1]–[5] are invalid, ut will include the systematic statistical information in Z0 left unaccounted for by m(t). [ii] Local, in the sense that the validity of its probabilistic structure revolves around how adequately the statistical model accounts for the statistical systematic information in data Z0 . That is, when probing to establish statistical adequacy the potential errors pertain only to statistical systematic information in Z0 that might have been overlooked by the statistical model, e.g. departures from assumptions [1]–[5]. [i]
In contrast, the structural error term ε t is assumed to represent the non-systematic substantive information left behind by the structural (substantive) model. In this sense, ε t is: Autonomous in the sense that, unlike ut , the probabilistic structure of εt is not entirely determined by the substantive information framed by the structural model and its variables. It also depends on other relevant substantive information that might have been overlooked. This includes omitted variables, confounding factors, external shocks, systematic errors of measurement and approximation, etc. This implies that when the structural model does not account adequately (describes, explains, predicts) for the phenomenon of interest, εt will include such neglected substantive information. [ii]* Global, in the sense that the validity of its probabilistic structure revolves around how adequately the structural model accounts for the phenomenon of interest. Hence, when probing to establish substantive adequacy one needs to consider the different ways the structural model might deviate from the actual data generating mechanism that gave rise to the phenomenon of interest; not just the part that generated data Z0 . [i]*
14.3 Linear Regression and Least Squares 14.3.1 Mathematical Approximation and Statistical Curve-Fitting As argued in Chapter 12, the principle of least squares has its origins in mathematical approximation using linear (in parameters) functions, originally proposed by Legendre in 1805. The basic idea involves the approximation of an unknown function
p y = h(x), (x, y)∈ RX ×RY , p
where h(.): RX → RY , by selecting an approximating function, say
p g(x) = α 0 + m i=1 α i xi , (x, y)∈ RX ×RY , and using data z0 := {(xt , yt ), t = 1, 2, . . . , n} to get the curve of best fit: p g(xt )= α 0 + i=1 α i xit . yt = Gauss in 1809 transformed this mathematical approximation problem into a statistical estimation procedure by adding a probabilistic structure via a generic error term:
(14.39) Yt = g(xt ) + t , t NIID 0, σ 2 , t = 1, 2, . . . , n
14.3 Linear Regression and Least Squares
649
that gave rise to the Gauss linear (GL) model in Table 14.5; see Seal (1967) and Plackett (1965) for its historical development. In this book, the GL model is not viewed as a variation on the LR model, because the two are very different as statistical models, even though they share certain apparently common procedures, the matrix notation blurs the differences. As argued in Chapter 12, the GL model is primarily motivated by the curve-fitting perspective in contrast to the LR model which is the quintessential statistical model with clear statistical parameterizations. Let us bring out some of the differences between the LR model specified in Table 14.1 and the GL model in Table 14.5. Apart from the fact that {xt }nt=1 is viewed as a sequence of given numbers, and not the observations associated with a random variable Xt , the linearity assumption in Table 14.5 that matters is in terms of the parameters. That is, the statistical GM of the Gauss linear model can be extended to a polynomial (preferably orthogonal) in xt , say (Seber and Lee, 2003): Yt = α 0 +
p
k k=1 α k xt
+ t , t NIID 0, σ 2 , t ∈ N,
without changing the probabilistic structure of the model. In this sense, the GL model is more appropriately viewed in the context of a curve-fitting framework. Indeed, the Gauss linear model is not usually specified as in Table 14.5, but as in Table 14.6 (Kennedy, 2008), where
the parameters θ:= α, σ 2 , α:=(α 0 , α 1 ) are often assigned a substantive interpretation, ignoring the statistical interpretation and the functional form is selected on goodness-of-fit grounds; see Spanos (2010b). Table 14.5 Gauss linear model
[1]
Yt = α 0 + α 1 xt + t , t∈N:=(1, 2, . . . , n, . . .) Normality Yt N(., .)
[2]
Linearity
E (Yt ) = α 0 + α 1 xt
[3]
Homoskedasticity
Var (Yt ) = σ 2
[4]
Independence
[5]
t-Invariance
{Yt , t ∈ N} is independent α 0 , α 1 , σ 2 are not changing with t
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭
t∈N
Table 14.6 Traditional Gauss linear model
{1} {2}
Yt = α 0 + α 1 xt + t , t∈N:=(1, 2, . . . , n, . . .) Zero mean E ( t ) = 0, ∀t ∈ N Constant variance E 2t = σ 2 > 0, ∀t ∈ N
{3}
Zero covariance
{4}
Fixed {xt }nt=1 X:= (x1 , x2 , . . . , xn )
E ( t s ) = 0, t =s, t, s ∈ N
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬
⎪ ⎪ ⎪ {xt }nt=1 fixed in repeated samples ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ such that rank(X) = p, p < n
t∈N
650
Linear Regression and Related Models
The curve-fitting differs from the statistical modeling perspective on a number of different dimensions the most important of which is their primary objective. In curve-fitting the functional form g(.) is determined on mathematical approximation grounds and the prig(xt ) is the “smallness” of the residuals mary criterion in selecting the fittest curve Yt =
Yt , t = 1, 2, . . . , n} measured in terms of the prespecified objective function. { t = Yt − Least squares amounts to minimizing (a0 , α 1 ) =
n t=1
Y t − α 0 − α 1 xt
2
→ (a0 , α 1 ) = min(α)
n
2t . t=1
(14.40)
It should come as no surprise that this perspective gave rise to several new curve-fitting procedures, widely used in the statistical learning literature (Murphy, 2012), including the following. (a)
Ridge regression, whose objective function is (Tibshirani, 1996) ,
2 p n 2 →> > Y α ridge = (X X+λI)−1 X y. − α − α x + λ α α ridge = min t 0 t t=1 i=1 i 1 α
(b) LASSO (least absolute shrinkage and selection operator) regression, whose objective function is ,
2 p n > α LASSO = min t=1 Yt − α 0 − α 1 xt + λ i=1 |α i | . α
(c)
Elastic net regression, whose objective function is (Zou and Hastie, 2005): ,
2 p p n 2 . |α | > α EN = min Y +λ − α − α x +λ α t 0 t 1 i 2 t=1 1 i=1 i=1 i α
What all these curve-fitting procedures have in common is that their ultimate objective is to minimize the mean square error of the resulting estimators E(> α −α)2 because the constraints give rise to biased estimators; see Chapter 11. The above curve-fitting procedures should be contrasted with the statistical modeling perspective where g(.) stems from the probabilistic structure of the stochastic process {Zt :=(Xt , yt ), t ∈ N}, the statistical parameterization θ is of paramount importance, and the statistical model is selected on statistical adequacy grounds; it is the “non-systematicity” of the residuals that matters and not their “smallness” measured by an arbitrary distance function. Moreover, the statistical parameters need to be related to the substantive parameters stemming from a structural model. Although the probabilistic assumptions {1}–{3} are directly related to assumptions [2]– [4] in Table 14.5, the specification of the GL model in Table 14.6 and that of the LR model in Table 14.1 differ in a number of important respects: {1}–{3} pertain to the probabilistic structure of {Yt , t∈N} as opposed to {(Yt |Xt = xt ) , t∈N}. (ii) {1} pertains to the linearity in parameters (not the linearity in xt ), (iii) [5] is clearly implicit in the specification in Table 14.6, and (iv) {4} pertains to the numbers {xt }nt=1 and the condition rank(X) = p is primarily a numerical issue; see Section 14.5.
(i)
14.3 Linear Regression and Least Squares
651
The traditional econometric literature considers the omission of the Normality assumption [1] as a major advantage of the specification in Table 14.6 over that of Table 14.5. The idea is that the assumptions {1}–{4} are weaker than [1]–[5] and thus the inference results are (i) more general as well as (ii) less vulnerable to statistical misspecification, a highly questionable claim (Spanos, 2018).
14.3.2 Gauss–Markov Theorem Under assumptions {1}–{4} in Table 14.6, the ordinary least squares (OLS) estimators α1 = α 0 = Y − βˆ 1 x¯ ,
n (Y −Y)(xt −¯x) t=1 n t 2 t=1 (xt −¯x)
of (α 0 , α 1 ) are best (minimum variance) among the class of linear and unbiased estimators (BLUE). Table 14.7 (i) (ii) (iii)
Gauss–Markov Theorem Conclusions
Best (relatively efficiency) Var( ai ) ≤ Var(> α i ), i = 0, 1 for any > α i , i=0, 1, that also satisfy (ii)and (iii) below α 1 = nt=1 w1t Yt , where wit , t = 1, . . . , n, i = Linear > α 0 = nt=1 w0t Yt , 0, 1 are constants Unbiased E(> ai )=ai , i=0, 1
That is, the OLS estimators ( α0 , α 1 ) are best (relatively more efficient) (Figure 14.13): Var( α i ) ≤ Var(> α i ), for any other estimators > α i , i=0, 1 within the class of linear and unbiased (LU) estimators of (α 0 , α 1 ); see Table 14.7. Note that the OLS estimators are linear functions of y since
n α 0 =y − x α 1 = nt=1 ξ t (Yt − Y), t=1 ξ t (Yt − Y) , where ξ t = [(xt − x¯ )/ nt=1 (xt − x¯ )2 ], t = 1, 2, .., n. Although this is often celebrated as a major result in econometrics, a closer examination reveals that this is not anything to write home about. First, as argued in Chapter 11, relative Set of all estimators of (α0,α1)
Linear Estimators
Unbiased Estimators
OLS estimators are best
Fig. 14.14
Gauss–Markov theorem
652
Linear Regression and Related Models
efficiency means very little if one restricts unnaturally the class of competing estimators. The class of unbiased estimators, although interesting, is too restrictive because numerous unbiased estimators are often useless unless they satisfy more potent optimal properties, such as full efficiency. Restricting the class of unbiased estimators further to only linear functions of y is entirely artificial because there is no intrinsic value in the linearity of an estimator. Linearity is a beguiling irrelevancy since without it the relative efficiency does not hold. Second, there is a degree of arbitrariness in minimizing (14.41) (α 0 , α 1 ) = nt=1 (Yt − α 0 − α 1 xt )2 with respect to (α 0 , α 1 ) that could be at odds with the probabilistic premises for statistical inference. The equivalence of the least-squares minimization to the maximization of the log-likelihood function: ln L(θ; y) = const − n2 ln σ 2 − 2σ1 2 nt=1 (Yt − α 0 − α 1 xt )2 , (14.42) brings out an intrinsic affinity between (14.41) and the Normality assumption; ignored by the Gauss–Markov theorem. As argued by Pearson (1920), Normality is the only real justification for minimizing the squares of the errors: Theoretically therefore to have justification for using the method of least squares to fit a line or a plane to a swarm of points we must assume the arrays to follow a Normal distribution. . . . Hence, in disregarding Normal distributions and claiming great generality for our correlation by merely using the principle of least squares, we are really depriving that principle of the basis of its theoretical accuracy, and the apparent generalization has been gained merely at the expense of theoretical validity. (p. 45)
In that sense the least-squares method will be very difficult to justify when the underlying distribution is non-Normal, since the Gauss–Markov theorem holds for all the distributions of the error in Table 14.8. Table 14.8 (i)
G-M theorem and admissible error distributions
Normal
f ( t )= √1
σ 2π
2 exp − t 2 , t ∈R, Var( t )=σ 2 2σ
(ii)
Laplace
1 exp − | t | , ∈R, Var( )=2σ 2 f ( t )= 2σ t t σ
(iii)
Uniform
1 , −σ < < σ , Var( )= σ f ( t )= 2σ t t 3
(iv)
Euler
2 f ( t )= 3 3 (σ 2 − 2t ), − t 0
to provide upper (or lower) bounds for the relevant error probabilities. As shown in Chapter 9, however, these inequalities provide very crude approximations to the relevant error probabilities, leading to highly imprecise inferences in both interval estimation and testing. H I S T O R I C A L G O S S I P: Gauss (1821) abandoned the Normality assumption and proved what is called today the Gauss–Markov theorem. It is interesting to note that the credit to Andrei Markov (1856–1922) is misplaced. The paper that initially misnamed the theorem was David and Neyman (1938). As Plackett (1949) argues, however: “Markoff, who refers to Gauss’s work, may perhaps have clarified assumptions implicit there, but proved nothing new ”(p. 460). Neyman (1952) corrected his mistake for the false attribution: “the theorem that I ascribed to Markoff was discovered by Gauss” (p. 228). Unfortunately, this retraction went unnoticed by textbook writers in statistics and econometrics.
14.3.3 Asymptotic Properties of OLS Estimators In light of the fact that the Gauss–Markov theorem provides the variances for the OLS estimators Var( α 0 ) = σ 2 1n +ϕ x x2 , Var( α1 ) = σ 2 ϕx , one can consider the question of consistency and asymptotic Normality for these estimators. Focusing on a1 , its consistency depends on lim Var( α 1 )=0, which holds only if n→∞ −1 n 2 → 0. This is equivalent to ϕ x = t=1 (xt − x) n→∞
(i) if
n
2 t=1 xt
P
→ ∞, as n → ∞, then a 1 → a1 .
654
Linear Regression and Related Models
For the asymptotic Normality of a1 one needs to establish the rate at which nt=1 xt2 → ∞. That requires additional restrictions relating to the behavior of nt=1 xt2 as n → ∞. In particular: √ a1 −a1 ) N(0, σ 2 q−1 (ii) if lim 1n nt=1 (xt − x)2 =qx >0 ⇒ n( x ). n→∞
n→∞
n
Assuming that the OLS estimator of σ 2 is s2 = asymptotic sampling distribution takes the form √ 2 n s − σ 2 N(0, μ4 − σ 4 ),
2t /(n t=1
− 2), t =Yt − α 0 − α 1 xt , its
(14.46)
n→∞
where μ4 =E( 4t ), i.e. the fourth central moment of the unknown distribution f ( t ), assuming it exists. One cannot go any further in discussing asymptotic efficiency because that requires an explicit distribution assumption in addition to assumptions {1}–{4} in Table 14.6. In addition, one should not rely exclusively on the above asymptotic distributions for inference purposes because they are often very crude approximations of the finite sampling distributions; see Chapter 9. It is instructive to compare the above asymptotic properties of the OLS estimators with those of the MLEs. β 1, σ 2ML ). For consistency we require Asymptotic properties of MLEs ( β 0, n 2 → ∞. lim In (β 0 , β 1 , σ 2 )=∞. A sufficient (not necessary) condition for that is t=1 xt n→∞ n→∞
For the asymptotic Normality of the estimator β 1 , however, the rate of convergence of
n 2 needed to define the Normalizing sequence {cn }∞ t=1 xt is also 3 n=1 . An obvious choice is n 2 (ii) above, or cn = t=1 xt since 3 n
2 t=1 xt
β 1 − β 1 N(0, σ 2 ). n→∞
σ 2ML ) the asymptotic sampling distributions are For the MLEs ( β 0, √
√ 2 σ ML − σ 2 N(0, 2σ 4 ). n β 0 − β 0 N(0, σ 2 ), n n→∞
n→∞
The comparison between the above asymptotic results for the OLS and ML estimators σ 2ML . Without reveals that the two differ with respect to the asymptotic distributions of s2 and 2 the Normality assumption, Var(s ) = g(μ2 , μ4 ) will involve the fourth central moment (μ4 ) of the unknown distribution f (yt |xt ); see (14.46). As shown in Chapter 12, for an IID sample (Z1 , Z2 , , . . . , Zn ) with E(Zt ) = 0 and E(Zt2 ) = σ 2 (Table 12.24): Var( 1n
n
2 t=1 Zt )
=
μ4 −μ22 n
−
2(μ4 −2μ22 ) n2
+
μ4 −3μ22 . n3
This brings up a serious problem. Estimating μ4 using, say μ4 = 1n nt=1 u4t offers little comfort because the variance of μ4 involves central moments up to order 8 (Table 12.21) and increases significantly the imprecision of inference for a given n. Note that when f ( t ) is Normal, Var( μ4 ) = (96σ 8 /n) (Table 12.22) and thus none of these difficulties arise.
14.4 Regression-Like Statistical Models
655
14.4 Regression-Like Statistical Models This section discusses briefly several regression-like models that belong to a broad family known as the generalized linear models. What renders these models different from the LR model is the fact that they are viewed primarily from a curve-fitting perspective. As a result, regression-like models differ from well-defined statistical models in several respects: (i) (ii) (iii)
They are specified in ad hoc and idiosyncratic ways that ignore crucial issues such as “statistical meaningfulness of their unknown parameters.” They do not enjoy explicit statistical parameterizations stemming from the observable stochastic process giving rise to the data. Their appropriateness is usually assessed on goodness-of-fit grounds instead of being appraised on statistical adequacy grounds. As argued in Spanos (2007), excellent goodness-of-fit is neither necessary nor sufficient for statistical adequacy. The intuition is that the former relies on “small” residuals and the latter on “non-systematic” residuals; residuals can be small but systematic or large and non-systematic.
14.4.1 Gauss Linear Model An intuitive way to accommodate the Gauss linear model (Table 14.6) within the probabilistic reduction framing is to view it as an extension of the Normal independent model with k-varying means Yk NI(μk , σ 2 ), yk ∈R, μk ∈R, σ 2 > 0, k∈N
(14.47)
whose statistical GM takes the simple form Yk = μk + uk , uk NIID(0, σ 2 ), k∈N.
(14.48)
This is clearly a non-estimable model since it raises the incidental parameter problem. One way to address that problem is to account for the heterogeneity using the variables xt (m×1) via the link function μk = α 0 + α 1 xk , k∈N.
(14.49)
This is clearly not a genuine regression model, but a regression-like specification that can be directly related to the LR model since its implicit parameterization largely coincides with that of the LR model; the ML estimators of the GL model converge in probability to the same parameters. It turns out that several statistical models can be interpreted similarly as regression-like models.
14.4.2 The Logit and Probit Models The logit and probit models can be viewed as extensions/modifications of an independent but heterogeneous Bernoulli model Yk BerI(θ k , θ k (1−θ k )), yk = 0, 1, θ k ∈[0, 1], k∈N:=(1, 2, . . .)
(14.50)
Linear Regression and Related Models
656
F(z)
1.0
N(0,1)
0.8 0.6 0.4 0.2 –5
–4
–3
–2
–1
Fig. 14.15
0
1
2
3
4
5 z
(z) [—–] vs. $(z) [- - -] cdf
where “BerI” denotes Bernoulli independent, whose statistical GM is Yt = θ k + uk , uk BerI(0, θ k (1 − θ k )), k∈N.
(14.51)
Given that θ k = P(Yk = 1) = E(Yk ), the heterogeneity can be accounted for by the variables xt via a link function θ k = F(α 0 + α 1 xk ), k∈N,
(14.52)
where F(.) denotes a continuous cdf such as the logistic and the Normal (Figure 14.15): Logistic
$(zk ) =
ezk (1+ezk ) ,
Normal
(zk ) =
√1 2π
zk = α 0 +α 1 xk , ∀xk ∈Rm z u2k duk , zk ∈ R. −∞ exp − 2
The literature on the logit/probit models favors interpreting (14.50) as conditional: (Yk |Xk = xk ) BerI(θ k , θ k (1 − θ k )), yk = 0, 1, θ k ∈[0, 1], k∈N:=(1, 2, . . .), (14.53) but as argued in Arnold et al. (1999) this is not a trivial step since one needs to justify the choice of the distribution of Xk ; see Bergtold et al. (2010). Sidestepping that issue, the Bernoulli regression-like models based on (14.53) include the probit and logit as special cases when F(.) is replaced with the logistic and Normal cdfs, respectively. Table 14.9
Bernoulli regression-like (generic) model Statistical GM
[1]
Yk = F(α 0 +α 1 xk ) + uk , k∈N
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ E (Yk |Xk = xk ) = F(α 0 +α 1 xk ), 00. Instead, the link function used for the gamma regression-like model is μk = exp(α 0 + α 1 xk ), k∈N. M-S testing. In light of the above discussion, it is clear that the only way to address the issue of arbitrary choices in statistical model specification is to evaluate their appropriateness using thorough M-S testing; see Chapter 15.
14.5 Multiple Linear Regression Model Table 14.10 Normal, linear regression model Statistical GM: Yt =β 0 + β 1 xt + ut , t∈N, [1] [2] [3] [4]
Normality Linearity Homo/sticity: Independence
(Yt |Xt =xt ) N(., .) E (Yt |Xt =xt ) =β 0 + β 1 xt , Var (Yt |Xt =xt ) =σ 2 {(Yt |X independent process t =xt ) , t∈N},
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬
⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ −1 ⎪ β:=(β 0 , β 1 ) , β 0 =E(Yt )−β 1 E(Xt ), β 1 =[Cov(Xt )] Cov(Xt , Yt ), ⎪ ⎪ ⎪ ⎭ σ 2 =Var(Yt )−Cov(Xt , Yt ) [Cov(Xt )]−1 Cov(Xt , Yt )
[5]
t-Invariance
θ:= β 0 , β 1 , σ 2 not changing with t
t∈N
14.5 Multiple Linear Regression Model
659
An obvious extension of the simple (one-regressor Xt ) LR model is the m-regressor LR model, specified in Table 14.10 in terms of the assumptions [1]–[5] that mirror Table 14.1. The statistical GM of the m-regressor LR model Yt = β 0 + m i=1 β i xti +ut , can be written compactly in vector notation as Yt = β 0 + β 1 x1t + ut , t ∈ N,
where β 1 :=(β 1 , . . . , β m ) and x1t :=(xt1 , . . . , xtm ) are (m×1) vectors: β 1 x1t =
m
i=1 β i xti .
Computational specification. For computational purposes the m-regressor LR model is conveniently framed in terms of all n observations as a system of linear equations in matrix notation: y = 1n β 0 + X1 β 1 + u, ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
y1 y2 y3 .. . yn
⎞ ⎛ 1 ⎜ ⎟ ⎜ 1 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎟ = ⎜ 1 ⎟β 0 + ⎜ ⎜ ⎟ ⎜ . ⎟ ⎜ ⎟ ⎜ . ⎟ ⎝ ⎠ ⎝ . ⎠ 1 ⎛
⎞
n×1
y
x11 x21 x31 .. . xn1
1n β 0
x12 x22 x32 .. . xn2
x13 x23 x33 .. . xn3
··· ··· ··· .. . ···
x1m x2m x3m .. . xnm
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
n×m
n×1
=
(14.59)
+
X1
⎛ ⎜ ⎜ ⎜ ⎜ ⎝
β1 β2 .. . βm
⎛
⎞
⎜ ⎟ ⎜ ⎟ ⎜ ⎟+⎜ ⎟ ⎜ ⎠ ⎜ ⎝
m×1
β1
u1 u2 u3 .. . un
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ .
n×1
+
This can be made more compact using the notation β:= β 0 :β 1 , X:= 1n y = Xβ + u.
u X1
: (14.60)
Somewhat misleadingly, the matrix formulation of the LR model in (14.60) is often used by the textbook approach (Greene, 2012) to specify the LR model as in Table 14.11. This specification is questionable on several grounds, including (i) what does (u|X) mean? (iii) it provides an incomplete list of model assumptions specified in terms of the unobservable error term u, and (iii) rank(X)=p pertains to the particular data X and not the probabilistic structure of the underlying process {(Yt |Xt = xt ) , t∈N}. Table 14.11
Textbook linear regression (t=1, 2, . . . , n) y = Xβ + u
{1} Normality {2} Zero mean {3} Homoskedasticity {4} No autocorrelation {5} No collinearity
(u|X) N(., ) E(u|X)=0 E(uu |X)=σ 2 In =diag(σ 2 , . . . , σ 2 ) rank(X)=(m+1)=p
660
Linear Regression and Related Models
14.5.1 Estimation The likelihood function for the LR model in Table 14.10 is
1 L(β 0 , β 1 , σ 2 ; y) ∝ nt=1 √1 exp − 2σ1 2 (Yt − β 0 − β 1 xt )2 σ 2π
− n2
2 exp − 2σ1 2 nt=1 (Yt − β 0 − β 1 xt )2 , = 2πσ giving rise to the log-likelihood ln L(β 0 , β 1 , σ 2 ; y) = const −
n 2
ln σ 2 −
1 (y − Xβ) 2σ 2
(y − Xβ).
(14.61)
The first-order conditions for a maximum are: ∂ ln L(θ ) ∂β ∂ ln L(θ ) ∂σ 2
= =
2 X 2σ 2 − 2σn 2
(y − Xβ) = 0, +
1 2 (y − Xβ) 2(σ 2 )
(y − Xβ) = 0,
whose solution yields the MLEs of (β:=(β 0 , β 1 ) , σ 2 ): σ 2ML = 1n nt=1 u2t = 1n β ML = (X X)−1 X y, u u, where the vector of residuals is defined by u=(y − X β). As in the one-regressor case, β ML is an unbiased, fully efficient, and sufficient estimator of β, but σ 2ML is a biased estimator of σ 2 ; see Seber and Lee (2003). An unbiased estimator of σ 2 is given by: 1 n 1 u2t = n−p u u. s2 = n−p t=1 The least-squares method gives rise to the same estimator of β since maximizing the log-likelihood in (14.61) is equivalent to minimizing the quadratic form (β) = (y − Xβ) (y − Xβ) ⇒
∂(β) ∂β
= (−2)X (y − Xβ) = 0
⇒ β OLS = (X X)−1 X y. β)=(y − X β) (y − X β)= u Substituting β OLS back into (β) yields the RSS: ( u, providing 2 2 the basis for s as the OLS estimator of σ (see Rao, 1973). In direct analogy to the simple LR model, the sampling distributions of these estimators, when assumptions [1]–[5] in Table 14.10 are valid, satisfy the properties
(i) β N β, σ 2 (X X)−1 , (ii) (β−β) (Xσ 2 X)(β−β) χ 2 (p), (14.62) 2 u u n σ2 2 (n − p). (iv) (n−p)s = = χ (iii) β is independent of s2 , 2 2 2 σ σ σ The primary reason for the popularity of the matrix formulation in Table 14.11, is that it can be used as a surrogate computational device in computing the estimates β, u u/(n − p) for several different statistical models, especially when s2 (X X)−1 , s2 = the assumptions {3} and {4} are extended to include E(uu ) = Vn >0. The statistical models that can be accommodated with the formulation in Table 14.11 are numerous; see Seber and Lee (2003). Although convenient from a computational perspective, the specification in Table 14.11 can be highly misleading for inference purposes.
14.5 Multiple Linear Regression Model
661
When viewed as a parameterization of the observable process {(Yt |Xt = xt ) , t∈N}, the probabilistic assumptions of the various linear models that can be accommodated within the formulation in table 14.11 are usually different for each model, and thus the u u/(n − p) often have very different sampling distributions, estimators β and s2 = despite the superficial similarities. (b) The system of n-linear equations in y = Xβ + u only pertains to the observation period t=1, 2., . . . , n, but the model itself should represent a generating mechanism relating to what is assumed to persist beyond n in order to be used for explanatory and prediction purposes. Hence, the proper specification is given in Table 14.10. (c) Specifying model assumptions via the error term process {(ut |Xt = xt ) , t∈N} offers a distorted picture of the probabilistic assumptions imposed on the data, and as a result, when any of the assumptions {1}–{4} are found wanting, respecification amounts to remodeling (fixing) the error terms; see Spanos and McGuirk (2001), McGuirk and Spanos (2009).
(a)
14.5.2 Linear Regression: Matrix Formulation The form of the estimators of σ 2 appears to represent nothing new in the sense that the formulae coincide with those of the simple LR with p=2. The estimator of β, however, is more involved since n n n t=1 xt1 t=1 xt2 n ⎜ n x n x 2 ⎜ t=1 t1 t=1 t=1 xt1 xt2 ⎜ n n t1 n 2 ⎜ x x x t=1 t2 t=1 t1 t2 t=1 xt2 (X X) = ⎜ ⎜ .. .. .. ⎜ ⎜ . . . ⎝ .. n n . t=1 xtm t=1 xt1 xtm ⎛ n ⎞ t=1 yt ⎜ n ⎟ ⎜ t=1 xt1 yt ⎟ ⎜ ⎟ n ⎟. X y = ⎜ ⎜ t=1 xt2 yt ⎟ ⎜ ⎟ ⎝ ··· ⎠ n t=1 xtm yt ⎛
··· ··· ··· .. .
n
t=1 xtm n t=1 xt1 xtm n t=1 xt2 xtm
···
.. . n
⎞ ⎟ ⎟ ⎟ ⎟ ⎟, ⎟ ⎟ ⎟ ⎠
2 t=1 xtm
In the case of one regressor (p=2):
⎛ ⎜ 1 (X X) = ⎝ x1
1 x2
.. . .. .
⎛ 1 ⎞ ⎜ ⎜ 1 1 ⎟ ⎜ 1 ⎠ ⎜ ⎜ . ⎜ . xn ⎝ . 1
x1 x2 x3 .. . xn
⎞
⎛ ⎟ ⎟ ⎟ ⎜ n ⎟=⎜ n ⎟ ⎝ ⎟ xt ⎠ t=1
⎞ xt ⎟ t=1 ⎟ n ⎠ xt2 n
t=1
662
Linear Regression and Related Models
⎛ (X X)−1 = ⎝ ⎛ =⎝
⎛ (X X)−1 X y = ⎝ 8 =
n x2 n t=1 t 2 n t=1 n(xt −x) xt − n n t=1 2 t=1 (xt −x) n 2 x t=1 t n nt=1 (xt −x)2 − n (xx −x)2 t=1 t n x2 n t=1 t 2 n t=1 (xt −x) n −x 2 t=1 (xt −x)
n
n
t=1
− n
x 2 t=1 (xt −x) n 1 2 t=1 (xt −x)
(x −x)(yt −y) t=1 n t 2 t=1 (xt −x)
⎠
t
n n nt=1 (xt −x)2
n −x 2 t=1 (xt −x) n 1 2 t=1 (xt −x)
y − β 1x
⎞
x
t − n n t=1 (x −x)2
⎞ ⎠,
⎞ y t ⎟ ⎜ ⎟ ⎠ ⎜ t=1 n ⎠ ⎝ xt yt ⎞
⎛
n
t=1
9 .
The estimator of β reminds one of the estimators of β 1 in the context of the simple LR model in (14.4) in the sense that 1n (X X) and 1n (X y), broadly speaking, can be thought of as the sample equivalents of the moments: Cov(Xt ) and Cov(Xt , yt ), respectively. However, the structure of β also involves an estimator of β 0 which is not as apparent. To shed some light on β 0 we need to understand the role of the first column of ones in the matrix X. Returning to the matrix formulation of the statistical GM: y = 1n β 0 + X1 β 1 + u, β gives rise to the estimators: a substitution of X:=(1n :X1 ) into the estimator
−1 X1 y−nyx1 , β 1 x1 , β 1 = X1 X1 − nx1 x1 β0 = y −
(14.63)
stemming from the partitioned inversion (Rao et al., 2008): 8
β0 β1
9
8 =
n nx1
nx1 X1 X1
9−1 8
9 ny X1 y
8 =
> −1 > y − x1 (> X1 X1 ) X1 y X1 )−1 > X1 y (> X1 >
9 ,
where > X1 := X1 − 1n x1 , and > X y:= X1 y − nYx1 . Given that the one-regressor estima
n 1 −1 n tors β 0 =Y − β 1 x1 and β 1 = t=1 (x1t − x1 )2 t=1 (Yt − Y)(x1t − x1 ), the matrix notation preserves, broadly speaking, the simple formulae.
14.5.3 Fitted Values and Residuals The original orthogonal decomposition of the statistical GM Yt = E(yt |Xt = xt ) + ut = μ(xt ) + ut , t∈N, where E(ut |Xt = xt ) = 0 ⇒ E (μ(xt )·ut |Xt = xt ) = 0 ⇒ E (μ(xt )·ut ) = 0, is reflected in its sample analogue y= μ(X) + u, denoted by μ(X) ⊥ u i.e. μ(x) u = 0,
14.5 Multiple Linear Regression Model
663
where “⊥” denotes “orthogonal to”, the fitted values y= μ(X), and the residuals u constitute the estimated systematic and non-systematic components
[1]–[5] y N Xβ, σ 2 PX , y = X β = X (X X)−1 X y = PX y ⇒
[1]–[5] u N 0, σ 2 MX . u = y − X β = y − X (X X)−1 X y = MX y ⇒ The (n × n) matrices PX = X (X X)−1 X and MX = I − X (X X)−1 X = I − PX are particularly interesting because they represent operators known in linear algebra as orthogonal projectors, that satisfy the following properties:
[i] PX
idempotent PX PX = PX ,
symmetric: PX = PX ,
[ii] MX
idempotent MX MX = MX , symmetric: MX = MX , MX X=0.
PX X=X,
In particular, PX projects into the space spanned by the columns of X, say R(X), and MX projects into its complement, say N (X) = R(X)⊥ . Naturally, the two projectors are orthogonal: [iii] MX PX = PX MX = 0. These properties suggest that y ⊥ u as well as y y ⊥ u u: y y + u u = y PX y + y MX y. y y =
(14.64)
This shows how the total sum of squares (TSS=y y) can be decomposed into the explained y) – the variation explained by the variables in X – and the Residsum of squares (ESS= y u). Under the model assumptions [1]–[5] (Table 14.10), the ual Sum of Squares (RSS= u sampling distributions of these statistics are y y 2 (δ ; n), δ = β (X In X)β , n = rank (I ) , χ 1 1 n σ2 2σ 2 y y = y σP2X y χ 2 (δ 2 ; p), δ 2 = β (X2σP2X X)β , p = rank (PX ) , σ2 y MX y u u 2 (δ ; n − p), δ = β (X MX X)β , n − p = rank (M ) . = χ 3 3 X 2 2 2 σ σ 2σ Let us use the above results to consider testing linear restrictions on β, say Rβ = c, where R (r × p) , r ≤ p, and c (r × 1) are known constants, in the form of the hypotheses H0 : (Rβ − c) = 0 vs. H1 : (Rβ − c) =0. An obvious distance function should be based on (R β−c) with a sampling distribution
−1 (R β−c) N (Rβ − c) , σ 2 R(X X) R , which suggests that a natural distance function is (R β−c)
2
H0 R(X X)R (R β−c) 2 rσ
χ 2 (r).
Given that (n − p) s2 /σ 2 χ 2 (n − p) and s2 is independent of β, one can use Lemma 11.3 to construct the test statistic
(R(X X)R ) (R (R β−c) β−c) u− β−c) (R β−c) R(X X)R (R u u n−p rσ 2 = = >u > F(y) = 2 2 u u ( r ), s rs (n−p) σ 2
664
Linear Regression and Related Models
where > u> u and u u denote the restricted (RRSS) and unrestricted (URSS) residual Sums of Squares. The relevant sampling distributions of F(y) are: H0
H
F(y) F(r, (n − p)), F(y) 1 F(δ; r, (n − p)),
R(X X)R (Rβ − c) a non-centrality parameter. δ = 2σ1 2 (Rβ − c)
(14.65)
A UMP F-test is specified using the rejection region ∞ C1 (α) = {y: F(y) > cα }, α = cα f (v)dv, where f (v) is the density function of F(r, (n − p)). Case 1. Let us return to the specification of the LR model: y = 1n β 0 + X1 β 1 + u. β= c formulation using the The restrictions β 1 = 0 can easily be accommodated in the R (m × p) matrix R = (0:Im ), with 0 an (m × 1) column of zeros, Im an (m × m) identity matrix, and c = 0. In this case: > X1 ) u β 1 (> X1 > β 1, u = (y − 1y) (y − 1y), u = (y − 1y) (y − 1y)− u>
u> X1 ) X1 := X1 − 1n x1 , and > X1 > β1 where β 1 is given in (14.63) above, > u − u u = β 1 (> denotes the ESS. This gives rise to the F-test test statistic
F(y) =
( β 1 (> X1 > β 1 ) H0 X1 ) ms2
F(m, (n − m − 1)),
for testing the hypotheses H0 : β 1 = 0 vs. H1 : β 1 =0.
(14.66)
This F-test can be used to generalize the variance decomposition (Table 14.4) as in Table 14.12. Table 14.12
Variance decomposition
Source
Sum of Squares
d.f.
ESS
β 1 (> X1 > β1 X1 )
m
RSS
u u
(n − m − 1)
TSS
> u> u
(n − 1)
Mean Square , β 1 (> X1 > β 1 /m X1 ) u u / (n − m − 1)
As in the one-regressor LR model, the above F-statistic can be written as F(y) =
R2 ( n−m−1 m ), (1−R2 )
where R2 is the goodness-of-fit measure R2 = 1 −
n u2 n t=1 t 2 t=1 (yt −y)
=1−
u u . (y−1y) (y−1y)
The role of the constant term. The above testing of (14.66) ignores the coefficient β 0 = E(yt ) − β 1 E(xt ) of the constant term for good reason. The first column of X is 1, denoting
14.5 Multiple Linear Regression Model
665
the constant, and should be automatically included when the data for any of the variables involved in (y, X1 ) have non-zero mean, which is almost always the case in practice since zero for data in most computer packages is evaluated using more than 15 decimal places. If
y:= (y − 1n y), it a modeler chooses to use the mean-deviation data > X1 := X1 − 1n x1 , and > is important to ensure that the approximations used are accurate up to more than 10 decimal places. In such cases, one can recover the constant term using the formulation > y=> X1 β 1 + u
to estimate β 0 via X1 )−1 > X1 y. β0 = y − β 1 x1 after estimating β 1 using β 1 =(> X1 > Case 2. To simplify the algebra let us assume that yt and Xt have mean zero, i.e. E(Zt ) = 0, and consider X:=(X1 :X2 ), (n × m), m1 +m2 = m, in the context of the LR model y = X1 β 1 + X2 β 2 + u.
(14.67)
The hypotheses of interest can be accommodated in the Rβ = c formulation and take the form H0 : β 2 = 0 vs. H1 : β 2 =0. The optimal test under assumptions [1]–[5] (Table 14.10) is the F-test F(y) =
> u > u− u u n−m−1 ( m2 ) u u
=
y (M1 −M)y n−m−1 H0 ( m2 ) y My
F(m2 , n − m − 1),
(14.68)
rejection region: C1 (α) = {y: F(y)>cα }, > u = y − X1 > β 1 = M1 y, M1 = I − X1 (X1 X1 )−1 X1 , u = y − X β = My, M = I − X(X X)−1 X . This F-test is particularly useful in practice when probing whether x2t is a potential confounder (relevant omitted variable) for the effects of x1t on yt . It should be emphasized that such probing is testing within the boundaries of the prespecified LR model in (14.67) and amounts to evaluating the substantive adequacy of the estimated model, not its statistical adequacy; see Spanos (2006c). Indeed, for the probing to be reliable one should secure the statistical adequacy of the LR model in (14.67); see Sections 2.6.2 and 2.6.3.
14.5.4 OLS Estimators and their Sampling Distributions Gauss–Markov theorem. The OLS estimator β = (X X)−1 X y of β, under the assumptions {2}–{5} (excluding {1} Normality), can be shown to be BLUE: (i) β has a smaller β = Ly) and unbivariance σ 2 (X X)−1 than any other estimator within the class of linear (> ased (E(> β) = E(Ly) = β) estimators of β but that does not provide a reliable basis for any form of finite sample (n < ∞) inference. This is because their sampling distributions are unknown: ?
? (a) ([ β − β]|X)D(0, σ 2 (X X)−1 ), (b) [s2 − σ 2 ]|X D(0, ?).
666
Linear Regression and Related Models
As shown in Table 12.21, without the Normality assumption, Var(s2 ) = g(μ2 , μ4 ) involves the fourth central moment (μ4 ) of the unknown distribution f (yt |xt ) . That does not go away
asymptotically even if the residuals are IID since Var(s2 ) = μ4 − σ 4 . The problem is n→∞ that the residuals u are not even IID, since ?
uD(0, σ 2 MX ), where MX = [I − X(X X)−1 X ]. In summary, relying on the OLS estimators and invoking asymptotic by assuming that limn→∞ ( X n X ) = QX : √ β − β)|X) N(0, σ 2 Q−1 (a)* ( n( X ), n→∞ (14.69)
√ 2 (b)* ( n s − σ 2 N 0, (μ4 − σ 4 ) , n→∞
brings Normality through the back door at the price of using approximations of questionable validity; see Chapters 10 and 12. The covariance Cov( u) = σ 2 MX converges asymptotically to Cov(u) = σ 2 In when PX = −1 X(X X) X → 0 as n → ∞. Huber condition. [I − PX ] → I iff the largest diagonal element of PX goes to zero as n → ∞. Under this condition, one can deduce that ?
as n → ∞, [I − PX ] → I, u ≈ D(0, σ 2 I), and thus the residuals behave “as if” they were IID random variables for n “large enough.” To investigate the finite sample distributions of ( β, s2 ), we need consider how they are affected by the structure of the vector process {Xt , t∈N}, since it enters both estimators via PX .
14.6 The LR Model: Numerical Issues and Problems Numerical problems in the context of the LR model stem from the particular data matrix Z0 :=(y, X), which might or might not concern the underlying generating mechanism represented by the model itself. In discussing these numerical problems one needs to keep this important distinction in mind to avoid confusion.
14.6.1 The Problem of Near-Collinearity In the context of the n data formulation of the LR model: y = Xβ + u, y: (n×1) , X: (n×p) , n>p,
(14.70)
perfect collinearity (multicollinearity) is defined by rank(X) 30, 0< det(X X).9, VIF>10, or VIF>20, which are often disputed by different practitioners because there is no principled way to select among these ad hoc thresholds. A typical textbook account of near-collinearity and its consequences is: “when regressors are highly, although not perfectly, correlated, include the following symptoms: Small changes in the data produce wide swings in the parameter estimates. Coefficients may have very high standard errors and low significance levels even though they are jointly highly significant and the R2 in the regression is quite high. Coefficients will have the wrong sign or implausible magnitudes.” (Greene, 2012, p. 89)
The “near-singularity” of (X X) has a numerical and a statistical side that are invariably conflated. In the first sentence Greene uses the statistical perspective, but in the second sentence he pivots to the numerical. The third sentence returns to the statistical perspective asserting that, for the two-regressor case: + + s2 G √ ar( β i) = i = 1, 2 and R2 will increase (i) as + ρ 23 + → 1, V n 2 (1− ρ 23 )
(ii)
t=1 (xit −xi )
monotonically and as a result (see also Wooldridge, 2013) 3 3 + + G G as + ρ 23 + → 1, the t-ratios τ 1 (y) = β 1/ V β 2/ V ar( β 1 ) , τ 2 (y) = ar( β 2) will decrease monotonically.
As argued in Spanos and McGuirk (2002), none of the assertions in (i)–(ii) hold because they ignore the parameterizations of θ:=(β, σ 2 ) (Table 14.10), which indicates that (X X) G it , Xjt ) → 1 G is not just in Cov( β) but also in β and s2 , and thus arguing in terms of Corr(X affects all three estimates at the same time. 14.6.1.1 Numerical Analysis Perspective From a numerical analysis vantage point the near-singularity of (X X) is best understood in terms of the data matrix X being ill-conditioned, and measured by the matrix Norm (' . ') condition number defined by κ(X X)='(X X)'·'(X X)−1 ' > 0.
(14.72)
The matrix X is considered perfectly conditioned (orthogonal x1 , x2 , . . . , xp ) when κ(X X) = 1, well-conditioned when κ(X X) is small, and ill-conditioned when κ(X X) is large; on how large is “large enough” see Golub and Van Loan (2013). The most widely
668
Linear Regression and Related Models
used matrix Norm for a matrix X is the Euclidean defined by: . 'X'2 = λmax (X X), and it is directly related to the eigenvalues of the matrix where λmax is the largest eigenvalue of (X X); see Golub and Van Loan (2013, pp. 71–73.) Given that (X X) is a symmetric and positive-definite matrix, the Euclidean condition number κ 2 (X X) takes the familiar form κ 2 (X X)='(X X)'2 ·'(X X)−1 '2 =
λmax λmin
= κ 22 (X).
The numerical analysis perspective traces the implications of ill-conditioning to the sensitivity of the numerical solution associated with (X X)β = X y, with ( β p ) and without ( β) small perturbations (E, e) in (X, y) using upper bounds (Stewart, 1973): (i) Xp = X+E
' β− βp' ' β' ' β− βp'
(ii) yp = y+e
' β'
X E' X )E' ≤ 2κ(X) 'P'X' +4κ 2 (X) '(I−PX )E'' u' +8κ 3 (X) '(I−P , 'X'2 'X'' y' (14.73) ' y− yp ' ≤ κ(X) , 2
' y'
y p = P X yp , u = y − y. The primarly purpose where y = PX y, PX = X(X X)−1 X , of these upper bounds is to provide a measure of the “potential” volatility of β to small changes in data Z0 :=(y, X). The idea is for the practitioner to evaluate these upper bounds for β different (E, e) to quantify how “potential” changes in the particular data Z0 might affect and decide whether the size of the potential volatility is serious enough to call for additional data information or model respecification; see Spanos and McGuirk (2002), Spanos (2019). Determinant of (X X). The widely used numerical measures det(RX )0 or det(X X)0 are inappropriate for evaluating the near singularity of (X X) because they are at odds with κ(X X). The simple reason is that since det(X X) = λ1 ·λ2 · · · λp there are many different ways this product can be close to zero without κ(X X) being very large and vice versa. Example 14.10 For (X X) =diag(10−4 , . . . , 10−4 ) → det(X X) = 10−4p , but κ 2 (X X) = 1. Example 14.11 To illustrate how the ill-conditioning of (X X) will induced serious instability in the estimated coefficients and their standard errors (SEs), consider the following highly artificial example with a tiny sample size (Spanos, 2019): 8 9 ⎛ ⎞ ⎛ ⎞ 57.6821 229.964 1.00 4.0 4.1 , (X X) = ⎜ 2.01 8.0 ⎟ ⎜ 2.3 ⎟ 229.964 916.81 ⎜ ⎟ ⎜ ⎟ (14.74) X:= ⎜ ⎟ , y:= ⎜ ⎟, ⎝ 4.02 16.0 ⎠ ⎝ 4.2 ⎠ κ 2 (X X) 11197862, 6.04 24.1 6.2 det(X X).085. κ 2 (X X) indicates that (X X) is ill-conditioned, but det(X X) doesn’t. 8
β1 β2
9
8 =
57.6821 229.964 229.964 916.81
9−1 8
9 63.055 251.42
8 =
−95.446 24.215
9 .
14.6 The LR Model: Numerical Issues and Problems
669
Change 1 Changing x11 = 1 to 1.01, induces a change of signs in β: 9 8 9−1 8 9 8 9 8 57.7022 230.004 63.096 170.560 β1 = = . β2 230.004 916.81 251.42 −42.515 Change 2 Changing x41 = 6.04 to 6.05, induces sizeable changes in magnitude in β: 9 8 9−1 8 9 8 9 8 57.803 230.205 63.117 −448.564 β1 = = . β2 230.205 916.81 251.42 112.906 It also affects the significance of (β 1 , β 2 ): Original SEs 3 Var( β 1 ) = 103.975σ 3 Var( β 2 ) = 26.08σ
Change 2 SEs 3 Var( β 1 ) = 186.3σ 3 Var( β 2 ) = 46.8σ
Change 3 changing y11 = 4.1 to 4.2, β changes magnitude significantly: 9 8 9−1 8 9 8 9 8 57.682 1 229.964 63.197 355.015 β1 = = . β2 229.964 916.81 251.82 −88.774 Despite the artificiality of the above example, it is worth noting that the values for both X1t and X2t exhibit a distinct form of t-heterogeneity in tandem. If one were to change the order of the observations in X to render the values less trending in tandem and more typical of a random (IID) sample, say ⎛
1.00
⎜ ⎜ 6.04 X∗ := ⎜ ⎜ 4.02 ⎝ 2.01
16.0
⎞
⎟ 8.0 ⎟ ⎟, 4.0 ⎟ ⎠ 24.1
⎛
⎞
reordered ⎜ ⎟ ⎜ 2.3 ⎟ change 1 ⎟ y:= ⎜ ⎜ 4.2 ⎟ ⎝ ⎠ change 2 6.2 change 3 4.1
β :=( β1, β 2 ) = (.323, .169) , β :=( β1, β 2 ) = (.324, .169) , (14.75) β :=(β 1 , β 2 ) = (.322, .167) , β :=( β1, β 2 ) = (.316, .172) ,
the instability and near-singularity vanish since κ 2 (X X) = 37.06. Hence, as shown above, the previous data changes 1–3 induce minor changes in β. This brings out the fact that the ordering of the data is of paramount importance in this context. One might question the above line of reasoning by arguing that the changes can also be G 2t , X3t ) = .9999 and for the data in explained by the fact that for the data in (14.74) Corr(X G 2t , X3t ) = −.693. This argument is clearly misleading because the statistical 14.75, Corr(X G 2t , X3t ) depends on probabilistic assumptions, such as constant means, validity of Corr(X which are clearly invalid for the data in (14.74); both data series are trending. In summary, the numerical perspective on the near-collinearity of (X X) has three crucial features: (i) it is data-specific (Z0 ), (ii) its impact is evaluated by the potential perturbations
of ( β, s2 ), and (iii) its appraisal is relative to the particular parameterization θ := β, σ 2 (Table 14.10). These features suggest that a large enough κ(X X) indicates that the par
ticular data Z0 contain insufficient information for learning about the true θ := β 0 , β 1 , σ 2 (Table 14.10) “reliably enough.”
670
Linear Regression and Related Models
14.6.1.2 Statistical Perspective On the other hand, the statistical perspective revolves around the sample correlations among the regressors, and thus it is highly vulnerable to statistical misspecification. For instance, the correlation matrix among regressors 3 3 n n 2 2 , (14.76) RX :=D−1 (XM XM )D−1 , D:=diag t=2 (x1t − x1 ) , . . . , t=1 (xpt − xp )
where XM := X − 1n x , 1n :=(1, 1, . . . , 1) , x :=(x1 , x2 , . . . , xp ) denote the sample means, is likely to be misspecified when the data are mean-trending. Variance inflation factors. Another widely used measure of near-collinearity relates to G Cov( β) = s2 (X X)−1 : ,3 -−1 2 n 2 G (14.77) , i = 1, 2, . . . , m, V ar( β i) = s 2 t=1 (xit − xi ) (1−R[i] )
where
R2[i]
denotes the estimated squared multiple correlation coefficient n 2 n vit / t=1 (xit − xi )2 R2[i] = 1 − t=1
(14.78)
associated with the auxiliary regression of xit on all the other regressors x(i)t :
xit = α 0 + α (i) x(i)t + vit , vit NIID(0, σ 2i ), i = 1, 2, 3, . . . , p − 1.
(14.79)
Marquardt (1970) proposed the VIFs VIFi = [1/(1 − R2[i] )], i = 2, 3, . . . , p, arguing that: “They are the factors by which the respective parameters are increased, due only to the correlation among the x-variables” (p. 606). The VIFs relate directly to attempts to quantify the effects of the ill-conditioning using statistical measures based on the correlation matrix R. Theil (1971, p. 166) showed that 2 −1 ij n ii R−1 X = [r ]i,j=1 , r = (1 − R[i] ) :=VIFi , i = 1, 2, . . . , p.
(14.80)
The intuitive appeal of the VIFs stems from the common-sense view that if R2[i] 1, then xit is close to being collinear with the other regressors x(i)t . Hence, the simple rule of thumb VIFi = [1 − R2[i] ]−1 >10 is often justified on the basis that R2[i] >.9 is close enough to 1. A closer look at this argument, however, reveals that it can be highly misleading. The most questionable aspect of the VIFs is the fact that they treat all the right-hand-side variables, including deterministic variables such as trend polynomials, the same as if they were legitimate regressors. Treating the deterministic polynomials ξ m (t):=(1, t, t2 , . . . , tm ) the same way as the stochastic regressors Xt is not just a convenient “abuse” of language, but a serious empirical modeling error because the concepts of mean deviation and correlation make no sense for such deterministic variables. In particular, the inclusion of trend polynomials ξ m (t) in an LR model calls into question the n 2 n vit / t=1 (xit − xi )2 evaluation of the VIFs for such terms because their R2[i] = 1 − t=1 are rendered meaningless since xi is, by definition, totally inappropriate for trend terms. How should one handle deterministic terms, such as trend polynomials, in the LR model? A more appropriate way to proceed is to separate the deterministic terms from the stochastic explanatory variables X1t :=(Xit , i=2, . . . , p) using the formulation y = ,δ+X1 β 1 +u, (u|X) NIID(0, σ 2 In ),
(14.81)
14.6 The LR Model: Numerical Issues and Problems
671
where X1 denotes the n observations associated with the proper explanatory variables and , the n values of the deterministic terms. For instance, in the case of the trend polynomial terms ξ m (t):=(1, t, t2 , . . . , tm ), ,:=(1, t, t2 , . . . , tm ). Having separated it in two, one can estimate
the unknown parameters δ, β 1 , σ 2 using the Frisch and Waugh (1933 theorem) based on the projection matrix:
−1 M, = I − , , , , = I − P, , M, M, = M, , M, = M, , M, , = 0. Pre-multiplying (14.81) by M, yields the transformed specification: M, y = M, X1 β 1 +M, u ⇒ > y=> X1 β 1 +> u.
(14.82)
the resulting estimation results coincide with those of (14.81): −1
u M, β 1 = X1 M, X1 M, y, s2 = un−p−m , u = (M, y − M, X1 β 1 ). y and M, X1 = > X yield the mean deviations of y and X1 from the trend Note that M, y => polynomial (,), giving rise to the correct goodness-of-fit formula u u > u , R2, = 1 − uy M => , > y > y M, y
and X1 M, X1 provides the proper basis for the sample correlation for X1t .
14.6.1.3 Numerical vs. Statistical Perspective The above discussion of the numerical and statistical perspectives brings out a serious disconnect between the two because (i) high correlations among regressors (statistical) is neither necessary nor sufficient for (X X) to be ill-conditioned (numerically) and (ii) κ(X X) pertains to the particular numbers in X, irrespective of whether the numbers denote observed data, but RX ultimately depends on the probabilistic structure of the process {Xt , t∈N}. Indeed, many confusions in the literature on near-collinearity stem from erroneously attributing symptoms of statistical misspecification to the presence of nearcollinearity when the latter is misdiagnosed using unreliable statistical measures based on correlations among regressors. To illustrate that consider the following example of how changes in sign/magnitude can naturally occur when the estimated regression is statistically adequate. Consider the case where one begins with the one-regressor (x1t ) LR model M1 : Yt = α 0 +α 1 x1t + εt , t∈N
(14.83)
and then proceeds to add a second regressor x2t by estimating M2 : Yt = β 0 + β 1 x1t + β 2 x2t +ut , t∈N.
(14.84)
The “wrong” signs can take the form of sign reversal: α 1 ≷ 0 and β 1 ≶ 0.
(14.85)
It can be shown that such sign reversals can easily arise in this case when the two LR models are (i) statistically adequate (assumptions [1]–[5] are valid for both) and (ii) ρ 23 = Corr(X1t ,X2t ) is not very high. This is explained by pointing out the fact that one is
672
Linear Regression and Related Models
estimating very different statistical parameters in the two LR models. In particular, using the parameterizations in Table 14.1: α 0 = μ1 − α 1 μ2 , 3 α 1 = (σ 12 /σ 22 ), σ 2ε = σ 11 − (σ 212 /σ 22 ); . note that ρ 23 = α 1 σσ 22 11
M1 :
M2 : β 1 =
(σ 12 σ 33 −σ 13 σ 23 ) , (σ 22 σ 33 −σ 223 )
β2 =
(σ 13 σ 22 −σ 12 σ 23 ) , (σ 22 σ 33 −σ 223 )
β 0 = μ1 − β 1 μ2 − β 2 μ3 ,
(14.86)
σ 2u = σ 11 − σ 12 β 1 − σ 13 β 2 . Without any loss of generality, one can express β 1 and β 2 in (14.86) in terms of the
correlation coefficients ρ 12 , ρ 13 , ρ 23 between (yt , X1t , X2t ), respectively: β1 =
(ρ 12 −ρ 13 ρ 23 ) , (1−ρ 223 )
β2 =
(ρ 13 −ρ 12 ρ 23 ) . (1−ρ 223 )
(14.87)
(14.87) indicates that the sign reversals in (14.85) arise naturally when: +
[a] ρ 12 >0, (ρ 12 − ρ 13 ρ 23 ) ρ 12 ] → α 1 >0, β 1 |ρ 12 |] → α 1 0 .
Example 14.12 For values ρ 12 , ρ 13 , ρ 23 = (.5, ±.7, ±.8) that satisfy [a]:α 1 = .5, β 1 =
−.167, β 2 = ±.833, σ 2 = .5. For ρ 12 , ρ 13 , ρ 23 = (−.5, ∓.7, ±.8) that satisfy [b]: α 1 =
−.5, β 1 = .167, β 2 = ±.833, σ 2 = .5. Since ρ 23 = ±.8, the sign reversals have nothing
to do with ρ 23 being very high. In contrast, when ρ 12 , ρ 13 , ρ 23 = (.5, ±.49, ±.99) : α 1 = .5, β 1 = .749, β 2 = ∓.251, σ 2 = .5. σ 2u = .503. Hence, for ρ 23 = ±.99 no sign reversal arises. This example indicates clearly that when the LR models (14.84)–(14.83) are statistically
adequate, the correlation values ρ 12 , ρ 13 , ρ 23 determine the correct signs and magnitudes, statistically speaking, calling into question all the above claims by Greene (2012, p. 89). Indeed, Example 15.9 demonstrates how statistical misspecification, not near-collinearity, can give rise to “coefficients with the wrong sign.” 14.6.1.4 Summary for the Practitioner The Norm condition number κ(X X) pertains exclusively to the ill-conditioning of (X X), irrespective of whether the numbers denote observed data or not. In contrast, the statistical measures view the near-collinearity problem as pertaining to the probabilistic structure of the process {(yt |Xt ), t∈N}, since they are based on sample statistics. Contrasting these two perspectives gives rise to several conclusions that can be summarized as recommendations to practitioners. First, the traditional scenarios pertaining to how increasing the sample correlations among G β) and the R2 but decrease the tregressors R2[i] →1, i = 1, . . . , p will increase Cov( statistics or inducing sign reversals of coefficients (Greene, 2012, p. 89) are erroneous
because they disregard the parameterization of θ:= β 0 , β 1 , σ 2 (Table 14.10); see Spanos and McGuirk (2002).
14.6 The LR Model: Numerical Issues and Problems
673
Second, statistical measures, such as R2[i] and VIF(xit ), i = 1, . . . , p, are highly vulnerable to statistical misspecification. Particularly pernicious is the presence of mean t-heterogeneity (departure from [5]). Third, using trend polynomials ξ m (t):=(t, t2 , . . . , tm ) with a view to accounting for the t-heterogeneity in the particular data Z0 will also undermine these statistical measures, especially when ,:=[(t, t2 , . . . , tm ), t = 1, . . . , n], is treated as part and parcel of the X data. The , terms render R2[i] , and VIF(ξ i (t)), i = 1, . . . , m, statistically meaningless. In practice, one needs to separate ξ m (t) by viewing it as providing generic ways to account for the t-heterogeneity in the original data as in (14.81). The separation yields appropriately corrected statistical measures based on (properly) detrended data (M, X1 ) and (M, y) as in (14.82). Fourth, in cases where referees or/and editors of journals insist that one should report VIFs for all the right-hand-side variables, including ,, one could minimize the values of these VIFs by rescaling t = 1, . . . , n to t0 ∈[−1, 1] using the transformation (Seber and Lee, 2003) t0 =
(2t−n−1) (n−1) ,
t = 1, 2, . . . , n
(14.89)
and replacing the ordinary with orthogonal polynomials, such as the Chebyshev whose first few terms are to = t, to2 = 2t2 −1, to3 = 4t3 −3t, to3 = 4t3 −3t, to4 = 8t4 −8t2 +1, to5 = 16t5 −20t3 +5t, . . .
14.6.2 The Hat Matrix and Influential Observations The above discussion suggests that the numerical structure of X (n × p) influences the inference in the context of the LR model via the projection matrix PX =X(X X)−1 X , referred y = Hy: to as the hat matrix and denoted by H=X(X X)−1 X . Since Yi = nj=1 hij yj = hii yi + nj=i hij yj , i = 1, 2, . . . , n. H being symmetric (H = H ) and idempotent (HH = H) implies (a)–(c).
(a)
hii =
n
2 j=1 hij
= h2ii +
n
2 1 j=i hij , n ≤hii ≤1,
hii = xi (X X)−1 xi , hij = xi (X X)−1 xj ,
for i, j = 1, 2, . . . , n, where xi :=(x1i, x2i, · · · , xpi ), (p × 1) denotes the ith row of X. (b) The “average” size of hii is p/n because ni=1 hii = p, since rank(H) = rank((X X)−1 X X) = trace(Ip ) = p. In view of (b) Huber’s condition does not seem unreasonable, since for Yi = β 0 +β 1 xi +ui : n 2 −1 , i, j = 1, 2, . . . , n, hij = 1n +[(xi − x)(xj − x)] i=1 (xi − x)
and thus, the Huber condition amounts to max [ ni=1 (xi − x)2 ]−1 (xi − x)2 → 0. i=1,...,n
This is also directly related to the condition for asymptotic Normality of β.
n→∞
674
Linear Regression and Related Models
For any (p × 1) vector c, the quadratic form c (X X)−1 c = λ, where λ is a constant, defines p-dimensional elliptical contours centered at x:=(x1 , x2 , . . . , xp ). Using the notation X:=(1n :X(1) ), one can deduce that hii = xi (X X)−1 xi = 1n +(x(1)i −x(1) ) (> X(1) )−1 (x(1)i −x(1) ), i = 1, 2, . . . , n, X(1) > X(1) ) = X(1) X(1) −nx(1) x(1) defines n concentric equal-probability elliptical where (> X(1) > contours. This implies that c (X X)−1 c = λ, where λ = maxi=1,...,n (hii ), which defines the smallest convex set which contains the scatter of n points of X. The implication from this is that a large value of hii would indicate that the observation xi is far from the center and thus possibly “non-typical.” What is “typical,” however, depends crucially on what the underlying distribution D(Xt ; ϕ 2 ) is. A particularly interesting case is when Xt is Normally distributed. (c)
When Xt N(μ2 , 22 ) we can deduce that (n−p)[hii − 1n ] (p−1)[1−hii ]
F((p − 1) , (n − p)).
(14.90)
As argued by Box and Watson (1962) and Ali and Sharma (1996), the reliability of inference in the context of the LR model is influenced by the non-Normality of the f (xt ; ϕ 2 ), which can be assessed using the percentage of observed points outside the (1 − α) “confidence” ellipsoid: ∞ X(1) )−1 (x(1)i −x(1) ) ≤ cα , α = cα dχ 2 (p − 1), i = 1, 2, . . . , n. X(1) > (x(1)i −x(1) ) (>
14.6.3 Individual Observation Influence Measures The hat matrix is also very useful in detecting “outliers”: values of Xt which are considered non-typical. Using the result in (14.90), one can derive a “rule of thumb” that for hii = xi (X X)−1 xi >
2p n ,
i = 1, 2, . . . , n,
where xi denotes the ith row of X, the influence of observation xi is unduly large! For more formal assessment tools for the influence of xi we need to estimate the regression model without the observation xi : β (i) = (X(i) X(i) )−1 X(i) y(i)
s2(i) =
y(i) −X(i) y(i) −X(i) β (i) β (i) n−p
X)−1 xi ui and compare them to ( β, s2 ) ( β − β (i) )= (X (1−h , (n − p)s2 − (n − p − 1)s2(i) = ii )
u2i (1−hii ) .
That is, by leaving out observation xi both estimators change proportionally to the size of the residual ui , but are also influenced greatly when hii is close to one. Another way to interpret β (i) , with hii hii is in terms of the fitted values, using the convex combination of yi and xi providing the weight for yi : yi = hii yi + (1 − hii )xi β (i) . Cook’s Di . A popular measure of the overall influence of xi on β, based on the difference β), is Cook’s Di (y): ( β (i) − ( β − β) (X X)( β (i) − β) v2i ( y(i) − y) ( y(i) − y) hii Di (y) = (i) = = p 1−hii , ps2 ps2
14.7 Conclusions
675
where vi = ui ./s(1 − hii ) represents the studentized ith residual. It is interesting to note that vi is large or/and the point xi is far from the centroid of X space; as Di (y) is large when hii → 1, wi = (hii /1 − hii ) → ∞, rendering xi a high leverage point. Thus, Cook’s Di (y) v2i and (ii) for the regressors is the product of two measures of outlying: (i) for yi based on based on wi . For a rule of thumb for Cook’s Di (y) > 4/n, see Cook and Weisberg (1999). yi : DFFITS. A statistic closely related to Di (y) measures the effect of xi on @ hii ui DFi (y) = v(i) = s(i) (1−h v(i) St(n − p − 1), i = 1, . . . , n, 1−hii , ii ) where v(i) are the externally studentized residuals. Covariance ratio measures the effect of xi on s2 (X X)−1 and is defined by , p det s2(i) (X(i) X(i) )−1 v2i 1 CVRi (y) = det s2 (X X)−1 + = (n−p−1) (n−p) (n−p) 1−hii . [ ] Given that CVRi (y)1 −
p( v2i −1) , n
|CVRi (y) − 1| >
the rule of thumb is that values of xi such that
3p n
are considered high leverage points. The above influence measures are widely used in empirical modeling, but their implementation is dominated by rules of thumb and “informal” testing procedures. This state of affairs has arisen primarily because these measures lack firm probabilistic foundations as well as how they relate to hypotheses about model parameters. 14.6.3.1 Testing for “Non-Typical” Observations An important issue is the presence of non-typical observations (outliers) in the data Z0 :=(y, X). What constitutes an outlier, of course, depends crucially on the assumed distribution, and one has to treat such assessment procedures with caution. Asking whether a particular observation (yt , xt ) is typical or non-typical is a question that concerns the adequacy of the assumed statistical model vis-à-vis the observed data. To bring that out we will consider recasting the various influence measures in search of “non-typical” observations in the context of M-S testing. For that we need to parameterize these departures in terms of departures from the assumed LR model formulation given in (14.67), where X2 is a matrix of dummy variables of the form ith
di :=(0, 0, . . . , 0, 1 , 0, 0, . . . , 0), for each possibly “non-typical” ith observation, and test the significance of the coefficients β 2 using the F-test in (14.68).
14.7 Conclusions This chapter brings out three important distinctions that can help to elucidate the modeling and inference for the linear regression (LR) model: (a) the statistical vs. the substantive information/model, (b) the modeling vs. the inference stages and (c) statistical modeling vs. curve-fitting.
676
Linear Regression and Related Models
A statistical model aims to account for the chance regularities in the data, and a substantive (structural) model aims to explain the phenomenon of interest giving rise to this data. The two models are related via the parameterization chosen for the statistical model. The two models are ontologically distinct and have very different objectives, rendering the criteria for evaluating their adequacy completely different; see Spanos and Mayo (2015). However, probing for substantive adequacy presupposes statistical adequacy to ensure the reliability of the statistical procedures employed. The main reason why the distinction between the two models is often blurred is because a theory-driven empirical modeling considers statistical modeling as curve-fitting: (i) selecting a substantive model (a family of curves) and (ii) choosing the estimated model that best fits the data. As argued above, the curve-fitting mathematical framework provides an inadequate basis for reliable inductive inference. For that one needs to recast the approximation problem into one of modeling the “systematic information” in the data, i.e. embed the substantive model into a statistical model. This enables one to address statistical adequacy first with a view to securing the reliability of inference pertaining to substantive adequacy. The basic conflict between the statistical and curve-fitting perspectives is that the former requires the residuals from an estimated model to be non-systematic (a martingale difference), i.e. the estimated model accounts for all the statistical information in the data (statistical adequacy), but the latter selects a fitted curve based on small residuals: best-fit criterion. Using the above distinctions, the discussion compared and contrasted the linear regression (LR) with the Gauss linear (GL) model, emphasizing the fact that what matters for inductive inference purposes are the inductive premises (the probabilistic assumptions) underlying the two models, and not the algebra and the formulae for estimators and tests. A case is made that the scope of the celebrated Gauss–Markov theorem is much too narrow, and its conclusions are of little value for inference purposes. When the primary objective is learning from data, there is no premium for weaker but non-testable premises. Such a strategy sacrifices the reliability and precision of inference at the altar of wishful thinking that relies on non-validated asymptotics (n→∞). Important Concepts Linear regression (LR) model, statistical meaningfulness, parameterizations that have a welldefined meaning, Gauss–Markov theorem, fitted values, residuals, misspecification testing, goodness-of-fit, prediction intervals, restricted residual sums of squares (RRSS), unrestricted residual sums of squares (URSS), testing substantive adequacy, substantive vs. statistical misspecification, least absolute deviation (LAD) estimator, t-test of significance, F-test of significance, linear parameter restrictions, near-collinearity, matrix condition number, potential volatility bounds for estimators, correlation among regressors, variance inflation factor (VIF), hat matrix, Huber’s condition, Cook’s D, leverage points, covariance ratio, non-typical observation (outlier). Crucial Distinctions Normal, linear regression (LR) vs. the Gauss linear (GL) model, reduction vs. model assumptions, Gauss–Markov theorem, RRSS vs. URSS, asymptotic properties of OLS vs. ML estimators, t-test vs. F-test of significance, large condition number vs. determinate close to zero, modeling vs. inference facet, statistical vs. substantive model, empirical modeling
14.8 Questions and Exercises
677
vs. curve-fitting, goodness-of-fit (small residuals) vs. statistical adequacy (non-systematic residuals). Essential Ideas ●
●
●
●
●
●
A statistical model is not an equation expressed in terms of variables and parameters with a stochastic error term affixed. It requires statistical meaningfulness and a parameterization that has a well-defined meaning. The comparison between the linear regression and the Gauss linear model brings out the crucial differences between curve-fitting and proper statistical modeling. Preliminary data analysis using graphical techniques should be part of any statistical modeling strategy, because it can be used to guide the specification, misspecification testing and respecification facets of modeling. Establishing the statistical adequacy of and LR model secures the reliability of inference and provides the basis for reliable probing for substantive adequacy: whether the substantive model sheds adequate light on (explains, describes, predicts) the phenomenon of interest. Normality is the least problematic of the five [1]–[5] (Table 14.1) LR assumptions, but it plays an important role because of its link with the [2] linearity and [3] homoskedasticity assumptions. This interrelationship among statistical model assumptions render “error fixing” problematic as a respecification strategy. What is often insufficiently appreciated when empirical modeling is viewed as curvefitting is that excellent goodness-of-fit (small residuals) is neither necessary nor sufficient for statistical adequacy (non-systematic residuals). The problem of near-collinearity is primarily a numerical issue that can be detected using the Norm condition number κ 2 (X X). A very large value of κ 2 (X X) indicates insufficient data information for learning from data about the particular LR parameters θ:=(β, σ 2 ) (Table 14.10). When the numbers in the (X X) matrix are transformed into statistical measures of near-collinearity, such as the VIFs, they should be treated with suspicion because they are highly vulnerable to statistical misspecification.
14.8 Questions and Exercises 1. Explain why an equation expressed in terms of variables and parameters with a stochastic error term attached does not necessarily constitute a proper statistical model. 2. Compare and contrast the specification of the LR model in Tables 14.1 and 14.2 from the statistical modeling perspective that includes specification, estimation, misspecification testing and respecification. 3. Explain how the distinction between reduction and model assumptions (Table 14.3) can be useful for statistical modeling purposes. 4. Compare and contrast the sampling distributions of the ML estimators of the LR model and the ML estimators of the simple Normal model: Mθ (x): Xk NIID(μ, σ 2 ), xk ∈R, μ∈R, σ 2 > 0, k∈N.
678
Linear Regression and Related Models
5. (a) Explain why the R2 as a goodness-of-fit measure for the LR model is highly vulnerable to any mean-heterogeneity in data Z0 :=(y, X). (b) Explain the relationship between the R2 and the F-test for the joint significance of the coefficients β 1 in an LR model. 6. Plot the residuals from the four estimated LR models in Example 14.3, and discuss the modeling strategy to avoid such problems. 7. (a) Test the hypotheses H0 : σ 2 = σ 20 , H1 : σ 2 =σ 20 for σ 20 = .2 at α = .05 using the estimated LR model in Example 14.2. (b) Using the estimated LR model in Example 14.2, derive the .95 two-sided CIs for the regression coefficients. (c) Use the estimated regression in Example 14.2 to predict the price for a z2,n+1 200year-old antique clock. 8. (a) Using the data in Table 1, Appendix 5A repeat example 14.8 by replacing the Intel log-returns with the CITI log-returns and compare your results with those in the example. u2t , as the basis (b) Explain intuitively the use of auxiliary regressions, based on ut and of misspecification testing for the LR model. 9. (a) Using the data in 1, Appendix 5.A, estimate the following LR model for the Intel log-returns (yt ): Yt = β 0 + β 1 x1t + β 1 x1t + ut ,
(14.91)
where x1t = log-returns of the market (SP500) and x2t = log-returns of 3-month treasury bills and . (b) Explain the relationship between the statistical model in (14.91) and the CAPM model: (Yt − x2t ) = α 0 + α 1 (x1t − x2t ) + t
(14.92)
and discuss how one can test the validity of the restrictions the substantive model in (14.92) imposes on the statistical model in (14.91). (c) Explain why the traditional test for the validity of the CAPM, α 0 = 0, ignores another crucial restriction. 10. Explain why adding an additional explanatory variable in the auction estimated LR model in Example 14.9 has nothing to do with statistical misspecification; it is a case of substantive misspecification. Discuss the differences between the two types of misspecification. 11. Compare and contrast the LR with the Gauss linear (GL) model in terms of their specifications in Tables 14.1 and 14.5, respectively. 12. Discuss the assumptions and the conclusions of the Gauss–Markov theorem and explain why it provides a very poor basis for statistical inference purposes.
14.8 Questions and Exercises
679
13. Explain why the error term distributions (ii)–(iv) in Table 14.8 raise questions about the appropriateness of least-squares as the relevant estimation method. 14. Explain the connection between the Normality of the joint distribution f (x, y) and the LR assumptions [1]–[5]. 15. Compare and contrast the asymptotic properties of the OLS estimators of the LR parameters with those of the ML estimators under the Normality assumption. 16. Explain why the specification of the LR model in Table 14.10 can be misleading for modeling purposes. 17. Explain the relationship between the statistical GM: Yt = β 0 + β 1 xt + ut , t ∈ N and β0 + β 1 xt + ut . the estimated orthogonal decomposition: Yt = 18. Explain the variance decomposition in Table 14.11 and its importance for testing purposes. 19. Explain why relying exclusively on the asymptotic sampling distributions of the OLS estimators of the LR model parameters can cause problems for the reliability and precision of inference. 20. (a) Explain the problem of near-collinearity of the (X X) matrix and how it might affect the ML estimators of the LR model parameters. (b) Explain why near-collinearity of the (X X) matrix is neither necessary nor sufficient for sign reversals in the estimated regression coefficients. 21. (a) The Norm condition number κ 2 (X X) provides a reliable measure of the illconditioning of the (X X) matrix, but det(X X) 0 does not. (b) Explain why κ 2 (X X) invokes no probabilistic assumptions about the underlying data, but the correlation matrix among regressors does! 22. (a) Explain why the concept of a variance inflation factor makes no sense for trend terms (1, t, t2 , . . . , tm ). (b) How should one handle the terms (1, t, t2 , . . . , tm ) in the context of an LR model? 23. Using the data in Example 14.11, evaluate the volatility upper bounds (i)-(ii) in (14.73), and explain why these bounds indicate that any inference concerning the regression coefficients is likely to be problematic. 24. (a) Explain the notion of a non-typical observation and how one can detect such observations. (b) Explain the notion of a high-leverage observation and discuss how it should be treated if it affects the statistical adequacy of an estimated LR model. 25. (a) Explain the role of the distinction between the modeling and inference facets in empirical modeling. (b) Using the differences between the LR and GL models compare the modeling with the curve-fitting perspectives in empirical modeling.
680
Linear Regression and Related Models
Appendix 14.A: Generalized Linear Models The exponential family of distributions provides a unifying framework for several regression-like models; see McCullagh and Nelder (1989).
14.A.1 Exponential Family of Distributions One-parameter exponential family. In its simplest form, the family of generalized linear models is based on assuming a stochastic process {Yk , k∈N} which is independent but not identically distributed, whose distribution belongs to the one-parameter exponential family: f (yk ; θ k ) = exp{θ k yk − b(θ k ) + c(yk )},
(14.A.1)
where θ k is known as the canonical parameter and b(θ k ) is a Normalizing term. Example 14.A.1 (a)
The binomial distribution:
y
πk f (yk ; π k , n) = nykk π kk (1 − π k )nk −yk = exp yk ln 1−π +nk ln(1 − π k )+ ln nykk , k
πk , b(θ k ) = nk ln(1 − π k ) = nk ln(1+eθ k ) and c(yk ) = ln nykk . where θ k = ln 1−π k (b) The Bernoulli distribution can be viewed as a special case of the binomial with nk = 1:
y πk f (yk ; π k ) = π kk (1 − π k )1−yk = exp yk ln 1−π + ln(1 − π ) , k k πk , b(θ k ) = ln(1+eθ k ) and c(yk ) = 0. where θ k = ln 1−π k e−μk μ
yk
The Poisson f (yk ; μk ) = yk ! k = exp{yk ln μk − μk − ln(yk !)}, μk > 0, yk = 0, 1, 2, . . ., where θ k = ln μk , b(θ k ) = exp(θ k ) = μk and c(yk ) = − ln(yk !). (d) The negative binomial:
+r−1 r
+r−1 π k (1 − π k )yk = exp yk ln (1 − π k ) +r ln(π k )+ ln ykr−1 f (yk ; π k , r) = ykr−1 ,
(c)
+r−1 . Note that where θ k = ln (1 − π k ) , b(θ k ) = r ln(π k ) = r ln(1 − eθ k ) and c(yk ) = ln ykr−1 yk denotes the number of “failures” (X = 0) in a sequence of Bernoulli trials (X1 , X2 , . . .) before the rth success (X = 1). Two-parameter exponential family. The exponential dispersion (two-parameter) family:
−b(θ k ) f (yk ; θ k , φ) = exp θ k yka(φ) + c(yk , φ) , where b(.), c(., .), and a(.) are arbitrary functions, and θ k is the canonical parameter that relates to E(Yk ) = μk . Of special interest is the function b(.) which provides the link function between E(Yk ) = μk and θ k , as well as the link between Var(Yk ) and a(φ). To unveil how the link function is determined, consider the one-observation log-likelihood ln L(yk ; θ k , φ) =
θ k yk −b(θ k ) a(φ)
+ c(yk , φ).
Its first derivative with respect to θ k is ∂ ln L(yk ;θ k ,φ)
k) = yk − b (θ k ) , where b (θ k ):= ∂b(θ ∂θ k ∂θ k ,
Appendix 14.A: Generalized Linear Models
681
which implies that for this to have zero mean: k ;θ k ,φ) = 0 =⇒ E(yk ) = b (θ k ) E ∂ ln L(y ∂θ k which gives rise to the link μk = b (θ k ). The second derivative of ln L(yk ; θ k , φ) is also of interest since , -2 ∂ 2 ln L(yk ;θ k ,φ) ∂ ln L(yk ;θ k ,φ) E +E =0 2 ∂θ k ∂θ k
where E
∂ 2 ln L(yk ;θ k ,φ) ∂θ 2k
b (θ k ) a(φ)
Note that E
=
Var(Yk ) a(φ)2
(θ k ) . This equation yields: = − ba(φ)
→ Var(Yk ) = a(φ)b (θ k ), b (θ k ) =
∂ 2 ln L(yk ;θ k ,φ) ∂θ 2k
Var(Yk ) a(φ) .
relates to Fisher’s information that provides the basis for the
asymptotic covariance of MLE estimators. Example 14.A.2 f (yk ; μk
, σ 2)
(e) Normal distribution: (y −μ )2 − k k = √ 1 2 e 2σ 2 = exp [μk yk − 2πσ
μ2k 1 2 ]σ2
where θ k = μk , b(θ k ) = θ 2k /2, a(φ) = σ 2 and c(yk , φ) = −
− /
ln(2πσ 2 ) , 0 + ln(2πφ) /2.
y2k 2σ 2
y2k φ
−
1 2
14.A.2 Common Features of Generalized Linear Models (1) (2) (3) (4)
A stochastic process {Yk , k∈N} which is assumed to be independent but not identically distributed whose distribution belongs to the exponential family. The mean of {Yk , k∈N} changes with k: E(Yk ) = μk , giving rise to a statistical GM of the simple form Yk = μk + uk , k∈N. A linear predictor in the form: α xk = α 0 +α 1 x1k +α 2 x2k + · · · +α m xmk , assumed to explain the heterogeneity in μk ; hence, generalized linear models. A link function, a monotonic differentiable function g(μk ) = α xk → μk = g−1 (α xk ) = h(α xk ), that describes how μk relates to the linear predictor α xk . Table 14.A.1 shows the mean, variance and the functions for several distributions. Table 14.A.1 Two-parameter exponential family: Y ED(μ, σ 2 V(μ)) Distribution:
N(μ, σ 2 )
Gamma(μ, δ)
Poisson(μ)
Bi(n, μ)/m
Ber(μ)
μ σ2 V(μ)
μ σ2 1
μ 1/δ μ2
μ 1 μ
1/μ
ln μ
μ 1 μ(1 − μ)
g(μ)
μ
μ 1/n μ(1 − μ/m)
μ ln 1−μ
μ ln 1−μ
682
Linear Regression and Related Models
14.A.3 MLE and the Exponential Family The generic loglikelihood for the above exponential family, using the link function θ k = g(μk ) = α xk
−b(θ k ) takes the form ln L(α; y) = nk=1 θ k yka(φ) + c(yk , φ) . Hence, taking first derivatives yields n n
∂ ln L(α;y) db(θ k ) 1 1 ∂ ln L ∂θ k = y xk = = − k ∂α ∂θ k ∂α a(φ) dθ k a(φ) yk − μk xk = 0. k=1
k=1
The first derivative of the log-likelihood function, when viewed as a function of the random variables y, defines the score function n
s(y) =
k=1
1 a(φ)
yk − μk xk =
1 a(φ) X
(y − μ).
Its importance stems from the fact that the MLE estimator of α arises from E(s(y)) = 0 by 1 solving the first order conditions a(φ) X (y − μ) = 0 for αˆ MLE and its asymptotic covariance of αˆ MLE stems from the covariance of s(y) which give rise to the Fisher information matrix 1 X VX, where V = diag (σ 1 2 , σ 2 2 , . . . , σ n 2 ) and σ k 2 = E(yk − μk )2 is often Cov(s(y)) = aφ a function of μk , depending on the particular distribution being chosen. Example 14.A.3 √1 n
In the case of the Poisson regression like model + ∂ ln L(α;y) + N(0, V), V > 0, + ∂α α=α 0 a
where V = − lim n1 E n→∞
+
∂ 2 ln L(α;y) + + ∂α∂α α=α 0
yielding the asymptotic distribution of α:
= lim
1 n→∞ n
n k=1
E( yk )xk xk ,
√ n( α − α) N(0, V−1 ). a
The asymptotic variance–covariance can be approximated by = diag ( X)−1 , yk = exp( Cov( α ) = (X y1 , y2 , . . . , yn ) , α xk ), which has a feasible generalized least-squares (GLS) form. 14.A.3.1 Goodness-of-Fit and Residuals for the GL Family of Models The most widely used goodness-of-fit measure for the regression-like models of the GL family is the deviance (Table 14.15), is defined as an asymptotic likelihood ratio which L(θ ;y) statistic of the form D(y)= − 2 ln L(> μ;y) , where L(θ; y) is the estimated likelihood with α xk ), and L(> μ; y) is the likelihood of the saturated model based on estimating the μk = h(
incidental parameters μ1 , μ2 , . . . , μn with inconsistent estimators μk = yk , k = 1, . . . , n. includes a constant, Note that in the commonly used case where the linear predictor
α xk yk the Poisson and gamma deviance reduces to D(y) = 2 nk=1 yk ln since, as shown, μk the response residuals sum to zero. There is also a variety of residuals being used in conjunction with different GL models, of which the most widely used are the following:
Appendix 14.B: Data
Table 14.A.2 Deviance: D(y) = −2 ln Normal Binomial Poisson Gamma
L( θ ;y) L(> μ;y)
n
μk )2 k=1 (yk −
(m−yk ) + (m − yk ) ln (m− μk )
yk 2 nk=1 yk ln − (y − μ ) k k μk n (yk − μk ) yk 2 k=1 − ln μ + μ
2
n
k=1 yk ln
yk μk
k
k
Response residuals uk = (yk − μk ), k = 1, 2, . . . , n upk = Pearson residuals
(y − μk ) 3k , Var( μk )
k = 1, 2, . . . , n
√ Deviance residuals udk = [sign(yk − μk )]· dk , k = 1, 2, . . . , n, u2dt = D(y). where nk=1 dk = D(y) and nk=1
Appendix 14.B: Data t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
y
x1
x2
1235 1080 845 1522 1047 1979 1822 1253 1297 946 1713 1024 1147 1092 1152 1336 2131 1550 1884 2041 854 1483 1055 1545 729
127 115 127 150 156 182 156 132 137 113 137 117 137 153 117 126 170 182 162 184 143 159 108 175 108
13 12 7 9 6 11 12 10 9 9 15 11 8 6 13 10 14 8 11 10 6 9 14 8 6
683
684
Linear Regression and Related Models (cont.) 26 27 28 29 30 31 32
x1 10 8 13 9 11 14 6 4 12 7 5
1792 1175 1593 785 744 1356 1262
179 111 187 111 115 194 168
9 15 8 7 7 5 7
y1
x2
y2
x3
y3
x4
y4
8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68
10 8 13 9 11 14 6 4 12 7 5
9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 4.74
10 8 13 9 11 14 6 4 12 7 5
7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73
8 8 8 8 8 8 8 19 8 8 8
6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 6.89
15 Misspecification (M-S) Testing
15.1 Introduction The problem of statistical misspecification arises from imposing (directly or indirectly) invalid probabilistic assumptions on data x0 :=(x1 , . . . , xn ) by selecting a statistical model Mθ (x) = {f (x; θ ), θ∈}, x∈RnX , for θ ∈⊂Rm , m < n. Mθ (x) defines the premises for statistical (inductive) inferences to be drawn on the basis of
data x0 , and its probabilistic assumptions are selected with a view to rendering x0 a “typical realization” thereof. This “typicality” is tested using misspecification (M-S) testing to establish the statistical adequacy of Mθ (x), i.e. the validity of its probabilistic assumptions vis-à-vis data x0 . When any of the model assumptions are invalid, both f (x; θ), x∈RnX and the likelihood function L(θ; x0 )∝f (x0 ; θ ), θ∈ are invalidated. This, in turn, invalidates and distorts the sampling distribution f (yn ; θ ) of any statistic Yn = g(X), where X:=(X1 , X2 , . . . , Xn ), used for inference since f (yn ; θ ), is derived via · · · f (x; θ )dx, ∀y∈R. F(Yn ≤ y) = (15.1) {x: g(x)≤y}
This, in turn, will undermine the reliability of any inference procedure based on Yn = g(X) by derailing its optimality, e.g. rendering an estimator inconsistent or/and inducing sizeable discrepancies between the actual error probabilities (type I, II, p-values, coverage) and the nominal (assumed) ones – the ones derived by invoking the model assumptions. Applying a .05 significance level test, when the actual type I error is closer to .9, will lead an inference astray. It is important to point out that statistical misspecification also undermines nonparametric inferences that rely on broader statistical models M (x) that include dependence and heterogeneity assumptions. Due to the reliance on the likelihood function, Bayesian inference is equally vulnerable to statistical misspecification since the posterior distribution is π(θ |x0 ) ∝ π(θ )·L(θ; z0 ), θ ∈. This is also true for Akaike-type model selection procedures relying on L(θ; z0 ); see Spanos (2010b). 685
686
Misspecification (M-S) Testing
Misspecification (M-S) testing plays a crucial role in empirical modeling because it evaluates the validity of the model assumptions: the soundness of the premises of inductive inference. Its usefulness is twofold: (i) it can alert a modeler to potential inference unreliability problems and (ii) it can shed light on the nature of departures from the model assumptions that could help with the model respecification: selecting another statistical model with a view to accounting for the chance regularity patterns exhibited by the data. M-S testing is a crucial facet of modeling because it can be used to secure the reliability and precision of inference, giving rise to trustworthy evidence. Since its introduction by Karl Pearson (1900), M-S testing has been one of the least appreciated facets of statistical modeling. As a result, its role and importance in securing the reliability and precision of inference has been seriously undervalued and has led to numerous published empirical studies based on untrustworthy evidence. The current conventional wisdom relies primarily on (a) weak, but often non-testable, probabilistic assumptions, combined with (b) asymptotic inference results, and (c) vague robustness claims, as a substitute for using comprehensive M-S testing to establish statistical adequacy. The first hint that M-S tests are rather different from significance testing comes from the founder of modern statistics, R. A. Fisher (1924b), who referred to Pearson’s goodness-of-fit test as “a kind of generalized test of significance” (p. 805). In his classic 1922a paper he praised Pearson for his contributions to the problem of specification by introducing the Pearson family of distributions, as well as “the introduction of an objective criterion of goodness of fit. For empirical as the specification of the hypothetical population may be, this empiricism is cleared of its dangers if we can apply a rigorous and objective test of the adequacy with which the proposed population represents the whole of the available facts” (p. 314).
Fisher went on to assert that “The possibility of developing complete and self-contained tests of goodness of fit [M-S tests] deserves very careful consideration, since therein lies our justification for the free use which is made of empirical frequency formulae” (p. 314).
That is, M-S testing provides the key justification for statistical induction, stemming from the objectivity of testing the adequacy of its premises. Fisher’s (1924) initial distinction between tests of significance and “generalized” tests of significance was more clearly drawn by Cramer (1946a) in his classic textbook where he separated these tests into two groups: (i) tests of goodness of fit and allied tests and (ii) tests of signifiance for parameters. Kendall (1946) draws a similar grouping of tests: “those which give a direct test of a given value of a parent parameter and those which do not . . . and some of them take no account of possible alternative hypotheses” (pp. 134–135). He went on to give examples of the latter group that include “tests of Normality” and “tests of randomness.” As shown below, however, goodness-of-fit tests are not always M-S tests, and not all tests framed in terms of the model parameters are proper tests of significance. The distinguishing characteristic between hypothesis testing proper and M-S testing is that the former probes within Mθ (x) and the latter outside its boundary; see Figures 15.1 and 15.2.
15.1 Introduction
Fig. 15.1
Testing within Mθ (x): N-P
Fig. 15.2
687
Testing outside Mθ (x): M-S
M-S testing differs from Neyman–Pearson (N-P) testing in several respects, the most important of which is that the latter is testing within boundaries of the assumed statistical model Mθ (x), but M-S testing probes outside those boundaries. N-P testing partitions the assumed model using the parameters as an index. In contrast, M-S testing partitions the set P (x) of all possible statistical models that could have given rise to data x0 into Mθ (x) and its complement Mθ (x) = [P (x) − Mθ (x)]. However, Mθ (x) cannot be explicitly operationalized, and thus M-S testing is more open-ended than N-P testing, depending on how one renders probing [P (x) − Mθ (x)] operational using parametric and non-parametric tests for detecting possible departures from Mθ (x); see Spanos (2018). The current neglect of statistical adequacy is mainly due to the fact that the current statistical modeling is beclouded by conceptual unclarities stemming from the absence of a coherent empirical modeling framework that delineates the different facets of modeling and inference beyond Fisher’s (1922a) initial categories: specification (the choice of the statistical model), estimation, and distribution (inferences based on sampling distributions). Fisher’s grouping of M-S testing under “distribution,” however, left a lot of unanswered questions concerning its nature and role. How does M-S testing differ from other forms of testing? How would one establish the validity of the model assumptions in practice? What would one do next when certain assumptions are found wanting? A pioneer of twentiethcentury statistics, a student of Fisher attests to that by acknowledging the absence of a systematic way to validate statistical models: “The current statistical methodology is mostly model-based, without any specific rules for model selection or validating a specified model” (Rao, 2004, p. 2) Bird’s-eye view of the chapter. Section 15.2 highlights the effects of statistical misspecification on the reliability of inference using two examples to illustrate the gravity of the problem. The question of specifying and validating a statistical model has received comparatively little attention in the literature for a number of reasons, some of which are briefly discussed. Section 15.3 introduces M-S testing is an intuitive way to motivate the procedures that follow. Section 15.4 provides a more formal introduction to M-S testing and compares it to significance testing and N-P testing. The M-S testing procedure proposed is based on auxiliary regressions using the residuals. Section 15.5 provides a detailed empirical example to illustrate how M-S testing can play an important role in all modeling facets: specification and respecification with a view to establishing statistical adequacy.
688
Misspecification (M-S) Testing
15.2 Misspecification and Inference: A First View Statistical adequacy ensures that the optimal properties of estimators and tests are real and not notional, and secures the reliability and precision of inference by guaranteeing that the relevant actual error probabilities approximate closely the nominal ones. In contrast, the presence of statistical misspecification induces a discrepancy between these two error probabilities and undermines the reliability of inference.
15.2.1 Actual vs. Nominal Error Probabilities Simple Normal model. Consider a simple (one parameter, σ 2 is assumed known) Normal model in Table 15.1. Table 15.1
The simple Normal model
Statistical GM [1] [2] [3] [4]
Xt = μ + ut , t∈N:=(1, 2, . . . , n, . . .)
Normal Constant mean Constant variance Independence
Xt N(., .), xt ∈R E(Xt ) = μ, μ∈R, ∀t∈N Var(Xt ) = σ 2 , ∀t∈N {Xt , t∈N}, independent process
In Chapter 13, it was shown that for testing the hypotheses H0 : μ = μ0
vs. H1 : μ > μ0
(15.2)
there is an α-level UMP defined by Tα :={ κ(X), C1 (α)}: κ(X) = where X n = 1n
√ n(X n −μ0 ) , σ
n
k=1 Xk , cα
(i) κ(X) =
C1 (α) = { x: κ(x) > cα },
(15.3)
is the rejection threshold. Given that
√ n(X n −μ0 ) μ=μ0 σ
N(0, 1),
(15.4)
the type I error probability (significance level) α is P(κ(X) > cα ; H0 true) = α. To evaluate the type II error probability and the power of this test
β(μ1 ) = P(κ(X) ≤ cα ; μ = μ1 ) π(μ1 ) = 1−β(μ1 ) = P(κ(X) > cα ; μ = μ1 )
∀(μ1 >μ0 ),
the relevant sampling distribution is (ii) κ(X) =
√ n(X n −μ0 ) μ=μ1 σ
N(δ 1 , 1), δ 1 =
√ n(μ1 −μ0 ) , σ
∀μ1 >μ0 .
(15.5)
What is often insufficiently emphasized in statistics textbooks is that when any of the assumptions [1]–[4] are invalid for data x0 , the above nominal error probabilities are likely to be significantly different from actual error probabilities, rendering inferences based on (15.3) unreliable.
15.2 Misspecification and Inference: A First View
689
Example 15.1 To illustrate how the nominal and actual error probabilities can differ when any of the assumptions [1]–[4] are invalid, let us consider the case where the independence assumption [4] is invalid. Instead, the underlying process {Xt , t∈N} is correlated: Corr(Xi , Xj ) = ρ, 0 < ρ < 1, for all i =j, i, j = 1, . . . , n.
(15.6)
For a similar example where the dependence is Markov, Corr(Xi , Xj ) = ρ |i−j| , see Spanos (2009). How does ρ =0 affect the reliability of test Tα ? The actual sampling distributions of κ(X) are now √ n(X n −μ0 ) μ=μ0 N (0, dn (ρ)) , σ √ √ μ=μ = n(Xσn −μ0 ) 1 N n(μσ1 −μ0 ) , dn (ρ) ,
(i)* κ(X) = (ii)* κ(X)
(15.7)
dn (ρ)=(1+(n − 1)ρ) > 1 for 0 cα ; H0 ) = P Z > √1.645 0 , d (ρ) n
where Z N(0, 1). The results in Table 15.2 for different values of ρ indicate that test Tα has now become “unreliable” because α ∗ > α. One will apply test Tα thinking that it will reject a true H0 only 5% of the time, when, in fact, the actual type I error probability increases with the value of ρ as shown in Table 15.2. Table 15.2
Type I error of Tα when Corr(Xi , Xj )=ρ
ρ
.0
.05
.1
.2
.3
.5
.75
.8
.9
α∗
.05
.249
.309
.359
.383
.408
.425
.427
.431
Table 15.3
Power π ∗ (μ1 ) of Tα when Corr(Xi , Xj )=ρ
ρ
π ∗ (.01)
π ∗ (.02)
π ∗ (.05)
π ∗ (.1)
π ∗ (.2)
π ∗ (.3)
π ∗ (.4)
.0 .05 .1 .3 .5 .8 .9
.061 .262 .319 .390 .414 .431 .435
.074 .276 .330 .397 .419 .436 .439
.121 .318 .364 .418 .436 .449 .452
.258 .395 .422 .453 .464 .471 .473
.637 .557 .542 .525 .520 .515 .514
.911 .710 .659 .596 .575 .560 .556
.991 .832 .762 .664 .630 .603 .598
Similarly, the actual power for ρ = 0 should be evaluated using , √ π ∗ (μ1 ) = P Z > √d1(ρ) cα − n(μσ1 −μ0 ) ; μ = μ1 , n
690
Misspecification (M-S) Testing
giving rise to the results in Table 15.3. Looking at these results closely it is clear that for values of μ1 close to the null (.01, .02, .05, .1), the power increases as ρ → 1 but for values of μ1 away from the null (.2, .3, .4), the power decreases. Conventional wisdom often believes that an increase in power is always a good thing. It is not when μ1 is very close to the null, because this destroys the “probativeness” of a test by rendering it vulnerable to the fallacy of rejection! The test has become like a defective smoke alarm, which has a tendency to go off when burning toast but will not be triggered by real smoke until the house is fully ablaze; see Mayo (1996). The linear regression model. A more realistic example is the case where the prespecified statistical model is the Normal, linear regression (LR) model in Table 15.4. When estimating the LR model, it can happen that the modeler ignores mean heterogeneity issues. To illustrate how that can devastate the reliability of inference, let us compare two scenarios. Table 15.4 Normal, linear regression model Yt = β 0 + β 1 xt + ut , t∈N:=(1, 2, . . . , n, . . .) ⎫ Normality (Yt | Xt = xt ) N(., .) ⎪ ⎪ ⎪ ⎪ Linearity E (Yt | Xt = xt )) = β 0 + β 1 xt ⎪ ⎬ Homoskedasticity Var (Yt | Xt =xt ) = σ 2 t∈N. ⎪ Independence {(Yt | Xt = xt ) , t ∈ N}, independent process ⎪ ⎪ ⎪ ⎪ ⎭ t-Invariance β 0 , β 1 , σ 2 not changing with t 2
σ β 0 = μ1 − β 1 μ2 ∈R, β 1 = σσ 12 ∈R+ ∈R, σ 2 = σ 11 − σ 12 22 22 Statistical GM
[1] [2] [3] [4] [5]
Scenario 1. The estimated LR model is statistically adequate; assumptions [1]–[5] are valid for the particular data; the true and estimated models coincide: Yt = β 0 +β 1 xt +ut . Scenario 2. The modeler estimates the LR Yt = β 0 +β 1 xt +ut , but the true model is Yt = δ 0 + δ 1 t+β 1 xt +ut . This renders the estimated model statistically misspecified because part of assumption [5] is invalid; β 0 is not t-invariant, instead β 0 (t) = δ 0 + δ 1 t. As shown in Chapter 7, such a case can easily arise in practice when the data exhibit mean heterogeneity; see Example 7.22. Example 15.2 (Spanos and McGuirk, 2001) To illustrate how the misspecification of ignoring the trend in the LR model will seriously undermine the reliability of inference, let us use simulation for the above two scenarios using N = 10,000 replications. As can be seen from the simulation results in Table 15.5, when the estimated LR model is statistically adequate: (i) the point estimates are highly accurate and the empirical type I error probabilities associated with the t-tests are very close to the nominal (α = .05) even for a sample size n = 50; (ii) their accuracy improves as n increases to n = 100. It is worth noting that when the estimated statistical model is statistically adequate, the estimates are close to the true parameter values, and the empirical error probabilities are very close to the nominal ones.
15.2 Misspecification and Inference: A First View
691
In contrast, when the estimated model is misspecified (assumption [5] is invalid): (iii) the point estimates are highly inaccurate (a symptom of inconsistent estimators) and the empirical type I error probabilities are much larger than the nominal (α = .05); (iv) as n increases the inaccuracy of the estimates increases (they get further and further away from the true values as n increases) and the empirical type I error probabilities approach 1! This brings out the folly of the widely touted assertion that when n is large enough the modeler does not need to worry about statistical misspecification. It is important to emphasize that these examples are only indicative of an actual situation facing a practitioner. More often than not, more than one of the model assumptions are invalid. This renders the reliability of inference a lot more uncertain in practice than these examples might suggest; see Spanos and McGuirk (2001). Table 15.5
N=10, 000
Linear regression and mean heterogeneity Adequate LR model
Misspecified LR model
True: Yt =1.5+0.5xt +ut Estim: Yt =β 0 +β 1 xt +ut
True: Yt =1.5+.13t+.5xt +ut Estim: Yt =β 0 +β 1 xt +ut
n=50
n=100
n=50
n=100
Parameters
Mean
Std
Mean
Std
Mean
Std
Mean
Std
[β 0 =1.5] βˆ 0 [β 1 =.5] βˆ 1 [σ 2 =.75] σˆ 2 [R2 =.25]R2 t-Statistics
1.502 0.499 0.751 0.253 Mean
.122 .015 .021 .090 α=.05
1.500 0.500 0.750 0.251 Mean
.087 .008 .010 .065 α=.05
0.462 1.959 2.945 0.979 Mean
.450 .040 .384 .003 α=.05
0.228 1.989 2.985 0.995 Mean
.315 .015 .266 .001 α=.05
τ β 0 = 0σˆ 0 β0
0.004
.049
0.015
.050
–1.968
0.774
–3.531
0.968
–.013
.047
–.005
.049
35.406
1.000
100.2
1.000
βˆ −β
βˆ −β τ β 1 = 1σˆ 1 β1
15.2.2 Reluctance to Test the Validity of Model Assumptions The crucial importance of securing statistical adequacy stems from the fact that no trustworthy evidence for or against a substantive claim (or theory) can be secured on the basis of a statistically misspecified model. The question that naturally arises is as follows. In light of the dire consequences of statistical misspecification, why is there such reluctance in most applied fields to secure statistical adequacy (validate the statistical premises) using thorough M-S testing? A crucial reason for this neglect is that the empirical modeling literature appears to seriously underestimate the potentially devastating effects of statistical misspecification on the reliability of inference. This misplaced confidence in the practice stems from a number of questionable arguments, including the following.
692
(i)
(ii)
Misspecification (M-S) Testing
Empirical modeling is misleadingly viewed as a curve-fitting exercise, where a substantive model Mϕ (x) is foisted on the data. This stems from presuming that Mϕ (x) is valid on a priori grounds by viewing it as established knowledge, instead of tentative conjunctures to be confronted with data. From this perspective the statistical model Mθ (x) (comprising the probabilistic assumptions imposed on the data) is implicitly specified via the error term(s) attached to the structural model. As a result, the statistical premises are inadvertently blended with the substantive premises of inference, and the empirical literature conflates two very different forms of misspecification: statistical and substantive. Hence, it should come as no surprise that econometrics textbooks consider “omitted variables” as the most serious form of statistical misspecification (omitted variables bias and inconsistency; see Greene, 2012), when in fact it has nothing to do with the statistical assumptions; it is an issue relating to substantive adequacy (Spanos, 2006c). The confusion between statistical and substantive misspecification also permeates the argument attributed to George Box (1979) that “all models are wrong, but some are useful.”A careful reading of Box (1979, p. 202), however, shows that he was talking about “substantive” inadequacy and not statistical. It is one thing to claim that one’s model is not “realistic enough,” and quite another to turn a blind eye to the problem of imposing invalid probabilistic assumptions on one’s data, and then proceed to use the inference procedures that invoke these assumptions as if the invalid assumptions do not matter. Indeed, in an even earlier publication Box (1979) argued for a balanced approach between theory and data: One important idea is that science is a means whereby learning is achieved, not by mere theoretical speculation on the one hand, nor by the undirected accumulation of practical facts on the other, but rather by a motivated iteration between theory and practice. (p. 791)
(iii)
The current reliance on asymptotic procedures for learning from data is not warranted. Such reliance ignores the fact that limit theorems invoked by consistent and asymptotically Normal (CAN) estimators and associated tests also rely on probabilistic assumptions which are usually non-testable, rendering the reliability of the resulting inferences dubious. Worse, the truth of the matter is that all inference results will rely exclusively on the n available data points x0 and nothing more. As argued by Le Cam (1986a, p. xiv): “limit theorems as n tends to infinity are logically devoid of content about what happens at any particular n.” Asymptotic theory based on “n → ∞” relates to the “capacity” of inference procedures to pinpoint θ ∗ , the “true” θ , as data information accrues {xk }∞ k=1 := (x1 , x2 , . . . , xn , . . .), approaching the limit at ∞. In that sense, asymptotic properties are useful for their value in excluding potentially unreliable estimators and tests, but they do not guarantee the reliability of inference procedures for given data x0 . The trustworthiness of any inference results invoking a CAN estimator relies solely on the approximate validity of the probabilistic assumptions imposed on x0 for the specific n, and nothing else.
15.2 Misspecification and Inference: A First View
(iv)
(v)
693
There is also undue reliance on vague “robustness” results whose generality and applicability is often greatly overvalued. On closer examination the adjustments used to secure robustness do nothing to alleviate the problem of sizeable discrepancies between actual and nominal error probabilities; see Spanos and McGuirk (2001), Spanos and Reade (2015). Finally, the neglect of M-S testing to secure the statistical adequacy of an estimated model is often explained away using ill-thought-out methodological charges against M-S testing, including (a) data mining/snooping, (b) double use of data, (c) infinite regress/circularity, (d) pre-test bias, (e) multiple testing issues, and (f) erroneous diagnoses; see Spanos (2000, 2010a, 2018) for a rebuttal of these criticisms.
The probabilistic reduction (PR) perspective. Distinguishing between the “statistical” and “substantive” information, ab initio, and viewing a statistical model Mθ (x) in purely probabilistic terms, enables one to address a number of problems associated with several conceptual and practical problems in establishing statistical adequacy, to secure the reliability of inference. First, this distinction delineates two different questions, often conflated in practice. [a] Statistical adequacy. Does Mθ (x) account for the chance regularities in x0 ? Mθ (x) is built exclusively on the statistical information contained in data x0 , and acts as a mediator between Mϕ (x) and x0 . [b] Substantive adequacy. Does the model Mϕ (x) adequately capture (describe, explain, predict) the phenomenon of interest? Substantive inadequacy arises, not from invalid probabilistic assumptions, but from highly unrealistic structural models, flawed ceteris paribus clauses, missing confounding factors, systematic approximation error, etc. In this sense, probing for substantive adequacy is a considerably more complicated problem, which, at the very minimum, includes the validity of the overidentifying restrictions G(θ, ϕ) = 0, after securing the statistical adequacy of Mθ (x). Without it, the reliability of such tests is questionable. The second practical problem addressed by the defining model Mθ (x) as comprising all the probabilistic assumptions imposed on the particular data x0 is the need for a complete list of testable probabilistic assumptions. Usually the list of assumptions is incomplete, specified in terms of the unobservable error term, and some of the assumptions are not testable; see Section 14.2.1. This undermines the effectiveness of any form of M-S testing, rendering it ad hoc and partial at best. The third important problem addressed by the PR perspective is the confusion between MS and N-P testing, stemming from using the same test procedures, likelihood ratio, Lagrange multiplier, and Wald, for both types of testing; Spanos (1986). This has led to a number of misleading claims and charges against M-S testing, such as calling into question the legitimacy of the latter, including “vulnerability to multiple testing,” “illegitimate double use of data,” “pre-test bias,” “infinite regress,” etc.; see Spanos (2010b) for rebuttals of these charges.
694
Misspecification (M-S) Testing
15.3 Non-Parametric (Omnibus) M-S Tests It is important to emphasize at the outset that M-S testing is not a collection of tests that are applied willy-nilly to different statistical models. For M-S testing to be effective in detecting departures from the model assumptions, one needs a systematic strategy to avoid several traps awaiting the unaware. The most crucial of those traps is the interdependency of the various model assumptions. Hence, in practice one should rely on joint M-S tests that probe for several assumptions simultaneously. To get some idea of what M-S testing is all about, however, let us focus on a few simple tests to assess assumptions [1]–[4] of the simple Normal model (Table 15.1 with σ 2 an unknown parameter).
15.3.1 The Runs M-S Test for the IID Assumptions [2]–[4] The hypothesis of interest concerns any “random” reordering of the sample X:=(X1 , X2 , . . . , Xn ), i.e. H0 : f (x1 , x2 , . . . , xn ; θ) = f (xi1 , xi2 , . . . , xin ; θ), for any permutation (i1 , i2 , . . . , im ) of the index (i=1, 2, . . . , n). Runs up and down test. The runs test, discussed in Chapter 5, compares the actual number of runs R with the number of expected runs, assuming that {Xt , t∈N} was an IID process, to construct the test dR (X) =
[R−E(R)] √ , Var(R)
C1 (α) = {x: |dR (x)| > c α2 }, E(R) =
2n−1 3 ,
Var(R) =
16n−29 90
and show that the distribution of dR (X), for n ≥ 40, can be approximated by √ IID dR (X) = [R−E(R)] / Var(R) N(0, 1). Example 15.3 (a) Consider the exam scores in Figure 15.3, for which dR (x0 ) = 50 − 46.3/3.482 = 1.062. Evaluating the p-value (Chapter 13) yields p(x0 ) = P(dR (X) > dR (x0 ); H0 ) = .144, which indicates no departures from the IID assumptions.
100 90
x
80 70 60 50 40 1
7
Fig. 15.3
14
21 28 35 42 49 alphabetical order
56
63
70
Exam scores: alphabetical order
15.3 Non-Parametric (Omnibus) M-S Tests
695
100 90
x
80 70 60 50 40 1
7
14
Fig. 15.4
21 28 35 42 49 56 sitting arrangement order
63
70
Exam scores: sitting order
Example 15.3(b) On the other hand, the same data ordered according to the sitting arrangement (Figure 15.4), yields dR (x0 ) = 21 − 46.3/3.482 = −7.266, with p-value p(x0 ) = .000000, which clearly indicates strong departures from the IID ([2]–[4]) assumptions.
15.3.2 Kolmogorov’s M-S Test for Normality ([1]) The Kolmogorov M-S test for assessing the validity of a distributional assumption under two key conditions is as follows: (i) (ii)
the data x0 :=(x1 , x2 , . . . , xn ) can be viewed as a realization of a random (IID) sample X:=(X1 , X2 , . . . , Xn ); the random variables X1 , X2 , . . . , Xn are continuous (not discrete) relies on the empirical cumulative distribution function Fn (x) =
[no of (x1 ,x2 ,...,xn ) that do not exceed x] , n
∀x∈R.
Under (i) and (ii), the ecdf is a strongly consistent estimator of the cumulative distribution function F(x)=P(X ≤ x), ∀x ∈ R. The generic hypothesis being tested takes the form H0 : F ∗ (x) = F0 (x), x∈R,
(15.8)
where F ∗ (x) denotes the true cdf and F0 (x) the cdf assumed by the statistical model Mθ (x). Kolmogorov (1933a) proposed the distance function Fn (x) − F0 (x) | n (X) = supx∈R | and proved that under (i) and (ii): √ k+1 e−2k2 x2 1 − 2 exp(−2x2 ), x>0. lim P( nn (X)≤x) = FK (x) = 1−2 ∞ k=1 (−1)
n→∞
(15.9) Since FK (x) is known (approximated), one can define an M-S test based on the test statistic √ Kn (X) = nn (X), giving rise to the p-value P(Kn (X) > Kn (x0 ); H0 ) = p(x0 ).
Misspecification (M-S) Testing
696
99.9 99
Percent
95 90 80 70 60 50 40 30 20 10 5 1 0.1 30
40
50
60
70
80
90
100
110
120
x
Fig. 15.5
P–P Normality plot
Example 15.4 Applying the Kolmogorov test to the scores data in Figure 1.12 yields P(Kn (X) > .039; H0 ) = .15, which indicates no significant departure from the Normality assumption. The P–P plot in Figure 15.5 provides a depiction of what this test is measuring in terms of the discrepancies from the line to the observed points (Chapter 5). Note that this particular test might be too sensitive to outliers, because it picks up only the biggest distance! A different distance function which is less sensitive to outliers is ∞ (x)−F0 (x)]2 (15.10) A-D(X) = n −∞ F[Fn(x)(1−F (x)) f0 (x)dx, 0
0
proposed by Anderson and Darling (1952), which for the ordered sample X[n] is A-D(X)= −n− 1n nk=1 (2k − 1) ln X[k] − ln(1 − ln X[n+1−k] ) . In the above example the Anderson–Darling test yielded the p-value P(A-D(X) > .139; H0 ) = .974, which confirms the result of the Kolmogorov test. Omnibus M-S tests? A crucial advantage of the above omnibus tests is that they probe more broadly around the Mθ (x) (locally) than directional (parametric) M-S tests, at the expense of lower power. However, tests with low power are useful in M-S testing because when they detect a departure, they provide better evidence for its presence than a test with very high power! Omnibus tests, however, have a crucial weakness. When the null hypothesis is rejected, the test does not provide reliable information as to the direction of departure. Such information is needed for the next stage of modeling, that of respecifying the original model with a view to accounting for the systematic information not accounted for by Mθ (x).
15.4 Parametric (Directional) M-S Testing
697
(x): set of all possible models that could have given rise to data x0
H1: (x)
H0:
Fig. 15.6
M-S testing: encompassing
(x)
(x)
Fig. 15.7
M-S testing: directions of departure
15.4 Parametric (Directional) M-S Testing Parametric M-S tests are of two forms. The first particularizes Mθ (x) = [P (x) − Mθ (x)] by choosing a broader model Mψ (z) ⊂ [P (x) − Mθ (x)] that encompasses Mθ (z) parametrically (Figure 15.6) and tests the nesting restrictions G(θ, ψ) = 0, θ ∈, ψ∈. The second particularizes Mθ (x) in the form of several directions of departure from specific assumptions using auxiliary regressions (Figure 15.7); see Section 15.5.3.
15.4.1 A Parametric M-S Test for Independence ([4]) In the case of the simple Normal model (Table 15.1), the process {Xt , t ∈ N} is assumed to be NIID and thus the simplification of the distribution of the sample is IID 1 f (x1 , x2 , . . . , xn ; φ) = nt=1 f (xt ; ϕ), ∀x∈Rn . Relaxing the IID assumptions and replacing them with Markov (M) dependence and stationarity (S), the simplification of the distribution of the sample resulting from sequential conditioning takes the form (Chapter 7) 1 M-S f (x1 , x2 , . . . , xn ; φ) = f (x1 ; ϕ 1 ) nt=2 f (xt |xt−1 ; ϕ), ∀x∈Rn , where the joint distribution f (xt , xt−1 ; ϕ) takes the form 8 9 8< = < =9 Xt μ σ (0) σ (1) N , . Xt−1 μ σ (1) σ (0)
(15.11)
As shown in Chapter 7, this gives rise to the autoregressive [AR(1)] model Mψ (x) based on f (xt |xt−1 ; θ ), whose statistical GM is Xt = α 0 + α 1 Xt−1 + ε t , t∈N, α 0 = μ(1−α 1 ) ∈ R, α 1 =
σ (1) σ (0)
∈ (−1, 1), σ 20 = σ (0)(1−α 21 ) ∈ R+ .
(15.12)
The AR(1) parametrically nests (includes as a special case) the simple Normal model, and the nesting restriction is α 1 =0. Under this restriction the AR(1) model Mψ (x) reduces to the simple Normal model (Table 15.1) Xt = α 0 + α 1 Xt−1 + ε t
α 1 =0
→
Xt = μ + ut , t∈N.
698
Misspecification (M-S) Testing
This suggests that the nesting restriction framed as a test of the hypotheses H0 : α 1 = 0 vs. H1 : α 1 =0
(15.13)
in the context of Mψ (x) can be used to test assumption [4] (Table 15.1). The M-S test for assumption [4] is the t-type test Tα :={τ (X), C1 (α)}: τ (X) =
α 1 −0) H0 √( ≈ Var( α1 )
St(n−2), C1 (α) = {x: |τ (x)| > cα },
n (X −X)(Xt−1 −X) n t α 1 = t=1 , 2 t=1 (Xt−1 −X) 1 s2 = n−2
n
Var( α 1 )=s2 [
n
α 0 − α 1 Xt−1 )2 , t=1 (Xt −
t=1 (Xt−1
− X)2 ]−1 ,
(15.14)
α 0 =(1− α 1 )X.
Example 15.5 For the data in Figure 15.4, estimating the AR(1) model yields Xt = 39.593+0.441Xt−1 + εt , R2 = .2, s2 = 143.42, n = 69, (7.790)
(0.106)
giving rise to the M-S t-test (15.14) for (15.13): τ (x0 ) = .441 .106 = 4.160, p(x0 ) = .000016, indicating a clear departure from assumption [4], confirming Example 15.3(b). The nesting of the original statistical model Mθ (x) within a more general statistical model Mψ (x) gives the impression that the M-S test based on (15.13) has been transformed into an N-P test. Although this is technically correct, it is conceptually erroneous because there is no assumption that the nesting model Mψ (x) is statistically adequate, as should be the case for a reliable N-P test. The only role played by the nesting model Mψ (x) is as a proxy that provides possible directions of departure from the original model Mθ (x). Hence, the only clear inference one can draw from a nesting M-S test pertains exclusively to the original model Mθ (x): whether Mθ (x) is misspecified in the direction(s) of departure indicated by Mψ (x) or not. When Mθ (x) is rejected as misspecified, one cannot infer the validity of Mψ (x), a classic example of the fallacy of rejection; see Chapter 13.
15.4.2 Testing Independence and Mean Constancy ([2] and [4]) The above t-type M-S test based on the auxiliary regression (15.12) can be extended to provide a joint test for assumptions [2] and [4]. The nesting AR(1) model was derived in Chapter 8 by replacing the stationarity assumption of {Xt , t ∈ N} with mean non-stationarity, that changes the f (xt−1 , xt ; ϕ) in (15.11) into = < =9 8 9 8< σ (0) σ (1) Xt μ+γ 1 t , . (15.15) N Xt−1 σ (1) σ (0) μ+γ 1 (t − 1) This new f (xt−1 , xt ; ϕ(t)) gives rise to a heterogeneous AR(1) model with statistical GM [2]
[4]
Xt = δ 0 + δ 1 t + α 1 Xt−1 + ε t , t ∈ N, δ 0 = μ+α 1 (γ 1 − μ), δ 1 = (1 − α 1 )γ 1 t, α 1 = (σ (1)/σ (0)),
(15.16) σ 20 = σ (0)(1 − α 21 ).
15.4 Parametric (Directional) M-S Testing
699
The nesting restrictions α 1 = 0, δ 1 = 0 reduce the AR(1) in (15.16) to the simple Normal model Xt = δ 0 + δ 1 t + α 1 Xt−1 + ε t
α 1 =0
→
δ 1 =0
Xt = μ + ut , t ∈ N.
Hence, a joint M-S test for assumptions [2] and [4] (Table 15.1), based on the hypotheses H0 : α 1 = 0 and δ 1 = 0 vs. H1 : α 1 = 0 or δ 1 =0 is an F-type test Tα :={τ (X), C1 (α)} that takes the form H0 n−3 ≈ F(2, n−3), C1 (α) = {x: F(x)>cα }, F(X) = RRSS−URSS URSS 2 δ 0 − δ 1 t− α 1 Xt−1 )2 , RRSS= nt=1 (Xt −X)2 , URSS= nt=1 (Xt − where URSS and RRSS denote the unrestricted and restricted residual sum of squares, respectively, and F(2, n−3) denotes the F-distribution with two and n−3 degrees of freedom. Example 15.6 For the data in Figure 15.4, the restricted and unrestricted models yielded, respectively ut , s2 = 185.23, n = 69, Xt = 71.69 + (1.631)
ε t , s2 = 144.34, n = 69, xt = 38.156 + .055 t + 0.434xt−1 + (8.034)
(.073)
(0.107)
(15.17)
where RRSS=2.6845, URSS=2.1543, yielding 67 = 8.245, p(x0 ) = .0006, F(x0 ) = 2.6845−2.1543 2.1543 2 indicating a clear discordance with the null ([2] and [4]). What is particularly notable about the auxiliary autoregression (15.17) is that a closer look at the t-ratios indicates that the source of the problem is dependence and not t-heterogeneity: .434 τ 1 (x0 ) = .055 .073 = .753, p(x0 ) = .226, τ 2 (x0 ) = .107 = 4.056, p(x0 ) = .0000, a clear departure from assumption [4], and not from [2]. This information enables one to apportion blame, which is not possible when using a non-parametric test, such as the runs test, and suggests ways to respecify the original model to account for such systematic statistical information. Using the residuals. An alternative way to specify (15.16) is in terms of the residuals ut = xt −xn = (xt − 71.7), t = 1, 2, . . . , n, because then the auxiliary regression ε t , s2 = 144.34, n = 69 ut = −33.534+.055 t+0.434xt−1 + (8.034)
(.073)
(0.107)
(15.18)
is a mirror image of (15.17) with identical parameter estimates, apart from the constant, which is irrelevant for M-S testing purposes.
700
Misspecification (M-S) Testing
15.4.3 Testing Independence and Variance Constancy ([2] and [4]) In light of the fact that assumption [3] involves the constancy of σ 2 = Var(Xt ) = E(u2t ), we can combine it with the independence [4] assumption to construct a joint test using the residuals squared in the context of the auxiliary regression [4]
u2t
[3] 2 x0 2 + = γ 0 + γ 1 t + γ 2 xt−1 + vt → u2t = 295.26 − 1.035t − .016 xt−1 vt . (89.43)
(1.353)
(.014)
The non-significance of the coefficients γ 1 and γ 2 indicates no departures from assumptions [3] and [4]. 15.4.3.1 Extending the Above Auxiliary Regression* The auxiliary regression (15.16), providing the basis of the joint test for assumptions [2]–[4], can easily be extended to include higher-order trends (up to order m ≥ 1) and additional lags ( ≥ 1): k (15.19) Xt = δ 0 + m k=1 δ k t + i=1 α i Xt−i +ε t , t ∈ N. In practice, however, going beyond m = 3 can give rise to numerical problems because
ordinary trend polynomials t, t2 , t3 , t4 , t5 , . . . are likely to be collinear. An effective way to avoid such near-collinearity problems is to use the scaling given in equation (14.89) in conjunction with orthogonal polynomials; see section 14.6.1.4.
15.4.4 The Skewness–Kurtosis Test of Normality An alternative way to test Normality is to use parametric tests probing for the validity of Normality within a broader nesting family of distributions. Such a test can be constructed using the Pearson family (Chapter 10), since the skewness (α 3 ) and kurtosis (α 4 ) coefficients (Chapter 3) can be used to distinguish between different members. For example, the Normal distribution is characterized within the Pearson family via (α 3 = 0, α 4 = 3) ⇒ f ∗ (x)=φ(x), for all x ∈ R,
(15.20)
f ∗ (x)
and φ(x) denote the true density and the Normal density, respectively. where Within the Pearson family, we can frame the hypotheses of interest to be H0 : α 3 = 0 and α 4 = 3 vs. H1 : α 3 = 0 or α 4 =3,
(15.21)
and construct the skewness–kurtosis test based on α 23 + SK(X) = n6
n 24
H
α 4 −3)2 0 χ 2 (2), P(SK(X)>SK(x0 ); H0 ) = p(x0 ), ( α
(15.22)
where the estimated parameters (α 3 , α 4 ) are α3 =
[1 3n
1 n
n
3 k=1 (Xk −X) ]
n
k=1 (Xk −X)
3 ,
2
α4 =
[ 1 nk=1 (Xk −X)4 ] 3n 4 . 1 n 2 k=1 (Xk −X) n
It is important to note that the SK(X) M-S test is particularly sensitive to outliers because it involves higher sample moments such as nk=1 (Xk − X)4 , which can allow one or two
15.4 Parametric (Directional) M-S Testing
701
outlier observations to dominate the summation. For an improved version of the SK(X) test; see D’Agostino and Pearson (1973).
15.4.5 Simple Normal Model: A Summary of M-S Testing The first auxiliary regression specifies how departures from different assumptions might affect the mean: [2]
[4] ut = δ 0 + δ 1 t + δ 2 t2 + δ 3 xt−1 + ε 1t ,
H0 : δ 1 = δ 2 = δ 3 = 0 vs. H1 : δ 1 =0 or δ 2 =0 or δ 3 =0. The second auxiliary regression specifies how departures from different assumptions might affect the variance: [3]
[4]
2 + ε 2t , u2t = γ 0 + γ 1 t + γ 2 t2 + γ 3 xt−1 H0 : γ 1 = γ 2 = γ 3 = 0 vs. H1 : γ 1 =0 or γ 2 =0 or γ 3 =0. Note that the above choices of the various terms for the auxiliary regressions are only indicative of the direction of departure from the model assumptions! Intuition. At the intuitive level the above auxiliary regressions can be viewed as probing the residuals with a view to finding systematic statistical information (chance regularities) that the original model Mθ (x) did not account for. Departures from assumptions [2]–[4] rightfully belong to the systematic component and not the error term. When no departures from assumptions [2]–[4] are detected, one can proceed to test the Normality assumption using tests such as the skewnes–kurtosis and Kolmogorov tests. Example 15.7 Consider the casting of two dice data in Table 1.1 (Figure 1.1). The estimated first four moments yield 1 n 2 xn = 1n nk=1 xk = 7.08, s2 = n−1 k=1 (xk − xn ) = 5.993, α 3 = −.035, α 4 = 2.362. (a)
Testing assumptions [2]–[4] using the runs test, with n=100, R=50: E(R) = (200−1)/3 = 66.333, Var(R) = (16(100) − 29)/90 = 17.456, ZR (X) =
72−66.333 √ 17.456
= 1.356, P(|ZR (X)| > 1.356; H) = .175.
(b) Testing assumptions [2] and [4] using the auxiliary regression: x0
ut = δ 0 + δ 1 t + δ 2 Xt−1 + ε 1t → ut = 1.02 − .005t − .103 Xt−1 + ε 1t (.877)
(.009)
(.101)
(2.434)
indicates no departures, since the F-test for the joint significance of δ 1 and α 1 H0 : δ 1 = δ 2 = 0 vs. H1 : δ 1 =0 or δ 2 =0,
Misspecification (M-S) Testing
702
576.6−568.540 96 F(x0 )= RRSS−URSS ( n−4 ( 2 )=.680[.511], URSS 2 )= 568.540
and the t-tests for the significance of δ 1 and δ 2 yield τ (x0 ) = (c)
1.02 .877
= 1.163[.248], τ (x) =
.103 .101
= 1.021[.312].
(15.23)
Testing assumptions [3] and [4] using the auxiliary regression x0
2 +ε → 2 + u2t = 6.10 − .021 t + .014xt−1 ε 2t u2t = γ 0 + γ 1 t + γ 2 xt−1 2t (1.84)
(.024)
(.02)
indicates no departures, since the F-test for the joint significance of γ 1 and γ 2 H0 : γ 1 = γ 2 = 0 vs. H1 : γ 1 =0 or γ 2 =0, 4678.67−4619.24 96 ( n−4 ( 2 ) = .618=[.541], F(x0 )= RRSS−URSS URSS 2 )= 4619.24
and the t-tests for the significance of γ 1 and γ 2 yield τ (x0 ) =
.021 .024
= .875[.397], τ (x) =
.014 .02
= .7[490].
(15.24)
(d) In light of the fact that model assumptions [2]–[4] are valid, one can proceed to test the Normality assumption [1] using the SK test to yield SK(x0 ) =
100 2 100 2 6 (−0.035) + 24 (2.362 − 3)
= 1.716[.424].
The p-value indicates no discordance with the Normality assumption. C A U T I O N A R Y N O T E: This result raises an interesting question, because the underlying distribution exhibited by the histogram in Figure 1.3 is discrete and triangular, not Normal.
99.9 99
Percent
95 90 80 70 60 50 40 30 20 10 5 1 0.1 0
2
4
6
8
10
12
x
Fig. 15.8
P–P Normality plot for dice data
14
16
15.5 Misspecification Testing: A Formalization
703
What went wrong? The SK test is not a very powerful test since α 3 = 0 and α 4 =3 characterizes Normality within the Pearson family; it can be fooled by symmetry. A departure from Normality is revealed by the more powerful Anderson–Darling test A-D(x0 ) = .772[.041]. The P–P plot for Normality in Figure 15.8 brings out the problem in no uncertain terms by pointing out the discrete nature of the underlying distribution for the dice data; notice the vertical columns of observations. This brings out the importance of graphical techniques at the modeling stages: specification, M-S testing, and respecification.
15.5 Misspecification Testing: A Formalization Despite the plethora of M-S tests in both the statistics and econometrics literatures, no systematic way to apply these tests has emerged; see Godfrey (1988). The literature has left practitioners perplexed since numerous issues about M-S testing remained unanswered, including (a) the choice among many different M-S tests as well as their applicability in different models, (b) respecification: what to do when any of the model assumptions are invalid, (c) the use of omnibus (non-parametric) vs. parametric tests, and (d) the differences between M-S testing, N-P testing, Fisher’s significance testing, and Akaike-type model selection procedures.
15.5.1 Placing M-S Testing in a Proper Context It is argued that these issues can be addressed by having a coherent framework where different facets of modeling and inference can be delineated, as articulated in Chapter 10 (Table 10.9), to include M-S testing and respecification (with a view to achieving statistical adequacy). More formally, the key difference between N-P and M-S testing is Testing within Mθ (x): learning from data about M∗ (x) H0 : f (x; θ ∗ ) ∈ M0 (x) = {f (x; θ), θ ∈ 0 } vs. H1 : f (x; θ ∗ ) ∈ M1 (x) = {f (x; θ), θ ∈1 }. (15.25) Testing outside Mθ (x): probing for validity of its assumptions H0 : f (x; θ ∗ ) ∈ Mθ (x) vs. H 0 : f (x; θ ∗ ) ∈ Mθ (x) = [P (x)−Mθ (x)] .
(15.26)
Note that M∗ (x)={f (x; θ ∗ )} denotes the “true” distribution of the sample. It is important to emphasize the fact that M-S testing is a form of significance testing where the null hypothesis is always defined by H0 : all assumptions of Mθ (x) are valid for x0 ,
(15.27)
with the particularized alternative H1 ⊂ [P (x)−Mθ (x)] being H1 : the stated departures from specific assumptions being tested, assuming the rest of the assumption(s) of Mθ (x) hold for data x0 .
(15.28)
704
Misspecification (M-S) Testing
Misspecification vs. N-P testing. The fact that N-P testing is probing within Mθ (x) and M-S testing is probing outside, i.e. [P (x)−Mθ (x)] , renders the latter more vulnerable to the fallacy of rejection. Hence in practice, one should never accept the particularized H1 without further probing. (b) In M-S testing the type II error (accepting the null when false) is often the more serious of the two errors. This is because one will have another chance to correct for the type I error (rejecting the null when true) at the respecification stage, where a new model aims to account for the chance regularities the original model ignored. Hence, M-S testing is also more vulnerable to the fallacy of acceptance. (c) Unlike N-P tests, low power tests with broad probing capacity have a role to play in M-S testing. Hence, the use of omnibus tests in conjunction with more powerful directional tests.
(a)
15.5.2 Securing the Effectiveness/Reliability of M-S Testing There are a number of strategies designed to enhance the effectiveness/reliability of M-S probing, thus rendering the diagnosis more reliable. Judicious combinations of omnibus (non-parametric), directional (parametric), and simulation-based tests, probing as broadly as possible and upholding dissimilar assumptions. The interdependence of the model assumptions that stems from the fact that Mθ (x) is a parameterization of the process {Xt , t ∈ N} plays a crucial role in the self-correction of M-S testing results. Astute ordering of M-S tests so as to exploit the interrelationship among the model assumptions with a view to “correcting” each other’s diagnosis. For instance, the probabilistic assumptions [1]–[3] of the Normal, linear regression model (Table 15.6) are interrelated because all three stem from the assumption of Normality for the vector process {Zt , t ∈ N}, where Zt :=(Yt , Xt ), assumed to be NIID. This information is also useful in narrowing down the possible alternatives. It is important to note that the Normality assumption [1] should be tested last, because most of the M-S tests for it assume that the other assumptions are valid, rendering the results questionable when some of the other model assumptions are invalid. Joint M-S tests (testing several assumptions simultaneously) designed to avoid “erroneous” diagnoses as well as minimize the maintained assumptions. Custom tailoring M-S tests. The most effective way to ensure that M-S tests have maximum capacity to detect any potential departures from the model assumptions is to employ graphical techniques judiciously. By looking closely at various data plots, such as t-plots, scatterplots, and histograms, one should be able to discern potential departures and design auxiliary regressions, by crafting additional terms, that could reveal such potential departures most effectively. The above strategies enable one to argue with severity that when no departures from the model assumptions are detected, the model provides a reliable basis for inference, including appraising substantive claims (Mayo and Spanos, 2004).
15.5 Misspecification Testing: A Formalization
705
15.5.3 M-S Testing and the Linear Regression Model The Normal, linear regression is undoubtedly the quintessential statistical model (Table 15.6) in most applied fields, including econometrics. For this reason we will consider the question of M-S testing for this statistical model in more detail. As shown in Chapter 7, the LR model can be viewed as a parameterization of a vector process {Zt , t ∈ N}, where Zt :=(Yt , Xt ) is assumed to be NIID. At the specification stage, evaluating whether the model assumptions [1]–[5] are likely to be valid for particular data is non-trivial, since all these assumptions pertain to the conditional process {(Yt |Xt = xt ) , t ∈ N} which is not directly observable! As argued in Chapter 7, one can indirectly assess the validity of [1]–[5] via the observable process {Zt :=(Yt , Xt ), t∈N}. Table 15.6
Normal, linear regression model
Yt = β 0 + β 1 xt + ut , t∈N:=(1, 2, . . . , n, . . .) ⎫ [1] Normality (Yt |Xt = xt ) N(., .) ⎪ ⎪ ⎪ ⎪ [2] Linearity E (Yt |Xt =xt ) = β 0 + β 1 xt ⎪ ⎬ 2 [3] Homoskedasticity Var (Yt |Xt = xt ) = σ t∈N ⎪ [4] Independence {(Yt |X ⎪ t = xt ) , t∈ N}, independent process ⎪ ⎪ ⎪ ⎭ [5] t-Invariance: θ := β 0 , β 1 , σ 2 not changing with t
−1 −1 β 0 = μ1 − β 1 μ2 ∈R, β 1 = 22 σ 21 ∈R, σ 2 = (σ 11 − σ 21 22 σ 21 )∈R+ Statistical GM
As argued in Chapter 14, indicative auxiliary regressions for the simple one-regressor case can be used to test jointly the model assumptions [2]–[5] as different misspecifications might affect the first two conditional moments E (Yt | Xt = xt ) = β 0 +β 1 xt , Var (Yt | Xt = xt ) = σ 2 .
(15.29)
The first auxiliary regression specifies how departures from different assumptions might affect the conditional mean [2]
[5] [4] 2 ut = δ 0 + δ 1 xt + δ 2 t + δ 3 xt + δ 4 xt−1 + δ 5 Yt−1 + v1t ,
(15.30)
H0 : δ 1 = δ 2 = δ 3 = δ 4 = δ 5 = 0 vs. H1 : δ 1 =0 or δ 2 =0 or δ 3 =0 or δ 4 =0 or δ 5 =0. The second auxiliary regression specifies how departures from different assumptions might affect the constancy of conditional variance [3]
[4]
[5] 2 2 2 + γ 5 Yt−1 + v2t , ut = γ 0 + γ 2 t + γ 1 xt + γ 3 xt2 + γ 4 xt−1
(15.31)
H0 : γ 1 = γ 2 = γ 3 = γ 4 = γ 5 = 0 vs. H1 : γ 1 =0 or γ 2 =0 or γ 3 =0 or γ 4 =0 or γ 5 =0. Intuitively, the above auxiliary regressions should be viewed as attempts to probe the residuals { ut , t = 1, 2, . . . , n} for any remaining systematic information that has been overlooked by the specification of the regression and skedastic functions in (15.29) in terms of assumptions [1]–[5]. More formally, the extra terms in (15.30) and (15.31) will be zero since they
706
Misspecification (M-S) Testing
should be orthogonal to ut and u2t when assumptions [1]–[5] are valid for data Z0 . As argued in Spanos (2010b), it is no accident that M-S tests are often specified in terms of the residuals; they often constitute a maximal ancillary statistic.
15.5.4 The Multiple Testing (Comparisons) Issue The multiple testing (comparisons) issue arises in the context of joint N-P tests because the overall significance level does not coincide with that of the individual hypotheses. Viewing the auxiliary regressions in (15.30) and (15.31) as providing the basis of two joint N-P tests, and choosing a specific significance level, say .025, associated with testing each individual assumption, such as δ 4 = δ 5 = 0 for departures from [4], implies that the overall significance level α of the F-test for H0 will be greater than .025. How are these two thresholds related? Let us assume that we have m individual null hypotheses pertaining to different model assumptions, say H0 (i), i = 1, . . . , m, such that H0 = ∪m i−1 H0 (i), and the overall F-test rejects H0 when the smallest p-value, pi (x0 ), associated with each H0 (i) is less than α. This can be framed in the form { min (p1 (x0 ), . . . , pm (x0 ))0, with n(n+1)/2 unknown parameters or heteroskedasticity when {3}-Heteroskedasticity
E(uu |X) = = diag(σ 21 , σ 22 , . . . , σ 2n )
(15.57)
with n unknown parameters, which conflates variance heterogeneity with heteroskedasticity; see Chapter 7. A crucial assertion by econometric textbooks is that under (15.55), the OLS estimator is (i) linear in y and (ii) unbiased: β OLS ) = β, β OLS = (X X)−1 X (Xβ+u) = β+(X X)−1 X u → E(
15.6 An Illustration of Empirical Modeling
719
since E(X u) = 0, but (iii) β OLS is less efficient than the generalized least squares (GLS) estimator, pioneered by Aitken (1935): > β GLS = (X V−1 X)−1 X V−1 y. The relative inefficiency stems from the Gauss–Markov theorem (Greene, 2012) β OLS ) = (X X)−1 X VX(X X)−1 ≥ Cov(> β GLS ) = (X V−1 X)−1 . Cov( y|x
y|x
(15.58)
Note that > β GLS is also (i)* linear in y and (ii)* unbiased, since by invoking assumption {2}: E (> β GLS ) = β + (X V−1 X)−1 X V−1 E (u) = 0.
y|x
y|x
Unfortunately, the above assertions are highly misleading at several levels. (a)
The assertion (15.58) assumes that the n × n matrix V is known, rendering it irrelevant for practical purposes. Worse, when a consistent estimator V of V exists (addressing the incidental parameter problem) and is used to construct a feasible GLS > V−1 X)−1 X V−1 y, β FGLS = (X β OLS ) ≷ Cov(> β FGLS ). the inequality in (15.58) no longer holds; instead Cov( y|x
y|x
(b) The FGLS estimator also raises serious issues because V can be estimated only when additional assumptions are imposed on the nature of autocorrelation/heteroskedasticity to address the incidental parameter problem. Focusing on (15.56), the n(n+1)/2 unknown parameters cannot be estimated using n observations, which calls for narrowing the nature of temporal dependence, by assuming the AR(1) model ut = ρut−1 + ε t , t = 1, 2, . . . , n.
(15.59)
Note that the no-autocorrelation assumption {4} includes an infinity of potential models and (15.59) is just one of them. The AR(1) model in (15.59) provides the basis for the Durbin– Watson (DW) test (Durbin and Watson, 1950), which transforms the underlying M-S test into a choice between two models, where M0 is nested within M1 : M0 : y t = β 0 + β 1 x t + ε t , M1 : yt = β 0 + β 1 xt + ut , ut = ρut−1 + ε t , with the hypotheses of interest expressed in terms of the nesting parameter: H0 : ρ = 0 vs. H1 : ρ = 0. Example 15.10 (continued) In the case of the estimated LR model in (15.53) the DW statistic rejects the null hypothesis (at level .02) (i.e. DW = 1.115[.021]). What is particularly problematic is the next (respecification) move of the traditional textbook approach to address the statistical misspecification, adopting M1 : ut , ut = .431 ut−1 + εt , yt = 167.209+1.898xt + (.939)
(.037)
R2 = .996, DW = 1.831,
(.152)
s = 1.742, n = 35.
(15.60)
720
Misspecification (M-S) Testing
The traditional approach interprets the adoption of M1 as a way to “fix” the original LR model! Indeed, DW = 1.831 for M1 is interpreted as providing confirmation for the validity of M1 ; see Greene (2012). This constitutes a classic example of the fallacy of rejection: evidence against H0 is misinterpreted as evidence for a particular H1 . Rejection of H0 by a DW test provides evidence against H0 (the original model M0 ) and for the presence of generic temporal dependence: E(εt ε s |Xt = xt ) =0, t > s, t, s = 1, 2, . . . , n, but it does not provide evidence for the particular form assumed by H1 (ut = ρut−1 +εt ): |t−s| ρ 2 H1 : E(εt ε s |Xt = xt ) = 1−ρ (15.61) 2 σ , t > s, t, s = 1, 2, . . . , n. Evidence for M1 can only be established by validating its own model assumptions, and not by misinterpreting evidence against M0 . Sargan (1964) pointed out that M1 , when written in terms of the observables yt = β 0 + β 1 xt + ρ(yt−1 − β 0 − β 1 xt−1 ) + ε t , is a special case of the dynamic linear regression (DLR(1)) model M2 : yt = α 0 +α 1 xt + α 2 xt−1 + α 3 yt−1 + ut ,
(15.62)
subject to the (non-linear) common factor restrictions α 2 + α 1 α 3 = 0.
(15.63)
As shown by McGuirk and Spanos (2009), these restrictions impose such a highly unrealistic temporal structure on the observable process {Zt :=(xt , yt ), t ∈ N} so as to render it rarely valid in practice. Worse, when these restrictions are not valid for the particular data z0 , the nice properties traditionally associated with the FGLS estimator, including consistency, no longer hold. This calls into question the traditional “error-fixing” strategies mentioned below that focus on “fixing” the error term assumptions. 15.6.2.2 The Problem with “Error-Fixing” Strategies H1 assumes a particular violation of independence: rejection of H0 is evidence for autocorrelation, i.e. ρ = 0, but this by itself is not good evidence for M1 ; there was no chance of having uncovered the various other forms of dependence that would also give rise to ρ=0. The ways M1 can be incorrect have not been probed adequately; M1 has certainly not passed a severe test see; Mayo and Spanos (2004). Any improved strategy must be able to tell us what we are warranted in inferring once a violation is detected. This can be achieved by testing the assumptions of the alternative encompassing model. In the above case the latter model M1 is also statistically misspecified. The list of ad hoc respecifications in the form of “error fixing” in traditional econometric textbooks includes: (i) (ii)
“Fixing” residual autocorrelation by adopting the alternative in a DW test. “Fixing” for non-linearity using arbitrary functions like h(xt ) = α 0 + α 1 xt + α 2 xt2 , often used as the alternative in a test of the linearity assumption.
15.6 An Illustration of Empirical Modeling
(iii)
(iv)
(v)
(vi) (vii)
721
“Fixing” heteroskedasticity using ad hoc functions such as g(xt ) = eα 0 +α 1 xt , in conjunction with a GLS estimator for the regression coefficients and their variance– covariance matrix. (D) Distribution
(M) dependence
(H) Heterogeneity
Normal
Independent
Identically distributed
“Fixing” heteroskedasticity by retaining β OLS = (X X)−1 X y and replacing σ 2 (X X)−1 with the heteroskedasticity-consistent standard errors (HCSE) ,
(15.64) ( 1n X X)−1 , u2t xt xt Cov( β) = 1n ( 1n X X)−1 1n nt=1 W = 1 n where G u2 x x =
(0), known as the White (1980) consistent
n t=1 t t t estimator of X $X ; see Hansen (1999). “Fixing” residual autocorrelation by retaining the OLS estimators of the coefficients but using the so-called autocorrelation-consistent standard errors (HCSE) for NW (Newey and West, 1987): W with G inference purposes, which involves replacing G p i NW = G
(0)+ i=1 (1− p+1 )
(i) +
(i) , p > 0, ut ut−i xt xt−i , i = 1, 2, . . . , p.
(i) = 1n nt=i+1 Blaming the presence of residual autocorrelation on omitted variables; conflating statistical with substantive adequacy. Ignoring any departures from Normality, by arguing that such departures are unimportant as n → ∞.
Reliability of inference. The above “error-fixing” strategies do not address the real effects of statistical misspecification: inconsistent estimators and sizeable discrepancies between actual and nominal error probabilities. These “error-fixing” moves constitute attempts to countenance the fallacies of acceptance and rejection, rendering the misspecification problem even worse by invoking consistent estimators without testing the validity of the assumptions needed for them to hold. This ignores the fact that even in cases where consistency holds, the reliability of inference depends only on the approximate validity of the model assumptions for the particular sample size n.
15.6.3 The Probabilistic Reduction Approach How does the PR approach address the statistical adequacy problem? By using informed specification, M-S testing, and respecification guided by data plots. 15.6.3.1 Specification What are the probabilistic assumptions pertaining to the process {Zt , t ∈ N} for the LR model? NIID! Are they appropriate for the consumption function data?
Misspecification (M-S) Testing
722
4500
4000
4000
3500
3500 3000
2500
x
y
3000
2000
2000
1500
1500
1000 1947
2500
1000 1955
1963
1971 1979 Year
Fig. 15.21
1987
1995
1947
t-plot of yt
1955
1963
Fig. 15.22
1971 1979 Year
1987
1995
t-plot of xt
100 100 x-detrended
y-detrended
50 0 –50
50 0 –50 –100
–100 1947
1955
1963
Fig. 15.23
1971 1979 Year
1987
1995
–150 1947
t-plot of yt , detrended
1955
Fig. 15.24
1963
1971 1979 Year
1987
1995
t-plot of xt , detrended
It is clear from the t-plots (Figures 15.21 and 15.22) that both data series are mean trending and exhibit cycles. To get a better view of the latter, let us subtract the trend using the auxiliary regression zt = δ 0 + δ 1 t + δ 2 t2 + vt , t = 1, 2, . . . , n
(15.65)
δ0 − δ1 t − δ 2 t2 ). This exercise corresponds to the philosoand take the residuals vt = (zt − pher’s counterfactual (what if) reasoning! The residuals from (15.65) for the two series (detrended) are plotted in Figures 15.23 and 15.24. (D) Distribution
(M) Dependence
(H) Heterogeneity
Normal?
Independent?
Mean heterogeneous
It is clear from Figures 15.23 and 15.24 that both series exhibit Markov-type temporal dependence; see chapter 5. To assess the underlying distribution we need to subtract that dependence as well, which we can do generically using the extended auxiliary regression zt = γ 0 +γ 1 t+γ 2 t2 +γ 3 zt−1 +γ 4 zt−2 + t , t = 1, 2, . . . , n,
(15.66)
and plot the residuals which we call detrended and dememorized data series (see Figures 15.25 and 15.26).
80
723
100 x-detrended & dememorize
y-detrended & demomorized
15.6 An Illustration of Empirical Modeling
60 40 20 0 –20 –40 –60 –80 1947
1955
1963
1971 1979 Year
1987
1995
50
0
–50
–100 1947
y-detrended & dememorized
Fig. 15.25 yt , detrended and dememorized
1955
1963
1971 1979 Year
1987
1995
Fig. 15.26 xt , detrended and dememorized
80 60 40 20 0 –20 –40 –60 –80 -100
Fig. 15.27
-50 0 50 x-detrended & dememorized
100
Scatterplot of detrended and dememorized data
(D) Distribution
(M) Dependence
(H) Heterogeneity
Normal?
Markov
Mean heterogeneous
The t-plots in Figures 15.25 and 15.26 indicate a trending variance; the variation around the mean increases with t. In addition, the scatterplot of the two series in Figure 15.27 indicates clear departures from the elliptically shaped plot associated with a bivariate Normal distribution. (D) Distribution
(M) Dependence
(H) Heterogeneity
Normal?
Markov
Mean heterogeneous
Non-symmetric
Variance heterogeneous?
It is important to note that non-Normality leads to drastic respecifications, because both the regression and skedastic functions need to be reconsidered.
724
Misspecification (M-S) Testing
15.6.3.2 Misspecification Testing Joint Misspecification Tests for Model Assumptions [1]–[5] Regression function tests. In view of the chance regularity patterns exhibited by the data in Figures 15.21–15.27, the test that suggests itself would be based on the auxiliary regression [2]
[4] [5] xt2 ) + γ 4 xt−1 +γ 5 xt−1 + v1t , ut = γ 0 + γ 1 xt + γ 2 t + γ 3 ( 1000
H0 : γ 2 = γ 3 = γ 4 = γ 5 = 0. ut = 205.5 − .389 xt +5.37 t − .03 (52.3)
(.082)
(2.61)
(.0085)
xt2 1000
+.594 yt−1 − .384 xt−1 + vt , (.128)
(.117)
R2 = .86, s = 19.1, n = 51.
(15.67)
Note that the scaling of variables, such as xt2 , is crucial in practice to avoid large approximation errors. The F-test for the joint significance of the terms t, xt2 , yt−1 , xt−1 , yields 45 F(4, 45) = 117764−16421 = 55.544[.0000000], 16421 5 indicating clearly that this estimated regression is badly misspecified. To get a better idea as to departures from the individual assumptions, let us consider the significance of the relevant coefficients for each assumption separately: Mean heterogeneity [5] (γ 2 = 0): τ 2 (y; 45) = 5.37 2.61 = 2.058[.023], .03 = 3.529[.0005], Non-linearity [2] (γ 3 = 0): τ 3 (y; 45) = .0085 45 = 11.45 Dependence [4] (γ 4 = γ 5 = 0): F(y; 2, 45) = 24778−16421 16421 2 [.00006],
τ 4 (y; 45) = ( .594 γ =0 : .128 ) = 4.641[.000015],
4 γ5 = 0 : τ 5 (y; 45) = ( .384 .117 ) = 3.282[.0001]. 2 It is important to note that τ i (y; 45) = F(y; 1, 45), i = 2, 3. Skedastic function tests. The auxiliary regression that suggests itself is [3]
[4] [5] xt2 2 ut−1 /s)2 + v2t , ( ut /s) = δ 0 + δ 1 t + δ 2 ( 1000 ) + δ 3 (
H0 : δ 1 = δ 2 = δ 3 = 0. x2
t ( ut /s)2 = 110.2 − .057 t − .252 ( 1000 )+.874 ( ut−1 /s)2 + v2t .
(83.83)
(.043)
(.127)
(.174)
u2t−1 , yields The F-test for the joint significance of the terms t, xt2 and 45 F(3, 45) = 134.883−66.484 66.484 3 = 15.432[.00000],
(15.68)
15.6 An Illustration of Empirical Modeling
725
indicating clearly that some of the model assumptions pertaining to the conditional variance are misspecified. To shed additional light on which assumptions are to blame for the small p-value, let us consider the significance of the relevant coefficients for each assumption separately: τ 1 (y; 45) = .057 Variance heterogeneity [5] (δ 1 = 0): .043 =1.326[.194], 45 Heteroskedasticity [3] (δ 2 =δ 3 =0): F(2, 45)= 114.7−66.484 66.484 2 = 16.318[.000005], where the latter indicates the presence of heteroskedasticity! C A U T I O N: If one were to use the auxiliary regression ( ust )2 = δ 0 + δ 1 t + v∗2t , ( ust )2 = − 82.2 +.042 t + v∗2t , (273.6)
(.014)
one would have erroneously concluded that [5] is invalid, since τ 1 (y; 45) = (.042/.014) = 3.01[.004]. This brings out the importance of joint M-S testing to avoid misdiagnosis! In summary, the M-S testing based on the auxiliary regressions (15.67) and (15.68) indicates that there are clear departures from assumptions [2]–[5]. If one were to ignore that and proceed to test the Normality assumption [1], the testing result is likely to be unreliable because, as mentioned above, all current M-S tests for Normality assume the validity of assumptions [2]–[5]. To see this, let us use the skewness–kurtosis SK(x0 ) =
52 2 52 2 6 (.031) + 24 (3.91 − 3)
= 1.803[.406],
which indicates no departures from [1]; but is that a reliable diagnosis? No; see below! 15.6.3.3 Traditional Respecification: Embracing the Fallacy of Rejection At this point it will be interesting to follow the traditional respecification of misspecified models by embracing the fallacy of rejection and simply adding the additional terms found to be significant in the above M-S testing based on auxiliary regressions. In particular, let us estimate an extended regression equation aiming to maximize R2 : yt = 163.7+7.75 t − .159 t2 + .462 xt + .057 xt2 + .556 yt−1 − .302 xt−1 + εt , (52.0)
(3.18)
(.122)
(.105)
(.022)
(.130)
(.132)
(15.69)
R2 = .9997, s = 18.957, n = 51. Apart from the obvious fact that (15.69) makes no statistical sense since the specification is both ad hoc and internally inconsistent (Spanos, 1995b), a glance at the residuals from this estimated equation raises serious issues of statistical misspecification stemming from a t-varying conditional variance; see Figure 15.28. 15.6.3.4 Probabilistic Reduction Respecification The combination of M-S testing and graphical techniques suggests the following probabilist structure for the process {ln Zt , t ∈ N}: (D) Distribution
(M) Dependence
(H) Heterogeneity
LogNormal
Markov
Mean heterogeneous
Misspecification (M-S) Testing
726
40 30 20 Residuals
10 0 –10 –20 –30 –40 1947
1955
1963
1987
1995
Residuals from (15.69)
8.5
8.5
8.0
8.0
ln(x)
ln(y)
Fig. 15.28
1971 1979 Year
7.5
7.5
7.0 7.0 6.5 1947
1955
1963
1987
t-plot of ln yt
0.02 0.01 0.00 –0.01 –0.02 –0.03 1955
Fig. 15.31
1963
1971 1979 Year
1987
1955
1963
Fig. 15.30
0.03
1947
1947
1995
ln(x)-detrended & dememorized
ln(y)-detrended & dememorized
Fig. 15.29
1971 1979 Year
1995
t-plot of ln yt , detrended and dememorized
1971 1979 Year
1987
1995
t-plot of ln xt
0.05 0.04 0.03 0.02 0.01 0.00 –0.01 –0.02 –0.03 –0.04 1947
1955
Fig. 15.32
1963
1971 1979 Year
1987
1995
t-plot of ln xt , detrended and dememorized
where the logarithm is used as a variance stabilizing transformation; see Spanos (1986). Let us take the logs of the original data series (see Figures 15.29–15.32. If we compare Figures 15.25/15.26 with Figures 15.31/15.32, it becomes clear that the logarithm has acted as a variance stabilizing transformation because the t-varying variances
ln(y)-detrended & dememorized
15.6 An Illustration of Empirical Modeling
727
0.03 0.02 0.01 0.00 –0.01 –0.02 –0.03 –0.04 –0.03 –0.02 –0.01 0.00
0.01 0.02 0.03 0.04 0.05
ln(x)-detrended & dememorized
Fig. 15.33
Scatterplots of (ln xt , ln yt ), detrended and dememorized
in the latter disappear; see Spanos (1986). In addition, the scatterplot in Figure 15.33 associated with the data in Figures 15.31 and 15.32 indicates no departures from the elliptical shape we associate with the bivariate Normal distribution. Imposing the reduction assumptions (D) Distribution
(M) Dependence
(H) Heterogeneity
LogNormal
Markov
Mean heterogeneous separable heterogeneity?
on the joint distribution f (z1 , z2 , . . . , zn ; φ) where Zt :=(yt , xt ), yt := ln Yt , xt := ln Xt gives rise to the reduction M
f (z1 , z2 , . . . , zn ; φ) = ft (z1 ; ψ 1 )
7n t=2
SH
ft (zt |zt−1 ; ϕ t ) = ft (z1 ; ψ 1 )
7n t=2
f (zt |zt−1 ; ϕ).
Reducing f (zt |zt−1 ; ϕ) further by conditioning on Xt = xt yields f (zt |zt−1 ; ϕ) = f (yt |xt , zt−1 ; ϕ 1 )·f (xt |zt−1 ; ϕ 2 ), with f (yt |xt , zt−1 ; ϕ 1 ) the distribution underlying the (DLR(1)) model with statistical GM (Table 15.10) yt = δ 0 + δ 1 t + α 1 xt + α 2 yt−1 + α 2 xt−1 + ut , t ∈ N,
(15.70)
which can be thought of as a hybrid of the LR and AR(1) models. Example 15.10 (continued)
Estimating the DLR model in (15.70) yields
εt , ln Yt = .912 + .005 t + .708 ln xt + .565 ln Yt−1 − .413 ln xt−1 + (.272)
(.001)
(.069)
(.108)
(.097)
R2 = .904 [reported R2 = .9997], s = .0085, n = 51.
(15.71)
728
Misspecification (M-S) Testing
Table 15.10 Normal, dynamic linear regression model yt = δ 0 + δ 1 t + α 1 xt + α 2 yt−1 + α 3 xt−1 + ut , t∈N
⎫ Normality y t |xt , Zt−1 ⎪ N(., .) ⎪ ⎪ Linearity E yt |xt , Zt−1 = δ 0 +δ 1 t+α 1 xt +α 2 yt−1 +α 3 xt−1 ⎪ ⎪ ⎬
2 = σ Homoskedasticity Var y |x , Z t t t−1 t∈N 0
⎪ ⎪ Markov { yt |xt , Zt−1 , t ∈ N}, ⎪ independent process ⎪ ⎪ ⎭ t-Invariance δ 0 , δ, α 1 , α 2 , α 3 , σ 20 not changing with t
Statistical GM [1] [2] [3] [4] [5]
Estimating the two auxiliary regressions for the first two conditional moments yields ut = .30 +.028 t − .022 ln xt − .338 ln Yt−1 +.308 xt−1 +.007 t2 (2.51)
(.056)
(.631)
(.128)
(.223)
(.014)
− .001 (ln xt )2 +.446 ut−1 + v1t , (.041)
(.297)
( ut /s)2 = −13.9 − .155 t − .324 (xt2 /1000)+.036 ( ut−1 /s)2 + v2t , (13.1)
(.138)
(.287)
(.153)
which indicates no departures from the probabilistic assumptions of the underlying DLR model in Table 15.10. Comparing Keynes’ AIH with the statistically adequate model in (15.71) we can infer that the substantive model is clear false on the basis of this data. 15.6.3.5 What About Substantive Adequacy? The statistical model (15.71) is statistically adequate but it does not constitute a substantive model. The trend in the estimated statistical model indicates substantive ignorance! It suggests that certain relevant (substantive) variables are missing. Example 15.10 (continued) Returning to macroeconomic theory one can pose the question of omitted relevant variables in the context of the statistically adequate model (15.71) and use the error reliability of the model to probe the significance of potential variables. For the sake of the discussion, let us assume that the following variables might capture the relevant omitted effects: X2t = price level, X3t = consumer credit outstanding, X4t = short-run interest rate. When these additional variables are introduced into the statistically adequate model (15.71), they render the trend insignificant by explaining the trending behavior and preserve the statistical adequacy of the respecified DLR model with statistical GM of generic form (15.72) ln yt = β 0 + 4j=1 (β j ln xjt + α j ln xjt−1 ) + γ 1 ln yt−1 + ut , t ∈ N. A detailed description of the procedure from (15.72), in the context of which the additional variables (x2t , x3t , x4t ) have rendered the trend term t redundant, to a new substantive model ln yt = .004 − .199 (ln yt−1 − ln x1t−1 ) + .558 ln x1t − .173 ln x2t (.004)
(.056)
(.046)
(.071)
+.081 ln x3t − .024 ln x4t + ε t , R2 = .924, s = .0071, n = 52 (.025)
(.011)
(15.73)
15.7 Summary and Conclusions
729
is beyond the scope of this chapter, but it involves imposing empirically validated restrictions with a view to finding a parsimonious and substantively meaningful model. In particular, the procedure involves estimating the DLR model in (15.72) and securing its statistical adequacy using trenchant M-S testing for probing assumptions [1]–[5] (Table 15.10), before any restrictions are imposed. In light of the fact that (15.72) has 11 statistical parameters and the substantive model in (15.73) has seven structural parameters, going from (15.72) to (15.73) involves imposing four overidentifying restrictions. The overidentifying restrictions were tested using an F-type test (linear restrictions) and not rejected, and the estimated substantive model in (15.73) retained the statistical adequacy of (15.72). The latter ensures that any inferences based on (15.73), including forecasting and policy simulations, will be statistically reliable. One might disagree with the choice of variables (X2t , X3t , X4t ) to replace the generic trend, but the onus will be on the modeler questioning such a choice to provide a better explanation by proposing different variables that might achieve that. The statistically adequate model (15.71) provides a proper basis for such a discussion.
15.7 Summary and Conclusions The problem of statistical misspecification arises when any assumptions invoked by a statistical inference procedure are invalid for the particular data z0 . Departures from the invoked assumptions invalidate the sampling distribution of any statistic (estimator, test, predictor), and as a result the reliability of inference is often undermined. Misspecification (M-S) testing aims to assess the validity of the assumptions comprising a statistical model. Its usefulness is twofold: (i) it can alert a modeler to potential problems with unreliable inferences and (ii) it can shed light on the nature of departures from the model assumptions. The primary objective of empirical modeling is “to learn from data z0 ” about observable phenomena of interest using a statistical model Mθ (z) as the link between substantive information and systematic statistical information in data z0 . Substantive subject matter information, codified in the form of a structural model Mϕ (z), plays an important role in demarcating and enhancing this learning from data when it does not belie the statistical information in Z0 . Behind every structural model Mϕ (z) there is a statistical model Mθ (z) which comprises the probabilistic assumptions imposed on one’s data, and nests Mϕ (z) via generic restrictions of the form G(θ, ϕ) = 0. In an attempt to distinguish between the modeling and inference facets in statistical analysis, the original Fisher’s (1922a) framework is broadened with a view to bringing out the different potential errors in the two facets and using strategies to safeguard against them. The statistical misspecification of Mθ (z) is a crucial error because it undermines the reliability of the inference procedures based on it. Relying on weak assumptions, combined with vague “robustness” claims and heuristics invoking n → ∞, will not circumvent this error in practice. Establishing the adequacy of Mθ (z) calls for thorough M-S testing, combined with a coherent respecification strategy that relies on changing the assumptions imposed on {Zt , t ∈ N}. Joint M-S tests based on auxiliary regressions provide a most effective procedure to detect departures from the model assumptions. The traditional respecification of adopting the particular alternative H1 used by the M-S test is fallacious.
730
Misspecification (M-S) Testing
Distinguishing between statistical and substantive inadequacy is crucial because a structural model Mϕ (z) will always be a crude approximation of the reality it aims to shed light on, but there is nothing inevitable about imposing invalid probabilistic assumptions on one’s data by selecting an erroneous Mθ (z). When modeling with observational data, Mϕ (z) will inevitably come up short in terms of substantive adequacy vis-à-vis the phenomenon of interest, but that does not preclude using a statistically adequate Mθ (z) to answer reliably substantive questions of interest. These questions include whether (i) Mϕ (z) belies the data, (ii) its overidentifying restrictions G(ϕ, θ ) = 0 are data-acceptable, as well as (iii) probing for omitted variables, confounding factors, systematic approximation errors, and other neglected aspects of the phenomenon of interest. F I N A L T H O U G H T: The overwhelming majority of empirical results published in prestigious journals are untrustworthy, primarily because very few researchers test the validity of their statistical models. Invoking vague robustness results, combined with wishful asymptotic (as n→∞) theorems based non-testable assumptions, will not secure the trustworthiness of one’s empirical evidence. In contrast, securing statistical adequacy using trenchant M-S testing will address the problem of statistical misspecification, encourage a more informed implementation of inference procedures, ensure the reliability of the inference results and their warranted evidential interpretations. These will go a long way toward securing trustworthy evidence and attain the primary objective of empirical modeling and inference: learning from data about stochastic phenomena of interest. Important Concepts Statistical misspecification, statistical adequacy, misspecification (M-S) testing, default alternative in an M-S test, unreliable inference, untrustworthy evidence, facets of modeling: specification, estimation, M-S testing, respecification, nominal error probabilities, actual error probabilities, misspecification-induced unreliability of inference, probabilistic reduction perspective, runs (up and down) test, circularity charge for M-S testing, infinite regress charge for M-S testing, Kolmogorov’s test of Normality, omnibus (non-parametric) M-S tests, directional (parametric) M-S tests, M-S tests based on encompassing models, skewness–kurtosis test of Normality, the multiple hypotheses charge for M-S testing, joint M-S tests based on auxiliary regressions, multiple testing (comparisons), regression function characterization, Yule’s nonsense correlations, traditional error-fixing strategies, generalized least squares (GLS) estimator, feasible GLS estimator, common factor restrictions, heteroskedasticity-consistent standard errors (HCSE), autocorrelation-consistent standard errors (ACSE). Crucial Distinctions Misspecification (M-S) vs. Neyman–Pearson (N-P) testing, nominal vs. actual error probabilities, testing within vs. testing outside the statistical model, statistical vs. substantive adequacy, model respecification vs. error fixing. Essential Ideas ●
Statistical misspecification undermines the reliability of statistical inference and gives rise to untrustworthy evidence. Non-parametric statistics as well as Bayesian statistics
15.8 Questions and Exercises
●
●
●
●
●
●
●
●
●
●
731
are equally vulnerable to statistical misspecification. What are often described as minor departures from model assumptions can have devastating effects on the reliability of inference. In the case of frequentist inference, unreliability takes the form of biased and inconsistent estimators, as well as sizeable discrepancies between nominal (assumed) and actual error probabilities. Trenchant M-S testing is the most effective way to secure the reliability of inference. Joint M-S tests based on the residuals offer the most reliable way to probe for potential statistical misspecifications by taking into account the interrelationships among model assumptions. Yule’s (1926) “nonsense” correlations can easily be explained away on statistical misspecification grounds. The slogan “all models are wrong, but some are useful” provides a poor excuse to avoid validating one’s statistical model by invoking a confusion between statistical and substantive inadequacy. Weak but non-testable probabilistic assumptions provide the most trusted way to untrustworthy evidence. Le Cam (1986a, p. xiv): “limit theorems ‘as n tends to infinity’ are logically devoid of content about what happens at any particular n.” M-S testing differs from N-P testing in one crucial respect: the latter takes place within the boundaries of the prespecified statistical model Mθ (x), but the former takes place outside those boundaries. The null for an M-S test is that the true distribution of the sample f ∗ (x)∈Mθ (x) is valid and the default alternative is that f ∗ (x)∈ [P (x) − Mθ (x)]. When any of the assumptions of the statistical model Mθ (x) are rejected by the data, the only inference that follows is that Mθ (x) is misspecified. The next step should be to respecify it with a view to accounting for the systematic information in data x0 that Mθ (x) did not. This calls seriously into question the traditional strategies based on piecemeal ad hoc “error fixing,” that ignores the observable process {Xt , t ∈ N} and focuses on the error term. Deterministic trends provide a generic way to account for certain forms of heterogeneity in the context of a statistical model. They are necessary to secure the reliability of inference. In the context of a substantive model, however, such terms represent ignorance and need to be replaced with proper explanatory variables. Be suspicious of any robustness claims made by practitioners who did not test the validity of their model assumptions. It is highly likely that on closer examination such claims fall apart. No practitioner can bluff his/her way out of dependence and heterogeneity-type misspecifications by invoking vague robustness results.
15.8 Questions and Exercises 1. Explain how statistical misspecification might undermine the consistency of an estimator or induce sizeable discrepancies between the nominal and actual type I and II errors. Use the LR model as an example for your answers.
732
Misspecification (M-S) Testing
2. (a) Explain how the following confusions contribute to the neglect of misspecification (M-S) testing: empirical modeling vs. curve fitting, statistical vs. substantive misspecification, the modeling vs. the inference facet. (b) Explain the difference between parametric and non-parametric M-S testing. (c) Explain why non-parametric M-S tests with low power have a role to play in establishing statistical adequacy. (d) Explain why sole reliance on asymptotic sampling distributions does not secure the reliability of inference. 3. (a) Explain the notions of testing within and testing without (outside) the boundaries of a statistical model in relation to Neyman–Pearson (N-P) and misspecification (M-S) testing. (b) Specify the generic form of the null and alternative hypotheses in M-S testing for a statistical model Mθ (z) and compare it with that of the archetypal N-P test. (c) Using your answers in (a) and (b) explain why M-S testing is vulnerable to both the fallacies of acceptance and rejection. (d) Explain why type II error is more serious in M-S testing as opposed to N-P testing. 4. (a) Explain how one can secure statistical adequacy in practice by employing particular strategies in misspecification testing. (b) Explain why in practice one should combine non-parametric (omnibus) with parametric directional M-S tests with a view to maximizing the breadth and length of the probing away from the statistical model being validated. (c) Explain why rejecting the null using a less powerful M-S test provides better evidence for the presence of the departure than using a more powerful test. (d) Explain intuitively the two auxiliary regressions used to test the statistical adequacy of the simple Normal model. Why do joint M-S tests have a distinct advantage over single assumption tests. 5. (a) Describe the Durbin–Watson test as it is applied in traditional textbook econometrics, and explain why it might not be an effective M-S test of assumption [4] in the linear regression model (Table 15.6). (b) Using your answer in (a), discuss the problem with the traditional textbook strategy of adopting the alternative as a way to respecify the original model in order to account for the systematic information in the data. (c) Imagine you are in a seminar where a researcher is presenting an applied econometrics paper using table after table of estimation and testing results relating to his/her favorite theoretical model, on the basis of which he/she makes a good case for the theory in question; no data plots, no model assumptions, and no M-S tests are presented. Discuss the merits of the case for the particular theory model made by the presenter. 6. Using the monthly data for the period August 2000 to October 2005 (n = 64) in Appendix 5.A, for yit =log returns of a particular firm i = 1, 2, . . . , 6, x1t =monthly log-returns of the 3-month treasury bill rate, x2t =market log-returns (SP500):
15.8 Questions and Exercises
733
(a) Estimate the simple Normal model for each data series separately and test the four model assumptions using auxiliary regressions. Pay particular attention to outliers that might distort M-S testing. (b) For each of the assets estimate the linear regression model (Table 15.6) yit = β 0 + β 1 x1t + β 2 x2t + uit , i = 1, 2, . . . , 6, t = 1, 2, . . . , n, . . .
(15.74)
and test assumptions [1]–[5] using joint M-S testing based on auxiliary regressions, indicating forms which are ut−1 + v1t , ut = δ 0 + δ 1 t + δ 2 t2 + δ 4 xt2 + δ 5 u2t = γ 0 + γ 1 t + γ 2 t2 + γ 4 xt2 + γ 5 u2t−1 + v2t .
(15.75)
For each model assumption test, indicate what the restriction(s) are and report the test results for each equation separately. (c) Relate your M-S testing results in (a) and (b). (d) Test the Normality assumption using the skewness–kurtosis test and discuss why, in light of your results in (b), this test is likely to give rise to misleading results. (e) In light of your results in (b)–(d), discuss the trustworthiness of the test of significance of the coefficients (β 0 , β 1 ) using α = .025, as well as that of R2 . 7. Explain where auxiliary regressions for M-S testing come from, and provide an intuitive rationale pertaining to the objective of the joint M-S tests. 8. (a) For the Yule (1926) data in Table 15.A.1 (Appendix 15.A), test whether the two data series {(xt , yt ), t = 1, 2, . . . , n} can be viewed as typical realizations of a simple Normal model (Table 15.1); test the validity of assumptions [1]–[4] for each of the two data series. (b) Estimate the LR model Yt = β 0 + β 1 xt + ut , t∈N and apply thorough M-S testing to assess the validity of assumptions [1]–[5] (Table 15.4). (c) Explain the relationship between the regression coefficient β 1 and the correlation coefficient ρ = Corr(Xt , Yt ). (d) Use your answer in (c) to explain away the significant correlation coefficient derived by Yule (1926) on statistical adequacy grounds. 9. (a) For the spurious correlation data in Table 15.A.2 (Appendix 15.A), xt =honeyproducing bee colonies (US, USDA), 1000s of colonies, yt -Juvenile arrests for possession of marijuana (US, DEA), for the period 1990–2009, estimate their correlation coefficient and its p-value. (b) Estimate the correlation coefficient between the above two data series after detrending and dememorizing (when needed) and compare the results. (c) Using your answer in (b), discuss how the correlation in (a) is statistically spurious.
734
Misspecification (M-S) Testing
Appendix 15.A: Data Table 15.A.1 Yule (1926) data t
y
x
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
22.0 21.8 21.5 21.3 21.1 21.2 21.0 20.8 20.5 20.4 20.1 19.9 19.6 19.4 19.2 19.1 19.1 18.9 18.7 18.7 18.7 18.4 18.4 18.3 18.4 18.3 18.3 18.0 17.7 17.7 17.9 17.6 17.7 17.4
78.1 74.5 74.6 75.5 76.9 76.3 73.8 73.4 75.8 76.9 73.2 72.1 74.3 73.1 72.2 69.4 70.8 71.1 71.2 70.6 71.1 70.3 68.3 68.9 71.9 73.3 70.9 70.0 65.5 70.1 66.6 67.4 67.8 69.4
Appendix 15.A: Data
Table 15.A.1 (cont.) t
y
x
35 36 37 38 39 40 41 42 43 44 45 46
17.1 16.8 16.5 16.2 15.8 15.5 15.2 14.9 14.6 14.4 14.5 14.2
69.4 66.7 65.0 63.1 65.0 63.9 63.0 62.2 61.4 61.0 58.8 61.0
Table 15.A.2 Spurious correlation data year
y
x
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
3220 3211 3045 2875 2783 2655 2581 2631 2637 2652 2622 2550 2574 2599 2554 2409 2394 2443 2342 2498
20940 16490 25004 37915 61003 82015 87712 94046 91467 89523 95962 97088 85769 87909 87717 88909 95120 97671 93042 90927
735
References
Abramowitz, M. and I. A. Stegum (1970) Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York. Adams, M. R. and V. Guillemin (1996) Measure Theory and Probability. Birkhäuser, Boston. Agresti, A. (2013) Categorical Data Analysis, 3rd edn. Wiley, New York. Aitken, A. C. (1935) “On least-squares and linear combinations of observations,” Proceedings of the Royal Society of Edinburgh, 55: 42–48. Ali, M. M. and S. C. Sharma (1996) “Robustness to nonnormality of regression F-tests,” Journal of Econometrics, 71: 175–205. Altman, D. G., D. Machin, T. N. Bryant, and M. J. Gardner (2000) Statistics with Confidence. British Medical Journal Books, Bristol. Anderson, T. W. and D. A. Darling (1952) “Asymptotic theory of certain goodness-of-fit criteria based on stochastic processes,” Annals of Mathematical Statistics, 23: 193–212. Andreou, E. and A. Spanos (2003) “Statistical adequacy and the testing of trend versus difference stationarity,” Econometric Reviews, 3: 217–237. Anscombe, F. J. (1973) “Graphs in statistical analysis,” The American Statistician, 27: 17–22. Arbuthnot, J. (1710) “An argument for Divine Providence, taken from the constant regularity observ’d in the birth of both sexes,” Philosophical Transactions, 27: 186–190. [Reprinted in M. G. Kendall and R. I. Plackett (eds) (1977).] Arnold, B. C., E. Castillio, and J.-M. Sarabia (1999) Conditional Specification of Statistical Models. Springer, New York. Arnold, S. F. (1990) Mathematical Statistics. Prentice-Hall, Englewood Cliffs, NJ. Ash, R. B. (2000) Real Analysis and Probability, 2nd edn. Academic Press, London. Azzalini, A. (1996) Statistical Inference: Based on the Likelihood. Chapman & Hall, London. Bachelier, L. (1900) “Theorie de la speculation,” Annales Scientifiques de l’Ecole Normale Superieure, 17: 21–86. Bahadur, R. R. (1960) “On the asymptotic efficiency of tests and estimates,” Sankhya, 22: 229– 252. Bahadur, R. R. and L. J. Savage (1956) “The nonexistence of certain statistical procedures in nonparametric problems,” Annals of Mathematical Statistics, 27: 1115–1122. Balanda, K. P. and H. L. MacGillivray (1988) “Kurtosis: A critical review,” The American Statistician, 42: 111–119. Bansal, A. K. (2007) Bayesian Parametric Inference. Alpha Science, Oxford. Barnard, G.A. (1967) “The use of the likelihood function in statistical practice,” in Proceedings of the Fifth Berkeley Symposium (Vol. 1, pp. 27–40).
736
References
737
Barndorff-Nielsen, O. and D. R. Cox (1994) Inference and Asymptotics. Chapman & Hall, London. Barnett, V. (1999) Comparative Statistical Inference, 3rd edn. Wiley, New York. Bayes, T. (1763) “An essay towards solving a problem in the doctrine of chances,” Philosophical Transactions of the Royal Society of London, 53: 370–418. [Reprinted in E. S. Pearson and M. G. Kendall (eds) (1970).] Beeck, P. V. (1972) “An application of the Fourier method to the problem of sharpening the Berry–Esseen inequality,” Zeitschrift fur Wahrscheinlichkeitstheorie und Vernwadte Gebiete, 23: 187–197. Belsley, D. A. (1991) Conditioning Diagnostics: Collinearity and Weak Data in Regression. Wiley, New York. Bennett, J. H. (1990) Statistical Inference and Analysis: Selected Correspondence of R. A. Fisher. Clarendon Press, Oxford. Berger, J. O. (1985) Statistical Decision Theory and Bayesian Analysis, 2nd edn. Springer, New York. Berger, J. O. and R. W. Wolpert (1988) The Likelihood Principle. Lecture Notes Monograph Series, 2nd edn, vol. 6. Institute of Mathematical Statistics, Hayward, CA. Bergtold, J. S., A. Spanos, and E. Onukwugha (2010) “Bernoulli regression models: Revisiting the specification of statistical models with binary dependent variables,” Journal of Choice Modelling, 3: 1–28. Berkson, J. (1938) “Some difficulties of interpretation encountered in the application of the chisquare test,” Journal of the American Statistical Association, 33: 526–536. Bernardo, J. M. and A. F. M. Smith (1994) Bayesian Theory. Wiley, New York. Bernoulli, J. (1713) Ars Conjectandi. Thurnisiorum, Basil. Berry, A. C. (1941) “The accuracy of the Gaussian approximation to the sum of independent random variables,” Transactions of the American Mathematical Society, 49: 122–136. Bhattacharya, A. (1946) “On some analogues of the amount of information and their uses in statistical estimation,” Sankhya, 8: 1–14, 201–218, 315–328. Bhattacharya, R. N. and E. C. Waymire (1990) Stochastic Processes with Applications. Wiley, New York. Billingsley, P. (1995) Probability and Measure, 4th edn. Wiley, New York. Binmore, K. G. (1980) Foundations of Analysis: A Straightforward Introduction. Book 1: Logic, Sets and Numbers. Cambridge University Press, Cambridge. Binmore, K. G. (1993) Mathematical Analysis: A Straightforward Approach, 2nd edn. Cambridge University Press, Cambridge. Bishop, Y. V., S. E. Fienberg, and P. W. Holland (1975) Discrete Multivariate Analysis. MIT Press, Cambridge, MA. Blackwell, D. (1947) “Conditional expectation and unbiased sequential estimation,” Annals of Mathematical Statistics, 18: 105–110. Borel, E. (1909) “Les probabilities denombrables et leurs applications arithmetiques,” Rendiconti del Circolo Matemtico di Palermo, 27: 247–271. Box, G. E. P. (1953) “Non-normality and tests on variances,” Biometrika, 40: 318–335. Box, G. E. P. (1979) “Robustness in the strategy of scientific model building,” pp. 201– 236 in R. L. Launer and G. N. Wilkinson (eds), Robustness in Statistics. Academic Press, London. Box, G. E. P. and G. M. Jenkins (1976) Time Series Analysis: Forecasting and Control (revised edition). Holden-Day, San Francisco, CA. Box, G. E. P. and D. A. Pierce (1970) “Distribution of residual autocorrelations in autoregressiveintegrated moving average time series models,” Journal of the American Statistical Association, 65: 1509–1526.
738
References
Box, G. E. P. and G. S. Watson (1962) “Robustness to non-normality of regression tests,” Biometrika, 49: 93–106. Box, J. F. (1978) R. A. Fisher: The Life of a Scientist, Wiley, New York. Breiman, L. (1968) Probability. Addison-Wesley, London. Cacoullos, T. (1966) “Estimation of a multivariate density,” Annals of the Institute of Statistical Mathematics, 18: 179–189. Carlton, A.G. (1946) “Estimating the Parameters of a Rectangular Distribution,” The Annals of Mathematical Statistics, 17: 355–358. Carnap, R. (1950) Logical Foundations of Probability. University of Chicago Press, Chicago, IL. Casella, G. and R. L. Berger (2002) Statistical Inference, 2nd edn. Duxbury, Pacific Grove, CA. Chaitin, G. J. (2001) Exploring Randomness. Springer, New York. Chambers, J. M., J. W. Cleveland, B. Kleiner, and P. Tukey (1983) Graphical Methods for Data Analysis. Duxbury, Boston, MA. Cherian, K. C. (1941) “A bivariate correlated gamma-type distribution function,” Journal of the Indian Mathematical Society, 5: 133–144. Choi, I. (2015) Almost All About Unit Roots. Cambridge University Press, Cambridge. Cleveland, W. S. (1985) The Elements of Graphing Data. Hobart Press, Summit, NJ. Cleveland, W. S. (1993) Visualizing Data, Hobart Press, Summit, NJ. Cohen, J. (1994) “The earth is round (p < .05),” American Psychologist, 49: 997–1003. Cook, R. D. and S. Weisberg (1999) “Graphs in statistical analysis: Is the medium the message?,” The American Statistician, 53: 29–37. Cox, D. R. (1977) “The role of significance testing,” Scandinavian Journal of Statistics, 4: 49–70. Cox, D. R. (1990) “Role of models in statistical analysis,” Statistical Science, 5: 169–174. Cox, D. R. and D. V. Hinkley (1974) Theoretical Statistics. Chapman & Hall, London. Cramer, H. (1946a) Mathematical Methods of Statistics. Princeton University Press, Princeton, NJ. Cramer, H. (1946b) “A contribution to the theory of statistical estimation,” Skand. Aktuarietidskr., 29: 85–96. Cramer, H. (1972) “On the history of certain expansions used in mathematical statistics,” Biometrika, 59: 205–207. Cubała, W. J., J. Landowski, and H. M. Wichowicz (2008) “Zolpidem abuse, dependence and withdrawal syndrome: Sex as susceptibility factor for adverse effects,” British Journal of Clinical Pharmacology, 65(3): 444–445. Cumming, G. (2012) Understanding the New Statistics, Routledge, NY. D’Agostino, R. B. (1986) “Graphical analysis,” ch. 2 in D’Agostino and Stephens (eds) (1986). D’Agostino, R. B. and E. S. Pearson √ (1973) “Tests for departure from normality. Empirical results for the distributions of b2 and b1 .” Biometrika, 60: 613–622. D’Agostino, R. B. and M. A. Stephens (eds) (1986) Goodness-of-Fit Techniques, Marcel Dekker, New York. Dauben, J. W. (1990) Georg Cantor: His Mathematics and Philosophy of the Infinite. Princeton University Press, Princeton, NJ. David, F. N. and J. Neyman (1938) “Extension of the Markoff theorem on least squares,” Statistical Research Memoirs, 2: 105–116. David, H. A. (1981) Order Statistics, 2nd edn. Wiley, New York. Davidson, J. (1994) Stochastic Limit Theory. Oxford University Press, Oxford. Davidson, R. and J. G. McKinnon (2004) Econometric Theory and Methods. Oxford University Press, Oxford. Davison, A. C. (2003) Statistical Models. Cambridge University Press, Cambridge. De Finetti, B. (1937) Foresight: Its Logical Laws, its Subjective Sources. [Reprinted in H. E. Kyburg and H. E. Smokler (eds) (1964).]
References
739
De Finetti, B. (1974) Theory of Probability 1. Wiley, New York. De Moivre, A. (1718) The Doctrine of Chances: Or a Method of Calculating the Probability of Events in Play. W. Pearson, London. Devroye, L. (1986) Non-Uniform Random Variate Generation. Springer, New York. Dhrymes, P. J. (1971) Distributed Lags: Problems of Estimation and Formulation. Oliver and Boyd, Edinburgh. Dhrymes, P. J. (1998) Time Series, Unit Roots, and Cointegration. Academic Press, New York. Donsker, M. (1951) “An invariance principle for certain probability limit theorems,” Memoirs of the American Mathematical Society, 6: 1–12. Doob, J. L. (1953) Stochastic Processes. Wiley, New York. Durbin, J. and G. S. Watson (1950) “Testing for serial correlation in least squares regression I,” Biometrika, 37: 409–428. Edgeworth, F. Y. (1885) “Methods of statistics,” Journal of the Statistical Society of London, Jubilee Volume, 181–217. Edwards, D. (1995) Introduction to Graphical Modelling. Springer, New York. Efron, B. (1979) “Bootstrap methods: Another look at the jackknife,” Annals of Statistics, 7: 1– 26. Efron, B. and D. V. Hinkley (1978) “Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information,” Biometrika, 65: 457–483. Efron, B. and R. J. Tibshirani (1993) An Introduction to the Bootstrap. Chapman & Hall, London. Eichhorn, W. (1978) Functional Equations in Economics. Addison-Wesley, London. Einstein, A. (1905) “Ueber die von der molikularkinetischen Theorie der Warme geforderte Bewegung von in Flussigkeiten suspendierten Teilchen,” Annualen der Physik, 17: 132–148. Engle, R. F., D. F. Hendry, and J.-F. Richard (1983) “Exogeneity,” Econometrica, 51: 277–304. Erd˝os, P. and M. Kac (1946) “On certain limit theorems in the theory of probability,” Bulletin of the American Mathematical Society, 52: 292–302. Esseen, C. G. (1945) “Fourier analysis of distribution functions. A mathematical study of the Laplace–Gaussian law,” Acta Mathematica, 77: 1–125. Fang, K.-T., S. Kotz, and K.-W. Ng (1990) Symmetric Multivariate and Related Distributions. Chapman & Hall, London. Ferguson, T. S. (1967) Mathematical Statistics: A Decision Theoretic Approach, Academic Press, London. Fine, T. L. (1973) Theories of Probability: An Examination of Foundations, Academic Press, London. Fisher, R. A. (1912) “On the absolute criterion for fitting frequency curves,” Messenger of Mathematics, 41: 155–160. Fisher, R. A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population,” Biometrika, 10: 507–521. Fisher, R. A. (1921) “On the probable error of a coefficient of correlation deduced from a small sample,” Metron, 1: 3–32. Fisher, R. A. (1922a) “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society, Series A, 222: 309–368. Fisher, R. A. (1922b) “On the interpretation of χ 2 from contingency tables, and the calculation of p,” Journal of the Royal Statistical Society, 85: 87–94. Fisher, R. A. (1924a) “The conditions under which χ 2 measures the discrepancy between observation and hypothesis,” Journal of the Royal Statistical Society. 87: 442–450. Fisher, R. A. (1924b) “On a distribution yielding the error functions of several wellknown statistics,” Proceedings of the International Mathematical Congress (Toronto), 2: 805–813. Fisher, R. A. (1925a) Statistical Methods for Research Workers. Oliver & Boyd, Edinburgh.
740
References
Fisher, R. A. (1925b) “Theory of statistical estimation,” Proceedings of the Cambridge Philosophical Society, 22: 700–725. Fisher, R. A. (1929) “Moments and product moments of sampling distributions,” Proceedings of the London Mathematical Society, 30 (2): 199–205. Fisher, R. A. (1930) “The moments of the distribution for normal samples of measures of departure from normality,” Proceedings of the Royal Statistical Society, Series A, 130: 16–28. [Reprinted in Fisher (ed.) (1950), pp. 198–238.] Fisher, R. A. (1934) “Two new properties of the mathematical likelihood,” Proceedings of the Royal Statistical Society, Series A, 144: 285–307. Fisher, R. A. (1935a) The Design of Experiments. Oliver & Boyd, Edinburgh. Fisher, R. A. (1935b) “The logic of inductive inference,” Journal of the Royal Statistical Society, 98: 39–54. Fisher, R. A. (1937) “Professor Karl Pearson and the method of moments,” Annals of Eugenics, 7: 303–318. [Reprinted in Fisher (ed.) (1950), pp. 302–318.] Fisher, R. A. (ed.) (1950) Contributions to Mathematical Statistics. Wiley, New York. Fisher, R. (1955) “Statistical methods and scientific induction,” Journal of the Royal Statistical Society, Series B, 17: 69–78. Fisher, R. A. (1956) Statistical Methods and Scientific Inference. Oliver & Boyd, Edinburgh. Forbes, C., M. Evans, N. Hastings, and B. Peacock (2010) Statistical Distributions, 4th edn. Wiley, New York. Frisch, R. and F. V. Waugh (1933) “Partial time regressions as compared with individual trends,” Econometrica, 1: 387–401. Galambos, J. (1995) Advanced Probability Theory, 2nd edn. Marcel Dekker, New York. Galavotti, M. C. (2005) Philosophical Introduction to Probability. CSLI Publications, Stanford, CA. Galton, F. (1885) “Regression towards mediocrity in hereditary stature,” Journal of the Anthropological Institute, 14: 246–263. Galton, F. (1886) “Family likeness in stature,” Proceedings of the Royal Society, 40: 42–72 (with an appendix by J. D. H. Dickson, pp. 63–66). Galton, F. (1889) “Co-relations and their measurement, chiefly from anthropometric data,” Proceedings of the Royal Society of London, 35: 135–145. Gan, F. F., K. J. Koehler, and J. C. Thomson (1991) “Probability plots and distribution curves for assessing the fit of probability models,” The American Statistician, 45: 14–21. Gauss, C. F. (1809) Theoria motus corporum celestium. Perthes et Besser (1827), Hamburg. Gauss, C F. (1821) “Theory of the combination of observations which leads to the smallest errors,” Gauss Werke, 4: 1–93. Geyer, C. J. (2013) “Asymptotics of maximum likelihood without the LLN or CLT or sample size going to infinity,” pp. 1–24 in G. Jones and X. Shen (eds), Advances in Modern Statistical Theory and Applications: A Festschrift in Honor of Morris L. Eaton. Institute of Mathematical Statistics, Bethesda, MD. Gigerenzer, G. (1993) “The superego, the ego, and the id in statistical reasoning,” A Handbook for Data Analysis in the Behavioral Sciences: Methodological Issues. Lawrence Erlbaum, Mahwah, NJ, pp. 311–339. Gillies, D. (2000) Philosophical Theories of Probability. Routledge, London. Godfrey, L. G. (1988) Misspecification Tests in Econometrics. Cambridge University Press, Cambridge. Golub, G. H. and C. F. Van Loan (2013) Matrix Computations, 4th edn. Johns Hopkins University Press, Baltimore, MD. Good, I. J. (1971) “46656 varieties of Bayesians,” American Statistician, 25: 62–74. Good, I. J. (1988) “The interface between statistics and philosophy of science,” Statistical Science, 3: 386–397.
References
741
Goodman, L. and W. H. Kruskal (1954) “Measures of association for cross classifications,” Journal of the American Statistical Association, 49: 732–764. Gorroochurn, P. (2012) Classic Problems of Probability. Wiley, New York. Gorroochurn, P. (2016) Classic Topics on the History of Modern Mathematical Statistics. Wiley, New York. Gossett (“Student”) (1908) “The probable error of the mean,” Biometrika, 6: 1–25. Gourieroux, C. and A. Monfort (1995) Statistical Analysis and Econometric Models (2 volumes). Cambridge University Press, Cambridge. Granger, C. W. J. (ed.) (1990) Modelling Economic Series. Clarendon Press, Oxford. Graunt, J. (1662) Natural and Political Observations Mentioned in a Following Index and Made Upon the Bills of Mortality. Printed by Arno Press (1975), New York. Gray, J. (2015) The Real and the Complex: A History of Analysis in the 19th Century. Springer, New York. Graybill, F. A. (1976) Theory and Application of the Linear Model, Duxbury Press, Duxbury, MA. Greene, W. H. (2012) Econometric Analysis, 7th edn. Prentice Hall, Englewood Cliffs, NJ. Gumbel, E. J. (1960) “Bivariate exponential distributions,” Journal of the American Statistical Association, 55: 698–707. Hacking, I. (1965) Logic of Statistical Inference. Cambridge University Press, Cambridge. Hacking, I. (2006) The Emergence of Probability, 2nd edn. Cambridge University Press, Cambridge. Hald, A. (1990) A History of Probability and Statistics and their Applications before 1750. Wiley, New York. Hald, A. (1998) A History of Mathematical Statistics from 1750 to 1930. Wiley, New York. Hald, A. (2007) A History of Parametric Statistical Inference from Bernoulli to Fisher, 1713– 1935. Springer, New York. Hall, P. and C. C. Heyde (1980) Martingale Limit Theory and its Application. Academic Press, New York. Hansen, B. E. (1999) “Discussion of ‘Data mining reconsidered’,” The Econometrics Journal, 2: 192–201. Hardy, I. C.W (ed.) (2002) Sex Ratios: Concepts and Research Methods. Cambridge University Press, Cambridge. Harter, H. L. (1974–6) “The methods of least-squares and some alternatives, parts I–VI,” International Statistical Review, 42–44. Hartman, P. and A. Wintner (1941) “On the law of iterated logarithm,” American Journal of Mathematics, 63: 169–176. Hausdorff, F. (1914) Grundzuge der Mengenlehre. De Gruyter, Leipzig. Hendry, D. F. (1995) Dynamic Econometrics. Oxford University Press, Oxford. Hendry, D. F. (2000) Econometrics: Alchemy or Science? Blackwell, Oxford. Hendry, D. F. (2009) “The methodology and philosophy of applied econometrics,” pp. 3–67 in T. C. Mills and K. Patterson (eds) (2009). Hendry, D. F. and M. S. Morgan (eds) (1995) The Foundations of Econometric Analysis. Cambridge University Press, Cambridge. Heron, D. (1911) “The danger of certain formulae suggested as substitutes for the correlation coefficient,” Biometrika, 8: 109–122. Herrndorf, N. (1984) “A functional limit theorem for weakly dependent sequences of random variables,” Annals of Probability, 12: 141–153. Hettmansperger, T. P. (1984) Statistical Inference Based on Ranks. Wiley, New York. Heyde, C. C. (1963) “On a property of the lognormal distribution,” Journal of the Royal Statistical Society, Series B, 25: 392–393.
742
References
Heyde, C. C. and E. Seneta (1977) I. J. Bieyname: Statistical Theory Anticipated. Springer, New York. Hinkelmann, K. and O. Kempthorne (1994) Design and Analysis of Experiments, vol.1: Introduction to Experimental Design. Wiley, New York. Hoffmann-Jørgensen, J. (1994) Probability with a View Toward Statistics, vol. 1. Chapman & Hall, London. Howson, C. and P. Urbach (2006) Scientific Reasoning: The Bayesian Approach. Open Court Publishing, Chicago, IL. Humphreys, P. (1985) “Why propensities cannot be probabilities,” The Philosophical Review, 94: 557–570. Ioannidis, J. P. A., (2005) “Why most published research findings are false,” PLoS Medicine, 2: p.e124. Jeffreys, H. (1939) Theory of Probability. Oxford University Press, Oxford (3rd edn, 1961). Johnson, M. E. (1987) Multivariate Statistical Simulation. Wiley, New York. Johnson, N. L., S. Kotz, and N. Balakrishnan (1994) Continuous Univariate Distributions, vol. 1, 2nd edn. Wiley, New York. Johnson, N. L., S. Kotz, and N. Balakrishnan (1995) Continuous Univariate Distributions, vol. 2, 2nd edn. Wiley, New York. Johnson, N. L., S. Kotz, and A. W. Kemp (1992) Univariate Discrete Distributions, 2nd edn. Wiley, New York. Kadane, J. B. (2011) Principles of Uncertainty, Chapman & Hall/CRC, London. Kahneman, D. (2013) Thinking, Fast and Slow, Farrar, Straus and Giroux, New York. Karlin, S. and H. E. Rubin (1956) “Distributions possessing a monotone likelihood ratio,” Journal of the American Statistical Association, 51: 637–643. Karr, A. F. (1993) Probability. Springer, New York. Kass, R. E. and A. E. Raftery (1995) “Bayes factors,” Journal of the American Statistical Association, 90: 773–795. Kelker, D. (1970) “Distribution theory of spherical distributions and a location-scale parameter generalization,” Sankhya, Series A, 32: 419–430. Kendall, M. G. (1946) The Advanced Theory of Statistics. Charles Griffin, London. Kendall, M. G. and R. I. Plackett (eds) (1977) Studies in the History of Statistics and Probability, vol. 2. Charles Griffin, London. Kennedy, P. (2008) A Guide to Econometrics, 6th edn. Blackwell, Oxford. Keynes, J. M. (1921) A Treatise on Probability. Macmillan, London. Khazanie, R. (1976) Basic Probability Theory and Applications. Goodyear Publishing, Los Angeles, CA. Khinchin, A. Y. (1924) “Uber einen Satz der Wahrscheinlichkeitsrechnung,” Mathematicheske Sbornik, 2: 79–119. Klein, M. (1972) Mathematical Thought from Ancient to Modern Times. Oxford University Press, New York. Knopp, K. (1947) Theory and Application of Infinite Series. Dover edition (1990), New York. Kolmogorov, A. N. (1927) “On the law of large numbers.” [Reprinted in Shiryayev (ed.) (1992), pp. 11–12.] Kolmogorov, A. N. (1928a) “On a limit formula of A. Khintchine.” [Reprinted in Shiryayev (ed.) (1992), pp. 13–14.] Kolmogorov, A. N. (1928b) “On sums of independent random variables.” [Reprinted in Shiryayev (ed.) (1992), pp. 15–31.] Kolmogorov, A. N. (1929) “On the law of the iterated logarithm.” [Reprinted in Shiryayev (ed.) (1992), pp. 32–42.]
References
743
Kolmogorov, A. N. (1930) “On the strong law of large numbers.” [Reprinted in Shiryayev (ed.) (1992), pp. 60–61.] Kolmogorov, A. N. (1931) “On analytical methods in probability theory.” [Reprinted in Shiryayev (ed.) (1992), pp. 62–108.] Kolmogorov, A. N. (1933a) Foundations of the Theory of Probability, 2nd English edn. Chelsea Publishing, New York. Kolmogorov, A. N. (1933b) “On the empirical determination of a distribution law.” [Reprinted in Shiryayev (ed.) (1992), pp. 139–146.] Kolmogorov, A. N. (1941a) “Stationary sequences in Hilbert space.” [English translation reprinted in Shiryayev (ed.) (1992), pp. 228–271.] Kolmogorov, A. N. (1941b)“Interpolation and extrapolation of stationary random sequences.” [English translation reprinted in Shiryayev (ed.) (1992), pp. 272–284.] Kolmogorov, A. N. (1963) “On tables of random numbers,” Sankhya, Series A, 25: 369–376. Kolmogorov, A. N. and S. V. Fomin (1970) Introductory Real Analysis. Dover, New York. Koop, G., D. J. Poirier, and J. L. Tobias (2007) Bayesian Econometric Methods. Cambridge University Press, Cambridge. Koutris, A., M.S. Heracleous and A. Spanos (2008) “Testing for Nonstationarity using Maximum Entropy Resampling: A Misspecification Testing Perspective,” Econometric Reviews, 27: 363– 384. Kullback, S. (1959) Information Theory and Statistics. Wiley, New York. Kyburg, H. E. and H. E. Smokler (eds) (1964) Studies in Subjective Probability. Wiley, New York. Lahiri, S. N. (2003) Resampling Methods for Dependent Data, Springer, New York. Lai, T. L. and H. Xing (2008) Statistical Models and Methods for Financial Markets. Springer, New York. Laplace, P. S. (1812) Theorie analytique des probabilites. Courceir, Paris [3rd edn (1820) with supplements]. Laplace, P. S. (1814) Essai philosophique sur les probabilites, Courceir, Paris [English translation A Philosophical Essay on Probabilities, reprinted (1951) by Dover, New York]. Lauritzen, S. L. (1996) Graphical Models. Clarendon Press, Oxford. Lauritzen, S. L. and N. Wermuth (1989) “ Graphical models for associations between variables, some of which are qualitative and some quantitative,” Annals of Statistics, 17: 31–54. Le Cam, L. (1986a) Asymptotic Methods in Statistical Decision Theory. Springer, New York. Le Cam, L. (1986b) “The central limit theorem around 1935,” Statistical Science, 1: 78–96. Lehmann, E. L. (1975) Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco, CA. Lehmann, E. L. (1984) “Specification problems in the Neyman–Pearson–Wald theory,” in H. A. David and H. T. David (eds), Statistics: An Appraisal, Iowa State University Press, Ames, IA. Lehmann, E. L. (2011) Fisher, Neyman, and the Creation of Classical Statistics. Springer, New York. Lehmann, E. L. and J. P. Romano (2006) Testing Statistical Hypotheses. Springer, New York. Lehmann, E. L. and H. Scheffé (1950) “Completeness, similar regions and unbiased estimation I,” Sankhya, 10: 305–340. Lehmann, E. L. and H. Scheffé (1955) “Completeness, similar regions and unbiased estimation II,” Sankhya, 15: 219–236. Levene, H. (1952) “On the power function of tests of randomness based on runs up and down,” Annals of Mathematical Statistics, 23: 34–56. Levy, P. (1937) Theorie de l’addition des variables aleatoires. Gauthier-Villars, Paris. Lewis, P. A. W. and E. J. Orav (1989) Simulation Methodology for Statisticians, Operations Analysts, and Engineers, vol. 1. Wadsworth, Belmont, CA. Li, M. and P. M. B. Vitanyi (2008) An Introduction to Kolmogorov Complexity and its Applications, 3rd edn. Springer, New York.
744
References
Lindeberg, J. W. (1922) “Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechnung,” Mathematische Zeitschrift, 15: 211–225. Lindley, D. V. (1965) Introduction to Probability and Statistics from a Bayesian Viewpoint, Part 1: Probability. Cambridge University Press, Cambridge. Liptser, R. and A. N. Shiryayev (2012) Theory of Martingales, Springer, New York. Loeve, M. (1963) Probability Theory, 3rd edn. Van Nostrand, New York. Luenberger, D. G. (1969) Optimization by Vector Space Methods. Wiley, New York. MacKenzie, D. and T. Spears (2014) “The formula that killed Wall Street: The Gaussian copula and modelling practices in investment banking,” Social Studies of Science, 44: 393–417. Maistrov, L. E. (1974) Probability Theory: A Historical Sketch. Academic Press, London. Mann, H. B. and A. Wald (1943) “On stochastic limit and order relationships,” Annals of Mathematical Statistics, 14: 390–402. Marquardt, D. V. (1970) “Generalized inverses, ridge regression, biased linear estimation, and nonlinear estimation,” Technometrics, 12: 591–612. Martin-Löf, P. (1969) “The literature on von Mises’ kollectivs revisited,” Theoria, 35: 12–37. Mayo, D.G. (2006), “Philosophy of Statistics,” in The Philosophy of Science: An Encyclopedia, S. Sarkar, J. Pfeifer, eds., Routledge, London, pp. 802–815. Mayo, D. G. (1996) Error and the Growth of Experimental Knowledge. University of Chicago Press, Chicago, IL. Mayo, D. G. (2018) Statistical Inference as Severe Testing: How to Get Beyond the Statistical Wars. Cambridge University Press, Cambridge. Mayo, D. G. and A. Spanos (2004) “Methodology in practice: Statistical misspecification testing,” Philosophy of Science, 71: 1007–1025. Mayo, D. G. and A. Spanos (2006) “Severe testing as a basic concept in a Neyman–Pearson philosophy of induction,” British Journal for the Philosophy of Science, 57: 323–357. Mayo, D. G. and A. Spanos (eds) (2010) Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science. Cambridge University Press, Cambridge. Mayo, D. G. and A. Spanos (2011) “Error statistics,” pp. 151–196 in D. Gabbay, P. Thagard, and J. Woods (eds), The Handbook of Philosophy of Science, vol. 7: Philosophy of Statistics. Elsevier, Amsterdam. McCullagh, P. (1987) Tensor Methods in Statistics, Chapman & Hall, London. McCullagh, P. (2002) “What is a statistical model?,” Annals of Statistics, 30: 1225–1267. McCullagh, P. and J. A. Nelder (1989) Generalized Linear Models, 2nd edn. Chapman & Hall, London. McFadden, D. (1978) “Modeling the choice of residential location,” in A. Karlqvist, L. Lundqvist, F. Snickars, and J. Weibull (eds), Spatial Interaction Theory and Planning Models, NorthHolland, Amsterdam, pp. 75–96. McGuirk, A. and A. Spanos (2009) “Revisiting error autocorrelation correction: Common factor restrictions and Granger non-causality,” Oxford Bulletin of Economics and Statistics, 71: 273– 294. McLeish, D. L. (1975) “A maximal inequality and dependent strong laws,” Annals of Probability, 3: 829–839. Mendenhall, W. and T. Sincich (1993) A Second Course in Business Statistics: Regression Analysis. Dellen, San Francisco, CA. Mikosch, T. (2005) Copulas: Tales and Facts. Laboratory of Actuarial Mathematics, University of Copenhagen. Mikosch, T. (2006) “Copulas: Tales and facts–rejoinder,” Extremes, 9: 55–62. Mills, F. C. (1924) Statistical Methods. Henry Holt & Co., New York. Mills, T. C. and K. Patterson (eds) (2006/2009) New Palgrave Handbook of Econometrics, vol. 1/vol. 2. MacMillan, London.
References
745
Moran, P. A. P. (1968) An Introduction to Probability Theory. Oxford University Press, Oxford. Morgan, M. S. (1990) The History of Econometric Ideas. Cambridge University Press, Cambridge. Morgenstern, O. (1963) On the Accuracy of Economic Observations, 2nd edn. Princeton University Press, Princeton, NJ. Mukhopadhyay, N. (2000) Probability and Statistical Inference. Marcel Dekker, New York. Murphy, K. P. (2012) Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge, MA. Nelsen, R. B. (2006) An Introduction to Copulas, 2nd edn, Springer, New York. Newey, W.K. and K.D. West (1987) “A Simple, Positive Semi-definite and Auto correlation Consistent Covariance Matrix,” Econometrica, 55: 703–707. Neyman, J. (1934) “On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection,” Journal of the Royal Statistical Society, 97(4): 558–625. Neyman, J. (1937) “Outline of a theory of statistical estimation based on the classical theory of probability,” Philosophical Transactions of the Royal Statistical Society of London, Series A, 236: 333–380. Neyman, J. (1952) Lectures and Conferences on Mathematical Statistics and Probability, 2nd edn. US Department of Agriculture, Washington, D.C. Neyman, J. (1956) “Note on an article by Sir Ronald Fisher,” Journal of the Royal Statistical Society, Series B, 18: 288–294. Neyman, J. (1977) “Frequentist probability and frequentist statistics,” Synthese, 36: 97–131. Neyman, J. and E. S. Pearson (1928) “On the use and interpretation of certain test criteria for purposes of statistical inference, part I,” Biometrika, 20: 175–240. Neyman, J. and E. S. Pearson (1933) “On the problem of the most efficient tests of statistical hypotheses,” Philosophical Transactions of the Royal Society, Series A, 231: 289–337. Neyman, J. and E. L. Scott (1948) “Consistent estimates based on partially consistent observations,” Econometrica, 16: 1–32. Norton, R. M. (1984) “The double exponential distribution: Using calculus to find a maximum likelihood estimator,” The American Statistician, 38: 135–136. O’Hagan, A. (1994) Bayesian Inference. Edward Arnold, London. Ord, J. K. (1972) Families of Frequency Distributions. Charles Griffin, London. Pagano, M. and K. Gauvreau (2018) Principles of Biostatistics. CRC Press, London. Parzen, E. (1962) Stochastic Processes. Holden-Day, San Francisco, CA. Pawitan, Y. (2001) All Likelihood: Statistical Modelling and Inference Using Likelihood. Clarendon Press, Oxford. Pearl, J. (2009) Causality. Cambridge University Press, Cambridge. Pearson, E. S. (1955) “Statistical concepts in the relation to reality,” Journal of the Royal Statistical Society, Series B, 17: 204–207. Pearson, E. S. (1962) “Some thoughts on statistical inference,” Annals of Mathematical Statistics, 33: 394–403. Pearson, E. S. (1966) “The Neyman–Pearson story: 1926–34,” pp. 1–24 in F. N. David (ed.), Research Papers in Statistics: Festschrift for J. Neyman. Wiley, New York. Pearson, E. S. (1968) “Studies in the history of probability and statistics. XX: Some early correspondence between W. S. Gosset, R. A. Fisher and Karl Pearson, with notes and comments,” Biometrika, 55: 445–457. Pearson, E. S., W. S. Gosset, R. L. Plackett, and G. A. Barnard (1990) Student: A Statistical Biography of William Sealy Gosset. Oxford University Press, Oxford. Pearson, E. S. and M. G. Kendall (eds) (1970) Studies in the History of Statistics and Probability. Charles Griffin, London.
746
References
Pearson, K. (1892) The Grammar of Science. Scott, London. Pearson, K. (1894) “Contributions to the mathematical theory of evolution I. On the dissection of asymmetrical frequency curves,” Philosophical Transactions of the Royal Society of London, Series A, 185: 71–110. Pearson, K. (1895) “Contributions to the mathematical theory of evolution II. Skew variation in homogeneous material,” Philosophical Transactions of the Royal Society of London, Series A, 186: 343–414. Pearson, K. (1896) “Contributions to the mathematical theory of evolution III. Regression, heredity and panmixia,” Philosophical Transactions of the Royal Society of London, Series A, 187: 253–318. Pearson, K. (1900) “On a criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen in random sampling,” Philosophical Magazine, 50(5): 157–175. Pearson, K. (1905) “Contributions to the mathematical theory of evolution XIV. On the general theory of skew correlation and non-linear regression,” Drapers’ Company Research Memoirs, Biometric Series, II. Pearson, K. (1906) “Skew frequency curves, a rejoinder to Professor Kapteyn,” Biometrika, 5: 168–171. Pearson, K. (1910) “On a new method of determining correlation when one variable is given by alternative and the other by multiple categories,” Biometrika, 7: 248–257. Pearson, K. (1913a) “On the measurement of the influence of ‘broad categories’ on correlation,” Biometrika, 9: 116–139. Pearson, K. (1913b) “Note on the surface of constant association,” Biometrika, 9: 534–537. Pearson, K. (1920) “Notes on the history of correlation,” Biometrika, 13: 25–45. Pearson, K. and D. Heron (1913) “On theories of association,” Biometrika, 9: 159–315. Petrov, V. V. (1995) Limit Theorems of Probability Theory: Sequences of Independent Random Variables. Clarendon Press, Oxford. Petty, W. (1690) Political Arithmetick. Edward Arber, Birmingham. Pfeiffer, P. E. (1978) Concepts of Probability Theory, 2nd edn. Dover, New York. Phillips, P. C. B. (1987) “Time series regression with a unit root,” Econometrica, 55: 277–301. Pitman, E. J. G. (1937) “The ‘closest’ estimates of statistical parameters,” Proceedings of the Cambridge Philosophical Society, 33: 212–222. Plackett, R. L. (1949) “A historical note on the method of least squares,” Biometrika, 36: 458–460. Plackett, R. L. (1965) “A class of bivariate distributions,” Journal of the American Statistical Association, 60: 512–522. Playfair, W. (1786) The Commercial and Political Atlas. Corry, London. Playfair, W. (1801) Statistical Breviary. Wallis, London. Poisson, S. D. (1837) Recherches sur la probabilite des judements en matiere criminelle et en matiere civile, precedes des regles generales du calcul des probabilies. Bachelier, Paris. Politis, D. N., J. P. Romano, and M. Wolf (1999) Subsampling. Springer, New York. Porter, T. M. (1986) The Rise of Statistical Thinking 1820–1900. Princeton University Press, Princeton, NJ. Press, J. S. (2003) Subjective and Objective Bayesian Statistics, 2nd edn. Wiley, New York. Priestley, M. B. (1983) Spectral Analysis and Time Series. Academic Press, New York. Prokhorov, Y. V. (1956) “Convergence of random processes and limit theorems in probability theory,” Theory of Probability and Applications, 1: 157–214. Quetelet, L. A. J. (1849) Letters on the Theory of Probability. Layton, London. Ramsey, F. P. (1926) “Truth and probability.” [Reprinted in H. E. Kyburg and H. E. Smokler (eds) (1964).] Rao, C. R. (1945) “Information and accuracy attainable in estimation of statistical parameters,” Bulletin of Calcutta Mathematical Society, 37: 81–91.
References
747
Rao, C. R. (1949) “Sufficient statistics and minimum variance estimates,” Proceedings of the Cambridge Philosophical Society, 45: 218–231. Rao, C. R. (1973) Linear Statistical Inference and its Applications, 2nd edn. Wiley, New York. Rao, C. R. (2004) “Statistics: Reflections on the past and visions for the future,” Amstat News, 327: 2–3. Rao, C. R., H. Toutenburg, H. C. Shalabh, and M. Schomaker (2008) Linear Models and Generalizations: Least Squares and Alternatives, 3rd edn. Springer, New York. Renyi, A. (1970) Probability Theory. North-Holland, Amsterdam. Riehl, E. (2016) Category Theory in Context. Dover, New York. Robert, C. P. (2007) The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation, 2nd edn. Springer, New York. Romano, J. P. and A. F. Siegel (1986) Counterexamples in Probability and Statistics. Wadsworth & Brooks, Monteray, CA. Rosenblatt, M. (1956) “Remarks on some nonparametric estimates of a density function,” Annals of Mathematical Statistics, 27: 832–835. Royall, R. (1997) Statistical Evidence: A Likelihood Paradigm, Chapman & Hall, London. Ruppert, D., M. P. Wand, and R. J. Carroll (2003) Semiparametric Regression. Cambridge University Press, Cambridge. Salmon, W. C. (1967) The Foundations of Scientific Inference. University of Pittsburgh Press, Pittsburgh, PA. Sargan, J. D. (1964) “Wages and prices in the U. K.: A study in econometric methodology,” pp. 25–54 in P. Hart, G. Mills, and J. K. Whitaker (eds), Econometric Analysis for National Economic Planning, vol. 16 of Colston Papers. Butterworths, London. Sargan, J. D. (1974) “The validity of Nagar’s expansion for the moments of econometric estimators,” Econometrica, 42: 169–176. Savage, L. J. (1954) The Foundations of Statistics. Wiley, New York. Scheffé, H. (1944) “A note on the Behrens–Fisher problem,” The Annals of Mathematical Statistics, 15: 430–432. Schervish, M. J. (1995) Theory of Statistics. Springer, New York. Schwartz, D. G. (2006) Roll the Bones: The History of Gambling. Gotham, New York. Seal, H. L. (1967) “The historical development of the Gauss linear model,” Biometrika, 54: 1–24. Seber, G. A. F. and A. J. Lee (2003) Linear Regression Analysis, 2nd edn. Wiley, New York. Serfling, R. J. (1980) Approximation Theorems of Mathematical Statistics. Wiley, New York. Severini, T. A. (2000) Likelihood Methods in Statistics. Oxford University Press, Oxford. Shannon, C. E. (1948) “A mathematical theory of communication,” Bell Labs Technical Journal, 27(3): 379–423. Sharpe, W. F. (1964) “Capital asset prices: A theory of market equilibrium under conditions of risk,” The Journal of Finance, 19: 425–442. Shevtsova, I. G. (2014) “On the absolute constants in the Berry–Esseen-type inequalities,” Doklady Mathematics, 89: 378–381. Shiryayev, A. N. (ed.) (1992) Selected Works of A. N. Kolmogorov, vol. II: Probability Theory and Mathematical Statistics. Kluwer, Dordrecht. Shiryayev, A. N. (1996) Probability. Springer, New York. Silverman, B. W. (1986) Density Estimation for Statistics and Data Analysis. Chapman & Hall, London. Silvey, S. D. (1975) Statistical Inference. Chapman & Hall, London. Simon, B. (1998) “The classical moment problem as a self-adjoint finite difference operator,” Advances in Mathematics, 137: 82–203. Sklar, M. (1959) “Fonctions de repartition an dimensions et leurs marges,” Institute of Statistics, University of Paris, pp. 229–231.
748
References
Skyrms, B. (1999) Choice and Chance: An Introduction to Inductive Logic, 4th edn. Wadsworth, Belmont, CA. Sólnes, J. (1997) Stochastic Processes and Random Vibrations: Theory and Practice. Wiley, New York. Spanos, A. (1986) Statistical Foundations of Econometric Modelling. Cambridge University Press, Cambridge. Spanos, A. (1989a) “On re-reading Haavelmo: A retrospective view of econometric modeling,” Econometric Theory, 5: 405–429. Spanos, A. (1989b) “Early empirical findings on the consumption function, stylized facts or fiction: A retrospective view,” Oxford Economic Papers, 41: 150–169. Spanos, A. (1990a) “The simultaneous equations model revisited: Statistical adequacy and identification,” Journal of Econometrics, 44: 87–108. Spanos, A. (1990b) “Unit roots and their dependence on the implicit conditioning information set,” Advances in Econometrics, 8: 271–292. Spanos, A. (1994) “On modeling heteroskedasticity: The Student’s t and elliptical regression models,” Econometric Theory, 10: 286–315. Spanos, A. (1995a) “On theory testing in econometrics: Modeling with nonexperimental data,” Journal of Econometrics, 67: 189–226. Spanos, A. (1995b) “On normality and the linear regression model,” Econometric Reviews, 14: 195–203. Spanos, A. (2000) “Revisiting data mining: ‘Hunting’ with or without a license,” Journal of Economic Methodology, 7: 231–264. Spanos, A. (2002) “Parametric versus non-parametric inference: Statistical models and simplicity,” pp. 181–206 in A. Zellner, H. A. Keuzenkamp, and M. McAleer (eds), Simplicity, Inference and Modelling: Keeping it Sophisticatedly Simple. Cambridge University Press, Cambridge. Spanos, A. (2006a) “Econometrics in retrospect and prospect,” pp. 3–58 in T. C. Mills and K. Patterson (2006). Spanos, A. (2006b) “Where do statistical models come from? Revisiting the problem of specification,” pp. 98–119 in J. Rojo (ed.), Optimality: The Second Erich L. Lehmann Symposium. Lecture Notes Monograph Series, vol. 49. Institute of Mathematical Statistics, Hayward, CA. Spanos, A. (2006c) “Revisiting the omitted variables argument: Substantive vs. statistical adequacy,” Journal of Economic Methodology, 13: 179–218. Spanos, A. (2007) “Curve-fitting, the reliability of inductive inference and the error-statistical approach,” Philosophy of Science, 74: 1046–1066. Spanos, A. (2008) “Statistics and economics,” pp. 1129–1162 in S. N. Durlauf and L. E. Blume (eds), The New Palgrave Dictionary of Economics, 2nd edn. Palgrave Macmillan, London. Spanos, A. (2009) “Statistical misspecification and the reliability of inference: The simple t-test in the presence of Markov dependence,” Korean Economic Review, 25: 165–213. Spanos, A. (2010a) “Statistical adequacy and the trustworthiness of empirical evidence: Statistical vs. substantive information,” Economic Modelling, 27: 1436–1452. Spanos, A. (2010b) “Akaike-type criteria and the reliability of inference: Model selection vs. statistical model specification,” Journal of Econometrics, 158: 204–220. Spanos, A. (2010c) “Theory testing in economics and the error statistical perspective,” pp. 202– 246 in D. G. Mayo and A. Spanos (eds) (2010). Spanos, A. (2011a) “Revisiting the Welch uniform model: A case for conditional inference?,” Advances and Applications in Statistical Science, 5: 33–52. Spanos, A. (2011b) “Misplaced criticisms of Neyman–Pearson (N-P) testing in the case of two simple hypotheses,” Advances and Applications in Statistical Science, 6: 229–242. Spanos, A. (2012a) “Philosophy of econometrics,” pp. 329–393 in U. Maki (ed.), Philosophy of Economics, Handbook of Philosophy of Science Series, Elsevier, Amsterdam.
References
749
Spanos, A. (2012b) “Revisiting the Berger location model: Fallacious confidence interval or a rigged example?,” Statistical Methodology, 9: 555–561. Spanos, A. (2013a) “A frequentist interpretation of probability for model-based inductive inference,” Synthese, 190: 1555–1585. Spanos, A. (2013b) “Revisiting the likelihoodist evidential account,” Journal of Statistical Theory and Practice, 7: 187–195. Spanos, A. (2013c) “Who should be afraid of the Jeffreys–Lindley paradox?,” Philosophy of Science, 80: 73–93. Spanos, A. (2013d) “The ‘mixed experiment’ example revisited: Fallacious frequentist inference or an improper statistical model?,” Advances and Applications in Statistical Sciences, 8: 29–47. Spanos, A. (2014a) “Recurring controversies about p values and confidence intervals revisited,” Ecology, 95: 645–651. Spanos, A. (2014b) “Reflections on the LSE tradition in econometrics: A student’s perspective,” Œconomia – History of Econometrics, pp. 343–380. Spanos, A. (2015) “Revisiting Haavelmo’s structural econometrics: Bridging the gap between theory and data,” Journal of Economic Methodology, 22: 171–196. Spanos, A. (2017a) “Frequentist probability,” Wiley StatsRef: Statistics Reference Online. Spanos, A. (2017b) “Why the decision-theoretic perspective misrepresents frequentist inference,” in Advances in Statistical Methodologies and Their Applications to Real Problems, pp. 3–28, http://dx.doi.org/10.5772/65720. Spanos, A. (2018) “Mis-specification testing in retrospect,” Journal of Economic Surveys, 32(2): 541–577. Spanos, A. (2019) “Near-collinearity in linear regression revisited: The numerical vs. the statistical perspective,” in Communications in Statistics – Theory and Methods, forthcoming. Spanos, A. and D. G. Mayo (2015) “Error statistical modeling and inference: Where methodology meets ontology,” Synthese, 192: 3533–3555. Spanos, A. and A. McGuirk (2001) “The model specification problem from a probabilistic reduction perspective,” Journal of the American Agricultural Association, 83: 1168–1176. Spanos, A. and A. McGuirk (2002) “The problem of near-multicollinearity revisited: Erratic vs. systematic volatility,” Journal of Econometrics, 108: 365–393. Spanos, A. and J. J. Reade (2015) “Heteroskedasticity/autocorrelation consistent standard errors and the reliability of inference,” Virginia Tech Working Paper. Spirtes, P., C. N. Glymour, and R. Scheines (2000) Causation, Prediction, and Search. MIT Press, Cambridge, MA. Srinivasan, S. K. and K. M. Mehata (1988) Stochastic Processes, 2nd edn. MacGraw-Hill, New York. Srivastava, M. K., A. H. Khan, and N. Srivastava (2014) Statistical Inference: Theory of Estimation. PHI Learning, Delhi. Stewart, G. W. (1973) Introduction to Matrix Computations. Academic Press, New York. Stigler, S. M. (1980) “Stigler’s law of eponymy,” Transactions of the New York Academy of Sciences, 39: 147–157. Stigler, S. M. (1986) The History of Statistics: The Measurement of Uncertainty Before 1900. Harvard University Press, Cambridge, MA. Stigler, S. M. (2005) “Fisher in 1921,” Statistical Science, 20: 32–49. Stoyanov, J. M. (1987) Counterexamples in Probability. Wiley, New York. Stuart, A. and M. G. Kendall (eds) (1971) Statistical Papers of George Udny Yule. Hafner, New York. Stuart, A., J. Ord, and S. Arnold (1999) Classical Inference and the Linear Model, 6th edn. Arnold, London.
750
References
Theil, H. (1950) “A rank-invariant method of linear and polynomial regression analysis,” Nederl. Akad. Wetensch. Proc. Ser. A, 53: 386–392. Theil, H. (1971) Principles of Econometrics. Wiley, New York. Thompson, J. R. and R. A. Tapia (1990) Nonparametric Function Estimation, Modeling, and Simulation. Society of Industrial and Applied Mathematics, Philadelphia, PA. Tibshirani, R. (1996) “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society, Series B, 58: 267–288. Todhunter, I. (1865) A History of the Mathematical Theory of Probability. Macmillan, London. Tukey, J. W. (1960) “Conclusions vs decisions,” Technometrics, 2: 423–433. Tukey, J. W. (1962) “The future of data analysis,” Annals of Mathematical Statistics, 3: 1–67. Tukey, J. W. (1977) Exploratory Data Analysis. Addison-Wesley, Boston, MA. Von Mises, R. (1928/1957) Probability, Statistics and Truth [original German edition (1928) Wahrscheinlichkeit, Statistik und Wahrheit], 2nd edn. Dover, New York (1981). Von Neumann, J. and O. Morgenstern (1947) Theory of Games and Economic Behavior. Princeton University Press, Princeton, NJ. Von Plato, J. (1994) Creating Modern Probability: Its Mathematics, Physics and Philosophy in Historical Perspective. Cambridge University Press, Cambridge. Wald, A. (1939) “Contributions to the theory of statistical estimation and testing hypotheses,” Annals of Mathematical Statistics, 10: 299–326. Wald, A. (1950) Statistical Decision Functions. Wiley, New York. Wasserman, L. (2004) All of Statistics. Springer, New York. Wasserman, L. (2006) All of Nonparametric Statistics. Springer, New York. Welch, B. L. (1947) “The generalization of Student’s problem when several different population variances are involved,” Biometrika, 34: 28–35. Welsh, A. H. (1996) Aspects of Statistical Inference. Wiley, New York. White, H. (1980) “A heteroskedasticity-consistent covariance matrix estimator and direct test for heteroskedasticity,” Econometrica, 48: 817–838. Whittaker, J. (1990) Graphical Models in Applied Multivariate Statistics. Wiley, New York. Wilks, S. S. (1938) “The large-sample distribution of the likelihood ratio for testing composite hypotheses,” Annals of Mathematical Statistics, 9: 60–62. Williams, D. (1991) Probability with Martingales. Cambridge University Press, Cambridge. Williams, D. (2001) Weighing the Odds: A Course in Probability and Statistics. Cambridge University Press, Cambridge. Wold, H. (1938) A Study in the Analysis of Stationary Time Series (Doctoral dissertation), Almqvist & Wiksell, Uppsala. Wooldridge, J. M. (2013) Introductory Econometrics: A Modern Approach. Cengage Learning, Stamford, CT. Yule, G. U. (1897) “On the theory of correlation,” Journal of the Royal Statistical Society, 60: 812–854. Yule, G. U. (1900) “On the association of attributes in statistics: With illustrations from the material of the Childhood Society &c,” Philosophical Transactions, A, 194: 257–319. [Reprinted in Stuart and Kendall (1971), pp. 7–69.] Yule, G. U. (1910) “On the interpretation of correlations between indices or ratios,” Journal of the Royal Statistical Society, 73: 644–647. Yule, G.U. (1912) “On the methods of measuring association between two attributes,” Journal of the Royal Statistical Society, 75: 579–652 [Reprinted in Stuart and Kendall (1971), pp. 107– 170.] Yule, G. U. (1926) “Why do we sometimes get nonsense correlations between time series – a study in sampling and the nature of time series,” Journal of the Royal Statistical Society, 89: 1–64. [Reprinted in Stuart and Kendall (1971).]
References
751
Yule, G. U. (1927) “On a method of investigating periodicities in disturbed series, with special reference to Wolfer’s sunspot numbers,” Philosophical Transactions of the Royal Society, Series A, 226: 267–298. Zou, H. and T. Hastie (2005) “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society, Series B, 67: 301–320.
Index
actual vs. nominal error probabilities, 308, 458, 585, 616, 625, 644, 685, 688 actual vs. nominal power, 689 actual vs. nominal type I error probability, 689, 691 defective smoke alarm, 690 admissibility, 486, 487, 507, 554 not a minimal property, 486, 487 relative to a specific loss function, 487 algebra and sigma-algebra, see field and sigma-field “all models are wrong” (Box), 28, 439, 691, 692, 731 ambien sleep aid real-life example, 21 analogical reasoning and its perils, 443, 610 Anscombe’s data, 632 aspect ratio in graphs, 179 association (sign) reversal in regression, 671 and Simpson’s paradox, 715 due to misspecification, 714 asymptotic properties of estimators, 406, 488 asymptotic efficiency, 491, 530 asymptotic Normality, 490, 491, 529 asymptotic sampling distribution, 505 asymptotic unbiasedness, 530 consistency (strong), 490 consistency (weak), 488 consistent and asymptotically Normal (CAN), 457 mean square consistency, 489 autocorrelation, see also temporal dependence, 707 exchangeable, 458 error, 718, 719 autoskedastic function, 328 autoregressive function, 191, 328 autoregressive model [AR(1)], 348 exponential, 348 LogNormal, 348 Normal, see also Normal AR(1), 348 752
Pareto, 348 Student’s t, 348 auxiliary regressions for dememorizing data, 191, 722 auxiliary regressions for deseasonalizing data, 191 auxiliary regressions for detrending data, 190, 191, 712, 722 auxiliary regressions for M-S testing, see misspecification testing auxiliary regressions for VIFs, 670 axiomatic probability theory, 60, 74, 422 Bahadur and Savage example, 455, 543 and non-parametric inference, 453 Bayes formula, 67 Bayesian approach to statistics, 432, 440, 441, 685 Bayes factor, 595 Bayes risk functions, 503 credible interval, 504, 505 coherence in judgement, 433, 434 conjugate priors, 440, 441 loss functions, 486, 504, 554 posterior distribution, 440, 441, 502 predictive distribution, 441 primary objective, 502 probabilistic ranking by a posterior, 502 rule (estimate), 503 testing, 594 uniform prior, 442 vs. frequentist inference, 436 Behrens–Fisher Problem, 621 Bernoulli, see distributions Bernoulli log-linear form, 272 log-linear form of the joint density, 271 Bernoulli regression-like model, 656 logit model, 655, 656 probit model, 655, 656 Bernoulli trivariate distribution, 271
Index Berry–Esseen bound, 403 Beta distribution, see distributions Beta function, 121 Bhattacharyya lower bound, 496 Bienayme’s lemma, 101 binomial distribution, see distributions bivariate distributions Bernoulli, 133, 525 beta distribution, 172, 255, 256, 281, 285 binomial (or trinomial), 135, 141, 171, 287 Cauchy, 172 exponential, 173, 250, 285 F, 173, 253, 285 gamma, 173, 174, 234, 247, 285, 307 logarithmic, 171, 246, 285 LogNormal, 285 Normal, 135, 136, 140, 174, 242, 280, 283, 285, 309 Pareto, 175, 285 Pearson II, 175, 244, 285 Poisson, 172, 287 Student’s t, 175, 241, 242, 281, 285, 294 Bonferroni rule, 706 bootstrap method, 461 and data-specific patterns, 463 and statistical adequacy, 462, 466 bootstrap resample, 461 Borel (measurable) function, 89, 97 Borel sigma-field, 56, 63, 87 boy–girl problem, 66 Brownian bridge, 356 Brownian motion process, 355, 415 as a limit of a random walk, 357 geometric Brownian motion, 357 integrated, 356 standard, 354 Caratheodory’s extension theorem, 63, 88 Cardano’s early probability rules, 7, 40 Cauchy distribution, see distributions central limit theorem (CLT), 357 Chebyshev’s “near” CLT, 397, 401 De Moivre–Laplace CLT, 397 Hajek–Sidak CLT, 402 Lindeberg’s CLT, 400 Lindeberg–Feller’s CLT, 400 Lindeberg–Levy CLT, 396 Lyapunov’s CLT, 399 martingale difference (second-order), 402 multivariate CLT, 406 stationary process (second-order), 403 chance regularity patterns, 1, 3, 6, 11, 437 assessing distribution assumptions, 5, 189 assessing t-homogeneity, 185 Beta IID data, 198, 199, 214 Cauchy IID data, 200, 201, 213 data plots and assumptions, 288 dememorized data, 191
753
detrended data, 192, 257, 712 distribution, dependence and heterogeneity, 12 early formalization of, 7 exchange rate data, 11 exponential IID data, 198 histogram of NIID data, 195 homogeneity, 184 independence, 178 irregular cycles, 182, 222 LogNormal IID data, 196 mean and variance heterogeneity, 189 mean heterogeneity, 186, 188 negative dependence, 183 NIID data, 180, 195 NID negatively dependent data, 183 NID positively dependent data, 182 NI mean heterogeneous data, 186 NI mean seasonal and trending data, 190 NI mean seasonal data, 190 NI mean-shift data, 188 NI mean/variance trending data, 189 NI mean/variance-shift data, 189 NI variance-shift data, 188 NI variance-trending data, 187 non-Normal IID, 196 Normal, dependent, and trending data, 191 positive dependence, 21, 182 probabilistic assumptions and statistical models, 12, 14 seasonality patterns, 187 Student’s t IID data, 200 Uniform IID data, 200 variance heterogeneity, 186, 188 Weibull IID data, 199 Chapman–Kolmogorov equation, 337 characteristic function (chf), 103 Chevalier de Mere’s paradox, 10 coefficient of variation, 118 combinations, 37 concordance measure, 265 conditional cross-product ratios, 272 conditional density, 143 properties, 144 conditional distribution, 144, 146, 148, 152, 170, 225, 228, 233, 268, 271, 273, 278, 279, 292, 299, 304, 325, 351, 480, 481 singular, 350 conditional expectation, 278, 296, 297, 300, 302, 327, 345, 450 conditional independence, 268, 269, 330 conditional moment functions, 278, 280 autoregressive function, 328 autoskedastic function, 328 clitic function, 281 kurtic function, 281 regression function, 280, 282 skedastic function, 280, 282 conditional moments, 146, 235, 280
754
Index
and dependence, 232 conditional covariance, 269 conditional mean, 147 conditional variance, 147, 245 stochastic, 297 conditional probability, 65, 74, 143 conditioning information set, 301, 302 conditioning on a sigma-field, 295–298 conditioning on the sigma-field σ (X), 295, 297 corset property, 298 law of iterated expectations (LIE) property, 298 least-squares prediction property, 298 stochastic moment functions, 295, 297 taking out what is known property, 298 conditioning on values of random variables, 143, 144, 288 the role of the quantifier, 151, 153 vs. on events, 151, 152 conditioning on events, 65, 66 confidence intervals (CIs), 365, 396, 448, 455, 498, 499, 501, 554, 575, 610, 611 and factual reasoning, 611 coverage error, 499, 501, 506 coverage probability, 499, 612–614, 644 expected length, 501, 613 fallacious interpretation, 614, 615 long-run metaphor, 499, 500 vs. Bayesian credible interval, 505 uniformly most accurate, 612, 636 vs. hypothesis testing, 613 vs. p-values, 614 vs. severity, 614 confounding variables/factors, 564, 648, 665, 693, 717, 730 consistent and asymptotically Normal (CAN) estimators, 491, 506, 507, 692 consistent and asymptotically Normal (CAN) estimators, 457 consistent estimators, 488, 529 consistency as a minimal property, 456, 489, 505 consistent but useless estimator, 489 contemporaneous (synchronic) dependence, 223, 288 continuous mapping theorem, 418 convergence, probabilistic almost surely (with probability one), 382, 425 almost surely vs. in probability, 382 convergence in distribution, 382, 396 convergence in probability, 381, 383 uniform convergence, 409 convolution formula, 163 Cook’s D, 674 copulas and dependence, 258, 259 Archimedean, 260 modeling and inference perspective, 261 Normal (Gaussian), 260 Sklar’s theorem, 261 Student’s t, 261
correlation coefficient, 231 exchangeable, 458 linear (first-order) dependence, 231 nonsense, 710 partial, 269 properties, 232 count (point) process, 360, 361 countable additivity, 60, 63, 85 countable set, 47 covariance, 136 partial, 268 properties, 137 covariance ratio, 675 coverage probability, see confidence intervals Cramer’s theorem, 407 Cramer–Rao lower bound, 478 asymptotic, 491 cross-product (odds) ratio, 264, 271, 272, 526 cumulants, 103 cumulative distribution function (cdf), 89 empirical (ecdf), 193, 207, 450 joint, 134 properties, 90 curve-fitting as empirical modeling, 2, 261, 262, 305, 306, 308, 311, 365, 649, 650, 675, 692, 716 conflating statistical and substantive models, 308, 365 driven by goodness-of-fit, 2, 262, 655, 676 theory-driven modeling, 676 vs. statistical modeling, 675 data classification, 15 cross-section data, 15 cross-section vs. time series data, 20 panel (longitudinal) data, 15 time series, 15 de Morgan’s laws, 50 decision making under uncertainty, 443 decision-theoretic framing of inference, 464, 502 delta method, 408 demand schedule, 26, 305 density function, 35, 83, 91 (continuous) properties, 92 (discrete) properties, 93 support, 96 dependence concepts (assumptions) asymptotic average non-correlation, 331 asymptotic independence, 330, 369 asymptotic non-correlation, 331, 370 correlation, 231 ergodicity, 370 linear dependence, 232, 235, 250, 263, 269, 276, 330, 389 Markov dependence, 227, 268 martingale, 331 martingale difference, 331 of order m, 330
Index orthogonality between random variables, 232 second-order dependence, 245 dependence measures, 274 and joint raw moments, 230 and nominal variables, 266 and ordinal variables, 263 and the elliptically symmetric family, 240 and the Normal distribution, 234 between two random variables, 228 gamma coefficient, 265 Goodman and Kruskal tau, 267 negative correlation, 252 Pearson vs. Yule on discrete/continuous measures, 263 Theil’s uncertainty coefficient, 267 Yule’s Q, 264 descriptive vs. inferential statistics, 30, 32 deterministic regularity, 3 deterministic trends, 731 different approaches to inference, 436 dimensionality problem, 225, 233, 277 and reduction, 224 discordance measure, 265 distribution of the sample, 167, 168, 224, 301, 422, 437, 444, 476, 481, 493, 511, 512, 516, 526, 531 reinterpreted in Bayesian statistics, 440 vs. the likelihood function, 437, 447 distributions (univariate) Bernoulli, 36, 83, 121 beta, 96, 104, 113, 116, 123, 208, 440, 503, 545 binomial, 36, 84, 122 Cauchy, 100, 110, 112, 113, 115–117, 123, 163, 325, 454 chi-square, 124 exponential, 91, 124, 134, 198, 199 extreme value (Gumbel), 124 F, 125 gamma, 96, 125 generalized gamma, 125 geometric, 86, 122 hypergeometric, 122 Laplace (double exponential), 106, 126 logarithmic series distribution, 122 logistic, 116, 126 LogNormal, 126, 196 negative binomial, 122 non-central chi-square, 126 non-central Student’s t, 127 Normal (Gaussian), 36, 93, 94, 127 Pareto, 128 Poisson, 94, 123 power exponential (error), 128 Student’s t, 107, 128, 497 uniform (continuous), 94, 129, 165 uniform (discrete), 123 Weibull, 129 distribution-free, see non-parametric
755
Doob martingale, 341 Doob–Meyer decomposition, 364 double truncation, 149 Duhem’s conundrum, 2 dummy variable, 191 duration (hazard-based) models, 363 Durbin–Watson (DW) test, 719 Dutch book argument, 433 dynamic linear regression (DLR(1)) model, 720, 728 Edgeworth expansion, 404 Edgeworth’s testing, key concepts, 556 elastic net regression estimation, 650 as curve-fitting, 650 elementary outcomes, 7, 8, 48, 49 elliptically symmetric family of distributions, 233 empirical cumulative distribution function (ecdf), 207 empirical modeling, 1 as curve-fitting guided by goodness-of-fit, 2, 262 crucial features, 1 guided by statistical and substantive information, 1 modelling vs. inference, 446 primary aim of, 28, 301 statistical modeling vs. curve-fitting, 675 empirical modeling examples with data, 1 antique grandfather clock, 630, 635, 636, 638, 646 capital asset pricing model (CAPM), 643, 644 dice data, 701 effect of education on income data, 714 exam scores data, 699 exchange rate data, 215 Keynes’ absolute income hypothesis data, 716 Yule’s nonsense correlations data, 710 equality of random variables, 89 equal-probability contours, 235, 236, 238 error probabilities, 425, 438, 563, 685 are not conditional, 579 actual vs. nominal, 458 pre-data, 578, 579, 616 post-data, 570, 579 post-data severity evaluation, 600 pre-data vs. post-data, 579 type I error probability, 455, 573, 576, 578, 588 type II error probabilities, 455, 573, 577, 588 error–statistical perspective, 585, 596 error term assumptions, 305, 348, 627, 720 error term, statistical vs. structural, 647 errors of measurement, 648 estimation (point) admissibility, 486 bias of an estimator, 474 crystal ball estimator and admisibility, 486, 487 finite sample properties, 474 fixed-window (rolling), 707
756
Index
inferential claim for, 506 minimal sufficient estimator, 481 primary objective, 471, 512 recursive, 707 sampling distribution, 374, 472, 505 strong consistency, 490 sufficient estimator, 480 vs. estimate vs. parameter, 448 event in probability, 48 events vs. elementary outcomes, 51 event space, 51 existence of moments, 110, 336, 387, 413, 459 experimental data, 37, 306, 564 sampling procedure probabilities, 37 sampling with replacement, 37 sampling without replacement, 37 vs. observational data, 14 exponential distribution, see distributions exponential family of distributions, 681 one-parameter, 574, 589, 680 two-parameter, 680, 681 F distribution, see distributions factor analysis, 351 factorization theorem, 481 factual reasoning, 438, 472, 500, 559, 611, 614 vs. Bayesian reasoning, 441 vs. hypothetical, 560, 576, 613, 614 false negative/positive in probability, 67 in medical detection devices, 609 vs. error probabilities, 610 feasible GLS, 719 field (algebra), 54, 55 filtration of sigma-fields, 341 finite additivity, 434 finite sample properties of estimators, 474 bias, 474 full efficiency, 476 median unbiasedness, 492 minimum MSE, 485 minimal sufficiency and completeness, 484 mode unbiasedness, 492 parameterization invariance, 525, 546 Pitman closeness, 492 relative efficiency, 476 relative vs. full efficiency, 475 sufficiency, 480, 496, 528 sufficiency and unbiasedness, 482 unbiasedness (mean), 474 Fisher and experimental design blocking, 564 material experiment, 565 randomization, 564 replication, 564 Fisher information, 476, 477, 494, 514, 517, 519, 630 asymptotic Fisher information, 491 for a single observation, 477, 530
observed, 521 Fisher–Neyman factorization, 482 frequentist approach to statistical inference, 436, 437, 512 model-based, 24, 30, 424–427, 429, 430, 444, 547 primary objective, 471, 561 reliability/unreliability of inference, 462, 548, 625, 721, 729, 731 untrustworthy evidence, 258, 453, 686, 714, 730 vs. Bayesian inference, 436 Fisher’s recasting of Pearson’s approach to statistics, 444, 511, 543 parametric model-based, 424–426, 444, 511, 512 Fisher’s significance testing, 558, 579 falsificationist stance, 562 hypothetical reasoning, 560 key elements, 568 p-value, 560 null hypothesis, 559 vs. N-P Testing, 598 for Logit/Probit Models, 710 frequentist interpretation of probability, 391, 423, 426, 438, 464 circularity charge, 427 long-run criticism, 429 repeatability in principle, 429 stipulated provisions, 426 vs. the propensity interpretation, 431 vs. Kolmogorov’s algorithmic complexity, 430 frequentist testing, 424 fallacy of acceptance, 577, 599 fallacy of rejection, 578, 599, 605, 704, 720 generic capacity (power) of tests, 577, 579, 582, 603 hypothetical reasoning, 576, 614 primary objective, 561 statistically vs. substantively significant, 607 frequentist vs. Bayesian inference, 436 full efficiency, 476, 479, 529 function, 58 bijection, 59 co-domain, 58 composite, 88 domain, 59 range of values, 59 relation, 58 functional (Donsker) CLT, 418 invariance principle, 414 functions of random variables one random variable, 97, 161 two random variables, 162 several random variables, 162 functions of the sample moments, 542 fundamental theorem of calculus, 91 Galileo’s three-dice puzzle, 38 Galton on concentric ellipses, 240 from a scatterplot to elliptical contours, 238
Index on reverse regression, 308 regression toward the mean, 307 gamma distribution, see distributions gamma function, 121 gamma regression-like model, 658 and generalized linear models, 658 garbage in garbage out, 23 Gauss linear model, 306, 649, 655, 657 best linear unbiased estimator (BLUE), 536, 651, 665 vs. the Linear Regression model, 650, 677 Gauss–Markov theorem, 536, 651, 665, 719 and different distributions, 652 limited value in inference, 536 generalized least squares (GLS), 719 feasible GLS, 719 generalized linear (GL) models, 655, 657, 682 common features, 681 deviance, 683 deviance residuals, 683 linear predictor, 681 link function, 657, 658, 681 Pearson residuals, 683 response residuals, 683 Glivenko–Canteli theorem, 451 Gnedenko’s theorem, 405 Goodman and Kruskal tau, 267 goodness-of-fit, 655 graphical analysis, 28, 176 graphical causal modeling, 270, 272 hazard function, 150, 363 cumulative, 363 hazard-based statistical models, 363 Weibull, 365 Heaviside function, 452 heterogeneity concepts (assumptions) first order stationarity, 333 Markov homogeneity, 332 second-order stationarity, 333 separable heterogeneity, 310, 334, 349, 367 spatially homogeneous, 339 stationarity of order m ≥ 1, 334 strict, 332 high leverage points, 675 histogram, 4, 176, 203 and non-IID data, 206 smoothed histogram, 204, 205 hypothetical reasoning, 438, 553, 560, 561, 576, 613–615 vs. Bayesian reasoning, 441 vs. factual reasoning, 438 ideal estimator, 471, 488 idempotent matrix, 663 identical distributions (ID), 37, 71, 158, 331 impossible event, 51 incidental parameter problem, 225, 227, 328, 343, 346, 533, 655, 693, 707, 719
757
increasing conditioning information set, 227 independence, 37, 70, 303, 329 among events, 69 among random variables, 155, 156 indicator function, 203 individual decision making, 443 induction by enumeration, 427 inductive inference, 24 premises of inference, 23 ampliative dimension, 24, 32 vs. deduction, 121, 457 model-based induction, 24, 427 inequalities (probabilistic), 412 Bernstein inequality, 413 Boole inequality, 64 Bonferroni inequality, 65 Cauchy–Schwarz inequality, 413 Chebyshev inequality, 102, 385, 412, 488 Hoeffding inequality, 229, 413 Holder inequality, 414 Jensen inequality, 414 Lyapunov inequality, 414 Markov inequality, 412 Mill’s inequality, 413 Minkowski inequality, 414 Schwarz inequality, 232 inference vs. modeling facet, see modeling vs. inference infinitely divisible distributions, 377, 406 initial conditions, 337 innovation process, 344 interpretations of probability, 421 degrees of belief, 432, 441, 464 frequentist (model-based), 424, 426 frequentist (von Mises), 426, 427 Kolmogorov’s algorithmic complexity, 430 logical, 435 Popper’s propensity, 431 interval estimation, see confidence intervals intervals on the real line, 46 closed interval, 46 half-closed, 46 open interval, 46 singleton, 46 Ito’s Lemma, 359 Itô process, 358 joint and conditional probability formulae, 152 joint density function, 132–134, 139, 144, 150 properties, 135 joint independence, 70 Karl Pearson approach to statistics, 263, 445, 497, 547, 551, 556 vs. Fisher, 547 Karl Pearson’s chi-square test, 557, 567 the initial misspecification test, 558, 686 Karl Pearson’s method of moments, 543
758
Index
Karl Pearson’s testing, 556 key concepts, 558 kernel smoothing, 204, 219, 454, 460 bandwidth, 205 biweight kernel, 205 Epanechnikov, 205 Normal, 204 rolling histogram, 203 uniform, 204 Kolmogorov algorithmic complexity, 430 Kolmogorov axioms of probability, 60, 422 Kolmogorov’s distance theorem, 451 Kolmogorov’s extension theorem, 322, 366 Kullback–Liebler distance, 229 and likelihood, 524 kurtosis coefficient, 105, 138 large n problem, 563, 598, 603, 616 decreasing α for as sample size increases, 599 LASSO regression estimation, 650 and curve-fitting, 650 law of iterated logarithm (LIL), 391, 395 Khinchin’s LIL, 395 learning from data, 383, 437, 438, 441, 466, 572, 577 least squares as a statistical method, 535 best linear unbiased estimator (BLUE), 536, 651, 665 least-squares estimator, 535 ordinary least squares, 305 least-squares mathematical approximation, 534 approximating function, 648 Boscovitch, 534 curve-fitting, 534, 648 Gauss, 536 Laplace, 534 Legendre, 534, 535 principle of least squares, 648 theory of errors, 41, 510 Lehmann–Scheffé theorems, 483, 484 leptokurtic distributions, 106, 241, 286 likelihood function, 437, 447, 511, 513 log-likelihood function, 515 likelihood ratio test, 571, 591, 594, 595 and monotonicity, 588 asymptotic, 594 limit theorems in probability, 375, 410 main assertions, 375 with non-testable assumptions, 409, 410 linear regression, traditional formulation, 659, 718 ad hoc M-S testing, 718, 725 asymptotic sampling distributions, 666 autocorrelation-consistent standard errors (ACSE), 721 computational perspective, 660 curve-fitting problem, 716 error-fixing strategies, 720, 721 Gauss-Markov theorem, 665
hat matrix, 673 heteroskedasticity-consistent standard errors (HCSE), 721 Huber condition, 666 matrix formulation, 661, 667 Normality and the reliability of inference, 643 robustness claims, 457, 693, 731 traditional specification, 627 vs. the Gauss linear model, 649 logit model, 655, 656, 710 misspecification testing, 710 logit transformation, 657 long-run frequency, 429 vs. probability, 429 long-run metaphor, 425, 427, 437, 466, 500 conflating probability with relative frequencies, 429 no temporal dimension, 427 repeatability in principle, 429 loss function, 113, 486, 504, 554 Mann and Wald theorem, 407 marginal density function, 140 marginal distributions, 139 marginalization, 139, 140 marginalization vs. conditioning, 150 Markov chains, 337 Markov dependence, 227, 268, 277, 329 of order p, 330 Markov process, 325, 336, 337 martingale difference process, 327, 342, 344, 394 dependence, 331 second-order, 344 martingale process, 327, 340 dependence, 331 sub-martingale, 340 super-martingale, 340 mathematical deduction, 64 mathematical duality between testing and CIs, 610, 612 not an inferential duality, 614 mathematization of chance regularities, 7 matrix norm, 668 maximal ancillary statistic, 705 maximum likelihood method, 547 asymptotic variance, 490 criticisms of, 532 estimator (MLE), 514 inconsistent estimators, 533 Neyman and Scott model, 533 optimal properties of MLEs, 532 max-stable family of distributions, 405 mean of a distribution, 99 properties, 100 mean ergodic process, 371 mean square error (MSE), 300, 485, 503 and biased estimators, 487 frequentist vs. Bayesian definition, 486
Index measure theory, 60, 73, 422, 428 Lebesgue measure, 63, 428 median of a distribution, 113 Mendel’s cross-breeding experiments, 557 substantive model, 557 method of least squares, 547 methods of estimation, 510 mid-range, 515 minimal sufficient statistic, 480 misinterpreting the p-value, 563 misspecification (M-S) testing, 2, 23, 26, 177, 289, 426, 461, 631, 686 Anderson–Darling test for Normality, 696 custom tailoring, 704 data snooping/mining charge, 693 directions of departure, 697 double use of data charge, 693 encompassing alternative model, 697, 720 for Logit/Probit Models, 710 formalization of, 316 independence and mean constancy, 698 independence and variance constancy, 700 infinite regress/circularity charge, 427, 693 joint tests based on auxiliary regressions, 645, 670, 699–701, 704, 705, 707, 709, 724 Kolmogorov’s test for Normality, 695 large enough sample size (n), 25, 379, 506 legitimate conditioning information set, 708 multiple testing charge, 693 non-parametric (omnibus), 694 Normality test, 646 pre-test bias charge, 693 runs test, 694 skedastic function tests, 724 skewness-kurtosis test of Normality, 700 testing outside the statistical model, 687, 703 testing within vs. testing outside, 567, 687, 703 t-invariance of the parameters, 707 too many tests charge, 706 vs. N-P testing, 704 mixing conditions, 369 non-testable, 403 mixingale, 370 mode of a distribution, 112 model-based frequentist interpretation of probability, 424, 426 single-event probability problem, 429 statistical induction, 24, 74, 427, 444, 457, 466, 686 modeling vs. inference facet, 2, 119, 365, 374, 446, 447, 465, 507, 510, 642, 687, 729 moment generating function (mgf), 103 moment matching principle, 111, 536, 537, 548 moments of distributions, 99 central moments and dependence, 231 higher central moments, 103 higher raw moments, 102 joint central moments, 136, 231
759
joint product moments, 136 problem of moments, 110 Monty Hall puzzle, 68 moving average [MA(q)] process, 352 multiplication counting rule, 37 multivariate distributions, 134 Bernoulli distribution, 271 density properties, 138 elliptically symmetric, 241 Normal distribution, 139, 269 Student’s t, 344 mutually exclusive events, 51, 70, 422 near-collinearity problem, 666, 667, 677 and ordering of observations, 669 determinant of (X X), 668 high correlations among regressors, 667
ill-conditioning of X X , 667, 668 numerical perspective, 667 statistical perspective, 670 negative binomial, 122, 589, 680 bivariate, 171, 287 nesting restrictions, 643 Newton–Raphson algorithm, 522 Neyman and Scott model, 532 Neyman–Pearson (N-P) testing, 449 acceptance region, 449, 572, 611–613 alternative hypothesis, 570 arbitrariness of the significance level, 608 archetypal framing of hypotheses, 572, 575, 603 behavioristic interpretation of accept/reject, 573 coarseness of accept/reject rules, 606 consistent N-P test, 582 convexity of the alternative parameter space, 589 default alternative hypothesis, 576 Egon Pearson’s retrosperspective, 569, 570 key elements, 586 modifications to Fisher testing, 570, 573 monotone likelihood ratio, 587 multiple testing (comparisons), 609, 706 N-P lemma, 586 non-centrality parameter, 586 null hypothesis, 559 origins of the Fisher vs. Neyman prolonged dispute, 574 pre-designation of error probabilities, 598 randomization, 590 rejection region, 450, 572, 576, 586 simple vs. composite hypotheses, 571 testing within, 572, 687, 703 testing within vs. testing outside, 567, 687, 703 the large n problem, 585 trade-off between the type I and II, 580, 598 trade-off rationale, 580 type I and II errors, see also error probabilities, 573 unbiased N-P test, 581
760
Index
uniformly most powerful (UMP) test, 580, 581, 586 non-correlation, 330 of order m, 331 non-parametric (distribution free) inference, 453 and its reliance on asymptotics, 453 indirect distribution assumptions, 112, 453 inference based on asymptotic bounds, 458 non-testable smoothness restrictions, 453 perils of ‘as n tends to infinity’, 374, 409, 458, 692, 731 vs. parametric, 453 weak probabilistic assumptions, 454, 456, 466 non-parametric tests and claims of robustness, 454 low power, 696 Normality of bust, 454 t test vs. Wilcoxon test, 454 Wilcoxon-type tests, 453 non-parametric (omnibus), 696 non-stationarity vs. separable heterogeneity, 367 non-typical observation, see outlier norm condition number, 667 Normal autoregressive [AR(1)] model, 294, 347, 348, 456, 457, 642 AR(p) model, 345, 353 AR(p) with mean heterogeneity model, 348 Normal, Markov, stationary process, 346 statistical GM, 347 statistical parameterization, 347 testable, internally consistent model assumptions, 347 underlying joint distribution, 346 vs. the unit root [UR(1)] model, 350, 351 with second-order separable heterogeneity, 351 Normal autoregressive, moving average [ARMA(p, q)] model, 353, 354, 367 Normal distribution, see distributions Normal dynamic linear regression (DLR) model, 728 Normal equations, 534 Normal linear regression (LR) model, 538, 625, 627, 631, 658, 690, 705 and mean heterogeneity, 310, 691 fitted values and residuals, 633 heterogeneous conditional variance, 310, 351 heteroskedasticity, 286, 304, 310, 718, 725 heteroskedasticity vs. heterogeneity, 310, 718 linearity assumption, 284, 298, 303, 649 mean heterogeneity, 716 Normality and the LR model, 642 role of the constant term, 664 t-invariance assumption, 303 variance decomposition, 664 Normal martingale difference process, 708 Normal moving average [MA(q)] process, 352 Normal, unit root [UR(1)] model, 349, 350 AR(1) vs. UR(1) model, 350
statistical GM, 350 statistical parameterization, 350 underlying joint distribution, 349 unit root testing, 350, 626 Wiener process, 349 Normal vs. Pearson type II, 108 Normal vs. Student’s t distribution, 107, 108, 201, 202 Normal white-noise process, 336 nuisance parameters, 533 null hypothesis, 559 number sets, 46 integers, 46 natural numbers, 46 positive integers, 46 rational numbers, 46 real numbers, 46 numerical evaluation of optimization, 522 objectivity in statistical inference, 444, 686 observational data, 14, 26, 37, 40, 304, 305, 443, 730 vs. experimental data, 14, 306 observed confidence intervals and severity, 614 odds ratio, 264, 433 test, 568, 577, 581, 582 omitted variables, 307, 646, 648, 692 vs. discarded variables, 709 one-way analysis of variance (ANOVA), 526, 624, 680 operational statistical model, 343, 386 ordered sample, 208 and its distributions, 165 ordering of interest, 20, 21, 178, 317, 365, 710, 715 cross-section data, 21, 179, 274, 307, 317, 365 of the sample, 20 Ornstein–Uhlenbeck process, 356 orthogonal decomposition, 302 orthogonal polynomials, 673 orthogonal projectors, 663 outcomes Set, 45 countable, 61 finite, 60 uncountable, 62 outlier (non-typical observation), testing for, 675 overidentifying restrictions, 693, 729 paired sample tests, 622 parameter space, 95, 446 parameters and moments, 97 parameters vs. estimators vs. estimates, 537 parametric (directional) M-S testing, 697 parametric inference, 548 parametric method of moments (PMM), 544, 548 parametric vs. non-parametric inference, 453 parametrically nested, 24, 350 partial sums process, 324 partition of a set, 51, 62, 67, 74, 449, 571, 576
Index Pearson family of frequency curves, 445, 551 percentiles of a distribution, 114 permutation invariance, 322 permutations, 38 pivotal quantity, 499, 559, 613 platykurtic distributions, 105, 108 Poisson distribution, see distributions Poisson process, 361, 362 Poisson regression-like model, 657 Poisson’s law of small numbers, 379 popular misconceptions about limit theorems, 376 positive dependence, 21, 182, 188, 193, 222, 253 positive predictive value (PPV), 609 blameworthiness by association, 610 misinterpreting error probabilities, 610 power of the test, 573 evaluation, 582 increases as discrepancies increase, 583 increases as n increases, 583 power curve, 583 pre-data role, 584 power set of an outcomes set, 52, 53 P–P plot, 207, 209 Normal IID data, 210 standardized, 211 standardized Normal, 211 Student’s t, 217 Uniform IID data, 209 vs. Q–Q Plots, 214 prediction (forecasting), 450 prediction intervals, 636 pre-image of a function, 80 and set-theoretic operations, 88 preliminary data analysis, 677 primary objective of frequentist testing, 561 primitive notions, 73 principal component analysis, 351 prior distribution, 440, 441, 464 conjugate, 441 Jeffreys, 435, 443, 504 logistic, 442 objective (default, reference), 442, 443 reparameterization invariant, 442 uniform, 442 vs. prior substantive information, 444 probabilistic assumptions underlying statistical models, 329 three broad categories, 12, 28, 177, 179, 316, 318, 386, 454 probabilistic reduction, 278, 282, 328, 628, 655, 721 reduction and model assumptions, 628, 631, 642, 644 probability integral transformation, 98, 212, 259 converse, 116 probability model, 33, 95 probability set function, 60, 63 probability space, 63, 302 induced by a random variable, 89
761
probit model, 655, 656, 710 product rule for conditional probability, 66 propensity interpretation of probability, 431 Humphrey’s paradox, 432 vs. the frequentist interpretation, 431 pseudo-random numbers, 178 p-value, 482, 560, 570, 578, 579, 616 a post-data perspective, 579 a pre-data perspective, 578 misinterpretations, 554, 562, 563 vs. type I error probability, 578, 579 Q–Q plot, 207, 213 quantifier, universal, 151, 486, 507 quantile function, 114 properties, 114 quantile transformation, 210 quantiles of a distribution, 114 quartiles of a distribution, 114 interquartile range, 117 quartile deviation, 117 R square, 635 random experiment, 42, 78 random sample, 36, 39, 74, 160 non-random sample, 222 random trials, 70, 72 random variable, 34, 79–81, 87, 169, 422 a naive view, 34 continuous, 35, 87 continuous/discrete, 146 degenerate, 82 discrete, 35, 80 neither random nor a variable, 81, 119 range of values, 116 random vector, 132, 133 random walk process, 325 second-order, 339 Rao–Blackwell theorem, 482 regression function, 280, 282 characterization, 298, 708 regression models, 281, 285, 287, 311 and heterogeneity, 308 and homoskedasticity, 285, 303 beta, 285 Binomial, 287 Exponential, 285, 289 F, 285 gamma (Cherian), 285 gamma (Kibble), 285 logistic, 285 LogNormal, 285 modeling strategy, 289 Negative Binomial, 287 Normal, linear regression, 303, 351, 627, 658, 690, 705 Pareto, 285 Pearson type II linear regression, 285
762
Index
Poisson, 287 selecting a regression model, 312 statistical GM, 303 Student’s t, linear regression model, 285, 304 regression-like statistical models, 655 generalized linear models, 655, 657 reluctance to validate model assumptions, 691 reparameterization, 293, 294, 351, 442 reparameterization invariance, 442, 507 replication crises, 609 and severity, 609 trustworthiness of the evidence, 28 rescaling variables, 673 residuals, 629, 632, 634, 644, 646, 647, 660 small vs. non-systematic, 650, 655 residual sum of squares, 527, 634, 663, 707 explained sum of squares, 634 restricted, 640, 641, 699 total sum of squares, 634 unrestricted, 640, 664, 699 reverse regression, 282, 307, 308 problem, 307 ridge regression estimation, 650 and curve-fitting, 650 rules of total variance, 299 runs up and down, 181 test, 694 sample correlation coefficient, 541 sample moments and their properties, 539 central moments and their first two moments, 540 mean and its first four moments, 540 raw moments and their first two moments, 539 variance and its first two moments, 541 sample space, 446 sample vs. sample realization, 319, 446 sampling distribution, 110, 374, 409, 429, 439, 445, 447, 448, 455, 460, 461, 472, 473, 476 actual vs. assumed sampling distribution, 689 sampling model, 33, 130, 161, 169 sampling space, 70, 72 sampling survey methods, 39 cluster sampling, 40 quota sampling, 40 simple random sampling, 39 stratified sampling, 39 scales of measurement for data, 16, 20, 262, 273 and mathematical operations, 17 and ordering of data, 17 interval scale, 17 nominal scale, 17 ordinal scale, 17 ratio scale, 17 scatterplot (or cross-plot), 237 score function, 517 properties, 518 sinusoidal polynomials, 309 sequential conditioning, 225, 226, 277, 328
set theoretic vs. probabilistic terminology, 57 set theory and operations, 45, 48 cardinality of a set, 47 complementation, 49 difference of two sets, 50 empty set, 49, 82 equality of sets, 50 intersection, 49 subset, 48 union of sets, 49 universal set, 45, 49 severity (post-data) evaluation, 578, 598, 600, 601 addressing abuses of N-P testing, 601 addressing abuses of significance testing, 609 addressing the fallacy of acceptance, 605 addressing the fallacy of rejection, 605 and observed confidence intervals, 614 and the large n problem, 603 and the p-value, 605 error-statistical account of evidence, 602, 603, 609 interchanging the null and alternative hypotheses, 601 manipulating the level of significance, 601, 608 post-data direction of departure, 605 statistical vs. substantive significance, 602 warranted discrepancy from null, 565 severity curve, 602, 608 accept the null, 604 reject the null, 602 severity principle, 609 strong, 609 weak, 609 sigma-field (σ -field), 56, 62 significance level, see type I error probability significance level vs. the p-value, 578, 579 simple random walk, 338, 339 simple statistical models, 33, 166 Bernoulli model, 167, 423, 438, 448, 470, 512, 566 bivariate Bernoulli model, 567 bivariate Normal model, 711 Cauchy model, 552, 591 exponential model, 475, 518, 591 gamma model, 521 Laplace model, 516 logistic model, 523 Normal model, 167, 177, 225, 303, 423, 438, 493, 519, 558 Normal (1-parameter) model, 471, 568, 570, 576, 585, 613 Poisson model, 528, 590 uniform model, 478, 514 Simpson’s paradox, 715 skewed distributions, 104 skewness coefficient, 104, 138 Spearman’s rank coefficient, 229 square contingency coefficient, 229
Index stable family of distributions, 404 standard deviation, 102 stationarity, 331 first order, 333 of order m ≥ 1, 334 second order, 333 strict, 332 statistical adequacy, 2, 23, 24, 28, 308, 312, 365, 439, 445, 465, 469, 565, 572, 616, 635, 647, 655, 693 and the reliability/unreliability of inference, 25, 462, 548, 625, 721, 729, 731 vs. substantive adequacy, 647, 648, 721 statistical curve-fitting, 648 statistical generating mechanism (GM), 278, 301, 311, 312, 366, 429, 728 hazard-based models, 364 vs. actual generating mechanism, 7, 431, 648, 649 statistical inference (induction), 32, 74 different approaches, 436 statistical information set, 709 statistical learning methods, 650 and curve-fitting, 650 statistical misspecification, 21, 262, 453, 456, 462, 465, 466, 610, 685, 730 nominal vs. actual error probabilities, 458, 585, 616, 644, 685, 688, 693, 717, 729 unreliable inferences, 365, 462, 548, 729 statistical models, 23, 28, 167, 315, 366, 425, 437, 559, 643, 716 and completeness, 484 and substantive information, 11 data has ‘a life of its own’, 26 identification of parameters, 168 internally consistent and testable assumptions, 13, 306, 318 misspecified, 616, 632, 646, 690, 691, 713, 715, 724 model validation vs. model selection, 685, 687, 703 non-regular probability models, 444, 479 non-systematic component, 302, 364, 647, 663 non-testable model assumptions, 409, 454, 457 population vs. stochastic generating mechanism, 40 probabilistic reduction, 628, 693, 721 probabilistic structure vs. parameterization, 350 reduction and model assumptions, 628, 642 regularity conditions, 477, 514 respecification, 316, 704, 713, 716 rigged models, 466 set of all possible probability models, 168 specification, 13, 177, 316, 721 statistical meaningfulness of parameters, 626, 655 statistical vs. substantive models, 304 systematic component, 364, 647, 663 testable probabilistic assumptions, 312, 350, 427, 454, 457, 693
763
what is a statistical model?, 625 statistical space, 73, 78, 79, 118, 130, 166 statistical systematic information, 13, 315, 436, 648, 699, 701 vs. substantive information, 1, 25, 575 statistical vs. substantive significance, 554, 599, 603, 607 statistical vs. structural error term, 647, 648 statistically meaningful parameterizations, 300, 304, 307, 310, 312, 315, 366, 626, 627, 648, 650, 655 stereogram, 238 smoothed, 238 stochastic calculus, 358 stochastic difference equation, 625 stochastic differential equation, 358 stochastic integral, 358 stochastic phenomena of interest, 2, 3, 8, 11, 30, 34, 78, 160, 167, 176, 323, 383, 421, 424, 426, 436, 443, 451 stochastic processes, 318 Bernoulli IID process, 335 building block processes, 335 classification, 320 dependence assumptions, 329 ensemble of a stochastic process, 320 exponential IID process, 335 functional perspective, 323 heterogeneity assumptions, 331 independent increments homogeneity, 332 IID process, 365 independent increments process, 326 index set (ordering), 318 joint distribution perspective, 323 Normal IID process, 336 sample path viewing angle, 319 taming a wild process, 342 stochastic regression model, 278 strong law of large numbers (SLLN), 375, 393, 395, 425 Borel’s SLLN, 391 Kolmogorov’s first SLLN, 391 Kolmogorov’s second SLLN, 392 uniform SLLN, 409 structural model, see substantive model Student’s t distribution, see distributions Student’s t martingale difference process, 344 Student’s t test, 454, 455, 565, 593, 637, 638, 640, 690, 698 vs. non-parametric tests, 453 substantive adequacy, 439, 565, 643, 647, 693, 728 confounding variables/factors, 564, 648, 665 errors of measurement, 648 external shocks, 648 omitted variables, 307 probing for, 647, 693 substantive (structural) model, 1, 643, 676, 692, 693, 730
764
Index
structural error term, 648 vs. statistical model, 365, 675 substantive information, 1, 7, 11, 24, 26, 301, 308, 312, 315, 432, 437, 444, 605 non-systematic, 648 sufficiency principle (SP)., 483 sure event, 51 symmetric distribution, 103, 302, 663 Taylor’s series expansion, 542 temporal dependence, 20, 223, 243, 288, 339, 354, 719, 722 test statistic, 355, 556, 560, 561, 566, 568, 576, 578 testing linear restrictions, 639, 641, 663, 729 F-test, 640, 643, 664, 665, 675, 701 testing the difference of two means, 620 testing the difference of two proportions, 623 total probability rule, 67 t-plot, 3, 4, 11, 178, 179, 219 transition probabilities, 338 trend polynomials, 219, 701 trigonometric polynomials, 191 trivial event space, 52 trivial field, 296 truncation, 148, 149 two-parameter exponential family, 680, 681 unbiased estimator, 474 and consistent, 489 and full efficient, 527 vs. parameterization invariant, 475
uniformly asymptotic negligibility, 398 untrustworthy evidence, 365, 453, 644, 717 variance, 100 heterogeneity, 186, 725 properties, 101 variance inflation factors (VIF), 670 Venn diagrams, 49 von Mises’ frequentist interpretation of probability, 426 von Mises collective, 427 vs. model based frequentist interpretation, 427 weak exogeneity, 292–294, 297 variation freeness, 293 weak law of large numbers (WLLN), 375 Bernoulli’s WLLN, 378, 383 Bernstein’s WLLN, 388 Chebyshev’s WLLN, 387 Khinchin’s WLLN, 390 Markov’s WLLN, 388 Poisson’s WLLN, 386 Weibull hazard-based model, 365 white-noise process, 336, 344, 352, 367 vs. innovation process, 345 Wiener process, 349 Yule’s Q, 264 Yule’s reverse engineering, 711 zero probability paradox, 578