177 33 6MB
English Pages [220] Year 1992
GOODNESS OF FIT IN LINEAR AND QUALITATIVE-CHOICE MODELS
The book is no. 29 of the Tinbergen Institute Research Series. This series is established through cooperation between Thesis Publishers and the Tinbergen Institute. A list of books which already appeared in the series can be found in the back.
Goodness of Fit in Linear and Qualitative-Choice models
ACADEMISCH PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam, op gezag van de Rector Magnificus prof.dr. P.W.M. de Meijer in het openbaar te verdedigen in de Aula der Universiteit (Oude Lutherse Kerk, ingang Singe I 4 I I, hoek Spui), op dinsdag 9 juni 1992 te 11.30 uur
door Franciscus Adrianus Gijsbertus Windmeijer geboren te Santpoort
THESIS PUBLISHERS AMSH:RDAM 1992
Promotoren:
prof.dr. H. Neudecker prof.dr. P. Groeneboom
Leden van de commissie:
prof.dr. A.P. Barten prof.dr. J.S. Cramer dr. R.D.H. Heijmans prof.dr. J.F. Kiviet prof.dr. G. Ridder
,i
Faculteit der Economische Wetenschappen en Econometrie
STELLING EN hij het proefschrift
Goodness of Fit in Linear and Qualitative-Choice Models
Frank A.G. Windmeijer
l.
De vaak vermelde bewering dat de som van gewogen gekwadratee rd e in het · · h chi -kwadraat res1residuen uen m et binaire keuze model asymptotisc verdeeld is, is niet juist. Deze som is asymptotisch normaal verdeeld. Proefschrift, Hoofdstuk 10.
2.
De absolute extremum afstand in het binaire keuze model tussen de verdelingsfunctie en de niet-paramet rische Nadaraya-Watson schatter van deze verdelingsfun ctie is asymptotisch extreme waarde (Gumbel) verdeeld. Proefschrift, Hoofdstuk 9.
3·
in het De niet-parametrische schatter van de parameter-ve ctor beta m binaire keuze model, verkregen via het maximaliseren van de doel functie
is identiek aan de maximum rang correlatie schatter, de schatter verkregen via het maximaliseren van de doelfunctie
Prot:fschrift, Hoofdstuk 10. 4·
Om et:n goede indicatie tt: krijgen van dt: fit in t:en lineair regressie modd zonder constante tt:rm, dienen zowd Barten's R2 als de gt:kwadrateerde steekproef correlatie coUfici8nt gerapportee rd te worden. Proefschrift, Hoofdstuk 3.
5.
De politieke integratie van de E.G. zal leiden tot een sterker West-Europees cultureel chauvinisme.
6.
Virtuele realiteit kan een belangrijke bijdrage leveren aan de bescherming van het milieu. Eén van de mogelijkheden is het beperken van de toeristen stroom: in plaats van reizen kunnen reisburo's virtuele realiteit helmen aanbieden.
7.
In de Nederlandse taal dient het gebruik van samengestelde woorden, die bestaan uit meer dan twee woorden, al of niet gekoppeld door een verbinding s, teruggedrongen te worden.
8.
De dijk Enkhuizen-Lelystad moet afgebroken worden. Dit komt de scheepvaart, zowel plezier als beroeps, ten goede en zal de oeverloze discussie over de inpoldering van de Markerwaard definitief doen verstommen.
Acknowledgements
I wish to express my appreciation to my supervisors Heinz Neudecker and Piet Groeneboom; in writing this thesis I have benefited from their constant encouragement and enthusiasm. I am indebted to Risto Heijmans for his constructive comments throughout the preparation of this work; Bert van Es for his guidance through the pitfalls of nonparametric statistics; and the members of the examining committee. My thanks are extended to friends and colleagues for their stimulus and encouragement, especially to Frances Wollmer, who also made the invaluable contribution of improving the English of this thesis. Finally, I wish to thank my parents for the support they have always given me.
Preface
Parts of this study have been published before as reports or journal articles. Chapter 5 is published in Statistica Neerlandica as WINDMEIJER ( 1991 ).
NEUDECKER AND
Section 10.2, together with the proof of Theorem 10.1
as presented in Appendix JO.A, have been published in Statistica
Neerlandica as
WINDMEIJER ( 1990).
to Statistica Neerlandica.
Chapter 3 is submitted for publication
Contents
1. Introduction 1.1. Goodness-of-fit measures in Econometrics 1.2.
R2
I
in the linear model
3
5
1.3. Goodness of fit in qualitative choice models
Part I
R2 in the linear model
2. The coefficient of determination, a survey 2.1. Introduction
11
2.2. The linear model, definitions and assumptions
11
2.3. The coefficient of determination
13
2.4. R 2 and hypothesis testing
16
2.5. Probability limit, distribution and moments
20
2.6. Some limitations
24
2. 7. List of properties
26
3. R2 in the linear model without a constant term 3.1. Introduction
29
3.2. Barten's R~
31
3.2.1. Definition and properties
31
3.2.2. Moments
33
3.3. The squared sample correlation coefficient of y and
y
37
3.3.1. Definition and properties
37
3.3.2. Moments
39
3.4. Comparison
42
4. R2 in the linear model with nonspherical disturbances 4.1. Introduction
49
4.2. Model and assumptions
49
R:u
52
4.3. The transformed model; Buse's and r 2 4.4. The original model; Barten's
R:
54
5. R2 in seemingly unrelated regression equations 5.1. Introduction 5.2. Model and assumptions 5.3. Definition and properties of McElroy's
57 57
R!
5.4. The squared sample correlation coefficient of (ff 1/ 2®1)y and (ff 112®1)y
59 65
6. R2 in the instrumental variable model and simultaneous equations 6.1. Introduction 6.2. The instrumental variable model 6.2.1. R 2 in the instrumental variable model 6.3. The simultaneous equations system 6.3.1. R 2 in a single structural equation 6.3.2.
R2
Part II
in the complete system
67 67 69 76 77
80
Goodness of fit in qualitative choice models
7. The binary choice and multinomial logit model 7.2. The binary choice model
87 87
7.3. The multinomial logit model
91
7. I. Introduction
8. Goodness-of-fit measures in binary choice models 8.1. Introduction 8.2. Efron's
Rir
95
96
8.3. The likelihood ratio index: McFadden's R~F
100
8.4. Miscellaneous measures
102
8.5. Examples
105
9. Tests on the distributional assumption 9.1. Introduction
113
9.2. Tests based on parametric families of distribution functions 113 9.2.1. Some classes of distributions 9.2.2. Score test based on the Perks class of distributions 9.2.3. Test on the distributional assumption in the
114 116 120
multinomial logit model 9.3. Simulation results
121
9.4. A non-parametric approach: the Nadaraya-Watson estimator 124 9.4.1. The asymptotic distribution of the supremum absolute deviation
126
9.4.2. A quadratic functional
129
Appendix 9.A. Calculations for the Perks class of distributions
132
Appendix 9.8. Proof of Theorem 9.1
138
10. Overall goodness-of-fit tests I 0.1. Introduction
143
10.2. A goodness-of-fit test for the binary choice model
143
based on weighted squared residuals 10.2.1. The asymptotic distribution of T n 10.2.2. Chi-square test and simulation results I 0.3. A goodness-of-fit test for the multinomial logit model
144 146 147
based on weighted squared residuals I 0.3.1. The asymptotic distribution of T';:
149
I 0.3.2. Chi-square test and empirical examples
150
10.3.3. The conditional logit model
153
I 0.4. A test based on the maximum rank correlation estimator
154
10.4.1. Hausman's test
154
10.4.2. Han's maximum rank correlation estimator
155
and the rank estimator Appendix 10.A. Proofs of Theorems 10.1 and 10.2
161
Appendix 10.8. Proofs of Theorems 10.3 and 10.4
168
11. Summary and conclusions I I. I. R 2 in the linear model
173
11.2. Goodness of fit in qualitative choice models
175
Appendix
179
Notation
183
References
185
Samenvatting (Summary in Dutch)
193
Author Index
197
Subject Index
199
1. Introduction
1.1. Goodness-of-fit measures in Econometrics A goodness-of -fit measure is one of the many tools available for a researcher to evaluate the performance of a model. It is a summary statistic that relates a model's estimated outcome to an actual observed outcome, e.g. the estimated (fitted) value of the dependent variable is compared with
its observed value. In order to be self-contained, a goodness-of-fit measure has fixed lower- and upper bounds. In practice these bounds are zero and one, and values in between are translated into a ranking of the performance of the model: from "very poor" (0) to "very good" ( 1). Apart from the goodness-of-fit measure, other key factors that play a role in the evaluation of a model's performance include the sign, magnitude and significance of the estimated values of the parameters, and a variety of test results, testing some of the basic assumptions underlying the model. The judgement of the model is then a 'weighted average' of all these factors, each factor having a different weight based on the preferences of the individual researcher, but also depending on the purpose for which the model was constructed. For example, when a model has been formulated to test certain hypotheses concerning parameter values, a relatively large weight will be given to the outcome of the parameter estimates as compared to the goodness-of-fit statistic. On the other hand, if a model's objective is to forecast or simulate, then its past and current performance, as summarized by the goodness-of-fit measure, may be given a larger weight than the estimated parameter values. The two model types for which goodness-of-fit measures will be considered in this study are linear models and qualitative choice models. In the latter case, attention will primarily be focused on the binary choice model and, to a lesser extent, the multinomial logit model. There is a substantial difference in defining a goodness-of-fit measure for the linear model as compared to the binary choice model. In the linear model, the estimated value of the dependent variable is directly related to
2
Chapter I
the actual observed value. For example, if the dependent variable in a linear model is the demand for oil, measured in barrels, then the model's outcome will be an estimate of the demand for oil in the same unit of measurement. Hence, a goodness-of -fit measure in a linear model can be based upon the direct comparison of the observed and estimated (fitted) value of the dependent variable. As a consequence of this, there is a widely accepted and often used goodness-of-fit measure in the univariate classical linear regression model: the coefficient of determination, denoted by R2 • In Part I of this study, this goodness-of-fit measure will be considered. Furthermore, as the usefulness of this measure is limited to the standard univariate model, the question arises as to whether we can construct related measures for non-standard univariate linear models and multivariate linear models. In the binary choice model, the outcomes of the model are estimated individual probabilities on the realization of the binary dependent variable. Because the estimated probabilities lie in the range (0, I), whilst the dependent variable can only take the value O or I, a direct comparison between the model's outcome and the observed value of the dependent variable is not apparent. As a consequence, the coefficient of determination of the linear model is not applicable as a goodness-of -fit measure in the binary choice model; several pseudo-R 2 measures have, however, been proposed and used by different researchers. In Part II of this study, a comparison is made between different pseudo-R 2 measures in the binary choice model. Furthermore, because these measures seem unable to play a similar role as R 2 in the linear model, several goodnessof-fit tests will be developed. Two of these are tests on the distributional assumption of the qualitative choice model; two other tests considered here are not specific, merely testing the overall assumptions of the model. All tests can play a similar role in the model evaluation as goodness-of -fit measures, although their background and interpretation are of course completely different. The major issues raised in Part I: R 2 in the linear model, and Part II: Goodness of fit in qualitative choice models, are summarized below in Sections 1.2 and 1.3 respectively.
Introduction
3
1.2. R 2 in the linear model The coefficient of determination and related measures will be considered for the following linear model types (see Scheme 1.1 ): the univariate classical linear regression model; the univariate model without a constant term;
and
the
univariate
model
with
nonspherical
disturbances.
Furthermore, for the multivariate case, R 2 will be discussed for the Seemingly Unrelated Regression Equations (SURE) and for a system of simultaneous equations.
Scheme 1.1 R 2 in the linear model
form. no.
Univariate
OLS Coefficient of Determination R 2 -without a constant Barten's Ri Correlation Coefficient r 2 GLS 2 Duse's R Bu Transformed model R2 r2 Original model B•
(2.4) (3.2) (3.7) (4.8)
Multivariate
SURE Simultaneous Equations Single Equation Reduced form (2SLS) Structural form (IV) System Reduced form (2SLS) Structural form (IV)
McElroy's R! (= generalization of R:u>
(5.7)
Carter-Nagar R6A R:j (rf)
(6.17) (6.18)
Carter-Nagar R6N (= McElroy's R!) Generalized Generalized R., r 2
(6.20)
R!, Ri
(6.21) (6.22)
4
Chapter 1
In Chapter 2, a survey of properties of the coefficient of determination in the univariate classical linear regression model is presented. The measure is based on the Ordinary Least Squares (OLS) estimation results. Attention is paid to the role of R 2 in the selection of nested models; to its probability limit, (asymptotic) distribution and moments; and to some limitations in the use of it. The chapter concludes with a list of properties of R 2 , on the basis of which the measure gained its status, and which serves as a guideline when related measures are considered for nonstandard models. Because certain properties of R 2 are only valid when the model contains a constant term, Chapter 3 investigates the consequences on R 2 when the model does not contain a constant term. A key property of the goodness-of -fit measure, that of fixed upper- and lower-bounds, is violated in this case and several alternatives are considered, viz. Barten's R~ (HARTEN, 1987) and r 2 , the correlation coefficient of the dependent variable and its estimated counterpart. Although these measures both have the proper lower- and upper limit of zero and one, this is obtained at the expense of a less clear interpretation than that of R 2 . The performance of the two measures, R~ and r 2 , is compared according to different criteria. One of these is their relative behaviour in small samples. Therefore, the first two moments of both measures are derived, and calculated for a particular model. Furthermore, both measures are calculated for several estimated price equations as used by VAN DIJK et al. ( 1991) in constructing a simulation model of the Dutch tourist sector. In Chapter 4, the model with nonspherical disturbances is considered, which is another linear model for which the conventional definition of the coefficient of determination does not serve as a proper goodness-of-fit measure. For this model, the estimation method is Generalized Least Squares (GLS). The measure proposed by BUSE (1973), Rt, is shown to be merely a goodness-of-fit measure for a transformed version of the original model, with equivalent properties as the OLS-R 2• As goodnessof-fit measures for the original model, R~ and r 2 are considered. In this case, the properties of these two measures derived in Chapter 3 apply. Chapter 5 presents several properties of the goodness-of -fit measure R!, defined by MCELROY ( 1977), for the SURE-model. As this model is
Introduction
5
equivalent to a model with nonspherical disturbances, there is a close link between R! and Rt. In particular, attention is paid to the derivation of the probability limit of the measure. As a possible alternative to Mc Elroy's measure is considered the correlation coefficient of the dependent variable, premultiplied by the inverse of the square root of its variance matrix, and its estimated counterpart. It will be shown that this measure is not invariant under changes of location or scale of the dependent variable and so is of no use in this model. In the last chapter of Part I, Chapter 6, the Carter-Nagar RiN (CARTER AND NA GAR, 1978) for a single equation in a system of simultaneous equations, and for the system as a whole, is shown to be a goodness-of fit measure for the reduced form of the model, based on Two Stage Least Squares (2SLS) estimation. Barten's R: is considered as a goodness-of-fit measure for a single structural equation, estimated by the method of Instrumental Variables (IV}. Of course, r 2 is a candidate as well, but will not be discussed extensively in order to avoid repetitiveness. When the whole system is estimated using the IV estimation procedure, two generalizations of McElroy's measure are proposed, one related to R:, the other to r 2 •
1.3. Goodness of fit in qualitative choice models In qualitative choice models there are several ways to measure the fit of the model. In Part II of this study three different kinds of measures are distinguished. After introducing the binary choice and multinomial logit model in Chapter 7, Chapter 8 is concerned with pseudo-R 2 measures in the binary choice model. As the name suggests, these measures are developed in an attempt to obtain a measure which exhibits the same virtues (and vices!) as the coefficient of determination in the linear model. The following measures are discussed: Efron's R:r (EFRON, 1978), which is based on a comparison of the estimated probabilities with the actual binary outcome; McFadden's likelihood ratio index, (McFADDEN, 1974), which compares the log-likelihood of the model with some baseline-log-likelihood; the
6
Chapter l
measure proposed by VEALL AND ZIMMERMANN ( 1990), also based on the loglikelihood of the model; the measure proposed by MCKELVEY AND ZAVOINA (1975), which is a kind of OLS-R 2 of the underlying continuous model (as explained in Chapter 7); and two measures based on prediction, viz. the proportion of correctly predicted observations and the average
probability of correct prediction. After establishing some properties of these measures, their behaviour is studied according to some generated and real-life data. The experiments on the generated data consist of calculating the values of the measures in a series of five nested (expanding) models, where the last model in the series is the true model. Furthermore, by controlling for the value of the mean of the binary dependent variable, its influence on the behaviour of the pseudo-R 2 measures is analysed. The real-life data are data on private car ownership (as used by CRAMER ( 1990) and CRAMER AND RIDDER (1991 )). The experiment based on these data is similar to that performed on the generated data: the values of the measures are calculated for five nested models, created by consecutively adding an explanatory variable. In Chapter 9, goodness-of-fit tests are introduced that analyse the assumed form of the distribution function. One test is a score test and is obtained by defining a class of distribution functions that depends on one or more parameters in which the logistic distribution ((multinomial) logit model), or normal distribution (probit model), is embedded. The asymptotic chi-square test based on the scores can be performed on the specific value(s) of the parameter(s), thus testing whether the model is consistent with the distributional assumption. The (symmetric) class of distribution functions proposed is that of Perks (see JOHNSON AND KOTZ, 1970, pp. 14-16 ). The logistic distribution is a member of this class and so the logit assumption can be tested. The performance of the test in the binary choice model, based on this class of distributions, is analysed and compared with the performance of the score test based on a class of distributions as proposed by PRENTICE (1976). Here, use is made of the results obtained by DE JONG (1989), who performed a Monte Carlo study for eight different model (mis)-specifications. Other tests on the distributional assumption are based on the nonparametric Nadaraya-Watson estimator of the distribution function,
Introduction
7
based on kernel smoothing (see HARDLE, 1990). The result that the asymptotic distribution of the absolute supremum distance between this estimator and the null distribution is an extreme value distribution, in the case of a continuous response variable, is shown to hold for the binary choice model as well. Furthermore, an asymptotically normally distributed quadratic functional, as proposed by HARDLE AND
MAMMEN
(1990), is
considered. In Chapter 10, some measures are derived that test the overall assumptions of the model. It is demonstrated that the sum of weighted squared residuals in both the binary choice model and the multinomial logit model asymptotically follows a normal distribution, instead of the often believed asymptotic chi-square distribution. Based on this fact a new chi-square distributed test-statistic is introduced. The performance of the test is analysed by means of a Monte Carlo study for the binary choice model; for the multinomial logit model the test statistic is calculated for different models based on the data concerning private car ownership used in Chapter 8. A further asymptotic test that is presented in Chapter 10 is based on
Hausman's testing procedure (HAUSMAN, 1978). This test is based on the comparison of two asymptotically normally distributed estimators in the same model, one of which is efficient under the basic model assumptions, whilst the other one is consistent under alternative specifications. The test is adopted to compare the logit or probit estimator in the binary choice model with Han's nonparametric maximum rank correlation estimator (HAN,
1987). Before the test is developed, a nonparametric estimator is
introduced, the rank estimator, that appears to be identical to Han's estimator. Consistency of the rank estimator is proven in this chapter; for the asymptotic normality result reference is made to SHERMAN ( 1991 ). Finally, Chapter 11 contains a summary of the main results and some concluding remarks.
Part I R 2 in the linear model
2. The coefficient of determination, a survey
2.1. Introduction Whenever Ordinary Least Squares (OLS) results of a linear model are presented, they are accompanied with the coefficient of determination, R 2 • For most researchers, R 2 plays an important role in validating the model at hand; a model that generates a high value of the measure is generally considered to be good or even successful. Whilst the cautious researcher may be averse to using the statistic, it is usually summoned by project commissioners. Despite its popularity, R 2 should be treated with great care: as a testing device, the large significance level it exhibits is intolerable; as an estimator of a sort of population counterpart it is very unreliable in small samples; and the measure becomes meaningless if the slope of the regression surface is relatively steep, e.g. when there is a time trend. The linear model and the coefficient of determination are introduced in Sections 2.2 and 2.3 respectively. The role of R 2 in hypothesis testing is discussed in Section 2.4, where use is made of the work of THEIL (1971) and DEN BUTTER AND VAN DE GEVEL ( 1979). Section 2.5 is devoted to the probability limit, the (asymptotic) distribution and the moments of R 2 , based on the work of K0ERTS AND ABRAHAMSE (1972), DE HAAN AND TAC0NIS-HAANTJES (1978), STEERNEMAN (1985) and CRAMER (1987). In Section 2. 7 some (more) limitations on the use of the goodness-of-fit measure become apparent; this follows the discussion in BARRETT (1974) and SMITH (1974).
2.2. The linear model, definitions and assumptions Consider the linear model
y = X/3 + e
(2.1}
12
Chapter 2
where y is an n-vector of observations on the dependent variable; X is an nxk matrix of n observations on k fixed, non-stochastic explanatory variables; f3 is a k-vector of nonrandom, unknown parameters to be estimated; and
e
is an n-vector of unobservable, stochastic disturbances.
The (ei} are i.i.d. with
Equation (2.1) contains a constant term and so the first column of X is a vector of units (denoted by s). By partitioning the matrix X as (s Z] and the parametervector /3 correspondingly as (a 1')', a being the constant and 1 the slope coefficients, model (2.1) can alternatively be written as Y =SQ+ Z1 +
£.
For the asymptotic behaviour of the matrix of explanatory variables, it is assumed that (2.2)
where
Ox is a finite non-singular matrix.
The ordinary least squares (OLS) estimator b of f3 minimizes the sum of squares (y-Xb)'(y-Xb), and is an efficient (best) linear and unbiased estimator (BLUE). Its formula is given by
with expectation and covariance matrix E(b) = /3
; var(b) = a2(X'Xf 1.
The formula of the OLS estimators of the elements of /3, a and 1, are given by
The coefficient of determination, a survey
13
c = (Z'NZf 1Z'Ny, where Mz:= I - Z(Z'Zr 12'; N:= M1 = I - n- 1ss'. The estimator b implies as an estimation, or theoretical value, for y
where Px is defined implicitly. The residual vector e is defined as
and is perpendicular to the plane spanned by the columns of X, hence X'e=O, and therefore y'Ne = 0
;
e'Ne = e'e.
Finally, an unbiased estimator of the variance
(2.3)
a2
is
e'e
u2=-. n-k
2.3. The coefficient of determination The coefficient of determination, R 2, is based on the decomposition of the total variation of the dependent variable into an explained part and an unexplained or residual part: y'Ny = y'Ny + 2y'Ne + e'Ne = y'Ny + e'e,
14
Chapter 2
where the second equality follows from (2.3 ). R 2 is then defined as the fraction of the total variation of y accounted for by the explanatory variables, which equals unity minus the fraction of unexplained variation: y'Ny R2
=--=
e'e 1---
y'Ny
y'Ny
b'X'NXb
c'Z'NZc
y'Ny
y'Ny
(2.4)
where the last equality holds because Ns = 0. Clearly, 0
s
R2
s
zero if all estimated slope coefficients are zero and R 2 is one if e the interpretation of
R2
I. R 2 is
= 0. So
in terms of goodness of fit is that the closer R 2 is
to unity, the better the performance of the explanatory variables. Furthermore, R 2 is equal to the squared sample correlation coefficient of y and y. Denote this coefficient by r, then (y'Ny)2 r2 =
{y'N(y+e)}2
y'Ny
= - - - - = - - = R2,
y'Ny y'Ny
y'Ny y'Ny
y'Ny
and so R 2 can be considered to be an estimator of a sort of squared population correlation coefficient. Following
BARTEN ( 1962, 1987),
the
square of this correlation coefficient in model (2.1) is {3'X'NX/3 p2=------
(2.5)
/3'X'NX/3 + na2 Let c be some estimator of the slope coefficients -y. Then the OLSestimator c of 1 maximizes r 2(c), defined by (c'Z'Ny) 2 r 2(c)
=----c'Z'N Zc y'Ny
The coef/icient of determination, a survey
15
and r!ax equals R 2 • To see this, rewrite r 2(c) as
with,
,p(c)
(c'Z'Ny) 2
c' Ac
c'Z'NZc
c'Bc
=- - - =- ,
where A:= Z'Nyy'NZ
;
B:= Z'NZ.
The first-order differential of ,p(c) with respect to c is 2
d,p(c)
= - - c'Adc c'Bc
c'Ac 2 - - - c'Bdc,
(c'Bc) 2
and so the first-order partial derivative is 8,p(c) 2 c' Ac - - = - - Ac - 2 - - - Be. ac
c'Bc
(c'Bc) 2
The first-order condition, 8,p(c)/8c
= 0, yields
Ac
= ,p(c)Bc, which is
equivalent to
and so the maximum and minimum of ,p(c) are given by •'•(c) Y' max
=
>. max(B-l/2AB-l/2) ,. •'•(c) . 'I' man
=). .
min
(B-l/2AB-l/2) '
where >.(W) denote the characteristic roots of some matrix W. Clearly, {B 1/ 2c) are the set of characteristic vectors of e- 1/ 2Ae- 1/ 2. Substituting
16
Chapter 2
back gives ip(c)max
= >.max ((Z'NZf 1/2z'Nyy'NZ(Z'NZf 112 } = y'NZ(Z'NZf 1Z'Ny,
ip(c)min = >.min ((Z'NZf 1/2z'Nyy'NZ(Z'NZf1/2} = 0.
The characteristic vector corresponding to >.max is (Z'NZf 112 Z'Ny and premultiplying by (Z'NZf 112 gives for the vector C that maximizes r 2, C
= (Z'NZf 1Z'Ny = C,
the OLS estimator of the slope coefficients. Combining the results gives then the following expression for the maximum of r 2(c):
2
y'NZ(Z'NZf 1Z'Ny
c'Z'NZc
y'Ny
y'Ny
rmax
To say that R 2 itself is maximized by the OLS-estimator as this estimator minimizes e'e, seems like a circular argument, because in constructing the measure, use is made of OLS properties (i.e. Ne the expression R 2 = 1-e'e/y'Ny.
= e,
X'e
= 0) to
arrive at
2.4. R2 and hypothesis testing Suppose there are two competing linear model specifications for one dependent variable and one would like to test whether one specification is better than the other. A possible testing procedure based on the squared population correlation coefficients of the two models, P~ and P~ as defined in (2.5), is to test the hypothesis H0: P~ = P~, and if this hypothesis is rejected, favour the model with the highest P2 • Since R 2 is an estimator of P2, the question arises as to whether it is possible to construct such a test on the basis of the coefficient of determination. The answer is negative:
The coefficient of determination, a survey
17
as will be described in the next section, the distribution of R 2 is a very complicated one and to test the null hypothesis based on it is a very, if not too difficult undertaking. There is, however, one special case for which the test can be performed. Consider as Model I, y = X/3 + £ = SQ + Z"f + £, and as Model 2 the model where the constant is considered to be the only explanatory variable y =SQ+£.
Clearly, P~
= 0, and so the null hypothesis
= 0 (versus H0: "f = 0. Now
in this case is H 0: P~
H 1: P~ > 0), which by (2.5) is equivalent to the hypothesis
R 2 has a one-to-one relation with the F-test statistic for testing this hypothesis. Because the sum of squared residuals under H 0 is y'Ny, the statistic y'Ny-e'e n-k
R2
n-k (2.6)
f=----=----. e'e k-1 I-R 2 k-1
is Fk-l,n-k distributed, whenever
£
is normally distributed.
Because of the aforementioned problems with regard to the distribution of R 2 and hence the difficulties in using it as a test device, R 2 is often used
as a guide on its own: the higher R 2 , the better the fit of the model, and if two models are compared one chooses the model with highest R 2 . This, however, is a very poor method of model selection with a low degree of statistical backing (see THEIL ( 1971 ); DEN BUTTER
AND VAN
DE GEVEL
( 1979)). To show this, first the coefficient of determination adjusted for degrees of freedom is introduced. R 2 does not penalize the number of explanatory variables included, and because the fit of the model gets better when the number of regressors increases in general (always, if the models are nested), it seems a logical step to correct the measure for this. The way
18
Chapter 2
in which this is done is the following. The number of degrees of freedom of e'e is n-k, and e'e/(n-k) is an unbiased estimator of
a'-.
The number of
degrees of freedom of y'Ny is n-1 (and y'Ny/(n-1) is an unbiased estimator of the variance of y when Yi has the same expectation µ for every i, which is not the case here). R 2 corrected for degrees of freedom is then defined as
R2 =
e'e/(n-k)
k-1
I - - - - - = R2
-
y'Ny/(n-1)
-(I-R2 ).
(2.7)
n-k
-2
R is smaller than R 2 except when k
= 1 or R 2 = I (then R- 2 = R 2 ) and the
correction factor increases when k increases, n decreases or R 2 decreases. Clearly, R 2 can become less than zero. To make clear what happens when model selection is performed on the basis of R 2 , consider two nested models. The first model has k explanatory variables and the coefficient of determination is denoted by R:, with corresponding corrected measure R=. The sum of squared residuals is denoted by e••e•. The second model is the first model with h regressors added. The coefficient of determination of this model is denoted by R:+h (R= +h> and the sum of squared residuals is e'e. The difference between R:+h and R= is always positive, but the difference between R:+h and R: can have any sign. Let f be the Fh,n-k-h-test statistic for testing the hypothesis H0: ,Bk+l•"·•,Bk+h = 0. Then
---=----e'e
n-k n-k-h
=-------+-e'e
=1-
n-k
n-k
h -(1-f),
n-k and so, if Rf +h > R:, it follows that f > 1 and choosing the model with
The coef/icient of determination, a survey
19
highest R 2 is equivalent to testing the hypothesis H0 : .B1i+i•···•,Bk+h = 0 with a relatively high significance level. For example, if n = 20, k • 3 and h = 2, f is F2 , 16 distributed and the significance level of the test is 0.39.
If h equals one, f is equal to the square of the t-statistic corresponding to the extra regressor, so the statement in that case can be alternatively formulated as: R:+i > R: if t:+i > I. To conclude this section, consider h linear restrictions on the parameter vector ,8, given by H0
:
A,8 = q,
with A an hxk matrix and q an h-vector. When the restrictions are included in the estimation procedure, the vector of residuals, e•, is given by
e• = Y -
y•,
with y•
= Xb*,
b* = b - (X'Xr 1A'(A(X'Xr 1A'r 1(Ab-q). As long as there are no restrictions involved with the constant, i.e. the first column of A is a zero vector, it is easily seen that s'e• = 0, and so y'Ny
= y*'Ny*+e•·e•. Therefore, there are no complications in using R 2 in
the restricted model. Denote this measure by R 2*, then the F-test statistic for testing the h linear restrictions can be expressed in terms of R 2 in the following way: e•·e• - e'e n-k f=---e'e h
(2.8) h
Clearly, (2.6) is just a special case of (2.8), because in the model where the restriction is incorporated that all coefficients except the constant are zero, the value of R 2* equals zero.
20
Chapter 2
2.5. Probability limit, distribution and moments KOERTS AND ABRAHAMSE
Define Hx:= is
(1972) derived the probability limit of R 2 .
lim 0 .... 00 n- 1X'NX 1;
plim R 2 =
/J'Hx/J
and Hz:= lim 0 ....00 n- 1Z'NZ, then this limit
1'Hz1
= - - - - = lim 0 ....00 P2 ,
fJ'Hx/J + u2
(2.9)
1'Hz1 + u2
hence, plim R 2 = 0 if 1 = 0 plim R 2 = 1 iff u2 = 0.
Proof: First, from
because of assumption (2.2), it follows that (2.10)
For the numerator of R 2 , the expression then is plim n- 1y'Ny = plim n- 1Y'PxNPxY = plim n- 1(/J'X'NX/J + 2/J'X'NPxe + e'PxNPxe)
= /J'Hx/JThe probability limit for the denominator is
1This limit exists, because the limit of n- 1x•x exists by assumption (2.2) and X contains s. Note that n- 1X'NX = n- 1x'X - (n- 1X's)(n- 1s'X).
The coe//icient of determination. a survey
because plim(n- 1,B'X'Ne)
21
= 0 by (2.2) and (2.10), and plim(n- 1e'e) = a2 by
Khinchine's law of large numbers, using the fact that {e;) are i.i.d. and therefore {e; 2 ) are i.i.d., with mean
a2.
■
Define, 2._ ,.
p .- 1mn-+oo
p2
•
then the result of (2.9) shows that R 2 is a consistent estimator of however, is not an unbiased estimator of
P2 ,
p2 . R 2 ,
and BARTEN (1962) proposed
a bias correction. Under the assumption that e is normally distributed, an assumption that will be maintained throughout the remainder of this section, the expectation of R 2 in terms of P 2 is (2.1 I) The bias is clearly positive (k
~
2) and is an increasing linear function of
the number of explanatory variables. A corrected R 2 , corrected for the bias up to order n- 1 , is
STEERNEMAN ( 1985) shows that the term o(n- 1 ) in (2.11) can be replaced by O(n- 2 ) and the bias of the corrected R 2 is of that order as well,
DE HAAN AND TACONIS-HAANTJES (1978) derived the formula for the asymptotic expectation of R 2 . Define
,B'X'NX/3
() n =
na2
a2
22
Chapter 2
and assume that (2.12) for some real w. The expectation of R 2 then is, as n-+oo, (2.13) Although De Haan and Taconis-Haantjes, in their own words, " .. were led
to this i11vestigatio11 hy our failure to understand Bar/en's ( 1962) argument leadi11g to a different as.vmptotic expression for E( R 2 )." (op. cit. p. 16), Steerneman shows that assumption (2.12) implies
and from this and (2.11 ), (2.13) follows in a straightforward manner. De Haan and Taconis-Haantjes also obtained the asymptotic distribution of R2 . Under the assumption
R 2 is asymptotically normally distributed,
As to the exact distribution of R 2 , after the early work of Koerts and A brahamse, who showed that the cumulative distribution function of R 2 was a function of
/3, a2 and X, and that this function was very sensitive to
changes in X, it was established that the measure has a non-central betadistribution (see DE HAAN AND TACONIS-HAANTJES (1978), CRAMER ( 1987)). Cramer derived the density function of R 2 and evaluated the e:. = I, then X's = 0, that is, all regressors are measured in deviations from their means. If >. = 0, then Mxs = 0, that is, there is a constant term (or at least the unity vector lies in the column space of X). Assume that
Ml. Then O < >. < I (the case>.= 0 is excluded here). Further, define (3.5) Combining definitions (3.4) and (3.5), the matrices A and B can be written as A = PxNPx = Px - ( 1->.)qq'
B = PxNPx + Mx = 10
-
(
1->.)qq'.
(3.6)
The following properties are now immediate: Pxq = q AB =BA=Px-(l->.) 2qq' Aq = Bq = >.q r(B)
= n,
where r(B) denotes the rank of B. Hence, q is a characteristic vector of both A and B, corresponding to the same characteristic root >..
Chapter 3
36
Furthermore, because r(B) = n, all moments of R~ exist (MAGNUS, 1986, Theorem 7, p. 105). Since A and B commute, there exists one orthogonal matrix which diagonalizes both matrices. Thus
with W a diagonal matrix. The main results concerning the characteristic roots and vectors of PxNPx+Mx and PxNPx are summarized in Lemma 3.1:
Lemma 3.1
I. PxNPx+Mx has n-1 characteristic roots equal to I and one characteristic root equal to>.:= n- 1s'Mxs. 2. >. is also a characteristic root of PxNPx corresponding to the same characteristic vector q:= (l->.r 1l 2 n- 1/ 2 pxs. From the results of Lemma 3.1 it follows that A•, µ•, A and t:.. can be written as
A
=
a2- [ 0). 0In-1 ]
• **') A* [ u2 >. 0 ] ' µ *' = ( I-':,.,µ ' = 0 A** '
with A .. = a'-W. The determinant of
and
f E,
trS, and
t:,.
fSE
is, clearly, equal to
become,
fE
µ! 2/(I +2ta'->.) + µ**'µ•• /(I +2tu2)
trS
a'->./(1 +2tu2 >.) + u2(k- I)/( I +2tu2)
R 2 in the linear model without a constant term
37
Becauseµ*'µ*= µ'µ/.) 2 + (µ' Aµ- 0.
(5.8)
R 2 in seemingly unrelated regression equations Proof: see
LUKACS
61
(1975, pp.38).
Make the following assumptions relating to the SURE-model: Assumption 5.1 : There exist m,M with 0' of Po is asymptotically normally distributed:
(7.5)
where 8 2 1ogL
nu
= - I'1mn_oon -1
---1 8p8p'
R
"'O
(7.6)
with fiO = f(xi'P 0 ). Because P0 is a solution of the equation
the following property can be obtained by Taylor expansion:
[
8 21ogL
-8-P-8P-, IP~
]- 1 81ogL
-a-p- I Po
(7.7)
where P~ lies between Pn and P0 • Once the estimator p0 has been obtained, the estimated probabilities are
90
Chapter 7
specified as
.
A familiar way of arriving at model (7.1) is to define an underlying
.
(continuous) response vanable Yi as (7.8)
with {Ei} i.i.d., E(E) = 0 and var(Ei) =
rl-.
The binary random variable is then defined as Yi= I
if Y; ~ 0
Yi = 0
otherwise.
The model specification (7. I) follows by P(Yi=I I xi)= P(x{P-Ei ~ 0) = P(Ei :S x{P) = F,(x{P). Sometimes it is assumed that the x's are stochastic. Assumption 7.3 is then replaced by Assumption 7.3': the {Xi} are i.i.d. vector random variables with a joint density function g(x) such that g(x)>0 for all x. The {Xi} are independent of the {Ei}. Because Xi and
Ei
are independent, and the probabilities are specified
conditional upon X, all previous results obtain.
91
The binary choice and multinomial logit model
7.3. The multinomial logi.t model The multinomial logit model is defined as (see e.g.
MADDALA,
1983)
i= I , ... ,n ; j= I , ... ,m-1 (7.9) 1 P·Im = P(Y-=m Ix.)= I - ~ I I L.. j = 1 P··IJ
where {Yi} is a sequence of independent discrete random variables taking the values {0, I , ... ,m}; xi is a k-vector of known constants; and /3j is a k-vector. A superscript O means that /3~ is substituted for /3j• where /3~ is a k-vector of unknown parameters to be estimated: P?j = exp(x//3~)/( I+ L':';;Jexp(x//3~)). Define a set of binary random variables {Yij} as Yij
= I if the ith individual falls in the
Yij
= 0 otherwise.
/h
category
Clearly, = P(Y-=J· I x-)I = P·IJ'• P(Y.IJ-= I I x.) I l and
The elements of {Yil, Yi 2 , ... , Yim} are stochastically dependent because only one of them equals I and all the others equal 0, and so
The loglikelihood function of the ml1ltinomial logit model can be written as
92
Chapter 7
Define the k(m-1 )-vector /3 as /3 = (/3/,/3 2 ', ••• ,/3m-i'>'· It is postulated that the following hold: Assumption 7. IM : p0 is contained in a parameter space B, which is an open bounded subset of the Euclidean k(m-1 )-space. Assumption 7 .zM
: the
(xi}
are
limn .... 00 n- 1L>ixi' empirical
uniformly
bounded
in
and
is a finite nonsingular matrix. The
distribution
of
(xi}
converges
to
a
(non-degenerate) distribution function. Under these assumptions the maximum likelihood estimator
Pn
of p0 is
asymptotically normally distributed:
where 8 21ogL 1· -1 _ _ _ " = - 1mn--+00n n
apap·
1
(7.10)
pO·
The elements of the score vector and hessian matrix are respectively given by alogL -- =
[~= 1(Yij-Pij)xi,
j = l, .. ,m-1
apj
j
= l, ... ,m-1 (7.11)
j,r
= 1, ... ,m-1;
j
,t.
r
93
The binary choice and multinomial logit model Once the estimator
Pn is obtained, the estimated probabilities Pij are given
by
j
= l, ... ,m-1 (7.12)
8. Goodness-of-fit measures in binary choice models
8./. Introduction In this chapter a number of goodness-of-fit measures for the binary choice model are discussed. As mentioned before, in the binary choice model the probability of a certain outcome of the choice process is estimated. Because the 'true' probabilities are unknown, the estimated probabilities cannot be compared with them and so the definition of R 2 in the linear model is not applicable. The main focus here is on pseudo-R 2 measures. A pseudo-R 2 measure is a measure that has the same kind of interpretation as the OLS-R 2 in the linear model, and so at least lies in the (0, I] interval. The measures are based on several possible criteria. For example on the discrepancy between the binary dependent variable, y, and estimated probabilities,
p;
on the (log)likelihood of the estimated model; on the
estimation of the underlying continuous latent variable; or on a prediction of the binary dependent variable. The measures as discussed here can be found, amongst others, in (1981),
KAY AND
LITTLE (1986) and
YEALL AND ZIMMERMANN
AMEMIYA
(1990). As
Yeall and Zimmermann point out, because there is no pseudo-R 2 which has all the properties of the OLS-R 2 as mentioned in Chapter 2, one should simply choose a pseudo-R 2 to be used consistently by different researchers in order to make comparability possible. In their study they favour the measure that best mimics the OLS-R 2 of the underlying continuous data. By means of Monte Carlo studies and an analysis of private car ownership, a contribution to the discussion of which measure is best will be given in Section 8.5. It becomes clear that all measures considered here show the same kind of relative behaviour when different models are compared.
96
Chapter 8
8.2. Efron 's Ri;J The first measure can be found as early as LAVE ( 1970) in a study of (probit) modal choice, and has been given a theoretical backup by EFRON (1978). In most literature (e.g.
AMEMIYA
(1981),
MADDALA
(1988)) it is
therefore referenced as Efron's measure, and will be given the same name here. The measure stems directly from the coefficient of determination in the linear model by decomposing the total variation of the binary dependent variable, y'Ny, into explained and unexplained variation, the unexplained variation being equal to the 'residual' sum of squares (y-p)'(y-p). The measure is defined as Li(Yi-p/
(y-p)'(y-p)
R~r= I - - - - - = I - - - - - -
Li(Yi-Y)2
y'Ny
where Y is the sample mean of Y (= n- 1}::iYJ As mentioned before, the measure is regarded as being equal to one minus the ratio of explained variation to total variation. It will be shown here that one has to be very cautious in giving this interpretation to the measure. In defining the measure, Efron used an example in which there are grouped data, a group formed by individuals with the same covariate vector x. He then compared two logit models. One model fits an overall constant to the data, resulting in an estimated probability equal to the sample mean Y for every individual. The other model fits a separate constant for each group, resulting in an estimated probability for each group j equal to the sample mean of Y in group j (=
v;).
In Efron's terminology, define S(y,p) as the sum of squared errors for arbitrary p, that is S(y,p) = (y-p)'(y-p). Let
y = sY, s = (1,1, ... ,1)', and y· = (slv;·, ... ,s,v:-r.
where sj is the unit
vector whose order is equal to the number of observations in group j. Then the reduction in squared error of the model with group specific constants, as compared to the model with an overall constant, equals the variation
97
Goodness-of-fit measures in binary choice models between explanations y 0 and y: S(y,y) = S(y,y°) + S(y 0 ,y),
(8.1)
which is referred to by Efron as 'the Pythagorean relationship'. This relationship plays a crucial role, or in Efron's words, (8.1) 'is the very least we should expect of any reasonable measure of residual variation' (op. cit. p. I 19).
From (8.1) the coefficient of determination is readily obtained,
2
S(y,y) - S(y,y 0 )
S(y°,y)
R =------
S(y,y)
S(y,y)
R 2 is the proportional decrease in residual variation in going from explanation y to y°, but it is also the variation of y 0 to y compared to the total variation of y. However, in the model with individual data and individual estimated probabilities, the Pythagorean relationship does not hold. Instead, S(y,y) = S(y,p) + S(p,y) + 2p'(y-p) - 2y'(y-p), and therefore the decomposition of the total variation into explained and residual variation no longer holds. Because with maximum likelihood estimation S(y,p) is always smaller than S(y,y), provided a constant term is part of the model, one need not worry about the Iowerbound of Rir• which is zero. Properties of Rir are
Rir = 0 Rir = I
if
7 0
= 0,
iff pi = Yi Vi.
(ii). If the assumptions 7.1, 7.2 and 7.3' hold (so X is stochastic), then
f F(x',80 )( 1-F(x',80 )}g(x)dx plim Rir = I - - - - - - - - - - - - - - (! F(x',80 )g(x)dx}( I- f F(x',80 )g(x)dx}
98
Chapter 8
and so plim Rir = 0 if F(x',8 0 )=c 'v x, 0 < c < I plim Rir = I if F(x',8 0 )=1 v F(x',8 0 )=0 'v x. Proof:
In order to obtain the probability limit of Rir, it is very convenient to use the theory and tools as developed for stochastic processes. This is also the reason for specifying X as being stochastic. The results obtained here (and later in Section I 0.4 ), rely strongly on the work of NOLAN AND POLLARD (
POLLARD
(1984) and
1987). The main definitions and theorems we use
are stated in the Appendix. Reference to this appendix is made by placing an A in front of the number of the definition, lemma or theorem. Write the numerator of Rir as
where h(y,x;,8) = (y-F(x',8)} 2 , and P n is the empirical measure of the pairs (Yi,X;), i.e. the measure that puts equal mass (n- 1 ) at each of then observations (Y 1 ,X 1 ), ... (Yn,Xn). Let H be the class of functions
H = (h(y,x;,8): ,BEB}. The subgraphs of functions in H clearly form a class with polynomial discrimination, (often called a Vapnik-Cervonenkis (VC) class, see definition A. I), and H has envelope H = I (definition A.2). Hence, using Lemma A. I, it is seen that the conditions of Theorem A. I are fulfilled, implying that, if n-+oo
Goodness-of-fit measures in binary choice models
99
where P is the true fixed distribution and Ph the expected value of h. Since plim
~n
= /3 0 , it follows that
Now dP(y ,x)
= F(x'f3 0 )Y{ I -F(x'/30 )} 1-Y g(x)dx,
and so
J h(y,x;,B0 )dP(y,x) = J F(x'/30 ){1-F(x'/30 )}g(x)dx. The denominator is
and because
J I y I dP(y ,x):s I,
it follows that, if n--+oo
J ydPn(Y ,x) --+ J ydP(y ,x) = J F(x'/30 )g(x)dx
almost surely.
The result follows, because
■
There has been some discussion about the upper limit of Efron's measure. MORRISON ( 1972)
argued that this upper limit is smaller than I, countered
by GOLDBERGER ( 1973 ). Clearly, the upper limit is obtained in the socalled "saturated" model, when there are as many parameters as observations and so it is a proper limit in a theoretical sense. (To be more precise, the discussion involved the correlation coefficient of
p and
y. The
difference between this correlation coefficient and R~r is, however, negligible in practice (see e.g. VEALL and ZIMMERMANN, 1990)).
I 00
Chapter 8
8.3. The likelihood ratio index: McFadden's Ri,F Several R 2 type measures based on the (log)likelihood of the model have been proposed. One of them is the likelihood ratio index, put forward by McFADDEN (1974). It is defined as logL(Pn) R~r= 1 - - - - , logL 0 where logL 0 is the maximum of logL subject to the constraint that all the regression coefficients except the constant term are zero,
Properties of R~r are R~r = 0 if logL(Pn) = logL 0 , i.e. if 'ln = 0, R~r = I if logL(Pn) = 0. (ii). R~r has a one-to-one relation with the chi-square statistic for testing the hypothesis that all coefficients, except the constant, are zero: 2 -21ogLoRMF-+
2 Xk-1
"f n-+oo.
I
Because of these properties, which are related to the properties of OLS-R 2 in the linear model, this measure has gained considerable popularity (DHRYMES (1986), JUDGE ET AL. (1980),
BEN-AKIVA AND LERMAN
(1985)).
Furthermore, HAUSER ( 1977) argued that the measure can be interpreted, in an information theoretic context, as the empirical percentage uncertainty explained by the model. His argument runs as follows (see HAUSER, 1977, Theorems 1-4, pp. 4I0-413). The measure of information for an individual, i.e. the information about the choice outcome provided by the explanatory variables, I(Yi,xi), is
Goodness-of-/il measures in binary choice models
101
-
with Y, the sample mean of Y, the prior probability that Y equals one. The expected information provided by the model is given by
Define as the total uncertainty of the system, the prior entropy, H(y), where H(y) = -(Ylog(Y) + (1-Y)log( I - Y)}. H(y) represents the maximum uncertainty that can be explained with
= 1(0) if Yi = 1(0). The expected information attains its maximum value for pi = piO and it represents the reduction in perfect information, i.e. pi
uncertainty due to the model. The percentage of uncertainty explained by the model is equal to E(l(y,X)}/H(y). However, because the expected information is computed independently of the observed Y/s, Hauser introduces the empirical information
and the empirical percentage of uncertainty explained by the model is then given by l(y,X)/H(y). The final result is that the likelihood ratio index is equal to the empirical percentage uncertainty explained:
The above analysis holds for every prior probability P(Y = I) that is constant amongst individuals. Instead of
Y, a possible choice for this prior
probability is P(Y=I) = 0.5. This prior is equivalent to the null model in which the whole parameter vector fJ is constrained to be equal to zero, the so called equally likely model. Furthermore, this prior generates the largest possible uncertainty, and hence larger values for R 2 (with properly redefined logL 0 ) in general.
102
Chapter 8
8.4. Miscellaneous measures MCKELVEY AND ZAVOINA (1975) (for the probit model) and, more recently, LAITILA ( 1990) (for the log it and the to bit model), proposed the following pseudo-R 2
y*'Ny• R~z=----y.'Ny° + n~• with
y• = XP
0 ,
the theoretical value of the unobservable latent variable y •
(as in (7.8)); and ~ equals
1r2
/3 in the logit model and I in the pro bit
model. Properties of R~z are (i). 0 ~ R~z < I
R~z = 0 R~z
(ii).
-+
if
1
0
=
0.
I if y°'Ny° /n
-+
oo.
R~z has the same probability limit as the OLS-coefficient of
determination R 2 for the underlying latent linear model when this model could be estimated by OLS. Under the assumption lim 0 .... 00 n- 1X'NX
= Hx,
this probability limit is given by
•
2
phm RMz
Po'Hx/Jo =----/Jo 'Hx/Jo + ~
Clearly R~z is a kind of estimator of the OLS-R 2 for the underlying linear model. Indeed, VEALL AND ZIMMERMANN ( 1990) found by Monte Carlo studies that from seven measures analysed, this measure was closest to the OLS-R 2 • VEALL AND ZIMMERMANN (1990) proposed a measure based on the loglikelihood of the model, more specifically on the (log) likelihood ratio test, which is an adjusted version of a measure proposed by ALDRICH AND
Goodness-of-fit measures in hinary choice models
103
NELSON (1984):
2[logL(~n)-logL 0]
21ogL 0 - n
2[logL(~n)-logL0] + n
21ogL 0
R~z=--------------
where the second term is the adjustment to the Aldrich/Nelson measure (the first term) in order to obtain a proper upper limit of one. Obviously, there is a relationship between between Rtz and R~F• given by 6-1 R~z
=---
with
R~F•
6 = n/(21ogL 0),
6-R~F
and therefore
= R~F
if
R~F = I or R~F = 0;
Rtz > R~F
if
O < R~F < I.
Rtz
Furthermore, the difference between R~z and R~F becomes smaller if -6 increases, i.e. if, with constant n, logL 0 becomes larger or, in other words, the total uncertainty of the system decreases. A measure that is often reported is the proportion of correctly predicted observations 2
I
•
•
Rep= I - n- (y-y)'(y-y),
where the predicted value
Yi
= I if
pi
~ 0.5 and
Yi
= 0 if
Pi
< 0.5. The
problem with this measure is twofold. First of all the loss function is an all or nothing one: whether the estimated probability is close to one/zero or just larger /smaller than a half does not make any difference for the value of R~p· Secondly, the measure can mask a poor goodness of fit. This happens if, for example, the proportion of ones in the sample is large. If in that case the predicted value is unity for every observation, the proportion of correctly predicted observations is large and so is R~p·
104
Chapter 8
However, the model misclassifies every observation with Yi
= 0. One way
of dealing with this last problem is to report both the proportion of correctly predicted ones and the proportion of correctly predicted zeros. Another way (BEN-AKIVA AND LERMAN (1985), KAY AND LITTLE (1986)), is to calculate the average probability of correct prediction
but, as will become clear by the examples in Section 8.5, R~; does not behave much better than R!P' and Ben-Akiva and Lerman conclude with ' .. the preferred procedure for evaluating a model is to use the loglikelihood
value or transforms of it such as .. R~r--The insensitivity of.. R~P .. in addition to its potelllial for completely misleading indications, argue against it' (op. cit. p. 92). Another well-known goodness-of-fit measure, which is not a pseudo-R 2 measure, is the sum of weighted squared residuals
This measure is often assumed to have an asymptotic chi-square distribution. However, as will be shown in Chapter 10, the asymptotic distribution of Tn is the normal distribution. A chi-square test based on this property is discussed there as well. The last measure to be discussed here is the so-called deviance D
= -21ogL(.Bn) = -2[i ( Yilog(pi) + (I- Yi)log(l-pi) }.
Clearly, this measure is not a pseudo-R 2 , but a kind of likelihood ratio test, where the model under consideration is the restricted model and the unrestricted model is the "saturated" model. In the saturated model all predicted probabilities are equal to the observed values of Y, in which case
Goodness-of-fit measures in binary choice models
105
the loglikelihood is equal to zero. There is a tendency to believe that D asymptotically follows a chi-square distribution, as a kind of generalization of the asymptotics when dealing with null and alternative models that remain fixed as the number of observations increases. However, in the binary choice model, as the number of observations increases, so too does the number of parameters needed for the saturated model, and so the traditional asymptotic results do not hold. As WILLIAMS ( 1981) pointed out, in the case of the logit model the deviance is uninformative on the goodness of fit of the model. His argument runs as follows: From (7.3) it follows that in the logit model,
The first order condition of the maximum likelihood estimator is atogL - - = Li (Yi-pi}xi = 0,
ap and so
Therefore the deviance can be written as
and, clearly, this measure does not give any information about the agreement between the Yi and the
pi.
8.5. Examples In order to gain some more insight into the behaviour of the pseudo-R 2 measures, the measures are calculated for both data that are generated by
I 06
Chapter 8
computer and data on private car ownership in the Dutch budget survey of 1980. The generated data were constructed as follows. The underlying continuous latent variable is specified as
y
.=
(8.2)
a + Xl + X2 + X3 + X4 + X5 - e,
with X1-N(0, I), X2-U[0,2], X3-N(0,2), XcU[0,4], X5-N(0,3) and e is logistically distributed with mean O and variance
71"2
/3. The x's are drawn
independently from each other and from e. The binary variable y is equal to unity whenever y • is larger than zero, and zero otherwise. By controlling the constant a, leaving the slope coefficients unchanged, two cases are distinguished. In the first case, the average response Y is 0.5, in the second case 0.9. For both cases, five (nested) logit models are estimated, with explanatory variables {s x 1 ), ... ,{s x 1 x 2 x 3 x 4 x 5 } respectively. The measures Rir• R~F• R~z, R~z, R!P and R!; are calculated. Furthermore, because the continuous/, and so p 0 , are known, the OLS-R 2 of the linear model (8.2) and the correlation coefficient of
p and
p 0 , denoted by r 2 (p 0
,p ),
can be
calculated as well. The sample mean (and standard deviation) of these measures, for I 00 replications of sample size n = 1000, with X being held constant, are presented in Table 8.1 and Figure 8.1, in case Y = 0.5. Treating r 2(p 0 ,p) as a separate case, the measures can roughly be divided into three groups, the first group containing Rir and R~F• the second 2 R2vz an d R2OLS• and the th1rd . 2 and group f armed by RMz• group by Rep R!;. The values of the measures between groups are related by (Rir,R~F} 2 2 2 2 R2•} · h.m groups RMF 2 2 2• 2 < (RMz,Ryz,RoLs} < (Rep• ep , an d wit < REr• Rep< Rep' whereas for the second group no such ordering can be made. The patterns of the measures are the same, the increase is higher the higher the variance of the extra x variable. The absolute changes are the smallest for R~P and R!;, and the largest for R~z and R~z (and RlLs)-
Goodness-of-fit measures in binary choice models
Table 8.1 Results pseudo-R 2 , generated data n = 1000, 100 replications, Y=0.50, sample mean, R~r
xi
I 2 I 2 3
I234 I 2 34 5
-
R~F
R~z
R❖z
R5Ls
107
(standard deviation)
2•
Rep
R~P
2
•
r (P 0 ,P)
0.08
0.06
0.10
0.13
0.11
0.62
0.54
0.15
(0.01)
(0.01)
(0.02)
(0.02)
(0.01)
(0.01)
(0.01)
(0.00)
0.10
0.08
0.13
0.17
0.14
0.63
0.55
0.19
(0.01)
(0.01)
(0.02)
(0.02)
(0.01)
(0.01)
(0.01)
(0.00)
0.22
0.17
0.29
0.33
0.31
0.70
0.61
0.40
(0.02)
(0.02)
(0.02)
(0.02)
(0.02)
(0.01)
(0.01)
(0.00)
0.30
0.25
0.42
0.44
0.43
0.73
0.65
0.55
(0.02)
(0.02)
(0.03)
(0.03)
(0.02)
(0.01)
(0.01)
(0.00)
0.54
0.48
0.71
0.68
0.71
0.84
0.77
0.99
(0.02)
(0.02)
(0.02)
(0.02)
(0.02)
(0.01)
(0.01)
(0.00)
...... : R~z .. +.. : R~; -+- : R5Ls
: Rir
__ . R ·
2
MF
--- :
2
Ryz -+-:
R2
cp
2(
·)
-+- : r Po,P
0.8 -
2
3
5
4
model Figure 8.1 Results pseudo-R 2 ,
Y = 0.5
108
Chapter 8
Table 8.2 Results pseudo-R 2, generated data n = IO00, 100 replications, Y=0.90, sample mean, R~r
xi
I 2 I 2 3 I 2 34 I 2 34 5
R~F
R~z
Riz
Rc5Ls
(standard deviation)
2 • r (P 0,P)
2• Rep
R!P
0.03
0.06
0.12
0.09
0.11
0.90
0.83
0.08
(0.01)
(0.01)
(0.03)
(0.02)
(0.01)
(0.01)
(0.01)
(0.00)
0.04
0.06
0.13
0.10
0.14
(0.02}
(0.03)
0.90
0.83
0.10
(0.01)
(0.03)
(0.01)
(0.01)
(0.01)
(0.00)
0.11
0.15
0.31
0.23
0.31
0.26
(0.02)
(0.04}
0.90
0.85
(0.02)
(0.03}
(0.01}
(0.01}
(0.01)
(0.00}
0.18
0.23
0.44
0.33
0.42
(0.03)
(0.04)
0.91
0.86
0.45
(0.03}
(0.03)
(0.02)
(0.01)
(0.01)
(0.01)
0.40
0.44
0.72
0.56
0.71
(0.04}
(0.03)
(0.04)
(0.04)
0.93
0.89
(0.01)
(0.01)
(0.01)
. R2 MZ
··+·· : R2• cp
-
2 : R Ef
-
: R~F --- : Riz -+-: R!P
••.••. •
0.98 (0.01)
-+- : R2OLS -+- : r 2(p0 ,p)
*
I I
+·················
·+·············/··· I
0.8 I
0.6
I
I
I
I
I
/
/ I .-·
/,
,.(~---····· 0.4
0.2 . -----== ... ---------
OL..___ _ _ __.__ _ _ ___._ _ _ _ __,___ _ _ _ 2
3
5
4
model
Figure 8.2 Results pseudo-R 2,
Y = 0.9
~
Goodness-of-fit measures in binary choice models The simulation results, in case Y
= 0.9,
109
are presented in Table 8.2 and
Figure 8.2. The values of the likelihood ratio index, McFadden's R~r• McKelvey and Zavoina's R~z together with the OLS-R 2 do not change much in comparison with the results in the case
Y = 0.5. The OLS-R 2 (and
so R~z) does not change, due to the fact that only the constant is changed in generating the variables. Efron's Rir, Veall and Zimmermann's Riz, the proportion of correct predictions R~P and the average probability of correct allocation R~; do change. The problems with the last two measures become apparent. There is hardly any distinction between the values of the measures for the different models, and predicting all observations to have a response of unity results in a R~P value of 0.9 (R~; = 0.83). 2 •} b 2 R MF} 2 2 2 2 2 . t h"JS I t StJ·11 h O Id S t h at (R Ef• < (RMz,Ryz,RoLs} < (Rcp•Rcp , ut In case Rir has become smaller than R~F• and Riz is smaller than (R~z,R6Ls>· Apart from R~P the measures all show the same pattern with the largest absolute changes for R~z (and R6Ls>· Because the parameter estimates are consistently estimated, the correlation coefficient of p 0 and p has a value close to one for the true model in both cases. It seems to be therefore a good policy to use the pseudo-R 2 that best mimics this correlation coefficient. If the four measures Rir, R~r• R~z and Riz are considered, then these measures all show more or less the same kind of relative pattern as the correlation coefficient. This means that it would not make much difference which measure is used in evaluating models. The fact that R~r and R~z seem less vulnerable to the proportion of unities in the sample is in favour of these two measures. The larger variability of R~z over the different models compared to the low variability (and low values) of R~r is in favour of the first. Furhermore, R~z is the measure that comes closest to r 2(p,p 0 ) on average. It is also the measure that has been recommended by Veall and Zimmermann because of its closeness to the OLS-R 2 , a closeness that appears in our experiments as well. The four measures Rir, R~r• R~z and Riz are also illustrated by an analysis of private car ownership in the Dutch Budget Survey of 1980 (see also CRAMER (1991) and CRAMER AND RIDDER (1991)). The data set consists of 2820 households and the two categories are "no car" versus "at
110
Chapter 8
least one car". The proportion of households that owns at least one car (Y) is 0.63. Five variables were considered as explanatory variables, namely
XI : log of household income X2 : household size (weighted number of individuals)
X3 : age of head of household (five year classes) X4 : level of urbanization (six classes, from country to city)
X5 : (0, I) dummy for the presence of a business car. Starting with the model that only includes the income variable (together with a constant), variables which show the highest increase in loglikelihood are added consecutively, viz. business car; age of head of household; urbanization level; and household size. The values of the four measures for the five different models are presented in Table 8.3 and Figure 8.3. Furthermore, Table 8.3 contains the values of the chi-square test statistic, testing whether the value of the parameter of the added variable equals zero.
Table 8.3 Results pseudo-R 2, car-ownership data R~r
R~F
R~z
R~z
x2
0.16
0.12
0.21
0.23
431
I 5
0.30
0.23
0.40
0.41
429
I 53
0.32
0.26
0.41
0.44
82
I 534
0.33
0.26
0.41
0.45
24
I 534 2
0.33
0.26
0.41
0.45
2
X;
For all models, R~z > R~z > Rir > R~F• a result similar to the result of the generated data in the case Y
= 0.5.
Again all measures show the same
pattern, a considerable increase when first the income variable and then the business car dummy is added, a slight increase when the age variable
Goodness-of-fit measures in hinar.v choice models
111
is added and (almost) no change when the urbanization level and the household size are included. Only for this last variable the chi-square statistic is not significant. For the income and business car variables, the values of the chi-square statistic are very significant .
- - .· I -
R2Ef
...... .·
R2 MZ
- - : Rt!.F --- : R❖z
O.IJ
0.6
_..,. ___ ---- ---- - ·-------·•·
0.4
........
0
--
....
- -----------
---~----- ----------•
- - - - - - ----· _j ·-· - -
2
· - -----
3
5
4
model Figure 8.3 Results pseudo-R 2 , car-ownership data
9. Tests on the distributional assumption
9.1. Introduction In Chapter 8 it became clear, that it is difficult to construct a proper goodness-of-fit measure in the binary choice model. A different approach will therefore be followed here. Instead of comparing the estimated probabilities with the binary random variable, or the loglikelihood with some baseline loglikelihood value, in this chapter the distributional (null)assumption, e.g. normal or logistic, is tested against some alternatives. Testing the distributional assumption is important because if there is misspecification of the distribution function and the estimation procedure is maximum likelihood, the estimator
Pn
of {30 may no longer be
asymptotically normally distributed at a n 1l 2 rate and can even become inconsistent (see e.g. Ruun, 1983, 1986). Two approaches will be adopted here. The first approach is to specify the hypothesized distribution as being a member of a class of distribution functions described by some parameters. A (score) test is performed to test for the specific values of the parameters at the null distribution. The second approach is based on the nonparametric Nadaraya-Watson estimator of the distribution function. This estimator will be compared with the null distribution by means of the supremum absolute deviation and a quadratic functional.
9.2. Tests based on parametric families of distribution functions Define a class of distribution functions, depending on parameters 6, F6 , for which the null distribution is given by F 6 . The null hypothesis to be tested 0
is given by
114
Chapter 9
Because the null model is the logit or probit model, a convenient teS t statistic for testing the null hypothesis is the Lagrange Multiplier or Efficient Score test statistic (see e.g. ENGLE, 1984). Define (J = (/3',o')', a nd
9=
(Pn',5 0')', the maximum likelihood estimator of
(J
under the null
hypothesis. The score test statistic is given by (9.1)
-
-
where q(O) is the score vector evaluated at 0, _ q(o>
atogL
_
= --1 9 = 1Q/(0)
ao
_ q/(0)
r.
with Qp = at~gL/8/3, q 6 = atogL/80; and H(9) is the information matrix evaluated at
(J,
Because by definition, qp(O)
= 0, the test statistic becomes (9.2)
which, under Assumptions 7. I- 7 .3 and H 0 , is asymptotically chi-square distributed with the number of degrees of freedom equal to the number of parameters in o.
9.2. t. Some classes of distributions There have been various classes of distributions proposed in the literature. PRENTICE ( 1976) considered the following class of density functions
f.s(z)=--------(l+exp(z))(c51+cS2) B(o 1,o2)
(9.3)
Tests
011
the distributional assumption
115
where B is the beta function, a,b > 0. Density (9.3) is the logistic density function if 61 = 62 = I, and converges to the normal density function if 6c+oo, 62-+oo. Therefore, this class of density functions can be used to both test the logit or probit assumption, although a reparametrization is necessary for testing for normality (see PRENTICE, 1976, p. 764). Other, limiting, distributions are the double exponential (6 1-+0, 62 -+0), exponential (6 110, 62-+0), extreme minimum value (6 1=1, 62 -+oo) and extreme maximum value (6 1-+oo, 62 =1). The density f.s(z) is symmetric if 61 = 62 , positively skewed if 61 > 62 , and negatively skewed whenever 61 < 62 . The original work of Prentice, who considered the one parameter (dosage level) case, has been extended to the multi parameter case by BROWN ( 1982). COPENHAVER AND MIELKE (1977) proposed the symmetric omega distribution for testing the logit model. Its cumulative distribution function and density function are characterized by F.s(z(v)) = v, f.s(z(v)) = I - I 2v - I 1 6 + 1 and
where O < v < I and 6 > -1. The omega distribution is a logistic distribution when 6 = I, and it is a double exponential distribution when 6 = 0. The uniform distribution is the limiting case as 6-+oo. BERA, JARQUE AND LEE ( 1984) assume the distribution to belong to the Pearson family of truncated distributions with density function exp(w.s(z)} f.s(z) = - - - - - - , f~ uexp(w.s(v)}dv 61 -V
w.s(v) =
f - - - - dv. l-6 1 v+6 2 v 2
I 16
Chapter 9
To test for normality is to test H0
:
ti 1
= ti 2 = 0.
Bera et al. specifically
derive the score test statistic for the binary probit model (pp. 570-571 ):
where ~i is the normal density function evaluated at x/Pn; wi= ~/{pi(l-pi)}; X(i)
= (xi'
; (x/Pn) 2 - 1 ; (x/Pn)(3+(x/Pn) 2 )]'; and J is a selection matrix
consisting of the last two columns of the identity matrix of dimension k+2, with k the number of parameters in /3. Another alternative is to assume that the disturbances of the underlying continuous model are Burr(ti) distributed (SMITH ( 1988), LECHNER ( 1991 )): exp(z)6 F6'z)=----, {1+exp(z)} 6 tiexp(z) 6 {1+exp(z) )6 + 1 The logistic distribution is obtained for ti
= I, and f 6(z) is positively
skewed if ti> I and negatively skewed if ti < I. It is easily seen, however, that this class is a special case of the Prentice-class by putting ti 2 = 1 in (9.3).
9.2.2. Score test based on the Perks class of distributions For testing the logit model, a class of symmetric distributions that has not been considered before is a class put forward by Perks (see KOTZ,
JOHNSON AND
1970, pp. 14-16) with density function,
f 6(z) = - - - - - - - exp(z) + ti 2 + exp(-z)
(9.4)
Tests on the distributional assumption
117
and 62 > -2. Clearly, when 61 = I and 62 = 2, f 6(z) is the logistic density function. The number of parameters can be reduced, however, when there is taken into account that
f
fiz)dz
= I.
Therefore, 61 can be expressed 1 as a function of 62, and the test can be based on 62 solely. For ease of notation denote 6:= 62. Then the null hypothesis for testing the logit assumption is H0 : 6
= 2.
When 6 increases, f 6(z) gets fatter tails; and when 6 = 0 it is the hyperbolic secant density function. The cumulative distribution function corresponding with (9.4) is given by 2exp(z)+6- ✓(c5 2 -4)
log(------F 6(z)
=
2exp(z)+6+ ✓(62-4)
I - ---------
6 > 2;
6- ✓(62-4)
log(---c5+ ✓(c52-4)
exp(z) 6
= 2;
l+exp(z) 6 2exp(z)+6 arctan( - - - - } - arctan( - - ✓( 4-c52)
-2 < 6 < 2. ,r
-
- arctan( - - 2
1
6
✓(4-6 2 )
For all the results, as presented in this and the following section, the calculations are presented in Appendix 9.A.
118
Chapter 9
After straightforward, but lengthy, calculations it follows that,
and 8Fiz) 8Fiz) exp(z) lim6!2 - - = Iim6l2 - - = - - - - - , 86 86 6(1 +exp(z)} 3
(9.5)
hence Fiz) is continuous and continuously differentiable in 6 = 2. Furthermore, from (9.5) it follows that 8F6(x;'Pn) lim 6..... 2 - - - -
•
•
•
= P;0-P;)0-2P;),
86
with P; the logit estimated probability. From this, the elements for computing the score test statistic S as defined in (9.1) are given by qiO)
= L; (Y;-P;)0-2P;)/6
(9.6)
H11 /0) = Li P;0-p;)(l-2p;)x/6 -
H66(0)
•
•
•
2
= Li P;0-P;)(l-2p;) /36
Hpp(O) =
Li P;O-P;)X;X;',
An interesting fact is that the same test statistic is obtained when testing for omitted regressors that are symmetrically distributed RIDDER,
1988, pp. 312-313). Let the random variable
11,
(CRAMER AND
with E(11) = 0 and
E(11 2) = a~. capture the omitted variables, and let the probabilities be
specified as i
= l, .. ,n.
These probabilities are by Taylor expansion approximately equal to
Tests on the distributional assumption
119
and so, P(Yi= I I x;) ~ A(x;'/3) + a!A(x;'/3)( l-A(x;'/3)}{ l -2A(x;'/3)}/2 = P; + a!P;( I -pi)( l-2P;), with, as before, pi = A(x;'/3). To test whether the model is correctly specified if only x;'/3 is taken into account, is to test for vector atogL/aa!, evaluated at a!= 0,
/3= {J
0 ,
a!= 0.
The score-
is given by L;(Y;-P;)0-2P;)/2,
i.e. equal (up to a scale factor) to (9.6). Cramer and Ridder also noted, that the test can be performed in a twostep procedure. The score test of the hypothesis >. = 0, in the logit model where the probabilities are specified as i = l, .. ,n,
is equal to the score test
a! = 0.
Therefore, the two-step procedure is to
first estimate f3 by {J 0 , compute pi and then test whether >. is significantly different from zero. Another issue is the estimation of 6 itself by maximum likelihood. Obviously, when the test statistic rejects the logistic assumption, estimation of 6 would provide information of the misspecification and at the same time reduce this misspecification, leading to better estimates of
/3.
However, by trying to estimate the extra parameter, (with the help of the maximum likelihood program GRMAX (1982)), it became clear that the parameters in the extended model are not identified: the program did not converge and the value of 6 went either to -2 or to oo. Although this phenomenon may raise objections to the use of the Perks class of distributions, it, however, has no implication for the asymptotic distribution of S, which is
x~-
120
Chapter 9
9.2.3. Test on the distributional assumption In the multinomial logit model
It is possible to construct a score test, as just described, for the multinomial logit model as well. Because of the normalization needed, the multinomial model can be written as (MADDALA, I 983, pp. 34-35) F(x;',Bj)
- - - - = G(x;'.Bj), I - F(x;'.Bj) and j
When F is the logistic distribution, G(x;'.B)
= l, ... ,m-1,
= exp(x;',B). By defining a class
of functions F 6 as before, the multinomial model can be tested as well. Denote 1-exp(x/Pnj)
Pim-Pij
I +exp(x/Pnj)
Pim +pij
j
uij =
with
Pn,
the logit ML estimator of
= l, ... ,m-1,
.B and Pij the logit estimated
probabilities, as in (7. I 2). For the Perks class of distributions, the first order derivative of the probabilities with respect to 6, when 6 approaches 2, are given by
j
= l, ... ,m-1
Tests on the distributional assumption
121
The elements for computing the score test statistic (9.1) are -
q,s( ())
H.o,.o
,-, J"r
(9)
= -"· L..1
pIJ.. p. x.x.' I I Ir
j,r = I , ... ,m-1, j*r.
DE JONG
(1989) calculated in a Monte Carlo study the
9.3. Simulation results In his master thesis
score test statistic based on the class of distributions of Prentice (two parameters) and of Perks (one parameter), for the binary choice logit model. The model under consideration is (9.7) with xil
= 4.899i/n-3.474 (see GOURIEROUX et al., 1987) and x2 = log I x1 I -
The binary random variable y is one whenever y • ~ 0 and zero otherwise. The following models are considered: Model I : £ is logistically distributed Model 2 : £ is normally distributed, N(0,,r2/3) Model 3 : £ follows a distribution of the Perks class with S = 10 Model 4 : £ is double exponentially distributed Model 5 : £ is uniformly distributed, U[-,r,,r) Model 6 : £ is exponentially distributed. In all models the variance of£ has been kept constant (,r 2/3). The mean of £ is 0 in all models, except for Model 6, where it is equal to
,r/ ✓3.
The
simulation results, for the number of observations equal to l000 and the number of replications equal to 100, are presented in Table 9.1. Apparently, only when£ is drawn from the exponential distribution, which
122
Chapter 9
is a very skew distribution, the test statistics have considerable power. The test based on the Prentice class of distributions performs better in that case. Both test statistics seem to have very little power against the alternatives considered, even when these alternatives are proper alternatives, i.e. are a member of the specific class of distributions. (Model 2, Model 3 and Model 6 are (limiting) members of the Prentice class; Model 3 is a member of the Perks class).
Table 9.1 Simulation results for logit model. Score test for classes of
distributions by Prentice and Perks, n Prentice
=
1000, 100 replications Perks
proportion H0 rejected
proportion H0 rejected
Q=
Model
Q=
mean
0.10
0.05
0.01
mean
0.10
0.05
0.01
2.05
0.12
0.04
0.01
1.09
0.12
0.07
0.00
2
1.47
0.04
0.02
0.00
0.82
0.06
0.02
0.00
3
2.05
0.10
0.03
0.00
1.15
0.13
0.09
0.00
4
2.98
0.16
0.12
0.05
1.73
0.19
0.13
0.05
5
2.64
0.11
0.07
0.01
2.12
0.30
0.12
0.06
6
8.15
0.81
0.70
0.36
3.03
0.47
0.30
0.11
Source: DE JONG (1989)
When E-N(0,1r2/3), Model 2, both test statistics have a mean below the expected value of the corresponding chi-square distribution and the power is in both cases even lower than the size. However, by changing the variance of the disturbance
E,
different figures are obtained. For example,
if the above simulation is performed with E-N(0, I), the mean of the Perks statistic becomes 1.36, with rejection rates 0.18; 0.08; and 0.0 I respectively. This is in line with the findings of BROWN (1982, p. 1094) for the Prentice test statistic. In his simulations, Brown considered a logit model with one regressor variable, taken from the uniform [-3,3] distribution, and
Tests
011
the distributional assumption
e-N(0, I). For n
123
= 800 (with 500 replications), he found for the mean of
the test statistic the value 2.57, with rejection rates 0.16; 0.09; 0.0 I respectively. For n = 1600, these figures read 3.60; 0.31; 0.16; 0.03. To investigate what happens when the deterministic part of the model is misspecified, the following two models are considered: Model 7 :e logistically distributed, logit estimation on constant and x1 only. Model 8 :e logistically distributed, logit estimation on constant and x2 only. The results are presented in Table 9.2. Clearly, both test statistics have very high (asymptotic) power against the misspecification considered. Especially the Prentice test statistic performs well in the case of Model 7.
Table 9.2 Simulation results for logit model. Score test for classes of
distributions by Prentice and Perks, n
= 1000, 100 replications Perks
Prentice
proportion H0 rejected
proportion H0 rejected
a=
a=
Model
mean
0.10
0.05
0.01
mean
0.10
0.05
0.01
7
54.3
1.00
1.00
1.00
17.8
0.91
0.85
0.78
8
93.3
0.97
0.96
0.95
29.3
0.99
0.99
0.94
Source: DE JoNo (1989)
From the results of the two simulations, the conclusion can be reached that the two test statistics do not seem to detect very well the kind of misspecification they have been designed for. However, if there is something very wrong, they show it, and so it is advisable to implement a test of this nature in the standard computer logit programs.
Chapter 9
124
9.4. A nonparametric approach: the Nadaraya-Watson estimator A different approach to test the distributional assumption in the binary choice model is to compare the null distribution with a nonparametric estimator of it. The basic ideas presented in this section stem from the theory 2 of approximating the mean response curve m in the regression relationship
i = l, .. ,n, where ((Yi,Xi)} are i.i.d. bivariate random variables with density function f(y,x); and the regression curve m(.) is defined as m(x) = E(Y I X=x). The actual approximation procedure is called smoothing. This procedure can be described as forming a weighted average of the response variable y in the neighbourhood of a point x:
where (W ni(x)} denotes a sequence of weights. A commonly used weight sequence is kernel smoothing. In this smoothing scheme the shape of the weight function is described by a density function with a scale parameter that adjusts the size and the form of the weights near x. The kernel K is a continuous, bounded and symmetric real function which integrates to one fK(u)du = I.
A kernel based estimator of m(x), is the so called Nadaraya-Watson
2
See
HARD LE(
1990)
Tests on the distributional assumption
125
estimator defined as -
m(x)
n-1D'=1Kh(n)(x-Xi)Yi
=--------
(9.8)
f h(n)(x) where
is the kernel with scale factor h(n), which is called the bandwidth; and
is the Rosenblatt-Parzen kernel density estimator3 of the marginal density of X. Often, for computational purposes, attention is restricted to kernels with bounded support. An example of a commonly used kernel function is the Epanechnikov kernel:
where I{.} is the indicator function. For the binary choice model the mean response curve is defined by (x is a vector here): m(x) = E(Y Ix) = F 8 (x'fJ), where the notation F 8 , the cumulative distribution function of the random disturbances in the underlying continuous linear model, is introduced for practical purposes. Let Zi
= X/fJ and assume for the moment that fJ is known, then the
Nadaraya-Watson estimator of the cumulative distribution function F,(z) is given by (to simplify the notation the dependence of h on n will be
3
See SILVERMAN(l986)
126
Chapter 9
dropped):
-
n- 1"'':' Kh(z-Z.)V. L..i1= 1 l 1
Fh(z) = - - - - - - -
(9.9)
gh(z)
with gh(z) the kernel density estimator of the marginal density of Z = X'/3. Although a large amount of theory is developed concerning the NadarayaWatson estimator, not everything is directly applicable to the case of the binary choice model. This is mainly due to the discrete nature of Y, and so to the non-existence of the density f(x,y). The use of the Nadaraya-Watson estimator of the form (9.8) has been considered for testing the logit model with one explanatory variable by COPAS ( 1983 ), who compared the estimator with the log it results graphically, and by AZZALINI et al. (I 989), who not only used graphical devices, but also considered the pseudo likelihood ratio statistic defined as
LR= L?=iYilog{
i\(xi) 1-Fh(xi) • • } + (1-Yi)log{-----}, A(xi;an,1n) 1-A(xi;~n•;n)
where an and 1n are the logit estimators of the constant and slope parameter respectively. Here, attention is focused on two functionals based on the NadarayaWatson estimator of the form (9.9), viz. the supremum absolute deviation and a quadratic functional. Especially, it is investigated what happens when the parameter vector /3 is estimated and the kernel estimator is based on these estimates.
9.4.1. The asymptotic distribution of the supremum absolute defiation
First, assume that /3 is known. The supremum absolute deviation is then a properly normalized version of the statistic
Tests
011
the distributional assumption
127
This functional has first been considered by BICKEL AND ROSENBLATT ( 1973) in the case of kernel density estimation. They showed that it is asymptotically distributed as an extreme value, or Gumbel, distributed random variable. This result has been extended for the Nadaraya-Watson estimator by JOHNSTON (1982), LIERO (1982) and HARDLE (1989). Especially the results of Hardie are useful for our purpose. The asymptotic distribution and the assumptions needed are stated in the next theorem (HARDLE, 1989, pp. 165-166): Theorem 9.1
Assume I. With probability one, Zi lies in a compact set. The marginal density
g(z) of Z is bounded away from zero and differentiable. 2. The kernel K(.) is nonnegative, has compact support [-A,A) and is continuously differentiable. 3. h = n- 6 , 1/5 < 6 < 1/3. 4. a~(z) = var(Y I Z=z) is bounded away from zero and infinity and is continuous.
5. Ft(z) is twice continuously differentiable. Let
dn = (261og(n)) 1/ 2 + (261og(n)r l/ 2 (1og(c 1(K )/1r 1/ 2 ) + 0.5[1og(o)+log(log(n)))} if c 1(K) = (K 2(A)+K 2(-A)}/(2>.(K)} > 0, dn = (261og(n)) 11 2 + (261og(n))- 112 tog(ciK )/2,r) otherwise, with c 2(K) = J~A(K'(u)) 2du/(2>.(K)}. Then, if n---+oo: P((261og(n)) 1 ' 2[sup,,(nh) 11 2 g(z) 11 2 (a2(z)r 1' 2 1 i\(z)-Ft.(K) 112 -dn] 2. Define 62- ✓(6~-4) h=-----
62+v'(6~-4) so 61 = -v'(6~-4)/log(h). The first order derivative of Iog(h) with respect to 62 is 81og(h)
2
The limit of 61 if 62 approaches 2 from above is then (using l'H0pital's rule): 8 ✓(6~-4)/86
lim62l2 61 = - lim6 l2 - - - - 2 81og(h)/86 62/ ✓(6~-4)
= - lim.s l2 - - - - - = I. 2
(9.A.5)
-2/ ✓(6~-4)
For -2 < 62 < 2, define
g=---
✓(4-6~)
so 61
= v'(4-6~)/(2{11/2-arctan(g)}). The first-order derivative of arctan(g)
134
Chapter 9
with respect to 82 is aarctan(g)
----=---
The limit of 81 if 82 approaches 2 from below is then:
a ✓(4-8~)/ a8 2
.
lim6zf2 81
= hm6zf2 - - - - - - - - a(2{7r /2-arctan(g)} )/ a8 2 -8 2/ ✓( 4-8~) = lim6zf2 - - - - - = 1.
(9.A.6)
-2/ ✓(4-8~)
Combination of (9.A.5) and (9.A.6) then results in lim6 2--+ 2F 6(z)
= lim 6212 Fiz) = lim 62T2Fiz) = f~s (I + ur 2du = e 8 /(l+e 8 ).
The first-order derivative of F iz) with respect to 82 is (APOSTOL, 1974, Theorem 10.39, p. 283):
As shown before, the first term in (9.A.7) is continuous in 82 = 2. The second term is continuous in 82 = 2 if 88 1/88 2 is. For 82 > 2, 88 1/88 2 is given by 88 1 88 2
28 1-8 2
✓(8~-4)tog(h)'
and so (l'H6pital), 288 1/88 2 - 1 88 1 lim 6 12 = lim 6 12 2 2 882 {82log(h)/ ✓(8~-4)} - 2
Tests 011 the distrihutio11al assumption Now, lim 6212 {6 21og(h)/v'(6~-4)} 861 lim 6 12 2
86
= -2,
135 (l'Hopital again), and so
= 1/6.
(9.A.8)
2
861
261-62
86 2
2v'( 4-6~){,r/2-arctan(g)}
and so, 86 1 lim6zf2 86 2
286 1/86 2 - I
= lim6zf2 - - - - - - - - - - - - , (26 2 { ,r /2-arctan(g)}/ v'( 4-6~)) - 2
and because lim 62 r2 (26 2{,r/2-arctan(g)}/v'(4-6~)) 861 lim 6 r2 2
86
= 1/6.
= -2,
(9.A.9)
2
By combining the results (9.A.5), (9.A.6), (9.A.8) and (9.A.9), the firstorder derivative of Fiz) with respect to 62, for 62 approaching 2, is
es( I -es) 6( l+es)S Because
136
Chapter 9
the elements of the test statistic S are given by (here, 6:= 62 ), atogL
as
(Y;-P;)
I 9 = L; .
• • • • • P;( 1-P;)( l-2P;)/6 = L;(Y;-P;)( l-2P;)/6,
.
P;O-P;)
atogL atogL E([-----
ap
as
1
9 ) = E( [;(Y;-P;)X; L;(Y;-P;)( l-2P;)/6) I Pn = L;P;( 1-P;}( I -2P;)X;/6
The multinomial logit model Again, denote cS:= 62 . In the multinomial logit model the probabilities are then specified as (suppressing subscript i)
Gj
Pi=---"m-lGj 1+L..r=l 6
m
p6 =
(I ~-lGr)-1 +L..r=l
6
•
with j=l, ... ,m-1.
The first order derivatives of the probabilities with respect to cS are
apl
as
aGl/acS - P~[';';;JaG6/acS 1 + [':';;;JG 6
j = l, ... ,m-1,
Tests o,z the distributional assumptio,z
137
Because, 8F!/86 86
(I-F!>2'
it follows from the earlier results that
where
uj
= I +exp(x/Pnj)
The first order derivative of the loglikelihood function with respect to 9 is given by atogL - - = "':'
Y.. 8p-. "'!'-1 ~ _•J
L...1=lL...J =l
89
Pij
'
89
and so atogL 86
IO= [?=1[T:i cvirPij>ui/6
138
Chapter 9
Appendix 9.B Proof of Theorem 9.1 The proof of Theorem 9.1 has been obtained by HARDLE (1989, pp. 168175) in the case of a continuous response variable y, and this proof cannot be applied directly to the model with a discrete y. However, by adding a continuous random variable to y, and letting this variable disappear asymptotically, use can be made of Hardle's proof. Here, I do not replicate Hardle's results fully, but merely highlight some of the crucial steps to be made. Suppose there exist n observations on a random variable U, which is uniformly [-1, I) distributed, independently from Z and Y. Construct the continuous variable
1
= l, ... ,n,
with the sequence {an} constructed as a 0 ~a 1~a 2 ~ ... ~an~o. a 0 some constant, and limn--+ooan = 0. The conditional distribution function of P{Y: ~ q I Z=z}= P{Y=I
v:
is then given by
I Z=z}P{anU~q-1} + P{Y=O I Z=z}P(anU~q}
= F,(z)[(2anr 1(an+q-l)l{(l-an)~q~(l+an)} + l{q>(l+an)}] + (1-F,(z))[(2anr 1(an+q)l{
-an~q~an
} + l{q>an}], (9.8.1)
and the conditional mean and variance are
E(V: I Z=z) = F,(z) var(Y: I Z=z) = a;(z) + a//3
= tp~(z).
By construction, the conditional expectation of
v:
is equal to the
conditional expectation of Y / or every n, and the conditional variance is bounded away from zero and infinity.
Tests on the distributional assumption
139
Consider the Nadaraya-Watson estimator based on Y~:
which is clearly also an estimator for Ft(z). Then does Httrdle's proof apply to sup., I F~(z) - F 6 (z) I ? First, F~(z) - Fg(z) is written as (9.B.2) where, H 0 (z) = gh(z)(F~(z) - F g(z)}, R 0 (z)
= (g(z)-gh(z)}{F~(z)-FsCz)}/g(z)
~•h(z)-Fs( l +an)}
]
Tests on the distributional assumption
141
= [l-F,(z)+F,(z)(2anf 1(an+anui)]Yi + (l-F,(z))(2anf 1(an+anui)(l-yi)
where u: is a realization of u>u10, I]. So the conditional distribution of Y~ is uniform and independent of an.
It then follows that (HARDLE, 1989, Lemma 2.1, p. 167)
Obviously, this approximation cannot be constructed for discrete random variables. By (9.B.5), meaning that Y~ can be integrated out, and noting that Y~ is bounded, it is easily seen that the sequence of lemmata of HARDLE ( 1989, Lemma's 3.2, 3.3, 3.4, 3.5 and 3.7), needed to prove that the limit distribution of sup& I V n(z)
I is the same as that of sup& I V l,n(z) I , where
{W(.)} being the Wiener process on (-00,00), also hold in our (constructed) case. The limit distribution of sup& I V l,n(z) ROSENBLATT (I 973,
I
is given in BICKEL AND
Theorem 3.1, p. 1076), giving the result that if n-+oo
P{(26log(n)) 112 supE((nh)l/2g(z) 1/ 2wn(zf 1 I F~(z) - F,(z) I /.X(K) 112 - dnl < v} -+ exp(-2exp(-v)). Finally, consider
where, -• n-1Li=lKh(z-Zi)Ui Fh(z) - Fh(z) = an - - - - - - gh(z)
142
Chapter 9
Obviously,supz I n- 11:?=1Kh(z-Zi)U/gh(z) I =Op((nhr 1' 2(1og(n)) 1/ 2 },soif an= oP((log(n))- 1), then supz I F~(z) - Fh(z) I = oP((nhr 1f 2 (1og(n))- 1f 2 } and therefore asymptotically negligible. Furthermore, t/Jn(z) --+ ayCz), if n--+oo, and so Hardle's theorem also applies to the model with a discrete response variable.
10. Overall goodness-of-fit tests
IO. I. Introduction Overall goodness-of-fit tests analyse the null hypothesis whether the specified parametric model is correct, without a specific alternative hypothesis, which was the case in Chapter 9. Tests of this kind have been developed by HOSMER AND LEMESHOW ( 1980) and ANDREWS (1988). Both these tests are based on grouping according to a partition of x'/3. The test of Hosmer and Lemeshow then is Pearson's chi-square test statistic with random cells and so the asymptotic theory of MOORE AND SPRUILL (1975) obtains. Andrews' test is also asymptotically chi-square distributed and makes use of the asymptotic covariance matrix of the differences between the actual and estimated frequencies. Two tests are presented in this chapter. The first test is based on the sum of weighted squared residuals, which is often believed to be asymptotically chi-square distributed, but will be shown to be asymptotically normally distributed. This test is derived for both the binary choice and the multinomial logit model. The second test is Hausman's test (HAUSMAN, 1978 ), applied to the difference between the probit or logit estimator and Han's maximum rank correlation estimator (HAN, 1987) in the binary choice model.
I 0.2. A goodness-of-fit test for the binary choice model based on weighted squared residuals A goodness-of-fit measure in binary choice models is the sum of weighted squared residuals
144
Chapter JO
where, as before, pi is the estimated probability of the binary random variable Yi being one for observation i, and n is the sample size. PREGIBON
(1981),
LANDWEHR ET AL.
(1984),
KAY AND LITTLE
(1986)
among others refer to T n as the "chi-square goodness-of -fit measure", assuming an asymptotic
X~-k-l distribution, with k the number of parameters to be estimated. This assumption seems to be based on the fact
that in the case of grouped observations the statistic
has a limiting x!-k distribution if cell sizes tend to infinity and the number of cells remain fixed, where Sj is the sum of the Y/s for i in class j; nj is the number of observations in class j; and m is the number of different classes. However, the relationship between T n and G 0 is unclear because if n goes to infinity, T n contains an infinite number of classes, each class containing only one observation. It will be shown that n- 112(T 0 -n) asymptotically follows a normal
distribution and that the
X~-k-l
approximation does not hold in general.
T n is used in measuring goodness of fit in two ways, not only as an overall goodness of fit measure but also the individual components of T 0 , (Yi-p)/{(pi(l-pi)) 1/ 2 }, play a role in detecting outliers by being compared with the standard normal distribution (see
PREGIBON,
1981 ). Clearly the
normal instead of the chi-square asymptotic distribution of T n will have an impact on the use of the statistic as a goodness-of-fit measure. After having established the asymptotic normality of T n it is easy to define a new chi-square test statistic based on T n·
I 0.2.1. The asymptotic distribution of T 0 The asymptotic distribution of T n is presented here as a theorem. The proof will be given in Appendix IO.A.
Overall good11ess-of-fit tests
145
Theorem JO.I
If the Assumptions 7. I, 7 .2 and 7.3 hold then n-1/2(Tn-n)
d
- - - - - --> N(O, I),
where
with (10.1)
and
n as defined
in (7.6).
Proof: see Appendix JO.A.
From Theorem I 0.1 it follows that the asymptotic distribution of T n is the normal distribution with mean n and variance nu!. The variance
u! consists
of two parts of which the second part vn'ff 1vn is due to the estimation of P;o by
P;-
So estimation leads to a shrinkage in the variance. Moreover
estimation of P;o by
P; has asymptotically only influence on the variance
of T n• not on its expectation, another indication that T n asymptotically does not follow a chi-square distribution. If the asymptotic distribution of T n could be approximated by a x!.k-l distribution it would hold that, for zER,
I P(Tnsz} - P(x!-k- 1sz} I = sup I P(n- 112(Tn-n)sz} - P(n· 1/ 2(x!-k-i-(n-k-l))sz+n· 1l 2(k+l)} I
sup 11
11
tends to zero if n goes to infinity. This can only happen if n· 112(Tn-n) and
146
Chapter JO
n- 112(x~-k-i-(n-k-l)) have the same limiting distribution. Because x~-n
d - > N(0,1),
(2n)l/2 this can only hold if c?:= limn_. 00 a~ = 2, and in general there is clearly no reason for this to be the case. For example in the logit model, with k = I,
/3 0 = I and convergence of {x;} to the Uniform[-1,2] distribution, a2 equals 0.034. This can be seen as follows. Because in the logit model f; = P;( 1-P;) the following expressions for v:= limn_. 00 vn• n and v = I /3
f J (l-2F(x))xdx = -0.428,
n = 1/3
Jj
a2
a2 are obtained:
x2f(x)dx = 0.157,
= 1/3 J_~ (exp(-x)+exp(x)-2)dx - v·n- 1v = 0.034.
Approximating the distribution of T n by a chi-square distribution can therefore lead to misleading conclusions.
10.2.2. Chi-square test and simulation results With Theorem I 0. I a chi-square test based on the sum of weighted squared residuals can be defined: (Tn-n)2 Hn=--na2n which asymptotically follows a chi-square distribution with I degree of freedom. An estimator of a~ is obtained by substituting the ML-estimators
P; and nfor P;o and n respectively. This estimator, u!, is consistent in the sense that, for 17 > 0, (10.2)
Overall goodness-of-fit tests
147
Hn is a test statistic that is easy to compute and which has a wide range of
alternatives just like the chi-square tests of ( 1980) and
ANDREWS ( 1988)
HOSMER AND LEMESHOW
that are based on a partition of the YxX plane.
By a simulation study some indication is given of what kind of violations of the basic assumptions in the logit model Hn is able to detect. First a regular logit model and then two violations of the basic assumptions are considered, the first related to the functional form, the second related to the distributional assumption. The models under consideration are : exp(l+zi) Model I : P(Yi= I I zi) = - - - 1+exp() +zi) z ~ N(-1/2,1). exp() +z;+ln( I zi I }) Model 2 : P(Y;= I I zi) = - - - - - - 1+exp() +zi+ln( I zi I }) z ~ N( - I /2, I)
estimation on [I ,zj] only. Model 3 : P(Yi= I I xi) = exp(~(xi-1}) ~
= ✓3/1r
x ~ U[-3,3). The sample size is 1000, and I 00 replications were performed for each simulation. The results are given in Table JO.I. Table 10.1 Simulation Study of Test Statistic for Logistic Model, Sample
Size I 000, I 00 Replications Test Statistic
Proportion H0 Rejected
Mean
Variance
a=0.10
a=0.05
a=0.01
Model I
1.00
2.08
0.12
0.05
0.02
Model 2
9.56
7.84
0.77
0.68
0.58
Model 3
5.28
3.39
0.94
0.77
0.23
148
Chapter JO
Hn has considerable (asymptotic) power against the two kinds of misspecification considered. The mean and variance of T n in the three models are 999 and 216 for model I, 997 and 2.92 for model 2 and 830 and 666 for model 3. The asymptotic expectation of T n in model I is 1000, the asymptotic variance 193. Model I is also evaluated in case of 30 observations and 500 replications. The mean and variance of T n are 29.43 and 38.5 respectively. The asymptotic expectation of Tn now is 30 (the asymptotic variance is 13.62), so the effect on the expectation of T n of estimation of P;o by
P; is negligible even for small sample sizes.
10.3. A goodness-of-fit test/or the multinomial logi.t model based on weighted squared residuals In this section the asymptotic normality of the sum of weighted squared residuals in the multinomial logit model is derived. This sum is defined as
(10.3)
where m is the number of different categories, Yij falls in category j, and
= I if the i th individual
1\ is the estimated probability of the event Yij = I.
In the binary logit model the number of categories, m, equals two and definition (I 0.3) equals
where P; is the estimated probability of the binary random variable Y; being one.
Overall goodness-of-I it tests
149
10.3.1. The asymptotic distribution of T':
The asymptotic distribution of T': is presented in Theorem 10.2. Theorem 10.2
If the Assumptions 7. IM and 7 .2M hold then n- 1/ 2 (T':-nm}
d
------>N(0,I),
where
with
vnr
and
n as
= n -1
'°'m l-2p~.•J O • °'n ( -l-2p~ 'L..i-1 - - - - L..,· =l - - } Pi,xi, ir
-
0 0 P ir( 1-p ir )
0
1-P,·,·
defined in (7. I 0).
Proof: see Appendix 10.A.
In the binary logit model, m = 2, the asymptotic variance becomes
with
which is the same result as in Theorem IO. I.
(I 0.4)
150
Chapter 10
10.3.2. Chi-square test and empirical examples With Theorem 10.2 the chi-square test is {T';.1-nm} 2 ( 10.5)
Hn=----
na2n which asymptotically follows a chi-square distribution with I degree of freedom. An estimator of an, consistent in the sense of (10.2), is obtained by substituting the ML-estimators
Pij and O for P?j and O respectively.
The test statistic is illustrated by an analysis of private car ownership in the Dutch Budget Survey of 1980 (see CRAMER AND RIDDER ( 1991 ); CRAMER ( 1991 ); and Chapter 8). The basic data set consists of 2820 households, with four categories of private car ownership, viz. -no car, -one used car, -one new car, -several cars. This is denoted as data set A. A simpler set B has been constructed with three categories only, by omitting households with more than one private car ( n = 2645 ), and a set C by pooling all car ownership categories so that the distinction is reduced to the simple dichotomy of "no private car" against "some private car" (n = 2820). In all analyses five variables are used as regressors, in various combinations (together with a constant), namely XI: log of household income
X2: household size (weighted number of individuals) X3: age of head of household (five year classes) X4: level of urbanization (six classes, from country to city) X5: (0, I) dummy for the presence of a business car. The results of the test for the three data sets and various combinations of the explanatory variables are set out in Table 10.2.
Overall goodness-of-!it tests
151
Table 10.2. Results test statistic on data of car ownership
data set A m=4 nm=l 1280
data set B m=3 nm=7935
data set C m=2 nm=5640
Tm n
Hn
Tm n
Hn
Tmn
Hn
I 2 34 5
21267
0.18
18544
152.86..
7422
30.32··
I 2 5
28954
0.25
26901
302.16 ..
9069
82.31 ••
5
24912
0.02
21809
235.16° 0
10540
76.93° 0
3
11713
0.02
8007
1.39
5869
20.76° 0
235
11423
1.77
8077
16.60° 0
5957
78.85 ..
34 5
11402
10.11··
8042
13.74 ..
5750
32.01 ••
3
11226
70.36 ..
7918
22.15° 0
5690
107.65 ..
xi
T~ = sum of weighted squared residuals, Hn = at 1% level
x~ test statistic, •• = sign.
In each case the value of T~. the sum of weighted squared residuals, and the test statistic Hn of (10.5) are shown. At a 5% significant level, the critical value of a chi-square distribution with one degree of freedom is 3.84, and at 1% it is 6.63. By these standards the fitted model must be rejected in almost all analyses of B and C, and in A whenever the regressors XI and X2 are omitted. But upon closer examination it appears that T~ exceeds its expected value in all analyses that use regressors XI and X5 (income and business car dummy), but far less so in the others. Such large values of T~ may indicate the presence of an outlier, and upon searching for this, a severe one was indeed found. This is a household with a business car, a very low income and a new private car. The estimated probability of this event is almost zero if both income and the business car are among the regressors, and the result is a very large contribution to T~. The test statistic was recalculated for all data sets after deleting this outlier. The results are presented in Table 10.3. First, attention is focused on the sets A and B. For these sets the fitted model is not rejected whenever income is used as a regressor; however, it is rejected whenever the regressors XI and X2 are omitted. The only
152
Chapter JO
Table 10.3 Results test statistic on data of car ownership, without one outlier
data set A m=4 nm=l 1276
data set B m=3 nm=7932
data set C m=2 nm=5638
Tmn
Hn
Tmn
Hn
Tmn
Hn
I2345
12519
0.00
8341
0.17
6214
2.67
I25
13269
0.00
8558
0.24
6557
5.03.
I5
12992
0.00
8484
0.30
6344
1.39
I 3
11628
0.02
7906
0.16
5854
11.31•·
235
11694
1.82
8076
11.02··
5957
79.46 ••
345
11400
11.13 ..
8042
14.30··
5749
32.64 ..
3
11222
70.43 ..
7915
22.22··
5689
107.62··
X;
T':.1 = SWSR, Hn =
x~
test statistic, •(°) = sign. at 5(1 )% level
difference between the analyses of the sets A and B is that the model that uses regressors X2, X3 and XS (household size, age and business car dummy) is rejected for B whilst it is not for A: household size and possessing a business car influence primarily the amount of cars owned, to a lesser extent whether it is a used or a new car. As for data set C, the results are different. Not all models which include income as a regressor are accepted. Furthermore, the fitted model with XI, X2 and X5 as regressors is rejected, whilst a nested model with only XI and XS as regressors can not be rejected. Upon investigation, this is due to the presence of another outlier. Although not all fitted models are rejected for data set C, the results of the test statistic lead to the conclusion that the regressors that are considered (especially income) are in the dichotomous case of less importance than in the sets A and B, indicating that other factors play a role in the decision process whether to buy a car or not than those considered here. Finally, as an illustration of the behaviour of the test statistic, consider the analyses with only X3, the age of the head of the household, as a regressor. In these analyses all estimated probabilities are close to the sample
Overall goodness-of-fit tests
153
frequencies and have a very small variance. Therefore, the estimated variance of
T'::
is small and, although the value of
T'::
is close to its
expected value, Hn becomes significant.
10.3.3. The conditional logit model
The multinomial logit model considered so far, has been specified by individual specific factors x; and choice specific parameters /3j. Assume now, that the choice parameters are constant for the different choices (denoted by 1), but the factors differ per state j (denoted by z;/ The probabilities are then specified as exp(z;/1) P(Y;=j I z) = - - - - [';' =1exp(z;/1)
i= l, ... ,n;j= l, ... ,m.
The results of Theorem 10.2 are extended to this model in a straightforward way: n- 1/ 2 (T'::-nm}
d
- - - - - --> N(O,l ),
where
with
and
n is given {l
by
. -1 = - I1mn--+oon
a21ogL ---
154
Chapter JO
Of course, the asymptotic distribution of T~ in the model that is a combination of the two multinomial logit models, specified by exp(xi'tlj + zij'1)
P(Yi=j
I x,z) = - - - - - - - [';:' =1exp(xi'tlr + zi;1)
1s again the normal distribution, with the asymptotic variance being a combination of the results of Theorem 10.2 and above. The adjustment only concerns vn and n.
10.4. A test based on the maximum rank correlation estimator 10.4.1. Hausman's test
A well-known testing procedure for discriminating between various kinds of models is the Hausman-test (HAUSMAN, 1978). This test is based on two estimators in the same model. One estimator,
p1 ,
is both consistent and
asymptotically efficient under the null hypothesis but is inconsistent under the alternative, whereas the other estimator, jJ 2 , is consistent under both the null and the alternative hypothesis, but is not efficient under the null hypothesis. If p1 and jJ 2 are both asymptotically normally distributed, with asymptotic covariance matrices V 1 and V 2 , Hausman's test statistic is given by
where
V1 and V2 are
consistent estimates of V 1 and V 2 respectively.
HT n asymptotically follows a chi-square distribution with k degrees of freedom, with k the number of estimated parameters. In the binary choice model the probit and the logit estimator are estimators of the form jJ 1 . A class of estimators that can perform the role of {J 2 is the class of non-parametric estimators. For this class of estimators Assumption
Overall goodness-of-/ it tests
155
7. I on the cumulative distribution function is weakened, and in estimation a functional form need not be specified. Therefore, a nonparametric estimator is consistent under a broader range of circumstances than the range for which the logit or probit estimator is consistent, but a nonparametric estimator is less efficient under the null hypothesis. Several non-parametric estimators in the binary choice model have been proposed during the last 15 years. Amongst them are Manski's maximum score estimator, MANSKI ( 1975, 1985); Cosslett's distribution-free maximum likelihood estimator, CossLETT (1983); Han's maximum rank correlation estimator, HAN (1987); a likelihood based estimator proposed by KLEIN AND SPADY (1988); and lchimura's estimator for index models, lcHIMURA AND LEE ( 1991 ). Of these estimators, the asymptotic distribution of Cosslett's estimator is still unknown; Manski's estimator converges to some Gaussian process at a n 113 rate of convergence (KIM AND POLLARD, 1990); and the estimators of Han, Klein and Spady, and Ichimura converge to a normal distribution at a n 112 rate of convergence. SHERMAN (1991) proved the asymptotic normality of Han's maximum rank correlation estimator. Here, attention is focused on this estimator and an estimator that appears to be identical to it: the rank estimator.
10.4.2. Han's maximum rank correlation estimator and the rank estimator
In the logit and probit model the loglikelihood function (7.4) is the objective function to be maximized with respect to f:J. Because of the smoothness of the normal and logistic distributions, first and second order derivatives are easily obtained, global concavity is easily shown, and so e.g. a Newton-Raphson iterative maximization algorithm can be used to obtain the unique maximand
Pn·
For a nonparametric estimator the objective
function to be maximized is not as smooth, first order derivatives often do not exist, and the maximand has to be found by less elegant maximization procedures, such as grid search. For Han's maximum rank correlation estimator, the objective function is
156
Chapter JO
given by
Qn(/3) =
1~r1Lp
l(V.>Y-}l(X-'/J>X-'/3) + l(V.0, Q 0 (c/3)
= Q 0 (/3).
Obviously, in order to compare Han's estimator with the logit or probit estimator, these estimators have to be restricted in the same way. So instead of k parameters, the Hausman test is performed on k-1 parameters. The idea behind Han's estimator is the following. Suppose F is a strictly increasing function, then
Thus, to maximize the number of cases in which Y;
= I,
Yj = 0, concurrent
with X;'/3>X//3, seems a sensible way of defining an estimator for /3 0 . For the rank estimator the objective function is given by
or equivalently
where Hin is the empirical distribution function of X'/3 evaluated at X//3. The rank estimator is then defined as (10.6)
Overall goodness-of-fit tests
157
The difference in the objective functions of the two estimators, is that the maximum rank correlation estimator only considers pairs with unequal value (0, I) of the response variable Y, whereas the rank estimator considers all pairs of observations. However, as will be shown in Theorem I 0.3, the maximum rank correlation estimator and the rank estimator are equivalent.
Pa (and so of Pu) is proven. The proof Pu (and so of Pa) has been obtained by
In Theorem 10.4 the consistency of of asymptotic normality of SHERMAN (
1991) and his theorem will be presented as Theorem I 0.5.
Clearly, Theorem 10.4 and Theorem 10.5 give the tools for Hausman's test. TIU!"rem 10.3
Han's maximum rank correlation estimator and the rank estimator are identical.
Proof: see Appendix 10.B.
By Theorem 10.3 and the fact that Han's estimator is consistent, it follows that the rank estimator is consistent. However, a formal proof of consistency is presented in Appendix 10.B, which seems to be more elegant than the proof given by Han. TIU!orem 10.4
If the following assumptions hold: Assumption I 0.1 : F is a strictly monotonic function, Assumption 10.2: /3 0 is contained in a parameter space B, a subset of the Euclidean k-space defined by B = {/3 I /3'/3 = I}. Assumption 10.3: {X;} are i.i.d. vector random variables with a joint density function g(x) such that g(x)>0 for all x. then the rank estimator probability one.
Pa
as defined in (10.6) converges to /3 0 with
Chapter JO
158
Proof: see Appendix 10.B.
For the asymptotic normality result, first the restriction (3'(3 = I is made implicit. Without loss of generality, assume that the k th element of
/3 0 is
nonnegative. Let 0 denote the unit sphere in Rk-l_ Then there is for every
/3 in B whose k th component is nonnegative a unique 6E0 for which f3 has the representation:
and, accordingly defined, /3 0 = /3(6 0 ). Define
r(u,6) = E I {y> Y} I {x'f3(6)>X'.8(6)} + I {Y>y} I {X'.B(O)>x'.8(6)} = I {y= I
/~C:/( 9)( l -F(t,6))g(t,6)dt + I (y=O/~·IJ(B)F(t,6)g(t,6)dt,
with u = (y,x); for each 6 in 0 and tin R, F(t,11) = E(Y I X'/3(6)=t); and g( .,II) is the density function of X'/3(6). With Assumptions 10.1-10.3, the following assumptions on rare needed to prove asymptotic normality
(SHERMAN,
1991, p. 62):
Assumption I0.4: Let N denote a neighbourhood of 60 • I. Let S(U) denote the support of the random variable U = (Y,X). For
each u in S(U), all third mixed partial derivatives of r(u,.) exist on N.
2. There is a function M(u) satisfying E{M(U)}
N(O,ff 1) and
plim ff 1 A~=I, it can finally be concluded that the limiting distribution of n- 1/ 2(T n-n) is the same as the limiting distribution of
_
- n
-1/2.._n
Y--P·o I I
L..i=l
(10.A.3)
P;o( 1-P;o) The limiting distribution of ( 10.A.3) will be obtained by applying Liapounov's Central Limit Theorem for "triangular arrays" (see BILLINGSLEY,
1979). Define (Y;-P;o) zni = - - - - (l-2P;o-vn'n- 1 x/;o)P;o( 1-P;o)
Then Zn 1 , ... ,Znn are independent random variables. The moments of Zni
164
Chapter JO
are
( l-2p; 0 -vn'ff 1x/i0) 2
var(Zn;)
= - - - - - - - - = a!; ; P;o(I-P;o)
Clearly, because of Assumptions 7.1-7.3,
so the conditions of the limit theorem are fulfilled and therefore
d
- > N(0,I).
Finally, the result of Theorem I 0.1 follows by rewriting
n-lL;
( 1-2p- 0 -v 'ff 1x-f- 0 ) 2 (l-2p- 0 ) 2 (l-2P;0 )vn'ff 1x/i0 i n i i = n-lL; _ _ _1 _ - 2 n - 1 [ ; - - - - - - - P;o( 1-P;o)
P;o( 1-P;o)
f;o2
+ n-lvn'ffl Li (----X;X;'){rlvn,
P;o(l-P;o) which, by virtue of (7 .6) and (I 0.1 ), is asymptotically equivalent with
"'n
n - 1L..i=l
( I -2P;0 ) 2 ■
Overall goodness-of-/ it tests
165
Proof of Theorem 10.2 To obtain the asymptotic distribution of T~ consider
The first order derivative of Sn with respect to fJ is
r = l, ... ,m-1,
with
unr
= -n-1[~=1(
(Yir-P?r)(l-2p?r+2(P?r) 2) ----------
(P?r>20-P?r>2
and
As shown before, under Assumptions 7.IM and 7.2M, plim(Unr> = 0 and furthermore (vnrl remains bounded for every n. Let vn be defined as
166
Chapler JO
Then it follows that the limiting distribution of n- 112 (T:-nm} is the same as the limiting distribution of
where I {.} is the indicator function, ej is the lh unit vector of order (m-1) and ® denotes the kronecker product. The limiting distribution of (10.A.4) is obtained by applying Liapounov's Central Limit Theorem. Define l-2p~. O IJ 1{ }) Z m. = L...J -_ 1 (Y..-p .. ) ( IJ IJ O- - -O- - I {J. 2 1 3 )=
-
P???jnLb,tjPtha?hn } = 0 !i ;
mni
:S E
Lj=1 { I Yij-P?j I 3 I a?jn I 3
+ 3 1yij-P?j
I I a?jn I [':;1,,,iYih-Pi0h) 2(a?hn) 2 + 6 I Yij-P?j I I a?jn I [':;i>j I Yih-P?h I I a?hn I X Lr>h I Yir-P?r I I a?m I}
Overall goodness-of-fit tests
167
Clearly, under Assumptions 7.IM and 7.2M,
so the conditions of the limit theorem are fulfilled and therefore n- 1 / 2 (T';';'-nm}
d
- - - - - --> N(O, I).
where 2
0n =
n
-l'l;"'n 2 L.i=l 0 ni•
The final result is obtained by making use of the definitions of
n (7.10);
vn (10.4); and the formulae for the second order derivatives of the loglikelihood function (7 .11 ).
a! is then asymptotically equivalent with
168
Chapter JO
Appendix 10.B. Proofs of Theorems 10.3 and 10.4 Proof of Theorem 10.3
With the underlying latent variable y• of (7.8), the probability model can be specified as
where (X 1,e 1 ), ... ,(X 0 ,e 0 ) is a sample from a distribution P on R k+l, and where the k-dimensional distribution of the X/s satisfy Assumption 7.3'. R 0 (.B) can now be written in the form
where
and Pn is the empirical measure of the pairs (Xi,ei). Equivalently, Q/(.B), the objective function of the maximum rank correlation estimator multiplied by (~)/n 2 , can be written as
where
The difference between R 0 (.B) and Q0 °(.B) is given by
Rn(/3)-Qn •(/3)
= f I {e:SX'/3o} I {1/SZ'/3o} I {x'/3~z'/3} + I {£>X '/3o}I {1/>Z'ti 'tl)dP 0 (Z,1/ )dP 0 (x,e). ,-,x'f,0 } 1{x'f,x'/3} )g(z}g(x)dzdx - 2f{F(z'/30 )1{
'R
'R
Zpo>Xpo
}-F(x'/30 )1{
'R
'R
Zpo>Xpo
))g(z)g(x)dzdx
= 2 J{F(z'/30 )-F(x'/30 )){ I {z'/3>x'/3)- I {z'/3o>x'/3ol)g(z)g(x)dzdx.
172
Chapter JO
Note that /#0 and /3 0 10, since
I /3 II
= 11/30 11 = I, implying that indicators
of the form I (z'/3>x'/3} can be replaced by indicators of the form I (z'/3~x'/3} in the integrals above, without changing the values of the integrals. By Assumption I 0.1, there will be a region
with positive Lebesgue measure, if /31/3 0 • By Assumption 10.3 this implies 2 f (F(z'/30 )-F(x'/30 )}( I (z'/3>x'/3}- I (z'/3o>x'f3o}}g(z)g(x)dzdx < 0, if /31/3 0 , since
Thus, {3 = {3 0 . Concluding, each subsequence of (/3n(w)} has a convergent subsequence, and each subsequence converges to {3 0 • This implies
This proves that
Pn
converges to
/10
with probability one.
■
11. Summary and conclusions
11. J. R2 in the linear model In Part I of this study, the coefficient of determination, R 2 , has been discussed for various linear models. First, properties have been stated for the standard OLS-R 2 in the univariate model. Related measures that have been proposed for different linear model types were then analysed, with some basic properties which the OLS-R 2 exhibits serving as a guideline to their performance as goodness-of-fit measures. These basic properties consist of the fixed lower- and upper bound of zero and one; the fact that the measure is the squared sample correlation coefficient of the dependent variable and its theoretical counterpart; the relationship of R 2 with the Ftest statistic for testing whether all slope parameters are equal to zero; and the probability limit of R 2 is the same as the limit of the model squared population correlation coefficient P2 • In the linear model without a constant term, the problem with the conventional R 2 is that it does not always fall within the zero-one boundaries. Two alternative measures have been analysed for this model, Barten's R~ and r 2 , the squared sample correlation coefficient of y and
y.
Both these measures are equal to R 2 if a constant is added to the model and satisfy the basic properties of taking a value between zero and one, and having the same probability limit as P2 . (Another measure, proposed by Theil, is rejected because of the facts that it does not coincide with R 2 when a constant is added to the model and its probability limit is not equal to the limit of P2 ). Whilst R~ and r 2 seem reasonably well behaved to be goodness-of -fit measures in the linear model without a constant, the major drawback of both measures is that they are not related to the F-test statistic. In order to compare the (relative) performance of R~ and r 2 as goodness-of -fit measures, their moments up to the second order have been derived. Unfortunately, the complicated expressions of these moments do not provide much, if any, insight. In one experiment, the calculated
174
Chapter I I
moments had the same pattern for both measures. More insight was obtained, however, by four practical examples. These examples revealed that the two measures can differ quite substantially. However, it is hardly possible to make a clear distinction between the performance of Rii and r 2 , although a lot of information is provided if both are reported. In the model with nonspherical disturbances, estimated by generalized least squares, R 2 may again fail to fall between the zero-one limits. Therefore, Rii and r 2 are possible goodness-of -fit measures in this model as well, exhibiting the properties just mentioned. An alternative measure is Buse's Riiu· This measure is obtained by first transforming the model into a model with spherical disturbances. Then, by defining a kind of weighted mean of the transformed dependent variable, Riiu is equal to the ratio of the explained variation to the total variation in the transformed model, where the variations are taken in deviations from this weighted mean. Riiu satisfies the basic properties of falling within the zero-one bounds and being related to the F-test statistic. However, it is not the squared sample correlation coefficient of the transformed dependent variable and its theoretical counterpart, and its probability limit is equal to the limit of a kind of redefined P2 in the transformed model. Therefore, Riiu seems to reveal information about the transformed model, not about the model we are actually interested in. For the SURE model there seems to be only one properly defined R 2 type goodness-of -fit measure: Mc Elroy's R!. Although this measure is related to Buse's Riiu and so is a measure for a transformation of the initial model, it is superior to alternative measures, because these alternative measures all suffer from the fact that they are not invariant to changes of
R! R! lies between zero and one; is the
location and/or changes of scale of the dependent variable(s), whereas is invariant to changes of this kind.
squared sample correlation coefficient of (ff 1/ 2®Nn)Y and (ff 1 / 2®Nn)y; is related to the F-test statistic; and is a kind of estimator of the squared population correlation coefficient of the transformed model with spherical disturbances. For a single structural equation within a system of simultaneous equations, and for the system as a whole, two cases with differing estimation procedures have been considered. In the first case, estimation
Summary and conclusions
175
results are obtained by 2 Stage Least Squares, and the Carter-Nagar RiN applies. Because of the 2SLS estimation procedure, results are similar to the OLS results in the standard univariate linear model. Therefore, RiN exhibits properties close to those of the OLS-R 2 • However, the measure provides information about the fit of the estimated reduced form of the system or equation. In the second case, estimation results are obtained by the Instrumental Variable estimation method. For the structural equation, Barten's R~ and r 2 are again possible candidates to measure the fit. For the structural system, two measures are proposed that are closely related to McElroy's R! for the SURE model. One measure is a generalization of R~, the other is the squared sample correlation coefficient of (ff 1/ 2®Nn)y and (n- 112®Nn)y. These measures have almost the same properties as McElroy's measure. The main difference is that they are not related to the (asymptotic) F-test statistic.
11.2. Goodness of fit in qualitative choice models In Part II of this study, goodness-of-fit measures for the binary choice model have been analysed and some goodness-of-fit tests for that model (and some for the multinomial logit model) have been developed. For the binary choice model several pseudo-R 2 measures have been proposed. These measures are e.g. based on the discrepancy between y and p, Efron's Ric; on the loglikelihood of the model, McFadden's Rie.r and Veall and Zimmermann's R~z; on the (imaginary) fit of the underlying continuous model, McKelvey and Zavoina's Rie.z; and on the proportion of correct predictions, R~p- None of these measures exhibits all basic properties of R 2 in the standard linear model, and making a choice between the measures, based on their mathematical properties is not easy. Using some examples, it has been shown that the individual relative behaviour of all these measures is more or less the same when different models are compared. It also became clear that R~P is useless in a model that shows a large difference between the observed number of ones and zeros. The values of the measures Rie.r and Rie.z seem insensitive to the
176
Chapter I I
proportion of ones in the model, unlike Riz and Rir· Because of this, R~F and R~z seem to be superior. Furthermore, R~z is close to the OLS-R 2 of the underlying continuous model, and, in absolute terms, closest to the squared sample correlation coefficient of the true and estimated probabilities. Therefore, R~z is a good option to use as a pseudo-R 2 measure in the binary choice model. Some goodness-of -fit tests have been developed that analyse the distributional assumption of the probability model. One test is a score-test for both the binary and multinomial logit model, based on the Perks class of distributions. This class is effectively described by one parameter, and the logistic distribution is obtained when this parameter takes the value 2. Using some Monte Carlo examples for the binary logit model, it has been shown that this test, and the score-test based on the (two parameter) Prentice class of distributions, seem to have little power against proper alternatives, i.e. alternative distributions that are a member of the specified class. However, if the model is (very) poorly specified, the tests seem to detect this, and attention should therefore be paid to these tests, e.g. they should be part of the standard output of the logit (and probit) computer programme. Other tests on the distributional assumption for the binary choice model are based on a comparison between the assumed distribution and a nonparametric estimator of it: the Nadaraya-Watson estimator, based on kernel smoothing. It has been shown that, also in this case of a discrete response variable, the supremum absolute deviation is asymptotically distributed as an extreme value, or Gumbel, distributed random variable. Furthermore, this result also holds when the Nadaraya-Watson estimator is based on the logit or probit estimation results. The same remarks apply to a quadratic functional that is asymptotically normally distributed. Finally, some overall goodness-of-fit tests have been derived. The asymptotic distribution of the sum of weighted squared residuals has been shown to be the normal distribution in both the binary choice- and the multinomial logit model, instead of the often believed chi-square distribution. An asymptotic chi-square test has been proposed, based on this result. By means of a Monte Carlo study for the binary choice model, this chi-square test is shown to be able to detect some (major) violations
Summary and co11clusio11s
177
of the basic assumptions. Further results on the test statistic in a multinomial logit model with data on private car ownership, have made clear that it is highly sensitive to outliers. The test is easy to calculate and should also be part of the standard logit or probit results. A last asymptotic chi-square test for the binary choice model is based on the Hausman testing procedure. This test has been adopted to compare the logit or probit parameter estimator with Han's nonparametric maximum rank correlation estimator, or the rank estimator. These two estimators have been shown to be equivalent.
Appendix
Definition A. 1. (POLLARD, I 984, p. 17) Let D be a class of subsets of some space S. It is said to have po/_l'lwmial discrimination (of degree 11) if there exists a polynomial p(.) (of degree 11) such that, from every set of N points in S, the class picks out at most p(N) distinct subsets. Formally, if S0 consists of N points, then there are at most p(N) distinct sets of the form S0 nD with D in D. Call p(.) the discriminating polynomial for D. Definition A.2. (POLLARD, 1984, p. 24)
Each measurable F, satisfying If I :S F, for every f in F, is called an envelope for F. The natural envelope is the pointwise supremum of I f I over F. Definition A.3. (NOLAN AND POLLARD, 1987, p. 783) Let S be a set equipped with a pseudometric d. The covering numher N(6,d,S) is defined as the smallest value of N for which there exist N closed balls of radius 6, and centers in S, whose union covers S. Definition A.4. (NOLAN AND POLLARD, 1987, p.783) If the class F has an envelope F, and Q is a measure on Ex£ for which 0 < Q(FP) < oo, define the distance dq,p,F on F by
Write NpC6,Q,F,F) for the covering numher N(6,dq,p,F•F). Thus NP(6,Q,F,F) is the smallest cardinality for a subclass F• of F, such that
180
Appendix
Lemma A.I.
Let F be a class of functions on a set S with envelope F, and let Q be a probability measure on S with O < QF < oo. If the graphs of the functions in F form a polynomial class of sets then
where the constants A and W depend upon only the discriminating polynomial of the class of the graphs. Proof: see POLLARD (1984, pp. 27-28).
Theorem A. I.
Let F be a permissible class of functions with envelope F. Suppose PF< oo. If Pn is obtained by independent sampling from the probability measure P and if log{N 1(5,P n,F)}
I
supF P/
- Pf
= oP(n)
for each fixed
o>
0, then
I -+ 0, almost surely.
Proof: see POLLARD (1984, pp. 25-27)
Theorem A.2.
Let
E 1 ,E 2 , ...
be independent observations taken from a distribution P on a
set E, and F be a class of real-valued symmetric functions on ExE, with PxP-integrable envelope F. Define
Furthermore, let e 1 , ... ,e 2 n be a double sample from P. Define the function f;j
as
and write T n for the measure that places mass one at each of the 4n(n- I) pairs (e 0 ,ell) appearing in the definition of the fij.
Appendix
181
Let P n be the empirical measure. If for each c5>0, i)
log N 1(c5,Tn,F,F) = opCn),
ii) log N 1(5,P nxP,F,F)
= oP(n),
iii) N 1(c5,PxP,F,F) < oo. Then,
II
{Sn/n(n-1 )} - PxP(f)
II
-+
0, almost surely.
Proof: see NOLAN AND POLLARD (I 987, pp. 787- 789).
Notation
In Part I, uppercase letters denote matrices, lowercase letters vectors and scalars. No distinction has been made between stochastic and non-stochastic variables in this part. In Part II, uppercase letters denote stochastic variables. (Non- )stochastic vectors are written as lowercase letters, apart from the xi vector. When it is stochastic, it is written as Xi. Some symbols that have been used throughout the text are listed below:
lim
limit equals by definition
max
maximum
min
minimum implies
iff
if and only if
■
end of proof
sup
supremum
e,exp
exponential natural logarithm
log
Ia I E
{x:xEA, x satisfies B)
factorial absolute value of scalar a belongs to set of all elements of A with
property B I {.} R
R"
indicator function real numbers n-dimensional euclidean space
184
Notation
I, In
identity matrix (of order nxn)
s
unity vector
Px
X(X'Xr 1 X'
Mx N
I - Px
[A B] ; (a' b')'
partitioned matrix; vector
I - n- 1ss' (= M0 )
II a II
norm of vector a
r(A)
rank
tr( A)
trace
vec( A)
vec operator
IA I
determinant
®
kronecker product
p
probability
E
expectation
var
variance (matrix)
COY
covariance
pn
empirical measure
Fn
empirical distribution function
Fh,k
F distribution with (h,k) degrees of
freedom N(µ,a2)
normal distribution
lJ[-a,a)
uniform distribution
x\
chi-square
distribution
with
degrees of freedom plim
probability limit convergence in distribution
h
References
ALDRICH, J.H. AND F.D. NELSON (1984), Linear Probability, Logit, and
Probit Models, Sage Publications, Beverly Hills. AMEMIYA, T. (1981), Qualitative response models: a survey, Journal of
Economic Literature I 9, 1483-1536. ---(1985), Advanced Econometrics, Basil Blackwell, Oxford. ANDREWS, D.W.K. (1988), Chi-Square diagnostic tests for econometric models: introduction and applications, Journal of Econometrics 37, I 35-
156. ( 1988), Chi-Square diagnostic tests for econometric models: theory, Econometrica 56, 1419-1453. APOSTOL, T.M. (1974), Mathematical Analysis, Addison-Wesley, Reading Massachusetts. AZZALINI, A., A.W. BOWMAN AND
w.
HARDLE (1989), On the use of
nonparametric regression for model checking, Biometrika 76, 1-11. BARRETT, J.P. ( 1974), The coefficient of determination - some limitations,
The American Statistician 28, 19-20. BARTEN, A.P. ( 1962), Note on unbiased estimation of the squared multiple correlation coefficient, Statistica Neerlandica 16, 151-163.
- - - ( 1987), The coefficient of determination for regression without a constant term, in: R.D.H. HEIJMANS AND H. NEUDECKER (eds.), The
Practice of Econometrics, Kluwer Academic Publishers, Dordrecht, 181-189. BEN-AKIVA, M. AND S.R. LERMAN (1985), Discrete Choice Analysis, The MIT Press, London. BERA, A.K., C.M. JARQUE AND L.F. LEE, (1984), Testing the normality assumption
in limited dependent variable models, International
Economic Review 3, 563-578. BICKEL, P.J. AND M. ROSENBLATT (1973), On some global measures of the deviations of density function estimates, The Annals of Statistics 6,
1071-1095. BILLINGSLEY, P. (1979), Probability and Measure, Wiley, New York.
186
References
BROWN, C.C. ( 1982), On a goodness-of-fit test for the logistic model based on score statistics, Communication in Statistics-Theory an.d Methods l l,
1087-1105. BusE, A. (1973), Goodness of fit in generalized least squares estimation,
The American Statistician 27, 106-108. - - - ( 1979), Goodness of fit for the seemingly unrelated regressions model: a generalization, Journal of Econometrics JO, I 09-113. BUTTER, DEN, F.A.G. AND F.J.J.S. VAN DE GEVEL (1979), De vrijheidsgraden-correctie van R 2 , VVS Bulletin 12, 23-30. CARTER R.A.L, AND A.L. NAGAR (1977), Coefficients of correlation for simultaneous equation systems, Journal of Econometrics 6, 39-50. COPAS, J.B. (1983), Plotting p against x, Applied Statistics 32, 25-31. COPENHAVER, T.W. AND P.W. MIELKE, (1977), Quantit analysis: a quanta! assay refinement, Biometrics 33, 175-186. CossLETT, S.R. (1981), Distribution-Free maximum likelihood estimator of the binary choice model, Econometrica 51, 765- 782. CRAMER, J .S. ( 1987), Mean and variance of R 2 in small and moderate samples, Journal of Econometrics 35, 253-266.
- - - (1991 ), The Log it Model, Edward Arnold, London. CRAMER, J.S. and G. RIDDER ( 1988), The logit model in economics,
Statistica Neerlandica 42, 297-314. - - - ( 1991 ), Pooling states in the multinomial logit Model, Journal of
Econometrics 47, 267-272. DAVIES, N., M.B. PATE AND J.D. PETRUCCELLI (1985), Exact moments of the sample cross correlations of multivariate autoregressive moving average time series, Sankhya : The Indian Journal of Statistics 47,
Series B, 325-337. DIIRYMES, P.J. (1986), Limited dependent variables, in: Z. GRILICHES AND M.D. INTRILIGATOR (eds.), Handbook of Econometrics, Volume III, North-Holland, Amsterdam, 1567-1631. DIJK, VAN, J.C., J.S. HAGENS, J.W. YELTHUIJSEN, F.A.G. WINDMEJJER
( 1991 ), A Simulation Model of the Dutch Tourist Market, SEO-Report no. 255, Foundation for Economic Research of the University of Amsterdam.
References
187
EFRON, B. (1978), Regression and ANOVA with zero-one data: measures of residual variation, Journal of the American Statistical Association 73,
113-121. ENGLE, R.F. (1984), Wald, Likelihood Ratio, and Lagrange Multiplier tests in econometrics, in:
z.
GRILICHES AND M.D. INTRILIGAT0R (eds.),
Handbook of Econometrics, Volume II, North-Holland, Amsterdam, 775-826. FOMBY, T.B., R.C. HILL AND S.R. JOHNSON (1984), Advanced Econometric
Methods, Springer-Verlag, New York. GOLDBERGER, A.S. (1973), Correlations between binary outcomes and probabilistic predictions, Journal of the American Statistical Association 68, 84.
GOURIEROUX, C., A. MONFORT, E. RENAULT AND A. TROGN0N (1987), Simulated residuals, Journal of Econometrics 34, 201-252. GROENEBOOM, P. (1991), Nonparametric Maximum Likelihood Estimators
for
lflterval Censoring and Deconvolution, Delft University of
Technology. HAAN, DE, L. AND E. TACONIS-HAANTJES (1978), Asymptotic properties of a correlation coefficient type statistic connected with the general linear model, Journal of Econometrics, 15-21. HAN, A.K. ( 1987),
model
-the
Non-Parametric analysis of a generalized regression
maximum
rank
correlation
estimator,
Journal of
Econometrics 35, 303-316. HARDLE, W. (1989), Asymptotic maximal deviation of M-smoothers, Journal of Multivariate Analysis 29, 163-179. - - - ( 1990), Applied Nonparametric Regression, Cambridge University Press, Cambridge. HARDLE,
w.
AND T.M. STOKER (1989), Investigating smooth multiple
regression by the method of average derivatives, Journal of the
American Statistical Association 84, 986-995. HARD LE, W. AND E. MAMMEN ( 1990), Comparing nonparametric versus parametric regression fits, CORE Discussion Paper 9065, CORE Louvain.
References
188
HAUSER, J.R. (1978), Testing the accuracy, usefulness, and significance of probabilistic choice
models: an
information-theoretic approach,
Operations Research 26, 406-421. HAUSMAN, J.A. (1978), Specification tests in econometrics, Econometrica 46, 1251-1271.
HEIJMANS, R.D.H. AND H. NEUDECKER (1987), The coefficient of determination revisited, in: R.D.H. HEIJMANS AND H. NEUDECKER (eds.), The Practice of Econometrics, Kluwer Academic Publishers, Dordrecht, 191-204. HOSMER, D.W. AND
s. LEMESHOW (1980), Goodness of fit for the multiple
logistic regression model, Communications in Statistics-Theory and
Methods A9, 1043-1069. - - - ( 1989) Applied Logistic Regression, John Wiley & Sons, New York. lcHIMURA, H. AND L.F. LEE (1991), Semiparametric least squares estimation of multiple index models: single equation estimation, in: W.A. BARNETT, J. POWEL AND G.E. TAUCHEN (eds.), Nonparametric and
Semiparametric Methods in Econometrics and Statistics, Cambridge University Press, Cambridge, 3-49. JOHNSON, N.L. ANDS. KOTZ ( 1970), Continuous Univariate Distributions-2, Houghton Mifflin, Boston. JOHNSTON,
G.J.
(1982),
Probabilities
of
maximal
deviations
for
nonparametric regression function estimates, Journal of Multivariate
Analysis 12, 402-414. JONG, DE, R.M. ( 1989), Enige toetsen voor misspecificatie in het binaire keuzemodel, Master thesis, University of Amsterdam. JUDGE, G.G., W.E. GRIFFITHS, R.C. HILL AND T.C. LEE { 1980), The Theory
and Practice of Econometrics, John Wiley & Sons, New York. KAY,
R. AND S. LITTLE (1986), Assessing the fit of the logistic model: a
case study of children with the Haemolytic Uraemic Syndrome, Applied
Statistics 35, 16-30. KIM,
J. AND D. POLLARD (1989), Cube root asymptotics, The Annals of
Statistics /8, 191-219.
References
189
KLEIN, R.W. AND R.H. SPADY (1988), An efficient semiparametric estimator for discrete choice models, Manuscript, Bell Communications Research Corp. KNIGHT, J.L. (1980), The coefficient of determination and simultaneous equation systems, Journal of Econometrics /4, 265-270. KOERTS, J. AND A.P.J. ABRAHAMSE (1970), The correlation coefficient in the general linear model, European Economic Review I, 401-427. LAITILA, T. ( 1990), A pseudo-R 2 measure for limited and qualitative dependent variable models, Paper presented at the 6 th World Congress of the Econometric Society, Barcelona. LANDWEHR, J.M., D. PREGIBON AND A.C. SHOEMAKER (1984), Graphical methods for assessing logistic regression models, Journal of the
American Statistical Association 79, 61-83. LAVE, C.A. ( 1970), The demand for urban mass transportation, The Review
of Economics and Statistics 37, 320-323 LECHNER, M. ( 1991 ), Testing log it models in practice, Empirical
Economics 16, 177-198. LIERO, H. (1982), On the maximal deviation of the kernel regression function estimate, Mathematische Operationsforschung, Serie Statistics 13, 171-182.
LUKACS, E. (1975), Stochastic Convergence, (2nd ed.), Academic Press, New York. MADDALA, G.S. (1983), Limited Dependent and Qualitative Variables in
Econometrics, Cambridge University Press, Cambridge. ( 1988), Introduction to Econometrics, Macmillan Publishing Company, New York. MAGNUS, J.R. (1986), The exact moments of a ratio of quadratic forms in normal variables, Anna/es d'Economie et de Statistique 4, 95-109. MAGNUS, J.R. AND H. NEUDECKER (1988), Matrix Differential Calculus
with Applications in Statistics and Econometrics, JohnWiley & Sons, Chicester. MAMMEN, E. (1991 ), When does bootstrap work: asymptotic results and simulations, Preprint 623, SFB 123, University Heidelberg.
190
References
MANSKI, C.F. (1975), Maximum score estimation of the stochastic utility models of choice, Journal of Econometrics 3, 205-228. - - - ( 1985), Semiparametric analysis of discrete response: asymptotic properties of the maximum score estimator, Journal of Econometrics 27, 313-334. MCELROY, M.B. ( 1977), Goodness of fit for seemingly unrelated regressions, Journal of Econometrics 6, 381-387. McFADDEN, D. (1974), Conditional logit analysis of qualitative choice behavior, in: P. ZAREMBKA (ed.), Frontiers in Econometrics, Academic Press, New York, 105-142. MCKELVEY, R.D. AND w. ZAVOINA (1976), A statistical model for the analysis of ordinal level dependent variables, Journal of Mathematical
Sociology 4, 103-120. MooRE, D.S. AND M.C. SPRUILL (1975), Unified large-sample theory of general chi-squared statistics for tests of fit, The Annals of Statistics 3, 599-616. MORRISON, D.G. (1972), Upper bounds for correlations between binary outcomes and probabilistic predictions, Journal of the American
Statistical Association 67, 68- 70. NEUDECKER, H. AND F.A.G. WINDMEIJER ( 1991 ), R 2 in seemingly unrelated regression equations, Statistica Neerlandica 45, 405-411. NOLAN, D. AND D. POLLARD (1987), U-Processes: rate of convergence, The
Annals of Statistics 15, 780- 799. POLLARD, D. (1984), Convergence of Stochastic Processes. Springer, New York. PREGIBON, D. (1981 ), Logistic regression diagnostics, The Annals of
Statistics 9, 705- 724. PRENTICE, R.L. (1976), A generalization of the probit and logit methods for dose response curves, Biometrics 32, 761- 768. RIDDER, G. (1982), GRMAX: een algemeen Maximum Likelihood programma, University of Amsterdam. RUUD, P.A. (1983), Sufficient conditions for the consistency of maximum likelihood estimation despite misspecification of distribution,
Econometrica 51, 225-228.
191
References
- - - ( 1986 ), Consistent estimation of limited dependent variable models despite misspecification of distribution, Journal of Economctric.1 32, I 57-187. SCHUSTER, E.F. (1972), Joint asymptotic distribution of the estimated regression function at a finite number of distinct points. Annals of
Mathematical Statistics 43, 84-88. SHERMAN, R.P. (1991), U-processes and Semiparametric Estimation, PhDthesis, Yale University. SILVERMAN, B.W. (1986), Density Estimation for Statistics and Data
Analysis, Chapman and Hall, London. SMITH, G. (1974), Further notes on the misuse of R 2 , Cowles Foundation Discussion Paper no. 381, Yale University. SMITH, R.J. ( I 988), On the use of distributional misspecification checks in limited dependent variable models, Discussion Papers in Econometrics and Social Statistics (ES203), University of Manchester. STEERNEMAN, T. (I 985), On the moment of the squared multiple correlation coefficient in multiple linear regression, Report 85-03-SE, University of Groningen. THEIL, H. ( 1971 ), Principles of Econometrics, John Wiley & Sons, New York. VEALL, M.R. AND K.F. ZIMMERMANN (1990), Evaluating pseudo-R 2 's for binary pro bit models, Discussion Paper No. 9057, CentER for Economic Research, Tilburg University. WILLIAMS, D.A. (1981 ), The use of the deviance to test the goodness of fit of a logistic linear model to binary data, GLIM Newsletter 6, 60-62. WINDMEIJER, F.A.G. (1990), The asymptotic distribution of the sum of weighted squared residuals in
binary choice models, Statistica
Neerlandica 44, 69- 78. - - - (1992), R 2 in the linear model without a constant term, Statistica
Neerlandica, submitted.
Samenvatting enten die een Een goodn ess-of -fit maat is een van de vele instrum ren. Het is een onderz oeker ter beschi kking staan om een model te evalue ijkt met de maat die de gescha tte uitkom sten van een model vergel interpr etatie te gereal iseerde waarde n. Om een mate van absolu te kan aannem en verkri jgen, ligt de waarde die een goodn ess-of -fit maat grenze n nut tussen vaste boven - en onderg renzen . In de prakti jk zijn deze rangsc hikkin g en een, en de waard en daartu ssen worde n vertaa ld in een werke lijkhei d van de mate waarin de uitkom sten van het model de benade ren: van "zeer slecht" (0) tot "zeer goed" (I). voor twee In deze studie worde n goodn ess-of -fit maten geanal yseerd ij in waarb en, model typen: lineair e model len en kwalit atieve keuze modell t tot het binaire het laatste geval de aandac ht zich voorna melijk beperk model. Het grote keuze model en, in minde re mate, het multin omiale logit en schatti ngen versch il tussen beide model typen is dat in lineair e modell atieve keuze en waarn eming en van dezelf de orde zijn, terwijl in kwalit en discree t. model len de gescha tte kansen contin u zijn en de waarn eming Deel I van deze Goodn ess of fit in lineair e model len komt aan de orde in Deel II. studie , goodne ss of fit in kwalit atieve keuze modell en in 2 t/m 6, Deel I, R 2 in the linear model , bestaa nde uit de hoofds tukken 2 n gerela teerbestud eert de bekend e determ inatie- coeffi cient, R , en hieraa iate klassieke de maten voor divers e lineair e model typen: het univar nte term; het lineair e regress ie model; het univar iate model zonder consta de storing en; de univar iate model met algeme ne covari antie matrix van (SURE ); en het multiv ariate Seemi ngly Unrela ted Regres ions Equati ons multiv ariate stelsel van simult ane vergel ijkinge n. n van de In Hoofd stuk 2 wordt een overzi cht van de eigens chappe ke lineair e determ inatie- coeffi cient in het (stand aard) univar iate klassie e Kleins te regress ie model gepres enteer d. De maat is gebase erd op Gewon de rol van R 2 in K wadra ten schatti ngs resulta ten. Aan de orde komen o.a. (asym ptotide selecti e van genest e modell en; de waarsc hijnlij kheids limiet,
194
Samenvalling
sche) verdeling en momenten van R 2; en enige beperkingen in het gebruik van de maat. Het hoofdstuk besluit met een lijst van eigenschappen van R 2 , die als referentie gebruikt wordt bij de behandeling van goodness-offit maten in niet-standaard modellen. Omdat sommige eigenschappen van R 2 volgen uit het feit dat er een constante term in het model is opgenomen, wordt er in Hoofdstuk 3 onderzocht wat de gevolgen zijn indien er geen constante in het model aanwezig is. Omdat de determinatie-coefficient in dit geval niet )anger een vaste boven- of ondergrens heeft, worden er twee alternatieven beschouwd: Barte n's R~ en r 2 , de gekwadrateerde steekproef correlatie coefficient van de afhankelijke variabele en de schatting daarvan. Beide maten nemen altijd waarden aan tussen nul en een, en zijn gelijk aan R 2 indien er een constante aan het model wordt toegevoegd. R~ en r 2 worden onderling vergeleken aan de hand van diverse criteria. Een daarvan is het relatieve gedrag van de maten voor kleine steekproeven. Daartoe worden de eerste twee momenten van beide maten afgeleid en uitgerekend voor een specifiek model. In Hoofdstuk 4 wordt het univariate model met een algemene covariantie structuur van de storingen beschouwd. De schattingsprocedure in dit model is die van Gegeneraliseerde Kleinste K wadraten. Er wordt aangetoond dat Buses's
Rt
een goodness-of -fit maat is voor een
getransformeerde versie van het oorspronkelijke model, met eigenschappen equivalent aan die van R 2 in het standaard model. Voor het niet-getransformeerde model zijn R~ en r 2 mogelijke maten met eigenschappen zoals die in Hoofdstuk 3 zijn afgeleid. Enkele eigenschappen van Mc Elroy's R! voor het SURE-model worden gepresenteerd in Hoofdstuk 5. Als mogelijk alternatief voor R! wordt voorgesteld de gekwadrateerde steekproef correlatie coefficient van de afhankelijke variabele, voorvermenigvuldigd met de inverse van de vierkantswortel van de variantie matrix, en de schatting daarvan. Echter, deze maat is niet invariant onder veranderingen van locatie en schaal van de afhankelijke variabele, terwijl R! dit wel is, en is daardoor van geen belang voor dit model. Hoofdstuk 6 laat zien dat de Carter-Nagar RiN voor een enkele vergelijking in een simultaan stelsel, en voor het hele stelsel, een goodness-
195
Samen vatting
gebase erd op de of-fit maat is voor de geredu ceerde vorm van het model, een enkele Twee Ronde n K leinste K wadra ten schatti ngsme thode. Voor elen, zijn structurele vergel ijking, gescha t met Instrum entele Variab rele structu gehele Barten 's R~ en r 2 weer mogel ijke maten. Voor het voorge steld, een stelsel, gescha t met IV, worde n twee genera lisaties van gerela teerd aan R~. de ander aan r 2 •
R!
nde uit In Deel II, Goodness of fit in qualitative choice model s, bestaa breder e contex t de hoofds tukken 7 t/m I 0, wordt goodness of fit in een yseerd , ook geplaa tst. Niet alleen worde n er goodn ess-of -fit maten geanal goodn ess-of-fit toetsen worde n afgele id. model in Nadat het binaire keuze model en het multin omiale logit pseudo enkele Hoofd stuk 7 ge1ntr oducee rd zijn, worde n in Hoofd stuk 8 maten zijn R 2 maten voor het binaire keuze model geanal yseerd . Deze tte kansen met bijvoo rbeeld gebase erd op een vergel ijking van de gescha model; op een de binaire waarn eming en; op de (log)li keliho od van het onderl iggend e voorsp elling van de binaire afhank elijke variab ele; of op bet deze maten zijn contin ue lineair e model. Nadat enkele eigens chappe n van experi menafgele id, wordt hun gedrag bestud eerd aan de hand van enkele ten. r benad erd. In Hoofd stuk 9 wordt goodn ess-of -fit op een andere manie met een ele variab In plaats van het vergel ijken van de afhank elijke telde verdel ing schatti ng hierva n, wordt in dit hoofds tuk de onders hierva n worde n vergel eken met een schatti ng van deze verdel ing. Op basis score toets en er twee goodn ess-of-fit toetsen afgele id. De eerste is een een klasse van wordt verkre gen door midde l van het defini eren van een specif ieverdel ingen, die beschr even wordt door een param eter. Voor aangen omen ke waard e van deze param eter wordt de logistieke verdel ing model is dan en de asymp totisch e chi-kw adraat score toets voor het logit De klasse van een toets op die specif ieke waard e van de param eter. verdel ingen die voorge steld wordt is die van Perks. trische Een andere goodn ess-of-fit toets is gebase erd op de niet-p arame eken met de Nadar aya-W atson schatte r. Deze schatt er wordt vergel wordt bewez en onders telde verdel ing via de absolu te suprem um afstand . Er elijke variab ele, dat deze afstan d, net als in het model met contin ue afhank
196
Same11va1ti11g
een asymptotische extreme waarde, of Gumbel, verdeling volgt. Dit resultant blijft geldig, ook als de Nadaraya-Watson schatter gebaseerd is op de logit of probit schattings resultaten. In Hoofdstuk IO worden er tenslotte enkele zgn. "overall" goodnessof -fit toetsen afgeleid. Deze toetsen hebben geen specifieke alternatieve hypothese, maar volgen onder alle assumpties die aan het model ten grondslag liggen een bepaalde (asymptotische) verdeling. Een toets is gebaseerd op de som van gewogen gekwadrateerde residuen. Er wordt bewezen dat deze som asymptotisch normaal verdeeld is, en een chikwadraat toets volgt hier rechtstreeks uit. Een andere toets is gebaseerd op de Hausman toets en vergelijkt de niet-parametrische "maximum rank correlation" schatter van Han met de logit of probit schatter. Er wordt tevens een niet-parametrische schatter ge1ntroduceerd, de "maximum rank" schatter, maar er wordt bewezen dat deze schatter identiek is aan de schatter van Han.
Author Index Abrahamse, A.P.J. 11, 20, 22, 189 Aldrich, J.H. 102, 103, 185 Amemiya, T. 88, 95, 96, 185 Andrews, D.W.K. 143, 147, 185 Apostol, T.M. 133, 134, 185 Azzalini, A. 126, 185 Barrell, J.P. 11, 25, 185 Barten, A.P. 14, 21, 22, 24, 30, 31, 33, 185 Ben-Akiva, M. 100, 104, 185 Bera, A.K. 115,116,185 Bickel, P.J. 127, 139, 141, 185 Billingsley, P. 163, 185 Bowman, A.W. 185 Brown, C.C. 115, 122, 186 Buse, A. 49, 53, 63, 186 Buller, den, F.A.G. 11, 17, 186 Carter, R.A.L. 67, 78, 186 Copas, J.B. 126, 186 Copenhaver, T.W. 115, 186 Cosslell, S.R. 155, 186 Cramer, J.S. 11, 22, 24, 33, 43, 109, 118, 119, 150, 186 Davies, N. 39, 40, 43, 186 Dhrymes, P.J. 100, 186 Dijk, van, J.C. 45, 186 Efron, B. 95-97, 99, 109, 187 Engle, R.F. 114, 187 Fomby, T.B. 71, 187 Gevel, van de, F.J.J.S. 11, 17, 186 Goldberger, A.S. 99, 187 Gourieroux, C. 121, 187 Griffiths, W.E. 188 Groeneboom, P. 187 Haan, de, L. II, 21, 22, 187 Hagens, J.S. 186 Han, A.K. 143, 155-157, 187 Hardie, W. 124, 127-130, 138-142, 185, 187 Hauser, J.R. 100, 101, 187 Hausman, J.A. 143, 154, 188 Heijmans, R.D.H. 30, 37, 38, 60, 55, 185, 188 Hill, R.C. 187, 188 Hosmer, D.W. 143, 147, 188
lchimura, H.
155, 188
Jarque, C.M. 115, 185 Johnson, S.R. 187 Johnson, N.L. I 16, 188 Johnston, G.J. 127, 188 Jong, de, R.M. 121-123, 188 Judge, G.G. 100, 188 Kay, R. 95, 104, 144, 188 Kim, J. 155, 188 Klein, R.W. 155, 188 Knight, J.L. 67, 79, 189 Koerts, J. 11, 20, 22, 189 Kotz, S. 116,188 Laitila, T. 102, 189 Landwehr, J.M. 144, 189 Lave, C.A. 96, 189 L&hner, M. 116, 189 Lee, L.F. 115, 155, 185, 188 Lee, T.C. 188 Lemeshow, S. 143, 147, 188 Lerman, S.R. 100, 104, 185 Liero, H. 127, 189 Little, S. 95, 104, 144, 188 Lukacs, E. 61, 189 Maddala, G.S. 91, 96, 120, 189 Magnus, J.R. 33-35, 40, 41, 189 Mammen, E. 129-131, 187, 189 Manski, C.F. 155, 189 McElroy, M.B. 57-60, 190 McFadden, D. 99, 100, 109, 190 McKelvey, R.D. 102, 109, 190 Mielke, P.W. I 15, 186 Monfort, A. 187 Moore, D.S. 143, 190 Morrison, D.G. 99, 190 Nagar, A. L. 67, 78, I 86 Nelson, F.D. 102, 103, 185 Neudecker, H. 30, 37, 38, 41, 55, 60, 185, 188-190 Nolan, D. 98, 179, 181, 190 Pate, M.B. 186 Petruccelli, J.D. 186 Pollard, D. 98, 155, 179-181, 188, 190 Pregibon, D. 144, 189, l 90
198
Author Index
Prentice, R.L. 190
114-116, 121-123,
Renault, E. 187 109, 150, I 18, I 19, Ridder, G. 186, 190 Rosenblatt, M. 125, 127, 139, 141, 185 Ruud, P.A. 113, 190 Schuster, E.F. 128, 191 Sherman, R.P. 155, 157-159, 191 Shoemaker, A.C. 189 Silverman, B.W. 125, 191 Smith, G. 11, 25, 191 Smith, R.J. I 16, 191 Spady, R.H. 155, 188 Spruill, M.C. 143, 190
Steememan, T. II, 21, 22, 191 Stoker, T.M. 128, 187 11, 21, 22, Taconis-Haantjes, E. 187 Theil, H. 11, 17, 30, 32, 191 Trognon, A. 187 Veall, M.R. 95, 99, 102, 109, 191 Velthuijsen, J.W. 186 Williams, D.A. 105, 191 Windmeijer, F.A.G. 186, 190, 191 Zavoina, W. 102, 109, 190 95, 99, 102, Zimmermann, K.F. 109, 191
Subject Index Average probability prediction 104
of
correct
Bandwidth 125 Beta function 115 Binary choice model 87 Brownian bridges 140 Classes of distributions 114 Burr 116 Pearson
115
Perks 116 Prentice 114 Omega
115
Coefficient of determination in standard linear model 13 adjusted for degrees of freedom 17 asymptotic distribution 22 asymptotic expectation 21 bias correction 21 density function 22 expectation 22 model selection 18 probability limit 20 variance 22 in 2SLS 70 in logit model Conditional logit model 153 Correlation coefficient population 14 sample 14 Covering number 179 Definition 31 Deviance 104 Efficient Score test 114 Empirical measure 98 Empirical percentage uncertainty 100 Empirical information 101 Envelope 98, 179 Epanechnikov kernel 125 Equally likely model 101 Expected information 101 F-test statistic 17 Generalized Least Squares estimator 50 GLS model 50, 63 Buse's R2 52, 53
GLS instrumental variable estimator 81
Hausman-test 154 Hessian matrix of binary choice model 89 of multinomial logit model 92 Heteroscedasticity 50 Information theory 100 Information matrix of binary choice model 114 Instrumental variable (IV) estimator 68 Instrumental variable model 67 Kernel 124 Kernel density estimator 125 Kernel smoothing 124 Lagrange Multiplier test 114 Likelihood ratio index 100 Likelihood ratio test 102, 104 Linear model (standard) 11 Linear model without a constant 29, 30 Barten's R2 31 moments 33 squared sample correlation coefficient 37 moments 39 Theil's R2 30 Logistic distribution 88 Logit model 88 Loglikelihood function of binary choice model 88 of multinomial logit model 91 Maximum likelihood estimator 89, 92 Maximum rank correlation estimator 155 Measure of information 100 Monte Carlo I 21 Multinomial logit model 91-93 Nadaraya-Watson estimator 125 Ordinary least squares estimator of parameter vector 12 of variance 13
(OLS)
200
Subject Index
Natural envdope 179 Outlier 151 Polynomial discrimination 98, 179 Private car ownership (data) 109, 150 Probit modd 88 Proportion of correctly predicted observations I03 Pscudo-R 2 in binary choice modd 95 Pythagorean relationship 97 Qualitative choice modd 87
Rank estimator 156 Redefined squared population correlation coefficient 75 Residual vector 13 Sample Mean Squared Error 25 Saturated model 99, I04 Score vector of binary choice model 89, 114 of multinomial logit model 92
Seemingly unrelated regression equations (SURE) 57 McElroy's R2 59 Simultaneous equations 76 single equation 76 Bartcn's R2 79 Carter-Nagar R2 78 system 76 Barten's R2 82 Carter-Nagar R2 81 Reduced form 76 2SLS approach 78, 81 Smoothing operator 129 Standard normal distribution 88 Stochastic processes 98 Sum of weighted squared residuals in binary choice model 104, 143 in multinomial logit model 148 Supremum absolute deviation 126 Test for omitted regressors 118 Two stage least squares (2SLS) 68 Uncertainty 10 I, 103 Vapnik-Cervonenkis (VC) class 98
The Tin bergen Institute is the Netherlands Research Institute and Graduate School for General and Business Economics founded by the Faculties of Economics (and Econometrics) of the Erasmus University in Rotterdam, the University of Amsterdam and the Free University in Amsterdam. The Tinbergen Institute, named after the Nobel prize laureate professor Jan Tinbergen, is responsible for the PhD-program of the three faculties mentioned. Since January 1991 also the Economic Institute of the University of Leiden participates in the Tinbergen Institute. Copies of the books which are published in the Tinbergen Institute Research Series can be ordered through Thesis Publishers, P.O. Box 14791, I 00 I LG Amsterdam, The Netherlands, phone: +3120 6255429; fax: +3120 6203396. The following books already appeared in this series:
Subseries A. General Economics
no. I
Otto H. Swank, "Policy Makers, Voters and Optimal Control, Estimation of the Preferences behind Monetary and Fiscal Policy in the United States".
no. 2
Jan van der Borg, "Tourism and Urban Development. The impact of tourism on urban development: towards a theory of urban tourism, and its application to the case of Venice, Italy".
no. 3
Albert Jolink, "Liberte, Egalite, Rarete. The Evolutionary Economics of Leon Walras".
no. 5
Rudi M. Verburg, "The Two Faces of Interest. The problem of order and the origins of political economy and sociology as distinctive fields of inquiry in the Scottish Enlightenment".
no. 6
Harry P. van Dalen, "Economic Policy in a Demographically Divided World".
no. 8
Marjan Hofkes, "Modelling and Computation of General Equilibrium".
no. 12 Kwame Nimako, "Economic Change and Political Conflict in Ghana 1600-1990". no. 13 Ans Vollering, "Care Services for the Elderly in the Netherlands. The PACKAGE model". no. 15 Cees Gorter, "The dynamics of unemployment and vacancies on regional labour markets". no. 16 Paul Kofman, "Managing primary commodity trade (on the use of futures markets)". no. 18 Philip Hans Franses, "Model selection and seasonality in time series". no. 19 Peter van Wijck, "Inkomensverdeling sbeleid in Nederland. Over individuele voorkeuren en distributieve effecten". no. 20 Angela van Heerwaarden, "Ordering of risks. Theory and actuarial applications". no. 21 Jeroen C.J.M. van den Bergh, "Dynamic Models for Sustainable Development". no. 22 Huang Xin, "Statistics of Bivariate Extreme Values". no. 23 Cees van Beers, "Exports of Developing Countries. Differences between South-South and South-North trade and their implications for economic development". no. 24 Lourens Broersma, "The Relation Between Unemployment a nd Interest Rate. Empirical evidence and theoretical justification". no. 26 Michel de Lange, Intermediation".
"Essays
on
the
Theory
of
. ancial F in
no. 27 Siem-Jan Koopman, "Diagnostic checking and intra-daily effects in time series models". no. 28 Richard Boucherie, "Product-form in queueing networks". no. 29 Frank A.G. Windmeijer, "Goodness of Fit In Linear a nd Qualitative-Choice Models".
Subseries B. Business Economics
no. 4
Rob Buitendijk, "Towards an Effective Use of Relational Database Management Systems".
no. 7
P.J. Verbeek, "Two Case Studies on Manpower Planning in an Airline".
no. 9
T.C.R. van Someren, "lnnovatie, emulatie en tijd. De rol van de organisatorische vernieuwingen in het economische proces".
no. 10 M. van Vliet, "Optimization of manufacturing system design". no. 11
R.M.C. van Waes, "Architectures for Information Management. A pragmatic approach on architectural concepts and their application in dynamic environments".
no. 14 Shuzhong Zhang, "Stochastic Queue Location Problems". no. 17 Paul Th. van de Laar, "Financieringsgedrag in de Rotterdamse maritieme sector, 1945-1960". no. 25 E. Smeitink, "Stochastic Models for Repairable Systems".
UB KAISERSLAUTERN
IIIII IIIII I IIIII I III Il I IIIII IIIII 107 047 656 386