286 41 3MB
English Pages 331 [323] Year 2004
CONTENTS LIST OF CONTRIBUTORS
vii
INTRODUCTION James P. LeSage and R. Kelley Pace
1
PART I: MAXIMUM LIKELIHOOD METHODS TESTING FOR LINEAR AND LOG-LINEAR MODELS AGAINST BOX-COX ALTERNATIVES WITH SPATIAL LAG DEPENDENCE Badi H. Baltagi and Dong Li
35
SPATIAL LAGS AND SPATIAL ERRORS REVISITED: SOME MONTE CARLO EVIDENCE Robin Dubin
75
PART II: BAYESIAN METHODS BAYESIAN MODEL CHOICE IN SPATIAL ECONOMETRICS Leslie W. Hepple
101
A BAYESIAN PROBIT MODEL WITH SPATIAL DEPENDENCIES Tony E. Smith and James P. LeSage
127
v
vi
PART III: ALTERNATIVE ESTIMATION METHODS INSTRUMENTAL VARIABLE ESTIMATION OF A SPATIAL AUTOREGRESSIVE MODEL WITH AUTOREGRESSIVE DISTURBANCES: LARGE AND SMALL SAMPLE RESULTS Harry H. Kelejian, Ingmar R. Prucha and Yevgeny Yuzefovich
163
GENERALIZED MAXIMUM ENTROPY ESTIMATION OF A FIRST ORDER SPATIAL AUTOREGRESSIVE MODEL Thomas L. Marsh and Ron C. Mittelhammer
199
PART IV: NONPARAMETRIC METHODS EMPLOYMENT SUBCENTERS AND HOME PRICE APPRECIATION RATES IN METROPOLITAN CHICAGO Daniel P. McMillen
237
SEARCHING FOR HOUSING SUBMARKETS USING MIXTURES OF LINEAR MODELS M. D. Ugarte, T. Goicoa and A. F. Militino
259
PART V: SPATIOTEMPORAL METHODS SPATIO-TEMPORAL AUTOREGRESSIVE MODELS FOR U.S. UNEMPLOYMENT RATE Xavier de Luna and Marc G. Genton
279
A LEARNING RULE FOR INFERRING LOCAL DISTRIBUTIONS OVER SPACE AND TIME Stephen M. Stohs and Jeffrey T. LaFrance
295
LIST OF CONTRIBUTORS Badi H. Baltagi
Texas A&M University, College Station, USA
Xavier de Luna
Ume˚a University, Ume˚a, Sweden
Robin Dubin
Case Western Reserve University, USA
Marc G. Genton
North Carolina State University, Raleigh, USA
T. Goicoa
Universidad P´ublica de Navarra, Pamplona, Spain
Leslie W. Hepple
University of Bristol, Bristol, UK
Harry H. Kelejian
University of Maryland, USA
Jeffrey T. LaFrance
University of California, Berkeley, USA
James P. LeSage
University of Toledo, Toledo, USA
Dong Li
Kansas State University, Manhattan, USA
Thomas L. Marsh
Washington State University, Pullman, USA
Daniel P. McMillen
University of Illinois at Chicago, Chicago, USA
A. F. Militino
Universidad P´ublica de Navarra, Pamplona, Spain
Ron C. Mittelhammer
Washington State University, Pullman, USA
R. Kelley Pace
Louisiana State University, USA
Ingmar R. Prucha
University of Maryland, USA
Tony E. Smith
University of Pennsylvania, Philadephia, USA
Stephen M. Stohs
University of California, Berkeley, USA
M. D. Ugarte
Universidad P´ublica de Navarra, Pamplona, Spain
Yevgeny Yuzefovich
Washington, D.C., USA vii
INTRODUCTION James P. LeSage and R. Kelley Pace Decisions and transactions of economic agents may depend upon current or past behavior of neighboring agents, which can yield spatial or spatiotemporal dependence. For example, prospective house buyers may base their decisions upon recent housing price appreciation for the country, a region, or their city while paying particular attention to sales in the neighborhood of interest. Other examples where spatial or spatiotemporal dependence might arise include voting behavior, retail store choice, and crime. Spatial dependence of one observation on nearby observations violates the typical assumption of independence made in regression analysis. This type of dependence can be motivated from both a theoretical and statistical viewpoint. A theoretical justification for spatial dependence is provided by Brueckner (2003) who considers strategic interaction among local governments. Competition among local governments in the areas of service provision and taxes can lead to a functional relation among observations from neighboring local governments. This theoretical motivation is consistent with empirical evidence found by Case, Rosen and Hines (1993). Other theoretical work by L´opez-Bazo, Vay´a, Mora and Suri˜nach (1999) has extended the neoclassical growth model to incorporate spatial dependence of regional growth rates on those of neighboring regions. Brasington (1999, 2003) models education as a joint product, with the pure private component consumed internally and a pure public component of schooling that spills over to neighboring communities. He also motivates spatial dependence as arising from a compromise between full spatial sorting of individuals and the need to gain economies of scale in areas where there are small numbers of school aged children.
Spatial and Spatiotemporal Econometrics Advances in Econometrics, Volume 18, 1–32 Copyright © 2004 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(04)18013-4
1
2
JAMES P. LESAGE AND R. KELLEY PACE
From a statistical viewpoint, unobservable latent characteristics such as locational amenities could create a functional relation among nearby observations on houses prices. Consequently, the inability to observe latent characteristics gives rise to a purely statistical motivation for spatial dependence. In this case, modeling spatial or spatiotemporal dependence may mitigate the problems arising from unobservable or difficult-to-quantify factors. Past studies using spatial or spatiotemporal samples often relied on dichotomous explanatory variables to control for this type of spatial or temporal effects. In time series analysis, using dichotomous variables to control for temporal dependence implies a linear increase in the number of variables as the number of periods increases. However, for spatial data the number of dichotomous variables must rise quadratically as the study area grows for fixed size regions. Spatiotemporal modeling using dichotomous variables must interact both the spatial and temporal dichotomous variables, and this can quickly lead to a very large number of estimated parameters. Like temporal autoregressive processes, spatial and spatiotemporal autoregressive processes often provide more parsimonious and better fitting models than those that rely on dichotomous variables. For example, Pace et al. (2000) found that a spatiotemporal model of house prices with three spatiotemporal autoregressive variables offered a better fit than a model based on 199 dichotomous variables. Recent technological advances have increased the availability of sample data that captures these spatial and spatiotemporal aspects of behavior by agents. For example, the Home Mortgage Disclosure Act data in 2002 consisted of 31 million individual observations on loan decisions. The Census reports some data to the block level, and there were nine million of these in the 2000 census. Private pointof-sale databases contain vast numbers of transactions. Many organizations have databases containing address fields. The use of inexpensive address matching and geocoding has made it possible to translate street addresses into map coordinates, allowing creation of large spatial databases. Finally, geographical information systems have made it possible to manage and map spatial data. In this introduction to Volume 18 of Advances in Econometrics we describe spatial autoregressive processes and outline literature that has developed methods for estimation and inference based on this parsimonious construct. The contributions to this volume are placed into this broader context of alternative models and estimation methods for dealing with spatial and spatiotemporal dependence. Section 1 describes spatial autoregression and sets forth the consequences of using ordinary least-squares estimation when the sample data contain spatial dependence. A family of spatial regression models are described in Section 2 that have served as the workhorse of empirical applications involving spatial data. Section 3 discusses estimation and inference for this family of models based on maximum likelihood, Bayesian, instrumental variables generalized method of moments, and generalized
Introduction
3
maximum entropy approaches. All of these estimation approaches are represented in the various contributions to this volume. Alternative approaches to modeling spatially dependent data that are not part of the family of spatial regression models are discussed in Section 4. These include nonparametric methods, and Bayesian hierarchical models. Section 5 turns attention to space-time models that are the subject of two contributions to the volume.
1. SPATIAL AUTOREGRESSION For this discussion, assume there are n sample observations of the dependent variable y at unique locations. In spatial samples, often each observation is uniquely associated with a particular location or region, so that observations and regions are equivalent. Spatial dependence arises when an observation at one location, say y i is dependent on “neighboring” observations y j , y j ∈ ϒ i . We use ϒ i to denote the set of observations that are “neighboring” to observation i, where some metric is used to define the set of observations that are spatially connected to observation i. For general definitions of the sets ϒ i , i = 1, . . . , n, typically at least one observation exhibits simultaneous dependence, so that an observation y j , also depends on y i . That is, the set ϒ j contains the observation y i , creating simultaneous dependence among observations. This situation constitutes a difference between time series analysis and spatial analysis. In time series, temporal dependence relations could be such that a “one-period-behind relation” exists, ruling out simultaneous dependence among observations. The time series one-observationbehind relation could arise if spatial observations were located along a line and the dependence of each observation were strictly on the observation located to the left. However, this is not in general true of spatial samples, requiring construction of estimation and inference methods that accommodate the more plausible case of simultaneous dependence among observations. Spatial weight matrices represent a convenient and parsimonious way to define the spatial dependence among observations. These are n by n matrices, where the rows are specified using ϒ i . If W ij represents the individual elements of the matrix W, then W ij > 0 when observation y j ∈ ϒ i , that is y i depends upon y j . Stated another way, if observation j is the neighbor of observation i, W ij > 0. By convention, W ii = 0 for reasons that will become clear shortly. For weight matrix specifications based on distance or contiguity, the pattern of non-zeros in W is symmetric. For example, assuming a house location can be modeled as a point for practical purposes, if houses within 1,000 feet of each other exhibit dependence in prices and houses i and j are 500 feet apart, then the price of house i is dependent upon that of house j and vice-versa. In this example, W ij = 0 for observations i, j located
4
JAMES P. LESAGE AND R. KELLEY PACE
more than 1000 feet apart, and W ij > 0 for those less than 1,000 feet away. If the sample is over a small area, all elements in W except the diagonal could be strictly positive. Alternatively, a sample spread over a large area could result in a large number of zero elements in W, making this matrix sparse. If the dependence specification is held constant, and the sample area expands, the sparsity (number of zero elements) of W increases with n, a phenomena labelled increasing domain asymptotics by Cressie (1993).1 When dependence between observations i and j is specified using distance, symmetry of the distance metric requires symmetry in dependence between observations i and j. There are many ways of specifying dependence such as contiguity among regions, nearest neighbors, and other functions of distance that we enumerate below. Even partial symmetry of the locations of the strictly positive elements in W leads to a situation where the spatial weight matrix W cannot be linearly transformed to triangular matrix form (made similar to a triangular matrix). For the special case of spatial observations on a line noted earlier, the oneobservation-behind relation could be transformed to achieve this strictly triangular form. In general the inability to do this represents an important departure from the case of time series that has a number of implications for estimation. For interpretive and numerical reasons another common practice is to rowstandardize the matrix W, so that row sums are unity (see Anselin, 1988). Since W is a non-negative matrix, this type of standardization leads to what is known as a “row stochastic” matrix. Row stochastic matrices have favorable interpretative and numerical implications that we discuss in the sequel. The matrix product Wy, where y is a vector containing the dependent variable results in an n by 1 vector often called a “spatial lag” of y. When the matrix W is row stochastic, the spatial lag vector contains n arithmetic averages constructed from observations that are spatially related to each observation in the vector y. Intuitively, Wy yields a series of spatially local averages, so in the context of a vector y containing housing prices, Wy represents the average price of houses around each house in the vector y. We enumerate some of the approaches suggested in the literature for specifying the connectivity structure of the matrix W below. The convention in spatial econometric modeling is to treat specification of the matrix W as based on exogenous information arising from the spatial configuration of the regions or points making up the spatial sample, and to assume that the matrix W is fixed in repeated sampling.2 (1) (2) (3) (4) (5) (6)
Spatially contiguous neighbors, Inverse distances raised to some power, Lengths of shared borders divided by the perimeter, Bandwidth as the mth nearest neighbor distance, Ranked distances, Constrained weights for an observation equal to some constant,
Introduction
5
Fig. 1. An Illustration of Contiguity Relationships.
(7) All centroids within distance d, (8) m nearest neighbors, (9) m nearest neighbors with decay. To illustrate a contiguity specification of W, consider the 5 regions shown in Fig. 1, where regions 2 and 3 are first-order contiguous to region 1, because they have borders that touch. An example of a second-order contiguity relation would be region 4 and region 1, since region 4 has borders that touch a first-order contiguous region, specifically region 3. A similar approach would define region 5 as thirdorder contiguous to region 1, since it has borders touching region 4. A row-standardized weighting matrix W based on the first-order contiguity relations for the five regions in Fig. 1 would take the form shown in (1). Each row denotes an observation and the columns associated with first-order contiguous regions to each row are given values of one. The matrix is then row-standardized by dividing each row by the row sum. As an example, region 1 has regions 2 and 3 as first-order contiguous neighbors, so the second and third columns receive a weight of 1 and are normalized to have a value of 1/2. 0 1/2 1/2 0 0 0 1/2 0 1/2 0 W= (1) 1/3 1/3 0 1/3 0 0 1/2 0 1/2 0 0 0 0 1 0
6
JAMES P. LESAGE AND R. KELLEY PACE
Alternative approaches to defining spatial weight matrices could be based on distances. We could use the centroid coordinates of each region in Fig. 1 to compute distances between each region and all other regions, an n by n matrix, where n reflects the number of sample observations or regions. These distances form the basis for determining a “nearest neighbor” spatial weight matrix. This matrix would have a value of 1 in the i, jth position representing the nearest region to each observation i and zeros elsewhere. Similarly, we might form a matrix based on the two nearest neighbors, those regions identified by the shortest two distances in each row of the n by n distance matrix. Typically, the strategy for constructing a spatial weight matrix depends on the type of variable being modeled. In a real estate context, it might be reasonable to specify a spatial weight matrix that relies on the nearest three neighboring homes that sold recently. On the other hand a model of local government interaction might find that spatial contiguity yields a more appropriate basis for the spatial weight matrix. Of course, candidate weight matrices can be compared using model selection criterion for any particular data sample, a subject discussed later. A first-order spatial autoregressive process can be defined using the spatial weight matrix W and the vector of sample observations y (expressed in deviations from the means form) along with a random normal vector of disturbances as in (2). Wy + , y= ∼ N(0, 2 I n ) (2) (In − W)−1 The second expression in (2) shows the implied data generating process for the first-order spatial autoregression. In (2), the scalar parameter defines the strength of the spatial autoregressive relation between y and the spatial lag Wy. By virtue of the transformation of y to deviation from the means form, we can interpret the parameter as reflecting the spatial correlation between y and −1 Wy. The domain of is defined by the interval (−1 min , max ), where min , max represent the minimum and maximum eigenvalues of the spatial weight matrix (see Sun, Tsutakawa and Speckman, 1999). For the case of a row-standardized weight matrix, −1 ≤ min < 0, max = 1, so that ranges from negative values to unity. In cases where positive spatial dependence is almost certain (e.g. house price data), restriction of to the [0, 1) interval simplifies computation. Also, since large negative values of spatial dependence seem implausible, restriction of to (−1, 1) seems reasonable in many cases (Smirnov & Anselin, 2001). Element i of the vector Wy represents W ij y j , a linear combination of “spatially related” observations based on non-zero elements in the ith row of W. It should be clear that the matrix W must contain zeros on the main diagonal to preclude y i from predicting itself by entering the spatial lag vector for observation i.
Introduction
7
One way to view the spatial autoregressive data generating process is to consider the series expansion of the matrix inverse (I n − W)−1 shown in (3). (I n − W)−1 = I n + W + 2 W 2 + 3 W 3 + · · ·
(3)
Powers of spatial weight matrices have an interpretation of representing higherorder neighboring relations. For example, since W contains positive entries for neighboring observations, then W 2 contains positive entries for the neighbors of the neighboring relations in W and consequently, W 3 denotes the neighbors of the neighbors of the neighbors and so on. Moreover, since products of row-stochastic matrices are still row-stochastic, W 2 represents a vector of local averages of local averages, and so on (Horn & Johnson, 1993, p. 529). This can be interpreted to mean that the first-order spatial autoregressive data generating process assigns declining weight to higher-order neighboring relations. This is because the parameter takes on values less than unity for row-stochastic weight matrices, so that higher powers reflect smaller influence attributed to higher order neighboring relations.
1.1. Estimation Consequences of Spatial Dependence The estimation consequences of spatial dependence are that least-squares estimation of the model in (2) will produce biased and inconsistent estimates of the parameter . Anselin (1988) considers the least-squares estimate of , which we label ˆ in (4). ˆ = (y ′ W ′ Wy)−1 y ′ W ′ y = + (y ′ W ′ Wy)−1 y ′ W ′
(4)
Considering the probability limit (plim) of expression (4) he notes that: Q = plim(1/n)(y ′ W ′ Wy) could obtain the status of a finite and nonsingular matrix with appropriate restrictions on the parameter and the structure of the spatial weight matrix W. This leaves us to consider: plim(1/n)(y ′ W ′ ) which must equal zero for least-squares to produce a consistent estimate of . Making use of y = (I n − W)−1 , we find:
1 ′ p lim (I n − W ′ )−1 W ′ (5) n The p lim of this quadratic form in the error terms will not equal zero except in the trivial case where = 0, or W is strictly triangular as in the time series case. As noted earlier, the matrix W is not in general similar to a strictly triangular matrix. The magnitude of bias arising from use of least-squares to estimate the parameter can be substantial. Using a sample of 3,110 U.S. counties, we illustrate this by defining y to represent the percentage of population in each county that lived in
8
JAMES P. LESAGE AND R. KELLEY PACE
the same house during 1985 and 1990 (expressed in deviations from the means form). This information was taken from the U.S. 1990 Census, and the matrix W was defined using contiguous counties. The least-squares estimate of was 0.902, whereas the maximum likelihood estimate for was 0.677.
2. SPATIAL REGRESSION MODELS Anselin (1988) popularized a family of spatial regression models that draw on the spatial lag variables introduced above, taking the form in (6), where W and D represent n by n non-negative spatial weight matrices with zeros on the diagonal. y = Wy + X + u ∼ N(0, 2 I n ) u = Du + ,
(6)
In (6), the n by k matrix X contains k explanatory variables, and the dependent variable is y. The parameters to be estimated are: , , , and 2 . This model is discussed extensively in Anselin (1988), where he notes that members of the family of models can be derived from this general formulation. Setting the parameter = 0 eliminates the spatially lagged dependent variable Wy, producing a model where the disturbances follow a spatial autoregressive process. The case where = 0 eliminates the spatially lagged disturbance term creating a model where the spatial dependence is modeled as occurring in the dependent variable vector y. Of course, a model based on both and non-zero allows for spatial autoregressive dependence in both y as well as the error process. Yet another model arises if we introduce spatial lags of the explanatory variables taking the form WX as additional explanatory variables in the model, along with an associated parameter vector. Extending our discussion of the estimation consequences of spatial dependence to this family of models we consider special cases of the general model in (6). The model derived from restricting = 0, along with the implied data generating process is shown in (7), which extends the first-order spatial autoregressive model to include an explanatory variables matrix. The matrix inverse (I n − W)−1 shown in the data generating process assigns the same pattern of geometric decay of influence to both the explanatory variables from increasingly distant neighbors as well as the disturbances in this model. The consequences of using least-squares to estimate this model are similar to those for the first-order spatial autoregressive model, and thus estimates of ,  and are biased and inconsistent. Wy + X + (7) y= (In − W)−1 X + (In − W)−1
Introduction
9
A second model derived by restricting = 0, is shown along with the implied data generating process in (8). The influence of spatial dependence arises from the infinite series expansion of the matrix inverse (I n − W)−1 applied only to the disturbance process in this model. Least-squares estimates are unbiased, but inefficient due to the non-diagonal structure of the implied disturbance covariance matrix. This model takes the form of a linear regression with a parameterized covariance matrix for the error term. X + u, u = Du + (8) y= X + (In − D)−1 It would be possible to devise a two-step estimated generalized least-squares (EGLS) procedure as in the case of serial correlation or heteroscedasticity. This would produce consistent estimates for  if the plug-in estimate for the parameter were a consistent estimate. However, a consistent estimate for required for the two-step EGLS procedure should be based on a first-order spatial autoregression in the disturbances: u = Du + ,
(9)
where u represents disturbances associated with the least-squares model. As already motivated in the case of the first-order spatial autoregressive model, least-squares cannot be used to consistently estimate the parameter . Maximum likelihood estimation of requires solving a problem equivalent to maximum likelihood estimation of the model in (8). An alternative would be to rely on an an instrumental variables generalized moments (IV/GM) approach suggested by Kelejian and Prucha (1998) to produce a consistent estimate of . This family of models has served as the workhorse of applied spatial econometric modeling efforts. For example, Le Gallo, Ertur and Baumont (2003) extend traditional growth regressions to the case of a spatial regression. Growth regressions specify the dependent variable y as the growth rate of per capita GDP over some period of time and include the (logged) initial period level of GDP as well as a constant term and explanatory variables that purport to explain long-term economic growth. In an application based on a cross-sectional sample of European Union regions, Le Gallo, Ertur and Baumont (2003) include a spatial lag of the growth rates y in the regression, which allows the growth rate of “neighboring” regions to act as an explanatory variable in the growth regression. Similarly, one can introduce spatial lags of the initial level and characteristics matrix X, allowing characteristics of neighboring regions to play a role in long-term economic growth. Another example is extension of hedonic price regressions to include a spatial lag of neighboring house values. This models home prices as depending on
10
JAMES P. LESAGE AND R. KELLEY PACE
neighboring home prices defined by Wy as well as home characteristics contained in the matrix X. This has a great deal of intuitive appeal since appraisers typically rely on the price of nearby homes that sold recently when assessing market values. As in the case of the growth regressions, the characteristics of neighboring homes can also be included in the model using spatial lags of explanatory variables in the matrix X. The contribution to this volume by Ugarte, Goicoa and Militino relies on a traditional hedonic price relation but allows for variation in the parameters of the relation over space. Other applied work in the area of technical innovation measured by patents granted or patent license citations at various locations have used these spatial regression models in an attempt to capture spatial spillover effects arising from innovations (Anselin, Varga & Acs, 1997). Anselin (2003) provides a taxonomy of spatial regression models based on distinctions between local and global spatial spillovers and the source giving rise to these.
3. ESTIMATION AND INFERENCE Numerous contributions to this volume rely on the family of regression models in (6) but apply estimation methods other than maximum likelihood. One motivation for alternative estimation approaches is the need to simplify calculation of estimates in large samples. In Section 3.1 we detail computational issues that arise in maximum likelihood estimation of these models. There are other motivations as well for use of alternative estimation approaches. For example, the contribution by Hepple outlines Bayesian estimation for these models, where the goal is to create simple approaches to comparing alternative model specifications, including models based on different spatial weight matrices. The work in this volume by Kelejian, Prucha and Yuzefovich sets forth an IV/GM approach to estimation, with the goal being simplified calculation of estimates as well as robustness with regard to non-normality of the disturbance process. A contribution by Marsh and Mittelhammer introduces generalized maximum entropy estimation methods for these models in an attempt to simplify estimation as well as extend the model to the case of truncated dependent variables. After explaining maximum likelihood estimation in Section 3.1, we discuss Bayesian estimation in Section 3.2, IV/GM estimation in Section 3.3 and maximum entropy estimation in Section 3.4. 3.1. Maximum Likelihood Usually, maximum likelihood methods have desirable asymptotic theoretical properties such as consistency, efficiency and asymptotic normality. They are also
Introduction
11
thought to be robust for small departures from the normality assumption, and these methods allow practitioners to draw on a well-developed statistical theory regarding parameter inference. We simplify the discussion by focusing on the spatial autoregressive model presented in (10), which represents one member of the family of models that can be derived from (6), by setting the parameter = 0. y = Wy + X + ,
∼ N(0, 2 I n )
(10)
Maximum likelihood estimation of this model requires solving a univariate optimization problem involving the spatial dependence parameter . This is achieved by concentrating the likelihood function with respect to the parameters  and resulting in (11). n ln L = C + ln |In − W| − ln (e′ e) 2 e = eo − ed eo = y − Xo ed = Wy − Xd
(11)
o = (X′ X)−1 X′ y d = (X′ X)−1 X′ Wy where C represents a constant not involving the parameters. The computationally troublesome aspect of this is the need to compute the log-determinant of the n × n matrix (I n − W). Operation counts for computing this determinant via eigenvalues grow with the cube of n for a dense matrix W. In response to the computational challenge posed by the log-determinant, at least two strategies exist. First, the use of alternative estimators can solve this problem. Examples include the instrumental variable approach of Anselin (1988, pp. 81–90), the IV/GM approach of Kelejian and Prucha (1998), or the maximum entropy approach introduced by Marsh and Mittelhammer in their contribution to this volume. Discussion of these is taken up in the later sections. A second strategy is to directly attack the computational difficulties confronting likelihood estimation. The Taylor series approach of Martin (1993), the eigenvalue based approach of Griffith and Sone (1995), the direct sparse matrix approach of Pace and Barry (1997), the Monte Carlo approach of Barry and Pace (1999), the graph theory approach of Pace and Zou (2000), and the characteristic polynomial approach of Smirnov and Anselin (2001) represent examples of this strategy. Pace and Barry (1997) suggested using direct sparse matrix algorithms such as the Cholesky or LU decompositions to compute the log-determinant. A sparse matrix is one that contains a large proportion of zeros. As noted earlier, increasing domain asymptotics imply that weight matrices will become more sparse with
12
JAMES P. LESAGE AND R. KELLEY PACE
increases in sample size. As a concrete example, consider the spatial weight matrix for the sample of 3,107 U.S. counties used in Pace and Barry (1997). This matrix is sparse since they examine a fixed number of neighbors (four nearest neighbors). To understand how sparse matrix algorithms conserve on storage space and computer memory, consider that we need only record the non-zero elements along with an indication of their row and column position. This requires a 1 by 3 vector for each non-zero element consisting of a row index, a column index, and the element value. Since non-zero elements represent a small fraction of the total 3,107 × 3,107 = 9,653,449 elements in the weight matrix, sparsity saves computer memory. For our example of four nearest neighbors and 3,107 U.S. counties, there are only 12,428 non-zero elements in the weight matrix, representing a very small fraction (about 0.4 percent) of the total elements. Storing the matrix in sparse form requires only three times 12,428 elements, or more that 250 times less computer memory than would be needed to store 9,653,449 elements. In addition to storage savings, sparse matrices result in lower operation counts as well, speeding computations. In the case of non-sparse (dense) matrices, matrix multiplication and common matrix decompositions such as the Cholesky require O(n 3 ) operations, whereas for sparse W these operation counts can fall as low as O(n). Another point made by Pace and Barry (1997) was that a vector evaluation of the log-likelihood function in (11) over a grid of q values of could be used to find maximum likelihood estimates. For row-stochastic matrices, the desired interval for values of is often known ex-ante (i.e. [0,1) or (−1,1)). In other cases, computing solely the minimum and maximum eigenvalues of W to determine the domain of does not incur the computational cost posed by computing all eigenvalues of W. The grid-based approach eliminates the need to rely on optimization algorithms and often has advantages in repeated estimation as in Monte Carlo studies, resampling, and restricted model estimation. The computationally intense part of this approach is still calculating the logdeterminant, which takes around 491 seconds for a sample of 65,433 U.S. census tracts. This is based on a grid of 0.01 increments from = −1 to 1 using sparse matrix algorithms in MATLAB version 7.0 on a 1.6 Ghz Pentium M computer. An improvement based on a Monte Carlo estimator for the log determinant suggested by Barry and Pace (1999) allows larger problems to be tackled without the memory requirements or sensitivity to orderings associated with the direct sparse matrix approach. This method not only provides an approximation to the log-determinant term but produces an asymptotic confidence interval around the approximation. As an illustration of these computational advantages, the time required to compute the same grid of log-determinant values for the sample of census tracts was 20.7 seconds, which compares quite favorably to 497 seconds for the direct sparse matrix computations cited earlier. In addition, Pace and LeSage (2004a) introduced
Introduction
13
a quadratic Chebyshev log-determinant approximation with bounds that further reduces the computational time. In our sample of 65,433 census tracts, calculating the grid of log-determinant values requires only 0.4 seconds. Improvements in computing technology as well as possible parallel computer approaches (Smirnov, 2003a) suggest that large problems can be handled using maximum likelihood methods. Other variations are possible. For example, LeSage and Pace (2004) introduce a matrix exponential spatial specification that replaces the conventional geometric decay of influence over space with an exponential pattern of decay. The resulting model, estimates and inferences are similar to the models examined here, but the log-determinant vanishes from the log-likelihood of this model, greatly simplifying maximum likelihood computations. All of these computational advances permit fitting regression relations across different weight matrices in reasonable time, and thus permit inference concerning these (either nested or non-nested). There are a number of ways of conducting maximum likelihood inference. Inference can proceed using classical approaches based on likelihood ratio (LR), Lagrange multiplier (LM), or Wald principles (Anselin, 1988). The choice among the approaches often depends upon theoretical or computational simplicity. For example, Baltagi and Li in this volume adopt an LM approach to simultaneously test for the presence of spatial dependence and functional form based on the theoretical simplicity associated with the LM approach. Pace and Barry (1997) advocate use of the LR method, arguing there is a low cost associated with computing likelihoods for all submodels based on parameter restrictions, given the pre-computed log-determinant over the grid of values for . Inference regarding parameters for regression-based models is frequently based on an estimate of the variance-covariance matrix. In problems where the sample size is small, an asymptotic variance matrix based on the Fisher information matrix for the parameters = (, , 2 ) can be used to provide measures of dispersion for these parameters. Anselin (1988) provides the analytical expressions needed to construct this information matrix, but evaluating these expressions is computationally difficult when dealing with large scale problems involving thousands of observations. The analytical expressions used to calculate the information matrix involve terms such as: trace(W(I n − W)−1 W(I n − W)−1 + W(I n − W)−1 W ′ (I n − ′ W)−1 ). While the matrix W may be sparse, the inverse matrix (I n − W)−1 is not. Smirnov (2003b) shows that it is possible to rely on the series expansion representation of the matrix inverse (I n − W)−1 to strategically produce a weighted combination of the traces of powers of the sparse spatial weight matrix to produce an estimate of the information matrix. Since this approach works best for cases involving values of the spatial dependence parameter that are less the 0.7, Smirnov develops an approximate version of the method to deal with cases where > 0.7.
14
JAMES P. LESAGE AND R. KELLEY PACE
An alternative is to rely on a numerical estimate of the Hessian matrix. Given the ability to evaluate the vectorized likelihood function rapidly, numerical methods seem promising as a way to compute approximations to the gradients required to form the Hessian. However, this can be problematical when the sample data is poorly scaled, and there is an inherent difficulty that plagues this approach. The precision of the estimate for the parameter is typically quite high, making it difficult to produce reasonable numerical Hessian estimates by perturbing the parameter vector = (, , 2 ). Small perturbations can lead to large changes in the log-likelihood function, creating an ill-conditioned problem that can result in non-invertible Hessians as well as negative variance estimates. In the next section we discuss a Bayesian Markov Chain Monte Carlo (MCMC) approach to estimating these models that provides a solution to these problems. Another aspect of maximum likelihood estimation and inference is that of model selection. Given the family of models that can be derived from the spatial regression model in (6) – which member of the family is most consistent with any particular data sample? A great deal of literature has focused on this issue from a maximum likelihood perspective. Florax and Folmer (1992) suggest a sequential testing procedure. These sequential tests attempt to discern whether a model based on the restrictions = 0, versus = 0, versus both and different from zero should be selected. Of course, this approach complicates inference regarding the parameters of the final model specification due to the pre-test issue. Florax, Folmer and Rey (2003) consider the “general to specific approach” to model specification of Hendry versus a forward stepwise strategy. While the general to specific approach tests sequential restrictions placed on the most general model, the stepwise strategy considers sequential expansion of the model. Beginning with a regression model, expansion of the model proceeds to add spatial lag terms, conditional upon the results of misspecification tests. They conclude that the Hendry approach is inferior in its ability to detect the true data generating process. The contribution by Dubin in this volume tackles this issue from a slightly different perspective, devising a strategically nested model. Inference regarding which member of the nested family can be carried out during estimation, so that inferences on model specification and parameters can be carried out simultaneously. Other non-nested approaches include a proposal by Pace and LeSage (2003) that relies on a likelihood dominance approach to compute bounded likelihood ratio tests that do not require computation of the log-determinant term. The contribution by Baltagi and Li also devises a simultaneous approach to estimation and model selection, but focus on the question of functional form transformations. Baltagi and Li consider simultaneous tests for specification of functional form as well as the presence of spatial dependence. They make a case that possibly non-linear spatial relations can be successfully modeled using log-linear
Introduction
15
or Box-Cox transforms, but conventional specification tests of these transforms are complicated by the presence of spatial dependence. Consequently, Baltagi and Li devise a test for a linear or log-linear model with no spatial lag dependence versus a more general Box-Cox model with spatial lag dependence. Conditional LM tests are also derived which test for zero spatial lag dependence conditional on an unknown Box-Cox functional form, as well as linear or log-linear functional form given spatial lag dependence.
3.2. Bayesian Markov Chain Monte Carlo estimation in conjunction with Bayesian hierarchical models represents a widely used approach in the spatial statistics literature (Banerjee, Carlin & Gelfand, 2004). MCMC estimation is achieved by sampling sequentially from the complete sequence of conditional distributions for each parameter in the model. These conditional distributions are often much easier to derive than the full posterior distribution for all parameters. It is also the case that calculations involving conditional distributions are often relatively simple. To illustrate how this works, let = (1 , 2 ), represent a parameter vector and p() denote the prior, with L(|y, X, W) denoting the likelihood. This results in a posterior distribution p(|D) = c · p()L(|y, X, W), with c a normalizing constant, and D representing the data. Consider the case where p(|D) is difficult to work with, but a partition of the parameters into two sets 1 , 2 is easier to handle. Given an initial estimate for 1 , which we label ˆ 1 , suppose we could easily estimate 2 conditional on 1 using p(2 |D, ˆ 1 ). Denote the estimate, ˆ 2 derived by using the posterior mean or mode of p(2 |D, ˆ 1 ). Assume further that we are now able to easily construct a new estimate of 1 based on the conditional distribution p(1 |D, ˆ 2 ). This new estimate for 1 can be used to construct another value for 2 , and so on. On each pass through the sequence of sampling from the two conditional distributions for 1 , 2 , we collect the parameter draws which are used to construct a joint posterior distribution for the parameters in our model. Gelfand and Smith (1990) demonstrate that sampling from the sequence of complete conditional distributions for all parameters in the model produces a set of estimates that converge in the limit to the true (joint) posterior distribution of the parameters. That is, despite the use of conditional distributions in our sampling scheme, a large sample of the draws can be used to produce valid posterior inferences regarding the joint posterior mean and moments of the parameters. LeSage (1997) introduced Bayesian estimation via Markov Chain Monte Carlo (MCMC) sampling for the family of models in (6). This approach can rely on diffuse priors for the parameters, producing estimates equal to those from
16
JAMES P. LESAGE AND R. KELLEY PACE
maximum likelihood. LeSage (1997) also provides a Bayesian extension of the family of models in (6) that accommodate heteroscedasticity and outliers. Use of MCMC sampling for parameter estimation provides an elegant solution to the computational issues noted regarding analytical and numerical Hessian calculations used to calculate measures of parameter dispersion. The sample of parameter draws produced during sampling can be used to construct a covariance matrix that can be used to compute implied t-statistics. As an illustration of this we use a data set from Pace and Barry (1997) covering a sample of 3,107 U.S. counties where the dependent variable is county-level voter participation in the 1980 presidential election and the explanatory variables matrix consist of the percent of the population in each county over age 19 eligible to vote, percent with college degrees, percent owning their homes, the log of per capita income, and an intercept term. To enhance scaling for the numerical Hessian calculations, the matrix of explanatory variables were put in deviations from the means form and scaled by their standard deviation. The maximum likelihood and Bayesian MCMC estimates for the spatial lag model: y = Wy + X + , are presented in Table 1. Standard deviations constructed from numerical Hessian covariance estimates and standard deviations based on the covariance matrix from 6,000 MCMC draws3 were used to compute implied t-statistics that are presented in the table. From the table we see that Bayesian MCMC estimates based on diffuse priors produce estimates and inferences almost identical to those from maximum likelihood estimation. This suggests that in cases where difficulties arise computing numerical Hessian estimates of dispersion for the parameter estimates, Bayesian MCMC estimates provide a viable alternative. As with maximum likelihood, a question arises as to which model generated the data. In this volume, Hepple sets forth a Bayesian approach to model comparison, that extends earlier work in this area (Hepple, 1995a, b). A virtue of this approach
Table 1. Numerical Hessian vs. MCMC Estimates of Dispersion.
Constant term Education Home ownership Per capita income 2 R2
Maximum Likelihood
t-Statistic
Bayesian MCMC
Implied t-Statistic
−0.2354 0.0481 0.0721 −0.0182
−26.43 13.92 33.19 −5.95
−0.2363 0.0483 0.0721 −0.0183
−25.85 13.84 33.14 −5.96
0.5909 0.0138 0.4413
39.42
0.5893 0.0138 0.4419
38.33
Introduction
17
is that posterior model probabilities or Bayes factors can be used to compare nonnested models. This allows the method to be used for comparing models based on: (1) different spatial weight matrices; (2) different model specifications (including those that may not be members of the family of models in (6)); and (3) models based on different sets of explanatory variables contained in the matrix X. An issue that arises with this approach is the need to avoid a paradox pointed out by Lindley (1957). He noted that when comparing models with different numbers of parameters that rely on diffuse priors, the simpler model is always favored over a more complex one, irrespective of the sample data information. An implication of this is that two models with an equal number of parameters can be compared using diffuse priors, but for model comparisons that involve changes in the number of parameters, strategic priors must be developed and used. A strategic prior would be one that recognizes that flat priors are highly informative because assigning a diffuse prior over a parameter value assigns a large amount of prior weight to values of the parameter than are very large in absolute value terms. A classic example is one given by DeGroot (1982), who notes that a diffuse prior on the parameter  which defines the parameter = exp()/(1 + exp()) in fact places nearly all of the prior weight on values of near 0 or 1. An implication is that there is no natural way to encode complete prior ignorance about parameters. A strategic prior in the context of model comparison would be one that explicitly recognizes the role that parameters and priors play in controlling model complexity. Given this, we could explore how prior settings impact posterior model selection. For example, use of a prior mean of zero for the regression parameters  should lead to a situation where tighter imposition of these prior means (smaller prior variances) would perhaps favor posterior selection of more parsimonious models. An important point to note about all spatial model comparison methods is that their performance will depend on the strength of spatial dependence in the sample data. We illustrate this point using the latitude-longitude coordinates from a sample of 258 European Union regions to produce a data-generated spatial sample. The location coordinates were used to construct a spatial weight matrix based on the five nearest neighboring regions. This weight matrix was used to generate a y-vector based on the spatial autoregressive model: y = Wy + X + . The explanatory variables matrix X was generated as a three column matrix of standard normal random deviates, and the three  parameters were all set to unity. The noise variance parameter 2 was also set to one. The operational characteristics of any specification test to detect the true model structure will usually depend on the signal/noise ratio in the data generating process, determined by the variance of the matrix X relative to the noise variance, which we hold constant in this datagenerated illustration.
18
JAMES P. LESAGE AND R. KELLEY PACE
A proper uniform prior was placed on the parameter and a diffuse prior was used for the parameter . The g-prior of Zellner (1986) was used for the parameters , which takes the form of a normal prior distribution with mean zero and variancecovariance equal to [(1/g) · X ′ X]−1 . The prior parameter g was set equal to (1/n), creating a relatively uninformative prior on . We note that use of Zellner’s g-prior illustrates a strategic prior discussed earlier, since we can examine how changes in the single prior parameter g impact posterior model selection. Larger values of g should impose the prior mean of zero for the parameters  more tightly, favoring models containing fewer explanatory variables in the matrix X. For this illustration, all models contain the same matrix X, differing only with respect to the spatial weight matrix. A series of seven models were generated based on varying values ranging from −0.5 to 0.5, with the parameters described above held constant. It took around 4 seconds to produce posterior probabilities for a set of 10 models based on spatial weight matrices constructed using 1–10 nearest neighbors for this sample of 258 observations. Most of the time (3.3 seconds) was spent computing the logdeterminant term for the 10 different weight matrices, using the method of Pace and Barry (1997). The posterior model probabilities are presented in Table 2, for models associated with M = 1, . . . , 10 neighbors and values of ranging from −0.5 to 0.5. From the table, we see that positive or negative values of 0.3 or above for lead to high posterior probabilities associated with the correct model, that based on M = 5
Table 2. Posterior Probabilities for Models with Differing and Weight Matrices. M
−0.5
−0.3
−0.1
0
0.1
0.3
0.5
1 2 3 4 5∗ 6 7 8 9 10
0.00 0.00 0.00 0.00 0.94 0.04 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.02 0.81 0.04 0.04 0.04 0.00 0.00
0.01 0.01 0.02 0.02 0.10 0.10 0.31 0.19 0.09 0.12
0.04 0.06 0.06 0.07 0.12 0.14 0.11 0.11 0.12 0.12
0.02 0.09 0.05 0.15 0.18 0.14 0.11 0.08 0.07 0.07
0.00 0.00 0.06 0.17 0.73 0.01 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00
R2
0.77
0.74
0.77
0.77
0.77
0.77
0.76
Introduction
19
nearest neighbors. Absolute values of 0.1 or less for lead to less accurate estimates of the true data generating model, with the posterior probabilities taking on a fairly uniform character for the case of = 0. Intuitively, when is small or zero, it will be difficult to assess the proper spatial weight matrix specification, since the spatial lag term, Wy, in the model is associated with a zero coefficient. The R-squared statistics based on maximum likelihood estimation are reported in the table, indicating a fairly typical signal/noise ratio, which was held constant across the seven models while varying values. A Bayesian solution to model selection is to rely on model averaging. This approach constructs estimates and inferences based on a linear combination of models, where the posterior model probabilities are used as weights to form the model average. While there has been a great deal of literature on this topic in the regression literature (Raftery et al., 1997), it has received little attention in the spatial regression literature.
3.3. Generalized Method of Moments An alternative approach to estimating the family of models in (6) is the instrumental variables – generalized method of moments (IV/GM) approach set forth by Kelejian and Prucha (1998, 1999). The IV/GM approach has seen wide usage in other econometric problems. The IV procedure introduced in Kelejian and Prucha (1998) is based on a (GM) estimator and does not require specific distributional assumptions. The work in this volume by Kelejian, Prucha and Yuzefovich sets forth an extension of this approach, and they provide Monte Carlo evidence regarding robustness of this approach to non-normality of the disturbance process. They also demonstrate that this approach is asymptotically efficient within the class of IV estimators, and has a lower computational count than an efficient IV estimator introduced by Lee (2003). One advantage of this approach to estimation of the family of spatial regression models in (6) is that extensions to other settings such as simultaneous systems of relations involving spatial dependence appear relatively straightforward and computationally simple (Kelejian & Prucha, 2004). Another extension of this approach by Flemming (2004) is in the area of limited dependent variable models that exhibit spatial dependence. Kapoor (2003) extends this approach to the case of panel data models involving spatial cross-sections of time series data. Their contribution to this volume extends the original approach of Kelejian and Prucha (1998) with regard to the selection of instruments and demonstrates asymptotic normality and efficiency. A Monte Carlo study compares the small
20
JAMES P. LESAGE AND R. KELLEY PACE
sample properties of this approach to that of: maximum likelihood, the method of Lee (2003), and the technique of Kelejian and Prucha (1998). They find that all of these approaches share similar operational characteristics when the disturbance process follows a normal distribution. Departures from normality provide a slight edge for the IV/GM based approaches. The classic criticism of IV/GM estimation approaches centers on the relative lack of precision associated with these estimates, yet the Monte Carlo evidence provided by Kelejian, Prucha and Yuzefovich suggest this may not be a problem for the method outlined here.
3.4. Maximum Entropy Entropy can be interpreted as a mathematical measure of the distance or discrepancy between two distributions, where one acts as a reference distribution. It can also be seen as a penalty function on deviations such as those that arise in least-squares between the sample data observations and fitted values. Replacing the quadratic sum of squares objective function with the entropy objective function provides the basis for entropy based estimation methods. Golan, Judge and Miller (1996, 1997) draw on these ideas to propose estimation of the general linear model based on the principle of generalized maximum entropy. They argue that this approach has advantages in cases involving small samples or ill-posed problems. Their approach has been labelled GME-GLM estimation. The contribution by Marsh and Mittelhammer provides a review of GME-GLM estimation and extends this method to the case of sample data with spatial dependence. Specifically, they introduce maximum entropy (GME-GLM) methods for estimation of the family of spatial autoregressive models in (6). They provide Monte Carlo experiments that compare mean squared error loss of the spatial GME-GLM estimators to ordinary least squares and maximum likelihood estimators. These experiments also examine sensitivity of the spatial GME estimators to supports that must be placed on parameters in the model by the user, and consider performance of these support intervals across a range of spatial autoregressive coefficient values. Drawing on work of Golan, Judge and Perloff (1997) that extends the GME estimator to the case of a Tobit regression model, they extend their spatial GME estimator to produce a Tobit variant. Monte Carlo experiments explore the operational characteristics of this estimation method. An application of the spatial entropy GLM and Tobit methods is provided that examines agricultural disaster payment allocations across political regions in a simultaneous framework.
Introduction
21
4. ALTERNATIVE SPATIAL MODELS A host of alternatives to the spatial regression model family in (6) have been developed, and we discuss a few of these in the following sections. Specifically, we examine nonparametric and semiparametric models as well as a Bayesian hierarchical spatial autoregressive individual effects models. Notable omissions in this discussion include spatial statistical models used for geostatistical and image processing, although the contribution by Dubin in this volume employs these.
4.1. Nonparametric Approaches to Modeling Spatial Dependence McMillen (2003) argues that inadequately modeled spatial heterogeneity can lead to spatial dependence. Semiparametric and nonparametric spatial approaches attempt to model this spatial heterogeneity. A semi-parametric approach to modeling cross-sectional data where spatial dependence is thought to arise from this source was introduced by McMillen (1996) as well as McMillen and McDonald (1997). Their locally linear weighted regression (LWR) approach to modelling spatial dependence relies on separate models estimated using a sub-sample of the data based on observations nearby each observation. If spatial dependence arises due to inadequately modeled spatial heterogeneity, LWR can potentially eliminate this problem. Nonparametric methods are highlighted in contributions to this volume by McMillen, and Ugarte, Goicoa and Militino. McMillen allows for variation over space in the relation between home price appreciation rates and urban subcenters that are becoming increasingly common in U.S. cities. Ugarte, Goicoa and Militino also deal with variation in the parameters over space, focusing on a traditional hedonic price relation. This type of model is shown in (12), where (i) represent an n × n diagonal matrix containing distance-based weights for observation i that reflect the distance between observation i and all other observations. (i)1/2 y = (i)1/2 X␥i + (i)1/2 i
(12)
The subscript i on ␥i indicates that this k × 1 parameter vector is associated with region i. The LWR model produces n such vectors of parameter estimates, one for each region/observation. These estimates are calculated using the expression in (13). ˆ = (X ′ (i)X)−1 X ′ (i)y γi
(13)
22
JAMES P. LESAGE AND R. KELLEY PACE
A number of alternative approaches have been proposed to construct the distance-based weights for each observation i contained in the vector on the diagonal of (i). As an example, McMillen and McDonald (1998) suggests a tri-cube weighting function in (14).
j
diag((i)) = 1 −
j
di dm i
3 3
j
I(d i < d m i )
(14)
where d i represents the distance between observation j and observation i, d m i represents the distance between the mth nearest neighbor and observation i, and I() is an indicator function that equals one when the condition is true and zero otherwise. In practice, the number of nearest neighbors used (often referred to as the “bandwidth”) is determined with a cross-validation procedure. Pace and LeSage (2004b) point out that LWR methods exhibit a trade-off between increasing the sample size to produce less volatile estimates and increasing spatial dependence. Selecting a smaller sample size reduces the spatial dependence, but at the cost of increased parameter variability that impedes detection of systematic patterns of parameter variation over space. They introduce a spatial autoregressive local estimation scheme (SALE). The SALE method modifies the volatility-spatial dependence trade-off by extending the LWR approach to include a spatial lag of the dependent variable. This can accommodate spatial dependence that is likely to arise as the sub-sample size is increased. In addition to improved prediction and stability of the parameter estimates, inclusion of the spatial autoregressive term in the model decreases sensitivity of the estimates to the choice of bandwidth. There is a computational cost associated with introducing the spatial lag since the SALE model requires maximum likelihood methods, whereas the LWR model relies on least-squares. Pace and LeSage (2004b) present an efficient recursive approach for maximum likelihood estimation of the n spatial autoregressive models for problems involving large numbers of observations. They illustrate the method for a sample of 3,107 U.S. counties, whereas most cross-sectional samples used in the LWR literature are considerably smaller. Other nonparametric work that focuses on estimation of the spatial correlation/covariance structure can be found in Conley and Ligon (2002), Chen and Conley (2001), as well as Pinske, Slade and Brett (2002). Here, reliance is on sample data to estimate the spatial structure of connectivity rather than take the conventional approach that specifies this structure based on the spatial configuration of the observations. It is also typical for these approaches to focus on models derived from the family in (6) by restricting the parameter = 0, producing
Introduction
23
a spatial variance-covariance structure of dependence between observational units. An implication of this approach is that estimates for the parameters  in the regression relation are conditional on other parameters that quantify the spatial covariance structure of the disturbances. Of course, the conventional spatial regression models from (6) also produce estimates that are conditional on the particular spatial weight matrix used to specify the model. However, conventional reliance on simple contiguity or nearest neighbors spatial weight matrices simplifies inference.
4.2. Spatial Autoregressive Individual Effects Smith and LeSage in their contribution to this volume introduce an extension to the basic regression model that incorporates a vector of individual effects following a spatial autoregressive process. The model in (15) produces individual effects estimates for spatially aggregated regions of the sample. For example, if n observations in the vector y and matrix X represent individual counties, the individual effects vector might reflect m states, where the matrix carries out the necessary aggregation of n counties to m states. They allow the error terms associated with the spatial autoregressive individual effects vector to exhibit heterogeneous variances, with a different variance scalar parameter for each state. y = X + + = W +
(15)
This specification reflects a more parsimonious approach to modeling individual effects than the typical approach that might rely on the introduction of m dummy variables in the model. Their specification treats the case of a binary dependent variable, creating a hierarchical Bayesian spatial probit model that is estimated using MCMC methods. Economists use unobserved utility differences to provide a formal econometric justification for binary choice models (Amemiya, 1985). Following Albert and Chib (1993), binary or truncated dependent variable models can be estimated by including the unobserved or latent utility differences as parameters during MCMC estimation. This contribution also illustrates Bayesian hierarchical models that have been popular in spatial statistical modeling (Banerjee, Carlin & Gelfand, 2004 as well as Sun, Tsutakawa & Speckman, 1999). Alternative approaches to modeling binary dependent variables in the presence of spatial dependence include: McMillen (1992) who describes an E-M approach, Beron and Vijverberg (2004) who take a maximum likelihood approach, Pinske, Slade and Brett (2002) who tackle this problem from a semiparametric approach, Flemming (2004) who extends the IV/GM methods of Kelejian and Prucha (1998) to this case, and LeSage (2000) who sets forth a Bayesian approach.
24
JAMES P. LESAGE AND R. KELLEY PACE
5. SPATIOTEMPORAL MODELS Spatiotemporal sample data can take the form of a set of time-series cross-sectional observations on regions over time. It is often the case that either the time-series or cross-sectional dimension of the sample is small. For example, we might have a long series of monthly or quarter labor market time-series on only five to ten states or regions. Another example would be census observations on thousands of census tracts that are limited to three or four time periods. Recognition of dependencies among regional economies have lead to a literature on multiregional models. These took a structural approach, employing linkage variables such as relative-cost, adjacent-state demand, and gravity variables that explicitly depend on the spatial configuration of the regions. Examples include Ballard and Glickman (1977), Milne, Glickman and Adams (1980), Ballard, Glickman and Gustely (1980), and Baird (1983). In contrast, Garcia-Ferrer, Highfield, Palm and Zellner (1997) and Zellner and Chanisk (1991) developed a more time-series oriented multi-country forecasting methodology that utilized Bayesian approaches to borrow strength from observations taken from a crosssection of countries. Related work by Zellner, Hong and Gulati (1990), Zellner, Hong and Min (1991) focused on forecasting turning points in the international growth rates of real output for 18 countries over the period 1974–1986. This work compared the ability of alternative estimation methods to successfully utilize cross-country sample information, but no explicit account was taken of the relative location of the 18 countries. Nearby countries were assigned equal prior importance as those located far away. Posterior estimates allowed the sample data to determine relative importance based on co-movements in the broad business cycle swings across their 18 country sample. LeSage and Pan (1995) took a Bayesian approach to developing a spatial prior variance based on the spatial contiguity structure of the variables that could be applied to the parameters in a vector autoregressive time-series model, while Krivelyova and LeSage (1999) formulated both a prior mean and variance based on spatial contiguity. Recent work by Pesaran, Schuermann and Weiner (2003) rely on exclusionary restrictions based on international trade relations in specifying a vector autoregressive error correction model for a cross-section of countries. The contribution by Stroh and LaFrance to this volume extends this literature by devising an explicit information theoretic learning rule that governs the way in which the space-time structure of the sample data is utilized during estimation. Their approach uses yearly Kansas winter wheat data from different spatial scales (county-level and farm-level) in a maximum entropy procedure to explicitly model the spatiotemporal dependence structure. They note that weather may play an important role in creating spatial dependence in agricultural harvests. To the extent
Introduction
25
that weather affects harvests, these local weather effects will be manifested in spatial correlations among crop yields in nearby counties. They exploit this by decomposing county-level yields into a mean trend and spatial residual component which can be used to improve the efficiency of yield estimates. The contribution by de Luna and Genton takes an approach that attempts to infer the spatiotemporal dependence structure using the sample data. A systematic model-building strategy is devised in the context of a spatial variant of the vector autoregressive model that relies on a spatiotemporal ordering mechanism. Unlike space-time ARMA models (Pfeifer & Deutsch, 1980; Stoffer, 1986), no assumption of spatial stationarity (isotropy) is made that would require the magnitude of spatial correlation to depend only on the distance between the regions. They rely on spatial contiguity to specify restrictions on the coefficient matrices in the spatial VAR model and automate inference regarding these through the use of model selection criterion such as AIC (Akaike, 1974) or BIC (Schwarz, 1978). This approach has the virtue that geographical proximity can be overturned in favor of time-series predictive power that arises from regions that are not nearby during the sequential testing procedure used to construct the model. Other spatiotemporal samples might be approached from a panel data setting if the time or cross-sectional dimension of the data is inadequate to allow use of time-series methods. Elhorst (2003) provides a review of issues arising in maximum likelihood estimation of space-time panel data models. He extends panel data models to members of the family of models in (6) involving spatial error autocorrelation or a spatially lagged dependent variable. The focus is on specification and estimation of spatial extensions to the four panel data models commonly used in applied research, the fixed effects model, the random effects model, the fixed coefficients model, and the random coefficients model. As noted Kapoor (2003) tackles these models using the IV/GM approach of Kelejian and Prucha (1998). Also, for disaggregated spatiotemporal data where observations occur irregularly over time and space, one can define spatial, temporal, and spatiotemporal lags and estimate conventional autoregressive models as in Pace et al. (2000), or the space-time ARMA models of Pfeifer and Deutsch (1980), and Stoffer (1986).
6. THE CURRENT STATE OF SPATIAL ECONOMETRICS Research on alternative methods for estimation and inference for spatial regression models has expanded rapidly, as illustrated by contributions to this volume. Each of these has its strengths and weaknesses, offering practitioners a menu of choices.
26
JAMES P. LESAGE AND R. KELLEY PACE
In addition to the numerous estimation methods, the availability of software that implements these has also expanded rapidly. The Center for Spatially Integrated Social Sciences (CSISS) provides an internet site for dissemination of information regarding such software (csiss.org), as well as tools for exploratory spatial data analysis and mapping. The internet sites, spatial-econometrics.com and spatialstatistics.com contain public domain MATLAB functions for implementing spatial econometric estimation methods, and a free program for exploratory spatial statistics and mapping (GEODA) is available (see csiss.org for links). Along with software for estimation, databases containing spatial data samples are expanding rapidly along with the tools to extract this information easily. Introductory econometrics textbooks usually fail to mention the problem of spatial dependence. Further, the textbook devoted to this area (Anselin, 1988) is becoming outdated due to the numerous alternative modeling and estimation methods introduced recently. The low coverage by econometrics textbooks and lack of textbooks devoted to spatial econometric methods have hindered recognition of the problem of spatial dependence. Practitioners need to search a wide variety of literature sources to find estimation and inference methods that deal with this type of sample data. Despite these drawbacks, there are signs that the need for and importance of spatial econometric methods is achieving wider recognition. In addition to this volume, a forthcoming issue of the Journal of Econometrics will be devoted to analysis of spatially dependent data, and several journals have devoted special issues to this topic. The Journal of Real Estate and Finance Economics devoted two special issues to spatial econometrics in 1998 and 2004, and a 2004 issue of Geographical Analysis focused on this topic. A book volume that focuses on recent work in the area of spatial econometrics appeared in 2004 (Mur, Zoller & Getis, 2004) and another is forthcoming (Anselin, Florax & Rey, 2004). In addition to texts and journal issues, CSISS and other organizations have sponsored shortcourses and workshops on the topic of spatial data analysis to introduce social science researchers to these methods. Although spatial and spatiotemporal methods have advanced substantially in recent years, a number of gaps remain. Theoretical work that leads to spatial specifications represents one area where additional work would help balance the recent progress that has taken place in spatial econometrics. These theoretical contributions would aid in the task of interpreting parameters estimated using spatial econometric models. For example, theoretical extensions of the neoclassical growth model have added parameters to reflect the extent or intensity of knowledge spillovers, yet these have not been tied to their obvious counterparts measuring the strength of spatial dependence in regression models.
Introduction
27
There is also a the need for additional study of the relative strengths and weaknesses of the alternative methods proposed for estimation and inference, in both Monte Carlo and applied settings. As illustrated in our introduction, the operational characteristics of these methods depend on the usual signal/noise inherent in the data generating process as well as the strength of spatial dependence. Despite the proliferation in methods for spatial estimation, applied researchers are confronted with a host of model specification issues. These might involve model selection or comparison of models based on alternative spatial weight matrices, alternative sets of explanatory variables and alternative spatial model specifications. Current practice often requires the practitioner to make a case that estimates and inferences are insensitive to alternative weight matrix specifications or alternative model specifications. A better solution would be the ability to provide statistical evidence of the relative consistency between sample data and alternative spatial weight matrix structures, spatial econometric model specifications and competing explanatory variables. Spatial samples involving limited dependent variables that are truncated or polychotomous in nature arise from surveys or spatial samples representing discrete decision outcomes such as voting. As noted, spatial probit and tobit models have been developed using maximum likelihood, Bayesian and IV/GM approaches for cases involving binary dependent variables and the normal probability transformation. Methods that deal with cases involving multinomial spatial choice situations, rare events governed by Poisson processes as well as survival and duration models would extend the toolkit of methods. Spatial dependence for cases involving simultaneous systems of equations represent another area where more work could be done. Kelejian and Prucha (2004) set forth an IV/GM approach to estimating simultaneous equation relations that exhibit spatial dependence, but generalizations to the case of maximum likelihood and applications of these methods would add to the literature. Spatiotemporal and panel data models that capture spatial dependence may represent the area with the most room for future growth. Spatial panels can include issues related to interaction between observational units collected at different spatial and time scales. Applied work often relies on sample data information collected over different time and spatial scales, raising research issues related to interpolation and prediction across these.
NOTES 1. Cressie (1993) describes the difference between increasing domain and infill asymptotics. The later corresponds to increasing the number of observations within a
28
JAMES P. LESAGE AND R. KELLEY PACE
fixed region. In the familiar time series context, increasing domain asymptotics would correspond to increasing the number of periods with frequency fixed. Infill asymptotics would correspond to holding the span of the data constant while increasing the frequency. 2. We will discuss ways of conducting weight matrix selection or comparison later. 3. The first 1,000 draws were excluded to allow the sampler to achieve a steady state.
ACKNOWLEDGMENTS Many of the contributions to this volume were presented at the second annual “Advances in Econometrics” conference held November 7–9, 2003, at the Lod and Carole Cook Conference Center on the campus of Louisiana State University in Baton Rouge, Louisiana. We would like to acknowledge conference support from LSU Real Estate Research Institute, LSU Department of Economics and the University of Toledo, Department of Economics. We would like to acknowledge Darren Hayunga for having helped greatly with the conference and this book, and Keith Richard for his assistance with this volume. In particular, we would like to thank Carter Hill and Tom Fomby for their help with this volume. A conference internet site at spatialstatistics.info contains links to computational software used to implement many of the methods described by the contributors to this volume. Kelley Pace would like to acknowledge support from the National Science Foundation, BCS-0136193 and James LeSage acknowledges support from the National Science Foundation, BCS-0136229.
REFERENCES Akaike, H. (1974). A new look at statistical model identification. IEEE Trans. Automatic Control, 19, 716–727. Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422), 669–679. Amemiya, T. (1985). Advanced econometrics. Cambridge, MA: Harvard University Press. Anselin, L. (1988). Spatial econometrics: Methods and models. Dordretcht: Kluwer. Anselin, L. (2003). Spatial externalities, spatial multipliers and spatial econometrics. International Regional Science Review, 26(2). Anselin, L., Florax, R. J. G. M., & Rey, S. J. (2004). Advances in spatial econometrics: Methodology, tools and applications. Berlin: Springer-Verlag. Anselin, L., Varga, A., & Acs, Z. (1997). Local geographic spillovers between university research and high technology innovations. Journal of Urban Economics, 42, 422–448. Baird, C. A. (1983). A multiregional econometric model of Ohio. Journal of Regional Science, 23, 501–515.
Introduction
29
Ballard, K. P., & Glickman, N. J. (1977). A multiregional econometric forecasting system: A model for the delaware valley. Journal of Regional Science, 17, 161–177. Ballard, K. P., Glickman, N. J., & Gustely, R. D. (1980). A bottom-up approach to multiregional modeling: NRIES. In: F. G. Adams & N. J. Glickman (Eds), Modeling the Multiregional Economic System. Lexington, MA: Lexington Books. Banerjee, S., Carlin, B. P., & Gelfand, A. E. (2004). Hierarchical modeling and analysis for spatial data. New York: Chapman & Hall/CRC. Barry, R., & Pace, R. K. (1999). A Monte Carlo estimator of the log determinant of large sparse matrices. Linear Algebra and its Applications, 289, 41–54. Beron, K. J., & Vijverberg, W. P. M. (2004). Probit in a spatial context: A Monte Carlo analysis. In: L. Anselin, R. J. G. M. Florax & S. J. Rey (Eds), Advances in Spatial Econometrics: Methodology, Tools and Applications (forthcoming). Berlin: Springer-Verlag. Brasington, D. M. (1999). Joint provision of public goods: The consolidation of school districts. Journal of Public Economics, 73(3), 373–393. Brasington, D. M. (2003). Snobbery, racism, or mutual distaste: What promotes and hinders cooperation in local public good provision? Review of Economics and Statistics, 85, 874–883. Brueckner, J. K. (2003). Strategic interaction among governments. International Regional Science Review, 26, 175–188. Case, A., Rosen, H. S., & Hines, J. R. (1993). Budget spillovers and fiscal policy interdependence: Evidence from the states. Journal of Public Economics, 52, 285–307. Chen, X., & Conley, T. G. (2001). A new semiparametric spatial model for panel time series. Journal of Econometrics, 105(1), 59–83. Conley, T. G., & Ligon, E. A. (2002). Economic distance, spillovers, and cross country comparisons. Journal of Economic Growth, 7, 157–187. Cressie, N. (1993). Statistics for spatial data (Rev. ed.). New York: Wiley. DeGroot, M. H. (1982). Comment on ‘Lindley’s paradox’ by G. Shafer. Journal of the American Statistical Association, 77, 336–339. Elhorst, J. P. (2003). Specification and estimation of spatial panel data models. International Regional Science Review, 26, 244–268. Flemming, M. (2004). Techniques for estimating spatially dependent discrete choice models. In: L. Anselin, R. J. G. M. Florax & S. J. Rey (Eds), Advances in Spatial Econometrics: Methodology, Tools and Applications (forthcoming). Berlin: Springer-Verlag. Florax, R. J. G. M., & Folmer, H. (1992). Specification and estimation of spatial linear regression models: Monte Carlo evaluation of pre-test estimators. Regional Science and Urban Economics, 22, 405–432. Florax, R. J. G. M., Folmer, H., & Rey, S. J. (2003). Specification searches in spatial econometrics: The relevance of Hendry’s methodology. Regional Science and Urban Economics, 33(5), 557–579. Garcia-Ferrer, A., Highfield, R. A., Palm, F., & Zellner, A. (1987). Macroeconomic forecasting using pooled international data. Journal of Business & Economic Statistics, 5, 53–68. Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398–409. Golan, A., Judge, G. G., & Miller, D. (1996). Maximum entropy econometrics: Robust information with limited data. New York: Wiley. Golan, A., Judge, G. G., & Miller, D. (1997). The maximum entropy approach to estimation and inference: An overview. In: T. B. Fomby & R. C. Hill (Eds), Advances in Econometrics: Applying Maximum Entropy to Econometric Problems (Vol. 12, pp. 3–24). Greenwich, CT: JAI Press.
30
JAMES P. LESAGE AND R. KELLEY PACE
Golan, A., Judge, G. G., & Perloff, J. (1997). Estimation and inference with censored and ordered multinomial data. Journal of Econometrics, 79, 23–51. Griffith, D., & Sone, A. (1995). Trade-offs associated with normalizing constant computational simplifications for estimating spatial statistical models. Journal of Statistical Computation and Simulation, 51, 165–183. Hepple, L. W. (1995a). Bayesian techniques in spatial and network econometrics: 1. Model comparison and posterior odds. Environment and Planning A, 27, 447–469. Hepple, L. W. (1995b). Bayesian techniques in spatial and network econometrics: 2. Computational methods and algorithms. Environment and Planning A, 27, 615–644. Horn, R., & Johnson, C. (1993). Matrix analysis. New York: Cambridge University Press. Kapoor, M. (2003). Panel data models with spatial correlation: Estimation theory and an empirical investigation of the US wholesale gasoline industry. Ph.D. Thesis, University of Maryland, College Park. Kelejian, H., & Prucha, I. R. (1998). A generalized spatial two-stage least squares procedure for estimating a spatial autoregressive model with autoregressive disturbances. Journal of Real Estate and Finance Economics, 17(1), 99–121. Kelejian, H., & Prucha, I. R. (1999). A generalized moments estimator for the autoregressive parameter in a spatial model. International Economic Review, 40, 509–533. Kelejian, H., & Prucha, I. R. (2004). Estimation of simultaneous systems of spatially interrelated cross sectional equations. Journal of Econometrics (118), 27–50. Krivelyova, A., & LeSage, J. P. (1999). A spatial prior for bayesian vector autoregressive models. Journal of Regional Science, 39(2), 297–317. Lee, L. F. (2003). Best spatial two-stage least squares estimators for a spatial autoregressive model with autoregressive disturbances. Econometric Reviews, 22, 307–335. Le Gallo, J., Ertur, C., & Baumont, C. (2003). A spatial econometric analysis of convergence across European regions, 1980–1995. In: B. Fingleton (Ed.), European Regional Growth (pp. 99–130). Springer-Verlag, Advances in Spatial Science. LeSage, J. P. (1997). Bayesian estimation of spatial autoregressive models. International Regional Science Review, 20(1/2), 113–129. LeSage, J. P. (2000). Bayesian estimation of limited dependent variable spatial autoregressive models. Geographical Analysis, 32(1), 19–35. LeSage, J. P., & Pace, R. K. (2004). Using matrix exponentials to estimate spatial Probit/Tobit models. In: J. Mur, H. Zoller & A. Getis (Eds), Recent Advances in Spatial Econometrics (pp. 105–131). Palgrave Publishers. LeSage, J. P., & Pan, Z. (1995). Using spatial contiguity as bayesian prior information in regional forecasting models. International Regional Science Review, 18(1), 33–53. Lindley, D. V. (1957). A statistical paradox. Biometrika, 45, 533–534. L´opez-Bazo, E., Vay´a, E., Mora, A., & Suri˜nach, J. (1999). Regional economic dynamics and convergence in the European Union. Annals of Regional Science, 33, 343–370. Martin, R. J. (1993). Approximations to the determinant term in Gaussian Maximum likelihood estimation of some spatial models. Communications in Statistics – Theory and Methods, 22, 189–205. McMillen, D. P. (1992). Probit with spatial autocorrelation. Journal of Regional Science, 32(3), 335–348. McMillen, D. P. (1996). One hundred fifty years of land values in Chicago: A nonparametric approach. Journal of Urban Economics, 40, 100–124.
Introduction
31
McMillen, D. P. (2003). Spatial autocorrelation or model misspecification? International Regional Science Review, 26, 208–217. McMillen, D. P., & McDonald, J. F. (1997). A nonparametric analysis of employment density in a polycentric city. Journal of Regional Science, 37, 591–612. McMillen, D. P., & McDonald, J. F. (1998). Locally weighted maximum likelihood estimation: Monte Carlo evidence and an application. Paper presented at the Regional Science Association International Meetings, Santa Fe, NM. Milne, W. J., Glickman, N. J., & Adams, F. G. (1980). A framework for analyzing regional decline: A multiregional econometric model of the U.S. Journal of Regional Science, 20, 173–190. Mur, J., Zoller, H., & Getis, A. (2004). Recent advances in spatial econometrics. New York: Palgrave. Pace, R. K., & Barry, R. (1997). Quick computation of spatial autoregressive estimators. Geographical Analysis, 29, 232–246. Pace, R. K., Barry, R., Gilley, O. W., & Sirmans, C. F. (2000). Method for spatial-temporal forecasting with an application to real estate prices. International Journal of Forecasting, 16, 229–246. Pace, R. K., & LeSage, J. P. (2003). Likelihood dominance spatial inference. Geographical Analysis, 35, 133–147. Pace, R. K., & LeSage, J. P. (2004a). Techniques for improved approximation of the determinant term in the spatial likelihood function. Computational Statistics and Data Analysis, 45, 179–196. Pace, R. K., & LeSage, J. P. (2004b). Spatial autoregressive local estimation. In: J. Mur, H. Zoller & A. Getis (Eds), Recent Advances in Spatial Econometrics (pp. 105–131). Palgrave. Pace, R. K., & Zou, Z. (2000). Closed-form maximum likelihood estimates of nearest neighbor spatial dependence. Geographical Analysis, 32, 154–172. Pesaran, M. H., Schuermann, T., & Weiner, S. (2003). Modeling regional interdependencies using a global error-correcting macroeconometric model. Journal of Economics and Business Statistics (forthcoming). Pfeifer, P. E., & Deutsch, S. J. (1980). A three-stage iterative procedure for space-time modeling. Technometrics, 22, 35–47. Pinske, J., Slade, M. E., & Brett, C. (2002). Spatial price competition: A semiparametric approach. Econometrica, 70, 1111–1153. Raftery, A. E., Madigan, D., & Hoeting, J. A. (1997). Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 92, 179–191. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464. Smirnov, O. (2003a). Parallel method for maximum likelihood estimation of models with spatial dependence. Paper presented at the 50th Annual Meetings of the Regional Science Association International, Philadelphia, PA. Smirnov, O. (2003b). Computation of the information matrix for spatial interaction models. Paper presented at The 9th International Conference of The Society for Computational Economics Computing in Economics and Finance, Seattle, WA. Smirnov, O., & Anselin, L. (2001). Fast maximum likelihood estimation of very large spatial autoregressive models: A characteristic polynomial approach. Computational Statistics and Data Analysis, 35, 301–319. Stoffer, D. S. (1986). Estimation and identification of space-time ARMAX models in the presence of missing data. Journal of the American Statistical Association, 81, 762–772. Sun, D., Tsutakawa, R. K., & Speckman, P. L. (1999). Posterior distribution of hierarchical models using car(1) distributions. Biometrika, 86, 341–350.
32
JAMES P. LESAGE AND R. KELLEY PACE
Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In: P. K. Goel & A. Zellner (Eds), Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti (pp. 233–243). Amsterdam: North-Holland/Elsevier. Zellner, A., & Chanisk, H. (1991). Bayesian methods for forecasting turning points in economic time series: Sensitivity of forecasts to asymmetric of loss structures. In: K. Lahiri & G. Moore (Eds), Leading Economic Indicators: New Approaches and Forecasting Records (pp. 129–140). England: Cambridge University Press. Zellner, A., Hong, C., & Gulati, G. M. (1990). Turning Points in Economic Time Series, Loss Structures and Bayesian Forecasting. In: S. Geisser, J. Hodges, S. J. Press & A. Zellner (Eds), Bayesian and Likelihood Methods in Statistics and Econometrics: Essays in Honor of George A. Barnard (pp. 371–393). Amsterdam: North-Holland. Zellner, A., Hong, C., & Min, C. (1991). Forecasting turning points in international output growth rates using bayesian exponentially weighted autoregression, time-varying parameter and pooling techniques. Journal of Econometrics, 49, 275–304.
TESTING FOR LINEAR AND LOG-LINEAR MODELS AGAINST BOX-COX ALTERNATIVES WITH SPATIAL LAG DEPENDENCE Badi H. Baltagi and Dong Li ABSTRACT Baltagi and Li (2001) derived Lagrangian multiplier tests to jointly test for functional form and spatial error correlation. This companion paper derives Lagrangian multiplier tests to jointly test for functional form and spatial lag dependence. In particular, this paper tests for linear or log-linear models with no spatial lag dependence against a more general Box-Cox model with spatial lag dependence. Conditional LM tests are also derived which test for (i) zero spatial lag dependence conditional on an unknown Box-Cox functional form, as well as, (ii) linear or log-linear functional form given spatial lag dependence. In addition, modified Rao-Score tests are also derived that guard against local misspecification. The performance of these tests are investigated using Monte Carlo experiments.
1. INTRODUCTION Baltagi and Li (2001) derived Lagrangian multiplier tests to jointly test for functional form and spatial error correlation. This companion paper derives Spatial and Spatiotemporal Econometrics Advances in Econometrics, Volume 18, 35–74 Copyright © 2004 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(04)18001-8
35
36
BADI H. BALTAGI AND DONG LI
Lagrangian multiplier tests to jointly test for functional form and spatial lag dependence. Testing for spatial lag dependence assuming a specific functional form, usually a linear regression, has been studied extensively by Anselin (1988, 1999), Anselin et al. (1996), and Anselin and Bera (1998). However, none of these tests jointly test for spatial lag dependence and functional form. The LM tests derived in this paper are computationally simple, requiring least squares regressions on linear or log-linear models. They allow the researcher to test for a linear or log-linear model with no spatial lag dependence against a more general Box-Cox model with spatial lag dependence. Conditional LM tests are also derived which test for (i) zero spatial lag dependence conditional on an unknown Box-Cox functional form, as well as, (ii) linear or log-linear functional form given spatial lag dependence. In addition, following Bera and Yoon (1993), we derive modified Rao Score tests that guard against local misspecification. Testing for functional form is important especially when nonlinearity is suspected as in a model of urbanization with nonlinear migration flows, see Ledent (1986); or when one is concerned with the nonlinear structure of hedonic housing price regressions, see Craig, Kohlhase and Papell (1991); or a nonlinear hedonic model of the price of farmland in Georgia, see Elad, Clifton and Epperson (1994). Similar concerns over non-linearity in the spatial econometrics literature are evident in Upton and Fingleton (1985), Bailly et al. (1992), Griffith et al. (1998), Fik and Mulligan (1998) and Pace et al. (1999) to mention a few. The Box and Cox (1964) procedure has been used to choose among alternative functional forms, see Savin and White (1978), Seaks and Layson (1983), and Davidson and MacKinnon (1985), to mention a few. However, the spatial lag dependence further complicates the estimation and testing of these models. Attempts at dealing with this problem vary from estimating the Box-Cox model by maximum likelihood methods ignoring the spatial lag dependence, see Upton and Fingleton (1985), to linearizing the Box-Cox transformation, see Bailly et al. (1992), and Griffith et al. (1998). Recently, Pace et al. (1999) developed a model which simultaneously performs spatial and functional form transformations. They applied it to housing data from Baton Rouge. Specifically, they use Bsplines which are piecewise polynomials with conditions enforced among the pieces. Relative to the Box-Cox transformation, the B-splines do not require strictly positive untransformed variables and can assume more complicated shapes. However, they require substantially more degrees of freedom. Misspecifying the functional form and/or ignoring the spatial lag dependence can result in misleading inference and raise questions about the reliability and precision of the resulting estimates. Section 2 derives the joint LM, conditional LM and modified Rao score tests for the Box-Cox spatial lag dependence model, while Section 3 illustrates these tests
Testing for Linear and Log-Linear Models
37
using two empirical examples. Section 4 investigates the performance of these tests using Monte Carlo experiments. Section 5 gives our conclusion.
2. THE MODEL AND THE LM TESTS Consider the following Box-Cox model with spatial lag dependence y (r) = Wy (r) + X (r)  + Z␥ + u
(1)
where x (r) =
r x −1
r
if r ≡ 0
log(x)
if r = 0
x=y
or X.
(2)
is the familiar Box-Cox transformation, is the spatial autoregressive coefficient, W is the matrix of known spatial weights and u ∼ N(0, 2 I), y (r) is n × 1, X (r) is n × K, Z is n × S and  and ␥ are K × 1 and S × 1, respectively.1 Both y i and x ik are subject to the Box-Cox transformation (2) and are required to take positive values only, while the Z i variables are not subject to the Box-Cox transformation. The Z i ’s may include dummy variables and the intercept. Note that for r = 1, Eq. (1) becomes a linear model whereas for r = 0 it becomes a log-linear model.2 Rewrite (1) as (I − W)y (r) = X (r)  + Z␥ + u.
(3)
The loglikelihood function is given by n
n n log(y i ) log L = − log(2) − log(2 ) + log|I − W| + (r − 1) 2 2 i=1
−
1 [(I − W)y (r) − X (r)  − Z␥]′ [(I − W)y (r) − X (r)  − Z␥]. 22 (4) n
Note that log|I − W| = i=1 log(1 − i ), where i ’s are the eigenvalues of W, see Ord (1975) and Anselin (1988). We assume that all diagonal elements of the spatial weight matrix W are zero and that (I − W) is nonsingular. The first-order derivatives are given by ∂log L n 1 = − 2 + 4 u′u ∂2 2 2 1 ∂log L = 2 [X (r) ]′ u ∂
(5) (6)
38
BADI H. BALTAGI AND DONG LI
∂log L 1 = 2 Z′u ∂␥
(7)
n
i 1 ∂log L =− + 2 u ′ Wy (r) ∂ 1 − i
(8)
i=1
n
∂log L 1 log(y i ) − 2 u ′ [(I − W)C(y, r) − C(X, r)] = ∂r
(9)
i=1
where C(y, r) = (∂y (r) /∂r) = 1/r 2 (ry r log y − y r + 1) and C(X, r) is similarly defined. The second-order derivatives of the loglikelihood function are given in Appendix A. Let = (, ′ , ␥′ , , r)′ , then the gradient is given by G = ∂log L/∂, and the information matrix is given by I = E(−(∂2 log L/∂∂′ )). The LM test statistic is given by ˜ ′ I˜ −1 G ˜ LM = G
(10)
˜ and I˜ denote the restricted gradient and information matrix evaluated where G under the null hypothesis, respectively. Following Efron and Hinkley (1978), we ˜ Davidson and estimate the information matrix by the negative Hessian −H(). ˜ using MacKinnon (1983) demonstrated that it is better to use I˜ rather than H Monte Carlo experiments. See also Bera and McKenzie (1985). However, the information matrix is difficult to compute in this case. It is important to note that the large sample distribution of our test statistics are not formally established in the paper but are likely to hold under similar sets of low level assumptions developed recently by Kelejian and Prucha (2001) for the Moran I test statistic as well as its close cousins the LM tests for spatial correlation.3 For an explicit account of these assumptions and a central limit theorem for linear quadratic forms that allows for heteroskedastic (possibly nonnormal) innovations as well as weights in the linear quadratic form that depend upon the sample size, see Kelejian and Prucha (2001). 2.1. Joint Tests Under the null hypothesis Ha0 : = 0 and r = 0, the model in (1) becomes a loglinear model with no spatial lag dependence log y i =
K k=1
k log x ik +
S s=1
␥s z is + u i ,
i = 1, . . . , n
(11)
Testing for Linear and Log-Linear Models
39
with y (0) = log y, C(y, 0) = limr→0 C(y, r) = 1/2(log y)2 and limr→0 (∂C(y, r)/ ∂r) = 1/3(log y)3 . The restricted OLS residuals from (11) are given by uˆ = log y − (log X)ˆ − Z ␥ˆ with ˆ 2 = uˆ ′ u/n. ˆ The gradient becomes n 1 ∂log L = − 2 + 4 u′u 2 ∂ 2 2 ∂log L 1 = 2 (log X)′ u ∂
(12) (13)
∂log L 1 = 2 Z′u ∂␥
(14)
n
1 ∂log L i + 2 u ′ W log y =− ∂
(15)
n ∂log L 1 ′ 1 1 2 2 log (y i ) − 2 u = (log y) − (log X)  ∂r 2 2
(16)
i=1
i=1
where the prime denotes the transpose of a matrix and log X is the logarithm of every element of X. The second order derivatives of the loglikelihood function are given in Appendix B.1. Under the null hypothesis Hb0 : = 0 and r = 1, the model in (1) becomes a linear model with no spatial lag dependence yi − 1 =
K k=1
k (x ik − 1) +
S s=1
␥s z is + u i ,
i = 1, . . . , n
(17)
with y (1) = y − 1, C(y, 1) = ylog y − y + 1 and limr→1 (∂C(y, r)/∂r) = y(log y)2 − 2ylog y + 2y − 2. The restricted OLS residuals from (17) are given by u˜ = (y − n ) − (X − J nK )˜ − Z ␥˜ with ˜ 2 = u˜ ′ u˜ /n, where n and J nK are n × 1 and n × K matrix with all elements equal to 1, respectively. The gradient becomes n 1 ∂log L = − 2 + 4 u′u 2 ∂ 2 2 ∂log L 1 = 2 (X − 1)′ u ∂ ∂log L 1 = 2 Z′u ∂␥
(18) (19) (20)
40
BADI H. BALTAGI AND DONG LI n
1 ∂log L i + 2 u ′ W(y − n ) =− ∂
(21)
i=1
n
1 ∂log L log (y i ) − 2 u ′ [ylog y − y + n − (X log X − X + J nK )] = ∂r i=1 (22) and the second-order derivatives of the loglikelihood function are given in Appendix B.2.
2.2. Conditional Tests Joint tests are often criticized because they do not point out the “right” model we should adopt when the null hypothesis is rejected. In this section we will consider conditional LM tests. These tests account for the possible presence of spatial lag dependence when testing for functional form, or the possible misspecification of the functional form when testing for spatial lag dependence. 2.2.1. LM Tests for Spatial Lag Dependence Conditional on a General Box-Cox Model g Under the null hypothesis H0 : = 0|unknown r, the model in (1) becomes a general Box-Cox model with no spatial lag dependence. The gradient becomes ∂log L n 1 = − 2 + 4 u′u = 0 ∂2 2 2
(23)
1 ∂log L = 2 [X (r) ]′ u = 0 ∂
(24)
∂log L 1 = 2 Z′u = 0 ∂␥
(25)
n
∂log L 1 i + 2 u ′ Wy (r) =− ∂
(26)
i=1
n
1 ∂log L log (y i ) − 2 u ′ [C(y, r) − C(X, r)] = ∂r i=1
(27)
Testing for Linear and Log-Linear Models
41
The second order derivatives of the loglikelihood function are given in Appendix C.1. 2.2.2. LM Tests for Functional Form Conditional on Spatial Lag Dependence Next we consider tests for functional form, linear or loglinear against a general Box-Cox transformation, conditional on the presence of spatial lag dependence. 2.2.2.1. Loglinear with spatial lag dependence. Under the null hypothesis Hh0 : r = 0|unknown , the model in (1) becomes a loglinear model with spatial lag dependence. Note that u = log y − (log X) − Z␥. The gradient is n 1 ∂log L = − 2 + 4 u′u = 0 ∂2 2 2 1 ∂log L = 2 (log X)′ u = 0 ∂ 1 ∂log L = 2 Z′u = 0 ∂␥
(28) (29) (30)
n
i 1 ∂log L =− + 2 u ′ W log y ∂ 1 − i
(31)
n ∂log L 1 ′ 1 1 2 2 log (y i ) − 2 u (I − W) (log y) − (log X)  = ∂r 2 2
(32)
i=1
i=1
The second order derivatives of the loglikelihood function are given in Appendix C.2. 2.2.2.2. Linear with spatial lag dependence. Under the null hypothesis Hi0 : r = 1|unknown , the model in (1) becomes a linear model with spatial lag dependence. Note again that u = (I − W)(y − n ) − (X − J nK ) − Z␥. The gradient is n 1 ∂log L = − 2 + 4 u′u 2 ∂ 2 2 1 ∂log L = 2 (X − J nK )′ u ∂ ∂log L 1 = 2 Z′u ∂␥
(33) (34) (35)
42
BADI H. BALTAGI AND DONG LI n
i 1 ∂log L + 2 u ′ W(y − n ) =− ∂ 1 − i
(36)
i=1
n
∂log L log (y i ) = ∂r i=1
−
1 ′ u (I − W)(ylog y − y + 1) − (X log X − X + 1) 2
(37)
The second order derivatives of the loglikelihood function are given in Appendix C.3.
2.3. Local Misspecification Robust Tests Conditional LM tests do not ignore the possibility that r is not known when testing for = 0. Whereas simple LM tests for = 0 assume implicitly that r = 0 or 1 and this may lead to misleading inference. However, conditional LM tests are computationally more involved than the corresponding simple LM tests. The latter are usually based on least squares residuals. Bera and Yoon (1993) showed that, under local misspecification, the simple LM test asymptotically converges to a noncentral chi-square distribution. They suggest a modified Rao-Score (RS) test which is robust to local misspecification. This modified RS test retains the computational simplicity of the simple LM test in that it is based on the same restricted MLE (usually OLS). However, it is more robust than the simple LM test because it guards against local misspecification. The idea is to adjust the one-directional score test by accounting for its non-centrality parameter. Bera and Yoon (1993) and Bera et al. (2001) showed using Monte Carlo experiments that these modified RS tests have good finite sample properties and are capable of detecting the right direction of departure from the null hypothesis. For our purposes, we consider four hypotheses: Hc0 . = 0 assuming r = 0 (no spatial lag dependence assuming loglinearity). Hd0 . = 0 assuming r = 1 (no spatial lag dependence assuming linearity). He0 . r = 0 assuming = 0 (loglinearity assuming no spatial lag dependence). Hf0 . r = 1 assuming = 0 (linearity assuming no spatial lag dependence).
Testing for Linear and Log-Linear Models
43
Let ′ = (, ′ , ␥′ ) so that ′ = (, ′ , ␥′ , , r) = (′ , , r) and partition the gradient and the information matrix such that ∂log L() ∂ ∂log L() ∂log L() = d() = (38) ∂ ∂ ∂log L() ∂r and J J Jr 2 1 ∂ log L() J() = − (39) = J J Jr n ∂∂′ Jr Jr Jr
To be more specific, consider the null hypothesis Hc0 : = 0 assuming r = 0. The general model is represented by the loglikelihood function log L(′ , , r). For the null hypothesis Hc0 , the investigator sets r = 0 and tests = 0 using the loglikelihood function log L 1 (′ , ) = log L(′ , , 0). The standard Rao-Score statistic based on log L 1 (′ , ) is denoted by RS . Let ˜ be the maximum likelihood estimator of when = 0 and r = 0. If log L 1 (′ , ) were the true loglikelihood, then under the null hypothesis that = 0 the test statistic RS is asymptotically distributed as 21 (0) and the test will be locally optimal.4 Now suppose that the true loglikelihood function is log L 2 (′ , r) = log L(′ , 0, r) so that the alternative √ log L 1 (′ , ) is misspecified. Using a sequence of local values r = / n, the asymptotic distribution of RS under log L 2 (′ , r) is 21 (c 1 ) where c 1 is the noncentrality parameter, see Bera and Yoon (1993). Due to the presence of this noncentrality parameter, RS will over-reject the null hypothesis even when = 0. Therefore, the test will have an incorrect size. In light of this non-centrality parameter, Bera and Yoon (1993) suggested a modification to RS so that the resulting test statistic is robust to the presence of r. The new test essentially adjusts the asymptotic mean and variance of the standard RS . Following Eq. (6) in Bera et al. (2001), we derive this modified RS test as follows RS∗ =
1 ˜ − J r· ()J ˜ −1 ˜ ˜ ′ ˜ ˜ −1 ˜ ˜ −1 [d () r· ()d r ()] [J · () − J r· ()J r· ()J r· ()] n ˜ − J r· ()J ˜ −1 ˜ ˜ × [d () (40) r· ()d r ()]
˜ is the gradient for evaluated at the restricted MLE, J · ≡ J · () ˜ = where d () −1 −1 J − J J J and J r· is similarly defined. Also, J r· = J r − J J J r and J r· is similarly defined. All the above quantities are estimated under the null
44
BADI H. BALTAGI AND DONG LI
hypothesis Hc0 : = 0 assuming r = 0. The null hypothesis Hd0 : = 0 assuming r = 1 can be handled similarly with r = 1 rather than 0. For the null hypotheses He0 and Hf0 , the modified RS test statistic is given by RS∗r =
1 ˜ − J r· ()J ˜ −1 ˜ ˜ ′ ˜ ˜ −1 ˜ ˜ −1 [d r () · ()d ()] [J r· () − J r· ()J · ()J r· ()] n ˜ − J r· ()J ˜ −1 ˜ ˜ × [d r () (41) · ()d ()].
˜ J r· (), ˜ and J · () ˜ are computed as described below (40) under the where d r (), respective null hypothesis. In the Monte Carlo experiment in Section 4, we compute the simple LM test and the corresponding Bera-Yoon modified LM test for each hypothesis considered. The simple RS tests for no spatial lag dependence under linearity or loglinearity, i.e. Hc0 and Hd0 , are given in Anselin (1988) and Anselin and Bera (1998), while the simple RS tests for functional form assuming no spatial lag dependence, i.e. He0 and Hf0 , are given in Davidson and MacKinnon (1985). Note that it is not possible to robustify tests in the presence of global misspecification (i.e. and r taking values far from their values under the null), see Anselin and Bera (1998). Also, it is important to note that the modified RS tests satisfy the following decomposition RSr = RS∗ + RSr = RS + RS∗r
(42)
i.e. the joint test can be decomposed into the sum of the modified RS test of one type of alternative and the simple RS test for the other, see Bera and Yoon (1993).
3. EMPIRICAL EXAMPLES 3.1. Crime Data Anselin (1988) considered a simple relationship between crime and housing values and income in 1980 for 49 neighborhoods in Columbus, Ohio. The data are listed in Table 12.1, p. 189 of Anselin (1988). Crime is measured as per capita residential burglaries and vehicle thefts, and housing values and income are measured in thousands of dollars. The OLS regression gives Crime = 68.619 − 1.597 Housing − 0.274 Income (4.735)
(0.334)
where the standard errors are given in parentheses.
(0.103)
Testing for Linear and Log-Linear Models
45
Table 1. Results for Anselin’s Crime Data. Statistic
p-Value
Joint LM tests Ha0 : = 0 and r = 0 Hb0 : = 0 and r = 1
85.411 13.952
0.000 0.001
Rao score tests and their modified forms Hc0 : RS=0 assuming r = 0 Hc0 : RS∗=0 assuming r = 0 Hd0 : RS=0 assuming r = 1 Hd0 : RS∗=0 assuming r = 1 He0 : RSr=0 assuming = 0 He0 : RS∗r=0 assuming = 0 Hf0 : RSr=1 assuming = 0 Hf0 : RS∗r=1 assuming = 0
1.940 0.005 13.914 11.139 85.406 83.471 2.813 0.038
0.164 0.943 0.000 0.000 0.000 0.000 0.093 0.845
Conditional LM tests g H0 : = 0|unknown r h H0 : r = 0|unknown Hi0 : r = 1|unknown
10.164 74.537 0.123
0.001 0.000 0.725
Table 2. Results for the Harrison and Rubinfeld Data. Statistic
p-Value
Joint LM tests Ha0 : = 0 and r = 0 Hb0 : = 0 and r = 1
414.812 311.208
0.000 0.000
Rao score tests and their modified forms Hc0 : RS=0 assuming r = 0 Hc0 : RS∗=0 assuming r = 0 Hd0 : RS=0 assuming r = 1 Hd0 : RS∗=0 assuming r = 1 He0 : RSr=0 assuming = 0 He0 : RS∗r=0 assuming = 0 Hf0 : RSr=1 assuming = 0 Hf0 : RS∗r=1 assuming = 0
357.979 325.985 93.063 191.799 88.827 56.833 119.409 218.145
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Conditional LM tests g H0 : = 0|unknown r h H0 : r = 0|unknown Hi0 : r = 1|unknown
202.811 30.716 196.004
0.000 0.000 0.000
46
BADI H. BALTAGI AND DONG LI
Fig. 1. Joint Test Ha0 : r = 0 and = 0.
We apply the tests proposed in the last section to the crime data. The dependent and independent variables Crime, Housing and Income are subject to the Box-Cox transformation while the constant term is not. The results are reported in Table 1. The joint LM test statistic for Ha0 : = 0 and r = 0, is 85.41. This is distributed as 22 under Ha0 and is significant. The LM test statistic for Hb0 : = 0 and r = 1, is 13.95. This is distributed as 22 under Hb0 and has a p-value of 0.001. Both the linear and loglinear models without spatial lag dependence are rejected when the alternative is a general Box-Cox model with spatial lag dependence. Assuming a loglinear model, one does not reject the absence of spatial lag dependence. However, assuming a linear model, one rejects the absence of spatial lag dependence. Note ˜ that in a linear model the Rao-Score test statistic based on the negative Hessian (H) yields a value of 13.914. This compares to a value of 9.364 reported by Anselin ˜ In a loglinear model the Rao-Score test (1988) using the information matrix (I). ˜ yields a value of 1.940, compared to statistic based on the negative Hessian (H) ˜ This highlights the different a value of 1.784 using the information matrix (I). numerical values obtained for the score test based on different estimates of the information matrix in a finite sample.
Testing for Linear and Log-Linear Models
47
Fig. 2. Joint Test Hb0 : r = 1 and = 0.
Allowing for local misspecification using the corresponding Bera and Yoon (1993) adjusted LM statistics does not change the outcome of these tests. In addition, if one assumes no spatial lag dependence, one rejects loglinearity but not linearity of the model at 5% level. Again, both outcomes are not changed by allowing for local misspecification using the Bera and Yoon (1993) adjustment. Conditional on a general Box-Cox model, the hypothesis of no spatial lag dependence is rejected with a p-value of 0.001. Conditional on spatial lag dependence, the loglinear model is rejected with a p-value of 0.000 and the linear model is not rejected with a p-value of 0.725. Baltagi and Li (2001) obtained similar test results on the Anselin (1988) crime data example when testing linear or log-linear models with no spatial error correlation against a more general Box-Cox model with spatial error correlation. Combining these results, one cannot reject the linear model with spatial lag dependence or spatial error correlation. It seems that Anselin (1988) was justified
48
BADI H. BALTAGI AND DONG LI
Fig. 3. Simple RS Test Hc0 : = 0 Assuming r = 0.
in assuming a linear regression for the crime data when testing for spatial lag dependence or spatial error correlation.
3.2. Harrison and Rubinfeld Data Since the previous example on crime data had only 49 observations, we thought it would be useful to see how these tests perform for a larger data set. Harrison and Rubinfeld (1978) examined the demand for clean air using housing data. The data consists of 506 observations, with one observation per census tract, from the Boston Standard Metropolitan Statistical Area. The dependent variable is the median value of owner-occupied homes (Price). The independent variables include crime rate (Crim), proportion of area zoned with large lots (Zone), proportion of non-retail business area (Indus), location contiguous to the Charles River (Chas), levels of nitrogen oxides concentration squared (Nox2), average number of rooms squared (Rm2), proportion of structures built before 1940 (Age), weighted distances
Testing for Linear and Log-Linear Models
49
Fig. 4. Bera-Yoon RS∗ Test Hc0 : = 0 Assuming r = 0.
to the employment centers (Dis), an index of accessibility to radial highways (Rad), property tax rate (Tax), pupil-teacher ratio (Ptratio), black population proportion (Black), and lower status population proportion (Lstat). We use the corrected data set as provided by Gilley and Pace (1996) who pointed out that there were incorrectly coded observations in the original Harrison and Rubinfeld data. For illustrative purposes, and in the interest of saving space, we only report the LM tests performed allowing for Price and Lstat to be subject to the Box-Cox transformation. The results are reported in Table 2. Both the linear and loglinear models without spatial lag dependence are rejected when the alternative is the BoxCox model with spatial lag dependence. Assuming a loglinear or linear model, one rejects the absence of spatial lag dependence. Note that in a loglinear model the ˜ yields a value of 357.98. Rao-Score test statistic based on the negative Hessian (H) ˜ In a linear This compares to a value of 184.05 using the information matrix (I). ˜ yields a model the Rao-Score test statistic based on the negative Hessian (H) ˜ value of 93.06, compared to a value of 75.05 using the information matrix (I).
50
BADI H. BALTAGI AND DONG LI
Fig. 5. Simple RS Test Hd0 : = 0 Assuming r = 1.
Although the sample is now much bigger, 506 rather than 49 observations, the differences in the LM test statistics based on the Hessian versus the information matrix are still quite large and cannot be attributed to small samples. Note that, despite the difference in the magnitudes of these statistics, there were no conflict in the decision reached using either statistic in both empirical examples. Allowing for local misspecification using the corresponding Bera and Yoon (1993) adjusted LM statistics does not change the outcome of these tests. Conditional on the Box-Cox model, the hypothesis of no spatial lag dependence is rejected. Conditional on spatial lag dependence, both the loglinear and linear models are rejected. For the hedonic housing example, it looks like both the linear and loglinear models are rejected even after accounting for spatial lag dependence. See Pace and Gilley (1997) for a more detailed spatial analysis of this data. For both empirical examples, one does not know the true model. In the next section, Monte Carlo experiments are performed, where we know the true model and we can report the empirical size and power performance of these tests.
Testing for Linear and Log-Linear Models
51
Fig. 6. Bera-Yoon RS∗ Test Hd0 : = 0 Assuming r = 1.
4. MONTE CARLO RESULTS The experimental design used in the Monte Carlo simulations follows those extensively used in other spatial studies (e.g. Anselin et al., 1996; Anselin & Rey, 1991; Florax & Folmer, 1992). The model considered is given by y (r) = Wy (r) + X (r)  + Z␥ + u.
(43)
We use the spatial weight matrix from the crime data in Anselin (1988). The number of observations is n = 49. The explanatory variables X, an n × 2 matrix, are generated from a uniform (0, 10) distribution and the coefficients ’s are set to 1. The Z variable consists of constant term and ␥ is set equal to 4. The error term u is generated from a standard normal distribution. In addition to a normal error, a student t error term is generated as well, with mean and variance equal to that of the normal variates. The tests are evaluated at their asymptotic critical value for α = 0.05 and the power is reported. The three conditional tests in this paper involve numerical maximum likelihood estimation. These are computationally
52
BADI H. BALTAGI AND DONG LI
Fig. 7. Simple RS Test He0 : r = 0 Assuming = 0.
more expensive compared to the unconditional or Bera-Yoon type LM tests. For each combination of parameter values, 1,000 replications were carried out. Figure 1 plots the frequency of rejections in 1,000 replications using the joint LM statistic for no spatial lag dependence and log-linearity, i.e. Ha0 : = 0 and r = 0. This test tends to over-reject with size equal to 16.1% rather than 5%. The power of the test increases as or r departs from zero. In fact, if the true model is linear, the frequency of rejections is almost a 100%. Figure 2 gives the frequency of rejections in 1,000 replications using the joint LM statistic for no spatial lag dependence and linearity, i.e. Hb0 : = 0 and r = 1. This test tends to over-reject with size equal to 14.5% rather than 5%. The power of the test increases as departs from zero or r departs from 1. In fact, if the true model is log-linear, the frequency of rejections is 100%. Figure 3 gives the frequency of rejections in 1,000 replications using the simple Rao-Score statistic for no spatial lag dependence assuming a log-linear model, i.e. Hc0 : = 0 assuming that r = 0. This is the standard LM test statistic for no spatial lag dependence given by Anselin (1988) and Anselin and Bera (1998). The size of the test is equal to 8.5% rather than 5%. For r = 0, the power of this test increases as departs from zero. However, this test is sensitive
Testing for Linear and Log-Linear Models
53
Fig. 8. Bera-Yoon RS∗ Test He0 : r = 0 Assuming = 0.
to departures of r from 0. In fact, if the true model is linear, we reject that = 0 only 28.8% of the time when true is equal to 0.6. This rejection frequency is only 0.3% when true is equal to −0.6. Figure 4 gives the frequency of rejections in 1,000 replications of the Bera and Yoon (1993) adjusted Rao-Score statistic for Hc0 . This Bera-Yoon adjustment helps increase the power of the test as clear from comparing Fig. 3 to Fig. 4. However, the size of this test is 11.4% and is sensitive to departures from local misspecification. In fact, for r = 1, this test rejects the null when true in 32.3% of the cases. Figure 5 gives the Rao-Score frequency of rejections for no spatial lag dependence assuming a linear model, i.e. Hd0 : = 0 assuming that r = 1. The size of the test is equal to 7.1% rather than 5%. For r = 1, the power of this test increases as departs from zero. However, this test is sensitive to departures of r from 1. In fact, if the true model is log-linear, we reject that = 0 only 19.0% of the time when true is equal to 0.6 and 5.6% of the time when true is equal to −0.6. Figure 6 gives the corresponding Bera and Yoon (1993) adjusted Rao-Score statistic for Hd0 . This adjustment helps increase the power of the test as clear from comparing Fig. 5 to Fig. 6. However, the size of this test is 10.7% and is sensitive to departures from local misspecification. In fact,
54
BADI H. BALTAGI AND DONG LI
Fig. 9. Simple RS Test Hf0 : r = 1 Assuming = 0.
for r = 0, this test rejects the null when true in 45.7% of the cases. Figure 7gives the Rao-Score frequency of rejections for log-linearity assuming no spatial lag dependence, i.e. He0 : r = 0 assuming that = 0. The size of the test is equal to 12.1% rather than 5% and tends to over-reject the null when in fact it is true. For = 0, the power of this test increases as r departs from zero. This test is not very sensitive to departures of from 0. Figure 8 gives the corresponding Bera and Yoon (1993) adjusted Rao-Score statistic for He0 . This adjustment helps increase the power of the test as clear from comparing Fig. 7 to Fig. 8. However, the size of this test is 13.5% and is sensitive to departures from local misspecification. In fact, for = 0.6, this test rejects the null when true in 17.2% of the cases. This rejection frequency is 13.8% when true is equal to −0.6. Figure 9 gives the Rao-Score frequency of rejections for linearity assuming no spatial lag dependence, i.e. Hf0 : r = 1 assuming that = 0. The size of the test is equal to 11.0% rather than 5% and tends to over-reject the null when in fact it is true. For = 0, the power of this test increases as r departs from one. This test is not very sensitive to departures of from 0. Figure 10 gives the corresponding Bera and
Testing for Linear and Log-Linear Models
55
Fig. 10. Bera-Yoon RS∗ Test Hf0 : r = 1 Assuming = 0.
Yoon (1993) adjusted Rao-Score statistic for Hf0 . This adjustment helps increase the power of the test as clear from comparing Fig. 9 to Fig. 10. However, the size of this test is 12.7% and is sensitive to departures from local misspecification. In fact, for = 0.6, this test rejects the null when true in 14.8% of the cases. This rejection frequency is 8.7% when true is equal to −0.6. Figure 11 gives the conditional LM frequency of rejections for no spatial lag dependence assuming g a general Box-Cox model, i.e. H0 : = 0 assuming an unknown r. This test tends to over-reject the null when true. This over-rejection depends on the value of r. The power of this test increases as departs from zero. Figure 12 gives the conditional LM frequency of rejections for log-linearity assuming the presence of spatial lag dependence, i.e. Hh0 : r = 0 assuming an unknown . The size of the test varies between 5.2 and 7.9% depending on the value of . The power of this test increases as r departs from zero. Figure 13 gives the conditional LM frequency of rejections for linearity assuming the presence of spatial lag dependence, i.e. Hi0 : r = 1 assuming an unknown . The size of the test varies between 7.1 and 10.0% depending on the value of . The power of this test increases as r departs from one.
56
BADI H. BALTAGI AND DONG LI
g
Fig. 11. Conditional Test H0 : = 0|unknown r.
In order to check the sensitivity of our results to the normality assumption, we generated the disturbances using the student t distribution with 3 degrees of freedom with mean and variance equal to that of the normal variates. The graphs for the t distribution look much like those for the normal distribution in shape but with varying size and empirical power. These figures are available upon request from the authors. Here we report the effects on size and power. For example, for Ha0 : = 0 and r = 0, the size of the test is 31.1% rather than 5% when the tdistribution is used. This compares to 16.1% for the normal distribution. For Hb0 : = 0 and r = 1, the size of the test is 9% under the t-distribution compared to 14.5% under the normal distribution. For Hc0 : = 0 assuming that r = 0, the size of the test is 7.4% under the t-distribution compared to 8.5% under the normal distribution. However, this test is sensitive to departures of r from 0. In fact, if the true model is linear, we reject that = 0 only 12.8% of the time when true is equal to 0.6. This rejection frequency is only 0.4% when true is equal to −0.6. These rejection frequencies are compared to 28.8 and 0.3% when the disturbances are normal. For Hc0 , the adjusted Rao-Score test has a size of 11.3% under the
Testing for Linear and Log-Linear Models
Fig. 12. Conditional Test Hh0 : r = 0|unknown .
Fig. 13. Conditional Test Hi0 : r = 1|unknown .
57
58
BADI H. BALTAGI AND DONG LI
t-distribution as compared to 11.4% under normality. When r = 1, we reject the null when true in 12.1% of the cases under the t-distribution as compared to 32.3% when the disturbances are normal. For Hd0 : = 0 assuming that r = 1, the size of the test is 6.7% rather than 5%. However, this test is sensitive to departures of r from 1. In fact, if the true model is log-linear, we reject that = 0 only 16.2% of the time when true is equal to 0.6 and 5.2% of the time when true is equal to −0.6. This compares to 19 and 5.6% under normality. The adjusted Rao-Score statistic for Hd0 has a size of 7.9% and compares to 10.7% under normality. For r = 0, this test rejects the null when true in 46% of the cases, which is close to the 45.7% under normality. For He0 : r = 0 assuming that = 0, the size of the test is 32.1% under the t-distribution compared to 12.1% under the normal distribution. The adjusted Rao-Score statistic for He0 has a size of 32.6% which compares to 13.5% under the normal distribution. For Hf0 : r = 1 assuming that = 0, the size of the test is 9% under the t-distribution which compares to 11% under normality. The adjusted Rao-Score statistic for Hf0 has a size of 10.3% under the t-distribution g which compares to 12.7% under normality. For H0 : = 0 assuming an unknown r, the size of the test varies between 7.6 and 51.7% depending on the value of r. For Hh0 : r = 0 assuming an unknown , the size of the test varies between 25.8 and 33.2% depending on the value of . This is a deterioration in the performance of the test whose size varied between 5.2 and 7.9% under normality. For Hi0 : r = 1 assuming an unknown , the size of the test varies between 12.4 and 15.7% depending on the value of . Again this is a deterioration in the performance of the test whose size varied between 7.1 and 10.0% under normality, depending on the value of .
5. CONCLUSION This paper derived joint, conditional and modified Rao-Score tests for functional form and spatial lag dependence and illustrated these tests using two empirical examples. In addition, the power performance of these tests was compared using Monte Carlo experiments. In a companion paper, Baltagi and Li (2001) derived similar tests for functional form and spatial error correlation. Both papers indicate the following: (i) Choosing the wrong functional form could lead to misleading inference regarding the presence or absence of spatial lag dependence (spatial error correlation). (ii) Ignoring spatial lag dependence (or spatial error correlation) when present could also lead to the wrong choice of functional form. Our experiments show that the power was more sensitive to functional form misspecification than misspecification of spatial lag dependence (or spatial error correlation). (iii) Modified Rao-Score tests guard against local
Testing for Linear and Log-Linear Models
59
misspecification, but their power deteriorates for large departures from the null hypothesis. Our hope is that practitioners will use these simple LM tests as guiding diagnostic tools. Using these tests will help the researcher in their search for the proper functional form while simultaneously indicating the presence or absence of spatial lag dependence (spatial error correlation). Rather than assuming a linear or loglinear spatial lag dependence (or spatial error correlation) model, these LM tests allow a formal test against a more general Box-Cox alternative with spatial lag dependence (or spatial error correlation). Some of the usual caveats apply. Our results should be tempered by the fact that our Monte Carlo experiments use a small sample size n = 49 and the critical values used for the test statistics are based on their large sample distribution. Also, we used the estimated negative Hessian instead of the information matrix in the computation of the LM tests. Ideally we would like to use the information matrix, but that is difficult to derive for the general models considered. The asymptotic distribution of our test statistics were not explicitly derived in the paper but they are likely to hold under similar sets of low level assumptions developed by Kelejian and Prucha (2001). Further work is needed including a formal derivation of the asymptotic distribution of these test statistics. Additional Monte Carlo experiments could be set up with larger sample size to see whether the use of the Hessian instead of the information matrix (for the cases where it can be derived) do converge as the sample size gets large.
NOTES 1. Davidson and MacKinnon (1985, p. 500) point out that the normality assumption may be untenable. In fact, except for certain values of r (including 0 and 1), y t(r) cannot take on values less than −1/r, while with normal errors there is always the possibility that the right-hand side of (1) may be less than −1/r. Davidson and MacKinnon (1985) argue that it is reasonable to ignore the problem especially if E(y t ) is very large relative to , and the possibility that u t may be so large and negative as to make the right-hand side of (1) unacceptably small can be safely ignored. We check the sensitivity of departures from the normality assumption in our Monte Carlo experiments. 2. In a companion paper, Baltagi and Li (2001) derived similar LM tests for the Box-Cox model with spatial error correlation. 3. See also Pinkse (1999) for general conditions under which Moran-flavoured tests for spatial correlation have a limiting normal distribution in the presence of nuisance parameters in six frequently encountered spatial models. 4. Again, it is important to note that the large sample distribution of our test statistics are not formally established in the paper but are likely to hold under similar sets of low level assumptions developed recently by Kelejian and Prucha (2001).
60
BADI H. BALTAGI AND DONG LI
ACKNOWLEDGMENTS We thank Kelley Pace and two anonymous referees for their helpful comments and suggestions. Baltagi would like to thank the Private Enterprise Research Center for their support. Li acknowledges financial support from the University Small Research Grant at Kansas State University.
REFERENCES Anselin, L. (1988). Spatial econometrics: Methods and models. Dordrecht: Kluwer. Anselin, L. (1999). Rao’s score test in spatial econometrics. Journal of Statistical Planning and Inference (forthcoming). Anselin, L., & Bera, A. K. (1998). Spatial dependence in linear regression models with an introduction to spatial econometrics. In: A. Ullah & D. E. A. Giles (Eds), Handbook of Applied Economic Statistics. New York: Marcel Dekker. Anselin, L., Bera, A. K., Florax, R., & Yoon, M. J. (1996). Simple diagnostic tests for spatial dependence. Regional Science and Urban Economics, 26, 77–104. Anselin, L., & Rey, S. (1991). Properties of tests for spatial dependence in linear regression models. Geographical Analysis, 23, 112–131. Bailly, A. S., Coffey, W. J., Paelinck, J. H. P., & Polese, M. (1992). The spatial econometrics of services. Aldershot: Avebury. Baltagi, B. H., & Li, D. (2001). LM tests for functional form and spatial error correlation. International Regional Science Review, 24, 194–225. Bera, A. K., & McKenzie, C. R. (1985). Alternative forms and properties of the score test. Journal of Applied Statistics, 13, 13–25. Bera, A. K., Sosa-Escudero, W., & Yoon, M. J. (2001). Tests for the error component model in the presence of local misspecification. Journal of Econometrics, 101, 1–23. Bera, A. K., & Yoon, M. J. (1993). Specification testing with locally misspecified alternatives. Econometric Theory, 9, 649–658. Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society Series B, 26, 211–252. Craig, S. G., Kohlhase, J. E., & Papell, D. H. (1991). Chaos theory and microeconomics: An application to model specification and hedonic estimation. Review of Economics and Statistics, 73, 208–215. Davidson, R., & MacKinnon, J. G. (1983). Small sample properties of alternative forms of Lagrange multiplier tests. Economics Letters, 12, 269–275. Davidson, R., & MacKinnon, J. G. (1985). Testing linear and loglinear regressions against Box-Cox alternatives. Canadian Journal of Economics, 18, 499–517. Efron, B., & Hinkley, D. V. (1978). Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information. Biometrika, 65, 457–482. Elad, R. L., Clifton, I. D., & Epperson, J. E. (1994). Hedonic estimation applied to the farmland market in Georgia. Journal of Agricultural and Applied Economics, 26, 351–366. Fik, T. J., & Mulligan, G. F. (1998). Functional form and spatial interaction models. Environment and Planning A, 30, 1497–1507.
Testing for Linear and Log-Linear Models
61
Florax, R., & Folmer, H. (1992). Specification and estimation of spatial linear regression models: Monte Carlo evaluation of pre-test estimators. Regional Science and Urban Economics, 22, 405–432. Gilley, O. W., & Pace, R. K. (1996). On the Harrison and Rubinfeld data. Journal of Environmental Economics and Management, 31, 403–405. Griffith, D. A., Paelinck, J. H. P., & van Gastel, R. A. (1998). The Box-Cox transformation: Computational and interpretation features of the parameters. In: D. A. Griffith et al. (Eds), Econometric Advances in Spatial Modelling and Methodology: Essays in Honour of Jean Paelinck. Dordrecht: Kluwer Academic. Harrison, D., & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5, 81–102. Kelejian, H. H., & Prucha, I. R. (2001). On the asymptotic distribution of the Moran I test statistic with applications. Journal of Econometrics, 104, 219–257. Ledent, J. (1986). A model of urbanization with nonlinear migration flows. International Regional Science Review, 10, 221–242. Ord, J. K. (1975). Estimation methods for models of spatial interaction. Journal of the American Statistical Association, 70, 120–126. Pace, R. K., Barry, R., Slawson, V. C., Jr., & Sirmans, C. F. (1999). Simultaneous spatial and functional form transformations. In: L. Anselin & R. J. G. M. Florax (Eds), New Advances in Spatial Econometrics (forthcoming). Pace, R. K., & Gilley, O. W. (1997). Using the spatial configuration of the data to improve estimation. Journal of Real Estate Finance and Economics, 14, 333–340. Pinkse, J. (1999). Moran-flavoured tests with nuisance parameters: Examples. In: L. Anselin & R. J. G. M. Florax (Eds), New Advances in Spatial Econometrics (forthcoming). Savin, N. E., & White, K. J. (1978). Estimation and testing for functional form and autocorrelation: A simultaneous approach. Journal of Econometrics, 8, 1–12. Seaks, T. G., & Layson, S. K. (1983). Box-Cox estimation with standard econometric problems. Review of Economics and Statistics, 65, 160–164. Upton, G. J. G., & Fingleton, B. (1985). Spatial data analysis by example. Chichester: Wiley.
APPENDIX A: HESSIAN MATRIX FOR THE GENERAL BOX-COX MODEL WITH SPATIAL LAG DEPENDENCE The second-order derivatives of the loglikelihood function given in (4) yield the following results: n 1 ∂2 log L = − 6 u′u 2 2 4 ∂ ∂ 2
(A.1)
∂2 log L 1 = − 4 u ′ X (r) ∂2 ∂′
(A.2)
∂2 log L 1 = − 4 u′Z ∂2 ∂␥′
(A.3)
62
BADI H. BALTAGI AND DONG LI
∂2 log L 1 = − 4 u ′ Wy (r) ∂2 ∂
(A.4)
1 ∂2 log L = 4 u ′ [(I − W)C(y, r) − C(X, r)] 2 ∂ ∂r
(A.5)
∂2 log L 1 = − 4 (X (r) )′ u ∂∂2
(A.6)
∂2 log L 1 (r) ′ (r) ′ = − 2 (X ) X ∂∂
(A.7)
∂2 log L 1 = − 2 (X (r) )′ Z ∂∂␥′
(A.8)
∂2 log L 1 = − 2 (X (r) )′ Wy (r) ∂∂
(A.9)
1 ∂2 log L = 2 (X (r) )′ [(I − W)C(y, r) − C(X, r)] ∂∂r +
1 [C(X, r)]′ u 2
(A.10)
∂2 log L 1 = − 4 Z′u 2 ∂␥∂
(A.11)
∂2 log L 1 = − 2 Z ′ X (r) ∂␥∂′
(A.12)
∂2 log L 1 = − 2 Z′Z ∂␥∂␥′
(A.13)
∂2 log L 1 = − 2 Z ′ Wy (r) ∂␥∂
(A.14)
∂2 log L 1 = 2 Z ′ [(I − W)C(y, r) − C(X, r)] ∂␥∂r
(A.15)
∂2 log L 1 = − 4 u ′ Wy (r) 2 ∂∂
(A.16)
Testing for Linear and Log-Linear Models
63
1 ∂2 log L = − 2 (y (r) )′ W ′ X (r) ∂∂′
(A.17)
∂2 log L 1 = − 2 (y (r) )′ W ′ Z ′ ∂∂␥
(A.18)
n
2i ∂2 log L 1 =− − 2 (Wy (r) )′ (Wy (r) ) ∂∂ (1 − i )2
(A.19)
i=1
∂2 log L 1 = 2 (y (r) )′ W ′ [(I − W)C(y, r) − C(X, r)] ∂∂r +
1 ′ u WC(y, r) 2
(A.20)
1 ∂2 log L = 4 u ′ [(I − W)C(y, r) − C(X, r)] 2 ∂r∂
(A.21)
∂2 log L 1 1 = 2 u ′ C(X, r) + 2 [(I − W)C(y, r) − C(X, r)]′ X (r) ∂r∂′
(A.22)
∂2 log L 1 = 2 [(I − W)C(y, r) − C(X, r)]′ Z ∂r∂␥′
(A.23)
∂2 log L 1 = 2 (Wy (r) )′ [(I − W)C(y, r) − C(X, r)] ∂r∂ +
1 ′ u WC(y, r) 2
(A.24)
∂2 log L 1 = − 2 [(I − W)C(y, r) − C(X, r)]′ [(I − W)C(y, r) − C(X, r)] ∂r∂r 1 − 2 u ′ [(I − W)C ′ (y, r) − C ′ (X, r)] (A.25) where C ′ (y, r) = ∂C(y, r)/∂r = [r 2 y r (log y)2 − 2ry r log y + 2y r − 2]/r 3 C ′ (X, r) = ∂C(X, r)/∂r is similarly defined.
and
64
BADI H. BALTAGI AND DONG LI
APPENDIX B: JOINT TESTS B.1. Hessian Matrix for the Log-linear Model with No Spatial Lag Dependence Under the hull hypothesis Ha0 : = 0 and r = 0, the second order derivatives of the loglikelihood function are given by n 1 ∂2 log L = − 6 u′u 2 2 4 ∂ ∂ 2
(B.1)
∂2 log L 1 = − 4 u ′ log X ∂2 ∂′
(B.2)
1 ∂2 log L = − 4 u′Z ∂2 ∂␥′
(B.3)
∂2 log L 1 = − 4 u ′ W log y ∂2 ∂ ∂2 log L 1 1 ′ 1 2 2 (log y) − (log X)  = u 2 2 ∂2 ∂r 4
(B.4) (B.5)
1 ∂2 log L = − 4 (log X)′ u ∂∂2
(B.6)
∂2 log L 1 = − 2 (log X)′ log X ∂∂′
(B.7)
∂2 log L 1 = − 2 (log X)′ Z ∂∂␥′
(B.8)
∂2 log L 1 = − 2 (log X)′ W log y ∂∂ 1 1 1 ∂2 log L = 2 (log X)′ (log y)2 − (log X)2  ∂∂r 2 2 ′ 1 1 (log X)2 u + 2 2 ∂2 log L 1 = − 4 Z′u ∂␥∂2
(B.9)
(B.10)
(B.11)
Testing for Linear and Log-Linear Models
65
1 ∂2 log L = − 2 Z ′ log X ∂␥∂′
(B.12)
∂2 log L 1 = − 2 Z′Z ∂␥∂␥′
(B.13)
1 ∂2 log L = − 2 Z ′ W log y ∂␥∂ ∂2 log L 1 1 1 = 2 Z′ (log y)2 − (log X)2  ∂␥∂r 2 2
(B.14) (B.15)
∂2 log L 1 = − 4 u ′ W log y 2 ∂∂
(B.16)
1 ∂2 log L ′ ′ ′ = − 2 (log y) W log X ∂∂
(B.17)
∂2 log L 1 = − 2 (log y)′ W ′ Z ′ ∂∂␥
(B.18)
n
∂2 log L 1 2i − 2 (W log y)′ (W log y) =− ∂∂
(B.19)
i=1
∂2 log L
1 1 2 2 ′ ′ 1 = 2 (log y) W (log y) − (log X)  ∂∂r 2 2 1 ′ 1 u W (log y)2 2 2 1 1 ′ 1 2 2 (log y) − (log X)  = 4u 2 2 ′ 1 ′1 1 1 1 2 2 2 = 2 u (log X) + 2 (log y) − (log X)  log X 2 2 2 ′ 1 1 1 = 2 (log y)2 − (log X)2  Z 2 2 1 1 ′ 1 2 2 = 2 (W log y) (log y) − (log X)  2 2 +
∂2 log L ∂r∂2 ∂2 log L ∂r∂′ ∂2 log L ∂r∂␥′ ∂2 log L ∂r∂
+
1 ′ 1 u W (log y)2 2 2
(B.20) (B.21) (B.22) (B.23)
(B.24)
66
BADI H. BALTAGI AND DONG LI
′ 1 1 1 1 (log y)2 − (log X)2  (log y)2 − (log X)2  2 2 2 2 1 1 1 (B.25) − 2 u′ (log y)3 − (log X)3  3 3
∂2 log L 1 =− 2 ∂r∂r
B.2. Hessian Matrix for the Linear Model with No Spatial Lag Dependence Under the null hypothesis Hb0 : = 0 and r = 1, the second order derivatives of the loglikelihood function are given by n 1 ∂2 log L = − 6 u′u ∂2 ∂2 24
(B.26)
1 ′ ∂2 log L ′ = − 4 u (X − J nK ) 2 ∂ ∂
(B.27)
∂2 log L 1 = − 4 u′Z 2 ′ ∂ ∂␥
(B.28)
∂2 log L 1 = − 4 u ′ W(y − n ) 2 ∂ ∂
(B.29)
1 ∂2 log L = 4 u ′ [ylog y − y + n − (X log X − X + J nK )] 2 ∂ ∂r
(B.30)
∂2 log L 1 = − 4 (X − J nK )′ u ∂∂2
(B.31)
∂2 log L 1 ′ ′ = − 2 (X − J nK ) (X − J nK ) ∂∂
(B.32)
∂2 log L 1 = − 2 (X − J nK )′ Z ∂∂␥′
(B.33)
∂2 log L 1 = − 2 (X − J nK )′ W(y − n ) ∂∂
(B.34)
∂2 log L 1 = 2 (X − J nK )′ [ylog y − y + n − (X log X − X + J nK )] ∂∂r +
1 (X log X − X + J nK )′ u 2
(B.35)
Testing for Linear and Log-Linear Models
67
1 ∂2 log L = − 4 Z′u ∂␥∂2
(B.36)
∂2 log L 1 ′ ′ = − 2 Z (X − J nK ) ∂␥∂
(B.37)
∂2 log L 1 = − 2 Z′Z ∂␥∂␥′
(B.38)
∂2 log L 1 = − 2 Z ′ W(y − n ) ∂␥∂
(B.39)
∂2 log L 1 = 2 Z ′ [ylog y − y + n − (X log X − X + J nK )] ∂␥∂r
(B.40)
1 ∂2 log L = − 4 u ′ W(y − n ) ∂∂2
(B.41)
∂2 log L 1 ′ ′ ′ = − 2 (y − n ) W (X − J nK ) ∂∂
(B.42)
∂2 log L 1 = − 2 (y − n )′ W ′ Z ∂∂␥′
(B.43)
n
∂2 log L 1 2i − 2 (W(y − n ))′ (W(y − n )) =− ∂∂
(B.44)
i=1
∂2 log L 1 = 2 (y − n )′ W ′ [ylog y − y + n − (X log X − X + J nK )] ∂∂r +
1 ′ u W(ylog y − y + n ) 2
∂2 log L 1 = 4 u ′ [ylog y − y + n − (X log X − X + J nK )] ∂r∂2
(B.45) (B.46)
1 ∂2 log L 1 ′ ′ = 2 u (X log X − X + J nK ) + 2 [ylog y − y + n ∂r∂ − (X log X − X + J nK )]′ (X − J nK )
∂2 log L 1 = 2 [ylog y − y + n − (X log X − X + J nK )]′ Z ′ ∂r∂␥
(B.47) (B.48)
68
BADI H. BALTAGI AND DONG LI
∂2 log L 1 = 2 [W(y − n )]′ [ylog y − y + n − (X log X − X + J nK )] ∂r∂ +
1 ′ u W(ylog y − y + n ) 2
(B.49)
∂2 log L 1 = − 2 [ylog y − y + n − (X log X − X + J nK )]′ [ylog y − y + n ∂r∂r 1 − (X log X − X + J nK )] − 2 u ′ [y(log y)2 − 2ylog y + 2y − 2n − (X(log X)2 − 2X log X + 2X − 2J nK )]
(B.50)
APPENDIX C: CONDITIONAL TESTS g
C.1. Hessian Matrix for the Null Hypothesis H0 : = 0|unknown r g
Under the null hypothesis H0 : = 0|unknown r, the second order derivatives of the loglikelihood function are given by n 1 ∂2 log L = − 6 u′u 2 2 4 ∂ ∂ 2
(C.1)
∂2 log L 1 ′ (r) ′ = − 4u X 2 ∂ ∂
(C.2)
∂2 log L 1 = − 4 u′Z 2 ′ ∂ ∂␥
(C.3)
∂2 log L 1 = − 4 u ′ Wy (r) ∂2 ∂
(C.4)
∂2 log L 1 = 4 u ′ [C(y, r) − C(X, r)] 2 ∂ ∂r
(C.5)
1 ∂2 log L = − 4 (X (r) )′ u 2 ∂∂
(C.6)
∂2 log L 1 = − 2 (X (r) )′ X (r) ∂∂′
(C.7)
Testing for Linear and Log-Linear Models
69
1 ∂2 log L = − 2 (X (r) )′ Z ∂∂␥′
(C.8)
1 ∂2 log L = − 2 (X (r) )′ Wy (r) ∂∂
(C.9)
∂2 log L 1 1 = 2 (X (r) )′ [C(y, r) − C(X, r)] + 2 [C(X, r)]′ u ∂∂r
(C.10)
∂2 log L 1 = − 4 Z′u 2 ∂␥∂
(C.11)
∂2 log L 1 ′ (r) ′ = − 2Z X ∂␥∂
(C.12)
1 ∂2 log L = − 2 Z′Z ′ ∂␥∂␥
(C.13)
∂2 log L 1 = − 2 Z ′ Wy (r) ∂␥∂
(C.14)
∂2 log L 1 = 2 Z ′ [C(y, r) − C(X, r)] ∂␥∂r
(C.15)
∂2 log L 1 = − 4 u ′ Wy (r) 2 ∂∂
(C.16)
∂2 log L 1 (r) ′ ′ (r) ′ = − 2 (y ) W X ∂∂
(C.17)
∂2 log L 1 = − 2 (y (r) )′ W ′ Z ′ ∂∂␥
(C.18)
n
1 ∂2 log L 2i − 2 (Wy (r) )′ (Wy (r) ) =− ∂∂
(C.19)
∂2 log L 1 1 = 2 (y (r) )′ W ′ [C(y, r) − C(X, r)] + 2 u ′ WC(y, r) ∂∂r
(C.20)
∂2 log L 1 = 4 u ′ [C(y, r) − C(X, r)] ∂r∂2
(C.21)
∂2 log L 1 ′ 1 ′ (r) ′ = 2 u C(X, r) + 2 [C(y, r) − C(X, r)] X ∂r∂
(C.22)
i=1
70
BADI H. BALTAGI AND DONG LI
∂2 log L 1 = 2 [C(y, r) − C(X, r)]′ Z ∂r∂␥′
(C.23)
1 1 ∂2 log L = 2 (Wy (r) )′ [C(y, r) − C(X, r)] + 2 u ′ WC(y, r) ∂r∂
(C.24)
∂2 log L 1 = − 2 [C(y, r) − C(X, r)]′ [C(y, r) − C(X, r)] ∂r∂r 1 − 2 u ′ [C ′ (y, r) − C ′ (X, r)]
(C.25)
C.2. Hessian Matrix for the Null Hypothesis Hh0 : r = 0|unknown Under the null hypothesis H0 : r = 0|unknown , the second order derivatives of the loglikelihood function are given by ∂2 log L n 1 = − 6 u′u 2 2 4 ∂ ∂ 2
(C.26)
∂2 log L 1 = − 4 u ′ log X ∂2 ∂′
(C.27)
1 ∂2 log L = − 4 u′Z ∂2 ∂␥′
(C.28)
∂2 log L 1 = − 4 u ′ W log y ∂2 ∂
(C.29)
∂2 log L 1 = 4 u ′ [(I − W)C(y, 0) − C(X, 0)] ∂2 ∂r
(C.30)
∂2 log L 1 = − 4 (log X)′ u 2 ∂∂
(C.31)
∂2 log L 1 = − 2 (log X)′ log X ∂∂′
(C.32)
1 ∂2 log L = − 2 (log X)′ Z ′ ∂∂␥
(C.33)
∂2 log L 1 = − 2 (log X)′ W log y ∂∂
(C.34)
Testing for Linear and Log-Linear Models
71
∂2 log L 1 = 2 (log X)′ [(I − W)C(y, 0) − C(X, 0)] ∂∂r +
1 [C(X, 0)]′ u 2
(C.35)
1 ∂2 log L = − 4 Z′u 2 ∂␥∂
(C.36)
∂2 log L 1 ′ ′ = − 2 Z log X ∂␥∂
(C.37)
∂2 log L 1 = − 2 Z′Z ∂␥∂␥′
(C.38)
∂2 log L 1 = − 2 Z ′ W log y ∂␥∂
(C.39)
∂2 log L 1 = 2 Z ′ [(I − W)C(y, 0) − C(X, 0)] ∂␥∂r
(C.40)
1 ∂2 log L = − 4 u ′ W log y 2 ∂∂
(C.41)
∂2 log L 1 ′ ′ ′ = − 2 (log y) W log X ∂∂
(C.42)
∂2 log L 1 = − 2 (log y)′ W ′ Z ′ ∂∂␥
(C.43)
n
2i 1 ∂2 log L =− − 2 (W log y)′ (W log y) ∂∂ (1 − i )2
(C.44)
i=1
∂2 log L 1 = 2 (log y)′ W ′ [(I − W)C(y, 0) − C(X, 0)] ∂∂r +
1 ′ u WC(y, 0) 2
(C.45)
∂2 log L 1 = 4 u ′ [(I − W)C(y, 0) − C(X, 0)] ∂r∂2
(C.46)
∂2 log L 1 1 ′ ′ ′ = 2 u C(X, 0) + 2 [(I − W)C(y, 0) − C(X, 0)] log X ∂r∂
(C.47)
72
BADI H. BALTAGI AND DONG LI
∂2 log L 1 = 2 [(I − W)C(y, 0) − C(X, 0)]′ Z ∂r∂␥′
(C.48)
1 ∂2 log L = 2 (W log y)′ [(I − W)C(y, 0) − C(X, 0)] ∂r∂ +
1 ′ u WC(y, 0) 2
(C.49)
∂2 log L 1 = − 2 [(I − W)C(y, 0) − C(X, 0)]′ [(I − W)C(y, 0) − C(X, 0)] ∂r∂r 1 − 2 u ′ [(I − W)C ′ (y, 0) − C ′ (X, 0)] (C.50)
C.3. Hessian Matrix for the Null Hypothesis Hi0 : r = 1|unknown Under the null hypothesis H0 : r = 1|unknown , the second order derivatives of the loglikelihood function are given by ∂2 log L n 1 = − 6 u′u 2 2 4 ∂ ∂ 2
(C.51)
∂2 log L 1 = − 4 u ′ (X − J nK ) ∂2 ∂′
(C.52)
∂2 log L 1 = − 4 u′Z ∂2 ∂␥′
(C.53)
∂2 log L 1 = − 4 u ′ W(y − n ) ∂2 ∂
(C.54)
∂2 log L 1 = 4 u ′ [(I − W)C(y, 1) − C(X, 1)] ∂2 ∂r
(C.55)
1 ∂2 log L = − 4 (X − J nK )′ u ∂∂2
(C.56)
∂2 log L 1 ′ ′ = − 2 (X − J nK ) (X − J nK ) ∂∂
(C.57)
∂2 log L 1 = − 2 (X − J nK )′ Z ∂∂␥′
(C.58)
Testing for Linear and Log-Linear Models
∂2 log L 1 = − 2 (X − J nK )′ W(y − n ) ∂∂
73
(C.59)
1 ∂2 log L = 2 (X − J nK )′ [(I − W)C(y, 1) − C(X, 1)] ∂∂r +
1 [C(X, 1)]′ u 2
(C.60)
∂2 log L 1 = − 4 Z′u 2 ∂␥∂
(C.61)
∂2 log L 1 = − 2 Z ′ (X − J nK ) ∂␥∂′
(C.62)
∂2 log L 1 = − 2 Z′Z ′ ∂␥∂␥
(C.63)
∂2 log L 1 = − 2 Z ′ W(y − n ) ∂␥∂
(C.64)
∂2 log L 1 = 2 Z ′ [(I − W)C(y, 1) − C(X, 1)] ∂␥∂r
(C.65)
∂2 log L 1 = − 4 u ′ W(y − n ) ∂∂2
(C.66)
∂2 log L 1 = − 2 (y − n )′ W ′ (X − J nK ) ∂∂′
(C.67)
∂2 log L 1 = − 2 (y − n )′ W ′ Z ∂∂␥′
(C.68)
n
2i 1 ∂2 log L =− − 2 (W(y − n ))′ (W(y − n )) 2 ∂∂ (1 − i )
(C.69)
i=1
1 ∂2 log L = 2 (y − n )′ W ′ [(I − W)C(y, 1) − C(X, 1)] ∂∂r +
1 ′ u WC(y, 1) 2
∂2 log L 1 = 4 u ′ [(I − W)C(y, 1) − C(X, 1)] 2 ∂r∂
(C.70) (C.71)
74
BADI H. BALTAGI AND DONG LI
∂2 log L 1 1 = 2 u ′ C(X, 1) + 2 [(I − W)C(y, 1) − C(X, 1)]′ ∂r∂′ × (X − J nK )
1 ∂2 log L = 2 [(I − W)C(y, 1) − C(X, 1)]′ Z ′ ∂r∂␥
(C.72) (C.73)
∂2 log L 1 = 2 (W(y − n ))′ [(I − W)C(y, 1) − C(X, 1)] ∂r∂ +
1 ′ u WC(y, 1) 2
(C.74)
1 ∂2 log L = − 2 [(I − W)C(y, 1) − C(X, 1)]′ [(I − W)C(y, 1) − C(X, 1)] ∂r∂r 1 − 2 u ′ [(I − W)C ′ (y, 1) − C ′ (X, 1)] (C.75)
SPATIAL LAGS AND SPATIAL ERRORS REVISITED: SOME MONTE CARLO EVIDENCE Robin Dubin INTRODUCTION From a theoretical point of view, a spatial econometric model can contain both a spatially lagged dependent variable (spatial lag) and a spatially autocorrelated error term (spatial error). However, such models are rarely used in practice. This is because (assuming a lattice model approach is used for both the spatial lag and spatial error) the model is difficult to estimate1 unless the weight matrices are different for the spatial lag and the spatial error. There are two schools of thought for modeling spatial dependence: lattice models and geostatistical models. In the former, a weight matrix is used to model the connectedness of the population, while in the latter, the spatial correlations are directly modeled as a function of separation distance. By combining the two approaches, the above estimation problem can be solved. Furthermore, since the geostatistical approach will be used to model the spatial error, the technique of Kriging can be used to improve the predictive ability of the model. In the remainder of this paper, I discuss the geostatistical and lattice approaches to modeling spatial effects in a regression context. I present a model which combines these two approaches and present some Monte Carlo results which explore the properties of this model.
Spatial and Spatiotemporal Econometrics Advances in Econometrics, Volume 18, 75–98 © 2004 Published by Elsevier Ltd. ISSN: 0731-9053/doi:10.1016/S0731-9053(04)18002-X
75
76
ROBIN DUBIN
LITERATURE REVIEW Lattice Models In his classic text on spatial econometrics, Anselin (1988) presents a taxonomy of models, all of which follow from parameter restrictions on what Anselin refers to as the “general spatial process model.” This model is shown below: y = W 1 y + X + u
(1.a)
u = W 2 u +
(1.b)
where y is the dependent variable, X is a matrix of independent variables, u is the spatially autoregressive error term, ∼ N(0, 2 ),  are the regression parameters, is the parameter which determines the importance of the spatial lag, is the parameter of the spatially autoregressive error, and W1 and W2 are spatial weight matrices.2,3 However, in practice, tests for spatial model specification have taken on a non-nested character. That is, the researcher attempts to determine which model, spatial error or spatial lag, best fits the data. Thus, the researcher tries to determine whether = 0 or = 0 and does not accept the possibility that both parameters might differ from zero. The non-nested treatment of spatial model specification permeates both the theoretical and applied literature. For example, in an early Monte Carlo study, Anselin and Rey (1991) conclude that if the Lagrange Multiplier (LM) tests for spatial error and spatial lag are both significant, the larger of the two statistics probably indicates the correct model. With modifications, this conclusion is echoed by more recent studies (see Anselin & Florax, 1995; Florax & Folmer, 1992). Empirical papers that consider both spatial lag and spatial error follow this advice by using the largest statistic (either a LM test or a robust version of LM developed by Anselin et al. (1996)) to pick the best model (see for example, Bivand & Szymanski, 2000; Brueckner, 1998; Can, 1992; Kim et al., 2003; Saavedra, 2000).
Geostatistical Models Lattice models are general in the sense that the weight matrix can be used to spatially lag the dependent variable, the independent variables and the error term (Eq. (1.a) depicts a model in which includes both a spatially lagged dependent variable and a spatially lagged (autocorrelated) error term). With respect to the spatially autoregressive error discussed above, the error generating process is specified (as shown in Eq. (1.b)), and the variance of the errors is derived
Spatial Lags and Spatial Errors Revisited
77
from the error generating process. The correlation matrix for this system is K = [(I − W 2 )′ (I − W 2 )]−1 . Geostatistical models do not use weight matrices to summarize the spatial relationships and no error generating process is specified. Instead, the correlation matrix for the data is modeled directly as a function of separation distances. The researcher must specify this function. Commonly used functional forms are the negative exponential, the gaussian and the spherical: Negative exponential correlation function: b exp − d for d > 0 1 d2 K(d) = (2) 1 for d = 0 Gaussian correlation function: 2 b exp − d 1 d2 K(d) = 1
for d > 0
(3)
for d = 0
Spherical correlation function: 3 3d d for 0 < d < b2 b1 1 − 2b + 2b 3 2 2 K(d) = 1 for d = 0 0 for d > b2
(4)
where K(d) is the correlation of two observations separated by distance d, and b1 and b2 are parameters to be estimated. Geostatistical models are used only for the error term; there is no analog for a spatially lagged dependent variable. Examples of papers that use a geostatistical approach to modeling a spatially autocorelated error term include Dubin (1998) and Basu and Thibodeau (1998). One advantage of the geostatistical approach is that, once the parameters of the correlation function are estimated, they can be used to make a prediction for the error term for a point that is not in the data. This is known as Kriging (see Dubin (1998) or Ripley (1981) for a discussion of Kriging). This is then added to the ˜ to make an improved prediction. The formula for the prediction at standard X, site s0 is shown below. ˆ 0) yˆ (s 0 ) = x 0 ˜ + u(s
(5)
78
ROBIN DUBIN
where: uˆ (s0 ) is the predicted error at site s0 , yˆ (s 0 ) is the predicted value at site s0 , ˜ −1 X)−1 X ′ K ˜ −1 y x0 is the (row) vector of independent variables at site s0 , ˜ = (X ′ K ˜ = the = the maximum likelihood estimate of the regression coefficients, and K correlation matrix obtained by substituting the ML estimates of b1 and b2 into the correlation function and applying this function to the data locations.
PROPOSED MODEL The model presented in this paper combines the lattice and geostatistical approaches. The lattice approach is used to lag the dependent variable, while the geostatistical approach is used for the spatially autocorrelated error term. Thus, y = Wy + X + u
where u ∼ N(0, 2 K)
(6)
where K is specified as in Eqs (2), (3), or (4). This model has two advantages. First, since there is only one weight matrix, the estimation problems detailed in Note 1 do not occur. Second, the error terms can be Kriged, which should provide good predictive power. Maximum Likelihood can be used to estimate this model. The likelihood function is N N 1 L(y) = − ln(2) − ln 2 − ln |K| + ln |A| 2 2 2 1 − 2 (Ay − X)′ K −1 (Ay − X) (7) 2 where A = (I − W). This can be written in concentrated form as (ignoring constants) 1 N ˜ ′ K −1 (Ay − X ) ˜ L c (y) = − ln |K| + ln |A| − (Ay − X ) (8) 2 2 Equation (8) is maximized with respect to the parameters of the correlogram (b1 and b2 ) and . The estimation is completed by solving for ˜ and ˜ 2 using the following formulas. ˜ = (X′ K−1 X)−1 X′ K−1 Ay ˜ 2 =
(Ay − X)′ K −1 (Ay − X) N
(9)
MONTE CARLO SIMULATIONS The properties of this model are explored in a Monte Carlo framework. The setup is as follows. First, locations are generated in a plane that is 100 units by
Spatial Lags and Spatial Errors Revisited
79
100 units by randomly choosing x and y coordinates from a uniform distribution with range 0–100. The locations remain the same for all experiments. Second, the independent variables are generated. These consist of constant term and two independent variables: X1 is drawn from a uniform distribution with range 0–5, and X2 is drawn from a uniform distribution with range 10–13. The X data also remain the same for all repetitions. Third, y is generated using the following equation. y = (I − W)−1 (X + u)
(10)
W is a row normalized nearest neighbor weight matrix, with 10 neighbors. The true values of the regression coefficients are 0 = 100, and 1 = 2 = 20. The error term u is spatially autocorrelated with a negative exponential correlogram, as shown in Eq. (2). To generate the spatially autocorrelated error, I create K by applying (2) to the generated locations. I then form the Cholesky decomposition of K; call this C. The spatially autcorrelated error term u is then u = Cz where z is drawn from a normal distribution with mean zero and variance 2 (2 is set to 400 in the simulations). The parameter determines the importance of the spatial lag. Since W is row normalized, this parameter should range between −1 and 1. I simulate the data using four values for : 0.8, 0.5, 0.2, and 0. The parameter b1 determines the importance of the spatially autocorrelated error. I also use four values for b1 : 0.8, 0.5, 0.2, and 0. The parameter b2 is set to 3 for all experiments. The combination of four values for and four values for b1 gives me 16 experiments. Each experiment is repeated 500 times. For each repetition, four models are estimated: OLS, spatial lag only, spatial error only, and the general spatial model. These models are shown below. OLS Spatial Lag
y = X + ∼ N(0, 2 I) y = Wy + X +
Spatial Error
y = X + u
u ∼ N(0, 2 K)
(11)
General Spatial y = Wy + X + u In order to determine the correct model, three null hypotheses are of interest. All of the hypotheses are tested by means of a likelihood ratio (LR) test,4 and in all of these, the general spatial model is the unrestricted model. The first, H10 : b1 = = 0, is a test of whether there are any spatial effects at all. If this hypothesis is accepted, OLS is the selected model. This hypothesis is tested by using OLS as the restricted model. The second null hypothesis is H20 : = 0. This is a test of whether a spatial lag is present, and is not conditioned on the value of b1 or b2 . This hypothesis is tested by using the spatial error model as the restricted hypothesis. The third null hypothesis is H30 : b1 = 0. This is a test of whether there is a spatially
80
ROBIN DUBIN
Fig. 1. H0 : b1 = = 0.
Spatial Lags and Spatial Errors Revisited
81
autocorrelated error term. This hypothesis is tested by using the spatial lag model as the restricted model. Note that this test is not conditional on = 0. Model specification can be determined by a combination of these three tests as follows. If H10 is accepted, the testing stops and OLS is taken to be the correct model. If H10 is rejected, then H20 and H30 must be tested. If H20 is accepted and H30 is rejected, the spatial error model is chosen. Conversely, if H20 is rejected and H30 is accepted, the spatial lag model is chosen. Finally, if both H20 and H30 are rejected, the general model is selected. In the Monte Carlo experiments, I use two sided alternatives for all tests, and the probability of a type I error is set to 0.05. Degrees of Freedom Likelihood ratio tests are asymptotically distributed as 2 random variables, with degrees of freedom equal to the number of restricted parameters. Following this
Fig. 2. (a) H0 : = 0, (b) H0 : = 0, (c) H0 : = 0, 0.5 2 (2) + 0.5 2 (1).
82
ROBIN DUBIN
reasoning, one would suspect that the LR test for H10 should be distributed as 2 (2), while those for H20 and H30 should be distributed as 2 (1). However, b1 is restricted to be greater than or equal to zero, and this may affect the distribution of the test (note that there is no restriction on ). I explore this possibility by forming probability plots for the LR statistic when H0 is true. Because H20 and H30 only involve one parameter, the plots must be checked at values of the other parameter. The plots are shown in Figs 1–3 for H10 , H20 , and H30 respectively. Each plot uses 500 observations and 200 repetitions. Examination of Fig. 1 shows that 2 (2) provides the best fit for the LR test of H10 . Figure 2 shows that the LR test for H20 is distributed as neither 2 (1) nor 2 (2), but is instead a mixture of these two distributions. Furthermore, the distribution appears to change with the (true) value of b1 . An equal mixture appears to provide a reasonably good fit, although cutoffs based on this mixture may tend to over-reject for b1 = 0. Figure 3 shows that the best fit for the LR test of H30 is 2 (1). The fit
Fig. 2.
(Continued )
Spatial Lags and Spatial Errors Revisited
Fig. 2.
83
(Continued )
is not very good except for the upper tail of the distribution, however, this is the most important region for our purposes. In summary, based on the probability plots, the cutoffs for the LR test of H10 will be taken from the 2 (2) distribution; 0.5 2 (2) + 0.5 2 (1) for H20 , and 2 (1) for H30 .
Estimation Procedure I use a grid-search procedure to estimate all of the models. First I search a fairly course grid of the parameter space. For example, for the general spatial model,5 the parameters initially take on the following values: b1 = {0, 0.2, 0.4, 0.6, 0.8}, b2 = {3, 7, 11, 15}, ={0, 0.2, 0.4, 0.6, 0.8}. The concentrated log likelihood function (8) is evaluated at each combination of parameter values. The set of values giving ˆ and values around the initial the largest value is identified (call these bˆ 1 , bˆ 2 and )
84
ROBIN DUBIN
Fig. 3. H0 : b1 = 0, 2 (1), H0 : b1 = 0, 2 (2).
point are evaluated. This is done by dividing the initial search interval by half, so that the next search is over bˆ 1 ± 0.1, bˆ 2 ± 2, and ˆ ± 0.1. The search is refined a total of four times to produce the final estimates. In this procedure, b1 is restricted to positive values. If the refinement results in a negative value for b1 , the value is reset to zero. No restrictions are placed on . Short Cuts Because the grid search procedure is computationally intensive, I reduce the number of computations by ending the search on a particular row of the grid (the values of b1 give the rows and the values of b2 give the columns) if the likelihood function begins to decrease. By comparing the results of the simulation program with and without this short cut, I can determine whether or not the likelihood function has a single, global maximum.6 These experiments reveal that, in most cases, the likelihood function does indeed have only a global maximum. However,
Spatial Lags and Spatial Errors Revisited
Fig. 3.
85
(Continued )
local maxima can occur when b2 takes on very large values (greater than ten times the true value). b2 determines the range of the spatial autocorrelation, and usually, this number should be relatively small. Very large values of b2 usually indicate that the model is mispecified (i.e. that a spatial lag is erroneously omitted from the model).
ESTIMATION RESULTS Estimates of Regression Coefficients Table 1 shows the mean (over the repetitions) estimates of the regression coefficients. Table 2 shows the standard deviation of these estimates around their true values. Recall that the true values for 0 , 1 , and 2 respectively are 100, 20 and 20. In general, the correct model produces the best estimates, both in terms of
86
Table 1. Mean Regression Coefficients. Spatial Error
Spatial Lag : 0.8
: 0.5
: 0.2
: 0
1
2
0
1
2
0
1
2
0
1
2
b: 0.8, 3 Ols Gen SE SL
1570.35 131.85 1656.46 21.47
23.69 20.00 18.43 19.78
23.64 19.99 18.46 19.79
469.39 117.58 494.71 31.69
20.76 19.97 19.09 19.83
20.78 19.97 19.10 19.84
193.66 111.63 199.34 38.56
20.07 19.96 19.70 19.92
20.10 19.96 19.70 19.92
99.92 107.17 99.89 43.00
19.99 19.96 20.00 19.98
20.00 19.97 20.00 19.98
b: 0.5, 3 Ols Gen SE SL
1570.46 130.92 1656.40 53.03
23.68 20.01 18.47 19.85
23.64 20.01 18.54 19.87
469.39 115.69 493.88 57.56
20.76 19.98 19.15 19.88
20.78 19.99 19.19 19.90
193.64 110.45 198.45 61.67
20.07 19.96 19.75 19.94
20.10 19.98 19.77 19.95
99.89 106.15 99.80 64.41
19.99 19.97 20.00 19.98
20.00 19.98 20.01 19.98
b: 0.2, 3 Ols Gen SE SL
1570.60 124.70 1656.48 88.95
23.68 20.02 18.53 19.94
23.64 20.03 18.62 19.96
469.40 114.28 492.65 87.10
20.76 19.99 19.25 19.94
20.79 20.01 19.31 19.97
193.62 109.55 197.30 87.43
20.07 19.97 19.82 19.96
20.11 19.99 19.85 19.98
99.86 106.39 99.80 87.95
19.99 19.97 19.99 19.97
20.01 19.99 20.01 19.99
b: 0, 3 Ols Gen SE SL
1570.71 127.92 1656.57 116.16
23.68 20.04 18.59 20.01
23.64 20.06 18.69 20.03
469.42 118.08 491.50 109.00
20.76 20.01 19.35 19.99
20.79 20.03 19.41 20.02
193.61 113.58 196.12 106.49
20.07 19.98 19.90 19.98
20.11 20.01 19.94 20.00
99.85 110.55 99.80 104.97
19.99 19.97 19.99 19.97
20.01 20.00 20.01 20.00
ROBIN DUBIN
0
Spatial Error
Spatial Lag : 0.8
: 0.5
: 0.2
: 0
0
1
2
0
1
2
0
1
2
0
1
2
b: 0.8, 3 Ols Gen SE SL
1470.65 91.91 1556.55 95.38
3.96 0.74 1.72 0.84
4.38 1.26 1.96 1.39
369.89 66.44 394.98 80.99
1.22 0.74 1.16 0.83
1.80 1.27 1.52 1.38
95.18 54.88 100.41 72.92
0.85 0.75 0.79 0.82
1.45 1.28 1.28 1.37
16.40 45.66 14.69 68.24
0.83 0.75 0.75 0.82
1.40 1.28 1.25 1.38
b: 0.5, 3 Ols Gen SE SL
1470.71 90.54 1556.50 72.88
3.91 0.79 1.71 0.83
4.27 1.36 1.97 1.40
369.86 62.96 394.20 61.18
1.19 0.79 1.14 0.82
1.76 1.37 1.55 1.40
95.15 53.13 99.72 54.97
0.84 0.79 0.82 0.82
1.44 1.37 1.37 1.40
16.53 44.61 15.94 51.65
0.82 0.79 0.79 0.82
1.41 1.37 1.36 1.40
b: 0.2, 3 Ols Gen SE SL
1470.81 77.03 1556.59 58.46
3.87 0.82 1.67 0.82
4.17 1.41 1.97 1.42
369.85 56.14 393.01 45.89
1.16 0.81 1.09 0.81
1.72 1.41 1.56 1.42
95.13 47.51 98.71 41.09
0.82 0.81 0.82 0.81
1.44 1.42 1.42 1.42
16.67 41.08 16.53 38.40
0.81 0.81 0.81 0.81
1.42 1.42 1.41 1.42
b: 0, 3 Ols Gen SE SL
1470.89 69.00 1556.67 59.46
3.84 0.82 1.64 0.82
4.11 1.43 1.95 1.43
369.85 50.17 391.87 44.53
1.14 0.81 1.04 0.81
1.70 1.43 1.55 1.43
95.13 43.30 97.60 38.78
0.81 0.81 0.81 0.81
1.44 1.43 1.44 1.43
16.77 37.82 16.75 35.55
0.81 0.81 0.81 0.81
1.43 1.43 1.43 1.43
Spatial Lags and Spatial Errors Revisited
Table 2. Standard Deviation of Regression Coefficient Estimates Around True Values.
87
88
ROBIN DUBIN
the point estimate and the standard deviation. The effects are larger for the constant term than the other regression parameters. The following additional conclusions can be drawn from Tables 1 and 2. (1) Constant term. The models which omit the spatial lag (OLS and spatial error) produce estimates of the constant which are much too large, unless the value of is zero. This is not surprising, since the constant is picking up the effect of the omitted variable. The spatial lag model produces estimates which are too small unless b1 = 0. (2) The OLS estimates of 1 and 2 are biased upward when a spatial lag is present. The presence of a spatially autocorrelated error does not affect the mean estimates. (3) When spatial effects are present, the dispersion of the OLS estimates is higher than for the spatial models. Again, a spatial lag has a bigger impact than a spatial error.
Estimates of Spatial Parameters Table 3 presents the mean spatial parameter estimates. An examination of this table reveals the following conclusions. (1) The general spatial model produces very good estimates of all of the spatial parameters. (2) Spatial Error model. This model is misspecified when a spatial lag is present. In this case the estimates of the spatial error parameters, b1 and b2 , incorporate the omitted spatial lag, and are much larger than their true values. As mentioned earlier, large values of b˜ 2 indicate that the model is misspecified. b˜ 1 and b˜ 2 approach their true values as gets smaller. (3) Spatial Lag model. This model is misspecified when a spatial error is present, however, the estimate of is close to its true value, regardless of the value of b1 .
Rejection Frequencies Table 4 shows the rejection frequencies for the three null hypotheses discussed earlier. Examination of this table reveals the following. (1) In general, all of the tests behave as expected: rejection frequencies are high when spatial effects are large and diminish as the spatial effects decrease.
Spatial Error
Spatial Lag : 0.8
b: 0.8, 3 Gen SE SL b: 0.5, 3 Gen SE SL b: 0.2, 3 Gen SE SL b: 0, 3 Gen SE SL
: 0.5
: 0.2
: 0
b1
b2
b1
b2
b1
b2
b1
b2
0.792 0.953 0
3.24 22.476 0
0.783 0 0.843
0.793 0.818 0
3.201 9.324 0
0.477 0 0.593
0.793 0.759 0
3.168 4.433 0
0.177 0 0.331
0.79 0.726 0
3.139 3.28 0
−0.018 0 0.15
0.504 0.911 0
3.492 24.386 0
0.784 0 0.826
0.505 0.668 0
3.434 11.119 0
0.480 0 0.558
0.504 0.508 0
3.337 5.135 0
0.179 0 0.282
0.502 0.444 0
3.263 3.369 0
−0.016 0 0.094
0.248 0.872 0
3.205 26.918 0
0.787 0 0.806
0.249 0.528 0
3.215 13.21 0
0.481 0 0.518
0.251 0.282 0
2.971 6.305 0
0.180 0 0.227
0.243 0.199 0
2.873 3.223 0
−0.017 0 0.032
0.114 0.847 0
1.978 29.234 0
0.785 0 0.791
0.115 0.438 0
2.023 14.934 0
0.476 0 0.488
0.118 0.147 0
1.847 6.639 0
0.171 0 0.186
0.114 0.071 0
1.603 1.942 0
−0.028 0 −0.013
Spatial Lags and Spatial Errors Revisited
Table 3. Mean Estimates of the Spatial Parameters.
89
90
ROBIN DUBIN
Table 4. Rejection Percentages. Spatial Error
Spatial Lag : 0.8 (%)
: 0.5 (%)
: 0.2 (%)
: 0 (%)
b: 0.8, 3 H10 : b1 = = 0 H20 : = 0 H30 : b1 = 0
100.00 100.00 99.80
100.00 98.40 99.80
100.00 31.00 99.80
99.60 4.80 99.80
b: 0.5, 3 H10 : b1 = = 0 H20 : = 0 H30 : b1 = 0
100.00 100.00 89.00
100.00 98.20 88.40
99.60 32.60 88.80
90.40 4.20 88.80
b: 0.2, 3 H10 : b1 = = 0 H20 : = 0 H30 : b1 = 0
100.00 100.00 33.20
100.00 99.20 31.40
87.40 38.00 31.20
29.00 2.80 31.00
b: 0, 3 H10 : b1 = = 0 H20 : = 0 H30 : b1 = 0
100.00 100.00 4.40
100.00 99.80 4.40
62.20 38.60 4.00
4.40 1.80 3.60
(2) H10 : b1 = = 0. This test is quite sensitive to the presence of spatial effects. It has very high rejection rates (1.00) when high and medium spatial effects are present and has some power when there are small spatial effects present. It also has very close to the right size (4.4%) when no spatial effects are present. (3) H20 : = 0. This test performs well; the rejection rates are high for large values of and fall as gets smaller. The size of the test is correct when b1 ≥ 0.5, but the null hypothesis is not rejected enough when is small. The cutoff for this test should probably be adjusted base on the value of b˜ 1 , however, I have not explored this possibility. (4) H30 : b1 = 0. This test also performs well. It is powerful, has about the right size (4.4–3.6%) and is only slightly affected by the value of .
Model Selection using LR Tests The model selection results using the LR tests are shown in Fig. 4. This figure shows the percentage of the time each model is selected when following the procedure
Spatial Lags and Spatial Errors Revisited
91
described earlier. The notation MS1 indicates that H10 is rejected but H20 and H30 are both accepted. This means that spatial effects are present, but we don’t know the source (spatial error, spatial lag or both). Examination of Fig. 4 shows that the model selection process works very well. The full model is selected most often when the spatial effects are strong (both b1 and ≥ 0.5). The Spatial Error model is accepted most of the time when b1 is large (≥0.5) and is small (≤0.2). Similarly, the Spatial Lag model is selected when is large and b1 is small. When both b1 and are zero, OLS is selected. When spatial effects are present but small, a variety of models are selected, but this is reasonable, considering the weakness of the signal.
Model Selection Using Lagrange Multiplier Tests As a point of comparison, it is useful to compare the above results to the standard model selection technique used in the literature, which is based on Lagrange Multiplier tests. This test is popular because it is easy to use: only OLS residuals are needed. The same weight matrix is used for testing for both a spatial lag and a spatial error process. While this makes the statistics easy to compute, it is a disadvantage with this data, because the error process is very different from the spatial lag. Here the weight matrix (ten nearest neighbors) is accurate for the spatial lag, but not for the spatial error. Although this provides a harsh test of the LM method, it is the procedure that most researchers would follow. To use the LM test, two statistics are calculated: LMerror and LMlag .7 These statistics are asymptotically distributed as 2 (1). The model selection process is as follows. If LMerror is significant, accept the spatial error model. If LMlag is significant, accept the spatial lag model. If both are significant, accept the model with the larger statistic. If neither is significant, accept OLS. The results are shown in Fig. 5. Examination of this figure shows that the LM tests choose the spatial lag model about 80% of the time, regardless of the values of b1 and . The only exception to this is when b1 is 0.2 or less and is zero. In those two cases, OLS is correctly chosen. Thus, in this data, at least, the LM method is able to detect spatial effects, but almost always attributes them to spatial lag, even when the spatial error is the larger effect. Previous simulation results for the LM test have shown it to be relatively accurate, however, in this work, W was the same for both the spatial lag and the spatial error in the simulated data, and the spatial lag and spatial error were never present together. When these conditions are violated, the performance of the LM test can be quite poor.
92
ROBIN DUBIN
Fig. 4. Model Selection.
Fig. 5. Model Selection Using LM Tests.
Spatial Lags and Spatial Errors Revisited
93
PREDICTIONS Procedure In order to test the predictive abilities of the model, I simulated 1,000 observations, using the same values for  and 2 as before. The dependent variable was generated by allowing b1 and to take the values 0.8, 0.5, and 0. I randomly chose 200 observations (without replacement) from the 1,000 to serve as prediction points. The prediction procedure is local in the sense that the models are estimated for each point to be predicted. Thus, for each observation in the prediction sample, I find the closest 300 observations out of the 1,000. These 300 observations constitute the estimation sample. The estimation sample is then used to estimate the four models (OLS, Spatial Error, Spatial Lag, and General). I then make predictions using each model. I also make predictions using the model found by the Likelihood Ratio model selection procedure described above. This process is repeated for each of the 200 prediction points and for all combinations of the values of b1 and . Note that all of the observations except for the prediction point (including other members of the prediction sample) can be included in the estimation sample. ˆ where x0 are the independent variables at For OLS, the prediction is simply x 0 , the prediction site and ˆ is a vector of the least squares estimates of the regression ˜ 0 y, where ˜ parameters. For the spatial lag model, the prediction is x 0 ˜ + W and ˜ are Maximum Likelihood estimates. W0 is the weight matrix applied to the prediction point. In this case, it is a vector with ones indicating the locations of the 10 closest points in the estimation sample to the prediction location. y is the vector of dependent variables in the estimation sample. For the Spatial Error model, the prediction is x 0 ˜ + u(s ˆ 0 ), where u(s ˆ 0 ) is the estimate of the error term at the prediction location, obtained by Kriging the residuals. And finally, for the ˜ 0 y + u(s General Spatial Model the prediction is x 0 ˜ + W ˆ 0 ). Prediction Results Table 5 shows the means of the actual values of y by the values of the spatial parameters. The mean values of y are positively related to because this parameter determines the weight on the spatially lagged dependent variable (which, in this paper, is the average values of y for the ten nearest neighbors). Thus, when = 0.8, the spatially lagged dependent variable adds a lot to the value of y (0.8 times the average value for the neighbors), however, when = 0, the spatially lagged variable contributes nothing. The value of b1 does not affect the means.
94
ROBIN DUBIN
Table 5. Means of the Dependent Variable by Spatial Parameters. b1
Mean
0.8 0.8 0.8 0.5 0.5 0.5 0 0 0
0.8 0.5 0 0.8 0.5 0 0.8 0.5 0
1904 759 377 1903 758 376 1899 756 375
Table 6 shows the Mean Square Error (MSE)8 for each model by the value of the spatial parameters. The last row of the table, labeled “Model” is the MSE for the model selected by the LR model selection process.9 This table shows that the penalty for ignoring spatial effects, when they are present, is quite large: OLS has very large MSE, compare to the spatial models, except when both b1 and are zero. The table also shows that the General Spatial model is a very good predictor: its predictions are very close to the selected model (and sometimes better) for most values of the spatial parameters. The worst performance of the General Model occurs when = 0.8 and b1 = 0 (pure spatial lag), but even here the MSE is reasonably close to that of the selected model. It would appear then, that for prediction purposes, the model selection procedure is not necessary and prediction with a possibly superfluous parameter is not costly. Figures 6 and 7 show histograms of the prediction errors, by the value of the spatial parameters, for OLS and the General Model respectively. Examination of Fig. 6 shows that the OLS prediction errors are quite large, particularly when a spatial lag is present. Figure 7 shows that the prediction errors for the General model are much smaller and more clustered around zero for all values of the spatial Table 6. Mean Squared Errors for Each Model by Spatial Parameters. b1 OLS SE SL Gen Model
0.8 0.8 3083.05 289.01 318.58 269.44 269.44
0.8 0.5 706.00 276.78 321.71 274.01 274.91
0.8 0 358.74 289.72 335.45 290.41 289.72
0.5 0.8 2551.44 368.50 349.18 339.96 340.48
0.5 0.5 625.57 351.37 352.60 343.42 342.62
0.5 0 371.12 357.35 363.90 358.11 357.12
0 0.8 1762.43 479.68 406.49 413.17 409.09
0 0.5 527.97 439.83 405.87 407.88 405.87
0 0 409.38 410.92 407.57 408.85 409.38
Spatial Lags and Spatial Errors Revisited
Fig. 6. Prediction Errors for OLS by Spatial Parameters. 95
96 ROBIN DUBIN
Fig. 7. Prediction Errors for General Model by Spatial Parameters.
Spatial Lags and Spatial Errors Revisited
97
parameters, except when both b1 and are zero. In this case, the two distributions are the same.
CONCLUSIONS This paper has presented a model which combines the geostatistical and the weight matrix approaches in a novel way. The advantage of such a model is that it is flexible: researchers no longer have to treat model specification as a nonnested problem. If, in fact, it is the case that only a spatial lag or a spatial error is present, this model will detect that structure. If both effects are present, this model can detect that as well. The proper specification is important because the regression coefficients can be estimated more accurately. Additionally, estimating the parameters of the spatial error processes (if present) can dramatically improve the predictive ability of the model.
NOTES 1. Using the notation of this paper, the demonstration is as follows. Assuming W 1 = W 2 , rewrite the model shown in Eq. (1.a) as y = Wy + X + (I − W)−1 = (I − W)y = X + (I − W)−1 = y = (I − W)−1 X + (I − W)−1 (I − W)−1
(11)
This model is difficult to estimate because both and are involved in the error term. The model is not strictly unidentified, however, because only appears in the first term. Such a model is likely to produce unstable estimates of the spatial parameters. 2. A spatial weight matrix is an N × N (N is the number of observations) matrix which captures the spatial relationships in the data. A commonly used form is nearest neighbors, in which wij = 1 if i and j are nearest neighbors (that is, j is the closest observation to i) and 0 otherwise. See Anselin (1988, Chap. 3), for a more detailed discussion of weight matrices. 3. Anselin allows the possibility that u can be heteroskedastic, but I ignore that possibility here. 4. The Likelihood Ratio test is performed by calculating the statistic 2 (LU − LR ), where LU and LR are the values of the likelihood function for the unrestricted and restricted models, respectively. This statistic is asymptotically distributed as a 2 random variable, with degrees of freedom equal to the number of restricted parameters. 5. An analogous procedure is use for the other models. 6. I used 100 repetitions for these comparisons and only checked the likelihood function for the full model. 7. See Anselin and Rey (1991) for the formulas for these statistics. 8. MSE = 200 ˆ i )2 /200 i=1 (y i − y 9. For the predictions, the MS1 case is treated as if the General model was selected.
98
ROBIN DUBIN
ACKNOWLEDGMENT The author wishes to thank the Lincoln Institute of Land Policy for providing funds to support this research. Tony Smith provided many helpful suggestions.
REFERENCES Anselin, L. (1988). Spatial econometrics: Methods and models. Dordrecht: Kluwer. Anselin, L., Bera, A., Florax, R., & Yoon, M. (1996). Simple diagnostic tests for spatial dependence. Regional Science and Urban Economics, 26, 77–104. Anselin, L., & Florax, R. (1995). Small sample properties of tests from spatial dependence in regression models: Some further results. In: L. Anselin & R. Florax (Eds), New Directions in Spatial Econometrics (pp. 21–74). Berlin: Springer. Anselin, L., & Rey, S. (1991). Properties of tests for spatial dependence in linear regression models. Geographical Analysis, 23, 112–131. Basu, S., & Thibodeau, T. (1998). Analysis of spatial autocorrelation in house prices. Journal of Real Estate Finance and Economics, 17, 61–86. Bivand, R., & Szymanski, S. (2000). Modeling the spatial impact of the introduction of compulsory competitive tendering. Regional Science and Urban Economics, 30, 203–219. Brueckner, J. (1998). Testing for strategic interaction among local governments: The case of growth controls. Journal of Urban Economics, 44, 438–467. Can, A. (1992). Specification and estimation of hedonic housing price models. Regional Science and Urban Economics, 22, 453–474. Dubin, R. (1998). Predicting house prices using multiple listings data. Journal of Real Estate Finance and Economics, 17, 35–60. Florax, R., & Folmer, H. (1992). Specification and estimation of spatial linear regression models: Monte Carlo evaluation of pre-test estimators. Regional Science and Urban Economics, 22, 405–432. Kim, C. W., Phipps, T., & Anselin, L. (2003). Measuring the benefits of air quality improvement: A spatial Hedonic approach. Journal of Environmental Economics and Management, 45, 24–39. Ripley, B. (1981). Spatial statistics. New York: Wiley. Saavedra, L. A. (2000). A model of welfare competition with evidence from AFDC. Journal of Urban Economics, 47, 248–279.
BAYESIAN MODEL CHOICE IN SPATIAL ECONOMETRICS Leslie W. Hepple ABSTRACT Within spatial econometrics a whole family of different spatial specifications has been developed, with associated estimators and tests. This lead to issues of model comparison and model choice, measuring the relative merits of alternative specifications and then using appropriate criteria to choose the “best” model or relative model probabilities. Bayesian theory provides a comprehensive and coherent framework for such model choice, including both nested and non-nested models within the choice set. The paper reviews the potential application of this Bayesian theory to spatial econometric models, examining the conditions and assumptions under which application is possible. Problems of prior distributions are outlined, and Bayes factors and marginal likelihoods are derived for a particular subset of spatial econometric specifications. These are then applied to two well-known spatial data-sets to illustrate the methods. Future possibilities, and comparisons with other approaches to both Bayesian and non-Bayesian model choice are discussed.
1. INTRODUCTION The field of spatial econometrics has rapidly matured over the last decade, with many developments in model specification, estimation and testing since Spatial and Spatiotemporal Econometrics Advances in Econometrics, Volume 18, 101–126 Copyright © 2004 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(04)18003-1
101
102
LESLIE W. HEPPLE
Anselin’s 1988 advanced text (Anselin, 1988b), developments surveyed in Anselin (2001). Amongst those developments have been the construction and application of Bayesian techniques (Hepple, 1995a, b; LeSage 1997, 2000), reflecting the widespread diffusion of Bayesian techniques throughout statistical modelling in the last decade (e.g. Bernardo & Smith, 1994; Carlin & Louis, 1996; Congdon, 2001; Gilks, Richardson & Spiegelhalter, 1996; Koop, 2003). Within spatial econometrics, issues of model choice and selection – how to decide between competing specifications, both nested and non-nested – were raised quite early, as in Anselin (1988a), and Bayesian techniques identified as a potentially fruitful way forward, but only limited practical advances have been made, with asymptotic measures such as AIC (Akaike’s Information Criterion) being the most popular. This paper sets out to further examine the Bayesian approach to model choice in spatial econometrics. The Bayesian theory is elegant and coherent, but its actual applicability and implementation is strongly influenced by the detailed structuring of the models. In particular, the assumptions made (by choice or necessity) about the prior distributions affect substantially the development of valid computational expressions. After reviewing these issues, the paper sets out a framework that allows a particular family of alternative spatial specifications to be compared. For this set of models, the marginal likelihoods may be constructed and ratios directly compared. This encompasses a range of both nested and nonnested spatial forms. What it does not encompass are nested models where the regressor matrix X varies, and this is an important limitation. It is not the argument of this paper that Bayes factors provide the panacea to problems of model choice, criticism and assessment in spatial econometrics. Rather, the argument is that Bayesian model choice provides a consistent framework (and associated probabilities) which may usefully be employed alongside other aspects of model assessment. The Bayesian perspective itself provides many of those other aspects, in terms of examination of joint and marginal distributions of parameters, tests in credible intervals etc. This paper simply sets out to develop and implement some practical procedures for Bayesian model choice in spatial econometrics. In the sections below we first set out the Bayesian theory on model choice, the issue of prior distributions, and examine alternative methods of computation. The necessary Bayesian expressions for a particular family of spatial econometric specifications are then constructed, followed by an empirical re-examination of two well-known data sets from the econometric literature, to provide a detailed illustration of the Bayesian methods. The paper concludes with a discussion of likely directions for further work.
Bayesian Model Choice In Spatial Econometrics
103
2. BAYESIAN ESTIMATION The review given here is very brief, and fuller accounts may be found in standard texts such as Bernardo and Smith (1994), Carlin and Louis (1996) or the recent Bayesian econometric text of Koop (2003) and a discussion in the context of nonnested spatial models in Hepple (1995a). Let y be a vector of n data observations, assumed to be dependent on a vector of k unknown parameters . Before any data are observed, our beliefs and uncertainties about are represented by a prior probability density p(). The probability model is specified by the likelihood p(y|), which is the probability of observing the data y given that is the true parameter vector. Having observed the data y, we update our beliefs about using Bayes’ theorem to obtain the posterior distribution of given the data y, which is p(|y) =
p(y|)p() p(y)
(1)
where p(y) = p(y|)p(), by the law of total probability. This normalising term, often termed the “marginal likelihood” plays a central role in Bayesian model choice. Note that Eq. (1) shows that the posterior distribution is proportional to the likelihood times the prior. The posterior distribution p(|y) is k-dimensional, but can often be simplified to focus on individual parameters such as 1 (i.e. the components of ) by analytically or numerically integrating out the other components, so that p(1 |y) =
p(|y) d2 d3 . . . dk
(2)
This marginal distribution for 1 contains all the information needed to make inferences about 1 and may be summarised in measures such as its moments, 95 percentage highest posterior density region, and graphically represented. The normalising integral p(y) may also be computed from this marginal and deployed in subsequent model choice. This Bayesian analysis depends on the specification of both the likelihood of the model and a set of priors. This allows “strong” prior knowledge, from theory or previous experience, to be built into the estimation, but often the statistician wishes to expresses weak prior knowledge or ignorance. This may be done by the use of diffuse, reference or uniform priors, and such priors are used in most econometric work (and in the examples below).
104
LESLIE W. HEPPLE
3. BAYESIAN MODEL CHOICE The discussion so far has assumed we have only one model under examination and estimation, but Bayesian theory readily extends to several competing models. Suppose we want to use the data y to compare two models M 1 and M 2 , with parameter vectors 1 and 2 . These models may be nested or non-nested. Then, using Bayes’ theorem, the posterior probability that M 1 is the correct model (given that either M 1 and M 2 is) is given by p(M 1 |y) =
p(M 1 ) p(y|M 1 ) p(y|M 1 ) + p(y|M 2 ) p(M 2 )
(3)
where p(y|M k ) is the marginal likelihood of the data given M k and p(M k ) is the prior probability of the model M k (k = 1, 2). By construction, p(M 1 |y) + p(M 2 |y) = 1. For example, we might begin with equal prior probabilities p(M 1 ) = p(M 2 ) = 0.5, but encounter with the data might move these to the posterior model probabilities p(M 1 |y) = 0.8 and p(M 2 |y) = 0.2. The marginal likelihood for model M 1 is obtained by integrating over 1 p(y|M 1 ) = p(y|1 , M 1 )p(1 |M 1 ) d1 = (likelihood × prior) d1 (4) It is worth emphasising that the marginal likelihood is obtained by an integration across the range of 1 , in contrast to the maximisation that is deployed in maximum likelihood analysis. The extent to which the data supports model p(M 1 |y) over p(M 2 |y) can also be measured by the posterior odds for M 1 against M 2 , i.e ratio of their posterior probabilities. This is: O 1,2 =
p(M 1 |y) p(y|M 1 ) p(M 1 ) = p(M 2 |y) p(y|M 2 ) p(M 2 )
(5)
Note that the first factor of the right-hand side is the ratio of the marginal likelihoods of the two models, and is called the Bayes factor for M 1 against M 2 , denoted by B 12 . The second factor is the prior odds, and this will often be equal to 1, representing the absence of a prior preference for either model: p(M 1 |y) = p(M 2 |y) = 0.5. The equation is thus: Posterior Odds = Bayes factor × Priors odds In comparing scientific theories, one needs to consider the magnitude of the Bayes factors. When B 12 > 1 the data favour M 1 over M 2 , and when B 12 < 1 the data favour M 2 , but the Bayes factors (alternatively expressed as posterior probabilities) show by how much one is favoured: does one model dominate the other or is it
Bayesian Model Choice In Spatial Econometrics
105
Table 1. Jeffreys’ “Weight of Evidence” for Bayes Factors. B 21 > 1 1 > B 21 > 10−1/2 10−1/2 > B 21 > 10−1 10−1 > B 21 > 10−2 10−2 > B 21
Evidence supports M 2 Very slight evidence against M 2 Moderate evidence against M 2 Strong to very strong evidence against M 2 Decisive evidence against M 2
a close run thing? If two models are non-nested and based on the same number of parameters (e.g. in choosing the weights matrix on a spatial autoregressive model), then one might always choose the more probable model, but if one wants to accept one theory (model) and reject another (model) then one might demand high probabilities. Where models are nested, so that M 2 is nested within M 1 , the more general model will only be “accepted,” and the simpler “rejected” if the model probabilities are strongly in favour of M 1 . Jeffreys, in his classic development of the Bayesian theory of probability, suggested rules of thumb for interpreting B 21 (Jeffreys, 1961), based on an order-of-magnitude interpretation using a logarithmic scale (Table 1). Posterior odds may be used to compare any pair of models within a larger set of p models. In addition, one may calculate the posterior probabilities for each model within such a large choice set. Denoting the models M j , j = 1, 2, . . . , p, then the posterior probability of M 1 is given by: 1 =
1 1 + O 2,1 + · · · + O m,1
(6)
with equivalent expressions for each model M j , and the odds ratio for any pair of models M i /M j is thus O i,j =
i j
(7)
where the prior probabilities for each model in the set are the same, the priors can be omitted, giving the simpler expression for the posterior probabilities: p(M i |y) i = p j=1 p(M j |y)
(8)
4. PRIORS: PROPER AND IMPROPER Bayesian theory provides an elegant, coherent and comprehensive framework for model estimation and model choice. However, the elegance and applicability
106
LESLIE W. HEPPLE
depend on an ability to specify the prior distributions for the parameters p(). The analyst often does not have much, or any, prior information, or does not want to incorporate subjective expectations, and so he/she wants or needs to employ “diffuse” or “non-informative” prior distributions. Such diffuse priors are the norm in scientific work. Thus, in a regression model context, one can assume diffuse normal-distribution priors for  and a gamma prior for (e.g. Koop, 2003). As one lets the priors become more vague or diffuse, they tend to uniform distributions (Broemeling, 1985; Chaloner & Brant, 1988), and such uniform distributions have long been a basis for much Bayesian econometric work (Zellner, 1971). These uniform distributions are, however, not just non-informative, but improper in that they are no defined limits and so the integral of the prior distribution is not defined. This impropriety is a problem in some contexts and not in others. Much Bayesian model estimation and analysis can be conducted with such priors, and posterior distributions of parameters can usually be constructed. However, problems certainly arise for model comparison and choice, precisely the context here. As the previous section showed, model choice depends on ratios between model posterior probabilities. If the improper priors for the two models are different, then they do not cancel each other properly. If the two models have different numbers of parameters with improper priors then the ratio of the two models becomes either zero or infinite depending on which model is numerator and which denominator (Koop, 2003, pp. 40–42). Problems can also arise in situations where the number of parameters with improper priors is the same but the X-matrices are defined in different units between the two models (Koop, 2003, p. 42). The analyst wanting to develop Bayesian model choice therefore has either to work with informative priors – which might be quite diffuse but still need to be defined – or avoid situations where the improper priors fail to cancel. Koop (2003) has neatly summarised this context in a rule of thumb (henceforth “Koop’s Rule”): When comparing models using posterior odds ratios, it is acceptable to use noninformative priors over parameters which are common to all the models. However, informative, proper priors should be used over all other parameters. He adds: This rule of thumb is relevant not only for the regression model, but for virtually any model you might wish to use (Koop, 2003, p. 42). Koop’s Rule defines a class of model comparisons where non-informative, improper priors may be used, and this class or group will be a key component in this paper. But it should be noted that, even if one is prepared to employ and define proper priors, the actual derivation and implementation of marginal likelihoods is often difficult. Informative priors for spatial models have been constructed and used in recent work by LeSage and Pace (2003), where a matrix exponential spatial specification (MESS) is developed, based on the matrix-logarithmic formulation
Bayesian Model Choice In Spatial Econometrics
107
of Chiu, Leonard and Tsui (1996). This specification closely replicates the spatial autoregressive covariances, but MESS circumvents the worst Jacobian problems of more orthodox spatial specifications. This facilitates model derivation and estimation. The specification allows them to develop Bayesian comparison of weights matrices, using normal-gamma priors for  and 2 and a normal prior for their spatial parameter; the derived expressions can then be evaluated for different prior assumptions to test sensitivity. The formulation is then incorporated in models with alternative explanatory variables, as noted below.
5. IMPLEMENTING BAYESIAN MODEL CHOICE The derivations given above provide the tools for Bayesian model choice. The practical question is how to implement the theory. The marginal likelihoods for each model lie at the heart of the theory: can they be computed directly or must they be approximated or simulated? If they can be computed directly, then the Bayesian methods can be used via analytical and numerical integration. This is the approach used in a range of econometric studies and is the method developed in this paper for spatial econometric applications. However, for many model specifications and large sample-sizes, such methods may be unavailable or impractical, and two main alternatives have been suggested: (a) the use of asymptotic approximations in the form of BIC or Bayes Information Criteria, and (b) the use of Markov chain Monte Carlo (MCMC) simulation for marginal likelihood calculation or model choice. Here we briefly review the two approaches, so that we can later examine their potential for spatial problems.
5.1. BIC The BIC (Bayesian Information Criterion) has been constructed as an asymptotic approximation to the marginal likelihood. Essentially, it uses the maximum likelihood estimate as the base for the approximation, and the validity of this depends on a number of assumptions. For “regular” statistical models (and the term “regular” will be examined below), it may be shown that, as n increases then the marginal likelihood may be approximated by terms from a Taylor series expansion around the maximum of the likelihood function. The details need not be rehearsed here, and the final expression is that: k log p(y|M 1 ) = log p(y|1 , M 1 ) − log n + O(n −1/2 ) (9) 2
108
LESLIE W. HEPPLE
Table 2. Raftery’s “Grades of Evidence” for BIC Differences, Bayes Factors and Posterior Probability of M 1 . BIC Difference 0–2 2–6 6–10 +10
Bayes Factor
p(M 1 |y)
Evidence
1–3 3–20 20–150 +150
50–75 75–95 95–99 +99
Weak Positive Strong Very strong
where log p(y|1 , M 1 ) is the log-likelihood value at the maximum. The term −(k/2) log (n) adjusts the log-likelihood value for the number of parameters in the model (in relation to sample size), so that models with more parameters are not automatically favoured. This type of adjustment is also standard in other, similar measures of log-likelihood-based model comparison such as AIC (Akaike’s Information Criterion). However, the actual marginal likelihood, because it is based on integration across all the parameter space of the model, automatically takes model complexity into account, and does not need such adjustment. As Raftery (1995) shows, this BIC approximation can be used in place of the Bayes factor B 12 . This comparison is most readily expressed on the scale of twice the logarithm (Table 2): 2 log B 12 = 2(log p(y|ˆ 1 , M 1 ) − log p(y|ˆ 2 , M 2 )) − (k 1 − k 2 ) log(n) + O(n −1/2 ) (10) BIC obviously has the great advantage that posterior odds or Bayes factors can be computed using only the maximum of the likelihood function, and the comparisons often relate closely to classical frequentist tests. Thus, when M 2 is nested within M 1 , the log Bayes factor becomes 2 log B 12 ≈ 212 − df 12 log n
(11)
where 212 is the standard likelihood-ratio tests for testing M 2 against M 1 and df 12 = k 1 − k 2 is the number of degrees of freedom associated with the test. It has to be noted, however, that the approximation is asymptotic and assumes the model is “regular.” On the first aspect, many likelihood-based measures are asymptotic, but one benefit of the Bayesian perspective is its finite-sample character, a benefit lost if asymptotic approximations are used. On the second aspect, the properties of “regular” statistical models are the standard ones of asymptotic normality etc., used by the Taylor series expansion in the derivation of BIC. Such properties are indeed standard for models with independent and identically distributed observations, and they have been shown to also apply to
Bayesian Model Choice In Spatial Econometrics
109
many models with non-independent observations in time-series analysis. However, rigorous asymptotic theory for such models has only been developed recently, and its extension to models with spatial dependence is, as yet, limited. See Kelejian and Prucha (1998) and Anselin (2001) for a discussion of this. It cannot, therefore, be taken-as-read that the BIC approximation will work well for spatial models.
5.2. MCMC A second alternative is the estimation of the marginal likelihood by simulation techniques. These simulations need not be MCMC simulations, but this has become the most popular format, and, given the very widespread development and application of MCMC techniques in the last decade, it is undoubtedly the most fruitful arena for such simulation-based estimation. MCMC is a technique for simulating posterior distributions. Repeatedly sampling from the distributions produces samples that can be used to approximate joint and marginal posterior distributions (Carlin & Louis, 1996; Chib, 2001; Gilks, Richardson & Spiegelhalter, 1996). The potential of these methods for spatial econometrics was briefly noted in Hepple (1995b) but the explicit formulation and implementation was first developed and applied by LeSage (1997, 2000). Estimation of the marginal likelihood from MCMC output is not straightforward, and has usually been restricted to cases where the conditional distributions are available in closed form and related instances. See Gelfand and Dey (1994) for early work, and more recent surveys in Carlin and Louis (1996) and Chib (2001). Closed forms are not available for the spatial econometric forms, but Chib and Jeliazkov’s recent extensions to Metropolis-Hastings estimation of marginal likelihoods suggests possibilities (Chib & Jeliazkov, 2001). Even where methods do exist, they are not necessarily precise estimates, and this is still a developing field where improvements are being made. For nested models, the Savage-Dickey density ratio does allow model comparison (Koop, 2003, pp. 69–71), but this does not encompass non-nested forms. The key point here is that MCMC marginal likelihood estimates are not yet available for the family of spatial econometric models. It should be noted that the calculation of marginal likelihoods in MCMC still requires the specification of proper prior distributions (so that posterior model probabilities will be conditional on what was specified for the various models), but MCMC does allow rapid recomputation with alternative assumptions and thus assessment of how sensitive the model probabilities are to varying prior assumptions. The marginal likelihood estimation discussed thus far is undertaken for each model separately, and then these results are used to compute odds-ratios and model
110
LESLIE W. HEPPLE
probabilities. The perspective is that of choosing the “best” (i.e. most probable) model from the set. The MCMC literature also suggests a rather different route: sampling across both models and parameters within a single MCMC structure to give Bayesian model averaging (Chib, 2001; Fern´andez, Ley & Steel, 2001), known as the MC3 (Markov Chain Monte Carlo model composition) approach. This is potentially very attractive, for it directly generates the model probabilities and the posteriors for model parameters (either within models or averaged across all or selected models). Several different methods have already been developed for this method, such as Green’s reverse jump diffusion technique (Green, 1995) and those of Carlin and Chib (1995). Successful application of these model-space MCMC techniques demands fine-tuning to the specific problem and they are not yet available as off-the-shelf MCMC algorithms. LeSage and Pace (2003) have begun the exploration of this framework for spatial econometric models, linking their MESS model with the selection of alternative explanatory X-variables in the manner of Fern´andez, Ley and Steel. For the priors of  they employ the flexible g-prior suggested by Zellner (1986). This is likely to open up a rich vein for spatial econometric modelling, and the challenge will be to try to develop similar derivations and analyses for the wider set of spatial econometric specifications.
6. A FAMILY OF SPATIAL ECONOMETRIC MODELS This section defines a set or family of spatial econometric models where the relevant expressions needed for Bayesian model choice can be constructed and applied within the boundaries drawn by Koop’s Rule. It focuses on the derivation of the marginal likelihood for each of the models, showing how it can be calculated for integration across a single marginal distribution (or across a two-dimensional distribution for more complicated spatial specifications). The central feature uniting the family of models is a division of the model parameters into two groups, 1 and 2 . 1 consists of the regression parameters  and the disturbance variance 2 , and 2 consists of the spatial parameters in the models. For 1 improper uniform priors are assumed, but for 2 proper priors are defined. These are in fact uniform priors, but within strictly defined boundaries, so that the integrals are defined and priors thus proper. In each specification, identical 1 are employed, together with the same X-matrices, and only the spatial specifications change. Thus the family falls within the remit of Koop’s Rule. The analysis begins with the model with spatial autoregressive disturbances and uses this case to set out the notation and definitions. Briefer analyses are then provided for several other models.
Bayesian Model Choice In Spatial Econometrics
111
First consider the model y = X + u
(12)
where y is the region n × 1 vector of spatial observations on the dependent variable for n regions, X is the n × k matrix of observations on the k independent variables,  is the k × 1 vector of regression coefficients, and u is the n × 1 vector of disturbances, with covariance matrix 2 V , where V is a positive definite symmetric matrix. Assume that u is generated by a spatial autoregressive model of the form: u = W u + e
(13)
where u is an n × 1 vector of autocorrelated disturbances, is a spatial autoregressive parameter, e is an n × 1 vector if independent errors, with e ∼ N(0, ), and W is an n × n matrix of (weighted and standardized) linkages between adjacent regions. has the range 1/k max − 1/k min = D, where the limits are functions of extreme eigenvalues of W . For a row-standardised weights matrix (the usual form) k max = 1.0. Defining the matrix P = (I − W ), the matrix products P ′ P = V −1 and |V |1/2 = |P | can be denoted. The likelihood function for this model is given by: 1 |P | L = n (2)−(n/2) exp − (y − X)′ V −1 (y − X) 2
(14)
As noted above, for each model in the family, this paper develops the method under the assumption of uniform, non-informative priors for  and 2 . Noninformative, locally-uniform (and improper) priors can be expressed as: P(, ) ∝
1
For the spatial autoregressive parameter , a uniform, but proper, distribution with the range D is assumed: p() =
1 D
112
LESLIE W. HEPPLE
The joint posterior density distribution is: 1 1 |P | 1 1 ′ −1 exp − 2 (y − X) V (y − X) p(, , |y) = p(y) D (22 )n/2 2 (15) 2
As shown in Hepple (1995b), and  can be analytically integrated out to give the marginal for , p(|y). This marginal is: n−k 1 |P | 1 1 1 Ŵ p(|y) = ′ (n−k)/2 (n−k)/2 ∗ ∗ 1/2 2 p(y) D 2 (2) |X X | s
(16)
where X∗ = P X = X − W X, y∗ = P y = y − W y, and s 2 is the residual sumof-squares of the regression of y∗ on X∗ . The marginal likelihood or normalising integral is the integral of the (unnormalised) marginal pdf: 1 1 n−k 1 |P | p(y) = Ŵ d ′ (n−k)/2 (n−k)/2 ∗ ∗ 1/2 2 D 2 (2) |X X | s
(17)
This marginal likelihood for the model with autoregressive disturbances will be denoted as p(y|M AR ). For some types of Bayesian analysis it is not necessary to include all the components of this expression in the calculations. Thus if one is focused on graphs and moments of the marginal distribution of , all or any of the terms before the integral can be omitted, for they are simply scaling the result. Thus Hepple (1995a, b) omits these terms. However, for model comparison some of them have to be included. The Ŵ and terms are significant if two models differ in the value of k, and the “range of integration” term 1/D is important if two models have different limits to the feasible region of the spatial parameter . Hepple (1995a) only developed non-nested comparison where k did not differ, and so the Ŵ and terms could be omitted. However, he also omitted the 1/D term. This could be omitted where the same W matrix was being used in different spatial specifications, but its inclusion is important if two different W -matrices are being compared. This point is discussed further below. Equivalent developments may be made for a range of spatial specifications, set out here as subsections. It should be emphasised that in each case identical assumptions about the priors for  and 2 are made and the X-matrix is always identical.
Bayesian Model Choice In Spatial Econometrics
113
6.1. Standard Regression Specification For the standard regression model, with k independent variables denoted by X and the standard assumptions and uninformative priors as above: 1 1 1 n−k p(y|M OLS ) = Ŵ (18) (n−k)/2 ′ 1/2 (n−k)/2 2 2 |X X| s (2) where s 2 is the residual sum-of-squares of the regression of y against X.
6.2. Spatial Moving Average Model The first-order spatial moving average disturbance model (Sneek & Rietveld, 1997a) may be expressed as y = X + u
(19)
u = W e + e = (I + W )e = Ge
(20)
with
The feasible parameter range for is 1/k max − 1/k min = D, where the limits are functions of extreme eigenvalues of W . For this model the relevant expressions can be derived as: 1 1 n−k 1 |G−1 | 1 p(|y) = Ŵ (21) ′ (n−k)/2 (n−k)/2 ∗ ∗ p(y) D 2 (2) |X X |1/2 s 2 where X∗ = G−1 X, y∗ = G−1 y, and s 2 is the residual sum-of-squares of the regression of y∗ on X ∗ . n−k 1 1 1 |G−1 | p(y|M MA ) = p(y) = Ŵ d (22) ′ (n−k)/2 (n−k)/2 ∗ ∗ 1/2 D 2 (2) |X X | s 2 6.3. “Spatial Effects” Model The spatial spillover or “spatial effects” model specifies the spillover in the yvariable rather than the disturbances: y = W y + X + u
(23)
with u ∼ N(0, 2 ). The parameter range for is given as for the spatial autoregressive case, and defining P = (I − W ) and y∗ = P y, the relevant
114
LESLIE W. HEPPLE
expressions are: P(|y) =
1 1 1 n−k 1 1 Ŵ |P | (n−k)/2 ′ 1/2 (n−k)/2 2 p(y) D 2 |X X| (2) s
(24)
where s 2 is the residual sum-of-squares of the regression of y∗ on X. Note that X is not transformed at all in this model, and in the expressions below the terms (X′ X)−1 and |X′ X|−1/2 replace the starred (or transformed) forms used in the autocorrelated disturbances models; in the spatial-effects model these terms do not vary with and so can be taken outside the integrations. 1 1 n−k 1 1 |P | (n−k)/2 d p(y|M LY ) = p(y) = Ŵ ′ 1/2 (n−k)/2 D 2 |X X| (2) s2 (25)
6.4. Higher-order Specifications Higher-order spatial specifications (for autoregressive, moving-average and spatial effects models) may also be constructed, and there is a limited literature on such models, see Brandsma and Ketellapper (1979), Hepple (1995b), Rietveld and Wintershoven (1998), Sneek and Rietveld (1997b). A range of applications has been suggested, and such models deserve further exploration. For a second-order autoregressive form: u = 1 W 1 u + 2 W 2 u + e
(26)
the forms above may be employed, with V −1 = P ′ P with P = (I − 1 )W 1 + 2 W 2 . This allows the definition of the transformed variables X ∗ = P X = X − 1 W 1 + 2 W 2 , y∗ = P y = y − 1 W 1 + 2 W 2 , and s 2 is the residual sum-ofsquares of the regression of y∗ on X ∗ . This gives the form: |P −1 | 1 1 n−k 1 p(y|M AR2 ) = p(y) = Ŵ d1 d2 ′ (n−k)/2 (n−k)/2 ∗ ∗ 1/2 2 D 2 (2) |X X | s (27) It is, however, well-established that the feasible parameter regions for such models (both autoregressive and moving average) are not rectangular (see Hepple, 1995b; Sneek & Rietveld, 1997b for examples), so that the integration region is also non-rectangular. Both the definition of the region, and the necessary integrations to obtain p(y|M AR2 ) are not straightforward, and this specification is not considered in the empirical application here. It may be that this model is a prime candidate for MCMC techniques of model determination, though this is not straightforward either.
Bayesian Model Choice In Spatial Econometrics
115
6.5. ARMAX or LYMA Specifications There is one more general specification that may have utility. This is the model where there are spatial effects or spillovers in the y-variable and also possible autocorrelation in the disturbance term. The disturbance component can be modelled as an autoregressive form, and this structure has been examined in a number of studies (e.g. Kelejian & Prucha, 1998); however, it may be argued that the more local process of the moving average may be more appropriate if there is already a global autoregressive process operating in the y-variable. This is the model discussed by, amongst others, Florax, Folmer and Rey (1998). The specification has been termed the ARMAX model, by analogy with the time-series literature, and the LYMA (Lagged y with moving average) model. The -marginals for this model are given below, but the equivalents for an autoregressive disturbance model are also readily formed. The LYMA model may be expressed: y = 1 W 1 y + X + 2 W 2 e + e
(28)
where terms are defined as earlier. The weights matrices W 1 and W 2 may the same or distinct matrices, and the parameter range D 1 for 1 are given by the extreme eigenvalues of W 1 , using the autoregressive expression, and D 2 for 2 by those of W 2 , using the moving average expression. For this model, unlike the higher-order autoregressive or moving-average specifications, the two spatial components are separable in that the Jacobians enter the likelihood and Bayesian functions as two independent components and the region of integration is rectangular. Thus all the expressions for the -marginals depend on standard two-dimension integration across 1 and 2 . Using P = (I − 1 W 1 ), G = (I + 2 W 2 ) and y† = y − 1 W 1 y, the expressions are: |P ||G−1 | n−k 1 1 1 1 1 Ŵ p(|y) = p(y) D 1 D 2 2 (2)(n−k)/2 |X ∗′ X∗ |1/2 s 2(n−k)/2
(29)
where X∗ = G−1 X, y∗ = G−1 y† , and s 2 is the residual sum-of-squares of the regression of y∗ on X ∗ . p(y|M LYMA ) = p(y) =
n−k 1 |P ||G−1 | 1 1 1 d1 d2 Ŵ ′ (n−k)/2 (n−k)/2 ∗ ∗ D1 D2 2 (2) |X X |1/2 s 2 (30)
116
LESLIE W. HEPPLE
7. MORE GENERAL FORMS The set of models outlined in the previous section has all been built around a common framework: all the models make the identical assumptions about  and 2 (uninformative and improper priors) and use identical X variables. The differences between the models are entirely in the spatial components, which are all modelled with proper priors. This allows explicit comparison of the marginal likelihood in the posterior probabilities. However, several basic and interesting spatial specifications are excluded from this framework. In particular, all models that augment X with the spatially-lagged form W X are excluded. This form is of interest in itself, and in terms of the general specification: y = W y + X + W X␥ + e
(31)
This includes both the W y term and also a spatial spillover from the independent variables. Since W X is a transformation of exogenous variables, these additional components can simply be added to X to define an extended set of 2k − 1 columns (the minus one is because the constant term is not also lagged). This model, denoted as M GEN , is a generalised form of the spatial autoregressive model, as can be seen by rewriting the autoregressive model: (I − W )y = (I − W )X + (I − W )u y = W y + X − W X(−) + u − W u = W y + X − W X(−) + e
(32)
This is the general form constrained such that ␥ = (−). Tests of the common factor restriction, using either Wald tests on the general form or a likelihood ratio test, have been constructed and applied for spatial econometric models (Bivand, 1984; Burridge, 1981), but an appealing alternative approach would be to use Bayesian comparison. However, if one assumes an improper uniform prior on ␥ (and there are no grounds for making it proper by putting an explicit range of it), this violates “Koop’s Rule of Thumb.”
8. EMPIRICAL ILLUSTRATIONS The expressions for the marginal likelihoods may now be employed to calculate posterior probabilities and Bayes factors for different spatial specifications. BIC and AIC measures will also be reported to provide asymptotic and likelihoodbased comparisons. This process can be illustrated using two empirical examples: the well-known data-sets of the Ohio crime data (Anselin, 1988b) and the Irish
Bayesian Model Choice In Spatial Econometrics
117
Table 3. Ohio Data: Marginal and Log-likelihoods, Standardised Contiguity-weights. Model M OLS M AR
Marginal Likelihood 8.0667 132.4122
Log-likelihood
i
BIC
AIC
−170.3950 −166.3983
0.0574 0.9426
348.5736 342.5262
348.7900 342.7966
agricultural data (Cliff & Ord, 1969), both of which have been extensively used in the spatial econometric literature.
8.1. Ohio Crime Data The Ohio crime data, set out in Anselin (1988b, pp. 187–191), is a model explaining crime incidence in 49 districts of Columbus, Ohio, in terms of locality income and housing values. The first application employs the standardised contiguityweights matrix outlined by Anselin. Other forms of W are examined later. The initial “choice set” is a simple comparison between OLS and AR. Table 3 gives the marginal likelihoods and log-likelihoods for the two models. The posterior probabilities are p(OLS) = 0.0574 and p(AR) = 0.9426, giving an odds ratio of 16.41 in favour of the AR model. Using Jeffreys’ rule-of-thumb, this falls in the “strong evidence” category, though it does not quite make this cutoff in Raftery’s amended table, for p(OLS) is just above 0.05. In Bayesian terms, the evidence is strong for p(AR). The choice set can now be expanded to include the other specifications outlined earlier: the moving-average disturbance model (MA), spatial-effects (LY), and spatial effects with moving-average disturbances (LYMA). Table 4 gives the marginal likelihoods and log-likelihood for this full set. The maximum likelihood estimate lies very close to the parameter boundary and is unreliable. The posterior Table 4. Ohio Data: Marginal and Log-likelihoods, Standardised Contiguity-weights. Model M OLS M AR M LY M MA M LYMA
Marginal Likelihood 8.0667 132.4122 102.2556 240.7516 97.2780
Log-likelihood
i
BIC
AIC
−170.3950 −166.3983 −165.4082 −166.0903 −164.4959
1.3891 22.7997 17.6070 41.4541 16.7499
348.5736 342.5262 340.5460 341.9102 340.6673
348.7900 342.7966 340.8164 342.1806 340.9918
118
LESLIE W. HEPPLE
Table 5. Ohio Data: Marginal and Log-likelihoods, Standardised Second-order Contiguity-weights. Model M OLS M AR M LY M MA M LYMA
Marginal Likelihood 8.0667 2.9080 2.5492 29.5019 1.1576
Log-likelihood
i
BIC
AIC
−170.3950 −170.3931 −169.8670 −170.3925 −168.5224
18.2572 6.5816 5.7696 66.7715 2.6201
348.5736 350.5158 349.4635 350.5146 348.7203
348.7900 350.7862 349.7339 350.7850 349.0449
probabilities now show that M MA is the best-choice model, with a ratio of almost 2:1 against M AR , whilst M LY and M LYMA are substantially inferior. Note that BIC and AIC, based on point estimates, favour this M LY specification. The analysis thus far has been based on models using the standardised contiguitymatrix for W , but clearly choice of W is an issue as well as type of model. Here we examine two other specifications. Labelling the standardised contiguity-matrix as W 1 , we also construct the second-order contiguity-matrix W 2 and a distance-based matrix W 3 . The latter is constructed using the X-Y coordinates in Anselin (1988a) with a cutoff of 5.0 units distance: below the cutoff, inverse distance provides the weight, with zero weight beyond the cutoff; again, weights were standardised. These two forms should enable some discrimination: the methods should choose W 1 or W 3 against the second-order W 2 , but perhaps the distance-weighted version W 3 may prove better than the simple contiguity-weights in W 1 . Tables 5 and 6 give the results for these forms. As with the W 1 weights, maximum likelihood estimation of the models with both LY and MA components led to boundary solutions and hence unreliable AIC and BIC estimates. For specifications using W 2 , AIC and BIC show that the basic OLS form is superior to all others, if the M LYMA specification with the unreliable estimate is omitted. The explicit marginal likelihoods do, however, suggest that models with MA components are strongly superior to OLS. However, comparisons with Table 6. Ohio Data: Marginal and Log-likelihoods, Standardised Distance-based Weights. Model
Marginal Likelihood
Log-likelihood
i
BIC
AIC
M OLS M AR M LY M MA M LYMA
8.0667 195.4920 184.0185 1453.5847 497.0939
−170.3950 −165.9701 −164.5884 −164.3115 −163.5616
0.3443 8.3607 7.8705 62.1656 21.2589
348.5736 341.6698 338.9063 338.3526 315.4477
348.7900 341.9403 339.1767 338.6230 315.1232
Bayesian Model Choice In Spatial Econometrics
119
Table 7. Irish Data: Marginal and Log-likelihoods, Standardised Contiguity-based Weights. Model
Marginal Likelihood
Log-likelihood
i
BIC
AIC
M OLS M AR M LY M MA
5.6686 1885.5619 2614.5924 1484.5877
−60.7547 −54.2811 −51.6528 −56.4452
0.0946 31.4763 43.6464 24.7827
126.3965 115.0783 109.8218 119.4066
127.5094 116.5621 111.3056 120.8904
Table 4, the W 1 results, demonstrate that none of the spatial specifications using W 2 compete with those using W 1 . These results are very positive: the methods – both AIC and marginal explicit likelihood – are good at selecting W 1 over W 2 . For specifications using the distance weights W 3 , the results show more consistency across the measures. If we omit the unreliable estimates once again, then both AIC and BIC recognise the superiority of MA specifications, and this is confirmed by the explicit marginal likelihoods. This repeats the results for the basic contiguity weights W 1 discussed earlier, but comparison of Tables 4 and 6 show the dominance of W 3 specifications over W 1 specifications. For the autoregressive and spatial-effects specifications, the differences are smaller, but all in the same direction: the distance-based weights are undoubtedly superior to the first-order contiguity weights, and M MA with W 3 -weights has easily the highest posterior probability across specifications and weights-matrices.
8.2. The Irish Data The second application is to the Irish data-set that has been one of the staples of spatial econometrics since Cliff and Ord (1969). Other studies using the data include Bivand (1984), Blommestein (1983), Burridge (1981) and Hepple (1995a). The model is a simple regression explaining the percentage of gross agricultural output of each Irish county consumed by itself (y) with arterial road accessibility as the independent variable. Bivand employed standardised contiguity weights, whilst Burridge and Blommestein usd general weights incorporating boundarylength and distance between county centroids; Hepple used both sets of weights. For standardised contiguity weights (Table 7), the posterior probabilities clearly reject M OLS and favour the spatial-effects model M LY . BIC and AIC would also select this specification. For general weights (Table 8), the posterior probabilities strongly favour M AR against M LY , but BIC and AIC would again select M LY . If both sets of weights and all four specifications are pitted against one another, Bayes posterior probabilities would select M AR with general weights, but BIC
120
LESLIE W. HEPPLE
Table 8. Irish Data: Marginal and Log-likelihoods, Standardised General Weights. Model
Marginal Likelihood
Log-likelihood
i
BIC
AIC
M OLS M AR M LY M MA
5.6686 2982.2816 1643.7532 255.2154
−60.7547 −53.6546 −52.3201 −55.7828
0.1160 61.0258 33.6358 5.2224
126.3965 113.8255 111.1564 118.0819
127.5094 115.3093 112.6402 119.5657
and AIC would both select M LY with standardised weights. The reason for these discrepancies between inferences using the maximum of the likelihood function (i.e. AIC and BIC) and the Bayes integrals lies precisely in what the two perspectives are measuring: AIC and related measures compare the heights of the likelihoods at their maxima, whereas the Bayesian marginal likelihoods compare the specifications in terms of the integrals across the whole marginal posterior: M LY has a higher peak or mode, but it is a sharper distribution, and M AR has more mass to it. Two further aspects of model choice can be examined using the Irish examples. Selection of the weights matrix can be made using Bayesian probabilities. For the Irish data this was examined in Hepple (1995a). However, it should be noted that Hepple did not take account of the differing normalising factors for the two weights matrices, i.e. that 1/D varies with the weights specification. In the event, the rankings of models and weights matrices do not change when the appropriate normalising values are employed (as in Tables 7 and 8), though the detailed probability values do. Thus, for M LY , Hepple reported probabilities of 0.64543 for contiguity weights and 0.3457 for general weights, whereas the correct values are 0.6140 and 0.3860. Using general weights favours the M AR model whereas contiguity weights favours M LY , but, as a comparison of Tables 7 and 8 shows, overall M AR with general weights is the winner by a small margin. The second aspect that the tables demonstrate very clearly is the poor performance of the M MA specification for the Irish example. In contrast to the Ohio example, the MA form is easily rejected, and the two examples together demonstrate the ability of the methods to discriminate well between the two forms.
8.3. Extending the Model Choices The explicit use of posterior probabilities in the model comparisons has been possible because each of the models has been derived under exactly the same assumptions about the improper priors which are integrated out, i.e. all the
Bayesian Model Choice In Spatial Econometrics
121
Table 9. Ohio Data: Log-likelihoods, BIC and AIC for BIC and AIC for Standardised Contiguity-weights. Model M OLS M AR M LY M MA M WX M GEN M MAWX
Log-likelihood
BIC
AIC
−170.3950 −166.3983 −165.4082 −166.0903 −167.0959 −164.4113 −164.1166
348.5736 342.5262 340.5460 341.9102 345.8673 342.4440 339.9087
348.7900 342.7966 340.8164 342.1806 346.1918 342.8226 340.2332
specifications have the same assumptions about , X and 2 . However, this has meant excluding models that allow the independent variables to be extended to include W X, thus excluding the important M GEN model and other forms such as M MAWX . This is quite a limitation, and means one cannot assess commonfactor properties (M AR against M GEN ). Unless one is able to make the priors proper in some way, progress is blocked for the marginal likelihood approach. One alternative is to use BIC as a Bayesian framework for these comparisons, and here the choice set can be expanded. It must, however, be noted that BIC is based on a point-estimate and the results may conflict with those based on the entire marginal likelihood, as has already been seen in the illustrations above. Table 9 presents BIC and AIC results for the Ohio models, using standardised contiguity weights, and now including M GEN , M WX and M MAWX in the set. The models M LYMA and M LYMAWX were both excluded because of the estimated maxima were located on boundaries and are unreliable. It has already been observed that BIC and AIC favour M LY against M AR or M MA , both of which have higher posterior probabilities uses the marginal likelihood calculations. BIC would marginally reject the common factor model (M AR ) in favour of M GEN , whereas AIC tips the other way, but both would favour the M MAWX model. Using standardised distance-based weights, the results are very similar (Table 10). For the Irish data using general weights, both Blommestein (1983) and Burridge (1981) rejected the common factor model, using Wald and likelihood ratio tests. Table 11 confirms this with AIC and BIC, and again M LY is the preferred form over both M AR and M GEN using these asymptotic criteria. Neither M WX nor M MA models find support. The lack of support for a more general form involving W X as a component does, however, still leave BIC and AIC favouring a different specification to the Bayesian posterior probabilities, as noted earlier.
122
LESLIE W. HEPPLE
Table 10. Ohio Data: Log-likelihoods, BIC and AIC for Standardised Distance-based Weights. Model M OLS M AR M LY M MA M WX M GEN M MAWX
Log-likelihood
BIC
AIC
−170.3950 −165.9701 −164.5884 −164.3115 −166.9891 −164.5302 −162.1082
348.5736 341.6698 338.9063 338.3526 345.6537 342.6818 337.8377
348.7900 341.9403 339.1767 338.6230 345.9782 343.0604 338.2164
Table 11. Irish Data: Log-likelihoods, BIC and AIC for Standardised General Weights. Model
Log-likelihood
BIC
AIC
M OLS M AR M LY M MA M WX M GEN
−60.7547 −53.6546 −52.3201 −55.7828 −56.4269 −52.3015
126.3965 113.8255 111.1564 118.0819 119.3700 112.7483
127.5094 115.3093 112.6402 119.5657 120.8538 114.6031
For standardised contiguity weights (Table 12), the picture is similar: BIC and AIC would favour the general model M GEN , more decisively than in the previous case, but AIC would narrowly favour M LY again. BIC does provide a way of treating a more general set of model specifications, but the price is moving to an asymptotic framework that, in the small or finitesample context of these illustrations, can and does produce contradictory results to those provided by explicit Bayesian posterior probabilities. Table 12. Irish Data: Log-likelihoods, BIC and AIC for Standardised Contiguity-based Weights. Model
Log-likelihood
BIC
AIC
M OLS M AR M LY M MA M WX M GEN
−60.7547 −54.2811 −51.6528 −56.4452 −53.7616 −50.8031
126.3965 115.0783 109.8218 119.4066 114.0394 109.7514
127.5094 116.5621 111.3056 120.8904 115.5233 111.6061
Bayesian Model Choice In Spatial Econometrics
123
8.4. Larger-sample Application of the Methods The empirical illustrations, using the two classic data-sets for Ohio and Ireland, work with small n (n = 49 and n = 26) respectively, and consideration needs to be given to the computation of the required marginal likelihoods for larger samplesizes. The central issue is the computation of the determinant |P | = |(I − W )| for M AR or its equivalent in terms of for M LY . This is normally done by first computing the eigenvalues of W and then using these to give fast calculation of the determinant for any given value of or . This is quite feasible for samplesizes of several hundreds and n = 50–500 is typical of most regional econometric applications, but once one works with the very large samples of several thousands now increasingly available (such as the 3,100 U.S. counties), the methods become less feasible. Here the advanced sparse-matrix determinant methods developed by Barry and Pace (1999) and related techniques will be important, calculating the determinant for a specified grid or set of values of . It is important to note that the marginal likelihood expressions (and hence |P |) usually only need evaluation at about 16 points to achieve high accuracy. The moving average specification M MA generates further large-sample size difficulties, for it requires the inverse |G−1 |. One approach is to simplify using both the eigenvalues and eigenvectors of W but this accelerates the large-sample issues; the other, used in the empirical work here, is to directly calculate the inverse (or rather the linear equations involving the inverse). For very large sample size, sparse matrix techniques are available for these numerical computations, but they have not yet been applied to these models.
9. CONCLUSIONS This paper has argued that Bayesian theory provides both a logical and attractive perspective for model choice in spatial econometrics, with some advantages over frequentist perspectives, though one should see the two as complementary rather than competitive. Computation of marginal likelihoods is at the core of any implementation of Bayesian model choice and this paper has derived expressions for the marginal likelihoods of each of the members of a family of spatial econometric specifications. These marginal likelihoods can then be computed with a simple one-dimensional integration using a 16- or 24-point evaluation, making the methods applicable and also computationally competitive with maximum likelihood methods. Applications to model choice for two well-known spatial data sets have been described. These show the ability of the methods to discriminate between appropriate specifications, and also demonstrate the differences there may be between exact calculation of marginal likelihoods, using analytical and
124
LESLIE W. HEPPLE
numerical integration, and asymptotic approximations such as BIC. In finite spatial samples, approximation of integrals from the height of the maximum likelihood may be misleading. Both nested and non-nested model choice may be made systematically within the framework outlined, but the work has also discussed just how circumscribed this framework is: whilst proper priors are used for the spatial parameters in the models, improper priors are used for  and 2 and this means the choice set is restricted to models with identical X and identical  and 2 priors, thus excluding many nested specifications, such as M AR versus M GEN . Without working with proper priors – and the way to incorporate such priors into the derivations of the standard spatial econometric specifications is not straightforward – it is difficult to proceed. Hence the appeal of asymptotic approximators such as BIC. But, certainly for small sample sizes, BIC and Bayes model probabilities can produce substantially different rankings of models. Given these limitations, alternative avenues look attractive for further work. MCMC is fast developing new measures, and the flexible way it can be used to respecify variations in prior assumptions, thus allowing sensitivity analysis, suggest it may be the most promising route. The recent work of LeSage and Pace (2003) provides initial confirmation. In this the analytical approach may be important in providing a baseline against which simulation-based expressions for the marginal likelihoods can be evaluated: for the basic models discussed in the present paper, MCMC methods, when developed, should be able to provide good approximations to the analytical results. In this way, the two Bayesian perspectives can assist each other.
ACKNOWLEDGMENTS The author would like to thank an anonymous referee for helpful comments and corrections on an earlier draft. Peter Green and Paul Plummer of the University of Bristol also provided enjoyable discussion and helpful comments.
REFERENCES Anselin, L. (1988a). Model validation in spatial econometrics: A review and evaluation of alternative approaches. International Regional Science Review, 11, 279–316. Anselin, L. (1988b). Spatial econometrics: Methods and models. Dordrecht: Kluwer. Anselin, L. (2001). Spatial econometrics. In: B. H. Baltagi (Ed.), A Companion to Theoretical Econometrics (pp. 310–330). Oxford: Blackwell.
Bayesian Model Choice In Spatial Econometrics
125
Barry, R., & Pace, R. K. (1999). A Monte Carlo estimator of the log determinant of large sparse matrices. Linear Algebra and its Applications, 289, 41–54. Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory. Chichester: Wiley. Bivand, R. (1984). Regression modelling with spatial dependence: An application of some class selection and estimation methods. Geographical Analysis, 16, 25–37. Blommestein, H. J. (1983). Specification and estimation of spatial econometric models. Regional Science and Urban Economics, 13, 251–270. Brandsma, A. S., & Ketellapper, R. H. (1979). A biparametric approach to spatial autocorrelation. Environment and Planning A, 11, 51–58. Broemeling, L. D. (1985). Bayesian analysis of linear models. New York: Marcel Dekker. Burridge, P. (1981). Testing for a common factor in a spatial autoregressive model. Environmental Planning A, 13, 795–800. Carlin, B. P., & Chib, S. (1995). Bayesian model choice via Markov chain Monte Carlo methods. Journal of the Royal Statistical Society B, 57, 473–484. Carlin, B. P., & Louis, T. A. (1996). Bayesian and empirical Bayes methods for data analysis. London: Chapman & Hall. Chaloner, K., & Brant, R. (1988). A Bayesian approach to outlier detection and residual analysis. Biomerika, 75, 651–659. Chib, S. (2001). Markov chain Monte Carlo methods: Computation and inference. In: J. J. Heckman & E. Leamer (Eds), Handbooks of Econometrics (pp. 3569–3649). New York: Elsevier. Chib, S., & Jeliazkov, I. (2001). Marginal likelihood from the Metropolis-Hastings output. Journal of the American Statistical Association, 96, 270–281. Chiu, T. Y. M., Leonard, T., & Tsui, K. W. (1996). The matrix-logarithmic covariance model. Journal of the American Statistical Association, 91, 198–210. Cliff, A. D., & Ord, J. K. (1969). Spatial autocorrelation. London: Pion. Congdon, P. (2001). Bayesian statistical modelling. Chichester: Wiley. Fern´andez, C., Ley, E., & Steel, M. F. J. (2001). Benchmark priors for Bayesian model averaging. Journal of Econometrics, 100, 381–427. Florax, R. J. G. M., Folmer, H., & Rey, S. (1998). The relevance of Hendry’s econometric methodology. Tinbergen Institute Discussion Paper, 98-125/4. Gelfand, A. E., & Dey, D. K. (1994). Bayesian model choice: Asymptotics and exact calculations. Journal of the Royal Statistical Society Series B, 56, 501–514. Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (Eds) (1996). Markov chain Monte Carlo in practice. London: Chapman & Hall. Green, P. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 57, 711–732. Hepple, L. W. (1995a). Bayesian techniques in spatial and network econometrics: 1. Model comparison and posterior odds. Environment and Planning A, 27, 447–469. Hepple, L. W. (1995b). Bayesian techniques in spatial and network econometrics: 2. Computational methods and algorithms. Environment and Planning A, 27, 615–644. Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford: Clarendon Press. Kelejian, H. H., & Prucha, I. R. (1998). A generalized spatial two-stage least squares procedure for estimating a spatial autoregressive model with autoregressive disturbances. Journal of Real Estate Finance and Economics, 17, 99–121. Koop, G. (2003). Bayesian econometrics. Chichester: Wiley. LeSage, J. P. (1997). Bayesian estimation of spatial autoregressive models. International Regional Science Review, 20, 113–129.
126
LESLIE W. HEPPLE
LeSage, J. P. (2000). Bayesian estimation of limited dependent variable spatial autoregressive models. Geographical Analysis, 32, 19–35. LeSage, J. P., & Pace, R. K. (2003). A matrix exponential spatial specification. Submitted for publication. Raftery, A. E. (1995). Bayesian model selection in social research (with discussion). Sociological Methodology, 25, 111–163. Rietveld, P., & Wintershoven, P. (1998). Border effects and spatial autocorrelation in the supply of network infrastructure. Papers in Regional Science, 77, 265–276. Sneek, J. M., & Rietveld, P. (1997a). On the estimation of the spatial moving average model. Tinbergen Institute Discussion Paper, 97-049/4. Sneek, J. M., & Rietveld, P. (1997b). Higher order spatial ARMA models. Tinbergen Institute Discussion Paper, 97-043/3. Zellner, A. (1971). An introduction to Bayesian inference in econometrics. New York: Wiley. Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In: P. K. Goel & A. Zellner (Eds), Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti (pp. 233–243). Amsterdam: North-Holland.
A BAYESIAN PROBIT MODEL WITH SPATIAL DEPENDENCIES Tony E. Smith and James P. LeSage ABSTRACT A Bayesian probit model with individual effects that exhibit spatial dependencies is set forth. Since probit models are often used to explain variation in individual choices, these models may well exhibit spatial interaction effects due to the varying spatial location of the decision makers. That is, individuals located at similar points in space may tend to exhibit similar choice behavior. The model proposed here allows for a parameter vector of spatial interaction effects that takes the form of a spatial autoregression. This model extends the class of Bayesian spatial logit/probit models presented in LeSage (2000) and relies on a hierachical construct that we estimate via Markov Chain Monte Carlo methods. We illustrate the model by applying it to the 1996 presidential election results for 3,110 U.S. counties.
1. INTRODUCTION Probit models with spatial dependencies were first studied by McMillen (1992), where an EM algorithm was developed to produce consistent (maximum likelihood) estimates for these models. As noted by McMillen, such estimation procedures tend to rely on asymptotic properties, and hence require large sample sizes for validity. An alternative hierarchical Bayesian approach to Spatial and Spatiotemporal Econometrics Advances in Econometrics, Volume 18, 127–160 Copyright © 2004 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(04)18004-3
127
128
TONY E. SMITH AND JAMES P. LESAGE
non-spatial probit models was introduced by Albert and Chib (1993) which is more computationally demanding, but provides a flexible framework for modeling with small sample sizes. LeSage (2000) first proposed extending Albert and Chib’s approach to models involving spatial dependencies, and this work extends the class of models that can be analyzed in this framework. Our extension relies on an error structure that involves an additive error specification first introduced by Besag et al. (1991) and subsequently employed by many authors (as for example in Gelman et al., 1995). As will be shown, this approach allows both spatial dependencies and general spatial heteroscedasticity to be treated simultaneously. The paper begins by motivating the basic probit model in terms of an explicit choice-theoretic context involving individual behavioral units. Since probit models are often used to explain variation in individual choices, these models may well exhibit spatial interaction effects due to the varying spatial location of the decision makers. That is, individuals located at similar points in space may tend to exhibit similar choice behavior. A key element in this context is the spatial grouping of individuals by region. Here we assume that individuals within each region are homogeneous, suggesting that all spatial dependencies and heteroscedastic effects occur at the regional level. We also show that the case of spatial dependencies between individuals can be handled by treating individuals as separate “regions.” The model proposed here is set forth in Section 2. This model allows for a parameter vector of spatial interaction effects that takes the form of a spatial autoregression. It extends the class of Bayesian spatial logit/probit models presented in LeSage (2000) and relies on a hierachical construct also presented in this section. We estimate the model using Markov Chain Monte Carlo (MCMC) methods to derive estimates by simulating draws from the complete set of conditional distributions for parameters in the model. Section 3 sets forth these conditional distributions for our model. Section 4 provides illustrations based on generated data sets as well as an application of the method to the voting decisions from 3,110 U.S. counties in the 1996 presidential election. Conclusions are contained in Section 5.
2. A SPATIAL PROBIT MODEL For the sake of concreteness, we motivate the model in terms of an explicit choice-model formulation detailed in Amemiya (1985, Section 9.2) in Section 2.1. Section 2.2 sets forth the Bayesian hierarchical structure that we use.
A Bayesian Probit Model with Spatial Dependencies
129
2.1. Choices Involving Spatial Agents Suppose there exists data on the observed choices for a set of individuals distributed within a system of spatial regions (or zones), i = 1, . . . , m. In particular, suppose that the relevant choice context involves two (mutually exclusive and collectively exhaustive) alternatives, which we label “0,” and “1.” Examples might be a voting decision or specific type of purchase decision. The observed choice for each individual k = 1, . . . , n i in region i is treated as the realization of a random choice variable, Y ik , where Y ik = 1 if individual k chooses alternative 1 and Y ik = 0 otherwise. In addition, it is postulated that choices are based on utility maximizing behavior, where k’s utility for each of these alternatives is assumed to be of the form: Uik0 = ␥′ ik0 + ␣′0 sik + i0 + ik0
Uik1 = ␥′ ik1 + ␣′1 sik + i1 + ik1
(1)
Here ika is a -dimensional vector of observed attributes of alternative a(= 0, 1) taken to be relevant for k (possibly differing in value among individuals), and s ik is an s-dimensional vector of observed attributes of individual k. It is convenient to assume that k’s region of occupancy is always included as an observed attribute of k. To formalize this, we let ␦i (k) = 1 if k is in region i and ␦i (k) = 0 otherwise, and henceforth assume that s ikj = ␦j (k) for j = 1, . . . , m (so that by assumption s ≥ m). The terms ia + ika , represent the contribution to utility of all other relevant unobserved properties of both i, k and a. These are separated into a regional effect, ia , representing the unobserved utility components of alternative a common to all individuals in region i, and an individualistic effect, ika , representing all other unobserved components. In particular, the individualistic components (ika : k = 1, . . . , n i ) are taken to be conditionally independent given ia , so that all unobserved dependencies between individual utilities for a within region i are assumed to be captured by ia . If we let the utility difference for individual k be denoted by zik = Uik1 − Uik0
= ␥′ (ik1 − ik0 ) + (␣1 − ␣0 )′ sik + (i1 − i0 ) + (ik1 − ik0 ) ′ + + = xik i ik
(2)
with parameter vector,  = (␥′ , ␣′1 − ␣′0 )′ , and attribute vector, x ik = (′ik1 − ′ik0 , s ′ik )′ , and with i = i1 − i0 and i = i1 − i0 , then it follows from the utility-maximization hypothesis that Pr(Y ik = 1) = Pr(U ik1 > U ik0 ) = Pr(z ik > 0)
(3)
130
TONY E. SMITH AND JAMES P. LESAGE
At this point it should be emphasized that model [(2), (3)] has many alternative interpretations. Perhaps the most general interpretation is in terms of linear models with limited information: if the elements x ikj , j = 1, . . . , q[= ( + s)], are regarded as general explanatory variables, then model [(2), (3)] can be interpreted as a standard linear model with “grouped observations” in which only the events “z ik > 0” and “z ik ≤ 0” are observed (as in Albert & Chib, 1993 for example). However, we shall continue to appeal to the above choice-theoretic interpretation in motivating subsequent details of the model. Turning next to the unobserved components of the model, it is postulated that all unobserved dependencies between the utility differences for individuals in separate regions are captured by dependencies between the regional effects (i : i = 1, . . . , m). In particular, the unobserved utility-difference aspects common to individuals in a given region i may be similar to those for individuals in neighboring regions. This is operationalized by assuming that the interaction-effects vector, , exhibits the following spatial autoregressive structure1 i =
m j=1
wij j + u i ,
i = 1, . . . , m
(4)
where nonnegative elements of the weights, wij are taken to reflect the degree of “closeness”between regions i and j. In addition, it is assumed that wii ≡ 0 and 2 row sums, m j=1 wij are normalized to one, so that can be taken to reflect the overall degree of spatial influence (usually nonnegative). Finally, the residuals, u i are assumed to be iid normal variates, with zero means and common variances 2 . Now, if we let = (i : i = 1, . . . , m) denote the regional effects vector, and similarly let u = (u i : i = 1, . . . , m), then these assumptions can be summarized in vector form as = W + u,
u ∼ N(0, 2 I m )
(5)
where W = (wij : i, j = 1, . . . , m) and where I m denotes the m-square identity matrix for each integer m > 0. For our subsequent analysis, it is convenient to solve for in terms of u as follows. Let B = I m − W
(6)
and assume that B is nonsingular, so that by (5): 2 2 ′ −1 = B −1 u ⇒ |(, ) ∼ N[0, (B B ) ]
(7)
Turning next to the individualistic components, ik , observe that without further evidence about specific individuals in a given region i, it is reasonable to treat these components as exchangeable and hence to model the ik as conditionally iid
A Bayesian Probit Model with Spatial Dependencies
131
normal variates with zero means3 and common variance vi , given i . In particular, regional differences in the vi ’s allow for possible heteroscedasticity effects in the model.4 Hence, if we now denote the vector of individualistic effects of region i by i = (ik : k = 1, . . . , n i )′ , then our assumptions imply that i |i ∼ N(0, vi I n i ). We can express the full individualistic effects vector = (′i : i = 1, . . . , m)′ as | ∼ N(0, V)
(8)
where the full covariance matrix V is given by: v1 In1 .. V= . vm Inm
(9)
We emphasize here that, as motivated earlier, all components of are assumed to be conditionally independent given . Expression (2) can also be written in vector form by setting z i = (z ik : k = 1, . . . , n i )′ and X i = (x ik : k = 1, . . . , n i )′ , so that utility differences for each region i take the form: z i = X i  + i 1i + i ,
i = 1, . . . , m
(10)
′ where unit vector. Then by setting 1i = (1, . . . , 1) denotes the n i -dimensional n = i n i and defining the n-vectors z = (z ′i : i = 1, . . . , m)′ and X = (X ′i : i = 1, . . . , m)′ ,5 we can reduce (10) to the single vector equation,
z = X + +
(11)
where
=
11 ..
. 1m
(12)
If the vector of regional variances is denoted by v = (vi : i = 1, . . . , m)′ , then the covariance matrix V in (8) can be written using this notation as V = diag(v)
(13)
Finally, if ␦(A) denotes the indicator function for each event A (in the appropriate underlying probability space), so that ␦(A) = 1 for all outcomes in which A occurs
132
TONY E. SMITH AND JAMES P. LESAGE
and ␦(A) = 0 otherwise, then by definition Pr(Yik = 1|zik ) = ␦(zik > 0) Pr(Yik = 0|zik ) = ␦(zik ≤ 0)
(14)
If the outcome value Y = (Y ik ∈ 0, 1), then (following Albert & Chib, 1993) these relations may be combined as follows: Pr(Y ik = y ik |z ik ) = ␦(y ik = 1)␦(z ik > 0) + ␦(y ik = 0)␦(z ik ≤ 0)
(15)
Hence, letting Y = (Y ik : k = . . . , n i , i = 1, . . . , m), it follows that for each possible observed set of choice outcomes, y ∈ {0, 1}n , Pr(Y = y|z) =
ni m
{␦(y ik = 1)␦(z ik > 0) + ␦(y ik = 0)␦(z ik ≤ 0)}
(16)
i=1 k=1
2.2. Hierarchical Bayesian Extension While this model could in principle be estimated using EM methods similar to McMillen (1992), the following Bayesian approach is more robust with respect to small sample sizes, and allows detailed analysis of parameter distributions obtained by simulating from the posterior distribution of the model. As with all Bayesian models, one begins by postulating suitable prior distributions for all parameters (, v, 2 , ), and then derives the corresponding conditional posterior distributions given the observed data. In the analysis to follow it is convenient to represent v equivalently using the covariance matrix V and to write the relevant parameter vector as (, V, 2 , ). The prior distributions employed for these parameters are taken to be diffuse priors wherever possible, and conjugate priors elsewhere. As is well known (see for example Gelman et al., 1995), this choice of priors yields simple intuitive interpretations of the posterior means as weighted averages of standard maximum-likelihood estimators and prior mean values (developed in more detail below). The following prior distribution hypotheses are standard for linear models such as (12) (see for example Geweke, 1993; LeSage, 1999):  ∼ N(c, T)
(17)
r/vi ∼ ID2 (r)
(18)
1/ ∼ Ŵ(␣, )
(19)
2
−1 ∼ U[−1 min , max ]
(20)
A Bayesian Probit Model with Spatial Dependencies
133
Here  is assigned a normal conjugate prior, which can be made “almost diffuse” by centering at c = 0 and setting T = tI q , for some sufficiently large t. More generally, the mean vector, c and covariance matrix T are used by the investigator to reflect subjective prior information assigned as part of the model specification. The variances, 2 together with (vi : i = 1, . . . , m), are given (conjugate) inverse gamma priors. A diffuse prior for 2 would involve setting the parameters (␣ = = 0) in (19). The prior distribution for each vi is the inverse chi-square distribution, which is a special case of the inverse gamma. This choice has the practical advantage of yielding a simple t-distribution for each component of (as discussed in Geweke, 1993). Here the choice of values for the hyperparameter r is more critical in that this value plays a key role in the posterior estimates of heteroscedasticity among regions, which we discuss below. We employ a uniform prior on that is diffuse over the relevant range of values for the model in (5). In particular, if min and max denote the minimum and maximum eigenvalues of W, then (under our assumptions on W) it is well known −1 that min < 0, max > 0, and that must lie in the interval [−1 min , max ] (see for example Lemma 2 in Sun et al., 1999). The densities corresponding to (17), (19), and (20) are given respectively by:
1 ′ −1 () ∝ exp − ( − c) T ( − c) (21) 2 (2 ) ∝ (2 )−(␣+1) exp − 2 (22) () ∝ 1 (23) where the inverse gamma density in (22) can be found in standard Bayesian texts such as Gelman et al. (1995, p. 474). Note also that the diffuse density for 2 with ␣ = = 0 is of the form (2 ) ∝ 1/2 . Finally, the prior density of each vi , i = 1, . . . , m, can be obtained by observing from (8) that the variate, = (vi ) = r/vi , has chi-square density (24) f() ∝ (r/2)−1 exp − 2 This together with the Jacobian expression, |d/dvi | = r/(v2i ), then implies that d (vi ) = f[(vi )] · dvi (r/2)−1 r/vi r r r −(r/2+1) exp − (25) exp − · 2 ∝ vi = vi 2 2vi vi
134
TONY E. SMITH AND JAMES P. LESAGE
which is seen from (8) to be an inverse gamma distribution with parameters ␣ = = r/2, (as in expression (6) of Geweke, 1993). These prior parameter densities imply corresponding prior conditional densities for , , and z. To begin with observe from (7) that the prior conditional density of given (, 2 ) is of the form 1 (|, 2 ) ∼ (2 )−m/2 |B |exp − 2 ′ B ′ B (26) 2 and similarly (8) implies that the conditional prior density of given (, V) is 1 ′ −1 −1/2 (|V) ∼ |V| (27) exp − V 2 This in turn implies that the conditional prior density of z given (, 2 , ) has the form 1 −1/2 ′ −1 (z|, , V) ∝ |V| exp − (z − X − ) V (z − X − ) 2
ni m 1 −1/2 (z ik − x ′ik  − i )2 (28) vi exp − = 2vi i=1 k=1
3. ESTIMATING THE MODEL Estimation will be achieved via Markov Chain Monte Carlo methods that sample sequentially from the complete set of conditional distributions for the parameters. To implement the MCMC sampling approach we need to derive the complete conditional distributions for all parameters in the model. We then proceed to sample sequential draws from these distributions for the parameter values. Gelfand and Smith (1990) demonstrate that MCMC sampling from the sequence of complete conditional distributions for all parameters in the model produces a set of estimates that converge in the limit to the true (joint) posterior distribution of the parameters. To derive the conditional posterior distributions, we use the basic Bayesian identity and the prior densities from Section 2, p(, , , 2 , V, z|y) · p(y)
= p(y|, , , 2 , V, z) · (, , , 2 , V, z)
(29)
where p(·) indicates posterior densities (i.e. involving the y observations). This identity together with the assumed prior independence of , , 2 , and V implies
A Bayesian Probit Model with Spatial Dependencies
135
that the posterior joint density p(, , , 2 , V, z|y) is given up to a constant of proportionality by p(, , , 2 , V, z|y) ∝ p(y|z)(z|, , V)(|, 2 )()()(2 )(V) (30) Using this relation, we establish the appropriate conditional posterior distributions for each parameter in the model in Sections 3.1 through 3.6.
3.1. The Conditional Posterior Distribution of  From (30) it follows that p(|∗) =
p(, , , 2 , V, z|y) ∝ p(, , , 2 , V, z|y) ∝ (z|, , V)() p(, , 2 , V, z|y) (31)
where we use ∗ to denote the conditioning arguments: , , 2 , V, z, y. This together with (28) and (21) implies that 1 ′ −1 p(|∗) ∝ exp − (z − X − ) V (z − X − ) 2 1 × exp − ( − c)′ T −1 ( − c) (32) 2 But since 1 1 − (z − X − )′ V −1 (z − X − ) − ( − c)′ T −1 ( − c) 2 2 1 = − [′ X ′ V −1 X − 2(z − )′ V −1 X + ′ T −1  − 2c ′ T −1  + C] 2 1 = − {′ (X ′ V −1 X + T −1 ) − 2[X ′ V −1 (z − ) + T −1 c]} + C (33) 2 where C includes all quantities not depending on , it follows that if we now set A = (X ′ V −1 X + T −1 )
(34)
b = X ′ V −1 (z − ) + T −1 c
(35)
and
136
TONY E. SMITH AND JAMES P. LESAGE
and observe that both A and b are independent of , then expression (32) can be rewritten as
1 ′ 1 ′ ′ ′ ′ −1 p(|∗) ∝ exp − ( A − 2b ) ∝ exp − ( A − 2b  + b A b) 2 2
1 ∝ exp − ( − A −1 b)′ A( − A −1 b) (36) 2 Therefore, the conditional posterior density of  is proportional to a multinormal density with mean vector A −1 b, and covariance matrix, A −1 , which we express as: |(, , 2 , V, z, y) ∼ N(A −1 b, A −1 )
(37)
This can be viewed as an instance of the more general posterior in expression (13) of Geweke (1993) where his G, −1 , y and g are here given by I, V −1 , z − , and c respectively. As is well known (see for example the discussion in Gelman et al., 1995, p. 79), this posterior distribution can be viewed as a weighted average of prior and sample data information in the following sense. If one treats z − in (11) as “data” and defines the corresponding maximum-likelihood estimator of  for this linear model by ˆ = (X ′ V −1 X)−1 X ′ V −1 (z − )
(38)
then it follows from (34) and (35) that the posterior mean of  takes the form E(|, , 2 , V, z, y) = (X ′ V −1 X + T −1 )−1 [X ′ V −1 (z − ) + T −1 c] = (X ′ V −1 X + T −1 )−1 [X ′ V −1 ˆ + T −1 c]
(39)
For the case of a single explanatory variable in X where q = 1, the right hand ˆ More generally, side of (39) represents a simple convex combination of c and . this posterior mean represents a matrix-weighted average of the prior mean c, ˆ Note that as the quality of sample data and the maximum-likelihood estimate, . information increases (i.e. the variances vi become smaller) or the quantity of sample information increases (i.e. sample sizes n i become larger) the weight placed on ˆ increases.
3.2. The Conditional Posterior Distribution of As will become clear below, the conditional posterior for is in many ways similar to that for . Here we let ∗ represent the conditioning arguments , , 2 , V, z, y.
A Bayesian Probit Model with Spatial Dependencies
137
First, note that using the same argument as in (30) and (31), together with (26) and (28) we can write p(|∗) ∝ (z|, , V)(|, 2 ) 1 ∝ exp − [ − (z − X)]′ V −1 [ − (z − X)] 2 1 ′ ′ exp − B B 2 1 ′ ′ −1 ′ −1 ′ −2 ′ ∝ exp − [ V − 2(z − X) V + ( B B )] 2 1 = exp − [′ (−2 B ′ B + ′ V −1 ) − 2(z − X)′ V −1 ] (40) 2 A comparison of (40) with (33) shows that by setting A 0 = −2 B ′ B + and b 0 = ′ V −1 (z − X), the conditional posterior density for must be proportional to a multinormal distribution ′ V −1
−1 |(, , 2 , V, z, y) ∼ N(A −1 0 b, A 0 )
(41)
−1 with mean vector A −1 0 b 0 and covariance matrix A 0 . Unlike the case of  however, the mean vector and covariance matrix of involve the inverse of the m × m matrix A 0 which depends on . Thus this matrix inverse must be computed on each MCMC draw during the estimation procedure. Typically thousands of draws will be needed to produce a posterior estimate of the parameter distribution for , suggesting that this approach to sampling from the conditional distribution of may be costly in terms of time if m is large. In our illustration in Section 5 we rely on a sample of 3,110 U.S. counties and the 48 contiguous states, so that m = 48. In this case, computing the inverse was relatively fast, allowing us to produce 2,500 draws in 37 seconds using a compiled c-language program on an Anthalon 1200 MHz processor. In the Appendix we provide an alternative approach that involves only univariate normal distributions for each element i conditional on all other elements of excluding the ith element. This approach is amenable to computation for much larger sizes for m, but suffers from the need to evaluate m univariate conditional distributions to obtain the vector of parameter estimates on each pass through the MCMC sampler. This slows down the computations, but it does not suffer from the need to manipulate or invert large matrices.
138
TONY E. SMITH AND JAMES P. LESAGE
3.3. Conditional Posterior Distribution for To determine the conditional posterior for , observe that using (30) we have: p(|, , 2 , V, z, y) ∝
p(, , , 2 , V, z|y) p(, , 2 , V, z|y)
∝ p(, , , 2 , V, z|y) ∝ (|, 2 )()
(42)
which together with (26) and (23) implies that 1 p(|, , 2 , V, z, y) ∝ |B |exp − 2 ′ B ′ B 2
(43)
−1 where ∈ [−1 min , max ]. As noted in LeSage (2000) this is not reducible to a standard distribution, so we might adopt a Metropolis-Hastings step during the MCMC sampling procedures. LeSage (1999) suggests a normal or t-distribution be used as a transition kernel in the Metropolis-Hastings step. Additionally, the −1 restriction of to the interval [−1 min , max ] can be implemented using a rejectionsampling step during the MCMC sampling. Another approach that is feasible for this model is to rely on univariate numerical integration to obtain the the conditional posterior density of . The size of B will be based on the number of regions, which is typically much smaller than the number of observations, making it computationally simple to carry out univariate numerical integration on each pass through the MCMC sampler. Specifically, we can rely on a vectorized expression for the conditional posterior computed over −1 a grid of q values for in the interval [−1 min , max ], which facilitates univariate numerical integration. First consider storing tabled values of |B | computed over a grid of q values for −1 in the interval [−1 min , max ] prior to beginning the MCMC sampler. These can be computed using sparse matrix methods from Pace and Barry (1997) or the Monte Carlo estimator suggested by Barry and Pace (1999). The remaining terms in the conditional posterior can also be expressed as a vector over the grid of values for by setting ␣ = (1/) and writing: 1 ′ ′ 1 ′ ′ 2 (, , ) = |B |exp − 2 B B , (, ␣) = |B |exp − ␣ B B ␣ 2 2 (44)
Using the expression: ␣′ B ′ B ␣ = ␣′ (I m − W)′ (I m − W)␣ = ␣′ ␣ − 2␣′ W␣ + 2 ␣′ W ′ W␣
(45)
A Bayesian Probit Model with Spatial Dependencies
139
we can define the scalars: a = ␣′ W␣ and b = ␣′ W ′ W␣, which allow us to express the conditional posterior as: 1 p(|, , 2 , V, z, y) ∝ |B |exp [a − b] (46) 2 Expression (46) can be expressed as a vector over the grid of q values for by evaluating the scalar expressions a and b over the grid. Since the vector of determinant values for |B()| has been constructed prior to beginning our MCMC sampler, we need only compute the quantities a and b on each pass through the sampler, which can be done rapidly for a small spatial matrix W reflecting connectivity relations between the regions in our model. Having expressed the conditional posterior distribution over a grid of values we use univariate numerical integration to find the normalizing constant. Having achieved a grid approximation to the conditional posterior for , we then draw from this using inversion. An advantage of this approach over the Metropolis-Hastings method is that each pass through the sampler produces a draw for , whereas acceptance rates in the Metropolis-Hastings method are usually around 50 percent requiring twice as many passes through the sampler to produce the same number of draws for . For our applications presented in Section 4, the number of observations and regions were based on the n = 3, 110 counties and m = 48 contiguous states in the U.S. An MCMC sampler implemented in the c-language produced around 66 draws per second on a desktop computer. In a typical application 2,500 draws would suffice for convergence of the sampler to produce adequate posterior inferences, requiring around 40 seconds. 3.4. The Conditional Posterior Distribution of 2 To determine the conditional posterior of 2 , the same argument as in (42) along with (26) and (22) implies that p(2 |, , , V, z, y) ∝ (|, 2 )(2 ) ∝ (2 )−m/2 1 × exp − 2 ′ B ′ B (2 )−(␣+1) exp − 2 2
(47)
Hence, we have
m p(2 |, , , V, z, y) ∝ (2 )−( 2 +␣+1)
2 × exp −′ B ′ B + 2 2
(48)
140
TONY E. SMITH AND JAMES P. LESAGE
which is seen from (22) to be proportional to an inverse gamma distribution with parameters (m/2) + ␣ and ′ B ′ B + 2. Following Geweke (1993), we may also express this posterior in terms of the chi-square distribution as follows. Letting = [′ B ′ B + 2]/2 , so that 2 = [′ B ′ B + 2]/ implies |d2 /d| = [′ B ′ B + 2]/(2 ), it then follows that −((m/2)+␣+1) d2 ′ B ′ B + 2 f() = [ ()] = d ′ ′ B B + 2 × exp − ∝ (m/2)+␣+1−2 2 2 (m+2␣/2)−1 = exp − × exp − 2 2 2
(49)
Hence the density of is proportional to a chi-square density with m + 2␣ degrees of freedom, and we may also express the conditional posterior of 2 as ′ B ′ B + 2 2
|(, , , V, z, y) ∼ 2 (m + 2␣)
(50)
3.5. The Conditional Posterior Distribution of v To determine the conditional posterior distribution of v = (vi : i = 1 . . . , m), we observe that from the same argument as in (30) and (31), together with (28), (18), and (25) that if we let v−i = (v1 , . . . , vi−1 , vi+1 , . . . , vm ) for each i, let e = z − X − and also let ∗ represent the conditioning arguments (in this case: , , , 2 , v−i , z, y), then p(vi |∗) ∝ (z|, , V)
m
(vj )
j=1
1 ∝ |V|−1/2 exp − e ′ V −1 e (vi ) 2 1 r −((r/2)+1) ∝ |V|−1/2 exp − e ′ V −1 e vi (51) exp − 2 2vi −n i /2 But since V = diag(v) implies that |V|−1/2 = m ), and since i=1 (vi n m m i −1 2 ′ ′ e V e = i=1 k=1 e ik /vi = i=1 e i e i /vi , where e i = (e ik : k = 1, . . . , n i ),
A Bayesian Probit Model with Spatial Dependencies
we have
141
′ m m ej ej r −n j /2 −((r/2)+1) vj exp − exp − (vj ) p(vi |∗) ∝ 2vj 2vj j=1
j=1
e ′i e i r −((r/2)+1) vi exp − 2vi 2vi ′ e ei + r −((r+n i )/2+1) = vi exp − i 2vi −n i /2
∝ vi
exp −
(52)
and may conclude from (22) that the conditional posterior distribution of each vi is proportional to an inverse gamma distribution with parameters (r + n i )/2 and (e ′i e i + r)/2. As with 2 , this may also be expressed in terms of the chi-square distribution as follows. If we let i = (e ′i e i + r)/vi , so that vi = (e ′i e i + r)/i implies |dvi /di | = (e ′i e i + r)/2i , then it follows that dvi f(i ) = [vi (i )] d i
=
=
e ′i e i + r i
−((r+n i )/2+1)
((r+n i )/2)−1 i exp
i − 2
i e ′i e i + r exp − 2 2i
(53)
which is proportional to a chi-square density with r + n i degrees of freedom. Hence in a manner similar to (50) we may express the conditional posterior of each vi as e ′i e i + r |(, , , 2 , v−i , z, y) ∼ 2 (r + n i ) (54) vi In this form, it is instructive to notice that the posterior mean of vi has a “weighted average” interpretation similar to that of  discussed above. To see this, note first that from (18) vi /r has an inverse chi-squared prior distribution with r degrees of freedom. But since the mean of the inverse chi-square with degrees of freedom is given by 1/( − 2), it follows that the prior mean of vi is i = E(vi ) = rE(vi /r) = r/(r − 2) for r > 2. Next observe from (54) that the random variable vi /(e ′i e i + r) is also conditionally distributed as inverse chi-square with r + n i degrees of freedom, so that 1 E(vi |, , , 2 , v−i , z, y) e ′i e i + r = E(
vi 1 |, , , 2 , v−i , z, y) = +r (n i + r) − 2
e ′i e i
(55)
142
TONY E. SMITH AND JAMES P. LESAGE
But if the maximum-likelihood estimator for vi given the “residual” vector e i is denoted by vˆ i = (1/n i )e ′i e i , then it follows from (55) that E(vi |, , , 2 , v−i , z, y) = =
e ′i e i + r (n i + r) − 2
n i vˆ i + (r − 2)i n i + (r − 2)
(56)
From this we see that the posterior mean of vi is a weighted average of the maximum-likelihood estimator, vˆ i , and the prior mean, i , of vi . Observe also that more weight is given to the sample information embodied in vˆ i as the sample size, n i increases. Even for relatively small sample sizes in each region, these posterior means may be expected to capture possible heteroscedasticity effects between regions. Note finally that the value of the hyperparameter, r, is critical here. In particular large values of r would result in the heteroscedasticity effects being overwhelmed. LeSage (1999) suggests that the range of values 2 < r ≤ 7 is appropriate for most purposes and recommends a value r = 4 as a rule-of-thumb.
3.6. The Conditional Posterior Distribution of z Finally, we construct a key posterior distribution for this model, namely that of the utility-difference vector, z. By the same argument as in (30) and (31) now taken together with (16) and (28), we see that p(z|, , , 2 , V, y) ∝ p(y|z)(z|, , V) ∝
ni m
{␦(y ik = 1)␦(z ik > 0) + ␦(y ik = 0)␦(z ik ≤ 0)}
i=1 k=1
ni m i=1 k=1
−1/2 vik exp
1 (z ik − x ′ik  − i )2 − 2vi
(57)
Hence by letting z −ik = (z 11 , . . . , z i,k−1 , z i,k+1 , . . . , z mn m ) for each individual k in region i, it follows at once from (57) that
1 −1/2 (z ik − x ′ik  − i )2 p(z ik |∗) ∝ vi exp − 2vi × ␦(y ik = 1)␦(z ik > 0) + ␦(y ik = 0)␦(z ik ≤ 0) (58)
A Bayesian Probit Model with Spatial Dependencies
143
Thus we see that for each ik, the conditional posterior of z ik is a truncated normal distribution, which can be expressed as follows: if yi = 1 N(xi′  + i , vi ) left-truncated at 0, (59) z ik |∗ ∼ ′ N(xi  + i , vi ) right-truncated at 0, if yi = 0 where ∗ denotes the conditioning arguments, (, , , 2 , V, z −ik , y). 3.7. The MCMC Sampler By way of summary, the MCMC estimation scheme involves starting with arbitrary initial values for the parameters which we denote 0 , 0 , 0 , 0 , V 0 and the latent variable y 0 . We then sample sequentially from the following set of conditional distributions for the parameters in our model. (1) p(|0 , 0 , 0 , V 0 , y 0 , z), which is a multinormal distribution with mean and variance defined in (37). This updated value for the parameter vector  we label 1 . (2) p(|1 , 0 , 0 , V 0 , y 0 , z), which we sample from a multinormal distribution in (41) (or the set of n univariate normal distributions with means and variances presented in (A.13) of the Appendix). These updated parameters we label 1 . Note that we employ the updated value 1 when evaluating this conditional distribution. (3) p(2 |1 , 1 , 0 , V 0 , y 0 , z), which is chi-squared distributed with n + 2␣ degrees of freedom as shown in (50). Label this updated parameter 1 and note that we will continue to employ updated values of previously sampled parameters when evaluating these conditional densities. (4) p(|1 , 1 , 1 , V 0 , y 0 , z), which can be obtained using a Metropolis-Hastings approach described in LeSage (2000) based on a normal candidate density along with rejection sampling to constrain to the desired interval. One can also rely on univariate numerical integration to find the conditional posterior on each pass through the sampler. This was the approach we took to produce the estimates reported in Section 5. (5) p(vi |1 , 1 , 1 , 1 , v−i , y 0 , z) which can be obtained from the chi-squared distribution shown in (54). (6) p(y|1 , 1 , 1 , 1 , V 1 , z), which requires draws from left-truncated or righttruncated normal distributions based on (59). We now return to step (1) employing the updated parameter values in place of the initial values 0 , 0 , 0 , 0 , V 0 and the updated latent variable y 1 in
144
TONY E. SMITH AND JAMES P. LESAGE
place of the initial y 0 . On each pass through the sequence we collect the parameter draws which are used to construct a posterior distribution for the parameters in our model. In the case of and V, the parameters take the form of an mn m -vector, which is also true of the draws for the latent variable vector y. Storing these values over a sampling run involving thousands of draws when m is large would require large amounts of computer memory. One option is to simply compute a mean vector which doesn’t require storage of the draws for these vectors. The posterior mean may often provide an adequate basis for posterior inferences regarding parameters like vi and the latent variable y. Another option is to write these values to a disk file during the sampling process, which might tend to slow down the algorithm slightly.
4. SOME SPECIAL CASES In this section we set forth distributional results for two cases which might be of special interest. First, we consider the case in which all individuals are interchangeable, i.e. in which homoscedasticity is postulated to hold among regions. We then consider a case where spatial dependencies are presumed to occur among individuals themselves, so that each individual is treated as a region.
4.1. The Homoscedastic Case This represents a situation where individual variances are assumed equal across all regions, so the regional variance vector, v, reduces to a scalar producing the simple form of covariance matrix shown in (60). V = vI n
(60)
With this version of the model, the conditional posterior densities for , , , and 2 remain the same. The only change worthy of mention occurs in the conditional posterior density for v. Here it can be readily verified by using the same definitions, e = z − X − and n = i n i , that the conditional posterior density for v given (, , , 2 , z, y) is identical to (54) with all subscripts i removed, i.e.
e′e + r (61) |(, , , 2 , z, y) ∼ 2 (r + n) v In addition, the conditional posterior density for each z ik given the values of (, , , 2 , v, z −ik , y) is identical to (59) with vi replaced by v. Of course, for
A Bayesian Probit Model with Spatial Dependencies
145
large n relative to r this approaches the usual 2 (n) distribution for 2 in the homoscedastic Bayesian linear model.
4.2. The Individual Spatial-dependency Case Another special case is where individuals are treated as “regions” denoted by the index i. In this case we are essentially setting m = n and n i = 1 for all i = 1, . . . , m. Note that although one could in principle consider heteroscedastic effects among individuals, the existence of a single observation per individual renders estimation of such variances problematic at best. In this case, one might adopt a homoscedasticity hypothesis described in Section 4.2 and use v to denote the common individual variance.6 Here it can be verified that by simply replacing all occurrences of (ik, X i , i 1i , , v) with (i, x ′i , i , , v) respectively, and again using the definition of V in (60), the basic model in (2) through (16), together with the conditional posterior densities for , , and 2 continue to hold. In this homoscedastic context, the appropriate conditional posterior density for each i , i = 1, . . . , n(= m), again has the form (77), where the definitions of a i and b i are now modified by setting vi = v, n i = 1, and i = (z i − x ′i )/v.
5. APPLICATIONS OF THE MODEL We first illustrate the spatial probit model with interaction effects using a generated data set. The advantage of this approach is that we know the true parameter magnitudes as well as the generated spatial interaction effects. This allows us to examine the ability of the model to accurately estimate the parameters and interaction effects. We provide an applied illustration in Section 5.2 using the 1996 presidential election results that involves 3,110 U.S. counties and the 48 contiguous states.
5.1. A Generated Data Example This experiment used the latitude-longitude centroids of n = 3,110 U.S. counties to generate a set of data. The m = 48 contiguous states were used as regions. A continuous dependent variable was generated using the following procedure. First, the spatial interaction effects were generated using: = (I m − W)−1 ,
∼ N(0, 2 )
(62)
146
TONY E. SMITH AND JAMES P. LESAGE
where was set equal to 0.7 in one experiment and 0.6 in another. In (62), W represents the 48 × 48 standardized spatial weight matrix based on contiguity of the states. Six explanatory variables which we label X were created using county-level census information on: the percentage of population in each county that held high school, college, or graduate degrees, the percentage of non-white population, the median household income (divided by 10,000) and the percent of population living in urban areas. These are the same explanatory variables we use in our application to the 1996 presidential election presented in Section 5.2, which should provide some insight into how the model operates in a generated data setting. The data matrix X formed using these six explanatory variables was then studentized (by subtracting means and dividing by standard deviations). Use of a studentized data matrix X along with negative and positive values for  ensures in particular that the generated y values have a mean close to zero. Since we will convert the generated continuous y magnitudes to binary z values using the rule in (64), this should produce a fairly equal sample of 0 and 1 values. z = 0 if y ≤ 0 z = 1 if y > 0
(63)
The vector , along with the given matrix X and beta values,  = (3, −1.5, −3, 2, −1, 1)′ , were used to generate a continuous y vector from the model: y = X + + u
u ∼ N(0, V ) V = Im
(64)
Generating a continuous y allows us to compare the posterior mean of the draws from the truncated normal distribution that would serve as the basis for inference about the unobserved y values in an applied setting. Another focus of inference for this type of model would be the estimated values for the m-vector of parameters . A set of 100 data samples were generated and used to produce estimates whose mean and standard deviations are shown in Table 1, alongside the true parameter values used to generate the sample data. In addition to the spatial probit model estimates, we also estimated a least-squares model, a non-spatial probit model and the spatial individual effects model based on the continuous y values. This was done using the generated continuous y vector and a minor change in the MCMC sampling scheme that ignored the update step for the latent variable y in step 6 of the sampler. Instead, we rely on the actual values of y, allowing us to see
A Bayesian Probit Model with Spatial Dependencies
147
Table 1. Generated Data Results. OLS
Probit
Sprobit
Regress
1.5370 −0.8172 −1.5476 1.0321 −0.5233 0.5231
2.9766 −1.5028 −2.9924 2.0019 −0.9842 0.9890 0.6585 2.1074
2.9952 −1.5052 −2.9976 2.0013 −1.0013 1.0006 0.6622 2.0990
0.2745 0.3425 0.4550 0.2250 0.1630 0.1349
0.1619 0.1463 0.2153 0.1359 0.1001 0.0819 0.1299 0.5224
0.0313 0.0393 0.0390 0.0252 0.0293 0.0244 0.1278 0.3971
2.4285 −1.1601 −2.4646 1.5975 −0.8140 0.8046
3.0290 −1.5017 −3.0277 2.0137 −1.0121 1.0043 0.5963 0.4960
2.9983 −1.4966 −3.0042 1.9984 −1.0002 0.9994 0.5886 0.5071
0.2058 0.2516 0.3873 0.1652 0.1595 0.1280
0.1684 0.1420 0.2215 0.1301 0.1102 0.0899 0.1257 0.1584
0.0292 0.0392 0.0382 0.0274 0.0329 0.0237 0.1181 0.1101
2
Experiments using =2 Estimates 1 = 3 0.2153 2 = −1.5 −0.1291 3 = −3 −0.0501 4 = 2 0.1466 5 = −1 −0.0611 6 = 1 0.0329 = 0.7 2 = 2 Standard deviations 1 0.0286 2 0.0434 3 0.0346 4 0.0256 5 0.0176 0.0109 6 Experiments using 2 = 0.5 Estimates 1 = 3 0.2312 2 = −1.5 −0.1312 3 = −3 −0.0517 4 = 2 0.1513 5 = −1 −0.0645 6 = 1 0.0348 = 0.6 2 = 0.5 Standard deviations 1 0.0172 2 0.0219 3 0.0168 4 0.0125 5 0.0103 6 0.0074
how inferences regarding the parameters are affected by the presence of binary versus continuous y values. Ideally, we would produce similar estimates from these two models, indicating that the latent draws for y work effectively to replace the unknown values.
148
TONY E. SMITH AND JAMES P. LESAGE
For the studentized matrix X the sample variance of each column is unity by construction. Hence we set 2 = 2 to create a situation where the relative importance or signal strength of X in generation of y was one-half that of the strength of the individual effects . A second experiment with another 100 samples was based on 2 = 0.5, which creates a situation where the relative importance or signal strength of in generation of y is one-half that of X. This experiment used a value of = 0.6 rather than 0.7 to examine sensitivity to this parameter. Both experiments relied on a homoscedastic prior setting r = 60 to reflect the homoscedasticity in the generating process. In practical applications it is important to note that one can encounter situations where an inference is drawn suggesting that individual effects are not spatially dependent, that is = 0. This might be due to a large amount of noise, represented by 2 in the generating process used here. It might also be due to a very strong signal in X relative to the signal in the individual effects, resulting in the influence of individual effects being masked in the estimated outcomes. This situation would be represented by a small value of 2 (relative to the unit sample variances in X) in our generating process. Of course, these two influences also depend on the inherent observation noise reflected in u ∼ N(0, V), which we controlled by setting V = I m in our experiments. Estimation results for these two illustrations are shown in Table 1 based on 1,500 draws with the first 500 omitted for “burn-in” of the sampler. Turning attention to the experimental results for the case with 2 = 2 shown in the table, we see that least-squares and probit estimates for  are inaccurate, as we would expect. However, the spatial probit estimates were on average very close to those from the spatial regression model based on non-binary y-values, suggesting that sampling for the latent y works well. It should come as no surprise that estimates for  based on the non-binary dependent variable y are more precise than those from the spatial probit model based on binary z values. From the standard deviations of the estimates over the 100 samples, we see that use of binary dependent variables results in a larger standard deviation in the outcomes, reflecting less precision in the  and estimates. This seems intuitively correct, since these estimates are constructed during MCMC sampling based on draws for the latent y values. A graphical depiction of these draws from a single estimation run are shown in Fig. 1, where we see that they are centered on the true y, but exhibit dispersion. The correlation between the posterior mean of the latent y draws and the actual y (which we know here) was around 0.9 for this single estimation run. In summary, additional uncertainty arising from the presence of binary dependent variables z that must be sampled to produce latent y during estimation result in increased dispersion or uncertainty regarding the  estimates for the spatial probit model, relative to the non-binary spatial regression model.
A Bayesian Probit Model with Spatial Dependencies
149
Fig. 1. Actual y vs. Mean of the Latent y-draws.
The experimental results for the case with 2 = 0.5 show roughly the same pattern in outcomes. This suggests that the estimation procedure will work well for cases where the relative signal strength of X versus varies within a reasonable range. An important use for this type of model would be inferences regarding the character of the spatial interaction effects. Since these were generated here, we can compare the mean of the posterior distribution for these values with the actual magnitudes. Figure 2 shows this comparison, where the average estimates from both the spatial probit and spatial regression model are plotted against the average of the true values generated during the experiment. In the figure, the individual effect estimates were sorted by magnitude of the average actual values for presentation purposes. We see from the figure that the estimates were on average close to the true values and one standard deviation of these estimates were also close to one standard deviation of the actual values. The spatial regression estimates were slightly more accurate as we would expect, exhibiting a correlation of 0.97 with the actual , whereas the spatial probit estimates had a correlation of 0.91. Figure 3 shows the actual values plotted versus the posterior mean of these estimates from a single estimation. Estimates from both the spatial probit model
150
TONY E. SMITH AND JAMES P. LESAGE
Fig. 2. Mean of Actual vs. Predicted Estimates Over 100 Samples.
and the spatial regression model are presented and we see that accurate inferences regarding the individual effects could be drawn. The correlation between the actual individual effects used to generate the data and the predictions is over 0.9 for both models. By way of summary, the spatial probit model performed well in this generated data experiment to detect the pattern of spatial interaction effects and to produce accurate parameter estimates. 5.2. An Application to the 1996 Presidential Election To illustrate the model in an applied setting we used data on the 1996 presidential voting decisions in each of 3,110 U.S. counties in the 48 contiguous states. The dependent variable was set to 1 for counties where Clinton won the majority of votes and 0 for those where Dole won the majority.7 To illustrate individual versus regional spatial interaction effects we treat the counties as individuals and the states as regions where the spatial interaction effects occur.
A Bayesian Probit Model with Spatial Dependencies
151
Fig. 3. Actual vs. Predicted from a Single Estimation Run.
As explanatory variables we used: the proportion of county population with high school degrees, college degrees, and graduate or professional degrees, the percent of the county population that was non-white, the median county income (divided by 10,000) and the percentage of the population living in urban areas. These were the same variables used in the generated data experiments, and data for each variable was again studentized. Of course, our application is illustrative rather than substantive. We compare estimates from a least-squares and traditional non-spatial probit model to those from the spatial probit model with a homogeneity assumption and a heteroscedasticity assumption regarding the disturbances. The spatial probit model estimates are based on 6,000 draws with the first 1,000 omitted to allow the sampler to achieve a steady-state.8 Diffuse or conjugate priors were employed for all of the parameters , 2 and in the Bayesian spatial probit models. A hyperparameter value of r = 4 was used for the heteroscedastic spatial probit model, and a value of r = 40 was employed for the homoscedastic prior. The heteroscedastic value of r = 4 implies a prior mean for r equal to r/(r − 2) = 2 (see discussion surrounding
152
TONY E. SMITH AND JAMES P. LESAGE
Table 2. Posterior Means for V i Parameters Indicating Heteroscedasticity. State
V i Estimate
Arizona Colorado Florida Georgia Kentucky Missouri Mississippi North Carolina New Mexico Oregon Pennsylvania Virginia Washington
4.7210 8.6087 3.9645 11.1678 8.8893 6.5453 4.3855 3.8744 4.7840 4.3010 3.4929 7.9718 4.7888
Fig. 4. Frequency Distribution of V i Estimates.
A Bayesian Probit Model with Spatial Dependencies
153
√ (55)] and a prior standard deviation equal to (2/r) = 0.707. A two standard deviation interval around this prior mean would range from 0.58 to 3.41, suggesting that posterior estimates for individual states larger than 3.4 would indicate evidence in the sample data against homoscedasticity. The posterior mean for the vi estimates was greater than this upper level in 13 of the 48 states (shown in Table 2), with a mean over all states equal to 2.86 and a standard deviation equal to 2.36. The frequency distribution of the 48 vi estimates is shown in Fig. 4, where we see that the mean is not representative for this skewed distribution. We conclude there is evidence in favor of mild heteroscedasticity. Comparative estimates are presented in Table 3, where we see that different inferences would be drawn from the homoscedastic versus heteroscedastic estimates. Table 3. 1996 Presidential Election Results. Variable
Coefficient
Std. Dev.
p-Levela
Homoscedastic spatial probit model with individual spatial effects High school 0.0976 0.0419 College −0.0393 0.0609 Graduate/professional 0.1023 0.0551 Non-white 0.2659 0.0375 Median income −0.0832 0.0420 Urban population −0.0261 0.0326 0.5820 0.0670 2 0.6396 0.1765
0.0094 0.2604 0.0292 0.0000 0.0242 0.2142 0.0000
Heteroscedastic spatial probit model with individual spatial effects High school 0.0898 0.0446 College −0.1354 0.0738 Graduate/professional 0.1787 0.0669 Non-white 0.3366 0.0511 Median income −0.1684 0.0513 Urban population −0.0101 0.0362 0.6176 0.0804 2 0.9742 0.3121
0.0208 0.0330 0.0010 0.0000 0.0002 0.3974 0.0000
Variable Non-spatial probit model High school College Graduate/professional Non-white Median income Urban population aSee
Coefficient
t-Statistic
t-Probability
0.1961 −0.1446 0.2276 0.2284 −0.0003 −0.0145
6.494 −3.329 5.568 8.203 −0.011 −0.521
0.0000 0.0008 0.0000 0.0000 0.9909 0.6017
Gelman, Carlin, Stern and Rubin (1995) for a description of p-levels.
154
TONY E. SMITH AND JAMES P. LESAGE
With the exception of high school graduates, the magnitudes of the coefficients on all other variables are quite different. The heteroscedastic estimates are larger (in absolute value terms) than the homoscedastic results with the exception of population in urban areas which is not significant. In the case of college graduates, the homoscedastic and heteroscedastic results differ regarding the magnitude and significance of a negative impact on Clinton winning. Heteroscedastic results suggest a larger negative influence significant at conventional levels while the homoscedastic results indicate a smaller insignificant influence. The results also indicate very different inferences would be drawn from the nonspatial probit model versus the spatial probit models. For example, the non-spatial model produced larger coefficient estimates for all three education variables. It is often the case that ignoring spatial dependence leads to larger parameter estimates, since the spatial effects are attributed to the explanatory variables in these nonspatial models. Another difference is that the coefficient on median income is small
Fig. 5. Individual Effects Estimates for the 1996 Presidential Election.
A Bayesian Probit Model with Spatial Dependencies
155
Fig. 6. Individual Effects Estimates from Homoscedastic vs. Heteroscedastic Spatial Probit Models.
and insignificant in the non-spatial model whereas it is larger (in absolute value terms) and significant in both spatial models. The parameter estimates for the spatial interaction effects should exhibit spatial dependence given the estimates for . Figure 5 shows a graph of these estimates along with a ±2 standard deviation confidence interval. In the figure, the states were sorted by 0.1 values reflecting the 18 states where Dole won the majority of votes versus the 30 states where Clinton won. From the figure we see that in the 30 states where Clinton won there is evidence of predominately positive spatial interaction effects, whereas in the states where Dole won there are negative individual effects. A comparison of the individual effect estimates from the homoscedastic and heteroscedastic models is shown in Fig. 6, where we see that these two sets of estimates would lead to the same inferences. Figure 7 shows a map of the significant positive and negative estimated individual effects as well as the insignificant effects, (based on the heteroscedastic
156
TONY E. SMITH AND JAMES P. LESAGE
Fig. 7. Individual Effects Estimates from Homoscedastic and Heteroscedastic Spatial Probit Models.
model). This map exhibits spatial clustering of positive and negative effects, consistent with the positive spatial dependence parameter estimate for . Finally, to assess predictive accuracy of the model we examined the predicted probabilities of Clinton winning. In counties where Dole won, the model should produce a probability prediction less than 0.5 of a Clinton win. On the other hand accurate predictions in counties where Clinton won would be reflected in probability predictions greater than 0.5. We counted these cases and found the heteroscedastic model produced the correct predictions in 71.82 percent of the counties where Dole won and in 71.16 percent of the counties where Clinton won. The homoscedastic model produced correct predictions for Dole in 73.29 percent of the counties and for Clinton in 69.06 percent of the counties.
6. CONCLUSION A hierarchical Bayesian spatial probit model was introduced here that allows for spatial interaction effects as well as heterogeneous individual effects. The model extends the traditional Bayesian spatial probit model by allowing decision-makers
A Bayesian Probit Model with Spatial Dependencies
157
to exhibit spatial similarities. In addition to spatial interaction effects, the model also accommodates heterogeneity over individuals (presumed to be located at distinct points in space) by allowing for non-constant variance across observations. Estimation of the model is via MCMC sampling which allows for the introduction of prior information regarding homogeneity versus heterogeneity as well as prior information for the regression and noise variance parameters. The model is not limited to the case of limited dependent variables and could be applied to traditional regression models where a spatial interaction effect seems plausible. This modification involves eliminating the truncated normal draws used to obtain latent y values in the case of limited dependent variables. MATLAB functions that implement the probit and regression variants of the model presented here are available at: spatial-econometrics.com.
NOTES 1. This simultaneous autoregressive specification of regional dependencies follows the spatial econometrics tradition (as for example in Anselin, 1988; McMillen, 1992]. An alternative specification is the conditional autoregressive scheme employed by Besag et al. (1991). 2. Note that if a given region i is isolated (with no neighbors) then wij = 0 for all j = 1, . . . , m. In this case, the “normalized” weights are also zero. 3. This zero-mean convention allows one to interpret the beta coefficient corresponding to the regional fixed-effect column, ␦i (·) as the implicit mean of each ik . 4. It should be noted that the presence of regional dependencies (i.e. nonzero offdiagonal elements in B ) also generates heteroscedasticity effects, as discussed for example in McMillen (1992). Hence the variances, vi are implicitly taken to reflect regional heteroscedasticity effects other than spatial dependencies. 5. Note again that by assumption X always contains m columns corresponding to the indicator functions, ␦(·), i = 1, . . . , m. 6. An alternative (not examined here) would be to consider regional groupings of individuals with possible heteroscedasticity effects between regions, while allowing spatial dependencies to occur at the individual rather than regional level. 7. The third party candidacy of Perot was ignored and only votes for Clinton and Dole were used to make this classification of 0.1 values. 8. Estimates based on 1,500 draws with the first 500 omitted were nearly identical suggesting that one need not carry out an excessive number of draws in practice.
ACKNOWLEDGMENTS The authors would like to thank Alan Gelfand for comments on an earlier version of this paper. The second author acknowledges support from the National Science Foundation, BCS-0136229.
158
TONY E. SMITH AND JAMES P. LESAGE
REFERENCES Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422), 669–679. Amemiya, T. (1985). Advanced econometrics. Cambridge, MA: Harvard University Press. Anselin, L. (1988). Spatial econometrics: Methods and models. Dordretcht: Kluwer. Barry, R., & Pace, R. K. (1999). A Monte Carlo estimator of the log determinant of large sparse matrices. Linear Algebra and its Applications, 289, 41–54. Besag, J., York, J. C., & Mollie, A. (1991). Bayesian image restoration, with two principle applications in Spatial Statistics. Annals of the Institute of Statistical Mathematics, 43, 1–59. Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398–409. Gelman, A., Carlin, J. B., Stern, H. A., & Rubin, D. R. (1995). Bayesian data analysis. London: Chapman & Hall. Geweke, J. (1993). Bayesian treatment of the independent student-t linear model. Journal of Applied Econometrics, 8, 19–40. LeSage, J. P. (1999). The theory and practice of spatial econometrics. Unpublished manuscript available at: http://www.spatial-econometrics.com. LeSage, J. P. (2000). Bayesian estimation of limited dependent variable spatial autoregressive models. Geographical Analysis, 32(1), 19–35. McMillen, D. P. (1992). Probit with spatial autocorrelation. Journal of Regional Science, 32(3), 335–348. Pace, R. K., & Barry, R. (1997). Quick computation of spatial autoregressive estimators. Geographical Analysis, 29, 232–246. Sun, D., Tsutakawa, R. K., & Speckman, P. L. (1999). Posterior distribution of hierarchical models using car(1) distributions. Biometrika, 86, 341–350.
APPENDIX In this appendix we derive a sequence of univariate conditional posterior distributions for each element of that allows the MCMC sampling scheme proposed here to be applied in larger models. For models with less than m = 100 regions it is probably faster to simply compute the inverse of the m × m matrix A 0 and use the multinormal distribution presented in (41). For larger models this can be computationally burdensome as it requires large amounts of memory. The univariate conditional distributions are based on the observation that the joint density in (40) involves no inversion of A 0 , and hence is easily computed. Since the univariate conditional posteriors of each component, i of must be proportional to this density, it follows that each is univariate normal with a mean and variance that are readily computable. To formalize these observations, observe first that if for each realized value of and each i = 1, . . . , m we let −i = (1 , . . . , i−1 , i+1 , . . . , m ), then by the
A Bayesian Probit Model with Spatial Dependencies
159
same argument as in (31) we see that p(i |∗) =
p(, , , 2 , V, z, y) ∝ p(, , , 2 , V, z|y) p(−i , , , 2 , V, z|y) × ∝ (z|, , V)(|, 2 ) 1 ′ −2 ′ ′ −1 ′ −1 × ∝ exp − [ ( B B + V ) − 2(z − X) V ] 2 (A.1)
This expression can be reduced to terms involving only i as follows. If we let = (i : i = 1, . . . , m)′ = [(z − X)′ V −1 ]′ , then the bracketed expression in (A.1) can be written as, ′ (−2 B ′ B + ′ V −1 ) − 2(z − X)′ V −1 1 ′ (I − W ′ )(I − W) + ′ ′ V −1 − 2′ 2 1 = 2 [′ − 2′ W + 2 ′ W ′ W] + ′ ′ V −1 − 2′ =
(A.2)
But by permuting indices so that ′ = (i , ′−i ), it follows that i ′ ′ W = ( w.i W−i ) = ′ (i w.i + W −i −i ) −i = i (′ w.i ) + ′ W −i −i
(A.3)
where w.i is the ith column of W and W −i is the m × (m − 1) matrix of all other columns of W. But since wii = 0 by construction, it then follows that j wij j = i ′ ′ ′ j wji + ( i −i ) W = C j=i = i
j=i
j (wji + wij ) + C
(A.4)
where C denotes a constant not involving parameters of interest. Similarly, we see from (A.3) that ′ W ′ W = (i w.i + W −i −i )′ (i w.i + W −i −i ) = 2i w′.i w.i + 2i (w′.i W −i −i ) + C
(A.5)
160
TONY E. SMITH AND JAMES P. LESAGE
Hence, observing that ′ = 2i + C ′
′ −1
V
=
(A.6) n i 2i /vi
+C
−2′ V −1 = −2i i + C
(A.7) (A.8)
where the definition of = (i : i = 1, . . . , m)′ implies (using the notation in (10)) that each i has the form i =
1′i (z i − X i ) , vi
i = 1, . . . , m,
(A.9)
it follows by substituting these results into (A.2) that we may rewrite the conditional posterior density of i as 1 p(i |∗) ∝ exp{− [(−2i j (wji + wij )i + 2 2i w′.i w.i + 22 i 2 j=i
1 1 + n i 2i /vi − 2i i ]} = exp{− (a i 2i − 2b i i )} 2 2 1 bi 2 1 2 2 } ∝ exp{− (a i i − 2b i i + b i /a i )} = exp{− i − 2 2(1/a i ) ai (A.10)
× (w′.i W −i −i ))
where a i and b i are given respectively by 2 ′ ni 1 + w.i w.i + 2 2 vi 2 j (wji + wij )j − 2 w′.i W −i −i b i = i + 2
ai =
(A.11) (A.12)
j=i
Thus the density in (A.10) is seen to be proportional to a univariate normal density with mean, b i /a i , and variance, 1/a i , so that for each i = 1, . . . , m the conditional posterior distribution of i given −i must be of the form bi 1 2 i |(−i , , , , V, z, y) ∼ N (A.13) , ai ai
INSTRUMENTAL VARIABLE ESTIMATION OF A SPATIAL AUTOREGRESSIVE MODEL WITH AUTOREGRESSIVE DISTURBANCES: LARGE AND SMALL SAMPLE RESULTS Harry H. Kelejian, Ingmar R. Prucha and Yevgeny Yuzefovich ABSTRACT The purpose of this paper is two-fold. First, on a theoretical level we introduce a series-type instrumental variable (IV) estimator of the parameters of a spatial first order autoregressive model with first order autoregressive disturbances. We demonstrate that our estimator is asymptotically efficient within the class of IV estimators, and has a lower computational count than an efficient IV estimator that was introduced by Lee (2003). Second, via Monte Carlo techniques we give small sample results relating to our suggested estimator, the maximum likelihood (ML) estimator, and other IV estimators suggested in the literature. Among other things we find that the ML estimator, both of the asymptotically efficient IV estimators, as well as an IV estimator introduced in Kelejian and Prucha (1998), have quite similar small sample properties. Our results also suggest the use of iterated versions of the IV estimators. Spatial and Spatiotemporal Econometrics Advances in Econometrics, Volume 18, 163–198 Copyright © 2004 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(04)18005-5
163
164
HARRY H. KELEJIAN ET AL.
1. INTRODUCTION Spatial models are important tools in economics, regional science and geography in analyzing a wide range of empirical issues. For example, in recent years these models have been applied to contagion problems relating to bank performance as well as international finance issues, various categories of local public expenditures, vote seeking and tax setting behavior, population and employment growth, and the determinants of welfare expenditures, among others.1 By far the most widely used spatial models are variants of the one suggested by Cliff and Ord (1973, 1981) for modeling a single spatial relationship. One method of estimation of these models is maximum likelihood; another is the instrumental variable (IV) procedure suggested by Kelejian and Prucha (1998). The Kelejian and Prucha (1998) procedure relates to the parameters of a spatial first order autoregressive model with first order autoregressive disturbances, or, for short, a SARAR(1,1) model, and is based on a generalized moments (GM) estimator of a parameter in the disturbance process. The GM estimator was suggested by Kelejian and Prucha (1999) in an earlier paper.2 The Kelejian and Prucha (1998, 1999) procedures do not require specific distributional assumptions. They are also easily extended to a systems framework. However, the IV estimator in Kelejian and Prucha (1998) is based on an approximation to the ideal instruments and, therefore does not fully attain the asymptotic efficiency bound of IV estimators. In a recent paper Lee (2003) extends their approach in terms of the ideal instruments and gives an asymptotically efficient IV estimator. The purpose of this paper is two-fold. First, on a theoretical level we introduce a series-type IV estimator for the SARAR(1,1) model. This estimator is a natural extension of the one proposed in Kelejian and Prucha (1998) concerning the selection of instruments. We show that our series-type IV estimator is asymptotically normal and efficient. It is also computationally simple. Second, via Monte Carlo techniques we give small sample results relating to our suggested estimator, the IV estimators of Lee (2003) and Kelejian and Prucha (1998), iterated versions of those estimators, as well as the maximum likelihood (ML) estimator for purposes of comparison. Among other things we find that the ML estimator, both of the asymptotically efficient IV estimators, as well as the IV estimator of Kelejian and Prucha (1998), have quite similar small sample properties. We also find that iterated versions of the IV estimators typically do not lead to increases in the root mean squared errors, but for certain parameter values, the root mean squared errors are lower. Therefore, the use of such iterated estimators is suggested.
Large and Small Sample Results
165
2. MODEL Consider the following (cross sectional) first order autoregressive spatial model with first order autoregressive disturbances (n ∈ N): yn = Xn  + Wn yn + un , || < 1, un = Mn un + n , || < 1,
(1)
where y n is the n × 1 vector of observations on the dependent variable, X n is the n × k matrix of observations on k exogenous variables, W n and M n are n × n spatial weighting matrices of known constants and zero diagonal elements,  is the k × 1 vector of regression parameters, and are scalar autoregressive parameters, u n is the n × 1 vector of regression disturbances, and n is an n × 1 vector of innovations. As remarked, and consistent with the terminology of Anselin (1988), we refer to this model as an SARAR(1,1,) model. The variables W n y n and M n u n are typically referred to as spatial lags of y n and u n , respectively. For reasons of generality we permit the elements of y n , X n , W n , M n , u n and n to depend on n, i.e. to form triangular arrays. We condition our analysis on the realized values of the exogenous variables and so, henceforth, the matrices X n will be viewed as a matrices of constants. With a minor exception, we make the same assumptions as in Kelejian and Prucha (1998), along with an additional (technical) assumption which was also assumed by Lee (2003). For the convenience of the reader these assumptions, labeled Assumptions 1–8, are given in the Appendix. For a further discussion of these assumptions see Kelejian and Prucha (1998). Given Assumption 2, the roots of aW n and of aM n are less than one in absolute value for all |a| < 1; see, e.g. Horn and Johnson (1985, p. 344). Therefore, for || < 1 and || < 1 the matrices I n − W n and I n − M n are nonsingular and furthermore (In − Wn )−1 =
∞
k Wnk ,
k=0
(In − Mn )
−1
=
∞
(2) k
Mnk .
k=0
It follows from (1) that yn = (In − Wn )−1 Xn  + (In − Wn )−1 un , un = (In − Mn )−1 n .
(3)
166
HARRY H. KELEJIAN ET AL.
In light of Assumption 4 we have E(u n ) = 0 and therefore E(y n ) = (I n − W n )
−1
Xn  =
∞
k W kn X n .
(4)
k=0
The variance-covariance matrix of u n is given by E(u n u ′n ) = 2 (I n − M n )−1 (I n − M ′n )−1 .
(5)
We also note from (3) that E(y n u ′n ) = 2 (I n − W n )−1 (I n − M n )−1 (I n − M ′n )−1 = 0 so that, in general, the elements of the spatially lagged dependent vector, W n y n , are correlated with those of the disturbance vector. One implication of this is that the parameters of (1) can not generally be consistently estimated by ordinary least squares.3 In the following discussion it is helpful to rewrite (1) more compactly as y n = Zn ␦ + u n , un = Mn un + n ,
(6)
where Z n = (X n , W n y n ) and ␦ = (′ , )′ . Applying a Cochrane-Orcutt type transformation to this model yields y n∗ = Z n∗ ␦ + n ,
(7)
where y n∗ = y n − M n y n and Z n∗ = Z n − M n Z n . In the following we may also express y n∗ and Z n∗ as y n∗ () and Z n∗ () to indicate the dependence of the transformed variables on .
3. IV ESTIMATORS 3.1. IV Estimators in the Literature ′ In the following let ˆ n be any consistent estimator for , and let ␦ˆ n = (ˆ n , ˆ n )′ be any n 1/2 -consistent estimator for ␦. As one example, ␦ˆ n could be the two stage least squares estimator of ␦, and ˆ n could be the corresponding GM estimator of , which was suggested in the first and second steps of the estimation procedure introduced in Kelejian and Prucha (1998). Recalling (4), the optimal instruments for estimating ␦ from (7) are
¯ n∗ = E(Z n∗ ) = (I n − M n )E(Z n ) = (I n − M n )(X n , W n E(y n )) Z = (I n − M n )[X n , W n (I n − W n )−1 X n ].
(8)
Large and Small Sample Results
167
¯ n∗ as Z ¯ n∗ (, ␦). In light of (4) we have In the following we will also express Z ∞ k k+1 ¯ n∗ = (I n − M n ) X n , Z Wn Xn  , (9) k=0
which shows that the optimal instruments are linear combinations of the columns of {X n , W n X n , W 2n X n , . . . , M n X n , M n W n X n , M n W 2n X n , . . .}. Motivated by this observation, Kelejian and Prucha (1998) introduced their feasible general spatial two stage least squares estimator (FGS2SLS) estimator in terms of an approximation to these optimal instruments. Specifically, their approximation to ¯ n∗ is in terms of fitted values obtained from regressing Z n∗ (ˆn ) against a set of Z instruments H n , which are taken to be a fixed subset of the linearly independent columns of {X n , W n X n , W 2n X n , . . . , W qn X n , M n X n , M n W n X n , M n W 2n X n , . . . , M n W qn X n } where q is a pre-selected constant, and the subset is required to contain at least the linearly independent columns of {X n , M n X n }. Typically, one would take q ≤ 2, ˆ n∗ see e.g. Rey and Boarnet (1998), and Das, Kelejian and Prucha (2003). Let Z denote those fitted values; then ˆ n∗ = P H n Z n∗ (ˆn ) = ([X n − ˆ n M n X n , P H n (W n y n − ˆ n M n W n y n )]) Z
(10)
where P H n = H n (H n H N )−1 H ′n denotes the projection matrix corresponding to ˆ n∗ as Z ˆ n∗ (ˆn ) to signify its dependence H n . In the following we will also express Z on ˆ n . Given this notation, the FGS2SLS estimator of Kelejian and Prucha (1998), say ␦ˆ F,n , is defined as ˆ n∗ (ˆn )′ yn∗ (ˆn ) ˆ n∗ (ˆn )′ Zn∗ (ˆn )]−1 Z [Z ␦ˆ F,n = (11) ˆ n∗ (ˆn )′ Z ˆ n∗ (ˆn )]−1 Z ˆ n∗ (ˆn )′ yn∗ (ˆn ) [Z In our discussion below, we will also express ␦ˆ F,n as ␦ˆ F,n (ˆn ) in order to signify its dependence on ˆ n . Kelejian and Prucha (1998) showed that d
n 1/2 (␦ˆ F,n − ␦)→N(0, )
(12)
−1 ˆ n∗ ()′ Z ˆ n∗ () = 2 lim n −1 Z .
(13)
where n→∞
The FGS2SLS estimator uses a fixed set of instruments and hence in general ˆ n∗ (ˆn ) will not approximate the optimal instruments arbitrarily close even as the Z
168
HARRY H. KELEJIAN ET AL.
sample size tends to infinity. For future reference we note that the computation of q q−1 W n X n in their procedure could be determined recursively as W n (W n X n ) and so the operational count of their procedure is O(n 2 ). In a recent paper Lee (2003) introduced the following IV estimator ¯ n∗ (ˆn , ␦ˆ n )′ Z n∗ (ˆn )]−1 Z ¯ n (ˆn , ␦ˆ n )′ y n∗ (ˆn ) ␦ˆ B,n = [Z where the optimal instrument is approximated by ¯ n∗ (ˆn , ␦ˆ n ) = (I n − ˆ n M n ) X n , W n (I n − ˆ n W n )−1 X n ˆ n , Z
(14)
(15)
which is obtained by replacing the true parameters values in (8) by the estimators ˆ n and ␦ˆ n . In our discussion below we will also express ␦ˆ B,n as ␦ˆ B,n (ˆn , ␦ˆ n ) in order to signify its dependence on ˆ n and ␦ˆ n . Lee (2003) showed that d n 1/2 (␦ˆ B,n − ␦)→N(0, L )
(16)
with L = 2
¯ n∗ (, ␦)′ Z ¯ n∗ (, ␦) lim n −1 Z
n→∞
−1
.
(17)
Lee also demonstrated, as is expected from the literature on optimal instruments, that is a lower bound for the asymptotic variance-covariance matrix of any IV estimator for ␦. Lee therefore calls his estimator the Best FGS2SLS estimator.
3.2. A Series-Type Efficient IV Estimator ¯ n∗ (ˆn , ␦ˆ n ) involves The computation of Lee’s (2003) optimal instrument Z −1 ˆ ˆ the calculation of W n (I n − n W n ) X n n . Lee (2003) designed a numerical algorithm which simplifies this computation but, never-the-less, the operational count of his procedure is still O(n 3 ) and, furthermore, requires special programming of the algorithm. Therefore it seems of interest to have available an alternative optimal IV estimator that is computationally simpler and can be readily computed in standard packages such as TSP without the need of further programming. Towards this end, let r n be some sequence of natural numbers with r n ↑ ∞, and ¯ n∗ : in light of (9) consider the following series estimator for Z rn k k+1 ˜ n∗ = (I n − ˆ n M n ) X n , Z (18) ˆ n W n X n ˆ n . k=0
Large and Small Sample Results
169
˜ n∗ as Z ˜ n∗ (ˆn , ␦ˆ n ). Using Z ˜ n∗ we define the In the following we also express Z following IV estimator for ␦: −1 ˜ n∗ (ˆn , ␦ˆ n )′ Z n∗ (ˆn ) ˜ n (ˆn , ␦ˆ n )′ y n∗ (ˆn ). ␦ˆ S,n = Z Z (19) We will refer to the estimator in (19) as the Best Series FGS2SLS estimator. In our discussion below we will also express ␦ˆ S,n as ␦ˆ S,n (ˆn , ␦ˆ n , r n ) in order to signify its dependence on ˆ n , ␦ˆ n and r n . In the Appendix we prove the following theorem.
Theorem. Let r n be some sequence of natural numbers with 0 ≤ r n ≤ n, r n ↑ ∞, and r n = o(n 1/2 ). Then under the maintained assumptions d n 1/2 (␦ˆ S,n − ␦)→N(0, L ).
(20)
The theorem demonstrates that ␦ˆ S,n is also an asymptotically efficient estimator within the class of IV estimators. In addition, recalling that W k+1 n X n can be computed recursively as W n (W kn X n ), and noting that r n = o(n 1/2 ), the operational count of the series estimator is O(n 2 r n ), which is o(n 2+1/2 ), and therefore is less than that of Lee’s estimator, but will exceed that of the GS2SLS estimator.
4. SMALL SAMPLE PROPERTIES OF IV ESTIMATORS 4.1. The Monte Carlo Model In this section we study the small sample properties of the IV estimators discussed in Section 3, as well as iterated versions of those estimators using Monte Carlo techniques. For purposes of comparison we also consider the maximum likelihood estimator as well as the least squares estimator. The Monte Carlo model is yn = Xn  + Wn yn + un , || < 1 un = Wn un + n , 0, where −E(u¯ ′n u¯ n ) n 2E(u′n u¯ n ) 1 (A.1) Ŵn = 2E(u¯ ′n u¯ n ) −E(u¯ ′n u¯ n ) tr(Mn′ Mn ) n ′ ′ ′ ¯ ¯ E(un un + u¯ n u¯ n ) −E(u¯ n un ) 0 and u¯ n = M n u n and u¯ n = M n u¯ n = M 2n u n .
¯ n∗ (, ␦) = (I n − M n )[X n , W n (I n − W n )−1 X n ], then Assumption 8. Let Z ¯ n∗ (, ␦)′ Z ¯ n∗ (, ␦) = p lim n −1 Z n→∞
where is finite and nonsingular.
188
HARRY H. KELEJIAN ET AL.
Proof of Theorem 1 The proof of this theorem will be in terms of a sequence of lemmas. For the subsequent discussion observe that ¯ n∗ (, ␦) = (In − Mn )[Xn , E¯yn ], Z ˜ n∗ (ˆn , ␦ˆ n ) = (In − ˆ n Mn )[Xn , y˜¯ n ] Z
(A.2)
where E¯yn = Wn Eyn = Wn (In − Wn )−1 Xn  =
∞ k=0
y˜¯ n =
rn
k Wnk+1 Xn , (A.3)
k ˆ n Wnk+1 Xn ˆ n .
k=0
The proof will also utilize repeatedly the observations summarized in the subsequent remark. Remark. Let A n and B n be two n × n matrices whose row and column sums are uniformly bounded in absolute value by finite constants c A and c B , respectively. Furthermore, let a n and b n be n × 1 vectors whose elements are uniformly bounded in absolute value by finite constants c a and c b , respectively. It is then readily seen that the row and column sums of A n B n are uniformly bounded in absolute value by the finite constant c A c B , see, e.g. Kelejian and Prucha (1999). Similarly, the elements of A n b n are seen to be uniformly bounded in absolute value by c A c b . Since via Assumption 2 the row sums of W n are uniformly bounded in absolute value by one, it follows that the elements of W n b n are uniformly bounded in absolute value by c b . By recursive argumentation it follows further that also the elements of W kn b n = W n (W k−1 n b n ) are uniformly bounded in absolute value by c b for k = 2, 3, . . .. Since by Assumption 3 the elements of X n are uniformly bounded in absolute value it follows further that the k+1 k+1 −1 ′ elements of A n X n , n −1 X ′n A n X n , n −1 a ′n W k+1 n X n , A n W n X n , n a n W n X n k+1 and A n W n X n are uniformly bounded in absolute value by some finite constant. Finally, let C n be some n × p matrix whose elements are uniformly bounded in absolute value; then n −1 C ′n n = o p (1), given Assumption 4 holds for n . Lemma 1. Let plimn→∞ ˆ n = with || < 1, let ˜ n = ˆ n 1(|ˆ n | < 1), and let r n be some sequence of natural numbers with r n ↑ ∞ as n → ∞. p Then plimn→∞ ˜ n = , and for any p ≥ 0 we have plimn→∞ r n |ˆ n |r n = p ˜ rn plimn→∞ r n |n | = 0.
Large and Small Sample Results
189
Proof: For arbitrary > 0 P ˜n − > ≤ P ˜n − ˆ n + ˆ n − > ≤ P ˜n − ˆ n > /2 + P ˆ n − > /2 ≤ P ˆ n ≥ 1 + P ˆ n − > /2 . observing that ˜ n − ˆ n = 0 for ˆ n < 1 and thus ˜n − ˆ n > /2 ⊆ ˆ n ≥ 1 . Since plimn→∞ ˆ n = with || < 1 it follows that both probabilities on the r.h.s. of the last inequality tend to zero, which establishes that plimn→∞ ˜ n = . Next choose some ␦ = (1 − ||)/2 > 0, then for any > 0 r r P r pn ˆ n n > = P r pn ˆ n n > , ˆ n − ≤ ␦ r + P r pn ˆ n n > , ˆ n − > ␦ ≤ P r pn ( || + ␦)r n > + P ˆ n − > ␦ .
Since || + ␦ < 1, and since limx→∞ x p a x = 0 for all 0 ≤ a < 1, it follows that p p limn→∞ r n (|| + ␦)r n = 0, and hence P(r n (|| + ␦)r n > ) → 0 as n → ∞. Since ˆ n is a consistent estimator for we also have P(|ˆ n − | > ␦) → 0 as n → ∞. Hence both terms on the r.h.s. of the last inequality limit to zero as n → p p ∞, which shows that plimn→∞ r n |ˆ n |r n = 0. To show that plimn→∞ r n |ˆ n |r n = 0 we have only used that ˆ n is a consistent estimator for . Since ˜ n is a consistent estimator for it follows that also the last claim holds. Lemma 2. Suppose n 1/2 (ˆ n − ) = O p (1) with || < 1 and define ˜ n = ˆ n 1(|ˆ n | < 1). Then n 1/2 (˜ n − ) = O p (1), n 1/2 (|ˆ n | − ||) = O p (1) and n 1/2 (|˜ n | − ||) = O p (1).
Proof: Observe that for every > 0 P n 1/2 ˜n − ˆ n ≥ ≤ P ˆ n ≥ 1 .
Since ˆ n is a consistent estimator for with || < 1, the probability on the r.h.s. tends to zero as n → ∞ and thus n 1/2 (˜ n − ˆ n ) = o p (1). Hence n 1/2 (˜ n − ) = n 1/2 (˜ n − ˆ n ) + n 1/2 (ˆ n − ) = o p (1) + O p (1) = O p (1). Since ||ˆ n || − || ≤ |ˆ n − | and ||˜ n | − ||| ≤ |˜ n − | the other claims follow trivially.
Lemma 3. Let r n be some sequence of natural numbers with 0 ≤ r n ≤ n, r n ↑ ∞, and r n = o(n 1/2 ). Suppose n 1/2 (ˆ n − ) = O p (1) with || < 1, then
190
HARRY H. KELEJIAN ET AL.
plimn→∞ r n
ˆk k=0 (n
r n
− k )2 = 0.
n k Proof: Define ˜ n = ˆ n 1(|ˆ n | < 1), let n = r n rk=0 (ˆ n − k )2 and n = n k r n rk=0 (˜ n − k )2 . Then for every > 0 P(|n | > ) = P n > , ˆ n < 1 + P n > , ˆ n ≥ 1 ≤ P n > + P ˆ n ≥ 1
observing that for realizations ∈ with |ˆ n ()| < 1 we have n () = n (). Since plimn→∞ ˆ n = with || < 1 it follows immediately that the second probability on the r.h.s. of the last inequality tends to zero. To complete the proof of the claim we next show that the first probability of that r.h.s. tends to zero, i.e., that n = o p (1). Observe that n = 1n + 2n , 1 2 1 1n = rn − + 2 1 − ˜ n 1 − 2 1 − ˜ n 1/2 ˜ n − ˜ n 1 − 2 − 1 − ˜ 2n n rn = 1/2 2 n 1 − ˜ 1 − 2 1 − ˜ n n
2n =
2(r +1) r n ˜ n n − 2 1 − ˜ n
−
r n 2(r n +1) 1 − 2
r +1 r n ˜ n n +2 1 − ˜ n
Since |˜ n | < 1 all terms are well defined. By Lemma 2 n 1/2 ˜ n − = O p (1) – r and thus, of course, plimn→∞ ˜ n = . By Lemma 1 we have plimn→∞ r n ˜ nn = rn 1/2 0 and limn→∞ r n = 0. Observing that r n /n = o(1) it is then readily seen that 1n = o p (1) and 2n = o p (1) and thus n = o p (1). Lemma 4. Suppose n 1/2 ˆ n − = O p (1) with || < 1, then p lim n n→∞
n
k
(ˆ n − k ) = 0
k=0
and p lim n n→∞
for 0 ≤ < 1/2.
n k=0
(|ˆ n |k − ||k ) = 0
Large and Small Sample Results
191
Proof: Define ˜ n = ˆ n 1(|ˆ n | < 1) and n = n decomposition
ˆk k=0 (n
n
− k ). Consider the
n = 1n + 2n , 1n = n
n k (˜ n − k ), k=0
2n = n
n
k k (ˆ n − ˜ n ).
k=0
Observe that 1n = n
n+1 1 − ˜ n 1 − n+1 − 1− 1 − ˜ n
n+1
= n −1/2
n 1/2 (˜ n − ) n ˜ n − (1 − ˜ n )(1 − )
(1 − ) − n n+1 (1 − ˜ n ) . (1 − ˜ n )(1 − )
Since |˜ n | < 1 all expressions on the r.h.s. are well defined. By Lemma 2 we have n 1/2 (˜ n − ) = O p (1) and thus plimn→∞ ˜ n = . Hence 1/[(1 − ˜ n )(1 − )] = O p (1). Observing that n −1/2 = o(1) and that in light of Lemma 1 n+1 n ˜ n = o p (1) and n n+1 = o(1) it follows that 1n = o p (1). Next observe that for every > 0 P(|2n | > ) ≤ P(|ˆ n | ≥ 1). Since plimn→∞ ˆ n = with || < 1 it follows that the probability on the r.h.s. tends to zero, which establishes that also 2n = o p (1), and thus n = o p (1) as claimed. By Lemma 2 we have n 1/2 (|ˆ n | − ||) = O p (1), and thus the second claim follows as a special case of the first claim. Lemma 5. Given the model in (1), suppose Assumptions 1–4 hold and n 1/2 (ˆ n − ) = O p (1) with || < 1 and ˆ n −  = o p (1). Let r n be some sequence of natural numbers with 0 ≤ r n ≤ n, r n ↑ ∞ and r n = o(n 1/2 ), and let a n = (a 1,n , . . . , a n,n )′ be some sequence of n × 1 constant vectors whose elements are uniformly bounded in absolute value. Then n −1 a ′n (y˜¯ n − E y¯ n ) = o p (1).
192
HARRY H. KELEJIAN ET AL.
Proof: Recall the expressions for E y¯ n and y˜¯ n in (A.3). Define n = n −1 a ′n (y˜¯ n − E y¯ n ) and consider the decomposition n = n1 + n2 + n3 , rn rn k k (ˆ n − k )bn(k) , (ˆ n − k )an′ Wnk+1 Xn  = n1 = n−1 k=0
n2 = n−1 n3 = n−1
rn
k ˆ n an′ Wnk+1 Xn (ˆ n − ) =
k=0 ∞
k=0 rn
k ˆ n (cn(k) )′ (ˆ n − ),
(A.4)
k=0
k an′ Wnk+1 Xn  =
∞
k bn(k) ,
k=rn +1
k=rn +1 (k) cn
(k) k+1 ′ −1 ′ = [n −1 a ′n W k+1 where b (k) n X n ] . Observe that b n n = n a n W n X n  and (k) and the elements of c n are uniformly bounded by some finite constant, say K, in light of the remarks above Lemma 1. To prove the claim we now show that ni = o p (1) for i = 1, 2, 3. Applying the Cauchy-Schwartz and triangle inequalities to the expression for n1 in (A.4) yields 1/2 r 1/2 1/2 r r n n n k k k 2 k 2 (k) 2 |n1 | ≤ (ˆ n − ) . (b n ) ≤ K rn (ˆ n − ) k=0
k=0
k=0
That n1 = o p (1) now follows directly from Lemma 3. Applying the triangle inequalities to the expression for n2 in (A.4) yields rn rn k k (k) ′ ˆ n c n2 ≤ ˆ ˆ ˆ n i,n − i . n −  ≤ K n k=0
k=0
i
n k ˆ n = By Lemma 4 with = 0 we have plimn→∞ rk=0 r n k limn→∞ k=0 || = 1/(1 − || ). Since i ˆ i,n − i = o p (1) it follows that n2 = o p (1). Applying the triangle inequality to the expression for n3 in (A.4) yields |n3 | ≤
∞
||k |b (k) n |≤K
k=r n +1
∞
k=r n +1
||k = K
||r n +1 . 1 − ||
1 it follows that ||r n +1
Since || < → 0 as n → ∞ and thus n3 = o(1), which completes the proof of the lemma. Lemma 6. Given the model in (1), suppose Assumptions 1–4 hold and n 1/2 (ˆ n − ) = O p (1) with || < 1 and ˆ n −  = o p (1). Let r n be some
Large and Small Sample Results
193
sequence of natural numbers with 0 ≤ r n ≤ n, r n ↑ ∞ and r n = o(n 1/2 ), and let A n = (a ij,n ) be some sequence of constant n × n matrices whose row and column sums are uniformly bounded in absolute value . Then n −1/2 ′n A n (y˜¯ n − E y¯ n ) = o p (1). Proof: Recall the expressions for E y¯ n and y˜¯ n in (A.3). Define n = n −1/2 ′n A n (y˜¯ n − E y¯ n ) and consider the decomposition n = n1 + n2 + n3 , rn rn k k (ˆ n − k )bn(k) , (ˆ n − k )′n An Wnk+1 Xn  = n1 = n−1/2 k=0
n2 = n−1/2 n3 = n−1/2
rn
k ˆ n ′n An Wnk+1 Xn (ˆ n
k=0 ∞
− ) =
k=0 rn
k ˆ n (cn(k) )′ (ˆ n − ),
(A.5)
k=0
k ′n An Wnk+1 Xn  =
∞
k bn(k) ,
k=rn +1
k=rn +1
(k)
−1/2 ′ A W k+1 X ]′ . Observe −1/2 ′ A W k+1 X  and c where b (k) n n n = [n n =n n n n n n n (k) (k) that the expected value of b n and of the elements of c n is zero. Observe k+1 further that the elements of A n W k+1 n X n  and A n W n X n are uniformly bounded by some finite constant, say K, in light of the remarks above Lemma 1. Since the i,n are distributed i.i.d. (0, 2 ) it follows that the variance of b (k) n and of (k) 2 2 the elements of c n are uniformly bounded by K . To prove the claim we now show that ni = o p (1) for i = 1, 2, 3. Applying the Cauchy-Schwartz and triangle inequalities to the expression for n1 in (A.5) yields
|n1 | ≤
rn
k ˆ n − k
k=0
≤ O p (1) r n
rn k=0
2
1/2 r n
2 (b (k) n )
k=0
k (ˆ n − k )2
1/2
1/2
r n r n (k) 2 (k) 2 −1 2 2 observing that r −1 n k=0 (b n ) = O p (1) since r n k=0 E(b n ) ≤ K . That n1 = o p (1) now follows directly from Lemma 3.
194
HARRY H. KELEJIAN ET AL.
Applying the Cauchy-Schwartz inequality twice to the expression for n2 in (A.5) yields
|n2 | ≤
rn
2 |ˆ n |k
k=0
r n ≤ 1/2 n × [n
1/2 r n
′ ˆ 2 |(c (k) n ) (n − )|
k=0
r n 1/2
2 |ˆ n |k
k=0
1/2
1/2
r −1 n
′
(ˆ n − ) (ˆ n − )]
rn
1/2
′ (k) (c (k) n ) (c n )
k=0
1/2
1/2
.
(k)
Since the variances of the elements of c n are uniformly bounded it follows r n (k) ′ (k) (k) (k) that E(c n )′ (c n ) and hence r −1 n k=0 E(c n ) (c n ) is uniformly bounded by (k) r n ′ (k) some finite constant. Thus r −1 n k=0 (c n ) (c n ) = O p (1). By Lemma 4 with n n 2 |2 |k = 1/(1 − |2 |). |ˆ n |k = limn→∞ rk=0 = 0 we have plimn→∞ rk=0 Since n 1/2 (ˆ n − ) = O p (1) and r n /n 1/2 = o(1) it follows that n2 = o p (1). Next observe that En3 = 0 since Eb (k) n = 0. To show that n3 = o p (1) it hence suffices to show that limn→∞ E2n3 = 0. Now E2n3
≤
∞
∞
(l) ||k+l E|b (k) n ||b n |
k=r n +1 l=r n +1
≤
2 K 2
∞
∞
k+l
||
≤
2 K 2
k=r n +1 l=r n +1
||r n +1 1 − ||
2
.
(l) (k) 2 1/2 since E|b (k) [E|b n(l) |2 ]1/2 ≤ 2 K 2 . Since || < 1 it follows n ||b n | ≤ [E|b n | ] r +1 n that || → 0 as n → ∞ and thus n3 = o p (1), which completes the proof of the lemma.
Proof of Theorem 1: Observe that from (6) and (7) y n∗ (ˆn ) = Z n∗ (ˆn )␦ + u n∗ (ˆn ) with u n∗ (ˆn ) = u n − ˆ n M n u n = n − (ˆn − )M n u n .
Large and Small Sample Results
195
Substitution of this expression into (19) yields after a standard transformation −1 ˜ n∗ (ˆn , ␦ˆ n )′ Z n∗ (ˆn ) ˜ n∗ (ˆn , ␦ˆ n )′ n n 1/2 (␦ˆ S,n − ␦) = n −1 Z n −1/2 Z −1 ˜ n∗ (ˆn , ␦ˆ n )′ Z n∗ (ˆn ) − n −1 Z
˜ n∗ (ˆn , ␦ˆ n )′ M n u n . × n −1/2 (ˆn − )Z
(A.6)
We now prove the result in four steps, utilizing the above decomposition.
Step 1: As our first step we show that ˜ n∗ (ˆn , ␦ˆ n )′ Z n∗ (ˆn ) = p lim n −1 Z ¯ n∗ (, ␦)′ Z ¯ n∗ (, ␦) = . p lim n −1 Z n→∞
n→∞
(A.7)
Observe that E y¯ n = W n Ey n = W n (I n − W n )
−1
Xn  =
∞
k W k+1 n X n ,
(A.8)
k=0
y¯ n − E y¯ n = W n (I n − W n )−1 u n = W n (I n − W n )−1 (I n − M n )−1 n , where y¯ n = W n y n . Recall that ˜ n∗ (ˆn , ␦ˆ n ) = (In − ˆ n Mn ) X n , y˜¯ n , Z Zn∗ (ˆn ) = (In − ˆ n Mn ) X n , y¯ n , ¯ n∗ (, ␦) = (In − Mn ) X n , E y¯ n , Z
where
y˜¯ n =
rn
k ˆ ˆ n W k+1 n X n n .
k=0
It is then readily seen that n
−1 ˜
′
Zn∗ (ˆn , ␦ˆ n ) Z n∗ (ˆn ) =
G11,n G21,n
G12,n G22,n
with G11,n = n−1 Xn′ (In − ˆ n Mn′ )(In − ˆ n Mn )Xn , G12,n = n−1 Xn′ (In − ˆ n Mn′ )(In − ˆ n Mn )¯yn , G21,n = n−1 y˜¯ ′n (In − ˆ n Mn′ )(In − ˆ n Mn )Xn , G22,n = n−1 y˜¯ ′n (In − ˆ n Mn′ )(In − ˆ n Mn )¯yn ,
(A.9)
196
HARRY H. KELEJIAN ET AL.
and n
−1
¯ n∗ (, ␦)′ Z ¯ n∗ (, ␦) = Z
H11,n H21,n
H12,n H22,n
with H11,n = n−1 Xn′ (In − Mn′ )(In − Mn )Xn , H12,n = n−1 Xn′ (In − Mn′ )(In − Mn )E¯yn , H21,n = n−1 E¯yn′ (In − Mn′ )(In − Mn )Xn , H22,n = n−1 E¯yn′ (In − Mn′ )(In − Mn )E¯yn . From the above expressions we see that G11,n − H11,n = n−1 Xn′ [−(ˆn − )(Mn′ + Mn ) + (ˆ2n − 2 )Mn′ Mn ]Xn , G12,n − H12,n = n−1 Xn′ [−(ˆn − )(Mn′ + Mn ) + (ˆ2n − 2 )Mn′ Mn ]E¯yn + n−1 Xn′ [In − ˆ n (Mn′ + Mn ) + ˆ 2n Mn′ Mn ](¯yn − E¯yn ), G21,n − H21,n = n−1 E¯yn′ [−(ˆn − )(Mn′ + Mn ) + (ˆ2n − 2 )Mn′ Mn ]Xn + n−1 (y˜¯ ′n − E¯yn′ )[In − ˆ n (Mn′ + Mn ) + ˆ 2n Mn′ Mn ]Xn , G22,n − H22,n = n−1 E¯yn′ [−(ˆn − )(Mn′ + Mn ) + (ˆ2n − 2 )Mn′ Mn ]E¯yn + n−1 (y˜¯ ′n − E¯yn′ )[In − ˆ n (Mn′ + Mn ) + ˆ 2n Mn′ Mn ]E¯yn + n−1 × (y˜¯ ′n − E¯yn′ )[In − ˆ n (Mn′ + Mn ) + ˆ 2n Mn′ Mn ](¯yn − E¯yn ) + n−1 E¯yn′ [In − ˆ n (Mn′ + Mn ) + ˆ 2n Mn′ Mn ](¯yn − E¯yn ). Upon close inspection, recalling the remarks before Lemma 1, and utilizing (A.8) and (A.9) shows that the terms on the r.h.s. have all either one of the following basic structures, where A n is some matrix whose row and column sums are uniformly bounded in absolute value: P1n = op (1) ∗ [n−1 Xn′ An Xn ], P2n = op (1) ∗ [n−1 Xn′ An Xn ], P3n = op (1) ∗ [n−1 ′ Xn′ An Xn ], P4n = Op (1) ∗ [n−1 Xn′ An n ], P5n = Op (1) ∗ [n−1 ′ Xn′ An n ], P6n = Op (1) ∗ [n−1 Xn′ An (y˜¯ n − E¯yn )], P7n = Op (1) ∗ [n−1 ′ Xn′ An (y˜¯ n − E¯yn )], P8n = Op (1) ∗ [n−1 ′n An (y˜¯ n − E¯yn )].
Large and Small Sample Results
197
Since the elements of X n are uniformly bounded in absolute value, so are the elements of n −1 X ′n A n X n , n −1 X ′n A n X n , n −1 ′ X ′n A n X n , n −1 X ′n A n , and n −1 ′ X ′n A n . Thus clearly P 1n = o p (1), P 2n = o p (1), and P 3n = o p (1). Using Chebychev’s inequality we see that also P 4n = o p (1) and P 5n = o p (1). From Lemmas 5 and 6 it follows further that P 6n = o p (1), P 7n = o p (1) and P 8n = o p (1). ˜ n∗ (ˆn , ␦ˆ n )′ Z n∗ (ˆn ) − n −1 Z ¯ n∗ (, ␦)′ Z ¯ n∗ (, ␦) = o p (1). Observing that Thus n −1 Z −1 ′ ¯ n∗ (, ␦) Z ¯ n∗ (, ␦) = by assumption completes this step of the plimn→∞ n Z proof. Step 2: We next show that ¯ n∗ (, ␦)′ n = o p (1). ˜ n∗ (ˆn , ␦ˆ n )′ n − n −1/2 Z n −1/2 Z
(A.10)
Observe that n
−1/2 ˜
′
Zn∗ (ˆn , ␦ˆ n ) n =
g1n g2n
with g1n = n−1/2 Xn′ (In − ˆ n Mn′ )n , g2n = n−1/2 y˜¯ ′n (In − ˆ n Mn′ )n , and n
−1/2
¯ n∗ (, ␦)′ n = Z
h1n h2n
with h1n = n−1/2 Xn′ (In − Mn′ )n , h2n = n−1/2 (E y¯ n )′ (In − Mn′ )n . Thus g1n − h1n = −(ˆn − )n−1/2 Xn′ Mn′ n , g2n − h2n = −(ˆn − )n−1/2 (E y¯ n )′ Mn′ n + n−1/2 (y˜¯ n − E¯yn )′ (In − ˆ n Mn′ )n . By arguments analogous to those above it is seen that the elements of M n X n and M n E y¯ n are bounded uniformly in absolute value. Because of this we see that the variances of the elements of n −1/2 X ′n M ′n n and n −1/2 (E y¯ n )′ M ′n n are uniformly bounded, and hence n −1/2 X ′n M ′n n = O p (1) and n −1/2 (E y¯ n )′ M ′n n = O p (1). Since ˆ n − = o p (1) it follows that the first two terms on the r.h.s. of the above equations are o p (1). The last term is seen to be o p (1) in light of Lemma 6. Step 3: We show further that ˜ n∗ (ˆn , ␦ˆ n )′ M n u n = o p (1). n −1/2 (ˆn − )Z
(A.11)
198
HARRY H. KELEJIAN ET AL.
Observe that n
−1/2 ˜
′
Zn∗ (ˆn , ␦ˆ n ) M n u n =
f1n f2n
with f1n = n−1/2 Xn′ (In − ˆ n Mn′ )Mn (In − Mn )−1 n , f2n = n−1/2 y˜¯ ′n (In − ˆ n Mn′ )Mn (In − Mn )−1 n = n−1/2 (E¯yn )′ (In − ˆ n Mn′ )Mn × (In − Mn )−1 n + n−1/2 (y˜¯ n − E¯yn )′ (In − ˆ n Mn′ )Mn (In − Mn )−1 n . In light of the remarks above Lemma 1 we see that f 1n is a sum of terms of the form O p (1) ∗ [n −1/2 A n n ] where A n is a matrix whose elements are bounded uniformly in absolute value. Thus the variances of the elements of n −1/2 A n n are uniformly bounded, which implies that n −1/2 A n n and thus f 1n are O p (1). By analogous argument we see that also the first term on the r.h.s. of the last equality for f 2n is O p (1). The second term is composed of expressions of the from O p (1) ∗ [n −1/2 (y˜¯ n − E y¯ n )′ A n n ], where A n is now some n × n matrix whose row and column sums are uniformly bounded in absolute value. By Lemma 6 we have n −1/2 (y˜¯ n − E y¯ n )′ A n n = o p (1), and thus this second term is o p (1), and ˜ n∗ (ˆn , ␦ˆ n )′ M n u n = O p (1), and thus the claim f 2n = O p (1). This shows that n −1/2 Z made at the beginning of this step holds observing that ˆ n − = o p (1). Step 4: Given (A.6), (A.7), (A.10), and (A.11) it follows that ¯ n∗ (, ␦)′ n + o p (1) n 1/2 (␦ˆ S,n − ␦) = −1 n −1/2 Z Observing that the elements of X n are uniformly bounded in absolute value, and that the rows and columns sums of a matrix which is obtained as the product of matrices whose rows and columns sums are uniformly bounded in absolute value ¯ n∗ (, ␦) are uniformly have again the same property, it follow that the elements of Z bounded in absolute value. Given the maintained assumptions on the innovations n it then follows immediately from Theorem A.1 in Kelejian and Prucha (1998) that d
¯ n∗ (, ␦)′ n →N(0, 2 ) n −1/2 Z d and hence n 1/2 (␦ˆ S,n − ␦)→N(0, ) observing that = 2 −1 .
GENERALIZED MAXIMUM ENTROPY ESTIMATION OF A FIRST ORDER SPATIAL AUTOREGRESSIVE MODEL Thomas L. Marsh and Ron C. Mittelhammer ABSTRACT We formulate generalized maximum entropy estimators for the general linear model and the censored regression model when there is first order spatial autoregression in the dependent variable. Monte Carlo experiments are provided to compare the performance of spatial entropy estimators relative to classical estimators. Finally, the estimators are applied to an illustrative model allocating agricultural disaster payments.
1. INTRODUCTION In this paper we examine the use of generalized maximum entropy estimators for linear and censored regression models when the data generating process is afflicted by first order spatial autoregression in the dependent variable. Generalized maximum entropy (GME) estimators of regression models in the presence of spatial autocorrelation are of interest because they: (1) offer a systematic way of incorporating prior information on parameters of the model;1 (2) are straightforwardly applicable to non-normal error distributions;2 and (3) are robust for ill-posed and ill-conditioned problems (Golan et al., 1996).3 Prior information in the form of parameter restrictions arise naturally in the context of spatial models Spatial and Spatiotemporal Econometrics Advances in Econometrics, Volume 18, 199–234 Copyright © 2004 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(04)18006-7
199
200
THOMAS L. MARSH AND RON C. MITTELHAMMER
because spatial autoregressive coefficients are themselves inherently bounded. The development of estimators with finite sample justification across a wide range of sampling distributions and an investigation of their performance relative to established asymptotically justified estimators provides important insight and guidance to applied economists regarding model and estimator choice.4 Various econometric approaches have been proposed for accommodating spatial autoregression in linear regression models and in limited dependent variable models. In the case of the linear regression model, Cliff and Ord (1981) provide a useful introduction to spatial statistics. Anselin (1988) provides foundations for spatial effects in econometrics, discussing least squares, maximum likelihood, instrumental variable, and method of moment estimators to account for spatial correlation issues in the linear regression model. More recently, generalized two stage least squares and generalized moments estimators have been examined by Kelejian and Prucha (1998, 1999). Meanwhile, Lee (2002, 2003) examined asymptotic properties of least squares estimation for mixed regressive, spatial autoregressive models and two-stage least squares estimators for a spatial model with autoregressive disturbances. In the case of the limited dependent variable model, most research has focused on the binary regression model and to a lesser extent the censored regression model. Besag (1972) introduced the auto-logistic model and motivated its use on plant diseases. The auto-logistic model incorporated spatial correlation into the logistic model by conditioning the probability of occurrence of disease on its presence in neighboring quadrants (see also Cressie, 1991). Poirier and Ruud (1988) investigated a probit model with dependent observations and proved consistency and asymptotic normality of maximum likelihood estimates. McMillen (1992) illustrated the use of a spatial autoregressive probit model on urban crime data with an Expectation-Maximization (EM) algorithm. At the same time, Case (1992) examined regional influence on the adoption of agricultural technology by applying a variance normalizing transformation in maximum likelihood estimation to correct for spatial autocorrelation in a probit model. Marsh et al. (2000) also applied this approach to correct for spatial autocorrelation in a probit model by geographic region while examining an extensive data set pertaining to disease management in agriculture. Bayesian estimation has also played an important role in spatial econometrics. LeSage (1997) proposed a Bayesian approach using Gibbs sampling to accommodate outliers and nonconstant variance within linear models. LeSage (2000) extended this to limited dependent variable models with spatial dependencies, while Smith and LeSage (2002) applied a Bayesian probit model with spatial dependencies to the 1996 presidential election results. Although Bayesian estimation is well-suited for representing uncertainty with respect to model parameters, it can also require extensive Monte Carlo sampling when
Generalized Maximum Entropy Estimation
201
numerical estimation techniques are required, as is often the case in non-normal, non-conjugate prior model contexts. In comparison, GME estimation also enforces restrictions on parameter values, is arguably no more difficult to specify, and does not require the use of Monte Carlo sampling in the estimation phase of the analysis.5 The principle of maximum entropy has been applied in a variety of modeling contexts, including applications to limited dependent variable models. However, to date, GME or other information theoretic estimators have not been applied to spatial regression models.6 Golan et al. (1996, 1997) proposed estimation of both the general linear model and the censored regression model based on the principle of generalized maximum entropy in order to deal with small samples or ill-posed problems. Adkins (1997) investigated properties of a GME estimator of the binary choice model using Monte Carlo analysis. Golan et al. (1996) applied maximum entropy to recover information from multinomial response data, while Golan et al. (1997) recovered information with censored and ordered multinomial response data using generalized maximum entropy. Golan et al. (2001) proposed entropy estimators for a censored demand system with nonnegativity constraints and provided asymptotic results. These studies provide the basic foundation from which we define spatial entropy estimators for the general linear model and the censored regression model when there is first order spatial autoregression in the dependent variable. The current paper proceeds as follows. First, we motivate GME estimation of the general linear model (GLM) and then investigate generalizations to spatial GMEGLM estimators. Second, Monte Carlo experiments are provided to benchmark the mean squared error loss of the spatial GME-GLM estimators relative to ordinary least squares (OLS) and maximum likelihood (ML) estimators. We also examine the sensitivity of the spatial GME estimators to user-supplied supports and their performance across a range of spatial autoregressive coefficients. Third, Golan et al.’s (1997) GME estimator of the censored regression model (i.e. Tobit) is extended to a spatial GME-Tobit estimator, and additional Monte Carlo experiments are presented to investigate the sampling properties of the method. Finally, the spatial entropy GLM and Tobit approaches are applied empirically to the estimation of a simultaneous Tobit model of agricultural disaster payment allocations across political regions.
2. SPATIAL GME–GLM ESTIMATOR 2.1. Data Constrained GME–GLM Following the maximum entropy of principle, the entropy of a distribution ′ probabilities p = (p 1 , . . . , p M )′ , M m=1 p m = 1, is defined by H(p) = −p lnp =
202
THOMAS L. MARSH AND RON C. MITTELHAMMER
− M m=1 p m lnp m (Shannon, 1948). The value of H(p), which is a measure of the uncertainty in the distribution of probabilities, reaches a maximum when p m = M −1 for m = 1, . . ., M characterizing the uniform distribution. Generalizations of the entropy function that have been examined elsewhere in the econometrics and statistics literature include the Cressie-Read power divergence statistic (Imbens et al., 1998), Kullback-Leibler Information Criterion (Kullback, 1959), and the ␣For example, the well known Kullback-Leibler entropy measure (Pompe, 1994). cross-entropy extension, − M m=1 p m ln(p m /q m ), is a discrepancy measure between distributions p and q where q is a reference distribution. In the event that the reference distribution is uniform, then the maximum entropy and cross-entropy functions coincide. We restrict our analysis to the maximum entropy objective function due to its efficiency and robustness properties (Imbens et al., 1998), and its current universal use within the context of GME estimation applications (Golan et al., 1996). To motivate the maximum entropy estimator, it is informative to revisit the least squares estimator.7 Consider the general linear model: Y = X +
(1)
with Y a N × 1 dependent variable vector, X a N × K matrix of explanatory variables,  a K × 1 vector of parameters, and a N × 1 vector of disturbance terms. The standard least squares optimization problem is to N min ε2i subject to Y i − X i  = εi , ∀i . The objective is to minimize the 
i=1
quadratic sum of squares function for  ∈ R K subject to the data constraint in (1). There are fundamental differences between the least squares and maximum entropy approaches. First, the maximum entropy approach is based on the entropy objective function H(p) instead of the quadratic sum of squares objective function. Instead of minimizing the sum of squared errors, the entropy approach selects the p closest to the uniform distribution given the data constraint in (1).8 Second, the maximum entropy approach provides a means of formalizing more adhoc methods researchers have commonly employed to impose a priori restrictions on regression parameters and ranges of disturbance outcomes. To do so the     unknown parameter vector  is reparameterized as k = Jj=1 s kj p kj ∈ [s k1 , s kJ ] 


onto a user defined discrete support space s k1 ≤ s k2 ≤ · · · ≤ s kJ for J ≥ 2 with    a (J × 1) vector of unknown weights pk = (p k1 , . . . , p kJ )′ ∀k = 1, . . . , K. The  discrete support space includes the lower truncation point s k1 , the upper truncation  point s kJ , and J-2 remaining intermediate support points. For instance, consider a
Generalized Maximum Entropy Estimation
203 

discrete support space with J = 2 support points {s k1 , s kJ } = {−1, 1} for k that has only lower and upper truncation points and allows no intermediate support   points. The reparameterized expression yields k = (1 − p k2 )(−1) + p k2 (1) with  a single unknown p k2 .9 Likewise the unknown error vector is reparameterized M ε ε as εi = m=1 s im p im ∈ [s εi1 , s εiM ] such that s εi1 ≤ s εi2 ≤ · · · ≤ s εiM with a (M × 1) vector of unknown weights pεi = (p εi1 , . . . , p εiM )′ ∀i = 1, . . . , N.10 In practice, discrete support spaces for both the parameters and errors are supplied by the user based on economic or econometric theory or other prior information. Third, unlike least squares, there is a bias-efficiency tradeoff that arises in GME when parameter support spaces are specified in terms of bounded intervals. A disadvantage of bounded intervals is that they will generally introduce bias into the GME estimator for finite samples unless the intervals happen to be centered on the true values of the parameters. An advantage of restricting parameters to finite intervals is that they can lead to increases in efficiency by lowering parameter estimation variability. The underlying idea is that the bias introduced by bounded parameter intervals in the GME estimator can be more-than compensated for by substantial decreases in variability, leading to notable increases in overall estimation efficiency. The data constrained GME estimator of the general linear model (hereafter GME-D) is defined by the following constrained maximum entropy problem (Golan et al., 1996):11 max{−(p)′ ln(p)}
(2a)
Y = X(S p ) + (Sε pε )  1′ p = 1 ∀k, 1′ pε = 1 ∀i
(2b)
p = vec(p , pε ) > [0]
(2d)
p
subject to:
k
i
(2c)
In matrix notation, the unknown parameter vector  and error vector are reparameterized as  = S p and = Sε pε from known matrices of user supplied discrete support points S and Sε and an unknown (KJ + NM) × 1 vector   of weights p = vec(p , pε ). The KJ × 1 vector p = vec(p1 , . . . , pK ) and the  NM × 1 vector pε = vec(p1ε , . . . , pεN ) consist of J × 1 vectors pk and M × 1 ε vectors pi , each having nonnegative elements summing to unity. The matrices S and Sε are K × KJ and N × NM block-diagonal matrices of support points for the unknown  and vectors. For example, consider the support matrix for the 
204
THOMAS L. MARSH AND RON C. MITTELHAMMER
vector,
S = 


(s1 )′

0
0 .. .
(s2 )′ .. .
0
0

···
0
··· .. .
0 .. . 
· · · (sK )′


(3)


Here, sk = (s k1 , . . . , s kJ )′ is a J × 1 vector such that s k1 ≤ s k2 ≤ · · · ≤ s kJ where     k = Jj=1 s kj p kj ∈ [s k1 , s kJ ] ∀k = 1, . . . , K. Given this reparameterization, the ˆ are determined empirical distribution of estimated weights pˆ (and subsequently ) by the entropy objective function subject to constraints of the model and user supplied supports. The choice of support points Sε depends inherently on the properties of the underlying error distribution. In most but not all circumstances, error supports have been specified to be symmetric and centered about the origin. Excessively wide truncation points reflect uncertainty about information in the data constraint and correspond to solutions pˆ that are more uniform, implying the ˆ k approach the average of the support points. Given ignorance regarding the error distribution, Golan et al. (1996) suggest calculating a sample scale parameter and using the three-sigma rule to determine error bounds. The three-sigma rule for random variables states that the probability for a unimodal random variable falling away from its mean by more than three standard deviations is at most 5% (Pukelsheim, 1994; Vysochanskii & Petunin, 1980). The three-sigma rule is a special case of Vysochanskii and Petunin’s bound for unimodal distributions. Letting Y be a real random variable with mean and variance 2 , the bound is given by: Pr(|Y − | ≥ r) ≤
4 2 9 r2
where r > 0 is the half length of an interval centered at . For r = 3 it yields the three-sigma rule and more than halves the Chebyshev bound. In more general terms the above bound can yield a j-sigma rule with r = j for j = {1, 2, 3, . . .}. The generalized maximum entropy formulation in (2) incorporates inherently a dual entropy loss function that balances estimation precision in coefficient estimates and predictive accuracy subject to data, adding up, and nonnegativity constraints.12 The specification of (2) leads to first order optimization conditions that are different from the standard least squares estimator with the notable difference that the first order conditions for GME-D do not require orthogonality between right hand side variables and error terms.13 Mittelhammer and Cardell (1998)
Generalized Maximum Entropy Estimation
205
have provided regularity conditions, derived the first order conditions, asymptotic properties, and asymptotic test statistics for the data-constrained GME-D estimator. They also identified a more computationally efficient approach with which to solve entropy based problems that does not expand with sample size and provides the basis for the optimization algorithms used in the current study.
2.2. Spatial GME Estimators The first order spatial autoregressive model can be expressed as (see Anselin, 1988; Cliff & Ord, 1981): Y = WY + X + u
(4)
where W is an N × N spatial proximity matrix structuring the lagged dependent variable. For instance, the elements of the proximity matrix W = {w∗ij } may be defined as a standardized joins matrix where w∗ij = wij / j wij with wij = 1 if observations i and j are from an adjoining spatial region (for i = j) and wij = 0 otherwise. In (4), u is a (N × 1) vector of iid error terms, while is an unknown scalar spatial autoregressive parameter to be estimated. In general, the ordinary least squares (OLS) estimator applied to (4) will be inconsistent. The maximum likelihood estimator of (4), for the case of normally distributed errors, is discussed in Anselin (1988). Consistent generalized two stage least squares and generalized method of moments estimators of (4) are discussed in Kelejian and Prucha (1998, 1999). Lee (2002, 2003) examined consistency and efficiency of least squares estimation for mixed regressive, spatial autoregressive models and investigated best two-stage least squares estimators for a spatial model with autoregressive disturbances. In the subsections below, we introduce both moment and data constrained GME estimators of the spatial regression model. Doing so enables us to investigate the finite sample performance of the two estimators relative to one another and relative to other competing estimators. 2.2.1. Normalized Moment Constrained Estimator Consider the normalized (by sample size) moment constraint relating to the GLM given by: X′ [(I − W)Y − X] X′ u = (5) N N The spatial GME method for defining the estimator of the unknown parameters  and in the spatial autoregressive model using the normalized moment information
206
THOMAS L. MARSH AND RON C. MITTELHAMMER
in (5) (hereafter GME-N) is represented by the following constrained maximum entropy problem: max{−(p)′ ln(p)}
(6a)
p
subject to: X′ [(I − (S p )W)Y − X(S p )]/N = (Su pu ) 
1′ pk = 1 ∀k,
1′ p = 1,
1′ pui = 1 ∀i
p = vec(p , p , pu ) > [0]
(6b) (6c) (6d)
where p = vec(p , p , pu ) is a (KJ + J + NM) × 1 vector of unknown parameters. In (6), we define  = S p as before and X′ u/N = Su pu . The additional term S is a 1 × J vector of support points for the spatial autoregressive coefficient . The J × 1 vector p is an unknown weight vector having nonnegative elements that sum to unity and to reparameterize the spatial autoregressive coefficient as are used = S p = Jj=1 s j p j . The lower and upper truncation points can be selected to bound and additional support points can be specified to recover higher moment information about it. The GME-N estimator when = 0 is defined in Golan et al. (1996, 1997) along with its asymptotic properties.14 2.2.2. Data Constrained Estimator As an alternative way of accounting for simultaneity in (4), we follow Theil (1971) and Zellner (1998) by specifying a data constraint for use in the GME method as: Y = (W)(Z) + X + u∗
(7)
where u* is a N × 1 vector of appropriately transformed residuals. The above expression is derived using the spatial model in (4) and from substitution of an unrestricted reduced form equation based on: Y = (I − W)−1 X + (I − W)−1 u which can be expressed as: Y = lim
t
t→∞
j=0
(W)j X + (I − W)−1 u,
−1 < < 1
Generalized Maximum Entropy Estimation
207
for eigenvalues of W less than or equal to one in absolute value (Kelejian & Prucha, 1998). The reduced form model can be approximated by: Y≈
t j=0
(W)j X + v = Z + v
(8)
where Z = tj=0 (Wj X)(j ) is a partial sum of the infinite series for some upper index value t. Here, Z can be interpreted as an N × L matrix of instruments consisting of {X, WX,. . ., Wt X}, is a L × 1 vector of unknown and unrestricted parameters, and v is an N × 1 vector of reduced form residuals. Zellner (1998) refers to Eq. (7) as a nonlinear in parameters (NLP) form of the simultaneous equations model. The spatial GME method for defining the estimator of the unknown parameters , , and in the combined models (7) and (8) (hereafter GME-NLP) is represented by the following constrained maximum entropy problem: max{−(p)′ ln(p)}
(9a)
Y = [(S p )W][Z(S p )] + X(S p ) + (Su∗ pu∗ )
(9b)
Y = Z(S p ) + (Sv pv )
(9c)
p
subject to:

1′ pk = 1 ∀k,
1′ p ℓ = 1 ∀ℓ,
1′ p = 1,
1′ pu∗ i = 1 ∀i,
1′ pvi = 1 ∀i, (9d)
p = vec(p , p , p , pu∗ , pv ) > [0]
(9e)
where p = vec(p , p , p , pu∗ , pv ) is a (KJ + LJ + J + 2NM) × 1 vector of unknown parameters. Including the reduced form model in (9c) is necessary to identify the reduced form parameters. In effect this is a one-step estimator in which the reduced and structural form parameters, as well as the spatial correlation coefficients, are estimated concurrently.15
2.3. Monte Carlo Experiments – Spatial GLM The linear component X is specified as: 2 1 X = [1, x2 , x3 , x4 ] −1
3
(10)
208
THOMAS L. MARSH AND RON C. MITTELHAMMER
where the values of xi2 , i = 1,. . ., N are iid outcomes from Bernoulli (0.5), and the values of the pair of explanatory variables xi3 and xi4 are generated as iid outcomes from: 2x3 x3 ,x4 3 1 0.5 2 N (11) , =N , 0.5 1 5 4 x3 ,x4 2x4 which are then truncated at ±3 standard deviations. The disturbance terms ui * are drawn iid from a N(0, 2 ) distribution that is truncated at ±3 standard deviations and 2 = 1. Thus, the true support of the disturbance distribution in this Monte Carlo experiment is truncated normal, with lower and upper truncation points located at −3 and +3, respectively. Additional information in the form of user supplied supports for the structural and reduced form parameters and the error terms must be provided for the  GME estimators. Supports for the structural parameters are specified as sk = ′ ′ (−20, 0, 20) , ∀k. Supports of the reduced form model are s1 = (−100, 0, 100) for ′ the intercept and s l = (−20, 0, 20) , ∀l > 1. Error supports for the GME estimator are defined using the j-sigma rule. Specifically, in the experiments for the general v linear model the supports are su∗ ˆ y , 0, j ˆ y )′ ∀i where ˆ y is the sample i = si = (−j standard deviation of the dependent variable with either j = 3 defining a 3-sigma rule or j = 5 defining a 5-sigma rule. Next we specify the model assumptions for the spatial components (i.e. proximity matrices and spatial autoregressive coefficients with supports) of the regression model. Seven true spatial autoregressive coefficients = {−0.75, −0.50, −0.25, 0, 0.25, 0.50, 0.75} were selected bounded between −1 and 1 for the experiments. Supports for the spatial autoregressive coefficients are specified by using narrower eigenvalue bounds as s = (1/ min , 0, 1/ max )′ and wider bounds s = (−20, 0, 20)′ coinciding with supports on the structural parameters.16 The spatial weight matrix W is constructed using the Harrison and Rubinfeld (1978) example discussed in Gilley and Pace (1996). This example is based on the Boston SMSA with 506 observations (one observation per census tract). The elements of the weight matrix are defined as wij = max{1 − (d ij /d max ), 0} where dij is the distance in miles (converted from the latitude and longitude for each observation). A nearest neighbor weight matrix was constructed with elements given the value 1 for observations within dmax and 0 otherwise. The weight matrix was then row normalized bounding its eigenvalues between −1 and 1. The reduced form model in (8) is approximated at t = 1 with a matrix of instrumental variables Z defined by {X, WX}. Table 1 presents the assumptions underlying twenty-one experiments that compare the performance of GME to the maximum likelihood (ML) estimator
Generalized Maximum Entropy Estimation
209
Table 1. Monte Carlo Experiments for Spatial Regression Model. ′
Experiment
j ˆ y
True
s
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
3 3 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5
−0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75
−1 −1 ) (min , 0, max −1 −1 ) (min , 0, max −1 −1 ) (min , 0, max −1 −1 ) (min , 0, max −1 −1 ) (min , 0, max −1 −1 (min , 0, max ) −1 −1 ) (min , 0, max (−20, 0, 20) (−20, 0, 20) (−20, 0, 20) (−20, 0, 20) (−20, 0, 20) (−20, 0, 20) (−20, 0, 20) −1 −1 ) (min , 0, max −1 −1 ) (min , 0, max −1 −1 ) (min , 0, max −1 −1 ) (min , 0, max −1 −1 ) (min , 0, max −1 −1 ) (min , 0, max −1 −1 ) (min , 0, max
under normality of the GLM with a first order spatial autoregressive dependent variable.17 Experiments 1–7 maintain support s = (1/ min , 0, 1/ max )′ for the autoregressive parameters in with the 3-sigma rule bounding the error terms. Experiments 8–14 deviate from the first seven by widening supports on the spatial autoregressive parameter to s = (−20, 0, 20)′ , while retaining the 3sigma rule. Experiments 15–21 maintain support s = (1/ min , 0, 1/ max )′ on the spatial autoregressive parameter, but expand the error supports to the 5-sigma rule. Experiments 8–14 and experiments 15–21 test the sensitivity of the GME estimators to the support spaces on the spatial autoregressive parameter and the error terms, respectively. Other pertinent details are that all of the Monte Carlo results are based on 500 replications with N = 506 observations consistent with the dimension of the weighting matrix W. It is noteworthy to point out that convergence of the GME spatial estimator (using the general objective function optimizer OPTMUM in GAUSS (Aptech Systems Inc., 1996)) occurred within a matter of seconds for each replication across all the Monte Carlo experiments. Appendix A discusses computational issues and provides derivations of the gradient and Hessian for
210
THOMAS L. MARSH AND RON C. MITTELHAMMER
the GME-NLP estimator in (9). For completeness, we also include performance measures of OLS in the Monte Carlo analysis. 2.3.1. Experiment Results Table 2 contains the mean squared error loss (MSEL) from experiments 1–21 for: (a) the regression coefficients  from the spatial ML and GME estimators and the OLS estimator and (b) the spatial autoregressive coefficient estimated by the spatial ML and GME estimators. Tables 3–7 report the estimated average bias ˆ = E[] ˆ − ) and variance (e.g. var(ˆ )) of 1 , 2 , 3 , 4 , and , (e.g. bias() k respectively. Consider the MSEL for regression coefficients . In cases with nonzero spatial autoregressive parameters, both GME-NLP and ML tend to exhibit smaller MSEL than either OLS or GME-N (with OLS having the largest MSEL values). Except for GME-NLP, the MSEL values are larger for positive than negative spatial autoregressive parameters. Meanwhile, ML had the smallest MSEL values for negative spatial autoregressive parameters equal to −0.75 and −0.50. For GMENLP, widening the autoregressive support space in experiments 8–14 decreased (increased) the MSEL relative to experiments 1–7 for negative (positive) spatial values in . Experiments 15–21, with the expanded error supports for GMENLP, exhibited increased MSEL relative to experiments 1–7 across nonzero spatial autoregressive values in . At = 0, the MSEL is lowest for OLS followed by GME-NLP, ML and GME-N, respectively. Turning to the MSEL for the spatial autoregressive parameter , both GMENLP and ML exhibited smaller MSEL than GME-N. ML had the smallest MSEL values for spatial autoregressive parameters −0.75 and −0.50. In experiments 1–7 GME-NLP had smaller MSEL than ML except for −0.75 and −0.50. Widening the error support space in experiments 15–21 tended to increase the MSEL of the spatial autoregressive parameter for GME-NLP. Next, we compare the estimated average bias and variance for ML and GMENLP across 1 , 2 , 3 , 4 , and in Tables 3–7, respectively. The obvious differences between ML and GME-NLP arise for the intercept (Table 3). In most cases, ML exhibited smaller absolute bias and larger variance on the intercept coefficient relative to GME-NLP. In Tables 4–6 there are no obvious systematic differences in the average bias and variance for the remaining parameters. Turning to the spatial autoregressive parameter (Table 7), ML tended to have smaller absolute bias and larger variance relative to GME-NLP for nonzero values of . Narrower support spaces for the spatial autoregressive parameter in experiments 1–7 and 15–21 yielded positive (negative) bias for negative (positive) spatial values in . Comparing across the experiments, performance of the GME-NLP estimator exhibited sensitivity to the different parameter and error support spaces.
True ˆ MSEL () −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 Exp. MSEL (ˆ) −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 Exp.
OLS
ML
GME-NLP
GME-N
GME-NLP
GME-N
GME-NLP
GME-N
46.85445 27.66546 9.94484 0.06413 26.64269 236.98749 2127.57748
0.73342 0.94232 1.35587 1.87715 2.99540 4.06597 8.97558
3.89496 2.62515 1.20896 0.67812 1.80136 3.13564 3.09602
7.86198 7.46430 7.13155 7.12143 8.20153 12.89220 24.41890
1.27085 1.12270 1.20733 1.44093 2.24008 4.74866 18.19564
12.41126 12.62844 12.89162 13.20860 13.58534 14.01641 14.49347
10.81546 6.32048 1.94408 0.12045 1.88375 5.25398 4.39171
8.15715 7.72633 7.34293 7.21240 7.84772 10.37058 12.68538
1–7
1–7
8–14
8–14
15–21
15–21
0.04582 0.02348 0.00730 0.00266 0.00416 0.00339 0.00085
0.39984 0.14877 0.01995 0.00977 0.10799 0.27891 0.37947
0.01486 0.00989 0.00761 0.00579 0.00532 0.00509 0.00488
2.23989 1.69301 1.21968 0.81743 0.48493 0.22847 0.06030
0.13166 0.05685 0.01162 0.00048 0.00477 0.00608 0.00129
0.39945 0.14845 0.01977 0.01003 0.11016 0.29199 0.46872
1–7
1–7
8–14
8–14
15–21
15–21
0.00775 0.00816 0.00838 0.00744 0.00673 0.00417 0.00236
Generalized Maximum Entropy Estimation
Table 2. MSEL Results for OLS and Spatial ML and GME Estimators.
211
212
Table 3. Bias and Variance Estimates for Spatial ML and GME Estimators of 1 .
−0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 Exp.
Bias
Variance
ML
GME-NLP
GME-NLP
GME-NLP
ML
GME-NLP
GME-NLP
GME-NLP
0.08108 0.00365 0.07222 0.22323 0.42844 0.41828 1.00396
−1.78936 −1.45051 −0.83051 0.05655 0.89485 1.27590 0.30056
0.11754 −0.29581 −0.41072 −0.59280 −0.90295 −1.69444 −3.94049
−1.82650 −1.36924 −0.76081 0.00105 1.18259 2.19136 1.95238
0.71281 0.92817 1.33798 1.81573 2.79831 3.87680 7.95493
0.67862 0.50703 0.50662 0.66343 0.98712 1.49351 2.99275
1.24219 1.02128 1.02452 1.07579 1.41168 1.86405 2.65450
0.79184 0.52741 0.51563 0.64257 0.47163 0.43703 0.56532
1–7
8–14
15–21
1–7
8–14
15–21
THOMAS L. MARSH AND RON C. MITTELHAMMER
True
True
−0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 Exp.
Bias
Variance
ML
GME-NLP
GME-NLP
GME-NLP
ML
GME-NLP
GME-NLP
GME-NLP
−0.00153 0.00137 0.00436 −0.00502 −0.00255 −0.00295 0.00258
0.00195 0.00268 0.00448 −0.00500 −0.00137 −0.00091 0.00376
−0.00256 −0.00129 0.00148 −0.00074 −0.00528 −0.00168 −0.00328
−0.00505 −0.00968 −0.00028 0.00118 −0.00311 0.00945 0.00387
0.00853 0.00858 0.00731 0.00664 0.00819 0.00818 0.00714
0.00874 0.00849 0.00725 0.00659 0.00820 0.00816 0.00720
0.00884 0.00792 0.00895 0.00849 0.00781 0.00815 0.00801
0.00790 0.00800 0.00694 0.00729 0.00782 0.00861 0.00829
1–7
8–14
15–21
1–7
8–14
15–21
Generalized Maximum Entropy Estimation
Table 4. Bias and Variance Estimates for Spatial ML and GME Estimators of 2 .
213
214
Table 5. Bias and Variance Estimates for Spatial ML and GME Estimators of 3 .
−0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 Exp.
Bias
Variance
ML
GME-NLP
GME-NLP
GME-NLP
ML
GME-NLP
GME-NLP
GME-NLP
0.00067 −0.00401 0.00358 0.00037 0.00205 0.00144 −0.00057
−0.00293 −0.00510 0.00397 0.00100 0.00232 0.00140 0.00025
−0.00252 0.00181 0.00049 −0.00120 0.00007 0.00242 0.00249
−0.00486 0.00226 0.00385 −0.00227 −0.00034 −0.00059 0.00056
0.00284 0.00249 0.00281 0.00240 0.00263 0.00316 0.00263
0.00291 0.00249 0.00280 0.00239 0.00261 0.00315 0.00268
0.00290 0.00315 0.00260 0.00272 0.00269 0.00276 0.00266
0.00255 0.00263 0.00262 0.00270 0.00276 0.00292 0.00281
1–7
8–14
15–21
1–7
8–14
15–21
THOMAS L. MARSH AND RON C. MITTELHAMMER
True
True
Bias ML
−0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 Exp.
−0.00006 0.00032 0.00085 −0.00028 −0.00277 0.00069 0.00057
Variance
GME-NLP
GME-NLP
GME-NLP
ML
GME-NLP
GME-NLP
GME-NLP
0.01181 0.00633 0.00341 0.00229 0.00191 0.00738 0.00133
0.00815 0.00189 −0.00063 0.00300 −0.00113 −0.00179 −0.01178
0.01267 0.00306 −0.00166 0.00152 0.01507 0.02398 0.02221
0.00267 0.00304 0.00252 0.00252 0.00269 0.00285 0.00294
0.00271 0.00310 0.00249 0.00248 0.00266 0.00284 0.00303
0.00302 0.00285 0.00257 0.00253 0.00255 0.00255 0.00286
0.00268 0.00285 0.00252 0.00272 0.00279 0.00270 0.00300
1–7
8–14
15–21
1–7
8–14
15–21
Generalized Maximum Entropy Estimation
Table 6. Bias and Variance Estimates for Spatial ML and GME Estimators of 4 .
215
216
Table 7. Bias and Variance Estimates for Spatial ML and GME Estimators of .
−0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 Exp.
Bias
Variance
ML
GME-NLP
GME-NLP
GME-NLP
ML
GME-NLP
GME-NLP
GME-NLP
−0.00927 0.00010 −0.00692 −0.01404 −0.02020 −0.01375 −0.01627
0.19577 0.13804 0.06472 −0.00426 −0.04403 −0.04254 −0.00504
−0.01730 0.02749 0.03321 0.03750 0.04419 0.05490 0.06443
0.20061 0.13101 0.06142 −0.00011 −0.06095 −0.07476 −0.03342
0.00767 0.00816 0.00833 0.00724 0.00632 0.00399 0.00210
0.00750 0.00442 0.00312 0.00264 0.00222 0.00158 0.00082
0.01456 0.00914 0.00651 0.00438 0.00337 0.00207 0.00073
0.00879 0.00456 0.00300 0.00253 0.00105 0.00049 0.00017
1–7
8–14
15–21
1–7
8–14
15–21
THOMAS L. MARSH AND RON C. MITTELHAMMER
True
Generalized Maximum Entropy Estimation
217
Overall, the Monte Carlo results for the spatial regression model indicate that the data constrained GME-NLP estimator dominated the normalized moment constrained GME-N estimator in MSEL. As a result, we focus on the data constrained GME-NLP and not the moment constrained GME-N estimator when investigating the censored Tobit model.
3. SPATIAL GME-TOBIT ESTIMATOR 3.1. GME–Tobit Model Consider a Tobit model with censoring of the dependent variable at zero: Yi = Yi∗ if Yi∗ = Xi.  + εi > 0 if Yi∗ = Xi.  + εi ≤ 0
Yi = 0
(12)
where Y ∗i is an unobserved latent variable. Traditional estimation procedures are discussed in Judge et al. (1988). Golan et al. (1997) reorder the observations and rewrite (12) in matrix form as: Y1 Y1 X1  + 1 > 0 Y= = (13) = X2  + 2 ≤ 0 0 Y2 where the subscript 1 indexes the observations associated with N1 positive elements, the subscript 2 indexes the observations associated with N2 zero elements of Y, and N1 + N2 = N. In the GME method, the estimator of the unknown  in the   Tobit model formulation is given by  = S p , where p = vec(p1 , . . . , pK ) is a ε ε ε ε ε KJ × 1 vector, and = S p , where p = vec(p 1 , p 2 ) is a NM × 1 vector, with both p and pε being derived from the following constrained maximum entropy problem: max{−(p)′ ln(p)} p
(14a)
subject to: 
Y1 = X1 (S p ) + Sε1 pε1 0 ≥ X (S p ) + Sε2 pε2
1′ pk = 1,
(14b)
2
1′ pεi 1 = 1,
1′ pεi 2 = 1
p = vec(p , pε1 , pε2 ) > [0]
(14c) (14d)
218
THOMAS L. MARSH AND RON C. MITTELHAMMER
where p = vec(p , pε1 , pε2 ) is a (KJ + NM) × 1 vector of unknown support weights. Under general regularity conditions, Golan et al. (Proposition 4.1, 1997) demonstrate that the GME–Tobit model is a consistent estimator. 3.2. Spatial GME–Tobit Model Assume a Tobit model with censoring of the dependent variable at zero and first order spatial autoregressive process in the dependent variable as: Yi = Yi∗ if Yi∗ = [(W)Y∗ ]i + Xi.  + ui > 0 (15) Yi = 0 if Yi∗ = [(W)Y∗ ]i + Xi.  + ui ≤ 0 where Xi denotes the ith row of the matrix X and [(W)Y* ]i denotes the ith row of (W) Y* . If = 0 then this is a Tobit model with a first order spatial autoregressive process in the dependent variable. Equation (15) reduces to the standard Tobit formulation in (14) if = 0. A spatial GME approach for defining the spatial Tobit estimator of the unknown parameters , , and of the spatial autoregressive model in (15) is represented by: max{−(p)′ ln(p)}
(16a)
p
subject to:
∗
∗
Y1 = [((S p )W)Z]1 (S p ) + X1 (S p ) + (Su1 pu1 ) ∗
∗
0 ≥ [((S p )W)Z]2 (S p ) + X2 (S p ) + (Su2 pu2 ) Y = Z(S p ) + (Sv pv )

1′ pk = 1∀k,
1′ p k = 1∀k,
1′ p = 1,
u∗
1′ pi = 1∀i,
p = vec(p , p , p , pu∗ , pv ) > [0]
(16b)
(16c) 1′ pvi = 1∀i
(16d) (16e)
where subscripts 1 and 2 represent partitioned submatrices or vectors corresponding to the ordering of Y discussed in (13). Apart from the censoring and reordering of the data, the spatial Tobit estimator in (16) has the same structural components as the estimator in (9). 3.3. Iterative Spatial GME–Tobit Model Breiman et al. (1993) discussed iterative least squares estimation of the censored regression model that coincides with ML and the EM algorithm under normality,
Generalized Maximum Entropy Estimation
219
but the method does not necessarily coincide with ML or EM with nonnormal errors. Each iteration involves two steps: an expectation step and a re-estimation step.18 Following this approach, Golan et al. (1997) suggested using the initial ˆ (0) = S pˆ (0) to predict Y ˆ (1) estimates from optimizing (14) defined by  2 and then (1) (1)  ˆ (1) 19 (1) ˆ ˆ redefine Y = vec(Y1 , Y2 ) in re-estimating and updating  = S p . In the empirical exercises below, we follow this approach to obtain the ith iterated ˆ (i) = S pˆ (i) of the spatial GME–Tobit model in Eq. (16). estimate  3.4. Monte Carlo Experiments – Spatial Tobit Model The sampling experiments for the Tobit model follow closely those in Golan et al. (1997) and Paarsch (1984). The explanatory variables and coefficients of the model are defined as: 2 1 X = [x1 , x2 , x3 , x4 ] (17) −3 2
where the xil , i = 1,. . ., N, l = 1,. . ., 4 are generated iid from N(¯x, 1) for x¯ = 0 and 2. Increasing the mean reduced the percent of censored observations from approximately 50% across the sampling experiments for x¯ = 0 to less than 20% for x¯ = 2.20 The disturbance terms ui are drawn iid from a N(0,1) distribution. The structural and reduced form error supports for the GME estimator are defined using a variation of Pukelsheim’s √ 3-sigma rule. Here, the supports are defined as (−3ˆ y , 0, 3ˆ y )′ where ˆ y = (ymax − yˆ min )/12 (Golan et al., 1997). Supports for the spatial autoregressive coefficient are specified as s = (1/ min , 0, 1/ max )′ .  Supports for the structural parameters are specified as sk = (−20, 0, 20)′ , ∀k. Supports of the reduced form model are s1 = (−100, 0, 100)′ for the intercept ′ and s l = (−20, 0, 20) , ∀l > 1. The Monte Carlo results are based on 500 replications and the true spatial autoregressive parameters are specified as = −0.5 and 0.5. As before, the spatial weight matrix W is constructed using the Harrison and Rubinfeld (1978) example with N = 506 observations and the reduced form model in (8) is approximated at t = 1. 3.4.1. Experiment Results Table 8 reports the MSEL for the regression coefficients  and the spatial autoregressive coefficient for the standard ML estimator of the Tobit model
220
THOMAS L. MARSH AND RON C. MITTELHAMMER
Table 8. Monte Carlo Results for GME-NLP Estimators and the ML-Tobit Estimator for the Censored Regression Model. Exp.
1 2 3 4
True
−0.5 −0.5 0.5 0.5
ˆ MSEL()
Xij
MSEL(ˆ)
ML-Tobit
GME-NLP
IGME-NLP
GME-NLP
IGME-NLP
0.01956 0.10631 0.04673 0.90989
4.08584 1.27839 4.37441 0.03844
0.02150 0.04942 0.06812 0.01253
3.01889 0.23721 0.83522 0.00129
0.19875 0.01880 0.11481 0.00076
N(0,1) N(2,1) N(0,1) N(2,1)
(ML-Tobit), as well as the non-iterative (GME-NLP) and iterative (IGME-NLP) spatial entropy estimators.21 Tables 9–12 provide estimated average bias and variance of the regression coefficients  for each estimator. Table 13 reports estimated average bias and variance for the spatial autoregressive coefficient from the non-iterative and iterative spatial GME estimators. Evidence from the Monte Carlo simulations indicates that ML-Tobit and IGMENLP estimator outperformed the non-iterative GME-NLP estimator in MSEL for Table 9. Bias and Variance Estimates for Spatial ML-Tobit and GME Estimators of 1 . Exp.
True
Bias ML-Tobit
1 2 3 4
−0.5 −0.5 0.5 0.5
0.01733 −0.16694 0.03134 0.49925
Variance
GME-NLP IGME-NLP −0.95542 −0.47565 −0.98231 −0.08200
0.01128 −0.05550 −0.08957 −0.03823
ML-Tobit GME-NLP IGME-NLP 0.00463 0.00267 0.01015 0.00458
0.00613 0.00440 0.00901 0.00278
0.00508 0.00308 0.00712 0.00218
Table 10. Bias and Variance Estimates for Spatial ML-Tobit and GME Estimators of 2 . Exp.
True
Bias ML-Tobit
1 2 3 4
−0.5 −0.5 0.5 0.5
0.00317 −0.16371 0.01145 0.48135
Variance
GME-NLP IGME-NLP −0.48003 −0.21208 −0.49760 −0.04128
−0.00070 −0.01208 −0.04791 −0.02194
ML-Tobit GME-NLP IGME-NLP 0.00400 0.00227 0.00600 0.00405
0.00560 0.00397 0.00649 0.00243
0.00483 0.00285 0.00527 0.00206
Generalized Maximum Entropy Estimation
221
Table 11. Bias and Variance Estimates for Spatial ML-Tobit and GME Estimators of 3 . Exp.
1 2 3 4
True
Bias
−0.5 −0.5 0.5 0.5
Variance
ML-Tobit
GME-NLP
IGME-NLP
ML-Tobit
−0.02003 −0.11426 −0.04041 0.39492
1.42290 0.87512 1.46910 0.11093
−0.01343 0.17769 0.13497 0.02858
0.00534 0.00395 0.01657 0.00398
GME-NLP IGME-NLP 0.00635 0.00595 0.01200 0.00356
0.00580 0.00346 0.01029 0.00223
Table 12. Bias and Variance Estimates for Spatial ML-Tobit and GME Estimators of 4 . Exp.
1 2 3 4
True
−0.5 −0.5 0.5 0.5
Bias
Variance
ML-Tobit
GME-NLP
IGME-NLP
ML-Tobit
0.01180 −0.16558 0.02742 0.50640
−0.94536 −0.47223 −0.98393 −0.07952
0.00896 −0.04889 −0.09566 −0.03470
0.00475 0.00228 0.01051 0.00393
GME-NLP IGME-NLP 0.00615 0.00400 0.00798 0.00260
0.00539 0.00283 0.00776 0.00210
the regression coefficients . When the percent of censored observations increased from less than 20% (experiments 2 and 4) to approximately 50% (experiments 1 and 3), ML-Tobit tends to outperform IGME-NLP in MSEL. Alternatively, when the censoring is less than 20% (experiments 2 and 4), then IGME-NLP outperforms ML-Tobit. Regarding the spatial autoregressive parameter, the IGMENLP estimator outperformed the non-iterative GME-NLP estimator in MSEL. Finally, in terms of average bias and variance estimates in Tables 9–12, in most cases the absolute bias for regression coefficients  is less for IGME-NLP Table 13. Bias and Variance Estimates for Spatial ML-Tobit and GME Estimators of . Exp.
1 2 3 4
True
−0.5 −0.5 0.5 0.5
Bias
Variance
GME-NLP
IGME-NLP
GME-NLP
IGME-NLP
1.68403 0.48177 0.86468 0.02749
0.31373 0.12239 0.25800 0.01678
0.18292 0.00511 0.08755 0.00053
0.10032 0.00382 0.04825 0.00048
222
THOMAS L. MARSH AND RON C. MITTELHAMMER
than ML-Tobit with censoring less than 20% (experiments 2 and 4). Both GME estimators exhibited positive bias in the estimated spatial coefficients across the four experiments (Table 13). Interestingly, the IGME-NLP estimator reduced the absolute bias and variance relative to the non-iterative GME-NLP estimator.
4. ILLUSTRATIVE APPLICATION: ALLOCATING AGRICULTURAL DISASTER PAYMENTS Agricultural disaster relief in the U.S. has commonly taken one of three forms – emergency loans, crop insurance, and direct disaster payments (U.S. GAO). Of these, direct disaster payments are considered the least efficient form of disaster relief (Goodwin & Smith, 1995). Direct disaster payments from the government provide cash payments to producers who suffer catastrophic losses, and are managed through the USDA’s Farm Service Agency (FSA). The bulk of direct disaster funding is used to reimburse producers for crop and feed losses rather than livestock losses. Direct disaster payments approached $30 billion during the 1990s, by far the largest of the three disaster relief programs. Unlike the crop insurance program which farmers use to manage their risk, it is usually legislators who decide whether or not a direct payment should be made to individual farmers after a disaster occurs. The amount of disaster relief available through emergency loans and crop insurance is determined by contract, whereas direct disaster relief is determined solely by legislators, and only after a disaster occurs. Politics thus plays a much larger role in determining the amount of direct disaster relief than it does with emergency loans and crop insurance. Legislators from a specific state find it politically harmful not to subsidize farmers who experienced a disaster, given the presence of organized agriculture interest groups within that state (Becker, 1983; Gardner, 1987).
4.1. Modeling Disaster Relief Several important econometric issues arise in allocating agricultural disaster payments. First, there is potential simultaneity between disaster relief and crop insurance payments. Second, and more importantly for current purposes, regional influences of natural disasters and subsequent political allocations may have persistent spatial correlation effects across states. Ignoring either econometric issue can lead to biased and inconsistent estimates and faulty policy recommendations.
Generalized Maximum Entropy Estimation
223
Consider the following simultaneous equations model with spatial components: Yd = (W)Yd + ␦Yc + Xd  + u
(18)
Yc = Zc + vc
(19)
Yd = Zd + vd
(20)
where the dependent variable Yd denotes disaster payments, and is censored because some states do not receive any direct agriculture disaster relief in certain years (in the context of the Tobit model, Yd = 0 if Y*d ≤ 0 and Yd = Y*d if Y*d > 0). In (18), Yc denotes crop insurance payments (non-censored) and Xd are exogenous variables including measures of precipitation, political interest group variables, year specific dummy variables, and number of farms. In the reduced form models (19) and (20), the exogenous variables include per capita personal income, farm income, the number of farm acres, total crop values, geographical census region, income measures, year specific dummy variables, number of farms and political factors.22 The parameters to be estimated are the structural coefficients ␦ and , the spatial coefficient , and the reduced form coefficients c and d . The data were collected from the FSA and other sources. A complete list and description of all direct disaster relief programs are available through the FSA. The FSA data set maintains individual farmer transactions of all agricultural disaster payments in the U.S. For the purposes of the current study, FSA aggregated the transactions across programs and individuals to obtain annual levels of disaster payments for each of the 48 contiguous states from 1992 to 1999. A list of selected variables and definitions are provided in Table 14. For this application the elements of the proximity matrices for each time period W = {w∗ij } are defined as a standardized joins matrix where w∗ij = wij / j wij with wij = 1 if observations i and j (for i = j) are from adjoining states (e.g. Kansas and Nebraska) and wij = 0 otherwise (e.g. Kansas and Washington). To account for the time series crosssectional nature of the data, the full proximity matrix used in modeling all of the observed data was defined as a block diagonal matrix such that W1 = (IT ⊗ W) where W is the joins matrix defined above and IT is an T × T identity matrix with T = 8 representing the 8 years of available data. The analysis proceeded in several steps. First, to simplify the estimation process and focus on the spatial components of the model, predicted crop insurance values were obtained from the reduced form model. Then predicted values crop insurance values were used in the disaster relief model.23 Second, Eqs (18) and (20) were jointly estimated with the iterative GME-NLP estimator. For further comparison, the model was estimated with ML-Tobit and the spatial ML estimator under normality.
224
THOMAS L. MARSH AND RON C. MITTELHAMMER
Table 14. Definitions of Selected Variables for Disaster Relief Model (N = 384). Variables
Definition
(+) Percent change in precipitationa
To capture periods of increased wetness, one variable contains positive percent changes in precipitation; 0 otherwise
(−) Percent change in precipitationa
Periods of relatively dryer weather are reflected in another variable containing negative percent changes in precipitation; 0 otherwise
Percent change in low temperaturea
For extreme or severe freezes, the annual percent change in low temperature
Crop insurance
These payments include both government and private insurance payments from the Crop Insurance program, and are computed from subtracting total farmer payments (which equals total insurance premiums minus a federal subsidy) from total indemnity payments
Secretary of agricultureb
1 if secretary of agriculture from a specific state; 0 otherwise
House agriculture subcommitteeb
1 if state represented on House Agriculture Committee, subcommittee on General Farm Commodities, Resource Conservation, and Credit; 0 otherwise
Senate agriculture subcommitteeb
1 if state represented on Senate Agriculture Committee, subcommittee on Research, Nutrition, and General Legislation; 0 otherwise
House appropriations subcommitteeb
1 if state represented on House Appropriations Committee, subcommittee on Agriculture, Rural Development, Food and Drug Administration, and Related Agencies; 0 otherwise
Senate appropriations subcommitteeb
1 if state represented on Senate Appropriations Committee, subcommittee on Agriculture, Rural Development, and Related Agencies; 0 otherwise
Income measures, farm acres, number of farms
U.S Bureau of the Census’ Bureau of Economic Analysis
Crop values
USDA’s National Agricultural Statistics Service
c
Electoral
Represents a measure of electoral importance
Census regionsd
1 if state in a specific Census Region; 0 otherwise
Year dummies
1 if a specific year from 1992 to 1999; 0 otherwise
a
For each state, average annual precipitation data were gathered over the period 1991–1999 from the National Oceanic Atmospheric Administration’s (NOAA) National Climatic Data Center. b From the Almanac of American Politics. c Garrett and Sobel (2003). d New England: Connecticut, Vermont, Massachusetts, Maine, Rhode Island, New Hampshire; Mid Atlantic: New Jersey, New York, Pennsylvania; East North Central: Michigan, Indiana, Illinois, Wisconsin, Ohio; West North Central: North Dakota, Minnesota, Nebraska, South Dakota, Iowa, Missouri, Kansas; South Atlantic: West Virginia, Delaware, South Carolina, North Carolina, Maryland, Florida, Virginia, Georgia; East South Central: Kentucky, Mississippi, Alabama, Tennessee; West South Central: Arkansas, Oklahoma, Texas, Louisiana; Mountain: Montana, Colorado, New Mexico, Arizona, Wyoming, Nevada, Idaho, Utah; Pacific: Oregon, Washington, California.
Generalized Maximum Entropy Estimation
225
Table 15. Results for Disaster Payment Model. Variable
ML-Tobit Coefficients
Constant (+) Percent change in precipitation (−) Percent change in precipitation Percent change in low temperature Crop insurance Number of farms Secretary of agriculture House agriculture subcommittee Senate agriculture subcommittee House appropriations subcommittee Senate appropriations subcommittee 1992 1993 1994 1995 1997 1998 1999
ML-Spatial
IGME-NLP
t-Values Coefficients
t-Values Coefficients
t-Values
−115.382 0.702
−6.297 1.681
−72.316 0.646
−5.629 1.942
−90.619 0.494
−4.467 1.098
−1.154
−1.871
−0.965
−1.977
−0.803
−1.205
2.012
0.846
1.517
0.785
2.405
0.920
0.525 0.001 118.065
3.979 4.943 3.087
0.483 0.001 99.854
4.515 4.218 3.245
−0.061 0.001 94.813
−0.610 4.837 2.422
58.767
4.868
34.661
3.487
43.208
3.165
43.360
2.677
28.651
2.181
49.152
2.882
37.800
2.836
34.670
3.216
43.568
3.118
34.707
2.566
16.080
1.454
16.020
1.099
130.860 119.890 106.055 57.935 4.283 71.128 135.895 –
6.023 5.544 4.830 2.701 0.192 3.315 6.036 –
55.577 43.327 43.453 18.163 −0.751 26.025 50.456 0.547
3.409 2.630 2.716 1.162 −0.100 1.662 2.979 13.065
71.967 68.487 47.366 32.043 −1.082 36.086 72.546 0.488
2.794 2.601 1.882 1.361 −0.045 1.507 2.634 5.093

′ Supports were specified as si = s j = (−1000, 0, 1000) for the structural and reduced form parameters and s = (1/ˆ min , 0, 1/ˆ max )′ for the spatial parameter with estimated eigenvalues ˆ min = −0.7 and ˆ max = 1. Effectively, structural coefficient supports were selected to allow political coefficients to range between −$1 billion to $1 billion. Because of the inherent unpredictability and political nature of disaster allocations, circumstances arose in the data that yielded relatively large residuals for selected states in specific years. Rather than removing outlying observations, we chose to expand the error supports to a 5-sigma rule. Table 15 presents structural and spatial coefficient estimates and asymptotic t-values for the estimates of the disaster relief model for ML-Tobit, ML
226
THOMAS L. MARSH AND RON C. MITTELHAMMER
estimator under normality, and GME-NLP. Results from the latter two estimators demonstrate that the spatial autoregressive coefficient is significant and positive. The structural parameters of the disaster relief model are consistent with the findings of Garrett et al. (2003), who applied a maximum likelihood estimator of the simultaneous Tobit model and discuss implications of results.24
5. CONCLUSIONS In this paper we extended generalized maximum entropy (GME) estimators for the general linear model (GLM) and the censored Tobit model to the case where there is first order spatial autoregression present in the dependent variable. Monte Carlo experiments were conducted to analyze finite sample size behavior and estimators were compared using mean squared error loss (MSEL) of the regression coefficients for the spatial general linear model and censored regression model. For spatial autoregression in the dependent variable of the standard regression model, the data constrained GME estimator outperformed a specific normalized moment constrained GME estimator in MSEL. In most cases, maximum likelihood (ML) exhibited smaller absolute bias and larger variance relative to GME. However, across the experiments, the GME estimators exhibited sensitivity to the different parameter and error support spaces. For the censored regression model, when there is first order spatial autoregression in the dependent variable, the iterative data constrained GME estimator exhibited reduced absolute bias and variance relative to the non-iterative GME estimator. Interestingly, the relative performance of GME was sensitive to the degree of censoring for the given experiments. When censoring was less than 20%, the iterative GME estimator outperformed ML-Tobit in MSEL. Alternatively, when the percent of censored observations increased to approximately 50%, ML-Tobit tended to outperform iterative GME in MSEL. Finally, we provided an illustrative application of the spatial GME estimators in the context of a model allocating agricultural disaster payments using a simultaneous Tobit framework. The GME estimators defined in this paper provides a conceptually new approach to estimating spatial regression models with parameter restrictions imposed. The method is computationally efficient and robust for given support points. Future research could include examining spatial autoregressive errors, other spatial GME estimators (e.g. alternative moment constraints), further analysis of the impacts of user supplied support points, or further assessments of the relative benefits of iterative versus non-iterative GME estimators. Overall, the results suggest that continued investigation of GME estimators for spatial autoregressive models could yield additional findings and insight useful to applied economists.
Generalized Maximum Entropy Estimation
227
NOTES 1. Standard econometric methods of imposing parameter restrictions on coefficients are constrained maximum likelihood or Bayesian regression. For additional discussion of these methods see Mittelhammer et al. (2000) or Judge et al. (1988). 2. Indeed, other spatial estimation procedures do not require normality for large sample properties. Examples include two stage least squares and generalized moments estimators in Kelejian and Prucha (1998, 1999). 3. A problem may be described as ill-posed because of non-stationarity or because the number of unknown parameters to be estimated exceeds the number of data points. A problem may be described as ill-conditioned if the parameter estimates are highly unstable. An example of an ill-conditioned problem in empirical application is collinear data (Fraser, 2000). 4. Zellner (1998), and others, have discussed limitations of asymptotically justified estimators in finite sample situations and the lack of research on estimators that have small sample justification. See Anselin (1988) for further motivation and discussion regarding finite sample justified estimators. 5. Moreover, and in contrast to constrained maximum likelihood or the typical parametric Bayesian analysis, GME does not require explicit specification of the distributions of the disturbance terms or of the parameter values. However, both the coefficient and the disturbance support spaces are compact in the GME estimation method, which may not apply in some idealized empirical modeling contexts. 6. Mittelhammer, Judge and Miller (2000) provide an introduction to information theoretic estimators for different econometric models and their connection to maximum entropy estimators. 7. Preckel (2001) interpreted entropy as a penalty function over deviations. In the absence of strong prior knowledge, for the general linear model with symmetric penalty, Preckel argues there is little or no gain from using GME over generalized ordinary least squares. 8. In this manner, one chooses the p that could have been generated in the greatest number of ways consistent with the given information (Golan et al., 1996). 9. Specifying a support with J = 2 is an effective means to impose bounds on a coefficient. The number of support points can be increased to reflect or recover higher moment information of the coefficient. Uncertainty about support points can be incorporated using a cross-entropy extension to GME. 10. For empirical purposes, disturbance support spaces can always be chosen in a manner that provides a reasonable approximation to the true disturbance distribution because upper and lower truncation points can be selected sufficiently wide to contain the true disturbances of regression models (Malinvaud, 1980). Additional discussion about specifying supports is provided ahead and is available in Golan et al. (1996). For notational convenience it is assumed that each coefficient has J support points and each error has M support points. 11. In contrast to the pure data constraint in (2b), the GME estimator could have been specified with the moment constraint X′ Y = X′ X + X′ . Data and moment constrained GME estimators are further discussed below and Monte Carlo results are provided for both.
228
THOMAS L. MARSH AND RON C. MITTELHAMMER
12. A dual loss function combined with the flexibility of parameter restrictions and usersupplied supports can provide an estimator that is robust in small samples or in ill-posed problems. For further discussion of dual loss functions see Zellner (1994). 13. In other words, GME-D relaxes the orthogonality condition between X and the error term. See appendix materials for further insight into the gradient derivation. 14. An alterative specification of the moment constraint would be to replace Eq. (6b) with X′ [(I − (S p )W1 )Y − X(S p )]/N = X′ (Su pu )/N. 15. This specification is based on a GME estimator of the simultaneous equations model introduced by Marsh et al. (2003), who demonstrated properties of consistency and asymptotic normality of the estimator. Historically, Theil (1971) used Eq. (7) to motivate two stage least squares and three stage least squares estimators. The first stage is to approximate E[Y] by applying OLS to the unrestricted reduced form model and thereby obtaining predicted values of Y. Then, using the predicted values to replace E[Y], the second stage is to estimate the structural model with OLS. 16. Anselin (1988) suggests that the spatial coefficient is bounded by 1/ min < ρ < 1/ max , where min and max are the minimum and maximum eigenvalues of the standardized spatial weight matrix, respectively. 17. The log-likelihood function of the ML estimator for the spatial model with A = I − W is lnL = −(N/2)ln() − (1/2)ln|| + ln|A| − (1/2)v′ v (Anselin, 1988). 18. For further information on EM approaches also see McLachlan and Krishnan (1997). Both McMillen (1992) and LeSage (2000) follow similar approaches where the censored estimation problem is reduced to a noncensored estimation problem using ML and Bayesian estimation, respectively. (i) 19. The process of updating can continue iteratively to construct Yˆ 2(i) and then ˆ . The notation (i) in the superscript of the variables indicates the ith iteration with i = 0 representing initial values. 20. The different percent of censoring was incorporated into the Monte Carlo analysis to better sort out the performance capabilities of GME in presence of both the censoring and spatial processes. 21. For the iterative GME-NLP estimator, convergence was assumed when the absolute value of the sum of the difference between the estimated parameter vectors from the ith iteration and i-1st iteration was less than 0.0005. 22. The reduced form models are specified as in (8) and approximated at t = 1. 23. Garrett et al. (2003) explicitly investigated the simultaneous effects between disaster relief and crop payments. Alternatively, we choose to proxy crop insurance payments with predicted values and then focus on spatial effects of the disaster relief model. 24. See Smith and Blundell (1986) regarding inference for maximum likelihood estimators of the simultaneous Tobit model.
ACKNOWLEDGMENTS We are indebted to Kelly Pace, James LeSage, and anonymous reviewers for insightful comments and recommended corrections. Any errors are solely the responsibility of the authors.
Generalized Maximum Entropy Estimation
229
REFERENCES Adkins, L. (1997). A Monte Carlo study of a generalized maximum entropy estimator of the binary choice model. In: T. B. Fomby & R. C. Hill (Eds), Advances in Econometrics: Applying Maximum Entropy to Econometric Problems (Vol. 12, pp. 183–200). Greenwich, CT: JAI Press. Anselin, L. (1988). Spatial econometrics: Methods and models. Dordrecht: Kluwer. Aptech Systems, Inc. (1996). GAUSS: Optimization application module. Washington: Maple Valley. Becker, G. S. (1983). A theory of competition among pressure groups for political influence. Quarterly Journal of Economics, 98, 371–400. Besag, J. E. (1972). Nearest-neighbor systems and the auto-logistic model for binary data. Journal of the Royal Statistical Society, B(34), 75–83. Breiman, L., Tsur, Y., & Zemel, A. (1993). On a simple estimation procedure for censored regression models with known error distributions. Annals of Statistics, 21, 1711–1720. Case, A. C. (1992). Neighborhood influence and technological change. Regional Science and Urban Economics, 22, 491–508. Cliff, A., & Ord, K. (1981). Spatial processes, models and applications. London: Pion. Cressie, N. A. C. (1991). Statistics for spatial data. New York: Wiley. Fraser, I. (2000). An application of maximum entropy estimation: The demand for meat in the United Kingdom. Applied Economics, 32, 45–59. Gardner, B. L. (1987). Causes of farm commodity programs. Journal of Political Economy, 95, 290–310. Garrett, T. A., Marsh, T. L., & Marshall, M. I. (2003). Political allocation of agriculture disaster payments in the 1990s. International Review of Law and Economics (forthcoming). Garrett, T. A., & Sobel, R. S. (2003). The political economy of FEMA disaster payments. Economic Inquiry. Gilley, O. W., & Pace, R. K. (1996). On the Harrison and Rubinfeld data. Journal of Environmental Economics and Management, 31, 403–405. Golan, A., Judge, G. G., & Miller, D. (1996). Maximum entropy econometrics: Robust information with limited data. New York: Wiley. Golan, A., Judge, G. G., & Miller, D. (1997). The maximum entropy approach to estimation and inference: An overview. In: T. B. Fomby & R. C. Hill (Eds), Advances in Econometrics: Applying Maximum Entropy to Econometric Problems (Vol. 12, pp. 3–24). Greenwich, CT: JAI Press. Golan, A., Judge, G. G., & Perloff, J. (1996). A maximum entropy approach to recovering information from multinomial response data. Journal of the American Statistical Association, 91, 841–853. Golan, A., Judge, G. G., & Perloff, J. (1997). Estimation and inference with censored and ordered multinomial data. Journal of Econometrics, 79, 23–51. Golan, A., Judge, G. G., & Zen, E. Z. (2001). Estimating a demand system with nonnegativity contraints: Mexican meat demand. The Review of Economics and Statistics, 83, 541–550. Goodwin, B. K., & Smith, V. H. (1995). The economics of crop insurance and disaster aid. Washington, DC: AEI Press. Harrison, D., & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5, 81–102. Imbens, G., Spady, R., & Johnson, P. (1998). Information theoretic approaches to inference in moment condition models. Econometrica, 66, 333–357. Judge, G. G., Hill, R. C., William, E. G., Lutkepohl, H., & Lee, T. (1988). Introduction to the theory and practice of econometrics. New York: Wiley.
230
THOMAS L. MARSH AND RON C. MITTELHAMMER
Kelejian, H. H., & Prucha, I. R. (1998). A generalized spatial two-stage least sqaures procedure for estimating a spatial autoregressive model with autoregressive distrubances. Journal of Real Estate Finance and Economics, 17, 99–121. Kelejian, H. H., & Prucha, I. R. (1999). A generalized moments estimator for the autoregressive parameter in a spatial model. International Economic Review, 40, 509–533. Kullback, S. (1959). Information theory and statistics. New York: Wiley. Lee, L. (2002). Consistency and efficiency of least squares estimation for mixed regressive, spatial autoregressive models. Econometric Theory, 18, 252–277. Lee, L. (2003). Best spatial two-stage least squares estimators for a spatial autoregressive model with autoregressive disturbances. Econometric Reviews, 22, 307–335. LeSage, J. P. (1997). Bayesian estimation of spatial autoregressive models. International Regional Science Review, 20, 113–129. LeSage, J. P. (2000). Bayesian estimation of limited dependent spatial autoregressive models. Geographical Analysis, 32, 19–35. Malinvaud, E. (1980). Statistical methods of econometrics (3rd ed.). Amsterdam: North-Holland. Marsh, T. L., Mittelhammer, R. C., & Cardell, N. S. (2003). A generalized maximum entropy estimator of the simultaneous linear statistical model. Working Paper, Washington State University. Marsh, T. L., Mittelhammer, R. C., & Huffaker, R. G. (2000). Probit with spatial correlation by plot: PLRV net necrosis. Journal of Agricultural, Biological, and Environmental Statistics, 5, 22–36. McLachlan, G. J., & Krishnan, T. (1997). The EM algorithm and extensions. New York: Wiley. McMillen, D. P. (1992). Probit with spatial correlation. Journal of Regional Science, 32, 335–348. Mittelhammer, R., Judge, G., & Miller, D. (2000). Econometric foundations. New York: Cambridge University Press. Mittelhammer, R. C., & Cardell, N. S. (1998). The data-constrained GME estimator of the GLM: Asymptotic theory and inference. Mimeo, Washington State University. Paarsch, H. J. (1984). A Monte Carlo comparison of estimators for censored regression models. Journal of Econometrics, 24, 197–213. Poirier, D. J., & Ruud, P. A. (1988). Probit with dependent observations. The Review of Economic Studies, 55, 593–614. Pompe, B. (1994). On some entropy measures in data analysis. Chaos, Solitons, and Fractals, 4, 83–96. Preckel, P. V. (2001). Least squares and entropy: A penalty perspective. American Journal of Agricultural Economics, 83, 366–367. Pukelsheim, F. (1994). The three sigma rule. The American Statistician, 48, 88–91. Shannon C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27. Smith, R. J., & Blundell, R. W. (1986). An exogeneity test for a simultaneous equation tobit model with an application to labor supply. Econometrica, 54, 679–685. Smith, T. E., & LeSage, J. P. (2002). A Bayesian probit model with spatial dependencies. Working Paper. Theil, H. (1971). Principles of econometrics. New York: Wiley. Vysochanskii, D. F., & Petunin, Y. I. (1980). Justification of the 3 rule for unimodal distributions. Theory of Probability and Mathematical Statistics, 21, 25–36. Zellner, A. (1994). Bayesian and non-Bayesian estimation using balanced loss functions. In: S. Gupta & J. Berger (Eds), Statistical Decision Theory and Related Topics. New York: Springer. Zellner, A. (1998). The finite sample properties of simultaneous equations’ estimates and estimators Bayesian and non-Bayesian approaches. Journal of Econometrics, 83, 185–212.
Generalized Maximum Entropy Estimation
231
APPENDIX A Conditional Maximum Value Function Define the conditional entropy function by conditioning on the (L + K + 1) × 1 vector = = vec( ,  , ) yielding:   p ln(p ) − p j ln(p j ) F() = max − p kj ln(p kj ) − kj kj p:= −
j
k,j
k,j
∗ ∗ p uim ln(p uim ) − p vim ln(p vim ) i,m
i,m
∗
∗
∗
(A.1)
The optimal value of pui = (p ui1 , . . . , p uiM )′ in the conditionally-maximized entropy function is given by: M ∗ u u∗ u∗ pi () = (A.2) arg max − p iℓ ln(p iℓ ) , M u∗ u∗ ∗ ∗ u∗ puiℓ : M ℓ=1 piℓ =1, ℓ=1 sℓ piℓ =ui ()
ℓ=1
which is the maximizing solution to the Lagrangian: M M M ∗ ∗ ∗ ∗ ∗ ∗ ∗ L p u ∗ = − p uiℓ ln(p uiℓ ) + ui s uiℓ p uiℓ − u ∗i () . p uiℓ − 1 + ␥ui i
ℓ=1
ℓ=1
ℓ=1
(A.3)
∗
The optimal value of p uiℓ is: u
∗ ∗ p uiℓ (␥ui (u ∗i ()))
u∗
∗
e ␥i (u i ())s iℓ
= M
m=1 e
∗
∗
␥ui (u ∗i ())s uim
,
ℓ = 1, . . . , M,
(A.4)
where ␥ui (u ∗i ()) is the optimal value of the Lagrangian multiplier ␥ui . The optimal    value of pk = (p k1 , . . . , p kJ )′ in the conditionally-maximized entropy function is given by: J     − p kℓ ln(p kℓ ) , (A.5) pk (k ) = arg max 
pk :
J     ℓ=1 pkℓ =1, ℓ=1 skℓ pkℓ =k
J
ℓ=1
232
THOMAS L. MARSH AND RON C. MITTELHAMMER
which is the maximizing solution to the Lagrangian: J J J         L p k = − p kℓ ln(p kℓ ) + k s kℓ p kℓ − k . p kℓ − 1 + k
(A.6)
ℓ=1
ℓ=1
ℓ=1

The optimal value of p kℓ is then: 
  p kℓ (k ) 

= J


e k (k )s kℓ
m=1 e



k (k )s km
,
ℓ = 1, . . . , J,
(A.7) 
where k (k ) is the optimal value of the Lagrangian multiplier k . Likewise the optimal values of p , pv , p can be derived. Substituting optimal values for p = vec(p , p , p , pu , pv ) into the conditional entropy function (A.1) yields: ( ) − ln exp( ( )p ) −  ( ) F() = − k k k k k kj k k k j
k
k
   −ln exp( ( )p ) − ( ) − ln exp( ( )p ) k
j
k
kj
j
j
∗ − ␥ui (u ∗i ())u ∗i − ln exp(␥ui (u ∗i ())p uim ) − ␥vi (vi ())vi m
i
−ln
m
exp(␥vi (vi ())p vim )
i
.
(A.8)
Computational Issues Following the computationally efficient approach of Mittelhammer and Cardell (1998), the conditional entropy function (Eq. (A.8)) was maximized. Note that the constrained maximization problem in (9) requires estimation of (KJ + LJ + J + 2NM) ×1 unknown parameters. Solving (9) for (KJ + LJ + J + 2NM) unknowns is not computationally practical as the sample size, N, grows larger. In contrast, maximizing (A.8) requires estimation of only (L + K + 1) unknown coefficients for any positive value of N. The GME-NLP estimator uses the reduced and structural form models as data constraints with a dual objective function as part of its information set. To completely specify the GME-NLP model, support points (upper and lower truncation
Generalized Maximum Entropy Estimation
233
and intermediate) for the individual parameters, support points for each error term, and (L + K + 1) starting values for the parameter coefficients are supplied by the user. In the Monte Carlo analysis and empirical application, the model was estimated using the unconstrained optimizer OPTIMUM in the econometric software GAUSS. We used 3 support points for each parameter and error term. To increase the efficiency of the estimation process the analytical gradient and Hessian were coded in GAUSS and called in the optimization routine. This also offered an opportunity to empirically validate the derivation of the gradient and Hessian (provided below). Given suitable starting values the optimization routine generally converged within seconds for the empirical examples discussed above. Moreover, solutions were robust to alternative starting values.
Gradient The gradient vector ∇ = vec(∇ , ∇ , ∇ ) of F() is: ′ ( ) Z [(W)Z]′ v ␥ ∇ = −  ( ) + [0] [X]′ ␥u ′ 0 [(W)Z] ( )
(A.9)
Hessian The Hessian matrix H() = ∂2 F()/∂∂′ is composed of submatrices: u ∂2 F() ′ ∂␥ (u()) = −[(W)Z] ⊙ [(W)Z] ∂ ∂′ ∂u() v ∂ ( ) ∂␥ (v()) ′ (A.10.1) ⊙Z − −Z ′ ∂v() ∂ u ∂2 F() ′ ∂␥ (u()) ⊙ [X] (A.10.2) ′ = −[(W)Z] ∂u() ∂ ∂ u ∂2 F() ′ ∂␥ (u()) ⊙ [(W)Z] ′ = −[(W)Z] ∂u() ∂ ∂ + [(W)Z]′ ␥u (u()) u ∂2 F() ∂ ( ) ′ ∂␥ (u()) ⊙ [X] = − − [X] ′ ′ ∂u() ∂ ∂ ∂
(A.10.3) (A.10.4)
234
THOMAS L. MARSH AND RON C. MITTELHAMMER
u ∂2 F() ′ ∂␥ (u()) = −[X] ⊙ [(W)Z] (A.10.5) ′ ∂u() ∂ ∂ u ∂η ( ) ∂2 F() ′ ∂␥ (u()) = − − [(W)Z] ⊙ [(W)Z] (A.10.6) ∂ ∂ ∂ ∂u() In the above equations, the notation ⊙ implies a Hadamard product (element by element multiplication) and the derivatives of the Lagrangian multipliers are defined as: −1 J ∂ℓk (ℓk ) for ℓ ∈ {, , } (A.11.1) = (s ℓkj )2 p ℓkj − (ℓk )2 ∂ℓk j=1 −1 J ∂␥ui (u()) u 2 u u 2 (s iℓ ) p iℓ (␥i (u i ())) − u i () = ∂u i ()
(A.11.2)
−1 J ∂␥vi (v()) (s viℓ )2 p viℓ (␥vi (vi ())) − v2i () = ∂vi ()
(A.11.3)
ℓ=1
ℓ=1
In Eqs (A.9)–(A.11) the superscript ∗ was dropped in the notation of the structural equations residuals for simple convenience. Given the above derivations (A.1)–(A.11), the asymptotic properties of the GME-NLP estimator follow from Theorems 1 and 2 in Marsh et al. (2003).
EMPLOYMENT SUBCENTERS AND HOME PRICE APPRECIATION RATES IN METROPOLITAN CHICAGO Daniel P. McMillen 1. INTRODUCTION The size and number of employment subcenters have increased in large metropolitan areas as the spatial distribution of jobs has become increasingly decentralized. Although employment decentralization is not a new phenomenon, only recently have concentrations of employment outside the central city begun to rival the traditional central business district (CBD) in size and scope. Because of this change, neither theoretical nor empirical models in urban economics now rely solely on the traditional monocentric city model of Muth (1969) and Mills (1972). Instead, recent research incorporates some version of a polycentric model, a trend that Anas et al. (1998) document in their excellent review article. One part of the empirical literature focuses on developing procedures for identifying subcenters. Important contributions to the literature on subcenter identification include Baumont et al. (2004), Craig and Ng (2001), Giuliano and Small (1991), McDonald (1987), and McMillen (2001). A complementary branch of the literature documents the effects of employment subcenters on employment density, population density, and house prices. Representative examples from this large and growing literature include Baumont et al. (2004), Bender and Hwang (1985), Gordon et al. (1986), Heikkila et al. (1989), McDonald and Prather (1994), McMillen and Lester (2003), McMillen and McDonald (1997), Mu˜niz et al.
Spatial and Spatiotemporal Econometrics Advances in Econometrics, Volume 18, 237–257 Copyright © 2004 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(04)18007-9
237
238
DANIEL P. McMILLEN
(2003), Richardson et al. (1990), Shearmur and Coffey (2002), and Small and Song (1994). In general, these studies find that employment densities tend to fall with distance from the nearest subcenter. The effect of subcenters on house prices and population density is less clear. Gordon et al. (1986), Mu˜niz et al. (2003), and Small and Song (1994) find that population density is higher near employment subcenters, and Bender and Hwang (1985), Heikkila et al. (1989), and Richardson et al. (1990) find similar results for home prices. In contrast, Baumont et al. (2004) and McMillen and Lester (2003) find that population density rises with distance from the nearest subcenter in Dijon France and in the Chicago metropolitan area, respectively. McMillen and McDonald (2000) find that commercial real estate developments in the Chicago suburbs were attracted to sites near subcenters during the 1990s, whereas new homes were more likely to be built in locations that are more distant. These results suggest that subcenters may still be primarily a nonresidential phenomenon. Large suburban subcenters offer firms significant advantages. They typically offer good access to the transportation network, and close proximity to other firms facilitates interpersonal business communications. Thus, it is not surprising that subcenters now have significant effects on employment density patterns. However, large subcenters should also have significant effects on residential location patterns. Suburban traffic congestion often rivals the central city, providing workers with an incentive to live near subcenters. Urban theory suggests that the price of a unit of housing should be higher near subcenters, which in turn leads to smaller lot sizes and higher population density. However, this prediction holds only for a given category of homeowner. Home prices and lot sizes may well be higher farther from subcenters if higher income people tend to live farther from subcenters – a result that is consistent with standard urban theory. The effect of subcenters on home prices and population density is ultimately an empirical issue. This paper presents further evidence that proximity to Chicago’s subcenters is not valued highly in the residential market. Following a procedure developed in McMillen (2003), I use a repeat sales estimator to calculate home price appreciation rates for 1990–1998 for every section (a mile square) in the Chicago metropolitan area. Consistent with the results of that paper, I find that house prices grew much more rapidly close to the city center. Appreciation rates also tend to be relatively high in distant parts of the suburbs, with particularly high rates in northern Will County. I then use distance from the nearest subcenter as an explanatory variable for the estimated home price appreciation rates. I control for spatial effects using two approaches. The first approach is a standard parametric model, with fixed effects for the township in which a section is located. The second approach uses a semiparametric estimator in which spatial effects enter as nonparametric
Employment Subcenters and Home Price Appreciation Rates
239
functions of the sections’ geographic coordinates. The regression results confirm the evidence shown in a map: home price appreciation rates rise with distance from the nearest subcenter, but are high near the traditional CBD. Rather than attempting to reduce their commutes by bidding for sites near subcenters, Chicago’s homeowners are taking advantage of suburban jobs sites to move still farther from the city center.
2. SUBCENTER IDENTIFICATION Subcenter locations are not necessarily easy to determine beforehand. The usual definition of a subcenter is a set of contiguous tracts with significantly higher employment densities than surrounding areas. Together, the subcenter tracts should have enough employment to have significant effects on the overall spatial distribution of employment and population. Studies such as Bender and Hwang (1985), Heikkila et al. (1989), and Richardson et al. (1990) use prior knowledge of the urban areas being studied to form a list of sites that they expect to have significant explanatory power. In practice, only some of these sites turn out to have significant effects on variables like employment density, and other possibly influential sites are likely to be omitted. More objective procedures for identifying subcenters have been proposed by Baumont et al. (2004), Craig and Ng (2001), Giuliano and Small (1991), McDonald (1987), and McMillen (2001). Of these studies, all but Giuliano and Small use statistical approaches to identify local peaks in estimated employment density functions. McDonald (1987) estimates a standard negative exponential employment density function in which distance from the CBD is the sole explanatory variable. He defines subcenters as groups of significantly positive residuals. McDonald’s procedure is sensitive to the restrictive functional form assumption. If the rate of decline of employment density varies across the urban area, the simple negative exponential function may produce clusters of positive residuals that have no association with subcenter sites. Craig and Ng (2001) and McMillen (2001) address this functional form problem by using alternative flexible estimators to model employment densities. Craig and Ng use quantile splines to let employment density vary smoothly with distance from the CBD. Candidate areas for subcenter status are the rings around the CBD in which estimated densities exhibit local peaks. McMillen’s procedure is similar, but he allows for further flexibility in the functional form by using locally weighted regression procedures that allow employment densities to vary within the urban area. Specifically, he estimates a separate weighted least squares regression for each observation in the data set, putting more weight on nearby observations. He
240
DANIEL P. McMILLEN
defines subcenters as local peaks in the estimated employment density surface. As an alternative to these regression-based procedures, Baumont et al. (2004) use exploratory spatial data analysis to identify subcenters. They define a subcenter as a cluster of sites with high densities, as indicated by Anselin’s (1995) LISA (local indicators of spatial association) statistics. Statistical approaches to subcenter identification are particularly useful when making comparisons across metropolitan areas, as in McMillen and Smith (2003), because they rely on formal definitions of statistical significance to determine how large a local concentration of employment must be to qualify as a subcenter. In contrast, Giuliano and Small define a subcenter as a group of contiguous tracts that together have at least 10,000 employees and in which each tract has at least 10 employees per acre. Their approach is sensitive to the cutoff points, with larger values of either cutoff producing fewer subcenters. Appropriate cutoff points may vary across urban areas, but prior knowledge of the local area must be used to choose the cutoffs rather than formal statistical significance. However, the Giuliano and Small approach is by far the easiest to use, and produces good results when the researcher has sufficient knowledge of the local area to choose accurate cutoff points. McMillen and Lester (2003) use the Giuliano and Small approach to identify subcenters in the Chicago metropolitan area from 1970 to 2020 (using forecasts of employment for 2020). The Giuliano and Small approach is useful in this context because using the same cutoff points over time provides an accurate picture of the spatial evolution of individual subcenters. After experimenting with cutoffs of 5, 10, 15 and 20 employees per acre and total employment of 5,000, 10,000, and 20,000 workers, McMillen and Lester found that constant cutoff points of 15 employees per acre and 10,000 total workers produced a reasonable number of subcenters in each period. They find that the number of subcenters in the Chicago area rises from 9 in 1970 to 13 in 1980, 15 in 1990, and 32 in 2000. The number falls to 24 in 2020 as several subcenters merge and the average subcenter size increases. McMillen and Lester (2003) provide a more thorough discussion of the data and subcenter identification procedure. My objective in this paper is to determine whether subcenter proximity influences home price appreciation rates for 1990–1998 in the Chicago metropolitan area. Thus, I use McMillen and Lester’s subcenter list for 1990 to construct a variable representing distance from the nearest subcenter. Figure 1 shows the subcenter locations. The circles are centered on tracts that are part of the subcenters. In 1990, the six-county Chicago metropolitan area had total employment of 3,631,400, of which 1,058,770 (25.9%) was in subcenters outside of the traditional CBD. The large group of subcenter sites northwest of the city center surrounds O’Hare Airport, which now rivals the CBD in its level of
Employment Subcenters and Home Price Appreciation Rates
Fig. 1. Subcenter Sites.
241
242
DANIEL P. McMILLEN
employment. Indeed, it is difficult to discern from Fig. 1 whether the subcenters near the airport are distinct from one another or are simply a single large O’Hare subcenter, and smaller cutoffs point cause the separate subcenters in this area to merge into one. Other significant subcenters include Oak Brook and the adjoining corridor along the I-88 tollway due west of the city center. Other notable subcenters include Deerfield-Northbrook at the boundary between Cook and Lake Counties, Evanston (north of the city along Lake Michigan), the Hyde Park area of Chicago (south of the city center near the lake), and the Bedford Park-Chicago Lawn-West Lawn manufacturing center on the southwest side of the city.
3. THE REPEAT SALES ESTIMATOR Home price appreciation rates typically vary within an urban area. Some older neighborhoods can even have declining home prices while newer areas have double-digit annual rates of increase. However, appreciation rates may be sensitive to the method used in their calculation. The easiest method is to use average home prices by neighborhood to calculate implied annual appreciation rates. Simple averages fail to control for differences in the types of homes that sell during different parts of the business cycle. For example, the rate of appreciation from the trough to peak may be over-stated if only inexpensive and comparatively lowquality homes sell during the trough while premium homes sell during the peak. Hedonic price functions can potentially eliminate this bias by controlling for observable housing characteristics. A typical hedonic price function is: Y i,t = ␣t + ′ X i + u i,t
(1)
where Yi,t is the natural logarithm of the sales price of house i at time t, Xi is a vector of housing characteristics, and ui,t is an error term. The vector of estimated coefficients ␣1 , . . ., ␣2 shows the time-path of housing prices after controlling for X. These coefficients are typically transformed into index form by normalizing such that ␣1 = 0. Controlling for observable housing characteristics, the estimated price appreciation rate from time t−1 to t is simply 100 × (␣t − ␣t –1 ). Some examples of this approach to calculating price indexes are Kiel and Zabel (1997), Mark and Goldberg (1984), and Thibodeau (1989). Housing is a complex good. The vector X will never include all the important housing characteristics, and ambiguous concepts like quality are nearly impossible to measure accurately. Unobserved housing characteristics are likely to be correlated with measured variables, leading to biased estimates of appreciation rates. The repeat sales estimator of Bailey et al. (1963) can potentially avoid this missing variable bias by using only homes that have sold more than once to estimate
Employment Subcenters and Home Price Appreciation Rates
243
the price index. If observed housing characteristics and the corresponding vector of coefficients, , are constant over time, then Eq. (1) can be re-written as: Y i,t − Y i,s = αt − αs + u i,t − u i,s
(2)
The left hand side of Eq. (2) is the appreciation rate for house i between time s and t, where s < t. Any housing characteristic that does not change over time disappears from the estimating equation if its corresponding coefficient is also constant. Thus, the repeat sales estimator controls for the effect of both observed and unobserved housing characteristics that do not change over time. Equation (2) forms the basis for the standard repeat sales estimator. It is estimated by regressing Y i,t − Y i,s on a series of indicator variables Di2 , . . ., DiT , where T is the number of time periods. The data set consists of a series of repeat sales pairs. For a home selling first at time s and later at time t, the indicator variables are defined as D is = −1, D it = 1, and D ik = 0 for k = s, t. Omitting Di1 from the estimating equation imposes the restriction ␣1 = 0, as required for a price index. Examples of repeat sales estimation include Case and Quigley (1991), Case and Shiller (1987, 1989), Follain and Calhoun (1997), and Kiel and Zabel (1997). Although the repeat sales estimator is a popular approach for estimating price indexes, it also is the source of some controversy. The most serious criticism is the potential for sample selection bias, which may be a serious problem if the sample of repeat sales is not representative of the full sample of sales. The bias can go either way. Case et al. (1997) and Gatzlaff and Haurin (1997) find that appreciation rates are lower in repeat sales samples when compared with the full sample of sales. In contrast, Clapp et al. (1991) find that appreciation rates are lower for a repeat sales sample. It is less frequently noted that the hedonic approach is also prone to selection bias because houses that sell are not necessarily representative of the full housing market, which at a given time is composed primarily of houses that do not sell. The repeat sales estimator is most useful when many important housing characteristics are missing from a data set – an important advantage here since my data set includes very few variables. Time enters Eqs (1) and (2) as a series of discrete intervals. Typically, the intervals are no smaller than years or quarters. Time periods are likely to be larger for the repeat sales estimator than for the hedonic approach because the number of observations is smaller when the sample is restricted to homes that sell at least twice. Estimated price indexes can have sharp discontinuities between periods, and are highly sensitive to unusually high or low sales prices in times with few sales. Theoretically, a smooth price index is much more realistic because prices do not increase sharply from one day to the next, as implied by the standard repeat sales estimator.
244
DANIEL P. McMILLEN
In this paper, I follow the approach used in McMillen (2003) to impose the theoretically attractive restriction of a smooth price index. Let Ti represent the sales date for house i, and let Tis represent the previous date of sale. The complete sample consists of n repeat sales pairs: i = 1, . . ., n. Using the smooth continuous function g(T) to represent the time trend in home prices, Eq. (2) can be re-written as: Y i,t − Y i,s = g(T i ) − g(T s ) + u i,t − u i,s
(3)
The trick in estimating Eq. (3) is finding a flexible function for g(T) that makes it possible to impose the restriction that g(Ti ) and g(Ts ) are simply two values of the same function. Although flexible functional forms such as the cubic spline are feasible, I have found that the Fourier expansion model of Gallant (1981, 1982) is a useful approach for modeling time trends in home prices. The Fourier approach begins by transforming the time variable to lie between 0 and 2: z i ≡ 2T i /max(T) and z Si ≡ 2T s /max(T). The Fourier expansions are g(T i ) = ␣0 + ␣1 z i + ␣2 z 2i + Σq (q sin(qz i ) + ␥q cos(qz i )) and s s s s g(T si ) = ␣s0 + ␣s1 z si + ␣s2 z s2 i + Σq (q sin(qz i ) + ␥q cos(qz i )), where q = 1, . . ., s Q. The restriction that g(Ti ) and g(Ti ) are the same underlying function is imposed by setting ␣1 = ␣s1 , 1 = s1 , etc. Imposing these restrictions, the estimating equation becomes: Y i,t − Y i,s = ␣1 (z i − z si ) + ␣2 (z 2i − z s2 [q (sin(qz i ) − sin(qz si )) i )+ q
+ ␥q (cos(qz i ) − cos(qz si ))] + u i,t
− u i,s
(4)
After specifying the expansion length, Q, Eq. (4) represents a standard linear regression. After imposing the constraint that g(0) = 0, the estimated price index can be constructed from regression estimates of Eq. (4) as ␣1 z + ␣2 z 2 + Σq (q sin(qz) + ␥q (cos(qz)–1)), where z is a set of target dates. In this application, the data set includes 204 months of sales, so z ranges from 0 to 2 in increments of 2/204.
4. LOCALLY WEIGHTED REGRESSION The main advantage of the Fourier approach is that it uses the theoretically attractive assumption of continuity to make efficient use of data from times with few sales. The approach may make it possible to estimate price indexes for small geographic submarkets. However, submarkets may have few sales, and their boundaries may not be known beforehand. In this paper, I follow a
Employment Subcenters and Home Price Appreciation Rates
245
procedure proposed in McMillen (2003) and use locally weighted regression (LWR) procedures to construct separate house price index estimates for every section in the Chicago metropolitan area. Cleveland and Devlin (1988) proposed the version of LWR used in urban economics by researchers such as Fu and Somerville (2001), McMillen and McDonald (1997), Meese and Wallace (1991), and Pavlov (2000). LWR is a simple weighted least squares estimation procedure that places more weight on nearby observations when constructing an estimate for a target location. The Fourier repeat sales estimator adapts easily to the LWR approach because it is a regression-based procedure. In this application, the target locations are the section centroids. The six-county Chicago area has 3,889 sections. LWR places more weight on repeat sales pairs for homes closer to the center of the target section. Let di be the distance between observation i and the target section centroid. Of the many available weighting functions, I use the tri-cube to define the weight received by observation i when constructing the estimate for the target section: 3 di 3 I(d i ≤ d max ) (5) wi = 1 − d max where I(•) is an indicator function that equals one when the condition is true and zero otherwise. While any standard nonparametric kernel could be used in place of Eq. (5), a standard result from nonparametric regression analysis is that the estimates are not sensitive to the weighting function. The weights fall smoothly from a maximum of one at the target section centroid to zero at distance dmax . I set dmax such that the nearest % of the observations receive positive weight in Eq. (5); is referred to as the “window size.” After experiments with window sizes of 5%, 10%, 25%, and 50%, I set the window size at 5%. This window size produces maps of appreciation rates with an appropriate level of detail. Defining y i ≡ Y i,t − Y i,s and letting xi represent the vector of explanatory variables defined implicitly in Eq. (4), the predicted value for the target section is simply: −1 n n ′ ′ yˆ = x i (6) wi x i y i wi x i x i i=1
i=1
which is easily estimated by weighted least squares. With 3,889 sections, Eq. (6) produces 3,889 separate estimates of the coefficients in Eq. (5). These estimates, in turn, imply 3,889 separate estimates of the price index. The structure of the estimator implies that sales from outside the target section will receive weight, but the weights will be lower the farther a sale is
246
DANIEL P. McMILLEN
from the target point. One of the advantages of local linear regression procedures is that error variances are more likely to be constant within a small geographic region than across the full sample. Moreover, the predicted values (but not the standard errors) produced by estimating Eq. (6) are not affected asymptotically by heteroskedasticity because regression estimates remain consistent when the errors do not have constant variances. Other estimators could also be used to allow for local variation in appreciation rates. One alternative, which has not yet been used in the literature, is the spatial autoregressive model. In matrix form, we can write either the standard repeat sales estimator or the Fourier estimator as y = X + u. For the standard repeat sales estimator, X is composed of the series of indicator variables D, whereas in the Fourier version X includes the explanatory variables implicitly defined in Eq. (4). In either case, y is a vector with typical element y i = Y i,t − Y i,s . After specifying a spatial weight matrix, W, the spatial autoregressive model is written y = Wy + X + u, or y = (I − W)−1 (X + u), where is a parameter and u is the vector of errors with typical element u i,t − u i,s . This model implies that each yi is a weighted average of all the values of X and u, with higher weights given to values from neighboring observations. Predicted appreciation rates will vary smoothly over space. Thus, the spatial autoregressive model and the LWR model are similar in spirit. After calculating price indexes, the next step is to determine whether appreciation rates are higher near employment subcenters. Existing data sources allow subcenters to be identified decennially. The sales price data set covers January 1983 to December 1999. Thus, 1980 and 1990 subcenter sites can potentially be used to explain home price appreciation rates. I use the subcenter sites for 1990 because the nature of a repeat sales data set implies that the price index is less accurate at the beginning of the time spanned by the sales. Since houses seldom sell more frequently than once per year, nearly all sales from 1983 are initial sales. In contrast, sales from 1990 consist of homes that sold first in early years and again in 1990, as well as homes that sold first in 1990 and later in subsequent years. The repeat sales estimator thus has a startup problem that leads to greater accuracy in later periods. Using 1990 as the date for the subcenter sites, the natural choice for calculating appreciation rates also begins at this time. I calculate implied home price appreciation rates for 1990–1998 using the LWR repeat sales price index estimates. I do not use estimates from 1999 because the repeat sales estimator has a terminal problem analogous to the startup problem: the 1999 sales are relatively infrequent because they consist nearly entirely of second sales. The 1990–1998 is an interesting period in Chicago. In McMillen (2003), I find that prices rose rapidly in the City of Chicago during this period, and that appreciation rates were
Employment Subcenters and Home Price Appreciation Rates
247
higher closer to the CBD. The question addressed next is whether an analogous result holds for appreciation rates near employment subcenters outside of the city center.
5. DATA The Illinois Department of Revenue provided data on every sale of a single-family residential home in the Chicago metropolitan area from 1983 to 1999 (with the exception of 1992, which proved unavailable for unknown reasons). Relevant data include the parcel identification number (PIN), sales price, and month of sale. The PIN identifies a home’s location by the legal description – tier, range, section, and quarter section. Tier and range combinations identify townships, which comprise 36 sections. Thus, a quarter section is 21 × 21 mile, or a quarter square mile. Although it would be possible to estimate price appreciation rates using quarter-section centroids as target points, I chose to use sections as the target points for two reasons. First, small sample sizes at the quarter section level would lead to little geographic variation in appreciation rates across neighboring quarter sections because the LWR estimation procedure places some weight on nearby observations, including those from other quarter sections. Second, LWR is a computer intensive estimation procedure, and 3,889 sections represent an ample number of target points for regressions. I used a GIS program to obtain geographic coordinates for the section centroids. Next, I calculated distances to the traditional Chicago city center (at the intersection of State and Madison Streets), the entrances to O’Hare and Midway Airports, Lake Michigan, the nearest highway entrance, the nearest commuter rail station, the nearest rapid transit stop (known locally as the “el”), and the nearest subcenter. I used discrete measures of distance for rapid transit stops because their effect tends to be highly localized since people usually walk to them, and discrete variables impose that the effect of proximity falls to zero after the maximum interval. The Illinois Department of Revenue does not collect data on any housing characteristics, such as lot size or the floor area. This omission makes it impossible to estimate hedonic price functions. As discussed in the previous section, the repeat sales estimator has important advantages over the hedonic approach for estimating home price appreciation rates. The repeat sales approach is feasible for this data set because the PIN is a unique identifier for each home, making it possible to track the time path of home sales. There are 223,693 repeat sales pairs in the sample – 39,324 in Chicago, 76,861 in the rest of Cook County, 42,761 in DuPage, 14,626 in Kane, 24,426 in Lake, 9,962 in McHenry, and 15,733 in Will.
248
DANIEL P. McMILLEN
Table 1. Descriptive Statistics for Section Data.
Distance from Chicago City Center Distance from O’Hare Airport Distance from Midway Airport Distance from Lake Michigan Distance from highway entrance Distance from commuter rail station 0–0.25 Miles from rapid transit stop 0.25–0.50 Miles from rapid transit stop 0.50–0.75 Miles from rapid transit stop 0.75–0.1 Miles from rapid transit stop Distance from nearest subcenter Chicago Rest of Cook County DuPage County Kane County Lake County McHenry County Will County Estimated annual appreciation rate
Mean
Std. Dev.
Minimum
Maximum
33.777 27.400 30.791 23.209 5.051 5.232 0.001 0.018 0.014 0.008 15.284 0.075 0.192 0.088 0.139 0.127 0.160 0.219 3.361
14.048 12.150 14.817 12.745 4.605 4.738 0.028 0.133 0.116 0.090 10.540 0.264 0.394 0.283 0.346 0.332 0.367 0.414 1.045
0.750 0.633 0.571 0.000 0.258 0.241 0 0 0 0 0 0 0 0 0 0 0 0 −0.350
68.891 55.864 68.884 51.147 23.812 23.473 1 1 1 1 42.894 1 1 1 1 1 1 1 9.141
Note: The data set includes 3,889 sections. Distances are measured from the section midpoints.
Table 1 presents descriptive statistics for the explanatory variables used in the estimated models. On average, the sections are 34 miles from the city center, 27 miles from O’Hare Airport, 31 miles from Midway Airport, 23 miles from Lake Michigan, 5 miles from a highway entrance, 5 miles from a commuter rail station, and 15 miles from the nearest subcenter. Four percent of the sections are within a mile of a rapid transit stop; most of these sections are within the Chicago city limits. Chicago has 7.5% of the sections, compared with 19.2% in the rest of Cook County. The remaining sections are in DuPage, Kane, Lake, McHenry, and Will Counties. Although some of the results presented in the next section overlap with results from McMillen (2003), the data sets are entirely different. The data set used here includes an additional year, 1999. More importantly, it also includes suburban data. The McMillen (2003) data set is from a commercial data source, whereas the data set used here comes entirely from the Illinois Department of Revenue. The key advantage of the current data set is that suburban data allow us to analyze subcenters, which are primarily a suburban phenomenon.
Employment Subcenters and Home Price Appreciation Rates
249
6. PRICE INDEXES The LWR repeat sales estimator produces 3,889 separate price index estimates – far too many to summarize graphically. Figure 2 presents estimates for the City of Chicago and the suburbs. Two separate regressions, one for the city and one for the suburbs, produce these price indexes. The indexes are not adjusted for inflation. The standard repeat sales estimator with monthly indicator variables produces the jagged lines. The Fourier version of the repeat sales estimator leads to the smooth lines. For the Fourier estimates, I set the expansion length to Q = 3, which implies that the explanatory variables in Eq. (4) are z − z s , z 2 − z s2 , sin(z) − sin(z s ), sin(2z) − sin(2z s ), sin(3z) − sin(3z s ), and corresponding cosine terms. I chose the expansion length by comparing the Fourier estimates for various expansion lengths to the standard repeat sales estimates shown in Fig. 2. Larger values of Q produce very similar estimates. The City of Chicago price index is consistent with the results presented in McMillen (2003). The price index rises sharply during the 1980s, levels off in the early 1990s, and then rises sharply again toward the end of the decade. The suburban price index provides some visual evidence of the benefits from geographic disaggregation. Prices rose much less rapidly in the suburbs during the 1990s. Based on the Fourier estimates, the Chicago price index rose from 0 in January 1983 to 1.075 in December 1999, i.e. nominal prices rose by 107.5% during this 17-year span. In contrast, the last value of the suburban price index implies that suburban home prices rose by only 84.2% during this time.
Fig. 2. City and Suburban Price Indexes.
250
DANIEL P. McMILLEN
Fig. 3. Average Annual Appreciation Rates in House Prices, 1990–1998: Fourier Estimates by Region. Note: The values are the annual appreciation rates implied by the Fourier price index estimates for January 1990 and January 1998.
I use the Fourier results to calculate average annual home price appreciation rates from January 1990 to January 1998. Letting p1 and p2 represent the index values for these two dates, the average annual appreciation rate is 100 × ((1 + p 2 − p 1 )1/8 − 1). Figure 3 shows the average annual appreciation rates. The suburban price index is disaggregated by estimating separate Fourier repeat sales models for each county, using an expansion length of Q = 3. The average annual appreciation rate of 4.5% in the city compares with 3.0% in the suburbs. The highest suburban annual appreciation rate is 3.3% in Will County. The average annual appreciation rate is 3.1% in suburban Cook County, and it is about 2.9% in DuPage, Kane, Lake, and McHenry Counties. In McMillen (2003), I found that annual appreciation rates within Chicago were higher near the city center. The new data set allows me to determine whether estimates vary across suburban locations. With 223,693 repeat sales pairs, estimating separate LWR models for 3,889 sections proved time consuming. To reduce the computation burden, I restrict the samples to observations within the same county as the target sections. For example, for suburban Cook County, I use only the 76,861 observations within the region to estimate the LWR models, and I estimate a separate regression using each of these observations in turn as the target point. I set the window size to 5% of the available observations within each region, and I use an expansion length of Q = 3 for all models.
Employment Subcenters and Home Price Appreciation Rates
251
As with other nonparametric estimators, LWR is somewhat sensitive to boundary effects, and restricting the data set in this way leads to discontinuities at county lines. For example, the estimated value for a Cook County observation at the boundary with DuPage County is based on the closest observations from Cook County even though some DuPage sales are likely to be closer. These discontinuities would disappear if the data were pooled across counties because the weights would be based only on distance, not on jurisdiction. However, the discontinuities may not simply be artifacts of the estimation procedure. Tax rates vary across counties and Cook County systemically under-assesses residential properties whereas assessment rates are fairly accurate in other counties. Estimating separate models for each county may eliminate a bias that could arise if appreciation rates vary by county. The LWR estimation procedure produces 3,889 estimates of the coefficients in Eq. (4), which then imply average annual appreciation rates from 1990 to 1998. The last row of Table 1 summarizes the results. Annual appreciation rates average 3.36% across the 3,889 sections, with a range of −0.35–9.14%. Figure 4 is a map of the results. Appreciation rates are highest near the CBD and along the lakefront
Fig. 4. Average Annual Appreciation Rates in House Prices, 1990–1998: Locally Weighted Regression Estimates with Section Target Points.
252
DANIEL P. McMILLEN
within the City of Chicago. Another pocket of high appreciation rates is southwest of Chicago, in northern Will County. Appreciation rates are lower elsewhere, although they do exhibit some geographic variation. The estimation procedure leads to moderate discontinuities at county lines, particularly between DuPage and Kane Counties, west of the city. Overall, the estimation procedure appears to provide a reasonable amount of smoothing, and it is clear that appreciation rates are far from being uniform across the metropolitan area.
7. DETERMINANTS OF THE SPATIAL VARIATION IN APPRECIATION RATES Although Fig. 4 shows clearly that home price appreciation rates are highest near the city center, it is not clear whether other areas with high growth rates are near subcenters. In this section, I present the results of three empirical models to summarize patterns in the spatial variation in the appreciation rates. The models differ in the way they control for spatial heterogeneity. The first model simply uses dummy variables for the City of Chicago and each of the counties, with suburban Cook County forming the base. The second method includes township fixed effects. Apart from occasional geographic irregularities, townships are typically 6 × 6 grids of sections, comprising 36 square miles. Although townships in the Chicago area levy some property taxes, the main reason for including township fixed effects is to control for omitted spatial effects. The approach is similar to the random effects approach used by Case (1992), who assumes that each district shares a common error term. The county and township fixed effects approaches can be estimated with ordinary regressions. Although the third modeling approach is the most complicated, it also imposes the least structure. I use a semiparametric approach to control for spatial variation in growth rates not accounted for by the other explanatory variables. Thus, the model becomes y = X + f(d 1 , d 2 ) + u, where d1 and d2 are the geographic coordinates for the section centroids and f(·) is a smooth but unspecified function. I use Robinson’s (1988) estimation procedure, again using LWR for the nonparametric estimates. Robinson’s semiparametric estimator has two steps. The first stage uses a nonparametric estimator for regressions with d1 and d2 as the explanatory variables and y and each variable in the matrix X as the dependent variables. Write the residuals from these regressions as ey and ex . The second stage linear regression of ey on ex provides consistent estimates of . The usual standard errors from the ˆ A third second-stage regression are consistent estimates of the standard errors of .
Employment Subcenters and Home Price Appreciation Rates
253
Table 2. Regression Analysis of Estimated Annual Home Price Appreciation Rates. Model 1: OLS
Constant Distance from Chicago City Center Distance from O’Hare Airport Distance from Midway Airport Distance from Lake Michigan Distance from highway entrance Distance from commuter rail station 0–0.25 Miles from rapid transit stop 0.25–0.50 Miles from rapid transit stop 0.50–0.75 Miles from rapid transit stop 0.75–0.1 Miles from rapid transit stop Distance from nearest subcenter Chicago DuPage County Kane County Lake County McHenry County Will County R2
3.163 (46.657) −0.118 (9.583) 0.026 (5.893) 0.053 (6.683) 0.033 (7.447) 0.005 (1.353) −0.020 (4.861) −0.564 (1.386) 0.837 (8.167) 0.512 (4.732) 0.063 (0.485) 0.087 (18.203) 1.253 (20.067) 0.031 (18.203) −0.832 (13.994) −0.128 (13.994) −0.634 (8.306) −0.246 (4.893) 0.578
Model 2: OLS with Township Fixed Effects
Model 3: Semiparametric
−0.175 (8.296) 0.039 (3.956) 0.081 (5.081) 0.031 (3.347) 0.005 (0.938) −0.034 (4.903) −0.618 (2.170) 0.743 (9.464) 0.271 (3.442) 0.309 (3.368) 0.087 (9.635) 0.671 (11.308)
−0.178 (7.923) 0.086 (8.053) 0.059 (3.705) 0.114 (6.133) 0.008 (1.630) −0.044 (7.242) −0.583 (1.629) 0.770 (8.298) 0.405 (4.238) 0.110 (0.972) 0.026 (3.237) 0.672 (10.850)
0.811
Note: The dependent variable is the estimated annual appreciation rate calculated using the locally weighted Fourier repeat sales estimator with section midpoints as the target points. T-values are in parentheses. The data set includes 3,889 sections. Estimated coefficients for 115 townships are not reported for the specifications with fixed effects. The parametric portion of the semiparametric specification is reported in the table; the nonparametric portion is a function of distances east and north of the Chicago City Center.
stage nonparametric regression of y − X ˆ on d1 and d2 would provide consistent estimates of f(d1 ,d2 ). However, there is no need for the third stage regression if, as is the case here, the objective is simply to obtain estimates of  that control in a general way for omitted spatial effects. I use a 25% window size for all calculations. The results are not sensitive to the choice of window sizes, and 25% is a reasonable choice for a model with 3,889 observations. Table 2 presents the results for the three models explaining the spatial variation in annual home price appreciation rates. The results are robust across estimation procedures. As in McMillen (2003), Table 2 indicates that annual appreciation rates are higher closer to the city center. The regressions also indicate that appreciation rates are higher within the City of Chicago than elsewhere. Appreciation rates are
254
DANIEL P. McMILLEN
higher farther from airports, away from Lake Michigan, close to commuter rail stations, and in locations in the interval 0.25–1.0 miles from rapid transit stops. Distance from the nearest highway entrance does not have a significant effect on appreciation rates. Table 2 also shows that home price appreciation rates increase with distance from the nearest subcenter. The models with fixed effects imply that appreciation rates rise by 0.087 percentage points with each mile from the nearest subcenter. This estimate falls to 0.026 percentage points under the semiparametric estimator. However, it is the positive sign that is critical. Chicago’s subcenters have been expanding geographically and in the number and percentage of jobs located within them. If subcenter workers attempt to reduce their commutes by living near their jobs, we would expect to find that home prices are appreciating more rapidly near subcenters. Instead, we find that prices are appreciating most rapidly near the traditional city center. These results are consistent with the findings of McMillen (2003) and McMillen and McDonald (1997), as well as the population density studies of Baumont et al. (2004) and McMillen and Lester (2003). Although jobs are increasingly likely to be located in subcenters, workers do not appear to be bidding more for homes closer to their workplaces. Suburban commutes are not as difficult and time-consuming as trips to the traditional city center. When workers take jobs in subcenters, they are willing to endure potentially lengthy commutes. Thus, home price appreciation rates have not increased more rapidly in areas near subcenters. The results are also consistent with the findings in McMillen (2003), which suggest that areas near downtown Chicago are experiencing a housing boon. The traditional city center added new jobs during the 1990s. But commutes to the CBD are both time-consuming and expensive. Demographic changes toward twoincome households without children have increased the demand for homes near the city center. The result of these changes is a rapid increase in home prices near the CBD. Although Chicago is still a decentralized urban area, the city center is enjoying a residential rebound.
8. CONCLUSION Employment subcenters offer firms many of the advantages of traditional city center locations with lower rents and potentially lower wages since workers typically enjoy shorter commutes. Some subcenters rival the CBD in total employment and industry mix. Many recent studies show that subcenters have highly significant effects on employment densities. Subcenters have ambiguous effects on residential markets. On one hand, an increase in subcenter employment
Employment Subcenters and Home Price Appreciation Rates
255
may lead to increased home prices and population density in nearby neighborhoods as workers attempt to reduce the time spent commuting to work. On the other hand, workers may take advantage of subcenter job sites to move to distant locations with inexpensive housing. Although suburban traffic jams are common, the marginal cost of commuting an additional mile is far lower for suburban jobs locations than for a job in the CBD. In this paper, I find that home price appreciation rates for 1990–1998 were lower in areas near subcenters in the Chicago metropolitan area. Appreciation rates were highest near the traditional city center. The results suggest that Chicago is not truly a polycentric urban area. Polycentric urban areas have multiple centers that resemble a traditional city – a central employment district surrounded by densely populated residential areas. The densely populated area of Chicago is the traditional city center. Although employment densities are high and are increasing over time within Chicago’s subcenters, suburban population densities are lower and home prices are rising less rapidly near subcenters than elsewhere. The spatial structure of Chicago is represented best as an urban area that continues to be dominated by its traditional central business district, albeit with ample suburban employment that is concentrated in subcenters.
REFERENCES Anas, A., Arnott, R., & Small, K. A. (1998). Urban spatial structure. Journal of Economic Literature, 36, 1426–1464. Anselin, L. (1995). Local indicators of spatial association – LISA. Geographical Analysis, 27, 93–115. Bailey, M. J., Muth, R. F., & Nourse, H. O. (1963). A regression method for real estate price index construction. Journal of the American Statistical Association, 58, 933–942. Baumont, C., Ertur, C., & LeGallo, J. (2004). Spatial analysis of employment and population density: The case of the agglomeration of Dijon, 1999. Geographical Analysis, 36. Bender, B., & Hwang, H. (1985). Hedonic housing price indices and secondary employment centers. Journal of Urban Economics, 17, 90–107. Case, A. (1992). Neighborhood influence and technological change. Regional Science and Urban Economics, 22, 491–508. Case, B., Pollakowski, H. O., & Wachter, S. M. (1997). Frequency of transaction and house price modeling. Journal of Real Estate Finance and Economics, 14, 173–187. Case, B., & Quigley, J. M. (1991). The dynamics of real estate prices. Review of Economics and Statistics, 73, 50–58. Case, K. E., & Shiller, R. J. (1987), Prices of single-family homes since 1970: New indexes for four cities. New England Economic Review, 45–56. Case, K. E., & Shiller, R. J. (1989). The efficiency of the market for single-family homes. American Economic Review, 79, 125–137. Clapp, J. M., Giaccotto, C., & Tirtiroglu, D. (1991). Housing price indices based on all transactions compared to repeat subsamples. AREUEA Journal, 19, 270–285.
256
DANIEL P. McMILLEN
Cleveland, W. S., & Devlin, S. J. (1988). Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association, 83, 596–610. Craig, S. G., & Ng, P. T. (2001). Using quantile smoothing splines to identify employment subcenters in a multicentric urban area. Journal of Urban Economics, 49, 100–120. Follain, J. R., & Calhoun, C. A. (1997). Constructing indices of the price of multifamily properties using the 1991 Residential Finance Survey. Journal of Real Estate Finance and Economics, 14, 235–255. Fu, Y., & Somerville, S. T. (2001). Site density restrictions: Measurement and empirical analysis. Journal of Urban Economics, 49, 404–423. Gallant, A. R. (1981). On the bias in flexible functional forms and an essentially unbiased form: The Fourier flexible form. Journal of Econometrics, 15, 211–245. Gallant, A. R. (1982). Unbiased determination of production technologies. Journal of Econometrics, 20, 285–323. Gatzlaff, D. H., & Haurin, D. R. (1997). Sample selection bias and repeat-sales index estimates. Journal of Real Estate Finance and Economics, 14, 33–50. Giuliano, G., & Small, K. A. (1991). Subcenters in the Los Angeles region. Regional Science and Urban Economics, 21, 163–182. Gordon, P., Richardson, H. W., & Wong, H. L. (1986). The distribution of population and employment in a polycentric city: The case of Los Angeles. Environment and Planning, A18, 161–173. Heikkila, E., Gordon, P., Kim, J. I., Peiser, R. B., & Richardson, H. W. (1989). What happened to the CBD-distance gradient?: Land values in a polycentric city. Environment and Planning, A21, 221–232. Kiel, K. A., & Zabel, J. E. (1997). Evaluating the usefulness of the American Housing Survey for creating housing price indices. Journal of Real Estate Finance and Economics, 14, 189–202. Mark, J. H., & Goldberg, M. A. (1984). Alternative housing price indices: An evaluation. AREUEA Journal, 12, 30–49. McDonald, J. F. (1987). The identification of urban employment subcenters. Journal of Urban Economics, 21, 242–258. McDonald, J. F., & Prather, P. J. (1994). Suburban employment centers: The case of Chicago. Urban Studies, 31, 201–218. McMillen, D. P. (2001). Nonparametric employment subcenter identification. Journal of Urban Economics, 50, 448–473. McMillen, D. P. (2003). Neighborhood house price indexes in Chicago: A Fourier repeat sales approach. Journal of Economic Geography, 3, 57–73. McMillen, D. P., & Lester, T. W. (2003). Evolving subcenters: Employment and population densities in Chicago, 1970–2020. Journal of Housing Economics, 12, 60–81. McMillen, D. P., & McDonald, J. F. (1997). A nonparametric analysis of employment density in a polycentric city. Journal of Regional Science, 37, 591–612. McMillen, D. P., & McDonald, J. F. (2000). Employment subcenters and subsequent real estate development in suburban Chicago. Journal of Urban Economics, 48, 135–157. McMillen, D. P., & Smith, S. C. (2003). The number of subcenters in large urban areas. Journal of Urban Economics, 53, 321–338. Meese, R., & Wallace, N. (1991). Nonparametric estimation of dynamic hedonic price models and the construction of residential housing price indices. AREUA Journal, 19, 308–332. Mills, E. S. (1972). Studies in the structure of the urban economy. Washington, DC: Resources for the Future.
Employment Subcenters and Home Price Appreciation Rates
257
Mu˜niz, I., Galindo, A., & Garc´ıa, M. A. (2003). Cubic spline population density functions and satellite city delimitation: The case of Barcelona. Urban Studies, 40, 1303–1321. Muth, R. F. (1969). Cities and housing. Chicago: University of Chicago Press. Pavlov, A. D. (2000). Space-varying regression coefficients: A semi-parametric approach applied to real estate markets. Real Estate Economics, 28, 249–283. Richardson, H. W., Gordon, P., Jun, M., Heikkila, E., Peiser, R., & Dale-Johnson, D. (1990). Residential property values, the CBD, and multiple nodes: Further analysis. Environment and Planning, A22, 829–833. Robinson, P. M. (1988). Root-N-consistent semiparametric regression. Econometrica, 56, 931–954. Shearmur, R. G., & Coffey, W. J. (2002). A tale of four cities: Intrametropolitan employment distribution in Toronto, Montreal, Vancouver, and Ottawa-Hull, 1981–1996. Environment and Planning, A34, 575–598. Small, K. A., & Song, S. (1994). Population and employment densities: Structure and change. Journal of Urban Economics, 36, 292–313. Thibodeau, T. G. (1989). Housing price indexes from the 1974–83 SMSA Annual Housing Surveys. AREUEA Journal, 17, 110–117.
SEARCHING FOR HOUSING SUBMARKETS USING MIXTURES OF LINEAR MODELS M. D. Ugarte, T. Goicoa and A. F. Militino ABSTRACT This paper presents a mixture of linear models (or hedonic regressions) for defining housing submarkets. Two different mixture models are considered: the first model allows all the regression coefficients to vary among the clusters (random coefficients); and the second model allows only the intercept term to change (random intercept). The model with a random intercept can be seen as a linear mixed model where the random effects distribution is estimated via non-parametric maximum likelihood (NPML). The models are illustrated using a real data set of 293 properties in Pamplona, Spain. These mixture models provide a classification of the dwellings into homogeneous groups that determine the structure of the submarkets.
1. INTRODUCTION Housing submarkets are typically defined, a priori, depending on socio-economic characteristics of geographical areas, local government boundaries, market areas as perceived by real estate agencies, or by the physical characteristics of the dwellings (see, for instance, Adair et al., 1996; Allen et al., 1995; Harsman & Quigley, 1995). The information used to create the submarkets is determined by some prior view of Spatial and Spatiotemporal Econometrics Advances in Econometrics, Volume 18, 259–276 Copyright © 2004 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(04)18008-0
259
260
M. D. UGARTE ET AL.
what is important. An alternative approach is to let the data determine the structure of the subgroups. There is a small body of research that attempts to use more systematic methods for defining submarkets. However, the statistical techniques used to date have been mainly exploratory. Some authors use principal component analysis, factor analysis, cluster analysis, or a combination of all three (see, for example, Bourassa et al., 1999; Goetzmann & Wachter, 1995; Hoesli & Macgregor, 1997). In this paper, we use model-based techniques to classify dwellings. We highlight the important role of finite mixture models (McLachlan & Peel, 2000) in modelling heterogeneous data, such as dwelling selling prices, with a focus on identifying distinct subgroups of observations which define submarkets. The detection of these subgroups may be of great importance to real estate companies, banks, or national agencies who set values for the prices of dwellings sharing similar characteristics. The problem of modelling dwelling selling prices is still a challenge for housing analysts. Hedonic market models (Griliches, 1971; Lancaster, 1966; Rosen, 1974) have been used for this purpose as a common modelling approach. These models consider the features of an asset to be determinant when assessing its total value. Consequently, a household price is estimated by regressing on the number of rooms, the total living area, the number of bathrooms, and location characteristics. However, it is very difficult to get a complete account for all the relevant location characteristics, so the residuals are often found to be autocorrelated. To overcome this situation, different spatial linear models have been proposed (Anselin, 1988; Cressie, 1993; Pace & Lesage, 2002), but there are still some unclear aspects of these models, such as the specification of the covariance structure or the weight matrix definition (Militino et al., 2004). The alternative model we propose is not only able to model dwelling selling prices, but also it allows for simultaneous dwellings classification. Selecting the optimal number of subgroups and whether the coefficients vary among the clusters is addressed in the context of the proposed model. Finally, we provide the subgroups and the parameter estimates of the hedonic regressions jointly. The finite mixture of linear models is estimated by maximum likelihood via the EM algorithm (Dempster et al., 1977) which is accomplished with standard statistical software. The proposed technique is illustrated by analyzing the 2000 market prices for a set of 293 second-hand dwellings market prices in Pamplona, Spain. The article is organized as follows: Section 2 defines the mixture of linear models and presents the EM algorithm, the calculus of the standard errors, the model selection criteria, and the fitting equations. Section 3 illustrates the results through the analysis of used dwellings’ market prices in Pamplona, Spain. The paper ends with a discussion.
Searching For Housing Submarkets Using Mixtures Of Linear Models
261
2. THE MIXTURE MODEL AND THE EM ALGORITHM Data heterogeneity may indicate that some kind of grouping exists among the sample observations; and as a consequence, an ordinary regression model would not fit the data properly. In this section, we propose the use of a mixture of linear models that provides both a classification of the observations into groups (the groups correspond to the mixture components), and the estimates of the regression parameters in each group. Comprehensive references on mixture models include Titterington et al. (1985) and McLachlan and Peel (2000). Let y 1 , . . . , y n denote an observed random sample of size n. The G-component mixture of linear models is defined by y j ∼ f(y j , ) =
G
g f g (y j ; gj , ),
(1)
g=1
where f g (y j ; gj , ), j = 1, . . . , n, denotes the probability density function of a univariate normal distribution with mean gj and variance 2 . Here gj = x ′j g + ␣g where ␣g is the intercept parameter for the g-th linear component of the mixture and x ′j is the vector of explanatory variables for the j-th observation. Let 1 , . . . , G denote the proportions in which these G normal component models occur in the mixture. Therefore, G g=1 g = 1. Here is the vector of all unknown parameters, partitioned as = (′ , ␣′ , ′ , )′ , where = (1 , . . . , G−1 )′ , ␣ = (␣1 , . . . , ␣G )′ , and  = (′1 , . . . , ′G )′ .  is the vector of regression coefficients which are assumed, a priori, to be distinct in each cluster with g = (g1 , . . . , gp )′ , g = 1, . . . , G. The likelihood function of the data is given by L() =
G n
g f g (y j ; , ␣g , ).
j=1 g=1
Computation of the maximum likelihood estimator of by direct consideration of the log-likelihood function requires solving the following equation ∂log L() = 0. ∂
(2)
In this paper, we propose two versions of Model (1). The first one is given by yj ∼
G g=1
g f g (y j ; x ′j g + ␣g , )
(3)
262
M. D. UGARTE ET AL.
with all the coefficients varying among the clusters. The second one corresponds to the case where only the intercept term varies among the clusters, and it is given by yj ∼
G
g f g (y j ; x ′j  + ␣g , )
(4)
g=1
where  = (1 , . . . , p )′ is the vector of regression coefficients common to all clusters. Note that is assumed to be the same in all of the components. This assumption prevents an unbounded likelihood function (Kiefer & Wolfowitz, 1956). Several authors have considered the fitting of mixture models with equal variances (see, for instance, Basford et al., 1997). Equation (2) can be solved by direct application of the EM algorithm (Dempster et al., 1977). In the framework of the EM algorithm, the observed data vector y = (y 1 , . . . , y n )′ is considered to be incomplete, since we do not know to which component each observation belongs. Thus, we will introduce associated component-label vectors z1 , . . . , zn , where zj is a G dimensional vector defined as (zj )g = 1 if yj arises from the gth component z gj = (zj )g = 0 otherwise, then,
G
g=1 z gj
= 1. The complete-data likelihood is expressed as L c () =
G n
(g f g (y j ; , ␣g , ))z gj
j=1 g=1
and the expanded expression of the log-likelihood by n
G
n log L c () = − log 2 − n log + z gj log g 2 j=1 g=1
−
n G 1 z gj (y j − gj )2 , 22
(5)
j=1 g=1
where gj = x ′j g + ␣g or gj = x ′j  + ␣g according to Model (3) or (4) respectively. The EM algorithm proceeds iteratively in two steps. In the first step, a “guess” is made regarding the grouping of the clusters, then estimation based on such grouping is conducted. The E-step requires the calculation of the expectation of
Searching For Housing Submarkets Using Mixtures Of Linear Models
263
the complete-data log likelihood given the observed data y using the current fit of . As the complete-data log likelihood is linear in the variables z gj , the E-step on the (k + 1) iteration only requires the conditional expectation of Z gj given the data y, where Z gj is the random variable corresponding to z gj , expressed by (k)
(k+1)
gj
(k)
g f g (y j |␣g ; (k) , (k) ) . = E[Z gj |y] = G (k) (k) (k) (k) h=1 h f g (y j |␣h ;  , )
This quantity represents the posterior probability that the observed value y j belongs to the gth component of the mixture. The M-step in the (k + 1) iteration requires the global maximization of n
G
(k+1) n gj log g log L (k+1) () = − log 2 − n log + c 2 j=1 g=1
−
n G 1 (k+1) gj (y j − gj )2 , 22
(6)
j=1 g=1
which results from substituting the expected values of the z gj variables for the variables themselves in Eq. (5). Then, Eq. (6) is common for both Model (3) and Model (4). Estimating parameters in Model (3) leads to the following equations (k+1)
n n (k+1) Gj ∂log L c () (k+1) gj − = = 0, ∂g 1 − (1 + · · · + G−1 ) j=1
j=1
g = 1, . . . , G − 1 (k+1)
∂log L c 1 () = 2 ∂g
n
(k+1)
gj
(y j − gj )x j = 0,
g = 1, . . . , G
(7) (8)
j=1
n (k+1) ∂log L c () 1 (k+1) gj (y j − gj ) = 0, = 2 ∂␣g
g = 1, . . . , G
(9)
j=1
n G (k+1) () −n 1 (k+1) ∂log L c = + 3 gj (y j − gj )2 = 0. ∂
(10)
j=1 g=1
(k+1) (k+1) Equations (7) and (10) give ˆg = nj=1 gj /n, and ˆ (k+1) = (k+1) n G (yj − gj )2 /n respectively. It can be seen that for j=1 g=1 gj each g = 1, . . . , G, Eqs (8) and (9) have the same form as the corresponding
264
M. D. UGARTE ET AL.
equations arising from a single linear model fitted to the responses y 1 , . . . , y n with prior weights g = (g1 , . . . , gn )′ . Consequently, each set of parameters (␣g , ′g ) can be solved by fitting a single linear model with the corresponding weights g . Estimating parameters in Model (4) leads to the same Eqs (7), (9) and (10), but Eq. (8) becomes n G (k+1) ∂log L c () 1 (k+1) = 2 gj (y j − gj )x j = 0. ∂
(11)
j=1 g=1
We can still solve Eqs (9) and (11) using the iteratively re-weighted least-squares algorithm for a single linear model. The double summation over j and g can be handled by expanding the data vector to length Gn and replicating each original observation (y j , x ′j )′ , G times. Model fitting is then similar to that of a single sample of length Gn with prior weights = (11 , . . . , 1n , . . . , G1 , . . . , Gn )′ . Although there are more efficient methods for solving Eqs (9) and (11), the main advantage of this approach is that it is easily carried out using standard statistical software with a linear model (LM) fitting procedure. Aitkin (1996, 1999) proposes as initial estimates for  those values obtained from the ordinary LM fit, and for ␣g the standard normal values z k from Gaussian quadrature techniques. As an alternative, we propose to run the EM algorithm on the response variable, that is to run a simple finite mixture of normal distributions without considering explanatory variables, and to use the corresponding gj ’s values as initial prior weights for fitting a weighted LM model to the expanded data. The same idea is applied when solving Eqs (8) and (9) for Model (3). However, in this latter situation, we do not fit a linear model to the expanded data, but we fit a set of weighted linear models. These models provide initial estimates for the whole set of parameters. The EM algorithm does not provide standard errors for the parameter estimates without additional calculations. Standard errors based on the information matrix could be calculated by the Louis method (1982). Aitkin and Aitkin (1996) describe this method in the case of a two-component normal mixture. This technique would be appropriate when fitting Model (3). The calculus of standard errors when fitting Model (4) is carried out by the method of Dietz and B¨ohning (1995). According to these authors, the variance estimator of ˆ j is given by var(ˆ j ) =
ˆ j 2(log L  =ˆ − log L j =0 ) j
,
j
where L  =ˆ is the likelihood function (2) evaluated at j = ˆ j and L j =0 is j j the likelihood function (2) evaluated at j = 0. The Dietz and B¨ohning method requires the fitting of a set of reduced models, where each variable is omitted from
Searching For Housing Submarkets Using Mixtures Of Linear Models
265
the final model one at a time. These reduced models would often be fitted in any case to assess the significance of each variable.
2.1. Number of Components One of the most difficult tasks when fitting a mixture model is assessing the number of components. In the real example we have, this is equivalent to determining the correct number of submarkets. A likelihood ratio test is not appropriate since the regularity conditions do not hold; therefore, the test statistic does not follow its usual χ2 distribution. In this paper, the Integrated Classification Likelihood Criterion proposed by Biernaki et al. (1998), and denoted by ICL-BIC (McLachlan & Peel, 2000, Chap. 6) is used. It is defined as
where EN() = −
ˆ + 2EN() ˆ + d log n, −2 log L() G j=1 g=1 gj log (gj ) is the entropy of the fuzzy classifi-
n
j=1,...,n
cation matrix (gj )g=1,...,G and d is the number of parameters. The model with the smallest value of the ICL-BIC is preferred over the others. This criterion is similar to the BIC, but it penalizes the model for its complexity with the entropy term. Penalizing the log-likelihood with the entropy term favors mixtures leading to a clustering of the data with the greatest evidence. McLachlan and Ng (2000) compare this more recent criterion with classical procedures such as the BIC or the AIC in a simulation study. The conclusion is that the ICL-BIC criterion selects the correct number of components in all cases, while the BIC and the AIC criteria failed in some cases. Hawkins et al. (2001) also compare several criteria based on the observed log likelihood to the criteria based on the complete likelihood.
2.2. Fitted Values We present two alternative possibilities to calculate the fitted values of the model. The first one takes into account all the information. That is, for each individual observation y j we use the whole set of gj ’s values (i.e., for each observation we use the posterior probability of belonging to each cluster). Then yˆ j =
G g=1
gj ˆ gj =
G
ˆ + ␣ˆ g ). gj (x ′j  g
(12)
g=1
The second possibility makes use of the final classification. For each individual observation y j , we will predict its value as the mean of the component whose
266
M. D. UGARTE ET AL.
posterior probability of belonging is the highest (i.e., the mean of the component with the largest gj ). Then ˆ + ␣ˆ g . yˆ j = x ′j  g
(13)
It is worth noting that under certain conditions Method (13) is not necessarily straightforward for calculating fitted values. One such case is when one observation belongs to a cluster with posterior probability lower than 0.5 and very close to the posterior probability of belonging to another cluster. When this occurs, Eq. (12) is more appropriate for calculating fitted values. As a result, we recommend the use of Eq. (13) only if the posterior probability of belonging to a cluster is high for every observation.
3. EXAMPLE This section is devoted to illustrating the performance of Models (3) and (4) on a real data set consisting of the 2000 selling prices for 293 used apartments in Pamplona, Spain. The original archive contains information about many characteristics of the apartments, including the number of rooms, the number of bathrooms, the total living area, the age, the maintenance state and the location characteristics. This data set has been previously fitted to several spatial models by Militino et al. (2004). The explanatory variables considered here are the same other than the spatial location, called zone, that now is introduced by using the postcode associated with each dwelling. A more detailed description is given in Table 1. Figure 1 plots the location of the dwellings according to the zone. Apart from five quantitative variables including age, expressed in a quadratic term, there are four categorical variables (out, dwelling, elev and zone) introduced as dummy variables resulting in a total of fourteen exogenous variables. The endogenous variable is the total price in thousands of euros transformed logarithmically. A preliminary exploratory analysis of the data gives a mean total price of 169,460 euros and a standard deviation of 71,514 euros. The mean age of the dwellings in the sample is 28.5 years and the mean total living area is 109.7 square meters. The mean numbers of rooms and bathrooms of the apartments in the sample are 5 and 1.5 respectively. Thirty-nine percent of the apartments are completely exterior, 52% of them are partially exterior with views to the main street, and 9% are without any view to the main street. Sixty-nine percent of the dwellings have at least one elevator. Fifty-seven percent of the households are located in zone 1, 15% are in zone 2, 18% are in zone 3, and 10% are in zone 4.
Searching For Housing Submarkets Using Mixtures Of Linear Models
267
Table 1. Variables Description. Variable
Description
Area Age Rooms Bath Floor Out 3 categories
Total living area in square meters Age of the dwelling (in years) Total number of rooms including bedrooms, dinning-room and kitchen Total number of bathrooms Floor on which the dwelling is located Completely exterior dwelling Dwelling is partially with views to the main street Dwelling is without any view to the main street Normal flat: kitchen is not included in the living area Small flat: Living area includes the kitchen Penthouse Absence of an elevator Presence of an elevator City center South-West outlying district North outlying district North-East outlying district
Dwelling 3 categories Elev 2 categories Zone 4 categories
Fig. 1. x-Coordinate and y-Coordinate of the 293 Households Labelled According to Their Zone.
268
M. D. UGARTE ET AL.
Different models are fitted. First, an ordinary least square regression model (OLS) defined by y = 0 + 1 Age + 2 Age2 + 3 Rooms + 4 Bath + 5 Area + 6 Floor + 7 O1 + 8 O2 + 9 D1 + 10 D2 + 11 E1 + 12 Z1 + 13 Z2 + 14 Z3 + ǫ, is fitted, where y is the (293 × 1) vector of the response variable, 1 , . . . , 14 are the coefficients of the model, and ǫ is the (293 × 1) vector of error terms with ǫ ∼ N(0, 2 I ). The dummy variables O1, O2, D1, D2, E1, Z1, Z2, and Z3 define the categories of the variables out, dwelling, elev and zone respectively as follows: O1 = 0 and O2 = 0 indicate completely exterior dwellings, O1 = 1 and O2 = 0 indicate dwellings that are partially exterior with views to the main street, and O1 = 0 and O2 = 1 indicate dwellings without any view of the main street. D1 = 0 and D2 = 0 represent a normal flat, D1 = 1 and D2 = 0 represent a small flat and D1 = 0 and D2 = 1 represent a penthouse. E1 = 1 means that one elevator is present. Z1 = 0, Z2 = 0 and Z3 = 0 indicate that the dwelling is located in zone 1, Z1 = 1, Z2 = 0 and Z3 = 0 indicate that the dwelling is located in zone 2, Z1 = 0, Z2 = 1 and Z3 = 0 indicate that the dwelling is located in zone 3 and Z1 = 0, Z2 = 0 and Z3 = 1 indicate that the dwelling is located in zone 4. A regression analysis of variance reveals that all the variables are significant. The results are displayed in Table 2. Lower plots in Fig. 3 show OLS fitted values versus log-observed values and back transformed fitted values versus observed prices. From these graphs, we see that a better fit would be possible. To check whether spatial correlation is present, the Moran test for regression residuals is Table 2. Analisis of Variance of the OLS Model. Variable
Dg of Freedom
Sum of Sq.
Mean Sq.
F-value
p-value
Age Age2 Rooms Bath Area Floor Out Dwelling Elev Zone Residuals
1 1 1 1 1 1 2 2 1 3 278
1.991 0.203 19.745 5.448 3.478 0.760 0.912 1.621 1.059 7.092 8.497
1.991 0.203 19.745 5.448 3.478 0.760 0.456 0.811 1.059 2.364 0.031
65.137 6.630 646.014 178.249 113.805 24.850 14.917 26.520 34.655 77.340
0.000 0.011 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Searching For Housing Submarkets Using Mixtures Of Linear Models
269
Table 3. Model Selection Criteria. Number of Components (G)
ICL-BIC
Random coefficients model: Model (3) 2 3
183.797 263.604
Random intercept model: Model (4) 2 3
239.645 33.143
also applied. This test statistic is given by (Cliff & Ord, 1981) n ǫ′W ǫ M= , S ǫ′ǫ where n is the number of observations, ǫ is the n × 1 vector of OLS residuals, W is the spatial weight matrix, and S is the sum of all the elements of W . We use as a spatial weight matrix the “nearest neighbor” matrix W with distances 1.5, 2, 2.5, and 3 km. The corresponding p-values are 0.411, 0.577, 0.194, and 0.604 respectively. As a consequence, the hypothesis of independent residuals can not be rejected. The inclusion of the zone as an exogenous variable in the model makes the spatial dependence present in the models analyzed by Militino et al. (2004) disappear. Now, we fit Model (3) with two and three components. Finally, we also fit Model (4) with two, three, and four components; but it became clear that the four-component model was not appropriate since it gave duplicate estimates of some of the component means, and then, it provided the same solution as the three-component model. We do not consider a four-component model in Model (3) because of the large number of parameters. Table 3 shows the ICL-BIC values for Models (3) and (4) with two and three components. ICL-BIC criterion selects Model (4) with three components. Due to the large number of parameters in random coefficients models, the tendency is to choose models more parsimonious with fewer parameters. The coefficient estimates, their standard errors and the likelihood ratio test (LRT) p-values for Model (4) with three components are shown in Table 4. All the explanatory variables are highly significant except the floor variable. As expected, both a location in an outlying district and not being completely exterior, negatively affect the selling price. The three-component mixture model selected defines three potential submarkets. The first one consists of 33 apartments; the second one has 173 dwellings; and the 87 remaining flats are in the third cluster. The mean total price
270
M. D. UGARTE ET AL.
Table 4. Estimates for the Three-component Model with Random Intercept. Variable
Coefficients
␣1 (1 ) ␣2 (2 ) ␣3 (3 ) Age Age2 Rooms Bath Area Floor O1 O2 D1 D2 E1 Z1 Z2 Z3
3.855 (0.113) 4.119 (0.590) 4.353 (0.297) −0.006 0.5e−4 0.071 0.063 0.006 0.008 −0.054 −0.106 0.227 0.263 0.138 −0.195 −0.428 −0.385 0.087
Std. Error
LRT p-Value
0.002 0.1e−4 0.018 0.023 0.5e−3 0.005 0.019 0.033 0.069 0.055 0.029 0.039 0.036 0.043
0.000 0.001 0.000 0.006 0.000 0.076 0.005 0.001 0.001 0.000 0.000 0.000 0.000 0.000
of the dwellings in cluster 1 is 144,051 euros with a standard deviation of 91,993 euros. The mean total price of the dwellings in cluster 2 is 158,837 euros with a standard deviation of 63,968 euros. Cluster 3 has a mean total price of 200,223 euros with a standard deviation of 67,821 euros. To facilitate the interpretation of the mixture, we back transform the fitted values by calculating exp(Yˆ + 0.52 ) in each component. Then, the mean total back transformed fitted price in cluster 1 is 150,631 euros with a standard deviation of 96,199 euros. The mean total back transformed fitted price in cluster 2 is 160,918 euros with a standard deviation of 64,415 euros. The mean total back transformed fitted price in cluster 3 is 192,775 euros with a standard deviation of 63,225 euros. Dwelling prices in cluster 1 tend to be slightly overvalued, prices in cluster 2 are reasonably well fitted, and prices in cluster 3 tend to be slightly undervalued. The mean total living areas are 111.501, 110.6, and 107.3 square meters respectively. The mean age of cluster 1 is 41.91 years, while the mean age of clusters 2 and 3 is approximately 27 years. The mean numbers of rooms and bathrooms are very similar in the three clusters and are equal to the mean numbers of rooms and bathrooms for the whole sample. In cluster 1, all dwellings except three are located in the city center, and most of them (79%) are partially or totally without views to the main street. The majority of dwellings from the outlying districts are classified in cluster 2 (69% of the total). Cluster 3 groups dwellings from the city center (59%) and from the outlying districts (41%).
Searching For Housing Submarkets Using Mixtures Of Linear Models
271
In summary, we detect three submarkets. The first one is mainly composed of apartments with low prices located in the city center. These dwellings are old and partially or totally without views to the main street. It is worth remarking that some of the most expensive dwellings are classified in this cluster because they have a large total living area. Three apartments in this cluster are not located in the city center. Two of them share characteristics of this cluster: the first one is old and not exterior, and the second one is rather new, but it has no view of the main street. The third flat has no characteristics from this cluster, it is new and exterior. However, its posterior probability of belonging to cluster 2 is 0.43, so it could be allocated in this cluster. The second submarket groups dwellings from the city center and from the neighborhoods in the outskirts. It is the most heterogeneous cluster. The majority of the apartments from the outlying districts are classified in this cluster (69% of the total). The dwellings have intermediate prices and, in general, a large total living area. The third cluster defines a submarket where dwellings are, in general, the most expensive. Apartments in this cluster are mainly located in the city center (59%). They are generally newer and the proportion of exterior flats is higher than in cluster 1. Apartments from the outskirts classified in this cluster are, in general, new, exterior, and have a large total living area. Figure 2 shows the location of the households for the three clusters. Figure 3 presents a plot of the fitted versus the observed values. The top row displays the fitted values versus the log-observed prices and the back transformed fitted values versus the observed prices for Model (4) with three components using Eq. (12). The second row displays the fitted values versus the log-observed prices and the back transformed fitted values versus the observed prices for the OLS model. The inspection of these plots reveals that Model (4) with three components fits the data much better than the OLS regression model. The residual sum of
Fig. 2. x-Coordinate and y-Coordinate for the Dwellings in Each Cluster when Fitting Model (2) with Three Components.
272
M. D. UGARTE ET AL.
Fig. 3. Upper Plots Correspond to Fitted vs. Log-observed Values and Back Transformed Fitted Values vs. Observed Prices for Model (4) with Three Components. Note: The fitted values have been given by Eq. (12). Lower plots correspond to fitted vs. Log-observed values and back transformed fitted values by the OLS model vs. Observed prices.
squares for Model (4) with three components using Eq. (12) is 0.920 and using Eq. (13) is 1.597. The residual sum of squares for the OLS model is 8.497. Figure 4 plots the fitted versus the log-observed values for each of the three components in the mixture model in the first row and the back transformed fitted values versus the observed prices in the second row. The fitting was carried out by Eq. (12). As was mentioned before, the majority of the observations in cluster 1 are dwellings with low prices that tend to be slightly overvalued by the mixture, observations in cluster 2 are dwellings of intermediate value, and observations in cluster 3 are expensive dwellings that tend to be slightly undervalued. The model is easily interpreted in terms of the residuals of the ordinary least squares regression model. The probability of arising from clusters 1, 2, and 3 for each dwelling is represented versus the residuals of the ordinary regression model in Fig. 5. Observations in cluster 1 are those that the ordinary regression model overfits, observations in cluster 2 are dwellings whose prices are reasonably
Searching For Housing Submarkets Using Mixtures Of Linear Models
273
Fig. 4. Upper Plots Correspond to Fitted Values vs. Log-observed Values in Clusters 1, 2 and 3 Obtained from Fitting Model (4). Note: Lower plots correspond to back transformed fitted values vs. Observed prices in clusters 1, 2 and 3. Equation (12) is used in the fitting.
Fig. 5. Posterior Probabilities vs. Residuals of the OLS Regression Model.
274
M. D. UGARTE ET AL.
well fitted, and observations in cluster 3 are dwellings underfitted by the ordinary regression model. This is clear since, observations with the largest negative OLS residuals have probabilities higher than 0.5 of belonging to cluster 1, observations with OLS residuals close to zero present probabilities higher than 0.5 of belonging to cluster 2, and observations with the largest positive OLS residuals have probabilities higher than 0.5 of arising from cluster 3. This interpretation of the model is made by Aitkin (1996).
4. DISCUSSION Data heterogeneity makes the analysis of the used dwelling market very difficult. Our data set was composed of apartments from various districts of the city. These apartments had varied numbers of rooms, bathrooms, and different type of views. The approach described in this paper emphasizes classification accounting for the heterogeneity by making groups of observations defining potential submarkets. In hedonic studies, submarkets are typically defined a priori; some authors (Bourassa et al., 1999) define submarkets using exploratory statistical techniques, such as cluster analysis or principal component analysis to define groups. Then, they fit hedonic equations to each cluster. In this paper, we applied a mixture model to let the data determine the group structure and to calculate the parameter estimates jointly. The result is that we obtain a model-based classification and the estimates at the same time. According to the model selection criterion recommended in this paper, the most suitable model to fit our data set is the one with three components and a random intercept. This model can be seen as a linear mixed model where the random effects distribution is estimated via non-parametric maximum likelihood (NPML). A disadvantage of any approach using a parametric form for the “mixing” distribution of the random effects is the possible sensitivity of the conclusions to this specification. Small changes in the mixing distribution may produce substantial variation in the parameter estimates (Heckman & Singer, 1984). The selected model provides three submarkets of low, intermediate, and high prices. The first submarket encompasses dwellings with low prices located in the city center. These apartments are old and not exterior. Fifty-five percent of them are in buildings without an elevator. The second submarket groups dwellings of intermediate prices from the center and the outskirts. It is remarkable that 69% of the flats in the outskirts are classified in this cluster. The third submarket consists of dwelling of high prices. 59% of the apartments are located in the city center and 41% are in the outlying districts, mainly in zones 2 and 3. The main differences with submarket 1 are that in submarket 3 dwellings from the center are newer and the proportion of exterior flats is higher. The model can also be interpreted in terms of the residuals of
Searching For Housing Submarkets Using Mixtures Of Linear Models
275
the OLS regression model. The methods proposed in this paper are very appealing since they are very easy to implement. They require fitting a single LM in which the observations are replicated as many times as the number of components in the model (in the case of the random intercept model), or fitting a set of linear models with prior weights (in the case of random coefficients). This is easily carried out with a standard LM fitting program included in all standard statistical packages.
ACKNOWLEDGMENTS The authors acknowledge the – Servicio de Riqueza Territorial – of the Government of Navarra, Spain, for providing the data set used in this work. This research is partially supported by Ministerio de Ciencia y Tecnolog´ıa, Spain. Project AGL2000-0978.
REFERENCES Adair, A. S., Berry, J. N., & McGreal, W. S. (1996). Hedonic modelling, housing submarkets and residential valuation. J. Property Res., 13, 67–83. Aitkin, M. (1996). A general maximum likelihood analysis of overdispersion in generalized linear models. Statistics and Computing, 6, 251–262. Aitkin, M. (1999). A general maximum likelihood analysis of variance components in generalized linear models. Biometrics, 55, 117–128. Aitkin, M., & Aitkin, I. (1996). A hybrid EM/Gauss-Newton algorithm for maximum likelihood in mixture distributions. Statistics and Computing, 6, 127–130. Allen, M. T., Springer, T. M., & Waller, N. G. (1995). Implicit pricing across residential submarkets. Journal of Real Estate, Finance and Economics, 11, 137–151. Anselin, L. (1988). Spatial econometrics: Methods and models. Dordretcht: Kluwer. Basford, K. E., McLachlan, G. J., & York, M. G. (1997). Modelling the distribution of stamp paper thickness via finite normal mixtures: The 1872 stamp issue of Mexico revisited. Journal of Applied Statistics, 24, 169–179. Biernaki, C., Celeux, G., & Govaert, G. (1998). Assessing a mixture model for clustering with the integrated classification likelihood. Technical Report No. 3521, INRIA, Rhˆone-Alpes. Bourassa, S. C., Hamelink, F., Hoesli, M., & Macgregor, B. D. (1999). Defining housing submarkets. J. Housing Econ., 8, 160–183. Cliff, A. D., & Ord, J. K. (1981). Spatial processes. Models and applications. London: Pion Limited. Cressie, N. A. (1993). Statistics for spatial data (Rev. ed.). New York. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38. Dietz, E., & B¨ohning, D. (1995). Statistical inference based on a general model of unobserved heterogeneity. In: G. U. H. Seeber, B. J. Francis, R. Hatzinger & G. Steckel-Berger (Eds), Statistical Modelling: Lecture Notes in Statistics (Vol. 104, pp. 75–82). New York: SpringerVerlag.
276
M. D. UGARTE ET AL.
Goetzmann, W. N., & Wachter, S. M. (1995). The global real estate crash: Evidence from an international database. In: Proceedings of the International Congress on Real Estate (Vol. 3). Singapore: School of Building and Estate Management. National University of Singapore. Griliches, Z. (1971). Price indexes and quality change. Cambridge, MA: Harvard University Press. Harsman, B., & Quigley, J. M. (1995). The spatial segregation of ethnic and demographic groups: Comparative evidence from Stockhom and San Francisco. J. Urban Econ., 28, 223–242. Hawkins, D., Allen, D. M., & Stromberg, A. J. (2001). Determining the number of components in mixtures of linear models. Computational Statistics and Data Analysis, 38, 15–48. Heckman, J. J., & Singer, B. (1984). A method for minimizing the impact of distributional assumptions in econometric models of duration. Econometrics, 52, 271–320. Hoesli, M., & Macgregor, B. D. (1997). The classification of local property markets in the UK using cluster analysis. In: The Cutting Edge: Proceedings of the RICS Property Research Conference 1995 (Vol. 1). London: Royal Institution of Chartered Surveyors. Kiefer, J., & Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many nuisance parameters. Ann. Math. Statist., 27, 887–906. Lancaster, K. J. (1966). A new approach to consumer theory. Journal of Political Economy, 74, 132–157. Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society B, 44, 226–233. McLachlan, G. J., & Ng, S. K. (2000). A comparison of some information criteria for the number of components in a mixture model. Technical Report. Department of Mathematics, University of Queensland, Brisbane. McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley. Militino, A. F., Ugarte, M. D., & Garc´ıa-Reinaldos, L. (2004). Alternative models for describing spatial dependence among dwelling selling prices. Journal of Real Estate, Finance and Economics, 29(2), 193–209. Pace, R. K., & Lesage, J. P. (2002). Semiparametric maximum likelihood estimates of spatial dependence. Geographical Analysis, 34, 76–90. Rosen, S. (1974). Hedonic prices and implicit markets: Products differentiation in pure competition. Journal of Political Economy, 82, 34–55. Titterington, D. M., Smith, A. F. M. & Makov, U. E. (1985). Statistical analysis of finite mixture distributions. New York: Wiley.
SPATIO-TEMPORAL AUTOREGRESSIVE MODELS FOR U.S. UNEMPLOYMENT RATE Xavier de Luna and Marc G. Genton ABSTRACT We analyze spatio-temporal data on U.S. unemployment rates. For this purpose, we present a family of models designed for the analysis and time-forward prediction of spatio-temporal econometric data. Our model is aimed at applications with spatially sparse but temporally rich data, i.e. for observations collected at few spatial regions, but at many regular time intervals. The family of models utilized does not make spatial stationarity assumptions and consists in a vector autoregressive (VAR) specification, where there are as many time series as spatial regions. A model building strategy is used that takes into account the spatial dependence structure of the data. Model building may be performed either by displaying sample partial correlation functions, or automatically with an information criterion. Monthly data on unemployment rates in the nine census divisions of the U.S. are analyzed. We show with a residual analysis that our autoregressive model captures the dependence structure of the data better than with univariate time series modeling.
Spatial and Spatiotemporal Econometrics Advances in Econometrics, Volume 18, 279–294 Copyright © 2004 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(04)18009-2
279
280
XAVIER DE LUNA AND MARC G. GENTON
1. INTRODUCTION In this article, we analyze spatio-temporal data on U.S. unemployment rates. Previous studies have often focused on purely temporal models of a single measure in the study of unemployment: the U.S. civilian unemployment rate, seasonally adjusted, see e.g. Montgomery et al. (1998), Proietti (2003). We, instead, develop a spatio-temporal model for monthly U.S. unemployment rates observed in the nine census divisions of the United States between January 1980 and May 2002. Figure 1 presents a map of the census regions and divisions of the U.S. The nine regions are New England (NE), Middle Atlantic (MA), South Atlantic (SA), East North Central (ENC), East South Central (ESC), West North Central (WNC), West South Central (WSC), Mountain (M), and Pacific (P). These data have two important characteristics. First, the geographical locations of the nine divisions form a spatial lattice (see, e.g. Haining, 1990). This means that the unemployment rate in a given
Fig. 1. Map of the Census Regions and Divisions of the U.S., Reproduced with Permission from the U.S. Census Bureau.
Spatio-Temporal Autoregressive Models
281
division might be spatially correlated with the rates in neighboring regions. This spatial information can be used in our model to improve the forecasts. Note that, unlike environmental applications with monitoring stations, spatial interpolation is not of interest in our context. Second, the spatio-temporal data are sparse in space but rich in time, that is, we have only nine spatial regions, but many observations at regular time intervals. This type of data can also be found in environmental studies, see de Luna and Genton (2003) for applications to wind speed data and carbon monoxide atmospheric concentrations. The article is set up as follows. Section 2 describes a family of autoregressive models with spatial structure, first introduced by de Luna and Genton (2003) in the context of environmental applications. We argue that spatial stationarity assumptions are not necessary and a different model can be build for each of the nine divisions. We briefly discuss estimation and inference for these models and also provide two main approaches to deterministic trend modeling. We then describe our model building strategy which is based on a spatio-temporal ordering of the nine divisions. Section 3 presents two analyses of the U.S. unemployment rate data: one is based on univariate time series modeling, whereas the second uses the spatial information across divisions. We perform a residual analysis on the two approaches and show that our autoregressive model with spatial structure captures the dependence structure of the data better. We conclude in Section 4.
2. AUTOREGRESSIVE MODELS WITH SPATIAL STRUCTURE In this section we present models that are specifically designed for the analysis of spatial lattice data evolving in time. Our purpose is to provide time-forward predictions at given spatial regions of the lattice, based on a minimum of assumptions.
2.1. The Model The model we consider is a vector autoregressive (VAR) model commonly used in multivariate time series analysis (e.g. L¨utkepohl, 1991). It is usually applied in situations where several variables are observed at the same time. We, however, use the VAR model for a single variable, but observed at several spatial regions of a lattice at the same time. Specifically, we consider observations z(si , t) collected at si , i = 1, . . . , N spatial regions of a lattice and t = 1, . . . , T times. A time-forward
282
XAVIER DE LUNA AND MARC G. GENTON
predictive model for zt = (z(s1 , t), . . . , z(sN , t))′ is zt −  =
p
R i (zt−i − ) + t ,
(1)
i=1
a VAR(p) model with spatial lattice structure, where p denotes the order of the autoregression in time. The parameter  = ((s1 ), . . . , (sN ))′ is a vector of spatial effects representing a spatial trend among the regions of the lattice. The N-dimensional process t is white noise, that is, E(t ) = 0, E(t ′t ) = , and E(t ′u ) = 0 for u = t. The N × N parameter matrices R i , i = 1, . . . , p, are unknown and need to be estimated from the data, as well as the unknown N × N matrix of spatial covariances. The estimation of those parameters can be carried out with maximum likelihood (with distributional assumptions, typically Gaussian), with least squares or with moments estimators of Yule-Walker type, see, e.g. L¨utkepohl (1991, Chap. 3). Note that the order p is also unknown and its identification in the space-time context will be discussed in the next section. We assume that iterations of the deterministic dynamic system defined by model (1) converge towards a constant. This stability property implies that the process zt is time stationary, that is E(zt ) = , for all t, and cov(zt , zt− ) = Ŵz (), a function of only, for all t and = 0, 1, 2, . . .. The covariance matrix Ŵz () can then be computed from the parameter matrices R 1 , . . . , R p and from , see L¨utkepohl (1991). Note that the matrix can represent nonstationary spatial covariances. Indeed, we do not assume any spatial stationarity for model (1), such as that spatial correlations depend only on the distance between stations (isotropy). Such assumptions have often been made in the spatio-temporal literature, for instance to develop space-time ARMA models, or STARMA, see Pfeifer and Deutsch (1980), Stoffer (1986). For the applications to space-time lattice data, spatial stationarity is an over-restrictive assumption which is, moreover, difficult to check in practice. Because model (1) is assumed to be stationary in time, we need to remove deterministic temporal trends from the data. This can be achieved by differencing the observations with respect to time, as is commonly done in classical time series analysis. If the result depends only on the spatial regions, then it can be modeled by the spatial trend parameter  in (1). This will happen as soon as the spatiotemporal trend is a polynomial function in time with coefficients that may depend on the spatial location. In practice, this is typically the case, at least approximately. Note also that seasonal effects, such as monthly records, can be removed by taking differences in time. Another approach consists in modeling a deterministic spatiotemporal trend by means of a weighted sum of known basis functions, such as polynomials and/or sine/cosine periodic functions. The weights are then estimated
Spatio-Temporal Autoregressive Models
283
by regression. Further discussions on the topic of spatio-temporal trend modeling can be found in the review article by Kyriakidis and Journel (1999).
2.2. Model Building In this section, we present a model building strategy for the VAR model (1). Indeed, the spatial lattice data impose a specific structure on the unknown parameter matrices R 1 , . . . , R p . Our strategy consists in identifying zero entries in the R i ’s matrices as well as the order p of the autoregression. To this aim, we model each spatial region s of the lattice separately, thus avoiding any assumption of spatial stationarity. This can be seen as a complex covariate selection problem, where, for the response z(s, t), the available predictors are the time-lagged values at all spatial regions of the lattice, i.e. z(si , t − j), i = 1, . . . , N and j = 1, . . .. We make use of the spatio-temporal structure to define an ordering of the predictors, thus allowing to introduce them sequentially in our model, as is commonly performed in univariate time series analysis. A natural ordering of the spatial regions around s, is given by the sequence of regions sorted in ascending order with respect to their distance to s. In Section 3.2 we define such a distance based ordering for the U.S. census divisions. For instance, for the region ENC the ordering obtained is ENC, MA, WNC, ESC, SA, NE, WSC, M and P; see Table 1. Other orderings can be considered as well. For instance, orderings based on the length of common border between two spatial regions (see Haining, 1990) or orders motivated by dynamical/physical knowledge about the underlying process, see de Luna and Genton (2003) for an application to wind speeds in Ireland. From an ordering of the spatial regions one obtains a spatio-temporal ordering by considering first the predictors z(si , t − j) at lag one (j = 1) in the spatially defined Table 1. Matrix of “Distance” Used in the Spatio-Temporal Ordering.
NE MA ENC WNC SA ESC WSC M P
NE
MA
ENC
WNC
SA
ESC
WSC
M
P
0 1 5 7 3 6 7 7 7
1 0 1 5 2 4 6 6 6
2 2 0 1 4 3 4 4 4
5 5 2 0 6 5 2 1 2
3 3 4 6 0 1 5 8 8
4 4 3 4 1 0 1 5 5
6 6 6 2 5 2 0 2 3
7 7 7 3 7 7 3 0 1
8 8 8 8 8 8 8 3 0
284
XAVIER DE LUNA AND MARC G. GENTON
Fig. 2. A Schematic Representation of the Spatio-Temporal Ordering of the Neighbors to the Spatial Division ENC for Time t − 1, t − 10, t − 12. Note: The centroid of each census region is represented by a black disc. The segments connecting those discs represent the sequence of regions sorted in ascending order with respect to their distance to ENC (Table 1). Only the predictors with predictive power for ENC unemployment at time t are connected.
Spatio-Temporal Autoregressive Models
285
order, then lag 2 (j = 2), etc. With a given spatio-temporal ordering, we can enter our predictors sequentially in our model and stop whenever the partial correlation between z(s, t) and a new predictor is zero, conditionally on all other predictors already in the model. This means that we are defining partial autocorrelations (PCF) along our ordering in space and time, see de Luna and Genton (2003) for details. This approach can be automated by using model selection criteria such as AIC (Akaike, 1974) or BIC (Schwarz, 1978). Figure 2 shows which neighbors (shaded states) are included at lag t − 1, t − 10 and t − 12 in the model built (see Section 3.2) to predict ENC unemployment at time t. The segments connecting the centroids (black discs) of the divisions highlight the ordering of the regions described above.
3. UNEMPLOYMENT RATES IN THE U.S. In this section we build predictive models for unemployment rates observed monthly between January 1980 and May 2002 in the nine census divisions of the United States. Each division consists in a collection of states as shown in Fig. 1. The data was downloaded from the web site of the Bureau of Labor Statistics (http://www.bls.gov). The unemployment rates are plotted for each division in Fig. 3.
Fig. 3. Unemployment Rates Observed Monthly Between January 1980 and May 2002 for the Nine Census Divisions of the U.S.
286
XAVIER DE LUNA AND MARC G. GENTON
3.1. Univariate Modeling We start with a purely time series data analysis where each division is considered separately with the purpose to fit a univariate autoregressive model. From Fig. 3 we see that unemployment rates show a long term cyclic (trend-cycle) behavior. We also observe the well known monthly seasonality. We begin by removing the trend-cycle with the difference operator of order one. Note here that it is not customary to take first differences of unemployment series, because the long term cyclic behavior is often of interest to economists. In particular, the existence of a natural rate of unemployment around which the unemployment series fluctuates is often studied and discussed (e.g. Lucas, 1973; Phelps, 1994). In this paper, however, we focus on models geared towards the purpose of providing short term predictions. Thus, it is natural to analyze monthly changes in unemployment rates (first differences) from which the (hypothetical) natural rate has been removed. The detrended series are shown in Fig. 4. From these plots and the autocorrelation functions displayed in Fig. 5 it is clear that we have a twelve month seasonal component. Thus, a seasonal differencing is carried out. The resulting series and their autocorrelation functions are displayed in Figs 6–8, respectively. The two autocorrelation functions (plain and partial) indicate that it is suitable to fit AR models to the different time series.
Fig. 4. First Differences of the Unemployment Series Showed in Fig. 3.
Spatio-Temporal Autoregressive Models
287
Fig. 5. Autocorrelation Functions of the Detrended Unemployment Series of Fig. 4.
Fig. 6. Twelve Month Differences of the Detrended Unemployment Series in Order to Eliminate the Seasonal Component.
288
XAVIER DE LUNA AND MARC G. GENTON
Fig. 7. Autocorrelation Functions of the Detrended and Deseasonalized Series of Fig. 6.
Fig. 8. Partial Autocorrelation Functions of the Detrended and Deseasonalized Series of Fig. 6.
Spatio-Temporal Autoregressive Models
289
3.2. Models with Spatial Structure The analysis is here geared towards building a VAR model with spatial structure as was described in Section 2.2. For this purpose the data must be stationary in the time dimension. From the previous univariate analysis we take that the nine time series must be detrended and deseasonalized. This is done by differencing as described above. Model building is then performed on a division basis. That is for each division, a predictive model is built independently of the models considered for the other divisions. As mentioned earlier this is an essential component of our modeling approach since it allows us to avoid spatial stationarity assumptions. We give the details of model building for the East North Central (ENC) division. A spatio-temporal ordering of the divisions is used to have a hierarchy in which to introduce covariates in the regression model. Such an ordering is most naturally constructed by considering geographical locations of the divisions, thereby anticipating that neighboring divisions will have a larger predictive power for ENC than far away regions. However, because divisions are not punctual locations but regions with an area, it is not clear how to define geographical proximity. On the other hand, a very precise ordering is not necessary since spatiotemporal neighbors with predictive power will eventually be included in the model. In other words, the ordering is used to help identifying zeros in the final VAR model, but without the ambition to identify them all. We, therefore, take an arbitrary approach and define an ordering by considering first neighbors with a direct “frontier” to the ENC division, starting from East and proceeding anti-clockwise. Neighbors without direct frontier are then considered in approximate increasing distance, see Table 1 where each row represents a spatial ordering (from 0 to 9). Table 2 displays the partial correlation function (PCF) defined with respect to the ordering described above. The table also indicates which partial correlations are significantly different from zero at the one percent level. This allows us to build a model with the relevant covariates. In this case, we should include at lag one the divisions ENC, MA, WNC, ESC, SA, and NE, because the latter is the last one being significant. Moreover, we should include ENC at lag 12. We consider the remaining significant partial correlations as spurious. A weakness of the above approach is that we are performing many tests simultaneously and do not have control over the overall size of the test. However, as mentioned in Section 2.2 model building may be performed automatically by using an information criterion such as AIC or BIC. BIC (with max lag 15) selects ENC, MA, WNC, ESC, SA, NE at lag one, ENC, MA at lag ten, and ENC at lag twelve. This is in accordance with the above PCF analysis and is illustrated
290
Table 2. PCF for the ENC Division for Lag 1–25. ENC
MA
WNC
ESC
SA
NE
WSC
M
P
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0.463∗∗ 0.045 0.033 −0.094 −0.142 −0.121 0.018 −0.098 −0.001 0.110 0.007 −0.419∗∗ 0.096 −0.084 0.039 −0.095 −0.012 −0.049 −0.116 0.030 −0.167 0.246 −0.035 0.217 0.059
0.046 0.042 −0.035 0.058 0.037 0.003 0.070 0.053 0.131 −0.261∗∗ 0.059 0.101 0.064 −0.062 0.007 −0.053 0.149 0.034 −0.058 0.141 0.200 0.110 0.105 −0.236 −0.344
0.196∗∗ 0.085 0.090 0.022 0.048 0.036 0.166 0.056 0.044 −0.077 0.078 −0.094 0.187 0.112 0.023 0.006 0.003 −0.056 −0.025 0.061 −0.326∗∗ −0.144 −0.048 −0.139 −0.365
0.143 0.010 −0.015 0.005 −0.044 −0.085 0.122 0.090 0.022 0.012 −0.063 0.059 0.128 −0.046 −0.149 0.064 −0.128 0.147 0.063 0.128 0.031 0.347∗∗ −0.033 0.290 0.001
0.113 0.096 0.000 0.065 0.046 −0.142 −0.082 0.078 0.050 0.030 −0.011 −0.022 −0.131 0.212 0.182 0.040 0.022 0.162 −0.063 −0.111 0.072 −0.085 0.120 −0.427∗∗ −0.291
0.210∗∗ −0.053 −0.043 0.120 0.119 −0.015 0.064 −0.129 0.177 0.091 −0.007 −0.067 0.099 −0.040 −0.049 0.122 0.106 0.169 −0.130 0.125 −0.278 −0.240 −0.194 −0.327 0.076
−0.002 −0.060 0.031 −0.037 −0.080 −0.236∗∗ −0.140 −0.076 −0.018 −0.189 −0.141 0.095 −0.027 0.137 0.016 −0.026 −0.140 −0.027 −0.153 −0.130 0.157 0.140 −0.040 −0.123 0.216
0.158 0.141 0.105 −0.069 0.055 −0.034 −0.018 −0.040 −0.074 −0.099 0.089 0.061 −0.089 0.020 −0.091 0.221 −0.004 0.095 0.088 0.227 0.070 −0.318 0.081 −0.416∗∗ −0.707∗∗
0.076 −0.077 0.070 0.041 0.065 −0.024 −0.105 −0.062 0.004 0.225∗∗ −0.070 0.053 0.043 0.210 −0.122 −0.118 0.024 −0.041 −0.059 0.148 0.007 0.082 0.249 0.002 0.247
Note:
∗∗
indicate correlations significantly different from zero at the one percent level.
XAVIER DE LUNA AND MARC G. GENTON
Lag
Spatio-Temporal Autoregressive Models
291
Table 3. Order (Chosen by BIC) of the Autoregressive Models Fitted for Each Division. Division
NE
MA
ENC
WNC
SA
ESC
WSC
M
P
Order
13
15
13
12
14
13
14
14
12
graphically in Fig. 2. In this automatic procedure we have chosen to include ENC at all lags.
3.3. Residual Analysis A residual analysis is carried out to investigate the relevance of the fitted models. In particular we look at the residuals obtained from a univariate modeling of the nine divisions, and the residuals obtained from a VAR modeling based on a spatiotemporal ordering. The univariate modeling of the detrended and deseasonalized time series is performed by fitting autoregressive models separately to the nine divisions. The order of each autoregression is chosen with BIC and the parameters are fitted by least squares. These orders are given in Table 3.
Table 4. Structure of the VAR Model Fitted. Lag
NE
MA
ENC
WNC
SA
ESC
WSC
M
P
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
3 1 1 1 1 1 1 1 1 1 1 1 1 3 0
3 1 1 1 1 1 1 1 1 1 1 1 2 0 0
6 1 1 1 1 1 1 1 1 2 1 1 0 0 0
4 1 1 1 1 1 1 1 1 1 1 1 2 0 0
9 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1 3 1 0 0
4 2 1 1 1 1 1 1 1 1 1 1 1 1 0
4 1 1 1 1 1 1 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1 1 1 1 1 0 0
Note: For each division the models are selected with the strategy of Section 2.2 combined with BIC.
292
XAVIER DE LUNA AND MARC G. GENTON
Table 5. Lag 1 Correlations of the Residuals from the Univariate Autoregressive Models as Described in Table 3.
NE MA ENC WNC SA ESC WSC M P
NE
MA
ENC
WNC
SA
ESC
WSC
M
P
−0.027 0.136 0.220 0.037 0.134 0.080 0.109 0.115 0.119
0.049 0.016 0.071 −0.055 0.039 0.078 0.102 0.093 0.202
0.203 0.115 0.050 0.022 0.000 0.118 0.093 0.083 0.140
0.107 0.038 0.073 0.030 −0.003 0.033 −0.013 0.012 0.098
0.069 0.077 0.132 0.051 −0.017 0.081 0.199 0.045 0.270
0.066 0.108 0.167 0.112 0.087 0.053 0.148 0.039 0.263
0.070 −0.024 0.009 0.011 0.014 −0.039 −0.056 0.096 0.144
−0.075 0.022 −0.002 0.136 0.083 −0.019 0.089 0.028 0.198
0.226 0.140 0.047 0.095 0.185 0.107 0.076 0.140 0.051
BIC is also used to specify a VAR model with spatial structure by using the model building strategy of Section 2.2. Model building is hence performed separately for each division, and parameters are estimated with least squares. The obtained models are described in Table 4. For each time lag and for each division, Table 4 reports the number of selected neighboring divisions. We, hence, have two sets of residuals obtained from Table 3 and from Table 4 respectively. We can look at their spatio-temporal correlation structure to see whether there is some linear dependence left. Only the correlations at lag one are presented in Tables 5 and 6. Correlations at other lags are low for both models. We see that the residuals from the univariate modeling show spatio-temporal correlations at lag one, while the residuals from the VAR model have very low
Table 6. Lag 1 Correlations of the Residuals from the VAR Model Described in Table 4.
NE MA ENC WNC SA ESC WSC M P
NE
MA
ENC
WNC
SA
ESC
WSC
M
P
0.014 −0.032 0.055 0.036 −0.034 0.016 −0.021 0.036 0.051
0.017 −0.051 0.028 −0.059 0.003 −0.002 0.046 0.112 0.103
−0.027 −0.073 0.047 0.026 −0.012 0.076 0.050 0.041 −0.003
0.047 −0.064 0.009 0.000 −0.054 −0.003 −0.031 −0.002 0.051
0.028 −0.052 −0.067 −0.054 −0.047 −0.062 0.067 −0.002 0.049
0.006 0.054 0.030 0.096 0.018 0.007 −0.009 −0.011 0.022
−0.012 −0.060 −0.085 −0.047 −0.048 −0.045 −0.034 −0.008 0.007
−0.119 −0.007 −0.039 −0.026 −0.087 −0.044 −0.043 0.005 0.057
0.153 0.133 0.034 0.044 0.021 0.091 0.036 −0.026 0.039
Spatio-Temporal Autoregressive Models
293
correlations. This indicates that the VAR model with spatial structure captures the dependence structure of the data better, as it was expected.
4. CONCLUSION In this article, we have proposed a VAR model with spatial structure to analyze monthly U.S. unemployment rates observed in the nine census divisions of the United States. These spatio-temporal data are sparse in space but rich in time because we have only nine spatial regions but many observations at regular time intervals. This type of data is fairly common in econometric studies. We have shown that, unlike many other spatio-temporal modeling approaches, we do not have to assume any spatial stationarity or isotropy. We have described a model building strategy based on a spatio-temporal ordering of the nine census divisions. This ordering allows us to enter predictors sequentially in our model and identify interesting ones using partial correlation functions or model selection criteria. We have shown that our VAR model with spatial structure captures the dependence structure of the U.S. unemployment rate data better than univariate autoregressive time series models. An interesting feature of our model is its simplicity in implementation resulting from the linearity of our model and its nestedness with respect to a spatio-temporal hierarchy. We have used the software Splus for our computations, but any software performing linear regressions would be suitable to implement our modeling approach. The latter is obviously applicable to the prediction of other macroeconomic variables observed on different regions/countries. Several possible generalizations of the models introduced can be identified. For instance, it is straightforward to take into account other variables measured at the same regions by including them as explanatory variables within the VAR framework. Moreover, instead of looking for a single optimal model, one may use model averaging procedures (see, e.g. Hoeting et al., 1999). Model averaging may be performed not only over different models based on a given ordering of the spatial regions, but also over models based on different such orderings. The model building strategy is also open to further generalizations, for example shrinkage techniques such as a ridge or Lasso penalty could be used to select the predictor variables.
ACKNOWLEDGMENTS We thank the Editors and an anonymous referee for helpful comments on this manuscript.
294
XAVIER DE LUNA AND MARC G. GENTON
REFERENCES Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723. de Luna, X., & Genton, M. G. (2003). Predictive spatio-temporal models for spatially sparse environmental data. Institute of Statistics Mimeo Series #2546. Haining, R. (1990). Spatial data analysis in the social and environmental sciences. Cambridge: Cambridge University Press. Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial (with discussion). Statistical Science, 19, 382–417. Kyriakidis, P. C., & Journel, A. G. (1999). Geostatistical space-time models: A review. Mathematical Geology, 31, 651–684. Lucas, R. E., Jr. (1973). Some international evidence on output-inflation trade-offs. American Economic Review, 63, 326–334. L¨utkepohl, H. (1991). Introduction to multiple time series analysis. Berlin: Springer-Verlag. Montgomery, A. L., Zarnowitz, V., Tsay, R. S., & Tiao, G. C. (1998). Forecasting the US unemployment rate. Journal of the American Statistical Association, 93, 478–493. Pfeifer, P. E., & Deutsch, S. J. (1980). A three-stage iterative procedure for space-time modeling. Technometrics, 22, 35–47. Phelps, E.S. (1994). Structural slumps: The modern equilibrium theory of unemployment, interest and assests. Cambridge: Harward University Press. Proietti, T. (2003). Forecasting the US unemployment rate. Computational Statistics and Data Analysis, 42, 451–476. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464. Stoffer, D. S. (1986). Estimation and identification of space-time ARMAX models in the presence of missing data. Journal of the American Statistical Association, 81, 762–772.
A LEARNING RULE FOR INFERRING LOCAL DISTRIBUTIONS OVER SPACE AND TIME Stephen M. Stohs and Jeffrey T. LaFrance ABSTRACT A common feature of certain kinds of data is a high level of statistical dependence across space and time. This spatial and temporal dependence contains useful information that can be exploited to significantly reduce the uncertainty surrounding local distributions. This chapter develops a methodology for inferring local distributions that incorporates these dependencies. The approach accommodates active learning over space and time, and from aggregate data and distributions to disaggregate individual data and distributions. We combine data sets on Kansas winter wheat yields – annual county-level yields over the period from 1947 through 2000 for all 105 counties in the state of Kansas, and 20,720 individual farm-level sample moments, based on ten years of the reported actual production histories for the winter wheat yields of farmers participating in the United States Department of Agriculture Federal Crop Insurance Corporation Multiple Peril Crop Insurance Program in each of the years 1991–2000. We derive a learning rule that combines statewide, county, and local farm-level data using Bayes’ rule to estimate the moments of individual farm-level crop yield distributions. Information theory and the maximum entropy criterion are used to estimate farm-level crop yield densities from these moments. These posterior densities
Spatial and Spatiotemporal Econometrics Advances in Econometrics, Volume 18, 295–331 Copyright © 2004 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(04)18010-9
295
296
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
are found to substantially reduce the bias and volatility of crop insurance premium rates.
1. INTRODUCTION The U.S. government operates the multiple-peril crop insurance (MPCI) program to provide farmers comprehensive protection against the risk of weather-related causes of income loss and certain other unavoidable perils. Insurance payments under MPCI are a function of individual farm-level realized crop yield, and the cost of MPCI depends on the distribution of farm-level yield for insured farms. Historically, the federal crop insurance program was plagued by low participation rates. Recent legislation has created additional forms of coverage and increased taxpayer-funded subsidies to encourage higher participation levels.1 Current laws and regulations include premium subsidies of 64% at the 75% coverage level to 100% at the 50% coverage level. Additional subsidies are employed to provide incentives for private insurers to market MPCI. These subsidies include payments for delivery expenses (marketing and service cost reimbursements), and reinsurance protection provided by the FCIC. Unlike private insurance, where the private insurance company charges premiums to finance the costs of indemnity payments and administrative expenses, federally-subsidized MPCI relies on the taxpayer to cover more than half the cost of this form of insurance. The need for such large premium subsidies to foster participation in MPCI is an economic puzzle. Theory implies that, with actuarially fair insurance,2 a risk-averse producer would prefer participating to foregoing coverage, risk-neutral producers would be indifferent between participating and not participating, and risk-loving individuals would prefer gambling over the full range of possible yield outcomes to the smoothing effect of insurance on realized income. With unsubsidized actuarially fair premiums, we expect full participation among riskaverse producers who capture increases in expected utility by insuring. Low participation rates without subsidies suggest either that most farmers are risklovers or that the insurance is perceived to be too costly for producers who would otherwise insure at actuarially fair rates. Many agricultural economists argue that a key factor causing low participation is a failure of premium rates to accurately reflect individual producer risk. There are several reasons this may be so, aside from the prima facie evidence of low participation levels: (1) The formula used to set rates is designed and computed by a single actuarial firm under government contract. All providers of insurance charge the same
A Learning Rule for Inferring Local Distributions Over Space and Time
297
centrally-determined rates, eliminating any competitive incentive for more accurate rate calculations. (2) The rates are computed by first pooling producers into discrete risk classes delineated by county, crop, practice, and elected coverage levels. Aggregate loss cost ratios3 (LCRs) are then computed within each risk class using twentyfive years of historical data. The result is an estimate of the average claim as a proportion of the maximum potential claim. This pooled average LCR is then multiplied by an adjustment factor that reflects the production practice, coverage level, and yield experiences for a given crop to arrive at a premium rate for each risk class. There is no theoretical or other a priori reason to assume this ad hoc classification scheme produces risk pools with the average LCR as a good estimate of expected claims for any given producer. Within a given risk class some farms will expect to profit by insuring and others will expect to lose money. (3) Although historical county-level yield data shows strong temporal dependence and spatial correlation in contemporaneous yields, current rating methods make almost no use of these properties. (4) The premium charged to an individual producer is proportional to the minimum of four- and a maximum of ten-year average of the farm-level crop yields. Due to the high variance of farm-level yields,4 a ten-year average is an imprecise measure of the expected farm yield and does not reliably measure the prospective yield risk. We develop a method to infer individual farm-level crop yield distributions by incorporating information from regional and county-level yield data into the farmlevel estimates. This method is based on information theory and optimal learning rules (Zellner, 2002, 2003) and minimum expected loss convex combinations of estimators of unknown parameters (Judge & Mittlehammer, 2004). This approach has several advantages: it makes efficient use of the spatial and temporal dependence inherent in the data generating process; it combines information available at the regional, county, and farm level to adjust for the aggregation bias inherent in estimates that are based only on regional or county level data; the estimated distributions are robust to departures from normality; the premium sensitivity to the volatility of the ten-year farm-level average yield is substantially reduced; and the learning rule is a coherent and efficient mechanism to update all of the needed parameter estimates as new information becomes available.
298
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
2. RELATED WORK Several authors have discussed the provisions and operations of the U.S. multipleperil crop insurance program at length. The survey by Knight and Coble (1997) provides a broad overview of the literature. Wright and Hewitt (1994) compare federal crop insurance programs in different countries and suggest reasons why insurance schemes fail to thrive without the support of large taxpayer-funded subsidies. Harwood et al. (1999) discuss agricultural risk management in general, and explain the provisions of the U.S. MPCI program. Details of the procedures for determining the premiums charged in the MPCI program are contained in Josephson et al. (2000). An efficient method for estimating seemingly unrelated regression equations (SUR) was originally introduced by Zellner (1962) and is described at length in Davidson and MacKinnon (1993) and Greene (2003). SUR is useful for estimating linear regression models with panel data subject to contemporaneous correlations across cross-sectional units of observation. In the present case, the panel data set is made up of Kansas county-level average winter wheat yields in the years 1947–2000. These data exhibit a high degree of spatial correlation. This spatial correlation is estimated and used to implement the estimated feasible generalized least squares version of SUR. The approach we use to combine information from the statewide-, county-, and farm-level yield data is based on an application of Bayes’ rule to the normal distribution with a known variance. This approach is explained in Gelman et al. (2003). Gelman et al. make an analogy between classical analysis of variance and empirical Bayes estimation, characterizing the estimates produced by the latter as a compromise between the null hypothesis that all units share a common mean and the alternative hypothesis that the mean differs across units. The Bayesian approach produces a convex combination of two estimates that balances the relative precision of the estimates under each of the two competing hypotheses. A general framework for obtaining efficient convex combinations of estimators is presented in Judge and Mittelhammer (2004). We use information theory to estimate farm-level crop yield distributions based on the principle of maximum entropy. This theory is described in Jaynes (1982), Zellner and Highfield (1988), Ormoneit and White (1999), and Tagliani (1984, 1993). Jaynes dispels the commonly held perception that maximum entropy is equivalent to maximum likelihood. In the latter case, the choice of statistical model is ad hoc. In the former, maximum entropy provides a rationale for both selecting and estimating the distribution generating the data.5 Zellner and Highfield apply the maximum entropy criterion to estimate probability density functions on the real line subject to a set of moment conditions – often referred to as the Hamburger
A Learning Rule for Inferring Local Distributions Over Space and Time
299
moment problem (Tagliani, 1984). Ormoneit and White develop an algorithm for computing maximum entropy densities on the real line and tabulate a large number of numerical results. In a pair of papers, Tagliani considers the case investigated by Zellner and Highfield as well as the Stieltjes moment problem, where the support of the probability density function is the positive half line. Crop yields are always nonnegative, so that the Stieltjes case applies to the problem of inferring local crop yield distributions. A number of researchers have focused on the role of ratemaking inaccuracies in fostering low participation in the crop insurance program (e.g. Just et al., 1999). The general theme is that a key factor in determining whether a producer will participate is whether the expected return to insuring is positive. If insurance rates are inaccurate, some farms will face an expected gain to insuring, while others will face an expected loss. The adverse selection problem arises when only those farmers who expect a positive net return to insuring choose to participate. The result is a pattern of persistent aggregate losses to the insurance program, and political pressure for premium subsidies to foster greater participation by producers who will expect financial losses from insuring when farmers are pooled into large risk classes. One factor that leads to insurance rating inaccuracies for individual farmers is the pooling of producers into large risk classes. The pooling procedure currently used is ad hoc and creates groups of individuals with heterogeneous risk profiles. Skees and Reed (1986) suggest farm-level rating as a means of avoiding the problems due to risk pooling. They argue that the coefficient of variation is a key statistic in measuring farm-level risk, and that current rating methods group producers with different coefficients of variation into the same risk pool. They also conjecture that farmers with higher mean yields tend to have lower coefficients of variation in yields, which implies that these producers should be charged lower insurance rates. The data and method developed here are conducive to addressing these questions, including whether the distribution of crop yields is negatively skewed, whether a higher mean crop yield tends to have a lower coefficient of variation, and whether a higher coefficient of variation is tantamount to a higher expected insurance claim.
3. THE MODELING APPROACH The modeling approach is based on the first four moments of the individual farmlevel yield distributions.6 We first decompose the farm-level yield into the sum of the county-level yield and a statistically independent farm-level residual. This decomposition facilitates estimation of the density parameters for the county-level yield and farm-level residual distributions. These can then be combined to provide
300
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
estimates of the local farm-level yield distributions. This procedure offers at least two advantages: (1) Incorporating information from the long time series of county-level yield data should result in more reliable estimates than those based on at most ten years of farm-level data. (2) Including the third and fourth moments in the estimated crop yield distributions accommodates the widely accepted properties of negative skewness and excess kurtosis in crop yield distributions (Just & Weninger, 1999). The first step in the modeling approach is to identify the relationship between the f farm- and county-level data generating processes. Let Y jt denote the period-t yield on farm j within county i(j). The farm-level yield is modeled by: f
f
(1)
Y jt = Y i(j),t + ␦i(j),j + j ηjt
(1)
The county-level yield in county i, in turn, is modeled as: (1)
(1)
Y it = it + i it
(2) (1)
(1)
where it is the expected county-level yield at time t, ␦i(j),j is the difference between the farm-level and the county-level trend, it is a standardized countyf level yield shock,7 jt is a standardized, idiosyncratic farm-level yield shock, √ (2) (2) i = i is the standard deviation of the county-level yield shock, where i is (3) the county-level yield variance, i = E3it is the county-level yield coefficient of √ (2) (4) skewness, i = E4it is the county-level yield coefficient of kurtosis, j = ␦j (2)
is the farm-level yield residual standard deviation, ␦j variance,
(3) ␦j
is the farm-level residual (4)
is the farm-level yield residual coefficient of skewness, ␦j
is the
f cov(it , jt )
farm-level residual coefficient of kurtosis, and = 0 ∀ i, j, t. 8 The centered moments of the farm-level yield distribution for farm j are given by: (1)
f
jt = E(Y jt ) (2)
j
(3)
j
f
(3)
f
(1)
= Var(Y jt ) = E (Y jt − j )2 f (1) 3 − ) (Y jt j f = Sk(Y jt ) = E (2) (j )3/2
(4)
(5)
A Learning Rule for Inferring Local Distributions Over Space and Time
(4)
j
f
= Ku(Y jt ) = E
f
(1)
(Y jt − j )4 (2)
(j )2
301
(6)
(k)
All higher-order moments j , for k = 2, 3, 4, are assumed to be time-invariant. Given these properties for the county- and farm-level yield distributions, it can be shown that the moments of the individual farm-level yield distribution are given by: (1)
(1)
(2)
(2)
(1)
jt = i(j),t + ␦i(j),j ,
(7)
(2)
jt = i(j) + ␦j , (3)
(3)
jt =
(8) (3)
3i i(j) + 3j ␦j
(2i(j) + 2j )3/2
,
(4)
(4)
jt =
(9) (4)
4i(j) i(j) + 62i(j) 2j + 4j ␦j (2i(j) + 2j )2
.
(10)
These formulas provide the foundation for combining estimates of the moments of the county-level yield distribution with the moments of the farm-level yield residual distribution. The final estimates of the moments and the local individual farm-level yield distributions are based on these combined moment estimates. The second step in the modeling process is to identify the county-level yield distribution. Seemingly unrelated regression (SUR) methods are used to estimate county-level yield trends using National Agricultural Statistics Service (NASS) data for Kansas winter wheat over the period from 1947 to 2000 (T = 54 years, M = 105 counties). We use iterative SUR on a linear trend model of individual county-level yield trends,9 Y it = ␣i + i t + it .
(11)
The linear time trend captures predictable patterns of technological growth, while the error term includes weather-related productivity shocks and other unpredictable and random components of technological change. The first step in estimating (11)
302
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
applies ordinary least squares (OLS) to each county yield trend, 1 1 y1  y 2 2 2 = (X ⊗ I M ) .. + .. , .. . . . M
M
yM
(12)
where,
1 1 . . X≡ .. .. , 1 T
≡
␣i , i
i1 . and i ≡ .. . iT
The second step estimates the covariance structure across counties as a function of the distance between the geographic centroids of the counties. Let ˆ it be the estimated OLS residual for county i in period t, let: T
ˆ ij =
1 ˆ it ˆ jt T
(13)
t=1
be the estimated sample covariance, and let: ˆ ij ˆ ij = ˆ ii ˆ jj
(14)
be the estimated correlation coefficient between counties i and j. We apply the Singh-Nagar procedure to estimate an exponential quadratic correlation function,10 ˆ ij = exp{␥1 h ij + ␥2 h 2ij } + u ij ,
(15)
where h ij is the distance between the centroids of counties i and j. Robust weighted least squares is used to correct for heteroskedasticity in the correlation errors. The large sample size (5,670 observations) and the consistency of nonlinear least squares in the presence of heteroskedasticity suggests that this correction should have little effect on the estimated correlation function or the structural yield trend model parameters; indeed this is what we found. We iterate on all of the model parameters until the estimated correlation function converges.11 The final prediction of the correlation function (with robust standard errors in parenthesis beneath the estimated coefficients) is:
˜ ij = exp −3.370 × 10−3 h ij − 1.067 × 10−5 h 2ij , (6.088×10−5 )
(3.718×10−7 )
(16)
A Learning Rule for Inferring Local Distributions Over Space and Time
303
Fig. 1. Estimated and Predicted County-Level Yield Correlations.
with an R 2 = 0.835, measured as the squared correlation between the predicted and estimated trend residual correlations. A plot of the estimated and predicted county-level yield correlations as functions of distance between counties (in miles) is presented in Fig. 1. This figure illustrates the high level of spatial correlation between contemporaneous county-level yields. This exponential quadratic also appears to be a reasonable approximation to the spatial dependence of the countylevel yield trend residuals.12 Next, define the M × M predicted correlation matrix by: 1 ˜ 12 · · · ˜ 1M ˜ 1 · · · ˜ 2M 21 R= . (17) .. .. .. . . . . . ˜ M1
˜ M2
···
1
the diagonal matrix of estimated standard errors by S = diag [ˆ ii ], and the ˆ = SRS. We estimated covariance matrix for the county-level trend residuals by apply the Jarque-Bera (J-B) test to the standardized residuals z t = L ˆ t , where (SRS)−1 = LL ′ , L a lower triangular matrix that gives the Cholesky factorization for the inverse covariance matrix. Under the null hypothesis that the ε’s are
304
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
normally distributed random variables, the J-B test statistic is asymptotically distributed as a Chi-squared random variable with two degrees of freedom. The calculated value of the J-B test statistic is 767.7, while the one percent critical value for a 2 (2) random variable is 9.21. Hence, the J-B test rejects the normal distribution at all reasonable levels of significance. We conclude that the countylevel yield data is not normally distributed. It follows that farm-level densities also are not normally distributed (Feller, 1971). In principle, given our estimated county-level yield trends and estimates of the farm-level residual moments, we can calculate estimates of the individual farmlevel yield distributions. However, the farm-level residual moments are based on at most ten years of data, and the high variation in farm-level crop yields over time makes these very unreliable estimnates. To illustrate this issue, we simulated 20,000 simulated samples of size ten from the standard normal distribution. The population mean, variance, skewness and kurtosis are well-known to be 0, 1, 0 and 3, respectively. Figure 2 is a scatter plot of the 20,000 simulated sample variances against the corresponding sample means. Figure 3 is a scatter plot of the 20,000 sample kurtosis against the sample skewness. For comparison, Fig. 4 is a scatter plot of the 20,720 farm-level residual variances against the corresponding sample means, based on the ten-year APH yields for all winter wheat producers in Kansas who participated in the MPCI program in each year in 1991–2000. These farmlevel yield data are not normalized, and the farm-level statistics are positive-valued and have large ranges for the mean and variance relative to the standard normal. However, the qualitative characteristics are similar in both data sets. Figure 5 is a scatter plot of the farm-level residual kurtosis on the corresponding farm-level residual skewness, again based on the ten-year APH yields. The simulation results from a standard normal distribution compared to the farm-level data shows that a substantial proportion of the variation in the farm-level statistics is purely due to sampling error. The remainder of this chapter focuses on a learning rule to reduce the uncertainty surrounding individual farm-level crop yield distributions. The learning rule we develop efficiently exploits the large cross-sectional but short time-series nature of the farm-level data and the long time series nature of the county-level data to obtain more precise estimates of the local farm-level yield distributions. An outline of this procedure is the following: (1) We begin with a diffuse prior and a normal likelihood for each of the mean, variance, skewness and kurtosis of the farm-level residual distributions.13 We interpret these farm-level residual distribution moments as independent random observations from the likelihood. We then apply Bayes’ rule sequentially in two steps. First, we update the statewide average over all
A Learning Rule for Inferring Local Distributions Over Space and Time
Fig. 2. Mean and Variance for 20,000 Standard Normal Random Samples of Size 10. 305
306
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
Fig. 3. Skewness and Kurtosis for 20,000 Standard Normal Random Samples of Size 10.
Fig. 4. 10-Year APH Mean and Variance of 20,720 Kansas Winter Wheat Farms.
A Learning Rule for Inferring Local Distributions Over Space and Time
307
Fig. 5. 10-Year APH Skewness and Kurtosis for 20,720 Kansas Winter Wheat Farms.
20,720 farms of the farm-level yield residual moments to a corresponding yield residual moment for each county. Second, we update from these county-level yield residual moments to the individual farm-level yield residual moments. The formula for updating the mean of a normally distributed random variable with new information is: 1 =
(1/20 )0 + (1/2 )y (1/20 ) + (1/2 )
,
(18)
where 0 and 20 are the parameters of the prior distribution, y and 2 are the observation and its variance, and 1 is the posterior mode. The corresponding formula for updating the variance is: 21 =
1 (1/20 ) + (1/2 )
,
(19)
where 21 is the posterior variance. In each step, these formulas are used to update the prior information using the likelihood function to reflect the new information for the case at hand. (2) A similar procedure is used to obtain moment estimates for county-level yield distributions.
308
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
(3) The county-level yield moments are combined with the farm-level residual moments to obtain estimates of the moments of farm-level yield distributions. These moment estimates provide the input for the maximum entropy estimation of the individual farm-level yield densities. In the more detailed discussion below, w and d denote the sample estimates of county-level yield moments and farm-level residual moments, ˆ and ␦ˆ denote the respective posterior updates, and ˆ denotes posterior updates of the farmlevel yield distribution moments. The corresponding population parameters are , ␦, and , a subscript j indexes farm-number, i(j) indexes the county in which farm j is located, and a superscript (k) denotes the order of the moment under consideration. 3.1. Constructing the Farm-level Moments The assumed relationship between county-level and farm-level yield is: f
(1)
f
Y jt = Y i(j),t + ␦i(j),j + j jt .
(20)
Rearranging and taking the expectation of both sides shows that: f
(1)
␦i(j),j = E(Y jt − Y i(j),t ).
(21)
f
Let Y¯ j represent the ten-year farm-level APH average yield on farm j, and Y¯ i(j) the ten-year average county-level yield for county i(j). Under the assumption that the expected difference between an individual farm-level yield and its corresponding county-level yield is time-invariant, an unbiased estimate of the difference is given by: t f s=t−9 Y js − Y i(j),s f (1) . (22) d j = Y¯ jt − Y¯ i(j),t = 10 The farm-level residuals are then defined by the difference between the farm-level yield deviation from the ten-year average farm-level yield, and the county-level yield deviation from the ten-year average county-level yield, f
f
f
ˆ jt = (Y jt − Y¯ jt ) − (Y i(j),t − Y¯ i(j),t ), The higher order farm-level residual sample moments are: 10 f k jt ) t=1 (ˆ (k) dj = , 10
(23)
(24)
A Learning Rule for Inferring Local Distributions Over Space and Time
309
for k = 2, 3, 4. Under the hypothesis of common residual moments across all 20,720 farms in the sample, pooled moments for the farm-level residuals are: N (k) (k) j=1 d j dp = (25) N with associated variance estimates of: N V
(k) dp
(k) j=1 (d j
=
(k)
− d p )2
N(N − 1)
,
(26)
for k = 1, 2, 3, 4. These sample moments and their associated sample variances were used to construct the pooled prior distribution for each farm-level residual moment: (k)
␦j
(k)
(k) dp
∼ N(d p , V ),
(27)
as the first step in our learning rule. An alternative to the common-moment hypothesis is that the moments vary geographically by county, but farms within each county share common residual moments. This alternative would be supported by the data if the cross-county variation in moment estimates were relatively large compared to within-county variation. On the other hand, if within-county variation were relatively large compared to cross-county variation, the pooled moment hypothesis would be supported. The classical resolution of this dichotomy applies analysis of variance. If the null hypothesis is accepted, the pooled sample moments are used, while if it is rejected, the county-specific moment estimators are used instead. A Bayesian approach resolves the dichotomy through a compromise rather than an absolute choice between the two hypotheses. Under the alternative hypothesis, the county-specific moments and their variances may be estimated using: (k) (k) j∈J i d j di = , (28) Ni and (k)
V
(k) di
=
Vi , Ni
(29)
where J i is the set of farms in county i, N i is the number of farms in county i, and: (k) 2 (k) j∈J i (d j − d i ) (k) (30) Vi = Ni − 1
310
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
is the sample variance of the corresponding farm-level sample moment for county i. This method computes the posterior mode as the precision-weighted average between the sample pooled moment and the county-level sample moment, (k) ␦¯ i =
(k) (k) (k) (k) (1/V d¯ )d¯ p + (1/V d¯ )d¯ i p
i
(k) (k) (1/V d¯ ) + (1/V d¯ ) p i
,
(31)
with posterior variance given by: 1
(k)
V ␦¯ = i
(k) (k) (1/V d¯ ) + (1/V d¯ ) p i
.
(32)
It is straightforward to show that the posterior variance is less than the smaller of (k) (k) the prior variance, V , and the likelihood variance, V .14 Hence the posterior dp
di
distribution has a mean which is the precision-weighted average of the prior mean and the likelihood mean, and a variance which is smaller than the variances of either the prior or the likelihood. After updating the farm-level residual moments from the pooled statewide level to the county level, the question remains whether there are significant differences in these moments at the individual farm level within each county. In principle, if the variation between farm-level moments is large relative to the sampling variation, then farm-level moment estimates should differ across the farms in each county. Conversely, if the sampling variance in the farm-level moment estimates is large relative to the inter-farm variation in moments for farms within a given county, then there is little basis for separate estimates across farms. An attractive aspect of the updating procedure we have developed and applied to this problem is that, at any stage, new information can be taken into account by treating the posterior from the previous update as the new prior and entering the new information through the likelihood function. We therefore update from county-level to farm-level residual moments with the formulas, (k) ␦ˆ i
=
(k) ¯ (k) (k) (k) (1/V ␦i(j) )␦i(j) + (1/V d j )d j ¯ (k)
(k)
(33)
(1/V ␦i(j) ) + (1/V d j ) ¯
and: (k) ␦j
Vˆ =
1 (k) (k) (1/V ␦i(j) ) + (1/V d j ) ¯
.
(34)
A Learning Rule for Inferring Local Distributions Over Space and Time
311
3.2. Constructing the County-level Sample Moments Initial estimates of the moments of the county-level yield distributions are based on the trend regressions. The estimated county-specific trends provide estimates of the county-level mean yield, and exhibit a small standard error. After we have calculated the iterative feasible GLS estimators for the county-level yield trends, the county-specific means are not updated in our implementation of the learning rule. Each estimated county-level yield trend is: (1)
ˆi
= ␣ˆ i + ˆ i t 0 ,
(35)
with corresponding variance equal to: (1)
V ˆ i = Var(␣ˆ i + βˆ i t 0 ) = Var(␣ˆ i ) + 2t 0 Cov(␣ˆ i , ˆ i ) + t 20 Var(ˆ i )
(36)
The parameter estimates ␣ˆ i and ˆ i are the iterative SUR regression coefficients, t 0 = T + 1 is the next period after the last observation in the data, and the estimated variances and covariance in the variance formula are obtained from the countyspecific elements of the SUR covariance matrix for the structural trend parameter estimates. Higher moments for the county-level yield distribution are initially estimated using the sample moments of county-specific residuals. These consistently estimate the corresponding population moments when the trend regression is correctly specified. The hyperparameters (mean and variance) of the estimated variance are computed as: (2) wi
=
T
ˆ 2it t=1
T−2
,
(37)
and: V (2) wi
(2) 1/(T − 2) Tt=1 ˆ 4it − [wi ]2 . = T
(38)
The county-level sample skewness and kurtosis are calculated with the normalized (2)
residuals, uˆ it = ˆ it / wi .15 The formulas for the residual skewness (k = 3) and kurtosis (k = 4) are: T
(k)
wi
=
1 k uˆ it (T − 2) t=1
(39)
312
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
and: (k) 1/(T − 2) Tt=1 uˆ kit − [wi ]2 . (40) T The updating procedure for the variance, skewness and kurtosis of the county-level residuals is the same for k = 2, 3, 4, and is described generically for the three cases. Under the hypothesis of a common kth moment across all counties, a pooled prior mean and variance are computed from the county-specific moment estimates: M (k) i=1 wi w(k) , (41) = p M and: M (k) (k) (w − wp )2 (k) V wp = i=1 i , (42) M (M − 1) V (k) wi =
where M = 105 is the number of counties in the data. These pooled priors are combined with the county-specific sample moments, again using Bayes’ rule, to generate posterior estimates of the higher order moments of the county-level yield distributions, (k)
(k)
ˆi
=
(k)
(k)
(k)
(1/V wp )wp + (1/V wi )wi (k)
(k)
,
(43)
(1/V wp ) + (1/V wi )
and: 1
(k)
V ˆ i =
(k) (k) (1/V wp ) + (1/V wi )
.
(44)
The means and variances of the posterior county-level yield moments and the posterior farm-level residual moments can be combined through the above moment decomposition formulas to obtain the final posterior means and variances of the farm-level density parameters. For the mean and variance of farm-level yield, the formulas are additive, (k)
ˆj
(k)
(k)
= ˆ i(j) + ␦ˆ j ,
(45)
and: (k)
(k)
(k)
V ˆ = V ˆ i(j) + V ␦¯ , j
j
(46)
for k = 1, 2. These formulas define the posterior mean and variance hyperparameters for the respective mean (k = 1) and variance (k = 2) of the farm-level yield distribution. For calculating the farm-level maximum entropy
A Learning Rule for Inferring Local Distributions Over Space and Time
313
distributions, it is useful to characterize the farm-level yield distributions in terms of dimensionless statistics – the coefficients of variation, skewness, and kurtosis. The coefficient of variation is the ratio of the standard deviation to the mean of a random variable. The final farm-level posterior for the coefficient of variation is calculated as: (2) ˆj (47) ␥ˆ j = (1) . ˆj The final farm-level posterior coefficient of skewness is: (3) ˆj
=
(3) 3 (3) ˆ i(j) + ˆ 3j ␦ˆ j ␦ˆ i(j)
(ˆ 2i(j) + ˆ 2j )3/2
,
(48)
while the final farm-level posterior coefficient of kurtosis is: (4)
(4) ˆj
=
(4)
ˆ 4i(j) ˆ i(j) + 6ˆ 2i(j) ˆ 2j + ˆ 4j ␦ˆ j (ˆ 2i(j) + ˆ 2j )2
(49)
Equations (45)–(49) produce the required summary statistics for inferring each local farm-level yield distribution.
4. RESULTS OF APPLYING THE LEARNING RULE Figure 6 illustrates the steps in the learning rule as two parallel sequences of calculations. The learning rule at the county level results in posterior moments (the mean, variance, skewness, and kurtosis) of the yield distribution in each county. Essentially the same sequence of steps results in farm-level posterior moments of the farm-level residual deviation from the county-level yield. The last stage in the learning rule uses Eqs (47)–(49) to combine the posterior moments for the county-level yield distribution with the posterior moments for the farm-level residual distribution to obtain the final updates of the farm-level yield distributions. To demonstrate the effects of the updating procedure, we present two collections of four graphs each. The first collection illustrates estimates of the moments of the farm-level residual distribution at each stage of the learning rule. The second collection compares estimates of the moments of the county-level yield distributions on the horizontal scale to estimates of moments of the farm-level yield distributions on the vertical scale. In the first group of four graphs, the cross-hairs in each case represent the mean of the initial, pooled prior distribution. Each point in the plot represents an estimate of the applicable moment from an individual farm at
314
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
Fig. 6. Flowchart of the Learning Rule Applied to Kansas Winter Wheat Yields.
two different stages in the process. The horizontal position of a point corresponds to the estimate after updating to the county-level; the vertical position of each point represents the estimate after updating to the farm-level. Figure 7 illustrates the results of the learning rule for the farm-level residual mean. The solid horizontal and vertical line segments that intersect in the interior of the graph indicate the position of the pooled mean of –0.82 on the horizontal and vertical scales.16 The horizontal position of each point in the scatter represents the county-level update of the farm-level residual mean; in other words, each vertical cluster of points represents a group of farms from within one particular county. The vertical position of each point represents the farm-level update of the farm-level residual mean. The initial stage of updating from the pooled-level to the county-level results in a range of values on the horizontal scale from slightly below −4 to slightly below 5. There apparently is a significant difference across counties in the deviation of the farm-level mean yield for insured farms and the mean county yield. The second stage of updating is indicated by the vertical spread of points about the diagonal 45◦ line segment. With the exception of one large outlier, the posterior farm-level
A Learning Rule for Inferring Local Distributions Over Space and Time
315
Fig. 7. Posterior Farm-Level Residual Mean.
residual means fall in a range similar to the posterior county-level means. Each vertical cluster of points over the 45◦ line represents the variation in the farm-level posterior parameter estimates for the insured farms in a specific county. Points which lie above the 45◦ line segment have a higher residual mean than average for the county, while points below the 45◦ line have a below-average residual mean for their respective county. Summarizing the results depicted in Fig. 7, the main qualitative features are as follows: (1) For the majority of farms, the posterior county-level residual mean is the most significant determinant of the final farm-level residual mean, as indicated by the fairly tight vertical clusters of points about the diagonal line segment. The indication is that geographical variation across counties is an important explanatory variable for the difference between the farm-level yields of insured farms and the county-level yield. (2) For most farms, the additional variation captured by applying the learning rule a second time to update from county-level to farm-level residuals is relatively small within each county group. But for a handful of farms, the departure of the farm-level yield experience from the county average is significant, producing a small number of relatively large outliers. These outliers illustrate the ability of the learning rule to distinguish atypical cases.
316
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
Fig. 8. Posterior Farm-Level Residual Variance.
Figure 8 shows the results of applying the same updating process to the farm-level residual variance. The parameter measures the residual variance of farm-level yields from the county-level average, and represents that part of the farm-level yield variance which is missing from the county-level yield variance due to averaging. The construction of this graph is similar to Fig. 7. The pooled variance is represented by the horizontal and vertical solid line segments that cross in the interior of the graph. The large value for the pooled variance (approximately 90) indicates that there is a significant variation in farm-level yields beyond the variation in county-level yields. The qualitative properties of Fig. 8 are similar to those of Fig. 7. The majority of the variation in the posterior farm-level variance is accounted for by updating to the county-level mean of the residual yield variance. A relatively small additional amount of variation captured by updating further to the individual farm-level. For a small number of cases, the posterior farm-level variance again differs substantially from the county-level posterior variance. Figures 9 and 10 show the results of applying the learning rule to the farmlevel residual skewness and kurtosis, respectively. The qualitative features of these graphs are similar to those for the mean and variance, and hence essentially the same descriptive comments about the updating process and its implications apply. It is interesting to note that the average farm-level skewness is below –0.2, while the average farm-level kurtosis is 3.5. Given that these averages represent consistent estimates of the pooled mean residual skewness and kurtosis, and that they were
A Learning Rule for Inferring Local Distributions Over Space and Time
Fig. 9. Posterior Farm-Level Residual Skewness.
Fig. 10. Posterior Farm-Level Residual Kurtosis.
317
318
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
computed over 20,720 farms, a significant departure from the normal distribution is suggested.17 Figures 11–14 are scatter diagrams of the final posterior estimates of the county yield moments on the horizontal scale and the corresponding final posterior estimates of the farm-level moments on the vertical scale. The diagonal line segment in each graph is a 45◦ line; points on the line have identical county-level and farm-level moments. The county-level moments are the result of applying the learning rule to the residuals of the county-level yield regressions, without taking into consideration the individual farm-level data. The farm-level moments utilize the formulas for combining the county-level moment estimates with the farm-level residual moments to obtain the final posterior estimates of the four moments of each farm-level yield distribution. Figure 11 compares the updated county-level yield means to the final updates of the farm-level yield means. Although both sets of yield means have similar ranges, the farm-level mean yields are on average less than the county-level mean yields. This implies that insured farms tend to have yields that are less than the countylevel average yields, and supports the argument for adverse selection and/or moral hazard in the federal crop insurance program. Figure 12 compares the posterior county-level variances to the final posterior estimates of the farm-level variances. These farm-level variances exhibit a striking
Fig. 11. Final Posterior Farm-Level Mean Yield.
A Learning Rule for Inferring Local Distributions Over Space and Time
Fig. 12. Final Posterior Farm-Level Yield Variance.
Fig. 13. Final Posterior Farm-Level Yield Skewness.
319
320
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
Fig. 14. Final Posterior Farm-Level Yield Kurtosis.
departure from the county-level yield variances, demonstrating a relatively large contribution of farm-level variability that can be masked by aggregating yields into county-level averages. The county-level variances fall within a narrow range, suggesting only small differences in yield variability across counties. In contrast, the farm-level variances exhibit a wide range of variation, reflecting the differences in the farm-level yield variance even after shrinking to the county-level estimates with our learning rule. This supports arguments that there are substantial differences in yield risk across farms and that farm-level yield distributions are considerably more uncertain than county-level distributions. Figure 13 compares the posterior estimates of county-level skewness to the final estimates of the farm-level skewness. The county-level skewness is on average below and has a narrower range than the farm-level skewness. This again reflects the greater variation in the individual farm-level yields. It also can be partially attributed to the fact that the sum of two independent random variables with negatively skewed distributions tends to be less skewed. Similarly, Fig. 14 compares the posterior estimates of the county-level kurtosis to the final posterior estimates of the farm-level kurtosis. In the great majority of cases, the county-level kurtosis falls in the range 3.0–3.2, while the farm-level kurtosis falls in the range 3.0–3.5. In a few cases, however, the farm-level kurtosis exceeds 3.5, while in some others it is considerably less than 3.0.
A Learning Rule for Inferring Local Distributions Over Space and Time
321
Overall, there are appreciable differences across farms in the four central moments we are using to characterize individual yield distributions. In addition, for most farms the hypothesis that crop yields follow a normal distribution is not supported by these data. In light of these two observations, we calculate individual farm-level yield distributions using the information theoretic method of maximum entropy. The resulting family of estimates nests the normal distribution as a special case, while allowing for departures from normality such as those that appear to characterize these crop yield data. The information theoretic approach to this problem is the topic of the next section.
5. INFERRING THE LOCAL CROP YIELD DISTRIBUTIONS We seek to estimate or infer the probability density function (pdf) for a data generating process (dgp). In addition, we want to be able to use new information as it becomes available to learn about the pdf and to update and revise our inferences. Ex ante we are essentially ignorant of the nature of the distribution generating the data. In particular, we do not have well-formed prior beliefs about the functional form or values of the parameters of the pdf. However, we do possess a set of sample statistics obtained from a sample of observations drawn from the distribution with an unknown pdf, f(x). For a large class of dgp’s, the sample moments are good estimators of their population counterparts, and in many respects they are best. Suppose, therefore, that we calculate or are presented with K sample moments from a sample of data, n
mi =
1 i x j , i = 1, . . . , K. n
(50)
j=1
The principle of maximum entropy (MAXENT) takes these sample moments to be a set of sufficient statistics for the corresponding population moments, and chooses a continuous pdf on the support X ⊂ R that maximizes the entropy of the density function, E ≡ − f(x) ln{f(x)} dx, (51) X
subject to a set of constraints. These constraints are defined by: (a) non-negativity of the pdf; (b) the total probability must equal one; and (c) the moments of the resulting distribution match those of the sample:
322
(a) (b) (c)
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
f (x) ≥ 0 ∀ x ∈ X; f(x) dx = 1; and X i x X f(x) dx = m i ∀ i = 1, . . . , K.
The entropy criterion developed by Shannon (1948) is formally equivalent to minimizing the information that we extract from the sample data, as represented by the K sample moments. That is, by definition, we agree that the sum of information and entropy is zero, similar to the physical laws of conservation of mass and energy and the dissipation of organized energy into disorganized energy, or entropy. Entropy satisfies several reasonable axioms, carefully detailed by Shannon, that uniquely define the criterion up to an unknown constant of integration. Perhaps the most important property of entropy, and in particular, the Kullback-Leibler crossentropy criterion, is that it is the foundation of optimal information processing as described and elaborated by Zellner (2002). This means, among other things, that the amount of information that is generated as the output of the information processing system is precisely equal to the amount of information used as inputs. In addition, the output at each stage is invariant to the order of the inputs used up to that stage. In other words, if we start with a subset of the original sample, and sequentially update the pdf by adding additional parts of the data, the result at each stage is the same regardless of the order in which the observations used up to that stage have been added, so long as the total amount of data (information) used remains the same. Finally, a remarkable property of the information processing methodology described by Zellner (2002) is that Bayes’ Theorem is a corollary to the information processing rule. That is, Bayes’s rule is a necessary condition for processing information efficiently. A simple, heuristic interpretation of a MAXENT density function is that it minimizes the average logarithmic height of the density function, subject to the constraints that the moments of the sample are equal to the post-data population moments of the pdf. This implies that the density function is as flat as possible, in this sense, subject to the information that is contained in non-negativity, integrability, and moment conditions introduced from the data. Another useful interpretation of entropy is that is minimizes the Kullback-Leibler cross-entropy pseudo-distance between the posterior pdf and the uniform distribution. For a proper prior pdf, say f 0 (x), the Kullback-Leibler cross-entropy criterion is: f(x) C ≡ f(x) ln dx (52) f 0 (x) X When f 0 (x) is the uniform pdf on X, the problem of maximizing (52) subject to the constraints (a)–(c) is equivalent to a continuous time dynamic optimization
A Learning Rule for Inferring Local Distributions Over Space and Time
323
problem with no differential constraints and K + 1 integral, or isoperimetric, constraints. This type of problem is well understood in the theory of optimal control. The unique solution can be obtained using the Lagrangean, K i m i − x i f(x) dx L = − f(x)ln{f(x)} dx + 0 1 − f(x) dx + X
X
=−
X
f(x) ln{f(x)} + 0 +
K
X
i=1
i x
i
i=1
dx + 0 +
K i m i ,
(53)
i=1
and maximizing L pointwise with respect to the choice of f(x) ∀ x ∈ X. By strict concavity of the integrand in f, the first-order necessary and sufficient conditions for the unique optimal solution are:
K i − 1 + ln[f(x)] + 0 + i x = 0 ∀ x ∈ X, (54) i=1
together with the constraints (a)–(c). Solving for f (·) gives:
K i f(x) = exp − 1 + 0 + i x ∀x ∈ X.
(55)
i=1
If we then integrate (55) over x and substitute this into the probability constraint (b), we can solve for e −(1+0 ) in the form: e −(1+0 ) =
1 K i dx exp − x i i=1 X
(56)
which acts as a normalizing factor, and the pdf has the exponential polynomial form: i x exp − K i i=1 ∀ x ∈ X. (57) f(x) = K i dy exp − y i=1 i X
The number of terms in the polynomial is precisely equal to the number of moments that we estimate from the original data set. The Lagrange multipliers, i , i = 1, . . . , K, are obtained as the solution to the system of K moment conditions, i K i dx x exp − x i i=1 X = m i , i = 1, . . . , K. (58) x i f(x) dx = K i dx X exp − x i i=1 X
324
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
The system of equations in (58) is highly nonlinear and there does not exist any closed form solution for the Lagrange multipliers. Ormoneit and White (1999) developed an efficient algorithm to compute the Lagrange multipliers and implemented and tabulated an exhaustive set of values for K = 4, for a centered and standardized random variable with zero mean and unit variance on the real line, using standardized central third and fourth moments, µ3 = (m 3 − 3m 2 m 1 + 2m 31 )/(m 2 − m 21 )3/2 and µ4 = (m 4 − 4m 3 m 1 + 6m 2 m 21 − 3m 41 )/(m 2 − m 21 )2 . Wu (2003) refined and improved the numerical methods for estimating MAXENT densities with any number of moment conditions for both compact and open supports, including the entire real line. Given the final posterior estimates of the farm-level moments, we calculate their dimensionless counterparts (coefficients of variation, skewness, and kurtosis) for each farm. We then take these dimensionless shape parameters as the information set and apply the principle of maximum entropy to construct individualized farmlevel crop yield distributions. As is explained below, the dimensionless coefficients represent a set of sufficient statistics for the crop insurance premium when it is expressed as a percentage of the expected yield. Because the final estimates of the coefficients of skewness and kurtosis both lie in narrow ranges, while the final estimates of the coefficient of variation lie in a much wider range, we consider three cases: (1) the farm with the minimum coefficient of variation; (2) the farm with the median coefficient of variation; and (3) the farm with the maximum coefficient of variation. In each case, we use the corresponding final estimates of the skewness and kurtosis for the chosen farm. The maximum entropy density parameters in each are computed by identifying the exponential quartic density that minimizes the dual objective function, given the four moments as the information set for each unknown crop yield distribution.18 We next compute the farm-level crop insurance premiums for each of the three representative cases using the maximum entropy density. For illustration, the premiums are computed as a percentage of the coverage level, c, at 65, 75, and 85% of the expected farm-level yield. Let X denote the random variable for yield. Then an actuarially fair insurance premium may be computed using the formula: c E c−X|X < c Pr X < c cPr {X < c} − 0 xf (x) dx P= × 100% = c c c c xf (x) dx ×100% = f (x) dx − 0 × 100%. (59) c 0
A Learning Rule for Inferring Local Distributions Over Space and Time
325
Table 1. Information Theoretic MPCI Premiums for Kansas Winter Wheat. Coverage Level (%) 65 75 85
Minimum CV (%)
Median CV (%)
0.8 1.8 3.6
1.7 3.1 5.2
Maximum CV (%) 2.2 3.7 5.9
Equivalently, changing variables to rescale X to a unit mean, Y = X/ where = E(X), we obtain:19 c c 0 yg (y) dy × 100%. (60) P= g (y) dy − c 0 The fair premium can be computed from this rescaled density. This implies that the coefficients of variation, skewness, and kurtosis constitute a set of complete sufficient statistics for fair bet premiums for the exponential quartic class of distributions. Table 1 displays the results of these premium calculations. As expected, the fair bet premiums increase monotonically with the coverage level and the coefficient of variation. Figure 15 illustrates the advantage of inferring farm-level crop yield distributions with an active learning rule to set MPCI insurance rates. The figure displays three kernel density estimates for three premium distributions for all wheat in Morton County, Kansas: (1) using information theoretic yield distributions; (2) using reported APH mean yields; and (3) using trend-adjusted APH mean yields. The actuarially fair premiums are calculated by applying quadrature to each farm-level maximum entropy distribution, treating these distributions for the farm-level yields as the true dgp. The distribution of the unadjusted APH insurance premium is estimated by first computing an estimate of the average loss cost ratio in the county. For each farm in the Morton County sample, a loss ratio equal to the expected loss as a percent of the APH average yield is computed by: E c y¯ i − X|X < c y¯ i Pr {X < c y¯ i } L(¯yi ) = ; (61) y¯ i that is, by computing the expected premium as a percentage of mean yield, based on the ten-year APH average yield, y¯ i . These farm-level expected loss ratios are averaged to obtain a proxy for the average LCR. This average LCR is then
326
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
Fig. 15. Kernel Density Estimates of MPCI Premiums for Winter Wheat in Morton County, Kansas.
multiplied by the farm-level APH average yields to calculate the distribution of APH-based premiums. A similar procedure is used to compute adjusted APH-based premiums, except that the APH yields are adjusted for trend growth. Because the APH yields are computed in the current FCIC ratemaking process without a trend adjustment, a ten-year average of APH yields actually estimates the expected yield net of 4.5 years of trend growth. To see this, note that: f
f
(1)
Y j,t−k = ␣i(j) + i(j) (t − k) + ␦i(j),j + i(j),t−k + j,t−k
(62)
f
where i(j),t−k is the county-level yield shock, j,t−k is the local farm-level yield f
shock, and E(i(j),t−k ) = E(j,t−k ) = 0. The APH mean is given by: f Y¯ j =
9
f
Y j,t−k ,
(63)
k=0
which has expected value: f
(1)
f
E(Y¯ j ) = ␣i(j) + i(j) (t − 4.5) + ␦i(j),j = E(Y j,t ) − 4.5i(j),
(64)
A Learning Rule for Inferring Local Distributions Over Space and Time
327
In light of this downward bias, an adjustment for 4.5 years of trend growth is added to the APH average yield for each farm before computing premiums in the adjusted case. The nonparametric kernel density estimates in Fig. 15 illustrate the differences between these three approaches to calculating insurance premiums. The variation in the premiums based on the learning rule and information theory reflects differences in the posterior farm-level moments. The range in the premiums with this method is relatively narrow, from about 3.25% up to 4.2% of the expected yield. In contrast, the two APH premium distributions have a considerably larger range. The unadjusted APH kernel estimate also has a downward bias due to the failure to reflect the growth trend in crop yields when calculating insurance premiums. We can measure the differences between the adjusted APH and information theoretic premium calculations cases with a decomposition of the mean square error that treats the maximum entropy premium as the true premium and the APH premium as an estimator. Let P represent the actual premium and Pˆ the adjusted APH premium. The standard mean square error decomposition is: ˆ = E(Pˆ − P)2 = var(P) ˆ + E(P) ˆ −P 2. MSE(P)
(65)
Calculating the right-hand-side terms for the adjusted APH premiums in Morton County produces a variance of 1.14 and bias of 0.17 percentage points. Though it is impossible to know the true extent that these results represent actual experience, two conclusions are suggested. First, the variance of APH premiums is large compared to the magnitude of the premiums. Because yields exhibit a large degree of variation from one year to the next, a large share of the variance in APH premiums is purely due to sampling variation, rather than variations in yield risk. Second, the current approach using the unadjusted 10-year average APH yield results in a significant negative bias in premiums relative to the actuarially fair values. Finally, the bias in APH premiums can be reduced by adjusting the APH mean for trend, though this will not mitigate the high variance in the 10-year APH yields and the associated MPCI premium rates. We therefore conclude that the current practice of computing premiums based on multiplying pooled average loss cost ratios by 10-year APH average yields is subject to bias and high year-to-year variance. Both are likely to contribute to the historically low participation rates in the MPCI crop insurance program and the mounting political pressure for greater premium subsidies to induce greater farmer participation. The incentive bias from pooling farmers with heterogeneous risk profiles into a single risk class results in a transfer payment through insurance premiums from low-risk farmers to high-risk farmers within each pool. Low-risk farms thus optimally forego participation, unless they are enticed to participate
328
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
with premium subsidies. The volatility in premiums is due to the fact that the APH premium is effectively proportional to the 10-year average APH yield, and the high variance of yields from one year to the next translates into a high variance in this 10-year average. The result is a distribution of insurance premiums that also have a large variance over time. If farmers are risk averse, highly variable premiums create another disincentive for participation. The method developed in this chapter is a step towards developing crop insurance premiums that more accurately reflect individual farm-level risks and that are more stable over time than those under current approach used by the FCIC. Again, if farmers are risk averse, then actuarially fair insurance rates that more accurately reflect individual farmer’s production risk and are intertemporally stable can result in less political pressure for premium subsidies to induce comparable participation rates. The active learning rule we have developed has been found to substantially reduce both the bias and uncertainty surrounding MPCI insurance premiums. The methodology is compatible with Bayes’ theorem and allows additional information to be incorporated as it becomes available by following the same steps at each stage of the learning process. The procedure we have developed can be applied to any data set and decision problem characterized by general and unknown spatial and temporal statistical dependence.
NOTES 1. In 1991, one-fourth of U.S. cropland (82.4 million acres) was insured, with an effective average premium subsidy rate of more than 25%. In the 2003 crop year, two-thirds of U.S. cropland (217.4 million acres) was insured with an effective average premium subsidy rate of 60%. Total liabilities in 2003 were a record $40.6 billion, and indemnity payments were $3.2 billion, more than triple the 1991 amount. 2. By actuarially fair, we mean P = E[I], where P is the premium paid by an individual, and I is a stochastic indemnity that depends on a producer’s yield by a formula stated in the insurance contract. A producer’s expected gain from purchasing the insurance is zero if the premium is actuarially fair. 3. The loss cost ratio is defined as the total of indemnities paid divided by the total of liabilities for a given pool of insured producers. The liability for an individual producer is the indemnity payment that would be made in the event of a total loss, taking into account all of the factors that determine the coverage level. 4. The coefficient of variation (cv) of farm-level yields for 20,720 Kansas winter wheat farms participating in the MPCI program over the period 1991–2000 has a range of 20 to 80%, and an average cv of 39.5%. 5. A number of authors, including, Zellner have shown that the entropy maximum solution takes the form of an exponential polynomial, f(x) = exp − ni=0 i x i . 6. The structure of the model was largely shaped by data availability. Given that only the moments of individual farm yields were available, our strategy is to first obtain the best
A Learning Rule for Inferring Local Distributions Over Space and Time
329
possible producer-level moment estimates, then use these moments to estimate the local distributions. 7. That is, E(it ) = 0 and Var(it ) = 1. 8. We apply moment somewhat loosely here. The noncentral moments are the sobriquet f defined by E (Y jt )k , k = 1, 2, 3, 4. The characterization of a distribution by its noncentral moments or by its mean, variance, skewness, and kurtosis is equivalent, due to the existence of a bijective map between the two sets of moments. 9. Higher-order polynomial trends were also considered, but the improvement in fit was negligible, as measured by the Bayes Information Criterion (BIC). On the grounds of parsimony, the linear trend model was therefore selected here. 10. A positive coefficient on the second-order term in (15) would imply a correlation function that explodes as the distance between counties increases without bound. As is seen below, both estimated coefficients in the correlation function are negative and significant, so that the correlation between counties decays monotonically with distance and the spatial variance-covariance process is globally positive definite. 11. We iterate over the nonlinear least squares estimates for the restricted covariance matrix. Although the regressors are the same in all equations, the restricted correlation matrix is an exponential function of a two parameter quadratic in the distance between counties, so that generalized least squares (GLS) is more asymptotically efficient than ordinary least squares. However, because there are more equations (M = 105) than timeseries observations (T = 54), the unrestricted covariance matrix is singular and unrestricted GLS is numerically infeasible. 12. The homogeneous Kansas geography may account for a significant part of the apparent high level of spatial correlation. A state with a less homogeneous geography may require adding explanatory variables to the correlation function to capture the effect of the geographic differences. It also may be worth considering the effect of the direction of weather patterns, the spatial variation in wheat farming intensity, and the size distribution of counties on the yield correlations across space in other data sets. However, the simple and parsimonious model used in this study appears to be quite adequate for our main purposes. 13. The normal prior can be justified by the very large samples for the prior estimates (20,720 observations) and an appeal to the asymptotic distribution of moment estimators. The updating procedure treats the pooled variance estimate for a given parameter as if it is the known population variance. A more complete treatment would treat both the mean and variance as unknowns and subjectively random. The procedure employed here is adequate for two reasons. First, we are primarily interested in inferences about the mean in each case. Second, the large sample sizes for the pooled estimates produces precise estimates of the variances of the pooled means. 14. Assume that V i > 0, i = 1, 2, 3, and without loss of generality, that V 1 ≥ V 2 . If V 3 = 1/((1/V 1 ) + (1/V 2 )), then since V 1 > 0, it follows that V 3 < 1/(0 + (1/V 2 )) = V 2 = min{V 1 , V 2 }. 15. Each county-level yield regression includes an intercept, so that the sample average of the ˆ it vanishes. 16. The farm-level residual mean indicates the extent to which the average farm-level yield for an insured farm falls below the average county-level yield; the negative value for the pooled mean indicates the degree to which the average yield for insured farms falls short of the overall average yield, a possible indication of adverse selection and/or moral hazard in the MPCI crop insurance program.
330
STEPHEN M. STOHS AND JEFFREY T. LAFRANCE
17. Recall that the normal distribution has skewness equal to zero and kurtosis equal to three. 18. For arbitrary values for the first four sample moments as the information set, a solution to the maximum entropy problem may not exist. Whenever a solution does, it is unique, and minimizing the dual objective function identifies this unique exponential quartic solution. One characterizing property of this distribution is that its first four population moments match those of the moments taken as the information set. 19. It is easy to see that ∂x/∂y = , and y = c when x = c. Making the appropriate substitutions leads to the premium formula in terms of the density of Y, g(y) = f(y).
REFERENCES Davidson, R., & MacKinnon, J. G. (1993). Estimation and inference in econometrics. Oxford: Oxford University Press. Feller, W. (1971). An introduction to probability theory and its applications (Vol. 2). New York: Wiley and Sons. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis (Vol. 2). Boca Raton, FL: Chapman and Hall/CRC Press. Greene, W. H. (2003). Econometric analysis (Vol. 5). Upper Saddle River, NJ: Prentice Hall. Harwood, J., Heifner, R., Coble, K. H., Perry, J., & Somwaru, A. (1999, March). Managing risk in farming: Concepts, research and analysis. Technical Report, Economic Research Service, U.S. Department of Agriculture. Jaynes, E. T. (1982). On the rationale of maximum-entropy methods. Proceedings of the IEEE, 70(9), 939–952. Josephson, G. R., Lord, R. B., & Mitchell, C. W. (2000, August). Actuarial documentation of multiple peril crop insurance ratemaking procedures. Technical Report, Risk Management Agency. Judge, G. G., & Mittlehammer, R. C. (2004). A semi-parametric basis for combining estimation problems under quadratic loss. Journal of the American Statistical Association, 99, 479–487. Just, R. E., Calvin, L., & Quiggin, J. (1999). Adverse selection in crop insurance: Actuarial and asymmetric information incentives. American Journal of Agricultural Economics, 81(4), 834– 849. Just, R. E., & Weninger, Q. (1999). Are crop yields normally distributed? American Journal of Agricultural Economics, 81(2). Knight, T. O., & Coble, K. H. (1997). Survey of U.S. multiple peril crop insurance literature since 1980. Review of Agricultural Economics, 19(1), 128–156. Ormoneit, D., & White, H. (1999). An efficient algorithm to compute maximum entropy densities. Econometric Reviews, 18(2), 127–140. Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423. Skees, J. R., & Reed, M. R. (1986). Rate making for farm-level crop insurance: Implications for adverse selection. American Journal of Agricultural Economics, 68, 653–659. Wu, X. (2003). Calculation of maximum entropy densities with application to income distribution. Journal of Econometrics, 115, 347–354. Zellner, A. (1962). An efficient method of estimating seemingly unrelated regressions, and tests for aggregation bias. Journal of the American Statistical Association, 57, 500–509.
A Learning Rule for Inferring Local Distributions Over Space and Time
331
Zellner, A. (2002). Information processing and Bayesian analysis. Journal of Econometrics, 107, 41– 50. Zellner, A. (2003). Some historical aspects of Bayesian information processing. Invited paper presented at the American Statistical Association Meeting, San Francisco CA, August. Zellner, A., & Highfield, R. A. (1988). Calculation of maximum entropy distributions and approximation of marginal posterior distributions. Journal of Econometrics, 37, 195–209.