326 21 155KB
English Pages 14
Biostatistics (2002), 3, 3, pp. 347–360 Printed in Great Britain
A Monte Carlo EM algorithm for generalized linear mixed models with flexible random effects distribution JUNLIANG CHEN, DAOWEN ZHANG∗ , MARIE DAVIDIAN Department of Statistics, Box 8203, North Carolina State University, Raleigh, NC 27695-8203, USA [email protected] S UMMARY A popular way to represent clustered binary, count, or other data is via the generalized linear mixed model framework, which accommodates correlation through incorporation of random effects. A standard assumption is that the random effects follow a parametric family such as the normal distribution; however, this may be unrealistic or too restrictive to represent the data. We relax this assumption and require only that the distribution of random effects belong to a class of ‘smooth’ densities and approximate the density by the seminonparametric (SNP) approach of Gallant and Nychka (1987). This representation allows the density to be skewed, multi-modal, fat- or thin-tailed relative to the normal and includes the normal as a special case. Because an efficient algorithm to sample from an SNP density is available, we propose a Monte Carlo EM algorithm using a rejection sampling scheme to estimate the fixed parameters of the linear predictor, variance components and the SNP density. The approach is illustrated by application to a data set and via simulation. Keywords: Correlated data; Rejection sampling; Seminonparametric density; Semiparametric mixed model.
1. I NTRODUCTION Generalized linear mixed models (GLMMs) (Schall, 1991; Zeger and Karim, 1991; Breslow and Clayton, 1993) are a popular way to model clustered (e.g. longitudinal) data arising in clinical trials and epidemiological studies of cancer and other diseases. In these models, correlation is taken into account through incorporation of random effects that are routinely assumed to be normally distributed. Calculation of the marginal likelihood requires integration over the distribution of the random effects. However, unlike that for linear mixed models (see, for example, Laird and Ware, 1982), maximum likelihood inference for GLMMs is complicated by the fact that the random effects enter the model nonlinearly. This feature makes the required integration intractable in general, even if the random effects are normally distributed. Consequently, much work has focused on approximate techniques that seek to avoid the integration. Unfortunately, approaches based on Laplace and other approximations (see, for example, Schall, 1991; Breslow and Clayton, 1993) may yield biased estimates of fixed model parameters (Breslow and Lin, 1995; Lin and Breslow, 1996), particularly in the case of binary response. This difficulty has motivated a number of alternative approaches (see, for example, Lee and Nelder, 1996; Jiang, 1999). There have also been attempts to carry out the necessary integrations via fully Bayesian analyses using Markov chain ∗ To whom correspondence should be addressed
c Oxford University Press (2002)
348
J. C HEN ET AL.
Monte Carlo techniques (Zeger and Karim, 1991) or using Monte Carlo EM (MCEM) algorithms to implement ‘exact’ likelihood analysis (McCulloch, 1997; Booth and Hobert, 1999). In the latter case, an approach proposed by Booth and Hobert (1999) is based on using acceptance–rejection methods to generate samples of independent draws from the appropriate conditional distribution. This allows one to assess the Monte Carlo error at each iteration of the algorithm, suggesting a rule for ‘automatically’ increasing the Monte Carlo sample size, making this method attractive as a general strategy for likelihoodbased analysis. An assumption underlying many of these approaches is that the random effects have distribution belonging to some parametric family, almost always the normal. However, this assumption may be unrealistic, raising concern over the validity of inferences both on fixed and random effects if it is violated. Moreover, allowing the random effects distribution to have more complex features than the symmetric, unimodal normal density may provide insight into underlying heterogeneity and even suggest failure to include important covariates in the model. Accordingly, considerable interest has focused on approaches that allow the parametric assumption to be relaxed in mixed models (Davidian and Gallant, 1993; Magder and Zeger, 1996; Verbeke and Lesaffre, 1996; Kleinman and Ibrahim, 1998; Aitken, 1999; Jiang, 1999; Tao et al., 1999; Zhang and Davidian, 2001). Many of these approaches assume only that the random effects distribution has a ‘smooth’ density and represent the density in different ways. In the particular case of the linear mixed model for clustered data, Zhang and Davidian (2001) propose use of the ‘seminonparametric’ (SNP) representation of densities in a ‘smooth’ class given by Gallant and Nychka (1987), which allows the marginal likelihood to be expressed in a closed form, facilitating straightforward implementation. They show that this approach yields reliable performance in capturing features of the true random effects distribution and the potential for substantial gains in efficiency over normal-based methods. In this paper, we apply this technique to the more general class of clustered-data GLMMs. Although it is no longer possible to write the likelihood in a closed form, the fact that an efficient algorithm is available for sampling from an SNP density makes use of the MCEM algorithm of Booth and Hobert (1999) attractive. This allows fitting in the case of a ‘smooth’ but unspecified random effects density to be carried out with not much greater computational burden than when a parametric family is specified. In Section 2, we introduce the semiparametric GLMM in which the random effects are assumed only to have a ‘smooth’ density, and we present the form of the SNP representation. Section 3 describes the proposed MCEM algorithm and the strategy to choose the degree of flexibility required for the density. We illustrate the methods by application to a data set in Section 4, and present simulations demonstrating performance of the approach in Section 5. Throughout, we use the terms ‘cluster’ and ‘individual’ interchangeably. 2. M ODEL SPECIFICATION 2.1
Semiparametric generalized linear mixed model
Let yi j denote the jth response for the ith individual, i = 1, . . . , m and j = 1, . . . , n i . For each i, conditional on random effects bi (q × 1), the yi j , j = 1, . . . , n i are assumed to be independent and follow a generalized linear model (McCullagh and Nelder, 1989) with density yi j θi j − d(θi j ) f (yi j |bi ; β, φ) = exp + c(yi j , φ) (1) a(φ) where µibj = E(yi j |bi ) = d ′ (θi j ); φ is a dispersion parameter whose value may be known; c(· ; ·) is known function; and, for link function g(·), the linear predictor ηi j = xiTj β + siTj bi = g(µibj ) depends on
Generalized linear mixed models with flexible random effects
349
fixed effects β ( p × 1), the random effects bi , and known vectors of covariates xi j and si j for the fixed and random effects, respectively. We assume the random effects are mutually independent across i and write bi = R Z i + γ ,
(2)
where γ is (q × 1), R is a (q × q) upper triangular matrix, and Z i is a random vector, which will prove convenient shortly. Rather than make the usual assumption that Z i is standard multivariate normal, so that bi ∼ Nq (γ , R R T ), where Nq (·, ·) represents the q-variate normal distribution, we instead assume only that the Z i have density in a ‘smooth’ class of densities described mathematically by Gallant and Nychka (1987). This class includes densities that may be skewed, multi-modal, or thin- or fat-tailed relative to the normal and in fact includes the normal; however, it excludes densities exhibiting unusual behavior, such as oscillations, or jumps that would likely be implausible representations of inter-individual heterogeneity. Thus, (1) and (2) together describe a semiparametric GLMM in the sense that the random effects distribution is not fully specified. Note that (2) does not restrict the random effects to have mean zero; thus, for identifiability purposes, xi j should not contain si j in (1), and E(bi ) = R E(Z i ) + γ will be the fixed effects corresponding to si j , usually including the ‘intercept’ term. This does not cause any difficulty, as we demonstrate subsequently. 2.2
SNP density representation
The term ‘seminonparametric’ or SNP was used by Gallant and Nychka (1987) to suggest an approach lying halfway between fully parametric and completely nonparametric specifications. Gallant and Nychka (1987) showed that densities satisfying certain smoothness restrictions, including differentiability conditions, could be represented by an infinite Hermite series expansion plus a lower bound on tail behavior. They suggested that a truncated version of the expansion could be used as a general-purpose approximation to a smooth density. Following Davidian and Gallant (1993) and Zhang and Davidian (2001), we propose to represent the density of bi in the semiparametric GLMM by the truncated expansion, which we now describe. Suppose Z is a q-variate random vector with density proportional to a truncated Hermite expansion so that Z has density 2 K α 2 aα z ϕq (z) h K (z) ∝ PK (z)ϕq (z) = |α|=0
for some fixed value of K , where ϕq (·) is the density function of the standard q-variate normal distribution Nq (0, Iq ). Here, PK (z) is a multivariate polynomial of order K , written above T using α = elements), |α| = q (α1α,kα2 , . . . , αq ) (vector with nonnegative integer q multi-index T (q = 2) and K = 2, α z . Thus, for example, with z = (z , z ) α , and z = 1 2 k=1 k k=1 k PK (z) = a00 + a10 z 1 + a01 z 2 + a20 z 12 + a11 z 1 z 2 + a02 z 22 . The proportionality constant is given by 1/ PK2 (s)ϕq (s) ds, so that h K (z) integrates to 1. The coefficients in PK (z) may only be determined to within a scalar multiple, so to achieve a unique representation, it is standard to set the leading term in the polynomial equal to 1; for example, a00 = 1 in the example (Davidian and Giltinan, 1995, Section 7.2). The resulting representation, a normal density modified by a squared polynomial, is capable of approximating a wide range of behavior for K as small as 1 or 2. We discuss choice of K in Section 3.4. We have found that, when using the SNP representation as given above for random effects, numerical instability in estimating the polynomial coefficients may result. To demonstrate, for K = 2 and q = 1, the density is h 2 (z) = (1 + a1 z + a2 z 2 )2 ϕ1 (z)/C(a), where C(a) = E(1 + a1 U + a2 U 2 )2 , U ∼ N1 (0, 1), so that C(a) = 1 + a12 + 3a22 + 2a2 . If max(a1 , a2 ) is sufficiently large with a2 a1 , then 1/a2 and 1/a22
350
J. C HEN ET AL.
may be negligible and h 2 (z) =
{(a1 /a2 )z + z 2 }ϕ1 (z) {1/a2 + (a1 /a2 )z + z 2 }2 ϕ1 (z) ≈ , (a1 /a2 )2 + 3 1/a22 + (a1 /a2 )2 + 3 + 2/a2
leading to practical identifiability problems. In order to circumvent this difficulty, we use a reparametrization by taking the proportionality constant equal to 1, which is an alternative way to provide a unique representation and which is equivalent to requiring E{PK2 (U )} = 1 for U ∼ Nq (0, Iq ). It is straightforward to calculate this expectation by writing the polynomial in an alternative form, which we now demonstrate in the case q = 2, so that U = (U1 , U2 )T . For given q and K , let J be the number of distinct coefficients in the polynomial PK (u); when q = 2, J = (K + 1)(K + 2)/2. Let p be the (J × 1) vector of polynomial coefficients; for example, when q = 2 and K = 2, p = (a00 , a10 , a01 , a20 , a11 , a02 )T . Then the polynomial can be written as PK (u) =
0k+ℓK
akℓ u k1 u ℓ2 =
J
j
j
p j u 11 u 22 ,
j=1
where p j is the jth element of p, and j1 and j2 are the subscripts of p j in the original representation. For example, with q = 2 and K = 2 and p as given above, p2 = a10 and j1 = 1, j2 = 0. Now let V be the j j (J × 1) random vector whose jth element is U1 1 U2 2 . Then PK (U ) = p T V , so that E{PK2 (U )} = p T Ap, where A = E(V V T ). Note that the elements of V are just products of powers of independent standard normal random variables, so that the expectation is straightforward to compute, and A is a positive definite matrix. Thus, we may write A = B T B for some B, and, with c = Bp, we require E{PK2 (U )} = c T c = 1, where c = (c1 , . . . , c J )T . This leads to a reparametrization using c1 = cos(ψ1 ), c2 = sin(ψ1 ) cos(ψ2 ), . . . , c J −1 = sin(ψ1 ) sin(ψ2 ) . . . cos(ψ J −1 ) c J = sin(ψ1 ) sin(ψ2 ) · · · sin(ψ J −1 ), for ψ j ∈ (−π/2, π/2] for j = 1, . . . , J − 1. This representation may be shown to hold for general q. The relationship between the original and new parametrization may be appreciated by an example. With q = 1 and K = 2, so PK (u) = a0 + a1 u + a2 u 2 , p = (a0 , a1 , a2 )T , 1 0 1 1 0 1 A = 0 1 0 , B = 0 1 √0 , 1 0 3 2 0 0 √ from which it is straightforward to deduce that a0 + a2 = cos(ψ1 ), a1 = sin(ψ1 ) cos(ψ2 ), and 2a2 = sin(ψ1 ) sin(ψ2 ). We thus propose to represent the density of Z i in (2) as h K (z; ψ) = PK2 (z; ψ)ϕq (z)
(3)
where the notation emphasizes that the polynomial is parametrized in terms of ψ = (ψ1 , . . . , ψ J −1 )T , satisfying the constraint E{PK2 (U )} = 1 for U ∼ Nq (0, Iq ). It follows from (2) that the density of bi is represented as f K (b; δ) = PK2 (z; ψ)n q (b; γ , ),
(4)
where z = R −1 (b − γ ), = R R T , n q (b; γ , ) is q-variate normal density with mean γ and covariance matrix , and δ is made up of all elements of ψ, γ , and R. Note that the random effects density is thus parametrized in terms of a finite set of parameters δ, where the dimension of δ (through ψ) depends on K . When K = 0, (3) reduces to a standard q-variate normal density and hence (4) is Nq (γ , R R T ), so that the usual normal specification for GLMMs is a special case.
Generalized linear mixed models with flexible random effects
351
3. M ONTE C ARLO EM ALGORITHM 3.1 Likelihood Under the model given in (1), (2) and (4), we may write the marginal likelihood for ζ = (β T , φ, δ T )T for fixed K as L(ζ |y) = f (y|b; β, φ) f (b; δ) db, (5)
m i m where f (y|b; β, φ) = i=1 f (yi |bi ; β, φ), f (yi |bi ; β, φ) = nj=1 f (yi j |bi ; β, φ); f (b; δ) = i=1 T )T , y = (y T , . . . , y T )T , and y = (y , . . . , y )T , i = 1, . . . , m. f K (bi ; δ); and b = (b1T , . . . , bm i in i i1 m 1 Because (5) cannot be evaluated analytically in general, we propose to maximize (5) in ζ via the MCEM algorithm of Booth and Hobert (1999), incorporating the SNP representation for the random effects density. We treat K as a ‘tuning parameter’ controlling the degree of flexibility for the random effects, using the algorithm to fit the model for several choices of K , for example K = 0 (normal random effects), 1, and 2, choosing among them as described in Section 3.4. The EM algorithm applied to random effects models treats b as missing data, y as the observed data, and (y, b) as the ‘complete’ data (see, for example Searle et al., 1992, Chapter 8). Under this perspective, the complete data have density function f (y, b; ζ ) equal to the integrand in (5). At the (r + 1)th iteration, the E-step involves the calculation of Q(ζ |ζ (r ) ) = E{log f (y, b; ζ )|y, ζ (r ) } = log f (y, b; ζ ) f (b|y; ζ (r ) ) db, (6) where f (b|y; ζ ) is the conditional distribution of b given y, and ζ (r ) denotes the value from the previous (r th) iteration. The M-step consists of maximizing Q(ζ |ζ (r ) ) in ζ to yield the new update ζ (r +1) . The process is iterated from a starting value ζ (0) to convergence; under regularity conditions, the value at convergence maximizes (5). Obtaining a closed form expression for (6) is often not possible, as it requires knowledge of the conditional distribution of b given y evaluated at ζ (r ) , which in turn requires knowledge of the marginal likelihood whose direct calculation is to be avoided. The suggestion exploited by Booth and Hobert (1999) for GLMMs with parametric random effects distribution is to use a Monte Carlo approximation to (6) in the E-step. Specifically, if it is possible to obtain a random sample (b(1) , b(2) , . . . , b(L) )T from f (b|y; ζ (r ) ), then (6) may be approximated by at the (r + 1)th iteration Q L (ζ |ζ (r ) ) =
L 1 log f (y, b(l) ; ζ ), L l=1
(7)
yielding a so-called MCEM algorithm. By independence, to obtain a sample from f (b|y; ζ (r ) ), one may sample from the conditional distribution of bi given yi evaluated at ζ (r ) , f (bi |yi ; ζ (r ) ), say, for each i. Several algorithms for generating such random samples are available. One approach is to use the Metropolis–Hastings algorithm, as in McCulloch (1997), to sample from the conditional distribution for each i using the random effects density as the candidate distribution. Alternatively, Booth and Hobert (1999) propose using a rejection sampling scheme (Geweke, 1996), arguing that, unlike the Metropolis– Hastings approach, this produces independent and identically distributed samples that may be used to assess Monte Carlo error at each iteration and hence suggest a rule for changing the sample size L to enhance speed. Booth and Hobert (1999) advocate using rejection sampling as long as it is easy to simulate from the assumed random effects density. As we now discuss, the availability of an efficient rejection sampling scheme for the SNP representation makes this an attractive choice.
352
J. C HEN ET AL. 3.2 ‘Double’ rejection sampling
To carry out the E-step for our model, a random sample from the conditional distribution of bi given yi evaluated at ζ (r ) for each i, which we denote f K (bi |yi ; ζ (r ) ) to emphasize dependence on K , is required. As discussed by Booth and Hobert (1999), this involves first obtaining a random sample from the marginal distribution f K (bi ; δ (r ) ). Gallant and Tauchen (1992) propose an algorithm to generate a random sample from a SNP density, which we now describe. To generate a sample from f K (bi ; δ (r ) ), we first generate a random sample from h K (z; ψ (r ) ). Following Gallant and Tauchen (1992), this depends on finding an upper envelope for h K (z; ψ (r ) ); i.e. a positive, integrable function d K (z; ψ (r ) ), say, that dominates h K (z; ψ (r ) ), i.e. 0 h K (z; ψ (r ) ) d K (z; ψ (r ) ) for all z. This leads to the candidate density g K (z; ψ (r ) ) = d K (z; ψ (r ) )/ d K (s; ψ (r ) ) ds. Gallant and Tauchen (1992) observe that for the SNP representation, g K (z; ψ (r ) ) is a weighted sum of S 2 Chi density functions, where a Chi random variable χν = (−1) χν , χν2 is a Chi-square random variable with ν degrees of freedom, and S is Bernoulli(1/2). Sampling from a Chi density is straightforward (Monahan, 1987). Exploiting these developments, a rejection algorithm to generate a random sample z from h K (z; ψ (r ) ) involves the following two steps: (1) Generate independently u ∼ U (0, 1), z ∼ g K (z; ψ (r ) ). (2) If u h K (z; ψ (r ) )/d K (z; ψ (r ) ) then accept z; otherwise, go to 1 and repeat until a sample z is accepted. If z i∗ is a sample generated this way for individual i, then a sample bi∗ from f K (bi ; δ (r ) ) may be obtained as bi∗ = γ (r ) + R (r ) z i∗ . The acceptance rate is close to 50% (Gallant and Tauchen, 1992). This scheme may be incorporated into the Booth and Hobert (1999) MCEM rejection sampling approach for generating a sample from f K (bi |yi ; ζ (r ) ) as follows. (1) Generate a random sample bi∗ from f K (bi ; δ (r ) ) using the rejection sampling algorithm for the SNP density above. (2) Generate a random sample from f K (bi |yi ; ζ (r ) ) by rejection sampling: sample u ∼ U (0, 1) independently. If u f (yi |bi∗ ; β (r ) , φ (r ) )/τi , where τi = supbi f (yi |bi ; β (r ) , φ (r ) ), accept bi∗ ; otherwise, return to 1 and repeat until a sample bi∗ is accepted. An accepted bi∗ is a random sample from the conditional distribution f K (bi |yi ; ζ (r ) ). Because both steps involve a rejection sampling scheme, we refer to this as ‘double’ rejection sampling. 3.3
EM algorithm
We may now describe implementation of the MCEM algorithm for our semiparametric GLMM for a fixed choice of K . If (b(1) , b(2) , . . . , b(L) ) are generated independently from the density function f (b|y; ζ (r ) ) = m (r ) (l) = (b(l) , b(l) , . . . , b(l) )T , the Monte Carlo approximation at the E-step m i=1 f K (bi |yi ; ζ ), where b 1 2 may be written as Q L (ζ |ζ (r ) ) =
ni m m L L 1 1 (l) (l) log f (yi j |bi ; β, φ) + log f K (bi ; δ). L l=1 i=1 j=1 L l=1 i=1
(8)
In (8), the parameters (β, φ) and δ separate; thus, maximization of (8) at the M-step may be carried out in two steps. To obtain (β (r +1) , φ (r +1) ), the first term in (8) may be maximized in (β, φ), which is equivalent to finding the maximum likelihood estimate of the regression parameters for a generalized linear model with an offset. Maximizing the second term in (8) gives δ (r +1) , which may be carried out using standard
Generalized linear mixed models with flexible random effects
353
optimization software. We have implemented this scheme in Fortran 90 using the IMSL optimization subroutine BCPOL to maximize the second term in (8) involving the SNP density. In our experience, this maximization is quite stable under our parametrization. Booth and Hobert (1999) propose that a normal approximation of the Monte Carlo error in ζ (r +1) may be used after iteration (r + 1) to deduce an appropriate choice of the Monte Carlo sample size L for the next iteration. In particular, they show that, after the (r + 1)th iteration, ζ (r +1) is approximately normally distributed with mean ζ ∗(r +1) , say, maximizing Q(ζ |ζ (r ) ), and covariance matrix that may be estimated. Hence, they advocate constructing an approximate 100(1 − α)% confidence ellipsoid for ζ ∗(r +1) . If the previous value ζ (r ) is inside of this confidence region, they recommend increasing L by the integer part of L/k, where k is a positive constant. Booth and Hobert (1999) have advocated choosing α=0.25 and k ∈ {3, 4, 5}. The authors also discuss several stopping rules for the EM algorithm. We have successfully used α = 0.25, k = 3, initial L = 50, and the stopping criterion that the relative change in all parameter values from successive iterations is no larger than 0.001. To obtain starting values for the algorithm, one may assume the random effects follow a normal distribution and use existing software (e.g. SAS PROC NLMIXED or the macro GLIMMIX) to obtain starting values β (0) , φ (0) , γ (0) and R (0) . We have had success using ψ (0) = 0. Combining the above, the MCEM algorithm for the semiparametric GLMM, incorporating ‘double’ rejection sampling, is as follows. Choose K , starting values ζ (0) , and initial sample size L. Set r = 0. At iteration (r + 1), generate b(l) from f (b|y; ζ (r ) ) using ‘double’ rejection methods, l = 1, . . . , L. Using the approximation (8), obtain ζ (r +1) by maximizing Q L (ζ |ζ (r ) ). Construct a 100(1 − α)% confidence ellipsoid for ζ ∗(r +1) . If ζ (r ) is inside of the region, set L = L + [L/k], where [ ] denotes integer part. (5) If convergence is achieved, set ζ (r +1) to be the maximum likelihood estimate ζ ; otherwise, set r = r + 1 and return to 2.
(1) (2) (3) (4)
Booth and Hobert (1999) show that a by-product of such an algorithm is the observed Fisher information, which may be approximated by the difference of two positive definite matrices. However, because of the Monte Carlo error, we have found that this difference may not always be positive definite in practice. Accordingly, we propose instead to approximate the observed Fisher information directly via Monte Carlo after convergence is achieved, which works well as shown in Section 5. In particular, the observed m information for ζ may be written as i=1 {∂/∂ζ log f (yi ; ζ )}{∂/∂ζ T log f (yi ; ζ )}. With bi = γ + Rz i and regularity conditions, ′ λ ( ζ ; yi , z i )ϕq (z i ) dz i ∂ ∂ 2 log f (yi ; ζ ) = log f (yi |z i ; ζ )PK (z i ; ψ)ϕq (z i ) dz i = , ∂ζ ∂ζ λ(ζ ; yi , z i )ϕq (z i ) dz i and λ′ ( ζ ; yi , z i ) = ∂/∂ζ λ( ζ ; yi , z i ). Therefore, the Monte where λ( ζ ; yi , z i ) = f (yi |z i ; ζ )PK2 (z i ; ψ), Carlo approximation of ∂/∂ζ log f (yi ; ζ ) for cluster i is given by L
(1)
(L)
(l) ′ l=1 λ (ζ ; yi , z i ) , L (l) l=1 λ(ζ ; yi , z i )
where (z i , . . . , z i ) are samples from normal distribution ϕq (z i ) for large L, which can be substituted into the expression for observed information. We use L = 10 000 in the analyses of Sections 4 and 5. Standard errors for functions of the parameters, e.g. for estimators for E(bi ) and var(bi ) = R var(Z i )R T , may be obtained via the delta method.
354
J. C HEN ET AL. 3.4 Choosing K
The preceding developments assume that K , the degree of the SNP polynomial, is fixed. A larger K gives more flexibility for representing the random effects distribution, but choosing a K that is too large will result in an inefficient representation. Thus, it is desirable to choose K according to a criterion that balances the number of parameters and the suitability of the fit. Davidian and Gallant (1993) and Zhang and Davidian (2001) advocate inspection of information criteria of the form − log L( ζ ; y) + a K ,N across fits for different K . Here, a K ,N is a constant depending on the dimension of ζ (which depends on K ), m D K , say, and the total number of observations N = i=1 n i . With a K ,N = D K , Akaike’s information criterion (AIC) is obtained; a K ,N = (1/2)(log N )D K leads to Schwarz’s Bayesian information criterion (BIC), and a K ,N = (log(log N ))D K gives the Hannan–Quinn criterion (HQ). For all of AIC, BIC and HQ, smaller values are preferred, so for any given criterion, one would choose K yielding the smallest value across several choices. In general, one would fit K = 0 (normal random effects), and richer choices K = 1, 2 and 3; often K = 1 or 2 is sufficient to represent adequately departure from normality. To compute AIC, BIC and HQ following the MCEM algorithm, we propose approximating log L( ζ ; y) (l) T (l) (l) (1) (L) T (l) via Monte Carlo. We may generate samples (b , . . . , b ) , b = (b1 , b2 , . . . , bm ) , for large L from f K (bi ; ζ ) via rejection sampling as in Section 3.2, and obtain log L( ζ ; y) ≈
m i=1
log
L 1 (l) f (yi |bi ; ζ ) , L l=1
which may be substituted into the expression for the desired criterion. We have used L = 10 000 in the sequel with good results. 4. A PPLICATION : SIX CITIES DATA To illustrate the utility of the methods, we apply them to data from the Harvard Study of Air Pollution and Health (Ware et al., 1984), the so-called ‘Six Cities Study’. The data consist of records for m = 537 children from Steubenville, Ohio, each of whom was examined annually from ages 7 to 10. Whether the child had a respiratory infection in the year prior to each exam was reported by the mother; also available was the mother’s baseline smoking status as ascertained at the first interview. We considered the model (1) for the binary response yi j (1 = infection, 0 = no infection) for child i at age ai j (in years-9) with logit link and linear predictor ηi j = si β1 + ai j β2 + bi , where si is mother’s baseline smoking status (1 = smoker, 0 = non-smoker), bi = γ + r z i , and z i has the density (3) with q = 1 for K = 0, 1, 2. We also considered this model with an additional interaction term for smoking and age, but there was no evidence supporting the need for this term, and the results were similar. The fit of the model is summarized in Table 1. The estimates of the fixed effects β1 and β2 are similar across choices of K . Based on all three criteria in Section 3.4, K = 0 is preferred, suggesting that there is no evidence that the random effects distribution is different from the normal. Figure 1 shows the estimated random effects densities for K = 0, 1, 2. All estimates are unimodal and symmetric, further confirming that the data show little support for a departure from normality of the underlying random effects. This demonstrates a useful feature of our approach as a model-checking device; because the SNP representation contains the normal, fitting the model with K = 0 and then increasing K provides both visual and empirical information with which to gauge the validity of the usual normality assumption. From Table 1, note that the estimates of all fixed parameters are virtually identical for the K = 0 and K = 1 fits. This suggests that adopting a strategy of choosing K = 1 a priori as protection against departures from the normality assumption may not be unreasonable. The simulation results in Section 5
Generalized linear mixed models with flexible random effects
355
Table 1. Results for the six cities study
β1 (smoking status) β2 (age) E(bi ) Var(bi ) ψ1 ψ2
K =1 Estimate (SE) 0.3848 (0.2706) −0.1719 (0.0676) −3.0785 (0.2228) 4.5930 (0.8060) −0.0705 (0.9741) –
0.3730 0.3783 0.3749
0.3737 0.3803 0.3761
K =2 Estimate (SE) 0.3641 (0.2580) −0.1703 (0.0676) −2.8766 (0.3297) 3.8847 (2.2167) −0.5206 (0.5808) −0.4821 (0.1965) 0.3838 0.3817 0.3767
0.30
AIC BIC HQ
K =0 Estimate (SE) 0.3970 (0.2711) −0.1721 (0.0675) −3.0870 (0.2204) 4.6281 (0.7688) – –
0.15 0.0
0.05
0.10
Densities
0.20
0.25
K=0 K=1 K=2
-8
-6
-4
-2
0
2
b
Fig. 1. Estimated random effects densities for fits to the six cities data.
support this idea, showing that, when the random effects are truly normal, loss of efficiency relative to using K = 0 is negligible. Computation times using our implementation on a Sun Ultra 10 workstation were 5, 10 and 20 min for K = 0, 1, and 2, respectively, including Monte Carlo calculation of standard errors and evaluation of the information criteria. Given the size of the data set, these computation times demonstrate the feasibility of the approach for practical use.
356
J. C HEN ET AL. 5. S IMULATION RESULTS
We conducted simulation studies to investigate performance of the proposed methods. We report on a case of binary response. In each of 100 Monte Carlo data sets, for each of i = 1, . . . , m = 250 individuals yi j , j = 1, . . . , n i = 5 observations were generated as conditionally independent given bi according to yi j |bi ∼ Bernoulli(πi j ), where πi j ηi j = log = trti β1 + ti j β2 + bi , (9) 1 − πi j trti is a treatment indicator with trti = I (i 125), ti j = (−0.2, −0.1, 0.0, 0.1, 0.2), β1 = 0.5, and β2 = 3.0. In the first simulation, bi were drawn from the mixture of normal distributions 0.7N1 (−1.5, 0.72 ) + 0.3N1 (2.0, 0.72 ) with E(bi ) = −0.45 and var(bi ) = 3.0625. For each data set, we fit (9) three times, with K = 0, 1, and 2, using the MCEM algorithm of Section 3.3, and calculated AIC, BIC, and HQ for each fit. A preliminary study revealed that K > 2 was never chosen by any criterion, so larger K were not considered in the interest of speed. For all simulations considered in this section, fits of each simulated data set for K = 0, 1, and 2, took about 3, 5, and 10 min respectively on a Sun Ultra 10 workstation. For this situation, over all 100 data sets, AIC selected K = 0 four times, K = 1 81 times, and K = 2 fifteen times; using BIC and HQ, these numbers were (20, 80, 0) and (6, 90, 4) respectively for K = (0, 1, 2). These results echo behavior observed in other contexts (for example, Zhang and Davidian (2001)), that AIC (BIC) tends to choose larger (smaller) K , with HQ intermediate. Estimation results are summarized in Table 2. The first part shows results under the usual normality assumption (K = 0), while the remaining parts give summaries for the fits preferred by BIC and HQ (results for AIC are similar and are excluded for brevity). Estimation of β1 and β2 is unbiased in all three cases; however, under the (inappropriate) normality assumption, estimation of E(bi ) and var(bi ) is compromised. The apparent robustness in terms of bias of the estimators for β1 and β2 may seem counterintuitive for nonlinear GLMM models, in which the regression coefficients and variance components are not ‘orthogonal’. This lack of orthogonality would be expected to lead to bias under such a misspecification of the random effects distribution. Interestingly, however, apparent unbiased estimation of regression coefficients under misspecification in nonlinear random effects models has also been observed by other authors (for example, Hu et al., 1998; Tao et al., 1999; Zhang and Davidian, 2001). A similar phenomenon has been discussed in nonlinear mixed models for pharmacokinetics by Lai and Shin (2000). Efficiency under K = 0 for the treatment effect β1 is poor relative to that when the SNP density is estimated, showing that estimation of a treatment effect, often of primary interest, may be inefficient when the usual normality assumption is adopted but is violated. Estimation of β2 , which corresponds to a covariate changing within individuals, suffers no loss of efficiency under misspecification of the random effects distribution. Similar results have been reported by Tao et al. (1999) and Zhang and Davidian (2001). We conjecture that, because a cluster-level covariate such as treatment and the latent random effects both pertain to among-individual variation, misspecification of the random effects distribution would compromise quality of estimation of the corresponding regression coefficient. In contrast, a withinindividual covariate such as time is roughly ‘orthogonal’ to among-individual effects. Thus, estimation of the associated regression coefficient may be less affected. Figure 2a presents the average of estimated densities for the 100 data sets for K = 0 (normal case), along with those preferred by AIC, BIC, HQ, and shows that the SNP density representation is capable of approximating the true, underlying density even for K = 1 or 2. Figure 2b shows the 100 estimated densities preferred by HQ and further supports the flexibility of the approach. Only a few of the 100 fits do not capture the bimodality of the true density. With binary data and only 250 individuals, the information on the underlying random effects is not great; thus, the performance of the SNP approach for detecting and representing a departure from normality is encouraging.
357
Generalized linear mixed models with flexible random effects Table 2. Simulation results for 100 data sets: mixture scenario. Ave. is the average of the estimates over 100 data sets, estimated SE is the average of the estimated standard errors, empirical SE is the standard deviation of the 100 estimates, and RE is empirical mean square error for the indicated fit divided by that for K = 0 Parameter (true)
Ave. 0.527 3.006 −0.629 3.923
Estimated SE Empirical SE K =0 0.298 0.283 0.549 0.550 0.212 0.216 0.703 0.634
β1 (0.5) β2 (3.0) E(bi ) (−0.45) Var(bi ) (3.0625)
1.00 1.00 1.00 1.00
β1 (0.5) β2 (3.0) E(bi ) (−0.45) Var(bi ) (3.0625)
0.510 3.014 −0.515 3.278
Preferred by BIC 0.263 0.235 0.551 0.557 0.194 0.198 0.569 0.594
0.68 1.03 0.55 0.35
β1 (0.5) β2 (3.0) E(bi ) (−0.45) Var(bi ) (3.0625)
0.505 3.013 −0.495 3.125
Preferred by HQ 0.255 0.231 0.551 0.557 0.190 0.186 0.533 0.442
0.66 1.03 0.47 0.17
0.4 0.0
0.2
Densities
0.4 0.2 0.0
Densities
0.6
(b)
0.6
(a)
RE
-4
-2
0 b
2
4
-4
-2
0
2
4
b
Fig. 2. (a) Average of estimated densities chosen by AIC (dotted curve), BIC (short dashed curve), and HQ (long dashed curve), with true density (solid curve) and average of K = 0 estimates (thick solid curve). (b) 100 estimated densities chosen by HQ.
358
J. C HEN ET AL. Table 3. Simulation results for 100 data sets: normal scenario. Ave. is the average of the estimates over 100 data sets, estimated SE is the average of the estimated standard errors, empirical SE is the standard deviation of the 100 estimates, and RE is empirical mean square error for the indicated fit divided by that for K = 0 Parameter (true)
Ave.
Estimated SE Empirical SE K =0 0.271 0.229 0.523 0.480 0.194 0.207 0.567 0.554
RE
β1 (0.5) β2 (3.0) E(bi ) (−0.45) Var(bi ) (3.0625)
0.510 2.969 −0.478 3.136
β1 (0.5) β2 (3.0) E(bi ) (−0.45) Var(bi ) (3.0625)
0.509 2.960 −0.484 3.134
K =1 0.272 0.524 0.202 0.784
0.233 0.481 0.209 0.565
1.03 1.01 1.03 1.04
β1 (0.5) β2 (3.0) E(bi ) (−0.45) Var(bi ) (3.0625)
0.516 2.974 −0.472 3.168
K =2 0.272 0.524 0.220 1.280
0.242 0.479 0.220 0.583
1.12 0.99 1.12 1.12
1.00 1.00 1.00 1.00
We conducted a second simulation under the same parameter settings as above, but with the true distribution of bi as N1 (−0.45, 3.0625). All three information criteria correctly selected K = 0 for all 100 data sets. The results are summarized in Table 3 over each fixed value of K . Thus, the results correspond to the situation where K is chosen a priori and held fixed. The estimates are almost all unbiased. The results show that, for estimating β, there is little price to be paid in terms of efficiency for allowing extra flexibility in representing the random effects if one chooses K fixed. 6. D ISCUSSION We have proposed a semiparametric generalized linear mixed model with flexible random effects distribution for clustered data. The method exploits the SNP approach to represent the density, where the degree of flexibility is controlled by the parameter K , chosen via inspection of standard information criteria. We use the MCEM algorithm with rejection sampling of Booth and Hobert (1999) for implementation, which, along with the parametrization of the density we advocate, provides stable and unbiased estimation, including reliable estimation of the random effects density. Because the algorithm is straightforward to implement and offers good performance, it is attractive for routine use in situations where departure from the usual normal assumption for the random effects is suspected or requires investigation. The proposed approach is practically feasible for low-dimensional random effects, i.e. q = 1, 2, or 3, as is often the case in the clustered data situation. As with any mixed effects model, high-dimensional random effects can lead to prohibitive computation. Indeed, with q large, the available information may be insufficient to entertain any approach to relaxing parametric assumptions on the random effects. Our experience with the approach proposed here in several examples and simulations suggests that the
Generalized linear mixed models with flexible random effects
359
computational burden associated with small q is not great, and reliable inferences on the random effects distribution are possible. Computational efficiency is in part due to the performance of the rejection sampling scheme for the SNP density, which enjoys acceptance rates close to 50%. Despite the efficiency of the SNP part of our ‘double’ rejection sampling procedure, the second rejection sampling component may suffer a low acceptance rate in some applications, particularly for binary data configurations where the proportions of 0 and 1 responses within a cluster are severely imbalanced. Under these circumstances, an alternative approach also proposed by Booth and Hobert (1999) is the use of importance sampling at the E-step. These authors suggest a multivariate t importance density, which requires matching the first two moments to those of f (b|y, ζ (r ) ) at the (r + 1)th iteration. With the SNP representation for the random effects density, we suspect that f (b|y, ζ (r ) ) is likely to be multimodal, making identification of the dominant mode difficult. Yet another potential approach for the E-step is adaptive quadrature (Pinheiro and Bates, 1995), which may share the same problem. We are currently investigating the performance of these alternatives for this and other popular models. Our simulations and those of other authors indicate that, contrary to intuition, estimation of regression coefficients may be subject to only negligible bias under misspecification of the random effects distribution. This phenomenon deserves formal investigation.
R EFERENCES A ITKEN , M. (1999). A general maximum likelihood analysis of variance components in generalized linear models. Biometrics 55, 117–128. B OOTH , J. G. AND H OBERT , J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society, Series B 61, 265–285. B RESLOW , N. E. AND C LAYTON , D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88, 9–25. B RESLOW , N. E. AND L IN , X. (1995). Bias correction in generalized linear mixed models with a single component of dispersion. Biometrika 82, 81–91. DAVIDIAN , M. AND G ALLANT , A. R. (1993). The nonlinear mixed effects model with a smooth random effects density. Biometrika 80, 475–488. DAVIDIAN , M. AND G ILTINAN , D. M. (1995). Nonlinear Models for Repeated Measurement Data. London: Chapman and Hall. G ALLANT , A. R. AND N YCHKA , D. W. (1987). Seminonparametric maximum likelihood estimation. Econometrica 55, 363–390. G ALLANT , A. R. AND TAUCHEN , G. E. (1992). A nonparametric approach to nonlinear time series analysis: estimation and simulation. In Brillinger, D., Caines, P., Geweke, J., Parzen, E., Rosenblatt, M. and Taqqu, M. S. (eds), New Directions in Time Series Analysis, Part II, New York: Springer, pp. 71–92. G EWEKE , J. (1996). Handbook of Computational Economics. Amsterdam: North-Holland. H U , P., T SIATIS , A. A. AND DAVIDIAN , M. (1998). Estimating the parameters in the Cox model when covariate variables are measured with error. Biometrics 54, 1407–1419. J IANG , J. (1999). Conditional inference about generalized linear mixed models. Annals of Statistics 27, 1974–2008. K LEINMAN , K. P. AND I BRAHIM , J. G. (1998). A semi-parametric Bayesian approach to generalized linear models. Statistics in Medicine 17, 2579–2696. L AI , T. L. AND S HIH , M.-C. (2000). Estimation in nonlinear and generalized linear mixed effects models. Technical Report No. 2000-42. Department of Statistics, Stanford University.
360
J. C HEN ET AL.
L AIRD , N. M.
AND
WARE , J. H. (1982). Random effects models for longitudinal data. Biometrics 38, 963–974.
L EE , Y. AND N ELDER , J. A. (1996). Hierarchical generalized linear models. Journal of the Royal Statistical Society, Series B 58, 619–678. L IN , X. AND B RESLOW , N. E. (1996). Bias correction in generalized linear mixed models with multiple components of dispersion. Journal of the American Statistical Association 91, 1007–1016. M AGDER , L. S. AND Z EGER , S. L. (1996). A smooth nonparametric estimate of a mixing distribution using mixtures of Gaussians. Journal of American Statistical Association 91, 1141–1151. M C C ULLAGH , P. AND N ELDER , J. A. (1989). Generalized Linear Models, 2nd edn. London: Chapman and Hall. M C C ULLOCH , C. E. (1997). Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association 92, 162–170. M ONAHAN , J. F. (1987). An algorithm for generating Chi random variables, algorithm 651. ACM Transactions on Mathematical Software 13, 168–172. P INHEIRO , J. C. AND BATES , D. M. (1995). Approximations to the log-likelihood function in the nonlinear mixed effects model. Journal of Computational and Graphical Statistics 4, 12–35. S CHALL , R. (1991). Estimation in generalized linear models with random effects. Biometrika 78, 717–727. S EARLE , S. R., C ASELLA , G.
AND
M C C ULLOCH , C. E. (1992). Variance Components. New York: Wiley.
TAO , H., PALTA , M., YANDELL , B. S. AND N EWTON , M. A. (1999). An estimation method for the semiparametric mixed effects model. Biometrics 55, 102–110. WARE , J. H., S PIRO , A., D OCKERY , D. W., S PEIZER , F. E. AND F ERRIS , B. G. J R (1984). Passive smoking, gas cooking, and respiratory health of children living in six cities. American Review of Respiratory Disease 129, 366–374. V ERBEKE , G. AND L ESAFFRE , E. (1996). A linear mixed-effects model with heterogeneity in the random-effects population. Journal of the American Statistical Association 91, 217–221. Z EGER , S. L. AND K ARIM , M. R. (1991). Generalized linear models with random effects: a Gibbs sampling approach. Journal of the American Statistical Association 86, 79–86. Z HANG , D. AND DAVIDIAN , M. (2001). Linear mixed models with flexible distributions of random effects for longitudinal data. Biometrics 57, 795–802. [Received May 22, 2001; revised August 16, 2001; accepted for publication August 17, 2001]