407 63 82KB
English Pages 12 Year 2006
Biostatistics (2006), 7, 3, pp. 387–398 doi:10.1093/biostatistics/kxj014 Advance Access publication on December 20, 2005
Multivariate survival analysis for case–control family data LI HSU∗ Program in Biostatistics and Biomathematics, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, M2-B500, PO Box 19024, Seattle, WA 98109-1024, USA [email protected] MALKA GORFINE Department of Mathematics, Bar Ilan University, Ramat-Gan 52900, Israel and Faculty of Industrial Engineering and Management, Technion—Israel Institute of Technology, Technion City, Haifa 32000, Israel
S UMMARY Multivariate survival data arise from case–control family studies in which the ages at disease onset for family members may be correlated. In this paper, we consider a multivariate survival model with the marginal hazard function following the proportional hazards model. We use a frailty-based approach in the spirit of Glidden and Self (1999) to account for the correlation of ages at onset among family members. Specifically, we first estimate the baseline hazard function nonparametrically by the innovation theorem, and then obtain maximum pseudolikelihood estimators for the regression and correlation parameters plugging in the baseline hazard function estimator. We establish a connection with a previously proposed generalized estimating equation-based approach. Simulation studies and an analysis of case–control family data of breast cancer illustrate the methodology’s practical utility. Keywords: Case–control family study; Cox proportional hazards model; Familial aggregation; Frailty model; Innovation theorem.
1. I NTRODUCTION There is an increasing interest in the application of multivariate survival analysis techniques to populationbased case–control studies for estimating the marginal hazard function and the dependencies of correlated ages at disease onset (Li et al., 1998; Shih and Chatterjee, 2002; Hsu et al., 2004). An interesting example is a breast cancer study conducted at the Fred Hutchinson Cancer Research Center (Malone et al., 1996). In this study, the breast cancer incidence cases were ascertained from the Surveillance, Epidemiology, and End Results registry, which is a set of geographically defined, population-based cancer registries in the United States. The controls were obtained through random digit dialing and were matched ∗ To whom correspondence should be addressed. c The Author 2005. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected]. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected].
388
L. H SU AND M. G ORFINE
with cases on the case diagnosis ages and county of residence. First- and second-degree female relatives of cases and controls were identified, and risk factor and outcome information were subsequently collected on them. Researchers were interested in understanding how various risk factors (including candidate genes) affect the age-dependent breast cancer risk and whether these risk factors partly explain the pattern of familial resemblance in ages at diagnosis for breast cancer. The ages at onset of (K + 1) family members can be denoted by T0 , T1 , . . . , TK with marginal hazard function λk (t) = lim h −1 Pr(t Tk < t + h|Tk t) h→0 t and k (t) = 0 λk (s)ds, for k = 0, 1, . . . , K . The index 0 refers to the proband, and 1, . . . , K refers to the K relatives. We loosely define a proband as the individual because of whom the family is ascertained. The proband in this situation therefore can be either diseased (case) or nondiseased (control). Note that this is different from the traditional definition of probands as diseased. The commonly used Cox proportional hazards model (Cox, 1972) specifies that the hazard function for the age at onset Tk associated with a p × 1 vector of covariates Z k takes the form of λk (t; Z k ) = λ0 (t) exp(β0′ Z k ),
(1.1)
where λ0 (·) is an unspecified baseline hazard function and β0 is a p × 1 vector of unknown regression parameters. One of the most commonly studied joint survival distribution of ages at onset is the Clayton– Oakes model (Clayton, 1978; Oakes, 1989), which can be written as K − θ1 0 , θ0 ∈ (0, ∞), (1.2) Sk (tk ; Z k )−θ0 − K Pr(T0 > t0 , T1 > t1 , . . . , TK > t K ; θ0 ) = k=0
and Sk (tk ; Z k ) = exp{−0 (t) exp(β0′ Z k )}. The parameter θ0 indexes the intrafamily dependence. When θ0 = 0, the joint survival distribution degenerates to the product of marginal survival functions and the ages at onset are independent. Much work has been done for prospectively collected data. Wei et al. (1989) proposed an estimating equation-based approach to estimate 0 (t) and β. Fixing the marginal hazard function at these estimates, Shih and Louis (1995) and Glidden (2000) further estimated θ by maximizing the joint density (1.2) for the multivariate survival times. Shih and Chatterjee (2002) subsequently generalized this approach to 0 , was obtained through case–control family survival data. The cumulative baseline hazard estimate, marginalized conditional hazard function for the relatives on the proband data. Advantages of this approach include relatively simple computation and robustness to the misspecification of high-order dependencies among family members. However, the approach may lose efficiency by assuming independence among relatives in the presence of a large number of relatives. Another approach is to use the frailty model, where the relatives share a common (latent) frailty. Many researchers, for example Nielsen et al. (1992), Klein (1992), Murphy (1995, 1996), and Parner (1998), have studied the estimation and inference for this model. A comprehensive review can be found in the survival analysis books by Andersen et al. (1993) and Hougaard (2000). Hsu et al. (2004) generalized this approach to the case–control family data. All these approaches focus on estimating the parameter conditional on the latent frailty. In some situations, when the population-averaged risk is of interest, the marginalized hazard function following the integration of the latent frailty distribution may be sensitive to the misspecification of the assumed joint model. However, an appealing feature of the frailty model-based approach is that it naturally accommodates the dependency among all the relatives of the same relation by assuming a shared frailty. Intuitively, this will improve the efficiency in estimating the strength of dependencies. To circumvent the dependence of the marginalized hazard function on the dependence parameter, Glidden and Self (1999) formulated the conditional hazard function so that the marginalized hazard function obeys the Cox model (1.1); they proposed a modified expectation maximization (EM) algorithm
Survival analysis
389
for parameter estimation. More recently, Pipper and Martinussen (2004) used a generalized estimating equation (GEE)-based approach and provided a large sample theory to their estimators. They showed that the efficiency of the regression coefficient estimators increases with the strength of the dependence and the cluster size. In this paper, we propose a general framework extending the approach of Pipper and Martinussen to the noncohort case–control study design and to accommodating varying degrees of the dependence strength among different types of relatives. When there is only one relative per proband, the proposed approach is the same as that of Shih and Chatterjee (2002). We will describe the method in Section 2 and the simulation results in Section 3. Section 4 illustrates the method in real data. We conclude with final remarks in Section 5. 2. M ETHOD 2.1
Notation and model
Consider n independent case–control families. Let Tik and Cik be the failure time and the censoring time, respectively, for the kth individual in the ith family, k = 0, 1, . . . , K i , i = 1, . . . , n. When k = 0, the index refers to the proband. All other k values refer to the relatives of the proband. We assume that K i is relatively small with respect to n, and K = maxi K i is bounded. For each individual, one can only observe X ik = Tik ∧ Cik and ik = I (Tik Cik ), where ∧ is minimum and I (·) is an indicator function. Let Z ik be a p × 1 vector of bounded covariates. For simplicity, we assume that each family has the same set of relatives, K . The presence or absence of any particular type of relatives is assumed to be independent of the observed data. The absence of these relatives can be denoted by setting Cik = 0. Naturally, such cases will not contribute to any of the proposed likelihoods. Denote X i′ = (X i1 , . . . , X i K ), i′ = (i1 , . . . , i K ), ′ , . . . , Z ′ ), for i = 1, . . . , n. and Z i = (Z i1 iK The K relatives can be divided into m disjoint sets, where each set is defined by the genealogical relationship between the relatives and the proband. For example, in the breast cancer study, the female relatives ascertained for each proband were mothers, sisters, grandmothers, and paternal and maternal aunts. The sets then can be defined accordingly. Denote v ik the set to which the ikth individual belongs, and it takes value 1, . . . , M, where M is the total number of the disjoint sets. Let v i′ = (v i1 , . . . , v i K ). For the jth set in the ith family, let ωi j be the unobservable random effect with the distribution function Fθ j (·), j = 1, . . . , M. Assume that conditional on ωi j and Z i , the failure times of the relatives in the jth set, are independent and jointly they are independent of the censoring times of these relatives. We also assume that the censoring times are noninformative of ωi j . Technically, the distributions can be different for each set of relatives. For simplicity, we assume F(·) the same for all sets, but the dependence parameter θ j takes different values for each set. Moreover, we leave the joint distribution of ωi′ = (ωi1 , . . . , ωi M ) unspecified. Assuming a multiplicative effect ωi j on the hazard function, the conditional hazard function for the ikth individual takes the form of (2.1) λck (t; Z ik , ωi ) = αik (t)ωivik . The hazard function αik (t) has a one-to-one relationship with the marginal hazard function in (1.1). This relationship can be expressed as ∞ t−
αik (s)ds Fθvik dωivik . exp −ωivik 0 (t−)eβ Z ik = − log 0
0
For the Clayton–Oakes model (1.2), which is equivalent to a gamma frailty model with Fθ j (ω) = ω 1 θ1 θ1 −1
j y j exp − θyj / Ŵ θ1j dy, j = 1, . . . , M, the relationship can be written explicitly as 0 θj
′ αik (t) = λ0 (t) exp β0′ Z ik + θvik eβ0 Z ik 0 (t−) .
390
L. H SU AND M. G ORFINE
Note that an added advantage of formulating the marginal hazard function as in (1.1) is that there will be no ambiguity on how to estimate the marginal distribution of an individual who is related to different types of relatives with different θ values. In the counting process notation, Yik = I (X ik t) and Nik (t) = ik I (X ik t), t ∈ [0, τ ], where τ is the maximum follow-up time. Note that Nik (t) is a counting process with marginal intensity function Yik (t)λ(t; Z ik ), where λ(t; Z ik ) satisfies (1.1). 2.2
An estimation procedure
In this section, we describe a frailty model approach for estimating the parameters {β0 , 0 (t), θ1 , . . . , θ M } from case–control family data. The data consist of iid random elements (X i , i , Z i , v i , ωi ) (i = 1, . . . , n) if ω were observed. We assume that the effect of the covariates on the age at onset is subject specific, i.e. Pr(X i j , i j |Z i ) = Pr(X i j , i j |Z i j ). The likelihood function for case–control family data can be written as L=
n
f (X i , i , Z i , Z i0 |X i0 , i0 ),
i=1
where the conditioning event describes the ascertainment scheme of case–control sampling. Following the conditional probability rule, this likelihood can be further decomposed into n
f (X i , i , Z i |Z i0 , X i0 , i0 ) ×
n
f (Z i0 |X i0 , i0 ).
i=1
i=1
One can see that the first part of the likelihood contains information from the relatives’ data (ages at onset and risk factors X i ) conditional on the proband ages at onset and risk factor data. The second part is the conventional retrospective likelihood for case–control data without families. This suggests that one can obtain consistent parameter estimates of β from either part of the likelihood. However, the information for the dependencies of ages at onset among family members as well as the baseline hazard function is largely from the first part of the likelihood, the data of the relatives. In what follows we describe each part of the likelihood in detail, starting with the second part of the likelihood, the likelihood function of the case–control data. n f (Z i0 |X i0 , i0 ), where the subscript cc stands for case and control. Assuming cases Let Lcc = i=1 and controls are age matched and there are Q matched sets, Lcc can be replaced by a conditional likelihood (Prentice and Breslow, 1978): Q exp(β ′ Sq ) , ′ ℓ∈Rq exp(β Sℓ ) q=1
where Sq is the sum of the covariate vectors of the dq cases in the qth stratum and Rq is the set of all selections of dq individuals chosen without replacement from those in the qth stratum. This is equivalent to a stratified partial likelihood function with each stratum corresponding to a matched set of cases and controls. The standard estimation procedure applies. When cases and controls are not matched on age, one can follow the Bayes rule to estimate the parameter. However, this approach requires an explicit modeling of the covariate distribution in the population and a stronger assumption than conditional independence for the censoring time. That is, it needs to be completely independent of the failure time. Next we show how to model the first part of the likelihood L, which represents contributions from the relatives of the probands. In order to obtain a full likelihood function, one needs to assume a joint distribution for ωi1 , . . . , ωi M . Such an approach may not be appealing because it is more difficult to perform model diagnostic procedure for a joint distribution than for a univariate distribution, and any violations
Survival analysis
391
will potentially invalidate the parameter estimates. Moreover, the computation can be cumbersome and intensive, which limits the general use of the method. We therefore propose a pseudolikelihood approach, a product of the likelihood function for each of the J sets of relatives. The idea is similar to the GEE (Liang and Zeger, 1986), in that the joint distribution of ωi1 , . . . , ωi M is left unspecified. Denote the pseudolikelihood by LR , where R stands for relatives, and we can write
LR =
M n
f ({X i , i , Z i } in the J th set|X i0 , i0 , Z i0 ; θ j )
j=1 i=1
=
M n
∞
f ({X i , i , Z i } in the J th set|ωi j , X i0 , i0 , Z i0 ) f (ωi j |X i0 , i0 , Z i0 ; θ j )dωi j .
j=1 i=1 0
Under the assumptions that the censoring is noninformative of ω values and the conditional independence of censoring and failure times given ωi values and the covariates, the likelihood function
LR ∝
M n K
j=1 i=1 k=1 0
∞
⎡
⎣exp −ωi j
τ
Yik (s)αik (s)ds
0
⎤ I (υik = j)
{ωi j Yik (s)αik (s)}Nik (s) ⎦
s∈[0,τ ]
× f (ωi j |X i0 , i0 , Z i0 ; θ j )dωi j .
If a parametric model for the baseline hazard function λ0 (t) is assumed, one may simply utilize the maximum likelihood machinery to obtain the estimates of the parameters. However, a parametric modeling of λ0 (t) may not be able to fully capture the often arbitrary age-dependent trend of the hazard function. It is desirable to estimate λ0 (t) in a nonparametric fashion. At time t, define the observed history j Gt = {Nik (s), Yik (s), Z ik , X i0 , i0 , Z i0 , υik = j, s ∈ [0, t], i = 1, . . . , n, k = 1, . . . , K }, including the history of events for all relatives in the jth ( j = 1, . . . , M) set up to time t and the history of events v ik for all probands up to time τ . The G(·−) intensity of Nik (t) is given by replacing ωivik by its conditional v ik expectation with respect to the history G(·−) , i.e. v ik
G (t) = Yik (t)αik (t) ωivik (t−), λik
(2.2)
v ik where ωivik (t−) = E ωivik |Gt− . Simple algebraic calculations show that when there is only one relative G vik per proband, the intensity λik has the same expression as the conditional hazard function (5) in Shih and Chatterjee (2002), provided that the copula model, namely, the joint distribution of multivariate failure times, can be induced by frailties. Oakes (1989) gave a comprehensive account of such bivariate survival models and noted exceptions: for example, the bivariate survival distribution introduced by Gumbel (1961) cannot be a frailty distribution even though it is a copula model. 0 . Using the Clayton–Oakes The structure of (2.2) suggests a Nelson–Aalen type of estimator for model function (2.2) can be expressed as Yik (t)λ0 (t)Hik (t), where Hik (t) = (1.2), the intensity ′ exp β ′ Z ik + θvik eβ Z ik 0 (t−) ωivik (t−). Treating Hik (t) as the time-dependent risk score as in Shih and Chatterjee (2002), we can write the Nelson–Aalen type of nonparametric estimator as 0 (t) =
0
t
n
n
i=1
i=1
K
K
k=1
Nik (du)
k=1 Yik (t)Hik (t)
.
(2.3)
392
L. H SU AND M. G ORFINE
Using the Clayton–Oakes model (1.2), ω has an explicit form given by ωi j (t) = K
k=1
1 + θj
K
k=1
I (v ik = j)Nik (t) + θ j Ni0 (τ )
I (Vik = j)[exp{0 (t ∧ X ik )eβ Z ik θ j }−1] + exp{0 (X i0 )eβ Z i0 θ j }
,
(2.4)
for j = 1, . . . , M. Note that ω involves 0 for the proband observation times X i0 that can be greater than time t, i.e. Hik (t) is not predictable in the usual martingale sense. An iterative procedure is needed 0 (t), t ∈ [0, τ ]. Large sample theory for this kind of problem is generally not available. for obtaining 0 (t) However, our experience with simulated data sets shows that the baseline hazard function estimator in (2.3) converges and, moreover, the biases are few, if there are any. 0 . We construct a profile likelihood Estimation of β and θ values is straightforward once we know 0 in (2.3). Standard nonlinear optimization routine can be LR ×Lcc in β and θ values by replacing 0 by used to obtain the maximum profile likelihood estimates for β and θ . We term this approach as the ‘joint’ analysis of cluster members because it utilizes the information for all family members without breaking them down into pairs, as in the following marginal approach. We further extend this approach by applying again the GEE (Liang and Zeger, 1986) to relative– proband pairs. Specifically, the baseline hazard estimate is still in the form of (2.3) except now that −1 1 + θvik Nik (t) + θvik Ni0 (τ ) . When ωivik (t) = exp 0 (t)eβ Z ik θvik + exp 0 (X i0 )eβ Z i0 θvik − 1 there is only one set of relatives, i.e. θvik ≡ θ, plugging ω into (2.3) yields GEE = 0
0
t
K k=1 Nik (du) i=1 , (0) S (β, θ, 0 , u)
n
′ β ′ Z ik (t−) exp (t)eβ Z ik θ where S (0) (β, θ, 0 , u) = v ik + 0 0 i k Yik (u) exp β Z ik + θv ik e −1 GEE is in 1 + θvik Nik (t) + θvik Ni0 (τ ) . The baseline hazard estimate exp 0 (X i0 )eβ Z i0 θvik − 1 0 exactly the same form as (7) in Shih and Chatterjee (2002). Estimates of β and θ values can be similarly obtained by maximizing either the pseudolikelihood that uses the data on relative–proband pairs or the full likelihood that incorporates the joint distribution of all relatives in the same cluster with a plug-in estimate of 0 (t). Advantages of the GEE-based approach, especially for the estimates obtained based on the relative–proband pairs, are easy to compute and robust to the misspecification of higher (>2)-order correlations among family members. However, it may suffer an efficiency loss compared to the pseudolikelihood approach that partitions the relatives into clusters. Besides, it seems natural to assume that relatives with the same relation to the proband share a common frailty. 3. A
SIMULATION STUDY
We conducted a simulation study to evaluate the finite sample properties of the joint and GEE-based approaches. Matched case–control family data were generated, mimicking the case–control family study as seen in practice. Specifically, for each individual we generated a continuous variable Z from the standard normal distribution N (0, 1) as an individual-specific risk factor. We also generated a common frailty value ω from the gamma distribution with scale and shape parameters θ −1 , where θ quantified the strength of the dependence among family members. This common frailty value was shared by all relatives within the same family. Let β be the log hazard ratio of Z after averaging over the frailty distribution. The marginal hazard function under the Cox proportional hazards model can be written as λ(t; Z ) = λ0 (t) exp(β Z ),
Survival analysis
393
where λ0 (t) follows a Weibull distribution with density ba b t b−1 exp{−a b x b }, where a > 0 and b > 0. The failure time for an individual with covariate Z and frailty value ω can be generated by T = a −1 [log{1 − (ωθ)−1 log(1 − u)}/{θ exp(β Z )}]1/b , where u is a random variable from a uniform [0,1]. The censoring distribution is normal with mean 60 years of age and standard deviation 20 years. We choose a = 0.01 and b = 4.6 for the baseline hazard function. Together with the censoring, this yields a censoring of about 80% and the mean age at onset about 65 years old. This is in line with the age at onset of breast cancer in the general population with a slightly lower censoring than observed. We also generate censoring times from normal with mean 90 years of age and standard deviation 20 years, yielding about 60% censoring. This is to examine how the proposed methods perform when the censoring percentage is lower than 80%. We generated 20 000 individuals as the base population. Arbitrarily assigning the first individual of each family as the index subject, we randomly select n = 200 diseased subjects and match each diseased subject with an unaffected subject whose censoring time is within 1 year of the age at onset for the disease subject. This sampling results in a total of 200 case families and 200 control families that are matched on age. We considered simulation situations that were a combination of the following parameter choices. We let θ = 1 or 3 so that the strength of the dependence parameter ranges from moderate to strong. The regression coefficient β takes value log(2), giving the relative risk of covariate 2. We repeated each simulation scenario 100 times. Results are summarized by the sample mean and sample standard deviation , { 0 (t), t = 60, 70, or 80}, and θ over these 100 simulated data sets. of β and 0 and the dependence parameter Table 1 shows a summary of marginal parameter estimates of β estimate θ . Both approaches, the joint and GEE, yielded virtually unbiased estimates of all parameters under most situations considered. When the number of relatives is two, θ = 3, and 80% censoring rate, the dependence estimates θ are slightly biased upward. This is due to a few simulated data sets that have large θ values. When we take the median, the corresponding estimates for the GEE and the joint analyses are 3.02 and 3.01, respectively. For all situations considered, we did not encounter any convergence problem in any of the simulated data sets. are more efficient for the joint analysis than the GEE-based estimators. This We see that θ and β is especially pronounced in situations with a high degree of correlation and relatively large number of 0 from the joint analysis is more moderate compared to that for β relatives. The gain in efficiency for and θ . This suggests that the efficiency gain may be mostly from the likelihood estimation of β and θ , 0 . We subsequently performed a simulation study using GEE-based rather than the joint estimation of approach for 0 estimate, but using the joint distribution for all family members for estimating β and θ. The simulation results show that the new estimators do (almost) equally well as the joint analysis (results not shown). We examined the performance of the bootstrap method for estimating the standard errors of the parameter estimates. Given the time demand for the bootstrap method, we only examined one situation, which is closest to our real data, that is, θ = 3 and the censoring rate of about 80%. We used 50 bootstrap samples for each simulated data set, where the sampling unit was each matched case–control set with their relatives. A very small amount of Gaussian noise, N (0, 0.1), was added to the observational times in the bootstrap samples in order to break the ties. Table 2 shows a summary of sample means of bootstrap standard deviations and corresponding estimated coverage probabilities of nominal 95% confidence intervals, along with sample means and standard deviations of parameter estimates. The bootstrap standard deviations are generally close to sample standard deviations. The coverage probabilities for the joint approach maintain about 95% confidence intervals, whereas the coverage probabilities for the GEE approach are slightly underestimated.
0.095
0.194
0.358
0 (60)
0 (70)
0 (80)
0.693
0.095
0.194
0.358
β
0 (60)
0 (70)
0 (80)
(censoring % = 80%) θ 1.0
0.693
β
(censoring % = 60%) θ 1.0
True value
1.03 (0.295) 1.03 (0.271) 0.667 (0.191) 0.668 (0.188) 0.093 (0.016) 0.093 (0.015) 0.186 (0.030) 0.186 (0.030) 0.345 (0.059) 0.345 (0.057)
1.06 (0.230) 1.05 (0.209) 0.681 (0.172) 0.682 (0.164) 0.092 (0.015) 0.092 (0.014) 0.186 (0.027) 0.186 (0.026) 0.345 (0.048) 0.345 (0.047)
2
1.04 (0.238) 1.03 (0.195) 0.694 (0.171) 0.698 (0.170) 0.090 (0.014) 0.090 (0.014) 0.183 (0.026) 0.183 (0.026) 0.336 (0.047) 0.336 (0.046)
1.06 (0.212) 1.05 (0.175) 0.679 (0.127) 0.682 (0.113) 0.092 (0.013) 0.092 (0.013) 0.188 (0.022) 0.187 (0.021) 0.348 (0.037) 0.348 (0.036)
3
Number of relatives
1.05 (0.236) 1.05 (0.169) 0.687 (0.148) 0.689 (0.142) 0.092 (0.012) 0.091 (0.011) 0.185 (0.023) 0.185 (0.021) 0.341 (0.042) 0.340 (0.039)
1.00 (0.197) 1.02 (0.143) 0.684 (0.107) 0.686 (0.106) 0.094 (0.010) 0.094 (0.010) 0.191 (0.019) 0.199 (0.019) 0.350 (0.033) 0.349 (0.032)
4
0.358
0.194
0.095
0.693
3.0
0.358
0.194
0.095
0.693
3.0
True value
3.21 (0.709) 3.10 (0.685) 0.697 (0.171) 0.692 (0.157) 0.090 (0.017) 0.092 (0.016) 0.180 (0.034) 0.186 (0.031) 0.329 (0.064) 0.343 (0.060)
3.04 (0.556) 3.06 (0.470) 0.674 (0.140) 0.665 (0.122) 0.096 (0.017) 0.096 (0.017) 0.192 (0.031) 0.193 (0.031) 0.353 (0.049) 0.354 (0.049)
2
2.97 (0.605) 2.97 (0.566) 0.677 (0.125) 0.679 (0.114) 0.092 (0.014) 0.092 (0.014) 0.190 (0.029) 0.190 (0.029) 0.352 (0.050) 0.351 (0.047)
3.06 (0.492) 3.09 (0.404) 0.666 (0.115) 0.674 (0.100) 0.095 (0.014) 0.095 (0.013) 0.194 (0.028) 0.194 (0.026) 0.357 (0.047) 0.355 (0.041)
3
Number of relatives
3.00 (0.497) 3.04 (0.242) 0.675 (0.129) 0.674 (0.106) 0.095 (0.015) 0.094 (0.014) 0.193 (0.030) 0.192 (0.028) 0.350 (0.052) 0.350 (0.048)
3.04 (0.477) 3.06 (0.384) 0.663 (0.110) 0.675 (0.084) 0.095 (0.014) 0.094 (0.013) 0.193 (0.026) 0.192 (0.025) 0.356 (0.046) 0.353 (0.044)
4
and 0 and the dependence parameter estimate Table 1. Summary of marginal parameter estimates of β θ . Each cell is the mean (standard error) of the estimate over 100 simulated data sets. The first lines of the entry are estimates from the GEE-based approach and the second lines are estimates from the joint distribution approach
394 L. H SU AND M. G ORFINE
Survival analysis
395
Table 2. Summary of bootstrap standard deviations (SDs) and corresponding estimated coverage probabilities of nominal 95% confidence intervals (CIs) under the situation of two relatives per family with 80% censoring. The first lines are estimates from the GEE approach and the second lines are estimates from the joint distribution approach True value Sample mean Sample SD (Number of relatives = 2, censoring % = 80%) θ 3.0 3.21 0.709 3.10 0.685 β 0.693 0.697 0.171 0.692 0.157 0 (60) 0.095 0.090 0.017 0.092 0.016 0 (70) 0.194 0.180 0.034 0.186 0.031 0 (80) 0.358 0.329 0.064 0.343 0.060
Mean of bootstrap SD
Coverage probability of 95% CI
0.677 0.651 0.177 0.160 0.018 0.018 0.036 0.035 0.065 0.064
0.91 0.93 0.98 0.99 0.88 0.95 0.88 0.93 0.89 0.95
4. I LLUSTRATION The case–control family study of early-onset breast cancer (Malone et al., 1996) recruited a set of population-based cases and controls and their first- and second-degree relatives to study the familial aggregation of age at diagnosis of breast cancer. A primary goal of the study was to determine whether the age at diagnosis of breast cancer in probands was associated with the age at diagnosis of breast cancer in relatives. For illustrative purpose, we only use a subset of the data: probands (cases and controls) and their mothers and aunts. There are a total of 820 cases and 942 controls with 1762 mothers and 5015 aunts. Among these, 904 probands have two aunts or fewer, 731 have three to six aunts, and 127 have more than six aunts. In these relatives, 155 mothers (mean: 53 years old; standard deviation: 12 years old) and 256 aunts (mean: 54 years old; standard deviation: 12 years old) have breast cancer. We take both the joint (Joint) and GEE-based (GEE) analyses with a common baseline hazard function for all individuals. In the joint analysis, the baseline hazard function is estimated by using (2.3) and (2.4), so that all family members are used without being broken down into relative–proband pairs. In the GEE-based analysis, both baseline hazard function and parameters β and θ are estimated under a working assumption that all relative–probands were independent. We also analyze the data by estimating the baseline hazard function with relative–proband pairs but estimating β and θ with the full set of relatives. We call this analysis the GEE2 analysis. Table 3 summarizes the dependency parameter estimates θ1 for 0 at 40, 50, 60, 70, and 80 years old. Fifty bootstrap samples with families as mothers, θ2 for aunts, and sampling units are used to calculate the standard errors of the parameter estimates. In the joint analysis, the dependency of ages at onset between probands and their mothers is strong (θ = 2.524), whereas the dependency between probands and their aunts is much weaker (θ = 0.705). Under the assumed gamma frailty model, (1 + θ) can also be interpreted as the constant relative risk of the hazard rate for the relative to develop breast cancer at age t2 given the proband having the breast cancer at age t1 compared to that given the proband being cancer free at age t1 . In other words, our results for mother–proband relation implied that the mothers of cases had an average 2.524 + 1 = 3.524–fold increased risk of developing breast cancer compared to the mothers of controls. In contrast, the aunts of cases had a much reduced increased risk, 0.705 + 1 = 1.705, than that of controls. Both dependence estimates were significantly greater than zero, i.e. independence, at level 0.05. We also obtained the baseline hazard function estimate and its point-wide standard errors. The cancer-free probability can therefore be
396
L. H SU AND M. G ORFINE
Table 3. Analysis of a case–control family study of breast cancer. Fifty bootstrap samples were used to calculate the standard errors (SEs) of parameter estimates
θ1 (mother) θ2 (aunt) 0 (40) × 102 0 (50) × 102 0 (60) × 102 0 (70) × 102 0 (80) × 102
GEE† Estimate SE 2.320 0.383 0.498 0.175 0.581 0.104 1.626 0.180 3.216 0.316 5.407 0.550 8.399 0.884
GEE2‡ Estimate SE 2.571 0.431 0.726 0.221 0.541 0.099 1.512 0.176 2.993 0.311 5.028 0.535 7.791 0.872
Joint§ Estimate 2.524 0.705 0.546 1.528 3.026 5.096 7.923
SE 0.418 0.210 0.098 0.174 0.306 0.527 0.866
† GEE: both, (t) and {θ , θ }, are estimated from relative–proband pairs. 0 1 2 ‡ GEE2: (t) is estimated from relative–proband pairs, and {θ , θ } are estimated from the joint 0 1 2 0 . distribution of all relatives with a plug-in estimate § Joint: both, (t) and {θ , θ }, are estimated from the joint distribution of all relatives. 0 1 2
0 (t)}. Using this formula, we estimated from our data that the cancer-free estimated by taking exp{− probability by age 80 for a woman was about 0.92 with 95% confidence interval (0.91, 0.94). The GEE2 analysis yields dependence estimates similar to the joint analysis with the joint analysis moderately more efficient. The GEE analysis, however, yields much lower dependence estimates, especially for the aunt, compared to either GEE2 or joint estimates. This is in part due to the fact that in this data set, aunts from larger aunt-ship sizes tend to have better disease-free probability than aunts from small aunt-ship sizes. The GEE, i.e. marginal approach that breaks down the large aunt-ship by aunt–proband pairs, includes many pairs that have fewer breast cancer cases in aunts. We excluded the aunt-ships having larger than six aunts and reran the GEE analysis. The dependence estimates for mother and aunts are 2.287 (standard error 0.459) and 0.648 (standard error 0.191), much closer to the dependence estimates from GEE2 or joint analysis. 5. S OME CONCLUDING REMARKS In this paper, we present a general framework for handling the family size >2 and establish a connection with a previous marginal approach proposed by Shih and Chatterjee (2002). Simulation results show that the efficiency gain by the proposed method is most pronounced with a high degree of correlation and with large family size. The real data analysis also shows a potential efficiency gain for the joint approach. Like any other (pseudo)likelihood-based method, the proposed approach may be sensitive to the misspecification of the joint distribution for the entire family. The lack of robustness is somewhat eased by partitioning relatives into clusters, where each cluster consists of only the relatives with same relation to the proband, allowing for different dependency parameters (or even different distribution) for each cluster. In addition, the computational demand for the joint analysis is greater than the GEE type of marginal approach. The proposed approach involves a hypothesized frailty shared by relatives with the same relation to the proband. We note that the frailty here is a variation between families, not a variation between individuals. It is introduced for the purpose of accommodating the unexplained dependency of ages at onset among family members. In other words, the latent frailty is like a common covariate that is missing for all the relatives in the same cluster. One can potentially introduce an additional individual-level frailty to account for the heterogeneity among individuals or missing covariates at individual level. In this situation,
Survival analysis
397
Heckman and Singer (1984) showed theoretically that both the frailty distribution and the baseline hazard function can be identified in the presence of other observed covariates, but estimating them from the data can be problematic. We also note that the marginal hazard function conditional on the observed covariates, the focus of the paper, can perhaps be considered as the hazard function averaging over all unmeasured individual risk factors. The shared frailty model for each set of relatives may not fully capture the relations among the relatives in the set. For example, in the set of paternal aunts, the dependency of these aunts on the proband can be hypothesized sharing a common frailty; however, the dependency among aunts, who are sisters among themselves, may share a common frailty that is not the same as aunt–proband relation. A possible solution is to consider a hierarchical frailty distribution. This is shown in the following simplified derivation. Let T1 and T2 be the failure times for paternal aunts and T0 be the failure time of the proband. Furthermore, let ω1 and ω2 be the frailties shared among aunts (i.e. sisters) and shared between aunts and the proband, respectively. Then the joint survival distribution of T1 , T2 , and T0 can be written as Pr(T1 t1 , T2 t2 , T0 t0 ) ∞ ∞ = Pr(T1 t1 |ω1 ) Pr(T2 t2 |ω1 ) f (ω1 |ω2 )dω1 Pr(T0 t0 |ω2 ) f (ω2 )dω2 . 0
0
A bivariate distribution of ω1 and ω2 needs to be assumed. However, the corresponding conditional hazard function under such bivariate distributions may be difficult to obtain. Other approaches such as additive gamma frailty model (Parner, 1998) and multiple level multivariate survival model (Bandeen-Roche and Liang, 1996) could also be employed to deal with multivariate frailties. , 0 under a general frailty model with cohort family We provide a large sample theory for β θ , and data, provided that there are finite moments for the frailty distribution (Gorfine et al., 2005). However, the theory developed for the cohort data cannot be directly applied to case–control family data, in part, because ω(t) in (2.4) at time t involves proband observational times that can be greater than t. Currently, we rely on bootstrap methods to draw inferences on parameter estimates. Further work is needed to show the large sample properties of the proposed and other similar estimators (Shih and Chatterjee, 2002). ACKNOWLEDGMENTS This work is supported in part by grants from the National Institutes of Health (AG14358 and CA 53996) and the United States-Israel Binational Science Foundation. R EFERENCES A NDERSEN , P. K., B ORGAN , O., G ILL , R. D. Processes. New York: Springer.
AND
K EIDING , N. (1993). Statistical Models Based on Counting
BANDEEN -ROCHE , K. J. AND L IANG , K.-Y. (1996). Modeling failure-time associations in data with multiple levels of clustering. Biometrika 83, 29–39. C LAYTON , D. G. (1978). A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 65, 141–151. C OX , D. R. (1972). Regression models and life tables (with discussion). Journal of the Royal Statistical Society Series B 34, 187–220. G LIDDEN , D. V. (2000). A two-stage estimator of the dependence parameter for the Clayton-Oakes model. Lifetime Data Analysis 6, 141–156.
398
L. H SU AND M. G ORFINE
G LIDDEN , D. V. AND S ELF, S. G. (1999). Semiparametric likelihood estimation in the Clayton-Oakes failure time model. Scandinavian Journal of Statistics 26, 363–372. G ORFINE , M., Z UCKER , D. M. AND H SU , L. (2006). Prospective survival analysis with a general semiparametric shared frailty model—a pseudo full likelihood approach. Biometrika (in press). G UMBEL , E. J. (1961). Bivariate logistic distribution. Journal of the American Statistical Association 56, 335–349. H ECKMAN , J. AND S INGER , B. (1984). The identifiability of the proportional hazard model. Review of Economic Studies 81, 231–241. H OUGAARD , P. (2000). Analysis of Multivariate Survival Data. New York: Springer. H SU , L., C HEN , L., G ORFINE , M. AND M ALONE , K. (2004). Semiparametric estimation of marginal hazard function from case–control family studies. Biometrics 60, 936–944. K LEIN , J. P. (1992). Semiparametric estimation of random effects using the Cox model based on the EM algorithm. Biometrics 48, 795–806. L I , H., YANG , P. AND S CHWARTZ , A. G. (1998). Analysis of age at onset data from case–control family studies. Biometrics 54, 1030–1039. L IANG , K. Y. 13–22.
AND
Z EGER , S. L. (1986). Longitudinal analysis using generalized linear models. Biometrika 73,
M ALONE , K. E., DALING , J. R., W EISS , N. S., M C K NIGHT, B., W HITE , E. AND VOIGT, L. F. (1996). Family history and survival of young women with invasive breast carcinoma. Cancer 78, 1417–1425. M URPHY, S. A. (1995). Consistency in a proportional hazards model incorporating a random effect. Annals of Statistics 22, 712–731. M URPHY, S. A. (1996). Asymptotic theory for the frailty model. American Statistician 23, 183–214. N IELSEN , G., A NDERSEN , P., G ILL , R. AND S ORENSEN , T. (1992). A counting process approach to maximum likelihood estimation in frailty models. Scandinavian Journal of Statistics 19, 25–43. OAKES , D. (1989). Bivariate survival models induced by frailties. Journal of the American Statistical Association 84, 487–493. PARNER , E. (1998). Asymptotic theory for the correlated gamma-frailty model. The Annals of Statistics 26, 183–214. P IPPER , C. B. AND M ARTINUSSEN , T. (2004). An estimating equation for parametric shared frailty models with marginal additive hazards. Journal of the Royal Statistical Society B 66, 207–220. P RENTICE , R. L. 153–158.
AND
B RESLOW, N. E. (1978). Retrospective studies and failure time models. Biometrika 65,
S HIH , J. H. AND C HATTERJEE , N. (2002). Analysis of survival data from case–control family studies. Biometrics 58, 502–509. S HIH , J. H. AND L OUIS , T. A. (1995). Inferences on the association parameter in copula models for bivariate survival data. Biometrics 51, 1384–1399. W EI , L. J., L IN , D. Y. AND W EISSFELD , L. (1989). Regression analysis of multivariate incomplete failure time data by modeling marginal distributions. Journal of the American Statistical Association 84, 1065–1073. [Received July 27, 2005; revised December 9, 2005; accepted for publication December 16, 2005]