243 10 17MB
English Pages 295 Year 2009
FRONTIERS OF STATISTICS
FRONTIERS OF STATISTICS The book series provides an overview on the new developments in the frontiers of statistics. It aims at promoting statistical research that has high societal impacts and offers a concise overview on the recent developments on a topic in the frontiers of statistics. The books in the series intend to give graduate students and new researchers an idea where the frontiers of statistics are, to learn common techniques in use so that they can advance the fields via developing new techniques and new results. It is also intended to provide extensive references so that researchers can follow the threads to learn more comprehensively what the literature is, to conduct their own research, and to cruise and lead the tidal waves on the frontiers of statistics.
SERIES EDITORS Jianqing Fan
ZhimingMa
Frederick L. Moore' 18 Professor of Finance.
Academy of Math and Systems Science,
Director of Committee of Statistical Studies,
Institute of Applied Mathematics,
Department of Operation Research and
Chinese Academy of Science,
Financial Engineering,
No.55, Zhong-guan-cun East Road,
Princeton University, NJ 08544, USA.
Beijing 100190, China.
EDITORIAL BOARD Tony Cai, University of Pennsylvania Min Chen, Chinese Academy of Science Zhi Geng, Peking University Xuming He, University of Illinois at Urbana-Champaign Xihong Lin, Harvard University Jun Liu, Harvard University Xiao-Ji Meng, Harvard University Jeff Wu, Georgia Institute of Technology Heping Zhang, Yale University
evv Developments in Biostatistics and Bioinformatics Editors
Jianqing Fan Princeton University, USA
Xihong Lin Harvard University, USA
Jun S. Liu Harvard University, USA
Volume 1
Higher Education Press
World Scientific NEW JERSEY. LONDON· SINGAPORE· BEIJING· SHANGHAI· HONG KONG· TAIPEI· CHENNAI
z
Jianqing Fan
XihongLin
Department of Operation Reasearch and
Department of Biostatistics of the School of
Financial Engineering
Public Health
Princeton University
Harvard University
Jun Liu Department of Statistics Harvard University
Copyright @ 2009 by
Higher Education Press 4 Dewai Dajie, 1000 II, Beijing, P.R. China and
World Scientific Publishing Co Pte Ltd 5 Toh Tuch Link, Singapore 596224
All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without permission in writing from the Publisher.
ISBN 13: 978-981-283-743-1 ISBN 10: 981-283-743-4 ISSN 1793-8155
Printed in P. R. China
Preface
The first eight years of the twenty-first century has witted the explosion of data collection, with relatively low costs. Data with curves, images and movies are frequently collected in molecular biology, health science, engineering, geology, climatology, economics, finance, and humanities. For example, in biomedical research, MRI, fMRI, microarray, and proteomics data are frequently collected for each subject, involving hundreds of subjects; in molecular biology, massive sequencing data are becoming rapidly available; in natural resource discovery and agriculture, thousands of high-resolution images are collected; in business and finance, millions of transactions are recorded every day. Frontiers of science, engineering, and humanities differ in the problems of their concerns, but nevertheless share a common theme: massive or complex data have been collected and new knowledge needs to be discovered. Massive data collection and new scientific research have strong impact on statistical thinking, methodological development, and theoretical studies. They have also challenged traditional statistical theory, methods, and computation. Many new insights and phenomena need to be discovered and new statistical tools need to be developed. With this background, the Center for Statistical Research at the Chinese Academy of Science initiated the conference series "International Conference on the Frontiers of Statistics" in 2005. The aim is to provide a focal venue for researchers to gather, interact, and present their new research findings, to discuss and outline emerging problems in their fields, to lay the groundwork for future collaborations, and to engage more statistical scientists in China to conduct research in the frontiers of statistics. After the general conference in 2005, the 2006 International Conference on the Frontiers of Statistics, held in Changchun, focused on the topic "Biostatistics and Bioinformatics". The conference attracted many top researchers in the area and was a great success. However, there are still a lot of Chinese scholars, particularly young researchers and graduate students, who were not able to attend the conference. This hampers one of the purposes of the conference series. However, an alternative idea was born: inviting active researchers to provide a bird-eye view on the new developments in the frontiers of statistics, on the theme topics of the conference series. This will broaden significantly the benefits of statistical research, both in China and worldwide. The edited books in this series aim at promoting statistical research that has high societal impacts and provide not only a concise overview on the recent developments in the frontiers of statistics, but also useful references to the literature at large, leading readers truly to the frontiers of statistics. This book gives an overview on recent development on biostatistics and bioinformatics. It is written by active researchers in these emerging areas. It is intended v
VI
Preface
to give graduate students and new researchers an idea where the frontiers of biostatistics and bioinformatics are, to learn common techniques in use so that they can advance the fields via developing new techniques and new results. It is also intended to provide extensive references so that researchers can follow the threads to learn more comprehensively what the literature is and to conduct their own research. It covers three important topics in biostatistics: Analysis of Survival and Longitudinal Data, Statistical Methods for Epidemiology, and Bioinformatics, where statistics is still advancing rapidly today. Ever since the invention of nonparametric and semiparametric techniques in statistics, they have been widely applied to the analysis of survival data and longitudinal data. In Chapter 1, Jianqing Fan and Jiancheng Jiang give a concise overview on this subject under the framework of the proportional hazards model. Nonparametric and semiparametric modeling and inference are stressed. Dongling Zeng and Jianwen Cai introduce an additive-accelerated rate regression model for analyzing recurrent event in Chapter 2. This is a flexible class of models that includes both additive rate model and accelerated rate models, and allows simple statistical inference. Longitudinal data arise frequently from biomedical studies and quadratic inference function provides important approaches to the analysis of longitudinal data. An overview is given in Chapter 3 on this topic by John Dziak, Runze Li, and Annie Qiu. In Chapter 4, Yi Li gives an overview on modeling and analysis of spatially correlated data with emphasis on mixed models. The next two chapters are on statistical methods for epidemiology. Amy Laird and Xiao-Hua Zhou address the issues on study designs for biomarker-based treatment selection in Chapter 5. Several trial designs are introduced and evaluated. In Chapter 6, Jinbo Chen reviews recent statistical models for analyzing twophase epidemiology studies, with emphasis on the approaches based on estimatingequation, pseudo-likelihood, and maximum likelihood. The last four chapters are devoted to the analysis of genomic data. Chapter 7 features protein interaction predictions using diverse data sources, contributed by Yin Liu, Inyoung Kim, and Hongyu Zhao. The diverse data sources information for protein-protein interactions is elucidated and computational methods are introduced for aggregating these data sources to better predict protein interactions. Regulatory motif discovery is handled by Qing Zhou and Mayetri Gupta using Bayesian approaches in Chapter 8. The chapter begins with a basic statistical framework for motif finding, extends it to the identification of cis-regulatory modules, and then introduces methods that combine motif finding with phylogenetic footprint, gene expression or ChIP-chip data, and nucleosome positioning information. Cheng Li and Samir Amin use single nucleotide polymorphism (SNP) microarrays to analyze cancer genome alterations in Chapter 9. Various methods are introduced, including paired and non-paired loss of heterozygosity analysis, copy number analysis, finding significant altered regions across multiple samples, and hierarchical clustering methods. In Chapter 10, Evan Johnson, Jun Liu and Shirley Liu give a comprehensive overview on the design and analysis of ChIPchip data on genome tiling microarrays. It spans from biological background and ChIP-chip experiments to statistical methods and computing. The frontiers of statistics are always dynamic and vibrant. Young researchers
Preface
vii
are encouraged to jump into the research wagons and cruise with tidal waves of the frontiers. It is never too late to get into the frontiers of scientific research. As long as your mind is evolving with the frontiers, you always have a chance to catch and to lead next tidal waves. We hope this volume helps you getting into the frontiers of statistical endeavors and cruise on them thorough your career. Jianqing Fan, Princeton Xihong Lin, Cambridge Jun Liu, Cambridge August 8, 2008
This page intentionally left blank
Contents
Preface
Part I
Analysis of Survival and Longitudinal Data
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis Jianqing Fan, Jiancheng Jiang. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2 Cox's type of models............................................... 3 Multivariate Cox's type of models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4 Model selection on Cox's models.................................. 5 Validating Cox's type of models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6 Transformation models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7 Concluding remarks............................................... References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
3 4 14 24 27 28 30 30
Chapter 2 Additive-Accelerated Rate Model for Recurrent Event Donglin Zeng, Jianwen Cai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 35 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2 Inference procedure and asymptotic properties. . . . . . . . . . . . . . . . . . .. 3 Assessing additive and accelerated covariates .. . . . . . . . . . . . . . . . . . . .. 4 Simulation studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5 Application....................................................... 6 Remarks.......................................................... Acknowledgements ................................................. ;. Appendix............................................................ References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
35 37 40 41 42 43 44 44 48
Chapter 3
An Overview on Quadratic Inference Function Approaches for Longitudinal Data John J. Dziak, Runze Li, Annie Qu.................................. 49
1 Introduction...................................................... 2 The quadratic inference function approach. . . . . . . . . . . . . . . . . . . . . . .. 3 Penalized quadratic inference function. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4 Some applications of QIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5 Further research and concluding remarks. . . . . . . . . . . . . . . . . . . . . . . . .. Acknowledgements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
49 51 56 60 65 68
Contents
x
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 68 Chapter 4
Modeling and Analysis of Spatially Correlated Data Yi Li................................................................ 73
1 Introduction .................................................... " 2 Basic concepts of spatial process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3 Spatial models for non-normal/discrete data ....................... 4 Spatial models for censored outcome data ....................... " 5 Concluding remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
Part II
73 76 82 88 96 96
Statistical Methods for Epidemiology
Chapter 5
Study Designs for Biomarker-Based Treatment Selection Amy Laird, Xiao-Hua Zhou. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 103
1 Introduction..................................................... 2 Definition of study designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3 Test of hypotheses and sample size calculation. . . . . . . . . . . . . . . . . .. 4 Sample size calculation ......................................... " 5 Numerical comparisons of efficiency. . . . . . . . . .. . . . . . . . . . . . . . . . . . .. 6 Conclusions...................................................... Acknowledgements.................................................. Appendix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. References ......................................................... "
103 104 108 111 116 118 121 122 126
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies Jinbo Chen......................................................... 127
1 Introduction... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2 Two-phase case-control or cross-sectional studies. . . . . . . . . . . . . . . .. 3 Two-phase designs in cohort studies ............................ " 4 Conclusions .................................................... " References...........................................................
Part III
127 130 136 149 151
Bioinformatics
Chapter 7
Protein Interaction Predictions from Diverse Sources Yin Liu, Inyoung Kim, Hongyu Zhao............................... 159
1 Introduction..................................................... 159 2 Data sources useful for protein interaction predictions .......... " 161 3 Domain-based methods.......................................... 163 4 Classification methods ......................................... " 169
Contents 5 Complex detection methods ..................................... , 6 Conclusions...... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Acknowledgements ................................................. , References .......................................................... ,
xi 172 175 175 175
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis Qing Zhou, Mayetri Gupta ...... ................................... , 179
1 Introduction..................................................... 2 A Bayesian approach to motif discovery. . . . . . . . . . . . . . . . . . . . . . . . .. 3 Discovery of regulatory modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4 Motif discovery in multiple species. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Motif learning on ChIP-chip data ............................... , 6 Using nucleosome positioning information in motif discovery.. . .. 7 Conclusion......... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
179 181 184 189 195 201 204 205
Chapter 9
Analysis of Cancer Genome Alterations Using Single Nucleotide Polymorphism (SNP) Microarrays Cheng Li, Samir Amin .............................................. 209
1 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2 Loss of heterozygosity analysis using SNP arrays. . . . . . . . . . . . . . . .. 3 Copy number analysis using SNP arrays ........................ , 4 High-level analysis using LOH and copy number data............ 5 Software for cancer alteration analysis using SNP arrays. . . . . . . . .. 6 Prospects............................. . . . . . . . . . . . . . . . . . . . . . . . . . .. Acknowledgements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
209 212 216 224 229 231 231 231
Chapter 10
Analysis of ChIP-chip Data on Genome Tiling Microarrays W. Evan Johnson, Jun S. Liu, X. Shirley Liu....................... 239
1 Background molecular biology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2 A ChIP-chip experiment......................................... 3 Data description and analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4 Follow-up analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5 Conclusion................................... . . . . . . . . . . . . . . . . . . .. References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
239 241 244 249 254 254
Subject Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 259 Author Index . ......................................................... , 261
This page intentionally left blank
Part I
Analysis of Survival and Longitudinal Data
This page intentionally left blank
Chapter 1 N on- and Semi- Parametric Modeling in Survival Analysis * Jianqing Fan t
Jiancheng Jiang +
Abstract In this chapter, we give a selective review of the nonparametric modeling methods using Cox's type of models in survival analysis. We first introduce Cox's model (Cox 1972) and then study its variants in the direction of smoothing. The model fitting, variable selection, and hypothesis testing problems are addressed. A number of topics worthy of further study are given throughout this chapter. Keywords: Censoring; Cox's model; failure time; likelihood; modeling; non parametric smoothing.
1
Introduction
Survival analysis is concerned with studying the time between entry to a study and a subsequent event and becomes one of the most important fields in statistics. The techniques developed in survival analysis are now applied in many fields, such as biology (survival time), engineering (failure time), medicine (treatment effects or the efficacy of drugs), quality control (lifetime of component), credit risk modeling in finance (default time of a firm). An important problem in survival analysis is how to model well the conditional hazard rate of failure times given certain covariates, because it involves frequently asked questions about whether or not certain independent variables are correlated with the survival or failure times. These problems have presented a significant challenge to statisticians in the last 5 decades, and their importance has motivated many statisticians to work in this area. Among them is one of the most important contributions, the proportional hazards model or Cox's model and its associated partial likelihood estimation method (Cox, 1972), which stimulated -The authors are partly supported by NSF grants DMS-0532370, DMS-0704337 and NIH ROl-GM072611. tDepartment of ORFE, Princeton University, Princeton, NJ 08544, USA, E-mail: jqfan@ princeton.edu tDepartment of Mathematics and Statistics, University of North Carolina, Charlotte, NC 28223, USA, E-mail: [email protected]
3
Jianqing Fan, Jiancheng Jiang
4
a lot of works in this field. In this chapter we will review related work along this direction using the Cox's type of models and open an academic research avenue for interested readers. Various estimation methods are considered, a variable selection approach is studied, and a useful inference method, the generalized likelihood ratio (GLR) test, is employed to address hypothesis testing problems for the models. Several topics worthy of further study are laid down in the discussion section. The remainder of this chapter is organized as follows. We consider univariate Cox's type of models in Section 2 and study multivariate Cox's type of models using the marginal modeling strategy in Section 3. Section 4 focuses on model selection rules, Section 5 is devoted to validating Cox's type of models, and Section 6 discusses transformation models (extensions to Cox's models). Finally, we conclude this chapter in the discussion section.
2
Cox's type of models
Model Specification. The celebrated Cox model has provided a tremendously successful tool for exploring the association of covariates with failure time and survival distributions and for studying the effect of a primary covariate while adjusting for other variables. This model assumes that, given a q-dimensional vector of covariates Z, the underlying conditional hazard rate (rather than expected survival time T),
A(tlz)=
. hm
D.t--+o+
apサエZ[t\Kセi_Lz]コス@
1 ut
is a function of the independent variables (covariates): A(tlz) = Ao(t)W(z),
(2.1)
where w(z) = exp( 'lj;(z)) with the form of the function 'lj;(z) known such as 'lj;(z) = (3T z, and Ao(t) is an unknown baseline hazard function. Once the conditional hazard rate is given, the condition survival function S(tlz) and conditional density f(tlz) are also determined. In general, they have the following relationship:
S(tlz) = exp(-A(tlz)),
J;
f(tlz) = A(tlz)S(tlz),
(2.2)
A(tlz)dt is the cumulative hazard function. Since no assumpwhere A(tlz) = tions are made about the nature or shape of the baseline hazard function, the Cox regression model may be considered to be a semiparametric model. The Cox model is very useful for tackling with censored data which often happen in practice. For example, due to termination of the study or early withdrawal from a study, not all of the survival times T 1 , ... ,Tn may be fully observable. Instead one observes for the i-th subject an event time Xi = min(Ti' Ci ), a censoring indicator 8i = J(Ti :::; Ci ), as well as an associated vector of covariates Zi. Denote the observed data by {(Zi' Xi, 8i ) : i = 1,··· , n} which is an i.i.d. sample from the population (Z, X, 8) with X = min(T, C) and 8 = J(T :::; C). Suppose that
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
5
the random variables T and C are positive and continuous. Then by Fan, Gijbels, and King (1997), under the Cox model (2.1),
E{8IZ = z} llI(x) = E{Ao(X)IZ = z}'
(2.3)
where Ao (t) = jセ@ >'0 (u) du is the cumulative baseline hazard function. Equation (2.3) allows one to estimate the function III using regression techniques if >'o(t) is known. The likelihood function can also be derived. When 8i = 0, all we know is that the survival time Ti ;? Ci and the probability for getting this is
whereas when 8i = 1, the likelihood of getting Ti is !(TiIZi ) = !(XiIZi). Therefore the conditional (given covariates) likelihood for getting the data is
and using (2.2), we have
L = I : 10g(>'(XiIZi )) - I:A(XiIZ i ) 8;=1
i
(2.5) For proportional hazards model (2.1), we have specifically (2.6) Therefore, when both 'ljJ(-) and >'00 are parameterized, the parameters can be estimated by maximizing the likelihood (2.6).
Estimation. The likelihood inference can be made about the parameters in model (2.1) if the baseline >'00 and the risk function 'ljJ(.) are known up to a vector of unknown parameters {3 (Aitkin and Clayton, 1980), i.e. >'00
= >'o{-; (3)
and
'ljJ(-)
= 'ljJ(.; (3).
When the baseline is completely unknown and the form of the function 'ljJ(.) is given, inference can be based on the partial likelihood (Cox, 1975). Since the full likelihood involves both (3 and >'o(t), Cox decomposed the full likelihood into a product of the term corresponding to identities of successive failures and the term corresponding to the gap times between any two successive failures. The first term inherits the usual large-sample properties of the full likelihood and is called the partial likelihood.
Jianqing Fan, Jiancheng Jiang
6
The partial likelihood can also be derived from counting process theory (see for example Andersen, Borgan, Gill, and Keiding 1993) or from a profile likelihood in Johansen (1983). In the following we introduce the latter. Example 1 [The partial likelihood as profile likelihood; Fan, Gijbel, and King (1997)] Consider the case that 'Ij;(z) = 'Ij;(z; (3). Let tl < ... < tN denote the ordered failure times and let (i) denote the label of the item failing at k Denote by Ri the risk set at time ti-, that is Ri = {j : Xj セ@ td. Consider the least informative nonparametric modeling for Ao('), that is, Ao(t) puts point mass (}j at time tj in the same way as constructing the empirical distribution: N
Ao(t; (})
= L (}jI(tj
セ@
t).
(2.7)
j=l Then N
AO(Xi ;{}) = L{}jI(i E Rj ).
(2.8)
j=l Under the proportional hazards model (2.1), using (2.6), the log likelihood is n
logL
L[6i{logAo(Xi;{}) +'Ij;(Zi;(3)} i=l -AO(Xi; (}) exp{ 'Ij;(Zi; (3))].
=
(2.9)
Substituting (2.7) and (2.8) into (2.9), one establishes that n
logL
=
L[1og{}j j=l n
+ 'Ij;(Z(j);(3)]
N
- LL{}jI(i E Rj)exp{'Ij;(Zi;(3)}. i=l j=l
(2.10)
Maximizing log L with respect to {}j leads to the following Breslow estimator of the baseline hazard [Breslow (1972, 1974)]
OJ
=
[L exp{'lj;(Zi; (3)} iERj
rl.
(2.11)
Substituting (2.11) into (2.10), we obtain n
セャッァl@
=
L('Ij;(Zei);(3) -log[L exp {'Ij;(Zj;(3)}]) - N. t=l JERi
This leads to the log partial likelihood function (Cox 1975) n
£((3)
=L t=l
('Ij;(Zei); (3) -
log
[L exp{'lj;(Zj; (3)}]). JERi
(2.12)
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
7
An alternative expression is n
n
R({3) = I:('Iji(ZCi);,8) -log[I:}j(Xi )exP {'Iji(Zj;,8)}]), i=l
j=l
where }jet) = I(Xj セ@ t) is the survival indicator on whether the j-th subject survives at the time t. The above partial likelihood function is a profile likelihood and is derived from the full likelihood using the least informative nonparametric modeling for Ao(·), that is, Ao(t) has a jump (h at k
/3
Let be the partial likelihood estimator of,8 maximizing (2.12) with respect to,8. By standard likelihood theory, it can be shown that (see for example Tsiatis 1981) the asymptotic distribution y'n(/3 - (3) is multivariate normal with mean zero and a covariance matrix which may be estimated consistently by (n- 11(/3))-1, where
1(,8) =
r J
T
o
and for k
[S2(,8,t) _ (S1(,8,t))®2] dN(t) So (,8, t) So (,8, t)
= 0, 1, 2, n
Sk(,8,t) = I:Yi(t)'ljiI(Zi;{3)®kexp{'Iji(Zi;,8)}, i=1
where N(t) = I(X セ@ t,o = 1), and x®k = 1,x,xxT , respectively for k = 0,1 and 2. Since the baseline hazard Ao does not appear in the partial likelihood, it is not estimable from the likelihood. There are several methods for estimating parameters related to Ao. One appealing estimate among them is the Breslow estimator (Breslow 1972, 1974) (2.13)
Hypothesis testing. After fitting the Cox model, one might be interested in checking if covariates really contribute to the risk function, for example, checking if the coefficient vector ,8 is zero. More generally, one considers the hypothesis testing problem Ho: ,8 = ,80· From the asymptotic normality of the estimator null distribution of the Wald test statistic
/3, it follows that the asymptotic
8
Jianqing Fan, Jiancheng Jiang
is the chi-squared distribution with q degrees of freedom. Standard likelihood theory also suggests that the partial likelihood ratio test statistic (2.14) and the score test statistic Tn
=
U(!3of rl(!3o)U(!3 o)
have the same asymptotic null distribution as the Wald statistic, where U(!3o)
=
C'(!3o) is the score function (see for example, Andersen et al., 1993). Cox's models with time-varying covariates. The Cox model (2.1) assumes that the hazard function for a subject depends on the values of the covariates and the baseline. Since the covariates are independent of time, the ratio of the hazard rate functions oftwo subjects is constant over time. Is this assumption reasonable? Consider, for example, the case with age included in the study. Suppose we study survival time after heart transplantation. Then it is possible that age is a more critical factor of risk right after transplantation than a later time. Another example is given in Lawless (1982, page 393) with the amount of voltage as covariate which slowly increases over time until the electrical insulation fails. In this case, the impact of the covariate clearly depends on time. Therefore, the above assumption does not hold, and we have to analyze survival data with time-varying covariates. Although the partial likelihood in (2.12) was derived for the setting of the Cox model with non-time-varying covariates, it can also be derived for the Cox model with time-varying covariates if one uses the counting process notation. For details, see marginal modeling of multivariate data using the Cox's type of models in Section 3.1. More about Cox's models. For the computational simplicity of the partial likelihood estimator, Cox's model has already been a useful case study for formal semiparametric estimation theory (Begun, Hall, Huang, and Wellner 1982; Bickel, Klaassen, Ritov, and Wellner 1993; Oakes 2002). Moreover, due to the derivation of the partial likelihood from profile likelihood (see Example 1), Cox's model has been considered as an approach to statistical science in the sense that "it formulates scientific questions or quantities in terms of parameters 'Y in a model f(y; 'Y) representing the underlying scientific mechanisms (Cox, 1997)j partition the parameters 'Y = ((), TJ) into a subset of interest () and other nuisance parameters TJ necessary to complete the probability distribution (Cox and Hinkley, 1974),develops methods of inference about the scientific quantities that depend as little as possible upon the nuisance parameters (Barndorff-Nielsen and Cox, 1989),- and thinks critically about the appropriate conditional distribution on which to base inferece" (Zeger, Diggle, and Liang 2004). Although Cox's models have driven a lot of statistical innovations in the past four decades, scientific fruit will continue to be born in the future. This motivates us to explore some recent development for Cox's models using the nonparametric idea and hope to open an avenue of academic research for interested readers.
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
2.1
9
Cox's models with unknown nonlinear risk functions
Misspecification of the risk function 'IjJ may happen in the previous parametric form 'IjJ(.; {3), which could create a large modeling bias. To reduce the modeling bias, one considers nonparametric forms of 'IjJ. Here we introduce such an attempt from Fan, Gijbels, and King (1997). For easy exposition, we consider only the case with q = 1:
A(tlz)
= Aa(t) exp{'IjJ(z)},
(2.15)
where z is one dimensional. Suppose the form of 'IjJ(z) in model (2.15) is not specified and the p-th order derivative of 'IjJ(z) at the point z exists. Then by the Taylor expansion,
'IjJ(Z) セ@ 'IjJ(z)
+ 'IjJ'(z)(Z -
z)
'IjJ(p) (z)
+ ... + - - I-(Z p.
z)P,
for Z in a neighborhood of z. Put Z
= {I, Z -
Z,··· ,(Z - Z)P}T and Zi
= {I, Zi -
Z,··· ,(Zi - z)p}T,
where T denotes the transpose of a vector throughout this chapter. Let h be the bandwidth controlling the size of the neighborhood of x and K be a kernel function with compact support [-1,1] for weighting down the contribution of remote data points. Then for IZ - zl ::;:; h, as h - t 0,
'IjJ(Z) セ@
ZT a ,
where a = (aa,a1,··· ,apf = {'IjJ(z),'IjJ'(z),· .. ,'IjJ(p)(z)/p!}T. By using the above approximation and incorporating the localizing weights, the local (log) likelihood is obtained from (2.9) as n
£n({3,8)
=
n- 1
L: [8i {log Aa(Xi ; 8) + zf a} i=l
-Aa(Xi ; 8) exp(Zf a)] Kh(Zi - x),
(2.16)
where Kh(t) = h- 1 K(t/h). Then using the least-informative nonparametric model (2.7) for the baseline hazard and the same argument as for (2.12), we obtain the local log partial likelihood N
L:Kh(Z(i) - z)(Z0)a i=l
-IOg[L: exp{ZG)a}Kh(Zj -
z)]).
(2.17)
JERi
Maximizing the above function with respect to a leads to an estimate ex of a. Note that the function value 'IjJ(z) is not directly estimable; (2.17) does not involve the intercept aa since it cancels out. The first component a1 = (f;'(z) estimates 'IjJ'(z).
Jianqing Fan, Jiancheng Jiang
10
It is evident from model (2.15) that 'ljJ(z) is only identifiable up to a constant. By
imposing the condition 'ljJ(0)
= 0, the function 'ljJ(z) can be estimated by ,(}(z)
=
l
z
,(}'(t) dt.
According to Fan, Gijbels, and King (1997), under certain conditions, the following asymptotic normality holds for ,(}' (z):
where
and
カセLHコI@
= a2(z)J-1(z)
JkセHエIR@
dt
with K;(t) = tK(t)/ J t 2K(t) dt and a 2(z) = E{8IZ = z} -1. With the estimator of 'ljJO, using the same argument as for (2.13), one can estimate the baseline hazard by (2.18) Inference problems associated with the resulting estimator include constructing confidence intervals and hypothesis tests, which can be solved via standard nonparametric techniques but to our knowledge no rigor mathematical theory exists in the literature. A possible test method can be developed along the line of the generalized likelihood ratio (GLR) tests in Section 5, and theoretical properties of the resulting tests are to be developed. For multiple covariates cases, the above modeling method is applicable without any difficulty if one employs a multivariate kernel as in common nonparametric regression. See Section 2.2 for further details. However, a fully nonparametric specification of 'ljJ(.) with large dimensionality q may cause the "curse of dimensionality" problem. This naturally leads us to consider some dimension reduction techniques.
2.2
Partly linear Cox's models
The partly linear Cox's model is proposed to alleviate the difficulty with a saturated specification of the risk function and takes the form (2.19) where Ao 0 is an unspecified baseline hazard function and
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
11
where the form of the function '1/JI(zl;{3) is known up to an unknown vector of finite parameters (3, and 'l/J2(-) is an unknown function. This model inherents nice interpretation of the finite parameter {3 in model (2.1) while modeling possible nonlinear effects of the d x 1 vector of covariates Z2. In particular, when there is no parametric component, the model reduces to the aforementioned full nonparametric model in Section 2.1. Hence, in practice, the number of components in Z2 is small. The parameters (3 and function 'l/J2(Z2) can be estimated using the profile partial likelihood method. Specifically, as argued in the previous section, the function 'l/J2 admits the linear approximation
'l/J2(Z2) セ@ 'l/J2(Z2)
+ GiOjセHコRヲz@
- Z2)
== a T Z2
when Z2 is close to Z2, where a = {'l/J2(Z2), GiOjセHzRItv@ and Z2 = {l, (Z2 - Z2)TV. Given (3, we can estimate the function 'l/J2(-) by maximizing the local partial likelihood N
in(a)
=
:L KH(Z2(i) - Z2) HGiOjセ@
(Zl(i); (3)
+ Zr(i) a
i=l
-log[:L exp{'l/Jl(Zl(j);(3)
+ Zr(j)a}K H(Z2j
- Z2)]) ,
(2.20)
JERi
where K H(Z2) = \HI- 1 K(H- 1 Z2) with K(·) being a d-variate probability density (the kernel) with unique mode 0 and J uK(u)du = 0, H is a nonsingular d x d matrix called the bandwidth matrix (see for example Jiang and Doksum 2003). For expressing the dependence of the resulting solution on (3, we denote it by &(Z2; (3) = サセR@ (Z2; (3), セ@ (Z2; (3)). Substituting セRHM[@ (3) into the partial likelihood, we obtain the profile partial likelihood of (3 n
i n (!3)
=
:L('l/Jl(Zl(i);f3)
KセRHzゥI[QS@
i=l
-log[:L exp{'I/Jl(Zlj; (3)
+ セRHzェ[@
{3)}]).
(2.21 )
JERi
.a
.a.
Maximizing (2.21) with respect to will lead to an estimate of We denote by 13 the resulting estimate. The estimate of function 'l/J2(-) is simply セRHG[@ 13). By an argument similar to that in Cai, Fan, Jiang, and Zhou (2007), it can be shown that the profile partial likelihood estimation provides a root-n consistent estimator of (see also Section 3). This allows us to estimate the nonparametric component 'l/J2 as well as if the parameter (3 were known.
.a
2.3
Partly linear additive Cox's models
The partly linear model (2.19) is useful for modeling failure time data with multiple covariates, but for high-dimensional covariate Z2, it still suffers from the so-called "curse-of-dimensionality" problem in high-dimensional function estimation. One
Jianqing Fan, Jiancheng Jiang
12
of the methods for attenuating this difficulty is to use the additive structure for the function 'ljJ2(·) as in Huang (1999), which leads to the partly linear additive Cox model. It specifies the conditional hazard of the failure time T given the covariate value (z, w) as
A{tlz, w} = AO(t) exp{ 'ljJ(z; 13) + ¢(w)},
(2.22)
where ¢(w) = ¢l(wd + ... + ¢J(wJ). The parameters of interest are the finite parameter vector 13 and the unknown functions ¢j's. The former measures the effect of the treatment variable vector z, and the latter may be used to suggest a parametric structure of the risk. This model allows one to explore nonlinearity of certain covariates, avoids the "curse-of-dimensionality" problem inherent in the saturated multivariate semiparametric hazard regression model (2.19), and retains the nice interpretability of the traditional linear structure in Cox's model (Cox 1972). See the discussions in Hastie and Tibshirani (1990). Suppose that observed data for the i-th subject is {Xi, lSi, Wi, Zi}, where Xi is the observed event time for the i-th subject, which is the minimum of the potential failure time Ti and the censoring time Gi , lSi is the indicator of failure, and {Zi' Wi} is the vector of covariates. Then the log partial likelihood function for model (2.22) is n
C(j3, ¢)
=
L lSi { 'ljJ(Zi; 13) + ¢(Wi ) -log L
rj(j3, ¢)},
(2.23)
JERi
i=l
where
rj(j3,¢) = exp{'ljJ(Zj;j3)
+ ¢(Wj)}.
Since the partial likelihood has no finite maximum over all parameters (13, ¢), it is impossible to use the maximium partial likelihood estimation for (13, ¢) without any restrictions on the function ¢. Now let us introduce the polynomial-spline based estimation method in Huang (1999). Assume that W takes values in W = [0, Let
IF.
{= {O =
セッ@
< 6 < ... < セk@
< セkKャ@
= I}
be a partition of [0, IJ into K subintervals IKi = {セゥG@ セゥKャI@ i = 0, ... ,K - 1, and IKK = {セkG@ セkKQjL@ where K == Kn = O(nV) with 0 < v < 0.5 being a positive integer such that h == max iセォ@ - ォMャ@セ = O(n-V). ャセォkKQ@
Let S( C, セI@ be the space of polynomial splines of degree C セ@ 1 consisting of functions s(·) satisfying: (i) the restriction of s(·) to IKi is a polynomial of order C- 1 for 1 セ@ i セ@ K; (ii) for C セ@ 2, sis C- 2 times continuously differentiable on [O,IJ. According to Schumaker (1981, page 124), there exists a local basis B i (·), 1 セ@ i セ@ qn for S(C, {) with qn = Kn + C, such that for any ¢nj (.) E S(C, {), qn
¢nj(Wj) =
L bjiBi(Wj), i=l
1 セ@ j セ@
J.
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
13
Put
B(w) = (Bl(w),· .. ,Bqn(w))T, B(w) = (BT(Wr), ... ,BT(wJ)f,
b j = (bjl , ... ,bjqn)T, b = (bi, ... ,b'})T.
bf
Then cPnj(Wj) = B(wj) and cPn(w) == 2:,1=1 cPnj(Wj) = bTB(w). Under regular smoothness assumptions, cPj's can be well approximated by functions in S(C, セIN@ Therefore, by (2.23), we have the logarithm of an approximated partial likelihood n
C({J, b)
L b"i{ ¢(Zi;}3) + cPn(Wd -log L exp[¢(Zj;}3) + cPn(Wj )]},
=
.=1
(2.24)
JERi
where J
L cPnj(Wji )
cPn(Wi) =
j=l
with Wji being the j-th component of Wi, for i = 1,··· ,n. Let (/3, b) maximize the above partial likelihood (2.24). Then an estimator of cP(-) at point w is simply ,
the cP(w)
J'
.'
= 2:,j=l cPj(Wj)
wIth cPj(Wj)
=
,T
b j B(wj).
As shown in Huang (1999), when ¢(z; f3) = zT f3, the estimator vn-consistency. That is, under certain conditions,
vn(/3 -
f3) = n- l / 2 I- l (f3)
/3 achieves
n
L ャセHxゥG@
b"i, Zi, Wi) + Op(l)
i=l
where I(f3)
= e{ャセHxL@
b., Z, W)]02 is the information bound and
ャセHxL@
8, Z, W)
= IT (Z - a*(t) - h*(W)) dM(t)
is the efficient score for estimation of f3 in model (2.22), where h*(w) = hiCwr) + ... + h j (w J) and (a * , hi, . .. ,h j) is the unique L2 functions that minimize
where
M(t)
=
M{X
セ@
t
t} - I I{X
セ@
u} exp[Z'f3 + cP(W)] dAo(u)
is the usual counting process martingale. achieves the semiparametric information lower bound Since the estimator, and is asymptotically linear, it is asymptotically efficient among all the regular estimators (see Bickel, Klaassen, Ritov, and Wellner 1993). However, the information lower bound cannot be consistently estimated, which makes inference for f3 difficult in practice. Further, the asymptotic distribution of the resulting estimator
/3,
Jianqing Fan, Jiancheng Jiang
14
¢ is hard to derive.
This makes it difficult to test if ¢ admits a certain parametric
form. The resulting estimates are easy to implement. Computationally, the maximization problem in (2.24) can be solved via the existing Cox regression program, for example coxph and bs in Splus software [for details, see Huang (1999)]. However, the number of parameters is large and numerical stability in implementation arises in computing the partial likelihood function. An alternative approach is to use the profile partial likelihood method as in Cai et al. (2007) (see also Section 3.2). The latter solves many much smaller local maximum likelihood estimation problems. With the estimators of (3 and ¢('), one can estimate the cumulative baseline hazard function Ao(t) = jセ@ Ao(u)du by a Breslow's type of estimators:
1(8 t
Ao(t)
=
n
Yi (u)eX P{¢(Zi;,B)
+ ¢(Wi
where Yi(u) = I(Xi セ@ u) is the at-risk indicator and Ni(u) is the associated counting process.
3
n
nr 8 dNi(u), 1
= I(Xi
°
in probability. It follows from (3.5) that
(3.6)
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
17
which is asymptotically normal with mean zero. By the consistency of Aj ({3) to a matrix Aj ({3) and by the asymptotic normality of n -1/2 Uj ({3 j' 00), one obtains that
(3.7)
Then by the multivariate martingale central limit theorem, for large n, HサSセL@
サSセ_@
... ,
is approximately normal with mean ({3f, ... ,{3})T and covariance matrix D = (D jz ), j, l = 1,··· ,J, say. The asymptotic covariance matrix between -Jri({3j - (3j) and -Jri({3z - (3z) is given by
D jz ({3j' (3z) = Aj1 ({3j )E{ Wj1 ({3j )Wll ({3Z)T} A 11({3z), where
1
00
Wj1 ({3j)
=
{Zlj (t) - sY) ({3j' t) / 8)°) ({3j' tn dM1j (t).
Wei, Lin and Weissfeld (1989) also gave a consistent empirical estimate of the covariance matrix D. This allows for simultaneous inference about the {3/s. Failure rates differ only in the baseline. Lin (1994) proposed to model the j-th failure time using marginal Cox's model: (3.8) For model (3.1), if the coefficients {3j are all equal to {3, then it reduces to model (3.8), and each {3j is a consistent estimate of {3. Naturally, one can use a linear combination of the estimates, J
{3(w) = LWj{3j
(3.9)
j=l
to estimate {3, where 'LJ=l Wj = 1. Using the above joint asymptotic normality of {3/s, Wei, Lin and Weissfeld (1989) computed the variance of {3(w) and employed the weight w = (W1' ... ,W J ) T minimizing the variance. Specifically, let E be the covariance matrix of ({31' . .. ,{3 J ) T. Then
Using Langrange's multiplication method, one can find the optimal weight:
Jianqing Fan, Jiancheng Jiang
18
If all of the observations for each failure type are independent, the partial likelihood for model (3.8) is (see Cox 1975) J
L(f3)
=
II L (f3) j
j=l
(3.10)
where L j (f3) is given by (3.2) and Yij(t) = I(Xl j セ@ t). Since the observations within a cluster are not necessarily independent, we refer to (3.10) as pseudopartial likelihood. Note that J
log L(f3)
= l:)og L j (f3), j=l
and
8 log L(f3) 8f3
= "
J
L.-
)=1
8 log L j (f3) 8f3 .
Therefore, the pseudo-partial likelihood merely aggregates J consistent estimation equations to yield a more powerful estimation equation without using any dependent structure. Maximizing (3.10) leads to an estimator '/3 of f3. We call this estimation method "pseudo-partial likelihood estimation". Following the argument in Example 3, it is easy to derive the asymptotic normality of fo('/3 - (3). For large nand small J, Lin (1994) gave the covariance matrix estimation formula for '/3. It is interesting to compare the efficiency of '/3 with respect to '/3(w), which is left as an exercise for interested readers.
3.2
Marginal modeling using Cox's models with nonlinear risks
The marginal Cox's models with linear risks provide a convenient tool for modeling the effects of covariates on the failure rate, but as we stressed in Section 2.1, they may yield large modeling bias if the underlying risk function is not linear. This motivated Cai, Fan, Zhou, and Zhou (2007) to study the following Cox model with a nonlinear risk: (3.11)
where f3(.) is the regression coefficient vector that may be a function of the covariate Vij , g(.) is an unknown nonlinear effect of Vij. Model (3.11) is useful for modeling the nonlinear effect of Vij and possible interaction between covariates Vij and Zij. A related work has been done in Cai and Sun (2003) using the time-varying coefficient Cox model for univariate data with J = 1.
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
19
Similar to (3.10), the pseudo partial likelihood for model (3.11) is
L({3(.),g(.)) =
II II{ J
n
j=l i=l
{(
j
T
exp (3 Vij) Zij セ@ g(Vij)} }t;,.i . L1ERj(Xij ) exp{{3(Vij) Zlj + g(Vij)}
(3.12)
The pseudo-partial likelihood (3.10) can be regarded as parametric counterpart of (3.12). The log-pseudo partial likelihood is given by
10gL({30,g(·))
=
lセゥェサSHvヲz@
J
n
j=l
i=l
+ g(Vij) -log
L exp{{3(Vij fZlj lERj(Xij)
+ g(Vij)} }.
(3.13)
Assume that all functions in the components of (3(.) and gO are smooth so that they admit Taylor's expansions: for each given v and u, where u is close to v,
(3(u) :::::: (3(v) + (3'(v)(u - v) == セ@ g(u) :::::: g(v) + g'(v)(u - v) == a
+ ".,(u - v), + 'Y(u - v).
(3.14)
Substituting these local models into (3.12), we obtain a similar local pseudo-partial likelihood to (2.17) in Section 2.1: J
C(e)
=
n
L L Kh(Vij - vIセゥェ@ j=li=l x{eXij-log(
L exp(eX;j)Kh(Vij-v))}, lERj(Xij)
(3.15)
e
and Xij = (Z'I;,Z'I;(Vij - v),(Vij - v))T. The kernel where = HセtLNGyI@ function is introduced to confine the fact that the local model (3.14) is only applied to the data around v. It gives a larger weight to the data closer to the point v. Let e(v) = HセvItLNカゥ@ be the maximizer of (3.15). Then (:J(v) = 8(v) is a local linear estimator for the coefficient function (30 at the point v. Similarly, an estimator of g'(.) at the point v is simply the local slope i(v), that is, the curve gO can be estimated by integration of the function g'(v). Using the counting process theory incorporated with non parametric regression techniques and the argument in Examples 2 and 3, Cai, Fan, Zhou, and Zhou (2007) derived asymptotic normality of the resulting pseudo-likelihood estimates. An alternative estimation approach is to fit a varying coefficient model for each failure type, that is, for event type j, to fit the model (3.16) for estimating ej(v) = HサSjカILセtァェN@ Under model resulting in セIカ@ (3.11), we have = = ... = eJ. Thus, as in (3.9), we can estimate e(v) by a linear combination
e1 e2
J
e(v; w)
=
LWjej(v) j=l
Jianqing Fan, Jiancheng Jiang
20
with L;=l Wj 1. The weights can be chosen in a similar way to (3.10). For details, see the reference above.
3.3
Marginal modeling using partly linear Cox's models
The fully nonparametric modeling of the risk function in the previous section is useful for building nonlinear effects of covariates on the failure rate, but it could lose efficiency if some covariates' effects are linear. To gain efficiency and to retain nice interpretation of the linear Cox models, Cai, Fan, Jiang, and Zhou (2007) studied the following marginal partly linear Cox model: (3.17) where Zij (-) is a main exposure variable of interest whose effect on the logarithm of the hazard might be non-linear; W ij (-) = (Wij1 (·),··· ,Wijq(·)f is a vector of covariates that have linear effects; AOj (.) is an unspecified baseline hazard function; and g(.) is an unspecified smooth function. For d-dimensional variable Zij, one can use an additive version g(Z) = gl(Zl) + ... + g(Zd) to replace the above function g(.) for alleviating the difficulty with curse of dimensionality. Like model (3.8), model (3.17) allows a different set of covariates for different failure types of the subject. It also allows for a different baseline hazard function for different failure types of the subject. It is useful when the failure types in a subject have different susceptibilities to failures. Compared with model (3.8), model (3.17) has an additional nonlinear term in the risk function. A related class of marginal models is given by restricting the baseline hazard functions in (3.17) to be common for all the failure types within a subject, i.e., (3.18) While this model is more restrictive, the common baseline hazard model (3.18) leads to more efficient estimation when the baseline hazards are indeed the same for all the failure types within a subject. Model (3.18) is very useful for modeling clustered failure time data where subjects within clusters are exchangeable. Denote by Rj(t) = {i : Xij セ@ t} the set of subjects at risk just prior to time t for failure type j. If failure times from the same subject were independent, then the logarithm of the pseudo partial likelihood for (3.17) is (see Cox 1975) J
£(/3, g(.» =
n
2:: 2:: Llij {/3TWij (Xij ) + g(Zij (Xij » -
Rij (/3, g)},
(3.19)
j=li=l
where R ij (/3,g) = log (L1ERjCXij) exp[/3TW1j (Xij ) + g(Zlj(Xij»l). The pseudo partial likelihood estimation is robust against the mis-specification of correlations among failure times, since we neither require that the failure times are independent nor specify a dependence structure among failure times. Assume that g(.) is smooth so that it can be approximated locally by a polynomial of order p. For any given point Zo, by Taylor's expansion,
g(z) セ@ g(zo)
f.. gCk)(zo) k! (z -
+6
k=l
zo)k == 0: + "(TZ,
(3.20)
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis T
-
21
.
where, = bl,'" ,"(p) and Z = {z - Zo,'" , (z - zo)pV. Usmg the local model (3.20) for the data around Zo and noting that the local intercept a cancels in (3.19), we obtain a similar version of the logarithm of the local pseudo-partial likelihood in (2.17): J
e({3,,)
n
LLKh(Zij(Xij ) - zoIセゥェ@ j=1 i=1
=
(3.21) where
Rtj ({3,,) = log(
L
exp[{3TWlj(Xij)
+ ,TZlj(Xij)]Kh(Zlj(Xij) -
zo)),
lERj(Xij)
and Zij(U) = {Zij(U) - zo,'" , (Zij(U) - zo)pV. Let (,6(zo),--y(zo)) maximize the local pseudo-partial likelihood (3.21). Then, セョ@ estimator of g' (.) at the point Zo is simply the first component of i( zo), namely g'(zo) = ,oil (zo)· The curve ?J can be estimated by integration on the function g'(zo) using the trapezoidal rule by Hastie and Tibshirani (1990). To assure the identifiability of g(.), one can set g(O) :t:: 0 without loss of generality. Since only the local data are used in the estimation of {3, the resulting estimator for (3 cannot be root-n consistent. Cai, Fan, Jiang, and Zhou (2007) referred to (,6(zo), i(zo)) as the naive estimator and proposed a profile likelihood based estimation method to fix the drawbacks of the naive estimator. Now let us introduce this method. For a given (3, we obtain an estimator g(k)(.,{3) of g(k)(.), and hence g(.,{3), by maximizing (3.21) with respect to,. Denote by i(zo,{3) the maximizer. Substituting the estimator g(.,{3) into (3.19), one can obtain the logarithm of the profile pseudo-partial likelihood:
fp({3)
=
lセゥェサLXtw@
J
n
+g(Zij,{3) j=li=1
-lOg(
L
eXP [{3TW1j
+?J(Zlj,{3)])}.
(3.22)
lERj (X ij )
Let ,6 maximize (3.22) and i = i(zo, ,6). Then the proposed estimator for the parametric component is simply ,6 and for the nonparametric component is gO =
g(., ,6). Maximizing (3.22) is challenging since the function form ?J(., (3) is implicit. The objective function ep (') is non-concave. One possible way is to use the backfitting algorithm, which iteratively optimizes (3.21) and (3.22). More precisely, given (3o, optimize (3.21) to obtain ?J(., (3o). Now, given g(., (3o), optimize (3.22) with respect to {3 by fixing the value of (3 in ?J(-' (3) as (3o, and iterate this until convergence. An alternative approach is to optimize (3.22) by using the NewtonRaphson method, but ignore the computation of XセR_jHᄋLサSI@ i.e. setting it to zero in computing the Newton-Raphson updating step.
22
Jianqing Fan, Jiancheng Jiang
As shown in Cai, Fan, Jiang, and Zhou (2007), the resulting estimator /3 is root-n consistent and its asymptotic variance admits a sandwich formula, which leads to a consistent variance estimation for /3. This furnishes a practical inference tool for the parameter {3. Since /3 is root-n consistent, it does not affect the estimator of the nonparametric component g. If the covariates (WG' Zlj)T for different j are identically distributed, then the resulting estimate 9 has the same distribution as the estimate in Section 2.1. That is, even though the failure types within subjects are correlated, the profile likelihood estimator of g(.) performs as well as if they were independent. Similar phenomena were also discovered in nonparametric regression models (see Masry and Fan 1997; Jiang and Mack 2001). With the estimators of (3 and g(.), one can estimate the cumulative baseline hazard function AOj(t) = jセ@ AOj(u)du under mild conditions by a consistent estimator:
where Yij(u) = J(Xij ;?: u) is the at-risk indicator and Nij(u) 1) is the associated counting process.
3.4
=
J(Xij セ@
U,
f}.ij =
Marginal modeling using partly linear Cox's models with varying coefficients
The model (3.17) is useful for modeling nonlinear covariate effects, but it cannot deal with possible interaction between covariates. This motivated Cai, Fan, Jiang, and Zhou (2008) to consider the following partly linear Cox model with varying coefficients: (3.24) where W ij (-) = (Wij1 (·),··· ,Wijq(·))T is a vector of covariates that has linear effects on the logarithm of the hazard, Zij(-) = (Zijl (.), ... ,Zijp(· ))T is a vector of covariates that may interact with some exposure covariate Vij (.); AOj (.) is an unspecified baseline hazard function; and Q{) is a vector of unspecified coefficient functions. Model (3.24) is useful for capturing nonlinear interaction between covariates V and Z. This kind of phenomenon often happens in practice. For example, in the aforementioned FRS study, V would represent the calendar year of birthdate, W would consist of confounding variables such as gender, blood pressure, cholesterol level and smoking status, etc, and Z would contain covariates possibly interacting with V such as the body mass index (BMI). In this example, one needs to model possible complex interaction between the BMI and the birth cohort. As before we use R j (t) = {i : X ij ;?: t} to denote the set of the individuals at risk just prior to time t for failure type j. If failure times from the same subject
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
23
were independent, then the partial likelihood for (3.24) is
For the case with J = 1, if the coefficient functions are constant, the partial likelihood above is just the one in Cox's model (Cox 1972). Since failure times from the same subject are dependent, the above partial likelihood is actually again a pseudo-partial likelihood. Assume that o{) is smooth so that it can be approximated locally by a linear function. Denote by fj (.) the density of V1j . For any given point Vo E U]=ISUPP(fj), where supp(fj) denotes the support of fj(') , by Taylor's expansion,
a(v) セ@ a(vo) + a'(vo)(v - vo) == 0 + ",(v - vo).
(3.26)
Using the local model (3.26) for the data around Vo, we obtain the logarithm of the local pseudo-partial likelihood [see also (2.17)J: J
£((3,,)
=
n
LLKh("Vij(Xij ) - vo)ll ij j=li=1 (3.27)
R:j ({3,,) = log(
exp[{3TWlj(Xij) + ,TUlj(Xij, VO)JKh(Vzj(Xij ) - vo)).
L lERj(Xij)
Let (i3(vo), i(vo)) maximize the local pseudo-partial likelihood (3.27). Then, an estimator of a(·) at the point Vo is simply the local intercept 6(vo), namely &(vo) = 6(vo), When Vo varies over a grid of prescribed points, the estimates of the functions are obtained. Since only the local data are used in the estimation of {3, the resulting estimator for {3 cannot be y'n-consistent. Let us refer to (13 (vo), &(vo)) as a naive estimator. To enhance efficiency of estimation, Cai, Fan, Jiang and Zhou (2008) studied a profile likelihood similar to (3.22). Specifically, for a given {3, they obtained an estimator of &(-, (3) by maximizing (3.27) with respect to,. Substituting the estimator &(-,{3) into (3.25), they obtained the logarithm of the profile pseudopartial likelihood: J
n
£p({3) = L L ll ij {{3TWij j=1 i=1 -log(
L lERj(Xij)
+ &("Vij, (3)TZij
eXP[{3TWlj +&(Vzj,{3f Z ljJ)}.
(3.28)
Jianqing Fan, Jiancheng Jiang
24
Let 13 maximize (3.28). The final estimator for the parametric component is simply 13 and for the coefficient function is o{) = o{,j3). The idea in Section 3.3 can be used to compute the profile pseudo-partial likelihood estimator. The resulting estimator 13 is root-n consistent and its asymptotic variance admits a sandwich formula, which leads to a consistent variance estimation for 13. Since 13 is yin-consistent, it does not affect the estimator of the non parametric component a. If the covariates (W'G, Zlj)Y for different j are identically distributed, then even though the failure types within subjects are correlated, the profile likelihood estimator of a(·) performs as well as if they were independent [see Cai, Fan, Jiang, and Zhou (2008)]. With the estimators of (3 and a(·), one can estimate the cumulative baseline hazard function AOj (t) = jセ@ AOj (u )du by a consistent estimator:
1 t
AOj(t) =
n
[2:)'ij(u) exp{j3TWij(u)
o
+ a(Vij (U))yZij (u)}
i=l
n
r L dNij(U), 1
i=l
where YijO and Nij(u) are the same in Section 3.3.
4
Model selection on Cox's models
For Cox's type of models, different estimation methods have introduced for estimating the unknown parameters/functions. However, when there are many covariates, one has to face up to the variable selection problems. Different variable selection techniques in linear regression models have been extended to the Cox model. Examples include the LASSO variable selector in Tibshirani (1997), the Bayesian variable selection method in Faraggi and Simon (1998), the nonconcave penalised likelihood approach in Fan and Li (2002), the penalised partial likelihood with a quadratic penalty in Huang and Harrington (2002), and the extended BIC-type variable selection criteria in Bunea and McKeague (2005). In the following we introduce a model selection approach from Cai, Fan, Li, and Zhou (2005). It is a penalised pseudo-partial likelihood method for variable selection with multivariate failure time data with a growing number of regression coefficients. Any model selection method should ideally achieve two targets: to efficiently estimate the parameters and to correctly select the variables. The penalised pseudo-partial likelihood method integrates them together. This kind of idea appears in Fan & Li (2001, 2002). Suppose that there are n independent clusters and that each cluster has Ki subjects. For each subject, J types of failure may occur. Let Tijk denote the potential failure time, Cijk the potential censoring time, X ijk = min(Tijk' Cijk) the observed time, and Zijk the covariate vector for the j-th failure type of the k-th subject in the i-th cluster. Let 6. ijk be the indicator which equals 1 if Xijk is a failure time and 0 otherwise. For the failure time in the case of the j-th type of failure on subject k in cluster i, the marginal hazards model is taken as
(4.1)
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
25
where {3 = ((31, ... ,(3dn ) T is a vector of unknown regression coefficients, dn is the dimension of (3, Zijk(t) is a possibly external time-dependent covariate vector, and AOj(t) are unspecified baseline hazard functions. Similar to (3.10), the logarithm of a pseudo-partial likelihood function for model (4.1) is
R({3) =
J
n
Ki
j=1
i=1
k=1
L L L セゥェォ@
((3TZijk(Xijk) - R({3)) ,
(4.2)
where R({3) = log [2:7=1 RZセQ@ Yij9(Xijk)exP{{3TZlj9(Xijk)}] and Yijg(t) = I (X1jg ;;?: t) is the survival indicator on whether the g-th subject in the l-th cluster surviving at time t. To balance modelling bias and estimation variance, many traditional variable selection criteria have resorted to the use of penalised likelihood, including the AIC (Akaike, 1973) and BIC (Schwarz, 1978). The penalised pseudo-partial likelihood for model (4.1) is defined as dn
L({3) = R({3) - n LP>'j (l(3j I), j=1
(4.3)
where P>'j (l(3j I) is a given nonnegative function called a penalty function with Aj as a regularisation or tuning parameter. The tuning parameters can be chosen subjectively by data analysts or objectively by data themselves. In general, large values of Aj result in simpler models with fewer selected variables. When Ki = 1, J = 1, dn = d, and Aj = A, it reduces to the penalized partial likelihood in Fan and Li (2002). Many classical variable selection criteria are special cases of (4.3). An example is the Lo penalty (or entropy penalty)
p>.(IBI) = 0.5A 2 I(IBI =I- 0). In this case, the penalty term in (4.3) is merely 0.5nA 2 k, with k being the number of variables that are selected. Given k, the best fit to (4.3) is the subset of k variables having the largest likelihood R({3) among all subsets of k variables. In other words, the method corresponds to the best subset selection. The number of variables depends on the choice of A. The AIC (Akaike, 1973), BIC (Schwarz, 1978), qr.criterion (Shibata, 1984), and RIC (Foster & George, 1994) correspond to A = (2/n)1/2, {10g(n)/n}1/2, [log{log(n)}] 1/2, and {10g(dn )/n}1/2, respectively. Since the entropy penalty function is discontinuous, one requires to search over all possible subsets to maximise (4.3). Hence it is very expensive computationally. Furthermore, as analysed by Breiman (1996), best-subset variable selection suffers from several drawbacks, including its lack of stability. There are several choices for continuous penalty functions. The L1 penalty, defined by p>.(IBI) = AIBI, results in the LASSO variable selector (Tibshirani, 1996). The smoothly clipped absolute deviation (SCAD) penalty, defined by
ーセHbI@
= Al(IOI ::;; A)
+ (aA -
0)+ l(O > A),
a-I
(4.4)
Jianqing Fan, Jiancheng Jiang
26
for some a > 2 and A > 0, with PA(O) = O. Fan and Li (2001) recommended a = 3.7 based on a risk optimization consideration. This penalty improves the entropy penalty function by saving computational cost and resulting in a continuous solution to avoid unnecessary modelling variation. Furthermore, it improves the L1 penalty by avoiding excessive estimation bias. The penalised pseudo-partial likelihood estimator, denoted by maximises (4.3). For certain penalty functions, such as the L1 penalty and the SCAD penalty, maximising L(f3) will result in some vanishing estimates of coefficients and make their associated variables be deleted. Hence, by maximising L(f3), one selects a model and estimates its parameters simultaneously. Denote by f30 the true value of f3 with the nonzero and zero components f310 and f320. To emphasize the dependence of Aj on the sample size n, Aj is written as Ajn. Let Sn be the dimension of f31O,
/3,
As shown in Cai, Fan, Li, and Zhou (2005), under certain conditions, if 0, bn ----f 0 and 、セェョ@ ----f 0, as n ----f 00, then with probability tending to one, there exists a local maximizer /3 of L(f3), such that an
----f
Furthermore, if Ajn ----f 0, JnjdnAjn ----f 00, and an = O(n- 1 / 2 ), then with probability tending to 1, the above consistent local maximizer /3 = (/3f, /3'f)T must be such that (i) /32 = 0 and (ii) for any nonzero constant Sn x 1 vector Cn with c;cn
r::: Tr-1/2(A 11 11
ynC n
+ セ@
){ f31 - f310 A
+
(
All
+ セ@
= 1,
)-1 b} -----7 D N(O, 1),
where All and r ll consist of the first Sn columns and rows of A(f3lO, 0) and r(f3lO, 0), respectively (see the aforementioned paper for details of notation here). The above result demonstrates that the resulting estimators have the oracle property. For example, with the SCAD penalty, we have an = 0, b = 0 and セ@ = 0 for sufficiently large n. Hence, by the above result,
vョc[ZイセQOR@
All (/31
-
f31O) セ@
N(O, 1).
The estimator /31 shares the same sampling property as the oracle estimator. Furthermore, /32 = 0 is the same as the oracle estimator that knows in advance that f32 = O. In other words, the resulting estimator can correctly identify the true model, as if it were known in advance. Further study in this area includes extending the above model selection method to other Cox's type of models, such as the partly linear models in Sections 2.3, 3.3 and 3.4.
Chapter 1 Non- and Semi- Parametric Modeling in Survival Analysis
5
27
Validating Cox's type of models
Even though different Cox's type of models are useful for exploring the complicate association of covariates with failure rates, there is a risk that misspecification of a working Cox model can create large modeling bias and lead to wrong conclusions and erroneous forecasting. It is important to check whether certain Cox's models fit well a given data set. In parametric hypothesis testing, the most frequently used method is the likelihood ratio inference. It compares the likelihoods under the null and alternative models. See for example the likelihood ratio statistic in (2.14). The likelihood ratio tests are widely used in the theory and practice of statistics. An important fundamental property of the likelihood ratio tests is that their asymptotic null distributions are independent of nuisance parameters in the null model. It is natural to extend the likelihood ratio tests to see if some nonparametric components in Cox's type of models are of certain parametric forms. This allows us to validate some nested Cox's models. In nonparametric regression, a number of authors constructed the generalized likelihood ratio (GLR) tests to test if certain parametric/nonparametric null models hold and showed that the resulting tests share a common phenomenon, the Wilks phenomenon called in Fan, Zhang, and Zhang (2001). For details, see the reviewing paper of Fan and Jiang (2007). In the following, we introduce an idea of the GLR tests for Cox's type of models. Consider, for example, the partly linear additive Cox model in (2.22):
.A{ tlz, w}
=
.Ao(t) exp{ zT,B + (h (WI) + ... + t),
we obtain
E[dN(te- XT (3) _ Y(te- XT (3)dte- XT {3 ZT ,IX, Z, C 1\ T > te- XT {3J
Donglin Zeng, Jianwen Cai
38
Therefore, for fixed (3 and " we can estimate /-l(t) by rescaling N(t) and CAT with a factor exTf3 ; thus, an estimator for /-l(t) is the Breslow-type estimator given by A
/-l(t;{3,,) =
lt 2:;=1 o
x x {dNj (se- XJ f3) - Yj(se- Jf3)dse- Jf3 ZJ,} Y( -XT(3) . uj=1 j se J
"n
To estimate (3 and" we consider the following estimating equations
After substituting il(t;{3,,) into the left-hand side, we have
_ 2:;=1 Yj(te- XJf3)(dNj (te- XJf3) -
ZJ,e- XJ f3 dt )} = O.
(2.1)
2:;=1 Yj(te- XJf3 )
The re-transformation of the numerator on the left-hand side yields
Equivalently,
n
2:jYi(t)
i=1
Z
{ (x.)-
)} "n uJ=1 y. (tex,[ f3-xJ (3) (Zj X· J
XTf3-XTf3 ' 2 : j =1 Yj(te' J) n
J
(dNi(t)-zT,dt) =0.
(2.2) We wish to solve equation (2.2), or equivalently equation (2.1). However, the estimating function in (2.2) is not continuous in {3. Therefore, we suggest minimizing the norm of the left-hand side of (2.2) to obtain the estimators. Particularly, we implement the Neider-Meader simplex method to find the minimum in our numerical studies. At the end of this section, we will show y'n(S - (30,.:y - ,0) converges in distribution to a multivariate normal distribution with mean zeros and covariance in a sandwiched form A -1 B (A -1 f, where
Chapter 2
A
セ@
V."
Additive-Accelerated Rate Model for Recurrent Event
J+,(t){HセI@
En
- j[ョセ@ and B
=
39
Y.(teX'{f3-xJf3) ( Zj )
J
1
ケNHエ・クLサヲSMjセェ@
} (dNi(t) - zT ,dt) J
II
f3=f30 ,'Y=,o
E [SiST] with Si equal to
x (dNi (te- X'{f30) - zT ,e-X'{f3odt) -
J
Yo (te •
(Zi) E [Y(te_XT ,f30 ) Xi
+
XTf30
) {dN(te-
XTf30
) - ZT ,oeXTf3o E [Y(te)]
J
x XTf3O }i(te- '{f3O)E [Y(te)
XTf30
dt}]
(i)]
E [Y(te- XTf30 ) {dN(te- XTf30 ) - ZT ,oe- XTf30 dt}] E [Y(te- XTf30 )] 2
x
(2.3) .
Therefore, to estimate the asymptotic covariance, we can consistently estimate Si by Hi , simply replacing (30, ,0 and f..Lo by their corresponding estimators and replacing expectation with empirical mean in the expression of Si. To estimate A, we let () denote ((3,,) and define
Un ((})
= n- 1
tJ
_ E7=1
X [}i(te- '{f3)
(i:) {
dNi(te- X'{f3) - ZT,dte- x '{f3
[}j(te-XJf3)(dNj(te-XJf3) - Z!,e- XJ f3 dt)]
E7=1
}l.
[}j(te- XJ f3)]
We also let A be the numerical derivative of A with appropriate choice of perturbation size hn, i.e., the k-th column of A is given by (Un(B + hnek) - Un(B))/h n where ek is the k-th canonical vector. In Theorem 2 given later, we will show A is a consistent estimator for A. Thus, the asymptotic covariance is estimated by
40
Donglin Zeng, Jianwen Cai
We start to state the asymptotic properties for (/3/'1). We assume the following conditions. (AI) Assume X and Z are bounded and with positive probability, [1, X, Z) are linearly independent. (A2) Matrix A is assumed to be non-singular. (A3) P(C > TIX, Z) > 0 and C given X, Z has bounded density in [0, T). (A4) !-lo(t) is strictly increasing and has bounded second derivative in [0, T]. Condition (AI) is necessary since otherwise, f3 and "( assumed in the model may not be identifiable. Both conditions (A2) and (A3) are standard. Under these conditions, the following theorems hold. These conditions are similar to conditions QセT@ in Ying (1993). Theorem 1. Under conditions HaャIセTL@ there exists (/3, i) locally minimizing the norm of the estimating function in (2) such that
Theorem 2. Under conditions HaャIセT@ (co, Cl) for two positive constants Co and Cl, for A- 1B(A-1 )T.
and assuming h n ---t 0 and fohn E is a consistent estimator
A-I B(A -1) T
Thus, Theorem 1 gives the asymptotic normality of the proposed estimators and Theorem 2 indicates that the proposed variance estimator is valid. Using the estimators for f3 and ,,(, we then estimate the unknown function p,O with fl(t; /3, i), i.e.,
The following theorem gives the asymptotic properties for fl(t). Theorem 3. Let Dx be the support of X. Then under conditions HaャIセTL@ xT for any to < TSUPXED x e- {3o, fo(fl(t) - P,o(t)) converges weakly to a Gaussian process in loo [0, to].
The upper bound for to in Theorem 3 is determined by the constraint that the study duration time is T. The proofs of all the theorems utilize the empirical process theory and are given in the appendix.
3
Assessing additive and accelerated covariates
One important assumption in the proposed model is that covariates X have accelerated effect while covariates Z have additive effect. However, it is unknown in practice which covariates should be included as part of X or Z. In this section, we propose some ways for assessing these covariates. From the proposed model, if X and Z are determined correctly, we then expect that the mean of Y(t)(N(t)* - p,(te XT {3) - tZT "() is zero. As the result,
Chapter 2
Additive-Accelerated Rate Model for Recurrent Event
41
correct choices of X and Z should minimize the asymptotic limit of n-
1
i
T
o
n
2
L"Yi(t) {Nt(t) - f-t(te XTf3 ) - tz[ 'Y} dt. i=l
The following result also shows that if we use wrong choice of X or Z, this limit cannot be minimized.
Proposition 1. Suppose Xc and Zc are correct covariates with non-zero accelerated effect f3c and non-zero additive effect 'Yc respectively. Further assume (C1) the domain of all the covariates is offull rank; (C2) f-tc(t) is continuous and satisfies f-tc(O) = 0 and ヲMエセHPI@ > 0; (C3) Xc contains more than 2 covariates; or, Xc contains one covariate taking at least 3 different values; or, Xc is a single binary variable but f-tc is not a linear function. Then if Xw and Zw are the wrong choices of the accelerated effect and the additive effect respectively, i.e., Xw #- Xc or Zw #- Zc, then for any non-zero effect f3w and 'Yw and function f-tw,
1T E [Y(t) {N*(t) - f-tw(te X;f3w) - tZ;;''Yw } 2] dt > 1T E[Y(tHN*(t)-f-tc(te X;f3C)-tZ';'Ycf]dt. The proof of Proposition is given in the appendix. Condition (C1) says that the covariates can vary independently; Condition (C2) is trivial; Condition (C3) says that the only case we exclude here is the f-t(t) = At and Xc is binary. The excluded case is unwanted since f-t(te Xcf3c ) = A{{ef3c - l)Xc + l}t resulting that Xc can also be treated as additive covariate. From Proposition 1, it is concluded that whenever we misspecify the accelerated covariates or additive covariates which has non-zero effect, we cannot minimize the limit of the square residuals for all observed t. Hence, if we find X and Z minimizing the criterion, it implies that the differences between X and Xc and Z and Zc are only those covariates without effect.
4
Simulation studies
We conduct two simulation studies to examine the performance of the proposed estimators with moderate sample sizes. In the first simulation study, we generate recurrent event using the following intensity model
E[dN(t)IN(t-), X, z, セj@
= セH、ヲMエ・SクI@
+ 'YZdt),
where Z is a Bernoulli random variable and X = Z + N(O, 1), f-t(t) = O.Se t , and セ@ is a gamma-frailty with mean 1. It is clear this model implies the marginal model as given in Section 2. Furthermore, we generate censoring time from a uniform distribution in [0,2J so that the average number of events per subject is about
Donglin Zeng, Jianwen Cai
42
1.5. In the second simulation study, we consider the same setting except that /-L(t) = 0.8t2 and the censoring distribution is uniform in [0,3]. To estimate (3 and " we minimize the Euclidean norm of the left-hand side in equation (2). Our starting values are chosen to be close to the true values in order to avoid local minima. To estimate the asymptotic covariance of the estimators, we use the numerical derivative method as suggested in Theorem 2. Particularly, we choose h n to be n- 1 / 2 , 3n- 1 / 2 and 5n- 1 / 2 but find the estimates robust to these choices. Therefore, our table only reports the ones associated with h n = 3n- 1 / 2 . Our numerical experience shows that when sample size is as small as 200, the minimization may give some extreme value for (3 in a fraction of 4% in the simulations; the fraction of bad cases decreases to about 1% with n = 600. Hence, our summary reports the summary statistics after excluding outliers which are 1.5 inter-quartile range above the third quartile or below the first quartile. Table 1 gives the summary of the simulation studies from n = 200, 400, and 600 based on 1000 repetitions. Particularly, "Est" is the median of the estimates; "SEE" is the estimate for the standard deviation from all the estimates; "ESE" is the mean of the estimated standard errors; "CP" is the coverage probability of the 95% confidence intervals. From the table, we conclude that the proposed estimators perform reasonably well with the sample size considered in terms of small bias and accurate inference.
Table 1: Summary of Simulation Studies /-L(t) = 0.8e t
n 200
parameter
f3 'Y
400
(3
600
(3
'Y 'Y
true -0.5 1 -0.5 1 -0.5 1
Est -0.501 0.991 -0.496 0.979 -0.490 1.000
SEE
ESE
CP
0.366 0.414 0.246 0.296 0.202 0.240
0.381 0.437 0.252 0.311 0.207 0.255
0.94 0.96 0.93 0.96 0.94 0.96
/-L(t) = 0.8t2
n 200
parameter (3
400
(3
600
(3
'Y 'Y 'Y
5
true -0.5 1 -0.5 1 -0.5 1
Est -0.478 0.970 -0.498 0.992 -0.493 0.998
SEE
ESE
CP
0.331 0.307 0.240 0.227 0.192 0.185
0.344 0.328 0.237 0.235 0.195 0.195
0.94 0.95 0.92 0.95 0.93 0.96
Application
We apply the proposed method to analyze two real data sets. The first data arise from a chronic granulomatous disease (CGD) study which was previously analyzed by Therneau and Hamilton (1997). The study contained a total 128 patients with 65 patients receiving interferon-gamma and 63 receiving placebo. The number of recurrent infections was 20 in the treatment arm and was 55 in the placebo arm.
Chapter 2
Additive-Accelerated Rate Model for Recurrent Event
43
Table 2: Results from Analyzing Vitamin A Trial Data Covariate vitamin A vs placebo age in years
Estimate -0.0006 -0.0040
Std Error 0.0411 0.0008
p-value 0.98 < 0.001
We also included the age of patients as one covariate. The proposed additiveaccelerated rate model was used to analyze this data. To assess whether the treatment variable and the age variable have either additive or accelerated effect, we utilized the criterion in Section 3 and considered the four different models: (a) both treatment and age were additive effects; (b) both treatment and age were accelerated effects; (c) treatment was additive and age was accelerated; (d) treatment was accelerated and age was additive. The result shows that the model with minimal error is model (b), i.e., the accelerated rate model. Such a model was already fitted in Ghosh (2004) where it showed that the treatment had a significant benefit in decreasing the number of infection occurrences. A second example is from the Vitamin A Community Trial conducted in Brazil (Barreto et aI, 1994). The study was a randomized community trial to examine the effect of the supplementation of vitamin A on diarrhea morbidity and diarrhea severity in children living in areas where its intake was inadequate. Some previous information showed that the vitamin A supplement could reduce the child mortality by 23% to 34% in populations where vitamin deficiency was endemic. However, the effect on the morbidity was little known before the study. The study consisted of young children who were assigned to receive either vitamin A or placebo every 4 months for 1 year in a small city in the Northeast of Brazil. Since it was indicated before that treatment effect might be variable over time, for illustration, we restrict our analysis to the first quarter of the follow-up among boys, which consists of 486 boys with an average number of 3 events. Two covariates of interest are the treatment indicator (vitamin A vs placebo) and the age. As before, we use the proposed method to select among the aforementioned models (a)-(d). The final result shows that model (c) yields the smallest prediction error. That is, our finding shows that the treatment effect is accelerative while the age effect is additive. Particularly, Table 2 gives the results from model (c). It shows that the treatment effect is not significant; however, younger boys tended to experience significantly more diarrhea episodes than older boys. Figure 2 gives the estimated function for J.L(t).
6
Remarks
We have proposed a flexible additive-accelerated model for modelling recurrent event. The proposed model includes the accelerated rate model and additive rate model as special cases. The proposed method performs well in simulation studies. We also discuss the method for assessing whether the covariates are additive covariates or accelerated covariates. When the model is particularly used for prediction for future observations, overfitting can be a problem. Instead of using the criterion function in assessing
Donglin Zeng, lianwen Cai
44
o
20
40
80
60
100
120
Days
Figure 2: Estimated function for J-l(t) in the Vitamin A trial
additive covariates or accelerated covariates, one may consider a generalized crossvalidation residual square error to assess the model fit. Theoretical justification of the latter requires more work. Our model can be generalized to incorporate time-dependent covariates by assuming
E[N(t)\X, Zl = {L(te X (t)Tf3)
+ Z(tf 'Y-
The same inference procedure can be used to estimate covariate effects. However, the nice result on assessing additive covariates or accelerated covariates as Proposition 1 may not exist due to the complex relationship among time-dependent covariates. Finally, our model can also be generalized to incorporate time-varying coefficient.
Acknowledgements We thank Professor Mauricio Barreto at the Federal University of Bahia and the Vitamin A Community Trial for providing the data. This work was partially supported by the National Institutes of Health grant ROI-HL57444.
Appendix Proof of Theorems 1-3. We prove Theorem 1 and Theorem 2. For convenience, we let 0 denote ({3, ,) and use the definition of Un(O). For 0 in any compact set, it is easy to see the class of functions {Y(te-
XT
(3)}, {ZT ,e-xT f3}, {N(te- XT (3) }
Chapter 2
Additive-Accelerated Rate Model for Recurrent Event
45
are P-Donsker so P-Glivenko-Cantelli. Therefore, in the neighborhood of 80 , Un (8) uniformly converges to
U(8) == E [I Yi(te- x [i3)
(i:) {dN i(te- X[i3) - Z'[-r dte - x [i3
_ E [Yj (te- XJi3) (dNj (te-xJ.B) - ZJ"{e-XJ.Bdt)] }] E [Yj(te-XJ.B)]
.
Furthermore, we note that uniformly in 8 in a compact set,
v'n(Un(8) - U(8))
セ@
Go
[f yHエ・MxtセI@
(;.) {dN(e-XT't) -
zr7 e- XT 'dt
P n [Y(te-XTi3)(dN(e-XTi3t) - ZT"{e-XT.Bdt )] }] P n [Y(te-XT.B)]
(i) J]
-G n
[IYCte-XT.B)CdN(e-XT.Bt) _ ZT"{e-XT.Bdt) P [Y(te-XT.B) P n [YCtcXT.B)]
+G n [I Y(te-
XTi3
)p [Y Cte-
XTi3
)
(i)]
x P [Y(te-XTi3)CdN(e-XTi3t) - ZT"{e- XTi3 dt)] P n [Y(te-XT.B)] P [Y(te- XTi3 )]
1 ,
where P nand P refer to the empirical measure and the expectation respectively, and G n = y'n(P n - P). Hence, from the Donsker theorem and the definition of 8 i , we obtain n
sup Iv'nUn(8) - v'nU(8) \lJ-Oo\:;;;M/v'n
n- 1 / 2
L 8 l = op(l). i
i=1
On the other hand, conditions (A2)-(A4) implies that U(8) is continuously differentiable around 80 and U(8) = U(80 ) + A(8 - 80 ) + 0(18 - 80 1). Additionally, it is straightforward to verify U(80) = O. As the result, it holds n
sup Iv'nUn (8) - v'nA(8 - 80 ) \O-Oo\:;;;M/v'n
-
1 2
n- /
L 8 l = op(l). i
i=1
We obtain a similar result to Theorem 1 in Ying (1993). Following the same proof as given in Corollary 1 in Ying (1993), we prove Theorem 1.
Donglin Zeng, Jianwen Cai
46
To prove Theorem 2, from the previous argument, we obtain uniformly in a
y'n- neighbor hood of 00 , n
:L S
..;nun(e) = ..;nA(O -eo) +n- 1/ 2
i
+op(1).
i=l
Thus, for any canonical vector e and h n
-+
0,
As the result,
Un(O + hne) - Un(O) _ A (_1_) h - e + op y'nh . n n Since y'nh n is bounded away from zero,
We have shown that the estimator based on the numerical derivatives, i.e., A, converges in probability to A. The consistency of Si to Si is obvious. Therefore,
A-1B(A-l)T
A-1B(A-1)T.
-+
To prove Theorem 3, from fl(t)'s expression, we can obtain
Y(se-xTi3)E[dN(se-xTi3) - Y(se-xTi3)dse-xTi3ZTil E[Y(se- XT ,8)J2 o E[dN(se- xTi3 ) - Y(se-xTi3)dse-xTi3 ZTil +..;n io E[Y(se-XTt3)] - J.L(t) + op(l).
-G n
[
i
t
A
1
r
Clearly, N(te- XT ,8), Y(te- XT ,8), ZT'Y all belong to some Donsker class. Additionally, when t セ@ to, E[Y(te- XTt3 )] > O. Hence, after the Taylor expansion of the third term and applying the Donsker theorem, we obtain A
..;n(J.L(t)
_
r{dN(se-XT,80)-Y(se-XT,80)dSCXT,8oZT'Yo} n J.L(t)) - G io E[Y(se-XT,8o)] _
Chapter 2 Additive-Accelerated Rate Model for Recurrent Event
47
Therefore, using the asymptotic expansions for (J and l' as given in proving Theorem 1, we can show vfn(P,(t) - Ito(t)) converges weakly to a Gaussian process in
lOO[O, to]. Proof of Proposition 1. We prove by contradiction. Assume
faT E
[Y(t) {N*(t) - Itw(te x J;f3w) _ tZ'{;/yw}
セ@ faT E
[Y(t) {N*(t) -
ャエ・HxセヲS」I@
- tZ'[ "Ie}
2] dt 2] dt.
Notice E [Y(t) {N*(t) - gt(X, Z)}2] is minimized for gt(X, Z) = E[N*(t)IY(t), X,
Z] = ャエ・Hxセ@ f3c) + tZ'[ "Ie and such gt(X, Z) is unique almost surely. Therefore, for any t E [0, rJ, we conclude Itw(te XJ;f3 w) + tZJ;"Iw = ャエ・HxセヲS」I@
+ tZ'[ "Ie
with probability one. We differentiate (A.l) at t =
QエセHoI・クj[ヲSキ@
+ ZJ;"Iw = QエセHoI・xヲS」@
(A.l)
°and obtain + Z'[ "Ie.
(A.2)
We show Xc must be included in Xw. Otherwise, there exists some covariate Xl in Xc but in Zw. We will show this is impossible by considering the following two cases. Case I. Suppose Xl has at least three different values and f3el is the coefficient of Xl. Fix the values of all the other covariates. Then equation (A.2) gives
aef3cIXl - bXl
=
d
for some constants a > 0, b, and d. However, since f3el i- 0, the function on the left-hand side of the above equation is strictly convex in Xl so the equation has at most two solutions. Since Xl has at least three different values, we obtain the contradiction. Case II. Suppose Xl has only two different values. Without loss of generality, we assume Xl = or 1. Then we obtain
°
for some function g. Here, Xc,-l means covariates in Xc except Xl and the same is defined for f3e,-l. Consequently, X'[_lf3e,-1 =constant, which implies that Xe,-l has to be empty. That is, Xc only contains a single binary variable. Thus, equation (A.l) gives If ef3c < 1, we replace t by ef3c k t and sum over k = 0,1,2, .... This gives Ite(t) = Act for some constant Ae. If ef3c > 1, we replace t by e-f3c k t and sum over k = 1,2, ... and obtain the same conclusion. Since we assumed that J-le(t) is not a
48
Donglin Zeng, Jianwen Cai
linear function of t when Xc is a binary covariate, P,c(t) = Act introduces the contradiction. As the result, Xc must be included in Xw. Thus, it is easy to see p,:"(0) =1= o. We repeat the same arguments but switch Xw and Xc. We conclude Xw is also included in Xc. Therefore, Xw = Xc. Clearly, f3w = f3c and p,:"(0) = ーLセHPI@ by the fact that Zw and Zc are different from Xw and Xc and Condition (A.1). This further gives z'E,w = z'[ Ic. Since the covariates are linearly independent and IW =1= 0 and IC =1= 0, we obtain Zw = ZC. This contradicts with the assumption Xw =1= Xc or Zw =1= ZC·
References [1] Andersen, P. K. and Gill, R. D. (1982). Cox's regression model for counting processes: a large sample study. Annals of Statistics, 10, 1100-1120. [2] Cox, D. R. and Oakes, D. (1984). Analysis of Survival Data. Chapman and Hall: London. [3] Ghosh, D. (2004). Accelerated rates rgression models for recurrent failure time data. Lifetime Data Analysis, 10, 247-26l. [4] Kalbfleisch, J. D. and Prentice, R. L. (1980). The Statistical Analysis of Failure Time Data. Wiley: New York. [5] Lawless, J. F. and Nadeau, C. (1995). Some simple and robust methods for the analysis of recurrent events. Technometrics, 37, 158-168. [6] Pepe, M. S. and Cai, J. (1993). Some graphical displays and marginal regression analysis for recurrent failure times and time dependent covariates. Journal of the American Statistical Association, 88, 811-820. [7] Schaubel, D. E., Zeng, D., and Cai, J. (2006). A Semiparametric Additive Rates Model for Recurrent Event Data. Lifetime Data Analysis, 12,386-406. [8] Therneau, T. M. and Hamilton, S. A. (1997). RhDNase as an example of recurrent event analysis. Statistics in Medicine, 16, 2029-2047. [9] Ying, Z. (1993). A large sample study of rank estimation for censored regression data. Annals of Statistics, 21, 76-99.
Chapter 3 An Overview on Quadratic Inference Function Approaches for Longitudinal Data John J. Dziak * Runze Li t
Annie Qu
:j:
Abstract Correlated data, mainly including longitudinal data, panel data, functional data and repeated measured data, are common in the fields of biomedical research, environmental studies, econometrics and the social sciences. Various statistical procedures have been proposed for analysis of correlated data in the literature. This chapter intends to provide a an overview of quadratic inference function method, which proposed by Qu, Lindsay and Li (2000). We introduce the motivation of both generalized estimating equations method and the quadratic inference method. We further review the quadratic inference method for time-varying coefficient models with longitudinal data and variable selection via penalized quadratic inference function method. We further outline some applications of the quadratic inference function method on missing data and robust modeling.
Keywords: Mixed linear models; hierarchical linear models; hierarchical generalized linear models; generalized estimating equations; quadratic inference function; dispersion parameter; GEE estimator; QIF estimator; modified Cholesky decomposition; time-varying coefficient model; penalized QIF; smoothly clipped absolute; LASSO.
1
Introduction
Correlated data occurs almost everywhere, and is especially common in the fields of biomedical research, environmental studies, econometrics and the social sciences (Diggle et al. 2002, Davis 2002, Hedeker & Gibbons 2006). For example, to achieve sufficient statistical power using a limited number of experiment units in clinical trials, the outcome measurements are often repeatedly obtained from the *The Methodology Center, The Pennsylvania State University, 204 E. Calder Way, Suite 400 State College, PA 16801, USA. E-mail: [email protected] tDepartment of Statistics and The Methodology Center, The Pennsylvania State University, University Park, PA 16802-2111, USA. E-mail: [email protected] tDepartment of Statistics, Oregon State University, Corvallis, OR 97331-4606, USA. E-mail: [email protected]
49
50
John J. Dziak, Runze Li, Annie Qu
same subject at different time points; in education, students' achievements are more likely to be similar if they are from the same class, and the class, school or community are treated as natural clusters; in spatial environmental studies, researchers often have no control over spatially correlated field samples such as streams or species abundance. There are two major existing approaches to modeling and analyzing correlated data. One is the subject-specific approach including mixed linear models (Laird & Ware, 1982), hierarchical linear models (Bryk & Raudenbush, 1992), and hierarchical generalized linear models (Lee & NeIder, 1996). The other is a population average approach including generalized estimating equations (Liang & Zeger, 1986). The former approach emphasizes modeling heterogeneity among clusters which induces a correlation structure among observations. The latter models the correlation structure directly. These two approaches yield regression estimators with different interpretations, and different values in practice (Zeger, Liang & Albert, 1988; Neuhaus, Kalbfleisch & Hauck, 1991; Lee & NeIder 2004). The major drawback to subject-specific approaches is that random effects are assumed to follow an explicit distribution, and typically a normal random effects distribution is assumed. Lee & NeIder (1996) allow a broader class of parametric models for random effects. However, in general there is not enough information for goodness-of-fit tests for random effects distributions. Neuhaus, Hauck & Kalbfleish (1992) show that when the distributions of the random effects are misspecified, the estimator of fixed effects could be inconsistent for logistic models. On the other hand, the generalized estimating equations approach has an advantage in that it requires only the first two moments of the data, and a misspecified working correlation does not affect the root n consistency of the regression parameter estimation; though the misspecification of working correlation does affect the efficiency of the regression parameter estimation (see Liang & Zeger 1986, Fitzmaurice 1995, Qu, Lindsay & Li, 2000). In general, the generalized estimating equation approach lacks a probabilistic interpretation since the estimating functions are not uniquely determined. Therefore, it is not obvious how to do model checking or goodness-of-fit test based on likelihood functions such as the likelihood ratio test. For these reasons, Heagerty & Zeger (2000) developed marginalized multilevel models, a likelihood based method which is less sensitive to misspecified random effects distributions. However, this might be computational complex and difficult since the marginal approach requires integrations, and, if a high dimension of random effects is involved, the estimates may not be analytically tractable. In order to overcome the limitations of the above approaches, Qu, Lindsay & Li (2000) proposed the quadratic inference function (QIF) method to analyze longitudinal data in a semiparametric framework defined by a set of mean zero estimating functions. This approach has the advantages of the estimating function approach, as it does not require the specification of the likelihood function and does not involve intractable computations. It also overcomes the limitation of estimating function approach such as lacking of inference functions. The difference in QIF between two nested models is analogous to the difference in minus twice the log likelihood, so it may provide a semiparametric analog to the likelihood ratio
Chapter 3
Quadratic Inference Function Approaches
51
test. Specifically, it provides an asymptotic chi-squared distribution for goodnessof-fit tests and hypothesis testing for nested regression parameters. Since the QIF method was proposed, this method has been further developed to cope with various difficulties in the analysis of longitudinal data. Its applications have been published in the various statistical journals. This chapter aims to provide a partial review on this topic. This chapter is organized as follows. Section 2 gives background and motivation of the quadratic inference function approach, and presents some theoretic properties of the quadratic inference function estimator. In Section 3, we introduce penalized quadratic inference functions for time-varying coefficient models and variable selection for longitudinal data. Section 4 presents some main ideas about how to apply the quadratic inference function approach for testing whether missing data is ignorable for an estimating equation approach CQu & Song, 2002). Section 4 also demonstrates that quadratic inference function estimation is potentially more robust to outliers than the ordinary generalized estimating equations estimation (Qu & Song, 2004). Some research topics that need to further study are presented in Section 5.
2
The quadratic inference function approach
Suppose that we collect a covariate vector Xij and a response Yij for individual = 1,···,J and i = 1,··· ,no Denote Yi = CYil,··· ,YiJ)T and Xi = (XiI, ... , XiJ ) T. Start with a simple continuous response case. The following linear regression model is useful to explore the relationship between the covariates and the continuous response.
i at time tj, j
(2.1) where ei is assumed to be an independent and identically random error with mean o. It is well known that when the within subject random errors are correlated, the ordinary least squares estimator for j3 is not efficient. To improve efficiency of the ordinary least squares estimator, consider the weighted least squares estimator (WLSE)
i=1
i=1
If the covariance matrix of ei is of the form HWBRセ@ with a known セL@ but unknown (7"2, then WLSE with Wi = セMQ@ is the best linear unbiased estimator (BLUE). In practice, the covariance matrix of ei is typically unknown, and therefore the WLSE requires us to specify a covariance structure. In practice, the true covariance structure is generally complicated and unknown, except in very simple situations. How to construct a good estimator for f3 when the covariance structure is misspecified? This poses a challenge in the analysis of longitudinal data.
52
2.1
John J. Dziak, Runze Li, Annie Qu
Generalized estimating equations
The outcome variable could be discrete in many longitudinal studies. Naturally, we consider a generalized linear model for a discrete response. For iid data, the generalized linear model assumes that given the covariates, the conditional distribution of the response variable belongs to the exponential family (McMullagh & NeIder, 1989). In some situations, specification of a full likelihood function might be difficulty. It might be more desirable to assume the first two moments instead. The quasi-likelihood approach can be used to develop inference procedures for the generalized linear model (Wedderburn, 1974). Unfortunately, it is still a challenge for correlated discrete responses, such as binary or count response. As done for weighted least squares approach, the generalized estimating equation (GEE) approach (Liang & Zeger, 1986) assumes only the mean structure and variance structure along with the working correlation for the discrete longitudinal/repeated measurement data. In what follows, we briefly introduce the GEE approach. As in the generalized linear models, it is assumed that
E(YijIXij)
g-l(x'f;{3),
=
Var(Yijlxij)
=
1>V(P,ij),
where P,ij = E(Yijlxij), g(.) is a known link function, 1> is called a dispersion parameter, and V (.) is called a variance function. Let us present a few examples. Example 1. For continuous response variable, the normal error linear regression model assumes that the random error in (2.1) follows a normal distribution. In other words, given Xij, the conditional distribution of Yij is N(xI;i3,u 2 ). Then g(p,) = p" the identity link, and 1> = u 2 and V(p,) = 1. Example 2. For binary outcome variable, the logistic regression model assumes that given Xij, Yij follows a Bernoulli distribution with success probability
p(xd = J
Then g(p)
= p/(l - p),
exp(xfi3) J. 1 + exp(xI;{3)
the logit link, 1>
=
1 and V(p)
= p(l - p).
Example 3. For count outcome variable, the log-linear Poisson regression model assumes that given Xij, Yij follows a Poisson distribution with mean
>'(Xij) = exp(x'f;{3). Then g(>.)
= log(>.),
the log link, 1> = 1 and V(>.)
= >..
For these examples, it is easy to specify the first two moments for Yij marginal ly. Thus, we may further construct quasi-likelihood when the conditional distribution of Yij is not available. It is quite difficult to specify the joint distribution of binary or count responses Yij, j = 1, ... ,J here.
Chapter 3 Let Pi
=
Quadratic Inference Function Approaches
(Pil,··· ,PiJ
f,
53
be a J x d matrix with the j-th row being
Di
8g- l (xT;(3)/8(3, Ai be a J x J diagonal matrix with the j-th diagonal element
V(pij). For a given working correlation matrix R i , the GEE estimator is the solution of the following estimation equation: (2.2) i==l
It can be verified that for model (2.1), the WLSE with weight matrix Wi = A;/2RiA;/2 is the solution of (2.2). As demonstrated in Liang & Zeger (1986), the choice of the working correlation matrix Ri does not affect the consistency of the GEE estimator, but could affect its efficiency; and if Ri is correctly specified, then the resulting estimation is the most efficient.
2.2
Quadratic inference functions
The quadratic inference function (QIF) approach (Qu, Lindsay & Li, 2000) shares the same asymptotic efficiency of the estimator as that of GEE when the correlation matrix equals the true one, and is the most efficient in asymptotic sense for a given class of correlation matrices. Thus, the QIF estimator is at least as asymptotic efficient as the corresponding GEE for a given working correlation. Before we describe the ideas of QIF, we first briefly discuss the common choices of the working correlation matrix Ri in the GEE method. There are two commonly used working correlation matrices: equicorrelated matrix and AR working correlation structures. The equicorrelated (also known as exchangeably correlated or compound symmetric) matrix is defined as
1 p ... P] pl··· p
R=
. . .. ..
.. [ ..
'.
p p ... 1
In the implementation of the GEE method, the inverse of R is used here. For equicorrelated working correlation structure, we have
where 1 is the J x 1 vector with all elements equaling 1, and al = 1/(1 - p) and a2 = - p/ { (1 - p) (1 - P + J p)}. This indicates that the inverse of equicorrelated matrix can be represented as a linear combination of nonnegative definite matrices. For the AR correlation structure, assume that Zt, t = 1,··· ,J is an AR sequence with order q. In other words, we can represent Zj as min{t-l,q} Zt
=
L
j==l
CPjZt-j
+ et·
(2.3)
54
John J. Dziak, Runze Li, Annie Qu
where et's are independent white noise with mean zero and variance (J2. Denote z = (Zl,'" ,zJ)Y, e = (el,'" ,eJ)Y and L to be a lower triangular matrix having ones on its diagonal and (i, i - j)-element -CPj, for i = 1"" ,J and j = 1,'" ,min{i - 1, q}. Then (2.3) can be rewritten as Lz = e. Thus, Lcov(z)L T = cov(e) = (J2 I. This indeed is the modified Cholesky decomposition of cov(z) (see (2.7) below). Therefore, cov-l(z)
= (J-2L T L
Denote U j is a J x J matrix with (i, i - j)-element being 1 and all other elements being O. Note that UfUk = 0 for j -1= k. Then
j
j
j
j
Thus, the inverse of the covariance matrix of an AR sequence can be also represented as a linear combination of symmetric matrices. Based on these observations, the QIF approach assumes that the inverse of a within-subject working correlation matrix R can be expressed as a linear combination RZセ]Q@ bkMk, where the bk's are unknown constants, and bkMk'S are known, symmetric matrices. Reexpressing (2.2), the GEE estimate is then the value of {3 which sets the following quasi-score to zero: M blb s = n -1 DiAi-1/2 ( l 1
+ ... + br M r ) A-i l / 2(yi
Notice that s is a linear combination of the "extended score" where mi
=
[Dr
aセQORィmャHyゥ@
Dr aセQOR「イmHyゥ@
-
J-Li ) .
mn =
セ@ RZセ]Q@
(2.4) bimi,
- J-Li)] . (2.5) - J-Li)
Since mn contains more estimating equations than the dimension of {3, it could be impossible to set all equations to be zero. Hansen (1982) proposed the generalized method of moment (GMM) which attempts to combine these estimating equations optimally. The GMM could be traced back from the minimum X2 method introduced by Neyman (1949) which is further developed by Ferguson (1958,1996). The GMM estimator 7J is obtained by minimizing ュセcMャ@ mn, where C is a weight matrix, instead of solving s = O. Hansen (1982) has shown that the best weight matrix C is the covariance matrix of mn . In practice the covariance matrix of mn is often unknown. Qu, Lindsay & Li (2000) suggested taking the weight matrix C to be its sample counterpart, i.e. n- 2 RZセQ@ MiMr, and defined the quadratic inference function (2.6)
Chapter 3 and the QIF estimator
Quadratic Inference Function Approaches
fj is
55
defined to be
13 =
argmin(3Q(f3).
The idea of QIF approach is to use data-driven weights, which gives less weight to the estimating equations with larger variances, rather than setting the weights via ad hoc estimators of the parameters of the working correlation structure as in (2.4). There is no explicit form for fj. A numerical algorithm, such as NewtonRaphson algorithm or Fisher scoring algorithm, should be used to minimization Q(f3). See Qu, Lindsay & Li (2000) for details. One of the main advantage of the QIF approach is that if the working correlation is correctly specified, the QIF estimator has an asymptotic variance as low as the GEE. If the working structure is incorrect, the QIF estimator is still optimal among the same linear class of estimating equations, while the GEE estimator with the same working correlation is not. See Qu, Lindsay & Li (2000) for some numerical comparisons. The asymptotic property of the QIF estimator fj has been studied under the framework of the GMM. It has been shown that no matter whether C is consistent for C or not, fj is root n consistent and asymptotic normal, provided that C is positive definite. As shown in Qu & Lindsay (2003), if the true score function is included in M, then the QrF estimator in the context of a parametric model, is asymptotically equivalent to the MLE and thus shares its first-order asymptotic optimality. Unlike GEE, the QIF estimation minimizes a clearly defined objective function. The quadratic form Qn itself has useful asymptotic properties, and is directly related to the classic quadratic-form test statistics (see Hansen 1982, Greene 2000, Lindsay & Qu 2003), as well as somewhat analogous to the quadratic GEE test statistics of Rotnitzsky & Jewell (1990) and Boos (1992). When p < q, Qn can be used as an asymptotic X2 goodness-of-fit statistic or a test statistic for hypotheses about the parameters 13k, it behaves much like a minus twice of log-likelihood. セ@
L
• Q(f3o) - Q(f3) ---t xセ@ under the null hypothesis Ho : 13 = 130· • More generally, if 13 = ['I,b, セャt@ where 'I,b is an r-dimensional parameter of interest and セ@ is a (p - r)-dimensional nuisance parameter, then the profile test statistic Qn('l,bo,[O) - Qn(;j),[) is asymptotically X; for testing Ho : 'I,b = 'l,bo. This could be used for testing the significance of a block of predictors. Thus, the QIF plays a similar role to the log-likelihood function of the parametric models. For the two primary class of working correlation matrices, the equicorrelated matrices and AR-1 working correlation structure, their inverse can be expressed as a linear combination of several basis matrices. One can construct various other working correlation structure through the linear combination RZセ]Q@ bkMk. For example, we can combine equi-correlation and AR-1 correlation structures together by pooling their bases together. One may find such linear combination of given correlation structure through the Cholesky decomposition. It is known that a positive definite matrix I; has a modified Cholesky decomposition: I;-l
= LTDL,
(2.7)
56
John J. Dziak, Runze Li, Annie Qu
where L is a lower triangular matrix having ones on its diagonal and typical element -¢ij in the (i, j) position for 1 :::; j < i :::; m, and D is a diagonal matrix with positive elements. As demonstrated for the AR working correlation structure, the modified Cholesky decomposition may be useful to find such a linear combination. In the recent literature, various estimation procedures have been suggested for covariance matrices using the Cholesky decomposition. Partial references on this topic are Barnard et al. (2000), Cai & Dunson (2006), Dai & Guo (2004), Daniels & Pourahmadi (2002), Houseman, et al. (2004), Huang, et al (2006), Li & Ryan (2002), Pan & Mackenzie (2003), Pourahmadi (1999, 2000), Roverato (2000), Smith & Kohn (2002), Wang & Carey (2004), Wu & Pourahmadi (2003) and Ye & Pan (2006). We limit ourselves in this section with the setting in which the observation times tj'S are the same for all subjects. In practice, this assumption may not always be valid. For subject-specific observation times tij'S, we may bin the observation times first, and then apply the QIF approach to improve efficiency based on the binned observation times. From our experience, this technique works well for functional longitudinal data as discussed in the next section.
3
Penalized quadratic inference function
We now introduce the penalized QIF method to deal with high-dimensionality of parameter space. We first show how to apply the QIF for time-varying coefficient models.
3.1
Time-varying coefficient models
Varying coefficient models have become popular since the work by Hastie & Tibshirani (1993). For functional longitudinal data, suppose that there are n subjects, and for the i-th subject, data {Yi(t), XiI (t), ... ,Xid(t)} were collected at times t = tij, j = 1, ... ,ni. To explore possible time-dependent effects, it is natural to consider d
Yi(t)
=
!3o(t)
+L
Xik (t)!3k (t)
+ Ei(t).
(3.1)
k=I
This is called a time-varying coefficient model. In these models, the effects of predictors at any fixed time are treated as linear, but the coefficients themselves are smooth functions of time. These models enable researchers to investigate possible time-varying effects of risk factors or covariates, and have been popular in the literature of longitudinal data analysis. (See, for example, Hoover, et al., (1998), Wu, Chiang & Hoover (1998), Fan & Zhang (2000), Martinussen & Scheike (2001), Chiang, Rice & Wu (2001), Huang, Wu & Zhou (2002,2004) and references therein.) For discrete response, a natural extension of model (3.1) is d
E{Yi(t)lxi(t)} = g-I{!30(t)
+L k=l
Xik(t)!3k(t)}.
(3.2)
Chapter 3
Quadratic Inference Function Approaches
57
With slight abuse of terminology, we shall still refer this model to as a timevarying coefficient model. Researchers have studied how to incorporate the correlation structure to improve efficiency of estimator of the functional coefficients. For a nonparametric regression model, which can be viewed as a special case of model (3.2), Lin & Carroll (2000) demonstrated that direct extension of GEE from parametric models to nonparametric models fails to properly incorporate the correlation structure. Under the setting of Lin & Carroll (2000), Wang (2003) proposed marginal kernel GEE method to utilize the correlation structure to improve efficiency. Qu & Li (2006) proposed an estimation procedure using penalized QIF with L2 penalty. The resulting penalized QIF estimator may be applied to improve efficiency of kernel GEE type estimators (Lin & Carroll, 2000; Wang, 2003) when the correlation structure is misspecified. The penalized QIF method originates from the penalized splines (Ruppert, 2002). The main idea of the penalized QIF is to approximate each (3k(t) with a truncated spline basis with a number of knots, thus creating an approximating parametric model. Let /'O,Z, l = 1"" ,K be chosen knots. For instance, if we parametrize (3k with a power spline of degree q = 3 with knot /'O,z's we have (3k(t)
= 'YkO + 'Yk1t + 'Yk2 t2 + 'Yk3 t3 +
2: 'Yk,l+3(t -
/'O,z)t·
z
Thus, we can reexpress the problem parametrically in terms of the new spline regression parameters 'Y. To avoid large model approximation error (higher bias), it might be necessary to take K large. This could lead to an overfitting model higher variance). To achieve a bias-variance tradeoff by reducing overfitting, Qu & Li (2006) proposed a penalized quadratic inference function with the L2 penalty: K
QC'Y) + n)..
L L GyセLャKS@ k
(3.3)
Z=l
where).. is a tuning parameter. In Qu & Li (2006), the tuning parameter).. is chosen by minimizing a modified GCV statistic: GCV
=
(1 _ n-1dfQ)2
(3.4)
with dfQ = tr( (Q+n).. Nf1Q), where N is a diagonal matrix such that Dii = 1 for knot terms and 0 otherwise. The GCV statistic is motivated by replacing RSS in the classic GCV statistic (Craven & Wahba, 1979) with Q. For model (3.2), it is of interest to test whether some coefficients are timevarying or time-invariant. Furthermore, it is also of interest to delete unnecessary knots in the penalized splines because it is always desirable to have parsimonious models. These issues can be formulated as statistical hypothesis tests of whether the corresponding 'Y's are equal to zero or not. Under the QIF approach this can be done by comparing Q for the constrained and unconstrained models. See Qu & Li (2006) for a detailed implementation of penalized spline QIF's.
58
3.2
John J. Dziak, Runze Li, Annie Qu
Variable selection for longitudinal data
Many variables are often measured in longitudinal studies. The number of potential predictor variables may be large, especially when nonlinear terms and interactions terms between covariates are introduced to reduce possible modeling biases. To enhance predictability and model parsimony, we have to select a subset of important variables in the final analysis. Thus, variable selection is an important research topic in the longitudinal data analysis. Dziak & Li (2007) gives an overview on this topic. Traditional variable selection criteria, such as AIC and BIC, for linear regression models and generalized linear models sometimes are used to select significant variables in the analysis of longitudinal data (see Zucchini, 2000, Burnham & Anderson 2004, Kuha 2004, Gurka 2006 for general comparisons of these criteria). These criteria are not immediately relevant to marginal modeling because the likelihood function is not fully specified. In the setting of GEE, some workers have recently begun proposing analogues of AIC (Pan, 2001, Cantoni et al., 2005) and BIC (Jiang & Liu, 2004), and research here is ongoing. Fu (2003) proposed penalized GEE with bridge penalty (Frank & Friedman, 1993) for longitudinal data, Dziak (2006) carefully studied the sampling properties of the penalized GEE with a class of general penalties, including the SCAD penalty (Fan & Li, 2001). Comparisons between penalized GEE and the proposals of Pan (2001) and Cantoni, et al (2005) are given in Dziak & Li (2007). This section is focus on summarizing the recent development of variable selection for longitudinal data by penalized QIF. A natural extension of the AIC and BIC is to replace the corresponding negative twice log-likelihood function by the QIF. For a given subset M of {1, ... ,d}, denote Me to be its complement, and i3M to be the minimizer of Qf((3), the QIF for the full model, viewed as a function of (3M by constraining (3Mc = O. Define a penalized QIF (3.5) where #(M) is the cardinality of M, Qf(i3M) is the Qf((3) evaluated at (3M = i3 and (3Mc = O. The AIC and BIC correspond to A = 2 and log(n), respectively. Wang & Qu (2007) showed that the penalized QIF with BIC penalty enjoys the well known model selection consistency property. That is, suppose that the true model exists, and it is among a fixed set of given candidate models, then with probability approaching one, the QIF with BIC penalty selects the true model as the sample size goes to infinity. The best subset variable selection with the traditional variable selection criteria becomes computationally infeasible for high dimensional data. Thus, instead of the best subset selection, stepwise subset selection procedures are implemented for high dimensional data. Stepwise regression ignores the stochastic errors inherited in the course of selections, and therefore the sampling property of the resulting estimates is difficult to understand. Furthermore, the subset selection approaches lack of stability in the sense that a small changes on data may lead to a very different selected model (Breiman, 1996). To handle issues related with high dimensionality, variable selection procedures have been developed to select significant variables and estimate their coefficients simultaneously (Frank & Friedman,
Chapter 3
Quadratic Inference Function Approaches
59
1993, Breiman, 1995, Tibshirani, 1996, Fan & Li, 2001 and Zou, 2006). These procedures have been extended for longitudinal data via penalized QrF in Dziak (2006). Define penalized QrF d
QJ(3)
+n LPAj(l,6jl),
(3.6)
j=l
where PAj (-) is a penalty function with a regularization parameter Aj. Note that different coefficients are allowed to have different penalties and regularization parameters. For example, we do not penalize coefficients 1'0, ... ,1'3 in (3.3); and in the context of variable selection, data analysts may not want to penalize the coefficients of certain variables because in their professional experience they believe that those variables are especially interesting or important and should be kept in the model. Minimizing the penalized QrF (3.6) yields an estimator of,6. With proper choice of penalty function, the resulting estimate will contain some exact zeros. This achieves the purpose of variable selection. Both (3.3) and (3.5) can be re-expressed in the form of (3.6) by taking the penalty function to be the L2 penalty, namely, PAj (l,6j I) = Aj l,6j 12 and the Lo penalty, namely, PAj (I,6j I) = AjI(I,6j I =I- 0), respectively. Frank & Friedman (2003) suggested using Lq penalty, PAj (l,6j I) = Aj l,6j Iq (0 < q < 2). The L1 penalty was used in Tibshirani (1996), and corresponds to the LASSO for the linear regression models. Fan & Li (2001) provides deep insights into how to select the penalty function and advocated using a nonconvex penalty, such as the smoothly clipped absolute deviation (SCAD) penalty, defined by
AI,6I, 2
(.II) _
PA (I f-/
-
(a -1)A2-(Ii3I_aA)2
{
2(a-1) (a+1)A2 2
'
,
if 0 :::; 1,61 < A; if A :::; 1,61 < aA; if 1,61 ? aA.
Fan & Li (2001) suggested fixing a = 3.7 from a Bayesian argument. Zou (2006) proposed the adaptive LASSO using weighted L1 penalty, PAj (l,6j I) = AjWj l,6j I with adaptive weights Wj. The L 1 , weighted L 1 , L2 and SCAD penalty functions are depicted in Figure 1. The SCAD estimator is similar to the LASSO estimator since it gives a sparse and continuous solution, but the SCAD estimator has lower bias than LASSO. The adaptive LASSO uses the adaptive weights to reduce possible bias of LASSO. Dziak (2006) showed that, under certain regularity conditions, nonconvex penalized QIF can provide a parsimonious fit and enjoys something analogous to the asymptotic "oracle property" described by Fan & Li (2001) in the least-squares context. Comparisons of penalized QIF estimator with various alternatives can be found in Dziak (2006).
John J. Dziak, Runze Li, Annie Qu
60
Penalty Functions 3.5 セMNL@ , - ,L 1 - - Weighted L 1
3
iGセ@
2.5 ...
C
2
]
p...1.5
"
. •セ@
','., '. , , , ... ... ,
, セZ@ ,
"
セ@
'", "
......
.'
,,
... ,
セ@ セ@
",
';""
OL-----________ NMLセ@ -5
'
,
セ@
セ@
"
0.5
, セ@ , "
...
,
,
... ... ,
______________
セ@
o
セ@
5
b
Figure 1: Penalty Functions. The values for>. are 0.5,0.5,0.125 and 1, respectively. The adaptive weight for the weighted L1 penalty is 3
4
Some applications of QIF
It is common for longitudinal data to contain missing data and outlying observations. The QIF method has been proposed for testing whether missing data are ignorable or not in the estimating equation setting (Qu & Song, 2002). Qu & Song (2004) also demonstrated that the QIF estimator is more robust than the ordinary GEE estimator in the presence of outliers. In this section, we outline the main ideas of Qu & Song (2002, 2004).
4.1
Missing data
Qu & Song (2002) define whether missing data is ignorable in the context of the estimating equation setting based on whether the estimating equations satisfy the mean-zero assumption. This definition differs somewhat from Rubin's (1976) definition which is based on the likelihood function. Qu & Song's (2002) approach shares the same basis as Chen & Little (1999) on decomposing data based on missing-data patterns. However, it avoids exhaustive parameter estimation for each missing pattern as in Chen & Little (1999). The key idea of Qu & Song's approach is that if different sets of estimating equations created by data sets with different missing patterns are compatible, then the missing mechanism is ignorable. This is equivalent to testing whether different sets of estimating equations satisfy the zero-mean assumption under common parameters, i.e., whether E(s) = O. This can be carried out by applying an over-identifying
Chapter 3
Quadratic Inference Function Approaches
61
restriction test, which follows a chi-squared distribution asymptotically. For example, suppose each subject has three visits or measurements, then there are four possible patterns of missingness. The first visit is mandatory; then subjects might show up for all appointments, miss the second, miss the third, or miss the second and third. We construct the QIF as
(4.1)
where 8j and 6 j are the estimating functions for the j-th missing pattern group and its empirical variance respectively. If 81, ... ,84 share the same mean structure, then the test statistic based on the QIF above will be relatively small compared to the cut-off chi-squared value under the null. Otherwise the QIF will be relatively larger. Clearly, if 81, ... ,84 do not hold mean-zero conditions under common parameters, the missingness might not be ignorable, since estimating functions formulated by different missing patterns do not lead to similar estimators. It might be recommended to use working independence here, or combine several similar missing patterns together, to keep the dimension to a reasonable size if there are too many different missing patterns. This approach is fairly simple to apply compared to that of Chen & Little (1999), since there is no need to estimate different sets of parameters for different missing patterns. Another advantage of this approach can be seen in the example of Rotnitzky & Wypij (1994), where the dimension of parameters for different missing patterns are different. The dichotomous response variables record asthma status for children at ages 9 and 13. The marginal probability is modeled as a logistic regression (Rotnitzky & Wypij, 1994) with gender and age as covariates: logit{Pr(Yit = I)} = /30
+ /31I(male) + /32I(age =
13),
where Yit = 1 if the i-th child had asthma at time t = 1,2 and I(E) is the indicator function for event E. About 20% of the children had asthma status missing at age 13, although every child had his or her asthma status recorded at age 9. Note that there are three parameters /30, /31 and /32 in the model when subjects have no missing data, but only two identifiable parameters, /30 and /31, for the incomplete case. Since the dimension of parameters is different for different missing patterns, Chen & Little's (1999) approach requires a maximum identifiable parameter transformation in order to perform the Wald test. However, the transformation might not be unique. Qu & Song (2002) do not require such a transformation, but show that the QIF goodness-of-fit test and the Wald test are asymptotically equivalent.
4.2
Outliers and contamination
The QIF estimator is more robust against outlying observations than the ordinary GEE estimator. Both GEE and QIF asymptotically solve equations which lead to
62
John J. Dziak, Runze Li, Annie Qu
an M-estimator. A robust estimator has a bounded influence function (Hampel et al., 1986). The influence function of the GEE is not bounded, while the influence function of the QIF is bounded (Qu & Song, 2004). This could explain why the QIF is more robust than the GEE for contaminated data. Hampel (1974) defines the influence function of an estimator as IF(z, Pj3)
inf
=
セHQM
eZセo@
c)Pj3
+ 」セェSI@
-
セHpェSI@
(4.2)
C
is the probability where Pj3 is the probability measure of the true model and セコ@ measure with mass 1 at contaminated data point z. If (4.2) is not bounded as a function of z, then the asymptotic bias in セ@ introduced by contaminated point z could be unbounded, that is, one could move セ@ to infinite value by allowing the contaminated point z to be infinite. Hampel et al. (1986) show that for an M-estimator, solving the estimating equation lセ]ャ@ Si(Zi, (3), the influence function is proportional to Si(Z, (3). Thus the influence function is bounded if and only if the contribution of an individual observation to the score function is bounded. If the influence function is not bounded, then the asymptotic "breakdown point" is zero and the corresponding estimator could be severely biased even with a singlegross outlier. Consider a simple case of GEE with a linear model using working independence. An individual observation's contribution to the score function is Xit (Yit クセHSIL@ which diverges if Yit is an outlier relative to the linear model. Qu & Song (2004) showed that the QIF does not have this kind of problem. In fact, the QIF has a "redescending" property whereby the contribution of a single anomalous observation to the score function goes to zero as that outlying observation goes to infinity. This is because the weighting matrix C is an empirical variance estimator of the extended score, and the inverse of C plays a major role in the estimation as it assigns smaller weights for dimensions with larger variance. Thus, the QIF automatically downweights grossly unusual observations. This result, however, would not hold for the working-independence structure since in that case the QIF is equivalent to the GEE estimator. Qu & Song (2004) show in their simulation how sufficiently large changes in a few observations can cause drastic effects on the GEE estimator but have only minor effects on the QIF estimator. The ordinary GEE can be made robust by downweighting unusual clusters and/or unusual observations (Preisser & Qaqish, 1996, 1999; He et al. 2002; Cantoni 2004; Cantoni et al. 2005). However, it could be difficult to identify outliers, since if some data do not fit the model well, it is not necessary that they are outliers. In addition, the choice of weighting scheme might not be obvious, and therefore difficult to determine.
4.3
A real data example
In this section we demonstrate how to use the QIF approach in real data analysis. We consider the CD4 data set, described in Kaslow et al. (1987) and a frequentlyused data set in the literature of varying-coefficient modeling (Wu, Chiang & Hoover, 1998; Fan & Zhang 2000; Huang, Wu & Zhou 2002, 2004; Qu & Li
Chapter 3
Quadratic Inference Function Approaches
63
2006). The response of interest is CD4 cell level, a measure of immune system strength, for a sample of HIV-positive men. Covariates include time in years from start of study (TIME), age at baseline (AGE, in years), smoking status at baseline (SMOKE; binary-coded with 1 indicating yes), and CD4 status at baseline (PRE). Measurements were made up to about twice a year for six years, but with some data missing due to skipped appointments or mortality. There were 284 participants, each measured at from 1 to 14 occasions over up to 6 years (the median number of observations was 6, over a median time of about 3.4 years). Some previous work has modeled this dataset with four time-varying linear coefficients:
y(t)
=
(3o(t)
+ (3s(t)SMOKE + (3A(t)AGE + (3p(t)PRE + c:(t).
(4.3)
We chose to extend this model to check for possible second-order terms:
y(t)
=
(3o(t) + (3s(t)SMOKE + (3A(t)AGE + (3p(t)PRE +(3SA(t)SMOKE' AGE + (3sp(t)SMOKE' PRE +(3AP(t)AGE' PRE + (3AA(t)AGE 2 + (3pp(t)PRE2
(4.4)
+ c:(t),
and to center the time variable at 3 to reduce correlation among the resulting terms. In fitting the model, AGE and PRE were also centered at their means to reduce collinearity. We initially modeled each (3 coefficient using a quadratic spline with two evenly spaced knots:
We did not apply a penalty. Combining (4.4) with (4.5) we effectively have a linear model with 45 parameters. Fitting this model under, say, AR-1 working QIF structure, would be very challenging because the nuisance C matrix would be of dimension 90 x 90 and would have to be empirically estimated and inverted, resulting in high sampling instability and perhaps numerical instability. Therefore, we instead start with a working independence structure. The estimated coefficients are depicted in Figure 2. As one might expect, the wobbly curves suggest that this large model overfits, so we will proceed to delete some of the 45 ,),'s to make the fit simpler. If the last four ,),'s for a particular (3 are not significantly different (as a block) from zero we will consider the (3 to be essentially constant in time (a time-invariant effect). If the last three are not significantly different from zero, we will consider the coefficient to be at most a linear function in time (this is equivalent to simply having a linear interaction term between the predictor and time). Otherwise, we will retain the spline form. To test whether a set of ')"s may be zero, we fit the model with and without the added constraint that they be zero, and compare the difference in Q between the models to its null-hypothesis distribution, a chi-squared with degrees of freedom equal to the number of parameters constrained to zero. Because we are using working-independence QIF (for which the number of equations exactly equals the number of coefficients), the full-model QIF will be zero, so the constrained-model QIF serves as the test statistic. It was concluded after some exploratory testing
John J. Dziak, Runze Li, Annie Qu
64
セa@ セs@
セッ@
I' /
Vl
J \ ,. - -
0
co.
M
co.
0
\
/
, Vl-
o
2 4 Time
o
6
-
co.
/,
_
-
"\
\ I
I
I
2
セー@ セ⦅@
,
\
0 N
4
024
6
Time
Time
セsa@
セsp@
ZSMセQ@
I
,"'-"
-\
/ /
/
co.-1(f.1), where iI>(.) is the CDF for a standard normal, gives a probit random effects model. On the other hand, for continuous outcome data, by setting
and g(.) to be an identity function, model (3.4) reduces to a linear mixed model. For count data, putting
h(o:) = e'''',
7
== 1, c(y, 7 2 ) = logy
and choosing g(f.1) = log(f.1) results in a Poisson regression model. Given data (Y(Si), i = 1"" ,n), (3.2) and (3.3) induce the following log likelihood that the inference will be based on:
£ = log
JIT
f(Y(Si)IX(Si), Z(Si); ,B)f(ZIXn; 9)dZ,
i=l
where the integration is over the n-dimensional random effect Z = (Z(sd,··· ,
Z(sn))' . We can further reformulate model (3.4) in a compact vectorial form. With Y n, Xn defined as in the previous section, we write
g{E(Y nlXn, Z)}
=
Xn(3 + AZ,
(3.5)
where A is a non-random design matrix, compatible with the random effects Z. The associated log likelihood function can be rewritten as
£(Y nlXn; (3, 9) = log L(Y nlXn; (3, 9) = log
J
f(Y nlXn, Z; (3)f(ZIX n ; 9)dZ,
(3.6) where f(Y nlXn, Z; (3) is the conditional likelihood for Y nand f(ZIX n ; 9) is the density function for Z, given the observed covariates X n . Model (3.5) is not a simple reformat - it accommodates more complex data structure beyond spatial data. For example, with properly defined A and random effects Z it encompasses non-normal clustered data and crossed factor data (Breslow and Clayton, 1993). When A is defined as matrix indicating membership of spatial regions (e.g. counties or census tracts), (3.5) models areal data as well. Model (3.5) accommodates a low-rank kriging spatial model, where the spatial random effects Z will have a dimension that does not increase with the sample size n and, in practice, is often far less than n. Specifically, consider a subset of
Chapter
4
Modeling and Analysis of Spatially Correlated Data
locations (Sl' ... ,Sn), say, (1\;1,'" ,I\;K), where K Let
85
< < n, as representative knots.
and Z be a K x 1 vector with covariance 0- 1 . Then (3.5) represents a low-kriging model by taking a linear combination of radial basis functions C(s - I\;k; 0)), 1 セ@ k セ@ K, centered at the knots (1\;1,'" ,I\;K), and can be viewed a generalization of Kammann and Wand's (2003) linear geoadditive model to accommodate nonnormal spatial data. Because of the generality of (3.5), the ensuing inferential procedures in Section 3.2 will be based on (3.5) and (3.6), facilitating the prediction of spatial random effects and, hence, each individual's profile. Two routes can be taken. The best predictor of random effects minimizing the conditional-mean-squared-error (2.15) is E(ZIY n), not necessarily linear in Y n' But if we confine our interest to an unbiased linear predictors of the form
for some conformable vector c and matrix Q, minimizing the mean squared error (2.12) subject to constraint (2.13) leads to the best linear unbiased predictor (BLUP)
Z = E(Z) + cov(Z, Y n){var(Y n)} -l{y n -
(3.7)
E(Y n)}.
Equation (3.7) holds true without any normality assumptions (McCulloch and Searle, 2001). For illustration, consider a Dirichlet model for binary spatial outcomes such that Y(s)IZ(s)
rv
Bernoulli(Z(s))
and the random effect Z = (Z(st)"" ,Z(sn)) rv Dir(a1,'" ,an), where Using (3.7), we obtain the best linear predictor for Z(Si),
where ao = ZセQ@l ai, 1£0 = (at/ao,'" ,an/ao)', セッ@ (ao + 1) for i =f j, Cii = ai(ao - ai)/a5(ao + 1) and As a simple example, when n = 2,
=
ei
ai
> O.
[Cij]nxn, Cij = -ai a j/a5 is the i-th column of セッᄋ@
86
3.2
Yi Li
Computing MLEs for SGLMMs
A common theme in fitting a SGLMM has been the difficulty of computation of likelihood-based inference. Computing the likelihood itself actually is often challenging for SGLMMs, largely due to high dimensional intractable integrals. We present below several useful likelihood-based approaches to estimating the coefficients and variance components, including iterative maximization procedures, such as the Expectation and Maximization (EM) algorithm, and approximation procedures, such as the Penalized Quazi-likelihood method and the Laplace method. The EM algorithm (Dempster et al., 1977) was originally designed for likelihood-based inference in the presence of missing observations, and involves an iterative procedure that increases likelihood at each step. The utility of the EM algorithm in a spatial setting lies in treating the unobserved spatial random terms as 'missing' data, and imputing the missing information based on the observed data, with the goal of maximizing the marginal likelihood of the observed data. Specifically, if the random effects Z were observed, we would be able to write the 'complete' data as (Y n, Z) with a joint log likelihood
As Z is unobservable, directly computing (3.8) is not feasible. Rather the EM algorithm adopts a two-step iterative process. The Expectation step ('E' step) computes the expectation of (3.8) conditional on the observed data. That is, calculate i = E{£(Y n, ZIX n ;,l3, 8)IY n, Xn;,l3o, 8 0 }, where ,l30, 8 0 are the current values, followed by a Maximization step (M step), which maximizes i with respect to ,l3 and 8. The E step and M step are iterated until convergence is achieved; however, the former is much costly, as the conditional distribution of ZIX n , Y n involves the distribution f(Y nIXn), a high dimensional intractable integral. A useful remedy is the Metropolis-Hastings algorithm that approximates the conditional distribution of ZIX n , Y n by making random draws from ZIXn, Y n without calculating the density f(Y nlXn) (McCulloch, 1997). Apart from the common EM algorithm that requires a full likelihood analysis, several less costly techniques have proved useful for approximate inference in the SGLMMs and other nonlinear variance component models, among which the Penalized Quasi-likelihood (PQL) method Penalized Quasi-likelihood method is most widely used. The PQL method was initially exploited as an approximate Bayes procedure to estimate regression coefficients for semiparametric models; see Green (1987). Since then, several authors have explored the PQL to draw approximate inferences based on random effects models: Schall (1991) and Breslow and Clayton(1993) developed iterative PQL algorithms, Lee and NeIder (1996) applied the PQL directly to hierarchical models. We consider the application of the PQL for the SGLMM (3.5). For notational simplicity we write the integrand of the likelihood function
f(Y nlXn, Z; J3)f(ZIX n ; 8) = exp{ -K(Yn, Z)},
(3.9)
Chapter
4
Modeling and Analysis of Spatially Correlated Data
87
where, for notational simplicity, we do not list Xn as an argument in function K. Next evaluate the marginal likelihood. Temporarily we assume that (J is known. For any fixed (3, expanding K(Y n, Z) around its mode Z up to the second order term, we have
L(Yn!Xn ; (3, (J)
=
J
exp{ -K(Yn, Z)}dZ
= IJ21T{ K(2)(y n, Z)} -1 W/ 2 exp{ -K(Yn, Z)}, wher: K(2)(y n, Z) denotes the second derivative of K(Y n, Z) with respect to Z, and Z lies in the segment joining 0 and Z. If K(2)(y n, Z) does not vary too much as Z changes (for instance, K(2)(y n, Z) = constant for normal data), maximizing the marginal likelihood (3.6) is equivalent to maximizing
This step is also equal to jointly maximizing fey n !X n , Z; (3)f(Z!X n ; (J) w.r.t (3 and Z with (J being held constant. Finally, only (J is left to be estimated, but it can be estimated by maximizing the approximate profile likelihood of (J,
refer to Breslow and Clayton (1993). As no close-form solution is available, the PQL is often performed through an iterative process. In particular, Schall (1991) derived an iterative algorithm when the random effects follow normal distributions. Specifically, with the current estimated values of (3, (J and Z, a working 'response' Yn is constructed by the first order Taylor expansion of g(Y) around p,z, or explicitly, (3.10) where g(1)(-) denotes the first derivative and g(.) is defined in (3.4). When viewing the last term in (3.10) as a random error, (3.10) suggests fitting a linear mixed model on Y n to obtain the updated values of (3, Z and (J, followed by a recalculation of the working 'responses'. The iteration shall continue until convergence. Computationally, the PQL is easy to implement, only requiring repeatedly invoking existing macros, for example, SAS 'PROC MIXED'. The PQL procedure yields exact MLEs for normally distributed data and for some cases when the conditional distribution of Y n and the distribution of Z are conjugate. Several variations of the PQL are worth mentioning. First, the PQL is actually applicable in a broader context where only the first two conditional moments of Y n given Z are needed, in lieu of a full likelihood specification. Specifically, fey n!X n , Z; (3) in (3.9) can be replaced by the quasi-likehood function exp{ ql(Y n!X n , Z; (3)}, where
ql(Y n!X n , Z; (3)
=
セ@
m
rf)li - t Vet) dt.
}Yi
Here J-li = E(Y(Si)!X n , Z; (3) and V(J-lt) = var(Y(si)!X n , Z; (3).
88
Yi Li
Secondly, the PQL is tightly related to other approximation approaches, such as the Laplace method and the Solomon-Cox method, which have also received much attention. The Laplace method (see, e.g. Liu and Pierce (1993)) differs from the PQL only in that the former obtains Z((3,8) by maximizing the integrand e-K(Y n,Z) with (3 and 8 being held fixed, and subsequently estimates (13,0) by jointly maximizing
On the other hand, with the assumption of E(ZIX n ) = 0, the Solomon-Cox technique approximates the integral J f(Y nlXn, Z)f(ZIXn)dZ by expanding the integrand f(Y nlXn, Z) around Z = 0; see Solomon and Cox (1992). In summary, none of these approximate methods produce consistent estimates, with exception in some special cases, e.g. normal data. Moreover, as these methods are essentially normal approximation-based, they typically do not perform well for sparse data, e.g. for binary data, and when the cluster size is relatively small (Lin and Breslow, 1996). Nevertheless, they provide a much needed alternative, especially given that full likelihood approaches are not always feasible for spatial data.
4
Spatial models for censored outcome data
Biomedical and epidemiological studies have spawned an increasing interest in and practical need for developing statistical methods for modeling time-to-event data that are subject to spatial dependence. Little work has been done in this area. Li and Ryan (2002) proposed a class of spatial frailty survival models. A further extension accommodating time-varying and nonparametric covariate effects, namely geoadditive survival model, was proposed by Hennerfeind et al. (2006). However, the regression coefficients of these frailty models do not have an easy population-level interpretation, less appealing to practitioners. In this section, we focus on a new class of semiparametric likelihood models recently developed by Li and Lin (2006). A key advantage of this model is that observations marginally follow the Cox proportional hazard model and regression coefficients have a population level interpretation and their joint distribution can be specified using a likelihood function that allows for flexible spatial correlation structures. Consider in a geostatistical setting a total of n subjects, who are followed up to event (e.g. death or onset of asthma) or being censored, whichever comes first. For each individual, we observe a q x 1 vector of covariates X, and an observed event time T = min(T, U) and a non-censoring indicator b = I(T :::;; U), where T and U are underlying true survival time and censoring time respectively, and 1(·) is an indicator function. We assume noninformative censoring, i.e., the censoring time U is independent of the survival time T given the observed covariates, and the distribution of U does not involve parameters of the true survival mode1. The covariates X are assumed to be a predictable time-dependent (and spacedependent) process. Also documented is each individual's geographic location Si.
Chapter
4
Modeling and Analysis of Spatially Correlated Data
89
Denote by X(t) = (X(s) : 0 セ@ s セ@ t) the X-covariate path up to time t. We specify that the survival time T marginally follows the Cox model
A{tIX(t)} = Ao(t)'IjJ{X(t),,8}
(4.1)
where 'IjJ{., .} is a positive function, ,8 is a regression coefficient vector and Ao (t) is an unspecified baseline hazard function. A common choice of 'IjJ is the exponential function, in which case, 'IjJ{X(t) , ,8} = exp{,8'X(t)}, corresponding to the Cox proportional hazards model discussed in Li and Lin (2006). This marginal model refers to the assumption that the hazard function (4.1) is with respect to each individual's own filtration, F t = O"{ICT セ@ s,8 = l),I(T セ@ s), Xes), 0 セ@ s セ@ t}, the sigma field generated by the survival and covariate paths up to time t. The regression coefficients ,8 hence have a population-level interpretation. Use subscript i to flag each individual. A spatial joint likelihood model for T I ,··· ,Tn is to be developed, which allows Ti to marginally follow the Cox model (4.1) and allows for a flexible spatial correlation structure among the T/s. Denote by Ai(t) = Ai(sIXi)ds the cumulative hazard and Ao(t) = Ao(s)ds the cumulative baseline hazard. Then Ai(Ti) marginally follows a unit exponential distribution, and its pro bit-type transformation
J;
J;
Tt
=
- 9 0 } is asymptotic normal with mean zero and a covariance matrix that can be easily estimated using a sandwich estimator. The results are formally stated in the following Proposition and can be proved along the line of Li and Lin (2006), which focused on the proportional hazards models.
Proposition Assume the true 9 0 is an interior point of an compact set, say, 13 x A E Rq+k, where q is the dimension of (3 and k is the dimension of 0::. When n is sufficiently large, the estimating equation G n (9) = 0 has a unique solution in a neighborhood of 9 0 with probability tending to 1 and the resulting estimator is consistent for 9 0 . Furthermore, v'np:;(2)}-1/2:E{(,B,a)'- ({3o,O::o)'}!!:.; MV N { 0q+k, I}, where I is an identity matrix whose dimension is equal to that of 9 0 , and
e
It follows that the covariance of
e can be estimated in finite samples by
1 セMQHRI@ I;;: =:E:E where セ@
and セHRI@
サセMャスG@
(4.8)
:E
are estimated by replacing U uv (-) by U uv (-) and evaluated at
80.
Although each E { UUl,Vl HYPIuセRLカ@ (9 0)} could be evaluated numerically, the total number of these calculations would be prohibitive, especially when the セHRI@
sample size m is large. To numerically approximate:E ,one can explore the resampling techniques of Carlstein (1986) and Sherman (1996). Specifically, under the assumption of n x E サgョセス@ -> :Eoo ,
:Eoo can be estimated by averaging K randomly chosen subsets of size 1,··· ,K) from the n subjects as
nj
(j =
where Gnj is obtained by substituting 9 with 8 in G nj • The nj is often chosen to be proportional to n so as to capture the spatial covariance structure. For
Yi Li
94
practical utility, Li and Lin recommended to choose nj to be roughly 1/5 of the total population. Given the estimates iS oo and is, the covariance of can be セMQ@
セ@
セMQ@
e
estimated by セ@ [lin x セッャH@ )'. To estimate the covariance matrix of the estimates arising from the penalized estimator obtained by solving gセH・I@ = 0, is is replaced by is - セョN@ A similar procedure was adopted by Heagerty and Lele (1998) for spatial logistic regression.
4.3
A data example: east boston asthma study
Li and Lin reported the application of the proposed method to analyze the East Boston Asthma study, focusing on assessing how the familial history of asthma may have attributed to disparity in disease burden. In particular, this study was to establish the relationship between the Low Respiratory Index (LRI) in the first year of life, ranging from 0 to 16, with high values indicating worse respiratory functioning, and age at onset of childhood asthma, controlling for maternal asthma status (MEVAST), coded as l=ever had asthma and O=never had asthma, and log-transformed maternal cotinine levels (LOGMCOT). This investigation would help better understand the natural history of asthma and its associated risk factors and to develop future intervention programs. Subjects were enrolled at community health clinics throughout the east Bost on area, with questionnaire data collected during regularly scheduled well-baby visits. The ages at onset of asthma were identified through the questionnaires. Residential addresses were recorded and geocoded, with geographic distance measured in the unit of kilometer. A total of 606 subjects with complete information on latitude and longitude were included in the analysis, with 74 events observed at the end of the study. The median followup was 5 years. East Boston is a residential area of relatively low income working families. Participants in this study were largely white and hispanic children, aging from infancy to 6 years old. Asthma is a disease strongly affected environmental triggers. Since the children living in adjacent locations might have had similar backgrounds and living environments and, therefore, were exposed with similar unmeasured similar physical and social environments, their ages at onset of asthma were likely to be subject to spatial correlation. The age at onset of asthma was assumed to marginally follow a Cox model
A(t) = Ao(t) exp{,BL x LRI +,BM x MEVAST +,Be x LOGMCOT},
(4.9)
while the Matern model (2.1) was assumed for the spatial dependence. Evidently, betaL,,BM and ,Be measured the impact of main covariates and have populationlevel interpretations. The regression coefficients and the correlation parameters were estimated using the spatial semi parametric estimating equation approach, and the associated standard error estimates were computed using (4.8). To check the robustness of the method, Li and Lin varied the smoothness parameter 1I in (2.1) to be 0.5, 1 and 1.5. As the East Boston Asthma Study was conducted in a fixed region, to examine the performance of the variance estimator in (4.8), which was developed
Chapter
4
Modeling and Analysis of Spatially Correlated Data
95
under the increasing-domain-asymptotic, Li and Lin calculated the variance using a 'delete-a-block' jackknife method (see, e.g. Kott (1998)). Specifically, they divided the samples into B nonoverlapping blocks based on their geographic proximity and then formed B jackknife replicates, where each replicate was formed by deleting one of the blocks from the entire sample. For each replicate, the estimates based on the semiparametric estimating equations were computed, and the jackknife variance was formulated as
Varjackknife
=
セ@
B-1
B
A
2)8 j
A
-
A
8)(8 j
A
-
8)'
(4.10)
j=l
where 8 j was the estimate produced from the jackknife replicate with the jth 'group' deleted and 8 was the estimate based on the entire population. In their calculation, B was chosen to be 40, which appeared large enough to render a reasonably good measure of variability. This jackknife scheme, in a similar spirit of Carlstein (1986, 1988), treated each block approximately independent and seemed plausible for this data set, especially in the presence of weak spatial dependence. Loh and Stein (2004) termed this scheme as the splitting method and found it work even better than more complicated block-bootstrapping methods (e.g. Kunsch, 1989; Liu and Singh, 1992; Politis and Romano, 1992; Bulhmann and Kunsch, 1995). Other advanced resampling schemes for spatial data are also available, e.g double-subsampling method (Lahiri et al., 1999; Zhu and Morgan, 2004) and linear estimating equation Jackknifing (Lele, 1991), but are subject to much more computational burden compared with the simple jackknife scheme we used. Their results are summarized in the following table, with the large sample standard errors (SEa) computed using the method described in Section 4.3 and the Jackknife standard errors (SEj) computed using (4.10). 1/ = 0.5 Parameters Estimate SEa 0.3121 0.0440 (h 13M 0.2662 0.3314 f3c 0.0294 0.1394 (Y2 1.68E-3 9.8&3 4.974 2.2977 (
1/=1 1/ = 1.5 SEj Estimate SEa SEj SEJ. Estimate SEa 0.0357 0.3118 0.0430 0.0369 0.3124 0.0432 0.0349 0.3222 0.2644 0.3289 0.3309 0.2676 0.3283 0.3340 0.1235 0.02521 0.1270 0.1063 0.0277 0.1288 0.1083 0.0127 0.74E-3 5.0E-3 7.1&3 0.72E-3 5.5E-3 4.8&3 3.708 2.1917 4.7945 4.1988 1.8886 6.5005 5.01617
The estimates of the regression coefficients and their standard errors were almost constant with various choices of the smoothness parameter 1/ and indicated that the regression coefficient estimates were not sensitive to the choice of 1/ in this data set. The standard errors obtained from the large sample approximation and the Jackknife method were reasonably similar. Low respiratorl index was highly significantly associated with the age at }>nset of asthma, e.g. f3L = 0.3121 (SEa = 0.0440, SEj = 0.0357) when 1/ = 0.5; f3L = 0.3118 (SEa = 0.0430, SEj = 0.0369) when 1/ = 1.0; fJL = 0.3124 (SEa = 0.0432, SEj = 0.0349) when 1/ = 1.5, indicating that a child with a poor respiratory functioning was more likely to
Yi Li
96
develop asthma, after controlling for maternal asthma, maternal cotinine levels and accounting for the spatial variation. No significant association was found between ages at onset of asthma and maternal asthma and cotinine levels. The estimates of the spatial dependence parameters, (J'2 and ( varied slightly with the choices of 1/. The scale parameter (J'2 corresponds to the partial sill and measures the correlation between subjects in close geographic proximity. Thi analysis showed that such a correlation is relatively small. The parameter ( measures global spatial decay of dependence with the spatial distance (measured in kilometers). For example, when 1/ = 0.5, i.e., under the exponential model, ( = 2.2977 means the correlation decays by 1- exp( - 2.2977 xl) セ@ 90% for everyone kilometer increase in distance.
5
Concluding remarks
This chapter has reviewed the methodologies for the analysis of spatial data within the geostatistical framework. We have dealt with data that consist of the measurements at a finite set of locations, where the statistical problem is to draw inference about the spatial process, based on the partial realization over this subset of locations. Specifically, we have considered using linear mixed models and generalized linear models that enable likelihood inference for fully observable spatial data. The fitting of such models by using maximum likelihood continues to be complicated owing to intractable integrals in the likelihood. In addition to the methods discussed in this chapter, there has been much research on the topic since the last decade, including Wolfinger and O'Connell (1993), Zeger and Karim (1993), Diggle et al. (1998), Booth and Hobert (1999). We have also reviewed a new class of semi parametric normal transformation model for spatial survival data that was recently developed by Li and Lin (2006). A key feature of this model is that it provides a rich class of models where regression coefficients have a population-level interpretation and the spatial dependence of survival times is conveniently modeled using flexible normal random fields, which is advantageous given that there are virtually none spatial failure time distributions that are convenient to work with. Several open problems, however, remain to be investigated for this new model, including model diagnostics (e.g. examine the spatial correlation structure for censored data), prediction (e.g. predict survival outcome for new locations) and computation (e.g. develop fast convergent algorithms for inference). Lastly, as this chapter tackles geostatistical data mainly from the frequentist points of view, we have by-passed the Bayesian treatments, which have been, indeed, much active in the past 20 years. Interested readers can refer to the book of Banerjee et al. (2004) for a comprehensive review of Bayesian methods.
References [1] Abramowitz M., and Stegun 1. A. (Editors)(1965), Handbook of Mathematical Functions, Dover Publications, New York.
Chapter
4
Modeling and Analysis of Spatially Correlated Data
97
[2] Banerjee, S. and Carlin, B. P (2003), Semiparametric Spatio-temporal Frailty Modeling, Environmetrics, 14 (5), 523-535. [3] Banerjee, S., Carlin, B. P. and Gelfand, A. E (2004), Hierarchical Modeling and Analysis for Spatial Data, Chapman and Hall/eRC Press, Boca Raton. [4] Besag, J., York, J. and Mollie, A. (1991), Bayesian Image Restoration, With Two Applications in Spatial Statistics, Annals of the Institute of Statistical Mathematics, 43, 1-20. [5] Breslow, N. E. and Clayton, D. G. (1993), Approximate Inference in Generalized Linear Mixed Models, Journal of the American Statistical Association, 88, 9-25. [6] Brook, D. (1964), On the Distinction Between the Conditional Probability and the Joint Probability Approaches in the Specification of NearestNeighbour Systems, Biometrika, 51, 481-483. [7] Bulhmann, P. and Kunsch, H. (1995), The Blockwise Bootstrap for General Parameters of a Stationary Time Series, Scandinavian Journal of Statistics, 22, 35-54. [8] Carlin, B. P. and Louis, T. A. (1996), Bayes and empirical Bayes methods for data analysis, Chapman and Hall Ltd, London. [9] Carlstein, E. (1986), The Use of Subseries Values for Estimating the Variance of a General Statistic from a Stationary Sequence, The Annals of Statistics, 14, 1171-1179. [10] Carlstein, E. (1988), Law of Large Numbers for the Subseries Values of a Statistic from a Stationary sequence, Statistics, 19, 295-299. [11] Clayton, D. and Kaldor, J. (1987), Empirical Bayes Estimates of Agestandardized Relative Risks for Use in Disease Mapping, Biometrics, 43, 671681. [12] Cressie, N. (1993), Statistics for Spatial Data, Wiley, New York. [13] del Pino, G. (1989), The unifying role of iterative generalized least squares in statistical algorithms (C/R: p403-408) Statistical Science, 4, 394-403. [14] A. P. Dempster, N. M. Laird, D. B. Rubin(1977), Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society. B, 39 1-22. [15] Kammann, E. E. and Wand, M. P. (2003), Geoadditive models. Journal of the Royal Statistical Society, Series C (Applied Statistics), 52, 1-18. [16] P. J. Green(1987), Penalized likelihood for general semi-parametric regression models, International Statistical Review, 55 (1987) 245-259. [17] Haining, R, Griffith, D. and Bennett, R (1989), Maximum likelihood estimation with missing spatial data and with an application to remotely sensed data, Communications in Statistics: Theory and Methods, 1875-1894. [18] Harville, D. A. (1974), Bayesian inference for variance components using only error contrasts, Biometrika, 61, 383-385. [19] Heagerty, P. J. and Lele, S. R (1998), A Composite Likelihood Approach to Binary Spatial Data, Journal of the American Statistical Association, 93, 1099-1111. [20] Hennerfeind, A., Brezger, A. and Fahrmeir, L. (2006), Geoadditive Survival Models, Journal of the American Statistical Association, Vol. 101, 1065-1075.
98
Yi Li
[21] Host, G. (1999), Kriging by local polynomials, Computational Statistics & Data Analysis, 29, 295-312. [22] Journel, A. G. (1983), Geostatistics, Encyclopedia of Statistical Sciences (9 vols. plus Supplement), 3,424-431. [23] Kott, P. S. (1998), Using the Delete-a-group Jackknife Variance Estimator in Practice, ASA Proceedings of the Section on Survey Research Methods, 763-768. [24] Lahiri, S. N., Kaiser, M. S., Cressie, N. and Hsu, N. (1999), Prediction of Spatial Cumulative Distribution Functions Using Subsampling (C/R: p97110), Journal of the American Statistical Association, 94, 86-97. [25] N. M. Laird, J. H. Ware(1982), Random-effects models for longitudinal data, Biometrics, 38, 963-974. [26] Y. Lee, J. A. Nelder(1996), Hierarchical Generalized Linear Models, J.R. Statist. Soc. B, 58, 619-678. [27] Le1e, S. (1991), Jackknifing Linear Estimating Equations: Asymptotic Theory and Applications in Stochastic Processes, Journal of the Royal Statistical Society, Series B, 53, 253-267. [28] Li, Y. and Ryan, L. (2002), Modeling spatial survival data using semiparametric frailty models, Biometrics, 58, 287-297. [29] Li, Y. and Lin, X. (2006), Semiparametric Normal Transformation Models for Spatially Correlated Survival Data, Journal of the American Statistical Association, 101, 591-603. [30] X. Lin, N. E. Breslow(1995), Bias Correction in generalized linear mixed models with multiple components of dispersion, Journal of the American Statistical Association, 91, 1007-1016. [31] Q. Liu, D. A. Pierce(1993), Heterogeneity in Mantel-Haenszel-type models Biometrika. 80, 543-556. [32] Liu, R. and Singh, K. (1992), Moving Blocks Jackknife and Bootstrap Capture Weak Dependence, Exploring the Limits of Bootstrap. LePage, Raoul (ed.) and Billard, Lynne (ed.) 225-248. [33] Loh, J. M. and Stein, M. L. (2004), Bootstrapping a Spatial Point Process, Statistica Sinica, 14, 69-101. [34] P. McCullagh, J. A. Nelder(1989), Generalized Linear Models, 2nd deition. Chapman and Hall, London. [35] J. A. NeIder, R. W. Wedderburn(1972), Generalized linear models, J. R. Statist. Soc. A, 135, 370-384. [36] D. Nychka and N. Saltzman (1998), Design of Air-Quality Monitoring Networks. Case Studies in Environmental Statistics, Lecture Notes in Statistics ed. Nychka, D., Cox, L. and Piegorsch, W. , Springer Verlag, New York. [37] Opsomer, J. D., Hossjer, 0., Hoessjer, 0., Ruppert, D., Wand, M. P., Holst, U. and Hvssjer, O. (1999), Kriging with nonparametric variance function estimation, Biometrics, 55, 704-710. [38] Paciorek, C. J. (2007), Computational techniques for spatial logistic regression with large datasets, Computational Statistics and Data Analysis, 51, 36313653. [39] Politis, D. N. and Romano, J. P. (1992), A General Resampling Scheme for
Chapter
[40]
[41] [42] [43] [44] [45]
[46] [47] [48]
4
Modeling and Analysis of Spatially Correlated Data
99
Triangular Arrays of a-mixing Random Variables with Application to the Problem of Spectral Density Estimation, The Annals of Statistics, 20, 19852007. Prentice, R. L. and Cai, J. (1992), Covariance and Survivor Function Estimation Using Censored Multivariate Failure Time Data, Biometrika, 79, 495-512. Sherman, M. (1996), Variance estimation for statistics computed from spatial lattice data, Journal of the Royal statistical Society,Series B, 58, 509-523. Stein, M. L. (1999), Interpolation of Spatial Data: Some Theory of Kriging, Springer, New York. R. Schall(1991), Estimation in generalized linear models with random effects, Biometrika, 78, 719-727. P. J. Solomon, D. R. Cox(1992), Nonlinear component of variance models, Biometrika, 79, 1-1l. Waller, L. A., Carlin, B. P., Xia, H. and Gelfand, A. E. (1997), Hierarchical Spatio-temporal Mapping of Disease Rates, Journal of the American Statistical Association, 92, 607-617. P. J. Diggle, R. A. Moyeed and J. A. Tawn(1998), Model-based Geostatistics. Applied Statistics, 47, 299-350. zhanghao H. Zhang (2002), On Estimation and Prediction for Spatial Generalized Linear Mixed Models Biometrics, 58, 129-136. Zhu, J. and Morgan, G. D. (2004), Comparison of Spatial Variables Over Subregions Using a Block Bootstrap, Journal of Agricultural, Biological, and Environmental Statistics, 9, 91-104.
This page intentionally left blank
Part II
Statistical Methods for Epidemiology
This page intentionally left blank
Chapter 5 Study Designs for Biomarker-Based Treatment Selection Amy Laird * Xiao- H ua Zhou t Abstract Among patients with the same clinical disease diagnosis, response to a treatment is often quite heterogeneous. For many diseases this may be due to molecular heterogeneity of the disease itself, which may be measured via a biomarker. In this paper, we consider the problem of evaluating clinical trial designs for drug response based on an assay of a predictive biomarker. In the planning stages of a clinical trial, one important consideration is the number of patients needed for the study to be able to detect some clinically meaningful difference between two treatments. We outline several trial designs in terms of the scientific questions each one is able to address, and compute the number of patients required for each one. We exhibit efficiency graphs for several special cases to summarize our results.
Keywords: Biomarker; predictive biomarker; treatment selection; clinical trial design; validation; predictive value.
1
Introduction
A group of patients with the same clinical disease and receiving the same therapy may exhibit an enormous variety of response. Genomic technologies are providing evidence of the high degree of molecular heterogeneity of many diseases. A molecularly targeted treatment may be effective for only a subset of patients, and ineffective or even detrimental to another subset, due to this molecular heterogeneity. Potential side effects, as well as time wasted with ineffective drugs, can create a substantial cost to the patient. The ability to predict which treatment would give the best result to a patient would be universally beneficial. We are interested in evaluating clinical trial designs for drug response based on an assay of a biomarker. We consider the design of a phase III randomized clinical trial to compare the performance of a standard treatment, A, and an *Department of Biostatistics, University of Washington, Seattle, WA, USA. E-mail: [email protected] tDepartment of Biostatistics, University of Washington, Seattle, WA, USA. HSR&D Center of Excellence, VA Puget Sound Health Care System, Seattle, WA, USA. E-mail: azhou@u. washington.edu
103
104
Amy Laird, Xiao-Hua Zhou
alternative and possibly new treatment, B. In the case of stage II colorectal cancer, which will be used as an example in our discussion of the study designs [1], treatment A consists of resection surgery alone, while treatment B is resection surgery plus chemotherapy. We assume that patients can be divided into two groups based on an assay of a biomarker. This biomarker could be a composite of hundreds of molecular and genetic factors, for example, but in this case we suppose that a cutoff value has been determined that dichotomizes these values. In our example the biomarker is the expression of guanylyl cyclase C (GCC) in the lymph nodes of patients. We assume that we have an estimate of the sensitivity and specificity of the biomarker assay. The variable of patient response is taken to be continuous-valued; it could represent a measure of toxicity to the patient, quality of life, uncensored survival time, or a composite of several measures. In our example we take the endpoint to be three-year disease recurrence. We consider five study designs, each addressing its own set of scientific questions, to study how patients in each marker group fare with each treatment. Although consideration of which scientific questions are to be addressed by the study should supersede consideration of necessary sample size, we give efficiency comparisons here for those cases in which more than one design would be appropriate. One potential goal is to investigate how treatment assignment and patient marker status affect outcome, both separately and interactively. The marker under consideration is supposedly predictive: it modifies the treatment effect. We may want to verify its predictive value and to assess its prognostic value, that is, how well it divides patients receiving the same treatment into different risk groups. Each study design addresses different aspects of these over arching goals. This paper is organized as follows: 1. Definition of study designs 2. Test of hypotheses 3. Sample size calculation 4. Numerical comparison of efficiency 5. Conclusions
2
Definition of study designs
The individual study designs are as follows.
2.1
Traditional design
To assess the safety and efficacy of the novel treatment, the standard design (Fig.l) is to register patients, then randomize them with ratio K, to receive treatment A or B. We compare the response variable across the two arms of the trial without regard for the marker status of the patients. In our example, we would utilize this design if we wanted only to compare the recurrence rates of colorectal cancer in the two treatment groups independent of each patient's biomarker status. Without the marker information on each patient we can address just one scientific question:
Chapter 5 Study Designs for Biomarker-Based Treatment Selection
105
; / treatment A register - - - - - - - - -.... イ。ョ、ッュゥコ・HエセI@
セ@
treatment B
Figure 1: The traditional design
• How does the outcome variable compare across treatment groups A and B?
2.2
Marker by treatment interaction design
The next three designs presented have the goal of validating the predictive value of the biomarker. In the marker by treatment interaction design (Fig. 2) we register patients and test marker status, then randomize patients with ratio Ii in each marker subgroup to receive treatment A or B. By comparing the outcomes for different marker subgroups in a given treatment group, the prognostic value of the marker in each treatment group can be measured. By examining the treatment effect in a given marker group, we can measure the predictive value of the marker with respect to the treatments.
+_
randomize Hイ。エゥッセI@
/ register ----.test marker
セ@
V
treatment A
1
セ@
treatment B
セ@
treatment A
- _randomize Hイ。エゥッセI@ 1
セエイ・。ュョb@
Figure 2: The marker by treatment interaction design
In our example, suppose we want to be able to look at the treatment effect within each marker group, as well as the effect of marker status on recurrence in each treatment group. In this example this design yields exactly the four outcomes we want to know: patient response in each treatment group in each of the two marker subgroups. In the case of just two marker subgroups, this design gives the clearest and simplest output. In the case of many marker subgroups or many treatments, it may not be feasible or even ethical to administer every treatment to every marker group, and this design may give an unnecessarily large amount of output that is difficult to interpret. In general this design can address the scientific questions: • Does the treatment effect differ across marker groups? • Does marker-specific treatment improve outcome? That is, how does treatment affect outcome within each marker group?
106
Amy Laird, Xiao-Hua Zhou
2.3
Marker-based strategy design
To address the issue of feasibility, we may wish to compare the marker-based and non-marker-based approaches directly. In the marker-based strategy design (Fig. 3) we register patients and randomize them with ratio A to either have a treatment assignment based on marker status, or to receive treatment A regardless of marker status. By comparing the overall results in each arm of the trial, the effectiveness of the marker-based treatment strategy relative to the standard treatment can be assessed. This comparison will yield the predictive value of the marker. Note that this design could not possibly illuminate the case in which treatment B were better for marker-positive and marker-negative patients alike, as marker-negative patients never receive treatment B.
/+ B
Y
\-__
marker-based strategy-test marker
register M⦅セ@
A
randomize (ratio>.. )
セ@
non-marker-based strategy _ _ _ A
Figure 3: The marker-based strategy design
Returning once again to our example, one study has demonstrated a strong tendency for patients who relapsed within three years to be marker-positive, and for those without relapse for at least six years to be marker-negative [2]. Given the extreme toxicity of chemotherapy treatment, it may be unethical in some cases to administer treatment B to marker-negative patients. In this case, and in similar cases, it may be wise to use the marker-based strategy design. This design can address the scientific questions: • How does the marker-based approach compare to the standard of care? (validation) • Secondarily, do marker-positive patients do better with novel treatment or standard of care?
2.4
Modified marker-based strategy design
Although the direct design presents a reasonable way to compare the marker-based approach with the standard of care, it does not allow us to assess the markertreatment interaction. We present a modification to this design that allows us to calculate this interaction. In the modified marker-based strategy design (FigA) we register patients and test marker status, then randomize the patients with ratio A (in the "first" randomization) to either have a treatment assignment based on marker status, or to undergo a second randomization to receive one of the two treatments. By comparing the overall results in each arm of the trial, the
Chapter 5
Study Designs for Biomarker-Based Treatment Selection
107
effectiveness of the targeted (marker-based) treatment strategy relative to the untargeted (non marker-based) strategy can be assessed. If we test marker status in all patients, this comparison can yield both the prognostic and predictive value of the marker. /
7 ,
/
marker-based strategy
treatment A
セ@
treatment B
register_test marker------.randomize (ratio.x)
セ@ 1
セ@
+ -randomizeY
/
non-marker-based strategy
'" "'" - _
(ratio K) 1 セ@
. y randomize (ratio K) ャセ@
treatment A treatment B treatment A treatment B
Figure 4: The modified marker-based strategy design
This design can address the following scientific questions: • How does the marker-based approach compare to the non-marker-based approach? (validation) • What is the prognostic value of the marker? • Secondarily, we can assess the marker-treatment interaction.
2.5
Targeted design
The goal of the traditional design is to assess the safety and efficacy of the novel treatment, but with a marker assay available, there are other possibilities for a trial of this type. For reasons of ethics or interest, we may wish to exclude marker-negative patients from the study. In the targeted design (Fig.5) we register patients and test marker status, then randomize the marker-positive patients to treatments B and A with ratio K,. By comparing the results in each arm of this trial, the performance of treatment B relative to treatment A can be assessed in marker-positive patients. We must therefore screen sufficiently many patients to attain the desired number of marker-positive patients. If the marker prevalence is low, the number of patients we need to screen will be very high. This trial design if very useful when the mechanism of action of treatment B is well-understood. In our example, suppose there were a treatment that had been demonstrated to have a significant positive effect on response for marker-positive patients, and no significant effect for marker-negative patients. We could employ the targeted design as a confirmatory study among marker-positive patients to better characterize the treatment effect in these patients. This design can address the scientific question: • What is the treatment effect among marker-positive patients?
Amy Laird, Xiao-Hua Zhou
108
/+
yB ""A
register セエ・ウ@
セ@ randomize (ratio ")"
I
marker
セ@
_
M⦅セ@
not included in trial
Figure 5: The targeted design
3
Test of hypotheses and sample size calculation
There are various types of hypothesis tests that we can use in a trial, and the choice of hypothesis test depends on the question in which we are interested. Using a hypothesis test we can find the number of patients that must be included in each arm of the study to detect a given clinically meaningful difference 8 in mean outcome between arms 1 and 2 of the trial at the significance level a we require and at a given level of power 1- (3. Calculation of this number, n, depends on the type of hypothesis that we are testing. We present four main types of hypothesis tests here. Let Xj and Wj be the response variables of the jth of n subjects each in part 1 or 2 of a trial, respectively, relative to the hypothesis test under consideration. We assume now that X and Ware independent normal random variables with respective means /-Lx and /-Lw, and respective variances O'r and oGセN@ Suppose that there are nl patients in part 1 of the trial and n2 in part 2. We let X and W denote the sample means of the response variable in each part of the trial:
We let E = /-Lw - /-Lx be the true mean difference in response between the two groups. So if a higher response value represents a better outcome, then E > 0 indicates that arm 2 of the trial has had a better mean outcome than arm 1. For ethical reasons we may have unequal numbers of patients in each group, so we call r;, = ndn2' We consider four hypothesis tests to compare the means of X and W.
Chapter 5
3.1
Study Designs for Biomarker-Based Treatment Selection
109
Test for equality
Suppose we want to know whether there is any difference in means between the two arms of the trial. We might have this question if we were interested in proving that there was a significant difference in the outcome variable between the two arms of the trial. We consider the hypotheses
Ho : E = 0
versus
Ha : E =1=
o.
If we know or can estimate the values of the variances (J"f and HjBセ@ then we reject Ho at significance level 0: if
from pilot studies,
w-x denotes the 1 - 0:/2-quantile of the standard normal distribution. If 0, then the power of the above test for rejecting the false null hypothesis is given approximately by
where E
= Ea
Zl-0/2
=1=
We set the above expression in parentheses equal to Zl-/3 and solve for n2 in terms of n1. Therefore the sample sizes necessary in each part of the trial to reject a false null hypothesis are given asymptotically by and
3.2
Test for non-inferiority or superiority
Now suppose we want to know whether the mean of W is less than the mean of X by a significant amount O. We might have this question if we wanted to find out if a new, less toxic treatment were not inferior in efficacy to the current standard. We invoke the test for non-inferiority and consider the hypotheses
Ho:
E
セ@
0
versus
Ha:
E
> 0,
where 8 < O. Rejection of the null hypothesis signifies non-inferiority of the mean of W as compared to the mean of X. In the test for superiority, we want to know if the mean of W is greater than the mean of X by a significant amount, so we can use these same hypotheses with 0 > O. In this case, rejection of the null hypothesis indicates superiority of the mean of W as compared to the mean of X. If we know the variances (J"f and HjBセL@ then we reject Ho at significance level 0: if
x-w-o
Amy Laird, Xiao-Hua Zhou
110 If E
= Ea > 8, then the power of the above test for rejecting a false
null hypothesis
is given by
We set the above expression in parentheses equal to Zl-{3 and solve for n2 in terms of nl. Therefore the sample sizes necessary in each part of the trial to reject a false null hypothesis are given asymptotically by and
3.3
Test for equivalence
Suppose we want to know whether the means of X and W differ in either direction by some significant amount 8. We might have this question if we were interested in proving that there is no significant difference in the outcome variable between the two arms of the trial. We use the test for equivalence and consider the hypotheses versus
Ha:
lEI < o.
Note that rejection of the null hypothesis indicates equivalence of the means of X and W. If we know the variances O"f and oBセL@ then we reject Ho at significance level a if or If lEI = lEal < 0, then the power ofthe above test for rejecting a false null hypothesis is given by
So to find the sample size needed to achieve power 1- f3 at solve
lEI = lEal < 8 we must
and hence the sample size necessary in each part of the trial to reject a false null hypothesis are given asymptotically by and
Chapter 5 Study Designs for Biomarker-Based Treatment Selection
111
By comparison with the formula for n2 in the test for equality, we see that the roles of a and (3 have been reversed. Note that in the usual case, in which 8 > 0 and 1 - (3 > a, this hypothesis test requires a greater sample size than do the other three.
4
Sample size calculation
We use the test for equality to compute the number of patients necessary to include in each arm of the trial to reject a false null hypothesis of zero difference between the means of response in the two arms of the trial, with type I error a and type II error 1 - (3. Here we derive this total number for each of the five designs under consideration, similar to [5], except equal variances in each marker-treatment subgroup and 1:1 randomization ratios are not assumed. Similarly to the notation in the last section, let X be the continuous-valued response of a patient in the trial. Let Z be the treatment assignment (A or B) of a patient. Let D be the true (unmeasurable) binary-valued marker status of a patient, and let R be the result of the marker assay. We denote by X ZR the response of a subject with marker assay R who is receiving treatment Z. Let /LZR and 。セr@ denote the mean and variance of response, respectively, among patients with marker assay R and receiving treatment Z. For ease of computation we let p denote a combination of Z- and R-values, which we use for indexing. Finally we let 'Y denote the proportion of people in the general population who are truly marker-negative, that is, 'Y = P(D = 0). To relate Rand D, we let Asens = peR = liD = 1) denote the sensitivity of the assay in diagnosing marker-positive patients and Aspec = peR = OlD = 0) denote the specificity of the assay in diagnosing marker-negative patients. On the other hand, we let w+ = P(D = 11R = 1) denote the positive predictive value of the assay, and B_ = P(D = OIR = 0) denote the negative predictive value. By Bayes' rule,
w+=P(D=lIR=l) P(D = 1,R= 1) peR = 1) peR = liD = l)P(D = 1) - peR = liD = l)P(D = 1) + peR = liD = O)P(D = 0) Asens (1 - 'Y) - Asens(l - 'Y) + (1 - Aspech and similarly we have
B_ = P(D = OIR = 0)
=
Aspec'Y (1 - Asens)(l - 'Y) + Aspec'Y
Amy Laird, Xiao-Hua Zhou
112
General formula for variance To calculate the variance of patient response in one arm of a trial, we use the general formula
where Varp(E(Xlp)) = Ep(E(Xlp) - E(X))2 = セ。ャ@
kP(p = k)(E(Xlp) - E(X)?
and Ep(Var(Xlp))
4.1
= セ。ャ@
kP(p
= k)Var(Xlp = k).
Traditional design
Note that since the traditional design does not involve a marker assay, the sample size calculation is independent of the assay properties. In the test for equality we test the null hypothesis of equality of means of patient response, Ho : /-LA = /-LB. The expected response in arm 1 of the trial, in which all patients receive treatment A is given by /-LA
= E(XIZ = A) = /-LAO"! + /-LAl(l- "!)
/-LB
= E(XIZ = B) = /-LBO"! + /-LBl(l- ,,!).
and similarly
We calculate the components of the variance of response using the general formula:
We obtain
and
See the Appendix for details of the calculations throughout this section. Therefore the number of patients needed in each arm of the trial to have a test of equality with type I error Q and type II error 1 - f3 are given by and so the total number of patients necessary is n
= nl + n2 = ('" + 1)n2.
Chapter 5
4.2
Study Designs for Biomarker-Based Treatment Selection
113
Marker by treatment interaction design
Let 1/1 denote the difference in means of response between treatments B and A within assay-positive patients, and I/o denote the difference in means of response between the treatments within assay-negative patients. As the primary question that this design addresses is whether there is a marker-treatment interaction, the null hypothesis under consideration is Ho : 1/1 = I/o. Since the quality of the assay determines in which group a given patient will be, the imperfection of the assay will have an impact on the sample size calculation in this design. The expected difference in response among assay-positive and assay-negative patients are given respectively by
1/1
= E(XIZ = B, R = 1) - E(XIZ = A, R = 1) = (/.LBI - /.LAdw+
+ (/.LBO -
/.LAo)(I - w+)
and
Hence we have
We calculate the variance of response among assay-positive patients as such:
Tf = Var(XIZ = B,R = 1) + Var(XIZ = A, R = 1), by independence, and using the general formula
Var(XIZ
= B,R = 1) = VarD(E(XIZ = B,R = I,D)) +ED(Var(XIZ = B,R = I,D)).
Hence the variances of patient response in each marker-treatment group are given by
+ (/.LBo - 1/81)2(1 - w+) + oBセャwK@ + (/.LBO - 1/81)20_ + oBセャ@ I/Al?W+ + (/.LAO - I/Al)2(1 - w+) + oBセャwK@ I/Ad(I- 0-) + (/.LAO - I/Ado- + oBセャHiM
T11
=
(/.LBI - 1/81)2W+
T10
=
(/.L81 - I/Bd 2 (1 - 0_)
Til = (/.LAI Tio = (/.LAI -
Therefore we have 2 2 2 Tl = TB1 + TAl = [(/.L81 - 1/81)2
K{PBセQ@
+ PBセQキK@
+ (/.LAl + {oBセッ@
- I/Ad 2]w+
+ oBセッャHi@
+ [(/.LBO -
1/81)2
+ oBセッHi@ (1 - 0_)
- w+)
+ PBセM
+ oBセッHi@
- w+)
0_) + PBセMN@
+ (/.LAO
- I/Ar)2](1 - W+)
- w+),
and in the same way, 2
TO
2 2 = TBO + TAO = [(/.LBI - 1/81? + (/.LAl
K{oBセャ@
+ oBセャ@
(1 - 0_)
- I/Adl(I - 0_)
+ {oBセッ@
+ oBセッャMN@
+ [(/.LBO -
1/81)2
+ (/.LAO -
I/Al)2]e_
Amy Laird, Xiao-Hua Zhou
114
In a large trial, if each patient screened for the trial is randomized, we may expect that the proportion of assay-positive patients in the trial reflects the proportion of assay-positive people we would see in the general population. Alternatively, an investigator may wish to have balanced marker status groups, in which case the prevalence of the marker in the trial is 0.50. In the second case the number of patients needed to be randomized in the trial is, as in [5],
so that the total number needed is n
4.3
= n1 + n2 = 2n2.
Marker-based strategy design
Let lim denote the mean of response in the marker-based arm (M = 1), and lin denote the mean ofresponse in the non-marker-based arm (M = 2). In the test for equality we test the null hypothesis Ho : lim = lin. Note that the imperfection of the assay will affect the assignment of patients to treatments in the marker-based arm, and hence the sample size. The mean of response in the marker-based arm is given by
lim
= E(XIM = 1) = [/tBlW+ + /tBo(l - W+)][Asens(1- "Y) + (1 - Aspech] +[/tA1(1- 0_)
+ /tA oO_][l- Asens(1- "Y) -
(1 - Aspech].
We observe that lin is ItA from the traditional design:
We calculate the variance of response in the marker-based arm using the general formula:
Wセ@
= Var(XIM = 1) =
+ ER[Var(XIM =
VarR[E(XIM = 1, R)]
1, R)].
Then using properties of the study design, we have
VarR[E(XIM
= 1, R)]
= [/tBlW+ + /tBo(l - w+)]2[Asens(1- "Y + (1 - Aspech] +[/tAl(l - 0_)
+ /tA oO_]2[(1- Asens)(l - "Y) + Aspec"Y]
and
ER[Var(XIM
=
[(/tBl - lIBl)2W+
=
1, R)]
+ (/tBO -
lIBl)2(1 - w+)
+ (1 - Aspech] +[(/tBl - lIB1)2(1 - 0_) + (/tBO *[(1 - Asens)(l - "Y) + Aspec"Y]. JヲNセウ・ョHQM
+ 0'11W+ + 0'10(1- W+)]
"Y)
- lIBl)20_
+ 0'11 (1 -
0_)
+ 0'10 0-1
Chapter 5
Study Designs for Biomarker-Based Treatment Selection
115
Now, we observe that the variance of response in the non-marker-based arm is T1 from the traditional design: T;
= T1 = ,(1-,)(/LAO - /LAd 2 + LoBセッ@
+ HQMGIPBセ@
Hence, the number of patients needed in each of the trial is and so the total number of patients needed is n
4.4
= nl
+ n2 = ('x + l)n2'
Modified marker-based strategy design
We again let Vm denote the mean of response in the marker-based arm (M = 1) and we now let Vnr denote the mean of response in the non-marker-based arm (M = 2). In the test for equality we test the null hypothesis Ho : Vm = Vnr . Note that the imperfection of the assay will affect the assignment of patients to treatments in each of the arms, and hence the total sample size. Note that Vm is the same as in the marker-based strategy design. We calculate the mean Vnr of response in the non-marker-based arm to be Vnr
= E(XIM = 2) 1
= '" + 1 [(1 -,)(/LBI
+ "'/LAd + ,(/LBO + "'/LAO)],
and we calculate the variance of response in this arm using the general formula: T;r
= Var(XIM = 2) = Var[E(XIM = 2, Z)] + E[Var(XIM = 2, Z)],
which gives T;r
=
_1_[[/LBl(1-,) ",+1
+ /LBo,J 2 + "'[/LAl(1 -I) + /LAo,f + tセ@
+ M1J·
Hence, the number of patients needed in each arm of the trial is and
n2
=
(Zl-/2
so the total number of patients needed is n
4.5
2
2 + Zl-,B) 2 (T + Tnr) 7
(Vm
= nl
- Vnr )2
'
+ n2 = (,x + l)n2'
Targeted design
The means and variances of response in the targeted design are exactly VBl, VAl, tセャG@ and T1l from the marker by treatment interaction design:
+ (/LBO - /LAo)(1 - W+) /LAd(1 - (L) + (/LBO - /LAO)eVBl)2w+ + (/LBO - vBl)2(1 - W+) + oBセャwK@ VAd 2w+ + (/LAO - vAd 2(1- W+) + oBセャwK@
VBl = (/LBI - /LAl)W+ VAl = (/LBI tセャ@ = (/LBI T1l = (/LAI -
+ oBセッHQM + oBセッHQ@
W+) - w+).
Amy Laird, Xiao-Hua Zhou
116
To reject the null hypothesis Ho : VEl = of patients needed in each arm is and
n2
=
V Al
using the test for equality, the number
(ZI-a/2
2
+ zャMLb_Hセ@
+ Tin)
(VEl - VAr)2
and the total number necessary in the trial is n = nI + n2 = (r;, + 1)n2. Our sample size computations pertain only to the test of equality of means, and we would have attained different formulae had we used a different type of hypothesis test.
5
Numerical comparisons of efficiency
We calculated the efficiency of each "alternative" design relative to the traditional design with regard to the number of patients needed to be randomized for each design to have a test of equality with type I error rate DO = 0.05 and type II error rate 1-,8 = 0.80. We evaluated this quantity, the ratio of the number of patients in the traditional design to the number in the alternative design, as a function of the true prevalence (1- ')') of the marker among patients in the population of interest (on the x-axis). In this calculation we considered various values of the sensitivity and specificity of the assay, and the size of the treatment effect for marker-negative patients relative to marker-positive patients. Specifically, we evaluated this quantity for each combination of sensitivity and specificity equal to 0.6, 0.8, and 1.0, and for the cases in which there is no treatment effect in marker-negative patients, and that in which the treatment effect for marker-negative patients is half that of the marker-positive patients. As in [3] and [5], there is no loss of generality in choosing specific values for the means of response. We present results with the following values: Scenario 1:
/-LBI =
2
1, /-LBO
0, /-LAI
=
2
2
= 2
aBI = aBO = aAI = aAO
Scenario 2:
/-LBI =
1, /-LBO
=
0.5, /-LAI
2 2 2 2 a BI = aBO = a Al = a AD
0, /-LAO 1 = ; =
=
0, /-LAO
0, =
0,
1
= .
We assumed here the variance of patient response was constant across all marker and treatment subgroups. Results shown in Figures 6-9 are alterations of those in
[5].
5.1
Marker by treatment interaction design
When there is no treatment effect among marker-negative patients, relative efficiency depends heavily on marker prevalence: for low prevalence, the interaction design is more efficient unless the assay is very poor, while for high prevalence, the traditional design is more efficient. When the treatment effect among markernegative patients is half that of marker-positive patients, the interaction design
Chapter 5 Study Designs for Biomarker-Based Treatment Selection
117
requires a very large number of patients, and the traditional design is much more efficient. Recall that in this calculation we have assumed balanced marker subgroups. Results are very similar if the proportion of marker-positive patients included in the trial reflects the proportion of marker-positive people we would find in the general population, as seen in [5].
5.2
Marker-based strategy design
When there is no treatment effect among marker-negative patients, we see that the traditional design is at least as efficient as the marker-based strategy design, and that the efficiency has very little dependence on the marker prevalence. When the assay is perfectly sensitive, the two designs require the same number of patients, regardless of the specificity. When the assay has imperfect sensitivity however, the traditional design requires fewer patients. On the other hand, when the treatment effect among marker-negative patients is half that of marker-positive patients, the traditional design requires fewer patients regardless of the properties of the assay, and the efficiency depends heavily on the marker prevalence. These results are not surprising since the treatment effect is diluted in the marker-based strategy design.
5.3
Modified marker-based strategy design
The modified marker-based strategy design is much less efficient than the traditional design in each of the situations in the simulation. When there is no treatment effect among marker-negative patients, marker prevalence has almost no bearing on the relative efficiency, while prevalence and efficiency have a more complex relationship in the case where the treatment effect among marker-negative patients is half that of marker-positive patients. As in the marker-based strategy design, the treatment effect is diluted in the modified marker-based strategy design relative to the traditional design.
5.4
Targeted design
The targeted design requires fewer patients to be randomized than the traditional design for every combination of prevalence, sensitivity, and specificity in each of the two scenarios. This result is what we might expect since the targeted design includes only those patients for whom we expect to see a large treatment effect. When there is no treatment effect in marker-negative patients, the relative efficiency gain for the targeted design is especially pronounced, particularly when the sensitivity and specificity of the assay are close to one. The efficiency gain for the targeted design is also greater when the true marker prevalence is low; when the prevalence is 100%, the two designs are identical, and very little efficiency is gained from the targeted design for a marker with a high prevalence in the population. When the treatment effect among marker-negative patients is half that of marker-positive patients, these effects are subdued due to the decreased
Amy Laird, Xiao-Hua Zhou
118
ability of the marker to divide patients into groups of sharply-differing treatment effect; the marker has smaller predictive value. Not surprisingly, there is very little efficiency gain for the targeted design when the assay is poor.
5
+
4
0
4 +0
3
+0
2
,
2
'+0
·+°0
';..o(t) dt to be the partial likelihood score function (3.3) but with z(,B) replaced by E(ZIY, D = 1). In general, let 0 denote the set of all variables that are predictive of control sampling. It can include only D (Chen and Lo, 1999), or discrete phase I variables Sand D (Borgan et al. 2000), stratified time intervals and D (Chen, 2001), or (Y, D) and both discrete and continuous phase I variables (Kulich and Lin, 2004; Qi, Wang, and Prentice, 2005). The covariance
J;
140
Jinbo Chen
matrix can be written as セMャ@
{セK@
E{(l- D)( 1fOl - 1) x var(MzIOn] セMャL@
(3.7)
where セ@ is the standard full cohort partial likelihood information matrix based on (3.2), and 1fo is the large sample limit of no/No. Thus, the variance of i3 depends on the covariance matrix of score influence terms within strata defined by o (Breslow and Wellner, 2007). The smaller var(MzIO) is, the more efficient an estimator is. This variance form thus immediately hints effective approaches for improving the asymptotic efficiency for the estimation of (3: one can enrich 0 in a way that the enriched 0 are more strongly correlated with M z . Of course one needs to be careful that the dimensionality of the enriched "0" is restricted by the sample size when analyzing a real study data, because the estimated sampling probability based on model p(R = 110) needs to be stable in order for the resultant i3 perform well in a finite sample. When 0 has continuous components such as Y, Qi, Wang, and Prentice (2005) proposed to use a nonparametric smoothing function for p(R = 110). When 0 only contains a small number of discrete variables, the actual calculation of asymptotic variance can be obtained largely by using standard software for fitting CPR models, as described in Samuelsen et al. (2007). Similar as the weighted approach for two-phase case-control studies, all above weighted estimating equations for (3 could be seen as a member in a class of augmented weighted estimating equations similar as that for two-phase case-control design (2.5) proposed by Robins et al. (1994). Define
where E(ZIY, D = 1) is a suitable weighted estimator as in the second term in the bracket in equation (3.6). The following subset of this class of estimating equations actually encompass all weighted estimators considered above:
(3.8) See section 2.1 for motivation to study this sub-class. The most efficient member corresponds to an "0" that includes all variables observed in phase I, (Y, D, S). When 0 has continuous components, the actual computation of the (3 estimate and the variance would require modeling of p(ZIO). One can specify parametric models for these nuisance functions. If these models are mis-specified, the estimate of {3 would still be unbiased due to the "double robustness" feature of this class of estimating functions as discussed in section 2.1, although the efficiency will be penalized. Nonparametric modeling is another option but will impose relatively heavy work load for data analysts. The complete class of augmented weighted estimating equations that includes all regular and asymptotically linear estimators for {3 is by replacing M z (Y, Z)
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
141
in (3.8) (Robins et al., 1994; Nan, Emond, and Wellner, 2004) by M(z,y)(Y, Z), which is defined as M(z,y)
=
faT
[h(Z, t) - E{h(Z, t)IY, D
=
1}] I(Y セ@
t)ef3'Z Ao(t) dt.
Here h(Z, T) is a general function of Z and T satisfying certain regularity conditions. The estimator with the smallest variance in this class achieves the semiparametric efficiency bound. Nan (2004) studied this optimal estimator when 0 is discrete with a small number of possible values, the computation of which involved modeling of the censoring time distribution p(CIZ). The NPMLE estimator (see section 3.1.3), when applicable, should be semiparametric efficient.
3.1.2
Pseudo-likelihood estimators
Representative methods in this class included estimators proposed by Prentice (1986), Self and Prentice (1988), Barlowet al. (1994), Borgan et al. (estimators I and III, 2000). Let V denote the sampled subcohort. The Self-Prentice estimator is obtained from the following pseudo-score function:
L({3) =
t i=l
di {Zi _ LjEV I(Yj セ@ yゥィ・セzェ@ f3 LjEV I(Yj セ@ Yi)e
} , J
where the second term in the curly bracket is obviously an unbiased estimate of E(ZIYi, di = 1) because V is a random subset of the full cohort. This estimator is asymptotically equivalent to the original Prentice estimator (1986), which supplemented the sub cohort risk set at Yi, {j : j E V, Yj セ@ Yi}, with the case that occur at Yj outside of the subcohort V. Borgan et al. (estimators I and III, 2000) extended the Self and Prentice estimator to the stratified case-cohort design. The essential difference between these estimators and the weighted estimators is that the latter included the case j in any risk set at time Yi with Yj セ@ Yi for the estimation of E(ZIYi, di = 1), while the Self-Prentice estimator ignored cases that occur outside of V. In other words, for the estimation of E(ZIX, D = 1), the SelfPrentice estimator requires that the probability that cases that are not in V have a "zero probability" of being sampled. This is an intuitive explanation for why the pseudo-likelihood estimators do not belong to the class of D-estimators. For stratified case-cohort design, the asymptotic variance of セ@ has the same form as (3.7), except that 1 - D is replaced by 1 and IT by the limit of the size of sub cohort V divided by the total cohort size n. This variance is straightly larger than that of the weighted estimator (estimator II; Borgan et al., 2000; Chen and Lo, 1999). The asymptotic variance can also be conveniently obtained by utilizing standard statistical software for fitting Cox's proportional hazards models (Therneau and Li, 1999). An estimate of the cumulative baseline hazard function Ao(t) corresponding to pseudo-score estimators for {3 can be obtained as
Jinbo Chen
142
where nv is the size of V. This estimator is expected also to be less efficient than that of the weighted estimator. 3.1.3
Nonparametric maximum likelihood estimation
If the censoring variable C does not depend on Z conditional on S, the NPMLE leads to efficient estimators for the estimation of both {3 and Ao (t) (Chen and Little, 1999; Chen, 2002; Scheike and Martinussen, 2004). This method is based on maximizing the likelihood for the observed data, (Y, D, R, RZ, S), which is a function of parameters {A(t), {3} and the nuisance distribution p(ZIS). With a parametric or nonparametric model imposed for p(ZIS), an EM-type algorithm can be applied for the joint estimation of all parameters. In the maximization, the cumulative baseline hazard function Ao(t) is treated as a step function with jumps only at observed failure time points. The asymptotic variance of estimates for parameter {3 can be obtained as the inverse of the information matrix calculated as the second derivative of the profile likelihood function of {3. The general NPMLE methods have been rigorously and thoroughly studied in Zeng and Lin (2007). We outline the basics of this approach following Scheike and Martinussen (2004), assuming that no variables S are observed in phase I to simplify the notation, and the extension to the case where S is observed is straightforward. Let H(tIZ) be the survival function p(T ;? tlZ) = exp{ -Ao(t) exp ({3' Z)}, and g(Z) is the marginal distribution of Z. The informative part of the likelihood function can be written as
II
A(Yilzi)d; H(Yilzi)g(Zi)
II 1H(Yilz)g(z) dz,
i:ri=O
i:Ti=l
Z
where the likelihood for the sampling indicator R is omitted because the casecohort sampling produces missingness in Z that is at random (Little and Rubin, 1987). When Z is discrete with a small number of levels, one can impose a saturated model for g(z) with the number of parameters equal to the total number of distinct values of Z minus quantity one. In general, when Z include both discrete and continuous components, one can put a point mass Pk at every distinct observed value for Z, Zk, with Pk 's satisfying Lk Pk = 1. The maximization ofthe above likelihood, with g(Z) replaced by Pk so that the integration over Z becomes summation over Zk, can be carried out using the EM algorithm. The expectation is taken over the conditional probability p(Z = zklY ;? t) = H(t; zk)Pkl Lk H(t; ZdPk' Define a{k as TJz;=Zk + (1 - Ti)pI(Z = zklT ;? Yi) at the Ith step of iteration. At the 1+1 iteration, ーセiKQI@ = Li a{kln, /3 is the solution to the score equation
セ@
[ セ、ゥ@
Zi-
Lj:Yi;;;'Y; {Tje,6I Zi Zj Lj:Yi;;;'Y; {Tje,6'Zi
i=1
and aセiKャIHyゥ@
+ (1 -
Tj) Lk e,6IZkZka]k}]
+ (1- Tj) Lk e,6I Zka ]k}
is obtained as
AO(Yi) =
t
I(YI セ@ Yi)d 1 . Zka 1=1 Lj:Yi;;;'Y; {Tje,6'Zi + (1- Tj) Lk e,6I }k}
,
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
143
The NPMLE method simultaneously provides estimates for all parameters including (3, Ao(t), and p(Zk). The simultaneous estimation of two essentially infinite dimensional parameters, Ao(t) and p(Zk), using an iterative algorithm makes the computation much more challenging than the weighted and pseudolikelihood estimators. The asymptotic variance of (3 can be obtained as the inverse Hessian matrix of the profile ャセォ・ゥィッ、@ for (3 as follows. At the convergence of the EM algorithm, one obtains and ーセ@ using the above formulas with (3 fixed at Then one can plug and ーセ@ back into the likelihood function, which is written as L((3) = L(/3, Af3, p(3). Then one perturbs by a small quantity E as = + E and recalculate the likelihood function L((3.) = L(/3', Af3·, pf3.) and calculate L((32.) similarly. The variance-covariance matrix of (3 can then be obtained as {L((32.)+L((3)-2L((3.)}/E 2. Scheike and Martinussen (2004) proposed to perform the computation based on the partial likelihood score function, which can largely take advantage of many quantities computed in the EM algorithm. Of course one could also adopt a parametric model for g(z) or g(zls) (Chen and Little, 1999). However, mis-specifying these nuisance functions may result in intolerable bias in parameter estimation. In addition, it is not trivial to compute the integration over Z in the likelihood function for the observed data.
Ag Ag
/3.
/3'
3.1.4
/3
/3
Selection of a method for analysis
The decision to choose a method for analysis depends on considerations of scientific hypotheses of interest, data generating processes, statistical efficiency, amount of data available from the full cohort, and availability of computing resources. The NPMLE with the missing covariate distribution modeled nonparametrically is consistent and the most efficient when applicable. However, this method requires a relatively strong assumption that the censoring only depends on the fully observed covariates, the violation of which may result in serious bias (Chen and little, 1999). While one may subjectively speculate on whether the data indeed satisfies this constraint, the observed data does not allow a formal statistical assessment. Of course one could perform suitable sensitivity analysis in light of this concern. The weighted methods, for the consistency of estimators, only require knowledge of a true sub cohort selection model, which is always in the hands of the investigator. These methods guarantee that estimates are consistent to the same quantities as those from the full cohort analysis if complete data were available. To choose one from various weighted methods, one can first estimate the subcohort selection probability within strata defined by discretized Y, phase I variables S, and D. The analysis may involve a pre-investigation process on what variables to use in this post-stratification so that there are a reasonable number of subjects in each stratum to ensure finite-sample consistency and the efficiency gain. The fully augmented weighted estimator (Qi, Wang, and Prentice, 2005) is relatively computationally intensive since it requires non-parametric smoothing. The pseudo-likelihood estimators may be less efficient than the NPMLE and some weighted estimators, but they require the least effort in data collection: they do not require the record of follow-up time Y for non-selected subjects and require the ascertainment of covariate values for a case outside of sub cohort V only at the
144
Jinbo Chen
case's failure time. Both the weighted and NPMLE estimators require covariate observations at all failure times in the full cohort. The NPMLE method requires the record on Y for the full cohort, which may not be precise or readily available for subjects not in the case-cohort sample. For example, the base cohort may consist of several sub cohorts that were assembled at different study sites and involved different study investigators (for example, a cohort consortium). The weighted method may also need to use Y to improve efficiency. While the weighted and pseudo-likelihood estimators can be computed largely by taking advantage of quantities provided by standard software for fitting Cox's proportional hazards models, the programming requirement for the NPMLE is much higher. Sometimes the base cohort may be so large that it may not be feasible to run the EM algorithm. For example, the Breast Cancer Detection and Demonstration Project (BCDDP) at the National Cancer Institute consisted of around 280,000 women (Chen et al., 2008). When the outcome incidence is low and hazard ratio parameters are not very large, the efficiency advantage of the NPMLE is very modest (Scheike and Martinussen, 2004). In particular, if the interest is only in the estimation of hazard ratio parameters for covariates collected in the case-cohort sample but not for phase-I variables, the incorporation of data (Y, D) in the NPMLE analysis for subjects outside of the case-cohort sample probably could only improve the efficiency of estimating parameters of interest marginally. But one could conjecture that the estimation of Ao(t) would be much improved. Limited data is available in the literature documenting the relative performance of the NPMLE approach and weighted estimators. It is of practical interest to compare their efficiency if phase I variables S are available and at least moderately correlated with phase II covariates Z. In this case, the NPMLE involves the estimation of nuisance distribution p(ZIS), which could be challenging when S involves continuous components or many discrete variables. One solution is that one can simply ignore components in S that are not used for case-cohort sampling. Since the weighted and pseudo-likelihood approaches are based on modifications of full data partial likelihood functions, they can conveniently incorporate fully-observed time-varying covariates. For the NPMLE method, no details have been given in the literature how well it can handle time-varying covariates. The likelihood function requires modeling of covariates Z conditioned on the fully observed ones, but such modeling becomes difficult when phase I variables are time-dependent. Additional assumptions may be imposed. For example, one may assume that only the baseline measurements of phase I variables are predictive of the distribution of Z (Chen et al., 2008). The analysis can then proceed largely using the EM-algorithm, except that the update for estimates of Ao(t) is more complex due to the involvement of time-varying covariates in the survivor function. It has also been pointed out that the weighted estimators always converge to the same quantities as the full data analysis when the Cox's proportional hazards model is mis-specified (Scott and Wild, 2002; Breslow and Wellner, 2007). Insufficient data exists in the literature documenting the bias in the NPMLE and pseudo-score estimates comparing to the full cohort analysis when the model (3.1)
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
145
is mis-specified.
3.2
Nested case-control and counter-matching design
The meaning of "nested case-control" design in the epidemiology literature is frequently different from that in the biostatistics literature. In the biostatistics literature, it refers to an individually-matched design where controls are sampled from the risk set at the failure time of the corresponding case. Suppose K out of n subjects in the cohort experienced the event of interest during the study period, and let tl < t2 < .. , < tK denote the K observed event times. Let Rk be the risk set at tk, {i: Yi セ@ tk,i = 1"" ,n}, which includes the kth case and all subjects in the full cohort who are event-free at tk, and denote nk as the number of subjects in the set Rk. Then for the kth case, typically, a small number of event-free subjects are sampled from Rk. We assume a fixed number, m - 1 controls are sampled for each case. Let Vk denote the sampled subset at tk together with the kth case. Then the size of Vk is equal to m. Covariates Z are available only for subjects in the sampled risk set Vk. Some auxiliary variables S may be available for the full cohort. In the epidemiology literature, subjects who experienced events and a group of event-free subjects at the end of study follow-up are sampled. The sampling is frequently stratified on some factors such as the length of follow-up, ethnicity group, study center, and etc. The resultant case-control data is often analyzed using standard logistic regression as an unmatched case-control sample, with the stratification factors adjusted as covariates. The same data, with the observation of (Y, D) available for the full cohort, can also be analyzed with the CPR model. The analytical method is closely related to that for the case-cohort study. We focus on methods for analyzing individually-matched case-control design. The classical method of estimating hazard ratio parameters using nested case-control data is by maximizing the log partial likelihood (Thomas, 1977)
L(f3)
=
L ddf3'
Zk -
log
L ef3'
Zl},
IEVk
k
where, with slight abuse of notation, Zk refers to the covariate values of case k at the kth failure time. This likelihood differs from that for the full cohort study only in that the sampled risk set is used instead of the full risk set. This partial likelihood has exactly the same form as the usual conditional likelihood for matched casecontrol data assuming a logistic regression model for analysis. Consequently, the analysis is frequently referred to as the conditional logistic regression 。セャケウゥN@ The intuitive explanation for consistency of the resultant estimate of f3, f3, is readily available by examining the score function for f3:
U(f3)
="'L.J"' k
f3
Z
{
"" Z e ' Zl _ L..IEVk I "" ef3' Zl k L..!EVk
} '
where the second term is an estimate of E(Zlvk' dk = 1) calculated using the sampled risk set only. This estimator is unbiased because the sampling within each risk set is random, implying that the score function is unbiased. It turns out
Jinbo Chen
146
that the estimate converges to the same value as the full cohort analysis (Langholtz and Goldstein, 2001) if Z were measured on the full cohort. An estimate of the baseline hazard function Ao(t) can be obtained as
When one is only interested in hazard ratio parameters (3, the partial likelihood analysis uses data only from selected subjects, and the only additional information required for the estimation of Ao(t) is nk, which can be conveniently recorded at the time of sampling the risk set. Similarly to stratified case-cohort designs, an exposure-stratified version of the individually-matched nested case-control design, called counter-matching design, can lead to improved efficiency (Langholtz and Borgan, 1995). It is easily seen that the NPMLE method for the analysis of case-cohort data is readily applicable to that of the nested case-control or counter-matched data, so we will omit the discussion of this approach. However, it is worth pointing out that extra caution needs to be taken when applying the NPMLE to the nested case-control data when it is important to adequately adjust for matching variables. For example, when measuring biologic quantities from blood samples, the investigators often match on the date that the blood sample was collected. When using the NPMLE method, one may need to adjust for the matching on the date, which could be rather difficult if the matching is very fine. In this case, the partial likelihood approach would be best suitable. Many methods we review for analyzing the nested case-control or the countermatched data are in analogy with those for the case-cohort design. This is not surprising. "Although studies of nested case-control and case-cohort sampling using Cox's model have mostly been conducted through separate efforts, the focus on searching for more efficient estimators is the same - namely how sampled individuals should be properly reused in constructing estimating equations" (Chen, 2001).
3.2.1
Methods for analyzing the nested case-control data
The general class of weighted estimating equations (3.6) is also applicable to the analysis of nested case-control or counter-matched data. While the probability of including cases is obviously one, the appropriate weight to use for controls, however, is not straightforward to obtain due to the matched sampling. In particular, a control for a former case may experience an event at a later time, and a subject could be sampled as controls for multiple cases. Thus, the probability that a subject is ever selected as a control for some case depends on the at-risk history for the full cohort. When cases and controls are only matched on the time, the selection probability for a control subject i was derived by Samuelsen (1997): 7ri
II
= 1ォZyセゥ@
{l- (m - l)/(nk - I)}.
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
147
This is the "true probability" for selection. The estimation can then proceed by solving the equation (3.6) with this 7L The theoretical study of this estimator is slightly more involved than that for the case-cohort design, since the probability 7ri is not predictable. When the selection of controls involves matching also on factors other than time, such as ethnicity, date of processing biologic assay, and etc, the extension of this weight to incorporate these finer strata does not seem obvious. Similar to the case-cohort design, one can also use "estimated weights" instead of the "true weight" to improve statistical efficiency. In this regard, the localcovariate averaging method of Chen (2001) and the general weighted method (3.8), which replaces 7r with smoothed estimates in strata defined by (Y, D) or (Y, D, S), are still applicable. Using estimated weights makes it feasible to incorporate stratified risk-set sampling. All these weighted methods allow the exposure of case k to be compared to all subjects in the nested case-control sample who are at risk at time tk rather than only to those matched to case k, thereby leading to increased power. The power gain is more important when the number of matched controls is small and/or the hazard ratio parameters are large and becomes less so otherwise (Samuelsen, 1997; Chen, 2001). More empirical studies are needed to evaluate the performance of these estimators in realistic scenarios. As a tradeoff for efficiency advantage, the weighted methods require more data than the standard partial likelihood method. The former require observations of time-dependent covariates for a subject at all failure times, but the latter requires those only at one failure time at which the subject was sampled. New estimators that use only the time-restricted nested case-control data that is reasonably accurate but more efficient than the partial likelihood estimator has been proposed (Chen, 2001). 3.2.2
Methods for analyzing counter-matched data
Counter-matching design is a stratified nested case-control design that can lead to improved statistical efficiency for assessing the exposure effect. The intuition is best explained by a one-one matched design. In the standard nested case-control design, when the exposure is rare, many case-control pairs would be unexposed and thus are non-informative for the exposure hazard ratio estimation. By sampling a control who has an exposure value different from that of the case, the countermatching design offers a sampling strategy to increase the number of informative pairs (Langholtz and Borgan, 1995). For example, if a case is exposed, the control should be randomly sampled from unexposed event-free subjects in the risk set. The sampling principle also applies when an auxiliary variable is available in the full cohort, but the exposure of interest can only be measured in the countermatched subset. Similar to the Thomas' partial likelihood approach for analyzing the nested case-control data, the counter-matching studies can be analyzed using a partial likelihood approach (Langholz and Borgan, 1993):
148
Jinbo Chen
where Wj (tk) is the inverse sampling probability for subject j at failure time k, calculated as the total number of subjects in the full cohort risk set Rk in sampling stratum where case k is located divided by the number in Vk from the same stratum. Interestingly, even with this sampling weights, this partial likelihood still has the same properties as the regular partial likelihood: the expectation of the score is zero and the expected information equals the covariance matrix of the score. Consequently, standard software for performing the conditional logistic regression analysis can be used for fitting the counter-matched sample. The weighted methods for the nested case-control data can be applied to the analysis of counter-matched data with only minor modifications in the construction of weights. While the principles of analyzing counter-matched data is not much different from those for the standard nested case-control data, the actual implementation of sampling could be much less straightforward except for counter-matching one case with one control on a binary exposure. With continuous exposures or auxiliary phase-I variables, one would need to categorize them in order to perform counter matching. Then cases and controls could be sampled from relatively distant intervals. One would need to decide a priori the number and endpoints of intervals, and some general guidelines are given in Langholtz (2003). In general, the number of categories should be balanced with the number of at-risk subjects within each category. It is also important to keep in mind that counter-matching sampling does not necessarily lead to efficiency gain for every parameter estimate (Langholtz and Borgan, 1995). Thus, one needs to be cautious when applying a counter-matching design to study joint effects of multiple covariates or absolute risk. Nevertheless, efficiency improvement may always be expected for the counter-matched variable or its correlates. In particular, counter-matching on both rare genetic and environmental exposures has been shown to be cost-effective for studying interactions between rare genes and rare environmental exposures (Cologne et al., 2004). Empirical studies showed that counter-matching on an auxiliary measure that is even moderately correlated with the exposure of interest could result in noteworthy efficiency gain (Langholtz and Borgan, 1995). Such gain increases with respect to both sensitivity and specificity of the auxiliary variable but more with specificity and decreases with the number of matched controls. 3.2.3
Unmatched case-control studies
Classical case-control studies, which include the failures together with a subset of non-failures at the end of the study period, are often conducted within an assembled study cohort. The sampling of non-failures (controls) could be stratified on factors such as ethnicity, age, follow-up time, and etc, so that cases and controls are "frequency-matched" on these variables. Epidemiologists often refer to this type of case-control studies as nested case-control studies. These studies are often analyzed using an unconditional logistic regression model for the prevalence of the failure, especially when the censoring is administrative. In the presence of competing risks or a complex relationship between the risk of failure and time, it may be desirable to perform a time-to-event analysis (Chen and Lo, 1999; Chen,
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
149
2001), which also conveniently allows the absolute risk estimation. For the purpose of estimating hazard ratio parameters, it turns out that an unmatched case-control study with kl cases and ko controls is equivalent to a case-cohort study if kl is equal to the total number of cases in the cohort and ko = n{1 - p(t5 = I)} where n is the sub cohort sample size (Chen and Lo, 1999). The estimation method of Chen and Lo (1999) and Borgan et al. (2000) can then be applied to the analysis of the case-control data, except that the asymptotic variance needs to be slightly modified. We include the unmatched case-control design here mainly to distinguish it from the nested case-control design but it is more closely related to the case-cohort design. 3.2.4
Comparing study designs and analysis approaches
The relative merits of case-cohort and nested case-control designs have been discussed extensively in the literature (e.g., Langholtz 2001; Samuelsen et al., 2007). A unique advantage of the case-cohort design is that it is cost effective for studying multiple outcome variables in that only a single random sub cohort is needed as a common control sample for different outcomes. A matched design, on the other hand, must select a new set of controls for a new outcome variable. But a case-cohort design does require that exposure assessments, such as molecular measurements from stored biological materials, are not affected by the passage of time. The time-matched case-control design is free of such concern. The implementation of a case-cohort design is somewhat easier than that for a nested case-control design. One should carefully gauge these and other practical considerations when choosing a design. The NPMLE or weighted likelihood methods make it possible to compare the efficiency of different study designs. For example, Langholz and Thomas (1991) showed that the nested case-control design may have greater efficiency than the case-cohort design when there is moderate random censoring or staggered entry into the cohort. Methods of "refreshing" the sub cohort so as to avoid such efficiency loss are available (Prentice, 1986; Lin and Ying, 1993; Barlow, 1994). We have focused on the methods for fitting the Cox's proportional hazards model on case-cohort or nested case-control data. Analytical methods assuming other regression models, such as semiparametric transformation models (Lu and Tsiatis, 2006; Kong, Cai, Sen, 2004) or additive hazards regression model (Kulich and Lin, 2000), have also been studied in the literature. The weighted approaches and NPMLE approach could be adapted to these settings without conceptual difficulty.
4
Conclusions
The chapter is intended to provide a useful highlight of some recent statistical method development for the analysis of two-phase case-control and cohort studies. It is in no way a comprehensive review of the whole field. Weighted likelihood methods, pseudo-likelihood methods, and nonparametric maximum likelihood methods were outlined for two-phase case-control, (stratified) case-cohort, and nested case-control or counter-matching designs. Each method has its own
150
Jinbo Chen
advantages and limitations, and probably it is difficult, if not impossible, to conclude which method is uniformly preferable. For a particular study, one should carefully gauge whether assumptions required by a method are satisfied, whether a potential efficiency gain could possibly justify any additional computation effort, and etc. We have focused on two-phase methods to reduce costs for measuring covariates. Similar sampling strategies and statistical methods are applicable for the cost-effective measurement of regression outcomes (Pepe, Reilly, and Fleming, 1994; Chen and Breslow, 2004). The case-cohort design has been widely adopted for the study of many human disorders related to expensive exposures. A search of "case-cohort" by title and abstract in Pubmed shows 455 records at the time when this paper was written. The original Prentice (1986) estimator appears still to be the method of choice, partially due to the convenience of computation using existing software. The timematched nested case-control design has long been used as a standard epidemiologic study design, but frequency-matching is probably applied more often than individual-matching for control selection, and conditional and unconditionallogistic regression analyses are more often used than time-to-event analysis. When the individual matching is administered, Thomas' partial likelihood estimator (1977), usually referred to as a conditional logistic regression method, is still the standard method of analysis. A limited literature search shows that many of the reviewed methods, although statistically elegant, have not really reached out to the epidemiology community. This highlights a great need of work that disseminates the mathematically-sophisticated two-phase methods to the scientific community, best accompanied by the development and distribution of user-friendly software. For two-phase case-control studies, some effort has already been devoted and appeared to be fruitful (Breslow and Chatterjee, 1999). We did not review the optimal design of two-phase studies. Within an established cohort where the phase-I sample size is fixed, an optimal design would refer to the sampling strategy to select a pre-specified number of phase-II subjects in a way that the resultant phase I and phase II data would allow the most precise parameter estimates. An optimal sampling strategy depends on sampling proportions within each stratum, the cost of sampling a case or control, and unknown parameters such as the joint distribution of regression variables and auxiliary variables. It is thus usually infeasible to determine the optimal sampling strategy. For the two-phase case-control study with a binary exposure, it was shown that a design that balances the number of subjects in all strata appeared to perform well in many practical situations (Breslow and Cain, 1988). For the unstratified design of case-cohort or nested case-control studies, the design is simply a matter of cost: if all cases are sampled, how many controls does one need to sample to achieve certain power? Cai and Zeng (2004) proposed two approaches for calculating sample size for case-cohort studies based on log-rank type of test statistics. For the stratified case-cohort or counter-matching design, one would need to decide what stratum to use and how many to sample within each stratum. Many useful extensions of two-phase methods have been proposed in the literature. For example, for the examination of two rare exposures, Pfeiffer and Chatterjee (2005) considered a supplemented case-control design where they extended
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
151
methods for two-phase case-control studies by supplementing a case-control sample with a group of non-diseased subjects who are exposed to one of the exposures. Lee, Scott and Wild (2007) proposed maximum likelihood methods when a case sample is augmented either with population-level covariate information or with a random sample that consists of both cases and controls. Their case-augmented design allows the estimation of not only odds ratio parameters, but also the probability of the outcome variable. Reilly et al. (2005) applied two-phase methods to study new outcome variables using existing case-control data, where the new outcome was an exposure in the original study. Their method could largely eradicate possible bias in the analysis of only controls and improve study power by including not only controls but also cases. Chen et al. (2008) extended the NPMLE method of two-phase case-control studies to a situation where phase I is a stratified casecontrol sample but the sampling of phase II subjects was within strata defined by auxiliary variables other than the stratification variable for phase I case-control sampling. Lu and Shih (2006) proposed an extended case-cohort design for the studies of clustered failure time data. Besides the two-phase design, multiple-phase designs have also been considered in the literature (Whittmore, 1997). In summary, the application and extension of two-phase methods are no doubt still a very promising area of research. In the current exciting era of genetic and molecular epidemiology, the cost for measuring many bioassays is nothing less than trivial. The two-phase and multi-phase methods are thus expected to find important and wider applications.
References [1] W. E. Barlow, Robust variance estimation for the case-cohort design. Biometrics 50 (1994), 1064-1072. [2] J. L. Bernstein, B. Langholz, R. W. Haile, L. Bernstein, D. C. Thomas, M. Stovall, K. E. Malone, C. F. Lynch, J. H. Olsen, H. Anton-Culver, R. E. Shore, J. D. Boice Jr, G. S. Berkowitz, R. A. Gatti, S. L. Teitelbaum, S. A. Smith, B. S. Rosenstein, A. L. Berresen-Dale, P. Concannon, and W. D. Thompson, Study design: evaluating gene-environment interactions in the etiology of breast cancer - the WECARE study. Breast Cancer Research 6 (2004), 199-214. [3] P. J. Bickel, C. A. Klaassen, Y. Ritov, and J. A. Wellner, Efficient and adaptive estimation for semiparametric models. Johns Hopkins Series in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, Md, USA, 1993. [4] O. Borgan, B. Langholz, S. O. Samuelson, L. Goldstein, and J. Pogoda, Exposure stratified case-cohort designs. Lifetime Data Analysis 6 (2000),39-58. [5] N. E. Breslow, Discussion of the paper by D. R. Cox. J. R. Statist. Ser. B 34 (1972), 216-217. [6] N. E. Breslow, Statistics in epidemiology: The case-control Study. Journal of the American Statistical Association 91 (1991),14-28. [7] N. E. Breslow and K. C. Cain, Logistic regression for two-Stage case-Control Data. Biometrika 75 (1988), 11-20.
152
Jinbo Chen
[8] N. E. Breslow and N. Chatterjee, Design and analysis of two phase studies with binary outcome applied to wilms tumor prognosis. Applied Statistics 48 (1999), 457-468. [9] N. E. Breslow and R. Holubkov, Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. Journal of the Royal Statistical Society, Ser. B 59 (1997), 447-46l. [10] N. E. Breslow, J. M. Robins, J. A. Wellner, On the semiparametric efficiency oflogistic regression under case-control sampling. Bernoulli 6 (2000),447-455. [11] N. E. Breslow and J. A. Wellner, Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression. Scandinarian Journal of Statistics (2007), 86-102. [12] N. Chatterjee, Y. Chen, and N. E. Breslow, A pseudoscore estimator for regression problems with two-phase sampling. Journal of the American Statistical Association, 98 (2003), 158-168. [13] J. B. Chen, N. E. Breslow, Semiparametric efficient estimation for the auxiliary outcome problem with the conditional mean model. Canadian Journal of Statistics 32 (2004), 359-372. [14] J. B. Chen, R. Ayyagari, N. Chatterjee, D. Y. Pee, C. Schairer, C. Byrne, J. Benichou, and M.H. Gail, Breast cancer relative hazard estimates from caseCcontrol and cohort designs with missing data on mammographic density. Journal of the American Statistical Association (2008; in press). [15] H. Y. Chen and R. J. A. Little, Proportional hazards regression with missing covariates. Journal of the American Statistical Association 94 (1999), 896-908. [16] H. Y. Chen, Double-semiparametric method for missing covariates in Cox regression models. Journal ofthe American Statistical Association 97 (2002), 565-576. [17] K. Chen, Generalized case-cohort sampling. Journal of the Royal Statistical Society, Ser. B 63 (2001), 791-809. [18] K. Chen, Statistical estimation in the proportional hazards model with risk set sampling. The Annals of Statistics 32 (2004), 1513-1532. [19] K. Chen and S. Lo, Case-cohort and case-control analysis with Cox's model. Biometrika 86 (1999), 755-764. [20] J. B. Cologne, G. B. Sharp, K. Neriishi, P. K. Verkasalo, C. E. Land, and K. N akachi, Improving the efficiency of nested case-control studies of interaction by selecting controls using counter matching on exposure. International Journal of Epidemiology 33 (2004), 485-492. [21] D. R. Cox, Regression models and life tables (with discussion). Journal of the Royal Statistical Society, Ser. B 34 (1972), 187-220. [22] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Ser. B 39 (1977), 1-38. [23] E. A. Engels, J. Chen, R. P. Viscidi, K. V. Shah, R. W. Daniel, N. Chatterjee, and M. A. Klebanoff, Poliovirus vaccination during pregnancy, maternal seroconversion to simian virus 40, and risk of childhood cancer. American Journal of Epidemiology 160 (2004), 306-316. [24] T. R. Flanders and S. Greenland, Analytic methods for two-stage case-control
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
153
studies and other stratified designs. Statistics in Medicine 10 (1991), 739-747. [25] D. G. Horvitz and D. J. Thompson, A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47 (1952), 663-685. [26] J. D. Kalbfleisch and J. F. Lawless, Likelihood analysis of multi-state models for disease incidence and mortality. Statistics in Medicine 7 (1988), 146-160. [27] L. Kong, J. W. Cai, and P. K. Sen, Weighted estimating equations for semiparametric transformation models with censored data from a case-cohort design. Biometrika 91 (2004), 305-319. [28] M. Kulich and D. Y. Lin, Improving the efficiency of relative-risk estimation in case-cohort studies. Journal of the American Statistical Association 99 (2004), 832-844. [29] M. Kulich and D. Y. Lin, Additive hazards regression for case-cohort studies. Biometrika 87 (2000), 73-87. [30] J. A. Largent, M. Capanu, L. Bernstein, B. Langholz, L. Mellemkaer, K. E. Malone, C. B. Begg, R. W. Haile, C. F. Lynch, H. Anton-Culver, A. Wolitzer, J. L. Bernstein, Reproductive history and risk of second primary breast cancer: the WECARE study. Cancer Epidemiology, Biomarkers, and Prevention 16 (2007), 906-91l. [31J B. Langholz and O. Borgan, Counter-matching: a stratified nested casecontrol sampling method. Biometrika 82 (1995), 69-79. [32] B. Langholz and L. Goldstein, Conditional logistic analysis of case-control studies with complex sampling. Biostatistics 2 (2001),63-84. [33] B. Langholz, Use of cohort information in the design and analysis of casecontrol studies. Scandivavian Journal of Statistics 34 (2006), 120-136. [34] B. Langholz and D. C. Thomas, Efficiency of cohort sampling designs: some surprising results. Biometrics 47 (1991), 1563-157l. [35] J. F. Lawless, J. D. Kalbfleisch, and C. J. Wild, Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society, Ser. B 61 (1999), 413-438. [36] A. J. Lee, A. J. Scott, and C. J. Wild, On the Breslow-Holubkov estimator. Lifetime Data Analysis 13 (2007), 545-563. [37J A. J. Lee, A. J. Scott, and C. J. Wild, Fitting binary regression models with case-augmented samples. Biometrika 93 (2007), 385-397. [38J A. J. Lee, On the semi-parametric efficiency of the Scott-Wild estimator under choice-based and two-phase sampling. Journal of Applied Mathematics and Decision Sciences (2007, in press). [39] D. Y. Lin and Z. Ying, A simple nonparametric estimator of the bivariate survival function under univariate censoring. Biometrika 80 (1993), 573-58l. [40] W. B. Lu and A. A. Tsiatis, Semiparametric transformation models for the case-cohort study. Biometrika 93 (2006), 207-214. [41] S. Lu and J. H. Shih, Case-cohort designs and analysis of clustered failure time data. Biometrics 62 (2006), 1138C1148. [42] B. Nan, Efficient estimation for case-cohort studies. Canadian Journal of Statistics 32 (2004), 403-419. [43J B. Nan, M. J. Emond, and J. A. Wellner, Information bounds for Cox regres-
154
Jinbo Chen
sion models with missing data. The Annals of Statistics 32 (2004), 723-753. [44] M. S. Pepe, M. Reilly, and T. R. Fleming, Auxiliary outcome data and the mean score method. Journal of Statistical Planning and Inference 42 (1994), 137-160. [45] R. L. Prentice, A Case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73 (1986), 1-11. [46] R. L. Prentice and R. Pyke, Logistic disease incidence models and case-control studies. Biometrika 66 (1979), 403-411. [47] M. Reilly, A. Torrang, and A. Klint, Re-use of case-control data for analysis of new outcome variables. Statistics in Medicine 24 (2005), 4009-4019. [48] J. M. Robins, A. Rotnitzky, and L. P. Zhao, Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89 (1994), 846-866. [49] S. O. Samuelsen, H. Aanestad, and A. Skrondal, Stratified case-cohort analysis of general cohort sampling designs. Scandinavian Journal of Statistics 34 (2007), 103-119. [50] T. H. Scheike and T. Marginussen, Maxiumum likelihood estimation for Cox's regression model under case-cohort sampling. Scandinavian Journal of Statistics 31 (2004), 283-293. [51] W. Schill, K. H. Jockel, K. Drescher, and J. Timm, Logistic analysis in casecontrol studies under validation sampling. Biometrika 80 (1993), 339-352. [52] A. J. Scott and C. J. Wild, Fitting logistic models under case-control or choice-based sampling. Journal of the Royal Statistical Society, Ser. B 48 (1986), 170-182. [53] A. J. Scott and C. J. Wild, Fitting regression models to case-control data by maximum likelihood. Biometrika 84 (1997), 57-71. [54] A. J. Scott and C. J. Wild, The robustness of the weighted methods for fitting models to case-control data. Journal of the Royal Statistical Society, Ser. B 64 (2002), 207-219. [55] S. G. Self and R. L. Prentice, Asymptotic distribution theory and efficiency results for case-cohort studies. The Annals of Statistics 16 (1988), 64-81. [56] T. M. Therneau and H. Li, Computing the Cox model for case-cohort designs. Lifetime Data Analysis 5 (1999), 99-112. [57] D. C. Thomas, In appendix to F. D. K. Liddell, J. C. McDonald and D. C. Thomas, Methods of cohort analysis: appraisal by application to asbesto mining. Journal of the Royal Statistical Society, Series A 140 (1977), 460-491. [58] B. Thorand, J. Baumer, H. Kolb, C. Meisinger, 1. Chambless, W. Koenig, C. Herder, Sex differences in the prediction of type 2 diabetes by inflammatory markers: results from the MONICA/KORA Augsburg case-cohort study, 1984-2002. Diabetes Care 30 (2007), 854-860. [59] J. W. Cai and D. Zeng, Sample size/power calculation for case-cohort studies. Biometrics 60 (2004), 1015-1024. [60] E. White, A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology 115 (1982), 119-128. [61] K. Wu, D. Feskanich, C. S. Fuchs, W. C. Willett, B. W. Hollis, and E. L.
Chapter 6
Statistical Methods for Analyzing Two-Phase Studies
155
Giovannucci, A nested caseCcontrol study of plasma 25-Hydroxyvitamin D concentrations and risk of colorectal cancer. Journal of the National Cancer Institute 99 (2007), 1120-1129. [62] D. Zeng and D. Y. Lin, Maximum likelihood estimation in semiparametric regression models with censored data (with discussion). Journal of the Royal Statistical Society, Ser. B, 69 (2007), 507-564.
This page intentionally left blank
Part III
Bioinformatics
This page intentionally left blank
Chapter 7 Protein Interaction Predictions from Diverse Sources Yin Liu*
Inyoung Kimt
Hongyu Zhao:!:
Abstract Protein-protein interactions play an important role in many cellular processes. Recent advances in high-throughput experimental technologies have generated enormous amounts of data and provided valuable resources for studying protein interactions. However, these technologies suffer from high error rates due to their inherent limitations. Therefore, statistical and computational approaches capable of incorporating multiple data sources are needed to fully take advantage of the rapid accumulation of data. In this chapter, we describe diverse data sources informative for protein interaction inference, as well as the computational methods that integrate these data sources to predict protein interactions. These methods either combine direct measurements on protein interactions from diverse organisms, or integrate different types of direct and indirect information on protein interactions from various genomic and proteomic approaches. Keywords: Protein-protein interactions, protein complex, domain, genomic feature, data integration
1
Introduction
Protein-protein interactions (PPI) playa critical role in the control of most cellular processes, such as signal transduction, gene regulation, cell cycle control and metabolism. Within the past decade, genome-wide data on protein interactions in humans and many model species have become available (Giot et at., 2003; Ito et at., 2001; LaCount et at., 2005; Li et at. 2004; Rual et al., 2005; Uetz et at., 2000). In the meantime, a large amount of indirect biological information on protein interactions, including sequence and functional annotation, protein localization information and gene expression measurements has also become available. -Department of Neurobiology and Anatomy, University of Texas Health Science Center at Houston, 6431 Fannin Street, Houston, TX 77030, USA, E-mail:[email protected] tDepartment of Statistics, Virginia Tech, 410-A Hutcheson Hall, Blacksburg, VA 24061, USA. +Department of Genetics, Department of Epidemiology and Public Health, Yale University School of Medicine, 60 College Street, New Haven, CT 06520, USA, E-mail:hong [email protected]
159
160
Yin Liu, Inyoung Kim, Hongyu Zhao
These data, however, are far from complete and contain many false negatives and false positives. Current protein interaction information obtained from experimental methods covers only a fraction of the complete PPI networks (Hart et al., 2006; Huang et al., 2007; Scholtens et al., 2007); therefore, there is a great need to develop robust statistical methods, capable of identifying and verifying interactions between proteins. In the past several years, a number of methods have been proposed to predict protein-protein interactions based on various data types. For example, with the genomic information available, the Rosetta stone method predicts the interacting proteins based on the observation that some of single-domain proteins in one organism can be fused into a multiple domain protein in other organisms (Marcotte et al., 1999; Enright et al., 1999). The phylogenetic profile method is based on the hypothesis that interacting proteins tend to co-evolve so that their respective phylogenetic trees are similar (Pazos et al., 2003). The concept of "interolog", which refers to homologous interacting protein pairs among different organisms, has also been used to identify protein interactions (Matthews et al., 2001). The gene neighborhood method is based on the observation that functionally related genes encoding potential interacting proteins are often transcribed as an operon (Overbeek et al., 1999). With genome-wide gene expression measurements available, some methods find gene co-expression patterns in interacting proteins (Jansen et al. 2002). Based on protein structural information, the protein docking methods use geometric and steric considerations to fit multiple proteins of known structure into a bounded protein complex to study interacting proteins at the atomic level (Comeau et al. 2004). Moreover, some methods analyze protein interactions at the domain level, considering protein domains as structural and functional units of proteins (Sprinzak and Margalit, 2001; Deng et al., 2002; Gomez et al., 2003). While each of these methods focuses on a single dataset, either the direct measurement of protein interaction, or the indirect genomics dataset that contains information on protein interaction, it is not surprising that these methods have certain limitations and the detailed reviews of these individual methods can be found elsewhere (Shi et al., 2005; Shoemaker and Panchenko, 2007; Bonvin et al., 2006). In this chapter, firstly we introduce the data sources useful for protein interaction predictions, and then we survey different approaches that integrate these data sources. We focus on three types of approaches: (1) domain-based methods, which predict protein physical interactions based on domain interactions by integrating direct protein interaction measurements from multiple organisms; (2) classification methods, which predict both protein physical interactions and protein co-complex relationship by integrating different types of data sources; (3) complex-detection methods, which only predict protein cocomplex membership by identifying protein complexes from integrating protein interactions data and other genomic data.
Chapter 7 Protein Interaction Predictions from Diverse Sources
2
161
Data sources useful for protein interaction predictions
Different genomic data, such as DNA sequence, functional annotation and protein localization information have been used for predicting protein interactions. Each piece of information provides insights into a different aspect of protein interaction information, thus covers a different subset of the whole interactome. Depending on the data sources, these genomic data can be divided into four categories. First, high-throughput protein interaction data obtained from Yeast two hybrid (Y2H) and Mass Spectrometry of affinity purified protein complexes (AP /MS) techniques provide direct protein interaction information and protein co-complex membership information, respectively. Second, functional genomic data such as gene expression, Gene Ontology (GO) annotation provide information on functional relationships between proteins. Third, sequence- and structure-based data reveal the sequence/structure homology and chromosome location of genes. Finally, network topological parameters calculated from Y2H and AP /MS data characterize the topological properties of currently available protein interaction network. For example, the small-world clustering coefficient of a protein pair, calculated as the p-value from a hypergeometric distribution, measures the similarity between the neighbors of the protein pair in a network (Goldberg et ai., 2003). A list of these features, along with their proteome coverage, is summarized in Table 1. The effects of these data sources on the prediction performance depend on how these data are encoded. With many datasets for a single data type under different experimental conditions available, there are two ways to encode the datasets: the "detailed" encoding strategy treats every experiment separately, while the "summary" encoding strategy groups all experiments belonging to the same feature together and provides a single value (Qi et al., 2006). For example, when all the gene expression datasets are used as one data source, the "summary" encoding strategy generates a single similarity score for each pair of proteins, but we can obtain multiple similarity scores for each protein pair using the "detailed" encoding strategy, with one score computed from one gene expression dataset. Another challenge of the integrated studies is related to the quality of the data to be integrated. It is well known that the prediction power and accuracy would be decreased by including irrelevant and "noisy" data. A list of these data sources, along with their proteome coverage, is summarized in Table 1. Among all the data sources, whole-genome gene expression data are currently the largest source of high-throughput genomic information and considered the most important data source according to the variable importance criterion in a random forest-based method (Qi et ai., 2006). Following gene expression, function annotation from Gene Ontology, which covers about 80% of the yeast proteome, is the second most important data source according to the importance measure obtained from the random forest method. Although some other studies indicate MIPS and GO annotation are more important than gene expression in prediction (Jansen et al., 2003; Lin et ai. 2004; Lu et ai., 2005), this may be due to different number of gene expression datasets used and different encoding styles of the features ("summary" vs. "detailed"). Nonetheless, gene expression, interaction data from Co-IP
Yin Liu, Inyoung Kim, Hongyu Zhao
162
Table 1. Useful data sources for predicting protein interactions Category Protein interaction Data
Functional Genomics Data
Data Type Abbreviation
Data Source
Yeast two-hybrid data
60
Uetz et al.(2000) Ito et al. (2000)
MS
Protein complex data
64
Gavin et al. (2006) Krogan et al (2006)
GE
Gene Expression
100
FUN
GO molecular function
62
GO consortium (2006)
PRO
GO biological process
70
GO consortium (2006)
COM
GO cellular component
72
GO consortium (2006)
80
Mewes et al. (2006)
24
Mewes et al. (2006)
Demeter et al (2007)
PHE
MIPS protein class Mutant phenotype
PE
Protein expression
65
ESS
Co-essentiality
67
Mewes et al. (2006)
essentiality Genetic interaction Transcription regulation Gene fusion Gene neighborhood
99
Lu et al. (2005)
24
Tong et al. (2004)
98
Harbison et al. (2004)
19 22
Lu et al. (2005) Lu et al. (2005)
PP
Pylogenetic profile
SEQ
Sequence similarity
29 100
Lu et al· (2005) Qi et al. (2006)
Interolog
100
Qi et al. (2006)
65
Qi et al. (2006)
MAE GI TR GF GN
INT
m。イセゥョャ@
Ghaemmaghami et al. (2003)
COE
Domain-domain interaction Co-evolution scores
22
Goh et al. (2002)
THR
Threading scores
21
Lu et al. (2005)
PF
Protein fold
26
Sprinzak et al. (2005)
CLU
Small-world clusteri g coefficients R
NA
DD
Network Topological Parameters
Proteome Coverage (%)
Y2H
CLA
Sequence! Structure Information
Data Type
Bader et al. (2004) Sharan et al. (2005)
Proteome Coverage, percentage of proteins in S. cerevisiae that are annotated by this data type. Data Source, datasets used for each data type. Here, only the references using the datasets with the highest coverage are listed. NA, not applicable. RThe clustering coefficients are calculated according to protein interaction data.
experiment, functional annotation from MIPS and GO are considered the most important data sources in several studies according to the feature importance analysis (Jansen et al., 2003; Lin et al. 2004; Lu et al., 2005; Qi et al., 2006). It also suggests that a small number of important data sources may be sufficient to predict protein interaction and the prediction performance may not be improved
Chapter 7 Protein Interaction Predictions from Diverse Sources
163
from adding additional weak data sources. When the datasets are noisy, or have low proteome coverage, they may make little contribution to the prediction and lead to biased prediction results. For example, the structural information of proteins is potentially useful for protein interaction prediction, as demonstrated by the "protein-protein docking" approach which assembles protein complexes based on the three-dimensional structural information of individual proteins. However, the availability of the protein structural information is very limited, and the structures of individual proteins in an unbound state differ from those in a bounded complex, therefore, the prediction of protein interactions based on the structural information only is not reliable (Bovin et at., 2006). Furthermore, some genomic datasets may not have strong association with protein interactions. For example, the feature of "transcription regulation" lists the genes co-regulated by the same set of transcription factors. Although it is shown that the co-regulated genes often function together through protein interactions, the proteins encoded by the co-regulated genes do not necessarily interact (Yu et al., 2003).
3
Domain-based methods
Note that only the Y2H experimental technique provides direct evidence of physical protein-protein interactions in high-throughput analysis. As a result, we start the chapter with the methods that use data generated from Y2H technique only. The Y2H experimental approach suffers from high error rates, for example, the false negative rate is estimated to be above 0.5 (Huang et at., 2007; Scholtens et at., 2007). Due to the large number of possible non-interacting protein pairs, the false positive rate, defined as the ratio of the number of incorrect interactions observed over the total number of non-interacting proteins, is small and is estimated as lxl03 or less (Chiang et at., 2007). But, the false discovery rate, defined as the ratio of the number of incorrect interactions observed over the total number of observed interactions, is much greater and is estimated to be 0.2 to 0.5 (Huang et at., 2007), indicating a large portion of the observations from Y2H technique are incorrect. As the Y2H data have become available in many model organisms, such as yeast, worm, fruit fly and humans, several computational methods have been developed to borrow information from diverse organisms to improve the accuracy of protein interaction prediction. Noting that domains are structural and functional units of proteins and are conserved during evolution, these methods aim to identify specific domain pairs that mediate protein interactions, by utilizing domain information as the evolutionary connection among these organisms.
3.1
Maximum likelihood-based methods (MLE)
The Maximum Likelihood Estimation (MLE) method, coupled with the Expectatio n-Maximization (EM) algorithm, has been developed to estimate the probabilities of domain-domain interactions, given a set of experimental protein interaction data (Deng et al., 2002). It was originally used to analyze the Y2H data from a single organism - S. cerevisiae only. More recently, the MLE method was extended beyond the scope of a single organism by incorporating protein interaction data
164
Yin Liu, Inyoung Kim, Hongyu Zhao
from three different organisms, S. cerevisiae, C. elegans and D. melanogaster, assuming that the probability that two domains interact is the same among all organisms (Liu et ai., 2005). It was shown that the integrated analysis provides more reliable inference of protein-protein interactions than the analysis from a single organism (Liu et ai., 2005). The details of the method are briefly described as follows. Let Amn represent the probability that domain m interacts with domain n, and the notation (Dmn E Pijk) denotes all pairs of domains from protein pair i and j in organism k, where k = 1,2"" ,K and K is the number of organisms. Let Pijk represent the protein pair i and j in organism k, with Pijk = 1 if protein i and protein j in organism k interact with each other, and Pijk = 0 otherwise. Further let Oijk = 1 if interaction between protein i and j is observed in species k, and Oijk = 0 otherwise. The definitions of false negative rate (fn) and false positive rate (fp) of protein interaction data are fn
= Pr (Oijk = 0IPijk = 1)
and
fp
= Pr (Oijk = 11Pijk = 0).
= {Oijk = Oijk, Vi :::;; j}, A = {Amn; Dmn n, Vi :::;; j}. With the above assumptions and notation, we have
We further define 0
Pr(Pijk
E
Pijk, Vm :::;;
= 1) = 1-
The probability that protein i and j in species k are observed to be interacting is given in the following equation: Pr (Oijk
= 1) = Pr (Pijk = 1)(1 -
fn)
+ (1 -
Pr (Pijk
=
l))fp.
(3.1)
The likelihood for the observed PPI data across all K organisms is then L(fn, jp, AIO) =
II Pr (Oijk =
l)oi j k {l- Pr (Oijk
= l)}l-oi jk,
(3.2)
ijk which is a function of (Amn, jn, jp). Deng et ai. (2002) and Liu et ai. (2005) specified values for jn and jp based on prior biological knowledge, and then the Amn were estimated using the EM algorithm by treating all DDls and PPls as obtained from the (t - l)-th EM missing data as follows. For a given aセM[LャI@ iteration, the next E-step can be computed as
E(D(ijk)IO .. mn 'Jk
=
..
A(t-l)) _ aセM[LャIH⦅@ -
O'Jk, mn
jn)Oijk jn(1-oi jk) _ (t-l) Pr(Oijk - OijklAmn )
(3.3)
With the expectations of the complete data, in the M-step, we update the Amn by, (t) _
Amn - セ@
aセM[LャI@
' " (1 - fn)Oijk jn(1-oi jk) セ@ (t 1) , mn Pr (Oijk = OijklAm-;;;' )
(3.4)
where N mn is the total number of protein pairs containing domain (m, n) across the three organisms, and the summation is over all these protein pairs.
Chapter 7 Protein Interaction Predictions from Diverse Sources
3.2
165
Bayesian methods (BAY)
Unlike the likelihood based approaches described above, the false negative and false positive rates of the observed protein interaction data are treated as unknown in the Bayesian methods, so that the domain interaction probabilities, the false positive rate and the false negative rates of the observed data can be estimated simultaneously (Kim et al., 2007). The error rates are treated as organism dependent to allow different organisms to have different false negative and false positive rates. In this case, equation (3.1) is replaced by Pr (Oijk
=
= Pr (Pijk =
1)
1)(1 - fnk)
+ (1- Pr (Pijk = l))fPk.
(3.5)
It was assumed that fnk rv unij[u nk ,lInk ] and fPk rv unij[upk ,lIpk ]' Further, Amn was assumed to have a prior distribution: Amn rv 7f60(Amn) + (1 - 7f) B (a, {3) where 60 (.) is a point mass at zero. The full conditional distributions of Amn are proportional to
[Alrest]
oc oc
L(Olfnk, jPk, A)j(Aljnk, jPk) TI[hij (A)(l - jnk)] + {1- hij(A)} jpkf ijk ijk [1- {h ij (A)(l- jnk) + (1 - hij(A))jpkW-Oijk f(Alfnk, jPk)'
The full conditional distributions of fnk and jPk are proportional to
L(Oljnk, jPk, A)f(JnkI A, jPk)f(JPkIA) oc [hij (A)(l - jnk) + {I - hiJCA)} jpkf ijk ijk [1- {h ij (A)(l - jnk) + (1 - hij(A))jpkW-Oijk jUnkl A , jPk); rest [jPkl ] oc L(Oljnk, jPk> A)j(JPkI A , jnk)j(fnkIA) oc IT[h ij (A)(l- jnk) + {1- hij(A)}jPkfijk ijk [1 - {h ij (A)(l - fnk) + (1 - hij(A))jpkW-Oijk jUPkI A , fnk)'
[Inklrest]
oc
IT
The function hij (A) correlates the probability that protein pair i and j interact with the domain interaction probabilities. hij(A) = h{j(A) (defined above) in the studies of Deng et al. (2002) and Liu et al. (2005). Under h}j(A), the full conditional distributions of Amn , jnk, and fPk are log-concave functions. This function can also be formulated in an alternative way to incorporate varying number of domains across different proteins (Kim et al. 2007). The full conditional distributions of jnk and jPk are still log-concave functions, whereas the full conditional distributions of Amn is no longer a log-concave function. In this case, the adaptive rejection Metropolis sampling can be used to generate the posterior samples for Amn. The performance of the MLE method and the BAY method can be assessed by comparing their sensitivities and specificities in predicting interacting domain
Yin Liu, Inyoung Kim, Hongyu Zhao
166
pairs, illustrated by their Receiver Operator Characteristic (ROC) curves (Figure 1a). The protein interaction dataset consists of 12, 849 interactions from 3 organisms (Giot et ai., 2003; Ito et ai., 2001; Li et al. 2004; Uetz et al., 2000). Compared to the likelihood-based methods, the Bayesian-based methods may be more efficient in dealing with a large number of parameters and more effective to allow for different error rates across different datasets (Kim et ai., 2007). (a)MLE vs. BAY
0.3
0.2
0.1
0.0
0.4
1-Specificity (%)
(b) DPEA VS. PE lO セ@
e:.c
0
.s: セ@
t:
Q)
lO
I=-=- ァセeaN@
(f)
I
I
0
0.00
0.05
0.10
0.15 1-Specificity (%)
0.20
0.25
Figure 1: ROC curves of domain interaction prediction results from different methods.
Comparison of sensitivities and false positive rates (1- specificity) obtained by different methods. MLE, maximum likelihood-based method (Liu et ai., 2005); BAY, Bayesian method (Kim et al., 2007); DPEA, domain pair exclusion analysis (Riley et al., 2005); PE, parsimony explanation method (Guimaraes et ai., 2006). a) MLE vs. BAY methods; b) DPEA vs. PE methods. The set of protein domain pairs in iPFAM are used as the gold standard set (http://www.sanger.ac.uk/Software/Pfam /iPfam/). Here, the sensitivity is calculated as the number of predicted interacting domain pairs that are included in the gold standard set divided by the total number of domain pairs in the gold standard set. The false positive rate is calculated as the number of predicted interacting domain pairs that are not included in the gold standard set divided by the total number of possible domain pairs not included in the gold standard set. The area under the ROC curve is a measurement of prediction accuracy for each method.
3.3
Domain pair exclusion analysis (DPEA)
The maximum likelihood methods may preferentially detect domain pairs with high frequency of co-occurrence in interacting protein pairs. To detect more specific domain interactions, Riley et al. (2005) modified the likelihood-based ap-
Chapter '( Protein Interaction Predictions from Diverse Sources
167
proach and developed the domain pair exclusion analysis. In this approach, two different measures, the () values and the E score, were estimated for each pair of domains. The () value was obtained through the EM method, similarly as the one described above, corresponding to the probability that two domains interact with each other. This value was used as a starting point to compute the E-score for each domain pair (m, n), which was defined as the change in likelihood of the observed protein interactions, when the interaction between domain m and n was excluded from the model. The E-score of a domain pair (m, n) is given by: =
E mn
2..:)0 Ok 'J 0
Pr(Oijk = 1 Idomain pair m, n can interact) g Pr (Oijk = 1 Idomain pair m, n do not interace)
1-
= """' 10
L...J g 1 ijk
II
(1 -
()kl)
(Dkl EPijk )
II
(DklEPijk)
(1 _
()mn) .
kl
Here, the notation (Dkl E Pijk ) denotes all pairs of domains from protein pair i and j in organism k, ()kl was the maximum likelihood estimate of domain interaction probability for the set of domain pairs, and ()f1n represents the same set of domain interaction probabilities. However, the probability of domain pair m and n interacting was set to zero, and then the EM algorithm was rerun to get the maximum likelihood estimates of other domain interactions. In this way, the new values of ()kl were obtained under the condition that domain pair m and n do not interact. The E-score may help to identify the specific interacting domain pairs where the MLE and BAY methods may fail to detect. This approach has been applied to all the protein interactions in the Database of Interacting Proteins (DIP) (Salwinski et at., 2004), assuming no false positive and false negatives.
3.4
Parsimony explanation method (PE)
Using the same dataset constructed by Riley et al. (2005) from the DIP database, the parsimony approach formulates the problem of domain interaction prediction as a Linear Programming (LP) optimization problem (Guimaraes et al. 2006). In this approach, the LP score Xmn associated with each domain pair was inferred by minimizing the object function 2:mn Xmn subject to a set of constraints describing all interacting protein pairs {Pi, Pj }, which require that 2:mE P i,nEPj Xmn :? 1. In addition to the LP score, another measure, the pw-score was defined. Similar to the E value in the DPEA method, the pw-score is used as an indicator to remove the promiscuous domain pairs that occur frequently and have few witnesses. Here, the witness of a domain pair is defined as the interacting protein pair only containing this domain pair. We also compare the performance of DPEA method and PE method using their ROC curves (Figure 1b). The reason contributing to the improved performance of the PE method compared to the DPEA method could be that the DPEA method tends to assign higher probabilities to the infrequent domain pairs in multi-domain proteins, while it is avoided in the PE method (Guimaraes et al. 2006). All the methods described in this section focus on estimating domain interaction by pooling protein interaction information from
168
Yin Liu, Inyoung Kim, Hongyu Zhao
diverse organisms. Box 1 illustrates the difference of these methods in identifying interacting domain pairs. The estimated domain interactions can then be used for the protein interaction prediction by correlating proteins with their associated domains. We expect that the results from these domain interaction prediction methods can be further improved when the domain information is more reliably annotated in the future. The current information on domain annotation is incomplete. For example, only about two-thirds of the proteins and 73% of the sequences in yeast proteome are annotated with domain information in the latest release of PFAM database (Finn et az', 2006). As a result, prediction based on domain interaction will only be able to cover a portion of the whole interactome. To overcome this limitation, a Support Vector Machine learning method was developed (Martin et aZ., 2005). Instead of using the incomplete domain annotation information, this method uses the signature products, which represent the protein primary sequence only, for protein interactions prediction (Martin et al., 2005). In addition to the incomplete domain annotation information, we note that there is another limitation of these domain interaction prediction methods: the accuracy and reliability of these methods highly depend on the protein interaction data. Although the prediction accuracy can be improved by integrating data from mUltiple organisms, the protein interaction data itself may be far from complete. Therefore, efforts have been made to integrate other types of biological information and these are briefly reviewed next. Box 1. Identification of domain interactions under different scenarios Note: PI represents protein 1, DI represents domain 1, PI that protein 1 only contains domain 1. Scenario 1: Protein-domain relationship:
= {DI} represents
Observed protein interaction data: PI
f-+
P 2 , PI
f-+
P 4, PI
f-+
P6 , P2
f-+
P3, P3
Ps
f-+
f-+
P 4, P 3
f-+
P6, P2
f-+
Ps,P4
f-+
P s,
P6 .
Under this scenario, each protein containing domain 1 interacts with each protein containing domain 2. All four methods MLE, BAY, DPEA, and PE can identify the interaction between domain 1 and domain 2. Scenario 2: Protein-domain relationship: Pl
= {DI}, P 2 = {D 2 }, P 3
= {DIl, P 4 = {D 2 }, P s
= {D I }, P 6 = {D 2 }.
Observed protein interaction data: PI
f-+
P 2, P 3
f-+
P 4, P 5
f-+
P6.
Under this scenario, only a small fraction (3/9) of protein pairs containing domain pair 1 and 2 interact. Both MLE and BAY methods may not be able to identify
Chapter 7 Protein Interaction Predictions from Diverse Sources
169
the interaction between domain 1 and 2. But, if the interaction of domain 1 and 2 is excluded, the likelihood of the observed protein interaction data is lower that under the condition domain 1 and 2 interact, which leads to the high Escore of this domain pair in the DPEA method, therefore, DPEA can detect the interaction. For the PE method, the interaction between domain 1 and 2 is the only explanation for the observed data, so it can be detected by the PE method as well. Scenario 3: Protein-domain relationship:
Observed protein interaction data: PI ....... P 2 ,P 3
.......
P 4 ,P 5
.......
P6.
Under this scenario, as the only protein pair (PI, P 2 ) containing domains 3 and 4 interact, both MLE and BAY methods can detect the interaction between domains 3 and 4. While only a small fraction (3/9) of protein pairs containing domain pair 1 and 2 interact, the interaction between domains 1 and 2 may not be detected by MLE and BAY methods. The DPEA method may detect interaction of domain pair (1, 2), and domain pair (3, 4) as well because excluding both domain pairs decreases the likelihood of the observed data. For the PE method, because the interaction between domain pair (1, 2) represents the smallest number of domain pairs to explain the observed data, this interaction is preferred than the interaction between domain pair (3, 4).
4 4.1
Classification methods Integrating different types of genomic information
Many computational approaches, ranging from simple union or intersection of features to more sophisticated machine learning methods have been applied to integrate different types of genomic information for protein interaction predictions. These approaches aim at performing two major tasks: predicting direct physical interactions between two proteins, and predicting whether a pair of proteins are in the same complex, which is a more general definition of protein interaction. Many classification methods have been applied to perform these two tasks by integrating different types of data sources. These methods selected some "gold standard" datasets as training data to obtain the predictive model, and then applied the model on test datasets for protein interaction predictions. For instance, Jansen et al. (2003) was among the first to apply a NaIve Bayes classifier using genomic information including m RNA expression data, localization, essentiality and functional annotation for predicting protein co-complex relationship. Based on the same set of data, Lin et al. (2004) applied two other classifiers, logistic regression and random forest, and demonstrated that random forest outperforms the other two and logistic regression performs similarly with the NaIve Bayesian method. The logistic regression approach, based on functional and network topological properties,
Yin Liu, Inyoung Kim, Hongyu Zhao
170
has also been presented to evaluate the confidence of the protein-protein interaction data previously obtained from both Y2H and AP IMS experiments (Bader et al., 2004). More recently, Sharan et al. (2005) also implemented a similar logistic regression model, incorporating different features, to estimate the probability that a pair of proteins interact. Kernel methods incorporating multiple sources of data including protein sequences, local properties of the network and homologous interactions in other species have been developed for predicting direct physical interaction between proteins (Ben-Hur and Noble, 2005). A summary of some published methods along with their specific prediction tasks are listed in Table 2. Table 2. Summary of previous methods integrating multiple data sources for protein interactions prediction Task
Methods
Data Sources Used
References
Y2H, MS, GE, FUN, PRO, Random Forest Protein physical interaction k-Nearest Neighbor
Logistic Regression Random Forest
k-Nearest Neighbor Co-complex membership (of a protein pair)
Support Vector Machin
Logistic Regression Naive Bayes
COM, CLA, PHE, PE, ESS, GI, TR, GF, GN, PP, SEQ, INT, DD Y2H, MS, GE, FUN, PRO, COM, CLA, PHE, PE, ESS, GI, TR, GF, GN, PP, SEQ, INT, DD Y2H, MS, GE, CLU Y2H, MS, GE, PRO, CLA, ESS Y2H, MS, GE, FUN, PRO, COM,CLA, PHE, PE, ESS, GI, TR,GF, GN, PP, SEQ, INT, DD Y2H, MS, GE, FUN, PRO, COM,CLA, PHE, PE, ESS, GI,TR,GF, GN, PP, SEQ, INT, DD GE, PRO, COM, TR,GF, GN, PP, DD, PF GE, PRO, CLA, ESS
Decision Tree
Y2H, MS, GE, COM,PHE, TR,SEQ, GF, GN,PP
Protein complex detection
BH-subgraph
Y2H, MS, COM
SAMBA
Y2H,MS,GE,PHE,TR
Domain interactions a
Naive Bayes
Y2H, FUN, PRO, GF
Evidence Counting
Y2H, FUN, PRO, GF
Qi et al. (2006)
Qi et al. (2006)
Bader et al. (2004) Sharan et al. (2005) Lin et al. (2004)
Lee et al. (2006)
Qi et al. (2006)
Sprinzak et al. (2005 Jansen et al. (2003) Zhang et al. (2004) Scholtens et al. (2004, 2005) Tanay et al. (2004) Lee et al. (2006) Lee et al (2006)
Data sources used, the list of abbreviated data sources used by each method. There may be several different implementations using different sets of data sources by each method. When using the same set of data sources, these methods may be sorted according to their performance,
Chapter 7 Protein Interaction Predictions from Diverse Sources
171
as shown in this table, with the best-performing methods listed at the top. For predicting protein physical interaction and protein pairwise co-complex membership, the comparison is based on the precision-recall curves and the Receiver Operator Characteristic (ROC) curves of these methods with the "summary" encoding (Qi et at., 2006). For predicting domain interactions, the comparison is based on the ROC curves of these methods (Lee et at., 2006). There is no systematic comparison of the two protein complex detection methods that integrate heterogeneous datasets. aThis task focuses on predicting domain-domain interactions instead of protein interactions. The Y2H data here are obtained from multiple organisms, and the data source GF represents the domain fusion event instead of gene fusion event as used for other tasks.
4.2
Gold standard datasets
The gold standard datasets defined for training purposes have effects on the relative performance of different classification methods, as discussed in the first DREAM (Dialogue on Reverse Engineering Assessments and Methods) conference (Stolovitzky et al., 2007). Protein pairs obtained from DIP database (Salwinski et al., 2004) and protein complex catalog obtained from MIPS database (Mews et al., 2006) are two most widely used sets of gold standard positives. While DIP focuses on physically interacting protein pairs, MIPS complexes catalog captures the protein complex membership, which lists proteins pairs that are in the same complex, but not necessarily have physical interactions. Therefore, when predicting physical protein interactions, the set of co-complex protein pairs from MIPS is not a good means of assessing the method accuracy. Two groups of protein pairs have been treated as the gold standard negatives most popularly in the literature: random/all protein pairs not included in the gold standard positives and the protein pairs annotated to be in different subcellular localization. Neither of them is perfect: the first strategy may include true interacting protein pairs, leading to increased false positives, while the second group may lead to increased false negatives when a multifunctional protein is active in multiple subcellular compartments but only has limited localization annotation. Protein pairs whose shortest path lengths exceed the median shortest path for random protein pairs in a protein network constructed from experimental data are treated as negative samples as well (Bader et al., 2004). However, this strategy may not be reliable as the experimental data are in general noisy and incomplete.
4.3
Prediction performance comparison
The availability of such a wide range of classification methods requires a comprehensive comparison among them. One strategy is to validate prediction results according to the similarity of interacting proteins in terms of function, expression, sequence conservation, and so on, as they are all shown to be associated with true protein interactions. The degree of the similarities can be measured to compare the performance of prediction methods (Stolovitzky et al., 2007; Suthram et al., 2006). However, as the similarity measures are usually used as the input features in the integrated analysis, another most widely applied strategy for comparison uses Receiver Operator Characteristic (ROC) or Precision-Recall (PR) curves. ROC curves plot true positive rate vs. false positive rate, while PR curves plot precision
172
Yin Liu, Inyoung Kim, Hongyu Zhao
(fraction of prediction results that are true positives) vs. recall (true positive rate). When dealing with highly skewed datasets (e.g. the size of positive examples is much smaller than that of negative examples), PR curves can demonstrate differences between prediction methods that may not be apparent in ROC curves (Davis and Goadrich, 2007). However, because the precision does not necessarily change linearly, interpolating between points in a PR curve is more complicated than that in a ROC curve, where a straight line can be used to connect points (Davis and Goadrich, 2007). A recent study evaluated the predictive power and accuracies of different classifiers including random forest (RF), RF-based k-nearest neighbor (k RF), NaIve Bayes (NB), Decision Tree (DT), Logistic Regression (LR) and Support Vector Machines and demonstrated in both PR and ROC curves that RF performs the best for both physical interaction prediction and co-complex protein pair prediction problem (Qi et al., 2006). Due to its randomization strategy, an RF-based approach can maintain prediction accuracy when data is noisy and contains many missing values. Moreover, the variable importance measures obtained from RF method help determine the most relevant features used for the integrated analysis. One point we need to pay attention to, though, is that RF variable importance measures may be biased in situations where potential variables vary in their scale of measurement or their number of categories (Strobl et al., 2007). Although NB classifier allows the derivation of a probabilistic structure for protein interactions and is flexible for combining heterogeneous features, it was the worst performer among the six classifiers. The relatively poor performance could be due to its assumption of conditional independence between features, which may not be the case, especially when many correlated features are used. A boosted version of simple NB, which is resistant to feature dependence, significantly outperforms the simple NB, as demonstrated in a control experiment with highly-dependent features, indicating the limitation of simple NB. Although a recent study showed no significant correlations between a subset of features using Pearson correlation coefficients and mutual information as measures of correlation, there was no statistical significance level measured in this study (Lu et al., 2005). Therefore, we cannot exclude the possibility that genomic features are often correlated with each other, especially when the "detailed" encoding strategy is used. Logistic regression generally predicts a binary outcome or estimates the probability of certain event, but its relative poor performance with "detailed" features could be due to the relative small size of gold standard positives currently available for training to the number of features, known as the problem of over-fitting (Qi et al., 2006). It was shown that when the training size increases, the logistic regression method became more precise and led to improved prediction. However, the RF method still outperforms LR even when a larger training set is used, indicating the superiority of RF method compared to other methods (Qi et al., 2006).
5
Complex detection methods
Unlike the methods described focusing on predicting the pairwise relationship between two proteins, the complex detection methods aim to identifying multi-
Chapter '/ Protein Interaction Predictions from Diverse Sources
173
protein complexes consisting of a group of proteins. Since the Affinity purification coupled with Mass Spectrometry (AP IMS) technique provides direct information on protein co-complex membership in a large-scale, this type of data has been used extensively for protein complex identification. However, this technique is subject to many experimental errors such as missing weak and transient interactions among proteins. Therefore, many methods have been proposed to integrate other types of data for protein complex identification.
5.1
Graph theoretic based methods
Protein interaction data can be modeled as an interaction graph, with set of nodes representing proteins, and edges representing interactions between proteins. Graph-theoretic algorithms define protein complexes as highly connected proteins that have more interactions within themselves and fewer interactions with the rest of the graph. For example, Scholtens et al. (2004, 2005) developed the BHcomplete subgraph identification algorithm coupled with a maximum likelihood estimation based approach to give an initial estimate of protein complex membership. Based on the interaction graph from AP IMS technique, a BH-complete subgraph is defined as a collection of proteins for which all reciprocated edges between bait proteins exist, and all unreciprocated edges exist between bait proteins and hit-only proteins. In the results from AP IMS experiment, the probability Pij of detecting protein j as a hit using protein i as a bait is given by a logistic regression model: log
(
Pij ) -1-= f-L - Pij
+ aYij + (3Sij
where Yij = 1 if the edge between i and j exists in the true AP IMS graph and 0 otherwise. The similarity measure Sij can be used to integrate data sources other than AP IMS experiments. As done in the study of Scholtens and Gentleman (2004), 8ij is calculated according to GO cellular component annotation. After estimating Yij by maximizing the likelihood of observed AP IMS data, the initial protein complexes can be estimated by identifying the maximal BH-complete subgraphs. Then the initial estimated complexes can be merged to yield more accurate complexes considering the missing edges due to false negative observations.
5.2
Graph clustering methods
Based on the maximum likelihood scoring, Tanay et al. (2004) proposed a Statisti cal-Algorithmic Method for Bicluster Analysis (SAMBA) to integrate heterogeneous data to detect molecular complexes. SAMBA models the genomic data as a bipartite graph, with nodes on one side representing proteins, and those on the other side representing properties from different types of genomic data including gene expression information, protein interaction data, transcription factor binding and phenotypic sensitivity information. The edge between a protein node u and a property v indicates that protein u has the property v. In this case, a bicluster is a subset of proteins with a set of common properties. A likelihood ratio is used
Yin Liu, Inyoung Kim, Hongyu Zhao
174
to score a potential biocluster (A, B) as follows: Pc Io g - +
L= (u,l/)E(A,B)
PU,l/ HオLャOIセab@
L
I og
1- Pc 1- PU,l/
Here PU,l/ and pc represent the probability that the edge between u and v exists under the null model and the alternative model, respectively. PU,l/ is estimated by generating a set of random networks preserving the degree of every protein and calculating the fraction of networks in which the edge between u and v exists. Pc is a fixed probability that satisfies Pc > max PU,l/. Then graph-theoretic based algorithms can be used to search for densely connected subgraphs having high log-likelihood scores. With SAMBA algorithm, not only the protein complexes can be identified, the biological properties such as gene expression regulation under different conditions can also be associated with each protein complex, which provides insight into the function of the complex. Some unsupervised clustering methods based on special similarity measurements of protein pairs have been applied to the protein complex identification problem. Several examples of these methods are Super Paramagnetic Clustering (SPC), Restricted Neighborhood Search Clustering (RNSC), Molecular Complex Detection (MCODE) and Markov Clustering (MCL). Although the development of these clustering methods represents an important research area in protein complex detection, they are solely applied to direct protein interaction data, either from Y2H or AP /MS techniques, as we can see from the detailed description of these methods in their original publications (Bader and Hogue, 2003; Blatt et al., 1996; Enright et al., 2002; King et al., 2004). Since all these methods perform the clustering task based on the similarity measurements of protein pairs, other types of indirect information, such as gene expression and function annotation should be easily integrated into the analysis when computing the similarities between protein pairs.
5.3
Performance comparison
A recent study evaluated four clustering methods (SPC, RNSC, MCODE and MCL) based on the comparison of the sensitivities, positive predictive values and accuracies of the protein complexes obtained from these methods (Brohee and van HeIden, 2006). A test graph was built on the basis of 220 known complexes in the MIPS database and 41 altered graphs were generated by randomly adding edges to or removing edges from the test graph in various proportions. The four methods were also applied to six graphs obtained from high-throughput experiments and the identified complexes were compared with the known annotated complexes. It was found that the MCL method was most robust to graph alterations and performed best in complex detection on real datasets, while RNSC was more sensitive to edge deletion and relatively less sensitive to suboptimal parameters. SPC and MCODE performed poorly under most conditions.
Chapter 7 Protein Interaction Predictions from Diverse Sources
6
175
Conclusions
The incomplete and noisy PPI data from high-throughput techniques require robust mathematical models and efficient computational methods to have the capability of integrating various types of features for protein interaction prediction problem. Although many distinct types of data are useful for protein interactions predictions, some data types may make little contribution to the prediction or may even decrease the predictive power, depending on the prediction tasks to be performed, the ways the data sets are encoded, the proteome coverage of these data sets and their reliabilities. Therefore, feature (data set) selection is one of the challenges faced in the area of integrating multiple data sources for protein interaction inference. In addition to these challenges, comparison and evaluation of these integration methods are in great need. Each method used for integration has its own limitations and captures different aspects of protein interaction information under different conditions. Moreover, as the gold standard set used for evaluating prediction performance of different methods is incomplete, the prediction results should be validated by small scale experiments as the ultimate test of these methods. With these issues in mind, we expect there is much room for current data integration methods and their evaluation to be improved.
Acknowledgements This work was supported in part by NIH grants R01 GM59507, N01 HV28286, P30 DA018343, and NSF grant DMS 0714817.
References [1] Bader, G.D. and Hogue C.W. (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4,2 [2] Bader, J.S. et al. (2004) Gaining confidence in high-throughput protein interaction networks. Nat Biotechnol 22, 78-85 [3] Ben-Hur, A. and Noble, W.S. (2005) Kernel methods for predicting proteinprotein interactions. Bioinformatics 20, 3346-3352 [4] Blatt, M. et al. (1996) Superparamagnetic clustering of data. Phys Rev Lett. 76, 3251-3254 [5] Bonvin, A.M. (2006) Flexible protein-protein docking. Curr Opin Struct Biol.16, 194-200 [6] Brohee, S. and van HeIden, J. (2006) Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7, 488 [7] Chiang, T. et al. (2007) Coverage and error models of protein-protein interaction data by directed graph analysis. Genome Biol. 8, R186 [8] Comeau, S. et al. (2004) ClusPro: an automated docking and discrimination method for the prediction of protein complexes. Bioinformatics 20, 45-50 [9] Davis, J. and Goadrich, M. (2007) The relationship between precision-recall
176
[10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30]
[31]
Yin Liu, Inyoung Kim, Hongyu Zhao and ROC curves. Proceedings of the 23"d International Conference on Machine Learning, (pp 233-240), Pittsburgh, PA. Deng, M. et al. (2002) Inferring domain-domain interactions from proteinprotein interactions. Genome Res. 12, 1540-1548 Enright, A.J. et al. (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature 402,86-90 Enright, A.J. et al. (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research 30, 1575-1584 Finn, RD. et al. (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34, D247-D251 Gene Ontology Consortim (2006) The Gene Ontology (GO) project in 2006. Nucleic Acids Res. 34, D322-326 Giot, L. et al. (2003) A protein interaction map of drosophila melanogaster. Science 302, 1727-1736 Gomez, S.M. et al. (2003) Learning to predict protein-protein interactions from protein sequences. Bioinformatics 19, 1875-1881 Goldberg, D.S. and Roth, F.P. (2003) Assessing experimentally derived interactions in a small world. Proc Natl Acad Sci USA. 100,4372-4376 Guimaraes, K. S. et al. (2006) Predicting domain-domain interactions using a parsimony approach. Genome Biol7, RI04 Hart, G.T. et al. (2006) How complete are current yeast and human proteininteraction networks. Genome Biol. 7, 120 Huang, H. et al. (2007) Where have all the interactions gone. Estimating the coverage of two-hybrid protein interaction maps. PLoS Comput Biol. 3, e214 Ito, T. et al. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 98, 4569-4574 Jansen, R et al. (2002) Relating whole-genome expression data with proteinprotein interaction. Genome Res. 12, 37-46 Jansen, R et al. (2003) A Bayesian networks approach for predicting proteinprotein interactions from genomic data. Science 302, 449-453 Kim, 1. et al. (2007) Bayesian methods for predicting interacting protein pairs using domain information. Biometrics 63, 824-833 King, A.D. et al. (2004) Protein complex prediction via cost-based clustering. Bioinformatics. 2004 Nov 22;20(17):3013-3020 LaCount, D.J. et al. (2005) A protein interaction network of the malaria parasite plasmodium falciparum. Nature 438, 103-107 Lee, H. et al. (2006) An integrated approach to the prediction of domaindomain interactions. BMC Bioinformatics 7, 269 Li, S. et al. (2004) A map of the interactome network of the metazoan c. elegans. Science 303, 540-543 Lin, N. et al. (2004) Information assessment on prediction protein-protein interactions. BMC Bioinformatics 5, 154 Liu, Y. et al. (2005) Inferring protein-protein interactions through highthroughput interaction data from diverse organisms. Bioinformatics 21, 32793285 Lu, L.J. et al. (2005) Assessing the limits of genomic data integration for
Chapter 7 Protein Interaction Predictions from Diverse Sources
177
predicting protein networks. Genome Res. 15, 945-953 [32] Marcotte, E.M. et al. (1999) Detecting protein function and protein-protein interactions from genome sequence. Science 285, 751-753 [33] Martin, S. et al. (2005) Predicting protein-protein interactions using signature products. Bioinformatics 21, 218-226 [34] Matthews, L.R. et al. (2001) Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Res. 11, 2120-2126 [35] Overbeek, R. et al. (1999) The use of gene clusters to infer functional coupling. Proc Natl Acad USA 96, 2896-2901 [36] Pazos, F. et al. (2003) Assessing protein co-evolution in the context of the tree of life assists in the prediction of the interactome. J Mol BioI. 352, 1002-1015 [37] Qi, Y. et al. (2006) Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins: structure, function and bioinformatics 63, 490-500 [38] Riley, R. et al. (2005) Inferring protein domain interactions from databases of interacting proteins. Genome BioI. 6, R89 [39] Rual, J.F. et al. (2005) Towards a proteome-scale map of the human proteinprotein interaction network. Nature 437, 1173-1178 [40] Salwinski, 1. et al. (2004) The database of interacting proteins: 2004 update. Nucleic Acids Res 32, D449-D451 [41] Scholtens, D. et al. (2007) Estimating node degree in bait-prey graphs. Bioinformatics 24, 218-224 [42] Sharan, R. et al. (2005) Conserved patterns of protein interaction in multiple species. Proc Natl Acad Sci USA 102, 1974-1979 [43] Shi, T. et al. (2005) Computational methods for protein-protein interaction and their application. CUrT Protein Pept Sci 6, 443-449 [44] Shoemaker, B.A. and Panchenko, A.R. (2007) Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PloS Comput BioI 3(4), e43 [45] Sprinzak, E. and Margalit, H. (2001) Corrlated sequence-signatures as markers of protein-protein interaction. J Mol BioI. 311, 681-692 [46] Sprinzak, E. et al. (2005) Characterization and prediction of protein-protein interactions within and between complexes. Proc Natl Acad Sci USA 103, 14718-14723 [47] Stolovitzky, G. et al. (2007) Dialogue on Reverse-Engineering Assessment and Methods: The DREAM of High-Throughput Pathway Inference. Ann N Y Acad Sci. 1115:1-22. Strobl, C. et al. (2007) Bias in random forest variable importance measures: [48] Illustrations, sources and a solution. BMC Bioinformatics 8, 25 [49] Suthram, S. et al. (2006) A direct comparison of protein interaction confidence assignment schemes. BMC Bioinformatics 7, 360 Tanay, A. et al. (2004) Revealing modularity and organization in the yeast [50] molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci USA 101, 2981-2986 [51] Uetz, P. et al. (2000) A comprehensive analysis of protein-protein interactions
178
Yin Liu, Inyoung Kim, Hongyu Zhao
in Saccharomyces cerevisiae. Nature 403, 623-627 [52] Yu, H. et al. (2003) Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends Genet. 19, 422-427
Chapter 8 Regulatory Motif Discovery: From Decoding to Meta-Analysis Qing Zhou*
Mayetri Guptat
Abstract Gene transcription is regulated by interactions between transcription factors and their target binding sites in the genome. A motif is the sequence pattern recognized by a transcription factor to mediate such interactions. With the availability of high-throughput genomic data, computational identification of transcription factor binding motifs has become a major research problem in computational biology and bioinformatics. In this chapter, we present a series of Bayesian approaches to motif discovery. We start from a basic statistical framework for motif finding, extend it to the identification of cis-regulatory modules, and then discuss methods that combine motif finding with phylogenetic footprinting, gene expression or ChIP-chip data, and nucleosome positioning information. Simulation studies and applications to biological data sets are presented to illustrate the utility of these methods. Keywords: Transcriptional regulation; motif discover; cis-regulatory; Gene expression; DNA sequence; ChIP-chip; Bayesian model; Markov Chain Monte Carlo.
1
Introduction
The goal of motif discovery is to locate short repetitive patterns ("words") in DNA that are involved in the regulation of genes of interest. In transcriptional regulation, sequence signals upstream of each gene provide a target (the promoter region) for an enzyme complex called RNA polymerase (RNAP) to bind and initiate the transcription of the gene into messenger RNA (mRNA). Certain proteins called transcription factors (TFs) can bind to the promoter regions, either interfering with the action of RNAP and inhibiting gene expression, or enhancing gene expression. TFs recognize sequence sites that give a favorable binding energy, which often translates into a sequence-specific pattern (rv 8-20 base pairs long). Binding sites thus tend to be relatively well-conserved in composition - such a conserved *Department of Statistics, University of California at Los Angeles, Los Angeles, CA 90095, USA, E-mail: [email protected] tDepartment of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27516-7420, USA, E-mail:[email protected]
179
180
Qing Zhou, Mayetri Gupta
pattern is termed as a "motif". Experimental detection of TF-binding sites (TFBSs) on a gene-by-gene and site-by-site basis is possible but remains an extremely difficult and expensive task at a genomic level, hence computational methods that assume no prior knowledge of the motif become necessary. With the availability of complete genome sequences, biologists can now use techniques such as DNA gene expression microarrays to measure the expression level of each gene in an organism under various conditions. A collection of expressions of each gene measured under various conditions is called the gene expression profile. Genes can be divided into clusters according to similarities in their expression profiles-genes in the same cluster respond similarly to environmental and developmental changes and thus may be co-regulated by the same TF or the same group of TFs. Therefore, computational analysis is focused on the search for TFBSs in the upstream of genes in a particular cluster. Another powerful experimental procedure called Chromatin ImmunoPrecipitation followed by microarray (ChIP-chip) can measure where a particular TF binds to DNA in the whole genome under a given experimental condition at a coarse resolution of 500 to 2000 bases. Again, computational analysis is required to pinpoint the short binding sites of the transcription factor from all potential TF binding regions. With these high throughput gene expression and ChIP-chip binding data, de novo methods for motif finding have become a major research topic in computational biology. The main constituents a statistical motif discovery procedure requires are: (i) a probabilistic structure for generating the observed text (Le. in what context a word is "significantly enriched") and (ii) an efficient computational strategy to find all enriched words. In the genomic context, the problem is more difficult because the "words" used by the nature are never "exact", i.e., certain "mis-spellings" can be tolerated. Thus, one also needs a probabilistic model to describe a fuzzy word. An early motif-finding approach was CONSENSUS, an information theorybased progressive alignment procedure [42]. Other methods included an EMalgorithm [11] based on a missing-data formulation [24], and a Gibbs sampling algorithm [23]. Later generalizations that allowed for a variable number of motif sites per sequence were a Gibbs sampler [28, 33] and an EM algorithm for finite mixture models [2]. Another class of methods approach the motif discovery problem from a "segmentation" perspective. MobyDick [6] treats the motifs as "words" used by nature to construct the "sentences" of DNA and estimates word frequencies using a Newton-Raphson optimization procedure. The dictionary model was later extended to include "stochastic" words in order to account for variations in the motif sites [16, 36] and a data augmentation (DA) [43] procedure introduced for finding such words. Recent approaches to motif discovery have improved upon the previous methods in at least two primary ways: (i) improving and sensitizing the basic model to reflect realistic biological phenomena, such as multiple motif types in the same sequence, "gapped" motifs, and clustering of motif sites (cis-regulatory modules) [30, 51, 17], and (ii) using auxiliary data sources, such as gene expression microarrays, ChIP-chip data, phylogenetic information and the physical structure of DNA
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
181
[9, 21, 52, 18]. In the following section we will discuss the general framework of de-novo methods for discovering uncharacterized motifs in biological sequences,
focusing especially on the Bayesian approach.
2
A Bayesian approach to motif discovery
In this section, unless otherwise specified, we assume that the data set is a set of N unaligned DNA fragments. Let S = (81 , ... ,8N ) denote the N sequences of the data set, where sequence 8i is oflength Li (i = 1" .. ,N). Multiple instances of the same pattern in the data are referred to as motif sites or elements while different patterns are termed motifs. Motif type k (of, say, width Wk) is characterized by a Position-Specific Weight matrix (PWM) 8 k = (Ok1,'" ,OkWk)' where the J-dimensional (J = 4 for DNA) vector Oki = ((Jki1,'" ,(JkiJ)T represents the probabilities of occurrence of the J letters in column i, (i = 1,'" ,Wk). The corresponding letter occurrence probabilities in the background are denoted by 0 0 = ((J01, ... ,(JOJ). Let E> = {8 1 , ... ,8 K }. We assume for now that the motif widths, Wk (k = 1"" ,K) are known (this assumption will be relaxed later). The locations of the motif sites are unknown, and are denoted by an array of missing indicator variables A = (A ijk ), where Aijk = 1 if position j (j = 1"" ,Li ) in sequence i (i = 1"" ,N) is the starting point of a motif of type k (k = 1,," ,K). For motif type k, we let Ak = {Aijk : i = 1"" ,N; j = 1"" ,Li }, i.e., the indicator matrix for the site locations corresponding to this motif type, and define the alignment: - , 1 ... , 1L·} 1''; - , 1 '" , N', J' S1(Ak) -- {S·· 1,)'. A"1,J k -,(J " - , 1 ... ,L·} S 2CAk) -- {Si,j+1·. A·· tJk -- 1''; ,. -1 - " .. , N', J' t,
CAk) -- {s·t , J. + W k1- '. A"'LJ k SWk
--
- , 1 '" ,L·} 1''; - , 1 ... , N·, J' ,(J t·
si
In words, Ak ) is the set of letters occurring at position i of all the instances of the type-k motif. In a similar fashion, we use S(A to denote the set of all letters occurring in the background, where S(K) = S \ uセ]Q@ U;:1 SI(Ak) (for two sets A, B, A c B, B \ A == B n AC). Further, let C : S -+ Z4 denote a "counting" function that gives the frequencies of the J letters in a specified subset of S. For example, if after taking the set of all instances of motif k, in the first column, we observe a C
)
C(si
Ak total occurrence of 10 'A's, 50 'T's and no 'C' or 'G's, )) Assuming that the motif columns are independent, we have
[C(Si
Ak
)), ... ,C(Si,;!k))]
rv
Product-Multinomial[8k
=
(Ok1,'"
=
(10,0,0,50).
,OkWk)]'
i.e., the i-th vector of column frequencies for motif k follows a multinomial distribution parametrized by (Jki' We next introduce some general mathematical notation. For vectors v = (Vl,'" ,vp)T, let us define Ivl = IVll + ... + Ivpl, and f(v) = r(vt) .. · r(vp).
182
Qing Zhou, Mayetri Gupta
Then the normalizing constant for a p-dimensional Dirichlet distribution with parameters a = (al,'" ,ap)T can be denoted as f(lal)/f(a). For notational convenience, we will denote the inverse of the Dirichlet normalizing constant as ID(a) = f(a)/f(lal). Finally, for vectors v and U = (Ul,'" ,up), we use the shorthand u v = I1f=l UYi. The probability of observing S conditional on the indicator matrix A can then be written as
For a Bayesian analysis, we assume a conjugate Dirichlet prior distribution for (}o, (}o '" Dirichlet (,80) , ,80 = (/101,'" ,/10D), and a corresponding product-Dirichlet prior (i.e., independent priors over the columns) PD(B) for 8k (k = 1"" ,K), where B=(,8kl,,8k2"" ,,8 kW k) is a J x Wk matrix with ,8ki = (/1kil, ... ,/1kiJ)T. Then the conditional posterior distribution of the parameters given A is:
For the complete joint posterior of all unknowns (8, (}, A), we further need to prescribe a prior distribution for A. In the original model [23], a single motif site per sequence with equal probability to occur anywhere was assumed. However, in a later model [28] that can allow multiple sites, a Bernoulli(7r) model is proposed for motif site occurrence. More precisely, assuming that a motif site of width W can occur at any of the sequence positions, 1, 2, ... ,L * - w + 1 in a sequence of length L *, with probability 7r, the joint posterior distribution is:
where L = RZセQ@
(Li - w) is the adjusted total length of all sequences and IAI = Ajk. If we have reason to believe that motif occurrences are not independent, but occur as clusters (as in regulatory modules), we can instead adopt a prior Markovian model for motif occurrence [17, 44] which is discussed further in Section 3.
2:f:=1 RZセQ@
2.1
RZセQ@
Markov chain Monte Carlo computation
Under the model described in (2.1), it is straightforward to implement a Gibbs sampling (GS) scheme to iteratively update the parameters, i.e., sampling from [8, eo I C, A], and impute the missing data, i.e., sampling from [A I C, 8, eo]. However, drawing 8 from its posterior at every iteration can be computationally inefficient. Liu et al. [28] demonstrated that marginalizing out (8, eo) from the posterior distribution can lead to much faster convergence of the algorithm [29]. In
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
183
other words, one can use the Gibbs sampler to draw from the marginal distribution p(A I 8,1f) =
JJ
p(8,00 I 8,A,1f)p(A)p(8,00)d8dO o,
(2.2)
which can be easily evaluated analytically. If 1f is unknown, one can assume a beta prior distribution Beta (a1' (2) and marginalize out 1f from the posterior, in which case p(A I 8) can be derived from (2.2) by altering the last term in (2.2) to the ratio of normalizing constants for the Beta distribution, B(IAI + a1, L - IAI + (2)/ B(a1' (2). Based on (2.2), Liu et al. [28] derived a predictive updating algorithm for A, which is to iteratively sample each component of A according to the predictive distribution P(Aijk = 1 I 8) = _1f_ Wk P(Aijk
= 0 18) A
where the posterior means are Okl =
1-1f
g
(ihl)
c(s(A k )+(3
C(Si,j+l,k)
00
(2.3)
' C(sW)
(3
セaI@ (3kl and 0 0 = +(30 . IC(Sl k)+ kll IC(SW)+ 01 A
Under the model specified above, it is also possible to implement a "partitionbased" data augmentation (DA) approach [16] that is motivated by the recursive algorithm used in Auger and Lawrence [1]. The DA approach samples A jointly according to the conditional distribution P(A I 8,8) =
N
L i -1
i=l
j=l
II P(A iLi 18,8) II P(A ij IA i ,j+1,'"
,AiLi,8,8).
At a position j, the current knowledge of motif positions is updated using the conditional probability P(Aij I A i ,j+1,'" ,AiLi,8) (backward sampling), with A i ,j-1,'" ,Ail marginalized out using a forward summation procedure (an example will be given in Section 3.1). In contrast, at each iteration, GS iteratively draws from the conditional distribution: P( A ijk IA \ A ijk , 8), iteratively visiting each sequence position i, updating its motif indicator conditional on the indicators for other positions. The Gibbs approach tends to be "sticky" when the motif sites are abundant. For example, once we have set A ijk = 1 (for some k), we will not be able to allow segment S[i,j+1:j+ Wk] to be a motif site. The DA method corresponds to a grouping scheme (with A sampled together), whereas the GMS corresponds to a collapsing approach (with 8 integrated out). Both have been shown to improve upon the original scheme [29].
2.2
Some extensions of the product-multinomial model
The product-multinomial model used for e is a first approximation to a realistic model for transcription factor binding sites. In empirical observations, it has been reported that certain specific features often characterize functional binding sites. We mention here a few extensions of the primary motif model that have been recently implemented to improve the performance of motif discovery algorithms.
Qing Zhou, Mayetri Gupta
184
In the previous discussion, the width w of a motif G was assumed to be known and fixed; we may instead view w as an additional unknown model parameter. Jointly sampling from the posterior distribution of (A, G, w) is difficult as the dimensionality of G changes with w. One way to update (w, G) jointly would be through a reversible jump procedure [15]. However, note that we can integrate out G from the posterior distribution to avoid a dimensionality change during the updating. By placing an appropriate prior distribution p( w) on w (a possible choice is a Poisson(A)), we can update w using a Metropolis step. Using a Beta(a1' (2) prior on 7f, the marginalized posterior distribution is P(A, wlS) IX ID(C(S(Acl) + (3 )
rr
w
ID(C(S;Al) + (3i) B(IAI ID({3i)
+ aI, L - IAI + (2)
(w).
B(a1' (2) p Another assumption in the product multinomial model is that all columns of a weight matrix are independent- however, it has been observed that about 25% of experimentally validated motifs show statistically significant positional correlations. Zhou and Liu [49] extend the independent weight matrix model to including one or more correlated column pairs, under the restriction that no two pairs of correlated columns can share a column in common. A MetropolisHastings step is added in the Gibbs sampler [28] that deletes or adds a pair of correlated column at each iteration. Other proposed models are a Bayesian treelike network modeling the possible correlation structure among all the positions within a motif model [4], and a permuted Markov model in which the assumption is that an unobserved permutation has acted on the positions of all the motif sites and that the original ordered positions can be described by a Markov chain [48]. Mathematically, the model [49] is a sub-case of [48], which is, in turn, a sub-case of [4J.
o
3
2=1
Discovery of regulatory modules
Motif predictions for higher eukaryotic genomes are more challenging than that for simpler organisms such as bacteria or yeast, for reasons such as (i) large sections of low-complexity regions (repeat sequences), (ii) weak motif signals, (iii) sparseness of signals compared to entire region under study-binding sites may occur as far as 2000-3000 bases away from the transcription start site, either upstream or downstream. In addition, in complex eukaryotes, regulatory proteins often work in combination to regulate target genes, and their binding sites have often been observed to occur in spatial clusters, or cis-regulatory modules (Figure 1). One approach to locating cis-regulatory modules (CRMs) is by predicting novel motifs and looking for co-occurrences [41]. However, since individual motifs in the cluster may not be well-conserved, such an approach often leads to a large number of false negatives. Here, we describe a strategy to first use existing de novo motif finding algorithms and motif databases to compose a list of putative binding motifs, 'D = {G l ,··· ,G D }, where D is in the range of 50 to 100, and then simultaneously update these motifs and estimate the posterior probability for each of them to be included in the CRM [17]. Let S denote the set of n sequences with lengths L 1 , L 2 , ... ,L n , respectively,
Chapter' 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
(
...
)
u
185
u
Figure 1: Graphical illustration of a CRM
corresponding to the upstream regions of n co-regulated genes. We a.'3sume that the CRM consists of K different kinds of motifs with distinctive PWMs. Both the PWMs and K are unknown and need to be inferred from the data. In addition to the indicator variable A defined in Section 2, we define a new variable ai,j, that denotes the location of the jth site (irrespective of motif type) in the ith sequence. {aij;i = 1"" ,n; j = 1,,,, ,Li}' Associated with each site is its type Let a indicator Ti,j, with Ti,j taking one of the K values (Let T (Tij)). Note that the specification (a, T) is essentially equivalent to A. Next, we model the dependence between Ti,j and Ti,j+! by a K x K probability transition matrix T. The distance between neighboring TFBSs in a CRM, セLゥKャ@ ai,j, is assumed to follow Q( ; A, w), a geometric distribution trundij cated at w, i.e. Q(d; A, w) = (1- A)d-w A (d = w, w + 1,·.·). The distribution of nucleotides in the background sequence is a multinomial distribution with unknown parameter P (PA,'" ,PT)' Next, we let u be a binary vector indicating which motifs are included in the module, Le. u = (UI,'" ,UD)T, where Uj = 1(0) if the j th motif type is present (absent) in the module. By construction, lui = K. Thus, the information regarding K is completely encoded by u. In light of this notation, the set of PWMs for the CRM is defined as e {E>j :11,j = 1}. Since now we restrict our inference of CRM to a subset of D, the probability model for the observed sequence data can be written as:
P(SID,T,u,A,p)= LLP(Sla,T,D,T,U,A,p)P(aIA)P(Tla,T). a T From the above likelihood formulation, we need to simultaneously estimate the optimal u and the parameters (D, T, A, p). To achieve this, we first prescribe a prior distribution on the parameters and missing data:
Here the fiO's are (product) Dirichlet distributions. Assuming each Ui takes the value 1 with a prior probability of 1f (i.e. 1f is the prior probability of including a motif in the module), gl (u) represents a product of D Bernoulli (1f) distributions; and 92(A), a generally fiat Beta distribution. More precisely, we assume
186
Qing Zhou, Mayetri Gupta
a priori that 8i "-' ITj=l Dirichlet(,i3ij) (for i = 1"" ,D); P "-' Dirichlet(,i3o); = K), each row of T is assumed to follow an independent Dirichlet. Let the i-th row vilu "-' Dirichlet(O:i), where i = 1"" ,K. Let n = (V,T,>',p) denote the full parameter set. Then the posterior distribution of n has the form
>. "-' Beta(a, b). Given u (with lui
F(n, u IS) exF(S Iu,n)h(V IU)h(T Iu)h(p)gl(U)g2(>')'
(3.1)
Gibbs sampling approaches were developed to infer the CRM from a special case of the posterior distribution (3.1) with fixed u [44, 51]. Given the flexibility of the model and the size of the parameter space for an unknown u, it is unlikely that a standard MCMC approach can converge to a good solution in a reasonable amount of time. If we ignore the ordering of sites T and assume components of a to be independent, this model is reduced to the original motif model in Section 2 which can be updated through the previous Gibbs or DA procedure.
3.1
A hybrid EMC-DA approach: EMCmodule
With a starting set of putative binding motifs V, an alternative approach was proposed by Gupta and Liu [17], which involves simultaneously modifying the motifs and estimating the posterior probability for each of them to be included in the CRM. This was acheived through iterations of the following Monte Carlo sampling steps: (i) Given the current collection of motif PWMs (or sites), sample motifs into the CRM by evolutionary Monte Carlo (EMC); (ii) Given the CRM configuration and the PWMs, update the motif site locations through DA; and (iii) Given motif site locations, update all parameters including PWMs. 3.1.1
Evolutionary Monte Carlo for module selection
It has been demonstrated that the EMC method is effective for sampling and op-
timization with functions of binary variables [26]. Conceptually, we should be able to apply EMC directly to select motifs comprising the CRM, but a complication here is that there are many continuous parameters such as the 81's, >., and T that vary in dimensionality when a putative motif in V is included or excluded from the CRM. We therefore integrate out the continuous parameters analytically and condition on variables a and T when updating the CRM composition. Let n(u) = (8, p, T, >.) denote the set of all parameters in the model, for a fixed u. Then, the marginalized conditional posterior probability for a module configuration u is:
where only 8 and T are dependent on u; and a and T are the sets of locations and types, respectively, of all putative motif sites (for all the D motifs in V). Thus, only when the indicator Ui for the weight matrix 8 i is 1, do its site locations and types contribute to the computation of (3.2). When we modify the current u by
Chapter 8
Regulatory Motif Discovery: Prom Decoding to Meta-Analysis
187
excluding a motif type, its site locations and corresponding motif type indicators are removed from the computation of (3.2). For EMC, we need to prescribe a set of temperatures, t1 > t2 > ... > tM = 1, one for each member in the population. Then, we define ¢i(Ui) ex: exp[logP(ui I a,T,S)/td, and ¢(U) ex: イセャᄁゥHuIN@ The "population" U = (U1,··· ,UM) is then updated iteratively using two types of moves: mutation and crossover. In the mutation operation, a unit Uk is randomly selected from the current population and mutated to a new vector Vk by changing the values of some of its bits chosen at random. The new member Vk is accepted to replace Uk with probability min(l, r m ), where rm = ¢k(Vk)/¢k(Uk). In the crossover step, two individuals, Uj and Uk, are chosen at random from the population. A crossover point x is chosen randomly over the positions 1 to D, and two new units Vj and Vk are formed by switching between the two individuals the segments on the right side of the crossover point. The two "children" are accepted into the population to replace their parents Uj and Uk with probability . (1 ,rc) , were h CMV;) d}, and {Xl> c}. The mean parameters attached to these regions are f.1,1,f.1,2, and f.1,3, respectively
To complete a Bayesian inference based on model (5.2), one needs to prescribe prior distributions for both the tree structures and the associated parameters, Mm and (72. The prior distribution for the tree structure is specified conservatively in [8] so that the size of each tree is kept small, which forces it to be a weak learner. The priors on Mm and (72 also contribute to preventing from overfitting. In particular, the prior probability for a tree with 1, 2, 3, 4, and セ@ 5 terminal nodes is 0.05, 0.55, 0.28, 0.09, and 0.03, respectively. Chipman et al. [8] developed a Markov chain Monte Carlo approach (BART
198
Qing Zhou, Mayetri Gupta
MCMC) to sample from the posterior distribution (5.3) Note that the tree structures are also updated along with MCMC iterations. Thus, the BART MCMC generates a large number of samples of additive trees, which form an ensemble of models. Now given a new feature vector x*, instead of predicting its response y* based on the "best" model, BART predicts y* by the average response of all sampled additive trees. More specifically, suppose one runs BART MCMC for J iterations after the burin-in period, which generates J sets of additive trees. For each of them, BART has one prediction: y*(j) = 2::;';;,=1 g(x*, T;;[), M;;[)) (j = 1,··· ,J). These J predicted responses may be used to construct a point estimate of y* by the plain average, as used in the following applications, or an interval estimate by the quantiles. Thus, BART has the nature of Bayesian model average.
5.3
Application to human ChIP-chip data
Zhou and Liu [50] applied BART to two recently published ChIP-chip data sets of the TFs Oct4 and Sox2 in human embryonic stem (ES) cells [5]. The performance of BART was compared with those of linear regressions [9], MARS [14, 10], and neural networks, respectively, based on ten-fold cross validations. The DNA microarray used in [5] covers -8 kb to +2kb of ",17,000 annotated human genes. A Sox-Oct composite motif (Figure 5) was identified consistently in both sets of positive ChIP-regions using de novo motif discovery tools (e.g., [23]). This motif is known to be recognized by the protein complex of Oct4 and Sox2, the target TFs of the ChIP-chip experiments. Combined with all the 219 known high-quality PWMs from TRANSFAC and the PWMs of 4 TFs with known functions in ES cells from the literature, a final list of 224 PWMs were compiled for motif feature extraction. Here we present their cross-validation results on the Oct4 ChIP-chip data as a comparative study of several competing motif learning approaches. 2
]: __ I -
セ@
TILJIgセ@ M
セ@
セ@ セ@ セ@
00
セ@
S
= セ@
セ@
Figure 5: The Sox-Oct composite motif discovered in the Oct4 positive ChIP-regions
Boyer et al. [5] reported 603 Oct4-ChIP enriched regions (positives) in human ES cells. ZL randomly selected another 603 regions with the same length distribution from the genomic regions targeted by the DNA micro array (negatives). Note that each such region usually contains two or more neighboring probes on the array. A ChIP-intensity measure, which is defined as the average array-intensity
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
199
ratio of ChIP samples over control samples, is attached to each of the 1206 ChIPregions. We treat the logarithm of the ChIP-intensity measure as the response variable, and those features extracted from the genomic sequences as explanatory variables. There are a total of 1206 observations with 224 + 45 = 269 features (explanatory variables) for this Oct4 data set. ZL used the following methods to perform statistical learning on this data set: (1) LR-SO, linear regression using the Sox-Oct composite motif only; (2) LR-Full, linear regression using all the 269 features; (3) Step-SO, stepwise linear regression starting from LR-SO; (4) Step-Full, stepwise linear regression starting from LRFull; (5) NN-SO, neural networks with the Sox-Oct composite motif feature as input; (6) NN-Full, neural networks with all the features as input; (7) MARS, multivariate adaptive regression splines using all the features with different tuning parameters; (8) BART with different number N of trees ranging from 20 to 200. In Step-SO, one started from the LR-SO model, and used the stepwise method (with both forward and backward steps) to add or delete features in the linear regression model based the AIC criterion (see R function "step"). The Step-Full was performed similarly, but starting from the LR-Full model. For neural networks, ZL used the R package "nnet" with different combinations of the number of hidden nodes (2, 5, 10, 20, 30) and weight decay (0, 0.5, 1.0, 2.0). For MARS, they used the function "mars" in the R package "mda" made by Hastie and Tibshirani, with up to two-way interactions and a wide range of penalty terms. For BART, they ran 20,000 iterations after a burn-in period of 2,000 iterations, and used the default settings in the R package "BayesTree" for all other parameters. The ten-fold cross validation procedure in [50] was conducted as follows. They first divided the observations into ten subgroups of equal sizes at random. Each time, one subgroup (called "the test sample") was left out and the remaining nine subgroups (called "the training sample") were used to train a model using the stated method. Then, they predicted the responses for the test sample based on the trained model and compared them with the observed responses. This process was continued until every subgroup had served as the test sample once. In [50], the authors used the correlation coefficient between the predicted and observed responses as a measure of the goodness of a model's performance. This measure is invariant under linear transformation, and can be intuitively understood as the fraction of variation in the response variable that can be explained by the features (covariates). We call this measure the CV-correlation (or CV-cor) henceforth. The cross validation results are given in Table 3. The average CV-correlation (over 10 cross validations) of LR-SO is 0.446, which is the lowest among all the linear regression methods. Since all the other methods use more features, this shows that sequence features other than the target motif itself indeed contribute to the prediction of ChIP-intensity. Among all the linear regression methods, Step-SO achieves the highest CV-cor of 0.535. Only the optimal performance among all the combinations of parameters were reported for the neural networks. However, even these optimal results are not satisfactory. The NN-SO showed a slight improvement in CV-cor over that of LR-SO. For different parameters (the number of hidden nodes and weight decay), NN-SO showed roughly the same performance except for those with 20 or more hidden nodes and weight decay =
Qing Zhou, M ayetri Gupta
200
0, which overfitted the training data. The neural network with all the features as input encountered a severe overfitting problem, resulting in CV-cor's < 0.38, even worse than that of LR-SO. In order to relieve the overfitting problem for NNs, ZL reduced the input independent variables to those selected by the stepwise regression (about 45), and employed a weight decay of 1.0 with 2, 5, 10, 20, or 30 hidden nodes. More specifically, for each training data set, they performed Step-SO followed by NNs with features selected by the Step-SO as input. Then they calculated the CV-cor's for the test data. We call this approach Step+NN, and it reached an optimal CV-cor of 0.463 with 2 hidden nodes. ZL applied MARS to this data set under two settings: the one with no interaction terms (d = 1) and the one considering two-way interactions (d = 2). For each setting, they chose different values of the penalty A, which specifies the cost per degree of freedom. In the first setting (d = 1), they set the penalty A = 1,2"" ,10, and observed that the CV-cor reaches its maximum of 0.580 when A = 6. Although this optimal result is only slightly worse than that of BART (Table 3), we note that the performance of MARS was very sensitive to the choice of A. With A = 2 or 1, MARS greatly overfitted the training data, and the CV-cor's dropped to 0.459 and 0.283, respectively, which are almost the same or even worse than that of LR-SO. MARS with two-way interactions (d = 2) showed unsatisfactory performance for A セ@ 5 (i.e., CV-cor < 0.360). They then tested A in the range of [10,50] and found the optimal CV-cor of 0.561 when A = 20. Table 3: Ten-fold cross validations for log-ChIP-intensity prediction on the Oct4 ChIP-chip data Method LR-SO Step-SO NN-SO MARSl,6 MARS2,20 BART20 BART60 BART100 BART140 BART180 Step-M MARSl,6-M
Cor 0.446 0.535 0.468 0.580 0.561 0.592 0.596 0.600 0.599 0.595 0.456 0.511
Imprv 0% 20% 5% 30% 26% 33% 34% 35% 34% 33% 2% 15%
Method LR-Full Step-Full Step+NN MARSl,1 MARS2,4 BART40 BART80 BART120 BART160 BART200 BART-M MARS2,20-M
Cor 0.491 0.513 0.463 0.283 0.337 0.599 0.597 0.599 0.594 0.593 0.510 0.478
Imprv 10% 15% 4% -37% -24% 34% 34% 34% 33% 33% 14% 7%
Reported here are the average CV-correlations (Cor). LR-SO, LR-Full, Step-SO, Step-Full, NNSO, and Step+NN are defined in the text. MARSa,b refers to the MARS with d = a and)" = b. BARTm is the BART with m trees. Step-M, MARSa,b-M, and BART-M are Step-SO, the optimal MARS, and BARTlOO with only motif features as input. The improvement ("Imprv") is calculated by Cor/Cor(LR-SO)-l.
Notably, BARTs with different number of trees outperformed all the other methods uniformly. BARTs reached a CV-cor of about 0.6, indicating a greater than 30% of improvement over that of LR-SO and the optimal NN, and more than 10% of improvement over the best performance of the stepwise regression
Chapter 8
Regulatory Motif Discovery: Prom Decoding to Meta-Analysis
201
method. In addition, the performance of BART was very robust for different choices of the number of trees included. This is a great advantage over MARS, whose performances depended strongly on the choice of the penalty parameter A, which is typically difficult for the user to set a priori. Compared to NNs, BART is much less prune to overfitting, which may be attributable to its Bayesian model averaging nature with various conservative prior specifications. To further illustrate the effect of non-motif features, ZL did the following comparison. They excluded non-motif features from the input, and applied BART with 100 trees, MARS (d = 1, A = 6), MARS (d = 2, A = 20), and Step-SO to the resulting data set to perform ten-fold cross validations. In other words, the feature vectors contained only the 224 motif features. The CV-correlations of these approaches are given in Table 3, denoted by BART-M, MARS1,6-M, MARS2,20M and Step-M, respectively. It is observed that the CV-correlations decreased substantially (about 12% to 15%) compared to the corresponding methods with all the features. One almost obtains no improvement (2%) in predictive power by taking more motif features in the linear regression. However, if the background and other generic features are incorporated, the stepwise regression improves dramatically (20%) in its prediction. This does not mean that the motif features are not useful, but their effects need to be considered in conjunction with background frequencies. Step-M is equivalent to MotifRegressor [9] and MARS-M is equivalent to MARS Motif [10] with all the known and discovered (Sox-Oct) motifs as input. Thus, this study implies that BART with all three categories of features outperformed MotifRegressor and MARS Motif by 32% and 17% in CV-correlation, respectively.
6
Using nucleosome positioning information in motif discovery
Generally TF-DNA binding is represented as a one-dimensional process; however, in reality, binding occurs in three dimensional space. Biological evidence [32] shows that much of DNA consists of repeats of regions of about 147 bp wrapped around nucleosomes, separated by stretches of DNA called linkers. Recent techniques [47] based on high density genome tiling arrays have been used to experimentally measure genomic positions of nucleosomes, in which the measurement "intensities" indicate how likely that locus is to be nucleosome-bound. These studies suggest that nucleosome-free regions highly correlate with the location of functional TFBSs, and hence can lead to significant improvement in motif prediction, if considered. Genome tiling arrays pose considerable challenges for data analysis. These arrays involve short overlapping probes covering the genome, which induces a spatial data structure. Although hidden Markov models or HMMs [35] may be used to accommodate such spatial structure, they induce an exponentially decaying distribution of state lengths, and are not directly appropriate for assessing structural features such as nucleosomes that have restrictions in physical dimension.
202
Qing Zhou, Mayetri Gupta
For instance, in Yuan et al. [47], the tiling array consisted of overlapping 50mer oligonucleotide probes tiled every 20 base pairs. The nucleosomal state can thus be assumed to be represented by about 6 to 8 probes, while the linker states had no physical restriction. Since the experiment did not succeed in achieving a perfect synchronization of cells, additionally a third "delocalized nucleosomal" state was modeled, which had intensities more variable in length and measurement magnitude than expected for nucleosomal states. Here, we describe a general framework for determining chromatin features from tiling array data and using this information to improve de novo motif prediction in eukaryotes [18].
6.1
A hierarchical generalized HMM (HGHMM)
Assume that the model consists of K (= 3) states. The possible length duration in state k, (k = 1,," ,K) is given by the set Dk = {rk,'" ,Sk} eN (i.e. N denotes the set of positive integers). The generative model for the data is now described. The initial distribution of states is characterized by the probability vector 7r = (1l'I,'" ,1l'K)' The probability of spending time d in state k is given by the distribution Pk (dl4», d E Dk (1 :::;; k :::;; K), characterized by the parameter 4> = ((Pl,'" ,¢K)' For the motivating application, Pk(d) is chosen to be a truncated negative binomial distribution, between the range specified by each D k . The latent state for probe i is denoted by the variable Zi (i = 1, ... ,N). Logarithms of spot measurement ratios are denoted by Yij (1:::;;i:::;;N; l:::;;j:::;; r) for N spots and r replicates each. Assume that given the (unobservable) state Zi, Yij'S are independent, with YijlZi = k rv gk( . [セゥォLHjGI@ For specifying gk, a hierarchical model is developed that allows robust estimation of the parameters. Let J1, = (11-1,'" ,I1-K) and セ@ = {(J';k; 1:::;; i :::;; N; 1:::;; k :::;; K}. Assume YijlZi = ォGセゥLHj[@ rv nHセゥォLjG[I@ セゥォャQMLHjG[@ '" N(l1-k,TO(J';k)' (J';k '" Inv-Gamma(Pk, ak), where at the top level, I1-k ex constant, and Pk, ak, and TO are hyperparameters. Finally, the transition probabilities between the states are given by the matrix T = (Tjk), (1 :::;; j, k :::;; K). Assume a Dirichlet prior for state transition probabilities, i.e. Tkl,"', Tk,k-1, Tk,k+l,'" ,Tk,K '" Dirichlet("1) , where "1 = (7]1,'" ,7]k-l,7]k+1,'" ,7]K). Since the duration in a state is being modeled explicitly, no transition back to the same state can occur, i.e. there is a restriction Tkk = 0 for all 1 :::;; k :::;; K.
6.2
Model fitting and parameter estimation
For notational simplicity, assume Y = {Yl' ... ,Y N }, is a single long sequence of length N, with r replicate observations for each Yi = (Yil,'" Yir)'. Let the set of all parameters be denoted by () = HjQBtLT^WイセI@ and let Z = (ZI,'" ,ZN) and L = (L 1 ,'" ,LN) be latent variables denoting the state identity and state lengths. Li is a non-zero number denoting the state length if it is a point where a run of states ends, i.e. if Zj=k, (i-l+1):::;;j:::;;i; Zi+l,Zi-!#·k; l:::;;k:::;;K otherwise.
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
203
The observed data likelihood then may be written as:
L(O; Y) = L LP(YIZ, L, O)P(LIZ, O)P(ZIO). Z L
(6.1)
The likelihood computation (6.1) is analytically intractable, involving a sum over all possible partitions of the sequence Y with different state conformations, and different state lengths (under the state restrictions). However, one can formulate a data augmentation algorithm which utilizes a recursive technique to efficiently sample from the posterior distributions of interest, as shown below. The key is to update the states and state length durations in an recursive manner, after calculating the required probability expressions through a forward summation step. Let an indicator variable It take the value 1 if a segment boundary is present at position t of the sequence, meaning that a state run ends at t (It = 1 {=> L t -=I- 0). In the following, the notation Y[l:tj is used to denote the vector {Yl, Y2, ... ,yd· Define the partial likelihood of the first t probes, with the state Zt = k ending at t after a state run length of L t = l, by the "forward" probability:
at(k, l) = P(Zt = k, L t = l,It = 1, Y[l:tj)' Also, let the state probability marginalized over all state lengths be given by f3t(k) = lL[セQGォ。エHャIN@ Let d(l) = min{D 1 ,'" ,DK} and d(K) = max{D 1 , " ' , D K }. Then, assuming that the length spent in a state and the transition to that state are independent, i.e. P(l, kll', k') = P(L t = llZt = k)Tk1k = Pk(l)Tk1k, it can be shown that
at(k,l)
= P(Y[t-I+1:tjIZt = k)Pk(l) LTk1kf3t-l(k'),
(6.2)
k'#k for 2 セ@ t セ@ N; 1 セ@ k セ@ K; l E {d(1),d(l)+l, .. · ,min[d(K),t]}. The boundary conditions are: at(k,l) = 0 for t < l < d(l), and al(k,l) = 7l'kP(Y[1:ljIZI = k)Pk(l) for d(l) セ@ l セ@ d(K), k = 1,'" ,K. Pk(') denotes the k-th truncated negative binomial distribution. The states and state duration lengths (Zt, Ld (1 セ@ t セ@ N) can now be updated, for current values of the parameters 0 = (IL, T, cp, 7r, セIL@ using a backward sampling-based imputation step:
1. Set i = N. Update ZNly,O using P(ZN = kly,O) = 2. Next, update LNIZN = k, y, 0 using
P(LN = llZN = k, y, 0) =
セォIG@
P(LN = l, ZN = kly, 0) _ aN(k, l) P(ZN = kly, 0) - f3N(k) .
3. Next, set i=i - LN, and let LS(i) = LN. Let D(2) be the second smallest value in the set {Dl' ... ,DK}' While i > D(2), repeat the following steps: • Draw Zily, 0, ZHLS(i) , LHLS(i) using
where k E {I"" ,K} \ ZHLS(i)'
Qing Zhou, Mayetri Gupta
204
• Draw LilZi , y, 0 using P(L i = llZi, y, 0) • Set LS(i - L i ) = L i , i = i - L i .
6.3
= qセZヲス@
.
Application to a yeast data set
The HGHMM algorithm was applied to the normalized data from the longest contiguous mapped region, corresponding to about 61 Kbp (chromosomal coordinates 12921 to 73970), of yeast chromosome III [47]. The length ranges for the three states were: (1) linker: Dl = {I, 2, 3, ... }, (2) delocalized nucleosome: D2 = {9,'" ,30}, and (3) well-positioned nucleosome: D3 = {6, 7, 8}. It is of interest to examine whether nucleosome-free state predictions correlate with the location of TFBSs. Harbison et al. (2004) used genomewide location analysis (ChIP-chip) to determine occupancy of DNA-binding transcription regulators under a variety of conditions. The ChIP-chip data give locations of binding sites to only a lKb resolution, making further analysis necessary to determine the location of binding sites at a single nucleotide level. For the HGHMM algorithm, the probabilities of state membership for each probe were estimated from the posterior frequencies of visiting each state in M iterations (excluding burn-in). Each region was assigned to the occupancy state k, for which the estimated posterior state probability P(Zi = klY) = eセャ@ J(Z}j) = k)/M was maximum. For all probes, this probability ranged from 0.5 to 0.9. Two motif discovery methods SDDA [16] and BioProspector [30] were used to analyze the sequences for motif lengths of 8 to 10 and a maximum of 20 motifs per set. Motif searches were run separately on linker (L), nucleosomal (N) and delocalized nucleosomal (D) regions predicted by the HGHMM procedure. The highest specificity (proportion of regions containing motif sites corresponding to high binding propensities in the Harbison et al. (2004) data) was for the linker regions predicted by HGHMM: 61% by SDDA and 40% by BP (Table 4). Sensitivity is defined as the proportion of highly TF-bound regions found when regions were classified according to specific state predictions. The highest overall specificity and sensitivity was observed for the linker regions predicted with HGHMM, indicating nucleosome positioning information may aid significantly in motif discovery when other information is not known. Table 4: Specificity (Spec) and Sensitivity (Sens) of motif predictions compared to data from Harbison et al SDDA Linker Deloc Nucl Nucleosomal
7
Spec 0.61 0.19 0.16
BP Sens 0.7 0.8 0.5
Spec 0.40 0.15 0.09
Sens 0.87 0.63 0.43
Conclusion
In this article we have tried to present an overview of statistical methods related to the computational discovery of transcription factor binding sites in genomic DNA
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
205
sequences, ranging from the initial simple probabilistic models to more recently developed tools that attempt to use auxiliary information from experiments, evolutionary conservation, and chromatin structure for more accurate motif prediction. The field of motif discovery is a very active and rapidly expanding area, and our aim was to provide the reader a snapshot of some of the major challenges and possibilities that exist in the field, rather than give an exhaustive listing of work that has been published (which would in any case be almost an impossible task in the available space). With the advent of new genomic technologies and rapid increases in the volume, diversity, and resolution of available data, it seems that in spite of the considerable challenges that lie ahead, there is strong promise that many exciting discoveries in this field will continue to be made in the near future.
References [1) Auger, 1. E. and Lawrence, C. E. (1989). Algorithms for the optimal identification of segment neighborhoods. Bull. Math. BioI., 51(1):39-54. [2) Bailey, T. and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 28-36. [3) Baldi, P., Chauvin, Y., Hunkapiller, T., and McClure, M.A. (1994). Hidden Markov models of biological primary sequence information. Proc. Natl. Acad. Sci. USA, 91, 1059-1063. [4) Barash, Y., Elidan, G., Friedman, N., and Kaplan, T. (2003). Modeling dependencies in protein-DNA binding sites. In RECOMB proceedings, 28-37. [5J Boyer, L.A., Lee, T.L, Cole, M.F., Johnstone, S.E., Levine, S.S., Zucker, J.P., et al. (2005). Core transcriptional regulatory circuitry in human embryonic stem cells. Cell, 122, 947-956. [6) Bussemaker, H. J., Li, H., and Siggia, E. D. (2000). Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis. Proc. Natl. Acad. Sci. USA, 97(18):10096-10100. [7} Bussemaker, H.J., Li, H., and Siggia, E.D. (2001). Regulatory element detection using correlation with expression, Nat. Genet., 27, 167-17l. [8} Chipman, H.A., George, E.1., and McCulloch, R.E. (2006). BART: Bayesian additive regression trees, Technical Report, Univ. of Chicago. [9J Conlon, E.M., Liu, X.S., Lieb, J.D., and Liu, J.S. (2003). Integrating regulatory motif discovery and genome-wide expression analysis, Proc. Natl. Acad. Sci. USA, 100, 3339-3344. [lOJ Das, D., Banerjee, N., and Zhang, M.Q. (2004). Interacting models of cooperative gene regulation, Proc. Natl. Acad. Sci. USA, 101, 16234-16239. [11J Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B,39(1):1-38. [12] Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998). Biological sequence analysis. Cambridge University Press. [13J Felsenstein, J. (1981). Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol., 17, 368-376.
206
Qing Zhou, Mayetri Gupta
[14] Friedman, J.H. (1991). Multivariate adaptive regression splines, Ann. Statist., 19, 1-67. [15] Green, P. J. (1995). Reversible jump MCMC and Bayesian model determination. Biometrika, 82,711-732. [16] Gupta, M. and Liu, J. S. (2003). Discovery of conserved sequence patterns using a stochastic dictionary model. J. Am. Stat. Assoc., 98(461):55-66. [17] Gupta, M. and Liu, J. S. (2005). De-novo cis-regulatory module elicitation for eukaryotic genomes. Pmc. Nat. Acad. Sci. USA, 102(20):7079-7084. [18] Gupta, M. (2007). Generalized hierarchical Markov models for discovery of length-constrained sequence features from genome tiling arrays. Biometrics, in press. [19] Jensen, S.T., Shen, L., and Liu, J.S. (2006). Combining phylogenetic motif discovery and motif clustering to predict co-regulated genes. Bioinjormatics, 21, 3832-3839. [20] Keles, S. et al., van der Laan, M., and Eisen, M.B. (2002). Identification of regulatory elements using a feature selection method, Bioinjormatics, 18, 1167-1175. [21] Kellis, M., Patterson, N., Endrizzi, M., Birren, B., and Lander, E. (2003). Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature, 423,241-254. [22] Krogh, A., Brown, M., Mian, L.S., Sjoander, K., and Haussler, D. (1994). Hidden Markov models in computational biology: Applications to protein modeling. J. Mol. Biol., 235, 1501-153l. [23] Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., and Wooton, J.C. (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208-214. [24] Lawrence, C. E. and Reilly, A. A. (1990). An expectation-maximization (EM) algorithm for the identification and characterization of common sites in biopolymer sequences. Proteins, 7,41-5l. [25] Li, X., and Wong, W.H. (2005). Sampling motifs on phylogenetic trees. Pmc. Natl. Acad. Sci. USA, 102,9481-9486. [26] Liang, F. and Wong, W. H. (2000). Evolutionary Monte Carlo: applications to cp model sampling and change point problem. Statistica Sinica, 10,317-342. [27] Liu, J.S. and Lawrence, C. E. (1999). Bayesian inference on biopolymer models, Bioinjormatics, 15, 38-52. [28] Liu, J. S., Neuwald, A. F., and Lawrence, C. E. (1995). Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Am. Stat. Assoc., 90,1156-1170. [29] Liu, J. S., Wong, W. H., and Kong, A. (1994). Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika, 81,27-40. [30] Liu, X., Brutlag, D. L., and Liu, J. S. (2001). Bioprospector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pacific Symposium on Biocomputing, 127-138. [31] Liu, Y., Liu, X.S., Wei, L., Altman, R.B., and Batzoglou, S. (2004). Eukaryotic regulatory element conservation analysis and identification using com-
Chapter 8
Regulatory Motif Discovery: From Decoding to Meta-Analysis
207
parative genomics. Genome Res. 14, 451-458. [32] Luger, K. (2006). Dynamic nucleosomes. Chromosome Res, 14,5-16. [33] Neuwald, A. F., Liu, J. S., and Lawrence, C. E. (1995). Gibbs Motif Sampling: detection of bacterial outer membrane protein repeats. Protein Science, 4,1618-1632. [34] Moses, A.M., Chiang, D.Y., and Eisen, M.B. (2004). Phylogenetic motif detection by expectation-maximization on evolutionary mixtures. Pac. Smp. Biocomput., 9, 324-335. [35] Rabiner, L. R (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257-286. [36] Sabatti, C. and Lange, K. (2002). Genomewide motif identification using a dictionary model. IEEE Proceedings, 90,1803-1810. [37] Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W.W., and Lenhard, B. (2004). JASPAR: an open-access database for eukaryotic transcription factor binding profiles, Nucleic Acids Res., 32, D91-D94. [38] Schneider, T.D. and Stephens, RM. (1990). Sequence logos: a new way to display consensus sequences. Nucleic Acids Res., 18,6097-6100. [39] Siddharthan, R, Siggia, E.D., and van Nimwegen, E. (2005). PhyloGibbs: A Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput. Biol., 1, e67. [40] Sinha, S., Blanchette, M., and Tompa, M. (2004). PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics, 5, 170. [41] Sinha, S. and Tompa, M. (2002). Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research, 30,55495560. [42] Stormo, G. D. and Hartzell, G. W. (1989). Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl. Acad. Sci. USA, 86,1183-1187. [43] Tanner, M. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc., 82,528-550. [44] Thompson, W., Palumbo, M. J., Wasserman, W. W., Liu, J. S., and Lawrence, C. E. (2004). Decoding human regulatory circuits. Genome Research, 10,1967-1974. [45] Wang, T. and Stormo, G.D. (2003). Combining phylogenetic data with coregulated genes to identify regulatory motifs. Bioinformatics, 19, 2369-2380. [46] Wingender, E., Chen, X., Hehl, R, Karas, H., Liebich, 1., Matys, V., Meinhardt, T., Pruss, M., Reuter, I., and Schacherer, F. (2000). TRANSFAC: An integrated system for gene expression regulation. Nucleic Acids Res., 28, 316-319. [47] Yuan, G.-C., Liu, y'-J., Dion, M. F., Slack, M. D., Wu, L. F., Altschuler, S. J. and Rando, O. J. (2005). Genome-scale identification of nucleosome positions in S. cerevisiae. Science, 309, 626-630. [48] Zhao, X., Huang, H., and Speed, T. P. (2004). Finding short DNA motifs using permuted markov models. In RECOMB proceedings, 68-75. [49] Zhou, Q. and Liu, J. S. (2004). Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics, 20(6):909-916.
208
Qing Zhou, Mayetri Gupta
[50] Zhou, Q. and Liu, J.S. (2008). Extracting sequence features to predict proteinDNA interactions: a comparative study. Nucleic Acids Research, in press. [51] Zhou, Q. and Wong, W.H. (2004). CisModule: De novo discovery of cisregulatory modules by hierarchical mixture modeling. Proc. Natl. Acad. Sci. USA, 101, 12114-12119. [52] Zhou, Q. and Wong, W.H. (2007). Coupling hidden Markov models for the discovery of cis-regulatory modules in multiple species. Ann. Appl. Statist. to appear.
Chapter 9 Analysis of Cancer Genome Alterations U sing Single Nucleotide Polymorphism (SNP) Microarrays Cheng Li *
Samir Amin t
Abstract Loss of heterozygosity (LOR) and copy number changes of chromosomal regions bearing tumor suppressor genes or oncogenes are key events in the evolution of cancer cells. Identifying such regions in each sample accurately and summarizing across multiple samples can suggest the locations of cancerrelated genes for further confirmatory experiments. Oligonucleotide SNP microarrays now have up to one million SNP markers that can provide genotypes and copy number signals simultaneously. In this chapter we introduce SNP-array based genome alteration analysis methods of cancer samples, including paired and non-paired LOR analysis, copy number analysis, finding significant altered regions across multiple samples, and hierarchical clustering methods. We also provide references and summaries to additional analysis methods and software packages. Many visualization and analysis functions introduced in this chapter are implemented in the dChip software (www.dchip.org), which is freely available to the research community.
1 1.1
Background Cancer genomic alterations
A normal human cell has 23 pairs of chromosomes. For the autosomal chromosomes (1 to 22), there are two copies of homologous chromosomes inherited respectively from the father and mother of an individual. Therefore, the copy number of all the autosomal chromosomes is two in a normal cell. However, in a cancer cell the copy number can be smaller or larger than two at some chromosomal regions due to chromosomal deletions, amplifications and rearrangement. These alterations start *Department of Biostatistics and Computational Biology, Harvard School of Public Health, Dana-Farber Cancer Institute, 44 Binney St. Boston, MA, 02115, USA, Email: eli @ hsph. harvard. edu tDepartment of Medical Oncology, Dana-Farber Cancer Institute, 44 Binney St. Boston, MA, 02115, USA.
209
210
Cheng Li, Samir Amin
to happen randomly in a single cell but are subsequently selected and inherited in a clone of cells if they confer growth advantage to cells. Such chromosomal alterations and growth selections have been associated with the conversion of ancestral normal cells into malignant cancer cells. The most common alterations are loss-of-heterozygosity (LOH), point mutations, chromosomal amplifications, deletions, and translocations. LOH (the loss of one parental allele at a chromosomal segment) and homozygous deletion (the deletion of both parental alleles) can disable tumor suppressor genes (TSG) [1]. In contrast, chromosomal amplifications may increase the dosage of oncogenes that promote cell proliferation and inhibit apoptosis. The detection of these alterations may help identify TSGs and oncogenes and consequently provide clues about cancer initiation or growth [2, 3]. Analyzing genomic alteration data across multiple tumor samples may distinguish chromosomal regions harboring cancer genes from regions with random alterations due to the instability of cancer genome, leading to identification of novel cancer genes [4, 5]. Defining cancer subtypes based on genomic alterations may also provide insights into the new classification and treatment of cancer [6, 7].
1.2
Identifying cancer genomic alterations using oligonucleotide SNP microarrays
Several experimental techniques are currently used to identify genomic alterations at various resolutions and aspects. The cytogenetic methods range from fluorescence in situ hybridization (FISH) to spectral karyotyping (SKY) [8] and provide global view of chromosomal organizations at the single cell level. At individual loci, low-throughput methods measure allelic polymorphisms or DNA copy numbers using locus-specific primers, while high-throughput methods, such as digital karyotyping [9] and comparative genomic hybridization (CGH) using chromosomes or micro arrays [10-12], measure DNA copy number changes at many loci simultaneously. Single nucleotide polymorphisms (SNP) are single base pair variations that occur in the genome of a species. They are the most common genetic variations in the human genome and occur on average once in several hundred basepairs. The NCB! dbSNP database (http://www.ncbi.nlm.nih.gov/projects/SNP /) stores about 10 million human SNPs identified by comparing the DNA of different individuals. High-density oligonucleotide SNP micro arrays have also been developed by Affymetrix for high-throughput genotyping of human SNPs. The SNP marker density has increased rapidly during the recent years, from Mapping 10K array (10,000 markers) and Mapping lOOK array set (116,204 markers spaced at 23.6 Kb) to the latest SNP 6.0 array with near one million SNPs [13-16]. The experimental assays require only 250 ng starting DNA sample and one primer to amplify the portions of the genome containing the SNPs interrogated by the array. This complexity-reduced sample is then fragmented, labeled and hybridized to an oligonucleotide SNP array containing probes complementary to the sequences surrounding the SNP loci. The Affymetrix analysis software is commonly used to analyze the scanned array image and compute genotype calls of all the SNPs in a
Chapter 9
Analysis of Cancer Genome Alterations US'ing Single nセl」ャ・ッエゥ、@
... 211
C
CCTCGGACTAATGGCCATT Probe Sequence
CGGAGCCTGACTACCGGTAA
:::::: gl::I:::1111I11111:11111111'111!1111l:::I::1:':11lllll111:111111111111'11111111\: (A)
(B)
1: (A) The designing scheme of a probe set on the HuSNP array, which contains 1400 SNPs. Five quartets (columns) of oligonucleotide probes interrogate the genotype of a SNP. In the central quartet, the perfect match (PM) probes are complementary to the reference DNA sequences surrounding the SNP with two alleles (denoted as allele A and B). The mismatch (MM) probes have substituted central base pair compared to the corresponding PM probes and they control for cross-hybridization signals. Shifting the central quartet by -1, 1 and 4 basepairs forms additional four quartets. The newer generations of SNP arrays have selected quartets from both forward and reverse strands Figure 6). (B) A genotyping algorithm makes genotype calls for this SNP in three different based on hybridization patterns. For a diploid genome, three genotypes are possible in different individuals: AA, AB and BB
sample with high accuracy (Figure 1) [17,79]. Several groups have pioneered the use of SNP arrays for LOR analysis of cancer [6, 13, 18]. These studies compared the SNP genotype calls or probe signals of a cancer sample to that of a paired normal sample from the same patient (Figure 2), and identified chromosomal regions with shared LOH across multiple samples of a cancer type. They provided a proof of principle that SNP arrays can identify genomic alterations at comparable or superior resolution to microsatelite markers. Additional studies have applied SNP arrays to identify LOH in variouscancer tissue types [7, 19]. In addition, The SNP arrays have been utilized for copy number analysis of cancer samples [20, 21]. probe level signals on the SNP arrays identified genomic amplifications and homozygous deletions, which were confirmed by Q-PCR and arrays CGH on the same samples. These analysis methods have been implemented in several software packages to be used by the research community (see section 5). In this chapter, we will review the analysis methods and software for SNP array applications in cancer genomic studies. We will focus on the methods and software developed by our group since they are representative, but we will also discuss related methods developed by others and future directions.
Cheng Li, 8amir Amin
212
Tumor
Normal
A
A
B
A B
LOR (subject #58)
A
Retention
B
(subject #57)
Non-informative B
(subject #55)
2: Comparing the genotypes of a SNP in paired normal and tumor samples gives LOH calls. Only SNP markers that are heterozygous in normal samples are informative to provide LOH or retention calls
2 2.1
Loss of heterozygosity analysis using SNP arrays LOH analysis of paired normal and tumor samples
For each SNP, there are two alleles existing in the population and are usually represented by A and B (corresponding to two of the four different nucleotides: A/T/G/C). Each person has two copies of every chromosome, one inherited from each parent. Therefore, the genotype of an SNP in one sample can be homozygous (AA or BB) if the SNP alleles in the two chromosomes are the same, or heterozygous (AB) if they are different. Loss of heterozygosity (LOH) refers to the loss of contribution of one parent in selected chromosome regions, due to hemizygous deletion or mitotic gene conversion [23]. LOH events frequently occur when somatic normal cells undergo transformations to become cancerous cells. By comparing the normal and tumor cells from the same patient, the SNPs can be classified as LOH or retention of heterozygosity (or simply "retention") (Figure 2). Afterwards, the LOH data can be visualized along a chromosome (Figure 3), and non-informative calls may be inferred to have LOR or retention calls through the "Nearest neighbor" or "The same boundary" method [22].
2.2
Tumor-only LOH inference
Often paired normal samples for primary tumors or cell lines are not available, and for hematologic malignancies, blood-derived normal DNA is difficult to obtain. As higher density SNP arrays become available, it is possible to use only the tumor
Chapter 9
Analysis of Cancer Genome Alterations Using Single Nu.cleotide ... 213
Figure 3: Observed LOH calls by comparing normal (N) and tumor (T) genotypes at the same SNP. The sample pairs are displayed on the columns (column names only show tumor samples) and the SNPs are ordered on the rows by their p-arm to q-arm positions in one chromosome. Yellow: retention (AB in both N and T), Blue: LOH (AB in N, AA or BB in T), Red: Conflict (AAjBB in N, BBjAA or AB in T), Gray: non-informative (AAjBB in both Nand T), White: no call (no call in N or T)
sample to identify LOH regions. For example, when we see a long stretch of homozygous calls in a chromosome region of a tumor sample, it is highly likely that LOH has happened in this region. If we assume the average heterozygosity is 0.3 and the SNPs are independent the probability of observing N consecutive homozygous SNP calls when no LOH happens is (1 -- O.3)N, which can be very small when N is large, favoring the hypothesis that the region has undergone LOR. Previously, the homozygosity mapping of deletion (HOMOD) method identified regions of hemizygous deletion in unmatched tumor cell lines by defining a region of LOR as more than five consecutive homozygous microsatellite markers [24J. However, applying this approach to SNP array data is not trivial since SNP markers are less polymorphic, the inter-marker distances are variable, and the haplotype structure may lead to interdependence in the genotype calls of neighboring SNPs. In addition, as with other methods there is a small rate of genotyping error. These multiple sources of information are not comprehensively accounted for in a
Cheng Li, Samir Amin
214
simple probability or counting method. We utilized a Hidden Markov Model (HMM; [25]) to formalize this inference and use these multiple sources of information [26]. HMMs have been used for modeling biological data in diverse applications such as protein and DNA sequence analysis [27-29], linkage studies [30, 31], and array CGH analysis [32]. SNP genotype or signal data along a chromosome are chain-like, have SNP-specific genotype or signal distributions, and have variable inter-marker distances. For tumor-only LOH analysis, we conceptualize the observed SNP calls (Homozygous call A or B, Heterozygous call AB) as being generated by the underlying unobserved LOH states (either Loss or Retention) of the SNP markers (Figure 4). Under the LOH Retention state, we observe homozygous or heterozygous SNP calls with probabilities determined by the allele frequency of an SNP, Under the LOH Loss state, we will almost surely observe homozygous SNP call. However, we need to account for possible SNP genotyping errors at< 0.01 rate [14] and also for wrong mapping of a few markers. These distributions give the "Emission probabilities" of the HMM. haplotype-related SNP-SNP dependence
observed SNPcall
セ「WャゥエケMQ@
-.@- -.0-·-.0 -.0- -.0- セ@ セP@
/1
undenlying LOH state
I
RET
1--8-1
I RET
1--1 t
I RET
iMセ@
I
I LOSS
1--
transition probability
Figure 4: Graphical depiction of the elements comprising the Hidden Markov Model (HMM) for LOH inference. The underlying unobserved LOR states (Loss or Retention (RET)) of the SNP markers generate the observed SNP calls via emission probabilities. The solid arrows indicate the transition probabilities between the LOR states. The broken arrows indicate the dependencies between the consecutive SNP genotypes within a block of linkage disequilibrium, which are handled in an expanded RMM model (not shown)
In order to determine the HMM initial probabilities, we next estimated the proportion of the genome that is retained by assuming that no heterozygous markers should be observed in a region of Loss except in the case of genotyping error. To this end, the proportion of the retention genome in a tumor sample was estimated by dividing the proportion of heterozygous markers in the tumor sample by the average rate of heterozygosity of the SNPs in the population (0.35 for the 10K arrays and 0.27 for the lOOK array [14, 15]. This retention marker proportion was used as the HMM initial probability, specifying the probability of observing the Retention state at the beginning of the p-arm, and was also used as the background LOH state distribution in a sample. Lastly, we specified the transition probabilities that describe the correlation between the LOH states of adjacent markers. In a particular sample, the larger the
Chapter 9
Analysis of Cancer Genome Alterations Using Single Nucleotide ... 215
distance between the two markers is, the more likely genetic alteration breakpoints will happen within the interval and make the LOH states of the two markers independent. We use the Haldane's map function B = (1 - e- 2d )/2 [31] to convert the the physical distance d (in the unit of 100 Megabases :::::: 1 Morgan) between two SNP markers to the probability (2B) that the LOH state of the second marker will return to the background LOH state distribution in this sample and thus independent from the LOH state of the first marker. Although Haldane's map function is traditionally used in linkage analysis to describe meiotic crossover events, the motivations for applying it in the context of LOH events are the following: (1) LOH can be caused by mitotic recombination events [33] and mitotic recombination events may share similar initiation mechanisms and hot spots with the meiotic crossover events [34]; (2) the empirical transition probabilities estimated from observed LOH calls based on paired normal and tumor samples agree well with Haldane's map function (see Figure 2 in [26]). We used this function to determine the transition probabilities as follows. If one marker is Retention, the only way for the next adjacent marker to be Loss is that the two markers are independent and the second marker has background LOH state distribution. This happens with probability 2() . Po (Loss), where Po (Loss) is the background LOH Loss probability described previously. If the first marker is Loss, the second could be Loss either due to the dependence between the two markers (occurring with the probability 1- 2()), or due to that the second marker has background LOH state distribution (occurring with the probability 2()· Po (Loss)). Therefore, the probabilities of the second marker being Loss given the LOH status of the current marker are: P (LossJLoss) = 2B· Po (Loss) + (1 - 2B), { P (LossJRetention) = 2() . Po (Loss). These equations provided the marker-specific transition probabilities of the HMM, determining how the LOH state of one marker provides information about the LOH state of its adjacent marker. The HMM with these emission, initial and transition probabilities specifies the joint probability of the observed SNP marker calls and the unobserved LOH states in one chromosome of a tumor sample. The Forward-backward algorithm [29] was then applied separately to each chromosome of each sample to obtain the probability of the LOH state being Loss for every SNP marker, and the inferred LOH probabilities can be displayed in dChip and analyzed in downstream analyses. To evaluate the performance of the HMM and tune its parameters, we applied it to the 10K array data of 14 pairs of matched cancer (lung and breast) and normal cell lines [21]. Only data from autosomes were used. The average observed sample LOH proportion of these paired samples is 0.52 ± 0.15 (standard deviation). We assessed the performance of the HMM by comparing the LOR events inferred using only the tumor data to the LOH events observed by comparing tumor data to their normal counterparts. Since the tumor-only HMM provides only the probability for each SNP being Loss or Retention, rather than making a specific, we used the least stringent threshold to make LOH calls, in which a SNP is called Loss if it has a probability of Loss greater than 0.5, and Retention otherwise. Using
Cheng Li, Samir Amin
216
this threshold, very similar results were achieved when comparing the inferred LOR from unmatched tumor samples to the observed LOR from tumor/normal pairs (Figure 5). Specifically, we found that 17, 105 of 17, 595 markers that were observed as Loss in tumor/normal pairs were also called Loss in unmatched tumors (a sensitivity of 97.2%), and 15, 845 of 16, 069 markers that were observed as Retention in tumor/normal pairs were also called Retention in unmatched tumors (for a specificity of 98.6%). However, comparison of the tumor only inferences (Figure 5B) to the observed LOH calls (Figure 5A) identifies occasional regions that are falsely inferred as LOH in the tumor only analysis (indicated by arrows), We found that these regions are due to haplotype dependence of SNP genotypes, which can be addressed by extending the HMM model to consider the genotype dependence between neighboring SNPs (Figure 4; see [26] for details). Tumor only inferences I II III IV
Tumor/normal observations Chro 10 I II III IV
1".,,' .
"' :.·1•.·• .•·• 1 10) normal reference samples to analyze together with tumor samples, we propose to use a trimming method to obtain reference signal distribution. Specifically, we assume that, in a set of reference normal samples (or a set of tumor samples when normal samples are not available or too few), for any SNP, at most a certain percent of the samples have abnormal copy numbers at the SNP locus (the abnormal percentage P is user defined, such as 10%). Then for each SNP, (P/2)% of samples with extreme signals are trimmed from the high and low end, and the rest samples are used to estimate the mean and standard deviation of the signal distribution of normal copy number 2 at this SNP. This trimming method is designed to accommodate the small amount of CNVs in reference normal samples. To address the issue (2) above and also test the feasibility of the method proposed in (1), we used a public lOOK SNP array dataset of 90 normal individuals of European ethnicity (https: / / www.affymetrix.com/support / technical/sample_data / hapmap_trio_data. affx). We set abnormal percentage as 5% to search for CNVs in these samples, and this method is able to identify 39 CNVs in the 90 normal samples with size ranging from 120Kb to 16Mb (4 to 411 SNPs) (Figure 9). Some of the CNVs are recurrent in related family members. These identified CNVs in normal samples can be excluded when analyzing lOOK SNP data of tumor samples of Caucasian origin, so that the same CNVs found in tumor samples are not considered in downstream analysis. If some studies contain paired normal samples for tumors, one can screen the normal samples for CNVs and the CNVs found in both normal and tumor will be excluded from cancer locus analysis. In addition, public databases such as Database of Genomic Variants (http://projects.tcag.ca/variationj) and Human Structural Variation Database [48] contain the known CNVs from various studies. The information of the genome position, size, and individual ethnicity of the CNVs in these databases can be used to exclude the known CNVs in normal individuals when analyzing the cancer samples of the corresponding ethnic groups.
Figure 9: The CNVs of chromosome 22 identified in 90 normal individuals using the lOOK Xba SNP array. The rows are ordered SNPs and the columns are samples. The white or light red colors correspond to lower than 2 copies (homozygous or hemizygous deletions). The CNVs are between the region 20.7 - 22 Mb (22qll.22 - ql1.23). The CNV in this region have been reported in other studies [42, 48]. Samples F4 and F6 are from the same family and share a CNV region
224
Cheng Li, Samir Amin
Furthermore, the SNP-array based methods have also been used to find genomic alterations in other diseases involving chromosomal changes, such as Down syndrome (chromosome 21 has 3 copies), autism spectrum disorder [49, 50] and mental retardation [51]. Similar trimming method as above may be used to identify copy number changes in the samples of these diseases. The DECIPHER database (https:/ / decipher.sanger.ac. uk!) can be queried to check if a CNV is related to a disease phenotype.
3.6
Other copy number analysis methods for SNP array data
Another major research topic in copy number analysis using SNP or CGH arrays is how to infer the underlying real copy numbers in cells based on the raw copy number or log ratios observed from arrays. In addition to the HMM method introduced, median smoothing is a simple but intuitive method. One can set an SNP marker window size (e.g 10) so that a SNP's smoothed copy number is the median of the raw copy numbers of the SNPs in the surrounding window. Compared to the HMM-inferred copy number, this method performs faster and gives closer result to the raw copy numbers. It is also more robust to outliers in raw copy numbers, and does not need parameter specifications as in the HMM method. However, median-smoothed copy numbers are not as smooth as HMMinferred copy numbers, and copy changes smaller than half of the window size will be smoothed out. The circular binary segmentation method has been introduced to analyze array CGH data and identify copy number change points in a recursive manner [52]. This algorithm utilizes the permutation of probe orders to test the significance of candidate change points, based on which the mean and variance of chromosomal segments are estimated piecewise. Furthermore, a regression model is used to formalize the detection of DNA copy number alterations as a penalized least squares regression problem [53]. One study compared 11 different algorithms for smoothing array CGH data, including methods using segmentation detection, smoothing, HMM, regression, and wavelets [54]. These authors also provided a website allowing analysis of array CGH data using multiple algorithms (http://compbio.med.harvard.edu/CGHweb!).
4 4.1
High-level analysis using LOH and copy number data Finding significantly altered chromosome regions across multiple samples
After inferring LOH and copy number altered chromosome regions in individual samples, we are interested in defining the regions of alteration shared by multiple tumors more than random chance. Under the clonal selection process that generates the tumor cells, the alterations at cancer gene loci are more probable to
Chapter 9
Analysis of Cancer Genome Alterations Using Single Nucleotide ... 225
be selected in the tumor samples than the alterations elsewhere in the genome. Therefore, the non-randomly shared alteration regions are likely to harbor cancercausing genes or are more susceptible to chromosomal damages in cancer samples, while the rest of the genome contains random alterations due to the genetic instability of tumor. There are existing methods to infer cancer-related chromosome regions from LOH data based on microsatellite markers or array CGH data of multiple samples. The instability-selection modeling of allelic loss and copy number alteration data has been used to locate cancer loci and discover combinations of genetic events associated with cancer sample clusters [57,58]. This method has been applied to a meta-analysis of 151 published studies of LOH in breast cancer [59]. A tree-based multivariate approach focuses on a set of loci with marginally most frequent alterations [60]. The pairwise frequencies among these loci are then transformed into distances and analyzed by tree-building algorithms. The "Cluster along chromosomes" (CLAC) algorithm constructs hierarchical clustering trees for a chromosome and selects the interesting clusters by controlling the False Discovery Rate (FDR) [61]. It also provides a consensus summary statistic and its FDR across a set of arrays. We have developed a scoring method to identify significantly shared LOH regions unlikely to be due to random chance [22]. For paired normal and tumor samples, the first step is to define an LOH scoring statistic across samples. For a particular chromosomal region containing one or more SNPs, we defined a score in each individual to quantify the region's likelihood of being real LOH: the proportion of LOH markers among all the informative markers in this region with some penalty given to conflict LOH calls. We used the proportion of LOH markers rather than the actual counts of LOH markers due to different marker densities at different chromosomal regions. The scores of all individuals are then summed up to give a summary LOH score for this chromosomal region. The second step is to use permutation to assess the significance of the score. Under the null hypothesis that there are no real LOH regions for the entire chromosome (all the observed LOH markers come from measurement error), one can generate the null distribution of the LOH statistic by permuting the paired normal and tumor samples and then obtain LOH scoring statistics based on the permuted datasets. If all the observed LOH events are due to call errors and thus are not cancer-related (the null hypothesis), then the paired normal and tumor samples are conceptually indistinguishable, and the observed differences between them represent the background noise from which we would like to distinguish potentially real LOH events. Therefore, we can create the background noise by permuting the "normal" and "tumor" labels on each pair. We then compare the LOH events in the original dataset with the LOH events in a large number of such permuted datasets to assess the statistical significance of the former. The MaxT procedure [62] is used to adjust the p-values for multiple testing. For each permuted dataset, we obtain the maximal score among all the regions in the genome or a chromosome. The adjusted p-value of a specific region is the proportion of the permuted maximal scores that are greater than the observed LOH score. The significantly shared LOH regions in an SNP dataset of 52 pairs of normal and prostate cancer
226
Cheng Li, Samir Amin
(A)
(13)
Figure 10: (A) In the dChip LOR view, the tumor-only LOR inference calls LOR (blue color) in multiple prostate samples. The curve and the vertical line on the represent LOR scores and a significance threshold. (B) The 3 Mb region of the LOR scores in (A) contains the known TSG retinoblastoma 1. The cytoband and gene information are displayed on the left. Data courtesy of Rameen Beroukhim
samples harbor the known PTEN tumor suppressor gene [22]. We also tested the method on an breast cancer data where LOR events occur much more frequently [7]. The p-value curve for all chromosomes is able to capture the where the LOR events occur frequently across multiple tumors and the LOR T'lAltl',p,'n cluster samples into two meaningful subtypes. The permutation approach above tests the null hypothesis that the observed shared LOR is due to SNP genotyping or mapping errors and that the normal and tumor labeling is non-informative for producing genomic alterations. This is reflected in the permutation scheme used to randomly switch normal and tumor sample within a sample pair to generate permutation datasets under the null hypothesis. A more direct null hypothesis is that there is no region in the tumor genome that is selected to have more shared LOR events than the rest of the genome regions. In addition, when the paired normal sample is not available for LOH or copy number analysis, we can still use a simple scoring method such as the proportion of samples having LOH or copy alterations to quantify the sharing of LOH (Figure lOA), but permuting paired samples is not an option anymore, In such situations, we propose to permute SNP loci in each sample while preserving most dependence structure of neighboring SNPs. Specifically, for each sample with SNPs ordered first by chromosomes and then by positions within chromosome, we randomly partition the whole genome into K 2) blocks, and randomly switch the order of these blocks but preserving the order of SNPs in each block. In this way the SNPs associated with LOH or copy number alterations in a sample are randomly relocated in blocks to new positions in the genome, while only minimally perturbing the dependence of the LOH or copy number data of neighboring markers. The same permutation applies to all samples using a different random partition for each sample. The LOH or copy number alteration score at
Chapter 9
Analysis of Cancer Genome Alterations Using Single Nucleotide ... 227
each SNP locus can then be computed for the permuted dataset, and the MaxT method can be to assess the significance of the original scores. A LOH scoring curve can be computed and displayed next to the LOH data along with a significance threshold to help researchers locate and explore significantly shared LOH regions (Figure 10). Another cancer alteration scoring and permutation method GIS TIC [63] had been successfully applied to SNP array data of glioma and lung cancer samples to identify novel cancer-related genes that were subsequently experimentally confirmed [5].
4.2
Hierarchical clustering analysis
Clustering and classification methods have been very popular and widely used in microarray data analysis since the introduction of gene expression microarrays [64, 65]. Increasingly SNP and array CGH data have been analyzed by these methods since they offer a global view of the genome alterations in cancer samples. It is often interesting to look for the subsets of tumor samples harboring similar alteration events across the genome and correlate such tumor sub clusters with clinical outcomes. Clustering chromosome loci may also reveal genes on different chromosomes that undergo genomic alterations simultaneously and therefore possibly belong to the same oncogenic pathways [66,67]. We have implemented a sample clustering algorithm using the LOH data of one or all chromosomes [22] (Figure llA). The sample distances are defined as the proportion of markers that have discordant Loss or Retention status. This method can discover subtypes of breast cancer and lung cancer based on LOH profiles [6, 7]. However, we have noticed that when using the LOH data of all SNPs for clustering, the result tends to be driven by the SNP markers that are in the "Retention" status across all samples. Therefore, a filtering procedure similar to that of gene expression clustering can be applied to only use those SNPs that have significant alteration scores (Section 4.1) These SNPs have excessive alterations and may harbor known or potential cancer genes, so their LOH status may be more informative for sample clustering. Following similar reasoning, we have implemented a function to filter the SNPs that are located within a certain distance to a list of specified genes (such as all known cancer genes [68] or kinase genes [69]) and use their LOH or copy number status for clustering. Similarly, chromosome regions can be clustered based on LOH or copy number data. A chromosome region can be a whole chromosome, cytoband or the transcription and promoter region of a gene. For the 500K SNP array, a single gene may contain multiple SNPs. The inferred LOH or copy number data for SNPs in a region can be averaged to obtain the genomic alteration data of the region, and the data for regions in turn will be used for region filtering to obtain interesting regions to cluster region and samples, similar to the SNP-wise filtering above. The advantage of this region-based clustering is that we will cluster hundreds of regions instead of 500K to one million SNPs, and chromosome regions can be defined by selected or filtered genes or cytobands to interrogate the genome at various resolutions or according to specific set of genes. For LOH data, we can define the distance between two regions as the average absolute value of
228
Cheng Li, Sarnir Arnin
(A)
(B)
Figure 11: Clustering the same set of breast cancer samples using SNP LOH data (A) and expression data (B). A similar sample cluster emerge in both clustering figures (note the left-most sample branches highlighted in blue). The blue/yellow colors indicate LOH/retention events (A), and the red/blue colors indicate high or low expression level (B). The labels below sample name are lymph node status (p: positive, n: np