Advanced Medical Statistics 9789810248000, 9810248008, 9810247990, 9789810247997, 9789812388759

This book presents new and powerful advanced statistical methods that have been used in modern medicine, drug developmen

255 72 8MB

English Pages 1101 [1118] Year 2003

Table of contents :
nlreader.dll@bookid=105145&Filename=Cover.pdf......Page 1
nlreader.dll@bookid=105145&Filename=I.pdf......Page 2
nlreader.dll@bookid=105145&Filename=II.pdf......Page 3
nlreader.dll@bookid=105145&Filename=III.pdf......Page 4
nlreader.dll@bookid=105145&Filename=IV.pdf......Page 5
nlreader.dll@bookid=105145&Filename=V.pdf......Page 6
nlreader.dll@bookid=105145&Filename=VI.pdf......Page 7
nlreader.dll@bookid=105145&Filename=VII.pdf......Page 8
nlreader.dll@bookid=105145&Filename=VIII.pdf......Page 9
nlreader.dll@bookid=105145&Filename=IX.pdf......Page 10
nlreader.dll@bookid=105145&Filename=X.pdf......Page 11
nlreader.dll@bookid=105145&Filename=XI.pdf......Page 12
nlreader.dll@bookid=105145&Filename=XII.pdf......Page 13
nlreader.dll@bookid=105145&Filename=XIII.pdf......Page 14
nlreader.dll@bookid=105145&Filename=XIV.pdf......Page 15
nlreader.dll@bookid=105145&Filename=XV.pdf......Page 16
nlreader.dll@bookid=105145&Filename=XVI.pdf......Page 17
t1.pdf......Page 18
t2.pdf......Page 417
t3.pdf......Page 817

Recommend Papers

Medical statistics made easy 0203502779, 9780203502778

451 84 629KB Read more

Advanced High School Statistics [2 ed.]

1,128 55 13MB Read more

The advanced theory of statistics [Volume 2]

537 47 30MB Read more

Handbook of medical statistics 9789813148963, 9813148969, 9789813148970, 9813148977

335 31 18MB Read more

Essentials Medical Statistics (Second Edition) [2nd ed.] 0865428719, 9780865428713

Blackwell Publishing is delighted to announce that this book has been Highly Commended in the 2004 BMA Medical Book Comp

586 78 2MB Read more

Using and understanding medical statistics [4ed] 3805581890, 9783805581899

Univ. of Waterloo, Canada. Presents many of the more complex statistical methods and techniques currently appearing in m

418 55 3MB Read more

Medical Statistics Made Easy [3 ed.] 9781907904455, 190790445X

378 27 1MB Read more

Riemannian geometric statistics in medical image analysis 9780128147252

459 22 9MB Read more

Encyclopaedic Companion to Medical Statistics [2nd ed.] 0470684194, 9780470684191

Statistical methodology is of great importance to medical research and clinical practice. The Encyclopaedic Companion to

362 45 50KB Read more

Medical Statistics Made Easy [2nd ed.] 1904842550, 9781904842552

Medical Statistics Made Easy 2nd edition continues to provide the easiest possible explanations of the key statistical t

632 100 630KB Read more

Advanced Medical Statistics
9789810248000, 9810248008, 9810247990, 9789810247997, 9789812388759

Author / Uploaded
Ying Lu
Ji-Qian Fang

Similar Topics
Biology
Biostatistics

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

-

Advanced Medical Statistics

Advanced Medical Statistics

r

EDITORS

YING I,U Universityof California, San Francisco, USA

TI-QLAN FANG Sun Yat-Sen University, Guangzhou, China

World Scientific New Jersey * London Singapore Hong Kong

Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: Suite 202, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

ADVANCED MEDICAL STATISTICS Copyright © 2003 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 981-02-4799-0 ISBN 981-02-4800-8 (pbk)

Printed in Singapore.

This page intentionally left blank

PREFACE

Since the early last century, many scholars from China have studied statistics in Western countries. Some of the early pioneers, including P.L. Hsu, C.L. Chiang, C.C. Lee, K.L. Chung, and G. Tiao, etc., achieved international recognition for their significant contributions to advanced statistics. Since the 1960s, many students from Taiwan, Hong Kong, and Mainland China have received their advanced degrees from universities in North America and Europe. Some have remained, becoming professors in academia or scientists in government or industry and making significant contributions to the fields of statistics and biostatistics. Many have been elected as fellows of the American Statistical Association and/or senior members of International Biometric Society. Others have become editors or associate editors for important journals, including the Annals of Statistics, the Annals of Probability, the Journals of the Royal Statistical Society, the Journal of American Statistical Association, Biometrika, Biometrics, and Statistica Sinnica, etc. Several Chinese statisticians have been honored with the COPSS award, among whom Professor T.L. Lai and J. Fan have participated in the creation of this book. Meanwhile, many young statisticians have trained in Mainland China. They have accumulated a rich store of experience in teaching biostatistics and applying its theory and methods to medical research in their home country. Many overseas Chinese statisticians as well as statisticians in Mainland China, Taiwan and Hong Kong participated in publishing a book in Chinese about advances in medical statistics, which was published in 2000 by The People’s Health Press, Beijing. Now, with the help of World Scientific Publishing Co, we are pleased to present the English version of this book — “Advanced Medical Statistics” — with a much larger professional community of English readers. The book consists of four sections and 29 chapters. The first section is about statistical methods in biomedical research, including their history and statistical thinking in medical research, medical diagnoses, dependent v

vi

Preface

data, quality control and quality assurance in medical measurements, cost-effective and evidence-based medicine, quality of life, meta analysis, descriptive statistics, medical image processing, and time series. Many of these statistical methods were developed specifically for specific medica issues. The second section covers the most important statistical issues in pharmaceutical research and development, including pharmacology and pre-clinical studies, biopharmaceutical research, toxicological study, and confirmative clinical trials. Some of the theory and methods are published here for the first time. The third section is concerned with statistical methods in epidemiology, including statistics in genetic studies, risk assessment, infectious diseases, disease surveys, capture-recapture models for monitoring epidemics, cancer screening, and causal inferences. Most of the methods have been newly developed within the past decades. The last section is dedicated to advanced statistical theory and methods, including survival analysis, longitudinal data analysis, non-parametric curve estimation, Bayes statistics, stochastic processes, tree structured methods, EM algorithms, and artificial neural networks. These last chapters not only summarize the current status of research, future research topics and applications in medical research, but also provide some necessary theory and background for the statistical methods discussed in the first three sections. All the chapters in the book are independent of each other; each is dedicated to a specific issue. To meet the needs of different readers, all chapters have a similar structure. The first subsection introduces the general concepts and the medical questions discussed in the chapter; examples are usually given in this section. The following sections present more specific details of concepts, methods and algorithms with the emphasis on application and significance. Derivations of proofs are generally not included, but citations in the literature are provided for interested readers. This book is targeted to a broad readership. We hope that regardless of your background whether as a physician, a researcher in bioscience, a professional statistician, or a graduate student, you will find the book appropriate to your needs. As statistical thinking and methods are essential tools in modern medicine and biomedical research, medical researchers, leaving aside the statistical derivations and mathematical arguments, will learn what statistical tools are available to them, how to prepare the necessary information to use these methods, and how to interpret statistical results and their limitations. For professional medical statisticians, this book provides a broad perspective on medical statistics, their possible applications and interactions between special subjects, and suggestions

Preface

vii

about future research topics, which will be helpful to their research as well as in consultation work with clients. For theoretical statisticians or applied statisticians working in other areas, the book provides many examples of statistical applications and challenges facing medical statistics, and which should help theoretical statisticians to identify new frontiers and possible application areas of their new methods. Last but not least, this book is a good reference for graduate students, providing a broad overview of medical statistics that will help them to select their research topics and guide them into the heart of the issue. All the authors are experts in their specific areas. Each chapter reflects their own research experience, results and achievements. They have given much under the tremendous pressures of their many other obligations. As editors, we greatly appreciate their support, dedications and friendship. Many thanks to our colleagues in the School of Public Health, Sun YatSen University, who provided assistance in the preparation of the book, especially Dr. Yu Chuanhua, Dr. Yan Jie, Dr. Wang Xianhong, Dr. Ling Li, Dr. Xu Zongli, Mr. Shuming Zhu, Ms. Shaomin Wu and Ms. Fangfang Zeng. We thank the People’s Health Press, Beijing, for kindly permitting us to freely publish versions other than the Chinese ones. We are most appreciative to the editors of World Scientific Publishing Co, Singapore, for their work in bringing this book to publication. Ying Lu Jiqian Fang Editors

This page intentionally left blank

ABOUT THE EDITORS

Ying Lu is an associate professor of Radiology at the Department of Radiology and the director of the Biostatistics Core, UCSF Comprehensive Cancer Center, and faculty of Bioengineering Graduate Program, University of California, San Francisco. He received his BS in mathematics from Fudan University (1982) and MS in applied mathematics from Shanghai Jiao Tong University (1984), and PhD in biostatistics from the University of California, Berkeley (1990). At Berkeley, he received university fellowships (1985–1988), and Public Health Alumni Association Scholarship (1989). In 1990, he received Evelyn Fix Memorial Medal for excellent statistical dissertation on animal carcinogenicity experiments under guidance of Professors Manali and Chiang, followed by being an assistant professor of epidemiology and public health at the University of Miami School of Medicine (1990–1993). Then, he moved to the Department of Radiology at the University of California, San Francisco in 1994. He was the director of the Biostatistical Laboratory in the Osteoporosis Research Group specialized in statistical applications in quality control, clinical trial and diagnosis of osteoporosis; a member of the International Committee for Standards in Bone Measurement (1996–1998), Vice President (1995–1997) and President (1999) of the San Francisco Bay Area Chapter, American Statistical Association. Dr. Lu has supervised two post-doctor fellows in biostatistics and more than 20 fellows in radiology and bioengineering. He has authored or co-authored more than 80 peer-reviewed articles and 4 book chapters in statistical methods for animal carcinogenicity experiments, medical diagnostic tests, and outcome prediction, as well as clinical research areas of radiology, osteoporosis, and cancer clinical trials. His papers have been published in various journals, such as Biometrics, Statistics in Medicine, Mathematical Biosciences, Medical Decision Making, Radiology, Journal of Bone and Mineral Research, Cancer, etc.

ix

x

About the Editors

Ying Lu Professor, PhD 1) Department of Radiology, Box 0629, University of California San Francisco, CA 94143-0629, USA [email protected] 2) Chapter 4. Statistics in Quality Control, Quality Assurance, and Quality Improvement in Radiological Studies Ji-Qian Fang, born in Shanghai 1939, earned his BS in 1961 from the Department of Mathematics, Fudan University and PhD in 1985 from the Program of Biostatistics, the University of California at Berkeley. His PhD thesis studied multi-state survival analysis for life phenomena under the guidance of Professor Chin Long Chiang. During 1985 to 1990, Dr. Fang was a Professor and Director, the Department of Biostatistics and Biomathematics, Beijing Medical University; Since 1991, he has been the Director and Chair Professor, Department of Medical Statistics, School of Public Health, Sun Yat-Sen University. Professor Fang was a visiting professor of University of Kent, UK in 1987 and Australian National University in 1990, as well as an adjunct professor of Chinese University of Hong Kong (since 1993). He is the secretary for the Group China of the International Biometric Society and vice president of the Chinese Association of Health Statistics. Professor Fang has published more than 100 peer-reviewed articles, monographs and text books, including “Methods of Mathematical Statistics”, “Advanced mathematics”, “Computer and Its Applications in Medical Field” and “Medical Statistics and Computerized Experiment.” Professor Fang has supervised 25 master students, 17 PhD students and 2 post-doctoral fellows in Biostatistics. His own and his joint research projects worked with his students cover a wide variety of fields, including “Stochastic Models of Life Phenomena”, “Gating Dynamics of Ion Channels”, “Biostatistical Theory and Methods for Research on Cancer Prevention”, “Bootstrap Studies on Multi-state Models”, “Statistical Methods for Data on Quality of Life”, “Health and Air Pollution”, “Analysis of DNA Finger Printing”, and “Linkage Analyses between Complex Trait and Multiple Genes”, etc. These projects were sponsored by either the National Foundations of China or by international organizations, such as the World Health Organization and the European Commission. Several research projects directed by Professor Fang have received awards

About the Editors

xi

from the Government of Beijing Municipal Government or Ministry of Public Health of China for their significant advances in the biostatistics fields, including the projects on “Sequential Discriminant Analysis”, “Multi-state Survival Analysis”, “Measurement of Quality of Life in China” and “Biostatistical Theory and Methods for Research on Cancer Prevention.”

Ji-Qian Fang Professor, PhD 1) Department of Medical Statistics, School of Public Health, Sun Yat-Sen University, 74 Zhongshan Road II, Guangzhou 510080, Guangdong, PR China [email protected] 2) Chapter 6. Quality of Life: Issues Concerning Assessment and Analysis Chapter 26. Stochastic Process and Their Application in Medicine

This page intentionally left blank

Contents

Preface

v

About the Editors

ix

Section 1. Statistical Methods in Biomedical Research

1

Chapter 1. History of Statistical Thinking in Medicine Tar Timothy Chen

3

Chapter 2. Evaluation of Diagnostic Test’s Accuracy in the Presence of Verification Bias Xiao-Hua Zhou

21

Chapter 3. Statistical Methods for Dependent Data Feng Chen

45

Chapter 4. Statistics used in Quality Control, Quality Assurance, and Quality Improvement in Radiological Studies Ying Lu and Shoujun Zhao

101

Chapter 5. Cost-Effectiveness Analysis and Evidence-Based Medicine Jianli Li

157

Chapter 6. Quality of Life: Issues Concerning Assessment and Analysis Ji-Qian Fang and Yuantao Hao

195

Chapter 7. Meta-Analysis Xuyu Zhuo, Ji-Qian Fang, Chuanhua Yu, Zongli Xu and Ying Lu

233

Chapter 8. Describing Data, Variability and Over-Dispersion in Medical Research Ming Tan

319

xiii

xiv

Contents

Chapter 9. Time Series Analysis And Its Applications in Medical Sciences Jinxi Zhang, Yingdong Zheng and Dejian Lai

333

Chapter 10. Applications of Statistical Methods in Medical Imaging Jesse S. Jin

379

Section 2. Statistical Methods in Pharmaceutical Research

407

Chapter 11. Statistics in Pharmacology and Pre-Clinical Studies Tze Leung Lai, Mei-Chiung Shih and Guangrui Zhu

409

Chapter 12. Statistics in Biopharmaceutical Research Shein-Chung Chow and Annpey Pong

443

Chapter 13. Statistics in Toxicology James J. Chen

495

Chapter 14. Some Statistical Issues of Relevence to Confirmatory Trials George Y. H. Chi, Kun Jin, Gang Chen and Lu Cui

523

Section 3. Statistical Methods in Epidemiology

581

Chapter 15. Statistics in Genetics Zhaohai Li and Minyu Xie

583

Chapter 16. Dose-Response Modeling in Health Risk Assessment Yiliang Zhu

617

Chapter 17. Statistical Models and Methods in Infectious Diseases Hulin Wu and Shoujun Zhao

645

Chapter 18. Special Models for Sampling Survey Sujuan Gao

685

Chapter 19. The Use of Capture-Recapture Methodology in Epidemiological Surveillance Anne Chao, H-C. Yang and P. S. F. Yip

711

Chapter 20. Statistical Methods in the Effect Evaluation of Mass Screening for Diseases Qing Liu

741

Chapter 21. Causal Inference Zhi Geng

777

Contents

xv

Section 4. Advanced Statistical Theory and Methods

813

Chapter 22. Survival Analysis Danyu Lin

815

Chapter 23. Regression Models for the Analysis of Longitudinal Data Colin Wu and Kai F. Yu

837

Chapter 24. Local Modeling: Density Estimation and Nonparametric Regression Jianqing Fan and Runze Li

885

Chapter 25. Bayesian Methods Minghui Chen and Keying Ye

933

Chapter 26. Stochastic Process and Their Applications in Medical Science Caixia Li and Ji-Qian Fang

991

Chapter 27. Tree-Based Methods Heping Zhang

1033

Chapter 28. Maximum Likelihood Estimation From Incomplete Data via EM-Type Algorithms Chuanhai Liu

1051

Chapter 29. Introduction to Artificial Neural Networks Jielai Xia, Jiang Hongwei and Tang Qiyi

1073

Index

1091

This page intentionally left blank

Section 1 Statistical Methods in Biomedical Research

This page intentionally left blank

CHAPTER 1

HISTORY OF STATISTICAL THINKING IN MEDICINE TAR TIMOTHY CHEN Timothy Statistical Consulting, 2807 Marquis Circle East, Arlington TX 76016, USA

1. Introduction Biostatistics is a very hot discipline today. Biostatisticians are in demand in the United States. Medical researchers appreciate statistical thinking and applications. In laboratory science, clinical research and epidemiological investigation, statisticians’ collaborations are sought after. In many medical journals, statisticians are asked to serve as reviewers. In NIH (National Institutes of Health) grant applications, statisticians are required to be collaborators and statistical considerations have to be incorporated. In pharmaceutical development, drug companies recruit statisticians to guide study design, to analyze data, and to prepare reports for submission to FDA (Food and Drug Administration). All in all, statistical thinking permeates medical research and health policy. But it was not this way in the beginning. This article describes the history of application of statistical thinking in the medicine. 2. Laplace and His Vision Near the time of American independence and the French Revolution, French mathematician Pierre-Simon Laplace (1749–1827) worked on probability theory. He published many papers on different aspects of mathematical probability including theoretical issues and applications to demography and vital statistics. He was convinced that probability theory could be applied to the entire system of human knowledge, because the principal means of finding truth were based on probabilities. Viewing medical therapy as a domain for application of probability, he said that the preferred method of 3

4

T. T. Chen

treatment would manifest itself increasingly in the measure as the number of observations was increased.1,2 Laplace’s view that the summary of therapeutic successes and failures from a group of patients could guide the future therapy was hotly debated within the medical community. Many famous physicians like Pieere-JeanGeorges Cabanis (1757–1808) claimed that the specificity of each patient demanded a kind of informed-professional judgment rather than guidance from quantitative analysis. According to their view, the proper professional behavior for physicians in diagnosing and treating disease was to match the special characteristics of each patient with the knowledge acquired through the course of medical practice. Physicians were able to judge individual cases in all of their uniqueness, rather than on the basis of quantitative knowledge. Cabanis rejected quantitative reasoning as an intellectual distraction and viewed medicine as an “art” rather than as a “science.” 3 On the other hand, other prominent physicians like Philippe Pinel (1745–1826) said that physicians could determine the effectiveness of various therapies by counting the number of times a treatment produced a favorable response. He considered a treatment effective if it had a high success rate. He even claimed that medical therapy could achieve the status of a true science if it applied the calculus of probabilities. His understanding of this calculation, however, was restricted to counting; he did not understand the detailed nature of the probability theory being developed by Laplace.4

3. Louis and Numerical Method Later another prominent clinician, Pierre-Charles-Alexandre Louis (1787– 1872), considered that enumeration was synonymous with scientific reasoning. He followed Laplace’s proposal that analytical methods derived from probability theory help to reach a good judgment and to avoid confusing illusions. His method consisted of careful observation, systematic record keeping, rigorous analysis of multiple cases, cautious generalizations, verification through autopsies, and therapy based on the curative power of nature. He said that the introduction of statistics into diagnosis and therapy would ensure that all medical practitioners arrive at identical results.5 In his study of typhoid fever, which collected patient data between 1822 and 1827, Louis observed the age difference between the groups who died (50 patients with mean age 23) and who survived (88 patients with mean

History of Statistical Thinking in Medicine

5

age 21). He also compared the length of residency in Paris and concluded that the group which survived lived in Paris longer. More importantly, Louis studied the efficacy of bloodletting as a therapy for typhoid fever. Among the 52 fatal cases, 39 patients (75%) had been bled. The mean survival time for the bled cases was 25.5 days contrasted to 28 days for those who were not bled. Of the 88 recovery cases, 62 patients (70%) were bled, with the mean duration of disease being 32 days as opposed to only 31 days for those not bled.6 Louis also studied the efficacy of bloodletting in treating pneumonitis and angina tonsillaris, and found it not useful. At that time, the method of venesection was defended by Francois Joseph Victor Broussais (1772– 1838), the chief physician at the Parisian military hospital and medical school. Broussais claimed that diseases could be identified by observing the lesions of organs. Then patients could be treated by bleeding the diseased organ and by low fat, since most diseases were the result of inflammation. Louis, in contrast with Broussais, emphasized quantitative results from a population of sick individuals rather than using pathological anatomy to observe disease in a particular patient. He contended that the difference between numerical results and words, such as “more or less” and “rarely or frequently,” was “the difference of truth and error; of a thing clear and truly scientific on the one hand, and of something vague and worthless on the other.” He also proposed the basic concept of controlled clinical trial.7 Louis’s work created more debates before the Parisian Academies of Sciences and Medicine in the late 1830s. The triggering issue was the question of the proper surgical procedure for removing bladder stones. A new bloodless method for removing bladder stones (lithotrity) was investigated by the surgeon and urologist Jean Civiale (1792–1867). He argued that, given the fallacy of human memory, surgeons tend to remember their successful cases more than their unsuccessful ones; errors result from inexact records. He published the relative rates of death from the traditional surgical procedure and the lithotrity. The death rate of the old procedure was 21.6% (1,237/5,715); the death rate for lithotrity was 2.3% (6/257). 3 In response to Civiale’s statistical results, the Academy of Sciences established a commission in 1835 including the mathematician SimeonDenis Poisson (1781–1840) and the physician Francois Double (1776–1842). Rejecting the attempt to turn the clinician into a scientist through the statistical method, Double believed that the physician’s proper concern should remain the individual patient. He claimed it was inappropriate to elevate

6

T. T. Chen

the human spirit to that mathematical certainty found only in astronomy; the eminently proper method in the progress of medicine was logical not numerical analysis.8 During that time, Lambert Adolphe Jacques Quetelet (1796–1874) proposed a new concept of the “average man,” defined as the average of all human attributes in a country. It would serve as a “type” of the nation similar to the idea of a center of gravity in physics. He formulated this idea by combining his training in astronomy and mathematics with a passion for social statistics. He analyzed the first census of Belgium (1829) and was instrumental in the formation of the Royal Statistical Society. He maintained that the concept of statistical norms could be useful to medical practice as it had been to medical research.9 At the same time, Poisson applied probability theory to the voting patterns of judicial tribunals. He used the “law of large numbers” to devise a 99.5% confidence interval for binomial probability.10 In 1837, in a lecture delivered before the French Academy of Medicine, physician Risueno d’Amador (1802–1849) used the example of maritime insurance to illustrate why the probability was not applicable to medicine. If 100 vessels perish for every 1,000 that set sail, one still could not know which particular ships would be destroyed. It depended on other prognostic variables such as the age of the vessel, the experience of the captain, or the condition of the weather and the seas. Statistics could not predict the outcome of particular patients because of the uniqueness of each individual involved. For d’Amador, the results of observation in medicine were often more variable than in other sciences like astronomy.11 In the ensuing debates, Double commented that a Queteletian average man would reduce the physician to “a shoemaker who after having measured the feet of a thousand persisted in fitting everyone on the basis of the imaginary model.” He also claimed that Poisson’s attempts to mathematize human decision-making were useless because of the pressing and immediate concerns of medical practice. Louis-Denis-Jules Gavarret (1809–1890), trained in both engineering and medicine, addressed the criticism of d’Amador in 1840. He maintained that the probability theory merely expressed the statistical results of inductive reasoning in a more formal and exact manner. He emphasized that statistical results were useful only if certain conditions prevailed — namely, the cases must be similar or comparable, and there must be large enough observations. He followed Poisson’s example in requiring a precision of 99.5% or 212:1. He commented on the insufficient sample size in Louis’ study of typhoid fever.12

History of Statistical Thinking in Medicine

7

In responding to the work of Gavarret, Elisha Bartlett (1804–1855), a professor of medicine at the University of Maryland and a student of Louis, said that the value of the numerical method was exhibited by Louis, and its true principles were developed and demonstrated by Gavarret.13 However, the British statistician William Augustus Guy (1810–1885) in his Croonian lecture before the Royal College of Physicians in 1860, said that Gavarret’s confidence interval could only be applied in rare occasions, and the results obtained from averaging a small number of cases could generally be assumed to be accurate.14 In Germany, an ophthalmologist Julius Hirschberg (1843–1925), concerning about the number of observations required by Gavarret’s assumption of 212:1 odds, he modified the formula by using a lower standard of confidence of 11:1 or 91.6%.15

4. Statistical Analysis Versus Laboratory Investigation In articles published in 1878 and 1881, German physician Friedrich Martius (1850–1923) commented that the dreams of Louis and Gavarret about a new era of scientific medicine had not been fulfilled due to the general “mathematical unfitness” of the medical profession as a whole. As one trained in laboratory methods, he said that the basis for science lay in laboratory experimentation rather than mere observation and the collection of numerical data.3 The legacy of Louis was in his claim that the clinical physician should aspire to become a scientist. But after Louis’s retirement from the medical scene by the mid 1850s, some medical researchers began to argue that the compilation of numerical results might provide some useful insights about therapy; however, these results should not posses the authoritative status as “science.” Friedrich Oesterlen (1812–1877) said that “scientific” results should be the discovery of knowledge which determined the causal connections, not just the discovery of the correlation.16 When Joseph Lister (1827–1912) published his pioneering work with antiseptic surgery in 1870, he noted that the average mortality rate was 45.7% (16/35) for all surgical procedures performed at the University of Edinburgh in the years 1864–1866 (before antiseptic methods were introduced). And it was 15% (6/40) for all surgical procedures performed in the three-year period 1867–1869 (after the introduction of antiseptic methods). Although he used this statistical result to show the efficacy of the new antiseptic method, he claimed that the science behind this was the germ theory of disease as proposed by Louis Pasteur (1822–1895).17 Pasteur developed the

8

T. T. Chen

germ theory and the concept of immunity. He carried out a clinical trial in 1881 to test his new vaccine against anthrax. The founder of 19th century scientific positivism, Auguste Comte (1798– 1857), believed that mere empiricism (as practiced by Louis) was not really useful for medicine.18 Claude Bernard (1813–1878) proposed that the science of medicine resided in experimental physiology, rather than observational statistics. As a result of his laboratory-based orientation, he claimed that the experimental investigation of each individual patient could provide an “objective” scientific result. He agreed with Louis’s vision of medicine as a science but saw the science of medicine as focused on the physiological measurements of individual patients.19 Other prominent clinicians at that time, like German Carl Wunderlich (1815–1877), tried to steer a middle ground between Louis and Bernard and synthesized both approaches. They collected a mass of quantifiable physiological data and tried to analyze it using numerical method. However, this approach was not accepted by the medical community in general, and many still opposed the process of quantification and remained focused on the individual patient.20

5. The Beginning of Modern Statistics The founders of the Statistical Society in London in 1834 chose the motto “Let others thrash it out,” thus set the general aim of statistics as data collection. Near the end of the 19th century, scientists began to collect large amounts of data in the biological world. Now they faced obstacles because their data had so much variation. Biological systems were so complex that a particular outcome had many causal factors. There was already a body of probability theory, but it was only mathematics. Prevailing scientific wisdom said that probability theory and actual data were separate entities and should not be mixed. Due to the work of the British biometrical school associated with Sir Francis Galton (1822–1911) and Karl Pearson (1857– 1936), this attitude was changed, and statistics was transformed from an empirical social science into a mathematical applied science. Galton, a half-cousin of Charles Darwin (1809–1882), studied medicine at Cambridge, explored Africa during the period 1850–1852, and received the gold medal from the Royal Geographical Society in 1853 in recognition of his achievement. After reading Charles Darwin’s 1859 work On the Origin of Species, Galton turned to study heredity and developed a new vision for the role of science in society.21 The late Victorian intellectual movement of

History of Statistical Thinking in Medicine

9

scientific naturalism gave rise to the belief that scientifically trained persons must become leaders of British intellectual culture. Galton accepted the evolutionary doctrine that the condition of the human species could be improved most effectively through a scientifically directed process of controlled breeding. His interest in eugenics led him to the method of correlation. He applied the Gaussian law of error to the intelligence of human beings and, unlike Quetelet, was more interested in the distribution and deviations from the mean than in the average value itself. As a disciple of Galton, Karl Pearson, the founding father of modern statistics, created the statistical methodology and sold it to the world. Pearson changed statistics from a descriptive to an inferential discipline. He majored in mathematics at King’s College, Cambridge. After Cambridge, he studied German literature, read law and was admitted to bar. He became professor of mathematics at King’s College, London in 1881 and at University College, London in 1883. In June 1884 at age 27 he was appointed to Goldsmid Professor of Applied Mathematics at University College, London. Biologists at that time were interested in genetics, inheritance, and eugenics. In 1892 Pearson began to collaborate with zoologist WFR Weldon, Jodrell Chair of biology at University College, and developed a methodology for the exploration of life. Two years later Pearson offered his first advanced course in statistical theory, making University College the sole place for instruction of modern statistical methods before the 1920s.22 Following Galton, Pearson maintained that empirically determined “facts” obtained by the methods of science were the sole arbiters of truth. He argued for the almost universal application of statistical method, that mathematics could be applied to biological problems and that analysis of statistical data could answer many questions about the life of plants, animals, and men.23 After a paper was rejected by the Royal Society, he together with Galton and Weldon founded the journal Biometrika in 1901 to provide an outlet for the works he and his biometrical school generated. Under Galton’s generous financial support, Pearson transformed his relatively informal group of followers into an established research institute. Although he was interested in eugenics, he tried to do objective research using statistical methods and separated his institute from the social concerns of the Eugenics Education Society. Pearson’s emphasis on the statistical relevancy to the problems of biology had very few audiences. Mathematicians despised new endeavor to develop statistical methodology, and biologists thought mathematicians

10

T. T. Chen

had no business meddling with such things. In 1903 Pearson wrote Galton that there were only two subscribers of Biometrika in Cambridge, one a personal friend of Pearson and one of Weldon. Even though his major contributions were correlational methods and chi-square goodness-of-fit test, in 1906 the Journal of the Royal Society refused to publish a paper because they failed to see the biological significance of a correlation coefficient. In 1911 after Galton’s death, Pearson became the first Galton Professor of Eugenics at University College, London. Pearson also attempted to build an intellectual bridge to medicine by applying the statistical methods he developed. During his lifetime, the medical profession was divided about their opinion of the usefulness of statistical reasoning. Clinicians who continued to emphasize the “art” of medicine thought that statistics added little information beyond that supplied by experience. Those who argued for the existence of a “clinical science,” basing diagnosis on physiological instruments or bacteriological observation, saw statistics as a way to make observation more objective, but that did not consider that as “scientific” evidence.

6. The Beginning of Medical Statistics Major Greenwood (1880–1949) was first to respond to Pearson’s “crying need” for the medical profession to appreciate the importance of new statistical methods. At the age of 18, he entered medical school and read Pearson’s Grammar of Science. He wrote to Pearson and applied statistical analyses to his research data while a student at London Hospital. During the academic year 1904–1905, after obtaining his license to practice medicine and publishing an article in Biometrika, he chose to study under Pearson. Despite Pearson’s warning about the difficulty of earning a living as a biometrician, Greenwood decided to stake his professional career on the application of mathematical statistical methods to medical problems. In debating with the bacteriologist Sir Almroth Wright (1861–1947) about the efficacy of vaccine therapy and a statistical measure called “opsonic index,” Greenwood invoked the distinction between functional and mathematical error.24 The former concerned errors in techniques of measurement, while the latter concerned inferential errors derived from the fact that data were a sample of population. When he pointed out that Wright had committed mathematical error, he got the attention of the medical community.25 Consequently the Lister Institute for Preventive Medicine in 1903 created the first department of statistics and named him

History of Statistical Thinking in Medicine

11

its head. Greenwood characterized his department as dealing with problems of epidemiology and pathology, in contrast to Pearson’s department at the University College, which dealt with heredity, eugenics and pure mathematical statistics. By training Greenwood, Pearson had helped to create the role of medical statistician, who as a researcher, understood both medical results and statistical methods. Greenwood left the Lister Institute in 1920 for a position at the Ministry of Health and became affiliated with the newly created Medical Research Council (MRC). He saw his position at the medical establishment as instrumental in furthering the impact of statistical methods. Raymond Pearl (1879–1940) was Greenwood’s American counterpart. He went to London to study under Pearson after finishing his PhD in biology at the University of Michigan. In 1918 Pearl began a long-standing relationship with The Johns Hopkins University as professor of biometry and vital statistics in the School of Hygiene and Public Health and as statistician at The Johns Hopkins Hospital. By the early 1920’s, Greenwood was not alone in arguing for application of modern statistics in medicine. One writer said in the Journal of the American Medical Association in 1920 that statistics was of great practical significance and should be required in the premedical curriculum.26 Pearl in a 1921 article in the Johns Hopkins hospital Bulletin said that quantitative data generated by the modern hospital should be analyzed in cooperation with expert statistician. The arguments for using statistics in medicine were framed in terms of ensuring that medical research become “scientifically” grounded.27

7. Randomization in Experimentation Besides Pearson, another founder of modern statistics was Sir Ronald A. Fisher (1890–1962). He also majored in mathematics at Cambridge and studied the theory of errors, statistical mechanics, and quantum theory.28 By the age of 22, he published his first paper in statistics introducing the method of maximum likelihood, and three years later he wrote another paper deriving the exact sampling distribution of the Pearson correlation coefficient. He was also interested in applying mathematics to biological problems. Beginning in 1919, he spent many years at Rothamsted Experimental Station and collaborated with other researchers. He developed statistical methods for design and analysis of experiments, which were collected in his books Statistical Methods for Research Workers 29 and

12

T. T. Chen

The Design of Experiments.30 He proposed three main principles — the essentiality of replication and randomization, and the possibility of reducing errors by appropriate organization of the experiment. Fisher’s major contribution to science was using randomization to do experiments so that the variation in the data could be accounted for in the statistical analysis, and the bias of treatment assignment could be eliminated. Greenwood characterized Fisher’s ideas as “epoch-making” in an article published in 1948, the year before Greenwood’s death. For Fisher, statistical analysis and experimental design were only two aspects of the same whole, and they comprised all the logical requirements of the complete process of adding to natural knowledge by experimentation.30 In other words, in order to draw inference, statisticians had to be involved in the design stage of experiments. Fisher, when addressing the Indian Statistical Congress in 1938, said, “To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of”. In addition to the new developments in statistical theory brought about by Fisher’s work, changes within the organization of the MRC also facilitated the emergence of the modern clinical trial. Sir Austin Bradford Hill (1897–1991), one of Greenwood’s proteges, was the prime motivator behind these Medical Research Council trials. He learned statistical methods from Pearson at University College and in 1933 became Reader in Epidemiology and Vital Statistics at the London School of Hygiene and Tropical Medicine, where Greenwood became the first professor of Epidemiology and Public Health in 1927. In 1937 the editors of The Lancet, recognizing the necessity of explaining statistical techniques to physicians, asked Hill to write a series of articles on the proper use of statistics in medicine. These articles were later published in book form as Principles of Medical Statistics.31 Upon Greenwood’s retirement in 1945, Hill took his place both as honorary director of MRC’s Statistical Research Unit and as professor of medical statistics at the University of London.32 8. First Randomized Controlled Clinical Trial The British Medical Research Council in 1946 began the first clinical trial with a properly randomized control group trial on the use of streptomycin in the treatment of pulmonary tuberculosis. This trial was remarkable for the degree of care exercised in its planning, execution and reporting. The trial involved patient accrual from several centers, and patients were randomized

History of Statistical Thinking in Medicine

13

to two treatments — either streptomycin plus bed-rest, or bed-rest alone. Evaluation of patient X-ray films was made independently by two radiologists and a clinician. This blinded and replicated evaluation of a difficult disease end-point added considerably to the final agreed patient evaluation. Both patient survival and radiological improvement were significantly better on streptomycin.33 Hill’s work set the trend for future clinical trials where both the insight of physicians and the statistical design of professional statisticians were combined. The convergence of these two separate disciplines constituted the sine qua non for the emergence of the probabilistically informed clinical trials. The Laplacian vision of the determination of medical therapy on the basis of the calculus of probability had finally found fulfillment. Hill, a non-physician, acknowledged that the medical profession was responsible for curing the sick and preventing disease, but he emphasized that experimental medicine had the third responsibility of advancing human knowledge, and the statistically guided therapeutic trial was a useful way to discharge that responsibility. Unlike earlier advocates of statistical application in medicine, Hill’s work became a rallying cry for supporters of therapeutic reform on both sides of Atlantic. Among many factors that contributed to this groundswell of support, one was the proliferation of new and potent industrially produced drugs in the postwar era. Supporters argued that randomized controlled clinical trials would permit the doctors to select the good treatment and prevent undue enthusiasm for newer treatments. To those critics who believed in the uniqueness of the individual, whether patient or doctor, LJ. Witts, Nuffield Professor of Clinical Medicine of Oxford University, said in a conference in 1959, that neither patients nor doctors were as unique as they might have wanted to believe. Witts conceded that there was a conflict of loyalties between the research for truth and the treatment of the individual. However, he pointed out that similar conflict existed between the teaching of clinical students and the treatment of the patient.34 At the same conference, Sir George Pickering, Regius Professor of Medicine at Oxford, praised the randomized controlled clinical trials and declared that, in contrast, clinical experience was unplanned and haphazard, and physicians were victims of the freaks of chance.35 Americans were not slow in following the British lead in applying statistics to controlled clinical trials. Americans carried out the largest and most expensive medical experiment in human history. The trial was done in 1954 to assess the effectiveness of the Salk vaccine as a protection against paralysis or death from poliomyelitis. Close to two million children

14

T. T. Chen

participated, and the immediate direct cost was over 5 million dollars. The reason for such a large trial was that the annual incidence rate of polio was about 1 per 2000. In order to show that vaccine could improve upon this small incidence, a huge trial was needed. Originally, there was some resistance to the randomization, but finally about one quarter of the participants did get randomized. This randomized placebo controlled double-blind trial finally established the effectiveness of the Salk vaccine.36

9. Government Regulation and Statistics Later in the early 1960s, the drug Thalidomide caused an outbreak of infantile deformity. The US FDA subsequently discovered that over two and a half million tablets had been distributed to 1,267 doctors who had prescribed the drugs to 19,822 patients, including 3,760 women of childbearing age. This evidence raised the question whether the “professional judgement” of the medical community could still be trusted. The outcry from the public led the US Congress to pass the Kefauver–Harris Bill, known as the Drug Amendments of 1962 and signed by President Kennedy on October 10, 1962. This law fundamentally altered the character of research both for the drug industry and for academic medicine. It transformed the FDA into the final arbiter of what constituted successful achievement in the realm of medical therapeutics. The FDA institutionalized clinical trials as the standard method for determining drug efficacy. By the late 1960s the double-blind methodology had become mandatory for FDA approval in the US, and the procedure had become standard in most of the other Western countries by the late 1970s. The application of statistics in medicine has scientific authority and is seen as rising above individual opinions and possessing “objectivity” and “truth.” The emergence of the randomized controlled clinical trials could be seen as a special case of a more general trend — the belief that “quantification is science.” This also coincided with the change of definition about statistics as a discipline. In a book written by Stanford professors Chernoff and Moses in 1959, they said, “Years ago a statistician might have claimed that statistics deals with the processing of data. Today’s statistician will be more likely to say that statistics is concerned with decision making in the face of uncertainty.”37 Through the work of Hill, the father of the modern clinical trial, statistical methods slowly were adapted in medical research. The reason that clinical trials gained legitimacy was because that public at large

History of Statistical Thinking in Medicine

15

realized that the decisions of the medical profession had to be regulated. Only when the issue of “medical decision making” was removed from the confines of professional medical expertise into the open arena of political debate could the statistical methods gain such wide acceptance. This ascendancy of the clinical trial method reflected the close connection between procedural objectivity and democratic political culture. Above is the evolutionary history of statistical thinking in medicine. Medical research is much more than therapeutic research, but all medical research must lead to improvement of therapeutics or prevention. From this history one can see how the application of numerical methods in medicine has been debated throughout the past two hundred years. It shows that it took a long time for good concepts and procedures to prevail in science. The debates described could be applicable to the current problems about therapeutic research in alternative and complimentary medicine. Only through learning from past experience non-orthodox medicine can be modernized quickly.

10. Epilogue Early landmarks in clinical investigation anticipated the current methodology.38 For example, James Lind (1716–1794) in 1753 planned a comparative trial of the most promising treatment for scurvy. However, most pre-twentieth century medical experimenters had no appreciation of the scientific method. Trial usually had no concurrent control, and the claims were totally subjective and extravagant. The publication by Benjamin Rush (1745–1813) in 1794 about the success of treatment of yellow fever by bleeding was one example. Statistics was very influential in the development of population genetics. Johann Gregor Mendel (1822–1884), a monk in the Augustinian order, studied botany and mathematics at the University of Vienna. He carried out experiments on peas to establish the three laws of genetics — uniformity, segregation and independence. After Darwin advanced the theory of evolution, there was a great debate between the evolutionists (biometricians) and those believing in the fixation of species (Mendelians). Pearson in his series of papers, Contributions to the Mathematical Theory of Evolution, I to XVI, gave mathematical form to the problems of genetics and evolution. However, he held the view of continuous change and never accepted Mendelism.39

16

T. T. Chen

After reading Pearson’s papers while a student at Cambridge, RA Fisher made major contributions to the field of genetics, especially he synthesized and reconciled the fixed inheritance theory of Mendel and the gradual evolution theory of Darwin.40 He was considered as one of three founders of the population genetics, together with Sewall Wright and JBS Haldane, and he occupied an endowed chair of genetics at Cambridge University. Fisher’s major contributions were the theoretical foundation of statistics including estimation and the testing of hypotheses, exact distributions of various statistics, and statistical models of natural phenomena.41 As mentioned in the debates between the numerical methods school and the physiological school, physiological measurement data were collected using precise instruments during the later half of the nineteenth century in conjunction with the creation of research universities. Statistical methods were developed to analyze the data coming from the laboratories. Later, the controversy between the biometrical school and the bacteriologists/immunologists in the laboratory led to the further developments of correct statistical methods to analyze laboratory data. Before the development of modern epidemiology, John Graunt (1620– 1674) started to collect data on mortality, derived the life table based on survival, and thus created the discipline of demographic statistics. William Farr (1807–1883) further improved the method of the life table and created the best official vital statistics system in the world for the Great Britain.38 In 1848, John Snow (1813–1858) carried out the first detailed investigation of the cholera epidemic of London. Development of the discipline of bacteriology was associated with the investigation of epidemics due to infectious agents. Mathematics and statistics were used in modeling and analysis of infectious epidemic data. Modern statistical methods were developed to investigate the epidemics of non-infectious diseases in the last half of the 20th century. Epidemiological research has become another field of statistical application. It has merged with statistical survey methods to carry out surveillance and disease monitoring, and it is called population science, in contrast to clinical and laboratory sciences. In every field of medical research, statistical thinking and methods are used to provide insight to the data and to verify the hypotheses. The generation of new data and new hypotheses also propel developments of new statistical methodology. In the twentieth century, modern statistics as created by Pearson and Fisher has made a huge impact on the advancement of human knowledge, and its application to medicine richly demonstrates the importance of statistics.

History of Statistical Thinking in Medicine

17

Acknowledgment The author would like to thank Dr. James Spivey for his input to this paper. References 1. Laplace, P. S. (1951). A Philosophical Essay on Probabilities, 6th ed., trans. Frederick Wilson Truscott and Frederick Lincoln Emory. Dover, New York. 2. Todhunter, I. (1865). A History of the Mathematical Theory of Probability, Macmillan and Co, London. 3. Matthews, J. R. (1995). Quantification and the Quest for Medical Certainty, Princeton University Press, Princeton, New Jersey. 4. Pinel, P. (1809). Traite medico-philosophique sur lalienation mentale, 2nd ed., Paris. 5. Louis, P. C. A. (1836). Pathological Researches on Phthisis, trans. Charles Cowan. Hilliard, Gray, Boston. 6. Louis, P. C. A. (1836). Anatomical, Pathological and Therapeutic Researches upon the Disease Known under the Name of Gastro-Enterite Putrid, Adynamic, Ataxic, or Typhoid Fever, etc., Compared with the Most Common Acute Diseases, Vols. 1 and 2, trans. Henry I. Bowditch. Issac R. Butts, Boston. 7. Louis, P. C. A. (1836). Researches on the Effects of Bloodletting in Some Inflammatory Diseases, and on the Influence on Tartarized Antimony and Vesication in Pneumonitis, trans. C. G. Putnam. Hilliard, Gray, Boston. 8. Double, F. J. (1835). Statistique appliquee a la medecine. Comptes rendus de lAcademie des Sciences 1: 281. 9. Quetelet, L. A. J. (1962). A Treatise on Man and the Development of His Faculties, trans. R. Knox. Research Works Series #247. Burt Franklin, New York. 10. Poisson, S. D. (1837). Recherches sur la probabilite des jugements en matiere criminelle et en matiere civile, Bachelier, Paris. 11. D’Amador, R. (1837). Memoire sue le calcul des probabilites applique a la medecine, Paris. 12. Gavarret, J. (1840). Principes generaux de statistique medicale. Libraries de la Faculte de Medecine de Paris. 13. Bartlett, E. (1844). An Essay on the Philosophy of Medical Science. Lea and Blanchard, Philadelphia. 14. Guy, W. A. (1860). The numerical method, and its application to the science and art of medicine. British Medical Journal 469: 553. 15. Hirschberg, J. (1874). Die mathematischen Grundlagen der Medicinischen Statistik, elementar Dargestellt, Veit, Leipzig. 16. Oesterlen, F. (1852). Medical Logic, trans. G. Whitley. Sydenham Society, London. 17. Lister, J. (1870). Effects of the antiseptic system of treatment upon the salubrity of a surgical hospital, The Lancet i: 40. 18. Comte, A. (1864). Cours de philosophie positive, 2nd edn., Vol. 3, JB Bailliere, Paris.

18

T. T. Chen

19. Bernard, C. (1957). An Introduction to the Study of Experimental Medicine, trans. Henry Copley Greene. Dover, New York. 20. Wunderlich, C. A. (1871). On the Temperature in Diseases: A Manual of Medical Thermometry, trans. W. Bathurst Woodman. New Sydenham Society, London. 21. Stigler, S. M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900. The Belknap Press of Harvard University Press, Cambridge. 22. Pearson, E. S. (1938). Karl Pearson, Cambridge University Press, London. 23. Pearson, K. (1911). The Grammar of Science, 3rd edn., Macmillan, New York. 24. Cope, Z. (1966). Almroth Wright: Founder of Modern Vaccine-Therapy, Thomas Nelson, London. 25. Greenwood, M. (1909). A statistical view of the opsonic index. Proc. Royal Soc. Med. 2: 146. 26. Kilgore, E. S. (1920). Relation of quantitative methods to the advance of medical science. J. Am. Med. Assoc. 88, July 10. 27. Pearl, R. (1921). Modern methods in handling hospital statistics. The Johns Hopkins Hospital Bulletin 32: 185. 28. Box, J. E. (1979). R. A. Fisher : The Life of a Scientist, John Wiley and Sons, New York. 29. Fisher, R. A. (1958). Statistical Methods for Research Workers, 13th edn., Hafner, New York. 30. Fisher, R. A. (1960). The Design of Experiments, 7th edn., Hafner, New York. 31. Hill, A. B. (1991). Principles of Medical Statistics. 12th edn., Lancet Ltd., London. 32. Himsworth, Sir Harold. (1982). “Bradford Hill and Statistics in Medicine,” Statistics in Medicine 1: 301–302. 33. MRC. (1948). Streptomycin treatment of pulmonary tuberculosis: A Medical Research Council Investigation, Br. Med. J. 769. 34. Witts, L. J. (1960). The ethics of controlled clinical trials. In Controlled Clinical Trials, Blackwell Scientific Publications, Oxford. 35. Pickering, Sir George. (1960). Conclusion: The Physician. In Controlled Clinical Trials, Blackwell Scientific Publications, Oxford. 36. Francis, T. Jr. et al. (1955). An evaluation of the 1954 poliomyelitis vaccines trials — Summary Report, American Journal of Public Health 45(5): 1–63. 37. Chernoff, H. and Moses, L. E. (1957). Elementary Decision Theory, John Wiley and Sons, New York. 38. Gehan, E. A. and Lemak, N. A. (1994). Statistics in Medical Research: Developments in Clinical Trials, Plenum Publishing Co, New York. 39. Lancaster, H. O. (1994). Quantitative Methods in Biological and Medical Sciences: A Historical Essay, Springer-Verlag, New York. 40. Fisher, R. A. (1958). The Genetical Theory of Natural Selection, 2nd edn., Dover, New York. 41. Fisher, R. A. (1950). Contributions to Mathematical Statistics, ed. WA Shewhart, John Wiley and Sons, New York.

History of Statistical Thinking in Medicine

19

About the Author Tar Timothy Chen is currently President, Timothy Statistical Consulting. He was Head of Biostatistics Section and Professor of Biostatistics at University of Maryland Greenebaum Cancer Center, 1998–2001; Mathematical Statistician, National Cancer Institute (1989–1998). He received BS in Mathematics (1966) from National Taiwan University; MS (1969), PhD in Statistics (1972) from the University of Chicago. His research interests include categorical data analysis, epidemiological methods, and clinical trial methodology. He has authored or coauthored 102 research papers published in Biometrics, JASA, Statistica Sinica, Statistics in Medicine, Controlled Clinical Trials, New England Journal of Medicine, Journal of Clinical Oncology, Surgery, Ophthalmology, Journal of National Cancer Institute, etc. He is an elected fellow of American Statistical Association and American Scientific Affiliation. He was the president of International Chinese Statistical Association (1999). His biosketch appeared in Who’s Who in America (1999, 2000, 2001, 2002). American Men and Women of Science (1989–1998), and Marquis Who’s Who in Cancer (1985).

This page intentionally left blank

CHAPTER 2

EVALUATION OF DIAGNOSTIC TEST’S ACCURACY IN THE PRESENCE OF VERIFICATION BIAS XIAO-HUA ZHOU Division of Biostatistics, Health Services Research and Development Center of Excellence, University of Washington, Veterans Affairs Puget Sound Health Care System, Building 1, Room 424 (152), 1660 S. Columbian Way, Seattle, WA 98108, USA Tel: 206-277-3588; [email protected]

1. Introduction In a rapidly changing world of advancing technology, it is very important to evaluate the relative accuracies of different diagnostic tests for both quality of care and cost containment. For example, transrectal ultrasound imaging costs $150 to $400 per examination and conventional body coil magnetic resonance imaging (MRI) costs $700 to $1200 per examination. Both MRI and ultrasound could be used to detect advanced stage prostate cancer. Rifkin et al.1 have shown that the accuracy of transrectal ultrasound imaging in detecting advanced stage prostate cancer was not statistically different from that of conventional body coil MRI imaging. Thus, choosing ultrasound over MRI could save $300 to $1050 without compromising quality of care. To evaluate the accuracy of a diagnostic test, we need to determine the disease status for each patient (present or absent) independent of the patient’s test result. The procedure that establishes the patient’s disease status is referred to as a gold standard. The gold standard may be based on surgery, autopsy, or clinical assessments. However, some patients who underwent the test might not have had their condition status verified by the gold standard. Usually the patients who did not have their condition status verified are not a random sample but rather are a selected group. For example, if the gold standard is based on invasive surgery, then patients with negative test results are less likely to receive the gold 21

22

X.-H. Zhou

standard evaluation than patients with positive test results. Although this approach may be sensible and cost-effective in clinical practice, when it occurs in studies designed to evaluate the accuracy of diagnostic tests, the estimated accuracy of the tests may be biased. This type of bias is called verification bias.2,3 For example, in a study of the accuracy of the lactose breath hydrogen test in the diagnosis of enteropathy in children, patients with negative test results had rarely undergone jejunal biopsy, the gold standard.4 Therefore, the estimated sensitivity and specificity based on the verified cases are subject to verification bias. Selective disease verification can lead to serious bias in estimating the accuracy of a diagnostic test. To illustrate how verification bias operates and affects the estimated accuracy of a test, we consider a hypothetical example where we want to estimate the sensitivity of a certain stress radiographic procedure in the diagnosis of coronary artery disease.5 We use angiography as the gold standard for coronary artery disease. Assume the actual sensitivity of the radiographic procedure (which we need to estimate) is 80%. Thus, 20% of all diseased patients will have false-negative test results. Suppose 500 patients with coronary artery disease undergo the stress test; 400 respond positively and 100 respond negatively. Since angiography is a risky and expensive procedure, instead of verifying all tested patients by angiography, only 75% of patients with a positive test undergo angiography, and 10% of patients with a negative test undergo angiography. Thus, among 400 patients who tested positive, 300 have angiography, and among 100 who tested negative, only 10 have angiography. Analysis using only those patients who have angiography would lead to the mistaken conclusion that the sensitivity of the stress test is 97% (300/310), a gross overestimation of the true sensitivity. Similarly, we can show an estimator of specificity using only verified cases can also be biased. The magnitude of verification bias depends on the association between selection for verification and the test result. The stronger the association is, the larger the bias. For example, Drum and Christacopoulos6 studied the accuracy of hepatic scintigraph to detect liver disease. The liver disease verification procedure was either liver biopsy, exploratory laparotomy, or autopsy. In their study, they performed 650 scans (429 positive and 221 negative). Among the 429 patients with positive test results, 61% received the disease verification procedure, and of the 221 patients with negative test results, only 37% received the disease verification procedure. Using only disease verified cases, Drum and Christacopoulos reported that the hepatic scintigraph had a sensitivity of 90% and a specificity of 63%.

Evaluation of Diagnostic Test’s Accuracy in the Presence of Verification Bias

23

However, using Begg and Greenes’s verification bias correction procedure (to be discussed in more detail in Sec. 2.2)2 on all tested patients, the corrected sensitivity and specificity are 84% and 74%, respectively. Thus, the reported sensitivity in Drum and Christacopoulos’s paper is inflated by 6%. A more extreme example is found in a study by Marshall et al.8 Their study assessed the accuracy of diaphanography in detecting breast cancer. A total of 833 patients were tested for breast cancer using diaphanography (67 positive and 766 negative results). The verification procedure on breast cancer was biopsy. The proportion receiving the disease verification procedure was 55% for test positive patients and 7% for test negative patients. Using only verified cases, Marshall et al.8 reported a sensitivity of 79% for diaphanography in detecting breast cancer. Using Begg and Greenes’ correction procedure, the estimated sensitivity becomes 28%. Thus, the reported sensitivity in the paper is grossly inflated. Therefore, ignoring verification bias could grossly overestimate the accuracy of a test, and result in the misuse of a test, leading to possible mismanagement of patient care. Although verification bias can distort the estimated accuracy of a diagnostic test, many published studies on the accuracy of diagnostic tests fail to recognize verification bias. For example, Greenes and Begg 9 reviewed 145 studies published between 1976 and 1980 and found that at least 26% of the articles had verification bias, but failed to recognize it; Bates et al.10 reviewed 54 pediatric studies and found more than one third had verification bias; and Philbrick et al.11 reviewed 33 studies on the accuracy of exercise tests for coronary disease and found that 31 might have had verification bias. Finally Reid et al.41 looked at 112 studies published in NEJM, JAMA, BJM and Lancet between 1978 and 1993 and found 54% had verification bias. Since it is often unethical or impractical to verify all study patients, retrospective adjustments are needed to provide correct inferences about the accuracy of tests. Assuming a gold standard exists, in this chapter, we review available statistical methods that may be used to correct for verification bias in evaluating the accuracy of diagnostic tests. In Sec. 2, we describe bias-correction methods for making inferences about the accuracy of a single diagnostic test when its response is binary, and in Sec. 3, we discuss bias-correction methods for comparing the relative accuracy of two correlated binary tests. In Sec. 4, we discuss bias-correction methods for making inferences about the accuracy of a single diagnostic test when its response is ordinal, and in Sec. 5, we present bias-correction methods for comparing the relative accuracy of two ordinal-scale diagnostic tests. We

24

X.-H. Zhou

start each section by presenting an overview of the methods, then follow this with a more detailed discussion. 2. A Single Binary Test 2.1. An overview When the response of a test is binary, its accuracy is usually measured by sensitivity and specificity or positive and negative predictive values. The sensitivity measures how good the test is at providing a positive result in diseased patients, and the specificity measures how good the test is at ruling out non-diseased patients. While sensitivity and specificity are intrinsic properties of a diagnostic test, positive and negative predictive values represent the accuracy of a diagnostic test when it is applied to a particular patient.12 Several approaches have been developed to make inferences on the accuracy of a single binary test in the presence of verification bias.2,7,13 Begg and Greenes2 developed a bias-correction procedure for estimating sensitivity and specificity under the conditional independence assumption, which requires that selection for verification does not depend on the true disease status directly. Zhou7 extended their method to allow a general model for verification process and derived the maximum likelihood estimators for sensitivity and specificity of a diagnostic test and their corresponding variances. Even though the estimated sensitivity and specificity may be biased using only verified cases, Zhou13 showed that under the conditional independence assumption, the naive estimators of predictive values, based on only verified cases, are unbiased. However, If the conditional independence assumption does not hold, Zhou showed that the naive estimators are still biased and derived the ML estimators under a model for the verification process. 2.2. Estimation of a single test To develop a bias-correction procedure for estimating sensitivity and specificity, we define the random variables, V , T , and D, to describe the verification indicator, the value of the diagnostic test result and the true disease status of a patient, respectively. Let V = 1 indicate a verified patient and V = 0 a non-verified patient; let T = 1 indicate a positive test result and T = 0 a negative test result; and let D = 1 indicate a diseased patient, and D = 0 non-diseased. Furthermore, we assume that the probability of

Evaluation of Diagnostic Test’s Accuracy in the Presence of Verification Bias Table 1. X = xi .

25

Cross-classification of test results by disease status and verification status

Diagnostic results T =1

T =0

s1i r1i

s0i r0i

Unverified

u1i

u0i

Total

n1i

n0i

Verified

D=1 D=0

verifying a patient may be influenced by not only the test results but also the discrete covariates X, which have I different covariate patterns. Let xi denote the ith covariate pattern of observed covariates, where i = 1, . . . , I. Also, assume that X is a random sample from a discrete space (x1 , . . . , xI ) with probabilities ξ = (ξ1 , . . . , ξI ). The observed data with verification bias may be displayed as in Table 1. Under the assumption that k1i =

P (V = 1 | D = 1, T = 1, X = xi ) P (V = 1 | D = 0, T = 1, X = xi )

k0i =

P (V = 1 | D = 1, T = 0, X = xi ) P (V = 1 | D = 0, T = 0, X = xi )

and

are known, Zhou7 showed that the maximum likelihood (ML) estimators for sensitivity and specificity are PI (sens ˆ i )ˆ pi ni /n sens ˆ = i=1 PI i=1 pˆi ni /n

and

spec ˆ = respectively, where sens ˆ i= spec ˆ i=

PI

ˆ i )(1 − pˆi )ni /n i=1 (spec PI ˆi )ni /n i=1 (1 − p

,

s1i n1i /(s1i + k1i r1i ) , s1i n1i /(s1i + k1i r1i ) + s0i n0i /(s0i + k0i r0i )

k0i r0i n0i /(s0i + k0i r0i ) , k1i r1i n1i /(s1i + k1i r1i ) + k0i r0i n0i /(s0i + k0i r0i )

which are the ML estimators for sensitivity and specificity of the test in the subpopulation with X = xi , respectively, s1i n0i s0i n1i + , pˆi = ni s1i + k1i r1i ni s0i + k0i r0i

26

X.-H. Zhou Table 2.

Hepatic scintigraph data. Diagnostic results T =1

T =0

231 32

27 54

V =0

166

140

Total

429

221

V =1

D=1 D=0

P ni = n1i + n0i , and n = Ii=1 ni .7 The corresponding variances may be computed from the inverse of the Fisher information matrix. If k1i = k0i = 1 for i = 1, . . . , I, the conditional independence assumption holds, and our ML estimators reduce to the ones given by Begg and Greenes.2 2.3. An hepatic scintigraph example Hepatic scintigraph is an imaging scan used in detecting liver disease. Drum and Christacopoulos6 conducted an experiment to determine the sensitivity and specificity of the hepatic scintigraph in detecting liver disease. There were 650 patients who participated in the study. Of the 429 patients who had positive hepatic scintigraph results, 263 (61%) were referred to undergo a disease verification procedure, which is liver pathology. Of the 221 patients with negative hepatic scintigraph results, only 81 (37%) were referred to undergo the disease verification procedure. The data are presented in Table 2. If only patients with verified condition statuses are used in the calculation, the biased estimate of sensitivity is 0.90 with a 95% confidence interval of (0.86, 0.93); and the biased estimate of specificity is 0.63 with the 95% confidence interval of (0.53, 0.73). If the probability of verifying a patient depends only on the test results of the hepatic imaging scan, the verification process is MAR. Using the correction method described in Proposition 1, the estimated sensitivity is 0.84 with a 95% confidence interval of (0.79, 0.88), and the estimated specificity is 0.74 with a 95% confidence interval of (0.66, 0.81). 3. Comparison of Two Correlated Binary Tests 3.1. An overview To compare the relative accuracies of two binary tests, several approaches have been developed to correct for verification bias.14–16 Schartzkin et al.14 considered the comparison of sensitivities and specificities of two tests in

Evaluation of Diagnostic Test’s Accuracy in the Presence of Verification Bias

27

an extreme case of verification bias where only those patients who tested positive on either test proceeded to have their true disease status verified, and they found that McNemar’s test could still be used for such a comparison. Baker15 proposed a parametric maximum likelihood procedure for estimating sensitivities and specificities of multiple tests, and Zhou16 provided a nonparametric ML approach for comparing the relative accuracies of two correlated binary tests. While both Baker’s method and Zhou’s method treat the problem of verification bias as a special type of missing-data and use likelihood-based approaches for missing-data to correct for verification bias, Baker’s approach primarily focused on estimation of sensitivities and specificities of multiple tests, and Zhou’s approach focused on hypothesis testing for the equality of sensitivities or specificities of two diagnostic tests. Although Baker’s approach allows the disease verification process to depend on the disease status, its validity still depends on the assumption that one can correctly model the disease verification process by logistic regression using the test results and the disease status. While the validity of Zhou’s approach relies on the assumption that the disease verification process depends on only the test results and other observed covariates, but not on the disease status, its validity does not require modeling the diseased verification process. However, his approach assumes that the effects of covariates on disease follow a logistic regression model. Baker implemented his approach by first starting with an EM algorithm and then switching to the Newton–Raphson algorithm after a few iterations,15 and Zhou implemented his method using the Newton–Raphson algorithm. The advantage of Zhou’s approach is that the computation may be done in an existing software, such as SAS;17 and the advantage of Baker’s approach is that it may be less sensitive to starting values, but its disadvantage is that it needs a special program to carry out its computation. 3.2. The ML approach In this subsection, we discuss Zhou’s approach18 for comparing the relative accuracies of two correlated binary tests. Let T1 and T2 be binary test results of two diagnostic tests. Let the definitions of random variables D, V , X be the same as those in Sec. 2.2. Then, the observed data may be summarized as in Table 3. To derive the bias-correction procedure, we need additional notation. Define θijl = P (D = 1 | T1 = j, T2 = l, X = xi ) , ηijl = P (T1 = j, T2 = l | X = xi ) .

and

28

X.-H. Zhou

Table 3. X = xi .

Cross-classification of test results by disease status and verification status with

T1 = 1

T1 = 0

T2 = 1

T2 = 0

T2 = 1

T2 = 0

si11 ri11

si10 ri10

si01 ri01

si00 ri00

V =0

ui11

ui10

ui01

ui00

Total

ni11

ni10

ni01

ni00

V =1

D=1 D=0

Because the number of free parameters could grow uncontrollably as the number of covariates grows, we need to model the joint probability P (T1 , T2 , D | X). We model P (D | T1 , T2 , X) by a logistic regression model and P (T1 , T2 | X) by a multinomial logit model.19 Specifically, these models are defined by the following equations: P (D = 1 | T1 , T2 , X = xi ) =

exp(β0 + β1 T1 + β2 T2 + β3′ xi ) 1 + exp(β0 + β1 T1 + β2 T2 + β3′ xi )

and P (T1 = j, T2 = l | X = xi ) = P1

exp(α0jl + α′1jl xi )

h1 ,h2 =0

exp(α0h1 h2 + α′1h1 h2 xi )

,

for j, l = 0, 1, where α011 = 0 and α111 = 0. Let β = (β0 , β1 , β2 , β3′ )′ ,

α = (α000 , α001 , α010 , α′100 , α′101 , α′110 )′ ,

ξi = P (X = xi ), ξ = (ξ1 , . . . , ξI−1 ). Let sijl , rijl and uijl be the numbers of subjects with (V = 1, D = 1, T1 = j, T2 = l), (V = 1, D = 0, T1 = j, T2 = l), and (V = 0, T1 = j, T2 = l), respectively. Then, the log-likelihood function based on the observed data is l(α, β, ξ) =

I X 1 X

i=1 j,l=0

+

I X

{nijl log ηijl + sijl log θijl + rijl log(1 − θijl )}

ni log ξi ,

(1)

i=1

P1 where ni = j,l=0 nijl , and ξI = 1 − ξ1 − · · · − ξI−1 . After obtaining ML estimates of α, β, and ξ by maximizing Eq. (1) with respect to these

Evaluation of Diagnostic Test’s Accuracy in the Presence of Verification Bias

29

parameters, we can estimate the sensitivities of the two tests, π1 and π2 , by the following formulas: !, !, I X 1 I X 1 X X pˆ , π ˆ1 = pˆ and π ˆ2 = θˆij1 ηˆij1 ξˆi θˆi1l ηˆi1l ξˆi i=1 l=0

i=1 j=0

respectively, and their specificities, ν1 and ν2 , by !, I X 1 X (1 − pˆ) and νˆ1 = (1 − θˆi0l )ˆ ηi0l ξˆi i=1 l=0

νˆ2 =

I X 1 X i=1 j=0

(1 − θˆij0 )ˆ ηij0 ξˆi

!,

(1 − pˆ) ,

where pˆ =

I X 1 X

θˆijl ηˆijl ξˆi .

i=1 j,l=0

We may use the delta method to estimate the corresponding covariance matrix of π ˆ1 and π ˆ2 and that of νˆ1 and νˆ2 . 4. A Single Ordinal-Scale Test 4.1. An overview When the response of a diagnostic test is ordinal, there is more than one way to define a positive test result. Hence, the use of one pair of sensitivity and specificity values confounded with the chosen confidence threshold for a positive result. To overcome this limitation, a receiver operating characteristic (ROC) curve was proposed to present the accuracy of an ordinal-scale test.20,21 An ROC curve is a plot of 1-specificity versus sensitivity as one varies the confidence threshold from the most liberal to the most conservative views on the presence of disease, and it shows the trade-off between sensitivity and specificity of a diagnostic test that can arise when one uses different confidence thresholds.22 For estimating a single ROC curve, several bias-correction methods have been proposed.18,23–26 Gray et al.23 proposed a parametric maximum likelihood (ML) approach for estimating an ROC curve, adjusting for verification bias, under the conditional independence assumption that the selection probability of verification depends only on the test results. Their approach has three limitations: (1) it cannot be applied to the situation where some observed covariates (e.g. sex or age) may

30

X.-H. Zhou

influence the decision to verify a patient and/or affect the ROC curve itself; (2) the validity of the approach relies on the normality assumption of the latent decision variables; and (3) computation of the ML estimates requires a modified iterative scoring algorithm. Hunink et al.24 proposed an ad-hoc method for estimating an ROC curve adjusting for the effects of covariates on the decision to verify and on the ROC curve. However, their approach does not necessarily provide maximum likelihood estimates nor consistent variance estimates, as shown by Rodenberg.26 Rodenberg and Zhou25,26 proposed a likelihood based approach for estimating an ROC curve when some observed covariates affect both the verification process and the test’s accuracy. Their approach first modeled effects of covariates on the accuracy of a test by an ordinal regression model, then treated the verification bias problem as a missing-data problem, and finally used the EM algorithm27 to compute the ML estimates under the MAR assumption for the verification process. To overcome the second and third limitations of the Gray et al.’s approach, Zhou18 proposed a non-parametric maximum likelihood approach to correct for verification bias in estimating the area under an ROC curve. The main idea behind this approach was to treat the verification bias problem as a missing data problem. Under the missing data framework, he first derived an explicit expression for a ML estimator of the ROC curve area without the normality assumption. Then, he presented two approaches for estimating the corresponding variance. The first approach was based on the observed Fisher information,28 called the information method, and the second approach was based on the jackknife method.29 A simulation study suggests that the estimator obtained using the jackknife method outperforms the estimator obtained by the information method. The proposed approach does not require an iterative algorithm to compute the ML estimates, nor the normality assumption of the latent decision variable. The proposed approach can also apply to the setting whether some observed discrete covariates of a patient might influence the decision to verify the patient. However, this approach can only apply to the area under the ROC curve, not the ROC curve itself.

4.2. Estimation of a single ROC curve without covariates Let T be the ordinal-scale test results, and the definitions of random variables D and V are the same as those in Sec. 2. Then, the observed data may be summarized as in Table 4. The rating data above may be considered as a categorization of an unobserved latent random variable T ∗ , representing

Evaluation of Diagnostic Test’s Accuracy in the Presence of Verification Bias

31

Table 4. Cross-classification of an ordinal-scale test by disease status and verification indicator. Diagnostic results T =1

···

T =K

D=1

z11

···

zK1

D=0

z10

···

zK0

Unverified

u1

···

uK

Total

n1

···

nK

Verified

degree of suspicion on the presence of disease for a patient. By postulating a relationship between the observed T and the unobserved T ∗ and a parametric distribution for T ∗ , one can build a parametric model for the ROC curve of a diagnostic test. Several ROC models have been proposed in the literature.30–33 The most commonly used binormal model is proposed by Dorfman and Alf,30 and this model can be summarized in the following result. Result 1: Assume that K − 1 cut-off points, θ1 , . . . , θK−1 , exist such that for each patient, if θk−1 < T ∗ ≤ θk , T = k, where k = 1, . . . , K, θ0 = −∞, and θK = ∞. Further assume that given that a patient is diseased, T ∗ is normally distributed with mean µ1 and variance σ12 , and that given that a patient is non-diseased, T ∗ is normally distributed with mean µ0 and variance σ02 . Under these assumptions, the ROC curve of the test is a plot of 1 − Φ(t) versus 1 − Φ(bt − a), where Φ(.) is the cumulative distribution function of the standard normal random variable, a = (µ1 − µ0 )/σ1 , and b = σ0 /σ1 . Hence, under the binormal model, an ROC curve is determined by two parameters, a and b, which may be estimated using the maximum likelihood method. To write down the likelihood function, one first defines the parameters for the observed data: pd = P (D = d) and πkd = P (T = k | D = d), and then one may write πkd as functions of the parameters of an ROC curve: ′ πk0 = Φ(θk−1 ) − Φ(θk′ )

′ and πk1 = Φ(bθk−1 − a) − Φ(bθk′ − a) ,

where θk′ = (θk −µ0 )/σ0 , k = 1, . . . , K. Notice that the probability of having T = k and D = d for a verified patient is pd πkd and that the probability P1 of having T=k for an unverified patient is d=0 pd πkd , a mixture of two

32

X.-H. Zhou

distributions. Hence, under the MAR assumption that P (V = 0 | T, D) = P (V = 0 | T ), the log-likelihood for the observed data may be written as K X 1 X

zkd log pd πkd +

k=1 d=0

K X

uk log(p1 πk1 + p0 πk0 ) .

k=1

Gray et al.23 employed a modified scoring algorithm to maximize this loglikelihood function with respect to a, b, and θk′ to obtain their ML estimates and the corresponding variance estimates.

4.3. Estimation of ROC curves with covariates Let X be the vector of observed covariates that may affect the verification process and the accuracy of the test. Assume that X can be cross-classified into I distinct combinations, and xi represents the values of the covariates for the ith combination. The observed data with X = xi form a contingency table, displayed in Table 5. Using the Rodenberg and Zhou’s approach,25,26 we model the effects of the covariates X = xi on the distribution of the response of a diagnostic test by ordinal regression with a probit link34,35 : X

πjdi = Φ

j≤k

θk − (αD d + α′X xi ) ′ x ) exp(βD d + βX i

,

for k = 1, . . . , K − 1 ,

where πjdi = P (T = j | D = d, X = xi ), and θk ’s are cut-off points of a latent continuous variable T ∗ , defined in Proposition 1. Denote α = (αD , αX ), β = (βD , βX ), and θ = (θ1 , . . . , θK−1 ). To emphasize the dependence of πjdi on α, β, and θ, we write πjdi = πjdi (α, β, θ). Hence, under the MAR assumption, that P (V = 0 | D, T, X) = P (V = 0 | T, X), the Table 5.

Observed data for the verification bias problem when X = xi . Verification Status V :

Disease Status D: 1 0 missing

Diagnostic Test Result T : 1 2 ··· K z11i z10i u1i

z21i z20i u2i

··· ··· ···

zK1i zK0i uKi

n1i

n2i

···

nKi

Evaluation of Diagnostic Test’s Accuracy in the Presence of Verification Bias

33

log-likelihood, based on the observed data, may be written as l=

K X 1 X I X

zkdi log(pdi πkdi (γ, α, θ))

k=1 d=0 i=1

+

K X I X

uki log(p1i πk1i (γ, α, θ) + p0i πk0i (γ, α, θ)) ,

(2)

k=1 i=1

where pdi = P (D = d | X = xi ), the prevalence rate of disease specific to the subgroup with X = xi . The log-likelihood, based on the observed data, has a complicated form, involving mixture distributions. Let wkdi be the number of unverified patients with T = k and X = xi whose disease status is d(D = d). Because of selective verification, one does not observe wkdi , but instead one observes uki = wk0i + wk1i . If all subjects had been verified, we would have observed wkdi , and a much simpler complete-data log-likelihood could be written as I X K X 1 X

{zkdi + wkdi } log(pdi )

i=1 k=1 d=0

+

I X K X 1 X {zkdi + wkdi } log(πkdi (α, β, θ)) . i=1 k=1 d=0

These two separate sums suggest that pdi and πkdi can be maximized separately. They also suggest the use of the EM algorithm with a maximization step for an ordinal regression model of πkdi (α, β, θ) with wkdi assumed known, which can be done fusing an existing computer program PLUM developed by McCullagh.34 Here, the expectation step finds new estimates of wkdi given the current values of α, β, θ, and p, α(m) , β (m) , θ(m) , and p(m) , using (m)

p πkdi (α(m) , β (m) , θ(m) ) . uki P1 di (m) (m) , β (m) , θ (m) ) d=0 pdi πkdi (α

This iterative process is continued until the relative change in successive ML estimates is small. The convergent values are the ML estimates of the parameters. Their asymptotic variance-covariance matrix is given by the inverse of the expected information matrix, defined by Eq. (2).

34

X.-H. Zhou

4.4. Estimation of the area under an ROC curve If one is interested in the area under the ROC curve, a simple bias-correction procedure is available.18 Define φ1ki = P (T = k | X = xi ) ,

and φ2ki = P (D = 1 | T = k, X = xi ) ,

The log-likelihood, defined in (2), may be re-written as K I X X

i=1 k=1

ni log(φ1ki ) +

K I X X

i=1 k=1

(zk1i log(φ2ki ) + zk0i log(1 − φ2ki )) .

Maximizing the above log-likelihood yields the following ML estimators for φ1 and φ2 : nki zk1i φˆ1ki = and φˆ2ki = , ni zk1i + zk0i PK where ni = k=1 nki . Notice that the area under an ROC curve A is a function of φ and ξ and can be written as PI PI PK−1 PK i=1 φ2ji φ1ji ξi i=1 (1 − φ2ki )φ1ki ξi k=1 j=k+1 PK PI PI 1 + (1 − φ2ki )φ1ki ξi i=1 φ2ki φ1ki ξi A = PK 2PIk=1 i=1 PK PI k=1 j=1 i=1 (1 − φ2ki )φ1ki ξi i=1 φ2ji φ1ji ξi

Substituting unknown parameters in the equation above by their ML estimates gives the following ML estimator for A: PI PK−1 PK ˆ ˆ ˆ PI φˆ2ji φˆ1ji ξˆi i=1 i=1 (1 − φ2ki )φ1ki ξi j=k+1 k=1 PI ˆ ˆ ˆ 1 PK PI ˆ ˆ ˆ + (1 − φ2ki )φ1ki ξi i=1 φ2ki φ1ki ξi A = PK 2PIk=1 i=1 , ˆ ˆ ˆ PK PI φˆ2ji φˆ1ji ξˆi k=1 j=1 i=1 (1 − φ2ki )φ1ki ξi i=1

where ξˆi = ni /n. The corresponding variance estimator can be obtained by either the jackknife method or the information method.18 4.5. A real example with fever of uncertain origin

Gray et al.23 reported data from a study on the accuracy of computed tomography in differentiating focal from nonfocal sources of sepsis among patients with fever of uncertain origin. In this study only some patients were verified, depending on their CT results. Hence, this study had verification bias. Table 6 displays the data.

Evaluation of Diagnostic Test’s Accuracy in the Presence of Verification Bias Table 6.

V =1

D=1 D=0

35

Observed Data.

T =1

T =2

T =3

T =4

T =5

7 8

7 0

2 1

3 1

37 4

40

11

3

5

12

Total

55

18

6

9

53

0.6 0.4

Indianapolis subjects age 65-74 Indianapolis subjects age 75 and older Ibadan subjects age 65-74 Ibadan subjects age 75 and older

0.0

0.2

True-Positive-Fraction

0.8

1.0

V =0

0.0

0.2

0.4

0.6

0.8

1.0

False-Positive-Fraction Fig. 1. ROC curves and empirical (FPF, TPF) estimates for dementia screening test by site and age group under the best model.

If we use only the veri ed cases,the estimated empirical and smooth ROC curves are displayed in Figure 2. The area under the smooth ROC curve is 0.75 with the standard deviation of 0.108. If we assumethat the probabilit y of veri cation depends only on the result of CT, using all cases,the ML estimatesof a and b are 1.80 and 1.75, respectively. We display the corrected empirical and smooth ROC curves in Fig. 2. The area under the corrected smooth ROC curve is 0.81 with standard deviation of 0.07.

36

0.6 0.4

Corrected smooth curve Uncorrected smooth curve Corrected empirical curve Uncorrected empirical curve

0.0

0.2

True Positive Fraction

0.8

1.0

X.-H. Zhou

0.0

0.2

0.4

0.6

0.8

1.0

False Positive Fraction

Fig. 2.

Corrected and uncorrected ROC curves.

5. Comparison of Two Correlated Ordinal-Scale Tests 5.1. An overview To estimate ROC curves of multiple tests, Toledano36 adapted the idea of the weighted generalized estimation equations (GEE)37 to correct for verification bias under the MAR assumption for the verification process. The proposed approach first modeled the verification process and then estimated the probability of verifying a patient given the patient’s observed covariates, and finally, weighted verified data inversely to this estimated probability of verification. One advantage of this approach is that it permits estimation of ROC curves when some observed covariates affect the accuracy of a test and the verification process without modeling the joint probability of diagnostic test results. Two disadvantages of the approach may be: (1) its validity relies on correct modeling of the verification process; and (2) it may not be as efficient as a likelihood-based approach because it discards the unverified cases and weights the verified cases inversely to the probability of having verification.38 For comparing two correlated ROC curve areas,

Evaluation of Diagnostic Test’s Accuracy in the Presence of Verification Bias

37

Zhou39 extended his previous approach18 to two correlated tests. The proposed approach first derived explicit nonparametric ML estimators for the areas under the ROC curves of two correlated tests and their corresponding variance-covariance matrix when the verification process depends on only the test results. If some categorical covariates affect the verification process, the proposed approach incorporated these covariates into the estimates of the areas by using both a logistic regression and a multinomial regression models. One strength of the proposed approach is that it does not require one to model the verification process under the MAR assumption, and its weakness is that it can only be used to estimate the areas under ROC curves, but not ROC curves themselves. 5.2. A weighted GEE approach for ROC curves Let T1 and T2 be the responses of two diagnostic tests of a patient, ranging from 1 to K1 and from 1 to K2 , respectively. Let the definitions of random variables D, V , and X be the same as those in Sec. 2.2. Then, the observed data with X = xi form a contingency table and are displayed in Table 7. Let Tlj be the result of the lth test on jth patient and Yljk be a cumulative indicator of Tlj . That is, Yljk = 1 if Tlj ≤ k and 0 otherwise. Let µjlkdi be the conditional expected value of Yljk given Dj = d and X = xi (µjlkdi = E(Yljk | Dj = d, X = xi )). One may model effects of covariates X = xi on µjlkdi by ordinal regression with a probit link: Φ−1 (µljkdi ) =

θlk − αDl d − α′Xl xi ′ x ) . exp(βDl d + βXl i

Let B be the vector of all unknown parameters, including αDl , αXl , βDl , and βXl . Using Toledano’s approach36 , one estimates B using a two-stage procedure. First, given the test results and the observed covariates Xj , the Table 7. Cross-classification of ordinal-scale tests by disease and verification indicators when X = xi . T1 = 1

...

T2 = 1

...

T2 = K2

...

D=1

z111i

...

z1K2 1i

D=0

z110i

...

z1K2 0i

V =0

u11

...

Total

n11

...

V =1

T1 = K1 T2 = 1

...

T2 = K2

...

zK1 1i

...

zK1 K2 1i

...

zK1 10i

...

zK1 K2 0i

u1K2

...

uK1 1

...

uK1 K2

n1K2

...

nK1 1

...

nK1 K2

38

X.-H. Zhou

probability of verifying a patient is modeled by a logistic regression model: log

P (Vj = 1 | T1j , T2j , Xj = xi ) = ω ′ (T1j , T2j , x′i )′ . P (Vj = 0 | T1j , T2j , Xj = xi )

The unknown parameters ω may be estimated by the method of generalized estimating equation (GEE)40 . Denote ω ˆ to be the resulting estimates of ω, and denote the probability of verifying the jth patient by νj = P (Vj = 1 | T1j , T2j , Xj = xi ) . Then, the following weighted generalized estimating equation is used to estimate B: ! n X Vj ∂µj −1 Σj (η) (Yj − µj ) ω = ω ˆ , η = ηˆ = 0 , (3) νj ∂B j=1

where the notation (. | ω = ω ˆ , η = ηˆ) denotes a function of ω and η evaluated at ω ˆ and ηˆ; Yj = (Y1j1 , . . . , Y1j(K1 −1) , Y2j1 , . . . , Y2j(K2 −1) )′ ; µj = (µ1j1 , . . . , µ1j(K1 −1) , µ2j1 , . . . , µ2j(K2 −1) )′ ; Σj (η) = cov(Yj ) is an assumed covariance matrix of Yj ; and ηˆ is a consistent estimator of η. Toledano and Gatsonis42 have discussed several ways of choosing the coˆ to the variance matrix Σj (η), and Toledano36 has shown that the solution B weighted GEE (3) is a consistent estimator of B and has an asymptotically normal distribution. 5.3. A likelihood-based approach for ROC areas If one is interested in comparing the areas under the ROC curves, a simpler approach than the weighted GEE approach is available.39 To derive this approach, one needs a different notation. For the observed data given as in Table 7, the following parameters are defined: φ2ijl = P (D = 1 | T1 = j, T2 = l, X = xi ) , φ1ijl = P (T1 = j, T2 = l | X = xi ) ,

and ξi = P (X = xi ) .

Since the problem of verification bias may be considered as a missing-data problem, the likelihood approach for missing data is used to estimate φ1ijl , φ2ijl , and ξi . Assume that the missing-data mechanism is MAR, that is, P (V = 0 | T1 , T2 , D, X = xi ) = P (V = 0 | T1 , T2 , X = xi ) .

Evaluation of Diagnostic Test’s Accuracy in the Presence of Verification Bias

39

Then, the contribution of a verified patient to the likelihood is P (T1 , T2 , D, X) = P (D | T1 , T2 , X)P (T1 , T2 | X)P (X), and the likelihood contribution of an unverified case is P (T1 , T2 , X) = P (T1 , T2 | X)P (X). Furthermore, one may model effects of T1 , T2 , and X on D by a logistic regression model: φ2ijl =

exp(β1j + β2l + β3′ xi ) , 1 + exp(β1j + β2l + β3′ xi )

(4)

where β1K1 = 0 and β2K2 = 0, and effects of X on T1 and T2 by a multinomial logit model, exp(α′jl xi ) , P K2 ′ h1 =1 h2 =1 exp(αh1 h2 xi )

φ1ijl = P (T1 = j, T2 = l | X = xi ) = PK1

(5)

where αK1 K2 = 0. Denote β = (β11 , . . . , β1(K1 −1) , β21 , . . . , β2(K2 −1) , β3′ ), α = (α11 , . . . , αK1 (K2 −1) )′ , ξi = P (X = xi ), ξ = (ξ1 , . . . , ξI−1 ), and n = PI i=1 ni . To emphasize the dependence of φ1ijl on α and the dependence of φ2ijl on β, one may write φ1ijl = φ1ijl (α) and φ2ijl = φ2ijl (β). Under the MAR assumption, a valid log-likelihood function is l(α, β, ξ) =

K2 K1 X I X X i=1 j=1 l=1

+

nijl log PK

K2 K1 X I X X

h1 ,h2 =1

sijl log

i=1 j=1 l=1

+ rijl log

exp(α′jl xi ) exp(α′h1 h2 xi )

+

I X

exp(β1j + β2l + β3′ xi ) 1 + exp(β1j + β2l + β3′ xi )

1 , 1 + exp(β1j + β2l + β3′ xi )

(6)

where nijl = sijl + rijl + uijl and ξI = 1 − ξ1 − · · · − ξI−1 . Let l1 (α) =

K2 K1 X I X X i=1 j=1 l=1

l2 (β) =

K2 K1 X I X X i=1 j=1 l=1

+ rijl log

ni log ξi

i=1

nijl log PK

exp(α′jl xi )

h1 ,h2 =1

sijl log

exp(α′h1 h2 xi )

,

exp(β1j + β2l + β3′ xi ) 1 + exp(β1j + β2l + β3′ xi )

1 , 1 + exp(β1j + β2l + β3′ xi )

40

X.-H. Zhou

and l3 (ξ) =

I X

ni log ξi .

i=1

Then, l(α, β, ξ) may be written as the sum of l1 (α), l2 (β), and l3 (ξ). Here, l1 (α), l2 (β), and l3 (ξ) may be considered as the log likelihood function of all cases modeled by the multinomial logit model defined by Eq. (5), the log likelihood function of verified cases modeled by the logistic regression model defined by Eq. (4), and the log likelihood function for a multinomial distribution based on all cases, respectively. Since the parameters α, β, ˆ and ξ, ˆ may be obtained and ξ are distinct, their ML estimators, α ˆ , β, by maximizing l1 , l2 , and l3 with respect to α, β, and ξ, separately. The observed Fisher information for (α, β, ξ) is diag(I1 (α), I2 (β), I3 (ξ))

(7)

where I1 , I2 , and I3 are the observed Fisher information matrices on the log-likelihood functions l1 (α), l2 (β), and l3 (ξ), respectively. Maximizing l1 (α) with respect to α and l2 (β) with respect to β yields ˆ respectively. Since ξI = 1 − · · · − ξI−1 , maximizing ML estimators α ˆ and β, l3 (ξ) with respect to ξi yields ML estimators of ξi : ni , ξˆi = n i = 1, . . . , I − 1. PI PK1 PK2 Note that γ = P (D = 1) = i=1 j=1 l=1 φ2ijl φ1ijl ξi and that one may write the area under the ROC curve of a diagnostic test Ai as   K Ki Ki i −1 X X X 1 1  φ∗ (j) φ∗ (j)φ∗2i (j) , φ∗i2 (l) + Ai = γ(1 − γ) j=1 i1 2 j=1 1i l=j+1

where

φ∗11 (j) =

K2 I X X

(1 − φ2ijk )φ1ijk ξi ,

φ∗12 (j) =

K1 I X X

(1 − φ2ikj )φ1ikj ξi ,

φ∗22 (j) =

i=1 k=1

φ2ijk φ1ijk ξi ,

K1 I X X

φ2ikj φ1ikj ξi .

i=1 k=1

i=1 k=1

φ∗21 (j) =

K2 I X X

i=1 k=1

The delta method may be used to obtain an estimate of the covariance matrix for Aˆ1 and Aˆ2 . Assuming the normality of Aˆ1 − Aˆ2 , one can then perform the hypothesis tests and construct confidence intervals about A1 − A2 .

Evaluation of Diagnostic Test’s Accuracy in the Presence of Verification Bias

41

5.4. Availability of computer software For analysis of ROC data, several computer programs have been developed for carrying out computations of the bias-correction methods discussed in Secs. 4 and 5. Gray et al.23 developed a program called ROCBIAS to estimate the ROC curve of a single test when the probability of verifying a patient depends on only the test results. Rodenberg and Zhou26 developed a program called EMPLUM to estimate ROC curves when the probability of verifying a patient depends not only on the test results but also on other observed covariates. Toledano36 developed special software written in Fortran to implement the weighted GEE approach for analyzing correlated ROC curves in the presence of verification bias. Zhou and Higgs43 implemented the likelihood-based approach for comparing the areas under the ROC curves in SAS,17 which can be down-loaded from http://www.biostat.iupui.edu/~zhou. 6. Discussion Statistical methods in diagnostic medicine have recently received a lot of attention.45 In this chapter we have discussed the problem of verification bias in evaluating the accuracy of diagnostic tests and some available biascorrection methods. The problem of verification bias is just one of many problems encountered in diagnostic medicine. For a completely treatment on statistical methods in diagnostic medicine, we refer readers to textbook by Zhou et al.44 References 1. Rifkin, M. D., Zerhouni, E. A., Gatsonis, C. A., Quint, L. E., Paushter, D. M., Epstein, J. I., Walsh, P. C. and McNeil, B. J. (1990). Comparison of magnetic resonance imaging and ultrasonography in staging early prostate cancer. New England Journal of Medicine 323: 621–626. 2. Begg, C. B. and Greenes, R. A. (1983). Assessment of diagnostic tests when disease is subject to selection bias. Biometrics 39: 207–216. 3. Ransohoff, D. F. and Feinstein, A. R. (1978). Problems of spectrum and bias in evaluating the efficacy of diagnostic Tests. New England Journal of Medicine 299: 926–930. 4. Levine, J. J., Seidman, E. and Walker W. A. (1987). Screening tests for enteropathy in children. American Journal of Diseases of Children 141: 435–438. 5. Tavel, M. E., Enas, N. H. and Woods, J. R. (1987). Screening tests for enteropathy in children. American Journal of Cardiology 60: 1167–1169.

42

X.-H. Zhou

6. Drum, D. E. and Christacopoulos, J. S. (1972). Hepatic scintigraphy in clinical decision making. Journal of Nuclear Medicine 13: 908–915. 7. Zhou, X. H. (1993). Maximum likelihood estimators of sensitivity and specificity corrected for verification bias. Communication in Statistics — Theory and Methods 22: 3177–3198. 8. Marshall, V., Williams, D. C. and Smith, K. D. (1984). Diaphanography as a means of detecting breast cancer. Radiology 150: 339–343. 9. Greenes, R. A. and Begg, C. B. (1985). Assessment of diagnostic technologies: Methodology for unbiased estimation from samples of selective verified patients. Investigative Radiology 20: 751–756. 10. Bates, A. S., Margolis, P. A. and Evans, A. T. (1993). Verification bias in pediatric studies evaluating diagnostic tests Journal of Pediatrics 122: 585–590. 11. Philbrick, J. T., Horwitz, R. I. and Feinstein, A. R. (1980). Methodologic problems of excerise testing for coronary artery disease: Groups, analysis and bias. American Journal of Cardiology 46: 807–812. 12. Feinstein, A. R. (1975). On the sensitivity, specificity and discrimination of diagnostic tests. Clinical Pharmacology and Therapeutics 17: 104–116. 13. Zhou, X. H. (1994). Effect of verification bias on positive and negative predictive values. Statistics in Medicine 13: 1737–1745. 14. Schatzkin, A., Connor, R. J., Taylor, P. R. and Bunnag, B. (1987). Comparing new and old screening tests when a reference procedure cannot be performed on all screenees. American Journal of Epidemiology 125: 672–678. 15. Baker, S. G. (1995). Evaluating multiple diagnostic tests with partial verification. Biometrics 51: 330–337. 16. Zhou, X. H. (1998). Comparing accuracies of two screening tests in a twophase study for dementia. Journal of Royal Statistical Society, Series C, Applied Statistics 47: 135–147. 17. SAS Institute Inc. (1990). SAS/STAT User’s Guide, Version 6, SAS Institute, Inc., Cary, NC. 18. Zhou, X. H. (1996). Nonparametric ML estimate of an ROC area corrected for verification bias. Biometrics 52: 310–316. 19. Agresti, A. (1990). Categorical Data Analysis. Wiley and Sons, New York, USA. 20. Metz, C. E. (1986). ROC Methodology in radiologic imaging. Investigative Radiology 21: 720–733. 21. Swets, J. A. (1979). ROC analysis applied to the evaluation of medical imaging techniques. Investigative Radiology 14: 109–121. 22. Hanley, J. A. (1989). Receiver operating characteristic (ROC) curve methodology. Critical Reviews in Diagnostic Imaging 29: 307–335. 23. Gray, R., Begg, C. B. and Greenes, R. A. (1984). Construction of receiver operating characteristic curves when disease verification is subject to selection bias. Medical Decision Making 4: 151–164. 24. Hunink, M. G. M., Richardson, D. K., Doubilet, P. M. and Begg, C. B. (1990). Testing for fetal pulmonary maturity: ROC analysis involving covariates, verification bias, and combination testing. Medical Decision Making 10: 201–211.

Evaluation of Diagnostic Test’s Accuracy in the Presence of Verification Bias

43

25. Rodenberg, C. A. and Zhou, X. H. (2000). ROC curve estimation when covariates affect the verification. Biometrics. 26. Rodenberg, C. A. (1996). Correcting for verification bias in ROC estimation with covariates. PhD thesis, Department of Statistics, Purdue University, West Layafette, Indiana. 27. Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data. John Wiley and Sons, New York, NY. 28. Lehmann, E. L. (1983). Theory of Point Estimation. John Wiley and Sons, New York, NY. 29. Efron, B. and Tibshirani, R. J. (1993). An introduction to the bootstrap. Chapman and Hall, New York, NY. 30. Dorfman, D. D. and Alf, E. (1969). Maximum likelihood estimation of parameters of signal detection theory and determination of confidence intervals: Rating data. Journal of Mathematical Psychology 6: 487–496. 31. England, W. L. (1988). An exponential model used for optimal threshold selection on ROC curves. Medical Decision Making 8: 120–131. 32. Grey, D. R. and Morgan, B. J. T. (1972). Some aspects of ROC curve-fitting: Normal and logistic models. Journal of Mathematical Psychology 9: 128–129. 33. Ogilvie, J. C. and Creelman, C. D. (1968). Maximum-likelihood estimation of receiver operating characteristic curve parameters. Journal of Mathematical Psychology 5: 377–391. 34. McCullagh, P. (1980). Regression models for ordinal data. Journal of Royal Statistical Society, Series B 42: 109–142. 35. Tosteson, A. A. N. and Begg, C. B. (1985). A general regression methodology for ROC curve estimation. Medical Decision Making 8: 204–215. 36. Toledano, A. Y. (1993). Generalized estimating equations for repeated ordinal categorical data, with applications to diagnostic medicine. PhD thesis, Department of Biostatistics, Harvard School of Public Health, Boston, MA. 37. Robbins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89: 846–866. 38. Paik, M. C. (1997). The generalized estimating equation approach when data are not missing completely at random. Journal of the American Statistical Association 92: 1320–1329. 39. Zhou, X. H. (1998). Comparing the correlated areas under the ROC curves of two diagnostic tests in the presence of verification bias. Biometrics 54: 349–366. 40. Liang, K. Y. and Zeger, S. L. (1988). Longitudinal data analysis using generalized linear models. Biometrika 73: 13–22. 41. Reid, M. C., Lachs, M. S. and Feinstein, A. R. (1995). Use of methodologic standards in diagnostic test research. Getting better but still not good. Journal of American Medical Association 274: 645–651. 42. Toledano, A. Y. and Gatsonis, C. A. (1996). Ordinal regression methodology for ROC curves derived from correlated data. Statistics in Medicine 15: 1807–1826. 43. Zhou, X. H. and Higgs, R. (1998). COMPROC and CHECKNORM: Computer programs for comparing accuracies of diagnostic tests using ROC curves

44

X.-H. Zhou

in the presence of verification bias. Computer Methods and Programs in Biomedicine 57: 179–186. 44. Zhou, X. H., Obuchowski, N. A. and McClish, D. K. (2002). Statistical Methods in Diagnostic Medicine. John Wiley and Sons, New York, NY. 45. Pepe, M. S. (2000). Receiver Operating Characteristic Methodology. Journal of American Statistical Association 95: 308–311.

About the Author Xiao-Hua Zhou is Director of Biostatistics Division in Puget Sound Health Care System and with a faculty appointment in the Department of Biostatistics at University of Washington. He received his BSc in Mathematics (1984) from Sichuan University, MSc in Statistics (1987) from University of Calgary, and PhD in Biostatistics (1991) from Ohio State University. He was a post-doctoral fellow in Biostatistics at Harvard University from 1991 to 1993. He joined Indiana University as Assistant Professor in 1993 and was promoted to Associate Professor in 1997. He was a visiting associate professor at Harvard University in 2000. He moved to Seattle in 2002. He was elected to the International Statistical Institute in 1998. He is the Program Chair of Section of Statistics in Epidemiology of the American Statistical Association (ASA) in 2001 and Chair-elect in 2002. In 2001, he shared Mitchell Prize from the International Society for Bayesian Analysis and Section on Bayesian Statistical Sciences of ASA with Prof. Hirano, Imbens, and Rubin. He is also an Associate Editor for Biometrics and Editorial Board Member for Statistics in Medicine. His research interests include statistical methods in diagnostic medicine, categorical data analysis, health services research, analysis of skewed data, causal inferences, analysis of observational studies, and analysis of missing data. He has published over 79 referred papers in both statistical and medical journals.

CHAPTER 3

STATISTICAL METHODS FOR DEPENDENT DATA FENG CHEN Department of Medical Statistics, Nantong Medical College, Nantong, Jiangsu 226001, PR China Tel: 0086-513-5517191-2012; [email protected]

1. Introduction Most classical statistical methods require independent observations. The issue here is not independence of multiple variables, rather of the samples. There are many cases of dependent in medical research when requirement of independence cannot be hold, i.e. observations are correlated. The existence of such correlation is not a coincident, but due to the design of the experiments. In some cases, this type of correlation can be eliminated by suitable procedure without losing any information. The simplest case is paired design where the observations within the same paired is correlated. For example, to investigate a new drug’s effects on hypertension, a 2-by-2 crossover design can be used to measure the diastolic pressure before and after treatment for each subject. Although the pressures across subjects are independent, the observations of the same subject are correlated. Unfortunately, we could not eliminate intra-unit correlations in most cases by traditional statistical methods. For example, in a toxicological study, 32 pregnant rats were randomly allocated into test and control groups. Rats in control group were fed with regular food, while rats in test group were fed with combinations of regular food and suspected teratogen. The proportion of malformation of pups of two groups was compared after rat delivery. In this study, the pregnant rats are independent with each other, but genetic factors, antepartum internal womb environments and

45

46

F. Chen

metabolism conditions of teratogen have effects on the rat pups. Thus, the rat pups cannot be treated as independent observations because siblings are more likely to encounter the similar proportion of malformation than pups from different litters. The litter effect must be taken into account. Special procedure must be used to deal with this type of data. The intra-unit correlation or intra-class correlation is a measure of similarity (or non-independence) among individuals that share some characteristics. The intra-unit correlation means that observations in the same unit are not dependent. There are overlaps between the information they present. It is inappropriate to ignore the intra-unit correlation. For example, in a clinical trail, many variables, i.e. vital signs, physiological index, effects and side effects, should be observed successively in different time for each subject during the trial period to show the efficacy and safety of the tested drug. Each subject should be observed several times. We refer to this type of study as repeated measurement study. There are two classical ways to deal with this sort of data. One is, to test the significance of the difference between the test group and the control group on each occasion respectively including test for the homogeneity of two groups before treatment and to compare the difference of changes (such as absolutely increase or decrease, relatively increase or decrease, etc.) between the two groups at each time. The alternative is to take k observations of each one of n subjects as one response variable, (the sample size will be nk) to fit a model (or generalized linear model) in which time is an explanatory variable. The former one will have low statistical power because it treats the observations of each occasion independently. The latter considers the correlations between the treatment effect and time. However, it ignores the intra-subject correlation of the observations and takes the data as independent data. Thus, it will increase the type I error which may result in the approval of the inefficiency drug to the market. The set of observations taken from the same subject tend to be correlated. They provide rather less information than the same number independent observations taken from different subjects. The larger the intra-unit correlation is, the less information will be provided. Therefore, it will increase the type I error if we use nk observations to fit general linear model. The statistical methods for dependent data are described and illustrated in this chapter. The methods cover estimation of intra-correlation coefficient, hypotheses test, estimation of sample size, etc.

Statistical Methods for Dependent Data

Table 1.

47

Clotting time (min) of serum from 8 volunteers, treated by 4 methods. Treatment

subject 1 2 3 4 5 6 7 8

A

B

C

D

8.4 12.8 9.6 9.8 8.4 8.6 8.9 7.9

9.4 15.2 9.1 8.8 8.2 9.9 9.0 8.1

9.8 12.9 11.2 9.9 8.5 9.8 9.2 8.2

12.2 14.4 9.8 12.0 8.5 10.9 10.4 10.0

2. Examples of Dependent Data Dependent data is omnipresent in medical researches, such as, repeated measurement data, longitudinal data, data of cross-over design, data of multicenter clinical trial, cluster sampling survey data, and infective disease, inherited disease, etc. They share the same property, which is the dependence or intra-unit correlation of observations. We refer to this type of data as dependent data. In this section, we will illustrate some types of examples for dependent data, and discuss their common and distinguishing features. 2.1. Example 1. Randomized block design To compare the effects on the clotting time of serum of four treatments, 8 volunteers were recruited. Four samples of serum from each subject were assigned to the four treatments in a random order. The results of the experiment were presented in Table 1. The property of the data shown in Table 1 is that the observations of the same block are correlated, while the observations from different blocks are independent. Thus, the effects of 4 treatments of 4 serums from one person are correlated. That is to say the data from block design are dependent. Observations in the same block in split-plot design and in split-split-plot design have the same property. 2.2. Example 2. Cluster sampling1 A simple random sample of 30 households was drawn from a census taken in 1947. The question here is whether they had consulted a doctor in the last 12 months. Data are shown below. The denominator is the number of

48

F. Chen

persons in a household, and the nominator is the number of persons who saw a doctor. 5/5, 0/5, 2/3, 3/3, 0/2, 0/3, 0/3, 0/3, 0/4, 0/4, 0/3, 0/2, 0/7, 4/4, 1/3 , 2/5, 0/4, 0/4, 1/3, 3/3, 2/4, 0/3, 0/3, 0/1, 2/2, 2/4, 0/3, 2/4, 0/2, 1/4 The property of this data is that the members of the same family tend to be similar, while persons from different families are assumed to be independent. Our purpose is to estimate the proportion of people who consulted a doctor, and to measure the similarity of the members in the same family. Similar results would be obtained for any characteristic in which the members of the same family trend to act in the same way. 2.3. Example 3. Toxicological study2 In a toxicological study, 32 pregnant rats were randomly allocated into 2 groups: test group and control group. Rats in control group were fed with regular food, while rats in test group were fed with combination of regular food with suspected teratogen. The proportions of malformation of pups of two groups were compared after delivery. The results are shown as follows: Control group

13/13 11/13

12/12 4/5

9/9 5/7

8/8 7/10

8/8 7/10

12/13 9/9

11/12

9/10

9/10

8/9

Test group

12/12 7/9

11/11 4/7

10/10 5/10

9/9 3/6

10/11 3/10

9/10 0/7

9/10

8/9

8/9

4/5

The denominator is the number of offspring in a litter, and the nominator is the number of offsprings that are malformation in the litter. In this study, the pregnant rats are independent to each other, but genetic factors, antepartum internal womb environments and metabolism conditions of teratogen have effects on the rat pups. The rat pups cannot be treated as independent observations because siblings are more alike than pups from different litters. Data in Example 2 have similar property. Similar property would be obtained from genetics studies in which the members of the same family tend to be similar. 2.4. Example 4. Crossover design For studying the bioequivalence of domestic and imported rosiglitazone maleate tablets (RMT), 24 volunteers were recruited in a 4 × 4 crossover study. Four sequence groups are formed by the randomized

Table 2. Id

Stage 1

sequence DCAB CBDA ADBC BACD CBDA ADBC DCAB BACD DCAB BACD ADBC BACD ADBC DCAB CBDA CBDA DCAB BACD CBDA ADBC ADBC CBDA BACD DCAB

Stage 2

Stage 3

Stage 4

AU C

Cmax

T50

AU C

Cmax

T50

AU C

Cmax

T50

AU C

Cmax

T50

884.27 919.50 1738.12 2000.29 823.39 2102.11 907.86 2139.72 787.80 1785.35 2031.55 1524.61 2013.54 990.04 839.94 1159.85 1032.22 1782.62 852.84 2178.77 2529.23 989.89 1579.55 889.20

204.63 178.27 326.72 382.25 158.72 360.38 170.88 366.84 163.07 347.46 320.43 381.50 314.76 163.73 136.99 167.43 182.99 376.58 150.20 436.64 449.49 167.33 328.00 186.69

3.47 3.92 3.95 3.51 3.86 4.47 3.95 3.80 2.73 3.93 3.39 3.12 4.99 4.63 4.02 4.57 4.01 3.64 3.87 4.09 4.58 3.85 3.72 3.95

905.09 2201.98 901.70 2350.58 1864.97 946.33 991.65 2012.09 905.52 1934.66 975.70 2525.23 1005.49 1118.63 1956.45 2760.90 1039.21 1917.01 2256.02 1273.04 1365.57 1936.16 1756.96 757.79

222.94 346.89 130.61 479.88 329.66 155.29 197.54 411.87 172.22 373.82 165.38 439.42 168.16 177.61 374.10 349.34 173.00 426.42 284.50 186.33 190.35 334.66 284.65 196.20

3.86 4.58 4.56 3.83 3.65 4.68 4.06 3.99 3.23 4.38 3.56 4.35 4.39 4.82 3.97 5.59 3.96 3.44 4.04 4.58 5.21 4.03 3.69 3.10

2330.77 855.89 1889.72 952.86 710.06 2005.84 2139.65 1134.23 1966.42 892.89 1893.99 952.05 2322.68 2300.18 611.43 1007.45 2440.50 1048.27 982.67 2074.44 1868.99 904.53 949.38 1813.93

455.14 205.31 375.37 187.72 133.87 339.67 369.21 200.43 362.11 163.78 313.81 177.75 406.54 334.58 132.20 178.00 380.79 179.42 157.35 296.02 412.40 175.76 188.22 441.25

5.50 3.79 4.08 3.68 3.22 3.53 3.98 3.94 3.77 3.78 3.48 4.07 4.98 4.51 2.63 4.59 4.53 4.04 4.19 4.08 4.15 4.10 4.11 3.57

1936.98 1939.36 870.93 955.46 1372.77 934.45 2408.84 924.12 1640.15 826.27 788.06 940.57 946.92 2197.79 1707.48 2477.37 1860.15 882.46 1924.09 1009.09 1064.39 2029.20 951.75 1523.54

395.55 327.12 158.99 202.02 309.07 176.23 368.93 223.98 331.95 151.87 128.56 187.22 152.94 293.69 273.61 327.36 353.41 149.33 360.50 190.86 208.41 420.12 201.15 327.47

4.35 4.01 4.09 3.69 2.99 3.72 4.62 3.70 3.15 3.78 3.23 3.58 4.73 4.72 4.03 5.88 3.77 3.23 3.96 4.57 4.95 4.24 4.12 3.33

Statistical Methods for Dependent Data

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Results of 4 × 4 cross-over trial for testing bioequivalence of domestic and imported rosiglitazone maleate tablets.

49

50

F. Chen

Latin square below ADBC BACD CBDA DCAB Where A, B, C, D are imported RMT 2 mg, domestic RMT 2 mg, imported RMT 4 mg and domestic 4 mg, respectively. Twenty-four volunteers were randomly allocated in 4 treatment groups with 6 in each sequence. Each subject received different treatment on different cycles. To minimize carryover effects, a 7-day wash-out period between the two treatment occasions was made. Plasma concentration of rosiglitazone maleate was detected within 24 hours after orally taking RMT. Data in Table 2 is the area under curve (AU C), maximum concentration (Cmax ) and time to half maximum concentration (T50 ). The aim is to test whether there is difference between domestic and imported RMT. This is a four-by-four crossover design with 3 variables. In this data set, the observations in 4 periods and the variables (AU C, Cmax and T50 ) are correlated. 2.5. Example 5. Repeated measurement, linear regression In a multicenter, randomized, double-blind, three doses (high, middle, and low = placebo) controlled clinical study, the researchers evaluated the efficacy and safety of urokinase (UK) in the treatment of acute cerebral infarctions within 6 hours from the onset of stroke. One interesting variable is the European stroke scale (ESS). Data are shown in Table 3. Repeated measurement design, also known as within-subject design, is a quite common design in medical researches. The feature of this type of data set is that individuals are measured repeatedly through time. We are interested in both treatment and temporal effects. Figure 1 displays the data graphically. Each line connect the repeated observations at different times of a subject. This simple graph reveals apparent and important patterns. First, all of 30 subjects are getting better within 8 weeks as ESS is becoming larger. Second, patients with larger ESS at the beginning of the period tend to remain larger throughout. This phenomenon is called “tracking.” There are two ways to deal with this type of data by classical methods. First, we estimate the average of ESS for each week and fit a regression model of the means of ESS over time. In fact, we aggregate the data.

51

Statistical Methods for Dependent Data Table 3. Id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

treat 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

ESS of 30 acute cerebral infarctions. weeks

Age 27 21 21 36 17 22 29 15 21 27 34 37 31 28 32 18 15 31 39 34 36 45 40 44 22 25 32 38 22 19

0

1

2

3

4

5

6

7

8

107 107 100 107 110 105 102 97 108 108 98 100 104 108 106 103 101 91 94 104 107 109 103 110 95 92 98 106 102 109

106 106 100 106 111 108 101 97 108 108 98 98 123 120 108 102 103 90 94 104 111 114 103 114 103 102 106 121 112 109

106 106 100 106 112 108 104 97 108 108 102 114 127 115 108 102 104 92 96 105 112 120 108 120 115 110 112 127 110 124

108 106 106 107 112 106 100 99 110 114 121 118 129 119 108 104 108 93 99 105 127 130 112 124 113 108 112 126 119 127

108 106 109 106 113 108 94 99 116 116 120 126 130 134 112 114 113 89 116 122 127 131 116 133 119 116 120 128 119 128

112 112 108 111 113 108 106 99 116 118 124 126 130 126 112 114 113 95 124 128 128 132 118 135 122 116 124 130 123 132

112 112 114 112 113 108 106 101 120 118 124 134 136 126 112 116 118 102 135 131 138 139 123 142 126 116 126 132 125 133

112 114 116 117 116 109 105 101 128 124 132 138 140 127 114 128 122 105 138 129 141 142 125 144 134 122 136 138 133 144

112 116 116 109 116 110 106 103 120 128 140 138 140 140 116 143 126 108 145 138 141 143 135 144 136 127 141 140 142 147

As a result, it increases the correlation and causes a spurious association between ESS and time. Second, we fit a regression model for all the data on time. These two models give the same regression coefficients but different standard errors. Both of them ignore the intra-subject correlation. 2.6. Example 6. Pharmacokinetics study, repeated measurements, nonlinear regression A single oral dose Ciclosporin A Capsule was given to 10 healthy volunteers. Plasma concentration (ng/ml) was detected after medication. The results are shown in Table 4.

52

F. Chen

150 140 130 120 110 100 90 80 0

1

2

3

4 5 Time(Weeks)

6

7

8

0

1

2

3

4 5 Time(Weeks)

6

7

8

0

1

2

3

4 5 Time(Weeks)

6

7

8

150 140 130 120 110 100 90 80

150 140 130 120 110 100 90 80

Fig. 1.

ESS over time of 30 acute cerebral infarctions for three groups.

53

Statistical Methods for Dependent Data Table 4. Palsma concentrations of Ciclosporin A Capsule after medication of 10 volunteers. Time (hour )

Subject 1 2 3 4 5 6 7 8 9 10

0.5

1

2

343.3 86.6 256.1 300.2 344.6 230.0 116.5 66.7 67.7 216.2

783.6 501.1 534.8 849.7 826.4 780.7 943.4 239.2 789.1 599.9

443.1 817.9 486.8 846.0 631.0 912.3 848.2 814.6 551.6 1099.5

3

4

5

426.8 267.0 155.5 542.7 273.9 226.4 420.1 370.6 316.7 521.1 373.2 269.4 485.0 389.7 257.7 551.2 299.8 219.3 747.3 410.4 345.5 526.9 426.6 213.5 520.2 463.0 295.7 562.9 413.9 297.5

6

8

125.0 98.3 195.7 114.0 250.6 192.6 258.1 182.7 204.7 172.4 148.7 75.1 171.4 129.5 152.5 118.5 191.8 154.4 233.2 146.6

12

16

75.2 23.8 79.9 26.6 124.5 75.9 93.0 68.0 124.5 44.2 55.9 27.6 63.0 17.5 73.1 38.1 108.4 32.5 94.8 38.7

1200 1000 800 600 400 200 0 .51

2

3

4

5

6

8 time

12

16

Fig. 2. Plasma concentration-time curve of Ciclosporin A capsule after a single oral dose in 10 volunteers.

This is an example of repeated measurement with nonlinear trend, which are distinct from Example 5. In experimental or pharmacokinetical study, the sample size is relatively small and the period is usually short. Dropout seldom occurs. Furthermore both times of repeated measure and time intervals are similar to each other. However, it is not the case in clinical trial. The observed period is usually long. Compliance varies among patients and dropouts are routine. Last, but not least, the times and time intervals are different among patients.

54

F. Chen

2.7. Example 7. Mmulti-center clinical study, ranked data To investigate the effect of nerve growth factor (NGF) for subjects of extraneuritis caused by chemical products, an random, double blind, placebo controlled clinical trial was developed. One hundred two subjects were random allocated into treatment and placebo groups. The effectiveness was observed in 8 consecutive weeks for 102 subjects. The results are shown in Table 5, in which Id represents Table 5.

Effects of 102 subjects of extra-neuritis caused by chemical products.

Id

cnt

Trt

com

sex

age

base

x1

x2

x3

x4

x5

x6

x7

x8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 4 5 5 6 6 6 6 6 7 7

T P T P T T T P T T T P P T T T P T P T T T T T T P P T T T T P P T P

A A A B B B B A A B A B A A B A B B A B A A A A B A B B A A B B A B B

0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0

21 27 27 21 34 45 37 21 31 23 22 28 24 28 29 21 31 25 29 20 32 18 31 39 34 15 36 37 15 16 15 16 17 29 19

13 12 13 13 7 13 13 13 14 14 13 14 15 13 14 10 15 10 8 10 13 14 6 10 9 12 7 5 1 1 11 8 13 13 15

1 1 1 1 0 1 1 1 1 2 1 2 2 2 2 1 2 1 0 1 1 2 0 1 1 1 1 1 0 0 1 1 1 1 2

2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2 0 1 1 1 1 1 0 0 2 1 2 2 2

2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 0 1 1 1 1 1 0 0 2 1 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 2 1 1 1 1 0 0 2 1 2 2 2

2 2 2 2 2 2 2 2 3 3 2 2 3 3 3 3 m 2 2 2 2 3 1 2 1 2 1 1 0 1 2 1 2 2 3

2 2 2 2 2 2 2 2 3 3 2 2 3 3 3 3 m 2 2 3 2 3 1 2 2 2 1 1 0 1 2 1 2 2 3

2 2 2 2 2 2 2 2 3 3 2 2 m 3 m 3 m 3 2 3 2 3 1 2 2 2 1 1 1 1 2 1 2 2 3

3 2 2 2 2 3 2 3 3 2 2 m m m 3 3 3 1 3 2 3 1 3 2 2 1 2 1 1 3 1 2 3 3

55

Statistical Methods for Dependent Data Table 5.

Continued.

Id

cnt

Trt

com

sex

age

base

x1

x2

x3

x4

x5

x6

x7

x8

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81

7 7 8 8 8 8 0 9 9 9 9 9 9 1 1 10 10 10 10 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11

T P T P T T P T P T T P T P T P T T T T P T P T T P T T T P T T T T P T P T T T P P T P T P

B A B A B A B A A A B B A A A B B A B A B B A A B B B A A A B A A B B B A A A B B A B B A A

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0

29 29 17 18 17 18 45 30 43 36 45 32 40 36 44 35 38 33 41 22 22 28 31 37 21 24 25 32 21 20 30 38 19 21 25 22 23 21 26 20 18 23 24 26 20 18

13 13 13 13 12 11 7 9 12 14 14 14 11 13 13 13 13 13 13 13 14 8 14 15 14 12 7 12 15 13 14 13 16 14 11 15 10 14 14 15 13 8 15 10 15 13

1 2 1 1 2 1 0 0 1 2 2 2 1 1 1 1 1 1 1 2 1 1 2 2 2 1 1 1 2 2 2 1 2 2 1 2 1 2 2 2 2 1 m 1 2 2

2 2 2 2 2 2 1 0 2 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 1 1 2 2 2 2 2 2 m 2 1 2 m 2 2 1 2 1 2 2

2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 m 1 2 m

2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 m 2 m 1 2 1 2 m

2 m 2 2 2 2 2 1 2 3 3 3 2 2 2 2 2 2 2 2 2 1 3 3 2 2 2 2 3 3 3 2 3 2 m m 1 3 3 3 m 1 m 1 3 m

2 m 1 2 2 2 2 1 2 2 3 3 2 2 3 2 2 3 2 3 2 2 3 3 3 2 2 2 3 3 m 3 3 3 m 3 1 3 m 3 m 1 3 2 3 m

2 m 2 2 3 2 2 2 2 2 3 3 2 2 3 2 2 3 2 m 2 2 3 3 3 2 2 2 3 3 m 3 3 3 m m m 3 3 3 m 1 m 2 3 m

2 m 1 2 3 2 2 2 2 2 3 3 2 2 3 2 2 3 3 m 2 2 3 3 3 2 2 2 3 3 m 3 3 3 m 3 m 3 m 3 m 2 3 m 3 m

56

F. Chen Table 5.

Continued.

Id

cnt

Trt

com

sex

age

base

x1

x2

x3

x4

x5

x6

x7

x8

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102

11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 12 12 12

T T T T T P T P T T T T P T P P T T P T T

B B A B A A A B B A B A A B B B B B A B A

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

22 25 19 20 29 25 24 21 19 21 22 20 29 19 28 21 30 27 25 35 33

14 15 15 15 15 8 15 11 15 15 14 15 4 14 12 8 15 15 7 7 8

2 2 2 2 2 m 2 m 2 2 2 2 0 2 1 1 2 2 0 1 1

2 2 2 2 2 m 2 1 2 2 2 2 0 2 2 1 2 2 1 1 1

2 2 2 2 2 1 2 1 2 2 2 2 0 2 m 1 2 2 1 1 1

2 2 2 2 2 m m m 2 2 2 2 0 2 2 1 2 2 1 1 1

m 3 3 3 3 m m m 3 3 3 3 1 3 3 1 3 3 1 1 2

m 3 3 3 3 1 3 m 3 3 3 3 1 3 2 1 3 3 2 2 2

3 3 3 m m m m m 3 3 3 3 1 3 2 m 3 3 1 2 2

m 3 3 m m m m m m m m 3 1 3 m m 3 3 2 2 2

identification of subjects; Cnt represents the center; Trt represents groups (Trt = P for placebo, Trt = T for NGF); Com represents companies; Base represents the MDNS base level before treatment; and x1–x8 are effects at time from 1 to 8 weeks after treatment. Here, x = 0 stands for invalid or worse effect of treatment on the subject, 1 for improved, 2 for notable improved, 4 for recovery, and m for missing. Except for the intra-subject correlation, the intra-center correlation of the subjects in the same hospital should be considered for this type of data. 2.8. Example 8. Repeated measurement, count data3 In order to understand whether the progabide reduces the rate of epileptic seizures, 59 patients of epileptics were recruited in a clinical trial. For each patient, the number of epileptic seizures was recorded during a baseline period of 8 weeks. Patients were then randomized to treatment with the anti-epileptic drug progabide, or placebo. In addition, all of the patients were treated with standard chemotherapy. The number of seizures was then recoded in 4 consecutive two-weeks for each epileptic. Where, treatment variable is group (0 = placebo, 1 = progabide). What is different from Examples 5 and 7 is that the response variable is the seizure

Statistical Methods for Dependent Data

57

counts in unit time (two-week). Poisson regression would be used here for count data. In this study, only recurrence episodes were included, not first episode. The reason of not including the first episode was that the factors associated with recurrence of a disease are usually different from those with that disease. For example, in a model of development of breast cancer, we should not include women who already had breast cancer, because family history and late childbearing have the strongest association with development of breast cancer, whereas stage of disease, hormone receptors, and histological grade are the strongest risk factors for recurrence of breast cancer. For those diseases for which it is sensible to speak of a second distinct episode, the risk factors for a second episode may be similar to the risk factors for a first episode. Hooton and colleagues were interested in studying urinary track infections in young women.4 With urinary track infections, patients can have a second (or third, etc.) episode after a “crude” first episode. Repeated episodes in the same person are not independent observations because the causes of urinary track infections are likely to be more similar in repeated episodes in the same person than in separate episodes in different people. Therefore, Hooton and colleagues included repeat episodes in their analysis, which increased the power of their study. 2.9. Other examples Clinical researchers in the fields of ophthalmology, orthopedics, and dentistry have a distinct advantage over cardiologists, neurologists, and hepatologists. That is while humans have only one heart, one brain, and one liver, we have two eyes, thirty-two teeth or so, and most of our joints in duplicates. In those fields with duplicate organs, it is possible to follow (or assess) a single subject and have multiple observations. For the cases with outcomes that are observed more than once in a single subject, you must use special methods to deal with outcomes that can occur in more than one body part in the same person. In a study of complications after breast implantation most women had bilateral implants.5 Some had multiple implants in the same breast. The investigators therefore performed follow-up of each breast implant until a complication occurred, the implant was removed, or the end of follow-up occurred. The survival times of the implants for the same woman are dependent. In a study of the relationship of vitamin D to development of osteoarthritis of knees, the investigators used the fact that their participants had

58

F. Chen

Table 6.

Four successive two-week seizure counts for each 59 patients of epileptics.

ID y1

y2

y3 Y4 treat baseline Age ID

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

3 5 4 4 18 2 4 20 6 13 12 6 4 9 24 0 0 29 5 0 4 4 3 12 24 1 1 15 14 17

3 3 0 1 9 8 0 23 6 6 6 8 6 12 10 0 3 28 2 6 3 3 3 2 76 2 4 13 9 9

5 3 2 4 7 5 6 40 5 14 26 12 4 7 16 11 0 37 3 3 3 3 2 8 18 2 3 13 11 8

3 3 5 4 21 7 2 12 5 0 22 5 2 14 9 5 3 29 5 7 4 4 5 8 25 1 2 12 8 4

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

11 11 6 8 66 27 12 52 23 10 52 33 18 42 87 50 18 111 18 20 12 9 17 28 55 9 10 47 76 38

31 30 25 36 22 29 31 42 37 28 36 24 23 36 26 26 28 31 32 21 29 21 32 25 30 40 19 22 18 32

y1

y2

y3

y4 treat baseline Age

31 0 32 3 33 2 34 4 35 22 36 5 37 2 38 3 39 4 40 2 41 0 42 5 43 11 44 10 45 19 46 1 47 6 48 2 49 102 50 4 51 8 52 1 53 18 54 6 55 3 56 1 57 2 58 0 59 1

4 6 6 3 17 4 4 7 18 1 2 4 14 5 7 1 10 1 65 3 6 3 11 3 5 23 3 0 4

3 1 7 1 19 7 0 7 72 1 4 0 25 3 6 2 8 0 72 2 5 1 28 4 4 19 0 0 3

0 3 4 3 16 4 4 7 5 0 0 3 15 8 7 4 8 0 63 4 7 5 13 0 3 8 1 0 2

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

19 10 19 24 31 14 11 67 41 7 22 13 46 36 38 7 36 11 151 22 42 32 56 24 16 22 25 13 12

20 20 18 24 30 35 57 20 22 28 23 40 43 21 35 25 26 25 22 32 25 35 21 41 32 26 21 36 37

two knees to their advantage.6 Although the Framingham’s study consists of over 5000 subjects, only 556 participants had X-rays of their knees and assessments of their vitamin D intake and serum levels. Therefore, they did this by looking at both knees to maximize their statistical power. 3. Common Structures of Intra-unit Correlation for Dependent Data The feature of dependent data is that the variance-covariance matrix of response variable is not diagonal but block diagonal. Because the dependent data do not meet the independent requirement that is essential in classical statistical methods, special methods are needed

Statistical Methods for Dependent Data

59

to deal with it. For example, random effects models and/or mixed effects models are used for repeated measurement or longitudinal data and metaanalysis is used for multicenter clinical trial.3 Many systematic researches have been achieved in the field. In this section, we try to demonstrate the connotations of dependent data, how to judge the type of the data set, to construct reasonable covariance structure or intra-unit correlations structure, draw valid scientific inferences for the data set. In this section, we focus on the common structures of intra-unit correlation of dependent data.7 3.1. A simple case We first consider the simplest case of a paired design. In this paired design, the subjects are independent, while two observations on the same subjects are correlated. If we assume the correlations of two observations of subjects are equal, say ρ, then the correlation matrix of 2m observations from m subjects could be   R 0 ··· 0  0 R ··· 0     (1) RY =  .. .  .. .. ..  .   . . 0

where,

RY =

0

"

1 ρ

··· ρ 1

#

R

(2)

where 0 is 0 matrix with all elements being 0, RY is block diagonal matrix with R in diagonal. For random block trial, we have a treatments and b blocks. While individuals from different blocks are independent, those from the same block tend to be similar and correlated. Because the individuals in the same block are in the same status, so we can assume that there is a positive correlation, ρ, between any two individuals from the same block. The intra-block correlation matrix is defined as   1 ρ ··· ρ ρ 1 ··· ρ     .. . . ..  R2 =  .. (3) . . . .   .. ρ ρ . 1

60

F. Chen

a × b observations form a correlation matrix which has the same structure as RY in (1). Matrices in diagonal block of RY have the same structure as R2 . It is obviously that R1 is a special case of R2 . Now let’s consider some types of correlation structure of longitudinal studies. The defining characteristic of a longitudinal study is that individuals are measured repeatedly through time in a follow-up study. Correlation structures vary from data set. The commonly used correlation matrices are equal correlation, neighbor correlation, autocorrelation and unstructured correlation, etc. 3.1.1. Equal correlation It is similar to R2 , We also refer to equal correlation as exchangeable or compound symmetry. 3.1.2. Neighbor correlation Neighbor correlation is that only two closed observations are correlated, others are independent. For 5 times repeated measurement, the correlation matrix is given by   1 ρ1 0 0 0   1 ρ2 0 0  ρ1   R3 =  (4) ρ2 1 ρ3 0 0 .   0 ρ3 1 ρ4  0 0 0 0 ρ4 0

When the correlations of two closed observations are equal, the correlation is refered to as stationary 1-dependence), otherwise, nonstationary 1-dependence. Stationary 2-dependence has the structure as follows   1 ρ ρ 0 0 ρ 1 ρ ρ 0      R4 =  ρ ρ 1 ρ ρ  (5) .   0 ρ ρ 1 ρ 0

0

ρ

ρ

1

It is not difficult to extend to stationary k-dependence. Obviously, stationary correlation is a special case of nonstationary, and exchangeable correlation is a special case of stationary.

61

Statistical Methods for Dependent Data

3.1.3. Autocorrelation Autocorrelation means that correlation depends on the spacing of two measurements. The correlation between a pair of measurements on the same subject decays towards 0 as the time separation between the measurements increases. If the correlation of two observations next to each other is ρ, the correlation of two separated observations is ρ’s power of number of the observations separated. For 5 times repeated measurements, the correlation matrix is given by   1 ρ ρ 2 ρ3 ρ4   1 ρ ρ 2 ρ3  ρ  2  R5 =  (6) ρ 1 ρ ρ2  ρ .  3  2 ρ ρ 1 ρ ρ 4 3 2 ρ ρ ρ ρ 1 We refer to (6) as the first order autocorrelation or the first order autoregressive process. A natural extension of (6) is given by (7), R 6 , the correlation is inversed to the time interval or spacing of two measurements. 

1

 ρt2 −t1   t −t 3 1 R6 =  ρ  t4 −t1 ρ ρ

t5 −t1



ρt2 −t1

ρt3 −t1

ρt4 −t1

ρt5 −t1

1

ρt3 −t2

ρt4 −t2

ρt3 −t2

1

ρt4 −t3

ρt5 −t2    ρt5 −t3  . t5 −t4   ρ

ρ

t4 −t2

ρ

t5 −t2

ρ

t4 −t3

1

ρ

t5 −t3

t5 −t4

ρ

1

3.1.4. Unstructured or general structure In this case elements on nondiagonal of block matrix R are unequal. 3.1.5. Independent, zero correlation Elements on nondiagonal of block matrix R are 0. The relationships of the matrices mentioned above are as follow independent ⊂ exchangeable ⊂ autocorrelation ⊂ stationary ⊂ nonstationary ⊂ unstructured where A ⊂ B means A is a special case of B.

(7)

62

F. Chen

3.2. Complicated cases In random cluster sample study, individuals in the same cluster (household, class in school, group in enterprise, etc.) tend to act in a similar way on healthy attitude, eating habit, and so on, and share the same environment, etc. If family is the unit in cluster sampling, genetic factor should be considered because the observations measured on the members from the same family are correlated. For example, in a cluster sampling, a simple random sample of 54 households was drawn.8 The blood pressure observations of 209 subjects were detected. Let Yij represent a response variable, systolic pressure, for member j (j = 1, 2, . . . , ni ) in household i (i = 1, 2, . . . , 54). Where j = 1 stands for father, 2 for mother, 3 and more for children. Generally speaking, if the interesting variable is affected by genetic factor or other family factors, the correlation between parents is lower than the correlations between father and children, mother and children, and children themselves. In this case, a special but common correlation structure could be defined as (for example, 4 persons in a family with parents and two children) Yi1 Yi2 Yi3 Yi4

Yi1  1 r  1   r2 r2

Yi2

Yi3

r1 1

r2 r3

r3 r3

1 r4

Yi4  r2 father r3   mother .  r4  child 1 child 2 1

(8)

In fact, the correlation structure matrix of 4 members (parents and two children) in one family in the example mentioned above is   1.0000 0.2056 0.4212 0.4212  0.2056 1.0000 0.4292 0.4292    .   0.4212 0.4292 1.0000 0.5622  0.4212

0.4292

0.5622

1.0000

For stratified cluster sampling and other data with hierarchical structure, the same strategy could be used to construct the intra-cluster correlation matrices. In the crossover design, each subject is randomized to a sequence of two or more treatments and hence acts as his own control for treatment comparisons. In the simplest paired 2 × 2 crossover design, two subjects are paired, the first subject in the same paired receives either of two treatments in randomized order in two successive treatment periods which often

63

Statistical Methods for Dependent Data

separated by a washout period, while the other received two treatments in adverse order to the first one in two successive treatment periods. There are 3 possible correlations in this type of data: (1) correlation between two observations of the same subject in two periods; (2) correlation between two subjects in the same paired in the same period; and (3) correlation between two subjects in the same paired in different periods. The correlation structure, therefore, could be defined as: Subject 1

Subject 1

Subject 2

Subjects 2

Period 1

Period 2

Period 1

Period 2

Period 1

1

r1

r2

r3

Period 2

r1

1

r3

r2

Period 1

r2

r3

1

r1

Period 2

r3

r2

r1

1

.

In multicenter clinical trial, although the protocol and standard operating procedures are implemented similarly at all centers, the level and opinions of doctors and nurses, equipments, and medical conditions, etc., vary from the centers. This is so-called center-effects. Subjects in the same center are correlated. The repeated observations through time from the same subjects are also correlated. This is hierarchical structure data. If subjects from different centers are independent, the intra-center correlation structure could be defined as (3 visits for each subject): Subject 1

Subject 2

···

t2

t3

t1

t2

t3

···

t1

t2

··· ···

r2 r2 r2

r2 r2 r2

r2 r2 r2

r2 r2 r2

r2 r2 r2

r2 r2 r2

Subject 1

t1 t2 t3

1 r1 r1

r1 1 r1

r1 r1 1

r2 r2 r2

r2 r2 r2

r2 r2 r2

Subject 2

t1 t2 t3

r2 r2 r2

r1 r2 r2

r2 r2 r2

r2 1 r1

1 r1 r1

r1 r1 1

··· Subject n

t1 t2 t3

Subject n

t1

··· r2 r2 r2

r2 r2 r2

···

··· r2 r2 r2

r2 r2 r2

r2 r2 r2

t3

.

··· r2 r2 r2

···

1 r1 r1

r1 1 r1

r1 r1 1

Although, for a real data set, the correlation structure could be defined and selected by statistical methods, the author suggests that the biological

64

F. Chen

and medical backgrounds should be considered to get a reasonable and acceptable correlation matrix structure. 4. ANOVA Methods and Its Limitation 4.1. Parameter estimations for dependent data 4.1.1. Estimation of means If there are n observations of variable X, denoted by x1 , x2 , x3 , . . . , xn with ¯ and variance σ 2 . We assume the data are dependent. mean X (1) xi is correlated with xj with a correlate coefficient ρ If xi is correlated with xj with a correlation coefficient ρ (ρ is assumed to ¯ was be larger than 0 without losing general), thus the variance of X ¯ = var(X) =

1 cov(x1 + x2 + · · · + xn , x1 + x2 + · · · + xn ) n2 1 [nσ 2 + n(n − 1)ρσ 2 ] n2

σ2 [1 + (n − 1)ρ] . (9) n Formula (9) shows that standard error of mean is larger when the data are dependent than the case when the data are independent. Moreover, it is in proportion to correlation. In this case, the confidence interval of population mean is as follows p ¯ ± tn−1,ν √σ 1 + (n − 1)ρ . (10) X n =

It is wider than that when the data are independent. When intra-unit correlation is 0, the confidence interval given by (10) is similar to the confidence interval when data are independent. (2) xi is correlated with xj with autocorrelation If xi is correlated with xj with autocorrelation cov(xi , xj ) = σ 2 ρ|i−j| .

(11)

The variance of ¯ = var(X)

1 [n + 2(n − 1)ρ + 2(n − 2)ρ2 + · · · + 2ρn−1 ]σ 2 . n2

(12)

Statistical Methods for Dependent Data

65

(3) Correlation between xi and xj is unstructured If the correlation between xi and xj is unstructured cov(xi , xj ) = ⌊ρij σ 2 ⌋n×n .

(13)

¯ is The variance of X 

  X ¯ = 1 n + 2  var(X) ρij  σ 2 . n2

(14)

i6=j

Thus, the standard error of mean in this case when the data are dependent is larger than the one when the data are independent from each other. The standard error is in proportion to the correlation as well. The confidence interval is wider than that of independent data. 4.1.2. Estimation of rate The independent binary data should generally be handled by the methods based on binominal distribution. Let incidence rate be π and its variance p be π(1 − π), then the standard error is π(1 − π)/n. If the data are correlated with each other, the variance and the standard error of rate increase. For example, in Example 2, the total incident rate is π = 30/104 = 0.2885, the variance is 0.00197 and 95% CI is 0.2038–0.3855 if we apply the methods based on binominal distribution. And its 95% CI is 0.2014–0.3756 if we apply the methods based on normal approximation. However, the actual variance of the incident rate in each family is 0.00520, much larger than that given by the pure binominal distribution. This is because that the incidence, “visiting doctors in the last year”, has a family aggregation. As a result, we underestimated the variance of dependent incidences by applying methods based on binomial distribution. The classic way of handling dichotomous data is firstly coding the incidence that happens as 1, otherwise, as 0, and then applying Eq. (9) to the data. When it comes to a dichotomous data with equal correlation, the standard error of rate is r π(1 − π) [1 + (n − 1)ρ] . (15) σπ = n Others can be handled in similar ways.

66

F. Chen

4.2. ANOVA with random effect We begin with the repeated one-way testing designs, of which the block design is the simplest case. We may assume that there are a treatments and b blocks. The model of ANOVA can be represented as yij = µ + τj + eij .

(16)

In the equation below, µ is the population’s mean, τj is the effect of the jth treatment (j = 1, 2, . . . , a), eij is the total residual error of observations receiving the jth treatment in the ith block. When we apply a randomized block design, the units in the same block may have good homogeneity, while units in different blocks may have many differences. This is the characteristic of block design that makes the observations in every block to be homoplasy, which is called intra-block correlation. For this moment, the error term eij may be denoted as eij = νi + uij .

(17)

νi is the residual error of the ith block (i = 1, . . . , b), uij is the residual error of observations receiving the jth treatment in the ith block. Therefore, the ANOVA model of block design should be yij = µ + τj + νi + uij .

(18)

In most cases, the treatment factors of a block design are fixed effects, while blocks are random effects. Namely, τj is fixed effect, µj is the mean of observations in the jth level, and νi is random effects with τj = µj − µ ,

Στj = 0

νi = µi − µ ,

Σνi = 0 ,

var(νi ) = σ22 ,

(19)

and cov(νi , νi′ ) = 0, i 6= i′ ,

where µi is the mean of the ith block. uij is the random effect, and uij = yij − µ − τj − νi , var(uij ) = σ12 ,

Σuij = 0 ,

and cov(uij , ui′ j ′ ) = 0, j 6= j ′

cov(uij , νk ) = 0

(20)

for all i, j, k .

In this way, the variance of yij is var(yij ) = σ22 + σ12 ,

(21)

Statistical Methods for Dependent Data

67

the covariance is cov(yij , yij ′ ) = cov(νi + uij , νi + uij ′ ) = σ22 , j 6= j ′ , and others are 0. Expressed by matrix, the variance and covariance of yij is   R 0 ··· 0  0 R ··· 0     . cov(eij ) = σ 2  ..  ..  .. .. .  . .  . 0

2

Here, σ =

σ12

+

···

0

R

(22)

ab×ab

σ22 ,



1

ρ

ρ  R=  .. . ρ

1 .. . ρ

···

··· .. . ···

The intra-unit correlation coefficient is ρ=

ρ



ρ   ..  . 1

.

(23)

a×a

σ22 . σ12 + σ22

(24)

Based on the idea of ANOVA, it is obvious that E(M Streatment) = b

a X i=1

τ12 /(a − 1) + σ12 ,

E(M Sblock ) = bσ22 + σ12 , E(M Sresidual ) = σ12 .

(25)

And the variance component σ12 and σ22 are σ12 = E(M Sresidual ) , σ22 =

M Sblock − M Sresidual . b

(26)

If we substitute σ12 and σ22 in Eq. (24) by equations above, the intracorrelation coefficient is ρ=

M Sblock − M Sresidual . M Sblock + (b − 1)M Sresidual

(27)

68

F. Chen Table 7.

ANOVA of the serum coagulation time of four methods.

Source Total Between Groups In Groups Between Block Residual

SS

DF

MS

105.7787 13.0163 92.7624 78.9888 13.7738

31 3 28 7 21

3.4122 4.3388 3.3129 11.2841 0.6559

F

P

6.62

0.0025

17.20

0.0000

If every block has a different size (e.g. missing values), the intra-unit correlation of randomized block design data can be denoted as ρ= where

M Sblock − M Sresidual , M Sblock + (m0 − 1)M Sresidual m0 = m ¯ −

P

(mi − m) ¯ 2 P . (a − 1) mi

(28)

(29)

4.3. Example 9. Analysis of randomized block design data The analysis of Example 1. We begin with the ANOVA Table 7.9 The variance component σ12 and σ22 are σ02 = M Sresidual = 0.6559 , 11.2841 − 0.6559 M Sblock − M Sresidual = = 2.6571 . b 4 And the intra-correlation coefficient is 2.6571 σ2 r= 2 1 2 = = 0.8020 . σ0 + σ1 0.6599 + 2.6571 σ12 =

Though ANOVA of correlated data is similar to that of traditional randomized block design in process and result, ANOVA of correlated data not only answers the question, “whether there is a difference between treatment groups”, on which that of traditional randomized block design emphasizes, but also puts more emphasis on the further decomposition of variance and affords the intra-unit correlation. Thus, its model is more precise with richer information. 4.4. Example 10. 4 × 4 cross-over design The analysis of log AU C data in Example 4. For this moment, the fixed effects that we should take into consideration is 4 treatments, A, B, C, D,

69

Statistical Methods for Dependent Data Table 8. Source Total ID(sequence) Sequence Period Treat Residual

ANOVA of log AU C.

SS

DF

MS

F

P

15.67277346 0.96107428 0.06927165 0.13519601 13.83396718 0.67326434

95 20 3 3 3 66

0.04805371 0.02309055 0.04506534 4.61132239 0.01020097

4.71 2.26 4.42 452.05

< 0.0001 0.0892 0.0068 < 0.0001

4 different periods and 4 different sequences. The 4 observations of the same subject are correlated. And, σ02 = M SResidual = 0.01020097 , σ12 =

M SID(Sequence) − M SResidual 0.04805371 − 0.01020097 = b 4

= 0.009463185 . Accordingly, ρ= =

M SID(Sequence) − M SResidual M SID(Sequence) + (b − 1)M SResidual 0.04805371 − 0.01020097 = 0.4812 . 0.04805371 + (4 − 1) × 0.01020097

4.5. The condition of using ANOVA The ANOVA is limited to fairly balanced designs where there are tidy partitions of the total sum of squares. The model should be fairly simple so that a suitable covariance structure (symmetry) for the observations can be produced. For example, if t = 4 in repeated measurement data, the covariance matrix should be   σ11 σ12 σ13 σ14 σ   21 σ22 σ23 σ24   .  σ31 σ32 σ33 σ34  σ41

So the symmetry means (1) σii = σjj = σ 2 , (2) σij = ρσ 2 , i 6= j.

σ42

σ43

σ44

70

F. Chen

In other wards, symmetry means equal variance and equal intra-unit correlation. When the data are not symmetry, the ANOVA would increase type I error. In 1958, Greenhouse and Geisse suggested a correction coefficient ε=

t2 (¯ σ −σ ¯ )2 P 2 P 2 ii 2 ..2 ¯.. − 2t σ ¯i ) (t − 1)( σij + t σ

(30)

where t represents the times of repeated measurement, σ ¯ ii is the average of variances in diagonal of covariance matrix, σ ¯.. is the average of all elements in covariance matrix, and σ ¯ i is the average of elements in ith row of covariance matrix. Greenhouse and Geisse have shown that, 1/(t − 1) ≤ ε ≤ 1. If ε is not equal to 1, a modified F = M STreatment/M SResidual would not follow F distribution with degree of freedom νTreatment and νResidual but follow F distribution with degree of freedom ενTreatment and ενResidual . Because of the cutting down of degree of freedom, the modified F test is conservative. For the data in Example 1, the variance-covariance matrix is   2.40286   3.23143 5.26411   .    2.13857 3.00518 2.29125 2.13143

3.41536

2.02036

3.29357

Greenhouse–Geisser’s ε = 0.7996. Thus, the degree of freedoms νTreatment = 0.7996 × 3 = 2.4 , νResidual = 0.7996 × 21 = 16.8 ,

then F = 6.62, P = 0.0056, larger than P = 0.0025. In 1970, Huynh and Feldt have proved that when ε = 1, the F test is valid. If covariance matrix is symmetry, then ε = 1 or otherwise ε < 1. On the other hand, ε = 1 does not necessary implies the covariance being symmetry. The exception is for 2 × 2 covariance matrix for twice repeated measurements, ε always equal to 1 even if the variances are unequal. We should select a suitable method for dependent data according to the feature of the data set. Unfortunately, the suitable systematic methods for all types of dependent data have not been developed. Only several methods for special data set can be used now. For instance, the mixed models are employed for repeated measurements or data from randomized block design, crossover design, and some special procedures for longitudinal

Statistical Methods for Dependent Data

71

data, etc. The multilevel models analysis7 would be used if the structure of variance-covariance matrix is block diagonal. For general structure of variance-covariance matrix, which is not block diagonal, generalized least square procedure with Newton–Raphson iterations may be useful. Further research is needed. 5. GEE for Dependent Data Generalized estimating equations (GEE) was put forward by Liang Zeger 10 which is an extension of generalized linear models that provides a unified and flexible approach to analysis of data from a longitudinal study. Of particular relevance when the repeated measurements are binary variables or counts, and a number of time dependent covariates are also measured (Qiguang Chen,11 Lingping Xiong et al.12 ). GEE plays an important rule in modeling the possible correlations among the repeated observations for a given subject.13 5.1. Introduction of GEE The key ideas are presented in terms of repeated measurements with the simplest dependent structure. Let yij be the observation of jth measurement of the ith unit, where i = 1, 2, . . . , n and j = 1, 2, . . . , mi . Xij = (x1ij , x2ij , . . . , xpij ) represents the explanatory variables. The observations from the same unit are likely to be correlated, but the observations from different units are assumed in general to be independent. If the marginal distribution of response variable yij is one of exponential family, then, by the theory of generalized linear models, the density functions would be f (yij ) = exp[{yij µij − a(µij ) + b(yij )}φ] ,

(31)

where φ is known as dispersion parameter or additional scale, µij = h(ηij ), ηij = X ij β. It can be proved that E(yij ) = a′ (µij ), var(yij = a′′ (µij )/φ. For random effects model, we have ( yˆij = µij , (32) g(µij ) = β0 + β1 x1ij + · · · + βp xpij where g(·) = h−1 (·) as a link function. If there is correlation between the repeated observations, the correlation between ni observations in unit i can be described by working correlation matrix Ri (α). The times of repeated

72

F. Chen

measurement on subjects are different from each other, so the ranks of correlation matrices are also different from each other. Ri (α) depends on unknown parameter α, to which we refer as correlate parameter. For instance, for R2 in (3) ( 1 if s = t , ρst = (33) α if s 6= t . for R4 in (5) ρst for R5 in (6)

  1 = α   0

ρst =

(

if s = t , if 0 < |s − t| ≤ 2 ,

(34)

if |s − t| > 2 . 1 |s−t|

α

if s = t , if s 6= t .

(35)

then the variance-covariance matrix of y i = (yi1 , yi2 , . . . , yimi )′ has the form 1/2

1/2

V i = Ai R(α)Ai /φ ,

(36)

where Ai is diagonal matrix with the elements h(µij ) = νij φ in diagonal, which are the function of the variance ν and the mean µ of y. Liang and Zeger10 defined the GEE as n X

D ′i V −1 i Ei = 0 ,

(37)

i=1

where D i =

∂µi ∂β ,

E i = y i − µi , and µi = (µi1 , µi2 , . . . , µimi )′ .

5.2. Parameters estimations of GEE There are three types of parameters in GEE, covariate coefficients β, the scale parameter φ, the correlation parameter α. But φ and α are functions of β. We can get the estimation of β only if φ and α are known. Consequently, the estimation procedure of GEE is iterative. The initial value of β will be the estimations from generalized linear model under the assumption that the observations are independent of one another, say β i . The crude residuals of the model is eij = yij − µij = yij − g −1 (β0 + β1 x1ij + · · · + βp xpij ) .

(38)

Statistical Methods for Dependent Data

73

The Pearson residuals are rij =

yˆij − µij . √ νij

(39)

Thus φˆ =

mi n X X i=1 j=1

rij /(N − p) .

(40)

The intra-unit correlation can be estimated from the current Pearson residuals. For exchangeable correlation, we have " P mi P mi Pmi 2 # , " n Pmi 2 # n X X j=1 rij j=1 rij l=1 rij ril − j=1 α ˆ= . (41) mi (mi − 1) mi i=1 i=1

For first order autocorrelation , " n P mi # n Pmi −1 2 X X j=1 rij j=1 rij rij+1 α ˆ= . mi − 1 mi i=1 i=1

(42)

For stationary k-dependence #," n Pmi " # Pmi −1 Pmi −k n Pmi 2 2 X j=1 rij X j=1 rij j=1 rij rij+1 j=1 rij ri,j+k , , ,. . . , α ˆ= mi mi − 1 mi − k mi i=1 i=1

(43)

where the first element of α is 1 and the elements after k-order are 0. At a given iteration, the scale parameter φ and correlation parameters α can be estimated from the current Pearson residuals. Given the estimated of φ and α, we can calculate an updated estimate of β by iteratively reweighed least squares (IRLS). These two steps are iterated until the procedure convergence. 5.3. Analysis of examples 5.3.1. Example 11. Analyses of the data in Example 1 The random effect model is yij = β0 + β2 g2ij + β3 g3ij + β4 g4ij + eij , where g1 , g2 , g3 and g4 are dummy variables of treatment groups. The correlations between the observations of different treatments for the same subjects are assumed equal. Results are shown in Table 9. Intra-subject correlation ρ = 0.8020. We obtain the same results as in Example 9.

74

F. Chen Table 9.

Table 10.

Estimated results of data in Example 1 by GEE.

Variables

Coefficient

SE

Z

P

g2 g3 g4 Constant

0.4125 0.6375 1.7250 9.3000

0.378783 0.378783 0.378783 0.601958

1.09 1.68 4.55 15.45

0.276 0.092 0.000 0.000

The results of fitting two GEE models for data in Example 3. Logistic

Group Constant

Coefficient

Std.

−1.0144 2.1484

0.4985 0.4039

Probit Z

P

−2.03 0.042 5.32 0.000

Coefficient

Std.

−0.5611 1.2564

0.2702 0.2086

Z

P

−2.08 0.038 6.02 0.000

5.3.2. Example 12. Analysis of data in Example 3 We fit both logistic regression model and probit model as follows yij =

eα+β treat + eij , 1 + eα+β treat

yij = Φ−1 (α + β treat) + eij , where the subscripts of treat are omitted. Table 10 shows the results. Two intra-litter correlation coefficients are estimated based on logistic model and probit model and they are all equal to 0.1556. 5.3.3. Example 13. Analysis of data in Example 8 Example 8 has count data, with successive two-week seizure counts for each of 59 epileptics. Poisson regression model will be used. In contrast to the examples mentioned above, beside treatment effects, the covariables, such as age, ln(base), and time effects, should also be considered. The mixed effect Poisson regression model for the data is ln(λ) = α + β1 treat + β2 time + β3 age + β4 ln(base) . For repeated measurement data, the intra-subject correlation structure may be exchangeable or first order auto-correlation. For exchangeable structure, intra-subject correlation estimated from GEEs is 0.7690, and deviance = 3551.0. For autocorrelate structure the intra-subject correlation is 0.7990t, where t is time interval between two observations (1 unit of t is 2 weeks) and deviance = 3554.79.

75

Statistical Methods for Dependent Data Table 11. Parameter Constant Treat Time Age ln(base)

GEE estimators for data in Example 8. Coefficient

SE

Z

P

−1.7760 −0.2938 −0.0443 0.0231 0.9817

0.3692 0.1445 0.0353 0.0067 0.0796

−4.81 −2.03 −1.26 3.46 12.33

< 0.0001 0.0420 0.2092 0.0005 < 0.0001

The working matrix of autocorrelation structure is 1.0000 0.4533 0.2055 0.0931 0.4533 1.0000 0.4533 0.2055 0.2055 0.4533 1.0000 0.4533 0.0931 0.2055 0.4533 1.0000 The numbers of parameter of two models are equal. Therefore, the smaller the deviance is, the better the model will be. According to this, we conclude that exchangeable structure is suitable for the data. Estimated results are shown in Table 11. The results show that the two-week seizure counts for those in test group are significantly smaller than those in placebo group. The counts are related to age and the baseline. No evidence shows that the counts change over time. GEE can cope with data with missing values. For the numerical data in a paired design or randomized block design, the paired t-test and ANOVA require the data are balanced without missing, while the GEE does not. Furthermore, when the times of measurement are not common to all the experimental units, or when the numbers of the unit in clusters are not the same, the use of GEE will still be applicable. Liang9 has proved that if there are not too many missing values and missing is random, the GEE estimation is robust. GEE obtains the estimation of covariance matrix V or working correlation matrix R by using simple regression or “moment” procedures based upon functions of the actual calculated raw residuals. Theoretically, the structure of working correlation matrix can be specified arbitrarily. However, GEE focuses on modeling the fixed effects rather than exploring the structure of the random component of the model. It does not consider the case where the explanatory variables have an influence on covariance of response variable.

76

F. Chen

6. Multilevel Models for Dependent Data Many kinds of dependent data collected in medical and biological sciences have a hierarchical or clustered structure. We refer to a hierarchy as consisting of units grouped at different levels. For example, in a clustered sampling survey where the sampling units are families, offsprings may be the level 1 units in a 2-level structure where the level 2 units are the families. Repeated measurements are the level 1 units in a 2-level structure where the level 2 units are the individuals. Repeated measurements are the level 1 unit in a 3-level structure where the level 3 units are the hospitals and level 2 units are the patients. The existence of such data hierarchies is created by experimental design. Low levels are nested in the high levels. 6.1. Introduction of multilevel model Multilevel model was put forward by Harver Goldstein14 for the data with hierarchical or clustered structure. The key ideals are to estimate variances on each level and to address how the explanatory variables affect the variances. The multilevel model, therefore, enables data analysis to obtain statistically efficient estimations of regression coefficients, and provides correct standard errors, confidence intervals and significance tests by using the clustering information. We discuss a simple 2-level model, without lose of generalizibility, of one explanatory variable x1 . yij = β0j + βij x1 + εij

(44)

i stands for level 1 units, j for level 2 units. i = 1, . . . , nj ; j = 1, . . . , m, where, β0j and β1j are random variables with β0j = β0 + u0j ,

β1j = β1 + uij ,

where β0 and β1 are fixed parameters, u0j , u1j are random variables in level 2 with parameters E(u0j ) = E(u1j ) = 0 2 var(u0j ) = σu0 ,

2 var(u1j ) = σu1 ,

cov(u0j , u1j ) = σu01 .

εij are random variables in level 1 with parameter E(ε1j ) = 0 ,

var(εij ) = σ02 .

We also assume that cov(εij , u0j ) = cov(εij , u1j ) = 0.

77

Statistical Methods for Dependent Data

We can now write the level 2 model in the form yij = β0 + β1 x + (u0j + u1j x + εij ) .

(45)

The model consists of a fixed part and a random part. In contrast to a general mixed effect model (for example, variance component model, mixed linear model, GEE), explanatory variables can be included in random part of multilevel model with random coefficients u1j . The multilevel model, therefore, is also refered to as random coefficient model. The covariance matrices is block diagonal   V n1     V n2   (46) V = . ..   .   V nm

2 If no covariate is included in the random part of the model, σu1 = 0 and the model reduces to a general mixed effects model with  2  2 2 σu0 + σ02 σu0 ··· σu0   2 2 2  σu0  σu0 + σ02 · · · σu0   V ni = cov(yij |Xβ) =  .  . . .   .. .. . ··· .   2 2 2 σu0 σu0 · · · σu0 + σ02 n ×n i

i

(47)

2 J (ni ) +σ02 I (ni ) . Where, J (n) is n×1 Equation (47) can be denoted as σu0 vector with all elements 1, I (n) is n dimension unit matrix with all elements in diagonal 1, others 0. Then the intra-unit correlation can be estimated by

σ2 cov(u0j + εi1 j + u0j + εi2 j ) = 2 u0 2 . ρ= p σu0 + σ0 var(u0j + εi1 j ) · var(u0j + εi2 j )

(48)

2 If covariate was considered, σu1 6= 0 and

2 2 + 2σu01 x + σu1 V ni = (σu0 x2 )J (ni ) + σ02 I (ni ) .

(49)

The intra-unit correlation can be estimated by ρ=

2 2 σu0 + 2σu01 x + σu1 x2 . 2 + 2σu01 x + σu1 x2 + σ02

2 σu0

(50)

It is thus clear that intra-unit correlation has relation to the explanatory variables.

78

F. Chen

6.2. Estimation of parameters of multilevel model Parameters in multilevel model can be estimated by iterative generalized least squares (IGLS)14 or Restricted Iterative Generalized Least Squares (RIGLS).15 Let cov(Y |Xβ) = V , if V is known, then according to the generalized least square estimation ˆ = (X T V −1 X)−1 X T V −1 Y , β

ˆ = (X T V −1 X)−1 . cov(β)

(51)

But in fact, V is usually unknown and expressed by random coefficients. For known β we form the residuals of yij Y˜ = {˜ yij } = {yij − X ij β} .

(52)

T

If we form the cross-product matrix Y˜ Y˜ ) we see that the expected value of this is simply V . From the equation T vec(Y˜ Y˜ ) = vec(V ) + R ,

(53)

2 2 we estimate parameters σu0 , σu1 , σu01 and σ02 by means of generalized least squares where vec(·) is the vector operator. The estimation procedure is iterative. We would usually start from “reasonable” estimates of the fixed parameters β. Typically these will be those from an initial OLS estimation. From these we form the “raw” residuals (52), estimate random coefficients; and obtain an improved estimator of V ; then return to (51) to obtain new estimates of the fixed effects β; and so on. Alternate between the random and fixed parameters estimation until the procedure convergence. The IGLS procedure produces biased estimates in general and this can be important in small samples. Goldstein15 shows how a simple modification leads to restricted iterative generalized least squares (RIGLS) by substituting V − X(X T V −1 X)X T for its corresponding term V in (53) to produce an unbiased estimate. For multilevel generalized linear model, in order to work with a linearized model, we will use Taylor expansion. There are two produces to treat high-level residuals when forming Taylor expansion. One is to add current residuals to the linear component of the nonlinear function and the another does not add. The former is predictive quasi-likelihood (PQL), while the latter is marginal quasi-likelihood (MQL). In many applications, MQL procedure tends to underestimate the values of both the fixed and random parameters, especially where nij is small. So Goldstein14 suggested that PQL be used in fitting generalized model rather than MQL. In addition, he

79

Statistical Methods for Dependent Data Table 12.

M Ln estimation for data in Example 1.

Variable

Coefficient

SE

Z

P

g2 g3 g4 Constant

0.4125 0.6375 1.7250 9.3000

0.4049 0.4049 0.4049 0.6435

1.0188 1.5745 4.2603

0.3083 0.1154 0.0000

also pointed out that greater accuracy is to be expected if the second-order approximation is used rather than first-order based upon the first term in the Taylor expansion.16 6.3. Example 14. Analysis of data in Example 1 This is the simplest case with 4 units in level 1 in a 2-level structure where the level 2 units are the subjects. The model has the form as yij = β0 + β2 g2ij + β3 g3ij + β4 g4ij + u0j + eij . To obtain IGLS estimation of the parameters, we use software M Ln.17 The results are shown in Table 12. The estimation of variance in level 1 is σ02 = 0.6559, in level 2 σu2 0 = 2.6571, with standard error SE[σ02 ] = 0.1893, SE[σu2 0 ] = 1.411, respectively. Then 2.6571 σ2 = 0.8020 . ρ = 2 u0 2 = σu0 + σ0 2.6571 + 0.6559 This results are similar to those from ANOVA and GEE. 6.4. Example 15. Analysis of data in Example 7 This is a 3-level model. Subjects are level 2 units clustered within centers that are level 3 units. Repeated measurements from the same subject are level 1 units nested within level 2 unit. The results are shown in Table 13. The multilevel model decomposes the variance into 3 levels. 0.1156 for level 1, 0.3151 for level 2 and 0.1190 for level 3. Thus, intra-subject correlation can be estimated as 2 2 0.3151 + 0.1190 σu0 + σν0 2 2 2 = 0.1156 + 0.3151 + 0.1190 = 0.7897 . σ0 + σu0 + σν0 And, intra-center correlation can be estimated as σ02

2 0.1190 σν0 2 = 0.1156 + 0.3151 + 0.1190 = 0.2165 . 2 + σu0 + σν0

80

F. Chen Table 13.

M Ln estimation of data in Example 7. Coefficient

Fixed effect

CONS TREAT COMP AGE SEX TIME

Random effect

Level 3 Level 2 Level 1

2 σν0 2 σu0 σ02

SE

1.7230 0.3273 0.0384 −0.0090 −0.2285 0.1616

0.2545 0.1135 0.0900 0.0076 0.1483 0.0056

0.1190 0.3151 0.1156

0.0657 0.0859 0.0065

Theoretically, multilevel model can fit for arbitrary levels. The most powerful software MLwin could fit models up to 7 levels. It is sufficient in practice. 6.5. Multilevel logistic regression Multilevel model can be expanded to the case where the error term in the model is non-normal distribution.18 In the rest of this section we will focus on the multilevel models with binomial distribution, or Poisson distribution. To make matters concrete, consider the data in Example 3. Let yij be an observation of ith pup from jth pregnant rat. If the pup is normal then yij = 0, else yij = 1. Let fij be fixed part of the model, and rj be the random part, and πij be the expected value of the response for the ijth level 1 unit. A 2-level logistic regression model would have the form exp(fij + rj ) + εij (54) yij = πij + εij = 1 + exp(fij + rj ) fij = α + β1 x1ij + · · · + βp xpij , p εij = eij πij (1 − πij ) .

In general, εij follows a binomial distribution, but sometimes it is extrabinomial. The variance of εij can be written in the form of σ02 πij (1 − πij ). Here, σ02 is refered to as extra-binomial variance (or over dispersion). When σ02 = 1, it is purely binomial. We will assume σ02 = 1, and rj ∼ N (0, σ02 ) in this section. Let rij be the Pearson residual of the model yij − πrij rij = p . (55) πij (1 − πij )/nij

81

Statistical Methods for Dependent Data Table 14.

Results of fitting 3 2-level models for data in Example 3. Models

Parameter α β σ12 σ02 intra-unit correlation ρ −2 ln(L)

Logit

Probit

C log-log

1.127(0.3380) 1.028(0.5099) 1.212(0.5061)

0.6933(0.1929) 0.5655(0.2824) 0.3822(0.1578)

0.3463(0.1686) 0.4692(0.2383) 0.2779(0.1131)

1 0.1731 210.686

1 0.1734 211.610

1 0.1739 213.084

Then the intra-unit correlation can be defined as " Pnj Pnj Pnj 2 # ,  m Pnj 2  m X i=1 rij X r r − i=1 i=1 rij k=1 ij kj  . ρ= nj (nj − 1) nj j=1 j=1

(56)

In contrast to GEE, 2-level logistic model decompose the residuals into each level. The residuals in the level 1 are linear to the response, while the residues in level 2 are nonlinear. The 2-level logistic model for Example 3 can be written as q exp(α + βGroup + rj ) + eij πij (1 − πij ) yij = 1 + exp(α + βGroup + rj ) where the subscripts of Group are omitted. The results are shown in Table 14. 6.6. Multilevel Probit model and complementary log-log model The expected proportion πij in (55) is modeled using a logit link function. If we use probit link function, a 2-level probit model would have the form yij = πij + εij = Φ(fij + rj ) + εij .

(57)

If we use complementary log-log link function, then a 2-level complementary log-log model would have the form yij = πij + εij = 1 − exp{− exp(fij + rj )} + εij . Other notations are similar to a level 2 logistic model in (54).

(58)

82

F. Chen

6.7. Example 16 We fitted 2-level logistic, probit, and complementary log-log models for data in Example 3 respectively. The results are shown in Table 14. The results show that the intra-unit correlations estimated from 3 models are quite similar. 6.8. Multilevel Poisson regression model For count data multilevel Poisson regression model would be fitted. For a 2-level model, it can be written as yij = mij + εij = exp(fij + rj ) + εij , fij = α + β1 x1ij + · · · + βp xpij .

(59)

We usually assume that εij follows a Poisson distribution with var(yij |mij ) = mij . But sometimes it is extra-Poisson with conditional variance of var(yij |mij ) = mij + km2ij . When k > 0, it is negative binomial distribution. When k = 0, it is purely Poisson. Here we keep k = 0, rj ∼ N (0, σ12 ). Let rij be Pearson residuals of the model as follow: rij =

yij − µij . √ µij

(60)

The definition of intra-unit correlation is similar to (55). 6.9. Example 17. Fitting a 2-level Poisson regression model for data in Example 8 The response is two-week seizure counts for epileptics and is a count data. The Poisson model is sufficed here. The results are shown in Table 15. The results show that the counts are correlated with age and time. No significance can be detected in test group and placebo group. But it is Table 15.

Estimated results of random effect model for Example 8.

Parameter Treat Trial Time Age Constant

Coefficient

SE

Z

P

−0.07606 0.19900 −0.05743 −0.01685 1.88100

0.27020 0.05859 0.02026 0.01788 0.55440

−0.28150 3.39648 −2.83465 −0.94239

0.7783 0.0007 0.0046 0.3460

83

Statistical Methods for Dependent Data

significantly different between the counts before and after medication. The intra-subject correlation estimated from the model is 0.7776. 6.10. Multilevel logistic models for multiple response categories In this section we extend the multilevel logistic model for binomial response to the cases of multiple categories and ordinal categories. When the response is multiple categories without order, a multilevel polytomous logistic model will be fitted. And when the response is ordinal, a multilevel ordinal logistic model will be fitted. For example, let’s consider a 2-level model with one explanatory variable. The response is now multiple with k categories. A multilevel polytomous logistic model can be defined as (s)

(s)

(s) πij

=

exp(β0 + β1 x1ij + u0j )

+ εij ,

(s)

(s)

1 + exp(β0 + β1 x1ij + u0j )

(61)

s = 1, 2, . . . , k. Under the standard assumption that the observed response proportions follow a multinomial distribution, the level 2 covariance matrix has the form   (1) (1) ··· πij (1 − πij )   (2) (2)  −π (1) π (2)  πij (1 − πij )   ij ij −1  . nij  (62)  .. .. . .   . . .   (1) (k)

−πij πij

(2) (k)

−πij πij

(k)

(k)

πij (1 − πij )

···

If k categories are ordered, we should base our model upon the cumulative response probabilities rather than the responses probabilities for each category. The multilevel ordinal logistic model can be defined as (s)

(s)

γij = (s)

(s)

exp(β0 + β1 x1ij + u0j ) (s)

(s)

1 + exp(β0 + β1 x1ij + u0j )

,

(63)

where γij is cumulative probability for s = 1, 2, . . . , k. If we assume an underlying multinomial distribution for the category probabilities, the (s) (r) cumulative proportions have a covariance matrix given by πij (1−πij )/nij , (r < s).

84 Table 16.

F. Chen The results of a 3-level cumulative logistic model for data in Example 7. Parameter Fixed parameters Y1 Y2 Y3 Comp Treat Sex Age Time Random parameters Level 3 Level 2 Level 1

Estimation

SE

−1.482 0.9309 4.299 −0.03902 −1.187 0.5575 0.03687 −0.5268

0.8162 0.8089 0.8284 0.2977 0.3245 0.4685 0.02515 0.03924

1.001 1.586 1

0.5818 0.3197 0

6.11. Example 18. Analysis of data in Example 7 The effectiveness variable in Example 7 is an ordinal response, with 0 stands for invalid or worse effects of treatment on the subject, 1 for improved, 2 for notable improved, and 3 for recovery. A multilevel ordinal logistic model with cumulative odds was fitted. The results are shown in Table 16. Where, the centers are level 3 units, the subjects are level 2 units and repeated observations are level 1 units. 6.12. Relationship between intra-unit correlation and explanatory variable The key idea of multilevel model is to express the variance in each level by explanatory variables. In many applications, mean squared error is related to some explanatory variables. As a result, the intra-unit correlations are related to them, too. This issue would be resolved by adding the explanatory variables to the random part of multilevel models. 6.13. Example 19. Analysis of the data in Example 5 The data show that the variance of ESS is changing over time. Let G1 and G2 be the dummy variables of groups. We fit a 2-level model for the data in which the time variable is added into random part at level 1 of the model. The results are shown in Table 17. The results show that in the middle dose group (treat = 1) and the high dose group (treat = 2), the intra-subject correlation is 0.7956, which

85

Statistical Methods for Dependent Data Table 17.

The results of fitting a 2-level model for data in Example 5.

Fixed

Random Level 2 Level 1 Level 1 Level 1

Parameter

Estimate

SE

Cons Time Age G1 G2 G1∗ Time G2∗ Time

98.19 1.283 0.18 −8.356 −2.972 2.778 2.977

4.175 0.1285 0.1531 3.23 3.255 0.2771 0.1902

38.59 9.914 3.195 0.195

10.48 1.44 0.7093 0.2165

Cons/Cons Cons/Cons G1∗ Time/Cons G2∗ Time/Cons

is independent on time. But in the low dose (placebo) group (treat = 0), the intra-subject correlation coefficient depends upon time. The correlation of observations at T ime1 and T ime2 would be estimated by 38.59 p . (38.59 + 9.914 + 3.195 × T ime1 )(38.59 + 9.914 + 3.195 × T ime2 ) 6.14. Multivariate multilevel models So far, we have only considered a single response variable. In many applications, we wish simultaneously to model several responses functions of explanatory variables. In Example 4, AU C, Cmax and T50 will be considered together as responses to test the bioequivalence of domestic and imported rosiglitazone maleate tablets (RMT). This goal could be achieved by fitting a multivariate multilevel model. For the sake of convenience, we consider the multivariate multilevel model with two response, the logarithmic values of AU C (also denoted AU C) and Cmax , and treat the subject as a subject-level unit and 4 treatment effects (observations repeated measured on subjects) as period-level units which are clustered in subject-level. Besides intra-subject correlation, other properties of this model should be considered: observations of AU C are correlated between different periods of trial, and so do Cmax ; and AU C and Cmax are correlated either in the same period or in different periods. The results shown that: the intra-subject correlation of AU C is 0.008301/(0.008301 + 0.011595) = 0.4172 .

86

F. Chen

Table 18.

The results fitting multivariate multilevel model for data in Example 4.

Fixed effects

Random effects Subject level

Period level

Parameters

Coefficient

Cons AU C Cons Cmax Treat AU C Treat Cmax Drug AU C Drug Cmax

6.82577 5.13547 0.03252 0.03563 0.75844 0.73093

0.02661 0.02550 0.02198 0.02099 0.02198 0.02099

0.008301 0.003883 0.007678 0.011595 0.003444 0.010572

0.003274 0.002426 0.003017 0.001936 0.001369 0.001765

Cons Cons Cons Cons Cons Cons

AU C/Cons Cmax /Cons Cmax /Cons AU C/Cons Cmax /Cons Cmax /Cons

AU C AU C Cmax AU C AU C Cmax

SE

Intra-subject correlation

1 0.486 1 1 0.311 1

The intra-subject correlations of Cmax is 0.007678/(0.007678 + 0.010572) = 0.4207 . The Pearson correlation of AU C and Cmax is 0.311 in level 1, and 0.486 in level 2. 7. Sampling Distribution and Confidence Interval of Intra-unit Correlation 7.1. Confidence interval of intra-unit correlation The intra-unit correlation coefficient estimated by a generalized estimation equation or a multilevel model is a point estimation. But we did not estimate its estimation errors and had little ideas of its sampling distribution. The bootstrap may be applied to estimate the CI of intra-unit correlation.19 Bootstrap is a data-based simulation method for statistical inference, which can be used to study the variability of estimated characteristics of the probability distribution of a set of observations, and provide confidence intervals for parameters and hypothesis test in situations where these are difficult or impossible to derive closed form formulas. The basic idea of the procedure involves sampling with replacement to produce random samples of size n from original data, each of these is known as a bootstrap sample and each provides an estimate θ(b) of the interesting parameter, θ. Repeating the process a large number of times, say B = 500 or more,

Statistical Methods for Dependent Data

87

provides the required information on the variability of the estimator. For example, the type of distribution, standard error of the bootstrap estimates. An approximate 95% confidence interval can be derived from mean ±1.96 SD if the bootstrap estimates are normally distributed, and from the 2.5% and 97.5% quartiles of the replicate values if the bootstrap estimates is not normally distributed. The confidence interval derived from bootstrap sampling is known as bootstrap confidence interval. If the population distribution is known, bootstrap samples can be randomly samped not from the original data, but from the population distribution. The former produce is known as non-parametric bootstrap, and the latter as parametric bootstrap. Research shows that there are two particularities in applying the bootstrap estimation to data of dependent design.20 First, it is not proper to adopt a parametric estimation because of the difficulty in making a judgment to the distribution of the data. So we suggest adopting a nonparametric estimation. Second, it is not proper to apply a random sample directly to observations because of the non-independence of observations. So we suggest that we sample high level units. And if some high level unit is sampled, all the observations in this unit will be sampled. As examples, data of Example 2 should be sampled by family; data of Example 3 should be sampled by litter; and data of Examples 1, 4 and 5 should be sampled by patient. The estimator of intra-family correlation of Example 2 is 0.5674. If we sample the data on families, make 500 resamplings, and estimated by GEE, then the non-parametric 95% CI of intra-family correlation is 0.2875–0.8874. If we sample the data of Example 8 on patients and make the same analysis, we estimated the non-parametric 95% CI of correlation 0.5219–0.8874. Both of the two bootstrap sampling distributions of intraunit correlation are skew. Because both of the two CIs do not include 0, we may accept that the intra-subject correlations exist.

7.2. The sampling distribution of intra-unit correlation To estimate the type and characteristic of distribution of intra-unit correlation, we use the Monte Carlo method. One thousand simulations were generated from specific population with known ρ and corresponding assumed parameters. For each set of simulated data, we fit a 2-level model, estimate σ02 , σe2 , and then the intra-unit correlation. We then investigate the distribution of intra-unit correlation based on 1000 estimators of ρ.

88

F. Chen

We may assume that there are m individuals (two-level unit) and each individual have k repeatedly measured values (one-level unit), then we have n = m × k observations. In order to investigate the effect of units of level 1 and units of level 2 on the intra-unit correlation when overall sample size of observations are the same, we design the grids are (k, m) = (4, 10), (4, 20), (4, 30), . . . , (4, 100) and (k, m) = (8, 5), (8, 10), (8, 15), . . . , (8, 50) respectively, and the intra-unit correlation coefficients is 0.1–0.9 respectively. Without losing generality, in analog investigation, we do not take fixed but random effect into account, because the intra-unit correlation is related only to random effect. Furthermore, we assume that the intra-unit correlation structure is exchangeable. Now we may consider two situations. One is the simplest situation yij = µj + eij .

(64)

Only one random effect is considered both in levels 1 and level 2. Then the variance of y is σ02 + σe2 ; The variance-covariance matrix of y is V = diag(R, R, . . . , R). If k is 4,   2 σ0 + σe2     σ02 σ02 + σe2 ,  (65) R=  2 2 2 2 σ0 + σe σ0   σ0 σ02 σ02 σ02 σ02 + σe2 the intra-unit correlation can be calculated by Eq. (48). Another situation is more complex yij = µj + νj xij + eij .

(66)

Level 1 has two random effect terms: one is random error. Another random effect term is related to independent variable, namely the variance of y, σ02 + σ12 (xij )2 + σe2 , and is affected by explanatory variable. Let xij = j − 1, the variance-covariance matrix of y is V = diag(R, R, . . . , R), when k = 4,   2 σ0 + σe2     σ02 σ02 + σ12 + σe2 .  R=  2 2 2 2 2 σ0 σ0 + 4σ1 + σe   σ0 2 2 2 2 2 2 σ0 σ0 σ0 σ0 + 9σ1 + σe (67)

Statistical Methods for Dependent Data

89

We may calculate intra-unit correlation by Eq. (48) after deducting the effect of explanatory variable to y. The parameters of the model can be estimated by the Restricted Iterative Generalized Least Square (RIGLS) method. The simulation study is made by using specialized multilevel model software M Ln.17 Table 19 lists the simulated results of six models. Models A–D are generated based on Eq. (64), and their amounts of units of levels 1 and 2 are model A with (k, m) = (4, 10), model B with (k, m) = (8, 5), model C with (k, m) = (4, 100), model D with (k, m) = (8, 50), respectively. The overall sample size of model A is equal to that of model B, while the overall sample size of model C is equal to that of model D. Models E and F are generated based on Eq. (66), and their amounts of units are respectively model E with (k, m) = (4, 100) and model F with (k, m) = (8, 50). The result shows that the type of distribution is related to the value of intra-unit correlation. And the distribution of intra-unit correlation of these models indicates that when ρ = 0.5, its distribution is symmetrical and resembles the normal distribution; when ρ > 0.5, its distribution is positively skew; and when ρ < 0.5, negatively skew, just as Fig. 3 shows. In one model, the estimated error is larger when ρ approaches 0.5, and becomes smaller gradually as ρ approaches 0.1 or 0.9. The mean intra-unit correlation coefficients of model C is closer to the theoretical value than that of model A, and its standard error is smaller. Similarly, the mean intra-unit correlation coefficient of model D is closer to theoretical value than that of model B, and its standard error is also smaller. Therefore, the larger the sample size, the better the effect of estimation. In comparisons of model A with B, model C with D and model E with F respectively, for which each pair has the same sample size, the estimation of model B is not as good as A, model D not as good as C and model F not as good as E except that ρ = 0.1 or 0.2. This is because the amount of two-level units is small while that of one-level units is large. In fact, because of the presence of intra-unit correlation, the amount of information is overlapped. As an example, the amount of information obtained by measuring k times repeatedly to the same individual is smaller than that obtained by measuring once to k individuals. If we compare model E with C and model F with D respectively, their overall sample sizes, the amount of one-level units and two-level units are equal respectively. But models E and F have larger estimated error because the variance terms of models E and F are more complex.

90

Table 19. Theoretical values

0.1235 0.2017 0.2855 0.3787 0.4595 0.5635 0.6663 0.7647 0.8786

± ± ± ± ± ± ± ± ±

0.1279 0.1485 0.1633 0.1725 0.1749 0.1608 0.1379 0.1146 0.0734

Model B k = 8, m = 5 0.1057 0.1786 0.2666 0.3428 0.4288 0.5207 0.6232 0.7278 0.8492

± ± ± ± ± ± ± ± ±

0.1146 0.1467 0.1722 0.1884 0.2054 0.2042 0.1928 0.1702 0.1216

Model C k = 4, m = 100 0.0980 0.1987 0.3026 0.3973 0.4971 0.5964 0.6967 0.7974 0.8985

± ± ± ± ± ± ± ± ±

0.0471 0.0540 0.0534 0.0560 0.0534 0.0463 0.0389 0.0287 0.0153

Model D k = 8, m = 50 0.0965 0.1981 0.2988 0.3936 0.4942 0.5931 0.6947 0.7935 0.8965

± ± ± ± ± ± ± ± ±

0.0411 0.0522 0.0572 0.0616 0.0609 0.0570 0.0481 0.0375 0.0213

Model E k = 4, m = 100 0.1150 0.2019 0.2978 0.3995 0.4995 0.5965 0.6974 0.7986 0.8912

± ± ± ± ± ± ± ± ±

0.0933 0.1067 0.1081 0.1123 0.1078 0.1018 0.0957 0.0865 0.0727

Model F k = 8, m = 50 0.1121 0.2075 0.2963 0.3934 0.4981 0.5920 0.6891 0.8042 0.8891

± ± ± ± ± ± ± ± ±

0.1178 0.1450 0.1635 0.1599 0.1563 0.1507 0.1435 0.1211 0.0957

F. Chen

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Model A k = 4, m = 10

500 simulated results of intra-unit correlation of 6 populations.

91

Statistical Methods for Dependent Data

ρ = 0.1

ρ = 0.2

ρ = 0.3

ρ = 0.4

ρ = 0.5

ρ = 0.6

ρ = 0.7

ρ = 0.8

ρ = 0.9

0.5 0.4 0.3 0.2 0.1 0

0.5 0.4 0.3 0.2 0.1 0

0.5 0.4 0.3 0.2 0.1 0 0

0.2

Fig. 3.

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

Sampling distibution of the intra-unit correlations base on model C.

The statistical simulation shows that the estimator of ρ is little smaller than the theoretical value. But the larger the sample size is, the closer the estimator is to its theoretical value. And when given the same overall sample size, the larger the amount of level 2 units is (the smaller the amount of corresponding level 1 unit is), the closer the estimated value is to the theoretical value. If ρ is close to 0.5, the sampling distribution of intra-unit correlation is approximately normal distribution. As ρ approaches 0, or approaches 1, the sampling error is becoming smaller and smaller. When ρ is close to 0, the distribution is positively skew. And when ρ is close to 1, the distribution is negatively skew. Theoretically, as Goldstein (1998) pointed out, the estimators obtained by IGLS is biased, while that obtained by RIGLS is unbiased. But the simulated results show that estimators of ρ obtained by RIGLS are somewhat smaller than the theoretical values. And the smaller the sample size is, the further the estimated value is away from its theoretical value. When given the same overall sample size, the smaller the amount of level 2 unit is (the larger the amount of corresponding level 1 unit is, of course), the further the estimated value is biased. The sampling error of intra-unit correlation is also related to the amount of units of every level. When given the same overall sample size, the larger

92

F. Chen

the amount of level 2 units (the smaller the amount of corresponding level 1 unit is), the smaller the estimated error. The sampling error is also related to how complex the variance of responding variable is: the larger the variance, the larger the sampling error. This section focuses only on the situation when responding variable is numeric, that is to say that data should be distributed normally. Further investigation is needed, especially with regard to skew distributions, such as binominal and Poisson distributions. 8. Sample Size and the Cost-effect of Dependent Test This section will take the repeated measurement (sampling) as an example to discuss sample size and power of hypothesis testing and cost-effect for dependent data. Because of the overlap of information, the dependent data tells us less than independent data given the same sample size, which leads to a low power. And the larger the relationship in groups is, the less information the data offers and the lower the power shows. 8.1. Sample size and power of test Let Yijg represent the jth observation of the ith subject in the gth group (i = 1, . . . , m, j = 1, . . . , k; g = 0, 1). We also assume the individuals are independent to each other, and the intra-subject correlations are equal. If the type I error is α and the power is 1 − β, the sample size of each group can be estimated by the equation below: σ 2 (Zα + Zβ )2 . (68) kδ 2 Where δ is the difference of effects of the two groups (g = 0 and g = 1). It is oblivious that the number of observations m needed in this design is smaller, while the overall number of observations n = mk is larger than those of the independent design. And when ρ = 0, it is equal to the sample size of independent design. The table below is the result of a simulated experiment on the power of a group of repeated measurements. The intra-unit correlation is 0, 0.1, 0.2, . . . , 0.9, respectively; the sample size m and the times of repeated measures k are (50, 4), (100, 2), (20, 4), (40, 2), respectively; And δ are 0, 0.2, 0.4, 0.6, 0.8, 1.0, respectively. All designs were balanced. Based on each grid, 1000 simulations were generated by using the M Ln package.17 For each set of simulated data, we fit a multilevel model. The power is then m = [1 + (k − 1)ρ]

93

Statistical Methods for Dependent Data

estimated by the proportion of times of the rejections of null hypothesis in 1000 simulations. The results corresponding to δ = 0 is type I error, while that to ρ = 0 is the power of the independent data. From the Table 20, we can conclude that the power decreases as the intra-unit correlation within group increases. And unless δ is large enough, the extent of the decrease is large. For example, when the repeated times are the same, the power of the design with m = 50, k = 4 and ρ = 0.9 is only half of that with n = 200 and ρ = 0. When the repeated times are equal, the power has a tendency to increase as m increases; And when n = mk are equal, the power of the design with twice measured is larger than that with 4 repeated times. Table 20. Power of repeated measurement (times of the rejection to null hypothesis in 1000 simulations). m = 50, k = 4

m = 100, k = 2

δ

δ

ρ

0

0.2

0.4

0.6

0.8

1.0

0

0.2

0.4

0.6

0.8

1.0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

44 55 68 57 60 56 66 56 62 58

234 232 220 169 181 157 143 138 151 117

789 689 598 557 474 447 393 357 331 342

984 961 915 854 820 756 718 692 647 595

1000 998 995 975 961 936 931 889 860 829

1000 1000 999 999 998 990 983 971 970 952

54 43 55 52 57 43 62 54 62 58

272 268 279 226 233 205 119 208 190 168

781 757 709 692 661 645 600 593 564 519

989 986 979 963 957 942 916 902 883 875

1000 999 999 999 996 999 989 992 981 988

1000 1000 1000 1000 1000 999 1000 1000 999 1000

m = 20, k = 4

m = 40, k = 2

δ

δ

ρ

0

0.2

0.4

0.6

0.8

1.0

0

0.2

0.4

0.6

0.8

1.0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

59 56 63 80 58 65 69 71 59 69

133 125 119 108 106 110 109 94 102 103

384 336 302 279 258 228 175 194 193 179

746 631 573 488 464 440 372 361 333 313

927 870 812 744 682 626 535 546 492 490

992 967 939 891 849 805 767 731 696 692

36 44 62 69 61 53 57 48 61 69

139 143 141 140 129 113 130 103 109 101

412 380 375 356 344 296 318 301 288 263

759 708 678 634 592 612 577 524 528 490

932 916 890 887 859 825 787 780 759 729

997 993 928 974 967 959 950 927 908 898

94

F. Chen

8.2. Cost-effect analysis The estimations of design efficiency and sample size are important considerations during the experiment design. Researchers always have to balance among design efficiency, sample size and cost-benefit before making a decision. For example, a physiological experiment uses several rats’ liver cells. Researchers may sample only once (single sample test, independent) or several times (repeated sample test, dependent) on each rat. The latter needs fewer rats than the former, which means the latter costs less. But data from the latter are dependent while those of the former are independent. So, the problem researchers confront is that the test should not only cost litter, but also achieve enough power of test. When we discussed the estimation of sample size in the last section, we did not consider the cost. But the funds are limited in practice. So it is related to cost-benefit problems. On one hand, given restricted funds (the cost is constant), we should consider whether to select single sample or repeated sample to make the effect as large as possible (the variance is minimum). On the other hand, when the benefit is constant (the variance is restricted), we should consider whether to sample independently or repeatedly to make the cost the least. 8.2.1. When the cost is constant, how to evaluate the benefits of independently sampling design and repeated sampling design? In a repeated sampling design, individual is independent with each other. We can assume the average elemental cost of each individual is C1 , the average direct cost of sampling once to each individual is C2 , the variation among individualities and repeated sample measures is σ2 , and the intrasubject correlation of samples from the same subject is ρ. Let the overall cost be C, the individual numbers (or pairs) needed for repeated sampling design is m, and the times of repeated sample to each individual is k, then C = m(C1 + kC2 ) .

(69)

It is not difficult to find: σ2 [1 + (k + 1)ρ] . (70) mk When the restricted overall cost is C, to make var(Y¯ ) as little as possible (equivalent to making the power as large as possible), the optimal number var(Y¯ ) =

Statistical Methods for Dependent Data

95

of individuals m and the times of repeated sample k to each subject are the solutions of conditional minimum of function (70) restricted by Eq. (69). Let f (m, k) = σ 2 [1 + (k − 1)ρ]/(mk) + λ(mC1 + kmC2 − C), then  −σ 2 (1 − ρ) ∂f (m, k)   + λmC2 = 0 =  ∂k mk 2 (71) 2    ∂f (m, k) = −σ [1 + (k − 1)ρ] + λ(C1 + kC2 ) = 0 . ∂m mk 2 That is  s   σ2 ρ  m =  λC1 (72) r   (1 − ρ)C  1  k = . ρC2

Substituting m and k in (72) for their corresponding terms in (69), because C is a specific value, the optimal number of individuals is p √ √ C C1 ρ[ C1 ρ − C2 (1 − ρ)] . (73) m= C1 [C1 ρ − C2 (1 − ρ)]

So the minimum variance is

p p var(Y¯ ) = σ 2 ( ρC1 + (1 − ρ)C2 )2 /C .

(74)

N = C/C1 .

(75)

If the sample size of independent sampling design is N , the overall cost and sample error are C = N C1 and var(Y¯ ) = σ 2 /N respectively. And if the restricted overall cost is C, to make var(Y¯ ) as little as possible (make the power as large as possible), the optimal number of subjects m is the solution of conditional minimum of function var(Y¯ ) = σ 2 /N restricted by equation C = N C1 . Let g(N ) = σ 2 /N + λN C1 , then

And the minimum variance of independent sample is var(Y¯ ) = σ 2 C1 /C .

(76)

So given restricted overall funds C, whether to sample independently or repeatedly depends on the value of sample error, which means to work out when Eqs. (76) and (74) will have minimum values, when 2 C1 − C2 ρ< . (77) C1 + C2

96

F. Chen

The repeated sample design can result in a minimum sample error (the power is the largest and the effect is better). Otherwise, it would be better to choose independent sampling design. 8.2.2. When the benefit is constant, how to compare the cost of independent sample with that of repeated sample? We should follow the method in the last section to make the overall cost C as little as possible when var(Y¯ ) = V is constant. The optimal individual number of repeated sampling design m and the optimal sample times of each subject k can be worked out by the equations below, respectively:  p √ √  σ 2 C1 ρ[ C1 ρ + C2 (1 − ρ)]   , m =  C1 V (78) r  C1 (1 − ρ)   k = . C2 ρ The minimum overall cost of repeated sampling design is p p C = m(C1 + kC2 ) = σ 2 [ C1 ρ + C2 (1 − ρ)]2 /V ,

(79)

The optimal individual number of independent sampling design is N = σ 2 /V .

(80)

And the minimum overall cost of independent sample is C = N C1 = σ 2 C1 /V .

(81)

So under the condition of restricted sample error (the same benefit), should we select the independent sampling design or the repeated sampling design? This depends on the overall cost of the sample. We should compare when Eqs. (79) and (82) will have their minimum values. And only if the intra-subject correlation ρ meets the need of Eq. (77) can repeated sampling design make the sample cost as little as possible. Otherwise we’d better use an independent sampling design. 8.3. Example 20. The cost benefit problems of rat’s test data Physiology Laboratory, Nantong Medical College, had finished a test that needed four rats. Four sets of single spleen T cells turbid liquid were prepared for each rat by normal methods. Then researcher mixed ConA with

Statistical Methods for Dependent Data

97

each liquid and measured OD. From pre-experiments or experiences, they estimated that σ 2 = 0.00054844 and ρ = 0.52895 . Then if we restricted sampling overall cost C or sample error var(Y¯ ), should we select an independently sampling design or a repeated sampling design? This is related to the average elemental cost of each rat C1 , the average direct cost of repeated sampling once to each rat C2 and the intrasubject coefficient of repeated measurement. We assume that each rat costs C1 = 20 yuan, each portion (1 ml) of medium and 0.1 ml calf serum costs C2 = 0.12 yuan. Because 2 2 20 − 0.12 C1 − C2 = 0.97629 , = ρ = 0.52895 < C1 + C2 20 + 0.12 this case meets the need of Eq. (77). Thus, it is wise to do repeated sampling instead of an independent sampling. From Eq. (72), we known that the repeated sampling times of each rat k is 13. If the restricted overall cost C = 110 yuan, the repeated sampling design needs 5 rats and the minimum sampling error is var(Y¯ ) = 0.000062 from Eqs. (73) and (74). If the restricted sampling error is var(Y¯ ) = 0.000052, the repeated sampling design needs m = 6 rats and the minimum sampling cost is C = 130 yuan from Eqs. (78) and (82). In this section we focus on the power of the repeated sample design in one group, the estimation of sample size and some problems about cost-effect. The principles of analysis can also be applied to repeated measurement data of grouped design and longitudinal data, etc. When estimating the power and sample size of the repeated sampling, we should take full advantages of the prior information to specify the values of variation among individuals and repeated samples, the value of intrasubject correlation coefficient and the values of acceptable error because of the affection these values have on the estimation of sample size. If there is not enough prior information, it is better to obtain it through pilot studies. And the importance of types I and II errors should be determined according to damages caused by the respectively wrong decisions. There are two other design methods similar to repeated sampling design. One of them is the multiple repeated measures, which can improve the precision of measurements, and reflect whether the measured results have stability, namely reliablity. And the degree of reliability can be represented

98

F. Chen

by constructed validity. The intra-subject correlation among repeated measures of these data is always low and always has nothing to do with the covariates. The another one is the regular or irregular follow-up in a longitudinal study, such as the follow up studies of kid’s growth and development and the metabolism of some kind of drug, etc., in which we are interested in the occurring, developing, or law of variation of an event. The intrasubject correlation of these data is always related with the interval of the follow up. However, repeated sampling is sampling from the same subject. These samples always have a low intra-subject correlation and are related to some covariates. Though in several literatures they are all refered to as repeated measurement and have similar methods of processing and analyzing, they have their own particular emphases. So the structures of covariances matrix of response variables are different. But to applied researchers, more emphases should be laid on the distinctions of different designs.

References 1. Cochran, W. G. (1977). Sampling Techniques. John Wiley and Son. Inc. New York. 2. Willams, D. A. (1975). The analysis of binary responses from toxicological experiments involving reproduction and teralogenicity. Biometrics 31: 941–952. 3. Diggle, P. J., Liang, K. Y. and Zeger, S. T. (1995). Analysis of longitudinal data. Clarendon Press. Oxford. 4. Hooton, T. M., Scholes, D., Hughes, J. P. et al. (1996). A prospective study of risk factors for symptomatic urinary tract infection in young women. NEJM 335: 468–474. 5. Gabriel, S. E., Woods, J. E. and O’Fallon, W. M. et al. (1997). Complications leads to surgery after breast implantation. NEJM 336: 677–682. 6. McAlindon, T. E., Felson, D. T. and Zhang, Y. et al. (1996). Relation of dietary intake and serum levels of vitamin D to progression of osteoarthritis of the knee among participants in the Framingham study. Ann. Int. Med. 125: 353–359. 7. Chen, F., Ren, S.-Q. and Lu, S. Z. (1998). On intra-correlations of dependent data. Modern Prevent. Med. 25: 269–271. 8. Ren, S.-Q. and Chenfeng (1998). Expression for covariance structure for dependent data. Chinese Health Statist. 15(4): 4–8. 9. Prescott, R. and Brown, H. (1995). Mixed Models Analysis of Clinical Trials Using SAS : Proc Mixed and Beyond. Course Presenters. 10. Liang, K. Y. and Zeger, S. T. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73(1): 13. 11. Chen, Q.-G. (1995). GEE analysis for repeated measurement in longitudinal study. Chinese Health Statist. 12(1): 22.

Statistical Methods for Dependent Data

99

12. Xiong, L.-P., Cao, X.-T. and Xu, Y.-Y. et al. (1999). Log linear model for longitudinal data. Chinese Health Statist. 16(2): 68. 13. Chen, F., Ren, S.-Q. and Lu, S.-Z. (1999). Intra-unit correlation of dependent data and GEE. Acta Academia Med. Nantong 19(6): 359–362. 14. Goldstein, H. (1986). Multilevel mixed linear model analysis using iterative generalized least square. Biometrika 73: 43–56. 15. Goldstein, H. (1989). Restricted unbiased iterative generalised least squares estimation. Biometrika 76: 622–623. 16. Goldstein, H. (1995) Multilevel Statistics Models, 2nd edn., Chapman, London. 17. Rasbash, J. and Woodhouse, G. (1995). MLn Command Reference, Institute of Education. London. 18. Goldstein, H. (1991). Nonlinear multilevel models with an application to discrete response data. Biometrika 78: 45–51. 19. Chen, F., Lu, S.-Z. and Yang, M. (1997). Bootstrap estimation and its apolication. Chinese Health Statist. 14(5): 5–8. 20. Ren, S.-Q. (1999). The statistics methods for non-independent data. PhD dissertation. West China University of Medical Sciences.

About the Author Feng Chen obtained his BS in Mathematics (1983) from Sun Yat-Sen University; MS in Biostatistics (1989) from Shanghai Second Medical University, and PhD in Medical Statistics (1994) from West Chinese Medical University. Dr. Chen was promoted to full professor in Nantong Medical University in 2000. He is the chairman of the project “Study on statistics methods for non-independent data” which is supported by the National Funds of Natural Sciences of China. His study visit in London University in 1996 was supported by the Royal Society. He is a member of review committee for medicines by the State Drug Administration (SDA) of China, Associate Editor of Chinese Health Statistics, and Vice Chairman of Statistical Theory and Methods Subcommittee of the Chinese Health Statistics Association. His main research interests are biostatistical models and statistical methods for dependent data.

This page intentionally left blank

CHAPTER 4

STATISTICS USED IN QUALITY CONTROL, QUALITY ASSURANCE, AND QUALITY IMPROVEMENT IN RADIOLOGICAL STUDIES YING LU and SHOUJUN ZHAO Department of Radiology, University of California, San Francisco, 3333 California Street, Suite 375, San Francisco, CA 94118, USA Tel: 415-502-4596; [email protected]

1. Introduction Quality control, quality assurance, and quality improvement in medical studies are active and large topics. From 1995–2000, there were more than 40,000 articles in MEDLINE database that had key words of at least one of these three terms. Quality has many connotations. The term “total quality management” (TQM) is given to an approach that related to the daily functioning of medical practices or medical research processes. All participated personnel and operational aspects are involved. Quality control is a very limited function that “controls” the product, primarily by testing, while quality assurance regulates the systems and methods for “assuring” the quality of the product.1 Every aspect of medical practice and research requires quality control and quality assurance. Although the statistical principles presented here can apply to other fields such as laboratory medicine, etc. this chapter is limited to quality control and quality assurance specifically in radiology. There are several reasons for this focus. First, this is the field in which the authors have the most experience. Second, radiology evaluation relies on radiological equipment, whether X-ray, ultrasound, CT, or MRI machines. As with all machinery, products of different manufacturers vary in quality. Over time, machine may draft and age can affect performance. Furthermore, precision errors are always to be expected in any radiological equipments or technique; even when the same patient is scanned under identical conditions the results will be different. Last but not least, many 101

102

Y. Lu & S. Zhao

radiological assessments are based on the experience of reader and are relatively subjective. It is common to have different readers to give different interpretations of the same image. Therefore, many factors will affect the results of radiological assessments. The statistical principles discussed here can resolve the conflicts among results from different devices and improve interpretation of the results. Radiology has been used to help decisions in disease diagnosis and management of patients. Its use as tools for population screening and for drug development is increasing. The newly developed response evaluation criteria in solid tumors (RECIST) uses changes in unidimensional CT measurement of tumor lesions to define the treatment response rates.2 Osteoporosis is defined by bone mineral density (BMD) measured by dual X-ray absorptiometry (DXA) scans3 and osteoporosis prevention drugs are assessed according to their effect on BMD.4 In fact, medical imaging has been used as surrogate endpoint or biomarkers in many therapeutic and diagnostic clinical trials, and radiologists are increasingly involved in these clinical trials. Good Clinical Practice (GCP) is an international quality standard for the design, conduct, recording, and reporting of clinical trials with human subjects. GCP guidelines not only provide a framework for protecting the rights of participating patients or volunteers, they also set standards to safeguard the integrity of data that are used to evaluate treatment efficacy and submitted to regulatory agencies.5 In radiology, GCP includes training documents and standard operating procedures, imaging device quality control, image acquisition protocols, software validation, record keeping, and reporting, etc.6 Obviously, this is not only a statistical process. Successful quality control and quality assurance require good leadership from department chairs or principal investigators and, importantly, a team of multi-disciplinary experts. The expert team should always include a statistician. Statisticians are important in planning quality control, including determining appropriate sampling to avoid bias in selecting test samples, calculating the sample sizes, analyzing results to identify deficiencies, planning the processing control charts for monitoring machine performance, reassessing the results of quality improvement, and in reporting data and study results. There are many aspects of quality control and quality assurance that are not directly related to statistics.7–9 This chapter presents some statistical tools used in radiological or osteoporosis research based on the experience of the authors. It is beyond the scope of this book to

Statistics Used in Radiological Studies

103

present a complete picture of quality control and quality assurance for all radiological studies. This chapter is organized into 5 sections. In the next section, we introduce definitions of different measurement errors for continuous radiological results and different ways to evaluate these errors. In Sec. 3, we present applications of process control-charts to monitoring measurement errors over time. In Sec. 4, we review the statistics of measurement agreement. In Sec. 5, we discuss the calibration problem. 2. Measurement Errors Radiological techniques are used to measure physical or mechanical properties that relate to disease status or progression. We use statistical techniques or procedures to transform our observations of a variable of interest into a particular category or number. This is the measurement process. For a categorical variable, we try to assign a subject into a particular, unambiguous category, as in the assessment of treatment response of solid tumors2 or evaluation of spine fracture severity.10 In other cases, we derive a numerical value that reflects the underlying physical quantity, such as tumor volume, bone mineral content or density, etc. Measurement errors describe the limits of a quantitative or qualitative assessment of a disease using a particular technique or procedure. Measurement errors have many sources. This section focuses on 2 types of measurement errors — precision and accuracy — and their applications to the diagnosis of osteoporosis and monitoring changes in bone status. The implications of precision on monitoring changes are emphasized, including the concepts of standardized precision, longitudinal sensitivity, and their applications to patient measurements and quality assurance, i.e. the monitoring of machine performance. 2.1. Measurement errors in radiological instruments Many sources of errors can affect the measurement and cause varying results, even when they are from the same region of interest in the same subject. Some of these variations can be controlled to minimize their impact. Some of the error sources are — in part — uncontrollable. Controllable variations are called fixed factors. Our interests, however, are usually on the uncontrollable random variations. Errors of measurement are the differences between observed values recorded under identical conditions and a fixed true value. In osteoporosis

104

Y. Lu & S. Zhao

studies, we always assume there are true quantities for densitometry parameters for each measured subject, even though we don’t always know their values. Measurement errors should be random in nature and can be attributed to two different sources: accuracy errors and precision errors.11 2.1.1. Accuracy errors Accuracy errors here are used as equivalent to the term bias. They reflect the degree to which the measured results deviate from the true values. To evaluate accuracy errors, we need to know the true values of the measured parameters. It is not always possible, however, to measure the accuracy errors because sometimes the true values of the measured parameters cannot be verified. For example, quantitative ultrasound (QUS) bone measurements are affected by a number of quantitative and qualitative factors, and there is no single correlate for any QUS measurement. Therefore, we cannot define a single accuracy error for QUS.12 For clinical applications only the part of the accuracy error that varies from patient to patient in an unknown fashion is relevant. The other part, i.e. the one that is constant, can be averaged across subjects e.g. the average underestimation of bone density due to the average fat content of bone marrow in Quantitative Computed Tomography (QCT), can be ignored. There are two reasons: First, for diagnostic uses, the reference data will be affected by the same error so the difference between healthy and diseased subjects is constant. Second, the error is present at both baseline and follow-up measurements, and does not contribute to measured changes. Therefore, when discussing the impact of accuracy errors only that part of the error that changes from patient to patient in an unknown and uncontrollable fashion is of interest.13 For this reason, small accuracy errors are of little clinical significance provided they remain constant.14 In general they are more relevant to diagnosis and risk assessment than to monitoring. 2.1.2. Precision errors They reflect the reproducibility of the technique. They measure the ability of a method to reproducibly measure a parameter for the purpose of reliably monitoring changes in bone status over time. Precision errors can be further separated into short-term and long-term precision errors. Short-term precision errors characterize the reproducibility of a technique and are useful for describing the limitations of measuring changes in skeletal status. If they are large they may affect the diagnostic sensitivity of a

Statistics Used in Radiological Studies

105

technique. Long-term precision errors are used to evaluate instrument stability. Because long-term precision errors include additional sources of random variation attributable to small drifts in instrumental calibration, variations in patient characteristics, and other technical changes related to time, they provide a better measure of a technique’s ability to monitor parameter changes than the short-term precision errors do. For patient measurements, estimates of long-term precision usually also include true longitudinal variability of skeletal status. For both of these reasons longterm precision errors normally are larger than short-term errors. While precision errors are easy to define, there are many ways to describe them depending on the purpose at hand, and there is no universal consensus on which definition is most appropriate. Mathematically, let θ be the theoretical true value in which we are interested, and let X be the observed value. The difference of ξ = X − θ is the measurement error. Furthermore, if X follows a normal distribution

Good Accuracy and Precision Good for Diagnosis and Monitoring

Good Accuracy and Poor Precision Unacceptable for Monitoring

Fig. 1.

Poor Accuracy and Good Precision Acceptable for Monitoring

Poor Accuracy and Precision Unacceptable for Diagnosis and Monitoring

Precision and accuracy.

106

Y. Lu & S. Zhao

N (µ, σ 2 ), the accuracy error is µ − θ and precision error is σ. Here, θ is considered a gold standard. Figure 1 illustrates the differences between precision and accuracy errors. If an archer consistently hits the target board close to the bull’seye, but with the arrows spread out around it, it is good accuracy but poor precision. If the archer consistently hits the board far off the bull’s-eye, but with all of the arrows in approximately the same location, it is poor accuracy but good precision. 2.2. Absolute precision errors Although there are many different ways to describe precision errors, they can be classified as absolute or relative. For the following descriptions of precision errors, we introduce some notations. Let Xi,j be the quantitative results (such as BMD) of the jth measurement for the ith individual, i = 1, . . . , m and j = 1, . . . , ni . Because individual subjects have different underlying true values due to biological variation, it is necessary to measure individual subjects repeatedly to evaluate precision errors. We use ni to denote the total number of measurements for the ith individual. The standard deviation (SD) of bone densitometry parameters from an individual subject i as a measure of short-term reproducibility is defined as the ¯i. average distance of individual Xi,j to the mean value for that subject, X Mathematically, it is the sample standard deviation: v uX u ni ¯ i )2 /(ni − 1) . (1) SDi = t (Xi,j − X j=1

Individual precision may vary. To estimate the reproducibility of a parameter in clinical use, we need to measure a representative set of individuals and combine their individual precision errors using the rootmean-square average of individual SD values (RMS SD) or in other words, within the mean squared errors in Analysis of Variance terms. Mathematically, sP sP m Pni m 2 ¯ i )2 (X − X i,j i=1 i=1 (ni − 1)SDj Pmj=1 P = , (2) RMS SD = m i=1 (ni − 1) i=1 (ni − 1)

where m is the number of subjects measured for precision evaluation. When q each subject has the same number of measurements, the RMS Pm 2 SD = i=1 SDi /m.

Statistics Used in Radiological Studies

107

With long-term precision, the underlying parameter can change for individual subjects over time. Therefore, instead of measuring the distance from the observed individual values to the mean of the individual subject, we use the distances from the observed individual values to the expected value of the parameter at the time of measurement. In many situations we assume that the change of the parameter over time is linear for mathematical convenience. Thus, we can fit a regression line for observed ˆ i,j = a individual measurements over time, i.e. X ˆi + ˆbi ti,j with ti,j as the time of the jth measurement for the ith subject. The variation around the regression line is the standard error of the estimate (SEE): sP ni ˆ 2 j=1 (Xi,j − Xi,j ) . (3) SEEi = ni − 2 In this case, SEE rather than SD should be taken as the estimate of the long-term precision error for an individual subject. For precision errors of a group of subjects, we use the root-mean-square SEE (RMS SEE) to evaluate the long-term precision error for clinical use. sP m (n − 2)SEEi i=1 Pm i . (4) RMS SEEi = i=1 (ni − 2) The confidence intervals of RMS SD and RMS SEE can be derived using transformation of a Chi-squared distribution. The generic formula of (1 − α) • 100% confidence interval is ! s s df df · Absolute Precision, · Absolute Precision . (5) χ21− α ,df χ2α ,df 2

2

Pm the absolute Thus, for short-term precision, df = i=1 (ni − 1) and P m precision error is RMS SD. For long-term precision, df = i=1 (ni − 2) and the absolute precision error is RMS SEE. The values of χ21− α ,df 2 and χ2α ,df can be obtained from most software and tables from statistics 2 text books. The absolute precision error depends on the unit of measurement. While it gives important information on measurement errors, it is inadequate for comparing precision errors across several techniques or measurements. For diagnosis or for monitoring longitudinal changes, we are usually more interested in the relative precision of a technique than in the absolute minimum measurement errors.

108

Y. Lu & S. Zhao

2.3. Relative short-term precision errors 2.3.1. Short-term coefficient of variation The most commonly used measure of relative precision error is the coefficient of variation (CV), defined as the ratio of the standard deviation to the mean measurement. It is usually given on a percentage basis. CV is unit free and therefore can be used with different techniques and instruments. CV has a long history as a measure of reproducibility. It was first proposed by Karl Pearson in 1895 to measure the variability of a distribution. The distribution of CV is complicated. The simplest case is one individual with repeated measurements. Assuming that Xi,j obtained from the ith individual are independent identical samples from a normal distribution N (µi , σi2 ), the density functions for CVi is15 : fCVi (x; ni , λi )  2 ∞ √  xni −2 ni + k e−ni λi /2 X ( 2ni λi )k   Γ √  ni +k , x ≥ 0 ,  k! 2  π Γ( ni2−1 ) (1 + x2 ) 2 k=0 (6) = √ ∞  −n λ2 /2 X  |x|ni −2 ni + k (− 2ni λi )k  √e i i  Γ  ni +k , x < 0 ,  π Γ( ni −1 ) k! 2 (1 + x2 ) 2 2 k=0

q with λi = σi /µi . Asymptotically, the variance of CVi is λi n1 ( 12 + λ2i ).16 This individual CV is only meaningful if the subject has multiple measurements. When all individuals in a study have only one measurement, a population CV can be defined similarly to the ratio of population standard deviation and population mean. Such a CV is no longer related solely to measurement errors but to a combination of measurement errors and population variations. Feltz and Miller17 gave an asymptotic χ2 -test (DAD test) to compare the CV from k-populations. Fung and Tsang18 compared the DAD test with the likelihood ratio test (LRT), and the squared ranks test (SRT) in a simulation study. They concluded that the DAD test is a very good test for CVs from k-populations of normal distributions, although it is not robust, for a symmetric distribution with heavy tails. The LRT does not control type I errors correctly, although it is very powerful. The SRT is slightly liberal, but rather robust. In radiological studies, the population CV is rarely of interest, and it will not be discussed in detail here. An alternative CV for non-normal distributions is the non-parametric CV, defined as the ratio of inter-quartile range over the median of the population.19 The confidence interval and hypothesis testing for the

Statistics Used in Radiological Studies

109

non-parametric CV can be derived using bootstrap or jackknife resampling techniques.20,21 In radiology, we are more interested measurement errors in a random effects model. Here, we assume that Xi,j = θi + ei,j ,

(7)

where θi is the unobserved true (expected) value for the ith subject that follows a N (µ, τ 2 ), and ei,j are independent measurement errors that follow N (0, σ 2 ). As in Sec. 2.2, the RMS SD in (2) is the best estimate of σ. Thus, for short-term precision, CV is defined as CV = 100 ×

RMS SD %, ¯ X

(8)

¯ is the mean of Xi,j . This is also called within-batch CV in Where X laboratory medicine.22 The distribution of this short-term precision is much more complicated because means of subjects θi ’s also follow a normal distribution. Quan and Shih23 derived the asymptotic sample variances for short-term CVs. The derivation requires two assumptions: (1) the number of repeated measurements of a patient ni will not be more than a positive number C; (2) the proportion of subjects with ni = l converges to a constant 0 ≤ pl ≤ 1, as m → ∞. Under these two concditions, the asymptotic standard deviation of moment estimator of short term CV defined in formula (8) is s Pm Pm σ 2 ( i=1 ni )σ 2 + ( i=1 n2i )τ 2 σ2 P P + , (9) m m µ4 ( i=1 ni )2 2µ2 i=1 (ni − 1)

for m → ∞. The sample variation when Xi,j ’s follow log-normal distribution can also be found in Quan and Shih.23 It is often useful to compare the CVs of different techniques, or of the same techniques at different research centers. When comparing the same technique at different centers, the measured subjects in different centers are independent so it is appropriate to use the DAD test similar to Feltz and Miller.17 When comparing the CVs of different techniques, however, it is preferable to apply the techniques to the same set of subjects to control for confounding factors. This resulted correlated estimated CV and testing can be complicated. A two-step bootstrap algorithm can be used to compare two or more CVs: Step 1. Draw m random samples with replacement from the study subjects.

110

Y. Lu & S. Zhao

Step 2. For each selected subject (possibly selected multiple times but treating each measurement as an independent sample) in Step 1, draw ni random samples with replacement from his/her corresponding measurements. Step 3. Calculate the difference of the two CV’s based on data in Step 2. Step 4. Repeat Steps 1 to 3 many times (1,000–2,000 times). Step 5. Calculate the 95% bootstrap confidence intervals of the differences. If the 95% bootstrap confidence interval excludes 0, the null hypothesis that the two CVs are equal is rejected.

2.3.2. Alternative forms of short-term coefficient of variation Intuitively, the larger the CV, the larger the precision errors and the poorer the technique’s ability to monitor changes. However, this is not always true. To use CV, the value 0 of a measurement should have some physical meaning. For example, 0 bone mineral content and density have clear physical meanings. On the other hand, 0 value in speed of sound (SOS) in quantitative ultrasound has no physical meaning — the lower limit for speed of sound in water is around 1500 m/s. When the value 0 has no physical meaning, the origin of the parameters can be moved up or down so that CV has no physical meaning. Secondly, using CV to characterize the precision error of a technique implies that the precision error is proportional to the quantity of measurements. This is not true for many bone densitometry measurements. Normally, we see that the lower the bone density, the higher the relative precision errors (actually, even the absolute precision error increases with decreasing BMD). Thus, CV is not always a robust parameter for evaluating precision, at least for bone densitometry in osteoporosis research. Third, the mean value of the measured quantity, in many cases, is not the primary interest. We are more interested in discriminating between patients and normal controls, monitoring changes in bone status, or evaluating treatment responses, and CV is inadequate for these purposes. A major limitation of CV is that it does not take into account the impact of the technique’s responsiveness to changes caused by disease or disease progression. When a technique has a very low precision error (i.e. a very “good” precision) but an even lower responsiveness (e.g. differences between healthy and diseased subjects or changes as a result of disease progression or treatment) it will not have a good longitudinal sensitivity to detect changes caused by disease over short time periods. Therefore, several approaches to adjust for differences in responsiveness have been proposed.

Statistics Used in Radiological Studies

111

Miller et al.24 proposed a standardized coefficient of variation (SCV) as the ratio of absolute precision over the range (5th to 95th percentiles) of parameters. The range can be obtained from manufacturer’s normative data or from the observed study subjects when the sample size is large enough and sampling procedures are appropriate. Mathematically, SCV = =

Absolute Precision • 100% Range Absolute Precision • 100% . 95% tile − 5% tile

(10)

Alternatively, Blake et al.25 proposed using the population standard deviations as the measure for the range of the measure. Thus, the precision error is measured by the ratio of standard deviation of measurement errors over measured population standard deviation (including both measurement errors and population variations), which we call it SCV2. SCV2 relates to the attenuation parameter in the measurement error models26 that measures the bias caused by measurement errors in linear and non-linear regression analysis. Because the width of the 90th percentile range in SCV is about 3.3 times the population standard deviation, SCV is approximately a third of SCV2. Machado et al.27 proposed a similar standardized precision measurement by replacing the range in the above formula with the differences in mean values of parameters for diseased and normal subjects, which we call SCV3. It is important to note that all these standardized CVs are also unit free. In osteoporosis research, the population range or standard deviations of BMD change across different age groups. To adjust for the age effects on precision errors, Langton28 proposed a precision parameter, ZSD. A ZSD is the standard deviation of an individual’s Z-scores, zi,j ’s, a transformation of the observed measurement Xi,j ’s. This Z-score is different from Z-statistics X −µ(age ) in statistical literature. Here, Z-score is defined as zi,j = i,jσ(agei ) i where µ(agei ) and σ(agei ) are the BMD mean and standard deviation of the age group for the ith subject. Therefore, a Z-score is the number of population standard deviations by which a subject’s value varies from the population age-matched mean. It is unit free. A RMS ZSD will be a measurement for a technique. SDi (Xi,j ) × 100%. Thus, ZSDi The standard deviation of zi,j is ZSDi = σ(age i) for the ith subject is actually an age matched SCV2. A RMS ZSD is a RMS average of individual SCV2’S.

112

Y. Lu & S. Zhao

The SCV proposed by Miller et al.24 is an important step in recognizing the limitations of a traditional CV. SCV often provides different information than CV. For example, PA spine BMD measured by a DXA scanner such as the Hologic QDR-1000 has a higher short term CV (1%) than speed of sound (SOS) (0.3%) measured by quantitative ultrasound machines like the Hologic Sahara. However, defining the SCV as the ratio of RMS SD over the young adult population SD gives the opposite result: the SCV of PA spine BMD is 8% and SOS is 20%.25 Rather than using the population standard deviation, ZSD uses the age specific population standard deviation. ZSD has advantages when the population variance varies for different age groups, and the purpose of the technique is to determine the differences of individual subjects from their corresponding age group means. An important limitation of SCV and SCV2 is their dependence on the normative data. In most cases, normative data from different equipment manufacturers are not comparable. Different manufacturers have different normative data based on different selection criteria. The procedures for collecting data may not always follow appropriate statistical sampling procedures and thus may not represent the true population distribution of the parameters. Comparing two SCVs based on two different normative data sets can be like comparing apples to oranges. Many precision studies have small sample sizes and subjects are recruited from convenient samples. The study sample may not be compatible with normative populations. These logistic difficulties severely limit the scientific validity of SCVs. Statistical properties and hypothesis testing procedures for all the SCVs are complicated and have not been fully studied. In all these cases, the bootstrap method can be applied to resolve the real application needs. 2.3.3. Sample size for short-term precision studies When planning for a short-term precision study, there are always trade-offs between the number of study subjects and the number of measurements. In most cases, we plan to have the same number of measurements n for all the m study subjects. Sample size calculations can be based on the width of the confidence intervals or on the null hypothesis. In both cases, one should have some idea of the ratio between population standard deviation τ and population mean µ. For a given n, the asymptotic (1 − α) • 100% confidence width for estimated CV λ is

Statistics Used in Radiological Studies

λ 2z1−α/2 √ m

s

1 τ2 λ2 + 2+ . n µ 2(n − 1)

113

(11)

This is obtained by rearranging formula (9). A similar argument for the sample size to test the hypothesis H0 : λ = λ0 versus H1 : λ 6= λ0 is given as q 2 q 2 2 τ2 τ2 λ λ 1 1 + z1−β λ1 n1 + µ12 + 2(n−1) z1−α/2 λ0 n0 + µ02 + 2(n−1) 0 1 . (12) m= (λ1 − λ0 )2 Here, α and β are the types I and II errors; λ1 is the alternative CV; and τi and µi are population standard deviation and means under the null (i = 0) and alternative (i = 1) hypotheses. Equation (12) shows that the sample size m decreases as number of measurements n increases. In practice, recruiting subjects is more difficult and costly than repeating measurements. However, many factors can influence precision errors and selecting a small number of patients can either over- or under-state the true precision of the technique in clinical use. For example, measuring only healthy young women to evaluate DXA scanner precision will give smaller precision errors and will overstate the precision of the scanner. Measuring only elderly osteoporotic women will give larger precision errors and will understate the precision of the scanner. Some balance of confounding factors for precision errors must be achieved to represent the clinical population to which the machine or technique will be applied.11 Within the given cost constraints, one should try to reach as many subjects as possible. 2.4. Relative long-term precision errors and sensitivity of monitoring changes Short-term precision is useful for evaluating the utility of a diagnostic technique. The smaller the precision error, the easier to separate diseased and normal subjects. This is particularly true for standardized precision errors. They cannot, however, describe the ability of a technique to monitor changes. 2.4.1. Longitudinal CV Like the limitation of short-term absolute precision errors, RMS SEE depends on the measurement unit and is not appropriate to compare across

114

Y. Lu & S. Zhao

techniques. Correspondingly, we can define a longitudinal CV as CV = 100 ×

RMS SE E % . ¯ X

(13)

If we assume that the changes of measurements for individual subjects over time follows a linear model, that is Xi,j = ai + bi ti,j + ei,j ,

(14)

with ti,j the measurement time for the jth measurement of the ith subject, the longitudinal CV is qP Pm m Pni ˆi − ˆbi ti,j )2 / i=1 (ni − 2) j=1 (Xi,j − a i=1 Pm ¯ . (15) CV = i=1 Xi /m

¯ i is the average Here, a ˆi and ˆbi are the estimated intercept and slop and X for the ith subject. Derivation of asymptotic standard deviations of Eq. (15) has not yet been reported in the literature. Although, it is inexplicitly, the longitudinal CV depends on the length of time that the measurement performed. If the length of time and frequency of measurements are different for the same technique and same subjects, ¯i ∼ the CV may be different. This is because that X = a + bt¯i , which is not the case for the absolute precision. Therefore, to compare the same technique on different machines, the absolute longitudinal precision in RMS SEE is more appropriate. When comparing different techniques, the measurement times should be identical. The best plan is to measure the same subjects at the same time. Otherwise, their longitudinal CVs will not be comparable. 2.4.2. The least significant change For clinical decision nitude of measured The least significant absolute precision,29

making it is important to know the minimum magchange that is not caused by measurement errors. change (LSC) is defined as 2.8 times the longitudinal i.e. LSC = 2.8 × RMS SEE .

(16)

More specifically, if we observe a change of a subject more than LSC, we will have 95% confidence that the change is beyond measurement errors. The derivation of the LSC is based on the following argument. Let X1 and X2 be two successive measurements of a subject. If there is no change in the two measurements, the difference between them is the result

Statistics Used in Radiological Studies

115

of longitudinal measurement errors. If we assume the longitudinal measurement variation is σ, as estimated by RMS SEE in Eq. (4), Pr(|X1 − X2 | > √ z1−α/2 2σ) = α. The least significant change is also called the “biologically significant change” in laboratory medicine.22 The longitudinal precision error must be used to evaluate the LSC rather than the short-term precision error, which is normally smaller than the longitudinal precision error. The significance level of 5% has no clinical meaning. Therefore, there is no need to insist on 95% confidence when evaluating the LSC. To treat patients early, before the disease progresses, lower confidence levels can be chosen. Another parameter trend assessment margin (TAM) was proposed as 1.8 × RMS SEE, which was calculated as corresponding to an 80% confidence level.30 The LSC and TAM can also be approximately calculated in percentages based on longitudinal CV’s. 2.4.3. Follow-up time interval Radiological variables are often used as monitoring tools for individual patients. To assess the sensitivity of a technique for monitoring patients, Gluer30 introduced the concept of “monitoring time interval” (MTI). The MTI for assessment of disease progression or treatment response is an estimate of the time period after which a patient will have a 50% chance of showing changes that exceed the LSC. Thus, MTI = LSC/Median Changes Per-Annual .

(17)

The changes here can be caused by age, disease progression or treatment efficacy depending on the purpose of the study. The change also should be consistent with the units of the LSC. That is, if LSC is expressed as absolute precision, the change should be expressed as absolute changes. If the LSC is expressed as a percentage, a percentage change should be used. It is important to note that the unit of MTI is a year. Similarly to TAM, Gluer30 also suggested the “trend assessment interval” (TAI) an estimate of the follow-up time after which a subject will have 50% chance of changes exceeding TAM. The determination of appropriate monitoring time intervals always represents a tradeoff between frequent visits with patient discomfort and additional costs, and fewer visits with the risk of substantial disease progress in the interval. MTI requires the usual 95% confidence level, which means the corresponding monitoring time interval would be almost double the

116

Y. Lu & S. Zhao

TAM. This shows that MTI and TAI, as applications of longitudinal precision, when defined in this fashion, have a direct and very intuitive meaning closely related to recommended monitoring time intervals. However, one should note that there is no single MTI (TAI) for each technique. They will differ substantially depending on the expected response of the patients. For their purpose, this is not a disadvantage, since it directly reflects that the frequency of follow-up measurements will depend on the type of patient examined. In osteoporosis clinics, for example, fast bone losers should have MTIs shorter than average postmenopausal women. 2.5. Examples of applications of precision errors In this subsection, we give some examples of calculating absolute and relative precision as described in the previous subsections. 2.5.1. Example 1 The short-term precision errors of two quantitative ultrasound scanners for osteoporosis from two different manufacturers were compared. Twenty Table 1.

SOS (m/sec) at calcaneus of 20 volunteers. Manufacturer 1

Manufacturer 2

Subject ID

Measure 1

Measure 2

Measure 1

Measure 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1499 1487 1471 1468 1501 1516 1490 1569 1534 1464 1509 1567 1514 1539 1540 1532 1544 1578 1484 1518

1505 1488 1465 1467 1504 1517 1491 1565 1543 1468 1510 1541 1509 1540 1537 1535 1531 1574 1482 1522

1579 1594 1543 1536 1587 1618 1580 1670 1641 1547 1591 1621 1605 1619 1632 1616 1629 1637 1574 1606

1586 1590 1556 1545 1588 1605 1587 1683 1641 1558 1593 1647 1625 1614 1648 1617 1636 1644 1576 1610

117

Statistics Used in Radiological Studies Table 2.

Short-term precisions and related parameters. Statistics (and equation number)

Manufacturer 1 2

RMS SD (2)

CV (8)

SD for CV (9)

SCV (10)

SCV2

5.30 7.59

0.35% 0.47%

0.06% 0.07%

5.33% 8.29%

16.28% 21.62%

healthy elderly volunteers participated in the study. Speed of sound (SOS) at the calcaneus was measured twice on the same day for each subject. The data is given in Table 1. Therefore, we have m = 20 and n1 = · · · = n20 = 2. The results are summarized in the following Table 2. SCV2 was defined in Sec. 2.3.2, immediately after Eq. (10). We did not calculate SCV3 and ZSD here because SCV3 requires information from individual disease status and ZSD requires manufacturer’s normative data, and neither were available. It is worth-noting that classical CV for SOS is very low compared to BMD measured by DXA (CV range from 1% to 6%). However, this does not mean that SOS is more precise in clinical use. The clinically useful range of SOS does not begin with zero and, in fact, zero is not defined here. That is why SCV and SCV2 are more meaningful in this example. The reported SCV2 for BMD measured by DXA ranged from 8% to 11%,25 far less than SOS on a quantitative ultrasound scanner. 2.5.2. Example 2 Five normal volunteers participated in a longitudinal quality evaluation study for two new quantitative ultrasound (QUS) devices from different manufacturers with in one year. Table 3 lists their SOS measurements. Table 4 displays the longitudinal precision. Although not all subjects demonstrated linear changes over time — Subject 3 in particular had some non-linear changes in Machine 1 — we applied only linear trends to all individuals. Also, as pointed out in Example 1, CV is not an appropriate measurement for SOS in QUS. CV is included in Table 4 only for demonstration. Thus, although Machine 2 has higher precision errors, it is more sensitive to changes in age and may be a better choice for longitudinal follow-up. Of course, the sample size in this study is too small to reliably determine the monitoring time intervals.

118

Y. Lu & S. Zhao

Table 3.

Longitudinal QC data for 5 normal volunteers. SOS (m/sec)

Subject

Date

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3

09/21/97 10/04/97 11/05/97 11/18/97 12/29/97 01/09/98 02/04/98 02/24/98 03/22/98 04/11/98 05/07/98 07/11/98 08/13/98 08/26/98 09/21/98 09/22/97 10/05/97 11/06/97 11/19/97 12/22/97 01/04/98 02/05/98 02/19/98 03/23/98 04/12/98 05/08/98 07/12/98 09/21/98 09/22/97 11/17/97 12/15/97 01/13/98 01/29/98 03/03/98 03/27/98

Machine 1 Machine 2 1554 1563 1546 1554 1560 1551 1556 1548 1552 1562 1544 1548 1567 1563 1554 1560 1565 1563 1562 1558 1572 1567 1566 1568 1572 1569 1573 1576 1579 1575 1576 1571 1573 1579 1580

1636 1642 1634 1635 1656 1626 1648 1642 1658 1665 1637 1653 1672 1658 1646 1654 1660 1643 1652 1663 1680 1674 1667 1677 1663 1661 1650 1695 1654 1666 1667 1669 1670 1676 1690

SOS (m/sec) Subject 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

Date 05/07/98 06/01/98 07/24/98 09/23/98 09/22/97 10/04/97 10/29/97 11/17/97 12/12/97 12/28/97 01/25/98 02/20/98 03/11/98 03/23/98 04/24/98 06/13/98 06/26/98 07/27/98 08/09/98 09/21/98 09/22/97 11/14/97 12/17/97 12/29/97 02/01/98 02/15/98 03/20/98 04/02/98 05/05/98 05/19/98 06/27/98 07/11/98 08/13/98 08/26/98 09/21/98

Machine 1 Machine 2 1588 1586 1587 1588 1598 1595 1601 1585 1590 1608 1593 1595 1586 1593 1594 1602 1600 1598 1591 1594 1591 1586 1587 1605 1587 1586 1588 1594 1594 1593 1594 1596 1594 1596 1594

1698 1717 1708 1708 1709 1694 1698 1677 1696 1720 1691 1692 1688 1718 1722 1727 1733 1708 1719 1714 1664 1678 1677 1703 1682 1681 1688 1693 1682 1684 1695 1677 1675 1701 1689

119

Statistics Used in Radiological Studies Table 4.

Longitudinal precision for 2 QUS machines. Machine 1

Machine 2

Subject

d.f

SEE

Mean

CV

SEE

Mean

CV

1 2 3 4 5

13 11 9 14 13

7.11 3.25 4.07 6.14 4.91

1555 1567 1580 1595 1592

0.46% 0.21% 0.26% 0.39% 0.31%

10.87 12.74 8.17 13.67 10.00

1647 1665 1684 1707 1685

0.66% 0.77% 0.49% 0.80% 0.59%

Total

60

5.43

1578

0.34%

11.49

1678

0.69%

LSC (m/sec) MTI (yr)

15.21 2.5

32.18 1.3

3. Statistical Process Control Charts In Sec. 2, we introduced the concept of measurement errors and the statistics to evaluate them. Precision errors are usually evaluated whenever new techniques or new devices are developed. Precision errors are also evaluated immediately after a device is installed in clinical sites to assure that the equipment is performing according to the manufacturer’s specifications at baseline. Precision errors also are always assessed before the beginning of clinical trials or longitudinal studies.7,31 Although the manufacturer’s service personnel can set up the device so that precision errors are within appropriate limits at baseline, it is very important to monitor the equipment to assure that imprecision remains within acceptable limits. Despite the remarkable accuracy and reproducibility of radiological equipment, measurements can still vary because of changes in equipment, software upgrades, machine recalibration, X-ray source decay, hardware aging and/or failure, or operator errors. In an ideal setting, a well maintained equipment produce values that are randomly spread around a reference value. A change point is defined as the point in time at which the measured values start to deviate from the reference value. To evaluate measurement stability and identify change points, radiologists develop phantoms that simulate human measurements but, unlike humans, do not change over time.7,32,33 Variations in phantom measurements should reflect variations in human measurements. Phantoms are measured regularly to detect one or more of the following events: (1) The mean values before and after the change point are statistically significantly different; (2) The standard deviations of measurements before and after the

120

Y. Lu & S. Zhao

Table 5. AP spine BMD of a hologic phantom in a QC study (13 March 1989 to 15 May 1989). i

Date

BMD (Xi )

µ0

σ(= µ × 0.5%)

41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

03/13/89 03/15/89 03/21/89 03/22/89 03/23/89 03/27/89 03/28/89 03/29/89 03/30/89 04/03/89 04/04/89 04/05/89 04/06/89 04/07/89 04/10/89 04/14/89 04/17/89 04/18/89 04/19/89 04/20/89 04/28/89 05/01/89 05/02/89 05/03/89 05/04/89 05/05/89 05/09/89 05/10/89 05/12/89 05/15/89

1.039 1.039 1.029 1.036 1.030 1.033 1.036 1.038 1.036 1.033 1.036 1.034 1.029 1.033 1.037 1.042 1.044 1.041 1.040 1.036 1.039 1.035 1.047 1.028 1.035 1.038 1.031 1.041 1.043 1.034

1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033

0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517

change point are statistically significantly different; (3) The measurements after the change point show a gradual but significant departure from the reference value. In Table 5, we introduce our third example, which is roughly two months of quality control data from a DXA scanner. In this example, a Hologic spine phantom was scanned about three times a week. The purpose of the study was to monitor the stability of the DXA scanner. If the scanner is functioning acceptably, the coefficient of variation should be less than 0.5%

Statistics Used in Radiological Studies

121

in the total AP spine BMD values. (Information on this data set can be found in Lu et al.34 ). In Table 5, i is an indicator of the observation number; date is the date the scan was performed; BMD is the ith measurement; µ0 is the reference value based on historical QC data; and σ is the standard deviation based on 0.5% CV. We will use this data to illustrate statistical process control charts. Statistical process control (SPC) is a powerful collection of problem solving tools for achieving process stability and improving capacity through reduction of variability.35 There are several statistical methods for identifying change points. One is to visually check the retrospective data to determine the change points and then to verify these changes by a t-test for means and an F -test for variances. An alternative is to use statistical process control charts.34,36 In this section, we introduce these methods and provide examples of their application in monitoring BMD measured by DXA scanners in osteoporosis studies. 3.1. Visual inspection Potential change points in the data can be determined after careful visual inspection. This can be done by plotting longitudinal phantom data over time and using visual judgment to identify the potential change points created by drifts or sudden jumps. Statistical tests, such as the t-test, can be used to confirm the significance of the changes. It is important to note that there can be multiple potential change points observed for a given period of time. Careful control for type one errors for repeated tests is recommended for the t-tests. Only experienced medical physicists or radiologists should perform periodic visual inspections. The role of primary evaluator should always be taken by the same individual to avoid subjective variations. The selection of the change points is based on the scatter plot in the most recent data. Once a change point has been identified, its cause should be investigated to determine if the change is machine related. Visual inspection is not recommended because its efficiency depends on the experience of the reviewer and may not be reproducible. 3.2. Shewhart control chart A Shewhart chart is a graphic display of a quality that has been measured over time. The chart contains a central horizontal line that represents the mean reference value. Three horizontal lines above and three below the

122

Y. Lu & S. Zhao

central line indicate 1, 2, and 3 standard deviations from the reference value. By plotting the observed quality control measurements on the chart, we can determine if the machine is operating within acceptable limits. The reference values can be derived from theoretical values for the phantom, or from the first 25 observations measured at baseline. The reference value changes whenever the Shewhart chart indicates an out of control signal and the machine is recalibrated. The new reference value will then be the mean of the first 25 observations after recalibration. The number of observations needed to calculate the reference value may vary; the number 25 was chosen based on practical experience to balance the stability of the reference value with the length of time needed to establish it. The standard deviation varies among individual devices, and manufacturers should be selected accordingly. For example, in one osteoporosis study, we sometimes use the BMD of a Hologic phantom to monitor DXA scanner performance. We usually assume the coefficient of variation for Hologic machines to be 0.5% and Lunar to be 0.6%, based on reported data on long-term phantom precision.37 Therefore, the standard deviation for the scanner was calculated as 0.005 and 0.006 times the reference value for Hologic and Lunar machines respectively. The original Shewhart chart will signal that there is a problem if the observed measurement is more than 3 standard deviations from the reference value. Although intuitive and easy to apply, the chart is not very sensitive to small but significant changes.35 Therefore, a set of sensitizing tests for assignable causes has been developed to improve the sensitivity of Shewhart charts. Eight of the tests are available in the statistical software package SAS.38 The tests are listed in Table 6. Table 6. Tests 1 2 3 4 5 6 7 8

Definition of tests for assignable causes for Shewhart charts.

Pattern Description One point is more than 3 standard deviation from the central line. Nine points in a row on one side of the central line. Six points in a row steadily increasing or steadily decreasing. Fourteen points in a row alternating up and down. Two out of 3 points in a row more than 2 standard deviation from the central line. Four out of 5 points in a row more than 1 standard deviation from the central line. Fifteen points in a row all within 1 standard deviation from the central line on either or both sides of the line. Eight points in a row all beyond 1 standard deviation from the central line on either or both sides of the line.

123

Statistics Used in Radiological Studies

Test 2

µ + 3σ o

o

µ + 2σ

o o o

o o

µ+σ

o

o

o

o

o o o

o

o

o

o

o

o o

o

µ = 1.033

o

o

o

o o o

o

µ−σ

o

µ − 2σ

µ − 3σ

03/13/1989

03/27/1989

04/10/1989

04/24/1989

05/08/1989

Scan Date Fig. 2.

Shewhart chart for QC data in Example 3.

The sensitizing rules can be used in toto or in part depending on the underlying processes of interest. For example, for quality control of DXA machines, we used four tests — 1, 2, 5 and 6.34 Once a change point has been identified by any one of the tests, the manufacturer’s repair service should be called to examine the causes and to recalibrate the machine. We then use the next 25 observations to generate new reference values and apply the tests to the subsequent data according to the new reference value. Figure 2 shows the application of a Shewhart chart for Example 3. In this chart, the dots are the observed BMD. The six lines are the control limits 1, 2 and 3 standard deviations away from the central reference line. There is a problem with Test 2 from April 10, 1989. The sensitizing rules increase the sensitivity of the Shewhart chart, but also increase the number of clinically insignificant alarms, which is not desirable. To overcome this problem, a threshold based on the magnitude of the mean shift can also be implemented. For example, we can select ten consecutive scans from after the possible change point identified on the Shewhart chart, and then calculate their mean values. If the mean differs by more than one standard deviation (which equals 0.5% times the reference value, in our example) from the reference value, the change point is confirmed as a true change point. Otherwise, the signal from the Shewhart

124

Y. Lu & S. Zhao

chart is ignored and the reference value is unchanged. This approach filters out small and clinically insignificant changes. However, the true difference must be more than one standard deviation for this approach to be effective, and this approach can delay the recognition of true change points. 3.3. Moving average chart An alternative method is to determine the means and standard deviations of 25 consecutive measurements and then plot them over time. Control limits can be based on the assumption of a constant coefficient of variation during the process (0.5% times the reference mean) and a type one error rate comparable to the original Shewhart method (0.27%).35 More specifically, we use Xi , for i = 1, 2, . . . , n, the measured QC values of n longitudinal phantom scans from a machine. We define the moving average mean and standard deviation based on 25 scans as: , i X (18) Xj 25 , i = 25, 26, . . . , n Mi = j=i−24

as the moving average of 25 scans to the date when the ith scan was collected, and v , u i u X 2 t (19) (Xj − Mi ) 24 , i = 25, 26, . . . , n Si = j=i−24

as the moving standard deviation of the 25 scans to the date when the ith scan was collected. Note that the first moving average can only be calculated after the first 25 scans have been collected. Now if we assume that Xi ’s independently follow a normal distribution N (µ, σ 2 ), it can be shown that the Mi ’s follow a normal distribution N (µ, σ 2 /25) and 24 Si2 /σ 2 ’s follow a chi-square distribution with 24 degrees of freedom denoted by χ224 . However, note that both Mi ’s and 24 Si2 /σ 2 ’s are not independent samples from the normal distribution and the chi-square distribution, respectively, for different i’s. Let µ0 be the reference mean. If the machine is operating correctly, we should accept the null hypothesis, H0 : µ = µ0 . If the machine is not operating correctly, we will accept the alternative hypothesis, H1 : µ 6= µ0 . We select a type one error level of 0.0027 to be comparable to the original Shewhart method. We will reject the null hypothesis if |Mi − µ0 | > z1−α/2 σ5 = 0.5991σ. Thus, the control limits for the moving average are ±59.91% of the standard deviation from the reference mean.

Statistics Used in Radiological Studies

125

We assumed that the CV for the machine is constant. Therefore, if it is functioning correctly, we can derive the standard deviation as equal to the reference mean times the CV. To check whether the precision of the machine is acceptable, we will test the null hypothesis, H0 : σ = σ0 , versus the alternative that H1 : σ > σ0 . With the same level of type one error rate as the mean difference, we will reject the null hypothesis if 24 Si2 /σ02 > χ224,1−α , or equivalently, if Si > 1.41σ0 . Thus, the control limit of the moving standard deviation is 1.41 times the standard deviation. Note that there is only an upper limit for the moving standard deviation chart, as we are interested only in the increase in the standard deviation. In other words, we are looking for quality control but not quality improvement. Once the moving average moves out of the control limit, the value of the moving average at that point is used as the new reference value for scans performed after that date. The number of scans used to calculate the moving average will affect performance of the method. Twenty-five scans were selected based on power analysis, so that the moving average chart has less than a 0.27% chance of a false alarm and a 98% chance of detecting an increase in the mean of one standard deviation. Also, the moving standard deviation chart has a 98% chance of picking up a 100% increase in the standard deviation.34 Twentyfive scans is also a typical month’s worth of quality control measurements. 3.4. CUSUM chart CUSUM chart is short for Cumulative Sum Chart. In applications, we recommend a version of CUSUM known as Tabular CUSUM35 because it can be presented with or without graphs. Mathematically, we define an upper one-sided tabular CUSUM SH (i) and a lower one-sided tabular CUSUM SL (i) for the ith QC measurement as the following: X i − µ0 − k + SH (i − 1) , (20) SH (i) = max 0, σ µ0 − Xi SL (i) = max 0, − k + SL (i − 1) . (21) σ Here, µ0 is the reference mean, σ is the standard deviation, and k is a parameter to filter out insignificant variations and is usually set at 0.5. The initial values of SH (0) and SL (0) are 0. The chart sends an alarm message if SL (i) or SH (i) is greater than 5. In other words, when the standardized BMD value deviates more than k from zero, the cumulative upper bounded

126

Y. Lu & S. Zhao

sum increases by the amount of deviations above k. On the other hand, if the deviation is less than k, the cumulative sum will be reduced accordingly. When the cumulative sum is less than zero, we ignore the past data and set the cumulative sum as zero. However, a cumulative sum greater than 5 is a strong indication of a deviation from the reference mean in the data. CUSUM also estimates when the change occurred and the magnitude of the change. We use the estimated magnitude of change to establish the new reference values. Table 7 demonstrates the application of CUSUM chart to Example 3. In this table, SH (i) and SL (i) are defined in Eqs. (20) and (21), and we selected k = 0.5 to detect a mean change of one standard deviation.35 Along with the sequences SH (i) and SL (i), sequences NH (i) and NL (i) denote the number of scans since the last positive observation of SH (i) and SL (i), respectively. For example, from records one to four, the SH (i)’s were positive, so that NH (i) goes from 41 to 44. However, SH (45) was zero. Therefore, the corresponding NH (45) = 0. A similar rule applies for NL (i). As explained, the initial reference value was obtained from the mean of the first 25 observations. However, once SH (i) or SL (i) exceeded 5, we concluded that the scanner was malfunctioning. For example, on April 20, 1989, SH (60) > 5, suggesting that the BMD values were too high. We estimate that this event could have started on April 10, 1989, by noting the last date when NH (i) = 1. Therefore, the investigation of assignable causes should focus around that time. The magnitude of change from the reference value can be estimated as σ[k + SH (i)/NH (i)], which equals the average difference.35 Once we know the machine is malfunctioning, we will establish new reference values. If the manufacturer was involved in correcting the machine, the new mean should be established by the first 25 observations after the correction. However, if there is no intervention by the manufacturer or, as in our case, when performing retrospective data analysis, the new reference value can be estimated by µ0 + σ[k + SH (i)/NH (i)], if the new BMD values are greater than the reference value, or by µ0 − σ[k + SH (i)/NH (i)] when the new BMD values are smaller than the reference value. This results in a new µ0 after the 60th scan of 1.040 mg/cm2 . Graphical presentation of the CUSUM chart was shown in Fig. 3. In some senses, it is easier to review the Table 7 than the chart for identifying change points. A separate CUSUM chart can be constructed for a one-sided change in variance. The one-sided variance chart was constructed according

Table 7.

CUSUM table (from 13 March 1989 to 15 May 1989).

Date

Xi

µ0

σ (0.5%µ0 )

41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

03/13/89 03/15/89 03/21/89 03/22/89 03/23/89 03/27/89 03/28/89 03/29/89 03/30/89 04/03/89 04/04/89 04/05/89 04/06/89 04/07/89 04/10/89 04/14/89 04/17/89 04/18/89 04/19/89 04/20/89

1.039 1.039 1.029 1.036 1.030 1.033 1.036 1.038 1.036 1.033 1.036 1.034 1.029 1.033 1.037 1.042 1.044 1.041 1.040 1.036

1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033 1.033

0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517 0.00517

Xi −µ0 σ

− 0.5

0.65 0.65 −1.29 0.07 −1.10 −0.52 0.07 0.45 0.07 −0.52 0.07 −0.32 −1.29 −0.52 0.26 1.23 1.61 1.03 0.84 0.07

SH (i)

NH (i)

0.65 1.29 0.00 0.07 0.00 0.00 0.07 0.52 0.58 0.07 0.13 0.00 0.00 0.00 0.26 1.49 3.10 4.13 4.97 5.04

1 2 3 4 0 0 1 2 3 4 5 0 0 0 1 2 3 4 5 6

µ0 −Xi σ

− 0.5

−1.65 −1.65 0.29 −1.07 0.10 −0.48 −1.07 −1.45 −1.07 −0.48 −1.07 −0.68 0.29 −0.48 −1.26 −2.23 −2.61 −2.03 −1.84 −1.08

SL (i)

NL (i)

0.00 0.00 0.29 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

Statistics Used in Radiological Studies

i

127

128

Table 7. Date

Xi

µ0

σ (0.5%µ0 )

61 62 63 64 65 66 67 68 69 70

04/28/89 05/01/89 05/02/89 05/03/89 05/04/89 05/05/89 05/09/89 05/10/89 05/12/89 05/15/89

1.039 1.035 1.047 1.028 1.035 1.038 1.031 1.041 1.043 1.034

1.040 1.040 1.040 1.040 1.040 1.040 1.040 1.040 1.040 1.040

0.00520 0.00520 0.00520 0.00520 0.00520 0.00520 0.00520 0.00520 0.00520 0.00520

Xi −µ0 σ

− 0.5

−0.69 −1.46 0.85 −2.81 −1.46 −0.88 −2.23 −0.31 0.08 −1.65

SH (i)

NH (i)

0.00 0.00 0.85 0.00 0.00 0.00 0.00 0.00 0.08 0.00

0 0 1 0 0 0 0 0 1 0

µ0 −Xi σ

− 0.5

−0.31 0.46 −1.85 1.81 0.46 −0.12 1.23 −0.69 −1.08 0.65

SL (i)

NL (i)

0.00 0.46 0.00 1.81 2.27 2.15 3.38 2.69 1.62 2.27

0 1 0 1 2 3 4 5 6 7

Y. Lu & S. Zhao

i

Continued.

129

Statistics Used in Radiological Studies

5

SH(di) 0

QC BMD µ+0.5σ µ µ−0.5σ

5

SL(di) 0

03/13/1989

03/27/1989

04/10/1989

04/24/1989

05/08/1989

Scan Date

Fig. 3.

CUSM chart for QC data in Example 3.

to Ryan.39 In this approach, the observed difference of two√successive scans Xi − Xi−1 was transformed to Zi = {|[(Xi − Xi−1 )/ 2σ 2 |1/2 − 0.82218}/0.34914, which approximately follows a standard normal distribution N (0, 1). For the variance chart, we selected k = 0.75 to reduce the number of alarms due to single outliers. When an alarm for a change in variance is identified, we will investigate the causes of the alarm and may need to recalibrate the machine. Table 8 is a variance chart for Example 3. The table has calculated values of Zi . Since Zi follows a standard normal distribution, the upper side CUSUM for variance is SH (i) = max[0, Zi − 0.75 + SH (i − 1)], which is given in the eighth column. As before, NH (i) indicates when a positive cumulative sum occurs and is useful for finding the assignable causes. The graphic presentation is similar to Fig. 3 and is not presented here. The general procedure for deriving the algebraic boundaries of the CUSUM chart is given in Montgomery35 and Rice39 and theoretical comparisons of Shewhart and CUSUM can be found in both books. The V-mask chart is another form of CUSUM chart and is essentially the same as the Tabular CUSUM.35,40

130

Y. Lu & S. Zhao Table 8.

CUSUM table for change of variance for Example 3.

i

Date

Xi

Xi − Xi−1

σ(0.5%µ0 )

Zi

Zi − 0.75

SH (i)

NH (i)

55 56 57 58 59 60

04/10/89 04/14/89 04/17/89 04/18/89 04/19/89 04/20/89

1.037 1.042 1.044 1.041 1.040 1.036

0.004 0.005 0.002 −0.003 −0.001 −0.004

0.00517 0.00517 0.00517 0.00517 0.00517 0.00517

−0.24 0.01 −0.86 −0.52 −1.30 −0.24

−0.99 −0.74 −1.61 −1.27 −2.05 −0.99

0.000 0.000 0.000 0.000 0.000 0.000

0 0 0 0 0 0

61 62 63 64 65 66 67 68 69 70

04/28/89 05/01/89 05/02/89 05/03/89 05/04/89 05/05/89 05/09/89 05/10/89 05/12/89 05/15/89

1.039 1.035 1.047 1.028 1.035 1.038 1.031 1.041 1.043 1.034

0.003 −0.004 0.012 −0.019 0.007 0.003 −0.007 0.010 0.002 −0.009

0.00520 0.00520 0.00520 0.00520 0.00520 0.00520 0.00520 0.00520 0.00520 0.00520

−0.52 −0.24 1.30 2.25 0.44 −0.53 0.44 0.99 −0.86 0.81

−1.27 −0.99 0.55 1.50 −0.31 −1.28 −0.31 0.24 −1.61 0.06

0.000 0.000 0.554 2.053 1.742 0.467 0.156 0.391 0.000 0.060

0 0 1 2 3 4 5 6 0 1

3.5. Comparison of statistical process control charts in osteoporosis studies Lu et al.34 compared several statistical process control procedures and their applications to monitoring DXA scanners based on daily scans of a Hologic spine phantom. The comparisons were based on their results on longitudinal quality control data from 5 clinical trial sites as well as simulation studies. They concluded that visual inspection is relatively subjective and depends on the operator’s experience and alertness. The regular Shewhart chart with sensitizing rules has a high false alarm rate. The Shewhart chart with sensitizing rules and an additional filter of clinically insignificant mean changes has the lowest false alarm rate but relatively low sensitivity. This method does not require a lot of statistics and can be easily applied to clinical study sites. The CUSUM approach has the best combination of sensitivity, specificity, and identification of the time and magnitude of change. It is recommended for use in quality control centers in clinical trials, especially if patient data must be recalculated to adjust for change points.41 Combining a moving average chart and a moving standard deviation chart comes closest to the performance of the CUSUM method as a quality control procedure for monitoring DXA scanner performance.

Statistics Used in Radiological Studies

131

3.6. Other charts In all the above procedures, we assumed that there is no autocorrelation between consecutive measurements. This is rarely true for longitudinal quality control for radiological equipment. The effects of such an assumption on the use of statistical process control charts and their decision structures are rather debatable. At one extreme, Wheeler42 argues that the usual control limits are contaminated “only when the autocorrelation becomes excessive (say 0.80 or larger).” He concludes that “one need not be overly concerned about the effects of autocorrelation upon the control chart.” Our personal experience with Shewhart or CUSUM charts and DXA quality control has been positive. This does not preclude autocorrelation from being a problem for other applications. Johnson and Bagshaw43 concluded that the problem is potentially quite serious. Strike suggested “clever use” of CUSUMs in laboratory medicine, such as process control for assays.22 Statistical approaches for dealing with autocorrelation are to construct process charts based on residuals after removing the autocorrelation or the use of an exponentially weighted moving-average (EWMA) control chart.35 EWMA is a flexible approach to statistical process control applications. When applied to uncorrelated data, it is a good alternative to the CUSUM chart. Applied to autocorrelated data, it can be adapted to form a control chart that eliminates the excessive false alarm problem associated with traditional control charts. Details of EWMA can be found in most books on quality control.35,39 While all the statistical process control charts presented here are for univariate continuous measurements, there are other types of charts for proportions and rates,44,45 and other quality control and improvement techniques from multivariate approaches.35,46 4. Assessment of Agreement In quality control for clinical trials, we must always assess the agreement of measurements. For example, during a longitudinal osteoporosis trial, a study site might upgrade its DXA machine. Because the change of BMD from baseline is the key measurement, we must be certain that the BMD values measured by the old and new machines are equivalent or in agreement. Also in clinical trials that require a radiologist’s assessment of outcomes, we must be certain that readings from different radiologists are the same, and that readings at the beginning and the end of the study are similar. All these require assessment of agreement.

132

Y. Lu & S. Zhao

After a DXA scanner upgrade, multiple phantoms scans should be performed, and if possible, a group of volunteers should be scanned on both the old and new devices. If human data is available, it can be used data to assess the agreement rather than phantom data. We hope the volunteers present a range of BMD wide enough to cover the spectrum of clinical uses. Before upgrading a machine that is being used in a clinical trial, the site must first inform the trial sponsors and quality assurance centers for their approval and must rely on manufacturers to assure proper installation and calibration. The site must maintain proper documentation for machine upgrades. Assessment of inter-reader agreement among radiologists in a clinical trial and intra-reader longitudinal consistency during a trial, normally requires group training before the trial starts. A database of representative images is assembled into a database. Potential readers for the study read the images together and discuss the grading criteria. Only trained radiologists can be readers. The group training should be documented. After training, inter-reader agreement should be assessed. If the agreement does not satisfy the requirements of the sponsors or protocols, the readers will be re-trained and a new set of test cases used to test for agreement. The trial cannot start until reader agreement reaches the pre-specified requirements. During the trial, the radiologists are required to re-read the test sets periodically to assess the agreement of their current readings with their baseline readings. This is necessary to assure longitudinal consistency. All tests for reader agreement should be documented and archived for auditing purposes. Evaluation of agreement is also important for other purposes, such as validation of diagnostic methods or radiological devices. In these cases, a gold standard will be selected and validation is performed to assure the new measurements agree with the gold standard. 4.1. Association versus agreement The concepts of agreement and association are related but different. Agreement means interchangeability of two measurements. In other words, a patient’s BMD should be the same whether measured on an old DXA scanner or a new one; and the spine fracture grade of a vertebra should be the same regardless by whom or when it is read. An association, on the other hand, suggests that two machines or two readers tend to agree in the same directions. In other words, for two patients with different BMD values, both DXA machines will find the same lower and higher BMD subjects but their BMD measurements can be different.

Statistics Used in Radiological Studies

133

The best example of the difference between agreement and association is the correlation coefficient of two continuous variables.47,48 A correlation coefficient can apply to any two continuous variables regardless of their scales, such as height and weight. Even if there is a high association between height and weight, they are not interchangeable because they measure completely different things. Even when X and Y are two continuous variables that measure the same physical properties in the same units, an association still cannot indicate agreement. In fact, cor(X, Y ) = cor(a + bX, Y ). Thus, the correlation is invariant for a shift of mean or a change of scale. Further, the estimation of the correlation depends on the range of the true quantity in the sample: the wider the range, the higher the correlation coefficient. Also, the null hypothesis in testing for a correlation coefficient is the more independent of two variables, which is not relevant to the agreement. Therefore, the use of correlation to assess agreement is inappropriate. On the other hand, a high correlation of two continuous variables in the same scale suggests that it is possible to calibrate variables so that they agree with each other. 4.2. Assessment of agreement of two continuous variables As discussed above, only when two variables measure the same physical property using the same units can they be assessed for agreement. Let Y1 and Y2 be such continuous variables that follow normal distributions N (µY1 , σY2 1 ) and N (µY2 , σY2 2 ). They are measured from the same subjects. The correlation coefficient between Y1 and Y2 is ρ. Let D = Y1 − Y2 and A = (Y1 +Y2 )/2. We want to perform a regression analysis of D = α+βA+ε. We are interested in α = β = 0.47 It is easy to verify that 2 β = cov(D, A)/σA = 0.5(σY2 1 − σY2 2 )/(σY2 1 − σY2 2 + 2ρσY1 σY2 ) ,

(22)

and α = (µY1 − µY2 ) −

µY1 + µY2 β. 2

(23)

Therefore, α = β = 0 implies that µY1 = µY2 and σY1 = σY2 , i.e., the two measurements have the same distribution parameters. Bland and Altman47 further suggested plotting the difference D against average A and calculating the standard deviation of D (σD ). With 95% confidence, the differences between paired data are between ±2σD . If this σD is less than or equal to the precision errors of Y1 and Y2 , then these two

134

Y. Lu & S. Zhao

measurements are exchangeable and therefore, equivalent. Also, if σD /A¯ × 100% is less than the CVs for Y1 and Y2 , they should be equivalent. Here we use a bar to denote sample means. Noting that both D and A are random variables, Bartko49 proposed a bivariate confidence ellipse for the Bland-Altman plot. The equation of the 95% ellipse is ¯ 2 /σ 2 − 2r(A − A)(D ¯ ¯ 2 /σA σD + (D − D) ¯ 2 /σ 2 (A − A) − D) A D = qχ2 (0.95, 2)(1 − r2 ) .

(24)

Here, qχ2 (0.95, 2) = 5.991 is the 95% quantile of the χ2 -distribution with 2 degrees of freedom and r is the sample correlation coefficient of D and A. The hypothesis α = β = 0 can be tested using the Bradley-Blackwood procedure.50 The test statistic is P Pn n 2 ˆ i )2 ˆ − βA i=1 (Di − α i=1 Di − P ∼ F (2, n − 2) , (25) F = (n − 2) ˆ i )2 2 ni=1 (Di − α ˆ − βA

which simultaneously tests for the zero intercept and slope. Table 9 shows a dataset of AP Spine BMD (mg/cm2 ) from 10 normal volunteers measured on three different DXA scanners. We are interested in the equivalence of Scanner 1 and the other two scanners. As shown in Table 9, we can accept the null hypothesis that there is no difference in means and standard deviations between Scanners 1 and 2 by the Bradley-Blackwood test. There is, however, a significant difference between Scanners 1 and 3. Further examination of the data shows that Scanners 1 and 2 have different standard deviations. Using Bland and Altman’s method, we can plot the comparison of Scanners 1 versus 2 and Scanners 1 versus 3 (Fig. 4). The dashed line shows that the 95% confidence interval is the most important measurement of these figures. Even though there is a significant non-zero intercept or slope in the Bland-Altman regression, we may still be able to treat the two measurements as interchangeable if the variation of differences is less than the in vivo short-term precision error. The 95% confidence ellipse of a Bland-Altman plot is useful for indicating the differences between sample variances. A bivariate normal distribution has 5 parameters: two means, two standard deviations, and a correlation coefficient. The Bland-Altman regression compares four of the five parameters. We can have two normal random variables with the same mean and standard deviation but a negative correlation coefficient, such as Y and −Y , when mean Y is 0. Thus, the

Table 9.

AP spine BMD of 10 patients by three DXA scanners. Comparison Scanners 1 and 2

Comparison Scanners 1 and 3

Subject

Scanner 1

Scanner 2

Scanner 3

D1

A1

D2

A2

1 2 3 4 5 6 7 8 9 10

1.342 1.303 1.093 1.092 1.215 1.155 1.125 1.434 1.230 1.326

1.328 1.312 1.100 1.116 1.215 1.157 1.117 1.437 1.225 1.324

1.352 1.317 1.078 1.087 1.216 1.137 1.097 1.447 1.231 1.313

0.014 −0.009 −0.007 −0.024 0.000 −0.002 0.008 −0.003 0.005 0.002

1.335 1.308 1.096 1.104 1.215 1.156 1.121 1.436 1.228 1.325

−0.010 −0.014 0.015 0.005 −0.001 0.018 0.028 −0.013 −0.001 0.013

1.347 1.310 1.085 1.089 1.216 1.146 1.111 1.441 1.231 1.320

σD Bradley-Blackwood Test

F p-value

0.0104 0.6733 0.5367

0.0141 5.3645 0.0333

Statistics Used in Radiological Studies

Observed BMD Data

135

136

0.04 0.02 0.00 -0.02 -0.04

BMD by Scanner 1 – BMD by Scanner 2

Y. Lu & S. Zhao

1.0

1.1

1.2

1.3

1.4

1.5

0.04 0.02 0.00 -0.02 -0.04

BMD by Scanner 1 – BMD by Scanner 3

Average BMD for Scanners 1 and 2

1.0

1.1

1.2

1.3

1.4

1.5

Average BMD for Scanners 1 and 3 Fig. 4. Examples of Bland-Altman plots for equivalence of 3 scanners. The dashed lines are the 95% confidence intervals for the differences between two Scanners. The ellipses are the 95% bivariate confidence ellipses.

Bland-Altman regression alone is inadequate for evaluating agreement. We still need to examine the correlation coefficient between the two measurements, in addition to the Bland-Altman regression. Only a high correlation with a zero intercept and slope in the Bland-Altman regression can suggest that the two measurements are equivalent.

Statistics Used in Radiological Studies

137

4.3. Intraclass correlation coefficient An alternative measurement for agreement is the intraclass correlation coefficient (ICC),51 which is simply the percentage of between readers/ techniques variance in the total variance of the sum of between and within reader/technique variations. More specifically, we assume that Yij = µ+pi +rj +(pr)ij +εij , with i representing the ith individual (i = 1, . . . , N ) and j representing the jth reader/devices (j = 1, . . . , K). Here, Yij is the observation of the ith individual measured by jth reader/scanner/machine; µ is the overall effect common to all observations; pi is the random patient effect; rj is the random reader/device effect; (pr)ij is the interaction between patient and reader/device; and εij is the measurement error. Here, we assume that pi and rj are independent and follow normal distributions 2 N (0, σP2 ) and N (0, σR ), respectively, and εij is independent of pi and rj 2 and follows N (0, σe ). Without duplicate observations, the interaction term (pr)ij cannot be separated from measurement error and can be dropped. An intraclass correlation coefficient is defined as ICC =

σP2

σP2 2 + σ2 . + σR e

(26)

Thus, a high ICC means less difference between two readers as well as less measurement error. Lee et al. suggested a cut-off value of 0.75 beyond which the readers or measurement devices are considered to be in agreement.51 The ICC can be estimated based on the output of an ANOVA table of the two-way mixed model as the following. ρICC =

N (MSB − MSE) . N MSB + K MSR + (KN − K − N )MSE

(27)

Here, MSB, MSR, and MSE are the mean squared between subject, between reader/device, and error respectively. Fleiss and Shrout52 derived an approximate formula for the confidence interval of ρICC . Let FU and FL be the upper and lower 100(1 − α/2)% percentiles, respectively from F distribution with degrees of freedom (N −1) and v, where v=

(K − 1)(N − 1){KρICC MSR/MSE + N [1 + (K − 1)ρICC ] − KρICC}2 . (N − 1)K 2 ρ2ICC MSR2 /MSE2 + {N [1 + (K − 1)ρICC ]− KρICC}2 (28)

138

Table 10. Agreement for

Scanners 1 and 2

Scanners 1 and 3

d.f

MS

d.f

2 = 0.02660 MSB = σ22 + KσP 2 = 0.00001 MSR = σ22 + N σR MSE = σ22 = 0.00005

Between-subject

9

Between-scanner

1

Residual

9

Total v ICC and 95% C.I.

19 9.3622 0.9963 (0.9865, 0.9991)

All 3 Scanners

MS

d.f

MS

9

MSB = 0.02992

9

MSB = 0.04278

1

MSR = 0.00008

2

MSR = 0.00008

9

MSE = 0.00010

18

MSE = 0.00010

19

29

9.9615 0.9935 (0.9755, 0.9983)

19.9237 0.9931 (0.9807, 0.9981)

Y. Lu & S. Zhao

Source

ANOVA Tables and ICC for Data in Example 4.

Statistics Used in Radiological Studies

139

The approximate upper and lower bounds, ρU and ρL , respectively, for the 100(1 − α)% confidence bounds of ρICC are given as following. ρU =

N (MSB − FL MSE) FL [K MSR + (N K − L − N )MSE] + N MSB

(29)

ρL =

N (MSB − FU MSE) . FU [K MSR + (N K − L − N )MSE] + N MSB

(30)

and

Table 10 shows ANOVA tables for comparison of scanners in Example 4, and the corresponding intraclass correlation coefficients. It is clear from this example that ICC is less sensitive to agreement between two scanners. The ICC for Scanners 1 and 3 is much higher but the Bland-Altman regression shows significant disagreement. Bland and Altman53 list other deficiencies of ICC for evaluation of agreement, including its dependence on sample variations. On the other hand, it is easier to use ICC to evaluate agreement among three or more readers or devices. Bartko49 developed an altered version of ICC, which is simplified and has an exact formula for confidence intervals. 4.4. Kappa statistics for agreement of categorical variables Like continuous measurements, agreement between two categorical variables is only meaningful when the two categorical variables have the same biological or physical meanings. Agreement of categorical variables is most commonly applied to qualitative evaluations of health or disease status by two readers or by the same reader at two different sessions, which are referred as inter-reader and intra-reader agreement respectively. In clinical studies using qualitative assessments by multiple readers, we hope that all readers will produce consistent readings, and that their assessments will remain consistent during the study period. Thus, periodic review of inter- and intra-reader agreement should be a part of quality control of clinical trials. If the readers do disagree with each other, re-training is necessary. The simplest way to display categorical variables of two readers is a 2 × 2 table, displayed in Table 11. Here, X1 and X2 are results from two readers, with 0 indicating healthy and 1 indicating diseased, and Pij representing the probability of the event. There are many ways to measure the agreement of two readers. The probability of agreement, i.e., P (X1 = X2 ) = P00 + P11 is the most direct measurement. Analysis of the

140

Y. Lu & S. Zhao Table 11.

Joint distribution of outcomes of two binary variables. X2

X1

Health (X2 = 0)

Diseased (X2 = 1)

Total

P00 P10

P01 P11

P0+ P1+

P+0

P+1

1

Health (X1 = 0) Diseased (X1 = 1)

Total

probability of agreement is just like analysis of binary probability. Sample size calculations for reader agreement based on duplicated readings were presented by Freedman, Parmar, and Baker.54 The drawback of the probability of agreement is a positive chance of agreement even when the two readers are independent. As a result, Cohen proposed the use of Kappa statistics,55 which offer a means of correcting measurement of agreement, defined as the following. κ=

P00 + P11 − P0+ P+0 − P1+ P+1 PO − PE = . 1 − P0+ P+0 − P1+ P+1 1 − PE

(31)

2(n00 n11 − n01 n10 ) . n0+ n+1 + n+0 n1+

(32)

Here, PO = P00 + P11 is the observed probability of agreement and PE = P0+ P+0 + P1+ P+1 is the probability of agreement due to changes when X1 and X2 are independent. κ can reach 100% if there is perfect agreement and can be as low as −PE /(1 − PE ), when X1 and X2 are completely different. If we use nij to denote the observed number of subjects in each category of Table 11, the maximum likelihood estimates for Pij , Pi+ and P+j are pˆij = nij /n, pˆ+j = ni+ /n and pˆ+j = n+j /n, respectively, with n as the total number of subjects. Through algebra operations, we can estimate κ by substituting the maximum likelihood estimates of the probabilities into Eq. (31). κ ˆ=

There are several methods for calculating the sample variations for MLE estimates in Eq. (32). Using the delta method, Fleiss et al.56 derived a large sample variance of the estimator. var(ˆ κ) =

1 n(1 − P0+ P+1 − P1+ P1+ )2 ( 1 X × Pii [1 − 2(Pi+ + P+i )(1 − κ)] i=0

141

Statistics Used in Radiological Studies

+ (1 − κ)2

1 X 1 X

Pij (P+i + Pj+ )2

i=0 j=0

− [κ − (P0+ P+0 + P1+ P+1 )(1 − κ)]

2

)

.

(33)

Alternatively, Kraemer57 and Fleiss and Davies58 proposed the use of jackknife technique to calculate the variance of the estimated κ. Let κ ˆ ij be the MLE of κ when one observation in the (i, j)th cell is excluded, and Jij (ˆ κ) = nˆ κ − (n − 1)ˆ κij . The jackknife estimator of κ is given by κ ˆJ =

1 X 1 X

nij Jij (ˆ κ)/n ,

(34)

i=0 j=0

which should be a less biased estimator than κ ˆ. The jackknife variance can be estimated by var(ˆ κJ ) =

1 X 1 X i=0 j=0

nij [Jij (ˆ κ) − κ ˆ J ]2 /[n(n − 1)] .

(35)

Conditioned on marginal distributions of the 2 × 2 table in Table 11, Garner59 proposed the following simpler formula: var(ˆ κ) =

4

n2 (1

. P P − pˆ0+ pˆ+0 − pˆ1+ pˆ+1 )2 ( 1i=0 1j=0 1/(nij + 1))

(36)

Although all these formulas are asymptotically equivalent, there are still differences when using them for small samples. A simulation study60 compared the different estimates for κ ˆ and gave guidance in methods to estimate and construct confidence intervals for Cohen’s κ ˆ for small samples as indicated in Table 12. In this table, the “(” and “)” indicate the open-ends of an interval and “[” and “]” the closed ends of an interval. Landis and Koch61 provided guidelines for interpreting kappa values as the level of agreement among readers. Prevalence was defined as (2n11 + n10 + n01 )/(2n).62 The last column indicates the preferred equations for estimating the sample variance. The use of Kappa statistics in quality control and quality assurance is mainly for estimation rather than hypothesis testing. We want to ensure that the inter-reader agreement is above an acceptable pre-specified level before we start the study. We also want to be certain that the longitudinal intra-reader Kappa statistics are beyond that given level. However, the subject of Kappa applications is very broad and goes far beyond quality

142

Y. Lu & S. Zhao

Table 12. Guidance in selecting a method for constructing confidence intervals for Cohen’s κ ˆ .60 Kappa (ˆ κ)

Agreement61

Prevalence62

Sample Size

Equations

[0, 0.2) [0.2, 0.4) [0.4, 0.6)

Slight Fair Moderate

[0.6, 1)

Substantial to almost perfect

(0.1, 0.9) [0.1, 0.9] (0.2, 0.8) (0, 0.2] or [0.8, 1) (0.1, 0.9)

n ≥ 20 n ≥ 20 20 ≤ n < 40 n ≥ 40 n ≥ 20

(33) (33) or (35) (36) (35) (36)

assurance. The extensive literature on Kappa statistics includes agreement for ordinal or multinomial data;63–66 for case-control studies;67 for multiple readers or correlated samples;68–70 and for using logistic regression models to adjust for the effects of covariates on Kappa statistics.71 These topics are far beyond the scope of this chapter; interested readers should investigate the literature. 4.5. Log-linear models for agreement of categorical variables Log-linear models can express agreement in terms of components, such as chance agreement and beyond-chance agreement. They can also display patterns of agreement among several observers, or compare patterns of agreement when subjects are stratified by values of a covariate.72 The later is particularly useful for quality improvement to identify factors that have an affect on reader agreement. Let {mij = nPij } denote expected frequencies for ratings (i, j) of n subjects by two observers A and B. Chance agreement, or statistical independence of the ratings, has log-linear model representation B log mij = µ + λA i + λj .

(37)

An extension of this independent model is the quasi-independent model73 B log mij = µ + λA i + λj + δi I(i=j) ,

(38)

where the indicator I(i=j) equals 1 when i = j and 0 otherwise. Constrains P P B on the model parameters are i λA i = j λj = 0. Conditional on disagreement by the observers, the rating by A is statistically independent of rating by B. When δi > 0, more agreements regarding outcome i occur than would be expected by chance. The model is easy to fit by most statistical software.

Statistics Used in Radiological Studies

143

When we assume a constant δi = δ, a Kappa-like index of chance-corrected agreement74 is √ κA = (P00 + P11 )(1 − e−δ ) = (P00 + P11 )(1 − 1 OR) . (39) Graham extended above model to allow binary covariates.75 Let X be the binary covariate with value 0 and 1. Let {mijk = nPij (X = k)} be the frequencies of observing (i, j) by readers A and B when covariate X equals k. The extended model is B X AX BX AB log mijk = µ + λA I(i=j) + δkABX I(i=j) . (40) i + λj + λk + λik + λjk + δ

Here, terms with single superscripts and subscripts correspond to main effects. Terms with double superscripts and subscripts represent partial associations between the superscripted variables, controlling for the variable omitted from the superscript. As with other log-linear models, we impose the constraints of zero sums on the main effects and partial associations, respectively. In this model, δ AB represents the overall agreement between two readers and δkABX represents the additional chance corrected agreement associated with covariate X when X = k. A model constraint is zero sum of δkABX . This model readily extends to multiple covariate situations, and estimates can be obtained using the SAS CATMOD procedure. In model (40), δkABX is an interpretation of the estimates of the average of the two conditional agreement log odds ratios, log[(miik /mjik )/ (mii0 /mji0 )] and log[(mjjk /mijk )/(mjj0 /mij0 )], for any pair of distinct categories i and j. In his paper,75 Graham applied this model to a study of the effects of age, sex, and proxy type on agreement between the primary and proxy respondents regarding the primary respondent’s participation in vigorous leisure time activity. 4.6. Latent class models In a latent class analysis of observer agreement, it is assumed that the ratings of observers appear related because they are, in fact, related to some latent classification of items that explains all associations in the observed agreement table. For example, we can assume that there are three types of subjects in the study population: those that all readers classify as positive or negative, and those inconclusive subjects that are rated as positives or negatives by chance by each reader.76 Let K be the prevalence of those “agreements beyond chance” and p be the probability of conclusive items belonging to the positive category. Let π be the probability of positively

144

Y. Lu & S. Zhao Table 13.

Probability in a 2 × 2 table with latent classification model.

Rater A

Rater B Positive K)π 2

Negative

Total

Positive Negative

Kp + (1 − (1 − K)(1 − π ) π

(1 − K)(1 − π)π Kp + (1 − K)π K(1 − p) + (1 − K)(1 − π)2 K(1 − p) + (1 − K)(1 − π)

Total

Kp + (1 − K)π

K(1 − p) + (1 − K)(1 − π)

1

rating for inconclusive subjects. With the assumption of independent rating by the two readers for inconclusive subjects, the following Table 13 gives the probability distribution of the 2 × 2 table. Thus, if p = π, K is the Cohen’s Kappa statistics. If p/(1 − p) = 2 π /(1 − π)2 , K equals Aickin’s Kappa in Eq. (39). Latent classification models have many uses.76 Baker, Freedman, and Parmar77 proposed a model with duplicate observations that allows a separation of intra- and inter-reader agreement simultaneously for binary measures. 5. Clibration and Standardization The most important mission of quality assurance is to prevent measurement errors from exceeding a pre-specified level. For this purpose, we evaluate the performance of instruments to ensure that their precision and accuracy are acceptable for clinical diagnosis or clinical monitoring. Once we have chosen the particular devices or methods to measure study parameters, we want to be sure that they are equivalent to each other. During the study, we use the quality process control charts to monitor whether the instruments are still providing the required precision and/or whether the readers are giving consistent readings. With each step, we may still find disagreements between instruments or readers. Once we have chosen one of them as the reference standard, the process of assigning values for other instruments or readers to correct their differences from the reference standard is called calibration. In the example of multi-center studies, we normally choose the coordinating center as the reference standard. Thus, any site/machine that produces readings or measurements that are different from the reference standards will be calibrated. This is called cross-calibration in the literature on quality control of clinical trials.41 Although mathematically any site can be chosen as the reference standard, in practice, selection of a reference

Statistics Used in Radiological Studies

145

standard should take into consideration the qualifications and quality control history of the selected site. Sometimes, multiple reference standards are needed. For example, in a clinical trial of osteoporosis that uses DXA scanners from different manufacturers, one option is to select reference standards for each manufacturer and then calibrate devices at the other study sites to the corresponding reference standards. The next step of calibration is to standardize among the reference standards. Calibration can also occur for a single radiological machine. In the longitudinal quality control process mentioned in Sec. 3, a radiological machine was compared to a standard defined by a phantom. We normally look for the mean and variance changes in reference to the baseline value. One may also be interested in scale differences, i.e. changes in measurement unit. For DXA scanners, phantoms with different linear scaled densities can be used to serve as reference standard and calibration of a scanner may be needed if there are clinically significant deviations from that standard.

5.1. Calibration of measurements to a standard To calibrate radiological equipment to the chosen standard, we need to measure the standard. One method is to measure phantoms with known theoretical measurement values.32 Another method is to measure a set of phantoms or a group of sampled subjects to examine the differences between the reference standard device and all other study instruments, referred as cross-calibration in multi-center clinical trials.7 In all cases, we observe pairs of data (Xi , Yi ) with Xi representing the reference standard and Yi representing measurement of the instrument to be calibrated. The practical question is how to assign a correct X (standard value) based on measurement Y . A na¨ıve solution is to perform a (linear or nonlinear) regression of Xi on Yi and use that regression model to correct future readings of Y . This solution may be adequate, but it has statistical flaws. When we choose the standard, we assume that the standard should be accurate, that is its measurement error can be ignored. Thus, the measurement error should be associated only with Y not X. A proper linear relationship should be Y = α + βX + ε, with α and β as regression parameters and ε as the random measurement error for Y . These regression parameters α and β are also referred as constant bias and relative (scale) bias. Maximum likelihood estimates of regression parameters, denoted as α ˆ ˆ and their covariance matrix as well as model RMSE are easily and β,

146

Y. Lu & S. Zhao

available by many statistical software packages. Based on these estimates, for a given observation of y, we can calibrate it to the standard by ˆ x ˆ = (y − α ˆ )/β. The predicted value x ˆ is a biased estimate of true value x except when ¯ x = X. ˆ2 ¯ E(ˆ x|y) = x + [Se2 (x − X)]/S XX β ) .

(41)

¯ and SXX are the Here, Se is the RMSE of the regression line and X sample mean and sample variance of Xi ’s used to derive calibrations. This is because x ˆ is estimated by ratio of correlated normal variables. In most ˆ More specifically, when cases, such bias can be ignored for large beta. g = (t2n−2,0.05 Se2 )/(SXX βˆ2 ) < 0.05 .

(42)

When g > 0.2, we are not able to calibrate Y to the standard X with acceptable accuracy.22 Details of the 95% confidence interval of calibrated x ˆ as well as simultaneous tolerance interval for it can be found in the same reference. When we allow measurement errors for standard X, we are dealing with the calibration problem as a regression with measurement errors, and the regression and calibration problems are equivalent mathematically. Rearrangement of the linear regression gives the following relationship between X and Y : X = γ0 + γ1 Y + δ .

(43)

The difference between this calibration model and regular regression model is that Y is a random variable with Y = U + ε. This regression is not always identifiable unless under certain conditions.78 When we assume that the measurement error ε and underlying true U are independent and ε has mean zero and a known variance σε2 (such as estimated through repeated measurements), the calibration formula is 2 2 x ˆ = µX + γˆ1 σY2 /(σY2 − σε2 )(y − µY ) = µX + γˆ1 (σU − σε2 )/σU (y − µY ) . (44)

Here, γˆ1 is the least squared estimate of slope based on observed Y with measurement errors. 5.2. Comparative calibrations and latent structure models Barnett79 first considered a model to assess “the relative calibration and relative accuracies of a set of p instruments, each designed to measure the same characteristic, on a common group of individuals.” It is common for

Statistics Used in Radiological Studies

147

several manufacturers to produce similar machines that measure the same physical properties. For various reasons, these machines will not produce identical measurements for the same subjects. Converting measurements from different manufacturers is important for clinical studies to reduce machine introduced variations improving study efficiency and facilitating comparisons among different studies. ⇀ For the ith subject, let a vector Y i = (Y1i , Y2i , . . . , Ypi )T to denote the measurements by p instruments for the subject. Here, superscript T repre⇀ sents “transpose.” Statistically, we assume that Y i measures the underlying unobservable quantity Xi from an unknown normal distribution N (µ, σ02 ). ⇀ The relationship between Y i and Xi is that ⇀

⇀

Yi = ⇀ a + b Xi + ε⇀i

(45)

with unknown regression parameters ⇀ a and ⇀ a , and ε⇀i as a p-dimensional ⇀ random measurement errors following N (0 , Σ). The difference between this model and the regular calibration model is that Xi can be observed in a regular problem, while Xi is unknown in comparative calibration problems.80 ⇀ The number of sufficient statistics based on observations of Y i is p means and p(p + 1)/2 covariance matrix. The number of unknown parameters are 2 for distribution of X, 2p for regression coefficients, and p(p + 1)/2 for the covariance matrix for measurement errors. Thus, for p < 3, comparative calibration is unidentifiable. Even for p ≥ 3, we still need additional assumptions to make the model identifiable. Barnett79 assumed a1 = 0 and b1 = 1, and the covariant matrix of measurement errors Σ as a diagonal matrix. He used moment estimates to obtain MLE for the modal parameters. Other authors have studied similar problems,81–84 The following EM algorithm is a shorter form of a more extended model by Lu et al.85 Like Barnett, we assume that Σ is a diagonal matrix. When we do not observe Xi , the log-likelihood of our model is pretty complicated. The ⇀ log-likelihood function of observations Y i is n

C−

⇀⇀ pX log(|b b T σ02 + Σ|] 2 i=1 n

⇀ 1 X ⇀ ⇀ ⇀ T ⇀⇀T 2 ⇀ (Y i − a − b µ0 ) (b b σ0 + Σ)−1 (Y i − ⇀ − a − b µ0 ) . 2 i=1

(46)

To make the model identifiable, we also impose linear constrains on regres⇀ ⇀ ⇀ ⇀ sion parameters as l T ⇀ a = c1 and l T b = c2 . When l = (1, 0, . . . , 0)T

148

Y. Lu & S. Zhao ⇀

and c1 = c2 = 0, the model is similar to Barnett.79 When l = (1, 1, 1)T , c1 = 0 and c2 = 2.912, the model is similar to Lu et al.84 While the loglikelihood function is complicated, the likelihood function for known Xi is rather simple: p 1 1 ⇀ ⇀ C − log(|Σ|) − log σ02 − (Y i − ⇀ a − b Xi )T 2 2 2 ⇀

⇀

× Σ−1 (Y i − ⇀ a − b Xi ) −

(Xi − µ0 )2 . 2σ02

(47)

Thus, we can treat Xi as missing data and use the EM algorithm to derive the MLE of model parameters. The EM algorithm has the following steps: ⇀

Step 0. Set the initial values of the model parameters ⇀ a , b , Σ, µ0 and σ02 . Step 1. E-Step: Calculate the conditional expectation of the sufficient statistics for the complete likelihood function. They are ⇀

⇀

⇀

⇀

V = var(Xi | Y i , ⇀ a , b , Σ, µ0 , σ02 ) = (b T Σ−1 b + 1/σ02 )−1 , ⇀

⇀

⇀

⇀

(48)

⇀

E(Xi | Y i , ⇀ a , b , Σ, µ0 , σ02 ) = µ0 + V b T Σ−1 (Y i − ⇀ a − b µ0 ) . (49) Step 2. M-Step: Calculate the MLEs by replacing the conditional sufficient statistics into the following MLE formulas. ⇀ ˆ

b =

¯ + λ2 )Σ−1 ⇀ l SY,X − (λ1 X , SXX

(50)

⇀ ⇀ ⇀ ˆ ¯ a = Y − bX − λ1 Σ l ,

⇀ ˆ

(51)

n

X ⇀ ⇀ ˆ ˆ ˆ = 1 ˆa − ⇀ ˆa − ⇀ diag[(Y i − ⇀ b Xi )(Y i − ⇀ b Xi )T ] , Σ n i=1

¯, µ ˆ0 = X σ ˆ02 =

n X i=1

⇀

(52) (53)

(Xi − µ ˆ0 )2 /n .

(54) ⇀

¯ are the sample means for Y i and Xi , respecHere, Y and X P ⇀ ⇀ ¯ (Y tively; SY,X = n1 ni=1 (Xi − X) i − Y ); and λ1 and λ2 are the Lagrange-coefficients for conditional maximization with λ1 = ⇀ ⇀ ⇀ ⇀ ⇀⇀ ¯ ⇀ ¯ − (c1 + l T Σ l and λ2 = [ l T SY,X + l T Y X (l Y − c1 − c2 X)/ ⇀T ⇀ ¯ − c2 SX,X ]/ l Σ l . c2 )X Step 3. Check the convergence of the unconditional log-likelihood function and decide to stop or go back to Step 1.

149

Statistics Used in Radiological Studies

Based on the MLE, we can calibrate the unobserved underlying X based on measures from any one instrument by inverse linear calibration. Moreover, this model allows us to calibrate measures from instruments k to l by the following formula: Yˆi,l = al + bl (Yi,k − ak )/bk .

(55)

Here, subscript i indicates the ith subject and k, l indicate the instruments; ⇀ ak , al , bk , and bl are the kth and lth components in the vectors ⇀ a and b , respectively. A much simpler model is for p = 3, where the closed forms of MLEs can be derived and asymptotic covariance of the MLEs can be obtained explicitly.84 This model has been used for standardization of bone mineral densities measured by three different manufacturers.84,86,87 5.3. Least square approach for comparative calibrations ⇀′

⇀

⇀

0

0

⇀

⇀′

Alternatively, we define Y i = Y i − Y and Xi = G Y i + k. Here, k is a real ⇀ number and G is a p × p diagonal matrix, G = diag(gj ) with gj ≥ 0. If Xi is the standard references for instruments, there should be no differences between any pairs of its components. Let H be a p × p matrix   1 −1 0 ··· 0 0  0 1 −1 · · · 0 0      . H= · · · · · · · · · · · · · · · · · ·      0 0 0 ··· 1 −1  −1

···

0

1

Hui et al.88 proposed to find gj ’s that minimize the differences between ⇀ components in vector Xi 88 : min

n X i=1

⇀

⇀

X Ti H T HXi = min

n X i=1

⇀

⇀

⇀

⇀

(Y 1 − Y )T GT H T HG(Y i − Y )

(56)

Pp 2 under constrains j=1 gj = p. Because of the quadratic constrains, the solution for minimization Eq. (56) is not in a closed form. Symbolic programming languages, such as Maple, can be used to calculate the numeric solutions. Like the latent structure models in the previous subsection, this model needs two constraints in order to make the model identifiable. The constant parameter k can be determined by a linear constraint as demonstrated in Hui et al.88

150

Y. Lu & S. Zhao

After we derive the solutions for gj ’s, we can use the following formula to calibrate values between instruments: Yi,j = Y¯.,j + gk /gj (Yi,k − Y¯.,k ) .

(57)

For p = 3, the calibration conversion formulas between instruments are the same for the least square approach [Eq. (57)] and latent structure model [Eq. (55)] if and only if the measurement errors of instruments in latent structure model σj2 are equal.84 6. Conclusions Radiological instrument quality is important for both clinical diagnosis of disease and clinical monitoring of patient changes. Quality assurance and quality improvement need efforts of people who involve in the processes of manufacturing, maintaining, and operating the equipment as well as statisticians who involved in assessing the quality, monitoring the changes in quality and identifying areas for quality improvement. In this chapter, we have introduced some statistical concepts and methods that are commonly used in quality assurance of radiology studies. There are many other materials and considerations that could not be covered because of the limitation of the space. The methods discussed in this chapter have applications beyond radiological studies and are relevant to most clinical studies. Quality assurance and quality control is rather a practice than a theoretical discussion. Successful quality assurance can have visible and immediate effects. Statisticians should actively participate in quality assurance. While it is important for clinicians and biomedical researchers to realize the importance of statistics in their quality control and quality assurance practice, it is also important for biostatisticians to understand the subject issues and communicate effectively statistical principles to scientists from different backgrounds. The collaborations between statisticians and biomedical researchers in other fields will not only benefit clinical researches but also lead to new challenges for research and development of new statistical methods. References 1. Huxsoll, J. F. (1994). Organization of quality assurance. In Quality Assurance for Biopharmaceuticals, ed. J.F. Huxsoll, John Wiley and Sons, Inc., New York: 2–13. 2. Therasse, P. A., Arbuck, S. G., Eisenhauer, E. A., Wanders, J., Kaplan, R. S., Rubinstein, L., Verweij, J., Van Glabbeke, M. van Oosterom, A. T., Christian,

Statistics Used in Radiological Studies

3.

4.

5. 6. 7.

8.

9.

10.

11.

12.

13. 14.

15. 16. 17.

151

M. C. and Gwyther, S. G. (2000). New guidelines to evaluate the response to treatment in solid tumors. Journal of the National Cancer Institute 92(3): 205–216. WHO. (1994). Assessment of fracture risk and its application to screening for postmenopausal osteoporosis. Report of a WHO Study Group. World Health Organization, Geneva. Siris, E. (2000). Alendronate in the treatment of osteoporosis: A review of the clinical trials. Journal of Womens Health and Gender-Based Medicine 9(6): 599–606. Switula, D. (2000). Principles of good clinical practice (GCP) in clinical research. Sciences Ethics and Engineering 6(1): 71–77. van Kuijk, C. (1998). Good clinical practice in clinical trials: What does it mean for a radiology department? Radiology 209(3): 625–627. Fuerst, T., Lu, Y., Hans, D. and Genant, H. K. (1998). Quality assurance in bone densitometry. In Bone Densitometry and Osteoporosis, eds. H.K. Genant, G. Guglielmi and M. Jergas, Springer-Verlag, New York: 461–476. Fraass, B. D. K., Hunt, M., Kutcher, G., Starkschall, G., Stern, R. and Van Dyke, J. (1998). American Association of Physicists in Medicine Radiation Therapy Committee Task Group 53: Quality assurance for clinical radiotherapy treatment planning. Medical Physics 25(10): 1773–1829. Laurila, JS-N. C. G., Suramo, I., Tolppanen, E. M., Tervonen, O., Korhola, O. and Brommels, M. (2001). The efficacy of a continuous quality improvement (CQI) method in a radiological department. Comparison with non-CQI control material. Acta Radiologica 42(1): 96–100. Genant, H., Wu, C., van Kuijk, C. and Nevitt, M. (1993). Vertebral fracture assessment using a semiquantitative technique. Journal of Bone and Mineral Research 8(9): 1137–1148. Gluer, C., Blake, G., Lu, Y., Blunt, B., Jergas, M. and Genant, H.(1995). Accurate assessment of precision errors: How to measure the reproducibility of bone densitometry techniques. Osteoporosis International 5: 262–270. Njeh, C. F., Nicholson, P. H. F. and Langton, C. M. (1999). The physics of ultrasound applied to bone. In Quantitative Ultrasound: Assessment of Osteoporosis and Bone Status. eds. C.F. Njeh, D. Hans, T. Fuerst, C.C. Gluer and H.K. Genant, Martin Dunitz Ltd., London: 420. Gluer, C. and Genant, H. (1989). Impact of marrow fat on accuracy of quantitative CT. Journal of Computer Assistant Tomographics 13(6): 1023–1035. Jergas, M. and Uffmann, M. (1998). Basic considerations and definitions in bone densitometry. In Bone Densitometry and Osteoporosis, eds. H. Genant, G. Guglielmi and M. Jergas, Springer, New York: 269–290. Liu, C.-Y. and Zheng, Z.-Y. (1989). Stabilization coefficient to random variable. Biometrical Journal 31(4): 431–441. Miller, G. E. (1991). Asymptotic test statistics for coefficients of variation. Communication in Statistics — Theory and Methods 20(10): 3351–3363. Feltz, C. J. and Miller, G. E. (1996). An asymptotic test for the equality of coefficient of variation from k populations. Statistics in Medicine 15: 647–658.

152

Y. Lu & S. Zhao

18. Fung, W. K. and Tsang, T. S. (1998). A simulation study comparing tests for the equality of coefficients of variation. Statistics in Medicine 17: 2003–2014. 19. Arenson, R., Lu, Y., Elliott, S., Jovais, C. and Avrin, D. (2001). Measuring the academic radiologist’s clinical productivity. Academic Radiology 8: 524–532. 20. Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap, Chapman and Hall, San Francisco. 21. Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap, Springer-Verlag, New York. 22. Strike, P. W. (1991). Statistical Methods in Laboratory Medicine, Butterworth-Heinemann Ltd., Oxford. 23. Quan, H. and Shih, W. J. (1996). Assessing reproducibility by the withinsubject coefficient of variation with random effects models. Biometrics 52(4): 1195–1203. 24. Miller, C. G., Herd, R. J., Ramalingam. T., Fogelman, I. and Blake, G. M. (1993). Ultrasounic velocity measurements through the calcaneus: Which velocity should be measured? Osteoporosis International 3(1): 31–35. 25. Blake, G. M. and Fogelman, I. (1997). Technical principles of dual X-ray absorptiometry. Seminar in Nuclear Medicine 27(3): 210–228. 26. Carroll, R. J., Ruppert, D. and Stefanski, L. A. (1995). Measurement Error in Nonlinear Models, Chapman and Hall, London. 27. Machado, A., Hannon, R., Henry, Y. and Estell, R. (1997). Standardized coefficient of variation for dual X-ray absorptiometry (DXA), quantitative ultrasound (QUS) and markers of bone turnover (Abstract). Journal of Bone and Mineral Research 12(Suppl. 1): S258. 28. Langton, C. M. (1997). ZSD: A universal parameter for precision in the ultrasonic assessment of osteoporosis. Physiological Measurement 18: 67–72. 29. Cummings, S. R. and Black, D. (1986). Should perimenopausal women be screened for osteoporosis? Annals of Internal Medicine 104: 817–823. 30. Gl¨ uer C. (1999). Monitoring skeletal changes by radiological techniques. Journal of Bone and Mineral Research 14(11): 1952–1962. 31. Faulkner, K. M., MR. (1995). Quality control of DXA instruments in multicenter trials. Osteoporosis International 5(4): 218–227. 32. Kalender, W., Felsenberg, D., Genant, H. K., Fischer, M., Dequeker, J. and Reeve, J. (1995). The European spine phantom — A tool for standardization and quality control in spine bone mineral measurements by DXA and QCT. European Journal of Radiology 20: 83–92. 33. Anderson, J. W. and Clarke, G. D. (2000). Choice of phantom material and test protocols to determine radiation exposure rates for fluoroscopy. Radiographics 20(4): 1033–1042. 34. Lu, Y., Mathur, A. K., Blunt, B. A. et al. (1996). Dual X-ray absorptieometry quality control: Comparison of visual examination and process-control charts. Journal of Bone and Mineral Research 11(5): 626–637. 35. Montgomery, D. C. (1992). Introduction to Statistical Quality Control, 2nd edn., Wiley, New York.

Statistics Used in Radiological Studies

153

36. Orwoll, E. S., Oviatt, S. K. and Biddle, J. A. (1993). Precision of dualenergy X-ray absorptiometry: Development of quality control rules and their application in longitudinal studies. Journal of Bone and Mineral Research 8(6): 693–699. 37. Jergas, M. and Genant, H. K. (1993). Current methods and recent advances in the diagnosis of osteoporosis. Arthritis and Rheumatism 36(12): 1649–1662. 38. SAS/QC (2000). User’s Guide (for SAS V8), SAS Research Institute, Cary, North Carolina. 39. Ryan ,T. P. (1989). Statistical Methods for Quality Improvement, Wiley, New York. 40. Pearson, D. C. and Gawte, S. A. (1997). Long-term quality control of DXA: A comparison of Shewhart rules and Cusum charts. Osteoporosis International 7(4): 338–343. 41. Lu, Y., Mathur, A. K., Gluer, C. C. et al. (1995). Application of statistical quality control method in multicenter osteoporosis clinical trials. International Conference on Statistical Methods and Statistical Computation for Quality and Productivity Improvement, Seoul, Korea, 474–480. 42. Wheeler, D. J. (1991). Shewhart’s Chart: Myths, Facts, and Competitors. 45th Annual Quality Congress Transactions: American Society for Quality Control, 533–538. 43. Johnson, R. A. and Bagshaw, M. (1974). The effect of serial correlation on the performance of CUSUM tests. Technometrics 16: 103–112. 44. Kaminsky, F. C., Maleyeff, J., Providence, S., Purinton, E. and Waryasz, M. (1997). Using SPC (statistical process control) to analyze quality indicators in a healthcare organization. Journal of Healthcare Risk Management 17(4): 14–22. 45. Quesenberry, C. P. (2000). Statistical process control geometric Q-chart for nosocomial infection surveillance. American Journal of Infection Control 28(4): 314–20. 46. Thompson, J. R. and Koronachi, J. (1993). Statistical Process Control for Quality Improvement, Chapman and Hall, New York. 47. Bland, J. M. and Altman, D. G. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet i: 307–310. 48. Bland, J. M. and Altman, D. G. (1995). Comparing two methods of clinical measurement: A personal history. International Journal of Epidemiology 24(Suppl. 1): S7–S14. 49. Bartko. J. J. (1994). General methodology II measures of agreement: A single procedure. Statistics in Medicine 13: 737–745. 50. Bradley, E. L. and Blackwood, L. G. (1989). Comparing paired data: A simultaneous test of means and variances. The American Statistician 43: 234–235. 51. Lee, J., Koh, D. and Ong, C. N. (1989). Statistical evaluation of agreement between two methods for measuring a quantitative variable. Computers in Biology and Medicine 19: 61–70.

154

Y. Lu & S. Zhao

52. Fleiss, J. L. and Shrout, P. E. (1978). Approximate interval estimation for a certain intraclass correlation coefficient. Psychometrika 43: 259–262. 53. Bland, J. M. and Altman, D. G. A note on the use of the intraclass correlation coefficient in the evaluation of agreement between two methods of measurements. Computers in Biology and Medicine 20(5): 337–340. 54. Freedman, L. S., Parmar, M. K. B. and Baker, S. G. (1993). The design of observer agreement studies with binary assessements. Statistics in Medicine 12: 165–179. 55. Cohen, J. A. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: 37–46. 56. Fleiss, J. L, Cohen, J. and Everitt, B. S. (1969). Large-sample standard errors of kappa and wieghted kappa. Psychological Bulletin 72: 323–327. 57. Kraemer, H. C. (1980). Extension of the kappa coefficient. Biometrics 36: 207–216. 58. Fleiss, J. L. and Davies, M. (1982). Jackknifing functions of multinomial frequencies, with an application to a measure of concordance. American Journal of Epidemiology 115: 841–845. 59. Garner, J. B. (1991). The standard error of Cohen’s Kappa. Statistics in Medicine 10: 767–775. 60. Blackman, NJ.-M. and Koval, J. J. (2000). Interval estimation for Cohen’s kappa as a measure of agreement. Statistics in Medicine 19: 723–741. 61. Landis, J. R. and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics 33: 159–174. 62. Block, D. A. and Kraemer, H. C. (1989). 2 × 2 kappa coefficients: Measures of agreement or association. Biometrics 45: 269–287. 63. Cohen, J. (1968). Weighted kappa: Nomial scale agreement with provision for scaled disagreement or partial credit. Psycological Bulletin 70(4): 213–219. 64. Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions, 2nd edn., Wiley, New York. 65. Barlow, W., Lai, M.-Y. and Azen, S. P. (1991). A comparison of methods for calculating a stratified kappa. Statistics in Medicine 10: 1465–1472. 66. Donner, A. and Eliasziw, M. (1997). A hierachical approach to inferences concerning interobserver agreement for multinomial data. Statistics in Medicine 16: 1097–1106. 67. Kraemer, H. C. and Bloch, D. A. (1990). A note on case-control sampling to estimate kappa coefficients. Biometrics 46(1): 49–59. 68. Posner, K. L., Sampson, P. D., Caplan, R. A., Ward, R. J. and Cheney, F. W. (1990). Measuring interrater reliability among multiple raters: An example of methods for nominal data. Statistics in Medicine 9: 1103–1115. 69. Oden, N. L. (1991). Estimating kappa from binocular data. Statistics in Medicine 10: 1303–1311. 70. Shoukri, M. M. and Martin, S. W. (1995). Maximum likelihood estimation of the kappa coefficient from models of matched binary responses. Statistics in Medicine 14: 83–99. 71. Shoukri, M. M. and Mian, I. U. H. (1996). Maximum likelihood estimation of the kappa coefficient from bivariate logistic regression. Statistics in Medicine 15: 1409–1419.

Statistics Used in Radiological Studies

155

72. Agresti, A. (1992). Modelling patters of agreement and disagreement. Statistical Methods in Medical Research 1: 201–218. 73. Tanner, M. A. and Young, M. A. (1985). Modelling agreement among raters. Journal of American Statistical Association 80: 175–180. 74. Aickin, M. (1990). Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics 46: 293–302. 75. Graham. P. (1995). Modelling covariate effects in observer agreement studies: The case of nomial scale agreement. Statistics in Medicine 14: 299–310. 76. Guggenmoos-Holzmann, I. and Vonk, R. (1998). Kappa-like indices of observer agreement viewed from a latent class prespective. Statistics in Medicine 17: 797–812. 77. Baker, S. G., Freedman, L. S. and Parmar, M. K. B. (1991). Using replicate observations in observer agreement studies with binary assessments. Biometrics 47(4): 1327–1338. 78. Cheng, C.-L. and Van Ness, J. W. (1999). Statistical Regression with Measurement Error, Arnold, London. 79. Barnett, D. V. (1969). Simultaneous pairwise linear structural relationships. Biometrics 28: 129–142. 80. Theobald, C. M. and Mallinso, J. R. (1978). Comparative calibration, linear structural relationships and congeneric measurements. Biometrics 34: 39–45. 81. Fuller, W. A. (1987). Measurement Error Models, Wiley, New York. 82. Dunn, G. (1989). Design and Analysis of Reliability Studies, Oxford University Press, New York. 83. Kimura, D. K. (1992). Functional comparative calibration using EM algorithm. Biometrics 48: 1263–1271. 84. Lu, Y., Ye, K., Mathur, A., Hui, S., Fuerst, T. P. and Genant, H. K. (1997). Comparative calibration without a gold standard. Statistics in Medicine 16: 1889–1905. 85. Lu, Y., Ye, K., Mathur, A. K., Srivastav, S. K., Yang, S. and Genant, H. K. (1997). Application of random effects models in comparative calibration. Proceedings of the Biometrics Section of American Statistical Association, 170–176. 86. Hanson, J. (1997). Standardization of femur BMD [letter]. Journal of Bone and Mineral Research 12(8): 1316–1317. 87. Lu, Y., Fuerst, T., Hui, S. and Genant, H. K. (2001). Standardization of bone mineral density at femoral neck, trochanter and Ward’s triangle. Osteoporosis International 12: 438–444. 88. Hui, S. L., Gao, S., Zhou, X.-H. et al. (1997). Universal standardization of bone density measurements: A method with optimal properties for calibration among several instruments. Journal of Bone and Mineral Research 12(9): 1463–1470.

156

Y. Lu & S. Zhao

About the Author Ying Lu is an associate professor of Radiology at the Department of Radiology and the director of the Biostatistics Core, UCSF Comprehensive Cancer Center, University of California, San Francisco. He was an assistant professor of the same department (1994–1998) and an assistant professor of epidemiology and public health in the University of Miami School of Medicine (1990–1993). Dr. Lu, received his BS in mathematics from Fudan University (1982) and MS in applied mathematics from Shanghai Jiao Tong University (1984), and PhD in Biostatistics from the University of California, Berkeley (1990). At Berkeley, he received university fellowships (1985–1988), the Evelyn Fix Memorial Medal (1990), and Public Health Alumni Association Scholarship (1989). Dr. Lu has authored or co-authored more than 80 peer-reviewed articles and 4 book chapters in statistical methods for animal carcinogenicity experiments, medical diagnostic tests, and outcome prediction, as well as clinical research areas of radiology, osteoporosis, and cancer clinical trials. His paper has been published in Biometrics, Statistics in Medicine, Mathematical Biosciences, Medical Decision Making, Radiology, Journal of Bone and Mineral Research, Cancer, etc.

CHAPTER 5

COST-EFFECTIVENESS ANALYSIS AND EVIDENCE-BASED MEDICINE JIANLI LI Department of Corporate Performance, St Michael’s Hospital, 30 Bond Street, Toronto, Ontario, M5B 1W8, Canada Tel: 416-864-6060 ext 6152; [email protected]

1. Introduction Over the past decades, as pressures to control health care spending have accelerated, the term “cost-effectiveness” has become increasingly into common parlance. It is widely used by groups as disparate as the government, the congress, the business community, managed-care organizations, the pharmaceutical industry and the press. The central purpose of cost-effectiveness analysis (CEA) is to compare the relative value of different interventions in creating better health and/or longer life. The results of such evaluations are typically summarized in a cost-effectiveness ratio, where the denominator reflects the gain in health from a candidate intervention (measured, for example, in term of years of life gained, premature birth averted, sight years gained, symptom-free days gained) and the numerator reflects the cost of obtaining the health gain. A cost-effectiveness analysis provides information that can help decision makers sort through alternatives and decide which one best serves their programmatic and financial needs. Decision maker may be federal, state or local. They may be in the private sector or the public sector. They may control dollars or they may run programs. CEA provides a framework within which decision makers may pose a range of questions. Cost-effectiveness analyses furnish information that can be useful in a variety of settings. For example, a managed-care organization might wish to know the cost per low- birthweight birth averted as a consequence of a prenatal outreach program. Or it might wish to take the question further 157

158

J. Li

and ask the cost of this program per year of life saved for its enrolled population. Or, recognizing that programs that avert premature births may not primarily save lives but rather avert disability over the lifetime of an individual, it might want to know the cost of this intervention for each quality-adjusted life year (QALY) gained. This latter question is addressed by a particular type of CEA, some times termed “cost utility analysis,” where adjustments for the value assigned to health-related quality of life are built into the calculation. As another example, a pharmaceutical manufacturer might wish to use CEA in pricing and marketing a new cholesterol-lowering drug. It might ask the question. How much does our medication cost per year of life gained compare to a similar product manufactured by a different company? Or, if the clinical trials show clinically insignificant changes in cholesterol level between the two products but significantly decreased side effects associated with the new drug, a drug purchaser or payer might wish then to calculate the cost per quality-adjusted life year (QALY) gained in using the new drug. An industry investigator might decide to extend the considerations of the analysis and explore the cost per year of life or QALY gained when comparing pharmaceutical treatment with surgical treatment for coronary disease. Or, an analysis of a state health department might wish to explore different strategies for control of blood lead levels in the population. It might choose to assess the cost-effectiveness of screening all children, compared to screening only those thought to be at particular risk for elevated lead levels by reason of housing or environment surrounding. 1.1. Worked examples 1.1.1. Bypass angioplasty revascularization investigation Percutaneous transluminal coronary angioplastry was introduced in 1977 as a less invasive alternative to coronary-artery bypass surgery. Several randomized clinical trials of angioplasty and bypass surgery have compared the clinical outcomes of these procedures. The Bypass Angioplasty Revascularization Investigation (BARI) was a large trial of angioplasty and bypass surgery in US, which collected five years of follow-up data. Mark A. Hlatky et al.8 conducted a study on a total 934 of the 1829 patients enrolled in the randomized BARI. Detailed data on quality of life were collected annually, and economic data were collected quarterly. They compared quality of life, employment, and medical care costs during

Cost-Effectiveness Analysis and Evidence-Based Medicine

159

five year of follow-up among patients treated with angioplasty or bypass surgery. They found that on average, functional status, which was assessed by scores on the Duke Activity Status Index, was improved more with bypass surgery than with angioplasty in the first three years (p < 0.05), whereas in other respects the quality of life was equivalent with either method of revascularization. Patient in the angioplasty group returned to work five weeks sooner than did patients in the surgery group (p < 0.001). The cost of angioplasty was initially $11,234 lower than that of bypass surgery (a 35% saving, p < 0.001), but higher subsequent costs for hospitalization and medication reduced the saving to $2,644 at five years (a 5% savings, p = 0.047). The five-year cost of angioplasty was significantly lower than that of surgery among patients with two-vessel disease ($52,930 versus $58,498, P < 0.05), but not among patients with three-vessel disease. After five years of follow-up, surgery had an overall cost-effectiveness ratio of $26,177 per year of life added, but unacceptable ratios of $100,000 or more per year of life added could not be excluded (P = 0.13). Surgery appeared particularly cost effective in treating patients with diabetes because of their significantly improved survival.

1.1.2. Treatment of high blood cholesterol In 1985, in response to the first evidence from a randomized controlled trial that reducing cholesterol reduces the risk of death from heart disease,10 the US National Institutes of Health created the National Cholesterol Education Program (NCEP). Three years later the NCEP published guidelines for the management of high blood cholesterol which recommended that all adults have their cholesterol checked at least every 5 years and that those with high levels (240 mg/dl) or higher), or borderline-high levels (200–239 mg/dl) plus other risk factors, be tested further. It was suggested that those whose low-density lipoproteins (LDL) levels were also high should be treated by changes in diet or with cholesterol-lowering drugs.11 It has been estimated that more than one-third of the adult population requires dietary change and/or drugs when judged by these criteria.15 Cost-effectiveness analyses done in the wake of the 1988 guideline focused on the management of high blood cholesterol once detected. Both lovastatin, a frequently prescribed drug, and dietary counseling were shown to vary widely in cost-effectiveness depending on age and other risk factors for heart disease.

160

J. Li

One study examined the use of lovastatin for people initially free of heart disease and for those who had already suffered a heart attack.7 The authors found that, for healthy people, saving a year of life in much more costly among those with cholesterol as their only risk factor than it is for those with several risk factors, even when cholesterol is very high; the cost ranged up to $330,000 for men aged 35–44 with no other risk factor and up to $1.5 million for women in the same category. The cost was considerably lower for people with other risk factors, reflecting the widely accepted assumption that risk factors interact to make the adverse effects of any one greater when others are present. Lovastatin treatment was still more costly per life year gained for people with levels in the range 250–299 mg/dl. By contrast, the study found that it is potentially very cost-effective to treat people with elevated cholesterol who have had heart attacks. Costs per life year gained are relatively low and for some, such as men aged 35–44, drug treatment might save money as well as extended life. Another study found similar results for a program of intensive diet therapy modeled after the one in the Multiple Risk Factor Intervention Trial (MRFIT).18 For example, diet therapy costs more than $500,000 per year life for 20-yearold men with initial cholesterol of 240 mg/dl and no other risk factors. For men with several risk factors, the cost per life year gained in much lower. These results suggest that management of high cholesterol in people without heart disease is often very costly per life year saved. Since they show that treatment of people whose blood cholesterol levels are not far above 240 mg/dl can be extremely costly, they suggest that the same would be true for people with levels in the borderline-high range, although the studies did not analyze this group. Taken together, cost-effectiveness results suggest that resources might better be concentrated on those with very high cholesterol levels and/or other risk factors for heart disease (and on those in whom heart disease is already present). Revised guidelines, published by NCEP in 1993,12 were somewhat more modest in their aims, in response to studies like these as well to ongoing debate over whether reducing cholesterol lengthens life in those without heart disease. If NCEP’s 1988 guidelines were followed to the letter, it would cost, depending on the effectiveness of diet in reducing blood cholesterol levels, $20 billion to $27 billion to provide lovastatin at dose of 20 mg per day, and $47 billion to $67 billion to provide a higher, more effective, dose of 80 mg per day.6 The saving from a more selective strategy would be substantial, freeing resources to be applied elsewhere. The CEA results suggest that

Cost-Effectiveness Analysis and Evidence-Based Medicine

161

more selective treatment strategies could be designed that would lose little in health benefits. 2. Foundations of Cost-Effectiveness Analysis 2.1. What is cost-effectiveness analysis? Cost-effectiveness analysis is a method designed to assess the comparative impacts of expenditures on different health interventions. As Weinstein and Stason19 state, it is based on the premise that “for any given level of resources available, society · · · wishes to maximize the total aggregate health benefits conferred.” For example, we might wish to know whether spending a certain amount of money on a public campaign to stop smoking will have greater or lesser effect on health than spending the same amount on colorectal screening. Cost-effectiveness analysis can be in decision making at different levels, such as societal level and organizational level. 2.2. The cost-effectiveness ratio The central measure used in CEA is the cost-effectiveness ratio. Implicit in the cost-effectiveness ratio is a comparison between alternatives. One alternative is the intervention under study, while the other is a suitably chosen alternative — “usual care,” another intervention, or no intervention. The cost-effectiveness ratio for comparing the two alternatives at the population level can be the ratio of expected costs to expected effect (CER), E(c)/E(e), and ratio of incremental expected costs to incremental expected effects (ICER), (E(ci ) − E(cj ))/(E(ei ) − E(ej ) or ∆E(c)/∆E(e). The ratio ∆E(c)/∆E(e) is essentially the incremental price of obtaining a unit health effect (such as dollars per year, or per quality-adjusted year, of life expectancy) from given health intervention when compared with an alternative. The following situations can arise: ∆E(c) < 0, ∆E(e) > 0; dominance; to accept the given intervention; ∆E(c) > 0, ∆E(e) < 0; dominance; to reject the given intervention; ∆E(c) > 0, ∆E(e) > 0; trade-off; consider magnitude of ratio of difference in costs to difference in effectiveness; ∆E(c) < 0, ∆E(e) < 0; trade-off; consider magnitude of ratio of difference in costs to difference in effectiveness. The expected ratio of cost to effect, E(c/e), can be investigated at patient level.

162

J. Li

2.3. The effectiveness The effectiveness is the extent to which medical interventions achieve health improvements in real practice settings. 2.3.1. Individual and social well-being By describing CEA as a tool for improving general welfare, it can be placed squarely within the context of welfare economics. The effectiveness measures could be quantified in term of utility, such as quality-adjusted life years (QALY); and in term of health status measures, such as the number of symptom-free days. 2.3.2. A metric of health effect: Quality-adjusted life years It may appear that CEA cannot even be used to compare interventions whose effects on health are qualitatively different, such as prevention of coronary artery disease and treatment of arthritis. However, such a comparison is possible if the measure of effectiveness is general enough to capture all of the important health dimensions of the effects of the interventions. Using the quality-adjusted life year (QALY) as the unit of effectiveness approaches this ideal within the framework of CEA, thus expanding considerably the range of application of CEA. The QALY is a measure of health outcome which assign to each period of time a weight, ranging from 0 to 1, corresponding to the quality of life during that period, where a weight of 1 corresponds to perfect health and a weight of 0 corresponds to a health state judged equivalent to death. The number of quality-adjusted life years, then, represents the number of healthy years of life that valued equivalently to the actual health outcome. 2.3.3. How to obtain evidence on effectiveness? The foundation for economic evaluation is valid data on the effectiveness of the intervention being evaluated relative to some alternative. The true cost and effectiveness of an intervention usually are not known but estimated. The source of estimates may be direct measurement (sampling) or indirect (non-sampling) methods such as expert opinion and published literature. There could be two types of data; sampled data where the sampling variance may or may not be known, and non-sampled data such as discount rate for which do not have sampling variation, although

Cost-Effectiveness Analysis and Evidence-Based Medicine

163

the true value of the parameter may be uncertain. These data can be used in various combinations in two models of analysis: stochastic analysis where inferences are drawn using standard statistical methods based on sampling variation, and deterministic analysis where inferences are drawn from point estimates of variables but interpretation is conditional upon the range of uncertainty from sensitivity analysis. The appropriateness of methods for analyzing uncertainty in costs or effects will depend upon the mix of sampled and non-sampled data. Cost-effectiveness analysis can be wholly deterministic, partially stochastic or wholly stochastic. 2.3.4. Deterministic cost-effectiveness analysis This is used where cost and effect variables are analyzed as point estimates. Sampling variation may not be available because of the source of the data (e.g. secondary data) or the variable may not have been sampled (e.g. choice of discount rate, expert opinion). Deterministic CEA models arise frequently in the early assessment of a new medical technology, where only limited data are available but some analysis is required for policy setting. For example, in their analysis of the implantable defibrillator. Kupperman et al.9 constructed a cost-effectiveness model where effect data were taken from reports of patient series in the literature as point estimates of survival probabilities and cost data were derived from a Medicare claims database and expert opinion. Given these data was not possible to present cost and effect differences with 95% confidence intervals, therefore a deterministic point estimate of cost-effectiveness was subject to detailed sensitivity analysis to explore the impact of uncertainty. Therefore a point estimate based on expert opinion of resource use was used as a proxy for variables that could be sampled in the future as part of a prospective study. 2.3.5. Partially stochastic cost-effectiveness analysis This is used where effectiveness has been estimated from clinical trial(s) and can be expressed as a mean effect size with an associated variance, but analysis of costs is deterministic because data are non sampled. This combination is common in decision analytic models of economic appraisal. Some studies with such data report confidence intervals for cost-effectiveness where only variation in effects has been analyzed. For example a study in ulcer maintenance theory presented 95% confidence intervals around expected one-year therapy costs including relapse management. But no primary data had been collected to determine variation between patients in

164

J. Li

costs of managing relapse. The source of variation for the confidence interval was only the surrounding the estimated incidence of relapse on treatment and control. 2.3.6. Wholly stochastic cost-effectiveness analysis This is used where both costs and effects are determined from data sampled from the same patients in a study. Although our discussion focused on the randomized controlled trials (RCT) these data might also be measured by non-experiment-design. If cost and effect data are sampled and variances are available then formal statistical tests can be performed on observed differences in costs (treatment-control) or effects. Randomized controlled trials (RCT) are one valuable source of evidence on effectiveness, used either as single studies or combined in a meta-anlysis. There are two general ways in which RCT data can be incorporated into economic evaluation: (i) combining RCT effectiveness data retrospectively with cost data from secondary non-trial sources into a decision analysis model; or (ii) collecting effectiveness and cost data on the same patients prospectively as part of an RCT. The growing interest in trial-based prospective cost-effective studies has raised some interesting statistical questions of study design and analysis. Given the traditional use of non-sampled secondary data (e.g. published literature, insurance claims databases, expert opinion) in cost-effectiveness models the convention for analyzing uncertainty in results has been to use sensitivity analysis, where the robustness of results is explored over a range of what if alternative values for uncertain variables. This analytical approach is marked contrast to the conventional analysis of RCT effectiveness data where standard principle s of statistical inference are used to construct tests of hypotheses and estimate intervention effect sizes, and where uncertainty is quantified by a confidence interval which has precise meaning in terms of probability. 2.4. Sensitivity analysis and beyond Before considering the adaptation of stochastic methods for economic evaluation, it is necessary to review the limitation of sensitivity analysis. This method is widely recommended for assessing problems of data uncertainty in economic appraisals of health care programs and allied evaluative techniques such as clinical decision analysis. The purpose is to examine the robustness of an estimated result over a range of alternative values for

Cost-Effectiveness Analysis and Evidence-Based Medicine

165

uncertain parameters. Weinstein and Stason (1977) describe the method in the following way: “The most uncertain features and assumptions. . . are varied one at a time over a wide range of possible values. If the basic conclusions do not change when a particular feature or assumption is varied, confidence in the conclusions is increased.” Whereas the traditional CEA model utilize sensitivity analysis, the mean-variance data on costs and effects from a prospective trial presents the opportunity to analyze cost-effectiveness using conventional inferential statistical methods.13 The statistical approach in CEA have been discussed by many literatures. 3. Statistical Approach 3.1. Costs and effects as point estimates The deterministic analysis of effectiveness is a comparison of point estimates. If we consider a treatment that is both more costly and more effective than control, then a useful way to represent incremental cost-effectiveness is illustrated in Fig. 1. In this diagram, the x axis represents the difference in effects between the experimental and control therapy (∆e) and the y axis the difference in cost between experimental and control (∆c). The slope of the line extending from origin (the control) through our study point estimate, ∆e, ∆c, represents the incremental cost-effectiveness of the treatment relative to control. Clearly, the steeper the slope of the line ∆c/∆e the ∆c / ∆e L

∆c / ∆e a ∆c / ∆e U

∆c

0

∆e ∆ Effectiven ess ( e T −e C )

Fig. 1. Cost-effectiveness quasi-confidence interval: Deterministic analysis of cost differences and stochastic analysis of effectiveness differences.

166

J. Li

greater is the additional cost at which additional units of effectiveness are gained by treatment relative to control, and the less attractive treatment becomes. In the absence of any data on sampling variation for costs or effects (point a) some form of sensitivity analysis would be useful to determine plausible ranges that may contain the true cost-effectiveness ratio. 3.2. Sampled effectiveness and non-sampled costs In the analysis of sampled effect data (with sample variation) the null hypothesis is usually that there is no difference may come between experimental and control therapy. This is tested against either a one tailed alternative (usually that the experimental treatment more effective) or a two tailed alternative (that the experimental treatment is more or less effective than control). For a continuos clinical variable such as blood pressure we assume, by convention, that the ratio of the difference in sample means (¯ eT − e¯C ), where the subscript T stands for the treatment group and subscript C stands for the control group, to the pooled standard error of the difference follows some known probability distribution such as Z or t. Critical values of the test statistics are determined by the analyst’s judgement about the acceptable risk of making a Type 1 (false-positive) error about a difference existing, this level conventionally being set to 5%. A problem with hypothesis testing as a form of stochastic analysis is that an overemphasis tends to be placed on the statistical significance. The advantages of the confidence interval is two-fold. First it permits hypothesis testing as described above because if a 95% confidence interval for a difference includes zero, then the treatment groups are not significantly different at 5% level. Second, in addition to statistical significance, the confidence intervals yields information on the magnitude of the observed difference (quantitative significance or clinical importance). The relationship between these two parameters is important because a difference can be highly statistically significantly but of no clinical importance, for example, a small difference (say, 0.25 mm/Hg) with p < 0.0001. Furthermore, the concept of a minimum clinically important difference δ to be detected is central to the design of a clinical experiment and determination of sample size. A familiar two-tailed confidence interval for the treatment-effect size would be s 2 S2 SeT + eC (1) (¯ eT − e¯C ) ± t(nT +nC −2,1−α/2) nT nC 2 2 where SeT and SeC are the sample estimates of variances.

Cost-Effectiveness Analysis and Evidence-Based Medicine

167

A confidence interval around (the mean effect size) has been drawn in Fig. 1. Given the confidence interval around ∆¯ e, one approach to translating this into variation around the cost-effectiveness ratio is by creating an interval bounded by the ratio of cost difference to the lower bound of the effect interval (∆c /∆¯ eL ) and the ratio of the cost difference to the effect upper bound of the effect interval (∆c /∆¯ eU ). These upper and lower bounds for the cost-effectiveness ratio might be termed a quasi-confidence interval, because they are only based upon knowledge of sampling variation associated with the measurement of the denominator (effects). This reasoning can be applied analogously to a situation where we had stochastic costs but deterministic effects. 3.3. Sampled effectiveness and sampled costs As we did in previous sections, we assumed that effects were measured from a trial and could be expressed as a confidence interval. However, we also assumed that resource use was measured to enable patient-specific costs to be estimated from j resources (j = 1, . . . , J) in quantity Qj at unit price PJ Pj , then the costs for individual i can be expressed ci = j=1 Pj Qj . Summing over i patients (i = 1, . . . , nT ) in the treatment group, mean PnT ci with estimated cost per patient can be expressed as c¯i = n1T i=1 variance nT X 1 (ci − c¯T )2 . (2) s2EcT = nT (nT − 1) i=1 Therefore the difference between the mean cost associated with treatment and control can be expresses as a confidence interval: s 2 ScT S2 (¯ cT − c¯C ) ± t(nT +nC −2,1−α/2) + cC . (3) nT nC

In this situation the incremental cost-effectiveness ratio is a ratio of two random variables, both of which can be expressed as a confidence interval (around a difference in means). If we initially assume zero covariance between costs and effects then one can conceptualize this ratio in the form of a two-dimensional confidence plane. 3.4. Joint distribution of cost and effects It is assumed that in an RCT (or observational study in which valid inference can made) there are J interventions where nj patients receive

168

J. Li

intervention j, j = 1, 2, . . . , J. Costs and effects are viewed as vector random variables cj and ej − cij representing the costs incurred and eij the effects achieved by patient i on intervention j, i = 1, 2, . . . , nj , during a specified period. The joint probability distribution function of costs and effects on a patient level is modeled by the function Fj (c, e; z). A vector of patient covariate, z, such as diagnosis, gender and age, is introduced to cover the situation in which the cost-effect relationship of a intervention is expected to vary for different subgroups. It is assumed that (cij (z), eij (z)) are independently and identically distributed over the patients with covariates z receiving intervention j. The marginal distributions of F , which are the univariate distribution of cost and distribution of effect, are each associated with parameters such as expected cost E(c), and expected effect, E(e). The expected cost and effect, (E(c), E(e)) could be estimated by the sample means of c and e, that is, (¯ c, e¯) and the covariance matrix of (¯ c, e¯) could be presented as:   ⌢2 ρˆ⌢ σc⌢ σe σc  n n    (4) ,  ⌢⌢ ⌢2  ⌢ ρ σ σ σ c e

e

n n ⌢ ⌢ where σc and σe are the estimated variances for cost and effect respectively and ρˆ is the estimated correlation coefficient between cost and effect. The difference in expected cost and effect between two treatments/ interventions, (∆E(c), ∆E(e)) could be estimated by the sample means of c and e, that is, (¯ c, e¯) and the covariance matrix of (∆¯ c, ∆¯ e) could be expressed as:   ⌢2 ⌢2 ⌢⌢ ⌢ σ cj ρ σci σei ρj ⌢ σcj ⌢ σej σ ci + +   ni nj ni nj   (5) .  ⌢2 ⌢ ⌢ ⌢ ⌢ ⌢2  ⌢ σ ej ρ i σci σei ρj σcj σe σ ei + + ni nj ni nj 4. Statistical Inferences on Cost-Effectiveness Measures 4.1. Parametric approaches to estimating the C-E ratio confidence interval 4.1.1. The confidence box approach A number of commentators advocated the cost-effectiveness plane (CE plane) for presenting the results of economic evaluation and for aiding

Cost-Effectiveness Analysis and Evidence-Based Medicine

169

policy decision. O’Brien and colleagues13 showed how the CE plane could be used to present the confidence limits for the estimate of incremental cost-effectiveness under the assumption of zero covariance between costs and effects. The difference in effect between two interventions is shown on the horizontal axis with mean effect difference ∆¯ e and upper and lower conU L fidence limits for the effect difference (∆¯ e , ∆¯ e ). Similarly, the difference in cost between two interventions is shown on the vertical axis with mean cost difference ∆¯ c and upper and lower confidence limits for the effect difference (∆¯ cU , ∆¯ cL ). These “I” bars intersect at point (∆¯ e, ∆¯ c), hence the ray that connects this point of intersection to the origin has a slope equal to the value of the ICER. Under the assumption described above, the center of the two confidence intervals intuitively can be thought of as the maximum likelihood of the two-dimensional probability density function. O’Brien and colleagues argue that combining the limits of the confidence intervals for costs and effects separately gives natural best and worst case limits on the ratio; that is, the upper limit of the cost difference over the lower limit of the effect difference (∆¯ cU /∆¯ cL ) gives the highest values of the ratio (worst case) and the lower limit of costs divided by the upper limit of effects (∆¯ eL /∆¯ eU ) gives the lowest (best) value of the ratio. Thus, in Fig. 1, the slope of the line from the origin through point a is a worst-case scenario for the incremental cost-effectiveness ratio based upon the upper 95% CI of the cost estimate and the lower 95% CI of the effect estimate. By similar reasoning, the line through point c is the best-case scenario. In contrast to Fig. 1, the slice of “pie” bounded from the origin by the best and worst cases scenarios has increased in size reflecting increased uncertainty about where the true cost-effectiveness ratio lies in this region. There are two problems with this line of reasoning. The first is that the depiction of the two-dimensional confidence plane as being box-shaped is misleading. If costs and effects varied independently then the conditional probability of being at the lower 95% CI of both simultaneously would be less than 0.05. In principle we might expect such a bivariate probability density function to be elliptical in shape with lines of equi-probability central point-estimate (the maximum likelihood) much like an ordnance survey map of a mountain with height contours. Figure 3 illustrates how this general concept applies to the current problem. The second problem is the implicit assumption that costs and effects vary independently (i.e. have zero covariance). In principle we would expect covariance between costs and effects, and therefore we cannot assume that the numerator and denominator in the ratio are independent. This means that the bounds for the costeffectiveness ratio depicted in Fig. 2 are still only a quasi-confidence interval

170

J. Li

∆c U ∆e L ∆c Rˆ = ∆e

∆c U ∆c

∆c L ∆e U

∆c L

∆e L

∆e U

∆e

Fig. 2. Confidence limits on the cost-effectiveness plane and the “confidence box” approach to estimating confidence limits for the ICER.

Upper CL

∆c / ∆e

∆c

Lower CL

∆e Fig. 3. Hypothetical probability density function around maximum likelihood pointestimate for cost-effectiveness.

because we have not taken account of all sampling variation. The challenge is whether a method exists for estimating the sampling distribution for the ratio of two random variables which may have nonzero covariance.

4.1.2. The Taylor series approximation The Taylor approximation shows that where y is a function of two random variables x1 and x2 , the variance of y can be expressed in term of the partial derivatives of y with respect to x1 and x2 , weighted by the variances and

171

Cost-Effectiveness Analysis and Evidence-Based Medicine

covariance of x1 and x2 . The Taylor series formula is 2 2 ∂y ∂y var(y) ≈ var(x1 ) + var(x2 ) ∂x1 ∂x2 ∂y ∂y cov(x1 , x2 ) . +2 ∂x1 ∂x2

(6)

For the ICER ∆E(c)/∆E(e), using the sample estimates of the means and variance, the variance of the ratio estimator can be given as follows: ˆ ≈ var(R)

∆¯ c2 ∆¯ c 1 var(∆¯ c ) + var(∆¯ e) − 2 3 cov(∆¯ c, ∆¯ e) . 2 4 ∆¯ e ∆¯ e ∆¯ e

(7)

Since the variance of difference in mean is equal to the sum of two sampling variances for those means, then we can simplify var(∆¯ c) =

2 σ ˆc1 σ ˆ2 + c2 , n1 n2

var(∆¯ e) =

2 σ ˆe2 σ ˆ2 + e2 , n1 n2

(8)

and the covariance term can also be simplified cov(∆¯ c, ∆¯ e) =

⌢ ⌢ (c2 , e2 ) ρˆ1 σ ˆc1 σ ˆe1 ρˆ2 σ ˆc2 σ ˆe2 cov (c1 , e1 ) cov + = + . n1 n2 n1 n2

(9)

Combining these elements gives our expression for the variance of ratio  2    2 2 2 σ ˆe1 σ ˆ c1 σ ˆc2 σ ˆe2 + + n n n n 2  ˆ ≈ 1 + ∆¯ c2  1 4 2  var(R) ∆¯ e2 ∆¯ e " ρˆ

1

− 2∆¯ c

σ ˆc1 σ ˆe1 n1

ρˆ2 σ ˆc2 σ ˆe2 n2 ∆¯ e3

+

#

.

(10)

ˆ 2 = ∆¯ Factoring R c2 /∆¯ e2 from the right-hand side simplifies (7) to ˆ ≈R ˆ 2 [(cv(∆¯ var(R) c))2 + (cv(∆¯ e))2 − 2 ⌢ ρ cv(∆¯ c)cv(∆¯ e)] ,

(11)

where cv(x) is the coefficient p of variation for the random variable x and defined as cv(x) = var(x)/¯ x, and ρxy is the correlation coefficient between two random variables x and y and defined as ρxy = p cov(x, y)/ var(x)var(y). The properties of this variance are intuitively appealing: the cost-effectiveness variance will increase with a greater difference in costs or effects, with a greater population mean costs between groups and with greater negative correlation between costs and effects. Conversely the ratio variance will decrease with greater sample size, with a greater

172

J. Li

difference in population mean effects between groups and a greater positive correlation between costs and effects. The accuracy of the approximation in the equation above depends upon the random variables, ∆¯ c and ∆¯ e, having small coefficients of variation. The coefficient of variation for each random variable is (Zα/2 + Zβ )−1 , where the two-sided level test α of significance has 1 − β power against the true difference. For even a 50% power against the true difference the coefficient of variation would be (1.96)−1 = 0.51; small enough to ensure reasonable accuracy. The accuracy of the approximation begins to fail as the difference between treatments, with respect to cost or effect, approaches zero so that the power falls well below 50%. Similarly, for the ratio E(c)/E(e), we have c¯2 c¯ 1 var(¯ c ) + var(¯ e) − 2 3 cov(¯ c, e¯) , 2 4 e¯ e¯ e¯ " , # " " , # , # ⌢2 ⌢2 ⌢⌢ ⌢ σ ρ σ σ σ c e c e 2 2 4 ˆ ≈ var(R) e¯ + c¯ e¯ − 2¯ c e¯3 , n n n ˆ ≈ var(R)

ˆ ≈R ˆ 2 [(cv(c))2 + (cv(e))2 − 2 ⌢ var(R) ρ cv(c)cv(e)] .

(12) (13) (14)

Employing standard parametric assumptions gives the confidence interval as q q ˆ − zα/2 var(R), ˆ R ˆ + zα/2 var(R) ˆ . R (15)

Knowledge of the variance of R would also enable some tests of hypotheses. For example, suppose we specified some a priori upper threshold for the cost-effectiveness ratio, Rmax , which was the maximum cost per unit effect that we would be willing to pay for this new treatment. Hence Rmax would be the maximum acceptable slope of the cost-effectiveness line through the origin in Fig. 2. We might set up a one-tailed test of the hypothesis that the true ratio, R, was less than this maximum. Thus, we have a null hypothesis, H0 : R = Rmax which is to be tested against an alternative HA : R < Rmax and using our variance we might construct a test statistic of the general form: q ⌢ ⌢ Z = R − Rmax var(R ) . ⌢

In illustrating the possible use of var(R ) in estimation and hypothesis ⌢ testing we have assumed that the distribution for R will be statistically well-behaved such that some parametric distribution (e.g. normal) might

Cost-Effectiveness Analysis and Evidence-Based Medicine

173

be used in the large sample case. Although this is ultimately an empirical issue it seems a questionable assumption. For example, the distribution of a ratio of two differences may not be unimodal. While a non-parametric analogue of the approach might be developed using rank-order statistics a more practical alternative might be to generate an empirical distribution ⌢ for R by non-parametric bootstrapping. 4.1.3. Fieller’s method An alternative method of calculating confidence intervals around ratios has been described by Fieller.5 The advantage of Filler’s method over the Taylor series expansion is that it takes into account the skew of the ratio estimator. The method assumes that the numerator and denominator of the ratio follow a joint normal distribution such that (in the case of the ICER) ∆¯ c − R∆¯ e is normally distributed. Hence, dividing through by the standard deviation equation follows the standard normal distribution: ∆¯ c − R∆¯ e p ∼ N (0, 1) . {var(∆¯ c) + R2 var(∆¯ e) − 2R cov(∆¯ c, ∆¯ e)}

(16)

Setting this expression equal to zα/2 and rearranging gives the following quadratic equation in R: ⌢ 2 ˆ − z 2 ρcv(∆¯ R [1 − zα/2 (cv(∆¯ e))2 ] − 2R R[1 e)cv(∆¯ c)] α/2

ˆ 2 [1 − z 2 cv(∆¯ c)] = 0 , +R α/2 ˆ R

"

2 1 − zα/2 ρcv(∆¯ c)cv(∆¯ e) 2 [cv(∆¯ e)]2 1 − zα/2

(17)

#

 p [cv(∆¯ c)]2 + [cv(∆¯ e)2 ] − 2ρcv(∆¯ c)cv(∆¯ e) 2  − zα/2 {[cv(∆¯ c)]2 [cv(∆¯ e)2 ] − ρ2 [cv(∆¯ c)]2 [cv(∆¯ e)]2 }   2 ˆ R ± zα/2 .  2 [cv(∆¯  e)]2 1 − zα/2 

(18)

Similarly, for the ratio E(c)/E(e), we have c¯ − R¯ e p ∼ N (0, 1) , 2 {var(¯ c) + R var(¯ e) − 2R cov(¯ c, e¯)}

(19)

174

J. Li

⌢ 2 ˆ − z 2 ρcv(¯ ˆ 2 [1 − z 2 cv(¯ R [1 − zα/2 (cv(¯ e))2 ] − 2R R[1 e)cv(¯ c)] + R c)] , α/2 α/2

(20)

ˆ R

"

2 1 − zα/2 ρcv(¯ c)cv(¯ e) 2 [cv(¯ e)]2 1 − zα/2

#

p  [cv(¯ c)]2 + [cv(¯ e)2 ] − 2ρcv(¯ c)cv(¯ e) 2  − zα/2 {[cv(¯ c)]2 [cv(¯ e)2 ] − ρ2 [cv(¯ c)]2 [cv(¯ e)]2 }   2 ˆ R ± zα/2  . 2 [cv(¯   e)]2 1 − zα/2

(21)

Siegel et al.16 proposed that τ = c¯ − R¯ e is normally distributed with mean Eτ = 0 and var(τ ) = (var(c) − 2R cov(c, e) + R 2 var(e))/n. Let F1,n−1 denote the 95th percentile of an F distribution with 1 and (n − 1) degrees of freedom. The probability that ⌢ ⌢ ⌢ τ 2 /(var (c) − 2R cov (c, e) + R2 var (e)) < F1,n−1 (n − 1)−1

is 0.05 since the random variable of the left side of the inequality is distributed as an F distribution with 1 and (n − 1) degree of freedom. Multiplying both sides by the denominator and subtracting the right hand side from both sides of the inequality yields ⌢ ⌢ (¯ c2 − F1,n−1 (n − 1)−1 var (c)) − 2R(¯ ce¯ − F1,n−1 (n − 1)−1 cov (c, e)) ⌢ + R2 (¯ e2 − F1,n−1 (n − 1)−1 var (e)) ≤ 0 .

(22)

The set of values of R satisfying this inequality is a 95% confidence interval for the ratio E(c)/E(e). 4.1.4. Confidence interval for the expected cost to effect ratio E(c/e) Under an assumption of asymptotic normality, the expected value of the ratio E(c/e) does not exist because ratios of normal random variables follow the Cauchy distribution. Therefore, in this case neither an estimator nor a confidence interval makes sense. The approximate distribution function of the random variable c/e, F (y) = P (c/e < yσe /σc ) is given by Φ((wE(e)/σe − E(e)/σc )(w2 − 2ρw + 1)−1/2 )

(23)

where Φ(·) is the cumulative normal distribution with mean 0 and variance 1. Here, w = yσe /σc where σe and σc are the population standard deviations of e and c respectively and ρ is the correlation between them. The median of

Cost-Effectiveness Analysis and Evidence-Based Medicine

175

this distribution is E(c)/E(e). Thus, the ratio of expected costs to expected effects is the median of the distribution of the distribution of the patient level ratio of costs to effects. A 95% confidence interval for the median of this distribution may be obtained by applying the method based on Fieller’s theorem. For some data, rather than assuming that the distribution of F is multivariate normal, it may be more appropriate to assume that the distribution has a form for which E(c/e) does exist. For example, under an assumption of asymptotic normality of the ratio, the sample mean of c/e and sample variance of the ratio ⌢2 σ c/e can be used to form a 95% confidence interval for the mean cost-weight ratio as follows: √ c/e + tn−1 ⌢ σ c/e n , (24) where n is the number of patients. 4.2. Bootstrap approaches to estimating the C-E ratio The bootstrap approach for the simple one sample case is straightforward. Suppose a particular population has a real but unobserved probability distribution F from which a random sample x of n observations is taken, and the statistic of interest s(x) is calculated the concern of inferential statistics is to make statements about the population parameter θ based on the sample drawn from that population. In the “bootstrap world,” the observed random sample x is treated as the empirical estimate of F by weighting observation in x by the probability 1/n. Successive random samples of size n are then draw from x with replacement to give the bootstrap samples (re-sample from the original sample). The statistic of interest is calculated for each of these samples and these bootstrap replicates of the original statistic make up the empirical estimate of the sampling distribution for that statistic. This estimated sampling distribution can be used in a variety of ways to construct confidence intervals. In principle, the bootstrap estimate of the ICER sampling distribution can be obtained in very similar way to that of the simple one sample case. How ever, since the ICER is estimated on the basis of four estimators from two samples care must be taken to bootstrap each sample appropriately. For data structures which are more complicated than a one sample structure. Efron and Tibshirani4 advocate that the bootstrap mechanism for the observed data mirror the mechanism by which those original data were obtained. In the case of the ICER, where data on resource use and

176

J. Li

outcome exists for two groups of patients of size ni and nj receiving treatments/interventions Ti and Tj , respectively this will involve a three-stage process: (1) Sample with replacement ni cost/effect pair from the sample of patients who received treatment Ti and calculate the bootstrap estimates c¯∗i and e¯∗i for the bootstrap sample. (2) Sample with replacement nj cost/effect pair from the sample of patients who received treatment Tj and calculate the bootstrap estimates c¯∗j and e¯∗j for the bootstrap sample. (3) Calculate the bootstrap replicate of the ICER given by the equation R∗ =

c¯∗i − c¯∗j ∆¯ c∗ = . ∗ ∗ e¯i − e¯j ∆¯ e∗

(25)

Repeating this three-stage process many times gives a vector of bootstrap estimates, which is an empirical estimate of the sampling distribution of the ICER statistic. Once the sampling distribution of the ICER has been estimated in this way, several approaches exit to estimate confidence limits using the bootstrap estimate of the sampling. 4.2.1. Normal approximation One method for confidence interval estimation is to take the bootstrap estimate of standard error, given by v( ) u B u X 1 ∗ t ∗ ∗b 2 ¯ −R ¯ ) , (26) (R δˆ = B−1 b=1

(where B is the total number of bootstrap replications) and assume that the sampling distribution is normal. The resulting 100(1−α) per cent confidence interval is ˆ − zα/2 δˆ∗ , R ˆ + zα/2 δˆ∗ ) . (R

(27)

4.2.2. Percentile The percentile method avoids the problem by making direct use of the empirical sampling distribution. The 100(α/2) and 100(1 − α/2) percentile values of the bootstrap sampling distribution estimate are used as the upper and lower confidence limits for the ICER. The attraction of this method

Cost-Effectiveness Analysis and Evidence-Based Medicine

177

is its simplicity and its avoidance of the assumption of normality for the ICER. However, skewed estimation can cause trouble for the percentile method. In particular, in this context, the percentile method assumes that the bootstrap replicates of the ICER are unbiased, whereas it is known that ratio estimators are biased and that bootstrap replicates will magnify the bias of the sample estimate.17 4.2.3. Bias-corrected and accelerated Efron3 suggests a modification of the percentile method, which seeks to adjust for the bias and skew of the sampling distribution. This is the bias-corrected and accelerated (BCa) percentile method, which involves algebraic adjustments to the percentiles selected to serve as the confidence interval end points. The adjusted percentiles are given by zˆ + zα/2 , α1 = Φ zˆ + 1−a ˆ(ˆ z + zα/2 ) (28) zˆ + z(1−α/2) α2 = Φ zˆ + , 1−a ˆ(ˆ z + z(1−α/2) ) where Φ(·) is the standard normal cumulative distribution function and zα is the 100α percentile point of standard normal distribution. Two adjustments to the percentiles are incorporated into Eq. (28): zˆ adjusts the sampling distribution for the bias of the estimator, while a ˆ adjusts for the skew of the sampling distribution. Setting a ˆ = 0 yields the adjustment for bias on the percentile chosen to serve as end points, and is equivalent to the bias-corrected method advocated by Chaudhary and Stearns1 : α1 = Φ(2ˆ z + zα/2 ) , α2 = Φ(2ˆ z + z(1−α/2) ) .

(29)

The bias correction, zˆ, is given by zˆ = Φ−1 (Q) where Q is the proportion ˆ Thereof bootstrap replicates which are less than the sample estimate, R. ˆ fore, if the bootstrap sampling distribution has median R, Q = 0.5 which gives zˆ = 0 and (in the absence of a skew adjustment) the percentiles from Eq. (29) correspond to those from the straightforward percentile method. ˆ a correcHowever, where the sampling distribution is not centered on R tion is made for this bias. Notice that the nonlinear relationship between the z-score and its probability results in the percentile end points being shifted at unequal rates. It is also worth nothing that the bias correction adjustment of BCa method, while not employing distributional assumptions

178

J. Li

concerning the distribution of the ICER itself, does make use of parametric assumptions concerning the distribution of the observed bias. This reliance on parametric assumptions has been cited as a potential weakness of the BCa method (29). The acceleration constant adjusts for the skew of the sampling distribution. Efron and Tibshirani4 suggest using a jack-knife estimate for α ˆ: Pn ¯ ∗∗ ˆ ∗∗ 3 (R − Ri ) , (30) α ˆ∗∗ = Pni=1 ¯ ∗∗ − R ˆ ∗∗ )2 ]3/2 6[ i=1 (R i

ˆ ∗∗ is the jack-knife replicate of the ICER with the ith observation where R i P ∗∗ ˆ /n for i = 1 to n and n = nt + nc . In terms of the ¯∗∗ = R removed, R i adjustments to the percentiles given in Eq. (28). In the absence of a bias correction adjustment, the skew adjustment is given by zα/2 , α1 = Φ 1−a ˆzα/2 (31) z(1−α/2) α2 = Φ . 1−a ˆz(1−α/2) Equation (30) shows that if the sampling distribution is symmetric, a ˆ=0 and Eq. (31) shows that no adjustment to the percentile interval end points is made. 4.2.4. Parametric bootstrap Efron and Tibshirani4 outline a simulation-based method of confidence interval estimation that they refer to as a parametric bootstrap approach. Notice that from the definition of ICER, the difference in cost on the numerator and the difference in effects on the denominator of the ICER are both simply the difference between two normally distributed. The parametric bootstrap approach involves using this property of the distribution of the numerator and denominator in combination with the observe means, variance and covariance to estimate the parameters of the sampling distribution of the cost and effect differences. Sampling from each of these two distributions, while allowing for the estimated covariance between them, gives an estimate of the ICER. Repeating this process many times generates an empirical estimate of the sampling distribution of the ICER. The 100(α/2) and 100(1−α/2) percentiles of this estimated distribution are used as estimstes for the upper and lower limits of the confidence interval, as with the percentile method.

Cost-Effectiveness Analysis and Evidence-Based Medicine

179

5. Testing Difference Among the Populations 5.1. Under assumption of normality of distribution 5.1.1. Testing on ICER Let R0 be a specified value of the incremental cost-effectiveness ratio (ICER) R. It may be viewed as the maximum amount society is willing to pay to gain one unit of effectiveness by adopting the test intervention over the reference. We consider three tests of hypotheses on R: (a) H0 : R = R0 verse HA : R 6= R0 ; (b) H0 : R ≥ R0 verse HA : R < R0 ; (c) H0 : ∆E(e) ≥ 0 or R ≥ R0 verse HA : ∆E(e) > 0 and R < R0 . In (b), rejection of the null hypothesis might be interpreted to mean that the test intervention is cost-effective, in the sense that the data supports a CER below the stipulated maximum R0 . Its two-tailed version, (a) tests whether the data are consistent with a specified value R0 of the ICER. In (c), we test the joint hypothesis on effectiveness and cost effectiveness. If the null hypothesis is tenable, the test intervention is either not effective or not cost-effective. If the alternative is true, then the test intervention is both effective and cost-effective, relative to the referent intervention. The covariance matrix of (∆¯ c, ∆¯ e)′ , Σ could be represented as follows   ⌢⌢ ⌢ ⌢2 σ c1 ρ j⌢ σcj ⌢ σej ρ i σci σei ⌢ σ c0 ⌢2 " ⌢2 ⌢ ⌢ # + +   ni nj ni nj σ c⌢ ρ σc σe   Σ = ⌢⌢ ⌢⌢2 =  . ⌢ ⌢ ⌢2 ⌢2 ⌢⌢ ⌢  ⌢ ρ σc σ σ e ρ i σci σei ρ j σcj σei σ ei σ e1 + + ni nj ni nj (32) Test of H0 : R = R0 verse HA : R 6= R0 . We formulate our test in terms of the estimated net cost ∆¯ c − R0 ∆¯ e. Under H0 , the statistic T = (∆¯ c − R0 ∆¯ e)/{var(∆¯ c − R0 ∆¯ e)}1/2 or T = (∆¯ c − R0 ∆¯ e)/{var(∆¯ c) + R02 var(∆¯ e) − 2R02 cov(∆¯ c, ∆¯ e)}1/2 has an approximate standard normal distribution. The test rejects H0 if |T | > z(1−α/2) where z(1−α/2) is the 100(1 − α/2) percentile of the standard normal distribution.

180

J. Li

∆c

Region 2

∆e Region 1

Fig. 4. Regions for one-sided test of effectiveness and cost-effectiveness. Region 1: test intervention both effective and cost-effective. Region 2: referent intervention effective and cost-effection.

Test of H0 : R ≥ R0 verse HA : R < R0 We would reject H0 : R ≥ R0 if ∆¯ c − R0 ∆¯ e < −z(1−α/2) {var(∆¯ c) + R02 var(∆¯ e) − 2R02 cov(∆¯ c, ∆¯ e)}1/2 Test of H0 : ∆E(e) ≤ 0 or R ≥ R0 verse HA : ∆E(e) > 0 and R < R0 . In Fig. 4, the lower shaded region (region 1) in the C E plane is where HA holds. The complementary shaded region (region 2) in the second and third quadrants is where the referent intervention is both effective and costeffective. Our one-sided test impose asymmetry between the test and the referent interventions, and region 1 is the appropriate rejection region for our test. Based on our previous discussion, an appropriate test would reject H0 if ∆¯ e > c1 and (∆¯ c − R0 ∆¯ e) < c2 where the constant c1 > 0 and c2 < 0 need to be specified. The size of the test is α = sup P [∆¯ e > c1 , (∆¯ c − R0 ∆¯ e) < c2 ] , where the supremum is taken over all (∆¯ e, ∆¯ c) consistent with H0 . By normalization, we may express this in terms of the bivariate normal (Z1 , Z2 ), with zero means, unit variances and correlation −ρ∗ . Then " −(c1 − ∆¯ e) , α = sup P Z1 < p var(∆¯ e) # c2 − (∆¯ c − R0 ∆¯ e) z2 < c, ∆¯ e)}1/2 e) − 2R02 cov(∆¯ {var(∆¯ c) + R02 var(∆¯

181

Cost-Effectiveness Analysis and Evidence-Based Medicine

( "

−c1

#

= max P Z1 < p , {var(∆¯ e)

c2 P Z2 < c, ∆¯ e)}1/2 e) − 2R02 cov(∆¯ {var(∆¯ c) + R02 var(∆¯

)

.

(33)

c2 = −{var(∆¯ c) + R02 var(∆¯ e) − 2R02 cov(∆¯ c, ∆¯ e)}1/2 z1−α .

(34)

One solution to (33) is c1 = σ1 z1−α ,

5.1.2. Testing on CER (cost-effectiveness ratio) If the cost-weight bivariate distributions are normal with mean vectors (E(ci ), E(ei )) and common covariance matrix, the multivariate analysis of variance, MANOVA, can be used to test the hypothesis that the vectors of cost-efficiency measures are identical. If the MANOVA finds the means of the distributions of the populations to be equal and the c − e measure is a function of the means, e.g. E(ci )/E(ei ), then it may be concluded that the c − e measures do not differ. A likelihood ratio test could be employed to test the hypothesis H0 : E(ci )/E(ei ) = R0 for all i, that is, E(ci ) − R0 E(ei ) = 0. An asymptotic α-level two sided test of H0 may be obtained by first using likelihood theory for normal variables for testing the linear hypothesis that all ratios are equal to a specific value, say, R0 . The desired likelihood ratio test is found by maximizing the previous likelihood over all possible values of R0 . Let ni denote the number of bivariate observations of cost and effect for treatment i and let n = Σni . The available data consists of the bivariate observations (cij , eij ), i = 1, 2, . . . , I, j = 1, 2, . . . , ni . Let s11 = Σi Σj (cij − c¯i )2 /n, s22 = Σi Σj (eij − e¯i )2 /n, and s12 = Σi Σj (cij − c¯i )(eij − e¯i )/n. Here, sij are the elements of the pooled covariance matrix, S. The hypothesis E(ci )/E(ei ) = R0 is equivalent to the hypothesis E(ci ) − R0 E(ei ) = 0 for all i. For a specific R0 the classical test of the latter linear hypothesis is based on the Wilks’ statistic, W (R0 ). The likelihood ratio statistic for the same linear hypothesis is given by Λ(R0 ) = W (R0 )n/2 . Maximizing Λ(R0 ) over all possible values of R0 yields the desired likelihood ratio test. The test rejects H0 at the R0 level if −n lnχmax < χ21−α (I − 1) .

(35)

182

J. Li

Here χ21−α (I − 1) is the upper 1 − α percentage points of the chi-square distribution with I − 1 degrees of freedom and χmax is the large of the two solutions of the following quadratic equation: ax2 + bx + c = 0 where a = Σi Σj c2ij ∗ Σi Σj e2ij − (Σi Σj cij eij )2 , b = [Σi Σj c2ij ∗ Σi (Σj e2ij − ni e¯2i ) + Σi Σj e2ij ∗ Σi (Σj c2ij − ni c¯2i ) − 2(Σi Σj cij eij ) ∗ (Σi (Σj cij eij − ni c¯i e¯i )] , c = Σi (Σj c2ij − ni c¯2i ) ∗ Σi (Σj e2ij − ni e¯2i ) − (Σi (Σj cij eij − ni c¯i e¯i )2 . 5.2. Without assumption of normality of distribution The distribution of c/e is often skewed. The lifetime models can be widely applied to investigate the distributions of c/e and the difference in c/e between populations. A cost-effectiveness distribution function, or c − e distribution function, could be defined as: S(ce ) = Pr(c/e > ce ) .

(36)

The parametric, semi-parametric and non-parametric methods are able to deal with the data, whose distributions do not meet the assumption of normality and with censored data. The Weibull, gamma and log-normal distributions could be applied to estimate the c − e distribution function and the difference in c/e between different populations. The non-parametric approach, such as Kaplan–Meier method could be applied to estimate the c − e function. The non-parametric tests such as Wilcoxon and logrank test can be used to test the equality of the different groups. 6. Power and Sample Size Assessment for Tests of Hypotheses on Cost-Effetiveness Ratios 6.1. Test of H0 : R = R0 verse HA : R 6= R0 The power (= 1 − β) of this test at the alternative HA : R = RA (6= R0 ) is given by ¯ 0 ∆¯ P [|∆¯ c−R e| < z1−α/2 {var(∆¯ c) + R02 var(∆¯ e) − 2R02 cov(∆¯ c, ∆¯ e)}1/2 |HA ] = β .

(37)

Cost-Effectiveness Analysis and Evidence-Based Medicine

183

¯ 0 ∆¯ Under HA , E(∆¯ c−R e) = δ(RA − R0 ), with δ(6= 0) denoting the true incremental effectiveness. Assuming the covariance matrix of (∆¯ c, ∆¯ e)′ , Σ, is known, Eq. (37) yields P [−z1−α/2 − δ(RA − R0 ){var(∆¯ c) + R02 var(∆¯ e) − 2R02 cov(∆¯ c, ∆¯ e}−1/2 < Z < z1−α/2 − δ(RA − R0 ){var(∆¯ c) + R02 var(∆¯ e) − 2R02 cov(∆¯ c, ∆¯ e}−1/2 ] = β ,

(38)

where Z is standard normal and nA = kn0 . Depending on the sign of δ(RA − R0 ), the absolute magnitude of one of the limits on Z is usually large. In either case, we will get, approximately . |δ(RA − R0 )| = (z1−α/2 + z1−β ){var(∆¯ c) + R02 var(∆¯ e) − 2R02 cov(∆¯ c, ∆¯ e)}1/2 .

(39)

Routing algebraic steps gives var(∆¯ c) + R02 var(∆¯ e) − 2R02 cov(∆¯ c, ∆¯ e) = ⌢2 σ c (1 − ⌢2 ρ )(1 + v0 ) , ⌢ ⌢ ⌢ 2 ⌢2 where v0 = {R0 (σ e /σc ) − ρ } /(1 − ρ ). Supposing n1 = kn0 , where n0 and n1 are the number of patients in the test intervention and referent intervention respectively, we have

n0 =

⌢2 −1⌢2 ρ )(1 + v0 ) σ c1 )(z1−α/2 + z1−β )2 (1 − ⌢2 (σ c0 + k 2 2 δ (RA − R0 )

(40)

Under the same design set-up, the sample size n0b in the referent intervention needed to guarantee power of 1 − β to detect a difference δ in the test of H01 : ∆¯ e = 0 is given by ⌢2 −1⌢2 n0b = δ −2 (σ σ e1 )(z1−α/2 + z1−β )2 . e0 + k

Therefore, n0 /n0b =

⌢2 −1⌢2 (σ σ c1 )(1 − ⌢2 ρ )(1 + v0 ) c0 + k . ⌢2 (σ e0 + k −1⌢2 σ e1 )(RA − R0 )2

(41)

¯ 0 ∆¯ The parameter v0 is a function of ρ∗ between ∆¯ e and (∆¯ c−R e). In fact, p ⌢ ⌢ ⌢ 2 ρ∗ = −{R0 (σ e /σc ) − ρ }/ (1 − v )(1 + −v0 ) .

184

J. Li

Therefore, |ρ∗ | = {v0 /(1 + v0 )}2 . It will be very large if ρ∗ is close to one and, consequently, the sample sizes in (40) and (41) will also be large. The correlation between the incremental cost and the incremental effectiveness is related through (32) to ⌢ ρ the individual correlations ⌢ ρ 0, ⌢ ρ 1 between cost and benefit in the two interventions. As is usually the case, R0 ≥ 0 and both (40) and (41) are monotonically decreasing in ⌢ ρ leading to a smaller sample size n0 and relative size n0 /n0b with increasing value of ⌢ ρ . Finally, these sample size formulae are dependent on both the hypothesized CER R0 and the difference RA − R0 . 6.2. Test of H0 : R ≥ R0 verse HA : R < R0 Analogous sample size calculations yield the following formula, which replaces (40): n0 =

2 2 (σc0 + k −1 σc1 )(z1−α + z1−β )2 (1 − ρ2 )(1 + v0 ) . δ 2 (RA − R0 )2

(42)

For the one-sided test H01 : ∆E(e) = 0, with regard to their effectiveness and, therefore, should be compared on their costs. The ratio n0 /n0b compares the sample size requirement of the test H0 : R = R0 with that for H01 : ∆¯ e = 0, with the latter powered to detect the difference δ. 6.3. Test of H0 : ∆E(e) ≤ 0 or R ≥ R0 verse HA : ∆E(e) > 0 and R < R0 With the solution (34), the power (1 − β) of the test can be computed from the bivariate normal distribution of (Z1 , Z2 ) and is given by " δ 1 − β = P Z1 < −z1−α + , Z2 < −z1−α σ1 # δ|R − R0 | + , {var(∆¯ c) + R02 var(∆¯ e) − 2R02 cov(∆¯ c, ∆¯ e)}−1/2

(43)

where δ > 0 is the incremental effectiveness and R(< R0 ) is the true costeffectiveness ratio. This parallels the power considerations leading to (39) for the test of cost-effectiveness only. The choice of c1 and c2 in (33) is optimal in order to gain maximal power for a given sample size and alternative. An implicit expression for the sample size n0 corresponding to ( ) can 2 2 2 be derived by making the substitutions σe2 = (σe0 +k −1 σc1 )/n0 , σe2 = (σc0 + −1 2 2 2 2 k σc1 )/n0 and var(∆¯ c)+R0 var(∆¯ e)−2R0 cov(∆¯ c, ∆¯ e) = σc (1−ρ)(1+v0).

Cost-Effectiveness Analysis and Evidence-Based Medicine

185

Note that the previous expression for n0 in (42) is a lower bond for the sample size requirements for testing H0 : ∆E(e) ≤ 0 or R ≥ R0 . 6.4. Numerical computations In some special cases, simplification of (40)–(43) are possible. Suppose the costs (c0 , c1 ) and benefit measures (e0 , e1 ) in the two interventions have the same variance ⌢2 σ c0 = ⌢2 σ c1 (= ⌢2 σ c ), ⌢2 σ e0 = ⌢2 σ e1 (= ⌢2 σ e ), respectively. Then, assuming equal allocation to the two interventions (k = 1), we have ⌢ ρ = ⌢ ⌢ (ρ 0 + ρ 1 )/2 and the sample size n0 , n0b in (40) and (41) reduce to n0 = n0b =

⌢ ⌢ 2 ⌢ ⌢ ⌢ ⌢2 2 2 2σ e (z1−α/2 + z1−β ) (R0 + (σc /σe ) − 2ρ R0 (σc /σe )) , δ 2 (RA − R0 )2

⌢2 δ −2 2σ e (z1−α/2

(44)

2

+ z1−β ) .

From (42), for one-sided testing, z1−α/2 must be replaced by z1−α . The effect size δ/σε is the difference in effectiveness in units of standard deviation (SD). For the joint hypothesis test of H0 : ∆E(e) ≤ 0 or R ≥ R0 , the power and sample size expression (43) becomes " r n0 δ , Z2 < −z1−α 1 − β = P Z1 < −z1−α + 2 ⌢ σe # r |R − R0 | n0 δ + . (45) ⌢ ⌢ 2 ⌢ ⌢ ⌢ −1/2 2 ⌢ σ e {R02 + (σ c /σe ) − 2ρ R0 (σc /σe )} The sample size requirement for this joint test to ensure power (1 − β) would be greater than the sample size needed for the one-sided test for effectiveness alone. For fixed n0 , the factor ζ=

⌢ ⌢ 2 ⌢ ⌢ −1/2 R02 + (σ c /σe ) − 2ρR0 (σc /σe )} |R − R0 |

would drive the power, with power decreasing with increasing ζ. This factor is the square-root of the sample size ratio n0 /n0b in (41). Note that if ⌢ ⌢ ⌢ |σ c /σe − R0 | > |R0 − R|, irrespective of the value ρ we always have ζ > 1. Therefore, (45) should be used to calculate power of the joint test given a sample size that might be available for testing effectiveness. On the other hand, (44) is suitable for assessing the sample size needed to establish costeffective. It should be noted that the right-hand side of (45) is dependent on ⌢ ρ through the correlation first decrease Z1 and Z2 . In practice, we are ⌢ ⌢ likely to have R0 > ⌢ σc /σ e , in which case this correlation ρ first decreases

186

J. Li

with and then increase after the value ⌢ ρ = (R0 σe /σc )−1 , therefore, in this circumstance, a strong positive correlation between cost and effectiveness would suggest a smaller sample size requirement for (45) to hold given β. 7. Examples 7.1. Example 1 We use summary data from Sacristan et al.14 on a trial comparing two pharmacological agents in this example. Data on 150 patients using the test drug yield a mean cost of $200,000 (SD = $78,400). Health benefit measured in QALYs is 8 (SD = 2.1) corresponding values on 150 patients using the standard drug are $80,000 (SD = $27,343) for mean cost, and 5 QALYs (SD = 2.0) for mean health benefit. These values yield the following estimates: ∆¯ e = 3, ∆¯ c = $120, 000 and from (32) ⌢ σe = 0.237 and ⌢ σc = 6779. In the absence of a reported value for the correlation between cost and effectiveness, we consider values ⌢ ρ0 = ⌢ ρ 1 = 0.7. From (32), we see that with zero correlations, the incremental cost and incremental effectiveness are uncorrelated (ρ = 0). For ⌢ ρ0 = ⌢ ρ 1 = 0.7, we get ⌢ ρ = 0.638 approximately. 7.1.1. Hypothesis testing for the CER Suppose the hypothesized CER was R0 = $50, 000/QALY. From the test of H0 : R = R0 verse HA : R 6= R0 section, the two-sided test of H0 : R = R0 based on the statistic T = (∆¯ c − R0 ∆¯ e)/{var(∆¯ c − R0 ∆¯ e)}1/2 has a p-value ⌢ ⌢ of 0.03 if ρ 0 = 0 and approximately 0.001 if ρ 0 = 0.7. It can be shown that the p-values decrease with increasing values of ⌢ ρ 0. 7.1.2. Determining statistical power What power does this test have to detect an alternative CER, R0 = $40, 000/QALY? We compute the power from (39) assuming an incremental effectiveness of 3 QALYs. If ⌢ ρ 0 = 0 the power is about 59% and increases to 94% if ⌢ ρ 0 = 0.7. A lower power may be acceptable in studies of costeffectiveness. 7.1.3. Testing the joint hypothesis on effectiveness and cost-effectiveness The power function of this one-sided test is given in (43). To test for significance of the difference in effectiveness (i.e. H01 : ∆E(e) = 0), we would

Cost-Effectiveness Analysis and Evidence-Based Medicine

187

⌢ reject if |∆¯ e/σ e | > z1−α/2 . In this example, the difference δ being highly significant makes the right-hand side of (43) essentially δ|R − R0 | P Z2 < −z1−α + . {var(∆¯ c) + R02 var(∆¯ e) − 2R02 cov(∆¯ c, ∆¯ e)}−1/2

The power at δ = 3 and R = $40, 000 is about 0.71 for these data. 7.2. Example 2 Consider the simplifications leading to (44) and (45). To ensure a power ⌢ of 80% to detect an effect size δ/σ e = 0.5 with a two-sided test of H01 : ∆E(e) = 0 with α = 0.05, we get n0b = 63. Suppose the hypothe⌢ sized ICER is R0 = $80, 000/QALY and the relative SD ⌢ σc /σ e = 5, 000 ($/QALY). Correlation between the cost and effectiveness measures is likely ⌢ to be positive. Let ⌢ ρ = 0.7 and assume a known effect size δ/σ e = 0.5. The sample size n0 needed to detect an ICER of $50,000/QALY or less with 80% power requires n0 ≥ 6.51n0b . For two-sided testing, this yields n0 ≥ 410. The sensitivity of ρ(> 0) to this relative sample size is small. A zero correlation increase this ratio to 7.1. Now consider testing the joint hypothesis H0 : ∆E(e) ≤ 0 or R ≥ R0 under the same constraints. Suppose we want 80% power to detect an effect size 0.5 and an ICER of $50,000/QALY. Using (45), we will get n0 = 323 when ⌢ ρ = 0.7. Note that the joint hypothesis is formulated as one-sided. In comparison, a one-sided test for effectiveness would need approximately 50 subjects per arm to detect an effect size of 0.5 with 80% power. As noted after (45), the power is driven by the probability involving Z2 because ζ > 1 in this case. 7.3. Example 3 Sample size requirements for testing H0 : ∆E(e) ≤ 0 or R ≥ R are given in ⌢ Table 1 for some values of R0 and effective sizes δ/σ e . The test is designed with α = 0.05 and 80% power at R = $30, 000/QALY. We use (45) with ⌢ ⌢ ρ = 0.7 and ⌢ σc /σ e = 5, 000 ($/QALY). The last column of Table 1 gives the sample size requirement to ensure 80% power in the two-sided test of effectiveness alone. For example, to detect an effect size of 0.4 and a ICER of $30,000/QALY, when the maximum acceptable level is $50,000/QALY, we require a sample size of 421 for the test and referent groups. In comparison, for testing H01 : ∆E(e) = 0, only 99 subjects are required to detect an effect size of 0.4.

188

J. Li

Table 1.

Sample size requirements for testing effectiveness and cost-effectiveness.

Effective size 0.3 0.4 0.5 0.6

Maximum ICER R0 ($1000/QALY) 40

45

50

55

60

1848 1040 666 462

1060 596 382 265

748 421 269 187

586 330 211 147

490 276 177 123

Effectiveness alone

175 99 63 44

Because of the relatively large sample size needed to test the joint hypothesis of cost-effectiveness and effectiveness, in practice power could be calculated from (10) using the sample size that is needed to demonstrate a difference in effectiveness between two treatments. For example, with 175 subjects per arm, we have 80% power to detect an effect size of 0.3. with this sample size, ⌢ ρ = 0.7 and R0 = $50, 000/QALY we will have 64% power to detect a ICER of $30,000/QALY at an effect size of 0.5 at an effect size of 0.3, the power is only 33%. 8. Modeling for Cost-Effectiveness Analysis Cost-effectiveness analysis require estimation of the health effects and resource costs associated with an intervention and with the alternatives to which it will be compared. Modeling is frequently necessary since few studies provide information over sufficiently long periods or for all relevant costs, effects and population groups. Cost-effectiveness analysis helps inform different types of decisions about health interventions. To begin, it can inform the decision to use an intervention at all by showing whether it is cost-effective enough compared to alternatives. More often decisions concern hoe to use the intervention. Should screening for hypertension be done every year, every two years, or every five years? If hypertension is diagnosed, and non-drug therapies are unsuccessful, which drugs should be used? Should folic acid supplementation be accomplished through diet, vitamin supplements, or fortification of cereal grains? If fortification, how many mg of folic acid per 100 grams of cereal grain product? Should every patient who presents at the emergency. A model creates the framework for cost-effectiveness analysis. To serve its purpose, and enable decision makers to explore the implications of variation in the intervention, the condition, and the population, it must allow not only for substantial variation in those factors.

Cost-Effectiveness Analysis and Evidence-Based Medicine

189

8.1. Validating effectiveness estimates Accuracy is essential for a model. Eddy2 described four levels of validation. First, the structure of the model should make sense to experts. Second, the model should reproduce the outcomes observed in the studies used to estimate its parameters. Third, the models predictions could be compared with results from studies not used in its construction. Fourth, the model could be used to predict outcomes for a new program and the predictions compared with the outcomes when the program is implemented. The first and second steps are essential. For the third step, randomized clinical trials (RCTs) offer a challenging, but potentially persuasive, test of a models accuracy. While trials are usually the benchmark, the model may be accurate on specific points. It is reasonable to expect a good model to match the results of trials available at the time of its construction, but not to expect it to predict the results of future trials. Models can and should accurately reflect the state of knowledge at the time they are created. When is a model going too far beyond the data? The medical and public health practice are the best guides. Models can appropriately be used to analyze any circumstances in which the intervention is already being applied, or in which it is being seriously considered for application. If it is appropriate to use the intervention in the real world, on real people, it is an appropriate to analyze the implications of that use of a model. 8.2. Modeling costs Eddys suggestions described above should be considered for the cost estimate as well. Modelers need to pay attention to ensuring that the pathway of events described by a model represents costs as well as it does effects. In part, the failure to validate cost estimates reflects the failure to take cost data as seriously as effectiveness data. A basic requirement for accurate predictions, often overlooked, is that both costs and effects should apply to the same population and the same circumstances. Further, data on resource use and cost need to be associated with the same care and subjected to the same sorts of consistency checks as effectiveness data — comparing one source with another, relating differences in costs to characteristics thought to be associated with those differences and so on. In addition, the range of variation that could usefully be modeled is as wide for costs as for effects. An iterventions effectiveness differs across the

190

J. Li

country because populations differ in incidence of the condition, risk factors and co-morbidities. Costs differ across the country because of differences in wages and other costs, in practice patterns and in suitable production technologies. While one purpose of sensitivity analyses is to determine which parameters have a major influences on cost-effectiveness, it would also be useful to explore sets of assumptions that describe, as accurately as the data allow, circumstances in another part of the country or another delivery system. The US panel on cost-effectiveness in Health and Medicine has urged the use of micro-costing for costing events important to an analysis. Microcosting could yield a better understanding of the factors that underlie resource use and costs for various conditions, analogous to the understanding of effectiveness built up from epidemiological and clinical research. That understanding might reveal alternatives for making interventions more cost-effective by changing the way they are delivered, not just by targeting them to population subgroup. Models should be flexible enough to permit exploration of a range of production possibilities and cost levels for an intervention. Analysts could then examine plausible differences in costs and production technologies. It would be useful to evaluate combinations of values that occur in the real world: conditions in Michigan verse those in San Francisco, conditions in an inner city, a suburb, or a rural area.

8.3. Modeling form Models are built from estimates of risk — the probability that a condition will progress to the next stage, that a test is accurate, that a treatment will be effective. In medical research, the familiar and convenient mathematical forms for fitting risk relationships are the logistic and, more recently hazard models. Both forms incorporate an assumption that the risk relationship is multiplicative, and thus that the size of the risk reduction caused by changing one risk factor differs for different levels of the other risk factors. This assumption implies, for example, that the reduction in risk caused by lowering systolic blood pressure from 160 mmHg to 140 mmHg will be larger in people who also smoke, even though they continue to smoke, than in people whose only risk factor is high blood pressure. Similarly, the reduction in risk from smoking cessation will be greater in people who are hypertensive, even if their blood pressure is unchanged, than in non-smokers.

Cost-Effectiveness Analysis and Evidence-Based Medicine

191

In turn, this implies that it will be more cost-effective to apply an intervention to people with several risk factors, not because the programme achieves economies by treating several riskm factors, but because intervention against a single risk factor is more effective in these people. The point is clear in an analysis by Taylor et al.18 Of a dietary programme to lower serum cholesterol modeled after the one employed in MRFIT. Effectiveness was estimated using logistic coefficients reported from the Framingham study. Results were presented separately for low-risk men, whose only risk factors for heart disease were their gender and cholesterol level, and for high-risk men, who also smoked and had high blood pressure and low HDL levels. Although the cost of the intervention was the same, cost per lifeyear was approximately ten times higher for low-risk men because of the multiplicative assumption incorporated in the logistic form. Logistic and hazard models play an important role in some of the situations for which models are particularly useful — examining differences in effectiveness and cost-effectiveness among subgroups. When analysts model the implications of targeting an intervention to subgroups, or extrapolate to explore its application to less-studied groups, they need to be aware of the implications of the conventional forms. Modelers cannot supply the data to resolve this issue, but they can draw attention to it by showing how estimates change when addictive and multiplicative forms are used. The ultimate goal is to ensure that estimated differences among subgroups are not an artifact of a convenient statistical model. References 1. Chaudhary, M. A. and Stearns, S. C. (1996). Estimating confidence intervals for cost-effectiveness ratios: An example from a randomized trial. Statistics in Medicine 15: 1447–1458. 2. Eddy, D. M. (1985). Technology assessment: The role of mathematical modeling in committee for evaluating medical technologies. In Clinical Use, Institute of Medicine, Assessing Medical Technologies, National academy Press, Washington, D.C. 144–154. 3. Efron, B. (1987). Better bootstrap confidence interval. Journal of the American Statistical Association 82: 171–200. 4. Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap, Chapman and Hall, New York. 5. Fieller, E. C. (1954). Some problems in interval estimation. Journal of the Royal Statistical Society, Series B16: 175–183. 6. Garber, A. M. and Leichter, H. M. (1991). Practice guidelines and cholesterol policy. Health Affairs 10(2): 52–66.

192

J. Li

7. Goldman, L., Weinstein, M. C., Goldman, P. A. and Williams, I. W. (1991). Cost-effectiveness of HMG — CoA reductase inhibition for primary and secondary prevention of coronary heart disease. Journal of American Medical Association 265: 1145–1151. 8. Hlatky, M. A., Rogers, W. J., Johnstone, I. et al. (1997). Medical care costs and quality of life after randomization to coronary angioplasty or coronary bypass surgery. New England Journal Medicine 336: 92–99. 9. Kupperman, M., Luce, B., McGovern, B. et al. (1990). An analysis of the cost-effectiveness of the implantable defibrillator. Circulation 81: 91. 10. Lipid Research Clinic Program (1984). Lipid research clinics coronary primary trial results, II: The relationship of reduction in incidence of coronary heart disease to cholesterol lowering. Journal of American Medical Association 251: 365–374. 11. National Cholesterol Education Program (NCEP) (1988). High blood cholesterol in adults: Report of the expert panel on detection, evaluation, and treatment. National Institutes of Health. Department of Health and Human Services, Bethesda, MD. 12. National Cholesterol Education Program (NCEP) (1994). The second report of the expert panel on detection, evaluation, and treatment of high blood cholesterol in adults. Circulation 89: 1329–1445. 13. OBrien, B. J., Drummond, M. F., Lebelle, R. J. and Willan, A. (1994). In search of power and significance: Issues in the design and analysis of stochastic cost-effectiveness studies in health care. Medical Care 32: 150–163. 14. Sacristan, J. A., Day, S. J., Navarro, O., Ramos, J. and Hernandez, J. M. (1995). Use of confidence intervals and sample size calculations in health economic studies. Annals of Pharmacotherapy 29: 719–725. 15. Sempos, C., Fulwood, R., Haines, C., Carroll, M. et al. (1989). Prevalence of high blood cholesterol levels among adults in the United States. Journal of American Medical Association 262: 45–52. 16. Siegel, C., Laska, E. and Meisner, M. (1996). Statistical methods for cost-effectiveness analyses. Controlled Clinical Trials 17: 387–406. 17. Stinnett, A. (1996). Adjusting for bias in C/E ratio estimates. Health Economics 5: 469–472. 18. Taylor, W. C., Pass, T. M., Shepard, D. S. and Komaroff, A. F. (1990). Cost effectiveness of cholesterol reduction for the primary prevention of coronary heart disease in men. In Preventing Disease: Beyond the Rhetoric, eds. R. B. Goldbloom and R. S. Lawrence, Springer-Verlag, NY, 437–441. 19. Weinstein, M. C. and Stason, W. B. (1977). Foundations of cost-effectiveness analysis for health and medical practices. New England Journal of Medicine 296: 716–721.

About the Author Jianli Li is working in the Department of Corporate Performance, St. Michael’s Hospital, University of Toronto, Canada, as biostatitician. He

Cost-Effectiveness Analysis and Evidence-Based Medicine

193

worked in Ontario Joint Policy and Planning Committee, Ontario Ministry of Health and Ontario Hospital Association, as statistical consultant working on the hospital funding models and performance assessment. He has been working in healthcare management for decades and making efforts to apply the new developments in statistics, medical information science, computer science and applied mathematics into the healthcare management. He published and presented several papers and chapters, such as “An Application of Life Time Model in Estimation of Expected Length of Stay of Patient in Hospital”, “Impact of the Complexity Methodology on an Ontario Teaching Hospital”, “A System for Evaluating Inpatient Care CostEfficiency in Hospital” and “Data Mining in Health Care Organisations: A Source for Continuous Health Care Information”. He was a visiting professor at the Department of Preventive Medicine and Biostatistics, University of Toronto. He was associate professor and vice chairman at the Department of Healthcare Management, Shanghai Second Medical University.

This page intentionally left blank

CHAPTER 6

QUALITY OF LIFE: ISSUES CONCERNING ASSESSMENT AND ANALYSIS JIQIAN FANG∗ and YUANTAO HAO Department of Medical Statistics, School of Public Health, Sun Yat-Sen University, 74 Zhongshan Road II, Guangzhou 510080, PR China Tel: 86-20-87330671; ∗[email protected]

1. The Concept of QOL and its Components What is quality of life? There is no universally agreed definition. Quality of life (QOL) not only means different things to different people, but it also varies according to a person’s current situation. When a person falls sick he thinks QOL is good health, when he is poor, QOL is wealth. To a town planner, for example, QOL might represent access to green space and other facilities. In the context of clinical trials we are rarely interested in QOL in such a broad sense, but are concerned only with evaluating those aspects that are affected by disease or treatment of disease. This may sometimes be extended to include indirect consequences of disease such as unemployment or financial difficulties. To distinguish between QOL in its more general sense and the requirements of clinical medicine and clinical trials, the term “health-related quality of life” (HRQOL) is frequently used in order to remove ambiguity. There are a number of reasons for developing a quality of life assessment tool. The main reason is undoubtedly that in recent years there has been a broadening of focus of the measurement of health beyond traditional health indicators such as mortality and morbidity.1 Indeed, the measurement of health may now includes assessment of the impact of disease and impairment on daily activities and behaviour,2 perceived health measures3 and disability/functional status measures.4 These measures, whilst beginning to provide an indication of the impact of disease, do not access quality of 195

196

J. Fang & Y. Hao

life per se, which has been aptly described as “the missing measurement in health”.5 The increasingly mechanistic model of medicine, concerned only with the eradication of disease and symptoms, reinforces the need for the introduction of a humanistic element into health care. Health care is essentially a humanistic transaction in which the patient’s well-being is the primary aim. By calling for QOL assessment in health care, attention is focused on this aspect of health, and resulting interventions will pay increased attention to the problem. There still has not been a single, clear, universally accepted definition of HR-QOL. What domains should be included in QOL? There are five major domains of QOL which are generally referred to by most authors. These domains are physical status and functional abilities, psychological status and well being, social interactions, economic and/or vocational status and factors, and religious and/or spiritual status. The World Health Organization (WHO) has developed an international quality of life assessment instrument (WHOQOL) which allows an enquiry into an individual’s perception of own position in life in the context of the culture and value systems in which they live, and in relation to their goals, expectations, standards and concerns. The WHOQOL measures quality of life related to health and health care. It has been developed in the framework of a collaborative project involving numerous centres in different cultural settings.6 QOL is defined by WHO as “individuals’ perceptions of their position in life in the context of the culture and value systems in which they live and in relation to their goals, expectations, standards and concerns”. It is a broad ranging concept incorporating in a complex way the persons’ physical health, psychological state, level of independence, social relationships, personal beliefs and their relationships to salient features of the environment. This definition reflects the view that quality of life refers to a subjective evaluation, which is embedded in a cultural, social and environmental context. As such, quality of life cannot be equated simply with the terms “health status”, “life style”, “life satisfaction”, “mental state” or “wellbeing”. Because the WHOQOL focuses upon respondents’ “perceived” quality of life, it is not expected to provide a means of measuring in any detailed fashion symptoms, diseases or conditions, nor disability as objectively judged, but rather the perceived effects of disease and health interventions on the individual’s quality of life. The WHOQOL is, therefore, an assessment of a multi-dimensional concept incorporating the individual’s perception of health status, psycho-social status and other aspects of life.

Quality of Life: Issues Concerning Assessment and Analysis

197

It is anticipated that the WHOQOL assessment will be used in broadranging ways. It will be of considerable use in clinical trials, in establishing baseline scores in a range of areas, and looking at changes in quality of life over the course of interventions. It is expected that the WHOQOL assessment will also be of value where disease prognosis is likely to involve only partial recovery or remission, and where treatment may be more palliative than curative. For epidemiological research, the WHOQOL assessments will allow detailed quality of life data to be gathered on a particular population, facilitating the understanding of diseases, and the development of treatment methods. The international epidemiological studies that would be enabled by instruments such as the WHOQOL-100 and the WHOQOL-BREF will make it possible to carry out multi-center quality of life research, and to compare results obtained in different centers. Such research has important benefits, permitting questions to be addressed which would not be possible in single site studies. For example, a comparative study in two or more countries on the relationship between health care delivery and quality of life requires an assessment yielding cross-culturally comparable scores. Sometimes accumulation of cases in quality of life studies, particularly when studying less frequent disorders, is helped by gathering data in several settings. Multi-center collaborative studies can also provide simultaneous multiple replications of a finding, adding considerably to the confidence with which findings can be accepted. In clinical practice the WHOQOL assessments will assist clinicians in making judgements about the areas in which a patient is most affected by disease, and in making treatment decisions. In some developing countries, where resources for health care may be limited, treatments aimed at improving quality of life through palliation, for example, can be both effective and inexpensive. Together with other measures, the WHOQOL-BREF will enable health professionals to assess changes in quality of life over the course of treatment. It is anticipated that in the future the WHOQOL will prove useful in health policy research and will make up an important aspect of the routine auditing of health and social services. Because the instrument was developed cross-culturally, health care providers, administrators and legislators who require a valid QOL instrument for use can be confident that data yielded by work involving the WHOQOL assessment will be genuinely sensitive to their setting. A large number of instruments have been developed for QOL assessment and we can divide them into two categories: generic instruments and

198

J. Fang & Y. Hao

disease-specific instruments.7 Generic instruments are intended for general use, irrespective of the illness or condition of the patient. These generic questionnaires may often be applicable to healthy people too. Some of the earliest ones were developed initially with population surveys in mind, although they were later applied in clinical trial settings. There are many instruments that measure physical impairment, disability or handicap. Although commonly described as QOL scales, these instruments are better called measures of health status because they focus on physical symptoms. They emphasize the measurement of general health, and make the implicit assumption that poorer health indicates poorer QOL. One weakness about this form of assessment is that different patients may react differently to similar levels of impairment. Many of the earlier questionnaires such as the Sickness Impact Profile (SIP)2 and the Nottingham Health Profile (NHP)8 to some degree adopt this approach. Few of the earlier instruments had scales that examined the subjective non-physical aspects of QOL, such as emotional, social and existential issues. Newer instruments such as the Medical Outcomes Study 36-Item Short Form (SF-36),9 however, emphasize these subjective aspects strongly, and also commonly include one or more questions that explicitly enquire about overall QOL. More recently, some brief instruments that place even less emphasis upon physical functioning have been developed. Two such instruments are the EuroQol,10 which is intended to be suitable for use with cost-utility analysis, and the SEIQol,11 which allows patients to choose those aspects of QOL that they consider most important to themselves. Generic instruments, intended to cover a wide range of conditions, have the advantage that scores from patients with various diseases may be compared against each other and against the general population. On the other hand, these instruments fail to focus on the issues of particular concern to patient with disease, and may often lack the sensitivity required to detect differences that arise as a consequence of treatment policies that are compared in clinical trials. This has led to the development of diseasespecific questionnaires, for example, the EORTC QLQ-C30 (European Organization for Research and Treatment of Cancer QLQ-C30).12 2. Methods of Developing QOL Measurements The development of a new QOL instrument requires a considerable amount of detailed work, demanding patience, time and resources. Some evidence of this can be seen from the series of publications that are associated with

Quality of Life: Issues Concerning Assessment and Analysis

199

such QOL instruments as the SF-36, the FACT and the EORTC QLQ-C30. These and similar instruments have initial publications detailing aspects of their general design issues, followed by reports of numerous validation and field-testing studies. Many aspects of psychometric validation depend upon collecting and analysing data from samples of patients or others. However, the statistical and psychometric techniques can only confirm that the scale is valid in so far as it performs in the manner that is expected. These quantitative techniques rely on the assumption that the scale has been carefully and sensibly designed in the first place. To that end, the scale development process should follow a specific sequence of stages, and details of the methods and the results of each stage should be documented thoroughly. Reference to this documentation will, in due course, provide much of the justification for content validity. It will also provide the foundation for the hypothetical models concerning the relationships between the items on the questionnaire and the postulated domains of QOL, which are then explored as construct validity. Next, we will discuss the steps in instrument development in detail. 2.1. Specifying measurement goals Before embarking on the development of any new instrument, the investigator should define exactly what the instrument is to measure. This initial definition will help the investigator design appropriate development and testing protocols and will enable other users of the instrument to identify its applicability to their own patients and studies. This process will include specification of the objectives in measuring QOL, a working definition of what is meant by “quality of life”, identification of the intended groups of respondents, and proposals as to the aspects or main dimensions of QOL that are to be assessed. The investigator should consider at least the following criteria. 2.1.1. Patient population As in a clinical trial, there should be clear inclusion and exclusion criteria that identify the precise clinical diagnosis and basic patient characteristics. A detailed definition might include age, literacy level, language ability, and presence of other illness that might have impact on QOL. An investigator may be thinking of a particular study in which the instrument is to be used, but constructing an instrument for too specific a population or function may

200

J. Fang & Y. Hao

limit its subsequent use. One can usually choose a patient population that is narrow enough to allow focus on important impairments in that disease or function but board enough to be valid for use in other studies. 2.1.2. Primary purpose The investigator needs to decide whether the primary purpose of the instrument is going to be evaluative, discriminative, or predictive. Although some instruments may be capable of all three functions, it is difficult to achieve maximum efficiency in all three. 2.1.3. Patient function In most disease-specific instruments, investigators want to include all areas of dysfunction associated with that disease (physical, emotional, social, occupational). However, there are some instruments that are designed to focus on a particular function (e.g. emotional function, pain, sexual function) within a broader patient population. The investigator should decide whether all or only specific functions are to be included. 2.1.4. Other considerations The investigator should also decide on the format of the instrument. Will it be interviewer and/or self-administered? Does it need to be suitable for telephone/postal interviews? Approximately how many items will the instrument contain? Once a working definition of quality of life and study protocol are developed, a further phase of work involved operationalizing the broad domains and individual facets of quality of life. Consultants and principal investigators should draft a provisional list of domains and constituent facets of quality of life. Each facet definition should consist of a conceptual definition, a description of various dimensions along which a rating can be made for that facet, and a listing of some example situations or conditions that might significantly affect that facet at various levels of intensity. Once facets of QOL are drafted, a series of focus groups should be held with patients, well persons and health professionals to consider the facet definitions drafted by health professionals and QOL researchers. On the basis of the focus group data, a revised set of facet or domain definitions are compiled to guide subsequent item generation.

Quality of Life: Issues Concerning Assessment and Analysis

201

2.2. Item generation The first task in instrument development is to generate a pool of all potential relevant items. For this pool, the investigator will later select items for inclusion in the final questionnaire. The most frequently used methods of item generation include unstructured interviews with patients who have insight into their condition, patient focus group discussions, a review of the disease-specific literature, discussions with health care professionals who work closely with the patients, and a review of generic QOL instruments. A question-writing panel should be assembled. The question-writing panel should consist of the principle investigator, the main focus group moderator, at least one person with good interviewing skills and experience, and a lay person, preferably someone who participates in one of the lay focus groups, to ensure that questions are framed in a way that is easy to understand. 2.3. Item reduction: Reducing items on the basis of their frequency and importance Having generated a large item pool, the investigator must select the items that will be most suitable for the final instrument. QOL instruments usually measure health status from the patients’ perspective and so it is appropriate that patients themselves identify the items that are most important to them. Investigators should ensure that the patients selected represent the full spectrum of those identified in the patient population. It is important to ensure that all of the subgroups are adequately represented. One approach to item reduction is to ask patients to identify those items that they have experienced as a result of their illness. For each positively identified item, they rate the importance using a 5-point Likert type scale (“extremely important” to “not important”). Results are expressed as frequency (the proportion of patients experiencing a particular item), importance (the mean importance score attached to each item), and the impact, which is the product of frequency and importance. Very occasionally, there are items that have absolutely no potential of changing over time either as a result of an intervention or though the natural course of the disease. If one is developing an evaluative instrument, one may consider excluding such unresponsive items because they will only add to the measurement noise and the time taken to complete the questionnaire. However, if such an item is considered very important by patients

202

J. Fang & Y. Hao

and therefore potentially a future target for therapy, exclusion because of apparent unresponsiveness to current therapies may be unwise. A comprehensive set of items will inevitably include some redundancies. How does one decide whether to include them? One approach is to test whether the items are highly correlated. If Spearman rank order correlations are high one could consider omitting one of the items. This strategy is particularly appropriate for a discriminative instrument, for highly correlated items will, when taken together, give little information in terms of distinguishing between those with mild and severe quality of life impairment. It is somewhat riskier for evaluative instruments; just because items correlate with one another at the item reduction phase does not guarantee that they will change in parallel when measured serially over time. Investigators can select the sample size for the item reduction process by deciding how precise they want their estimates of the impact of an item on the population. The widest confidence interval around a proportion (the frequency with which patients identify items) occurs when the proportion is 50%; any other value will yield a narrower confidence interval. If one recruits 25 subjects, and an item is identified by 50% of the population, the true prevalence of that item is somewhere between approximately 30% and 70%. If one recruit 50 subjects, the 95% CI around a proportion of 0.5 will be approximately from 0.36 to 0.64. For 100 subjects, the confidence interval will be from 0.4 to 0.6. It is recommended that researchers recruit at least 100 subjects for this part of the questionnaire development process. There are some statistical methods we can use to determine which items should be included in the instrument. Factor analysis, cluster analysis, multiple regression, and discriminant analysis are methods often used. 2.4. Questionnaire formatting 2.4.1. Selection of response options Response options refer to the categories or scales that are available for responding to the questionnaire items. For example, one can ask whether the subject has difficulty climbing stairs; two response options, yes and no, are available. If the questionnaire asks about the degree of difficulty, a wide variety of response options are available. An evaluative instrument must be responsive to important changes even if they are small. To ensure and enhance this measurement property, investigators usually choose scales with a number of options, such as a 7-point scale where responses may range from 1 = no impairment to 7 = total impairment, or a continuous scale such as a 10-cm Visual Analogue Scale

Quality of Life: Issues Concerning Assessment and Analysis

203

(VAS). The 7-point Likert scale is often preferred, because although both yield similar data, the Likert scale has practical advantages over the VAS, being both easier to administer and easier to interpret. Likert scale and VAS can be used as discriminative and predictive instruments, and are likely to yield optimal measurement properties. However, Likert scale and VAS are more complex than a simple yes/no response and they are very difficult to use for telephone interviews. In health surveys, investigators requiring only satisfactory discriminative or predictive measurement properties of their instrument may choose a simple response option format. 2.4.2. Time specification A second feature of presentation is time specification: patients should be asked how they feeling over a well-defined period of time. Two weeks is the time frame used by most instruments on the basis of the intuitive impression that patients can accurately recall. Time specification can be modified according to the study, and other investigators may have different impressions of the limits of their population’s memory. When a new questionnaire is developed, it is necessary to test its psychometric properties including validity, reliability, responsiveness and sensitivity. Validation of instruments is the process of determining whether there are grounds for believing that the instrument measures what it intends to measure, and that it is useful for its intended purpose. Reliability concerns the random variability associated with measurements. Ideally, patients whose QOL status has not changed should make very similar, or repeatable, responses each time they are assessed. If there is considerable random variability over time, the measurements are unreliable. Sensitivity is the ability of measurement to detect differences between patients or groups of patients. Sensitivity is important in clinical trials since a measurement is of little use if it cannot detect the differences in QOL that may exist between the randomised groups. We will discuss these properties in detail in Sec. 6. 3. Linguistic Validation of QOL Instrument 3.1. Introduction Most health status measures and psychological tests are used only in the setting in which they were originally developed. Some are translated into other languages and used without making any adaptations, and yet this is

204

J. Fang & Y. Hao

necessary to ensure their usefulness in another culture or language. A very small number of instruments are produced in equivalent version in different languages, before assessing the instruments’ validity and reliability that are prerequisites for the use of instrument in a new culture. WHO has accrued considerable experience in translating health measurements. This has facilitated the development of a translation methodology which has significant advantages over the forward-translation and the translation-back-translation methodologies. We call this procedure “linguistic validation”. The steps outlined below describe a sequence which has been used successfully in a number of studies. It is clear that variations of the method may well be necessary, and indeed desirable, in certain situations. The aim of linguistic validation of a QOL questionnaire is to maintain, as far as possible, conceptual, semantic and technical equivalence between the target language and source language versions of the instrument. Conceptual equivalence refers to the same concepts underlying the questions in an instrument in both source and target languages. Semantic equivalence refers to the same denotative and connotative elements of words. Denotation refers to that which is implied by the word, and connotation refers to the emotional meaning of the word. That is to say, what the words indicate or are a sign for (denotation) or what is implied by the words in addition to their emotional meaning (connotation). Technical equivalence refers to two separate but overlapping issues: first, the equivalence of technical features of language and their relationship to the socio-cultural context; and secondly, the feasibility of the nature and mode of questioning of the instrument in both source and target culture. The linguistic validation of a QOL questionnaire is a complex process which requires the recruitment of professional teams who are familiar with this type of work. The linguistic validation of a questionnaire is not a literal translation of the original questionnaire, but the production of a translation which is conceptually equivalent to the original, and culturally acceptable in the country in which the translation will be used. In order to work towards an acceptable translation of an instrument in a given language the following points should be adhered to: – The translation methodology should be adhered to and the different phases of the process should be summarised in a report – The translated version of a questionnaire — obtained if possible in collaboration with its developer — should be recognised as the official version

Quality of Life: Issues Concerning Assessment and Analysis

205

in the country concerned. This will avoid the proliferation of “pirate” versions and will help to facilitate the access to translations – Ideally, a linguistic validation of a QOL questionnaire should be complemented by a psychometric validation of the questionnaire.

3.2. Methodology The original language in which the questionnaire was developed is called source language. The language into which the questionnaire is translated is called target language. After the recruitment of a QOL specialist in each country concerned, and having explained the concepts of a linguistic validation in detail, a QOL instrument is then ideally translated according to Table 1. Thus, in summary, the linguistic validation of a QOL questionnaire comprises 7 steps shown in the first column of Table 1. The questionnaire should always be considered as a whole (i.e. the response choice could influence the translation of the items and vice verse). It cannot be assumed that a questionnaire, however, extensively tested in the originating country, will be valid and reliable once it has been translated. No instrument for the assessment of psychological states of subjective Table 1.

Methodology for linguistic validation of a QOL questionnaire. Steps

Source Questionnaire

1. “Forward” translation by two independent translators

forward version A1 and forward version A2

2. Reconciliation meeting between the 2 “forward” translators and the local project manager

forward version B

3. “Backward translation” by 1 independent translator

backward translation

4. Comparison of the source questionnaire with the “backward” translation by the local team

forward version C

5. Cognitive debriefing

forward version D

6. International harmonisation (if the original is translated into more than 1 language)

final version

7. Report

206

J. Fang & Y. Hao

perceptions is culture-free. In each instance the validity and other metric characteristics of the instrument must be assessed in the country of application. Important components of psychometric testing in cross-cultural quality of life studies include reliability, validity, responsiveness, and effect size interpretation. 4. Design Issues Relating to QOL Study 4.1. Study objectives Clear study goals are prerequisites to developing appropriate design and analysis strategies that answer clinically relevant questions. Overly general objectives, such as “describe the QOL of. . .” do not adequately address aspects of study such as the comparison of the two treatment arms, whether the comparisons are limited to the period of therapy or extend across time within a treatment group. Without a focused objective, unnecessary assessments are often included in protocol designs. This increases problems of multiple comparisons and missing data, and increases the possibility that critical assessments will be omitted. 4.2. QOL instruments QOL assessments should ideally be brief, using an uncomplicated and least complicated instrument or combination of instruments that adequately address primary research questions. Adding scales/instruments in order to obtain less relevant data will increase both the multiple comparisons problem and the likelihood that data will be incomplete. This will in turn potentially compromise the ability of the trial to achieve the primary objectives of the study. 4.3. Timing of assessments The timing of QOL assessments must also be specified to achieve the goals of the study. Baseline measures that precede therapy allow for assessment of treatment-related changes within an individual. Depending on the goals of the study, it is also important to have a sufficiently long period of follow-up after therapy to allow for assessment of the long-term treatment effect and potential late sequelae. In the phase 3 treatment comparison setting, it is critical that QOL should be assessed regardless of treatment and disease status. Patients who have changes in status or who have discontinued

Quality of Life: Issues Concerning Assessment and Analysis

207

treatment should still take part in QOL assessment, as the biggest differences in QOL may be in these patients. Without these measurements it will be difficult to derive summary measures and impossible to make unbiased comparisons of the effects of different therapeutic regimens on QOL. Procedures for obtaining assessments for patients who have changed status or discontinued therapy should be explicitly stated in protocols. The timing of assessments should be chosen to minimize missing data. It is generally recommended that the frequency of assessments be minimized for ease of patient and staff burden. However, in some cases more frequent administration linked to the clinical routine (e.g. at the beginning of every treatment cycle) may result in more complete data because the pattern of assessment is established as part of the clinical routine.

4.4. Sample size and power The sample size and power to detect meaningful differences for primary QOL hypotheses is critical to any study in which QOL is an important end point. In addition to the usual estimates of variation and correlations, the sensitivity of the QOL instrument to detect clinically significant changes is the most useful information that can be provided during the validation of a QOL instrument. Specific estimates of the changes in subscales and global scales related to clinical status give the statistician and the clinician a clear and familiar reference point for defining differences that clinically relevant. This is critical for insuring an adequate sample size for the study. It should be noted that because end points may involve repeated measurements at different times and/or combinations of subscales, both test-retest correlations and among-subscale correlations are useful and should be reported for validated instruments. If the sample size requirements for the QOL component are substantially less than for the entire study, an unbiased strategy for selection of a subset of patients in which QOL will be assessed should be identified. For example, the first 500 patients enrolled in the study might be included in the QOL substudy. This may have an additional advantage in studies with a long duration of QOL follow-up. This strategy is being used in the design of an Eastern Cooperative Oncology Group (ECOG) study, in which patient entry is expected to take 5 years, an additional follow-up of 2.5 years is planned for the survival end point, and the desired duration of QOL assessment is 5 years. By limiting the patients in which QOL is assessed to those enrolled in the first 2.5 years, the QOL study is expected

208

J. Fang & Y. Hao

to be complete at the same time as the final analysis of the primary survival end points. 5. Characteristics of QOL Data and Statistical Issues 5.1. Primary statistical issues14 5.1.1. Multiple comparisons Analysis of QOL data differs from the analysis of other clinical end points data. There are often a large number of measures resulting from both multiple dimensions of QOL (multiple instruments and/or subscales) and repeated assessments over time. Univariate tests for each subscale and time point can seriously inflate the type I error rate (false positive) for the overall trial such that the investigator is unable to distinguish between the true and false positive differences. Furthermore, it is often impossible to determine the number of tests performed at the end of analysis and adjust post hoc. Methods that allow summarization of multiple outcome both simplify the interpretation of the results and often improve the statistical power to detect clinically relevant differences, especially when small but consistent differences in QOL occur over time or across multiple domains. On the other hand, significant differences at a particular time or within a particular domain may be blurred by aggregation. 5.1.2. Missing data Missing data refers to missing items in scales and missed and/or mistimed assessments. If the assessment was not completed for reasons that there are unrelated to the patient’s QOL, the data are classified as “missing at random”. Examples might be staff forgetting to administer the assessment, a missed appointment due to inclement weather, or the patient having moved out of the area. Data that are missing because the patient had not been on-study long enough to reach the assessment time point (i.e. the data are censored or incomplete) are also considered missing at random. Assessments may be mistimed if they are actually given but the exact timing does not correspond to the planned schedule of assessments for reasons unrelated to the patients’ QOL. While these types of missing/mistimed data make analyses more complex and may reduce the power to detect differences, the estimates of QOL are unbiased even if they are based only on the observed QOL assessments.

Quality of Life: Issues Concerning Assessment and Analysis

209

Non-randomly missing or informatively censored data present researchers with a much more difficult problem. One example of this type of missing data is that due to death, disease progression, or toxicity where the QOL would generally be poorer in the patients who were not observed than in those who were observed. In the chronic disease setting, this relationship between QOL and missing data might manifest itself as study dropout due to lack of relief, presence of side effects, or, conversely, improvement in the condition. The difficulty occurs because analyses that inappropriately assume the data are randomly missing will result in biased estimates of QOL reflecting only the more limited population of patients who were assessed rather than the entire sample of population under study. One possibility is to limit the analysis, and thereby the inference, to patients with complete data. In most cases, however, this strategy is not acceptable to achieve the goal of comparing QOL assessment for all patients. Unless careful prospective documentation of the reasons for missing assessments is available in a clinical trial, it is generally impossible to know definitively whether the reason for the missing assessment is related to the patient’s condition and/or to their QOL. In scales based on multiple items, missing information results in a serious missing data problem. If only 0.1% of items are randomly missing for a 50item instrument, 18% of the subjects will have one or more items missing over four assessments. If the rate is 0.5%, then only 37% of subjects will have complete data. Deletion of the entire case when there are missing items results in loss of power and potential bias if subjects with poorer QOL are more or less likely to skip an item. Individuals with a high level of non-response (> 50%) should be dealt with on a case-by-case basis. Imputing missing items for an individual who has answered most questions would, in general, be preferable to deletion of the entire case or observation, although the method used for such imputing must be carefully considered. A simple method based solely on the patient’s own data would use the mean of all non-missing items for the entire scale or the specific subscale. Methods based on other patients would include the mean of that item in individuals who had responded. Another method utilizing data from other participants is based on the high correlation of items within a scale or subscale and utilizes information about the individual’s tendency for particular items to be scored higher or lower relative to other items. The procedure here is to regress the missing item on the non-missing items using data from individuals with complete data, and to then predict the value of

210

J. Fang & Y. Hao

the missing items using the information gained from the items that the individual has completed. 5.1.3. Integration of QOL and survival data In clinical trials with significant disease-related mortality there is need to integrate survival with QOL. This was identified by the participants in the 1990 NCI QOL workshop who “acknowledged that the use of QOL data in clinical decision-making will not routinely occur until a larger body of QOL data is available and models for integrating medical and QOL information are available”. In studies where both QOL (or toxicity) and clinical end points indicate the superiority of one treatment over another, the choice of the best treatment is clear. Similarly, if either QOL or the efficacy outcome demonstrates a benefit and there is no significant difference in the other, the choice of treatment is straightforward. The dilemma occurs when there is a conflict between the QOL and efficacy outcomes. This is often the case when there is significant toxicity associated with the more effective treatment. 5.2. Statistical methods used to analyse QOL data 5.2.1. Univariate methods One approach to the reporting of QOL data has been descriptive univariate statistics such as means and proportions at each specific point in time. These descriptive statistics may be accompanied by simple parametric or nonparametric tests such as t-tests or Wilcoxon tests. While these methods are easy to implement and often used, they do not address any of the three previously identified issues. One recommended solution to the multiple comparisons problem is to limit the number of a priori end points in the design of the trial to three or less. The analyses of the remaining scales and/or time points can be presented descriptively or graphically. While theoretically improving the overall type I error rate for the study, in practice investigators are reluctant to ignore the remaining data and may receive requests from reviewers to provide results from secondary analyses with the corresponding significance level. An alternative method of addressing the multiple comparisons problem is to apply a Bonferroni correction, which adjusts the test statistics on k end points so that the overall type I error is preserved for the smallest p value. The procedure is to accept as statistically significant only those tests

Quality of Life: Issues Concerning Assessment and Analysis

211

with p value that are less than α/k where α is the overall type I error usually set equal to 0.05. 5.2.2. Multivariate methods Multivariate analysis techniques include approaches such as repeated measures analysis of variance (ANOVA) or multivariate ANOVA (MANOVA). These techniques require complete data, which limits their use in settings where there is a low risk of mortality and very high compliance with QOL assessment. If the data are not complete, the inferences are restricted to a very select and generally non-representative group of patients. Multivariate statistics such as Hotelling’s T are frequently used to control for type I error. These statistics, however, answer global questions such as “are any of the dimensions of QOL different?” or “are there differences in QOL at any point in time?” without considering whether the differences are in consistent directions. In general, the multivariate test statistics are not sensitive to differences in the same direction across the multiple end points. The requirement for complete data can be relaxed by using repeated measures or mixed effects model with structured covariance. These methods assume that the data are missing for reasons unrelated to the patients QOL, such as staff forgetting to administer the assessment for example. If the missing assessment can reasonably be assumed to be missing at random, a likelihood-based analysis approach, such as mixed-effects models or EM (Estimation-Maximization) algorithm for repeated measures models, incorporates all patients with at least one assessment in the analysis. This approach has the additional advantages of estimation of within- and between-subject variation, inclusion of time varying variables, and of being able to test for significant changes over time. Other methods often used to determine the risk factors related to QOL include multiple regression, stepwise discriminant analysis, canonical correlation, and Logistic regression. 5.2.3. Other methods 5.2.3.1. Quality-Adjusted Life Years (QALY) An intuitive method of incorporating QOL and time would be to adjust life years by down-weighting time spent in periods of poor QOL. However, what would seem to be a simple idea has many methodological challenges. The first of these challenges is the determination of weights. Torrance14

212

J. Fang & Y. Hao

describes several techniques for eliciting weights for states of health including direct ratings, time trade-offs, and standard gambles. In addition to the difficulties of administering some of these techniques in clinical trials, weights elicited by the different techniques or from different respondents may not result in equivalent measures. The choice of anchor points and content validity may mean that weights that are appropriate in one setting may be inappropriate in another. The other methodological difficulty occurs in trials with censored data. Although it might seem appropriate to undertake a standard survival analysis of individual quality-adjusted survival times, the usual product limit estimator of the survival function is biased because censoring is related by the future outcome. For example, if two groups have the same censoring time due to death, the group with the poorer QOL will be censored earlier on the QALY scale. This latter problem can be addressed by estimating the average time spent in each health state and then computing a weighted average of the time as is done in the Q-TwiST approach. 5.2.3.2. Q-TwiST The objective of the Q-TwiST method is to evaluate therapies based on both quantity and quality of life. Q-TwiST stands for Quality-adjusted Time Without Symptoms of disease and Toxicity of treatment. It is based on the concept of quality-adjusted life years (QALYs) and represents a utility-based approach to QOL assessment in clinical trials. The starting point is to define QOL-oriented clinical health states, one of which represents relatively good health with minimal symptoms of disease or treatment associated toxicity (TWiST). Patients will progress through or skip these clinical health states, but will not back-track. The next step is to partition the area under the overall Kaplan–Meier survival curve and calculate the average time a patient spends in each clinical health state. The final step is to compare the treatment regimens using weighted sums of the mean duration of each health state, where the weights are utility based. If these utility weights are unknown, as is generally the case treatment comparisons can be made using sensitivity analyses, also called threshold utility analyses. 5.2.3.3. Markov and Semi-Markov Models Markov and Semi-Markov models have been used to compare treatments based on estimates of the time spent in different health states and the probabilities of transitions between these states. The relevant health states

Quality of Life: Issues Concerning Assessment and Analysis

213

must be identified and then each is weighted to reflect the relative value of a health state compared to perfect health. The treatments are then compared in terms of the total quality-adjusted time, the weighted sum of the health state durations. In general, to calculate the transition probabilities an underlying model must be assumed. The most commonly used model is the Markov chain, which assumes that the transitions from one QOL state to another are independent and continuous and only depend on the previous state. This requires that the assessments are made at time points independent of the patients’ treatment schedule or health state. Discrete-time transient semi-Markov processes are used to model the health state transition probabilities corresponding to prolonged life, while a simple recurrent Markov process is used to derive the QOL state transition probabilities. In a semi-Markov process, the state changes from an embedded Markov chain and the times spent in different health states are mutually independent, and depend only on the adjoining states.

5.3. Conclusions We have identified three characteristics of QOL studies that present challenges for analysis and interpretation. The first is the occurrence of random and non-random missing data. The analysis of random missing data is generally well documented with sufficient advice and guidelines for both practical and theoretical issues. In contrast, development of methods for analysis of non-random missing data is in its infancy, and we now require an enhanced knowledge and understanding to determine which methods are most practical and appropriate. The second issue addressed is the multivariate nature of QOL studies. Not only is QOL a multi-dimensional concept measured by multiple scales, but most studies are longitudinal. Separate analyses of each domain at multiple time points may make it difficult to communicate the results in a manner that is meaningful for clinicians and patients. Summary measures may reduce the multi-dimensionality of the problem but may not make the interpretation much easier. The issue of weights that vary by technique and study also adds to the complexity of interpretation. In general, it would be advisable to perform the analyses using various assumptions to verify that the results are not sensitive to small changes in the assumptions. The third issue addressed is the integration of survival data with QOL measures. This can be addressed from either the perspective of QOL or from the perspective of time. From a research perspective both approaches

214

J. Fang & Y. Hao

can be informative; however, currently time is the dimension that both clinicians and statisticians are most familiar with. Finally, interpretation of clinical trials may not always be helpful in guiding individual patient decisions. In theory, individual patients could utilize the threshold utility analysis of Q-TWiST, but this may require extensive patient education. There are a number of statistical methodologies that can be employed in the analysis of QOL data, each of which is based on specific assumptions, yields a different summary measure, and thus emphasizes different aspects of QOL. When there is more than one analysis strategy that best anticipates the above issues should be considered. Analyses should be clearly and concisely reportable so that the relevant differences can be readily understood by those who will use the results. 6. The Validation Process: Psychometric Testing The question of most concern relating to psychometrics is whether a measures both reliable and valid. Measurement is the process by which a concept is linked to one or more latent variables, and these are linked to observed variables. The concept can vary from one that is highly abstract, such as QOL, or intelligence, to one that is more concrete, such as age, sex, or race. One or more latent variables may be needed to represent the concept. The observed variables can be responses to questionnaire items, census figures, or any other observable characteristics. The first step of the measurement process is to give the concept a theoretical definition. A theoretical definition explains in as simple and precise terms as possible the meaning of a concept. The second step is to identify the dimensions and latent variables that will represent it. The next step, of forming measures, depends on the theoretical definition. This is sometimes referred to as the operational definition. The operational definition describes the procedures to follow to form measures of the latent variables that represent a concept. In some situations the latent variables are operationalized as the responses to questionnaire items. The fourth step is construct the measurement model. A measurement model specifies a structural model connecting latent variables to one or more measures or observed variables. A simple measurement model for the latent variables influence on the two measures is x1 = λ11 ξ + δ1 , (1) x2 = λ21 ξ + δ2 . where ξ represents the latent variable, x1 and x2 are its indicator. δ1 and δ2 are errors of measurement with expected values of zero and uncorrelated

Quality of Life: Issues Concerning Assessment and Analysis

215

with ξ and with each other. All variables are in deviation form so that intercepts terms do not enter the equations. In sum, the four steps in measurement are to give meaning, identify dimensions and latent variables, to form measures, and to specify a model. The theoretical definition assigns meaning to a term and the concept associated with it. On the basis of this definition, we can know a concept’s dimensions. Each dimension is represented by one latent variable. Guided by theoretical definitions, we form measures, and hopefully two or more measures will be formed per latent variable. Finally, we formulate the structural relation between indicators and latent variables in the measurement model. Two important properties of measures are their validity and reliability. 6.1. Validity Validity15 is concerned with whether a variable measures what it is supposed to measure. For instance, does an IQ test measure intelligence? Does the WHOQOL-100 measure people’s quality of life? These are questions of validity. They can never be answered with absolute certainty. Although we can never prove validity, we can develop strong support for it. Traditionally, psychologists have distinguished four types of validity: content validity, criterion validity, construct validity, and convergent and discriminant validity. Each attempts to show whether a measure corresponds to a concept, though their means of doing so differ. Content validity is largely a “conceptual test”, whereas the other three types are empirically rooted. If a measure truly corresponds to a concept, we would expect that all four types of validity would be satisfied. Unfortunately, it is possible that a valid measure will fail one or more of these tests or that an invalid measure will pass some of them. 6.1.1. Content validity Content validity is a qualitative type of validity where the domain of a concept is made clear and the analyst judges whether the measures fully represent the domain. To the extent that they do, content validity is met. A key question is, how do we know a concept’s domain? For the answer we must return to the first step in the measurement process. That is, to know the domain of a concept, we need a theoretical definition that explains the meaning of a concept. Ideally, the theoretical definition should reflect the meanings associated with a term in prior research so that a general rather an idiosyncratic domain results. In addition the theoretical definition should make clear the dimensions of a concept.

216

J. Fang & Y. Hao

Does it matter if our measures lack content validity? In general, the answer is yes. Just as a nonrepresentative sample of people can lead to mistaken inferences to the population, a nonrepersentative sample of measures can distort our understanding of a concept. The major limitation of content validity stems from its dependence on the theoretical definition. For most concepts in the social sciences, no consensus exists on theoretical definitions. The domain of content is ambiguous. In this situation the burden falls on researchers not only to provide a theoretical definition accepted by their peers but also to select indicators that fully cover its domain and dimensions. In sum, content validity is a qualitative means of ensuring that indicators tap the meaning of a concept as defined by the analyst. 6.1.2. Criterion validity Criterion validity is the degree of correspondence between a measure and a criterion variable, usually measured by their correlation. To assess criterion validity, we need an objective reliable standard measure with which to compare our measure. Suppose that in a survey we ask each employee in a corporation to report his or her salary. If we had access to the actual salary records, we could assess the validity of the survey measure by correlating the two. In this case employee records represent an ideal, or nearly ideal, standard of comparison. The absolute value of the correlation between a measure and a criterion sometimes is referred to as the validity coefficient. Does this correlation of a measure and a criterion reveal the validity of a measure? If we represent the measure as x1 and the criterion as c1 , the validity coefficient may be represent as ρx1 c1 . A simple model of the relation between x1 and c1 , and the latent variable ξ1 that they measure appears in the following equations: x1 = λ11 ξ1 + δ1 , c1 = λ21 ξ1 + δ2 ,

(2)

where δ1 and δ2 are uncorrelated with each other and with ξ1 , E(δ1 ) = E(δ2 ) = 0. ρx 1 c 1 =

λ11 λ21 φ11 . [var(x1 ) var(c1 )]1/2

(3)

As Eq. (3) reveals, the magnitude of ρx1 c1 depends on factors other than the “closeness” of x1 and ξ1 . This is made clearer if we standardize x1 , c1 ,

Quality of Life: Issues Concerning Assessment and Analysis

217

and ξ1 to variances of one. In this case : ρx1 c1 = λ11 λ21 , Corr(x1 , ξ1 ) = λ11 ,

(4)

Corr(c1 , ξ1 ) = λ21 . The validity coefficient, ρx1 c1 , is affected not only by ρx1 ξ1 (= λ11 ) but also by ρc1 ξ1 (= λ21 ). Even if the correlation of x1 with ξ1 stays at 0.5 the validity coefficient would be 0.45, 0.35, or 0.25 if the correlation of c1 and ξ1 , is 0.9, 0.7, or 0.5. Thus, even with one change in x1 ’s association with ξ1 , we obtain different values of validity, depending on the criterion’s relation to ξ1 . In sum, criterion validity as measured by ρx1 c1 , the validity coefficient, has several undesirable characteristics as a means to assess validity. It is not only influenced by the degree of random measurement error variance in x1 but also by the error in the criterion. Furthermore different criteria lead to different “validity coefficient” for the same measure, leaving uncertainty as to which is an accurate reading of a measure’s validity. Finally, for many measures no criterion is available. 6.1.3. Construct validity Construct validity is a third type of validity. Many concepts within the social science are difficult to defined and formulated, and so content validity is difficult to apply. As mentioned earlier, appropriate criteria for some measures often do not exist. This prevents the computation of criterion validity coefficients. In these common situations construct validity is used instead. Construct validity assesses whether a measure relates to other observed variables in a way that is consistent with theoretically derived predictions. Hypotheses may suggest positive, negative, or no significant associations between constructs. If we examine the relation between a measure of one construct to other observed variables indicating other constructs, we expect their empirical association to parallel the theoretically specified associations. To the extent that they do, construct validity exists. The major steps in the process begin with postulating theoretical relations between constructs. Then the associations between measures of the constructs or concepts are estimated. Based on these associations, the measures, the constructs, and the postulated associations are re-examined.

218

J. Fang & Y. Hao

ξ1

ξ2

λ11

λ22

x1

δ1 Fig. 1.

x2

δ2

Two constructs with one measure each.

Some of the difficulties with construct validity can be illustrated with a structural equation approach. As a simple example, consider Fig. 1. Assuming two constructs, ξ1 and ξ2 . Each has one measure represented as x1 and x2 . As usual δ1 and δ2 are random errors of measurement with expected values of 0, uncorrelated with each other and with ξ1 and ξ2 . Suppose that the construct validity of x1 is of interest. We hypothesize that the two constructs (ξ1 and ξ2 ) are positively correlated (φ12 > 0). To test construct validity, we would compute the correlation between x1 and x2 . ρx1 x2 = (ρx1 x1 ρx2 x2 )1/2 ρξ1 ξ2 ,

(5)

where ρxi xi is the reliability of xi . It is the squared correlation between xi and ξi . The correlation of the two observed variables depends not only on the correlation of x1 and ξ1 but also on the correlation between the constructs ξ1 and ξ2 and the correlation of x1 and x2 . Because of this, the interpretation of construct vability based on ρx1 x2 is seriously complicated. For instance, if the correlation between ξ1 and ξ2 is relatively large and that x1 has very high reliability but x2 has low reliability. This would reduce ρx1 x2 , raising doubts about the construct validity of x1 . In practical work, people usually use exploratory factor analysis or confirmatory factor analysis to test for construct validity. Confirmatory factor analysis is preferable than exploratory factor analysis, because its principle is similar to the definition of construct validity.

Quality of Life: Issues Concerning Assessment and Analysis

219

6.1.4. Convergent and discriminant validity Convergent validity is another important aspect of construct validity, which is intended to show that for example, a postulated dimension of QOL correlates appreciably with all other dimensions that theory suggests should be related to it. That is, we may believe that some dimensions of QOL are related, and we therefore expect the observed measurements to be correlated. For example, one might anticipate that patients with severe pain are likely to be depressed, and that there should be a correlation between pain scores and depression ratings within group. Many of the dimensions of QOL are interrelated. Very ill patients tend to suffer from a variety of symptoms, and have high scores on a wide range of psychological dimensions. As many dimensions of QOL are correlated with each other, assessment of convergent validity consists of predicting the strongest and weakest correlations, and confirming that subsequent observed values conform to the predictions. Analysis involves calculating all pairwise correlation coefficients between scores for different QOL scales. Discriminant validity, or divergent validity, recognises that some dimensions of QOL are anticipated to be relatively unrelated, and that their correlations should be low. Convergent and discriminant validity represent the two extremes in a continuum of associations between dimensions of QOL. One problem when assessing discriminant validity (and to a lesser extent, convergent validity) is that two dimensions may correlate spuriously because of some third, possibly unrecognised, construct that links the two together. For example, if two dimensions are both affected by age, an apparent correlation can be introduced solely though the differing ages of the respondents. Another extraneous source of correlation could be that of social desirability, where patients may report a higher QOL on many dimensions simply to please staff or relative. When specific independent variables are suspected of introducing spurious correlations, the statistical technique of “partial correlation” should be used. This is a method of estimating the correlation between two variables, or dimensions of QOL, whilst holding other “nuisance” variables constant. In practice, there are usually many extraneous variables that contribute a little to the spurious correlations obtained. Convergent validity and discriminant validity are commonly assessed across instruments. For convergent validity to exist, those scales from each instrument that are intended to measure similar constructs should have higher correlations with each other than with scales that measure unrelated constructs.

220 Table 2.

J. Fang & Y. Hao

Template for the multitrait-multimethod (MTMM) correlation matrix.

Instrument Emotional function Social function Role function

1 2 1 2 1 2

Emotional function 1 2 R C D

Social function 1 2

Role function 1 2

R D

D D

R C D

R D

R C

R

The multitrait-multimethod (MTMM) correlation matrix is a method for examining convergent and discriminant validity. The general principle of this technique is that two or more methods, such as different instruments, are each used to assess the same traits, for example QOL aspects, items or subscales as estimated by the different methods. Various layouts are used for MTMM matrices, the most common being shown in Table 2. In Table 2, the two instruments are methods, while the functioning scales are traits. Cells marked C show the correlations of the scores when different instruments are used to assess the same trait. Convergent validity is determined by the C cells. If the correlations in these cells are high, say above 0.7, this suggests that both instruments may be measuring the same thing. If the two instruments were developed independently of each other, this would support the inference that the traits are defined in a consistent and presumably meaningful manner. Similarly, the D cells show the scale-to-scale correlations for each instrument, and these assess discriminant validity. Lower correlation are usually expected in these cells, because otherwise scales purporting to measure different aspects of QOL are in fact more strongly related than supposedly similar scales from different instruments. The main diagonal cells, marked R, can be used to show reliability coefficients, as described later. These can be either Cronbach’s α for internal reliability or, if repeated QOL assessments are conducted on patients whose condition is stable, test-retest correlations. Since repeated values of the same trait measured twice by the same method will usually be more similar than values of the same trait measured by different instruments, the R cells containing test-retest repeatability scores should usually contain the most significant correlations. One common variation on the theme of MTMM matrices is to carry out the patient assessments on two different occasions. The upper-right

Quality of Life: Issues Concerning Assessment and Analysis

221

triangle of Table 2 can be used to display the correlations at time 1, and the correlations at time 2 can be shown the lower-left triangle that we have been describing above. The diagonal cells dividing the two triangles, marked R, should then show the test-retest repeatability correlations.

6.1.5. Alternatives to classical validity measures Thus far we have reviewed four common types of validity: content, criterion, construct, and convergent and discriminant validity. Content validity is largely a theoretical approach to validation. Criterion validity is largely an empirical means of validating. Construct validity and convergentdiscriminant validity are both theoretical and empirical. They are theoretical in the sense that theory suggests which constructs should correlate and which should not. The empirical aspect concerns the correlations of observed measures. The empirical aspect concerns the correlations between measures, although there are a number of limitations associated with this. One problem is that they rely on correlations rather than structural coefficients to test validity. Criterion validity examines the correlation between the criterion and the observed measure. Construct validity and convergentdiscriminant validity are based on the correlation between measures of the same and different constructs. These correlations may have little to do with the validity of a measure. A second problem with these empirical tests is that they use only observed measures, rather than incorporating the latent variables into the analysis. The implicit assumption is that the correlation between two observed variables mirrors an association involving latent variables, so it is implicitly assumed that the correlation of the criterion and the measure adequately approximates the correlation between the latent variable and the measure. In construct and convergent-discriminant validities the correlation of observed measures is a proxy for the correlation of the latent constructs. But in fact, it can be a poor proxy under a number of conditions. To overcome these limitations, Bollen15 proposed an alternative definition that based on a structural equation approach. In his definition, the validity of a measure xi of ξj is the magnitude of the direct structural relation between ξj and xi . Therefore, for a measure to be valid, the latent and observed variable must have a direct link. Using this approach, a natural question is how to measure validity based on it? There is probably no one ideal measure of validity, but several correspond to this theoretical definition.

222

J. Fang & Y. Hao

6.1.5.1. Unstandardized Validity Coefficient (λ) One important gauge of validity, the direct structural relation between an and xi and ξj , is λij the unstandardized coefficient linking them. For instance, x1 = λ11 ξ1 + δ1 ,

(6)

where λ11 is the unstandardized coefficient, it provides the expected change in x1 for a one-unit change in ξ1 . The λij coefficients are in the Λx and Λy matrices. As in multiple regression xi may have a number of explanatory variables. Consider the following measurement model: x1 = λ11 ξ1 + λ12 ξ2 + λ13 ξ3 + δ1 .

(7)

The validity of x1 with respect to ξ1 is indicated by λ11 . The λ11 coefficient is interpreted as the expected change in x1 for one-unit change in ξ1 , holding constant ξ2 and ξ3 . In addition the validity of x1 with respect to ξ2 and ξ3 can be gauged by λ12 and λ13 respectively. Thus the unstandardized validity coefficient λij is appropriate for measures that depend on one or more latent variables. The unstandardized validity coefficient λij is also useful for comparing samples from different populations. For example, the same observed variable may be measured in samples of males and females, samples from two different countries, or samples of some other groups. A comparison ˆij coefficients of validity could be made by comparing the corresponding λ in the separate samples. They represent a better measure of the structural relation of the variables, and are less influenced by differences in population variances. One disadvantage in comparing the unstandardized validity coefficients of measures that depend on the same latent variable is that the observed variables may be measured on very different scales. Direct comparison of the magnitude of λ’s to determine the relative validity of measures generally is not appropriate. 6.1.5.2. The Standardized Validity Coefficient, λs The standardized validity coefficient λs is defined as 1/2 φjj s , λij = λij var(xi )

where φjj is the variance of latent variable ξj .

(8)

Quality of Life: Issues Concerning Assessment and Analysis

223

Unlike λij , λsij is one means to compare the relative influence of ξj on several xi variables. For example, if x1 and x2 depend on ξj and λs1j is 0.8 and λs2j is 0.1, this would indicate that x1 is more responsive to ξj than is x2 in standard deviation units. In addition, if xi depends on two or more latent variables the relative influence of the latent variables can be compared. The standardized λsij is less useful than λij in comparing different populations because it is greatly influenced by the varying standard deviations of the variables in different populations. 6.1.5.3. Unique Validity Variance, Uxi ξj The unique validity variance measures that part of explained variance in xi that is uniquely attributable to ξj . The formular for Uxi ξj is Rx2 i

Uxi ξj = Rx2 i − Rx2 i (ξj ) ,

(9)

is the squared multiple correlation coefficient or proportion of where variance in xi explained by all variables in a model that have a direct effect on xi (excluding error terms) and Rx2 i (ξj ) is the proportion of explained variance in xi by all variables with a direct effect on xi excluding ξj . Uxi ξj always varies between zero and one. If only ξj has a direct effect on xi , Uxi ξj equals the squared correlation between ξj and ξi . Uxi ξj is more general than ρ2xi ξj since it allows the observed variable to depend on more than one latent variable and it is zero if ξj has no direct effect on xi . If multiple correlated latent variables underlie xi , Uxi ξj will generally not equal ρ2xi ξj unless the latent variables are uncorrelated. 6.2. Reliability Reliability is the consistency of measurement. It is not the same as validity since we can have consistent but invalid measures. To illustrate reliability, suppose that I wish to measure your level of education. I narrowly define education as completed years of formal schooling. I operationalize it by asking:“ How many completed years of formal schooling have you had?” Next, I record your answer. If I had the ability to erase your memory of the question and the response you gave, I could repeat the same question and again, record your answer. Repeating this process an infinite number of times, I could determine the consistency of your response to the same question. The reliability of this education measure is the consistency in your response over the infinite trials. The greater the fluctuation across your answers, the lower the reliability of the measure.

224

J. Fang & Y. Hao

It is possible to have a very reliable measure that is not valid. For example, repeatedly weighing yourself on a bathroom scale may provide a reliable measure of your weight but the scale is not valid if it always gives a weight that is 5 kg too light. A more extreme example would be obtaining a measure of intelligence by asking individuals their shoes size. This may provide a very reliable measure, but it lacks validity as an intelligence measure. Thus the distinction between reliability and validity is a very important one. Much of the social science literature on reliability originates in classical measurement theory from psychology. A fundamental equation of the theory is xi = τi + ei ,

(10)

where xi is the ith observed variable (or “test” score), ei is the error term and τi is the true score that underlies xi . It is assumed that cov(τi , ei ) is zero and that E(ei ) = 0. According to classical test theory, the errors of measurement for different items are uncorrelated. The correlation between two measures results from the association of their true scores. Thus the true scores are the systematic components that lead to the association of observed variables. Parallel, τ -equivalent, and congeneric measures are the three major types of observed variables in test theory. They can be defined using two measures xi and xj as shown in the example below: xi = αi τi + ei , xj = αj τj + ej .

(11)

The ei and ej are uncorrelated. Assume that the true scores are the same. If αi = αj = 1, var(ei ) = var(ej ), then xi and xj are parallel measures. If αi = αj = 1, var(ei ) 6= var(ej ), the measures are τ -equivalent. Finally, if αi 6= αj , var(ei ) 6= var(ej ), then the measures are congeneric. Congeneric measures are the most general of the three types. The reliability of a measure ρxi xi is defined as ρx i x i =

α2i var(τi ) . var(xi )

(12)

For τ -equivalent or parallel measures, this simplifies to ρx i x i =

var(τi ) . var(xi )

(13)

Quality of Life: Issues Concerning Assessment and Analysis

225

Reliability is the ratio of true score’s variance to the observed variable’s variance. It is equals to the squared correlation of the observed variable and the true score: ρ2xi τi =

[cov(xi , τi )]2 var(xi ) var(τi )

=

α2i [var(τi )]2 var(xi ) var(τi )

=

α2i var(τi ) var(xi )

= ρx i x i .

(14)

Thus, ρxi xi can be interpreted as the variance of xi that is explained by τi with the remaining variance due to error. A number of methods have been proposed for estimating the reliability of measures. Here will review the four most common: test-retest, alternative forms, split-halves, and Cronbach’s α. 6.2.1. Test-retest method The test-retest method is based on administering the same measure for the same observations at two points in time. The equations for the two measures are xt = αt τt + et , xt+1 = αt+1 τt+1 + et+1 ,

(15)

where t and t + 1 are subscripts representing the first and second time periods for the x, α, τ and e. Here it is assumed that E(et ) = E(et+1 ) = 0, that the true scores (τt , τt+1 ) are uncorrelated with errors (et , et+1 ), and that the errors are uncorrelated. In addition this method assumes that xt , xt+1 are parallel measures and that the true scores are equal. The reliability estimate is the correlation of xt and xt+1 . Using the definition of the correlation between two variables and covariance algebra leads to ρxt xt+1 =

var(τt ) cov(xt , xt+1 ) = = ρx t x t . var(xt ) [var(xt ) var(xt+1 )]1/2

(16)

In fact, the correlation of any two parallel measures equals their reliability since all parallel measures have identical reliability.

226

J. Fang & Y. Hao

Despite the intuitive appeal of the test-retest reliability technique, it has several limitations. First, it assumes perfect stability of the true score. In many cases the true score may change over time so that this assumption is not reasonable. If lack of equivalence of true scores is the only violated assumption, then ρxt xt+1 is less than the reliability. Secondly, memory effects are sometimes present. People’s memories of response during the first interview can influence their response in a second interview. They may have the tendency to give the same responses. In short, the test-retest method of estimating reliability has the advantage of simplicity, but it is dependent on assumptions that are unrealistic in practice.

6.2.2. Alternative forms Another method for estimating reliability is that of alternative forms. This is similar to the test-retest method, except that different measures instead of the same measure are collected at t and t + 1. The equations for the two measures are x1 = τt + et , x2 = τt+1 + et+1 .

(17)

The x1 variable is a measure of τ at time t, x2 is a different measure at t + 1, and x1 and x2 are parallel measures. Like the test-retest method it is assumed that τt equal τt+1 , that the expected value of et and et+1 are zero, and that the errors are uncorrelated with each other and with τt and τt+1 . With these assumptions the correlation between x1 and x2 (ρx1 ,x2 ) equals the reliability of both measures. The alternative form does have two advantages. One is that compared to the test-retest, the alternative form measures are less susceptible to memory effects since time t and t + 1 have different scales. Second, the errors of measurement for one indicator are less likely to correlate with a new measure at the second time period. Compared to test-retest, correlated errors of measurement are less likely to happen. Although the alternative forms estimate of reliability overcomes some of the limitations of the testretest approach, several unrealistic assumptions remain there. For example, it is assumed that τt is still equal to τt+1 . The assumption that the error variances are equal is less likely since x1 and x2 are different measures, that are administered at different time points.

Quality of Life: Issues Concerning Assessment and Analysis

227

6.2.3. Split-halves A third means to estimate reliability is with split-halves. The split-halves method assumes that a number of items are available to measure τ . Half of these items are combined to form a new measure, say, x1 , and the other half to form x2 . Note that in contrast to the test-retest and alternative form, x1 and x2 are measures of τ in the same time period. It is still assumed that E(e1 ) = E(e2 ) = 0, cov(e1 , e2 ) = 0, cov(τ1 , e1 ) = cov(τ1 , e2 ) = 0, and that x1 and x2 are parallel measures. The equations for x1 and x2 are x1 = τ1 + e1 , x2 = τ1 + e2 .

(18)

The correlation of x1 and x2 equals to ρx 1 x 2 =

cov(x1 , x2 ) [var(x1 ) var(x2 )]

1/2

=

var(τ1 ) = ρx 1 x 1 = ρx 2 x 2 . var(x1 )

(19)

In many cases the unweighted sum of two halves forms a composite to measure τ1 so that the reliability of x1 + x2 may be determined. As demonstrated earlier, in general the squared correlation of τ1 with observed score represents the reliability of a measure. Employing this notion, the squared correlation of τ1 with x1 + x2 is ρ2τ1 (x1 +x2 ) =

[cov(τ1 , x1 + x2 )]2 var(τ1 ) var(x1 + x2 )

=

4[var(τ1 )]2 var(τ1 )[(var(x1 ) + var(x2 ) + 2cov(x1 , x2 )]

=

2var(τ1 )/var(x1 ) var(τ1 )/var(x1 ) + var(x1 )/var(x1 )

=

2ρx1 x1 . 1 + ρx 1 x 1

(20)

This formula is well known as the Spearman-Brown Prophey formula for gauging the reliability of a full test based on split-halves. The split-halves test has several aspects more desirable than the testretest and alternative forms methods. For one, the split-halves method does not assume perfect stability of τ since τ is only gauged in one time period. Secondly, the memory effects that can occur if the same item is asked at two points in time do not operate with this approach. Third, the correlated errors of measurement that are likely in test-retest approaches are less likely for split-halves. A practical advantage is that split-halves are often cheaper and more easily obtained than overtime data.

228

J. Fang & Y. Hao

One disadvantage is that the split-halves must be parallel measures. Often we cannot know whether the variance of the measurement errors are equal, or whether α1 and α2 are equal to one. Another drawback is the way that the halves are allocated is somewhat arbitrary. There are many possible ways of dividing a set of items in half, and each split could lead to a different reliability estimate. 6.2.4. Cronbach’s α coefficient Cronbach’s α coefficient overcomes some of the disadvantages of the splithalves method. The Coefficient α is the most popular reliability coefficient in social science research. It measures the reliability of a simple sum of τ -equivalent or parallel measures. For α, the observed variables x1 , x2 , . . . , xq are summed. The x′i s should be scored so that they are all positively Pq or all negatively related to τ1 . I will call this index H so that i=1 xi = H. The squared correlation of τ1 and H or the reliability of H is ρ2τ1 H =

[cov(τ1 , H)]2 var(τ1 ) var(H)

[cov(τ1 , x1 + x2 + · · · + xq )]2 var(τ1 ) var(H) P [cov(τ1 , qτ1 + qi=1 ei )]2 = var(τ1 ) var(H) =

=

[q var(τ1 )]2 var(τ1 ) var(H)

=

q 2 var(τ1 ) var(H)

= ρHH .

(21)

This equation provides a general formula for the reliability of the unweighted sum of q τ -equivalent or parallel measures. As the next equation shows, this can be manipulated so that it appears as the typical formula for Cronbach’s α: q 2 var(τ1 ) ρHH = var(H) =

q(q − 1)q var(τ1 ) (q − 1) var(H)

Quality of Life: Issues Concerning Assessment and Analysis

q 2 var(τ1 ) − q var(τ1 ) var(H) Pq Pq 2 q var(τ1 ) + i=1 var(ei ) − q var( τ1 ) − i=1 var(ei ) q = q−1 var(H) P q var(H) − [q var(τ1 ) + i=1 var(ei )] q = q−1 var(H) Pq q i−1 var(xi ) = 1− . (22) q−1 var(H)

=

229

q q−1

With these features the advantages of α over the other reliability measures should be evident. There are no assumptions needed for the stability of τ1 . The measures need not be parallel. The possibility of memory effects are remote since measures for only one time period are applied. There is no problem in selecting splits of items for testing since all measures can be treated individually. In addition, computation of α is relatively easy. However, two drawbacks to α are that it underestimates reliability for congeneric measures, and it is not suited to work with single indicators. Measurement is a broad topic in social science research. This section emphasized the issues of measurement most relevant to a structural equations approach to measurement models. Most basic is the need to begin with a clear definition of the concepts to be measured. Without such a definition, we have little hope of identifying dimensions and latent variables. Validity and reliability are two basic characteristics of measures. Validity refers to the direct correspondence between a measure and a concept. Reliability refers to the consistency of a measure, regardless of whether it is valid. Many researchers have proposed empirical techniques to estimate validity and reliability. These often are based on correlation coefficients and restrictive assumptions about the properties of measures. Several alternative means have been shown here, that are more general than the traditional procedures, and they also fit well into a structural equations approach to measurement. References 1. World Health Organization (1991). World Health Statistics Annual. WHO, Geneva. 2. Bergner, M., Bobbitt, R. A. and Carter, W. B. et al (1981). The sickness impact profile: Development and final revision of a health status measure. Medical Care 19: 787–805.

230

J. Fang & Y. Hao

3. Hunt, S. M., McKenna, S. P. and McEwan, J. (1989). The Nottingham Health Profile. Users Manual , Revised edition. 4. Ware, J. E., Snow, K. K., Kosinski, M. and Gandek, B. (1993). SF-36 Health Survey: Manual and Interpretation Guide. New England Medical Center, MA, USA. 5. Fallowfield, L. (1990). The Quality of Life: The Missing Measurement in Health Care, Souvenir Press. 6. The WHOQOL Group (1998). The World Health Organization Quality of Life Assessment (WHOQOL): Development and general psychometric properties. Social Science and Medicine 12: 1569–1585. 7. Fayers, P. M. and Machin, D. (2000). Quality of Life: Assessment, Analysis and Interpretation. New York: John Wiley and Sons. 8. Hunt, S. M., McKenna, S. P. and McEwen, J. et al. (1981). The Nottingham Health Profile: Subjective health status and medical consultations. Sosial Science and Medicine 15A: 221–229. 9. Ware, J. E. Jr, Snow, K. K. and Kosinski, M. et al. (1993). SF-36 Health Survey Manual and Interpretation Guide, New England Medical Centre, Boston, MA. 10. Brooks, R. and with the EuroQol group (1996). EuroQol: The current state of play. Health Policy 37: 53–72. 11. Hickey, A. M., Bury, G. and O’Boyle, C. A. et al. (1996). A new short-form individual quality of life measure (SEIQol-DW): Application in a cohort of individuals with HIV/AIDS. British Medical Journal 313: 29–33. 12. Bjordal, K., Hammerlid, E. and Ahlner-Elmqvist, M. et al. (1999). Quality of life in head and neck cancer patients: Validation of the European Organization for Research and Treatment of Cancer Quality of Life Questionnaire-H&N35. Journal of Clinical Oncology 17: 1008–1019. 13. Spilker, B. (ed.) (1996) Quality of Life and Pharmacoeconomics in Clinical Trials. Lippincott Williams and Wilkins, Philadelphia. 14. Torrance G. W. (1986). Measurement of health state utilities for economic appraisal: A review. Journal Health Econometrica 5: 1–30. 15. Bollen, K. A. (1989). Structural Equations with Latent Variables, John Wiley and Sons, NY.

About the Author Jiqian Fang, born in Shanghai 1939, obtained BS in Mathematics (1961) from the Fudan University and PhD in Biostatistics (1985) from the University of California at Berkeley. From 1985 to 1990, he was Professor and Director, the Department of Biostatistics and Biomathematics, Beijing Medical University; Since 1991, he has been Director and Chair Professor,

Quality of Life: Issues Concerning Assessment and Analysis

231

Department of Medical Statistics, School of Public Health, Sun Yat-Sen University. His research projects covers widely in various fields, including “Stochastic Models of Life Phenomena”, “Gating Dynamics of Ion Channels”, “Biostatistical Theory and Methods for Research on Cancer Prevention”, “Bootstrap Studies on Multi-state Models”, “Statistical Methods for Data on Quality of Life”, “Health and Air Pollution”, “Analysis of DNA Finger Printing”, and “Linkage Analyses between Complex Trait and Multiple Genes”, etc. Some of them have received awards from the Beijing Municipal Government or Ministry of Public Health of China for their significant advances in the biostatistics fields.

This page intentionally left blank

CHAPTER 7

META-ANALYSIS

XUYU ZHOU Medical Information Institute, Sun Yat-Sen University, 74 Zhongshan Road II, Guangzhou 510080, PR China JIQIAN FANG,∗ CHUANHUA YU and ZONGLI XU Department of Medical Statistics, School of Public Health, Sun Yat-Sen University, 74 Zhongshan Road II, Guangzhou 510080, PR China Tel: 86-20-87330671; ∗[email protected] YING LU Department of Radiology, University of California, 3333 California Street, Suite 375, San Francisco, CA 94118, USA Tel: 415-502-4596; [email protected] The best possible synthesis of available information is essential for medical researchers, health policy-makers, clinicians and other decision makers. With the explosion of information in the literature, literally hundreds of studies may exist on the same topics, and the designs, participants, outcomes, sample sizes, and interventions among these studies may differ. How can information derived from those studies be combined to arrive at a general conclusion? During the past 20 years, meta-analysis, a statistical procedure for systematically combining and analyzing the results of previous research, has been applied with increasing frequency to health-related contexts, especially in the fields of clinical trials.

233

234

X. Zhou et al.

1. Introduction 1.1. Definition The term “meta-analysis” was coined by psychologist Glass in 1976.1 The prefix “meta” has several related meanings, including the ideas of occurring after something else, of transcending, or of being more comprehensive than the precursor. Glass’ first definition of meta-analysis is the statistical analysis of a large collection of analyses results from individual studies for the purpose of integrating the findings. A useful definition was given by Huque: “. . .the term ‘meta-analysis”’ refers to a statistical analysis which combines or integrates the results of several independent clinical trials, considered by the analyst to be ‘combinable’.2 ” Similar synonyms of meta-analysis include “overview”, “quantitative review”, “quantitative synthesis”, and “pooling”. But these alternative terms may be less specific or less poignant, and were not accepted broadly. More recently, Evidence-Base Medicine (EBM) has been greatly developed. EBM, systematic review, and meta-analysis get widely used terms in medical journals. Systematic review denotes any type of review that has been prepared using strategies to avoid bias and that which includes a material and methods section. Systematic review may or may not include formal meta-analysis. The Cochrane Collaboration aims to prepare, maintain, and disseminate comprehensive and systematic reviews of the effects of health care. Systematic reviews provided by Cochrane Collaboration are regarded as the best evidence for practicing EBM.3 Nowadays, meta-analysis is not limited to a statistical approach, and defined as a systematic approach to identifying, appraising, synthesizing, and (if appropriate) combining the results of relevant studies to arrive at conclusions about the body of research.4 1.2. Historical notes The origins of pooling the results may be traced to statistician Karl Pearson in 1904, who was the first researcher to report the use of formal techniques to combine data from different samples. The first article which quantitatively synthesized the previous research in medicine, The Powerful Placebo, and written by Beecher, was published in 1955.5 As a formal statistical technique to combine data from studies for the same topic, meta-analysis began to be applied to social sciences in the mid-1970s, particularly in educational and psychological research.

Meta-Analysis

235

Widespread use of meta-analysis in medicine quickly followed its popularization in the social sciences, and mainly focused the research on the randomized clinical trials. In the late 1980s, there has been a rapid growth in interest and use of the method. At that time, descriptions of the method of meta-analysis and guidelines for its application appeared almost simultaneously in many general influential medical journals, such as the New England Journal of Medicine, Lancet, and Annals of Internal Medicine. Meta-analysis has been adopted by MEDLINE as a Medical Subject Heading (MeSH) term in 1989 and as a sort of Publication Type (PT) in 1993. Meta-analysis of observational studies has also been advocated. Meta-analysis is now commonplace in a wide range of medical research contexts. Concurrent with the increased number of articles using metaanalysis in the last decade, there have been numerous articles relating to statistical issues or concerns. Many methods have been proposed and used, from crude “vote counting” of studies showing significant or non-significant results, through method for combination of effect size estimates based on fixed or random-effects models, to general linear mixed models and Bayesian methods. Meta-analysis has established itself as an influential branch of biostatistics. With the sharp increasing use of meta-analysis, several unresolved issues concerning meta-analysis still remain. Incomplete or un-standardized reporting of results, and combing “apples and oranges and the occasional lemon” — failure to make allowance for varying nature and quality of the studies reviewed.6 Therefore, both the uncritical synthesis of data from observational studies and the unconsidered synthesis of disparate results from randomized controlled trials can threaten to damage the validity and reliability of conclusions of meta-analysis. Other stubborn problems involved in meta-analysis may be biases, especially publication bias, and heterogeneity across studies.

1.3. Objectives of meta-analysis Traditionally, research synthesis was done in a fairly simple way. The classic narrative reviews have several disadvantages that meta-analysis appear to overcome. The traditional review is a subjective method of summarizing research data and therefore prone to bias and error. Without guidance by formal rules, a narrative review expresses the personal opinions of their

236

X. Zhou et al.

authors and depends heavily on the perspicacity and personal experience of the reviewer. Selective inclusion of studies that support the reviewer’s view is common. On the other hand, a narrative review tends to present a series of effect measures in the narrative in most situations, and reviewers potential to ignore the factors that greatly influence the results of primary study, such as research design, sample size, and effect size. Meta-analysis provides a logical framework to research a review: Similar measures from comparable studies are listed systematically and the available effect measures are combined where possible. For example, in 1982, use of thrombolytic agents after acute myocardial infarction was controversial. Table 1 presents the data of eight randomized clinical trials at that time, which examined the effects of a loading dose of at least 250,000 international units of intravenous streptokinase on mortality given a short time after an acute myocardial infarction had occurred. As shown in Table 1, two trials showed a higher risk of mortality in treated patients, with both 95% confidence intervals covering one, which means no statistical significance; five showed a lower risk, with four of those 95% confidence intervals covering one; and one showed same mortality rate in the treated and the control patients. The trials were all fairly small, and the difference in mortality between treated and controlled patients was Table 1. Results of randomized trials of effect on mortality of intravenous streptokinase following acute myocardial infarction published before 1982. N Deaths/Total Included Study

Mortality (%)

Treated Control Treated

Estimated relative

Control

Risk and its 95% CI

Avery (1969)

20/83

15/84

24.1

17.9

1.35(0.74–2.45)

European Working Party (1971)

69/373

94/357

18.5

26.3

0.70(0.53–0.92)

Heikinheimo (1971)

22/219

17/207

10.0

8.2

1.22(0.67–2.24)

Dioguardia (1973)

19/164

18/157

11.6

11.5

1.01(0.55–1.85)

Breddin (1973)

13/102

29/104

12.7

27.9

0.46(0.26–0.81)

Bett (1973)

21/264

23/253

8.0

9.1

0.88(0.50–1.54)

Aber (1979)

43/302

44/293

14.2

15.0

0.95(0.64–1.40)

UCSG for Streptokinase in AMI(1979)∗

18/156

30/159

11.5

18.9

0.61(0.36–1.04)

Summary relative risk ∗ European

0.80(0.68–0.95)

Cooperative Study Group for Streptokinase in acute myocardial infarction.

Meta-Analysis

237

statistically significant in only one trial. These studies were interpreted as inconclusive about the benefit of early treatment with intravenous streptokinase. In a meta-analysis based on these trials, Stampfer estimated the relative risk of mortality in patients treated with intravenous streptokinase to be 0.80 with 95% confidence limits of 0.68 and 0.95, and draw the conclusion that streptokinase reduces the mortality following acute myocardial infarction.7 The findings were published in the famous medical journal, New England Journal of Medicine, and were not accepted by clinician due to poor understanding of meta-analysis in early 1980s. Until 1986, a large clinical trial of intravenous streptokinase after acute myocardial infarction involving thousands of patients (GISSI 1985) confirmed the conclusion based on the meta-analysis, and streptokinase got to be widely used in clinical practice. The objectives of meta-analysis are: 1.3.1. To increase statistical power Meta-analysis effectively provides a gain in statistical power for average estimates. In clinical trials, meta-analysis offers an opportunity to observe more events of interest in the groups followed, when incidence or mortality is rare, and combined estimates are likely to be more precise. In some cases, a single study often cannot detect or exclude a modest, albeit clinical relevant, difference in the effects of two treatments with great confidence. For example, suppose a drug could reduce the risk of death from myocardial infarction by 10%, to detect such an effect with 90% confidence (that is, with a type II error of no more than 10%) over 10,000 patients in each treatment group would be needed. However, such large samples were difficult to recruit in a single study. Clearly, if data from more than one study are available and can be combined, the “sample size” and, thus, power increase, and relatively small effects can be detected or excluded with confidence. 1.3.2. To improve estimate of effect size Meta-analysis has historically been useful in summarizing prior research based on randomized trials when individual studies are too small to yield a valid conclusion. Results from studies may disagree as to the magnitude of effects or, of more concern, as to the direction of effects. By integrating the actual evidence, meta-analysis allows a more objective appraisal, which can help to resolve uncertainties when the original researches, classic

238

X. Zhou et al.

reviews, and editorial comments disagree. As an effective tool for quantitative synthesis, meta-analysis may resolve issues relating to inconsistent or conflicting results from studies, provide the pooled estimate of effect size with a more precise confidence interval, and draw an explicit conclusion. 1.3.3. To assess the disagreement and generalizability of study results Studies for the same topic may use different eligibility criteria for participants, different definitions of disease, different methods of measuring or defining exposure, or different variations of treatment. It means there is heterogeneity between studies. When heterogeneity is large enough to be detected by a statistical test, it is important to explore its source. Metaanalysis also systematically assesses the biases and confounding in primary studies. On the other hand, meta-analysis can contribute to considerations about the generalizability of study results. The findings of a particular study may be valid only for a specific population of patients with the same characteristics as those investigated in the trial. If many trials are available for different groups of patients, and show similar results, it can be concluded that the effect of the intervention under study has some generality. Furthermore, meta-analysis is also superior to individual trials when answering questions about whether an overall study result varies among subgroups — for example, among men and women, older and younger patients, or subjects with different degrees of severity of disease. These questions can be addressed in the analysis and often lead to insights beyond what is provided by the calculation of a single combined effect estimate. 1.3.4. To answer new questions that were not previously posed in the individual studies Meta-analysis includes the epidemiological exploration and evaluation of results, new ideas (hypotheses) that were not posed in the individual studies can thus be developed and tested for further research and further original studies. 1.4. The main steps involved in a meta-analysis Meta-analysis should be viewed as an observational study of the evidence. The steps involved are similar to any other research undertaking:

Meta-Analysis

239

Formulation of the problem to be addressed, collection and analysis of the data, and reporting of the results.

1.4.1. Formulating the problem It is as important to carefully plan a study that involve in a meta-analysis as to carefully plan a clinical trial, a cross-sectional survey, and a casecontrol or a cohort study. Documentation of all aspects of study design and conduct of the study is a crucial and often overlooked step in carrying out the meta-analysis. As with any research, a meta-analysis begins with a well-formulated question and design. Meta-analysis can, in general, be motivated by a number of factors. It can be conducted in an effort to resolve conflicting evidence, to answer the questions where the answer is uncertain or to explain variations in practice. A well-formulated question is essential for determining the structure of a meta-analysis. Specifically, it will guide much of the meta-analysis process including strategies for locating and selecting studies or data, for critically appraising their relevance and validity, and for analyzing variation among their results. There are several key components to a well-formulated question. A clearly defined question should specify the types of people (participants), types of interventions or exposures, types of outcomes that are of interest, and types of study design. In general the more precise one is in defining components, the more focused the meta-analysis. The first step in planning the study is to define the problem. The problem definition is a general statement of the main questions that the study addresses. For examples, does the thrombolytic therapy lower the risk of death for patients with acute myocardial infarction? A meta-analysis for randomized clinical trials. Does the passive smoking increase the risk of lung cancer for women? A meta-analysis for case-control studies. These two topics are well-formulated questions that contain the main elements for a meta-analysis. Once the problem is defined, developing a detail study protocol is essential. A protocol is the blueprint for conduct of the meta-analysis. The protocol should clearly state the objectives, the background, the hypotheses to be tested, the subgroups of interest, the proposed methods and criteria for identifying and selecting relevant studies, and extracting

240

X. Zhou et al.

and analyzing information. The statement of objectives should be concise and specific. 1.4.2. Searching the relevant information A comprehensive, unbiased information search is one of the critical differences between a meta-analysis and a traditional review. Systematic procedures for literature searching should be described in protocol in detail. Ideally, all of the relevant information, including the published literature, unpublished literature, uncompleted research reports, and work in progress, would be searched and identified in meta-analysis. In practice, the meta-analyst begins with searches of regular medical databases of published literature. Developing a search strategy is very important, which means to present the exact search terms and the search algorithm for each computer databases. Sometimes restrictions are necessary, such as language, study objects, publication year, or publication types, and it is easy to carry out in computer database search. Skipping over important documents available in databases in searching process may affect the validity and reliability for the results of metaanalysis. The ability of a search algorithm to identify all of the pertinent literature can be improved by consultation with a professional librarian or an expert searcher. Two useful concepts in information retrieval can be used to describe the success of the search process: Sensitivity and precision. Sensitivity of a search is its ability to identify all of the relevant material. Precision (which is the positive predictive value of the search) is the amount of relevant material among the materials retrieved by the search. The overall strategy for searching is to maximize sensitivity and precision. But with the increase of the recall, the precision may be reduced. For meta-analysis, a higher percent sensitivity may be more important than precision. MEDLINE is the most powerful bibliographic database that is the primary source of information on publication in the biomedical literature. It contains information on publications in over 3,500 and covers the period from 1966 to the present. MEDLINE provides more than 10 search entries and is very friendly to users. The use of MeSH (Medical Subject Headings) terms allows searches of MEDLINE to be focused and specific, which gives higher sensitivity and precision. Free access to MEDLINE through the Internet (www.ncbi.nlm.nih.gov/PubMed) greatly enhances the ability to conduct searches. Other broadly used biomedical databases include EMBASE, SCI (Web of Science), Cochrane Library, and specific databases, such as CANCERLINE, TOXLINE, etc.

Meta-Analysis

241

The citations or abstracts in databases are browsed in search process, and those obviously unrelated to the topic are eliminated. The full-text of the remaining articles is then collected. These articles are read quickly, and those clearly irrelevant ones are excluded. The remaining publications are then systematically reviewed to determine whether they are eligible for the meta-analysis based on predetermined criteria for eligibility. The reference lists of the articles that contain useful information are searched for more references, then the new publications retrieved, and the process is repeated, until all potentially articles on the topic are identified. Medical information is also presented in professional website, especially in the medical journal’s website, and some of them also provide free fulltext. Handsearching is often used. Scanning new information in key journals in the area of interest is an important supplement. Furthermore, “fugitive” literatures, such as proceedings of conferences, dissertations and master’s theses, books chapters, and government reports, are not included in MEDLINE and most other databases. To ignore these material have the potential to cause bias in the meta-analysis. One of the effective ways to obtain the information about publications in the fugitive literature is to consult experts. Unpublished studies are the ultimate example of fugitive literature. The existence of large numbers of unpublished studies may cause publication bias, which will be discussed in detail in the final section in this chapter. 1.4.3. Selecting the studies eligible for inclusion Studies are chosen for meta-analysis on the basis of inclusion and exclusion criteria. Inclusion criteria are ideally delineated at the stage of the development of the meta-analysis protocol, and should depend on the specific objectives of the analysis. The process of determining whether studies are eligible for inclusion in the meta-analysis should be systematic and rigorous. Each article must be assessed to see whether the inclusion criteria for the meta-analysis are met. To ensure reproducibility and minimize bias in selecting studies, the following six aspects should be addressed in almost all meta-analyses. 1.4.3.1. Study Population What types of people should be included in meta-analysis? This involves deciding whether one is interested in a specific population group determined on the basis of factors such as age, sex, educational status, or

242

X. Zhou et al.

presence of a particular condition such as the severity of disease and types of disease. For example, in a meta-analysis of the effects of estrogen replacement therapy on the risk of breast cancer, the inclusion criteria for study population is limited to the women who experienced the natural menopause or who underwent premenopausal hysterectomy, with or without bilateral oophorectomy. The studies that included subjects with a previous history of breast cancer are excluded. 1.4.3.2. Study Design In clinical trials, the effect of non-randomized controlled study is often overestimated compared with that of randomized study. The treat effect of single blind design may be different from that of double blind design, even though other aspects of the studies are the same. When both randomized and nonrandomized studies are available for a topic, estimates of effect size should be made separately for the randomized and the nonrandomized studies. In observational studies, the results of case-control study and cohort study may be discrepant for identical problem due to the effects of confounding factors, the influence of biases, or both. The results of metaanalysis need to be reported respectively, according to the study design. 1.4.3.3. Intervention or Exposures One of the key components about eligibility for a meta-analysis is to specify the intervention or exposure that is of interest, and what types of control groups that are acceptable also need to be defined. In other words, how similar intervention (exposure) should be to use them in the same analysis, such as studies with different doses of the same drug in clinical trials, and studies with the different intensity of exposures in observational data. For example, a meta-analysis of low-dose aspirin for the prevention of pregnancy-induced hypertensive disease included the studies in which the intervention is aspirin in doses of less than 325 mg/day. 1.4.3.4. Outcomes Researchers on primary studies often report more than one outcome, and may report the same outcome using different measures. When defining eligibility criteria for the meta-analysis, eligibility based on the similarity

Meta-Analysis

243

of the outcome will enhance the homogeneity of the studies. Generally, the end-points that are comparable, quantitative and reflecting the final outcomes are appropriate to be chosen for meta-analysis. For example, the chief endpoints, which included in the meta-analysis of randomized trials of angiotensin-converting enzyme (ACE) inhibitors on mortality and morbidity in patients with congestive heart failure (CHF), are total and cause-specific mortality (i.e. progressive heart failure, myocardial infarction, and sudden or presumed arrhythmic death) and hospitalization for CHF. 1.4.3.5. Inclusive dates of publication and English-language publication Meta-analysis should be as up-to-date as possible, the cutoff date for identification of eligible studies should be specified in the report of the meta-analysis. The inclusive date of publication should be chosen based on consideration of the likelihood of finding important and useful information during the period that is chosen, but not simply on convenience, such as availability of MEDLINE. A meta-analysis solely based on English-language publications has been shown to have the potential to cause bias. It is not valid to conduct a metaanalysis to rely only on the publications and reports that are easily found and understood. 1.4.3.6. Restriction on sample size or length of follow-up Most of classical the statistical methods for meta-analysis are based on asymptotic. Normal under moderately large samples. The precision of small studies may tend to be overestimated. To avoid the problem of weighting small studies inappropriately in the meta-analysis, it is reasonable to make sample size an eligibility criteria for the meta-analysis. Small studies are excluded. Sometimes, the length of follow-up may influence the likelihood of observing a true association in clinical trials. For observational studies, there are many situations where exposure would not affect the risk of disease until after a latent period. To avoid these problems, the length of follow-up could be a criterion for eligibility for the meta-analysis. An alternative to making study size or length of follow-up an eligibility criterion is to estimate effect with and without small studies or with and without studies with short follow-up or low-dose exposure.

244

X. Zhou et al.

For example, in a meta-analysis of the efficacy of screening mammography, one of the inclusion criteria is, the length of follow-up is least 5 years and with minimum of 10 breast cancer mortality cases in each eligible study. Generally, highly restrictive eligibility criteria tend to give meta-analysis greater validity. But the criteria may be so restrictive and require so much homogeneity as to limit the eligible studies to only one or two studies, which is conflicted with one of the goals of meta-analysis as a method to increase statistical power. However, less restrictive criteria may lead to the accusation that the meta-analysis “mixes apples and oranges”. 1.4.4. Abstracting the data The process of abstraction of information for meta-analysis from eligible studies should be reliable, valid, and free of bias. In order to enhance the reliability of data collection, a standardized form should be developed to record the information. The key components of a data collect form generally include study characteristic with methods, participants, interventions, outcome measures and results. To avoid the selection bias, the abstraction of information should be done by two abstractors separately, and experts should be consulted for disagreement. Furthermore, the abstractors should be blinded to the information of the authors, the journals, and the funding sources. It is believed that these factors possibly influence the judgment of the abstractor. 1.4.5. Assessing study quality It is important to systematically complete critical appraisal of all included studies, which primarily focus on the validity of studies. If the quality of original study is poor, the results of meta-analysis will be less reliable and valid. The validity of a study is the extent to which its design and conduct are likely to prevent systematic errors, or bias. Generally, there are four sources of systematic errors in clinical trials: Selection bias, performance bias, attrition bias and detection bias. The randomization process, the measurement of patient compliance, the blinding of patients and observers, the statistical analyses, and the handling of withdrawals in each primary study should be examined. For non-experimental studies, control for confounding, measurement of exposure and completeness of follow-up are all the main factors that need to be greatly considered in the process of study quality assessment.

Meta-Analysis

245

Because quality assessment is a subjective process, it may potential cause error and bias. There is not a “gold standard” for study quality appraise yet. So, the reliability of the quality rating scales in published meta-analysis is often not formally evaluated. 1.4.6. Statistical analysis The process of quantitative combining the data is the key step for metaanalysis, which is distinguish from the traditional narrative review. The main procedures involved in the statistical analysis are: Defining the outcome; homogeneity test for the effect size; model choice (fixed-effects model or random-effects model); pooled estimate of effect size (point estimate and confidence interval estimate); hypothesis test for overall effect size and graphic display of the results. 1.4.7. Sensitivity analysis The goal of sensitivity analysis in meta-analysis is to assess the robustness of conclusion when different assumptions are made in conducting the analysis. Sensitivity analysis is usually conducted to examine the change of the pooled estimate of effect size, when both fixed- and random-effects model are used. Sensitivity analysis is also often done including and excluding certain studies, which are controversial, have large effects and thus dominate the analysis, or cannot be determined to meet the eligibility criteria but whose exclusion may be problematic. When there is more than one estimate of effect size available from a study, sensitivity analysis can be performed using one estimate and then the other. For example, Egger did a sensitivity analysis in the meta-analysis of β-blockade in secondary prevention after myocardial infarction.8 Firstly, the overall effect was calculated by different statistical model, the results showed that the overall effect estimates are virtually identical and that confidence intervals are only slightly wider with random-effects model. Secondly, methodological quality was assessed in terms of how patients were allocated to treatment or control groups, how outcome was assessed, and how the data were analyzed. The results showed that the three low quality studies presented more benefit than high quality trials. Exclusion of these three studies, however, leaves the overall effect and the confidence intervals practically unchanged. Third, when stratifying the analysis by study size, the results showed the trials with smallest sample sizes have the largest effect. However, exclusion of such studies has little effect on the

246

X. Zhou et al.

overall estimate. Thus, sensitivity analysis showed that the results from this meta-analysis were robust. 1.4.8. Discussion of results As with any medical article, the last step in meta-analysis is discussion. • Investigating and explaining the source of heterogeneity are critically important component of meta-analysis, when there is “statistically significant” heterogeneity across studies. Heterogeneity is easier to be observed in observational studies due to the diversity in their designs, the methods for collecting data, definitions of endpoints, and the degree of control for bias and confounding. Indeed, there are no statistical methods that can deal with the bias and confounding in the original studies. Meta-regression model and mixed model may adjust somewhat of heterogeneity by controlling the confounding, but it still cannot explain the source of heterogeneity. Sensitivity analysis and subgroup are useful for exploring the heterogeneity. It may not be appropriate with great difference. • Subgroup analysis is necessary when treatment effect vary according to patient-level covariance or trial-level characteristics. For example, the effect of a given treatment is unlikely to be identical across different group of participant — for example, young people versus elderly people, those with mild disease versus with severe disease. A relationship between the underlying risk of patient and treatment effect may crucially affects decisions about which patients should be treated from a cost-effectiveness perspective: Patient at high risk with a small proportionate treatment benefit may be preferentially treated compared to low risk patients with a larger proportionate treatment benefit. Sometimes the treatment effect may be in the opposite direction for patients at low and high risk. Metaanalysis thus offers a sounder basis for subgroup analysis. But metaanalytic subgroup analyses are prone to bias and need to be interpreted with caution. Ideally, if individual patient data in each eligible study can be obtained, a standardized subgroup analysis can be performed. • Meta-analysis is essentially viewed as an observational study. Bias can occur at multiple steps in the process of meta-analysis. Bias may seriously influence the validity and reliability of meta-analysis, and more attention needs to be paid to detect and assess of the bias. • When reporting the conclusion, we should summarized the key finding, interpret the results in light of the total of available evidence, and suggest

Meta-Analysis

247

a future research agenda. But for meta-analysis of observational studies, generalization of the conclusions must be explained in caution, because bias and confounding may distort the findings as we have shown above. For example, the hypothesis from ecological analyses that higher intake of saturated fat could increase the risk of breast cancer generated much observational research often with contradictory results. A comprehensive meta-analysis showed an association from case-control but not from cohort studies (odds ratio was 1.36 from case-control studies versus relative rate 0.95 from cohort study), and this discrepancy was also shown in two separate large collaborative meta-analyses of case-control and cohort studies. The most likely explanation for this situation is that biases in the recall of dietary items and in the selection of study participants have produced a spurious association in the case-control comparisons.9 2. Statistical Methods in Meta-Analysis 2.1. Definition of the study outcome The primary studies included in the meta-analysis may report several different end points. Often the meta-analyst has little control over the choice of the study outcome, and it is very important to select pooled statistic that is comparable across all studies. In some situations this task will be impossible. Here, three classes of outcome measures are discussed: Measures based on discrete outcome data, that may generally be thought of as odds ratios, relative risks, or risk differences; those based on continuous data, such as mean difference, and standardized mean difference; and a miscellaneous set of outcome measures that may be based on test statistics.10 2.1.1. Odds ratios, relative risks and risk differences Suppose there are K studies for binary discrete measurements included in the meta-analysis, whose data are in the form of 2 × 2 tables (see Table 2). Let i index study, in a typical one, clinical trials, let 1 denote treatment group, and 2 control group. We denote ai , bi , ci , and di as the number of observations in each of the cells defined by the treatment and outcome table, with n1i subjects in the treatment group and n2i in the control group. p1i and p2i , are the proportions of having the characteristic under study, such as death, relapse or some other kind of failure. In an epidemiological case-control study, the two groups would be the cases and controls and

Table 2.

Treated (Exposed)

Not Treated (Not Exposed)

Total

ai ci

bi di

n1i n2i

m1i

m2i

Ti

Death (Case) Survival (Control) Total

Table 3.

Parameter estimation for three binary measurements.

Parameter Risk Difference Relative Risk

Odds Ratio

Arrangement of data for 2 × 2 table.

Estimator

D = P 1 − P1

di = pˆ1i − pˆ2i

R = P1 /P2

r = pˆ1i /ˆ p2i

Ω=

P1 /(1 − P1 ) P2 /(1 − P2 )

ωI =

pˆ1i /(1 − pˆ1i ) pˆ2i /(1 − pˆ2i )

Standard Error 1 p1i (1 − p1i ) p2i (1 − p2i ) 2 + n1i n2i 1 (1 − p1i ) (1 − p2i ) 2 + SLog (ri) = n1i p1i n2i p2i

sdi =

sLog (ωi) =

1 1 1 1 + + + a b c d

1 2

Meta-Analysis

249

the characteristic under study would be exposed to the hypothesized risk factor. Table 3 gives the formula of parameter inferences in three potential study summary statistics: The ratio of the odds for the treated group to the odds for the control group (odds ratio, OR), the ratio of two probabilities (relative risk, RR), and the difference between two probabilities (risk difference, RD). OR and RR are typically analyzed on logarithmic scale with normal distribution approximation, and the confidence intervals for OR and RR are also computed on the logarithmic scale, then transformed back to the original scale. In practice, OR is widely used as an outcome measure for its convenient mathematical properties, which allow for easily combining data and testing the significance of the overall effect. The OR will be close to the RR, if the end point occurs relatively infrequently, such as less than 20%. RD or absolute risk reduction is easy to interpret and defined for boundary values (proportions of 0 or 1), and is approximately normally distributed for the modest sample sizes. RD reflects both the underlying risk without treatment and the risk reduction associated with treatment. Taking the reciprocal of the RD gives the “number needed to treat” (the number of patients needed to be treated to prevent one event), which is very useful in making a decision in clinical practice. 2.1.2. Means differences and standardized means differences When the primary studies report means as outcome measure on a continuous scale, there are two situations to be considered. First, all of the eligible studies use the same measure of effect, and mean difference may be used as summary measure to estimate pooled effect in the meta-analysis. Suppose the n1i and n2i are the sample sizes, x1i and x2i are the means, for ¯ 1i − X ¯ 2i , with standard treatment and control group, respectively. Yi = X error, si , calculated as with (n1i − 1)s21i + (n2i − 1)s22i 1 1 2 2 with s2pi = + , si = spi n1i n2i n1i + n2i − 2

where s21i and s22i are the treatment and control group variance, respectively, of the ith study. Second, all of the eligible studies address the same question, but the measure of effect is made using different instruments and thus different scales. When there is no direct measure common to all the studies, it may be feasible to transform the study-specific summary to a standardized (scalefree) statistic denoted as effect size. One common estimator of effect size is

250

X. Zhou et al.

the standardized mean difference, which is calculated as the difference of means divided by the variability of the measures. If Yij1 ∼ N (µ1 , σ 2 ) ,

j = 1, 2, . . . , n1i ,

Yij2 ∼ N (µ2 , σ 2 ) ,

j = 1, 2, . . . , n2i ,

then the standardized means difference is defined as µ1 − µ2 , δ= σ which denotes the gain (or loss) as the fraction of the measurements. The estimator of δ, Hedge’s g, is defined as Y¯ 1 − Y¯i2 . hi = i sp Such standardization leads to a unitless effect measure. The results from the original studies, where “success” is measured in different ways, can be standardized to unitless measures and then pooled. The estimated variance of hi is h2i 1 1 + + . var(hi ) = n1i n2i 2(n1i + n2i ) 2.1.3. Other measures When the summary data from the primary studies consist of test statistics, then it is sometimes possible to recover the estimated effect size if the appropriate pieces of information are also reported. For example, if the zstatistics is reported, the estimated standardized mean difference may be calculated as s 1 1 ˆ δ=z . + n1i n2i 2.2. Model choice In meta-analysis, pooled effects and confidence intervals are usually obtained by using appropriate parametric statistical models. Just like ANOVA, analysis the sources of variation may be critical for the model used in meta-analysis.11,12 There are at least two sources of variation to consider before combining summary statistics across studies. One is the inner- or within-study variation, which is derived from sampling error. Sampling error may vary with

Meta-Analysis

251

studies. In general, the sampling error may be relatively small for studies with large sample sizes, which means high degrees of precision and large weight would be given. The other is the inter- or between-study variation. The fixed-effects (FE) model assumes each study is measuring the same underlying parameter and there is no inter-study variation, in other words, the population from which the given studies were drawn comprises studies exactly like those in the sample, the only source of variation in the observations is due to within-study sampling. By contrast, the random effects (RE) model assumes each study is associated with a different but related parameter, which means the population believed to produce the sampled set of studies is a population of studies not exactly alike. For the RE model, each study’s observed effect results from sampling variation about a random effect measure, which itself is “drawn” from a distribution of effect measures. 2.2.1. Fixed-effects model A fixed-effects model assumes that each observed study effect, Yi (i = 1, 2, . . . , K), is a realization of a population of independent studies with common parameters. Let θ be the parameter of interest, which quantifies the average treatment effect. Assume that Yi is such that E(Yi ) = θ and let s2i = var(Yi ) be the estimate of variance of the effect in the ith study. For moderately large study sizes, each Yi should be asymptotically normal distributed (by the central limit theorem) and approximately unbiased. Thus, indep

Yi ∼ N (θ, s2i )

(1)

and s2i is assumed known. 2.2.2. Random-effects model The random-effects model framework postulates that each observed study effect, Yi , is a draw from a normal distribution with a study-specific mean, θi , and variance, s2i . θi is interpreted as the “true effect” in study i. Furthermore, θi is assumed to be a draw from some hyper-distributions of effects with mean θ and variance τ 2 . θ is the true underlying effect of interest, represent the average treatment effect, and τ 2 is the inter-study variance, or heterogeneity parameter. Thus, indep

Yi |θi , s2i ∼ N (θi , s2i ) ,

(2)

252

X. Zhou et al.

indep

θi |θ, τ 2 ∼ N (θ, τ 2 ) .

(3)

Random-effects model “borrow strength” across studies when estimating study-specific effects, θi , as well as the population effect θ. RE model of (2) and (3) is refer to “hierarchic” model. This structure will be particularly useful in the development of the Bayesian paradigm. 2.3. Statistical inference A test of homogeneity should be done before any further analysis. If no significant inter-study variation is found, a fixed-effects approach is adopted. Otherwise, the meta-analyst either adopts a random-effects approach or identifies study characteristics that stratify the studies into subsets with homogeneous effects. The test of heterogeneity is described next and followed by a description of inference for fixed-effects and randomeffects models. Maximum likelihood, and restricted maximum likelihood methods are given for both types of models. 2.3.1. Test of homogeneity The investigation of homogeneity is a crucial part of the meta-analysis. The fixed effects model assumes that the K study-specific summary statistics share a common mean θ. A statistical test for the homogeneity of study means is equivalent to testing H0 : θ = θ1 = θ2 = · · · = θK , H1 : At least two θi s different . The test statistic Qw =

k X i

ˆ2 Wi (Yi − θ)

(4)

will asymptotically follow χ2k−1 under H0 for large sample sizes. The overall treatment effect θ, is estimated as a weighted average, that is .X X Wi and Wi = 1/s2i . θˆ = Wi Yi If Qw is greater than the 100(1−α) percentile of the χ2 distribution, the hypothesis of equal means, H0 , would be rejected at the 100(1 − α) level. If H0 is rejected, the meta-analyst may conclude that the study means arose from two or more distinct populations and proceed by either attempting to identify covariates that stratify studies into the homogeneous populations

253

Meta-Analysis

or adopting a random-effects model. If H0 cannot be rejected, it would be concluded that the K studies share a common mean, θ. Tests of homogeneity have low power against the alternative var(θ i ) > 0. Note that not rejecting H0 is equivalent to asserting that the betweenstudy variation is small. The results of simulation by Hardy show that the power of homogeneity test depends on the number of included studies, the total information (i.e. total weight or inverse variance) available and the distributions among the different studies.13 In practice, if the studies are homogeneous, then the choice between the fixed- and randomeffects model is not important, as the models will yield similar results. The use of the random-effects model is not considered to be a defensible solution to the problem of heterogeneity. The random-effects model is generally “conservative”. That is, in most situations, use of the random-effects model will lead to wider confidence inference and a low chance to call a difference “statistically significant”. 2.3.2. Parameter estimation For fixed-effects model, when s2i is assumed known, log(L(θ|y, s2 )) ∝ P (Yi −θ)2 , which leads to the maximum likelihood estimator (MLE) i s2 i

Pk Wi Yi ˆ θMLE = Pi=1 k i=1 Wi

with

Wi =

1 . s2i

(5)

Standard inferences about θ are available using the fact that  !−1  X . θˆMLE ∼ N θ wi i

2

For random-effects model, if τ is known, the MLE of θ is given by Pk wi (τ )Yi 1 ˆ . (6) θ(τ )MLE = Pi=1 with Wi (τ ) = 2 k + τ2 s i i=1 wi (τ )

However, in the more realistic case of unknown τ 2 , restricted maximum likelihood (RMLE) can be employed as a method for estimating variance components in a general linear model. Using the marginal distribution for y, the log-likelihood to be maximized is ( ) X (Yi − θˆR )2 2 2 2 2 log(si + τ ) + 2 log(L(θ, τ |s y) ∝ si + τ 2 i + log

X (s2i + τ 2 )−1 .

254

X. Zhou et al.

The REML of τ 2 is the solution of P 2 k 2 2 ˆ w (ˆ τ ) (Y − θ ) − s i R i i i k−1 P 2 τR2 = . (ˆ τ ) w i i

The estimator for the population mean is then calculated as Pk wi (ˆ τR )Yi θˆR = Pi k , w τR ) i (ˆ i

wi (ˆ τR ) =

1 , s2i + τˆR2

P τR ))−1 ). and inferences are made using θˆR ∼ N (θ, ( i wi (ˆ By equating the homogeneity test, Qw , to its corresponding expected value, DerSimonian and Laird proposed a non-iterative (method of moments) estimator of τ 2 as    Q − (k − 1)  P 2 τ 2 = max 0, P . w  wi − P i  wi

This leads to

P τDL )Yi i wi (ˆ θˆDL = P w (ˆ τDl ) i i

with wi (ˆ τDl ) =

1 . 2 s2i + τˆDL

θˆDL is also denoted Cochran’s semi-weighted estimator of θ and can be easily programmed using most software packages. A third estimator of τ 2 and θ is to adopt a fully Bayesian approach, which reflect the uncertainty in the estimates of hyperparameters. 2.4. Classical approaches for meta-analysis Many methods of meta-analysis have been proposed. Here we focus on the classic approaches based on two kinds of measures, discrete outcome and continuous outcome. 2.4.1. Measures based on a discrete outcome For measures based on discrete outcome, we primary discuss the methods involve the data in the form of 2× 2 table, which is widely used in the metaanalysis of clinical trials, cohort studies and case-control studies. Suppose the arrangement of data and table notation is still as shown in Table 2.

Meta-Analysis

255

2.4.2. Mantel-Haenszel method The Mantel-Haenzel method is a well-known approach for pooling data across strata. Since each study included in meta-analysis could be regarded as a stratum, Mantel-Haenzel method is appropriate for analyzing data for a meta-analysis. The method is based on the assumption of fixed-effects model, and the pooled measure is expressed as a combination of stratumspecific measures. Mantel-Haenzel method can be used when the measure of effect is a ratio measure, typically an odds ratio.14 In meta-analysis, the pooled estimate using Mantel-Haenzel method is the weighted average of the maximum-likelihood estimate of the odds ratios in each study, using the inverse of study level variances as weights. The odds ratio for the ith study ORi = abii cdii . The weight for the ith study wi = bTi cii . The pooled estimate of odds ratio is P P (wi ORi ) (ai di /Ti ) P ORMH = = P . (7) wi (bi ci /Ti ) The variance of the ORMH is equal to P P P F G H P P P var(ORMH ) = + + P 2, 2 R2 2 R S 2 S

with

F = G=

ai di (ai + di ) , Ti2

ai di (bi + ci ) + bi ci (ai + di ) , Ti2 H=

bi ci (bi + ci ) , Ti2

b i ci ai di , S= . Ti Ti The 95% confidence interval for pooled odds ratio is equal to p exp ln ORMH ± 1.96 var(ORMH ) . R=

(8)

The Q statistics for homogeneity test is given by X Q= wi (ln ORMH − ln ORi )2 =

X

wi [ln(ORi )]2 −

P [ wi ln(ORi )]2 P . wi

(9)

256

X. Zhou et al.

Under the null hypothesis of homogeneity, Q has an approximate χ2k−1 distribution. The test based on Mantel-Haenszel χ2 has optimal statistical properties, being the uniformly most powerful test. But application of the method requires that data to complete a 2×2 table of outcome by treatment groups for each study are available. 2.4.1.2. Peto method The Peto method is a modification of Mantel-Haenszel method. It is based on the fixed-effects model and the effect measure of interest is odds ratio.15 Peto method uses a score statistics and Fisher information statistics from conditional likelihood for study-specific effects to estimate pooled effects. The computation involved in Peto method is relatively simple compared to Mantel-Haenszel method. Peto method has been extensively used, especially in clinical trials. Let Oi and Ei be the observed and expected number of events in the 1i . treatment group for ith study, respectively, where Ei = n1iTm i The pooled estimate of odds ratio is equal to P (Oi − Ei ) P , (10) ORp = exp Vi

1i n2i m2i is the variance of the difference Oi − Ei . where Vi = n1iTm2 (T i −1) i The 95% confidence interval for pooled odds ratio is ! pP ! P Vi (Oi − Ei ) ± 1.96 1.96 P . exp ln ORp ± pP = exp Vi Vi

The homogeneity test, Q, is given by P X (Oi − Ei )2 ( (Oi − Ei ))2 P Q= . − Vi Vi

(11)

(12)

Under the null hypothesis of homogeneity, Q has an approximate χ2k−1 distribution. Although Peto method is widely used, it has been demonstrated to be potentially biased when the true common odds ratio is far from unity or when there are large unbalances between the numbers of death and survival or exposed and non-exposed. In this situation, Mantel-Haenszel may be preferred. Example 1. Table 4 shows data from seven randomized clinical trials of the effect of aspirin in preventing death after myocardial infarction16 The Peto

257

Meta-Analysis

method is used to estimate a summary odds ratio and its 95% confidence interval for these data is as follows: Table 4. Data form seven randomized trials of the effectiveness of aspirin after myocardial infarction and the results of meta-analysis (Peto method). Aspirin Study 1 2 3 4 5 6 7

Placebo

No. No. No. No. Deaths patient death Patient 49 44 102 32 85 246 1570

615 758 832 317 810 2267 8587

67 64 126 38 52 219 1720

Total

624 771 850 309 406 2257 8600

Vi

ORi

(Oi − Ei )2 /vi

5i.6 −8.6 53.5 −9.5 112.8 −10.4 35.4 −3.4 91.3 −6.3 233.0 13.0 1643.8 −73.8

26.3 25.1 49.3 15.5 27.1 104.3 665.1

0.720 0.681 0.803 0.801 0.798 1.133 0.895

2.8 3.6 2.4 0.7 1.5 1.6 8.2

−99.4

912.7

Ei

O i − Ei

20.8

Source: Fleiss and Gross.16

2.4.1.2.1. Homogeneity test Calculate Ei , Vi , Oi − Ei , and (Oi − Ei )2 /Vi , and the results are show in Table 4. P X (Oi − Ei )2 (−99.4)2 ( (Oi − Ei ))2 P = 20.8 − − = 10.1 . Q= Vi Vi 912.7

Here, df = 6, χ2(0.05,6) = 12.6 > 10.1, P > 0.05, the null hypothesis of homogeneous odds ratio would not be rejected at 5 percent level, so that the fixed-effects model may be appropriate to be adopted for pooling the odds ratio. 2.4.1.2.2. Calculate the pooled estimate of odds ratio and its 95% confidence interval P −99.4 (Oi − Ei ) P = exp ORp = exp = 0.09 Vi 912.7 ! pP ! √ P (Oi − Ei ) ± 1.96 Vi −99.4 ± 1.96 912.7 P exp = exp 912.7 Vi = (0.84, 0.96) .

258

X. Zhou et al.

2.4.1.2.3. Graphical presentation of the results

Odds ratio (95%CI)

1.6 1.4 1.2 1.0 0.8 0.6 0.4 1

2

3

4 5 study

6

7 pooled

Fig. 1. The odds ratios of seven studies and their 95% confidence interval, and pooled odds ratio and its 95% confidence interval.

2.4.1.3. Fleiss method When data to complete a 2×2 table is not available, the Peto method could not be adopted unless those studies are excluded. Sometimes the individual study may report the proportions having the characteristic under study, the Fleiss method can be used as alternative the Peto method based on fixed-effect model. For a clinical trial or cohort study, let p1i and p2i be the mortality rate or incidence rate for treated (exposed) and control group, respectively. For a case-control study, let p1i and p2i be the exposure rate for case and control group, respectively. Fleiss draws the formula of pooling the log odds ratio when p1i and p2i are given in the included study in meta-analysis.16 The effect for ith study, denoted by yi , is the logarithm of the odds ratio: yi = ln(ORi ) = ln(p1i (1 − p2i )/p2i (1 − p1i )) . The variance and weight of yi are given by var(yi ) =

1 1 + , n1i p1i (1 − p1i ) n2i p2i (1 − p2i )

wi =

The pooled estimate of odds ratio is equal to .X X ORF = exp(¯ y ) = exp wi . wi yi

1 . var(yi )

(13)

259

Meta-Analysis

The 95% confidence interval for summary odds ratio is given by qX exp y¯ ± 1.96 wi .

(14)

The Q statistic for homogeneity test is Q=

X

2

wi (yi − y¯) =

X

wi yi2

P ( wi yi )2 − P . wi

(15)

Example 1 (continued). The Fleiss method is used to estimate a pooled odds ratio and its 95% confidence interval for data in Example 1 in Table 5. First, calculate the observed effect yi = ln(ORi ), variance vi weight wi and wi yi , wi yi2 for each individual study, results shown in Table 5. The Q statistic for homogeneity test is P X X ( wi yi )2 2 2 Q= wi (yi − y¯) = wi yi − P wi = 20.7849 −

(−99.1391)2 = 10.8 . 910.559

df = 6, χ2(0.05,6) = 12.6 > 10.1, P > 0.05, H0 would not be rejected, so the fixed-effects model may be appropriate. Then the pooled estimate of odds ratio is equal to X X ORF = exp wi wi yi = exp(−99.1391/910.559) = exp(−0.1089) = 0.90 .

Table 5. Results of meta-analysis for the effectiveness of aspirin after myocardial infarction (Fleiss method). Study

yi = ln(ORi )

wi = 1/vi

w i yi

wi yi2

1 2 3 4 5 6 7

−0.3285 −0.3842 −0.2194 −0.2194 −0.2332 0.1249 −0.1109

25.710 24.291 48.801 15.440 28.409 103.985 663.923

−8.4457 −9.3326 −10.7069 −3.3875 −6.6250 12.9877 −73.6291

2.7744 3.5856 2.3491 0.7432 1.5449 1.6222 8.1655

910.559

−99.1391

20.7849

Total

260

X. Zhou et al.

The 95% confidence interval for pooled odds ratio is given by qX √ exp y¯ ± 1.96 wi = exp(−0.1089 ± 1.96/ 910.559) = (0.84, 0.96) .

Note that, results of Fleiss method are the same as those of Peto method. If complete 2 × 2 tables are available for all included studies, Peto method is simpler than Fleiss method, but the latter can be used for those only proportions reported. 2.4.1.4. General variance-based method When the effect size is measured as a rate difference, the general variancebased method would be applied to estimation of the pooled rate difference. The general variance-based method also used to estimate the pooled risk ratio, rate ratio and odds ratio.12 The general variance-based method is also based on fixed-effect model. 2.4.1.4.1. Effect size is measured as a rate difference The rate different for ith study is RDi = na1ii − nc2ii . The variance and weight of rate difference are var(RDi ) = wi = 1/var(RDi ). The pooled estimate of rate difference is P (wi RDi ) P . RDGV = wi

n1i n2i m1i m2i Ti ,

(16)

The 95% confidence interval of pooled estimate of rate difference is equal to qX RDGV ± 1.96 wi . (17) 2.4.1.4.2. Effect size is measured as an incidence density ratio or as a risk ratio The relative risk for ith study is RRi = na1ii / nc2ii . The variance and weight of relative risk are var(RRi ) = 1/var(RRi ). The pooled estimate of relative risk is P (wi ln(RRi )) P . RRGV = exp wi

n2i Ti m1i m2i n1i ,

wi =

(18)

261

Meta-Analysis

The 95% confidence interval of pooled estimate of relative risk is equal to qX exp ln(RRGV ) ± 1.96 (19) wi . When each study in meta-analysis just presents the relative risk and its 95% confidence interval, whereas a complete 2 × 2 table is unavailable, general variance-based method also could be applied to estimate the pooled effect using Eqs. (18) and (19). The formula for estimating variance from the 95% confidence interval is 2 2 ln(RRu /RRi ) ln(RRi /RRl ) = , (20) var(RRi ) = 1.96 1.96

where RRu and RRl are the upper and lower bound of the 95% confidence interval for ith study. 2.4.1.4.3. Effect size is measured as odds ratio The pooled estimate of odds ratio is P (wi ln(ORi )) P . ORGV = exp wi

(21)

The 95% confidence interval of pooled estimate of odds ratio is equal to qX wi , (22) exp ln(ORGV ) ± 1.96

where

−1

wi = [var(ln(OR)i )]

=

1 1 1 1 + + + ai b i ci di

−1

.

(23)

Note that when pooling the effect, relative risk and odds ratio should be transform to logarithmic scale in order to be approximately normally distributed, whereas the rate difference could be computed directly. 2.4.1.5. DerSimonian-Laird method The approaches we previously described are all based on the fixed-effect model. When the studies included in meta-analysis lack of homogeneity, the random-effects model may be appropriate to combine the effect size. Formulas of applying the DerSimonian-Laird method summarizing studies in the case where effects are measured as odds ratios are given as follows 17 : P ∗ wi ln(ORi ) P . (24) ORDL = exp wi∗

262

X. Zhou et al.

The 95% confidence interval of pooled estimate of odds ratio is equal to qX exp ln ORD L ± 1.96 (25) wi∗ ,

where wi∗ is the weighting factor for the ith study, is estimated as wi∗ =

1 . D + (1/wi )

(26)

D is derived from the homogeneity test statistic, Q, in Eq. (4). As described previously about the moment estimate of inter-study variance τ 2 in model choice and homogeneity test, we have P Q − (K − 1) wi P and D = 0 if Q < k − 1 , (27) D= P ( wi )2 − wi2 where k is the number of included studies.

Example 1 (continued). In the example of meta-analysis of seven clinical trials in which aspirin was used to prevent the death after myocardial infarction, we have calculated the pooled effect sized using the approaches based on fixed-effects model. The results of homogeneity test is, Q = 10.8, and df = 6, χ2(0.05,6) = 12.6 > 10.8, P > 0.05, the null hypothesis was not rejected. In order to evaluate the dependence of the conclusions of the analysis on the model assumption, now we calculate the pooled effect using random-effects model. P 10.1 − (7 − 1) × 910.559 Q − (k − 1) wi P 2 = = 0.00977 . D= P 2 ( wi ) − wi 910.559 − 456284.69 Each wi∗ for individual study is calculated using Eq. (26), results shown in Table 6. Table 6. Results of meta-analysis for the effectiveness of aspirin after myocardial infarction (DerSimonian-Laird method). Study

yi = ln(ORi )

wi

wi2

wi∗

wi∗ yi

1 2 3 4 5 6 7

−0.3285 −0.3842 −0.2194 −0.2194 −0.2332 0.1249 −0.1109

25.710 24.291 48.801 15.440 28.409 103.985 663.923

661.004 590.053 2381.538 238.394 807.071 10812.880 440793.750

20.54 19.63 33.04 13.42 22.24 51.58 88.68

−6.747 −7.542 −7.219 −2.944 −5.186 −6.442 −9.835

910.559

456284.69

249.13

−33.061

Total

263

Meta-Analysis

The pooled estimate of odds ratio and its 95% confidence interval are P ∗ wi ln(ORi ) −33.061 P ∗ ORDL = exp = exp wi 249.13 = exp(−0.1327) = 0.88 ,

qX wi∗ exp ln ORDL ± 1.96

√ = exp(−0.1327 ± 1.96/ 249.13) = (0.77, 0.99) .

Now, if we compare the results of fixed-effects and random-effects model, the pooled point estimate of odds ratio, 0.88 and 0.90, respectively, is almost the same. The length of the 95% confidence interval based on random-effects model is 0.22 (0.99–0.77), which is greater than that based on fixed-effects model, 0.12 (0.96–0.84). So the result of random-effects model is potentially more conservative. But the two methods yield the same conclusion, that is, in general, aspirin make the risk of death after myocardial infarction decrease by nearly 10%. 2.4.3. Measures based on a continuous scale When the effect size in the studies included in a meta-analysis is measured on a continuous scale, we primarily focus on the estimates of pooled mean difference and standardized mean difference.10,12 Suppose the n1i and n2i are the sample sizes, x1i and x2i are the means, of treatment and control group, respectively. The mean difference yi = x ¯1i − x ¯2i , with standard error, si , calculated as (n1i − 1)s21i + (n2i − 1)s22i 1 1 , where s2pi = + . s2i = s2pi n1i n2i n1i + n2i − 2 2.4.2.1. Fixed-Effect model 2.4.2.1.1. Effect size is measured on the same scale The pooled measure of size effect (mean difference) is ys = wi = s12 . i The Q statistic for homogeneity test is given by P X X ( wi yi )2 . Q= wi (ys − yi )2 = wi yi2 − P wi

P Pwi yi , wi

where

264

X. Zhou et al.

The 95% confidence interval of summary measure of effect size is ys ± pP 1.96/ wi .

Example 2. Table 7 presents data about the change in Kurtzke Disability Status Scale at two years in four randomized trials of the effect of azathioprine treatment in multiple sclerosis. The summary estimate of mean difference is given as follows: Table 7. A meta-analysis for Change in Kurzke Disability Status Scale at two years in four randomized trials of the effect of azathioprine treatment in multiple sclerosis. Treated

Study 1 2 3 4 Total

n1i

Control

x1i

s1i

x2i

s2i

0.30 0.17 0.20 0.17

1.26 162 0.42 1.28 175 0.90 15 0.83 0.98 20 1.10 30 0.45 1.12 32 1.38 27 0.42 1.36 25

yi

yi2

s2i

−0.12 −0.66 −0.25 −0.25

0.0144 0.4356 0.0625 0.0289

0.019 52.632 −6.316 0.758 0.105 9.524 −6.286 4.149 0.080 12.500 −3.125 0.781 0.145 6.897 −1.724 0.431

0.5414

81.553 −7.451 6.119

wI

w i yi

wi yi2

n2i

Source: Yudkin et al. (1991). Lancet 338: 1051–1055 and Petitti. 12

2.4.2.1.1.1. Homogeneity test P X (−17.451)2 ( wi yi )2 ) P = 6.119 − Q= wi yi2 − = 2.385 . wi 81.553

Here, df = 3, χ2(0.05,3) = 7.28 > 2.385, p > 0.05 therefore, the null hypothesis that the studies are homogeneous is not rejected, and it is appropriate to use fixed-effects model to estimate the pooled weight mean. 2.4.2.1.1.2. Calculating the pooled effect size and its 95% confidence interval P wi yi −17.451 = ys = P = −0.197 , wi 81.553 qX √ ys ± 1.96 wi = −0.197 ± (1.96/ 81.553) = (−0.414, 0.02) .

The results of meta-analysis suggest, the pooled mean difference of the Kutzke Disability Scale for the effect of azathioprine treatment in multiple sclerosis is −0.197, but the results are statistically non-significant (95% confidence interval covers zero). Based on these results, we still cannot draw the conclusion that azathioprine is beneficial for multiple sclerosis.

Meta-Analysis

265

2.4.2.1.2. Effect size is measured on different scale When studies used different scales to measure effect, the standardized mean difference is calculated as the estimate of effect size. Let x¯1i − x ¯2i , di = spi then the pooled estimate of effect size is P wi di , ds = P wi

(28)

where wi is the weight assigned to each study. This weighted estimator of the effect size was shown by Hedges to be asymptotically efficient when sample sizes in the two groups are both greater than 10 and the effect sizes are less than 1.5.18 When the sample sizes are about equal in the two groups and both greater than 10, the weight of each study can be estimated as follows: 2Ni wi = . (29) 8 + d2i The 95% confidence interval for the pooled estimate of effect size is √ (30) ds ± 1.96/ wi . The Q statistic for homogeneity test is given by P X X ( wi di )2 2 2 . Q= wi (ds − di ) = wi di − P ( wi )

(31)

2.4.2.2. Random-effects model

If H0 of homogeneity is rejected, which means that the between-study variance is relatively large, a random-effect model should be used. x2i . The calculation of effect size is the same, that is, di = x¯1is−¯ pi The pooled estimate of effect size and variance are P wi di ¯ , d= P wi P P ¯2 wi (di − d) wi d2 2 P sd = = P i − d¯2 , (32) wi wi where, wi = Ni = n1i + n2i . The random-effect model assumes di = δi + ei , with d¯2 4k 1+ . δ¯ = d¯, e¯ = 0 and s2i = P wi 8

(33)

266

X. Zhou et al.

(1) If s2d > s2e , s2δ = s2d −s2e , and the 95% confidence interval of pooled effect size is d¯ ± 1.96sδ .

(34)

(2) If s2d ≤ s2e , s2δ = 0, and random-effects model is actually fixed-effect model, that is di = δ + ei . The standard error for d¯ is se sd¯ = √ . k

(35)

Then the 95% confidence interval of pooled effect size is d¯ ± 1.96sd¯ .

(36)

In random-effect model, the statistic for homogeneity test is given by x2 =

ks2d . s2e

(37)

Under the null hypothesis of homogeneity, the statistic follows an approximate x2k−1 distribution. Table 8. Data from meta-analysis of the effect of aminophylline treatment in severe acute asthma. Study

Ni (wi )

spi

di

wi di

wi d2i

1 2 3 4 5 6 7 8 9 10 11 12 13 ··

20 50 48 24 29 20 23 13 23 51 61 66 40 468

0.76 320.00 0.65 0.42 0.22 17.00 0.62 110.00 2.10 6.30 0.50 0.67 0.58

−0.43 −0.04 −0.84 −1.67 −1.03 −2.41 −0.08 0.26 2.93 0.51 0.72 0.03 −0.02

−8.6 −2.00 −40.32 −40.08 −29.87 −48.2 −1.84 3.38 67.39 26.01 43.92 1.98 −0.8 −29.03

3.698 0.08 33.869 66.934 30.766 116.162 0.147 0.879 197.453 13.265 31.622 0.059 0.016 494.95

Source: Littenberg (1988). JAMA. 259: 1678–1684 Petitti. 12

Meta-Analysis

267

Example 3. Table 8 presents data from a meta-analysis of the effect of aminophylline in severe acute asthma. The 13 studies included in the metaanalysis reported different measures on pulmonary function. A standardized mean difference should be used as common metric. The pooled estimate of effect size and 95% confidence interval are calculated as follows: 2.4.2.2.1. Homogeneity test s2d

P 2 P ¯2 wi d2i wi (di − d) wi di P P = P − wi wi wi

=

P

=

494.95 (−29.03)2 = 1.054 , − 468 4682

P wi di −29.03 ¯ − 0.062 , d= P = 468 wi 4 × 13 (−0.062)2 d¯2 4k s2e = P 1+ = 1+ = 0.111 , wi 8 468 8 x2 =

13 × (1.054)2 ks2d = = 130.107 . 2 se 0.111

df = 12, x2(0.05,12) = 21.03, p < 0.05, the null hypothesis of homogeneity is rejected, which means between-study variance is relatively large, and random-effects model should be adopted. 2.4.2.2.2. Calculating the summary effect size and its 95% confidence interval s2δ = s2d − s2e = 1.054 − 0.111 = 0.94 , P wi di −29.03 ¯ = −0.062 , d= P = 468 wi √ d¯ ± 1.96sδ = −0.062 ± (1.96 × 0.94) = (−1.962, 1.838) . The results of meta-analysis suggest that the effect of aminophylline treatment in severe acute asthma is statistically non-significant (95% confidence interval covers zero). In fact, the heterogeneity between studies is greatly large in the example, the smallest effect size is −0.02, whereas the

268

X. Zhou et al.

largest is 2.93. It is necessary to explore the source of heterogeneity before meta-analysis, and assess the source of biases and confounding. If the combinability of studies is poor, meta-analysis should be abandoned. The process above is just a typical example for computation.

3. Bayesian Methods in Random-Effects Models for Meta-analysis The methods discussed above are basically frequentist procedures. There have been considerable discussions in the literature on the relative merits of fixed- and random-effects model. In practice, when combining the effect, the choice between fixed- and random-effects models is determined by the results of statistical tests of homogeneity (Q statistic). But the power of statistical tests of homogeneity is low. The results of random-effects model may be more “conservative”, which leads to somewhat wider confidence intervals than the fixed-effects model. Little is known about the approach describing the random effects quantitatively. The appropriate treatment for small studies and extreme results included in meta-analysis is still unresolved in classic methods. Furthermore, the uncertainty of the parameters, such as the pooled effect size and variance, is not taken into account to use current approaches for meta-analysis. Bayesian methods for meta-analysis give several options to deal with these problems and have been well-developed in the past decades. Under the Bayesian framework for random-effect model in meta-analysis, the parameter is an unknown random variable that has a specific distribution. The posterior distribution of parameter is derived from prior distribution and sample information available. DuMouchel gave a fully Bayesian analysis of the hierarchical model with a complete conjugate prior structure.19 Carlin developed and implements a fully Bayesian approach to meta-analysis for 2 × 2 tables, in which uncertainty about effects in comparable studies is represented by an exchangeable prior distribution.20 A Bayesian analysis requires integration of each of the conditional posterior distributions. Unfortunately, such integration cannot be performed in closed form in most situations. Approximate solution can be obtained through asymptotic or numerical techniques. With the great progress in Bayesian computational tools, especially the rapid development of Markov Chain Monte Carlo (MCMC) method, it is effective to deal with the problems that could not be resolved by classical meta-analysis method. Gibbs

Meta-Analysis

269

sampling is a recently developed simulation tool for Bayesian inferences, obtaining the simulated joint posterior distribution from the full conditional distributions of parameters.21,22 In this section, the Bayesian approaches are introduced, especially the hierarchical model under a full Bayesian framework and the Gibbs sampling in random-effects model for meta-analysis. 3.1. Bayesian meta-analysis for DuMouchel’s model Supporse there are K individual studies included in the meta-analysis, and the effect for each study is Y1 , Y2 , . . . , YK . The random-effects model is Yi = µi + εi , µi = µ + ei ,

εi ∼ N (0, σi2 ) , ei ∼ N (0, τ 2 ) ,

with {εi , i = 1, . . . , K} and {ei , i = 1, . . . , K} are independent. Let Y = (Y1 , . . . , YK )′ , 1 = (1, . . . , 1)′ , m = (µ1 , . . . , µk )′ , ε = (ε1 , . . . , εk )′ , e = (e1 , . . . , ek )′ , Σ = diag(σ12 , . . . , σk2 ) and I the K ×K identity matrix, then the P random-effect model in matrix form as Y |m ∼ N (m, ), m ∼ N (1µ, τ 2 I). Under the full Bayesian framework, we have the model Y |m , σ 2 ∼ N (m, σ 2 C) , σ 2 ∼ x2 (dfσ ) , m|µ , τ 2 ∼ N (1µ, τ 2 H) , µ|τ 2 ∼ N (0, D → ∞) , τ −2 ∼ x2 (dfτ ) .

Here, σ 2 , τ 2 and µ are hyperparameters, and C and H are assumed as known K × K covariance matrices with unknown scale factor σ 2 and τ 2 , respectively. The degrees of freedom dfσ and dfτ for inverse-χ2 prior distributions allow incorporation of how incorporation of how well known C and H are, respectively. The prior distribution for µ is the standard diffused and independent of τ 2 . In fact, as noted by DuMouchel, these particular prior distributions are chosen for convenience, so that the posterior distribution of m given Y is a mixture of multivariate student-t distribution, each with degrees of freedom dfσ + dfτ + K − 1. For computational convenience, however, he suggests using a multivariate normal approximation to the posterior, which can then be completely described through the posterior mean and covariance matrices.

270

X. Zhou et al.

Reparameterize the variance parameters as φ, let φ = τ 2 /σ 2 , W (φ) = (φH + C)−1 , β(Y, φ) = [1′ W (φ)1]−1 1′ W (φ)Y , S(Y, φ) = [Y − 1β(Y, φ)]′ W (φ)[Y − 1β(Y, φ)] , γ(Y, φ) =

dfτ + dfσ /φ + S(Y, φ) . dfσ + dfτ + K − 3

The posterior estimate E(µ|Y ) of µ is then given by integrating what is essentially the weighted least squares estimator of µ over the posterior density of φ, f (φ|Y ) Z E(µ|Y ) = E(µ|φ, Y )f (φ|Y )dφ . Similarly,

var(µ|Y ) =

Z

{γ(Y, φ)[1′ W (φ)1]−1 + [β(Y, φ) − E(µ|Y )]

× [β(Y, φ) − E(µ|Y )]′ }f (φ|Y )dφ . The approximate 95% credible interval for µ using E(µ|Y ) and var(µ|Y ) and the normal distribution, will be p E(µ|Y ) ± 1.96 var(µ|Y ) . Posterior mean of σ 2 are obtained using Z E(σ 2 |Y ) = γ(Y, φ)f (φ|Y )dφ .

3.2. Bayesian meta-analysis for Carlin’s model Carlin adopts a Bayesian approach to meta-analysis for 2 × 2 tables, in which an exchangeable prior distribution is used. A hierarchical normal model assumes that Yi |µi , σi2 ∼ N (µi , σi2 ) ,

(38)

µi |µ , τ 2 ∼ N (µ, τ 2 ) ,

(39)

where σi represents the corresponding estimated standard error, which is assumed known without error. µi is interpreted as the “true effect” in ith study, which has an exchangeable normal prior, and also it means effects are independently and identically distributed conditional on the values of

271

Meta-Analysis

unknown hyperparameters µ and τ 2 . Here, τ 2 is between-study variance. Assume the prior distributions of µ and τ 2 are non-informative or locally uniform prior. Under the framework of Bayesian, the posterior distributions of quantities of interest, conditional on the variance hyperparameter, have closed form solutions. Let Bi = τ 2 /(τ 2 + σi2 ), then we have P Bi Yi , (40) µ ˆ = E(µ|Y, τ 2 ) = P Bi τ2 var(µ|Y, τ 2 ) = P . Bi

(41)

The posterior mean and variance for the individuals µi , conditional on both µ and τ 2 , for each i, are E(µi |Y, µ, τ 2 ) = Bi Yi + (1 − Bi )µ ,

(42)

var(µi |Y, µ, τ 2 ) = Bi σi2 .

(43)

Note that, Bi is usually referred to as the shrinkage factor for the ith study. The larger the inter-study variation, τ 2 , is the smaller the shrinkage Bi of the observed study effects. Because 0 ≤ Bi ≤ 1, the mean is compromised between the average treatment effect µ and the observed study summary statistics, Yi . When σi2 = 0, shrinkage is maximized to Bi = 1 so that µ1 = µ2 = · · · = µk = µ and the random-effects model reduces to the fixed-effects model. Integrating Eqs. (42) and (43) over the posterior distribution of µ conditional on τ 2 we have Z 2 E(µi |Y, τ ) = E(µi |Y, µ, τ 2 )f (µ|Y, τ 2 )dµ = Bi Yi + (1 − Bi )ˆ µ2τ ,

τ var(µi |Y, τ 2 ) = Bi σi2 + (1 − Bi )2 P

(44) 2

Bi

.

(45)

The marginal likelihood function P 1/2 ( Bi Yi )2 1 X IIBi 2 2 P Bi Yi − P exp − 2 f (Y |τ ) = (τ 2 )k−1 Bi 2τ Bi

can be obtained by integrating µ out of the full likelihood. The posterior density for τ 2 is then f (τ 2 |Y ) = f (Y |τ 2 )f (τ 2 ) ,

where f (τ 2 ) is the prior density of τ 2 . Carlin used Monte Carlo procedure to compute posterior density of estimates of interest, τ 2 , τ and µi .

272

X. Zhou et al.

3.3. Gibbs sampling in random-effects model for meta-analysis Gibbs sampling is a procedure for numerical integration of complex functions that has come from its origins in statistical mechanics, through image processing into modern statistics. It is based on a simple, although computationally demanding, idea. All unknown quantities are given some initial values. The technique then involves successively sampling from the conditional distribution of each variable in turn, given the current value of all the other variables. These “full conditional” distributions are often of fairly standard form. It can be shown that under broad conditions eventually one will be sampling from the correct posterior distributions of the unknown parameters. Recently, there are many literatures on this topic, both on methodology and applications. The key feature of Gibbs sampling is, given a joint posterior density P(θ|X), K univariate full conditional densities (the distribution of each individual component of θ conditional on known values of the data X and all other components) can be written down in close form. Now we derive the full conditional distributions for parameters in Gibbs sampling based on random-effects model for meta-analysis. Consider the typical Bayesian hierarchical model as previouly described in Eqs. (38) and (39), that is Level I : Yi |µi , σi2 ∼ N (µi , σi2 ) , Level II : µi |µ, τ 2 ∼ N (µ, τ 2 ) , Level III : µ|(a, b) ∼ N (a, b) ,

τ 2 |(c, d) ∼ IG(c, d) .

For computationally convenient, the prior distributions for hyperparameters µ and τ 2 are generally normal distribution and inverse Gamma distribution, respectively. Under the full Bayesian framework, all full conditional distributions are easily estimated using Gibbs sampling. Samples from the marginal posterior distributions of interest are simulated using the following full conditional distributions: µi |Y1 , . . . Yk , µj 6= i, µ , σi2 σi2 τ 2 τ2 + µ , . τµ2 ∼ N Yi σi2 + τ 2 σi2 + τ 2 σi2 + τ 2 µ|Y1 , . . . , Yk , . . . , µk , K X 2 µi τ ∼N i=1

Kb 2 τ + Kb

+a

τ2 τ 2 + Kb

τ2 , 2 τ + Kb

!

,

(46)

(47)

Meta-Analysis

273

τ 2 |Y1 , . . . , Yk , µ1 , . . . , µk , µ ∼ IG

! K 1 1X (µi − µ)2 + d . K + c, 2 2 i=1

(48)

The processes involved in Gibbs sampling are: (i) µi , µ, and τ 2 are given some initial values; (ii) Gibbs sampling values are obtained in turn, from the conditional distributions in Eqs. (46), (47) and (48); (iii) update the Gibbs sampling values successively for t iterations. For each run a “burnin” of m iterations is followed by a further t-m iterations during which the posterior marginal density of parameters, µi , µ, and τ 2 are computed; (iv) check the convergence of Gibbs sampling. Note that, given the prior and conditional distribution, Gibbs sampling is easy to be carried out. When deriving the full conditional distributions, the prior and likelihood are conjugate in the model we discussed above. In some situations, it may be reasonable to assume that the prior of µ and τ 2 are non-informative prior, and the form of full conditional distribution is simpler. WinBUGS is a program that carries out Bayesian inference for complex statistical analysis via MCMC simulation technique.23 Using WinBUGS software, Gibbs sampling is easily implemented for many common models and distributions. The WinBUGS language allows the model to be specified by way of construction of a directed graphical model. The summary statistics for the variable, which calculate from the posterior distributions of parameters of interest, are given in the output. The software also produce the plots of the kernel density estimate, dynamic trace for sampling, and autocorrelation function for parameters. 3.4. An example of Gibbs sampling for meta-analysis Table 9 gives the results of 16 case-control studies about the role of hepatitis B virus (HBV) infection, hepatitis C virus (HCV) infection, and dual infection in the patients with primary hepatocellular carcinoma (PHC) in Chinese. The classic approaches for meta-analysis are not suitable for estimating quantitatively the risk of HBV, HCV and dual infection for PHC. As shown in Table 9, extreme values (zero) are observed for dual infection in the control groups in several studies, due to the quite low population-based dual infection rate. Classic approaches could not deal with the extreme values unless 0.5 is used to substitute zero or the studies containing zero

274

Table 9.

The data of 16 case-control studies for HBV, HCV and dual infection in PHC. Statue of HBV, HCV Infection

NonInfection ca/co

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

42/198 33/101 34/81 20/70 21/36 20/62 9/75 35/122 9/123 22/278 5/57 7/109 11/179 13/105 15/120 10/138

HBV Infection ca/co OR10 var(Y10 ) 77/40 102/10 43/8 49/16 28/24 64/31 50/21 53/20 51/14 232/73 87/45 45/16 80/26 79/105 100/27 23/10

9.08 31.22 12.81 10.72 2.00 6.40 19.84 9.24 49.79 40.16 22.04 43.79 50.07 6.08 29.63 31.74

2.21 3.44 2.55 2.37 0.69 1.86 2.99 2.22 3.91 3.69 3.09 3.78 3.91 1.80 3.39 3.46

Y10

ca/co

HCV Infection OR01 Y 01 var(Y01 )

0.07 0.15 0.19 0.15 0.15 0.11 0.19 0.11 0.21 0.07 0.25 0.24 0.15 0.11 0.12 0.25

6/8 3/3 4/3 0/1∗ 8/10 7/7 11/3 3/1 4/2 49/8 6/3 3/1 3/1 3/4 23/2 4/4

3.54 3.06 3.18 1.75 1.37 3.10 30.56 10.46 27.33 77.40 22.80 46.71 48.82 6.06 92.00 13.80

1.26 1.12 1.16 0.56 0.32 1.13 3.42 2.35 3.31 4.35 3.13 3.84 3.89 1.80 4.52 2.62

0.32 0.71 0.63 3.06 0.30 0.35 0.55 1.37 0.87 0.19 0.72 1.49 1.43 0.67 0.62 0.61

ca/co var(Y11 ) 15/1 14/1 11/0 ∗ 8/0∗ 14/1 9/0 ∗ 30/1 5/1 6/1 58/2 11/4 9/2 10/2 15/6 12/1 1/0 ∗

Source: Zhou Xuyu (1999). Postgraduate Dissertation of Sun Yat-Sen University of Medical Science. ca: Case; co: Control. ∗ : The data of the included study contain extreme value, zero.

Dual Infection OR11 70.71 42.85 52.41 56.00 24.00 55.80 250.00 17.43 82.00 366.45 31.35 70.07 81.36 20.19 96.00 27.60

4.26 3.76 3.96 4.03 3.18 4.02 5.52 2.86 4.41 5.90 3.45 4.25 4.40 3.01 4.56 3.32

Y11 ca/co

Total

1.10 1.11 2.13 2.19 1.15 2.18 1.16 1.24 1.29 0.57 0.56 0.76 0.70 0.32 1.16 3.11

140/247 152/115 92/92 77/87 71/71 100/100 100/100 96/144 70/140 361/361 109/109 61/128 104/208 110/220 150/150 38/152

X. Zhou et al.

Study No.

275

Meta-Analysis

are excluded, and it may potentially lead to biasness. Furthermore, classic approaches neglect the uncertainly of the parameter of interest, especially the parameter for inter-study variance. The Bayesian approach based on random-effects model is flexible. Here, Gibbs sampling is adopted via the WinBUGS software to obtain pooled estimate of parameters by directly fitting three logistic models using the data available in 16 studies. Arrangement of data and table notation for each individual study is shown in Table 10. Table 10.

Arrangement of data and table notation for 16 case-control studies.

Non Infection

HBV Infection

HCV Infection

Dual Infection

Total

ai bi

ci di

ei fi

gi hi

mi ni

Case Control

For each individual study, the odds ratio (OR), logarithm of OR, and variance for logarithm of OR, are given using following formula (here 00 denote non-infection, 10 denote HBV infection, 01 denote HCV infection, and 11 denote dual infection). The results are also shown in Table 9. 1 1 1 1 ci × b i , Yi10 = ln ORi10 , var(Yi10 ) = + + + , ORi10 = di × ai ai bi ci di ORi10 =

ei × b i , f i × ai

Yi01 = ln ORi01 ,

var(Yi01 ) =

1 1 1 1 + + + , ai bi ei fi

1 1 1 1 gi × bi , Yi11 = ln ORi11 , var(Yi11 ) = + + + . hi × ai ai bi g i hi Three logistic models are introduced for HBV, HCV, and dual infection. Take HBV infection for example. Let ri00 and ri10 denote the number of infection in the control and case group in ith study, arising from n00 i and 00 10 n10 subjects which are assumed to have probability of p and p of HBV i i i infection, respectively. βi10 is defined as 00 10 pi pi 10 10 00 − ln . βi = logit(pi ) − logit(pi ) = ln 10 1 − pi 1 − p00 i ORi11 =

βi10 is the true effect for ith individual study, that is, posterior mean of Yi10 . The prior distribution of βi10 is N (µ10 , (σµ10 )2 ). µ10 is the pooled effect size of interest, and (τ 10 )2 is the variance of inter-study. Thus, the full model can be written as HBV infection HCV infection Dual infection 00 00 00 00 00 00 00 ri ∼ B(pi , ni ) ri ∼ B(pi , ni ) ri00 ∼ B(p00 i , ni ) 10 ri10 ∼ B(p10 i , ni )

01 ri01 ∼ B(p01 i , ni )

11 ri11 ∼ B(p11 i , ni )

276

X. Zhou et al.

OR10

α[i]

n00[i]

p00[i]

µ10

τ10

β10[i]

n10[i]

Σµ

p10[i]

r10[i]

r00[i] for (i IN 1:Num)

Fig. 2.

Directed graphic model for HBV infection in WinBUGS.

10 logit(p00 i ) = αi

01 logit(p00 i ) = αi

11 logit(p00 i ) = αi

10 10 logit(p10 i ) = αi + βi

01 01 logit(p01 i ) = αi + βi

11 11 logit(p11 i ) = αi + βi

βi10 ∼ N (µ10 , (τ 10 )2 )

βi01 ∼ N (µ01 , (τ 01 )2 )

βi11 ∼ N (µ11 , (τ 11 )2 ) .

Still take HBV infection for example. The prior distribution of hyperparameters µ10 and (τ 10 )2 are “non-informative”, µ10 ∼ N (0.0, 106 ), are also “non(σµ10 )2 ∼ IG(10−3 , 10−3 ). The prior of parameter α10 i −5 informative”, α10 ∼ N (0, 10 ). i In the WinBUGS, we can describe above models intuitively by the way of construction of directed graphical models, in which nodes in the graph represent the data and parameters of the model (See Fig. 2). From the conditional independence conditions expressed in the graph, 10 the joint distribution takes the form (ignoring n00 i , ni , and using the fact 10 10 that p00 i , pi can be expressed in terms of αi , βi ) p(r00 , r10 , µ10 , τ 10 , β, α) ∝ Πi [p(ri00 |αi , βi10 )p(ri01 |αi , βi10 ) × p(αi )p(βi00 |µ10 , τ 10 )p(µ10 )p(τ 10 ) . First 5000 iterations were used as a “burn in” in order to reduce the effect of initial value of parameters. Then running another 20,000 iterations and the summary statistics of posterior distribution for parameters were estimated. The main results of Gibbs sampling were seen in Table 11, which contains the means, standard deviations, and 95% confidence intervals from

277

Meta-Analysis Table 11.

The results of Gibbs sampling for HBV, HCV and dual infection.

Status of Infection

Parameter

Mean

SD

95%CI

µ10 OR10 τ 10

2.862 18.050 0.892

0.250 4.668 0.209

2.371–3.360 10.710–28.800 0.565–1.380

HCV Infection

µ01 OR01 τ 01

2.489 13.110 1.344

0.410 5.712 0.359

1.663–3.297 5.276–27.020 0.785–2.191

Dual Infection

µ11 OR11 µ11

4.489 93.540 0.487

0.308 31.530 0.366

3.901–5.120 49.440–167.300 0.033–1.320

HBV Infection

Fig. 3.

Fig. 4.

Fig. 5.

The trace of Gibbs sampling for parameters µ10 , µ01 , µ11 .

The kernel density of Gibbs sampling for parameters µ10 , µ01 , µ11 .

The autocorrelation of Gibbs sampling for parameters µ10 , µ01 , µ11 .

the posterior distribution of parameters, µ10 , τ 2 and OR. WinBUGS also gives the posterior distributions of “true effect” for each study, βi10 , βi01 and βi11 . The trace, kernel density and autocorrelation plots for summary effects, µ10 , µ01 , µ11 , in WinBUGS were presented in Figs. 3–5. The dynamic traces

278

X. Zhou et al. Table 12.

The results of meta-analysis using DerSimonian-Laird method.

Parameter

HBV Infection

HCV Infection

Dual Infection

µ 95%CI τ2

2.815 2.359 ∼ 3.270 0.711

2.423 1.639 ∼ 3.208 1.816

4.026 3.553 ∼ 4.500 0

showed that the Gibbs sampling tends to balance, the plots of kernel density estimate are smooth, and the autocorrelation of sampling is low. For comparison, the classical DerSimonian-Laird random-effects model is used to estimate the pooled effect. For those studies in which the number of dual infection in control group is zero, 0.5 is substituted in order to calculate the OR. Results are shown in Table 12. The results in Table 11 and 12 show that, for HBV, HCV infection, point estimations and 95% confidence intervals of summary effects for parameters µ10 and µ01 from Gibbs sampling and classical method are similar. But for dual infection, the number of dual infection in control group is quite small in most of 16 case-control studies, and four of them even contain zero. The pooled estimation of µ11 is 4.489 (95%CI is 3.901–5.120) via Gibbs sampling, and µ11 is 4.026 (95%CI is 3.553–4.500) using classical method, so the difference is relatively large. Moreover, the pooled estimation of between-study variance, (σµ11 )2 , is zero, when using DerSimonian-Larid method, which means the between-study variance could not be identified for dual infection and result in bias obviously. In fact, when the data in meta-analysis contain many extreme values, the pooled estimation of true effect and variance is unreliable using classic methods, which are basically based on approximately normalization with large samples. Gibbs sampling, almost the standard tool for Bayesian method, can be flexibly deal with a large of complex models that the classical approaches may difficult handle. The key of Gibbs sampling is to obtain the joint posterior distribution from the full conditional distributions of parameters using MCMC method, given the prior distribution and likelihood function. When the full conditional distribution is not given in a close form, MetropolisHastings method may be adopted. Gibbs sampling can be effectively implemented using WinBUGS software, as demonstrated in the example. Furthermore, one can quite easily adjust for specific covariance that may influence the treatment effect by

Meta-Analysis

279

fitting a new model under full Bayesian framework in WinBUGS. For the choice of prior distribution, student-t distribution as a population prior may be reasonable and proper in some situations. 4. Meta-analysis of Diagnostic Tests Studies of the diagnostic accuracy of a test conducted at different centers often produce estimates of the sensitivity and specificity of a test that vary greatly. These differences may be due to random sampling variation and differences in the cutoff points of diagnostic test. In order to get summary results of diagnostic tests for different centers, meta-analysis of diagnostic tests is necessary. The steps in conducting a meta-analysis of diagnostic tests are as follows: (i) Determine the objective and scope of meta-analysis In order to get the diagnostic accuracy, we must determine the test of interest, the disease of interest and reference standard by which it is measured, and the clinical question and context. (ii) Retrieve the relevant literatures and judge the validity of the literatures Extract and sort data of primary studies, and assess the eligibility and the quality of retrieved studies for inclusion in the analysis by two or more reader. Analyze the situations that come from different primary studies and get differences of diagnostic accuracy. The situations include as follows: If the reference standard is acceptable as a good representation of the true presence or absence of the disease of interest; if between the test and the reference standard are read independently each other; whether verification by the reference standard is done for all patients who had the test or a stratified random sample of them; if the design of primary studies is correct; how much the cutoff point is; whether the prevalence of population who accept the test is similar to etc.24–26 The first author should consider the results from all readers overall. (iii) Estimation of a summary diagnostic accuracy of a test There are several statistical methods to calculate a summary diagnostic accuracy of a test. In this section, we will introduce summary receiver operating characteristic (SROC for short). In the last part of this section, we will introduce briefly the other methods to calculate a summary diagnostic accuracy of a test.

280

0.9

1.0

X. Zhou et al.

0.4

0.5

0.6

Sensitivity 0.7

0.8

+

0.0

0.2

0.4

0.6

0.8

1.0

1-specificity

Fig. 6. Mean sensitivity and specificity cannot summarize results of diagnostic test in meta analysis.

While the goal of meta-analysis for diagnostic tests and the corresponding protocol development are similar in principals to the meta-analysis for clinical trials mentioned earlier, there are some specific issues. First, the performance of a diagnostic test is determined by the sensitivity and the specificity. Meta-analysis for diagnostic tests has two simultaneous endpoints. Secondly, because of the need to balance both sensitivity and specificity, the usual meta-analysis for rates, such as weighted average of sensitivity and specificity separately will miss the essential non-linear relationship between sensitivity and specificity. Figure 6 illustrates why the average sensitivity and specificity will not work for meta-analysis of a diagnostic test. Here, the six points are the observed means for sensitivity and specificity from six studies. The solid line is the corresponding ROC curve. When we take the average of sensitivity and specificity without considering their inter-relationship, we have the average point in “+”, which is not on the ROC curve.27–30 This figure demonstrated that using traditional meta-analysis on sensitivity and specificity separately results in the summary characteristics that do not belong to the test.

281

Meta-Analysis

The mathematical reason for this difficulty is because of the non-linear relationship between sensitivity and specificity. Any transformation that reasonably related 1-specificity in a linear form to sensitivity will help to simplify the meta-analysis of diagnostic tests. One of these approaches is the SROC. 4.1. SROC analysis In order to evaluate the diagnostic accuracy of a test, at first we must be aware of the true presence or absence of the disease of interest. The standard which identifies an individual as disease (case) or non-disease (control) is the reference standard or golden standard. Golden standards which are used in medical research include biopsy, autopsy, surgery exploration, follow-up and so on. Although a golden standard need not be perfect, it should be more credible than the diagnostic test of interest and it should be independent with the diagnostic test. For the individuals which are determined case or control by golden standard, the results which are determined by a diagnostic test are labeled as positive or negative respectively. The data can be presented as the form of fourfold table. Among them there are two true results, that is, case is diagnosed as positive (true positive, TP) and control is diagnosed as negative (true negative, TN). There are two false results, that is, case is diagnosed as negative (false negative, FN ) and control is diagnosed as positive (false positive, FP) (see Table 13). The true positive rate (T P R), i.e. sensitivity, is the probability that a test result is positive in patients with disease of interest, namely: T P R = a/(a + c) ,

(49)

(1 − T P R) = c/(a + c) is called false negative rate. The false positive rate (F P R) which equals to (1-specificity), is the probability that a test result is positive in patients without the disease of Table 13. Test Results

A diagnostic test results for 2 × 2 table. Golden Standard

Total

Case

Control

Positive Negative

a(TP) c(FN)

b(FP) d(TN)

a+b c+d

Total

a+c

b+d

a+b+c+d=N

282

X. Zhou et al.

interest, namely: F P R = b/(b + d) ,

(50)

(1 − F P R) = d/(b + d) is true negative rate or specificity. 4.1.1. SROC linear regression model For T P R and F P R, we use logit translation, namely: logit(T P R) = ln[T P R/(1 − T P R)] ,

(51)

logit(F P R) = ln[F P R/(1 − F P R)] ,

(52)

D = logit(T P R) − logit(EP R) ,

(53)

S = logit(T P R) + logit(F P R) .

(54)

let

Through the formula (53), we can get: D = ln = ln

T P R/(1 − T P R) F P R/(1 − F P R) true positive rate × false negative rate = ln OR . false positive rate × true negative rate

(55)

Through the formula (54), we can get: S = ln =

TPR × FPR (1 − T P R)(1 − F P R)

true positive rate × false positive rate . true negative rate × false negative rate

(56)

Let D be dependent variable and S be independent variable. In order to make SROC curve into a linear in (S, D) plane, we establish an SROC linear regression model as: ˆ = A+B×S, D

(57)

where D is a log odds ratio [see formula (55)], representing the odds of a positive test result among people with the disease relative to the odds of a positive test result among people without the disease. D value can reflect the distinguishing ability of a diagnostic test. S is a measure of threshold for classifying a test as positive, which has a value of 0 when a sensitivity equals specificity [see formula (56)]. It becomes positive, i.e. S > 0, when

Meta-Analysis

283

a threshold is used that increases sensitivity (and decreases specificity) and becomes negative, i.e. S > 0, when a threshold is used that decreases sensitivity (and increases specificity). A is the intercept of the linear model and a log odds ratio when sensitivity equals specificity (S = 0). B is the regression coefficient and examines the extent to which the odds ratio (D) is dependent on the threshold (S) used. If the regression coefficient (B) is near zero and not statistically significant, test accuracy for each primary study can be summarized by a common odds ratio given by the intercept A. 4.1.2. Solving the parameter of SROC linear regression model Unweighted least squares linear regression, weighted least squares linear regression, and robust method can be used to solve the parameters of SROC linear regression model (57). 4.1.2.1. Conventional least squares methods This method can be introduced in a general statistical textbook. The parameter A and B are solved by making minimum of the square sum of the difference between observed value and fitted value (i.e. residual). The disadvantage of the method is not paying more attention to larger study, it does not consider the sample size of primary studies. 4.1.2.2. Weighted least squares method In order to give more weight to studies of larger sample size, weighted least squares method can be used, weighting each observation using the reciprocal of the variance of log odds ratio (ln OR). The parameter A and B are solved by making minimum of the square sum of weighted residual. Let a, b, c, and d be the number of true positive, false positive, false negative, and true negative respectively (see Table 13). The weight can be calculated by W = [var(D)]−1 = (1/a + 1/b + 1/c + 1/d)−1 .

(58)

To deal with the 0 of denominator, if a cell of cross-classification of test and golden standard value is 0 among a, b, c, d, we add 0.5 to each cell of the primary study. The observation values of the study become (a + 0.5), (b + 0.5), (c + 0.5), and (d + 0.5). Weighted method is inappropriate if one assumes that individual primary studies are all measuring the same underlying test accuracy. So,

284

X. Zhou et al.

Moses, Shapiro and Littenberg suggested a robust modeling technique of SROC in 1993.25 4.1.2.3. Robust method D plotted against S, coordinate points (S, D) of primary studies are plotted. According to the value of S value, we order the scatters (S, D) pairs and divide the points into 3 approximately equal groups. The total of studies divide 3 and round it, we can get the number of scatter points for left or right side. For example, 10 scatter points are divided, left or right side is round (10/3) = 3 respectively. Find the medians of S and D among the left and right side respectively and label them. Link the labeled scatter point into a line. The slope of the line is regression coefficient B. The intercept A is derived by positioning the line so that half of the points lie above and half below it. Let (S1 , D1 ) and (S2 , D2 ) represent two points which are on the line and far from each other (for example, the two median points of S and D among left or right side respectively). Using the follow formula, we can calculate the regression parameters A and B. A=

D 1 S2 − D 2 S1 , S2 − S 1

B=

D2 − D 1 . S2 − S 1

(59)

Solveing the parameter of SROC curve using robust method see Fig. 7. This figure is plotted using the S and D in Table 14. The regression coefficient of the line is 0.0011. The line parallel approximately the abscissa. 5 4

D 3 2 1 0 -1 -2 -6

-5

-4

-3

-2

-1

0

1

2

3

4

5

S

Fig. 7.

¨ S,D)Scatter £

Left line

Robust line

Median point

Right line

Solving the parameter of SROC curve using robust.

6

285

Meta-Analysis

Table 14. Studies i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

TP a 8 31 70 65 20 35 39 567 25 38 45 71 4.5 2 5 38 4 87 15 41 76 10 28 3.5 79 61 62 284 66 40 11 23 65 1269 223 154 6 7 12 348

The data of Pap test from 59 primary studies.

FP b

FN c

TN d

3 3 12 10 3 92 8 117 37 28 35 87 0.5 2 9 21 2 13 3 1 12 4 11 0.5 26 20 20 31 25 43 1 50 13 928 22 30 2 4 5 41

23 43 121 6 19 20 111 140 11 17 15 10 36.5 3 3 7 16 12 65 61 11 48 28 5.5 13 27 16 68 20 12 1 10 42 264 74 20 12 3 11 212

84 14 25 6 4 156 270 157 18 37 48 306 5.5 21 182 62 31 9 15 29 12 174 77 1.5 182 35 49 68 44 47 2 44 13 1084 83 237 81 4 60 103

Sensitivity 1-specificity Weight TPR FPR W 0.258 0.419 0.66 0.915 0.513 0.636 0.260 0.802 0.694 0.691 0.750 0.877 0.110 0.400 0.625 0.844 0.200 0.879 0.188 0.402 0.874 0.172 0.500 0.389 0.859 0.693 0.795 0.807 0.767 0.769 0.917 0.697 0.607 0.828 0.751 0.885 0.333 0.700 0.522 0.621

0.034 0.176 0.324 0.625 0.429 0.371 0.029 0.427 0.673 0.431 0.422 0.221 0.083 0.087 0.047 0.253 0.061 0.591 0.167 0.033 0.500 0.022 0.125 0.250 0.125 0.364 0.290 0.313 0.362 0.478 0.333 0.532 0.500 0.461 0.210 0.112 0.024 0.500 0.077 0.285

1.947 2.173 6.855 2.229 1.458 10.433 6.122 41.976 4.684 6.762 7.231 7.761 0.411 0.724 1.539 4.293 1.184 3.535 2.074 0.930 3.694 2.655 5.704 0.319 7.489 7.576 6.710 15.340 7.820 6.542 0.386 5.370 5.180 152.068 13.245 10.633 1.312 1.024 2.558 23.987

D

S

2.276 1.213 0.187 1.872 0.339 1.088 2.473 1.693 0.100 1.083 1.414 3.218 0.305 1.946 3.518 2.774 1.355 1.613 0.143 2.970 1.933 2.204 1.946 0.647 3.750 1.375 2.251 2.215 1.759 1.293 3.091 0.705 0.437 1.725 2.431 4.108 3.008 0.847 2.572 1.417

−4.388 −1.868 −1.281 2.893 −0.236 0.032 −4.565 1.105 1.542 0.526 0.783 0.702 −4.491 −2.757 −2.496 0.609 −4.127 2.349 −3.076 −3.765 1.933 −5.341 −1.946 −1.551 −0.141 0.255 0.458 0.644 0.629 1.115 1.705 0.961 0.437 1.415 −0.225 −0.026 −4.394 0.847 −2.398 −0.426

286

X. Zhou et al. Table 14.

Studies i 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

TP a 8 12.5 95 40 71 1204 6 35 10 3 118 13 38 14 12 238 111 491 48

FP b

FN c

TN d

4 2.5 9 18 13 186 20 9 31 5 40 3 14 25 14 52 44 165 16

11 6.5 2 20 20 455 51 12 5 3 44 82 13 67 6 2 20 250 38

34 0.5 1 19 18 241 27 12 32 15 183 17 62 291 12 16 39 701 31

Continued.

Sensitivity 1-specificity Weight TPR FPR W 0.421 0.658 0.979 0.667 0.780 0.726 0.105 0.745 0.667 0.500 0.728 0.137 0.745 0.173 0.667 0.992 0.847 0.663 0.558

0.105 0.833 0.900 0.486 0.419 0.436 0.426 0.429 0.492 0.250 0.179 0.150 0.184 0.079 0.538 0.765 0.530 0.191 0.340

D

2.019 1.822 0.380 −0.956 0.617 1.664 5.459 0.747 5.087 1.592 79.655 1.232 3.659 −1.840 3.264 1.358 2.751 0.725 1.071 1.099 16.216 2.507 2.078 −0.107 5.241 2.561 7.705 0.889 2.471 0.539 1.707 3.600 9.313 1.593 73.944 2.122 7.047 0.895

S −2.459 2.263 6.058 0.639 0.942 0.714 −2.440 0.783 0.661 −1.099 −0.534 −3.576 −0.415 −4.020 0.847 5.958 1.834 −0.772 −0.428

4.1.3. Establishing SROC curve regression model Both regression parameters A and B are solved using above methods. We can establish SROC curve regression model as follow: " (1+B)/(1−B) #−1 1 − F P R T P R = 1 + e−A/(1−B) , (60) FPR where T P R represents true positive rate and FPR represents false positive rate. For a general ROC analysis, the area under ROC cure is taken as the diagnostic accuracy of a test. For SROC analysis, we can take T P R ∗ as the diagnostic accuracy of a test. T P R∗ is the sensitivity taken by SROC curve of Eq. (60) and line equation TPR + FPR = 1.

(61)

It reflects the extent to which SROC curve approach the top left corner. The larger the value of T P R∗ is, the higher the diagnostic accuracy of a test is. T P R+F P R = 1 is a line through both the top left corner (1, 0) and

Meta-Analysis

287

the bottom right corner (0, 1). For the line, sensitivity equals specificity, namely S = 0. Using S = 0 and formula (54), we have S = logit(T P R) + logit(F P R) = 0 or logit F P R = −logit T P R .

(62)

Substituting formula (62) into formula (53), we have D = logit(T P R) − logit(F P R) = 2 logit(T P R) = A + B · S = A , and logit(T P R) = A/2 ,

(63)

T P R = (1 + e−A/2 )−1 .

(64)

and

In order not to be confused with general T P R, we take the diagnostic accuracy of a test of SROC curve as T P R∗ = (1 + e−A/2 )−1 . Its standard error can be calculated by SE(T P R∗ ) =

ˆ SE(A) , 8[cosh(A/4)]2

(65)

ˆ is the standard error of the intercept A of linear regression where SE(A) model. Cosh(.) is the hyperbolic cosine function. To compare the diagnostic accuracy between 2 independent groups, if the numbers of the primary studies is large enough (more than 10), we can use Z statistic, namely T P R1∗ − T P R2∗ , Z= p SE 2 (T P R1∗ ) + SE 2 (T P R2∗ )

(66)

where Z is a quantile from the standard normal distribution. Both T P R1∗ and T P R2∗ are the diagnostic accuracy of compared SROC curves. Either SE(T P R1∗ ) or SE(T P R2∗ ) is the standard error of T P R1∗ or T P R2∗ , respectively.

288

X. Zhou et al.

If the regression coefficient of a SROC curve has B = 0, for the F P R of each primary study, the confidence interval of the T P R can be taken as: −1 −1 ! 1 − FPR 1 − FPR −AL −AU 1+e , (67) , 1+e FPR FPR where AL and AU are the lower and upper confidence interval of the intercept A respectively. 4.1.4. Analysis using an example The Pap test involves the collection, preparation, and examination of exfoliated cervical cells. It is quick, noninvasive, and relatively inexpensive. These properties make the test appealing for cervical precancer. Currently some doctors use it as a screening test and as a follow-up test for women. Because the accuracy of the test is affected by a doctor understanding the natural history of cervical cancer, morbidity of cervical cancer, the number of sampling of cell, the diagnostic accuracy of test has been reported wide variation. The value of the sensitivity and the specificity ranges from 11% to 99% and from 14% to 97% respectively. The method of SROC analysis is illustrated using the data of 59 primary studies reported by Fahey, Irwig and Macaskill.26 Example 5. In the Data of Fahey, Irwigand and Macaskil, the number of true positive (TP, a), false positive (FP, b), false negative (FN, c), true negative (TN, d) is not presented, but the number of with disease, the number of without disease, sensitivity, (1-specificity) were given. For the method need them, according to the known data we calculate a, b, c, d (see Table 14). Because there were 0s in b of 13th and 24th and d of 42nd of primary studies, to avoid 0 of denominator, 0.5 was added to a, b, c, d of the 3 studies (see Table 14). In the 1st study, the weight was calculated using formula (58), −1 1 1 1 1 = 1.947 . + + + W1 = 8 3 23 84 The true positive rate is calculated by formula (49), i.e. T P R = 8/(8+23) = 0.2581. The false positive rate is calculated by formula (50), i.e. F P R = 3/(3 + 84) = 0.0345. D and S are calculated by formula (55) and (56) re0.2581×0.0345 spectively, i.e. D = ln 0.2581(1−0.0345) 0.0345(1−0.2581) = 2.276, S = ln (1−0.2581)(1−0.0345) = −4.388 and so on.

Meta-Analysis

289

Using above weight, the weighted least square linear regression model is established taking D as dependent variable, S as independent variable. The residual standard deviation of the weighted model is 2.430. Intercept is A = 1.720 and its standard error is SE(A) = 0.100. The result of t test is t = 17.227 and P = 0.001. Using A ± t0.05,58 SE(A), the 95% confidence interval of A is 1.520–1.920. The results suggest that the difference between A and 0 has statistical significance under 0.05 test level. The regression coefficient is B = −0.015 and its standard error is SE(B) = 0.070. The result of t test is t = −0.215, P = 0.830. The results suggest that the difference between B and 0 has no statistical significance under 0.05 test level. The odds ratio is exp(A) = exp(1.720) = 5.585. It suggests the odds of positive test in abnormal group is larger than in the normal group. According to formula (64) and (65), we can get the diagnostic accuracy of the test T P R∗ = 0.703 is and its standard error is SE(T P R∗ ) = 0.010. The general least square linear regression model is established taking D as dependent variable and S as independent variable. The residual standard deviation of the model is 1.1144. Intercept is A = 1.590 and its standard error is SE(A) = 0.151. The result of t test is t = 10.522 and P = 0.001. Using A ± t0.05,58 SE(A), the 95% confidence interval of A is 1.288–1.892. The results suggest that the difference between A and 0 has statistical significance under 0.05 test level. Regression coefficient is B = −0.020 and its standard error is SE(B) = 0.063. The result of t test is t = 0.319, P = 0.751. The results suggest that the difference between B and 0 has no statistical significance under 0.05 test level. The odds ratio is exp(A) = exp(1.590) = 4.904. It suggests the odds of positive test in abnormal group is larger than in the normal group. According to formula (64) and (65), we can get the diagnostic accuracy of the test is T P R∗ = 0.689 and its standard error is SE(T P R∗ ) = 0.016. D plotted against S, coordinate points (S, D) of 59 primary studies are plotted. According to the value of S value, we order the scatters (S, D) pairs and divide the points by 3 approximately equal groups. The 59 studies were divided into 3 groups. The number of scatter points for left or right side is round (59/3) = 20. The medians of S and D among left side are (S1 , D1 ) = (−2.916, 1.588) and among right side are (S2 , D2 ) = (1.265, 1.593). The intercept A and regression coefficient B are A = 1.5914 and B = 0.0011 respectively obtained by formula (59). So, the linear regression model is ˆ = 1.5914 + 0.0011S . D

290

X. Zhou et al.

To make half the points lie above and half below the line, we need to move the line up and down. In this situation, the regression parameter is 0.0011 constantly and the intercept A is derived by positioning the line. In fact, this equals that the number of positive sign equals negative sign of residual which is from the difference between observed value and predicted value. Through changed the A value many times, we got the line which scatter points lie above equals below approximately and the intercept is A = 1.5914, the odds ratio is exp(A) = exp(1.5914) = 4.9106, and the diagnostic accuracy is T P R∗ = 0.6891. The results obtained from the weighted linear regression, general linear regression, robust regression are presented in Table 15. Table 15. Methods

A

The diagnostic accuracy and related result from 3 methods. SE(A)

95%CL

B

SE(B)

95%CL

SE Odds T P R∗ (T P R∗ ) Ratio

Weighted 1.720 0.100 1.520 ∼ 1.920 −0.015 0.070 −0.155 ∼ 0.125 0.703 Unweighted 1.590 0.151 1.288 ∼ 1.892 0.020 0.063 −1.241 ∼ 1.281 0.689 Robust 1.591 – – 0.001 – – 0.689

0.010 0.016 –

5.585 4.904 4.911

Substituting A, B of 3 methods into formula (60), we obtained the SROC curves of weighted, unweighted and robust method respectively. They are as follows: " 0.970 #−1 1 − FPR −1.694 T P Rweighted = 1 + e , FPR "

T P Runweighted = 1 + e−1.623 "

T P Rrobust = 1 + e

−1.593

1 − FPR FPR

1.041 #−1

,

1 − FPR FPR

1.002 #−1

.

To obtain the smooth SROC curve, let F P R from 0.002 to 0.998 (can also setup other value) and increase in arithmetic series 0.002. According to the above SROC curve equations T P R is calculated. 499 SROC coordinate points were obtained. Using the above coordinate points obtained and point (0, 0), (1, 1) we can plot the smooth SROC curve. Figure 8 presents smooth SROC curve and SROC coordinate points of the 59 primary studies from Table 14.

291

Meta-Analysis

sensitivity

1.00 0.80 0.60 weighted unweighted 59 points chance line robust

0.40 0.20

0.00 0.00 0.20 0.40 0.60 0.80 1.00 1-specificity Fig. 8.

SROC curves of 3 methods and the scatter points of 59 primary studies.

From Fig. 8, it is suggested that the area under curve of weighted method is larger. Of unweightd method and of robust method are similar. These results are consistent with the diagnostic accuracy T P R ∗ and odds ratio in Table 15. If the association between weighted and unweighted method is ignored and assuming T P R∗ obtained by 2 methods is approximately normal distribution. The formula (66) can be used to test the difference between 2 T P R∗ s. The result of test is Z = 0.7132, P = 0.4757 for two-side test. This suggests that the T P R∗ difference between weighted and unweighted method have not statistical significance. 4.1.5. The SAS code of solving SROC curves parameter SAS code 1. SROC analysis of weighted, unweighted and robust method. 31 Number 01 02 03 04

SAS Code OPTIONS LS=76 PS=MAX NODATE; %LET N=59; /*the number of primary studies N= ***********/ %LET A ROB=1.5914; /* changed robust intercept A ROB= ***********/ DATA SROC; RETAIN I 0;

292

X. Zhou et al.

SAS code 1. (Continued). Number 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

SAS Code INPUT TP FN FP TN@@; I+1; N RL=ROUND(&N/3); W=1/ (1/TP+1/FN+1/FP+1/TN); TPR=TP/(TP+FN);FPR=FP/(FP+TN); D=LOG(TPR/(1-TPR))-LOG(FPR/(1-FPR)); S=LOG(TPR/(1-TPR))+LOG(FPR/(1-FPR)); CARDS; 8 23 3 84 76 11 12 12 8 11 4 34 31 43 3 14 10 48 4 174 12.5 6.5 2.5 0.5 70 121 12 25 28 28 11 77 95 2 9 1 65 6 10 6 3.5 5.5 0.5 1.5 40 20 18 19 20 19 3 4 79 13 26 182 71 20 13 18 35 20 92 156 61 27 20 35 1204 455 186 241 39 111 8 270 62 16 20 49 6 51 20 27 567 140 117 157 284 68 31 68 35 12 9 12 25 11 37 18 66 20 25 44 10 5 31 32 38 17 28 37 40 12 43 47 3 3 5 15 45 15 35 48 11 1 1 2 118 44 40 183 71 10 87 306 23 10 50 44 13 82 3 17 4.5 36.5 0.5 5.5 65 42 13 13 38 13 14 62 2 3 2 21 1269 264 928 1084 14 67 25 291 5 3 9 182 223 74 22 83 12 6 14 12 38 7 21 62 154 20 30 237 238 2 52 16 4 16 2 31 6 12 2 81 111 20 44 39 87 12 13 9 7 3 4 4 491 250 165 701 15 65 3 15 12 11 5 60 48 38 16 31 41 61 1 29 348 212 41 103 ; TITLE ′ to calculate sensitivity, 1-specificity, weight, D, S using TP,FN,FP,TN ′ ; PROC PRINT;RUN; TITLE ′ weighted regression model?W=1/(VAR(LN(OR))) ′ ; PROC REG DATA=SROC OUTEST=W OUTSEB SIMPLE; MODEL D=S; WEIGHT W; DATA W1; SET W; PROC TRANSPOSE DATA=W PREFIX=AW OUT=WW; DATA XX1; SET WW; OR SROC=EXP(AW1); A L=AW1-AW2*TINV(1-0.05/2,&N-1); A U=AW1+AW2*TINV(1-0.05/2,&N-1); TPR S W=1/(1+EXP(-AW1/2)); SE TPR W=AW2/(8*(COSH(AW1/4))**2); IF NAME ^=′ INTERCEP′ THEN DO; A L=.; A U=.; OR SROC=.; TPR S W=.; SE TPR W=.; END; DATA XXX1; SET XX1; IF NAME ^=′ INTERCEP′ THEN DELETE; PROC PRINT; PROC REG DATA=SROC OUTEST=NW OUTSEB SIMPLE; MODEL D=S; TITLE ′ ******unweighted general linear regression model ********* ′ ; DATA NW1; SET NW; PROC TRANSPOSE DATA=NW PREFIX=A OUT=WW;

Meta-Analysis

293

SAS code 1. (Continued). Number

SAS Code

51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82

DATA XX2 ; SET WW; OR SROC=EXP(A1); A L=A1-A2*TINV(1-0.05/2,&N-1);A U=A1+A2*TINV(1-0.05/2,&N-1); TPR STAR=1/(1+EXP(-A1/2)); SE TPR=A2/(8*(COSH(A1/4))**2); IF NAME ^=′ INTERCEP′ THEN DO; A L=.; A U=.; OR SROC=.; TPR STAR=.; SE TPR=.; END; DATA XXX2; SET XX2;IF NAME ^=′ INTERCEP′ THEN DELETE; PROC PRINT; DATA XXX; MERGE XXX1 XXX2; KEEP TPR S W TPR STAR SE TPR W SE TPR Z SROC P SROC; Z SROC=(TPR S W-TPR STAR)/(SE TPR W**2+SE TPR**2)**0.5; P SROC=2*(1-PROBNORM(Z SROC)); PROC PRINT; TITLE ′ compare the TPR STAR between unweighted and weighted regression model ′ ; DATA SROCS; SET SROC;PROC SORT; BY S; DATA BS1; KEEP II S D; SET SROCS; II+1; IF II>N RL THEN DELETE ; PROC UNIVARIATE DATA=BS1 NOPRINT; VAR S; OUTPUT OUT=A1 MEDIAN=S1; PROC UNIVARIATE DATA=BS1 NOPRINT; VAR D; OUTPUT OUT=A2 MEDIAN=D1; DATA BS2; KEEP II S D; SET SROCS; II+1; IF II0 THEN COUNT1+1; IF SIGN=0 THEN COUNT0+1; IF SIGN Z] = s(1 − Z)s−1 − + · · · + (−1)s

s(s − 1) (1 − 2Z)s−1 2

s! (1 − aZ)s−1 . a!(s − a)!

(19)

Here, a is the maximum integer less than 1/Z. Fisher test deals with only the highest peak in the periodogram. Whittle24 popularized it to the second highest peak. Let Ij1 is the first highest peak, Ij2 is the second highest peak and the corresponding statistic is Ij2 . g2 = Ps ( j=1 Ij ) − Ij1 The distribution of g2 above is similar with that of (19). If the hypothesis test to Ij2 is significant, we continue to test the next highest peak until all the significant peaks are detected.

366

J. Zhang, Y. Zheng & D. Lai

3.5.3. The cleavage of a peak A single peak may split into several nearly located peaks when the Fourier frequencies 2πfi do not contain the non-Fourier frequency of the cyclic mechanism of the time series. It can be proved that periodogram is an unbiased estimation of power spectrum. However, it is not a consistent estimator of the spectral density function. 4. Nonlinear Model 4.1. Threshold autoregressive model This model is brought forward by Tong64 in 1978. The general formation is, Xt =

(j) ϕ0

+

Pj X

(j)

(j)

ϕi Xt−i + εt ,

i=1

while rj−1 ≤ Xt−d ≤ rj

j = 1, 2, . . . , k .

(20)

The set of ri ’s (−∞ = r0 < r1 < · · · < rk = ∞, rj , j = 1, 2, . . . , k − 1) are (j) called threshold values, d is called the parameter of delay. {εt } is a white ′ (j ) (j) noise series that has a variance of σj2 . {εt } and {εt } are independent to each other when j 6= j ′ . The Eq. (20) shows that the threshold values divide the axis (−∞, ∞) into k intervals. In every intervals, Xt is expressed by an autoregressive model with the order of pj . Actually, the model is composed of k autoregressive models with different orders. It can be denoted as SETAR (d, k, p1 , . . . , pk ). The most commonly used threshold autoregressive model is SETAR (d, 2, p1 , . . . , pk ). This model can be expressed as,  (1)  ϕ(1) + ϕ(1) + · · · + ϕ(1) Xt−d ≤ r1 , p1 Xt−p1 + εt 0 1 (21) Xt = (2) (2) (2) (2) ϕ + ϕ + ··· + ϕ X Xt−d > r1 . p2 t−p2 + εt 0 1

In medical research, the thresholds can be determined by professional knowledge. For example, in a relatively fixed population, the prevalence rate of tuberculosis is correlated with the average antibody level r1 . For the cases of the prevalence higher or lower than the critical value, the dynamics of prevalence are quite different. In this situation, a threshold autoregression model is applicable and the average antibody level r1 is used as the threshold.

Time Series Analysis and its Applications in Medical Sciences

367

4.2. Bilinear model This model is raised in economic field.65 For example, the output in the tth year is xt and it is used as the input of the next year. The rate of recovery yt is an MA (1) model yt =

xt − xt−1 = εt + θεt−1 , xt−1

The output of this year is, xt = xt−1 + εt xt−1 + θεt−1 xt−1 , which is called a bilinear model. A generalized form of it is xt =

p X

ϕi xt−i +

q X

θi εt−i +

i=0

i=1

Q X P X

βkl εt−k xt−l .

k=0 l=1

Here, {εt } is a white noise series. εt and εs are dependent random variables, with means are zero and variance is σs2 . (p, q, P, Q) are the orders of the model. This model is a linear function of xt when εt is given and is a linear function of εt when xt is given. That is why it is called a bilinear model. 4.3. Exponential autoregressive model It was put forward by Ozaki.65 The generalized form of this model is, xt =

p X

2

(ϕi + ψi e−rxt−1 )xt−i + εt .

i=1

Here, ϕi , ψi , r > 0 are all constants and {εt } is a white noise series. This model is used to describe some medical time series where the amplitudes are closely correlated. 4.4. State dependent model This model was raised by Priestley65 in 1980. The generalized form is, xt = µ(xt−1 ) +

p X j=1

ϕj (xt−1 )xt−j +

q X i=1

θi (xt−1 )εt−i + εt ,

368

J. Zhang, Y. Zheng & D. Lai

where xt−1 = (εt−q , . . . , εt−1 , xt−p , . . . , xt−l )τ and the model can be denoted as SDM(p, q). SDM(p, q) is a widely used model. (1) It becomes an ARMA(p, q) when µ(xt−1 ), ϕj (xt−1 ) and θi (xt−1 ) are dependent on xt−1 . (2) When θi (xt−1 ) ≡ 0(i = 1, 2, . . . , q), xt−d ∈ R(i) , we have ϕj (xt−1 ) = (i) ϕj , µ(xt−1 ) = µ(i) , j = 1, 2, . . . , p. R(i) = (ri−1 , ri ], i = 1, 2, . . . , l, −∞ = r0 < r1 < · · · < rl−1 < rl = ∞. In this situation, the SDM(p, q) becomes a threshold autoregressive model. xt = µ(i) +

p X

(j)

(j)

ϕj xt−j + εt ,

j=1

while xt−d ∈ R(i)

i = 1, 2, . . . , l .

(3) When θi (xt−1 ) = 0(i = 1, 2, . . . , q), µ(xt−1 ) = 0, ϕj (xt−1 ) = 2 ϕj + ψj e−rxt−1 (j = 1, 2, . . . , p), the SDM(p, q) becomes an exponential autoregressive model. (4) When µ(xt−1 ), ϕj (x−1 )(j = 1, 2, . . . , p) are constants and θi (xt−1 ) = Pp ψj + k=1 βjk xt−k , j = 1, 2, . . . , max(q, Q). Here p and Q are both positive integers. When q < Q, θj = 0; when q > Q, βjk = 0, j = 1, 2, . . . , max(q, Q). SDM(p, q) becomes to be a bilinear model. 5. Multivariable ARMA Model To those complicated medical phenomena, multivariable time series analysis is useful.65 5.1. The concept Let Xt = (X1t , X2t , . . . , Xkt )τ is a k-dimension stationary time series with mean zero (to every Xit , i = 1, 2, . . . , k). Xt satisfies, Xt − ϕ1 Xt−1 − · · · − ϕp Xt−p = εt − θ1 εt−1 − · · · − θq εt−p

(22)

Here, ϕi (i = 1, 2, . . . , p), θj (j = 1, 2, . . . , q) are all k × k matrix. ϕp 6= 0, θq 6= 0, εt = (ε1t , ε2t , . . . , εkt )τ is k-dimension white noise series. That is, ( S , when t = s τ Eεt = 0, Eεt εs = 0 , when t 6= s . τ Here S is a k × k positive definite matrix. Eεt Xt−j = 0, j = 1, 2, . . . .

Time Series Analysis and its Applications in Medical Sciences

369

The polynomial of operator B is also a k × k matrix. ϕ(B) = I − ϕ1 B − ϕ2 B 2 − · · · − ϕp B p θ(B) = I − θ1 B − θ2 B 2 − · · · − θq B q . Here, I is a k × k unit matrix. When all the roots of det(ϕ) = 0 are located in the unit circle, the series is stationary. When all the roots of det(θ) = 0 are located in the unit circle, the series is invertible. The original series is called an ARMA(p, q) series. It is well known that any ARMA(p, q) series can be approximated by an AR(p) model. 5.2. The Yule-Walker estimation of the parameters Assume that stationary {Xt } is an AR(p) series with mean 0, Xt − ϕ1 Xt−1 − · · · − ϕp Xt−p = εt ,

t = 1, 2, . . . , n .

(23)

τ Xt−h is multiplied at the both sides of the equation. After the expectation is performed, we have P X

γ0 =

ϕj γi−j ϕτi + S ,

i,j=1

γh =

P X

ϕj γh−j ,

h = 1, 2, . . . .

(24)

j=1

Here γh , h = 1, 2, . . . are all positive definite matrixes. Let h = 1, 2, . . . , p. Because of γ−h = γhτ , we have the linear equations as the following: 

γ1τ





γ0

 τ  τ  γ2   γ1  = ···  ···    τ γpτ γp−1

γ1 γ0 ···

τ γp−2

···

··· ···

γp−1



ϕτ1



  γp−2   ϕτ2   . ···   γ0 ϕτp

(25)

The estimation can be obtained when we substitute γh above with γˆh . As to the γˆh ’s they can be derived by the following formula γˆh =

n X

t=h+1

xt xτt−h /n .

(26)

370

Let

J. Zhang, Y. Zheng & D. Lai



γˆ0

 τ  γˆ1 ˆ Γp =   ···  τ γˆp−1

γˆ1 γˆ0 ···

τ γˆp−2

···

··· ···

γˆp−1





γˆ1τ





ϕˆτ1



  τ  τ  γˆ   ϕˆ  γˆp−2   , ξˆp =  2  , Φp =  2   ··· ···      τ γˆp γˆ0 ϕˆτp

(27)

be the estimation of (25) can be fulfilled by ˆp = Γ ˆ −1 ξp , Φ p

(28)

which is called Yule-Walker estimation (or moment estimation). In practice, we can use least square estimation, recursive algorithm to estimate the parameters.65 5.3. Predictions and errors Assume that the AR(p) model is denoted as Xt − ϕp1 Xt−1 − · · · − ϕpp Xt−p = εt . ˆ t−1 (1), that is The prediction with the lead of one is X ˆ t−1 (1) = ϕp1 Xt−1 + ϕp2 Xt−2 + · · · + ϕpp Xt−p , X ˜ = Xt − X ˆ t−1 (1) = et . The variance matrix of and the prediction error is X ˜ tτ = Eet eτt = Sp If the parameters are estimated by ˜ tX the prediction is EX ϕˆp1 , ϕˆp2 , . . . , ϕˆpp , the prediction with lead of one is xˆt−1 (1) = ϕˆp1 Xt−1 + ˆ tτ ˆ tX ϕˆp2 Xt−2 +· · ·+ ϕ ˆpp Xt−p . The variance of the prediction error Dp = EX can be obtained from   −1 p X kp ˆ p = 1 + kp γ0 − (29) ϕˆpj γˆjτ  . 1− D n n j=1 6. Some Supplementary Topics The research on time series has been driven by applications. There are many new development: (1) from linearity to nonlinearity; (2) apply the time series theory to the unbalanced models and establish dynamic unbalanced models to perform prediction; (3) combine Bayes theory with time series analysis to detect the changing point in dynamic data.

Time Series Analysis and its Applications in Medical Sciences

371

6.1. Nonlinear time series model 6.1.1. The smooth state transform model65 Xt+1 = ατ1 Wt + ϕ(rτ Wt )(ατ2 Wt ) + εt . ϕ(·) is a monotonous limited function on (0, 1), r τ Wt is the transforming variable, which is applied to fulfill the transform between two states. The model above can be expressed as the following, ( state 1 : Xt+1 = ατ1 Wt + εt , rτ Wt = 0 and ϕ(0) = 0 state 2 :

Xt+1 = (ατ1 + ατ2 )Wt + εt ,

rτ Wt = 1 and ϕ(0) = 1 .

6.1.2. The model with time variable parameters Xt+1 = αt Wt + εt .

(30)

Here, αt is a parameter that change across time and it is not a function of Wt−j (j 6= 0). For example αt = ϕαt−1 + at , while at is white noise and unit roots exist. αt can be regarded as a marginal cost parameters. Time variable parameters can be estimated by Kalman algorithm.65 6.1.3. Projective pursuing model Xt+1 = ατ Wt +

q X

rj ϕj (βjτ Wj + θj ) + εt+1 .

(31)

j=1

Here, ϕj (·) is a smooth function and the estimation is performed first to those given values. 6.1.4. The system of neural network τ

Xt+1 = α Wt +

q X

rj ϕ(Bjτ Wt + θj ) + εt+1 .

(32)

j=1

Here, ϕ(·) is a monotonic limited function e.g. ϕ(z) = (1 + exp(−z))−1 . The models (31) and (32) are both a kind of weighted projecting processes, which can be used in complicated medical time series. Even to outliers from the regular scatters, these models work fine.

372

J. Zhang, Y. Zheng & D. Lai

6.1.5. Product model 65 Xt+1 = ατ Wt + Xt−j εt−k + Yi,t−j εt−k . Here Wt contains the lagged X and lagged Y . The product item will lead to nonlinearity. 6.1.6. Flexible Fourier model 65 q X rj cos(βjτ Wt + θj ) + εt+1 . Xt+1 = ατ Wt + j=1

The estimation to the parameters is relatively complicated. Some particular transformations are needed to produce a linear or nonlinear model. Then, the estimation will be meliorated. 6.2. Vector autoregression model (VAR) Sims65 put forward this model, as he doesn’t deem the assumption is helpful that some variables be treated as external variables, nor the assumption that some parameters equal to zero reasonable. Sims suggested that the same numbers of variables are needed in the structural equations in order to find all the possible interaction. The main difference of the classical viewpoints and Sims’s idea is that whether a variable can be defined as a internal or external one. Generally speaking, a VAR model has the following shortcomings: The model will depend on the transformation of data and the series is assumed to be stationary; there is a gap between the theory and practice; the simplified multinomial should be orthogonalized when it is used to prediction. As no feedback relation is included in the model, the influence with a delay cannot be described. 6.3. Bayesian theory in medical time series prediction To the ordinary linear model Y = XB + U , all frequentist’s methods are based on an assumption that all the estimated parameters are fixed. Unfortunately, it seems that this assumption does not hold all the time. In the view of Bayesian theory, the parameters B’s are random. The posterior distribution is determined by prior knowledge. The change of model structure is detected and the theoretical hypothesis is tested. The most special point in Bayesian methods is that prior knowledge is used in prediction.

Time Series Analysis and its Applications in Medical Sciences

373

In recent years, a new idea is to apply time series theories into Bayesian models to predict the change points in the dynamic data with prior information.65 6.4. Unbalanced time series model This model was used to predict the economy situation in Poland by Bowditch65 in 1987. Because of the confinement from the unbalanced theory and the modeling skills, the application of this method keeps at a logjam. However, it is always a possible topic in economic prediction. Some researchers begin to combine time series theory with unbalanced model and a new direction is formed.65 This direction is paid much attention as it can explain the complex medical time series very well under some particular situation. References 1. Jiang, Q. L. (1978). (translated by Fang, J. Q. into Chinese) The Principle of Stochastic Process and the Related Models in Life Sciences, Shanghai Publishing Company for Translated Books, Shanghai 2. Box, G. E. P., Jenkins, G. M. and Reinsel, G. C. (1997) (translated into Chinese by Gu, L. et al.) Time Series Analysis: Prediction and Control, Chinese Publishing Hall of Statistics, Beijing. 3. Pandit, S. M. and Wu, X. M. (1998). The Analysis and Some Applications of Time Series and the Related System(in Chinese), The Publishing Hall for Mechanical Engineering, Beijing. 4. An, H. Z and Chen, M. (1998). Nonlinear Time Series Analysis (in Chinese), Shanghai Publishing Hall of Science and Technology, Shanghai. 5. Bell, W. R. and Hillmer, S. C. (1983). Modeling time series with calendar variation. Journal American Statistical Association 78: 526–534. 6. Quenouille, M. H. (1957). The Analysis of Multiple Time-Series, Charles Griffin and Company Limited. 7. Tiao, G. C. and Box, G. E. P. (1981). Modeling multiple time series with applications. Journal American Statistical Association 76: 802–816. 8. Chan, W.-Y. T. and Wallis, K. F. (1978). Multiple time series modeling: Another look at the mink-muskrat interaction. Applied Statistics 27(2): 168–175. 9. Sung, K. A. (1997). Inference of vector autoregressive models with cointegration and scalar components. Journal American Statistical Association 92: 350–356. 10. Stock, J. H. and Watson, M. W. (1998). Median unbiased estimation of coefficient variance in a time-varying parameter model. Journal American Statistical Association 93: 349–358.

374

J. Zhang, Y. Zheng & D. Lai

11. Ling, Shiqing and Li, W. K. (1997). On fractionally integrated autoregressive moving-average time series models with conditional heteroscedasticity. Journal American Statistical Association 92: 1184–1194. 12. Kr¨ amer, W. and Michels, S. (1997). Autocorrelation- and heteroskedasticityconsistent t-values with trending data. Journal of Econometrics 76: 141–147. 13. West, K. D. (1997). Another heteroskedasticity- and autocorrelationconsistent covariance matrix estimator. Journal of Econometrics 76: 171–191. 14. Gallant, A. R., Hsieh, D. and Tauchen, G. (1997). Estimation of stochastic volatility models with diagnostics. Journal of Econometrics 81: 159–192. 15. Drost, F. C. and Klaassen, C. A. J. (1997). Efficient estimation in semiparametric GARCH models. Journal of Econometrics 81: 193–221. 16. H¨ ardle, W. and Tsybakov, A. (1997). Local polynomial estimators of the volatility function in nonparametric autoregression. Journal of Econometrics 81: 223–242. 17. Koop, G. and Ley, E. (1997). Bayesian analysis of long memory and persistence using ARFIMA models. Journal of Econometrics 76: 149–169. 18. Breidt, F. J., Crato, N. and Lima, P. D. (1988). The detection and estimation of long memory in stochastic volatility. Journal of Econometrics 83: 325–348. 19. Box, G. E. P. and Pierce, D. A. (1970). Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. Journal American Statistical Association 65: 1509–1526. 20. McLeod, A. I. and Li, W. K. (1983). Diagnostic checking ARIMA time series models using squared-residual autocorrelations. Journal of Time Series Analysis 4(4): 269–273. 21. Ljung, G. M. and Box, G. E. P. (1978). On a measure of lack of fit in time series models. Biometrika 65(2): 297–303. 22. Poskitt, D. S. and Tremayne, A. R. (1980). Testing the specification of a fitted autoregressive-moving average model. Biometrika 67(2): 359–363. 23. Szpiro, G. G. (1997). Noise in unspecified non-linear time series. Journal of Econometrics 78: 229–255. 24. Chen, Z. G. (1988). Time Series and its Spectral Analysis (in Chinese), Chinese Publishing Hall of Science, Beijing. 25. Wang, Z. S. (2000). Time Series Analysis (in Chinese), Chinese Publishing Hall of Statistics, Beijing. 26. Feng, W. Q. (1994). The Techniques for Economic Predictions and DecisionMaking (in Chinese), The Publishing Hall of Wuhan University, Wuhan. 27. Gao, H. X. (1998). The User’s Manual of SAS/ETS Software (in Chinese), Chinese Publishing Hall of Statistics, Beijing. 28. Box, G. E. P. and Tiao, G. C. (1976). Comparison of forecast and actuality. Applied Statistics 25(3): 195–200. 29. Clemens, M. P. and Hendry, D. F. (1998). Forecasting economic process. International Journal of Forecasting 14: 111–131. 30. Granger, C. W. J., Drmerod, P. and Smith, R. (1998). Comments on forecasting economic process. International Journal of Forecasting 14: 133–137.

Time Series Analysis and its Applications in Medical Sciences

375

31. Clements, M. P. and Hendry, D. F.(1998). Forecasting economic process — A reply. International Journal of Forecasting 14: 139–143. 32. Webby, R. and O. Connor, M. (1996). Judgmental and statistical time series forecasting: A review of the literature. International Journal of Forecasting 12: 91–118. 33. Bewley, R. (1997). The forecast process and academic research. International Journal of Forecasting 13: 433–437. 34. Herwartz, H. (1997). Performance of periodic error correction models in forecasting consumption data. International Journal of Forecasting 13: 421–431. 35. Gooijer, J. G. D. and Franses, P. H. (1997). Forecasting and seasonality. International Journal of Forecasting 13: 303–305. 36. Kulendran, N. and King, M. L. (1997). Forecasting international quarterly tourist flows using error-correction and time-series models. International Journal of Forecasting 13: 319–327. 37. Shen, C.-H. (1996). Forecasting macroeconomic variables using data different periodicities. International Journal of Forecasting 12: 269–282. 38. Wallis, K. F. and Whitley, J. D. (1991). Sources of error in forecasting and expectations. UK economic models 1984–1988. Journal of Forecasting 10: 231–253. 39. Saligari, G. R. and Snyder, R. D. (1997). Trends Lead times and forecasting. International Journal of Forecasting 13: 477–488. 40. Welch, E., Bretschneider, S. and Rohrbaugh, J. (1998). Accuracy of judgmental extrapolation of time series data characteristics: Causes and remediation strategies for forecasting. International Journal of Forecasting 14: 95–110. 41. Veloce, W. (1996). An evaluation of the leading indicators for the Canadian economy using time series analysis. International Journal of Forecasting 12: 403–416. 42. Burridge, P. and Wallis, K. F. (1998). Prediction theory for autoregressivemoving average processes. Econometric Reviews 7(1): 65–95. 43. Swanson, N. R. and White, H. (1997). Forecasting economic time series using flexible versus fixed specification and linear versus nonlinear econometric models. International Journal of Forecasting 13: 439–461. 44. Shah, C. (1997). Model selection in univariate time series forecasting using discriminant analysis. International Journal of Forecasting 13: 489–500. 45. Lim, J. S. and O’Connor, M. (1996). Judgmental forecasting with time series and causal information. International Journal of Forecasting 12: 139–153. 46. Harvey, N. and Bolger, F. (1996). Graphs versus tables: Effects of data presentation format on judgmental forecasting. International Journal of Forecasting 12: 119–137. 47. Pindyck, R. S. and Rubinfeld, D. L. (1998). Econometric Models and Economic Forecasts, The McGraw-Hill Companies, Inc. 48. Reinsel, G. C. (1997). Elements of Multivariate Time Series Analysis, Springer-Verlag New York, Inc. 49. Gu, L. (1994). Time Series Analysis and its Applications in Economic Sciences (in Chinese), Chinese Publishing Hall of Statistics, Beijing.

376

J. Zhang, Y. Zheng & D. Lai

50. Tang, X. W, Cao, C. X. and Jin, D. Y. (1994). The further research on the vector of the optimal weights. Prediction (in Chinese) 13(2): 48–49. 51. Wallis, K. F. (1974). Seasonal adjustment and relations between variables. Journal American Statistical Association 69: 18–31. 52. Albert, P. S. (1993). A model for seasonal changes in time series regression relationships with an application in psychiatry. Statistics in Medicine 12: 1555–1568. 53. Albertson, K. and Aylen, J. (1996). Modeling the Great Lakes freeze: Forecasting and seasonality in the market for ferrous scrap. International Journal of Forecasting 12: 345–359. 54. Wallis, K. F. (1995). Models for X-11 and X-11 forecast procedures for preliminary and revised seasonal adjustment. Time Series Analysis Macroeconometric Modeling. 55. Wallis, K. F. (1982). Seasonal adjustment and revision of current data: Linear filters for the X-11 method. Journal of Royal Statistical Society A 145(1): 74–85. 56. Burridge, P. and Wallis, K. F. (1984). Unobserved-components models for seasonal adjustment filters. Journal of Bussiness and Economic Statistics 2(4): 350–359. 57. Hillmer, S. C. and Tiao, G. C. (1982). An ARIMA-model-based approach to seasonal adjustment. Journal American Statistical Association 77: 63–70. 58. Burridge, P. and Wallis, K. F. (1990). Seasonal adjustment and Kalman ltering: Extension to periodic variances. Journal of Forecasting 9: 109–118. 59. Burridge, P. and Wallis, K. F. (1985). Calculating the variance of seasonally adjusted series. Journal American Statistical Association 80: 541–552. ¨ and Pollock, A. C. (1997). Currency forecast60. Mary, E. W.-T., Atay, D. O. ing: An investigation of extrapolative judgement. International Journal of Forecasting 13: 509–526. 61. Kitagawa, G. and Gersch, W. (1984). A smoothness priors-state space modeling of time series with trend and seasonality. Journal American Statistical Association 79: 378–389. 62. Gottlieb, D. and Orszag, S. A. (1977). Numerical Analysis of Spectral Methods: Theory and Applications, Society for Industrial and Applied Mathematics. 63. Stoica, P. and Moses, R. L. (1997). Introduction to Spectral Analysis, Prentice Hall, Inc. Simon and Schuster/A. Viacom Company. 64. Tong H. (1990). Non-linear Time Series, Oxford University Press. 65. Ma, S. C. and Gao, B. Q. (1997). Economic Time Series Analysis (in Chinese), The Publishing Hall of Liaoning University, Shenyang.

About the Author Jinxin Zhang was born in Yuci Shanxi, the People’s Republic of China in 1966. He obtained his BS in Biomedical Engineering (1989) from Tianjin

Time Series Analysis and its Applications in Medical Sciences

377

University, MS in Biostatistics (1997) from Shanxi Medical University, and PhD in Biostatistics (2000) from the Fourth Military Medical University. He did postdoctoral research at Sun Yat-sen University of Sciences from January 2001 to January 2003. More than 20 theses relating to his research on time series and multivariate analysis have been published.

This page intentionally left blank

CHAPTER 10

APPLICATIONS OF STATISTICAL METHODS IN MEDICAL IMAGING JESSE S. JIN School of Information Technologies, The University of Sydney, NSW 2006, Australia Tel: +61-2-9351-3766; [email protected] Medical images are two-dimensional stochastic signals. There are many common issues of stochastic signals such as noise removal, signal restoration, signal sampling, etc. There are also many special issues which are relevant to high dimensional signals only, such as segmentation, clustering, etc. This chapter discusses issues of medical imaging. In particular, we will discuss the application of statistical methods in this area.

1. Introduction Medical imaging is a fast growing area with the richest source of information and variety of modalities such as Magnetic Resonance Imaging (MRI), X-ray Transmission Imaging (X-ray), Computerised Tomography (CT), ultrasound images (both 2D and 3D), Positron Emission Tomography (PET), Single-Photon Computed Tomography (SPECT), Magnetic Source Imaging (MSI), Electrical Source Imaging (ESI), X-ray Mammography (MG), Orthopantomograms (OPG), and many others. MRI is one of the most powerful non-invasive techniques in diagnostic clinical medicine and biomedical research. The technique is an application of nuclear magnetic resonance (NMR), a well-known analytical method of chemistry, physics and molecular structural biology. MRI is primarily used as a technique for producing anatomical images, but MRI also gives information on the physical-chemical state of tissues, flow diffusion and motion information. Magnetic Resonance Spectroscopy (MRS) gives chemical/ composition information. MRI has revolutionised imaging of the brain, spine and the musculoskeletal system. Superb soft tissue contrast and 379

380

J. S. Jin

spatial resolution have made MRI the investigation of choice in many neurologic and orthopaedic diseases. X-rays are generated by the interaction of accelerated electrons with a target material (usually tungsten). X-rays are deflected and absorbed to different degrees by the various tissues and bones in the patient’s body. The amount of absorption depends on the tissue composition. For example, dense bone matter will absorb many more X-rays than soft tissues, such as muscle, fat and blood. The amount of deflection depends on the density of electrons in the tissues. Tissues with high electron densities cause more X-ray scattering than those of lower density. Thus, since less photons reach the X-ray film after encountering bone or metal rather than tissue, the X-ray will look brighter for bone or metal. CT became generally available in the mid 1970s and is considered one of the major technological advances of medical science. X-ray CT gives anatomical information on the positions of air, soft tissues, and bone. Three-dimensional imaging is achieved by rotating an X-ray emitter around the patient, and measuring the intensity of transmitted rays from different angles. Ultrasound, as currently practiced in medicine, is a real-time tomographic imaging modality. Not only does it produce real-time tomograms of the position of reflecting surfaces (internal organs and structures), but it can be used to produce real-time images of tissue and blood motion. The history of PET can be traced to the early 1950s, when workers in Boston first realized the medical imaging possibilities of a particular class of radioactive isotopes. Whereas most radioactive isotopes decay by release of a gamma ray and electrons, some decay by the release of a positron. A positron can be thought of as a positive electron. Widespread interest and an acceleration in PET technology was stimulated by development of reconstruction algorithms associated with X-ray CT and improvements in nuclear detector technologies. By the mid-1980s, PET had become a tool for medical diagnosis, for dynamic studies of human metabolism and for studies of brain activation. PET has a million fold sensitivity advantage over other techniques used to study regional metabolism and neuroreceptor activity in the brain and other body tissues. In contrast, magnetic resonance has exquisite resolution for anatomic studies and for flow or angiographic studies. In addition, magnetic resonance spectroscopy has the unique attribute of evaluating chemical composition of tissue but in the millimolar range rather than the nanomolar range. Since the nanomolar range is the concentration range of

Applications of Statistical Methods in Medical Imaging

381

most receptor proteins in the body, positron emission tomography is ideal for this type of imaging. The major clinical applications of PET have been in cancer detection of the brain, breast, heart, lung and colorectal tumors. Another application is the evaluation of coronary artery disease by imaging the metabolism of heart muscle. SPECT, like PET, acquires information on the concentration of radionuclides introduced to the patients body. SPECT dates from the early 1960s, when the idea of emission traverse section tomography was introduced by D. E. Kuhl and R. Q. Edwards prior to either PET, X-ray CT, or MRI. Iron currents arising in the neurons of the heart and the brain produce magnetic fields outside the body. These fields can be measured by arrays of SQUID (Superconducting QUantum Interference Device), detectors that are placed on or near the head or chest. The recording of magnetic fields of the head is known as MagnetoEncephaloGraphy (MEG) while that of the heart is called MagnetoCardioGraphy (MCG). Magnetic Source Imaging (MSI) is the general term for the reconstruction of current sources in the heart or brain from the measurements of external magnetic fields. Electrical source imaging (ESI) is an emerging technique for reconstructing electrical activity in the brain or heart from electric potentials measured on the scalp or torso. Standard ElectroEncephaloGraphic (EEG), ElectroCardioGraphic (ECG) and VectorCardioGraphic (VCG) techniques are limited in their ability to provide information on regional electrical activity or localize bioelectrical events within the brain and heart. Noninvasive ESI of the brain requires simultaneous electric potential recordings from 20 or more electrodes for the brain and 100 to 250 torso electrode sites to map the body surface potential from the heart. X-ray mammography (MG) is an effective method to diagnose the breast cancer. A low dose X-ray screening mammograms are performed on a woman’s breasts with no symptoms to detect breast cancer at an early stage. The practice can perform diagnostic mammography. Breast needle localisation prior to surgery can be performed to provide location information and fine tissue information. Orthopantomograms (OPG) and lateral cephalograms are the latest techniques for dental or orthodontic assessment. Medical images are 2D stochastic signals. There are many common issues of stochastic signals such as noise removal, signal restoration, signal sampling, etc. There are also many special issues which are relevant to high dimensional signals only, such as segmentation, clustering, etc. We will discuss issues of medical imaging. In particular, we will discuss the

382

J. S. Jin

image sampling and compression in Sec .2, filtering in Sec .3, segmentation in Sec .4 and registration in Sec .5. Finally, there is a conclusion. 2. Sampling and Compression Using Statistical Features of Images Computer-based advanced medical imaging techniques such as Positron Emission Tomography (PET) have been playing a crucial and expanding role in modern medical research and diagnosis. However, these powerful techniques have being accompanied by the growing size of image data sets as well. For example, a routine dynamic PET study using the CTI 951 scanner usually acquires 31 cross-sectional image planes of 128 × 128 pixels each, at 20 to 30 time points. It results a 4D data set containing up to 11 million data points with approximately 22 Mbytes storage space. As the resolution of current PET imaging improves, the large volume of related data will further increase. It has therefore, prompted significant recent interest in developing efficient image compression techniques which can contribute to the current expansion in medical digitalization, image database management and telemedicine. Taking advantage of domain specific physiological kinetic knowledge related to dynamic PET images and physiological tracer kinetic modeling, this paper presents a novel knowledge-based near-lossless data compression algorithm for dynamic PET images. The proposed compression algorithm consists of three stages: (a) compression in the temporal domain using optimal image sampling schedule design; (b) compression in the spatial domain through cluster analysis; and (c) index image compression using standard still image compression techniques. In this section, clinical human brain PET studies using the [18 F ] 2-fluoro-deoxy-glucose (FDG) tracer are presented to illustrate the proposed compression algorithm. The technique can be easily applied to other PET studies with different tracers. The conventional22 and proposed techniques are implemented on clinical dynamic PET images. Empirical results are given to illustrate the compression performance and the image quality. 2.1. Tracer kinetic modeling and functional imaging Tracer kinetic techniques with PET are widely applied to extract valuable information from dynamic processes in the body. This information is usually defined in terms of a mathematical model u(t|p), where t = 1, 2, . . . , T and p are the model parameters. The parameters describe the delivery,

Applications of Statistical Methods in Medical Imaging

383

transport and biochemical transformation of the tracer. The driving function for the model is the plasma blood input function, which is often obtained from blood sampling.22 Measurements acquired by PET define the tissue time activity curve (TAC), or output function, denoted zi (t), where t = 1, 2, . . . , T are discrete sampling times of the measurements, and i = 1, 2, . . . , I corresponds to the ith pixel in the imaging region. The purpose of dynamic PET image analysis is to obtain tracer TACs and parameter estimates for each pixel in the imaging region. These parameters can then be used to define physiological parameters, such as the local cerebral metabolic rate of glucose (LCMRGlc). The conventional method uses the complete set of acquired PET projection data. Through the parameter estimation on a pixel-by-pixel basis using certain rapid estimation algorithms,16,22,36 functional images can be generated. In this section, the Patlak method35,36 was used to generate the LCMRGlc functional images for the purpose of comparing the estimation accuracy of the original and compressed data. 2.2. Sampling and compression in temporal and spatial domains The sampling and compression scheme using statistical features of tracer kinetics consists of three stages.6 2.2.1. Stage 1 : Compression in the temporal domain using optimal image sampling schedule In dynamic PET studies, the reliability of temporal frames is directly influenced by the sampling schedules and duration used to acquire the data. The longer the duration and greater the radio-activity counts, the more reliable the temporal frames. However, in order to obtain quantitative information from the dynamic processes, a certain number of temporal frames are required. Recently, it has been shown that the minimum number of temporal frames required is equal to the number of model parameters to be estimated.26 Based on this, an algorithm that automatically determines optimal image sampling schedule (OISS) and maximizes the information content of the acquired PET data was developed.26 The algorithm utilizes the accumulated/integral PET measurements. In the design of OISS, a new objective function based on the Fisher Information Matrix,10 was proposed to limit the loss of dynamic information. This objective function was used to discriminate between different

384

J. S. Jin

experimental protocols and sampling schedules. OISS can be directly applied to acquisition of PET projection data. This reduces the number of temporal frames obtained and therefore, reduces data storage. Furthermore, as fewer temporal frames are reconstructed the computational burden posed by image reconstruction is reduced. Details of this algorithm can be found in Li et al.26 2.2.2. Stage 2 : Compression in the spatial domain through cluster analysis The prior knowledge has the form of tracer kinetic model to a time series of PET tracer uptake measurements. From the model, using cluster analysis, the image-wide TACs can be extracted and further classified into a certain numbers of TAC groups which corresponding to different tissue regions, according to the similarity of their kinetics. Cluster analysis aims at grouping and classifying image-wide TACs, zi (t) (where i = 1, 2, . . . , I), into Cj cluster groups (where j = 1, 2, . . . , J and J ≪ I) by measuring the magnitude of natural association (similarity characteristics). It is expected that TACs with high degrees of natural association will belong to different groups.8 It should be noted that each TAC must be assigned uniquely to a cluster group. In this paper, a hierarchical-agglomerative clustering algorithm based on the Euclidean distance measurement was used to classify the clinical dynamic PET image data. Using the results of cluster analysis, an index table containing the mean TAC within each cluster and an indexed image can be formed. The indexed image represents a mapping from the cluster to its respective pixel TAC locations. This image together with the index table forms the basis of the compressed temporal/spatial data. With PET, the number of distinguishable clustering groups may generally not exceed 64. This means that an 8-bit indexed image is sufficient to represent the cluster mapping. 2.2.3. Stage 3 : Index image compression A lossless compression scheme is considered in this paper for further reduction of the indexed image. The PNG (Portable Network Graphics)11 format was used to compress and store the indexed image obtained from cluster analysis. The coding technique presently defined and implemented for PNG is based on deflate/inflate compression with a 32-Kb sliding window. The PNG format was chosen over other lossless image compression file formats

Applications of Statistical Methods in Medical Imaging

385

(a) (a)

(b) (b) Fig. 1. Compression results. (a) A set of 22 temporal-frame images (scaled) for the 15th plane from one patient study. (b) Results of the proposed compression method in temporal domain: 5 temporal-frame images (scaled), obtained from 1(a).

386

J. S. Jin

due to its portability, flexibility and being legally unencumbered. Details on the PNG format can be found in Crocker.11 Human dynamic FDG-PET brain studies were performed using an eight-ring, fifteen-slice PET scanner (GE/Scanditronix PC4096-15WB). This scanner contains 4096 detectors and achieves axial and trans-axial resolutions of 6.5-mm full width at half maximum (FWHM) at the center of the field of view. Between 200 and 400 mBq (approximately 0.5 mg) of FDG was injected intravenously and arterial blood sampling commenced immediately thereafter. The blood samples (each 2–3 ml) were taken at 8 × 0.25 minute intervals for the first 2 minutes, then at 2.5, 3, 3.5, 7, 10, 15, 20, 30, 60, 90 and 120 minutes. These samples were immediately placed on ice and the plasma was subsequently separated for the determination of plasma FDG and “cold” glucose concentration. Figure 1(a) shows a set of temporal frames for the 15th plane from one patient study. Due to the lower tracer concentration in the first few frames, these images were scaled to be visible. 3. Noise Reduction Using Statistical Anisotropic Diffusion Diffusion processes have been widely used in quantum physics, material science, fluid dynamics, nuclear science, medicine and chemical physics. Perona and Malik38,39 introduced it to image processing and proposed a multi-scale smoothing and edge detection scheme. It has the good property of eliminating noise while preserving high frequency components, namely edges.2 Diffusion is an iterative process. The degree of diffusion depends on the threshold of diffusion, i.e. the contrast cut-off. A contrast above the threshold will be enhanced during the diffusion process and that below the threshold will be smoothed out. The selection of the threshold is vital to the filtering process. However, the threshold varies from image to image. The problem compounds with the contrast variation from region to region and with intensity distortion of the same region in an image. It is thus desirable to have an adaptive criterion for selecting a threshold. The threshold in a diffusion process is closely correlated with the contrast of the edges in an image. Selecting the threshold is a process of analysing local contrast. In low contrast images, especially when noise is present and the signal-noise ratio (SNR) is low, the contrast between regions is not significant and will be very difficult to pick up. The difficulty lies in the noise presence, unknown distribution of a stochastic signal,

Applications of Statistical Methods in Medical Imaging

387

and unknown combination of multiple interferences. In most of these cases, the histogram of the region shows a single peak. Many automatic threshold selection mechanisms require a bi-peak histogram such as Tsai and Chen46 and Bhandari et al.5 A bi-peak or multi-peak histogram may not exist in many cases. Luijendijk29 proposed an automatic threshold selection using two histograms based on the count of 4-connected regions. Tseng and Huang47 proposed to select the threshold using edge information, i.e. the intensity along edge intervals. Nagawa and Rosenfeld33 fitted the histogram with two Gaussian functions, and Cho et al.7 applied bias correction factors. Glaseby18 combined them with an amendment using iteration. The assumption of Gaussian distribution is weak and correction does not make up this vital defect. Furthermore, iteration makes the computation very expensive. Another difficulty is due to intensity distortion. The applicability of histogram analysis is based on the assumption that all image pixels which have a similar grey level correspond to one object or region of interest in the image. However, this assumption is not always true for most images. Rodriguez and Mitchell41 used an adaptive thresholding method that extracts the background in two phases. The first step uses a global threshold to extract the structure of the regions and the second step refines the segmentation. Parker34 used a local threshold to grow a region after finding a seed pixel in an object. Spann and Horne44 grow regions from low resolution to high resolution in a quadtree structure. The adaptive scheme is a proper way to combat the distortion of intensity. However, the above mentioned methods have a try-and-error nature and do not have a solid theoretical foundation. This section describes an adaptive diffusion scheme by applying the Central Limit Theorem. Regression is used to separate the distribution of the major object in a local window from other objects in a single-peak histogram. The separation will help to automatically determine the threshold. We have applied the algorithm to X-ray angiogram (XRA) images to extract brain arteries. The algorithm works well for single-peak distributions where there are no valleys in the histograms. It has also been used for filtering microscope images of kidneys where there are multiple visual objects and the contrast between objects is very low. The scheme shows that a fully automatic filtering process can be achieved. It works well with images which have texture patterns and are contaminated with noise while the distribution of noise is unknown. These kinds of images have posed a significant problem for traditional filtering schemes such as wavelet based de-noising.13

388

J. S. Jin

3.1. Non-linear anisotropic diffusion Low-pass filters have been used to remove noise. Most filters are isotropic. Isotropic filtering tends to smear the corners and loses the accuracy of edges. To examine the problem carefully, we notice that the gradient along an edge is not isotropic. It has the highest value perpendicular to the edge and is dilated along the edge. It is therefore proper to increase the smoothing function parallel to the edge and stop the smoothing perpendicular to the edge. Non-linear anisotropic diffusion provides such a function. It takes the form ∂ I(x, y, t) = div(g(∇I)∇I) , (1) ∂t where I(x, y, t) is the signal and g(∇I) is a dilation function of gradients. There are two frequently used dilation functions: 1 , (2) g1 (x, y, t) = ∇I(x,y,t) 1+ k ( 2 ) ∇I(x, y, t) g2 (x, y, t) = exp − . (3) k Calculation of diffusive filtering can be performed by a difference operation ∂ I(x, y, t) = div[g(x, y, t) ∗ ∇I(x, y, t)] ∂t ∂ ∂ ∂ ∂ g(x, y, t) ∗ g(x, y, t) ∗ I(x, y, t) + I(x, y, t) = ∂t ∂x ∂y ∂y = g(x + 1, y, t)[I(x + 1, y, t) − I(x, y, t)] + g(x, y, t)[I(x − 1, y, t) − I(x, y, t)] + g(x, y + 1, t)[I(x, y, +1, t) − I(x, y, t)] + g(x, y, t)[I(x, y, −1, t) − I(x, y, t)] = Φ′e + Φ′w + Φ′s + Φ′n .

(4)

Diffusion encourages intra-region smoothing in preference to smoothing across boundaries. The basis of this method is to suppress smoothing at boundaries by selecting locally adaptive diffusion strengths. The parameter κ plays an important role in diffusion. If the κ value is set to too high the filter will act as a smoothing filter, diffusing across the edge boundary; while if κ is too low, small dilation will result in many iterations. At some κ

389

Applications of Statistical Methods in Medical Imaging

values, an extra edge will be introduced between the region of high intensity and region of low intensity. Therefore, the vital question in our design is the selection of κ. 3.2. Selection of the cut-off contrast Images requiring processing often have very low contrast with many intensity layers. Determining an appropriate threshold for such images is difficult. Figure 2 shows an XRA image of the brain artery (a) and its histogram (b) which is a single peak histogram. The selection of a threshold value from such a histogram is ambiguous and not viable by trial and error. We have developed a region-based method to dynamically select a threshold using regression.

(a) (a)

(a)

(b) (b)

Fig. 2.

Histogram analysis on background. (a) An XRA image; (b) Its histogram.

390

J. S. Jin

Fig. 3.

Segmenting histogram using Gaussian regression.

3.2.1. Selecting the threshold by regression and likelihood classification Our scheme is based on the Central Limit Theorem. It is difficult to segment brain arteries from the background because of the low contrast and an overwhelming proportion of the background. We do not know the histogram distribution of the background. However, from the Central Limit Theorem we know that if x1 , x2 , . . . , xn are independent, identically distributed random Pn variables with expectation µ and finite variance σ 2 , then y = n1 i=1 xi is asymptotically normal (µ, σ 2 ) when n is large enough.42 Regression using a Gaussian distribution can separate the background histogram from the foreground histogram, as shown in Fig. 3, where shaded area shows the background histogram and the darker area is the foreground histogram. After separating the histogram, it is easy to select a threshold for image segmentation and to analyse foreground objects. The sampling data for regression is obtained from partial histogram. We calculate the mean value of the histogram and take the half with less variance. Then we find the modal of that half histogram. The sampling data, hi , i ∈ S, is on the same side with the modal against the mean value. The regression is obtained by   µ = max(hi )   i∈s  s X  (hi − µ)2 . σ= 2    i∈s

However, when the number of background pixels is not large enough, it is improper to use the Gaussian distribution in regression. Figure 4 shows another XRA image (a) and its histogram (b). Figure 4(c) is the histogram

391

Applications of Statistical Methods in Medical Imaging

(a)

(a)

(b)

(a)

(b)

(c) (c) (c)

(d) (d)

Fig. 4. Segmentation using regressions. (a) The original image; (b) Its histogram; (c) Gaussian regression does not show clear separation; (d) Rayleigh regression shows a clear separation.

after regression using Gaussian distribution over the background. It does not show a valley between two peaks as we expect, which means there is no clear separation. In this situation, we apply the Rayleigh distribution in regression. Probability theory states that when n is not large enough, x=x ¯n satisfies Rayleigh distribution:  x2   x e 2µ2 x≥0 2 f (x) = µ  0 x < 0.

The Rayleigh regression is obtained by  r π   max(hi ) µ =   2 i∈s r   4−π 2  σ = µ . 2 3.2.2. Extracting a cut-off contrast

The diffusion process is critically depended on the κ value in functions (2) and (3). The parameter κ can be associated with the contrast. The following

392

J. S. Jin

discussion on extracting κ will be based on diffusion function (2) but it can √ be easily converted to function (3) by dividing κ by 2. Although we can obtain a proper estimation of the distribution of one visual object in the image, e.g. background in XRA images, we do not have information on other objects, e.g. vessels in XRA images. It is very difficult to estimate the average contrast between two objects. We use likelihood classification45 to separate pixels from two objects after we separate the background histogram from the foreground histogram. These two histograms are used as the probability distributions of two clusters in likelihood classification. We calculate κ value from the following    Nl   X ∇p I(x, y)/Nl  , κ = l max    l∈{0..255} P ∈p

where Nl is the pixel number with gray level l, and P is a set of neighboring pixel pairs whose two pixels belong to different clusters. This calculation can be restricted to a local region. If two neighbor pixels belong to two clusters, we accumulate their difference into a difference histogram. The contrast can be extracted from the modal of differences within a local region. 4. Medical Imaging Segmentation

Segmentation is the process in which an image is divided into constituent objects or parts. It is often the first and most vital step in an image analysis task. Effective segmentation can usually dictate eventual success of the analysis. For this reason, many segmentation techniques have been developed by researchers worldwide.19 Segmentation of intensity images usually involves four main approaches, namely thresholding, boundary detection, region-based and hybrid methods. Thresholding techniques43 are based on the postulate that all pixel whose value lie within a certain range belongs to one class. Such methods neglect all of the spatial information of the image and do not cope well with noise or blurring at boundaries. Boundary-based methods are sometimes called edge-detection,12 because they assume that pixel values change rapidly at the boundary between two regions. The basic method is to apply a gradient filter to the image. High values of this filter provide candidates for region boundaries, which must then be modified to produce closed curves representing the boundaries between regions. Region-based segmentation algorithms postulate that neighbouring pixels within the same region have similar intensity values, of which the

Applications of Statistical Methods in Medical Imaging

393

split-and-merge21 technique based on homogeneity criterion is probably the most well know. It includes seeded region growing32 and unseeded region growing. Hybrid methods combine one or more of the above-mentioned criteria. This class includes the morphological watershed32 segmentation, variableorder surface fitting4 and active contour24 methods. This section presents two methods among which statistical features are used in segmentation. 4.1. Probilistical segmentation using expectation-maximization Intensity-based classification of MR images has proven problematic, even when advanced techniques are used. Intra-scan and inter-scan intensity inhomogeneities are a common source of difficulty. While reported methods have had some success in correcting intra-scan inhomogeneities, such methods require supervision for the individual scan. This section describes a new method called adaptive segmentation that uses knowledge of tissue intensity properties and intensity inhomogeneities to correct and segment MR images. Use of the EM algorithm leads to a method that allows for more accurate segmentation of tissue types as well as better visualization of MRI data, that has proven to be effective in a study that includes more than 1000 brain scans. Implementation and results are described for segmenting the brain in the following types of images: axial (dual-echo spin-echo), coronal (3DFT gradient-echo T1-weighted) all using a conventional head coil; and a sagittal section acquired using a surface coil. The accuracy of adaptive segmentation was found to be comparable with manual segmentation, and closer to manual segmentation than supervised multi-variate classification while segmenting gray and white matter. Advanced applications that use the morphologic contents of MRI frequently require segmentation of the imaged volume into tissue types. Such tissue segmentation is often achieved by applying statistical classification methods to the signal intensities25,49 in conjunction with morphological image processing operations.9,17 Conventional intensity-based classification of MR images has proven problematic, however, even when advanced techniques such as nonparametric, multi-channel methods are used. Intra-scan intensity inhomogeneities due to RF coils or acquisition sequences (e.g. susceptibility artifacts in gradient echo images) are a common source of difficulty. Although MRI images may appear visually uniform, such intra-scan

394

J. S. Jin

inhomogeneities often disturb intensity-based segmentation methods. In the ideal case, differentiation between white and gray matter in the brain should be easy since these tissue types exhibit distinct signal intensities. In practice, spatial intensity inhomogeneities are often of sufficient magnitude to cause the distributions of signal intensities associated with these tissue classes to overlap significantly. In addition, the operating conditions and status of the MR equipment frequently affect the observed intensities, causing significant inter-scan intensity inhomogeneities that often necessitate manual training on a per-scan basis. Intra- and inter-scan MRI intensity inhomogeneities is modeled with a spatially-varying factor called the gain field that multiplies the intensity data. The application of a logarithmic transformation to the intensities allows the artifact to be modeled as an additive bias field. If the gain field is known, then it is relatively easy to estimate tissue class by applying a conventional intensity-based segmenter to the corrected data. Similarly, if the tissue classes are known, then it is straightforward to estimate the gain field by comparing predicted intensities and observed intensities. It may be problematic, however, to determine either the gain or the tissue type without knowledge of the other. It will be shown that it is possible to estimate both using an iterative algorithm (that converges in five to ten iterations, typically). A Bayesian approach is used to estimating the bias field that represents the gain artifact in log-transformed MR intensity data. First, a logarithmic transformation of the intensity data is computed as follows: Yi = g(Xi ) = (ln([Xi ]1 ), ln([Xi ]2 ), . . . , ln([Xi ]m ))T ,

(5)

where Xi is the observed MRI signal intensity at the ith voxel, and m is the dimension of the MRI signal. Similar to other statistical approaches to intensity-based segmentation of MRI,9,17 the distribution for observed values is modeled as a normal distribution (with the incorporation of an explicit bias field): p(Yi |Γi , βi ) = GψΓi (Yi − µ(Γi ) − (βi ) , where −m 2

GψΓi (x) = (2π)

− 12

|ψΓi |

(6)

1 T −1 exp − x ψΓi x 2

is the m-dimensional Gaussian distribution with variance ψΓi and where Yi is the observed log-transformed intensities at the ith voxel;

Applications of Statistical Methods in Medical Imaging

395

Γi is the tissue class at the ith voxel; µ(x) is the mean intensity for tissue class x; ψx is the covariance matrix for tissue class x; βi is bias field at the ith voxel. Here, Yi , µ(x), and βi are represented by m-dimensional column vectors, while ψx is represented by an m × m matrix. Note that the bias field has a separate value for each component of the log-intensity signal at each voxel. In words, (6) states that the probability of observing a particular image intensity, given knowledge of the tissue class and the bias field is given by a Gaussian distribution centered at the biased mean intensity for the class. A stationary prior (before the image data is seen) probability distribution on tissue class is used, it is denoted as p(Γi ). If this probability is uniform over tissue classes, our method devolves to a maximum-likelihood approach to the tissue classification component. A spatially-varying prior probability density on brain tissue class has been studies.23 Such a model might profitably be used within this framework. The entire bias field is denoted by β = (β0 , β1 , . . . , βn−1 )T , where n is the number of voxels of data. The bias field is modeled by a n-dimensional zero mean Gaussian prior probability density. This model allows us to capture the smoothness that is apparent in these inhomogeneities: p(β) = Gψβ (β) ,

(7)

where

1 T −1 Gψβ (β) = (2π) |ψβi | exp − x ψβi x 2 is the n-dimensional Gaussian distribution. The n × n covariance matrix for the entire bias field is denoted ψβ . Although ψβ will be too large to manipulate directly in practice, tractable estimators can result when ψβ is chosen so that it is banded. It is assumed that the bias field and the tissue classes are statistically independent, this follows if the intensity inhomogeneities originate in the equipment. Using the definition of conditional probability the joint probability on intensity and tissue class can be obtained as follows: −n 2

− 12

p(Yi , ΓI |βi ) = p(Yi |Γi , βi )p(Γi ) ,

(8)

and we may obtain the conditional probability of intensity alone by computing a marginal over tissue class: X X p(Yi |βi ) = p(Yi , Γi |βi ) = p(Yi |Γi , βi )p(Γi ) . (9) Γi

Γi

396

J. S. Jin

This expression may be written more compactly as   ∂ X p(β) ∂[β i ]k −1   Wij [ψj (Yi − µj − βi )]k + p(β) j

β=βˆ

=0

∀ i, κ

(10)

with the following definition of Wij , (which are called the weights), Wij ≡

⌊p(Γi )GψΓi (Yi − µ(Γi ) − βi ⌋Γi =tissuee-class-j P . Γi p(Γi )GψΓi (Yi − µ(Γi ) − βi )

(11)

where subscripts i and j refer to voxel index and tissue class respectively, and defining µj ≡ µ(tissue-class-j) as the mean intensity of tissue class j. The mean residual is defined as X ¯i ≡ Wij ψj−1 (Yi − µj ) , (12) R j

and the mean inverse covariance is P  j Wij ψj−1 ψ −1 ik ≡  0

if j = κ (13) otherwise .

The result of the statistical modeling in this section has been to formulate the problem of estimating the bias field as a non-linear optimization problem embodied in ¯ − ψ −1 βˆ − ψ −1 βˆ = 0 R β or ¯. βˆ ≡ (ψ −1 + ψβ−1 )−1 R

(14)

This optimization depends on the mean residual of observed intensities and the mean intensity of each tissue class, and on the mean covariance of the tissue class intensities and the covariance of the bias field. The expectation-maximization (EM) algorithm is used to obtain bias field estimates from the non-linear estimator of (10). The EM algorithm iteratively alternates evaluations of the expressions appearing in models (11) and (14), Wij ←

⌊p(Γi )GψΓi (Yi − µ(Γi ) − βi )⌋Γi =tissue-class-j P , Γi P (Γi )GψΓi (Yi − µ(Γi ) − βi )

¯. βˆ ← (ψ −1 + ψβ−1 )−1 R

(15) (16)

397

Applications of Statistical Methods in Medical Imaging

(a)

(b)

(c)

Fig. 5. Segmentation using expectation-maximization. (a) Original MRI brain slide; (b) Bias fied estimation; (c) Segmentation result.

In other words, model (15) is used to estimate the weights given an estimated bias field, then model (16) is used to estimate the bias, given estimates of the weights. The adaptive segmentation can be applied to spin-echo and gradientecho images. Examples are shown for the coronal (3DFT gradient-echo T1-weighted) images. All of the MR images shown in this section were obtained using a General Electric Signa 1.5 Tesla clinical MR imager [General Electric Medical Systems, Milwaukee, WI]. An anisotropic diffusion filter described in Sec. 3 was used as a pre-processing step to reduce noise. Figure 5(a) shows the input image, a slice from a coronal 3DFT gradient-echo T1-weighted acquisition. The brain tissue ROI was generated manually. Figure 5(b) shows the final bias field estimate. The largest value of the input data was 85, while the difference between the largest and smallest values of the bias correction was about 10. Figure 5(c) shows the segmentation resulting from adaptive segmentation. Note the significant improvement in the right temporal area. In the initial segmentation the white matter is completely absent in the binarization. 4.2. Unseeded region growing Unseeded region growing is similar to seeded region growing except that no explicit seed selection is necessary: the seeds can be generated by the segmentation procedure automatically. Therefore, this method can achieve fully automatic segmentation with the added benefit of robustness from being a region-based segmentation.

398

J. S. Jin

Formally, the segmentation process initializes with region A1 containing a single image pixel, and the running state of the segmentation process consist of a set of identified regions, A1 , A2 , . . . , An . Let T be the set of all unallocated pixels which borders at least one of these regions ) ( n [ Ai ∧ ∃k : N (x ∩ Ak ) 6= 6 , T = x∈ / i=1

(a)

(a)

(b)

(b)

(c)

(c) Fig. 6. Segmenation using unseeded region growing. (a) Noisy image (σ = 10.0); (b) X-ray angiogram; (c) Ultrasound heart image.

Applications of Statistical Methods in Medical Imaging

399

where N (x) are immediate neighboring pixels of point x. Further, we define a difference measure δ(x, Ai ) = |g(x)-meany∈Ai [g(y)]| , where g(x) denotes the image value at point x, and i is an index of the region such that N (x) intersect Ai . The growing process involves selecting a point z ∈ T and region Aj where j ∈ [1, n] such that δ(x, Ai ) =

min

{δ(x, Ai )} .

x∈T,κ∈[1,n]

If δ(z, Aj ) is less than the predefined threshold t, then the pixel is added to Aj . Otherwise, we must choose the most substantially similar region A such that A = arg min{δ(x, Ak )} . Ak

If δ(z, A) < t, we can assign the pixel to A. If neither of these two conditions above apply, then it is apparent that the pixel is significantly different from all the regions found so far, so a new region, An+1 would be identified and initialized with pint z. In all three cases, the statistic of the assigned region must be updated once the pixel has been added to the region. The URG segmentation procedure is inherently iterative, and the above process is repeated until all pixels have been allocated to a region. To ensure correct behavior with respect to the homogeneity criterion, the region growing operation requires the determination of the “best” pixel each time a region statistic is changed. The details of implementation can be found in Lin et al.28 The segmentation results can be seen in Fig .6. 5. Improving Confidence Intervals of Image Registration Using 3-D Monte Carlo Simulations Clinical diagnosis and treatment usually require registration of images with multiple modalities. Most of the medical image registration methods30,31,48 minimize or maximize values of certain cost functions to achieve the global optimized match. These functions are usually the sum of squares of the distances between certain homogenous features in the two image sets to be registered. The sum of distances between homogenous point pairs of the two image sets,15 distances between skin surfaces of CT, MR and PET images of the head in the “head-hat” method,37 the absolute difference between pixel values of PET image and pixel values of image simulated by MR

400

J. S. Jin

image,28 and the ratio between pixel values and their means in the same tissue class3,51 are examples of these cost functions. However, most of these cost functions do not directly reflect the distance between the actual and estimated positions of targets, i.e. the target registration error (TRE). Most medical applications demand accuracy and precision assessment methods to justify their results. Internal consistency measures were used by Woods et al.51 to place limits on registration accuracy for MRI data. Almost all other registration accuracy assessment methods fall into two broad categories: qualitative evaluations by visual inspection and quantitative evaluation by reference to results from a gold standard registration method. The former methods require special expertise and extensive experience, while the latter methods require an extremely accurate gold standard that cannot be easily achieved. Different methods may not always be comparable to each other under identical criteria. Using the terminology of nonlinear regression analysis,14 we can refer the problem of image registration as a nonlinear least sum of squares estimation of the transformation parameters that result in the optimal fitting of one set of image (function) to the other set of image (data). For least square estimation methods, the cost function could be assumed to be linear around the neighborhood of the current parameter values. So that we can calculate the confidence intervals or regions using the following equation14 : X (θ − θ0 ) (f ′ ) ≤ (σ 2 (n − 1))F (p, n − p, 1 − α) , (17)

where F is a chosen F -test value of the corresponding confidence level, σ 2 is the residual sum of squares (registration cost function) value at the P location of the estimated parameters, and (f ′ ) represents the sum of the derivatives of the reference model image to the transformation parameters. θ and θ0 are the parameters corresponding to the confidence level and the optimal parameters found by the registration procedure, respectively. Since all the data points involved in the calculation of (17) should be statistical independent to each other, and the data points in the images are correlated, the number of points in the image could not be used directly as n and the effective number of independent data points needs to be estimated. To determine the effective number of independent data points involved in the estimation of confidence intervals, we first used one Monte Carlo simulation study based on normal conditions. The same number n selected according to this simulation results was found to be consistent for both the 95% and 90% confidence levels. We have further investigated the validity of the selected number n in various simulated conditions in other parts of the study.

Applications of Statistical Methods in Medical Imaging

401

Monte Carlo studies to simulate 2D PET images and subsequent registrations of the simulated images were conducted. The resulted distributions of the estimated transformation parameters were used to assess the consistency of 90%, 95% and 99% confidence intervals with the distributions in the parameter space. 2D grey matter and white matter sinograms of the segmented 2D Hoffman brain phantom20 were combined with the grey-towhite ratios of 2:1, 3:1 and 4:1 before reconstruction to see whether the discrepancies of the ratios in two images can affect the confidence intervals. Then, filtered back-projection reconstruction programs with various filters (i.e. Hanning, Ramp, Butter-worth, Ham, Parzen and Shepp-Logan filters) were employed to reconstruct images of size 128 × 128. Various amounts of spatial displacements (i.e. rotations of 0.3, 0.8, 1.2 and 3.3 degrees, and translations of 0.16, 0.8, 1.6 and 2.4 mm) were introduced. Various levels of Poisson noise (i.e. total counts of 5 × 105 , 1 × 106 and 2 × 106 ) were simulated. A Gaussian smoothing filter with a FWHM of 5 mm is applied to both sets of images before registration. The Powell’s algorithm40 was selected as the optimization procedure. In the cases of extreme noise conditions and large contrast discrepancies, the residual sum of squares (RSS ) consists of two parts: the systematic error and the error due to statistical noise: RSS = RSSsystem + RSSnoise .

(18)

The systematic error is contributed by the innate difference between the two images, inappropriate registration method, precision error of the program, etc. Such errors are independent of the initial displacements and noise. The second part of the residual sum of squares is due to statistical noise. If the systematic error is relatively large compared to the noise term, i.e. for cases with very low noise levels and high grey-to-white ratio discrepancies, the estimated residual sum of squares needs to be adjusted for systematic error. Since the systematic component in RSS is much less sensitive to spatial smoothing than the other component in Eq. (18), it can be estimated by applying smoothing filters to both sets of images with relatively large FWHMs when the parameters are found. By removing the systematic component, the result RSS provides an estimation of the noise component in Eq. (18). The calculated confidence intervals based on statistical regression are consistent with the simulation results for sample distributions of the transformation parameters of image co-registration. Varying the amount of displacement, reconstruction processes, noise levels, or tracer distributions have little impacts on the validity of the calculated confidence intervals.

402

J. S. Jin

After adjusted for systematic errors in the estimated residual sum of squares, confidence intervals can be calculated accurately even for very noisy conditions and with large distribution discrepancies between the two sets of images. Since multi-modality registration can be viewed as mono-modality registration of one image set with another simulated from the other image modality, this method is also expected to be applicable to multi-modality registration. Hence, visual inspection and validations by experts are not necessary for assessing the precision of the registration results. The results indicate the use of statistical confidence intervals has a potential to provide an automatic and objective assessment of individual image registration. 6. Conclusion We have attempted a brief summary of the applications of statistical methods in image processing in general and medical imaging in particular. The issues cover image sampling, compression, filtering, segmentation and registration. Methods have been discussed in theory and illustrated in empirical results. Statistical methods are powerful tools in many signal processing applications. We hope this summary will provide an insight for the further use of statistical methods in image processing. References 1. Adam, R. and Bischof, L. (1994). Seeded region growing, IEEE Transactions on Pattern Analysis and Machine Intelligence 16(6): 641–647. 2. Alvarze, L. and Mazorra, L. (1994). Signal and image restoration using shock filters and anisotropic diffusion. SIAM Journal on Numerical Analysis 31(2): 590–594. 3. Ardekani, B. A., Braun, M. et al. (1995). A fully automatic multimodality image registration algorithm, Journal Computer Assistant Tomography 19(4): 615–623. 4. Besl, P. J., and Jain, R. C. (1988). Segmentation through variable-order surface fitting, IEEE Transaction on Pattern Analysis and Machine Intelligence 10(2): 167–192. 5. Bhandari, D., Pal, N. R. and Majumder, D. D. (1992). Fuzzy divergence, probability measure of fuzzy events and image thresholding. Pattern Recognition Letter 13: 857–867. 6. Cai, W., Feng, D. and Fulton, R. (1998). Clinical investigation of a knowledgebased data compression algorithm for dynamic neurologic FDG-PET images, Proceedings of the 20th Annual International Conference of the IEEE

Applications of Statistical Methods in Medical Imaging

7. 8.

9.

10.

11. 12. 13. 14. 15.

16.

17.

18. 19. 20.

21.

403

Engineering in Medicine and Biology Society (EMBS’98), Vol. 20, part 3, 1270–1273, Hong Kong, October 29–November 1. Cho, S., Haralick, M. R. and Yi, S. (1989). Improvement of Kittle and Illingworth’s minimum error thresholding. Pattern Recognition 22: 609–617. Ciaccio, E. J., Dunn, S. M. and Akay, M. (1994). Biosignal pattern recognition and interpretation systems: Methods of classification, IEEE Engineering in Medicine and Biology 13: 129–135. Cline, H. E., Lorensen, W. E., Kikinis, R. and Jolesz, F. (1990). Threedimensional segmentation of MR images of the head using probability and connectivity. JCAT 14(6): 1037–1045. Cobelli, C., Ruggeri, A., DiStefano, III, J. J. and Landaw, E. M. (1985). Optimal design of multioutput sampling schedules — software and applications to endocrine–metabolic and pharmacokinetic models, IEEE Transactions on Biomedical Engineering 32(4): 249–256. Crocker, L. D. (1995). PGN: The portable network graphic format. Dr. Dobb’s Journal, 36-49, July. Davis, S. L. (1975). A survey of edge detection techniques. Computer Graphics Image Processing 4: 248–270. Donoho, D. L. (1995). De-noising by soft-thresholding. IEEE Transaction Information Theory IT-41: 613–627. Draper, N. R. (1981). Applied regression Analysis, 2nd edn., Wiley, New York. Evans, A. C., Marrett, S., Collins, L. and Peters, T. M. (1989). Anatomicalfunctional correlative analysis of the human brain using three dimensional imaging systems. In Medical Imaging: Image Processing, eds. R.H. Schneider, S.J. Dwyer III AND R.G. Jost, SPIE Press, Bellingham, WA., 1092: 264–274 vol. 1092, pp. 264-274. Feng, D., Ho, D., Chen, K., Wu, L., Wang, J., Liu, R. and Yeh, S. (1995). An evaluation of the algorithms for constructing local cerebral metabolic rates of glucose tomographical maps using positron emission tomography dynamic date. IEEE Transaction on Medical Imaging 14(4): 697–710. Gerig, G., Kuoni, W., Kikinis, R. and Kubler, O. (1993). Medical imaging and computer vision: An integrated approach for diagnosis and planning. Proceedings of the 11th DAGM Symposium, 425–443. Glaseby, C. A. (1985). An analysis of histogram based thresholding algorithms. CVGIP : Graphical Models and Image Processing 55: 532–533. Haralick, R. M. and Shapiro, L. G. (1985). Image segmentation techniques. Computer Graphics Image Processing 29: 100–132. Hoffman, E. J., Cutler, P. D., Guerrero, T. M., Digdy, W. M. and Mazziotta, J. C. (1991). Assessment of accuracy of PET utilizing a 3-D phantom to simulate the activity distribution of [18F] fluorodeoxyglucose uptake in the human brain. Journal of Cerebral Blood Flow and Metabolism 11: 17–25. Horowitz, S. L. and Pavlidis, T. (1974). Picture segmentation by a directed split-and-merge procedure. Proceedings 2nd International Joint Conference On Pattern Recognition, 424–433.

404

J. S. Jin

22. Huang, S. C., Phelps, M. E., Hoffman, E. J., Sideris, K., Selin, C. and Kuhl, D. E. (1980). Non-invasive determination of local cerebral metabolic rate of glucose in man. American Journal of Physiology 238: E69–E82. 23. Kamber, M., Collins, D., Shinghal, R., Francis, G. and Evans, A. (1992). Model-based 3D segmentation of multiple sclerosis lesions in dual-echo MRI data. SPIE Vol. 1808, Visualization in Biomedical Computing. 24. Kass, M., Witkin, A. and Terzonpoulos, D. (1987). Snakes: Active contour models. Proceedings International Conference On Computer Vision, London. 25. Kohn, M., Tanna, N., Herman, G. et al. (1991). Analysis of brain and cerebrospinal fluid volumes with MR imaging. Radiology 178: 115–122. 26. Li, X., Feng, D. and Chen, K. (1996). Optimal image sampling schedule: A new effective way to reduce dynamic image storage space and functional image processing time. IEEE Transactions 15: 710–718. 27. Lin, J. Z., Jin, J. S. and Hugo, T. (2001). Unseeded region growing. Proceedings Workshop on Visual Information Processing, 2000, Sydney. 28. Lin, K. P., Huang, S. C. et al. (1994). A general technique for interstudy registration of multifunction and multimodality images. IEEE Tran Nuclear Science 41(6): 2850–2855. 29. Luijendijk, H. (1991). Automatic threshold selection using histograms based on the count of 4-connected regions. Pattern Recognition Letter 12: 219–228. 30. Maintz, J. B. A. and Viergever, M. A. (1998). A survey of medical image registration. Medical Image Analysis 2(1): 1–36. 31. Maurer, C. R. and Fitzpatrick, J. M. (1993). A review of medical image registration, In Interactive Imageguided Neurosurgery, American Association of Neurological Surgeons, ed. R.J. Maciunas, Parkridge, IL, 17–44. 32. Meyer, F. and Beucher, S. (1979). Morphological segmentation. Journal of Visual Communication And Image Representation 1: 21–46. 33. Nagawa, Y. and Rosenfeld, A. (1979). Some experiments on variable thresholding. Pattern Recognition 11: 191–204. 34. Parker, J. R. (1991). Gray level thresholding in badly illuminated images. IEEE Transaction PAMI 13: 813–819. 35. Patlak, C. S. and Blasberg, R. G. (1985). Graphical evaluation of blood to brain transfer constaints from multiple-time uptake data generalizations. Journal of Cerebral Blood Flow and Metabolism 5: 584–590. 36. Patlak, C. S., Blasberg, R. G. and Fenstermacher, J. (1983). Graphical evaluation of blood to brain transfer constants from multiple-time uptake data. Journal of Cerebral Blood Flow and Metabolism 3: 1–7. 37. Pelizzari, C. A., Chen, G. T. Y. et al. (1989). Accurate three-dimensional registration of CT, PET and/or MR images of the brain. Journal Computer Assistant Tomography 13: 20–26. 38. Perona, P. and Malik, J. (1987). Scale-space and edge detection using anisotropic diffusion. Proceedings IEEE Workshop Computer Vision, Miami, FL, 16–22. 39. Perona, P. and Malik, J. (1990). Scale-space and edge detection using anisotropic diffusion. IEEE Transaction PAMI 12: 629–639.

Applications of Statistical Methods in Medical Imaging

405

40. Powell, M. J. D. (1964). An efficient method for finding the minimum of a function of several variables without calculating derivatives. Computer Juornal 7: 155–163. 41. Rodriguez, A. A. and Mitchell, O. R. (1991). Image segmentation by successive background extraction. Pattern Recognition 24: 409–420. 42. Ross, S. M. (1987). Introduction to Probability and Statistics for Engineers and Scientists, John Wiley and Sons, NY. 43. Sahoo, P. K., Soltani, S. and Wong, A. K. C. (1988). A survey of threhsolding techniques. Computer Graphics Image Processing 41: 230–260. 44. Spann, M. and Horne, C. (1989). Image segmentation using a dynamic thresholding pyramid. Pattern Recognition 22: 719–732. 45. Tou, J. T. and Gonzalez, R. C. (1972). Pattern Recognition Principle, Addison-Wesley. 46. Tsai, D. M. and Chen, Y. (1992). A fast histogram-clustering approach for multi-level thresholding. Pattern Recognition Letter 13: 245–252. 47. Tseng, D. C. and Huang, M. Y. (1993). Automatic thresholding based on human visual perception. Image and Vision Computing 11: 539–548. 48. Van den Elsen, P. A., Pol, E. J. D. and Viergever, M. A. (1993). Medical image matching — A review with classification. IEEE Engineering in Medicine and Biology 12: 26–39 49. Vannier, M., Butterfield, R., Jordan, D., Murphy, W. et al. (1985). Multispectral analysis of magnetic resonance images. Radiology 154: 221–224. 50. Wells III., W. M., Grimson, W. E. L., Kikinis, R. and Jolesz, F. A. (1996). Adaptive segmentation of MRI data. IEEE Transactions on Medical Imaging 15(4): 429–443 51. Woods, R. P., Grafton, S. T., Holmes, C. J., Cherry, S. R. and Mazziotta, J. C. (1998). Automated image registration: I. General methods and intrasubject, intramodality validation. Journal Computer Assistant Tomography 22(1): 139–152.

About the Author Jesse Jin graduated with a BEng from Shanghai Jiao Tong University and a PhD University of Otago, New Zealand. He is an Associate Professor and postgraduate coordinator in the Department of Computer Science, University of Sydney, and an Adjunct Associate Professor and Director of the Visual Information Processing Laboratiry in the School of Computer Science and Engineering, University of South Wales, Australia. Dr. Jin is an international renowed expert on multimedia technology and visual information retrival and processing. He has published 135 articles and 6 books. He also has one patent and is in the process of filing 3 more patents. He established a spin-off company and the company won the 1999 ATP Vice-Chancellor New Business Creation Award. He is a consultant of

406

J. S. Jin

many companies such as Motorola, Silicon Graphics, Computer Associates, ScanWorld, Proteome Syestems, HyperSoff, CyberView, etc. He was a visiting professor in MIT, UCLA, HKPU and Tsinghua University. He is also a Vice-President of Ausinan Science & Technology Society. His research interests include image processing and multimedia.

Section 2 Statistical Methods in Pharmaceutical Research

This page intentionally left blank

CHAPTER 11

STATISTICS IN PHARMACOLOGY AND PRE-CLINICAL STUDIES TZE LEUNG LAI Department of Statistics, Stanford University, Room 127, Sequoia Hall, 390 Serra Mall, Stanford, CA 94305-4065, USA Tel: 650-7232622; [email protected] MEI-CHIUNG SHIH Department of Biostatistics, Harvard School of Public Health, Boston, MA 02215, USA GUANGRUI ZHU Aventis Pharmaceuticals, P O Box 6800, Bridgewater, NJ 08807-0800, USA

1. Introduction Pharmacology1 as the science dealing with interactions between living systems and molecules, especially chemicals introduced from outside the system. This broad definition includes clinical pharmacology, whose objective is to prevent, diagnose and treat diseases with drugs, and the pathogenesis of diseases due to chemicals in the environment. A drug is defined in1 as a small molecule that, when introduced into the body, alters the body’s function. The component of a cell or organism that interacts with a drug and initiates the chain of biochemical events leading to the drug’s therapeutic and toxic effects is called a receptor. The receptor concept has become the central focus of investigation of pharmacodynamics — the study of drug effects and their mechanisms of action. The relation between the dose of a drug and its clinically observed effects can be quite complex. In carefully controlled in vitro systems, however, the relation between the concentration of a drug at the site(s) of action and its effects can often be described by relatively simple mathematical models. 409

410

T. L. Lai, M.-C. Shih & G. Zhu

How a drug dose produces its effects involves not only pharmacodynamics but also pharmacokinetics. The latter is concerned with the concentration-time curve that is associated with the following “history” of a single adminstration of a drug: (i) absorption phase of the drug into the body — transfer of the drug from its site of administration (via oral, or inhalational, or intravenous, or other route) into the bloodstream, (ii) distribution phase — distribution of the drug to different compartments of the body, including receptor binding sites in the target tissue, and resulting in rapid decline in plasma concentration, (iii) elimination phase — excretion of chemically unchanged drug or elimination via metabolism that converts the drug into one or more metabolites (e.g. at the liver). Section 2 presents an overview of the basic principles, models and statistical methods in pharmacokinetics and pharmacodynamics. An active area of research in the field is pharmacometrics and Sec. 2 also gives some recent trends in this area. Particular attention will be directed to population pharmacokinetics and its interactions with several branches of modern statistics, including nonlinear mixed effects models, hierarchical and empirical Bayes methods, and generalized linear mixed effects models. Section 2 also discusses the role of pharmacokinetic and pharmacodynamic studies in drug development. Specifically they are used to determine the dosage regimen of the drug (i.e. how much and how often it should be taken). These studies are initially performed in vitro and then on animals to come up with rough guesses of a region of dosage regimens in which clinical studies on human subjects are to be performed. The in vitro and animal studies are called pre-clinical and precede the clinical studies that are classified as Phase I studies (on healthy volunteers) and Phases II and III clinical trials (on patients). Other statistical applications in pharmacology and pre-clinical studies include bioequivalence and bioavailability (treated in Sec. 3), assay development and validation (summarized in Sec. 4), drug discovery (reviewed in Sec. 5) and toxicology (treated in Chapter 13). 2. Pharmacokinetics and Pharmacodynamics Drug administration can be divided into two phases, a pharmacokinetic (PK) phase in which the kinetics of drug absorption, distribution, and

Statistics in Pharmacology and Pre-Clinical Studies

411

elimination translate into drug concentration-time relationships in the body, and a pharmacodynamic (PD) phase in which the drug concentration at the site(s) of action leads to the response/effects produced. Knowledge of both phases is important for the design of a dosage regimen to achieve the therapeutic objective. Since both the desired response and toxicity of the drug are functions of the drug concentration at the site(s) of action, the therapeutic objective can be achieved only when the drug concentration lies within a “therapeutic window,” outside which the therapy is either ineffective or has unacceptable toxicity. Drug concentrations, however, can rarely be measured directly at the sites of action and are typically measured at the plasma, which is a more accessible site. An optimal dosage regimen can therefore be defined as one that maintains the plasma concentration of a drug within the therapeutic window. This can be achieved for many drugs by giving an initial dose to yield a plasma concentration within the therapeutic window and then maintaining the concentration within this window by periodic doses to replace the drug lost over time. 2.1. PK/PD models Many PK and PD models have been developed in clinical pharmacology. The monographs1–5 give a comprehensive introduction to these models and their applications. The PK models can be roughly classified as “mechanistic” or “empirical,” while mechanistic models can be classified as “physiologic” or “compartmental.” In physiologic models, the body is viewed in physiologic terms, making use of a priori knowledge of physiology, anatomy and biochemistry. Although the tissues or organs differ from one another, they share many qualitative features. As an illustrative example, consider how anatomy affects elimination. First “clearance” CL is defined as the rate of elimination divided by the concentration of the drug. If the organs of elimination are in parallel, then CL is the sum of the CLi over the elimination organs i. On the other hand, if the organs of elimination are in series (working sequentially one after another), then CL is proportional to 1 − Π(1 − Ei ), where Ei is the extraction ratio of the drug at organ i. In particular, since the gut-liver system is in series for portal circulation whereas the portal and arterial systems into the liver are in parallel, it follows that CL = QH {fHP [1 − (1 − Egut )(1 − Eliver )] + (1 − fHP )Egut } ,

(1)

where fHP is the fraction of total hepatic blood flow QH that enters the liver via the hepatic portal vein.

412

T. L. Lai, M.-C. Shih & G. Zhu

In compartmental models, the body is viewed in terms of kinetic compartments between which the drug distributes and from which elimination occurs. The kinetics is often described by a linear system of ordinary differential equations, which have explicit solutions involving exponential functions. On the other hand, the rate constants of a compartmental model may be functions of the concentration of the drug itself or another metabolite/interacting drug, leading to a system of nonlinear differential equations that have to be solved numerically. Empirical PK models are typically poly-exponential models of the form Σαi e−λi t . It is well known that different compartmental models may imply the same poly-exponential models, leading to identifiability difficulties with compartmental models in empirical work.7,8 A basic goal of PD models is to describe and quantify the steady-state relationship of drug concentration (C) at an effector site to the drug effect (E). The simplest PD model for one drug is the so-called “Emax model” defined by E = emax C/(C + c50 ) ,

(2)

where emax is the maximum effect that the drug can produce and c50 is the concentration that yields 50% of emax . Note that this equation is the same as the Langmuir model in thermodynamics or the Michaelis–Mantern model in enzyme kinetics, in which the equilibrium state of ligand binding reactions is given by B = νF/(α + F ) ,

(3)

where B and F are the concentrations of the bound and free ligand, respectively, ν is the capacity of the binding site and 1/α is the affinity constant. In fact, assuming that E is proportional to B, Eq. (2) follows from Eq. (3). A variant of Eq. (2) to incorporate the baseline effect e0 is E = e0 + emax C/(C + c50 ) .

(4)

When the effect decreases response, e0 = emax and Eq. (4) has the form E = e0 − e0 C/(C + c50 ) = e0 c50 /(C + c50 ) . A convenient surrogate for the drug concentration at an effector site, which is difficult to measure directly, is dose (D). In empirical work, the Emax model is often reformulated as E = e0 + emax D/(D + ED50 ) .

(5)

Statistics in Pharmacology and Pre-Clinical Studies

413

A more general form of Eq. (2) is E = bX/(X + a) .

(6)

For b = emax , a = 1 and X = (C/c50 )γ with γ > 0, Eq. (6) is called the “Sigmoid-Emax model.” While the special case γ = 1 of such models reduces to Eq. (2), the inclusion of γ gives an additional adjustable parameter in fitting the model from data. A general Emax model for two drugs, with concentrations C and C ∗ , incorporating both competitive and noncompetitive interactions is of the form (C/c50 ) + α(C ∗ /c∗50 ) + β(C/c50 )(C ∗ /c∗50 ) , (7) E = emax 1 + (C/c50 ) + (C ∗ /c∗50 ) + δ(C/c50 )(C ∗ /c∗50 )

where 0 ≤ α ≤ 1, 0 ≤ δ ≤ 1, and β ≥ 0 with β = 0 if δ = 0. In particular, for δ = 1 and β = 1 + α, the right hand side of Eq. (7) can be written as a sum of (C/c50 )/{1 + (C/c50)} and α(C ∗ /c∗50 )/{1 + (C ∗/c∗50 )}, yielding additive effects of the two drugs. The case β = δ = 0 gives a “competitive interaction model,” which can be written as a linear combination of two terms of the form in Eq. (6) with b = emax and (X, a) = (C/c50 , 1 + C ∗ /c∗50 ) or (C ∗ /c∗50 , 1 + C/c50 ). The case β > δ > 0 shows synergism between the two drugs, while δ > max(β, 0) shows antagonism. In particular, the case β = 0 and δ = 1 gives a “non-competitive antagonism model,” which can be written as a linear combination of two terms of the form in Eq. (6) with b = emax and (X, a) = (C/c50 , 1 + C ∗ /c∗50 + CC ∗ /c50 c∗50 ) or (C ∗ /c∗50 , 1 + C/c50 + CC ∗ /c50 c∗50 ) . Non-competitive antagonism can be explained by using receptor theory as follows. A drug interacts with two sites, one of which activates a receptor which may still interact with a second drug to form another non-activated receptor. 2.2. PK parameters and their nonparametric estimates Several physiologic (e.g. maturation of organs in infants) and pathologic (e.g. kidney failure, heart failure) processes require dosage adjustments in individual patients to modify specific PK parameters. Two basic parameters in this connection are clearance (a measure of the ability of the body to eliminate the drug) and volume of distribution (a measure of the apparent space in the body available to contain the drug).

414

T. L. Lai, M.-C. Shih & G. Zhu

Drug clearance principles are similar to clearance concepts in renal physiology, in which creatinine or urea clearance is defined as the rate of elimination of the compound in the urine relative to the plasma concentration. Thus clearance CL of a drug is the rate of elimination by all routes relative to the concentration C of the drug in a biologic fluid: CL = Rate of elimination/C .

(8)

The commonly used biologic fluid in Eq. (8) is plasma, for which CL is, strictly speaking, “plasma clearance.” When C is Cb (blood concentration) or Cu (unbound or free drug concentration), then Eq. (8) gives “blood clearance” or “clearance based on unbound drug concentration,” respectively. In healthy subjects, the clearance of amikacin is 91 ml/min, with 98% of the drug excreted in the urine unchanged. This means that the kidney is able to remove this drug from approximately 89 ml of plasma per minute. Propranolol is cleared at the rate of 840 ml/min, almost exclusively by the liver. This means that the liver is able to remove this drug from 840 ml of plasma per minute. For most drugs, clearance is constant over the plasma or blood concentration range in clinical settings, so the rate of elimination of the drug is proportional to its concentration C, in view of Eq. (8). Clearance is perhaps the most important PK parameter to be considered in defining a rational drug dosage regimen. In most cases, the clinician would like to maintain steady-state drug concentrations Css within a known therapeutic window. Steady state will be achieved when the dosing rate (rate of active drug entering the systemic circulation) equals the rate of drug elimination. Therefore, Dosing rate = CL × Css .

(9)

The two major sites of drug elimination are the kidneys and the liver. Clearance of unchanged drug in the urine represents renal clearance. Within the liver, drug elimination occurs via biotransformation of the drug to one or more metabolites, or excretion of unchanged drug into the bile, or both. When no other organs are involved in elimination of the drug, CL = CLrenal + CLliver since the liver and kidneys work in parallel. The rate of elimination of a drug by a single organ can be defined in terms of the blood flow entering and exiting from the organ and the concentration of drug in the blood. The rate of presentation of the drug to the organ is the product of blood flow (Q) and entering drug concentration (Ci ), while the rate of exit of drug from the organ is the product of blood flow and exiting

Statistics in Pharmacology and Pre-Clinical Studies

415

drug concentration (Co ). The difference between these rates at steady state is the rate of drug elimination: Rate of elimination = Q × Ci − Q × Co .

(10)

Dividing Eq. (10) by the concentration Ci of the drug entering the organ yields CLorgan =

Q × Ci − Q × Co Ci − Co =Q× . Ci Ci

(11)

The expression (Ci − Co )/Ci is called the extraction ratio (ER) of the drug. Bioavailability is the fraction of unchanged drug reaching the systemic circulation after its administration by any route. For an intravenous dose of the drug, bioavailability is 1. For a drug administered orally, bioavailability may be less than 1 since the drug may be incompletely absorbed, or metabolized in the gut, the portal blood or the liver prior to entry into the systemic circulation. If a drug is metabolized in the liver or excreted in bile, some of the active drug absorbed from the gastrointestinal tract will be inactivated by hepatic processes before the drug can reach the general circulation and be distributed to its sites of action. If the metabolizing or biliary excreting capacity of the liver is great, the so-called “first-pass effect” on the extent of availability will be substantial. The systemic bioavailability (F ) of a drug that is completely absorbed and eliminated only by metabolism in the liver is given by F = 1 − ER ,

(12)

where ER = CLliver /Qliver is the hypatic extraction ratio. The AUC (area under the plasma or blood concentration-time curve) is a commonly used measure of the extent of absorption or availability of the drug absorbed in the body. It is usually calculated using the trapezoidal rule based on the blood or plasma concentrations obtained at various blood sampling times. Yeh and Kwan8 considered spline and Lagrange interpolation schemes in lieu of the linear interpolation implied by the trapezoidal rule and compared these methods. Let C0 , C1 , . . . , Ck be the plasma or blood concentrations obtained at times 0, t1 , . . . , tk , respectively. The AUC from time 0 to tk , denoted by AUC0,tk , can be obtained via the trapezoidal rule as AUC0,tk =

k X i=1

(ti − ti−1 )(Ci + Ci−1 )/2 .

(13)

416

T. L. Lai, M.-C. Shih & G. Zhu

Typically tk should be chosen so that Ck does not fall below the so-called “limit of quantitation” (LOQ) that will be defined in Sec. 4. In principle, the AUC should be calculated from 0 to ∞ (not just to the time of the last blood sample), and the portion of the remaining area from tk to ∞ can be large. An estimate of AUC (= AUC0,∞ ) is AUC = AUC0,tk + Ck e−λtk /λ ,

(14)

where λ, called the elimination rate constant, is estimated from the elimination phase of the graph of log-concentration versus time by linear regression, assuming that it is linear so that λ corresponds to the slope of the fitted regression line; (see Ref. 2, Chapter 3 and Appendix A). The United States Food and Drug Administration (FDA) regulations require that sampling be continued through at least 3 half-lives of the active drug ingredient, measured in blood or urine, so that the remaining area beyond time tk is only a small proportion of AUC0,tk . The AUC also provides a simple relationship between the volume of distribution and dose. The volume of distribution (V ) is defined as V = Amount of drug in body/C ,

(15)

where C is the concentration of the drug in blood or plasma, depending on the fluid measured. It reflects the apparent space available in both the general circulation and the tissue of distribution. It does not represent a real volume but should be regarded as the size of the pool of blood fluids that would be required if the drug were distributed equally throughout all parts of the body. From mass balance and steady state considerations, V is related to clearance via CL = λV , where λ is the elimination rate constant in Eq. (14). Moreover, F × Dose = CL × AUC (= total amount eliminated), where F is the systematic bioavailability in Eq. (12).2 Hence, V = CL/λ = (F × Dose)/(λ × AUC) .

(16)

Besides CL, V , and AUC (measuring bioavailability), another PK variable, called the elimination half-life and denoted by t1/2 , has to be considered when designing drug dosage regimens. It is given by t1/2 = (ℓn2)/λ = 0.693 V /CL

(17)

and corresponds to the time taken for the concentration to drop to half of its initial level, assuming a one-compartment model for the drug’s elimination phase in the body, as is usually done in designing drug dosage regimens. ˆ where λ ˆ is an In view of Eq. (17), t1/2 can be estimated by (ℓn2)/λ, estimate of the elimination rate constant described after Eq. (14). When

Statistics in Pharmacology and Pre-Clinical Studies

417

c = Dose/AUC. Without assuming F to be F = 1, we can estimate CL by CL 1, we have to replace Dose above by Fˆ × Dose, where Fˆ is an estimate of F . ˆ To estimate F , c λ. Once CL has been estimated, we can estimate V by CL/ we need additional data following an intravenous dose D ∗ , yielding AUC∗ and whose F ∗ can be assumed to be 1. Then F can be estimated from the original (extravascular) dose D and AUC by Fˆ = min

AUC/D ,1 . AUC∗ /D∗

The above PK parameters are considered in a single dose trial. In practice, drugs are most commonly prescribed to be taken at fixed and equal time intervals, each of width τ . The maximum, minimum, and average concentration of the drug in steady state, denoted by Css,max , Css,min and Css,av , respectively, are considered in conjunction with the steadystate volume of distribution and AUC during a dosing interval in steady state. See Chapter 7 of Rowland and Tozer,2 which also shows how to develop a dosage regimen from knowledge of these PK parameters and the therapeutic window of a drug. Data obtained on multiple dosing can be used to estimate the PK parameters of a drug as follows. The most useful information derived from a multiple dosing study is the ratio of clearance to availability. It is obtained from (Dose/τ ) CL , = F Css,av

(18)

where Css,av is determined from the area under the plasma concentrationtime curve within a dosing interval at steady state divided by τ . Occasionally, the drug is given as a multiple intravenous regimen, in which case the ratio (Dose/τ )/Css,av is simply clearance, since F = 1. The accuracy of the clearance estimate depends on the number of plasma concentrations measured in the dosing interval and on the ratio of τ /t1/2 . The estimate can be improved by using several dosing intervals in steady state. Equation (18) is also useful for determining the relative availability of a drug administered extravascularly, between two treatments (e.g. dosage forms) A and B. Assuming that clearance remains unchanged, we have Relative availability =

(Css,av )B (Dose/τ )A · . (Css,av )A (Dose/τ )B

(19)

418

T. L. Lai, M.-C. Shih & G. Zhu

2.3. Parametric and population PK/PD models The nonparametric estimates of PK parameters described above assume that the blood (or urine) samples are collected frequently through at least 3 half-lives of the active ingredient, so that the curve between successive times tk and tk+1 is well approximated by the line joining its values at these two points. When the experiment does not meet such conditions, the nonparametric estimates of AUC, CL and t1/2 become unreliable and there are no satisfactory ways to evaluate the bias and standard error of such estimate. In this case it is preferable to use a parametric approach, based on the commonly used one-compartment model yj =

Dka (e−ke tj − e−ka tj ) + ǫj , V (ka − ke )

1 ≤ j ≤ n,

(20)

in which yj is the concentration at time tj after the administration of a single oral dose D. Here V , ka , ke are the volume of distribution, absorption rate constant and elimination rate constant, respectively. Note that model (20) has the form of a bi-exponential model α1 e−λ1 t +α2 e−λ2 t with α1 = α2 . Lai7 gives a review of the literature on fitting the poly-exponential P regression model yj = β + kk=1 αk e−λk tj + ǫj , in which the errors ǫj are assumed to be independent with zero means and (i) var(ǫj ) = σ 2 (constant variance error models), or (ii) var(ǫj ) = fθ2 (tj )σ 2 (constant coefficient of variation error models), or (iii) var(ǫj ) = fθ (tj )σ 2 (Poisson-type error models), P where θ = (λ1 , . . . , λk ; α1 , . . . , αk , β) and fθ (t) = β + αk e−λk t . We can estimate θ by weighted least squares, i.e. by minimizing S(θ) =

n X j=1

wj [yj − fθ (tj )]2 .

(21)

For fixed λ1 , . . . , λk , fθ (t) is linear in the parameters β, α1 , . . . , αk and standard formulas in multiple linear regression can be used to find least squares estimates of the linear parameters β, α1 , . . . , αk . This reduces the problem of minimizing S(θ) to that of minimizing S ∗ (λ1 , . . . , λk ) =

min

β,α1 ,...,αk

S(θ) .

In the case of the Poisson-type or constant coefficient of variation error model, the weights wj also involve the unknown parameter θ and can be determined at each iteration from the previous iterate. It is shown in Lai7 that S ∗ not only provides a relatively stable numerical algorithm for finding

Statistics in Pharmacology and Pre-Clinical Studies

419

the least squares estimates but also sheds light on the range of models that are compatible with the data. Depending on the experimental design, S ∗ can be very flat over a broad region containing the minimum or can decrease steeply to the minimum. It is also shown in Lai7 that although the parameter vector θ may be poorly estimated because S ∗ is relatively flat, the function fθ (·) is typically well estimated by weighted least squares. Therefore derived parameters like AUC can still be well estimated from the estimated fθˆ even though θˆ does not estimate θ well because of the experimental design. Parametric modeling also facilitates the evaluation of standard errors and construction of confidence intervals. For the Emax model (2), which can be rewritten as E/C = aE + b with a = −1/c50 and b = emax /c50 , Scatchard9 proposed to estimate a and b by linear regression of the observed E/C on C. This simple method is usually adequate for point estimation because of the large signal-to-noise ratio in the measurements. It is, however, unsatisfactory for constructing confidence intervals of the unknown parameters, as has been noted in the ligand-binding literature related to the mathematically equivalent model (3). Lai and Zhang10 give a review of the literature and propose a new approach using nonlinear least squares and bootstrap methods to construct confidence regions for the parameters. The numerical studies reported in Lai and Zhang10 show that these confidence regions are markedly different from the elliptical confidence regions based on asymptotic normal approximations. So far we have considered estimation of the PK/PD parameters of a subject from the data in a study on the subject. In many PK/PD studies, however, data are collected from a number of subjects, some of whom may have intensive blood sampling while others only have sparse data. A primary objective of these studies is to study the PK/PD characteristics of the entire population, such as how they vary with certain covariates. This requires embedding the individual parametric PK/PD models in a population model. For example, the yj in model (20) are now replaced by yij , where i denotes the subject number. Since the dose, volume of distribution, absorption and elimination rate constants may vary from subject to subject, we also have to replace D, V, ka , ke , n by Di , Vi , kai , kei and ni in model (20). Let θi be the vector consisting of the logarithms of the PK parameters Vi , kai , kei . The unknown θi may vary with certain covariates, such as the subject’s age and body weight. How can the individual subjects’ data be used to analyze such relationships for the target population, of which the subjects can be regarded as a sample? We shall show that

420

T. L. Lai, M.-C. Shih & G. Zhu

nonlinear mixed effects modeling provides a valuable tool to address this problem. Returning to the PD model (2), the variable C refers to concentration at an effector (tissue) site. It is usually impossible to measure C directly, so some surrogate for C has to be used, as in model (5). On the other hand, if one has a kinetic model for C, then it can be used to impute the value of C from the blood/urine measurements. Chapter 9 of Davidian and Giltinan11 illustrates how population PK/PD models can be synthesized for such tasks. 2.4. Nonlinear mixed effects models The preceding population PK/PD models are special cases of nonlinear mixed effects models (NONMEM) of the form yij = fi (tij , θi ) + εij ,

θi = g(xi , β) + bi (1 ≤ j ≤ ni , 1 ≤ i ≤ K) ,

(22)

in which θi is a 1 × r vector of the ith subject’s parameters whose regression function on the subject’s observed covariate xi is given by g(xi , β) with 1 × s parameter vector β, which is the “fixed effect” to be estimated. The “random effects” bi in model (22) are assumed to be independent and identically distributed, having common distribution G with mean 0. The ith subject’s response yij at tij has mean fi (tij , θi ), in which fi is a known function. Given θi , the random errors εij are assumed to be normal with mean 0 and standard deviation σw(θi ), in which w is a given function and σ is an unknown parameter. The regression function g relates θi to the ith subject’s physiologic characteristics that constitute the covariate vector xi in model (22). The first equation of (22) is often called the individual measurement model and the second equation the population structure model. The population distribution G is usually assumed to be normal with mean 0 and covariance matrix Σ so that β, σ, Σ can be estimated by maximum likelihood. However, unlike linear mixed effects models in which the normal assumption on G yields closed-form expressions of the likelihood, the normality of G in nonlinear mixed effects models leads to computationally intensive likelihoods that involve K integrals. A commonly used approach, as adopted in the software package NONMEM12 or the nlme procedure in S-Plus, is to develop iterative schemes based on first-order approximations of fi (tij , g(xi , β) + bi ) in model (22) so that the normal assumption on G can be used to reduce the problem to that of a linear Gaussian mixed effects model at each iterative step.

Statistics in Pharmacology and Pre-Clinical Studies

421

Unless otherwise stated, we shall assume throughout the sequel that the random errors εij in model (22) have common variance σ 2 (so w(θ) ≡ 1). The likelihood function L(β, σ, Σ) is proportional to |Σ|−K/2

K Z Y

i=1

× exp

σ −ni

Rr

(

) ni 1 1 X 2 −1 T [yij − fi (tij , g(xi , β) + bi )] − bi Σ bi dbi , − 2 2σ j=1 2

(23)

where |Σ| denotes the determinant of Σ. For the case of more general w(θi ), simply replace σ in model (23) by σw(g(xi , β) + bi ). Computing the maximum likelihood estimate of (β, σ, Σ) via numerical integration and nonlinear optimization becomes difficult for large K. Letting η = (σ, Σ), Lindstrom and Bates13 proposed the following iterative procedure that involves successive linear approximations to fi (tij , g(xi , β) + b). At the mth iteration, the Lindstrom–Bates procedure consists of a pseudo-data step and a linear mixed effects (LME) step. (a) The pseudo-data step: Given the current estimate ηˆ(m) of η, compute (m) ˆ ˆ η (m) ) and ˆb(m) = ˆbi (ˆ η (m) ), 1 ≤ i ≤ K, that jointly minimize β = β(ˆ i K X i=1

ˆ (m) )−1 bT /2} , {(ˆ σ (m) )−2 Si (b, β) + bi (Σ i Si (β, b) =

ni X j=1

where

[yij − fi (tij , g(xi , β) + b)]2

2.

(24)

This can be carried out by modifying a standard nonlinear least squares routine; see Sec. 6.1 of Lindstrom and Bates.13 Define the s × ni , r × ni and 1 × ni matrices (m)

Xi

(m)

Zi

(m)

Yi

= =

∂fi (m) (tij , g(xi , β) + ˆbi )|β=βˆ(m) ∂β

∂fi (tij , g(xi , βˆ(m) ) + bi )|bi =ˆb(m) i ∂bi

,

1≤j≤ni

,

1≤j≤ni

(m) (m) (m) (m) = (yij − fi (tij , g(xi , βˆ(m) ) + bi ))1≤j≤ni + βˆ(m) Xi + ˆbi Zi .

422

T. L. Lai, M.-C. Shih & G. Zhu

(b) The LME step: Linear approximation to fi (tij , g(xi , β) + bi ) around (m) ˆ(m) ˆ (β , bi leads to the linear mixed effects model (m)

Yi

(m)

= βXi

(m)

+ b i Zi

+ (εi1 , . . . , εini ) .

(25)

The integrals in expression (23) for the likelihood function of the linear mixed effects model (25), instead of model (22), have closed-form expressions, yielding maximum likelihood estimates of the form !−1 ! K K X (m) X (m)T (m)T (m) −1 −1 , (26) X V X V X Y βˆ = i

i=1

i,m

i

i

i,m

i

i=1

(m)T ˆ (m) ˆ is computed via the σ , Σ) ˆ 2 Ini and ηˆ = (ˆ ΣZi + σ where Vi,m = Zi Newton–Raphson algorithm to maximize the likelihood; see Sec. 6.2 of Lindstrom and Bates13 where a restricted maximum likelihood (REML) variant of the procedure is also given. Several alternatives to the linearization approach have been proposed in the literature. One is Monte Carlo integration, whose accuracy and computational complexity depend critically on how and how many samples are drawn. Importance sampling and periodic updating of the importance weights during iterative maximization of the likelihood have been proposed.14–16 Another alternative, proposed by Pinheiro and Bates,17 is to use an adaptive version of Gaussian quadrature based on ideas similar to importance sampling in Monte Carlo integration. A third approach is to use MCEM (Monte Carlo EM) in which the E-step of the usual EM algorithm is replaced by an empirical estimate based on a random sample generated from the conditional distribution.15 Instead of applying Monte Carlo methods to compute the integrals in the likelihood function to be maximized in the maximum likelihood approach, it seems more direct to apply Markov Chain Monte Carlo (MCMC) to evaluate the posterior distribution of (β, σ, Σ) when a prior distribution on these parameters is assumed. MCMC enables one to generate a sequence of random samples whose limiting distribution is the target distribution (in this case the posterior distribution of (β, σ, Σ)) and thereby avoids the calculation of normalizing constants and the numerical integration associated with any probability statements of interest. The most popular MCMC method used in the mixed effects model framework is the Gibbs sampler. This is because the (hierarchical) Bayes model allows a natural grouping of the vector of all unknown or unobserved parameters into

Statistics in Pharmacology and Pre-Clinical Studies

423

subvectors β, σ, Σ and (θi , i = 1, . . . , n), where drawing samples for each component is much easier than drawing samples for the whole vector. Successful usage of Gibbs sampler for NONMEM in population PK studies has been reported in Refs. 11, 18–22. The relative efficiencies of different MCMC procedures have been investigated by Bennett et al .23 and Shih.24 In addition to considerations in choosing transition functions, there are other practical issues one has to deal with when implementing MCMC, such as the number of chains to run, the length of burn-in sequences, and how to monitor convergence. These are no general answers to these questions and they often need to be addressed empirically by numerical experiments; see Chapter 26. The normality assumption on the population distribution G has been weakened by Davidian and Gallant,25 who assume that G has a density function of the form of a product of a multivariate normal N (0, Σ) density function and the square of a polynomial of degree p, which was introduced in another context and called the “smooth nonparametric” (SNP) model by Gallant and Nychka.26 The coefficients of the polynomial and the components of the matrix Σ can be estimated by maximum likelihood, while the degree p of the polynomial can be chosen via standard model selection criteria like BIC, AIC or the Hannan-Quinn criterion. Magder and Zeger27 proposed an alternative method that uses mixtures of normals, while Fattinger et al .28 modeled each component of bi as a data-dependent monotone spline transformation of the corresponding component of a multivariate normal vector. All these methods require considerably more intensive computation to maximize the likelihood function than the case of normal G assumed before. Since the normality assumption on G only provides numerically tractable maximum likelihood estimates after various approximations and since attempts to relax that assumption have led to even more computationally intensive procedures, a natural alternative is to try estimating G nonparametrically (by a distribution with finite support, with the number of support points depending on the sample size). However, even for the simple case ni ≡ n and fi (tij , θi ) = θi with known β and σ, it is difficult to estimate G well since the optimal rate of convergence of the estimate to G is very slow when G has a smooth density function, as pointed out by Carroll and Hall29 and Fan.30 When G has fixed support, Chen31 showed that the optimal convergence rate is K −1/2 if the number of support points is known but decreases to K −1/4 otherwise as K → ∞. Lindsay32 showed

424

T. L. Lai, M.-C. Shih & G. Zhu

ˆ of G is unique that the nonparametric maximum likelihood estimate G and discrete, with no more than K support points, and Mallet33 made use ˆ to develop an algorithm to compute G. ˆ of this and other properties of G The situation becomes considerably worse when σ and β are unknown and fi (tij , θi ) is nonlinear in θi , for which little is known about the performance ˆ of nonparametric estimates and it is also difficult to compute G. One way to ensure that β and σ can be well estimated is to require the dataset to contain a subset from subjects whose θi can be well estimated. This idea was introduced in the work of Ibragimov and Has’minskii34 who consider estimation of (α, G) from independent random variables y1 , . . . , yJ such that the conditional density function of yi given θi has the parametric form fα (·|θi ), in the presence of another “direct” sample θ1 , . . . , θI from G. Let K = I + J. They show that under certain regularity conditions, a variant of the nonparametric maximum likelihood estimate that is √ initialized at a n-consistent estimate of (α, G) is asymptotically efficient. Their model of the data {θ1 , . . . , θI ; y1 , . . . , yJ } is commonly called the Ibragimov-Has’minskii (IH) model. We shall relax the model assumptions and extend them to our setting, providing what will be called an “Ibragimov-Has’minskii (IH) environment.” In an IH environment, there are I (≤ K) subjects whose θi can be well estimated by the nonlinear least squares estimate θ˜i based on (yij , tij ), 1 ≤ j ≤ ni . Without loss of generality we can assume that these are the first I subjects. We can determine from the data the standard error of each component of θ˜i using the asymptotic formulas in nonlinear regression.35 The ith study is deemed “good” if all components of θ˜i have reasonably small standard errors relative to their absolute values. A consistent estimate of σ 2 is given by , I ni I X X X −2 2 2 (ni − r) . (27) (w(θ˜i )) (yij − fi (tij , θ˜i )) σ ˜ = i=1 j=1

i=1

Such IH environments arise in most population PK studies, which use combined data from several Phases I, II and III trials. The subjects in Phase I trials are usually healthy volunteers or patients with the intent-totreat disease, from whom intensive blood sampling is conducted, and thus provide natural candidates for good studies. Lai and Shih36 developed the following iterative scheme to compute the MLE of (β, σ, G) in an IH environment. First note that in the case w(θ) ≡ 1 the likelihood function is proportional to

425

Statistics in Pharmacology and Pre-Clinical Studies

L(β, σ, G) =

K Y

σ −ni

αm

m=1

i=1

× exp

M X

(

ni 1 X [yij − fi (tij , g(xi , β) + ζm )]2 − 2 2σ j=1

)

(28)

when G has a finite support {ζ1 , . . . , ζM } and puts mass αm at ζm . For the case of more general w(θi ), simply replace σ in model (28) by σw(g(wi , β)+ ˆ (0) ) is obtained as follows: Let σ ζm ). The initial estimate (βˆ(0) , σ ˆ (0) , G ˆ (0) = P I σ ˜ and βˆ(0) be the least squares estimate βˆ(0) which minimizes i=1 (θei − g(xi , β))T (θei − g(xi , β)). Let bei = θei − g(xi , βˆ(0) ), 1 ≤ i ≤ I, denote the PI ˆ (0) residuals, and let bbi = bei − ( j=1 bej )/I be the centered residuals. Let G ˆ σ ˆ be the distribution putting weight 1/I at each centered residual. (β, ˆ , G) is computed via an iterative procedure in which the following two steps are ˆ (k) ) from (βˆ(k−1) , σ ˆ (k−1) ); see Ref. 36 used to compute (βˆ(k) , σ ˆ (k) , G ˆ (k−1) , G where a termination criterion and numerical examples are given. ˆ (k−1) puts mass αj at ζj (j = 1, . . . , Mk−1 ). Find the Step 1. Suppose G ˆ (k−1) ). maximizer (βˆ(k) , σ ˆ (k) ) of L(β, σ, G 33 Step 2. Use Mallet’s algorithm to maximize L(βˆ(k) , σ ˆ (k) , G) over the set of distributions G with no more than K support points. 2.5. Empirical Bayes methods for individualization and diagnostics We now consider the prediction problem of estimating a function h(θ) of the unobservable parameter θ for a new subject with covariate x and from whom some data have been collected. For example, in population PK studies, it is believed that efficacy and toxicity of a drug are directly related to the drug concentrations at the target site, which are generally not available but for which blood concentrations are often good surrogates; therefore the criteria for designing the dosing regimen for a specific subject often involve functions of individual concentrations, or equivalently, functions of the individual parameter θ. The subject’s data are often too sparse to ˆ can be used to estimate provide an adequate estimate θˆ of θ so that h(θ) h(θ). If β, σ and G are known, then a natural estimate of h(θ) in the mixed effects model is the posterior mean Eβ,σ2 ,G [h(θ) | subject’s data]. Without assuming β, σ 2 and G to be known, the empirical Bayes approach in Ref. 36

426

T. L. Lai, M.-C. Shih & G. Zhu

ˆ σ ˆ from the K studies so that h(θ) replaces them by their estimates β, ˆ2, G is estimated by d = E ˆ 2 ˆ [h(θ) | subject’s data] . h(θ) β,ˆ σ ,G

(29)

This idea of borrowing information from other subjects is in fact one of the main motivations for building population structure models. In particular, because of ethical and practical reasons, intensive blood sampling is often not feasible for clinical patients, for whom this individualization of dosing regimen can be obtained by combining the patient’s sparse data and characteristics (as measured by x) with the large database for the population model. See also Berzuini38 for an example of medical monitoring. Empirical Bayes ideas can also be used to derive diagnostics for the regression model (22). If the individual parameters θi were observed, the ˆ would provide approximations for the unresiduals ri = θi − g(xi , β) observable i.i.d. random variables bi . Therefore substantial deviation of these residuals from i.i.d. patterns would suggest inadequacies and possible improvements of the assumed regression model. Since the θi are not observed, we propose to replace them by the empirical Bayes estimate Eβ,ˆ ˆ σ 2 ,G ˆ (θi | yi1 , . . . , yini , ti , xi ), leading to the following generalized residuals in the sense of Cox and Snell39 : ˆ rˆi = E(β,ˆ ˆ σ2 ,G) ˆ (θi | yi1 , . . . , yini , ti , xi ) − g(xi , β) ,

i = 1, . . . , K .

(30)

The rˆi can be interpreted as estimates of the independent zero-mean random variables ri = E(β,σ2 ,G) (θi | yi1 , . . . , yini , ti , xi ) − g(xi , β). Instead of using the posterior mean in Eq. (30), it is popular in population PK studies to use the posterior mode sˆi = arg max p(β,ˆ ˆ σ2 ,G) ˆ (bi | yi1 , . . . , yini , ti , wi ) bi

(31)

ˆ where p(β,σ2 ,G) denotes the posterior to form the residuals sˆi − g(xi , β), density in the Bayesian model with given β, σ 2 and G. This was first suggested by Maitre et al.40 in connection with linearization methods under the assumption of normality for the population distribution, but is also used as a general strategy in the semiparametric models of Davidian and Gallant25 and the hierarchical Bayesian models of Wakefield and RacinePoon.44 For linear Gaussian mixed effects models, the mean and the mode of the conditional distribution of θi given yi1 , . . . , yini coincide since the conditional distribution is Gaussian, so the theoretical justification for

Statistics in Pharmacology and Pre-Clinical Studies

427

the ri via an empirical Bayes point of view applies also to the si . In the case of nonlinear mixed effects models, the posterior mean and mode no longer coincide, and rˆi is usually easier to compute and more robust. In the above empirical Bayes approach, we have replaced (β, σ 2 , G) in the posˆ σ ˆ This terior mean Eβ,σ2 ,G [h(θ) | subject’s data] by an estimate (β, ˆ 2 , G). ˆ σ ˆ can be either parametric, as in Lindstrom and Bates,13 estimate (β, ˆ 2 , G) or nonparametric, as given above. We next list some examples of using empirical Bayes/hierarchical Bayes/posterior mode estimates in NONMEM to quantify covariate effects on PK parameters in the literature: (a) Population PK analysis of felbamate in epileptic patients42 : Apparent clearance of felbamate was found to decrease with age for children (age ≤ 12) and to stay relatively constant beyond 13 years of age. There were 1–17 blood samples per subject. This study, undertaken by Zhu and his collaborators at Schering-Plough Research Institute and Wallace Laboratories, led to the FDA approval of the labeling of felbamate for its prescription to children. (b) Population PK analysis of quindine in hospitalized patients treated for atrial fibrillation over ventricular arrhythmias19,25,43: The effects of dichotomized creatinine clearance, body weight and α1 -acid glucoprotein concentration on clearance were analyzed from a study consisting of 1–11 blood samples per subject. (c) Population PK analysis of phenobarbital in neonates11,25,44 : The effects of birth weight and 5-minute Apgar score on clearance and volume were analyzed from a study with sparse PK data in each subject (having only 1–6 concentration measurements). Model validation methodology for population PK analysis is still in its infancy. One commonly used approach is to use m-fold cross-validation or bootstrap to estimate the prediction errors based on a fitted model. Here the prediction error may be associated with prediction of concentrations or prediction of PK parameters (that can be estimated nonparametrically only from subjects with intensive measurements). Given the computational complexity associated with fitting nonlinear mixed effects models, m-fold cross-validation (with m ≤ 20) appears to be more feasible than the bootstrap (for which the FDA recommends using at least 200 bootstrap samples).

428

T. L. Lai, M.-C. Shih & G. Zhu

2.6. The Lindstrom Bates algorithm and related statistical methods Vonesh45 proposed an alternative to the Lindstrom–Bates algorithm (consisting of the pseudo-data and LME steps described in Sec. 2.4) by applying, for fixed β and η, Laplace’s asymptotic formula Z ˆ el(b) db ∼ (2π)r/2 | − ¨l(ˆb)|−1/2 el(b) (32) Rr

to each integral in expression (23), where ¨l denotes the Hessian matrix of second partial derivatives of l and ˆb maximizes l(b). Earlier, Wolfinger46 derived the pseudo-data step of the Lindstrom–Bates algorithm by applying for fixed η Laplace’s asymptotic formula to the multiple integral ) (K Z Z X li (bi ; β) dβdb1 db2 · · · dbK , (33) · · · exp i=1

and then used a Gauss–Newton approximation of −¨l to derive the REML version of the LME step. Laplace’s asymptotic formula has also been used by Breslow and Clayton47 and Lee and Nelder48 to derive their estimators in generalized linear mixed models (GLMM) and hierarchical generalized linear models (HGLM), respectively. The HGLM involves independent random vectors (yi , xTi , ziT ) such that the conditional density function of yi given a 1 × K vector of random effects b has the GLM (generalized linear model) form f (y|b, zi , xi ) = c(y, φ) exp{(θi y − ψ(θi ))/a(φ)} ,

(34)

in which φ is a dispersion parameter, θi is the canonical parameter such that E(y|b, zi , xi ) = g(βxi + bzi ) and g is the inverse of a monotone link function. Letting fα be the density function of b with unknown parameter α, Lee and Nelder48 define the hierarchical likelihood (h-likelihood) by h(b, β, φ, α) = log fα (b) +

n X i=1

log f (yi |b, zi , xi ) .

(35)

They propose to estimate β, φ, α by an iterative procedure whose mth iteration consists of the following two steps: (i) Given the current estimate (φˆ(m) , α ˆ (m) ) of (φ, α), compute the maximizer (ˆb(m) , βˆ(m) ) of h(b, β, φˆ(m) , α ˆ (m) ) by solving the score equations ∂h/∂β = 0 and ∂h/∂b = 0.

Statistics in Pharmacology and Pre-Clinical Studies

429

(ii) Given the current estimate (ˆb(m) , βˆ(m) ) of (b, β), maximize the adjusted profile h-likelihood hA (φ, α) = h(ˆb(m) , βˆ(m) , φ, α) + (log |2πφH −1 |)/2 with ! ∂ 2 h/∂β 2 ∂ 2 h/∂β∂b H= , ∂ 2 h/∂b∂β ∂ 2 h/∂b2 (b,β)=(ˆ b(m) ,βˆ(m) ) by solving the score equations ∂hA /∂φ = 0 and ∂hA /∂α = 0.

For the special case of normal fα with mean 0 and covariance matrix Σ(α), the HGLM reduces to the GLMM considered by Breslow and Clayton47 who make use of the normality assumption to come up with an explicit expression for Laplace’s approximation to the likelihood function R h(b,β,φ,α) e db, yielding an algorithm similar to that of Lindstrom and Bates for NONMEM. The Lee-Nelder procedure above is somewhat different and is motivated by generalizing Henderson’s49 joint likelihood for linear models with normal effects. It can be derived by applying Lapalce’s RRrandom approximation to eh(b,β,φ,α) dbdβ, analogous to integral (33). Let β0 and σ0 denote the true values of β and σ. A sufficient condition for the validity of Laplace’s asymptotic formula (32) is that l(b) = N λ(b), where N → ∞ and λ is a fixed smooth function with a unique maximum. The integral for the ith subject in model(23) has the form Z exp{li (b|β, σ, Σ)}db , where Rr

li (b|β, σ, Σ) = −Si (b, β)/σ 2 − bΣ−1 bT /2 − ni log σ ,

(36)

in which Si is computed via Eq. (24) from ni observations (yij , tij ), 1 ≤ j ≤ ni . If these observations are sufficiently informative about the ith subject’s parameter vector θi = g(xi , β0 ) + bi , then for (β, σ) near (β0 , σ0 ), Si (b, β) becomes peaked around bi and can be approximated by a quadratic function in a neighborhood of the maximizer ˆbi = ˆbi (β, σ, Σ) of li (b|β, σ, Σ). Laplace’s asymptotic formula basically replaces li in integral (36) by the approximating quadratic function of b as λmin (−¨li (ˆbi |β, σ, Σ)) → ∞, where λmin (·) denotes the minimum eigenvalue of a symmetric matrix. When the ith subject has sparse data (yij , tij ), Si (b, β) is no longer peaked around ˆbi and Laplace’s asymptotic formula may be a poor approximation to integral (36). A better way to compute integral (36) in this case is to use Monte Carlo, expressing integral (36) as the expectation EΣ {exp(−Si (b, β)/σ 2 )} ,

(37)

430

T. L. Lai, M.-C. Shih & G. Zhu

where EΣ denotes expectation under the probability measure for which b is a normal random vector with mean 0 and covariance matrix Σ. Lai and Shih37 proposed the following hybrid method for evaluating model (36). Take c > 10 and let Vi = −¨l(ˆbi |β, σ, Σ). (i) If λmin (Vi ) < c, evaluate integral (36) by Monte Carlo approximation to expression (37): B −1

B X j=1

exp{−Si (bij , β)/σ 2 } ,

where bij , j = 1, . . . , B, are independent samples from the N (0, Σ) distribution. (ii) If λmin (Vi ) ≥ c, evaluate integral (36) by its Laplace approximation (2π)r/2 |Vi |−1/2 exp{li (ˆbi |β, σ, Σ)} . By performing simple diagnostics on the appropriateness of using Laplace’s asymptotic formula to evaluate the integral in expression (23) for the ith subject, the hybrid approach preserves the computational simplicity of Laplace’s method when it can be used and switches to the Monte Carlo method when Laplace’s method fails. In practice, the actual population distribution G of the random effects bi may differ substantially from the assumed normal distribution with unknown covariance matrix, which at best can only be regarded as an approximation to G. If the ith subject has only sparse data so that Si (b, β) is relatively flat in b, then applying the Monte Carlo approach to the subject is tantamount to choosing a certain random distribution Gi , which is the empirical distribution of a sample of size B from a normal distribution, to approximate G. Since the assumed normal distribution is itself also an approximation to G, there is no need for a “high resolution” in the random distribution used to approximate the normal distribution, so using 50 ≤ B ≤ 200 samples in the Monte Carlo method should be able to provide enough statistical detail so that the resultant estimator of (β, σ, Σ) still has a low computational cost comparable to that of the Lindstrom–Bates estimator. On the other hand, if the ith subject has enough data so that Si (b, β) is peaked around ˆbi for β near β0 , the Monte Carlo approach becomes unreliable unless B is sufficiently large and importance sampling is needed to generate the B samples from a distribution that is peaked around ˆbi , so Laplace’s method gives a much better approximation to (36) in this case. Thus the Monte Carlo and Laplace’s methods complement each other in the hybrid approach, which uses either

Statistics in Pharmacology and Pre-Clinical Studies

431

N (0, Σ) or the empirical distribution of a sample of size B from N (0, Σ) as the approximation Gi (·|Σ) to the unknown (and possibly non-normal) mixing distribution G. Using this hybrid approach to compute expression (23) approximately, Lai and Shih37 make use of numerical differentiation and iterative optimization schemes such as conjugate gradient and quasiNewton methods50 to maximize this approximation to expression (23), ˆ σ ˆ of (β, σ, Σ). Good starting values in this providing the estimator (β, ˆ , Σ) ˆ σ ˆ can be obtained by running several iterative scheme to compute (β, ˆ , Σ) steps of the Lindstrom–Bates nlme procedure. Lai and Shih37 also develop an asymptotic theory of the hybrid estiˆ σ mator (β, ˆ ) as the number K of subjects becomes infinite. This theory does not require all subjects to have sufficient data to estimate their θi consistently, nor does it require the actual G to be normal. Under the assumption that a sufficiently large subset of the subjects have good studies in the sense that their λmin (Vi ) exceeds the threshold c for applicability of Laplace’s approximation to evaluate integral (36) and some additional reˆ σ gularity conditions, (β, ˆ ) is shown to converge with probability 1 to (β0 , σ0 ) as K → ∞. Let n = n1 + · · · + nK . It is also shown in Lai and Shih37 that √ ˆ n(βn − β0 , σ ˆn − σ0 ) has a limiting normal distribution as K → ∞ under these and some other conditions. Moreover, this hybrid estimator and its asymptotic theory have been extended in Lai and Shih37 to the HGLM of Lee and Nelder48 and the GLMM of Breslow and Clayton.47

3. Bioavailability and Bioequivalence Generic drug products (manufactured by other companies that are not the innovator) have become increasingly popular since the 1960s. For the approval of a generic drug product, the FDA usually does not require a regular new drug application (NDA) submission to demonstrate the efficacy and safety of the product. Instead, it requires the generic drug company to submit bioavailability (BA) information on the generic drug and to provide evidence of its bioequivalence (BE) to the standard (or reference) drug in an “abbreviated new drug application” (ANDA), following certain regulations that became effective in 1977 and are codified in 21 CFR 320, in which BA is defined as “the rate and extent to which the active ingredient or active moiety is absorbed from a drug product and becomes available at the site of action.” In Sec. 2.2 we have discussed how the PK data of a drug can be used to measure its BA. This section focuses on BE and the statistical methods in BE studies.

432

T. L. Lai, M.-C. Shih & G. Zhu

Two drug products are said to be bioequivalent if they contain either identical amounts of the same active ingredient (i.e. are “pharmaceutical equivalents”) or an identical therapeutic moiety and if their rates and extents of absorption are not significantly different when administered at the same dose under similar experimental conditions. BE studies are conducted not only for ANDAs of generic drugs but also for formulation change of an approved drug. For example, clinical trials for the NDA of a drug usually use the drug produced in a laboratory setting. After approval, commercial batches produced from manufacturing plants have to be demonstrated to be bioequivalent to the clinical trial batches. Moreover, there may also be changes from tablet to capsule formulations so that BE studies are needed. BE studies typically use healthy normal subjects and do not involve Phases II and III trials. A pilot study using a small number (e.g. 6) of subjects can be carried out in advance to assess inter-subject and intra-subject variabilities, sample size, time intervals to collect blood or urine samples and to provide other information. Instead of the commonly used randomized designs in Phases II and III studies, in which each subject is randomly assigned to one and only one formulation of a drug (parallel designs), BE studies typically use the crossover design, which is a modified randomized block design in which each block (consisting of a subject or a group of subjects) receives more than one formulation of a drug at different time periods. Crossover designs have the following advantages in BE studies: (a) Each subject serves as his/her own control, allowing a within-subject comparison between formulations. (b) Inter-subject variability is removed from the comparison between formulations. (c) With proper randomization of subjects to the sequence of formulation administrations, a crossover design can provide the best unbiased estimates of the differences (or ratios) between formulations. On the other hand, care must be taken to address the “carry-over” effects in crossover designs. In BE studies, the “washout” period, which is defined as the rest period between two treatment periods for the effect of the preceding treatment period to taper off, must be long enough so that the carryover effect from one treatment period to the next is negligible. There is an extensive literature on crossover designs for clinical trials,51–57 and for BE studies.58

Statistics in Pharmacology and Pre-Clinical Studies

433

Although parallel designs are infrequently used in BE studies since crossover designs usually provide much better ways of identifying and removing the inter-subject variability from the comparison between formulations based on a sample of typically 18–24 subjects, there are situations in which a parallel design is preferable to a crossover design, e.g. when (i) the inter-subject variability is relatively small compared to the intra-subject variability, or (ii) the drug has long elimination half-life so that the long washout period in a crossover design prolongs the study and increases the chance of drop-out of the subjects, or (iii) the cost of increasing the number of subjects is smaller than that of adding an additional treatment period, or (iv) extensive blood collection is not feasible from the subjects. Suppose there are two formulations, one of which is a test formulation (T) and the other a reference (or standard) formulation (R) of a drug. For a standard 2 × 2 crossover design, each subject is randomly assigned to either the first sequence RT or the second sequence TR at two dosing periods. A subject assigned RT receives R at the first dosing period and T at the second period. The dosing periods are separated by a washout period of sufficient length to rule out carry-over effects. More generally, an m × n crossover design involves m sequences of formulations that are administered at n time periods. Examples are the 2 × 4 crossover design consisting of the two sequences TRTR and RTRT, and Balaam’s 4 × 2 crossover design51 consisting of the four sequences TT, RR, RT and TR. A widely used statistical model to perform inference in these designs is the linear mixed effects model yijk = µ + aj + ηik + bjk + cj−1,k + ǫijk ,

(38)

where i refers to the subject number, j the period number and k the sequence number. Here µ is the overall mean, aj is the fixed effect of the jth period (with Σaj = 0), ηik is the random effect (assumed to be normal with mean 0) of the ith subject in the kth sequence, bjk is the fixed effect of the formulation in the jth period of the kth sequence, and ǫijk is the within-subject random error which is assumed to be normal with mean 0. In particular, for a standard 2 × 2 crossover design, bjk is the fixed effect of R (resp. T) if j = k (resp. j 6= k). Note that model (38) assumes first-order (i.e. one-period) carry-over effects: cj−1,k represents the (fixed) residual effect carried over from period j − 1 to period j in the kth sequence. For two-period designs, carry-over effects can only occur in the second period. It is also assumed that the ηik and ǫijk are independent with var(ηik ) = ση2 and var(ǫijk ) = σǫ2 . Standard ANOVA techniques can be used to construct

434

T. L. Lai, M.-C. Shih & G. Zhu

unbiased estimates and confidence intervals of linear contrasts of the fixed effects, while the variance parameters ση2 and σǫ2 can be estimated by the method of moments or restricted maximum likelihood.38,39 The yijk in model (38) is typically some transformation of the observed response (e.g. logarithm of the AUC) to make it approximately normal. Note that the logarithmic transformation converts multiplicative effects into additive effects, as assumed in model (38). Although model (38) leads to standard F -tests of equality between the formulations T and R, it has been recognized since the 1970s that testing the usual hypothesis of equality is inappropriate for BE, whose purpose is to verify that the two formulations have no “biologically significant” differences.60,61 One way to address this difficulty is to change the null hypothesis of equality (versus the alternative hypothesis of inequality) into a null hypothesis of the form H0 : θ ≤ θ1 or θ ≥ θ2 , with an interval alternative hypothesis H1 : θ1 < θ < θ2 , where θ is the parameter of interest and the interval (θ1 , θ2 ) is a biological indifference zone. Schuirmann62,63 and Anderson and Hauck64 have developed test procedures for what is now called average bioequivalence. Instead of relying on hypothesis testing, Westlake61 proposed the following confidence interval procedure to assess average bioequivalence. Let µT (µR ) denote the mean response of a subject receiving treatment T(R) in model (38). If a (1 − 2α) × 100% confidence interval for µT − µR is within the acceptance limits as recommended by the regulatory agency, then accept the test formulation T as bioequivalent to the reference formulation R. Average bioequivalence only compares the means of the marginal distributions of the PK parameters of interest, such as AUC or Cmax , associated with the two formulations. Under normality assumptions, the equivalence between distributions is characterized by the equivalence of their means and variances. Population bioequivalence therefore also compares the variances of the two formulations. The intra-subject variability, particularly associated with switching from one formulation to another, leads to another criterion in assessing BE, called individual bioequivalence. To explain the underlying motivation, suppose a patient switches from R to T that has a much higher intra-subject variability than R. This may push the AUC of the patient outside the established therapeutic window of R. Consequently, population BE does not guarantee that the two formulations are exchangeable and therapeutically equivalent and individual BE is needed.

Statistics in Pharmacology and Pre-Clinical Studies

435

To see what is involved in assessing these three criteria of BE, assume for simplicity that there are no carry-over effects and no period-sequence interactions so that (38) can be reduced to a form that involves T and R more directly as yiδν = µδ + αiδ + ǫiδν ,

(39)

where i is the subject number, δ = T or R, ν denotes the number of times that δ appears in a sequence and therefore 1 ≤ ν ≤ nδ (= largest number of times that δ appears in the available sequences). The ǫiδν are independent normal with mean 0, variance σǫ2 and are independent of the random effects (αiT , αiR ) that are independent and have a bivariate normal distribution 2 with var(αiT ) = vT , var(αiR ) = vR and var(αiT − αiR ) = σD . According to the 1999 FDA Guidance on Bioequivalence, average BE is established if the 90% confidence limits for eµT /eµR are 4/5 and 5/4, or equivalently, if ±ℓn(1.25) are the 90% confidence limits for µT − µR . Note that the total variance of the T formulation is σT2 = σǫ2 + vT , while that of the R formula2 tion is σR = σǫ2 + vR . Population BE is established if the 95% upper confi2 2 dence bound for {(µT −µR )2 +(σT2 −σR )}/σR falls below the FDA specified 2 2 limit of {(ℓn1.25) + 0.02}/(0.2) = 1.745. Individual BE is established if 2 the 95% upper confidence bound for {(µT − µR )2 + σD }/σǫ2 falls below another FDA specified limit. These upper confidence bounds can be obtained by appealing to the central limit theorem and using the delta method to compute the asymptotic standard errors. Alternatively, bootstrap methods can be used to compute the confidence bounds and confidence intervals; see in particular Chapter 25 of Efron and Tibshirani65 and Sec. 4.5.3 of Chow and Liu.58 The inclusion of population BE and individual BE besides average BE by the FDA in its guidelines for the pharmaceutical industry reflects its concerns about prescribability and switchability of generic drug products. Prescribability means that when a physician prescribes a generic drug product to a patient for the first time, they should both be assured that the drug product yields safety and efficacy results comparable to that of the reference product in the patient population. Switchability means that when a physician switches a reference product to a generic product for a patient, they should both be assured that the generic product will yield comparable safety and efficacy results for the same individual. Nonparametric and Bayesian approaches to BE have also been developed in the literature.58 There are intriguing theoretical problems concerning BE in statistical decision theory.66 Crossover designs and average BE for more than two formulations have also been studied.58

436

T. L. Lai, M.-C. Shih & G. Zhu

4. Assay Development and Validation The availability of reliable assays is central to determining the drug concentrations in blood, urine, etc., in PK studies. When a pharmaceutical compound is discovered, it is necessary to develop an assay method to measure the substance levels in plasma, serum, etc. The substance that is being measured is called an analyte, and the objective is to determine the analyte’s potency, which refers to its content or activity (e.g. number of particles, gravitometric mass, percent of impurity). There are three types of assays that are commonly used in the pharmaceutical industry: (i) chemical assays such as HPLC (high performance liquid chromatographs), (ii) immunoassays (e.g. radioimmunoassays, enzyme-linked immunosorbent assays), (iii) biological assays (measuring the analyte’s potency relative to some standard drug in terms of the magnitudes of their effects on responses from living subjects). For the development of an assay method of a pharmaceutical compound, the FDA requires that the assay method meet the established specifications, for which instrument calibration is essential. A common approach to calibration is to have a number of known standard concentration preparations put through the instrument to obtain the corresponding responses. Fitting an appropriate statistical model to the data yields an estimated calibration curve, called the standard curve. Simple linear regression of the response on the standard is perhaps the most widely used statistical model. The standard curve is used to determine the unknown potency.67 Validation of an assay method is the process by which it is established, in laboratory studies, that the performance characteristics of the method indeed meets the specified criteria. As specified in Chow and Liu,67 these criteria include (i) accuracy (no systematic error in the assay method), (ii) precision (measurement error of the method), (iii) limit of detection/quantitation (LOD/LOQ, which is the lowest concentration of analyte in a sample that can be detected/determined with acceptable precision under the specified experimental conditions), (iv) range (reliable range of the method), (v) linearity (whether the assay generates results that are directly proportional to the concentration of analyte within a given range), (vi) specificity (whether the assay measures the analyte and no other substance in the specimen), (vii) ruggedness (degree of reproducibility of assay results under a variety of normal test conditions, such as different laboratories, assay temperatures, days). Commonly used statistical methods for assay validation include:

Statistics in Pharmacology and Pre-Clinical Studies

437

(a) regression analysis (particularly with respect to accuracy, linearity and LOD/LOQ), (b) analysis of variance (particularly with respect to ruggedness); see Chapter 3 of Chow and Liu.67 Lin68 introduces a concordance correlation coefficient to evaluate reproducibility and ruggedness, while Chapter 10 of Davidian and Giltinan11 applies nonlinear mixed effects models to the analysis of assay data. 5. Drug Discovery As pointed out in the preceding section, assay development is an important facet in the drug discovery process. Another important facet is of a biological nature and involves the identification of a biological target or pathway. In recent years, advances in bioinformatics and genomics have provided new tools and opportunities in this direction. Besides applications to assay development and bioinformatics, statistical methods are also useful in screening compounds for clinically active drugs, and in searching for novel, active compounds. A pharmaceutical company typically has a large inventory of compounds, of which an unknown small proportion is truly active. Dunnett69 developed a model that takes into account the costs and benefits of any screening procedure to derive an optimal procedure; see also the subsequent work of Bergman and Gittins70 in this direction. Colton71 and King72 considered multistage screening procedures, while Redman and King73 proposed group screening that uses balanced and partially balanced incomplete block designs to increase the rate of compound screening without reducing necessary replication. Numerical topology is the assignment of numerical values to topologically invariant features of molecules. There is an isomorphism between two-dimensional molecular diagrams and connected graphs; the edges and vertices of the graphs correspond to bonds and atoms of molecules, yielding numerical representation of compounds or parts of compounds. With this representation, search for active compounds involves a very large set of graphs. Moreover, there may also be a large number of potential chemical modifications at different sites that one may want to experiment with. Experimental design techniques are particularly useful for such problems.74,75 References 1. Katzung, B. G. (1992). Basic and Clinical Pharmacology. Prentice Hall, Englewood Cliffs, NJ.

438

T. L. Lai, M.-C. Shih & G. Zhu

2. Rowland, M. and Tozer, T. N. (1989). Clinical Pharmacokinetics, 2nd edn., Lea and Febiger, Philadelphia. 3. Evans, W. E., Schentag, J. J. and Jusko, W. J. (eds). (1986). Applied Pharmacokinetics, 2nd edn., Applied Therapeutics Inc., SF. 4. Welling, P. G. (1986). Pharmacokinetics: Processes and Mathematics, American Chemical Society, Washington, DC. 5. Goldstein, A., Aronow, L. and Kalman, S. M. (1974). Principles of Drug Action, 2nd edn., Wiley, NY. 6. Wagner, J. G. (1976). Linear pharmacokinetic equations allowing direct calculation of many needed pharmacokinetic parameters from the coefficients and exponents of polyexponential equations which have been fitted to data. Journal of Pharmacokinetics and Biopharmaceutics 4: 443. 7. Lai, T. L. (1985). Regression analysis of compartmental models. Journal of Research of the National Bureau Standard 90: 525. 8. Yeh, K. C. and Kwan, K. C. (1978). A comparison of numerical integration algorithms by trapezoidal, Lagrange and spline approximations. Journal of Pharmacokinetics and Biopharmaceutics 6: 79. 9. Scatchard, G. (1949). The attractions of protein for small molecules and ions. Annals of New York Academy of Science 51: 660. 10. Lai, T. L. and Zhang, L. (1994). Statistical analysis of ligand-binding experiments. Biometrics 50: 782. 11. Davidian, M. and Giltinan, D. M. (1995). Nonlinear Models for Repeated Measurement Data, Chapman and Hall, NY. 12. Beal, S. M. and Sheiner, L. B. (1992). NONMEM User’s Guide. NONMEM Project Group, UCSF. 13. Lindstrom, M. J. and Bates, B. M. (1990). Nonlinear mixed effects models for repeated measures data. Biometrics 46: 673. 14. Geyer, C. J. and Thompson, E. A. (1992). Constrained Monte Carlo maximum likelihood for dependent data. Journal of the Royal Statistical Society Series B54: 657. 15. Quintana, F. A., Liu, J. S. and del Pino, G. E. (1999). Monte Carlo EM with importance reweighting and its applications in random effects models. Computational Statistics and Data Analysis 29: 429. 16. Gelman, A. (1995). Method of moments using Monte Carlo simulation. Journal of Computational and Graphical Statistics 4: 36. 17. Pinhero, J. C. and Bates, D. M. (1995). Approximations to the loglikelihood function in the nonlinear mixed effects model. Journal of Computational and Graphical Statistics 4: 12. 18. Wakefield, J. C., Smith, A. F. M., Racine-Poon, A. and Gelfand, A. E. (1994). Bayesian analysis of linear and nonlinear population models by using the Gibbs sampler. Applied Statistics 43: 201. 19. Wakefield, J. (1996). The Bayesian analysis of population pharmacokinetic models. Journal of the American Statistics Association 91: 62. 20. Wakefield, J. and Benett, J. (1996). The Bayesian modeling of covariates for population pharmacokinetic models. Journal of the American Statistical Association 91: 917.

Statistics in Pharmacology and Pre-Clinical Studies

439

21. Gelman, A., Bois, F. and Jiang, J. (1996). Physiological pharmacokinetic analysis using population modeling and informative prior distributions. Journal of the American Statistical Association 91: 1400. 22. M¨ uller, P. and Rosner, G. L. (1997). A Bayesian population model with hierarchical mixture priors applied to blood count data. Journal of the American Statistical Association 92: 1279. 23. Bennett, J. E., Racine-Poon, A. and Wakefield, J. C. (1996). MCMC for nonlinear hierarchical models. In Markov Chain Monte Carlo in Practice, eds. W. R. Gills, S. Richardson and D. J. Spiegelhalter, Chapman and Hall, NY, 339–357. 24. Shih, M. (1999). Estimation in nonlinear mixed effects models: Parametric and nonparametric approaches. PhD. dissertation, Dept. Statistics, Stanford University. 25. Davidian, M. and Gallant, A. R. (1992). Smooth nonparametric maximum likelihood estimation for population pharmacokinetics, with applications to quinidine. Journal of Pharmacokinetics and Biopharmaceutics 20: 529. 26. Gallant, A. R. and Nychka, D. W. (1987). Semi-nonparametric maximum likelihood estimation. Econometrica 55: 363. 27. Magder, L. S. and Zeger, S. L. (1996). A smooth nonparametric estimate of a mixing distribution using mixture of gaussians. Journal of the American Statistical Association 91: 1141. 28. Fattinger, K. E., Sheiner, L. B. and Verotta, D. (1995). A new method to explore the distribution of interindividual random effects in nonlinear mixed effects models. Biometrics 51: 1236. 29. Carroll, R. and Hall, P. (1988). Optimal rates of convergence for deconvolving a density. Journal of the American Statistical Association 83: 1184. 30. Fan, J. Q. (1991). On the optimal rates of convergence for nonparametric deconvolution problems. The Annals Statistics 19: 1257. 31. Chen, J. (1994). Optimal rate of convergence for finite mixture models. The Canadian Journal of Statistics 22: 387. 32. Lindsay, B. G. (1983). The geometry of mixture likelihoods: A general theory. Annals of Statistics 11: 86. 33. Mallet, A. (1986). A maximum likelihood estimation method for random coefficient regression models. Biometrika 73: 645. 34. Ibragimov, I. A. and Has’minskii, R. Z. (1983). On asymptotic efficiency in the presence of an inifinite-dimensional nuisance parameter. In Lecture Notes in Mathematics 1021, Springer-Verlag, NY, 195–220. 35. Bates, D. M. and Watts, D. G. (1988). Nonlinear Regression Analysis and Its Applications. Wiley, NY. 36. Lai, T. L. and Shih, M.-C. (2003). Nonparametric estimation in nonlinear mixed effects models. Biometrika 90, in press. 37. Lai, T. L. and Shih, M.-C. (2003). A hybrid estimator in nonlinear and generalised linear mixed effects models. Biometrika 90, in press. 38. Berzuini, C. (1996). Medical monitoring. In Markov Chain Monte Carlo in Practice, eds. W. R. Gilks and S. Richardson, Spiegelhalter, Chapman and Hall, 1996, 321–337.

440

T. L. Lai, M.-C. Shih & G. Zhu

39. Cox, D. R. and Snell, E. J. (1968). A general definition of residuals. Journal of the Royal Statistical Society Series B30: 248. 40. Maitre, P. O., Buhrer, M., Thomson, D. and Stanski, D. R. (1991). A three-step approach combining Bayes regression and NONMEM population analysis: Application to midazolam. Journal of Pharmacokinetics and Biopharmaceutics 19: 377. 41. Wakefield, J. and Racine-Poon, A. (1995). An application of Bayesian pharmacokinetic/pharmacodynamic models to dose recommendation. Statistics in Medicine 14: 971. 42. Banfield, C. R., Zhu, G. R., Jen, J. F., Jensen, P. K., Schumaker, R. C., Perhach, J. L., Affrima, M. B. and Glue, P. (1996). The effect of age on the apparent clearance of felbamate: A retrospective analysis using nonlinear mixed effects modeling. Therapeutic Drug Monitoring 18: 19. 43. Verme, C. N., Ludden, T. M., Clementi, W. A. and Harris, S. C. (1992). Pharmacokinetics of quinidine in male patients: A population analysis. Clinical Pharmacokinetics 22: 468. 44. Boeckmann, A. J., Sheiner, L. B. and Beal, S. L. (1992). Part V (Introductory Guide) of NONMEM User’s Guide. NONMEM Project Group, UCSF. 45. Vonesh, E. F. (1996). A note on the use of Laplace’s approximation for nonlinear mixed effects models. Biometrika 83: 447. 46. Wolfinger, R. (1993). Laplace’s approximation for nonlinear mixed effects models. Biometrics 80: 791. 47. Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88: 9. 48. Lee, Y. and Nelder, J. A. (1996). Hierarchical generalized linear models. Journal of the Royal Statistical Society Series B58: 619. 49. Henderson, C. R. (1975). Best linear unbiased estimation and prediction under a selection model. Biometrics 31: 423. 50. Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. (1992). Numerical Recipes in C : The Art of Scientific Computing, Cambridge Univ. Press. 51. Balaam, L. N. (1968). A two-period design with t2 experimental units. Biometrics 24: 61. 52. Brown, B. W. (1980). The crossover experiment for clinical trials. Biometrics 36: 69. 53. Cheng, C. S. and Wu, C. F. (1980). Balanced repeated measurements designs. The Annals of Statistics 8: 1272. 54. Laska, E. M., Meisner, M. and Kushner, H. B. (1983). Optimal crossover designs in the presence of carryover effects. Biometrics 39: 1089. 55. Laska, E. M. and Meisner, M. (1985). A variational approach to optimal two-treatment crossover designs: Applications to carryover effect methods. Journal of the American Statistical Association 80: 704. 56. Jones, B. and Kenward, M. G. (1989). Design and Analysis of Crossover Trials, Chapman and Hall, NY.

Statistics in Pharmacology and Pre-Clinical Studies

441

57. Fleiss, J. L. (1989). A critique of recent research on the two-treatment crossover design. Controlled Clinical Trials 10: 237. 58. Chow, S. C. and Liu, J. P. (1992). Design and Analysis of Bioavailability and Bioequivalence Studies. Marcel Dekker, NY. 59. Chinchilli, V. M. and Esinhart, J. D. (1996). Design and analysis of intrasubject variability in cross-over experiments. Statistics in Medicine 15: 1619. 60. Metzler, C. M. (1974). Bioavailability: A problem in bioequivalence. Biometrics 30: 309. 61. Westlake, W. J. (1972). Use of confidence intervals in analysis of comparative bioavailability trials. Journal of Pharmaceutical Sciences 61: 1340. 62. Schuirmann, D. J. (1981). On hypothesis testing to determine if the mean of a normal distribution is continued in a known interval. Biometrics 37: 617. 63. Schuirmann, D. J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics 15: 657. 64. Anderson, S. and Hauck, W. W. (1983). A new procedure for testing equivalence in comparative bioavailability and other clinical trials. Communications in Statistics Series A12: 2663. 65. Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall, NY. 66. Brown, L. D., Hwang, J. T. and Munk, A. (1997). An unbiased test for the bioequivalence problem. The Annals of Statistics 25: 2345. 67. Chow, S. C. and Liu, J. P. (1995). Statistical Design and Analysis in Pharmaceutical Science. Marcel Dekker, NY. 68. Lin, L. I. (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics 45: 255. 69. Dunnett, C. W. (1961). Statistical theory of drug screening. In Quantitative Pharmacology, ed. H. DeJonge, North Holland, Amsterdam. 70. Bergman, S. W. and Gittins, J. C. (1985). Statistical Methods for Planning Pharmaceutical Research. Marcel Dekker, NY. 71. Colton, T. (1963). Optimal drug screening plans. Biometrika 50: 31. 72. King, E. P. (1964). Optimal replication in sequential drug screening. Biometrika 51: 110. 73. Redman, C. E. and King, E. P. (1965). Group screening utilizing balanced and partially balanced incomplete block designs. Biometrics 21: 865. 74. Thornber, C. W. (1979). Isosterism and molecular modification in drug design. Chemical Society Reviews 8: 563. 75. Martin, E. J., Blaney, J. M., Siani, M. A., Spellmeyer, D. C., Wong, A. K. and Moose, W. H. (1995). Measuring diversity: Experimental design of combinatorial libraries for drug discovery. Journal of Medicinal Chemistry 38: 1431.

About the Author Tze Leung Lai born in Hong Kong in 1945, received his BA degree in Mathematics (1967) from the University of Hong Kong and PhD in

442

T. L. Lai, M.-C. Shih & G. Zhu

Statistics (1971) from Columbia University. He then remained on the faculty of Columbia University as Assistant Professor (1971–1974), Associate Professor (1974–1977), Professor (1977–1986) and Higgins Professor (1986– 1987) of Mathematical Statistics. Since 1987, he has been Professor of Statistics at Stanford University, where he is currently Chairman of the Department of Statistics. His research interests include adaptive control and learning systems, sequential experimentation, fault detection and quality control, stochastic optimization, biostatistics, probability theory and mathematical finance. He received the Committee of Presidents of Statistical Societies (COPSS) Award and the Guggenheim Fellowship in 1983. He is an elected member of Academia Sinica (Taipei) and International Statistical Institute.

CHAPTER 12

STATISTICS IN BIOPHARMACEUTICAL RESEARCH SHEIN-CHUNG CHOW StatPlus, Inc. Heston Hall, Suite 206, 1790 Yardley-Langhorne Road, Yardley, PA 19067, USA Tel: 215-321-1082; [email protected] ANNPEY PONG Biostatistics, Forest Laboratories, Inc. 18th Floor, Haborside Financial Center, Plaza V, Jersey City, NJ 07311, USA Tel: 201-4278318; [email protected]

1. Pharmaceutical Research and Development In the process of research and development of a pharmaceutical entity, statistics are necessarily applied at various critical stages of the process to meet regulatory requirements for the effectiveness, safety, identity, strength, quality, purity, stability, and reproducibility of the pharmaceutical entity under investigation. A pharmaceutical entity could be a drug product, a biological product, a medical device, or a combination of a drug product, a biological product and a medical device. The critical stages of the process of pharmaceutical research and development include pre-IND (Investigational New Drug Application), IND, NDA (New Drug Application) and postNDA. The role of statistics at these critical stages is briefly described below. At the very early stage of pre-IND, pharmaceutical scientists may have to screen thousands of potential compounds in order to identify a few promising compounds. An appropriate use of statistics with efficient screening and/or optimal designs will assist pharmaceutical scientists to cost effectively identify the promising compounds within a relatively short period of time. As indicated by the United States Food and Drug Administration (FDA), an IND should contain information regarding chemistry, 443

444

S. C. Chow & A. Pong

manufacturing, and controls (CMC) of the drug substance and drug product to ensure the identity, strength, quality, and purity of the investigational drug. In addition, the sponsors are required to provide adequate information regarding pharmacological studies for absorption, distribution, metabolism, and excretion (ADME) and acute, subacute, and chronic toxicological studies and reproductive tests in various animal species to support that the investigational drug is reasonably safe to be evaluated in clinical trials in humans. At this stage, statistics are usually applied to (i) validate a developed analytical method, (ii) establish drug expiration dating period through stability studies, and (iii) assess toxicity through animal studies. Statistics are required to meet standards of accuracy and reliability. Before the drug can be approved, the FDA requires that substantial evidence of the effectiveness and safety of the drug be provided in the Technical Section of Statistics of an NDA submission. Since the validity of statistical inference regarding the effectiveness and safety of the drug is always a concern, it is suggested that a careful review be performed to ensure an accurate and reliable assessment of the drug product. In addition, in order to have a fair assessment of the efficacy and safety of the investigational drug, the FDA also establishes advisory committees, each consisting of clinical experts, pharmacological experts, statistical experts, and one advocate (not employed by the FDA) in designated drug classes and specialties, to provide a second but independent review of the submission. The responsibility of the statistical expert is not only to ensure that a valid design is used but also to evaluate whether statistical methods used are appropriate for addressing the scientific and medical questions regarding the effectiveness and safety of the drug. After the drug is approved, the FDA also requires that the drug product be tested for its identity, strength, quality, purity, and stability before it can be released for use. For this purpose, the current Good Manufacturing Practice (cGMP) is necessarily implemented to (i) validate the manufacturing process, (ii) monitor the performance of the manufacturing process, and (iii) provide quality assurance of the final product. At each stage of the manufacturing process, the FDA requires that sampling plans, acceptance criteria, and valid statistical analyses be performed for the intended tests such as potency, content uniformity, and dissolution.69 For each test, sampling plan, acceptance criteria, and valid statistical analysis are crucial for determining whether the drug product pass the test based on the results from a representative sample.

Statistics in Biopharmaceutical Research

445

In this chapter, we will not only introduce some key statistical concepts commonly encountered in pharmaceutical research and development, but also provide a comprehensive review of some important topics such as assay validation, stability design and analysis, individual bioequivalence, statistical principles for good clinical practice, and statistics in diagnostic imaging. Detailed information regarding the application of statistics at various critical stages during the process of pharmaceutical research and development can be found in Chow.9

2. Key Statistical Concepts Key statistical concepts in the design and analysis of studies that are commonly conducted at various stages of pharmaceutical research and development are described below.

2.1. Bias and variability For approval of a drug product, regulatory agencies usually require that the results of the studies conducted at various stages of drug research and development must be accurate and reliable to provide a valid and fair assessment of the treatment effect. The accuracy and reliability are usually referred to as the closeness and the degree of the closeness of the results to the true value (i.e. true treatment effect). Any deviation from the true value is considered a bias, which may be due to selection, observation, and statistical procedures. Pharmaceutical scientists should make any attempts to avoid bias whenever possible to ensure that the collected data are accurate. The reliability of a study is an assessment of the precision of the study, which measures the degree of the closeness of the results to the true value. The reliability reflects the ability to repeat or reproduce similar outcomes in the targeted population. The higher precision a study is, the more likely the results would be reproducible. The precision of a study can be characterized by the variability incurred during the conduct of the study. In practice, since studies are usually planned, designed, executed, analyzed, and reported by a team consisting of pharmaceutical scientists from different disciplines, bias and variability inevitably occurs. It is then suggested that possible sources of bias and variability be identified at the planning stage of the study not only to reduce the bias but also to minimize the variability.

446

S. C. Chow & A. Pong

2.2. Type I error, significance level, and power In statistical analysis, two different kinds of mistakes are commonly encountered when performing hypotheses testing. As an example, consider the example of pharmaceutical application. Suppose that a pharmaceutical company is interested in demonstrating that a newly developed drug is efficacious. The null hypothesis is often chosen as that the drug is inefficacious vs. the alternative hypothesis of that the drug is efficacious. The objective is to reject the null hypothesis and conclude the alternative hypothesis that the drug is efficacious. Under the null hypothesis, a type I error is made if we conclude that the drug is efficacious when in fact it is not. This error is also known as consumer’s risk. The acceptable level of probability of committing type I error is known as the significance level. If the probability of observing type I error based on the data is less than the significance level, we conclude that a statistically significant result is observed. The probability of observing type I error is usually referred to as p-value of the test. Similarly, a type II error is committed if we conclude that the drug is inefficacious when in fact it is. This error is referred to as the producer’s risk. The power is defined as the probability of correctly concluding that the drug is efficacious when in fact it is. For assessment of drug effectiveness and safety, a sufficient sample size is often selected to have a desired power with a pre-specified significance level. The purpose is to control both type I error (significance level) and type II error (power). 2.3. Confounding and interaction In pharmaceutical research and development, there are many sources of variation, which have impact on the evaluation of the treatment. If these variations are not identified and properly controlled, then they may be mixed up with the treatment effect for which the studies are intended to demonstrate. In this case, the treatment effect is confounded with the effects due to these variations. Statistical interaction is to investigate whether the joint contribution of two or more factors is the same as the sum of the contributions from each factor when considered alone. If an interaction between factors exists, an overall assessment cannot be made. In practice, it is suggested that possible confounding factors be identified and properly controlled at the planning stage of the studies. When significant interactions among factors are observed, subgroup analyses may be necessary for a careful evaluation of the treatment effect.

Statistics in Biopharmaceutical Research

447

2.4. Randomization Statistical inference on a parameter of interest of a population under study is usually derived under the probability structure of the parameter. The probability structure depends upon the randomization method employed in sampling. The failure of the randomization will have a negative impact on the validity of the probability structure. Consequently, the validity, accuracy, and reliability of the resulting statistical inference of the parameter are questionable. Therefore, it is suggested that randomization be performed using appropriate randomization method under a valid randomization model according the study design to ensure the validity, accuracy, and reliability of the derived statistical inference. Details regarding various randomization models and methods that are commonly employed in clinical research can be found in Chow and Liu.12 2.5. Sample size determination/justification One of the major objectives of most studies during drug research and development is to determine whether the drug is effective and safe. During the planning stage of a study, the following questions are of particular interest to the pharmaceutical scientists: (i) how many subjects are needed in order to have a desired power for detecting a meaningful difference, (ii) what is the trade off if only a small number of subjects are available for the study due to limited budget and/or some scientific considerations. To address these questions, a statistical evaluation for sample size determination/justification is often employed. Sample size determination is usually referred to the calculation of sample size for some desired statistical properties such as power or precision, while sample size justification is to provide statistical justification for a selected sample size, which is often a small number. For a given study, sample size can be determined/justified based on some criteria on type I error (a desired precision) or type II error (a desired power). The disadvantage for sample size determination/justification based on the criteria of precision is that it may have a small chance of detecting a true difference. As a result, sample size determination/justification based on the criteria of power becomes the most commonly used method. Sample size is selected to have a desired power for detection of a meaningful difference at a pre-specified level of significance. In practice, however, it is not uncommon to observe discrepancies among study objective (hypotheses), study design, statistical analysis (test

448

S. C. Chow & A. Pong

statistic) and sample size calculation. These inconsistencies often result in (i) wrong test for right hypotheses, (ii) right test for wrong hypotheses, (iii) wrong test for wrong hypotheses, or (iv) right test for right hypotheses with insufficient power. Therefore, before the sample size can be determined, it is suggested that the following be carefully considered; (i) the study objective or the hypotheses of interest be clearly stated, (ii) a valid design with appropriate statistical tests be used, and (iii) sample size be determined based on the test for the hypotheses of interest. Note that procedures for sample size calculation based on a pre-study power analysis for comparing means, proportions, time-to-event data, and variabilities can be found in Chow, Shao and Wang.21 2.6. Statistical difference and scientific difference A statistical difference is defined as a difference that is unlikely to occur by chance alone, while a scientific difference is referred to as a difference that is considered to be of scientific importance. A statistical difference is also referred to as a statistically significant difference. The difference between the concepts of statistical difference and scientific difference is that statistical difference involves chance (probability), while scientific difference does not. When we claim there is a statistical difference, the difference is reproducible with a high probability. When conducting a study, basically, there are four possible outcomes. The result may show that (i) the difference is both statistically and scientifically significant, (ii) there is a statistically significant difference yet the difference is not scientifically significant, (iii) the difference is of scientifically significant yet it is not statistically significant, and (iv) the difference is neither statistically significant nor scientifically significant. If the difference is both statistically and scientifically significant or it is neither statistically or scientifically significant, then there is no confusion. However, in many cases, a statistically significant difference does not agree with the scientifically significant difference. This inconsistence has created confusion/arguments among pharmaceutical scientists and biostatisticians. The inconsistence may be due to large variability and/or insufficient sample size. 2.7. One-sided test versus two-sided test For evaluation of drug product, the null hypothesis of interest is often the one of no difference. The alternative hypothesis is usually the one that there

Statistics in Biopharmaceutical Research

449

is a difference. Statistical test for this setting is called a two-sided test. In some cases, the pharmaceutical scientist may test the null hypothesis of no difference against the alternative hypothesis that the drug is superior to the placebo. Statistical test for this setting is known as one-sided test. For a given study, if a two-sided test is employed at the significance level of 5%, then the level of proof required is one out of 40. In other words, at the 5% level of significance, there is 2.5% chance (or one out of 40) that we may reject the null hypothesis of no difference in the positive direction and conclude the drug is effective at one side. On the other hand, if a one-sided test is used, the level of proof required is one out of 20. It turns out that onesided test allows more ineffective drugs to be approved because of chance as compared to the two-sided test. It should be noted that when testing at the 5% level of significance with 80% power, the sample size required increases by 27% for a two-sided test as compared to a one-sided test. As a result, there is a substantial cost saving if a one-sided test is used. However, there is no universal agreement among the regulatory, academia, and the pharmaceutical industry as to whether a one-sided test or a two-sided test should be used. The FDA tends to oppose the use of a onesided test though several pharmaceutical companies on the Drug Efficacy Study Implementation (DESI) drugs at the Administrative Hearing have challenged this position. Dubey26 pointed out that several viewpoints that favor the use of one-sided test were discussed in an administrative hearing. These points indicated that one-sided test is appropriate in the following situations of (i) where there is truly only concern with outcomes in one tail and (ii) where it is completely inconceivable that the results could go in the opposite direction.

2.8. Good Statistics Practice Good Statistics Practice (GSP) is defined as a set of statistical principles for the best pharmaceutical practices in design and analysis of studies conducted at various stages of drug research and development.8 The purpose of GSP is not only to minimize bias but also to minimize variability that may occur before, during, and after the conduct of the studies. More importantly, GSP provides a valid and fair assessment of the drug product under study. The concept of GSP can be seen in many guidelines and guidance that issued by the FDA and the International Conference on Harmonization (ICH) at various stages of drug research and development. These guidelines and guidances include Good Laboratory Practice

450

S. C. Chow & A. Pong

(GLP), Good Clinical Practice (GCP), current Good Manufacturing Practice (cGMP), and Good Regulatory Practice (GRP). Another example of GSP is the guideline on Statistical Principles in Clinical Trials recently issued by the ICH.42 As a result, GSP can not only provide accuracy and reliability of the results derived from the studies but also assure the validity and integrity of the studies. The implementation of GSP in pharmaceutical research and development is a teamwork, which requires mutual communication, confidence, respect, and cooperation between statistician, pharmaceutical scientists in the related areas, and regulatory agents. The implementation of GSP involves some key factors that have an impact on the success of GSP. These factors include (i) regulatory requirements for statistics, (ii) the dissemination of the concept of statistics, (iii) an appropriate use of statistics, (iv) an effective communication and flexibility, (v) statistical training. These factors are briefly described below. In the pharmaceutical development and approval process, regulatory requirements for statistics are the key to the implementation of GSP. They not only enforce the use of statistics but also establish standards for statistical evaluation of the drug products under investigation. An unbiased statistical evaluation helps pharmaceutical scientists and regulatory agents in determining (i) whether the drug product has the claimed effectiveness and safety for the intended disease, and (ii) whether the drug product possesses good drug characteristics such as the proper identity, strength, quality, purity, and stability. In addition to regulatory requirements, it is always helpful to disseminate the concept of statistical principles described above whenever possible. It is important for pharmaceutical scientists and regulatory agents to recognize that (i) a valid statistical inference is necessary to provide a fair assessment with certain assurance regarding the uncertainty of the drug product under investigation, (ii) an invalid design and analysis may result in a misleading or wrong conclusion about the drug product, (iii) a larger sample size is often required to increase statistical power and precision of the studies. The dissemination of the concept of statistics is critical to establish the pharmaceutical scientists and regulatory agents’ brief in statistics for scientific excellence. One of the commonly encountered problems in drug research and development is the misuse or sometimes the abuse of statistics in some studies. The misuse or abuse of statistics is critical which may result in either having the right question with the wrong answer or having the right

Statistics in Biopharmaceutical Research

451

answer for the wrong question. For example, for a given study, suppose that a right set of hypotheses (the right question) is established to reflect the study objective. A misused statistical test may provide a misleading or wrong answer to the right question. On the other hand, in many clinical trials, point hypotheses for equality (the wrong question) are often wrongly used for establishment of equivalency. In this case, we have right answer (for equality) for the wrong question. As a result, it is recommended that appropriate statistical methods be chosen to reflect the design, which should be able to address the scientific or medical questions regarding the intended study objectives for implementation of GSP. Communication and flexibility are important factors to the success of GSP. Inefficient communication between statisticians and pharmaceutical scientists or regulatory agents may result in a misunderstanding of the intended study objectives and consequently an invalid design and/or inappropriate statistical methods. Thus, effective communications among statisticians, pharmaceutical scientists and regulatory agents is essential for the implementation of GSP. In addition, in many studies, the assumption of a statistical design or model may not be met due to the nature of drug product under investigation, experimental environment, and/or other causes related/unrelated to the studies. In this case, the traditional approach of doing everything by the book does not help. In practice, since the concerns from a pharmaceutical scientist or the regulatory agent may translate into a constraint for a valid statistical design and appropriate statistical analysis, it is suggested that a flexible and yet innovative solution be developed under the constraints for the implementation of GSP. Since regulatory requirements for the drug development and approval process vary from drug to drug and country to country, various designs and/or statistical methods are often required for a valid assessment of a drug product. Therefore, it is suggested that statistical continued/advanced education and training programs be routinely held for both statisticians and non-statisticians including pharmaceutical scientists and regulatory agents. The purpose of such continued/advanced education and/or training program is threefold. First, it enhances communications within the statistical community. Statisticians can certainly benefit from such a training and/or educational program by acquiring more practical experience and knowledge. In addition, it provides the opportunity to share/exchange information, ideas and/or concepts regarding drug development between professional societies. Finally, it identifies critical practical and/or regulatory issues that are commonly encountered in drug development and regulatory approval

452

S. C. Chow & A. Pong

process. A panel discussion from different disciplines may result in some consensus to resolve the issues, which helps in establishing standards of statistical principles for implementation of GSP. 3. Pharmaceutical Validation 3.1. Assay validation When a new pharmaceutical compound is discovered, the FDA requires that an analytical method or test procedure for determination of the active ingredients of the compound be developed and validated before it can be applied to animal and/or human subjects. The cGMP requires that test methods, which are used for assessing compliance of pharmaceutical products with established specifications, must meet proper standards of accuracy and reliability. The USP/NF defines the validation of analytical methods as the process by which it is established, in laboratory studies, that performance characteristics of the methods meet the requirement for the intended analytical application. The analytical application may be referred to as a drug potency which is usually based on gas chromatography (GC) or high performance liquid chromatography (HPLC) for potency and stability studies, immunoassays such as radioimmunoassay (RIA) for the in vitro activity of an antibody or antigen, or a biological assay for the in vivo activity such as median effective dose (ED50 ). The performance characteristics include accuracy, precision, limit of detection (LOD), limit of quantitation (LOQ), selectivity (or specificity), linearity, range, and ruggedness, which are useful measures for assessment of accuracy and reliability of the assay results. Among these performance characteristics, accuracy, precision, and ruggedness are considered the primary parameters for the validation of an analytical method. For the validation of an analytical method, whether the analytical method can generate true values is often of great concern. To address this question, one may measure how close the assay result obtained by the analytical method is to the true value. This performance characteristic is referred to as the accuracy of the assay result. In practice, one may consider the analytical method to be validated in terms of accuracy if the mean value is within ±15% of the actual value, except at LOQ, where it should not deviate by more than 20%.60 In addition, the precision, which is defined as the degree of agreement among individual assay results when the assay method is applied repeatedly to multiple sampling of a homogenous sample can be measured based on measurement error of the assay. Similarly,

Statistics in Biopharmaceutical Research

453

Shah et al.60 indicated that one may claim that the analytical method is validated if the precision around the mean value does not exceed a 15% coefficient of variation (CV), except for LOQ, where it should not exceed 20% CV. In many cases, different analysts and different laboratories under different operating circumstances such as different instruments, different lots of reagents, different elapse time, or different assay temperatures may perform a specific analytical method. Assay ruggedness is often used to assess the influence of uncontrollable factors or the degree of reproducibility on assay performance. One may conclude that the analytical method is validated in terms of reproducibility if its assay ruggedness is within 15% of the mean value. Accuracy is typically assessed using multiple testing by linear regression. Precision can be assessed by testing the null hypothesis that the variability is less than an acceptable limit. Typical approaches for assessing assay ruggedness include the one-way nested random effects model and the two-way crossed-classification mixed model. For the assessment of assay ruggedness, it should be noted, however, that the classical analysis of variance method may produce negative estimates for the variance components and that the sum of best estimates of variance components may not be the best estimate of the total variability. In these situations, methods proposed by Chow and Shao14 and Chow and Tse23 are useful. In practice, the validation of an analytical method can be carried out by the following steps: First, it is important to develop a prospective protocol which clearly states the validation design, sampling procedure, acceptance criteria for the performance characteristics to be evaluated, and how the validation is to be carried out. Second, collect the data and document the experiment, including any violations from the protocol that may occur. The data should be audited to assure their quality. The collected data are then analyzed based on appropriate statistical methods. Appropriate statistical methods are referred to as those methods, which can reflect the validation design and meet the study objective. Finally, draw a conclusion regarding whether the analytical method is validated based on the statistical inference drawn about the accuracy, precision, and ruggedness of the assay results. 3.2. Process validation The objective of the validation of a manufacturing process is to ensure that the manufacturing process does what it purports to do. A validated

454

S. C. Chow & A. Pong

process assures that the final product has a high probability of meeting the standards for identity, strength, quality, purity, and stability of the drug product. A manufacturing process is a continuous process, which usually involves a number of critical stages. For example, for the manufacturing of tablets, the process may include initial blending, mill, primary blending, final blending, compression, and coating stages. At each critical stage, some problems may occur. For example, the ingredients may not be uniformly mixed at the primary blending stage; the segregation may occur at the final blending stage, and the weight of tablets may not be suitably controlled during the compression stage. In practice, therefore, it is important to evaluate the performance of the manufacturing at each critical stage by testing in process and/or processed materials for potency, dosage uniformity, dissolution, and disintegration according to sampling plans and acceptance criteria stated in the USP/NF. These tests are usually referred to as the USP tests. For sampling plans of USP tests, the USP/NF requires that representative samples be drawn from the container. A manufacturing process is considered to pass the USP/NF tests if each critical stage of the manufacturing process and the final product meet the required USP/NF specifications for the identity, strength, quality, and purity of the drug product. A manufacturing process is considered validated if at least three validation batches (or lots) pass all required USP/NF tests. Since manufacturing procedures vary from drug product to drug product and/or from site to site during the development of a validation protocol of manufacturing process, it is important to discuss the issues such as (i) critical stages, (ii) equipment to be used at each critical stage, (iii) possible problems, (iv) USP tests to be performed, (v) sampling plans, (vi) testing plans, (vii) acceptance criteria, (viii) pertinent information, (ix) test or specification to be used as reference, and (x) validation summary with project scientists to acquire a good understanding of the manufacturing process. Process validation usually refers to as the establishment of documented evidence that a process does what it purports to do. Basically, there are four different types of manufacturing process validations in the pharmaceutical industry: prospective, concurrent, retrospective, and re-validation. Prospective validation establishes documented evidence that a process does what it purports to do based on a preplanned protocol. Prospective validation is usually performed in the situations where (i) historical data are not available or sufficient and in-process and end-product testing data are not adequate, (ii) new equipment or components are used, (iii) a new product

Statistics in Biopharmaceutical Research

455

is reformulated from an existing product, or there are significant modifications or changes in the manufacturing process, and (iv) the manufacturing process is transferred from development laboratory to full-scale production. Retrospective validation provides documented evidence based on review and analysis of historical information, which is useful when there is a stable process with a larger historical database. One of the objectives of the retrospective validation is to support the confidence of the process. Concurrent validation evaluates the process based on information generated during actual implementation of the process. In some situations where (i) a step of the process is modified, (ii) the product is made infrequently, and (iii) a new raw material must be introduced, a concurrent validation is recommended. In practice, a well-established manufacturing process may need to be revalidated when there are changes in critical components (e.g. raw materials), changes/replacement of equipment, changes in facility/plant (e.g. location or size), and a significant increase and/or decrease in batch size. For a validated process, there is no guarantee that if the test is performed again it will have a high probability of meeting the specification. Thus, it is of interest to conduct some in-house acceptance limits (specifications), which guarantee that future batches produced by the process will pass the USP test with a high probability. A common approach to process validation is to obtain a single sample and test the attributes of interest to see whether the USP/NF specifications are met. Bergum4 proposed constructing acceptance limits that guarantee that future samples from a batch will meet a given product specification a given percentage of times. The idea is to consider a multiple stage test. If the criteria for the first stage are met, the test is passed. If the criteria for the first stage are not met, then additional stages of testing are done. If the criteria at any stages are met, the test is passed. Acceptance limits for a validation sample are them constructed based on sample mean and standard deviation of the test results to assure that a future sample will have at least a certain chance of passing a multiple stage test. More details can be found in Chow and Liu.10 4. Stability Studies 4.1. Drug shelf-life For every drug product in the marketplace, the FDA requires that an expiration dating period (or shelf-life) must be indicated on the immediate container label. The shelf-life is defined as the time interval at which the characteristics of a drug product (e.g. strength) will remain within

456

S. C. Chow & A. Pong

the approved specifications after manufacture. Along this line, Shao and Chow63 studied several statistical procedures for estimation of drug shelflife. Before a shelf-life of a drug product can be granted by the FDA, the manufacturers (drug companies) need to demonstrate that the average drug characteristics can meet the approved specifications during the claimed shelf-life period through a stability study. For determination of the shelf life of a drug product, both the FDA stability guideline and the stability guideline issued by the ICH requires that a long term stability study be conducted to characterize the degradation of the drug product over a time period under appropriate storage conditions. Both the FDA and ICH stability guidelines suggest that stability testing be performed at 3-month intervals during the first year, 6-month intervals during the second year, and annually thereafter. The degradation curve can then be used to establish an expiration dating period or shelf life applicable to all future batches of the drug product. For a single batch, the FDA stability guideline indicates that an acceptable approach for drug products that are expected to decrease with time is to determine the time at which the 95% one-sided lower confidence bound for the mean degradation curve intersects the acceptable lower product specification limit, e.g. as specified in the USP/NF.29 4.2. Statistical model Consider the case where the drug characteristic is expected to decrease with time. The other case can be treated similarly. Assume that drug characteristic decreases over time linearly (i.e. the degradation curve is a straight line). In this case, the slope of the straight line is considered as the rate of stability loss of the product. Let Xj be the jth sampling (testing) time point (i.e. 0 months, 3 months, etc.) and Yij be the corresponding testing result of the ith batch (j = l, . . . , n; i = l, . . . , k). Then Yij = αi + βi Xj + eij

(1)

where eij are assumed to be independent and identically distributed (i.i.d.) random errors with mean 0 and variance σe2 . The total number of observations is N = kn. The αi (intercepts) and βi (slopes) vary randomly from batch to batch. It is assumed that αi (i = l, . . . , k) are i.i.d. with mean a and variance σa2 , and that βi (i = l, . . . , k) are i.i.d. with mean b and variance σb2 . The eij , αi , and βi are mutually independent. If σa2 = 0 (i.e. αi are equal), then the above model has a common intercept. Similarly, if σb2 = 0 (i.e. βi are equal), then the above model has

Statistics in Biopharmaceutical Research

457

a common slope. If both σa2 = 0 and σb2 = 0, then there is no batch-to-batch variation and the above model reduces to a simple linear regression. Under the above model, Chow and Shao15 proposed several statistical tests for batch-to-batch variation. 4.3. Statistical methods 4.3.1. Fixed batches approach If there is no batch-to-batch variation, a commonly used method for fitting the above model is the ordinary least squares (OLS) and a 95% lower confidence bound for E(Y ) = a + bξ, the expected drug characteristic at time ξ, can be obtained as a ˆ + ˆbξ − t0.95 S(ξ) ,

where a ˆ and ˆb are the OLS estimators of a and b, respectively, t0.95 is the one-sided 95th percentile of the t distribution with N −2 degrees of freedom, and ) ( ¯ 2 (ξ − X) 1 2 + Pn S (ξ) = M SE ¯ 2 , N k j=1 (Xj − X)

where

n

X ¯ = 1 Xj X n j=1 and k

M SE =

n

1 XX (Yij − a ˆ − ˆbXj )2 . N − 2 i=1 j=1

The estimated shelf-life can be obtained by solving the following equation η=a ˆ + ˆbξ − t0.95 S(ξ) , where η is a given approved lower specification limit. When there is a batch-to-batch variation (i.e. there are different intercepts and different slopes), the FDA recommends the minimum approach be used for estimation of the shelf-life of a drug product. The minimum approach considers the minimum of the estimated shelf-lives of the individual batches. The minimum approach, however, has received considerable criticisms because it lacks of statistical justification. As an alternative, Ruberg and Hsu58 proposed an approach using the concept of multiple comparisons

458

S. C. Chow & A. Pong

to derive some criteria for pooling batches with the worst batch. The idea is to pool the batches that have slopes similar to the worst degradation rate with respect to a pre-determined similarity (equivalence) limit. 4.3.2. Random batches approach As indicated in the FDA guideline, the batches used in long-term stability studies for establishment of drug shelf-life should constitute a random sample from the population of future production batches. In addition, all estimated shelf-lives should be applicable to all future batches. As a result, statistical methods based on random effects model seem more appropriate. In recent years, several methods for determination of drug shelf-life with random batches have been considered.7,15,16,49,61 Under the assumption that batch is a random variable, stability data can be described by a linear regression model with random coefficients. Consider the following model ′ βi + eij , Yij = Xij

where Yij is the jth assay result (percent of label claim) for the ith batch, ′ Xij is a pxl vector of the jth value of the regressor for the ith batch and X ij is its transpose, βi is a pxl vector of random effects for the ith batch, and ′ eij is the random error in observing Yij . Note that Xij βi is the mean drug characteristic for the ith batch at Xij (conditional on βi ). The primary assumptions for the model are similar to those for model (1). Since Xij is usually chosen to be xj for all i, where xj is a pxl vector of nonrandom covariate which could be of the form (1, tj , tj wj )′ or (1, tj , wj , tj wj )′ , where tj is the jth time point and wj is the jth value of qxl vector of nonrandom covariate (e.g. package type and dosage strength). Denote xj = x(tj , wj ), where x(t, w) is a known function of t and w. If there is no batch-to-batch variation, the average drug characteristic at time t is x(t)′ b and the true shelf-life is equal to t¯true = inf{t : x(t)′ b ≤ η} , which is an unknown but nonrandom quantity. The shelf-life is then given by t¯ = inf{t : L(t) ≤ η} , where L(t) = x(t)′ˆb − tα,nk−p

1/2 x(t)′ (X ′ X)−1 x(t) SSR , k(nk − p)

Statistics in Biopharmaceutical Research

459

in where SSR is the usual sum of squared residuals from the ordinary least squares regression. When there is batch-to-batch variation, ttrue is random since βi is random. Chow and Shao16 and Shao and Chow61 proposed considering an (1 − α) × 100% lower confidence bound of the εth quantile of ttrue as the labeled shelf-life, where ε is a given small positive constant. That is, P {tlabel ≤ tε } ≥ 1 − α , where tε satisfies P {ttrue ≤ tε } = ε . It follows that tε = inf{t : x(t)′ b − η = zε σ(t)} , where zε = Φ−1 (1 − ε) and σ(t) is the standard deviation of x(t)′ βi . As a result, the shelf-life is given by t¯ = inf{t : x(t)′¯b ≤ η¯(t)} , where η¯(t) = η + cκ (ε, α)zε

p v(t) ,

1 √ cκ (ε, α) = √ , t kzε α,K−1, kzε

1 x(t)′ (X ′ X)−1 X ′ SX(X ′ X)−1 x(t) . k−1 Note that tα,K−1,√kzε is the αth upper quantile of the noncentral t distri√ bution with (k − 1) degrees of freedom and noncentrality parameter kzε . v(t) =

4.4. Two-phase shelf-life estimation Unlike most drug products, some drug products are required to be stored at several temperatures such as −20◦C, 5◦ C and 25◦ C (room temperature) in order to maintain stability until use.47 The drug products of this kind are usually referred to as frozen drug products. Unlike the usual drug products, a typical shelf life statement for frozen drug products usually consists of multiple phases with different storage temperatures. For example, a commonly adopted shelf life statement for frozen products could be either (i) 24 months at −20◦ C followed by 2 weeks at 5◦ C or two days at 25◦ C or (ii) 24 months at −20◦ C followed by 2 weeks at 5◦ C and one days at 25◦ C.

460

S. C. Chow & A. Pong

As a result, the drug shelf life is determined based on a two-phase stability study. The first phase stability study is to determine drug shelf-life under frozen storage condition such as −20◦ C, while the second phase stability study is to estimate drug shelf-life under refrigerated or ambient conditions. A first phase stability study is usually referred to as a frozen study and a second phase stability study is known as a thawed study. Since the stability study of a frozen drug product consists of frozen and thawed studies, the determination of the shelf-life involves a two-phase linear regression. The frozen study is usually conducted similar to a regular long term stability study except the drug is stored at frozen condition. In other words, stability testing will be normally conducted at 3-month intervals during the first year, 6-month intervals during the second year, and annually thereafter. Stability testing for the thawed study is conducted followed by the stability testing for the frozen study, which may be performed at 2-day intervals up to two weeks. It should be noted that the stability at the second phase (i.e. thawed study) might depend upon the stability at the first phase (i.e. frozen study). In other words, an estimated shelf-life from the thawed study followed stability testing at 3-month of the frozen study may be longer than that obtained from the thawed study followed the frozen study at 6-month. For simplicity, Mellon47 suggested that stability from the frozen study and the thawed study be analyzed separately to obtain a combined shelf life for the drug product. As an alternative, Shao and Chow62 consider the following method for determination of drug shelf lives for the two phases based on a similar concept proposed before.16,61 For the first phase shelf-life, we have stability data Yik = α + βti + εik , where i = 1, . . . , I ≥ 2 (typically ti = 0, 3, 6, 9, 12, 18 months), k = 1, . . . , Ki ≥ 1, α and β are unknown parameters, and εik, ’s are i.i.d. random errors with mean 0 and variance σ12 > 0. The total number of data for the P first phase is n1 = i Ki (= IK if Ki = K for all i). At time ti , Kij > I second phase stability data are collected at time intervals tij , j = 1, . . . , J ≥ 2. The total number of data for the second P P phase is n2 = i j Kij (= IJK if Kij = K for all i and j). Data from two phases are independent. Typically, tij = ti +sj , where sj = 1, 2, 3 days, etc. Let α(t) and β(t) be the intercept and slope of the second phase degradation line at time t. Since the degradation lines for the two phases intersect, α(t) = α + βt .

461

Statistics in Biopharmaceutical Research

Then, at time ti , i = 1, . . . , I, we have stability data Yijk = α + βti + β(ti )sj + eijk , where β(t) is an unknown function of t and eijk ’s are i.i.d. random errors 2 with mean 0 and variance σ2i > 0. We assume that β(t) is a polynomial in t. Typically, β(t) = β0 Common slope model , β(t) = β0 + β1 t Linear trend model , or β(t) = β0 + β1 t + β2 t2 Quadratic trend model . In general, β(t) =

H X

βh t h ,

h=0

where βh ’s are unknown parameters and H + 1 < H < I.

P

j

Kij for all i, and

4.4.1. First phase shelf-life The first phase shelf-life can be determined based on the first phase data {Yik } as the time point at which the lower product specification limit intersects the 95% lower confidence bound of the mean degradation curve.29,40 Let α ˆ and βˆ be the least squares estimators of α and β, based on the first phase data, and let p ˆ − t.05;n −2 v(t) L(t) = α ˆ + βt 1

be the 95% lower confidence bound for α + βt, where t.05;n1 −2 is the upper 0.05 quantile of the t-distribution with (n1 − 2) degrees of freedom, # " P P 2 2 t t )t + nt − (2 i i i,k i,k P P , v(t) = σ ˆ12 n i,k t2i − ( i,k ti )2

and

σ ˆ12 =

1 X ˆ i )2 (Yik − σ ˆ − βt n1 − 2 i,k

462

S. C. Chow & A. Pong

is the usual error variance estimator based on residuals. Suppose that the lower limit for the drug characteristic is η (we assume that α + βt decreases as t qincreases). Then the first phase shelf-life is the first solution of L(t) = η, i.e. tˆ = inf{t : L(t) ≤ η} . Note that the first phase shelf-life is constructed so that P {tˆ ≤ the true first phase shelf-life} = 95% assuming that eik ’s are normally distributed. Without the normality assumption, result approximately holds for large n1 . 4.4.2. The case of equal second phase slopes To introduce the idea, we first consider the simple case where the slopes of the second phase degradation lines are the same. When β(t) ≡ β0 , the common slope β0 can be estimated by the least squares estimator based on the second phase data: P ¯)Yijk i,j,k (sj − s ˆ , β0 = P ¯)2 i,j,k (sj − s where sj is the second phase time intervals and s¯ is the average of sj ’s. The variance of βˆ0 is V (β¯0 ) = P

σ22 , ¯)2 i,j,k (sj − s

which can be estimated by

Vˆ (βˆ0 ) = P

where σ ˆ22 =

σ ˆ22 , ¯)2 i,j,k (sj − s

X 1 ˆ i ) − βˆ0 sj )2 . (Yijk − (ˆ σ + βt n2 − I(H + 2) i,j,k

For fixed t and s, let v(t, s) = v(t) + Vˆ (βˆ0 )s2 and ˆ + βˆ0 s − t.95;n +n −2−I(H+2) L(t, s) = σ ˆ + βt 1 2

p v(t, s) .

Statistics in Biopharmaceutical Research

463

For any fixed t less than the first phase true shelf-life, i.e. t satisfying α + βt > η, the second phase shelf-life can be estimated as sˆ(t) = inf{s ≥ 0 : L(t, s) ≤ η} (if L(t, s) < η for all s, then sˆ(t) = 0). That is, if the drug product is taken out of the first phase storage condition at time t, then the estimated second phase shelf-life is sˆ(t). The justification for sˆ(t) is that for any t satisfying α + βt > η, P {ˆ s(t) ≤ the true second phase shelf-life} = 95% assuming that eik ’s and eijk ’s are normally distributed. Without the normality assumption, the above result approximately holds for large n1 , and n2 . In practice the time at which the drug product is taken out of the first phase storage condition is unknown. In such a case we may apply the following method to assess the second phase shelf-life. Select a set of time intervals tl < tˆ, l = 1, . . . , L, and construct a table (or a figure) for (tl , sˆ(tl )), l = 1, . . . , L. If a drug product is taken out of the first phase storage condition at time to which is between tl and tl+1 , then its second phase shelf-life is sˆ(tl+1 ). However, a single shelf-life label may be required. We propose the following method. 4.4.3. Determination of a single two-phase shelf-life label In most cases, L(tˆ, s) is less than η for all s, i.e. sˆ(tˆ) = 0. Hence, we propose to select a tˆ1 < tˆ such that sˆ(tˆ) > 0 and use tˆ1 + sˆ(tˆ1 ) as the two phase shelf-life label. The justification for this two-phase shelf-life label is: 1. If the drug product is stored under the first phase storage condition until time tˆ1 , then P {tˆ1 ≤ the true first phase shelf-life} ≥ 95% , since tˆ1 < tˆ. 2. If the drug product is taken out of the first phase storage condition at time tˆ1 < tˆ, then its estimated second phase shelf-life is sˆ(tˆ), and P {ˆ s(tˆ1 ) ≤ the true second phase shelf-life at time t0 } ≥ P {ˆ s(tˆ0 ) ≤ the true second phase shelf-life at time t0 } ≥ 95%

464

S. C. Chow & A. Pong

However, this two-phase shelf-life label is very conservative if t0 is much less than tˆ1 . A general rule of choosing tˆ1 is that tˆ1 should be close to tˆ while sˆ(tˆ) is reasonably large. For example, if the units of the first and second phase shelf lives are month and day, respectively, and if tˆ = 24.5, then we can choose tˆ1 = 24; if tˆ = 24, then we choose tˆ1 = 23. A table on (tl , sˆ(tl )), l = 1, . . . , L, will be useful for the selection of tˆ1 . 4.4.4. The general case of unequal second phase slopes In general, the slope of the second phase degradation line varies with time. Let Y¯i be the average of Yijk ’s with a fixed i, Zijk = Yijk − Y¯i , and Xhij = (sj − s¯)thi . Then the least squares estimator of (β0 , . . . , βH ) denoted by (βˆ0 , . . . , βˆH ), is the least squares estimator of the following linear regression model: Zijk =

H X

βh Xhij + error .

h=0

Let ˆ = β(t)

H X

βˆh th

h=0

and ˆ Vˆ (β(t)) =σ ˆ22 1′ (X′ X)−1 1 , where 1′ = (1, t, t2 · · · tH ), X is the design matrix and σ ˆ22

X 1 = n2 − (H + 2)

i,j,k

Zijk −

H X

βˆh Xhij

h=0

!2

.

The second phase shelf-life and the two-phase shelf-life label can be determined in the same way as described in the previous section with p ˆ + β(t)s ˆ L(t, s) = α ˆ + βt − t.05;n1 +n2 −(H+4) v(t, s) and

2 ˆ v(t, s) = v(t) + Vˆ (β(t))s .

Statistics in Biopharmaceutical Research

465

For the proposed method for two-phase shelf-life estimation, assume that the assay variabilities are the same across different phases. Detailed information regarding two-phase shelf-life estimation can be found.17,62 In practice, the assay variability may vary from phase to phase. In this case, the proposed method is necessarily modified for determination of the expiration dating period of the drug product. In practice, it is of interest to determine the allocation of sample size at each phase. For a fixed total of sample size, it is of interest to examine the relative efficiency for estimation of shelf lives using either more sampling time points in the first phase and less sampling time points in the second phase or less sampling time points in the first phase and more sampling time points in the second phase. The allocation of sampling time points at each phase then becomes an interesting research topic for two-phase shelf-life estimation. In addition, since the degradation at the second phase is highly correlated with the degradation at the first phase, it may be of interest to examine such correlation for future design planning. 4.5. Practical issues 4.5.1. Matrixing and bracketing designs For a new drug product, stability studies are necessarily conducted not only to characterize the degradation of the compound over time but also to determine the expiration dating period (shelf-life). The estimated shelf-life should be applicable for all strengths and packages of the drug product. However, accelerated stability testing is required for 6 months and longterm stability testing is required for the length of shelf-life. The cost of the stability studies could be substantial. As a result, it is of interest to adopt a design where only a fraction of the total number of samples are tested but at the same still maintain the validity, accuracy and precision of the estimated shelf-life. For this consideration, matrixing and bracketing designs have become increasingly popular in drug research and development for stability. As indicated in the ICH stability guideline, bracketing design is defined as the design of a stability schedule so that at any time point only the samples on the extremes, for example, of container size and/or dosage strengths, are tested.41 Matrix design is a design where only a fraction of the total number of samples is tested at any specified sampling point.41,51 The matrixing design and bracketing design were evaluated by Pong and Raghavarao.54

466

S. C. Chow & A. Pong

Lin43 indicated that a matrixing design might be applicable to strength if there is no change in proportion of active ingredients, container size, and immediate sampling time points. The application of a matrixing design to situations such as closure systems, orientation of container during storage, packaging form, manufacturing process, and batch size should be evaluated carefully. It is discouraged to apply a matrixing design to sampling times at two endpoints (i.e. the initial and the last) and at any time points beyond the desired expiration date. If the drug product is sensitive to temperature, humidity, and light, the matrixing design should be avoided. 4.5.2. Bias and interval estimation of shelf-life As indicated in the FDA stability guideline, the estimated shelf-life of a drug shelf-life can be obtained at the time point at which the 95% onesided lower confidence limit for the mean degradation curve intersects the acceptable lower specification limit. In practice, it is of interest to study the biasedness of the estimated shelf-life. If the bias is positive, the estimated shelf-life overestimates the true shelf-life. On the other hand, if there is a downward bias, the estimated shelf-life is said to underestimate the true shelf-life. In the interest of the safety of the drug product, the FDA might prefer a conservative approach, which is to underestimate rather than overestimate the true shelf-life. Sun et al.66 studied distribution properties of the estimated shelf-life16,61 for both cases with and without batch-to-batch variation. The result indicate that when there is no batch-to-batch variation (i.e. σa2 = σb2 = 0), there is a downward bias which is given by " #1/2 ˆ + a − η)2 (bX tα σe b2 , + Pn ¯ 2 b2 n j=1 (Xj − X)

where tα , is the (1 − α)th quantile of the t distribution with (k − 1) degrees of freedom. 4.5.3. Shelf-life estimation with multiple active components

For the study of drug stability, the FDA guideline requires that all drug characteristics be evaluated. In most drug products, we obtain an estimated drug shelf-life based primarily on the study of the stability of the strength of the active ingredient. However, some drug products may contain more than one active ingredient. For example, Premarin (conjugated estrogens, USP) contains three active ingredients: estrone, equilin, and 17a-dihydroequilin.

Statistics in Biopharmaceutical Research

467

The specification limits for each component are different. To ensure identity, strength, quality, and purity, it is suggested that each component be evaluated separately for determination of drug shelf-life. In this case, although a similar concept can be applied, the method suggested in the FDA stability guideline is necessarily modified. It should be noted that the assay values observed from each component might not add up to a fixed total, which is due to the possible assay variability for each component. The modified model should be able to account for these sources of variation. Pong and Raghavarao55 proposed a statistical method for estimation of drug shelf-life for drug products with two components. The distributions of shelf-life for two components were evaluated by Pong and Raghavarao56 under different designs. 4.5.4. Stability analysis with discrete responses For solid oral dosage forms such as tablets and capsules, the FDA stability guideline indicates that following characteristics should be studied in stability studies: (i) Tablets — appearance, friability, hardness, color, odor, moisture, strength, and dissolution, and (ii) capsules — strength, moisture, color, appearance, shape brittleness and dissolution. Some of these characteristics are measured based on discrete rating scale. As a result, the usual methods for stability analysis may not be appropriate. Chow and Shao18 proposed some statistical methods for estimation of drug shelf-life based on discrete responses following the concept as described in the FDA stability guideline. However, it may be of interest to consider a mixture of a continuous response variable (e.g. strength) and a discrete response variable (e.g. color or hardness) for estimation of drug shelf-life. This requires further research. 5. Bioequivalence and Bioavailability In pharmaceutical research and development, in vivo bioequivalence testing is usually considered a surrogate for assessment of clinical efficacy and safety. This is based on the so-called Fundamental Bioequivalence Assumption that when two formulations of the same drug product or two drug products (e.g. a brand-name drug and its generic copy) are equivalent in the rate and extent of drug absorption, it is assumed that they will reach the same therapeutic effect or they are therapeutically equivalent.13 Pharmacokinetic (PK) responses such as area under the blood or plasma concentration-time curve (AUC) and maximum concentration (Cmax ,) are

468

S. C. Chow & A. Pong

usually considered to assess the rate and extent of drug absorption. The current regulation of the FDA requires that the evidence of bioequivalence in average bioavailabilities in terms of some primary PK responses such as AUC and Cmax , between the two formulations of the same drug product or the two drug products be provided.28,31 This type of bioequivalence is usually referred to as average bioequivalence (ABE). Under current ABE criterion, however, it is not clear whether we are able to demonstrate that the absorption profiles of a brand-name drug and its generic copies are similar; consequently, it is not clear whether the brand name drug and its generic copies will have the same therapeutic effect in terms of efficacy and safety and hence can be used interchangeably. In medical community, as more generic drug products become available in the marketplace, it is of great concern whether a number of generic drug products of the same brand-name drug can be used safely and interchangeably. Basically drug interchangeability can be classified as drug prescribability or drug switchability. Drug prescribability is defined as the physician’s choice for prescribing an appropriate drug product for his/her new patients between a brand-name drug product and a number of generic drug products of the brand-name drug product, which have been shown to be bioequivalent to the brand-name drug product. The underlying assumption of drug prescribability is that the brand-name drug product and its generic copies can be used interchangeably in terms of the efficacy and safety of the drug product. Under current practice, the FDA only requires evidence of equivalence in average bioavailabilities be provided, the bioequivalence assessment does not take into account equivalence in variability of bioavailability. A relatively large intrasubject variability of a test drug product (e.g. a generic drug product) as compared to that of the reference drug product (e.g. its brand-name drug product) may present a safety concern. To overcome this disadvantage, in addition to providing evidence of ABE, it is recommended that bioequivalence in variability of bioavailabilities between drug products be established. This type of bioequivalence is called population bioequivalence (PBE). In practice, although PBE is often considered for assessment of drug prescribability, it does not fully address drug switchability due to possible existence of the subject-by-formulation interaction. Drug switchability is related to the switch from a drug product (e.g. a brand-name drug product) to an alternative drug product (e.g. a generic copy of the brand-name drug product) within the same subject whose concentration of the drug product has been titrated to a steady, efficacious, and safe level. As a result, drug switchability is considered more critical than

Statistics in Biopharmaceutical Research

469

drug prescribability in the study of drug interchangeability for patients who have been on medication for a while. To assure drug switchability, it is recommended that bioequivalence be assessed within individual subjects. This type of bioequivalence is known as individual bioequivalence (IBE). The concept of IBE has attracted FDA’s attention since introduced by Anderson and Hauck,1 which has led to a significant change in regulator consideration for assessment of bioequivalence.32 In what follows, we will focus on the review of guidance on Statistical Approaches to Establishing Bioequivalence, which was recently issued by the FDA.32

5.1. Limitations of average bioequivalence Under current FDA regulation, two formulations of the same drug or two drug products are said to be bioequivalent if the ratio of means of the primary PK responses such as AUC and Cmax between the two formulations of the same drug or the two drug products is within (80%, 125%) with 90% assurance.28,31 A generic drug product can serve as the substitute of its brand-name drug product if it has been shown to be bioequivalent to the brand-name drug. The FDA, however, does not indicate that a generic drug can be substituted by another generic drug even though both of the generic drugs have been shown to be bioequivalent to the same brand-name drug. Bioequivalence among generic copies of the same brand-name drug is not required. As more generic drugs become available in the marketplace, it is very likely that a patient may switch from one generic drug to another. Therefore, an interesting question to the physicians and the patients is whether the brand-name drug and its generic copies can be used safely and interchangeably. Chen6 pointed out that current ABE approach for bioequivalence assessment has limitations for addressing drug interchangeability especially for drug switchability. These limitations include (i) ABE focuses only on the comparison of population average between the test and reference drug products, (ii) ABE does not provide independent estimated of the intrasubject variances of the drug products under study, and (iii) ABE ignores the subject-by-formulation interaction, which may have an impact on drug switchability. As a result, Chen6 suggested that current regulation of ABE be switched to the approach of PBE and IBE to overcome these disadvantages. Chow and Liu11 proposed to perform a meta-analysis for an overview of ABE. The proposed meta-analysis provides an assessment of bioequivalence among generic copies of a brand-name drug that can be used as a tool to

470

S. C. Chow & A. Pong

monitoring the performance of the approved generic copies of the brandname drug. In addition, it provides more accurate estimates of intersubject and intrasubject variabilities of the drug product. 5.2. Drug interchangeability As indicated earlier, drug interchangeability can be classified as drug prescribability or drug switchability. It is recommended that PBE and IBE be used to assess drug prescribability and drug switchability, respectively. More specifically, the FDA guidance recommends that PBE be applied to new formulations, additional strength, or new dosage forms in NDAS, while IBE should be considered for ANDA (abbreviated new drug application) or AADA (abbreviated antibiotic drug application) for generic drugs. In what follows, we will only focus on the concept, decision rule, and statistical method of IBE for assessment of drug interchangeability. 5.2.1. Individual bioequivalence The individual bioequivalence is motivated by the 75/75 rule which claims bioequivalence if at least 75% of individual subject ratios (i.e. relative individual bioavailability of the generic drug product to the innovator drug product) are within (75%, 125%) limits. Along this line, Anderson and Hauck1 first proposed the concept of testing for individual equivalence ratios (TIER). The idea is to test individual bioequivalence based on the dichotomization of continuous PK metrics by calculating the p value for at least the observed number of subjects who fall within bioequivalence limits with the minimum proportion of the population in which the two drug products must be equivalent in order to claim individual bioequivalence. It should be noted that no universal definition of IBE exists which is uniformly accepted by researchers from the regulatory agency, the academia and the pharmaceutical industry. For example, IBE may be established based on the comparison between distributions within each subject or it could be based on the distribution of the difference or ratio within each subject.45 In addition to average bioavailability and variability of bioavailability, we may also consider assessment for the variability due to the subject by formulation interaction. In this case, IBE can be assessed by means of a union-intersection test approach, which concludes IBE if and only if all of the hypotheses are rejected at a pre-specified level of significance. Most current methods for assessment of IBE, however, are derived from the distribution of either difference or ratio within each subject. Under

471

Statistics in Biopharmaceutical Research

this setting, IBE can be classified as probability-based and moment-based according to different criteria for bioequivalence.1,27,39,59,64 To address drug switchability, the FDA proposed the following aggregated, scaled moment based one-sided criterion: IBC =

2 2 2 (µT − µR )2 + σD + (σW T − σW R ) ≤ θI , 2 2 max(σW R , σW 0 )

2 2 where σW T and σW R are the within subject variances for the test drug 2 product and the reference drug product, respectively, σD is the variance 2 due to subject-by-formulation interaction, σW 0 is a constant which can be adjusted to control the probability of passing IBE, θI and is the bioequivalence limit. The FDA 2001 guidance suggests that θI be chosen as follows

θI =

(ln 1.25)2 + εI , 2 σW 0

where εI is the variance allowance factor which can be adjusted for control sample size. As indicated in the FDA 2001 guidance, εI may be fixed be2 tween 0.04 and 0.05. For the determination of σW 0 , the FDA 2001 guidance recommends the use of individual difference ratio (IDR), which is defined as 1/2 E(T − R)2 IDR = E(R − R′ )2 =

=

2 2 2 (µT − µR )2 + σD + (σW T + σW R ) 2 2σW R

IBC +2 2

1/2

1/2

.

Therefore, assuming that the maximum allowable IDR is 1.25, substitu2 tion of (ln 1.25)2 /σW 0 for IBC without adjustment of the variance term approximately yields σW 0 = 0.2. The FDA 2001 guidance suggests that a mixed effects model in conjunction with the restricted maximum likelihood (REML) method be used to 2 2 2 estimate variance components of σD , σW t and σW R . An intuitive statistical test can then be obtained by simply replacing the unknown parameters with their corresponding estimates. However, exact statistical properties of the resultant test are unknown. The FDA 2001 guidance recommends that the small sample method proposed by Hyslop et al.38 be used to obtain the confidence interval or confidence bound of the test. If the upper 95% confidence bound is less than θI , we conclude IBE.

472

S. C. Chow & A. Pong

5.3. A review of the FDA guidance on population/individual bioequivalence As indicated earlier, the FDA 2001 guidance on Statistical Approaches to Establishing Bioequivalence is intended to address drug interchangeability. As a result, the guidance for assessment of PBE and IBE has a significant impact on pharmaceutical research and development. In what follows, we provide a comprehensive review of the FDA 2001 guidance on population and individual bioequivalence from both scientific/statistical and practical points of view. Without loss of generality, we will only focus on IBE. 5.3.1. Aggregated criteria vs. disaggregated criteria The FDA 2001 guidance recommends aggregated criteria as described earlier for assessment of IBE. The IBE criterion takes into account for average of bioavailability, variability of bioavailability, and the variability due to subject-by-formulation interaction. Under the proposed aggregated criteria, however, it is not clear whether IBE criterion is superior to ABE criterion for assessment of drug interchangeability. In other words, it is not clear whether or not IBE implies ABE under aggregate criteria. Hence, the question of particular interest to pharmaceutical scientists is that whether the proposed aggregated criterion can really address drug interchangeability? Liu and Chow45 suggested disaggregated criteria be implemented for assessment of drug interchangeability. The concept of disaggregated criteria for assessment of IBE is described below. In addition to ABE, we may consider the following hypotheses testing for equivalence in variability of bioavailabilities, and variability due to subject-by-formulation interaction: 2 2 H0 : σW T /σW R ≥ ∆v 2 2 vs. Ha : σW T /σW R < ∆v

and 2 H0 : σD ≥ ∆s 2 vs. Ha : σD < ∆s

where ∆v is bioequivalence limit for the ratio of intrasubject variabilities and ∆s is an acceptable limit for variability due to subject-by-formulation interaction. We conclude IBE if both 100(1 − α)% upper confidence limit 2 2 for σW T /σW R is less than ∆v and 100(1 − α)% upper confidence limit for

Statistics in Biopharmaceutical Research

473

2 σD is less than ∆s . Under the above disaggregated criteria, it is clear that IBE implies ABE. In practice, it is of interest to examine the relative merits and disadvantages between the FDA recommended aggregated criteria and the disaggregated criteria described above for assessment of drug interchangeability. In addition, it is also of interest to compare the aggregated and disaggregated criteria of IBE with the current ABE criterion in terms of the consistencies and inconsistencies in concluding bioequivalence for regulatory approval.

5.3.2. Masking effect The goal for evaluation of bioequivalence is to assess the similarity of the distributions of the PK metrics obtained either from the population or from individuals in the population. However, under the aggregated criteria, different combinations of values for the components of the aggregated criterion can yield the same value. In other words, bioequivalence can be reached by two totally different distributions of PK metrics. This is another artifact of the aggregated criteria. For example, at the 1996 Advisory Committee meeting, it was reported that the data sets from the FDA’s files showed that a 14% increase in the average (ABE only allow 80% to 125%) is offset by a 48% decrease in the variability and the test passes IBE but fails ABE. 5.3.3. Power and sample size determination For the proposed aggregated criterion, it is desirable to have sufficient statistical power to declare IBE if the value of the aggregated criterion is small. On the other hand, we would not want to declare IBE if the value is large. In other words, a desirable property for assessment of bioequivalence is that the power function of the statistical procedure is a monotone decreasing function. However, since different combinations of values of the components in the aggregated criteria may reach the same value, the power function for any statistical procedure based on the proposed aggregated criteria is not a monotone decreasing function. The experience for implementing the aggregated criteria in regulatory approval of generic drugs is lacking. Another major concern is how the proposed criteria for IBE will affect the sample size determination based on power analysis. Unlike ABE, there exists no closed form for the power function of the proposed statistical procedure for IBE. As a result, the sample size may be determined through a Monte Carlo simulation study. Chow and Shao17 provided formulas (based

474

S. C. Chow & A. Pong

on normal approximation) for sample size calculation for assessment of PBE and IBE under a 2 × 4 replicated crossover design. Sample sizes calculated from the formulas were shown to be consistent with those obtained from simulation studies. 5.3.4. Two-stage test procedure To apply the proposed criteria for assessment of IBE, the FDA 2001 guidance suggests the constant scale be used if the observed estimator of σT R or σW R is smaller than σT 0 or σW 0 . However, statistically, the observed estimator of σT R or σW R being smaller than σT 0 or σW 0 does not mean that σT R or σW R is smaller than σT 0 or σW 0 . A test on the null hypothesis that σT R or σW R is smaller than σT 0 or σW 0 is necessarily performed. As a result, the proposed statistical procedure for assessment of IBE becomes a two-stage test procedure. It is then recommended that the overall type I error rate and the calculation of power be adjusted accordingly. 5.3.5. Study design The FDA 2001 guidance recommends a 2×4 replicated designs, i.e. (TRTR, RTRT) be used for assessment of IBE without any scientific and/or statistical justification. As an alternative to the 2 × 4 replicated design, the FDA 2001 guidance indicates that a 2 × 3 replicated crossover design, i.e. (TRT, RTR) may be considered. Several questions are raised. First, it is not clear whether the two replicated crossover designs the optimal design (in terms of power) among all 2×4 and 2×3 replicated crossover designs with respect to the aggregated criterion? Second, it is not clear what is the relative efficiency of the two designs if the total number of observations is fixed. Third, it is not clear how these two designs compare to other 2 × 4 and 2 × 3 replicated designs such as (TRRT, RTTR) and (TTRR, RRTT) designs and (TRR, RTT) and (TTR, RRT) designs. Finally, it may be of interest to study the relative merits and disadvantages of these two designs as compared to other designs such as Latin square designs and four sequence and four period designs. Other issues regarding the proposed replicated designs include (i) it will take longer time to complete, (ii) subject’s compliance may be a concern, (iii) it is likely to have a higher dropout rate and missing values especially in 2 × 4 designs, and (iv) there are little literature on statistical methods dealing with dropouts and missing values in a replicated crossover design setting.

Statistics in Biopharmaceutical Research

475

Note that the FDA 2001 guidance provides detailed statistical procedures for assessment of PBE and IBE under the recommended 2 × 4 replicated design. However, no details regarding statistical procedures for assessment of PBE and IBE under the alternative 2×3 replicated design are given. Detailed statistical procedures for assessment of PBE and IBE are available.19,20 In addition, Chow and Shao17 pointed out that the statistical procedure for assessment of PBE under the recommended 2 × 4 replicated design as described in the FDA 2001 guidance was inappropriate due to the violation of the primary assumption of independence. 5.4. Outlier detection The procedure suggested for detection of outliers is not appropriate for the standard 2 × 2, the 2 × 3 or the 2 × 4 replicated crossover designs because the observed PK metrics from the same subject are correlated. For a valid statistical assessment, the procedures proposed by Chow and Tse22 and Liu and Weng46 should be used. These proposed statistical procedures for outlier detection in bioequivalence studies were derived under crossover designs, which incorporate the correlations within the same subject. The FDA 2001 guidance provides little or no discussion regarding the treatment of identified outliers. 6. Statistical Principles for Good Clinical Practice For approval of a drug product, the FDA requires that substantial evidence of the effectiveness and safety of the drug product be provided through the conduct of two adequate and well-controlled clinical studies. To assist the sponsors in preparation of final clinical reports for regulatory submission and review, the FDA developed guidelines for the format and content of a clinical report in 1988. In addition, in 1994, the Committee for Proprietary Medicinal Products (CPMP) Working Party on Efficacy on Medicinal Products of the European Community issued a similar guideline entitled A Note for Guidance on Biostatistical Methodology in Clinical Trials in Applications for Marketing Authorizations for Medicinal Products. At the same time, the ICH also signed off on the step 4 final draft of the Structure and Content of Clinical Study Reports and recommended its adoption to the three regulatory authorities of the United States, European Community, and Japan. The ICH guidelines require that some critical statistical issues be addressed in the final clinical report. These critical issues include baseline comparability, adjustments for covariates, dropouts or missing values,

476

S. C. Chow & A. Pong

interim analyses and data monitoring, multicenter studies, multiplicity, efficacy subsets, active control trials, and subgroup analyses, which are briefly described below (see also, Pong and Chow.53 ). 6.1. Baseline comparability Baseline measurements are those collected during the baseline periods as defined in the protocol. Baseline usually refers to at randomization and prior to treatment. Sometimes, measurements obtained at screening are used as baselines. Basically, the objectives for analysis of baseline data are threefold. First, the analysis of baseline data is to provide a description of patient characteristics of the targeted population to which statistical inference is made. In addition, the analysis of baseline data provided useful information regarding whether the patients enrolled in the study are a representative sample of the targeted population according to the inclusion and exclusion criteria of the trial. Second, since baseline data measure the initial patient disease status, they can serve as reference values for the assessment of the primary efficacy and safety clinical endpoints evaluated after the administration of the treatment. Finally, the comparability between treatment groups can be assessed based on baseline data to determine potential covariates for statistical evaluations of treatment effects. The ICH guideline requires that baseline data on demographic variables such as age, gender, or race and some disease factors such as specific entry criteria, duration, stage and severity of disease and other clinical classifications and subgroups in common usage or of known prognostic significance be collected and presented. The commonly employed statistical tests for baseline comparability are Cochran-Mantel-Henzsel test for categorical data and analysis of variance for continuous variables. Preliminary investigation of baseline comparability helps identifying possible confounding and interaction effects between treatment and baseline characteristics. 6.2. Adjustments for covariates For assessment of the efficacy and safety of a drug product, it is not uncommon that the primary clinical endpoints are affected by some factors (or covariates) such as demographic variables, patient characteristics, concomitant medications, and medical history. If these covariates are known to have an impact on the clinical outcomes, one may consider stratified randomization. In practice, however, one may collect information on some covariates, which may influential and yet unknown at the planning stage

Statistics in Biopharmaceutical Research

477

of the trial. In this case, if patients are randomly assigned to receive treatments, the estimated treatment effect is asymptotically free of the accidental bias induced by these covariates. If the covariate were balanced, then the difference in simple treatment averages would be an unbiased estimate for the treatment effect. On the other hand, if the covariate is not balanced, then the difference in simple average between treatment groups will be biased for estimation of the treatment effect. In this case, it is suggested that the covariates be included in the statistical model such as an analysis of variance (or covariance) model for an unbiased estimate of the treatment effect. In the case where covariates are balanced between the treatment groups, it is still necessary to adjust for covariates for clinical endpoints in order to obtain valid inference of the treatment effect if the covariates are statistically significantly correlated with the clinical endpoints. The ICH guidelines require that selection of and adjustments for demographic or baseline measurements, concomitant therapy, or any other covariate or prognostic factor should be explained. In addition, methods of adjustments, results of analyses, and supportive information should be included in the detailed documentation of statistical methods. 6.3. Dropouts or missing values In clinical research, there are many possible causes for the occurrence of dropouts and missing values. These possible causes include the duration of the study, the nature of the disease, the efficacy and adverse effects of the drug under study, intercurrent illness, accidents, patient refusal or moving, or other administrative reasons. The ICH guidelines suggest that the reasons for the dropouts, the time to dropout, and the proportion of dropouts among treatment groups be analyzed to examine the effects of dropouts for evaluation of the efficacy and safety of the study drug. Little and Rubin44 classified missing values into three different types based on the possible causes. If the causes of missing values are independent of the observed responses, then the missing values are said to be completely random. On the other hand, if the causes of missing values are dependent on the observed responses but are independent of the scheduled but unobserved responses, then missing values are said to be random. The missing values are said to be informative if the causes of missing values are dependent upon the scheduled but unobserved measurements. If missing mechanism is either completely random or random, then statistical inference derived from the likelihood approaches based on patients who complete the study is still valid. However, the inference is

478

S. C. Chow & A. Pong

not as efficient as it supposes to be. If the missing values were informative, then the inference based on the completers would be biased. As a result, it is suggested that despite the difficulty, the possible effects of dropouts and missing values on magnitude and direction of bias be expressed as fully as possible. 6.4. Interim analysis and data monitoring Interim analysis and data monitoring are commonly employed for clinical trials in treatment of life-threatening disease or severely debilitating illness with long-term follow-up and endpoints such as mortality or irreversible morbidity. Interim analyses based on the data monitoring can be classified into formal interim analysis and administrative analysis. The aim of a formal interim analysis is to determine whether a decision for early termination can be reached before the planned study completion due to compelling evidence of beneficial effectiveness or harmful side effects. The administrative interim analysis is usually carried out without any intentions of early termination because of the results of the interim analysis results. Since interim analyses, either formally or informally, can introduce bias and/or increase type I error, the ICH guidelines require that all interim analyses, formal or informal, pre-planned or ad hoc, by any study participant, sponsor staff member, or data monitoring group should be described in full, even if the treatment groups were not identified. Data monitoring without code-breaking should also be described, even if this kind of monitoring is considered to cause no increase in type I error. 6.5. Multicenter studies A multicenter trial is often conducted to expedite the patient recruitment process. The objective of the analysis of clinical data from a multicenter trial is two-fold. It is not only to investigate whether a consistent treatment effect can be observed across centers but also to provide an estimate of the overall treatment effect. A set of four conditions under which evidence from a single multicenter trial would provide sufficient statistical evidence of efficacy is proposed.50 Although all of the centers in multicenter trials follow the same protocol, many practical issues are likely to occur. For example, some centers may be too small for a reliable interpretation of the results, while some centers may be too big which dominate the results. In addition, there may be a significant treatment-by-center interaction. As a result, a statistical test

Statistics in Biopharmaceutical Research

479

for homogeneity across centers is necessarily performed for detection of possible quantitative or qualitative treatment-by-center interaction. Gail and Simon34 indicated that the existence of a quantitative interaction between treatment and center dose not invalidate the analysis by pooling data across centers. However, if a qualitative interaction between treatment and center is observed, an overall or average summary statistic may be misleading and hence considered inadequate. In this case, treatment effect should be carefully evaluated by center. 6.6. Multiplicity In clinical trials, multiplicity may occur depending upon the objective of the intended trial, the nature of the design, and statistical analysis. The causes of multiplicity are mainly due to the formulation of statistical hypotheses and the experiment-wise false positive rates in subsequent analyses of the data. The ICH guidelines require that the overall type I error rate be adjusted to reflect multiplicity. Basically, multiplicity in clinical trials can be classified as repeated interim analyses, multiple comparisons, multiple endpoints, and subgroup analyses. In the interest of an overall type I error rate, the commonly employed approach is probably the application of the Bonferroni technique. The concept of Bonferroni’s technique is to adjust p values for control of experiment-wise type I error rate for pairwise comparisons. Bonferroni’s method does not require that the structure of the correlation among comparisons be specified. In addition, it allows an unequal number of patients in each treatment group. Bonferroni’s method works well when the number of treatment groups is small. When the number of treatment groups increases, however, Bonferroni’s adjustment for p values becomes very conservative and may lack adequate power for the alternative in which most or all efficacy endpoints are improved. In this situation, as an alternative, one may consider a modified procedure proposed by Hochberg (1988).37 Hochberg’s procedure is shown to be more powerful because it only requires one p value smaller than α to declare one statistically significant comparison. 6.7. Efficacy subsets In clinical trials, despite the fact that there is a thoughtful study protocol, deviation from the protocol may be encountered during the course of the trial. In addition, it is very likely that patients will withdraw from the

480

S. C. Chow & A. Pong

study prematurely before the completion of the trial due to various reasons. Patients who complete the study might miss some scheduled visits. As a result, which patients should be included in the analysis for a valid and unbiased assessment of the efficacy and safety of the treatment is a legitimate question to ask. To provide a fair and unbiased assessment of the treatment effect, the ICH guideline suggests that the primary analysis for the demonstration of the efficacy and safety of the drug product should be conducted based on the intention-to-treat sample. In addition to the intention-to-treat sample, some subsets of the intention-to-treat sample may be constructed for efficacy analysis. These subsets are usually referred to efficacy subsets. These efficacy subsets include (i) patients with any efficacy observations or with a certain minimum number of observations, (ii) patients who complete the study, (iii) patients with an observation during a particular time window, and (iv) patients with a specified degree of compliance. The ICH guidelines require that efficacy subsets be analyzed to examine the effects of dropping patients with available data from analyses because of poor compliance, missed visits, ineligibility, or any other reasons. Any substantial differences resulting from the analyses of the intention-to-treat sample and the efficacy subsets should be the subject of explicit discussion.

6.8. Active control trials An active control trial is often considered an alternative to placebo control study for evaluation of the effectiveness and safety of a test drug with very ill patients or patients with severe or life-threatening diseases based on ethical considerations. The primary objective of an active control trial could be to establish the efficacy of the test drug, to show that the test drug is equivalent to an active control agent, or to demonstrate that the test drug is superior to the active control agent. Pledger and Hall52 pointed out that active control trials offer no direct evidence of effectiveness of the test drug. The only trial that will yield direct evidence of effectiveness of the test drug is a placebo-controlled trial, which compares the test drug with a placebo. Temple68 indicated that if we cannot be very certain that the active control agent in a study would have beaten a placebo group, the fundamental assumption of the active control study cannot be made and that design must be considered inappropriate. ICH guidelines indicated that if an active control study is intended to show equivalence between the test drug and an active control, the analysis

Statistics in Biopharmaceutical Research

481

should show the confidence interval for the comparison between the two agents for critical endpoints and the relation of that interval to the prespecified degree of inferiority that would be consider unacceptable. 7. Statistics in Diagnostic Imaging The techniques for evaluation of the performance of diagnostic medical products are very different from therapeutic pharmaceuticals and nondiagnostic devices. However, medical imaging drugs are generally governed by the same regulations as other drug and biological products. Because of the medical imaging drugs have special characteristics that do not reflect from other drug and biological products. The purpose of this section will focus on the different considerations for designs in diagnostic studies. 7.1. Introduction Medical imaging drug products are drugs used with medical imaging methods (such as radiography, computed tomography [CT], ultrasonography [US], and magnetic resonance imaging [MRI]) to provide information on anatomy, physiology and pathology. The term “images” can be used as films, likenesses or other renderings of the body, body parts, organ systems, body functions, or tissues. For example, an image of the heart obtained with a diagnostic radiopharmaceutical or ultrasound contrast agent may in some cases refer to a set of images acquired from different views of the heart. Similarly, an image obtained with an MRI contrast agent may refer to a set of images acquired with different pulse sequences and interpluse delay times. In other words, medical imaging uses advanced technology to “see” the structure and function of the living body. The intentions of a medical imaging drug have two-fold: (i) delineate nonanatomic structures such as tumors or abscesses (ii) detect disease or pathology within an anatomic structure. Therefore, the indications for medical imaging drugs Table 1. Most common used contrast drug products in combination with medical imaging devices. Modality

Contrast Drug Products

X-Ray and CT MRI Ultrasound Suspensions Nuclear

Iodine agents (photon scattering) Gadolinium, dysprosium, helium Liposomes, microbubbles Tc-99rn, TI-201, indium, samarium

482

S. C. Chow & A. Pong

may fall within the following general categories. However, they need not be mutually exclusive: a. b. c. d.

Structure delineation — normal or abnormal; Functional, physiological, or biochemical assessment; Disease or pathology detection or assessment; Diagnostic or therapeutic patient management.

The details of drug regulations are shown in the draft guidance to INDs, NDAs, biologics license applications (BLAs), ANDAs, and supplements to NDAs or BLAs for the medical imaging drug and biological products. This guidance was issued by FDA for industry entitled Development Medical Imaging Drugs and Biologics.30 Usually, images are created from computerized acquisition of digital signals. The medical imaging drugs can be classified into contrast drug products and diagnostic radiopharmaceuticals. 7.1.1. Contrast drug product Contrast drug products are used to increase the relative difference of signal intensities and to provide the additional information in combination with an imaging device beyond by the device alone. In other words, imaging with the contrast drug product should add value when compared to imaging without the contrast drug product. 7.1.2. Diagnostic radiopharmaceuticals Radiopharmaceuticals are used for a wide variety of diagnostic, monitoring, and therapeutic purposes. Diagnostic Radiopharmaceuticals are used to image or otherwise identify an internal structure or disease process. In other words, diagnostic Radiopharmaceuticals are radioactive drugs that contain a radioactive nuclide that may be linked to a legend and carrier. These products are used in planar imaging, single photon emission computed tomography (SPECT), positron emission tomography (PET), or with other radiation detection probes. 7.2. Design of blinded-reader studies In order to demonstrate efficacy of a medical imaging drug, readers who are both independent and blinded should perform evaluation of images. These independent, blinded image evaluations are intended to limit possible bias that could be introduced into the images evaluation by non-independent or unblinded readers. This evaluation is conducted in controlled setting

Statistics in Biopharmaceutical Research

483

with minimal clinical information provided to the reader. The definitions of “independent” and “blinded” are defined next. The independent readers are defined as those who have not participated studies and who are not affiliated with the sponsor or with institutions at which the studies were conducted. The meaning of blinding differs from the common way the term used in therapeutic clinical trials. Blinding in this sense is a critical aspect of clinical trials of medical imaging agents. “Blinded readers” are those who are unaware (1) of treatment identity used to obtain a given image and (2) of patient-specific clinical information or study protocol. For example, blinded readers should not have the knowledge about which images were obtained prior to drug administration and which were obtained after drug administration, although this may be apparent upon viewing the images. In addition, blinded readers should not know the patients’ final diagnoses and may have limited or no knowledge of the results of other diagnostic tests that were performed on the patients. In some cases, blinded readers should not be familiar with the inclusion and exclusion criteria for patient selection that were specified in the protocol. 7.2.1. Assessing reader agreement As indicated in the draft guidance,30 at least two independent, blinded readers (and preferably three or more) are recommended for each study that is intended to demonstrate efficacy. The purpose is to provide a better basis for the findings in the studies. Therefore, the determination of interreader agreement and variability is the typical design issue to blinded read studies. According to the guidance, the consistency among readers should be measured quantitatively. The most commonly used statistical test to assess the inter-reader agreement is the κ (kappa) statistic. The Cohen’s kappa coefficient,24 is a measure of inter-reader agreement in terms of count data. For a 2 × 2 table, κ=

P0 − Pe , 1 − Pe

where P0 =

X

pii = proportion of observed agreement

i

Pe =

X i,j

pi.p.j = proportion of expected agreement

484

S. C. Chow & A. Pong

It assumes that two response variables are two independent ratings of the n subjects. It should be noted that the kappa coefficient equals +1 when there is complete agreement of the readers. When the observed agreement exceeds chance agreement, kappa is positive. Also, the magnitude of kappa statistics reflects the strength of agreement. In a very unusual practice, kappa could be negative when the observed agreement is less than chance agreement. The total range of kappa is between −1 and 1. The asymptotic variance of simple kappa coefficient can be estimated by the following, according to Fleiss et al.33 : A+B+C , var(κ) = (1 − Pe )2n

where

A=

X i

pii[1 − (pi. + p.j)(1 − κ ˆ )]2 ,

B = (1 − κ ˆ )2

XX

pij(pi. + p.j)2 ,

i6=j i,j

C = [ˆ κ − Pe (1 − κ ˆ )]2 . For measuring the inter-reader agreement in continuous data, Snedecor and Cochran proposed the intra-class correlation.65 7.3. Diagnostic accuracy To determine how well a diagnostic imaging agent can distinguish disease subjects and non-diseased subjects, the outcome may often be classified into one of the four groups depending on (i) whether disease is present and (ii) the results of the diagnostic test of interest (positive or negative). The terms “positive” and “negative” concern some particular disease status, which must be specified clearly. The categories can be defined in any meaningful way to the problem. For example, patients could be classified as having one or more tumors (positive) or no tumor (negative), malignant (positive) or benign/no tumor (negative). It should be noted that the disease is often determined with a “truth” standard or “gold” standard. A “truth” standard or “gold” standard is an independent method of measuring the same variable being measured by the investigational drug that is known or believed to give the truth state of a patient or true value of a measurement. In other words, “truth” standards are used to demonstrate that the results obtained with the medical imaging

Statistics in Biopharmaceutical Research

Table 2.

485

The typical outcome table (2 × 2) in the evaluation of a diagnostic test. Disease Status Present

Absent

Diagnostic

Positive

a(TP)

b(FP)

Test

Negative

c(FN)

d(TN)

drug are valid and reliable. For example, for a MRI contrast agent intended to visualize the number of lesions in liver or determine whether a mass is malignant, the truth standard might include results from the pathology or long-term clinical outcomes. In diagnostic imaging studies, “truth” or “gold” standard are usually called as standard of reference (SOR). Possible choices of SOR in an imaging trail are: a. b. c. d.

Histopathology; Therapeutic response; Clinical outcome; Another valid imaging procedure (validated against a valid gold standard); e. Autopsy.

TP, FP, FN, TN represent the true positive, false positive, false negative, and true negative, respectively. After completing a well-defined classification based on the disease status and diagnostic test of interest, the efficacy of imaging agent can be expressed as the diagnostic performance of the agent. The simplest measure of diagnostic decision is the fraction of cases for which the physician is correct, which is often called “accuracy”. In other words, the accuracy is defined as the proportion of cases, considering both positive and negative test results, for which the test results are correct. It also can be expressed in mathematics as following: Accuracy =

a+d . a+b+c+d

However, accuracy is of limited usefulness as an index of diagnostic performance because two diagnostic modalities can yield equal accuracies but perform differently with respect to the types of decisions. Also, it can be affected by the disease prevalence strongly. Due to the limitation of the accuracy index, the sensitivity and specificity are used in the evaluation scheme.

486

S. C. Chow & A. Pong

Sensitivity =

Number of TP decisions a = Number of actually positive cases a+c

Number of TN decisions d = Number of actually negative cases b+d In effect, sensitivity and specificity represents two kinds of accuracy: the first is for actually positive cases and the second is for actually negative cases. However, very often a single pair of sensitivity and specificity measurements may provide a possibly misleading and even hazardous oversimplification of accuracy.70 This is how the ROC (Receiver Operating Characteristic) curve comes into picture and is introduced in Sec. 7.4.1. It should be noted that the method for evaluating and comparing sensitivity and specificity for diagnostic tests is based on: Specificity =

Assumption 1: Diagnostic tests are independent given the disease status; Assumption 2: The gold standard is error free. These two assumptions are not always valid. Several statistical methods have been considered.2,3,57 7.4. Statistical analysis Most of the imaging trials are designed to provide dichotomous or ordered categorical outcomes. Therefore, the statistical tests for proportions and rates are commonly used, and the methods based on ranks are often applied to ordinal data. The analyses based on odds ratios and the MantelHaenszel procedures are useful for data analysis. In addition, the use of model-based techniques, such as logistic regression models for binomial data, proportional odds models for ordinal data, and log-linear models for normal outcome variables are usually applied. The diagnostic validity can be assessed in many ways. For example, the pre- and post-images can be compared to the gold standard, and the sensitivity and specificity of the pre-image compared to the post-image. Similarly, the same approaches can be used for two different active agents. The common methods used to test for differences in diagnosis are the McNemar test and Stuart-Maxwell test. The confidence intervals for sensitivity and specificity, and other measures can be also provided in the analysis. Recently, the Receiver Operating Characteristic (ROC) analyses are becoming increasing important. Not only because it is recommended in the FDA draft guidance,30 but also its advantage over more traditional measures of diagnostic performance.48

Statistics in Biopharmaceutical Research

487

7.4.1. Receiver operating characteristic (ROC) analyses In the use of most diagnostic test, test data do not necessarily fall into one of two obviously defined categories. Imaging studies usually require some confidence threshold be established in the mind of the decision maker. For example, if an image suggests the possibility of disease, how strong the suspicion is in order for the image to be called positive? Therefore, the decision maker chooses between positive and negative diagnosis by comparing his/her confidence concerning with an arbitrary confidence threshold. Figure 1 is an example of the model that underlies ROC analysis. The bellshaped curves represent the probability density distributions of a decision maker’s confidence in a positive diagnosis that arise from actually positive patients and actually negative patients. The true positive fraction (TPF) is represented by the area under the left-hand distribution to the threshold. Similarly, the false positive fraction (FPF) is represented by the area under the left-hand distribution to the threshold. These imply that the sensitivity and specificity vary inversely as the confidence threshold is changed. In other words, TPF and FPF will increase or decrease together as the confidence threshold is changed. If we change the decision threshold several times, we will obtain several different pairs of TPF and FPF. These pairs can be plotted as points on a graph, such as that in Fig. 2. This curve is called the ROC curve for

Fig. 1.

Model of ROC Analysis.

488

S. C. Chow & A. Pong

Fig. 2.

Typical ROC Curve.

diagnostic test. Then, we may conclude that better performance is indicated by an ROC curve that is higher to the left in the ROC space. A practical technique for generating response data that can be used to plot a ROC curve is called the rating method. This method requires the decision maker select a value from a continuous scale, such as definitely negative, probably negative, questionable, probabyly positive or definitely positive. The advantages of the ROC curves are it is simple and graphical. Also, it is independent of prevalence and it provides a direct visual comparison between tests on a common scale. However, the drawbacks of the ROC curves are the decision thresholds and the numbers of subjects are usually not displayed on the graph. In addition, the appropriate software may not be widely available. The ROC curve provides more information than just a single sensitivity and specificity pair to describe the accuracy of a diagnostic test. The curve depicts sensitivity and specificity levels over the entire range of decision thresholds. However, it would be helpful if the performance of a diagnostic test could be assessed by a single number. One such measurement that can be derived from the ROC curve is the area under the curve (AUC). If a diagnostic test that discriminates almost perfect, then its ROC curve passes near the upper left corner. This makes an AUC approaching 1. On the other hand, if the curve of a test that discriminates almost randomly, then the curve would lie near the 45 degree diagonal line. This would turn an AUC close to 0.5. The AUC range is between 0.5 and 1.

Statistics in Biopharmaceutical Research

489

The AUC is calculated by summing the area of the trapezoids formed between the graph and the horizontal axis. This nonparametric method of calculation makes no assumptions regarding the underlying distributions of the diseased and non-diseased status. The meaning of AUC has been proved mathematically to be the probability that a random pair of positive/diseased and negative/non-diseased individuals would be identified correctly by the diagnostic test.35 Also, it had been shown that the statistical properties of the Mann-Whitney-Wilcoxon statistics could be used to predict the statistical properties of AUC.36 For comparing corrected ROC curves, Delong et al.25 suggested a nonparametric approach for comparing the AUCs. For the parametric approach, Swets and Pickett67 proposed a more exact method using the maximum likelihood estimation to estimate the AUC and its standard error. A comparison of nonparametric and binomial parametric areas can be found in Center and Schwartz.5

References 1. Anderson, S. and Hauck, W. W. (1990). Consideration of individual bioequivalence. Journal of Pharmacokinetics and Biopharmaceutics 18: 259–273. 2. Baker, S. G., Cannor, R. J. and Kessler, L. G. (1998). The partial testing design: A less costly way to test equivalence for sensitivity and specificity. Statistics in Medicine 17: 2219–2232. 3. Baker, S. G. (1990). A simple EM algorithm for capture-recapture data with categorical covariates. Biometrics 46: 1193–1200. 4. Bergum, J. S. (1990). Constructing acceptance limits for multiple stage tests. Drug Development Industrial Pharmaceutical 16: 2153–2166. 5. Center, R. M. and Schwartz, J. S. (1985). An evaluation of methods for estimating the area under the receiver operating characteristics (ROC) curve. Medical Decision Making 5: 149–156. 6. Chen, M. L. (1997). Individual bioequivalence — A regulatory update. Journal of Biopharmaceutical Statistics 7: 5–11. 7. Chow, S. C. (1992). Statistical design and analysis of stability studies. Presented at the 48th Conferences of Applied Statistics, Atlantic City, N.J., December. 8. Chow, S. C. (1997). Good statistics practice in the drug development and regulatory approval process. Drug Information Journal 31: 1157–1166. 9. Chow, S. C. (2000). Encyclopedia of Biopharmaceutical Statistics. Marcel Dekker, Inc., New York. 10. Chow, S. C. and Liu, J. P. (1995). Statistical Design and Analysis in Pharmaceutical Science, Marcel Dekker, Inc., New York. 11. Chow, S. C. and Liu, J. P. (1997). Meta-analysis for bioequivalence review. Journal of Biopharmaceutical Statistics 7: 97–111.

490

S. C. Chow & A. Pong

12. Chow, S. C. and Liu, J. P. (1998). Design and Analysis of Clinical Trials, Wiley, New York. 13. Chow, S. C. and Liu, J. P. (2000). Design and Analysis of Bioavailability and Bioequivalence, Marcel Dekker, Inc., New York. 14. Chow, S. C. and Shao, J. (1988). A new procedure for estimation of variance components. Statistics and Probability Letters 6: 349–355. 15. Chow, S. C. and Shao, J. (1989). Test for batch-to-batch variation in stability analysis. Statistics in Medicine 8: 883–890. 16. Chow, S. C. and Shao, J. (1991). Estimating drug shelf-life with random batches. Biometrics 47: 1071–1079. 17. Chow, S. C. and Shao, J. (2002). Statistics in Drug Research — Methodologies and Recent Development. Marcel Dekker, Inc., New York. 18. Chow, S. C. and Shao, J. (2003). Stability analysis with discrete response. Journal of Biopharmaceutical Statistics. To appear. 19. Chow, S. C., Shao, J. and Wang, H. (2002a). Statistical tests for population bioequivalence. Statistica Sinica, In press. 20. Chow, S. C., Shao, J. and Wang, H. (2002b). Individual bioequivalence testing under 2 × 3 designs. Statistics in Medicine 21: 629–648. 21. Chow, S. C., Shao, J. and Wang, H. (2003). Sample Size Calculation in Clinical Research. Marcel Dekker, Inc., New York, In press. 22. Chow, S. C. and Tse, S. K. (1990). Outliers detection in bioavailability/ bioequivalence, Statistics in Medicine 9: 549–558. 23. Chow, S. C. and Tse, S. K. (1991). On variance estimation in assay validation. Statistics in Medicine 10: 1543–1553. 24. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: 37–46. 25. DeLong, E. R. and DeLong, D. M. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 44: 837–845. 26. Dubey, S. D. (1991). Some thoughts on the one-sided and two-sided test. Journal of Biopharmaceutical Statistics 1: 139–150. 27. Esinhart, J. D. and Chinchilli, V. M. (1994). Extension to the use of tolerance intervals for assessment of individual bioequivalence. Journal of Biopharmaceutical Statistics 4: 39–52. 28. FDA (1992). Guidance on Statistical Procedures for Bioequivalence Studies Using a Standard Two-Treatment Crossover Design. Division of Bioequivalence, Office of Generic Drugs, Center for Drug Evaluation and Research, Food and Drug Administration, Rockville, Maryland. 29. FDA (1993). Guideline for Submitting Documentation for the Stability of Human Drugs and Biologics. Center for Drugs and Biologics, Office of Drug Research and Review, Food and Drug Administration, Rockville, Maryland. 30. FDA (2000a). Draft Guidance on Developing Medical Imaging Drugs and Biologics. Division of Medical Imaging and Radiopharmaceutical Drug Product. Center for Drug Evaluation and Research and Center for Biologics Evaluation and Research, Food and Drug Administration, Rockville, Maryland.

Statistics in Biopharmaceutical Research

491

31. FDA (2000b). Guidance for industry: Bioavailability and Bioequivalence Studies for Orally Administrated Drug Products — General Considerations. Office of Generic Drugs, Center for Drug Evaluation and Research, Food and Drug Administration, Rockville, Maryland. 32. FDA (2001). Guidance for industry: Statistical Approaches to Establishing Bioequivalence. Office of Generic Drugs, Center for Drug Evaluation and Research, Food and Drug Administration, Rockville, Maryland. 33. Fleiss, J. L., Cohen, J. and Everitt, B. S. (1969). Large-sample standard errors of kappa and weighted kappa. Psychological Bulletin 72: 323–327. 34. Gail, M. and Simon, R. (1985). Testing for qualitative interactions between treatments and patient subsets. Biometrics 41: 361–372. 35. Green, D. M. and Swets, J. A. (1966). Signal Detection Theory and Psychophysics. John Wiley, New York. 36. Hanley, J. A. and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Diagnostic Radiology 143: 29–36. 37. Hochberg, Y. (1988). A sharper Bonferronis procedure for multiple test of significance. Biometrika 75: 800–803. 38. Hyslop, T., Hsuan, F. and Holder, D. J. (2000). A small-sample confidence interval approach to assess individual bioequivalence. Statistics in Medicine 19: 2885–2897. 39. Holder, D. J. and Hsuan, F. (1993). Moment-based criteria for determining bioequivalence. Biometrika 80: 835–846. 40. ICH (1993). Stability Testing of New Drug Substances and Products. Tripartite International Conference on Harmonization Guideline. 41. ICH (1994). ICH QIA. Guideline for industry. Stability Testing Guideline of New Drug Substances and Products. September. 42. ICH (1997). Statistical Principles in Clinical Trials. International Conference on Harmonization. 43. Lin, T. Y. D. (1994). Applicability of matrixing and bracketing approach to stability study design. Presented at the 4th ICSA Applied Statistics Symp., Rockville, Maryland. 44. Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Values. John Wiley and Sons, New York. 45. Liu, J. P. and Chow, S. C. (1997). Some thoughts on individual bioequivalence, Journal of Biopharmaceutical Statistics 7: 41–48. 46. Liu, J. P. and Weng, C. S. (1992). Detection of outlying data in bioavailability/bioequivalence studies. Statistics in Medicine 10: 1375–1389. 47. Mellon, J. I. (1991). Design and analysis aspects of drug stability studies when the product is stored at several temperatures. Presented at the 12th Annual Midwest Statistical Workshop, Muncie, Indiana. 48. Metz, C. E. (1986). ROC methodology in radiologic imaging. Investigative Radiology 21: 720–733. 49. Murphy, J. R. and Weisman, D. (1990). Using random slopes for estimating shelf-life. Proceedings of the Biopharmaceutical Section of the American Statistical Association, 196–203.

492

S. C. Chow & A. Pong

50. Nevins, S. E. (1988). Assessment of evidence from a single multicenter trial. Proceedings of the Biopharmaceutical Section of the American Statistical Association, 43–45. 51. Nordbrock, E. (2000). Stability matrix design. In Encyclopedia of Biopharmaceutical Statistics. Ed. S. Chow, Marcel Dekker, Inc., New York. 52. Pledger, G. W. and Hall, D. (1986). Active control trials: Do they address the efficacy issue? Proceedings of the Biopharmaceutical Section of the American Statistical Association, 1–7. 53. Pong, A. and Chow, S. C. (1997). Statistical/practical issues in clinical trials. Drug Information Journal 31: 1167–1174. 54. Pong, A. and Raghavarao, D. (2000). Comparison of bracketing and matrixing designs for a two-year stability study. Journal of Biopharmaceutical Statistics 10: 217–228. 55. Pong, A. and Raghavarao, D. (2001). Shelf life estimation for drug products with two components. Proceedings Biopharmaceutical Section of the American Statistical Association. 56. Pong, A. and Raghavarao, D. (2002). Comparing distributions of drug shelf lives for two components under different designs. Journal of Biopharmaceutical Statistics 12(3): 277–293. 57. Qu, Y. and Hadgu A. (1998). A model for evaluating sensitivity and specificity for correlated diagnostic tests in efficacy studies with an imperfect test. Journal of the American Statistical Association 93: 920–928. 58. Ruberg, S. and Hsu, J. (1992). Multiple comparison procedures for pooling batches in stability studies. Technometrics 34: 465–472. 59. Schall, R. and Luus, R. E. (1993). On population and individual bioequivalence. Statistics in Medicine 12: 1109–1124. 60. Shah, V. P., Midha, K. K., Dighe, S., McGilveray, I. J., Skelly, J. P., Yacobi, A., Layoff, T., Viswanathan, C. T., Cook, C. E., McDowall, R. D., Pittman, K. A. and Spector, S. (1992). Analytical methods validation: bioavailability, bioequivalence and pharmaceutical studies. Pharmaceutical Research 9: 588–592. 61. Shao, J. and Chow, S. C. (1994). Statistical inference in stability analysis. Biometrics 50: 753–763. 62. Shao, J. and Chow, S. C. (2001a). Two-phase shelf-life estimation. Statistics in Medicine 20: 1239–1248. 63. Shao, J. and Chow, S. C. (2001b). Drug shelf life estimation. Statistica Sinica 11: 737–745. 64. Sheiner, L. B. (1992). Bioequivalence revisited. Statistics in Medicine 11: 1777–1788. 65. Snedecor, G. W. and Cochran, W. G. (1967). Statistical Methods, 6th edn. Ames, The Iowa State University Press. 66. Sun, Y., Chow, S. C., Li, G. and Chen, K. W. (1999). Assessing distributions of estimated drug shelf lives in stability analysis. Biometrics 55: 896–899. 67. Swets, J. A. and Pickett, R. M. (1982). Evaluation of Diagnostic System. Academic Press, NY.

Statistics in Biopharmaceutical Research

493

68. Temple, R. (1983). Difficulties in evaluating positive control trials. Proceedings of the Biopharmacentical Section of the American Statistical Association, 1–7. 69. USP/NF (2000). The United States Pharmacopeia 24 and the National Formulary 19. United States Pharmacopeia Convention, Inc., Rockville, Maryland, USA. 70. Zweig, M. I. and Campbell, G. (1993). Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine. Clinical Chemistry 39(4): 561–577.

About of Author Shein-Chung Chow is President of the US Operations of StatPlus, Inc. directing the fuction/activity of technical consulting and biostatistics and data management for drug research and development services. Dr. Chow also provides technical supervision and guidance to project teams on statistical issues and presentations before clients, regulatory agencies or scientific bodies, defending the appropriateness of statistical methods used in clinical trial design or data analyses or the validity of reported statistical inferences. He identifies the best statistical and data management practices, organizes and leads working parties in the development of statistical design, analyses and presentation applications, and participates on Data Safety Monitoring Boards. He is also an Adjunct Professor at Temple University, Philadelphia, PA and an Adjunct Professor at National Cheng-Kung University, Taiwan. His professional activities include playing key roles in many professional organizations such as officer, Board of Directors member, Advisory Committee member, and Executive Committee member. He has served as Program-chair, session-chair/moderator, panelist and instructor/faculty at many professional conferences, symposia, workshops, tutorials and short courses. He is currently on the Editorial Boards of the Journal of Biopharmaceutical Statistics, Statistica Sinica, the Journal of Food and Drug Analysis. Dr. Chow is the Editor of Biostatistics Book series at Marcel Dekker, Inc., New York, NY. He was elected Fellow of the American Statistical Association in 1995 and an elected member of the ISI (International Statistical Institute) in 1999. He was the recipient of the DIA Outstanding Service Award (1996), ICSA Extraordinary Achievement Award (1996), and Chapter Service Recognition Award of the American Statistical Association (1998). Dr. Chow was appointed Pharmaceutical Scientific Advisor to the Bureau of Pharmaceutical Affairs, Department of Health,

494

S. C. Chow & A. Pong

Republic of China in 1999. Dr. Chow is past President of the International Chinese Statistical Association, Chair of the Advisory Committee on Chinese Pharmaceutical Affairs and a member of the Advisory Committee on Statistics of the DIA. Dr. Chow is the author or co-author of over 120 professional papers, and co-author of the following books: Advanced Linear Models, Design and Analysis of Bioavailability and Bioequivalence Studies (1st and 2nd editions), Statistical Design and Analysis in Pharmaceutical Science, Design and Analysis of Clinical Trials, Design and Analysis of Animal Studies in Pharmaceutical Development, and Encyclopedia of Biopharmaceutical Statistics. Dr. Chow received his BS in Mathematics from National Taiwan University, Taiwan, and PhD in Statistics from the University of Wisconsin, Madison, Wisconsin.

CHAPTER 13

STATISTICS IN TOXICOLOGY

JAMES J. CHEN Division of Biometry and Risk Assessment, National Center for Toxological Research, US Food and Drug Administration, Jefferson, AR 72079, USA Tel: 870-543-7007; [email protected] Toxicology is the study of the adverse effects of chemical substances on biological systems. Toxicological research is typically directed toward providing scientific information for the hazard potential of drugs and chemicals used by humans. Human epidemiology and animal toxicology are two major sources of scientific information for evaluation of toxic chemicals or drugs. Epidemiological studies, which attempt to associate disease or other adverse outcomes with an exposure, have the advantage of directly measuring an effect in humans at exposure conditions. Main limitations on the epidemiological studies are the lack of comprehensive data associated with unintentional or complex exposures, such as quantifying the actual dose concentration and no safety data for new drug or chemical products. Safety evaluation of the use of drug and chemicals are primarily based on animal studies in which animals are considered as surrogates for humans. In Vitro mutagenicity studies and structureactivity relationships may be used to support the interpretation of the information from the animal or human studies. In this chapter, we focus on two major toxicological studies: long-term carcinogenicity testing and reproductive testing. Statistical analyses of various endpoints have been of two aspects: qualitative testing and quantitative estimation for risk assessment. The qualitative testing is to determine if the chemical cause an adverse health effect (if there is a statistically significant difference between treated and control groups. Statistical analysis discussed in this section focuses on the qualitative testing with respect to carcinogenic and reproductive endpoints. Statistical modeling for quantitative risk estimation is given in Chapter 11.

495

496

J. J. Chen

1. Animal Carcinogenicity Experiments Long-term rodent bioassays have been the government’s primary means of screening chemicals to assess carcinogenic potential to human risk. The United States Food and Drug Administration (FDA) and other countries require that new drugs and certain medical devices must be approved for safety and effectiveness for their intended use before being marketed. As a part of the drug approval process, the FDA requires that the sponsor submit the results of a rodent tumorigenicity bioassay to assess the carcinogenic potential of a drug for chronic use of humans. In the last 25 years the National Toxicology Program (NTP) has conducted about 500 long-term animal carcinogenesis bioassays for safety assessment of environmental compounds, and Food and Drug Administration (FDA) has reviewed hundreds of such studies of pharmaceuticals conducted by drug companies. Data from these studies have been a major database for safety assessment of compounds in the environment and industry. A standard carcinogenic study is conducted in both sexes of two rodent species, typically rats and mice. A carcinogenicity experiment consists of a control and several dose groups. The maximum tolerated dose (MTD) has been used as the high-dose level. The MTD is defined as the dose that causes no more than a 10% body weight decrement, as compared to the appropriate control groups. The MTD is often estimated from the results of subchronic studies (generally three months of duration). Typically, dosage is measured in mg/kg body weight per day. The number of dose groups and allocation of animals among the dose groups depend on the objective of the study. A typical NTP carcinogenicity experiment consists of a control and three dose levels (0, 1/4 NTD, 1/2 MTD, MTD) with 50 animals per group. Animals are assigned randomly to dose groups or cages. As an example, consider a situation of 200 animals to be assigned to four groups of 50 with four animals from the same group caged together. Thus, 52 cages are used for the 200 animals. Each animal, first, is given to a number according to their order of presentation. A random number sequence of 52 cages numbers each with 4 replicates is, then, generated for placing animals in cages. For example, a sequence may be Animal number Random cage number

1 42

2 7

3 8

4 13

5 9

6 11

7 18

8 7

9 22

10 38

.. ..

The animal #1 would be placed in cage #42, and animal #2 in cage #7, and so on. After randomization of the animals to cages (and into experimental

Statistics in Toxicology

497

dose groups), the cage position may need to be rotated during the course of the experiment in order to balance the environmental effects. The animals are given the test substance for a major portion of their lifespan. The test substance may be given in the diet or administered by other routes, such as inhalation, skin paints, or oral gavage. The experiment is terminated according to a predetermined stopping time, for example, 78–104 weeks for mice and 104 weeks for rats. Animal body weights and food consumption are measured weekly, the weeks of death of animals are recorded. Animals which die or are sacrificed are necropsied. Tissues taken from different organs and sites are examined microscopically for the presence of tumors for an evidence of carcinogenic effects. One main objective of a long-term carcinogenicity experiment is to compare control and dose groups of animals with respect to tumor development. Statistical analysis of tumor responses includes the comparisons between dosed and control groups as well as a test for dose-related trend for each tumor site/organ. A typical experiment investigates approximate 20–50 tumor sites routinely. Because a large number of statistical tests are performed, the chance of false positive findings could increase. For example, the false positive rate is about 0.64 (≈ 1 − (1 − 0.05)20 ) for tests of 20 independent tumor types (sites/organs) all at the 0.05 significance level. For a particular tumor type, the primary response variable (endpoint) for comparison is the incidence of first tumors. One factor that affects the performance of methods is the animal survival time. A high degree of animal mortality will cause a significant censoring of the tumor response. Comparisons should be adjusted for the survival time because the crude incidence rate can be biased by the differential mortality (across groups). Another complication is that most tumor types are occult and therefore detectable only after the animal has died; that is, the time to the (first) tumor onset is not directly observable. This section will describe the commonly used statistical procedures for the analysis of animal tumor response data. 1.1. Time-to-tumor model Kodell and Nelson1 presented a tumor-death model which uses survival/ sacrifice data to describe the sequence of events comprised by histological appearance of a tumor followed by death from that tumor. Three random variables can be used to describe the model: X: The potential time to tumor onset, transition time from the normal state (N) to the tumor-bearing state (T).

498

J. J. Chen

T : The potential time from tumor onset to death, transition time from tumor state (T) to the death from the tumor state (DT ). Z: The potential time until death from a competing cause, transition time from the normal state (N) or the tumor state (DT ) to the death from competing risk (DN T ). Sacrificed animals are considered to be dead from a competing risk. The three random variables X, T, Z completely determined the fate of each animal. The two random variables Y and Z are the survival time of an animal, where Y = X + T is the potential time until death from tumor. Note that X is not observable for the occult tumors. A survival-adjusted method, that has been widely accepted, is to require that pathologists assign a “context of observation” (cause-of-death) to each tumor.2 Tumors can be classified as “incidental”, “fatal”, and “mortalityindependent (or observable)”. Tumors that do not alter an animal’s risk of death and are observed only as the result of a death from an unrelated cause are classified as an incidental context. Tumors that affect mortality by either directly causing death or indirectly increasing the risk of death are classified as a fatal context. Tumors, such as skin tumors, whose detection occurs at times other than when the animal dies are classified as a mortality-independent (or observable) context. It should be noted that the validity of context of observation is under the assumption: tumor-bearing and tumor-free animals of the same age have identical hazard functions for death unrelated to tumor. In the context of observation, one of the four events will be observed on each animal: A. Appearance of a visible tumor (mortality-independent context, X is observable). B. Animal died from the tumor of interest (fatal context, Y < Z). C. Animal had a tumor and died from competing cause (incidental context, X ≤ Z < Y ). D. Animal did not have a tumor and died from a competing cause (Z ≤ X). Let t1 , t2 , . . . , tm be the distinct times at which the above events are observed, and ak , bk , ck , and dk , k = 1, . . . , m, are the number of events of A, B, C, and D at time tk , respectively. Define the tumor resistance (survival) functions for X and Y as SX (t) = Pr(X ≥ t), and SY (t) = Pr(Y ≥ t). Let fX (t) and fY (t) be the density function of X and Y , respectively. For the tumors observed in a mortality-independent context, the likelihood function

Statistics in Toxicology

is given as La =

Y

499

fX (tk )ak SX (tk )dk .

The likelihood function for the tumors observed in a fatal context is Y Lb = fY (tk )bk SY (tk )dk .

The likelihood functions La and Lb are essentially the same. The likelihood function for the tumors observed in an incidental context is Y Lc = [1 − SX (tk )]ck SX (tk )dk . In the general case, when a tumor is observed in a fatal cases for some animals and is also observed in an incidental context for other animals, the likelihood function is Y [fY (tk )]bk [SY (tk ) − SX (tk )]ck [SX (tk )]dk . Ld = Kodell et al.3 showed that

SY (t) − SX (t) = [1 − Q(t)]SY (t) , where Q(t) = SX (t)/SY (t) is the conditional probability of tumor onset after time t, given tumor-free survival through time t. It follows that Y Ld = [fY (tk )]bk [SY (tk )]ck +dk [1 − Q(tk )]ck Q(tk )dk .

That is, Ld can be expressed as the product of the two likelihood functions Y Lbd = [fY (tk )]bk [SY (tk )]ck +dk , and

Lcd =

Y [1 − Q(tk )]ck Q(tk )dk .

The Lbd and Lcd represent the contributions of the fatal and incidental tumors, respectively. 1.2. Estimation An important first step in the evaluation of animal carcinogenicity data is to estimate the animal survival curve for the assessment of any effects of exposure to the test compound on mortality. The survival curve for each dose group is calculated by the Kaplan-Meier method.4 In this calculation, the weeks of death for animals killed accidentally or sacrificed are considered as censored observations.5 For a given group, suppose that the death time of the animals are observed at tk , k = 1, . . . , m. Let nk denote the number

500

J. J. Chen

of animals that died at or after tk (the number of animals at risk), and xk denote the number of deaths (out of nk ). The Kaplan-Meier estimate of the conditional probability of survival beyond tk given survival beyond t(k−1) is (nk − xk )/nk . The estimated survival function is Y nk − xk ˆ = S(t) , tk ≤ t < t(k+1) . nk tk ≤t

ˆ is calculated by Greenwood’s formula The variance of the S(t) X xk ˆ V [S(t)] = Sˆ2 (t) tk ≤ t ≤ t(k+1) . nk (nk − xk ) tk ≤t

The estimation of the tumor survival functions SX (t) and SY (t) depends on the context of observation. When all tumors are observed in a mortalityindependent context or fatal context, the Kaplan-Meier method can be used to estimate the tumor survival function. The calculation is the same as estimating animal survival function. But, the nk represents the number of animals at risk (have not developed a tumor), and xk is the number of tumor observed (or death caused by the tumor). For the tumors observed in an incidental context, these tumors are only discovered at necropsy, either after sacrifice or after has died from the cause unrelated to the presence of tumor. Let the experiment period be partitioned into J sub-interval such that (tj−1 , tj ], j = 1, . . . , J, where t0 = 0 and tJ denotes the time at which the terminal sacrifice is scheduled. Let cj and dj denote the number of animals that died in the jth time interval for which the tumor is present or absent, respectively. The total number of deaths in the jth time interval is (cj + dj ). Hoel and Walburg6 proposed the tumor prevalence estimate as ˆ = cj , t(j−1) < t ≤ tj . R(t) cj + dj The prevalence method requires to partition the experimental period into several time intervals. The following three partitions have been used in practice: (1) (0, 50], (51, 80], (81, 104], interim sacrifice (if any), and terminal sacrifice; (2) (0, 52], (53, 78], (79, 92], (93, 104], interim sacrifice (if any), and terminal sacrifice; and (3) the Peto ad hoc interval determined by the tumor prevalence data based on the assumption of non-decreasing prevalence function.2 The maximum likelihood estimate of R(t) is estimated by the “pooling adjacent violators” method. For the tumors observed in both incidental and fatal contexts, the maximum likelihood estimate of SX (t) can be obtained by estimating SY (t) and

501

Statistics in Toxicology

Q(t) separately, provided that Q(t) is monotonically nonincreasing. Consequently, the SY (t) and Q(t) are estimated by the Kaplan-Meier estimator and Hoel-Walburg estimator described above. The estimate of tumor onset distribution3 is   Y n k − b k cj  SˆX (t) =  , tj − 1 ≤ t < t j . nk cj + dj tk ≤t

The variance and the variance of SˆX (t) is obtained using a first-order Taylor series,   X cj xk . + var[SˆX (t)] ≃ [SˆX (t)]2  nk (nk − xk ) (cj + dj )dj tk ≤t

1.3. Testing First, consider testing for difference in animal survival functions, or difference in the incidence of tumors observed in a mortality-independent context or fatal context. The logrank test or the death-rate method2 is the most widely used procedure for testing the age-specific differences among groups. Consider a carcinogenicity experiment with g groups (d1 , . . . , dg ). Taking all dose groups together as one, suppose that the death times are observed at tk , a time point at which tumors are found in any group, k = 1, . . . , m. Let nik denote the number of animals in the ith group that died at or after tik (at risk), and xik denote the number of deaths (out of nik ). Animal death-tumor data at time tk can be summarized in a 2 × g table as Summary of animal tumor-death data at tk . Dose # With Tumors # At Risk

d1

d2

···

dg

Total

x1k n1k

x2k n2k

··· ···

xgk ngk

x.k n.k

The expected number of tumors in the ith group at time tk is eik = x.k fik , where fik = nik /n.k , i = 1, . . . , g. Thus, the observed and expected numbers P of tumors in the ith group over the entire experiment are Oi = m k=1 xik Pm and Ei = k=1 eik , respectively. Define Di = Oi − Ei =

m X

k=1

(xik − eik )

502

J. J. Chen

and Vrs =

m X x.k (n.k − x.k )frk (δrs − fsk )

n.k − 1

k=1

where δrs is defined as 1 if r = s and 0 otherwise. Let D a = (D1 , . . . , Dg )′ and V a be the g × g matrix with the (r, s) entry Vrs . Then Xa = D ′a V − a Da can serve as a test for heterogeneity among the g groups, where V − a is a generalized inverse of V a . Under the null hypothesis, XH is asymptotically distributed as χ2 distribution with g − 1 degrees of freedom. Also, a dose-related trend test can be considered by using p Za = l ′ D a / l ′ V a l ,

where l = (d1 , . . . , dg )′ . For the tumors observed in an incidental context, the Mantel and Haenszel7 test or the prevalence method2 can be used for comparing the prevalence rates among groups. The prevalence method used is very similar to the death-rate method, except that each interval defines the tumor-death time as described in the estimation. The vector of the differences of observed and expected values D b is calculated the same way as described for the fatal tumors, and the corresponding covariance matrix Vb is computed. The χ2 test statistics for heterogeneity and trend can be calculated similarly. For the tumors observed in both fatal and incidental and contexts, the data for the fatal tumors and for the incidental tumors are analyzed separately by the death-rate and prevalence methods, respectively. The test for the difference in the time of tumor onset is based on the pooled vector D = D a + D b , with covariance matrix V = V a + V b .2 The test statistic for heterogeneity is given by X = (D a + D b )′ (V a + V − b (D a + D b )) . The trend test is given by Z = l′ (D a + D b )/

q l′ (V a + V b )l .

The choice of time intervals for calculating the incidental tumor component of the test of Peto et al.2 is an important consideration. The use of the ad hoc time intervals can be problematic.8 Moreover, the procedures described above for the analysis of occult tumor all require the information of tumor lethality or cause of death. Some argue that the determination

Statistics in Toxicology

503

whether a tumor causes an animal’s death is a rather complicate and subjective process. It is often difficult for a pathologist to classify accurately and objectively a tumor type as straight causing or not causing animal’s death. Furthermore, the validity of context of observation relies on the assumption that tumor-bearing and tumor-free animals of the same age have identical hazard functions for death unrelated to tumor. Chen and Moore9 showed the Peto test performs poorly when there is a large reduction in survival times in the dosed groups. 1.4. Other methods Dinse and Lagakos10 proposed a logistic regression model as an alternative prevalence test for the incidental tumors, exp(µ + τ t + θdi )/[1 + exp(µ + τ t + θdi )] .

(1)

They derived the likelihood score test of θ = 0. The logistic regression analysis assumes that tumor prevalence is a smooth function of ages and it does not require the choice of time intervals. In addition, the logistic regression model can easily incorporate other covariates, and the software is ready available. The logistic regression has been adopted by the NTP as a standard analysis for dose-related trend test. Bailer and Portier11 proposed an alternative survival-adjusted approach that do not require the cause of death information. The approach modifies the Cochran-Armitage test to account for the survival times of those animals that die prior to study termination without tumor presence. The Bailer and Portier approach, has been referred to as the Poly-κ test, can be used to replace the Peto’s procedure when the cause of death information is not available. Let yij be a binary response indicating presence or absence of a tumor type of the jth animal in the ith group who dies at time tij , i = 1, . . . , g, j = 1, . . . , ni . If all animals survive during the whole experiment period, the probability of developing a tumor for a animal in the ith group, say, µi can be modelled by the linear-logistic model logit(µi ) = α + βdi . Let T denote the length of the experiment and tij denote the death time of the ijth animal. Bailer and Portier11 defined a weight equal to 1 if a tumor is present at death, and a weight equal to δij = (tij /T )κ if the animal dies without tumor presence. The parameter δij reflects a less-than-whole

504

J. J. Chen

animal contribution, and κ depends on the tumor type/site. The Poly-κ trend test for the null hypothesis H0 : β = 0 against H1 : β > 0, is given as P P yi. di − p′.. i n′i. di i P P P z= ′ p.. (1 − p′.. )[ i n′i. d2i − ( i n′i. di )2 / i n′i. ] P P P P where yj. = j yi. , n′ik = j δij , and p′.. = ij yij / ij δij . Under the null hypothesis, z is asymptotically standard normally distributed. Bieler and Williams12 proposed a modification to account for the random variation due to δijk . P P P P ai p′ di − ( i ai di )( i ai p′i. )/ i ai P P , z = i Pi. {C[ i ai d2i − ( i ai di )2 / i ai ]}1/2 P where C = ij (rij − r¯i )2 /(N − g), ai = (n′i )2 /ni , p′i. = yi. /n′i , rij = yij − P P P P P p′.. δij , r¯i = j rij /ni , n′i = j δij , p′.. = i yi. / i n′i , yi. = j yij , and P ni is the total number of animals in the ith group, and Nk = i ni . Under the null hypothesis, zk is asymptotically standard normally distributed. The values for κ are between 1 to 6 from the examination of the NTP historical data.11 Recently, the NTP has adopted the modified Poly-3 test (κ = 3) as a standard test for trend and compares the results against the Poly-1.5 and Poly-6 tests. 1.5. Example Stallard and Whitehead13 presented the results of a carcinogenicity experiment with four dose groups in male mice. The control, low, medium groups contained 60 animals and the high dose group contained 59 animals. The experiment lasted for 105 weeks. The tumor data are shown in Table 1. Since the data contained mixture of fatal and incidental tumors, the Peto cause-of-death test was used. The fatal tumors and incidental tumors were analyzed separately by the logrank test and the prevalence method, respectively. The vector for the difference between the observed and expected numbers of tumors is Da = (1.4976, −6.7991, 7.1938, 12.4952) for the fatal tumor and is Db = (−0.1489, −8.0581, −8.8191, 17.0532) for the incidental tumor with the variance-covariance matrices   6.4575 −2.7921 −2.2296 −1.4358    −2.7921 6.2768 −2.1184 −1.3663    Va =  5.4485 −1.1005   −2.2296 −2.1184  −1.4358 −1.3663 −1.1005 3.9026

505

Statistics in Toxicology

Table 1. Tumor data from a carcinogencity experiment presented by Stallard and Whitehead.13 Dose

Deaths without Tumors (frequency in parentheses)

Deaths with Tumors (frequency in parentheses)

Control

15, 62, 90, 92, 96, 97, 101, 105(22)

56, 65, 66, 76, 77, 80, 81, 86 ∗ , 87, 89, 93, 95, 97, 98(2), 103, 104, 105∗ (14)

Low

24, 27, 53, 64, 68, 47, 82, 83, 94, 96, 97, 99, 102, 105∗ (6) 102(2), 103, 104, 105(27)

63, 75, 78, 84, 85, 95, 96, 97, 98, 101,

Medium

5, 7, 39, 65, 70, 75, 76, 80, 82, 83, 87, 91(2), 92, 96(2), 97, 98(2), 99, 100(2), 102, 105(23)

47, 52, 65, 69, 70, 88, 91, 95, 99, 100, 104, 105∗ (3)

High

16, 18, 49, 55, 59, 77, 85(2), 105

57∗ , 60, 66, 70(2), 74(2), 76, 78, 83(2), 84(3), 85, 88, 89, 92, 93(2), 94, 95∗ , 95, 96, 97, 98∗ , 98(2), 99, 100, 101, 102∗ , 102, 103, 104, 105∗ (15)

and



7.8714

  −2.8425 Vb =   −2.9518  −2.0772

−2.8425 8.7858

−3.4885

−2.4549

−2.9518

−3.4885 8.9895

−2.5493

−2.0772



 −2.4549  , −2.5493   7.0813

respectively. Hence, the pooled vector for the fatal and incidental tumors combined is D = (1.3487, −14.8842, −16.0130, 29.5484) with the variancecovariance matrix   14.3289 −5.6346 −5.1814 −3.5130    −5.6346 15.0626 −5.6069 −3.8212  . V =  −5.1814 −5.6069 14.4380 −3.6498    −3.5130 −3.8212 −3.6498 10.9839

The χ23 test statistic for heterogeneity among the 4 groups is XH = 88.42 and the Z statistic for dose-response trend is Z = 4.599, where the dose metric is l = (0, 1, 2, 3). Both the heterogeneity test and trend test show statistically significant. The z-score from the Poly-3 test is 4.4328. 1.6. Exact trend tests for incidental tumors In animal carcinogenicity bioassay experiments, the number of animals developing certain tumor types of interest is often small. The methods

506

J. J. Chen

described previously use the asymptotic normal approximation. In general, the mortality patterns, number of intervals used in the partition, and numbers and patterns of tumor occurrence in each interval can have effects on the accuracy of an asymptotic test. When the total number of tumor occurrences is small, the normal approximation may not be reliable.14 The exact permutation test is recommended. The following will describe an exact permutation trend test for tumors observed in an incidental context. The test is a generalization of the Fisher’s exact test to the 2 × g table. The tumors observed in a mortality-independent or fatal context can be tested in a similar way. However, Fairweather et al.15 discussed limitations of applying exact methods to fatal tumors. The data for the tumors observed in the jth interval can be summarized as Summary of animal prevalence data at interval (t(j−1) , tj ]. Dose # with Tumors # deaths

d1

d2

···

dg

Total

r1j n1j

r2j n2j

··· ···

rgj ngj

r.j n.j

where nij , here, is the total number of animals from the ith group that died in the jth time interval, rij is the number of animals (out of nij ) found to have the tumor of interest, and r.j and n.j are the row marginal totals which are fixed for all j = 1, . . . , J. Conditional on r.j and n.j , under the null hypothesis of no difference among groups, the conditional distribution of (r1j , r2j , . . . , rgj ) is the multivariate hypergeometric distribution n n n gj 2j 1j ··· rgj r2j r1j . P (r1j , r2j , . . . , rgj ) = n .j r.j Assuming independence among the J tables, the joint probability of (r11 , . . . , r21 , . . . , rgJ ) for a random permutation outcome is P (r11 , . . . , r21 , . . . , rgJ ) =

J Y

P (r1j , r2j , . . . , rgj ) .

j=1

The trend score associated with (r1j , r2j , . . . , rgj ) in the j-interval is defined by Sj =

g X i=1

di rij .

Statistics in Toxicology

507

The probability distribution for the trend score statistic Sj is X P (Sj = sj ) = P (r1j , r2j , . . . , rgj ) , Ωj

Pg where Ωi consists of all possible permutations of rij such that i=1 rij = Pg r.j and i=1 di rij = sj . The trend score associated with a random permutation (r11 , . . . , r21 , . . . , rgJ ) is S=

J X j=1

Sj =

g J X X

di rij .

j=1 i=1

The probability distribution for the trend score S is X P (S = s) = P (S1 = s1 ) · · · P (S2 = s2 ) · · · P (SJ = sJ ) , Ω

where Ω consists of all possible permutations (r11 , . . . , r21 , . . . , rgJ ) such PJ ∗ that j=1 sj = s. Let s denote the trend score associated with the observed outcome. The exact one-tailed p-value for a positive trend is X p-value = P (S = s) . s≥s∗

Note that when k = 1 and g = 2, this procedure becomes the Fisher exact test. Traditionally, the definition of the p-value of an exact test is the cumulative sum of the probability of the observed outcome and the probabilities of all more extreme outcomes. The trend p-value described above is the sum of the probabilities of all permutations whose trend scores are greater than or equal to the trend score of the observed outcome s∗ . Thus, every permutation with a trend score equal to s∗ is included in the p-value computation irrespective of the magnitude of its probability of occurrence. Chen et al.16 argued that those permutations whose probabilities are greater (less extreme) than the probability of the observed outcome should not be included. A less conservative exact p-value is X X p-value = P (S = s) + P (S1 = s1 ) · P (S2 = s2 ) · · · P (SJ = sJ ) , s>s∗

ω

Pk ∗ and where ω consists of all permutations such that j=1 sj = s ∗ ∗ P (r11 , . . . , r21 , . . . , rgJ ) ≤ P , and P denotes the probability of the observed outcome. The lung adenoma data observed in female mice from a two-year feeding study of phenylephrine hydrochloride conducted by the National Toxicology

508

J. J. Chen

Table 2. The incidence of lung adenoma in female mice in the two-year feeding study of phenylephrine. Weeks

0 ppm

1250 ppm

2500 ppm

Total

0–52

r1j n1j

0 1

0 0

0 3

0 4

53–78

r2j n2j

0 1

0 3

0 3

0 7

79–92

r3j n3j

1 9

0 5

0 2

1 16

93–104

r4j n4j

0 39

3 42

5 42

8 123

Table 3.

Computations of the p-value for the exact trend test.

79–92 Pattern

S3

93–104 P3

Pattern

S4

Combined P4

S3 + S4

Prob.

> 16, 250

0.01987

1, 0, 0

0 0

0.5625 0.5625

1, 1, 6 0, 3, 5

16,250 16,250

0.008343 0.009482

16,250 16,250

0.00469 0.00533

0, 1, 0

1250 1250 1250

0.3125 0.3125 0.3125

2, 0, 6 0, 4, 4 2, 0, 6

15,000 15,000 15,000

0.003774 0.012165 0.027736

16,250 16,250 16,250

0.00118 0.00380 0.00867

0, 0, 1

2500 2500 2500

0.1250 0.1250 0.1250

0, 5, 3 2, 1, 5 1, 3, 4

13,750 13,750 13,750

0.009482 0.025707 0.048660

16,250 16,250 16,250

0.00118 0.00321 0.00608

Program (NTP, 1987) is analyzed for illustration. This experiment contained a control and two dose groups, 1250 and 2500 ppm. In the analysis of tumor incidence data, NTP generally groups the animals into the following four time intervals to adjust for intercurrent mortality: 0–52, 53–78, 79–92, 93–104 weeks. Table 2 shows the number of lung adenomas and the number of deaths occurring in the four time intervals. The computations of the p-value for the exact trend test are shown in Table 3. Each row in the table corresponds to a permutation for which the score is 16,250 from the sum of Columns S3 and S4 . The exact p-value (= 0.0540) is the sum of the probabilities of the right most column, and the exact p-value (= 0.0393) by the Chen et al.16 is obtained by excluding two probability values 0.00867 and 0.00608 as they are greater than 0.00533, the probability

Statistics in Toxicology

509

of the data observed. The MH asymptotic trend test gives a p-value of 0.0347. For a significance level of 5%, the Chen et al.16 adjustment would indicate a statistically significant result, in agreement with the MH trend test. 1.7. Discussion A typical carcinogenicity experiment examines approximately 20–50 tumor types/sites, statistical tests often perform for 10–30 types/sites routinely. Performing several tests without appropriately accounting for the multiplicity effect can inflate the overall Type I error rate or familywise error rate (FWE). Statistical Methods for the analysis of multiple tumor types/sites have been proposed by several authors.17–20 Haseman21 presented a rule rejecting a hypothesis for a rare tumor (spontaneous rate at most 0.01) when p ≤ 0.05 and for a common tumor (spontaneous rate greater than 0.01) with p ≤ 0.01. The Center for Drug Evaluation and Research (CDER) of FDA has adopted the “Haseman rule” in comparing tumor incidence rates between the control and dose groups in its evaluation of tumorigenicity studies of new drugs, and has recently recommended a new rejection rule for a positive dose-related trend test with p ≤ 0.025 for rare rumors and p ≤ 0.005 for common tumors.22 Interpreting results of carcinogenicity experiments is a complex process, and there are risks of both false negative and false positive results. The relatively small number of animals used, and the low tumor incidence rates can cause carcinogenicity of a compound not to be detected (i.e. a false negative error is committed). Because of the large number of comparisons involved, there is also a great potential or finding statistically significant positive trends or differences in some tumor types that are due to chance alone (i.e. a false positive error is committed). The inflated false-positive rate can invalidate the use of animal carcinogenicity data. Controlling both false positive and false negative rates should be the central issue in the statistical analysis of animal carcinogenicity experiments from the safety assessment viewpoint. 2. Reproductive Studies Reproductive studies are conducted to assess reproductive risk to mature adults and to the developing individual from the exposure to drugs and environmental compounds. Adverse reproductive and developmental effects include effects on male and female fecundity, spontaneous abortion, infant

510

J. J. Chen

and child death, congenital malformations, growth retardation, and mental retardation. Three segments of study are required in preclinical animal testing for each new drug depending on how women might be exposed to the drug.23 These are referred to as Segment I (fertility and general reproductive performance), Segment II (developmental effects), and Segment III (prenatal and postnatal evaluations). The Segment I study is aimed at providing an overall evaluation of the effects of drugs on fertility in both sexes, the course of gestation, early and late stages of the development of the embryo and fetus, and postnatal development. The studies may be conducted by treating animals of only one sex and mating with untreated animals of the opposite sex, or by treating both male and female animals. Segment II is primarily aimed at detecting teratogenic effects. The drug is given to the pregnant females during the period of organogenesis, e.g. days 6–15 for rats and mice, and days 6–18 for rabbits. The offspring are removed one or two days before term, and corpora lutea, resorption sites, and live and dead fetuses are examined. Fetuses are weighed and examined for anomalies. Segment III is aimed at the evaluation the effects of drugs on the late stages of gestation and on parturition and lactation. The drug is given to pregnant females in the final one-third of gestation and continued throughout lactation to weaning, e.g. gestation day 15 to postnatal day 21 for rats or mice. The effects on duration of gestation is determined. Pup birth and developmental data including litter size, weight, and postnatal growth and mortality, along with impaired maternal behavior are recorded and measured. Mice, rats, and rabbits are the most commonly used species for reproductive and developmental studies. The experimental design is very similar to the carcinogenicity experiment consisting an untreated control and three dose groups. The US regulatory guidelines generally recommend about 20 pregnant rodents and 15 nonrodent animals per dosage group. The ICH guideline recommends 16 to 20 pregnant animals per group. All adult animals are necropsied at terminal sacrifice.24,25 Regulatory requirements specify that a wide range of endpoints must be measured, recorded, and analyzed. The endpoints can be divided into two categories: parental and embryonic/fetal endpoints. Since the test compound is administered to the adult animal, the effect of the test compound occurs in the female that receives the compound, or that is mated to a male that receives the compound, the treatment affects the fetuses indirectly via the dam. The fetal responses from the same dam are expected to be more alike than responses from different dams. This phenomenon is referred

Statistics in Toxicology

511

to as the “litter effect”. In the analysis of embryonic/fetal endpoints, the experimental unit should be the entire litter rather than an individual fetus. Failure to account for the intra-litter correlations by using each fetus as the experimental unit will inflate the Type I error and will reduce the validity of the test. The classical approach to the analysis of reproductive data is based on the litter mean. However, this approach does account for differences in litter sizes. An alternative approach is to model the fetal endpoints in a litter as correlated outcomes (clustered data). These two approaches will described in this section. Statistical methods described will be according to three measures, continuous, binary, and count. 2.1. Per-litter based analysis Consider a typical experiment of g groups, a control and g − 1 dose groups. Assume that the ith group contains mi female animals. Let yijk be the response from a fetus out of nij examined or tested for a particular developmental outcome, 1 ≤ i ≤ g, 1 ≤ j ≤ mi , and 1 ≤ k ≤ nij . Note that yijk may be an indicator variable representing the presence or absence of a particular malformation type or a continuous variable representing a fetal weight or postnatal performance measurement. Depending on the endpoint of interest, nij may represent the number of viable fetuses, number of implants, or number of measurements. The litter-based analysis is based P on the per-litter response yij = k yijk /nij . For a continuous response, yij will represent the mean fetus response; for a discrete variable, it will represent the sum of the fetal responses. The yij can be viewed and analyzed as a maternal endpoint. 2.1.1. Continuous data Continuous data such as body weights, organ weights, or behavioral measurements conducted on offspring following birth are measured on a continuous scale. The continuous endpoints are measured either at the litter level in an adult animal (e.g. maternal body weights) yij , or at the individual fetus level (fetal body weights) yijk . Analysis of variance (ANOVA) is the most commonly used procedure for the analysis of continuous data.25 The ANOVA method assumes that data are independently and normally distributed with homogeneous variance. Transformations such as the logarithmic, square-root and arc-sine are often applied to satisfy the normality assumption and stabilize the variance. A simple one-way ANOVA analysis is the comparison of maternal endpoints among groups. Developmental

512

J. J. Chen

endpoints are analyzed similarly but in terms of the average within each litter as described.26 Nonparametric methods are used when the assumption of normality fails. The nonparametric analysis is initiated by ranking all observations of the combined groups. The repeated measures ANOVA is often used for the analysis of postnatal behavioral data.27 2.1.2. Binary data Binary endpoints can also be measured either at the parent level, such as success or failure of pregnancy, or at the individual fetal level, such as presence or absence of a particular malformation type. Statistical methods for the analysis of the prenatal and fetal responses are different. Asymptotic chi-square test is often used for the analysis of prenatal binary endpoints for comparing the incidence rates among several groups.28 The CochranArmitage test is used to test for trend.29,30 The Fisher exact test is the best known permutation test for comparing two groups. General computational algorithms and software to perform all possible permutations are given by Mehta et al.31 The general approach to the analysis of fetal binary endpoints is to consider the proportion per-litter such as the proportion of live fetuses with a certain type of malformation. Typically, the proportions are transformed by an arc-sine transformation, and then the parametric ANOVA methods are used. The litter-based approach does not use the data effectively since it does not account for the litter size. For example, one out of two is treated as the same as five out of ten. 2.1.3. Count Data A number of primary reproductive endpoints are measured in counts. In a dominant lethal assay, male mice are treated with a suspect mutagen, and then are mated with females. The numbers of corpora lutea, implantations, lives, and dead conceptuses are counted to assess reproductive effects of the test compound. Count data are often normalized by the square root transformation, the transformed data are then analyzed as continuous data using the parametric ANOVA methods. Count data can also be analyzed by nonparemetric methods. 2.1.4. Example A study of the effects of maternal exposure to diethylhexyl phthalate (DEHP) in rats is presented as an example. Table 4 contains a summary of

Table 4. Reproductive parameters after exposure of pregnant Fisher 344 rats to diethylhexyl phahalate in the feed on getational days 0–20. Diethylhexyl phahalate % in feed Endpoint

0.0

Maternal wgt gd 0 Maternal wgt gd 20 Maternal wgt gain Gravid uterine wgt Maternal liver wgt Corpora lutea Implantation sites % viable % male % malformation Fetal weight a Significance ∗ Significantly

24 173.60 248.30 74.69 49.79 9.75 10.91 10.92 10.46 53.56 1.27

(3.25) (3.64)a (1.71)a (1.39)a (0.12)a (0.33) (0.31) (0.32)a (3.65) (0.71)a

3.022 (0.029)a

23 175.37 246.00 70.63 44.83 11.72 10.96 9.83 9.39 45.77 0.00

(3.37) (2.90) (2.48) (2.49) (0.17)∗ (0.22) (0.54) (0.53) (4.51) (0.00)

3.143 (0.035)∗

1.0 22 172.42 232.73 60.31 46.81 12.06 11.18 10.59 10.05 46.39 1.92

(3.12) (3.25)∗ (1.44)∗ (1.31) (0.17)∗ (0.30) (0.24) (0.26) (3.31) (1.11)

2.852 (0.053)∗

1.5 24 171.77 217.76 45.99 43.86 12.21 11.17 10.58 10.08 54.44 3.13

(2.80) (3.25)∗ (1.78)∗ (0.96) (0.19)∗ (0.18) (0.22) (0.25) (3.00) (1.06)

2.557 (0.034)∗

2.0 25 173.33 184.15 10.82 19.08 11.11 10.52 10.40 8.00 50.66 2.87

(2.95) (4.28)∗ (3.00)∗ (3.71) ∗ (0.15)∗ (0.61) (0.41) (0.70) ∗ (6.10) (1.64)

Statistics in Toxicology

No. pregnant dams

0.5

2.266 (0.041)∗

in linear trend different from control

513

514

J. J. Chen

the analysis for selected endpoints. For the detailed design and analysis the reader is referred to Tyl et al.32 As described the ANOVA was used for the analysis of continuous endpoints and per-litter proportion data. Prior to analysis, the arc-sine square root transformation was performed to all maternal or per-litter proportion data. When a significant (p < 0.05) dose effect occurred, Duncan’s multiple range test was used for pairwise comparisons between control and each dose group. A test of linear dose-response trend was performed using contrast tests. A one-sided test was used for pairwise comparisons except for the maternal and fetal body weights and percentage of males per litter. 2.2. Likelihood and quasi-likelihood/ generalized-estimating-equations approaches As discussed, the fetal responses with a litter are not independent. The proper experimental unit in the analysis should be the litter with the fetal responses representing multiple observations from a single experimental unit. The likelihood-based and generalized estimating equations (or quasi-likelihood) are the two commonly used approaches to modeling the correlated data. 2.2.1. Modeling continuous data A general approach to modeling fetal data can be carried out in terms of a mixed-effects model. Dempster et al.33 proposed a normal mixed-effects model with two levels of variance, in which litter effect is modeled by a nested random factor and dose by a fixed factor. The response yijk in a litter is expressed as a mixed effects model with two sources of variations: the between litter γij and within litter variations eijk , yijk = µij + γij + eijk . The random components γij and eijk are independently normally distributed with E(γij ) = 0, var(γij ) = σa2 , and E(eijk ) = 0, var(eijk ) = σ 2 . Thus, the mean and variance of yijk are E(yijk ) = µi and var(yijk ) = σa2 + σ 2 . The intra-litter correlation between yijk and yijk′ , for k 6= k ′ is φ = σa2 /(σ 2 + σa2 ). The mean parameter µij is often modeled as a linear function of dose µij = β0 + β1 d for trend test (Ho : β1 = 0). Computation techniques for the maximum likelihood estimation of the parameters of linear mixed effects models have been proposed by many authors.34–36 The estimates can be obtained using the PROC MIXED

Statistics in Toxicology

515

procedure of SAS.37 Alternatively, the generalized estimating equations (GEE) approach38 can be used to estimate the fixed effects parameters β. Under the normal model, the likelihood-based and GEE approaches have the same estimating equations for the mean parameters, but unlike the likelihood approach, the GEE uses the method of moments to estimate the variance component parameters. 2.2.2. Modeling binary data Let yijk denote the presence or absence of a response. Assume that the mean and variance of yijk are E(yijk ) = µi , and var(yijk ) = µi (1 − µi ) and the correlation is corr(yijk , yijk′ ) = φi , where k, k ′ = 1, . . . , nij , and k 6= k ′ . The parameter µi is the probability of a developmental effect of the ith group, and φi is the intra-litter correlation coefficient. The binary responses yij1 , . . . , yijnij within each litter are assumed exchangeable, that is, if {k1 , . . . , kl } is a subset of {1, . . . , nij }, then Pr(yij1 = 1, . . . , yijl = 1) = Pr(yijk1 = 1, . . . , yijkl = 1) , for all l = 1, . . . , nij . Let yij = (yij1 + · · · + yijnij ), then the mean and variance of yij are E(yij |nij ) = nij µi and var(yij |nij ) = nij µi (1−µi )[φi (nij −1)+1]. The intralitter correlation coefficient generally is positive (φi > 0). Thus, the variance nij µi (1 − µi )[φi (nij − 1) + 1] is greater than the nominal binomial variance nij µi (1 − µi ). The distribution of yij is known as an extra-binomial variate. Note that if φi = 0, then all yijk ’s are independent binary random variables and yij is a binomial, ! nij y µijij (1 − µij )nij −yij . P (yij ) = yij In a binomial or extra-binomial model, the mean function is often modeled by a logit function, logit(µi |z ij ) = β ′ z ij . The dose response model for trend test is given by µi =

exp(β0 + β1 di ) . 1 + exp(β0 + β1 di )

The beta-binomial model is the commonly known distribution used for modeling the extra-binomial variation data.39 The beta-binomial model assumes that responses within the same litter occur according to a binomial

516

J. J. Chen

distribution and the probability of responses is assumed to vary among litters according to a beta distribution: ! nij B(ai + yij , bi + nij − yij ) P (yij ) = B(ai , bi ) yij where B(ai , bi ) = Γ(ai )Γ(bi )/Γ(ai + bi ), where Γ(·) is the gamma function, ai > 0, and bi > 0. Under the reparameterization µi = ai /(ai + bi ) and φi = (ai +bi +1)−1 , the parameters µi and φi are, respectively, the mean and the intra-litter correlation parameters in the ith group. That is, the mean of yij is E(yij ) = nij µi , and the intra-litter correlation is corr(yijk , yijk′ ) = φi . The variance of yij is var(yij ) = [φi (nij −1)+1][nij µi (1−µi )]. When φi = 0, then yij becomes a binomial variable. The parameters can be estimated by the maximum likelihood method. The likelihood ratio or Wald test is often used to test for the significance of parameters. An advantage of the use of the likelihood-based beta-binomial model is that parameters β as well as the intra-litter correlations φ’s can be tested directly. One problem with the beta-binomial model is the bias and instability of the MLE’s of β as discussed by Williams.40 Williams41 proposed using the quasi-likelihood method as an alternative approach to the beta-binomial model. In the quasi-likelihood approach, only assumptions on the mean and variance are required: E(yij |nij ) = nij µi and var(yij |nij ) = nij µi (1 − µi ) [φ(nij − 1) + 1]. Note that in the quasi-likelihood estimation, the intra-litter correlations typically are modeled as constant across groups. The coefficients of the β’s can be obtained by solving the quasi-likelihood estimating (score) equations: S(βl ) =

g X mi X

zijl

i=1 j=1

yij − nij µi = 0, [1 + (nij − 1)φ]

l = 1, . . . , 2 .

The intra-litter correlation coefficient is calculated by equating with the mean of Pearson chi-square statistics, i.e. g

m

i (yij − nij µi )2 1 XX φ= , N − 2 i=1 j=1 nij µi (1 − µi )[φ−1 + (nij − 1)]

Pg Pm where N = i j i nij . The parameters β’s and φ are estimated by solving S(βk ) = 0 and φ alternatively until convergence. The Wald test is often used to test for the significance of β’s.

Statistics in Toxicology

517

2.2.3. Modeling count data Count data are generally modeled by a Poisson distribution. Let nij be an observed count from the jth animal in the ith group 1 < i < g and 1 < j < mi . If nij has the Poisson distribution n

p(nij ) =

µi ij e−µi , nij !

nij = 0, 1, 2, . . . ,

the mean and variance of nij are E(nij ) = var(nij ) = µi . In the Poisson model, the mean function is often modeled by a log-linear function, log(µi |z ij ) = z ij β. The dose-response model for trend test is µi = exp(β0 + β1 di ). A common complication in the analysis of count data is that the observed variation often exceeds or falls behind the variation that is predicted from a Poisson model. The classical approach is to assume that the mean of the Poisson has a gamma distribution which leads to a negative binomial (gamma-Poisson) distribution for the observed data, nij φ1 i φi µi 1 Γ(nij + φ−1 i ) p(nij ) = −1 1 + φi µi Γ(nij + 1)Γ(φi ) 1 + φi µi where φi > 0. The maximum likelihood estimation of the negative binomial model was described in details by Lawless.42 The significance of the parameters can be tested using either the likelihood ratio test or Wald test. Like the beta-binomial model, the negative binomial model can be applied to testing for the extra-Poisson variation. A limitation of the parametric approach is in its restriction on φ ≥ 0. In applications, for example, the number of litter implant or the number of corpora lutea may exhibit a sub-Poisson variation. The quasi-likelihood approach43 provides a method to model sub-Poisson variation data. The quasi-likelihood approach assumes the mean and variance of count data are of a negative binomial form, E(nij ) = µi and var(nij ) = µi (1 + φi µi ). The coefficients of the β’s can be obtained by solving the score equations S(βl ) =

g X mi X

zijl

i=1 j=1

nij − µi = 0. µi + φµ2i

The parameter φ is calculated by equating with the mean of Pearson chi-square statistics, g

φ=

m

i (nij − µi )2 1 XX , N − 2 i=1 j=1 µi φ−1 + µ2i

518

J. J. Chen

Pg Pmi where N = j nij . The parameters β’s and φ are estimated by i solving S(βk ) = 0 and φ alternatively until convergence. Quasi-likelihood approaches to estimating and testing the mean of various mixed Poisson models are given by Chen and Ahn.44 Examples of analyses of fetal response toxicity data using the likelihood and the quasi-likelihood/GEE approaches are given in Chen.45 2.3. Multiple developmental outcomes The standard approach for assessment of developmental risks of a compound has been based on the analysis of each developmental endpoint separately. It has been suggested that the developmental toxicity outcomes (i.e. death/resorption, malformation, growth retardation, etc.) may represent different degrees of responses to a toxic insult and occur in a dose-related manner.45,46 These developmental outcomes are likely to be correlated. Therefore, a joint analysis of multiple developmental outcomes can have some advantages: it can increase the power of detecting effects if the multiple outcomes are manifestations of some common biological effects, and it allows investigations of associations among the multiple outcomes if they are the results of different biological mechanisms. Various multivariate models have been developed for simultaneous analysis of multiple endpoints.47–50 References 1. Kodell, R. L. and Nelson, C. J. (1980). An illness-death model for the study of the carcinogenic process using survival/sacrifice data. Biometrics 36: 267. 2. Peto, R., Pike, M. C., Day, N. E. et al. (1980). Guidelines for simple sensitive significance tests for carcinogenic effects in long-term animals experiments. In Long-term and Short-term Screening Assays for Carcinogens: A Critical Appraisal, Annex to Supplement 2. Lyon, International Agency for Research on Cancer, 311. 3. Kodell, R. L., Shaw, G. W. and Johnson, A. M. (1982). Nonparametric joint estimators for disease resistance and survival functions in survival/sacrifice experiments. Biometrics 38: 43. 4. Kaplan, E. L. and Meier, P. (1982). Nonparametric estimation from incomplete observation. Journal of the American Statistical Association 53: 457. Experiments. Biometrics 38: 43. 5. Gart, J. J., Krewski, D., Lee, P. N. et al. (1986). Statistical methods in cancer research, The Design and Analysis of Long-Term Animal Experiments, Vol. 3, Lyon, International Agency for Research on Cancer. 6. Hoel, D. and Walburg, H. (1972). Statistical analysis of survival experiments. Journal of National Cancer Institute 49: 361.

Statistics in Toxicology

519

7. Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute 22: 719. 8. Kodell, R. L., Chen, J. J. and Moore, G. E. (1994). Comparing distributions for time to onset of disease in animal tumorigenicity experiments. Communications in Statistics — Theory and Methods 23: 959. 9. Chen, J. J. and Moore, G. E. (1994). Impact of surviving time on tests for carcinogenicity. Communications in Statistics — Theory and Methods 23: 1375. 10. Dinse, G. E. and Lagakos, S. W. (1983). Regression analysis of tumor prevalence data. Applied Statistics 32: 236. 11. Bailer, A. J. and Portier, C. J. (1988). Effects of treatment-induced mortality and tumor-induced mortality on tests for carcinogenicity in small samples. Biometrics 44: 417. 12. Bieler, G. S. and Williams, R. L. (1993). Ratio estimates, the delta methods, and quantal response tests for increased carcinogenicity. Biometrics 49: 793. 13. Stallard, N. and Whitehead, A. (1999). Modified Weibull multi-state models for the analysis of animal carcinogenicity data. Environmental and Ecological Statistics 6: xx. 14. Chen, J. J. and Gaylor, D. W. (1986). The upper percentiles of the distribution of the logrank statistics for small numbers of tumors. Communications in Statistics — Theory and Methods 15: 991. 15. Fairweather, W. R., Bhattacharyya, P. P., Ceuppens, G. et al. (1998). Biostatistical methodology in carcinogenicity studies. Drug Information Journal 32: 401. 16. Chen, J. J., Kodell, R. L. and Pearce, B. A. (1997). Significance levels of randomization trend tests in the event of rare occurrences. Biometrical Journal 39: 327. 17. Heyse, J. F. and Rom, D. (1980). Adjusting for multiplicity of statistical tests in the analysis of carcinogenicity studies. Biometrical Journal 30: 883. 18. Westfall, P. H. and Young, S. S. (1989). p-value adjustment for multiple tests in multivariate binomial models. Journal of American Statistical Association 84: 780. 19. Chen, J. J. (1996). Global tests for analysis of multiple tumor data from animal carcinogenicity experiments. Statistics in Medicine 15: 1217. 20. Chen, J. J., Lin, K. K., Huque, M. et al. (2000). Weighted p-value adjustments for animal carcinogenicity. Trend Test Biometrics 56. 21. Haseman, J. K., Huff, J. and Boorman, G. A. (1984). Use of historical control data in carcinogenicity studies in rodents. Toxicology and Pathology 12: 126. 22. Lin, K. K. and Rahman, M. A. (1998). Overall false positive rates in tests for linear trend in tumor incidence in animal carcinogenicity studies in new drugs. Journal of Biopharmaceutical Statistics 8: 1. 23. US Food and Drug Administration. (1996) International Conference on Homonisation. Guideline on Detection of Reproduction for Medicinal Products, FDA, Rockvill, MD.

520

J. J. Chen

24. Christian, M. S. and Hoberman, A. M. (1996). Perspectives on the US, EEC, and Japanese developmental toxicity testing guidelines. In Handbook of Developmental Toxicology, ed. R. D. Hood, CRC Press, NY, 551. 25. Tyl, R. W. and Marr, M. C. (1996). Developmental toxicity testing — methodology. In Handbook of Developmental Toxicology, ed. R. D. Hood, CRC Press, NY, 175. 26. Healy, M. J. R. (1972). Animal litters as experimental units. Applied Statistics 21: 155. 27. Karpinski, K. F. (1991). In Statistics in Toxicology, eds. D. Krewski and C. Frankin, Gordon and Breach Science, NY, 393. 28. Agresti, A. (1990). Categorical Data Analysis, John Wiley and Sons, NY. 29. Cochran, W. G. (1954). Some methods for strengthening the common χ2 tests. Biometrics 10: 417. 30. Armitage, P. (1955). Tests for linear trends in proportions and frequencies. Biometrics 11: 375. 31. Mehta, C. R., Patel, N. R. and Senchaudhuri, P. (1992). Journal of Computation and Graph Statistics 1: 21. 32. Tyl, R. W., Price, C. J., Marr, M. C. et al. (1983). Teratologic Evaluation of Diethylhexyl Phthalate in Fisher 344 Rats, Research Triangle Park, NC. 33. Dempster, A. P., Selwyn, M. R., Patel, C. M. et al. (1984). Statistical and computational aspects of mixed model analysis. Applied Statistics 33: 203. 34. Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association 72: 320. 35. Laird, M. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics 38: 963. 36. Lindstrom, M. J. and Bates, D. M. (1988). Newton-Raphson and EM algorithms for linear mixed-effects models for repeated measures data. Journal of the American Statistical Association 83: 1014. 37. SAS Institute Inc. (1994). Getting Started with PROC MIXED. SAS Institute Inc., Cary, NC. 38. Liang, K. Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73: 13. 39. Williams, D. A. (1975). The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics 31: 949. 40. Williams, D. A. (1988). Estimation bias using beta-binomial distribution in teratology. Biometrics 44: 305. 41. Williams, D. A. (1982). Extra-binomial variation in logistic linear models. Applied Statistics 31: 144. 42. Lawless, J. F. (1987). Negative binomial and mixed Poisson regression. The Canadian Journal of Statistics 15: 209. 43. Breslow, N. E. (1984). Extra-Poisson variation in log-linear model. Applied Statistics 33: 38. 44. Chen, J. J. and Ahn, H. (1996). Fitting mixed Poisson regression models using quasi-likelihood methods. Biometrical Journal 38: 81.

Statistics in Toxicology

521

45. Chen, J. J. (1998). Analysis of reproductive and developmental studies. In Design and Analysis of Animal Studies in Pharmaceutical Development, eds. S. C. Chow and J. P. Liu, Marcel Dekker, NY, 309. 46. Kimmel, C. A. and Gaylor, D. W. (1988). Issues in qualitative and quantitative risk analysis for developmental toxicology. Risk Analysis 8: 15. 47. Chen, J. J. and Gaylor, D. W. (1992). Correlations of developmental endpoints observed after 2, 4, 5-trichlorophenoxyacetic acid exposure in mice. Teratology 45: 241. 48. Lefkopoulou, M., Moore, D. and Ryan, L. M. (1989). The analysis of multiple correlated binary outcomes: Application to rodent teratology experiments. Journal of the American Statistical Association 84: 810. 49. Chen, J. J., Kodell, R. L., Howe, R. B. et al. (1991). Analysis of trinomial responses from reproductive and developmental toxicity experiments. Biometrics 47: 1049. 50. Catalano, P. J. and Ryan, L. M. (1992). Bivariate latent variable models for clustered discrete and continuous outcomes. Journal of American Statistical Association 87: 651.

About the Author James Chen is Senior Mathematical Statistician in the Division of Biometry and Risk Assessment at the National Center for Toxicological Research, US Food and Drug Administration. He received his BS degree from Taiwan Tsing-Hua University, M.A. degree from University of Pittsburgh, and PhD from Iowa State University. Dr. Chen is a Fellow of the American Statistical Association. He has over 100 scientific publications in peer-reviewed journals and numerous invited subject review articles. Dr. Chen has served on the FDA, EPA, and interagency committees and workshop that directed at developing scientific and regulatory issues and guidelines, and has provided consultations to FDA, and EPA scientists on the statistical analysis of toxicological data and on risk assessment procedures.

This page intentionally left blank

CHAPTER 14

SOME STATISTICAL ISSUES OF RELEVANCE TO CONFIRMATORY TRIALS GEORGE Y. H. CHI∗ , KUN JIN and GANG CHEN† Division of Biometrics I, US Food and Drug Administration, HFD-710 W0C2, Room 2033 1451 Rockville Pike Rockville, MD 20852, USA Tel: 301-827-1515; ∗[email protected] LU CUI‡ Division of Biometrics and Data Management, Aventis Pharceuticals, Inc.

1. Introduction 1.1. Overview of drug development From its initial discovery to final marketing, a new drug in the United States has to go through various stages of development. Typically, preclinical animal studies are needed to determine the toxicity and carcinogenicity potential of a new compound. If these studies offer no suggestion of toxicity and evidence of carcinogenicity, then small studies on human volunteers are conducted in Phase 1 to determine the metabolism and pharmacologic actions of the drug in humans, dose dependent side effects, and if possible evidence of effectiveness. The information collected will permit the design of well-controlled, scientifically valid studies in Phase 2. The clinical studies in Phase 2 are conducted to evaluate the effectiveness of the drug for a particular indication or indications in patients with the disease or condition under study and to determine the common short term side effects and risks associated with the drug. In addition, for serious diseases such as cancer, a drug may be approved conditionally based on acceptable surrogate efficacy †The reviews expressed in this chapter are those of the authors and not necessarily those of the Food and Drug Administration. ‡The contribution by this author was made while he was still with the Division.

523

524

G. Y. H. Chi et al.

endpoints in Phase 2. The information gathered in these early phases is used to guide the drug sponsor in the planning of Phase 3 clinical trials. Phase 3 studies are expanded controlled and uncontrolled studies carried out to obtain additional effectiveness and safety information that is needed to evaluate the benefit-risk relationship of the drug and to provide adequate basis for physician labeling [21 CFR Sec. 312.20]. One of the critical requirements for pre-marketing approval of a new drug in the United States is the demonstration of the effectiveness of the drug through Phase 3 clinical trials. The United States Code of Federal Regulations [21 CFR Sec. 314.126(a)] requires that to establish the efficacy of a new drug, the sponsor must produce reports of adequate and well-controlled clinical trials that demonstrate its effectiveness. Generally, the evidence of effectiveness is based on at least two adequate and wellcontrolled studies. Some of the characteristics of an adequate and wellcontrolled study are described in the regulation [21 CFR Sec. 314.126(b)]. Another critical requirement for pre-marketing approval of a new drug is safety. The safety of the drug is evaluated on the basis of the entire database obtained from all three phases of drug development prior to approval, and continues to be evaluated through Phase 4 clinical trials, if required as a condition for approval. To be approved, the drug sponsor must demonstrate that there is sufficient information to show that the drug is safe for use under the conditions prescribed, recommended, or suggested in its proposed labeling. Once approved, the safety of the new drug continues to be monitored through post-marketing adverse reaction reports. A drug that is not yet approved for marketing may be used for treatment in patients with a serious or immediately life-threatening disease condition and for whom no comparable or satisfactory alternative drug or other therapy is available. The regulations allow such treatment to be carried out under a special protocol or treatment IND [21 CFR Secs. 312.34, 312.83]. For treating serious or life-threatening illnesses, certain new drug products may qualify under accelerated approval [21 CFR Sec. 314,500]. For treating a serious or life-threatening illness, a new drug that has demonstrated, based on adequate and well-controlled trials, an effect on a surrogate endpoint, or on some clinical benefits other than survival or irreversible morbidity, may be approved for marketing. The drug should provide meaningful therapeutic benefit to patients over existing treatments. For example, it may be effective in treating patients who are unresponsive to available therapy, or it may improve patient response over available therapy. The surrogate endpoint should have been reasonably validated

Some Statistical Issues of Relevance to Confirmatory Trials

525

through epidemiologic, therapeutic, patho-physiologic, or other evidence as predictive of clinical benefit. Accelerated approval is granted under the condition that the drug sponsor will conduct further study to verify and describe its clinical benefit in relation to the surrogate endpoint, or the ultimate clinical outcome of primary interest. Some publications14,61,71 including the US Code of Federal Regulations [21 CFR Secs. 321–314 (2001)], and the US Food and Drug Administration International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH), E1–E10 documents (1998–1999), provide a good overview and source of reference for the entire drug development procss and the United States drug regulatory requirements. This chapter focuses primarily on some concepts, principles and issues related to establishing the evidence of drug efficacy in a confirmatory trial. The discussion of selected topics on certain aspects of the design, conduct and analysis of a clinical trial generally revolves around two fundamental statistical principles that are particularly critical to a confirmatory trial.

1.2. Confirmatory trial In drug development, a confirmatory trial is a clinical trial that is prospectively designed to provide the primary source of evidence necessary to support the efficacy claim of the drug under investigation. The evidence expected of a confirmatory trial must be compelling, especially for mortality or serious morbidity trials where ethical consideration often precludes the possibility of conducting a second trial. This will also be critical for future active control trials where the current drug may be used as the control. The strength of evidence is measured among other things by the overall quality of the trial, internal and external consistency of the trial results. The internal and external consistencies of the trial results are the outcomes of the trial, whereas the overall quality of the trial is to a large extent under the control of the experimenter. The overall quality of a trial should be evaluated with respect to the following aspects of the study. These aspects should include the appropriateness of the design, acceptability of the study conduct, quality of the data collected, adequacy of the power, maintenance of the probability of the overall type I error, control of bias and confounding, proper method of analysis, and correct interpretation of trial results.

526

G. Y. H. Chi et al.

The accepted standard for the design of a confirmatory trial includes proper randomization, desired level of blinding, absence of confounding and choice of an appropriate control. In addition, other fundamental statistical principles should be closely adhered to in a confirmatory trial. These principles include minimization of the potential for bias and control of the probability of the overall type I error at the desired α-level. These two principles are essential to a Phases 2 or 3 confirmatory trial. A confirmatory trial should have assay sensitivity, that is, the capability of showing a difference when the study treatment is effective, a concept defined in the ICH-E10 (1999) document on the choice of control. 1.3. Two fundamental statistical principles of clinical trials Minimization of the potential for bias and control of the probability of the overall type I error, or the probability of making a false positive conclusion, are two fundamental statistical principles that are central to the quality of a trial. Both principles are intimately related to the design, conduct and analysis of the trial, and the final interpretation of the trial results. Deviation from these two principles has the potential of weakening the strength of evidence, rendering the trial results uninterpretable, or sometimes, even invalidating the entire trial results. Proper attention to these two principles throughout the trial from design and conduct to the final analysis and interpretation will strengthen the evidence and improve the likelihood of success of a trial. Minimizing the potential for bias will be discussed first in the next section. In Sec. 3, the control of the probability of the overall type I error will be discussed through issues and problems related to multiple testing. Clinical decision rule will be introduced as an intuitive and natural way of handling the multiple testing problems. The concept of decision structure for a clinical trial incorporates the clinical, statistical and regulatory perspectives on the issue of multiplicity. In Sec. 4, interim analysis and design modification will be discussed through the framework of decision structure. The problems and issues related to the important topic of active control non-inferiority trial of current interest will be discussed in the last section from the perspective of validity of inference and interpretation. 2. Minimizing the Potential for Bias In a clinical trial, bias refers to the consequence of any design, property of the study treatment or characteristics of the disease, intentional or

Some Statistical Issues of Relevance to Confirmatory Trials

527

unintentional conduct, and decision that results in a systematic exaggeration of the treatment difference either in favor or against the study treatment in a show-difference trial. It also refers to the consequence of a systematic dampening of treatment difference in favor of the study treatment in an active control non-superiority (non-inferiority or equivalence) trial. Bias affects the estimate of the true treatment effect and may lead to drawing incorrect conclusion regarding the overall effect of the study treatment. This is important for instance in an active control non-superiority trial where it is crucial to obtain an unbiased estimate of the effect of the active control. 2.1. Potential sources for bias in a clinical trial There are many potential sources for bias in a randomized controlled clinical trial. These sources include confounding, operational bias during trial execution, evaluation bias in outcome measurement, structural bias in the trial design, and statistical bias in the method of analysis. In any given situation, bias could come from one or more or even a combination of these sources. Some of these are discussed below. 2.1.1. Confounding The primary objective of a clinical trial is to demonstrate that the study treatment is effective as expected. Therefore, the trial must allow the experimenter to attribute any observed effect to the study treatment, and to the study treatment alone by ruling out all other potential explanations. Confounding occurs when one cannot attribute the entire observed treatment difference to the study treatment. The consequence of confounding here is bias, a bias that may be for or against the study treatment. A common source of confounding is an imbalance between the treatment arms in some important baseline covariates or prognostic factors that may have direct influence on the clinical outcomes of interest. A standard technique for avoiding confounding is randomization. Randomization may provide some assurance that the treatment arms are relatively balanced with respect to all known or unknown factors that may affect a patient’s response to the treatment. Randomization, if carried out properly, can minimize the chances for such imbalances to occur at baseline. Confounding may occur simply due to failure of the randomization to achieve necessary balance at baseline between the treatment and the control relative to some important and relevant known or unknown

528

G. Y. H. Chi et al.

demographic or prognostic factors. When there are imbalances in known baseline covariates or prognostic factors, then one will be less assured about absence of imbalances in important but unknown factors. This type of imbalance can often occur when the sample size is modest or when the randomization scheme is flawed. It is important for a confirmatory trial to have proper randomization and to stratify relative to some of the most important factors to avoid such problems and concerns. However, even proper randomization may not be able to protect against bias resulting from differential treatment of the experimental and the control arms. Such differential treatment of the two arms may occur through operational bias introduced as a result of unblinding, through inherent structural bias as to be illustrated by the example on duodenal ulcer prevention trial in Sec. 2.1.4, or through statistical bias in the method of analysis.

2.1.2. Operational bias If a randomized controlled trial is open label, that is, if there is no blinding, then operational bias can easily be introduced. For instance, in a study of a new treatment for headache, the investigator may consciously or unconsciously allow subjects on study treatment to have concurrent use of aspirins. This type of conduct can introduce bias into the trial favoring the study treatment. Evaluation bias can also easily find its way into such trial. For example, if the evaluation involves adjudication of certain event, then the adjudicator may assess the event differently depending upon whether the subject is on study treatment or control. For certain clinical trials, it is simply impossible to maintain blinding at all. In trials comparing different surgical procedures, it is frequently not possible to have blinding. In some oncology trials, it may not be possible to maintain blinding because the trials may involve comparing treatments with different toxicity profiles and delivery systems, e.g. pill vs. intravenous injection. In some randomized trials, even though the trials may be blinded, they can still potentially become partially unblinded. For example, it is known that β-blocks, a class of drugs for treating certain heart diseases, can lower heart rate significantly. The mean heart rate of patients treated with βblockers can be about 12 beats/minute lower than that of patients given placebo. Thus, knowing the specific properties of the study treatment, one can correctly guess the patient treatment assignment with a high likelihood of success, resulting in some degree of unblinding.

Some Statistical Issues of Relevance to Confirmatory Trials

529

2.1.2.1. Levels of blinding Blinding is one of the basic techniques used to control the potential introduction of bias in a randomized controlled trial. It is generally recommended to maintain a level of blinding permissible by the study. When the individual patients are blinded to their own treatment assignments, the trial is called single blind. The blinding is generally achieved by giving patients medications that are identical in appearance but may contain either the study treatment, the active comparator, or an inactive ingredient (placebo) depending upon the type of design. When the investigators, evaluators, raters, or anyone who can influence the course of the trial, are also blinded to the patient treatment assignments during the entire course of the experiment, the trial is called double blind. In a double-blind trial, provisions are made so that the investigators will be able to break the blind in individual patient during an emergency, or when it is determined that the risk to the patient requires specific cares to be taken by the clinician to protect the patient’s safety. In a double blind trial, all study personnel are blinded to the patient treatment codes and should have no knowledge of or access to the results of any interim treatment comparative analysis. 2.1.2.2. Minimizing the likelihood of unblinding In a blinded trial, how easy is it for the individual patient to unblind his/her own treatment? Can the study treatment and the control be easily distinguished through appearance, taste, shape, size, route of administration, regimen, frequency of administration, and side effects? Special blinding techniques may be needed in a trial in order to minimize the likelihood that the individual patient will be able to unblind his/her own treatment codes. Of course, the investigator and other trial personnel should be instructed not to provide any information to the patient that may help the patient to unblind his/her own treatment code. How easy is it for the investigator, evaluator, or rater to have access to the patient treatment codes or to unblind the patient treatment assignments? Can the investigator unblind or partially unblind the patient treatment assignments through patient’s reported side effects, laboratory, physiological, and clinical parameters? For example, for the congestive heart failure trial involving exercise tolerance test, measures need to be taken to ensure that the administrator who conducts the treadmill test has no access to patient treatment experiences and baseline information so that the

530

G. Y. H. Chi et al.

likelihood of unblinding the patient treatment assignments can be minimized. Or in a trial that requires event adjudication, the adjudicator should not have access to information that may help reveal the patient treatment assignments. However, for the investigator, available blinding techniques may be more limited because the investigator has access to all of the patient data. The attention here should be more on reducing the impact of potential bias in the event of unblinding by the investigator. How easy is it for the trial personnel to have access to treatment codes, to perform treatment comparative analysis, or to receive treatment comparative analysis results? How easy is it for the trial personnel to make changes to the trial, e.g. patient enrollment criteria, addition or deletion of centers, increasing sample size, changing endpoints, etc. For example, if there is an independent Data Monitoring Committee involved, then how easy is it for the trial personnel to have access to treatment comparative analysis results through members of the Data Monitoring Committee? If the treatment comparative analysis is actually done by the trial personnel, then what safeguard is there to protect the integrity of the trial from changes done to the trial based on knowledge of the interim analysis results? Here, the only available tool to minimize unblinding is to establish a sound clinical trial infrastructure, clear designation of roles and responsibilities, and strict standard operating procedures governing the interaction, communication and dissemination of results among various operating units. These operating units include units within the trial organization such as the safety monitoring group, the data management group, the data analysis group, and the custodian of patient treatment codes. They also include units outside the sponsor’s organization such as outside consultants, Contract Research Organization, and Data Monitoring Committee, if there is one. When interim analysis is planned, the standard operating procedures should address some issues including the following. Who is doing the interim analysis? How the interim analysis is being carried out? To whom the results should be reported? Who can have access to the results of the interim analysis? How confidentiality of the results can be maintained? For guidance on these issues, one may refer to the Guidance on the Establishment and Operation of Clinical Trial Data Monitoring Committees (2001). These questions and issues regarding blinding and how to avoid or minimize the potential opportunity for unblinding should be discussed and addressed at the design stage of a confirmatory trial. Appropriate blinding techniques tailored to each trial by taking into consideration the trial design, special properties of the study treatment, and characteristics of the disease

Some Statistical Issues of Relevance to Confirmatory Trials

531

should be described in the study protocol. Trial infrastructure and standard operating procedures should be designed in relation to these considerations.

2.1.2.3. Assessing and reducing the impact of unblinding Of course, the mere occurrence of unblinding will not necessarily result in bias. It is the subsequent conduct on the part of some study personnel intentionally trying to influence or change the outcome either in favor or against the study treatment that results in bias. Thus, in addition to minimize the likelihood of unblinding, one should also consider reducing the severity and impact of potential bias introduced in the event of unblinding. Generally, the severity and impact of bias introduced subsequent to unblinding depends on the level at which the unblinding occurs, the ease with which the patient response can be altered, the quanitfiability of the bias, and the magnitude of the bias relative to the overall study treatment effect size. When some individual patients are unblinded to their own treatment assignments, then these patients may introduce biases into their own individual responses. These biases introduced by the individual patients may not be consistently in the same direction, nor systematic. So the net effect of all the biases may be much less severe. The impact of the bias depends upon the number of individual patients involved and the magnitude of the net bias relative to the overall study population size and the overall treatment effect size. Occasionally, biases introduced by one or two patients may change the overall conclusion. However, in most cases, these individual biases may not be easily quantified. Various strategies may be taken at the analysis stage to examine the impact of these biases when known. Such strategies may range from excluding some or all of these affected patients from the analysis to imputing conservative values for these patients’ responses. When unblinding occurs at the investigator level, then the potential bias introduced will affect the outcomes from the particular investigator site or center. Such bias tends to be more severe because within each center, the bias introduced tends to be more systematic and consistent, and affects the entire outcome from that particular center. In a trial with only a few centers, or when the centers involved account for a significant proportion of the total patients, the result can be devastating. If such biases are known and quantifiable, then appropriate measures may be taken to account for them at the analysis stage. Otherwise, available analysis strategies for examining

532

G. Y. H. Chi et al.

the impact of such biases may be limited only to dropping the affected centers from the analysis. In such event, randomization stratified within center may become an important issue. Finally, when unblinding occurs at the study level, treatment comparisons may be made based on the entire available data. Knowledge of such treatment comparative analysis can lead to subtle changes in patient demographics, addition or deletion of sites, and sometimes even to early trial termination. This type of changes made to the trial may result in bias favoring the study treatment. The impact of this kind of bias may be quite significant because such changes are based on the interim treatment comparative data. The consequences of such bias may include declaring an ineffective study treatment to be effective, prescribing the study treatment for a more general patient population than warranted, and incorrect labeling of the study treatment. It may be difficult to detect this kind of bias as a result of unblinding at the study level, and even when detected, it may be difficult to assess its impact. When unblinding occurs at the study level, the entire study results may be voided. The potential impact of unblinding may also depend upon the type of efficacy endpoints involved. There are generally two types of endpoints, the objective and the subjective. In general, for subjective endpoints such as scores based on investigator or patient evaluation, concerns for potential unblinding are understandable. In such cases, an investigator or patient can easily assign certain scores, or components of the scores, in a manner that is favoring the study treatment. Subjective endpoints arise frequently in clinical trials. For objective endpoints, the outcomes are less vulnerable to such alterations. However, some forms of operational bias can still be introduced. For example, in many clinical trials, besides the study treatment and the control, concomitant use of other effective drugs are allowed for all patients. In such cases, differential usage of concomitant medications between the study treatment and the control may potentially lead to bias. Objective endpoints also may vary in degree of objectivity. For objective endpoints such as mortality or serious morbidity, it is difficult for anyone to alter their outcomes. For some other objective endpoints, operational bias can still be introduced. For example, in congestive heart failure trials, exercise tolerance test is often used to measure a patient’s symptomatic improvement. In such trials, patients are asked to walk on treadmill. The administrator of the treadmill test would ask patients to walk until exhaustion. The total walking time will be recorded. Here the exercise tolerance test is an objective endpoint, but it may be affected by how hard the administrator

Some Statistical Issues of Relevance to Confirmatory Trials

533

pushes the patient to exhaustion. If the administrator is unblinded or partially unblinded to the patient treatment assignments, then differential handling of patients depending upon their treatment assignments will introduce operational bias. So in such cases, the administrator should not have access to the patient data and the investigator should not communicate to the patient his/her treatment assignment and treatment information. It is clear that for a confirmatory trial, one should require double blinding. In addition, measures for minimizing the likelihood of unblinding and steps for reducing the impact of bias in the event of unblinding should be described clearly in the protocol. These efforts should include a sound infrastructure for the trial, clear designation of roles and responsibilities, and strict standard operating procedures that are tailored for the trial at hand and that take into account the type of design and specific trial requirements. 2.1.3. Structural bias In a controlled clinical trial, even randomization and blinding may not fully protect against structural bias resulting from flaws in the design. Design flaws can occur, and not infrequently. Thus, one should be on guard against such structural bias. When it occurs, it has the potential to invalidate the results of the entire trial. There is usually no satisfactory remedy for structural bias as illustrated in Examples 1 and 2. Example 1. In the late 1970s and early 1980s, it is customary for duodenal ulcer prevention trials to consider a double-blind placebo-controlled design with patients whose duodenal ulcers were recently healed on acute treatment randomized to either the same treatment at half the dose or placebo. Patients were scheduled at regular intervals for endoscopic examination to determine the presence or absence of duodenal ulcers. The intervals are usually of 3-, 4- or 6-months duration, and the entire trial usually lasts 12 months. At the request of the physician, patients could also be endoscoped upon presentation of symptoms such as pain. Patients found to have recurrent duodenal ulcers were discontinued from the study; the remaining patients continued in the trial until symptomatic recurrence, the next scheduled endoscopy, or the end of the trial. So these trials met the basic requirements of being randomized, double blind and placebo-controlled. What is the problem? For H2-receptor antagonists such as ranitidine and cimetidine, it was generally recognized that they provide symptom relief and promote healing of duodenal ulcers in short term acute treatment trials.

534

G. Y. H. Chi et al.

In the prevention trial, patients with symptomatic recurrent ulcers are discontinued from the trials. Therefore, if the test drug merely relieves symptoms without actually preventing recurrences, then simply by the present design, more symptomatic recurrences would be observed among patients on placebo. To further compound the problem, regular endoscopies are scheduled 3, 4 or 6 months apart. It is also known that relatively short time, 4–8 weeks, is required for duodenal ulcers to heal either with or without treatment in an acute trial. Thus it is entirely possible that during the intervals between successive endoscopies, symptomatic ulcers may have recurred among patients who remain in the trial, healed before the next scheduled endoscopy, and hence escaped detection. It is clear that there is a bias favoring the H2-receptor antagonist as a result of the properties of the drug and the design. These and related issues were extensively discussed at a FDA Gastrointestinal Advisory Committee meeting.52 A renowned gastroenterologist at the time questioned whether duodenal ulcers could recur under maintenance treatment. However, it was demonstrated in a subsequent trial that duodenal ulcers could recur under maintenance treatment.8,46,62 A more detailed discussion of this example and the related design issues can be found in Chi,10 and Elashoff et al.18 This interesting example illustrates a combination of design flaws, evaluation bias due to the analgesic property of the H2-blockers, and the spontaneous healing of duodenal ulcers over time.

2.1.4. Statistical bias A randomized controlled trial that is blinded and has no design flaw can still have statistical bias introduced at the final analysis stage. Statistical bias can arise as a result of the method of analysis, the manner in which patients or data are excluded from the data set, or the manner in which missing data are being handled. This kind of statistical bias can sometimes be fairly subtle. An important principle in the analysis of clinical trial data is the socalled intent-to-treat principle,20,28,47 see ICH-E9 (1999). The intent-totreat principle simply espouses the view that the primary analysis should be performed on the outcome measures from all of the randomized patients. When there is no other source of bias, such as design flaw, then the intentto-treat analysis should provide an unbiased estimate of the treatment effect when there are no patient exclusions and no missing data. When the outcome measures are not available from all of the randomized patients, then

Some Statistical Issues of Relevance to Confirmatory Trials

535

efficacy subset analysis is likely to provide biased estimate of the treatment effect. Little and Rubin53 and Little54 defined three types of missing data mechanisms, missing completely at random (MCAR), missing at random (MAR) and informative missingness. There are various statistical models proposed to handle clinical trial data with these types of missing mechanisms. However, missing data mechanisms in clinical trials are difficult to verify whether they are MCAR, MAR or informative missingness. Generally, clinicians and biostatisticians in the field believe that most missing data mechanisms in clinical trials are likely to be informative missingness for which statistical approaches are less well developed. One imputation method that has often been used is the so-called Last Observation Carried Forward (LOCF) analysis where the last available observation is substituted for all the subsequent missing observations. The LOCF analysis implicitly assumes that the last observation is an unbiased representation of what the missing observation would have been had the patient been followed, which is also an unverifiable assumption. Various issues related to the LOCF analysis are discussed.6,28,47,59,66,103 The handling of missing data is a difficult problem, and the best strategy is to adhere to the intent-to-treat principle by minimizing the likelihood of missing data. The ICH E-8 Guidance on General Considerations for Clinical Trials (1997) recommends that the study protocol should specify procedures for the follow-up of patients who stop treatment prematurely. Furthermore, the ICH E-9 Guidance on Statistical Principles for Clinical Trials (1998) states that compliance with the intent-to-treat principle requires complete follow-up of all randomized patients for the primary outcome measures. For a confirmatory trial, procedures for such complete follow-up of all randomized patients should be carefully spelled out in the protocol and diligently carried out during the trial. The intent-to-treat principle should be followed with robustness or sensitivity analyses performed if needed. The following example illustrates an interesting case of structural bias combined with analysis bias in a study for a serious progressive disease. Example 2. In this example, the investigational new drug is manufactured only in limited quantity; hence only a small percent of the patients can be given the new drug. The study protocol has an unusual design. After a patient satisfies some eligibility criteria, he/she enters a pool and becomes eligible for random selection for the study treatment according to the fixed percentage. If a patient is selected, then the study treatment will be given. On the other hand, if a patient is not selected at a given pool of eligible,

536

G. Y. H. Chi et al.

then that patient remains unselected until the next selection. If that patient dies before the next selection, then he/she will be counted as an event in the unselected group. Otherwise, at the next selection, this patient enters the pool along with the newly eligible. Again patients in this second pool will have the same probability of being selected for the study treatment. This process continues with selections conducted about every month for over a year. Generally, for the selected patients, there is a delay of about three months between the time of selection and the actual time of administration of the study treatment. There is a one-year follow-up. The primary endpoint of interest is mortality. The length of survival of a patient is measured from the time of first eligibility to subsequent death, lost-to-follow-up, or to the end of the study. The comparison of the survival experience between the selected group and the unselected group from all ten selections is evaluated by the log-rank test statistic. There are two kinds of bias in this example. The first kind is a structural bias, and the second is an analysis bias. The structural bias can be seen as follows. The definition of selected and unselected groups are outcome (survival) dependent. This is because a patient who was not selected in any given selection could still be selected in a subsequent selection provided that that patient was able to survive till that selection. A patient who was not selected and died before a selection would automatically remain in the unselected group. This design would enrich the selected group with patients that have better survival experience up to the time of selection, and enrich the unselected group with patients who died before a selection. There is also a bias in the survival analysis. If a patient who is selected after a given selection, then his/her survival time is measured from the time of first eligibility to either death, lost-to-follow-up, or end of the study. But for such patient, the time from first eligibility to the time of selection is actually not under the study treatment. In fact, in general, there is an additional delay of about three months before the study treatment is actually given to a selected patient. Thus a large proportion of his/her survival experience would not be under the study treatment, but would be attributed to the study treatment. Such bias favors the study treatment. However, it is not advisable to consider the survival time as measured from the time of selection or the actual time the selected patient is given the study treatment, because the selected group and the unselected group are outcome dependent and hence there is no valid randomization. As it turns out, in this example, a majority of the patients were considered for eligibility in the second selection. To maintain randomization

Some Statistical Issues of Relevance to Confirmatory Trials

537

and to minimize bias, it was recommended that patients from the second selection be considered for analysis. Thus, the selected patients consist of those who were selected in the second selection, and the unselected patients consist of those who were not selected at the second selection. Now, for these unselected patients, some of them were actually selected at a subsequent selection. So, in the survival analysis, the survival time for the unselected patients are measured from the time of first eligibility to either death, lostto-follow-up, or censored at time of subsequent selection, actual time of study treatment administration, or the end of the study.

2.2. Some measures to minimize potential bias In view of the various potential sources of bias in a clinical trial, it is important for a confirmatory trial to consider adopting appropriate measures at the design stage to minimize the impact of potential biases. Randomization is the standard technique used in clinical trials to achieve balance in both known and unknown important baseline covariates, prognostic and demographic factors between the treatment arm and the control. It is important to use proper randomization scheme that will not lead to unblinding to achieve the necessary balance. Operational bias is difficult to control. Blinding is the most important technique for controlling operational bias. A confirmatory trial should be blinded at the study level. If necessary, special blinding techniques should be considered at the individual patient level and the investigator level. The aim is to minimize the likelihood of unblinding by the individual patients, the investigators, evaluators or raters, or the study personnel and to minimize its impact in the event of actual unblinding. In order to minimize the impact of bias due to unblinding, randomization should be centralized and stratified within each center, provided that the randomization procedure itself does not lead to potential unblinding. All blinding techniques should be clearly documented, described, standardized and operationalized. At the study level, the only blinding technique that can be implemented is through proper infrastructure, clear delineation of roles and responsibilities, and strict standard operating procedures. For instance, it should be made clear who has custody of the patient treatment codes, circumstances under which the patient treatment codes can be accessed, the parties that may have the authority to access them, and the proper procedure for such accession. These considerations should also include outside consultants, contract research organizations, and data monitoring committees. This is especially

538

G. Y. H. Chi et al.

critical when the trial involves planned interim analyses and a data monitoring committee. When there is a data monitoring committee, there should be standard operating procedures that define the responsibility and the proper span of authority of this committee, and govern the conduct and communication between the trial sponsor and the committee, whether the committee is independent or not. For example, when a data monitoring committee requests the interim results from the trial sponsor, who in the sponsor’s organization is doing the interim analysis, and how can the interim analysis results be kept confidential from the sponsor’s study personnel? When the independent Data Monitoring Committee (DMC) recommends to the trial sponsor that the study be terminated early, certainly the trial sponsor should be provided with the interim analysis results and the reason for the recommendation. The sponsor’s study personnel that receive such recommendation and the interim analysis results would have the interim comparative results. Thus the sponsor’s study personnel would now have access to the comparative interim results. If the sponsor decides to continue the trial, then there is a potential opportunity for operational bias to be introduced by the study personnel at this point. How can one prevent such potential opportunity for operational bias to be introduced? The recent FDA draft document on the Establishment and Operation of Clinical Trial Data Monitoring Committee (2001) provides some relevant guidance on these issues. To avoid structural bias, properties of the study treatment, type of treatment administration, the nature of the disease, the objective of the trial and other pertinent information should be well understood to insure that the design of the trial is not flawed. To reduce the problem with missing data, it is recommended that a confirmatory trial should attempt complete follow-up on all missing primary response data from dropouts or others, and better documentation of reasons for dropping out and missing primary response data. This will minimize the impact of bias and may provide the basis for determining the proper method of handling the missing data. To properly assess the strength of evidence in a trial, one needs to know whether the blind has been broken. If the blind has been broken, what is the extent of unblinding? Whether and how such information was used to bias the trial outcome? Whether the bias can be quantified? What analysis strategies can be applied to examine the impact of the bias, and whether the trial can still provide the strength of evidence sought, if at all? These are important issues to be addressed in a confirmatory trial.

Some Statistical Issues of Relevance to Confirmatory Trials

539

3. Inflation of the Overall Type I Error Rate The second fundamental statistical principles in clinical trial is to control the overall probability of type I error, i.e. the probability of declaring the study treatment to be effective when it is not, at a desired significance level of α. In a given experiment, this probability can frequently be inflated in various ways. For example, when the test statistic for the null hypothesis involves a nuisance parameter that is estimated from the data, then this may inflate this probability. This section discusses some frequently encountered situations in which this probability can be inflated. 3.1. Multiple testing Multiple testing surfaces frequently in clinical trials. It manifests itself in various forms. It often appears in multiple comparisons, repeated testing, multiple endpoints, multiple indications, and subgroup analyses, or combinations thereof. The basic problem with multiple testing in a clinical drug trial is that it increases the probability of the overall type I error, i.e. the probability of declaring the drug or certain dose of the drug to be effective for the desired treatment indication when in fact it is not. This inflation can be illustrated as follows in the context of multiple comparisons. 3.2. Multiple comparisons In clinical drug trials, it is often of interest to design a parallel placebocontrolled study with multiple treatments, or multiple doses of a test drug. The primary objective of such a trial is to demonstrate that one or more of the treatments, or doses of the drug works, and a secondary objective is to compare different treatments, or to characterize the drug’s dose response relationship. For such a study, how should one most efficiently evaluate the effect of the treatments or the effect of the drug? How should the primary hypothesis or hypotheses be defined? And how should the test or tests be carried out? For purposes of discussion, consider N treatments or doses of a drug and a placebo. Let µi , i = 0, 1, 2, . . . , N represent the mean changes from baselines for placebo and the N treatments or doses of the drug respectively. Let Ho123···N : ∆µ1 = ∆µ2 = · · · = ∆µN = 0 represents the global null hypothesis. Let Hoi : ∆µi = 0, i = 1, 2, . . . , N denote the individual null hypotheses of no difference between the ith-treatment or the ith-dose of the drug and placebo. Let us suppose that we have suitable test statistics

540

G. Y. H. Chi et al.

to test each one of these individual null hypotheses, Hoi : ∆µi = 0, i = 1, 2, . . . , N , and let poi , i = 1, 2, . . . , N denote the p-values associated with the corresponding tests. Now can we reject an individual null hypothesis, Hoi : ∆µi = 0, if the p-value poi < α? The answer is not necessarily so, because by testing the individual null hypothesis Hoi at the same α level does not control the probability of the overall type I error, the probability of rejecting at least one individual null hypothesis when the global null hypothesis is true, at the α level. This probability will be inflated as a result of the multiple testing. The inflation of the overall type I error rate can be seen as follows. If there are N individual and independent comparisons with each individual comparisons done at a nominal significance level of αi , then The probability of the overall type I error = P (Reject Ho123···N |Ho123···N ) = P (Reject at lease one Hoi : ∆µi = 0|Ho123···N ) = 1 − P (Fail to reject all Hoi : ∆µi = 0|Ho123···N ) = 1 − ΠP (Fail to reject Hoi : ∆µi = 0|Hoi ), assuming independence ≥ 1 − Π(1 − αi ) .

(1)

If the individual comparisons are made at the same significance level of αi = α, then from Eq. (1), it follows that under independence assumption, The probability of the overall type I error > 1 − (1 − α)N > α .

(2)

The following values illustrate the increase in the probability of the overall type I error rate as a result of an increase in the number of comparisons each performed at α = 0.05 level. The probability of the overall type I error = 0.05, 0.098, 0.143, 0.226, 0.401 corresponding to N = 1, 2, 3, 5 and 10 comparisons, respectively. From these values, Eq. (1), and inequality (2), it is clear that in order for the probability of the overall type I error to be controlled at a given α level, it is necessary to test the individual null hypothesis Hoi at a significance level αi < α, i = 1, 2, . . . , N , in such a way that the probability of the overall type I error does not exceed α. Various multiple comparison procedures have been proposed to accomplish this.

Some Statistical Issues of Relevance to Confirmatory Trials

541

3.2.1. Some p-value based multiple comparisons procedures The most commonly used multiple comparison procedures are those that can be described as p-value based procedures, so-named because these procedures are all defined in terms of the individual p-values, poi . These general procedures include the simple Bonferroni procedure, the Holm’s35 sequential rejective procedure, the Hochberg33 procedure, the Hommel37 procedure, and the more general Simes procedure. Except for the Simes procedure, most of these procedures are not powerful when the number of comparisons increases. However, they control the probability of the overall type I error at the desired α level, and they have enjoyed great popularity due to their relative simplicity. For a more detailed discussion of these and related procedures, the reader is referred to Hochberg and Tamhane,34 Samuel-Cahn,81 Chi,11 and Sarkar.83 These p-value based procedures are all inferences based on the individual p-values, po(i) , i = 1, 2, . . . , N , associated with the individual null hypotheses, Ho(i) , i = 1, 2, . . . , N . These p-value based procedures adjust the individual αi in such a manner that the probability of the overall type I error is controlled at the desired α level. If these p-value based procedures reject any one of the individual null hypotheses at the αi level, then the global null hypothesis, Ho123···N : ∆µ1 = ∆µ2 = ∆µ3 = · · · = ∆µN = 0, will be rejected for sure, where ∆µi is the difference between the ith-treatment or the ith-dose of the test drug and placebo. These p-value based procedures are sometimes also referred to as the step-up procedures. They generally do not make use of the correlation that may exist between the various test statistics and tend to be conservative. Furthermore, it should be noted that most of the p-value procedures implicitly treat all comparisons as of equal importance. 3.2.2. The closure principle and the step-down procedures In contrast, a test that rejects the global null hypothesis, Ho123···N , at a significance level of α permits one to conclude that overall the drug is superior to placebo. However, the rejection of this global null hypothesis does not shed any light as to which specific treatment(s) or dose(s) actually works. Certainly, if the test fails to reject the global null hypothesis, then one can only conclude that the study fails to detect a difference among various treatments and placebo, or among different doses of the drug and placebo.

542

G. Y. H. Chi et al.

The US regulations [21 CFR Sec. 314.50(d)(5)(v)] also require that evidence be provided to support the dosage and administration section of the labeling, including support for the dosage and dose interval recommended. Thus, when a test rejects the global null hypothesis, it is also of interest to know which one or more of these treatments or doses of the drug is actually superior to placebo. Can one then simply use the p-values, po(i) , obtained from the individual comparisons of Ho(i) : ∆µi = 0, i = 1, 2, . . . , N and compare each at the same nominal significance level say, α, and then conclude that that treatment or dose is or is not superior to placebo? The answer is no, not in general, because suppose that the true state of nature is as follows: ∆µ1 = ∆µ2 = · · · = ∆µN −1 = 0 and ∆µN 6= 0 .

(3)

Now suppose the global null is rejected, and then one tests each of the individual null hypothesis, Ho(i) , i = 1, 2, . . . , N at the α level. It follows from the series of equations in (1) that the probability of rejecting at least one of the individual null hypotheses, Ho(i) , i = 1, 2, . . . , N − 1 assuming independence, is > 1 − (1 − α)N −1 > α given the true state of nature according to Eq. (3). Thus, the probability is greater than α that one will erroneously reject one of the individual null hypotheses, Ho(i) , i = 1, 2, . . . , N − 1, and conclude that one of the first (N − 1) treatments or doses is effective. Equation (3) is only one possible true state of nature. The following lists out all the remaining true states of nature. Level 0 :

∆µ1 = ∆µ2 = · · · = ∆µN = 0

Level 1 :

∆µ1 = ∆µ2 = · · · = ∆µi−1 = ∆µi+1 = · · · = ∆µN = 0 , ∆µi 6= 0, i = 1, 2, . . . , N

Level 2 :

∆µ1 = ∆µ2 = · · · = ∆µi−1 = ∆µi+1 = · · · = ∆µj−1 =

∆µj+1 = · · · ∆µN = 0, ∆µi 6= 0, ∆µj 6= 0, i = 1, 2, . . . , N − 1, all j > i .

· Level N − 1 :

·

···

∆µi = 0, i = 1, 2, . . . , N, and ∆µj 6= 0, all j 6= i .

Since the above argument is also applicable to all the other possible true states of nature, it follows that controlling the probability of the type I error associated with the global null hypothesis, Ho123···N , at the α level, and the probability of the type I error associated with the individual null hypotheses Ho(i) , i = 1, 2, . . . , N each at the α level, is not sufficient to

Some Statistical Issues of Relevance to Confirmatory Trials

543

protect the probability of the overall type I error at the α level. That is, in this case, the probability of erroneously rejecting one of the individual null hypotheses when it is true will be greater than α. In other words, simply because an individual p-value po(i) < α it does not imply that one can reject the corresponding individual null hypothesis Ho(i) , even when the global null hypothesis has been rejected. However, there are some classical multiple comparisons procedures within the framework of analysis of variance that permit such conclusions. For example, the Scheff´e test, the Dunnett procedure, and the Tukey procedure. An interesting example of a global test within the context of a multi-factorial combination drug trial is discussed in Hung, Chi and Lipicky.38 This suggests that perhaps one should consider controlling the probability of the type I error associated with all the possible true states of nature. The closure principle proposed60 to a general class of closed testing procedure is a step-down procedure that begins with testing the global null hypothesis (corresponding to the state of nature in Level 0) at the α level. If this test fails to reject the global null hypothesis, then the procedure stops. If this test rejects the global null hypothesis, then the procedure steps down to test each of the partial null hypotheses at the next lower level (corresponding to the states of nature in Level 1) at the same α level. The procedure stops when none of the partial null hypotheses is rejected. When one or more of the partial null hypotheses are rejected, then the procedure steps down to the next lower level (corresponding to the states of nature in Level 2). It tests each of the partial null hypotheses that imply the previously rejected partial null hypotheses at the α level (A hypothesis H1 implies another hypothesis H2 , if the rejection of H1 implies the rejection of H2 ). The process either stops at some level for failing to reject any of the implying partial null hypotheses at that level, or continues to the last level (corresponding to the true states of nature in Level N − 1) and tests each of the implying individual null hypotheses at α level. The closed testing procedure guarantees that the probability of the overall type I error will be maintained at the α level. However, it does not guarantee that in a given application, it will necessarily be able to continue to the last level of testing all the implying individual null hypotheses. Even when it does reach the last level, it may fail to reject any of the implying individual null hypotheses. In other words, the closed testing procedure may fail to identify any treatment or dose that is superior to the control. A closed testing procedure may be able to can take advantage of the correlation between the various test statistics. This feature makes the

544

G. Y. H. Chi et al.

closed testing procedure somewhat more attractive when one is confronted with multiple endpoints which is the next multiplicity issue to be discussed. Again it should be noted that the closed testing procedure also implicitly treats all comparisons as of equal importance. 3.3. Multiple endpoints In medicine, a disease is characterized by multiple clinical endpoints that may be correlated. These clinical endpoints may describe the stage and severity of the disease, and the signs and symptoms associated with the disease, etc. Some clinical endpoints provide direct information regarding the state of the disease; they are capable of revealing whether the disease has been modified, such as healing of ulcer as determined by endoscopic examination. Some clinical endpoints provide indirect information regarding the state of the disease; these clinical endpoints include signs and symptoms known to be associated with the disease. In a disease where death is a potential outcome, mortality or survival is considered as a very special and unique clinical endpoint; it is an objective endpoint, and due to its seriousness, it supersedes all other endpoints in importance. On the other end of the spectrum are the so-called surrogate endpoints or surrogate markers, or biomarkers. They are endpoints, usually not clinical in nature, but may be associated or correlated with the clinical endpoints of interests. The effect of a drug may manifest itself in a number of endpoints, and each endpoint may provide a measure of drug effect. From a regulatory perspective (The 1962 Amendment to the 1938 Food, Drug and Cosmetic Act requires drug to show clinical benefit in adequate and well-controlled studies), only endpoints that can provide a measure of relevant clinical benefits that may lead to potential efficacy claims are of interests. This requirement essentially restricts the endpoints to clinical endpoints. But clinical endpoints have varying importance and reliability, and not every clinical endpoint has the potential of leading to an efficacy claim. In a few special instances, surrogate endpoints have been used directly to support drug efficacy claim, as in blood pressure for hypertension trials and CD4 counts in AIDS trials. In some disease areas, due to the lack of good or available treatments, a surrogate endpoint may be used to support drug efficacy claim conditional on subsequent demonstration of meaningful and real clinical benefit under the accelerated approval program (See Mathieu61 and 21 CFR Sec. 314.500 Subpart H, 2001). Therefore, in the design of a clinical drug trial, the investigator should first determine what constitutes a clinical benefit and how best to measure

Some Statistical Issues of Relevance to Confirmatory Trials

545

it. The ICH-E9 Guideline (1998) calls for the designation of a single clinical endpoint as the “primary endpoint”. The effect of the drug will then be measured relative to this “primary endpoint”. This recommendation may be reasonable when there is only one clinically relevant endpoint for the disease under study, or when there is only one clinical endpoint that is of primary interest. Often in practice, this recommendation is followed because the investigator is not willing to make the necessary adjustment for multiplicity testing or to consider applying a closed testing procedure that would preserve the probability of the overall type I error at α. In either case, the investigator usually would designate one among several endpoints as the “primary endpoint”, and the remaining endpoints as “secondary endpoints”. Such practice often leaves ambiguity as to what to do with those “secondary endpoints” that may lead to the efficacy claim themselves. This often leads to difficult situations and undesirable consequences as illustrated by the following scenario. After a trial fails to reject the null hypothesis defined by the designated “primary endpoint” at the significance level α, the investigator then turns to the “secondary endpoints”. Often, some of the observed p-values for these endpoints show “apparent significance”, i.e. with p-values less than α. The question is whether one can claim that the trial has demonstrated that the drug is effective based on those “secondary endpoints” that show “apparent significance”. This type of situation often results in controversy, particularly when the “secondary endpoint” turns out to be mortality or some serious irreversible morbidity endpoints. For an interesting recent discussion of such an example, one may refer to the articles by Fisher 21,22 and Moy´e.63–65 See also Chi10 for the SOLVD prevention trial. From a statistical and regulatory standpoint, when the primary hypothesis fails to be rejected at the significance level α, then the trial has failed to provide the expected strength of evidence for the efficacy of the drug. Additional testing of hypotheses can only result in an inflation of the probability of the overall type I error over and beyond α. The probability of the overall type I error can be controlled only if all desired hypotheses to be tested have been pre-specified with a proper allocation of α. A recent proposal by Moy´e64,65 and related commentaries16,70 considers testing the secondary hypotheses at some significance level α∗ between 0.05 and 0.10 to accommodate the so called “surprise finding”. This proposal is clearly inflating the overall type I error rate beyond the desired level of α = 0.05 and is not desirable from this perspective. In our view, for a confirmatory trial, one should maintain the fundamental principle of controlling the probability of the overall type I error at

546

G. Y. H. Chi et al.

the desired level α. Thus, designating only one of several clinical endpoints as “primary” may not necessarily be the best option.

3.3.1. Current clinical trial practice When there are multiple clinical endpoints, each of which can lead to the same efficacy claim, the current practice usually considers one of two general approaches. The first approach is to designate all clinical endpoints that can lead to the same claim as “co-primary endpoints”. Appropriate hypotheses are then defined in terms of these endpoints. A method for testing these hypotheses that accounts for multiple testing can then be applied. The second approach is a conditional or sequential approach. One designates only one “primary endpoint” to be tested at the significance level of α as recommended in the ICH-E9 Guidance document. The remaining endpoints are then designated as “secondary endpoints”. If the primary null hypothesis fails to be rejected, then no efficacy claim can be made. When the primary null hypothesis is rejected, then the efficacy claim is made. Additionally, those “secondary endpoints” that are closely correlated with the “primary endpoint” can be tested at the nominal significance level of α; but these tests will not lead to additional efficacy claims, but are meant to provide additional information to describe the efficacy finding based on the “primary endpoint”. For those “secondary endpoints” that are not closely related to the “primary endpoint”, they may be tested at an overall nominal significance level of α appropriately adjusted for multiplicity. Those “secondary endpoints” that reach nominal significance after adjustment may be described in the labeling of the drug. In the above two approaches, to adjust for multiple “co-primary endpoints” or to adjust for multiple “secondary endpoints” after rejection of the primary null hypothesis, one may consider various multiplicity adjustment procedures such as the p-valued based procedures and the closed testing procedures that were discussed in Secs. 3.2.1 and 3.2.2. These procedures may be applied here if one assumes that the “co-primary endpoints” are of equal importance and have equal likelihood of demonstrating the proposed efficacy claim, or that the “secondary endpoints” are of equal clinical relevance and importance in terms of allowing their appearance in the labeling of the drug. A few examples of multiple testing has been discussed by many authors.2,17,27,51,67,69,76,80,82,87,88,105,106 The p-value based procedures and the closed testing procedures are only formal statistical procedures and they lack the proper clinical and

Some Statistical Issues of Relevance to Confirmatory Trials

547

regulatory perspectives. They are being applied as if all the endpoints are of equal importance and have equal likelihood of achieving the desired efficacy claim. However, in most practical situations, such assumption is not appropriate. The fact is that most of the times, these clinical endpoints have varying or differential clinical relevance and importance, and different likelihood of demonstrating the efficacy claim. Furthermore, from a clinical and regulatory perspective, some of these endpoints may be used together rather than individually in a more complex way in assessing the drug effect. The proper approach, in our view, is to define a priori a decision set and a clinical decision rule for assessing the drug efficacy claim. A decision set is simply a set of clinical endpoints and a clinical decision rule is simply a decision tree consisting of several decision paths or branches. Each decision path is defined by one or more of the endpoints from the decision set and it may lead to the desired efficacy claim. The null hypothesis at each decision path is tested by a test statistic that reflects this decision path. If the null hypothesis at each decision path is being rejected at certain significance level αi , then the trial would have demonstrated efficacy. The significance levels αi are allocated across the various decision paths in a manner that will maintain the overall probability of type I error. A clinical decision rule should have proper statistical support structure to provide valid statistical inference. The concept of clinical decision rule is more formally discussed and illustrated by examples in the next section.13,42

3.4. Clinical decision rules It is important to understand that in current clinical trial practice, the term, “primary endpoint”, is used to identify a clinical endpoint that is being used in a trial for demonstrating the efficacy of the study treatment. It is an endpoint that has the burden of providing the primary evidence for the desired efficacy claim. When there are more than one “primary endpoint”, then they are “co-primary” relative to one another. The “secondary endpoints” and “tertiary endpoints” do not have such a role. But the mere fact that a clinical endpoint is placed in the category of “secondary endpoint” does not imply that it has lesser clinical importance or lacks the ability to independently provide the evidence for an efficacy claim. This is one reason why controversies frequently arise in clinical trials as in the case of carvedilol discussed earlier. A clinical decision rule is a natural and meaningful way of assessing drug efficacy. It reflects the clinical and regulatory perspectives in handling the

548

G. Y. H. Chi et al.

multiple testing issue. How a clinical decision rule is defined actually varies from disease to disease. It depends on the current state of clinical knowledge about the disease, and the degree of acceptability of the endpoints considered. In order to define clinical decision rule in a clear and clinically meaningful way, some general definitions of clinical endpoints are needed. 3.4.1. A hierarchical definition of clinical endpoints The following hierarchical definitions of clinical endpoints are proposed with a regulatory perspective. Definition 1. A clinical endpoint is a clinical variable which either directly or indirectly reflects the condition of an underlying disease. For example, death, disease progression, tumor size, pain intensity, signs/symptoms, hospitalization are clinical endpoints. Definition 2. A clinical endpoint is a primary endpoint if it satisfies the following conditions: • It can provide a measure of clinical benefit realized in the patient that is acceptable by the clinicians in the field as a meaningful measure of the drug effect for the disease under treatment. • It is an endpoint such that a positive finding in this endpoint may result in an efficacy claim. Examples of primary endpoints are time to death, time to disease progression, tumor response, diastolic blood pressure, presence or absence of ulcer. Definition 3. A primary endpoint is a principal primary endpoint if a positive finding in this endpoint alone is sufficient to result in the efficacy claim. Principal primary endpoints are usually endpoints that are objective and can directly demonstrate that the underlying disease has been modified. For example, mortality, absence of ulcers, and objective measures of disease progression. The regulatory perspective here is that a positive finding on a principal primary endpoint alone is sufficient for proof of efficacy. For many diseases, there may not be a principal primary endpoint. In general, the choice for principal primary endpoint may be limited. Definition 4. A primary endpoint is a co-primary endpoint if a positive finding in this endpoint is sufficient to result in an efficacy claim, provided other primary endpoints considered do not show an inconsistent effect.

Some Statistical Issues of Relevance to Confirmatory Trials

549

Co-primary endpoints are endpoints that may be more subjective in nature, less well defined, or may provide only indirect measure of the condition of the underlying disease such as signs and symptoms. The regulatory perspective here is that a positive finding on a co-primary endpoint alone is sufficient for proof of efficacy only if other primary endpoints considered do not show an inconsistent effect. Some examples of co-primary endpoints include all cause hospitalization, cause specific hospitalization, time to disease progression, tumor response, disease related signs/symptoms, etc. From a clinical perspective, co-primary endpoints are not as important as principal primary endpoints because they may not be as objective, or as well defined, and may not provide direct measures of disease modification. On the other hand, improvement as measured by a co-primary endpoint may support a claim provided other primary endpoints considered do not show contradictory findings. It is important to point out that by our definitions, both principal primary endpoints and co-primary endpoints are primary endpoints. Their nature can not be changed simply because they are placed in the “secondary endpoint” category, a common term used in current practice. Definition 5. A clinical endpoint is secondary if it satisfies the following conditions: • It is a clinical endpoint that provides a measure of clinical benefit realized in the patient that is acceptable by the clinicians in the field as a meaningful measure of the drug effect for the disease under treatment. • It is an endpoint such that a positive finding in this endpoint alone is not sufficient to result in the drug’s efficacy claim. It should be pointed out that secondary endpoint as we define here has a different meaning than that used in current practice. In current practice, “secondary endpoints” may include both primary and secondary endpoints as we define above. It should be noted that according to our definition, a positive finding in a secondary endpoint alone could not result in an efficacy claim. However, positive findings in several secondary endpoints may together provide sufficient evidence to support a claim. This is illustrated in Example 6. Evidence from secondary endpoints may or may not be needed to support the evidence provided by a primary endpoint. There are several reasons for the above hierarchical definition of clinical endpoints. The first reason is that in each disease, there are usually multiple relevant clinical endpoints of varying importance. Secondly, efficacy

550

G. Y. H. Chi et al.

assessment depends on the nature of the clinical endpoints and the relevance of the clinical benefits that are measured by these clinical endpoints to the indication sought. Thirdly, the hierarchical nature of these endpoints would become important in defining the clinical decision rule, to be defined below, which essentially operationalizes these endpoints in the assessment of efficacy. Definition 6. A decision set relative to a disease is a set of relevant endpoints determined by the clinicians that will be used to assess the effectiveness of the treatment for the disease in a clinical trial. A decision set may contain principal primary endpoints, co-primary endpoints, and secondary endpoints. Ideally, the decision set for a disease should be determined by the consensus of the medical experts in the field. However, for diseases that are not well understood, consensus may be hard to come by. But it is exactly in this kind of disease where multiple endpoints become a difficult issue. This is because, the experts themselves may not know exactly which clinical endpoints are most relevant and important, and they often may not even agree. A decision set may consist of principal primary, co-primary, and secondary endpoints, and it may even contain surrogate endpoints. A decision set may change over time as better understanding of a disease may modify the way efficacy should be assessed. It usually should not contain too many endpoints. It should contain sufficient number of endpoints to allow all potentially acceptable ways of assessing efficacy of the drug. The hierarchical nature of the endpoints is operationalized in a clinical decision rule. Definition 7. A clinical decision rule is simply a decision tree defined relative to a decision set for the purpose of assessing the effectiveness of a drug in treating a given disease. Each decision path or branch is defined in terms of one or more endpoints from the decision set. In each path, a decision is made based on the outcomes of all endpoints involved. It is important to realize that the clinical decision rule as defined here simply represents the clinical and regulatory way of expressing how evidence of efficacy of the drug can be assessed based on the endpoints in a decision set. It is independent of any statistical considerations. It simply reflects the various alternative hypotheses of interest. For instance, no consideration is given to how the hypotheses are defined, control of the probability of the overall type I error, etc.

Some Statistical Issues of Relevance to Confirmatory Trials

551

Therefore, a clinical decision rule should have a proper statistical support structure. A proper statistical support structure should include appropriate null and alternative hypotheses reflecting the entire clinical decision rule. It should have proper test statistic at each decision point, appropriate allocation of α to each decision point so as to maintain the probability of the overall type I error for the clinical decision rule at α, and sufficient sample size or power for the entire clinical decision rule. Definition 8. A decision structure is a clinical decision rule with a proper statistical support structure. A decision structure requires proper allocation of α among the various decision paths of a clinical decision rule. If one wishes to allocate all the α to one particular decision path, then this can easily be accommodated by defining all the other decision paths as sub-paths of this particular decision path, or simply defines the clinical decision rule with only this path. The sub-paths will be tested only when the primary decision path has demonstrated the efficacy of the drug. Decision structure generalizes the concept of “primary endpoints” and “secondary endpoints” used in current practice by decision paths, and sub-paths in a conditional decision path of a clinical decision rule. The earlier discussion of labeling consideration for “secondary endpoints” can also be transferred to the present decision structure framework by considering clinical decision rule with sequential or conditional paths that allow secondary outcomes based on sub-paths to enter the labeling of the drug. Ideally, for a given disease, the clinical decision rule should be based on a consensus of the clinical experts in the field. For example, the Food and Drug Administration has advisory committees for various disease areas. The members of the advisory committees include medical experts in various diseases. These advisory committees often discuss issues regarding proper choices of endpoints. The consensus regarding decision set and clinical decision rule in a given disease may be reached among the corresponding committee members. For some diseases, there may be general consensus; but for many other diseases, there may be no agreement. This is especially true in disease areas where the state of knowledge is still evolving. For such diseases, the clinical trial sponsor and the regulatory agency to which the submission will be submitted should reach agreement on the decision set and the clinical decision rule that are acceptable to both parties prior to commencement of the trial. It is the responsibility of the statistician to provide proper statistical support structure for the clinical decision rule.

552

G. Y. H. Chi et al.

In a clinical trial with only a single primary endpoint, T1 , the decision structure is simple. Its statistical support structure includes the simple null hypothesis, its alternative, and an appropriate test statistic all defined in terms of this primary endpoint. The sample size and power are calculated accordingly. It can be expressed by the symbolic notation (T1 > 0) which is to be interpreted as follows. The new treatment will be declared as superior to the control, if the test statistic rejects the null hypothesis of no difference between the new treatment and control at the significance level of α, and if the difference is favoring the new treatment. Many trials lack a well-defined clinical decision rule. Among those that do, many do not have proper statistical supports struture as illustrated by the following two examples. Example 3. In a recent study of the drug carvedilol for the treatment of congestive heart failure, there was an “unexpected” mortality finding (observed p-value < 0.001).16 However, in this study, mortality was not even stated in the protocol as an endpoint. There was a lengthy debate as to whether carvedilol has demonstrated a mortality benefit in this trial. If it were accepted as a positive demonstration of a mortality benefit, then this would imply that the final decision rule has been altered from the one originally proposed in the protocol. This would certainly lead to an increase in the probability of the overall type I error. This example illustrates a trial with a decision rule that is not exhaustive, in the sense that the decision set is not complete, since mortality is not in the decision set. In addition, the post-hoc attempt to alter the clinical decision rule is without proper statistical support. Example 4. According to its label, VASOTEC is indicated “for stable asymptomatic left ventricular dysfunction: it decreases the rate of development of overt heart failure and decreases the incidence of hospitalization for heart failure”.73 This claim is essentially based on the SOLVD Prevention Trial. The SOLVD Prevention Trial has all cause mortality as the only primary endpoint, and hospitalization for CHF, development of CHF, and incidence of MI as three, among several, secondary endpoints. The trial results show that mortality is not significant (p = 0.60), but hospitalization for CHF (p < 0.002) and development for CHF (p < 0.002), and incidence of MI (p < 0.024) show nominally significant p-values. So VASOTEC was approved for the above indication based on the “apparent” significance of the “secondary endpoints”, hospitalization for CHF and development of CHF. This time, the decision rule used to assess the efficacy of the drug is

Some Statistical Issues of Relevance to Confirmatory Trials

553

again altered from the one declared in the protocol. It leaves unanswered the potential inflation in the probability of the overall type I error. These two examples describe clinical trials with multiple primary endpoints where the decision rules are not complete and lack proper statistical support structures. They led to post-hoc attempts to alter the decision rules by attaching “statistical significance” to findings in endpoints that are selected in a retrospective manner. A clinical decision rule should in principle and in practice exhaust all possible ways of assessing the efficacy of the drug. The following example illustrates a clinical decision rule that is exhaustive. Example 5. In a clinical trial, the decision set consists of three co-primary endpoints, {T1 , T2 , T 3} without a principal primary endpoint. The clinical decision rule is (T1 > 0, or T2 > 0, or T3 > 0). That is, the new drug would have demonstrated efficacy if it can be shown to be superior to placebo in any one of the three co-primary endpoints and no inconsistent trend in the other co-primary endpoints. This clinical decision rule considers all three co-primary endpoints as of equal importance and to have equal likelihood of demonstrating drug efficacy, and hence the application of either a p-value based procedure or a closed testing procedure would be appropriate. The following example illustrates a different situation where the primary endpoints have varying importance. Example 6. In a clinical trial, the decision set consists of three primary endpoints, {T1 , T2 , T3 }, where T1 is a principal primary endpoint, T2 and T3 are two co-primary endpoints. The clinical decision rule is defined as (T1 > 0 at α1 or T2 > 0 at α2 or T3 > 0 at α3 ), where α1 , α2 and α3 are chosen to preserve the probability of the overall type I error at α. This clinical decision rule is different from the preceding example in that one specifies a different allocation of α among the three separate hypotheses. So a closed testing procedure or a p-value based procedure may not be directly applicable. The trial sponsor usually determines the α allocation. In addition, the drug efficacy is established if the principal primary endpoint T1 can be shown to be positive regardless of the outcomes of the other two co-primary endpoints. For example, one may think of T1 as mortality. The following example shows that even if individual secondary endpoint may not support an efficacy claim, several secondary endpoints together may be sufficient to support an efficacy claim.

554

G. Y. H. Chi et al.

Example 7. In a study of Alzheimer patients, T1 = Alzheimer’s Disease Assessment Scale — Cognitive Subscale, and T2 = Clinician’s Interview Based Impression of Change, are two secondary endpoints. The clinical decision rule is (T1 > 0 at α = 0.05 and T2 > 0 at α = 0.05). The decision to use a combination of two different types of outcome measurements to evaluate the efficacy of anti-dementia drugs was made with the support of a number of experts working in the field of dementia at FDA’s Anti-dementia Assessment Symposium in 1989. This decision rule was recently reaffirmed by the expert panel of the Peripheral and Central Nervous System Drugs Advisory Committee in meetings on issues concerning mild cognitive impairment and cardiovascular dementia. The Alzheimer’s Disease Assessment Scale — Cognitive Subscale, a performance based assessment instrument, ensures that the effect of the drug is on the “core” phenomena of dementia. The Clinician’s Interview Based Impression of Change, a global assessment, ensures that the effect of the drug is clinically meaningful. It has been argued that the clinical decision rule in the above example is too conservative because the probability of the overall type I error is less than 0.05. This argument is valid only if both T1 and T2 are primary endpoints. The reason why this should not be considered as conservative is because both T1 and T2 are secondary endpoints, and by our definition either one alone is not sufficient to support a claim. This is one important reason why a hierarchical definition for the various clinical endpoints is proposed. The next example illustrates the importance of hierarchy in clinical decision rule. Example 8. In a clinical trial, the decision set {T1 , T2 } consists of a principal primary endpoint T1 , and a co-primary endpoint, T2 . The clinical decision rule is (T1 > 0 at α1 ) or (T2 > 0 at α2 |T1 ≮ 0 at α1 ), where α1 and α2 are chosen to preserve the probability of the overall type I error at α. This example illustrates the difference between a principal primary endpoint and a co-primary endpoint as follows. The new drug can claim efficacy if it is significantly better than placebo relative to the principal primary endpoint T1 at a significance level of α1 . On the other hand, the new drug can also claim efficacy if it is superior to placebo relative to the co-primary endpoint T2 at a significance level of α2 , provided it is not worse than placebo relative to the principal primary endpoint T1 at the significance level of α1 . The condition (T1 ≮ 0 at α1 ) may need further elaboration in each case. This same condition is implicit in the preceding examples. However, in current practice, this condition is usually not clearly mentioned.

Some Statistical Issues of Relevance to Confirmatory Trials

555

The next example illustrates a clinical decision rule without proper statistical structures. Example 9. In an epilepsy trial, the decision set consists of three clinical endpoints, {A, B, C}, where A is a primary endpoint and B and C are secondary endpoints. The clinical decision rule adopted is that either (A > 0 at α) or (B > 0 at α and C > 0 at α). Let ∆A , ∆B and ∆C denote the parameters corresponding to the endpoints A, B and C respectively. This decision rule reflects the following complex null and alternative hypotheses: Hoc : ∆A = 0

and (∆B = 0 or ∆C = 0) ,

Hac : ∆A 6= 0 or (∆B 6= 0 or ∆C 6= 0) . The sponsor proposes to test instead the restricted null hypothesis, Hor : ∆A = 0 and ∆B = 0

and ∆C = 0 ,

for no treatment effect. Note that Hoc consists of two axes, (∆A = 0 and ∆B = 0) and (∆A = 0 and ∆C = 0), while Hor consists of only the origin. It is shown by Jin and Chi42 that the probability of the overall type I error for the proposed clinical decision rule under Hoc should be αC =

P(∆A ,∆B ,∆C ){|ZA | > cA or (|ZB | > cB and |ZC | > cC )}

sup

(∆A ,∆B ,∆C )0Hoc

= max{P(∆A =0,∆B =0) (|ZA | > cA or |ZB | > cB ) , P(∆A =0,∆c =0) (|ZA | > cA or |ZC | > cC )} , where ZA , ZB , and ZC are the corresponding test statistics. The probability of the overall type I error corresponding to the restricted null hypothesis is αr =

sup

P(∆A ,∆B ,∆C ){|ZA | > cA or (|ZB | > cB and |ZC | > cC )} .

(∆A ,∆B ,∆C )0Hor

It is easy to see that αC ≥ αr , with strict inequality holds true in general. Therefore, if the rejection region, or critical values cA , cB and cC , are defined to preserve αr at the desired level, say, α = 0.05, then the probability of the overall type I error, αC will be inflated beyond α = 0.05. Therefore, the p-value calculated with the restricted null hypothesis is smaller than that calculated with the complete null hypothesis.

556

G. Y. H. Chi et al.

This example illustrates that it is necessary to clearly specify the statistical support structure for a complex clinical decision rule. Otherwise, it can easily weaken the strength of evidence or compromise the validity of the statistical inference, for example, through an improper calculation of the p-value. As noted earlier, clinical decision rule is defined relative to multiple endpoints. However, one can easily extend the concept to include other efficacy assessment schemes involving multiple comparisons, repeated testing, multiple indication etc. Furthermore, it is critical that a clinical decision rule should have a proper statistical support structure. A confirmatory trial should have a well-defined decision structure. The p-value based procedures and the closed testing procedures previously discussed are formal statistical procedures developed for testing hypotheses in a multiple testing situation. As pointed out earlier, these procedures do not make reference to any clinical or regulatory considerations. For example, no distinction is made with respect to the nature, meaning and relative importance of the clinical endpoints, and no attempt is made to take into consideration any regulatory perspective on how these endpoints should be used in assessing efficacy. They essentially assume that all multiple endpoints or multiple comparisons are of equal importance for purpose of efficacy assessment. Therefore, in any given clinical situation, a direct application of these procedures to assess the efficacy of a new treatment without proper clinical and regulatory considerations may be quite inappropriate. The following two examples illustrate this problem. Example 10. In a trial seeking an indication for treatment of transient insomnia, data on latency to persistent sleep (LPS) are collected at night at baseline and on four post-baseline nights. The trial sponsor proposes that the drug would have demonstrated efficacy if it can show superiority to placebo at any one of the four nights relative to LPS. Since there are multiple testing, the sponsor proposes to use a p-valued based procedure. From a purely statistical perspective, this proposal is acceptable. But this kind of decision rule lacks clinical perspective and is not appropriate for this indication. The reason is that according to this decision rule, the drug could be approved for the treatment of transient insomnia if the trial shows that the drug is superior to placebo at any one of the four nights. Thus, a possible winning scenario is for the drug to show superiority on any one of the nights except the first night. Unfortunately, this kind of outcome can not support the desired indication, because it is expected that a drug

Some Statistical Issues of Relevance to Confirmatory Trials

557

for treating transient insomnia should show effect on the first night. The treatment effects on the subsequent nights are used mainly for assessing its tolerance profile, i.e. whether the drug effect diminishes over time. Clinically, showing an effect on the first night is a necessary condition for approval. Therefore, the application of any p-valued based procedure or closed testing procedure would not be appropriate. One possible approach is to apply a conditional testing procedure: starting with the first night, test LPS with a significance level of α, and continue testing for the subsequent night as long as the current night shows a treatment effect significant at the same α level. In developing statistical methodology for handling multiple endpoints in clinical trials, one should also be mindful of the interpretability of the results. Clinical interpretation of the results of statistical analysis of multiple endpoints often presents formidable challenge to both statisticians and clinicians. An interesting way of reducing the problem with multiple endpoints is to define in some manner a composite endpoint, an index or a global statistic. The following examples show different ways composite endpoint can arise and the problem of interpretation that may accompany such composite endpoints. The handling of mortality data in some central nervous system drug trials provides a good illustration. Example 11. In trials treating acute stroke or ALS, some patients in either drug or placebo group die during the trials. The primary endpoints in these trials are often some neurological scores, such as NIH Stroke Scale, Modified Rankin Scale, Barthel Index, Glasgow Outcome Scale in stroke trials, or Appel score and vital capacity measurement in ALS trials. All these scores, except for the Modified Rankin Scale, do not include death as a part of the measurement. To analyze the complex data consisting of neurological scores and death, a composite endpoint is sometimes proposed as a way to compare treatment effect. To illustrate a difficulty in interpretation of the results of such composite endpoint, consider the following proposal. The proposed method will rank the deaths from the earliest to the last, then rank the patients who are alive according to their neurological scores with the worst ranked next to the rank of the last death. The new ranking scores will then be used in an ordinary ANOVA analysis. The results from such an analysis will be difficult to implement in a regulatory setting, particularly the meaning of the assigned rank scores for death and neurological functioning measures will be hard to interpret. For example, the ranks for the

558

G. Y. H. Chi et al.

last death and the worst neurological score only differ by one. These two very different clinical outcomes are essentially treated as the same outcome in the analysis. The comment is not meant to criticize the statistical merits of the composite endpoint analysis, but merely to point out a problem in interpreting the results based on this kind of composite endpoint. Any other global test statistics will likely face similar difficulty since it is attempting to evaluate drug effect by using a single measure to summarize two very different outcomes. Example 12. The objective of this stroke trial is to demonstrate whether a 24-hour continuous infusion of a new drug combined with t-PA when t-PA treatment is initiated within three hours of stroke onset offers superior outcome at day 90 than t-PA treatment alone. The primary endpoints consist of the NIH Stroke Scale, the Barthel Index, Modified Rankin Scale and the Glasgow Outcome Scale. The proposed analysis is a modification of the Wald-like global test statistic used in Tilley et al.93 If the global test statistic demonstrates a “significant” difference, then the individual component scores will be examined. The problem with this global test statistic is that it can show a “significant” difference even when the individual component scores may show inconsistent results. For example, one possible scenario could be that the Barthel Index shows “significance”, but all the other three scores show a negative trend. Now, there is no adjustment made for multiple testing after the global test shows “significance”. A closed testing procedure should be applied in order to maintain the probability of the overall type I error. This same issue existed in Tilley et al.93 However, the t-PA trial was fortunate to have a highly effective treatment so that all individual component scores showed significant and consistent findings so that the application of a closed testing procedure would produce similar conclusion. But in the current proposed trial, such consistent and significant outcomes in the individual component scores may be unlikely, in view of all the recent failed stroke trials. In clinical drug trials, resolutions of subtle issues like the ones discussed above require a close interaction between the clinicians and the statisticians. Any attempt to develop a statistical methodology without careful consideration of the interpretive issue is hazardous. From all of the preceding examples, it becomes very clear that one should try to avoid the following types of problems. The first type of problem is that of asserting drug efficacy claim based on clinical decision rule that is not pre-specified, but defined retrospectively. This practice

Some Statistical Issues of Relevance to Confirmatory Trials

559

inflates the probability of the overall type I error rate. The second type of problem refers to clinical decision rules that do not provide the kind of evidence appropriate for the indication sought and the strength of evidence needed. The third type of problem is a clinical decision rule that lacks proper statistical support structure. The fourth type of problem is the use of a statistical procedure inappropriate for a given multiple testing situation. Finally, one should avoid defining composite endpoints or global test that would render the result of the analysis difficult to interpret or clinically unacceptable. The following general procedure for defining a decision structure should be prospectively described in the study protocol for each trial, especially a confirmatory trial. For the disease under study, define clearly the indication desired. Understand clearly the efficacy criteria needed for the new treatment to satisfy in order to get this indication. Based on the criteria, identify a decision set of clinically relevant endpoints. These endpoints may be called principal primary, co-primary and secondary as defined earlier. These definitions include a hierarchical order based on their clinical importance, objectivity, etc. and are defined from both clinical and regulatory perspectives. From this decision set, one should define the clinical decision rule for assessing the efficacy of the drug. The clinical decision rule is a decision tree consisting of most if not all decision branches or paths that can lead to a drug’s efficacy claim. Each decision branch or path is defined in terms of one or more endpoints from the decision set. Some decision branches or paths may be sequential or conditional, thus forming sub-branches or sub-paths. At each decision point, decision will be made based on the outcome of the endpoints used in that particular decision path or branch. Then, one should provide the necessary and appropriate statistical support structure for this clinical decision rule. In defining a decision structure, one should also consider the level and strength of the evidence required of the study. The statistical support structure should include: • The appropriate hypotheses to be tested; the hypotheses include the proper null and alternative hypotheses reflecting the clinical decision rule. Unlike a simple null hypothesis that usually consists of a single parameter zero, the null hypothesis for a complex clinical decision rule is a region in a multidimensional parameter space. Failure to clarify this null hypothesis region could jeopardize the final statistical analysis, and weaken the strength of evidence or even invalidate the trial results.

560

G. Y. H. Chi et al.

• The appropriate test statistic to be used at each decision branch or path, and the associated statistical analysis plan. • The desired allocation of α across the decision branches or paths so that the probability of the overall type I error is preserved at the desired α level. • Sufficient sample size and an optimal overall power relative to the entire clinical decision rule. So far, the discussion of clinical decision rule is focused on a given time point in the trial, usually at the end of the trial. In the next section, we shall discuss clinical decision rule over some indices typically time or information fraction. 4. Interim Analysis and Sequential Clinical Decision Rule In the 1970s, clinical trial investigators began to question on ethical grounds whether a trial, with mortality or serious irreversible morbidity as the primary outcomes of interest, should be continued to its intended end when interim analysis based on accumulating data shows that the treatment is effective. In those early days, interim analyses were routinely done by the investigators without any regard to issues of multiple testing and its consequent inflation in the probability of the overall type I error. So in this kind of trials, the decision rule, which is usually defined in terms of one primary endpoint such as mortality, is repeatedly tested at various times in the course of the trial. The problem at that time is that there was no appropriate statistical support structure for the decision process. The desire for stopping the trial early for efficacy has been the driving force behind the development of group sequential procedures in the subsequent decades. A group sequential procedure is a statistical procedure that provides for a series of test statistics based on the accumulating data. The simple null hypothesis is tested by these statistics. An early termination rule is implemented through a stopping boundary defined by the critical values or in terms of an increasing sequence of nominal significance levels for each test. This boundary is defined so that the probability of the overall type I error will be maintained at a desired α-level for a two-sided test. Group sequential procedures that allow for interim analyses and early termination had been proposed by many authors.23,68,74,75,94,108 These group sequential procedures are used for a type of analysis customarily referred to as formal interim analysis, and they are usually implemented for mortality and irreversible morbidity trials.

Some Statistical Issues of Relevance to Confirmatory Trials

561

A group sequential procedure with formal interim analyses is an example of a sequential clinical decision rule. Definition 8. A sequential clinical decision rule is a clinical decision rule, repeatedly applied over some indices typically time or information fraction. A sequential decision structure is a sequential clinical decision rule with a proper statistical support structure. Note: alternatively, one can define each decision (path) that can result in an efficacy assessment as a clinical decision rule, and then consider a family of such clinical decision rules along with its statistical support structure. The group sequential procedures referred to above that allow for interim analyses and early termination are simple sequential decision structures. They are simple because the clinical decision rule at each time point involves one and the same primary endpoint such as mortality or a composite endpoint consisting of several types of event endpoints. In a trial with a sequential clinical decision rule, the decision rule itself can be more complex when multiple endpoints and/or multiple doses are involved. Proper statistical support structure is needed for these more complex sequential clinical decision rules. 4.1. Interim analysis and design modifications The principal components of a confirmatory trial include the target patient population, the study design, the decision structure which includes a decision set of key clinical endpoints, a clinical decision rule with its statistical support structure for assessing efficacy, and possibly an interim analysis plan, etc. There is always an interest on the part of the trial sponsor to make changes to the trial based on interim treatment comparative analysis of accumulating data prior to the intended end of the trial. These changes may involve changes in the patient population, in the decision structure including the clinical decision rule, the test statistics, the interim analysis plan, and the sample size. For instance, after an interim analysis, it may be of interest to drop one or more treatment arms, to change or drop one or more endpoints, or in a group sequential trial, to change the interim analysis schedule, to change the stopping boundary, the α-spending function, or to increase the sample size. Since such proposed changes are based on the interim treatment comparative analysis, they may introduce serious bias into the study and inflate the probability of the overall type I error.

562

G. Y. H. Chi et al.

Cui et al.15 showed that in a group sequential trial, sample size reestimation based on observed interim treatment difference can inflate the probability of the overall type I error, essentially because the sample size estimate is interim outcome dependent. A methodology was developed based on weighted test statistics that would permit sample size re-estimation at any one of the pre-scheduled interim analysis times without inflating the probability of the overall type I error. The methodology is quite flexible and allows one to retain the original stopping boundary. Other design modification schemes with or without sample size re-estimation are currently being investigated. Except for changes in the characteristics of the patients yet to be enrolled, one may view most of these design modification strategies as modifications of certain aspects of the decision structure. For example, these modifications may involve deleting one or more decision branches or paths, changing the test statistic at a decision point, re-allocating the α’s and re-estimating the sample size. How such modifications impact on the validity of the statistical inference needs to be investigated. For instance, in a group sequential trial with an interim analysis plan, if such modification affects the underlying Brownian motion process, then one needs to be able to develop valid statistical test procedure in the absence of, say, the property of independent increments. Generally, one should pre-specify the desired modifications, and describe how decisions to modify or not to modify the decision structure are made conditional on the interim outcome data. Finally, one should ascertain that the modified decision structure can still permit valid and meaningful interpretation of the study results. 4.2. Adaptive two-stage designs Design modification of a clinical trial can also be prospectively built into a two- or multi-stage randomized trial. In a two-stage design, modifications may be considered at the end of the first stage. Bauer and K¨ ohne3 considered the problem of sample size re-estimation at the end of the first stage. The two-stage combination test is based on Fisher’s product test and the assumption that the samples from the two stages are independent. Proschan and Hunsberger77 generalized the method of Bauer and K¨ ohne through the concept of conditional error function. Liu and Chi56 further considered allowance for stopping at the end of the first stage for futility, and the problem of defining a unique overall p-value by proposing the use of a class of generalized conditional error functions. All

Some Statistical Issues of Relevance to Confirmatory Trials

563

three papers assumed independence of the samples from the two stages.5,105 A general theory of adaptive two- or multi-stage design that would permit modifications other than sample size re-estimation, and for dependent samples have recently been considered.4,5,57 Again as in the case of group sequential trials, it would be of interest to develop a general theory of adaptive two- or multi-stage design that would permit modification of the decision structure for the general case with dependent samples. Such theory should be developed within the framework of pre-specification of the desired modifications and how the decision to modify or not to modify the decision structure is based on the interim outcome data. In addition, the distributions of the adapted statistics should be derived, and the overall probability of type I error should be maintained. Furthermore, the uniqueness of the overall p-value should be demonstrated to confirm the validity of the statistical inference. Such theory would be extremely important as it forms the basis for various practical applications in clinical drug trials as discussed.49 For example, the adaptive two-stage design would be a natural framework for combining a traditional Phase 2 study with a Phase 3 study, or for accelerated approval of potentially life-saving drugs in diseases that do not have available treatment.4 4.3. Interim analysis and data safety monitoring committee Most mortality or serious irreversible morbidity trials have formal planned interim analysis and stopping rules. There is also an independent Data Safety and Monitoring Committee (DMC). The primary responsibility of a DMC is to recommend to the trial sponsor early termination of the trial for either efficacy or safety reason. The requirement of being independent is to maintain the integrity of the trial. It also frequently happens that the DMC also makes recommendation for changes to the design and conduct of the trial. This kind of recommendations often raises concern. The following are a few important points to consider for a DMC. For general guidance, one may refer to the FDA’s pending Guidance on the Establishment and Operation of Clinical Trial Data Monitoring Committees101 and Data Monitoring Committees in Clinical Trials.112 • The trial protocol should have the prior approval of the DMC. The DMC cannot arbitrarily modify the protocol design. Even though DMC may be independent, its independence refers to the lack of direct interest in the outcome of the trial. It does not imply that the DMC can recommend

564

G. Y. H. Chi et al.

design modifications arbitrarily. Any design modification recommended by the DMC also has the potential for inflation in the probability of the overall type I error. This is because the proposed modification by the DMC is likely to be based on the interim treatment comparative data. The guideline should clearly define the span of authority of the DMC. • The DMC should have standard operating procedures governing its conduct. It has obligation and responsibility to maintain confidentiality of the interim outcome of the trial. It should not communicate the interim results to anyone outside the DMC, unless for purpose of early termination according to the pre-specified termination rule or for safety reason. • There should be clear standard operating procedures governing the communication between the trial sponsor and the DMC. There should be clear guidelines regarding the documentation of minutes of meetings, decisions reached, and written communications. The ability to make design modifications is obviously very attractive. However, one should be mindful of the real potential for bias to be introduced as a result of access to the interim unblinded treatment comparative data. As previously noted, if the characteristics of the patients enrolled subsequent to an interim analysis have changed, then this may seriously bias the outcome, unless the drug label can accurately describe the patient population for whom the drug may be prescribed. It is desirable to have an independent third party that is responsible for conducting interim analysis either in a group sequential trial setting or a two- or multi-stage design trial. There should be clear and sound guidelines and standard operating procedures governing the role, responsibility and conduct of the independent third party. In general, any desired design modification or change should be prespecified at the design stage. Furthermore, one should describe how decisions on modifications are made conditional on the interim outcome data, the distribution of the adapted test statistic, and the overall p-value for assessing the significance of the trial finding. The proposal should also include methods for addressing any adverse impacts such modification or change may introduce, such as, bias, changes in patient population, inflation in the probability of the overall type I error, and proof of the validity of the statistical inference based on the adapted test statistic. Generally, the clinical decision rule should be as complete as possible. The design modification, other than sample size adjustment, should be restricted to deletion

Some Statistical Issues of Relevance to Confirmatory Trials

565

of decision branches or paths. The addition of new decision branches or paths should be discouraged.

4.4. Two-stage design in a randomized trial and accelerated approval Accelerated approval of potentially life saving drugs in diseases that do not have available treatment may be done through an accepted surrogate (or surrogates) whose validity has been established. Validity of the surrogate means it is likely to predict clinical benefit of primary interest. The idea of accelerated approval is to first conditionally approve the use of the drug based on a positive outcome on the surrogate endpoint, and then subsequently confirm its effectiveness through the clinical endpoints of primary interests. FDA’s Oncology Initiatives96 recognize that the predictive value of partial responses may still be a matter of discussion for all types of cancer. But for refractory malignant disease or for diseases that have no adequate alternative, clear evidence of anti-tumor activity is a reasonable basis for approving the drug. In these cases, studies confirming a clinical benefit may appropriately be completed after the conditional approval. So in essence, the Oncology Initiative has gone a step further in permitting a surrogate to be used in certain situations even though the surrogate has not been fully validated. One may refer to the example of gemtuzumab ozogamicin in relapsed acute myeloid leukemia as discussed in Bross et al.9 It is for this reason that a two-stage design may be well suited for accelerated approval. The drug may be tested at the end of the first stage for conditional approval based on the surrogate, and then confirmed at the end of the second stage by the primary clinical endpoints. A two-stage design in a randomized trial offers the following advantages. First, it places the evaluation of the surrogate and the clinical outcomes of primary interest in the same trial, and hence would be able to provide some checks on the validity of the surrogate in relation to the clinical benefit of interest. Secondly, a more appropriate way for conditional approval is to first identify a decision set of clinical endpoints of primary interest that will be used at the end of the second stage to evaluate the efficacy of the drug. Appropriate allocation of α should be considered for multiple endpoints, including the surrogate as well as interim analysis at the end of the first stage, in order to maintain the probability of the overall type I error at

566

G. Y. H. Chi et al.

α. At the end of the first stage, the efficacy will be evaluated relative to the surrogate endpoint as well as all the clinical endpoints in the decision set. If it fails to demonstrate efficacy relative to the surrogate and all the clinical endpoints, then the trial stops and the drug fails to gain accelerated approval. If any of the clinical endpoints in the decision set demonstrates efficacy, then the trial can be terminated at the end of the first stage. Otherwise, if the test demonstrates efficacy relative to the surrogate only, then one proceeds to calculate the conditional power of showing a positive outcome at the end of the second stage given the current interim outcomes for each of the clinical endpoint in the decision set. If the conditional powers show that the likelihood of such an outcome is very low for all the clinical endpoints, and a sample size increase would be unacceptably large, then the accelerated approval probably should be withheld. Thus, accelerated approval should be granted if at least one such conditional power is sufficiently high, or if sample size can be increased so that at least one clinical endpoint will have sufficient conditional power to suggest a positive outcome at the end of the second stage. If the trial continues to the second stage and the final results show that there is no clinical benefit, then the drug may need to be withdrawn from the market.

5. Active Control Trials The recent fifth revision of the World Medical Association Helsinki Declaration (2000) has generated a great deal of discussion19,92,102 and a renewed interest in active control trials. Historically, for trials with mortality or serious morbidity outcome, delaying or withholding available treatments would increase the mortality or irreversible morbidity outcome, the use of a placebo is considered unethical. Thus, active control comparative trials have been proposed.24,25 An active control superiority trial allows for a direct comparison of a new study treatment against a standard therapy or standard of care. It establishes the efficacy of a new study treatment by demonstrating that the new treatment is superior to the active control.49,50,90 For example, in oncology for cancers that have standard therapies, a new study treatment must be compared to and show superiority to a standard therapy in two randomized controlled clinical trials. Unless the new treatment represents a new advance in the treatment of the disease, it would generally be more difficult to show that the new treatment is better than an effective active control. Thus, it is not surprising to find that when a new treatment fails to show superiority to the active control, the sponsor or investigator

Some Statistical Issues of Relevance to Confirmatory Trials

567

would attempt to assert that the new treatment is either equivalent to, or no worse than the active control. But it is well known that failing to reject the null hypothesis of equality between the new treatment and the active control does not imply that they are equivalent.25,91 As White107 aptly puts it, “the lack of evidence of a difference can not be interpreted as evidence of a lack of difference”. It has been suggested by various authors7,39 that if the intention is to show that the new intervention is equivalent or non-inferior to the active control, then one needs to define the null hypothesis and the corresponding alternative hypothesis appropriately. More specifically, the alternative hypothesis should reflect the hypothesis of interest, namely that of equivalence, or non-inferiority. To do that, these authors suggested that a certain equivalence or non-inferiority margin should be pre-specified in the hypothesis. For example, in bioequivalence studies, an equivalence margin is generally set at ±20% of the control effect. In many active control non-inferiority trials, an arbitrary fixed threshold is pre-specified — for example, a threshold of 1.25 for the hazard ratio of the study treatment relative to the active-control. If the upper limit of the 95% CI for the study treatment effect relative to the active control lies beneath this threshold, then non-inferiority is inferred. The major concern with using an arbitrary fixed threshold is that it is unrelated to the active control effect defined as the difference between the control response and the non-existing placebo response. When the active control effect is relatively small, this may lead to a loss of all of the active control effect plus more (or a loss of too great a percent of the active control effect). In other words, if the demonstration of non-inferiority is based on an arbitrary fixed margin, the new treatment may be approved even though it may be less efficacious than a placebo. Having recognized this problem, it has been proposed that the margin should be set at half the lower limit of the 95% confidence interval for the estimate of the active control effect to ensure that the new drug is better than placebo. A similar approach was used by the FDA Center for Biologics Evaluation and Research in a thrombolytic trial where instead of 95%, they used 90% confidence interval. The null hypothesis of sufficient inferiority will be rejected if the upper limit of the 95% confidence interval for the hazard ratio of the new treatment relative to the active control is below this cutoff. This method has also been called the “two-95% confidence intervals approach”. While this fixed cutoff is not arbitrary and is linked to an estimate of the control effect, it has been criticized as being too

568

G. Y. H. Chi et al.

conservative because it compares two “statistically worst” cases. Success is when the new treatment demonstrates a retention of at least 50% of the statistically worst active control effect. Various authors30,31,36,44,79,86 have proposed not to directly pre-specify a fixed margin, but rather define simply the percent of the control effect one wishes to retain. Non-inferiority is then demonstrated by the new treatment, if it can be shown that the new treatment retains at least the desired percent of the active control effect. The active control effect may be estimated from non-concurrent placebo or standard control studies using mixed effects model. 5.1. Retention of certain percent of active control effect hypothesis for a non-inferiority trial Let T , C and P denote the treatment, the control and the placebo respectively. The hypothesis for a non-inferiority trial can be formulated in the following two different ways. 1) If one specifies an arbitrary fixed margin, say δ = 20%, then the hypothesis can be formulated as follows: Ho : T − C ≤ −δ

vs. Ha : T − C > −δ .

A fixed margin may be appropriate if one believes the effect can only be attributed to the treatment. For example, in cancer trials, tumor shrinkage may be attributed only to the treatment. Otherwise, the use of an arbitrary fixed margin is highly questionable since the active control effect is not accounted for in the margin. Particularly, if an active control effect size is relatively small, using an arbitrary fixed margin may lead to the approval of a study treatment that is actually inferior to the placebo. 2) If the objective of a non-inferiority trial is to demonstrate that the new treatment retains a certain percent of the active control effect, then the null and alternative hypotheses can be written as follow: Ho : (T − P1 )/(C1 − P1 ) ≤ π

vs. Ha : (T − P1 )/C1 − P1 ) > π

(4)

or Ho : ((T − C1 ) − (C1 − P1 ))/(C1 − P1 ) ≤ π − 1 vs. Ha : ((T − C1 ) − (C1 − P1 ))/(C1 − P1 ) > π − 1

(5)

where π is the percent retention desired, C1 represents the current active control, P1 the current placebo if placebo were to be present, and (C1 − P1 ) is the current active control effect assumes to be positive.

Some Statistical Issues of Relevance to Confirmatory Trials

569

The hypotheses in (5) can be written as Ho : (T − C1 ) ≤ −(1 − π)(C1 − P1 )

vs.

Ha : (T − C1 ) > −(1 − π)(C1 − P1 ) .

(6)

Now since in the current trial, there is no placebo, the active control effect on the right side of the hypotheses cannot be estimated from current active control trial. The assessment of the assumption of C1 − P1 > 0 is based on non-concurrent, randomized, placebo controlled trials. If there are adequate and well-controlled non-concurrent placebo or standard control studies that can provide consistent estimates of the active control effect and if the current active control study is as similar as possible to these non-concurrent studies in terms of design, patient populations and therapeutic settings, then perhaps the assumption that the current control effect (C1 − P1 ) is a certain fraction, θ, of the historical active control effect, (C0 − P0 ), may be considered reasonable. That is, (C1 − P1 ) = θ(C0 − P0 ), with 0 ≤ θ ≤ 1. Under this assumption, (6) can be written as, Ho : (T − C1 ) ≤ −(1 − π)θ(C0 − P0 ) vs. Ha : (T − C1 ) > −(1 − π)θ(C0 − P0 ) .

(7)

The right side in the hypotheses can be viewed as the margin. For example, by setting π = 1/2, then the margin is defined as a 50% preservation of the active control effect, θ(C0 − P0 ). By setting π = 0, the non-inferiority trial becomes a superiority over “placebo” trial, where “placebo” is represented by C1 − θ(C0 − P0 ). By setting π = 1, the non-inferiority trial becomes an active control superiority trial. Test statistic defined in terms of the estimate of (T − C1 ) from the current active control trial and the estimate of (C0 − P0 ) from non-concurrent placebo or standard control trials can be used to test the null hypothesis in (7). Holmgren (1999) derived a test statistic involving relative risks. Rothmann et al.79 derived a test statistic involving hazard ratios from mortality trials, and derived some important properties of the corresponding test statistic. In each case, the hypotheses in (4)–(7) need to be rewritten to reflect the corresponding efficacy measure. Rothmann et al.79 showed that the two-95% confidence intervals approach is conservative in the sense that it controls the probability of the type I error associated with hypotheses (7) at the 0.003 level when using survival endpoint. It was further shown that if one controls the probability of the type I error for the one-sided test at the 0.025 level, then a unique γ% confidence interval for the active control estimate can be derived. The use of a test statistic for a retention of certain

570

G. Y. H. Chi et al.

percent of the active control effect is conditionally equivalent to an asymmetric two-confidence interval approach. The asymmetric two-confidence interval approach is analogous to the two-95% confidence intervals approach, except by replacing the 95% confidence interval for the active control effect estimate by the shorter γ% confidence interval. It follows from this that the use of point estimate for the active control effect always inflates the probability of the type I error. The shorter confidence interval would lead to a higher threshold or margin. A detailed discussion of the general methodology and issues related to the design, analysis and interpretation of such active control non-inferiority trials is given in Rothmann et al.79 Implicit in the formulation of the hypotheses in (4), and its modification in (7) are the following fundamental assumptions: First, there are adequate non-concurrent placebo or standard control trials that consistently demonstrated the active control effect over various patient populations studied. Such consistency provided the necessary confidence regarding the existence of the active control effect in the current patient population and clinical trial setting. The false positive rate of the assessment of the control effect should be taken into the consideration of the current non-inferiority trial. In many cases, active control non-inferiority trial would probably not be possible to do. Second, it is assumed that the active control effect exists in the current active control trial. Even though there are non-concurrent placebo or standard control trials on the basis of which the active control effect is estimated, it does not follow that in the current active control trial, the active control effect necessarily exists. This is because many causes may lead to this. For example, the patient samples in the non-concurrent trials and the patient sample in the current trial may not be representative of the same patient population. Take an extreme case. Suppose that the active control works mainly in female patients. So suppose in the non-concurrent trials, the patient samples are mostly females, while the patient sample in the current trial is mainly males. Thus, in the patient sample in the current trial, the active control will not show much of an effect. Third, it is assumed that the active control effect in the current active control trial is a certain fraction, θ, of the active control effect determined in the non-concurrent placebo or standard control trials. This assumption requires information that may or may not be available that would permit its determination. Rothmann et al.79 show that the method is applicable if one knows what this fraction, θ, is. In fact, because the patient samples in the non-concurrent studies and the patient sample in the current study may not be representative of the same patient population, one generally

Some Statistical Issues of Relevance to Confirmatory Trials

571

cannot assume that the active control effect is constant between the nonconcurrent studies and the current study. Hence, some kind of adjustment factor θ may be needed. In other circumstances, such adjustment may also be necessary. For example, if the standard of care has improved over time, then the active control effect will be somewhat reduced. Similarly, in anti-infective area, if the bacteria have become resistant to the current antibiotics, then the active control effect will be reduced. Allowing the use of new and effective concomitant medications also reduces the active control effect. A more subtle situation is when the non-concurrent placebo or standard control studies were stopped early based on the interim analyses results. It is known that estimates of the active control effects may be biased upward based on interim results.109,110 Proper adjustment is needed for such active control effect estimates.55 It should be noted that the hypothesis in (4), which is appropriate only if the assumption C1 − P1 > 0 holds, reflects the following clinical philosophy. If a new treatment can offer some additional clinical benefits, e.g. a better toxicity profile or ease of administration, then this new treatment may still be beneficial even when some efficacy is lost as compared to the active control. From this perspective, then one objective of an active control non-inferiority trial is to rule out all differences of “clinical importance” between the new treatment and the active control. This would permit one to conclude that the new treatment is effective even though one has not established that it is non-inferior to the active control. In such trials, the term “active control non-inferiority trial” is a misnomer because the trial objective is not to show non-inferiority of the new treatment to the active control, but that the new treatment is simply effective. To show the new treatment is non-inferior or equivalent to the active control, one will need to specify a much more stringent criteria than that in (4), that is one should demand a retention much greater than 50% of the active control effect, perhaps at least 85% based on preliminary research results. For active control non-inferiority trial, the problem regarding bias towards no difference is a crucial issue because the null hypothesis would be easier to be rejected when there is bias towards no difference. This kind of bias can be introduced without having to unblind the treatment codes. When the primary efficacy endpoint is mortality or irreversible morbidity event, then perhaps such bias would be of less concern. Example 13. In clinical trials involving colorectal cancer, it is considered unethical to use placebo. In a recent study of xeloda for the treatment of colorectal cancer, the objective is to demonstrate that the new treatment

572

G. Y. H. Chi et al.

is effective in an active control trial. Since xeloda has better side effects profile and ease of administration. The clinicians felt that a retention by xeloda of at least 50% of the active control effect represents an acceptable level of efficacy in view of its better side effects profile and ease of administration. There are a number of non-concurrent standard control studies involving the active control. These studies demonstrated fairly consistent active control effect. Based on the data from these non-concurrent studies, estimates for the log hazard ratio of the active control effect and its standard error are 0.234 and 0.075 respectively. This is equivalent to a hazard ratio estimate of 1.264 with a 95% confidence interval, (1.091, 1.464). For a 50% retention in the active control effect, the largest estimate of the active control effect that is allowed with type I error controlled at 0.025 is 1.228 which corresponds to the lower limit of a 30% confidence interval. Thus, 50% retention of this active control effect produces a cutoff of 1.114. The clinicians believe that the active control effect should not have diminished over time. Thus, in this analysis, θ = 1. The current active control trial produces a hazard ratio estimate of 0.92 with a 97.5% confidence interval upper limit of 1.09 that is below the cutoff of 1.114 (For comparison, the original protocol proposed to consider a fixed cutoff of 1.20). Thus, this trial demonstrates that it rules out a loss of more than 50% of the active control effect. In fact, the new treatment is likely to retain at least 61% of the active control effect. The conclusion that one may draw from this trial is that the new treatment is effective. It retains at least 61% of the active control effect. However, one cannot claim that this new treatment is non-inferior to the active control. For such a claim, one would require that the new treatment should retain at least a certain percent of the active control effect that is much greater than 50% specified here. But for such a trial, the sample size required to maintain reasonable power would become prohibitively large. This example illustrates that clinical decision rule is sometimes fairly subtle. It is not simply a matter of defining a set of primary and secondary endpoints, desired comparisons among several treatment arms, and an allocation of α. It involves a deeper understanding of what kind of evidence will be presented when the trial is finished, whether the evidence is appropriate for the indication sought, and whether the evidence would be sufficient for approval. In summary, if for ethical reason, one has to use an active control, then an active control superiority trial can always be done. The problem is that

Some Statistical Issues of Relevance to Confirmatory Trials

573

it may not be easy to demonstrate that the new treatment is superior to the active control, unless the active control happens to be ineffective in the current study. But in the latter case, if one were to claim that the new drug is superior to the active control, then it would be misleading. Thus, even in an active control superiority trial, perhaps one can only conclude that the new treatment is effective, unless one is certain that the active control is effective in the current trial. On the other hand, due to all the critical assumptions needed in doing an active control non-inferiority trial, most of the times, such trials may not be possible and are not recommended because these assumptions are not verifiable. One of the basic concerns is that the active control may not work in the current patient population or trial setting. When this is true, then the assumption that the active control is effective would inflate the type I error rate. Another concern is that one may end up demonstrating a drug that is actually inferior to a placebo through such a trial. Therefore, such trials may be contemplated if there are other studies or information that can help to alleviate these concerns which can not be verified within the active control trial itself. Even then, the issue regarding bias towards no difference should be properly addressed. In the actual design of an active control non-inferiority trial, the active control effect, the proportion of control effect to be preserved, the control of the probability of the type I error should be properly determined in light of the objective. One should also pay attention to issues such as multiplicity testing, interim analysis and design modification. These issues may take on a different complexity due to the special nature of an active control noninferiority trial. Proper standard operating procedures should be designed to minimize the introduction of bias towards no difference. References 1. Armitage, P. and Berry, G. (1994). Statistical Methods in Medical Research. 2. Bauer, P. (1991). Multiple testings in clinical trials. Statist. Med. 10: 871–890. 3. Bauer, P. and K¨ ohne, K. (1994). Evaluations of experiments with adaptive interim analyses. Biometrics 50: 1029–1041. 4. Bauer, P. and Kieser, M. (1999). Combining different phases in the development of medical treatments within a single trial. Statist. Med. 18: 1833–1848. 5. Bauer, P., Brannath, W. and Posch, M. (2001). Flexible two-stage designs: An overview. Meth. Infor. Med. 40: 117–121.

574

G. Y. H. Chi et al.

6. Begg, C. B. (2000). COMMENTARY: Ruminations on the intent-to-treat. Controlled Clin. Trials 21: 241–243. 7. Blackwelder, W. C. (1982). Proving the null hypothesis in clinical trials. Controlled Clin. Trials 3: 345–353. 8. Boyd, E. J. S., Penston, J. G., Johnston, D. A. and Wormsley, K. G. (1988). Does maintenance therapy keep duodenal ulcer healed? Lancet, 1324–1327. 9. Bross, P. F., Beitz, J., Chen, G., Chen, X. H., Duffy, E., Kieffer, L., Roy, S., Sridhara, R., Rahman, A., Williams, G. and Pazdur, R. (2001). Approval Summary: Gemtuzumab ozogamicin in relapsed acute myeloid leukemia. Report from FDA. Clin. Cancer Res. 7: 1490–1496. 10. Chi, G. Y. H. (1985). A design problem in ulcer prevention trials. Proceedings of the Biopharmaceutical Subsection of the Am. Statist. Assoc., 100–105. 11. Chi, G. Y. H. (1998). Multiple testings: Multiple comparisons and multiple endpoints. Drug Infor. J. 32 (Suppl.): 1347s–1362s. 12. Chi, G. Y. H. and Liu, Q. (1999). The attractiveness of the concept of a prospectively designed two-stage clinical trial. J. Biopharmaceutical. Statist. 9: 537–547. 13. Chi, G. Y. H. (2000). Clinical decision rules and multiple endpoints: A regulatory perspective. Presented at the 2nd Inter. Conf. Multiple Comparisons held at the Humboldt University/Charite, Berlin, Germany, June 25–28. 14. Chow, S. C. and Liu, J. P. (1998). Design and Analysis of Clinical Trials: Concept and Methodologies, John Wiley and Sons, Inc. 15. Cui, L., Hung, H. M. J and Wang, S. J. (1999). Modification of sample size in group sequential clinical trials. Biometrics 55: 853–857. 16. D’Agostino, R. B., Sr. (2000). Controlling alpha in a clinical trial: The case for secondary endpoints. Statist. Med. 19: 763–766. 17. Dunnett, C. W. and Tamhane, A. C. (1995). Step-up multiple testing of parameters with unequally correlated estimates. Biometrics 51: 217–227. 18. Elashoff, J. D., Koch, G. G. and Chi, G. Y. H. (1988). Designing a clinical trial to demonstrate prevention of ulcer recurrence: Modelling simulation approaches. Statist. Med. 7: 877–888. 19. Ellenberg, S. S. and Temple, R. (2000). Placebo-controlled trials and active control trials in the evaluation of new treatments, Part II: Practical issues and specific cases. Ann. Int. Med. 133(6): 464–470. 20. Fisher, L. D., Dixon, D. O., Herson, J., Frankowski, R. F., Hearron, M. S. and Peace, K. E. (1990). Intention-to-Treat in Clinical Trials, Chapter 7. Statistical issues in drug research and development, ed. E. Karl, Peace, Marcel Dekker, Inc. 21. Fisher, L. D. and Moy´e, L. A. (1999a). Carvedilol and the Food and Drug Administration approval process: An introduction. Controlled Clin. Trials 20: 1–15. 22. Fisher, L. D. (1999b). Carvedilol and the Food and Drug Administration (FDA) approval process: The FDA paradigm and reflections on hypothesis testing. Controlled Clin. Trials 20: 16–39. 23. Fleming, T. R., Harrington, D. P. and O’Brien, P. C. (1984). Designs for group sequential tests. Controlled Clin. Trials 5: 348–361.

Some Statistical Issues of Relevance to Confirmatory Trials

575

24. Fleming, T. R. (1987). Treatment evaluation in active control studies. Cancer Treatment Reports 71(11): 1061–1065. 25. Fleming, T. R. (1990). Evaluation of active control trials in AIDS. J. AIDS 3 (Suppl. 2): S82–S87. 26. Fleming, T. R. (2000). Design and interpretation of equivalence trials. Am. Heart J. 139: s171–s176. 27. Follman, D. (1995). Multivariate tests for multiple endpoints in clinical trials. Statist. Med. 14: 1163–1175. 28. Gillings, D. and Koch, G. (1991). The application of the principle of intention-to-treat to the analysis of clinical trials. Drug Infor. J. 25: 411–424. 29. Goodman, S. (1999). Toward evidence-based medical statistics 1: The p value fallacy. Ann. Intern. Med. 130: 995–1004. 30. Hassalblad, V. and Kong, D. F. (2001). Statistical methods for comparison to placebo in active-control trials. Drug Infor. J. 32(2): 435–450. 31. Hauck, W. W. and Anderson, S. (1999). Some issues in the design and analysis of equivalence trials. Drug Infor. J. 33: 109–118. 32. Hauschke, D., Schafer, J. and Pigeot, I. (2001). Statistical approaches for the choice of delta. Presented at the 37th Ann. Meet. of the Drug Infor. Assoc., July 8–12, Denver, Colorado. 33. Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75: 800–802. 34. Hochberg, Y. and Tamhane, A. C. (1987). Multiple Comparison Procedures, Wiley, New York. 35. Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian. J. Statist. 6: 65–70. 36. Holmgren, E. B. (2001). Establishing equivalence by showing that a specified percentage of the effect of the active control over placebo is maintained. J. Biopharmaceut. Statist. 9(4): 651–659. 37. Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika 75: 383–386. 38. Hung, H. M. J., Chi, G. Y. H and Lipicky, R. J. (1993). Testing for the existence of a desirable dose combination. Biometrics 49: 85–94. 39. Huque, M., Dubey, S. and Fredd, S. (1989). Establishing therapeutic equivalence with clinical endpoints. Proc. Biopharmaceut. Sec., 46–52. 40. Huque, M. F. and Sankoh, A. J. (1997). A reviewer’s perspective on multiple endpoint issues in clinical trials. J. Biopharmaceut. Statist. 7: 545–564. 41. Jin, K. and Chi, G. Y. H. (1997). Application of bootstrap in handling multiple endpoints. Proc. Biopharmaceut. Sec. Am. Statist. Assoc., 150–155. 42. Jin, K. and Chi, G. Y. H. (1998). Clinical decision rules and statistical support structures — a novel approach to handling the multiple endpoints problem. Proc. Biopharmaceut. Sec. Am. Statist. Assoc., 56–62. 43. Kieser, M., Bauer, P. and Lehmacher, W. (1999). Inference on multiple endpoints in clinical trials with adaptive interim analyses. Biometric. J. 41(3): 261–277.

576

G. Y. H. Chi et al.

44. Koch, G. G. and Tangen, C. M. (1999). Nonparametric analysis of covariance and its role in noninferiority clinical trials. Drug Infor. J. 33: 1145–1159. 45. Koch, G. G. (2000). Discussion for Alpha calculus in clinical trials: Considerations and commentary for the new millennium. Statist. Med. 19: 781–784. 46. Kurata, J. H. and Koch, G. G. (1988). Response to H2-receptor antagonists and duodenal ulcer recurrence. Am. J. Gastroenterol. 83(12): 1427–1428. 47. Lachin, J. M. (2000). Statistical considerations in the intent-to-treat principle. Controlled Clin. Trials 21: 167–189. 48. Lan, K. K. G. and DeMets, D. L. (1983). Discrete sequential boundaries for clinical trials. Biometrika 70(3): 659–663. 49. Leber, P. (1986). The placebo control in clinical trials — A view from the FDA. Psychopharmacol. Bull. 22(1): 30–32. 50. Leber, P. (1989). Hazards of inference: The active control investigation. Epilepsia 30 (Suppl. 1): s57–s63. 51. Lehmacher, W., Wassmer, G. and Reitmer, P. (1991). Procedures for two sample comparisons with multiple endpoints controlling the experimentwise error rate. Biometrics 47: 511–521. 52. Lewis, J. H. (1985). Summary of the Gastrointestical Drug Advisory Committee Meeting, March 21 and 22. Am. J. Gastroenterol. 80: 581–583. 53. Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data, John Wiley and Sons, Inc. 54. Little, R. J. A. (1995). Modeling the drop-out mechanism in repeatedmeasures studies. J. Am. Statist. Assoc. 90(431): 1112–1121. 55. Liu, A. and Hall, W. J. (1999). Unbiased estimation following a group sequential test. Biometrika 86: 71–78. 56. Liu, Q. and Chi, G. Y. H. (2001). On sample size and inference for two-stage adaptive designs. Biometrics 57: 172–177. 57. Liu, Q. (2001). On general two-stage adaptive designs with dependent data. Personal communication, 1–22. 58. Lavori, P. W., Dawson, R. and Shera, D. (1995). A multiple imputation strategy for clinical trials with truncation of patient data. Statist. Med. 14: 1913–1925. 59. Malinckrodt, C., Clark, W. and David, S. (2001). Accounting for dropout bias using mixed-effects models. J. Biopharmaceut. Statist. 11(1&2): 9–12. 60. Marcus, R., Peritz, E. and Gabriel, K. R. (1976). On closed testing procedure with special reference to ordered analysis of variance. Biometrika 63: 655–660. 61. Mathieu, M. (1997). New Drug Development: A Regulatory Overview, Parexel International Corporation, Waltham, Massachusetts. 62. McManus, J. and Wormsley, K. G. (1989). A new perspective on what maintenance therapy of duodenal ulcer achieves — Selected Summaries. Gastroenterology 96(4): 1218–1220. 63. Moy´e, L. A. (1999). End-point interpretation in clinical trials: The case for discipline. Controlled Clin. Trials 20: 40–49.

Some Statistical Issues of Relevance to Confirmatory Trials

577

64. Moy´e, L. A. (2000a). Alpha calculus in clinical trials: Considerations and commentary for the new millennium. Statist. Med. 19: 767–779. 65. Moy´e, L. A (2000b). Response to commentaries on Alpha calculus in clinical trials: Considerations and commentary for the new millennium. Statist. Med. 19: 795–799. 66. Myers, W. R. (2000). Handling missing data in clinical trials: An overview. Drug Infor. J. 34: 525–533. 67. Neuh¨ auser, M., Steinijans, V. W. and Bretz, F. (1999). The evaluation of multiple clinical endpoints, with application to asthma. Drug Infor. J. 33: 471–477. 68. O’Brien, P. C. and Fleming, T. R. (1979). A multiple testing procedure for clinical trials. Biometrics 35: 549–556. 69. O’Brien, P. C. (1984). Procedures for comparing samples with multiple endpoints. Biometrics 40: 1079–1087. 70. O’Neill, R. T. (2000). Commentary on Alpha calculus in clinical trials: Considerations and commentary for the new millenium. Statist. Med. 19: 785–793. 71. Peace, K. E. (1988). Biopharmaceutical Statistics for Drug Development, Marcel Dekker, Inc. 72. Peace, K. E. (1990). Statistical Issues in Drug Research and Development, Marcel Dekker, Inc. 73. Physicians’ Desk Reference (1999). Physicians’ Desk Reference, Medical Economics Company, Inc., Montvale, NJ 07645-1742. 74. Pocock, S. J. (1977). Group sequential methods in the design and analysis of clinical trials. Biometrika 64(2): 191–199. 75. Pocock, S. J. (1982). Interim analyses for randomized clinical trials: The group sequential approach. Biometrics 38: 153–162. 76. Pocock, S. J., Geller, N. L. and Tsiatis, A. A. (1987). The analysis of multiple endpoints in clinical trials. Biometrics 43: 487–498. 77. Proschan, M. A. and Hunsberger, S. A. (1995). Designed extension of studies based on conditional power. Biometrics 51: 1315–1324. 78. Riis, P. (2000). Perspectives on the 5th revision of the Declaration of Helsinki. J. Am. Med. Assoc. 284(23): 3045–3046. 79. Rothmann, M., Chen, G., Li, N. and Chi, G. Y. H. (2001). Design and analysis of non-inferiority mortality trials in oncology. DIA 37th Annual Meeting, July 8–12, Denver, Colorado. To appear in Stat. in Med. 2003. 80. R¨ uger, B. (1978). Das maximale signifikanzniveau des tests. Lehno Ho ab, wenn k under n gegebenen tests zur Ablehnung fuhren. Metrika 25: 171-178. 81. Samuel-Cahn, E. (1996). Is the Simes improved Bonferroni procedure conservative? Biometrika 83(4): 928–933. 82. Sankoh, A. J., Huque, M. F. and Dubey, S. D. (1997). Some comments on frequently used multiple endpoint adjustment methods in clinical trials. Statist. Med. 16: 2529–2542. 83. Sarkar, S. K. (1998). Some probability inequalities for ordered MTP2 random variables: A proof of the Simes Conjecture. Ann. Statist. 26(2): 494–504.

578

G. Y. H. Chi et al.

84. Shen, Y. and Fisher, L. (1999). Statistical inference for self-designing clinical trials with a one-sided hypothesis. Biometrics 55: 190–197. 85. Siegel, J. P. (2000). Equivalence and non-inferiority trials. Am. Heart J. 139: s166–s170. 86. Simon, R. (1999). Bayesian design and analysis of active control clinical trials. Biometrics 55: 484–487. 87. Tamhane, A. J., Hochberg, Y. and Dunnett, C. W. (1996). Multiple test procedures for dose finding. Biometrics 28: 519–531. 88. Tang, D. I., Geller, N. L. and Pocock, S. J. (1993). On the design and analysis of randomized clinical trials with multiple endpoints. Biometrics 49: 23–30. 89. Tang, D. I., Gnecco, C. and Geller, N. L. (1989). An approximate likelihood ratio test for a normal mean vector with non-negative components with application to clinical trials. Biometrika 76: 577–583. 90. Temple, R. (1983). Difficulties in evaluating positive control trials. Proc. Am. Statist. Assoc. (Biopharmaceut. Sec.), 1–7. 91. Temple, R. (1996). Problems in interpreting active control equivalence trials. Account. Res. 4: 267–275. 92. Temple, R. and Ellenberg, S. S. (2000). Placebo-controlled trials and active control trials in the evaluation of new treatments, Part I: Ethical and scientific issues. Ann. Int. Med. 133(6): 455–463. 93. Tilley, B. C., Maler, J., Geller, N. L., Lu, M., Legler, J., Brott, T., Lyden, P. and Grotta, J. (1996). Use of a global test for multiple outcomes in stroke trials with application to the National Institute of Neurological Disorders and Stroke t-PA Stroke Trial. Stroke 27(11): 2136–2142. 94. Tsiatis, A. A. (1982). Repeated significance testing for a general class of statistics used in censored survival analysis. J. Am. Statist. Assoc. 77(380): 855–861. 95. US Code of Federal Regulations 21: Parts 300–499 (2001). US Government Printing Office. 96. US Food and Drug Administration (1996). Oncology Initiatives. http://www.fda.gov/opacom/backgrounders/cancerbg/html. 97. US Food and Drug Administration (1998–2000). International Conference on Harmonization Requirements for Registration of Pharmaceuticals for Human Use (ICH) E-1-E-11. http://www.FDA.gov/CDER/guidance/ index.html. 98. US Food and Drug Administration (1997). International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH). E-8: Guidance on General Considerations for Clinical Trials. Federal Register 62, 242, December 17, 1997/Notices, 66113–66119. 99. US Food and Drug Administration (1998). International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH). E-9: Guidance on Statistical Principles for Clinical Trials. Federal Register 63, 179, September 16, 1998/Notices, 49583–49598.

Some Statistical Issues of Relevance to Confirmatory Trials

579

100. US Food and Drug Administration (1999). International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH). E-10: Guidance on Choice of Control Group in Clinical Trials. Federal Register 64, 185, September 24, 1999/Notices, 51767–51780. 101. US Food and Drug Administration (2001). On the Establishment and Operation of Clinical Trial Data Monitoring Committees — Draft Guidance. Federal Register: Pending. 102. Vastag, B. (2000). Helsinki discord? A controversial declaration. J. Am. Med. Assoc. 284(23): 2983–2985. 103. Verbeke, G., Molenberghs, G., Bijnens, L. and Shaw, D. (1997). Linear Mixed Models in Practice. New York: Springer. 104. Wassmer, G. (1998). A comparison of two methods for adaptive interim analyses in clinical trials. Biometrics54: 696–705. 105. Wassmer, G., Reitmeir, P., Kieser, M. and Lehmacher, W. (1999). Procedures for testing multiple endpoints in clinical trials: An overview. J. Statist. Plan. Inference 82: 69–81. 106. Westfall, P. H. and Young, S. S. (1992). Resampling-Based Multiple Testing, Wiley, New York. 107. White, H. D. (1998). Thrombolytic therapy and equivalence trials — Editorial Comment. J. Am. Coll. Cardiol. 31(3): 494–496. 108. Whitehead, J. (1983). The Design and Analysis of Sequential Clinical Trials. New York: Ellis Horwood. 109. Whitehead, J. (1986a). On the bias of maximum likelihood estimation following a sequential test. Biometrika 73: 573–581. 110. Whitehead, J. (1986b). Supplementary analysis at the conclusion of a sequential clinical trial. Biometrics 42: 461–471. 111. World Medical Association Declaration of Helsinki — Ethical principles for medical research involving human subjects. J. Am. Med. Assoc. 284(23): 3043–3045. 112. Ellenberg, S. S., Fleming, T. R. and DeMets, D. L. (2002). Data Monitoring Committees in Clinical Trials, Wiley, England.

About the Author Dr. Chi is currently the Director, Division of Biometrics I, Center for Drug Evaluation and Research, Food and Drug Administration(FDA), Rockville, Maryland. He received his doctorate degree in Mathematics (1970) from the Carnegie-Mellon University. He has taught at the University of Pittsburgh, University of Florida, and visited the University of Bucharest as a Senior Fulbright Scholar. Prior to joining FDA in 1983, Dr. Chi has been associated with the LIPID project at the University of North Carolina as a NIH research fellow.

580

G. Y. H. Chi et al.

Dr. Chi has an active interest in regulatory research. Recent research publications on factorial trials (Biometrics, March 1993, December 1995 and Communication in Statistics 23, 1994) dealt with the development of new test statistics for evaluating the effectiveness of antihypertensive combination drugs. Current activities include the analysis of ambulatory blood pressure data, analysis of multiple endpoints, prospectively designed twostate trials (Biometrics 2001), design modifications, clinical decision rules and multiplicity, the design and analysis of active control non-inferiroity trials, the preparation of interim analysis and multiple endpoints guidelines, and guidance documents on how to review NDA/INDs. He is currently a member of the Statisticsal Policy Coordinating Committee.

Section 3 Statistical Methods in Epidemiology

This page intentionally left blank

CHAPTER 15

STATISTICS IN GENETICS

ZHAOHAI LI George Washington University, Department of Statistics, 315 Funger Hall, Washington, DC 20052, USA Tel: 202-994-6357; [email protected] MINYU XIE Central China Normal University, Wuhan, China and George Washington University, Washington DC, USA

1. Introduction Recent advances in molecular genetics have provided opportunities for genetic studies of complex human traits. Many human diseases such as cystic fibrosis, insulin dependent diabetes mellitus, hypertension, and schizophrenia, are considered as having some genetic component. Locating the genes that affect susceptibilities to these diseases is important in understanding the etiology of the diseases and may result in better treatment. In this chapter, we give an overview of segregation and linkage analysis of genetic data. We start with an introduction of basic genetic concepts and relevant terminology. 1.1. Genetic terminology Each individual has 23 pairs of chromosomes. One of the 23 pairs is formed by two sex chromosomes, X and Y . Every woman carries XX chromosomes and every man carries XY . The other 22 pairs are called autosomal chromosomes. We focus on autosomal chromosomes in this chapter. The human genome can be imagined as two parallel straight lines. A given location in the genome (like a segment or point on the straight line) is called a locus. 583

584

Z. Li & M. Xie

The different forms of genetic variants are termed as alleles. In molecular genetics or biology, the term, gene, is used to refer to both an allele or a locus. Alleles are often denoted by letters or numbers, such as A, a, B, b, 1, 2, 3, etc. The proportion of a specific allele at a given locus in the population is called the allele frequency or allele probability. For example, p = P (A) = 0.3, implies that 30% of alleles at a given locus are “A.” The allele frequency can be estimated from genetic data. At any given locus, each individual has two alleles, such as AA, Aa, or aa. The pair of alleles at a locus is referred to as the genotype. The order of the two alleles in the genotype is not relevant. Thus, “Aa” and “aA” are considered to be the same genotype. If an individual has two identical alleles (e.g. AA or aa) at a given locus, then the individual is said to be homozygous at that locus. If the two alleles are different (e.g. Aa), then the individual is said to be heterozygous at that locus. Usually, the determination of the genotype of an individual at a given locus requires laboratory work. However, some of characteristics are readily observable, such as an individual’s eye color and height. An observable characteristic is called the phenotype. The relationship between the genotype and the phenotype is not necessarily one to one. It is possible that several different genotypes correspond to one phenotype. If genotype AA and genotype Aa both have the same phenotype (characteristic), but different from that of genotype aa, then allele A is said to be dominant to allele a, or allele a is said to be recessive to allele A. In this case, the phenotype or characteristic corresponding to AA is called dominant, and the phenotype corresponding to aa is called recessive. If the phenotype associated with the genotype Aa is different from those of both AA and aa, then alleles A and a are said to be codominant. For example, there are three alleles, A, B, and O, at the blood group locus. Genotypes AA and AO have the same phenotype, blood type A, and genotypes BB and BO have the same phenotype, blood type B. Hence, allele A is dominant to allele O, and allele B is also dominant to allele O. The genotype OO has a recessive phenotype, blood type O, and the genotype AB has a codominant phenotype, blood type AB. When we consider more than one locus simultaneously, the alleles (at different loci) received by an individual from one parent are called a haplotype. A pair of haplotypes is a multilocus genotype. Suppose that there are two loci, locus one with two alleles A and a, and locus two with two alleles B and b. Figure 1 presents a hypothetical family with two parents and a son and a daughter in which the son received haplotype ab from the mother and haplotype Ab from the father, and the daughter received

585

Statistics in Genetics

Mother

Father

a a b b

A a B b

A a b b

A a B b Daughter

Son Fig. 1.

haplotypes AB from the father and ab from the mother. The squares and circles in Fig. 1 indicate males and females, respectively. Notice that there is a vertical bar between the two haplotypes for some individuals in Fig. 1. The bar notation represents a phase known genotype. That is, alleles on the same side of the bar are from the same parent, or equivalently the maternal and paternal origin for each allele is known. The haplotype that the son received from the father is Ab which is different from the two original haplotypes of the father, AB and ab. Therefore, the alleles at the two loci have been recombined during the process of transmitting from the father to the son. This phenomenon is referred to as crossing-over. Only an odd number of cross-overs between two loci are observable. The probability of an odd number of cross-overs between two loci is called the recombination fraction, denoted by θ. An individual receives one of the two alleles from the genotype of each parent with equal probability. Suppose the genotype of a parent is Aa, then the above principle implies that P {→ A|Aa} = P {→ a|Aa} = 21 , where {→ A|Aa} denotes the event that a parent transmits allele A to the offspring given that the parent has genotype Aa. This is frequently referred to as Mendel’s first law or the principle of independent segregation. If the loci of a two-locus genotype are on different chromosomes, then the transmission of the alleles at one locus is independent of the transmission of the alleles at the other locus. For example, suppose a parent has genotype AaBb

586

Z. Li & M. Xie

(two-locus genotype), and the two loci are on the different chromosomes, then 1 P {→ AB|AaBb} = P {→ A|Aa} × P {→ B|Bb} = , 4 1 P {→ Ab|AaBb} = P {→ A|Aa} × P {→ b|Bb} = , 4 1 P {→ aB|AaBb} = P {→ a|Aa} × P {→ B|Bb} = , 4 1 P {→ ab|AaBb} = P {→ a|Aa} × P {→ b|Bb} = . 4 This principle of independent assortment is often called Mendel’s second law. 1.2. The Hardy Weinberg equilibrium Random mating is defined as: any female is equally likely to mate with any male. That is, the probability of the mating type is the product of the probabilities of the genotypes of the female and male mates, for example, P {AA × Aa} = P (AA)P (Aa). Notation AA × Aa indicates the mating type resulting from an AA individual mating with an Aa individual. Next, we consider a locus with two alleles, A and a. Assume that the allele frequencies (or probabilities) for the two alleles are P (A) = p ,

P (a) = q ,

where p + q = 1. A population is said to be in equilibrium if the proportions (or probabilities) of the three genotypes of the current generation in the population are P (AA) = p2 ,

P (Aa) = 2pq ,

P (aa) = q 2 .

(1)

Under random mating, the allele and genotype probabilities for next generation are the same as the current generation, i.e. P (A) = p ,

P (a) = q ,

and P (AA) = p2 ,

P (Aa) = 2pq ,

P (aa) = q 2 .

The genotypic and allelic probabilities stay the same from generation to generation for an equilibrium population. This is known as the Hardy– Weinberg law.19,67

587

Statistics in Genetics

If the genotypic probabilities of the current generation do not satisfy condition (1), then equilibrium will be reached after one generation of random mating.68 For details of the derivation of the Hardy–Weinberg law, readers are referred to an excellent book on population genetics by C. C. Li.29 1.3. Linkage and linkage equilibrium From Mendel’s second law, if two genetic loci are on different chromosomes, then the transmission of alleles (segregation) at one locus is independent of that at the other locus; there, the recombination fraction is θ = 12 . If the two genetic loci are close together, the alleles that are paternal (or maternal) in origin tend to transmit together (cosegregation) to an offspring. This phenomenon is known as linkage. The closer the two loci are, the smaller the probability for crossing over; thus, the recombination fraction for the two linked loci is smaller than 21 . Next, we consider two genetic loci in more detail. Suppose that the first locus has two alleles, A and a; and the second locus has two alleles, B and b. The respective allele probabilities are P (A) = p ,

P (a) = q ,

P (B) = u ,

P (b) = v ,

where p + q = 1 and u + v = 1. There are nine two-locus joint genotypes. Under the assumptions of random mating and no linkage (θ = 21 ), if the genotypic probabilities are: AABB

AABb

AAbb

AaBB

p 2 u2

2p2 uv

p2 v 2

2pqu2

AaBb

Aabb

aaBB

aaBb

aabb

4pquv

2pqv 2

q 2 u2

2q 2 uv

q 2 v2

then, the genotypic probabilities of the next generation will be the same as the current generation. Hence, if each locus is in equilibrium separately in a population; and the two loci are not linked, then the two loci are jointly in equilibrium. For a population not in equilibrium jointly, the joint equilibrium will not be reached after a single generation of random mating. The joint equilibrium is approached as the number of generations n → ∞. The speed of approaching joint equilibrium depends on the recombination fraction, θ (see the following Eq. (3)). If alleles at two loci are in random association (independent), the two loci are said to be in a state of linkage equilibrium. To illustrate this concept, consider two diallelic loci with alleles A and a for locus one, and alleles B

588

Z. Li & M. Xie

and b for locus two. If these loci are in a state of linkage equilibrium, then the haplotype probabilities satisfy: P (AB) = P (A)P (B) ,

P (Ab) = P (A)P (b) ,

P (aB) = P (a)P (B) ,

P (ab) = P (a)P (b) .

(2)

Under linkage equilibrium, the joint probabilities of each two locus haplotype are equal to the products of the corresponding single-locus allele probabilities. If alleles at two loci are not in random association (dependent), this situation is referred to as allelic association or linkage disequilibrium. For two diallelic loci, under linkage disequilibrium, we have δ = P (AB) − P (A)P (B) 6= 0 , where δ is the departure from equilibrium; it is the linkage disequilibrium parameter. It can be shown that P (AB) = P (A)P (B) + δ ,

P (Ab) = P (A)P (b) − δ ,

P (aB) = P (a)P (B) − δ ,

P (ab) = P (a)P (b) + δ .

If linkage disequilibrium exists (δ 6= 0) initially for a population, under random mating, then the linkage disequilibrium parameter will approach zero as the number of generations, n → ∞. Specifically, let δ0 be the initial disequilibrium parameter and θ be the recombination fraction between two loci, then after n generations, δn = (1 − θ)n δ0 ,

(3)

where δn is the linkage disequilibrium parameter of the nth generation. The linkage disequilibrium will decrease quickly over generations when the linkage is weak, i.e. the recombination fraction θ is large (close to 21 ). If the two loci are unliked, θ = 21 and equilibrium is reached very quickly; if the two loci are tightly linked, θ ≃ 0, and disequilibrium will continue for many generations. This is the basis for fine mapping using disequilibrium. Therefore, a large linkage disequilibrium is often considered to be evidence of linkage (small θ). For readers who would like to know more about linkage and population genetics in general, they are referred to the books by Li29 and Hartl and Clark.20 Most books on statistical methods in genetics provide an introductory chapter on basic genetic concepts and terminology.13,27,40,54 The book by Watson et al.65 gives a comprehensive description of the molecular biology of genes. Olson et al.38 wrote a tutorial on genetic mapping of complex traits. For fundamentals of genetic epidemiology, readers are referred to Khoury et al.23 and Thompson.64

589

Statistics in Genetics

2. Segregation Analysis Mendel proposed that two factors (alleles) segregate from one another in the process of forming gamates (sperm or ovum) that constitute the genetic makeup of the next generation. The Mendelian segregation ratio is the conditional probability of the genotype in the offspring given the mating type, for example, 1 P {aa|Aa × Aa} = . 4 These ratios are fixed under the Mendelian inheritance model. Hence, one task of segregation analysis is to test whether or not the observed phenotype data among the offspring is consistent with Mendelian inheritance. In general, segregation analysis tests models of inheritance with family data. 2.1. Estimating allele probability (Gene frequency) Suppose that a random sample of individuals is selected to observe the phenotypes at a given autosomal locus. If all alleles are codominant, one can estimate the frequency of an allele at the locus by counting the number of alleles in the sample and then dividing by the total number of genes (alleles) in the sample. There are twice as many alleles as individuals because each individual has two alleles at a given locus. Consider a given locus which has two codominant alleles A and a. Suppose that a random sample of n individuals is ascertained to study the locus, with nAA , nAa , naa corresponding to the counts of the three genotypes, AA, Aa, aa respectively (nAA + nAa + naa = n). Then frequency pA of allele A is estimated by 2nAA + nAa . pˆA = 2n Likewise 2naa + nAa pˆa = , 2n with pˆA + pˆa = 1. The variances of the estimated allele frequencies are var(ˆ pA ) =

pA (1 − pA ) 2npA (1 − pA ) = , 2 (2n) 2n

var(ˆ pa ) =

pa (1 − pa ) . 2n

Example: The human MN blood type locus is a codominant locus with two alleles M and N. Li29 cites the results of L. Ride (1935; cf. Haldance, 1938) on M N blood type data. More than one thousand Chinese residents in Hong Kong were tested for the M N blood type. The following results were obtained:

590

Z. Li & M. Xie

Blood Types

MM

MN

NN

Total

Numbers

342

500

187

1029

2 × 342 + 500 1184 = = 0.5753 . 2 × 1029 2058 Suppose that a given locus has k codominant alleles with ni alleles of type i in a random sample of n individuals, where n1 + n2 + · · · + nk = 2n. Then, the allele frequency pi of allele type i is estimated by ni . pˆi = 2n It is easy to see that the allele counts (n1 , n2 , . . . , nk ) has a multinomial distribution with parameters (p1 , p2 , . . . , pk ). Thus, pˆM =

pi (1 − pi ) pi pj , cov(ˆ pi , pˆj ) = − , for i 6= j . 2n 2n In fact, pˆi is a maximum likelihood estimate of pi . Both maximum likelihood estimates (MLE) and likelihood ratio tests play very important roles in genetic analysis. Consider a locus with two alleles A and a. If allele A is dominant, then the genotype AA and Aa each have the same phenotype. Let nA be the number of individuals with either genotype AA or Aa, and let na be the number of individuals with genotype aa, where nA + na = n. Under the random mating assumption and by Hardy–Weinberg equilibrium, the probability of the recessive genotype equals the squared probability of allele a, i.e. P (aa) = p2a . Therefore r na na , and pˆa = . pˆ2a = n n E(ˆ pi ) = pi ,

var(ˆ pi ) =

Note na has a binomial distribution with parameter p2a . Hence, pˆ2a is the MLE for p2a . By the invariance principle of MLE, pˆa is the MLE for pa . By using the δ-method47 and the fact that var(pˆ2a ) =

p2a (1 − p2a ) , n

we could derive 1 − p2a . 4n For a dominant locus with more than two alleles, such as the ABO blood locus, estimating allele frequencies is a little more complex. The EM algorithm can be used to obtain the MLEs of the allele frequencies.9,27,32,39,56 var(ˆ pa ) =

591

Statistics in Genetics

2.2. Testing Hardy Weinberg equilibrium Hardy–Weinberg equilibrium is commonly assumed in genetic analyses. The validity of this assumption can be tested by the Pearson χ2 test: X (O − E)2 . χ2 = E For a two allele codominant locus, under Hardy–Weinberg equilibrium (H0 ), the expected numbers of individuals with various genotypes in a random sample of n individuals are AA

Aa

aa

np2A

2npA pa

np2a

Let nAA , nAa , naa be the observed numbers of individuals with genotypes AA, Aa, aa, respectively. Then, 2nAA + nAa 2naa + nAa pˆA = , pˆa = . 2n 2n The expected numbers of the individuals with various genotypes can be estimated by: EˆAA = nˆ p2A ,

ˆAa = 2nˆ E pA pˆa ,

ˆaa = nˆ E p2a .

The one degree of freedom Pearson χ2 statistic is χ2 =

ˆAa )2 ˆaa )2 ˆAA )2 (nAa − E (naa − E (nAA − E + + . ˆAA ˆAa ˆaa E E E

Reject the null hypothesis (H0 ) that the population is in Hardy–Weinberg equilibrium when the χ2 value is large, larger than χ21−α (1), where α is the level of significance. Example: The following data were obtained from a hypertension genetic study. A random sample of 197 individuals were genotyped for the angiotensin-converting enzyme (ACE) locus with AA

Aa

aa

26

93

78

From this data set, we have pˆA = 0.3680 ,

pˆa = 0.6320 ,

EˆAA = 197 × (0.3680)2 = 26.68 ,

ˆAa = 2 × 197 × 0.3680 × 0.6320 = 91.63 , E Eˆaa = 197 × (0.6320)2 = 78.69 ,

592

Z. Li & M. Xie

and (26 − 26.68)2 (93 − 91.63)2 (78 − 78.69)2 + + = 0.0439 . 26.68 91.63 78.69 This test fails to reject the null hypothesis that the population is in Hardy– Weinberg equilibrium at the significance level α = 0.05, since χ20.95 (1) = 3.841. χ2 =

2.3. Segregation analysis of dominant loci One method to show the genetic basis of single locus inheritance of a given disease is to demonstrate the Mendelian segregation ratio. Consider a rare dominant two allele disease locus with allele A and a. Assume that the allele A causes the disease with frequency P (A) = p ≃ 0. The individuals with genotypes AA and Aa will have the disease, while individuals with genotype aa will be unaffected. The observable phenotypes are affected or unaffected. The Mendelian segregation ratios for six mating types under random mating are presented in Table 1. Table 1.

Mendelian segregation ratios for 6 mating types under random mating.

Mating

P(Mating)

AA × AA

p4

AA × Aa

4p3 q

AA × aa

2p2 q 2

Genotype AA Aa aa

Phenotype Affected Unaffected

1 1 2 0

0 1 2 1

0

1

0

0

1

0

0

1

0

1 2 1 2 0

1 4 1 2 1

3 4 1 2 0

1 4 1 2 1

Aa × Aa

4p2 q 2

1 4

Aa × aa

4pq 3

0

aa × aa

q4

0

Among the five possible mating types which produce affected offspring, the mating type Aa × aa is the most likely to occur according to the above probabilities of mating types and the fact that p = P (A) ≃ 0. The most informative ascertainment procedure or sampling scheme is to select families with one affected parent and another unaffected parent. The mating of such families is commonly assumed to be Aa × aa. Let τ = P {Aa|Aa × aa} ≃ P {Affected|Aa × aa} be the Mendelian ratio parameter, and let X be the random variable for the number of affected offspring for

Statistics in Genetics

593

an ascertained family. Then, X has a binomial distribution with probability function ! n L(r|τ ) = P {X = r} = τ r (1 − τ )n−r , (4) r where n is the number of offspring. It is worth noting that the genotypes of the offspring are conditionally independent given the parental mating type. To demonstrate the Mendelian segregation ratio, we test H0 : τ = 21 . 2.4. Approximated χ2 test Suppose that k families with parental mating type Aa×aa were ascertained, and each family has ri affected offspring among a total of ni offspring (n1 + n2 + · · · + nk = n). Let X = X1 + X2 + · · · + Xk , then E(Xi ) = ni τ ,

var(Xi ) = ni τ (1−τ ) ,

E(X) = nτ ,

var(X) = nτ (1−τ ).

By the central limit theorem, we have X − nτ D p → N (0, 1) . nτ (1 − τ )

Hence,

(X − nτ )2 D 2 → χ1 , nτ (1 − τ )

D

where → denotes covergence in distribution. Under H0 : τ = 21 , we calculate the test statistic Pk ( i=1 ri − n2 )2 χ2 = . n 4

2

Reject H0 when χ is large. 2.5. Likelihood ratio test With the above family data structure and formula (4), the likelihood function of τ is !# " k k Y ni Y τ r (1 − τ )n−r , L(τ |ri ) = L(τ |r1 , . . . , rk ) = r i i=1 i=1 Pk where r = i=1 ri . The MLE of τ is τˆ = nr . The likelihood ratio statistic is χ2 =

L(ˆ τ) . L( 21 )

594

Z. Li & M. Xie

2.6. Segregation analysis of recessive loci Consider a rare recessive two allele disease locus. For an individual to be affected, the genotype of the individual has to be aa. There are three mating types, Aa×Aa, Aa×aa, aa×aa, which could produce affected offspring. Let P (A) = p and P (a) = q = 1 − p. Then, it can be shown that the conditional probabilities of mating types, given that one offspring is affected, are: P {Aa × Aa|Affected} = p2 , P {Aa × aa|Affected} = 2pq , P {aa × aa|Affected} = q 2 . Since the disease is rare recessive, i.e. P (a) = q ≃ 0, then Aa × Aa is the most likely mating type given that one child is affected. Therefore, the ascertainment procedure for a rare recessive disease is to select families with at least one affected child and then assume the mating type is Aa × Aa for analysis. The Mendelian segregation ratio for mating type Aa × Aa with a recessive disease is 1 τ = P {aa|Aa × Aa} = . 4 We can estimate and test the segregation ratio τ . It is possible that some families with at least one affected child are not included in the study sample due to chance. Fisher15 in his classical paper recognized the need for correcting for the incomplete selection in segregation analysis and proposed methods to take the ascertainment procedure into account in the analysis. When families are selected on the basis of having at least one affected offspring, the affected individuals initially identified are called probands. It is possible that one ascertained family has more than one proband. The probability that an affected individual is a proband is called the ascertainment probability, and it is denoted by π = P {Proband|Affected} . The probability that a family, with r affected offspring, is not ascertained is P {Not Ascertained|r Affected} = (1 − π)r . Therefore, the probability that a family with r affected offspring is ascertained is P {Ascertained|r Affected} = 1 − (1 − π)r .

Statistics in Genetics

595

If the ascertainment probability is one, i.e. π = 1, then all families with at least one affected offspring are ascertained. This situation is referred to as complete ascertainment. The situation where π < 1 is referred to as incomplete ascertainment. When the ascertainment probability is very small, that is, π ≃ 0, then 1 − (1 − π)r ≃ rπ, and the phrase single ascertainment is used. Thus, for the single ascertainment procedure, the probability that an affected family is ascertained is proportional to the number of affected offspring. Under the single ascertainment procedure, almost all ascertained families have a single proband. 2.7. Segregation analysis with complete ascertainment Under the complete ascertainment procedure, the number of affected offspring for each ascertained family has a truncated binomial distribution15 with the likelihood function L(τ |ri , si ) =

( sri )τ ri (1 − τ )si −ri i

1 − (1 − τ )si

,

ri = 1, . . . , si , i = 1, . . . , n ,

(5)

where τ = P {aa|Aa × Aa} is the Mendelian segregation ratio, si is the number of offspring for the ith family, ri is the number of affected offspring, and n is the number of families. The goal of segregation analysis is to estimate and test the Mendelian segregation ratio τ . Assume that all ascertained families have the same number of offspring, si = s for all i. Let ar be the number of families with r affected offspring (r = 1, 2, . . . , s) and ns be the total number of ascertained families, then Ps Ps r=1 ar = ns , and r=1 rar = A is the total number of affected offspring. From (5), the likelihood function based on ns ascertained families is ar  ns s ( rs )τ r (1 − τ )s−r Y Y   L(τ |ri , si ) = L(τ ) =  .  s 1 − (1 − τ ) r=1 i=1 The maximum likelihood estimate for τ is the solution of the score equation ∂L(τ ) = 0, which is equivalent to ∂τ A sτ = = r¯ . s 1 − (1 − τ ) ns

(6)

There is no closed form solution to Eq. (6). The solution can be obtained by interative algorithms such as the: Newton–Raphson, Fisher scoring, and

596

Z. Li & M. Xie

EM algorithms. The Fisher information for τ is 2 ∂ L(τ ) sns 1 − (1 − τ s − sτ (1 − τ )s−1 I(τ ) = E − . = ∂τ 2 1 − (1 − τ )s τ (1 − τ )[1 − (1 − τ )s ] The variance of the MLE τˆ can be estimated by

1 I(ˆ τ).

2.8. Segregation analysis with incomplete ascertainment In order for an individual selected at random, from families with parental mating type Aa×Aa, to become a proband, the individual has to be affected and ascertained. Thus, P {Proband} = P {Affected and Selected} = P {Selected|Affected}P {Affected} = πτ , where π is the ascertainment probability, and τ is the Mendelian segregation ratio. A family is considered to be segregating if the family has at least one affected offspring. The probability that a family with s offspring is a segregating family and not ascertained is (1 − πτ )s . Thus, the probability that a family with s offspring is ascertained is 1 − (1 − πτ )s . Let B be the random variable denoting the number of probands in a family. Then, the fact that a family is ascertained is equivalent to B > 0 with P (B > 0) = 1 − (1 − πτ )s . The likelihood function for an ascertained family with r affected offspring with sibship size s is36 L(π, τ ) = P {X = r|B > 0; s, π, τ } =

[1 − (1 − π)r ]( rs )τ r (1 − τ )s−r 1 − (1 − πτ )s

.

(7)

This likelihood function can be used to estimate and test ascertainment probability, π, and the Mendelian segregation ratio, τ . Since the analysis takes the ascertainment procedure into account, it is often referred to as an ascertainment bias corrected analysis. There are two special cases to the likelihood function (7). When the ascertainment probability π = 1, that is, complete ascertainment, the likelihood function (7) reduces to (5). If the ascertainment probability is very small, i.e. single ascertainment (π ≃ 0), then (1 − π)r ≃ 1 − rπ ,

(1 − πτ )s ≃ 1 − sπτ ,

Statistics in Genetics

597

and the likelihood function (7) is approximately s − 1 r−1 τ (1 − τ )s−r . L(π, τ ) = P {X = r|B > 0; s, π, τ } = r−1 In the case when the exact number of probands for each ascertained family is known, the likelihood function is given by L(π, τ ) = P {X = r, B = b|B > 0; s, π, τ }

s r r b τ (1 − τ )s−r π (1 − π)r−b r b = . 1 − (1 − πτ )s

In addition to the likelihood based inference for ascertainment probability and the Mendelian segregation ratio for a recessive locus under incomplete ascertainment, there are two other simple methods: the proband method and the singles method. The proband method was initiated by Weinberg and described in detail by Fisher.15 Suppose that n segregating families are ascertained and si , ri , bi are the sibship size, the number of affected offspring, and the number of probands for the ith family, respectively. The segregation ratio and the ascertainment probability are estimated by the proband method as Pn Pn bi (bi − 1) bi (ri − 1) i=1 τˆ = Pn and π ˆ = Pni=1 . b (s − 1) b i=1 i i i=1 i (ri − 1)

If a proband is the only proband within an ascertained family, such a proband is referred to as a single. Let d be the number of singles in a sample of ascertained families. The segregation ratio and the ascertainment probability are estimated by the singles methods as Pn Pn bi − d ri − d i=1 τˆ = Pn and π ˆ = Pni=1 . i=1 si − d i=1 ri − d

The segregation analysis methods described above are for qualitative traits (i.e. affected or unaffected). Methods for segregation analysis of quantitative traits for single-locus and polygenic control were developed by Morton and MacLean.37 This “mixed model” method is based on a likelihood approach. Several programs implemented the “mixed model,” such as Pedigree Analysis Package (PAP),22 SEGPATH,44 and POINTER.26 Bonney5 developed a family of regressive models for segregation analysis of quantitative traits that allowed simultaneous adjustment for covariates and estimation of the parameters of Mendelian models. These models were implemented in the Statistical Analysis for Genetic Epidemiology (S.A.G.E.)

598

Z. Li & M. Xie

computer program.11 Terwilliger and Ott63 provided a list of computer programs for genetic analysis. 3. Linkage Analysis Linkage analysis investigates whether or not two loci are physically located near one another on the same chromosome. Alleles from two linked loci (physically close) tend to segregate together, that is, they are passed from parent to child as a single unit. This phenomenon deviates from Mendel’s second law of independent assortment. Linkage analysis is one of the methods used to localize disease traits in the human genome. Evidence of linkage between a known marker system and a putative gene for a disease is considered to be the highest level of statistical evidence that the disease is due to a genetic mechanism. Linkage analysis localizes a gene solely on the basis of its location, without regard to its biochemical function. This approach is called “positional cloning.” Alleles cosegregating due to linkage between two loci in one family may be different from alleles in another family. The cosegregating phenomenon due to linkage is only observable within families. Therefore, family data or data from biologically related subjects are necessary for detecting linkage. However, cosegregating caused by allelic association (linkage disequilibrium) can be detected by general population studies. Allelic association is a property of alleles, while linkage is a property of loci. They are two different but related concepts. The measurement for linkage between two loci is the recombination fraction, θ. The closer the two loci, the less likely that a cross-over will occur between them and the smaller the recombination fraction, θ. Two extreme cases are: (1) θ = 21 , the two loci are far apart and segregate independently (Mendel’s second law of independent assortment); (2) θ = 0, the two loci are identical and actually are one locus. The range of the recombination fraction is 0 ≤ θ ≤ 21 . Through map functions,17,25,34,47 the recombination fraction, θ, between the two loci can be translated to the genetic distance between the two loci. Genetic distance is correlated with physical distance, but they are different. Linkage analysis estimates and tests the recombination fraction, θ. There are two types of statistical methods for linkage analysis: model-based and model free. 3.1. The LOD score method The LOD (log-odds) score method is based on the maximum likelihood ratio test. Haldane and Smith18 used the maximum likelihood approach to

599

Statistics in Genetics

linkage analysis. The LOD score method was widely used after Morton35 published tables of log-odds (or LOD) scores that could be used in the analysis of family data. The LOD score method is considered to be a modelbased procedure. Usually, the mode of inheritance, the number of alleles, and the penetrances of each genotype are assumed to be known for the LOD score method. The only unknown parameter is the recombination fraction θ. The conditional probability of observing the corresponding phenotype, say affection status, given the specified genotype, is referred to as penetrance. Ott41 described the LOD score method in detail. Assume that the likelihood function for a given family is L(θ), and θˆ is the maximum likelihood estimate of θ. The LOD score for testing H0 : θ = 12 vs. θ < 21 is ˆ = log Z(θ) 10

ˆ L(θ) . L(θ = 12 )

ˆ > 3.35 The p-value correTraditionally, reject H0 and claim linkage if Z(θ) −4 ˆ > 3 is less than 10 . sponds to Z(θ) 3.1.1. Example of phase known data Figure 2 depicts a three generation hypothetical family with a binary phenotype (affected or unaffected) and marker genotype data for each individual. The dark symbols indicate that a subject is affected with an autosomal rare dominant disorder. We further assume that the disease locus has two alleles D and d with full penetrance. Hence, the penetrances are P {Affected|DD} = P {Affected|Dd} = 1 ,

P {Affected|dd} = 0 .

From the phenotype and marker locus genotype data in Fig. 2, the joint genotype (i.e. two locus genotype) of marker and trait loci and phase information can be inferred and are given in Fig. 3. Both grandmother and mother are homozygous with genotype dd at the disease locus because they are normal. Since the grandfather is affected, his genotype at the disease locus is either DD or Dd. It is reasonable to assume that his genotype is Dd since the disease is rare. This assumption has no impact on the linkage analysis because the grandparental genotypes are only used to determine the phase of the father. There are two possible phases for the grandfather and either one of them will give rise to the same phase for the father. The father is affected, so he must have at least one D allele. He must also receive one d allele from his mother. Therefore, the genotype of the father is Dd at the disease locus. The fact that the father received a haplotype

600

Z. Li & M. Xie

Grandfather

Grandmother

Mm

mm

Father

Mother mm

Mm

Mm Daughter 2

Mm Daughter 1

Mm Son 1

mm Daughter 3

mm Son 2

Fig. 2. D d Mm

d d mm

d D mM

d d Mm

D d Mm

d d mm

D d Mm

d d mm

d d mm

Fig. 3.

dm from his mother determines his phase. Hence, his genotype is dm/DM as shown in Fig. 3. Since the mother is double homozygous, each of the five children must receive a haplotype dm from her. The two locus genotypes and phases for the five children are inferred from the parental information and shown in Fig. 3. The mother is not informative for recombination. The father produced one recombinant gamate and four non-recombinant gamates. The likelihood function is L(θ) = θr (1 − θ)N −r = θ(1 − θ)4 , where r is the number of recombinants, and N is the number of gamates. The maximum likelihood estimate of θ is θˆ = Nr = 0.2, and the LOD score

601

Statistics in Genetics

is ˆ = log Z(θ) 10

ˆ L(θ) 0.20 × 0.84 = 0.4185 . = log10 L(θ = 0.5) 0.55

The evidence is not strong enough to support linkage. 3.1.2. Example of phase unknown data Suppose that the grandparents are missing in Figs. 2 and 3. Then, the phase of the father cannot be determined with certainty. With the given genotype, the father’s genotype could have two possible phases, dm/DM (phase I) (Fig. 4) and dM/Dm (phase II) (Fig. 5). Under the phase dm/DM , there are one recombinant and four non-recombinants. There are one non-recombinant and four recombinants under the phase dM/Dm. Each of these two phases has equally likely probability ( 12 ) of being correct. The likelihood function becomes: L(θ) = P {Data} = P {Data|Phase I}P {Phase I} + P {Data|Phase II}P {Phase II} 1 1 θ(1 − θ)4 + θ4 (1 − θ) . 2 2 The MLE for θ no longer has a closed form solution in this case. Many ˆ Using the numerical procedures can be employed to obtain the MLE, θ. 28 LINKAGE program, we find θ = 0.21 (approximately), with LOD score ˆ = 0.1249. Terwilliger and Ott63 provided detailed descriptions and Z(θ) guidance about the LINKAGE program. If the data consists of more than one family, then the joint likelihood function is the product of the likelihood functions of each family under the =

D d Mm

d d Mm

D d Mm

d d mm

D d Mm

Fig. 4.

d d mm

d d mm

602

Z. Li & M. Xie

D d mM

d d Mm

D d Mm

d d mm

D d Mm

d d mm

d d mm

Fig. 5.

assumption that the families are independent of each other. The LOD score linkage analysis can be carried out similarly. 3.2. The affected sib pair (ASP) method Sib pair linkage studies are commonly used for the investigation of genetic components involved in complex traits because a sib pair is the simplest family unit and easy to ascertain. Penrose42,43 initiated the method based on the idea that sib pairs with similar phenotypes should have an excess of allele sharing while sib pairs with dissimilar phenotypes should have a deficit of allele sharing. The method was further developed to give rise to the affected sib pair method (ASP). The ASP method ascertains sib pairs where both sibs are affected. Hence, they should have an excess of allele sharing. One measurement of allele sharing is the number of alleles shared identical-by-descent (IBD). Two alleles that were transmitted from a common ancestor are referred to as being IBD. For example, suppose that the parental mating type is Aa × aa and the genotypes of both sibs are Aa, then allele “A” of the sib pair is IBD, but allele “a” of the sib pair may or may not be IBD. Let I be a random variable denoting the number of alleles shared IBD by a sib pair. It can be shown that the distribution of I for a sib pair is: 1 1 1 P {I = 0} = , P {I = 1} = , P {I = 2} = . 4 2 4 Hence, 1 E(I) = 1 , var(I) = . 2 The conditional distribution of I, given the disease status of the members of a pair, provides the theoretical basis for the ASP method. It

Statistics in Genetics

603

was derived by Suarez et al.61 under the one locus with two alleles assumption. Risch48 generalized these results to other relatives and to multilocus models based on the recurrence risks of the disease. We introduce some notation and concepts before we describe the ASP methods. Assume that there are two alleles, T and t, at the trait locus with probabilities P {T } = p and P {t} = q. Let Y be a binary random variable indicating whether or not an individual is affected by a given disorder, i.e. ( 1 , Affected Y = 0 , Unaffected . The three penetrances are denoted as f1 = P {Y = 1|T T } = P {Affected|T T } , f2 = P {Y = 1|T t} , f3 = P {Y = 1|tt} . The prevalence rate of the disorder in the population is derived under the Hardy–Weinberg equilibrium as KP = P {Y = 1} = P {Y = 1|T T }P {T T } + P {Y = 1|T t}P {T t} + P {Y = 1|tt}P {tt} = p2 f1 + 2pqf2 + q 2 f3 . The genetic variance for the binary trait is VG = var(E(Y |G)) = VA + VD , where G is a random variable representing the three genotypes, VA = 2pq[p(f2 − f1 ) + q(f3 − f2 )]2 is the additive variance, and VD = p2 q 2 (f1 − 2f2 + f3 )2 is the dominance variance. The additive model corresponds to 3 f2 = f1 +f 2 ; the dominance model corresponds to f2 = f1 ; and the recessive model corresponds to f2 = f3 . Let IM be the number of alleles shared IBD at the marker locus, X be the random variable denoting the number of affected sibs in a sib pair, and θ be the recombination fraction between the trait and marker loci. Suarez et al.61 derived the distribution P {IM = j|X = k}, j = 0, 1, 2, k = 0, 1, 2, which is given in Table 2. Given a set of parameters (KP , VA , VD , θ), the deviation of the conditional distribution of IM given X = 2, under the alternative hypothesis of linkage, from the expected distribution ( 14 , 12 , 41 ), under the null hypothesis of no linkage, is the largest among the three cases (X = 2, X = 1, X = 0).

604

Table 2.

Distribution P {IM = j|X = k} by Suarez et al.61

j=2

j=1

j=0

(Ψ − 21 )VA + (Ψ2 − 41 )VD 1 + 4 d2

2(Ψ2 − Ψ + 14 )VD 1 − 2 d2

(Ψ − 21 )VA + (2Ψ − Ψ2 − 34 )VD 1 − 4 d2

k=1

(2Ψ − 1)VA + (2Ψ2 − 21 )VD 1 − 4 d1

2(2Ψ2 − 2Ψ + 12 )VD 1 + 2 d1

(2Ψ − 1)VA + (4Ψ − 2Ψ2 − 23 )VD 1 + 4 d1

k=0

(Ψ − 21 )VA + (Ψ2 − 41 )VD 1 + 4 d0

2(Ψ2 − Ψ + 14 )VD 1 − 2 d0

(Ψ − 21 )VA + (2Ψ − Ψ2 − 34 )VD 1 − 4 d0

2 + V /2 + V /4), d = 4(2K − V − V /2 − 2K 2 ), d = 4(1 − 2K + K 2 + V /2 + V /4), where d2 = 4(KP 0 1 A D P A D P A D P P

and Ψ = θ 2 + (1 − θ)2 .

Z. Li & M. Xie

k=2

605

Statistics in Genetics

Hence, the ASP design will provide more information for detecting linkage. There are many test statistics for detecting linkage based on the ASP design. The most popular one is the “mean test.” It is defined as T =

(n2 + 12 n1 ) − pn

n 2

,

8

where n is the total number of affected sib pairs, n2 is the number of affected sib pairs with IM = 2, and n1 is the number of affected sib pairs with IM = 1. Under the null hypothesis of no linkage, T has an asymptotically standard normal distribution. The power and sample size calculations can be carried out with given values of (KP , VA , VD , θ) by using the formulas in the above table. The power of the mean test has been investigated by Knapp et al.,24 Suarez and Eerdewegh,62 and Blackwelder and Elston.3 It performs adequately under various conditions. Risch48 formulated the ASP method based on recurrence risk. Recurrence risk of a sib pair is defined as the conditional probability that one sib is affected given that the other sib is affected, i.e. KR = P {Y2 = 1|Y1 = 1}, where Y1 and Y2 are random variables indicating affection status of the two sibs. The recurrence risk ratio of a sib pair is the ratio of the sibling KR . recurrence risk relative to the population prevalence rate, i.e. λS = K P Similarly, the recurrence risk ratio between parent and offspring can be defined by replacing the sibling recurrence risk with the parent-offspring recurrence risk, denoted by λO . The conditional distribution of IM given that two sibs are affected is trinomial with probabilities: z0 =

1 1 (2Ψ − 1)[(λS − 1) + 2(1 − Ψ)(λS − λO ] , − 4 4λS

z1 =

1 1 (2Ψ − 1)2 (λS − λO ) , − 2 2λS

z2 =

1 1 (2Ψ − 1)[(λS − 1) + 2Ψ(λS − λO )] , + 4 4λS

where zi = P {IM = i|Two sibs affected} and Ψ = θ 2 + (1 − θ)2 . This parameterization is different from that of Suarez et al.61 The mean test statistic for H0 : θ = 12 vs. H1 : θ < 12 is the same as before. The only difference is that the power and sample size calculations are in terms of the parameters (λS , λO , θ). A likelihood ratio test can be performed based on the trinomial distribution with parameters (λS , λO , θ). The affected sib pair method has been generalized to affected relative pairs48 and to affected relative-sets or affected-pedigree-member (APM) methods.66

606

Z. Li & M. Xie

3.3. The Haseman Elston procedure The LOD score and ASP methods described above handle the linkage analysis of qualitative traits. However, many complex traits are quantitative, i.e. those measured on a continuous scale, instead of a discrete scale. Having a complex disease is often determined by applying a threshold to a quantitative measurement, such as hypertension defined by blood pressure or obesity defined by body mass index. However, it is important to develop statistical methods to study linkage between a marker locus and a locus underlying a quantitative trait. Let x1j and x2j denote the continuous trait values for two sibs in a sib pair, respectively. We assume the trait values have the following structure: x1j = µ + g1j + e1j , x2j = µ + g2j + e2j , where µ is the overall mean, g1j and g2j represent genetic contributions to the trait values, and e1j and e2j are residuals. We consider a single locus with two alleles A1 and A2 . The allele frequencies of A1 and A2 are p and q = 1 − p, respectively. The mean trait values of individuals, with the three possible genotypes, are defined as follows: A2 A2

A2 A1

A1 A1

−a

d

a

Then, the additive genetic variance is σa2 = 2pq[a + (q − p)d]2 , and the dominance variance is σd2 = (2pqd)2 . The total genetic variance is the sum of the additive and dominance genetic variances, that is, σg2 = σa2 + σd2 . Let σe2 = E(e1j − e2j )2 be the residual variance for each genotype and ρ be the residual correlation coefficient between the two sibs. Throughout this section, we assume that there is no dominance, i.e. σd2 = 0. Haseman and Elston21 showed that the expectation of the squared difference in the observed trait values of a sib pair, conditional on the proportion of alleles shared IBD at the trait locus, satisfies the regression equation E(yj |πj ) = σe2 + 2σg2 − 2σg2 πj = β0 + β1 πj ,

(8)

where yj = (x1j − x2j )2 , β0 = σe2 + 2σg2 and β1 = −2σg2 . With the candidate gene approach, we collect sib pair data as (y1 , π1 ), . . . , (yn , πn ) and perform regression analysis to obtain estimates σ ˆe2

Statistics in Genetics

607

and σ ˆg2 . Then, the estimate of heritability is obtained from: H=

σe2

σg2 + σg2

by substituting σ ˆe2 and σ ˆg2 . Such an approach has been extended to multiple trait loci with multiple alleles.60 When the proportion of alleles shared IBD at the marker locus, πjm , instead of trait locus, is available, the regression equation becomes 21 E(yj |πjm ) = σe2 + 2σg2 Ψ − 2(1 − 2θ)2 σg2 πjm = γ0 + γ1 πjm ,

(9)

where yj = (x1j − x2j )2 , γ0 = σe2 + 2σg2 Ψ, Ψ = θ2 + (1 − θ)2 , γ1 = −2(1 − 2θ)2 σg2 , and θ is the recombination fraction between the trait and marker loci. Recall that θ ∈ [0, 12 ], and πjm , the proportion of marker alleles shared IBD for the jth sib pair, assumes values 0, 12 , 1. The proportion of marker alleles shared IBD is often estimated. The regression equation in terms of the estimated proportion π ˆjm becomes E(yj |ˆ πjm ) = γ0 + γ1 π ˆjm ,

(10)

where γ0 and γ1 are the same as those in (9), and π ˆ jm takes values 4i , i = 0, 1 21 1, 2, 3, 4. Notice that, γ1 = 0 implies θ = 2 when σg2 > 0. The least squares estimate γˆ1 is obtained from sib pair genetic data (y1 , π ˆ1m ), . . . , (yn , π ˆnm ) by performing regression analysis based on formula (10). The least squares estimate γˆ1 can be used to test H0 : θ = 21 (No linkage) vs. H1 : θ < 21 (Linkage). A significantly negative γˆ1 indicates θ < 21 . Hence, reject H0 if γˆ1 is less than the appropriate C < 0 at a level α test. 3.4. The ED and EC sib pair design Haseman–Elston21 model is based on randomly sampled sib pairs. Certain sampling schemes that select sib pairs based on their trait values have greater power.3,6,10,49 Risch and Zhang49 concluded that three types of sib pairs, selected on the basis of trait values, provide the most power to detect linkage for a quantitative trait locus (QTL): (1) extremely discordant (ED) sib pairs where one has a high and the other a low trait value; (2) extremely concordant (EC) for high trait values; (3) extremely concordant for low trait values.49,50,69,70 They investigated the power of these three sib pair designs under different genetic models and concluded that the extremely discordant sib pair design has the greatest power. Hence their recommendation is that the extremely discordant sib pair design be used for linkage studies of QTLs

608

Z. Li & M. Xie

in humans. Eaves and Meyer10 also obtained the power of ED sibpairs by simulation. Suppose that N ED sib pairs are selected for genotyping in a linkage study. Let n0 , n1 , n2 be the number of sib pairs with IBD = 0, 1, 2, respectively. If the marker locus is linked to the trait locus, more ED sib pairs should have IBD = 0 due to the selection process. Hence, if n0 is significantly larger than n2 , then linkage is indicated. Under H0 (no linkage) (n0 , n1 , n2 ) has a trinomial distribution with parameters ( 14 , 21 , 14 ). A test statistic can be based on n0 − n2 . We have EH0 (n0 − n2 ) = 0 ,

varH0 (n0 − n2 ) =

N , 2

where N = n0 + n1 + n2 . Define a test statistic TED =

n0 − n 2 q . N 2

This statistic has an asymptotically standard normal distribution. Reject H0 and declare linkage when TED is large. Sample size and power formulas are given in Risch and Zhang.49 For the EC sib pair design, the test statistic is n2 − n 0 . TEC = q N 2

Several existing procedures that combine ED and EC sib pairs into one test give each ED (EC) pair the same weight.16,30 Rao46 noticed that these methods can be improved by exploiting the quantitative variability in the tail distribution of the trait. Li and Gastwirth31 developed a test giving greater weight to the more discordant (concordant) ED (EC) pairs. 3.5. The transmission/disequilibrium test (TDT) Assume that there are two alleles, D1 and D2 , at a disease locus, and two marker alleles, M1 and M2 , at a marker locus. Suppose that n affected children are ascertained. From these families, there will be 4n parental marker alleles, 2n of which are transmitted and 2n of which are not transmitted. If the marker locus is in the neighborhood of the disease locus and the disease allele is due to a recent mutation, then a specific marker allele associated with the disease allele will have higher frequency among diseased individuals compared to normal individuals. The imbalanced transmission of one allele relative to the other suggests that linkage exists between the marker and disease loci.

609

Statistics in Genetics Table 3. Numbers a, b, c, and d of the transmitted and non-transmitted marker alleles M1 and M2 among 2n parents of n affected children.59 Transmitted allele

Non-transmitted allele M1

M2

Total

M1 M2

a c

b d

a+b c+d

Total

a+c

b+d

2n

Spielman et al.,59 summarizes the number of alleles transmitted and not transmitted to the n affected children of the 2n parents, as presented in Table 3. Notice that entry b in the above 2 × 2 table represents the number of parents who are M1 M2 at the marker locus with one transmitted allele, M1 , and one non-transmitted allele, M2 . Since each parent of an affected offspring contributes exactly one transmitted and one non-transmitted allele in the above 2 × 2 table, the transmission/disequilibrium test (TDT) proposed by Spielman et al.57,59 is the McNemar test resulting from a matched case-control design. The one degree of freedom χ2 test statistic is (b − c)2 . b+c The McNemar test is based on the standard normal approximation to a binomial distribution. The theoretical background for the TDT is given in Table 4, from Curnow et al.7 In Table 4, m = P (M1 ) and p = P (D1 ) are the allele frequencies, θ is the recombination fraction between the marker and trait loci, δ = P (M 1 D1 ) − P (M1 )P (D1 ) is the linkage disequilibrium (association) parameter, and χ2T D =

B= where

p[p(f11 − f12 ) + (1 − p)(f12 − f22 )] , p2 f11 + 2p(1 − p)f12 + (1 − p)2 f22 f11 = P (Affected|D1 D1 ) , f12 = P (Affected|D1 D2 ) , f22 = P (Affected|D2 D2 )

610

Z. Li & M. Xie Table 4. Probabilities of combination of transmitted and nontransmitted marker alleles M1 and M2 among 2n parents of n affected children.7 Transmitted allele

Non-transmitted allele M1

M1

m2 +

M2

m(1 − m) +

M2

Bmδ p

m(1 − m) +

B(θ − m)δ p

B(1 − θ − m)δ p

(1 − m)2 −

B(1 − m)δ p

are the three penetrances, i.e. the probabilities that individuals with disease genotype D1 D1 , D1 D2 , and D2 D2 have the disease. The entry in the upper, is the conditional probability that right of Table 4, m(1 − m) + B(1−θ−m)δ p a parent with marker genotype M1 M2 transmits allele M1 given that the child is affected, that is p12 = P {Parent = M1 M2 → M1 |Child Affected} = m(1 − m) +

B(1 − θ − m)δ . p

Similarly, in the lower-left of Table 4, p21 = P {Parent = M1 M2 → M2 |Child Affected} = m(1 − m) +

B(θ − m)δ . p

If δ 6= 0 (association), then H0 : θ = 21 is equivalent to H0 : p12 = p21 . Hence, the TDT is a joint test for linkage and association. The diagonal terms in Table 4 are independent of the recombination fraction θ. That is, as expected, homozygous parents have no information about linkage which is the reason why the TDT statistic χ2T D only involves b and c and not a or d in Table 3. The TDT is related to the concept of haplotype relative risk (HRR).14,40 The TDT has the advantage that it only requires parent and child data from families with one affected child. It does not require multiple sibs such as ASP. The disadvantage of the TDT is that it can detect linkage only if association is present. The TDT has been generalized to multiallelic markers,2,51,55,57 to families without parental genotype information,4,8,33,52,58 and to quantitative traits.1,45,53

Statistics in Genetics

611

4. Discussion The statistical methods discussed in this chapter are only a small selection from problems arsing from the molecular data that are becoming available to search for disease genes. The properties of currently used statistical methods still need to be investigated, and new statistical methods for genome-wide inference need to be developed. For an excellent review on major contribution of statistics to genetics over the last century, and current and future research problems, readers are referred to Elston and Thompson.12 Acknowledgments This was partly supported by NIH grants CA 64363 and EY 14478 and China Nature Science grant. We would like to thank Professor Chin Long Chiang and Mr. Dennis Buckman for their excellent comments. References 1. Allison, D. B. (1997). Transmission-disequilibrium tests for quantitative traits. American Journal of Human Genetics 60: 676–690. 2. Bickeb¨ oller, H. and Clerget-Darpoux, F. (1995). Statistical properties of the allelic and genotypic transmission/disequilibrium test for multi-allelic markers. Genetic Epidemiology 12: 865–870. 3. Blackwelder, W. C. and Elston, R. C. (1982). Power and robustness of sibpair linkage test and extension to larger sibships. Communication Statistical Theory Methods 11: 449–484. 4. Boehnke, M. and Langefeld, C. D. (1998). Genetic association mapping based on discordant sib pairs: The discordant-alleles test. American Journal of Human Genetics 62: 950–961. 5. Bonney, G. E. (1984). On the statistical determination of major gene mechanisms in continuous human traits: Regressive models. American Journal of Medical Genetics 18: 731–749. 6. Carey, G. and Williamson, J. (1991). Linkage analysis of quantitative traits: Increased power by using selected samples. American Journal of Human Genetics 49: 786–796. 7. Curnow, R. N., Morris, A. P. and Whittaker, J. C. (1998). Locating genes involved in human disease. Applied Statistics 47: 63–76. 8. Curtis, D. (1997). Use of siblings as controls in case-control association studies. Annals of Human Genetics 61: 319–333. 9. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B39: 1–38. 10. Eaves, L. and Meyer, J. (1994). Locating human quantitative trait loci: Guidelines for the selection of sibling pairs for genotyping. Behaviour Genetics 24: 443–455.

612

Z. Li & M. Xie

11. Elston, R. C., Bailey-Watson, J. E., Bonney, G. E. et al. (1986). A package of computer programs to perform statistical analysis for genetic epidemiology. Presented at the 7th International Congress of Human Genetics, Berlin. 12. Elston, R. C. and Thompson, E. A. (2000). A century of biometrical genetics. Biometrics 56: 659–666. 13. Falconer, D. S. (1989). Introduction to Quantitative Genetics, Longman Scientific and Technical with John Wiley and Sons, Inc. New York. 14. Falk, C. T. and Rubinstein, P. (1987) Haplotype relative risks: An easy reliable way to construct a proper control sample for risk calculations. Annals of Human Genetics 51: 227–233. 15. Fisher, R. A. (1952). The effect of methods of ascertainment upon the estimation of frequencies. Annals of Eugenics 6: 13–25. 16. Gu, C., Todorov, A. and Rao, D. C. (1996). Combining extremely concordant sibpairs with extremely discordant sibpairs provides a cost effective way to linkage analysis of QTLs. Genetic Epidemiology 13: 513–533. 17. Haldane, J. B. S. (1919). The combination of linkage values and the calculation of distance between the loci of linked factors. Journal of Genetics 8: 299–309. 18. Haldane, J. B. S. and Smith, C. A. B. (1947). A new estimate of the linkage between the genes for colour-blindness and haemophilia in man. Annals of Eugenics 14: 10–31. 19. Hardy, G. H. (1908). Mendelian proportions in a mixed population. Science 28: 49–50. 20. Hartl, D. L. and Clark, A. G. (1997). Principles of Population Genetics, Sinauer Associates, Inc. Sunderland, Massachusetts. 21. Haseman, J. K. and Elston, R. C. (1972). The investigation of linkage between a quantitative trait and a marker locus. Behaviour Genetics 2: 3–19. 22. Hasstedt, S. J. and Carwright, P. E. (1981). PAP-pedigree analysis package, University of Utah, Department of Medical Biophysics and Computing, Technical Report No. 13. Salt Lake City, Utah. 23. Khoury, M. J., Beaty, T. H. and Cohen, B. H. (1993). Fundamentals of Genetic Epidemiology, Oxford University Press, New York and Oxford. 24. Knapp, M., Wassmer, G. and Baur, M. P. (1995). Linkage analysis in nuclear families, I. Optimality criteria for affected sib-pair tests. Human Heredity 44: 37–43. 25. Kosambi, D. D. (1994). The estimation of map distances from recombination values. Annals Eugenics 12: 172–175. 26. Lalouel, J. M. and Yee, S. (1980). POINTER: A computer program for complex segregation analysis with pointers. Technical Report, Population Genetics Laboratory, University of Hawaii, Honolulu. 27. Lange, K. (1997). Mathematical and Statistical Methods for Genetic Analysis, Springer-Verlag, New York. 28. Lathrop, G. M., Lalouel, J. M., Juier, C. et al. (1984). Strategies for multilocus linkage analysis in humans. Proceedings of National Academic Sciences USA 81: 3443–3446.

Statistics in Genetics

613

29. Li, C. C. (1988). First Course in Population Genetics, The Boxwood Press, Pacific Grove, California. 30. Li, Z. and Zhang, H. (2000). Mapping quantitative trait loci in humans using both extreme discordant and concordant sib pairs: A unified approach for meta-analysis. Communication Statistical Theory Methods 29: 1115–1127. 31. Li, Z. and Gastwirth, J. L. (2001). A weighted test using both extreme discordant and concordant sibpairs for detecting linkage. Genetic Epidemiology 20: 34–43. 32. Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data, Wiley, New York. 33. Monks, S. A., Kaplan, N. L. and Weir, B. S. (1998). A comparative study of sibship tests of linkage and/or association. American Journal of Human Genetics 63: 1507–1516. 34. Morgan, T. H. (1928). The Theory of Genesics, Yale University Press, New Haven, Conn. 35. Morton, N. E. (1955). Sequential tests for the detection of linkage. American Journal of Human Genetics 7: 277–318. 36. Morton, N. E. (1959). Genetic tests under incomplete ascertainment. American Journal of Human Genetics 11: 1–16. 37. Morton, N. E. and MacLean, C. J. (1974). Analysis of family resemblance. III. Complex segregation analysis of quantitative traits. American Journal of Human Genetics 26: 489–503. 38. Olson, J. M., Witte, J. S. and Elston, R. C. (1999). Tutorial in biostatistics: Genetic mapping of complex traits. Statistics in Medicine 18: 2961–2981. 39. Ott, J. (1977). Counting methods (EM algorithm) in human pedigree analysis: Linkage and segregation analysis. Annals of Human Genetics 40: 443–454. 40. Ott, J. (1989). Statistical properties of the haplotype relative risk. Genetic Epidemiology 6: 127–130. 41. Ott, J. (1999). Analysis of Human Genetic Linkage, The Johns Hopkins University Press, Baltimore and London. 42. Penrose, L. S. (1935). The detection of autosomal linkage in data which consist of pairs brothers and sisters of unspecified parentage. Annals of Eugenics 6: 133–138. 43. Penrose, L. S. (1953). The general purpose sib-pair linkage test. Annals of Eugenics 18: 120–124. 44. Province, M. A. and Rao, D. C. (1995). General purpose model and a computer program for combined segregation and path analysis (SEGPATH): Automatically creating computer programs from symbolic language model specifications. Genetic Epidemiology 12: 203–219. 45. Rabinowitz, D. (1997). A transmission disequilibrium test for quantitative trait loci. Human Heredity 47: 342–350. 46. Rao, D. C. (1998). CAT scans, PET scans, and genomic scans. Genetic Epidemiology 15: 1–18.

614

Z. Li & M. Xie

47. Rao, D. C., Keats, B. J. B., Lalouel, J. M., Morton, N. E. and Yee, S. (1979). A maximum likelihood map of chromosome 1. American Journal of Human Genetics 31: 680–696. 48. Risch, N. (1990). Linkage strategies for genetically complex traits II. The power of affected relative pairs. American Journal of Human Genetics 46: 229–241. 49. Risch, N. and Zhang, H. (1995). Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science 268: 1584–1589. 50. Risch, N. and Zhang, H. (1996). Mapping quantitative trait loci with extreme discordant sib pairs: Sample size considerations. American Journal of Human Genetics 58: 836–843. 51. Schaid, D. J. (1996). General score tests for associations of genetic markers with disease using cases and parents. Genetic Epidemiology 13: 423–449. 52. Schaid, D. J. and Rowland, C. (1998). The use of parents, sibs, and unrelated controls to detection of association between genetic markers and diseases. American Journal of Human Genetics 63: 1492–1506. 53. Schaid, D. J. and Rowland, C. M. (1999). Quantitative trait transmission disequilibrium test: Allowance for missing parents. Genetic Epidemiology 17: S307–S312. 54. Sham, P. (1998). Statistics in Human Genetics, Arnold, London and New York. 55. Sham, P. C. and Curtis, D. (1995). An extended transmission/disequilibrium test (TDT) for multi-allele marker loci. Annals of Human Genetics 59: 323–336. 56. Smith, C. A. B. (1957). Counting methods in genetical statistics. Annals of Human Genetics 21: 254–276. 57. Spielman, R. S. and Ewens, W. J. (1996). The TDT and other family-based tests for linkage disequilibrium and association. American Journal of Human Genetics 59: 983–989. 58. Spielman, R. S. and Ewens, W. J. (1998). A sibship test for linkage in the presence of association: The sib transmission/disequilibrium test. American Journal of Human Genetics 62: 450–458. 59. Spielman, R. S., McGinnis, R. E. and Ewens, W. J. (1993). Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDM). American Journal of Human Genetics 52: 506–516. 60. Stoesz, M. R., Cohen, J. C., Mooser, V., Marcovina, S. and Guerra, R. (1997). Extension of the Haseman–Elston method to multiple alleles and multiple loci: Theory and practice for candidate genes. Annals of Human Genetics 61: 263–274. 61. Suarez, B. K., Rice, J. and Reich, T. (1978). The generalized sib pair IBD distribution: Its use in the detection of linkage. Annals of Human Genetics 42: 87–94. 62. Suarez, B. K. and van Eerdewegh, P. (1984). A comparison of three affectedsib-pair scoring methods to detect HLA-linked disease susceptibility genes. American Journal of Medical Genetics 18: 135–146.

Statistics in Genetics

615

63. Terwilliger, J. D. and Ott, J. (1994). Handbook of Human Genetic Linkage, The John Hopkins University Press, Baltimore and London. 64. Thompson, E. A. (1986). Genetic epidemiology: A review of the statistical basis. Statistics in Medicine 5: 291–302. 65. Watson, J. D., Hopkins, N. H., Roberts, J. W., Steitz, J. A. and Weiner, A. M. (1987). Molecular biology of the Gene, The Benjamin/Cummings Publishing Company, Inc., Menlo Park, California. 66. Weeks, D. E. and Lange, K. (1988). The affected-pedigree-member method of linkage analysis. American Journal of Human Genetics 42: 315–326. ¨ 67. Weinberg, W. (1908). Uber den Nachweis der Vererbung beim Menschen. Jahresh. Verein f. vaterl. Naturk. in W¨ urttemberg 64: 368–382. 68. Wentworth, E. N. and Remick, B. L. (1916). Some breeding properties of the generalized Mendelian population. Genetics 1: 608–616. 69. Zhang, H. and Risch, N. (1996). Mapping quantitative trait loci in humans using extreme concordant sib pair: Selected sampling by parental phenotypes. American Journal of Human Genetics 59: 951–957 70. Zhao, H., Zhang, H. and Rotter, J. I. (1997). Cost-effective sib-pair designs in the mapping of quantitative-trait loci. American Journal of Human Genetics 60: 1211–1221.

About the Author Zhaohai Li is an Associate Professor of Statistics and Biostatistics at the Biostatistics Center, Statistics Department, George Washington University, Washington, DC. He obtained, his MS degree in Mathematical Statistics from Center China Normal University and PhD in Statistics from Columbia University. His research interests include statistical methods for genetic epidemiology and empirical Bayes methods for meta-analysis.

This page intentionally left blank

CHAPTER 16

DOSE-RESPONSE MODELING IN HEALTH RISK ASSESSMENT YILIANG ZHU Department of Epidemiology and Biostatistics, College of Public Health, University of South Florida, 13201 Bruce B Downs Blvd, Tampa, FL 33612-2805, USA Tel: (813)974-6674; [email protected] Human health protection against environmental exposure to chemical hazards is of fundamental public health importance, and thus of intensive research interests among industries, governments, and academics. As the scientific basis of health protection, risk assessment employs a wide range of statistical tools. Among them, dose-response modeling plays a central role in assessing exposure-related health risk and deriving safety exposure levels for environmental regulation. This article illustrates the use of dose-response modeling in risk assessment through examples of carcinogenicity, developmental toxicity, and neurotoxicity. Data from these examples include binary, clustered categorical, and longitudinal measurements, and require careful consideration for effective and innovative use of statistical methods such as generalized estimating equations and nonlinear mixed effects models. We also discuss the problem of benchmark dose estimation and some open statistical issues encountered in risk assessment. Whereas the examples and applications are directly related to environmental health, the methods illustrated in this article are widely applicable to many problems in medicine and biological studies.

1. Introduction There are sufficient scientific evidences that link chemical exposure to various adverse health effects, including carcinogenicity, developmental toxicity, mutangenicity, immunotoxicity and neurotoxicity.33 The United States Environmental Protection Agency’s (US EPA) TSCA inventory currently registers more than 65,000 chemical substances that are in active use in the USA, and this number is increasing. However, only about 10% of 617

618

Y. Zhu

these substances have been tested for carcinogenicity, and an even smaller number have been tested for non-cancer effects. For example, it is estimated that about 3% to 28% of all chemicals are neurotoxicants.33 A large number of the more than 500 registered pesticide ingredients are estimated to affect the nervous system of the targeted species to varying degrees. Of the 588 chemicals listed by the American Conference of Governmental Industrial Hygienists, 167 affected the nervous system or behavior at some exposure level.1 It is further estimated that of the approximately 200 chemicals to which one million or more workers are exposed in the USA, more than one third may have adverse effects on the nervous system if sufficient exposure occurs.2 Thus, there are increasing scientific and regulatory interests in estimating the risk of exposure to chemicals of various toxic potentials with regard to their overall impact on human health.

1.1. Health risk assessment Health risk assessment is composed of some or all of the following components: hazard identification; dose-response assessment; exposure assessment; and risk characterization. Hazard identification involves the detection of exposure-induced adverse health effects, with respect to dose, route, timing and duration of exposure, through either case report or designed human/animal studies. Dose-response analysis is typically done through designed animal experiments because controlled exposure to humans is generally not feasible. The purpose of dose-response assessment is to determine an exposure range in which exposure-induced risk is of a controllable magnitude. This estimated range of exposure in turn provides a quantitative base for risk characterization. The third component, exposure assessment, is to identify exposed or potentially exposed population, describe its size and composition, and present the types, magnitudes, frequencies, and duration of the exposure. As an integration of hazard identification, dose-response assessment and exposure assessment, a statement of the consequence of exposure then comes as a result of risk characterization. In the past decade or so, dose-response assessment in the context of non-cancer risk assessment has largely focused on determining a noobserved-adverse-effect-level (NOAEL) or lowest-observed-adverse-effectlevel (LOAEL), largely ignoring the shape of the underlying dose-response relationship. A NOAEL is the highest experimental dose at which the increase in adverse effects relative to the control group is not significant. A LOAEL, on the other hand, is the lowest experimental dose at which

Dose-Response Modeling in Health Risk Assessment

619

there is a significant increase in risk. A reference dose (RfD) or reference concentration (RfC) is then calculated by dividing a NOAEL, or a LOAEL when a NOAEL is not determined, by uncertainty factors to account for interspecies difference in response, different exposure routes and other study variations.3 The RfD is an estimate, with uncertainty of an order of magnitude, of a daily exposure to the human population that is likely to be without appreciable risks of deleterious health effects during a lifetime.3 It provides a quantitative basis for setting up regulatory levels of the chemical under testing. Since a NOAEL/LOAEL is limited to only the experimental doses, its determination ignores the shape of the dose-response relationship, and the risk associated with the NOAEL/LOAEL varies substantially between experiments.12,21,24 Recognizing the inconsistency in the NOAEL/LOAEL approach, the US EPA and several other international agencies have recommended the benchmark dose method (BMD)12 as an alternative or supplementary approach in risk assessment of developmental toxicity as well as other non-cancer effects.4,39–41 It is expected that, through sound statistical methods, the BMD approach can overcome the weaknesses of the NOAEL/LOAEL approach and provide a more consistent quantification of risk.

1.2. Benchmark Doses Befaore we describe the benchmark dose method, it helps to first discuss measures of risk. Consider an outcome measure Y of health effects. Assume that in the absence of exposure, Y has a cumulative distribution function F0 (y) = P r(Y < y). Chemical exposure at level d may cause adverse effects to the exposed population, resulting in an altered distribution Fd (y) = P r(Y < y|d) relative to that of the control population, F0 (y) = P r(Y < y|d = 0). The alteration may occur in many ways, including such common phenomenon as mean shifting and change in variance. Characterization of a dose-response relationship in risk is essentially to characterize any alterations in the distribution as a function of exposure. Statistically, the problem often reduces to identifying a quantity that reflects the changes to F0 (y). We can consider, for example, a set A of particular value of Y , and use the probability π(d = 0) = P r(Y ∈ A|d = 0) as a reference level for the changes in π(d) = P r(Y ∈ A|d). Significant changes from π(0) to π(d) signify a risk as well as a dose-response relationship. However, the probability π(d) may not be a direct measure of risk

620

Y. Zhu

since A is not necessarily a region of adverse values. Since this approach assesses overall risk in terms of changes to the population, not necessarily identifying adverse effects in individuals, we call it population-level risk characterization. Although the value of Y may not always lead to a definitive diagnosis of adverse effects especially when Y is a continuous measure, special cases do arise where A is a region of adversity on grounds of toxicological criteria. These include the case of Y being a binary or dichotomized measure, so that Y = 1 indicates the presence of adverse effects. As a result A = {1}, and π(d) is a direct measure of risk. Under these circumstances, it is feasible to assess adverse effects on an individual basis. We call this approach individual-level risk characterization. While the principle of population-level risk assessment is clear, the possibility of distributional changes to the population is almost infinite. Let us consider the special case of the location-scale family of distributions where Fd (y) = F ((y − µd )/σd ) with mean µd and scale σd . Obviously, normal distributions are members of this family. For convenience, we make µ0 = 0, and σ0 = 1 for the control distribution F (y). If toxic effects are manifested as a mean shifting, and a shifting of c units is toxicologically or clinical meaningful, we can then choose A = {Y > c}. It follows that π(0) = 1 − F (c), and π(d) = 1 − F ((c − µd )/σd ). The case of A = {Y < c} can be implemented analogously. Here, c can be a percentage of the control mean, variance, or data range. If toxic effects are manifested as c units change of variance relative to the control variance, we can choose A = {|Y | > c}. It follows that π(0) = 1 − F (c) + F (−c), and π(d) = 1 − F (c/σd ) + F (−c/σd ). In the case of both a mean shifting and variance change, A = {|Y | > c} remains applicable. If changes to F0 (y) are such that Fd (y) is no longer in the same family, it is still possible to quantify the changes using preferably non-parametric methods.5 Given the dose-response model π(d) = P r(Y ∈ A|d), we can choose a benchmark response (BMR) level γ, say 10%, and identify the corresponding level of exposure, i.e. benchmark dose (BMD) that induces the specified BMR. Specifically, the BMD can be defined by the multiplicative excess risk π(BMDγ ) − π(0) =γ 1 − π(0)

(1)

π(BMDγ ) − π(0) = γ .

(2)

or alternatively the additive excess risk,

We then divide the BMDγ by certain safety factors to obtain a RfD.

Dose-Response Modeling in Health Risk Assessment

621

The estimation of BMDs involves a few steps. First, we determine a reference “risk” π(0) = P r(A|d = 0) by characterizing the control distribution with regard to certain aspects that are sensitive to potential changes caused by exposure. Second, we use the reference probability π(d) = P r(A|d) to establish a dose-response relationship. Third, we estimate BMDs based on the fitted dose-response models using Eqs. (1) or (2). Since BMD is a point estimate, a lower confidence limit is often computed as a more conservative measure. Gaylor et al.16 give a detailed account of procedures for computing BMDs with different types of data. While the procedures are clear in principle, many technical challenges remain. Because of the complex nature of data arising from environmental studies, it is important to employ effective, and is challenging to develop innovative methods to handle statistical issues in risk assessment. This article mainly illustrates dose-response modeling using advanced statistical methods such as generalized estimating equations (GEEs) and mixed effects models. We discuss binary data for cancer risk assessment in Sec. 2, clustered multinomial data of developmental toxicity in Sec. 3, and longitudinal data from neurobehavioral toxicity screening studies in Sec. 4. We conclude with some open issues in Sec. 5.

2. Binary Data: Carcinogenicity Binary outcome data indicate the presence or absence of adverse effects in each individual. In cancer studies this means that subjects can be classified as having or not having a tumor in a target organ at some specified time following exposure to a test substance. It is often assumed that the effects observed in a subject is independent of that observed in others. This assumption needs to be checked, and is not the case for example in reproductive and developmental experiments where littermates may respond more similarly than others from different litters. This situation is addressed in Sec. 3. Table 1 is a summary of adenoma/carcinoma incidence in liver from an experiment on Japanese Medaka (Oryzias latipes) exposed to Nnitrosodiethylamine (DEN), a known carcinogen. This dataset is the 4th replication in a study reported by Brown-Patterson et al.6 The primary purpose of this study is to test for non-linear dose-response relationships below the 1% BMR level. This is an issue of important implications in regulatory policy. Currently, EPA’s testing protocol requires studies to include a lowest dose level near the 1% BMR level. A linear dose-response relationship is assumed from below the 1% BMR level, and linear extrapolation

622

Y. Zhu

Table 1.

Frequency of adenoma/carcinoma in the liver of medaka exposed to DEN. Dose (ppm)

0

0.075

0.15

0.3

0.6

1.5

3.0

Cases % Sample Size

5 0.36 1387

6 0.43 1385

11 0.77 1427

12 0.86 1400

19 1.36 1393

27 4.20 643

41 6.48 633

Table 2.

Fitted probit model for the incidence of adenoma/carcinoma. Coefficient

Estimate

Std. Err.

t-value

Intercept DEN DEN2

−2.632 0.831 −0.153

0.072 0.148 0.046

−36.752 5.625 −3.333

towards 0 is used to derive a RfD at the 0.1% BMR level. (The terminology extrapolation in risk assessment context excludes the control as an exposure level.) As a result, extrapolation would lead to an over-estimation (or under-estimation) of risk if the true dose-response is curved upwards (or downwards). From Table 1 we can see that the incidence of either adenoma or carcinoma or both is elevated by exposure to DEN and the 1% BMR is somewhere between 0.3 and 0.6 ppm. Since Medaka were housed in water tanks in group during both exposure and growing-out periods, similar living condition shared by Medaka in the same tank may result in clustering effects, i.e. data dependence. We checked the full study data by comparing the sample variance with the binomial variance, and did not find any indication of clustering. Therefore, we used a probit link function in conjunction with a binomial distribution, within the framework of generalized linear models,28 to model the mean response probability of an adenoma or carcinoma. This led to a probit dose-response model π(Xβ) = Φ(Xβ), where Φ is the cumulative distribution function of the standard normal, X is a set of covariates including dose, and possibly other covariates or interaction terms, and β are a vector of the regression coefficients. A simple quadratic model of dose was fit to the data, yielding the results given in Table 2. The residual deviance for the model was 0.984 on 4 degrees of freedom, indicating a reasonable fit to the data. From this model, the 1% BMR level was estimated to be about 0.396 ppm. While the quadratic term is statistically significant, signaling some degree of non-linearity below the 1% BMR level, it is the impact of the non-linearity on risk quantification that we were

Dose-Response Modeling in Health Risk Assessment

623

really interested in. To this end, we decided to estimate the curvature |Φ′′ |/(1 + (Φ′ )2 )3/2 of the dose-response curve, and then explored the relationship between the curvature and risk. To see the potential bias in risk estimation resulted from linear extrapolation, we compared the risk based on the fitted model and that based on linear extrapolation below the 1% BMR level. Assuming that the spontaneous risk of the control population is 0.0042 as predicted from the fitted model, the extrapolated risk was πlin (d) = 0.0145d˙ + 0.0042 .

4 2 0

Bias as a Percentge of Risk

6

We plotted the relative bias (πlin − Φ)/Φ against the estimates of curvature of the fitted model (Fig. 1). We can see that the largest bias is about 6.6% of the estimated risk, occurring at d = 0.17 ppm with curvature = 0.0212. At the 0.1% excessive risk level, the relative bias is about 5.5%, and the curvature there is 0.0202. These are clear evidences of a significant non-linearity in the true dose-response curve. However, the impact of nonlinearity on risk estimation is of a small magnitude. On the other hand,

0.019

0.020

0.021

0.022

Curvature

Fig. 1.

Risk bias due to extrapolation and curvature.

0.023

624

Y. Zhu

even a small over- or under-estimation of risk may carry a high level of sensitivity in public health policy. The consequence of a linear dose-response assumption in policy-making remains the subject of further studies.

3. Clustered Multinomial Data: Developmental Toxicity One of the most challenging and interesting aspects of reproductive and developmental toxicological data is the complex multivariate outcomes often encountered. This is the case because exposure to dangerous toxicants can affect many different stages in the reproductive process, including viability (sperm count, ovulation, etc.), fertilization and implantation. Once implantation has occurred, exposure can result in early pregnancy loss, malformations, lowered fetal weight or functional deficiency. Figure 2 shows some aspects of the developmental process, and illustrates how exposure may increase death rate, cause growth alterations such as lower fetal weight or malformations. In the figure, π1 (d) denotes the risk of death including resorption and still birth, π2 (d) the conditional risk of malformation among live births, all through maternal exposure at dose level d. We will focus on the multinomial outcomes of death and malformation only. A number of authors have discussed this issue in the context of dose response modeling.11,22,23,37,44 A second challenge is the clustering, i.e. the so-called litter effects or the tendency for littermates (offspring) of the same mother to respond similarly. In studies utilizing exposure to timed-pregnant animals over the period of organogenesis, the exposure effects on in-utero development are evaluated just before parturition would normally occur. Since the implants ✂✁☎✄✝✆ ✞✝✟✡✠☛✞☞✠✍✌ ✎✏✟ ✒✔✓✕✄✖✎✝✗☞✘✝✙✛✚

✜ ✵ ✵ ✵ ★✵ ✶ ✬✭✚✤✞✡✠✫✪✦✮✰✯✲✱✴✳ ✢✏✌ ✣✤✚✦✑ ✥✧✌★✙✩✠✫✪

✷ ✻

✙✛✎✹✸✺✠✫✪ ✆ ✠✼✚✖✙✽✞☞✠✍✌ ✎✏✟

✾ ✾✿ ✾ ✾

✾ ❀❁✞✕✆ ❂✂✎❃✙❄✁❅✞✡✠☛✌ ✎✲✟❆✮✰✯✕❇✤✳✾✿ ✾ ✾

Fig. 2.

✑✵

✵ ✵ ✵★✶❉❈ ✎❃✙✰✁❊✞✝✆❋✥●✌❍✙✩✠✴✪

Multivariate outcomes of developmental toxicity.

Dose-Response Modeling in Health Risk Assessment

625

(offspring) of the same mother are genetically related and share a similar developmental environment, they are likely to respond to the toxic insult similarly. To fit dose-response models for clustered multinomial data, we first adopt some notations. Consider an experiment in which pregnant dams are randomized to a control or one of D dose groups at exposure levels d0 = 0, d1 , . . . , dD , respectively. Suppose the ith dose group has Mi dams, and the jth, (j = 1, . . . , Mi ) dam has nij implants. Let yij = (y1ij , y2ij )T denote the number of dead implants and malformed fetuses in the (ij)th dam. 3.1. Generalized estimating equations Following Zhu et al.44 and Krewski and Zhu,23 we use Weibull models π1 (d; α) = 1 − exp(−(α0 + α1 dα2 )) π2 (d; β) = 1 − exp(−(β0 + β1 dβ2 )) to describe the incidences of death and malformation, respectively. With various values of the parameters θ = (αT , β T )T to be estimated from the data, the Weibull models are flexible in describing the various dose-response shapes observed in developmental toxicity studies, particularly a reversed L-shape or S-shape.22 The models can mimic continuous threshold models without explicit use of a threshold.38 The power parameters may be useful in other models such as the logistic and probit models. Maximum likelihood estimation can be used to simultaneously fit dose-response models π1 (d) and π2 (d) in conjunction with a mixture of multinomial distributions. The infinite mixture of Dirichlet-multinomial distribution9,44 and a finite mixture of multinomial30 are perhaps the most common examples, because mixture is a simple mechanism to capture extra-multinomial variation due to clustering. However, computational complexity, uncertainty, and limitations about the distribution are disadvantages associated with ML estimation. Many favor alternative analyses based on quasi-likelihood, or more generally, generalized estimating equations (GEEs).25 Ryan,37 Zhu et al.44 and Chen and Li11 used variations of GEEs specifically in the modeling of death and malformation. To estimate parameters θ using GEE, we need specification of only the mean and variance of the data. A general expression of the mean and variance functions for clustered multinomial data is µij = E(Yij |nij ) = nij (π1 (di ), (1 − π1 (di ))π2 (di ))T ,

626

Y. Zhu

and Vij = (1 + (nij − 1)ρi )nij (diag(µij ) − µij µTij ) . In the variance ρi is the coefficient of intra-cluster correlation, and (nij −1)ρi is the variation component in excess of the multinomial variation. This variance function appears to cover all exchangeable multinomial data.45 Now GEEs for θ take the following simple form: X ∂(nij µij ) Vij−1 (yij − nij µij ) = 0 . (3) T ∂θ i,j To estimate the dispersion parameters ρi , a separate set of equations is required. The simplest example is the moment estimation given by X ∂E(q(Yij )) Wij−1 (q(yij ) − Eq(Yij )) = 0 , (4) T ∂ρ i,j

where q can be a quadratic function of yij as well as θ, and W is chosen to approximate the variance of q. We obtain estimates for θ and ρ by iteratively solving the two sets of equations until convergence. An important addition in the GEE method is that the mean parameters and their variances will be estimated correctly even if the variance (Vij ) is incorrectly specified. This is achieved by the inclusion of an empirical variance “fix-up”. There are still incentives to correctly specify the variance to improve statistical efficiency.25 One approach to improving the efficiency of estimation is to consider joint GEEs for θ and ρ. This can be done by forming an enlarged “data” T T T vector (yij , qij ) , and use its covariance matrix to construct GEEs in the form of Eq. (3). This approach is sometimes called GEE2. However, it requires the specification of the 3rd and 4th moments for the data Yij as well as iterative algorithms between estimates of θ and ρ because qij are likely dependent upon θ. Still other variations of GEEs are available. The extended GEEs18 rely on the idea of constructing an extended quasilikelihood by integrating the GEEs (3) and then differentiating the “likelihood” with respect to ρ to obtain the equations for ρ. If the extended quasi-likelihood is somewhat similar to the true likelihood, the second set of equations would be reasonably satisfactory. In the current application we adopt the first approach, with q(yij ) = 2 yij − E(Yij2 ) in Eq. (4). This approach is computationally simple, and numerically equivalent to the GEE2 approach. We further impose a distinct intra-litter correlation ρi for each dose group since it is common that ρi increases with dose level.22

Dose-Response Modeling in Health Risk Assessment

627

A simple alternative approach is proposed by Krewski and Zhu23 to avoid direct estimation of the intra-litter correlation ρi . The idea is to transform the data by dividing the death and malformation frequencies yij as well as the number of implants nij of each dam by the so-called design effects. The transformation effectively removes the over-dispersion component (nij − 1)ρi from the data. As a result, we can use the mean and variance of a multinomial distribution to approximate that of the transformed data. The design effects associated with each dose group is in essence the “ratio” of the true variance to the variance of independent data. Denote the outcomes of malformation, death, and normal birth by an index k = 1, 2, 3. The design effect of each dose group can be estimated by vˆi2 vˆi3 1 vˆi1 ˆ (5) + + Di = 2 µ ˆi1 µ ˆi2 µ ˆi3

where vˆik (k = 1, 2, 3) are the sample variances divided by the average number of implants and µ ˆ ik are the mean response rates in each dose group. An added advantage to this transformation procedure is that the usual Chi-squared goodness-of-fit test is applicable. Simulation studies reveals that this transformation procedure is valid even for moderate number of dams per dose group.17 3.2. Illustration Data in Table 3 come from a study35 that investigated the developmental effects of diethylene glycol dimenthyl ether (TGDM). The study included a vehicle control group and 3 dosed groups, each with from 20 to 30 pregnant rats. Measured outcomes included the number of implantation sites in each dam and the incidence of resorptions and/or fetal deaths. Fetuses surviving to sacrifice were weighed and evaluated for the presence of a variety of different types of malformation. It is clear from Table 2 that both the death and malformation incidences are elevated by increased exposure to TGDM. The fitted death-malformation models are summarized in Table 4, where the transformation method is labeled by TR. While the incidence of malformation is strongly dose-related, the dose-response relationship of prenatal death is only marginally significant. Litter effects are apparent at the two highest dose levels, as indicated by the estimated intra-litter correlations as well as the design effects of 1.10, 0.69, 2.78, and 3.91 for the four dose groups respectively. The transformation method yielded a model comparable to that of GEEs, and a Pearson’s goodness-of-fit statistic with χ2 = 2.15 and p-value = 0.34.

628

Y. Zhu Table 3. Dose (g/kg) 0.00 0.25 0.50 1.00

Summary data of TGDM study. Deatha

Malf’sb

Dams

Total Impl’s

No.

(%)

No.

(%)

27 26 26 28

340 296 296 327

22 21 34 41

6.5 7.1 11.49 12.54

1 0 2 33

0.3 0.0 0.8 11.5

a Including b Number

dead or resorbed animals. of live animals exhibiting any malformation.

Table 4. Parameter estimates (S.E.)a of joint dose-response models for death and malformation in rats exposed to TGDM. Model

Coefficient

Estimate

Estimates (TR)

Prenatal Death

Intercept (α0 )

0.0617 (0.0180) 0.0956 (0.0563) 1.2650 (1.0584)

0.0642 (0.0154) 0.0793 (0.0481) 1.2140 (1.0932)

0.0006 (0.0009) 0.1209 (0.0346) 4.8157 (2.0224)

0.0009 (0.0016) 0.1222 (0.0356) 2.0224 (1.9803)

Dose (α1 ) Power (α2 ) Fetal Malformation

Intercept (β0 ) Dose (β1 ) Power (β2 ) Group 1 (ρ1 )

Intralitter

Group 2 (ρ2 )

Correlation

Group 3 (ρ3 ) Group 4 (ρ4 )

a S.E.

0.1546 (0.0688) −0.0369 (0.0315) 0.3288 (0.2829) 0.2846 (0.1021)

= standard error.

We computed benchmark doses based on the risks of death (π1 ), malformation (π2 ), and the overall risk π(d) = P r(death or malformation|d) = 1 − (1 − π1 (d))(1 − π2 (d)) . We used both the multiplicative risk (1) and additive excess risk (2). The results are given in Table 5 along with their standard error (in parentheses)

Dose-Response Modeling in Health Risk Assessment

629

Table 5. Estimates (S.E.) of BMD05 (mg/kg/d) based on the joint Weibull models for death and malformation.

Additive Multiplicative a S.E.

Prenatal Death

Malformation

Multivariate

0.643 (0.343) 0.611 (0.333)

0.837 (0.072) 0.837 (0.072)

0.568 (0.212) 0.548 (0.216)

= standard error.

derived by the delta method. The 95% lower confidence limit (LBD) of the BMD estimates can be approximated by LBD = BMD05 − 1.645 × SE. The results in Table 5 confirm that a BMD based on the multiplicative risk is always smaller than that of the additive risk, although the difference becomes negligible when the spontaneous risk is small. Further, using the multiplicative risk formula, the BMD is always below the smallest BMD based separately on the univariate risk π1 or π2 . The same needs not be true for the additive risk.16 4. Longitudinal Data: Neurobehavioral Toxicity Screening In addition to its primary role in psychological functions, the nervous system controls most, if not all, other bodily processes. It is sensitive to perturbation from various sources and has limited ability to regenerate. There are evidences that even small anatomical, biochemical, or physiological insults to the nervous system may result in adverse effects on human health, transient or persistent.41 Many chemicals in active commercial use may have, but are not tested for, neurotoxic potential. The US EPA has strongly recommended that neurotoxicity be used as an endpoint in regulating environmental toxicants. Neurotoxicity includes adverse effects in behavior, neurochemistry, neurophysiology, and neuropathology. It can also be evaluated based on neurological development and function in infants and children following prenatal and perinatal exposure. Our focus here is on neurobehavioral testing. 4.1. Neurobehavioral screening test In a recent international study jointly sponsored by the International Program on Chemical Safety (IPCS) and US EPA,32 the use of Functional Observational Battery (FOB)29 was validated and tested for rapidly measuring neurobehavioral changes potentially caused by

630

Y. Zhu

chemical exposure. The FOB consists of a number of measurements that are grouped into several domains according to their neurological function: Autonomic, Neuromuscular, Activity, Sensorimotor, Excitability, and Physiology. Except for several measures in the domain of Neuromuscular as well as physiological measures, most measures are of observational nature, and criteria for adversities in individuals are not easy to develop. Thus, we assess the collective behavioral changes of the exposed group relative to that of a control group. If rare behaviors in the control become more common in the exposed group, there are indicators of adverse exposure effects. To unify all measures within a domain, every single measure, regardless of being binary, continuous or ordinal, is converted to a severity score. Under a 4-level Likert scale, “1” is most common or least severe in the control group, and “4” the least common or most severe. For instance, the measure of “ease of removal from cage” comes with six categories: “very easy”, “easy”, “moderately difficult”, “rat flinches”, “difficult”, and “very difficult”. If more than 50% of the control group are either “very easy” or “easy” to remove, then “very easy” and “easy” are converted to a “1”; Categories whose frequency is at least one rank away from the mode of the control distribution is assigned a “4”. Further details on converting continuous measures can be found in MacDaniel and Moser.29 Variation of this type of schemes can also be developed and tested. The average of individual severity scores within the same domain is a composite severity score for the domain, and is treated as a continuous measure in analysis. Table 6 gives a summary of composite severity score of the “excitability” domain. The raw data are derived from an EPA study31 in which 8 rats of the same strain and age, each from a different litter, were exposed at one of the four dose levels (150, 500, 1500, 5000 mg/kg) to tetrachloroethylenl(PER), a common chlorinated solvent. In addition, there was Table 6.

Mean (S.E.) excitability scores of rats exposed to PER.

Dose (mg/kg)

0

150

500

1500

5000

0H

1.7938 (0.0603) 1.6238 (0.1414) 1.2064 (0.0292)

2.0413 (0.1070) 2.1650 (0.1594) 1.5000 (0.2876)

1.9188 (0.2442) 1.9163 (0.4352) 1.7088 (0.6183)

1.7525 (0.0545) 2.1650 (0.3168) 1.6250 (0.6187)

1.8350 (0.0622) 2.5025 (0.1273) 2.3300 (0.0000)

4H 24 H

631

Dose-Response Modeling in Health Risk Assessment

a control group. FOB screening was conducted on each subject at just before dosing (time = 0), approximately 4, and 24 hours post exposure. There were four deaths in the 5000 mg/kg dose group by 24 hour. The composite severity score is the average of three individual measurements: handling reactivity, arousal, and ease of removal. Repeated measurements are common in neurotoxicity studies in order to understand the often-transient effects of neurotoxicity. Traditionally, 0 dose

5

10

15

20

dose

3.0

2.5

Severity Score

2.0

1.5

1.0 dose

dose

dose

3.0

2.5

2.0

1.5

1.0 0

5

10

15

20

0

5

10

Test Time (hour) Fig. 3.

Change of excitability in rats exposed to PER.

15

20

632

Y. Zhu

ANOVA with repeated measures is employed. This is also the case in the analysis of the composite severity scores of the FOB data.31 However, ANOVA is designed for testing dose effects, not for dose-response modeling. Further, dose-response modeling is a prerequisite to the BMD approach. Data in Table 6 also showcase a typical situation in neurotoxicity study: small sample size per dose group, several repeated measures over time, and possible missing observations that often occur at higher dose levels or later times due to mortality. In the presence of missing data, ANOVA fails to utilize all available data because subjects with any missing data at some time points would have been removed to satisfy the requirement of balanced designs. This can be costly given an already-small sample per dose group in most neurotoxicity studies. Although linear or non-linear regression models may be used to model dose-response, they cannot differentiate distinct behavior trajectory among individual subjects. Between-subject variation often reflects important biological variation. Not only is it important to control to achieve better statistical power, it also sheds lights on identifying special risk groups. In Fig. 3 the excitability score is plotted against time for each subject, and the trajectories are grouped into separate panels by dose level. The bar on the top of each panel indicates an order of increasing dose, from left to right. The plot reveals some degree of between-subject

Fig. 4.

Excitability of rats exposed to PER.

Dose-Response Modeling in Health Risk Assessment

633

variation. To see the average trajectory of behavioral change, we have a 3-dimensional plot in Fig. 4 using a spline smoother. The control group’s excitability score remained stable, only decreasing slightly over time, due perhaps to self-adjustment. As dose increased, however, there were slight increases in excitability, peaking at around 5 hour after exposure and then dropped somewhat by 24 hour. The trajectory of the 5000 mg/kg group was quite different: the excitability score increased over time, and did not drop. This indicates that there were sustained dose effects at excessively high dose levels. It was not clear though whether the effects could be permanent in nature. In summary, the fact that there were only a small number of subjects, potentially sizable between-subject variation, and missing data associated with some subjects points to the benefits of using mixed effects models for data analysis. 4.2. Linear mixed effects models Examination of Fig. 4 reveals some interesting patterns of the dose-timeresponse of the excitability score. We found the following hybrid pharmocokinetic model f (φ(dose), t) =

φ1 (dose) + φ2 (dose)t 1 + eφ3 (dose) tφ4 (dose)

(6)

quite flexible in describing the various dose-response shapes observed in the FOB data. In this model, the dose effects are incorporated into the parameters φi as a function of dose. The simplest one is linear functions. Further, individual subject is allowed to have distinct behavior trajectory that is characterized by random coefficients associated with population parameters. After testing a number of options based on the actual data, we choose the following parametrization: φ1 (d) = β10 φ2 (d) = β20 + (β21 + b21i )d φ3 (d) = β31 d φ4 (d) = β40 . Note that φ1 determines the initial (time = 0) excitability level that is not influenced by dose because the first test was done before dosing. Therefore we make φ1 (d) = β10 a constant. Although we can jitter the intercept β10

634

Y. Zhu

by random effects to characterize unexplained variation in initial excitability among different subjects, the effects seem to be minimal in our actual data analysis. The time-slope φ2 measures how excitability changes over time, either naturally (β20 ), or under the influence of dosing (β21 + b21i ) with individual variation b21i . The downturn time-slope β31 is simply to control for the non-monotone trend within some dose range. A constant term is unnecessary because it amounts to a scale in conjunction with the intercept term in φ1 . Finally, a power parameter β40 is employed to increase the flexibility of the dose-time-response shape. Although in principle every population parameter can be jittered with random effects, the use of random effects must be justified by actual data variation. Guidance for model selection as well as use of random effects can be found in Pinheiro and Bates.34 Before fitting the model (6), we briefly outline the methods of nonlinear mixed effects models. The notation here follows largely that of Chapter 7 of Pinheiro and Bates,34 which emphasizes on the computational aspects and applications of mixed effects models. For theory and method, see for example Davidian and Glicknan.14 Consider the following general model, Yij = f (φi , Xij ) + ǫij ,

i = 1, . . . , M ; j = 1, . . . , ni

(7)

where ǫi = (ǫi1 , ǫij , . . . , ǫini )T is the error vector for the ith subject in the sample of M ; Xij is the covariate vector for the ith subject at the jth sampling time. The parameter vector φi = Ai β + Bi bi has two components: the first component β is population parameters and the second component bi (q × 1 vector) represents random effects associated with the ith subject. The matrices Ai and Bi are chosen with appropriate dimensions to determine on an individual level how to incorporate the population parameters and random effects into the model. In the present case, the matrices would be the same for all individuals. More general formulations can be developed to allow for time-varying coefficients.34 It is common to assume for continuous data that ǫi follow the Nni (0, σ 2 Λi (θ1 )) distribution, and are independent between different subjects. Various correlation options may be incorporated into Λi (θ1 ) through the parameter θ1 . The random effects bi are also conveniently assumed to follow the Nq (0, σ 2 Ω(θ2 )) distribution. Although other distributions can be considered in principle, computer software is mostly limited to normal distributions. To fit a non-linear mixed effects model, maximum likelihood estimation is often used via either the EM-algorithm15 or other numerical algorithms

Dose-Response Modeling in Health Risk Assessment

635

such as Newton-Raphson.27 Here we use Newton-Raphson algorithm implemented in S-PLUS (Mathsoft, Seattle, Washington). Under the assumption of normality, the marginal distribution of Yi = (Yi1 , . . . , Yini )T is Z p(yi |β, σ 2 , θ1 , θ2 ) = p(yi |bi ; β, σ 2 , θ1 )p(bi ; θ2 )dbi =

1 ni +q 2

1

1

(2πσ 2 ) |Λi | 2 |Ω| 2 Z T −1 (yi − f (φi ))T Λ−1 bi i (yi − f (φi )) + bi Ω dbi , × exp − 2σ 2

where f (φi ) = (f (φi , Xi1 )), . . . , f (φi , Xini ))T . Using a first order Taylor (t) series expansion about some initial estimates βˆ(t) and ˆbi to approximate the exponent of the integrand, we can obtain the following approximate marginal log likelihood function ni + q log(σ 2 ) 2 1 1 (t) (t) (t) (t) − log |Σi | + 2 (wi − Z1i β)T Σ−1 β) − Z (w i 1i i 2 σ

l(β, σ 2 , θ1 , θ2 ; yi ) = −

(t)

where wi

(t) (t) (t) (t) = yi − f (βˆ(t) , ˆbi ) + Z1i βˆ(t) + Z2i ˆbi , and (t)

Z1i =

∂f ∂β T

(t)

and Z2i =

∂f ∂bTi

(t) are the derivative matrices evaluated at βˆ(t) and ˆbi , and (t)

(t)

Σi (θ1 , θ2 ) = Λi (θ1 ) + Z2i Ω(θ2 )(Z2i )T . (t) (t) Maximum likelihood estimates (θˆ1 , θˆ2 , σ ˆ 2(t) ) are obtained from maximizing this approximate log-likelihood

max 2

θ1 ,θ2 ,σ

M X

l(βˆ(t) , σ 2 , θ1 , θ2 ; yi ) .

(8)

i

The estimate of the regression parameters β is then updated by a least squares estimator (M )−1 ( M ) X (t) X (t) (t) (t) (t) (t) βˆ(t+1) = . (9) (Z1i )T (Σi )−1 Z1i (Z1i )T (Σi )−1 wi 1

1

636

Y. Zhu

The estimate of the random effects bi is updated also by a least squares estimator based on M n X i

o ˆ(t) )(yi − f (βˆ(t+1) , bi )) + bT Ω−1 (θˆ(t) )bi . (yi − f (βˆ(t+1) , bi ))T Λ−1 ( θ i 2 1 i

Using a first order Taylor series approximation (t) (t) (t) f (βˆ(t+1) , bi , Xij ) = f (βˆ(t+1) , ˆbi , Xij ) + Z2i (bi − ˆbi ) ,

we have the update n o ˆb(t+1) = {Ω−1 + Z T Λ−1 Z2i }−1 Z T Λ−1 yi − f (βˆ(t+1) , ˆb(t) ) + Z2iˆb(t) , 2i i 2i i i i i

(10)

with all terms evaluated at the most recent parameter estimates. We iteratively solve the three sets of Eqs. (8), (9) and (10) until convergence. The maximization procedure outlined above relies heavily on the normality assumption for both the error terms and random effects. If the random effects follow a non-normal distribution, explicit marginal likelihood is often not available, and the EM-algorithm would be used to numerically approximate the marginal likelihood. 4.3. Illustration The fitted mixed effects model (6) is summarized in Table 7, and the predicted dose-time-response surface is plotted in Fig. 5. The fitted model fits the data well. A comparison of Fig. 5 with Fig. 4 reveals that the model captures most of the trends observed in the raw data. The dose effects were Table 7. Parameter β10 β20 β21 β31 β40

Estimates of parameters in the excitability model.

Value

S.E.a

DF

t-Value

p-value

1.8693 2.2611 −0.0003 −0.0003 1.1597

0.0763 0.1723 0.0001 0.0001 0.0320

72 72 72 72 72

24.51 13.12 −4.29 −3.40 36.30

< 0.0001 < 0.0001 0.0001 0.0011 < 0.0001

(1, 3): 0.14

(2, 3): 0.30

Ω(θ2 )

S.E.

7.72e-008

Λ(θ1 )

σ corr(i, j)b :

0.4737 (1, 2): 0.63

a S.E.

= standard error. k) = corr(ǫij , ǫik ).

b corr(j,

Dose-Response Modeling in Health Risk Assessment

Fig. 5.

637

Fitted model for the excitability score.

highly significant, and varied with testing time (Table 7) with a peak effect occurring at around 4.5 hour. However, it was difficult to accurately determine the true peak effect or peak effect time. The observed of peak effects are most likely taken to give the peak effect time. Several options were considered to include random effects in the model. Since using random effects for both the intercept β10 and the time slope β20 , or both β10 and the time-dose slope β21 led to an almost perfect correlation between the two random effects, both seemed an over-parameterization. Thus one random effect appeared sufficient for this dataset. Under several criteria, including likelihood ratio, information criterion and residual plots, the random effect associated with the time-dose slope β21 was found to be most effective. The standard error for b21i was 7.72e−8 (Table 7), a small number compared with the estimate βˆ21 = −3e−4, indicating a small between-subjects variation. We had also considered several correlation structures for the error terms, including that of complete independence, auto-regression with log 1, compound symmetric, and un-restricted. The unrestricted correlation matrix Λ (Table 7) seemed most flexible. 4.4. Benchmark doses In computing BMDs, we must take into consideration the fact that risk depends not only on exposure level, but also on the time the screening test

638

Y. Zhu

took place. It is crucial that risk is assessed before, not after, the peak effect time. Mathematically, we can measure risk at any designated time, estimate BMDs as a function of time to create a timed-profile, and then search for the minimum BMD. Since the excitability score is treated as a continuous measure, abnormality cannot be determined solely by a cutoff value, thus we adopt the population-level assessment approach by comparing the proportions of subjects whose score exceeds a given level between the exposed and controlled groups. This cutoff level can be chosen to include α × 100% of the subjects with the highest severity score in the control group. A significant increase in the proportion of subjects exceeding this level in the exposed group represents a risk because of the significant changes in the distribution of excitability. Conceptually, let π(0, t) = P r(Y > ct |d = 0, t) = α be the proportion of subjects in the control group whose severity score exceeds ct at testing time t, and π(d, t) = P r(Y > ct |d, t) be that for the exposed group. Note again that π is not necessarily a risk, but merely a reference level for risk. We compute BMD corresponding to a γ × 100% increase in multiplicative excess risk (1). Under a normal distribution for the severity score Y , ct − f (d, t) , π(d, t) = 1 − Φ σ0 where Φ is the standard normal cumulative distribution and with a constant standard deviation σ0 = 0.4737 pooled across all dose groups (Table 7). Equation (1) simplifies to ct − f (d, t) ct − f (0, t) Φ =Φ (1 − γ) = (1 − α)(1 − γ) , σ0 σ0 and ct = f (0, t) + z1−α σ0 with z1−α being the 1 − α percentile of the standard normal distribution. The preceding equation further simplifies to f (BMDt , t) = f (0, t) − σ0 (Φ−1 ((1 − α)(1 − γ)) − z1−α ) .

(11)

Substituting for the parameters with their estimates and solving Eq. (11) lead to the estimate of BMDt at chosen time t. In Fig. 6, we plotted the estimated BMDs over time to illustrate the influences of the reference and risk levels on BMDs with various choices of (α, γ): (0.05, 0.05), (0.05, 0.10), (0.10, 0.05) and (0.10, 0.10). We can see from Fig. 6 that the higher the reference level α is, the lower the risk detection limit is, hence the smaller the BMD would be. On the other hand, the lower the tolerance risk level γ is, the smaller the BMD would be because of increased sensitivity level

639

1500

2000

5% Cut-off and 5% Risk 5% Cut-off and 10% Risk 10% Cut-off and 5% Risk 10% Cut-off and 10% Risk

500

1000

BMD (mg/kg)

2500

3000

Dose-Response Modeling in Health Risk Assessment

0

5

10

15

20

Time (Hour)

Fig. 6.

Time-profiled benchmark doses.

to risk. A careful examination of the timed-profile of BMD suggests that, depending on the reference and risk levels, the minimum BMD is between 6.25 and 7 hour post exposure. This finding in turn suggests a sensitive time window for screening. 5. Discussion We have illustrated in this article several dose-response modeling techniques with applications to health risk assessment, particularly BMD estimation. The examples used include binary, clustered multinomial and repeated measurements. Although the basic concepts of modeling and risk assessment are unambiguous, a number of technical issues warrant further discussion. Defining adversity based on continuous measures is often arbitrary. We can circumvent this problem by assessing distributional changes to exposed populations relative to a control group. The baseline probability is only a reference, the changes to the baseline can be used as measures of risk. Bosch, Wypij and Ryan5 proposed a non-parametric approach to establish the baseline reference for weight loss. Suppose Yij denotes the observed weight for the jth animal in the ith dose group (i = 0, . . . , D; j = 1, . . . , ni ),

640

Y. Zhu

with d0 = 0 being the control. If there is no dose effects, π(di ) = P r(Y0h > Yij ) = 1/2 for all i, including in particular the case i = 0, i.e. π(0) = 1/2. PD We construct a “new” response vector of length mn0 , m = i=1 ni , representing the comparisons between each control subject and each exposed subject. That is, let W = (W0111 , W0112 , . . . , W0hij , . . . , W0n0 DnD )′ ,

W0hij = I(Y0h > Yij ) ,

where the indicator function W0hij takes the value 1 if Y0h > Yij , and 0 otherwise. Since E(W0hij ) = π(di ), it is natural to estimate π(d) from a binary regression model for W0hij . Developing generalized linear mixed effects models for ordinal data19,20 would be useful for the analysis of individual measure of FOB data. There is one major technical challenge with regard to zero frequencies in categories of less common behavior particularly with a small sample size. Developing multivariate dose-response models is also interesting. For example, Catalano and Ryan7 and Regan and Catalano36 studied joint dose-response models with both binary (malformation) and continuous outcomes (body weight) in developmental toxicity studies, using a bivariate latent variable model. Attempts have also been made to further include death in the multivariate dose-response model.8,10 Empirical evidences argue that multivariate models should result in improved precision for the purpose of risk assessment, as compared with a univariate approach.22,37 This improved precision could arise in two ways: (1) a richer class of models and better fit, and (2) more efficient estimators. It would be very useful to explore good quality statistical techniques for evaluating goodness-of-fit of dose response models when they are fit using methods such as GEEs. Better techniques for calculating lower confidence limits based on either likelihood or GEEs are needed. Given small to moderate sample size, the conventional methods13,16 that use normal approximation to the estimated dose-response or BMD may not be accurate. Statistical calibration methods in conjunction with GEEs can be adopted and modified to estimate the benchmark dose. Nonparametric techniques such as bootstrap43 should be further explored in this context to have robust estimate of confidence limits. Acknowledgment This work is in part supported by the US National Science Foundation grant DMS9978370.

Dose-Response Modeling in Health Risk Assessment

641

References 1. Anger, W. K. (1984). Neurobehavioral testing of chemicals: Impact on recommended standards. Neurobehavioral Toxicology Teratology 6: 147–153. 2. Anger, W. K. (1986). Workplace exposores. In Neurobehavioral Toxicology, ed. Z. A, Annau, John’s Hopkins University Press, 331–347. 3. Barnes, D. G. and Dourson, M. (1988). Reference dose (RfD): Description and use in health risk assessments. Regulatory Toxicology Pharmacology 8: 471–486. 4. Barnes, D. G. and Daston, G. P., Evans, J. S., Jarabek, A. M., Kavlock, R. J., Kimmel, C. A., Park, C. and Spitzer, W. L. (1995). Benchmark dose workshop: Criteria for use of a benchmark dose to estimate a reference dose. Regulatory Toxicology Pharmacology 21: 296–306. 5. Bosch, R. J., Wypij, D. and Ryan, L. M. (1996). A semiparametric approach to risk assessment for quantitative outcomes. Risk Analysis 16: 657–665. 6. Brown-Peterson, N., Krol, R., Zhu, Y. and Hawkins, W. (1999). Nitrosodiethylamine initiation of carcinogenesis in Japanese medaka (Oryzias latipes): Hepatocellular Proliferation, Toxicity, and Neoplastic Lesions Resulting from Short Term, Low Level Exposure. Toxicological Sciences 50: 186–194. 7. Catalano, P. J. and Ryan, L. M. (1992). Bivariate latent variable models for clustered discrete and continuous outcomes. Journal of American Statistical Association 87: 651–658. 8. Catalano, P. J., Scharfstein, D. O., Ryan, L., Kimmel, C. and Kimmel, G. (1993). Statistical model for fetal death, fetal weight, and malformation in developmental toxicity studies. Teratology 47: 281–290. 9. Chen J. J., Kodell, R. L., Howe, R. B. and Gaylor, D. W. (1991). Analysis of trinomial responses from reproductive and developmental toxicity experiments. Biometrics 47: 1049–1058. 10. Chen, J. J. (1993). A malformation incidence dose-response model incorporating fetal weight and/or litter size as covariates. Risk Analysis 13: 559–564. 11. Chen, J. J. and Li, L. A. (1994). Dose-response modeling of trinomial responses from developmental experiments. Statistica Sinica 4: 265–274. 12. Crump, K. S. (1984). A new method for determining allowable daily intakes. Fundamental Applied Toxicology 4: 854–871. 13. Crump, K. S. (1995). Calculation of benchmark doses from continuous data. Risk Analysis 15: 79–89. 14. Davidian, M. and Giltinan, D. M. (1995). Nonlinear Models for Repeated Measurement Data, Chapman and Hall, London. 15. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM-algorithm. Journal of the Royal Statistical Society B39: 1–22. 16. Gaylor, D. W., Ryan, L., Krewski, D. and Zhu, Y. (1998). Procedures for calculating benchmark doses for health risk assessment. Regulatory Toxicology Pharmacology 28: 150–164. 17. Fung, K. Y., Krewski, D., Rao, J. N. K. and Scott, J. A. (1994). Tests for trend in developmental toxicity experiments with correlated binary data. Risk Analysis 14: 639–648.

642

Y. Zhu

18. Hall, D. B. and Severini, T. A. (1998). Extended generalized estimating equations for clustered data. Journal of American Statistical Association 93: 1365–1375. 19. Hedeker D. and Gibbons, R. D. (1994). A random effects ordinal regression model for multilevel analysis. Biomtetrics 50: 933–944. 20. Hedeker, D. and Gibbons, R. D. (1996). MIXOR: A computer program for mixed-effects ordinal regression analysis. Computer Methods and Programs in Biomedicine 49: 157–176. 21. Kimmel, C. A. and Gaylor, D. W. (1988). Issues in qualitative and quantitative risk analysis for developmental toxicity. Risk Analysis 8: 15–20. 22. Krewski, D. and Zhu, Y. (1994). Applications of multinomial doseresponse models in developmental toxicity risk assessment. Risk Analysis 14: 613–627. 23. Krewski, D. and Zhu, Y. (1995). A simple data transformation for estimating benchmark dose in developmental toxicity experiments. Risk Analysis 15: 29–39. 24. Leisenring, W. and Ryan, L. (1992). Statistical properties of the NOAEL. Regulatory Toxicology Pharmacology 15: 161–171. 25. Liang, K. Y and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73: 13–22. 26. Lindstrom, M. J. and Bates, D. M. (1988). Newton-Raphson and EM algorithms for linear mixed effects models for repeated-measures data (corr: 94v89 p1572), Journal of American Statistical Association 83: 1014–1022. 27. Lindstrom, M. J. and Bates, D. M. (1990). Nonlinear mixed-effects models for repeated measures data, Biometrics 46: 673–687. 28. McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, Chapman and Hall, London. 29. MacDaniel and Moser (1993). Utility of a neurobehavioral screening battery for differentiating the effects of two pyrethroids, permethrin and cypermethrin. Neurotoxicology Teratology 15: 71–83. 30. Morel, J. G. and Neerchal, N. K. (1993). A finite mixture distribution for modeling multinomial extra variation. Biometrika 80: 363–371. 31. Moser, V. C., Cheek, B. M. and MacPhail, R. C. (1995). A multidisciplinary approach to toxicological screening III: Neurobehavioral toxicity. Journal of Toxicology Environmental Health 45: 173–210. 32. Moser, V. C., Tilson, H. A., MacPhail, R. C., Becking, G. C., Cuomo, V., Frantik, E., Kulig, B. and Winneke, G. (1997). The IPCS collaborative study on neurobehavioral screening methods: II. protocol design and testing procedures. NeuroToxicology 18: 929–938. 33. Office of Technology Assessment (OTA) (1990). Neurotoxicity: Identifying and controlling poisons of the nervous system. US Congress, Office of Technology Assessment. OTA-BA-436. US Government Printing Office, Washington, DC. 34. Pinheiro, J. C. and Bates, D. M. (2000) Mixed-effects models in S and S-plus. Springer-Verlag: New York.

Dose-Response Modeling in Health Risk Assessment

643

35. Price, C. J., Kimmel, C. A., George, J. D. and Marr, M. C. (1987). The developmental toxicity of diethylene glycol dimenthyl ether in mice. Fundamental Applied Toxicology 81: 113–127. 36. Regan, M. M. and Catalano, P. J. (1999). Likelihood models for clustered binary and continuous outcomes: Applications to developmental toxicology. Biometrics 55: 760–768. 37. Ryan, L. M. (1992). Quantitative risk assessment for developmental toxicity. Biometrics 48: 163–174. 38. Stiteler, W. M., Joly, D. A. and Printup, H. A. T. (1993). Monte Carlo investigation of issues relating to the benchmark dose method. Task No. 2-30, Environmental Criteria and Assessment Office. US Environmental Protection Agency, Cincinnati, Ohio. 39. US Environmental Protection Agency (1991). Guidelines for developmental toxicity risk assessment. Federal Register 56: 63797–63826. 40. US Environmental Protection Agency (1995). The use of benchmark dose approach in health risk assessment. Office of Research and Development, Washington, DC. EPA/630/R-94/007. 41. US Environmental Protection Agency (1998). Guidelines for Neurotoxicity Risk Assessment Federal Register 63: 26926–26954 42. Williams, D. A. (1975). The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics 31: 949–952. 43. Zeng, Q. and Davidian, M. (1997). Bootstrap-adjusted calibration confidence intervals for immunoassay. Journal of American Statistical Association 92: 278–290. 44. Zhu, Y, Krewski, D and Ross, W. H. (1994). Multinomial models for developmental toxicity experiments. Applied Statistics 43: 583–598. 45. Zhu, Y. and Fung, K. Y. (1996). Statistical methods in developmental toxicity risk assessment. In Toxicology and Risk Assessment, Principles, Methods and Applications, eds. Fan and Chang, 413–446.

About the Author Zhu Yiliang is an Associate Professor at the College of Public Health, University of South Florida. He obtained his PhD in Statistics from University of Toronto, MS from Queen’s University in Kingston, and BS from Shanghai University of Science and Technology. He was a visiting scientist at Environmental Health Center, Health Canada and US Environmental Protection Agency. He was also the principal Biostatistician at the headquarter of the Shriner’s Hospital for Children. His current research focuses on statistical methodology and applications in environmental health risk assessment, including semiparametric methods for dose-response modeling of longitudinal data, effective designs for dose-response, benchmark

644

Y. Zhu

methods in environmental regulation, toxicokinetic-toxicodynamic models in toxicology, and software development. His research has been funded through grants from the US National Science Foundation and National Institute of Health. In addition to serving on many national and statewide committees, he has also done extensive research and consultation in clinical trials and management, health care outcome evaluation, epidemiology, and litigations.

CHAPTER 17

STATISTICAL MODELS AND METHODS IN INFECTIOUS DISEASES HULIN WU Frontier Science and Technology Research Foundation, 1244 Boylston Street, Suite 303, Chestnut Hill, Massachusetts 02467, USA Tel: 617-632-5738; [email protected] SHOUJUN ZHAO Department of Neurology, University of California, San Francisco, CA 94143, USA

1. Introduction 1.1. Infectious diseases Infectious diseases are the illnesses caused by a specific infectious agent or its toxic products. Most of the agents are microorganisms, like bacteria, virus, parasites, etc. The transmission of the agent from an infected person, animal, or reservoir to a susceptible host, results in the infectious diseases of human, either directly or indirectly through an intermediate plant or animal host, vector, or the inanimate environment. For example, influenza, hepatitis, AIDS are caused by virus; dysentery, typhoid by bacteria; and schistosomiasis, filariasis by parasites.1 Infectious diseases are also called communicable diseases. Infectious disease is a great threat to human beings in the past centuries. In this new century, understanding and controlling the spread of infections is still vitally important to public health. The challenging problems in studies of infectious diseases include: (i) how to evaluate the epidemics of an infectious disease in a population; (ii) how to understand the pathogenesis of infections and transmissions; (iii) if intervention measures such as drugs and therapies, and prevention measures such as education program and vaccines are developed, how to evaluate their effectiveness. Mathematics and 645

646

H. Wu & S. Zhao

statistics have played a central role in all these three aspects in the past decades. To accurately evaluate and project the epidemics of an infectious disease would help to determine the health care needs which is useful for the public health department or government to prepare and allocate the resources to fight the disease. It also signals a message on how serious a particular infectious disease is and draws attentions from the public on the dangerousness of the disease. 1.2. Mathematical and statistical challenges from infectious diseases The application of mathematical methods to infectious diseases dates back to Daniel Bernoulli’s paper in 17602 in which he used a mathematical model to evaluate the impact of smallpox on life expectancy. More analytical work has been done by Hamer3 and Ross,4 who tried to understand the mechanisms of disease transmission in early of last century. Kermack and McKendrick5 studied the mass action principle and threshold theorem originally proposed by Hamer and Ross respectively. The well-known chain binomial models of disease spread may be traced back to En’ko,6 and a stochastic counterpart of chain binomial models was introduced by Greenwood.7 Throughout the last century, theory and quantitative techniques have been developed to study both the dynamics of disease within individuals and the transmission of infections through populations. Mathematics and statistics have played an important role in the studies of infectious diseases. In particular, over the last two decades there has been a great deal of work on HIV/AIDS. Both mathematics and statistics have played and will continuously play an important roles in epidemic studies as well as intervention and prevention studies of infectious diseases. A recent brief review on these methods can be found in Farrington,8 Heesterbeek and Roberts,9 and Gani.10 1.3. Outline Many mathematicians and statisticians have responded to the challenges from infectious diseases in the past two centuries, and continue to make the best efforts for meeting the new challenges in this new millennium. In this chapter, first we will introduce the back-calculation method for a projection of epidemics, proposed recently for estimating epidemics of AIDS, followed by the introduction of models for natural history of

Statistical Models and Methods in Infectious Diseases

647

infectious diseases. Deterministic and stochastic models as well as their recent developments are presented in Sec. 2. In Sec. 3, we introduce viral dynamic models which are heavily studied for understanding pathogenesis of HIV, HBV, and HCV infection during the last decade. We briefly review the mathematical and statistical methods for evaluation of intervention and prevention measures in infectious diseases in Sec. 4. We conclude the chapter with a brief summary. 2. Epidemic Models 2.1. Estimation and projection of epidemics Back-calculation Estimation and projection of epidemics, such as disease incidence and prevalence, are very important for intervention and prevention of infectious diseases. It is also critical for a government to make decisions and to prepare public health needs. Back-calculation or back-projection method has been paid tremendous attention in estimating and projecting AIDS epidemics in the past 15 years. In this section, we briefly introduce the back-calculation method and its applications. Back-calculation is a method for estimating past infection rates of an infectious disease by working backward from observed disease incidence using knowledge of the incubation period between infection and disease. Although it can be used to any infectious diseases in theory, it was first proposed to study AIDS epidemics by Brookmeyer and Gail.11,12 It has been widely used in AIDS epidemics. The basic idea is to use the convolution equation of the expected cumulative number of disease cases diagnosed by time t, A(t), the infection rate g(s) at time s, and the incubation period distribution F (t), i.e. Z t A(t) = g(s) · F (t − s)ds . (1) 0

If the disease cases A(t) are known (may be obtained from case reports) and the incubation period distribution F (t) can be estimated from epidemiological studies, the infection rate g(s) then can be estimated by deconvolution of Eq. (1). If the infection rate g(s) and incubation period distribution F (t) are known, the disease cases can be estimated or projected using the convolution Eq. (1). We introduce the back-calculation using a discrete-time formulation since it is more realistic. We assume that we have n non-overlapping time

648

H. Wu & S. Zhao

interval, (Tj−1 , Tj ), j = 1, . . . , n; let Yj be the number of disease cases diagnosed in the jth interval; denote fij as the probability of developing disease in time interval j given infection in interval i, or fij = F (Tj−i+1 ) − F (Tj−i ) where F (T0 ) = 0; let gi denote the expected number of new infections in time interval i (infection rate). The discrete-time statistical convolution equation can be written as E(Yj ) =

j X

gi fij ,

j = 1, . . . , n .

(2)

i=1

Usually Yj is assumed to follow a Poisson distribution. A Poisson regression analysis may be used to estimate parameters gi while we regard fij as known covariates. The generalized linear model algorithms in the standard statistical packages such SAS or Splus can be used to fit the model. However, a difficulty with this model is that the number of parameters equal to the number of data points. This may result in unstable estimate of gi . To resolve this problem, one may model g(s) parametrically or nonparametrically. The parametric models include damped exponential model, log-logistic model, logistic (prevalence) model, and piecewise constant step function model. For nonparametric modeling methods, smoothing spline, kernel method, and series-based splines can be used. More details can be found in the book (Chapter 8) by Brookmeyer and Gail.13 If we model gi as a parametric function, say, gi = g(i, β), where β is a vector of parameters, the maximum likelihood method may be used to estimate the parameter vector β or the infection rate function g i = g(i, β). Assume that Yj follows a nonhomogeneous Poisson process, the log-likelihood function of Yj can be written as # ! " j j n X X X g(i, β)fij − log Yj ! . (3) g(i, β)fij − Yj log L(Y |β) = j=1

i=1

i=1

The maximum likelihood estimate of β is obtained by maximizing L(Y |β) with respect of β using general numerical approaches such as the NewtonRaphson method or EM algorithm.16 The variance of the estimate can be obtained using Fisher information or bootstrap method. As long as the infection rate function g(i, β) is estimated, the number of infections and future disease cases can be estimated and projected. The cumulative number of infections from time T0 to time Tk can be estimated by

Statistical Models and Methods in Infectious Diseases

ˆ k) = G(T

Z

Tk

ˆ g(s, β)ds

649

(continuous time)

(4)

(discrete time)

(5)

T0

=

k X

ˆ g(i, β)

i=0

The variance of this estimate can be obtained by the delta method or by bootstrap method. Also note that the infection prevalence is defined to be the number of infected individuals who are alive. Thus, the estimate of ˆ k ) − D(Tk ), where D(Tk ) is the cumulative the infection prevalence is G(T number of deaths during the same time interval. The projection of disease incidence in a future time interval [Tl−1 , Tl ) is obtained by projecting forward the number of individuals infected prior to the current time Tn , i.e. ˆ l ) − A(T ˆ l−1 ) A(T Z Tn ˆ g(s, β)[F (Tl − s) − F (Tl−1 − s)]ds =

(continuous time)

(6)

(discrete time) .

(7)

T0

=

n X

ˆ il g(i, β)f

i=0

However, this estimate is a lower bound since it only considers the infected individuals prior to time Tn . To make an adjustment, the infections during time Tn and Tl need to be considered, that is, the following term needs to be added to the above projection, Z Tl g(s, β)[F (Tl − s) − F (Tl−1 − s)]ds Tn

in continuous time or l X

g(i, β)fil

i=n

in discrete time. However, the future infection rate g(s, β) or g(i, β) is unknown. A guess or extrapolation of current infection rate is usually used. A brief introduction on back-calculation can be found in Bacchetti16 and a detailed description can be found in Brookmeyer and Gail.13 Note that the above methods for projection of disease incidence or infection prevalence should be used with caution. There are many sources of uncertainty

650

H. Wu & S. Zhao

in back-calculation methods. The first is the uncertainty in the incubation period distribution. The estimate of the incubation period distribution may be subject to errors and uncertainty of the designed epidemiological studies. The sensitivity analysis is usually used to evaluate these uncertainties. More details on the incubation period distribution can be found in the book (Chapter 4) by Brookmeyer and Gail.13 The projection is also sensitive to the assumption of infection rate models, especially for the unknown future infection rates. Thus, the model of g(s) needs to be chosen with care. Another problem is the reported disease incidence data. Different countries have different reporting systems for infectious diseases. Some of them may not be reliable. Reporting delay or underreporting occurs frequently. Some formal methods have been developed to account for the reporting uncertainty, see Harris14 and Lawless and Sun.15 Also note that the effect of immigration and emigration from one community (country) to another community (country) is not considered in above projection models. In summary, the back-calculation method only provides a rough (a lower bound) estimate or prediction for the disease incidence or infection prevalence.

2.2. Model the natural history The natural history of a disease is the evolution of a disease in the absence of medical intervention. Today, however, most diseases are treated after they are diagnosed. The term “clinical course” is usually used to describe the natural history of a disease that has been affected by intervention. A broader definition of the term “natural history” may also include clinical course. The endpoints of a natural history study may be dichotomous outcomes (such as death, relapse of a tumor or acquisition of AIDS following HIV infection, etc.), time-to-event (such as time to a clinical outcome occurs), or a repeated biomarker (such as CD4+ cell counts or HIV RNA copies in AIDS patients). To study the relationship between these endpoints and prognostic factors, standard statistical methods may be used. For example, logistic regression or tree-structured regression methods (see related chapter of this book) may be used to study dichotomous outcome endpoints. These methods are pretty standard, we omit the details here. To study the survival endpoints, a Kaplan-Meier curve or a product-limit estimator (life table) is widely used to describe the natural history. The popular proportional hazards model or Cox regression model can be used to study the relationship

Statistical Models and Methods in Infectious Diseases

651

between the survival endpoints and prognostic factors. Since neither the time of HIV infection nor the AIDS incidence can be observed exactly, the doubly censored or interval censored data need to be considered in this case. This problem motivated the development of new methods, see De Gruttola and Lagakos,17 Kim et al.,18 , Jewell et al.,19 Jewell20 and Sun.21 The detailed survival analysis methods can be found in related chapter of this book or other textbooks. For repeated measurement endpoints, statistical methods for longitudinal data have been developed in the past two decades. A good survey of these methods can be found in the book by Diggle et al.,22 and others. Modeling biomarkers of HIV/AIDS such as CD4+ cell counts and HIV RNA copies (viral load) have been paid special attention in the last decade. Many new models and methods have been developed. In the following, we briefly introduce several new models and methods, but refer the readers to the original papers for details. Standard longitudinal data analysis methods can also be found in related chapter of this book. In the early stage of HIV/AIDS research, CD4+ T cell count is the most important biomarker to study natural history of HIV infection and evaluate the treatment effects. Recently HIV RNA copies (viral load) became the new focus in HIV/AIDS research. But the methodology in modeling CD4+ T cell counts can be adopted to model viral load with minor modifications. De Gruttola, Lange and Dafni et al.23 proposed a linear mixed-effect model with errors-in-variables to model CD4+ T cell trajectory, i.e. y i = X i a + Z i β i + εi ,

i = 1, . . . , n ,

(8)

where design matrices X i and Z i are subject to measurement error since they depend on observed time measurements, a is population parameter, and β i is subject-specific random effects with an i.i.d. normal distribution and is independent from εi which also follows an i.i.d. normal distribution. Taylor et al.24 considered a linear mixed-effect model with within-subject covariance specified as an OU stochastic process, i.e. y i = X i a + Z i βi + W i + εi ,

i = 1, . . . , n ,

(9)

where W i is an OU process. They claimed that this model tracked CD4+ T cell data better compared to standard linear mixed-effect models. To better track the nonlinearity of CD4+ T cells, some nonparametric and semiparametric models have been proposed. For example, Zeger and Diggle25 introduced a semiparametric model, Yij = xij β + µ(tij ) + εi (tij ) ,

(10)

652

H. Wu & S. Zhao

where xij is a covariate matrix (prognostic factors) and µ(tij ) is an unknown smooth function of t. They proposed a back-fitting algorithm to fit the model, i.e. estimating β and fitting µ(tij ) (using kernel or other nonparametric regression methods) iteratively. See Zeger and Diggle25 for details. Nonparametric mixed-effects models have been proposed to model CD4+ T cell courses.26,27 The basic idea is to decompose a population (cohort) of CD4+ T cell curves into two parts, a population effect and a subject-specific random effect, yi (t) = f (t) + hi (t) + εi (t), where f (t) and hi (t) denote the population curve and the subject-specific random effect curve respectively, and both of them are assumed to be smooth functions of t. Cubic B-spline method was proposed to fit f (t) and hi (t). Let B(t) be a vector of a cubic B-spline basis. Assume f (t) = αB(t) and hi (t) = γi B(t), then the CD4+ T cell model can be written as yi (t) = αB(t) + γi B(t) + εi (t) .

(11)

This is a standard linear mixed-effects model by treating B(t) as covariates, and α and γi as fixed and random effects respectively. The existing statistical packages such as SAS procedure MIXED or Splus lme function can be used to fit this model. Standard inference procedures for linear mixed-effects models are also available. Similarly, Wang and Taylor 28 also proposed a piecewise cubic polynomial model for CD4+ T cell changes and used the model to conduct inferences such as treatment comparisons. Recently more flexible models such as functional linear models or varying-coefficient models29–32 have been proposed. The model can be written as Yi (tij ) = X Ti (tij )β(tij ) + ei (tij ) ,

(12)

where β(tij ) is a time-varying coefficient vector which is assumed to be a smooth function of t. Fan and Zhang29,30 proposed a two-step procedure to fit the model. That is, for fixed time t, fit a standard linear regression model to obtain the raw estimates of β(tij ), and then smooth the raw estimates using one of the existing smoothing techniques. Hoover et al. and Wu et al.31,32 proposed smoothing spline and local polynomials methods. Hierarchical Bayes models have been introduced by Lange et al.33 to model CD4+ T cell counts. This model is similar to a mixed-effects model, but in Bayes framework. De Gruttola and Tu34 and Tsiatis et al.35 also proposed a method for jointly modeling survival endpoints and longitudinal biomarkers. They modeled the longitudinal biomarkers (CD4+ T cell

Statistical Models and Methods in Infectious Diseases

653

counts) as a linear mixed-effect model and model survival data using a standard Cox model, and then construct a joint log-likelihood function of these two models. Thus, the likelihood-based method can be applied to the models. Although above models are developed to model CD4+ T cell counts in AIDS research, the methodology is generally applicable to other similar repeatedly measured biomarker data for other infectious diseases. However, in most countries, especially for developed countries such as United States, patients with infectious diseases such as HIV/AIDS are mostly under active treatments. How to model the natural history or clinical course of infectious diseases under effective treatments is a great challenge, since the treatment may affect the changes of biomarkers and disease progression dramatically. Also note that there are many resources of biases and uncertainties in natural history studies. For examples, sampling or selection bias in study subject selection process, follow-up length bias due to study length limitations and long latent period of some infectious diseases such as AIDS, drop-out or missing data bias when the drop-out or missing pattern is not random. Another problem is that the time zero of a natural history may not be well-defined and exactly observed in a study, for instance, the exact time of HIV infection is difficult to obtain for some cohorts. See more discussions in Cnaan.36 In summary, a careful design of a natural history study is necessary to eliminate or reduce these biases. 2.3. Deterministic models for epidemic transmission A standard deterministic model for epidemic transmission is a compartmental model. For example, a general susceptible-infection-removal (SIR) compartment model can be written as S˙ = µ − βSI − dS S , (13) I˙ = βSI − rI − dI I ,

(14)

where S and I represent the proportions of susceptible and infectious subjects in the population, and S˙ and I˙ denote their derivatives respectively. Parameter µ denotes the birth rate of susceptible subjects per time unit, β represents the infection rate when S and I are randomly interacted (mixed). Parameters dS and dI denote the death rates of susceptible and infectious subjects, r denotes the recovered (removal) rate of infectious subjects. The basic reproduction ratio is defined as R0 = µβ/[dS (r + dI )] .

654

H. Wu & S. Zhao

R0 is an important summary measure of the infectiousness of a disease. This is the mean number of secondary cases generated by a single infective in a totally susceptible population. The higher the value of R0 , the more infectious the disease. If R0 ≤ 1, transmission of the infection cannot be sustained and will eventually die out. If the infection has a latent stage with a proportion of E (during this stage, they are not infectious), a standard SEIR model is S˙ = µ − βSI − dS S ,

(15)

E˙ = βSI − αE − dE E ,

(16)

I˙ = αE − rI − dI I ,

(17)

where α is the transmission rate from latent to infectious, and dE is the death rate of latently-infected subjects. For some infectious diseases, we may assume that dS = dE = dI = d, but this may not be true in general. These compartment models are derived from a principal of mass action and homogeneously mixing pattern. The age of subjects is another important factor in the epidemics of infectious diseases. The age-structured compartment model may be used to account for age effects. Here is a simple example, ∂S ∂S + = µ − λS − dS S , ∂t ∂a

(18)

∂I ∂I + = λS − rI − dI I , ∂t ∂a

(19)

∂R ∂R + = rI − dR R , ∂t ∂a

(20)

where a is the age of subjects, R is the proportion of recovered (removal) of subjects in the population. Parameter λ is the so-called force of infection (age-specific hazard rate of infection) which is a function of time t, and can be defined by Z ∞ λ(t) = k(a′ )I(t, a′ )da′ , 0

′

where k(a ) is a kernel function. The partial differential equation system (18)–(20) can be solved numerically with appropriate initial conditions. Note that the above models are very general for infectious diseases. But for a particular disease, these models may need to be modified to accommodate the special feature or characteristics of the disease. For example, HIV-infected patients cannot be cured or recovered from infection with

Statistical Models and Methods in Infectious Diseases

655

current treatments, instead they may progress to AIDS or death. Thus, in the differential equation of infectious population (I), r is the rate of progression to AIDS. An additional equation of AIDS cases, A˙ = rI − dA A, may be added. Hepatitis B virus (HBV) infection is another example of infectious diseases with its transmission to be characterized by a model of five compartments. A community population can be divided into five compartments: (1) susceptible S(a, t); (2) latent period (the time interval from infection to development of infectiousness), L(a, t); (3) temporary HBV carriers, T (a, t); (4) chronic HBV carriers C(a, t); and (5) immune I(a, t).37,38 Here “a” represents the age and “t” represents the length of follow-up. Among the five stages, compartments 3 and 4 are infectious. In this model, birth rate is considered as a constant; age specific death rates are collected from death notification systems. The immune status is assumed to be life-long and newborns are assumed susceptible. For simplicity of modeling, the rare intrauterine HBV infection,39,40 the short period of newborn maternal antibody and the sex differences ignored. The model parameters are defined as the following: λ(a, t) is the force of infection; α is the rate of transition from latent period to temporary HBV viremia; β(a) is the risk of transient viremia progressing to chronic HBV carriage; ε is the rate of transition from temporary HBV viremia to immune per time unit; ν(a) is the rate of HBV clearance in chronic HBV carriers; τ (a) is the mortality rate of HBV related diseases; µ(a) is the age-specific mortality rate of non HBV related diseases; Vc (a, t) is the effectiveness of hepatitis B vaccine immunization. Then the age-structured compartment model for HBV can be written as,

∂S(a, t) ∂S(a, t) + = [λ(a, t) + Vc (a, t) + µ(a)]S(a, t) , ∂a ∂t

(21)

∂L(a, t) ∂L(a, t) + = λ(a, t)S(a, t) − [α + µ(a)]L(a, t) , ∂a ∂t

(22)

∂T (a, t) ∂T (a, t) + = αL(a, t) − [β(a) + ε + µ(a)]T (a, t) , ∂a ∂t

(23)

∂C(a, t) ∂C(a, t) + = β(a)T (a, t) − [ν(a) + τ (a) + µ(a)]C(a, t) , ∂a ∂t

(24)

∂I(a, t) ∂I(a, t) + = Vc (a, t)S(a, t) + εT (a, t) ∂a ∂t + ν(a)C(a, t) − µ(a)I(a, t) .

(25)

656

H. Wu & S. Zhao

After all the parameters were estimated from the data of epidemiological studies,41–43 the probabilities or variables, S(a, t), L(a, t), T (a, t), C(a, t) and I(a, t) at age a and time t in the model can be calculated by the integral of the partial differential equations. These estimates can describe the dynamics of HBV transmission in the population at the pre-vaccination period or predict the trend with different vaccination coverage Vc (a, t) in the population. Detailed information about HBV modeling and parameter estimation can be found in references.44,45 The deterministic models for other diseases such as Malaria and Helminths can be found in Heesterbeek and Roberts.9 More details on deterministic compartment models can be found in the books by Bailey,46 Becker,47 Anderson and May48 and Daley and Gani.49 2.4. Stochastic models for epidemic transmission 2.4.1. Branching processes In the cases in which it is reasonable to assume an unlimited pool of susceptibles, for instance during the initial stage of an epidemic, a branching process can be used to model the spread of infection. Let Y0 denote an initial number of infectives at generation 0. These Y0 individuals infect Y1 individuals as the next generation. In turn, these Y1 individuals infect Y2 individuals as the third generation, and so on. Let Z denote the number of infections directly caused by one individual, which is a random variable with a mean of µ, variance σ 2 and a probability density function of g(z). Thus, for each i, Yi = Z1 + · · · + ZYi−1 , where Zj are independent variables with density g(z). Harris50 proposed a nonparametric maximum likelihood estimator for µ: , k k X X Yi−1 . Yi µ ˆ= i=1

i=1

The properties of this estimator are discussed by Keiding.51 Becker52 suggested an alternative estimator: ( (Yk /Y0 )1/k if Yk > 0 , µ ˆ= 1 if Yk = 0 .

Note that the expected number of infections caused by one individual, µ, plays the same role as the basic reproduction ratio (R0 ) in deterministic models. It can be shown that if µ ≤ 1, the process will become extinct with

Statistical Models and Methods in Infectious Diseases

657

probability one.50 Inferences for branching processes are usually conditional on extinction and non-extinction. Heyde53 suggested a Bayesian approach which allows the extinction (µ ≤ 1) and non-extinction (µ > 1) to be treated without distinction. Becker47 gave several applications of branching processes to smallpox epidemics. 2.4.2. Chain binomial models The branching process is unsuitable for the epidemics within a small communities such as households. In this context, chain binomial models are more appropriate. The chain binomial model, or refer to the Reed-Frost epidemic model was introduced by the biostatistician Lowell J. Reed and the epidemiologist Wade Hampton Frost around 1930, as a teaching tool at Johns Hopkins University. Although they did not publish their results formally, their model was introduced in later publications.54,55 Consider a fixed number of community (such as household, sexual partners, needle sharing group) with n individuals. At generation k there are Xk susceptibles exposed to Yk infectives. The distribution of the number of infectives in the next generation, Yk+1 , conditional on Xk and Yk , is binomial: x! pz (1 − pk )x−z , Pr(Yk+1 = z|Xk = x, Yk = y) = z!(x − z)! k where pk is the probability that a susceptible of generation k will acquire infection from one of the yk infectives. The parameter pk can be modeled under different assumptions. One assumption due to Reed and Frost54 is that contacts with infectives occur independently, so that pk = 1 − (1 − π)yk , where π is the probability of infection for the contact with the infectives. An alternative assumption, due to Greenwood,7 is that the probability of infection does not depend on the number of infectives that the susceptible is exposed to, then ( π if yk > 1 , pk = 0 otherwise . Under this assumption, it is usually called the Greenwood chain binomial model. More complicated models for pk can be developed for complicated transmission mechanisms such as HIV infection. Some other extensions to the Reed-Frost model can be found in Longini and Koopman.56

658

H. Wu & S. Zhao

Inference for chain binomial models is usually based on likelihood methods. See Bailey,46 Becker,47 Longini and Koopman,56 Longini et al.57 and Saunders.58 A brief introduction can be found in Longini.59 For an updated review, see Becker and Britton.60 2.4.3. Stochastic compartment models The stochastic version of the SIR model (13) is useful to capture stochastic features of epidemics in a small population or in the early stage of the epidemics. Consider S(t) and I(t) as the number (rather than the proportion) of susceptibles and infectives respectively. Then the transition probabilities in a short time interval (t, t + δt) are Pr[S(t + δt) = S(t) − 1; I(t + δt) = I(t) + 1] = βS(t)I(t)δt ,

(26)

Pr[S(t + δt) = S(t); I(t + δt) = I(t) − 1] = (r + dI )I(t)δt ,

(27)

Pr[S(t + δt) = S(t) + 1; I(t + δt) = I(t)] = µδt ,

(28)

Pr[S(t + δt) = S(t) − 1; I(t + δt) = I(t)] = dS S(t)δt .

(29)

The solution of this stochastic systems is not straightforward. Monte Carlo methods may be used to solve it.61–63 Tan and Hsu64 proposed a stochastic SEIR model (including a latent stage of infection) for AIDS epidemics. Recently, Wu and Tan65 suggested a multiple stage stochastic models (the chain multinomial model) for AIDS epidemics in a homosexual population. Here we briefly introduce this model. Let S(t), Ir (t), and A(t) denote the numbers of susceptible people, people at rth infection stage (r = 1, 2, . . . , k) and people on set of AIDS at time t respectively. Then we are entertaining a (k + 2)-dimensional discrete stochastic process X(t) = [S(t), I1 (t), I2 (t), . . . , Ik (t), A(t)]T , where [·]T denote the transpose of a vector or matrix. To formulate the dynamic model (the chain multinomial model) for this process, let αS (t) denote the conditional probability of S → I1 given X(t) during [t, t + 1) and give the other notations of the transition probabilities and numbers of various transitions of the HIV epidemic in Table 1. By using the chain multinomial model, we obtain the following stochastic difference equations: S(t + 1) = RS (t) + S(t) − FS (t) − DS (t) ,

(30)

659

Statistical Models and Methods in Infectious Diseases

Table 1. Notation for transitions of the HIV epidemic in homosexual populations during [t, t + 1). Transition Immigration −→ S Immigration −→ Ir S −→ I1 Ir −→ Ir+1 , r = 1, 2, . . . , k − 1 I1 −→ S Ir −→ Ir−1 , r = 2, 3, . . . , k A −→ Ik Ir −→ A, r = 1, 2, . . . , k S −→ Death Ir −→ Death A −→ Death

Transition probability

Transition numbers

µS (t) µr (t) αS (t) αr (t) β1 (t) = 0 βr (t) βk+1 (t) = 0 ωr (t) dS (t) dr (t) dA (t)

RS (t) RIr (t) FS (t) FIr (t) BI1 (t) = 0 BIr (t) BA (t) = 0 AIr (t) DS (t) DIr (t) DA (t)

Ir (t + 1) = RIr (t) + FIr−1 (t) + BIr+1 (t) + Ir (t) − [FIr (t) + BIr + AIr (t) + DIr (t)] , A(t + 1) =

k X r=1

AIr (t) + A(t) − DA (t) ,

(31) (32)

where r = 1, 2, . . . , k, and FI0 (t) = FS (t), FIk (t) = 0. The distributional properties of the quantities in the equations are listed as follows: • RS (t) ∼ Binomial [S(t), µS (t)], independent of FS (t) and DS (t). • [FS (t), DS (t)]|X(t) ∼ Multinomial [S(t); αS (t), dS (t)]. • RIr (t)|Ir (t) ∼ Binomial [Ir (t); µr (t)], independent of FIr (t), BIr (t), AIr (t) and DIr (t). • [FI1 (t), AI1 (t), DI1 (t)]|X(t) ∼ Multinomial [I1 (t); α1 (t), ω1 (t), d1 (t)]. • [FIr (t), BIr (t), AIr (t), DIr (t)]|X(t) ∼ Multinomial [Ir (t); αr (t), βr (t), ωr (t), dr (t)], for r = 2, . . . , k − 1. • [BIk (t), AIk (t), DIk (t)]|X(t) ∼ Multinomial [Ik (t); βk (t), ωk (t), dk (t)]. • DA (t)|A(t) ∼ Binomial [A(t), dA (t)]. Equations (30)–(32) provide an avenue for computing the probability distributions of X(t). Although the exact probability distributions of X(t) are quite complicated, one may use these equations to derive equations for the means, the variances and higher cumulants of X(t) as well as other results. Wu and Tan65 also proposed using a state-space model to

660

H. Wu & S. Zhao

approximate the above stochastic model, and then Kalman filter can be used for estimation and projections. See Wu and Tan65 for details. 3. Viral Dynamic Models Recently a great attention has been paid for modeling interaction and dynamics of virus and immune systems at cellular level within a host. It brought up a breakthrough in studying pathogenesis of HIV, HBV and HCV infections. In this section, we briefly introduce the basic models and their extensions, and summarize the important results obtained by applying these models to clinical data. Some statistical methods for parameter estimation will be briefly introduced. 3.1. HIV dynamics Modeling HIV dynamics within a host can be traced back to 1980s.66–68 In the early stage of HIV modeling, the focus is to understand the mechanism and pathogenesis of HIV infection and antiviral drug action using computer simulations based on the developed models. When the simplified version of the complicated simulation models were successfully applied to the clinical data in the last several years,69–72 it has led to a new understanding of the pathogenesis of HIV infection. Mathematical models and statistical methods played an important role in this breakthrough. Here we briefly introduce the models and the results. In the seminar papers, Ho et al.69 and Wei et al.70 proposed simple compartment models (one or two compartments) for their clinical data of plasma HIV viral load (the number of RNA copies) in HIV-1-infected patients treated with potent antiviral agents. In Ho et al.,69 a simple oned V = P − cV , was proposed, where V denotes the compartment model, dt concentration of virus (measured by the number of HIV RNA copies per ml plasma), P denotes the production rate of virus, and c denotes the clearance rate of virus. If we assume that the antiviral treatment is perfect, or P = 0 after initiation of a potent antiviral treatment, the solution to above ordinary differential equation is V = V0 e−ct , where V0 is the initial viral concentration. When we have repeated measurements of V on individual patients, we can fit a nonlinear model, Y (t) = V0 e−ct +ε or a linear model in a log scale, Y (t) = log(V0 ) − ct + ε, to obtain the parameter estimate of c. The mean life-span or half-life of HIV can be estimated by 1/c and ln 2/c respectively. Ho et al.69 applied this simple method to 20 HIV-infected patients, and they obtained that the half-life of HIV (in fact, it is the

Statistical Models and Methods in Infectious Diseases

661

half-life of productively infected cells) is 2.1 ± 0.4 days with a range of 1.3 to 3.3 days. Wei et al.70 obtained similar results. This estimated rapid turnover rate of HIV virus (or infected cells) has important implications for HIV therapy and pathogenesis. One of the implications is that the rapid turnover of HIV may generate viral diversity and increase the opportunities for viral escape from antiviral agents. This motivated the idea of the therapy with combination of several antiviral agents (or so-called “cocktail” therapy). To refine the estimate of viral replication, Perelson et al.71 considered a more complicated compartment model when HIV infected patients are treated with more potent protease inhibitor (PI) antiviral agents. The mechanism of the PI drug antiviral action is to block the replication (generation) of infectious virus. Under the assumption of perfect PI drug treatment, the model can be written as dT ∗ = kVI T − δT ∗ , dt dVI = −cVI , dt dVN I = N δT ∗ − cVN I , dt where T represents the concentration of uninfected CD4+ T cells; T ∗ denotes the concentration of productively infected T cells; VI denotes the concentration of noninfectious virions; VN I denotes the concentration of noninfectious virions; c denotes the clearance rate of virus; δ denotes the clearance rate of infected cells; k is the infection rate. A closed-form solution to above differential equations under the assumption of constant T (it is reasonable at initial stage of infection) can be obtained: V (t) = VI (t) + VN I (t) = V0 exp(−ct) + ×

cV0 c−δ

c [exp(−δt) − exp(−ct)] − δt exp(−ct) . c−δ

When frequent repeated measurement data on V (t) are available, a nonlinear model, Y (t) = V (t) + ε, can be fitted to obtain the estimates of important parameters such as clearance rate of virus (c) and infected cells (δ). Perelson et al.71 applied this method to the data from 5 HIV infected individuals, and obtained that the refined estimate of half-life of free HIV is 0.24 ± 0.06 days (about 6 hours in average) which is much more rapid than

662

H. Wu & S. Zhao

previous estimate in Ho et al.69 and Wei et al.70 The estimated half-life of infected cells is 1.55 ± 0.57 days. Furthermore, Perelson et al.72 developed a compartment model for the observed biphasic viral load data. They speculated that the first phase is due to viral replication from productively infected cells such as CD4+ T cells, and the second phase as latent or long-lived infected cells such as macrophages or dendritic cells. Based on clinical data, Perelson et al.72 estimated that the half-life of short-lived productively infected cells is about 1.1 ± 0.4 days, for long-lived infected cells is 14.1 ± 7.5 days, and for latently infected cells is 8.5 ± 4.0 days. Using their model and the estimated results, they predicted that it might need 2.3–3.1 years to eliminate the HIV virus by the potent antiviral therapies, although later it was shown that this estimate was too optimistic. Wu and Ding73 recently proposed a unified approach for modeling observed HIV dynamic data. First Wu and Ding73 proposed a comprehensive mathematical model for HIV dynamics considering all the potential cell and virus compartments: (1) uninfected target cells, such as T cells, macrophages, lymphoid mononuclear cells (MNCs), and tissue langerhans cells, which are possible targets of HIV-1 infection; (2) mysterious infected cells, cells other than T cells, such as tissue langerhans cells and microglial cells whose behavior is not completely known so far; (3) long-lived infected cells, such as macrophages, that are chronically infected and longlived; (4) latently infected cells, infected cells that contain the provirus but are not producing virus immediately, and only start to produce virus when activated; (5) productively infected cells, infected cells which are actively producing virus; (6) infectious virus, virus that are functional and capable of infecting target cells; (7) noninfectious virus, virus that are dysfunctional and cannot infect target cells. We denote the concentration of the variety of these cells and virus by T, Tm , Ts , Tl , Tp , VI , and VN I respectively. Without the intervention of antiviral treatment, the uninfected target cells may either decrease due to HIV infection or be in an equilibrium state due to the balancing between the regeneration and proliferation of uninfected target cells and HIV infection. Some uninfected target cells (T ) are infected by infectious virus (VI ) and may become mysterious infected cells (Tm ), long-lived infected cells (Ts ), latently infected cells (Tl ) or productively infected cells (Tp ) with proportions of αm kVI , αs kVI , αl kVI , and αp kVI respectively, where αm + αs + αl + αp = 1. The latently infected T cells may be stimulated to become productively infected cells with a rate of

Statistical Models and Methods in Infectious Diseases

663

δl . The infected cells, Tm , Ts and Tp , are killed by HIV at the rates of δm , δs and δp respectively after producing an average of N virions per cell during their lifetimes. The infected cells, Tm , Ts and Tl may also die at the rates of µm , µs and µl respectively without producing virus. We assume that the proportion of noninfectious virus produced by infected cells is η without the intervention of protease inhibitor (PI) antiviral drugs. The elimination rates for infectious virus and noninfectious virus are assumed to be the same, say c. We assume that the antiviral therapy consists of one or more protease inhibitor (PI) drugs and reverse transcriptase inhibitor (RTI) drugs. We model the effect of RTI drugs by reducing the infection rate from k0 to (1 − γ)k0 , where 0 ≤ γ ≤ 1. Parameter γ reflects the RTI drug efficacy. If γ = 0, the RTI drugs have no effect; if γ = 1, the RTI drugs are perfect and completely block HIV infection. The PI drugs are assumed to be so potent that the production of infectious virions is almost blocked except for a small fraction. To account for some compartments where the PI drugs cannot reach and some persistent virus that the PI drugs cannot completely block the production, we consider an additional virus production term with a constant (average) rate, P , in the model. If only a small fraction of persistent virus can escape from the attack of PI drugs, it may be considered as a Poisson process, and thus also be modeled by a constant production in a deterministic model. Thus after initiation of combination treatment of PI and RTI drugs, the HIV dynamic model can be written as, d Tm dt d Ts dt d Tl dt d Tp dt d VI dt d VN I dt

= (1 − γ)αm k0 T VI − δm Tm − µm Tm , = (1 − γ)αs k0 T VI − δs Ts − µs Ts , = (1 − γ)αl k0 T VI − δl Tl − µl Tl ,

(33)

= (1 − γ)αp k0 T VI + δl Tl − δp Tp , = (1 − η)P − cVI , = ηP + N δm Tm + N δs Ts + N δp Tp − cVN I .

where αm + αs + αl + αp = 1. Under the assumption of constant T and perfect treatments, then the system of Eq. (33) can be solved analytically and the final solution for the total virus V = VI + VN I has a form of (see

664

H. Wu & S. Zhao

Appendix in Wu and Ding73 ), V (t) = P0 + P1 e−λ1 t + P2 e−λ2 t + P3 e−λ3 t + P4 e−λ4 t + (P5 + P6 t)e−λ5 t + P7 e−λ6 t + P8 e−λ7 t ,

(34)

where Pi , i = 0, . . . , 8 are functions of model parameters and λ1 = δp , λ2 = δm + µm , λ3 = δs + µs , λ4 = δl + µl , λ5 = c, λ6 = r, and λ7 = c + r. P At time t = 0, V (0) = i6=6 Pi . Parameter Pi represents the initial viral production rate, and Parameter λi represents the exponential decay rate of virus due to the corresponding compartment. This model is too complicated (too many parameters) to be used in practice. Wu and Ding73 suggested to use simplified version of the model based on available data. For example, if only the biphasic data are available, a bi-exponential model, V (t) = P1 e−δp t + P2 e−λl t or a one-exponential plus a constant model, V (t) = P0 + P1 e−δp t may be used.73 To fit the model with sparse individual data, Wu et al.74 and Wu and Ding73 also proposed using nonlinear mixed-effect model approach.75,76 The two-stage nonlinear mixed-effect model is briefly introduced as follows. Stage 1. Intra-patient variation in viral load measurement: yij = log(V (tij , β i )) + eij ,

ei |βi ∼ (0, Ri (β i , ξ)) ,

(35)

where yij is the log-transform of the total viral load measurement for the ith patient and at the jth time point tij , i = 1, . . . , m; j = 1, . . . , ni . The log-transformation of raw data is used to stabilize the variance (it is also more normally distributed). The function V (tij , β i ) is a nonlinear function of treatment time t which may be selected based on the available data and model assumptions. See Wu and Ding73 for details. Stage 2. Inter-patient variation: βi = β + bi .

(36)

Population parameters are β, and random effects are bi ∼ (0, D). More detailed inferences regarding the nonlinear mixed-effect model can be found in the books by Davidian and Giltinan75 and Vonesh and Chinchilli.76 A comparison for viral dynamic model-fitting procedures can be found in Ding and Wu.77 Application to more adult HIV-1 infected patients and pediatric patients can be found in Wu et al.78 and Luzuriaga et al.79 Recently, Ding and Wu80 and others have suggested using viral dynamics (decay rates) to evaluate the potency of antiviral therapies. Statistical

Statistical Models and Methods in Infectious Diseases

665

methods has been proposed to implement this idea by Ding and Wu.81 Modeling drug resistance can be found in Nowak et al.82 and others. For a good review of viral dynamic modeling and their extensions, see Perelson and Nelson83 and Nowak and May.84 3.2. Hepatitis virus dynamics Following the success of HIV dynamics modeling, similar studies have been done for hepatitis B and C virus (HBV and HCV). For example, Nowak et al.85 proposed a simple compartment model for HBV dynamics. Let X, Y and V be uninfected cells, infected cells and free HBV virus respectively, then a mathematical model for HBV dynamics is X˙ = λ − βXV − dx X ,

(37)

Y˙ = βXV − dy Y ,

(38)

V˙ = αY − dv V ,

(39)

where λ is the production rate of susceptible cells. Uninfected cells die at a rate of dx X and become infected at rate βXV . Infected cells are produced at rate βXV and die at rate dy Y . Free virions are produced from infected cells at rate αY and are removed at rate dv V . Nowak et al.85 assumed that the potent treatment (the reverse transcriptase inhibitor, lamivudine) is perfect, i.e. α = β = 0. Thus, V (t) = V0 exp(−dv t) and Y (t) = Y0 exp(−dy t). If the treatment is not perfect (more likely in reality), a model for free virion is V (t) = V0 [1−r+r exp(−dv t)], where r is an efficacy parameter, which can be estimated from the viral load data. Nowak et al.85 fitted a clinical data to these models, and found that the half-life of HBV free virions is about 1 day, the half-life of infected cells ranges from 10 to 100 days in different patients. Many researchers have studied HCV dynamics using models similar to HBV under the antiviral treatment with Interferon-α.86–90 The recent report from Neumann90 showed that the half-life of HCV free virions was, on average, 2.7 hours, the half-life of infected cells was 1.7 to 70 days. All modeling techniques and statistical methods for HIV dynamics are applicable to both HBV and HCV with minor modifications. 4. Intervention and Prevention Intervention and prevention measures are critical to stop the epidemic of infectious diseases. To evaluate the effectiveness of the intervention

666

H. Wu & S. Zhao

and prevention methods, clinical trials are usually conducted. Since other chapters have addressed the general methods of clinical trials, here we only emphasize some special features of clinical trials for infectious diseases, in particular, AIDS clinical trials.

4.1. Medical intervention The general design of clinical trials can be found in clinical trial textbooks.91–94 One of the most important issues in clinical trials is the selection of an endpoint which can be used to measure the effectiveness of interventions such as medical treatments. In general, an endpoint of a clinical trial should possess the following properties: (i) relevant to the treatment effectiveness and easy to interpret; (ii) clinically apparent and easy to diagnose (or measure); (iii) sensitive to treatment differences. An earlier discussion on the choice of an endpoint for AIDS clinical trials can be found in Amato and Lagakos.95 In the early stage of AIDS clinical trials (before 1994), the time to progression to AIDS or survival (time to death) was used. Since many different types of events or symptoms were defined as AIDS, the endpoints of progression to AIDS are often referred to as “combined endpoints”. After reviewing and comparing the clinical data, Neaton et al.96 recommended that the survival, instead of combined endpoints, be a preferred primary endpoint of antiviral trials. However, due to long incubation period of AIDS, the trial requires a longterm follow-up if the survival was used as the endpoint. Recently, surrogate markers such as viral load (HIV RNA copies) or CD4+ cell counts have been proposed and used as endpoints after validation of these markers (it is out of scope of this chapter to discuss how to validate a surrogate marker, see Prentice97 for details). The commonly used primary endpoints in recent AIDS clinical trials are viral load based endpoints which include: (i) the magnitude of reduction in viral load (HIV-1 RNA level) from baseline to a prespecified primary follow-up time (e.g. Week 24 or Week 48); (ii) the proportion of patients having viral load below the limit of quantification of the assay being used at the primary follow-up time; (iii) the durability based on the time-tovirologic-failure (the time until plasma viral load becomes detectable again). Marschner et al.98 studied the first endpoint, the magnitude of reduction in viral load, and proposed statistical methods to deal with censored (below detection limit) viral load measurements. They argued that the dichotomous endpoint (the proportion of patients having viral load below the limit

Statistical Models and Methods in Infectious Diseases

667

of detection) was more straightforward and less subject to bias than the analysis of the magnitude of viral load reduction. Standard methods of analysis for binary data would be appropriate. However, the dichotomous endpoint may lead to a lower power to detect the treatment difference than the endpoint of the actual magnitude of viral load change. By classifying virologic responses as either successes or failures, information is lost regarding the degree of virologic response. However, the endpoint based on the magnitude of viral load change involves complicated censored data problem due to the limit of detection of viral load assays, which may be subject to bias, see Marschner et al.98 and Hughes.99 Gilbert et al.100 studied the time-to-virologic-failure endpoints. They recommended the endpoint of time-to-virologic-failure from randomization due to its advantages in flexibility and sample size. They argued that the time-to-failure endpoint is generally more powerful than the binary endpoint, and it is flexible for evaluating covariate effects and for extending the study by prolonging the follow-up period. Also the interpretation of time-to-failure endpoint is more close to clinical practice than that of a binary endpoint, since physicians monitor viral load levels in patients over time for treatment managements. Other endpoints such as the area under the curve (AUC) of viral load change, time-to-below-detection in viral load, and viral dynamic parameters (viral decay rates) were also suggested, but not widely used in large AIDS clinical trials. For the comparison of some of these endpoints, see Weinberg and Lagakos.101 To evaluate the short-term potency of antiviral therapies using viral dynamic parameters, see Ding and Wu.80,81 Although the general clinical trial design methods can be used in most AIDS clinical studies, some new issues have arised from the complicated treatments for AIDS patients. See De Gruttola et al.102 and Hughes99 for some discussions on the design issues in AIDS clinical trials. Also note that the computer-assisted design techniques or clinical trial simulations (CTS)103 may be useful for designing the complicated clinical trials. Successful medical interventions will result in the change of epidemic patterns.104–106 How to evaluate the epidemic trends under medical interventions is challenging. In this regards, computer simulations based on the epidemic models with considering treatment effects will be helpful. Clinical studies on HBV and HCV are currently very active. HBV patients are treated with antiviral agents such as lamivudine and famciclovir,85 and HCV patients are treated with interferon (IFN) and

668

H. Wu & S. Zhao

ribavirin therapy.90 New anti-HBV and anti-HCV agents are under development, some of them are already in the stage of clinical trials. The methods for studying anti-HIV medical interventions are generally applicable to HBV and HCV. 4.2. Prevention Preventive interventions are widely used to stop or reduce the epidemic of infectious diseases. Prevention measures include “lifestyle” maneuvers such as the change of social behavior to reduce the exposure risk to infectives, and modifications of sexual behavior for sexually transmitted infectious diseases via public education and advertisement. Another effective prevention measure is to prevent the infectious diseases by vaccination. To evaluate the effectiveness of these prevention measures is very challenging in terms of designing and implementing a prevention study due to the high cost and long-term follow-up. 4.2.1. Prevention trials A prevention trial is different from standard randomized clinical trials. The goal of prevention trials for infectious diseases is to evaluate the effectiveness of prevention measures to protect individuals from infections. If the infection rate in a community or population is low for a particular infectious disease, a prevention trial generally needs a large sample size with tens of thousands of subjects. If it takes time for the prevention measures to start to work or for an individual to acquire the infection via exposure to infectives, it may require long duration such as several years of follow-up to evaluate the effectiveness of the prevention programs. Thus, the prevention study is logistically challenging with high cost. Traditionally an observation study, for example, a cohort or community study, is employed to evaluate the prevention programs (historical control may be used in this case). When prevention measures are taken in a cohort or a community, the infection rate may be evaluated within a prespecified time period, and then compare this rate with a historical infection rate in this cohort or community. This kind of observational studies are subject to several problems such as within-subject variations, measurement errors, confounding factors, and adherence to prevention measures. For example, to evaluate the promotion of condom use and education of safe sexual practices among homosexual men community to prevent HIV infection, not all subjects in the study adhere to the condom use during the study, and their

Statistical Models and Methods in Infectious Diseases

669

sexual behavior may change during the study period (within-subject variation). Also the epidemics of an infectious disease under the prevention may confound with other factors. The causal inferences for epidemiologic associations with corresponding preventive strategies are also subject to measurement errors. Ideally it is most effective and informative to conduct a randomized and controlled prevention trial to evaluate prevention programs. This would avoid the difficulties in observational studies. The study subjects may be selected from a high risk community of an infectious disease to reduce the sample size and the cost in a randomized prevention trial. For example, homosexual men, IV drug users, or sexual workers are high risk communities that are targeted for prevention from HIV infection. The design, conduct, monitoring, and analysis for randomized prevention trials are similar to standard therapeutic clinical trials. However, in some cases, it may not be ethical and practical to conduct a randomized prevention trial. For example, the needle sharing is a confirmed cause of HIV infection among IV drug users and safe sex such as condom use may reduce the risk of sexually transmitted infectious diseases, education or advertisement to the public on these knowledge is a good prevention measure. However, it is not ethical and practical to randomize high-risk subjects into the arm without accessing the education or advertisement of needle sharing risk and safe sex. For more discussion on prevention studies, see Prentice107 and Jacobs.108

4.2.2. Vaccine studies Vaccine studies are designed to evaluate different effects of vaccination during different stages of vaccine development. The major purpose of vaccines is to protect the vaccinated person against infection or reduce the severity or risk of disease progression after being infected. Successful vaccination can reduce person-to-person transmission and change the pattern of epidemics of an infectious disease within a population by reducing the infectiousness of an vaccinated person or by preventing individuals from infection. Thus, vaccination is an important intervention tool to the epidemics of infectious diseases and has great contributions to the public health. It is important to understand the biological background of a vaccine in order to evaluate its efficacy. A vaccine is usually composed of an antigen and an adjuvant. The antigen contains either a piece of or the whole infectious agent in question and is the component of the vaccine that induces

670

H. Wu & S. Zhao

the immune response which is specific to the infectious agent. An adjuvant may increase the immunogenicity of the antigen. Thus, active immunization by vaccination does not prevent infection or disease, but the immune responses induced by vaccination interfere with infection or disease. For this reason, the efficacy of a vaccine also depends on the condition of the host’s immune system. An important part of vaccine studies is evaluation of immunogenicity of the vaccine which is the ability of the vaccine to produce a measurable immune response in a host. Three different types of population level effects of vaccination are identified. The indirect effects are the effects or benefits on those people not receiving the vaccine in the targeted population. The total effects in vaccinated individuals are the combination of the indirect effects with the individual-level effects of vaccination. Overall public health effect of the vaccination in the entire population of interest is a weighted average of the indirect effects on the unvaccinated people and the total effects on the vaccinated people. Vaccine studies can be used to evaluate the indirect, total, or overall effects of vaccination in a population. Note that safety is also an important aspect for evaluation of vaccines since vaccination can cause side effects due to the induction of the immune system. A general definition of vaccine efficacy is the percentage reduction in the attack rate attributable to the vaccine, or pv pu − p v =1− = 1 − ρ, VE = pu pu where pu and pv denote the risk of infection in unvaccinated and vaccinated individuals respectively, and ρ = pv /pu is the relative risk of infection. Alternatively, the vaccine efficacy can be defined as the relative hazard of infection: λv , VE =1− λu where pu = 1 − exp(−λu t), pv = 1 − exp(−λv t). Vaccine efficacy can also be defined based on the cumulative incidence rates (attack rates) at the end of a study, Cv VE =1− , Cu where Cv q and Cu denote the cumulative incidence (infection) rates for vaccinated and unvaccinated individuals. To define the vaccine efficacy for infectiousness, we need to know the infection rate in exposures to vaccinated individuals (rv ) and unvaccinated individuals (ru ), then, rv VE =1− . ru

Statistical Models and Methods in Infectious Diseases

671

These definitions are used to evaluate vaccine efficacy in vaccine clinical trials and in the field. Vaccine efficacy can be estimated in a cohort study or a clinical trial involving nv and nu individuals in vaccinated and unvaccinated cohorts or groups respectively. Assume that rv and ru are the number of infection cases from vaccinated and unvaccinated cohorts, respectively, during a prespecified follow-up period, then the vaccine efficacy is estimated as rv /nv . Vd E =1− ru /nu

The vaccine efficacy can also be similarly estimated by a case-control study using the case-control study methodology. The screening method may also be used to estimate the vaccine efficacy in a population. Let θ denote the proportion of infection cases from vaccinated individuals, and let π as the proportion of the population vaccinated (known). The vaccine efficacy is estimated by d V E =1−

θ 1−π . 1−θ π

In the screening method, the vaccination rate π is fixed and known, while θ is estimated. To investigate covariate effects, the generalized linear model such as logistic regression techniques can be used. Vaccine development and studies can be divided into several phases. Phase 0 is the candidate vaccine development. In this early phase of vaccine development, the focus is the search for antigen candidates. A broad types of vaccine candidates may be investigated. Phase I is safety and immunogenicity testing in animals. In this phase, a study is designed to demonstrate the safety and immunogenicity of the vaccine candidate in animals. The question of whether the vaccine candidate is safe or effective in animals is a primary interest in the study. Usually the sample size of this kind of studies is small (only several animals involved). Thus the exact statistics such as Fisher’s exact test are usually used for inferences Phase II is safety and immunogenicity testing in humans. Since infectious agents tend to be host-specific, and immune responses and the adverse reactions to a vaccine candidate may be different between animals and humans, safety and immunogenicity studies in humans are required before large-scale trials. Phase II trials may also try to determine dose levels and vaccination schedules. The sample size of Phase II vaccine trials can be very small or as large as several hundreds. Experimental challenges with infectious agents are usually not ethical in humans. The use of the immune response as a surrogate for protective immunity is still questionable since the correlation

672

H. Wu & S. Zhao

between the measurable immune response and the actual protection against infection or disease by the vaccine cannot be confirmed without investigation. Thus, it is a very difficult decision to move the vaccine study from Phase II to a large scale Phase III field efficacy testing. In fact, a very small proportion of vaccine candidates move to Phase III studies, although sometimes a rare Phase IIb, a small field study, may be conducted. The primary objective of Phase III field trials is to estimate the protective efficacy of vaccination, rather than to test whether there is an effect. Usually a randomized and double-blinded trial is ideal for a Phase III vaccine trial, but may be limited due to ethical or implemental problems. Since the number of infection cases depends on the exposure to infection and transmission rate (difficult to estimate), the sample size of Phase III trials is usually difficult to determine. A liberal strategy may be taken. Many vaccine trials have been inconclusive due to unpredicted transmission or exposure rate. If efficacy and safety of a vaccine is demonstrated in Phase III trials, the vaccine may be licensed by the responsible agency. However, the postlicensure Phase IV studies are still needed to evaluate: (i) protective efficacy under normal usage; (ii) safety under normal usage; (iii) duration of protection; and (iv) indirect and overall effects. Postlicensure studies are usually nonrandomized observational studies (subject to potential biases). Case-control studies are commonly used. Since Phase III efficacy trials are often too small to detect rare adverse events of vaccination, much larger postlicensure studies are very useful in this case. Thus, the design of a vaccine study depends on the scientific questions of interest and the phase of vaccine development. The corresponding statistical methods need to be selected for different studies. The design of vaccine studies can be traced to early 19th century.109 More detailed discussions can be found in Smith and Morrow110 and Farrington and Miller.111 Most materials of this section are taken from Halloran112 and Farrington.8

4.2.3. Mathematical modeling and simulations Prevention strategies or measures can be included in the deterministic or stochastic epidemic models of infectious diseases introduced in Sec. 2 of this chapter. Computer simulations may be used to evaluate or project how the pattern of epidemics will be changed by effective prevention strategies. Here we introduce an example of HBV infection with vaccine interventions.

Statistical Models and Methods in Infectious Diseases

673

Fig. 1.

The HBV model (21–25) introduced in Sec. 2 can be used to simulate the hepatitis B transmission dynamics before and after vaccination, so we utilize the models to predict the long-term effectiveness of hepatitis B immunization, and to describe the transmission dynamics of HBV in the population. The HBV carriers in a vaccinated cohort will decrease sharply. If all newborn babies can be immunized, the proportion of HBV carriage for immunized children will decrease to a very low level (< 1%). Following up the immunization program with 100% coverage, the transmission dynamics of HBV carriers can be described by the model (21–25). The majority of HBV carriers will shift gradually from children to the elderly. After the vaccination program has been implemented for 70 years or more, the average HBV carrier rate will decrease to a lower level (Fig. 1). In order to evaluate the impact of the vaccination program on future incidence rate of hepatitis B, we can also define two incidence ratios: an acute hepatitis B and a chronic hepatitis B incidence ratio. Both of them can be calculated based on the dynamics of HBV carriers, that is, as a linear function of carrier proportion since vaccination implementation. The incidence ratio of acute hepatitis B, Ra (a1 , a2 : t) is the number of acute cases in the age range from a1 to a2 at time t divided by the corresponding number of acute cases at t = 0, the baseline before vaccination. Mathematically, Z a2 Z a2 T (a, t)da Ra (a1 , a2 : t) = T (a, 0)da . a1

a1

674

H. Wu & S. Zhao

1.0

Ratios

0.9 0.8

coverages of vaccination:

0.7 0.6

20%

0.5 0.4

40%

0.3 0.2

60%

0.1

80% 100%

0.0 0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

year

Fig. 2.

Ratios 1.0

coverage of vaccination:

0.9 0.8 0.7

20%

0.6 0.5

40%

0.4 0.3

60%

0.2

80%

0.1

100%

0.0 0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

year

Fig. 3.

The range in the equation has been defined as a1 = 10 and a2 = 45, because the peak of the incidence curve for acute hepatitis B was observed in the age interval of 10 to 45 years old. The incidences in other age groups are at very low levels. The Ra (a1 , a2 : t) at time t with different vaccination coverage is shown in Fig. 2. It decreases steeply at the beginning of the hepatitis B vaccination program. The higher the vaccination coverage, the steeper the decrease of the ratio. The decrease slows down in a few years after the start of the vaccination program. The incidence ratio of chronic hepatitis B, Rc (a1 , a2 : t) is defined as the number of chronic HBV carriers in the age range from a1 to a2 , at time t divided by the corresponding number of chronic HBV carriers at t = 0,

Statistical Models and Methods in Infectious Diseases

675

the baseline before vaccination. Mathematically Z a2 Z a2 C(a, t)da C(a, 0)da . Rc (a1 , a2 : t) = a1

a1

In the equation, a1 = 25 and a2 = 70. Most of chronic liver diseases were observed in adults of 25 years and older. The disease incidence in the younger age group was negligible. The Rc (a1 , a2 : t), at time t with different vaccination coverage, is shown in Fig. 3. It remains almost unchanged at the beginning of the vaccination program, and drops rapidly after 25 years of immunization. Again, the decrease in the ratio is closely related to the vaccination coverage. 5. Summary Infectious diseases are dangerous and threat the public health as a whole. To evaluate and project the epidemics of an infectious disease is a great challenge to biomathematicians and statisticians. Enormous efforts have been made in the past century. In this chapter, we have briefly reviewed the epidemic models and methods, especially for HIV and hepatitis viruses, the two most active research areas. Mathematics and statistics have contributed to understanding the pathogenesis of infectious diseases, particularly to understanding the mechanisms of HIV, HBV, and HCV infection in recent years via viral dynamics modeling. We have introduced these models and statistical methods. Statistics always plays an important role in evaluating interventions and preventions of any diseases via clinical trials. Motivated by clinical studies, many advanced statistical methods have been developed. For infectious diseases, statistics even plays more critical role in evaluating medical interventions, prevention measures and vaccine efficacy. Apparently, the study of HIV/AIDS has spurred the statistical research in infectious diseases in the past two decades. Brief reviews on statistical issues in HIV research can be found in an special issue of Journal of Royal Statistical Society A (Vol. 161, Part 2, 1998). A good review on the references can be found in Foulkes.113 Acknowledgment The first author was supported by NIAID/NIH grants No. R29 AI43220, RO1 AI45356 and U01 AI38855. References 1. Last, J. M. (1988). A Dictionary of Epidemiology, Oxford University Press, New York, 27.

676

H. Wu & S. Zhao

2. Bernoulli, D. (1760). Essai d’une nouvelle analyse de la mortalite causee par la petite verole et des avantages de l’inoculation pour la prevenir. Memoires de Mathematiques et de Physique, in Histoire de l’Academie Royale des Sciences, Paris, 1–45. 3. Hamer, W. H. (1906). Epidemic disease in England: The evidence of variability and persistency. Lancet ii: 733–739. 4. Ross, R. (1911). The Prevention of Malaria, 2nd edn. John Murray, London. 5. Kermack, W. O. and McKendrick, A. G. (1927). A contribution to the mathematical theory of epidemics. Proceedings of the Royal Society, Series A115: 700–721. 6. En’ko, P. D. (1889). The Epidemic Course of Some Infectious Diseases. Vrach, St Petersburg, 10: 1008–1010, 1039–1042, 1061–1063 (in Russian). 7. Greenwood, M. (1931). On the statistical measure of infectiousness. Journal of Hygiene, Cambridge, 31: 336–351. 8. Farrington, C. P. (1998). Communicable diseases. In Encyclopedia of Biostatistics, Vol. 1, eds. P. Armitage and T. Colton, John Wiley and Sons, New York, 795–815. 9. Heesterbeek, J. A. P. and Roberts, M. G. (1998). Epidemic models, deterministic. In Encyclopedia of Biostatistics, Vol. 2, eds. P. Armitage and T. Colton, John Wiley and Sons, New York, 1332-1340. 10. Gani, J. (1998). Epidemic models, stochastic. In Encyclopedia of Biostatistics, Vol. 2, eds. P. Armitage and T. Colton, John Wiley and Sons, New York, 1345–1351. 11. Brookmeyer, R. and Gail, M. H. (1986). Minimum size of the acquired immunodeficiency syndrome (AIDS) epidemic in the United States. Lancet 2: 1320–1322. 12. Brookmeyer, R and Gail, M. H. (1988). A method for obtaining short-term projections and lower bounds on the size of the AIDS epidemic. Journal of the American Statistical Association 83: 301–308. 13. Brookmeyer, R. and Gail, M. H. (1994). Journal of Acquired Immune Deficiency Syndromes Epidemiology: A Quantitative Approach. Oxford University Press, New York. 14. Harris, J. E. (1990). Reporting delays and the incidence of AIDS. Journal of the American Statistical Association 85: 915–924. 15. Lawless, J. and Sun, J. (1992). A comprehensive back-calculation framework for the estimation and prediction of AIDS cases. In Journal of Acquired Immune Deficiency syndromes Epidemiology: Methodological Issues, eds. N. Jewell, K. Dietz and C. Farewell, Birkhauser, Boston, 81–104. 16. Bacchetti, P. (1998). Back-calculation. In Encyclopedia of Biostatistics, Vol. 1, eds. P. Armitage and T. Colton, John Wiley and Sons, New York, 235–242. 17. De Gruttola, V. and Lagakos, S. W. (1989). Analysis of doubly-censored survival data, with application to AIDS. Biometrics 45: 1–11. 18. Kim, M. Y., De Gruttola, V. and Lagakos, S. W. (1993). Analysis of doubly censored data with covariates; with application to AIDS. Biometrics 49: 13–22.

Statistical Models and Methods in Infectious Diseases

677

19. Jewell, N. P., Malani, H. M. and Vittinghoff, E. (1994). Nonparametric estimation for a form of doubly censored data, with application to two problems in AIDS. Journal of the American Statistical Association 89: 7–18. 20. Jewell, N. P. (1994). Nonparametric estimation and doubly-censored data: General ideas and applications to AIDS. Statistics in Medicine 13: 2081–2096. 21. Sun, J. (1995). Empirical estimation of a distribution function with truncated and doubly interval-censored data and its application to AIDS studies. Biometrics 51: 290–295. 22. Diggle, P. J, Liang, Ky. and Zeger, S. L. (1994). Analysis of Longitudinal Data, Oxford University Press, New York. 23. De Gruttola, V., Lange, N. and Dafni, U. (1991). Modeling the progression of HIV infection. Journal of the American Statistical Association 86: 569–577. 24. Taylor, J. M. G., Cumberland, W. G. and Sy, J. P. (1994). A stochastic model for analysis of longitudinal AIDS data. Journal of the American Statistical Association 89: 727–736. 25. Zeger, S. L. and Diggle, P. J. (1994). Semi-parametric models for longitudinal data with application to CD4 cell numbers in HIV seroconverters. Biometrics 50: 689–699. 26. Shi, M., Weiss, R. E. and Taylor, J. M. G. (1996). An analysis of paediatric CD4 counts for Acquired Immune Deficiency Syndrome using flexible random curves. Applied Statistics 45: 151–163. 27. Rice, J. A. and Wu, C. O. (2001). Nonparametric mixed effects models for unequally sampled noisy curves. Biometrics 57: 253–259. 28. Wang, Y. and Taylor, J. M. G. (1995). Inference for smooth curves in longitudinal data with application to an AIDS clinical trial. Statistics in Medicine 14: 1205–1218. 29. Fan, J. and Zhang, J. T. (1998). Comments on “smoothing spline models for the analysis of nested and crossed samples of curves” (by B. Brumback and J. A. Rice). Journal of the American Statistical Association 93: 980–983. 30. Fan, J. and Zhang, J. T. (1999). Two-step estimation of functional linear models with application to longitudinal data. Journal of the Royal Statistical Society, Series B62: 303–322. 31. Hoover, D. R., Rice, J. A., Wu, C. O. and Yang, L. P. (1998). Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika 85: 809–822. 32. Wu, C. O., Chiang, C. T. and Hoover, D. R. (1998). Asymptotic confidence regions for kernel smoothing of a varying-coefficient model with longitudinal data. Journal of the American Statistical Association 93: 1388–1402. 33. Lange, N., Carlin, B. P. and Gelfand, A. E. (1992). Hierarchical Bayes models for the progression of HIV infection using longitudinal CD4 T-cell numbers (with discussion). Journal of the American Statistical Association 87: 615–632.

678

H. Wu & S. Zhao

34. De Gruttola, V. and Tu, X. M. (1994). Modelling progression of CD4lymphocyte count and its relationship to survival time. Biometrics 50: 1003–1014. 35. Tsiatis, A. A., De Gruttola, V. and Wulfsohn, M. S. (1995). Modeling the relationship of survival to longitudinal data measured with error. Applications to survival and CD4 counts in patients with AIDS. Journal of the American Statistical Association 90: 27–37. 36. Cnaan, A. (1998). Natural history study of prognosis. In Encyclopedia of Biostatistics, Vol. 4, eds. P. Armitage and T. Colton, John Wiley and Sons, New York, 2956–2962. 37. Xu, Z. Y. (1991). Impact and control of viral hepatitis in China. In Viral Hepatitis and Liver Diseases, eds. F. Blaine Hollinger, Williams and Wilkins, 700–706. 38. Szmuness, W. (1978). Sociodemograghic aspect of the epidemiology of HB. In Viral Hepatitis., eds. G. N. Vyas, S. N. Cohen and R. Schmidt, Franklin Institute Press, Philadelphia, Pennsylvania, 296–320. 39. Tang, S. X. (1991). Study of the mechanisms and influential factors of intrauterine infection of hepatitis B virus. Chinese Journal of Epidemiology 12(6): 325–328. 40. Wang, S. S. (1991). Transplacental transmission of hepatitis B virus. Chinese Journal of Epidemiology 12(1): 33–35. 41. Zhao, S. J., Xu, Z. Y., Ma, J. C. et al. (1994). A follow up study of spontaneous clearance rates on hepatitis B surface agent persistent carriers. Chinese Journal of Preventive Medicine 29(6): 378–379. 42. Zhao, S. J. and Xu, Z. Y. (1991). A study of hepatitis B infection force in China. Chinese Journal of Epidemiology 14(suppl.): 70–74. 43. Zhao, S. J., Xu, Z. Y., Cao, H. L. et al. (1995). An estimating model of relationship between the infection age of hepatitis B virus and the agespecific chronic carrier rate. Chinese Journal of Experimental and Clinical Virology 9(Suppl.): 101–104. 44. Zhao, S. J. and Xu, Z. Y. (1995). Mathematical simulation of hepatitis B transmission and application in immunization police. In Proceeding of Epidemiology, ed. X. W. Zhen, Beijing, 8: 162–181. 45. Zhao, S., Xu, Z. Y. and Lu, Y. (2000). A mathematical model of hepatitis B virus transmission and its application for vaccination strategy in China. International Journal of Epidemiology 29: 744–752. 46. Bailey, N. T. J. (1975). The Mathematical Theory of Infectious Diseases and Its Applications, 2nd edn., Griffin, London. 47. Becker, N. J. (1989). Analysis of Infectious Disease Data, Chapman and Hall, London. 48. Anderson, R. M. and May, R. M. (1991). Infectious Diseases of Humans: Dynamics and Control, Oxford University Press, Oxford. 49. Daley, D. J. and Gani, J. (1999). Epidemic Modelling, Cambridge University Press, Cambridge. 50. Harris, T. E. (1948). Branching processes. Annals of Mathematical Statistics 19: 474–494.

Statistical Models and Methods in Infectious Diseases

679

51. Keiding, N. (1975). Estimation theory for branching processes. Bulletin of the International Statistical Institute 46(4): 12–19. 52. Becker, N. (1977). Estimation for discrete time branching processes with applications to epidemics. Biometrics 33: 515–522. 53. Heyde, C. C. (1979). On assessing the potential severity of an outbreak of a rare infectious disease: A Bayesian approach. Australian Journal of Statistics 21: 282–292. 54. Frost, W. H. (1976). Some conceptions of epidemics in general. American Journal of Epidemiology 103: 141—151. 55. Fine, P. (1977). A commentary on the mechanical analogue to the ReedFrost epidemic model. American Journal of Epidemiology 106: 87–100. 56. Longini, I. and Koopman, J. (1982). Household and community transmission parameters from final distributions of infections in households. Biometrics 38: 115–126. 57. Longini, I., Koopman, J., Haber, M. and Otsonis, G. (1988). Statistical inference for infectious diseases: Risk-specified household and community transmission parameters. American Journal of Epidemiology 128: 845–859. 58. Saunders, I. (1980). An approximate maximum likelihood estimator for chain binomial models. Australian Journal of Statistics 22: 307–316. 59. Longini, I. (1998). Chain binomial model. In Encyclopedia of Biostatistics, Vol. 1, eds. P. Armitage and T. Colton, John Wiley and Sons, New York, 593–597. 60. Becker, N. G. and Britton, T. (1999). Statistical studies of infectious disease incidence. Journal of the Royal Statistical Society, Series B61: 287–307. 61. Bartlett, M. S. (1957). Measles periodicity and community size. Journal of the Royal Statistical Society, Series A120: 48–70. 62. Bartlett, M. S. (1960). The critical community size for measles in the United States. Journal of the Royal Statistical Society, Series A123: 37–44. 63. Tan, W. Y. and Wu, H. (1998). Stochastic modeling the dynamics of CD4+ T cell infection by HIV with Monte Carlo studies, Mathematical Biosciences 147: 173–205. 64. Tan, W. Y. and Hsu, H. (1989). Some stochastic models of AIDS spread. Statistics in Medicine 8: 121–136. 65. Wu, H. and Tan, W. Y. (2000). Modeling HIV epidemic: A state-space approach. Mathematical and Computer Modelling 32: 197–215. 66. Merrill, S. (1987). AIDS: Background and the dynamics of the decline of immunocompetence. In Theoretical Immunology, Part 2, ed. A. S. Perelson, Addison-Wesley, Redwood City, Calif. 67. Anderson, R. M. and May, R. M. (1989). Complex dynamical behavior in the interaction between HIV and the immune system. In Cell to Cell Signaling: From Experiments to Theoretical Models, ed. A. Goldbeter, Academic, New York. 68. Perelson, A. S. (1989). Modeling the interaction of the immune system with HIV. In Mathematical and Statistical Approaches to Journal of Acquired Immune Deficincy Syndromes Epidemiology (Lect. Notes Biomath., Vol. 83), ed. C. Castillo-Chavez, Springer-Verlag, New York.

680

H. Wu & S. Zhao

69. Ho, D. D., Neumann, A. U., Perelson, A. S., Chen, W., Leonard, J. M. and Markowitz, M. (1995). Rapid turnover of plasma virions and CD4 lymphocytes in HIV-1 infection. Nature 373: 123–126. 70. Wei, X., Ghosh, S. K., Taylor, M. E., Johnson, V. A., Emini, E. A., Deutsch, P., Lifson, J. D., Bonhoeffer, S., Nowak, M. A., Hahn, B. H., Saag, M. S. and Shaw, G. M. (1995). Viral dynamics in human immunodeficiency virus type 1 infection. Nature 373: 117–122. 71. Perelson, A. S., Neumann, A. U., Markowitz, M., Leonard, J. M. and Ho, D. D. (1996). HIV-1 dynamics in vivo: Virion clearance rate, infected cell life-span, and viral generation time. Science 271: 1582–1586. 72. Perelson, A. S., Essunger, P., Cao, Y., Vesanen, M., Hurley, A., Saksela, K., Markowitz, M. and Ho, D. D. (1997). Decay characteristics of HIV-1-infected compartments during combination therapy. Nature 387: 188–191. 73. Wu, H. and Ding, A. (1999). Population HIV-1 dynamics in vivo: Applicable models and inferential tools for virological data from AIDS clinical trials. Biometrics 55: 410–418. 74. Wu, H., Ding, A. and De Gruttola, V. (1998). Estimation of HIV dynamic parameters, Statistics in Medicine 17: 2463–2485. 75. Davidian, M. and Giltinan, D. M. (1995). Nonlinear Models for Repeated Measurement Data. Chapman and Hall, New York. 76. Vonesh, E. F. and Chinchilli, V. M. (1996). Linear and Nonlinear Models for the Analysis of Repeated Measurements, Marcel Dekker, New York. 77. Ding, A. A. and Wu, H. (2000). A comparison study of models and fitting procedures for biphasic viral dynamics in HIV-1 infected patients treated with antiviral therapies. Biometrics 56: 16–23. 78. Wu, H., Kuritzkes, D. R., McClernon, D. R. et al. (1999). Characterization of viral dynamics in Human Immunodeficiency Virus Type 1-infected patients treated with combination antiretroviral therapy: Relationships to host factors, cellular restoration and virological endpoints. Journal of Infectious Diseases 179(4): 799–807. 79. Luzuriaga, K., Wu, H., McManus, M. et al. (1999). Dynamics of HIV-1 replication in vertically-infected infants. Journal of Virology 73: 362–367. 80. Ding, A. A. and Wu, H. (1999). Relationships between antiviral treatment effects and biphasic viral decay rates in modeling HIV dynamics. Mathematical Biosciences 160: 63–82. 81. Ding, A. A. and Wu, H. (2001). Assessing antiviral potency of anti-HIV therapies in vivo by comparing viral decay rates in viral dynamic models. Biostatistics 2: 13–29. 82. Nowak, M. A., Bonhoeffer, S., Shaw, G. M. and May, R. M. (1997). Anti-viral drug treatment: Dynamics of resistance in free virus and infected cell populations. Journal of Theoretical Biology 184: 203–217. 83. Perelson, A. S. and Nelson, P. W. (1999). Mathematical Analysis of HIV-1 dynamics in vivo. SIAM Review 41(1): 3–44. 84. Nowak, M. A. and May, R. M. (2000). Virus Dynamics: Mathematical Principles of Immunology and Virology, Oxford University Press.

Statistical Models and Methods in Infectious Diseases

681

85. Nowak, M. A., Bonhoeffer, S., Hill, A. M. et al. (1996). Viral dynamics in hepatitis B virus infection. Proceedings of National Academic Sciences 93: 4398–4402. 86. Fukumoto, T., Berg, T., Ku, Y. et al. (1996). Viral dynamics of hepatitis C early after orthotopic liver transplantation: Evidence for rapid turnover of serum virions. Hepatology 24: 1351–1354. 87. Zeuzem, S., Schmidt, J. M., Lee, J. H. et al. (1996). Effect of interferon alfa on the dynamics of hepatitis C virus turn over in vivo. Hepatology 23: 366–371. 88. Lam, N. P., Neumann, A. U., Gretch, D. R. et al. (1997). Dose-dependent acute clearance of hepatitis C genotype 1 virus with interferon alfa. Hepatology 26: 226–231. 89. Yasui, K., Okanoue, T., Murakami, Y. et al. (1998). Dynamics of hepatitis C viremia following interferon-α administration. The Journal of Infectious Diseases 177: 1475–1479. 90. Neumann, A. U., Lam, N. P., Dahari, H. et al. (1998). Hepatitis C viral dynamics in vivo and the antiviral efficacy of interferon-α therapy. Science 282: 103–107. 91. Friedman, L. M., Furberg, C. D. and DeMets, D. L. (1981). Fundamentals of Clinical Trials, John Wright, PSG Inc. 92. Pocock, S. J. (1983). Clinical Trials on Protocol Approach, New York, Wiley. 93. Meinert, C. L. (1986). Clinical Trials, Design, Conduct and Analysis, Oxford, Oxford University Press. 94. Piantadosi, S. (1997). Clinical Trials, A Methodologic Perspective, New York, Wiley. 95. Amato, D. A. and Lagakos, S. W. (1990). Considerations in the selection of end points for AIDS clinical trials. Journal of Acquired Immune Deficiency Syndromes 3(Suppl): S64–S68. 96. Neaton, J. D., Wentworth, D. N., Rhame, F. et al. (1994). Methods of studying interventions, Considerations in choice of a clinical endpoint for AIDS clinical trials. Statistics in Medicine 13: 2107–2125. 97. Prentice, R. L. (1989). Surrogate endpoints in clinical trials: Definition and operational criteria. Statistics in Medicine 8: 431–440. 98. Marschner, I. C., Betensky, R. A., De Gruttola, V. et al. (1999). Clinical trials using HIV-1 RNA-based primary endpoints: Statistical analysis and potential biases. Journal of Acquired Immune Deficiency Syndromes and Human Retrovirology 20: 220–227. 99. Hughes, M. D. (2000). Analysis and design issues for studies using censored biomarker measurements, with an example of viral load measurements in HIV clinical trials. Statistics in Medicine 19: 3171–3191. 100. Gilbert, P., Ribaudo, H. J., Greenberg, L. et al. (2000). Considerations in choosing a primary endpoint that measures durability of virologic suppression in an antiretroviral trial. Journal of Acquired Immune Deficiency Syndromes 14: 1961–1972. 101. Weinberg, J. and Lagakos, S. W. (2001). Efficiency comparisons of rank and permutation tests based on summary statistics computed from repeated measures data. Statistics in Medicine 20: 705–731.

682

H. Wu & S. Zhao

102. De Gruttola, V., Hughes, M. D., Gilbert, P. and Phillips, A. (1998). Trial design in the era of highly effective antiviral drug combinations for HIV infection. Journal of Acquired Immune Deficiency Syndromes 12: S149–S156. 103. Krall, R. L., Engleman, K. H., Ko, H. C. and Peck, C. C. (1998). Clinical trial modeling and simulation–work in progress. Drug Information Journal 32: 971–976. 104. Mocroft, A., Vella, S., Benfield, T. L. et al. (1998). Changing patterns of mortality across Europe in patients infected with HIV-1. Lancet 352: 1725–1730. 105. Palella, F. J., Jr, Delaney, K. M., Moorman, A. C. et al. (1998). Declining morbidity and mortality among patients with advanced human immunodeficiency virus infection. New England Journal of Medicine 338: 853–860. 106. Vittinghoff, E., Scheer, S., O’Malley, P. et al. (1999). Combination antiretroviral therapy and recent declines in AIDS incidence and mortality, Journal of Infectious Diseases 179: 717–720. 107. Prentice, R. L. (1998). Prevention trials. In Encyclopedia of Biostatistics, Vol. 4, eds. P. Armitage and T. Colton, John Wiley and Sons, New York, 3494–3499. 108. Jacobs, Jr., D. R. (1998). Preventive medicine. In Encyclopedia of Biostatistics, Vol. 4, eds. P. Armitage and T. Colton, John Wiley and Sons, New York, 3500–3502. 109. Greenwood, M. and Yule, U. G. (1915). The statistics of anti-typhoid and anti-cholera inoculations, and the interpretation of such statistics in general. Proceedings of the Royal Society of Medicine 8: 113–194. 110. Smith, P. G. and Morrow, R. N. (1991). Methods for Field Trials of Interventions Against Tropical Diseases: A Toolbox, Oxford, Oxford University Press. 111. Farrington, P. and Miller, E. (1996). Clinical trials. In Methods in Molecular Medicine: Vaccine Protocols, eds. A. Robinson, G. Farrar and C. Wiblin, Totowa, Humana Press. 112. Halloran, M. E. (1998). Vaccine studies. In Encyclopedia of Biostatistics, Vol. 6, eds. P. Armitage and T. Colton, John Wiley and Sons, New York, 4687–4694. 113. Foulkes, M. A. (1998). Advances in HIV/AIDS statistical methodology over the past decade. Statistics in Medicine 17: 1–25.

About ad Author Hulin Wu is currently a senior statistician and project leader at Frontier Science and Technology Research Foundation. He also holds an adjunct faculty position at Biostatistics Department, Harvard University and Section Head at Center for Biostatistics in AIDS Research, Harvard School of Public Health. He obtained his BS and MS degrees in Automatical Control from the National University of Defense Technology of China, and PhD

Statistical Models and Methods in Infectious Diseases

683

in Statistics from Florida State University, USA. He has been a visiting assistant professor at University of Memphis, Tennessee, USA before he joined the current position. His current research interests include nonparametric regression for longitudinal data, linear and nonlinear mixed-effects models, state-space models and dynamic prediction, modeling HIV/AIDS epidemics and pathogenesis, AIDS clinical trials, and clinical trial simulations.

This page intentionally left blank

CHAPTER 18

SPECIAL MODELS FOR SAMPLING SURVEY

SUJUAN GAO Division of Biostatistics, Indiana University School of Medicine, 1050 Wishard Blvd., RG 4101, Indianapolis, IN 46202-2872, USA Tel: (317)274-0820; [email protected]

1. Introduction Sample surveys have been widely used in everyday life and scientific research, from our attitude towards a specific television program to major economic indices. Sample surveys are broadly classified into two types — descriptive and analytical. In a descriptive survey the objective is simply to obtain information about large groups. For example, the total number of men, women and children living in a certain geographic area may be the objective of a descriptive survey. In an analytical survey, comparisons are made between different subgroups of the population in order to ascertain whether true differences exist among them and to verify hypotheses about the reasons for these differences. Sample surveys in medical research fields are mostly taken for analytical purposes. Early well-known surveys include survey of the teeth of school children before and after fluoridation of water, of the death rates and causes of death of people who smoke different amount, and the huge study on the effectiveness of the Salk polio vaccine. Recent surveys in the medical fields, still analytical in nature, can be further divided into two categories. The first category is large scale surveys with the intent to make inference on a large population. These surveys are designed to derive reliable estimates on various health, nutrition, medical expenditure, etc. on a national level. Examples of surveys of this nature are the National Health and Nutrition Examination Survey (NHAINES), Survey of Asset and Health Dynamics of the Oldest Old (AHEAD), Health and Retirement Survey (HRS), National Long Term Care Survey, to name 685

686

S. Gao

just a few. These surveys have elaborate sampling frames with oversampling of certain subgroups and extensive questionnaire containing a vast amount of information. They tend not to be disease specific, concerning instead about national trends on the general health of the entire population. The second category of medical surveys are community-based surveys, where a geographically defined catchment area is first defined and the survey is given to individuals selected from a sampling plan within this community only. Community-based surveys are usually focused on a specific disease, advantageous specifically on diseases that require extensive diagnosis, such as dementia and Alzheimer’s disease. Community-based surveys offer the advantages of extensiveness on studying specific disease, but may suffer from limitations on the scope of populations they represent. The basic steps and components of a survey are described in details in classical sample survey books such as Cochran (1977) and Kish (1965). We will review here some terminology central to the development in this chapter. Population: The aggregate from which the sample is chosen. The population to be sampled (the sampled population) should coincide with the population about which information is desired (the target population). Sometimes, due to practical constraints, the sampled population is more restricted than the target population. It should be noted that the conclusion drawn from the samples under these situations should only apply to the sampled population. The extent to which the conclusion from the sampled population apply to the target population depends on many additional information. Sampling plan: The rule by which the samples are selected. Sampling units: The parts the population is divided into and selection rules are based upon. They must cover the entire population without any overlap. Sometimes the choice of sampling units is obvious, as in a surveys of hospitals where each hospital constitutes a sampling unit. In other situations, there may be many possible choices of sampling units. For example, sampling of individuals in a city can be done by selecting the whole family, or selecting individuals living in a whole city block. The decision on which sampling unit to use is often based on an array of factors such as logistic, economic and convenience. Sampling frame: The list containing all sampling units so that random selection from the population is accomplished by random sampling from

Special Models for Sampling Survey

687

the list. An ideal sampling frame should be complete containing every unit in the population and without any duplicates. The legitimacy of a survey often rests on how well the sampling frame is constructed to represent the population. Simple random sampling: A method of selecting n units out of N such that every distinct sample has an equal chance of being drawn. Conventional statistical inference and software all apply to simple random sampling. Stratified random sampling: In stratified random sampling, the population of N units is first divided into non-overlapping subpopulations of N1 , . . . , NS units, respectively. These subpopulations are called strata. If a simple random sampling is taken within each stratum, the whole procedure is described as stratified random sampling. Stratification may produce a gain in precision in the estimates of characteristics of the whole population by creating relative homogeneity within each stratum. Sample survey data differ from data collected from conventional observational studies in two fundamental ways. The first is that samples from surveys are usually selected with unequal probability, which if ignored may create a distorted picture of the target population. For example, if a health survey oversamples elderly individuals, the simple frequencies on diseases associated with increasing age are likely to overestimate disease rates in the population. The second difference between survey data and conventional studies is that survey data often present a natural clustering inherited from the sampling design. For example, if all family members in a selected household are interviewed in a survey, the outcomes on social-economical scales, behavior measures and attitudes are likely to be correlated. A vast body of literature on methods of analyzing survey data exists, see, for example, Skinner, Holt and Smith28 for review of earlier works in the area. The most used approach for analyzing survey data is the so-called pseudo-likelihood method where the score equation in a general likelihood framework is modified by including appropriate sampling weights. Variance estimation is achieved by taking the sample design into consideration and by using Taylor series linearization. Software packages implementing the pseudo-likelihood approach are also available, e.g. SUDAAN from the Research Triangle Institute25 and WesVarPC from Westat.22 This chapter focuses on several special models used in analyzing survey data from epidemiological studies. First, we discuss the use of special models for estimating prevalence rate of a rare disease from community-based surveys, followed by a discussion about the use of random effect models

688

S. Gao

for small area estimation on both continuous and discrete outcomes. The final section is devoted to capture recapture models in epidemiology.

2. Models for the Estimation of Disease Prevalence Disease prevalence is the percentage of individuals with a disease at the study time in a certain population. It describes the disease’s effect on the population. Multi-phase samplings are often used in epidemiological studies where a disease is rare and diagnosis of the disease is expensive or difficult. The design has been used interchangeably with “multi-stage” sampling by medical researchers without distinction from multi-phase sampling. However, multi-stage sampling is a standard terminology from sampling theory which usually implies that different sampling units are used at various stages of sampling (for example, city blocks in the first stage, households in the second stage, and individuals in the third stage). We therefore prefer the term multi-phase sampling for the type of study where individuals are the sampling units in all sampling selections. Two-phase sampling is by far the most often used design of all multiphase studies. In the first phase of the study a large random sample from the targete population is screened with less intensive and expensive screening tests for the disease. Based on the results of the screening tests subjects are stratified and randomly selected within each stratum for extensive clinical evaluations at the second phase to determine disease status. The sampling plan are usually designed to identify as many diseased subjects as possible for risk factor studies and at the same time allow efficient estimation of disease prevalence for the population. The two-phase sampling design has been used to estimate the prevalence rates of Alzheimer’s disease,1 heart disease4 and sexually transmitted disease.31 Data for our first example comes from the Indianapolis Study of Health and Aging, an on-going longitudinal study of dementia in the elderly African Americans age 65 and over living in Indianapolis, USA. Population-based two-phase surveys were conducted to estimate the prevalence rate of dementia in this population. At the first phase, 2212 individuals were randomly selected from the community and administered screening tests aimed at measuring their cognitive functions. Each individual received a cognitive score which ranged from 0 to 33. Based on the screening scores the subjects were grouped into three performance groups: good, intermediate and poor. The initial sampling plan was to invite 100% of the subjects from the poor performance group, 50% from the intermediate group, 5% from the

Special Models for Sampling Survey

689

good performance group of which 75% should come from those older than 75 years of age. However, due to refusal, death, severe sickness and other reasons, the study had to sample more than the prespecified percentages in all groups except the poor performance group to achieve the targeted number of total clinical evaluations. The following table gives the number of demented subjects diagnosed from each of the sampling stratum by age group. A weighting type estimator, also referred to as the direct standardization approach, is often used to estimate disease prevalence from a multi-phase sampling study. It assumes that subjects within each stratum are homogeneous and random sampling is used within each stratum. Suppose that in the first phase of the study N individual subjects are sampled by simple random sampling from the target population and information is collected from all N subjects on a set of characteristics X. X can be a vector containing several predictors, such as age, gender, screening scores, etc., that relate to the disease of interest. The N subjects are then divided into S strata, labeled as I1 , . . . , IS , based on the values of X. The total numbers of subjects in the respective strata are denoted by N1 , . . . , NS . In the second phase ns subjects are sampled from the Ns subjects in the sth stratum using stratified random sampling. Disease status is ascertained on the selected ns subjects only. Let ysi represents disease status on the ith subject from the sth stratum, with ysi = 1 denoting disease and ysi = 0 for non-disease. The probabilities for second phase sampling can be different for subjects from different strata. However, subjects from the same stratum are assumed to have equal probability of sampling. The weighting type estimator of prevalence rate for a stratified random sampling is: pˆwt =

S S ns X Ns Ns 1 XX ysi = pˆs , N s=1 i=1 ns N s=1

(1)

Pns ysi where pˆs = i=1 ns is the average disease rate from the sth stratum. The variance of the estimator is estimated by var(ˆ pwt ) =

S X s=1

pˆs (1 − pˆs )

Ns2 . N 2 ns

(2)

Problems in the weighting type estimation approach can occur in groups with no affected individuals or no unaffected individuals (ˆ ps = 0 or pˆs = 1), or groups where no individual is sampled (ns = 0). Prevalence estimate can be very unstable with possible variance 0 or infinity.

690

S. Gao

An alternative method of estimating disease prevalence from two-phase surveys is the modeling type estimator. A model is assumed for the population where the finite population is sampled from and smoothed estimates from the model are used to estimate disease prevalence. The modeling type estimator for binary data was first proposed by Roberts et al.26 and used by Beckett et al.1 to estimate the prevalence of Alzheimer’s disease from two-phase surveys. The modeling type estimator is preferred in situations where the disease is rare and estimates from strata containing few or zero events are desired. Let Xsi be a set of covariates collected at the first phase. Therefore, Xsi is available for all N subjects. Let Prob(ysi = 1) = psi . A logistic regression model is assumed for the disease model: psi = Xsi β , (3) log 1 − psi where β is a p×1 vector of parameter. If β is known, then the average of the predicted probability of disease from the model is an unbiased estimator of disease prevalence. In practice, one has to estimate β from the sample. A psudo-maximum likelihood estimate βˆ is obtained using data from the second phase and estimate of disease prevalence is then obtained using the average predicted probabilities of disease on every subject in the population: pˆmodel =

S Ns 1 1 XX . N s=1 i=1 1 + e−Xsi βˆ

(4)

The estimated variance of the prevalence estimate is var(ˆ pmodel ) = W ′ QXV X ′ Q′ W ,

(5)

where W is an N × 1 vector with elements equal to N1 , Q is an N × N diagonal matrix with elements pˆsi (1 − pˆsi ), X is the covariate matrix and V is the estimated variance covariance matrix of the logistic regression parameter β. Note that the above variance estimator assumes that X is fixed. Variance estimators accounting for the variability in X for survey data is given by Graubard and Korn.9 Prevalence estimates from the modeling approach can be more efficient than the weighting type estimate if the logistic model fits the data well. The weighting type estimates may mask trends in prevalence rates due to small sample sizes in some groups. Lastly, the weighting type estimator requires the assumption of missing completely at random while the modeling type estimator is unaffected under the missing at random mechanism (using

691

Special Models for Sampling Survey

Table 1. Number of demented subjects diagnosed from the three sampling groups in each age group. Data from the Indianapolis Study of Health and Aging.

Age Group

Performance Group

Number in Population

Number Sampled

Number with Dementia

1 2 3 1 2 3 1 2 3

1133 77 97 523 69 112 127 21 53

27 34 63 58 35 75 14 10 37

0 0 10 1 5 31 0 0 18

65–74

75–84

85+

Table 2. Estimated age-specific prevalence rates of dementia using the weighting method and the modeling method. Data from Indianapolis Study of Health and Aging. Weighting

Modeling

Age Group

Rate

Std Err

Rate

Std Err

65–74 75–84 85+

1.18 9.26 12.83

0.34 1.66 2.17

1.83 6.73 17.07

0.37 0.85 2.31

the definitions of Little and Rubin21 ), provided that covariates related to missing data are incorporated in the model.7 We return now to the example data in Table 1. Prevalence estimates for the three age groups are desired. Note in age groups 65–74 and 85+, the first two sampling strata produced zero disease case, the weighting type estimator is expected to underestimate the true prevalence in these two age groups. Prevalence estimates for the three age groups using both the weighting type approach and the modeling estimator along with standard error estimates are included in Table 2. It can be seen from Table 2 that the modeling type estimator yields larger prevalence estimates for two age groups whiles the estimates for age group 75–84 is smaller than the weighting type estimates. A simulation study conducted by Beckett et al.1 demonstrated that the modeling type estimator can increase the accuracy and efficiency of the rate estimates substantially if the model fits the data well.

692

S. Gao

In this section we are mainly concerned with estimating disease prevalence which is the mean of a binary variable. It should be noted that similar approaches exist in survey theory to estimate the means of continuous variables. For further details see the section on regression estimator in Cochran.5

3. Random Effect Models for Small Area Estimation The terms “small area” were initially used to denote a small geographical area, such as a county, or a census division. They may also describe a “small domain” which is a small subpopulation such as a specific age-sexrace group of people within a large geographical area. Small area estimation arises usually from secondary analysis of large survey data, where the survey is designed to estimate the characteristics of a large domain. For example, in a national hospital cost survey, the sample selection is designed to estimate mean hospital cost with a desired precision at the national level. However, it may also be of interest to use the survey to derive hospital cost estimates by census region, or by state or county, possibly for the apportionment of government funds, and in regional and city planning. Direct estimates by region, state or county in this case based on only data from the small area are likely to yield unacceptably large standard errors due to the unduly small size of the sample in the area. Suppose that the population is divided into k small areas, each contains Ni samples. ni out of Ni units are sampled from the ith area. Yij denote the jth unit value in the ith small area. For convenience we let the first ni units in Yij be sampled, and the remaining Ni − ni not sampled. In addition, we assume auxiliary information Xij is available on every unit in the population. Alternative models can also be formed when Xij is only available on the area-specific level. See Ghosh and Rao8 for further details. The focus of inference is to estimate the small area mean Y¯i. . A conceptualized example is described here without direct reference to a specific survey or real data set. The example is in the context of a study by Taylor et al.29 published in the New England Journal of Medicine. We have modified the setting so that the sampling units are hospitals instead of patients to make the inference straightforward. Taylor et al.29 pointed out that it is important to study hospital cost and outcome and to investigate any differences among various hospitals. Suppose a national survey on hospital cost is conducted. The primary interest of the survey is to estimate the average hospital cost for various primary diagnoses such as hip fracture, stroke, coronary heart disease or congestive heart failure on the national

693

Special Models for Sampling Survey

level. Suppose we are also interested in using the survey data to provide average hospital cost estimates for the counties within a region. Suppose in addition, we have information on Medicare claim data for all hospitals in the region in the same time period of the survey. Medicare is a federal program that purchases inpatient services, primarily for persons 65 years and older, from various types of hospital. It is reasonable to assume that as the amount of Medicare claim increase, hospital cost is also likely to increase. For this hypothetical example, we will assume that there is no major systematic difference between the hospitals. In a real world situation, the type (teaching or non-teaching), size (numbers of bed) and location (urban or rural) of the hospital are all likely to affect hospital cost, as concluded in the article by Taylor et al.29 A simulated data set for 114 hospitals in 16 small areas is generated using the equation: yij = −2.0 + 0.2xij + µi + eij , where µi ∼ N (0, 20), and eij ∼ N (0, 40). Thirty-eight hospitals are sampled using simple random sampling. The sampled data is presented in Table 3. Table 3. Data from a simple random sample drawn from a synthetic population (n = 38, N = 114). Area No.

Ni

ni

xij

yij

No.

Area Ni

ni

xij

yij

1 2 3

1 6 4

0 1 2

27

10

10

5

2

1 8

0 2

11

12

4

6

6

2

12

7

3

7

6

2

8 9

6 27

1 10

– 38.19 7.43 10.42 – 20.52 25.61 22.01 25.61 22.52 35.24 8.38 12.44 26.23 8.77 17.98 23.22 9.41 15.61 34.42

9

4 5

– 169.99 6.16 30.09 – 94.60 86.34 90.53 86.34 95.10 130.45 46.09 89.88 145.68 113.54 114.54 96.51 84.58 139.38 117.24

13 14

4 6

1 2

15

13

5

16

2

1

86.54 101.55 61.34 102.77 61.49 99.77 52.08 136.46 92.28 93.73 62.29 164.14 134.88 164.94 64.44 63.51 73.10 109.66 123.09 186.16

12.51 15.27 11.56 17.62 7.02 27.09 10.39 39.48 9.53 9.43 10.89 35.79 35.30 40.81 14.92 9.07 8.34 10.35 29.39 28.08

694

S. Gao

For each county (small area), Ni represents the total number of hospitals in that county and ni is the number of hospitals sampled in the survey. xij is the Medicare claim amount and yij is the true hospital cost. A problem immediately seen with this data set, present also in many small area estimation situation, is that there are two counties without sampled hospital so that direct estimates from samples from those counties are not possible. Another problem is that some counties have a very small numbers of hospital sampled. Hence direct estimates for these counties may be unstable. We will describe three approaches commonly used for estimation from small area. A synthetic estimator is similar in spirit to the ratio estimator in sample ¯i ˆ¯ survey.5 The estimator uses the percentage of X ¯ . in the estimate Y. of the X ¯i = overall mean as the estimate of the total in the ith area, where X PNi Pk Pk Pni 1 1 ˆ 1 ¯ ¯ i=1 Xi , and Y. = n i=1 j=1 xij , X. = N j=1 yij . The synthetic Ni estimator of small area mean is given by: ¯i X Yˆ¯i (S) = ¯ Yˆ¯ , X.

(6)

The bias of the synthetic estimator is given by: ¯ ¯ ¯i (S)) − Y¯i = X ¯ i Y. − Yi . E(Yˆ ¯. ¯i X X ¯

¯

Y. Yi It can be seen that the bias is not zero unless X ¯. = X ¯ i . Under the ¯ i equals to the population average assumption that the sample average X ¯ . , the synthetic estimator will only be unbiased if each small area mean X ¯ Yi equals to the overall mean Y¯. . Such an assumption can be very strong and can produce biased estimates in situations when the assumption does not hold. In an effort to reduce or balance the bias of the synthetic estimator a weighted estimator is proposed in the form of

¯i (SD) = wi Yˆ¯1i + (1 − wi )Yˆ¯2i , Yˆ

(7)

¯1i is the direct estimator from the selected samples, Yˆ¯2i is an indirect where Yˆ estimator and wi is a pre-determined weight (0 ≤ wi ≤ 1). An optimal weight may be obtained to minimize the mean squared error of Yˆ¯i (SD). See Ghosh and Rao8 for further details on obtaining the optimal weight. iN In practice a sample size dependent weight is chosen as wi = nnN , where i N and Ni are the total number of units and number of units in each small area in the population, respectively. n and ni are the total selected sample

695

Special Models for Sampling Survey

size and the selected sample size in each small area, respectively. Therefore, the sample size dependent estimator can be written as: ( ¯i , Yˆ if wi ≥ 1 , ¯i (SD) = (8) Yˆ ˆ ˆ wi Y¯i + (1 − wi )Y¯i (S) , if wi < 1 , ¯i (S) is the synthetic estimator. where Yˆ In a random effect model approach the finite population containing N units is itself assumed to be random samples from an infinite population, the so-called superpopulation. The finite population is further assumed to have the following distribution: yij = xij β + νi + eij ,

i = 1, . . . , k ,

j = 1, . . . , Ni ,

(9)

where xij is the value of the auxiliary variable, β is the parameter for the auxiliary effect. νi and eij are two independent random variables with E(νi ) = 0 ,

V (νi ) = σν2 ,

E(eij ) = 0 ,

V (eij ) = σ 2 .

In addition, normality of the two random variables is assumed. Using matrix notations, and an asterisk for the nonsampled elements, the above random effect model can be written as: " # " " # " # # yi 1i Xi ei = , (10) ∗ ∗ + ∗ β + νi yi 1i Xi e∗i where yi = (yi1 , . . . , yini )′ represents the sampled units, and yi∗ = (yini +1 , . . . , yiNi )′ the non-sampled ones. Other vectors in the equation are similarly defined. The difference between the random effect model for sampling survey and the random effect models in classical statistics textbook is demonstrated by Eq. (10). A component of the outcome variable is unobserved because it is not sampled. Therefore, there are two steps in estimating the means of small areas. The first step involves estimating the parameters in the model, i.e. β, σν2 and σ 2 , using the sampled data only. The second step uses the parameter estimates to predict yi∗ . Three estimation approaches exist for parameter estimation from the random effect model, namely, the best linear unbiased predictor approach (BLUP), the empirical Bayes approach (EB) and the hierarchical Bayes approach (HB). We will focus on the discussion of the BLUP and be brief on the EB and HB approaches. Interested readers can find a rather thorough review on all three approaches from Ghosh and Rao.8

696

S. Gao

For the estimation of parameters from the random effect model using sampled data only, we start with a more general mixed effect model: y = Xβ + Zν + e ,

(11)

where E(ν) = 0 ,

V (ν) = G ,

E(e) = 0 ,

V (e) = R ,

and ν and e are assumed to be independent of each other. Parameter estimates for β and ν are obtained by solving the following equations simultanenously: X′ R−1 Xβ + X′ R−1 Zν = X′ R−1 y Z′ R−1 Xβ + (Z′ R−1 Z + G−1 )ν = Z′ R−1 y ,

(12)

which can be expressed alternatively as: ˆ = (X′ V−1 X)−1 X′ V−1 y β ˆ , ˆ = GZV−1 (y − Xβ) ν

(13)

where V = ZGZ′ + R. With the variance-covariance matrices G and R known, it is proved that βˆ derived above is the best linear unbiased estimator and νˆ is the best linear unbiased predictor in the measures of mean squared error. In practice the variance-covariance matrices are unknown and have to be estimated from the data. Estimation of the variance-covariance matrix can be accomplished by using the restricted maximum likelihood (REML) approach proposed by Patterson and Thompson.23 REML yields unbiased estimates of the variance covariance parameters for balanced designs. Technically, the optimality of the BLUP is lost when one uses estimated variance covariance matrices. However, such an approach coincides with the empirical Bayes approach with normal error assumption. Therefore, the empirical BLUP (EBLUP) and EB lead to identical estimates. In the HB approach, a prior distribution on the model parameters is specified and the posterior distribution of the parameters of interest is then obtained. Inferences are based on the posterior distribution; in particular, a parameter of interest is estimated by its posterior mean and its precision is measured by its posterior variance. The EBLUP approach is implemented by various statistical software packages. For example, SAS Proc MIXED derives parameter estimates and prediction using the EBLUP method. To illustrate the various approaches

697

Special Models for Sampling Survey

Table 4. Small area estimates by the synthetic estimator (YˆSYN ), the sample size dependent estimator (YˆSD ) and the empirical best linear unbiased predictor estimator (YˆEBLUP ). Area No.

Ni

ni

¯i X

Y¯i

YˆSYN

YˆSD

YˆEBLUP

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 6 4 1 8 6 6 6 27 5 12 7 4 6 13 2

0 1 2 0 2 2 2 1 10 2 4 3 1 2 5 1

137.70 106.18 23.38 45.64 102.08 77.96 125.81 79.80 104.06 84.57 93.84 85.55 120.90 177.58 94.96 123.30

32.68 24.12 4.64 4.12 21.08 22.53 28.36 13.50 17.33 13.41 19.43 15.12 23.79 38.00 12.92 21.01

26.93 20.76 4.57 8.93 19.96 15.25 24.60 15.61 20.35 16.54 18.35 16.73 23.64 34.73 18.57 24.11

26.93 29.48 8.92 8.93 22.05 23.81 28.88 11.99 17.59 14.59 20.99 9.95 32.75 38.06 14.41 28.08

28.41 22.53 6.67 8.97 20.77 18.21 28.28 16.00 17.60 16.33 21.03 13.85 25.33 39.77 17.31 22.59

Average relative error %

21.31

25.49

19.59

Average squared error:

13.46

16.98

7.10

on small area estimation discussed in this section we use the synthetic population in Table 3. In Table 4 we compare the small area estimates of means derived by the synthetic estimator (YˆSYN ), the sample size dependent estimator (YˆSD ) and the empirical BLUP estimator (YˆEBLUP ) to the true small area mean (Y¯i ). Note that the true means are available to us because the data is simulated. We also define two criteria for comparing the overall performances of the estimators across all small areas. 16

Average relative error =

1 X |Yˆi − Y¯i | , 16 i=1 Y¯i

Average squared error =

1 X ˆ (Yi − Y¯i )2 . 16 i=1

16

These two criteria provide measures on the overall bias and efficiency of the various estimators. The EBLUP estimator is shown to perform the best in having the smallest average relative error and the smallest average squared error. The sample size dependent estimator failed to improve on

698

S. Gao

the performance of the synthetic estimator, perhaps due to the use of a non-optimal weight. So far we have discussed methods for small area estimation appropriate for a continuous outcome. Small area estimation for discrete outcomes such counts and proportions are often desired as well. Random-effect models can also be applied in these situations. In the generalized linear model framework, the discrete outcome has its first two moments specified as: E(yij |νi ) = h(xij β + νi ) = µij ,

(14)

V (yij |νi ) = ai ν(µij ) ,

(15)

where νi is assumed to be normally distributed with mean 0 and variance covariance matrix D. There are two steps involved in small area estimation from discrete outcome, similar to the continuous variable case. In the first step we estimate the model parameters using the sampled data only. The second step involves prediction using the estimated model parameters and the auxiliary values for the non-sampled units in the population. Several estimation approaches exist for parameter estimation from the generalized linear mixed model. The first approach is the full likelihood approach which requires the specification of the distributions of the random effects and often requires numerical integration of the likelihood function over the distribution of the random effect variables. To overcome these problem, Breslow and Clayton3 proposed the penalized quasi-likelihood approach (PQL) where only the first two moments are specified and parameter estimation can be achieved using iterative weighted least square estimation. Raghunathan24 proposed a quasi-empirical Bayes method for small area estimation on discreate outcomes. We concentrate on a description of the PQL method, simply because it is implemented in some statistical software packages. For a detailed derivation of the PQL equations see Breslow and Clayton.3 The PQL method requires iteratively solving the following equations: βˆ = (X′ V−1 X)−1 X′ V−1 Y , ˆ , νˆ = DV−1 (Y − Xβ)

where V = W−1 + D, and W is a diagonal matrix with the elements: wij =

var(yij |νi ) g ′ (µij )2

(16)

699

Special Models for Sampling Survey

and g(µij ) = h−1 (xij β + νi ), and Yij = xij β + νi + (yij − h(xij β + νi ))g ′ (µij ) . In the context of the previous example on a hospital survey, suppose we wish to estimate the average cancer-specific remission rate for each county. Such rates can be used in part to assess the quality of a hospital. We use the same setting as in the previous example and assume that the same 38 hospitals are randomly selected. A complete data set on all 114 hospitals are generated using a two-stage model: µij = 0.2xij + νi , E(yij ) = mij µij , log 1 − µij

where νi ∼ N (0, 0.9). The synthetic samples are presented in Table 5. We are interested in estimating the average cancer remission rate for each county. In Table 6 we present the estimates of proportions obtained using the PQL method as implemented in the SAS Glimmix macro.20 These estimates are compared

Table 5. Data from a simple random sample drawn from a synthetic population (n = 38, N = 114). Area No.

Ni

ni

xij × 100

mij

yij

1 2 3

1 6 4

0 1 2

4 5

1 8

0 2

6

6

2

7

6

2

8 9

6 27

1 10

– 169.99 6.16 30.09 – 94.60 129.93 90.53 86.34 95.10 130.45 46.09 89.88 145.68 113.54 114.54 96.51 84.58 139.38 117.24

– 37 24 3 – 52 47 24 22 9 72 37 3 80 246 8 25 340 390 24

– 1 9 1 – 26 16 15 13 4 37 7 1 22 58 1 7 73 87 7

Area No.

Ni

ni

xij × 100

mij

yij

9

27

10

10

5

2

11

12

4

12

7

3

13 14

4 6

1 2

15

13

5

16

2

1

86.54 101.55 61.34 102.77 61.49 99.77 52.08 136.46 92.28 93.73 62.29 164.14 134.88 164.94 64.44 63.51 73.10 109.66 123.09 186.16

64 9 126 10 3 57 17 5 1067 246 8 25 64 9 94 40 3 57 3 45

19 1 26 2 1 24 9 3 455 114 2 6 32 4 12 5 0 1 1 19

700

S. Gao

Table 6. Small area estimates of proportions (in %) by raw proportion and by the PQL method. Y¯i is the true population rate for each small area. Area No.

Ni

ni

¯ i × 100 X

Y¯i

Yˆraw

YˆPQL

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 6 4 1 8 6 6 6 27 5 12 7 4 6 13 2

0 1 2 0 2 2 2 1 10 2 4 3 1 2 5 1

137.70 106.18 23.38 45.64 102.08 77.96 125.81 79.80 104.06 84.57 93.84 85.55 120.90 177.58 94.96 123.30

30.00 44.30 37.66 40.38 36.93 68.47 47.39 18.16 23.15 21.70 45.78 42.98 20.60 43.70 11.45 42.29

33.55 51.35 37.04 33.55 42.42 60.87 50.62 18.92 23.21 20.59 45.12 43.22 24.00 49.32 9.64 42.22

35.43 48.88 36.70 35.28 41.97 57.74 49.50 21.80 23.28 21.41 44.41 43.17 26.54 48.26 10.85 41.26

Average relative error %

6.89

6.71

Average squared error:

0.0011

0.0013

to the direct estimates using data from each small area only. For counties without sampled hospitals, the overall mean is used as the estimate. The average relative error and the average squared error are defined in the same way as in the previous example and are included in the table. Although the PQL estimates show smaller overall bias than the direct estimates, it has a slightly larger average squared error than the direct estimates. Raghunathan24 demonstrated the quasi-empirical Bayes method on a data set to estimate county-specific mean number of hospital admission for cancer chemotherapy. A Poisson model for count data was assumed. We want to conclude this section by pointing out that the random effect model approach we described here is a general approach in modeling sampling data in that each small area mean is itself assumed to be random variables following certain distributions. Parameter estimation is always a concern because the estimates directly contribute to the prediction of the non-sampled units. It may be noted that we might appear biased in choosing to present in more details the techniques that are implemented in computer software packages. The emphasis simply reflects the convenience in deriving estimates for our example data. It does not, however, reflect

Special Models for Sampling Survey

701

the superity of performances of the estimation approaches we presented. In fact many approaches for the discrete outcomes have yet to be compared in a well designed simulation study. Therefore, the readers are adviced to keep an open mind on estimation techniques when applying random effect models themselves and when more results on comparing various estimation methods are available.

4. Capture Recapture Models The use of capture recapture models in epidemiology is generally different from the sampling surveys we discussed so far in the previous two sections. Capture recapture setting usually works with several sampling frames, instead of just one in conventional surveys. Capture recapture model concentrates on matching individuals identified by different sources rather than sampling selection from one sampling frame. However, capture recapture model do share a common goal with some sample survey in that it also focus on the estimation of the size of a population. As an alternative to the community survey we introduced in Sec. 2, capture recapture systems can be thought as multiple surveys on the same population trying to estimate the same quantity. This is especially useful when there does not exist one complete sampling frame from which a conventional sample survey can be established to reliably estimate population characteristics. Capture recapture methods have a long history. They were first introduced in the study of fish and wildlife populations before being adopted for other populations. In animal studies, the animals being captured by the first attempt will be marked and returned to wildlife. This allows cross-classify the animals captured by various attempts. Hence the name capture recapture. Various authors have argued against the use of capture recapture model for human population on the bases that the various capture attempts of humans are usually not random.2,11 In epidemiological studies hospital records, doctors’ medical files, medical prescription list are examples of various sources to locate individuals with certain disease. Each of these sources is incomplete by their true nature, and the problem is to estimate those missing from all sources. The simplest capture recapture model is the so called two-source model used to estimate the unknown size of the population. The first sample provides the individuals for marking and the second sample provide the recapture.

702

S. Gao Table 7.

Layout of a two source capture recapture setting. Source B Yes Source A

No

Yes

n11

n10

No

n01

n00

We begin with two source model having sources A and B. Let n11 , n10 , n01 and n00 be the numbers of individuals captured by both sources (n11 ), by source A only (n10 ), by source B only (n01 ), and by neither source (n00 ). Note that n00 is unobservable. By estimating n00 , the number of cases missing by both sources, we provide an estimate of the number of cases in the population. The layout of a two source capture recapture setting is illustrated in Table 7. Four assumptions are implicitly made on capture recapture analysis: (1) There is no change to the population during the investigation. Such a population is usually called a closed population. (2) Individuals can be matched from sources A to source B. (3) In each source each individual has the same chance of being included in the sample. (4) The two sources are independent meaning a “Yes” from source A does not affect the chance of a “Yes” from source B. In epidemiological studies assumptions 1 and 2 can generally be assumed true. However, assumptions 3 and 4 present the biggest problem and has been the subject of debate since the application of capture recapture model in epidemiological fields. Human subjects are known to be heterogeneous with regard to being “caught” by a specific source. Methods to incorporate covariate in the method is becoming available. Tilling and Sterne30 gave the latest development including a review of previous work. Assumption 4 is invariably false and is perceived as the biggest weakness on the use of capture recapture models in epidemiology. Humans are not fish where the chance of being recaptured is truly independent of whether they have been marked. For example, if certain doctors refer their patients to certain hospitals, then hospital records and doctors’ records will not be two independent sources. Fienberg6 approached the interdependence among sources of capture using log-linear model framework. We will focus our presentation using this approach.

Special Models for Sampling Survey

703

Depends on the ways of parametrization, a 2 × 2 contingency table can be represented by the following log-linear models: log E(n11 ) = µ log E(n01 ) = µ + µA log E(n10 ) = µ + µB log E(n00 ) = µ + µA + µB + µAB .

(17)

The parameters on the right hand side of the above equations represent the logarithm of the number expected for each cell. Notice that there are four parameters in the log-linear models but only three known quantities to use for the estimation. One parameter is unestimatable. The customary solution is to assume µAB = 0 which is equivalent of assumption 4. Hence n00 can be estimated by its expected value under the log-linear models: n01 n10 n10 n01 . (18) n ˆ 00 = eµˆ+ˆµA +ˆµB = elog n11 +log n11 +log n11 = n11 The estimate of the total sample size is: n10 n01 ˆ = n11 + n10 + n01 + n . N ˆ 00 = n11 + n10 + n01 + n11

(19)

An example was taken from Bruno2 and modified for presentation here. Four sources were used to identify known cases of diabetes among the residents in the area of Casale Nonferrato in Northern Italy. Data are presented in Table 8. Here, we illustrate the example with the first two sources only. Source A: A list of all patients with a previous diagnosis of insulindependent diabetes mellitus (IDDM) or non-insulin dependent mellitus (NIDDM), via diabetes clinic and/or family physicians. Source B: A list of all patients discharged with a primary or secondary diagnosis of diabetes in all public and private hospitals in the region. Table 8.

Capture recapture models with two sources. 2 Source B

Source A

Yes No

Yes

No

377 115

1417 n00

704

S. Gao

Table 9. Parameter estimates from a two source log-linear model with the interaction term set to zero. Parameter

Estimate

Standard Error

µ µA µB

5.8201 −1.0752 1.4362

0.0545 0.1082 0.0606

Using the log-linear approach described above, we fit a log-linear model with the PROC GENMOD procedure in SAS with the log-link function. Parameter estimates were displayed in Table 9. The estimated number of cases missed by both cases is: n ˆ 00 = eµˆ+ˆµA +ˆµB = 483.54 . The total number of diabetes estimated from using both sources is 2353. Notice in this example, source A identified 1754 cases of diabetes and source B identified 452 cases, corresponding to 75% and 19% of the estimated total cases by using both sources, respectively. Both sources are seen to be relatively incomplete. Note in the two source setting, dependency between the two sources cannot be estimated without additional information. If external data is used to estimate the dependency, µAB in the above log-linear models, estimation of n00 may be possible. If we define P(A), P(B) and P(A∩B) be the probability of captured by source A only, by source B only and by both sources, respectively, we can define the dependence between the two sources to be positive if P(A ∩ B) > P(A)∗P(B), negative if P(A ∩ B) < P(A)∗P(B). It has been shown that positive dependence of sources tends to underestimate the true population size and negative dependence tends to overestimate. Extensions to modeling more than two sources are straightforward. In a capture recapture model with k sources, it is customary to set the highest interaction parameter of order k to be zero. We illustrate the cases of more than two sources with a three-sources analysis. Table 10.

The layout of a three source (A, B and C) capture recapture model. Source B

Source A

Yes No

Yes

No

Source C Yes No

Source C Yes No

n111 n011

n101 n001

n110 n010

n100 n000

Special Models for Sampling Survey

705

The layout of a three-source capture recapture model is presented in Table 10. We use the saturated model to construct the log-linear models for the three-source analysis. log E(n111 ) = µ log E(n110 ) = µ + µC log E(n101 ) = µ + µB log E(n011 ) = µ + µA log E(n100 ) = µ + µB + µC + µBC log E(n010 ) = µ + µA + µC + µAC log E(n001 ) = µ + µA + µB + µAB log E(n000 ) = µ + µA + µB + µC + µAB + µBC + µAC + µABC .

(20)

Recall that there are 8 models, 8 parameters and 7 observations (counts). Therefore, one parameter is unestimable. An untestable assumption is made so that inference is possible: µABC = 0. With more than two sources there is the possibility that a model with fewer parameters will fit the model equally well as the saturated model with 7 parameters (µABC is set to 0). Statistically, the model with the fewest number of parameters fitting the data is to be chosen to represent the data. This leads us to the discussion of model selection criteria. Three methods are commonly used for log-linear model selection: the G2 deviance statistic which is a likelihood ratio statistic comparing a current model to the saturated model, Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). For the three source model, the G2 statistic can be expressed as: X nijk , G2 = −2 nijk log E(nijk ) i,j,k

where E(nijk ) is the expected cell counts under the assumed alternative model other than the saturated one. If a model represents the data, then the difference in deviance between the considered model and the saturated model for which G2 = 0 has an approximate χ2 distribution. The Akaike’s Information Criterion is defined to be: AIC = G2 − 2 (d.f. of the model) .

706

S. Gao

The Bayesian Information Criterion is defined as: log N (d.f. of the model) , 2π where N is the total number of observed cases. Both AIC and BIC take the number of parameters in the model into consideration. The BIC also considers sample size. Both criteria select the model with the lowest value on the respective criterion. A three-source example data from LaPorte et al.17 is modified here to illustrate the issues in model selection. Capture recapture method was used to identify the most accurate and efficient approaches to monitor adolescent injuries. For our example, we consider the issue of accuracy only. We take three sources from the article: 127 identified by student monthly recalls at either 1 month or 4 months, 58 by medical excuses and 33 by daily attendance records. Data is presented in Table 11. A series of seven possible models were fit by using the SAS system for log-linear model. Table 12 includes the values of three model selection criteria: the G2 , AIC and BIC, and the estimated number of cases missed by all three sources and the estimated total number of injuries from all three sources. BIC = G2 −

Table 11. Injuries captured by three sources17 : A: student recall at either 1 or 4 month; B: medical excuses; C: daily attendance records. Medical Excuses

Student Recall

Table 12.

No

Attendance Record Yes No

Attendance Record Yes No

16 0

13 4

39 3

69 n000

The values of G2 , AIC and BIC for all seven models on the LaPorte data. 17

Number 1 2 3 4 5 6 7

Yes No

Yes

Model A, A, A, A, A, A, A,

B, B, B, B, B, B, B,

C C, C, C, C, C, C,

AB AC BC AB, AC AB, BC BC, AC

G2

d.f.

p-value

AIC

BIC

n ˆ 000

ˆ N

8.1222 4.8084 7.4284 6.3624 3.3987 1.9983 5.8258

3 2 2 2 1 1 1

0.0436 0.0903 0.0244 0.0415 0.0652 0.1575 0.0158

2.1222 0.8084 3.4284 2.3624 1.3987 −0.0017 3.8258

2.8509 1.2942 3.9142 2.8482 1.6416 0.2412 4.0687

6 15 5 7 * 21 5

150 159 149 151 * 165 149

Special Models for Sampling Survey

707

Using the G2 statistic, three models (models 2, 5 and 6) are not rejected at the 0.05 significance level which means that these models are not significantly different from the saturated model. Model 5 failed to converge in parameter estimates. The likelihood based statistic would have favored model 2 because it is the simplest model not rejected by G2 . In fact, this model was chosen by LaPorte et al.17 in the original paper. This model predicts 15 cases missed by all three sources and total number of injuries is estimated to be 159. However, both AIC and BIC identified model 6 as the optimal model. The number of cases missed by all sources is estimated to be 21 using model 6 and the total number of cases is estimated to be 165. Notice that the G2 method is based on large sample theory which assumes that each cell count is reasonably large. When there are small or zero cell counts as in this example, the validity of the test is questionable. In the above example, we would favor the use of model 6 over that of model 2 based on this observation. Notice also that the conclusion on the validity of each source is very much dependent on which model one has chosen to represent the data. For example, LaPorte et al.17 stated that student recall is the most accurate source of identifying injury with an estimated accuracy of 86% (137/159) using estimates derived from model 2. If the alternative model 6 is chosen, the accuracy rate for student recall will be estimated to be 83% (137/165), although it remains the most accurate of the three sources. AIC and BIC base their decision on minimization. Therefore, uniqueness of the selected model is generally satisfied. Simulation studies have been conducted to compare AIC, BIC and several modified forms of the two criteria.12 In general, the two criteria are quite comparable. To conclude this chapter, we would like to reiterate the need for proper statistical methods in analyzing complex survey data. Many large national survey data are now accessible to the public for secondary data analysis providing medical researchers unique opportunities to study relationship and trend on the national level. However, great care must be exercised in analytical methods if one is to draw proper conclusion. The intention of this chapter is not on a exclusive coverage of general techniques on analyzing sampling data. Instead our focus of this chapter is on “special models” used for sampling data in the field of epidemiology. Readers are referred to Cochran5 and Kish15 for the background knowledge on sampling theory, to Skinner, Holt and Smith28 for more theoretical development on statistical inference on sampling data. Examples of analysis of health survey data can

708

S. Gao

be found in Leclerc et al.,19 Korn and Graubard,16 Graubard and Korn9 and LaVange et al.18 Acknowledgment The research os supported in part by NIH grant R01 15813. References 1. Beckett, L. A., Scherr, P. A. and Evans, D. A. (1992). Population prevalence estimates from complex samples. Journal of Clinical Epidemiology 45: 393–402. 2. Brenner, H. (1994). Use and limitations of the capture-recapture method in disease monitoring with two dependent sources. Epidemiology 6: 42–48. 3. Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88: 9–25. 4. Chambless, L. E. and Boyle, K. E. (1985). Maximum likelihood methods for complex sample data: Logistic regression and discrete proportional hazard models. Communications in Statistics-Theory and Methods 14: 1377–1392. 5. Cochran, W. G. (1977). Sampling Techniques, John Wiley and Sons, New York. 6. Fienberg, S. E. (1972). The multiple-recapture census for closed populations and incomplete 2k contingency tables. Biometrika 59: 591–603. 7. Gao, S., Hui, S. L., Hall, K. S. and Hendrie, H. C. Estimating disease prevalence from two-phase surveys with nonresponse at the second phase. Statistics in Medicine, in press. 8. Ghosh, M. and Rao, J. N. K. (1994). Small area estimation: An appraisal. Statistical Science 9: 55–93. 9. Graubard, B. I. and Korn, E. L. (1996). Modelling the sampling design in the analysis of health surveys. Statistical Methods in Medical Research 5: 263–281. 10. Graubard, B. I. and Korn, E. L. (1999). Predictive margins with survey data. Biometrics 55: 652–659. 11. Hook, E. B. and Regal, R. R. (1995). Capture-recapture methods in epidemiology: Methods and limitations. Epidemiologic Reviews 17: 243–264. 12. Hook, E. B. and Regal, R. R. (1997). Validity of methods for model selection, weighting for model uncertainty, and small sample adjustment in capturerecapture estimation. American Journal of Epidemiology 145: 1138–1144. 13. Hurvich, C. M and Tsai, C. L. (1995). Model selection for extended Quasilikelihood models in small samples. Biometrics 51: 1077–1084. 14. International working group for disease monitoring and forecasting (1995). Capture-recapture and multiple-record systems estimation I: History and theoretical development. American Journal of Epidemiology 142: 1047–1058. 15. Kish, L. (1965). Survey Sampling, John Wiley and Sons, New York.

Special Models for Sampling Survey

709

16. Korn, E. L. and Graubard, B. I. (1995). Analysis of large health surveys: Accounting for the sampling design. Journal of Royal Statistical Society A158: 263–295. 17. LaPorte, R. E., Dearwater, S. R., Chang, Y. F. et al. (1995). Efficiency and accuracy of disease monitoring systems: Application of capture-recapture methods to injury monitoring. American Journal of Epidemiology 142: 1069–1077. 18. LaVange, L. M., Stearns, S. C., Lafata, J. E., Koch, G. G. and Shah, B. V. (1996). Innovative strategies using SUDAAN for analysis of health surveys with complex samples. Statistical Methods in Medical Research 5: 311–29. 19. Leclerc, A., Luce, D., Lert, F., Chastang, J. F. and Logeay, P. (1988). Correspondence analysis and logistic modelling: Complementary use in the analysis of a health survey among nurses. Statistics in Medicine 7: 983–995. 20. Littell, R. C., Milliken, G. A., Stroup, W. W. and Wolfinger, R. D. (1996). SAS System for Mixed Models, SAS Institute Inc., Cary, NC. 21. Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data, John Wiley and Sons, New York. 22. Morganstein, D. and Brick, J. M. (1996). WesVarPC: Software for computing variance estimates from complex designs. Proceedings of the 1996 Annual Research Conference, Bureau of Census, 861–866. 23. Patterson, H. D. and Thompson, R. (1971). Recovery of interblock information when block sizes are unequal. Biometrika 58: 545–554. 24. Raghunathan, T. E. (1993). A quasi-empirical Bayes method for small area estimation. Journal of the American Statistical Association 88: 1444–1448. 25. Research Triangle Institute (1993). Statistical Methods and Mathematical Algorithms Used in SUDAAN, Research Triangle Park, NC. 26. Roberts, G., Rao, J. N. K. and Kumar, S. (1987). Logistic regression analysis of sample survey data. Biometrika 74: 1–12. 27. Robinson, G. K. (1991). That BLUP is a good thing: The estimation of random effects. Statistical Science 6: 15–51. 28. Skinner, C. J., Holt, D. and Smith, T. M. F. (1989). Analysis of Complex Surveys, John Wiley and Sons, New York. 29. Taylor, D. H., Whellan, D. J. and Sloan, F. A. (1999). Effects of admission to a teaching hospital on the cost and quality of care for medicare beneficiaries. The New England Journal of Medicine 340(4): 293–299. 30. Tilling, K. and Sterne, A. C. (1999). Capture-recapture models including covariate effects. American Journal of Epidemiology 149: 392–400. 31. Warszawski, J., Messiah, A., Lellouch, J., Meyer, L. and Deville, J. C. (1997). Estimating means and percentages in a complex sampling surveys: Application to a French national survey on sexual behaviour. Statistics in Medicine 16: 397–423.

710

S. Gao

About the Author Sujuan Gao is currently an assistant professor in the Division of Biostatistics, Indiana University School of Medicine, Indianapolis, USA. She received her BS in Mathematics (1985) from Beijing Normal University, Beijing, China, and a PhD in Statistics (1991) from the University of Southampton, UK. She joined the Division of Biostatistics at Indiana University School of Medicine as a post-doctoral fellow in 1994 and became a faculty member shortly after. Sujuan Gaos research interests include the analysis of longitudinal data with missing values, the analysis of data collected using complex sampling scheme, the development and application of statistical methods in medical and epidemiological studies.

CHAPTER 19

THE USE OF CAPTURE-RECAPTURE METHODOLOGY IN EPIDEMIOLOGICAL SURVEILLANCE ANNE CHAO Institute of Statistics, National Tsing Hua University, Hsin-Chu, Taiwan 30043 Tel: 886-3-571-5131 ext 3161; [email protected] HSIN-CHOU YANG Institute of Biomedical Science, Academia Sinica, Taipei, Taiwan PAUL S. F. YIP Department of Statistics and Actuarial Science, The University of Hong Kong, Pokfulam Road, Hong Kong

1. Introduction Accurate and timely estimates of disease occurrence over time or across geographic area play an important role in disease monitoring and health care planning. Traditional simple random sampling and other probabilistic sampling schemes are not easily applicable to such situations or are prohibitively expensive. Multiple surveillance systems are usually employed to ascertain cases using different resources or efforts. Although some studies manage to locate almost all patients, most epidemiological approaches merging different lists and eliminating duplicate cases are likely to significantly underestimate true occurrence rates.23,25 That is, the final merged list misses those who are in the target population but were missed by all lists. This chapter discusses the use of capture-recapture models to estimate the number of missing cases under proper assumptions. We use three real data sets to illustrate the use of the capture-recapture methodology to correct for under-ascertainment of cases in epidemiological surveillance. 711

712

A. Chao, H.-C. Yang & P. S. F. Yip

1.1. Example 1. Data on hepatitis A Virus (HAV) Chao et al.12 documented the details on a large outbreak of the HAV that occurred in and around a college in northern Taiwan from April to July 1995. Cases of students in that college were ascertained by three sources: (1) P-list: records based on a serum test taken by the Institute of Preventive Medicine, Department of Health of Taiwan; 135 cases were identified. (2) Q-list: local hospital records reported by the National Quarantine Service; 122 cases were found. (3) E-list: records collected by epidemiologists; of which there were 126 cases. Merging the three lists by eliminating duplicate records resulted in 271 ascertained cases. The categorical data are shown in Table 1 where all ascertained cases are classified according to their presence/absence in the three lists. Presence or absence on any list is denoted by 1 and 0, respectively. For three lists, we can use a sequence of three numbers (each is either 0 or 1) to denote the record of each individual. For example, a record (001) describes an individual on the third list only and a record (011) describes an individual on the second and third lists but not on the first list. The three lists are displayed in an order of P, Q and E; this ordering is arbitrary and any legitimate inference procedure should be independent of the ordering of the lists. Those patients who were missed by all lists have the record (000). There are seven observable records and their counts over all ascertained cases are denoted as Z001 , Z010 , Z011 , Z100 , Z101 , Z110 and Z111 . From Table 1, there were 63 people listed in the E-list only, 55 people listed in the Q-list only, and 18 people listed in both lists Q and E but not in the P-list. Similarly, we can interpret the other records. In the P-list, there were 135 cases, which means Z1++ = Z100 + Z101 + Z110 + Z111 = 135. Table 1.

Categorical data on hepatitis A virus. Hepatitis A list P

Q

E

Data

0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1

0 1 0 1 0 1 0 1

Z000 =?? Z001 = 63 Z010 = 55 Z011 = 18 Z100 = 69 Z101 = 17 Z110 = 21 Z111 = 28

The Use of Capture-Recapture Methodology

713

Here, when we add over a sample, the subscript corresponding to that sample is replaced by a “+” sign. Similar relationship holds for the other two lists. The number of different cases ascertained in at least one of the lists, 271 in this case, is the sum of all observable cell counts. Epidemiologists suspected that the observed number of cases considerably undercounted the true number of infected and an evaluation of the degree of undercount was needed.12 The purpose was then to estimate the number of missed cases, Z000 , or equivalently, to estimate the number of total individuals who were infected in the outbreak. This data set was analyzed in Chao et al.11 As opposed to many real data sets, this one has the advantage of a known true number of infected because a screen serological check for all students was conducted after the three surveys. In this chapter, we therefore select the HAV data set as an illustrative example to assess the relative merit of various estimation methods.

1.2. Example 2. Data on Neurologic Illness (Stratified by Diagnostic Group) Bobo et al.4 reported a comprehensive surveillance system for acute neurologic illness in children from August 1987 to July 1988 in two States of USA. Three surveillance strategies were employed: (1) Hospital surveillance system (H-list): Cases were identified based on hospitals discharge records. (2) Provider surveillance system (P-list): Cases were reported by pediatricians and neurologists. (3) Study staff surveillance system (S-list): Cases collected by the staff members by visiting all participating facilities and checking clinical records of potential cases. For this data set, relevant covariates (auxiliary or explanatory variables) include geographic location (Oregon or Washington), gender and primary diagnostic groups (encephalopathies, infantile spasms, afebrile seizure and complex febrile seizure). These four groups are referred to as stratum A, B, C and D for convenience. Bobo et al.4 found that substantial difference exists in case ascertainment rates by diagnostic groups. The post-stratified data by diagnostic groups are shown in Table 2. This covariate (group) is also referred to as a post-stratifying or stratifying variable. The data structure for each group is similar to that in Example 1. The collapsed data over the four groups are shown in the last column. In the original data, there were 626 ascertained cases. In Table 2 and our analysis in Sec. 5, we only consider 619 cases with

714

A. Chao, H.-C. Yang & P. S. F. Yip Table 2. List

Categorical data on neurologic illness. Diagnostic Group (Stratum)

H

P

S

A

B

C

D

Total

0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1

0 1 0 1 0 1 0 1

? 11 2 6 6 7 1 2

? 7 1 5 1 2 1 4

? 131 38 31 37 62 14 31

? 103 4 20 26 44 11 11

? 252 45 62 70 115 27 48

known diagnostic groups. There were 260, 182 and 477 cases, respectively, in H-list, P-list and S-list. Despite the comprehensive surveillance systems, Bobo et al.4 concluded that there were still some people who could not be identified. They performed capture-recapture adjustment for the data within each stratum and the collapsed data in order to obtain an accurate occurrence rate for various sub-populations defined by the available covariates. Their results showed that the ascertainment rate for the four groups were 82%, 94%, 69% and 91%, respectively. The rate was substantially low for the afebrile seizures. 1.3. Example 3. Drug Data (Stratified by the Length of Time on Drug) Wittes50 presented an ascertainment data set on patients receiving synthetic penicillin called methicillin. Cases were identified by the following four systems: (1) intravenous nurses (100 cases); (2) hospital floor nurses (21 cases); (3) hospitals pharmacists (156 cases) and (4) medication sheets (348 cases). We refer to these four lists as list 1, 2, 3 and 4, respectively. A total of 428 cases were found. Wittes50 indicated that the length of time a patient was given the drug was related to his/her probability of being recorded. The original data consist of four strata for the time length (1–3 days, 4–6 days, 7–10 days and 11+ days). We combine the last two strata and the data are shown in Table 3. For each stratum, there are 15 observable presence/absence records and each can be expressed by a sequence of four numbers. Wittes50 found that an independent model (see Sec. 3 for explanation of the model) in each stratum fitted well and obtained an estimate of 544

715

The Use of Capture-Recapture Methodology Table 3. List

Categorical data on drug. Usage on Drug (stratum)

1

2

3

4

1–3 days

4–6 days

7+ days

Total

0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

? 48 18 14 1 1 1 0 8 8 1 1 0 2 0 0

? 83 13 33 4 1 0 3 6 16 2 6 0 0 0 0

? 66 12 27 1 1 0 1 6 17 5 17 1 2 1 1

? 197 43 74 6 3 1 4 20 41 8 24 1 4 1 1

(s.e. 22.4) for the total number of patient receiving the drug. Dependence was suspected between lists 3 and 4 because the records from the pharmacy were duplicates of the medication sheets. To eliminate this possible dependence, the lists 3 and 4 were combined to form only one list. Then based on this combined list, list 1 and list 2, an estimate of 536 was obtained under independence. Both models provide evidences that a non-negligible number of patients were missed by all four identification sources. One common goal for the above three examples is to find out under what assumptions or models we can estimate the number of missing cases and adjust for under-ascertainment considering the relevant covariate in the analysis. This has analogues in the biological sciences: Estimating the number of unseen animals in a closed population considering environmental factors or individual covariates. Here, a closed population means that there is no addition and loss so that the population size is a constant during the study period. The estimation of population size is a classical problem and has been extensively discussed in the literature. Biologists have long realized that it is almost impossible to determine the size of a population by counting every animal. Most animals cannot be drawn like balls in an urn or numbers on a list, thus special types of sampling schemes have been developed. Capture-recapture sampling has

716

A. Chao, H.-C. Yang & P. S. F. Yip

been widely used to adjust for undercount in the biological sciences. The recapture information (i.e. source-overlap information or source intersection) collected by marking or tagging can be used to estimate the number of missing under proper assumptions. Therefore, it is not necessary to count every animal in order to obtain an accurate estimate of population size. In contrast, epidemiologists have attempted to enumerate all relevant cases to obtain the prevalence rates for various diseases. Cases in various lists are usually merged and any duplicate cases are eliminated. The overlap information is thus ignored. This typical approach assumes complete ascertainment and does not correct or adjust for under-ascertainment. As Hook and Regal23,25 indicated, most prevalence surveys merging several records of lists are likely to miss some cases and thus be negatively biased. There is relatively little literature in the health sciences on the assessment of the completeness of these types of surveys or on the adjustment for underascertainment. Therefore, as commented by LaPorte et al.,35 people know more about the number of animals than the count of diseases. In a similar way that ecologists and biologists count animals, we introduce with proper modification in this chapter the use of capture-recapture models to count human populations. In Sec. 2, the background and motivation of the capture-recapture technique and its adaptation for use in human populations are reviewed. Section 3 summarizes the capture-recapture models when no covariates are available. Section 4 presents stratified capture-recapture models including covariates (or stratifying variable) effects. The analyses of the above three data sets are shown in Sec. 5. Concluding remarks are discussed in Sec. 6.

2. Background and Motivation of Capture-Recapture Capture-recapture sampling was originally developed for estimating demographic parameters of animal populations. In a typical animal capturerecapture experiment, traps or nets are placed in the study area at several times, often called capture occasions (or trapping samples). At the first occasion, a number of animals are captured. A tag or mark with a unique number or record is attached to each captured animal. These animals are then released back into the population. At each subsequent occasion, animals that are first-captures are similarly marked and the tag numbers of re-captures are recorded. At the end of the experiment, a sequence of samples is obtained and the complete capture history for each captured animal is known.

The Use of Capture-Recapture Methodology

717

Why is marking or tagging necessary in animal studies? Clearly, many animals look the same to humans and individuals cannot be identified by sight. Marking or tagging is used to distinguish captured individuals so that the recapture information (overlap information) by marking or tagging can be used to estimate the number of missing animals in the experiment. Marks include banding and tagging, paint brushing, toe clipping, ear clipping and in some species individual animals can be identified. Intuitively, if the proportion of newly captured animals in the later capture occasions is high, we know that the population size is much higher than the number of distinct captures. On the other hand, if there is a low proportion of newly captured animals, then we are likely to have caught most of the animals and the population size is close to the number of captured animals. According to Seber,44 the original idea of two-sample capture-recapture technique can be traced back to Laplace, who implemented it to estimate the population size of France. The interesting history of Laplaces survey conducted in 1802 was described in Cochran.13 The earliest applications to ecology include Petersens and Dahls work on fish populations and Lincolns use of band returns to estimate waterfowl in 1930. More sophisticated statistical theory and inference procedures have been proposed since Darrochs17 paper, in which the mathematical framework of this topic was founded. Seber,44–46 , Pollock39 and Schwarz and Seber43 provided comprehensive reviews on the methodologies and applications. The capture-recapture technique has been applied to human populations under the term “multiple-record system”.20,32,33,47,50,52 The special two-sample cases are often referred to as the “dual-system” or “dual-record system”. For ascertainment data, if each list is regarded as a trapping sample and identification numbers, names and other characteristics are used as tags or marks, then this framework is similar to a capture-recapture setup for wildlife estimation. Comparisons of the applications to human and animal populations are listed in Table 4. The earliest references to the application of the capture-recapture techniques to health science included the pioneering paper by Sekar and Deming47 for two samples, Wittes and Sidel52 for three samples, Wittes50 for four samples, Wittes et al.51 and Fienberg20 for five samples. Hook and Regal23 also suggested the use of capture-recapture models even for apparently exhaustive surveys. In recent years, there has been growing interest in the use of this technique in various disciplines. For example, another important application area is software reliability.5 Hook and Regal,25

718 Table 4.

A. Chao, H.-C. Yang & P. S. F. Yip Comparison of capture-recapture applied to human and animal populations. Human Populations (Multiple-List System)

Animal Populations (Capture-Recapture Sampling)

Similarities: Lists, sources, records Identification numbers and/or names Ascertainment probability

Trapping samples or occasions Marks or tags Capture probability

Differences: Usually only 2 to 5 lists No natural time ordering among lists Different ascertainment methods

Any t number of samples or occasions (t ≥ 2) Natural time ordering in samples or occasions Identical trapping methods

Some Shared Models Considered in this Chapter: Log-linear models14,20,32,33 Sample coverage method10,11 Logistic regression models2,30,31,53

IWGDMF32,33 and Chao8 provided overviews of the applications of the capture-recapture models specifically to epidemiological data. However, some critical comments and concerns about the use of capture-recapture models in analyzing ascertainment data have been expressed by several authorsd.15,19,34,38,42 For some of the concerns, Chao et al.11 provided relevant discussion from a statistical point of view. As shown in Table 4, there are some principal differences between wildlife and human applications. Researchers in wildlife and human populations have developed models and methodologies along separate lines. In Table 4, we list the approaches that are applicable to both populations. We will address those approaches in the next two sections after the introduction of notational conventions. Throughout this paper, we use the following notational conventions: • The true unknown population size (i.e. the number of individuals in the target population) is N and all individuals can be conceptually indexed by 1, 2, . . . , N . • There are t lists (samples, records, or sources) and they are indexed by 1, 2, . . . , t. • There are M identified individuals, i.e. M equals to the sum of all observable cell counts. • Denote Zs1 ,s2 ,...,st as the number of individuals with record s1 , s2 , . . . , st , where sj = 0 denotes absence in list j and sj = 1 denotes presence in list j.

The Use of Capture-Recapture Methodology

719

• Denote nj , j = 1, 2, . . . , t as the number of individuals identified in the jth list. • Denote Pij as the capture or ascertainment probability of the ith individual in the jth list. Basic assumptions are: • All individuals act independently. • Interpretation or definition for the characteristic of the target population should be consistent for all data sources. • Closure assumption: The size of the population is approximately a constant during the study period. • Ascertainable assumption: Any individual must have a positive probability to be ascertained by any source; any un-ascertainment is purely due to small chance rather than impossibility. Remark: When a random sample is feasible in a dual-system, some special types of structure zeros are permitted; see Sec. 6.1 of Chao it et al.11 • For all sources, identification marks are correctly recorded and matched. Traditional statistical approach further assumes that the samples are independent. In animal studies, this traditional assumption is in terms of an even more restrictive “equal-catchability” assumption, i.e. in each fixed trapping sample all animals have the same capture probability. (Equal catchability assumption implies independence among samples but the reverse is not true; see Sec. 3.) Dependence or unequal catchabilities may be caused by the following two sources: (1) Local dependence (also called list dependence) within each individual/stratum: Conditional on any individual, the presence/absence in one source has a direct causal effect on this individual’s probability of inclusion in other sources. In animal populations, local dependence arises mainly from a behavioral response to capture due to identical trapping method. Animals may become trap-happy, and have an increased probability of subsequent capture, if baited traps are used whereas they may become trap-shy, and have a decreased probability of subsequent capture, if mist nets or ear clipping are used. Local dependence within each individual/stratum may also arises for human populations. For example, the probability of going to a hospital for treatment for any individual depends on his/her result on the serum test of the HAV, leading to dependence between the ascertainment of the serum sample and that of the hospital sample.

720

A. Chao, H.-C. Yang & P. S. F. Yip

(2) Heterogeneity among individuals: Even if the two lists are independent within an individual/stratum, the ascertainment of the two sources may become dependent if the capture probabilities are heterogeneous among individuals/strata. Hook and Regal24 presented an interesting epidemiological example. For many populations, capture or ascertainment probability may vary with age, gender, location, activity, diagnostic symptom, severity of illness or other individual characteristics. For example, in animal populations, some females tend to be less likely to be captured in all trapping occasions, leading to dependence among samples. In human populations, severe cases are more ascertainable in all lists than less severe cases, also leading to dependence. These two types of dependencies are usually confounded and cannot be easily disentangled in a data analysis. Lack of independence leads to a bias (called “correlation bias”) for the usual estimator which assumes independence. For example, in two-list cases, the widely used Petersen estimator underestimates the true size if both samples are positively dependent. Conversely, it overestimates for negatively dependent samples (see Sec. 3). A similar conclusion holds for a general number of samples. When only two lists are available, the data are insufficient for estimating dependence unless additional covariates are available. All existing methods unavoidably encounter this problem and adopt the independence assumption. Therefore, when there are no available covariates, at least three lists are required to model dependence between samples. 3. Models Without Covariates For closed populations, the most commonly used models were proposed by Pollock.37,39,49 This class of models considers time (or occasional) effect, behavioral response to capture and heterogeneity among individuals. For models incorporating behavioral response, which induces local dependence among samples, the capture probability in a sample depends on whether the animal was captured in the “previous” samples. Hence, the ordering of the trapping samples is involved and estimators do depend on the ordering of samples. Since there is usually no ordering among human lists or the ordering may vary with individuals, such models with behavioral response are rarely adopted in modeling local dependence for humans. Heterogeneous model which allows different capture probabilities among animals are potentially useful in health science. A commonly used estimator for such a model is the jackknife estimator proposed by Burnham and

The Use of Capture-Recapture Methodology

721

Overton.6 To assure the jackknife work well, the number of trapping samples should be at least five.37 Similar condition is also required for other estimation procedure.11 However, only two to five human lists are usually available. Therefore, most ecological models have limited use in epidemiological applications. Consequently, we only focus on another two approaches: loglinear models and sample coverage approach. 3.1. Log-linear models The log-linear models14,20 have been extensively used to handle dependence among samples. Part of this approach is discussed in Chapter 13. The methodology applied to human diseases was well covered in IWGDMF.32,33 In this approach, various log-linear models are fitted to the observed cells. How well a model fits the data is assessed using the deviance statistic and a model is usually selected based on the Akaikes information criterion (AIC). The chosen model is then projected onto the unobserved cell by assuming that there is no highest order interaction. The two types of dependencies can be modeled by including some specific interactions or common interaction in the models. We use three lists for illustration. The log-linear approach models the logarithm of the expected value of each observable category. Let I(A) denote the usual indicator function, i.e. I(A) = 1 if A is true, 0 otherwise. The most general model is log E(Zs1 ,s2 ,s3 ) = u + u1 I(s1 = 1) + u2 I(s2 = 1) + u3 I(s3 = 1) + u12 I(s1 = s2 = 1) + u13 I(s1 = s3 = 1) + u23 I(s2 = s3 = 1) + u123 I(s1 = s2 = s3 = 1) .

(1)

That is, log E(Z000 ) = u, log E(Z100 ) = u + u1 , log E(Z010 ) = u + u2 , log E(Z001 ) = u + u3 , log E(Z110 ) = u + u1 + u2 + u12 , log E(Z101 ) = u + u1 + u3 + u13 , log E(Z011 ) = u + u2 + u3 + u23 and log E(Z111 ) = u + u1 + u2 + u3 + u12 + u13 + u23 + u123 . This is a reparametrization of the eight expected values. For three-list data, we have seven observed categories, whereas there are eight parameters. Therefore, a natural assumption is that there is no threelist interaction term, i.e. u123 = 0. Intuitively, this means the complete 2 × 2 table formed with respect to lists 2 and 3 for individuals in list 1 and the incomplete 2 × 2 table for individuals not in list 1 have the same odds ratio. The sample odds ratio for the former table is Z111 Z100 /(Z110 Z101 )

722

A. Chao, H.-C. Yang & P. S. F. Yip

whereas the odds ratio for the latter table is Z011 Z000 /(Z010 Z001 ). The assumption of u123 = 0 allows the following extrapolation formula Zˆ000 = Zˆ001 Zˆ010 Zˆ100 Zˆ111 /(Zˆ110 Zˆ011 Zˆ101 ), which expresses the estimated missing cases as a function of the fitted values of other categories. The independent model includes only main effects as given by log E(Zs1 ,s2 ,s3 ) = u + u1 I(s1 = 1) + u2 I(s2 = 1) + u3 I(s3 = 1), which is denoted by model (1, 2, 3) as used in categorical data analysis. The interaction terms are used to model dependence. If local list dependence arises in samples 1 and 2, then the interaction term u12 is included, and the model is denoted as model (12, 3) or 12/3. If local dependence also appears in samples 1 and 3, then the two interactions u12 and u13 are needed. The model is denoted as model (12, 13) or 12/13 and similarly for models (13, 2), (23, 1), (13, 23) and others. The log-linear model can also be motivated by the Rasch41 model and its generalizations which incorporate heterogeneity among individuals. The Rasch model assumes logit(Pij ) = αi + τj , where Pij denotes the capture probability of the ith individual on the jth list, {α1 , α2 , . . . , αN } represents heterogeneity effects among individuals and {τ1 , τ2 , . . . , τt } denotes the list effects. Since only dependence due to heterogeneity is considered in the Rasch model, the capture probability for the ith individual in any category can be determined by the product of {Pij , j = 1, 2, . . . , t}. A generalized Rasch model allows the heterogeneity effects {α1 , α2 , . . . , αN } to be different for two or more separate groups of samples. It has been verified18,21 that the Rasch (generalized Rasch) model is equivalent to a quasi-symmetric (partial quasi-symmetric) model with some moment constraints. Except for the constraints, a quasi-symmetric model for the three-list case with no second-order interaction, i.e. u123 = 0, is equivalent to the model with first-order interactions identical; this is denoted by (12 = 13 = 23) or simply H1 (which is called the first-order heterogeneity by IWGDMF).32,33 Only one degree of freedom is used to model heterogeneity. A partial quasi-symmetric model which assumes the heterogeneity effects are identical only for the first and second lists, is equivalent to the model with u13 = u23 . This model is denoted as (13 = 23, 12). Similarly, we have models (12 = 13, 23) and (12 = 23, 13) corresponding to other two partial quasi-symmetric models. Therefore, the dependence due to heterogeneity can be modeled by either a quasi-symmetric or a partial quasi-symmetric model. When both types of dependencies occur, they are inevitably confounded in the interaction or common interaction terms and cannot be separated.

The Use of Capture-Recapture Methodology

723

The log-linear model can be similarly formulated when there are more than three lists. The basic assumption for 4 lists is the third-order interaction vanishes (i.e. u1234 = 0); that is, the 3-list interaction for individuals in list 1 is the same as that for individuals not in list 1. Local list dependence can be modeled by including the first-order interaction term (u12 , u13 , u14 , u23 , u24 , u34 ) and/or the second-order interaction (u123 , u134 , u124 , u234 ). The Rasch model is equivalent to a model with the first-order heterogeneity H1 (i.e. 12 = 13 = 14 = 23 = 24 = 34) and the second-order heterogeneity H2 (i.e. 123 = 124 = 134 = 234), and thus denoted by (H1, H2). Two degrees of freedom are used to model heterogeneity. If additional local dependencies also occur between lists 1 and 2, lists 1 and 3, and lists 2 and 4, then we add three more parameters u12 , u13 and u24 to the model and the resulting model is denoted as (12/13/24, H1, H2). Parallel formulations can be obtained for the general case of t lists; see Lloyd36 for details. The reader is referred to Agresti,1 Coull and Agresti16 and Fienberg et al.21 for other related and useful models. 3.2. Sample coverage approach This approach was proposed by Chao and Tsay10 for the three-list case. The extension to a general case is presented in Tsay and Chao.48 Details and relevant software are reviewed in Chao et al.11 The approach aims to provide a measure to quantify the overlap information and also to propose parameters to quantify source dependence. Dependence is modeled by parameters called the “coefficient of covariation” (CCV). To better understand the CCV parameters, we discuss the dependence measure only for the heterogeneous case. Let the two sets of probabilities, {Pij ; i = 1, 2, . . . , N } and {Pik ; i = 1, 2, . . . , N }, denote the capture probabilities for N individuals in samples j and k, respectively. The CCV of samples j and k for a fixed-effect approach is defined as PN N 1 1 X (Pij − µj )(Pik − µk ) i=1 Pij Pik = = −1 , (2) γjk = N i=1 µj µk N µj µk PN where µj = i=1 Pij /N = E(nj )/N denotes the average probability of being listed in the jth sample. The magnitude of γij measures the degree of dependence between samples j and k. The two heterogeneous samples PN are independent if and only if γij = 0, i.e. N −1 i=1 Pij Pik = µj µk which means that the covariance between the two sets of probabilities is zero. Thus if only one set of probabilities is homogeneous, then it suffices to assure independence provided no local dependence exists.

724

A. Chao, H.-C. Yang & P. S. F. Yip

Two samples are positively (negatively) dependent if γjk > 0 (γjk < 0), PN PN which is equivalent to N −1 i=1 Pij Pik > µj µk (N −1 i=1 Pij Pik < µj µk ), i.e. the average probability of jointly being listed in the two samples is greater (less) than that in the independent case. The CCV can be similarly defined for more than two sets of heterogeneous probabilities. When there are only two lists, say, lists 1 and 2, the relative bias of Petersens estimator (bias divided by the estimate) is approximately −γ12 .10,47 This explains the direction of the correlation bias for Petersens estimator, as stated in Sec. 2. Thus, the value of CCV also quantifies the correlation bias. The CCV for the general cases with two types of dependencies has been developed,10 but it will not be addressed here. We remark that all CCVs in the general cases measure the mixed overall effect of the two types of dependencies. The sample coverage is used as a measure of overlap fraction of the available lists. While the mathematical formula for the sample coverage is complicated, its estimator is intuitively understandable. The estimated sample coverage can be written as10 Z010 Z001 1 Z100 ˆ + + C = 1− 3 n1 n2 n3 1 Z100 Z010 Z001 = 1− + 1− + 1− , 3 n1 n2 n3 which is the average (over three lists) of the fraction of cases found more than once. Note that Z100 , Z010 and Z001 are the numbers of individuals listed only in one sample. Hence, this estimator is the complement of the averaged fraction of singletons. Obviously, singletons cannot contain any overlapping information. Define D=

1 [(M − Z100 ) + (M − Z010 ) + (M − Z001 )] 3

1 = M − (Z100 + Z010 + Z001 ) . 3 Here (Z100 + Z010 + Z001 )/3 represents the average of the non-overlapped cases and recall that M denotes the total number of identified cases. Thus, D can be interpreted as the average of the overlapped cases. The sample coverage estimation procedures for the three-list case are summarized in the following:

The Use of Capture-Recapture Methodology

725

(1) When the three sources are independent, a simple population size estimator is derived as: ˆ0 = D/Cˆ . N

(3)

It can also be intuitively thought of as ratio of overlapped cases to overlap fraction. (2) When dependence exists and the overlap information is large enough (how large it should be will be discussed further below), we take into ˆ0 account the dependence by adjusting the above simple estimator N based on a function of two-sample CCVs. The resulting estimator has the following explicit form: " ( (Z1+0 Z+10 )Z11+ Z + Z + Z 1 +11 1+1 11+ ˆ = ÷ 1− N ˆ ˆ n1 n2 3C 3C #) (z0+1 + Z01+ )Z+11 (Z10+ + Z+01 )Z1+1 . (4) + + n1 n3 n2 n3 (3) For relatively low sample coverage data, we feel the data do not contain sufficient information to accurately estimate the population size. In this ˆ1 is suggested: (The estimator case, the following “one-step” estimator N is called “one-step” because it is obtained by one iterative step from the above-mentioned adjustment formula.) ˆ1 = D + 1 [(Z1+0 + Z+10 )ˆ γ12 + (Z10+ + Z+01 )ˆ γ13 N Cˆ 3Cˆ + (Z01+ + Z0+1 )ˆ γ23 ] , where CCV estimates are ˆ ′ Z11+ − 1 , γˆ13 = N ˆ ′ Z1+1 − 1 , γˆ23 = N ˆ ′ Z+11 − 1 , γˆ12 = N n1 n2 n1 n3 n2 n3 and " D 1 D Z11+ ′ ˆ = −1 N + · (Z1+0 + Z+10 ) Cˆ Cˆ n1 n2 3Cˆ D Z1+1 −1 + (Z10+ + Z+01 ) · Cˆ n1 n3 # D Z+11 + (Z01+ + Z0+1 ) · −1 . Cˆ n2 n3

(5)

(6)

726

A. Chao, H.-C. Yang & P. S. F. Yip

This one-step estimator can be regarded as a lower (upper) bound for positively (negatively) dependent samples. Hook and Regal27 noted that most data sets used in epidemiological applications tend to have a net positive dependence. Thus, the one-step estimator is often used as a lower bound. ˆ0 , N ˆ, N ˆ1 ) will be simply referred to as The above three estimators (N ˆ A bootstrap sample coverage estimators if there is no confusion with C. 10 resampling method was proposed to obtain estimated standard errors for the above three estimators and to construct confidence intervals using a log-transformation.7 A relatively low overlap fraction means that there are many singletons. In this case, the undercount cannot be measured accurately due to insufficient overlap. Consequently, a large standard error is ˆ in Eq. (4). How large should the usually associated with the estimator N 11 overlap information be? Chao et al. suggested that the estimated sample coverage should be at least 55%. A practical data-dependent guideline can be determined from the estimated bootstrap s.e. associated with the estiˆ . If the estimated bootstrap standard error becomes unacceptable mator N (say, it exceeds one-third of the population size estimate), then only the lower or upper bound in Eq. (5) is recommended. The estimation procedure for the general t-sample case11 is parallel to that for the 3-sample case as discussed above. 4. Models With Covariates In animal populations, individuals covariates include age, gender, body weight, wing length and others; environmental covariates include temperature, rainfall, number of traps and others. For human populations, relevant covariates are age, gender, race, geographic area, marital groups, diagnostic group, time of onset, severity of diseases and many other explanatory variables. The covariate variables are also classified as either discrete (categorical type) or continuous (numerical type). As discussed earlier, traditional approach depends on a crucial assumption of “equal-catchability”. Heterogeneity in capture probabilities induces dependence among samples, which causes correlation bias in the usual estimator. One approach to assessing heterogeneity is based on the assumption that heterogeneity can be largely explained by some relevant observable covariates. If covariate is discrete, Sekar and Deming 47 were the first to suggest post-stratification to reduce the bias due to heterogeneity. That is, if the population can be divided into several homogeneous

The Use of Capture-Recapture Methodology

727

sub-populations defined by relevant covariates, then a stratified analysis can be performed. That is, log-linear model or any other proper model is fitted to the data for each stratum, then all estimates are combined to obtain a final estimate.32,33,50 Pollock et al.40 were the first to use a logistic model to include continuous covariates in the analysis. In this approach, covariates are used to model heterogeneous capture probabilities by a logistic regression. They developed an estimation procedure based on the full likelihood. However, the covariates for the un-captured animals are not observable. Therefore, they had to make some assumptions about the covariates for the uncaptured animals. Huggins30,31 and Alho2 avoided this difficulty by using a likelihood conditional on the captured animals so that the covariates of the un-captured are not needed. After the coefficients of the logistic regression are obtained, a Horvitz–Thompson type of estimator29 is then employed to obtain an estimate of population size. Alho et al.3 applied this logistic regression approach to the 1990 census and the Post-Enumeration Survey of the United States. Yip et al.53 extended this logistic regression to allow random removals in the experiments. We now apply the logistic regression model to the data sets in Tables 2 and 3, in which there is only one stratifying variable. A unified model proposed in Huggins30,31 and Yip et al.53 can incorporate effects for covariates, capture occasions and behavioral response. If the times for the individuals being recorded on the respective lists are known, then the behavioral response effect for humans could be explored. Since such information for both examples is not available, we only consider a model with stratum effect and occasional effects. Assume that there are k strata. We need to construct k − 1 dummy indicators to specify the effect of each stratum. That is, for the ith individual define the dummy variable Wis = I (the ith individual is in the sth stratum), s = 1, 2, . . . , k − 1, a logistic regression model can be expressed as Pij login(Pij ) = log 1 − Pij = a + cj + β1 Wi1 + β2 Wi2 + · · · + βk−1 Wi,k−1 .

(7)

In this model, the parameters are a: baseline intercept, (c1 , c2 , . . . , ct−1 ): occasional or list effect, (ct = 0), (β1 , β2 , . . . , βk−1 ): stratum effect, β2 denotes the effect of the sth stratum,

728

A. Chao, H.-C. Yang & P. S. F. Yip

s = 1, 2, . . . , k − 1. (The effect of the kth stratum is assumed to be 0.) Under this model, the capture probability for each capture record and thus the likelihood can be formulated. Then the maximum likelihood estimates of the parameters a, (c1 , c2 , . . . , ct−1 ) and (β1 , β2 , . . . , βk−1 ) are searched by numerical iteration. The population size is estimated by the Horvitz– Thompson estimator. Huggins30,31 derived the variance of the resulting estimator by an asymptotic variance formula. Note that the logistic type model has an advantage that it can be used to assess the effect of any type of covariate (both discrete and continuous). However, it does not take account of any local dependence and heterogeneity is entirely explained by covariates. To incorporate possible local dependence, a multivariate logistic model22 might be needed and will be a worthwhile future research topic. 5. Analysis of Three Examples 5.1. Hepatitis A virus data (Three lists) Table 5 shows the results for analyzing the HAV data. The first part of the table presents Petersens estimate based on any pair of lists. Although Petersens estimator is valid only under the restrictive independence assumption, they are practically useful as a preliminary analysis. It has been suggested21,23 that estimates based on any two lists can be used to detect possible dependence. A substantially higher (lower) estimate signifies possible negative (positive) dependence for those two samples. For the HAV data, Petersens estimates are in the range of 330 to 380. The narrow range of these estimates would not indicate the possible direction of dependence at this stage. The second part of Table 5 includes the results for all possible loglinear models fitted to the three-list data. The corresponding deviances and estimates of the total number of infected are also shown. The independent model produces an estimate of 388, which is close to the results for any two samples. Except for the saturated model, all the log-linear models, which consider local independence only and do not take into account heterogeneity, i.e., models (PE, Q), (QE, P), (PQ, E), (PQ, QE), (PQ, PE), and (QE, PE), do not fit the data well. All other models, which take heterogeneity only into account (quasi-symmetric and partial quasisymmetric models) fit well. Those adequate models produce approximately the same estimates of 1300 with an approximate estimated s.e. of 520. In

729

The Use of Capture-Recapture Methodology Table 5.

Analysis results for the HAV data.

Model/Method

Deviance

d.f.

AIC

Estimate of total infected (s.e.)

Petersens estimates for pair of lists: (P, Q) (P, E) (Q, E)

0 0 0

0 0 0

336 (29) 378 (36) 334 (30)

(P, Q, E) independent (PE, Q) (QE, P) (PQ, E) (PQ, QE) (PQ, PE) (QE, PE) (PQ, QE, PE)

24.36 24.25 21.33 21.14 13.20 19.42 19.90 0

3 2 2 2 1 1 1 0

69.8 71.6 68.7 68.5 62.6 68.8 69.3 51.4

388 393 413 416 527 464 452 1313

Quasi-symmetric (PQ = QE = PE)

0.96∗

2

48.4

1313 (521)

Partial quasi-symmetric (PQ = QE, PE)

0.03∗

1

49.4

1309 (519)

Partial quasi-symmetric (PQ = PE, QE)

0.86∗

1

50.3

1306 (517)

Partial quasi-symmetric (QE = PE, PQ)

0.55∗

1

49.9

1325 (528)

Log-linear models for three lists: (23) (27) (31) (33) (79) (61) (54) (521)

Sample coverage approach

ˆ0 : Eq. (3) N ˆ : Eq. (4) N ˆ N1 : Eq. (5) ∗ Deviance

D

ˆ C

γ ˆ12

γ ˆ13

γ ˆ23

Estimate (s.e.)

208.7 208.7 208.7

0.513 0.513 0.513

0.21 1.89 0.51

0.08 1.57 0.34

0.22 1.91 0.52

407 (28) 971 (925) 508 (40)

is not significant at the 5% level, which means a proper fit.

terms of AIC, the quasi-symmetric model is selected, but the relatively large estimated s.e. shows that the data are actually insufficient to fit a heterogeneous model. The third part of Table 5 contains the sample coverage approach. The sample coverage is estimated to be Cˆ = 51.3%, and the average of the overlapped cases is equal to D = 208.67. If we ignore the possible ˆ0 = D/Cˆ = dependence between samples, an estimate based on (3) is N 208.67/0.513 = 407, which is slightly higher than the estimate of 388 based

730

A. Chao, H.-C. Yang & P. S. F. Yip

on the independent log-linear model. The estimator given in Eq. (4) is ˆ = 971, but a large estimated bootstrap s.e. (925) renders the estimate N useless. The estimated s.e. was calculated by using a bootstrap method based on 1000 replications. We feel these data with a relatively low sample coverage estimate of 51% do not contain enough information to correct for ˆ1 = 508 with undercount. The proposed one-step estimator in Eq. (5) is N an estimated s.e. of 40 using 1000 bootstrap replications. The same bootstrap replications produce a 95% confidence interval of (442, 600) using a log-transformation.7 We remark that the estimated s.e. and confidence intervals might vary from trial to trial because replications vary in the bootstrap procedures. It follows from Eq. (6) that the CCV measures depend on the value of N . The CCV estimates in Table 5 based on the three estimates of N show that ˆ = 508 any two samples are positively dependent. As a result, the estimate N can only serve as a lower bound. Also, the estimates assuming independence based on two samples should have a negative bias. However, we cannot distinguish which type of dependence (local dependence or heterogeneity) is the main cause of the bias. In December 1995, the National Quarantine Service of Taiwan conducted a screen serum test for the HAV antibody for all students of the college at which the outbreak of the HAV occurred.12 After suitable adjustments, they have concluded that the final figure of the number infected was about 545. Thus this example presents a very valuable data set with ˆ1 does provide a the advantage of a known true parameter. Our estimator N satisfactory lower bound. This example shows the need for undercount correction and also the usefulness of the capture-recapture method in estimating the number of missing cases.

5.2. Stratified neurologic illness data (Three lists) Various models have been fitted to the stratified neurologic illness data and the results are shown in Table 6. Except for the first stratum, the Petersen estimate based on the H-list and P-list is much lower than the other two Petersens estimates. Thus positive dependence exists between the two lists. This finding is further confirmed by the CCV estimates (not reported) in the sample coverage approach. If dependence is ignored, the pooled stratified estimate gives an estimate of 762 (s.e. 21), which is identical to that obtained from an un-stratified analysis. Also, the sample coverage estimate under independence for

731

The Use of Capture-Recapture Methodology

Table 6. Various estimates of the population size for neurologic illness data (Standard error is in the parenthesis). Diagnostic Group (Stratum) Model/Method

A

B

C

D

Stratified

Un-stratified

59 (16) 46 (7) 36 (5)

18 (3) 24 (3) 22 (2)

365 (34) 192 (24) 634 (45) 395 (19) 298 (21) 763 (29) 469 (34) 264 (23) 791 (41)

631 (46) 761 (29) 789 (41)

(H, P, S) Independent (HP, S) (HP, HS) Quasi-symmetric (HS = PS, HP)

43 42 39 40 43

22 23 22 26 23

429 438 505 543 533

762 778 865 802 808

Sample coverage: ˆ0 : Eq. (3) N ˆ : Eq. (4) N ˆ1 : Eq. (5) N

43 (5) 42 (20) 42 (7)

Petersens estimate: (H, P) (H, S) (P, S) Log-linear models: (5) (4) (4) (6) (13)

(2) (2) (2) (8) (5)

(16) (19) (44) (82) (88)

268 275 240 273 242

(13) (15) (12) (30) (15)

762 778 806 882 841

(21) (25) (46) (88) (90)

23 (2) 436 (17) 255 (10) 757 (21) 24 (15) 524 (68) 218 (15) 808 (74) 23 (3) 468 (29) 239 (16) 772 (34)

(21) (24) (76) (41) (67)

762 (21) 812 (65) 782 (32)

Logistic regression model: Eq. (7) with covariate and list effects, Horvitz–Thompson estimate: 765 (s.e. 22)

ˆ0 , in Eq. (3)) yields exactly the same result. An un-stratified data (N analogous estimate of 765 (s.e. 22) is also obtained by a logistic regression model incorporating both covariate (stratifying variable) and the list effects. However, the deviance statistic of the logistic regression model is 38.8 with 22 degrees of freedom (P -value = 0.015), indicating an inadequate fit. Since significantly positive dependence exists for the H-list and P-list, Bobo et al.4 suggested fitting a log-linear model with the interaction term HP, model (HP, S), to the total data, and obtained an estimate of 787. In Table 6, both stratified and un-stratified estimates for model (HP, S) are 778. Note that we have excluded seven patients whose diagnostic groups are unknown. This explains the slight difference between our result and that in Bobo et al.4 Adding up these seven to our estimate, we then obtain a very close result. We also list the results for model (HP, HS), quasi-symmetric and a partial-quasi-symmetric model in Table 6. For these three log-linear models, there are substantial differences between the stratified and un-stratified

732

A. Chao, H.-C. Yang & P. S. F. Yip

results. In contrast, such differences for the three sample coverage estimators are limited. Recall that the purpose of stratification is mainly to reduce the dependence due to heterogeneity. The overall dependencies are consiˆ and N ˆ1 . Therefore, dered and adjusted in the sample coverage estimators N the closeness of the stratified and un-stratified results is expected. (As will be seen in Sec. 5.3, post-stratification is not warranted because insufficient overlap may arise in some strata, leading to an unstable stratified result.) ˆ that can take account of two types of For this data set, the estimator N dependencies in each stratum has acceptable precision. Both stratified and un-stratified estimates can be recommended since they result in similar variation using 1000 bootstrap replications. The latter estimate is 812 with an estimated bootstrap s.e. of 65, which yields a 95% confidence interval of (720, 988). The former yields a slightly lower estimate of 808 with a higher s.e. of 74, which implies a 95% confidence interval of (709, 1015).

5.3. Stratified drug data (Four lists) Table 7 shows the analysis results for the drug data. Estimates based on various models are presented for each stratum and for the collapsed data. Except for the stratum of 4–6 days, the Petersen estimate based on the lists 1 and 2 are lower than the others and thus positive dependence is expected. The CCV estimate (unreported) for the total data reveals that positive dependence also exists between the lists 1 and 3. Wittes50 suspected that positive dependence may arise between list-3 and list-4, but the CCV estimate only shows very weak dependence. Wittes50 fitted an independent model to the data in each stratum and obtained a pooled estimate of 544 (s.e. 22.4), which is slightly different from our result under independence in Table 7 probably due to numerical rounding errors. The un-stratified estimate under an independent model is 524 (s.e. 18). Wittes50 thus concluded that failure to account for stratification in the analysis would have afforded the investigators a false sense of precision. Table 7 also presents the results for some other selected log-linear models that fit well, i.e. models (1, 2, 34), (12, 13, 4), (H1, 12, 13, 4), (12, 13, 14), and the quasi-symmetric model. For the quasi-symmetric model, the iterations failed to converge for two strata. Therefore, the stratified result for the quasi-symmetric model is not obtainable. The iteration steps for the un-stratified estimate did converge, but the s.e. is extremely large. It implies

733

The Use of Capture-Recapture Methodology

Table 7. Various estimates of the total number of patients for drug data (Standard error is in the parenthesis). Usage on Drugs (Stratum) Model/Method

1–3 days

4–6 days

7+ days Stratified

50 350 135 175 123 173

278# (183) 80 214 (49) 133 194 (18) 178 152 (42) 171 284 (69) 211 193 (12) 184

(16) (15) (12) (47) (43) (11)

(7) 542 (23) (8) 539 (30) (9) 541 (23) (17) 608 (95) (12) 556 (31) (322)(diverge)

Un-stratified

Petersens estimate: (1, (1, (1, (2, (2, (3,

2) 3) 4) 3) 4) 4)

(14) (112) (22) (49) (28) (27)

408 697 507 498 618 550

(184) (123) (31) (80) (86) (32)

300 459 497 468 609 527

(71) (54) (28) (112) (99) (25)

524 521 531 586 547 1027

(18) (24) (20) (62) (26) (815)

Log-linear models: (1, 2, 3, 4) Independent (1, 2, 34) (12, 13, 4) (H1, 12, 13, 4) (12, 13, 14) Quasi-symmetric

160 (19) 149 (21) 156 (18) 181 (73) 165 (25) (diverge)

203 (11) 215 (20) 201 (11) 242 (58) 204 (13) (diverge)

179 175 184 185 187 353

Sample coverage: ˆ0 : Eq. (3) N ˆ : Eq. (4) N ˆ1 : Eq. (5) N

151 (22) 170 (575) 157 (32)

226 (23) 286 (63) 247 (28)

178 (10) 555 (33) 185 (26) 641 (579) 182 (17) 586 (46)

541 (26) 635 (93) 579 (44)

Logistic regression model: Eq. (7) with covariate and list effects, Horvitz–Thompson estimate: 539 (s.e. 21) # Petersens

estimate does not exist due to no overlapped cases; Chapmans estimator is calculated instead (see Seber44 ; 59–60).

that a Rasch model which can reflect heterogeneity among individual cannot be adopted. The sample coverage estimator under independence gives an estimate of 541 (s.e. 26), which is very close to the result obtained from a logistic analysis. The logistic regression model in Eq. (7) provides a proper fit to the data because the deviance is 39.81 with 39 degrees of freedom (P -value = 0.43). In the first stratum, the relatively large bootstrap s.e. of the estimator ˆ indicates that data information cannot provide a reliable estimate and N thus only a reasonable lower bound can be obtained. Consequently, the ˆ is not recommended for use. For the collapsed stratified estimate based on N data, the precision is acceptable, showing the pooled data are sufficient

734

A. Chao, H.-C. Yang & P. S. F. Yip

to incorporate both types of dependencies. We obtain an estimate of 635 with an estimated s.e. of 93 based on 1000 bootstrap replications. A 95% confidence interval of the size can be constructed as (520, 895). 6. Remarks and Discussion Capture-recapture models provide a potentially useful method for assessing the extent of incomplete ascertainment in epidemiological studies but there are assumptions and limitations to this approach. We have reviewed three methods (log-linear models, sample coverage approach and logistic regression analysis) and applied them to three data sets with/without covariates. The three data analyses have demonstrated the usefulness of the capture-recapture analysis. Basic assumptions must be checked to validate the implementation of the capture-recapture method. Hook and Regal26,28 presented 17 recommendations for the use of the capture-recapture method in epidemiology. We also urge the readers to check the assumptions listed in Sec. 2 before capture-recapture analysis. We have shown that for some data sets (e.g. the HAV data and the first stratum of the drug data), insufficient overlap information usually results in an imprecise estimate. This implies that a serious limitation of the capture-recapture methods is that sufficiently high overlapping information is required to produce reliable population size estimates and to model dependence among samples. Coull and Agresti16 also indicated that the likelihood functions under some models for sparse information might become flat and the resulting estimates are likely to become unstable. In such cases, we feel that a precise lower bound is of more practical use than an imprecise point estimate. Almost all methods discussed in this chapter require extensive numerical iterations or calculations to obtain estimators and standard errors. Therefore, user-friendly software is essential for applications. We have developed a program CARE (for capture-recapture) containing two parts: CARE-1 and CARE-2. CARE-1 is an S-PLUS program for analyzing epidemiological data; CARE-2, written in C language, calculates various estimates for ecological models. All estimates and standard errors given in Tables 5–7 were obtained by using CARE-1. A tutorial article11 demonstrated the use of CARE-1. The reader is referred to Chao and Huggins9 for the use of CARE-2 if ecological models are needed. The program CARE is available and can be downloaded from the first authors website at http://chao.stat.nthu.edu.tw/.

The Use of Capture-Recapture Methodology

735

Acknowledgments Research was supported (for Chao) by the National Science Council of Taiwan under Contract/grant number NSC89-2118-M007-006 and NSC-90-2119-M-007-003.

References 1. Agresti, A. (1994). Simple capture-recapture models permitting unequal catchability and variable sampling effort. Biometrics 50: 494–500. 2. Alho, J. M. (1990). Logistic regression in capture-recapture models. Biometrics 46: 623–635. 3. Alho, J. M., Murly, M. H., Wurdeman, K. and Kim, J. (1993). Estimating heterogeneity in the probabilities of enumeration for dual-system. Journal of the American Statistical Association 88: 1130–1136. 4. Bobo, J. K., Thapa, P. B., Anderson, J. R. and Gale, J. L. (1994). Acute encephalopathy and seizure rates in children under age two years in Oregon and Washington State. American Journal of Epidemiology 140: 27–38. 5. Briand, L. C., El Eman, K., Freimut, B. G. and Laitenberger, O. (2000). A comprehensive evaluation of capture-recapture models for estimating software defect content. IEEE Transactions on Software Engineering 26: 518–538. 6. Burnham, K. P. and Overton, W. S. (1978). Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika 65: 625–633.)( 7. Chao, A. (1987). Estimating the population size for capture-recapture data with unequal catchability. Biometrics 43: 783–791. 8. Chao, A. (1998). Capture-recapture. In Encyclopedia of Biostatistics, eds. P. Armitage and T. Colton, Wiley, New York, 482–486. 9. Chao, A. and Huggins, R. M. (2003). Closed capture-recapture models. In The Handbook of Capture-Recapture Methods, eds. B. Manly, T. L. McDonald and S. C. Amstrup, Princeton University Press. 10. Chao, A. and Tsay, P. K. (1998). A sample coverage approach to multiplesystem estimation with application to census undercount. Journal of the American Statistical Association 93: 283–293. 11. Chao, A., Tsay, P. K., Lin, S. H., Shau, W. Y. and Chao, D. Y. (2001). Tutorial in Biostatistics: The applications of capture-recapture models to epidemiological data. Statistics in Medicine 20: 3123–3157. 12. Chao, D. Y., Shau, W. Y., Lu, C. W. K., Chen, K. T., Chu, C. L., Shu, H. M. and Horng, C. B. (1997). A large outbreak of hepatitis A in a college school in Taiwan: Associated with contaminated food and water dissemination. Epidemiology Bulletin, Department of Health, Executive Yuan, Taiwan Government. 13. Cochran, W. G. (1978). Laplaces ratio estimators. In Contributions to Survey Sampling and Applied Statistics, ed. A. David, Academic Press, New York, 3–10.

736

A. Chao, H.-C. Yang & P. S. F. Yip

14. Cormack, R. M. (1989). Loglinear models for capture-recapture. Biometrics 45: 395–413. 15. Cormack, R. M. (1999). Problems with using capture-recapture in epidemiology: An example of a measles epidemic. Journal of Clinical Epidemiology 52: 909–914. 16. Coull, B. A. and Agresti, A. (1999). The use of mixed logit models to reflect heterogeneity in capture-recapture studies. Biometrics 55: 294–301. 17. Darroch, J. N. (1958). The multiple-recapture census I. Estimation of a closed population. Biometrika 45: 343–359. 18. Darroch, J. N., Fienberg, S. E., Glonek, G. F. V. and Junker, B. W. (1993). A three-sample multiple-recapture approach to census population estimation with heterogeneous catchability. Journal of the American Statistical Association 88: 1137–1148. 19. Desenclos, J. C. and Hubert, B. (1994). Limitations to the universal use of capture-recapture methods. International Journal of Epidemiology 23: 1322–1323. 20. Fienberg, S. E. (1972). The multiple recapture census for closed populations and incomplete 2k contingency tables. Biometrika 59: 591–603. 21. Fienberg, S. E., Johnson, M. S. and Junker, B. W. (1999). Classical multilevel and Bayesian approaches to population size estimation using multiple lists. Journal of Royal Statistical Society A162: 383–405. 22. Glonek, G. F. V. and McCullagh, P. (1995). Multivariate logistic models. Journal of Royal Statistical Society B572: 533–546. 23. Hook, E. B. and Regal, R. R. (1992). The value of capture-recapture methods even for apparently exhaustive surveys: The need for adjustment for source of ascertainment intersection in attempted complete prevalence studies. American Journal of Epidemiology 135: 1060–1067. 24. Hook, E. B. and Regal, R. R. (1993). Effects of variation in probability of ascertainment by sources (“variable catchability”) upon “capturerecapture” estimates of prevalence. American Journal of Epidemiology 137: 1148–1166. 25. Hook, E. B. and Regal, R. R. (1995). Capture-recapture methods in epidemiology: Methods and limitations. Epidemiological Reviews 17: 243–264. 26. Hook, E. B. and Regal, R. R. (1999). Recommendations for presentation and evaluation of capture-recapture estimates in epidemiology. Journal of Clinical Epidemiology 52: 917–926. 27. Hook, E. B. and Regal, R. R. (2000). Accuracy of alternative approaches to capture-recapture estimates of disease frequency: Internal validity analysis of data from five sources. American Journal of Epidemiology 152: 771–779. 28. Hook, E. B. and Regal, R. R. (2000). On the need for a 16th and 17th recommendation for capture-recapture analysis. Journal of Clinical Epidemiology 53: 1275–1277. 29. Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47: 663–685.

The Use of Capture-Recapture Methodology

737

30. Huggins, R. M. (1989). On the statistical analysis of capture experiments. Biometrika 76: 133–140. 31. Huggins, R. M. (1991). Some practical aspects of a conditional likelihood approach to capture experiments. Biometrics 47: 725–732. 32. International Working Group for Disease Monitoring and Forecasting (IWGDMF). (1995a). Capture-recapture and multiple-record systems estimation I: History and theoretical development. American Journal of Epidemiology 142: 1047–1058. 33. International Working Group for Disease Monitoring and Forecasting (IWGDMF). (1995b). Capture-recapture and multiple-record systems estimation II: Application in human diseases. American Journal of Epidemiology 142: 1059–1068. 34. Kiemeney, L. A. L. M., Schouten, L. J. and Straatman, H. (1994). Ascertainment corrected rates (Letter to Editor). International Journal of Epidemiology 23: 203–204. 35. LaPorte, R. E., McCarty, D. J., Tull, E. S. and Tajima, N. (1992). Counting birds, bees and NCDs. Lancet 339: 494. 36. Lloyd, C. J. (1999). Statistical Analysis of Categorical Data. Wiley, New York. 37. Otis, D. L., Burnham, K. P., White, G. C. and Anderson, D. R. (1978). Statistical inference from capture data on closed animal populations. Wildlife Monographs 62: 1–135. 38. Papoz, L., Balkau, B. and Lellouch, J. (1996). Case counting in epidemiology: Limitation of methods based on multiple data sources. International Journal of Epidemiology 25: 474–477. 39. Pollock, K. H. (1991). Modeling capture, recapture, and removal statistics for estimation of demographic parameters for fish and wildlife populations: Past, present, and future. Journal of the American Statistical Association 86: 225–238. 40. Pollock, K. H., Hines, J. E. and Nichols, J. D. (1984). The use of auxiliary variables in capture-recapture and removal experiments. Biometrics 40: 329–340. 41. Rasch, G. (1961). On general laws and the meaning of measurement in psychology. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, ed. J. Neyman, University of California Press, 321–333. 42. Schouten, L. J., Straatman, H., Kiemeney, L. A. L. M., Gimbrere, C. H. F. and Verbeek, A. L. M. (1994). The capture-recapture method for estimation of cancer registry completeness: A useful tool? International Journal of Epidemiology 23: 1111–1116. 43. Schwarz, C. J. and Seber, G. A. F. (1999). Estimating animal abundance: Review III. Statistical Science 14: 427–456. 44. Seber, G. A. F. (1982). The Estimation of Animal Abundance, 2nd edn., Griffin, London. 45. Seber, G. A. F. (1986). A review of estimating animal abundance. Biometrics 42: 267–292.

738

A. Chao, H.-C. Yang & P. S. F. Yip

46. Seber, G. A. F. (1992). A review of estimating animal abundance II. International Statistical Review 60: 129–166. 47. Sekar, C. and Deming, W. E. (1949). On a method of estimating birth and death rates and the extent of registration. Journal of the American Statistical Association 44: 101–115. 48. Tsay, P. K. and Chao, A. (2001). Population size estimation for capturerecapture models with applications to epidemiological data. Journal of Applied Statistics 28: 25–36. 49. White, G. C., Anderson, D. R., Burnham, K. P. and Otis, D. L. (1982). Capture-Recapture and Removal Methods for Sampling Closed Populations, Los Alamos National Lab, LA-8787-NERP, Los Alamos, New Mexico, USA. 50. Wittes, J. T. (1974). Applications of a multinomial capture-recapture method to epidemiological data. Journal of the American Statistical Association 69: 93–97. 51. Wittes, J. T., Colton, T. and Sidel, V. W. (1974). Capture-recapture methods for assessing the completeness of cases ascertainment when using multiple information sources. Journal of Chronic Diseases 27: 25–36. 52. Wittes, J. T. and Sidel, V. W. (1968). A generalization of the simple capturerecapture model with applications to epidemiological research. Journal of Chronic Diseases 21: 287–301. 53. Yip, P. S. F., Wan, E. C. and Chan, K. S. (2001). A unified approach for estimating population size in capture-recapture studies with arbitrary removals. Journal of Agricultural Biological and Environmental Statistics 6: 183–194.

About the Authors Anne Chao received her PhD in Statistics (1977) from the University of Wisconsin. Since 1978, Professor Chao has been teaching at the National Tsing Hua University, Taiwan, and is a professor of the Institute of Statistics. Her main research areas are biostatistics, including estimation of class numbers, estimation of population size, sampling methods and multiple system estimation. Hsin-Chou Yang received his PhD in Statistics (2002) from National Tsing Hua University under the supervision of Anne Chao. He is now working as a post-doctor at the Institute of Biomedical Science, Academia Sinica, Taiwan, in fulfillment of his compulsory military service.

Paul Yip is a Senior Lecturer in the Department of Statistics and Actuarial Science and Director of the Hong Kong Jockey Club Centre for Suicide

The Use of Capture-Recapture Methodology

739

Research and Prevention, the University of Hong Kong. He received his PhD in Statistics from La Trobe University, Australia. His research areas cover a wide spectrum of methodological and applied topics, with a major interest in developing and applying statistical techniques to medical, social and biological sciences.

This page intentionally left blank

CHAPTER 20

STATISTICAL METHODS IN THE EFFECT EVALUATION OF MASS SCREENING FOR DISEASES QING LIU Department of Medical Statistics and Epidemiology, School of Public Health, Sun Yat-Sen University, Guangzhou, Guangdong, 510080, PR China Tel: 86-20-87330664; [email protected] It is an important way in chronic disease control to detect disease earlier by mass screening. The purpose of screening is to detect disease in an early stage, in an expectation of the better treatment effect, the improvement of patients prognosis and the reduction of disability or death. Mass screening may achieve its objective of early detection in two approaches: one is to encourage patient to visit doctor when the early sign and symptom of disease appears; second is to supply a regular physical test and to detect disease in an asymptomatic stage.

1. Basic Concept of Mass Screening for Disease American Commission on Chronic Illness gave a definition of screening in 19571: “The presumptive identification of unrecognized disease or defect by the application of test, examine, or other procedures which can be applied rapidly to sort out apparently well persons who probably have a disease from those who probably do not. A screening test is not intention to be diagnostic. Persons with positive or suspicious finding must be referred to their physicians for diagnosis and necessary treatment.” This definition emphasized on two points. Firstly, the potential disease states identified by mass screening include two subgroups: one is the state in the high risk of disease and another is the state of disease unrecognized by patients himself. Person in first group may not be ill but he may have disease in a high probability, such as the people with multiple intestinal polyps, oesophageal epithelial dysplasia and mataplasia. People in second group have suffered from the disease but have not recognized it, such as patients with small liver cancer found by α-fetoprotein (AFP). Secondly, screening procedure 741

742

Q. Liu

itself does not diagnose illness. Those who test positive are sent on for further evaluation by a subsequent diagnostic test or procedure to determine whether they do have the disease. In 1968, World Health Organization (WHO)2 suggest some governing principles and pre-requisites of mass screening: (1) Disease is a serous health problem. It means a high morbidity, mortality and social burden. (2) Early detection of disease may improve the prognosis, reduce proportion of disability and death of patients. It means that there are the effective treatments of early stage disease. (3) The natural history of disease is well known and the screening test is able to detect disease in preclinical phase. (4) There is an effective screening test. It means that the test is sensitive, specific, high predictive value and safe. (5) The screening program is acceptable to population, simple and inexpensive, high compliance rate and low complication and pain. (6) There is proper procedure of further diagnosis and follow-up. Simply, the success of a screening program depends on how many deaths saved. By the exact statistical statement, screening is effective if the mortality rate of the disease in screened population is lower than that of in unscreened population. It may exist by the difference of the mortality rates or by the relative rates. The methods of statistical analysis for screening data is similar to that for general epidemiological study. The life-year savings of a disease is also a common index for screening assessment and especially useful in the cost-effectiveness evaluation. Mass screening does not only reduce the mortality rate of disease but sometimes also reduces the incidence rate and the medication cost of disease. However, the quantitative assessment on this aspect of screening is much more difficult than death reduction. The cost must be considered in the assessment of screening. The direct cost includes charges of screening test, further diagnostic procedure and follow-up for positive result. The indirect cost includes expense of time and work, management and organization of program, etc. The evaluation of the cost on the psychological and biological impact of screening, such as the anxiety for positive results, risk of complication, harm and pain brought by screening test is much more complicate but must be taken into consideration. The assessment of screening is very complicate and difficult. The current method of assessment for screening still needs to be improved. In the

Statistical Methods in the Effect Evaluation

743

beginning, ones hope to compare the survival time of the cases detected by screening and the cases diagnosed in clinic. If the survival time of the cases detected by screening is longer than that of the cases diagnosed in the clinical, the screening is said to be effective. However, this comparison suffers may have many biases. Firstly, screening population is not a random sample of general population. They may be different in some important demographic characteristics from general population where the cases come, such as occupation, life status and education, etc. Secondly, longer survival time in the cases detected by screening may be caused by earlier diagnostic time rather than the prolonged life. That is called as lead time bias.3 As shown in Fig. 1, we suppose the survival time of patients are not changed no matter it is detected by screening or diagnosed in clinic. The average longer survival time may be due to the screening advances the diagnostic time. The lead time in an uncontrolled clinical trial appears to increase survival time although the natural history of the disease and the time of death are unchanged, whereas, patients stay longer in disease phase and suffer more from pain and anxiety. Finally, the probability that a disease will be detected by screening is directly proportional to the length of its preclinical detectable phase, which is inversely related to its rate of disease progression. Individuals with rapidly progressive disease — those with short preclinical phases — are more likely to die than the average longer and are less likely to be identified by screening. Therefore, long survival time may not be the effect of screening nor the selective effects of screening procedure on cases. That is called as length bias.4 The screening tends to detect disease subsets with long preclinical phase, less aggressive progression and perhaps better inherent prognosis. As shown in Fig. 2, for Cases 1 and 2, the disease is less aggressively progressive, detected by screening and predictable a better prognosis. For Cases 3 and 4, disease progresses rapidly, missing the opportunity of screening detection and a poor prognosis. In the comparison of survival time of patients detected by screening and detected clinically, these biases

Fig. 1.

Illustration of lead time bias.

744

Q. Liu

1 2 3 4

diagnosis

screen1

screen2

Fig. 2.

die

screen3

Illustration of length bias.

must be adjusted. This adjustment is very difficult, depends on the fully perception of natural history of the disease and needs a very complex counting process. Early diagnostic rate is one of indices for assessment in the pilot period of a screening program. However, this index only suggests that the screening may be effective but has not been proven. If there is no effective treatment for detected patients, early detection of disease will not improve the prognosis of patients and reduce the death or disability caused by disease. Only when is there the effective treatment for disease in early stage and not for later stage, the early detection of disease is of special meaning. Obviously, the comparison of survival rates of patients detected by screening and detected in clinic is not an ideal method of screening assessment. Early diagnostic rate of disease is also not a good index and only an indirect index. An ideal assessment of screening is to compare the mortality rates of the disease and death rates in screening population and control population in a randomized controlled population-based design. The mortality rates of the disease and death rates are crucial in the assessment of screening effects. The comparison of mortality rates is not interfered by lead time bias and length bias. The results of the comparison reflect the true effectiveness of screening. For example, in the Health Insurance Plan (HIP)5 in New York, the women aged 40 to 64 who participated of the plan are randomly separated into two groups, one group accept yearly physical check and mammography, another group receive a routine medical care service. Four repeat screening tests were given to first group in total. In the first 5 years, the mortality rates of breast cancer in screening group reduced about 40%. After 14 years follow-up, there is still a 20% of mortality reduction of breast cancer (Table 1). The results from a randomized control clinical trial of breast cancer screening in Sweden,6 also proved that the screening of breast cancer may effectively reduces the deaths caused by the disease (Table 2).

745

Statistical Methods in the Effect Evaluation Table 1. HIP.

Cumulative deaths of breast cancer in screening population and control in

Years from first screens to diagnosis

Cases of breast cancer

Deaths from breast cancer according to follow-up years 5 years

7 years

10 years

14 years

5 years Screening group Control group Difference of rates (%)

306 300

39 63 38.1

71 106 33.0

95 133 28.6

118 153 22.9

7 years Screening group Control group Difference of rates (%)

425 443

39 63 38.1

81 124 34.7

123 174 29.3

165 212 22.2

10 years Screening group Control group Difference of rates (%)

600 604

39 63 38.1

81 124 34.7

146 192 24.0

218 262 16.8

Table 2. control.

The mortality analysis of breast cancer between screening population and Age

Groups

Deaths

Screening population

RR (95%CI)

40–49

Screen Control Screen Control Screen Control Screen Control Screen Control

28 24 45 54 52 58 35 31 160 171

19844 15604 23485 16805 23412 16269 10339 7307 77080 55985

0.92(0.52–1.60)

50–59 60–69 70–74 Total

∗ Adjusting

0.60(0.40–0.90) 0.65(0.44–0.95) 0.77(0.47–1.27) 0.69(0.55–0.88)∗

for age.

It is shown in Tables 1 and 2 that the methods of statistical analysis are similar to those in the treatment of traditional epidemiological data. Mantel-Haenszel stratified analysis was used to estimate the relative risks and confidence interval of disease mortality. The methods of hypothesis testing are also same. If a randomized control clinical trial is not feasible, a population-based cohort study is the next choice. For example, a study in England is to

746

Q. Liu

compare the mortality rates of the disease between the screening area and the non-screening area. This kind of design requires a relative high participation rate in screening population. It means a good compliance. For comparison, the demographic characteristics of populations in different areas need to be adjusted. The randomized control clinical trial or other observational design only evaluates a single screening scheme. The usage of them is limited because it cannot estimate the extra effects of a screening scheme applied in different population with different age distribution and different prevalence rates of disease, or the extra effects of different screening schemes, such as different frequency, different test. In practice, people cannot carry out a randomized control clinical trial for every screening scheme to evaluate its effects. In this situation, the mathematical model of natural history of disease based on the current data from a RCT or other observational studies may complement this limitation. The prevalence rate of disease on every screening and the incidence rates in the screening interval are estimated based on a stochastic model and the effects of different screening policies are evaluated. The screening programs for a variety of diseases have been implemented in many countries in the world. The practice proves that the mortality rates of some diseases, such as breast cancer, cervical cancer, hypertension and diabetes may be reduced by screening. The question is which one in different screening test and different schemes detect disease earlier, with higher efficacy and higher cost-effectiveness. Two important parameters decide the effect of screening. One is the sensitivity of screening test. Another is the distribution of sojourn time in preclinical detectable phase (PCDP). A high sensitivity, or low false negative rate of screening test means a strong power to detect disease. A long sojourn time of preclinical detectable phase means more chance to be detected by screening and in an early stage of disease. A short sojourn time of preclincial detectable phase gives little chance to be detected by screening. That means that the proportion of the cases detected by screening is low, the effect of screening is poor and the screening may be not feasible. If the sojourn time of preclinical detectable phase is long, the interval between screens may be designed longer. When the distribution of preclinical detectable phase is known, the lead time bias may be estimated for the assessment of screening. Therefore, the core of analysis of screening data and optimization of screening schemes is to estimate these two parameters. We assume that the disease progresses in the manner shown in Fig. 3. An individual enters the preclinical detectable phase of the disease, detectable

747

Statistical Methods in the Effect Evaluation

sojour tme no detectable disease

symptomatic invasive disease

asymptomatic detectable disease T0

T1

T2 Lead time delay time

screening test detects preclinical disease

Fig. 3. Schema for the progression of a disease with the intervention of an early detection.

by the screening modality in question, at time T0 , and would begin to manifest symptoms, i.e. the disease would become clinically apparent, at time T1 , if no intervention were to take place. For this individual, the “sojourn time” is defined as T1 − T0 . Suppose now that the individual is screened at time T2 (T0 < T2 < T1 ) and is diagnosed in the preclinical state. For this individual, the “lead time”, the interval by which diagnosis is brought forward, is defined as T1 − T2 . The probability that the screening test correctly identifies an individual as being in the preclinical detectable phase is termed the “sensitivity” of the test; the “false-negative rate” is one minus the sensitivity. 2. One-stage Models of the Natural History of Disease for Screening The data from screening process consist of: (1) prevalent cases diagnosed in preclinical detectable state during the screening; (2) incidence cases diagnosed clinically in the interval of two screenings or both before and after screening. The purpose of natural history model of screening for a disease is to express these prevalence and incidence rates in terms of the falsenegative rate and of the sojourn-time distribution. General model have been developed7,8 to describe the effect of screening on the disease process in order to identify those parameters which determine the expected benefit. Estimation of parameters of interest is difficult with these general models since the number of unknowns is large. Here firstly the simplified model (NE Day and SD Walter)9 is introduced.

748

Q. Liu

In the absence of screening, the incidence of clinical disease at age t will be denoted by I(t). For the screening modality in question, f (y) will denote the probability density function of the length of the interval y during which the disease is preclinical but detectable, i.e. the sojourn time. For simplicity, we assume f (y) to be independent of t. The function J(t) will denote the incidence of the preclinical state, i.e. the rate at which individuals enter it. The “false-negative rate”, i.e. the probability that an individual in the preclinical state is screened negative, will be denoted by β; thus 1 − β is the “sensitivity”. We assume that β is independent of both the lead time and the sojourn time. The functions I(t) and J(t) are related through f (y) by the equation Z t I(t) = J(s)f (t − s)ds . (1) 0

Suppose that the population is screened at t1 . Then for t > t1 the incidence is made up of two components, individuals with short sojourn time who entered the preclinical state after t1 , and individuals with a longer sojourn time falsely screened negative at t1 . Thus, after one screen, the incidence, I1 (t) is given by Z t1 Z t J(s)f (t − s)ds . (2) J(s)f (t − s)ds + I1 (t) = β 0

t1

Similarly, if screens occur at times t1 , t2 , . . . , tn , then the incidence In (t) after the nth screen is given by Z t1+1 n X J(s)f (t − s)ds , (3) β n−i In (t) = t1

i=0

where t0 = 0 and tn + 1 = t. We make an assumption that J(t) is uniform for an individual over the duration of the study. For cancer, the screening interval usually is 1 or 2 years, in this interval the assumption of uniform incidence rate of preclinical state may hold approximately. Z t−t1 n X f (y)dy . (4) β n−i In (t) = J i=0

t−ti+1

The prevalence, P1 observed at a first screen at time t1 , is given by Z ∞ Z t1 J(s) f (y)dyds . P1 = (1 − β) 0

(5)

t1 −s

That is, for each time s < t1 , those individuals entering the preclinical state at time s will be prevalent cases at time t1 if their sojourn time is greater

Statistical Methods in the Effect Evaluation

749

than t1 − s. On reversing the order of integration and setting J(s) constant, this expression becomes Z ∞ Z t1 f (y)dy (6) yf (y)dy + P1 = (1 − β)J t1

0

If there are n previous screens, at times t1 , . . . , tn . Summation over the n intervals, gives the following expression for the total prevalence Pn at the nth screen at time tn : Z ∞ n X β n−i min{y − (tn − ti ), ti − ti−1 }f (y)dy . (7) Pn = (1 − β)J tn −ti

i=1

If the interval between screens is constant, i.e. if ti+1 − ti = ∆, i = 1, 2, . . . , n − 1. Above expressions may be further simplified. For incidence rates, Z t−tn +(n−i)∆ Z t−tn n−1 X β n−i f (y)dy . (8) f (y)dy + Jβ n In (t) = J t−tn +(n−i−1)∆

i=0

0

For the prevalence rates, Pn = (1 − β)J

n X i=1

β

n−i

Z

∞

(n−i)∆

min{y − (n − i)∆, ∆}f (y)dy .

(9)

We consider first the idealized situation where a total population is screened at regular intervals, each individual being screened with the same inter-screening interval ∆. The constant incidence rates are assumed known from a pre-existing disease registry. At the ith screen (i = 1, . . . , n) one knows ri , the number of cases of preclinical disease found, and ni , the number screened; and between screen i and screen i + 1 one knows the total ci of cases diagnosed outside screening from a total of yi person-years at risk. The probability qi of a case developing between screen i and screen i + 1 outside screening is given by Z ti+1 Ii (y)dy , (10) qi = 1 − exp − ti

Which can be well approximated by Z ti+1 Ii (t)dt . qi = ti

The cases at screen i can be taken to have a Poisson distribution with parameter ni Pi and the cases emerging outside screening between screens i

750

Q. Liu

and i + 1 can be taken to have a Poisson distribution with parameter yi qi . Then the likelihood function is L=

n−1 Y i=0

n ni Pi −ni Pi Y yi qi −yi qi . e e ri ! c! i=1 i

(11)

Based on likelihood function, the maximum likelihood values of parameters may be estimated. Three different forms for f (y) may be considered: a step function with arbitrary probabilities defined over short time intervals, a lognormal and an exponential. Here, we only discuss the exponential distribution as an example for the application since it not only gives simpler expressions for the quantities of interest but also fits the data better. 2.1. Example Breast cancer screening by the Health Insurance Plan of Greater New York (the HIP study).5 The data we have used are summarized in Tables 3 and 4. There were 4 screens at yearly interval and the cases arising between screens were identified. We use the data from the first 5 years of follow-up after the start of screening. We assume that the distribution of sojourn time of preclinical phase is an exponential. f (y) = λ exp(−λy) ,

y ≥ 0.

With this assumption, for t > tn , the incidence rate of screening interval is In (t) = J − J exp(−λt){exp(λtn ) − Table 3.

n−1 X i=1

β n−i [exp(λti+1 ) − exp(λti )] − β n exp(λti )} .

(12)

Prevalence rates of breast cancer in the first 5 years of the HIP study.

Years since start of study

No. of women screened

0 1 2 3

20166 15936 13679 11971

No of prevalent breast cancer cases Observed 55 32 17 23

Expected 59.5 25.8 20.4 17.7

Prevalence (h) 2.73 2.01 1.24 1.92

751

Statistical Methods in the Effect Evaluation Table 4.

Incidence rates of breast cancer in the first 5 years of the HIP study.

Years since start of study

No of previous negative screens

Women-months of follow-up

0–1 1–2 2–3 3–4 4–5 1–2 2–3 3–4 4–5 2–3 3–4 4–5 3–4 4–5

1 1 1 1 1 2 2 2 2 3 3 3 4 4

240277 81337 38370 30942 26701 190474 52934 19036 12626 163642 45964 13151 145118 89371

No of prevalent breast cancer cases Observed

Expected

13 7 1 3 5 8 5 2 4 10 5 2 10 10

14.9 9.0 5.3 4.7 4.3 9.6 5.5 2.6 1.9 8.1 4.7 1.8 7.0 9.2

Annual incidence (h) 0.65 1.03 0.31 1.16 2.25 0.50 1.13 1.26 3.80 0.73 1.31 1.82 0.84 1.34

The qj are then given, for j = 1, . . . , n by the integrals of (12) from tj to tj + 1, so qj = J(tj+1 − tj ) − Jλ−1 [exp(−λtj ) − exp(λtj+1 )] ) ( j−1 X β j−i [exp(λti+1 ) − exp(λti )] − β j exp(λt1 ) . (13) × exp(λtj ) − i=1

If the screens are equally spaced, qj reduces to

qj = J∆ − Jλ−1 [1 − exp(−λ∆)] ) ( j−1 X i j β exp(−iλ∆) + β exp[−(j − 1)λ∆] . (14) × 1 − [exp(λ∆) − 1] i=1

The expression for Pj , from (9) reduces to Pj = (1 − β)Jλ−1

j X i=1

β j−1 {exp[−λ(tj − ti )] − exp[−λ(tj − ti−1 )]} ,

(15)

which for equal spaced intervals becomes −1

Pj = (1 − β)Jλ

[1 − exp(−λ∆)]

j−1 X i=0

β i exp(−iλ∆) .

(16)

752

Q. Liu

Since the example is a screening at an equal spaced interval, the expression (14) and (16) are used to develop the maximum likelihood function. Then it is relatively straightforward to compute the log likelihood as a function of λ and β. The results are: β = 0.18, λ = 0.585. According to the exponential distribution, the average sojourn time equals 1.71. Also shown in Table 4 are the expected numbers of cases based on the best-fitting exponential distribution, with an overall χ2 goodness-of-fit test. The fit is clearly good.

3. Two-Stage Models of the Natural History of Disease for Screening The models of natural history of the disease introduced before are based on a progressive disease model. The progressive disease model assumes individuals are in a healthy state until they enter the preclinical disease state and all individuals in this state eventually emerge with clinical symptoms if untreated. The key assumption of this model is that preclinical disease, if left untreated, would ultimately surface clinically. This assumption is true for a part of invasive diseases, such as breast cancer. When the mammography may detect the malignant tumor in breast, the tumor must progress until patient feels the symptom or sign and goes to visit physician. Similar cases are the chest radiography for lung cancer and α-fetoprotein (AFP) test for liver cancer. However, it is not always true for some of other cancers or diseases. For example, the Pap smear for screening of cervical cancer may detect the heavy epithelial dysplasia and mataplasia, and carcinoma in situ; the gastroscopy for screening of stomach cancer may detect the gastric mucous dysplasia and mataplasia; the enteroscopy for screening of colon cancer may detect multiple intestinal polyps; etc. These non-invasive lesions may progress to invasive disease but also may persist or revert to normal automatically. However, once a lesion becomes invasive, it almost never regresses without treatment and it is assumed all invasive lesions arise from a preinvasive lesion. According to this situation, R. Brookmeyer and NE Day10 suggested a two-stage model for the analysis of cancer screening data. The two-stage model is illustrated schematically in Fig. 4. We define the random variable X to be the duration of time a progressive lesion spends in the preclinical stage 1 and Y the duration of time a progressive lesion spends in the preclinical stage 2. The cumulative distribution function of the joint sojourn times (X, Y ) is called F (x, y) with density

753

Statistical Methods in the Effect Evaluation

detectable preclinical phase of disease regress progress

progress clinical disease

healthy stage2

stage1

Fig. 4.

Schematic illustration of two stage model for preclinical period of disease.

f (x, y). Then the cumulative distribution function of the total preclinical duration (X + Y ) of progressive lesions is given by Z tZ s FT (t) = f (s − y, y)dyds . (17) 0

0

FT is the distribution function of the total preclinical duration for progressive lesions only. 3.1. The likelihood for the interval (clinical incident) cases Suppose the hazard function of clinical disease in the absence of screening is given by i and a steady state is assumed, before time t, the screening history of the jth individual in this risk set had a history of nij previous negative screens at time Hij = {tij1 , tij2 , . . . , tijni }. These times are given in reverse chronological order so that tij1 refers to the time since the most recent screen and tijni refers to the time of first screens. By convention, j = 1, . . . , M1i refers to the cases. Then the hazard of clinical disease at age t is given a screening history Hij is approximately Iρ (Hij ; FT , β), where ρ(Hij ; FT , β) is the probability that an individual who is destined to be clinically incident at age t in the absence of screening intervention, would have tested negative at prior times Hij if the individual was in fact in the screening program. Then, the conditional likelihood of that the screening history Hi0 corresponds to the interval case and the other screening histories among Ri screening subjects is ρ(Hi0 ; FT , β) . L1i = PRi j=1 ρ(Hij ; FT , β)

(18)

It is shown that the constant I cancels out in (18). Suppose M1i incident cases are clinically diagnosed at age ai and there are an additional N1i individuals at risk at ai ; that is, in addition to the cases there are N1i individual in the cohort still at risk of being diagnosed with clinical disease

754

Q. Liu

at age ai . Suppose the jth individual in this risk set had a history of nij previous negative screens at time Hij , where R1i = M1i + N1i is the size of the risk set. Then the partial likelihood contribution of incident cases at age ai is QM1i j=1 ρ(Hij ; FT , β) , (19) L1i = P Q l j∈sl ρ(Hij ; FT , β)

where Sl is the subsets consisting of M2i individuals from the R2i screened at age ai , R1i . l = 1, 2, . . . , M1i Suppose an interval case could have entered between the kth and (k − 1)th most recent screen, and then the probability of falsely screened negative on all k − 1 subsequent screens is nij +1

ρ(Hij ; FT , β) =

X

k=1

β k−1 [FT (tijk ) − FT (tijk−1 )] ,

(20)

with the conventions FT (tijnij +1 ) = 1, FT (tij0 ) = 0. 3.2. Likelihood for screen-detected stage 2 prevalent cases The prevalence (probability) of stage 2 screen-detected disease at age t conditional on a screening history Hij is R

[(1 − β)Iµ2 ]ρ(Hij ; FB2 , β) ,

(21)

where µ2 = yf (x, y)dxdy is the mean duration in stage 2. The first factor in brackets is the prevalence of screen-detected stage 2 disease unconditional on any screening history. The second factor ρ(Hij ; FB2 , β) is the probability that an individual who is destined to be in stage 2 at age t in the absence of screening intervention, would have tested negative at prior times Hij if the individual was in a screening program. FB2 is the backward recurrence distribution function. The backward recurrence time is the amount of time that a screen-detected lesion spent in the preclinical stage (stage 1 plus the time spent in stage 2 up to detection. The backward recurrence density is RtR∞ f (x, y)dydx . (22) fB2 (t) = 0 t−x µ2

Statistical Methods in the Effect Evaluation

755

This expression is derived by first noting the probability of being in stage 2 is Iµ2 and second, in order to have been in the preclinical phase for duration t and currently in stage 2 one must be in stage 1 for duration x and stage 2 for at least t − x, 0 < x < t. It is defined FB2 (tijnij +1 ) = 1 and FB2 (tijo ) = 0 . ρ(Hi0 ; FB2 , β) . L2i = PRi j=1 ρ(Hij ; FB2 , β)

(23)

Similarly, suppose M2i screen-detected stage 2 cases are detected at age ai and an additional N2i individuals also are screened at age ai and are negative. Then the partial likelihood contribution of the screen-detected cases at ai is QM1i j=1 ρ(Hij ; FB2 , β) L1i = P Q , (24) l j∈sl ρ(Hij ; FB2 , β)

where meaning of sl and l as same as before. If the screening times are randomly assigned, that is, the R2i individuals who are screened at age ai are a random sample of all individuals R1i at risk at age ai . Suppose the incident cases have Cl strata and prevalent cases screen-detected have C2 strata, the partial likelihood is then the product of factors for the contributions from incident cases and contributions from screen-detected cases. L=

c1 Y

i=1

L1i

c2 Y

L2i .

(25)

i=1

3.3. The joint sojourn distribution of two stage model 3.3.1. The independent model The simplest model assumes that the sojourn times for the two stages X and Y are independent with cumulative distribution functions F1 (x) and F2 (y) and densities f1 (x) and f2 (y), respectively. The distribution function for the total sojourn time is Z t FT (t) = f1 (x)F2 (1 − x)dx . (26) 0

The backward recurrence cumulative distribution function is Z t 1 FB2 (t) = [1 − F2 (y)]F1 (t − y)dy . µ2 0

(27)

756

Q. Liu

For the two-stage independent model, with exponential sojourn distributions [F1 (t) = 1 − e−λ1 t , and F2 (t) = 1 − e−λ2 t ], both the total sojourn (FT ) and backward recurrence (FB2 ) cumulative distribution functions are identical and given by  −λ1 t − λ1 e−λ2 t   1 + λ2 e , λ1 6= λ2 λ1 − λ2 FT (t) = FB2 (t) = (28)   −λt 1 − e (1 + λt) , λ1 = λ2 = λ . 3.3.2. Limiting dependent models

For many diseases, the second stage (the preclinical invasive stage) is short relative to the first (the noninvasive stage). It is useful to consider the limiting behavior of FT (t) and FB2 (t) as µ2 → 0 with F1 fixed. These limiting expressions could then be substituted into expression of likelihood. For example, consider the complete positive dependent exponential model, the relationship of X and Y may express as Y = λ1 X/λ2 , FT (t) = 1 − e−ut , λ2 FB2 (t) = 1 − e−ut + (e−ut − e−λ1 t ) , λ1

(29)

where u = λ1 λ2 /(λ1 + λ2 ). Under this model the limiting backward recurrence distribution is lim FB2 (t) = 1 − e−λ1 t (1 + λ1 t) .

µ2 →0

(30)

For this limiting situation, F1 (t) = 1 − e−λ1 t is substituted into expression (19) for total sojourn distribution while the cumulative distribution function in expression (30) is substituted into expression (24) for the backward recurrence distribution. Brookmeyer and Day applied the two-stage model to the analysis of data from a case-control study. Data is from the case-control study of the Northeast Scotland Cervical Cancer Screening Program.11 The program was started in 1960 when women were asked to come for an initial Pap smear. Records on all subsequent Pap tests were kept. The definition of a positive Pap test is given in MacGregor et al.11 When a woman had a positive Pap test she was biopsied and/or treated. Thus, the natural history of the disease was interrupted at the time of the first positive Pap test. A case-control study was conducted and consisted of 85 women who

757

Statistical Methods in the Effect Evaluation

Table 5. Conditional likelihood analysis for independent (model 1) and limiting dependent (model 2) two-stage exponential models. Maximum likelihood estimates β λ Proportion with < 5 years total sojourn Proportion with < 10 years total sojourn Maximum log-likelihood ∗ In

Model 1 0.025 0.003, 0.247∗ 0.18 0.33 −120.54

Model 2 0.001 0.013∗ 0.55 0.70 −119.35

time units of months.

were diagnosed with invasive squanmous carcinoma of the cervix between 1968 and 1982. Of these 85 cases, 35 were clinically incident (interval cases) and 50 were screen-detected with preclinical invasive disease (stage 2). Each interval cases was matched by year of birth to five controls who were healthy at the time of the cases diagnosis. Each stage 2 screen-detected case was matched by year of birth to a control who screened negative within 6 months of the date at which the case was screen-detected. The screening histories of all cases and controls were ascertained; these histories consisted of the number and timing of previous negative screens (prior to diagnosis date of the case). The two-stage model was fitted to the data and results showed in Table 5. The independent model gave two estimated values of λ, one is big and one is small. The author thought the development of preclincial invasive disease is very rapid so that he chose the big one as λ2. The maximized log-likelihood was slightly higher with the dependent model and it suggest the limiting dependent model is better. Both the independent and limiting dependent analysis suggested a small false negative rate. However, there was some discrepancy in the estimates of the sojourn distribution. As expected, the value of λ is big in the limiting dependent model suggesting the shorter sojourn duration than the independent model. 4. Multiple Stages Markov model for the Natural History of Disease Screening The two-stage model suggested by Brookmeyer and Day presented the concept of regression of disease development in preclinical phase. It describes the disease progression better and is of an important significance in evaluation and prediction of effect of screening for the disease in the

758

Q. Liu

Fig. 5.

Schematic illustration of state transition of disease screened.

different stages. But the two-stage model of Brookmeyer and Day does not fully describe the transition of disease in different stages. The parameter of the model may only use to estimate the total sojourn time in preclinical phase. The structure of model and the parameter estimation are relative complex. The history of a disease may look as a transition process of ones discrete healthy status. For example, in a certain period, the healthy status of an individual may transfer from healthy to potential illness, later may transfer further to clinical disease. The transition of status may be single direction or may also be double direction. Therefore, a stochastic process model is very convenient and reasonable to describe the transition of disease status and the sojourn time in each stage. If the future status only depends on the current status and is independent to all status before, that is called as the Markov property. We may use Markov process or Markov chain to describe the disease progression when it is of this property. We assume that the single stage model and two-stage model of natural history of disease are illustrated in Fig. 5. 4.1. Time homogeneous Markov chain model Duffy and Chen12 suggested to describe the natural history of the disease for screening by the Markov process model. A Markov process with the following instantaneous transition matrix:   λ1 0 0 −λ1   −λ2 λ2  . 1 0 2

0

0

0

Here 0 is the “no disease state”, 1 is “preclinical but detectable disease” and 2 is “clinical disease”. Implicit in this model is the assumption that diseases

Statistical Methods in the Effect Evaluation

759

are “born” into the preclinical state with an exponential distribution of time to birth with Z t P (Time to birth ≤ t) = λ1 e−λ1 x dx = 1 − e−λ1 t . (31) 0

Time remaining in the preclinical phase conditional on being in the phase at time t = 0, is also assumed exponentially distributed with Z t λ2 e−λ2 x dx P (Time to transition to clinical state ≤ t) = 0

= 1 − e−λ2 t .

(32)

Based on the solution of (dI-Q) to obtain the eigenvalues and eigenvectors, the transition probabilities for time t may be obtained:   λ2 e−λ1 t − λ1 e−λ2 t λ1 (e−λ2 t − e−λ1 t ) −λ1 t 1+  e (λ1 − λ2 ) (λ1 − λ2 )    . (33) P (t) =  −λ2 t −λ2 t   0 e 1−e   0

0

1

This can also readily obtained from the exponential distribution properties. The transition probabilities in Eq. (33) are unconditional probabilities. There are two complications, however, which necessitate the replacement of some with conditional or compound probabilities. First, those found to be free of disease or to have preclinical disease at first screen are not from an entire cohort followed from birth; women with a previous and clinically confirmed disease were excluded from the trial. Thus the probabilities of being free of disease and of having preclinical disease at the first screen should be conditional on having no clinical disease between birth and first screen. Also, the time of entering the clincial phase is known exactly. Their probabilities should therefore be of becoming clinical at the time ti rather than at some time between 0 and ti . For one individual, suppose we know that the exact time the person becomees clinical is 5 years, for example. The probability of clinical disease at exactly five years is P (clinical at 5 years) = P (not clinical up to 5 − ∆t years) × P (become clinical in the interval (5 − ∆t, 5)). Since the model does not allow the possibility of instantaneous transition from no disease to clinical state, and since we wish to explicitly allow for the probability of both rapid and slow progression through preclinical phase, we use our limit of accuracy, in this case one month, and further approximate the correct probability as P (clinical at 5 years) = P (not clinical up to 5 − ∆t years) P (become clinical in the interval (5 − ∆t, 5)).

760

Q. Liu

As 1 month = 0.08 years, approximately, P = P00 (ti − 0.08)P02 (0.08) + P01 (ti − 0.08)P12 (0.08) . Therefore, the probabilities of being free of disease at first and second screen, P1 and P2 , were calculated as P1 =

eλ1 t e−λ1 t +

λ1 (e−λ2 t −e−λ1 t ) λ1 −λ2

P2 = e−λ1 t .

(34) (35)

The probabilities of having preclinical disease at first and second sceeen, P3 and P4 and the probability of clinical disease at time ti (I = 1, 2, . . . , t) are P3 = 1 − P1 , λ1 (eλ2 t − eλ1 t ) λ1 − λ2 λ2 e−λ1 0.08 − λ1 e−λ2 0.08 −λ1 (ti −0.08) 1+ =e λ1 − λ2

P4 = P5i

λ1 (e−λ2 (ti −0.08) − e−λ1 (ti −0.08) )(1 − e−λ2 0.08 ) . λ1 − λ2 The total likelihood function is n Y L= (P11−δ P3δ )(P21−δ P4δ )P5i . +

(36)

(37)

(38)

l−1

δ is the index variable of screening results. If the result of screening test is negative, δ = 0, otherwise, δ = 1. The solution of the likelihood function is complicate, it must be iteratively maximized by a non-standard program. For simplicity, Duffy and Chen equate observed numbers of different types of observation to expected numbers and estimate the parameters by nonlinear least squares. Thus, the least squares approximations to maximum likelihood estimates were obtained by a procedure similar to the method of moments. Then the non-linear procedure (NLIN) in SAS13 may be used to estimate the parameters. 4.1.1. Example A randomized trial was conducted in women aged 40–74 in two counties, Kopparberg and Ostergotland, in Sweden14 to assess the effect on breast

761

Statistical Methods in the Effect Evaluation Table 6.

The estimated sojourn time in PCDP of breast cancer.

λ1

λ2

SE(λ2 )

Mean sojourn time

95%CI

0.0052

0.43

0.014

2.3

2.1 ∼ 2.5

cancer mortality of screening by single-view mammography. The data from two screens were used in the example. The number of invited women at first and second screens was 5410 and 4823. Among those subjects, only 4383 and 3494, respectively, were actually screened. Of those who attended the second screen, 3347 had attended the first screen and 147 had not. Therefore following are the transition histories: (1) 4383 women were screened at the first screen and 52 cancers were detected. There are 4331 (4383-52) women with transition history (72, 0-0), that is transition from no disease to no disease in 72 years (the average age at baseline was 72). (2) The 52 cases detected at first screen have transition history (72, 0-1). (3) 3494 women attended the second screen and 35 cancers were detected, all among the 3347 women who had attended the first screen. Thus 3312 (3347-35) have subsequent transition history (74.72, 0-0). (4) The 35 cases detected at the second screen have subsequent history (2.75, 0-1). (5) The 147 women who missed the first screen but attended the second screen have transition history (74.75, 0-0). (6) There are 10 interval cancers between the first and second screens. Thus, of the above 4331, there are 10 with subsequent transition history (time to interval cancer, 0-2). (7) There are 68 cases diagnosed clinically after the last screen, which is either the first screen (10 cases) or second (58 cases), depending on whether the subject attended the second screen. These have subsequent history (time to surface to clinical stage, 0-2). The results of analysis were showed in Table 6. 4.1.2. Estimation of sensitivity Day shows that under the constant incidence assumption, in a time interval T after a negative screen one would expect K new cases, where Z T Z T K = J(1 − s) (1 − F (t)dt) + J F (t)dt . (39) 0

0

762

Q. Liu

J is the annual (constant) incidence rate, S is the sensitivity and F is the comulative distribution function of the sojourn time. The first component in the formula for K is the number of cases missed at the screen and the second is the number of new cases “born” since the screen. This suggests as a formula for an estimate of sensitivity Sˆ ≈ 1 −

ˆ 1 − K/J , R T 1 − T1 0 f (t)dt

(40)

where K is the observed number of new cases. Using the proposed threestate Markov model, the corresponding expected number K of cases in time T after a negative screen at time ti is Z t1 +T −t Z t1 λ1 e−λ1 t λ2 e−λ2 u dudt K = N (1 − S) 0

+N

Z

t1 +T

t1

λ1 e−λ1 t

Z

t1 −t

t1 +T −t

λ2 e−λ2 u dudt ,

(41)

0

where N is the number screened at time t1 . After some integration and algebra, this given an estimate of sensitivity ˆ 2 − λ1 )/N − a K(λ , Sˆ = 1 − b

(42)

where a = (λ2 − λ1 )e−λ1 t1 (1 − e−λ1 T ) − λ1 e−λ2 T e−λ1 t1 (e−(λ2 −λ1 )T − 1) and b = λ1 (e−λ1 t1 − e−λ2 t1 )(1 − e−λ2 T ) . Thus, the same data as in the Markov model are used to estimate sensitivity, but in a second stage of estimation. 4.2. Non homogeneous Markov model with covariables In the description of disease progression, the transition of disease states may be interfered by a lot of important factors. For example, some risk factors in living environment may decide the probability of transition from healthy state to ill state. A stochastic model of transition of disease states in the consideration of these factors will be benefit to identify the population with high risk of disease. This population is more appropriate to implement of screening program and is expected to get higher effectiveness. Another case

Statistical Methods in the Effect Evaluation

763

is that the clinical disease characters decide the probability of transition from preclinical state to clinical state. The inclusion of these variables may strengthen the identification power of stochastic model for different types of diseases and the precise of estimation for the sojourn time in preclinical detectable phase. Therefore, the purpose of developing a stochastic model describes the transition probability of healthy state of an individual not only by populational transition but also by the consideration of individual characters, such as age, gender, disease history, etc. To consider the individual variability, a regression combination of the variables of individual characters may be used to describe the transition probability and it is called as the non-homogeneous Markov model with covariables. 4.2.1. Non-homogeneous time discrete Markov chain model J. Q. Fang and W. Q. Zhou15,16 suggested a parameterized nonhomogeneous Markov chain model in the analysis of screening data for disease. They assumed that the disease process is like in Fig. 5. The state space S = {0, 1, 2, 3}, The transition probability from state I to state j is defined as Pij (τ, t) = P {X(t) = j|X(τ ) = i} .

(43)

In general screening practice, the screens were given in a fix interval and the time intervals may only differ in several days. This error may be neglected. Then the time t may be assumed as a fix unit, such as year or month. Among the Eq. (46), τ and t belong to screening time set T ≡ {ttt}. Based on the professional knowledge, it is assumed that the state of individual may just transfer one step in one interval. Suppose the one step transition matrix during the time from t to t + I is   p00 (t) p01 (t) 0 0    p10 (t) p11 (t) p12 (t) 0    i, j = 0, 1, 2, 3 . P (t) =  (44) 0 p22 (t) p23 (t)   o  o 0 0 1

Among expression (44)

` , P01 (t) = α01 · /E(t)

P00 (t) = 1 − P01 (t) ,

P10 (t) = α10 · (1 − θ(t)) ,

` , P12 (t) = α12 · /E(t)

P22 (t) = 1 − P10 (t) − P12 (t) ,

(45)

764

Q. Liu

` , P23 (t) = α23 · /E(t)

P22 (t) = 1 − P23 (t) ,

θ(t) = 1 − exp(−β ′ Z(t)) ,

β = (β1 , β2 , . . . , βp )′ .

(46)

Here the proportional factor α and vector β are the estimation parameters. The transition matrix of m steps during the time from t to t + m is Am (t) =

m−1 Y

P (t + k) .

(47)

k=0

t+k∈T

Suppose there is only one step discriminant error in the state 0, 1, 2, 3, s ∈ S, P {s + i|s} = 1 − γ, here γ is the false negative rate. Therefore, the discriminant vectors are B(0) = (1 − γ, γ, 0, 0)′ ,

B(1) = (0, 1 − γ, γ, 0)′

B(2) = (0, 0, 1 − γ, γ)′ ,

B(3) = (0, 0, 0, 1 − γ)′ .

(48)

The maximum likelihood function is L=

N qY k −1 Y

B ′ (skj )Amkj (tkj )B(sk,j+1 ) .

(49)

k=1 j=1

The hypothesis test of parameters may use the likelihood ratio test. The statistic is G = 2(ln L − ln L) .

(50)

When the sample size is big enough and H0 is true, the statistic G follows the chi-square distribution and the degree of freedom is the number of estimating parameters. 4.2.2. Non-homogeneous time continuous Markov process model J. Q. Fang and J. H. Mao15,16 suggested a time continuous Markov process model. They assumed that the transition power from state i to state j is λij (t) · dt = P {X(t + dt) = j|X(t) = i} = Pij (t, t + dt) , t ∈ [0, ∞) ,

i, j ∈ S .

(51)

They also assumed that the transition power of two stage model for disease screening was related with p covariables. The model is

765

Statistical Methods in the Effect Evaluation

   λ01 (t) = A0 + A1 Z1 (t) + · · · + Ap Zp (t) ,     λ10 (t) = B0 + B1 Z1 (t) + · · · + Bp Zp (t) ,

(52)

 λ12 (t) = C0 + C1 Z1 (t) + · · · + Cp Zp (t) ,      λ23 (t) = D0 + D1 Z1 (t) + · · · + Dp Zp (t) .

Among the expression (52) Ai , Ci , Di , (I = 1, 2, . . . , p) are the estimating parameters. The one step transition probability and stay probability are P00 (τ, t) =

1,2 X ρi + λ10 + λ12 i6=j

P01 (τ, t) =

1,2 X i6=j

P12 (τ, t) =

ρi − ρ j

ρi − ρ j

·

P11 (τ, t) =

1,2 X ρi + λ01 i6=j

λ01 ρi (t−τ ) e , ρi − ρ j

1,2 X ρi + λ01 i6=j

eρi (t−τ ) ,

P10 (τ, t) =

1,2 X t6=j

ρi − ρ j

eρi (t−τ ) ,

λ10 ρi (t−τ ) e , ρi − ρ j

λ12 (eρi (t−τ ) − e−λ23 (t−τ ) ) , ρi + λ23

P23 (τ, t) = 1 − e−λ23 (t−τ ) ,

P22 (τ, t) = e−λ23 (t−τ ) .

(53)

Among them, λij ≡ λij (t), i → j ∈ S and p λ01 − λ10 − λ12 + (λ10 + λ12 − λ01 )2 + 4λ01 λ10 ρ1 = , 2 p λ01 − λ10 − λ12 + (λ10 + λ12 − λ01 )2 + 4λ01 λ10 . ρ2 = 2 The multiple step transition probability is Z t P02 (τ, t) = P01 (τ, ξ) · P12 (ξ, t)dξ , τ

P03 (τ, t) = P13 (τ, t) =

Z

Z

t

P02 (τ, ξ) · P23 (ξ, t)dξ ,

τ t τ

P12 (τ, ξ) · P23 (ξ, t)dξ .

(54)

During the screening process, the total sample of screening population is N and individual i is observed staying in state s ∈ S and with covariables Zj (t) at screening time ti . Therefore, the likelihood model for parameter

766

Q. Liu

estimation is L=

N mY i −1 Y

Ps(tik )s(tik+1 ) (tik , tik+1 ) .

(55)

i=1 k=1

The estimation of parameters may use the maximum likelihood method and the hypothesis testing may use likelihood ratio test. 4.2.3. Example The stochastic model of natural history of disease for nasopharyngeal carcinoma (NPC) screening is used as the example to introduce the development of non-homogeneous Markov model with covariables. The natural history of NPC is showed in Fig. 6 Three states were assumed in the progress of NPC: health, PCDP of NPC and clinical phase of NPC. When the individual transfers from health to PCDP of NPC, gender, age, antibody level and variability characters of Epstein Barr virus (EBV) are the covariables deciding the transition power. According to the natural history of NPC, the transition probability matrix of Markov chain is   p11 P12 0   P (i, j) =  0 p22 p23  i, j = 0, 1, 2, 3 , 0

0

1

where 1 is state of health, 2 is state of PCDP of NPC and 3 is state of clinical phase of NPC. Since there is no reverse transition between the states, the p21 = 0, p32 = 0 and p32 = 0. This assumption is reasonable for the progressive diseases, such as malignant tumor. The state 3 is the absorbable state. It is also assumed that the transition between states is no jump. That means Age, gender, VCA/IgA and its variation

Progress

Health

Progress

PDD Fig. 6.

Natural history model of NPC.

Clinical

Statistical Methods in the Effect Evaluation

767

p13 = 0. Suppose θ is the parameter of transition intension. The transition probabilities are P12 (t) = α12 θ(t) ,

P11 (t) = 1 − P12 (t) ,

P23 (t) = α23 θ(t) ,

P22 (t) = 1 − P23 (t) ,

θ(t) = 1 − e−βX(t) , β = {β0 , β1 , β2 , β3 , β4 }

(56) (57)

0 ≤ β ≤ ∞,

X(t) = {x1 , x2 (t), x3 (t), x4 (t)} . Among them, X1 is gender (1 = woman, 2 = man) and does not change with time. X2 is age at the time of screen. X3 is the titer of antibody of EBV(VCA/IgA), 1 = negative, 2 = 1:5 1:10, 3 = 1:20 1:40, 4 = 1:80+. X 4 is the characteristics of variation of VCA/IgA, 1 = negative, 2 = low level of positive antibody, 3 = persistent high level, 4 = increasing level, 5 = both positive of VCA/IgA and EA/IgA. The age and level of antibody of EBV are covariables with time. It is assumed that the maximum times of transition during a fixed screening interval is m. An individual stay in state i at the time t and transfer to state j during a fixed interval (for example 1 year) m. The matrix of m step transition probability is Am (t) =

m−1 Y

P (t + k) .

(58)

k=0

t+k∈T

In order to estimate the false negative rate, it is assumed that the one step missing discriminant may happen only and suppose s ∈ S = 1, 2, 3. P (state = s + 1|diagnosis = s) = γ · γ is the false negative rate and the diagnostic vector is B(1) = (1 − γ, γ, 0) , B(2) = (0, 1 − γ, γ) , B(3) = (0, 0, 1) . Supposed N individuals are screened, individual k = 1, 2, . . . , N participates qk screens and then the likelihood function is L=

N qY k −1 Y

k=1 j=1

B ′ (skj )Amkj (tkj )B(sk,j+1 ) .

(59)

768

Q. Liu

Table 7. Estimated values of parameters in natural history model of NPC and hypothesis test. Parameters

Values

α12 0.001908 α23 0.2051 β0 −0.4163 β1 (gender) 0.1651 β2 (age) 0.00002217 β3 (VCA/IgA) 0.1553 β4 (variation of VCA/IgA) 0.1879 γ (false negative rate) 0.0002 M (Maxi. transitions) 2

Likelihood test

P value

10.7338 0.0194 34.2804 25.1728

< 0.00005 > 0.9 < 0.0001 < 0.0001

A mass screening for nasopharyngeal carcinoma was carried out in Guangzhou,17 where 2970 cases with positive results of test for VCA/IgA and 3 cases of NPC were found. All the cases with positive VCA/IgA and 214 controls with negative VCA/IgA were followed up and 35 NPC were found during a 7-year period. And 2988 cases with positive VCA/IgA and 34 cases of NPC were found in first screen in Zhongshan. All cases with positive VCA/IgA and 2068 controls with negative VCA/IgA were followed up and 40 cases of NPC were found during a 7-year period. Also, 1297 cases with positive VCA/IgA and 13 cases of NPC were found in first screen in Sihui; 19 cases of NPC were found during a 7-year follow-up for the cases with positive VCA/IgA. A Markov model with time dependent covariables was developed and the results showed in Table 7. 5. The Simulation and Optimization of Screening Policy Based on the parameters of natural history of disease and screening implement, the disease process and the effects of screening intervention may be simulated and the cost-effectiveness may be evaluated. There are two purposes of analysis and assessment of screening data. One is to estimate the parameters of screening, including the attendance rate, cost, the characteristics of screening test (sensitivity and specificity) natural history of disease (the sojourn time in preclinical detectable phase PCDP), the impact of screening on the mortality and prevalence of disease, etc. Based on these parameters, one may make a conclusion if the mass screening is efficacy. Another thing is to choose an optimized screening policy. It means the choice of population in eligible age group, frequency of screening and the

Statistical Methods in the Effect Evaluation

769

interval between subsequent screening test, the combination of screening test and sequential diagnostic procedures. An optimized screening policy is expected to obtain the maximum health effects in the limited resource. The choice of a screening policy should preferably be based on the balance between expected health effects and costs. The development of natural history model of disease serves for first purpose and the simulation of disease process and screening serves for second one. The fundament of simulation for disease process is technique of Monte Carlo. Based on the known and hypothetic parameters, a screening process for disease in a large population is simulated by computer and the simulated effects of screening policy are evaluated. A variety of screening policies are simulated repeatedly and the optimized policy of screening are identified for the consideration of policy decision-maker. Knox18 first used the macro-simulation method to evaluate the health effects of screening for cervical cancer in England. He assumed that the duration of the interval between the point at which the disease first becomes detectable and the point at which it becomes incurable is a constant (A). The duration of the interval between incurability and death is B. The sensitivity of screening test is S. The interval between subsequent screens is I. The disease incidence rates in different age groups are P . Now a mass screening program was carried out starting in age B in a 100,000 population. In the population, the disease onsets in a speed of P . It is assumed that the disease is curable if it is detected by screening and the life may be saved. The disease is incurable if it is diagnosed in the hospital and the life lost is the mortality rate (D) of disease. If the main concern were to save life year rather than lives, the life year lost (Y = De0x taken from the current life-table) and an another compromise is the weighted index √ (Id = D ex Y −1 ), that is long survivals are not weighted in proportion to their length. The health effects of different screening policy are demonstrated in Fig. 7. The dash line is the original Id caused by the disease. Each test, after an interval of B, produces a deep cut in the mortality, proportional in depth to the sensitivity of the test. The cut persists for a period A, and ends. Closely set tests involve some waste because of overlap, but later tests cut by the same proportion into the cases missed by earlier tests. The interval of first screen and second screen is wider than A so that the Id returns to original line when the health effects of early detection of screening on the disease mortality disappear. The interval of second screen and third one is shorter than A, so that Id goes down again before it return to original value. The sensitivity of screening test decides the depth of cut.

770

Q. Liu

100 90 80 70 60

Id

50 40 30 20 10

B

0 25

30

35

40

45

50

55

60

65

70

75

80

Age Fig. 7.

Illustration of simulated effects of screening for disease.

A high sensitivity means the cut is close to the horizontal axis and a good health effect of screening. However, the expense of high sensitivity is a low specificity and a high false positive rate when the power of screening test is fixed. That means a increase of amount of following diagnostic work and the total cost of screening. Knox simulated the cost-effectiveness of mass screening for cervical cancer in England by a simplified macro-simulation model. It is showed in Fig. 7 that the maximum health effects may obtain when screening is started just before the steep increase of disease incidence rate and more frequently screening in the age with high mortality rate of disease. It can be seen by simulation that a very wide range of results can be obtained from different deployments of the same resource, the range itself depending upon the natural history. For example, Knox assumed that the natural history distribution centered upon a mean interval is 6-year for cervical cancer, a 5-year spacing of tests beginning at age 35 gives something like 30 times the benefit of a one-year spacing beginning at age 20 and ending at age 29. The health effects of different screening policies can be roughly compared by macro-simulation model with relative simple calculation and the optimized scheme can be suggested. However, the parameters in macro-simulation are only assumed as the constants and the average disease process and

Statistical Methods in the Effect Evaluation

771

the health effects of screening in an whole population are simulated the variability of individuals is ignored. It is known that the sojourn time of preclinical detectable phase is a variable with a certain distribution. The sensitivity and specificity of screening test depends on the individual characteristics of disease. And the disease process may change in different individuals. Therefore the macro-simulation can not consider the variations between individuals and evaluate the cost-effectiveness of different screening policies precisely. Habbema19 developed a micro-simulation model of screening process by the assistance of computer. The disease process and the impact of screening intervention of every individual in 100,000 population were simulated by the method of Monte Carlo. This simulation model divided into two parts: the disease part and the screening part. The disease part generates a large number of life histories. Together, the life histories constitute the target population that will be screened in the screening part. The stochastic model underlying the simulation of the population is specified by the input of the program. The input related to the population (e.g. the life table), the epidemiology of the disease (e.g. age-specific incidence) and the disease process. Important aspects of the disease process include disease states into which preclinical and clinical disease is subdivided, the duration of preclinical disease, the probability that preclinical disease will regress spontaneously, etc. The output of disease part consists of the simulated life histories. All types of epidemiological data are computed from the aggregation of life histories: incidence of clinical disease, the prevalence of the disease states, the mortality, and survival figures. The input of the screening part consists of assumptions on the screening process (properties of the screening test, prognosis after early detection) and of a specification of the screening policy. The output of screening part consists of the simulated screening results (e.g. the number of cases detected, number of cases missed, mortality among screen-detected cases) and of the simulated effects of screening (e.g. the number of lives/life years saved, and the number of unnecessarily treated persons). Habbema applied this model to the evaluation of screening for breast cancer and colorectal cancer. Here the structure of micro-simulation model is introduced by an example of screening for nasopharyngeal carcinoma.20 The basic structure of model for disease process and screening process of NPC is illustrated in Fig. 8. The main biomarker of NPC risk is the antibody level of Epstein Barr virus (VCA/IgA). In order to simplify, the positive rate of VCA/IgA is assumed as a constant, e.g. instantaneous transition rate λ. It is supposed

772

Q. Liu

Fig. 8.

Structure of the disease model for NPC and stages used in model.

that the distribution of λ depended upon the time is an exponential distribution and then the accumulated distribution probability at certain time can be estimated, e.g. transition probability (P12 ). It is assumed that a part of population wit positive VCA/IgA may become negative and the transition probability is P21 . It is reasonable that the progression of nasopharyngeal carcinoma is irreversible. Therefore, the preclinical cases should progress to clinical without the intervention of medication. It means the incidence of preclinical NPC is same as the incidence of clinical NPC. The transition probabilities from normal population with negative VCA/IgA to preclinical NPC are the incidence rates (I0i ) of NPC in different age groups with negative VCA/IgA. The incidence rates among the population with positive VCA/IgA are I1i = I0i × RREB (the relative risk of positive VCA/IgA). These are the instantaneous transition probabilities. The transition probabilities (P13 and P23 ) can be estimated if the distribution functions are assumed. If there is no intervention of screening, the preclinical NPC cases enter the clinical phase according to the distribution probability (P34 ) of sojourn time in preclinical detectable phase. And then these cases will die according to survival rate (P45 ) of clinical cases of NPC. All individuals may die from other cause in every stages depended upon the age-specific mortality rates (P6 ). Finally, the death age of every individual are simulated. That is the disease part of model without the intervention of screening. The output of disease part includes the simulated life history of

773

Statistical Methods in the Effect Evaluation

every individual and other disease index such as incidence rate, prevalence rates in different disease stages, mortality rate, etc. The simulated results are compared to the actual figures. Some adjustment of parameters should be done to make simulation model close to realistic population. In the screening part, the simulated population goes through the screening intervention according to the different policies of screening. For example, the population ise divided into the low and the high risk groups according the level of VCA/IgA and are screened in different frequencies. The early detected cases will be checked by further diagnostic procedure. Therefore, the simulated population will enter these three states according to the sensitivity and specificity of screening tests. The transition probability (S4 ) from screening detected NPC to death is estimated depended upon the survival rate of preclinical NPC cases. In theory, S4 < P45 . It means that the early detection may save the life of NPC cases. The simulated results of screening part are also the death age (d) of every individual. P The total effect of screening is the life year saving (Y = (D−d)). The total cost (C) of screening can be estimated according to the simulated screening process including physical check, test of VCA/IgA, further diagnostic procedure and the organization of screening program. The cost-effectiveness index of screening is the average cost per life year saving = C/Y . The lower the cost-effectiveness index, the more the total life year saving, the more the health effect of screening is. The simulated results of different screening schemes for NPC are listed in Table 8. It is showed that the cost-effectiveness is the best when the positive value of VCA/IgA sets on 1:20. A lower value increases the false Table 8. VCA/IgA

Simulated effects of screening in different policies.

VCA/IgA ≥ 1:5

VCA/IgA ≥ 1:20

VCA/IgA ≥ 1:80

(−)

(+)

Life year saving

Cost per life year

Life year saving

Cost per life year

Life year saving

Cost per life year

1 3 3 3 5 5 5 5 5

1 1 2 3 1 2 3 4 5

5848 5716 4606 4025 4921 4376 4100 3804 3591

7210.30 3243.40 3788.55 3738.33 2605.28 2772.54 2871.37 3053.20 2679.13

6212 5227 4562 4171 4542 4051 3758 3697 3282

6787.53 3268.65 3697.57 3607.44 2523.95 2809.80 3005.80 3046.84 2930.68

5620 4603 4547 3696 3759 3340 3208 3343 3239

7503.04 3617.20 3653.35 4070.09 2967.23 3336.29 3466.84 3326.62 2969.72

774

Q. Liu

positive rate and the cost for further diagnosis. A higher value increases the missing rate and the cost-effectiveness index since the less life saving obtains. The cost is the lowest and the total life year saving is median when the interval between screens is every year for population with positive VCA/IgA and every 5 years for population with negative VCA/IgA. That may be considered as the optimized screening scheme for NPC. From the results of simulation, it is showed that the cost-effectiveness index of the poorest scheme is four times higher than that of the best one. It cannot be overemphasized the importance of simulation study for the assessment of screening.

References 1. United States Commission on Chronic Illness (1957). Chronic illness in the United States, Vol. 1, Cambridge, MA:harvard University Press, MA, 267. 2. Wilson, J. and Jungner, G. (1968). Principles and practices of screening for disease. Public Health Paper 34, World Health Organization, Geneva, 26–39. 3. Shapiro, S., Goldberg, J. D. and Hutchinson, G. B. (1974). Lead-time in breast cancer screening detection and implication for periodicity of screening. The American Journal of Epidemiology 100: 357–360. 4. Zelen, M. and Feinleib, M. (1969). On the theory of screening for chronic diseases. Biometrika 56: 601–614. 5. Shapiro, S., Venet, W., Strax, P. et al. (1982). Ten- to fourteen-year effect of screening on breast cancer mortality. Journal of National Cancer Institute 69: 249–255. 6. Tabar, L., Fagerberg, G., Duffy, S. W. et al. (1989). The Swedish two county trial of mamographic screening for breast cancer: Recent results and calculation of benefit. Journal of Epidemiology Community Health 43: 107–114. 7. Albert, A., Gertman, P. M. and Liu, S. (1978). Screening for the early dectection of cancer: II. The impact of screening on the natural history of the disease. Mathematical Biosciences 40: 61–109. 8. Eddy, D. (1980). Screening for Cancer : Theory, Analysis and Design, Englewood Cliffs, Prentice-Hall, New Jersey. 9. Day, N. E. and Walter, S. D. (1984). Simplified models of screening for chronic disease: Estimation procedures from mass screening programmes. Biometrics 40: 1–14. 10. Brookmeyer, R. and Day, N. E. (1987). Two-stage models for the analysis of cacner screening data. Biometrics 43: 657–669. 11. MacGregor, J. E., Moss, S., Parkin, D. M. et al. (1985). A case-control study of cervical cancer screening in Northeaset Scotland. British Medical Journal 290: 1543–1546. 12. Duffy, S. W., Chen, H. H., Tabar, L. et al. (1995). Estimation of mean sojourn time in breast cancer screening using a markov chain model of both entry to and exit from the preclinical detectable phase. Statistics in Medicine 14: 1531–1543.

Statistical Methods in the Effect Evaluation

775

13. SAS Institute Inc. (1989). SAS/STAT Users Guide, Version6, Volume 2, Cary NC: SAS Institute Inc. 14. Tabar, L., Fagerberg, G., Duffy, S. W. et al. (1992). Update of the Swedish two-county program of mammographic screening for breast cancer. Radiologic Clinics of North America 30: 187–210. 15. Fang, J. Q., Wu, C. B., Mao, J. H. et al. (1995). Two stage model of tumor sojourn time(I)-non-homogeneous Markov Model. Applied Probability and Statistics 11(2): 205–212. 16. Fang, J. Q. and Wu, C. B. (1995). Two stage model of tumor sojourn time(II) — Counting process and Bootstrap method. Applied Probability and Statistics 11(2): 213–222. 17. Liu, Q., Fang, J. Q., Hu, M. X. et al. (1997). Stochastic model of natural history of nasopharyngeal carcinoma. Chinese Journal of Health Statistics 14(4): 12. 18. Knox, E. G. (1976). Ages and frequences for cervical cancer screening. British Journal of Cancer 34: 444–452. 19. Habbema, J. D. F., Van Oortmarssen, G. J., Lubbe, J. T. N. et al. (1984). The MISCAN simulation program for the evaluation of screening for disease. Computation Methods and Programs Biomedicine 20: 79–93. 20. Liu, Q. (1995). Simulation study of effect evaluation of screening for tumor. Chinese Journal of Tumor 17: 182.

About the Author Qing Liu got his BS in Public Health (1982) from Guangdong Pharmacological and Medical College. MS (1985) and PhD (1996) in Medical Statistics from Sun Yat-Sen University of Medical Sciences. Dr. Liu has worked in Sun Yat-Sen University as assistant professor, associate professor and professor since 1985. Attending to an international seminar of clinical epidemiology held in West Chinese Medical University in 1985 and being a fellow in International Agency of Research on Cancer during 1990–1991. He has been teaching medical statistics, epidemiology and clinical epidemiology for years. His main research interests are the epidemiology of chronic diseases and related statistical methods, the model of natural history of diseases and evaluation of cancer screening, genetic epidemiology of cancer.

This page intentionally left blank

CHAPTER 21

CAUSAL INFERENCE

ZHI GENG Department of Probability and Statistics, School of Mathematical Sciences, Peking University, Beijing 100871, PR China Tel: 86-10-62751837; [email protected]

1. Introduction More than 2000 years ago, Aristotle pointed out that the real scientific knowledge was about causation. Since ancient times, exploration of causation has been the ultimate destination of almost all the scientific studies including philosophy, social sciences and medical sciences. Causation and association are two different and important concepts. Although causation is more important than association in many scientific studies, most of the statistical methods at present can only be suitable for associational studies. Even if there is not causation between two factors, there may be spurious association; on the other hand, causation may also appear spurious independence. A lot of examples may explain spurious association. For example, the watch times of Mr. John Doe 1 and Mr. John Doe 2 have very strong association, but changing the watch time of Mr. John Doe 1 would not affect that of Mr. John Doe 2. Freedman7 gave an example that the reading ability of primary school students was related to sizes of their shoes, but there was not causation between them apparently because changing the shoes sizes would not change their reading ability. A few of statisticians and medical researchers wondered whether causation should present association. In fact, there are many examples of spurious independence. We may imagine that exercising Taijiquan (shadow boxing) can build up health, that is, doing Taijiquan has causal effects on health. However, people who exercise Taijiquan may show little difference in health from, or even worse than, those who do not exercise Taijiquan. That may be because 777

778

Z. Geng

people exercising Taijiquan were all in poor health and would be worse if they had not exercised Tajiquan, and thus spurious independence appears. Another example is that the life of uranium miners was as long as, or even longer than, that of others who were not exposed to uranium mines. It cannot be explain that exposure to uranium mines would not affect one’s life. Perhaps uranium miners were selected from the stronger. If they had not been exposed to uranium mines, they would have longer life. This is called healthy worker effect. In the history of statistical science, research on causal inference has not received deserved emphasis and development. The early statistical theories and methods of causal inference include contingency tables, path analysis and structural equation models. Although association and causation are the well-known different concepts, association obtained by statistical inference is not unusually misused to explain the relation between cause and effect. In application studies, one often overlooks assumptions on causal mechanisms and interprets parameters of association as those of causation. Rubin35 proposed a causal effect model, which was similar to the counterfactual philosophy of Lewis21 and was called the counterfactual model. Pearl26 proposed concepts of causal diagram and external intervention and established the method of diagram for causal inference that combined knowledge about causal mechanisms with data from observational studies. These causal models need various assumptions that cannot be tested by using data from observational studies. The statistical theories and methods are lack in causal inference and identification of confounders. Causal inference is a complex problem involving statistics, philosophy and related application areas and has been discussed by many authors in recent years.7,14,33

2. Experimental Studies and Observational Studies The difference between observation and experiments is that information from observation seems to emerge by itself while that from experiments is knowledge on the truth.1 Observation is a method for collecting facts while experiment is a means for obtaining knowledge.1 The information from an experiment has an essential distinction from that from observation. Experiments try to explore information about causation, and observational studies can get information only about association. Only under some conditions or assumptions can they are equivalent. Holland17 pointed out that it was impossible to make causal inference based on observational studies without untestable assumptions. From philosophic views of the

779

Causal Inference

Popperian, an affirmation is not scientific if it is not empirically testable. Randomized experiment is the best scientific method to assess causal effects. However, randomized experiments or even experimental methods are prohibited in many studies, and only observational studies can be conducted. A well-known example is the epidemiological study on cancer and smoking. When randomized experiments cannot be carried out, a case-reference study attempts to seek a control group that is comparable to the treatment group. Grace et al.13 gave the results of 51 studies on the portacaval shunt, in which some are randomized experimental studies, some are controlled studies and some are uncontrolled studies, as shown in Table 1, where 24 of 32 uncontrolled studies and 10 of 15 nonrandomized controlled studies were markedly enthusiastic about the shunt, while all of 4 randomized controlled studies showed that the surgery does not have effect. This shows that different study methods may lead to completely different conclusions.8 In 1948 and 1949, Doll and Hill carried out a case-control study on lung cancer and smoking in 20 hospitals in London and found significant association between smoking and lung cancer (see Table 2). Fisher pointed out in 1957 that association from case-control studies could not be explained simply as causal relation between smoking and lung cancer and this association might be explained by other two alternative theories: (1) Cancer causes smoking; (2) Genes cause both cancer and smoking. He illustrated the association between genes and smoking habit from data of twins. Up to the present, epidemiologists have shown by various methods and data that smoking is a risk factor of lung cancer: (1) Dose response relation: The Table 1.

51 studies on the portacaval shunt. 8,13

Design

Marked

No Controls Controls, but not Randomized Randomized Controlled

Table 2.

24 10 0

7 3 1

1 2 3

The case-control study on lung cancer and smoking.

Smoking Lung Cancer Controlled

Degree of Enthusiasm Moderate None

647 622

Male No smoking 29 27

P = 0.64 × 10−6

Female Smoking No smoking 2 28

41 32 P = 0.025

Total 719 709

780

Z. Geng

Table 3.

SV Company L Company

Water source and death rate from cholera. 8

No. of Houses

Deaths from Cholera

Rate per 10,000

40,046 26,107

1,263 98

315 37

more one smokes, the bigger risk is; (2) The risk increases as the years of smoking increases; (3) The risk decreases as the years of giving up smoking increases. Freedman7 reviewed Snows study on cholera. In 1855, Snow, a physician in London, found that cholera was a kind of infectious disease through drinking water. He developed a series of arguments to support his germ theory about cholera. For example, cholera spread along the tracks of human commerce. When a ship stopped at a port where cholera was prevailing, sailors who contacted local residents would contract the disease. These could show that cholera was an infectious disease and could not be explained by miasma or bad air. In August and September 1854, cholera broke out in London and the patients gathered mainly near water pumps of Broad Street in Soho District. A number of groups in the district, fortunately, escaped from cholera. One was a brewery, where the workers preferred ale to water and it had a private pump. Another was a poor-house, which owned a water pump. Snow could explain that most of patients in other areas also drank water from the pump in Broad Street. For example, a female patient in Hampstead had lived in Soho before and she liked the taste of the water of Broad Street and routinely draw water from there. There were several water supply companies in London in the 19th century. There was no difference between the customers of the Southwark and Vauxhall (SV) Company and the Lambeth (L) Company, which imitated an experiment of nature. The water source of the SV Company contained polluted water while the L Company drew relatively pure water. It was found that the mortality rate from the SV Company was about 9 times the death rate for the L Company, as shown in Table 3. Therefore, it was concluded that the water source was the cause of cholera. Snows study shows that non-experimental study can also be applied for causal inference. 3. Simpsons Paradox and Standardization We first illustrate the paradox proposed by Simpson36 in 1951, which is usually called Simpsons Paradox although many statisticians had noted it

781

Causal Inference Table 4.

A numerical example of smoking and lung cancer.

Smoking No Smoking

Case

Control

Total

80 100

120 100

200 200

RR = 0.80, OR = 0.67, RD = −0.10

Table 5.

Stratification with sex.

Case Smoking No Smoking

35 90

Male Control 15 60

RR1 = 1.17 OR1 = 1.56 RD1 = 0.10

Female Case Control 45 10

105 40

RR2 = 1.50 OR2 = 1.71 RD2 = 0.10

before 1951. Suppose that we obtain observed data on smoking and lung cancer given in Table 4, which may be regarded as the distribution from the population. Thus the relative risk is RR = (80/200)/(100/200) = 0.80, the odds ratio is OR = (80 × 100)/(100 × 120) = 0.67 and the risk difference is RD = (80/200)−(100/200) = −0.10. From these measures, we can see that the prevalent rate of cancer in the smoking population is lower than that in the no smoking population, and thus smoking seems to be a preventive factor. Suppose that sex is also included in the observation, and data stratified by sex are shown in Table 5. For male, the relative risk is RR1 = 1.17, the odds ratio OR1 = 1.56 and the risk difference RD1 = 0.10. For female, the relative risk is RR2 = 1.50, the odds ratio OR2 = 1.71 and the risk difference RD2 = 0.10. Smoking appears to be a risk factor. It follows from Tables 4 and 5 that smoking is harmful to both male and female, but is good for human. This phenomenon is called Simpsons Paradox. Suppose furthermore that the observation includes age, classified as two levels: Age under 40 and above 40. The 2 × 2 × 2 × 2 contingency table is shown in Table 6, where smoking seems to be again a preventive factor across the strata of sex and age. This numerical example illustrates that covariates should carefully considered in survey design and data analysis; otherwise, a spurious association may be obtained.

782

Z. Geng Table 6. Age

≤ 40

Sex

> 40

Male

Control Smoking No smoking

Female

Male

Female

Case

Control

Case

Control

Case

Control

Case

Control

5 60

5 55

40 5

50 5

30 30

10 5

5 5

55 35

RR11 = 0.96 OR11 = 0.92 RD11 = −0.02

Table 7.

Stratification with sex and age.

RR12 = 0.89 OR12 = 0.80 DR12 = −0.06

RR21 = 0.88 OR21 = 0.50 RD21 = −0.11

RR22 = 0.67 OR22 = 0.64 RD22 = −0.04

Admission of graduate students at the University of California, Berkeley. Admission

Rejection

Sum

Percentage of Admission

3738 1494

4704 2827

8442 4321

44 35

Men Women

Table 8.

Stratification with major.

Major No. of Male Applicants Percentage of admission No. of Female Applicants Percentage of Admission Sum of Applicants

A

B

C

D

E

F

825 62 108 82 933

560 63 25 68 585

325 37 593 34 918

417 33 375 35 792

191 28 393 24 584

373 6 341 7 714

Bickel et al.2 reanalyzed the data of the observation study on sex discrimination in admissions of graduate students at the University of California, Berkeley. The data on sex and admission are shown in Table 7. The admission rate for the men was about 44% and that for the women was about 35%, which appeared the existence of sex discrimination. By looking at each major separately, however, it was found that there did not exist sex discrimination in any major and, even if there existed, it discriminated against the men. Data for the six largest majors are shown in Table 8, where the candidates in the six majors covered over one-third of all candidates in over 100 majors of the university. It can be seen that the reason why the admission rate for the women was lower than that for the men was that most of the women applied to the hard majors with lower admission rates. In order to eliminate the influence of different distributions

Causal Inference

783

of the men and women in application to majors, we may select, for example, the sums of each majors applicants in the last row of Table 8 as the standard distribution of majors, and then we adjust the distributions of the men and women in majors. The standardized admission rate for the men after adjustment is 0.62 × 933 + 0.63 × 585 + 0.37 × 918 + 0.33 × 792 + 0.28 × 584 + 0.06 × 714 = 0.39 , 4526

and the standardized admission rate for the women after adjustment is 0.82 × 933 + 0.68 × 585 + 0.34 × 918 + 0.35 × 792 + 0.24 × 584 + 0.07 × 714 = 0.43 , 4526

It can be seen that the standardized admission rate for the women is slightly higher than that for the men after eliminating the influence of different distributions of majors. 4. Counterfactual Model for Causal Inference Rubin35 proposed the counterfactual model for causal inference, which can be traced back to Neyman.23 The basic idea is the potential outcomes. If we could observe the responses of a unit under both exposed and unexposed to the treatment, we would assess the causal effect of the treatment on the unit by the difference between the two responses. In epidemiological and medical studies, however, every unit can be under only one exposure status, exposed or unexposed. As the motto of Heraclitus: “You cant step into the same river twice,” only one response can be observed and the other cannot. Some untestable assumptions are necessary for causal inference from observational studies. Thus, there are arguments about the counterfactual model,5 but this property reflects a strength of the counterfactual model.16 The counterfactual model is widely applied to causal inference, which gives the most precise definition and description of causal effects. 4.1. Causal effects Definitions of causal effects at three different levels are introduced in this section.18,35 Let E be a binary variable denoting treatment (or exposure) E = e, denotes treated (or exposed) and E = e¯ denotes untreated (or unexposed). Let De (u) denote the response of unit u under treated E = e, and De¯(u) denote the response under untreated E = e¯. Definition 4.1. The individual causal effect (ICE) of the unit u. For a particular unit u and an interval of time from the exposure time t1 to the

784

Z. Geng

response time t2 , the causal effect of the e versus e¯ treatment on the unit u is defined as: ICE(u) = De (u) − De¯(u) . Since any unit u can be under only one exposure status, it is impossible to observe the both De and De¯ on the same unit, and thus it is impossible to observe ICE(u). So this model is called the counterfactual model or the potential model. If there exist two units u1 and u2 so that De (u1 ) = De (u2 ) and De¯(u1 ) = De¯(u2 ), and if the units u1 and u2 receive different treatments, then the individual causal effect ICE(u1 ) of the unit u1 can be observed. Definition 4.2. The average causal effect (ACE) over U . Suppose that there are N units in the population U . The average causal effect (ACE) over U is defined as the mean of individual causal effects of all units: 1 X ACE = ICE(u) = E(De − De¯) . N u∈U

The average causal effect denotes the difference between the response average if all units had been exposed and that if all units had not been exposed. Since every unit can be under only one exposure status, ACE is also a potential quantity. We define a covariate as a variable unaffected by treatment. For example, a variable that occur before treatment is a covariate because it is not affected by the treatment. Definition 4.3. Let X be a discrete covariate. The population U is stratified into K subpopulations by the covariate X = 1, . . . , K. The average causal effect of the subpopulation X = k is defined as: ACEk = E(De − De¯|X = k) = P (De = 1|X = k) − P (De¯ = 1|X = k) . ACEk denotes the difference between the average response if all units in the subpopulation X = k had been exposed and that if all of the units had not been exposed. 4.2. Randomized experiments Randomized experiment is the most powerful method to assess average causal effects. Randomization can balance the joint distribution of all known confounders and all unknown confounders in the treated group and that in

Causal Inference

785

the control group. That is, randomization can eliminate not only the known confounders but also the unknown confounders. In a randomized experiment, treatment E is independent of any other covariates, and particularly, E is independent of both De and De¯, denoted by E ⊥ (De , De¯). Thus, we have E(Di |E = i) = E(Di ) for i = e or e¯, where E(Di |E = i) denotes the expectation of the response Di in the group E = i, which can be estimated by using the mean of responses of all units in the group E = i. If units are randomly sampled from the population, E(Di |E = i) is asymptotically equal to the average causal effect when all units in the whole population receive the treatment E = i. For a randomized experiment, we have E(De |E = e) − E(De¯|E = e¯) = E(De ) − E(De¯) = E(De − De¯) = ACE . For a randomized experiment, we can obtain the following unbiased estimate of ACE from sample means: [ = the average of all responses in the treated group ACE − the average of all responses in the control group =

1 the number of units in the treated group X × De (u) u in the treated group

1 the number of units in the control group X × De¯(u) . −

u in the control group

4.3. Ignorability and the propensity score Rosenbaum and Rubin34 proposed the following assumption, called strong ignorability. This assumption is widely applied in observational studies. Definition 4.4. We say that a treatment assignment E is strongly ignorable given a covariates X if E ⊥ (De , De¯)|X

and 0 < P (E = e|X = x) < 1 .

If a treatment assignment is strongly ignorable, then we have, for i = e and i = e¯, E(Di |E = i, X = x) = E(Di |X = x) .

786

Z. Geng

Therefore, the average causal effect in subpopulation X = x is ACEx = E(De |X = x) − E(De¯|X = x) = E(De |E = e, X = x) − E(De¯|E = e¯, X = x) , where E(De |E = e, X = x) and E(De¯|E = e¯, X = x) can be estimated by using observed data. Theorem 4.1. Under the assumption of strong ignorability, the average causal effect (ACE) in the whole population equals the expectation of the average causal effects in subpopulations. When X is discrete, it can be denoted as: ACE = E(ACEk ) =

X

ACEk P (X = k)

k

=

X k

{[P (De = 1|E = e, X = k)

− P (De¯ = 1|E = e¯, X = k)]P (X = k)} . Under the assumption of strong ignorability, the average causal effect ACE is identifiable and can be estimated from observed data. In an observational study, if we can observe enough covariates X so that the assumption of strong ignorability X holds, then we first stratify the population by X, next compute the average causal effects in subpopulations and finally obtain the average causal effect ACE with the above weighted average. The assumption of strong ignorability is one of the important assumptions for causal inference in observational studies. Note that this assumption cannot be tested empirically. Thus we must depend on knowledge and experience of experts in related disciplines for judgment, or we develop experiments (for example, stratified randomization) to ensure the assumption of strong ignorability. Stone39 discussed various assumptions needed for causal inference. When X is a continuous covariate or a discrete covariate with a lot of levels, even if stratification could ensure the strong ignorability, it makes each stratum sparseness and thus makes statistical inference inefficient. One solution to this problem is to use the propensity score to stratify the population as coarsely as possible so that units in each stratum can be as more as possible.

Causal Inference

787

Definition 4.5. Let X be a continuous or discrete covariate. The propensity score is defined as the conditional probability: f (X) = P (E = 1|X) , where f (X) is a function of X. Rosenbaum and Rubin34 presented the following result. Theorem 4.2. If treatment assignment E is strongly ignorable given covariates X (i.e. E ⊥ (De , De¯|X and 0 < P (E = e|X = x) < 1), then treatment assignment E is also strongly ignorable given the propensity score f (X) (i.e. E ⊥ (De , De¯)|f (X) and 0 < P (E = e|f (x)) < 1). Since the propensity score f (X) is a function of X, stratifying the population by the propensity score f (X) is coarser than that by X. When X is a continuous variable, we may apply a logistic regression model to the propensity score. 5. Confounding and Confounders Confounding is one of the most important concepts in observational studies. Greenland et al.16 provided an overview on confounding. In epidemiological studies, selection of a control (unexposed) group should ensure comparability of the exposed group with the control group. The difference between the response distributions in comparison groups is called confounding. When there exists confounding, methods of control and adjustment for confounders are usually used to reducing and remove confounding. There is, however, inconsistency on definitions of confounding biases and confounders in the epidemiological literature. There are two main approaches for assessing confounding and a confounder: (1) Collapsibility-based criterion: A covariate is a confounder if a measure of association across strata of the covariate is not equal to the marginal measure.3,9 (2) Comparability-based criterion: A covariate is a confounder if adjusting for the covariate reduces confounding.22 The formalized definitions of confounding and a confounder and the relation between these two criteria have been discsussed by Geng et al.11,12 5.1. Collapsibility-based criterion We say that a covariate is collapsible (or ignorable) if the measure of association of interest remains unchanged after ignoring the covariate.

788

Z. Geng

When a covariate is not collapsible, ignoring it will confound causal effects and give biased inference. Therefore, the covariate is called a confounder. Suppose that there exists a sufficient set of covariates such that the measure of association is constant across the strata stratified by all covariates in the set and the constant measure also equals to the marginal measure. Then, ignoring these covariates in the set does not produce confounding, and thus they are not confounders. For discussions on collapsibility, see Kleinbaum et al.,19 Whittemore42 and Geng.9 Consider two binary random variables A and B which take value 0 or 1, and a K-valued random variable C, which may consists of many covariates. Let Pijk = P (A = i, B = j, C = k) denote the probability for the cell (i, j, k) in a 2 × 2 × K contingency table. For the stratum C = k, the relative risk is defined as: p1|1k P (A = 1|B = 1, C = k) = . RRk = P (A = 1|B = 0, C = k) p1|0k The risk difference is defined as: RDk = P (A = 1|B = 1, C = k) − P (A = 1|B = 0, C = k) = p1|1k − p1|0k . The odds ratio is defined as: p11|k p00|k P (A = 1, B = 1|C = k)P (A = 0, B = 0|C = k) ORk = = . P (A = 1, B = 0|C = k)P (A = 0, B = 1|C = k) p10|k p01|k If a measure of association (e.g. the relative risk) is constant across the strata, for example RR1 = · · · = RRk = CRR, then we say that the relative risk is homogenous, and CRR is called the common relative risk. The common relative difference CRD and the common odds ratio COR may be defined similarly. If the common measure of association equals the marginal measure of association in the 2 × 2 marginal table obtained by pooling the K strata, then the measure is said to be simply collapsible, or simply called collapsible. For example, let OR+ = p11+ p00+ /(p11+ p00+ ) denote the odds ratio in the 2 × 2 marginal table obtained by pooling the P K tables, where pij+ = k pijk . If OR1 = · · · = ORK = COR = OR+ , then the odds ratio is collapsible. When a measure is collapsibility, the phenomenon of Simpsons paradox on that measure can be avoided and statistical inference on the measure can be done in the marginal table. If the common measure of association equals the measure of association in the partially marginal table obtained by pooling any several tables, then the measure is said to be strongly collapsible. Let ω be a subset of {1, . . . , K} and pool the strata in ω to obtain a 2 × 2 partially marginal

Causal Inference

789

table, where the probability is sum of probabilities in corresponding cells of these strata: X pijk . pijω = k∈ω

In survey design and data analysis, we may consider based on collapsibility, which covariates should be included and which may be ignored, and we may determine if the data from different survey studies may be combined for analysis. Collapsibility also provides a method for categorizing covariates. Assume that the probability pijk > 0 for every cell (i, j, k). We shall show conditions of simple collapsibility and strong collapsibility for measures of associations. 5.1.1. Conditions of collapsibility for relative risk Geng9 discussed the collapsibility of relative risks. If relative risks in all the strata are the same (i.e. RR1 = · · · = RRK ), we say the relative risks are homogenous. The marginal relative risk obtained by pooling all the strata of C is defined as: p1|1+ P (A = 1|B = 1) = . RR+ = P (A = 1|B = 0) p1|0+

If the relative risk is homogenous across the strata, and also equals the marginal relative risk (i.e. RR1 = · · · = RRK = RR+ ), then we say that the relative risks are simply collapsible. Let ω ⊆ {1, . . . , K} be a subset of levels of C. We define the relative risk of the partially marginal table obtained by pooling all the strata in ω as: RRω =

p1|1ω P (A = 1|B = 1, C ∈ ω) = . P (A = 1|B = 0, C ∈ ω) p1|0ω

If the relative risk keeps constant for any partially marginal table obtained by pooling several arbitrary strata (i.e. RRω = RR+ for any set ω ⊆ {1, . . . , K}), then we say that the relative risks are strongly collapsible. Theorem 5.1. The relative risks are simply collapsible if one of the following conditions holds: (1) A is conditionally independent of C given B (written as A ⊥ C|B); (2) B is marginally independent of C (written as B ⊥ C), and the relative risks are homogenous (i.e. RR1 = · · · = RRK ); and (3) B is marginally independent of C (written as B ⊥ C), and B is conditionally independent of C given A = 1 (written as B ⊥ C|A = 1).

790

Z. Geng

It can be proven that the above condition (2) is equivalent to (3). Theorem 5.2. A necessary and sufficient condition for the relative risks to be strongly collapsible (i.e. RRω = RR+ for any ω ⊆ {1, . . . , K}) is: (1) A is conditionally independent of C given B (written as A ⊥ C|B); or (2) B is marginally independent of C (written as B ⊥ C), and the relative risks are homogenous (i.e. RR1 = · · · = RRK ). We may explain the theorem as follows: If one of the conditions (1) and (2) holds, then the relative risks are strongly collapsible; on the other hand, if the relative risks are strongly collapsible, then one of the conditions (1) and (2) must hold. 5.1.2. Conditions of collapsibility for risk differences Similar to definitions of collapsibility for relative risks, we may define the homogenous, simple collapsibility and strong collapsibility for risk differences. The risk differences are homogenous if RD1 = · · · = RDK . The risk difference in the marginal table obtained by pooling all the strata of C is defined as: RD+ = P (A = 1|B = 1) − P (A = 1|B = 0) = p1|1+ − p1|0+ . We say that the risk differences are simply collapsible if RD1 = · · · = RDK = RD+ . Let ω ⊆ {1, . . . , K}. We define the risk difference of the partially marginal table obtained by pooling all the strata in as: RDω = P (A = 1|B = 1, C ∈ ω) − P (A = 1|B = 0, C ∈ ω) = p1|1ω − p1|0ω . If RDω = RD+ for any set ω ⊆ {1, . . . , K}, then the risk differences are strongly collapsible. Theorem 5.3. The risk differences are simply collapsible if one of the following conditions holds: (1) A is conditionally independent of C given B (written as A ⊥ C|B); (2) B is marginally independent of C (written as B ⊥ C), and the risk differences are homogenous (i.e. RD1 = · · · = RDK ). Theorem 5.4. A necessary and sufficient condition for the risk differences to be strongly collapsible (i.e. RDω = RD+ for any set ω ⊆ {1, . . . , K}) is:

Causal Inference

791

(1) A is conditionally independent of C given B (written as A ⊥ C|B), or (2) B is marginally independent of C (written as B ⊥ C), and the risk differences are homogenous (i.e. RD1 = · · · = RDK ). 5.1.3. Conditions of collapsibility for odds ratios The odds ratios are homogenous if OR1 = · · · = ORK . The odds ratio of the marginal table obtained by pooling all the strata in C is defined as: OR+ =

p11+ p00+ P (A = 1, B = 1)P (A = 0, B = 0) . = P (A = 1, B = 0)P (A = 0, B = 1) p10+ p01+

We say that the odds ratios are simply collapsible if OR1 = · · · = ORK = OR+ . Let ω ⊆ {1, . . . , K}. We define the odds ratio of the partially marginal table obtained by pooling all the strata in ω as: ORω =

p11|ω p00|ω P (A = 1, B = 1|ω)P (A = 0, B = 0|ω) = . P (A = 1, B = 0|ω)P (A = 0, B = 1|ω) p10|ω p01|ω

If ORω = OR+ for any set ω ⊆ {1, . . . K}, then the odds ratios are strongly collapsible. Theorem 5.5. The odds ratios are simply collapsible if one of the following conditions holds: (1) A is conditionally independent of C given B (written as A ⊥ C|B); (2) B is conditionally independent of C given A (written as B ⊥ C|A). Theorem 5.6. A necessary and sufficient condition for the odds ratios to be strongly collapsible (i.e. ORω = OR+ for any set ω ⊆ {1, . . . , K}) is: (1) A is conditionally independent of C given B (written as A ⊥ C); or (2) B is conditionally independent of C given A (written as B ⊥ C|A). If C contains sufficient covariates such that relative risks, risk differences or odds ratios are measures of causal effects in each stratum C = k, then ignoring C will not affect the value of the association measure and C is not a confounder if the measure of association is collapsible over C. We usually say that covariate C is a confounder when a measure of association is not collapsible over C. Note that we may probably get different conclusions when we use different measures of association. For example, risk differences are collapsible while relative risks may be not. Therefore, whether or not a covariate is a confounder may depend on what measure of association is used.

792

Z. Geng

5.2. Comparability-based criterion If the distribution of response (e.g. the probability of diseased) in the unexposed population (e.g. the population of non-smokers) would be the same as that in the exposed population (e.g. the population of smokers) had all individuals in the exposed population not been exposed (i.e. non-smoking), and if the distribution of response in the exposed population would be the same as that in the unexposed population had all individuals been exposed, then the exposed population is said to be exchangeable with the unexposed population. In this case, the difference between the response distribution in the exposed population and that in the unexposed population equals the population-averaged causal effect. Therefore, we can estimate the population-average causal effect with observed data from the exposed and unexposed populations. If the exposed population is not exchangeable with the unexposed population, but the exposed subpopulation is exchangeable with the unexposed subpopulation defined by a covariate, then the covariate is called as a confounder, i.e. confounding can be eliminated by stratifying the population with this covariate. In this case, the subpopulation-average causal effects may be identified and the population-average causal effect can be obtained by adjusting the subpopulation-average causal effects with the distribution of the confounder as weight. The comparability-based criterion for confounders was presented by Miettinen and Cook22 (written as M and C hereafter), and it was discussed further by many authors.11,12,15,43 It has been discussed above that the collapsibility-based criterion for confounders depends on the selected measure of association. It is possible that some measures are collapsible while others are not, and thus assessment of a confounder depends on which measure is selected. The collapsibilitybased criterion is widely used in practical data analysis because this criterion could be tested simply. However, some epidemiologists think that the basic criteria for confounders should not depend on selection of measures. M and C22 used many examples to induce the following comparability-based criteria for confounders, which does not depend on selection of measures. A confounder C must satisfy the following two conditions: (1) C must be predictive of risk in the unexposed population, and (2) C must have different distributions between the exposed and unexposed populations. Now, we use the counterfactual model to describe the comparabilitybased criterion for detecting confounders. Epidemiological studies usually

Causal Inference

793

focus on causal effects in the exposed population rather than those in the whole population. The average causal effect in the exposed population is defined as: ACEe = E[De (u) − De¯(u)|E = e] = P (De = 1|E = e) − P (De¯ = 1|E = e) , where P (De¯ = 1|E = e) denotes the hypothetical probability of disease if the individuals in the exposed population had not been exposed. In epidemiological studies, one usually selects an unexposed reference population of E = e¯ and uses the probability of disease P (De¯ = 1|E = e¯) in the unexposed population to estimate the hypothetical probability P (De¯ = 1|E = e). Therefore, if the probability P (De¯ = 1|E = e¯) in the unexposed population equals the hypothetical probability P (De¯ = 1|E = e), then there does not exist confounding, and we say that the exposed population is comparable with the unexposed population. Rosenbaum and Rubin34 proposed two important assumptions needed for causal inference in observational studies, one of which is strong ignorability: (De , De¯) ⊥ E|C and the other is week ignorability: De ⊥ E|C and De¯ ⊥ E|C. Wickramaratne and Holford43 gave a weaker assumption for causal inference in epidemiological studies: De¯ ⊥ E|C, i.e. there is no confounding in any subpopulation: for every k, P (De¯ = 1|E = e, C = k) = P (De¯ = 1|E = e¯, C = k) . All of these assumptions are untestable with observed data. Wickramaratne and Holford43 proved the comparability-based criteria of M and C under their assumption, see the following theorem. Theorem 5.7. Assume that De¯ ⊥ E|C. If one of the following conditions holds: (1) De¯ ⊥ C|E = e¯, or (2) E ⊥ C, then the exposed population is comparable with the unexposed population, i.e. P (De¯ = 1|E = e) = P (De¯ = 1|E = e¯) . Condition (1) means that the covariate C is not predictive to De¯ in the unexposed population, and condition (2) means that the covariate C has the same distribution in both exposed and unexposed populations. This just

794

Z. Geng

verifies M and Cs criteria. This theorem for judging confounding depends on categorization of a covariate C. For example, one may get different conclusions when a covariate, say age C, is categorized by every 10 years and every 20 years. Geng et al.12 proposed a concept of uniform non-confounding. If for any set ω ⊆ {1, . . . , K}, P (De¯ = 1|E = e, C ∈ ω) = P (De¯ = 1|E = e¯, C ∈ ω) . i.e. there is no confounding in any coarse subpopulation, then we call it uniform non-confounding. This means that no confounding occurs no matter how to categorize the covariate C. For example, when C denotes age, the exposed population always is comparable with the unexposed population no matter how to stratify the population by age, every 10 years or every 20 years. Theorem 5.8. Assume that De¯ ⊥ E|C. A necessary and sufficient condition for uniform non-confounding is: (1) De¯ ⊥ C|E = e¯, or (2) E ⊥ C. This theorem shows that M and Cs criteria is a necessary and sufficient condition for uniform non-confounding. But both Theorems 5.7 and 5.8 need the untestable assumption of ignorability De¯ ⊥ E|C. 5.3. Formal definitions and criteria for confounders M and C22 derived inductively the criteria for confounders without the assumption of ignorability. Greenland and Robins15 exemplified that the comparability-based criteria of M and C is not a sufficient condition for a confounder, but only a necessary condition. Thus, M and Cs criteria cannot be used as a definition of confounders. The collapsibility-based criteria neither are sufficient for confounders, which also depend on which measure of association is used and how to categorize a covariate. The common qualitative definition of a confounder is that it is a risk covariate for disease, controlling for which can reduce bias for estimating causal effects.15,19 In this section, we discuss formal definitions of confounders without the assumption of ignorability.11 Let C be a discrete covariate, which is not affected by the exposure. The following common standarization in epidemiology is used to estimate the hypothetical probability P∆ (De¯ = 1|E = e): P∆ (De¯ = 1|E = e) =

K X

k=1

P (De¯ = 1|E = e¯, C = k)P (C = k|E = e) .

Causal Inference

795

Definition 5.1. C is said to be a confounder11 if |P (De¯ = 1|E = e) − P∆ (De¯ = 1|E = e)| < |P (De¯ = 1|E = e) − P (De¯ = 1|E = e¯)| . The above definition of a confounder means that the standardized probability P∆ (De¯ = 1|E = e) obtained by adjusting for the confounder C is closer to the hypothetical probability P (De¯ = 1|E = e) than is the crude probability P (De¯ = 1|E = e¯). Thus, we can adjust for the confounder C to reduce confounding bias. The inequality in Definition 5.1, however, is untestable with observed data since the hypothetical probability usually is unknown. Thus, we need to introduce the concept of an irrelevant factor for empirically detecting a confounder. Definition 5.2. C is said to be an irrelevant factor if P∆ (De¯ = 1|E = e) = P (De¯ = 1|E = e¯) . From the definition, we can see that an irrelevant factor is not a confounder. Adjustment for an irrelevant factor cannot reduce confounding bias. According to the above definition of a confounder, however, it is possible that C1 is a confounder, C2 is a confounder but {C1 , C2 } as a composite covariate is not a confounder. To avoid this counter-intuitive property, we present the concept of an occasional confounder. The above definitions also depend on the choice of categorization for the covariate C under consideration. For example, age may be a confounder or irrelevant factor when it is categorized by every 10 years of age, but it may not be a confounder when categorized by every 20 years. We consider that the definition of a confounder should not depend on the categorization of a covariate and we give the following definition of an occasional confounder. Definition 5.3. If there exists a partition p of the categories of C (i.e. p = {ω1 , . . . , ωM }, where ωk 6= φ, ωi ∩ ωj = φ, ∪k ωk = {1, . . . , K}, M ≤ K) such that P (De¯ = 1|E = e) − Pp (De¯ = 1|E = e)| < |P (De¯ = 1|E = e) − P (De¯ = 1|E = e¯)| , then C is said to be an occasional confounder, where Pp (De¯ = 1|E = e) is the standardized probability based on the partition p: Pp (De¯ = 1|E = e) =

M X

[P (De¯ = 1|E = e¯, C ∈ ωm )P (C ∈ ωm |E = e)] .

m=1

796

Z. Geng

Similarly, the inequality is also untestable, and the concept of a uniform irrelevant factor is introduced below. Definition 5.4. C is said to be a uniform irrelevant factor if, for any possible partition p, Pp (De¯ = 1|E = e) = P (De¯ = 1|E = e¯) . From definitions, we can see that any confounder is also an occasional confounder, but the reverse is not true. Definition 5.3 implies that C is an occasional confounder if there exists a partition of C such that adjustment for C with respect to the partition p can reduce confounding bias, but such a partition cannot be recognized from observed data. If C is not an occasional confounder, then confounding bias cannot be reduced by controlling for C, no matter what categorization is chosen for C, or no matter how the subpopulations are pooled together. If C is an occasional confounder, then any covariate set containing C must also be an occasional confounder. Definition 5.4 implies that C is a uniform irrelevant factor if there does not exist any partition of C such that adjustment for it can reduce confounding bias. It can be shown from the formal definition of a confounder that the collapsibility-based and comparability-based criteria for assessing a confounder are not contradictory, but mutually complementary. Combination of these two criteria may eliminate more non-confounders from the set of potential confounders. Three necessary conditions for a covariate C to be a confounder are as follows: (A1) C 6⊥ E and De¯ 6⊥ C|E = e¯; (A2) the risk difference is not collapsible over C; and (A3) the relative risk is not collapsible over C. The condition (A1) is the comparability-based criteria of M and C, and (A2) and (A3) are the collapsibility-based criteria. If C is a confounder, then it must satisfy all of these three conditions (A1), (A2) and (A3). Three necessary conditions for a covariate C to be an occasional confounder are: (B1) C 6⊥ E and De¯ 6⊥ C|E = e¯; (B2) the risk difference is not strongly collapsible over C; and (B3) the relative risk is not strongly collapsible over C.

Causal Inference

797

If C is an occasional confounder, then it must satisfy all of the three conditions (B1), (B2) and (B3). It can be shown that condition (B1) implies both conditions (B2) and (B3). Therefore, condition (B1) is the essential necessary one for an occasional confounder. Condition (B1) is just the comparability-based criterion of M and C, while (B2) and (B3) are the strong collapsibility criteria, but not the simple collapsibility criteria. 5.4. Numerical examples of confounders As mentioned above, the comparability-based criteria of M and C is only a necessary condition and cannot be used as a definition, and Greenland and Robins15 illustrated this. The collapsibility-based criteria are not sufficient either. In the section, we use the individual effect model presented by Greenland and Robins15 to illustrate necessity of these conditions and why the criteria cannot define confounders. Suppose that the response of every individual is determinate under any exposure status. Then all individuals in the population may be classified into the following four types: Type 1. No effect (individual “doomed”): De = De¯ = 1 . Type 2. Exposure causative (individual susceptible): De = 1, De¯ = 0 . Type 3. Exposure preventive (individual susceptible): De = 0, De¯ = 1 . Type 4. No effect (individual immune to disease): De = De¯ = 0 . Example 5.1. Suppose there is no exposure effect, i.e. there are only individuals of Types 1 and 4, and that the joint distribution of response, exposure and a covariate is given in Table 9. We cannot know if an individual is Types 1, 2, 3 or 4, but only know if he developed the disease. It can be seen from Table 9 that neither RD nor RR are collapsible, and thus we cannot assessing from the collapsibility-based criteria if C is a confounder. On the other hand, it follows from Table 9 that C ⊥ E, and thus we can judge from the comparability-based criteria that C is not a confounder. Without stratifying and adjusting for C, we can obtain directly from the crude marginal table that and RD = 0 and RR = 1, which correctly evaluate no causal effect. Example 5.2. Suppose that exposure has no causal effect, i.e. there are only Types 1 and 4 individuals, and that the joint distribution is shown in Table 10. It can be seen from Table 10 that both RD and RR are collapsible,

798

Z. Geng

Table 9.

Hypothetical joint distribution: Non-confounder C that is not collapsible. C=1 E=e E = e¯

Type 1(“doomed”) 4(“immune”)

15 85

Total RD RR

100

Table 10.

Type

5 95

C=2 E = e E = e¯

Crude E=e

(C ∈ {1, 2}) E = e¯

50 50

55 145

55 145

100 100 −0.10 0.80

200

40 60

100 0.10 3.00

200 0.00 1.00

Hypothetical joint distribution: Non-confounder C that is collapsible. C=1 C=2 C=3 Crude E = e E = e¯ E = e E = e¯ E = e E = e¯ E = e

1(“doomed”) 4(“immune”)

95 5

Total RD RR

100 0.00 1.00

95 5

5 95

100

100 0.00 1.00

5 95

75 75

100

150 0.00 1.00

25 25

175 175

50

350

(C ∈ {1, 2, 3}) E = e¯ 125 125 250 0.00 1.00

and thus we can judge from the collapsibility-based criteria that C is not a confounder. On the other hand, it follows from Table 10 that C 6⊥ E and De¯ 6⊥ C|E = e¯, and thus we cannot decide from the comparability-based criteria if C is a confounder. Without stratifying and adjustment for C, we can obtain directly from the crude marginal table that RD = 0 and RR = 1, which correctly assess no causal effect. Example 5.3. Suppose exposure has no causal effect, i.e. there are only Types 1 and 4 individuals, and that the joint distribution is given in Table 11. It can be seen from Table 11 that RD is collapsible, but RR is not. Therefore, we can judge from the collapsibility-based criteria that C is not a confounder. On the other hand, it follows from Table 11 that C 6⊥ E and De¯ 6⊥ C|E = e¯, and thus we cannot decide from the comparabilitybased criteria if C is a confounder. Stratifying and adjusting for C, we can obtain that 46 400 26 100 66 100 184 100 · + · + · + · = 0.46 , P∆ (De¯ = 1|E = e) = 100 700 400 700 100 700 100 700 which equals P∆ (De¯ = 1|E = e¯) = 322 700 = 0.46 without stratifying or adjusting for C. It shows that adjustment for C neither decreases nor increases confounding biases, and C is an irrelevant factor.

799

Causal Inference Table 11.

Hypothetical joint distribution: Non-confounder C that is simply collapsible.

Type

C=1 C=2 C=3 C=4 Crude (C ∈ ∆) E = e E = e¯ E = e E = e¯ E = e E = e¯ E = e E = e¯ E = e E = e¯

1(“doomed”) 1(“immune”) Total RD RR

60 40

66 34

40 60

100 100 −0.06 0.91

Table 12.

184 216

100 400 −0.06 0.87

160 240

46 54

400 100 −0.06 0.87

20 80

26 74

100 100 −0.06 0.77

280 420

322 378

700 700 −0.06 0.87

p = {[1, 2], [3, 4]}, an occasional confounder C.

Type

C ∈ [1, 2] E=e E = e¯

1(“doomed”) 4(“immune”)

100 100

Total RD RR

200 0.00 1.00

C ∈ [3, 4] E = e E = e¯

250 250

180 320

500

500

72 128 200 0.00 1.00

Let p = {{1, 2}, {3, 4}}. Table 12 represents the distribution of the coarse subpopulations obtained by pooling levels 1 and 2 and pooling levels 3 and 4 based on the partition p. It can be seen that the risk difference RD is not strongly collapsible. The hypothetical probability P (De¯ = 1|E = e) is usually unknown, so we cannot decide if C is an occasional confounder. Under the assumption of no causal effect, we have P (De¯ = 1|E = e) = P (De¯ = 1|E = e) =

280 = 0.40 . 700

We can compute the standardized probability Pp (De¯ = 1|E = e) based on the partition p as follows: Pp (De¯ = 1|E = e) =

72 500 280 250 72 · + · = = 0.40 . 500 200 200 700 700

Since |P (De¯ = 1|E = e) − Pp (De¯ = 1|E = e)| = 0 is less than |P (De¯ = 1|E = e) − P (De¯ = 1|E = e¯)| = 0.06, C is an occasional confounder. Confounding bias can be completely eliminated by controlling for C with respect to the partition p. Furthermore, we can obtain the adjusted risk difference |P (De = 1|E = e) − Pp (De¯ = 1|E = e)| = 0 and the adjusted relative risk |P (De = 1|E = e)/Pp (De¯ = 1|E = e)| = 1, which correctly assess no causal effect. However, this example by no means suggests that

800

Z. Geng

we should try to merge the levels of a covariate to correct confounding since it is impossible in practice to determinate an occasional confounder based on observed data. 6. Causal Diagrams Graphical models have been widely applied for large complex systems.4,6,20,42 Sprites, Glymour and Scheines38 and Pearl25–38 proposed causal diagrams for causal inferences for observational studies. They supposed first a completely constructed causal diagram. Pearl26 showed several of sufficient conditions for no confounding, and he gave several rules for identifying causal effects. Greenland et al. applied causal diagrams to epidemiological research and provided a criterion for detecting multiple confounders, which is more efficient than the traditional criterion. In many practical cases, however, it is difficult to construct such a complete causal diagram. Geng and Li10 discussed conditions of non-confounding without a completely constructed causal diagram. 6.1. Definitions and notations Let Γ denote a directed acyclic graph, X = {X1 , . . . , Xn } be a set of nodes, where a node represent a discrete random variable. Xj is called a parent node of Xi , or Xi is a son of Xj if there is a directed arrow from Xj to Xi . Let pai denote the set of parents of Xi . In a diagram Γ, a path between Xi and Xj is a succession of arcs from Xi to Xj , regardless of their directions. All nodes with an arrow pointing path Xi from are called to be descendants of Xi . Definition 6.1. Let C be a set of nodes in a diagram Γ. A path between Xi and Xj is said to be connected if every node on the path satisfies the following two conditions: (1) if it has converging arrows along the path, then either it or one of its descendants is in C; and (2) if it does not have converging arrows along the path, then it is not in C. Otherwise, the path is said to be blocked by C. Definition 6.2. Let A, B and C be three disjoint subsets of nodes in a diagram Γ. C is said to d-separate A from B, denoted d(A, B, C), if and only if C blocks every path from any node in A to any node in B.

801

Causal Inference

X1

X2

X4

X3

X5 Fig. 1.

A directed acyclic graph.

There is a one-to-one correspondence between d-separation d(A, B, C) and conditional independency A ⊥ B|C under the stability condition of distribution.25 Thus, conditional independence between variables can be read off from a diagram Γ by using the d-separation criterion. Example 6.1. In Fig. 1, C = {X1 } blocks the path X2 ← X1 → X3 ; neither X4 nor X5 are in C, and thus C blocks the path X2 → X4 ← X3 too. Therefore, C = {X1 } d-separates A = {X2 } from B = {X3 }, and thus X2 ⊥ X3 |X1 . If we let C ′ = {X1 , X4 }, then X4 is in C ′ and has converging arrows along the path X2 → X4 ← X3 . So C ′ does not d-separate A = {X2 } from B = {X3 }, and X2 6⊥ X3 |(X1 , X4 ). A graph model defined by Γ represents that X1 , . . . , Xn have a joint probability distribution as follows: P (x1 , . . . xn ) =

n Y

i=1

P (xi |pai ) ,

where P (xi |pai ) stands for the conditional probability of Xi = xi given P ai = pai . We assume that P (x1 , . . . , xn ) > 0 for every x1 , . . . , xn . Example 6.2. The directed graph in Fig. 1 describes the following joint probability distribution: P (x1 , x2 , x3 , x4 , x5 ) = P (x1 )P (x2 |x1 )P (x3 |x1 )P (x4 |x2 , x3 )P (x5 |x4 ) . Assume that a causal system can be represented by a directed acyclic graph, where an arrow means the direction from cause to effect. The causal

802

Z. Geng

Pai

Pai Fi Γ

Xi

Xi Fig. 2.

Γ’

External intervention Fi .

effect of Xi on the whole system can be evaluated by an external intervention to Xi . Consider Fig. 2. An arrow Fi → Xi in Γ represents the external intervention to Xi . Fi is a new variable with value, set(xi ) or idle, where set(xi ) stands for that Fi forces Xi to take on some value xi and idle stands for no external intervention influences to Xi . Γ′ represents the diagram after the external intervention. In the diagram Γ′ after the external intervention, the set of parent nodes is augmented to P a′i = P ai ∪ {fi }. The conditional distribution of Xi given P a′i is:    p(xi |pai ) , if Fi = idle ; ′ P (xi |pai ) = 0 , if Fi = set(x′i ) and xi 6= x′i ;   1, if Fi = set(x′i ) and xi = x′i . The joint distribution after the intervention set(x′i ) is given by Px′i (xi , . . . , xn ) = P ′ (x1 , . . . , xn |Fi = set(x′i )) . Px′i (xi , . . . , xn ) represents the joint distribution of X1 , . . . , Xn if Xi of all units in the target population had been forced to x′i . The transformation formula between the pre-intervention and the postintervention distributions is given by  ′   Pxi (xi , . . . , xn ) , if xi = x′ ; i P (xi |pai ) Px′i (xi , . . . , xn ) =   0, if xi 6= x′i .

Let Xj and Xi denote the response and treatment variables respectively, and Ωj and Ωi their corresponding domains. The causal effect of Xi on Xj is defined as Px′i (xj ), which means the post-intervention distribution of Xj under the intervention to Xi . Given a causal diagram, we say that there is no confounding for the causal effect of Xi on Xj if Px′i (xj ), the post-intervention distribution of

803

Causal Inference

Xj under the intervention Xi = xi in the whole target population, is equal to P (xj |x′i ), the distribution of Xj in the subpopulation Xi = x′i without external intervention. The formal definition is given below. Definition 6.3. No confounding. There is no confounding for the causal effect of Xi on Xj if for every xj ∈ Ωj and x′i ∈ Ωi , Px′i (xj ) = P (xj |x′i ) . No confounding implies that the causal effect Px′i (xj ) can be evaluated with the observed association measure P (xj |x′i ). Therefore, we may define the confounding bias as the difference between Px′i (xj ) and P (xj |x′i ). If there is confounding in the population, but there is no confounding in all the subpopulations stratified by some other variables, we call these variables as confounders. Let C = {Xt1 , . . . , Xtk } be a set of variables, Ωtk

be the domain of Xtk , and Px′i (xj |xt1 , . . . , xtk ) =

Px′ (xj ,xt1 ,...,xtk ) i

Px′ (xt1 ,...,xtk )

denotes

i

the conditional distribution under the external intervention Xi = xi . It means that the condition is taken after the intervention. Given a causal diagram together with the corresponding joint distribution, we can infer the post-intervention distribution from the transformation formula (1). Thus, we can estimate the causal effects of interventions. The joint distribution, however, is usually unknown when some variables are unobservable. Let X = XO ∪ XU , where XO denotes the set of observed variables and XU denotes the unobserved variables. The basic problem of causal inference is to identify the causal effect of Xi on Xj , Px′i (xj ), from the distribution of observed variables, P (XO ). The causal effect of Xi on Xj is said to be identifiable if Px′i (xj ) can be represented by P (XO ). Pearl26 proposed a set of inference rules for identifying causal effects. No confounding implies that the causal effect Px′i (xj ) is identifiable. The following properties can be obtained from the transformation defined in formula (1): (1) An intervention set(xi ) can affect only the descendants of Xi in Γ. (2) Px′i (S|pai ) = P (S|x′i , pai ) holds for any set S of variables. (3) Xj ⊥ pai |Xi is a sufficient condition for no confounding. Property (1) implies that Px′i (S) = P (S) if S is not the descendants of Xi . Property (2) implies no confounding, which means that the causal effect of Xi on any set S of variables is equal to the conditional distribution when the parents of Xi are given. Spiegelhatter et al.37 gave an example to show that the conditional independency in property (3) is not necessary for

804

Z. Geng

X1

X2

X3

X4

X5

E D Fig. 3.

An example of the back-door criterion.

no confounding. Geng and Li10 proved that the conditional independency (3) is a necessary and sufficient condition for uniform non-confounding. 6.2. The back-door and front-door criteria In this section, we shall introduce two criteria, the back-door criterion and the front-door criterion, proposed by Pearl.26 Definition 6.4. Back-door path. A path between Xi and Xj with an arrow into Xi is said to be a back-door path between Xi and Xj . Definition 6.5. Back-door criterion. A set S of variables is said to satisfy the back-door criterion if: (i) No node in S is a descendant of Xi , and (ii) S blocks every back-door path between Xi and Xj . Example 6.3. Consider the causal effect of E on D in Fig. 3. The sets S1 = {X3 , X4 } and S2 = {X4 , X5 } meet the back-door criterion, but S3 = {X4 } does not meet it because S3 does not block the back-door path between E and D(E ← X3 ← X1 → X4 ← X2 → X5 → D). Theorem 6.1. Two conditions of the back-door criterion are equivalent to S ⊥ Fi

and

Xj ⊥ Fi |(Xi , S) .

Theorem 6.2. If a set S of variables satisfies the back-door criterion, we have X Px′i (xj ) = P (xj |s, x′i )P (s) , s

′

for every xj ∈ ωj and x ∈ ωi .

It follows from Theorem 6.2 that the causal effect of Xi on Xj , Px′i (xj ), can be estimated from the joint distribution of observed variables if we find a subset of variables S ⊆ X0 satisfying the back-door criterion.

805

Causal Inference

U

E

D Z

Fig. 4.

A diagram for the front-door criterion.

We cannot derive the causal effect Px′j (xj ) from Theorem 6.2 if the set S of observed variables does not satisfy the back-door conditions. Consider the causal diagram in Fig. 4, where U denotes an unobserved variable, and E, Z and D represent three observed variables respectively. Z is on a directed path from E to D and blocks all directed paths from E to D (there is only one directed path from E to D in Fig. 4). Following traditional epidemiological theories, we cannot control or adjust for intermediate variables on the causal path to evaluate the causal effect of E on D. Now we introduce the front-door criterion proposed by Pearl26 through the causal diagram in Fig. 4. It follows from Fig. 4 that the joint probability distribution of (E, D, Z, U ) is given by P (e, d, z, u) = P (d|z, u)P (z|e)P (e|u)P (u) , and from the back-door criterion, Pe′ (d) = P (d|set(e′ )) =

X

P (d|e′ , u)P (u) ,

u

which is not directly identifiable because U is unobserved. We can obtain two conditional independencies from Fig. 4: (1) D ⊥ E|(Z, U ) and (2) Z ⊥ U |E, and we can show " # X X ′ P (z|e ) P (d|e, z)P (e) . Pe′ (d) = z

e

Because the probabilities on the right side of Eq. (2) are all conditional distributions, the causal effect Pe′ (d) is identifiable when D, Z and E are observed. This result is summarized in the following theorem, which is called the front-door criterion.26 Theorem 6.3. Suppose that a set Z of variables satisfies the following conditions relative to an ordered pair of variables (E, D) : (1) Z intercepts all directed paths from E to D. (2) There is no unblocked back-door path

806

Z. Geng

between E and Z. (3) Every back-door path between Z and D is blocked by E. Then the causal effect of E on D is identifiable and is given by Eq. (2). Example 6.4. Pearl26 presented an example to illustrate the front-door criterion. For the observational research of smoking and lung cancer, Fisher once proposed a conjecture: There exists an unknown carcinogenic genotype related to smoking. If this is true, the genotype is an unobservable confounder and the confounding biases could not be eliminated using traditional standardization methods of epidemiology. If we observe an intermediate variable, tar deposit, on the causal path from smoking to lung cancer, and the unobserved genotype does not have a causal path to the tar deposit, then we can obtain a causal diagram as shown in Fig. 4, where E, Z, U , and D represent, respectively, smoking, tar deposits, the carcinogenic genotype and lung cancer. It follows from Theorem 6.3 that the causal effect of smoking on lung cancer can be identified by the distribution of observed variables. 6.3. Criterion for multiple confounders If there is an association between exposure E and response D after removing all causal effects of E on any factor, then the association cannot be a causal effect of E on D and thus there must be confounders. We now introduce the algorithm presented by Greenland et al.14 for checking confounders based on the back-door criterion. Given a set of covariates, S = {S1 , . . . , Sn }, which does not contain descendants of E and D, we perform the following steps: (1) Delete all arrows emanating from E (i.e. remove all causal effects of E). (2) For every node Si , draw undirected edges to connect every pair of nodes that share a common child which is either in S or has a descendant in S. (3) If the nodes in S block all paths from E to D in the new graph, then the set S is sufficient for controlling confounding bias. Otherwise, S is not sufficient. Example 6.5. Consider the causal diagram in Fig. 5. Delete the arrow E → D, and thus E has no effect on D. It follows from the diagram that there is still an association between E and D. Therefore, there exists confounding bias. When controlling for C, the unique back-door path from E to D is blocked by C, and thus the confounding is completely eliminated. So, C is a confounder.

807

Causal Inference

C

E

D

Fig. 5.

C is a confounder. B

A C E

Fig. 6.

D

C is not a confounder.

Example 6.6. Consider the causal diagram in Fig. 6. Delete the arrow E → D, thus E has no effect on D. The node C blocks the unique back-door path E ← A → C ← B → D. It follows from the new diagram obtained by deleting E → D that E is independent of D and the association between E and D in the original diagram is attributed to causation. Neither A, B nor C is a confounder. However, if controlling for C, then A and B must be connected by an undirected edge in the step 2, and thus the path E ← A − B → D is not blocked. In the diagram obtained by deleting E → D, E is not independent of D conditional on C, and the association between E and D is not attributed to causation, that is, confounding bias is introduced by controlling for C. 7. Causal Inference for Longitudinal Studies Longitudinal study is one of the important methods in epidemiological and medical research. Causal inference and criteria for confounders in longitudinal studies have been discussed.30–32 In a longitudinal study, the same unit is repeatedly observed during the follow-up period. Longitudinal observation includes treatment variables, covariates, intermediate variables and response variables of interest in every time points, and it can provide more information on causation. In this section, we introduce notations and conceptions of causal inference in longitudinal studies proposed by Robins et al.32 Let m = 1, 2, . . . , M denote the time points at which data are observed; let Ym,i and Am,i denote the response and exposure of unit i at time m, respectively. Let Xi denote a vector of time-independent covariates of unit i, and Lm,i denote a vector

808

Z. Geng

of time-independent covariates of unit i at time m. Suppose that Ym,i and Lm,i occur before Am,i at the each time m. (0) Based on the counterfactual model, let Ym,i denote the response of unit i at m if the unit were not unexposed at any time during the follow-up (1) (i.e. continuous unexposed). Similarly, let Ym,i denote the response of the unit i at m if the unit had been exposed continuously during the follow-up (i.e. continuous exposed). If a unit i changes its exposure status during the (1) (0) follow-up, its response at m is neither Ym,i nor Ym,i . Therefore, neither (1) (0) Ym,i nor Ym,i can be observed for such units. Let Z¯m = (Z1 , . . . , Zm ) for m ≥ 1 and Z¯m = 0 for m ≤ 0. Assume that N series of data, (1) (0) ¯ M,i , Xi ) for i = 1, . . . , N , are independently and (Y¯ , Y¯ , Y¯M,i , A¯M,i , L M,i

m,i

identically distributed. For the subpopulation X = x at the time m, define expectation of responses for continuous unexposed and that for continuous exposed respectively as (0) (x) = E[Ym(0) |X = x] θm

(1) and θm (x) = E[Ym(1) |X = x] .

The average causal effect of continuous exposed versus continuous unexposed is defined as (1) (0) δm (x) = θm (x) − θm (x) .

The average causal effect cannot be identified without any assumptions. An assumption was proposed32 for sufficiently controlling confounding at every time k, which is similar to Rosenbaum and Rubins34 the strongly ignorable assumption for cross-section studies: (1) ¯ k , X) {(Ym(0) , Tm ); m = k + 1, k + 2, . . . , M } ⊥ Ak |(A¯k−1 , Y¯k , L

(3)

for all k ≥ 1. This assumption means that the strongly ignorability holds for all k, and it cannot be tested empirically. ¯ k in model If there are no time-dependent confounders, that is, the set L (3) is empty, then conditional probabilities of potential responses can be identified as following: (j)

P (Ym(j) = 1|Y¯m=1 = y¯m−1 , X = x) = P (Ym = 1|Y¯m−1 = y¯m−1 , A¯m−1 = j [m−1] , X = x) , where j [m−1] is a vector with m − 1 elements all of which are equal to j (i.e. j [m−1] = (j, . . . , j)). Substituting the expression into the following (j) g-algorithm formula presented by Robins,28,29 the causal parameter θm (x)

809

Causal Inference

can be identified by: (j) θm (x) = P (Ym(j) = 1|X = x) X (j) [P (Ym(j) = 1|Y¯m=1 = y¯m−1 , X = x) = y¯m−1

×

m−1 Y k=1

(j)

P (Yk

(j) = yk |Y¯k−1 = y¯k−1 , X = x)] .

¯ k in Similarly, if there are time-dependent confounders, that is, the set L (j) model (3) is not empty, then the causal parameter θm (x) can be identified by the following expression: (j) (x) = P (Ym(j) = 1|X = x) θm ( X ¯ m−1 = ¯lm−1 , P (Ym = 1|Y¯m−1 = y¯m−1 , L = y¯m−1 ,¯ lm−1

A¯m−1 = j [m−1] , X = x)

m−1 Y k=1

− [P (Yk = yk |Y¯k−1 = yk−1 ,

¯ k−1 = ¯lk−1 , A¯k−1 = j [k−1] , X = x) L ¯ k−1 = ¯lk−1 , A¯k−1 = j × P (Lk = lk |Y¯k = y¯k , L

[k−1]

)

, X = x) ,

where lk and l¯k−1 consist of part elements of ¯lm−1 . From the causal (j) parameter θm (x), we can identify the average causal effect δm (x) of continuous exposed versus continuous unexposed for the subpopulation X = x at any time m. References 1. Bernard, C. (1920). Introduction to Research on Experiment Medicine (Chinese version). 2. Bickel, P. J., Hammel, E. A. and O’Connell, J. W. (1975). Sex bias in graduate admissions: Data from Berkeley. Science 187: 398–404. 3. Boivin, J. F. and Wacholder, S. (1985). Conditions for confounding of the risk ratio and of the odds ratio. American Journal of Epidemiology 121: 152–158. 4. Cox, D. R. and Wermuth, N. (1996). Multivariate Dependencies: Models, Analysis and Interpretation, Chapman and Hall, London. 5. Dawid, A. P. (2000). Causal inference without counterfactuals. Journal of the American Statistical Association 95: 407–448.

810

Z. Geng

6. Edwards, D. (1995). Introduction to Graphical Modelling, Springer, New York. 7. Freedman, D. (1999). Association to causation: Some remarks on the history of statistics. Statistical Science 14: 243–258. 8. Freedman, D., Pisani, R., Purves, R. and Adhikari, A. (1991). Statistics, 2nd edn., W. W. Norton and Company, Inc., New York. 9. Geng, Z. (1992). Collapsibility of relative risks in contingency tables with a response variable. Journal of the Royal Statistical Society B54: 585–593. 10. Geng, Z. and Li, G. W. (2002). Conditions for non-confounding and collapsibility without knowledge of completely constructed causal diagrams. Scandinavian Journal of Statistics 29: 169–181. 11. Geng, Z., Guo, J. and Fung, W. (2002). Criteria for confounders in epidemiological studies. Journal of the Royal Staistical Society B64: 3–15. 12. Geng, Z., Guo, J., Lau, T. and Fung, W. (2001). Confounding, homogeneity and collapsibility for causal effects in epidemiologic studies. Statistica Sinica 11: 63–75. 13. Grace, N. D., Muench, H. and Chalmers, T. C. (1966). The present status of shunts for portal hypertension in cirrhosis. Gastroenterology 50: 684–691. 14. Greenland, S., Pearl, J. and Robins, J. M. (1999). Causal diagrams for epidemiologic research. Epidemiology 10: 37–48. 15. Greenland, S. and Robins, J. M. (1986). Identifiability, exchangeability, and epidemiologic confounding. International Journal of Epidemiology 15: 413–419. 16. Greenland, S., Robins, J. and Pearl, J. (1999). Confounding and collapsibility in causal inference. Statistical Science 14: 29–46. 17. Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association 81: 945–970. 18. Holland, P. W. and Rubin, D. B. (1988). Causal inference in retrospective studies. Evaluation Review 12: 203–231. 19. Kleinbaum, D. G., Kupper, L. L. and Morgenstern, H. (1982). Epidemiologic Research: Principles and Quantitative Methods. Van Nostrand Reinhold, New York. 20. Lauritzen, S. L. (1996). Graphical Models. Oxford University Press, Oxford. 21. Lewis, D. (1973). Counterfactuals. Harvard University Press, Cambridge. 22. Miettinen, O. S. and Cook, E. F. (1981). Confounding: Essence and detection. American Journal of Epidemiology 114: 593–603. 23. Neyman, J. (1935). Statistical problems in agricultural experimentation. Journal of the Royal Statistical Society 2(Suppl.): 107–180. 24. Neufeld, E. (1995). Simpson’s paradox in artificial intelligence and in real life. Computational Intelligence 11: 1–10. 25. Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge. 26. Pearl, J. (1995a). Causal diagrams for empirical research (with discussion). Biometrika 83: 669–710. 27. Pearl, J. (1995b). From Bayesian networks to causal networks. ed. A. Gammerman, Probabilistic Reasoning and Bayesian Belief Networks, Henley-on-Thames, 1–31.

Causal Inference

811

28. Robins, J. M. (1986). A new approach to causal inference in mortality studies with sustained exposure periods-application to control of the healthy worker survivor effects. Mathematical Modelling 7: 1393–1512. 29. Robins, J. M. (1987). A graphical approach to the identification and estimation of causal parameters in mortality studies with sustained exposure periods. Journal of Chronic Disease 40(Supp.): 139s–161s. 30. Robins, J. M. (1989). The control of confounding by intermediate variables. Statistics in Medicine 8: 679–701. 31. Robins, J. M. (1997). Causal inference from complex longitudinal data. In Latent Variable Modeling and Applications to Causality, ed. M. Berkane, Springer-Verlag, New York, 69–117. 32. Robins, J. M., Greenland, S. and Hu, F. C. (1999). Estimation of the causal effect of a time-varying exposure on the marginal mean of a repeated binary outcome (with discussion). Journal of the American Statistical Association 94: 687–712. 33. Rosenbaum, P. R. (1999). Choice as an alternaive to control in observational studies. Statistical Science 14: 259–304. 34. Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70: 41–55. 35. Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66: 688–701. 36. Simpson, E. H. (1951). The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society B13: 238–241. 37. Spiegelhalter, D. J., Dawid, A. P., Lauritzen, S. L. and Cowell, R. C. (1993). Bayesian analysis in expert systems (with discussion). Statistical Science 8: 219–283. 38. Spirtes, P., Glymour, C. and Scheines, R. (1993). Causation, Prediction and Search, Springer, New York. 39. Stone, R. (1993). The assumptions on which causal inference rest. Journal of the Royal Statistical Society B55: 455–66. 40. Wagner, C. H. (1982). Simpson’s paradox in real life. American Statistics 36: 46–48. 41. Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics, Wiley, Chichester. 42. Whittemore, A. S. (1978). Collapsibility of multidimensional contingency tables. Journal of the Royal Statistical Society B40: 328–340. 43. Wickramaratne, P. J. and Holford, T. R. (1987). Confounding in epidemiologic studies: The adequacy of the control groups as a measure of confounding. Biometrics 43: 751-765.

812

Z. Geng

About the Author Zhi Geng is Professor of School of Mathematical Sciences, Peking University. He got his PhD (1989) from Kyushu University, Japan. He has been an ISI selected member since 1996. He is associated editors of Computation Statistics and Data Analysis and Statistica Sinica. His research interests include categorical data, incomplete data, causal inference, graphical models, uncertainty inference and probabilistic expert systems, and biostatistics. These researches were supported by NSFC, EYNSFC of NSF and TCTP of SECC.

Section 4 Advanced Statistical Theory and Methods

This page intentionally left blank

CHAPTER 22

SURVIVAL ANALYSIS

D. Y. LIN Department of Biostatistics, University of North Carolina, 3101E McGavran-Greenberg Hall, CB#7420, Chapel Hill, NC 27599-7420, USA Tel: 919-843-5134; [email protected]

1. Introduction The primary response variables in many medical studies pertain to the time to occurrence of a clinically important event, such as death, development or progression of a disease, or occurrence of a clinically significant morbid event such as a serious infection, stroke or major organ failure. A complexity that frequently arises in studies having time-to-event outcome measures is that a substantial fraction of the study subjects have not experienced the event of interest at the time of data analysis. These subjects who provide this incomplete information are referred to as being censored, or more precisely right-censored, since it is only known that the true time-to-event for that subject exceeds the duration of follow-up. The complexities created by the presence of censored observations have led to the development of a special field of statistical methodology. Because the analysis of clinical trials data with time-to-death outcomes provided the original motivation for this development, the field has become known as survival analysis. This article provides an overview of the key ideas and methods in survival analysis. After introducing some basic terminology, we will discuss the estimation of the survival distribution, the comparison of two survival distributions as well as regression models. We will focus on non- and semi-parametric methods, which do not impose parametric assumptions on the survival distribution.

815

816

D. Y. Lin

2. Basic Concepts Let T denote the true time-to-event or failure time for a study subject in a medical study. Primary interest usually lies in estimation and testing regarding the distribution of T . This distribution can be characterized by the survival function S(t) ≡ Pr(T > t). Because of censoring, it is more convenient to deal with the hazard function, which is the instantaneous probability of dying at time t given that the subject is alive just prior to t. If T is continuous with density function f , then the hazard function is defined by λ(t) = lim Pr(t ≤ T < t + ∆t|T ≥ t)/∆t = f (t)/S(t) . ∆t↓0

Rt The function Λ(t) ≡ 0 λ(u)du is called the cumulative hazard function for T , and it is easily shown that S(t) = e−Λ(t) for a continuous survival time T . Let U denote the censoring time, that is, the time beyond which the study subject cannot be observed. Then (T, U ) are referred to as latent data, while the observed data are denoted by (X, δ), where X = min(T, U ), δ = I(T ≤ U ), and I(·) is the indicator function. The study subjects having δ = 0 are referred to as having censored observations. While the distribution function S(t) can be consistently estimated when data are uncensored, neither λ(t) nor S(t) is identifiable or consistently estimable if one only observes (X, δ).18,78 Observing (X, δ) rather than TR for all subjects only allows one to consistently estimate S # (t) ≡ t exp{− 0 λ# (u)du} for all t such that Pr(X > t) > 0, where λ# (t) = lim Pr(t ≤ T < t + ∆t|T ≥ t, U ≥ t)/∆t . ∆t↓0

(1)

We refer to λ# (t) as the crude hazard and λ(t) as the net hazard.14 In most survival analysis applications, a key assumption is made regarding the equality of the crude hazard (that is estimable) and the net hazard (that is of interest), i.e. λ# (t) = λ(t) for all t such that Pr(X > t) > 0 .

(2)

A sufficient condition for the validity of assumption (2) is the independence of T and U . 3. Estimation of the Survival Distribution A fundamental problem in survival analysis is the estimation of the hazard function λ(t) and the survival function S(t). Several parametric models are

817

Survival Analysis

0.7 0.5

0.6

Survival probabilities

0.8

0.9

1.0

available, and the maximum likelihood approach can be used for estimation of the parameters under the assumption of independent censoring. For example, assuming a constant hazard function, i.e. λ(t) = λ for all t > 0, one obtains the exponential distribution, where the maximum likelihood estimator for λ is the number of observed events divided by the summation of duration of follow-up over all subjects. Kalbfleisch and Prentice37 and Lawless42 provided detailed discussion of parametric methods.

0

2

4

6

8

Follow−up time (years)

Fig. 1. Kaplan–Meier estimates for the survival probabilities in the colon cancer study: ———— : observation groups, - - - - - - - - : therapy groups.

818

D. Y. Lin

It is more desirable to estimate the failure time distribution without parametric modelling. Nelson60 introduced a nonparametric estimator of the cumulative hazard function Λ(t). This estimator is given by a step function, with steps occurring at times of observed events and having size D/Y , where D events occur among Y subjects at risk. Recognizing the relationship between S(t) and Λ(t) through the differential equation, −{dS(t)/dt}/S(t−) = λ(t), one motivates the relationship, ˆ ˆ ˆ , −{∆S(t)}/ S(t−) = ∆Λ(t) where one estimates Λ(t) using Nelson’s estimator and then recursively solves for the estimator of S(t). The resulting estimator is that proposed by Kaplan and Meier.38 It is a step function, with value reduced by the multiplicative factor {1 − (D/Y )} at times of observed events. The asymptotic properties of the Kaplan–Meier estimator have been studied by Breslow and Crowley,9 Gill30 and Ying84 among others. Figure 1 displays the Kaplan–Meier estimates for the survival probabilities in a randomized clinial trial, which was designed to assess whether a new therapy, levamisole plus fluorouracil, prolongs the survival time for patients with Duke’s Stage C colon cancer.47 There are 315 and 304 patients in the observation and therapy groups. By the end of the study, 155 patients in the observation group and 108 patients in the therapy group had died. The Kaplan–Meier estimates given in Fig. 1 show the average survival experiences of the patients in the two groups over the entire follow-up period. Clearly, the patients on the therapy tend to have higher survival probabilities than the patients in the observation group. 4. Counting Process Theory It is difficult to study the properties of the statistics used in survival analysis, such as the Nelson and Kaplan–Meier estimators, by using standard statistical techniques because they are not sums of independent random variables. Aalen1 introduced an elegant martingale-based approach to survival analysis, where statistical methods can be cast within a unifying counting process framework. This approach uses an integral representation for censored data statistics that provides a simple unified form for estimators, test statistics and regression methods. These martingale methods allow one to obtain simple expressions for moments of complicated statistics and asymptotic distributions for test statistics and estimators, and to examine the operating characteristics of censored data regression methods.

Survival Analysis

819

Detailed presentation of this approach has been provided in textbooks by Fleming and Harrington27 and Andersen et al.4 In the counting process approach for analyzing data on time-to-a-singleevent, the data for the ith subject, (Xi , δi ), is represented as {Ni (t), Yi (t)} (t > 0), where Ni (t) = I(Xi ≤ t, δi = 1) ,

and Yi (t) = I(Xi ≥ t) .

(3)

The right-continuous process N (t) is referred to as the counting process, since it essentially counts the number of events observed up to and including time t, while the left-continuous process Y (t) is referred to as the at-risk process, indicating whether the subject is at risk at time t. A simple yet important illustration of the counting process approach is ˆ provided by examining the properties of the Nelson estimator Λ(t) of Λ(t). The hazard integrated over the region in which one has data is Z t I{Y¯ (u) > 0}λ(u)du , Λ∗ (t) ≡ where Y¯ (t) =

Pn

0

Yi (t), and n is the sample size. One can write n Z t X ∗ ˆ Λ(t) − Λ (t) = Hi (u)dMi (u) ,

i=1

i=1

(4)

0

where Hi (t) ≡ I{Y¯ (t) > 0}/Y¯ (t) is a left-continuous process, and Z t Yi (u)λ(u)du Mi (t) ≡ Ni (t) −

(5)

0

is the subject-specific martingale. The martingale Mi in Eq. (5) represents the difference over the interval (0, t] between the observed number and the model-predicted number of events for the ith subject. The left-continuity of the process Hi and the martingale property for Mi render the entire expression in Eq. (4) to be a martingale transform. This structure directly yields moments and large-sample properties. For example, since the marˆ tingale Mi has expectation zero, it follows that the Nelson estimator Λ(t) Rt ¯ has expectation 0 Pr{Y (u) > 0}λ(u)du. This martingale-based approach enables an elegant development of the small- and large-sample properties of the Nelson and Kaplan–Meier estimators, as shown by Gill.29 5. Two-Sample Statistics

The primary objective of many clinical trials is to provide a reliable comparison of the efficacy and safety of two treatments, where efficacy often

820

D. Y. Lin

is assessed in terms of a time-to-event outcome measure. Similarly, many epidemiologic studies are concerned with the comparisons of exposed and unexposed groups in the time to disease occurrence. A variety of parametric and nonparametric two-sample statistics have been proposed to compare two survival distributions based on censored data. Parametric methods are described by Kalbfleisch and Prentice37 and Lawless.42 The most popular nonparametric two-sample statistic is the so-called logrank statistic, which was originally proposed by Mantel.55 The subjects at risk at the time of an event are classified into a 2 × 2 table, according to event status (yes versus no) and group membership. The numerator of the logrank statistic is obtained by computing the observed and the expected (conditioning on the margins of the 2 × 2 table) events in the first group, and by summing the differences of these over all distinct event times. Within each 2 × 2 table, the variance of the number of events in the first group is obtained using the hypergeometric distribution. These are then summed over all distinct event times to provide the variance estimator for the logrank statistic. For the colon cancer study mentioned in Sec. 3, the observed chi-squared value of the logrank statistic is 11.2, providing strong evidence for the benefit of the therapy. The logrank statistic has been extended to a broad class of weighted logrank statistics. Any member of this class can be written as a weighted sum of the “observed minus expected” events in Mantel’s 2 × 2 tables. These statistics can be formulated as in Eq. (4). Using this structure, Gill29 derived small- and large-sample properties for statistics of this wide class. He developed criteria for consistency of these tests against ordered hazards and stochastic ordering alternatives. He also provided asymptotic distribution results, not only under the null hypothesis of equality of survival distributions, but also under contiguous alternatives, allowing him to provide a characterization of the alternatives against which the tests are efficient. Among these results was a proof that the logrank statistic provides an efficient test under proportional hazards alternatives.

6. Regression Models In medical studies designed to assess the effect of a treatment or exposure on a time-to-event outcome, it is important to be able to explore or adjust for the effect of an array of other covariates that may be associated with that outcome. Hence, the information collected on each study subject (X, δ) is expanded to be (X, δ, Z), where Z represents a p-vector of covariates.

821

Survival Analysis

The covariates can be treatment indicators or exposure levels; demographic variables, such as age, gender or race; laboratory measurements, such as levels of bilirubin, blood pressure or viral load; histologic assessments based on biopsy; or other descriptive measurements such as time from diagnosis of disease, type of disease, prior therapeutic exposures, or functional status of the subject. In regression models, these covariates can take a variety of functional forms, being dichotomous, ordered or continuous. The continuous variables may be transformations of original measures, such as the logarithm of bilirubin. The linear regression model for survival time data takes the form log T = β ′ Z + ǫ ,

(6)

where β is a set of unknown regression parameters, and ǫ is an error variable independent of Z. The logarithmic transformation is employed because T is positive; other appropriate transformations of T may also be selected. ′ Exponentiation of Eq. (6) yields T = eβ Z T0 , where T0 = eǫ . This expression shows that the role of Z is to accelerate (or decelerate) the time to failure. Thus, Eq. (6) is referred to as the accelerated failure time model. Because of censoring, it is more convenient to model the survival data through the hazard function. Let λ(t|Z) denote the hazard function associated with Z, i.e. λ(t|Z) = lim Pr(t ≤ T < t + ∆t|T ≥ t, Z)/∆t . ∆t↓0

The proportional hazards model specifies that ′

λ(t|Z) = λ0 (t)eβ Z ,

(7)

where λ0 (t) is the so-called baseline hazard function, i.e. the hazard function under Z = 0, and β is a set of unknown regression parameters. Under this model, the covariates have multiplicative effects on the hazard function, and the regression parameters are interpreted as the logarithms of the hazard ratios or relative risks. Equation (6) can be rewritten as ′

′

λ(t|Z) = λ0 (te−β Z )e−β Z ,

(8)

where λ0 (t) is the hazard function of T0 . A comparison of Eq. (8) with Eq. (7) reveals that the only overlap in the accelerated failure time and proportional hazards models arises when λ0 (t) is Weibull.37 In the regression setting, the independent censoring assumption given by Eq. (2) is extended so that, conditional on Z, the crude and net hazard

822

D. Y. Lin

functions are equal. Survival models, such as Eqs. (6) and (6.2), are referred to as parametric models if the distributional form of the failure time, i.e. λ0 (t), is specified, and as semiparametric models otherwise. Analysis of parametric survival models has been discussed by Kalbfleisch and Prentice,37 Lawless,42 Cox and Oakes,21 and Andersen et al.4 Due to the complex nature of human diseases, it is difficult to specify the parametric form. Thus, semiparametric models are preferable to parametric models in most medical applications. 7. Cox Proportional Hazards Model Cox19,20 introduced an ingenious semiparametric approach to inference based on the proportional hazards model. These methodologic results are among the developments in the field of survival analysis that have had the most profound impact on medical research. By fitting the proportional hazards model in Eq. (7) with an unspecified baseline hazard function λ0 (t), Cox obtained a robust approach for studying the influence of covariates on outcome. However, with an infinitedimensional nuisance function λ0 (t), modifications to the classical likelihood approach would be needed. Thus, Cox20 introduced the partial likelihood, which is based on the data that does not carry information about λ0 (t). Specifically, one discards the times of observed events, and the number of events at those times. Assuming that censoring is independent and is uninformative for β (see Definition 4.3.1 of Fleming and Harrington27 ), one also discards the censoring times and the identity of subjects associated with the censored times. The partial likelihood is then based on, for all event times, the identity of the subject(s) failing at each event time, given the number failing and the identity of the subjects at risk at that time. It takes the form ′ Y eβ Z(i) P , (9) L(β) = β ′ Zj j∈Ri e i∈D

where D is the set of indices of observed event times, Z(i) is the covariate vector for the subject failing at the ith observed event time Ti0 , and Ri is the set of subjects at risk at Ti0 . The maximum partial likelihood estimator ˆ is the value of β that maximizes L(β). Given β, ˆ the cumulative baseline β Rt hazard function Λ0 (t) ≡ 0 λ0 (u)du is estimated by the Breslow estimator8 : X 1 ˆ 0 (t) ≡ Λ . (10) P ˆ ′ Zj β j∈Ri e i∈D;T 0 ≤t i

823

Survival Analysis Table 1.

Regression analysis of the Mayo primary biliary cirrhosis data. Proportional Hazards

Parameter Age log(Albumin) log(Bilirubin) Oedema log(Protime)

Est 0.039 −2.533 0.871 0.859 2.380

SE

95% conf int

0.008 (0.024, 0.054) 0.648 (−3.803, −1.263) 0.083 (0.709, 1.033) 0.271 (0.328, 1.391) 0.767 (0.877, 3.883)

Accelerated Failure Time Est

SE

95% conf int

−0.026 1.656 −0.585 −0.734 −1.944

0.004 0.368 0.046 0.178 0.462

(−0.035, −0.018) (0.934, 2.378) (−0.674, −0.496) (−1.083, −0.385) (−2.850, −1.038)

Cox19,20 conjectured that L(β) shares the asymptotic properties of a full likelihood. This conjecture was confirmed by a number of authors. The first published proof was provided by Tsiatis.78 Andersen and Gill5 provided an ˆ and Λ ˆ 0 (t) by observing that the partial elegant asymptotic theory for β likelihood score function can be formulated as a martingale transform, of the form given in Eq. (4). The left panel of Table 1 summarizes the results of the Cox regression analysis for the Mayo primary biliary cirrhosis data.27 The database contains 418 patients who were referred to the Mayo Clinic. As of the date of data listings, 161 patients had died. The Cox regression analysis not only quantifies the effects of the five covariates on the risk of death but also allows one to estimate the survival probabilities for patients associated with specific covariate values.48 8. Multiplicative Intensity Model In many medical studies, the outcome of primary interest extends beyond the time of the first event to exploration of the rate of recurrent events over time. These recurrent events, for example, may be repeated otitis media infections in an infant, or repeated hospitalizations in an adult with a serious disease. To analyze such data, Aalen2 introduced the multiplicative intensity model as a generalization of the proportional hazards model. In this model, the subject-specific martingale is Z t ′ M (t) = N (t) − Y (u)λ0 (u)eβ Z(u) du , (11) 0

where N and Y are of more general forms than given in Eq. (3). Specifically, the counting process N (t) still reflects the number of events that have occurred by time t, but now has range over all non-negative integers. The at-risk process Y (t) can be any left-continuous process indicating, by 1

824

D. Y. Lin

versus 0, whether or not the subject is at risk at time t. In addition, the covariate vector is allowed to be a stochastic process. In the semiparametric setting where λ0 (t) in Eq. (11) is unspecified, one can use the partial likelihood principle to make inference about β and the Breslow estimator to estimate Λ0 (t), although now the set D in Eqs. (9) and (10) may involve multiple event times from the same subject. The corresponding large-sample theory was again provided by Andersen and Gill.5 9. Regression Model Diagnostics Extensive development of residuals has provided a wide variety of model diagnostics that are useful for the Cox proportional hazards model as well as for the broader multiplicative intensity model. For simplicity, we consider the subject-specific martingale in Eq. (5) for the special case of the Cox model given by Eq. (7). The corresponding martingale residual is ′

ˆ i (t) ≡ Ni (t) − Λ ˆ 0 (t ∧ Xi )eβˆ Zi , M where a ∧ b = min(a, b). This residual, introduced by Barlow and Prentice6 and explored by Therneau et al.76 can be interpreted as the “observed” minus “estimated model predicted” events for subject i over the interval (0, t]. As t → ∞, the martingale residual reduces to ′

ˆ i ≡ δi − Λ ˆ 0 (Xi )eβˆ Zi . M

These residuals, symmetrized using the deviance transformation56 can be used to detect outliers. The partial residuals, defined by ˆ i /{Λ ˆ 0 (Xi )eβˆ M

′

Zi

} + βˆj Zij ,

i = 1, . . . , n ; j = 1, . . . , p ,

ˆ can be used to where Zij and βˆj are the jth components of Zi and β, suggest the proper functional form for covariates in the model. A class of martingale-transform residuals can be obtained by replacing ˆ i (u) for each i in Eq. (4). Important members of this class Mi (u) with M are the p “score residuals” for each subject. These residuals are defined by Z ∞ ˆ i (t) , i = 1, . . . , n ; j = 1, . . . , p , Hij (t)dM Lij ≡ 0

P where Hij (t) is chosen such that i Lij reduces to the jth component of the partial likelihood score statistic. These p score residuals can be used to assess the influence of each subject on the parameter estimates βˆj (j = 1, . . . , p). They are also related to a class of residuals, proposed by

Survival Analysis

825

Schoenfeld,74 that are useful for detecting departures from the proportional hazards assumption. Lin et al.52 studied the cumulative sums of martingale-based residuals over covariates or event times. The distributions of these stochastic processes under the assumed model can be approximated by zero-mean Gaussian processes. Each observed process can then be compared, both graphically and numerically, with a number of realizations from the approximate null distribution by computer simulation. These comparisons enable one to determine objectively whether a seemingly abnormal residual pattern reflects model misspecification or natural random variation. This methodology can be used to assess the functional forms of covariates, the proportional hazards assumption, as well as the overall fit of the model. 10. Alternatives to the Cox Model Despite the great popularity and versatility of the Cox regression model, there are reasons to explore alternative models. First, the proportional hazards assumption may not be satisfied in some applications. Second, alternative models characterize different aspects of the associations between covariates and survival time. In this section, we describe briefly some alternative semiparametric models. In contrast to the proportional hazards model, the additive hazards model specifies that covariates have additive rather than multiplicative effects on the hazard function, i.e. λ(t|Z) = λ0 (t) + β ′ Z(t) .

(12)

This model was discussed by Cox and Oakes,21 Thomas77 and Breslow and Day.10 Using the counting-process martingale approach, Lin and Ying54 obtained closed-form estimators for the regression parameters β and the cumulative baseline hazard function Λ0 (t). Semiparametric transformation models take the form h(T ) = β ′ Z + ǫ ,

(13)

where ǫ is a random error with a given distribution function F , and h is a completely unspecified function. If F is the extreme value distribution, then Eq. (13) is the proportional hazards model. If F is the standard logistic function, then Eq. (13) is the proportional odds model, under which the hazard ratio approaches unity as time increases. This class of models was studied by Clayton and Cuzick17 and Dabrowska and Doksum,24

826

D. Y. Lin

and the proportional odds model was studied by Pettitt,68 Bennet7 and Murphy et al.59 A significant breakthrough was made by Cheng et al.13 who provided simple and relatively efficient estimators of β for all members of Eq. (13). The semiparametric accelerated failure time model takes the same form as Eq. (13), but with h specified, usually as h(T ) = log T , and ǫ unspecified. Various methods of estimation for this model were proposed around 1980. Specifically, Koul et al.39 suggested to include in the least-squares estimator only the uncensored survival times, but to weigh them by the inversed probabilities of being uncensored. The resulting estimator is highly inefficient, especially in the presence of heavy censoring. However, the underlying idea of weighting uncensored observations by their inversed probabilities of being uncensored, to be referred to as the inverse probability of censoring weighting (IPCW) technique, turns out to be extremely useful in many other contexts. In fact, the Cheng et al.13 estimators were based on this idea. A more efficient modification of the least-squares estimator was provided by Buckley and James,11 which replaces the conditional expectations for the censored survival times by their estimates based on the Kaplan–Meier estimator of the residual lifetime distribution and which involves an iterative estimation scheme analogous to the EM algorithm.25 Prentice,69 on the other hand, showed how to adapt the rank estimation method for noncensored data to the censored data setting. The asymptotic properties of the Buckley–James and rank estimators were established in the early 1990’s by Tsiatis,80 Ritov,72 Wei et al.83 Lai and Ying40,41 and Ying.85 Despite the theoretical advances, semiparametric methods for the accelerated failure time model have not been widely used in medical applications due to the lack of simple and reliable numerical algorithms. Recently, Jin et al.36 provided a practical method for implementing the rank estimators. Using their method, we obtain the results for the accelerated failure time regression of the primary biliary cirrhosis data shown in the right panel of Table 1. These results are based on the log-rank estimating function. Although the conclusions are not qualitatively different, the analysis under the accelerated failure time model provides an alterative and more direct interpretation of the effects of the covariates on the survival time, as compared to the Cox regression analysis. 11. Multivariate Failure Time Data Under the multiplicative intensity model described in Sec. 8, the risk of a recurrent event for a subject is unaffected by earlier events that occurred to

827

Survival Analysis

the subject unless time-dependent covariates that capture such dependence are included explicitly in the model. In medical applications, the dependence structures are complex and the forms of time-dependent covariates are unknown. Furthermore, the inclusion of such time-dependent covariates which are part of the response results in biased estimation of the overall treatment effect in a randomized clinical trial. Thus, it would be desirable to model the marginal distribution of the recurrent event times while leaving the dependence structures unspecified. It is particularly appealing to consider the cumulative mean function µ(t) ≡ E{N ∗ (t)}, where N ∗ (t) is the number of events that the subject has actually experienced by time t (in the absence of censoring). This function was first considered by Nelson61 and further studied by Lawless and Nadeau.43 A number of authors44,51,65 studied the following regression models for the cumulative mean function ′

E{N ∗ (t)|Z} = µ0 (t)eβ Z ,

(14)

where µ0 (t) is an arbitrary baseline mean function, and β is a set of regression parameters. If N ∗ (t) is a (non-homogeneous) Poisson process, then Eq. (14) is equivalent to the intensity model determined by Eq. (11). Although in general N ∗ is not a Poisson process, the maximum partial likelihood estimator for β of Eq. (11) remains consistent and asymptotically normal under Eq. (14). The covariance matrix, however, can no longer be estimated by the inversed information matrix. A sandwich variance estimator has to be used instead. In the one-sample case, µ(t) can be consistently estimated by the Nelson estimator. Under model (14), the baseline mean function µ0 (t) can be consistently estimated by the Breslow estimator, and the covariate-specific mean function can be estimated in a similar fashion.51 It is particularly informative to display the estimated mean functions for different treatment arms and for specific covariate patterns. In some medical studies, each subject can potentially experience more than one type of event. Examples include the developments of physical symptoms or diseases in several organ systems (e.g. stroke and cancer) or in several members of the same organ system (e.g. eyes or teeth). Models such as Eqs. (11) and (14) are not applicable since the multiple events on the same subject are of different natures and in fact may not even be ordered. It is convenient to formulate the marginal distributions of the multiple event times through the proportional hazards models while leaving the dependence structures completely unspecified. Let K denote the number

828

D. Y. Lin

of potential events per subject. The hazard function for the kth event of the ith subject is postulated to take the form ′

λ(t|Zki ) = λk0 (t)eβ Zki (t) ,

k = 1, . . . , K ;

i = 1, . . . , n ,

(15)

where Zki is the covariate vector for the ith subject with respect to the kth event, λk0 (k = 1, . . . , K) are arbitrary baseline hazard functions, and β is a set of regression parameters. In some applications (e.g. an ophthalmologic study involving the left and right eyes), it is natural to impose the restriction that λ10 = · · · = λK0 wheareas in others (e.g. the setting of multiple diseases) it is necessary to allow the λk0 ’s to be different. If the event times were independent, then the partial likelihood could be easily constructed for β of model (15). The resulting estimator turns out to be consistent and asymptotically normal even if the event times are correlated. However, a sandwich variance estimator is again needed to account for the intra-class dependence. This approach was pioneered by Wei et al.82 and further developed by Lee et al.,45 Liang et al.46 and Cai and Prentice12 among others. The marginal approach discussed above treats the dependence of related event times as a nuisance. An alternative approach is to explicitly formulate the nature of dependence by the so-called frailty. The term frailty was first introduced by Vaupel et al.81 to illustrate the consequences of a lifetime being generated from several sources of variation. The use of frailty in bivariate survival time data was considered by Clayton.15 Frailty models were studied extensively in the 1980s by Clayton and Cuzick,16 Hougaard34 and Oakes63 among others. The frailty-model analog of Eq. (15) specifies that the hazard function for the kth event of the ith subject, given the frailty νi , takes the form ′

λki (t|Zki ; νi ) = νi λk0 (t)eβ Zki (t) ,

(16)

where νi (i = 1, . . . , n) are independent random variables. Conditional on νi , the event times on the ith subject are assumed to be independent. The parameter vector β has a population-average interpretation under model (15) and a subject-specific interpretation under model (16). Models (15) and (16) cannot hold simultaneously unless ν is a positive-stable variable. It is very challenging, both theoretically and computationally, to deal with frailty models such as (16). Major progress was made in the 1990s. In the special case of gamma frailty models, maximum likelihood estimation via the EM algorithm was studied by Nielsen et al.,62 Murphy,57,58 Andersen et al.3 and Parner64 among others.

Survival Analysis

829

Nonparametric estimation for the multivariate survival function is a fundamental problem in the analysis of multivariate failure time data. Using the IPCW technique, Lin and Ying53 developed a simple estimator for the special case where there is a common censoring time for all event times of the same subject. Estimation in the general setting has been studied by Dabrowska,23 Prentice and Cai,70 and van der Laan81 among others. The occurrence of one event (e.g. death) may preclude the development of another (e.g. relapse of cancer). In some applications, such as causespecific mortality studies, the subject can only experience one of several potential events. This type of data is referred to as competing risks. The simplest solution to this problem is to censor the event time of interest at the time of the competing events, and then apply the standard survival analysis methods such as the logrank test and Cox regression. The results pertain to the so-called cause-specific hazard function, which is given by Eq. (1) with U representing the time to the competing events. An important limitation of the cause-specific hazard function is that the associated S # (t) is not a survival function unless the cause of interest is independent of other risks and the other risks could be eliminated without altering the distribution of the cause of interest. Thus, in general the Kaplan–Meier estimator does not pertain to the survival function or disease incidence. Special methods have been developed to estimate disease incidence functions.26,31,66

12. Concluding Remarks We have reviewed many areas of survival analysis in the previous sections. All these methods require the assumption of independent censoring. As discussed in Sec. 2, the survival distribution is not identifiable in the presence of dependent censoring. If one is willing to model certain aspects of the dependent censoring mechanism, then it is possible make inference about the survival distribution under dependent censoring.50,73 When the event of interest is asymptomatic, as is the case with cancer progression or HIV infection, the event time cannot be measured exactly, but is rather known to lie in an interval determined by two successive examinations. The resulting data are said to be interval censored. Non- and semi-parametric analysis of such data has been studied by Groeneboom and Wellner,32 Huang,35 Lin et al.49 and Rabinowitz et al.71 among others. The applications of survival analysis methods to medical studies have been greatly facilitated by the developments of software packages. Standard

830

D. Y. Lin

methods such as the Kaplan–Meier estimator, weighted logrank tests, Cox regression and parametric regression with (univariate) right-censored data are now available in virtually all software packages. The multiplicative intensity model and the sandwich variance estimators for models (14) and (15) have been implemented in major packages, such as SAS, S-Plus and STATA. However, most of the newer methods, such as those for the semiparametric analysis of models (6), (12) and (13), and those mentioned in this section, are not available in software packages. Further developments are anticipated in many areas of survival analysis. For example, the Cheng et al.13 estimators for Eq. (13) require modelling the censoring distribution, and it would be worthwhile to explore estimation procedures which do not involve such modelling. In the area of multivariate failure time data, efficient estimators for model (15) have yet to be identified, and further theoretical and numerical advances are warranted for model (16). Considerable activities are also expected in the areas of dependent censoring, interval censored data, causal inference, and joint modelling of longitudinal and failure time data. Finally, further expansion of software is anticipated. References 1. Aalen, O. O. (1975). Statistical Inference for a Family of Counting Processes. PhD. Dissertation, University of California, Berkeley. 2. Aalen, O. O. (1978). Nonparametric inference for a family of counting processes. The Annals of Statistics 6: 701–726. 3. Andersen, P. K., Klein, J. P., Knudsen, K. M. and Palacios, R. T. (1997). Estimation of variance in Cox’s regression model with shared gamma frailties. Biometrics 53: 1475–1484. 4. Andersen, P. K., Borgan, Ø., Gill, R. D. and Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer-Verlag, London. 5. Andersen, P. K. and Gill, R. D. (1982). Cox’s regression model for counting processes: A large sample study. The Annals of Statistics 10: 1100–1120. 6. Barlow, W. E. and Prentice, R. L. (1988). Residuals for relative risk regression. Biometrika 75: 65–74. 7. Bennett, S. (1983). Analysis of survival data by the proportional odds model. Statistics in Medicine 2: 273–277. 8. Breslow, N. E. (1972). Contribution to the discussion on the paper by D. R. Cox cited below. Journal of the Royal Statistical Society B34: 216–217. 9. Breslow, N. E. and Crowley, J. J. (1974). A large sample study of the life table and product limit estimates under random censorship. The Annals of Statistics 2: 437–453. 10. Breslow, N. E. and Day, N. E. (1987). Statistical Methods in Cancer Research. The Design and Analysis of Cohort Studies 2, IARC, Lyon.

Survival Analysis

831

11. Buckley, J. and James, I. (1979). Linear regression with censored data. Biometrika 66: 429–436. 12. Cai, J. and Prentice, R. L. (1995). Estimating equations for hazard ratio parameters based on correlated failure time data. Biometrika 82: 151–164. 13. Cheng, S. C., Wei, L. J. and Ying, Z. (1995). Analysis of transformation models with censored data. Biometrika 82: 835–845. 14. Chiang, C. L. (1968). Introduction to Stochastic Processes in Biostatistics. John Wiley, New York. 15. Clayton, D. G. (1978). A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 65: 141–151. 16. Clayton, D. G. and Cuzick, J. (1985). Multivariate generalizations of the proportional hazards model (with discussion). Journal of the Royal Statistical Society A148: 82–117. 17. Clayton, D. G. and Cuzick, J. (1986). The semiparametrics Pareto model for regression analysis of survival times. In Proceedings of the International Statistical Institute, International Statistical Institute, Amsterdam, 23.3-1-18. 18. Cox, D. R. (1959). The analysis of exponentially distributed life-times with two types of failure. Journal of the Royal Statistical Society B21: 411–421. 19. Cox, D. R. (1972). Regression models and life tables (with discussion). Journal of the Royal Statistical Society B34: 187–220. 20. Cox, D. R. (1975). Partial likelihood. Biometrika 62: 269–276. 21. Cox, D. R. and Oakes, D. (1984). Analysis of Survival Data. Chapman and Hall, London. 22. Cuzick, J. (1985). Asymptotic properties of censored linear rank tests. The Annals of Statistics 13: 133–141. 23. Dabrowska, D. M. (1988). Kaplan–Meier estimate on the plane. The Annals of Statistics 16: 1475–1489. 24. Dabrowska, D. M. and Doksum, K. A. (1988). Partial likelihood in transformation models with censored data. Scandinavian Journal of Statistics 15: 1-23. 25. Dempster, A. P., Laird, N. M. and Rubin, D. (1977). Maximum likelihood estimation for incomplete data via the EM algorithms (with discussion). Journal of the Royal Statistical Society B39: 1–38. 26. Fine, J. P. and Gray, R. J. (1999). A proportional hazards model for the subdistribution of a competing risk. Journal of the American Statistical Association 94: 496–509. 27. Fleming, T. R. and Harrington, D. (1991). Counting Processes and Survival Analysis. Wiley, New York. 28. Gehan, E. A. (1965). A generalized Wilcoxon test for comparing arbitrarily single-censored samples. Biometrika 52: 203–223. 29. Gill, R. D. (1980). Censoring and Stochastic Integrals. Mathematical Centre Tracts 124, Mathematisch Centrum, Amsterdam. 30. Gill, R. D. (1983). Large sample behavior of the product-limit estimator on the whole line. The Annals of Statistics 11: 49–58.

832

D. Y. Lin

31. Gray, R. J. (1988). A class of K-sample tests for comparing the cumulative incidence of a competing risk. The Annals of Statistics 16: 1141–1154. 32. Groeneboom, P. and Wellner, J. A. (1992). Information Bounds and Nonparametric Maximum Likelihood Estimation. Birkh¨ auser, Basel. 33. Harrington, D. P. and Fleming, T. R. (1982). A class of rank test procedures for censored survival data. Biometrika 69: 553–566. 34. Hougaard, P. (1987). Modelling multivariate survival. Scandinavian Journal of Statistics 14: 291–304. 35. Huang, J. (1996). Efficient estimation for the proportional hazards model with interval censoring. The Annals of Statistics 24: 540–568. 36. Jin, Z. Lin, D. Y., Wei, L. J. and Ying, Z. (2003). Rank-based inference for the semiparametric accelerated failure time model. Biometrika, in press. 37. Kalbfleisch, J. D. and Prentice, R. L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. 38. Kaplan, E. L. and Meier, P. (1958). Nonparametric estimator from incomplete observations. Journal of the American Statistical Association 53: 457–481. 39. Koul, H., Susarla, V. and Van Ryzin, J. (1981). Regression analysis with randomly right censored data. The Annals of Statistics 9: 1276–1288. 40. Lai, T. L. and Ying, Z. (1991a). Rank regression methods for left-truncated and right-censored data. The Annals of Statistics 19: 531–556. 41. Lai, T. L. and Ying, Z. (1991b). Large sample theory of a modified Buckley– James estimator for regression analysis with censored data. The Annals of Statistics 19: 1370–1402. 42. Lawless, J. F. (1982). Statistical Models and Methods for Lifetime Data. Wiley, New York. 43. Lawless, J. F. and Nadeau, C. (1995). Some simple robust methods for the analysis of recurrent events. Technometrics 37: 158–168. 44. Lawless, J. F., Nadeau, C. and Cook, R. J. (1997). Analysis of mean and rate functions for recurrent events. In Proceedings of the First Seattle Symposium in Biostatistics: Survival Analysis, eds. D. Y. Lin and T. R. Fleming, Springer-Verlag, New York, 37–49. 45. Lee, E. W., Wei, L. J. and Amato, D. A. (1992). Cox-type regression analysis for large numbers of small groups of correlated failure time observations. In Survival Analysis: State of the Art; eds. J. P. Klein and P. K. Goel, Kluwer Academic Publishers, Dordrecht, 237–247. 46. Liang, K. Y., Self, S. G. and Chang, Y. C. (1993). Modelling marginal hazards in multivariate failure time data. Journal of the Royal Statistical Society B55: 441–453. 47. Lin, D. Y. (1994). Cox regression analysis of multivariate failure time data: The marginal approach. Statistics in Medicine 13: 2233–2247. 48. Lin, D. Y., Fleming, T. R. and Wei, L. J. (1994). Confidence bands for survival curves under the proportional hazards model. Biometrika 81: 73–81. 49. Lin, D. Y., Oakes, D. and Ying, Z. (1998). Additive hazards regression with current status data. Biometrika 85: 289–298. 50. Lin, D. Y., Robins, J. M. and Wei, L. J. (1996). Comparing two failure time distributions in the presence of dependent censoring. Biometrika 83: 381–393.

Survival Analysis

833

51. Lin, D. Y., Wei, L. J., Yang, I. and Ying, Z. (2000). Semiparametric regression for the mean and rate functions of recurrent events. Journal of the Royal Statistical Society B62: 711–730. 52. Lin, D. Y., Wei, L. J. and Ying, Z. (1993). Checking the Cox model with cumulative sums of martingale-based residuals. Biometrika 80: 557–572. 53. Lin, D. Y. and Ying, Z. (1993). A simple nonparametric estimator of the bivariate survival function under univariate censoring. Biometrika 80: 573–581. 54. Lin, D. Y. and Ying, Z. (1994). Semiparametric analysis of the additive risk model. Biometrika 81: 61–71. 55. Mantel, N. (1966). Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemotherapy Report 50: 163–170. 56. McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, 2nd ed. Chapman and Hall, London. 57. Murphy, S. A. (1994). Consistency in a proportional hazards model incorporating a random effect. The Annals of Statistics 22: 712–731. 58. Murphy, S. A. (1995). Asymptotic theory for the frailty model. The Annals of Statistics 23: 182–198. 59. Murphy, S. A., Rossini, A. J. and van der Vaart, A. W. (1997). Maximum likelihood estimation in the proportional odds model. Journal of the American Statistical Association 92: 968–976. 60. Nelson, W. B. (1969). Hazard plotting for incomplete failure data. Journal of Qual Technology 1: 27–52. 61. Nelson, W. B. (1988). Graphical analysis of system repair data. Journal of Quality Technology 20: 24–35. 62. Nielsen, G. G., Gill, R. D., Andersen, P. K. and Sorensen, T. I. A. (1992). A counting process approach to maximum likelihood estimation in frailty models. Scandinavian Journal of Statistics 19: 25–43. 63. Oakes, D. (1989). Bivariate survival models induced by frailties. Journal of the American Statistical Association 84: 487–493. 64. Parner, E. (1998). Asymptotic theory for the correlated gamma-frailty model. The Annals of Statistics 26: 183–214. 65. Pepe, M. S. and Cai, J. (1993). Some graphical displays and marginal regression analyses for recurrent failure times and time dependent covariates. Journal of the American Statistical Association 88: 811–820. 66. Pepe, M. S. and Mori, M. (1993). Kaplan–Meier, marginal or conditional probability curves in summarizing competing risks failure time data? Statistics in Medicine 12: 737–751. 67. Peto, R. and Peto, J. (1972). Asymptotically efficient rank invariant test procedures (with discussion). Journal of the Royal Statistical Society A135: 185–206. 68. Pettitt, A. N. (1982). Inference for the linear model using a likelihood based on ranks. Journal of the Royal Statistical Society B44: 234–243. 69. Prentice, R. L. (1978). Linear rank tests with right censored data. Biometrika 65: 167–179.

834

D. Y. Lin

70. Prentice, R. L. and Cai, J. (1992). Covariance and survival function estimation using censored multivariate failure time data. Biometrika 79: 495–512. 71. Rabinowitz, D., Tsiatis, A. and Aragon, J. (1995). Regression with interval censored data. Biometrika 82: 501–513. 72. Ritov, Y. (1990). Estimation in a linear regression model with censored data. The Annals of Statistics 18: 303–328. 73. Robins, J. M. and Rotnitzky, A. (1992). Recovery of information and adjustment for dependent censoring using surrogate markers. In AIDS Epidemiology: Methodological Issues, eds. N. P. Jewell, K. Dietz and V. T. Farewell, Birkh¨ auser, Boston, 297–331. 74. Schoenfeld, D. (1982). Partial residuals for the proportional hazards regression model. Biometrika 69: 239–241. 75. Tarone, R. E. and Ware, J. (1977). On distribution-free tests for equality of survival distributions. Biometrika 64: 156–160. 76. Therneau, T., Grambsch, P. and Fleming, T. R. (1990). Martingale-based residuals for survival models. Biometrika 77: 147–160. 77. Thomas, D. C. (1986). Use of auxiliary information in fitting nonproportional hazards models. In Modern Statistical Methods in Chronic Disease Epidemiology, eds. S. H. Moolgavkar and R. L. Prentice, Wiley, New York, 197–210. 78. Tsiatis, A. A. (1975). A nonidentifiability aspect of the problem of competing risks. Proceedings of the National Academy of Science USA 72: 20–22. 79. Tsiatis, A. A. (1981). A large sample study of Cox’s regression model. The Annals of Statistics 9: 93–108. 80. Tsiatis, A. A. (1990). Estimating regression parameters using linear rank tests for censored data. The Annals of Statistics 18: 354–372. 81. van der Laan, M. J. (1996). Efficient estimation in the bivariate censoring model and repairing NPMLE. the Annals of Statistics 24: 596–627. 82. Vaupel, J. W., Manton, K. G. and Stallard, E. (1979). The impact of heterogeneity in individual frailty on the dynamics of mortality. Demography 16: 439–454. 83. Wei, L. J., Lin D. Y. and Weissfeld, L. (1989). Regression analysis of multivariate incomplete failure time data by modeling marginal distributions. Journal of the American Statistical Association 84: 1065–1073. 84. Wei, L. J., Ying, Z. and Lin, D. Y. (1990). Linear regression analysis of censored survival data based on rank tests. Biometrika 77: 845–851. 85. Ying, Z. (1989). A note on the asymptotic properties of the product-limit estimator on the whole line. Statisitics and Probability Letters 7: 311–314. 86. Ying, Z. (1993). A large sample study of rank estimation for censored regression data. The Annals of Statistics 21: 76–99.

About the Author D. Y. Lin is the Dennis Gillings Distinguished Professor of Biostatistics at the Univerisity of North Carolina at Chapel Hill. Professor Lin has

Survival Analysis

835

made significant contributions to medical statistics, especially in survival analysis and clinical trials. As a result, he was honored with the Mortimer Spiegelman Award from the American Public Health Association in 1999. Professor Lin is a Fellow of the American Statistical Association, and a Fellow of the Institute of Mathematical Statistics. He currently serves as an associate editor for Biometrika and Statistica Sinica.

This page intentionally left blank

CHAPTER 23

REGRESSION MODELS FOR THE ANALYSIS OF LONGITUDINAL DATA COLIN O. WU Office of Biostatistics Research, DECA, National Heart, Lung and Blood Institute, II Rockledge Center, Rm 8218, 6701 Rockledge Drive, MSC 7938, Bethesda, MD 20892, USA Tel: 301-435-0440; [email protected] KAI F. YU Division of Epidemiology, Statistics and Prevention Research, National Institute of Child Health and Human Development, 6100 Executive Blvd. Room 7B05, National Institutes of Health, Bethesda, MD 20892-7510, USA

1. Introduction 1.1. Structures of longitudinal data In biomedical and epidemiological studies, interests are often focused on evaluating the effects of treatments, dosage, risk factors or other covariates of interest on the outcomes of interest, such as disease progression and change of health status of a population, over time. Because the changes of outcome and covariates within each subject usually provide important information of scientific relevance, longitudinal samples that contain repeated measurements of the chosen subjects over time are often more preferable than the classical cross-sectional samples. In fact, by combining the characteristics of random sampling and time series observations, the usefulness of longitudinal samples goes far beyond biomedicine and epidemiology, and their trace is often found in economics, psychology, sociology and many other fields of natural and social sciences. For a typical framework of longitudinal data, we define t to be a real-valued variable of time, Y (t) a real-valued outcome and X(t) = 837

838

C. O. Wu & K. F. Yu

(X (0) (t), . . . , X (k) (t))T , k ≥ 1, a Rk+1 -valued covariate vector at time t. Depending on the choice of origin, the time variable t is not necessarily nonnegative. As part of the general methodology, interest of statistical analysis with regression models is often focused on modeling and determining the effects of (t, X(t)) on the population mean of Y (t). For n randomly selected subjects, each repeatedly measured over time, the longitudinal sample of (Y (t), t, X(t)) is denoted by {(Yij , tij , Xij ) : i = 1, . . . , n, j = 1, . . . , ni }, where tij is the jth measurement time of the ith subject, Yij and Xij = (k) (0) (Xij , . . . , Xij )T are the observed outcome and covariate vector, respectively, of the ith subject at tij and ni is the ith subject’s number of repeated Pn measurements. The total number of measurements is N = i=1 ni . In contrast to the classical independent identically distributed (i.i.d.) samples, the measurements within each subject are possibly correlated, although the inter-subject measurements are independent. A longitudinal sample is said to have a balanced design if all the subjects have their measurements made at a common set of time points, i.e. ni = m for some m ≥ 1 and all i = 1, . . . , n and t1j = · · · = tnj for all j = 1, . . . , m. An unbalanced design arises if the design time tij are different per subject. In practice, unbalanced designs may be caused by the presence of missing values in an otherwise balanced design or by the random variations of the time design points. In general, two routes could be used to obtain biomedical observations: clinical trials and observational cohort studies. The main difference between a clinical trial and an observational cohort study is at their designs. In a clinical trial, the investigator has the power to determine, at least partially, the selection of the participants and the design of the trial, such as the treatment offered, the length of the trial and the time and methods of the measurement process. An observational cohort studies, on the other hand, is more complicated, because the risk factors, the treatments and the measurement process now depend on the participants of the study and are usually not controlled by the investigator. Consequently, it is possible to observe balanced longitudinal data from clinical trials. But, for various reasons that are out of the investigators’ control, most observational cohort studies have unbalanced longitudinal designs. 1.2. Examples of longitudinal studies The following two epidemiological examples illustrate some typical features of longitudinal samples. Although these examples share some similarities, such as both of them are cohort studies with unbalanced designs, they differ in the numbers of repeated measurements and the design of their covariates.

Regression Models for the Analysis of Longitudinal Data

839

1.2.1. Alabama small-for-gestational-age (ASGA) study This is a prospective study of risk factors and intrauterine growth retardation involving 1475 women who had their fetal anthropometry measurements made by ultrasound repeatedly during pregnancy. All the women were scheduled to have their measurements made at approximately 17, 25, 31 and 36 weeks of gestation. Their actual visits, however, numbered between 1 to 7 times per person, did not follow this schedule and were scattered throughout 12 to 43 weeks of gestation. This results in an unbalanced design in the sense that not all the subjects are measured at the same design points. Associated covariates that may affect fetal development include maternal behavioral factors, such as cigarette smoking, alcohol consumption and drug abuse, and maternal anthropometric measurements, such as pre-pregnancy weight, height and body mass index, and placental development measured by placental thickness at different stages of gestation. Some of these covariates, such as the maternal anthropometric measurements are time-dependent. But, the others, such as maternal behavioral factors, may be either time-dependent or time-invariant, depending on how these variables are defined. The outcome variables of fetal development, which, of course, are all time-dependent, include the fetal abdominal circumference, biparietal diameter, femur length and other ultrasound measurements. Figure 1 shows the observed fetal abdominal circumferences (in cm) at their corresponding gestational age in weeks. To see the trend of the individual repeated measurements, the line segments indicate the measurement sequences for a number of randomly selected subjects. Heuristically, we observe a linear upward trend on the growth of fetal size. But, there has been no prior study which justifies the goodness of a linear growth model for this population or any other statistical model on the covariate effects on fetal growth. A statistical analysis should then focus on two objectives: establishing an appropriate statistical model so that the effects of these covariates on the outcome of interest can be clearly interpreted; developing estimation and inference procedures to adequately quantify the covariate effects based on the chosen statistical models. 1.2.2. HIV/CD4 depletion data The data set is from Multicenter AIDS Cohort Study (MACS) 1984–1991, which includes 400 homosexual men who were infected by the human immunodeficiency virus (HIV) between 1984 and 1991. Because CD4 cells (T-helper lymphocytes) are vital for immune function, an important component of the study is to evaluate the effects of risk factors, such as

840

C. O. Wu & K. F. Yu

Line Segments

10

20

30 Time (in weeks)

40

35

•

•

30

• •

25

• • • • •

20

AC

•

• •

15

• • • • ••••• •• •• • • • • •••• •••• • • • • • • • •••••••••••• ••••••• • ••• •••••••••• ••••••••••••••••••••••••••••••••••• • • •••••••••••••••••••• • •• •••••••••••••••••••••••••••••••••••••••••••••••••••••••• • •• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • • • •• • • • •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • • • • • • • • • • • • • •••••••••••••••••••••••••••••••••••••••• •• • ••••••••••••••••••••••••••••••••••••••••••••••••••••• •• ••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• •••••••••••••••••••••••••••••••••••••••• • •••••••••••••••••••••••••••••••••••••• • •••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • • • • ••••••••••••••••••••••••••••••••• • ••••••••••••••••••••••••• • • • • • • • ••••••••••••••••••••••••••• ••••••••••••••••••••••••••• •••••• ••• • •••••••••••••••••••••••••••••••••••••••••••••• •• ••• • • • • • • ••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••• • • • • • • • • • • • • • • • • ••••••••••••••••••••••••••• • ••••••••••• •••••••••••••••••••••••••••••••••••••• • • • ••••••••••••••••••• • • • • ••••••••••••••••• • •••••••••••••••••••• •••••••••••••••••• • ••••••••••••••••••••••••••••••• • • • ••••••••••• • • •••••••••••••••••••••••••••••••••••• •••••••• • •••••••••••••••••••••• • •••••••••••••••••••••••••••• ••••••••••••••• • •••••••••••••• ••••••••• • ••••••

• 10

10

20

AC

30

40

Scatter Plot

• 15

20

25

30

35

40

Time (in weeks)

Fig. 1. Relationship between fetal abdominal circumference (in cm) and fetal gestation (in weeks). Left Panel: individual measurement. Right Panel: sequences of measurements for some randomly selected subjects.

cigarette smoking and drug use, and health status, such as CD4 cell levels before the infection, on the post-infection depletion of CD4 percent (CD4 percent of lymphocyte cells). Although all the individuals were scheduled to have their measurements made at semi-annual visits, the study has an unbalanced design because the subjects’ actual visiting times did not exactly follow the schedule and the HIV infections happened randomly during the study. The numbers of repeated measurements range from 1 to 14 with a median of approximately 6. Compared with the previous example of ASGA Study, this data set has a smaller number of subjects and a wider range of repeated measurements. The covariates of interest in these data also include both time-dependent and time-invariant variables. However, as will be seen later in this chapter, these covariates have some important differences, hence, should be treated differently, from those considered in the ASGA Study. Further details of the design and medical importance of the MACS data can be found in Kaslow et al.23

Regression Models for the Analysis of Longitudinal Data

841

1.3. Overview of regression models 1.3.1. Main objectives Generally speaking, a proper longitudinal analysis should achieve at least three objectives: • The model under consideration must give an adequate description of the scientific relevance of the data and be sufficiently simple and flexible to be practically implemented. In biomedical and epidemiological studies, we would prefer a model that gives a clear and meaningful biological interpretation and also has a simple mathematical structure. • The methodology must contain proper model diagnostic tools to evaluate the validity of a statistical model for a given data set. Two important diagnostic methods are confidence regions and tests of statistical hypotheses. • The methodology must have appropriate theoretical and practical properties, and can adequately handle the possible intra-subject correlations of the data. In practice, the intra-subject correlations are often completely unknown and difficult to be adequately estimated, so that it is generally preferred to use estimation and inference procedures that do not depend on modeling the specific correlation structures. 1.3.2. Parametric models Naturally, the most commonly used modeling approach is to use parametric regression, such as the random and mixed effects linear models, the generalized linear models and nonlinear models. The simplest case of these models is the marginal linear model of the form Yij =

k X

(l)

βl Xij + ǫi (tij ) ,

(1)

l=0

where β0 , . . . , βk are constant linear coefficients describing the effects of the corresponding covariates, ǫi (t) are realizations of a mean zero stochastic process ǫ(t) at t and Xij and ǫi (tij ) are independent. Similar to all regression models where a constant intercept term is desired, we set X (0) ≡ 1, which produces a baseline coefficient β0 , representing the mean value of Y (t) when all the covariates X (l) (t) are set to zero. A popular special case of the error process is to take ǫ(t) to be a mean zero Gaussian stationary process. Although (1) appears to be overly simplified for many practical

842

C. O. Wu & K. F. Yu

situations, its generalizations lead to many useful models which form the bulk of longitudinal analysis. Estimation and inference methods based on parametric models, including the weighted least squares, the quasi-likelihoods and the generalized estimating equations, have been extensively investigated in the literature. The details can be found in some references.5,7,8,20,21,24,25,28,34,35,37,43 The main advantage of parametric models is that they generally have simple and intuitive interpretations. User friendly computer programs are already available in most popular statistical software packages, such as SAS and SPlus. However, these models suffer the potential shortfall of model misspecification, which may lead to erroneous conclusions. At least in exploratory studies, it is necessary to relax some of the parametric restrictions. 1.3.3. Semiparametric models A useful semiparametric model, investigated by Zeger and Diggle42 and Moyeed and Diggle,26 is the partially linear model of the form Yij = β0 (tij ) +

k X

(l)

βl Xij + ǫi (tij ) ,

(2)

l=1

where β0 (t) is an unknown smooth function of t, βl are unknown constants and ǫi (t) and Xij are defined in model (1). This model is more general than the marginal linear model (1), because β0 (t) is allowed to change with t, rather than setting to be a constant over time. On the other hand, (l) by including the linear terms of Xij , Eq. (2) is also more general than the nonparametric regression studied,2,14,15,30 which involves only (tij , Yij ). (l) However, because model (2) describes the effects of Xij on Yij through constant linear coefficients, this model is still, to some degree, based on mathematical convenience rather than scientific relevance. For example, there is no reason to expect that the influences of maternal risk factors on fetal development in the ASGA Study (Sec. 1.2.1) or the effects of cigarette smoking and pre-infection CD4 level on the post-infection CD4 cell percent in the HIV/CD4 Depletion Data (Sec. 1.2.2) are linear and constant throughout the study period. Thus, further generalization of model (2) is needed in many situations. 1.3.4. Nonparametric models Although it is possible in principle to model (Yij , tij , Xij ) through a completely nonparametric high dimensional function, such approach is often

Regression Models for the Analysis of Longitudinal Data

843

impractical due to the well-known problem of “curse of dimensionality”. Moreover, numerical results obtained from high dimensional nonparametric fittings are often difficult to interpret. These problems motivate the consideration of nonparametric models that have certain meaningful structures. An important class of structural nonparametric regression models is the varying-coefficient models of the form T Yij = Xij β(tij ) + ǫi (tij ) ,

(3)

where β(t) = (β0 (t), . . . , βk (t))T is a (k + 1)-vector of smooth functions of t and ǫi (t) and Xij are defined as in Eq. (1). Because Eq. (3) gives a linear model between Y (t) and X(t) at each fixed t, the linear coefficients βl (t), (0) l = 0, . . . , k, can be interpreted the same way as in Eq. (1). Taking Xij ≡ 1, β0 (t) represents the intercept at time t. On the other hand, because all the linear coefficients may change with t, we may obtain different linear models at different time points. Model (3) is a special case of the general varying-coefficient models discussed by Hastie and Tibshirani.17 Methods of estimation and inferences based on this class of models have been subjected to intense investigation recently in the literature. A number of different smoothing methods for the estimation of β(t) have been proposed. These include the ordinary least squares local polynomials, the penalized least squares, the two-step and componentwise methods and the basis approximation approaches. Targeted to specific types of longitudinal designs, each of these methods has its own advantages and disadvantages in practice. We will present in Secs. 4 and 5 an overview of the above estimation and inference methods and demonstrate in Sec. 6 the applications of these methods. 2. Linear mixed effects models 2.1. Models for covariate effects and correlations Statistical models for longitudinal observations generally serve two purposes: (i) describing the effects of the treatments and other factors on the mean response profile; (ii) describing the differences in response profiles between individuals. A model serving the first purpose is generally classified as a marginal model or a population average model. A model serving the second purpose is a random effects model or a subject specified model.43 A mixed effects model then combines both the marginal and random effects. In particular, a linear mixed effects model is obtained when the marginal and random effects are additive and follow a linear relationship.

844

C. O. Wu & K. F. Yu

It is convenient to describe the model through a matrix representation. Let Yi = (Yi1 , . . . , Yini )T be the [ni × 1] vector of the response for the ith subject, ti = (ti1 , . . . , tini )T be the subject’s time design points and Xi be the corresponding [ni × (k + 1)] covariate matrix whose jth row, for (k) (1) j = 1, . . . , ni , is (1, Xij , . . . , Xij ). Assuming that the error term ǫi (t) of Eq. (1) is a mean zero Gaussian process with covariate matrix Vi (ti ), the responses Yi are then independent Gaussian random vectors such that Yi ∼ N(Xi β, Vi (ti )) ,

(4)

where β = (β0 , . . . , βk )T with βj being defined in Eq. (1) and N(a, b) denotes a multivariate normal distribution with mean vector a and covariance matrix b. Note that, because model (4) represents the conditional mean of Yi at Xi through Xi β, it is a marginal model. The covariance structures of model (4) are usually influenced by three factors: random effect, serial correlation and measurement error. The random effects characterize the stochastic variations between subjects within the population. In particular, we may view that, when the covariates affect the response linearly, some of the linear coefficients may vary from subject to subject. The serial correlations are the results of time-varying associations between different measurements of the same subject. Such correlations are typically positive in biomedical studies, and become weaker as the time interval between the measurements increases. Finally, the measurement errors, which are normally assumed to be independent both between and within the subjects, are induced by the measurement process or random variations within the subjects. Suppose that, for each subject i, there is a [r × 1] vector of explanatory variables Uij measured at time tij , which may or may not overlap with the original covariate vector Xij . Using the additive decomposition of random effects, serial correlations and measurement errors, ǫi (tij ) can be expressed as ǫi (tij ) = UijT bi + Wi (tij ) + Zij ,

(5)

where bi is the [r × 1] random vector with multivariate normal distribution N(0, D), D is a [r × r] covariance matrix with (p, q)th element dpq = dqp , Wi (tij ) for i = 1, . . . , n are independent copies of a mean zero Gaussian process whose covariance at time points tij1 and tij2 is ρW (tij1 , tij2 ), and Zij for i = 1, . . . , n and j = 1, . . . , ni are independent identically distributed random variables with N (0, τ 2 ) distribution. Writing δi (tij ) = Wi (tij )+Zij , δi = (δi (ti1 ), . . . , δi (tini ))T and Ui to be the [ni × r] matrix whose jth row

Regression Models for the Analysis of Longitudinal Data

845

is UijT , (4) and (5) reduce to the linear mixed effects model of Laird and Ware24 Yi = Xi β + Ui bi + δi .

(6)

The marginal effect β represents the influence of Xi on the population average of the response profile, while bi describes the variation of the ith subject from the population conditioning on the given explanatory variable Ui . Thus, conditioning on Xi and Ui , model (6) implies that Yi for i = 1, . . . , n are independent Gaussian vectors such that Yi ∼ N(Xi β, Ui DUiT + Pi + τ 2 Ii ) ,

(7)

where Pi is the [ni × ni ] covariance matrix whose (j1 , j2 )th element is ρW (tij1 , tij2 ) and Ii is the [ni × ni ] identity matrix. A number of special cases can be derived for the variance-covariance structure of model (5). The classical linear models for the independent crosssectional data (or the independent identically distributed data) is a special case of model (7) where ǫi (tij ) are only affected by the measurement errors Zij . When neither random effects nor measurement errors are present, the error term is of pure serial correlation ǫi (tij ) = Wi (tij ). Moreover, if Wi (tij ) are from a mean zero stationary Gaussian process, the covariance of ǫ i (tij1 ) and ǫi (tij2 ), hence, Yij1 and Yij2 , can be specified by Cov(ǫi (tij1 ), ǫi (tij2 )) = σ 2 ρ(|tij1 − tij2 |) ,

(8)

where σ is a positive constant and ρ(·) is a continuous function. Useful choices of ρ(·) include the exponential correlation ρ(s) = exp(−as) for some constant a > 0 and the Gaussian correlation ρ(s) = exp(−as2 ), among others. When ǫi (tij ) are affected by a mean zero stationary Gaussian process and a mean zero Gaussian white noise (measurement error), the variance of Yij is σ 2 ρ(0) + τ 2 , while the covariance of Yij1 and Yij2 , for j1 6= j2 , is σ 2 ρ(|tij1 − tij2 |), for some σ > 0, τ > 0 and continuous correlation function ρ(·). When serial correlations are not present, the intra-subject correlations are only induced by the random effects, so that Pi is not present in model (7). 2.2. Likelihood based estimation and inferences 2.2.1. Conditional maximum likelihood estimation Suppose that the variance-covariance matrix Vi (ti ) of model (5) is determined by a Rq -valued parameter vector α. Denote Vi (ti ; α) to be the

846

C. O. Wu & K. F. Yu

variance-covariance matrix parametrized by α. The log-likelihood function for model (4) is n X 1 − log |Vi (ti ; α)| L(β, α) = c + 2 i=1

where c = mized by

1 T −1 − (Yi − Xi β) Vi (ti ; α)(Yi − Xi β) , 2

Pn

ˆ β(α) =

i=1 [(−ni /2) log(2π)].

" n X i=1

(9)

For a given α, model (9) can be maxi-

# #−1 " n X (XiT Vi−1 (ti ; α)Yi ) . (XiT Vi−1 (ti ; α)Xi )

(10)

i=1

ˆ It is easy to verify that, under model (4), β(α) is an unbiased estimator ˆ of β. Direct calculation also shows that the covariance matrix of β(α) is #−1 " n X T −1 ˆ Cov[β(α)] = (X V (ti ; α)Xi ) i

i

i=1

× × =

" n X

#

(XiT Vi−1 (ti ; α) Cov(Yi )Vi−1 (ti ; α)Xi )

i=1

" n X

#−1

(XiT Vi−1 (ti ; α)Xi )

i=1

" n X

#−1

(XiT Vi−1 (ti ; α)Xi )

i=1

.

(11)

It is interesting to note that the second equality sign of model (11) does not hold when the structure of the variance-covariance matrix is not correctly ˆ specified. Further derivation using Eqs. (4)–(11) shows that β(α) has a multivariate normal distribution,  " #−1  n   X ˆ β(α) ∼ N β, (XiT Vi−1 (ti ; α)Xi ) . (12)   i=1

When α is known, this result can be used to develop inference procedures, such as confidence regions and test statistics, for β.

Regression Models for the Analysis of Longitudinal Data

847

2.2.2. Maximum likelihood estimation When α is unknown, as in most practical situations, a consistent estimate of α has to be used. An intuitive approach is to estimate β and α by maximizing (9) with respect to β and α simultaneously. Maximum likelihood estimators (MLE) of this type can be computed by substituting (10) into (9) and then maximizing (9) with respect to α. Denote the resulting ML estimators by βˆML and α ˆ ML . The asymptotic distributions of (βˆML , α ˆ ML ) can be developed using the standard approaches in large sample theory. Although (βˆML , α ˆ ML ) has some justifiable statistical properties, as for most likelihood-based methods, it may not be desirable in practice. To see why an alternative estimation method might be warranted in some situations, we consider the simple linear regression with independent errors and n1 = · · · = nn = m, Yi ∼ N(Xi β, σ 2 Im ) ,

(13)

where Im is the [m × m] identity matrix. The parameters involved in the model are β and σ. Let βˆML and σ ˆML be the MLEs of β and σ, respectively, and RSS be the residual sum of squares defined by n X (Yi − Xi βˆML )T (Yi − Xi βˆML ) . RSS = i=1

2

2 The MLE of σ is σ ˆML = RSS/(nm). However, it is well-known that, for 2 any finite n and m, σ ˆML is a biased estimator of σ 2 . On the other hand, a 2 slightly modified estimator σ ˆREML = RSS/[nm − (k + 1)] is unbiased for 2 2 σ . Here, σ ˆREML is the restricted maximum likelihood (REML) estimator

for the model (13). 2.2.3. Restricted maximum likelihood estimation This class of estimators was introduced by Patterson and Thompson29 for the purpose of estimating variance components in the linear models. The main idea is to consider a linear transformation of the original response variable so that the distribution of the transformed variable does not depend on β. Let Y = (Y1T , . . . , YnT )T , X = (X1T , . . . , XnT )T and V be the blockdiagonal matrix with Vi (ti ) on the ith main diagonal and zeros elsewhere. Then, with V parameterized by α, model (4) is equivalent to Y ∼ N(Xβ, V(α)) .

(14)

The REML estimator of α, the variance component of (14), is obtained by maximizing the likelihood function of Y ∗ = AT Y, where A is a [N ×

848

C. O. Wu & K. F. Yu

Pn (N − k − 1)], N = i=1 ni , full rank matrix such that AT X = 0. A specific construction of A can be found in Diggle, Liang and Zeger.8 It follows from (14) that Y∗ has a mean zero multivariate Gaussian distribution with covariance matrix AT V(α)A. Harville16 showed that the likelihood function of Y∗ is proportional to −1/2 ( 1/2 ) n n n X X Y −1/2 T −1 T ∗ |Vi (ti ; α)| Xi Vi (ti ; α)Xi Xi Xi L (α) = i=1

i=1

i=1

(

) n 1X −1 T ˆ ˆ × exp − . Vi (ti ; α)(Yi − Xi β(α)) (Yi − Xi β(α)) 2 i=1

(15)

The REML estimator α ˆ REML of α maximizes (15). The REML estimator ˆ βREML of β is obtained by substituting α of (10) with α ˆ REML . Because (15) does not depend on the choice of A, the resulting estimators βˆREML and α ˆ REML are free of the specific linear transformations. The log-likelihood of Y ∗ , log[L∗ (α)], differs from the log-likelihood ˆ α) only through a constant, which does not depend on α, and L(β, n X 1 T −1 Xi Vi (ti ; α)Xi , − log 2 i=1

which does not depend on β. Because both REML and ML methods are based on the likelihood principle, they all have important theoretical properties such as consistency, asymptotic normality and asymptotic efficiency. In practice, neither one is uniformly superior to the other for all the situations. Their numerical values are also computed from different algorithms. For the ML method, the fixed effects and the variance components are estimated simultaneously, while for the REML method, only the variance components are estimated. 2.2.4. Inferences

The results established in the previous sections are useful to construct inference procedures for β. We mention here only a few special cases. A more complete account of inferential and diagnostic tools may be found in Diggle,8 Zeger, Liang and Albert,43 Diggle, Liang and Zeger8 or Vonesh and Chinchilli,35 among others. Suppose that we have a consistent estimator α ˆ of α, which may be either the ML estimator α ˆ ML or the REML estimator α ˆ REML . Substituting α of ˆ α) can be approximated, at least when n (12) with α, ˆ the distribution of β(ˆ

Regression Models for the Analysis of Longitudinal Data

849

is large, by ˆ α) ∼ N(β, Vˆ ) , β(ˆ

(16)

ˆ α) ∼ N(Cβ, C Vˆ C T ) . C β(ˆ

(17)

Pn ˆ )Xi )]−1 . Suppose that C is a known [r × where Vˆ = [ i=1 (XiT Vi−1 (ti ; α (k + 1)] matrix with full rank. It follows immediately from (16) that, when ˆ α) can be approximated by n is sufficiently large, the distribution of C β(ˆ

Consequently, an approximate 100×(1−a)%, 0 < a < 1, confidence interval for Cβ can be given by ˆ α) ± Z1−a/2 (C Vˆ C T )1/2 . C β(ˆ Taking C to be the (k + 1) row vector with 1 at its lth place and zero elsewhere, an approximate 100 × (1 − a)% confidence interval for βl can be given by q (18) βˆl (ˆ α) ± Z1−a/2 Vˆl ,

where Vˆl is the lth diagonal element of Vˆ . The approximation in (17) can also be used to construct test statistics for linear statistical hypotheses. For example, suppose that we would like to test the null hypothesis of Cβ = θ0 for a known vector θ0 against the general alternative that Cβ 6= θ0 . A natural test statistic would be ˆ α) − θ0 ]T (C Vˆ C T )−1 [C β(ˆ ˆ α) − θ0 ] , Tˆ = [C β(ˆ

(19)

which has approximately a χ2 -distribution with r degrees of freedom, denoted by χ2r , under the null hypothesis. A level (100 × a)% test based on Tˆ then rejects the null hypothesis when Tˆ > χ2r (a) with χ2r (a) being the [100 × (1 − a)]th percentile of χ2r . For the special case of testing βl = 0 versus βl 6= 0, a simple procedure equivalent to (19) is to reject the null hypothesis when q ˆ |βl (ˆ α)| > Z1−a/2 Vˆl , where Z1−a/2 and Vˆl are defined in (18). 3. Partially Linear Models As discussed in Sec. 1.3.3, this class of models has been studied by Zeger and Diggle42 and Moyeed and Diggle26 as a means to generalize the

850

C. O. Wu & K. F. Yu

marginal linear models. With further restrictions on the error process, (2) is equivalent to Y (t) = β0 (t) +

k X

βl X (l) (t) + ǫ(t) ,

(20)

l=1

where ǫ(t) is a mean zero stochastic process with variance σ 2 and correlation function ρ(t), and X (l) (t), l = 1, . . . , k, and ǫ(t) are independent. The errors ǫi (tij ) specified in (2) are then independent copies of ǫ(t). A useful way to view ǫi (tij ) is through the decomposition ǫi (tij ) = Wi (tij ) + Zij ,

(21)

where Wi (t) are independent copies of a mean zero stationary process 2 W (t) with covariance function σW ρ(t) and Zij are independent identically 2 distributed measurement errors with mean zero and variance σZ . The covariance structure of the measurements Yij for i = 1, . . . , n and j = 1, . . . , ni are  2 2 if i1 = i2 and j1 = j2 ,   σZ + σW , 2 Cov(Yi1 j1 , Yi2 j2 ) = σW ρ(ti1 j1 − ti2 j2 ) , if i1 = i2 and j1 6= j2 , (22)   0, otherwise .

Although the above models can be classified as a special case of (3), a class of the structural nonparametric models to be discussed in later sections, their estimation methods are quite different, a fact owing to the structural differences between these two classes of models. The rest of this section focuses on an iteration procedure for the estimation of β0 (t), β1 , . . . , βk . Inferential and alternative estimation methods, which constitute some major research activities in longitudinal analyses, are still not well-understood and warrant considerable effort in further investigation. 3.1. Smoothing estimators for the mean response Suppose for the moment that no covariate other than time is considered in modeling the mean response. The model (20) then reduces to Y (t) = β0 (t) + ǫ(t) .

(23)

Equivalently, with ǫ(t) defined in (20), β0 (t) is the mean response of Y (t) conditioning on time t; that is, β0 (t) = E[Y (t)|t]. A natural approach for estimating β0 (t) nonparametrically is to borrow smoothing techniques from the classical independent identically distributed

Regression Models for the Analysis of Longitudinal Data

851

(i.i.d.) setting, while evaluating the statistical performances of the resulting estimators by taking the influences of the intra-subject correlations into account. A simple method is to use kernel smoothing, which amounts to estimate β0 (t) through a weighted average using the measurements obtained within a neighborhood of t defined by a kernel function. Let K(u) be a continuous kernel function, usually a continuous probability density function, defined on the real line, and h a positive bandwidth sequence which shrinks to zero as n tends to infinity. A kernel estimator similar to the well-known Nadaraya–Watson type kernel estimators in the i.i.d. setting is Pn Pni i=1 j=1 {Yij K[(t − tij )/h]} K ˆ . (24) β0 (t) = Pn Pni j=1 K[(t − tij )/h] i=1

Here, (24) uses uniform weight on each measurement, hence, makes no distinction between the subjects that have unequal numbers of repeated measurements. Thus subjects with more repeated measurements are used more often than those with fewer repeated measurements. A general formulation is to assign a specific weight to each subject and estimate β0 (t) by Pn Pni i=1 j=1 {Yij wi K[(t − tij )/h]} K ˆ β0 (t; w) = Pn Pni , (25) i=1 j=1 {wi K[(t − tij )/h]}

where the weights, w = (w1 , . . . , wn ), satisfy wi ≥ 0 for all i = 1, . . . , n with strict inequality for some 1 ≤ i ≤ n. Clearly, (25) reduces to (24) when wi = 1/N . An intuitive weight choice other than wi = 1/N is to uniformly weight each subject, rather than each measurement, so that the resulting kernel estimator is (25) with wi = 1/(nni ). Other approaches for the estimation of (22) have also been studied.2,14,15,27,31 We omit these methods here and refer to their original articles for details. These methods, including (25) and the above alternative approaches, are essentially based on the fundamental spirit of local smoothing, hence, often lead to similar results in practice. This is in contrast to the smoothing methods to be discussed in the next section, where, because of the model complexity, different smoothing methods often produce very different results. A crucial step in obtaining an adequate kernel estimator for β0 (t) is to select an appropriate bandwidth h, while the choices of kernel functions are relatively less important. For estimation methods other than kernel smoothing, such as splines, this amounts to selecting an appropriate smoothing parameter. Rice and Silverman31 suggested a simple cross-validation for selecting a data-driven smoothing parameter which does not depend on

852

C. O. Wu & K. F. Yu

the intra-subject correlation structures of the data. Applying their cross(−i,K) validation to the kernel estimator (25), we first define βˆ0 (t; w) to be the estimator computed using (25) and the remaining data after deleting the entire set of repeated measurements of the ith subject. Predicting the (−i,K) ith subject’s outcome at time t by βˆ0 (t; w), the cross-validation score of (25) is CV(h) =

ni n X X i=1 j=1

(−i,K) {wi [Yij − βˆ0 (tij ; w)]2 } .

(26)

Suppose that (26) can be uniquely minimized. The “leave-one-subject-out” cross-validated bandwidth hcv is the minimizer of (26). Heuristically, the use of hcv can be justified because, by minimizing (26), it approximately minimizes an average prediction error of (25). More details for the implementations and generalizations of this cross-validation will be discussed in Sec. 4.7. Direct calculation of (26) can often be time consuming, as the algorithm repeats itself each time a new subject is deleted. Denote Kij = K[(t − tij )/h], w K[(t − tij )/h] Pini i=1 j=1 wi K[(t − tij )/h]

and Ki∗ =

∗ Kij = Pn

ni X

∗ Kij

j=1

for i = 1, . . . , n. A computationally simpler approach, also suggested by (−i,K) (tij ; w)] using the followRice and Silverman,31 is to compute [Yij − βˆ0 ing expression:   ni X Ki∗ (−i,K) ∗  K ˆ ˆ  (Yij Kij ) 1+ Yij − β0 (tij ; w) = Yij − β0 (tij ; w) − 1 − Ki∗ j=1 = [Yij − βˆ0K (tij ; w)] + 

− βˆ0K (tij ; w) −

j=1

×

∗ j=1 (Yij Kij ) Ki∗

∗ (Yij Kij )

j=1

ni X

= [Yij − βˆ0K (tij ; w)] + " Pni

ni X



∗  (Yij Kij )

Ki∗ 1 − Ki∗

Ki∗ 1 − Ki∗

#

− βˆ0K (tij ; w) .

(27)

Regression Models for the Analysis of Longitudinal Data

853

The above expression, as currently stated, is specifically targeted to kernel estimators defined in (25). When other smoothing methods, such as splines, are used, we may not get an explicit expression as the right side of (27), hence, direct calculation of (26) has to be carried out by deleting the subjects one at a time. Large sample inferences of βˆ0K (t; w) can be derived based on the asymptotic expressions of its means and variances and its asymptotic distributions. Because βˆ0K (t; w) is a linear statistic of Yij , its means and variances can be directly computed and, consequently, its asymptotic distributions can be easily established by checking the triangular array central limit theorem after taking the intra-subject correlations into account; see, for example, Wu, Chiang and Hoover39 and Wu and Chiang.38 Because βˆ0K (t; w) is a special case of the kernel estimators of Sec. 4, details of pointwise and simultaneous inferences for β0 (t) are discussed in Sec. 5.1. 3.2. Estimation of covariate effects With covariates other than time entered into the model, the estimation of (β0 (t), β1 , . . . , βk ) can be proceeded by an iteration that combines smoothing with parametric estimation techniques. Suppose that the error terms ǫi (t) of (21) have known variance-covariance matrices Vi (ti ) for ti = (ti1 , . . . , tini ) and all i = 1, . . . , n. The iteration can be proceeded as follows: (a) Set β0 (t) to zero and calculate an initial estimate of (β1 , . . . , βk )T using (10), an expression also for the generalized least squares, with Vi (ti ; α) replaced by Vi (ti ). (b) Based on the current estimate (βˆ1 , . . . , βˆk ), calculate the residual rij = Pk (l) Yij − l=1 βˆl Xij and compute the kernel estimator βˆ0K (t; w) of β0 (t) using (24) with Yij replaced by rij . (c) Based on the current kernel estimator βˆ0K (t; w), calculate the residual rij = Yij − βˆ0K (tij ; w) and update the estimate of (β1 , . . . , βk ) using (10) with (Vi (ti ; α), Yij ) replaced by (Vi (ti ), rij ). (d) Repeat steps (b) and (c) until the estimates converge. This algorithm is a special case of the more general backfitting algorithm described in Hastie and Tibshirani.17 The assumption of having a known correlation structure is unrealistic and can be relaxed. Although an incorrectly specified correlation structure may cost the efficiency of the estimators, it generally does not affect

854

C. O. Wu & K. F. Yu

the consistency. When the variance-covariance matrix is parametrized by a parameter α and the error terms are from a mean zero Gaussian stationary process, the above iteration algorithm can be used in conjunction with the likelihood and restricted likelihood methods of the previous section, i.e. the generalized least squares estimators used in Steps (a) and (c) can ˆ αREML ). Further be replaced by the likelihood based estimators βˆML or β(ˆ computational details, statistical properties of the resulting estimators and a modified estimation procedure can be found in Zeger and Diggle42 and Moyeed and Diggle.26 Inferences based on the resulting estimators have not been systematically investigated, hence, warrants substantial further development.

4. Smoothing for Varying-Coefficient Models We present in this section a series of different smoothing methods for estimating the coefficient curves β(t) = (β0 (t), . . . , βk (t))T of (3). Inferences based on smoothing estimators of β(t) will be discussed in Sec. 5.

4.1. Some useful expressions In observational studies, the covariates are usually random as the subjects are randomly chosen, although they could in principle be either random or fixed. For generality, we assume throughout that X(t) is random and the matrix E[X(t)X T (t)] ≡ EXX T (t) exist. With a proper change of the notation, our methods can be modified to accommodate the case of nonrandom covariates. An equivalent expression of (3) is then Y (t) = X T (t)β(t) + ǫ(t) ,

(28)

where ǫ(t) is a mean zero stochastic process and ǫ(t) and X(t) are inde−1 pendent. Suppose that EXX T (t) is invertible and its inverse is EXX T (t). It directly follows from (28) that β(t) uniquely minimizes the second moment of ǫ(t) in the sense that E{[Y (t) − X T (t)β(t)]2 } = inf E{[Y (t) − X T (t)b(t)]2 } , all b(·)

(29)

and is given by −1 β(t) = EXX T (t)E[X(t)Y (t)] .

(30)

855

Regression Models for the Analysis of Longitudinal Data

When the covariates are time-invariant, we have X(t) ≡ X and EXX T (t) ≡ EXX T , so that Eq. (30) reduces to ! # " k X (31) βr (t) = E erl X (l) Y (t) , l=0

where erl is the element of

−1 EXX T

at the rth row and lth column.

4.2. Smoothing based on least squares 4.2.1. General formulation Intuitively, (29) suggests that β(t) can be estimated by a method of local least squares using the measurements observed within a neighborhood of t. Assume that, for each l and some integer p ≥ 0, βl (t) is p times differentiable and its pth derivative is continuous. Approximating βl (tij ) by a pth order Pp polynomial r=0 {blr (t)(tij − t)r } for all l = 0, . . . , k, a local polynomial estimator of β(t) = (β0 (t), . . . , βk (t))T based on a kernel neighborhood is ˆb0 (t) = (ˆb00 (t), . . . , ˆbk0 (t))T , where {ˆblr (t); l = 0, . . . , k, r = 0, . . . , p} minimizes " ( !#)2 p ni k n X X X X (l) r Xij blr (t)(tij − t) wi Yij − Lp (t) = i=1 j=1

×K

l=0

tij − t h

r=0

,

(32)

where wi are the non-negative weights as in (25), K(·) is a kernel function, usually chosen to be a probability density function, and h is a non-negative bandwidth. As a by-product of (32), (r!)ˆblr (t) may be used to estimate the (r) rth derivative βl (t) of βl (t), r = 1, . . . , p. 4.2.2. Least squares kernel estimators The simplest case of (32) is the ordinary least squares kernel estimator, also known as the local constant fit, obtained by minimizing (32) with p = 0. Using the matrix representation Yi = (Yi1 , . . . , Yini )T ,     (k) (1) · · · Xi1 1 Xi1 Ki1 · · · 0    .. .. ..  and K (t) =  .. ..   ..  Xi =  ...  i . . . . .   .   1

(1)

Xini

···

(k)

Xini

0

···

Kini

856

C. O. Wu & K. F. Yu

Pn with Kij = K[(tij − t)/h], if i=1 XiT Ki (t)Xi is invertible, then (32) with p = 0 can be uniquely minimized and its minimizer, the kernel estimator of β(t), is given by ! !−1 n n X X T T LSK ˆ (33) wi X Ki (t)Yi . wi X Ki (t)Xi β (t) = i

i

i=1

i=1

When the model incorporates no covariate other than time, i.e. k = 0, (33) reduces to a Nadaraya–Watson type kernel estimator of the conditional expectation E[Y (t)|t]; for example, H¨ ardle.12 4.2.3. Least squares local linear estimators

Although (33) has a simple mathematical expression, it often leads to significant bias when t is at the boundary of its support. An automatic procedure to reduce such boundary bias is to use higher order local polynomial fits. But, a high order local polynomial fit can be impractical in some applications because it usually requires large sample sizes and may be computationally intensive. A practical approach that provides automatic boundary bias adjustment is to use local linear fit that minimizes (32) with p = 1. Denote X X  (l) (r) (l) (r) [wi Xij Xij (tij − t)Kij ] [wi Xij Xij Kij ]   i,j i,j   Nlr =  X , X (l) (r) (l) (r)  [wi Xij Xij (tij − t)2 Kij ]  [wi Xij Xij (tij − t)Kij ] i,j

i,j

Nr = (N0r , . . . , Nkr ), N = (N0T , . . . , NkT )T , T  X X (r) (r) [wi Xij (tij − t)Yij Kij ] , Mr =  [wi Xij Yij Kij ], i,j

i,j

(MT0 , . . . , MTk )T ,

bl (t) = (bl0 (t), bl1 (t))T and b(t) = (b0 (t), . . . , bk (t)T M= for r, l = 0, . . . , k. Setting the partial derivatives of L1 (t) with respect to blr (t) to zero, the normal equation of (32) with p = 1 is N b(t) = M .

(34)

Suppose that the matrix N is invertible at t. The solution of (34) exists and is uniquely given by ˆb(t) = N −1 M. The least squares local linear estimator βˆLSL (t) of βl (t) is then l

βˆlLSL (t) = eT2l+1ˆb(t) ,

(35)

857

Regression Models for the Analysis of Longitudinal Data

where eq is the [2(k + 1) × 1] column vector with 1 at its qth place and zero elsewhere. Explicit expressions for the general higher order least squares local polynomial estimators can be similarly derived; see Hoover et al.18 We omit the details of these general higher order estimators, as a local linear fitting is sufficiently satisfactory in almost all the biomedical studies that have appeared in the literature. 4.2.4. Least squares with centered covariates In some situations, some of the covariates used in (28) can not have values at zero, so that the baseline coefficient curve β0 (t) does not have a practical interpretation. Strictly positive covariates appear naturally both in the ASGA Study (Sec. 1.2.1), such as the mother’s placental thickness and pre-pregnancy height, and the HIV/CD4 Depletion Data (Sec. 1.2.2), such as the subject’s pre-infection CD4 level. A useful remedy when such a situation arises is to use a centered version of the covariates in the model, so that the corresponding baseline coefficient can be interpreted as the conditional mean of Y (t) when the centered covariates are set to zero. Let X (∗l) (t) = X (l) (t)−E[X (l) (t)] be the centered version of X (l) (t) and X (∗) (t) be the covariate vector with some or all of its components being centered. An equivalent form of (28) is Y (t) = (X (∗) (t))T β ∗ (t) + ǫ(t) ,

(36)

where β ∗ (t) = (β0∗ (t), β1 (t), . . . , βk (t))T . Note that β0∗ (t), the baseline coefficient curve of (36), represents the mean of Y (t), when X (∗l) (t), rather than X (l) (t), for l = 1, . . . , k are set to zero. Other coefficient curves of (36) can be interpreted the same way as those of (28). The estimation of β ∗ (t) can be obtained by first estimating the centered (l) (l) (∗l) covariates Xij of Xij and then minimizing (32) with Xij replaced (∗l)

by Xij . If X (l) (t) is a time-dependent covariate, then, using a kernel (l)

(∗l)

smoothing, a centered version of Xij can be estimated by Xij µ ˆl (tij ) with Pn Pni (l) i=1 j=1 {wi Xij Γl [(t − tij )/γl ]} , µ ˆl (t) = Pn Pni i=1 j=1 {wi Γl [(t − tij )/γl ]}

(l)

= Xij −

(37)

where (Γl (·), γl ) is a set of kernel and bandwidth. On the other hand, if (l) (l) X (l) (t) ≡ X (l) is time-invariant, then Xij ≡ Xi for all j = 1, . . . , ni , P (l) (∗l) ¯ (l) , where X ¯ (l) = n−1 n X (l) is the and Xi can be taken as Xi − X i=1 i

858

C. O. Wu & K. F. Yu

(∗)

weighted sample mean for X (l) . Let Xi be the ni × (k + 1) centered (∗k) (∗1) covariate vector whose jth row is (1, Xij , . . . , Xij ). A least squares kernel estimator of β ∗ (t) is # #−1 " n " n X X (∗) (∗) (∗) wi (X )T Ki (t)Yi , wi (X )T Ki (t)(X )T βˆ∗LSK (t) = i

i=1

i

i

i=1

(38)

where Ki (t) and Yi are defined as in (33). Wu, Yu and Chiang40 investigated the large sample properties of βˆ∗LSK (t). Their results suggest that neither βˆLSK (t) nor βˆ∗LSK (t) is uniformly superior to the other. In particular, when the covariates are time-invariant, βˆLSK (t) and βˆ∗LSK (t) are asymptotically equivalent. However, when X (l) (t) for l ≥ 1 changes significantly with t, theoretically and practically superior estimators of βl (t) may be obtained by centering X (l) (t). Of course, after a covariate is centered, the baseline coefficient curve of the model is changed. The decision on whether a covariate should be centered or not primarily depends on the biological interpretations of the corresponding baseline coefficient curve. Such a decision should be made based on the statistical properties of the estimators only if the effects of the covariates, rather than the baseline coefficient curve, is of primary interest in the investigation. Clearly, methods other than kernel smoothing may also be applied to the estimation with centered covariates. But, because of the complication caused by smoothing the covariates, statistical properties for estimators other than (38) have not been investigated in the literature. 4.2.5. A simple modification The estimators mentioned above, both with and without covariate centering, rely on a single bandwidth to estimate all (k + 1) coefficient curves. This simple approach may work well when all the curves roughly belong to the same smoothness family. However, such an idealized scenario is often not anticipated in practice. A flexible method which automatically adjusts for the possibly different smoothing needs for different coefficient curves is always preferred. In the literature, the potential deficiency associated with the use of a single bandwidth has been reported.10,18,40 These authors have also proposed a number of alternative approaches (see Secs. 4.3–4.6) to overcome

Regression Models for the Analysis of Longitudinal Data

859

this potential drawback. A simple method suggested by Wu, Yu and Chiang40 is to use a linear combination of the form ˆ K, h) = β(t;

k X

ˆ K l , hl ) , eTl+1 β(t;

(39)

l=0

where K(·) = (K0 (·), . . . , Kk (·)), h = (h0 , . . . , hk ), ep is the [(k + 1) × 1] ˆ Kl , hl ) is the kernel vector with 1 at its pth place and zero elsewhere and β(t; ∗ estimator of β(t) or β (t) obtained from (33) or (38), respectively, using ˆ K, h) relies on a specific pair kernel Kl (·) and bandwidth hl . Intuitively, β(t; of kernel and bandwidth to estimate the corresponding component of β(t) or β ∗ (t). As a general methodology, (39) is not limited to kernel estimators and may be applied to other local polynomial estimators as well. 4.2.6. Choices of wi An important factor that affects the theoretical and practical behaviors of the least squares local polynomial estimators of β(t) is the choice of wi in (32). For cross-sectional studies with independent identically distributed data, a uniform weight choice, wi ≡ 1/N , is often desirable. For the current sampling, it is conceivable that a proper choice of wi may depend on the intra-subject correlation structures and the numbers of repeated measurements ni . In practice, however, the correlation structures of the data are often completely unknown and may be difficult to estimate, so that subjective choices such as wi = 1/N and wi = 1/(nni ) are often considered. Intuitively, wi = 1/N assigns equal weight to each observation point, while wi = 1/(nni ) assigns equal weight to each subject. Theoretically, the choice of wi = 1/N may produce inconsistent least squares kernel estimators when some ni are much larger than the others. On the other hand, the least squares kernel estimators based on wi = 1/(nni ) are always consistent regardless the choices of ni .18,38 4.3. Penalized least squares Suppose that all the components of β(t) are twice continuously differentiable and have bounded and square integrable second derivatives with respect to t. A natural penalized least squares criterion is to minimize )2 ( Z ni k k n X X X X (l) λl [βl′′ (t)]2 dt Yij − Xij βl (tij ) + (40) J(β, λ) = i=1 j=1

l=0

l=0

860

C. O. Wu & K. F. Yu

with respect to βl (t), where λ = (λ0 , . . . , λk )T and λl are positive smoothing parameters. The existence and uniqueness of the minimizer of (40) depend (l) on tij and Xij . Suppose that (40) can be uniquely minimized. The penalized least squares estimator βˆP LS (t) = (βˆ0P LS (t), . . . , βˆkP LS (t))T of β(t) is then defined to be the unique minimizer of (40). Using similar techniques as in univariate smoothing, it can be shown that βˆlP LS (t) are natural cubic splines with knots at the distinct values of {tij : i = 1, . . . , n, j = 1, . . . , ni } and can be expressed as linear functions of {Yij : i = 1, . . . , n, j = 1, . . . , ni }. One feature that distinguishes βˆP LS (t) from the estimators obtained from (32) is the use of multiple smoothing parameters λl in the penalty term. In (40), all (k + 1) smoothing parameters λl , l = 0, . . . , k, can be adjusted in the penalty term. Numerical results presented in Hoover et al.18 demonstrated that the extra flexibility created by multiple smoothing parameters could indeed lead to better estimators than the least squares local polynomials that rely on a single smoothing parameter. However, because βˆP LS (t) has knots at all the distinct time points, it can be extremely computationally intensive when the number of distinct time points is large, a case often happened in unbalanced longitudinal studies. 4.4. A two-step method In an attempt to provide flexible smoothing estimators that are computationally accessible with large longitudinal data, Fan and Zhang10 proposed to estimate β(t) by a two-step smoothing method which uses (k+1) smoothing parameters in a different way from (39) and (40). Their procedure calls for the following two steps: (i) computing the raw estimates βˆRAW (s) of β(s) at a set of distinct time points, say s1 , . . . , sm , where m may depend on n and ni , i = 1, . . . , n; (ii) estimating each coefficient curve βl (t) by smoothing the raw estimates βˆlRAW (sr ), r = 1, . . . , m. Although Fan and Zhang10 used local polynomials to illustrate the method, other smoothing methods such as splines may in principle be used. For the special case of balanced longitudinal data where all the subjects are observed at a same set of time points {sj ; j = 1, . . . , m} with m = ni , i = 1, . . . , n, the raw estimates can be computed by fitting linear models between Yij and Xij at sj for all j = 1, . . . , m. However, when the design is unbalanced and the numbers of subjects on some time points are sparse, as in most practical situations, it may be necessary to computing the raw

Regression Models for the Analysis of Longitudinal Data

861

estimates by grouping the observations from the adjacent time points. In particular, we can first compute βˆlRAW (sr ), l = 0, . . . , k, using the local polynomial method (32) with a small bandwidth, and then, treating βˆlRAW (sr ) as the new data, estimate βl (t) by minimizing )2 ( p m X X sj − t r RAW TS ˆ (41) Kl blr (t)(sj − t) βl (sj ) − Lp,l (t) = hl r=0 j=1 with respect to blr (t), where (Kl (·), hl ) is a set of kernel and bandwidth. Similar to (32), if ˆbTlrS (t) for r = 0, . . . , k uniquely minimize (41), ˆbTl0S (t) is the two-step pth order local polynomial estimator of βl (t), while (r!)ˆbTlrS (t) can be used to estimate the rth derivative of βl (t). In contrast to the estimators obtained from (32) where a single bandwidth must be used for all βl (t), the two-step method has in principle the flexibility to adjust for the specific smoothing need of each coefficient curve. However, a main difficulty in current version of two-step smoothing is that it lacks a specific and practical guideline to construct the raw estimates for unbalanced longitudinal data. Certain data-driven bandwidth procedures would be desirable for computing both the raw and the final estimates. Impacts of different raw estimates on the theoretical and practical properties of the final two-step estimators are still not well-understood and require substantial further development. 4.5. Smoothing with time-invariant covariates When the covariates of interest are time-invariant, such as in clinical trials when the treatments are kept fixed throughout the study periods, an effective way motivated by (30) to provide flexible and computational feasible estimators of β(t) is to smooth each component of β(t) separately. P (k) (1) Let Z (r) (t) = [ kl=0 erl X (l) ]Y (t), Xi = (1, Xi , . . . , Xi )T be the covariate vector of the ith subject and eˆrl be the (r, l)th element of the matrix ˆXX T )−1 , the inverse of the sample mean E ˆXX T = (1/n) Pn Xi X T . A (E i i=0 Pk (l) (r) natural estimator of Z (r) (t) is Zij = [ l=0 eˆrl Xi ]Yij . By (30), a compo(r)

nentwise smoothing estimator of βr (t) can be obtained by smoothing Zij for i = 1, . . . , n and j = 1, . . . , ni . Specifically, a local polynomial estimator of βr (t) with order p ≥ 0 is ˆbCOM (t), such that ˆbCOM (t), l = 0, . . . , p, r0 rl uniquely minimize )2 ( p ni n X X X tij − t (r) l COM , (42) Kr brl (t)(tij − t) wi Zij − Lp,r (t) = hr i=1 j=1 l=0

862

C. O. Wu & K. F. Yu

with respect to brl (t). For the local constant fitting with p = 0, (42) leads to the componentwise kernel estimator Pn Pni (r) i=1 j=1 {wi Zij Kr [(tij − t)/hr ]} COM ˆ βr (t) = Pn Pni . (43) i=1 j=1 {wi Kr [(tij − t)/hr ]}

Wu and Chiang38 established the large sample mean squared errors of βˆrCOM (t), while Wu, Yu and Yuan41 developed a procedure for constructing approximate asymptotic pointwise and simultaneous confidence regions for βr (t). These results shed some light on the asymptotic behaviors of the higher order estimators ˆbCOM (t), although specific asymptotic risks r0 and asymptotic distributions have not been established for the case with p ≥ 1. The results of Wu and Chiang38 and Wu, Yu and Yuan41 indicate some clear advantages of βˆrCOM (t) over the kernel estimator (33) both in terms of theoretical convergence rates and practical flexibilities. Similar advantages over the least squares method of (32) are also expected for the componentwise local polynomial estimators. Obviously, minimizing (42) is not the only componentwise smoothing approach. Suppose that the support of the design time points is contained in a compact set [a, b] and βr (t) is twice differentiable with respect to t in [a, b]. A viable alternative is to estimate βr (t) by penalized least squares estimator β˜rCOM (t), where β˜rCOM (t) minimizes Z b n X ni X (r) 2 COM {wi [Zij − βr (tij )] } + λr [βr′′ (s)]2 ds , (44) Jr (βr , λr ) = i=1 j=1

a

with λr being a non-negative smoothing parameter. By the same rationale as in Sec. 2.3, it is easy to verify that β˜rCOM (t) is a natural cubic spline with knots at the distinct points of {tij ; i = 1, . . . , n, j = 1, . . . , ni }. Furthermore, using the approach of equivalent kernels, Chiang, Rice and Wu4 derived the asymptotic mean squared errors and the asymptotic distributions of β˜rCOM (t). In contrast to the multiple penalized least squares of (40) whose solution is obtained by solving a large linear system involving all (k + 1) components, (44) significantly simplifies the computation by solving (k + 1) separate linear systems. This computational advantage ensures the practical implementability of (44) in many situations, while the intensive computational needs often make the optimization of (40) impracticable. 4.6. Smoothing via basis approximations All the smoothing methods described above depend on local smoothing in the sense that only the measurements obtained within some neighborhood of t are effectively used to estimate β(t). Although local smoothing

Regression Models for the Analysis of Longitudinal Data

863

works well when all the coefficient curves βr (t) are nonparametric, it is not adequate when some of the coefficient curves have known parametric forms, as in the partially linear model (2). Compared with local smoothing, estimation using basis approximations has three important advantages. First, it can be used to estimate β(t) whether its components are parametric or nonparametric, hence, is suitable for both nonparametric and semiparametric varying-coefficient models. Second, when a random effect is desired, it provides a natural means to incorporate random effects into a nonparametric or semiparametric model. Third, because popular basis estimators, such as truncated polynomials or B-splines, often rely on far fewer knots or approximation terms than smoothing splines, they often enjoy considerable computationally advantage over smoothing splines or even local polynomials. Although estimation with mixed effects is of great interest in various settings, we only discuss here the case of marginal models. Extension to mixed effects models can be found in Rice and Wu.31 The main idea is to first approximate βr (t) by a basis function expansion with Kr terms, where Kr may or may not tend to infinity as n tends to infinity, and then estimate βr (t) by estimating the coefficients of this expansion. For each r = 0, . . . , k, let Brs (t), s = 1, . . . , Kr , be a set of basis functions. If βr (t) can be approximated by an expansion based on Brs (t), s = 1, . . . , Kr , there is a set of constants γrs so that βr (t) ≈

Kr X

γrs Brs (t) .

(45)

s=1

Substituting (45) into (3), an approximation of the varying-coefficient model is Yij ≈

Kr k X X

(r)

Xij γrs Brs (t) + ǫi (tij ) .

(46)

r=0 s=1

The approximation sign in (46) will be replaced by the equality sign if, for all r = 0, . . . , k, βr (t) belongs to a linear space spanned by {Brs (t); s = 1, . . . , Kr }. Using (46), the least squares estimators γˆrs of γrs can be obtained by minimizing  " #2  Kr ni  k X n X  X X (r) wi Yij − ℓ(γ) = (Xij γrs Brs (tij )) , (47)   i=1 j=1

r=0 s=1

864

C. O. Wu & K. F. Yu

where γ = (γ0T , . . . , γkT )T and γr = (γr1 , . . . , γrKr )T . If the minimizer of (47) uniquely exists, the basis function estimator of βr (t) is βˆrBAS (t) =

Kr X

[ˆ γrs Brs (t)] ,

(48)

s=1

where Kr may depend on n and ni , i = 1, . . . , n. Clearly, if Kr is finite and known and βr (t) belongs to the linear model spanned by Brs (t), s = 1, . . . , Kr , then (48) returns a parametric estimator of βr (t). On the other hand, if (45) holds with Kr unknown, a consistent nonparametric estimator produced by (48) may require Kr to be a function of n and ni , i = 1, . . . , n, which may tend to infinity as n tends to infinity. Depending on the underlying scientific nature of the data, many different bases may be used to approximate the components of β(t). The most popular basis system in the classical linear models is the polynomial basis {1, t, . . . , tKr −1 }. A general class of bases that have certain numerical advantages over the above polynomial basis is the class of piecewise polynomials. Examples of piecewise polynomial bases include B-spline bases, such as linear, quadratic or cubic splines, or other types of truncated power series; see de Boor6 for further details of the explicit expressions of piecewise polynomials and their numerical properties. If βr (t) is believed to exhibit periodicity, Fourier series are often natural basis choices. Huang, Wu and Zhou19 recently established the consistency of (48) and studied the practical performance of (48) with B-splines through an intensive simulation. In general, a B-spline estimator requires a smoothing parameter consisted of three aspects: degrees of the polynomials and number and location of the knots. Although generally desired, it is difficult, however, to simultaneously determine all three of these aspects from the data. Rice and Wu31 showed that the simple approach of using equally spaced knots often works well in practice, a finding also corroborated by the simulation of Huang, Wu and Zhou.19 4.7. A cross-validation procedure The most important factor that affects all of the above smoothing methods is the selection of appropriate smoothing parameters, such as the bandwidth, the positive penalty weight λ and the number and location of knots. It is of both theoretical and practical interest to select these values directly from the data.

Regression Models for the Analysis of Longitudinal Data

865

Selecting data-driven smoothing parameters for nonparametric regression with independent identically distributed data has been a subject of intense investigation in the literature. Under the current context, a widely used method, suggested by Rice and Silverman,31 is a cross-validation that deletes the entire repeated measurements of a subject, rather than an individual measurement, one at a time. Hart and Wehrly15 derived the consistency of this cross-validation for a simple nonparametric regression without the presence of covariates other than time. Without loss of ˆ ξ) generality, we denote ξ to be a vector of smoothing parameters, β(t; a smoothing estimator based on ξ and βˆ(−i) (t; ξ) an estimator computed ˆ ξ) but with the ith subject’s measurements using the same method as β(t; ˆ ξ) is deleted. The cross-validation score for β(t; CV(ξ) =

ni n X X i=1 j=1

T ˆ(−i) β (t; ξ)]2 } , {wi [Yij − Xij

(49)

ˆ ξ). The cross-validated which measures the predictive error of β(t; smoothing parameter ξcv is then the minimizer of CV(ξ), provided that the unique minimizer of CV(ξ) exists. The above cross-validation criterion is directly applicable to all the smoothing methods presented above, except the two-step smoothing of Sec. 2.4. For the estimators of Secs. 2.2, 2.3 and 2.5 and B-splines with equally spaced knots, minimizing the corresponding cross-validation scores would either return a univariate bandwidth or a Rk+1 -valued vector. An automatic search of the global minima usually requires a sophisticated optimization software. In practice, particularly when the smoothing parameter is multivariate, it is often reasonable to use a smoothing parameter whose cross-validation score is close to the global minima. There are three intuitive reasons to use the cross-validation criterion (49). First, by deleting the subjects one at a time, it preserves the correlation structure of the data. Second, in contrast to alternatives such as the AIC, the BIC and the generalized cross-validation,1,32,33,36 (49) does not depend on the structure of the intra-subject correlations, hence, can be implemented in almost all the practical situations. Third, when the number of subjects is sufficiently large, minimizing (49) leads to a smoothing parameter that approximately minimizes the average squared error: ˆ ξ)) = ASE(β(·;

ni n X X i=1 j=1

T ˆ ij ; ξ))]2 } . {wi [Xij (β(tij ) − β(t

(50)

866

C. O. Wu & K. F. Yu

The last assertion can be heuristically seen by the decomposition: CV(ξ) =

ni n X X i=1 j=1

+2

T {wi [Yij − Xij β(tij )]2 }

ni n X X i=1 j=1

+

ni n X X i=1 j=1

T T {wi [Yij − Xij β(tij )][Xij (β(tij ) − βˆ(−i) (tij ; ξ))]2 }

T {wi [Xij (β(tij ) − βˆ(−i) (tij ; ξ))]2 } .

(51)

Here, (50) and the definition of βˆ(−i) (t; ξ) imply that the third term at ˆ ξ)). Because the right side of (51) is approximately the same as ASE(β(·; the first term at the right side of (51) does not depend on the smoothing parameter and the second term is approximately zero, ξcv approximately ˆ ξ)). minimizes ASE(β(·; 5. Confidence Regions Based on Smoothing Confidence statements can be made either based on the asymptotic distributions of the estimators or through a bootstrap procedure. Currently, explicit expressions of asymptotic distributions have only been developed for the kernel estimators (33) and (43). A bootstrap approach that has broader appeal in longitudinal analysis is to resample the subjects of the original data. Although its theoretical properties have not been wellunderstood, practical performances of this “resampling-subject” bootstrap have been investigated by a number of simulation studies. We present in this section both asymptotic and bootstrap approaches based on the smoothing estimators of Sec. 4. 5.1. Asymptotic inferences for kernel estimators 5.1.1. Pointwise confidence intervals For both (33) and (43), their asymptotic distributions have been developed based on two important assumptions. First, the numbers of repeated measurements ni are non-random and may or may not tend to infinity as n tending to infinity. Second, the time design points tij are random and independent identically distributed according to an unknown density function f (·). These assumptions are made for practical considerations as well as mathematical tractability.

Regression Models for the Analysis of Longitudinal Data

867

We first consider the confidence procedures based on (33). Under the above assumptions and some additional mild regularity conditions, Wu, Chiang and Hoover39 showed that, if wi = 1/N , h = N −1/5 h0 and lim N −6/5

n→∞

n X

n2i = θ

i=1

for some constants h0 > 0 and 0 ≤ θ < ∞, βˆLSK (t) has an asymptotically multivariate normal distribution in the sense that (N h)1/2 [βˆLSK (t) − β(t)] → N(B(t), D ∗ (t)) ,

(52)

in distribution as n → ∞. The bias, B(t), and the variance-covariance matrix, D ∗ (t), of (52) are −1 T B(t) = [f (t)]−1 EXX T (t)(b0 (t), . . . , bk (t))

(53)

−1 1 D∗ (t) = [f (t)]−2 EXX T (t)D(t)EXX T (t)

(54)

and

where D(t) is a (k + 1) × (k + 1) matrix whose (l, r)th element is Z Dlr (t) = σ 2 (t)E[X (l) (t)X (r) (t)]f (t) [K(u)]2 du + θh0 ρǫ (t)E[X (l) (t)X (r) (t)][f (t)]2 , σ 2 (t) = E[ǫ2 (t)], ρǫ (t) = lima→0 E[ǫ(t + a)ǫ(t)] and ( Z k X 3/2 2 bl (t) = h0 u K(u)du {βc′ (t)[E[X (l) (t)X (c) (t)]]′ f (t) c=0

)

+ βc′ (t)E[X (l) (t)X (c) (t)]f ′ (t) + (1/2)βc′′ (t)E[X (l) (t)X (c) (t)]f (t)} . Then, there are lower and upper end points Lα (t) and Uα (t) given by {AT βˆLSK (t) − (N h)−1/2 AT B(t)} ± Zα/2 (N h)−1/2 [AT D∗ (t)A]1/2 , (55) where Zα/2 is the (1 − α/2) quantile of the standard normal distribution, so that lim P {Lα (t) ≤ AT β(t) ≤ Uα (t)} = 1 − α .

n→∞

(56)

Because B(t) and D ∗ (t) depend on unknown quantities, (55) is not implementable in practice. If B(t) and D ∗ (t) can be consistently estimated by

868

C. O. Wu & K. F. Yu

ˆ ˆ ∗ (t), a pointwise (1 − α) confidence interval for AT β(t) can be B(t) and D ˆ α (t), U ˆα (t)) with L ˆ α (t) and U ˆα (t) being the lower and approximated by (L upper end points given by ˆ ˆ ∗ (t)A]1/2 . ± Zα/2 (N h)−1/2 [AT D {AT βˆLSK (t) − (N h)−1/2 AT B(t)}

(57)

ˆ ˆ ∗ (t) by Wu, Chiang and Hoover39 suggested to compute B(t) and D 2 (l) (r) substituting f (t), σ (t), ρǫ (t), E[X (t)X (t)] and the required derivatives in (53) and (54) with their kernel estimators. Suppose that the kernel function K(·) is at least twice continuously differentiable in the interior of its support. These authors proposed to estimate f (t), σ 2 (t), ρǫ (t) and E[X (l) (t)X (r) (t)] by ni n X X tij − t −1 ˆ K , f(t) = (N h) h i=1 j=1 ni n X X tij − t 1 2 ǫˆi (tij )K , σ ˆ (t) = h N hfˆ(t) 2

i=1 j=1

ρˆǫ (t) =

Pn

i=1

and ˆ (l) (t)X (r) (t)] = E[X

P

t −t t −t ǫi (tij2 )K( ijh )K( ijh )} ǫi (tij1 )ˆ j1 6=j2 {ˆ P tij −t tij −t i=1 j1 6=j2 {K( h )K( h )}

Pn

ni n X X tij − t 1 (r) (l) Xi (tij )Xi (tij )K , ˆ h N hf(t) i=1 j=1

ˆ ij ) are the residuals, and to estimate the where ǫˆi (tij ) = Yij − XiT (tij )β(t first and second derivatives of f (t), βl (t) and E[X (l) (t)X (r) (t)] by the corˆ (l) (t)X (r) (t)]. Through an responding derivatives of fˆ(t), βˆlLSK (t) and E[X intensive simulation, these authors also suggested that the cross-validation bandwidth hcv obtained from (49) may be used to compute all of the above estimators, although, in general, different bandwidths may be used for these estimators. The above plug-in approach can also be extended to βˆrCOM (t) of (43) when the covariates are time-invariant. Wu, Yu and Yuan41 have derived the explicit expressions of the bias, B(βˆrCOM ; t), and the standard deviation, SD(βˆrCOM ; t), of βˆrCOM (t), and suggested to use the approximate (1 − α) confidence interval for βr (t) with end points ˆ βˆCOM ; t)} ± Z1−α/2 SD( c βˆCOM ; t) , {βˆrCOM (t) − B( r r

869

Regression Models for the Analysis of Longitudinal Data

ˆ βˆCOM ; t) and SD( c βˆCOM ; t) are plug-in estimators of B(βˆCOM ; t) where B( r r r COM and SD(βˆ ; t). Because of the similarity it shares with βˆLSK (t), we omit r

the details for this case. The above asymptotic intervals differ from their counterparts with independent identically distributed data in the inclusion of intra-subject correlations in the variance term. When ni are not negligible relative to n, θ in (54) may not be negligible, so that the contribution of the correlations may not be ignored. For the HIV/CD4 data (Sec. 1.2.2), the numbers of repeated measurements range from 1 to 14, while the number of subjects is 400. Asymptotic results that do not take the intra-subject correlations into account may not lead to adequate approximations. In this case, it is appropriate to estimate the correlations directly from the data. When the numbers of repeated measurements are negligible relative to the numbers of subjects, as in the ASGA data (Sec. 1.2.1), the contribution of the intrasubject correlation structures becomes negligible in the variances of the kernel estimators. The resulting confidence intervals are then similar to that with independent identically distributed samples. 5.1.2. Simultaneous bands

In most applications, the main interest of inference lies in the overall confidence regions of βl (t) within a proper range of t values, rather than the confidence intervals at a particular time point. When the data are from independent identically distributed samples, simultaneous confidence regions for regression curves may be constructed using either extreme value theory of Gaussian processes9 or variability bands bridged by pointwise intervals over a grid points.11,13,22 For longitudinal samples, analogous asymptotic theory of extreme values has not been developed. This leaves the latter approach to be the only practical simultaneous inferential tool in longitudinal analysis. To construct a simultaneous band for AT β(t) over t ∈ [a, b] based on the least squares kernel estimator βˆ(LSK) (t), we choose a positive integer M and partition [a, b] into M equally spaced intervals with grid points a = ξ1 < · · · < ξM+1 = b, such that ξj+1 − ξj = (b − a)/M for j = 1, . . . , M . A set of approximate (1 − α) simultaneous confidence intervals for AT β(ξj ), j = 1, . . . , M + 1, is then the collection of intervals (ˆlα (ξj ), u ˆα (ξj )), j = 1, . . . , M + 1, which satisfies lim P {ˆlα (ξj ) ≤ AT β(ξj ) ≤ uˆα (ξj ) for all j = 1, . . . , M + 1} ≥ 1 − α .

n→∞

(58)

870

C. O. Wu & K. F. Yu

The Bonferroni adjustment suggests ˆ α/(M+1) (ξj ), U ˆα/(M+1) (ξj )) , (ˆlα (ξj ), uˆα (ξj )) = (L

(59)

ˆ α (ξj ), U ˆα (ξj )) are defined in (57). where (L To establish a band that covers all the points between the grid points ξj , j = 1, . . . , M + 1, we first consider the interpolation of AT β(ξj ) defined by M (ξj+1 − t) M (t − ξj ) (AT β)(I) (t) = [AT β(ξj )] + [AT β(ξj+1 )] , b−a b−a (60) for t ∈ [ξj , ξj+1 ]. A simultaneous band for (AT β)(I) (t) over t ∈ [a, b] is (I) (I) (I) (I) lα (t) and uˆα (t) are the linear interpolations of ˆα (t)), where ˆ (ˆlα (t), u ˆlα (ξj ) and uˆα (ξj ), similarly defined as in (60). The gaps between the grid points are then bridged by the smoothness conditions of AT β(t). If AT β(t) satisfies sup |(AT β)′ (t)| ≤ c1 ,

for a known constant c1 > 0 ,

(61)

t∈[a,b]

then it follows that T

T

(I)

|A β(t) − (A β)

(t)| ≤ 2c1

M (ξj+1 − t)(t − ξj ) , b−a

for all t ∈ [ξj , ξj+1 ], and consequently ˆl(I) (t) − 2c1 M (ξj+1 − t)(t − ξj ) , α b−a M (ξj+1 − t)(t − ξj ) (I) uˆα (t) + 2c1 b−a

(62)

is an approximate (1 − α) confidence band for AT β(t). If AT β(t) satisfies sup |(AT β)′′ (t)| ≤ c2 ,

for a known constant c2 > 0 ,

(63)

t∈[a,b]

then T

T

(I)

|A β(t) − (A β)

c2 M (ξj+1 − t)(t − ξj ) (t)| ≤ , 2 b−a

for all t ∈ [ξj , ξj+1 ], and an approximate (1 − α) confidence band can be given by ˆl(I) (t) − c2 M (ξj+1 − t)(t − ξj ) , α 2 b−a c2 M (ξj+1 − t)(t − ξj ) . (64) u ˆ(I) (t) + α 2 b−a

Regression Models for the Analysis of Longitudinal Data

871

For smoothness conditions other than the ones considered in (61) and (63), the corresponding confidence bands may be similarly established. When the covariates are time-invariant, the same approach can be used to establish simultaneous confidence bands based on βˆCOM (t); see Wu, Yu and Yuan41 for details. 5.2. Bootstrap variability bands The above asymptotic inferences subject to two restrictions which, to some degree, limit their applications in longitudinal analysis. First, because the asymptotic distributions have so far only been developed for the two kernel type estimators, βˆLSK (t) and βˆCOM (t), confidence procedures for other estimators are still not available. Given that smoothing methods such as splines and local polynomials have exhibited a number of theoretical and practical advantages over the kernel methods, particularly at the boundary of the support of t, inferential procedures based on these smoothing methods are in demand. Second, because the plug-in estimators require the estimation of the design densities, covariance functions and the other quantities appeared in the bias and variance terms of the estimators, the procedure is usually computationally intensive and may introduce additional errors in its coverage probabilities. A more appealing inferential procedure that has been suggested in the ˆ literature is the “resampling-subject” bootstrap. Let β(t) = (βˆ0 (t), . . . , T ˆ βk (t)) be an estimator of β(t) constructed based on any of the previously mentioned smoothing method. An approximate (1− α) pointwise percentile ˆ interval for AT E[β(t)] can be constructed by the following steps: (1) Randomly draw n subjects with replacement from the original dataset ∗ ); i = and denote the resulting bootstrap sample to be {(Yij∗ , t∗ij , Xij 1, . . . , n, j = 1, . . . , ni }. (2) Compute the bootstrap estimator βˆboot (t), hence AT βˆboot (t), based on the above bootstrap sample and the smoothing method specified for ˆ β(t). (3) Repeating the above two steps B times, so that B bootstrap estimators AT βˆboot (t) are obtained. boot (4) Calculate Lboot (t), the lower and upper (α/2)th perα (t) and Uα centiles, respectively, of the B bootstrap estimators AT βˆboot (t). The boot approximate (1 − α) bootstrap interval is then (Lboot (t)). α (t), Uα

872

C. O. Wu & K. F. Yu

ˆ When AT E[β(t)] satisfies the smoothness conditions (61) or (63), simultaˆ neous confidence bands for AT E[β(t)] can be constructed using (62) and boot (64) with (59) replaced by (Lboot (ξ α/(M+1) j ), Uα/(M+1) (ξj )). The main advantages of this bootstrap are its generality and simplicity. It is not limited to kernel type estimators and does not depend on the correlations and designs of the data. Despite its potential, several related theoretical and practical issues have still yet to be resolved. Because the biases of the estimators have not been adjusted, the resulting intervals or bands may not always have desirable coverage probabilities for AT β(t). If a consistent estimator of the bias is also available, improved confidence regions for AT β(t) may be obtained by adjusting the boot bias appeared in (Lboot α/(M+1) (ξj ), Uα/(M+1) (ξj )). Currently, consistent bias estimators can only be obtained on a case-by-case basis, and no general procedure is available. A natural alternative to the percentile end points used in Step 4 is to consider normal approximated intervals with end points ˆ ± z(1−α/2) se AT β(t) b boot (t), where se b boot (t) is the sample standard error of the B bootstrap estimators AT βˆboot (t). Asymptotic properties for both the percentile and the normal approximation bootstrap procedures have not been investigated. 6. Two Examples 6.1. Alabama fetal growth study Normal fetal growth is naturally thought to influence infant survival and proper child development. Our objective is to investigate the effects of maternal risk factors and maternal anthropometric measurements on the patterns of fetal growth. Although the outcomes measured by fetal abdominal circumference, biparietal diameter and femur length are all time-dependent, the covariates of interest may be either time-dependent or time-invariant. A typical time-dependent covariate is the maternal placental thickness measured by ultrasound at each visit. On the other hand, mother’s height, weight and body mass index measured at the beginning of pregnancy, are time-invariant. Other variables, such as maternal habits of cigarette smoking and alcohol consumption, may be either time-dependent or time-invariant depending on how these variables are defined. A simple way to define time-invariant maternal smoking and drinking status is to categorize the mothers as smokers (ever smoked cigarettes during the pregnancy) versus non-smokers (never smoked cigarettes during the pregnancy) and non-drinkers/light-drinkers (consumed one beer/one glass of

Regression Models for the Analysis of Longitudinal Data

873

wine or less per day in average during the pregnancy) versus heavy-drinkers (consumed more than one beer or one glass of wine per day in average during the pregnancy). As in most self-reported questionnaires, the data contain the average numbers of cigarettes smoked and the average amount of alcohol consumed per day per subject. These actual cigarette and alcohol consumptions are clearly time-dependent as some of the participating subjects change their behaviors during the study. Depending on the specific scientific questions, both smoking and drinking categories and the actual consumptions could be considered in the analysis. For the purpose of illustration, the analysis present here focuses on the effects of maternal smoking/drinking categories and placental thickness on the growth of fetal abdominal circumference. Other covariate and outcome measurements can be similarly investigated, provided that the models have clear and meaningful biological interpretations. Although the general trend of Fig. 1 shows an upward growth pattern, it hardly provides any clue on the relationship between fetal growth and the covariates of interest. A nonparametric analysis with model (3) seems a natural start. Let Y (t) and X (1) (t) be the fetal abdominal circumference and placental thickness, respectively, at t weeks of gestation; X (2) and X (3) be the mother’s drinking and smoking categories defined by ( 1 if she is a heavy-drinker , (2) X = 0 if she is a non-drinker/light drinker , ( 1 if she is a smoker , (3) X = 0 otherwise ; and X (4) be the mother’s height (in centimeters) at the beginning of the pregnancy. In view that proper placental development may also be affected by drinking and smoking, we first consider the effects of the time-invariant covariate vector X = (1, X (2) , X (3) , X (4) )T . Although we can fit model (3) directly with (Y (t), t, X) and describe the covariate effects by β(t) = (β0 (t), β2 (t), β3 (t), β4 (t))T , a better biological interpretation can be obtained if X (4) were replaced by its centered version X (∗4) = X (4) − E[X (4) ], so that the covariate effects are characterized by β ∗ (t) = (β0∗ (t), β2 (t), β3 (t), β4 (t))T . For the latter case, the baseline coefficient curve β0∗ (t) represents the mean abdominal circumference at t weeks of gestation for a non-smoking and non-drinking/light-drinking mother whose height is at average, while, for the former, β0 (t) itself does not have a biological interpretation.

874

C. O. Wu & K. F. Yu

To fit model (3) with (Y (t), t, X (∗) ), X (∗) = (1, X (2) , X (3) , X (∗4) )T , we (∗4) computed Xi , i = 1, . . . , 1475, by subtracting the sample average of (4) (4) {Xj ; j = 1, . . . , 1475} from Xi . Figure 2 shows the estimated coefficient curves, including the baseline growth curve and the covariate effects characterized by alcohol consumption, cigarette smoking and mother’s height, and their corresponding 95% simultaneous confidence bands. These coefficient curves were computed using the componentwise estimators of (43) with the Epanechikov kernel, the cross-validated bandwidths described in (49) and wi = 1/(nni ). It is worthwhile noting that in this data set the numbers of repeated measurements, most of which are around 4, are much smaller compared with the number of subjects n = 1475. Thus, asymptotic results obtained by assuming n tending to infinity and ni remaining finite are expected to give adequate approximations. For kernel smoothing estimators, this means that both wi = 1/(nni ) and wi = 1/N lead to very similar estimates, and the intra-subject correlations can be ignored in the asymptotic variances of the estimators. Thus, no covariance estimators are needed in the construction of asymptotically approximate confidence bands. Based on the same kernel and bandwidths used in the coefficient curve estimates, the simultaneous confidence bands were computed using the asymptotic approximation (62) and the Bonferroni adjustment with M = 40 and c1 = 5. These graphs suggest an upward linear baseline curve Alcohol Effect

10

Baseline Ab. Cir. 20 30

Coef. of Alc. Status -15 -10 -5 0 5 10

15

40

Time Effect

15

20

25 Time(weeks)

30

35

15

20

25 Time(weeks)

30

35

30

35

Height Effect

-2

Coef. of Height -1 0 1

Coef. of Smk. Status -15 -10 -5 0 5 10

2

15

Smoking Effect

15

20

25 Time(weeks)

30

35

15

20

25 Time(weeks)

Fig. 2. ———— : componentwise kernel estimates of the coefficient curves (covariate effects) computed using the Epanechnikov kernel, the cross-validated bandwidths and wi = 1/(nni ). - - - - - - - - : the 95% Bonferroni-type confidence bands.

Regression Models for the Analysis of Longitudinal Data

875

Table 1. Parameter estimates and their standard errors computed using the MixedEffects Procedure in S-plus.

β00 β01 β2 β3 β4

Parameter Estimate

Standard Error

Z-ratio

−6.5496 1.0645 0.0026 0.1009 0.0007

0.0614 0.0021 0.0551 0.0516 0.0035

−106.5880 496.1262 0.0478 1.9555 0.1996

β0∗ (t) and undetectable effects from alcohol consumption, cigarette smoking and mother’s height. However, because the confidence bands used here tend to be conservative, they may not be sensitive enough to detect small influences of the covariates. We also computed the curve estimates and their corresponding confidence bands using the least squares kernel method of (33). We omit these results from the presentation, because they are similar to the ones shown in Fig. 2. The above nonparametric results, i.e. graphs shown in Fig. 2, suggest that the relationship between fetal abdominal circumference Y (t), gestational age t, alcohol consumption X (2) , cigarette smoking X (3) and centered maternal height X (∗4) can be reasonably described by the linear model Y (t) = β00 + β01 t + β2 X (2) + β3 X (3) + β4 X (∗4) + ǫ(t) , with unknown parameters (β00 , β01 , β2 , β3 , β4 ) and a mean zero error process ǫ(t). This model can be fitted using the Mixed-Effects Procedure in S-plus.3 Table 1 shows the parameter estimates and the corresponding standard errors computed from the above linear model and the S-plus procedure. The results from this linear model suggested clearly non-significant effects for alcohol consumption and maternal height and a very weak, but slightly positive, effect for cigarette smoking. The weak smoking effect shown in this linear analysis is likely caused by the random variations of the data, rather than any substantial association between fetal size and smoking. These results generally agree with the findings obtained from the above nonparametric analysis. When placental thickness X (1) (t) is added to the model, smoothing has to be carried out with time-dependent covariates. In order to obtain a meaningful biological interpretation for the baseline coefficient curve, we use the centered covariate X (∗1) (t) = X (1) (t) − E[X (1) (t)], the difference between a subject’s placental thickness at time t and the conditional mean at t. To avoid starting with a model that has too many covariates, we

876

0.0 -1.0

-0.5

Coef. of Pl. Th.

0.5

1.0

C. O. Wu & K. F. Yu

20

25

30

35

30

35

Time (weeks)

0.0 -1.0

-0.5

Coef. of Pl. Th.

0.5

1.0

(a) Placental Effect (CV)

20

25 Time (weeks)

(a) Placental Effect (Subjective) Fig. 3. ———— : estimated coefficient curve (covariate effect) for placental thickness, computed using (38) with the standard Gaussian kernel, wi = 1/N , cross-validated bandwidths (top panel) and bandwidth vector (γ1 , h0 , h1 , h4 ) = (1.5, 1.0, 2.0, 1.0) (bottom panel). · · · · · · · · · : the 95% pointwise intervals computed using the “resampling-subject” bootstrap percentiles.

Regression Models for the Analysis of Longitudinal Data

877

consider first fitting (3) with (t, X (∗1) (t), X (∗4) ) as the covariate vector. The top panel of Fig. 3 shows the estimated coefficient curve for X (∗1) (t) computed using the kernel method of (38) with the standard Gaussian kernel, the cross-validated bandwidths and wi = 1/N . This estimate appears to be undersmoothed, as it can not be explained by a clear biological interpretation. An alternative, perhaps biologically more transparent, estimated coefficient curve of X (∗1) (t), shown in the bottom panel of Fig. 3, is computed using the same method except with bandwidth vector (γ1 , h0 , h1 , h4 ) = (1.5, 1.0, 2.0, 1.0). This bandwidth vector was chosen because its cross-validation score was very close to that of the crossvalidated bandwidths. Bootstrap percentile intervals are used to demonstrate the variability of the estimates, while inferences based on asymptotic approximations are still not yet available for this type of estimators. Figure 3 suggests, at least qualitatively, some positive association between placental thickness and fetal abdominal circumference. The estimated coefficient curve for the centered maternal height X (∗4) (t) stays constantly close to zero, suggesting a non-significant effect for the maternal height. The estimated baseline coefficient curve is also very close to the one presented in Fig. 2. Hence, these curves are omitted from the presentation. Also omitted are the analysis with the mother’s drinking and smoking status, X (2) and X (3) , added to the model, as their effects are very similar to the ones shown in Fig. 2. 6.2. MACS CD4/HIV study Let tij denote the ith subject’s time length (in years) for his jth measurement since HIV infection. Our objective is to evaluate the effects of two factors, the pre-HIV infection CD4 percent X (1) and the smoking status X (2) (t), on the post-HIV infection depletion of CD4 percent Y (t) over time. The first covariate X (1) does not depend on the time since HIV infection. The second covariate X (2) (t) equals 1 if the subject is classified as a smoker at time t and zero otherwise. Because some of the subjects change their smoking habits during the study, X (2) (t) is a time-dependent variable. Owing to the lack of an existing parametric or semiparametric model that is known to describe the scientific relevance between these variables, it is reasonable to consider an initial analysis with the nonparametric model (3). The same rationale used in the analysis of the ASGA study suggests that, in terms of biological interpretability, the center variable X (∗1) = X (1) − E[X (1) ] is more preferable than its uncentered version X (1) in the

878

C. O. Wu & K. F. Yu

model (3). However, because X (2) (t) is a time-dependent binary variable, (∗1) estimated by subtracting it is unnecessary to be centered. Thus, with Xi (1) the corresponding sample mean from Xi , the model (3) can be fitted ∗ with the data {(Yij , tij , Xij ); i = 1, . . . , 400, j = 1, . . . , ni }. The baseline ∗ coefficient curve β0 (t) represents the mean CD4 percent at t years after the infection for those who are non-smokers at time t and have average level of CD4 percent before the infection. The effects β1∗ (t) and β2 (t) of X (∗1) and X (2) (t), respectively, can be interpreted the usual way. Besides the difference in covariate centering, there is another important difference in the estimation and inferences between this and the previous example. The numbers of repeated measurements in this data set can not be simply ignored compared with the number of subjects. Thus, at least for the known case of kernel estimation, the asymptotic approximations assuming n tending to infinity and ni remaining bounded may not lead to adequate estimators of the variances, although both wi = 1/(nni ) and wi = 1/N seem to be reasonable weight choices. Because the correlation structure of the data is completely unknown and difficult to be estimated accurately, Wu, Chiang and Hoover39 suggested that it is appropriate in this case to obtain conservative Bonferroni-type bands with the covariance ρǫ (t) in (55) replaced by the variance σ 2 (t), an upper bound for |ρǫ (t)|. The graphs in Fig. 4 show the individuals’ depletion of CD4 percent over time, the estimated coefficient curves and their corresponding conservative Bonferroni-type 95% asymptotic confidence bands. The estimated coefficient curves were computed using (33) with Epanechnikov kernel, the cross-validated bandwidth and the wi = 1/N weight. The confidence bands were computed using (57) and (62) with M = 138, c1 = 3 and ρǫ (t) replaced by σ 2 (t). The same kernel and bandwidth used in computing (33) were also used in computing all the plug-in kernel estimators required in (57). Figure 4(b) shows a declining baseline CD4 percent curve over time since HIV infection, which coincides with the basic trend suggested by the plot shown in Fig. 4(a). The simultaneous band for the coefficient curve of the pre-infection CD4 percent stays positive at least for the first four years after HIV infection, suggesting strongly the benefit of high pre-infection CD4 level for the initial period since the infection. However, the positive effect of the pre-infection CD4 percent on the post-infection CD4 percent appears tapering down at the later stage of the infection. Although the estimated curve in Fig. 4(c) stays positive throughout the 7-year time range considered in this data set, the confidence band obtained for this curve does not show any significant positive association between cigarette smoking

879

0.10

0.25

Baseline CD4

0.40

0.8 0.4 0.0

CD4 percentage

Regression Models for the Analysis of Longitudinal Data

0

2

4

6

0

Time since Infection

4

6

6

Time since Infection

(c) Smoking Effect

0.020 0.005 -0.010

Coef. of Pre-Infection CD4

0.1

2

4

(b) Time Effect

-0.1

Coef. of Smoking

(a) CD4 VS. Time

0

2

Time since Infection

0

2

4

6

Time since Infection

(d) Pre-Infection CD4 Effect

Fig. 4. (a) Individuals’ CD4 percent versus time (in years) since HIV infection. (b)–(d) Estimated baseline CD4 percent, coefficient curve for smoking and coefficient curve for pre-HIV infection CD4 percent (————) and their corresponding 95% simultaneous confidence bands (· · · · · · · · · ).

and post-infection CD4 level. This may be either caused by the weak association between these two variables or the conservative nature of our confidence bands. Clearly, our findings here only provide some exploratory insights on the data. Biomedical implications and parametric models that provide additional meaningful descriptions of the biological mechanisms have to be further developed and independently confirmed by other studies. Nevertheless, the usefulness of nonparametric regression, particularly the varying-coefficient models, in the initial exploration of longitudinal data is transparent, as was shown in this and the previous examples.

7. Summary and Discussion This article has presented a series of parametric, semiparametric and nonparametric models and their estimation and inferential methods for

880

C. O. Wu & K. F. Yu

the analysis of longitudinal data. These methods have a wide range of applications in biomedical studies. Theory and methods for parametric models, particularly the linear models, have been extensively studied in the literature. Estimation and inferences based on parametric models can be easily implemented using existing statistical software packages, such as SAS and S-plus. Methods based on semiparametric and nonparametric models, on the other hand, represent the most current progress in this active research field. The nonparametric estimation and inferential methods introduced here are all based on the general framework of varying-coefficient models. These methods have the advantage of being flexible while applicable to large longitudinal studies. Smoothing methods for these models have been developed using local polynomials and splines, each has its own advantages and disadvantages in practice. Generally speaking, the componentwise smoothing methods are flexible and computationally feasible when the covariates are time-invariant, while methods based on ordinary and penalized least squares and basis approximations can be applied to models with both time-dependent and time-invariant covariates. Pointwise and simultaneous confidence bands for the coefficient curves can be constructed using either asymptotic approximations or the “resampling-subject” bootstrap. The asymptotic confidence procedures have only been developed for the kernel methods. The “resampling-subject” bootstrap may in principle be used with any smoothing estimators. However, despite the usefulness of this bootstrap shown by a number of simulation studies, its theoretical properties have not been investigated. The approach of two-step smoothing appears to be useful to overcome some of the drawbacks of the ordinary least squares. But, in order for this approach to be useful in an unbalanced longitudinal study, further research is needed to establish specific methods for calculating the raw estimates and the asymptotic properties of the final estimators. Finally, a practical consideration is the use of the uniform weight wi = 1/N versus the uniform subject weight wi = 1/(nni ). Although none of these weight uniformly dominates the other in all the longitudinal designs and an ideal weight may depend on the unknown correlation structure and how fast ni , i = 1, . . . , n, tending to infinity relative to n, simulation studies that have been reported in the literature so far suggest that both weight choices are appropriate when all the subjects have approximately the same numbers of repeated measurements, while the wi = 1/(nni ) weight is usually preferred when the numbers of repeated measurements differ from each other significantly. There are a number of topics that warrant further investigation. First

Regression Models for the Analysis of Longitudinal Data

881

and foremost, although estimation and confidence tools are important in longitudinal analyses, methods that are enormously useful in biomedical studies are testing procedures that can evaluate the statistical evidence for different hypotheses. Such procedures distinguish a parametric submodel that explains a given scientific hypothesis from the general nonparametric model. The main task of decision making is to determine the distributions of the appropriate test statistics. Another practically important problem is to improve the confidence procedures. The procedures presented in this article are known to be conservative, which often hinders their usefulness in practice. Further work needs to focus on reducing the widths of the bands while maintaining satisfactory coverage probabilities. Finally, in view that the varying-coefficient models are still inadequate for a number of longitudinal settings, there is a need to further extend these models. A useful extension is to consider regression models where the outcome variable depends on the history as well as the current values of the covariates. All the estimation and inference methods will have to be redeveloped for this extension. Acknowledgment The first author was partial supported by the National Institute on Drug Abuse (R01 DA10184-01), the National Science Foundation (DMS 0103832) and the Acheson J. Duncan Fund of the Johns Hopkins University. The authors would like to thank Professor Joseph Margolick (Johns Hopkins School of Hygiene and Public Health) for providing the MACS Public Use Data Set Release PO4 (1984–1991) and Ms. Vivian W.-S. Yuan for computing many of the numerical results used in this work. References 1. Akaike, H. (1970). Statistical predictor identification. Annals Institute Statistics Mathematics 22: 203–217. 2. Altman, N. S. (1990). Kernel smoothing of data with correlated errors. Journal of the American Statistical Association 85: 749–759. 3. Bates, D. M. and Pinheiro, J. C. (1999). Mixed Effects Models in S, SpringerVerlag, New York. 4. Chiang, C.-T., Rice, J. A. and Wu, C. O. (2001). Smoothing spline estimation for varying coefficient models with repeatedly measured dependent variable. Journal of the American Statistical Association 96. 5. Davidian, M. and Giltinan, D. M. (1995). Nonlinear Models for Repeated Measurement Data, Chapman Hall, London; New York. 6. De Boor, C. (1978). A Practical Guide to Splines, Springer-Verlag, New York.

882

C. O. Wu & K. F. Yu

7. Diggle, P. J. (1988). An approach to the analysis of repeated measurements. Biometrics 44: 959–971. 8. Diggle, P. J., Liang, K.-Y. and Zeger, S. L. (1994). Analysis of Longitudinal Data, Oxford University Press, Oxford, England. 9. Eubank, R. L. and Speckman, P. L. (1993). Confidence bands in nonparametric regression. Journal of the American Statistical Association 88: 1287–1301. 10. Fan, J. Q. and Zhang, J.-T. (2000). Functional linear models for longitudinal data. Journal of Royal Statistical Society, Series B62: 303–322. 11. Hall, P. and Titterington, D. M. (1988). On confidence bands in nonparametric density estimation and regression. Journal Multiple Analysis 27: 228–254. 12. H¨ ardle, W. (1990). Applied Nonparametric Regerssion, Cambridge University Press, Cambridge, UK. 13. H¨ ardle, W. and Marron, J. S. (1991). Bootstrap simultaneous error bars for nonparametric regression. The Annals of Statistics 19: 778–796. 14. Hart, T. D. (1991). Kernel regression estimation with time series errors. Journal of Royal Statistical Society, Series B53: 173–187. 15. Hart, T. D. and Wehrly, T. E. (1986). Kernel regression estimation using repeated measurements data. Journal of the American Statistical Association 81: 1080–1088. 16. Harville, D. A. (1974). Bayesian inference for variance components using only error contrasts. Biometrika 61: 383–385. 17. Hastie, T. J. and Tibshirani, R. J. (1993). Varying-coefficient models. Journal of Royal Statistical Society, Series B55: 757–796. 18. Hoover, D. R., Rice, J. A., Wu, C. O. and Yang, L.-P. (1998). Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika 85: 809–822. 19. Huang, J., Wu, C. O. and Zhou, L. (2002). Varying coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika 89: 111–128. 20. Jones, R. H. and Ackerson, L. M. (1990). Serial correlation in unequally spaced longitudinal data. Biometrika 77: 721–731. 21. Jones, R. H. and Boadi-Boteng, F. (1991). Unequally spaced longitudinal data with serial correlation. Biometrics 47: 161–175. 22. Knafl, G., Sacks, J. and Ylvisaker, D. (1985). Confidence bands for regression functions. Journal of the American Statistical Association 80: 683–691. 23. Kaslow, R. A., Ostrow, D. G., Detels, R., Phair, J. P., Polk, B. F. and Rinaldo, C. R. (1987). The Multicenter AIDS Cohort Study: Rationale, Organization and Selected Characteristics of the Participants. American Journal of Epidemiology 126: 310–318. 24. Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics 38: 963–974. 25. Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73: 13–22. 26. Moyeed, R. A. and Diggle, P. J. (1994). Rates of convergence in

Regression Models for the Analysis of Longitudinal Data

27. 28. 29. 30.

31. 32. 33. 34. 35. 36. 37. 38. 39.

40.

41.

42.

43.

883

semiparametric modeling of longitudinal data. Australian Journal Statistics 36: 75–93. M¨ uller, H.-G. (1988). Nonparametric regression analysis of longitudinal data. Lecture Notes in Statistics 46, Springer-Verlag, Berlin. Pantula, S. G. and Pollock, K. H. (1985). Nested analysis of variance with autocorrelated errors. Biometrics 41: 909–920. Patterson, H. D. and Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika 58: 545–554. Rice, J. A. and Silverman, B. W. (1991). Estimating the mean and covariance structure nonparametrically when the data are curves. Journal of Royal Statistical Society, Series B53: 233–243. Rice, J. A. and Wu, C. O. (2001). Nonparametric mixed effects models for unequally sampled noisy curves. Biometrics 57: 253–259. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6: 461–464. Shibata, R. (1981). An optimal selection of regression variables. Biometrika 68: 45–54. Verbeke, G. and Molenberghs, G. (2000). Linear Mixed Models for Longitudinal Data, Springer, New York. Vonesh, E. F. and Chinchilli, V. M. (1997). Linear and Nonlinear Models for the Analysis of Repeated Measurements, Marcel Dekker, New York. Wahba, G. (1990). Spline Models for Observational Data, SIAM, Philadelphia. Ware, J. H. (1985). Linear models for the analysis of longitudinal studies. The American Statistician 39: 95–101. Wu, C. O. and Chiang, C.-T. (2000). Kernel smoothing on varying coefficient models with longitudinal dependent variable. Statistica Sinica 10: 433–456. Wu, C. O., Chiang, C.-T. and Hoover, D. R. (1998). Asymptotic confidence regions for kernel smoothing of a varying-coefficient model with longitudinal data. Journal of the American Statistical Association 93: 1388–1402. Wu, C. O., Yu, K. F. and Chiang, C.-T. (2000). A two-step smoothing method for varying-coefficient models with repeated measurements. Annals Institute Statistics Mathematics 52: 519–543. Wu, C. O., Yu, K. F. and Yuan, V. W. S. (2000). Large sample properties and confidence bands for component-wise varying-coefficient regression with longitudinal dependent variable. Communications in Statistics — Theory Methods 29: 1017–1037. Zeger, S. L. and Diggle, P. J. (1994). Semiparametric models for longitudinal data with application to CD4 cell numbers in HIV seroconverters. Biometrics 50: 689–699. Zeger, S. L., Liang, K.-Y. and Albert, P. S. (1988). Models for longitudinal data: a generalized estimating equation approach. Biometrics 44: 1049–1060.

884

C. O. Wu & K. F. Yu

About the Author Colin O. Wu is currently Mathematical Statistician in the Office of Biostatistics Research, Division of Epidemiology and Clinical Applications, National Heart Lung and Blood Institute. He received his BA (1985) in Mathematics from the University of California, Los Angeles, and his PhD (1990) in Statistics from the University of California, Berkeley. His past academic and research experience includes academic appointments at the Department of Statistics, The University of Michigan, Ann Arbor, and the Department of Mathematical Sciences, The Johns Hopkins University, joint appointment at the Department of Biostatistics, the Johns Hopkins Bloomberg School of Hygiene and Public Health, and guest researcher at the Division of Epidemiology, Statistics and Prevention Research, National Institute of Child Health and Human Development. His publications have spanned over a number of areas including asymptotic efficiency for semiparametric models, asymptotic minimaxity for kernel estimators, density estimation and nonparametric regression for biased sampling data, varyingcoefficient models for longitudinal data and nonparametric mixed-effects models. His current research is focused on the theory and methods of basis function approximations for the modeling, estimation and inferences with longitudinal data.

CHAPTER 24

LOCAL MODELING: DENSITY ESTIMATION AND NONPARAMETRIC REGRESSION JIANQING FAN Department of Statistics, Chinese University of Hong Kong Shatin, Hong Kong Tel: 852-2609-7942; [email protected] RUNZE LI Department of Statistics, 326 Joab L Thomas Building, Penn State University, University Park, PA 16802-2111, USA Tel: (814) 865-1555; [email protected]

1. Introduction Local modeling approaches are useful tools for exploring features of data without imposing a parametric model. These approaches have been received increasing attention in last two decades and successfully applied to various scientific disciplines, such as, economics, engineering, medicine, environmental science, health science and social science. There are a vast amount of literature on this topic.29,34 A comprehensive account of local modeling can be found in the books.6,28,48,72,75,76,85 see also Fan and Gijbels29 and Fan and M¨ uller34 for a brief overview on this topic. In this chapter, we will introduce fundamental ideas of local modeling and illustrate the ideas by real data examples. For ease of presentation, we will omit all technical parts. This chapter basically consists of two parts: Kernel density estimation and local polynomial fitting. In Sec. 2, the kernel density estimation method will be introduced. Important issues, including bandwidth selection, will be addressed. Real data examples will be used to illustrate the ideas how to implement this type of method. Local polynomial regression will be introduced in Sec. 3. In this section, we also discuss how to decide the amount 885

886

J. Fan & R. Li

of smoothing, and extend the ideas of local polynomial regression to other contexts. The idea is further extended to the local likelihood and local partially likelihood in Sec. 4. Section 5 introduces the ideas of nonparametric smoothing tests. Section 6 summarizes some applications of local modeling, including estimation of conditional quantile functions, conditional variance functions and conditional densities, and change point detection. 2. Density Estimation Suppose that X1 , . . . , Xn are an independent and identically distributed sample from a population with an unknown probability density f (x). Of interest is to estimate the density f . In explanatory data analysis, we may construct a histogram for the data. If the resulting histogram has a bell shape, then we may assume that the samples were taken from a normal distribution. In this situation, one may just estimate the population mean and variance using the sample mean and sample variance because a normal distribution is completely determined by its mean and variance. In general, parametric approaches to estimation of a density function assume that the density belongs to a parametric family of distributions, such as normal, gamma or beta family. In order to fully specify the density function, one has to estimate the unknown parameters using, for example, maximum likelihood estimation. One may use prior knowledge or scientific reasons to determine a parametric distribution family. In explanatory data analysis, data analysts frequently construct a histogram based on the sample, and then draw reasonable conclusions on the population density.

2.1. Histogram A histogram is usually formed by partitioning the range of data into equally length intervals, called bins, and then drawing a block over each interval with height being the proportion of the data falling in the bin divided by the width of the bin. Specifically, the histogram estimate at a point x is given by number of observations in the bin containing x , fˆ(x, h) = nh where h is the width of the bins, namely binwidth. For a fixed choice of ˆ h) is a maximum bins, it can be shown that under some mild conditions, f(·, likelihood estimate of the unknown density f . It is worthwhile to note that

Local Modeling: Density Estimation and Nonparametric Regression

887

the nonparametric maximum likelihood estimate of the unknown density f without any further restriction does not exist, since max R

{f :f ≥0, f =1}

n Y

i=1

f (Xi ) = ∞ .

When one constructs a histogram, one has to choose the binwidth and the centers of bins. Figure 1 depicts four histograms based on the same data set and the same binwidth, but using different locations of bin centers. It can be seen from Fig. 1 that the shapes of the resulting histograms are quite different. This implies that the histogram suffers the “edge” effect. Figure 2 shows four histograms of the lengths of crabs, collected from 1973 to 1986, but with different binwidths. The crab data set is available from the 1.5

1.5

1

1

0.5

0.5

0 −2

−1

0

1

2

0 −2

−1

(a) 1.5

1

1

0.5

0.5

−1

0

(c)

1

2

1

2

(b)

1.5

0 −2

0

1

2

0 −2

−1

0

(d)

Fig. 1. Histograms of a sample of size 300 from a mixture of normal distribution 1/3N (−1, 0.12 ) + 1/3N (0, 0.252 ) + 1/3N (1, 0.12 ).

888

J. Fan & R. Li

0.1

0.1

0.08

0.08

0.06

0.06

0.04

0.04

0.02

0.02

0

0

5

10

15

20

25

0

0

5

(a) 0.08

0.06

0.06

0.04

0.04

0.02

0.02

0

5

10

(c) Fig. 2.

15

20

25

15

20

25

(b)

0.08

0

10

15

20

25

0

0

5

10

(d)

Histograms for crab sizes. The data is the length of crab (cm).

website of statlib at Carnegie Mellon University at http://lib.stat.cmu.edu. From Fig. 2, if the binwidth h is too small, then the resulting histogram is rough, on the other hand, if the binwidth is too large, then the resulting histogram is too smooth. Thus constructing a histogram actually is not so simple! Usually one may start from an undersmoothed histogram, and then increase gradually the binwidth until getting a satisfactory result. The histogram is the oldest and most widely used nonparametric estimate of density. The choice of binwidth is a smoothing problem. The edge effect of histograms can be repaired by the kernel density estimation introduced in next section. Furthermore, the kernel estimate will result in a smooth density curve rather than a step function as in histograms. It is an improved technique over the kernel density estimation.

Local Modeling: Density Estimation and Nonparametric Regression

889

Kernel Density Estimate 0.5

density

0.4

0.3

0.2

0.1

0 −3

−2

−1

0 data

1

2

3

Fig. 3. Kernel density estimate for an hypothetical data set (thick curve). It smoothly redistributes the point mass at Xi by the function (nh)−1 K{(x − Xi )/h}. The small bumps show how point masses are redistributed.

2.2. Kernel density estimation A kernel density estimate is defined as fˆh = (nh)−1

n X i=1

K{(x − Xi )/h} ,

R where K(·) is a function satisfying K(x)dx = 1, called a kernel function and h is a positive number, called a bandwidth or a smoothing parameter. A density function such as the plot (thick curve) in Fig. 3 is usually obtained by evaluating the function fˆh (x) over a few hundred of grid points. From the definition, indeed, the kernel estimate is the average of density functions h−1 K{(x − Xi )/h}, which smoothly redistribute the point mass at the point Xi . Figure 3 depicts the redistribution of point masses. To facilitate notation, let Kh (t) = h1 K(t/h) be a rescaling function of K. This allows us to write fˆh = n−1

n X i=1

Kh (x − Xi ) .

(1)

It is well known that the choice of K is not very sensitive, scaled in a canonical form64 to the estimate fˆh (x). Thus it is assumed throughout this chapter that the kernel function is a symmetric probability density

890

J. Fan & R. Li

Symmetric Beta family kernels

Uniform Epanechnikov Biweight Triweight

0.0

0.0

0.2

0.1

0.4

0.2

0.6

0.3

0.8

1.0

0.4

Gaussian kernel

-4

-2

0

2

(a)

4

-1.0

-0.5

0.0

0.5

1.0

(b)

Fig. 4. Commonly-used kernels. (a) Gaussian kernel; (b) Symmetric Beta family of kernels that are renormalized to have maximum height 1.

function. The most commonly used kernel function is the Gaussian density function given by 1 (2) K(t) = √ exp(−t2 /2) . 2π Other popular kernel functions include the symmetric beta family 1 K(t) = (1 − t2 )γ+ , γ = 0, 1, . . . , (3) β(1/2, γ + 1) where + denotes the positive part, which is assumed to be taken before exponentiation, so that the support of K is [−1, 1], and β(·, ·) is a beta function. The corresponding kernel functions when γ = 0, 1, 2 and 3 are the uniform, the Epanechnikov, the biweight and the triweight kernel functions. Figure 4 shows these kernel functions. The smoothing parameter h controls the smoothness of density estimates, acting as the binwidth in histograms. The choice of the bandwidth is of crucial importance. If h is chosen too large, then the resulting estimate misses fine features of the data, while if h is selected too small then spurious sharp structure become visible. See Fig. 6 for example. In fact, it can be shown that under some mild conditions, when n → ∞, h → 0 and nh → ∞, E fˆh (x) − f (x) =

f ′′ (x) µ(K)h2 + o(h2 ) 2

(4)

Local Modeling: Density Estimation and Nonparametric Regression

891

Crab Size Data 0.08 0.07 0.06

density

0.05 0.04 0.03 0.02 0.01 0 0

5

10 15 crab size

20

25

Fig. 5. Automatic kernel density estimates using the bandwidth according the rule of thumb. The data set is the crab size data collected from 1973 to 1986.

and R(K)f (x) (1 + o(1)) , (5) nh R R where µ(K) = t2 K(t)dt and R(K) = K 2 (t)dt. Thus, from (4) and (5), a large bandwidth h results in a large bias while a small bandwidth produces an estimate with a large variance. A good choice of bandwidth would balance the bias and variance trade-off. This is conveniently assessed by the Asymptotic Mean Integrated Square Error (AMISE) which is defined as Z R(K) µ2 (K)h4 AMISE(h) = . (6) {f ′′ (x)}2 dx + 4 nh var{fˆh (x)} =

Minimizing (6) with respect to h gives the ideal bandwidth hI =

R(K) R 2 µ (K) {f ′′ (x)}2 dx

1/5

n−1/5 ,

(7)

which involves the unknown density function, and cannot be directly used in kernel smoothing. Since the choice of bandwidth is critical to kernel density estimation, there has a large literature on this topic. See Jones et al.56,57 for a survey. In practice, we may take the Gaussian density with variance

892

J. Fan & R. Li

Crab Size Data 0.08 0.07 0.06

density

0.05 0.04 0.03 0.02 0.01 0 0

5

10 15 crab size

20

25

Fig. 6. A family of kernel estimates. The data set is the crab 5size data. The thick curve corresponds to ˆ hI .

σ 2 as a reference density. In this situation, Eq. (7) becomes 1/5 √ 8 πR(K) σn−1/5 . hI = 3µ2 (K)

(8)

Here we focus on a rule of thumb.75 The rule of thumb of bandwidth selection is to replace σ by the sample standard deviation sn . Thus, for the Gaussian kernel, ˆ I = 1.06sn n−1/5 , h and for the symmetric β family √ 1/5 8 πβ(1/2, 2γ + 1) ˆ hI = sn n−1/5 . {β(3/2, γ + 1)}2

Figure 5 depicts a kernel density estimate of the length of crab using the ˆ I with the Gaussian kernel. From the shape of the estimated bandwidth h density curve, it seems that a normal distribution is not appropriate for modeling the crab size. While the rule of thumb works well for many data sets, it tends to produce oversmooth estimates as the referenced density is a Gaussian density. Another method to avoid choosing a single optimal bandwidth is the family smoothing approach. This can be done by using a family of estimates ˆ I , j = −3, −2, −1, 0, 1, 2} {fˆh , h = 1.4j h

(9)

Local Modeling: Density Estimation and Nonparametric Regression

893

Crab Size Data 0.12 0.1

Year 76 Year 86

density

0.08 0.06 0.04 0.02 0 0

5

10 15 crab size

20

25

Fig. 7. Comparison of the length of crabs between 1976 and 1986. The sample mean and standard deviation for 1976 are 12.9020 and 4.2905, while the sample mean and standard deviation for 1986 are 14.4494 and 2.6257, respectively.

and then overlaying them in the same plot. The family smoothing approach allows us to explore possible patterns contained in data using different scale of bandwidths. This is closely related to scale space ideas in computer science. Choosing a smaller bandwidth acts as “zoom in”, while selecting a larger bandwidth corresponds to “zoom out” in the scale space. These ideas have been further developed in a SiZer map.13 The SiZer map can detect significant features in estimated curve with different scales. Figure 6 depicts a family smoothing plot for the crab size data. The density estimation method is also a powerful graphic tool for comparing the results of two experiments. This is related to the classical two-sample mean problem. The advantage of the kernel smoothing approach over the traditional two sample tests is that the smoothing approach can show an overall pattern of the experiments, including the locations of centers and the dispersions of the data. Further, it gives us some ideas of two population distributions. To illustrate the idea, we applied the smoothing techniques for two subsets of the crab size data. One contains the 1976 data set and the other consists of the 1986 data set. The two estimated density curves are depicted in Fig. 7. They have different centers and dispersions. In this section, the bandwidth remains constant, that is, it depends on neither the location x nor the datum point Xi . This kind of bandwidth is referred as a global bandwidth. From (7), it is desirable to use a larger

894

J. Fan & R. Li

bandwidth when changes of curvature is small and use a smaller bandwidth when curvature of underlying density dramatically changes. This leads to studying variable bandwidth selection, which suggests the use of different bandwidth at different location of x. Usually, a global bandwidth is easier to choose than the variable bandwidth. In order to use a constant bandwidth, one may first transform the data by Yi = g(Xi ) ,

i = 1, . . . , n ,

where g is a given monotone increasing function. The transformation g should be chosen so that the transformed data have a density with more homogeneous degree of smoothness so that a global bandwidth for the transformed data is more appropriate. Then apply the kernel density estimate to the transformed data set and obtain the estimate fˆY (y). Finally apply the inverse transform to obtain the density of X: fˆX (x) = g ′ (x)fˆY (g(x)) = g ′ (x)n−1

n X i=1

Kh (g(x) − g(Xi )).

The performance of this type estimate has been illustrated in Wand et al.86 Marron and Yang (1999) proposed an approach to selecting a good transformation g. 3. Local Polynomial Fitting Regression is one of the most useful techniques in statistics. Consider the (d+1)-dimensional data (X1 , Y1 ), . . . , (Xn , Yn ), which form an independent and identically distributed sample from a population (X, Y ), where X is a d-dimensional random vector and Y is a random variable. Of interest is to estimate the regression function m(x) = E(Y |X = x). In other words, the data are regarded as realizations from the model: Y = m(X) + ε , where ε is a random error with zero mean. For a given data set, one may try to fit the data by using a linear regression model. If a nonlinear pattern appears in the scatter plot of Y against X, one may employ polynomial regression to reduce the modeling bias of linear regression. Consider, for example, the data plotted in Fig. 8, where the relationship between the concentration of nitric oxides in engine exhaust (taken as dependent variable) and the equivalence ratio (taken as independent variable), a measure of the richness of the air/ethanol mix, is depicted for a burning of ethanol in

895

Local Modeling: Density Estimation and Nonparametric Regression

• • • • • •• • • •• •• • •• • •

4

• • • • • •• •• ••• • • • •• •

• •• • •• • •• •

•• •

•••• • • ••••• •• •• • •

•• • •• •

• • • • •• • ••

0.6

0.8 1.0 Equivalent ratio (a)

1.2

0.6

(a)

• • • • • •• • • •• •• • •• • •

• • • • • •• •• ••• • • • •• •

• •• • •• • •• •

•• •

•••• • • ••••• •• •• • •

4 ••• •

•• •

•••• • • ••••• •• •• • •

•• • •• •

• • • • •• • ••

•

• • • • • •• • • •• •• • •• • •

• • • • • •• •• ••• • • • •• • • •• • ••

••• •

•• •

•••• • • ••••• •• •• • •

0

• • • • • •• • • •• •• • •• • •

• • • • • •• •• ••• • • • •• • • •• • ••

0

• • • • •• • ••

•

1.2

Quartic regression

Nitric oxides levels 1 2 3

4

•• • •• •

0.8 1.0 Equivalent ratio (b)

(b)

Cubic regression

Nitric oxides levels 1 2 3

•

0

• • • • •• • ••

•

Nitric oxides levels 1 2 3

•• • •• •

Quadratic regression

0

Nitric oxides levels 1 2 3

4

Linear regression

0.6

0.8 1.0 Equivalent ratio (c)

(c)

1.2

0.6

0.8 1.0 Equivalent ratio (d)

1.2

(d)

Fig. 8. Polynomial fits to the ethanol data. Presented are the scatter plots of the concentration of nitric oxides against the equivalence ratio along with the fitted polynomial regression functions. Adapted from Fan and Gijbels. 29

a single-cylinder automobile test engine. From Fig. 8, it can be seen that the relationship between the concentration of nitric oxides and the equivalence ratio is highly nonlinear. Polynomial regression is used to fit the data. Figure 8 presents the resulting fits by using four different degrees of polynomials. One can easily see that all resulting fits have substantial biases.

896

J. Fan & R. Li

Because polynomial functions have all orders of derivatives everywhere, and polynomial degree cannot be controlled continuously, polynomial functions are not very flexible in modeling features encountered in practice. Further individual observations can have a large influence on remote parts of the curve in polynomial regression models. Nonparametric regression techniques can be used to repair the drawbacks of polynomial fitting. Fan and Gijbels28 give detailed background and excellent overview on various nonparametric regression techniques, which can be classified into two categories. One is to approximate the regression function globally and the other one is to parameterize the regression function m(x) locally. Two common methods of global approximation are the spline approach and the orthogonal series method. In this section, we focus on the techniques of local modeling. 3.1. Kernel regression Consider the bivariate data (X1 , Y1 ), . . . , (Xn , Yn ), an i.i.d. sample from the model: Y = m(X) + ε , where ε is random error with E(ε|X) = 0 and var(ε|X = x) = σ 2 (x). The nonparametric regression problem is to estimate the regression function m(·) with imposing a form. Usually, a datum point closer to x carries more information about the value of m(x). Therefore an intuitive estimator for the regression function m(x) is the running local average. An improved version of this is the locally weighted average. That is m(x) ˆ =

n X

wi (x)Yi

i=1

X n

wi (x) .

i=1

An alternative interpretation of locally weighted average estimators is that the resulting estimator is the solution to the following weighted leastsquares problem: min θ

n X i=1

(Yi − θ)2 wi (x) .

In other words, the kernel regression estimators are a weighted least squares estimate at the point x using a local constant approximation.

Local Modeling: Density Estimation and Nonparametric Regression

Table 1.

Leading terms in the asymptotic biases and variances. 25

Method

Bias

NW Estimator

m′′ (x) +

f (x)

m′′ (x)bn

Local Linear

m′′ (x)bn

1 2

Z

+∞

Variance

2m′ (x)f ′ (x)

GM Estimator

Here bn =

897

u2 K(u)duh2 and Vn =

−∞

bn

Vn 1.5Vn Vn

σ2 (x) f (x)nh

Z

+∞

K 2 (u)du.

−∞

Setting the weights wi (x) = Kh (Xi − x) results in the NW kernel regression estimator, which is given by68,87 Pn Kh (Xi − x)Yi . (10) m ˆ h (x) = Pi=1 n i=1 Kh (Xi − x)

See Nadaraya68 and Watson.87 Since the denominator in (10) is a random variable, it is inconvenient to take derivatives with respect to x and to derive the asymptotic properties of the estimator. Assume that the data have alreadyRbeen sorted according si to the X-variable. Taking the local weights wi (x) = si−1 Kh (u − x)du with si = (Xi + Xi+1 )/2, X0 = −∞ and Xn+1 = +∞, we obtain the GM regression estimator given by n Z si X Kh (u − x) duYi . m ˆ h (x) = i=1

si−1

See Gasser and M¨ uller.41 Just like the kernel density estimate, the choice of bandwidth is critical to the quality of the estimate. A too large bandwidth yields an oversmooth estimate, while a too small bandwidth gives a rough estimate. The basic asymptotic properties of the NW and GM regression estimators have been well established. The asymptotic biases and variances of these two estimators are depicted in Table 1.23 The properties on the GM estimator were established in Mack and M¨ uller63 and Chu and Marron.16 3.2. Local polynomial regression As indicated in the last section, both the NW estimator and the GM estimator are a local constant fit. It is natural to extend this to a local polynomial fit. The idea of local polynomial regression has been around for a long time. Since both a local constant and local polynomial fits use

898

J. Fan & R. Li

effectively datum points in a local neighborhood, this idea is referred as local modeling. It appeared in the statistical literature.17,79 Stone80,81 shows that local regression achieves optimal rates in a minimax sense. M¨ uller 16 establishes the equivalence between a local polynomial fit and a local constant fit under an equally-spaced design model. Fan23,24 focus on local linear regression in the random design case and show that it has many advantages, such as simple expression for local bias and variance, spatial adaptation and high minimax efficiency. Fan and Gijbels28 proved that theoretically the local linear regression estimator adapts automatically to the boundary. This was also empirically observed by Tibshirani and Hastie.82 Ruppert and Wand70 extended the results of Fan and Gijbels28 to the case of local polynomial estimation. A thorough study of this topic can also be found in Chaps. 3 and 4 of Fan and Gijbels.28 Suppose that the regression function m is smooth. For z in a neighborhood of x, it follows from using Taylor’s expansion that m(z) ≈

p X m(j) (x) j=1

j!

(z − x)j ≡

p X j=1

βj (z − x)j .

(11)

Thus, for Xi close enough to x, m(Xi ) ≈

p X j=0

βj (Xi − x0 )j ≡ XTi β ,

where Xi = (1, (Xi − x0 ), . . . , (Xi − x0 )p )T and β = (β0 , β1 , . . . , βp )T . Intuitively datum points further from x have less information about m(x). This suggests using a locally weighted polynomial regression n X i=1

(Yi − XTi β)2 Kh (Xi − x) .

(12)

Denote by βˆj (j = 0, . . . , p) the minimizer of (12). The above exposition suggests that an estimator for the regression function m(x0 ) is m(x ˆ 0 ) = βˆ0 (x0 ) .

(13)

Furthermore, an estimator for the νth order derivative of m(x0 ) at x0 is m ˆ ν (x0 ) = ν!βˆν (x0 ) . In general, local polynomial fitting has certain advantages over the NW and the GM estimators not only for estimating regression curves, but also for derivative estimation.

899

Local Modeling: Density Estimation and Nonparametric Regression

4

Local linear fit for ethanol data •

Nitric oxides concentration 2 3

•• • • • •

• • • • • •• • • •

• •

• • • • • ••

• • •

• •• • • • • • •

1

• • •

•

• •

•

• • • •

•• •

• •

• ••

• ••

•• 0.6

• • • • ••

0.8

1.0

••• • •

• • • •• • • •• • • • • 1.2

Equivalent ratio

Fig. 9. Illustration of the local linear fit. For each given x0 , a linear regression is fitted through the data contained in the strip x0 ± h, using the weight function indicated at the bottom of the strip. The interactions of the fitted lines and the short dashed lines are the local linear fits. Adapted from Fan and Gijbels. 29

To better appreciate the above local polynomial regression, consider the ethanol data presented in Fig. 8. The window size h is taken to be 0.051 and the kernel is the Epanechnikov kernel. To estimate the regression function at the point x0 = 0.8, we use the local data in the strip x0 ± h to fit a regression line (c.f. Fig. 9). The local linear estimate at x0 is simply the intersection of the fitted line and the line x = x0 . Suppose that we wish to estimate the regression function at another point x0 = 0.88, another line is fitted using the data in the window 0.88 ± 0.051. The whole curve is obtained by estimating the regression function in a grid of points. Indeed, the curve in Fig. 9 was obtained by 101 local linear regressions, taking the 101 grid points from 0.0535 to 1.232. The local linear regression smoother is particularly simple to implement. Indeed, the estimator has the simple expression m ˆ L (x) =

n X

wi (x)Yi ,

(14)

i=1

where with Sn,j (x) = wi (x) = Kh (Xi − x)

Pn

i=1

Kh (Xi − x)(Xi − x)j ,

2 {Sn,2 (x) − (Xi − x)Sn,1 (x)}/(Sn,0 (x)Sn,2 (x) − Sn,1 (x)) .

(15)

900

J. Fan & R. Li

We can either use the explicit formula (15) or a regression package to compute it. It has several nice properties such as high statistical efficiency (in an asymptotic minimax sense), design adaption24 and good boundary behavior.28,70 The asymptotic bias and variance for this estimator is E{m ˆ L (x)|X1 , . . . , Xn } − m(x) = µ(K)

m′′ (x) 2 h + o(h2 ) 2

and σ 2 (x) +o var{m ˆ L (x)|X1 , . . . , Xn } = R(K) f (x)nh

1 nh

,

(16)

(17)

provided that the bandwidth h tends to zero in such a manner that nh → ∞, where f is the marginal density of X, namely, the design density.29 Table 1 lists the leading term in the asymptotic bias and variance. By comparing the leading terms in the asymptotic variance, clearly the local linear fit uses locally one extra parameter without increasing its variability. But this extra parameter creates opportunities for significant bias reduction, particularly at the boundary regions and slope regions. This is evidenced by comparing their asymptotic biases. Local linear fitting requires a choice for the smoothing parameter h and for the kernel function K. It is well known that the choice of the kernel function is of less importance in kernel smoothing. This holds truely for local polynomial regression. It has been shown that the Epanechnikov kernel is optimal in some sense. See Gasser, M¨ uller, and Mamitzsch,42 Granovsky and M¨ uller45 and Chap. 3 of Fan and Gijbels.28 The bandwidth selection is critical to all nonparametric estimators. A too-large bandwidth creates excessive biases in nonparametric estimates and a too small bandwidth results in a large variance in nonparametric estimate. There are two basic choices of bandwidth: subjective and datadriven. In subjective choices, data analysts use different bandwidths to estimate the regression function and choose the one that visually balances the bias and variance trade-off. Trials-and-errors are needed in this endeavor. Alternatively, one can present the nonparametric estimates using a few different bandwidths (c.f. Fig. 6 for a similar idea). The data-driven bandwidth is to let data themselves choose a bandwidth that balances the bias and variance, via minimizing certain estimated Mean Integrated Square Errors (MISE). We now briefly discuss some data-driven choices of the bandwidth. By (16) and (17), the weighted MISE of the local linear estimator is Z Z 2 σ (x) R(K) µ(K)2 h4 ′′ 2 w(x)dx . {m (x)} w(x)dx + 4 nh f (x)

Local Modeling: Density Estimation and Nonparametric Regression

901

The asymptotic optimal bandwidth, that minimizes the asymptotic weighted MISE of m ˆ L (x), is given by R 1/5 R(K) σ 2 (x)f −1 (x)w(x)dx R hopt = n−1/5 , (18) µ2 (K) {m′′ (x)}2 w(x)dx

where w(x) is a weight function. The optimal bandwidth involves the unknown regression function and the unknown density function of X. Hence it cannot be applied directly. There are many references on the topic of bandwidth selection. See Chap. 4 of Fan and Gijbels28 and references therein. Here, we focus on the crossvalidation method, which is conceptually simple, but needs intensive computation. Let m ˆ h,(−i) (x) be the local linear regression estimator (12) without using the ith-observation (Xi , Yi ). We now analogously validate the “goodness-of-fit” by measuring the “prediction error” Yi − m ˆ h,(−i) (Xi ). The cross-validation criterion measures the overall “prediction errors”, which is defined by CV (h) = n−1

n X i=1

{Yi − m ˆ h,(−i) (Xi )}2 .

(19)

ˆ CV chooses the one that miniThe cross-validation bandwidth selector h mizes CV (h). In what follows, we illustrate the methodology of local linear regression in details by an environmental data set. This data set consists of 612 observations of 15 variables and has been analyzed by Rawlings and Spruill.69 See Sec. 2 of Rawlings and Spruill69 for a detailed description. Here, we are interested in how depth to mottling (DMOT) of soil affects the increment of diameter growth of some kinds of pine. Thus we take the increment of diameter as a response variable Y and the DMOT of soil as an independent variable X. After excluding the data points with missing values, we have 216 observations. The scatter plot of the data is depicted in Fig. 10. The cross-validation method was used to search a bandwidth over 20 grid points 0.15 ∗ 1.1j multiplying the range of X variable, j = 0, . . . , 19. With the smallest bandwidth 0.15 multiplying the range of X, we used 15% of data around x0 to estimate m(x0 ), while with the largest bandwidth 0.15 × 1.119 multiple the range of X, we used about 92% of data around x0 to estimate m(x0 ). Here the Epanechnikov kernel was used. The plot of cross-validation scores against candidate bandwidths is depicted in ˆ CV is 12.776. Fig. 10(a). The corresponding h With the selected bandwidth, we are able to estimate the regression function. In nonparametric regression, one usually plots the curve of the

902

J. Fan & R. Li

estimated regression function. Thus one has to evaluate the regression function over a grid of points. Usually we take the grid of points evenly distributing over the range of X. A natural question arises here is how many grid points at which the estimate needed to be evaluated. Figure 10(c)–(e) depicts the resulting estimated curve with the number of grid points (Ngrid) being 100, 200 and 400, respectively. The plots shows nonlinear between the increment and the DMOT. From these plots, the estimated curves 65

67

64.5

66.5

64 CV

CV

66

63.5

65.5

63 65

62.5 62

0

20

40

64.5

60

0

0.5

bandwidths

50

50

40

40

30

20

10

10 40 DMOT

(c) Ngrid = 100

60

2

30

20

20

1.5

(b) Cross-Validation Scores

Diameter

Diameter

(a) Cross-Validation Scores

1 bandwidths

20

40 DMOT

60

(d) Ngrid = 200

Fig. 10. Estimated regression functions. (a) and (b) are plots of cross-validation scores for increment of diameter versus depth of mottling (DMOT) and for versus log(DMOT), respectively. (c)–(e) are estimated regression function curves E(increment|DMOT) with scatter plot of data, corresponding to the number of grid points 100, 200 and 400, respectively. (h) is the estimated regression curve E(increment|log(DMOT)).

903

50

50

40

40

Diameter

Diameter

Local Modeling: Density Estimation and Nonparametric Regression

30

30

20

20

10

10

20

40 DMOT

2.5

60

(e) Ngrid = 400

3 3.5 log(DMOT)

4

(f) Ngrid = 200 Fig. 10.

(Continued).

are almost the same, since the underline estimate is relatively smooth. In practice, we recommend using 100 or 200 grid points to evaluate estimated regression functions. Now we take the natural logarithm of DMOT as the X-variable, and then use the cross-validation method to choose a bandwidth. The CV scores ˆ CV = 1.3271. The estimated curve are depicted in Fig. 10(b). This yields h is depicted in Fig. 10(h). Figure 10(h) shows that increment of diameter growth versus log(DMOT) is nearly linear. For such an implementation, it spent about 2 seconds (using MATLAB on PC Pentium II 450 MHz) to compute the estimated function over 200 grid points, including bandwidth selection using the cross-validation method. Direct implementation of local polynomial regression for a large data set needs a considerable amount of computation. Fast computation algorithms have been proposed in Fan and Marron.33 Many computer codes are available through internet. For example, S-plus codes can be downloaded from Matt Wand’s homepage at http://www.biostat.harvard.edu/ mwand/software.html, while Matlab codes can be downloaded from James S. Marron’s homepage at http://www.stat.unc.edu/faculty/marron/marron software.html or through the authors. These codes can be easily implemented by directly plugging-in data. There is also a procedure of kernel smoothing in the latest version of SAS.

904

J. Fan & R. Li

4. Local Likelihood and Local Partial Likelihood The local likelihood approach was first proposed by Tibshirani and Hastie82 based on the running line smoother. As an extension of the local likelihood approach, local quasi-likelihood estimation using local constant fits, was considered by Severini and Staniwalis.74 Fan, Heckman and Wand31 investigated the asymptotic properties of the local quasi-likelihood method using local polynomial modeling. Fan et al.27 addressed the issue of bandwidth selection, bias and variance assessment and constructed confidence intervals in local maximum likelihood estimation. Fan and Chen26 proposed one-step local quasi-likelihood estimation, and demonstrated that the onestep local quasi-likelihood estimator performs as well as the maximum local quasi-likelihood estimator using the ideal optimal bandwidth. Fan et al.30 extended the idea of the local likelihood approach to local partial likelihood in the context of censored survival data analysis, such as Cox’s regression model. The ideas in this section are motivated from Fan et al.30,31 Carroll et al.11 extend the idea further to the likelihood equations. 4.1. Generalized linear models and local likelihool estimate 4.1.1. Generalized linear models Generalized linear models introduced by Nelder and Wedderburn in 1972 extend the scope of the traditional least squares fitting of linear models. The relationship between a response variable and a set of covariates is modeled as a linear fit to the transformed conditional mean. A comprehensive account of generalized linear models can be found in McMullagh and Nelder.65 Suppose that we have n independent observations (X1 , Y1 ), . . . , (Xn , Yn ) of random vector (X, Y ), where X is a d-dimensional real vector of covariates, and Y is a scalar response variable. The conditional density of Y given covariate X = x belongs to the canonical exponential family: fY |X (y|x) = exp{[θ(x)y − b{θ(x)}]/a(φ) + c(y, φ)}

(20)

for known functions a(·), b(·) and c(·, ·). In parametric generalized linear models it is usual to model a transformation of the regression function m(x) = E(Y |X = x) as linear, that is η(x) = g{m(x)} = xT β ,

and g is a known link function. If g = (b′ )−1 , then g is called the canonical link because it transform the regression function into the canonical parameter: (b′ )−1 {m(x)} = θ(x).

Local Modeling: Density Estimation and Nonparametric Regression

905

Here are a few examples that illustrate the model (20). The first example is that the conditional distribution of Y given X = x has a normal distribution with mean m(x) and variance σ 2 . The normal density can be rewritten as √ m(x)y − m2 (x)/2 y2 2) . 2πσ fY |X = exp − − log( σ2 2σ 2 It can be easily seen that φ = σ 2 , a(φ) = φ, b(m) = m2 /2 and c(y, φ) = −y 2 /(2φ) − log(

p

2πφ) .

The canonical link function is the identity link g(t) = t. This model is useful for a continuous response with homoscedastic errors. Suppose that the conditional distributions of Y given X = x is a Bernoulli distribution with the probability of success p(x), in which case it can be seen that fY |X (y|x) = exp (y log[p(x)/{1 − p(x)}] + log{1 − p(x)}) . The canonical parameter in this example is θ(x) = logit{p(x)}, and the logit function is the canonical link. Under model (20), it can be easily shown that the conditional mean and conditional variance are given respectively by m(x) = E(Y |X = x) = b′ {θ(x)}, and var(Y |X = x) = a(φ)b′′ {θ(x)}. Hence, θ(x) = (b′ )−1 {m(x)} .

Using the definition of η(·), we have θ(x) = (b′ )−1 {g −1 [η(x)]} .

(21)

Since our primary interest is to estimate the mean function, without loss of generality, the factors related to the dispersion parameter φ are omitted. This leads to the following conditional log-likelihood function ℓ{θ, y} = θ(x)y − b{θ(x)} . By (21), the above log-likelihood can be expressed as ℓ{θ, y} = y(b′ )−1 ◦ g −1 (η(x)) − b{(b′ )−1 ◦ g −1 (η(x))} ,

(22)

where ◦ denotes composition. In particular, when g is the canonical link, ℓ{θ, y} = η(x)y − b{η(x)} .

906

J. Fan & R. Li

4.1.2. Local likelihood estimate It has been of interest to adapt these models to situations where the functional form for the dependence of g(m(x)) on x is unknown. In what follows, the covariate X is assumed to be a scalar random variable. If η(x) is a smooth function of x, then for Xi close enough to a given point x0 , η(Xi ) ≈

p X j=0

βj (Xi − x0 )j ≡ XTi β ,

(23)

where Xi = (1, (Xi − x0 ), . . . , (Xi − x0 )p )T and β = (β0 , β1 , . . . , βp )T . Intuitively data points close to x0 have more information about η(x0 ) than those away from x0 . Therefore, by (22), the local log-likelihood function based on the random sample {(Xi , Yi )}ni=1 is ℓ(β) =

n X i=1

[Yi (b′ )−1 ◦ g −1 (XTi β) − b{(b′ )−1 ◦ g −1 (Xi β)}]Kh (Xi − x0 ) . (24)

Define the local maximum likelihood estimator of β to be ˆ = arg max β β∈Rp+1 ℓ(β) . Thus η(x0 ) and the νth derivative of η(x0 ) can be estimated by ηˆ(x0 ) = βˆ and ηˆ(ν) = ν!βˆν respectively, assuming that η has p derivatives. When the canonical like g = (b′ )−1 is used, (24) becomes ℓ(β) =

n X i=1

[Yi (XTi β) − b(Xi β)]Kh (Xi − x0 ) .

The log-likelihood function (24) is really a weighted log-likelihood and hence can be computed by using the existing software. In fact, suppose that we want to estimate ηˆ(·) in a given interval. Take a grid of points (200, say) in that interval. For each given grid point x0 , model (24) can be maximized by using existing software packages such as SAS and Splus that contains the parametric Glim function. The whole estimated function is obtained by plotting the estimates obtained at grid points. The choice of the link function g is not as crucial as for parametric generalized linear models, because the fitting is localized. Indeed it is conceivable to dispense with the link function and just estimate m(x) directly. But there are several drawbacks to having the link equal to the identity. An

907

Local Modeling: Density Estimation and Nonparametric Regression

identity link may yield a local likelihood that is not convex, allowing for the possibility of multiple maxima, inconsistency and computational problems. Use of the canonical link guarantees convexity. Furthermore the canonical link ensures that the final estimate has the correct range. For example, in the logistic regression context using the logit link leads to an estimate that is always a probability whereas using the identity link does not have. A final reason for preferring the canonical link is that the estimate of m(x) approaches the usual parametric estimate as the bandwidth becomes large. This can be useful as a diagnostic tool.31 We now illustrate the local likelihood approach via analyzing the data set: Burns data, collected by General Hospital Burn Center at the University of South California. It is of interest to estimate the probability of surviving given the age of victims. Local likelihood estimate was computed over a grid of points with bandwidth 0.4 multiplying the range of X, and the estimated curves are depicted in Fig. 11. Note that the conditional probability function must be monotonic for the parametric linear model, whereas for the local linear model, the conditional probability function can be any curve. The former model can overstate the probability of survival for the younger group and for the senior group. The solid curves in Fig. 11 suggest

Conditional probability +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-0.5

0.2

-1.0

0.0

probability 0.4 0.6

logit of probability 0.0 0.5 1.0

0.8

1.5

1.0

Logit regression

20

40

60

(a)

80

++++++++++++++++++++++ +++ ++++++ +++++++++++++++++++ + 20

40

60

80

(b)

Fig. 11. Illustration of local likelihood approach for the burn data. (a) Estimated logit transform of the conditional probability. (b) Estimated conditional probability. Solid curve — local modeling with about 40% of the data points in each local neighborhood; dashed curve — global parametric logit linear model. Taken from Fan and Gijbels. 29

908

J. Fan & R. Li

that the conditional probability function is unimodal, which is reasonable in the current context. 4.2. Local partial likelihood estimate In this section, we apply the local likelihood techniques to survival data analysis. Let T , C and X be respectively the survival time, the censoring time and their associated covariates. Correspondingly, let Z = min{T, C} be the observed time and δ = I(T ≤ C) be the censoring indicator. It is assumed that T and C are conditionally independent given X and that the censoring mechanism is noninformative. Suppose that {(Xi , Zi , δi ) : i = 1, . . . , n} are an i.i.d. sample from the population (X, Z, δ). For a thorough introduction to survival analysis, see books by Fleming and Harrington40 and Andersen et al.2 Let h(t|x) be the conditional hazard rate function. The proportional hazards model assumes that h(t|x) = h0 (t) exp{θ(x)} .

(25)

This model indicates that the covariate x inflates or deflates the hazard risk by a factor of exp{θ(x)}. The function θ(x) is called a hazard regression function, and characterizes the risk contribution of the covariate at value x. See Cox19 for proportional hazard models with time-dependent covariates. In the parametric model, a linear form θ(x) = βx is imposed on the hazard regression function. The local modeling methodology aims at removing this restriction and exploring possible nonlinearity, and is applicable to any smooth hazard regression function. For simplicity of discussion, we focus on the univariate cases. For multivariate settings, a dimensionality reduction technique such as additive models should be used.52 A commonly-used technique for estimating the hazard regression function is the partial likelihood technique introduced by Cox.20 Let to1 < · · · < toN denote the ordered observed failure times. Let (j) provide the label for the item failing at toj so that the covariates associated with the N failures are X(1) , . . . , X(N ) . Denote by Rj = {i : Zi ≥ toj }, the risk set at time instantaneously before toj . Then, the log-partial likelihood in our context is given by    N X X θ(X(j) ) − log  exp{θ(Xk )} . (26) j=1

k∈Rj

Local Modeling: Density Estimation and Nonparametric Regression

909

See Cox,20 Fleming and Harriton40 and Fan and Gijbels.28 Substituting the parametric form of θ(·) into (26) yields a maximum partial likelihood estimate of the hazard regression function. We now apply the local modeling technique to estimate the hazard regression function θ(·). For a given x0 , approximate θ(x) by θ(x) ≈ β0 + · · · + βp (x − x0 )p ,

(27)

for x in a neighborhood of x0 . Let β = (β1 , . . . , βp )T

and Xj = {(Xj − x0 ), . . . , (Xj − x0 )p }T .

Then the local partial likelihood is    N X  X Kh (X(j) − x0 ) XT(j) β − log exp(XTk β)Kh (Xk − x0 )  .   j=1

(28)

k∈Rj

See Fan et al.30 for a derivation of the local partial likelihood (28). When the kernel function is uniform and the bandwidth is of the nearest neighbor type, the local likelihood (28) was introduced by Tibshirani and Hastie.82 For a related approach based on the local likelihood, see Gentleman and Crowley.43 The function value θ(x0 ) is not directly estimable since (28) does not depend on the intercept β0 . However, the derivative functions are directly ˆ 0 ) be the maximum local log-partial likelihood estimate estimable. Let β(x that maximizes (28). An estimate θˆν (x0 ) of θ(ν) (x0 ) is given by ν!βˆν (x0 ). We impose the condition θ(0) = 0 for identifiability. With this extra constraint, the function θ(x) can be estimated by Z x ˆ θ(x) = θˆ′ (t)dt , (29) 0

ˆ′

where θ (t) = θˆ1 (t) is the derivative estimator. In practice, the function θˆ1 (x) is often evaluated at either grid points or the design points. Assume ˆ i ) can that θˆ1 (xj ) = βˆ1 (xj ) are computed at points {x0 , . . . , xm }. Then, θ(x be approximated by the trapezoidal rule ˆ i) = θ(x

i X j=1

(xj − xj−1 )(βˆ1,j + βˆ1,j−1 )/2 ,

where βˆ1,j = βˆ1 (xj ). The coefficients can simply be computed by using existing software packages for parametric Cox’s proportional hazards model. We conclude this section with an analysis of the Primary Biliary Cirrhosis (PBC) data set, which can be found in Fleming and Harrington.40

910

J. Fan & R. Li

5000

Survival time versus age

4000

+ +

+ +

+

+ + • +

+

+ + •+ •

+ + ++ ++

+

+

+

+

++

+• • + + • ++ + • • +• + • + + ++ + • • ++ + • • + + ++ + + + + + + •+ + • • + + ++ + •+ + + ++ • ++• • + ++ + • • + +• ++++ + • + • + • + • ++ ++ + + + + + • + + +• + ++ + + + + ++ • + • + • + + ++• + + + + • ++ + +• + + + + +• + + + + + +•+• • + + + + • + • ++ ++ • • +++• + • • + ++ • + ++ •+ + + + • ++ ++ •+ • + + + +• • +++ • + • •+ •+ ++ + • • • • +•• • + +• • +• • • ++ • •+ + • •+ • • + • • • • + + • • • • • ••• • + • •• • • • • • • • • •• • • • • • • +

0

1000

2000

time

3000

•

30

40

50

60

• •

+ + +

+ • •• • •

+

• • • •

•

• •• 70

(a)

Estimated curve psi

-0.3

0.0

-0.2

0.5

-0.1

1.0

0.0

1.5

0.1

2.0

0.2

0.3

2.5

Estimated derivative curve

30

40

50

(b)

60

70

30

40

50

60

70

(c)

Fig. 12. Local partial likelihood estimation of the hazard regression function. (a) Observed time versus age with “+” indicating censored observations. (b) Estimated derivative function θ ′ (·). (c) Estimated hazard regression function θ(·); solid curve — bandwidth = 10; short-dashed curve — bandwidth = 20; long-dashed curve — bandwidth = 30. From Fan and Gijbels.29

PBC is a rare but fatal chronic liver disease of unknown cause. The analysis is here based on the data collected at Mayo Clinic between January 1974 and May 1984. Of 312 patients who participated in the randomized trial, 187 cases were censored. In our analysis, we take the time (in days) between

Local Modeling: Density Estimation and Nonparametric Regression

911

registration and death, or liver transplantation or the time of the study analysis (July 1986) as response and the ages of the patients as a covariate. The observed data are presented in Fig. 12(a). The local partial likelihood method (28) with p = 2 was employed for three different bandwidths h = 10, 20 and 30. The estimated hazard regression function and its derivative function are respectively given in Figs. 12(b) and (c). Note that since the hazard regression function is only identifiable within a constant, the curves in Fig. 12(c) are normalized to have the same average height so that they can be better compared. Figure 12(c) reveals the fact that it is reasonable to model linearly the hazard regression function of covariate age. 5. Nonparametric Goodness of Fit Tests Nonparametric goodness of fit test has received increasing attention recently. A motivating and simple example is to consider a simple nonparametric regression model. Suppose that (X1 , Y1 ), . . . , (Xn , Yn ) are a random sample from the nonparametric regression model Yi = m(Xi ) + εi with E(εi |Xi ) = 0 and var(εi |Xi ) = σ 2 . Of interest is to test the hypothesis H0 : m(x) = α0 +α1 x1 +· · ·+αp xp versus H1 : m(x) 6= α0 +α1 x1 +· · ·+αp xp . This testing problem is well known as test of linearity in the context of model diagnostic where the question arises whether a family of parametric models fit adequately the data. It is natural to use the nonparametric model as an alternative hypothesis. On the other hand, it is known that nonparametric regression may yield a complicated model. Thus, after fitting a data set by a nonparametric model, we may check whether the data can be fitted by a less complicated parametric model. This leads to a nonparametric goodness of fit test. Hart51 gives a comprehensive study and presents many examples on this topic. Fan25 and Fan and Huang32 proposed some goodness of fit tests for various parametric models and nonparametric models. Fan et al.37 proposed generalized likelihood ratio tests and established a general framework for nonparametric smoothing tests. Many related literature, are available1,3,4,21,22,50,55,58 In this section, we illustrate the idea of nonparametric likelihood ratio test by generalized varying coefficient models. Some material of this section was extracted from Cai et al.,8 referred as CFL.

912

J. Fan & R. Li

5.1. Generalized varying coefficient models A generalized varying-coefficient model has the form η(u, x) = g{m(u, x)} =

p X

aj (u)xj

(30)

j=1

for some given link function g(·), where x = (x1 , . . . , xp )T , and m(u, x) is the mean regression function of the response variable Y given the covariates U = u and X = x with X = (X1 , . . . , Xp )T . Clearly, model (30) includes both the parametric generalized linear model65 and the generalized partially linear model.9,14,46,77 An advantage of model (30) is that by allowing the coefficients {aj (·)} to depend on U, the modeling bias can be reduced significantly and the “curse of dimensionality” is avoided. In this section, we focus on the cases in which the response is discrete. For continuous responses, many works have been done. In the leastsquares setting, model (30) with the identity link was introduced by Cleveland et al.18 and extended by Hastie and Tibshirani53 to various aspects. Varying-coefficient models are a simple and useful extension of classical generalized linear models. They are particularly appealing in longitudinal studies where they allow one to explore the extent to which covariates affect responses changing over time. See Hoover et al.,54 Brumback and Rice7 and Fan and Zhang38 for details on novel applications of the varying-coefficient models to longitudinal data. For nonlinear time series applications, see Chen and Tsay15 and Cai et al.9 for statistical inferences based on the functional-coefficient autoregressive models. Kauermann and Tutz59 used varying coefficient models for model disgnostics. 5.2. Estimation procedures For simplicity, we consider the important case that u is one-dimensional. Extension to multivariate u involves no fundamentally new ideas. However, implementations with u having more than two dimensions may have some difficulties due to the “curse of dimensionality”. In this section, it is assumed that the conditional log-likelihood function ℓ(v, y) is known and linear in y for fixed v. This assumption is satisfied for the canonical exponential family, which is the focus of this section. The methods, introduced in this section, are directly applicable to situations in which one cannot specify fully the conditional log-likelihood function ℓ(v, y), but can model the relationship between the mean and

Local Modeling: Density Estimation and Nonparametric Regression

913

variance by var(Y |U = u, X = x) = σ 2 V {m(u, x)} for a known variance function V (·) and unknown σ. In this case, one needs only to replace the log-likelihood function ℓ(v, y) by the quasi-likelihood function Q(·, ·), ∂ Q(µ, y) = Vy−µ defined by ∂µ (µ) . 5.2.1. Local MLE Local linear modeling will be used here, though general local polynomial methods are also applicable. Suppose that aj (·) has a continuous second derivative. For each given point u0 , aj (u) can be approximated locally by a linear function aj (u) ≈ aj + bj (u − u0 ) for u in a neighborhood of u0 . Based on a random sample {(Ui , Xi , Yi )}ni=1 , one may use the following local likelihood method to estimate the coefficient functions     p n  1 X  −1 X ℓ g (aj + bj (Ui − u0 ))Xij , Yi  Kh (Ui − u0 ) , ℓn (a, b) =   n i=1

j=1

(31)

where a = (a1 , . . . , ap )T and b = (b1 , . . . , bp )T . Note that aj and bj depend on u0 , and so does ℓn (·, ·). Maximizing the local likelihood function ℓn (a, b) ˆ 0 ). The components in a ˆ(u0 ) and b(u ˆ(u0 ) provide an results in estimates a estimate of a1 (u0 ), . . . , ap (u0 ). For simplicity of notation, let β = β(u0 ) = (a1 , . . . , ap , b1 , . . . , bp )T , and write the local likelihood function (31) as ˆ ˆ ℓn (β). Likewise, the local MLE is denoted by β MLE = β MLE (u0 ). The sampling properties have been established in CFL. 5.2.2. One-step local MLE Computation for the above local MLE is expensive. We have to maximize the local likelihood (31) for usually hundreds of distinct values of u0 , with each maximization requiring an iterative algorithm, in order to obtain the estimated functions {ˆ aj (·)}. To alleviate this expense, we replace an iterative local MLE by the one-step estimator, which has been frequently used in parametric models.5,61 The one-step local MLE does not lose any statistical efficiency provided that the initial estimator is good enough. See CFL for theoretic insights. Let ℓ′n (β) and ℓ′′n (β) be the gradient and Hessian matrix of the ˆ = β ˆ (u0 ) = local log-likelihood ℓn (β). Given an initial estimator β 0 0 T ˆ T T (ˆ a(u0 ) , b(u0 ) ) , one-step of the Newton-Raphson algorithm updates its

914

J. Fan & R. Li

solution by ˆ =β ˆ − {ℓ′′ (β ˆ )}−1 ℓ′ (β ˆ ), β OS 0 n 0 n 0

(32)

thus featuring the computational expediency of least-squares local polynomial fitting. Furthermore, the sandwich formula can be used as an estimate for standard errors of the resulting estimate ˆ ) = {ℓ′′ (β ˆ )}−1 cd ˆ )}{ℓ′′ (β ˆ )}−1 . ov{ℓ′n (β cd ov(β 0 n 0 OS n 0

This formula has been tested in CFL to be accuracy enough for most of practical purpose. In univariate generalized linear models, Fan and Chen26 carefully studied properties of the local one-step estimator. In that setting, the leastsquares estimate serves a natural candidate as an initial estimator. However, in the multivariate setting, it is not clear how an initial estimator can be constructed. The following is proposed in CFL. Suppose that we wish to ˆ(·) at grid points uj , j = 1, . . . , ngrid . Our idea of evaluate the functions a finding initial estimators is as follows. Take a point ui0 , usually the center ˆ of the grid points. Compute the local MLE β MLE (ui0 ). Use this estimate as ˆ (ui +1 ). the initial estimate for the point ui0 +1 and apply (32) to obtain β 0 OS ˆ Now, use β OS (ui0 +1 ) as the initial estimate at the point ui0 +2 and apply ˆ (ui +2 ) and so on. Likewise, we can compute β ˆ (ui −1 ), (32) to obtain β 0 0 OS OS ˆ (ui −2 ), etc. In this way, we obtain our estimates at all grid points. β 0 OS A refine alternative of the above proposal is to calculate a fresh local MLE as a new initial value after iterating along the grid points for a while. For example, if we wish to evaluate the functions at 200 grid points and are willing to compute the local maximum likelihood at five distinct points. A sensible placement of these points is u20 , u60 , u100 , u140 and u180 . Use for ˆ example β MLE (u60 ) along with the idea in the last paragraph to compute ˆ ˆ ˆ β OS (ui ) for i = 40, . . . , 79, and use β MLE (u100 ) to compute β OS (ui ) for i = 80, . . . , 119, and so on. ˆ ) can be nearly singular for certain u0 , due to possible Note that ℓ′′n (β 0 data sparsity in certain local regions. Seifert and Gasser73 and Fan and Chen26 explored the use of the ridge regression as an approach to handling such problems in the univariate setting. See CFL8 for details. 5.3. Hypothesis testing When fitting a varying-coefficient model, it is natural to ask whether the coefficient functions are actually varying or whether any particular covariate

Local Modeling: Density Estimation and Nonparametric Regression Table 2.

915

Normalization constant rK .

Kernel

Uniform

Epanechnikov

Biweight

Triweight

Gaussian

rK

1.2000

2.1153

2.3061

2.3797

2.5375

is significant in the model. For simplicity of description, we only consider the first hypothesis testing problem H0 : a1 (u) ≡ a1 , . . . , ap (u) ≡ ap ,

(33)

though the technique also applies to other testing problems. A useful procedure is based on the nonparametric likelihood ratio test statistic T = 2{ℓ(H1 ) − ℓ(H0 )} ,

(34)

where ℓ(H0 ) and ℓ(H1 ) are respectively the log-likelihood functions computed under the null and alternative hypotheses. Note that the normalization constant in (34) does not change the testing procedure. However, in order for it to possess a χ2 distribution, it needs to be normalized as in Ref. 37 TK = rK {ℓ(H1 ) − ℓ(H0 )} , where rK

R K(0) − 21 K 2 (t)dt . = R (K(t) − 12 K ∗ K(t))2 dt

(35)

Table 2 gives the value of rK for a few commonly used kernels. For parametric models, it is well known that the likelihood ratio statistic follows asymptotically a χ2 -distribution. The asymptotic null distribution is independent of nuisance parameters under the null hypothesis. This is the Wilks type of phenomenon. Fan et al.37 has shown the Wilks phenomenon still holds for the nonparametric likelihood ratio tests. Furthermore, they showed that the null distribution of the nonparametric likelihood ratio test is a χ2 -distribution in some sense and does not depend on the values of a1 , . . . , ap . Thus one may use the following conditional bootstrap to construct the null distribution of TK and hence the P -value. Let {ˆ aj } be the MLE under the null hypothesis. Given the covariates (Ui , Xi ), generate a bootstrap sample Yi∗ from the given distribution of Y with the estimated Pp ∗ ˆj Xij and compute the test statistic TK linear predictor ηˆ(Ui , Xi ) = j=1 a ∗ in (34). Use the distribution of TK as an approximation to the distribution of TK .

916

J. Fan & R. Li

Note that the above conditional bootstrap method applies readily to setting without presence of dispersion parameter, such as the Poisson and Bernoulli distributions. It is really a simulation approximation to the conditional distribution of TK given observed covariates under the particular null hypothesis: H0 : aj (u) = a ˆj (j = 1, . . . , p). As pointed out above, this approximation is valid under both H0 and H1 as the null distribution does not asymptotically depend on the values of {aj }. In the case where model (30) involves a dispersion parameter (e.g. the Gaussian model), the dispersion parameter should be estimated based on the residuals from the alternative hypothesis. It is also of interest to investigate whether some covariates are significant. For example, we want to check whether the covariate X p can be excluded from the model. This is equivalent to testing the hypothesis H0 : ap (·) = 0, the above conditional bootstrap idea can be employed to obtain the null distribution of TK under the model (30) and the generalized likelihood ratio statistics continue to apply. In this case, the data should Pp−1 be generated from the mean function g{m(u, x)} = j=1 a ˆj (u)xj , where a ˆj (·) is an estimate under the alternative hypothesis. 5.4. An application We conclude this section via illustrating the proposed methodology to analyze the Burn Data set. The binary response variable Y is 1 for those victims who survived their burns and 0 otherwise, and covariates X1 = age, X2 = sex, X3 = log(burn area + 1) and binary variable X4 = Oxygen (0 if oxygen supply is normal, 1 otherwise) are considered. Of interest is to study how burn areas and the other variables affect the survival probabilities for victims at different age groups. This naturally leads to the following varying-coefficient model logit{p(x1 , x2 , x3 , x4 )} = a1 (x1 ) + a2 (x1 )x2 + a3 (x1 )x3 + a4 (x1 )x4 .

(36)

Figure 13 presents the estimated coefficients for model (36) via the onestep approach with bandwidth h = 65.7882, selected by a cross-validation method. See CFL8 for details. A natural question arises whether the coefficients in (36) are actually varying. To see this, we consider the parametric logistic regression model logit{p(x1 , x2 , x3 , x4 )} = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4

(37)

917

Local Modeling: Density Estimation and Nonparametric Regression

Estimated coefficient of a2(.) 2

80

1

coefficient function

coefficient function

Coefficient of a1(.) 100

60 40 20 0

0

20

40

60

0 −1 −2 −3

80

0

20

(a)

−2

2

coefficient function

coefficient function

3

−4 −6 −8

20

40

(c)

80

Estimated coefficient of a4(.)

Estimated coefficient of a3(.)

0

60

(b)

0

−10

40

60

80

1 0 −1 −2

0

20

40

60

80

(d)

Fig. 13. The estimated coefficient functions (the solid curves) via the one-step approach with bandwidth chosen by the CV. The dot curves are the estimated functions plus/minus twice estimated standard errors. Adapted from Cai, Fan and Li. 8

as the null model. As a result, the MLE of (β0 , . . . , β4 ) in model (37) and its standard deviation are (23.2213, −6.1485, −0.4661, −2.4496, −0.9683) and (1.9180, 0.6647, 0.2825, 0.2206, 0.2900), respectively. The likelihood ratio test TK is 58.1284 with p-value 0.000, based on 1000 bootstrap ∗ samples (the sample mean and variance of TK are 6.3201 and 11.98023, respectively). This implies that the varying-coefficient logistic regression model fits the data much better than the parametric fit. It also allows us to examine the extent to which the regression coefficients vary over different ∗ ages. The estimated density of TK is depicted in Fig. 14, from which we can seen that the null distribution is well approximated by a χ2 distribution with 6.5 degrees of freedom (a gamma distribution).

918

J. Fan & R. Li

Estimated density of T 0.14 0.12

density

0.1 0.08 0.06 0.04 0.02 0 0

5

10

15

20

25

T

Fig. 14. The estimated density of TK by Monte Carlo simulation. The solid curve is the estimated density, and the dashed curve stands for the density of chi-squared distribution (gamma distribution) with 6.5 degrees of freedom.

To examine whether there is any gender gap for different age groups or if the variable X4 affects the survival probabilities for different age of burn victims, we consider testing the null hypothesis H0 : both a2 (·) and a4 (·) are constant under model (36). The corresponding test statistic T K is 3.4567 with p-value 0.7050 based on 1000 bootstrap samples. This in turn suggests that the coefficient functions a2 (·) and a4 (·) are independent of age and indicates that there are no gender differences for different age groups. Finally, we examine whether both covariates sex and oxygen are statistically significant in model (36). The likelihood ratio test for this problem is TK = 11.9256 with p-value 0.0860, based on 1000 bootstrap samples (the ∗ sample mean and variance of TK are 5.5915 and 10.9211, respectively). Both covariates sex and oxygen are not significant at level 0.05. This suggests that gender and oxygen do not play a significant role in determining the survival probability of a victim. 6. Other Applications There are many other applications of local modeling methods. This section briefly introduces some of them and gives some relevant references for those who wish for more details. Suppose that (X1 , Y1 ), . . . , (Xn , Yn ) are a random sample from a population (X, Y ). We are interested in estimating

Local Modeling: Density Estimation and Nonparametric Regression

919

a population parameter function θ. The function θ(·) can be, for example, the conditional mean function E(Y |X) and the conditional quantile function. In parametric settings, we model θ(x) using a parametric family θ(x) = g(x; β). To get an estimator of β, we optimize (either minimize or maximize) an objective function n X ℓ{Xi , Yi , g(Xi , β)} . (38) L(β) = i=1

Here ℓ is a discrepancy loss function or the log-likelihood function of an individual observation. For example, the L2 -loss function leads to a least squares estimate, while the L1 -loss function corresponds to a robust linear regression. The local modeling method can be used to relax the global parametric model assumption and to significantly reduce the modeling bias. For a given point x0 , we replace the objective function by its local version n X ℓ{Xi , Yi , g(Xi , β(x0 ))}Kh (Xi − x0 ) . (39) L{β(x0 )} = i=1

ˆ 0 ), just like the local likelihood Optimizing (39) yields an estimate β(x estimate discussed in the last section. Thus, an estimate of the function ˆ 0 ) = g{x0 ; β(x ˆ 0 )}. Since the local estimate β(x ˆ 0 ) optimizes (39), θ(·) by θ(x ˆ 0 )} should converge to its population version. Therethe estimate g{x0 , β(x ˆ fore the estimate θ(x0 ) is a consistent estimator of the function θ(x0 ) if h → 0 in such a way that nh → ∞. For a given x0 , by Taylor’s expansion, we can parametrize the function in a local neighborhood of x0 as g(x; β) = β0 (x0 ) + β1 (x0 )(x − x0 ) + · · · + βp (x0 )(x − x0 )p .

(40)

With suppressing the dependence of β’s on x0 , (39) can be rewritten as n X ℓ{Xi , Yi , β0 + β1 (Xi − x0 ) + · · · + βp (Xi − x0 )p } L{β(x0 )} = i=1

× Kh (Xi − x0 ) .

(41)

Let βˆj (j = 0, 1, . . . , p) optimize (41). Then as in last section, ˆ 0 ) = βˆ0 θ(x and θˆν (x0 ) = ν!βˆν ,

ν = 1, . . . , p

estimates the νth derivative of the function θ(x) at x = x0 .

920

J. Fan & R. Li

It is clear that local polynomial regression and local likelihood approach are special cases hereof. An extension of the ideas for estimating bias and variance can be found in Fan et al.,27 in which methods for selecting bandwidths and constructing confidence intervals are also proposed. A closely related framework is the local estimating equation method introduced by Carroll et al.11 and the kernel generalized estimating equation (GEE) proposed by Lin and Carroll.62 6.1. Estimation of conditonal quantiles and median In explanatory data analysis, quantiles provide us informative summary of a population. In regression analysis, conditional quantiles have important applications for constructing predictive intervals and detecting heteroscedasticity. When the error distribution is asymmetric, the conditional median regression function is more informative than the conditional mean regression. Take the loss function in (38) to be ℓ(x, y, θ) = ℓα (y − θ) with ℓα (t) = |t| + (2α − 1)t .

(42)

The minimizer of Eℓ(X, Y, θ) in this situation is the conditional α-quantile function ξα (x) = G−1 (α|x), where G−1 (y|x) is the conditional distribution of Y given X = x. Now we apply the local modeling approach to estimate the conditional quantile function. Minimize n X i=1

ℓα {Yi − β0 − β1 (Xi − x0 ) − · · · − βp (Xi − x0 )p }Kh (Xi − x0 )

(43)

and the resulting estimator for ξα (x0 ) is simply βˆ0 . Now we apply the proposed approach to the 12-month Treasury bill data presented in Fig. 15(a). Figure 15(b) depicts the estimated conditional median, the conditional 10th percentile and the conditional 90th percentile. The fan shape of the conditional quantiles shows that the variability gets larger as the interest rate gets higher. The intervals sandwiched by conditional 10th and 90th percentiles are 80%-predictive intervals. For example, given the current interest rate being 10%, with probability 80% the difference of the next week’s rate and this week’s rate falls in the interval [−0.373%, 0.363%].

921

Local Modeling: Density Estimation and Nonparametric Regression

Running quantile curves

0.2 difference 0.0

80% predictive interval

-0.8

4

6

-0.4

12 Interest rates 8 10

0.4

14

0.6

Twelve-month T-bill rate

1970

1980 calendar time

1990

4

6

(a)

8 10 Interest rate

12

14

(b)

Fig. 15. Quantile regression. (a) The yields of 12-month Treasury bill. (b) Conditional quantiles - - - - - : α = 0.1, solid curve — α = 0.5, − − − − − : α = 0.9. The vertical bar indicates the 80%-predictive interval at the point x = 10. Taken from Fan and Gijbels. 29

For robust estimation of the regression function, one can simply replace the loss function in (42) by an outlier-resistant loss function such as ℓ(t) =

(

t2 /2 2

c|t| − c /2

when |t| ≤ c

when |t| > c ,

namely, taking the derivative of ℓ(t) to be Huber’s ψ-function: ψc (t) = max{−c, min(c, t)}. When the conditional distribution of Y given X = x is symmetric about the regression function m(x), the resulting estimates are consistent for all c ≥ 0. Another useful robust procedure is LOWESS, introduced by Cleveland,17 which reduces the influence of outliers by an iterative reweighted least-squares scheme with weights proportional to the residuals from the previous iteration. There is a large literature on nonparametric quantile regression and robust regression. H¨ ardle and Gasser49 and Tsybakov84 considered respectively local constant and local polynomial fitting. Other contributions in this area are also available.12,28,47,60,83

922

J. Fan & R. Li

6.2. Estimation of conditional variance Conditional variance functions have many statistical applications, particularly in finance. Because of their important applications in finance in which data are often dependent, we formulate the problems in stochastic setup. Let {(Xi , Yi )} be a two-dimensional strictly stationary process having the same joint distribution as (X, Y ). Let m(x) = E(Y |X = x) and σ 2 (x) = var(Y |X = x) be respectively the regression function and the conditional variance function. Our approach is based on the residuals of the local fit. Let m ˆ h1 ,K (·) be the local fit of m(·) using a kernel K and a bandwidth h1 . Consider the squared residuals rˆi = {Yi − m ˆ h1 ,K (Xi )}2 .

(44)

Note that the conditional variance function can be expressed as σ 2 (x) = E[{Y − m(X)}2 |X = x] , which is the regression function of the squared residuals. Therefore, a natural procedure is to run a local fit on the squared residuals. Let σ ˆ h2 2 ,W (x) be the local fit based on the data {(Xi , rˆi ), i = 1, . . . , n}, using a bandwidth h2 and a kernel W . Then, it was shown by Fan and Yao,35 Ruppert et al.71 that the estimator σ ˆh2 2 ,W (x) performs as well as the ideal estimator, which is a local linear fit to the true squared residuals {(Xi , {Yi − m(Xi )}2 ), i = 1, . . . , n} using the same bandwidth h2 and the same kernel W . They also obtained the order of bias and variance of the resulting estimators. Their results suggest that if the bandwidth h1 is of order n−1/5 , then the residual-based conditional variance estimator performs asymptotically as well as the ideal one. In particular, the optimal bandwidth for estimating the mean regression function is permitted to be used for computing the residuals. Thus a data-driven procedure can be established.35 To illustrate the usefulness of the above automatic method, consider the yields of 12-month Treasury bill. The refined global bandwidth selector of Fan and Gijbels28 and the Epanechnikov kernel. Figure 16(a) gives the ˆ 1 = 3.99 was chosen estimated mean regression function. The bandwidth h by the software. Figure 16(b) depicts the estimated conditional standard deviation (the volatility function) and the conditional variance function. ˆ 2 = 3.63 was selected by the software. Visual inspection The bandwidth h suggests that the volatility function should be a power function. Indeed,

923

Volatility function 0.6

Regression function of the rate differences

0.0

-0.08

0.1

0.2

-0.04

^m(x)

0.0

^sigma(x) 0.3 0.4

0.5

0.02

0.04

Local Modeling: Density Estimation and Nonparametric Regression

4

6

8

10

12

14

4

6

8

10

12

14

x

x

(a)

(b)

Fig. 16. The regression function and the volatility function for the 12-month Treasury bill data. (a) The estimated mean regression function and, (b) Estimated volatility function (thick curve) and the estimated conditional variance function (thin curve). The two dashed curves around a solid one indicate one standard error above and below the estimated mean regression function. From Fan and Gijbels. 29

the correlation coefficient between {log(xj )} and {log(ˆ σ (xj ))} is 0.997!, where xj , (j = 1, . . . , 201) are grid points in the interval [3, 14]. Fitting a line through the data {(log(xj ), log{ˆ σ (xj )}), j = 1, . . . , 201}, we obtain the estimate σ ˆ (x) = 0.0154x1.3347 . This estimate is presented as a thick-dashed curve in Fig. 16(b). This is an example where the nonparametric analyses yield a good parametric model σ(x) = αxβ . Based on the linear regression on the data (log(Xi ), log(ˆ ri )), one can also obtain directly an estimate of α and β. 6.3. Estimation of conditional density It is well known that probability density function is much more informative than the mean and the variance. Similarly, in regression settings, the conditional probability density function provided more information about the population than the conditional regression function. The probability density function plots can show us about the center as well as the spreadness

924

J. Fan & R. Li

of the population. The shape of conditional probability density function tells us whether it is symmetric. This provides a guidance for us to summarize the population via the conditional mean regression function or the conditional median regression function. Suppose that (X1 , Y1 ), . . . , (Xn , Yn ) are a random sample from the population (X, Y ) with the conditional density g(y|x). Note that E{Kh2 (Y − y)|X = x} ≈ g(y|x) ,

as h2 → 0 .

(45)

Thus, g(y|x) can be regarded approximately as the regression function of the variable Kh2 (Y − y) on X. Considerations of this nature lead to the following local polynomial regression problem: 2  p n   X X βj (Xi − x)j (46) Wh1 (Xi − x) , Kh2 (Yi − y) −   i=1

j=0

for a given bandwidth h1 and a kernel function W . Let {βˆj (x, y), j = 0, . . . , p} be the solution of the least-squares problem. Then an estimator ν g(y|x) is ν!βˆν (x, y). We write gˆ(y|x) = βˆ0 (x, y) as the of g (ν) (y|x) = ∂ ∂x ν estimator of the conditional density. To apply the proposed approach of conditional density estimation, we have to choose two bandwidths h1 and h2 . The method for constructing a data-driven bandwidth for local polynomial regression can be used to compute a bandwidth for h1 , and the method for choosing a bandwidth for kernel density estimation can be employed to find a bandwidth h2 .36 With the estimated conditional density function, one can derive many statistical estimators. For example, the mean regression function can simply be estimated by Z m(x) ˆ = yˆ g(y|x)dy . It can be shown that this estimator is the same as the local polynomial regression estimator when the kernel function K has mean zero. Similarly, one can derive estimates for the conditional variance and conditional quantile functions. 6.4. Change point detection Change point detection is useful in medical monitoring and quality control. For example, when the treatment effects change suddenly without warning

Local Modeling: Density Estimation and Nonparametric Regression

925

or planning, jump points arise. The statistical problem can be formulated as follows. Let (X1 , Y1 ), . . . , (Xn , Yn ) be a random sample from a population (X, Y ) with conditional mean function m, which is smooth except for a few number of jump discontinuities. For simplicity, we assume that there is only one single discontinuity point, also called a change point. One may regard the change point as the location where the derivative function |m′ (·)| is maximized. Thus, a naive method is to first estimate the derivative curve and then find the maximizer of the absolute value of the estimated derivative function. Let D(x, h) be a derivative estimator resulting from a local polynomial fit of order p with bandwidth h and kernel K. For simplicity, assume that the support of K is [−1, 1]. The above idea translates into the following estimating scheme: plot the function |D(·, h)| for a range of values of h and identify the jump as the point x in the vicinity of which |D(x, h)| is consistently large for a range of values of h. More precisely, let x ˜(h) be the global maximum of the function |D(·, h)|. Put x ˜− (h) =

sup {˜ x(h1 ) − 2h1 } ,

h1 ∈[h,ηn ]

x ˜+ (h) =

inf

{˜ x(h1 ) + 2h1 } ,

h1 ∈[h,ηn ]

(47)

for h ≤ ηn , where ηn > 0 is a prescribed number, tending to zero more ˜ denote the infimum of values h such that slowly than n−1 log n. Let h ˜ 44 x ˜− (h) ≤ x ˜+ (h). The proposed jump point estimator is x ˆ0 = x˜(h). 44 Gijbels et al. also propose a further refinement of the above idea. For a given bandwidth h, pretend the change point lies in the interval x˜(h) ± 2h and the regression function is a step function on this interval. Then, find the unknown location of the jump such that it minimizes the residual sum of squares, using only the data in the strip x ˜(h) ± 2h. The resulting estimator ˜ to is a refinement of the estimator x˜(h). In particular, we can take h = h yield a refinement of x ˆ0 . M¨ uller67 proposed an alternative method based on a one-sided kernel approach. The idea can be extended to the local polynomial setting as follows. Denote by K− a kernel function supported on [−1, 0] and m ˆ − (x, h) a local polynomial fit using the bandwidth h and the kernel K− . Note that the estimator m ˆ − (x, h) uses only the local data on the left-hand side of the point x. Analogously, let K+ be a kernel function supported on [0, 1] and m ˆ + (x, h) be a local polynomial fit using the bandwidth h and the kernel K+ . Then, m ˆ + (x, h) uses only the data on the right-hand side of the point x. At the smooth locations, the estimates m ˆ − (x, h) and m ˆ + (x, h) are about the same, since both are consistent estimates of m(x). At the discontinuity

926

J. Fan & R. Li

point, however, they estimate respectively the left-limit and the right-limit of the function m at the point x. Thus, a natural estimator is the location such that the difference function |m ˆ + (x, h) − m ˆ − (x, h)| is maximized. The bandwidth for detecting the change point is typically much smaller than the optimal bandwidth for curve estimation. M¨ uller67 and Gijbels et al.44 also gave some interesting examples. References 1. Aerts, M., Claeskens, G. and Hart, J. D. (1999). Testing the fit of a parametric function. Journal of the American Statistical Association 94: 869–879. 2. Andersen, P. K., Borgan, Ø., Gill, R. D. and Keiding, N. (1993). Statistical Models Based on Counting Processes, Springer-Verlag, New York. 3. Azzalini, A. and Bowman, A. N. (1993). On the use of nonparametric regression for checking linear relationships. Journal of the Royal Statistical Society Series B55: 549–557. 4. Azzalini, A., Bowman, A. N. and H¨ ardle, W. (1989). On the use of nonparametric regression for model checking. Biometrika 76: 1–11. 5. Bickel, P. J. (1975). One-step Huber estimates in linear models. Journal of the American Statistical Association 70: 428–433. 6. Bowman, A. W. and Azzalini, A. (1997). Applied Smoothing Techniques for Data Analysis, The Kernel Approach with S-Plus Illustrations, Oxford Science Publications, Oxford. 7. Brumback, B. and Rice, J. (1998). Smoothing spline models for the analysis of nested and crossed samples of curves. Journal of the American Statistical Association 93: 961–976. 8. Cai, Z., Fan, J. and Li, R. (2000). Efficient estimation and inferences for varying-coefficient models. Journal of the American Statistical Association 95: 888–902. 9. Cai, Z., Fan, J. and Yao, Q. (2000). Functional-coefficient regression models for nonlinear time series. Journal of the American Statistical Association 95: 941–956. 10. Carroll, R. J., Fan, J., Gijbels, I. and Wand, M. P. (1997). Generalized partially linear single-index models. Journal of the American Statistical Association 92: 477–489. 11. Carroll, R. J., Ruppert, D. and Welsh, A. H. (1998). Local estimating equations. Journal of the American Statistical Association 93: 214–227. 12. Chaudhuri, P. (1991). Nonparametric estimates of regression quantiles and their local Bahadur representation. The Annals of Statistics 19: 760–777. 13. Chaudhuri, P. and Marron, J. S. (1999). SiZer for exploration of structures in curves. Journal of the American Statistical Association 94: 807–822. 14. Chen, H. (1988). Convergence rates for parametric components in a partly linear model. The Annals of Statistics 16: 136–146. 15. Chen, R. and Tsay, R. S. (1993). Functional-coefficient autoregressive models. Journal of the American Statistical Association 88: 298–308.

Local Modeling: Density Estimation and Nonparametric Regression

927

16. Chu, C. K. and Marron, J. S. (1991). Choosing a kernel regression estimator (with discussions). Statistical Sciences 6: 404–436. 17. Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 74: 829–836. 18. Cleveland, W. S., Grosse, E. and Shyu, W. M. (1992). Local regression models. In Statistical Models in S, eds. J. M. Chambers and T. J. Hastie, Wadsworth and Brooks, California, 309–376. 19. Cox, D. R. (1972). Regression models and life-tables (with discussion). Journal of the Royal Statistical Society B34: 187–220. 20. Cox, D. R. (1975). Partial likelihood. Biometrika 62: 269–276. 21. Eubank, R. L. and Hart, J. D. (1992). Testing goodness-of-fit in regression via order selection criteria. The Annals of Statistics 20: 1412–1425. 22. Eubank, R. L. and LaRiccia, V. M. (1992). Asymptotic comparison of Cram´er-von Mises and nonparametric function estimation techniques for testing goodness-of-fit. The Annals of Statistics 20: 2071–86. 23. Fan, J. (1992). Design-adaptive nonparametric regression. Journal of the American Statistical Association 87: 998—1004. 24. Fan, J. (1993). Local linear regression smoothers and their minimax. The Annals of Statistics 21: 196–216. 25. Fan, J. (1996). Test of significance based on wavelet thresholding and Neyman’s truncation. Journal of the American Statistical Association 91: 674–688. 26. Fan, J. and Chen, J. (1999). One-step local quasi-likelihood estimation. Journal of the Royal Statistical Society B61: 927–943. 27. Fan, J., Farmen, M. and Gijbels, I. (1998). A blueprint of local maximum likelihood estimation. Journal of the Royal Statistical Society B60: 591–608. 28. Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications, Chapman and Hall, London. 29. Fan, J. and Gijbels, I. (2000). Local polynomial fitting, Smoothing and Regression. Approaches, Computation and Application, ed. M. G. Schimek, John Wiley and Sons, 228–275. 30. Fan, J., Gijbels, I. and King, M. (1997). Local likelihood and local partial likelihood in hazard regression. The Annals of Statistics 25: 1661–1690. 31. Fan, J., Heckman, N. E. and Wand, M. P. (1995). Local polynomial kernel regression for generalized linear models and quasi-likelihood functions. Journal of the American Statistical Association 90: 141–150. 32. Fan, J. and Huang, L. (2001). Goodness-of-fit test for parametric regression models. Journal of the Americal Statistical Association, to appear. 33. Fan, J. and Marron, J. S. (1994). Fast implementations of nonparametric curve estimators. Journal of Computational and Graphical Statistics 3: 35–56. 34. Fan, J. and M¨ uller, M. (1995). Density and regression smoothing. In XploRe: An Interactive Statistical Computing Environment, eds. W. H¨ ardle, S. Klinke and B. A. Turlach, Springer, Berlin, 77–99. 35. Fan, J. and Yao, Q. (1998). Efficient estimation of conditional variance functions in stochastic regression. Biometrika 85: 645–660.

928

J. Fan & R. Li

36. Fan, J., Yao, Q. and Tong, H. (1996). Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems. Biometrika 83: 189–206. 37. Fan, J., Zhang, C. and Zhang, J. (2001). Generalized likelihood ratio statistics and Wilks phenomenon. The Annals of Statistics 29, to appear. 38. Fan, J. and Zhang, J. (2000). Functional linear models for longitudinal data. Journal of the Royal Statistical Society B62: 303–332. 39. Fan, J. and Zhang, W. (1999). Statistical estimation in varying-coefficient models. The Annals of Statistics 27: 1491–1518. 40. Fleming, T. R. and Harrington, D. P. (1991). Counting Processes and Survival Analysis, Wiley, New York. 41. Gasser, T. and M¨ uller, H.-G. (1984). Estimating regression functions and their derivatives by the kernel method. Scandinavian Journal of Statistics 11: 171–185. 42. Gasser, T., M¨ uller, H.-G. and Mammitzsch, V. (1985). Kernels for nonparametric curve estimation. Journal of the Royal Statistical Society B47: 238–252. 43. Gentleman, R. and Crowley, J. (1991). Local full likelihood estimation for the proportional hazards model. Biometrics 47: 1283–1296. 44. Gijbels, I., Hall, P. and Kneip, A. (1995). On the estimation of jump points in smooth curves. Discussion Paper #9515, Institute of Statistics, Catholic University of Louvain, Louvain-la-Neuve, Belgium. 45. Granovsky, B. L. and M¨ uller, H.-G. (1991). Optimizing kernel methods: A unifying variational principle. International Statistical Review 59: 373–388. 46. Green, P. J. and Silverman, B. W. (1994). Nonparametric Regression and Generalized Linear Models: A Robust Penalty Approach, Chapman and Hall, London. 47. Hall, P. and Jones, M. C. (1990). Adaptive M-estimation in nonparametric regression. The Annals of Statistics 18: 1712–1728. 48. H¨ ardle, W. (1990). Applied Nonparametric Regression, Cambridge University Press, Boston. 49. H¨ ardle, W. and Gasser, T. (1984). Robust non-parametric function fitting. Journal of the Royal Statistical Society B46: 42–51. 50. H¨ ardle, W. and Mammen, E. (1993). Comparing nonparametric versus parametric regression fits. The Annals of Statistics 21: 1926–47. 51. Hart, J. D. (1997). Nonparametric Smoothing and Lack-of-fit Tests, Springer, New York. 52. Hastie, T. J. and Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall, London. 53. Hastie, T. J. and Tibshirani, R. J. (1993). Varying-coefficient models (with discussion). Journal of the Royal Statistical Society B55: 757–796. 54. Hoover, D. R., Rice, J. A., Wu, C. O. and Yang, L. P. (1998). Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika 85: 809–822. 55. Inglot, T., Kallenberg, W. C. M. and Ledwina, T. (1994). Power approximations to and power comparison of smooth goodness-of-fit tests. Scandinavian Journal of Statistics 21: 131–45.

Local Modeling: Density Estimation and Nonparametric Regression

929

56. Jones, M. C., Marron, J. S. and Sheater, S. J. (1996a). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association 91: 401–407. 57. Jones, M. C., Marron, J. S. and Sheater, S. J. (1996b). Progress in data-based bandwidth selection for kernel density estimation. Computational Statistics 11: 337–381. 58. Kallenberg, W. C. M. and Ledwina, T. (1997). Data-driven smooth tests when the hypothesis is composite. Journal of the American Statistical Association 92: 1094–1104. 59. Kauermann, G. and Tutz, G. (1999). On model diagnostics using varying coefficient models. Biometrika 86: 119–128. 60. Koenker, R., Portnoy, S. and Ng, P. (1992). Nonparametric estimation of conditional quantile function. In Proceedings of the conference on L1 — Statistical Analysis and Related Methods, ed. Y. Dodge, Elsevier, 217–229. 61. Lehmann, E. L. (1983). Theory of Point Estimation, Pacific Grove, Wadsworth and Brooks/Cole, California. 62. Lin, X. and Carroll, R. J. (2000). Nonparametric function estimation for clustered data when the predictor is measured without/with error. Journal of the American Statistical Association 95: 520–534. 63. Mack, Y. P. and M¨ uller, H. G. (1989). Convolution type estimators for nonparametric regression estimation. Statistics and Probability Letters 7: 229–239. 64. Marron, J. S. and Nolan, D. (1988). Canonical kernels for density estimation. Statistics and Probability Letters 7: 195–199. 65. McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman and Hall, London. 66. M¨ uller, H.-G. (1987). Weighted local regression and kernel methods for nonparametric curve fitting. Journal of the American Statistical Association 82: 231–238. 67. M¨ uller, H.-G. (1992). Change-points in nonparametric regression analysis. The Annals of Statistics 20: 737–761. 68. Nadaraya, E. A. (1964). On estimating regression. Theory Probability Applied 9: 141–142. 69. Rawlings, J. O. and Spruill, S. E. (1994). Estimating pine seedling response to ozone and acidic rain. In Case Studies in Biometry, eds. N. Lange, L. Ryan, L. Billard, D. Brillinger, L. Conquest and J. Greenhouse, Wiley, New York, 81–106. 70. Ruppert, D. and Wand, M. P. (1994). Multivariate weighted least squares regression. The Annals of Statistics 22: 1346–1370. 71. Ruppert, D., Wand, M. P., Holst, U. and H¨ ossjer, O. (1997). Local polynomial variance function estimation. Technometrics 39: 262–73. 72. Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization, John Wiley and Sons, New York. 73. Seifert, B. and Gasser, T. (1996). Finite-sample variance of local polynomials: Analysis and solutions. Journal of the American Statistical Association 91: 267–275.

930

J. Fan & R. Li

74. Severini, T. A. and Staniswalis, J. (1994). Quasi-likelihood estimation in semiparametric models. Journal of the American Statistical Association 89: 501–511. 75. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis, Chapman and Hall, London. 76. Simonoff, J. S. (1996). Smoothing Methods in Statistics. Springer, New York. 77. Speckman, P. (1988). Kernel smoothing in partial linear models. Journal of the Royal Statistical Society B50: 413–436. 78. Spokoiny, V. G. (1996). Adaptive hypothesis testing using wavelets. The Annals of Statistics 24: 2477–2498. 79. Stone, C. J. (1977). Consistent nonparametric regression. The Annals of Statistics 5: 595–645. 80. Stone, C. J. (1980). Optimal rates of convergence for nonparametric estimators. The Annals of Statistics 8: 1348–1360. 81. Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. The Annals of Statistics 10: 1040–1053. 82. Tibshirani, R. and Hastie, T. J. (1987). Local likelihood estimation. Journal of the Americal Statistical Association 82: 559–567. 83. Truong, Y. K. (1989). Asymptotic properties of kernel estimators based on local medians. The Annals of Statistics 17: 606–617. 84. Tsybakov, A. B. (1986). Robust reconstruction of functions by the localapproximation method. Problems of Information Transmission 22: 133–146. 85. Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing, Chapman and Hall, London. 86. Wand, M. P., Marron, J. S. and Ruppert, D. (1991). Transformations in density estimation, Journal of the Americal Statistical Association 86: 343–361. 87. Watson, G. S. (1964). Smooth regression analysis. Sankhy¯ a Series A26: 359–372. 88. Yang, L. and Marron, J. S. (1999). Iterated transformation — Kernel density estimation. Journal of the American Statistical Association 94: 580–589.

About the Author Jianqing Fan is currently chair professor and chairman of Department of Statistics, Chinese University of Hong Kong. He is also a professor of Statistics at The University of North Carolina at Chapel Hill and a former professor at The University of California at Los Angeles. He is an elected fellow of the American Statistical Association and the Institute of Mathematical Statistics, and an associate editor of The Annals of Statistics, The Journal of American Statistical Association and Statistica Sinica. He obtained BS in Mathematics from Fudan University, and MS degrees in Statistics from The Chinese Academy of Sciences, and PhD in Statistics from University

Local Modeling: Density Estimation and Nonparametric Regression

931

of California, Berkeley. He has published many papers in various aspects of applied, computational and theoretical statistics. His published work has been recognized by the Hettleman Prize by the University of North Carolina, and the Presidents’ Award of the Committee of Presidents of Statistical Societies. His current research interests include nonparametric modeling, statistical methods in finance, nonlinear time series, analysis of longitudinal data, model selections, wavelets, survival analysis, generalized linear models, among others.

This page intentionally left blank

CHAPTER 25

BAYESIAN METHODS

MING-HUI CHEN Department of Statistics, University of Connecticut, 215 Glenbrook Rd, U-4120, Stoors, CT 06269-4120, USA Tel: 860-486-6984; [email protected] KEYING YE Department of Statistics, Virginia Polytechnic Institute and State University, USA

1. Introduction In the practice of applied statistics and data analysis, summarizing data points, making inference to the unknowns, fitting probability models and predicting the future are all important elements to be considered. Many statistical methodologies have been developed in modern days to deal with different problems in the world full of randomness. Two schools of statistics are popular nowadays. One of them is called classic or frequentist statistics. The other one, which has progressed rapidly in the last decade and which has been more and more used in common practice, is called Bayesian statistics. Due to the current advances in computing technology and the development of efficient computational algorithms, Bayesian statistics are now becoming more popular in many applied fields such as agriculture, medicine, biology, public health, and epidemiology. The Bayesian paradigm is based on specifying a probability model for the observed data D = (n, y, X), where n is the sample size, y is the n × 1 response vector, and X is the n × p matrix of covariates, given a vector of unknown parameters θ, leading to the likelihood function L(θ|D). Then we assume that θ is random and has a prior distribution denoted by π(θ). Inference concerning θ is then based on the posterior distribution, which is 933

934

M.-H. Chen & K. Ye

obtained by Bayes’ theorem. The posterior distribution is given by L(θ|D)π(θ) , L(θ|D)π(θ) dθ Ω

π(θ|D) = R

(1)

where Ω denotes the parameter space of θ. From (1), it is clear that π(θ|D) is proportional to the likelihood multiplied by the prior, π(θ|D) ∝ L(θ|D)π(θ) , and thus it involves a contribution from the observed data through L(θ|D), and a contribution R from prior information quantified through π(θ). The quantity m(D) = Ω L(θ|D)π(θ) dθ is the normalizing constant of π(θ|D), and is often called the marginal distribution of the data or the prior predictive distribution. Notice that other than the data D to be random, from a Bayesian point of view, the parameter θ is also random. Foundationally speaking, to gain information of an unknown, say θ, it would be natural if the knowledge of θ can be described by using a form of statistical distribution. The more the information about the unknown through data is obtained, the better knowledge of the unknown is gained. The posterior in (1) can also be viewed as a prior distribution for future experimental observations, if any. Hence, Bayesian thinking requires a sequential learning process that leads to understanding unknowns in the scientifical world. In this chapter, we are going to describe a few aspects of Bayesian statistics, including posterior inference (Sec. 2), prior elicitation (Sec. 3), Bayesian computations (Sec. 4), and applicational examples (Sec. 5). 2. Posterior Inference 2.1. Summary of posterior distributions In Bayesian data analysis, many posterior quantities are of the form Z h(θ)π(θ|D) dθ , E[h(θ)|D] =

(2)

Rp

where h(·) is a real-valued function of θ = (θ1 , θ2 , . . . , θp )′ . We call (2) an integral-type posterior quantity, or the posterior expectation of h(θ). In (2), we assume that Z |h(θ)|π(θ|D) dθ < ∞ . E(|h(θ)| | D) = Rp

Bayesian Methods

935

Integral-type posterior quantities include posterior means, posterior variances, covariances, higher-order moments, and probabilities of sets by taking appropriate functional forms of h. For example, (2) reduces to: (a) the posterior mean of θ when h(θ) = θ; (b) the posterior covariance of θj Rand θj ∗ if h(θ) = (θj − E(θj |D))(θj ∗ − E(θj ∗ |D))′ , where E(θj |D) = Rp θj π(θ|D) dθ; (c) the posterior predictive density when h(θ) = f (z|θ), where f (z|θ) is the predictive density given the parameter θ; and (d) the posterior probability of a set A if h(θ) = 1{θ ∈ A}, where 1{θ ∈ A} denotes the indicator function. In (d), the posterior probability leads to a Bayesian p-value72 by taking an appropriate form of A. Some other posterior quantities such as normalizing constants, Bayes factors, and posterior model probabilities, may not simply be written in the form of (2). However, they are actually functions of integral-type posterior quantities. Posterior quantiles, Bayesian credible intervals, and Bayesian Highest Posterior Density (HPD) intervals are often viewed as nonintegraltype posterior quantities. Even for these types of posterior quantities, we can express them as functions of integral-type posterior quantities under certain conditions. For example, let ξ = h(θ), and ξ1−α be the (1 − α)th posterior quantile of ξ with respect to π(θ|D), where 0 < α < 1 and h(·) is a real-valued function. Then, ξ1−α is the solution of the following equation: Z 1{h(θ) ≤ t}π(θ|D) dθ = 1 − α . Rp

Therefore, the posterior quantile is a function of the posterior expectation of 1{h(θ) ≤ t}. 2.2. Predictive distributions A major aspect of the Bayesian paradigm is prediction. Prediction is often an important goal in regression problems, and usually plays an important role in model selection problems. The posterior predictive distribution of a future observation vector z given the data D is defined as Z π(z|D) = f (z|θ)π(θ|D) dθ , (3) Ω

where f (z|θ) denotes the sampling density of z, and π(θ|D) is the posterior distribution of θ. We see that (3) is just the posterior expectation of f (z|θ),

936

M.-H. Chen & K. Ye

and thus sampling from (3) is easily accomplished via the Gibbs sampler (See Sec. 4 for detail) from π(θ|D). This is a nice feature of the Bayesian paradigm since Eq. (3) shows that predictions and predictive distributions are easily computed once samples from π(θ|D) are available. Regarding the complementary roles of the predictive and posterior distributions in Bayesian data analysis, Box12 notes that the posterior distribution provides a basis for “estimation of parameters conditional on the adequacy of the entertained model” while the predictive distribution enables “criticism of the entertained model in light of current data”. In this spirit, Gelfand et al.37 consider a cross-validation approach, in which the predictive distribution is used in various ways to assess model adequacy. The main idea of this cross-validation approach is to validate conditional predictive distributions arising from single observation deletion against observed responses. Let y = (y1 , y2 , . . . , yn )′ denote the n × 1 vector of the observed responses. Let X denote the n × p matrix of covariates whose ith row x ′i is associated with yi . Then, the observed data can be written as D = (n, y, X). Also let y (−i) denote the (n−1)×1 response vector with yi deleted, let X (−i) denote the (n − 1) × p matrix that is X with the ith row x′i deleted, and the resulting observed data are written as D (−i) = ((n − 1), y (−i) , X (−i) ). In addition, let θ be the vector of model parameters. We assume that yi ∼ f (yi |θ, xi ) and we let π(θ) denote the prior distribution of θ. Then, the posterior distribution of θ based on the data D is given by # " n Y f (yi |θ, xi ) π(θ) , (4) π(θ|D) ∝ i=1

and the posterior distribution of θ based on the data D (−i) is given by   Y π(θ|D(−i) ) ∝  f (yj |θ, xj ) π(θ) . (5) j 6=i

Let z = (z1 , z2 , . . . , zn )′ denote future values of a replicate experiment. Also let π(zi |xi , D(−i) ) denote the conditional density of zi given xi and D(−i) defined as Z π(zi |xi , D(−i) ) = f (zi |θ, xi )π(θ|D(−i) ) dθ , (6)

for i = 1, 2, . . . , n. The conditional predictive density π(zi |xi , D(−i) ) is also called the cross-validated predictive density. This density is to be checked against yi , for i = 1, 2, . . . , n in the sense that, if the model holds, yi may be viewed as a random observation from π(zi |xi , D(−i) ). To do this, we

Bayesian Methods

937

take zi = yi in (6) and then we obtain the Conditional Predictive Ordinate (CPO): CPOi = π(yi |xi , D(−i) ) .

(7)

CPOi , which was proposed by Geisser36 and further discussed in Gelfand et al.,37 is a very useful quantity for model checking, since it describes how much the ith observation supports the model. Large CPO values indicate a good fit. Another application of the predictive distribution to construct the Bayesian standardized residual. Similar to the Studentized residuals with the current observation deleted, the Bayesian standardized residual can be computed as yi − E(zi |xi , D(−i) ) , di = E[g(zi , yi )|xi , D(−i) ] = p var(zi |xi , D(−i) )

(8)

where var(zi |xi , D(−i) ) is the variance of zi with respect to the predictive distribution π(zi |xi , D(−i) ) given by (6). Large |di |’s cast doubt upon the model but retaining the sign of di allows patterns of under or over fitting to be revealed. Example 1. Estimating apple production y in New Zealand.17 Let βj denote the average number of cartons per tree, conditional on the age of the tree, j, for j = 1, 2, . . . , 10. These averages are combined using the linear model y=

10 X

βj xj + ǫ ,

(9)

j=1

where xj is the number of trees of age j and and ǫ ∼ N (0, σ 2 ). Younger trees are known to produce fewer apples on average, so the model is subject to the constraints 0 ≤ β1 ≤ β2 ≤ · · · ≤ β10 .

(10)

Given data D on the number of trees and production by year and by orchard, Chen and Deely17 choose a noninformative prior for β1 , . . . , β9 , and σ 2 as well as a proper prior for β10 , which allow them to derive the full joint posterior density  o n 2   −µ10 )2   10 N   1 X exp − (β102σ X 2 10 2   β x , exp y − − π(β, σ |D) = j ij i 2   c(D)σ N +1   2σ i=1 j=1 (11)

938

M.-H. Chen & K. Ye

where β = (β1 , β2 , . . . , β10 )′ , c(D) is the normalizing constant, 0 ≤ β1 ≤ β2 ≤ · · · ≤ β10 , and σ 2 > 0. For the New Zealand apple data, N = 207, 2 2 µ10 = 0.998, and σ10 = 0.0891, where µ10 and σ10 are specified using method-of-moments estimates from the growers’ data for trees of age 10. For the model given in (9), we have 1 (zi − x′i β)2 , f (zi |β, σ 2 , xi ) = √ exp − 2σ 2 2πσ where xi = (xi1 , xi2 , . . . , xi,10 )′ . Thus, Z E(zi |xi , D(−i) ) = (x′i β)π(β, σ 2 |D(−i) ) dβ dσ 2 .

Note that CPOi given by (7) can be rewritten as Z −1 1 (−i) 2 CPOi = f (yi |xi , D )= π(β, σ |D) dβ . f (yi |β, σ 2 , xi )

Let {(β l , σl2 ), l = 1, 2, . . . , L} denote a Gibbs sample from π(β, σ 2 |D) using the Gibbs sampler given in Sec. 4. Then, the Monte Carlo estimate of CPOi is given by " L #−1 X [i = L CPO , (12) (f (yi |βl , σl2 , xi ))−1 l=1

and the Monte Carlo estimates of E(zi |xi , D(−i) ) and var(zi |xi , D(−i) ) are given by ˆ i |xi , D(−i) )) = CPO [ i L−1 E(z and

L X l=1

x′i βl , f (yi |β l , σl2 , xi )

(13)

ˆ i2 |xi , D(−i) ) − [E(z ˆ i |xi , D(−i) )]2 var(z c i |xi , D(−i) )) = E(z [ i L−1 = CPO

L X σl2 + (x′i βl )2 f (yi |β l , σl2 , xi ) l=1

ˆ i |xi , D(−i) )]2 , − [E(z

(14)

respectively. Using (13) and (14), the Monte Carlo estimate of the Bayesian standardized residual di is ˆ i |xi , D(−i) ) yi − E(z . dˆi = p var(z c i |xi , D(−i) )

(15)

Bayesian Methods

Fig. 1.

939

The Bayesian standardized residual plot.

For the New Zealand apple data, Chen and Deely17 use 50,000 Gibbs iterations to obtain the dˆi ’s, and the results are displayed in Fig. 1. ˆ i |xi , From Fig. 1, it can be seen that: (i) the dˆi ’s are small when the E(z (−i) ˆ D )’s are small; and (ii) the di ’s are roughly symmetric about zero, which implies that the model is neither over-fitted nor under-fitted. Chen and Deely17 also check the distribution of dˆi and find that the dˆi ’s roughly follow a Student t distribution. Noting that f (yi |β, σ 2 , xi ) is a normal distribution and fˆ(yi |xi , D(−i) ) in (12) is a finite mixture of normal distributions, it follows from a result of Johnson and Geisser65 that f (yi |xi , D(−i) ) is approximately a Student t distribution. Hence the results obtained by Chen and Deely17 are consistent with the theoretical result of Johnson and Geisser,65 and give further support that the normal assumption of the error terms in the constrained multiple linear regression model is appropriate.

940

M.-H. Chen & K. Ye

2.3. Marginal distributions In Bayesian inference, a joint posterior distribution is available through the likelihood function and a prior distribution. One purpose of Bayesian inference is to calculate and display marginal posterior densities because the marginal posterior densities provide complete information about parameters of interest. Let θ(j) = (θ1 , . . . , θj )′

and θ (−j) = (θj+1 , . . . , θp )′

be the first j and last p − j components of θ, respectively. The support of the conditional joint marginal posterior density of θ (j) given θ (−j) is denoted by Ωj (θ(−j) ) = {(θ1 , . . . , θj )′ : (θ1 , . . . , θj , θj+1 , . . . , θp )′ ∈ Ω} ,

(16)

and the subspace of Ω, given the first j components θ ∗(j) = (θ1∗ , . . . , θj∗ )′ , is denoted by Ω−j (θ∗(j) ) = {(θj+1 , . . . , θp )′ : (θ1∗ , . . . , θj∗ , θj+1 , . . . , θp )′ ∈ Ω} .

(17)

Then the marginal posterior density of θ (j) evaluated at θ ∗(j) has the form Z π(θ ∗(j) , θ(−j) |D) dθ (−j) . (18) π(θ ∗(j) |D) = Ω−j (θ ∗(j) )

In general, the analytical evaluation of π(θ ∗(j) |D) is not available. Thus, a Monte Carlo method is much needed to estimate it. There are several Monte Carlo methods available. These include the kernel density estimation, the conditional density estimation, and the importance weighted marginal density estimation (IWMDE) of Chen.16 Here, we briefly describe how IWMDE works, and we refer the interesting readers to Chen et al.26 for detailed discussion of other methods. Consider the following identity: Z w(θ (j) |θ(−j) )π(θ ∗(j) , θ(−j) |D) π(θ|D) dθ , (19) π(θ ∗(j) |D) = π(θ|D) Ω where w(θ (j) |θ(−j) ) is a completely known conditional density whose support is contained in, or equal to, the support, Ωj (θ (−j) ), of the conditional density π(θ (j) |θ(−j) , D). Here, “completely known” means that w(θ (j) |θ(−j) ) can be evaluated at any point of (θ (j) , θ(−j) ). In other words, the kernel and the normalizing constant of this conditional density are

941

Bayesian Methods

available in closed form. Using the identity (19), the IWMDE of π(θ ∗(j) |D) is defined by n

π ˆ (θ∗(j) |D) = (j)

(−j)

∗(j) , θi |D) 1X (j) (−j) π(θ w(θ i |θi ) , (−j) (j) n i=1 |D) π(θ , θ i

(20)

i

(−j)

where {θ i = (θ i , θi ), i = 1, 2, . . . , n} is an MCMC sample from π(θ|D). In (20), w plays the role of a weight function. Further, π ˆ (θ ∗(j) |D) does not depend on the unknown normalizing constant c(D), since c(D) (−j) (j) (−j) cancels in the ratio π(θ ∗(j) , θi |D)/π(θ i , θi |D). In fact, using (1), we can rewrite (20) as n

π ˆ (θ ∗(j) |D) =

∗(j) , θi |D)π(θ ∗(j) , θi 1X (j) (−j) L(θ w(θ i |θ i ) n i=1 L(θi |D)π(θ i ) (−j)

(−j)

)

.

The choice of w and the properties of π ˆ (θ ∗(j) |D) can be found in Chen.16 Thus, the detail is omitted here for brevity. 2.4. Posterior Model Probabilities Suppose there are K models under consideration. Assume model m has a vector θ (m) of unknown parameters, with dimension pm , which may vary from model to model, for m = 1, 2, . . . , K. Under model m, the posterior distribution of θ (m) takes the form π(θ (m) |D, m) ∝ π ∗ (θ (m) |D, m) = L(θ (m) |D, m)π(θ (m) |m) ,

(21)

where L(θ (m) |D, m) is the likelihood function, D denotes the data, π(θ (m) |m) is the prior distribution, and π ∗ (θ(m) |D, m) is the unnormalized posterior density. Let p(m) denote the prior probability of model m. Then, using Bayes’ theorem, the posterior probability of model m can be written as p(D|m)p(m) , (22) p(m|D) = PK j=1 p(D|j)p(j) where

p(D|m) = =

Z

Z

L(θ(m) |D, m) π(θ (m) |m) dθ (m) π ∗ (θ (m) |D, m) dθ (m)

(23)

denotes the marginal distribution of the data D under model m. The marginal density p(D|m) is precisely the normalizing constant of the joint

942

M.-H. Chen & K. Ye

posterior density of θ (m) . We choose the model with the largest posterior model probability p(m|D). Model selection, in particular variable selection, is one of the most frequently encountered problems in statistical data analysis. In cancer or AIDS clinical trials, for example, one often wishes to assess the importance of certain prognostic factors such as treatment, age, gender, or race in predicting survival outcome. Bayesian approach to model selection is more attractive than a criterion-based classical method such as the Akaike Information Criterion (AIC)2 or Bayesian Information Criterion (BIC),90 since available prior information can be incorporated into the posterior model probability via p(m) and π(θ (m) |m) and thus more power can be achieved in order to identify the correct model. However, Bayesian model selection is often difficult to carry out because of the challenge in specifying prior distributions for the regression parameters for all possible models; specifying a prior distribution on the model space; and computations. Other than focusing on a particular model selection using the largest posterior probability, one may also use the probability in (22) as a weight function to incorporate model uncertainty in a prediction. Such a criterion is called Bayesian Model Averaging or simply BMA. Suppose that one is interested in predicting certain quantity ∆ such as predicting a future observation or a coefficient estimation for a regression problem. Instead of using just one model in prediction, one makes a prediction by average all feasible models through a weighted average whereas the weights are calculated from the posterior probabilities of the models. The posterior distribution given data D is X p(∆|D) = p(∆|m, D)p(m|D) . (24) model m

The models used in the above calculation are the ones with significant posterior probabilities. The purpose of this model averaging is to avoid any risk of believing that the data belongs to a particular model. Instead, it accounts for the model uncertainty in predictions. More details of BMA can be found in Raftery et al.83 and Hoeting et al.57 and the references therein. As an illustration, we look at the following example. Example 2. Assessment of health using body fat data. A variety of popular health books suggest that the readers assess their health, at least in part, by estimating their percentage of body fat. A data set used in Penrose et al.80 studied the predictive equations of human’s body fat with other variables such as age, weight, height, neck circumference,

943

Bayesian Methods Table 1.

Regular model selection results for the body fat data. Variable Names

R2

s2

Age, Weight, Neck, Abdomen, Thigh, Forearm, Wrist Age, Weight, Neck, Abdomen, Hip, Thigh, Biceps, Forearm, Wrist

0.7445 0.7447

18.41 18.32

Method Stepwise Adjusted R2

Table 2.

Posterior probabilities of models for the body fat data.

Model 1 2 3 4

Variable Names Weight, Weight, Weight, Weight,

Abdomen, Forearm, Wrist Abdomen, Wrist Abdomen, Biceps, Wrist Neck, Abdomen, Forearm, Wrist

R2

p(m|D)

0.7350 0.7277 0.7328 0.7379

0.47043 0.24656 0.16415 0.11885

chest circumference, abdomen 2 circumference, hip circumference, t high circumference, knee circumference, ankle circumference, biceps (extended) circumference, forearm circumference and wrist circumference. Apparently a multiple linear regression model can be used here. However, since there are a lot of independent variables, it is quite natural to use certain model selection techniques to obtain a “best” model. The two commonly used methods, namely stepwise regression and adjusted R2 method come up with two different models as follows. From Table 1, it seems both methods yielded quite comparative results in terms of variation explanation and regression accuracy. Suppose that we want to predict somebody’s body fat at the values Age = 36, Weight = 226.75, Height = 71.75, Neck = 41.5, Chest = 115.3, Abdomen = 108.8, Hip = 114.4, Thigh = 69.2, Knee = 42.4, Ankle = 24, Biceps = 35.4, Forearm = 21 and Wrist = 20.1. The stepwise regression gave an estimate of 21.39 with a prediction variance 22.73, while the adjusted R 2 resulted in an estimate of 21.73 with a prediction variance 22.81. Note that those variances are the variances of the predictions under given models. If the prediction model is not a correct one, then the variance in prediction would be very different. On the other hand, we may use the method of BMA to deal with this data set. The models with significant posterior probabilities are given in Table 2. In Table 2, it clearly shows that both models selected by classical sequential model selection methods are not with high posterior probabilities. To predict the person’s body fat for the same values as above, the prediction

944

M.-H. Chen & K. Ye

is at 24.26 with a prediction variance of 26.82. However, this is the variance over all plausible models and it can be decomposed by two parts. The first part is 21.37 which is quite comparable to the variances we derived using sequential methods. This part is called pooled variances of all the models used in BMA. The second part, which is 5.45, is the part due to model uncertainty. Bayesian approaches are now feasible due to recent advances in computing technology and the development of efficient computational algorithms. In particular, Chen et al.21 and Ibrahim et al.61 propose informative prior distributions π(θ (m) |m) and p(m) for the parameter θ (m) and model m, and develop novel methods for computing the marginal distribution of the data. In addition, the stochastic search variable selection of George and McCulloch45 and a novel reversible jump MCMC algorithm proposed by Green52 make the computation of posterior model probabilities possible when K is large. In Sec. 6 below, we present a real data example from a series of animal toxicological experiments performed in the Department of Biology at the University of Waterloo to illustrate Bayesian model selection using informative priors.

3. Prior Elicitation Prior distribution is one of the most important elements in Bayesian methodology. As the matter of fact, it is the most challenging element to practitioners. Rather than having a large amount of data when people can use large sample theory, most of the experiments consist of small to moderate sample size data sets. Thus, prior distribution plays a very important role in Bayesian analysis. We insist that whenever a practitioner can summarize historical or subjective information on an unknown, an informative prior should be elicited. The difficulty of seeking an informative prior is how one can connect the known or subjective information to a prior distribution. Conjugate priors are most commonly sought before because of the simplicity of the distributional forms and computational reason. However, the recent development in Bayesian computation overcome much of the difficulty using non-conjugate priors. Hence, to a practitioner, it is important to summarize all the information about an unknown to an approximate distribution form and use such a distribution as a prior. On the other hand, many times, either historical or subjective knowledge of the unknown is not available, or there are too many parameters whose prior distributions need to be specified, noninformative priors are constantly

Bayesian Methods

945

used as alternatives. However, as contrary to its name, all noninformative priors are actually informative. They are usually based on different criteria people use to generate prior distributions serving different purposes. In subsequent subsections, we discuss in more details about informative and noninformative priors.

3.1. Informative priors Informative priors are useful in applied research settings where the investigator has access to previous studies measuring the same response and covariates as the current study. For example, in many cancer and AIDS clinical trials, current studies often use treatments that are very similar or slight modifications of treatments used in previous studies. We refer to data arising from previous similar studies as historical data. In carcinogenicity studies, for example, large historical databases exist for the control animals from previous experiments. In all of these situations, it is natural to incorporate the historical data into the current study by quantifying it with a suitable prior distribution on the model parameters. The methodology discussed here can be applied to each of these situations as well as in other applications that involve historical data. From a Bayesian perspective, historical data from past similar studies can be very helpful in interpreting the results of the current study. For example, historical control data can be very helpful in interpreting the results of a carcinogenicity study. According to Haseman et al.,55 historical data can be useful when control tumor rates are low and when marginal significance levels are obtained in a test for dose effects. Suppose, for example, that 4 of 50 animals in an exposed group develop a specific tumor, compared with 0 of 50 in a control group. This difference is not statistically significant (p = 0.12, based on Fisher’s exact test). However, the difference may be biologically significant if the observed tumor type is known to be extremely rare in the particular animal strain being studied. By specifying a suitable prior distribution on the control response rates that reflect the observed rates of a particular defect over a large series of past studies, one can derive a modified test statistic that incorporates historical information. If the defect is rare enough in the historical series, then even the difference of 4/50 versus 0/50 will be statistically significant based on a method that appropriately incorporates historical information. To fix ideas, suppose we have historical data from a similar previous study, denoted by D0 = (n0 , y0 , X0 ) where n0 is the sample size of the

946

M.-H. Chen & K. Ye

historical data, y0 is the n0 × 1 response vector, and X0 is the n0 × p matrix of covariates based on the historical data. Chen et al.19 and Ibrahim and Chen58 proposed the power prior to incorporate historical information. The power prior is defined to be the likelihood function based on the historical data D0 , raised to a power a0 , where 0 ≤ a0 ≤ 1 is a scalar parameter that it controls the influence of the historical data on the current data. One of the most useful applications of the power prior is for model selection problems, since these priors inherently automate the informative prior specification for all possible models in the model space. They are quite attractive in this context, since specifying meaningful informative prior distributions for the parameters in each model is a difficult task requiring contextual interpretations of a large number of parameters. In variable subset selection, for example, the prior distributions for all possible subset models are automatically determined once the historical data D0 , and a0 are specified. Berger and Mallows8 refer to such priors as “semi-automatic” in their discussion of Mitchell and Beauchamp.76 Chen et al.23 use the power prior for heritability estimates from human twin data. Chen et al.22 demonstrate the use of the power prior in variable selection contexts for logistic regression. Ibrahim et al.61 and Chen et al.20 develop the power prior for the class of generalized linear mixed models. Ibrahim and Chen,59 Ibrahim et al.,60 Chen et al.18,21 develop the power prior for various types of models for survival data. Let π0 (θ) denote the prior distribution for θ before the historical data D0 is observed. We shall call π0 (θ) the initial prior distribution for θ. Given a0 , we define the power prior distribution of θ for the current study as π(θ|D0 , a0 ) ∝ L(θ|D0 )a0 π0 (θ) ,

(25)

where a0 is a scalar prior parameter that weights the historical data relative to the likelihood of the current study. The parameter a0 can be interpreted as a precision parameter for the historical data. It is reasonable to restrict the range of a0 to be between 0 and 1, and thus we take 0 ≤ a0 ≤ 1. One of the main roles of a0 is that it controls the heaviness of the tails of the prior for θ. As a0 becomes smaller, the tails of (25) become heavier. Setting a0 = 1, (25) corresponds to the update of π0 (θ|c0 ) using Bayes theorem. That is, with a0 = 1, (25) corresponds to the posterior distribution of θ from the previous study. When a0 = 0, then the prior does not depend on the historical data, and in this case, π(θ|D0 , a0 = 0) ≡ π0 (θ). Thus, a0 = 0 is equivalent to prior specification with no incorporation of historical data.

Bayesian Methods

947

Therefore, (25) can be viewed as a generalization of the usual Bayesian update of π0 (θ). The parameter a0 allows the investigator to control the influence of the historical data on the current study. Such control is important in cases where there is heterogeneity between the previous and current study, or when the sample sizes of the two studies are quite different. The hierarchical power prior specification is completed by specifying a (proper) prior distribution for a0 . Thus we propose a joint power prior distribution for (θ, a0 ) of the form π(θ, a0 |D0 ) ∝ L(θ|D0 )a0 π0 (θ) = π(a0 |γ 0 ) ,

(26)

where γ 0 is a specified hyperparameter vector. A natural choice for π(a0 |γ 0 ) is a beta prior. However, other choices, including a truncated gamma prior or a truncated normal prior can be used. These three priors for a0 have similar theoretical properties, and our experience shows that they have similar computational properties. In practice, they yield similar results when the hyperparameters are appropriately chosen. Thus, for a clear focus and exposition, we will use a beta distribution for π(a0 |γ0 ), which takes the form π(a0 |γ 0 ) ∝ aδ00 −1 (1 − a0 )λ0 −1 , where γ 0 = (δ0 , λ0 ). The beta prior for a0 appears to be the most natural prior to use and leads to the most natural elicitation scheme. The prior in (26) does not have a closed form in general, but it has several attractive theoretical and computational properties for the classes of models considered here. One attractive feature of (26) is that it creates heavier tails for the marginal prior of θ than the prior in (25), which assumes that a0 is a fixed value. This is a desirable feature since it gives the investigator more flexibility in weighting the historical data. In addition, the construction of (26) is quite general, with various possibilities for π0 (θ). If π0 (θ) is proper, then (26) is guaranteed to be proper. Further, (26) can be proper even if π0 (θ) is an improper uniform prior. Specifically, Ibrahim et al.62 and Chen et al.22 characterize the propriety of (26) for generalized linear models, and also show that for fixed a0 , the prior converges to a multivariate normal distribution as n0 → ∞. For the class of generalized linear mixed models, Ibrahim et al.,61 Chen et al.18,19 characterize the propriety of (26) and derive various other theoretical properties of the power prior. Ibrahim et al.,60 and Ibrahim and Chen59 characterize various properties of (26) for proportional hazards models, and Chen et al.21 examine various theoretical properties of (26) for a class of cure rate models.

948

M.-H. Chen & K. Ye

3.1.1. Example 3. An analysis of the AIDS study ACTG036 using the data from ACTG019 as historical data The ACTG019 study was a double blind placebo-controlled clinical trial comparing zidovudine (AZT) to placebo in persons with CD4 counts less than 500. The results of this study were published in Volberding et al.100 The sample size for this study, excluding cases with missing data, was n0 = 823. The response variable (y0 ) for these data is binary with a 1 indicating death, development of AIDS, or AIDS related complex (ARC), and a 0 indicates otherwise. Several covariates were also measured. The ACTG036 study was also a placebo-controlled clinical trial comparing AZT to placebo in patients with hereditary coagulation disorders. The results of this study have been published by Merigen et al.74 The sample size in this study, excluding cases with missing data, was n = 183. The response variable (y) for these data is binary with a 1 indicating death, development of AIDS, or AIDS related complex (ARC), and a 0 indicates otherwise. Several covariates were measured for these data. A summary of both data sets can be found in Chen et al.22 Therefore, we let D0 denote the data from the ACTG019 study and D denote the data from the ACTG036 study. Chen et al.22 use the priors given by (26) and the logistic regression model to carry out variable subset selection, which yields the model containing an intercept, CD4 count (cell count per mm3 of serum), age, and treatment as the best model. For this model, we use the power prior (26) to obtain posterior estimates of the regression coefficients for various choices 0 and σa20 = µa0 (1 − µa0 )(δ0 + λ0 + 1)−1 . The of (µa0 , σa0 ), where µa0 = δ0 δ+λ 0 results based on the standardized covariates and the logit model with an improper uniform prior for the regression coefficients are given in Table 3. The values of (µa0 , σa0 ) and the corresponding values of (δ0 , λ0 ) are also reported in the table. We used 50,000 Gibbs iterations for all posterior computations and the Monte Carlo method of Chen and Shao24 to calculate 95% highest probability density (HPD) intervals for the parameters of interest. From Table 3, we see that as the weight for ACTG019 study increases, the posterior mean of a0 (denoted E(a0 |D, D0 )) increases, the posterior standard deviations (SD) for all parameters decrease, and the 95% HPD intervals get narrower. Most noticeably, when (δ0 , λ0 ) = (100, 1), none of the HPD intervals for the regression coefficients contain 0. Table 3 also indicates that the HPD intervals are not too sensitive for moderate changes in (µa0 , σa0 ). This is a comforting feature, since it implies that the HPD intervals are fairly robust with respect to the hyperparameters

949

Bayesian Methods Table 3.

Posterior estimates for AIDS data.

E(a0 |D, D0 ) Variable

Posterior Posterior Mean SD

95% HPD Interval

(δ0 , λ0 )

(µa0 , σa0 )

(5, 5)

(0.50, 0.151)

0.02

Intercept CD4 count Age Treatment

−4.389 −1.437 0.135 −0.120

0.725 0.394 0.221 0.354

(−5.836, −3.055) (−2.238, −0.711) (−0.314, 0.556) (−0.817, 0.570)

(20, 20) (0.50, 0.078)

0.09

Intercept CD4 count Age Treatment

−3.803 −1.129 0.176 −0.223

0.511 0.300 0.195 0.300

(−4.834, −2.868) (−1.723, −0.559) (−0.214, 0.552) (−0.821, 0.364)

(30, 30) (0.50, 0.064)

0.13

Intercept CD4 count Age Treatment

−3.621 −1.028 0.194 −0.259

0.436 0.265 0.185 0.278

(−4.489, −2.809) (−1.551, −0.515) (−0.170, 0.557) (−0.805, 0.288)

(50, 1) (0.98, 0.019)

0.26

Intercept CD4 count Age Treatment

−3.337 −0.865 0.233 −0.314

0.323 0.211 0.160 0.230

(−3.978, −2.715) (−1.276, −0.448) (−0.081, 0.548) (−0.766, 0.138)

(100, 1) (0.99, 0.010)

0.53

Intercept CD4 count Age Treatment

−3.144 −0.746 0.271 −0.356

0.231 0.161 0.135 0.181

(−3.601, (−1.058, ( 0.001, (−0.717,

−2.705) −0.429) 0.529) −0.011)

of a0 . This same robustness feature is also exhibited in posterior model probability calculations.22 3.2. Conjugate priors Conjugate priors were quite popular before the powerful breakthrough of the Bayesian computational techniques. Suppose F is a class of prior distributions for θ, where θ is the parameter, P a class of sampling distributions f (y|θ). The class F is conjugate for P for any f (y|θ) in P if the prior π(θ) and the posterior π(θ|y) all from the same family F. Since the posterior distributions have the same form as the prior, the closed form of the posteriors can be derived accordingly. Hence, it does not cause much trouble in computing the posterior as described in Sec. 1. Example 4. Suppose that a random variable Y follows a Poisson distribution with mean λ. The density function of Y given λ can be written as λy , for y = 0, 1, . . . . (27) f (y|λ) = e−λ y!

950

M.-H. Chen & K. Ye

Further suppose that the prior distribution of λ is a Gamma distribution with parameters α and β, as follows π(λ) ∝ λα e−βλ ,

for λ > 0 .

(28)

The posterior distribution of λ thus can be calculated as π(λ|y) ∝ π(λ)f (y|λ) ∝ λα e−βλ λy = e−λ ∝ λα+y e−(β+1)λ ,

(29)

which is another Gamma distribution. Clearly, the prior (28) and the posterior (29) all belong to the same Gamma distribution family for any Poisson family sampling distribution. The practical advantage of the conjugate prior distributions is obvious. This is the reason why it is still popular when practitioners often use it if they believe that their priors may be specified as conjugate priors. Although, it is flexible to elicit a conjugate prior due to changing the hyper-parameter values, many times it is not accurate to decide a prior knowledge by only choosing one or two parameter values. On the other hand, sometimes a finite mixture of conjugate priors may be a good idea to overcome this difficulty since the mixture of conjugate priors is also a conjugate prior.5 We have to note here that the conjugacy depends on the family F and P one chooses. Also, it depends on dimensions of the parameter space. For instance, normal priors on the mean of normal sampling distribution when the variance is known constitute a conjugate prior family, while inverse gamma priors on the variance of normal sampling distribution when the mean is assumed known constitute another conjugate prior family. However, normal priors on the mean and inverse gamma priors on the variance of the normal sampling distribution do not constitute a conjugate family since the marginal posteriors of the mean parameter is no longer normal. This suggests that people have to be careful when examining conjugacy for multi-dimensional parameter problems. Another point about conjugacy we like to point out is that it is often quite useful to have conditionally conjugacy to parameters for a multidimensional parameter model. Conditional conjugacy or sometimes called semi-conjugacy means that the conditional posterior distributions of a set of parameters given others and the prior distribution of the same set of parameters belong to the same distributional family, for instance, to the example we just mentioned above about normal mean and normal variance. Although the marginal posterior distribution of the normal mean is no longer normal, the conditional posterior distribution of the normal mean

Bayesian Methods

951

given that the normal variance is still normal. Likewise, the conditional posterior distribution of the normal variance given the normal mean is still inverse gamma. The advantage of this semi-conjugacy can be used in full conditional distributions in Gibbs sampling (see Sec. 4.1). 3.3. Noninformative priors As described in the opening of this section, none of the noninformative priors are non-informative. The derivations of those so called noninformative priors all depend on certain informative criteria. One of the earliest methods of defining noninformative priors was based on the principle of insufficient reason. This method, sometimes referred to as Laplace’s rule, prescribes a uniform prior on the parameter space Θ. Laplace used uniform priors on the probabilities of two binomial populations. Laplace’s rule and the principle of insufficient reason are intuitively appealing. The reasoning is that if no prior information is available that favors certain parameter values over others, then all parameter values should be considered equally likely. However, the immediate criticism of this uniform prior is that it does not follow probability law in the sense of invariance in parameter transformation. Example 5. Suppose that Y follows a Binomial distribution with parameters n and p, while n is known. Using Laplace’s argument, if there is no subjective prior information available for p, a uniform prior π(p) = 1, for 0 < p < 1 should be used. Now assume that we are interested in the parameter q = 1/(1 + p). Since we still do not have information about q, we have to assume the prior distribution of q as uniform, i.e. π(q) = 2, for 1/2 < q < 1. Since q is a variable transformation of p, with the Jacobian 1/q 2 , following the probability law it follows that π(q) = π(p(q))/q 2 = 1/q 2 for 1/2 < q < 1. However, if both uniform distributions are used, the above equality cannot hold. This implies that the uniform prior is not invariant under transformation. There is also another issue that has been discussed extensively in Bayesian school. Suppose we are going to use Laplace’s rule and assign a uniform prior on the parameter. If the parameter space is finite, the uniform distribution is proper (that is, its integration over its domain is finite). However, if the parameter space is infinite such as the mean of a normal distribution, a uniform prior is improper (not integrable). Such a phenomenon does not only happen for a uniform prior, it may happen to

952

M.-H. Chen & K. Ye

many other noninformative priors we will discuss later. As a matter of fact, many of the noninformative priors we use are improper. Although an improper prior is not supported by Bayes’ rule, it does not necessarily lead to problems in Bayesian analysis as long as the posterior distributions are proper. For more discussions of improper priors, readers are referred to Berger,5 Bernardo and Smith,10 and Kass and Wasserman.66 Once a posterior distribution is integrable, after normalization to a probability distribution, the final posterior distribution still represents a postknowledge of the unknowns. Therefore, if an improper noninformative prior is used in practice, it is important to verify that the posterior distribution is integrable before making posterior inferences. This is even more important if simulation methods (see Sec. 4) are used to draw posterior inference because even if the posterior distribution is improper, sometimes it cannot be detected by using simulation. Therefore, it may lead to inappropriate conclusion when the true posterior distribution is actually improper. Since the pioneer work of Laplace to use Bayesian methodology into applied statistics, there have been a lot of attempts to seek default or automatic prior distributions. In the following subsections, we will discuss Jeffreys priors, the reference priors and the probability matching priors. 3.4. Jeffreys priors To overcome the difficulty of the non-invariant uniform prior criterion, Jeffreys64 derived a prior using invariance of parameter transformations. Before we present the Jeffreys prior, one term called expected Fisher information needs to be defined. Suppose that a random vector Y , given θ, has a probability density function f (y|θ) which is twice differentiable with respect to θ. The expected Fisher information matrix, denoted by I(θ) = {Iij (θ)}, is defined as 2 δ log(f (y|θ)) . (30) Iij (θ) = −Eθ δθi δθj Once a sampling distribution is known with the density satisfying the existence of the Fisher information matrix, the Jeffreys prior is simply p (31) πJ (θ) ∝ det(I(θ)) ,

where det stands for a determinant. For any one-to-one transformation between two parameters θ and η, the two priors for θ or η calculated using (31) will not cause any ambiguous results, i.e. the method used here

Bayesian Methods

953

is invariant through parameter transformation. On the other hand, like a uniform prior, Jeffreys prior may be improper. Example 6. Suppose that Y follows a Binomial distribution with unknown parameter p and known n. The density function of this distribution is f (y|p) ∝ py (1 − p)1−y . Taking second derivative to − log(f ), with respect to p, yields y/p2 + (1 − y)/(1 − p)2 . The expectation of this p form becomes n/p(1 − p). Thus, the Jeffreys prior is proportional to 1/ p(1 − p).

Example 7. Suppose y1 , . . . , yn form a random sample from a normal population 2 2 N (µ, σ 2 ). The density function is given as f (y|µ, σ) ∝ e−(y−µ) /2σ /σ. To calculate the Fisher information matrix, note that I11 (µ, σ) = E(1/σ 2 ) = 1/σ 2 , I12 = I21 = 0 and I22 = 2/σ 2 . Hence the Jeffreys prior for (µ, σ) is 1/σ 2 . Note that while the prior distribution in Example 6 is proper, the prior in Example 7 is improper. Yet, in the later example, the posterior distributions are usually proper except in certain degenerate cases. Jeffreys priors are very commonly used for many different models. Even in many other developments of noninformative priors, one can always trace them back to Jeffreys priors in some sense. One property of the Jeffreys priors is the invariance under parameter transformations. However, the use of Jeffreys prior is not quite appealing in multi-dimensional situations. For instance, Jeffreys prior in Example 7 is 1/σ 2 . Of making inference about the mean variable, even Jeffreys himself pointed out that this prior does not yield satisfactory results. Instead, in this case a prior 1/σ is usually used for the parameterization (µ, σ) which is the product of the Jeffreys prior of µ alone (uniform) and that of σ alone (1/σ). Here alone means that when the Jeffreys prior is calculated for one parameter, the other parameter is treated as fixed. In this case, both individual priors do not depend on the fixed parameter. When the product of the two priors is used, it means that “independence” of the prior knowledge of those two parameters is assumed. However, once the data is used, in the posterior analysis, those two parameters are rarely independent. The above discussion raises a question that what kind of prior is good for multi-dimensional parameter problems. In most of the applied statistical problems, there are more than one parameters and some of them are treated as very important and the others are treated as nuisance. To find a noninformative prior for such kinds of models, Berger and Bernardo6 developed an

954

M.-H. Chen & K. Ye

iterative algorithm to calculate noninformative priors for multi-parameter problems. Such priors are called the reference prior which will be briefly described in the next subsection. 3.5. The reference priors The reference prior method, introduced by Bernardo9 and further developed by Berger and Bernardo,6,7 is motivated by the notion of maximizing the expected amount of information about the parameter θ provided by the data y. The amount of information provided by the experiment is quantified by the Kullback–Liebler divergence, which is defined by Z g(θ) dθ , g(θ) log D(g, h) = h(θ) Θ for two densities g and h. The expected information about θ provided by the data can be naturally defined as EY (D(π(θ|y), π(θ))) ,

(32)

where π(θ) and π(θ|y) are prior and posterior distributions, respectively. Theoretically, the reference prior approach is to find a prior such that the quantity (32) is maximized. However, the actual process of this maximization involves a modification of the form (32) and asymptotic process using infinitely many independent replications of the experiments would be used. Now we briefly mention the idea and procedure of the algorithm developed by Berger and Bernardo.7 To derive the reference priors for an experiment, one has to decompose the parameter space by ordered groups in the order of importance of the groups: θ (1) , θ(2) , . . . , θ(m) , where each group θ (j) contains one or more of the scalar parameter in θ. The reference prior is developed iteratively by first computing the marginal prior for θ (m) , then the conditional prior for θ(m−1) given θ (m) , then the conditional prior for θ (m−2) given θ (m−1) and θ(m) , etc. Finally, a reference prior can be obtained by multiplying all the priors above together. In the derivation, the parameter spaces should be truncated to compact sets and certain limiting procedures may be used. The detailed algorithm can be found in Berger and Bernardo.7 Note that in addition to being divided into groups, the parameters in θ are also ordered. The order of the importance of the groups may be different by different users, although the parameter of interests stay the same. Berger and Bernardo recommended single group ordering, which means that there is only one parameter in each group.105,107 This recommendation is based on

Bayesian Methods

955

their experience in applying the reference prior method to various applied problem. Since different groupings may yield different preference priors,7,106 it is possible that there exist different reference priors for the same model. Berger and Bernardo also state, concerning the ordering of the parameters in terms of inferential importance, that “. . . beyond putting the “parameters of interest” first, it is too vague to be of much use.” They recommend that, if possible, all reference priors for which the parameters of interest are placed first in the ordering should be computed. This provides a set of prior distributions which can be compared, to assess the sensitivity of the resulting analyses to the choice of prior distribution. Finally, we want to point it out that interestingly, under certain regular conditions, the reference prior is the same as the Jeffreys prior which means that using the Jeffreys prior, the expected information about the parameter coming from data only is maximized. More studies of the reference priors can be found in Refs. 19, 46, 66, 94 and 105. In Sec. 5.3, we will discuss the reference priors for a statistical calibration model.

3.6. Probability matching priors From inference point of view, confidence interval is quite frequently used in practice. Although, the concept of confidence interval creates a lot of confusing in interpretation, this interval however, gives quite important information in accuracy of an estimate. In the frequentist domain such that for many runs of experiments, the probability associated with a confidence interval provides coverage probability of the random intervals covering the true unknown. On the other hand, one can also derive a credible interval in Bayesian study. Such an interval is quite similar to a confidence interval, except that the probability of this credible interval implies the probability that the unknown parameter belonging to two fixed numbers. This is actually how people usually interpret a confidence interval. The advantage in this interpretation comes from the fact that the parameter in consideration is random and the posterior probability is calculated under the parameter domain, not in the frequentist domain anymore. Getting an automatic prior is one purpose of developing a noninformative prior method. This means that the prior derived using this method can be applied to any data created from the statistical model in study. Obviously, not only a Bayesian credible interval is of interest, but also the confidence interval. Probability matching priors are those priors to have the property that posterior probabilities of the posterior quantiles from

956

M.-H. Chen & K. Ye

the resulting Bayesian analysis match frequentist coverage probabilities of the same quantiles, at least asymptotically. Suppose that θ is a parameter of interest and an interval (φ(y) ≤ θ) has the posterior probability α = P ({φ(y) ≤ θ)|y) (αth posterior quantile). On the other hand, if we treat θ as fixed, the frequentist coverage probability of this interval can also be calculated as P (φ(y) ≤ θ|θ). If a prior can be obtained such that α = P (φ(y) ≤ θ|y) ≈ P (φ(y) ≤ θ|θ) ,

(33)

for all y and θ, asymptotically, we say the prior a probability matching prior. Welch and Peers101 are the first ones to study such kind of priors. In onedimensional case, they found that the Jeffreys priors satisfies this equality √ in the order of O(1/ n), which means that when n goes to infinity, the rate of the difference between the two probabilities in (33) goes to zero in √ the rate same as 1/ n. This is called the first order matching. Stein93 and Tibshirani97 extended their work and used differential equations to obtain more first order matching priors. The probability matching priors have played certain justification rules to many of the noninformative priors. For instance, in many models, the reference priors are matching priors. However, in a few occasions, they are not. Since there are usually many priors satisfying the differential equations in deriving the probability matching priors, only using this method does not lead to a single prior which may be satisfactory. 4. Bayesian Computation There are two major challenges involved in advanced Bayesian computation. These are how to sample from posterior distributions and how to compute posterior quantities of interest using Markov chain Monte Carlo (MCMC) samples. Several books,26,34,48,85,95 cover the development of MCMC sampling and advanced Monte Carlo (MC) methods for computing posterior quantities using the samples from the posterior distribution. 4.1. Sampling from posterior distribution During the last decade, Monte Carlo (MC) based sampling methods for evaluating high-dimensional posterior integrals have been rapidly developing. Those sampling methods include MC importance sampling,44,54,84,102

957

Bayesian Methods

Gibbs sampling,41,43 Metropolis–Hastings sampling52,56,75 and many other hybrid algorithms. 4.1.1. Basic Gibbs sampler The Gibbs sampler may be one of the best known MCMC sampling algorithms in the Bayesian computational literature. As discussed in Besag and Green,11 the Gibbs sampler is founded on the ideas of Grenander,53 while the formal term is introduced by Geman and Geman.43 The primary bibliographical landmark for Gibbs sampling in problems of Bayesian inference is Gelfand and Smith.41 A similar idea termed as data augmentation is introduced by Tanner and Wong.96 Casella and George15 provide an excellent tutorial on the Gibbs sampler. Let θ = (θ1 , θ2 , . . . , θp )′ be a p-dimensional vector of parameters and let π(θ|D) be its posterior distribution given the data D. Then, the basic scheme of the Gibbs sampler is given as follows: Step 0. Choose an arbitrary starting point θ 0 = (θ1,0 , θ2,0 , . . . , θp,0 )′ , and set i = 0. Step 1. Generate θ i+1 = (θ1,i+1 , θ2,i+1 , . . . , = θp,i+1 )′ as follows: • Generate θ1,i+1 ∼ π(θ1 |θ2,i , . . . , θp,i , D); • Generate θ2,i+1 ∼ π(θ2 |θ1,i+1 , θ3,i , . . . , θp,i , D); ··· ··· ··· • Generate θp,i+1 ∼ π(θp |θ1,i+1 , θ2,i+1 , . . . , θp−1,i+1 , D). Step 2. Set i = i + 1, and go to Step 1. Thus each component of θ is visited in the natural order and a cycle in this scheme requires generation of p random variates. Gelfand and Smith41 show that under certain regularity conditions, the vector sequence {θi , i = 1, 2, . . .} has a stationary distribution π(θ|D). Schervish and Carlin88 provide a sufficient condition that guarantees geometric convergence. Other properties regarding geometric convergence are discussed in Roberts and Polson.87 Example 8. For the constrained linear model considered in Example 1, the posterior distribution for (β, σ 2 ) based on the New Zealand apple data D is given by (11). The Gibbs sampler can be implemented by taking βj |β1 , . . . , βj−1 , βj+1 , . . . , β10 , σ 2 ,

D ∼ N (θj , δj2 ) ,

(34)

958

M.-H. Chen & K. Ye

subject to βj−1 ≤ βj ≤ βj+1 (β0 = 0) for j = 1, 2, . . . , 9, β10 |β1 , . . . , β9 , σ 2 ,

2 D ∼ N (ψθ10 + (1 − ψ)µ10 , (1 − ψ)σ10 ),

subject to β10 ≥ β9 and 2

σ |β ,

D ∼ IG

n , 2

Pn

i=1 (yi

−

P10

j=1

xij βj )2

2

!

2 2 2 where in (34) and (35), ψ = σ10 /(σ10 + δ10 ),     ! −1 n n X X X 2  yi − xij θj = xil βl  xij  , i=1

i=1

,

(35)

(36)

(37)

l6=j

and

δj2 =

n X

x2ij

i=1

!−1

σ2 ,

(38)

for j = 1, . . . , 10, and IG(ξ, η) denotes the inverse gamma distribution. Inverse gamma distribution with parameters (ξ, η), whose density is given by 2

π(σ 2 |ξ, η) ∝ (σ 2 )−(ξ+1) e−η/σ . 4.1.2. Metropolis–Hastings algorithm The Metropolis–Hastings algorithm is developed by Metropolis et al.75 and subsequently generalized by Hastings.56 Tierney98 gives a comprehensive theoretical exposition of this algorithm, and Chib and Greenberg27 provide an excellent tutorial on this topic. Let q(θ, ϑ) be a proposal density, which is also termed as a candidategenerating density by Chib and Greenberg,27 such that Z q(θ, ϑ) dϑ = 1 . Also let U (0, 1) denote the uniform distribution over (0, 1). Then, a general version of the Metropolis–Hastings algorithm for sampling from the posterior distribution π(θ|D) can be described as follows: Step 0. Choose an arbitrary starting point θ 0 and set i = 0. Step 1. Generate a candidate point θ ∗ from q(θ i , ·) and u from U (0, 1).

959

Bayesian Methods

Step 2. Set θ i+1 = θ ∗ if u ≤ a(θi , θ ∗ ) and θi+1 = θi otherwise, where the acceptance probability is given by π(ϑ|D)q(ϑ, θ) a(θ, ϑ) = min , 1 . (39) π(θ|D)q(θ, ϑ) Step 3. Set i = i + 1, and go to Step 1. The performance of a Metropolis–Hastings algorithm depends on the choice of a proposal density q. In the context of the random walk proposal density, which is of the form q(θ, ϑ) = q1 (ϑ − θ), where q1 (·) is a multivariate density, Roberts et al.,86 show that if the target and proposal densities are normal, then the scale of the latter should be tuned so that the acceptance rate is approximately 0.45 in one-dimensional problems and approximately 0.23 as the number of dimensions approaches infinity, with the optimal acceptance rate being around 0.25 in six dimensions. For the independence chain, in which we take q(θ, ϑ) = q(ϑ), it is important to ensure that the tails of the proposal density q(ϑ) dominate those of the target density π(θ|D), which is similar to a requirement on the importance sampling function in Monte Carlo integration with importance sampling. Example 9. Consider a Poisson mixed model: yi ∼ P(µi ) , where µi = exp(x′i β +ǫi ) for i = 1, 2, . . . , n, xi is a p×1 vector of covariates, and β is a p × 1 vector of regression coefficients. We assume the random effects ǫ = (ǫ1 , ǫ2 , . . . , ǫn )′ ∼ N (0, Σ) , where 

1

  ρ  Σ=σ  .  ..  ρn−1 2

ρ

ρ2

· · · ρn−1

1 .. .

ρ .. .

··· .. .

ρn−2

ρn−3

···



 ρn−2   . ..  .   1

Assume that a noninformative prior for (β, σ 2 , ρ) has the form π(β, σ 2 , ρ) ∝ (σ 2 )−(δ0 +1) exp(−σ −2 γ0 ) ,

960

M.-H. Chen & K. Ye

where the hyperparameters δ0 > 0 and γ0 > 0 are prespecified. Then, the joint posterior distribution for (β, σ 2 , ρ, ǫ) is given by 1 ′ −1 2 ′ ′ π(β, ρ, σ , ǫ|D) ∝ exp y (Xβ + ǫ) − Jn Q(β, ǫ) − ǫ Σ ǫ 2 1 λ0 2 −(δ0 +1) × exp − 2 , (40) n−1 × (σ ) σ σ n (1 − ρ2 ) 2

where y = (y1 , y2 , . . . , yn )′ , Jn = (1, 1, . . . , 1)′ , Q(β, ǫ) = (q1 , q2 , . . . , qn )′ , qi = exp(xi β + ǫi ) + log(yi !), X is the covariate matrix with the ith row equal to x′i , and D = (n, y, X). To obtain a more efficient MCMC sampling algorithm, we consider a hierarchically centered reparameterization, which is given by η = Xβ + ǫ . Using (40), the reparameterized posterior for (β, σ 2 , ρ, η) is written as π(β, σ 2 , ρ, η|D) ∝ exp{y ′ η − Jn′ Q(η) − Jn′ C(y)} × (2πσ 2 )−n/2 (1 − ρ2 )−(n−1)/2 1 ′ −1 × exp − 2 (η − Xβ) Σ (η − Xβ) , 2σ

(41)

where η = (η 1 , η 2 , . . . , η n )′ , and Q(η) is an n×1 vector with the tth element equal to qi = exp(η i ). We note that the hierarchical centering method of Gelfand et al.39,40 is a tool to improve convergence of MCMC sampling. As discussed in Chen et al.,26 this technique is particularly useful for the Poisson mixed model. To sample from the reparameterized posterior π(β, σ 2 , ρ, η|D), the following steps are required: Step 1. Draw η from its conditional posterior distribution π(η|β, σ 2 , ρ, D) (η − Xβ)′ Σ−1 (η − Xβ) . ∝ exp y ′ η − Jn′ Q(η) − 2σ 2

(42)

Step 2. Draw β from β|η, σ 2 , ρ, D ∼ N8 ((X ′ Σ−1 X)−1 X ′ Σ−1 η, σ 2 (X ′ Σ−1 X)−1 ) . Step 3. Draw σ 2 from its conditional posterior σ 2 |β, ρ, η, D ∼ IG(δ ∗ , γ ∗ ) ,

961

Bayesian Methods

where δ ∗ = δ0 + n/2, γ ∗ = γ0 + 12 (η − Xβ)′ Σ−1 (η − Xβ), and IG(δ ∗ , γ ∗ ) is an inverse gamma distribution. Step 4. Draw ρ from its conditional posterior π(ρ|σ 2 , β, η, D) ∝ (1 − ρ2 )−(n−1)/2 1 × exp − 2 (η − Xβ)′ Σ−1 (η − Xβ) . (43) 2σ In Step 1, it can be shown that π(η|β, σ 2 , ρ, D) is log-concave in each component of η. Thus η can be drawn using the adaptive rejection sampling algorithm of Gilks and Wild.49 The implementation of Steps 2 and 3 is straightforward, which may be a bonus of hierarchical centering, since sampling β is much more expensive before the reparameterization. In Step 4, we use a so-called “Localized Metropolis” algorithm, which was introduced in Chen et al.26 The Localized Metropolis algorithm requires the following transformation: −1 + eξ , −∞ < ξ < ∞ . ρ= 1 + eξ Using (43), we have π(ξ|σ 2 , β, η, D) = π(ρ|σ 2 , β, η, D)

2eξ . (1 + eξ )2

ˆσ Now, we generate ξ by using a normal proposal N (ξ, ˆξ2ˆ), where ξˆ is a

maximizer of the logarithm of π(ξ|σ 2 , β, ǫ, D), which can be obtained by, for example, the Newton-Raphson algorithm, and σ ˆ ξ2ˆ is the minus of the ˆ inverse of the second derivative of log π(ξ|σ 2 , β, η, D) evaluated at ξ = ξ, i.e. σ ˆξ−2 ˆ =−

d2 log π(ξ|σ 2 , β, η, D) ˆ. dξ 2 ξ=ξ

The algorithm to generate ξ operates as follows:

(1) Let ξ be the current value. ˆσ (2) Generate a proposal value ξ ∗ from N (ξ, ˆξ2ˆ). ∗ (3) A move from ξ to ξ is made with probability ˆ   ξ  π(ξ ∗ |σ 2 , β, η, D)φ ξ−  σ ˆξˆ ∗ ˆ , 1 , min  π(ξ|σ 2 , β, η, D)φ ξ −ξ  σ ˆˆ ξ

where φ is the N (0, 1) probability density function.

962

M.-H. Chen & K. Ye

ˆσ Note that the proposal (ξ, ˆξ2ˆ) does not depend on the current value of ξ, which will typically produce a small autocorrelation among ξ’s. Recently, several Bayesian software packages have been developed. These include BUGS for analyzing general hierarchical models via MCMC (http://www.mrc-bsu.cam.ac.uk/bugs/), BATS for Bayesian time series analysis (http://www.stat.duke.edu/∼mw/bats.html), Matlab and Minitab Bayesian computational algorithms for introductory Bayesian analysis (http://www-math.bgsu.edu/∼albert/), and many others. A more complete listing and description of pre-1990 Bayesian software can be found in Goel.50 A listing of some of the Bayesian software developed since 1990 is given in Berger.4 4.2. Computing posterior quantities In Bayesian inference, MC methods are often used to compute the posterior expectation E(h(θ)|D), since the analytical evaluation of E(h(θ)|D) is typically not available. Assuming that {θ i , i = 1, 2, . . . , n} is an MCMC sample from π(θ|D), the MC estimator of E(h(θ)|D) is given by n

1X ˆ h(θ i ) . E(h) = n i=1

(44)

ˆ Asymptotic or small sample properties of E(h) depend on the algorithm used to generate the sample {θ i , i = 1, 2, . . . , n}. Under certain regularity ˆ conditions such as ergodicity, the MC estimator E(h) is consistent. ˆ Since E(h) is a random quantity, it is important to compute the simulaˆ tion standard error of E(h), as it provides the magnitude of the simulation ˆ ˆ ˆ accuracy of the estimator E(h). Let var(E(h)) be the variance of E(h), and ˆ ˆ let var( c E(h)) be an estimate of var(E(h)). Then, the simulation standard ˆ error of E(h) is defined as 1/2 ˆ ˆ se(E(h)) = [var( c E(h))] ,

(45)

which is the square root of the estimated variance of the MC estimator ˆ E(h). Since the sample generated by an MCMC sampling algorithm is often dependent, a complication that arises from the autocorrelation is ˆ that var(E(h)) is difficult to obtain. A variety of methods for obtaining ˆ a dependent sample based estimate of var(E(h)) are discussed in system 13,71,84 simulation textbooks. In this subsection, we briefly discuss a general overlapping batch statistics (obs) method considered in Schmeiser et al.89 ˆ for computing var( c E(h)).

Bayesian Methods

963

Suppose that {θ i , i = 1, 2, . . . , n} is a dependent sample, from which a point estimator ξˆ of the posterior quantity of interest is computed. (Here, ˆ ξˆ = E(h).) The obs estimate of the variance of ξˆ is Pn−m+1 ˆ ˆ2 (ξj − ξ) m j=1 Vˆ (m) = , (46) n−m (n − m + 1)

ˆ but is a function of only θ j , θj+1 , . . ., where ξˆj is defined analogously to ξ, θj+m−1 . Sufficient conditions for obs estimators to be unbiased and have variance inversely proportional to n are given in Schmeiser et al.89 Using q ˆ = Vˆ (m). The primary (46), the simulation standard error of ξˆ is se(ξ)

difficulty in using the obs estimator is the choice of the batch size m to balance bias and variance, since no optimal batch size formula is known for general obs estimators. Limiting behavior for Vˆ (m) for some special obs estimators is discussed.51,92 For many situations, choosing m so that 10 ≤ n/m ≤ 20 is reasonable. During last several years, many other efficient Monte Carlo methods have been developed for computing posterior quantities other than E(h(θ)|D). These include the bridge sampling method of Meng and Wong,73 the path sampling method of Gelman and Meng42 and the ratio importance sampling method of Chen and Shao25 for computing normalizing constants and Bayes factors; and the MC methods of Chen and Shao24 for calculating HPD intervals. The detailed description and discussion of these methods can be found in Chen et al.26 5. Applications and Examples 5.1. Bayesian analysis for survival data with a cure fraction The cure rate model is needed for modelling time-to-event data for various types of cancers, including breast cancer, non-Hodgkins lymphoma, leukemia, prostate cancer, melanoma, and head and neck cancer, where for these diseases, a significant proportion of patients are “cured”. To demonstrate such a phenomenon, we consider a recent phase III clinical trial in malignant melanoma (E1684) undertaken by the Eastern Cooperative Oncology Group (ECOG). The graph in Fig. 2 gives the Kaplan–Meier survival curve for 284 patients in E1684, with the survival time given in years. We see from Fig. 2 that a plateau in the curve occurs at approximately 0.36, suggesting that 36% fraction of patients are “cured” after sufficient follow-up.

964

M.-H. Chen & K. Ye

Fig. 2.

Kaplan–Meier plot for E1684 data.

An important issue with cure rate modelling is model comparison. It will be of interest to compare various cure models to the Cox model. It will also be of interest to compare various semi-parametric models to obtain the most parsimonious and best fitting semi-parametric model. For model comparisons, Bayes factors require proper priors, and criterion based statistics such as the L measure63 will not be well defined for cure rate models since the cure rate model does not have proper probability density. As a result, we need to turn to other measures to carry out model comparisons. Here, we use the Conditional Predictive Ordinate (CPO) as a goodness of fit statistic that is well defined for these models and will allow us to do formal model comparisons. We compare three types of models for modelling time-to-event data.

965

Bayesian Methods

5.1.1. Cox model A proportional hazards model is defined by a hazard function of the form h(t, x) = h0 (t) exp(x′ β) ,

(47)

where h0 (t) denotes the baseline hazard function at time t, x denotes the covariate vector for an arbitrary individual in the population, and β denotes a vector of regression coefficients. Suppose we have n subjects, and let y1 , . . . , yn denote the observed failure times or censoring times for the individuals, and νi is the indicator variable taking on the value 1 if yi is a failure time, and 0 if it is a censoring time. Our semi-parametric development for this model is based on a piece-wise constant hazard. We construct a finite partition of the time axis, 0 < s1 < · · · < sJ , with sJ > yi for all i = 1, 2, . . . , n. Thus, we have the J intervals (0, s1 ], (s1 , s2 ], . . . , (sJ−1 , sJ ]. In the jth interval, we assume a constant hazard λj . Throughout, we let D = (n, y, X, ν) denote the observed data for the current study, where y = (y1 , . . . , yn )′ , ν = (ν1 , . . . , νn )′ , and X is the n × p matrix of covariates with ith row x′i . Letting λ = (λ1 , . . . , λJ )′ , we can write the likelihood function of (β, λ) for all n subjects as L(β, λ|D) =

n Y J Y

(λj exp(x′i β))δij ν i

i=1 j=1

(

"

× exp −δij λj (yi − sj−1 ) +

j−1 X g=1

#

λg (sg − sg−1 )

)

exp(x′i β)

,

(48)

where δij = 1 if the ith subject failed or was censored in the jth interval, and 0 otherwise, x′i = (xi1 , . . . , xip ) denotes the p× 1 vector of covariates for the ith subject, and β = (β 1 , . . . , βp )′ is the corresponding vector of regression coefficients. The indicator δij is needed to properly define the likelihood over the J intervals for the semi-parametric models. The semi-parametric model in (48), sometimes referred to as a piecewise exponential model, is quite general and can accommodate various shapes of the baseline hazard over the intervals. Moreover, we note that if J = 1, then the model reduces to a parametric exponential model with failure rate parameter λ ≡ λj , j = 1, 2, . . . , J. This semi-parametric proportional hazards model is a useful and simple model for modelling survival data. It serves as the benchmark for comparisons with other semi-parametric or fully parametric models for survival data.

966

M.-H. Chen & K. Ye

5.1.2. Parametric cure rate model We present a version of the cure rate model studied.21,104 Suppose that for an individual in the population, we let N denote the number of metastatic-competent tumor cells for that individual left active after the initial treatment. A metastatic-competent tumor cell is a tumor cell which has the potential of metastasizing. Further, we assume that N has a Poisson distribution with mean θ. We let Zi denote the random time for the ith metastatic-competent tumor cell to produce detectable metastatic disease. That is, Zi can be viewed as an incubation time for the ith tumor cell. The variables Zi , i = 1, 2, . . ., are assumed to be independent and identically distributed with a common distribution function F (t) = 1 − S(t) and are independent of N . The time to relapse of cancer can be defined by the random variable T = min{Zi , 0 ≤ i ≤ N }, where P (Z0 = ∞) = 1 and N is independent of the sequence Z1 , Z2 , . . . . The survival function for T , and hence the survival function for the population, is given by Sp (t) = P (no metastatic cancer by time t) = P (N = 0) + P (Z1 > t, . . . , ZN > t, N ≥ 1) = exp(−θ) +

∞ X

k=1

= exp(−θF (t)) .

S(t)k

θk exp(−θ) = exp(−θ + θS(t)) k! (49)

Since Sp (∞) = exp(−θ) > 0, (49) is not a proper survival function. We see that (49) shows explicitly the contribution to the failure time of two distinct characteristics of tumor growth: the initial number of metastaticcompetent cells and the rate of their progression. Thus the model incorporates parameters bearing clear biological meaning. The model in (49) is quite different from the standard mixture cure rate model proposed by Berkson and Gage,3 and has several attractive properties. For a detailed discussion of the various properties of (49), we refer the reader to Yakovlev and Tsodikov.104 Aside from the biological motivation, the model in (49) is suitable for any type of failure-time data with a surviving fraction. Thus, failure-time data which do not “fit” the biological definition given above can still certainly be modeled by (49) as long as the data has a surviving fraction and can be thought of as being generated by an unknown number N of latent independent competing risks (Zi ’s). Yakovlev et al.103 discuss a similar modeling technique for tumor latency, but do not consider a Bayesian formulation of the model.

967

Bayesian Methods

We also see from (49) that the cure fraction (i.e. cure rate) is given by Sp (∞) ≡ P (N = 0) = exp(−θ) .

(50)

As θ → ∞, the cure fraction tends to 0, whereas as θ → 0, the cure fraction tends to 1. The sub-density corresponding to (49) is given by fp (t) = θf (t) exp(−θF (t)) ,

(51)

d where f (t) = dt F (t) is a proper probability density function. The hazard function is given by

hp (t) = θf (t) .

(52)

Note that hp (t) is not a hazard function corresponding to a probability distribution since Sp (t) is not a proper survival function. Suppose we have n subjects and for the ith subject, let yi denote the observed survival time, let νi be the censoring indicator that equals 1 if yi is a failure time and 0 if it is right censored, and also let Ni denote the number of metastatic-competent tumor cells. Further, we assume that the Ni ’s are i.i.d. Poisson random variables with mean θi , which is related to the covariates by θi ≡ θ(x′i β) = exp(x′i β). Letting N = (N1 , . . . , Nn )′ , the “complete data” is given by Dcomp = (n, y, ν, N ), where N is an unobserved vector of latent variables. Then, we can write the complete data likelihood of (β, λ) as ! n Y νi Ni −νi (Ni f (yi |λ)) S(yi |λ) L(β, λ|Dcomp ) = i=1

× exp

(

n X i=1

)

[Ni x′i β − log(Ni !) − exp(x′i β)] ,

(53)

where f (yi |λ) is exponential density given above, and S(yi |λ) = 1 − F (yi |λ) = exp(−λyi ). Since the latent vector N is not observed, the likelihood function based on the observed data D = (n, y, X, ν) is obtained by summing (53) over N , leading to X L(β, λ|D) = L(β, λ|Dcomp ) . (54) N

5.1.3. Semi-parametric cure rate model We consider a semi-parametric version of the parametric cure rate model in (53) by considering a piecewise constant hazard model, and thus assume

968

M.-H. Chen & K. Ye

that the hazard is equal to λj for the jth interval, j = 1, . . . , J. With this assumption, the complete data likelihood can be written as L(β, λ|Dcomp ) =

n Y J Y

i=1 j=1

×

"

(

exp −(Ni − νi )δij λj (yi − sj−1 ) +

n Y J Y

δij νi

(Ni λj )

i=1 j=1

× exp

(n X i=1

[Ni x′i β

"

(

j−1 X g=1

λg (sg − sg−1 )

exp −νi δij λj (yi − sj−1 ) +

− log(Ni !) −

)

exp(x′i β)]

,

#)

j−1 X g=1

#)

λg (sg − sg−1 )

(55)

where λ = (λ1 , . . . , λJ )′ . The model in (55) is a semi-parametric version of (53). If we take J = 1 in (55), then the model reduces to the fully parametric model given in (53). There are several attractive features of the model in (55). First, we note the degree of the non-parametricity is controlled by J. The larger the J, the more non-parametric the model is. However, by picking a small to moderate J, we get more of a parametric shape for the survival function. This is an important aspect for the cure rate model, since the estimation of the cure rate parameter θ could be highly affected by the non-parametric nature of the survival function. For this reason, it may be desirable to choose small to moderate values of J for cure rate modelling. In practice, we recommend doing analyses for several values of J to see the sensitivity of the posterior estimates of the regression coefficients. We recommend doing sensitivity analyses for small, moderate, and large values of J. Thus, the semi-parametric cure rate model (55) is quite flexible, as it allows us to model general shapes of the hazard function, as well as choose the degree of parametricity through suitable choices of J. Again, since N is not observed, the observed data likelihood, L(β, λ|D) is obtained by summing out N from (55) as in (54). 5.1.4. Prior distributions First, we consider a noninformative prior. Take π(β, λ) ∝

J Y

j=1

λζj 0 −1 exp(−τ0 λj ) ,

(56)

Bayesian Methods

969

where ζ0 ≥ 0 and τ0 ≥ 0, so that β has an improper uniform prior and λj ∼ gamma(ζ0 , τ0 ). The prior given in (56) includes two special cases: (i) Jeffreys’s prior for λ when ζ0 = τ0 = 0, and (ii) uniform prior for λ when ζ0 = 1 and τ0 = 0. Second, we consider an informative prior. We use the power prior to formally construct an informative prior distribution from historical data D0 . Let n0 denote the sample size for the historical data, y 0 = (y01 , y02 , . . . , y0n0 )′ be an n0 ×1 vector of right censored failure times for the historical data with censoring indicators ν 0 = (ν01 , ν02 , . . . , ν0n0 )′ , and X0 is an n0 × k matrix of covariates with ith row x′0i . Let D0 = (n0 , y 0 , X0 , ν 0 ) denote the observed historical data. The power prior given by (26) has the form 0 −1 π(β, λ, a0 |D0 ) ∝ L(β, λ|D0 )a0 π0 (β, λ)aα (1 − a0 )λ0 −1 , 0

(57)

where L(β, λ|D0 ) is the likelihood function based on the observed historical data, and α0 and λ0 are prespecified hyperparameters. The quantity π0 (β, λ) is the initial prior for (β, λ), which is (56). It is well known that with insufficient follow-up or with too few events, the estimate of the cure rate can be quite unreliable and unstable. In addition, the model itself may not be identifiable or nearly identifiable if there is insufficient follow-up and there are too few events. The use of informative prior distributions can help overcome such difficulties, which can provide better estimates of the cure rate and make the model identifiable. 5.1.5. Model assessment We use a summary statistic of the CPOi ’s given by (7), the logarithm of the pseudo-Bayes factor, for model assessment. The CPOi given by (7) has the form Z CPOi = f (yi |y(i) ) = f (yi |β, λ) π(β, λ|y(i) ) dβ dλ , (58)

where yi denotes the response variable for case i, and y(i) denotes the entire response vector with the ith case deleted. Then, the logarithm of the pseudo-Bayes factor is defined as B=

n X

log(CPOi ) .

(59)

i=1

In the context of survival data, the statistic B has been discussed before.30,38,91

970

M.-H. Chen & K. Ye

We see from (58) that B is always well defined as long as the posterior predictive density is proper. Thus, B is well defined under improper priors, and in addition, it is very computationally stable. Therefore, B has a clear advantage over the Bayes factor as a model assessment tool, since it is well known that the Bayes factor is not well defined with improper priors, and is generally quite sensitive to vague proper priors. Thus, the Bayes factor is not applicable for many of our models here, since we consider several models involving improper priors. In addition, the B statistic also has clear advantages over other model selection criteria, such as the L measure.63,70 The L measure is a Bayesian criterion requiring finite second moments of the sampling distribution of yi , whereas the B statistic does not require existence of any moments. Since the cure rate models in (53) and (55) have improper survival functions, no moments of the sampling distribution exist, and therefore the L measure is not well defined for these models. Thus, for the models considered here, the B statistic is well motivated. 5.1.6. E1690 melanoma study To demonstrate the methodologies, we consider the second ECOG trial, E1690, in malignant melanoma. This study had n = 427 patients on the high dose interferon arm and observation arm combined. ECOG initiated this trial, right after the completion of E1684, to attempt to confirm the results of E1684 and to study the benefit of Interferon Alpha-2b (IFN) given at a lower dose. The E1690 trial accrued patients from 1991 until 1995, and was unblinded in 1998. The E1690 trial was designed for exactly the same patient population as E1684, and the high dose IFN arm in E1690 was identical to that of E1684. See Kirkwood et al.67,68 for detail. We carry out a Bayesian analysis of E1690 using E1684 as historical data using relapse-free survival (RFS) as the response variable with the treatment as a covariate. Since E1684 has longer follow-up than E1684, the use of E1684 as a historical may help to improve the accuracy in the estimates of cure rates based on E1690. But, in this example, we solely focus on model comparisons. We consider Cox model, parametric cure rate (PCR) model, and semiparametric cure rate (SPCR) model with J = 1, J = 5, and J = 10 for noninformative and informative priors. Table 4 shows the results of Pseudo-Bayes Factors (B’s) when a0 = 0 with probability 1, E(a0 |D) = 0.05, 0.20, 0.30, and 0.60, and a0 = 1 with probability 1. We note that a0 = 0 implies the use of noninformative prior, which is equivalent to a prior specification with no incorporation of historical data, while with a0 = 1, we simply combine D and D0 together.

971

Bayesian Methods

Table 4.

Pseudo-Bayes factors (B’s). E(a0 |D) ≃

Model

a0 = 0

0.05

0.20

0.30

0.60

a0 = 1

Cox

J =1 J =5 J = 10

−575.60 −522.30 −523.62

−575.45 −522.05 −523.20

−575.23 −521.67 −522.39

−575.13 −521.59 −522.12

−574.95 −521.61 −522.02

−574.64 −522.24 −522.71

SPCR

J =1 J =5 J = 10

−519.75 −520.24 −524.42

−519.61 −519.89 −523.82

−519.39 −519.43 −522.83

−519.34 −519.31 −522.53

−519.40 −519.67 −522.56

−519.67 −520.16 −522.97

Fig. 3.

Plot of pseudo-Bayes factors for SPCR with J = 5.

Table 4 is quite informative. First, for the degree of parametricity, J = 5 is better than J = 1 or J = 10. However, for SPCR J = 1 and J = 5 are fairly close. Second, for both J = 1 or J = 5, the cure rate model yields a better fit than the Cox model. Third, the incorporation of E1684 in the analysis improves the model fit over the exclusion of historical data. Fourth, for all the cases, B is a concave function of E(a0 |D), see Fig. 3 for an illustration. This is an interesting feature in B in that it demonstrates that there is an “optimal” weight for the historical data with respect to the statistic B, and thus this property is potentially very useful in selecting a model and the prior weight a0 .

972

M.-H. Chen & K. Ye

5.2. Bayesian model selection for multivariate mortality data with large families To illustrate Bayesian model selection, we consider an analysis of the multivariate mortality data with large families from a series of animal toxicological experiments performed in the Department of Biology at the University of Waterloo. One of these experiments to study the toxic effect of potassium thiocyanate (KSCN) on the mortality of trout fish eggs. In this experiment, each of the six levels of KSCN were added to different tanks each containing many trout fish eggs (61 to 179 eggs per vial). Half of the tanks were water hardened before the application of the KSCN (x1 = 1) and other half were water hardened after the application of the KSCN (x1 = 0). Another covariate is the continuous variable x2 , which is defined as the natural logarithm of the level of KSCN. Each experimental condition was replicated 4 times, so there were in total 48 tanks. Another similar experiment was conducted in the same laboratory with a different toxicant, sodium thiocyanate (NaSCN). For the KSCN data, mortality counts for each tank were taken at 5, 11, 19, 31 and 35 days after the application of KSCN, while for the NaSCN data mortality counts for each tank were taken at 1, 6, 13, 20, and 27 days after the application of the NaSCN. We refer the reader to O’Hara Hines77 and O’Hara Hines and Lawless78 for the more detailed description of these two experiments. Since NaSCN is a toxicant similar to KSCN and both experiments were similar in design and purpose, the data from the NaSCN experiment will be used to build an informative prior distribution for the KSCN study. For the KSCN data, there are K = 48 tanks (families) of fishes (subjects). Suppose that the observation times for each tank are t0 = 0 < t1 < · · · < tm (m = 5). Moreover, it is given that t1 = 5, t2 = 11, t3 = 19, t4 = 31 and t5 = 35. The cumulative mortality counts (the observed data) in each tank at these five days, can be summarized as {(dkj , rkj ) : j = 1, . . . , 5; k = 1, . . . , 48} where dkj is the number of fishes dying and rkj is the number of fishes at risk in the kth tank during the time interval (tj−1 , tj ] = Ij . Let hkj be the mortality rate, which can be also interpreted as the discrete hazard rate, for a fish from the kth tank at the time interval Ij : hkj = P (Ti ∈ Ij |i ∈ Rkj , xi ) , where Ti is the time of death for a fish from the kth tank, and Rkj is the set of fishes still “at risk” (alive) in the kth tank at the beginning of the time interval (tj−1 , tj ] (for j = 1, 2, . . . , 5 and k = 1, 2, . . . , K). There are two

973

Bayesian Methods

covariates x1k and x2k for each tank, where x2k is the primary covariate, the natural logarithm of the KSCN level applied to the kth tank, and x1k is the water hardening of the kth tank. Thus, the number of fishes dying in the kth tank during Ij is distributed as Binomial with success probability hkj and the number of trials is equal to the number of fishes at risk during that interval in the kth tank, i.e. dkj ∼ Binomial(rkj ; hkj ). Chen et al.18 assume that for a logit link function, hkj logit(hkj ) = log = x′k β + gγ (tj ) + ekj , (60) 1 − hkj where gγ is a known function with unknown parameters γ, possibly vector valued, xk = (1, x1k , x2k )′ , β = (β0 , β1 , β2 )′ is a 3 × 1 vector of the regression coefficients with respect to covariates xk . In (60), ekj ’s are the random effects for the hazard rate of the fishes in the kth tank during the time interval Ij . Let ek = (ek1 , ek2 , . . . , ek,m )′ for k = 1, 2, . . . , K. We assume that these random effects, ekj ’s, are dependent on each other within the same tank (family) but independent between any two different tanks. Also let A = diag(a1 , a2 , . . . , am ), where aj = (tj − tj−1 )1/2 is the square root of the length of consecutive time intervals for j = 1, 2, . . . , m. We build the dependence structure using a m-dimensional multivariate normal distribution within each tank, so that ek ∼ Nm (0, σ 2 Σ) , where Σ is a m × m matrix defined by  1 ρ ρ2   ρ 1 ρ  Σ = A . . ..  .. .. .  m−1 m−2 m−3 ρ ρ ρ

···

··· .. .

···

(61)

ρm−1



 ρm−2   A. ..  .   1

(62)

The function gγ (t) in (60) is a function of time known up to the unknown parameter γ. The choice of gγ depends on the entertained response-time model of these kinds of mortality data. The most simple form of gγ (t) is linear, i.e. gγ (t) = γ1 t, the constant term is absent to preserve the identifiability of the model. There are many other possible choices for the function gγ (t) such as quadratic, gγ (t) = γ1 t + γ2 t2 , and even more complex one such as a spline function of known order with a known number of knots but, with unknown knot positions. In this example, we use Bayesian variable selection approach to explore a suitable form for gγ (t) from the data.

974

M.-H. Chen & K. Ye

Here, we assume that gγ (t) takes the form gγ (t) = γ1 g1 (t) + γ2 g2 (t) + · · · + γq gq (t) ,

(63)

where the gj (t) are the known functions of t and q ≥ 0. For the notational convenience, we denote gγ (t) ≡ 0 when q = 0. We further denote γ = (γ1 , γ2 , . . . , γq )′ . Let D denote the complete data, that is, D = ((dk , r k , xk , g1 (tk1 ), . . . , gq (tkm )), k = 1, . . . , 48), where dk = (dkj , j = 1, 2, . . . , m) and r k = (rkj , j = 1, 2, . . . , m). Also let φm (ek |µ, σ 2 Σ) denote the m-dimensional normal density of the random effect ek with mean µ and covariance matrix σ 2 Σ, i.e. |Σ|−1/2 1 ′ −1 (e − µ) Σ (e − µ) . (64) φm (ek |µ, σ 2 Σ) = exp − k k 2σ 2 (2πσ 2 )m/2 Then, it can be shown that the likelihood function is given by  (Z  m K Y Y  (hkj )dkj (1 − hkj )rkj −dkj  L(β, γ, σ 2 , ρ|D) = k=1

j=1

)

× φm (ek |0, σ 2 Σ)dek ,

(65)

where hkj and ekj are given in (60). Notice that from (62), we have |Σ| = (1 − ρ2 )m−1 ×

m Y

aj .

j=1

We use the NaSCN added fish tank data77 to build our prior distributions. Let D0 denote the complete NaSCN data, that is, D0 = ((d0k , r 0k , x0k , g1 (t01 ), . . . , gq (t0m0 )), k = 1, 2, . . . , K0 = 36), where x0k = (1, x01k , x02k )′ , d0k = (d0kj , j = 1, 2, . . . , m0 = 5) and r 0k = (r0kj , j = 1, 2, . . . , m0 = 5). In the NaSCN data, d0kj is the number of fishes dying and r0kj is the number of fishes at risk in the kth tank during the time interval I0j = (t0,j−1 , t0j ) for j = 1, 2, . . . , m0 = 5, where t01 = 1, t02 = 6, t03 = 13, t04 = 20, t05 = 27, and m0 = 5 for k = 1, 2, . . . , K0 (K0 = 36). To determine the form of gγ , let M denote the model space with each model containing β and a specific choice of covariates gl (tj ). The full model is defined here as the model containing all of the available covariates in the toxicity experiment. Also, let θ (Q) = (β′ , γ1 , γ2 , . . . , γq )′ and let θ (m) denote

975

Bayesian Methods

a qm × 1 vector of regression coefficients for model m with β, and a specific ′ ′ choice of qm − 3 covariates gl (tj ). We write θ (Q) = (θ (m) , θ(−m) )′ , where θ(−m) is θ (Q) with θ(m) deleted. Using the power prior given by (26), we construct the prior distribution for (θ (m) , σ 2 , ρ) under model m as π(θ (m) , σ 2 , ρ|m) ∝ π0∗ (θ (m) , σ 2 , ρ|D0 , m) =

Z

K0 1 Y

Y m0 Z

∗(m)

exp{a0 d0kj ((x0kj )′ θ (m) + e0kj )} ∗(m)

[1 + exp{(x0kj )′ θ(m) + e0kj }]a0 r0kj × φm0 (e0k |0, σ 2 Σ0 )de0k 0 k=1

j=1

× π1 (σ 2 ) π2 (ρ) π3 (a0 )da0 ,

(66) ∗(m)

where e0k is a m0 × 1 vector of random effects, and x0kj is a qm × 1 vector

of covariates corresponding to θ (m) . Note that under the full model, x∗0kj = (1, x01k , x02k , g1 (t0kj ), . . . , gq (t0kj ))′ . In (66), we specify an inverse gamma prior for σ 2 given as π1 (σ 2 ) ∝ (σ 2 )−(δ0 +1) exp(−σ −2 ζ0 ), a scaled beta prior for ρ given as π2 (ρ) ∝ (1 + ρ)ν0 −1 (1 − ρ)ψ0 −1 , and independent beta priors for each a0 given as π3 (a0 ) ∝ a0α0 −1 (1 − a0 )λ0 −1 . Let L(θ (m) , σ 2 , ρ|D, m) denote the likelihood function given by (65) under model m. Then, the marginal likelihood given in (23) has the following expression: Z p(D|m) = L(θ(m) , σ 2 , ρ|D, m)π(θ (m) , σ 2 , ρ|m) dθ (m) dσ 2 dρ . (67) To completely determine the posterior probability of model m given by (22), we elicit the prior model probability p(m) as: R ∗ (m) 2 π (θ , σ , ρ|D0 , m) dθ(m) dσ 2 dρ . (68) p(m) = PQ 0 R π0∗ (θ(j) , σ 2 , ρ|D0 , j) dθ (j) dσ 2 dρ j=1

The choice for p(m) in (68) is a natural one since the numerator is just the normalizing constant of the joint prior of (θ (m) , σ 2 , ρ) under model m. The prior model probabilities in (68) are based on coherent Bayesian updating. It can be shown that p(m) in (68) corresponds to the posterior probability of model m based on the data D0 using a uniform prior on the model space for the previous study, p0 (m) = 2−q for m ∈ M as α0 → ∞. That (m) is, p(m) ∝ p(m|D0 ), and thus p(m) corresponds to the usual Bayesian update of p0 (m) using D0 as the data.

976

M.-H. Chen & K. Ye

Next, we briefly discuss how to compute the posterior model probability p(m|D). From (67) and (68), it can be seen that p(m|D) is a function of the ratios of analytically intractable prior and posterior normalizing constants, which are expensive to compute. However, the following result can greatly ease such computational burden. Using (22), (67) and (68) along with the Savage–Dicky ratio,99 it directly follows from Ibrahim et al.61 that the posterior probability p(m|D) in (22) of model m reduces to π(θ (−m) = 0|D, Q) , p(m|D) = PQ (−j) = 0|D, Q) j=1 π(θ

(69)

m = 1, . . . , Q, where π(θ (−m) = 0|D, Q) is the marginal posterior density of θ(−m) evaluated at θ (−m) = 0 under the full model. In (69), for notational convenience we assume that π(θ (−Q) = 0|D, Q) = 1. Note that the joint posterior distribution of θ is given by Z π(θ|D, Q) ∝ L(θ, σ 2 , ρ|D)π(θ, σ 2 , ρ|Q)dσ 2 dρ . The result in (69) is attractive since it shows that the posterior model probability p(m|D) is simply a function of the marginal posterior density functions of θ (−m) for the full model evaluated at θ (−m) = 0. This formula does not algebraically depend on the prior model probability p(m) since it cancels out in the derivation due to the structure of p(m). This is an important feature since it allows us to compute the posterior model probabilities directly without numerically computing the prior model probabilities. This has a clear computational advantage and as a result, allows us to compute posterior model probabilities efficiently. Although the analytical evaluation of π(θ (−m) = 0|D, Q) does not appear possible due to the complexity of our model, it can be easily computed by using the IWMDE method discussed in Sec. 2.3. We implement the above Bayesian variable subset selection to determine the form of gγ . The terms t, t2 , log(t) were previously used in the toxicological mortality estimation literature by some authors77,78 to model the effect of time. We do not know which form of g(t) fits the data best before our analysis, and therefore, we use the formal Bayesian model selection procedure allowing the possibility of including all of these coefficients in the selected model. So, we consider that the full model for gγ (tj ) contains tj , t2j , and ln tj , that is, gγ (tj ) = γ1 tj + γ2 t2j + γ3 ln tj . Thus, the model space M has a dimension of Q = 23 = 8. We specify

977

Bayesian Methods Table 5.

Posterior model probabilities for the fish tank data. (µ0 , σ0 )

Model

p(m|D)

(0.50, 0.109)

(x1 , x2 , T ) (x1 , x2 , T 2 ) (x1 , x2 , ln T )

0.486 0.352 0.153

(0.50, 0.050)

(x1 , x2 , T ) (x1 , x2 , T 2 ) (x1 , x2 , ln T )

0.484 0.351 0.153

(0.91, 0.027)

(x1 , x2 , T ) (x1 , x2 , T 2 ) (x1 , x2 , ln T )

0.484 0.350 0.153

(0.98, 0.006)

(x1 , x2 , T ) (x1 , x2 , T 2 ) (x1 , x2 , ln T )

0.483 0.350 0.154

noninformative priors for ρ and σ 2 . Specifically, we take a uniform prior for ρ on [−1, 1] (i.e. ν0 = ψ0 = 1) and take an inverse gamma prior for σ 2 given as π1 (σ 2 ) ∝ (σ 2 )−(δ0 +1) exp(−σ −2 ζ0 ) with δ0 = ζ0 = 0.005. Then, we consider several choices of hyperparameters (α0 , λ0 ) for a0 to perform a small scale sensitivity study. Similar to Example 3, we let µa0 = α0 /(α0 + λ0 ), and σa0 = (µ0 (1 − µ0 )(α0 + λ0 + 1)−1 )1/2 . We generate 50,000 Gibbs iterations from the posterior distribution under the full model to obtain IWMDE. Table 5 gives results for the top three models based on several values of (µa0 , σa0 ). In Table 5, we let T, T 2 , and ln T denote time, time square, and the natural logarithm of time. It can be seen that (i) for all choices of (µa0 , σa0 ), the order of the top three models does not change while model (x1 , x2 , T ) is clearly the top model, and (ii) the posterior model probabilities for all top three models are almost the same for all choices of (µa0 , σa0 ) despite the one with a strong prior on a0 (see (µa0 , σa0 ) = (0.98, 0.006)). Therefore, model choice is reasonably robust to the choice of (µa0 , σa0 ). In addition, the total sum of the posterior model probabilities for the top three models is close to 1, for example, this sum equals 0.988 for (µa0 , σa0 ) = (0.50, 0.050). This result implies that for the purpose of posterior prediction, it suffices to use these three models to apply model averaging techniques82 to incorporate model uncertainty in posterior densities for parameters. Also, from the principle of parsimony and from the result of Bayesian variable selection, the best model to use is g(t) = γ1 + γ2 t.

978

M.-H. Chen & K. Ye

5.3. Reference prior analysis to a statistical calibration model A statistical calibration problem (or, more precisely, an absolute statistical calibration problem), bears a resemblance to a regression problem. It is still assumed that the variables x and y are related through a function of specified form. However, in calibration, interest centers on the estimation of an unknown value of x, corresponding to an observed value of y. Inferences are based on two samples of data. At a first stage of data collection, n pairs (xi , yi ) are observed, with the x values fixed at known levels. At a second stage, c replications of the response variable y are observed, corresponding to an unknown value of the regressor x0 ; estimation of this regressor value is of primary interest. A review of statistical calibration is given by Osbourne.29 Noninformative priors for the linear calibration problem have been presented by several authors.31,47,69,81 Also, Eno and Ye32 studied the reference priors for the polynomial calibration models, as well as the probability matching prior for an extended calibration problem.33 5.3.1. An example of polynomial calibration The data set was presented by Aitchison and Dunsmore.1 These data resulted from an assay of an antibiotic, based on the “clearance circle” technique. The goal of such an experiment is to estimate the concentration of the active constituent in a particular test preparation of the antibiotic. In a clearance circle assay, the regressor variable x is a precise measure of the concentration of active constituent in a preparation of antibiotic. This concentration is controlled in a laboratory experiment, where it is set at several different values by diluting a known full-strength antibiotic preparation to varying degrees. Each response yi is obtained by placing a drop of antibiotic solution (of a specified volume) on a petri dish which is uniformly infected with bacteria. The actual response variable yi is the measured diameter of the circle which has been disinfected by the antibiotic preparation after a specified period of time. It is expected that the diameter of this clearance circle depends on the concentration of the active constituent in an antibiotic solution. Based on the known dilutions of the standard preparation, this regression relationship can be estimated. At the same time that the clearance circles corresponding to the known antibiotic concentrations are measured, clearance circles are also measured for the test preparation whose unknown antibiotic concentration is

979

Bayesian Methods

Fig. 4.

Scatterplot of the transformed data in clearance circle assay.

of interest. What we want to do here is to use the reference prior Bayesian analysis to estimate this unknown antibiotic concentration. The data are plotted in Fig. 4. In the plot (Fig. 4), the response is transformed via the square root transformation and the regressor is transformed via the log transformation. The plot really fits a quadratic model well. 5.3.2. The model and the reference priors The polynomial calibration problem can be formally stated as follows. Data in the form of n pairs (xi , yi ) are collected. In addition to these n data pairs, we observe c values of the response, yn+1 , yn+2 , . . . , yn+c , which correspond to a single unknown value of the regressor x0 . The response variable y is assumed to be related to the regressor x via a polynomial function of order p: yi = α + β 1 xi + β 2 x2i + · · · + βp xpi + ǫi ,

for i = 1, 2, . . . , n + c .

For convenience, we have written xi in place of x0 , for i = n + 1, n + 2, . . . , n + c. We assume that the errors ǫi are independent and identically distributed normal deviates, with mean 0 and standard deviation σ. Of primary interest in this problem is the estimation of the unknown regressor value x0 . A feature that distinguishes the polynomial calibration problem from the linear calibration problem is that, since a polynomial

980

Fig. 5.

M.-H. Chen & K. Ye

Reference prior for the quadratic model, for the clearance circle assay example.

function need not be monotonic, more than one value of x0 may give rise to a particular mean response y¯0 . This issue in the context of the clearance circle assay described above has been discussed and addressed in Eno and Ye.32 Reference priors for the univariate polynomial calibration problem, as described above, are given as follows. πk (x0 , α, β, σ) ∝ σ −k =σ

−k

ζ0′ ζ0 1 + cξ0′ (X′α,1 Xα,1 )−1 ξ0 1 (ζ0′ ζ0 ) 2

! 21

− 21 cn ′ ′ −1 ¯ ) (X1 X1 ) (x0 − x ¯) (x0 − x 1+ , n+c (70)

where X1 is the n × p matrix whose ith row is x′i , the vector of regressor at ith observation, Xα,1 is the n × (s + 1) matrix whose ith row is (1, x′i ), ζ ′0 = (1, 2x0 , 3x20 , . . . , px0p−1 ) is a vector of derivative terms of x0 , and k is the number of parameters in the group involving σ in the implementation of the reference prior algorithm. Clearly the prior in (70) is improper. As we noted in Sec. 3.3, it is necessary to check the propriety of the posteriors once an improper prior is used. In Eno and Ye,32 integrability of the reference prior in form (70) was proven. The prior function is shown in Fig. 5 for the clearance circle assay problem.

Bayesian Methods

981

5.3.3. Posterior results Applying the reference prior (70) to the clearance circle assay problem, we are ready to estimate the antibiotic concentration level for the observed responses. The marginal posterior distribution of the log(concentration) is shown in Fig. 6. It is quite clear to see the bi-modal properties of both the prior and the posterior since the model considered here is quadratic. Furthermore, it can easily seen that the small bump in the posterior distribution reflects to x0 value way beyond the region of original regressor. It is conceivable that this local mode does not belong to this data set.

Fig. 6.

Marginal reference posterior for log(x0 ).

Fig. 7.

Marginal reference posterior for x0 .

982

M.-H. Chen & K. Ye

Since the model is not likely to be reliable outside the range of the controlled regressor values, we truncate the range of the posterior density and transform the logarithm back to the original scale. Figure 7 shows the marginal posterior distribution of x0 . References 1. Aitchison, J. and Dunsmore, I. R. (1975). Statistical Prediction Analysis, Cambridge University Press. 2. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In International Symposium on Information Theory, eds. B. N. Petrov and F. Csaki, Akademia Kiado, Budapest, 267–281. 3. Berkson, J. and Gage, R. P. (1952). Survival curve for cancer patients following treatment. Journal of the American Statistical Association 47: 501–515. 4. Berger, J. O. (1999). Bayesian analysis today and tomorrow. Technical Report 99-30. Institute of Statistics and Decision Sciences, Duke University. 5. Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd edn., Wiley, New York. 6. Berger, J. O. and Bernardo, J. (1989). Estimating a product of normal means: Bayesian analysis with reference priors. Journal of the American Statistical Association 84: 200–207. 7. Berger, J. O. and Bernardo, J. (1992). On the development of the reference prior method. In Bayesian Statistics 4, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, Oxford University Press, London, 35–60. 8. Berger, J. O. and Mallows, C. L. (1988). Discussion of Bayesian variable selection in linear regression. Journal of the American Statistical Association 83: 1033–1034. 9. Bernardo, J. (1979). Referce posterior distributions for Bayesian inference (with discussion). Journal of the Royal Statistical Society B41: 113–147. 10. Bernardo, J. M. and Smith, A. F. M. (1995). Bayesian Theory, John Wiley and Sons, New York. 11. Besag, J. and Green, P. J. (1993). Spatial statistics and Bayesian computation. Journal of the Royal Statistical Society, Series B55: 25–37. 12. Box, G. E. P. (1980). Sampling and Bayes’ inference in scientific modelling and robustness (with discussion). Journal of the Royal Statistical Society, Series A143: 383–430. 13. Bratley, P., Fox, B. L. and Schrage, L. E. (1987). A Guide to Simulation, 2nd edn., Springer-Verlag, New York. 14. Carlin, B. P. and Louis, T. A. (2000). Bayes and Empirical Bayes Methods for Data Analysis, 2nd edn, Chapman and Hall, New York. 15. Casella, G. and George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician 46: 167–174. 16. Chen, M.-H. (1994). Importance weighted marginal Bayesian posterior density estimation. Journal of the American Statistical Association 89: 818–824.

Bayesian Methods

983

17. Chen, M.-H. and Deely, J. J. (1996). Bayesian analysis for a constrained linear multiple regression problem for predicting the new crop of apples. Journal of Agricultural, Biological and Environmental Statistics 1: 467–89. 18. Chen, M.-H., Dey, D. K. and Sinha, D. (2000). Bayesian analysis of multivariate mortality data with large families. Applied Statistics 49: 129–144. 19. Chen, M.-H., Ibrahim, J. G. and Shao, Q.-M. (2000). Power prior distributions for generalized linear models. Journal of Statistical Planning and Inference 41: 121–137. 20. Chen, M.-H., Ibrahim, J. G., Shao, Q.-M. and Weiss, R. E. (2002). Prior elicitation for model selection and estimation in generalized linear mixed models. Journal of Statistical Planning and Inference, in print. 21. Chen, M.-H., Ibrahim, J. G. and Sinha, D. (1999). A new Bayesian model for survival data with a surviving fraction. Journal of the American Statistical Association 94: 909–919. 22. Chen, M.-H., Ibrahim, J. G. and Yiannoutsos, C. (1999). Prior elicitation and Bayesian computation for logistic regression models with applications to variable selection. Journal of the Royal Statistical Society, Series B61: 223–242. 23. Chen, M.-H., Manatunga, A. K. and Williams, C. J. (1998). Heritability estimates from human twin data by incorporating historical prior information. Biometrics 54: 1348–1362. 24. Chen, M.-H. and Shao, Q.-M. (1999). Monte Carlo estimation of Bayesian credible and HPD intervals. Journal of Computational and Graphical Statistics 8: 69–92. 25. Chen, M.-H. and Shao, Q.-M. (1997). On Monte Carlo methods for estimating ratios of normalizing constants. The Annals of Statistics 25: 1563–1594. 26. Chen, M.-H., Shao, Q.-M. and Ibrahim, J. G. (2000). Monte Carlo Methods in Bayesian Computation, Springer-Verlag, New York. 27. Chib, S. and Greenberg, E. (1995). Understanding the Metropolis–Hastings algorithm. The American Statistician 49: 327–335. 28. Cox, D. R. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference (with discussion). Journal of the Royal Statistical Society 49: 1–39. 29. Datta, G. S. and Ghosh, M. (1996). On the invariance of noninformative priors. Technical Report 94-20, Department of Statistics, University of Georgia. 30. Dey, D. K., Chen, M.-H. and Chang, H. (1997). Bayesian approach for nonlinear random effects models. Biometrics 53: 1239–1252. 31. du Plessis, J. L., van der Merwe, A. J. and Groenewald, P. C. N. (1995). Reference priors for the multivariate calibration problem. South African Statistical Journal 29: 155–168. 32. Eno, D. R. and Ye, K. (2000). Bayesian reference prior analysis for polynomial calibration models. Test 9: 191–208. 33. Eno, D. R. and Ye, K. (2001). Probability matching priors for an extended statistical calibration model. Canadian Journal of Statistics 29: 19–35.

984

M.-H. Chen & K. Ye

34. Gamerman, D. (1997). Markov Chain Monte Carlo, Chapman and Hall, London. 35. Geisser, S. (1993). Predictive Inference: An Introduction, Chapman and Hall, London. 36. Geisser, S (1980). In discussion of G. E. P. Box. Journal of the Royal Statistical Society, Series A143: 416–417. 37. Gelfand, A. E., Dey, D. K. and Chang, H. (1992). Model determinating using predictive distributions with implementation via sampling-based methods (with discussion). In Bayesian Statistics 4, eds. J. M. Bernado, J. O. Berger, A. P. Dawid and A. F. M. Smith, Oxford University Press, 147–167. 38. Gelfand, A. E. and Mallick, B. (1995). Bayesian analysis of proportional hazards models built from monotone functions. Biometrics 51: 843–852. 39. Gelfand, A. E., Sahu, S. K. and Carlin, B. P. (1996). Efficient parametrisations for generalized linear mixed models (with discussion). In Bayesian Statistics 5, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, Oxford University Press, 165–180. 40. Gelfand, A. E., Sahu, S. K. and Carlin, B. P. (1995). Efficient parametrisations for normal linear mixed models. Biometrika 82: 479–488. 41. Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association 85: 398–409. 42. Gelman, A. and Meng, X.-L. (1998). Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical Science 13: 163–185. 43. Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6: 721–741. 44. Geweke, J. (1989). Bayesian inference in econometrics models using Monte Carlo integration. Econometrica 57, 1317–1340. 45. George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association 88: 881–889. 46. Ghosh, J. K. and Mukerjee, R. (1992). Noninformative priors. In Bayesian Statistic 4, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, Oxford University Press, London, 195–210. 47. Ghosh, M., Carlin, B. P. and Srivastava, M. S. (1998). Probability matching priors for linear calibration. Test 4: 333–357. 48. Gilks, W. R., Richardson, S. and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in Practice, Chapman and Hall, London. 49. Gilks, W. R. and Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling. Applied Statistics 41: 337–348. 50. Goel, P. K. (1988). Software for Bayesian analysis: Current status and additional needs. In Bayesian Statistics 3. Eds. J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, Oxford University Press, 173–188. 51. Goldsman, D. and Meketon, M. S. (1986). A comparison of several variance estimators. Technical Report J-85-12. School of Industrial and Systems Engineering, Georgia Institute of Technology.

Bayesian Methods

985

52. Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82: 711–732. 53. Grenander, U. (1983). Tutorial in pattern theorey. Technical Report. Division of Applied Mathematics, Brown University, Providence, RI. 54. Hammersley, J. M. and Handscomb, D. C. (1964). Monte Carlo Methods, Methuen, London. 55. Haseman, J. K., Huff, J. and Boorman, G. A. (1984). Use of historical control data in carcinogenicity studies in rodents. Toxocologic Pathology 12: 126–135. 56. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57: 97–109. 57. Hoeting, J. A., Madigan, D., Raftery, A. E. and Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science 14: 382–417. 58. Ibrahim, J. G. and Chen, M.-H. (2000). Power prior distributions for regression models. Statistical Sciences 15, 46–60. 59. Ibrahim, J. G. and Chen, M.-H. (1998). Prior distributions and Bayesian computation for proportional hazards models. Sankhy¯ a, Series B60: 48–64. 60. Ibrahim, J. G., Chen, M.-H. and MacEachern, S. N. (1999). Bayesian variable selection for proportional hazards models. The Canadian Journal of Statistics 27: 701–717. 61. Ibrahim, J. G., Chen, M.-H. and Ryan, L.-M. (2000). Bayesian variable selection for time series count data. Statistica Sinica 10: 971–987. 62. Ibrahim, J. G., Ryan, L.-M. and Chen, M.-H. (1998). Use of historical controls to adjust for covariates in trend tests for binary data. Journal of the American Statistical Association 93: 1282–1293. 63. Ibrahim, J. G. and Laud, P. W. (1994). A predictive approach to the analysis of designed experiments. Journal of the American Statistical Association 89: 309–319. 64. Jeffreys, H. (1961). Theory of Probability, Oxford University Press. 65. Johnson, W. and Geisser, S. (1983). A predictive view of the detection and characterization of influential observations in regression analysis. Journal of American Statistical Association 78: 137–144. 66. Kass, R. E. and Wasserman, L. A. (1996). Formal rules for selecting prior distributions. Journal of the American Statistical Association 91: 1343–1370. 67. Kirkwood, J. M., Ibrahim, J. G., Sondak, V. K., Richards, J., Flaherty, L. E., Ernstoff, M. S., Smith, T. J., Rao, U., Steele, M. and Blum, R. H. (1999). The role of high- and low-dose interferon Alfa-2b in high-risk melanoma: First analysis of intergroup trial E1690/S9111/C9190. Journal of Clinical Oncology 18: 2444–2458. 68. Kirkwood, J. M., Strawderman, M. H., Ernstoff, M. S., Smith, T. J., Borden, E. C. and Blum, R. H. (1996). Interferon alfa-2b adjuvant therapy of high-risk resected cutaneous melanoma: The Eastern Cooperative Oncology Group trial EST 1684. Journal of Clinical Oncology 14: 7–17. 69. Kubokawa, T. and Robert, C. P. (1994). New perspectives on linear calibration. Journal of Multivariate Analysis 51: 178–200.

986

M.-H. Chen & K. Ye

70. Laud, P. W. and Ibrahim, J. G. (1995). Predictive model selection. Journal of the Royal Statistical Society, Series B57: 247–262. 71. Law, A. M. and Kelton, W. D. (1991). Simulation Modeling and Analysis. 2nd edn., McGraw-Hill, New York. 72. Meng, X.-L. (1994). Posterior predictive p-values. The Annals of Statistics 22: 1142–1160. 73. Meng, X.-L. and Wong, W. H. (1996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica 6: 831–860. 74. Merigan, T. C., Amato, D. A., Balsley, J., Power, M., Price, W. A., Benoit, S., Perez-Michael, A., Brownstein, A., Kramer, A. S., Brettler, D., Aledort, L., Ragni, M. V., Andes, A. W., Gill, J. C., Goldsmith, J., Stabler, S., Sanders, N., Gjerset, G., Lusher, J. and the NHF-ACTG036 Study Group (1991). Placebo-controlled trial to evaluate zidovudine in treatment of human immunodeficiency virus infection in asymptomatic patients with hemophilia. Blood 78: 900–906. 75. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics 21: 1087–1092. 76. Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression (with discussion). Journal of the American Statistical Association 83: 1023–1036. 77. O’Hara Hines, R. J. (1989). Some methods for the analysis of toxicological mortality data grouped over time. Unpublished PhD Thesis, Department of Statistics and Actuarial Science, University of Waterloo, Canada. 78. O’Hara Hines, R. J. and Lawless, J. F. (1993), Modelling overdispersion in toxicological mortality data grouped over time. Biometrics 49: 107–122. 79. Osbourne, C. (1991). Statistical calibration: A review. Journal of Statistical Review 59: 309–336. 80. Penrose, K., Nelson, A. and Fisher, A. (1985). Generalized body composition prediction equation for men using simple measurement techniques (Abstract). Medicine and Science in Sports and Exercise 17: 189. 81. Philippe, A. and Robert, C. P. (1998). A note on the confidence properties of reference priors for the calibration model. Test 7: 147–160. 82. Raftery, A. E. (1996). Approximate Bayes factors and accounting for model uncertainty in generalized linear models. Biometrika 83: 251–266. 83. Raftery, A. E., Madigan, D. M. and Hoeting, J. (1997). Model selection and accounting for model uncertainty in linear regression models. Journal of the American Statistical Association 92: 179–191. 84. Ripley, B. D. (1987). Stochastic Simulation, Wiley, New York. 85. Robert, C. P. and Casella, G. (1999). Monte Carlo Statistical Methods, Springer-Verlag, New York. 86. Roberts, G. O., Gelman, A. and Gilks, W. R. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. Annals of Applied Probability 7: 110–120.

Bayesian Methods

987

87. Roberts, G. O. and Polson, N. G. (1994). On the geometric convergence of the Gibbs sampler. Journal of the Royal Statistical Society, Series B56: 377–384. 88. Schervish, M. J. and Carlin, B. P. (1992). On the convergence of successive substitution sampling. Journal of Computational and Graphical Statistics 1: 111–127. 89. Schmeiser, B. W., Avramidis, A. N. and Hashem, S. (1990). Overlapping batch statistics. In Proceedings of the 1990 Winter Simulation Conference, 395–398. 90. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6: 461–464. 91. Sinha, D. and Dey, D. K. (1997). Semiparametric Bayesian analysis of survival data. Journal of the American Statistical Association 92: 1195–1212. 92. Song, W.-M. T. and Schmeiser, B. W. (1995). Optimal mean-squared-error batch sizes. Management Science 41: 110–123. 93. Stein, C. (1985). On the coverage probability of confidence sets based on a prior distribution. In Sequential Methods in Statistics, ed. R. Zielinski, PWN-Polish Scientific Publishers, Warsaw, 485–514. 94. Sun, D. and Ye, K. (1995). Reference prior Bayesian analysis for normal mean products. Journal of American Statistical Association 90: 589–597. 95. Tanner, M. A. (1996). Tools for Statistical Inference, 3rd edn., SpringerVerlag, New York. 96. Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association 82: 528–549. 97. Tibshirani, R. (1989). Noninformative priors for one parameter of many. Biometrika 76: 604–608. 98. Tierney, L. (1994). Markov chains for exploring posterior distributions (with discussions). The Annals of Statistics 22: 1701–1762. 99. Verdinelli, I. and Wasserman, L. (1995). Computing Bayes factors using a generalization of the Savage-Dickey density ratio. Journal of the American Statistical Association 90: 614–618. 100. Volberding, P. A., Lagakos, S. W., Koch, M. A., Pettinelli, C., Myers, M. W., Booth, D. K., Balfour, H. H., Reichman, R. C., Bartlett, J. A., Hirsch, M. S., Murphy, R. L., Hardy, D., Soeiro, R., Fischl, M. A., Bartlett, J. G., Merigan, T. C., Hylsop, N. E., Richman, D. D., Valentine, F. T., Corey, L. and the AIDS Clinical Trials Group of the National Institute of Allergy and Infectious Diseases (1990). Zidovudine in asymptomatic human immunodeficiency virus infection. New England Journal of Medicine 322: 941–949. 101. Welch, B. L. and Peers, H. W. (1963). On formulae for confidence points based on integrals of weighted likelihoods. Journal of the Royal Statistical Society 25: 318–329. 102. Wolpert, R. L. (1991). Monte Carlo importance sampling in Bayesian statistics. In Statistical Multiple Integration, eds. N. Flournoy and R. Tsutakawa, Contemporary Mathematics 116: 101–115.

988

M.-H. Chen & K. Ye

103. Yakovlev, A. Y., Asselain, B., Bardou V. J., Fourquet, A., Hoang, T. Rochefediere, A. and Tsodikov, A. D. (1993). A simple stochastic model of tumor recurrence and its applications to data on premenopausal breast cancer. In Biometrie et Analyse de Donnees Spatio-Temporelles, # 12, eds. B. Asselain, M. Boniface, C. Duby, C. Lopez, J. P. Masson and J. Tranchefort, Rennes, France, 66–82. 104. Yakovlev, A. Y. and Tsodikov, A. D. (1996). Stochastic Models of Tumor Latency and Their Biostatistical Applications, World Scientific, New Jersey. 105. Ye, K. (1993). Reference priors when the stopping rule depends on the parameter of interest. Journal of American Statistical Association 88: 360–363. 106. Ye, K. (1994). Bayesian reference prior analysis on the ratio of variances for the balanced one-way random effect model. Journal of Statistical Planning and Inference 41: 267–280. 107. Ye, K. and Berger, J. O. (1989). Noninformative priors for inferences in exponential regression models. Biometrika 78: 645–656.

About the Authors Ming-Hui Chen is currently an associate professor at Department of Statistics, University of Connecticut. He has been an associate professor at Department of Mathematical Sciences, Worcester Polytechnic Institute before he joined the current position. He obtained BS in Mathematics from Hangzhou University, MS in Applied Probability from Shanghai Jiao Tong University, and MS in Applied Statistics and PhD in Statistics from Purdue University. He has coauthored the books: Applied Statistics for Engineers (Prentice-Hall, INC., 1999), Monte Carlo Methods in Bayesian Computation (Springer-Verlag, 2000) and Bayesian Survival Analysis (SpringerVerlag, 2001). His current research interests include Bayesian statistical methodology, Bayesian computation, categorical data analysis, Monte Carlo methodology, prior elicitation, missing data analysis, statistical analysis for prostate cancer study, variable selection, and survival models.

Keying Ye is currently an associate professor at Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA. Dr. Ye received his BS in Mathematics, Fudan University, MS in Mathematics, Institute of Applied Mathematics, Academia Sinica, in the People’s Republic of China, and PhD in Statistics from Purdue University, USA. His research interests are Bayesian methodologies in applications to various fields such as environmental and ecological systems, gene expressions,

Bayesian Methods

989

industrial and quality improvement. Recent works involve Bayesian hierarchical modeling, prediction incorporating model uncertainty, experimental designs, multivariate statistical analysis in environmental data, spatial effects in data analysis and response surface experimental designs with model uncertainty.

This page intentionally left blank

CHAPTER 26

STOCHASTIC PROCESSES AND THEIR APPLICATIONS IN MEDICAL SCIENCE JIQIAN FANG∗ and LI CAIXIA Department of Medical Statistics, School of Public Health, Sun Yat-sen University, 74 Zhongshan Road II, Guangzhou 510080, PR China Tel: 86-20-87330671; ∗[email protected] A stochastic process X = {X(t), t ∈ T } is a t-indexed collection of random variables. i.e. for any t ∈ T , X(t) is a random variable and t is a parameter. We often interpret t as time. If the set T is a countable set, we call the process a discrete-time stochastic process, usually denoted by {Xn , n = 1, 2, . . .}. And if T is continuum, we call it a continuous-time stochastic process, usually denoted by {X(t), t ≥ 0}. X(t) is called the state of process at time t. The collection of possible values of X(t) is called state space. Stochastic processes have ever applied in many fields. Now we introduce some important stochastic processes and their applications in medical science.

1. Markov Chains 1.1. Discrete-time Markov chains Suppose that we roll a six-sided dice. The probability of rolling 1 is denoted p1 (0 < p1 < 1). Now consider a sequence of consecutive rolls. Suppose that they are all independent. If we let Xn denote the accumulative number of rolling 1 after n consecutive rolls, it is easy to see that the variables {Xn , n = 1, 2, . . .} are not independent. However, If the value of Xn is given, for example Xn = i, we can see Xn+1 takes either the value i (with probability 1−p1 ) or the value i+1 (with probability p1 ). In other words, the process {Xn , n = 1, 2, . . .} shows the property that conditional distribution of the future state Xn+1 , given the present state Xn , depends only on the present state and is independent of the past states of X1 , X2 , . . . , Xn−1 .

991

992

J.-Q. Fang & C.-X. Li

This property is called Markovian property. Markov chains are discretestate stochastic processes with Markovian property. Definition.2,19,21 Consider a stochastic process {Xn , n = 0, 1, 2, . . .} that takes on a finite or countable values. {Xn , n = 0, 1, 2, . . .} is said to be Markov chain if P {Xn+1 = j|Xn = i, Xn−1 = in−1 , . . . , X1 = i1 , X0 = i0 } = P {Xn+1 = j|Xn = i} .

(1)

For all states i0 , i1 , . . . , in−1 , i, j and all n ≥ 0. P {Xn+1 = j|Xn = i} in Eq. (1) is associated with a transition taking place in one step, so it is called (one-step) transition probability and is denoted as pi,j (n, n + 1). A Markov chain is said to be homogeneous if pi,j (n, n + 1) is independent of n. Then pi,j (n, n + 1) can be denoted as pij . Let P denote the matrix of transition probability pij , so that   p00 p01 p02 · · ·   P = b (pij ) =  p10 p11 p12 · · ·  . It is obvious that

P

j

···

···

···

···

pij = 1 for any i.

1.1.1. A simple example — simple random walk A particle make a random walk on the integer points. Wherever it is, it will either goes up one step (with probability p) or down one step (with probability 1 − p). Let Xn denote the site of the particle after n steps. It is easy to see the simple random walk is a Markov chain. Its transition probability  j =i+1  p pij = 1 − p j = i − 1 i = 0, ±1, . . .   0 else

As similar as one-step transition probability pij . m-step (m > 1) transition probability is (m)

pij

= b P {Xn+m = j|Xn = i} . (m)

(m)

Let P (m) denote the matrix of pij , i.e. P (m) = (pij ).

Stochastic Processes and Their Applications

993

1.1.2. Chapman–Kolmogorov equationm (C–K equation) For any m, n, (m+n)

pij

=

X

(m) (n)

pik pkj ,

(2)

k

or in terms of the transition probability matrices: P (m+n) = P (m) · P (n) ,

(3)

especially, P (n) = [P (1) ]n . So C–K equation can be used to derive higher order transition probability from one-step transition probability. Chiang11 pointed out that (n)

pij =

s X l=1

A′ij (λl )λnl , m=1 (λl − λm )

Qs

i, j = 1, 2, . . . , s ,

(4)

m6=l

when s × s transition probability matrix P has s distinct eigenvalues λ1 , λ2 , . . . , λs , where the matrix A′ (λl ) = (λl I − P )′ . The probability distribution of Xn is X (n) P {X0 = i}P {Xn = j|X0 = i} b P {Xn = j} = pj = i

=

X i

(n)

P {X0 = i}pij .

And in terms of the matrices, (n)

PXn = (pj ) = PX0 · P (n) .

(5)

So, the probability distribution of a Markov chain can be derived from transition probability matrix and initial distribution. 1.1.3. Example: Hardy–Weinberg law of equilibrium in genetics 2,11 Consider a biological population. Each individual in the population is assumed to have a genotype AA or Aa or aa, where A and a are two alleles. Suppose that the initial genotype frequency composition (AA, Aa, aa) equals to (d, 2h, r), where d + 2h + r = 1. Then the gene frequencies of A and a are p and q, where p = d + h, q = r + h and p + q = 1. We can use Markov chain to describe the heredity process. We number the three genotypes AA, Aa and aa by 1, 2, 3 and denote by pij the probability that

994

J.-Q. Fang & C.-X. Li

an offspring has genotype j given that a specified parent has genotype i. For example, p12 = P {a child has genotype Aa|his mother has genotype AA} = P {his father has gene a|his mother has genotype AA} . Under random mating assumption, P {his father has gene a|his mother has genotype AA} = P {his father has gene a} . So p12 = q. Similar computations yield the other transition probabilities. The one-step transition probability matrix is     p q 0 p11 p12 p13      1 . P =  p21 p22 p23  =  1 p 1  q 2 2 2  p31 p32 p33 0 p q (k)

Let pi denote the probability that the kth generation has genotype i. The initial genotype distribution of the 0th generation (0)

(0)

(0)

(p1 , p2 , p3 ) = (d, 2h, r) . And then the genotype distribution of the first generation (1)

(1)

(1)

(0)

(0)

(0)

(p1 , p2 , p3 ) = (p1 , p2 , p3 )P  p  1 = (d, 2h, r)  p 2 0

q 1 2 p

0



 1  q 2  q

= ((d + h)p, dq + h + rp, (h + r)q) = (p2 , 2pq, q 2 ) .

The genotype distribution of the second generation (2)

(2)

(2)

(1)

(1)

(1)

(p1 , p2 , p3 ) = (p1 , p2 , p3 )P  p   = (p2 , 2pq, q 2 )  1 p 2 0

q 1 2 p

0



 1  q 2  q

995

Stochastic Processes and Their Applications

= (p2 (p + q), pq(p + q + 1), q 2 (p + q)) = (p2 , 2pq, q 2 ) and has the same distribution as that of the first generation. Similar computations show the distributions of the 3rd, 4th, . . . are all same and still are (p2 , 2pq, q 2 ). This is Hardy–Weinberg law of equilibrium. That is, whatever the parent genotype frequency compositions (d, 2h, r) may be, under random mating assumption, the first generation progenies will have the genotype composition (p2 , 2pq, q 2 ) and this composition will remain in equilibrium forever. 1.2. Stationary distribution and limiting distribution (n)

State j is said to be accessible2 from state i if for some n ≥ 0, pij > 0. Two states accessible to each other are said to communicate.2 We say that the Markov chain is irreducible2 if all states communicate with each other. (n) State i is said to have period d if pii = 0 whenever n is not divisible by d and d is the greatest integer with the property. A state with period 1 is called aperiodic.20 A probability distribution {πj } related to a Markov chain is called stationary if it satisfied the relation X πi pij . (6) πj = i

If the initial distribution {P (X0 = i)} is stationary distribution, then X X πi pij = πj P {X0 = i}P {X1 = j|X0 = i} = P {X1 = j} = i

i

and by induction, the probability P {Xn = j} = πj . Therefore the distribution of Xn is independent of n (time) and the corresponding process is in a statistical equilibrium. In the example of Hardy–Weinberg law of equilibrium, the stationary distribution of the Markov chain is (p2 , 2pq, q 2 ). A Markov chain is called finite2 if the chain has finite states. There must exist unique stationary distribution {πj } in a finite and irreducible Markov P chain,2 the πj , j ≥ 0, are the unique solution of Eq. (6) and j πj = 1. If there is a distribution {πj } such that X (n) (7) πj pij = πj for any i, j lim n→∞

i

{πj } is called long-run distribution (or limiting distribution).

996

J.-Q. Fang & C.-X. Li

If a Markov chain has long-run distribution {πj }, the chain has asymptotic distribution {πj } no matter what the initial distribution is. A long-run distribution must be a stationary distribution. If a finite Markov chain is aperiodic, its stationary distribution is long-run distribution.2,18,20 1.2.1. Example: Social status change2,18 In sociology, there is a question about how much effect be made on son’s social status by father’s social status. We take one’s occupation indicates his social status. Now, consider the conditional probability distribution for son’s occupation. In a research report about social status change, probability distribution is provided as following: Table 1. Son’s occupation

Father’s occupation

good

median

bad

good median bad

0.448 0.054 0.011

0.484 0.699 0.503

0.068 0.247 0.486

We consider social status’s change as transition between states. If Markovian property is satisfied in the states, we can use a finite (three states) Markov chain to describe the social status’s change. This chain is irreducible and aperiodic, and there must be long-run distribution (π1 , π2 , π3 ) satisfied (π1 , π2 , π3 )P = π1 , π2 , π3 . So we can get

   0.067 π1      π2  =  0.624  . 

π3

0.309

We can say, if social status’s change is a Markov chain with above transition probability, the social status take asymptotically the proportions: 6.7 for good, 62.4 for median, 30.9 for bad. 1.3. Continuous-time Markov chain Definition. For all states i, j, xu and all s, t ≥ 0, if the equation P {Xt+s = j|Xs = i, Xu = xu , 0 ≤ u < s} = P {Xt+s = j|Xs = i}

(8)

997

Stochastic Processes and Their Applications

is satisfied, the discrete process {Xt , t ≥ 0} is called continuous-time Markov chain. P {Xt+s = j|Xs = i} in the equation is also called transition probability, denoted by pi,j (s, s + t). A continuous-time Markov chain is said to be homogeneous if pi,j (s, s + t) is independent of (denoted by pij (t) here). If the state is i at time t, the chain transform into state j with the probability pij (∆t) = P {X(t + ∆t) = j|X(t) = i} after ∆t. Let δij

(

j 6= i

0 1

j=i

.

p (∆t)−δ

qij = b lim∆t→0+ ij ∆t ij is said to be transition intensity.20 The matrix Q= b (qij ) is called transition intensity matrix. qij dt = P {X(t + dt) = j|X(t) = i} ,

j 6= i

qii dt = P {X(t + dt) = i|X(t) = i} − 1 = −P {X(t + dt) 6= i|X(t) = i} . So

X

qij = 0 .

j

The continuous-time Markov chain with intensity matrix Q = (qij ). (1) The sojourn time of state i have exponential distribution with the mean −qii . q after (2) The chain step into state j(j 6= i) with the probability pij = − qij ii leaving state i. The transition probability satisfies Chapman–Kolmogorov equation X pij (t + s) = pik (t)pkj (s) k

and two Chapman–Kolmogorov differential equations which are Chapman– Kolmogorov forward equation X p′ij (t) = pik (t)qkj , i.e. P ′ (t) = P (t)Q (9) k

and Chapman–Kolmogorov backward equation X p′ij (t) = qik pkj (t) , i.e. P ′ (t) = QP (t) k

for all i, j and t ≥ 0.

(10)

998

J.-Q. Fang & C.-X. Li

We can obtain transition probabilities from the differential equations. Example. Now consider a two-state continuous-time Markov chain. The sojourn time of state 0 has exponential distribution with rate λ and the sojourn time of state 1 has exponential distribution with rate u. Therefore the intensity matrix is ! −λ λ . µ −µ From forward equation p′00 (t) = −λp00 (t) + µp01 (t) = −λp00 (t) + µ(1 − p00 (t)) = −(λ + µ)p00 (t) + µ we have p00 (t) =

µ λ + exp(−(λ + µ)t) . λ+µ λ+µ

p11 (t) =

µ λ + exp(−(λ + µ)t) . λ+µ λ+µ

Similarly,

Hence, transition probability matrix  λ µ + exp(−(λ + µ)t) λ + µ λ + µ  P (t) =  µ µ − exp(−(λ + µ)t) λ+µ λ+µ

 λ λ − exp(−(λ + µ)t)  λ+µ λ+µ .  λ µ + exp(−(λ + µ)t) λ+µ λ+µ

Generally, for s-state chain, when the intensity matrix has single eigenvalues λ1 , λ2 , . . . , λs , Chiang11 presented the solution of Chapman– Kolmogorov differential equations Pij (t) =

s X A′ (λl ) exp(λl t) Qijs , m=1 (λl − λm ) l=1

′

i, j = 1, 2, . . . , s

(11)

m6=l

′

where A (λl ) = (λl I − Q) .

2. Applications of Markov Chains Markov chain is usually used to describe a systems with Markovian property. For example, We usually divide a certain disease into several states in medical science. Under Markovian property assumption, we can get the transition information among states.

999

Stochastic Processes and Their Applications Table 2.

Frequence of tree species in 1983 and 1988.

1983

1988

Total

1

2

3

4

5

6

1 2 3 4 5 6

83 8 12 2 7 39

8 81 2 0 0 27

0 0 116 6 3 44

0 0 2 75 0 5

1 0 1 0 35 4

12 18 6 1 4 362

104 107 139 84 49 481

Total

151

118

169

82

41

403

964

2.1. Example 1 : Predict the structure of future system In the paper of Chen Jianzhong et al.1 analysis the data investigated in 964 areas of NanPing in 1983 and 1988, a Markov model is built to predict forest resources with tree species. The tree species include cunninghamia lanceolata (state 1), pinus massoniana (state 2), broad-leaved trees (state 3), phyllostachys pubescens (state 4), economic trees (state 5) and others (state 6). The data from two investigations see Table 2. One step (five years) transition probability matrix P is estimated as follows:   79.81 7.69 0 0 0.96 11.54  7.48 75.70 0 0 0 16.82     8.63 1.44 83.45 1.44 0.72 4.32    P = .  2.38 0 7.14 89.29 0 1.19     14.29 0 6.12 0 71.43 8.16  8.11

5.61

9.15

1.04

0.83

75.26

The chain with transition matrix P has stationary distribution. Based on the initial distribution in 1988 and transition matrix P , the distributions of trees from 1993 to 2023 are computed in Table 3. The stationary distribution shows the structure at present is unreasonable and needs to be adjusted in accordance to future structure. The transition probability matrix after adjustment becomes   79.81 7.69 0 0 0.96 11.54  7.48 75.70 0 0 0 16.82     2.88 1.44 89.21 1.44 0.72 4.32    P = .  2.38 0 7.14 89.29 0 1.19     14.29 0 6.12 0 71.43 8.16  3.95 5.61 7.28 3.12 4.78 75.26

1000

J.-Q. Fang & C.-X. Li

Table 3.

Prediction for occupied % of tree species in different years.

Year

1

2

3

4

5

6

1983 1988 1993 1998 2003 2008 2013 2018 2023 .. . Stationary

0.1079 0.1566 0.1913 0.2159 0.2335 0.2460 0.2550 0.2614 0.2661 .. . 0.2805

0.1111 0.1224 0.1307 0.1369 0.1418 0.1457 0.1488 0.1514 0.1536 .. . 0.1663

0.1442 0.1753 0.1932 0.2028 0.2073 0.2087 0.2084 0.2071 0.2054 .. . 0.1901

0.0871 0.0851 0.0828 0.0805 0.0783 0.0761 0.0740 0.0721 0.0703 .. . 0.0534

0.0508 0.0425 0.0366 0.0324 0.0295 0.0273 0.0259 0.0248 0.0241 .. . 0.0226

0.4990 0.4180 0.3653 0.3318 0.3097 0.2961 0.2879 0.2831 0.2805 .. . 0.2871

Table 4. Prediction for occupied % of tree species in different years after adjustment. Year

1

2

3

4

5

6

1983 1988 1993 1998 2003 2008 2013 2018 2023 .. . Stationary

0.1079 0.1276 0.1437 0.1562 0.1658 0.1728 0.1779 0.1815 0.1840 .. . 0.1872

0.1111 0.1224 0.1285 0.1315 0.1330 0.1337 0.1339 0.1339 0.1338 .. . 0.1324

0.1442 0.1743 0.1965 0.2131 0.2255 0.2348 0.2420 0.2474 0.2572 .. . 0.2698

0.0871 0.0954 0.1008 0.1041 0.1062 0.1075 0.1083 0.1087 0.1089 .. . 0.1084

0.0508 0.0622 0.0669 0.0680 0.0672 0.0657 0.0639 0.0622 0.0607 .. . 0.0545

0.4990 0.4180 0.3637 0.3270 0.3023 0.2855 0.2741 0.2663 0.2609 .. . 0.2477

The distributions of trees from 1993 to 2023 are computed similarly in Table 4. 2.2. Example 2 : Decision analysis and cost-effectiveness analysis Helicobacter pylori (HP) infection is a factor on tummy cancer. A markov model is provided for cost analysis in Wang Qian et al.9 Four states in the chain are without HP infection (state 1), HP infection (state 2), cancer (state 3) and death (state 4). The transition probabilities are given in Fig. 1. Suppose that is 50% individuals in the population is HP infectious and the cancer incidence is 27/105. For cost analysis, assume that the heath

1001

Stochastic Processes and Their Applications

Fig. 1. Table 5.

Transition probabilities.

The cost in the population (10,000 individuals) without screening.

T

States

Communication value

1

2

3

4

S

Q

Cost

S

Q

Cost

0 1 2 3 4 5 . ..

0.5 0.49175 0.48364 0.47566 0.46781 0.46009 . ..

0.5 0.50160 0.50310 0.50452 0.50584 0.50708 . ..

0 0.00027 0.00049 0.00066 0.00080 0.00091 . ..

0 0.00638 0.01277 0.01916 0.02555 0.03192 . ..

0 9936 9872 9808 9745 9681 . ..

0 9684 9617 9551 9486 9421 . ..

0 27050 48660 65916 79687 90667 . ..

0 9936 19808 29617 39361 49042 . ..

0 9684 19301 28852 38338 47759 . ..

0 27050 75710 141627 221314 311981 . ..

28 29 30

0.31382 0.51532 0.00129 0.16957 8304 8038 129002 254820 247422 3181863 0.30864 0.51496 0.00129 0.17512 8249 7982 128703 263069 255404 3310566 0.30355 0.51454 0.00128 0.18063 8194 7927 128387 271263 263332 3438953

T = time

S = survival time

Q = quality survival time

values of four states are 1, 0.95, 0.3, 0, respectively. The cost of a patient with cancer is $104 per year. The transitions and costs for the population are given in Table 5. Suppose that the sensitivity and specificity are both 90% in the screening test and the cost of the screening test is $25 per individual. The cure rate of HP infection is 80% and the cost for cure is $300. Similarly, the transitions and costs for the population with screening can be computed. They are given in Table 6. There is a contrast between them and see Table 7. 2.3. Example 3 : Using the transition dependent on covariates to analysis the factors for illness In the paper of Fang et al.,3 two non-homogeneous Markov chains were used to study a two-stage model with time-dependent covariates for latent

1002

J.-Q. Fang & C.-X. Li

Table 6.

The cost in the population (10,000 individuals) with screening.

T

0 1 2 3 4 5 .. . 28 29 30

States

Communication value

1

2

3

4

S

Q

Cost

S

Q

Cost

0.86 0.84581 0.83186 0.81813 0.80464 0.79136 .. .

0.14 0.14765 0.15510 0.16236 0.16944 0.17634 .. .

0 0.00016 0.00029 0.00040 0.00048 0.00055 .. .

0 0.00638 0.01275 0.01911 0.02544 0.03175 .. .

9936 9873 9809 9746 9682 .. .

9861 9793 9725 9658 9590 .. .

16070 29082 39642 48236 55251 .. .

9936 19809 29618 39363 49046 .. .

9816 19654 29379 39037 48627 .. .

1766070 1795152 1834794 1883030 1938281 .. .

0.53977 0.29162 0.00092 0.16769 8323 8171 91973 255075 251730 3848064 0.53086 0.29504 0.00092 0.17318 8268 8114 92283 263343 259844 3940347 0.52210 0.29834 0.00093 0.17863 8214 8085 92571 271557 267902 4032918

Table 7.

The costs between two populations.

S (year) Q (year) Cancer frequency Summary cost ($) Screening cost ($) Cost for cancer ($)

Fig. 2.

Screening

Non-Screening

Difference

271557 267902 55 4032918 1750000 2282918

271263 263332 82 3438953 — 3438953

294 4570 −27 593965 1750000 −1156035

The transitions between four states.

period of cancer. There are four states in the chains, which are inapparent illness (state 0), soakage stage (state 1), non-soakage stage (state 2) and observable clinic state (state 3). The transitions between the states are in Fig. 2. 2.3.1. Model 1. A non-homogeneous discrete-time Markov model Let pij (t) = P {X(t + 1) = j|X(t) = i} and Z(t) = (Z1 (t), Z2 (t), . . . , Zp (t))′ ,

Stochastic Processes and Their Applications

1003

where Z1 (t), Z2 (t), . . . , Zp (t) are p covariates. The transition probability matrix is   p00 (t) p01 (t) 0 0  p (t) p (t) p (t) 0  11 12   10 P (t) =  ,  0 0 p22 (t) p23 (t)  0

0

0

1

where

p01 (t) = a01 · θ(t) ,

p00 (t) = 1 − p01 (t) ,

p10 (t) = a10 · (1 − θ(t)) , p12 (t) = a12 · θ(t) , p11 (t) = 1 − p10 (t) − p12 (t) , p23 (t) = a23 · θ(t) ,

p22 (t) = 1 − p23 (t) ,

θ(t) = 1 − exp(−C ′ Z(t)) ,

C = (r1 , r1 , . . . , rp )′ .

2.3.2. Model 2. A non-homogeneous continuous-time Markov model In the model, the transition intensities λij (t)dt = P {X(t + dt) = j|X(t) = i} Let λ01 (t) = A0 + A1 Z1 (t) + · · · + Ap Zp (t) , λ10 (t) = B0 + B1 Z1 (t) + · · · + Bp Zp (t) , λ12 (t) = C0 + C1 Z1 (t) + · · · + Cp Zp (t) , λ23 (t) = D0 + D1 Z1 (t) + · · · + Dp Zp (t) . The two models were applied to analyze a set of 12-year and 6-run screening data of cervical cancer in Jingan county, Jiangxi Province, China. The covariates are sex disorder, sex health, age, age-square and cervicitis. In model 1, the estimation of parameter C = (r1 , r1 , . . . , rp )′ is rˆ1 = 0.7095 , rˆ2 = 0.0189 , rˆ3 = 0.0152 , rˆ4 = 2.23 × 10−4 , rˆ5 = 0.631 , i.e. Cˆ = (0.7095, 0.0189, 0.0152, 2.23 × 10−4 , 0.631). In the likelihood ratio tests for the hypothesis ri = 0, the χ2 statistics are 58.65, 59.62, 22.97, 39.72 and 77.38 respectively. P < 0.01 for all. The results show the 5 covariates have effect on the transition.

1004

J.-Q. Fang & C.-X. Li

In model 2, let Ci = 0, Di = 0(i = 1, 2, 3, 4, 5). The estimations of other parameters are Aˆ0 = 2.81 × 10−5 ,

Aˆ1 = 0.765 ,

Aˆ2 = 0.963 ,

Aˆ3 = 4.16 × 10−4 ,

Aˆ4 = 7.61 × 10−4 ,

ˆ0 = 8.25 × 10−3 , B

ˆ1 = 0.019 , B

ˆ3 = 2.32 × 10−3 , B

ˆ4 = 3.12 × 10−3 , B

Cˆ0 = 7.03 × 10−4 ,

ˆ 0 = 0.0279 . D

Aˆ5 = 1.333 ,

ˆ2 = 0.116 , B ˆ5 = 0.109 , B

After hypothesis test, except the covariates age and age-square, the others have effect on the transition. 2.4. Example 4 : Modeling for sequence with short-term memory In a sequence {Xt }, the value of Xs usually has effect on the value of Xs+t , i.e. Cov(Xs , Xs+1 ) 6= 0. The effect attenuates gradually when the time length t gets longer. The sequence is said to have short-term memory if the attenuation is fast and said to have long-term memory if the attenuation is slow. Markov chain is a sequence with short-term. Ion-channels sometime is open and sometime is close. The single ionchannel patch-clamp recordings recorded by patch-clamp are shown in Fig. 3. Fang et al.15 proposed two-state Markov model to study quantitatively memory existing in ion-channels. A two-state Markov process with constant transition intensities well fitted the short-term memory and a twostate Markov process within a kind of random environment well fitted the long-term memory. In the short-term memory model, the auto-correlation function is exp[−(λ + µ)t].

Fig. 3.

Single ion-channel patch-clamp recordings.

Stochastic Processes and Their Applications

1005

3. Generalized Markov Chains 3.1. Markov chains in random environment A Markov chain in random environment15 is called if the transition intensities are random variables. Example. Considering a two-state continuous-time Markov chain {X(t), t ≥ 0}, the transition intensities λ = q01 , µ = q10 are two independent random variables. The probability density functions are f (λ), g(µ) respectively. When λ, µ are given, the transition matrix of X(t) is   λ λ λ µ + exp(−(λ + µ)t) − exp(−(λ + µ)t)  λ + µ λ + µ λ+µ λ+µ . P (t) =    µ µ λ µ − exp(−(λ + µ)t) + exp(−(λ + µ)t) λ+µ λ+µ λ+µ λ+µ The conditional distribution {pj|λ,µ (t), j = 0, 1} of X(t) satisfies pj|λ,µ (t) = b {X(t) = j|λ, µ} =

1 X i=0

P {X(0) = i|λ, µ}P {X(t) = j|X(0) = i, λ, µ} .

So the distribution of X(t) becomes Z ∞ pj (t) = b P {X(t) = j} = P {X(t) = j|λ, µ}f (λ)g(µ)dλdµ . 0

The long-term memory model for ion-channels proposed by Fang et al.15 is Markov model in random environment. The distributions of transition intensities are Γ distributions, Γ(α1 , β), Γ(α2 , β). The auto-correlation attenuates as (βt + 1)−(α1 +α2 ) . The attenuation shows long-term memory. 3.2. Semi-Markov processes A continuous-time Markov chain {X(t), t ≥ 0}. The time τi spending for the transition between two successive states i → j(j 6= i) have exponential distribution. The distribution is dependent with state i and independent with state j. The chain is called semi-Markov process if the distribution of τi is arbitrary distribution and the distribution is dependent with state i and state j. Markov chain is a semi-Markov process.

1006

J.-Q. Fang & C.-X. Li

....

Xt–1

Xt

Xt+1

....

....

Yt–1

Yt

Yt+1

....

Fig. 4.

The relation between {Xt } and {Yt }.

3.3. Hidden Markov chains A hidden Markov chain19 {Xt , Yt } is composed by two processes. One is a underline finite-state Markov chain {Xt }. The other is observed processes {Yt } and the distribution of Yt is dependent with Xt . For example, the ion-channel sometimes is open and sometimes is close. The open (state 1) or close (state 0) cannot be observed directly because there are noise in the channel. Hidden Markov model can be used here. Let {Xt } denote the sequence of states and {Yt } denote the observed sequence recorded by patch-clamp. Suppose that {Xt } has Markov property. Yt has normal distribution N (µ0 , σ02 ) when Xt = 0 and has normal distribution N (µ1 , σ12 ) when Xt = 1. If {Xt } is given, there is conditional independence among Yt s. The relation between {Xt } and {Yt } as shown in Fig. 4. At present, hidden Markov model is also used for biological sequence analysis. Example.19 Let {Yt , t = 1, 2, . . . , n}, Yt ∈ {a, c, g, t} = b {1, 2, 3, 4} is a DNA sequence. St , St ∈ {1, 2, . . . , r} denote the type of homogeneous segment at position t in the sequence. St , St ∈ {1, 2, . . . , r} is unobservable. A hidden Markov processes modeled for {St , Yt }. Supposed {Yt } is a Markov chain, the transition probability P {Yt = yt |St = st , y1 , y2 , . . . , yt−1 } = P {Yt = yt |St = st , yt−1 } t) = py(st−1 yt ,

dependent with the segment type St . (k) Suppose that the conjugate prior distribution for the row vectors pi of the transition matrix is Dirichlet distribution. Then the posterior distribution is still Dirichlet distribution. The segment type can be decided by the conditional probability P (St = k|y) (k = 1, 2, . . . , r) and the probability P (St = k, St+1 6= k|y).

Stochastic Processes and Their Applications

1007

3.4. Time-reversible Markov chain Consider a stationary Markov chain with transition matrix P = (pij ). If the initial distribution is the stationary distribution {πi }, the distribution of the chain at any time will remain the same. Now, we trace the sequence of states backwards in time. The reversible sequence Xn , Xn−1 , . . . , is still a Markov chain. The transition probability p∗ij = P {Xm = j|Xm+1 = i, Xm+2 = i2 , . . . , Xm+k = ik } =

P {Xm = j, Xm+1 = i, Xm+2 = i2 , . . . , Xm+k = ik } P {Xm+1 = i, Xm+2 = i2 , . . . , Xm+k = ik }

P {Xm = j}P {Xm+1 = i|Xm = j} × P {Xm+2 = i2 , . . . , Xm+k = ik |Xm = j, Xm+1 = i} = P {Xm+1 = i}P {Xm+2 = i2 , . . . , Xm+k = ik |Xm+1 = i} = =

πj pji P {Xm+2 = i2 , . . . , Xm+k = ik |Xm+1 = i} πi P {Xm+2 = i2 , . . . , Xm+k = ik |Xm+1 = i} πj pji . πi

If p∗ij = pij , for all i, j, the Markov chain is called time-reversible.20 If a Markov chain is time-reversible, then πi pij = πj pij ,

for all i, j .

(12)

4. Applications in Statistic Computation 4.1. MCMC In statistic computation, we usually compute the expectation of a function f (x): Z Eπ f = f (x)π(x)dx ,

where x = (x1 , x2 , . . . , xk ) is k-dimensional vector and π(x) is a density function. Markov chain Monte Carlo (MCMC)8 methods are applied for computation. At present, there are two different definitions about Markov chain .In MCMC, the Markov chain is a discrete-state Markov process. There are transition probabilities for discrete-time chain and transition intensities for continuous-time chain. The transition probabilities and transition intensities are called transition kernel for discrete-state Markov process.

1008

J.-Q. Fang & C.-X. Li

In MCMC, an aperiodic irreducible Markov chain {X (0) , X (1) , X (2) , . . .} with stationary distribution π(x) was built. If the initial state X (0) has distribution π(x), X (t) will have the same distribution π(x). An aperiodic irreducible Markov chain with stationary distribution π(x) has limiting distribution which is also π(x). The distribution of X (m) has little effect on the initial state X (0) and gets close to π(x) when m is large enough. Therefore, the n − m state, X (m+1) , X (m+2) , . . . , X (n) , is used for computation. The steps in MCMC methods as follows. Step 1. Building a Markov chain with stationary distribution π(x). Step 2. Getting a sample X (0) , X (1) , X (2) , . . . , X (n) . Step 3. Taking m, n(m < n) and estimating of Eπ f : ˆπ f = E

n X 1 f (X (t) ) . n − m t=m+1

(13)

In MCMC methods, the transition kernel p(x, x′ ) is very important, where x′ = (x′1 , x′2 , . . . , x′k ). In the different MCMC methods, the kernel is different. In MCMC methods, the full conditional distribution π(xT |x−T ) are used, where T ⊂ {1, 2, . . . , k}, xT = {xi , i ∈ T }, x−T = {xi , i ∈ / T }. π(xT |x−T ) = R

π(xT ) ∝ π(x) . π(x)dxT

Example. Suppose the joint distribution of X1 , X2 is 1 π(x1 , x2 ) ∝ exp − (x1 − 1)2 (x2 − 1)2 . 2 Then the full conditional distribution

1 π(x1 |x2 ) ∝ π(x1 , x2 ) ∝ exp − (x1 − 1)2 (x2 − 1)2 2

= N (1, (x2 − 1)−2 ) 1 2 2 π(x2 |x1 ) ∝ π(x1 , x2 ) ∝ exp − (x1 − 1) (x2 − 1) 2 = N (1, (x1 − 1)−2 ) Two important MCMC methods, Gibbs method and MetropolisHastings method, are introduced respectively.

1009

Stochastic Processes and Their Applications

4.1.1. Gibbs sampling method Gibbs sampling method is proposed by Geman S. and Geman D. (1984).21 Let x′−T = x−T . Consider the conditional distribution of XT |X−T . Let b π(x′T |x−T ). So x′T can be gotten the transition kernel p(xT , x′T |x−T ) = ′ from the distribution π(•|x−T ). It can be proved that X ′ = (XT′ , X−T ) has ′ distribution π(x ). The simple Gibbs method is called if only one element is in T . Sampling from full conditional distribution becomes simple. The steps are given as follows. (0) (0) (0) Suppose that initial value x(0) = (x1 , x2 , . . . , xk ) is given and the (t − 1)th iterative value is x(t−1) . The tth iterative value x(t) is gotten as follow k steps. (t)

(t−1)

(t)

(t)

(t)

(t)

Step 1. Sampling x1 from full conditional distribution π(x1 |x2 (t−1) xk ); ...... ......

,...,

Step i. Sampling xi from full conditional distribution π(xi |x1 , . . . , (t−1) (t−1) (t) ); xi−1 , xi+1 , . . . , xk ...... ...... Step k. Sampling xk from full conditional distribution π(xk |x1 , . . . , (t) (t) (t) (t) xk−1 ). Let x(t) = (x1 , x2 , . . . , xk ). 4.1.2. Metropolis–Hastings sampling method Metropolis–Hastings sampling method is proposed by Metropolis (1953)22 and Hastings (1970).23 The transition kernel p(x, x′ ) is built as follows. Suppose that q(x, x′ ) is an irreducible transition kernel. Let p(x, x′ ) = q(x, x′ )α(x, x′ ) ,

x 6= x′

where α(x, x′ ) is a function and 0 < α(x, x′ ) ≤ 1. When the current state is x, i.e. X (t) = x. We propose a transition x → x′ with intensity q(x, x′ ). The proposal is not automatically accepted. The probability of acceptance is α(x, x′ ). Therefore, the successive state will be changed as x′ with probability α(x, x′ ) and not changed with probability 1 − α(x, x′ ). That is to say, ( x′ u ≤ α(x, x′ ) x(t+1) = , x u > α(x, x′ ) where u is a random number from [0, 1] uniform distribution.

1010

J.-Q. Fang & C.-X. Li Table 8.

The capture-recapture data when k = 3. 3rd sample

1st sample

Observed Non-observed

Observed

Non-observed

2nd sample

2nd sample

Observed x123 x¯123

Non-observed x1¯23 x¯1¯23

Observed x12¯3 x¯12¯3

Non-observed x1¯2¯3 x¯1¯2¯3

q(x, ·) is a probability function or density function. It is called proposal distribution in Metropolis–Hastings sampling. α(x, x′ ) is called acceptance probability. An expression for α(x, x′ ) is derived to ensure the chain with stationary distribution π(x). π(x′ )q(x′ , x) ′ . (14) α(x, x ) = min 1, π(x)q(x, x′ )

For full conditional distribution, let x′−T = x−T . The proposal distribution is q(xT , x′T |x−T ) and acceptance probability π(x′T |x−T )q(x′T , xT |x−T ) ′ . (15) α(x, x ) = min 1, π(xT |x−T )q(xT , x′T |x−T )

Gibbs sampling is a special Metropolis–Hastings sampling, where proposal distribution is π(x′T |x−T ) and acceptance probability is constant 1. In the dissertation of Gao6 about Bayesian analysis of capture-recapture data, an application for MCMC is provided. In capture-recapture model, the size of a closed population is N . N is unknown and need to be estimated by k samples from the population. Suppose that the size of the ith sample is ni and n(n < N ) individuals are observed in all k samples and. Every individual was captured with probability pi in the ith sample. When k = 3, the capture-recapture data as following Table 8, where x¯1¯2¯3 is unknown. Let u = x¯1¯2¯3 . Then N = n+u. Let {ω} = {123, 12¯ 3, 1¯ 2¯ 3, ¯123, ¯1¯23, ¯12¯3} and {ω ′ } = {ω} + {¯1¯2¯3}. When the three samples are independent, the likelihood function is 3

Y N! Q pni (1 − pi )N −ni . p({xω }|N, {pi }) = (N − n)! ω xω ! 1 i

The posterior distribution (N, {pi }) is Y N! Q p(N, {pi }|xω′ ) ∝ (pi )xω (1 − pi )N −ni π(N, {pi }) , (N − n)! ω xω ! ω

1011

Stochastic Processes and Their Applications

where π(N, {pi }) is prior distribution of (N, {pi }). If π(N, {pi }) is uniform distribution, the full conditional distribution of (N, {pi }) is p(pi |u, p−i , {xω }) ∝ pni i (1 − pi )u+n−ni ∼ β(ni + 1, u + n − ni + 1) , where β(ni + 1, u + n − ni + 1) is β distribution, and ! u+n p(u|{pi }, {xω }) = (p∗ )u (1 − p∗ )n = b(n + u, p∗ ) , u where p∗ =

Q3

1 (1

− pi ).

4.2. Reversible jump MCMC computation In above MCMC methods, the dimension k of vector x is known and fixed. They are not available when k is not fixed. Peter J. Green16 proposed reversible jump MCMC samplers that jump between parameter subspaces of differing dimensionality. 4.2.1. The general case Suppose that we have a countable collection of candidates model {M k , k ∈ K}. Model Mk has a vector θ (k) of unknown parameters, where θ (k) ∈ Rnk and is a nk dimensional vector. Let x = (k, θ (k) ), Ωk = {k} × Rnk and S Ω = k Ωk , x ∈ Ωk . For a given k, π(x) is the joint posterior distribution of k and θ(k) , i.e. π(x) = p(k, θ(k) |y) = p(k|y)p(θ (k) |k, y) , where y is an observed sample. When the current state is x, we propose a move x → dx′ of type m with probability qm (x, dx′ ). Thus the successive state is not changed with P P probability 1 − m qm (x, Ω), where m qm (x, Ω) ≤ 1. Let αm (x, x′ ) is the acceptance probability of the move of type m. The transition kernel is XZ P (x, B) = qm (x, dx′ )αm (x, x′ ) + s(x)I(x ∈ B) , m

B

where B ⊂ Ω, I(·) is indicator function and XZ X s(x) = qm (x, dx′ ){1 − αm (x, x′ )} + 1 − qm (x, Ω) m

B

is the probability of not moving from x.

m

1012

J.-Q. Fang & C.-X. Li

αm (x, x′ ) given by Peter J. Green16 is π(dx′ )qm (x′ dx) αm (x, x′ ) = min 1, . π(dx)qm (x, dx′ )

Suppose that π(dx)qm (x, dx′ ) has a finite density fm (x, x′ ). Then fm (x′ , x) . αm (x, x′ ) = min 1, fm (x, x′ )

(16)

(17)

4.2.2. Switching between two simple subspaces A simple example is given first. Let two subspaces Ω1 = {1} × R, Ω2 = {2} × R2 x = (1, θ) ∈ Ω1 when k = 1 and x = (2, θ1 , θ2 ) ∈ Ω2 when k = 2. Consider a move between Ω1 and Ω2 . A move from Ω1 to Ω2 is defined as (1, θ) → (2, θ + u, θ − u), where u and θ are independent random variables. Then the reversible move from Ω2 to Ω1 is (2, θ1 , θ2 ) → (1, 12 (θ1 + θ2 )). For dimensional matching of (θ, u) and (θ1 , θ2 ), there is a bijection between (θ, u)and (θ1 , θ2 ). S In general, Ω1 = {1} × Rn1 , Ω2 = {2} × Rn2 , Ω = k Ωk , x = (k, θ(k) ). For a given k, x ∈ Ωk , k = 1, 2). Consider just one move type between Ω1 and Ω2 . The proposal distribution is q(x, dx′ ) and this move is chosen with probability j(x). From Ω1 to Ω2 , a m1 dimension random vector u(1) independent with θ(1) is generated. Set θ (2) to be some function of θ (1) and u(1) . Then the move (1, θ (1) ) → (2, θ(2) ) is defined. Similarly, to switch back, a m2 dimension random vector u(2) independent with θ(2) is generated and set θ(1) to be some function of θ (2) and u(2) . Then there is a bijection between (θ(1) , u(1) ) and (θ(2) , u(2) ). For dimensional matching, m1 and m2 must satisfy n1 + m1 = n2 + m2 . Suppose that the densities of u(1) and u(2) are q1 (u(1) ) and q2 (u(2) ) respectively. Let x = (1, θ (1) , µ(1) ) ∈ Ω, x′ = (2, θ(2) , µ(2) ) ∈ Ω and f (x, x′ ) = p(1, θ(1) |y)j(1, θ(1) )q1 (u(1) ) , (2) , u(2) ) ′ (2) (2) (2) ∂(θ f (x , x) = p(2, θ |y)j(2, θ )q2 (u ) . ∂(θ(1) , u(1) ) Then the acceptance probability p(2, θ(2) |y)j(2, θ(2) )q2 (u(2) ) ∂(θ(2) , u(2) ) ′ . α(x, x ) = min 1, p(1, θ(1) |y)j(1, θ(1) )q1 (u(1) ) ∂(θ(1) , u(1) )

(18)

Sometimes, m1 or m2 is 0. For example, when m2 is 0, Eq. (18) becomes ∂(θ(2) ) p(2, θ(2) |y)j(2, θ(2) ) . α(x, x′ ) = min 1, p(1, θ(1) |y)j(1, θ(1) )q1 (u(1) ) ∂(θ(1) , u(1) )

1013

Stochastic Processes and Their Applications

5. Branching Processes 5.1. Branching processes Branching Processes were studied by Galton and Watson in 1874. Consider a population, in which the individuals can produce offspring. Each individual can produce k new offspring with probability pk , k = 0, 1, 2, . . . , independently of the others’ producing. That is to say, all individuals’ producings have independent identical distribution (i.i.d.). Suppose that the numbers of initial individuals is X0 , which is called the size of the 0th generation. The size of the first generation, which is constituted by all (n) offspring of the 0th generate, is denoted by X1 , . . . . Let Zj denote the number of the offspring produced by the jth individual in the nth generation. Then, Xn−1 (n−1)

Xn = Z 1

(n−1)

+ Z2

(n−1)

+ · · · + ZXn−1 =

X

(n−1)

Zj

.

j=1

It shows that Xn is a sum of Xn−1 random variables with i.i.d. {pk , k = 0, 1, 2, . . .}. The process {Xn } is called branching processes.18 The Branching Processes is a Markov Chain and its transition probability is ! i X (n) pij = P {Xn+1 = j|Xn = i} = P Zk = j . k=1

Suppose that there are x0 individuals in the zeroth generation, i.e. X0 = P∞ P∞ (n) (n) x0 . Let E(Zj ) = k=0 kpk = µ and var(Zj ) = k=0 (k − µ)2 pk = σ 2 . Then it is easy to see E(Xn ) = x0 µn ,  n−1   x2 µn−1 σ 2 µ 0 µ−1 var(Xn ) =   2 2 nx0 σ

µ 6= 1

.

µ=1

Now, we can see that the expectation and variance of the size will increase when µ > 1 and will decrease when µ < 1. In branching processes, the probability π0 that the population dies out is shown in the following theorem. Theorem. Suppose that p0 > 0 and p0 + p1 < 1. Then

1014

J.-Q. Fang & C.-X. Li

(1) π0 = q x0 if µ > 1, where q is the smallest positive number satisfying the equation ∞ X

x=

pk xk .

(19)

k=0

(2) π0 = 1 if µ ≤ 1.

Example. Suppose that each individual in a population can produce 0, 1, 2 and 3 offspring with probabilities 1/8, 3/8, 3/8 and 1/8, respectively. Then the mean number of offspring per individual is 1 3 3 1 + 1 × + 2 × + 3 × = 1.5 > 1 . 8 8 8 8 If the size of 0th generation is 1, i.e. x0 = 1, the probability π0 that the population dies out satisfies µ=0×

x = x0 ×

1 3 3 1 + x1 × + x2 × + x3 × , 8 8 8 8

i.e. x3 + 3x2 − 5x + 1 = 0. √ √ √ This equation has 3 roots, 1, − 5 − 2 and 5 − 2. So q = 5 − 2. Lotka2,18 proposed a branching process model to study the white American family. Suppose that a white generates k sons (k = 0, 1, 2, . . .) with probability   bck−1 k 6= 0   ∞ X pk =  bck−1 k = 0 ,  1 − k=1

where b, c is positive number and b + c < 1. In this branching process, µ=

∞ X

kpk =

k=0

The equation

x=

∞ X

b . 1 − c2

pk xk

k=0

1−(b+c) c(1−c) .

has two roots, i.e. 1 and q = Lotka got b = 0.2126, c = 0.5893 according to a data collected in 1920. So we will obtain µ = 1.26 > 1 ,

q = 0.819 .

Stochastic Processes and Their Applications

1015

That is to say, given x0 = 1, the probability that the population will eventually die out is 0.819. In addition, branching processes has ever used in genes mutation, genetics and epidemiology.

5.2. Generalized branching processes Branching processes introduced above is in very simple cases. (1) Each individual in certain generation produces offspring with the identical probability distribution. (2) The probability distribution above is independent of generation. (3) Any two individuals’ producing is independent each other. (4) Each individual will not die before it produces offspring. (5) The population is close. In this simple case, the size of the population will eventually become 0 or infinity. In fact, the above assumptions usually come into broken before the size become infinity. There are some cases to generalize simple branching processes. (i) Suppose that each individual survives with probability r before producing. (ii) The distribution {pk , k = 0, 1, 2, . . .} is dependent on generation n. (iii) The population is not close. Suppose that the population in simple branching processes is not close when the generation n is produced, there are Yn individuals of the same kind immigrating and the Yn individuals’ producing independently. The size of population which is not close can be stationary if the immigration satisfies some certain conditions. Suppose that survival times of each individual in simple branching processes have independent identical distribution F . Before dying, each individual will have produced k new offspring with probability pk . Let X(t) denote the number of living individuals at time t. For the process {X(t), t ≥ 0}, when t is large enough, E(X(t)) ≈

(µ − 1)eαt R∞ , µ2 α 0 xe−αx dF (x)

1016

J.-Q. Fang & C.-X. Li

P∞ where µ = k=0 kpk is expectation of each individual’s offspring. α is the positive number which satisfies the follow equation Z ∞ 1 e−αx dF (x) = . µ 0

Lucas17 used branching processes to study plasmodia’s producing. After a plasmodium comes into the cell in the liver, it propagates rapidly. When the number of plasmodia is large enough, the cell will be broken and plasmodia will go into red cells in blood. And with the number of plasmodia produced increasing, the red cell is broken and go into other red cells. The red cell’s broken is periodical. Every period is about 48–72 hours. A branching process is used here. The initial plasmodia which go into red cells constitute 0th generation. The offspring of 0th generation is called 1st generation, and so on. Two models were considered. In model 1, each plasmodium’s survival probability is r before producing. It is easy to see E(Xn ) = x0 rn−1 µn . So we can conclude that the expectation of the number of plasmodia increase if rµ > 1, decrease if rµ < 1, and is fixed if rµ = 1. This model cannot fit well the data given in Table 9. In model 2, suppose that the survive probability is dependent on generation n and denotes rn . Then E(Xn ) = x0 r1 r2 · · · rn−1 µn . Therefore E(Xn+1 ) = rn µE(XN ). If µ is given, rn can be estimated from this equation. When µ = 10, the estimations are given in Table 10. Table 9.

The number of invaded from 106 red cells every 48 hours. Date

Time

Invaded number

10/24 10/26 10/28 10/30 11/1 11/3 11/5 11/7 11/9

22:15 22:30 22:00 22:30 22:30 22:30 22:30 23:00 23:15

5600 1220 1330 1200 1560 1440 2000 1370 161

1017

Stochastic Processes and Their Applications Table 10.

The estimations of rn when µ = 10.

Date

Invaded number

rn µ

rn

10/24 10/26 10/28 10/30 11/1 11/3 11/5 11/7 11/9

5600 1220 1330 1200 1560 1440 2000 1370 161

0.218 1.090 0.902 1.300 0.923 1.389 0.685 0.118

0.0218 0.1090 0.0902 0.1300 0.0923 0.1389 0.0685 0.0118

6. Birth Death Processes 6.1. Birth death process Birth–death processes {N (t), t ≥ 0} were discussed in the processes that a population grow and decline. Birth–death processes are Markov chains with transition probabilities satisfying qij = 0 if |i − j| ≥ 2. Let λi = qi,i+1 ,

µi = qi,i−1 .

The transition probability matrix for Birth–death processes is   −λ0 λ0 0 0 ···    µ1 −(λ1 + µ1 ) µ1 0 ···   .  0 µ2 −(λ2 + µ2 ) µ2 · · ·      .. .. .. .. .. . . . . .

Let N (t) denote the size of a population at time t. We say that 1 birth (or 1 death) occurs in the population when the size increase by 1 (decrease by 1, respectively). λi and µi are called birth rate and death rate respectively. According to the C–K forward equation, the distribution pk (t) = b P {N (t) = k} satisfies the equation p′0 (t) = −(λ0 + µ0 )p0 (t) + µ1 p1 (t) ,

p′k (t) = −(λk + µk )pk (t) + λk−1 pk−1 (t) + µµ+1 pk+1 (t)

k≥1

when N (0) = 0. Example.20 There are M mice in a cage, and there are infinity foods to provide them to eat. A mouse will stop eating at time t + h with probability µh + o(h) if it is eating at time t and will have been eating before time t + h

1018

J.-Q. Fang & C.-X. Li

with probability λh + o(h) if it is not eating at time t. Each mice’s eating is independent of others’ eating. Let N (t) denote the number of mice which are eating at time t. {N (t)} is a birth–death process, and P {N (t + h) = i + 1|N (t) = i} = (M − i)λh + o(h) , P {N (t + h) = i − 1|N (t) = i} = iµh + o(h) . Therefore, the birth rate is λi = (M − i)λ, the death rate is µi = iµ. The processes discussed above are homogeneous. A birth–death process is called homogeneous when the birth rate and death rate depend on time. 6.2. Pure birth process A birth–death process is called pure birth process if µi = 0 for all i. Therefore, the distribution {pk (t)} of pure birth process satisfies p′0 (t) = −λ0 p0 (t) ,

p′k (t) = −λk pk (t) + λk−1 pk−1 (t) k ≥ 1 . 6.2.1. Example20 Mckendrick model There is one population which is constituted by 1 infected and N − 1 susceptible individuals. The infected state is a absorbing state. Suppose that any given infected individual will cause, with probability βk + o(h), any given susceptible individual infected in time interval (t, t + h), where β is called infection rate. Let X(t) denote the number of the infected individuals at time t. Then {X(t)} is a pure birth process with birth rate λn (t) = (N − n)nβ . This epidemic model was proposed by A. M. Mckendrick in 1926. Let T denote the time until all individuals in the population are infected and Ti denote the time from i infective to i + 1 infective. Then Ti has 1 exponential distribution with mean λ1i = (N −i)iβ . Therefore ! N −1 N −1 X 1 X 1 Ti = ET = E . β i(N − i) i=1 i=1 6.2.2. Example: M. J. Faddy and J. S. Fenlon13 Some stochastic models based on pure birth processes are constructed to describe the invasion process of nematodes in fly larvae. Let X(t) denote

Stochastic Processes and Their Applications

1019

the number of nematodes which have invaded the host at time t. Then {X(t)} is a pure birth process with birth rate λn , i.e. P {X(t + ∆t) = n + 1|X(t) = n} = λn ∆t + o(∆t) , λn = (N − n)an , where N is the number of nematodes outside the host at time 0 and X(0) = 0. Five models in which appropriate forms for λn are given constructed as follows. Let λn = (N − n)an . Model 1. Let an = a, where a is a constant. From the differential equations, we can know that X(t) has binomial distribution ! N P {X(t) = n} = (1 − exp(−at))n exp(−at)N −n . n However, in practice, such a model is unlikely to be adequate. Model 2. Let an =

(

a0

n=0

a1

n≥1

,

where a1 > a0 .

Model 3. Let an = exp(a + bn), where b > 0. Model 4. Let an =

a , 1 + exp(b + cn)

where c < 0 .

Model 5. Let an =

a exp(−dN ) , 1 + exp(b + cn)

where c < 0, d > 0 .

The solution of the four differential equations for the latter 4 models can be calculated numerically using MATLAB software. Three data sets are given in Tables 11, 12 and 13 respectively. Three data sets are analyzed. All models fitted to these data resulted in a log-likelihood. Let Li denote the log-likelihood value for model i. For Table 11, L1 = −626.80, L2 = −602.57, L3 = −588.88, L4 = −588.46 and L5 = −588.21. Model 3 is good enough. In model 3, a ˆ = −1.17(0.05), ˆb = 0.25(0.03) where the values in parentheses are standard

1020

J.-Q. Fang & C.-X. Li Table 11.

Numbers of invading nematodes for various N .

Number of larvae with the following numbers of invading nematodes N

0

1

2

3

4

5

6

7

8

9

10

10 7 4 2 1

1 9 28 44 158

8 14 18 26 60

12 27 17 6

11 15 7

11 6 3

6 3

9 1

6 0

6

2

0

Table 12.

Numbers of invading nematodes for various N .

Number of larvae with the following numbers of invading nematodes N

0

1

2

3

4

5

6

7

8

9

10

10 7 4 2 1

4 12 32 35 165

11 21 22 26 59

15 17 15 17

10 12 6 2

10 7 0

11 5

8 0

3 0

0

0

0

Table 13.

Numbers of invading nematodes for various N .

Number of larvae with the following numbers of invading nematodes N

0

1

2

3

4

5

6

7

8

9

10

10 7 4 2 1

21 34 35 45 186

13 15 19 26 40

11 13 12 3

11 1 3

9 2 2

4 3

2 1

2 1

1

0

0

errors. After combining some of the entries in Table 11 with low counts, a χ2 goodness-of-fit statistic is calculated. χ2 = 17.42, degree of freedom df = 16 and p-value p ≈ 0.36. For Table 12, L1 = −559.28, L2 = −545.04, L3 = −542.12, L4 = −540.041 and L5 = −540.039. Model 4 is better. In model 4, the estimators of parameters are a ˆ = 0.72(0.20), ˆb = 0.60(0.42), cˆ = −0.61(0.25). After combining some of the entries in Table 12 with low counts, χ2 = 7.75, df = 13 and p ≈ 0.86. For Table 13, L1 = −554.81, L2 = −517.60, L3 = −512.39, L4 = −509.58 and L5 = −503.71. Model 5 is the best. In model 5, the estimators

Stochastic Processes and Their Applications

1021

of parameters are a ˆ = 1.94(0.78), ˆb = 2.07(0.41), cˆ = −0.87(0.19), dˆ = 0.062(0.018). After combining some of the entries in Table 13 with low counts, χ2 = 19.22, df = 13 and p ≈ 0.12. When the birth rate λi in homogeneous process or λi (t) in nonhomogeneous process takes appropriate forms, some special processes are gotten. 6.3. Poisson process 6.3.1. Poisson process — λi = λ A Poisson process {X(t)} is a pure birth process with constant birth rate. The solution of the differential equations in pure birth processes gives the distribution of X(t) pk (t) = b P {X(t) = k} =

exp(−λt)(λt)k . k!

It is a Poisson distribution with mean λt. 6.3.2. Non-homogeneous Poisson process — λi (t) = λ(t) Non-homogeneous Poisson processes {X(t)} are time dependent Poisson processes. The distribution of X(t) is Poisson distribution Rt Rt exp(− 0 λ(s)ds)( 0 λ(s)ds)k pk (t) = b P {X(t) = k} = , k! Rt with mean 0 λ(s)ds. 6.3.3. Weighted Poisson process11 — λ is a random variable The Poisson process describe the frequency of occurrence of an event to an individual with risk parameter λ. The variability of individuals with respect to this risk takes into account. That is to say, we permit λ is various with the density function f (λ). The conditional distribution pk|λ (t) of X(t) given λ is pk|λ (t) = b P {X(t) = k|λ} =

exp(−λt)(λt)k . k!

Thus the distribution of X(t) is Z ∞ exp(−λt)(λt)k pk (t) = f (λ)dλ . k! 0

1022

J.-Q. Fang & C.-X. Li

For example, if λ has a gamma distribution, we can get that the distribution of X(t) is a negative binomial distribution. Weighted Poisson model which incorporates variability of risk has been found useful in studies of accident proneness. 6.3.4. Compound Poisson process A stochastic process {X(t), t ≥ 0} is called a compound Poisson process if it is represented by N (t)

X(t) =

X

Yn ,

n=1

where {N (t), t ≥ 0} is a Poisson process, {Yn , n = 1, 2, . . .} is a collection of random variables with independent and identically distribution (i.i.d.), and {N (t), t ≥ 0} and {Yn , n = 1, 2, . . .} are independent. Example.2 Suppose that the insurants in a insurance company die at time τ1 , τ2 , . . . , (0 < τ1 < τ2 < · · · ). Their deaths are Poisson events with rate λ. The insurance is Yn when the death occurs at time τn . Let X(t) denote the total amount of insurance by time t. Then, N (t)

X(t) =

X

Yn ,

n=1

where N (t) is the number of deaths by time t and it is a Poisson process with rate λ. {X(t)} is a compound Poisson process. 6.4. Yule process 6.4.1. Yule process — λi = iλ The Yule process is a pure birth process with linear birth rate. Suppose that all individuals alive at time t give birth to another individual with the same rate λ and that individuals give birth independently of each other. Let N (t) denote the total number of the population at time t. Then N (t) is a Yule process. The Yule process, starting from i individuals, has a negative binomial distribution ! j−1 pij (t) = b P {N (t) = j|N (0) = i} = exp(−λt)i (1 − exp(−λt))j−i . i−1

Stochastic Processes and Their Applications

1023

Yule18 used this process to study evolution. Let N (t) is the number of animal or plant species in a certain genus at time t. Suppose that each species would not be extinct when it come into being. In interval (t, t + h), a new species is generated with probability λN (t)h. Suppose that the new genera generated at time τ1 , τ2 , . . . , (0 < τ1 < τ2 < · · · ) with nonhomogeneous Poisson rate. The mean function of the Poisson process is m(t) = N0 eat , where N0 and a are positive constant. Let X (n) (T ) denote be the numbers of genera in which n species are contained at the given time T . Then X (n) (T ) can be represented by X (n) (T ) =

∞ X

(n) Wm (T, τm ) ,

m=1

where

(n) Wm (T, τm ) ( 1 if the genus generated at time τm contains n species at time T . = 0 else

We can get E[W (n) (t, τ )] = p1,n (t − τ ) = exp(−λ(t − τ )){1 − exp(−λ(t − τ ))}n−1 , and E[X

(n)

When T is large enough, E[X

(n)

(T )] ≈

Z

(T )] ≈ c

T

W (n) (T, τ )dm(τ ) .

0

Z

0

1

(1 − y)n−1 y a/λ dy ,

where c is a constant independent of n. Therefore R 1 a/λ y dy E[X (1) (T )] 1 P∞ = R 10 . = (n) a/λ−1 1 + λ/a (T )] y dy n=1 E[X 0

Let M denote the total number of genera and M1 denote the number of the genera in which only one species is contained. If M and M1 are large enough, we can estimate λ/a from the equation 1 M1 = . M 1 + λ/a

1024

J.-Q. Fang & C.-X. Li

The estimator of λ/a is λ M − M1 . = a M1 6.4.2. Non-homogeneous Yule process — λi (t) = iλ(t) Non-homogeneous Yule process {N (t)} is a time dependent Yule process. Similarly, N (t) has probability distribution pij (t) = b P {N (t) = j|N (0) = i} ! j−1 = exp[−(−ρ(s))]i {1 − exp[−(ρ(t) − ρ(s))]}j−i , i−1 where ρ(t) =

Rt 0

λ(τ )dτ .

6.5. Pure death process A birth–death process {N (t)} is said to be a pure death process if birth rates λi = 0 for all i. The pure death process is exactly analogous to the pure birth process. In the usual applications, N (t) is the number of individuals alive at time t and time t is interpreted as age. Let µ(t) denote the intensity that an individual alive at time t will die in the interval (t, t + ∆t). µ(t) is known as force of mortality, intensity of risk of dying, or failure rate. When N (t) = i, one death event occur in the interval (t, t + ∆t) with probability iµ(t)∆t + o(∆t). As we can see, {N (t)} is a pure death process with death rates iµ(t). And N (t) has binomial distribution pni (t) = b P {N (t) = i|N (0) = n} ! i Z t n−i Z t n µ(τ )dτ 1 − exp − µ(τ )dτ exp − = i 0 0 i = 0, 1, . . . , n . Suppose that i individuals are independent and have the same force of mortality. Let T denote the individual’s survival time. The survival function is defined by S(t) = b P {T > t} = 1 − F (t) ,

Stochastic Processes and Their Applications

1025

where F (t) is distribution of T . It is easy to show Z t µ(τ )dτ , S(t) = exp − 0

Z t f (t) = µ(t) exp − µ(τ )dτ , 0

µ(t) =

f (t) f (t) = +∞ . S(t) ft f (s)ds

For example, when T has Weibull distribution f (t) = µγtγ−1 exp(−µtγ−1 ) . Then we can calculate µ(t) = µγtγ−1 , and S(t) = exp(−µtγ−1 ) . 7. Counting Processes and Regression Models for Survival Data 7.1. Life table There are two kinds of life tables, current life table and cohort life table, working for two different kinds of studies, cross-sectional study and followup study. For current life table, Chiang12 proposed a method to calculate the probability of death. ni M i , qi = 1 + (1 − ai )ni Mi

where qi is age specific probability of death, the probability in (xi , xi + ni ), and Mi is age specific death rate. Based on stage processes, a new life table is constructed by Chiang.12 In this new table, probability of death is not only dependent on age, but also dependent on stage of disease. The stage process is usually used to describe the development of chronic diseases. Generally, chronic diseases advance with time from mild through intermediate stages to death. The process often is irreversible but a patient may die while being in any one of stages. For example, evolution of cancer is always a stage process. There are many staging phenomena in many other areas, birth order and child spacing, engagement-marrige-divorce in demography, and so on.

1026

J.-Q. Fang & C.-X. Li

7.2. Counting process A process {N (t), t ≥ 0} is called a counting process if N (t) represents the total number of events that occurred in (0, t]. A counting process must be a non-negative integer valued process. Let τi denotes the time of the ith event and it is said to be the arrival time of the ith event. τ1 , τ2 , . . . are random variables and 0 < τ1 < τ2 < · · · . Let T1 = τ1 , T2 = τ2 − τ1 , . . . , Tn = τn − τn−1 , . . . .

{Ti , i = 1, 2, . . .} is called the sequence of interarrival times. A counting process is called a renewal process if the interarrival times have independent and identically distribution. It is called Poisson process if the distribution is exponential distribution. Now consider k different kinds of events may occur. Let Ni (t) denote the total number of the ith kind of events that occurred in (0, t). N (t) = (N1 (t), N2 (t), . . . , Nk (t)) is called multiple counting process with k dimensions. Let X1 , X2 , . . . , Xn denote survival times of n individuals. They are independent and have the same survival function S(t). Let n X I(Xi ≤ t) , N (t) = #(i : Xi ≤ t) = i=1

where #(·) is a counting function and I(·) is a indicator function. Then N (t) is the total number of death that occurs in (0, t] and {N (t), t ≥ 0} is a counting process. In survival analysis problems, compete data is not possible. We can ˜ i , Di ), i = 1, 2, . . . , n, where Di is a censoring indicator. observe that (X Then ˜ i , if Di = 1 , Xi = X ˜i , Xi > X

if Di = 0 .

Let ˜ i ≤ t, Di = 1} . N (t) = #{i : X 7.3. Kaplan Meier estimator Kaplan–Meier estimator is a non-parametric estimator for survival function. It is also called product-limit estimator: Y ∆N (s) b = , S(t) 1− Y (s) s≤t

Stochastic Processes and Their Applications

1027

˜ i ≥ t} is the number at risk where ∆N (s) = N (s) − N (s−), Y (s) = #{i : X just before time t. 7.4. Cox regression In above assumption, survival times of n individuals have identical distribution. However, survival time usually depend on some covariates. If the values of covariates are different for individuals, survival times of individuals have different distribution. In other words, survival function depends on the covariates. Regression model can be used here. When the distribution is known as a certain distribution, for example, exponential distribution, a parametric regression model can be used. However, the distribution usually is unknown. Some semiparametric models are considered. Cox regression model is a semiparametric model. In Cox regression model, the intensity of hazard is µi (t, z) = µ0 (t) exp(β ′ Zi ) , where Zi = (zi1 , zi2 , . . . , zip )′ is covariates vector, β is regression coefficient vector and µ0 (t) is an intensity of risk independent of individuals. This model is also called proportional hazard model. Because µ0 (t) is unknown, the estimators of parameters are based on partial likelihood function. Suppose that we observed d individuals, denoted by (1), (2), . . ., (d), dead. Let X(i) denote the survive time of individual ˜ j ≥ X(i) }, the number (i). And X(1) < X(2) < · · · < X(d) . Let R(i) = {j : X at risk just before the time X(i) . Then the partial likelihood function is ) ( d Y individual (i) die at time X(i) | one P L= individual in R(i) die at time X(i) i=1 =

d Y

i=1

P

exp(β ′ Z(i) ) . ′ j∈R(i) exp(β Z(j) )

Then logarithm partial likelihood function is    d   X X β ′ Z(i) − ln  exp(β ′ Z(j) ) . ln L =   i=1

j∈R(i)

Therefore, the maximal partial likelihood estimator of β can be calculated. It is also called Cox estimator. Cox estimator is consistent, i.e. the estimator → β when the sample size n → ∞.

1028

J.-Q. Fang & C.-X. Li

Sometimes, the covariates are time-dependent. Some counting process models with time-dependent covariates are given next. 7.5. Multiple renewal process model A multiple renewal process with time-dependent covariates is used as a model for acute respiratory infections (ARI) in the paper of Fang.13 Consider a marked renewal process with g marks D1 , . . . , Dg corresponding g classes of diseases. Let µsi (t), s = 1, 2, . . . , g, i = 1, 2, . . . , n denote the intensity that disease Ds for individual i happens at time t. Let Zi (t) = (zi′ (t), (t − tri ), (t − tri )2 , . . . , (t − tri )q )′ . It contains a set of p time dependent covariates zi (t) and quasi-covariates, (t − tri ), (t − tri )2 , . . . , (t − tri )q , representing the effect of time, where tri is the latest renewal time of individual i before time t. And let µsi (t) = exp(Cs′ Zi (t) + θs ) , where Cs = (cs1 , cs2 , . . . , cs,p+q )′ is a p+q dimensional column vector and θs are parameters related to the occurrence of Ds and expected to be estimated from the data. For individual i, the records in the data include the beginning and the end of observed time, t0i and tei , the occurrence times t1i , t2i , . . . , tki i , and the corresponding states d1i , d2i , . . . , dki i , where ki is the number of transitions happening. The full log-likelihood function for n individuals can be written as " g # ki n Z tei X n X X X µsi (t) dt . ln µdj i (tji ) − ln L = i=1 j=1 ki 6=0

i=1

t0i

s=1

As a especial case, if the parameter vectors Cs are assumed to be equal to C for all s = 1, 2, . . . , g, the parameters are fewer. To estimate the vector C, a numerical method such as the Newton–Raphson algorithm is used. The child survey data on ARI was analyzed. Eighteen covariates are dealt with, of which nine are indices of health, seven are weather indices, and the last two are (t − tr ) which is length of time since the latest illness (TEF), and (t − tr )2 which is square of TEF (TEF2 ), respectively, serving to explore the effect of time. The nine indices of health include hemochrome (HEM), history of tracheitis (HTR), rickets (RIC), age (AGE), history of tuberculosis (HTB), dental caries (CAR), sex (SEX), ratio of height and weight (RHW), family history of tracheitis (FHT). The seven indices of

1029

Stochastic Processes and Their Applications

weather include low temperature for days (LTM), range of temperature for days (RTM), relative humidity for days (RHU), difference of minimal temperatures for two days (DLT), maximal wind velocity for days (MWV), atmospheric pressure for days (ATP). By means of the likelihood ratio test, we find that one index of health HEM, three weather indices LTM, RTM, and RHU, and the effect of time, TEF and TEF2 are significant on the risk of occurrence of disease D. The estimates of parameter θs corresponding to the six types of ARI are given as −2.0759, −3.9243, −7.6897, −4.4043, −2.0402 and −6.4517. The regression coeffients are 0.5115, −0.2971, 0.3055, 0.2176, 2.2937, −4.9689, respectively. 7.6. Markov counting process model In the paper of Fang et al.,4 a Markov counting process with time-dependent covariates is used as a model. Let Nij denote the numbers of transitions i → j(i 6= j) that occur in (0, t]. Let µijh (t) denote the intensity that the transition i → j for individual h happens at time t. We assume ′ µijh (t, z) = µij0 (t) exp(βij Zijh (t)) .

Then the partial likelihood function is L=

YY

t i,j,h

′ exp(βij Zijh (t)) Pn ′ h=1 exp(βij Zsi (t))Yih (t)

!∆Nijh (t)

,

where Nijh (t) is the number of transitions i → j(i 6= j) that occur in (0, t] for the individual h, ∆Nijh (t) = Nsi (t) − Nsi (t−) and Yih is indicator of the individual h at risk, corresponding state i. i.e. ( 1 if individual h at risk at time t, corresponding state i , Yih (t) = 0 else . Using this model, Fang4 analyzed a set of 12-year and 6-run screening data of cervical cancer in Jingan county, Jiangxi Province, China. The covariates are sex disorder, sex health, age, age-square and cervicitis. There are four states and five covariates are dealt with. To obtain maximal partial likelihood estimator of parameter vector β, Marquardt modification algorithm was used. The estimators are solved as ′ = (0.13488, −0.6567, 0.1088∗, −0.0012∗, 0.1838∗) , β01 ′ β10 = (0.0450, −0.8642, 0.0542∗, −0.0010∗, −0.0173) ,

1030

J.-Q. Fang & C.-X. Li ′ β12 = (−0.2087, −0.7219, 0.0691∗, −0.0008∗, 0.5787∗) , ′ β23 = (1.7469∗, 0.0000, 0.0544, 0.0014∗, 0.6952) ,

where

∗

means the covariate is significant.

7.7. General multiple counting process model Andersen et al.10 proposed a statistical model based on multiple counting process. Now consider k types of event. Let µsi (t, z) denote the occurrence intensity of type s event for individual i with time-dependent covariate vector z. We assume µsi (t, z) = µs0 (t, θ)f (β ′ Zsi (t)) ,

s = 1, 2, . . . , k ,

where Zsi (t) = (zsi1 (t), zsi2 (t), . . . , zsip (t))′ is covariate vector, β is a regression coefficient vector and θ is a parameter. The partial likelihood function is given as ∆Nsi (t) YY f (β ′ Zsi (t)) Pn , L= ′ i=1 f (β Zsi (t))Ysi (t) t s,i

where Nsi (t) is the occurrence number of type s event by time t for the individual i, ∆Nsi (t) = Nsi (t) − Nsi (t−), and Ysi is indicator of the individual i at risk, corresponding type s event. References 1. Cheng, J. Z., Zhou, S. Y. and Xu, H. Y. (1994). Application of Markov process in structural dynamic forecasting of forest resources with tree species structure in Nanping region of Fujian province as an example. Chinese Journal of Applied Ecology 5(3): 232–236. 2. Deng, Y. L. (1994). Stochastic Models and Their Applications, Advanced Education Press. 3. Fang, J. Q., Mao, J. H., Zhou, W. Q. et al. (1995). Two-stage models, nonhomogeneous Markov chains, for cancer latent period. Chinese Journal of Applied Probability and Statistics 11(2). 4. Fang, J. Q. and Wu, C. B. (1995). Two-stage model, counting process, and Bootstrap for cancer latent period. Chinese Journal of Applied Probability and Statistics 11(2). 5. Fudan University (1981). Stochastic Processes, Advanced Education Press. 6. Gao, G. M. (2000). Probability models and Bayesian analysis of capturerecapture data via Markov chain Monte Carlo simulation. The PhD dissertation in Sun Yat-sen University.

Stochastic Processes and Their Applications

1031

7. Huang, Z. N. (1995). Multiple Analysis in Medical, Hunan Science and Technology Press. 8. Mao, S. S., Wang, J. L. and Pu, X. L. (1998). Advanced Statistics, Advanced Education Press. 9. Wang, Q. and Jin, P. H. (2000). Applications in economic hygiene for Markov model. Chinese Journal of Health Statistics 17(2): 86–88. 10. Anderson, P. K., Borgan, Ø., Gill, R. D. et al. (1993). Statistical Models Based on Counting Processes, Springer-Verlag, New York. 11. Chiang, C. L. (1980). An Introduction to Stochastic Processes and Their Application, Robert E. Krieger Publishing Company, New York. (The Chinese version is translated by Fang, J. Q., Shanghai translation press, 1986.) 12. Chiang, C. L. (1983). The Life Table and Its Application. (The Chinese version is translated by Fang, J. Q., Shanghai translation press, 1984). 13. Faddy, M. J. and Fenlon, J. S. (1999). Stochastic modeling of the invasion process of nematodes in fly larvae. Applied Statistics 48, Part 1: 31–37. 14. Fang, J.-Q., Shi, Z. L., Wang, Y. et al. (1990). Parametric inference in a renewal process with time-dependentCovariates. Biometrics 46(3): 849–854. 15. Fang, J. Q., Ni, T. Y., Fan, Q. et al. (1996). Two-state stochastic models for memory in ion channels. Acta Pharmacologica Sinica 17(1): 13–18. 16. Green, P. J. (1995). Reversible jump Markov chain Monto Carlo computation and Bayesian model determination. Biometrika 82(4): 711–732. 17. Lucas, W. F. (1983). Modules in Applied Mathematics, Vol. 4: Life Science Models. Springer-Verlag, New York. 18. Parzen, E. (1962). Stochastic Processes, Holden-Day, San Francisco. (The Chinese version is translated by Deng, Y. L. and Yang, Z. M., 1987.) 19. Richard, J. B., Daniel, A. H. and Darren, J. (2000). Wilkinson, Detecting homogeneous segments in DNA sequences by hidden Markov models, Applied Statistics 49, Part 2: 269–285. 20. Ross, S. M. (1983). Stochastic Processes, John Wiley and Sons Inc., 1983. 21. Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pat. Anal. Mach. Intel. 6: 721–741. 22. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N. et al. (1953). Equations of state calculations by fast computing machines. J. Chem. Phys. 21: 1087–1091. 23. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57: 97–109.

About the Author Jiqian Fang born in Shanghai 1939, BS of Mathematics in 1961 from the Fudan University and PhD of Biostatistics in 1985 from the University of California at Berkeley. 1985 to 1990, Professor and Director, the Department of Biostatistics and Biomathematics, Beijing Medical

1032

J.-Q. Fang & C.-X. Li

University; Since 1991, Director and Chair Professor, Department of Medical Statistics, School of Public Health, Sun Yat-Sen University. His research projects covers widely various fields, including “Stochastic Models of Life Phenomena”, “Gating Dynamics of Ion Channels”, “Biostatistical Theory and Methods for Research on Cancer Prevention”, “Bootstrap Studies on Multi-state Models”, “Statistical Methods for Data on Quality of Life”, “Health and Air Pollution”, “Analysis of DNA Finger Printing”, and “Linkage Analyses between Complex Trait and Multiple Genes”, etc. Some of them have received awards from the Beijing Municipal Government or Ministry of Public Health of China for their significant advances in the biostatistics fields.

CHAPTER 27

TREE-BASED METHODS

HEPING ZHANG Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520-8034 [email protected]

1. Introduction In this chapter, I describe the development and applications of tree-based methods. The thrust of these methods is the recursive partitioning technique that facilitates a process to divide an initially heterogeneous sample of observations into smaller subgroups within each of which the outcome of interest is relatively homogeneous. The book by Breiman et al.2 on classification and regression trees (CARTT M ) is the milestone of the tree-based methodology. It provides much historical background and describes the methods and applications systematically. The associated CART program has become a commercial software. Since 1984, there has been a great deal of methodological developments as well as applications of tree-based methods, particularly in the area of survival trees. As in CART, the idea of recursive partitioning is still the heart and soul of the more recently developed methods. Please refer to Zhang and Singer28 for a detailed introduction to those methods. The rest of this chapter is organized as follows. First, the basic ingredients in CART is introduced. Then, survival trees is described. Finally, tree-based methods for analyzing multiple correlated responses is discussed. 2. The Basics of CART One of the important and original applications of CART was to develop expert systems that can assist physicians in diagnosing patients potentially suffering heart attacks. Traditionally, the physicians made diagnoses in a 1033

1034

H. P. Zhang

1 2

3 4

5 6

Fig. 1.

7

A sample tree structure.

subjective, intuitive, idiosyncratic manner. A data-driven classification tree would enable the physician to interpret a patient’s conditions by taking advantage of the empirical information from a large number of patients with similar conditions.20 Now, the applications of the tree-based methods are far reaching.1,3,4,8,11,12,14,24,27 CART is arguably the most popular method among classification trees. All of tree-based approaches have in common the successive partitioning of a “feature space” of predictors into subsets. The partitioning is done on the basis of a learning sample, and it is sometimes validated by a test sample. Some of classification trees make use of a multi-level partition of a non-terminal node (a sub-group of the learning sample that is subject to a further division). However, only binary trees will be presented, i.e. a non-terminal node has exactly two daughter nodes. It is noteworthy that a tree that is constructed in a binary manner is not confined to be presented in the same manner as illustrated in Fig. 1 of Zhang and Bracken27 for the sake of an easier interpretation. In other words, a multi-level partition can be derived in principle by repeated binary partitions on the same variable. Suppose that we have collected p covariates x and a response y from n subjects. For the ith subject, the measurements are xi = (xi1 , . . . , xip )′

and yi , i = 1, . . . , n .

The objective is to model the probability distribution of P (y|x) or sometimes a function of this probability. The covariates x can be an array of mixed categorical (nominal or ordinal) and continuous variables, and they may have missing values for some subjects. In this section, we consider a single response y of either a categorical or continuous form. Later sections will deal with censored response or multiple responses. The characteristics of y mandate the choice of methodology.

Tree-Based Methods

1035

Let us begin with an arbitrary tree as depicted in Fig. 1. This tree has four layers of nodes. In general, the number of layers varies from case to case. At the top is the unique root node. Including the root node, there are three non-terminal, or internal nodes. They are represented by the circles and labeled as 1, 3, and 5. The tree has four terminal nodes, represented by boxes and labeled as 2, 4, 6, and 7. The root and the internal nodes are connected to two nodes in the next layer that are called left and right daughter nodes, whereas the terminal nodes do not have “offspring.” Moreover, the tree is not necessarily symmetric in that one of the two daughter nodes can be an internal node and the other a terminal one; for instance, nodes 2 and 3 are both the daughter nodes of node 1, and node 2 is terminal whereas node 3 is internal. The connection from a parent node to the two daughter nodes is determined by a splitting rule that I will elaborate in detail shortly. Although the details and the implementation are complex, the nutshell of tree construction is really a few key questions: (a) How are the nodes determined from the data? (b) How do we split a node? (c) When does a node become terminal? I divide the answers to these questions in two steps: tree growing and tree pruning. After a tree is constructed, we need to interpret the tree structure and make statistical inferences to reveal the relationships among the predictors and the response. This issue is important and may determine the final tree, but it belongs to the use and interpretation of trees and does not have a clear-cut answer.

2.1. Tree growing through node splitting Node is the most basic element of a tree. A node is simply a collection of observations. For example, the root node contains the learning sample, namely, all of the observations that are used during a tree construction. All nodes except the root node are subsets of the learning sample. When an internal node is divided into its daughter nodes, it means that a subset of the sample is further divided into sub-groups. Because the node division is exclusive, the terminal nodes are disjoint subgroups of the learning sample and the union of all terminal nodes is the root node. The tree growing procedure begins with the split of the root node into its two daughter nodes. Once this is done, the resulting two daughters can be split recursively in the same way. This is why the procedure is called recursive partitioning. Obviously, the fundamental step is to partition one parent node (e.g. the root node) to the two daughter nodes. How do we split a node? The division of the root node is carried out through a predictor.

1036

H. P. Zhang

The purpose of splitting is to generate two daughter nodes within which the distributions of the response are more homogeneous than that in the parent node. Every predictor in x competes against another to achieve a combined maximum homogeneities in the two daughters. If xj is an ordered covariate such as age, two subgroups result from the question of the form “Is xj > c?” Here the cutoff point c is in the range of the observed values of xj . The ith subject goes to the right or left node according to whether or not xij > c. The number of such distinct questions that can be asked upon an ordered covariate is one fewer than the number of the distinctly observed value of xj . On the other hand, if xj is nominal such as nationality, we can send a subject to the left or right node by asking questions such as “is the subject an Asian?” and “is the subject a Hispanic or an African?” If xj has k levels, we can ask 2k−1 − 1 meaningfully different questions, considering that the designation of left and right daughter nodes is arbitrary. For example, if xj has four levels, A, B, C, and D, we can make seven distinct cut as follows: {A}, {B}, {C}, {D}, {A, B}, {A, C}, and {A, D}. We do not list {B, C} and others, because its compliment {A, D} is listed and asking “xij ∈ {B, C}?” or “xij ∈ {A, D}?” has the same effect. Considering p covariates and the number of possible cutoff points from each of them, we see that there are usually many possibilities to split a parent node into two daughter nodes. Therefore, we need a criterion to decide which split is preferable over the rest. This leads to the concept of impurity. Let us use age as a predictor and cancer status as the response to explain how to evaluate the splits for a node (t) based on this age predictor. Suppose that we consider an age cutoff at c, e.g. 35. As a result of the question “Is xj (age) > c (35)?”, we have the following table:

Left Node (tL ) Right Node (tR )

xj ≤ c

xj > c

Normal n11

Cancer n12

n1·

n21

n22

n2·

n·1

n·2

What do we like to see? As stated earlier, we want a split such that the distributions of y in the daughter nodes are homogeneous. In other words, we would like to push as many observations as possible either along the diagonal n11 , n22 or along the off-diagonal n12 , n21 . A perfect example is n11 = n22 = 0. In this case, the two nodes are pure (or completely homogeneous) because each of them contains either the cancer patients only or the normal subjects only. In contrast, their parent node includes

Tree-Based Methods

1037

a mixture of n21 normal subjects and n12 cancer patients. Thus, the two daughter nodes are “more desirable” than their parent node. However, in most applications, whether the daughter nodes are more desirable than their parent node is not so clear cut and it generally requires a mathematical criterion to make the comparison. One commonly used measure of node impurity for a categorical response is defined through the entropy function as follows: n12 n11 n12 n11 − . (1) log log i(tL ) = − n1· n1· n1· n1· Likewise, i(tR ) and i(t) can be defined. Then, we select a split that minimizes the weighted node impurity: n·1 n·2 i(tL ) + i(tR ) , (2) n n which can be regarded as the node splitting criterion. For the later discussions, it is useful to note that −i(tL ) is simply the maximum log likelihood of y by assuming that it follows a binomial distribution in node tL . Minimizing criterion (2) amounts to maximizing the likelihood or homogeneity in this case. When y is a continuous response, a node is pure when the responses within the node equal to the same constant. However, when the withinnode responses are not constant, commonly used node impurity measures are the within-node variance and the absolute distance toward the median. So far, I have described the splitting procedure based on completely observed ages from all subjects. In the presence of missing ages for some subjects, two strategies are available to deal with the splitting. One makes use of surrogate splits. The idea is this. If we cannot use age to decide how to send a subject to the left or right daughter node due to missing information, we try to find a split based on another predictor that hopefully resembles the age split sufficiently. The other strategy is much easier. We simply create another level for missing values. Then, all subjects with missing information will be sent to the same daughter node. 2.2. Tree pruning by determining terminal nodes Applying the node splitting procedure described above to the root node, then to the resulting daughter nodes, and so on, we usually end up with a tree of excessive nodes. We do not need to worry about when to stop the recursive partitioning process because it stops by itself when further splitting is not possible or meaningless. In an extreme case, for example,

1038

H. P. Zhang

we cannot split a node with one observation. And, the number of study subjects is always finite. What we need to be concerned with is how to deal with a tree of excessive size. In usual practice, such a tree is too large to be useful and trustworthy. This is why we need the tree pruning step to trim off some over-fitting nodes. Let us pretend that the tree in Fig. 1 is large and subject to pruning. We need to address the question: “can we prune away some of the nodes?” If we have a general answer for this question, then we can prune any tree. Tree pruning starts at the bottom of a tree, and the pruned tree is a subtree of the original one. Thus, pruning the tree in Fig. 1 is equivalent to selecting one of its subtrees. The latter requires a measure of tree quality, reflecting our objective of extracting homogeneous subgroups of the study sample. Whether we construct trees for classification or prediction purpose, we make our decision based on the distributions of the response in the terminal nodes. All internal nodes play an intermediate role ultimately to lead to relatively homogeneous terminal nodes. Therefore, the quality of a tree depends on the quality of its terminal nodes. Let Q(T ) denote a certain quality measure of tree T, and we have X p(t)q(t) , (3) Q(T ) = t∈T˜

where T˜ is the set of terminal nodes of tree T, q(t) summarizes the quality of node t, and p(t) is the proportion of subjects falling into node t. For binary outcomes, q(t) is usually replaced with the within-node misclassification cost r(t), and Q(T ) with a tree misclassification cost R(T ). In other words, a tree is assessed by X p(t)r(t) . R(T ) = t∈T˜

There are two types of misclassifications, each of which is associated with a certain misclassification cost. The misclassification cost should reflect the severity of the error, for instance, when a cancer patient is classified to be cancer free or vice versa. Let C(i|j) be the misclassification cost that a class j patient is classified as a class i patient. Here, there are two classes of subjects: 0 for normal and 1 for cancer patients. For medical reasons, it is natural to choose C(0|1) > C(1|0) because the consequence is potentially more severe when a patient with disease is wrongly diagnosed than when a normal person is classified to have the disease. Without loss of generality, we can set C(1|0) = 1 as the cost unit and let C(0|1) = c, which means that the

Tree-Based Methods

1039

a false positive diagnosis costs as many as c false negative ones. In addition, there is no cost when the classification is correct, namely, C(i|i) = 0. Unless a node is pure, we makes mistakes one way or another. The within-node misclassification cost is the minimum of the two possible misclassification costs. Although defining the measure in Eq. (3) is easy, using it is not as straightforward unless there is an independent set of sample–the so-called test sample. When a test sample is available, we can estimate p(t) and r(t) from it, leading to an estimate of R(T ). Then, we can select a subtree ˆ ). However, in many applications, that has the lowest estimated cost R(T such a second sample is not feasible or is too costly. Sample re-use methods are used as an alternative. For these methods, the size of a tree is another important aspect, indicating the tree complexity. Note that the total number of nodes in a tree, T, is 2|T˜| − 1, where |T˜| is the number of the terminal nodes of T. Hence, the complexity of T can be defined directly as |T˜|. Usually, a unit penalty, called a complexity parameter, is assigned to each terminal node, and the sum of these penalties becomes the penalty for the tree complexity. Therefore, the final quality measure of a tree is the following cost-complexity: Rα (T ) = R(T ) + α|T˜| ,

(4)

where α(> 0) is the complexity parameter. For a given complexity parameter and an initial tree such as the one in Fig. 1, there is a unique smallest subtree of the initial tree that minimizes the cost-complexity measure Eq. (4). Importantly, if α1 > α2 the optimally pruned subtree corresponding to α1 turns out to be a subtree of the one corresponding to α2 . So, as we increase the complexity parameter, we have a sequence of nested optimally pruned subtrees. The fact that the successive optimally pruned subtrees are nested can entail important savings in computation.2 This nested sequence of subtrees has a finite length, because the number of subtrees is finite, and the last one is the root node. On the other hand, the complexity parameter takes a continuous value, which implies that an interval of the complexity parameter must correspond to the same subtree. Let T0 be the initial tree. To prune off some nodes from T0 , we need to find the smallest α, denote by α1 , to allow some of the terminal nodes to be removed such that Rα1 (T1 ) for the pruned T1 is better than Rα1 (T0 ) of the initial tree. It turns out α1 = min t∈T˜0

r(t)p(t) − R(T (t)) , |T˜(t)| − 1

1040

H. P. Zhang

where T (t) represents a tree rooted at node t. Likewise, we proceed to find the next smallest α, denoted by α2 , to allow some of the terminal nodes to be removed from T1 such that Rα2 (T2 ) for the pruned T2 is better than Rα2 (T1 ). As we continue this process until we reach the tree with the single root node, we end up with a sequence of increasing complexity parameters m {αi }m 0 (here α0 = 0) and a sequence of nested and shrinking subtrees {T i }0 (here Tm is the single root node tree). The next step is to select a subtree from the nested sequence, and a cross-validation procedure is usually recommended. For example, we can randomly divide the study sample into several, say 5, sub-samples of about the same size. We use 4 of the 5 sub-samples to grow a large tree and prune it using the sequence {αi }m 0 that leads to a new sequence of subtrees. Then, we compute R(T ) for each of those subtrees based on the left-over subsample, giving us one set of estimates for {R(Ti )}m 0 . We can do this 5 times and the average will be the final estimates for {R(Ti )}m 0 . With this sequence of estimates, we can select the subtree with the smallest or near ˆ ). Please refer to Breiman et al.2 and Zhang and Singer28 the smallest R(T for details. Once the subtree is selected, the pruning step is accomplished.

3. Survival Trees Although CART is a well-known brand name, the most frequently used tree-based method in biomedical research is survival trees, partly because survival analysis per se is a major topic in the health sciences. In this section, how to adapt the ideas expressed above for censored survival data will be explained. We face the same basic issues. One is to define a splitting criterion to divide a node into two, and the other is to choose a “right-sized” tree for subsequent use. Many criteria have been proposed in the literature, but they differ primarily in the way of declaring which daughter nodes are desirable. A few major ideas have emerged. First, as in CART, we can split a node to achieve better impurities in its daughter nodes. The concept of impurity is very intuitive in CART; however, for survival trees, we have to decide what we mean by node impurity. The second idea is to maximize the distributional difference between the two daughter nodes. In classical ANOVA, reducing within-group variances increases the between-group variances. But, for survival trees, it is not clear that reducing node impurity increases the distributional difference between the two daughter nodes. Finally, as hinted earlier, there is a connection between node impurity and

Tree-Based Methods

1041

maximum likelihood in CART. Although this connection may not hold in survival trees, we can nonetheless base our splitting decision on likelihood. In the following, the focus will be on presenting the approaches that have been implemented in the author’s STREE program. Also see Zhang and Singer28 and Zhang et al.25 3.1. Use of the difference I begin the introduction of splitting rules with the use of Wasserstein metrics to measure the between-nodes distributional difference as proposed by Gordon and Olshen.13 The within-node survival distribution is estimated by the Kaplan–Meier curve.17 A desirable split can be characterized as one that results in two very different survival functions in the daughter nodes. Gordon and Olshen13 used the so-called Lp Wasserstein metrics, dp (· , ·), as the measure of discrepancy between the two survival functions. Specifically, for p = 1, the Wasserstein distance, d1 (SL , SR ), between two Kaplan–Meier curves, SL and SR , is the area sandwiched by the two Kaplan–Meier curves. Suppose that SL and SR are respectively the Kaplan–Meier curves for the left and right daughter nodes. We choose the split that maximizes the distance, d1 (SL , SR ). As before, we employ the recursive partitioning process to produce an initially large tree that will be pruned later. A standard approach for comparing the survival times of two groups is the log-rank test. Thus, it is no surprise in the literature that the logrank test is also used to separate the left and right daughter nodes. Indeed, Ciampi et al.6 and Segal18 adopted the log-rank test statistic as the splitting criterion. 3.2. Use of likelihood functions One very flexible way of forming a splitting criterion is to use likelihood functions. Not only is this true for survival trees, but it is also the case for analyzing more complex responses. This approach is useful and convenient because we can assume a simple within-node distribution when we assess a split or node. In fact, a few likelihood based splitting and pruning criteria have been proposed. Davis and Anderson9 assume that the survival function within any given node is an exponential function with a constant hazard. LeBlanc and Crowley15 and Ciampi et al.7 assume the proportional hazard models in two daughter nodes, but the hazard functions are unknown, but they respectively used the full and partial likelihoods for maximization.

1042

H. P. Zhang

3.3. Use of impurity Note that we observe a binary death indicator and the (failure or censored) time. If we take these two outcomes separately for the moment, we can compute the within-node impurity, iδ , of the death indicator and the within-node variation, iy , of the time toward the median. By considering both of them together, we have the within-node impurity for both the death indicator and the time using a weighted combination, wδ iδ + wy iy . Zhang21 examined a similar approach and recommended some choices of weights wδ and wy . Even though this approach does not fully incorporate the relationship between the censoring and observed time variables, existing evidence suggests that this simple extension outperforms the more sophisticated ones in discovering the underlying structures of data. 3.4. Pruning survival trees I explained various ways of growing a survival tree. There is also a variety of options to prune a survival tree. We need the same recipes as before: a quality measure and a cost-complexity of a tree. They enable us to use the cross validation procedure again to finish the pruning step. Gordon and Olshen13 suggested using the deviation of survival times toward their median as a measure of node quality q(t) for a node t and model (4) as the cost-complexity where R(T ) is taken as Q(T ). In addition, a variety of tree cost-complexities has also been proposed by using the likelihood ratio statistic that compares the survival times in a parent node with those in its daughter nodes. A related method due to Therneau et al.19 makes use of what are termed martingale residuals from the Cox model as the input to a cost-complexity scheme using least squares as the cost. LeBlanc and Crowley16 introduced the notion of “goodness-of-split” complexity as a substitute for cost-complexity in pruning the tree. Now, let q(t) be the value of the log-rank test at node t. Then the split-complexity measure is X q(t) − α(|T˜| − 1) , Q(T ) = t∈T˜

where the summation above is over the set of internal (non-terminal) nodes rather than the terminal nodes as in Eq. (4). The negative sign in front of the complexity part is a reflection of the fact that Q is to be maximized here, whereas the cost-complexity R is minimized there. In CART, cross

Tree-Based Methods

1043

validation is used to determine an optimal complexity parameter value, while here LeBlanc and Crowley16 recommend choosing α between 2 and 4. I refer to their work for the justification of this choice. Although the tree pruning procedures as described above are statistically elegant, they are rather sophisticated for researchers outside of the statistical society to comprehend. Even within the statistical community, we do not necessarily agree upon the correct use of these procedures. Thus, a practical and intuitive approach is appealing. Segal18 recommended the following alternative for pruning survival trees. For each internal node (including the root node) of an initial tree, we assign it a value that equals the maximum of the log-rank statistics over all splits starting from the internal node of interest. Then, plot the values for all internal nodes in an increasing order and decide a threshold from the graph. If an internal node corresponds to a smaller value than the threshold, we prune all of its offspring. Although this usually works out fine, the choice of threshold could be arbitrary. In the author’s RTEE program, pruning a tree (not necessarily a survival tree) at a different significance level is chosen. Analyzing genetic data, Zhang and Bonney23 demonstrated how to decide a final tree based on both the scientific implication and computer output. It is clear that there are plenty of choices to construct survival trees, which is good and bad. More choices give the data analysts the opportunity to select the ones that make a better scientific sense. Sometimes, however, too many choices can also lead the data analysts to wonder what to do. The state of the art is still to construct survival trees using a number of approaches and discuss them with experts to come up one or more trees that are as simple, informative, and interpretable as possible.

4. Classification Trees for Multiple Binary Responses In this section, I introduce a tree-based approach for analyzing multiple correlated binary outcomes. Such outcomes are sometimes referred to as clustered outcomes. Correlated discrete responses can be generated from a single endpoint by repeatedly measuring it on individuals in a temporal or spatial domain. They are called longitudinal discrete responses. The correlated responses may also consist of distinct endpoints, which are actually the focus of this section. Suppose that Yi = (Yi1 , . . . , Yiqi )′ is a vector of binary responses for subject i, i = 1, . . . , n. The index length qi may vary from individual to individual. This is particularly useful when some responses are missing for

1044

H. P. Zhang

some individuals. Multiple correlated outcomes arise from many applications. For example, (Y1 , Y2 ) may indicate the blindness of the left and right eyes or cancer status for a sib pair. We have seen in both classification and survival trees that some of the simple parametric distributions form the foundations for tree constructions. For multiple correlated responses, the following distribution from an exponential family is proven useful: " q i X X θij yij + θij1 j2 yij1 yij2 P {Yi = yi } = exp j1 0). E step:

Liu and Rubin18 proposed a version of the EMCE algorithm with a CMQ-step that is the same as the above CM-step 1 and a CML-step that maximizes the actual constrained likelihood function of ν L(µ, σ 2 , ν) = n ln[Γ((ν + 1)/2)] − n ln[Γ(ν/2)] − −

n ln(νπ) 2

n ν+1 ln(σ 2 ) − ln[1 + (yi − µ)2 /(νσ 2 )] , 2 2

instead of Q(ν), with µ and σ 2 fixed at their current estimates. As with the CM-step 2 of ECM (or EM) for updating ν, the CML step needs a one-dimensional search algorithm. They showed that this version of ECME converges dramatically faster than both EM and ECM. 3.3. Accelerating EM via Parameter Expansion: The PX-EM algorithm Notice that the scale parameter 2/ν of the γ distribution (1) for the missing τi is constrained to the inverse of the shape parameter ν/2. The scale parameter could be considered as an unknown parameter to be estimated from the complete data X = {(τi , yi ) : i = 1, . . . , n}. For the sake of clarity, denote by θx = (µx , σx2 , νx , λx ) the corresponding parameters for X, where µx , σx2 , and νx correspond to µ, σ 2 , and ν, respectively, and λx (> 0) is the extra scale parameter for τi .

Maximum Likelihood Estimation from Incomplete Data

1059

More precisely, the model for the complete data is written as follows νx νx τi |(νx , λx ) ∼ G , and yi |(τ, νx , µx , σx2 ) ∼ N(µx , σx2 /τi ) . , 2 2λx Integrating out the τi from the joint distribution of (τi , yi ) gives the marginal distribution of yi −(ν+1)/2 Γ((νx + 1)/2) (y − µx )2 , y ∈ (−∞, ∞) . 1+ ν(σx2 /λx ) Γ(νx /2)(νx π)1/2 (σx2 /λx )1/2 (6) The parameters σx2 and λx in the marginal distribution (6) of y is unidentifiable from the observed data Y = {yi : i = 1, . . . , n} when τ1 , . . . , and τn are missing. It is thus necessary to constrain (σx2 , λx ). For example, fix λx at λx = 1 as in Sec. 2.2. However, the EM algorithm converges faster if it is applied to the expanded model with the extra parameter λx . This extra parameter can be used to capture the information of “imputed” Sτ (t) (i.e. Sτ ) which goes to n as t → ∞, and then the information is used to adjust the current estimate of σ 2 . This leads to the parameter-expanded EM algorithm.21 PX-EM expands the complete-data model f (X|θ) (θ ∈ Θ) used in EM to a larger model fx (X|θx , λx ) ((θx , λx ) ∈ Θ×Λ) with the expansion satisfying two conditions: (i) the observed-data model is preserved in the sense that there is a many-to-one reduction function θ = R(θx , λx ) ,

θx ∈ Θ; λx ∈ Λ; θ ∈ Θ

(7)

from Θ × Λ onto Θ such that Y |θx ∼ g(Y |θ = R(θx , λx )); and (ii) there is a (fixed) null value λx , λ, in the sense that, for all θ, fx (X|θx = θ, λx = λ) = f (X|θ). PX-EM extends EM by replacing the complete-data model f (X|θ) with the expanded complete-data model fx (X|θx , λx ). More specifically, (0) (0) starting with (θx = θ(0) , λx = λ), the tth iteration of PX-EM consists of a parameter-expanded E-step and a parameter-expanded M-step. 3.3.1. The PX-EM algorithm21 (t−1)

(t−1)

PX-E step: Compute Qx (θx , λx |θx , λx ) = E{log fx (X|θx , λx )|Y, (t−1) (t−1) }. , λx θx (t−1) (t−1) (t−) (t−) ), , λx PX-M step: Find (θx , λx ) that maximizes Qx (θx , λx |θx (t) then apply the reduction function R(θx , λx ) to obtain θ = (t−) (t−) (t) (t) R(θx , λx ) and set θx = θ(t) and λx = λ.

1060

C. Liu

For the t-distribution with known number of degrees of freedom νx = ν, the reduction function is given by the many-to-one mapping µ = µx

and σ 2 = σx2 /λx

with the null value of λx : λ = 1. The sufficient statistics for the expanded parameter (µx , σx2 , λx ) are the same as those for (µ, σ 2 ) given in Condition (2). The complete-data maximum likelihood estimates of µx and σx2 are the same as those of µ and σ 2 given in Eq. (3), respectively, that is, " # Sτ2y Sτ y 1 2 2 S 2− . and σ ˆx = σ ˆ = µ ˆx = µ ˆ= Sτ n τy Sτ The complete-data maximum likelihood estimate of λx is ˆ x = Sτ . λ n The null value of λx is λ = 1. Thus, PX-EM t-distribution with the sample Y = {yi : i = 1, . . . , n}, can be written as follows. 3.3.2. The PX-EM algorithm (for the t-distribution with a known number of degrees of freedom) PX-E step: This is the same as the E-step of EM PX-M step: This is the same as the M-step of EM except that (σ 2 )(t) is replaced by " # P (t) (t) n (t) 2 (Sτ y )2 1 (t) 2 (t) i=1 τi (yi − µ ) = (σ ) = (t) Sτ y2 − . P (t) (t) n Sτ Sτ i=1 τi

This algorithm for the t-distribution is first proposed by Kent et al.4 Meng and van Dyk29 showed that this algorithm is an EM with a different complete-data model. Liu et al.21 provided this PX-EM version. The ECME algorithm can also be applied to the expanded model with unknown degrees of freedom.12 4. General Linear Mixed Models 4.1. The model Mixed-effects models are among the most important applications of EM. van Dyk42 and Pinherio et al.32 provided recent such examples. The most

Maximum Likelihood Estimation from Incomplete Data

1061

commonly used linear mixed-effects model was proposed by Laird and Ware,5 which, without loss of generality, is simplified as follows y i = X i β + Z i b i + ei ,

i = 1, . . . , m ,

(8)

where i is the subject index, y i is an ni -dimensional vector of observed responses, X i is a known ni × p design matrix corresponding to the pdimensional fixed effects vector β = (β1 , . . . , βp )′ , Z i is a known ni × q design matrix corresponding to the q-dimensional random effects vector bi = (ri,1 , . . . , ri,q )′ , and ei = (ei,1 , . . . , ei,ni )′ is an ni -dimensional vector of within-subject errors. The random effects b1 , . . . , and bm are independent of each other. For each i, bi follows the q-dimensional normal distribution bi ∼ Nq (0, Ψ) ,

i = 1, . . . , m ,

where 0, a vector of q zeros, is the mean vector and Ψ is the (q × q) variance-covariance matrix. The errors e1 , . . . , and em are independent of each other and independent of b1 , . . . , and bm . For each i, ei follows the ni -dimensional normal distribution ei ∼ Nni (0, σ 2 I ni ) with a common unknown variance parameter σ 2 , where I ni denotes the (ni × ni ) identity matrix. For more discussion of the structure of the variance-covariance matrices for the random effects and errors, see Pinherio et al.32 and the references therein. Integrating out the unobservable (missing) random effects b1 , . . . , and bm leads to the observed data model y i |(X i , Z i , β, Ψ, σ 2 ) ∼ Nni (X i β, Z i ΨZ ′i + σ 2 I ni ) ,

i = 1, . . . , m . (9)

It is difficult to find the maximum likelihood estimates of the parameter θ = (β, Ψ, σ 2 ) from the observed data Y = {y i : i = 1, . . . , m}, especially when q, the dimension of Ψ, is large. 4.2. The complete-data maximum likelihood estimates Consider the complete data that consist of both observed data Y = {y i : i = 1, . . . , m} and missing data {bi : i = 1, . . . , m}. The complete-data

1062

C. Liu

model can also be written as " # bi (X i , Z i , β, Ψ, σ 2 ) yi ∼

Nni +q

"

0

X iβ

# " ,

Ψ

ΨZ ′i

ZiΨ

Z i ΨZ ′i + σ 2 I ni

#!

.

(10)

The complete-data model belongs to the exponential family with the sufficient statistics Sbb′ =

m X

bi b′i ,

i=1

S(y−zb)′ (y−zb) =

m X

(y i − Z i bi )′ (y i − Z i bi ) ,

m X

X ′i (y i − Z i bi )

i=1

Sx′ (y−zb) =

i=1

and

(11)

Pm for the parameter θ = (β, Ψ, σ 2 ). Let Sx′ x = i=1 X ′i X i and let Sn = Pm i=1 ni . The complete-data maximum likelihood estimates of β, Ψ, and 2 σ are given as follows ˆ = S −1 β x′ x Sx′ (y−zb) ,

ˆ = 1 Sbb′ , Ψ m

and σ ˆ2 = =

1 (S(y−zb)′ (y−zb) − Sx′ ′ (y−zb) Sx−1 ′ x Sx′ (y−zb) ) Sn m 1 X ˆ − Z i bi ) . ˆ − Z i bi )′ (y − X i β (y − X i β i Sn i=1 i

4.3. EM-type algorithms 4.3.1. The EM algorithm Given the observed data and the current estimate θ (t−1) = (β (t−1) , Ψ(t−1) , (σ 2 )(t−1) ), under the complete-data model (10), the bi are independent of each other and ˆi , Vˆ i ) , bi |(y i , θ(t−1) ) ∼ Nq (b

1063

Maximum Likelihood Estimation from Incomplete Data

where the mean vector and variance-covariance matrix are ˆ bi ≡ E{bi |Y, θ(t−1) } = Ψ(t−1) Z ′i (Z i Ψ(t−1) Z ′i + (σ 2 )(t−1) I ni )−1 (y i − X i β (t−1) ) = [(σ 2 )(t−1) (Ψ(t−1) )−1 + Z ′i Z i ]−1 Z ′i (y i − X i β (t−1) )

(12)

and Vˆ i ≡ cov{bi |Y, θ(t−1) } = Ψ(t−1) − Ψ(t−1) Z ′i (Z i Ψ(t−1) Z ′i + (σ 2 )(t−1) I ni )−1 Z i Ψ(t−1) −1 1 (13) = (Ψ(t−1) )−1 + 2 (t−1) Z ′i Z i (σ ) respectively. Thus, the (standard) EM algorithm18 is given as follows. ˆi and Vˆ i for i = 1, . . . , m, and then E step: Compute b Sˆbb′ =

m X

ˆ′ + ˆi b b i

m X

Vˆ i ,

Sˆx′ (y−zb) =

i=1

i=1

i=1

m X

ˆi ) , X ′i (y i − Z i b

and Sˆ(y−zb)′ (y−zb) =

m X i=1

(y i − Z iˆbi )′ (y i − Z iˆbi ) +

m X

tr(Z i Vˆ i Z ′i ) ,

i=1

where the trace function tr(Z i Vˆ i Z i ) denotes the sum of the diagonal elements of the (q × q) matrix Z i Vˆ i Z i . M step: Compute ˆ β (t) = Sx−1 ′ x Sx′ (y−zb) ,

Ψ(t) =

1 ˆ Sbb′ , m

and 1 (σ 2 )(t) = Pm

i=1

Sn

ˆ [Sˆ(y−zb)′ (y−zb) − Sˆx′ ′ (y−zb) Sx−1 ′ x Sx′ (y−zb) ]

" m 1 X ˆi ) ˆi )′ (y − X i β (t) − Z i b (y − X i β(t) − Z i b = i Sn i=1 i #

+ tr(Z i Vˆ i Z ′i ) .

1064

C. Liu

4.3.2. The ECME algorithm It is well known that EM for general linear mixed-effects models can converge very slowly. Many methods for accelerating EM have been proposed in the literature.6 A modified version of the above EM was proposed as an EM implementation by Laird and Ware.5 This algorithm can be represented as the ECME algorithm with a CML step that updates β and a CMQ step that updates (Ψ, σ 2 ). 4.3.2.1. The ECME algorithm (version 1) E step: This is the same as the E-step of EM CMQ step: Update the estimates of Ψ and σ 2 : Ψ(t) = Sˆbb′ /m and " m 1 X 2 (t) (σ ) = (y − X i β (t−1) − Z i bi )′ Sn i=1 i × (y i − X i β

(t−1)

− Z i bi ) +

m X

tr(Z i V

#

′ iZ i)

i=1

.

CML step: Updates the estimate of β with β and Ψ fixed at their current estimates: #−1 "m X X ′i (Z i Ψ(t) Z ′i + (σ 2 )(t) I ni )−1 X i β(t) = i=1

×

"m X i=1

X ′i

(t)

ZiΨ

Z ′i

2 (t)

+ (σ ) I ni

−1

#

yi .

Liu and Rubin18 considered another version of ECME with three CM steps: A CMQ step for updating the estimate of Ψ, a CML step for updating the estimate of β, and a CML step for updating the estimate of σ 2 . The CML step for updating σ 2 does not have a closed-form solution. A modified version (see Schafer38 and Lindstrom and Bates9 ) has a closed-form solution for σ 2 . 4.3.2.2. The ECME algorithm (version 2) E step: This is the same as the E-step of EM CMQ step: Update the estimate of Ψ : Ψ(t−) = Sˆbb′ /m.

1065

Maximum Likelihood Estimation from Incomplete Data

CML step: Update the estimates of σ 2 and β with Φ ≡ Ψ/σ 2 fixed at its current estimate Φ(t) = Ψ(t−) /(σ 2 )(t−1) : β

(t)

=

"m X

X ′i (Z i Φ(t) Z ′i

−1

+ I ni )

Xi

i=1

×

"

m X

X ′i (Z i Φ(t) Z ′i

−1

+ I ni )

#−1 #

yi .

i=1

and (σ 2 )(t) =

n 1 X (y − X i β(t) )′ (Z i Φ(t) Z ′i + I ni )−1 Sn i=1 i

× (yi − X i β (t) ) ; and then adjust the current estimate of Ψ: Ψ(t) = Φ(t) (σ 2 )(t) =

(σ 2 )(t) Ψ(t−) . (σ 2 )(t−1)

As was noted by van Dyk,42 this ECME version can also be obtained using the more convenient parameterization that replaces Ψ by σ 2 Φ. 4.3.3. The PX-EM algorithm Meng and van Dyk30 considered efficient data augmentation for deriving fast implementations of EM for general linear mixed model. The basic idea is to augment less missing information, as was described in Meng and van Dyk.29 As shown by Liu et al.,21 PX-EM is easier to implement and converges faster than the efficient implementation of EM by Meng and van Dyk.30 The idea of PX-EM is to expand the complete-data model (10) to capture the information on (i) the covariance matrix between missing data bi and the observed responses y i , which is implicitly constrained by the variance-covariance matrix Ψ of bi in the original model; and (ii) the mean vector of bi , which is fixed at 0 in the original model. First, we describe the PX-EM version of Liu et al.21 (see also Liu13 ), who considered the case (i) and implemented the model with ni = 1 for all i = 1, . . . , m, is presented. Second, we discuss a new version that takes both (i) and (ii) into account.

1066

C. Liu

The expanded model that activates the covariance matrix between bi and y i can be written as " # bi (X i , Z i , β∗ , Ψ∗ , σ∗2 , C ∗ ) yi ∼

Nni +q

"

0

X iβ∗

# " ,

Ψ∗

Z i C ∗ Ψ∗

Ψ∗ C ′∗ Z ′i

Z i C ∗ Ψ∗ C ′∗ Z ′i + σ∗2 I ni

#!

,

(14)

where C ∗ is a (q × q) matrix. The observed-data model is thus given by the marginal distribution of y i in model (14). The reduction function is β = β∗ ,

σ 2 = σ∗2 ,

and Ψ = C ∗ Ψ∗ C ′∗

with the null values of C ∗ : C = I q , the q-dimensional identity matrix. The complete-data maximum likelihood estimates of θ∗ = (β ∗ , Ψ∗ , σ∗2 , C ∗ ) is obtained from the following linear model ~ + ei , y i = X i β ∗ + (b′i ⊗ Z i )C

~ denotes the vector obtained where ⊗ denotes the Kronecker operator, C ′ ~ by stacking the columns of C (i.e. C = (C 1 , . . . , C ′q )′ , which C j is the jth column of C), and ei ∼ Nni (0, σ∗2 I ni ) for i = 1, . . . , m. This leads to the following PX-EM algorithm. 4.3.3.1. The PX-EM algorithm (version 1) PX-E step: This is the same as the E-step of EM PX-M step: Compute Ψ(t) , which is the same as the M-step of EM,  "  !#−1 (t) m ˆ′ ⊗ Z i ) X β∗ X ′i X i X ′i (b i   ~(t) = ˆ′ + Vˆ i ) ⊗ (Z ′ Z i ) ˆi ⊗ Z ′ )X i (b ˆi b (b C∗ i=1 i i i "m ! # X X ′i × yi , ˆi ⊗ Z ′ b i=1

and

(σ∗2 )(t)

i

# " n n X (t) ′ (t) 1 X ′ ′ ~ ˆ ~ = , (C ∗ ) (V i ⊗ (Z i Z i ))C ∗ eˆi eˆi + Sn i=1 i=1

(t) (t) ˆ where eˆi = y i − X i β∗ − Z i C ∗ b i , and then apply the reduction function to obtain (t)

β(t) = β ∗ ,

(σ 2 )(t) = (σ∗2 )(t) ,

(t)

(t)

(t)

and Ψ(t) = C ∗ Ψ∗ (C ∗ )′ .

1067

Maximum Likelihood Estimation from Incomplete Data

In order to activate the mean vector of the random effects, suppose that there is a known (p×q) matrix K such that Z i = X i K for all i = 1, . . . , m. For example, the design matrix Z i consists of the columns of X i , as is often the case in practice. In this situation, PX-EM can be used to accelerate EM further by activating the mean vector of the random effects. The completedata model can be written as bi |(β ∗ , Ψ∗ , σ∗2 , C ∗ , µ∗ ) ∼ Nq (µ∗ , Ψ∗ ) y i |(bi , β ∗ , Ψ∗ , σ∗2 , C ∗ , µ∗ ) ∼ Nni (X i β∗ + Z i C ∗ bi , σ∗2 I ni ) , The corresponding complete-data model is y i |(β ∗ , Ψ∗ , σ∗2 , C ∗ , µ∗ ) ∼

Nni (X i (β ∗ + KC ∗ µ∗ ), Z i C ∗ Ψ∗ C ′∗ Z ′i + σ∗2 I ni ) .

The reduction function is then given by β = β∗ + KC ∗ µ∗ ,

σ 2 = σ∗2 ,

and Ψ = C ∗ Ψ∗ C ′∗ ,

with the null values of the extra parameters C = I q and µ = 0. Thus, we have the following new version of PX-EM. 4.3.3.2. The PX-EM algorithm (version 2) PX-E step: This is the same as the E step of EM. PX-M step: Compute m

(t−)

=

1 Xˆ bi , m i=1

(t−)

=

1 X ˆ (t−) ˆ (t−) ′ [(bi − µ∗ )(b ) + Vˆ i ] , i − µ∗ m i=1

µ∗

m

Ψ∗ (t−)

(t−)

and (β ∗ , C~ ∗ , (σ∗2 )(t−) ), that is the same as that in M step of PX-EM version 1; and then apply the reduction function to obtain (t)

(t−)

β(t) = β∗ = β ∗

(t−)

+ KC ∗

(σ 2 )(t) = (σ∗2 )(t) = (σ∗2 )(t−) , (t−)

Ψ(t) = C ∗ (t−)

µ∗

= 0,

(t−)

Ψ∗

(t−) ′

(C ∗

(t−)

and C ∗

(t−)

µ∗

) ,

= Iq .

,

1068

C. Liu

Another PX-EM version can be obtained by activating only the mean vector of the random effects. Parameter expansion can also be used to accelerate the ECM and ECME algorithms. van Dyk42 provides such examples with the expanded model that includes the extra parameter C ∗ . 5. Discussion There are many other important issues about EM that this brief review has not touched on. These include how to compute asymptotic variancecovariance matrix,14,25,27 the relationship between EM and Markov chain Monte Carlo methods, e.g. EM and the Data-Augmentation algorithm;40 ECM and the Gibbs sampler;3 ECME and collapsed Gibbs sampler;18 and PX-EM and the Parameter-Expanded-Data-Augmentation algorithm,24,31 and Monte Carlo EM algorithms.43,44 Acknowledgement The author thanks Dr. Diane Lambert for her helpful comments. References 1. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B39: 1–38. 2. Fessler, J. A. and Hero, A. O. (1994). Space-alternating generalized expectation-maximization algorithm. IEEE Transactions on Signal Processing 42: 2664–2677. 3. Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association 85: 141–151. 4. Kent, J. T., Tyler, D. E. and Vardi, Y. (1994). A curious likelihood identity for the multivariate t distribution. Communication Statistical Simulation Computing 23: 441–453. 5. Laird, N. M. and Ware, J. H. (1982). Random effects models for longitudinal data. Biometrics 38: 963–974. 6. Laird, N., Lange, N. and Stram, D. (1987). Maximizing likelihood computations with repeated measures: Application of the EM algorithm. Journal of the American Statistical Association 82: 97–105. 7. Lange, K. and Carson, R. (1984). EM reconstruction for emission and transmission tomography. Journal of Computing Assistant Tomograph 8: 306–312. 8. Lange, K. L., Little, R. J. A. and Taylor, J. M. G. (1989), Robust statistical modeling using the t distribution, Journal of the American Statistical Association 84: 881–896.

Maximum Likelihood Estimation from Incomplete Data

1069

9. Lindstrom, M. J. and Bates, D. M. (1988). Newton–Raphson and EM algorithms for linear mixed-effects models for repeated-measures data. Journal of the American Statistical Association 83: 1014–1022 10. Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis With Missing Data, Wiley, New York. 11. Liu, C. (1996). Bayesian robust multivariate linear regression with incomplete data, Journal of the American Statistical Association 91: 1219–1227. 12. Liu, C. (1997a). ML estimation of the multivariate t distribution and the EM algorithm, Journal of Multivariate Analysis 63: 296–312. 13. Liu, C. (1997b). Comment on “The EM algorithm — An old folk song sung to fast new tune” by Meng and van Dyk, Journal of the Royal Statistical Society B59: 557–558. 14. Liu, C. (1998). Information matrix computation from conditional information. Biometrika 85: 973–979. 15. Liu, C. (1999). Efficient ML estimation of the multivariate normal distribution from incomplete data, Journal of Multivariate Analysis 69: 206–217. 16. Liu, C. (2000a). Estimation of discrete distributions with a class of simplex constraints, Journal of the American Statistical Association. 17. Liu, C. (2000b). Robit regression: A simple robust alternative to logistic and probit regression, Technical report, Bell Labs. 18. Liu, C. and Rubin, D. B. (1994). The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika 81: 633–648. 19. Liu, C. and Rubin, D. B. (1995). ML estimation of the multivariate t distribution with unknown degrees of freedom. Statistica Sinica 5: 19–39. 20. Liu, C. and Rubin, D. B. (1998). Maximum likelihood estimation of factor analysis using the ECME algorithm with complete and incomplete data. Statistical Sinica 8: 729–747. 21. Liu, C., Rubin, D. B. and Wu, Y. (1998). Parameter expansion to accelerate EM: The PX-EM algorithm. Biometrika 85: 755–770. 22. Liu, C. and Sun, D. (1997). Acceleration of the EM algorithm for mixture models using ECME. American Statistical Association Proceedings of the Statistical Computing Section, 109–114. 23. Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. Journal of the American Statistical Association 89: 958–966. 24. Liu, J. S. and Wu, Y. (1999). Parameter expansion scheme for data augmentation. Journal of the American Statistical Association 94: 1264–1274. 25. Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society B44: 226–233. 26. Meng, X. L. and Pedlow, S. (1992). EM: A bibliographic review with missing articles. Proceedings of the Statistical Computing Section of the American Statistical Association, 24–27. 27. Meng, X. L. and Rubin, D. B. (1991). Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association 86: 899–909.

1070

C. Liu

28. Meng, X. L. and Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 80: 267–278. 29. Meng, X. L. and van Dyk, D. (1997). The EM algorithm — An old folk song sung to a fast new tune (with Discussion). Journal of the Royal Statistical Society B59: 511–67. 30. Meng, X. L. and van Dyk, D.(1998). Fast EM implementations for mixedeffects models. Journal of the Royal Statistical Society B60: 559–578. 31. Meng, X. L. and van Dyk, D. (1999). Seeking efficient data augmentation schemes via conditional and marginal augmentation. Biometrika 86: 301–320. 32. Pinheiro, J. C., Liu, C. and Wu, Y. (2000). Efficient algorithms for robust estimation in linear mixed-effects models using the multivariate t-distribution, revised for Journal of Computational and Graphical Statistics. 33. Rubin, D. B. (1974). Characterizing the estimation of parameters in incomplete-data problems. Journla of the American Statistical Association 69: 467–474. 34. Rubin, D. B. (1976). Inference and missing data. Biometrika 63: 581–590. 35. Rubin, D. B. (1978). Multiple imputation in sample surveys — A phenomenological Bayesian approach to nonresponse. Proceedings of the Survey Research Methods Section of the American Statistical Association, 20–34. Also in Imputation and Editing of Faulty or Missing Survey Data, US Dept. of Commerce, Bureau of the Census, 1–23. 36. Rubin, D. B. (1983). Iteratively reweighted least squares. In Encyclopedia of Statistical Sciences. 37. Rubin, D. B. and Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychometrika 47: 69–76. 38. Schafer, J. L. (1999). Some improved procedures for linear mixed effects. Unpublished technical report. 39. Shepp, L. A. and Vardi, Y. (1982). Maximum likelihood reconstruction for emission tomography. IEEE Transactions on Image Processing 2: 113-122. 40. Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (with discussion). Journal of the American Statistical Association 82: 528–550. 41. Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions. John Wiley and Sons, New York. 42. van Dyk, D. A. (2000a). Fitting mixed-effects models using efficient EM-type algorithms. Journal of Computational and Graphical Statistics. 43. van Dyk, D. A. (2000b). Nesting EM algorithms for computational efficiency. Statistica Sinica. 44. Wei, G. C. G. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithm. Journal of the American Statistical Association 85: 699–704. 45. Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics 11: 95–103.

Maximum Likelihood Estimation from Incomplete Data

1071

About the Author Chuanhai Liu is a technical staff member of Bell Laboratories, Lucent Technologies, (http://www.stat.bell-labs.com/stat/liu/). He obtained MS in Probability and Statistics in 1987 from Wuhan University, MA in Statistics (1990) from Harvard University, and PhD in Statistics (1994) also from Harvard University. His research areas include Bayesian statistics, missing data problems, multiple imputation, robust statistics, time series, EM algorithms, and Markov chain Monte Carlo methods.

This page intentionally left blank

CHAPTER 29

INTRODUCTION TO ARTIFICIAL NEURAL NETWORKS XIA JIELAI and JIANG HONGWEI Department of Health Statistics, The Fourth Military Medical University, Xi’an, Shanxi 710033, PR China Tel: (86) 29-3376979; [email protected] TANG QIYI Department of Plant Protection, Zhejing University, 268 Kaixuan Road, Hangzhou, 310029, PR China Tel: (86) 571-86971621; [email protected]

1. Introduction While recent progressions of neurology lead the rapid development of artificial neural networks (ANN), the growing requirement of digital computer and artificial intelligence (AI) also promotes ANN. Today, in all problems that involve AI, human intelligence is still performed over AI. To develop new generation of intelligent computers, we must fully understand the human intelligent processes; in particular, the mechanisms of dealing with information by the neural network systems in human brains. On the other hand, although the initial intention of ANN was merely to explore and simulate informational processing of human, its superior capability has been demonstrated in problems that traditional digital computer systems and artificial intelligence encountered. Indeed, ANNs can be viewed as a major new break-through to various fields such as computational methodology and AI, etc. Artificial neural networks (ANN) is an engineering method that simulates the structures and operating principles in the information processing systems possessed by human brain. It was a milestone that psychologist McCulloch and mathematician Pitts had originally proposed the first 1073

1074

J. L. Xia, H. W. Jiang & Q. Y. Tang

mathematical model of ANNs in 1940s. Since then ANN has made rapid progresses, and various perceptron models have been brought forth subsequently by many researchers such as F. Rosenblatt, Widrow, Hopf and J. J. Hopfield, etc. In a magnitude of ANN studies, simulated annealing (SA)1,2,4 and Genetic Algorithm (GA)3,4 are two popular stochastic optimization algorithms. The former was proposed by Metropolis to simulate the annealing process of metal heating, and the latter was proposed by Holland to simulate the natural evolutional process of living beings. Although the stimulated objectives are at all different, both algorithms are extremely similar each other in formulation of algebraic structures. SA holds for the ergodicity of state spaces by generation functions and ensures the directions of iteration processes by acceptation operator. GA holds for the ergodicity of state spaces by crossover operator and mutation operator, and ensures the directions of iteration processes by selection operator. The traditional statistics, especially parametric statistics, usually assume a population distribution with unknown parameter. It is the mostly perplexing to assure the validity of the assumptions that samples indeed come from the population specified before using statistical techniques such as t test, ANOVA, regression and so on. However, in accordance to directly learning from data sets, ANN dynamically modulates the “weight” of neurons, and sequentially be able to perceive newly resembled data. Because of its favorable resilience against distortions, ANN has unique advantages to processing imperfect data sets and to problems of complex nonlinear systems. Statistically ANN can be described as the nonparametric nonlinear models. Its applications include predictions, cluster analysis, pattern recognition engines, time series analysis and wick relationship gauge among complex systems. Depending on the nonlinear linkage of numerous simple rule sets (neurons), ANN, especially multilayer perceptron networks, is different in essence from normal expert systems, which is some enumerative procedures based on comprehensive rule systems. As knowledge of experts is collected and represented using some traditional measurements, the establishment of expert systems is more difficult than ANN.

2. Back Propagation (BP) Neural Networks There are many different types of ANN, including the popular Hopfield model,5 the connection networks by Feldmann,6 the Baltzmann machine model by Hinton,7 the multilayer perceptron model by Rumelhart8 and

1075

Introduction to Artificial Neural Networks

Input layer

Hidden layer

Fig. 1.

Output layer

A BP neural network model.

the self-organization networks models by Kohonen,9 etc. Multilayer perceptron model is the most general among these ANN models. Although ANN had been around since the late 1940’s, no major progress was made until the mid-1980’s when the multilayer forward-propagation perceptron model was proposed by Minsky10 and became sophisticated enough for general applications through combination with back-propagation (BP) algorithm by Rumelhart and simulated annealing (SA) algorithms. A manifold of three-layer BP neural network is shown in Fig. 1. The BP neural networks (BPNN) systems with the hierarchical structure, including one input layer, several hidden layers and one output layer. Each layer consists of various neutrons taking on two phases: Activity and inactivity. Figure 1 illustrates a typical network with one input layer, one hidden layer and one output layer. Typically in BPNN, after having processed the signals received from the input layer, the neutrons of the hidden layers propagate it forward to the output layer that completes the finial procedure to export the results. A conventional stimuli function of every neutron usually is a S-shaped curve function such as the logistic function. f (x) =

1 . 1 + e−x/Q

Here, Q is the threshold parameter to adjust the formulation of stimuli function. The learning procedure of this algorithm is made up of forwardpropagation and backward-propagation. The special characteristic of this type of network is its simple dynamics: when a signal is inputted into the BPNN, it is propagated to the next layer by the interconnections between

1076

J. L. Xia, H. W. Jiang & Q. Y. Tang

estimated error E/mm

0.20

0.15

0.10

0.05

0 2

Fig. 2.

6

10 14 18 units of hidden layer n/layer

22

Relationships between units of the hidden layer and estimated error.

the neurons. The sign is processed by the neurons of one layer and then be propagated onto the next layer. It means that the state of each layer locally influence the next layer. This procedure will not stop until the signal reaches the output layer sending out the processed signal. In order to upgrade the precision of the system, signal errors feedback across the same pathways. Through modifications of the weights for all units of each layer, the differences between the expected and observed outcome are minimized. At present, there is no matured theory on how to select the number of units and hidden layers. In general, the more units of hidden layers neural networks possess, the more complexity they reflect and higher precision of learning. Nevertheless, with the increment of units in hidden layers, overfitting to the learning data comes into being easily. If an ANN model is trained on a learning data set very well, its ability to predict subsequently future data set will be enhanced. Figure 2 shows a special example in which the lowest error achieved when the system has 6 units in the hidden layer. For simplicity, we assume that there are n sigmoid type units in a neural network which possesses only one unit x in the input layer and one unit y in the output layer. Let (xk , yk ) (k = 1, 2, 3, . . . , N ) be observations in which xk is the input signal and yk is the output signal for the kth sample. Also, let the output of any unit i as Oik and the input of unit j is X Wij Oik . netjk = i

Introduction to Artificial Neural Networks

1077

And the error function is E=

N 1 X (yk − yˆk )2 . N k=1

In this function yˆk is the predicted value of the network output. If ∂Ek Ek = (yk − yˆk )2 , δjk = ∂net and Ojk = f (netjk ), then jk ∂Ek ∂Ek ∂netjk ∂Ek = = Oik = δjk Oik . ∂Wij ∂netjk ∂Wij ∂netjk If the unit j is in the output layer, Ojk = yˆk δjk =

∂Ek ∂ yˆk = −(yk − yˆk )f ′ (netjk ) . ∂ yˆk ∂netjk

(1)

Else if unit j is not in the output layer, then δjk =

∂Ek ∂Ojk ∂Ek ′ ∂Ek = = f (netjk ) ∂netjk ∂Ojk ∂netjk ∂Ojk

X ∂Ek ∂netmk ∂Ek = ∂Ojk ∂netmk ∂Ojk m =

= Thus,

   

X ∂Ek ∂ X Wmi Oik ∂netmk ∂Ojk i m

X X ∂Ek X Wmj = δmk Wmj . ∂netmk i m m δjk = f ′ (netjk )

∂Ek    = δmk Oik . ∂Wij

X

δmk Wmj

m

(2)

If a neural network has M layers in which the M th only owns the output units and the first layer only possesses the input units, then BP algorithms are (i) Select the initial weights W . (ii) Repeat following procedures until converging: a. For K from 1 to N (a) Calculate Oik , netjk and yˆk (in the procedure of forwardpropagation)

1078

J. L. Xia, H. W. Jiang & Q. Y. Tang

(b) Implement the reversed calculation of layers from M to 2 (in the procedure of backward-propagation) b. For the same unit j ∈ M , calculated δjk by (1) and (2). PN ∂E ∂E (iii) Modulate weights, Wij = Wij − δ ∂W , δ > 0, for ∂W = k ij ij

∂Ek ∂Wij

From BP algorithms, it concludes that BP models transform input-tooutput patterns of sampling data sets into the optimization of nonlinear models. Its optimization is different at all from the traditional gradient descend method. So neural networks are the absolutely nonlinear mapping projects between input and output. The focal design of a neural network lies in how to estimate the structure of models and the selection of learning algorithms. To establish appropriate learning algorithms and model structure, we must rely on current theoretical developments of ANN and train these systems with enormous datasets. By dynamically adjusting the parameters of networks in the continuous procedures of learning, ANN can reach the precision required. 3. Introduction to Operation of DPS Data Process System ANN packages have been embedded in statistical software packages, such as SPSS, MATLAB and so forth. They can be browsed in those statistical software websites. DPS, Data Processing System, programmed by Qiyi Tang, will be showed below in this section. The basic data structure is that each row is a single case (observation), each column is a single variable and the left is the data of input units (independent variables), the right is the data of output units (dependent variables). And all values of cases are entered one by one. Do not need to enter the outputs (dependent variables) for the individuals to be recognized (predicted). After the data-entering step has finished, press CTRL and right button of mouse to define the predicted data as the second block. Before the learning procedure of neural networks, an optional dialogue, showed in Fig. 3 below, will appear to require some parameters of neural network. The principles of setting parameters are: (1) Number of units: The number of units of the input layer equals to the number of characteristic factors (independent variables), and the units of the output layer just amount to the number of system targets. Generally the number of units in hidden layers, greatly varying according to individual experiences, is 75% units of input layer. For

Introduction to Artificial Neural Networks

1079

Figure 3 optional dialogue of parameters in neural network Fig. 3.

(2)

(3)

(4) (5)

Optional dialogue of parameters in neural network.

example, if there are 7 units in an input layer and 1 unit in an output layer, the number of units in a hidden layer will be 5 to constitute a 7-5-1 model of neural network. In practice, due to comparing the output consequences of various units in a hidden layer, the most reasonable structure is established conclusively after the learning procedure of neural network. Initial weights: All of initial weights must not be exactly equal to each other. For the fact that it has been verified that once the initial weights are identical, even if there exists a set of diverse weights so as to the minimum error of neural network, the weights of units will remain to be equal. Thus, in our software, a random generator is programmed to yield a set of random numbers ranged from −0.5 to +0.5 as the initial weight of neural network. Optimum learning speed : As a typical BP algorithm, the larger the learning speed is, the greater the change of the weights is, and the faster the convergence is. However when learning speed is beyond a certain limitation, the neural network will oscillate. Consequently learning speed is larger with the guarantee against system oscillation. So, in DPS, learning speed is optimized automatically, though user can specify a certain value, say 0.9. Dynamic coefficient : It is chosen empirically too, just as the range from 0.6 to 0.8. Error tolerance: Generally ranges from 0.001 to 0.00001. If the error between the results of two successive iterations is below the tolerance, computing stops systematically to provide the results.

1080

J. L. Xia, H. W. Jiang & Q. Y. Tang

(6) Times of iteration: The default value is 1000. Due to the possible divergence of neural network computing, the maximum iteration times is given beforehand. (7) Coefficient of Sigmoid function: The value, regulating the stimuli formulas of neutron, ranges from 0.9 to 1.0 generally. (8) Data transformation: DPS has advantage of allowing data transformations in several functions, such as logarithm, square root and normalization. 4. Application Examples Example 1 is an illustration to use our software. Physicians under randomization collect the dataset. The influencing factors set of body surface area consists of 4 physical factors: Sex, age, weight and height. Figure 4 shows the nonlinear relationships between predictor variables and response variable. Before establishing BP neural network, we split the data set into two segments: From No. 1 to No. 70 as learning sample and from No. 71 to No. 90 severally as predicted sample. The data structure is defined as the following blocks in Table 1.

Fig. 4.

3-D Scatter of weight, height and body surface area.

1081

Introduction to Artificial Neural Networks

Table 1.

No. Sex

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

1 0 0 1 1 0 0 1 1 0 0 1 0 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 1 1 0 1 1 1 0 1 1 1 0 1

Random allotment of 90 persons’ physical measurements.

Body Body Age Weight Height Surface No. Sex Age Weight Height Surface (year) (kg) (cm) Area (year) (kg) (cm) Area (cm2 ) (cm2 ) 13 5 0 11 15 11 5 5 3 13 3 0 7 13 0 0 7 11 7 11 9 9 0 11 0 5 9 13 3 3 9 13 9 1 15 15 1 7 1 7

30.5 15 2.5 30 40.5 27 15 15 13.5 36 12 2.5 19 28 3 3 21 31 24.5 26 24.5 25 2.25 27 2.25 16 30 34 16 11 23 30 29 8 42 40 9 22 9.5 25

138.5 101 51.5 141 154 136 106 103 96 150 92 51 121 130.5 54 51 123 139 122.5 133 130 124 50.5 129 53 105 133 148 99 92 126 138 138 76 165 151 80 123 77 125

10072.9 6189 1906.2 10290.6 13221.6 9654.5 6768.2 6194.1 5830.2 11759 5299.4 2094.5 7490.8 9521.7 2446.2 1632.5 7958.8 10580.8 8756.1 9573 9028 8854.5 1928.4 9203.1 2200.2 6785.1 10120.8 11397.3 6410.6 5283.3 8693.5 9626.1 10178.7 4134.5 13019.5 12297.1 4078.4 8651.1 4246.1 8754.4

46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71∗ 72∗ 73∗ 74∗ 75∗ 76∗ 77∗ 78∗ 79∗ 80∗ 81∗ 82∗ 83∗ 84∗ 85∗

0 0 0 0 1 1 1 0 1 0 1 1 1 0 0 0 1 0 0 0 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 0 1 1 0

15 13 3 15 5 1 5 3 0 1 9 9 5 11 7 1 11 5 3 5 15 7 9 7 15 15 7 3 7 0 1 15 11 0 11 3 13 5 9 15

43 27.5 12 40.5 15 9 16.5 12.5 3.5 10 25 33 16 29 20 7.5 29.5 15 13 15 45 21 23 17 43.5 50 18 14 20 3 9.5 44 32 3 29 15 44 15.5 22 40

152 139 91 153 100 80 112 91 56.5 77 126 138 108 127 114 77 134.5 101 91 98 157 120 127 104 150 168 114 97 119 54 74 163 140 52 141 94 140 105 126 159.5

12998.7 9569.1 5358.4 12627.4 6364.5 4380.8 7256.4 5291.5 2506.7 4180.4 8813.7 11055.4 6988 9969.8 7432.8 3934 9970.5 6225.7 5601.7 6163.7 13426.7 8249.2 8875.8 6873.5 13082.8 14832 7071.8 6013.6 7876.4 2117.3 4314.2 13480.9 10583.8 2121 10135.3 6074.9 13020.3 6406.5 8267 12769.7

1082

J. L. Xia, H. W. Jiang & Q. Y. Tang

Table 1.

No. Sex

41 42 43 44 45

1 1 0 0 0

(Continued).

Body Body Age Weight Height Surface No. Sex Age Weight Height Surface (year) (kg) (cm) Area (year) (kg) (cm) Area (cm2 ) (cm2 ) 13 3 0 1 1

Note: The sign

36 15 3 9 7.5 ∗

143 94 51 74 73

11282.4 6101.6 1850.3 3358.5 3809.7

86∗ 87∗ 88∗ 89∗ 90∗

1 0 1 0 1

1 13 13 9 11

9.5 32 40 22 31

76 144 151 124 135

3845.9 10822.1 12519.9 8586.1 10120.6

denote predicted sample.

following.

Figure 5 Diagram of the data editor window for BP neural network Fig. 5.

Diagram of the data editor window for BP neural network.

Figure 5 is an editor window of DPS system. The format of data is inputted as the following. After launching the learning procedure of neural network, a window similar to Fig. 5 will be displayed. And then assign the parameter of network: Units in input layer is 4, hidden layer has 2 layers, optimum learning speed is 0.1, dynamic coefficient is 0.6, the coefficient of Sigmoid function is 0.9, error tolerance is 0.00001, maximum times of iteration are 2000, and the selection of data transformation is normalization. Then press the “OK” button. Then we set 5 to the units of the first hidden layer and set 3 to the units of second hidden layer. After 2000

1083

Introduction to Artificial Neural Networks

iterations, the error is 0.00000170. The weights of neutron in output layer is shown below: Weights matrix of units in the first hidden layer 0.597230

−0.824710

0.566580

−1.065810

0.051900

0.750700

−0.151260

0.172180

−0.369140

−0.139280

−0.715220

2.044800

0.194420

−0.869060

−2.243410

1.46518

−0.12269

−2.33773

−0.07406

1.168

Weights matrix of units in the second hidden layer −0.507940

−4.616450

3.675080

−0.928980

1.937520

−1.824980

−0.268910

0.21253

−3.046950

−0.708560

−4.81552

2.276470

−0.08934

−4.92864

−0.5283

Weights matrix of units in the output layer 1.12130 7.20956 −5.5911 Table 2 compared the predicted values with the observed values of the body surface area. And these predicted values from No. 71 to No. 90, used as predicted sample, are very close to the observed values. So the facts illustrate that the neural network has favorable abilities in model fitting and predicting.

1084

J. L. Xia, H. W. Jiang & Q. Y. Tang Table 2.

The predicted values and the observed values of neural network.

No.

Predicte d Value

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

10226.87 6370.82 2076.073 10320.32 12589.19 9544.59 6651.57 6510.22 6034.13 11881.13 5461.66 2166.992 7748.67 9342.00 2275.284 2082.186 8109.89 10528.90 8803.51 9157.74 8894.74 8797.66 2145.221 9414.36 2104.641 6732.92 10243.62 11424.97 6584.85 5395.49

Note: The sign

Observed No. Value

∗

10072.9 6189 1906.2 10290.6 13221.6 9654.5 6768.2 6194.1 5830.2 11759 5299.4 2094.5 7490.8 9521.7 2446.2 1632.5 7958.8 10580.8 8756.1 9573 9028 8854.5 1928.4 9203.1 2200.2 6785.1 10120.8 11397.3 6410.6 5283.3

31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Predicted Value 8511.77 10085.21 10138.25 3896.33 12830.50 12484.89 4353.68 8353.89 4053.10 8993.87 11713.91 6090.36 2082.186 3746.90 3485.76 12914.47 9703.54 5387.18 12658.19 6306.22 4353.68 7228.11 5457.31 2386.589 4120.19 8871.79 11159.89 6947.78 9869.52 7658.39

Observed No. Value 8693.5 9626.1 10178.7 4134.5 13019.5 12297.1 4078.4 8651.1 4246.1 8754.4 11282.4 6101.6 1850.3 3358.5 3809.7 12998.7 9569.1 5358.4 12627.4 6364.5 4380.8 7256.4 5291.5 2506.7 4180.4 8813.7 11055.4 6988 9969.8 7432.8

61 62 63 64 65 66 67 68 69 70 71 ∗ 72 ∗ 73 ∗ 74∗ 75 ∗ 76 ∗ 77∗ 78 ∗ 79 ∗ 80∗ 81∗ 82∗ 83 ∗ 84 ∗ 85 ∗ 86∗ 87 ∗ 88 ∗ 89∗ 90∗

Predicted Value

Observed Value

3793.63 10024.22 6370.82 5528.03 6183.77 13052.96 8072.70 8536.07 6733.68 12893.16 13282.39 7358.28 6173.54 7875.99 2163.63 3909.322 13002.22 10854.05 2213.59 10094.41 6026.73 12950.38 6703.66 8315.04 12601.92 4075.91 10903.51 12564.37 8277.76 10423.74

3934 9970.5 6225.7 5601.7 6163.7 13426.7 8249.2 8875.8 6873.5 13082.8 14832 7071.8 6013.6 7876.4 2117.3 4314.2 13480.9 10583.8 2121 10135.3 6074.9 13020.3 6406.5 8267 12769.7 3845.9 10822.1 12519.9 8586.1 10120.6

denote predicted sample.

5. ANNs Based on Genetic Algorithm Genetic Algorithm (GA), firstly proposed in 1975 by Holland in Michigan University, USA, is inspired by natural selection of Darwinism and the genetics machine. As a brand-new global optimization technique, the algorithm uses the population evolution principles to continuously optimize the prediction weights and eventually finds the optimal or nearly optimal solutions. Because it is simple, universal, robust and applicable to parallel computing, this method is effectively used in a wide variety of fields, such as computer, dispatch optimization, transport problems and constitution optimizations etc.

1085

Introduction to Artificial Neural Networks

GA Evolution

Fitness estimate

Evaluation

Population

Decode

Test

Train

New network

Trained network

Learning set

Testing set Fig. 6.

The conjunction of GA and ANN.

GA, instead of many traditional methods, has been increasingly applied in ANNs design in the learning steps. The conjunction of GA and ANNs, showed below in Fig. 6, will be applied to the evaluation of crops in current research. 5.1. Value encoding GA oriented to learning of weights in ANN 5.1.1. Encoding The procedure of weight learning in neural network is a continuously complicated optimization problem of parameters. Binary encoding gives too many possible chromosomes even with a small numbers of alleles. On the other hand, this encoding is often not natural for many problems and sometimes corrections must be made after crossover and/or mutation. By this method, the change of weights will step forward so as to influence the precision of learning in neural network. Therefore value encoding is adopted in current study. 5.1.2. Fitness function The weights of chromosomes are allocated to ANN, and the learning data sets are served as input/output. Then inverse mean squared error, coming out after ANN computing, is defined as fitness function: , n X e2i . f =1 i=1

1086

J. L. Xia, H. W. Jiang & Q. Y. Tang

5.1.3. Weights initialization Against that in ordinary BP algorithms performing with the uniform distribution from 0.0 to 1.0, the initial weights of neural network are obtained in accordance with the distribution of e−|γ|, supported by enormous known trials, because all possible solution can be run around. As a result, the absolute values of weights are relatively small after neural network converges. 5.1.4. Genetic operator Although genetic operators are different from various application circumstances, the weight crossover and weight mutation are two most important operators. 5.1.5. Selection The selection probability of each individual is determined not by means of proportion, but by an elite ratio S, a measure of the surviving ratio of offspring. This can be written as: P2 = P1 · S P3 = P2 · S

.. .

where P1 , P2 , P3 , . . . represent the individual probabilities of various fitness functions: First-rate, second-rate, third-rate and so on. 5.2. The conjunction of GA and ANN There is a 3-layer BP neural network with one input layer, one output layer and one hidden layer. Now first, using training sample Ak and expected output Ck (k = 1, 2, . . . , m), we calculate the stimuli values from the input layer to the hidden layer by formula ! n X bi = f ah Vhi + θi , h=1

where i = 1, 2, . . . , p, the units in output layer are ah , the connection weights from input layer to hidden layer are Vhi , the thresholds of units in hidden

Introduction to Artificial Neural Networks

1087

layer are θi . The values of ah , Vhi , θi are determined by the probability distributions, e−|γ|. The stimuli algorithm is a logistic function: f (x) = 1/(1 + e−x ) .

(3)

And then using formula (3) and (4), we compute the stimuli values of units in the output layer: ! p X Wij bi + γj . Cj = f i=1

where j = 1, 2, . . . , q, the connection weights from hidden layer to output layer are Wij , the thresholds of units in the output layer are γj . And also the values of Wij , γj are determined by the probability distribution, e−|γ| . The method for calculating normalized error of output layer is given in (5): dj Cj (1 − Cj )(Cjk − Cj ) ,

(4)

where the expected value of unit j in output layer is Cjk . Finally, compute the error of units in hidden layer compared with each dj ei = bi (1 − bi )

q X

Wij dj .

j=1

On the basis of abiding on the above-mentioned steps and recombining crossover and mutation, we modulate the hidden-layer-to-output-layer connection weights and the thresholds of units in the output layer following the adjustment of the input-layer-to-hidden-layer connection weights and the thresholds of units in hidden layer. When the error between expected and observed outcomes converge to the pre-determined error tolerance, the learning of neural network stops. 6. Future Research Trends Because the intelligent computing techniques, such as ANN, succeed in many applications, they obtain considerate attention and play important role in magnitude of research fields. As a popular utility, should become more matured in the future. But in terms of intelligent computing itself and statistics, several outstanding obstacles, which be urgently solved in this field are: (1) It is known that AI still does not possess multitude inherent characteristics of brain, such as tolerance and robustness. Although ANNs have

1088

J. L. Xia, H. W. Jiang & Q. Y. Tang

solved some problems in AI, their theories are not perfect in and are still in their infancy. Moreover, AI and ANN may be completely different realizations of the natural principles on which the brain is based. So a number of mathematical principles should been developed right now. But this is ignored in lots of standard textbooks and reviews. Researchers should pay more attention to studying existing theories and to establishing mathematical foundations. To avoid losing in “forest” of biological mechanisms, the nature of AI and ANN theories should be figured out. As there have been a multitude materials and experiences, it is time to replace many seemly faultless algorithms, which are full with analogues and metaphors, with the specifically objective methods and theories of quantification. (2) It will take longer time to reveal and understand the intelligent mechanisms of human being. These biological discoveries are increasingly considered as a way to open up the scopes of AI and ANN. Once biology, neurology, genetics make break-through, AI and ANN can simulate high-level intelligent mechanisms to solve some unsolvable problem encountered today. Furthermore, they also urge biologic researches to unveil more sealed puzzle in intelligence. Although AI and ANN are gray-box algorithms and approximately correspond to intelligence of human being, they can provide some useful clues and foundations for further research. While verifying rationality of previous models, the methodology of AI and ANN, how to construct more sophisticated model, become more individualistic and explicit. (3) Although AI and ANN have developed for near 50 years, their terminology is not standardized by any cases. Especially as to random system, most networks completely conceal or rigidly utilize the statistics. In the words of Anderson, Pellionisz and Rosenfeld22 : Neural networks are statistics for amateurs. Most statisticians still soberly stand by the development of ANNs and will not to accept it in a short time because ANNs are quite imperfect compared with statistics. They, especially in statistical applications, are reluctant to waste valuable data and time in automatic processing of computer. However, along with maturity of MCMC theory and Gibbs sampling and the increasing Interactions between Frequency School and Bayesian School, the ANN based on Bayesian theory is growing rapidly. All the advantages of ANN are evaluated empirically based on practical applications rather than in theoretical comparisons with statistic methods. The impact of ANNs on the theory and application of statistics is rather obscure at

Introduction to Artificial Neural Networks

1089

this stage. At present, as a system method between gray-box and blackbox, we cannot evaluate ANNs advantages in generalized range. So how to combine ANN with statistics may be a feasible approach to get out of current dilemma. Statistician should pay more attention to the aspects of AI and ANN. References 1. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics 21: 1087–1092. 2. Rumelhart D. E., Hinton G. E. and Williams R. J. (1986). Learning representation by back propagation errors. Nature 323(6188): 533–536 3. Holland J. H. (1992). Adaptation in Natural and Artificial Systems. MIT Press, Cambridge, Massachusetss. 4. Dong, C., Li, Z. N., Xia, R. W. and He, Q. Z. (1995). Advance and some problems on multilayer perceptron neural network. Mechanics advance 25(2): 186–196. 5. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of National Academic Sciences USA 79: 2445–2458. 6. Feldmann, R., Monien, B. and Mysliwietz, P. (1990). Distributed game tree search. In Parallel Algorithms for Machine Intelligence and Vision, eds. Kumar, Kanal and Gopalakrishan, Springer-Verlag, New York. 7. Hinton, G. E., Van Camp, D. (1993). Keeping neural networks simple by minimizing the description length of the weights. Proceedings of the 6th Annual ACM Conference on Computational Learning Theory, Santa Cruz, 5–13. 8. Rumbelhart, D. E., Hinton, G. E., Willians, R. J. (1986a) Learning representations by back-propagating errors. Nature 323: 533–536. 9. Kohonen, T. (1984). Associative Memory and Self-Organization, SpringerVerlag, New York. 10. Minsky, M. L. and Papert, S. A. (1988). Perceptrons. Expanded Edition, MIT Press, Cambridge, Massachusetss. (First edn 1969.) 11. Lai, Y. X. and Lu, Y. S. (1999). BP neural network application in the distribution study of fluid diversity. Journal of Beijing Chemistry University 26(2). 12. Radford, M. J. (1996). Bayeian Learning for Neural Networks, Springer, New York. 13. Watt, R. (1991). Understanding Vision, Academic Press, London. 14. Barndorff-Nielsen, O. E. and Jensen, J. L. (1993). Networks and Chaos Statistical and Probabilistic Aspects, Chapman and Hall, London. 15. Zhang, S. Q., Chen, C. and Wan, E. P. (1998). Gray system application in production evaluation. Geography 18(6): 581–585. 16. Hechi Nielsen, R. (1989). Theory of the back propagation neural network. International Journal of Conference Neural Network 1: 593–605.

1090

J. L. Xia, H. W. Jiang & Q. Y. Tang

17. Bornholdt, A. (1992). General asymmetric neural network and structure design by genetic algorithms. Neural Network 5(2): 327–334. 18. Li, M. Q., Xu, B. Y. and Kou, J. S. (1999). The combination of GA and ANN. Theory and Practice of System Engineering 21(2): 65–69. 19. Jin, L., Luo, Y., Mou, Q. L. et al. (1998). Study on ANN prediction model of the humidity in the soil of cornfield. Agrology 35(1): 25–35. 20. Li, Z. and Zhang, J. T. (2000). The study on evaluation of mealie based on the combination of GA and ANN. Natural Resource Journal 15(3). 21. Deng J. N. (1992). The Basic Methods of Gray System, Center China Sci. Tech. Univ. Press, 304–312. 22. Anderson, J. A., Pellionisz, A. and Rosenfeld, E. (1990). Neuro-computing 2 : Directions for Research, MIT Press, Cambridge, Massachusetts.

About the Author Xia Jielai, professor of the Fourth Military Medical University, earned his bachelor in Applied Mathematics from the Anhui University and master and PhD in Healthy Statistics from the Fourth Military Medical University. He teaches in bio-statistics department of the university as an assistant professor (1983–1988), lecturer (1988–1995), associate professor (1995– 1998) and professor (1988-present). He visited clinical and epidemiological research center of Prince Welsh Hospital, the Chinese University of Hongkong (January–February 1994), department of genetics and biometry, Louisiana state university medical center (March–October 1994). His research fields are biostatistics and data processing, including theory and methods of statistical modeling and soft development of NoSA (Non-typical data statistical analysis system). He has published more than 50 articles in various scientific journals.

Index artificial neural networks (ANN), 1073 assay assay development and validation, 410 assay method, 436 assay validation, 452 biological assays, 436 chemical assays, 436 immunoassays, 436 asymptotic mean integrated square error (AMISE), 891 atrial fibrillation over ventricular arrhythmias, 427 autocorrelation, 61, 335 auto-regressive models ARIMA(p; d; q), 335 ARMA(n; n 1), 335 multivariable ARMA model, 368 autosomal chromosomes, 583 average causal effect (ACE), 784 Back Propagation (BP) Neural Networks, 1074 back-calculation or back projection, 647 back-door criterion, 804 backfitting algorithm, 853 balance at baseline, 527 cross-validated bandwidth, 852 bandwidth selection, 900 batch-to-batch variation, 457

accelerated approval, 566 accuracy accuracy errors, 104 definition, 21, 103, 436 diagnostic accuracy, 484 active control, 566 effect size, 568 control trials, 480 acute neurologic illness in children, 713 affected sib pair method (ASP), 602 affected-pedigree-member (APM), 605 agreement (see also Bland-Altman plot, Kappa) assessment of agreement, 131 chance-corrected agreement, 143 reader agreement, 483 Akaike Information Criterion (AIC), 705, 721, 942 allele codominant, 584 definition, freqeuncy, probabiity, 584 dominant, 584 recessive, 584 allelic association, 588 Alzheimer’s disease assessment scale, 554 analysis of variance (ANOVA), 64 analytical survey, 685 angiotensin-converting enzyme (ACE), 591 animal carcinogenicity experiments, 496 animal toxicological experiments, 972 antibiotic preparation, 978 area under curve (AUC) ROC curve, 34, 488 SROC curve, 294 plasma or blood concentration-time curve, 415 artificial intelligence (AI), 1073

Bayes, Bayesian Bayesian computation, 956 Bayesian credible intervals, 935 Bayesian Highest Posterior Density (HPD), 935 Bayesian Information Criterion (BIC), 705, 942 Bayesian meta-analysis, 269 Bayesian methods, 268 Bayesian model averaging (BMA), 942 1091

1092

Bayesian statistics, 933 Bayes theorem, 941 benchmark dose (BMD), 619, 637 Begg’s method, 312 benchmark response (BMR), 620 best linear unbiased predictor (BLUP), 695 beta-binomial model, 515, 517 bias, 305, 526 bias and variability, 445 bias-corrected and accelerated (BCa), 177 English language bias, 306 evaluation bias, 528 extractor bias, 308 indexing bias, 306 lead time bias, 743 length bias, 743 multiple publications bias, 306 operational bias, 528 publication bias, 306, 308 reference bias, 306 sampling bias, 306 selection bias, 306 sources for bias, 527 statistical bias, 534 structural bias, 533 verification bias, 21, 22 within study biases, 308 bilinear model, 367 binary test, 24, 26 binormal model, 31 binwidth, 888 bioavailability, 410, 415, 431 bioequivalence, 48, 410, 431, 432, 467, 468 biostatistics history, 3 birth-death process, 1017, 1024 Bland-Altman plot, 134 blind blinded-reader studies, 482 blinding, 526, 529, 537 double blind, 14, 529 blood or plasma concentration-time curve, 467 body fat, 942

Index

body surface area, 1080 bone mineral density (BMD), 102 Bonferroni adjustment, 870 bootstrap, 86, 175, 866 bootstrap algorithm, 109 bootstrap variability bands, 871 conditional bootstrap, 915 parametric bootstrap, 178 box-and-whisker plot, 326 Box-Jenkins model, 350 BP neural networks (BPNN), 1075 bracketing design, 465 Bradley-Blackwood procedure, 134 branching processes, 657, 1013 breast cancer screening, 750 Breslow-Day test of homogeneity, 297 bypass angioplasty revascularization, 158 calibration, 145 capture-recapture, 701, 716 case-control study, 779 categorical data, 142, 320 causal diagram, 778, 800 causal effect model, 778 causal inference, 778 causal relationship, 334 cause-of-death, 498 censored, 815 censored and truncated data, 323 central limit theorem, 387 cervical cancer, 1029 chain binomial model, 657 change point detection, 924 Chapman-Kolmogorov equation, 993 chromosomes, 583 Ciclosporin A, 51 clearance, 413, 414 clinical decision rule, 547, 550 clinical endpoint, 548 clinical trials, 8, 414 active control trial, 480 confirmatory trial, 525 randomized controlled clinical trial (RCT), 12, 164

Index

two- or multi-stage randomized trial, 562 cluster analysis, 384 cluster sampling, 47 Cochrane collaboration, 234 Cochran’s semi-weighted estimator, 254 coefficient curves, 863 coefficient of variation (CV), 108, 109, 453, 723 longitudinal CV, 113 standardized coefficient of variation (SCV), 111 within-batch CV, 109 collapsibility-based criterion, 787, 792 community-based surveys, 686 comparative calibrations, 146, 147 compartment models, 412, 418, 656, 658 complete ascertainment, 595 complex human traits, 583 computer software, 41 computerized tomography (CT), 34, 379, 380 concentration, 978 conditional conjugacy, 950 conditional density estimation, 924 conditional heterogeneity, 337 conditional maximum likelihood estimation, 845 conditional variance estimator, 922 conditional variance function, 922 confidence interval of, 86 confounder, 787, 795 confounding, 446, 526, 527, 787 conjugate priors, 944, 949 construct validity, 217 content validity, 215 continuous proportional data, 322 continuous-time Markov chain, 997 convergent validity, 219 co-primary endpoints, 546, 548 cost-complexity, 1039 cost-effectiveness analysis (CEA), 157, 161

1093

cost-effectiveness plane (CE plane), 168 cost-effectiveness ratio (CER), 161 counterfactual model, 783 counting process, 818, 1026 multiple counting process, 1030 Cox regression, proportional hazards model, 821, 908, 965, 1027, 1041 criterion validity, 216 Cronbach’s coefficient, 228 cross-calibration, 144 crossing-over, 585 crossover design, 48, 432 cross-validation, 852, 865, 901, 936 cross-validation bandwidth selector, 901 cure rate model, 963 current Good Manufacturing Practice (cGMP), 450 DAD test, 108 data monitoring, 478, 530, 538, 563 data processing system (DPS), 1078 datasets absorption, distribution, metabolism and excretion (ADME), 444 air/ethanol mix, 894 Alabama fetal growth study, 872 Alabama small-for-gestational-age (ASGA) study, 839 aminophylline treatment in severe acute asthma, 267 burn data, 907, 916 Indianapolis Study of Health and Aging, 688 Multicenter AIDS Cohort Study (MACS), 839, 877 National Cholesterol Education Program (NCEP), 159 National Health and Nutrition Examination Survey (NHAINES), 685 Primary Biliary Cirrhosis (PBC) data set, 909

1094

stratified neurologic illness data, 730 decision set, 550 decision structure, 551 dementia screening test, 35 DerSimonian-Laird method, 261, 278 descriptive survey, 685 design density, 900 design efficiency, 94 diagnostic imaging, 481 diethylene glycol dimenthyl ether (TGDM), 627 diethylhexyl phthalate (DEHP), 512 Dirichlet distribution, 1006 Dirichlet-multinomial distribution, 625 discrepancy loss function, 919 discriminant validity, 219 DNA sequence, 1006 dose dosage regimen, 410 dose-related trend, 497 dose-response assessment, 618 dose-response modeling, 639 dropouts, 477 drug discovery, 410 drug interchangeability, 470 drug shelf-life, 455 d-separation criterion, 801 dual X-ray absorptiometry (DXA), 102 duodenal ulcer prevention trials, 533 edge effect, 887 effect size, 237, 249 effectiveness, 162, 444, 524 efficacy, 5, 435, 444, efficacy and safety, 444 efficacy subsets, 479 Egger’s linear regression method, 311 electrical source imaging, 381 elimination half-life, 416 Emax model, 412 empirical Bayes, 425, 695 hierarchical and empirical Bayes methods, 410

Index

empirical BLUP (EBLUP), 696 epidemic transmission, 653 epilepsy epilepsy trial, 555 epileptics, 56 felbamate in epileptic patients, 427 seizure counts, 58 equal-catchability, 719, 726 equilibrium in genetics, 993 equivalence, 567 estimated shelf-life, 466 estimation, 16, 64, 65, 847 Evidence-Base Medicine (EBM), 234 exact-based logit method, 295 excitability score, 638 expectation-maximization (EM) algorithm, 15, 393, 396, 634, 1051 ECME algorithm, 1057, 1064 parameter-expanded EM algorithm, 1059 PX-E step, 1059, 1066, 1067 PX-EM, 1059 PX-M step, 1059, 1066, 1067 extra-Poisson variation, 517 extremely concordant (EC), 607 extremely discordant (ED), 607 fail-safe number, 312 Fieller’s method, 173 first phase shelf-life, 461 first-pass effect, 415 fixed-effects (FE) model, 251 Fleiss method, 258 Fourier Transform Fast Fourier Transform (FFT), 361 finite discrete Fourier transformation (DFT), 361 frailty, 828 front-door criterion, 804 Functional Observational Battery (FOB), 629 funnel graph, 309 gain field, 394 g-algorithm formula, 808

Index

gamma-normal hierarchical model, 1052 general variance-based method, 260 generalize simple branching processes, 1015 generalized estimating equation (GEE), 38, 71, 515, 625, 842 generalized linear mixed models (GLMM), 410, 428, 514 generic drug products, 431 Genetic Algorithm (GA), 1084 Genome Search Meta-Analysis method (GSMA), 304 genotype, 584 Gibbs sampling method, 269, 272, 422, 938, 957, 1009 Good Clinical Practice, 102, 450, 475 Good Laboratory Practice (GLP), 450 Good Manufacturing Practice (cGMP), 444 Good Regulatory Practice (GRP), 450 Good Statistics Practice (GSP), 449 goodness of fit, 344 goodness-of-split complexity, 1042 government regulation, 14 United States Food and Drug Administration (FDA), 14, 443, 472, 496, 565 graphic methods, 325 haplotype relative risk (HRR), 610 Hardy-Weinberg equilibrium, 590 Hardy-Weinberg law, 586, 587, 993 Haseman-Elston procedure, 299, 606 hazard function, 816 additive hazards model, 825 Health and Retirement Survey (HRS), 685 health risk assessment, 618 Helicobacter pylori (HP) infection, 1000 hepatitis hepatitis A virus (HAV), 712, 728 hepatitis B virus (HBV), 273, 655

1095

hepatitis C virus (HCV), 273 heterogeneous model, 720 heterozygous, 584 hierarchical Bayes approach (HB), 695 hierarchical generalized linear models (HGLM), 428 hierarchical or clustered structure, 76 hierarchical power prior, 947 hierarchical structure, 62 histogram, 326, 886 homogeneity, 252 homozygous, 584 hospital cost, 692 human dynamic FDG-PET brain, 386 human immunodefficiency virus (HIV) HIV dynamic, 660, 662, 839 HIV/AIDS, 646 zidovudine (AZT), 948 human OB gene, 299 hypergeometric distribution, 820 Ibragimov-Has’minskii (IH) environment, 424 identical-by-descent (IBD), 602 identifiability, 412 image sampling schedule, 383 importance weighted marginal density estimation (IWMDE), 940 in vitro, 409 incomplete ascertainment, 595 incomplete data, 1051 incremental cost-effectiveness ratio (ICER), 169, 179 independence chain, 959 individual bioequivalence, 434, 469 individual causal effect (ICE), 783 infarctions acute cerebral infarctions, 50 acute myocardial infarction, 236 infectious diseases, 645 informative priors, 945 intent-to-treat (ITT), 534 interaction, 446 interim analysis, 478, 560, 561

1096

International Conference on Harmonization, 449, 525 interval censored, 829 intra-class correlation, 46 intraclass correlation coefficient (ICC), 67, 137 intra-litter correlation, 626 intra-subject correlations, 845 in-utero, 624 Investigational New Drug (IND), 443, 524 ion-channels, 1005 irregularity, 333 irrelevant factor, 795 Iteratively reweighted least squares (IRLS), 73, 78 Jeffreys prior, 952 joint sojourn distribution, 755 Kaplan-Meier estimator, 818, 1026 Kappa statistics, 140, 483 Kendall’s tau, 312 kernel density estimate, 889 kernel function, 851, 889 kernel regression, 896 Kurtzke Disability Status Scale, 264 L measure, 970 Laplace’s rule, 951 Last Observation Carried Forward (LOCF), 535 latency to persistent sleep (LPS), 556 latent period of cancer, 1002 latent structure models, 146 learning sample, 1034 least significant change (LSC), 114 least squares kernel estimator, 855 least squares local linear estimator, 856 least squares local polynomial estimators, 859 leave-one-subject-out, 852 levels of validation, 189 life table, 1025 likelihood classification, 390

Index

likelihood ratio test (LRT), 108 limit of detection/quantitation, 436 limit of quantitation (LOQ), 416 limiting dependent models, 756 Lindstrom-Bates procedure, 421, 428 linearity, 436 linguistic validation, 203 link function, 904 linkage, 587 linkage analysis, 598 linkage disequilibrium, 588 linkage equilibrium, 587 linkage studies, 298 linkage to BMI, 300 litter effect, 511, 624 local log-likelihood function, 904, 906, 913 local maximum likelihood estimator, 906, 1096 local modeling, 885 local partial likelihood, 909 local polynomial regression, 897 local quasi-likelihood estimation, 904 Localized Metropolis’ algorithm, 961 locus, 583 LOD (log-odds) score method, 598 logistic regression, 690, 727 multilevel logistic regression, 80 log-linear models, 142, 705, 721 log-log model, 81 logrank statistic, 820 longitudinal data, 59, 807, 837 long-term memory, 1004 LOWESS, 921 lowest-observed-adverse-effect-level (LOAEL), 618 Lp Wasserstein metrics, 1041 lymphoid mononuclear cells (MNCs), 662 macro-simulation method, 769 magnetic resonance imaging (MRI), 379 magnetic resonance spectroscopy (MRS), 379 malignant melanoma, 963, 970

Index

Mammography, 379 Mantel-Haenszel method, 255, 502 marginal posterior densities, 940 marginal quasi-likelihood (MQL), 78 Markov and Semi-Markov models, 212 non-homogeneous continuous-time Markov model, 1003 non-homogeneous discrete-time Markov model, 1002 non-homogeneous Markov model with covariables, 762 Markov chain, 758 discrete-time Markov chains, 991 hidden Markov chain, 1006 Markov chain in random enviroment, 1005 non-homogeneous time discrete Markov chain model, 763 time homogeneous Markov chain model, 758 Markov Chain Monte Carlo (MCMC) method, 268, 422, 956, 1007 Markov counting process, 1029 Markov process, 758 martingale, 818 mass screening, 741 matrix design, 465 maximum concentration (Cmax ;), 467 maximum likelihood estimate (MLE), 11, 253, 590, 847 maximum tolerated dose (MTD), 496 Mean Integrated Square Errors (MISE), 900 measurement errors, 103, 844 medical imaging, 379 Medical Outcomes Study 36-Item Short Form (SF-36), 198 Mendel’s first law, 585 meta-analysis, 233 methods for describing data, 320 Metropolis-Hastings algorithm, 958, 1009 micro-simulation model, 771 missing informative missing, 535

1097

missing at random (MAR), 535 missing completely at random (MCAR), 535 missing data, 208, 477 mixed effects models, 59, 633 model diagnostics, 824 model validation, 427 modeling type estimator, 690 molecular genetics, 583 monitoring time interval (MTI), 115 Monte Carlo (MC), 956 Monte Carlo EM, 422 Monte Carlo integration, 422 Monte Carlo method, 87 Monte Carlo simulation, 400 morbidity, 195 mortality, 195 multicenter studies, 478 multilevel model, 61 multilevel Poisson regression model, 82 multilevel probit model, 81 multivariate multilevel model, 85 multi-phase sampling, 688 multiple comparisons, 208, 539, 541 multiple endpoints, 544 multiple event times, 827 multiple renewal process, 1028 multiple sclerosis, 264 multiplicative intensity model, 823 multiplicity, 479 multi-stage sampling, 688 Multivariate Analysis of Variance (MANOVA), 181 multivariate mortality, 972 Nadaraya-Watson type kernel estimators, 851 nasopharyngeal carcinoma (NPC), 766 natural history, 650 Nelson’s estimator, 818 nerve growth factor (NGF), 54 New Drug Application (NDA), 443 N-nitrosodiethylamine (DEN), 621 noise reduction, 386

1098

non-inferiority, 567 noninformative priors, 951 non-linear anisotropic diffusion, 388 nonlinear mixed effects models, 410, 420 nonlinear regression analysis, 400 nonparametric goodness of fit test, 911 nonparametric likelihood ratio test, 915 nonparametric methods, 324 no-observed-adverse-effect-level (NOAEL), 618 normalizing constant, 934 Nottingham Health Profile (NHP), 198 objective endpoints, 532 odds ratios, 247 pooled estimate of odds ratio, 257 one-stage models, 747 one-step local MLE, 913 optimal image sampling schedule (OISS), 383 optimization, 768 Ordinal-Scale Test, 29 orthogonal series method, 896 orthopantomograms, 381 osteoporosis, 116 outlier detection, 475 over-dispersion, 324, 328 parallel designs, 433 parametric model, 324 partial likelihood, 822, 908 partially linear model, 863, 849 Pearson residuals, 73 Pearson Chi-square test, 10, 591 penalized least squares criterion, 859 periodicity, 333 periodogram, 365 Peto test, 503 Peto’s method, 256 pharmacodynamic (PD), 409, 411 pharmacokinetic (PK), 51, 324, 410, 467

Index

pharmacology, 409 pharmacometrics, 410 phase 1, 410, 523 phase 2, 410, 523 phase 3, 410, 524 phenobarbital in neonates, 427 piecewise exponential model, 965 Poisson process, 1021 non-homogeneous Poisson process, 1021 weighted Poisson process, 1021 Poisson regression, 57 poly-exponential models, 412 poly-k test, 503 polynomial calibration models, 978 polynomial regression, 894 population, 686 population bioequivalence, 434, 468 population genetics, 15 positive and negative predictive values, 24 Positron Emission Tomography, 379, 380 posterior covariance, 935 distribution, 933 mean, 935 predictive density, 935 predictive distribution, 935 probability, 935 quantity, 934 post-intervention distribution, 803 potency, 436 power, 93, 182, 207, 237, 446, 473 precision, 103, 436 absolute precision errors, 106 long-term precision, 107 long-term precision errors, 104, 105 precision errors, 104 short-term precision errors, 104 pre-clinical, 410 preclinical detectable phase (PCDP), 746 prediction, 347, 935 predictive errors, 347 predictive quasi-likelihood (PQL), 78

Index

pre-IND, 443 prevalence, 688 prevention trial, 668 primary endpoint, 545, 546, 548, 666 primary hepatocellular carcinoma (PHC), 273 principle of independent segregation, 585 priori, 411, 547 distribution, 933, 944 power, 946 probability matching, 955 probability density, 886 probability theory, 6 process control charts, 119 cumulative sum chart (CUSUM), 125 exponentially weighted moving average (EWMA) control chart, 131 moving average chart, 124 Shewhart chart, 121 process validation, 453, 454, 455 propensity score, 787 proportional hazard models, 821, 908, 965, 1041 protease inhibitor (PI), 663 pseudo-Bayes factor, 969 pseudo-data step, 421 pseudo-likelihood method, 687 pulmonary function, 267 Q-TwiST method, 212 quality assurance (QA), 101 quality control (QC), 101 quality improvement, 101 quality of life (QOL), 195 QOL instrument, 198 domains, 199 health-related quality of life (HRQOL), 195 quality-adjusted life year (QALY), 158, 211 WHOQOL, 196 quantitative trait locus (QTL), 607

1099

quantitative ultrasound (QUS), 104, 116 quasi-confidence interval, 167 quasi-likelihood method, 516, 842 radiology, 101 random effect model, 59, 251, 687, 692, 695, 841 random effects, 844 random variation, 319 randomization, 11, 447, 526, 537 randomized block design, 47 randomized experiment, 784 range, 436 ratios, 321 reader agreement, 483 receiver operating characteristic (ROC) analyses, 29, 34, 486, 886 receptor, 409 recombination fraction, 585 recurrent events, 823 recursive partitioning, 1033 reference concentration (RfC), 619 reference dose (RfD), 619 reference prior, 954, 980 relapse-free survival (RFS), 970 relative risk, 237, 247 reliability, 223, 225 repeated measures, 322 reproductive and developmental toxicological data, 624 reproductive studies, 509 responses, 1043 restricted iterative generalized least squares (RIGLS), 78 restricted maximum likelihood (REML), 253, 696, 847 reverse transcriptase inhibitor, 663 reversible jump MCMC samplers, 1011 risk assessment, 495 risk differences, 247 robust method, 284 rosiglitazone maleate tablets, 48 ruggedness, 436

1100

safety, 444, 524 safety and efficacy, 435 sample coverage, 724 sample size, 92, 112, 182, 447, 473 sample size and cost effect, 92 sample size re-estimation, 562 sample surveys, 685 sampling frame, 686 sampling plan, 686 sampling units, 686 SAS, 291 Savage-Dicky ratio, 976 scintigraph, 26 seasonality, 333, 336, 357 second phase slopes, 462 secondary endpoints, 546 segmentation, 392 segregation analysis of dominant loci, 592 segregation analysis of recessive loci, 594 segregation ratio, 589 semi-Markov process, 1005 semi-parametric, 324 semiparametric accelerated failure time model, 826 semi-parametric cure rate model, 967 semiparametric model, 842 sensitivity, 22, 280, 486, 746 sensitivity analysis, 164 sensitizing tests, 122 sequential group sequential procedure, 560 sequential clinical decision rule, 561 sequential decision structure, 561 serial correlations, 844 short-term memory, 1004 sib-pair method, 300 Sickness Impact Profile (SIP), 198 signal-noise ratio (SNR), 386 significance level, 446 simple random sampling, 687 simple random walk, 992 Simpsons Paradox, 780, 781 simultaneous bands, 869

Index

single ascertainment, 595 single blind, 529 single-photon computed tomography, 379 small area estimation, 692 smooth nonparametric (SNP) model, 423 smoothing estimators, 850 sojourn time of state, 997 source language, 205 Spearman-Brown prophey formula, 227 specificity, 22, 280, 436, 486 SPECT, 381 spectral analysis, 357, 359 spectral function, 362 spline B-spline, 864 spline approach, 896 spline smoothing estimator, 326 split-halves method, 227 SROC curve, SROC regression model, 286, 549 standardized means differences, 249 standardized validity coefficient, 222 stationarity and invertibility, 353 stationary distribution, 995 stationary process, 340 statistical anisotropic diffusion, 386 statistical calibration, 978 statistical process control (SPC), 121 stochastic process, 333, 992 strata, 687 stratified cluster sampling, 62 stratified random sampling, 687 strong ingorable, ignorability, 785, 786 strongly stationary process, 340 structural nonparametric models, 850 structural nonparametric regression models, 843 study design, 474 surrogate efficacy, 523 surrogate endpoint, 524 Survey of Asset and Health Dynamics of the Oldest Old (AHEAD), 685

Index

survival function, 816 susceptible-infection-removal (SIR), 653 symmetric beta family, 890 synthetic estimator, 694 target language, 205 temporal domain, 383 terminal nodes, 1035 test for stationarity, 341 overfitting, 353 of hypotheses, 16 CER, 181 ICER, 179 homogeneity, 253 test sample, 1034 test-retest method, 225 threshold autoregression model, 366 thrombolytic agents, 236 time-dependent covariates, 827, 908 time-invariant, 861 time-reversible, 1007 time-to-event outcome, 815 time-to-virologic-failure endpoints, 667 tissue time active curve (TAC), 384 topotecan in Solid Tumors, 324 toxicology, 48, 410, 495 t-PA, 558 tracer kinetic techniques, 382 transition intensity, 997 transition probability, 992 transmission/disequilibrium test (TDT), 609 treatment of congestive heart failure, 552 Tree Classification and Regression Tree (CART), 1033 tree node, 1035 tree pruning, 1038 tree splitting, 1036 trend trend assessment interval (TAI), 115

1101

trend assessment margin (TAM), 115 trend test, 333, 502 two-phase sampling, 688 two-phase shelf-life estimation, 459 two-stage model, 752 two-step smoothing method, 860 type I error, 446 types of data, 320 ultrasound, 104, 116, 379, 380 uniform irrelevant factor, 796 unique validity variance, 223 unstandardized coefficient linking, 222 unstandardized validity coefficient, 222 urokinase, 50 vaccine attack rate, 670 vaccine efficacy, 670 vaccine studies, 669 validity, 215, 221 variability, 323 varying-coefficient models, 843, 912 nonparametric and semiparametric varying-coefficient models, 863 VASOTEC, 552 volume of distribution, 413 Wald test, 517 weak stationary, 341 weighting type estimator, 689 within-node impurity, 1042 World Health Organization (WHO), 196, 742 xeloda, 571 x-rays, 380 x-ray mammography (MG), 381 x-ray transmission imaging, 379 Yule process, 1022 non-homogeneous Yule process, 1024 Yule-Walker estimation, 369