590 35 2MB
English Pages 251 Year 2011
This page intentionally left blank
Multivariable Analysis A Practical Guide for Clinicians and Public Health Researchers
Why do you need this book? Multivariable analysis is confusing! Whether you are performing your irst research project or attempting to interpret the output from a multivariable model, you have undoubtedly found this to be true. Basic biostatistics books are of little or no help to you, since their coverage oten stops short of multivariable analysis. However, existing multivariable analysis books are too dense with mathematical formulae and derivations and are not designed to answer your most basic questions. Is there a book that steps aside from the math and simply explains how to understand, perform, and interpret multivariable analyses? Yes. Multivariable Analysis: A Practical Guide for Clinicians and Public Health Researchers, as this new edition is titled, is precisely the reference that will lead your way. In fact, Dr. Mitchell Katz has asked and answered all of your questions for you! Why should I do multivariable analysis? How do I choose which type of multivariable to use? How many subjects do I need to do multivariable analysis? What if I have repeated observations of the same persons? Answers and detailed explanations to these questions and more are found in this book. Also, it is loaded with useful tips, summary charts, igures, and references. If you are a medical student, resident, or clinician, Multivariable Analysis: A Practical Guide for Clinicians and Public Health Researchers will prove an indispensable guide through the confusing terrain of statistical analysis. his third edition has been fully revised to build on the enormous success of its predecessors. New features include new sections on Poisson and negative binomial regression, proportional odds analysis, and multinomial logistic regression, and an expanded section on interpretation of residuals.
Praise for irst edition “his is the irst nonmathematical book on multivariable analysis addressed to clinicians. Its range, organization, brevity, and clarity make it useful as a reference, a text, and a guide for self-study. his book is ‘a practical guide for clinicians.’” Leonard E. Braitman, Ph.D., Annals of Internal Medicine Mitchell H. Katz is Clinical Professor of Medicine, Epidemiology and Biostatistics at the University of California, San Francisco; and Director of the Los Angeles Department of Health Services, Los Angeles, USA.
Multivariable Analysis A Practical Guide for Clinicians and Public Health Researchers Third Edition
Mitchell H. Katz Department of Medicine, Epidemiology and Biostatistics, University of California, USA
cam bri d ge uni v e rsi t y pre s s Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo, Delhi, Tokyo, Mexico City Cambridge University Press he Edinburgh Building, Cambridge CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521760980 © M. H. Katz, 1999, 2006, 2011 his publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 1999 Second edition published 2006 hird edition published 2011 Printed in the United Kingdom at the University Press, Cambridge A catalog record for this publication is available from the British Library Library of Congress Cataloging in Publication data Katz, Mitchell H., 1959– author. Multivariable analysis : a practical guide for clinicians and public health researchers / Mitchell H. Katz, Department of Medicine, Epidemiology, and Biostatistics, University of California, USA. – 3rd Edition. p. ; cm. Includes bibliographical references and index. ISBN 978-0-521-76098-0 (hardback) – ISBN 978-0-521-14107-9 (paperback) 1. Medicine–Research–Statistical methods. 2. Multivariate analysis. 3. Biometry. 4. Medical statistics. I. Title. [DNLM: 1. Multivariate Analysis. 2. Biometry–methods. WA 950] R853.S7K38 2011 610.72–dc22 2010052187 ISBN 978-0-521-76098-0 Hardback ISBN 978-0-521-14107-9 Paperback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. Every efort has been made in preparing this book to provide accurate and up-to-date information which is in accord with accepted standards and practice at the time of publication. Although case histories are drawn from actual cases, every efort has been made to disguise the identities of the individuals involved. Nevertheless, the authors, editors and publishers can make no warranties that the information contained herein is totally free from error, not least because clinical standards are constantly changing through research and regulation. he authors, editors and publishers therefore disclaim all liability for direct or consequential damages resulting from the use of material contained in this book. Readers are strongly advised to pay careful attention to information provided by the manufacturer of any drugs or equipment that they plan to use.
To my parents, for their unwavering support
Contents
Preface 1
Introduction
1
1.1 1.2
1
1.3 1.4
2
2.2 2.3 2.4 2.5
What are the most common uses of multivariable models in clinical research? How is multivariable analysis used in observational studies of etiology? How is multivariable analysis used in intervention studies (randomized and nonrandomized)? How is multivariable analysis used in studies of diagnosis? How is multivariable analysis used in studies of prognosis?
Outcome variables in multivariable analysis 3.1 3.2
vii
Why should I do multivariable analysis? What are confounders and how does multivariable analysis help me to deal with them? What are suppressers and how does multivariable analysis help me to deal with them? What are interactions and how does multivariable analysis help me to deal with them?
Common uses of multivariable models 2.1
3
page xiii
How does the nature of the outcome variable inluence the choice of which type of multivariable analysis to do? What type of multivariable analysis should I use with an interval outcome?
6 9 11 14
14 14 16 21 23 25
25 26
viii
Contents
3.3
What type of multivariable analysis should I use with a dichotomous outcome? 3.4 What type of multivariable analysis should I use with an ordinal variable? 3.5 What type of multivariable analysis should I use with a nominal outcome? 3.6 What type of multivariable analysis should I use with a time-to-outcome variable? 3.7 How likely is it that the censoring assumption is valid in my study? 3.8 How can I test the validity of the censoring assumption for my data? 3.9 What is the proportionality assumption of proportional hazards analysis? 3.10 What type of multivariable analysis should I use with counts? 3.11 What type of multivariable analysis should I use with an incidence rate? 3.12 May I change the coding of my outcome variable to use a diferent type of multivariable analysis? 4
Independent variables in multivariable analysis 4.1 4.2 4.3 4.4
4.5
5
How do I incorporate independent variables into a multivariable analysis? How do I incorporate nominal independent variables into a multivariable analysis? How do I incorporate interval-independent variables into a multivariable model? Assuming that my interval-independent variable its a linear assumption, is there any reason to group it into interval categories or create multiple dichotomous variables? How do I incorporate ordinal independent variables into a multivariable model?
Relationship of independent variables to one another 5.1 5.2 5.3
Does it matter if my independent variables are related to each other? How do I assess whether my variables are multicollinear? What should I do with multicollinear variables?
36 39 42 44 50 55 58 60 64 66 74
74 74 76
86 86 88
88 89 91
ix
6
Contents
Setting up a multivariable analysis 6.1 6.2 6.3 6.4 6.5 6.6 6.7
7
Performing the analysis 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9
8
What independent variables should I include in my multivariable model? How do I decide what confounders to include in my model? What independent variables should I exclude from my multivariable model? How many subjects do I need to do multivariable analysis? What if I have too many independent variables given my sample size? What should I do about missing data on my independent variables? What should I do about missing data on my outcome variable?
What numbers should I assign for dichotomous or ordinal variables in my analysis? Does it matter what I choose as my reference category for multiple dichotomous (“dummied”) variables? How do I enter interaction terms into my analysis? How do I enter time into my proportional hazards or other survival analysis? What about subjects who experience their outcome on their start date? What about subjects who have a survival time shorter than physiologically possible? How do I incorporate time into my Poisson analysis? What are variable selection techniques? My model won’t converge. What should I do?
93
93 93 94 97 102 108 115 118
118 120 122 124 129 131 133 134 139
Interpreting the results
140
8.1 8.2 8.3
140 140
8.4
What information will my multivariable analysis produce? How do I assess how well my model its the data? What do the coeicients tell me about the relationship between each variable and the outcome? How do I interpret the results of interaction terms?
149 159
x
Contents
8.5
9
Do I have to adjust my multivariable regression coeicients for multiple comparisons?
Delving deeper: Checking the underlying assumptions of the analysis How do I know if the assumptions of my multivariable model are met? 9.2 What are residuals? How are they used to assess the it of models? 9.3 How do I test the normal distribution and equal variance assumptions of a multiple linear regression model? 9.4 How do I test the linearity assumption of a multivariable model? 9.5 What are outliers and how do I detect them in a multivariable model? 9.6 What should I do when I detect outliers? 9.7 What is the additive assumption and how do I assess whether my multiple independent variables it this assumption? 9.8 How do I test the proportional odds assumption? 9.9 How do I test the proportionality assumption? 9.10 What if the proportionality assumption does not hold for my data?
159
162
9.1
10
11
162 165 166 167 170 171 174 174 177
Propensity scores
180
10.1 What are propensity scores? Why are they used?
180
Correlated observations
185
11.1 What circumstances lead to correlated observations? 11.2 Should I avoid study designs that lead to correlated observations? 11.3 How do I analyze correlated observations? 11.4 How do I calculate the needed sample size for studies with correlated observations? 12
162
185 187 189 207
Validation of models
208
12.1 How can I validate my models?
208
xi
13
Contents
Special topics 13.1 What if the independent variable changes value during the course of the study? 13.2 What are the advantages and disadvantages of time-dependent covariates? 13.3 What are classiication and regression trees (CART) and should I use them? 13.4 How can I get best use of my biostatistician? 13.5 How do I choose which sotware package to use?
14
15
Publishing your study
213
213 213 216 219 219 221
14.1 How much information about how I constructed my multivariable models should I include in the Methods section? 14.2 Do I need to cite a statistical reference for my choice of method of multivariable analysis? 14.3 Which parts of my multivariable analysis should I report in the Results section?
224
Summary: Steps for constructing a multivariable model
227
Index
229
221 223
Preface
here has been astounding growth in the use of multivariable analysis in clinical research. When the irst edition of this book was published in 1999 logistic regression and proportional hazards models were cutting-edge techniques. Now for many researchers, these are old, staid models and the new edge is mixed-efects models, generalized estimating equations, Poisson regression, and propensity score analysis. he use of these more sophisticated models is fueled by the development of user-friendly sotware for constructing multivariable models, increased availability of electronic databases (medical records, disease and procedure registries) that provide longitudinal data on large populations, and increased funding for and interest in clinical efectiveness studies – studies comparing diferent treatments in use – as a method of improving quality and reducing healthcare costs. What hasn’t changed in the past 11 years is the need for an easy-to-follow guide for nonstatisticians on how to perform and interpret these models. Although the available sotware (e.g., SPSS, SAS, S-plus, R) doesn’t require programming experience or mathematical aptitude to conduct the analyses, if the analysis is not set up correctly, the answer is sure to be wrong! Even when the analysis is performed correctly, researchers may not draw the correct conclusions from the output. To prevent these problems, throughout the book I have focused on how to set up and interpret multivariable analysis. I use examples from the medical and public health literature because illustrations of how to correctly analyze data and present the results will help you analyze and present your data correctly. Modeling your work based on successful published studies is one of the best and most eicient strategies for correctly analyzing data. he biggest changes in this edition are that I have written new sections on Poisson and negative binomial regression, proportional odds analysis, and multinomial logistic regression because these models are increasingly in use. I have improved the section on mixed-efects models and generalized xiii
xiv
Preface
estimating equations, and also expanded the section on checking the underlying assumptions of multivariable models (Chapter 9) using residuals and other techniques. While taking on new and more complicated material, I have maintained the basic organization of the book. Besides retaining the question-and-answer approach, the order of the book mirrors the process of doing multivariable analysis: deciding whether you need to do multivariable analysis (Chapters 1 and 2), choosing the correct model (Chapter 3), preparing your independent variables (Chapters 4 and 5), setting up the model (Chapter 6), performing the analysis (Chapter 7), interpreting the basic output (Chapter 8), delving deeper into the underlying assumptions of the model (Chapter 9), validating your model (Chapter 12) and publishing your study (Chapter 14). One of the reasons I prefer this approach to the more traditional approach (i.e., having a separate chapter on each type of multivariable model) is that it illustrates the similarities and diferences of the diferent approaches. In my experience, when the results are strong, diferent (but reasonable) approaches lead to similar answers; conversely, when the results are very diferent with diferent techniques be suspicious. Also, I have found that the most eicient way to end an argument over what the best way is to analyze a data set is to analyze it multiple ways and see whether the results difer. If there are few diferences then you have strengthened your results. When there are diferences, you have probably learned something important about the nature of your data. Also, by structuring the book to parallel the research process, it allows readers to join the book at whatever stage they are at in the research process. his book assumes that you are familiar with basic biostatistics. If not, I recommend S. Glantz’s Primer of Biostatistics (sixth edition, McGraw-Hill, 2005). I have also written a basic statistics book using a question-and-answer approach similar to that used in this book called Study Design and Statistical Analysis: A Practical Guide for Clinicians (Cambridge University Press, 2006). Some reviewers have suggested that the two books be combined, and while I see the merit in that, I also see a much fatter text that might be more expensive and of-putting to clinical researchers. Please forgive me therefore if I cite that book or my other book on performing interventions (Evaluating Clinical and Public Health Interventions, Cambridge University Press, 2010). It is not an exercise of ego, but rather an attempt to keep each book inexpensive and short. One of the challenges in writing a book for clinical researchers is deciding how much detail to include. One could easily have (and many have) written books larger than this about just one of the procedures described. To keep the presentations short and the material accessible, I direct readers who wish to know more about a particular procedure to more detailed sources in the
xv
Preface
footnotes. Since statistical textbooks are expensive, and many journal articles are not easy to ind, I have particularly emphasized web resources that I have found useful. Twenty years of students in the University of California, San Francisco, Clinical Research Program have contributed to this book through their insightful questions and observations. Serving as the Deputy Editor for the Archives of Internal Medicine during the past two years has deinitely sharpened my eye as to how best to conduct multivariable research. For this opportunity I am grateful to the Editor, Rita Redberg, M.D., our two biostatistical editors who have taught me much, John Neuhaus, Ph.D. and David Glidden, Ph.D., and the other editors, Patrick O’Malley, M.D. and Kirsten Johansen, M.D., who have shared their critical observations with me on hundreds of articles. I greatly appreciate the support of my editor Richard Marley and the staf at Cambridge University Press for encouraging me to do this third edition. he best part of writing and updating this book is the number of researchers who have emailed me with their comments, compliments, and questions. Writing textbooks is a lonely business and I wouldn’t do it unless I had evidence that the books were actually helping people to conduct better research. If you have questions or suggestions for future editions, email me at [email protected]
1
Introduction
1.1 Why should I do multivariable analysis? DEFINITION Multivariable analysis is a tool for determining the relative contributions of different causes to a single event.
We live in a multivariable world. Most events, whether medical, political, social, or personal, have multiple causes. And these causes are related to one another. Multivariable analysis1 is a statistical tool for determining the relative contributions of diferent causes to a single event or outcome. Clinical researchers, in particular, need multivariable analysis because most diseases have multiple causes, and prognosis is usually determined by a large number of factors. Even for those infectious diseases that are known to be caused by a single pathogen, a number of factors afect whether an exposed individual becomes ill, including the characteristics of the pathogen (e.g., virulence of strain), the route of exposure (e.g., respiratory route), the intensity of exposure (e.g., size of inoculum), and the host response (e.g., immunologic defense). Multivariable analysis allows us to sort out the multifaceted nature of risk factors and their relative contribution to outcome. For example, observational epidemiology has taught us that there are a number of risk factors associated with premature mortality, notably smoking, a sedentary lifestyle, obesity, elevated cholesterol, and hypertension. Note that I did not say that these factors cause premature mortality. Statistics alone cannot prove that a relationship between a risk factor and an outcome are causal.2 Causality is established on
1
2
1
he terms “multivariate analysis” and “multivariable analysis” are oten used interchangeably. In the strict sense, multivariate analysis refers to simultaneously predicting multiple outcomes. Since this book deals with techniques that use multiple variables to predict a single outcome, I prefer the more general term multivariable analysis. hroughout the text I use the terms “associated with” and “related to” interchangeably. Similarly, I use the terms “risk factor,” “exposure,” “predictor,” and “independent variable,” and the terms “outcome” and “dependent variable,” interchangeably. Although some of these terms such as “risk factor,” “predictor,” and “outcome” imply causality remember that causality can never be proven with statistical analysis. he best way for establishing causality is through rigorous study design (e.g., randomization to eliminate confounding, longitudinal observations to minimize the chance that the “outcome” caused the “risk factor”).
2
Introduction
the basis of biological plausibility and rigorous study designs, such as randomized controlled trials, which eliminate sources of potential bias. Identiication of risk factors of premature mortality through observational studies has been particularly important because you cannot randomize people to many of the conditions that cause premature mortality, such as smoking, sedentary lifestyle, or obesity. And yet these conditions tend to occur together; that is, people who smoke tend to exercise less and be more likely to be obese. How does multivariable analysis separate the independent contribution of each of these factors? Let’s consider the case of exercise. Numerous studies have shown that persons who exercise live longer than persons with sedentary lifestyles. But if the only reason that persons who exercise live longer is that they are less likely to smoke and more likely to eat low-fat meals leading to lower cholesterol, then initiating an exercise routine would not change a person’s life expectancy. he Aerobics Center Longitudinal Study tackled this important question. 3 hey evaluated the relationship between exercise and mortality in 25, 341 men and 7080 women. All participants had a baseline examination between 1970 and 1989. he examination included a physical examination, laboratory tests, and a treadmill evaluation to assess physical itness. Participants were followed for an average of 8.4 years for the men and 7.5 years for the women. Table 1.1 compares the characteristics of survivors to persons who had died during the follow-up. You can see that there are a number of signiicant diferences between survivors and decedents among men and women. Speciically, survivors were younger, had lower blood pressure, lower cholesterol, were less likely to smoke, and were more physically it (based on the length of time they stayed on the treadmill and their level of efort). Although the results are interesting, Table 1.1 does not answer our basic question: Does being physically fit independently increase longevity? It doesn’t answer the question because whereas the high-fitness group was less likely to die during the study period, those who were physically fit may just have been younger, been less likely to smoke, or had lower blood pressure. To determine whether exercise is independently associated with mortality, the authors performed proportional hazards analysis, a type of multivariable analysis. he results are shown in Table 1.2. If you compare the number of deaths per thousand person-years in men, you can see that there were more 3
Blair, S. N., Kampert, J. B., Kohl, H. W., et al. “Inluences of cardiorespiratory itness and other precursors on cardiovascular disease and all-cause mortality in men and women.” JAMA 276 (1996): 205–10.
3
1.1 Why should I do multivariable analysis?
Table 1.1 Baseline characteristics of survivors and decedents, Aerobics Center Longitudinal Study. Men
Characteristics Age, y (SD) Body mass index, kg/m 2 (SD) Systolic blood pressure, mm Hg (SD) Total cholesterol, mg/dL (SD) Fasting glucose, mg/dL (SD) Fitness, % Low Moderate High Current or recent smoker, % Family history of coronary heart disease, % Abnormal electrocardiogram, % Chronic illness, %
Women
Survivors (n = 24 740)
Decedents (n = 601)
Survivors (n = 6991)
Decedents (n = 89)
42.7 (9.7) 26.0 (3.6) 121.1 (13.5) 213.1 (40.6) 100.4 (16.3)
52.1 (11.4) 26.3 (3.5) 130.4 (19.1) 228.9 (45.4) 108.1 (32.0)
42.6 (10.9) 22.6 (3.9) 112.6 (14.8) 202.7 (40.5) 94.4 (14.5)
53.3 (11.2) 23.7 (4.5) 122.6 (17.3) 228.2 (40.8) 99.9 (25.0)
20.1 42.0 37.9 26.3 25.4 6.9 18.4
41.6 39.1 19.3 36.9 33.8 26.3 40.3
18.8 40.6 40.6 18.5 25.2 4.8 13.4
44.9 33.7 21.3 30.3 27.0 18.0 20.2
Adapted with permission from Blair, S. N., et al. “Inluences of cardiorespiratory itness and other precursors on cardiovascular disease and all-cause mortality in men and women.” JAMA 276 (1996):205–10. Copyright 1996, American Medical Association. Additional data provided by authors.
DEFINITION Stratified analysis assesses the effect of a risk factor on outcome while holding another variable constant.
deaths in the low-itness group (38.1) than in the moderate/high itness group (25.0). his diference is relected in the elevated relative risk for lower itness (38.1/25.0 = 1.52). hese results are adjusted for all of the other variables listed in the table. his means that low itness is associated with higher mortality, independent of the efects of other known risk factors for mortality, such as smoking, elevated blood pressure, cholesterol, and family history. A similar pattern is seen for women. Was there any way to answer this question without multivariable analysis? One could have performed stratiied analysis. Stratiied analysis assesses the efect of a risk factor on outcome while holding another variable constant. So, for example, we could compare physically it to unit persons separately among smokers and nonsmokers. his would allow us to calculate a relative risk for the impact of itness on mortality, independent of smoking. his analysis is shown in Table 1.3. Unlike the multivariable analysis in Table 1.2, the analyses in Table 1.3 are bivariate.4 We see that the mortality rate is greater among those at low itness 4
Some researchers use the term “univariate” to describe the association between two variables. I think it is more informative to restrict the term univariate to analyses of a single variable (e.g., mean, median), while using the term “bivariate” to refer to the association between two variables.
4
Introduction
Table 1.2 Multivariable analysis of risk factors for all-cause mortality, Aerobics Center Longitudinal Study. Men
Women
Deaths per 10 000 person-years
Adjusted relative risk (95% CI)
Fitness Low Moderate/High
38.1 25.0
1.52 (1.28–1.82) 1.0 (ref.)
27.8 13.2
2.10 (1.36–3.26) 1.0 (ref.)
Smoking status Current or recent smoker Past or never smoked
39.4 23.9
1.65 (1.39–1.97) 1.0 (ref.)
27.8 14.0
1.99 (1.25–3.17) 1.0 (ref.)
Systolic blood pressure ≥140 mm Hg