263 120 5MB
English Pages 494 Year 2012
Regression Models as a Tool in Medical Research
• • • • •
Werner Vach
Regression Models as a Tool in Medical Research
K15111_FM.indd 1
10/29/12 4:28 PM
K15111_FM.indd 2
10/29/12 4:28 PM
Regression Models as a Tool in Medical Research
Werner Vach
K15111_FM.indd 3
10/29/12 4:28 PM
CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2013 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 2012924 International Standard Book Number-13: 978-1-4665-1749-3 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com
Dedicated to my wife Kirstin and to my children Hilka and Martin
Contents
Preface
xv
Acknowledgments
xix
About the Author
xxi
I The Basics
1
1
Why Use Regression Models? 1.1 Why Use Simple Regression Models? 1.2 Why Use Multiple Regression Models? 1.3 Some Basic Notation
3 3 4 6
2
An Introductory Example 2.1 A Single Line Model 2.2 Fitting a Single Line Model 2.3 Taking Uncertainty into Account 2.4 A Two-Line Model 2.5 How to Perform These Steps with Stata 2.6 Exercise 5-HIAA and Serotonin 2.7 Exercise Haemoglobin 2.8 Exercise Scaling of Variables
9 9 11 13 14 17 19 19 19
3
The Classical Multiple Regression Model
21
4 Adjusted Effects 4.1 Adjusting for Confounding 4.2 Adjusting for Imbalances 4.3 Exercise Physical Activity in Schoolchildren
23 23 26 27
5
29 29 34 35
Inference for the Classical Multiple Regression Model 5.1 The Traditional and the Modern Way of Inference 5.2 How to Perform the Modern Way of Inference with Stata 5.3 How Valid and Good are Least Squares Estimates? 5.4 A Note on the Use and Interpretation of p-Values in Regression Analyses vii
35
viii 6 Logistic Regression 6.1 The Definition of the Logistic Regression Model 6.2 Analysing a Dose Response Experiment by Logistic Regression 6.3 How to Fit a Dose Response Model with Stata 6.4 Estimating Odds Ratios and Adjusted Odds Ratios Using Logistic Regression 6.5 How to Compute (Adjusted) Odds Ratios Using Logistic Regression in Stata 6.6 Exercise Allergy in Children 6.7 More on Logit Scale and Odds Scale
39 39 40 44 45 49 50 51
7 Inference for the Logistic Regression Model 7.1 The Maximum Likelihood Principle 7.2 Properties of the ML Estimates for Logistic Regression 7.3 Inference for a Single Regression Parameter 7.4 How to Perform Wald Tests and Likelihood Ratio Tests in Stata
55 55 56 57 58
8 Categorical Covariates 8.1 Incorporating Categorical Covariates in a Regression Model 8.2 Some Technicalities in Using Categorical Covariates 8.3 Testing the Effect of a Categorical Covariate 8.4 The Handling of Categorical Covariates in Stata 8.5 Presenting Results of a Regression Analysis Involving Categorical Covariates in a Table 8.6 Exercise Physical Occupation and Back Pain 8.7 Exercise Odds Ratios and Categorical covariates
63 63 65 67 68
9 Handling Ordered Categories: A First Lesson in Regression Modelling Strategies 10 The Cox Proportional Hazards Model 10.1 Modelling the Risk of Dying 10.2 Modelling the Risk of Dying in Continuous Time 10.3 Using the Cox Proportional Hazards Model to Quantify the Difference in Survival Between Groups 10.4 How to Fit a Cox Proportional Hazards Model with Stata 10.5 Exercise Prognostic Factors in Breast Cancer Patients—Part 1 11 Common Pitfalls in Using Regression Models 11.1 Association versus Causation 11.2 Difference between Subjects versus Difference within Subjects 11.3 Real-World Models versus Statistical Models 11.4 Relevance versus Significance 11.5 Exercise Prognostic Factors in Breast Cancer Patients—Part 2
73 76 77 79 85 85 87 90 91 94 97 97 99 100 102 104
ix
II Advanced Topics and Techniques
107
12 Some Useful Technicalities 12.1 Illustrating Models by Using Model-Based Predictions 12.2 How to Work with Predictions in Stata 12.3 Residuals and the Standard Deviation of the Error Term 12.4 Working with Residuals and the RMSE in Stata 12.5 Linear and Nonlinear Functions of Regression Parameters 12.6 Transformations of Regression Parameters 12.7 Centering of Covariate Values 12.8 Exercise Paternal Smoking versus Maternal Smoking
109 109 110 116 118 119 120 121 122
13 Comparing Regression Coefficients 13.1 Comparing Regression Coefficients among Continuous Covariates 13.2 Comparing Regression Coefficients among Binary Covariates 13.3 Measuring the Impact of Changing Covariate Values 13.4 Translating Regression Coefficients 13.5 How to Compare Regression Coefficients in Stata 13.6 Exercise Health in Young People
123 123 127 128 130 131 137
14 Power and Sample Size 14.1 The Power of a Regression Analysis 14.2 Determinants of Power in Regression Models with a Single Covariate 14.3 Determinants of Power in Regression Models with Several Covariates 14.4 Power and Sample Size Calculations When a Sample from the Covariate Distribution Is Given 14.5 Power and Sample Size Calculations Given a Sample from the Covariate Distribution with Stata 14.6 The Choice of the Values of the Regression Parameters in a Simulation Study 14.7 Simulating a Covariate Distribution 14.8 Simulating a Covariate Distribution with Stata 14.9 Choosing the Parameters to Simulate a Covariate Distribution 14.10Necessary Sample Sizes to Justify Asymptotic Methods 14.11Exercise Power Considerations for a Study on Neck Pain 14.12Exercise Choosing between Two Outcomes
139 139 140
15 Selection of the Sample 15.1 Selection in Dependence on the Covariates 15.2 Selection in Dependence on the Outcome 15.3 Sampling in Dependence on Covariate Values
181 181 183 185
16 Selection of Covariates 16.1 Fitting Regression Models with Correlated Covariates 16.2 The “Adjustment versus Power” Dilemma
187 187 189
148 152 154 165 166 169 177 178 178 179
x 16.3 The “Adjustment Makes Effects Small” Dilemma 16.4 Adjusting for Mediators 16.5 Adjusting for Confounding—A Useful Academic Game 16.6 Adjusting for Correlated Confounders 16.7 Including Predictive Covariates 16.8 Automatic Variable Selection 16.9 How to Choose Relevant Sets of Covariates 16.10Preparing the Selection of Covariates: Analysing the Association Among Covariates 16.11Preparing the Selection of Covariates: Univariate Analyses? 16.12Exercise Vocabulary Size in Young Children—Part 1 16.13Preprocessing of the Covariate Space 16.14How to Preprocess the Covariate Space with Stata 16.15Exercise Vocabulary Size in Young Children—Part 2 16.16What Is a Confounder?
191 193 196 198 199 201 202 206 206 207 208 210 219 219
17 Modelling Nonlinear Effects 17.1 Quadratic Regression 17.2 Polynomial Regression 17.3 Splines 17.4 Fractional Polynomials 17.5 Gain in Power by Modelling Nonlinear Effects? 17.6 Demonstrating the Effect of a Covariate 17.7 Demonstrating a Nonlinear Effect 17.8 Describing the Shape of a Nonlinear Effect 17.9 Detecting Nonlinearity by Analysis of Residuals 17.10Judging of Nonlinearity May Require Adjustment 17.11How to Model Nonlinear Effects in Stata 17.12The Impact of Ignoring Nonlinearity 17.13Modelling the Nonlinear Effect of Confounders 17.14Nonlinear Models 17.15Exercise Serum Markers for AMI
221 221 225 225 229 230 232 233 234 237 237 238 254 255 257 258
18 Transformation of Covariates 18.1 Transformations to Obtain a Linear Relationship 18.2 Transformation of Skewed Covariates 18.3 To Categorise or Not to Categorise
259 259 262 264
19 Effect Modification and Interactions 19.1 Modelling Effect Modification 19.2 Adjusted Effect Modifications 19.3 Interactions 19.4 Modelling Effect Modifications in Several Covariates 19.5 The Effect of a Covariate in the Presence of Interactions 19.6 Interactions as Deviations from Additivity
269 269 274 276 280 281 282
xi 19.7 Scales and Interactions 19.8 Ceiling Effects and Interactions 19.9 Hunting for Interactions 19.10How to Analyse Effect Modification and Interactions with Stata 19.11Exercise Treatment Interactions in a Randomised Clinical Trial for the Treatment of Malignant Glioma
285 286 287 290 296
20 Applying Regression Models to Clustered Data 20.1 Why Clustered Data Can Invalidate Inference 20.2 Robust Standard Errors 20.3 Improving the Efficiency 20.4 Within- and Between-Cluster Effects 20.5 Some Unusual but Useful Usages of Robust Standard Errors in Clustered Data 20.6 How to Take Clustering into Account in Stata
299 299 300 301 304
21 Applying Regression Models to Longitudinal Data 21.1 Analysing Time Trends in the Outcome 21.2 Analysing Time Trends in the Effect of Covariates 21.3 Analysing the Effect of Covariates 21.4 Analysing Individual Variation in Time Trends 21.5 Analysing Summary Measures 21.6 Analysing the Effect of Change 21.7 How to Perform Regression Modelling of Longitudinal Data in Stata 21.8 Exercise Increase of Body Fat in Adolescents
313 313 316 317 317 321 322 323 329
22 The Impact of Measurement Error 22.1 The Impact of Systematic and Random Measurement Error 22.2 The Impact of Misclassification 22.3 The Impact of Measurement Error in Confounders 22.4 The Impact of Differential Misclassification and Measurement Error 22.5 Studying the Measurement Error 22.6 Exercise Measurement Error and Interactions
331 331 334 335 336 337 338
23 The Impact of Incomplete Covariate Data 23.1 Missing Value Mechanisms 23.2 Properties of a Complete Case Analysis 23.3 Bias Due to Using ad hoc Methods 23.4 Advanced Techniques to Handle Incomplete Covariate Data 23.5 Handling of Partially Defined Covariates
341 341 342 343 344 345
III Risk Scores and Predictors 24 Risk Scores 24.1 What Is a Risk Score?
305 307
347 349 349
xii 24.2 24.3 24.4 24.5 24.6 24.7
Judging the Usefulness of a Risk Score The Precision of Risk Score Values The Overall Precision of a Risk Score Using Stata’s predict Command to Compute Risk Scores Categorisation of Risk Scores Exercise Computing Risk Scores for Breast Cancer Patients
352 353 356 357 368 369
25 Construction of Predictors 25.1 From Risk Scores to Predictors 25.2 Predictions and Prediction Intervals for a Continuous Outcome 25.3 Predictions for a Binary Outcome 25.4 Construction of Predictions for Time-to-Event Data 25.5 How to Construct Predictions with Stata 25.6 The Overall Precision of a Predictor
371 371 371 373 376 378 382
26 Evaluating the Predictive Performance 26.1 The Predictive Performance of an Existing Predictor 26.2 How to Assess the Predictive Performance of an Existing Predictor in Stata 26.3 Estimating the Predictive Performance of a New Predictor 26.4 How to Assess the Predictive Performance via Cross-Validation in Stata 26.5 Exercise Assessing the Predictive Performance of a Prognostic Score in Breast Cancer Patients
383 383 385 387 389 392
27 Outlook: Construction of Parsimonious Predictors
393
IV Miscellaneous
395
28 Alternatives to Regression Modelling 28.1 Stratification 28.2 Measures of Association: Correlation Coefficients 28.3 Measures of Association: The Odds Ratio 28.4 Propensity Scores 28.5 Classification and Regression Trees
397 397 399 400 402 404
29 Specific Regression Models 29.1 Probit Regression for Binary Outcomes 29.2 Generalised Linear Models 29.3 Regression Models for Count Data 29.4 Regression Models for Ordinal Outcome Data 29.5 Quantile Regression and Robust Regression 29.6 ANOVA and Regression
407 407 408 409 411 412 414
xiii 30 Specific Usages of Regression Models 30.1 Logistic Regression for the Analysis of Case-Control Studies 30.2 Logistic Regression for the Analysis of Matched Case-Control Studies 30.3 Adjusting for Baseline Values in Randomised Clinical Trials 30.4 Assessing Predictive Factors 30.5 Incorporating Time-Varying Covariates in a Cox Model 30.6 Time-Dependent Effects in a Cox Model 30.7 Using the Cox Model in the Presence of Competing Risks 30.8 Using the Cox Model to Analyse Multi-State Models
415 415 417 418 421 422 424 426 427
31 What Is a Good Model? 31.1 Does the Model Fit the Data? 31.2 How Good Are Predictions? 31.3 Explained Variation 31.4 Goodness of Fit 31.5 Model Stability 31.6 The Usefulness of a Model
429 429 430 431 432 434 435
32 Final Remarks on the Role of Prespecified Models and Model Development
439
V Mathematical Details
443
A Mathematics Behind the Classical Linear Regression Model A.1 Computing Regression Parameters in Simple Linear Regression A.2 Computing Regression Parameters in the Classical Multiple Regression Model A.3 Estimation of the Standard Error A.4 Construction of Confidence Intervals and p-Values
445 445
B Mathematics Behind the Logistic Regression Model B.1 The Least Squares Principle as a Maximum Likelihood Principle B.2 Maximising the Likelihood of a Logistic Regression Model B.3 Estimating the Standard Error of the ML Estimates B.4 Testing Composite Hypotheses
453 453 454 457 458
C The Modern Way of Inference C.1 Robust Estimation of Standard Errors C.2 Robust Estimation of Standard Errors in the Presence of Clustering
461 461 461
D Mathematics for Risk Scores and Predictors D.1 Computing Individual Survival Probabilities after Fitting a Cox Model D.2 Standard Errors for Risk Scores
463
446 448 450
463 463
xiv D.3 The Delta Rule
464
Bibliography
465
Index
471
Preface Regression models are today a standard tool in medical research. However, their application and the interpretation of their results is often challenging. It is the purpose of this book to introduce medical researchers to basic concepts and other important aspects of regression models, which have to be taken into account in their application to ensure optimal use and appropriate interpretation of results. This way adequate addressing of subject matter questions should be supported. Three regression models are, in particular, popular in medical research: the classical regression model for a continuous outcome, the logistic regression model for a binary outcome, and the Cox proportional hazards model for survival data. Additionally, you can find other regression models for more specific types of data or for specific situations, for example, Poisson regression for count and incidence data or conditional logistic regression for the analysis of matched case control studies. We will meet all these models in this book. The emphasis will be, however, on topics of relevance across all types of regression models. Such topics are, for example, • the interpretation of effect estimates, confidence intervals, and p-values • the adequate presentation of results of regression analyses in a publication • typical pitfalls in the interpretation of the results • the impact of different types of research questions (establishing of an effect, quantification of an effect, prediction) and research designs (observational studies, experiments) on the use of regression models • the preparing steps prior to a regression analysis like the choice of variables, their coding and the use of transformations • techniques to extend the scope of regression models, for example allowing nonadditive or nonlinear effects • the impact of the selection of the sample on the results of a regression analysis • the impact of measurement error or incomplete covariate data on the results of a regression analysis • the use of regression models to construct risk scores or to make predictions on the individual level • techniques to evaluate the quality of such risk scores and predictions • the communication of the goodness of a model The book is divided into five parts. In the first part, the basic definitions of the three most popular models are reviewed, and basic techniques like the handling of categorical covariates are discussed. The second and central part of the book focuses on slightly advanced topics and techniques ranging from the comparison of rexv
xvi
P REFACE
gression coefficients over the selection of covariates, the modelling of nonlinear and nonadditive effects, and the analysis of clustered or longitudinal data to the impact of selection mechanisms, measurement error and incomplete covariate data. A central aspect will be power considerations, as power is one main limiting factor in using regression models. If studies were large enough, regression models would allow the modelling of very fine structures of our data. But with the typically very limited sample sizes in medical research, we are forced to stick to answering rather crude questions about “the effect” of a covariate etc. Moreover, as the power depends on the unknown truth, it becomes obvious why it is often difficult in medical research to give definite advice on the optimal way to use regression models. The third part covers the use of regression models to construct risk scores and predictors. The aim here is to introduce the basic concepts and techniques, which are rather different from those we usually use when focusing on the estimation of effects. In the fourth part I give a short overview of some more specific regression models, specific usages of regression models and alternatives to regression modelling. These sections cannot go into depth, but should be useful to get an idea about some possibilities when entering situations not covered by the three popular models and their typical use. Further, the question of the goodness of a model is addressed, and many of the general aspects on regression modelling mentioned in the book are summarised in some considerations about the important role of prespecifying models in medical research. Mathematical details behind the estimation and inference techniques of some models are described in appendices in the fifth part. Additional material can be found on the homepage of the book at www.imbi.uni-freiburg.de/RegModToolInMedRes, in particular, solutions to all exercises and all data sets used in the examples and exercises. Some principles applied in preparing this book are as follows: • In the health sciences researchers are today usually exposed to regression models in two ways: They see the results of a regression analysis as a computer output in their own research or they see the results of other researchers in publications. This book tries to take these two perspectives into account by refering to the typical way results of a regression analysis are shown in a computer output or in medical journals. • Most medical journals require today that the limitations of a study are discussed in the publication. If the analysis of a study involves the use of regression models, then such limitations can arise from limitations of the regression modelling, for example, an inadequate modelling or sensitivity to selection processes. For this reason we try to discuss in some detail how and when regression models can fail. But, fortunately, regression models are rather robust against many phenomena, such that often a possible limitation of a study actually vanishes because we use regression models. • The adequate use of regression models in medical research does not require a deeper understanding of the computational techniques and the mathematical details behind these computations. They only require an understanding what a regression model actually models, how we have to interpret regression parameter estimates, confidence intervals, and p-values, and which assumptions we have to
P REFACE
xvii
make to justify their use. Hence, we will describe in this book only the basic principles behind the computations done by standard statistical software, but we will omit most details. For those readers, who feel more comfortable with knowing mathematical details some of these details are described in the appendices. This does not imply that this book is free of mathematical notation. We will use some basic mathematics mainly to describe the regression models, such that we can precisely say what we assume, and for this task mathematics is very useful. • Most of the aspects of regression models we discuss in this book belong to the “statistical folklore,” that is, a common knowledge the scientific community built up over decades. I often do not make any attempt to address the history and origin of this wisdom. Hence, there are actually very few references in this book, except in the case when I refer to very recent developments or when I would like to point to good books or overview papers that are suitable for further reading. With respect to the history and origin of the statistical techniques I present in this book, I refer the reader to the books of Weisberg (2005) for the classical regression model, of Hosmer and Lemeshow (1989) and Kleinbaum (1994) for logistic regression, and of Kalbfleisch and Prentice (2002) for the Cox proportional hazards model. Another useful complement to this book is the book by Andersen and Skovgaard (2010), which also presents a joint view on the most popular models, but with more emphasis on statistical and mathematical aspects than this book. • Any written material on a complex topic like regression models suffers from the problem that the single steps can only be presented in a linear sequence. It will frequently happen that the material presented raises a particular question, which might be addressed 50 pages later. It is often hard to predict which questions a reader will raise. Similar it is hard to predict what a reader will expect under a certain topic. To address these problems, each section is typically finished by some remarks. The topics covered in these remarks are never essential for the following sections, so it should be possible to follow this book without reading the remarks. And if a reader has troubles with understanding a remark, he or she should not worry about this: An understanding of all remarks is not essential for following the book. Some remarks try to address questions the reader may rise in connection with the material presented, and typically they point the reader to a later section where an answer might be found. Some remarks try to explain why the presentation of a certain topic in this book may be different from presentations you can find in other textbooks. • Although this book tries to address many problems in the practical use of regression models, the reader should not expect a cookbook. The adequate use of regression models requires some insight into regression modelling; regression modelling is far away from a black-box technique. You will sometimes find suggestions or recommendations in this book, but you should be aware that there are (hopefully) always some arguments behind these suggestions and recommendations, which you should have understood before you can really apply the sugges-
xviii
P REFACE
tions or recommendations. It is not the aim that you can later refer to this book in the “... has said that ...” manner to justify your choice of a particular analysis strategy. You should be able to justify it by your own arguments. • This book covers a lot of topics. They are selected according to the principle “This is something you should know if working with regression models.” However, this does not imply that every topic is of equal relevance for one’s own work. If you feel that a topic is not relevant for your own work, then just try to grasp the main messages, typically summarised at the end in a “nutshell.” If it is more relevant, then try to understand the details also. • All examples in this book are only for illustrative purposes. Most of the data sets we will use are fake data sets, created for the purpose of this book. However, they are usually motivated by a real study, and they are constructed in a way such that they are close to real data sets. The results of examples and exercises should never be cited, even if we refer to a publication, because the data and its presentation is usually simplified for the use in this book. • Each practical technique presented in this book is first illustrated using a concrete example without any reference to a specific software. This way we try to focus on the understanding of the basic properties of each technique and the “theory” behind it. Only in a second step do we explain the practical implementation in the statistical package Stata in a separate section. Readers who prefer “learning by doing” may work through this section in parallel to reading the other sections. The choice of Stata is somewhat arbitrary, as nearly all techniques considered in this book are today available in all major packages like SPSS, SAS or R. Stata has the advantage of being very easy to learn, and at the homepage of the book there is a short introduction to Stata focusing only on those aspects that are necessary to follow the exercises and examples of this book. In particular, it should be emphasised that this is not a book about Stata. Stata can do much more than shown in this book and, in particular, Stata’s many postestimation facilities are covered in this book only to a limited degree. All examples in the book are written in Stata 11, but as only the very basic Stata commands are involved, they should work also in earlier or later versions of Stata. I hope that many medical researchers will enjoy reading this book. And I sincerely hope that, this way, the book will contribute to a better understanding of the role of regression models as a tool in medical research.
Werner Vach Freiburg April 2012
Acknowledgments First of all, I would like to thank all participants of the internet-based courses I offered in the last few years for their comments on the teaching material, which now has been transformed into a book. I am in debt to my colleagues Lars Korsholm and Ivan Iachine for helping in starting this project many years ago. My colleagues Sonja Wehberg and Primrose Beryl have contributed with many helpful comments on the manuscript. Monika Richards assisted me in preparing some of the figures and Carolin Fischer and Edith Motschall assisted me in preparing the bibliography. I highly appreciate the many useful comments—not only on the Stata code—provided by Bill Rising from Stata. I am grateful to Dorte Gils˚ a Hansen, Steinbjørn Hansen, and Jan Hartvigsen for their kind permission to use their data in preparing some of the data sets used in the book, and to Merel Ritskes-Hoitinga for providing information on the growth of mice. Many thanks to all my colleagues at the former Department of Statistics at the University of Southern Denmark and at the Institute of Medical Biometry and Medical Informatics at the University of Freiburg for all discussions about various aspects of regression models. Last but not least I have to thank all those scientists from the health sciences I met during the last 20 years and who introduced me to so many fields in which regression models can be useful.
xix
About the Author Werner Vach studied statistics at the University of Dortnund, Germany, finishing his studies with a diploma in 1989. From 1990 to 1998 he worked at the Institute of Medical Biometry and Medical Informatics and at the Centre for Data Analysis and Modeling at the University of Freiburg, Germany. During this time, he obtained a PhD in statistics from the University of Dortmund in 1993. In 1998 he became professor of medical statistics at the Medical Faculty of the University of Southern Denmark. In the following decade he built up a research group in biostatistics there. In 2008 he switched to the Faculty of the Humanities. In 2009 he became professor of medical informatics and clinical epidemiology at the University of Freiburg, Germany. He has worked together with many medical researchers from various fields and coauthored more than 150 publications in medical journals. He contributed also to the methodology of biostatistics in the areas of incomplete covariate data, prognostic studies, diagnostic studies, and agreement studies.
xxi
Part I
The Basics
1
Chapter 1
Why Use Regression Models?
In this chapter we present some examples from medical research, where the use of regression models can help answer a basic research question.
1.1
Why Use Simple Regression Models?
13
15
weight 17 19
21
23
In many situations in medical research, we are interested in investigating and quantifying the effect of a variable X on another variable Y . Figure 1.1 illustrates a simple example from an experiment investigating the effect of calcium intake on body weight in mice. We can use the data of such an experiment to try to answer the question: How much does the (expected) weight of a mouse increase if we increase the calcium dose by a certain amount Δ? and regression models will help us to give an answer to this question. If prior to the experiment we are in doubt whether calcium has any effect on the body weight of mice, we may use the experiment to answer a more fundamental question: Does calcium intake have any effect on the body weight of mice?
1
2 3 4 calcium intake
5
Figure 1.1 Results of an experimental study.
3
4
W HY U SE R EGRESSION M ODELS ?
Here again, regression models provide a framework to answer this question. If we are able to quantify the effect of an increase of calcium intake on body weight, we can inspect how far the effect is away from 0, as an effect of 0 means that calcium intake has no effect on body weight. A third type of question we can ask is What is the (expected) weight of a mouse fed with a calcium intake of 2.5 mg? or How much calcium do we need to give to a mouse so that we can expect a body weight of 20 g? and regression models provide us with an appropriate framework to answer this type of question, too. The use of regression models is not restricted to the case of a continuous outcome like body weight. For example, regression models can also be used in the case of a binary outcome, where we would like to investigate how the probability of a certain event depends on some variable X. 1.2 Why Use Multiple Regression Models? In many situations in medical research, it is not sufficient to investigate the influence of a single variable X on an outcome Y , but — for several reasons — we have to take into account other variables also. The following examples illustrate some typical situations: (a) Epidemiological studies on risk factors Assume that we have collected data on a potential risk factor X—for example, smoking—in a cohort of subjects at some time point t, and that we observe for each subject whether or not a certain disease is diagnosed within the next 5 years. If we now observe that the frequency of the disease is higher in smokers than in nonsmokers we cannot immediately conclude that smoking increases the risk for this disease. It might be that in our cohort smoking is more frequent in women than in men, as well as the disease is more frequent in women than in men. So it might be that the observed association between smoking and the disease is just a consequence of the fact that both smoking and the disease are associated with gender. In such a situation, the research question of interest is How big is the association between smoking and the disease which cannot be explained by the association of both smoking and the disease with gender? Multiple regression models can give an answer exactly to this question. Moreover, they can simultaneously take several variables into account, because in practice we have typically several variables like gender which may disturb the study of the association between smoking and the disease. (b) Prognostic factor studies In a prognostic factor study, we try to identify factors which help us to predict whether patients with a specific disease will in the long run experience a certain
W HY U SE M ULTIPLE R EGRESSION M ODELS ?
5
event like progression or death. In establishing a new prognostic factor, for example, a new tumour marker, it is not sufficient to establish an association between the marker and the patient’s survival. The new tumour marker is only useful if it provides additional prognostic value beyond the established factors. So we have to find an answer to the question Does the new factor improve the prognosis we can make today based on established factors only? Once we have identified several prognostic factors, we are usually interested in combining them into a prognostic index describing the risk of a patient for a certain event. For example, in oncology it is popular to develop indices describing the 5-year survival probability of a patient based on patient-specific factors like age and performance status or tumour-specific factors like size and histological grading. The choice of the therapy may then be influenced by the value of the index: If the patient has a low 5-year survival probability, we might choose a more aggressive therapy than if the patient has a high 5-year survival probability. So the basic research question is here: How can we best combine the values of several prognostic factors to form a prognostic index? For both questions, multiple regression models provide an appropriate framework. (c) Diagnostic studies Diagnostic studies are very similar to prognostic studies. Instead of predicting an event in the future, we would like to determine the current disease status of a subject based on some variables we can measure. As in prognostic factor studies, we might be confronted with the question of evaluating the value of a new diagnostic marker, or we would like to know how to combine several markers into one diagnostic rule. Diagnostic studies are different from prognostic studies in the sense that it is not enough to estimate the probability of having the disease, but that at the end a final diagnosis is required, that is, a yes/no decision. However, this does not preclude the use of multiple regression models. (d) Multifactorial experiments In conducting experimental studies, we often try to investigate several factors simultaneously to use the available material (e.g., animals) in an optimal manner. In the example of the last section, we might decide to study simultaneously the effect of vitamin B intake on the body weight of mice. Choosing three dose levels for calcium, three dose levels of vitamin B, and 9 animals, we can apply each single combination of dose levels to exactly one mouse. If we now analyse the effect of calcium, we may ignore the vitamin B dose and just compare the average values within the three calcium dose level groups. In contrast to the example of the epidemiological study, it is allowed to ignore the other variable: Since within each calcium dose group the distribution of vitamin B is balanced (one mouse per vitamin B dose level), a possible effect of vitamin B on the body weight cannot disturb the association between calcium intake and body weight. However, it can be shown that it is still an advantage to take the vitamin B level into account in the analysis, as this increases the power to show an effect of the calcium dose.
6
W HY U SE R EGRESSION M ODELS ?
(e) Effect modification In some areas of medical research, it is of interest to investigate whether the effect of one variable depends on the effect of another variable. For example, in treatment research after demonstrating that a new treatment is superior to a standard treatment in the patient populations of phase III studies, it is of high clinical interest to identify subgroups with no benefit from the new therapy or even a negative benefit. On the other side, identification of subgroups with a treatment effect much higher than the average effect may contribute to a better understanding why a treatment is working. Typical variables to identify such subgroups where the treatment effect is below or above the average effect are age, gender, the disease stage or—as a recent development—certain genotypes. Any of these variables may modify the treatment effect. In assessing the effect of occupational exposures, we can often imagine that the presence of other exposures or certain lifestyle factors may increase or decrease the effect of the exposure. For example, a high intake of vitamin C may protect against the negative effect of some exposures, that is, may modify their effect. Knowledge about such effect modification can contribute to a better understanding of the actual mechanisms behind the effect of certain exposures and to the development of preventive strategies. Multiple regression models do not only provide a framework to describe the effect of some variables on another. They also allow one to describe how these effects may depend on other variables, and hence can also be used to address the question of effect modification. 1.3
Some Basic Notation
As we have seen, in medical research we are often interested in analysing the effect of a set of variables on one variable of interest. In the sequel, we will call this variable of interest the outcome variable, and we will denote it by Y . Variables, whose influence on the outcome Y we would like to investigate, will be called covariates. We will denote them by X1 , X2 , . . . , Xp , with p being the number of covariates. Multiple regression models are just simple tools to describe the joint influence of the covariates X1 , X2 , . . . , Xp on the outcome Y and to quantify the effect of the single covariates. Typically, we will have measurements of the outcome variable Y and the covariates X1 , X2 , . . . , Xp for a set of units, which may be—depending on the concrete example—patients, (healthy) subjects, animals, regions, studies, etc. We will number the units from 1 to n, where n denotes our sample or population size. The outcome variable measured at unit i will be denoted by Yi , and the corresponding covariate values by Xi1 , Xi2 , . . . , Xip , such that Xi j denotes the value of covariate X j at unit i. If we would like to stress that we mean a concrete realisation of a covariate X j or the outcome Y at unit i, we use small letters xi j and yi . If we have only one covariate, we use typically X and Xi instead of X1 and Xi1 . Remark: In some textbooks, the outcome variable Y is called the dependent variable, reflecting that this variable is modelled as being dependent on the covariates. Con-
S OME BASIC N OTATION
7
sequently, the covariates are sometimes called independent variables. However, this is somewhat confusing. In the statistical and generally in the scientific literature, two variables are usually called independent or (stochastically) independent if knowledge of the value of one variable does not tell us anything about the value of the other variable. In other words, we call them independent if there is no association or correlation between the two variables. And—as we will see in many examples in this book —it is typical that there is some correlation between covariates; that is, they are not (stochastically) independent.
T HIS C HAPTER IN A N UTSHELL Regression models allow one to address a variety of questions in medical research. They are characterised by the interest in the effect of one or several covariates on an outcome of interest.
Chapter 2
An Introductory Example
In this example we illustrate some typical steps in analysing a data set by use of regression models.
2.1
A Single Line Model
Let us assume we are interested in analysing the effect of calcium intake on the growth of mice. We perform a small experimental study, where we add calcium to the food of newborn mice, such that we have food with 1, 2, 3, and 5 mg daily calcium intake. For each calcium level, we use two mice and the weight 16 weeks after birth is used as indicator of their growth. The resulting data are shown in Figure 2.1. It looks as though the weight increases with increasing calcium intake. Our interest is now to quantify this relationship. We have here the outcome Yi = Weight of mouse i at age 16 weeks (in g) and the single covariate
13
15
weight 17 19
21
23
Xi = Daily dose of calcium (in mg) for mouse i
1
2
3 dose
Figure 2.1 The data set.
9
5
13
13
15
15
weight 17 19
weight 17 19
21
21
23
A N I NTRODUCTORY E XAMPLE 23
10
1
2
3 dose
5
Figure 2.2 The data set with the true line.
1
2
3 dose
4
5
Figure 2.3 The true line and its interpretation.
To describe the effect of calcium dose on weight, we can use a single line model. In such a model, we assume that there exists a true linear relationship between the dose level X and the expected weight of a mouse exposed to this dose level X. In Figure 2.2, we have added to the observed data a line depicting this true linear relationship. The line indicates that for a mouse exposed to 2 mg/day we would expect a weight of 18.2 g and for a mouse exposed to 4 mg/day we would expect a weight of 20.1 g. This is illustrated in Figure 2.3. By “expected weight” we mean the weight we can observe on an average if we expose a large number of mice to the corresponding dose level. In the sequel we use the symbol μ (x) to denote the expected weight for mice exposed to a calcium dose of x. We can regard Figure 2.3 as the graph of the function μ (x). Now if the graph is a line, then we can represent the function μ (x) in the form
μ (x) = β0 + β1 x with two specific values for β0 and β1 . The graph shown in Figure 2.3 corresponds to the values β0 = 16.38 and β1 = 0.95. The value of β1 describes the steepness of our line, and it is called the slope parameter. The larger the value of β1 , the steeper is the increase of μ (x) with increasing x. The exact interpretation of β1 is as follows: If we increase the calcium dose by 1, then we increase the expected weight by β1 . Consequently, if we increase the dose by Δ, then the expected weight is increased by β1 × Δ. This is illustrated in Figure 2.4. The interpretation of the intercept parameter β0 follows directly from the fact that μ (0) = β0 . This implies that β0 describes the expected weight of a mouse with a calcium dose of 0. We can regard the relation
μ (x) = β0 + β1 x
11
13
15
weight 17 19
21
23
F ITTING A S INGLE L INE M ODEL
1
2
3 dose
5
Figure 2.4 Illustration of the meaning of β1 : If the length of the dashed line is Δ, then the length of the dotted line is β1 × Δ.
as our first regression model. We model here the expected weight as a linear function of the dose using two regression parameters β0 and β1 . 2.2
Fitting a Single Line Model
Assuming a model is not the same as knowing a model exactly. All figures considered so far are based on assumed values for β0 and β1 , but in reality we do not know these values. β0 and β1 are unknown regression parameters, and we have to estimate them. This can be done by fitting a line to the available data, because each line corresponds to a certain choice of β0 and β1 . As shown in Figure 2.5, we can fit many lines to the data, so we need a criterion to select the best-fitting line. The most popular criterion is the so-called least squares criterion. This means, we consider the deviation between the observed values yi and the expected value μ (xi ), that is, the residuals ri := yi − μ (xi ) = yi − β0 − β1 xi , which are illustrated in Figure 2.6. Then we can try to minimise the overall amount of the deviation measured by the sum of the squares of the residuals. The line with the minimal residual sum of squares gives the best-fitting line and the parameters of this line are our estimates for the regression parameters. (For more details, see Appendix A.1.) In our example, we obtain the estimates
βˆ0 = 16.2 and βˆ1 = 1.03 and the corresponding line is shown in Figure 2.7. So, the results of this experiment suggest that Increasing the calcium dose by 1 mg increases the expected weight at 16 weeks after birth by 1.03 g.
21 weight 17 19 15 13
13
15
weight 17 19
21
23
A N I NTRODUCTORY E XAMPLE 23
12
1
2
3 dose
5
1
3 dose
5
Figure 2.6 Illustration of residuals (dotted lines).
13
15
weight 17 19
21
23
Figure 2.5 The data set with several possible lines.
2
1
2
3 dose
5
Figure 2.7 The data and the best-fitting line.
The next question may be how to interpret the value of βˆ0 . As mentioned above, β0 corresponds to the expected weight of a mouse with a calcium intake of 0. However, interpreting βˆ0 in this way must be regarded as a risky extrapolation, because in our experiment we have only considered doses between 1 and 5, and our model is unlikely to be valid for lower doses, because a mice without any calcium intake probably would die within the first 16 weeks of its life. So we should not interpret the intercept parameter. There are other more useful things the intercept parameter can be used for. We can, for example, ask what may be the expected weight of a mouse fed with 4 mg calcium each day. Our model implies that the value of μ (4) is the correct answer, and we can estimate this number by inserting the estimated parameters in our model equation: μˆ (4) = βˆ0 + βˆ1 × 4 = 16.2 + 1.03 × 4 = 20.4 .
13
13
15
weight 17 19 21
23
TAKING U NCERTAINTY INTO ACCOUNT
1
2
3 dose
5
Figure 2.8 A second data set based on repeating the experiment with other mice.
So we come to the conclusion: The expected weight 16 weeks after birth of a mouse fed with 4 mg calcium/day is 20.4 g. Here, we have performed an interpolation, because we did not have the dose level 4 in our experiment, but the neighbouring values 3 and 5. Remark: It is rather common in medical applications for the intercept parameter not to be interpreted, because β0 corresponds to μ (0), that is, the expected value of the outcome for x = 0, and—if 0 is not within the range of measured x values—any interpretation of μˆ (0) is an extrapolation. The slope parameter β1 usually reflects the quantity of interest, whereas the intercept parameter is just a nuisance parameter we can and should ignore in our analysis. 2.3
Taking Uncertainty into Account
If a second researcher repeats our experiment, he or she will obtain a different data set, for example, the data set shown in Figure 2.8. And consequently, he or she will obtain a different line and different estimates: βˆ1 = 1.19 instead of 1.03. So, the conclusions we arrived at above are incomplete, because they ignore the fact that a repetition of the experiment will give different results, and we do not know, how different they will be. There exist, however, statistical techniques to describe the uncertainty associated with an estimate such as βˆ1 = 1.03. The main techniques are standard errors, confidence intervals, and p-values. We would like to remind the reader now of the definition of these terms when applied to describe the precision of an estimated slope parameter βˆ1 : • The standard error describes the expected variability of the slope parameter, if we repeat our experiment in the same manner, that is, with the same dose levels and with the same sample size within each dose level. If we repeat the experiment many times, and compute the slope parameter in each experiment, then we can
14
A N I NTRODUCTORY E XAMPLE observe a certain standard deviation among these estimated slope parameters, and this standard deviation is the standard error.
• A 95% confidence interval describes the precision of the estimate. It is always defined in a way such that at least 95% of all confidence intervals we compute cover the true slope parameters. Hence, we can be pretty sure that a 95% confidence interval covers the true slope parameter β1 , and consequently a small confidence interval tells us that we have rather precise knowledge about the slope parameter. • A p-value refers to testing the null hypothesis β1 = 0, that is, the null hypothesis of no effect of X on Y . The p-value is defined as the probability of observing a slope estimate at least as large (in absolute value) as the observed one under the assumption that the null hypothesis is true. The smaller the p-value, the higher the evidence against the null hypothesis. Usually, p-values smaller than 0.05 are regarded as sufficient evidence to reject the null hypothesis, allowing to state that there is an effect of X on Y . Typically, we say that we have shown a “significant” effect of X on Y . The output of a computer program to perform a least squares regression applied to our original example looks typically like this: beta SE 95%CI p-value variable intercept 16.256 1.803 [11.845,20.667] |t| [95% Conf. Interval] -------------+----------------------------------------------------------------
18
A N I NTRODUCTORY E XAMPLE
dose | 1.034286 .5773455 1.79 0.123 -.3784276 2.447 _cons | 16.25571 1.802761 9.02 0.000 11.84452 20.66691 ------------------------------------------------------------------------------
The regress command produces a lot of output. At present, the reader should ignore most of the output and should look only at the results in the lower part. Here, you can find the estimated regression parameteres (called regression coefficients in Stata), the standard errors, confidence intervals, and p-values. Note that Stata abbreviates the intercept parameter as cons. If we would like to visualise the fitted line, we can use Stata’s lfit command, which plots the least square line, and we can combine it with scatter: . scatter weight dose || lfit weight dose
Now we use a data set which also includes the variable sex: . use calciumfull, clear . list
1. 2. 3. 4. 5. 6. 7. 8.
+---------------------+ | dose sex weight | |---------------------| | 1 1 20.8 | | 1 0 14.2 | | 2 1 18.9 | | 2 0 16.8 | | 3 1 21.9 | |---------------------| | 3 0 17.4 | | 5 1 21.7 | | 5 0 21.1 | +---------------------+
To visualise the relationship between dose, age, and sex, we make two scatterplots separately for each sex group and glue them together: . scatter weight dose if sex==0 || scatter weight dose if sex==1
To fit the model with the two covariates, we can use . regress weight dose sex Source | SS df MS -------------+-----------------------------Model | 42.5255822 2 21.2627911 Residual | 11.1944253 5 2.23888507 -------------+-----------------------------Total | 53.7200076 7 7.6742868
Number of obs F( 2, 5) Prob > F R-squared Adj R-squared Root MSE
= = = = = =
8 9.50 0.0198 0.7916 0.7083 1.4963
-----------------------------------------------------------------------------weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dose | 1.034286 .3576818 2.89 0.034 .1148356 1.953736
E XERCISE 5-HIAA and Serotonin
19
sex | 3.45 1.058037 3.26 0.022 .7302291 6.169771 _cons | 14.53071 1.235815 11.76 0.000 11.35395 17.70748 ------------------------------------------------------------------------------
We will learn later how to create a graph with the data and the two regression lines.
2.6
Exercise 5-HIAA and Serotonin
The data set hiaa includes measurements of 5-HIAA urine excretion (μ mol/l) and serotonin plasma level (μ mol/l) from 40 individuals of different age and sex. (a) Describe the dependence of the two laboratory parameters on age and sex. (Note: sex==1 indicates a male subject.) (b) Can we conclude that the 5-HIAA urine excretion varies with age? (c) What is the expected difference of the 5-HIAA urine excretion between two females differing in age by 13 years? Compute also a confidence interval for this quantity. (You can do this by hand, but you can also use Stata’s lincom command. Type help lincom for further information.) (d) What is the expected 5-HIAA excretion of a 49-year-old male? 2.7
Exercise Haemoglobin
The data set hb includes measurements of the Hb-value for 10,000 German schoolchildren. (a) Describe the dependence of the Hb value on age and sex. How much does the Hb-value increase with each year of age? (Note: age is measured in this data set in months! sex==1 indicates male children.) (b) Select 3 times randomly 11 children and repeat the above analysis. Observe, how often the six 95% confidence intervals for the variables age and sex include the results of (a). (Note: You can assume that the children in this data set are in random order. So you can just use for example regress hb age sex if id in 7242/7252
to select a random sample of 11 children.) (c) If 50 readers of this book perform part (b), such that they have computed 300 times a 95% confidence interval, how often do you expect to find the results of a) in these intervals? 2.8
Exercise Scaling of Variables
Suppose that five students have performed the calcium experiment together. After finishing the whole experiment, each student wrote down the results on a piece of paper and went home. At home each student types the results into his or her computer and applies the least squares regression procedure of his or her package. – Student A types the weight of the mice in g, the calcium dose in mg, and uses the codes 0 for females and 1 for males.
20
A N I NTRODUCTORY E XAMPLE
– Student B types the weight of the mice in g, the calcium dose in g, and uses the codes 0 for females and 1 for males. – Student C types the weight of the mice in kg, the calcium dose in mg, and uses the codes 0 for females and 1 for males. – Student D types the weight of the mice in g, the calcium dose in mg, and uses the codes 1 for females and 2 for males. – Student E types the weight of the mice in g, the calcium dose in mg, and uses the codes 0 for males and 1 for females. How do you expect the effect estimates of students B, C, D, and E to differ from the effect estimates of student A?
T HIS C HAPTER IN A N UTSHELL Regression models allow one to estimate regression coefficients for each covariate. These coefficients describe the expected difference in the outcome if we compare two subjects (or other units) differing by 1 unit in the covariate of interest, keeping all other covariate values fixed. Confidence intervals allow one to describe the uncertainty of estimated regression coefficients. The p-values express the evidence we have against the null hypothesis that a covariate has no effect.
Chapter 3
The Classical Multiple Regression Model
In this chapter we briefly introduce the classical multiple regression model.
The two line model of the previous chapter was a first example of the classical multiple regression model. In general, in the classical multiple regression model the expected value of an outcome variable Y is modeled as a function of the covariates X1 , X2 , . . . , Xp . Denoting the expected value of Y given the covariate values x1 , x2 , . . . , x p by μ (x1 , x2 , . . . , x p ), we can express this model as
μ (x1 , x2 , . . . , x p ) = β0 + β1 x1 + β2 x2 + . . . + β p x p
.
The regression coefficients β j in such a model have a clear and unique interpretation: If we compare two subjects with identical covariate values except for a difference in covariate X j , and if the subjects differ by an amount of Δ in this covariate, then the expected values of Y differ between these two subjects by Δ × β j . This follows immediately from the model above, because
μ (x1 , x2 , . . . , x j + Δ, . . . , x p ) − μ (x1 , x2 , . . . , x j , . . . , x p ) β0 + β1 x1 + β2 x2 + . . . + β j (x j + Δ) + . . . + β p x p = − β0 − β1 x1 − β2 x2 − . . . − β j x j − . . . − β p x p = β j (x j + Δ) − β j x j β jΔ = In particular, if the subjects differ by an amount of 1 in X j , then the expected values of Y differ by β j . This applies, of course, also to a binary variable X j , for which β j is just the difference in the expected values between two subjects differing only in this covariate, but sharing all other covariate values. The general principle to obtain estimates of the regression parameters is the least squares principle; that is, we try to minimise the sum of squares of the residuals ri = yi − μ (xi1 , xi2 , . . . , xip ) = yi − β0 − β1 xi1 − β2 xi2 − . . . − β p xip
over all possible values of β0 , β1 , . . . , β p . Statistical software packages solve this minimisation task for us (for more details see Appendix A.2), and they provide us also with confidence intervals and p-values. 21
Chapter 4
Adjusted Effects
In this chapter we consider the use of regression models to compute adjusted effects.
4.1
Adjusting for Confounding
Suppose we are interested in proving that smoking increases the blood pressure. We collect data on the amount of smoking during the last 12 months in a sample of 302 subjects and measure their systolic blood pressure today. The resulting data is shown in Figure 4.1, indicating that the systolic blood pressure increases with the daily number of cigarettes smoked. If we fit a simple linear regression model to this data, we obtain an output like beta SE 95%CI p-value variable intercept 119.817 0.981 [117.887,121.748] |z| [95% Conf. Interval] -------------+---------------------------------------------------------------smokem | 1.68 .2210022 3.94 0.000 1.298179 2.174122 -----------------------------------------------------------------------------. lrtest A Likelihood-ratio test (Assumption: . nested in A)
LR chi2(1) = Prob > chi2 =
12.69 0.0004
60
I NFERENCE FOR THE L OGISTIC R EGRESSION M ODEL
and we can continue with testing the effect of the maternal smoking status by . logistic allergyc allergym Logistic regression
Log likelihood = -690.71179
Number of obs LR chi2(1) Prob > chi2 Pseudo R2
= = = =
1125 5.49 0.0191 0.0040
-----------------------------------------------------------------------------allergyc | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------allergym | 1.367219 .181867 2.35 0.019 1.053444 1.774453 -----------------------------------------------------------------------------. lrtest A Likelihood-ratio test (Assumption: . nested in A)
LR chi2(1) = Prob > chi2 =
22.98 0.0000
Remark: The p-values of the Wald-test and the likelihood ratio test shown in Stata’s output coincide. However, if you look at the test statistics reported as chi2(1) in the output of the test command and reported as LR chi2(1) in the output of lrtest, you can see that the likelihood ratio test yields slightly larger statistics. This illustrates that the likelihood ratio test has slightly more power than the Wald test. Remark: Stata stores many results of procedures internally, such that they can be accessed after execution of the command. Both the test and lrtest command allows for inspecting the saved results using the command return list. This way you can really see that the p-value of the likelihood ratio test is smaller: . logistic allergyc allergym smokem Logistic regression
Log likelihood =
-679.2193
Number of obs LR chi2(2) Prob > chi2 Pseudo R2
= = = =
1125 28.48 0.0000 0.0205
-----------------------------------------------------------------------------allergyc | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------allergym | 1.651714 .2328059 3.56 0.000 1.253024 2.177259 smokem | 1.927018 .2672664 4.73 0.000 1.468348 2.528964 -----------------------------------------------------------------------------. estimates store A . test allergym ( 1)
[allergyc]allergym = 0 chi2( 1) = Prob > chi2 =
12.68 0.0004
H OW TO PERFORM WALD TESTS AND LR TESTS IN S TATA
61
. return list scalars: r(drop) r(chi2) r(df) r(p)
= = = =
0 12.67552733594214 1 .0003704727137245
. logistic allergyc smokem Logistic regression
Log likelihood = -685.56407
Number of obs LR chi2(1) Prob > chi2 Pseudo R2
= = = =
1125 15.79 0.0001 0.0114
-----------------------------------------------------------------------------allergyc | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------smokem | 1.68 .2210022 3.94 0.000 1.298179 2.174122 -----------------------------------------------------------------------------. lrtest A Likelihood-ratio test (Assumption: . nested in A)
LR chi2(1) = Prob > chi2 =
12.69 0.0004
. return list scalars: r(p) = r(chi2) = r(df) =
.000367709582478 12.68952542992452 1
T HIS C HAPTER IN A N UTSHELL The maximum likelihood principle provides us with estimates for the coefficients of a logistic regression model, which are—at least in large samples—optimal. The likelihood ratio test is an alternative to the usual Wald test.
Chapter 8
Categorical Covariates
In this chapter we discuss how to incorporate categorical covariates into a regression model.
8.1
Incorporating Categorical Covariates in a Regression Model
In many applications there is some interest in investigating the effect of a covariate, which divides the whole population into several subgroups (not only just in two, as in the case of a binary covariate). For example, in the study of allergies in early childhood, the parents can be divided according to their smoking status into four groups. A corresponding covariate X1 can be defined as ⎧ 1 mother and father are both nonsmokers ⎪ ⎪ ⎨ 2 mother nonsmoker, father smoker X1 = 3 mother smoker, father nonsmoker ⎪ ⎪ ⎩ 4 mother and father are both smokers
and we are interested in the differences between the groups with respect to the probabilities π (x) = P(Y = 1|X1 = x)
with Y=
1 child do suffer from an allergy 0 child does not suffer from an allergy
.
We can, of course, simply estimate these probabilities by the corresponding relative frequency of children with an allergy in each category as shown in Table 8.1. However, our aim is to assess the differences between the categories, that is, we are Category 1 2 3 4
πˆ (x) 0.22 0.33 0.28 0.40
logit πˆ (x) -1.295 -0.720 -0.920 -0.425
Table 8.1 The four categories defined by X1 and the relative frequency πˆ of children suffering from allergies within each category on the probability and the logit scale.
63
64
C ATEGORICAL C OVARIATES k 1 1 1 2 2 3
l 2 3 4 3 4 4
δˆkl = logit πˆ (l) − logit πˆ (k) 0.575 0.375 0.870 -0.200 0.295 0.495
Table 8.2 The six pairwise comparisons: Differences in empirical frequency of children with allergies expressed on the logit scale.
interested in the six pairwise comparisons and a quantification of the corresponding six differences. We can quantify these differences on the logit scale by introducing the parameters δkl = logit π (l) − logit π (k) , and using the empirical estimates for π (x) we can, of course, obtain estimates for the parameters δkl as indicated in Table 8.2. The reader should note that, indeed, any of the six values δkl is of subject matter interest in this application: δ12 describes the effect of “father only smoking”, δ13 describes the effect of “mother only smoking”, δ14 describes the effect of “father and mother smoking”, δ24 describes the effect of “mother smoking additional to father smoking”, δ34 describes the effect of “father smoking additional to mother smoking”, and δ23 compares the the effect of “mother only smoking” with the effect of “father only smoking”. If we would like to approach the estimation of δkl in the framework of a regression model, we have a slight complication. This is related to the fact that if we know, for example, δ12 and δ23 , we also know δ13 = δ12 + δ23 , and if we know δ12 and δ13 we know δ23 = δ13 − δ12 etc. So although we have six parameters, we have only three “free” parameters in the sense that if we know for example δ12 , δ13 , and δ14 , then we also know δ23 = δ13 − δ12 , δ24 = δ14 − δ12 and δ34 = δ14 − δ13 . In a regression model formulation we will find only three “free” parameters, and they are typically chosen as (1) βx = δ1x for x = 1, 2, 3, 4 (1)
with the convention β1 = δ11 = 0. The logistic model for π (x) = P(Y = 1|X1 = x) now reads (1) with x = 1, 2, 3, or 4 . logit π (x) = β0 + βx (1)
We can simply verify that βx
is identical to δ1x by noting (1)
(1)
δ1x = logit π (x) − logit π (1) = β0 + βx − β0 = βx If we fit this logistic regression model, we obtain an output like
.
S OME T ECHNICALITIES IN U SING C ATEGORICAL C OVARIATES variable smokep 2 smokep 3 smokep 4
beta 0.575 0.375 0.870
SE 0.207 0.203 0.163
95%CI [0.168,0.981] [-0.024,0.773] [0.550,1.190]
65
p-value 0.006 0.065 chi2 Pseudo R2
= = = =
1125 48.70 0.0000 0.0351
-----------------------------------------------------------------------------allergyc | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------smokep | 2 | .5829577 .2090785 2.79 0.005 .1731714 .992744 3 | .512587 .2089995 2.45 0.014 .1029555 .9222186 4 | 1.016536 .1702369 5.97 0.000 .6828781 1.350194 | allergyp | 2 | .3668414 .1755227 2.09 0.037 .0228233 .7108595 3 | .5231563 .1870375 2.80 0.005 .1565696 .889743 4 | .7586479 .1907552 3.98 0.000 .3847746 1.132521 | _cons | -1.692582 .1639953 -10.32 0.000 -2.014007 -1.371157 ------------------------------------------------------------------------------
The i. prefix does nothing else but to tell Stata to add the indicator variables described in Section 8.2. These indicators do not remain in the dataset, but we can take a look at them when using in a list command: . list smokep i.smokep in 1/10
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
+--------------------------------------------+ | 1b. 2. 3. 4.| | smokep smokep smokep smokep smokep | |--------------------------------------------| | 3 0 0 1 0 | | 1 0 0 0 0 | | 4 0 0 0 1 | | 4 0 0 0 1 | | 1 0 0 0 0 | |--------------------------------------------| | 2 0 1 0 0 | | 4 0 0 0 1 | | 4 0 0 0 1 | | 4 0 0 0 1 | | 4 0 0 0 1 | +--------------------------------------------+
. list allergyp i.allergyp in 1/10
T HE H ANDLING OF C ATEGORICAL C OVARIATES IN S TATA
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
71
+------------------------------------------------------+ | 1b. 2. 3. 4.| | allergyp allergyp allergyp allergyp allergyp | |------------------------------------------------------| | 3 0 0 1 0 | | 2 0 1 0 0 | | 1 0 0 0 0 | | 1 0 0 0 0 | | 1 0 0 0 0 | |------------------------------------------------------| | 4 0 0 0 1 | | 1 0 0 0 0 | | 2 0 1 0 0 | | 1 0 0 0 0 | | 1 0 0 0 0 | +------------------------------------------------------+
We can see that Stata has created indicator variables for the categories 2, 3, and 4 of the covariates smokep and allergyp. Actually, there is also a variable refering to the base (or reference) category, which is just filled with 0s. It is always good to remember how Stata names these indicator variables, as they are used in some commands later. If we now want to know the additional effect of father’s smoking on the top of mother’s smoking, we have to compare the category 4 = mother and father are both smokers with category 3 = mother smoker, father non-smoker. We can do this using Stata’s lincom command: . lincom 4.smokep - 3.smokep ( 1)
- [allergyc]3.smokep + [allergyc]4.smokep = 0
-----------------------------------------------------------------------------allergyc | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | .5039492 .1916113 2.63 0.009 .1283979 .8795004 -----------------------------------------------------------------------------(4)
(3)
such that we obtain as the estimate of this difference δˆ43 = βˆ1 − βˆ1 = 1.02 − 0.51 = 0.50 together with a standard error and a confidence interval. To test the null hypothesis that the smoking status of the parents is not associated with the allergy status of the child (taking the allergy status of the parents into account), we can use Stata’s test or testparm command to obtain the overall p-value: . test 2.smokep 3.smokep 4.smokep ( 1) ( 2) ( 3)
[allergyc]2.smokep = 0 [allergyc]3.smokep = 0 [allergyc]4.smokep = 0 chi2( 3) = Prob > chi2 =
36.12 0.0000
72
C ATEGORICAL C OVARIATES
. testparm i.smoke ( 1) ( 2) ( 3)
[allergyc]2.smokep = 0 [allergyc]3.smokep = 0 [allergyc]4.smokep = 0 chi2( 3) = Prob > chi2 =
36.12 0.0000
Both commands do exactly the same, namely to perform a Wald test on the null hypothesis that all the three regression coefficients are 0. test requires to specify all parameters, whereas testparm allows to use the i. notation. In our example we can see that the effect of parental smoking is highly significant: the p-value is less than 0.0001. (Stata shows the value 0.0000, but this means nothing else but that the p-value is so small that it cannot be represented by four decimal digits. Hence, the correct way to report this results is to write “p chi2 Pseudo R2
= = = =
1125 48.70 0.0000 0.0351
-----------------------------------------------------------------------------allergyc | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------smokep | 2 | .5829577 .2090785 2.79 0.005 .1731714 .992744 3 | .512587 .2089995 2.45 0.014 .1029555 .9222186 4 | 1.016536 .1702369 5.97 0.000 .6828781 1.350194 | allergyp | 2 | .3668414 .1755227 2.09 0.037 .0228233 .7108595 3 | .5231563 .1870375 2.80 0.005 .1565696 .889743 4 | .7586479 .1907552 3.98 0.000 .3847746 1.132521 | _cons | -1.692582 .1639953 -10.32 0.000 -2.014007 -1.371157 -----------------------------------------------------------------------------. estimates store A . logit allergyc i.allergyp
P RESENTING R ESULTS I NVOLVING C ATEGORICAL C OVARIATES Iteration Iteration Iteration Iteration
0: 1: 2: 3:
log log log log
likelihood likelihood likelihood likelihood
Logistic regression
Log likelihood = -687.99853
= = = =
73
-693.45852 -688.01197 -687.99853 -687.99853 Number of obs LR chi2(3) Prob > chi2 Pseudo R2
= = = =
1125 10.92 0.0122 0.0079
-----------------------------------------------------------------------------allergyc | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------allergyp | 2 | .3579979 .1726649 2.07 0.038 .0195809 .6964149 3 | .3292157 .1790193 1.84 0.066 -.0216556 .6800871 4 | .5549968 .1820931 3.05 0.002 .198101 .9118927 | _cons | -1.057454 .1042384 -10.14 0.000 -1.261758 -.8531508 -----------------------------------------------------------------------------. lrtest A Likelihood-ratio test (Assumption: . nested in A)
LR chi2(3) = Prob > chi2 =
37.78 0.0000
We can again observe that the effect of smoking is highly significant. Remark: The i. notation in Stata declares a so-called factor variable. You can read more about factor variables by typing help fvvarlist. In particular you can learn how to change the reference category. 8.5
Presenting Results of a Regression Analysis Involving Categorical Covariates in a Table
In most publications, results of a regression analysis are summarised in a table. As long as only binary or continuous covariates are involved, it is straightforward to present the estimated regressions coefficients, confidence intervals and p-values. However, working with a categorical covariate, we have regression coefficients, estimates of pairwise differences, confidence intervals and p-values for all of them and the overall p-values. So naturally the question arises which of these numbers we should present. Let us extend our example of the study on allergies in early childhood by considering the following five covariates: • Allergystatus of parents: A categorical covariate with the four categories mentioned above. • Social class: A categorical covariate with the categories I, II, and III. • Region: A categorical covariate with the two categories rural and urban. • Breast feeding: A binary variable (yes or no).
74
C ATEGORICAL C OVARIATES covariate Allergy of parents (reference: none) mother only father only both Social class (reference: I) II III Region Breast feeding Age of mother at birth
βˆ
95% CI
p-value
0.43 0.30 0.86
[ 0.11, 0.76] [–0.04, 0.63] [ 0.51, 1.20]
.0085 .083 50 X= 0 if age ≤ 50 and the probability π (t, x) of a subject with X = x to die in the interval t (here t = 1, 2, . . . , 6), given the subject is still alive at the start of the interval. If we denote with Y˜ the time interval in which the subject dies, we can express π (t, x) as
π (t, x) = Pr(Y˜ = t|Y˜ ≥ t, X = x) . If we now want to formulate a regression model for π (t, x) we can start with formulating a model for each value of t. However, since we are interested in quantifying the effect of age by one number, it is convenient to assume that the effect of age is the same in all time intervals. So we can, for example, formulate a model for the logarithm of π (t, x) as log π (t, x) = β0 (t) + β1 x that is, we have different intercept parameters for each interval, but a common slope parameter.
40 50
50
M ODELLING THE R ISK OF DYING IN C ONTINUOUS T IME age 50
0
10
10
death rate 20 30
death rate 20 30 40
age 50
87
1
2
3
4
5
6
1
2
year
3
4
5
6
year
Figure 10.1 Death rates in the first 6 years after radiation therapy, stratified by age.
Figure 10.2 Death rates in the first 6 years after radiation therapy, stratified by age. Note the logarithmic scale of the y-axis.
The reader may wonder why we formulate the model on the log scale and not on the logit scale. One reason for this is that we have observed in Figure 10.1 a constant factor between the death rates of young and old subjects, which corresponds to a constant difference on the log scale as illustrated in Figure 10.2. Moreover, we can easily relate the slope parameter β1 to the ratios ππ (t,1) (t,0) , that is, the ratio of the death rates between old and young patients: We have log and hence
π (t, 1) = log π (t, 1) − log π (t, 0) = β0 (t) + β1 − β0 (t) = β1 π (t, 0) π (t, 1) = eβ1 , π (t, 0)
that is, eβ1 is the ratio of interest. Indeed, if we fit this model by the ML principle, we ˆ obtain βˆ1 = 0.63 and eβ1 = 1.87, which fits to the observed factor between 1.5 and 2 in Figure 10.1. Remark: Death rates are usually small, in particular if we consider small intervals. p as 1 − p is close to 1. Now for probabilities p close to 0 we have log p ≈ log 1−p Hence, there is actually no big difference between differences on the log scale and on the logit scale. 10.2
Modelling the Risk of Dying in Continuous Time
If we know the exact death time Y of each patient in our cohort, the above approach based on death rates in intervals is not completely convincing, as we do not use all available information and our results depend on the intervals chosen. To approach
T HE C OX P ROPORTIONAL H AZARDS M ODEL
0
0
10
death rate 20 30 40
death frequency 10 20 30 40
50
50
88
0
2
4
6
year
0
2
4
6
year
Figure 10.3 The probability of dying in intervals of 1 year (filled dots), 6 months (empty dots), and 3 months (diamonds) estimated in our data set: On the left side the relative frequency is shown; on the ride side, the empirical rates, that is, the relative frequency divided by the length of the interval. The scale of the y-axis in the right figure is probability per year (in %).
this problem, we might decide to decrease the length Δ of the time intervals. However, with decreasing values of Δ the probability to die in a certain interval becomes smaller and smaller and tends to 0. This is illustrated on the left hand side of Figure 10.3. The situation is different, if we consider the rate of dying, that is, the probability of dying relative to the length of the time interval, or—with other words—the ratio π (t,x) Δ . These rates converge to some function h(t, x) as indicated on the right hand side of Figure 10.3. This function h(t, x) is called a hazard function, and we can define it formally as the limit of the rates π (t,x) Δ if we decrease the length of the intervals to 0: Pr(Y ≤ t + Δ|Y ≥ t, X = x) . h(t, x) = lim Δ→0 Δ Note that t is now any possible time point on a continuous time scale. The function h(t, x) describes the hazard of dying of a subject with the covariate value x at time t, given that the subject is still alive at time t. The value of the hazard function is not a probability, but a rate, and hence somewhat difficult to interpret. However, if we compare two subjects i1 and i2 with covariate values xi1 and xi2 , respectively, then h(t,xi )
we can easily interpret the ratio h(t,xi1 ) . This ratio describes just the ratio between the 2 probabilities to die in any short interval after time t of the two subjects, given that both subjects are still alive at time t. This follows from the fact that the probability to die in a short interval of length Δ after time t is approximately h(t, xi j ) × Δ for both h(t,xi )
subjects, and in taking the ratio of these two probabilities we obtain the ratio h(t,xi1 ) . 2 Hence, such ratios, called hazard ratios, are easy to interpret. The famous Cox proportional hazards model (Cox, 1972) is now nothing else but a regression model for the hazard function, which results in parameters that can be interpreted as (the logarithm) of such hazard ratios. The model looks identical to the
M ODELLING THE R ISK OF DYING IN C ONTINUOUS T IME
89
model considered in the last section: log h(t, x) = log h0 (t) + β1 x . The only difference is that h0 (t) is no longer just a finite set of values, but a function defined for all values t ≥ 0, the so-called baseline hazard function. We can of course extend this model to the case of several covariates X1 , X2 , . . . , Xp . Then it reads log h(t, x1 , x2 , . . . , x p ) = log h0 (t) + β1 x1 + β2 x2 + . . . + β p x p . The interpretation of the parameters is as usual: If we compare two subjects differing only in covariate X j and in this covariate by Δ, then the difference between the hazard functions of these two subjects is β j × Δ on the log scale at all time points t. Or, equivalently, the ratio between the two hazard functions is eβ j ×Δ at all time points t. If Δ = 1, then the ratio of the two hazard functions is eβ j , and hence many statistical packages report these numbers just as hazard ratios (HR). It is also possible to say that the hazard rate increases by the factor eβ j . The assumption that the hazard functions of two subjects show a constant ratio for all time points t can be also expressed by saying that the two functions are proportional. This explains the name “proportional hazards model.” The Cox proportional hazards model is fitted by maximising a so-called partial likelihood. This principle is close to the maximum likelihood principle, and they share a lot of properties. Especially, the estimates are consistent and approximately normal in large samples, which can be used to define standard errors, confidence intervals, and p-values. It is an important feature of fitting the Cox proportional hazards model by maximising a partial likelihood that this principle allows to incorporate in an easy manner censored observations. The presence of (right) censored observations, that is, subjects for whom we only know that they have survived until a certain time point, but for whom we do not know the exact time of death, is typical in medical applications. In the example of the previous section, we have 237 subjects who are still alive after 6 years, and we have stopped following this cohort any longer, so that is all we know. All these subjects have survival times that are censored after 6 years. In practice, such a cohort will typically compromise all subjects treated in a certain period at a hospital, and we can follow them until today. Then subjects still alive today are censored, but for some this may mean that we know that they have survived the first 2 years after treatment, and for others that they have survived 5 year after treatment. Consequently, the time point of censoring varies from subject to subject. It is no problem to incorporate also this type of censoring in fitting a Cox proportional hazards model. Actually, any (right) censoring is allowed. We have only to ensure that this censoring is noninformative, that is, that it does not tell us something about the risk of dying of a patient compared to a patient with the same covariate values and who is still alive and not censored at this time point. For example, if patients in the cohort decide to start another treatment at another hospital and we are unable to follow them any longer, we only know that that patient was alive until this decision. However, in this situation censoring might be informative, as it tells us that the patient was not satisfied by the first treatment and hence might have an increased risk
90
T HE C OX P ROPORTIONAL H AZARDS M ODEL
0
12 24 36 time (months)
48
0.0 0.2 0.4 0.6 0.8 1.0
Hazard ratio 1.48
0.0 0.2 0.4 0.6 0.8 1.0
Hazard ratio 1.09
0.0 0.2 0.4 0.6 0.8 1.0
Hazard ratio 0.61
0
48
12 24 36 time (months)
48
12 24 36 time (months)
48
0.0 0.2 0.4 0.6 0.8 1.0
Hazard ratio 2.66
0.0 0.2 0.4 0.6 0.8 1.0
0
0
Hazard ratio 2.29
0.0 0.2 0.4 0.6 0.8 1.0
Hazard ratio 1.73
12 24 36 time (months)
0
12 24 36 time (months)
48
0
12 24 36 time (months)
48
Figure 10.4 Kaplan-Meier curves comparing two groups of subjects and corresponding hazard ratios based on fitting a Cox proportional hazards model.
of dying compared to other patients in the cohort who have survived and remain in our cohort. It is important to note that the Cox proportional hazards model can be used for any type of time-to-event data, not only the classical survival data with time to death. You can use it, for example, to analyse the influence of covariates on the time until a patient leaves the hospital after surgery, on the time until a patient needs a second prescription after the first prescription, or the time until a patient pays his or her bill after contact with a GP. You can even use it in situations, in which “time” might be measured on an unusual scale. For example in small children it is known that their vocabulary size may be a better marker for their development than age. Then it may be of interest to look at an outcome like time until the first three word sentence on the time scale vocabulary size. 10.3
Using the Cox Proportional Hazards Model to Quantify the Difference in Survival Between Groups
If we want to compare the survival of patients between groups, the standard analysis in the medical literature is to draw the two corresponding Kaplan-Meier curves and to apply the log rank test to assess the significance of the difference. If, in addition, we want to quantify this difference, the hazard ratio estimated by fitting a Cox proportional hazards model with a corresponding binary covariate can be used. Figure 10.4 shows several examples, which may also give the reader an idea what a certain hazard ratio means if we think in survival functions instead of hazard functions. In the example considered in the first section we have actually two covariates age
91
0.0
0.2
0.4
0.6
0.8
1.0
H OW TO F IT A C OX P ROPORTIONAL H AZARDS M ODEL WITH S TATA
0
12 24 36 48 60 time since treatment (months) 50, low
72
50, high
Figure 10.5 Kaplan-Meier curves corresponding to the combinations of the two covariates age and radiation dose.
and radiation dose. If we represent them as two binary variables, 1 if age > 50 X1 = 0 if age ≤ 50 and X2 =
1 0
if radiation dose high if radiation dose low
we can consider the four possible combinations of the values of X1 and X2 . The four corresponding Kaplan-Meier curves are shown in Figure 10.5. We can fit a Cox proportional hazards model of the type log h(t, x1 , x2 ) = log h0 (t) + β1 x1 + β2 x2 and obtain an output like variable beta age 0.685 treat 0.178
SE 0.080 0.077
95%CI [0.527,0.842] [0.027,0.329]
p-value chi2
= =
78.11 0.0000
-----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------agegroup | 1.983285 .1593592 8.52 0.000 1.694299 2.321561 treat | 1.194707 .0919163 2.31 0.021 1.02748 1.389152 ------------------------------------------------------------------------------
Stata presents parameter estimates of the Cox model as hazard ratios, that is, as eβ j . You can require also parameter estimates on the log hazard scale: . stcox age treat, nohr failure _d: analysis time _t:
died == 1 time
Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Refining estimates: Iteration 0: log likelihood
= -4286.7773 = -4247.7342 = -4247.724 =
-4247.724
Cox regression -- Breslow method for ties No. of subjects = No. of failures = Time at risk = Log likelihood
=
918 681 1208737 -4247.724
Number of obs
=
918
LR chi2(2) Prob > chi2
= =
78.11 0.0000
-----------------------------------------------------------------------------_t | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------agegroup | .6847546 .0803512 8.52 0.000 .5272693 .84224 treat | .1779011 .0769363 2.31 0.021 .0271088 .3286934 ------------------------------------------------------------------------------
You can use the i. prefix also in stcox, and you can use commands like lincom (allowing a hr option to express results as hazard ratios) or test also after the stcox
94
T HE C OX P ROPORTIONAL H AZARDS M ODEL
command. As a small example, you can see here how you can perform a (partial) likelihood ratio test to test whether there is a difference between low and high radiation dose: . estimates store A . stcox age, nohr failure _d: analysis time _t:
died == 1 time
Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Refining estimates: Iteration 0: log likelihood
= -4286.7773 = -4250.4047 = -4250.3946 = -4250.3946
Cox regression -- Breslow method for ties No. of subjects = No. of failures = Time at risk = Log likelihood
=
918 681 1208737 -4250.3946
Number of obs
=
918
LR chi2(1) Prob > chi2
= =
72.77 0.0000
-----------------------------------------------------------------------------_t | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------agegroup | .6690755 .0800623 8.36 0.000 .5121564 .8259946 -----------------------------------------------------------------------------. lrtest A Likelihood-ratio test (Assumption: . nested in A)
LR chi2(1) = Prob > chi2 =
5.34 0.0208
If you do not know how to make Kaplan-Meier plots with Stata, you can find here a small example: . sts graph, by(age treat) failure _d: analysis time _t:
10.5
died == 1 time
Exercise Prognostic Factors in Breast Cancer Patients—Part 1
The data set breast contains information on the prognostic factors age, tumorgrading (in three levels), tumorsize (in cm), and lymph node status.1 The latter is coded as 1 This
data set has been motivated by the study by Hansen et al. (2000).
E XERCISE Prognostic Factors in Breast Cancer—Part 1
95
0 no lymph nodes affected 1 1–3 lymph nodes affected 2 ≥ 4 lymph nodes affected Additionally, you find information on the survival time (in days) after treatment and an indicator, whether the patient died or was censored. Consider first a model only with the covariates age and tumor size. Try to answer the following questions: (a) How much does the risk of dying increase with an increase of the tumor size by 1 cm? (b) How much does the risk of dying increase with an increase of age by 10 years? (c) What is the ratio of the probability of dying within 3 months after treatment between a 75-year-old patient with a tumour size of 3 cm and a 55-year-old patient with a tumour size of 4 cm? (d) Can we conclude from this data that an increase of the tumour size by 2 cm is associated with at least a 50% increase of the risk of dying? (e) Try now to take all four prognostic covariates into account. Is it reasonable to model the two categorical covariates lymph node status and grading as continuous ones? (f) Can we state that all prognostic factors have independent prognostic value?
T HIS C HAPTER IN A N UTSHELL The Cox proportional hazards model is a very well-suited regression model for survival data. It describes the effect of a covariate as differences on the log hazard scale or as hazard ratios.
Chapter 11
Common Pitfalls in Using Regression Models
In this chapter we discuss some common pitfalls in the interpretation of the results of regression models.
11.1
Association versus Causation
If we fit a regression model to a data set and we obtain a regression parameter estimate βˆ j (with a sufficient precision), we can conclude that if we compare two subjects differing in covariate X j (and only in X j ) by Δ, then the expected value of Y /the probability of Y = 1 on the logit scale/the log hazard function of Y at any time point t differs by about βˆ j × Δ. This is the explanation given so far, and it is the correct one. However, if we report results of a regression analysis, we usually use a shorter description; for example, we say that βˆ j is the estimated effect or just the effect of the covariate X j . We may argue that this is just an abbreviation of the complex description given above. However, we must be aware that a phrase like effect has a certain meaning in common language and scientific language, and we have to justify that we can interpret the results of the regression analysis in accordance with the meaning of “effect” in common or scientific language. If we talk about the effect of a covariate X j on an outcome variable Y in common language, we mean something like that a difference between two subjects in this covariate is or may be responsible for some difference in the outcome variable. We often express this in the way that differences in X j have caused or are causal for differences in Y . For example, if we observe—as in Section 4.1—that with increasing amount of smoking the blood pressure increases, we would like to conclude that smoking a lot causes high blood pressure. If we use a regression model, the only thing we do is describing associations. For example, if we compare two subjects with a certain difference in X j , we have to expect “some difference in Y ,” and we can quantify this “difference in Y ” in relation to the difference in X j in an adequate manner using regression models. Whether this association is due to a “causal effect” or can be explained in another way is a question we cannot answer based on the results of our regression model. It is a question we have to discuss based on our background knowledge, which may allow us to identify or exclude certain explanations for the associations described. 97
98
C OMMON P ITFALLS IN U SING R EGRESSION M ODELS
So whenever we start to interpret a regression parameter estimate as the “effect” of a covariate, we have to think about whether this might be misleading. And there are, indeed, many situations in which such an interpretation is misleading. In the following, we discuss some typical examples. • Hidden “causal” variable Many classical examples of the misinterpretation of associations as causal effects are characterised by the fact that there is a common, “causal” variable in the background. For example, if data from general practitioners on the number of counselings and the number of prescriptions within one year is collected and a regression model with Y = number of prescriptions and X = number of counselings is fitted in order to investigate whether GPs talking a lot with their patients avoid drug treatment, then it is unlikely to find the expected negative regression coefficient. It is more likely to find a positive association, because the bigger the patient population of a GP and the more hours a GP works, the higher will be the overall number of counselings and the overall number of prescriptions. So, practice population size and working hours are here two variables which may act as “causal” variables in the background. • Hidden confounders Whenever results of a regression analysis are interpreted, the question can be raised whether it has been forgotten to adjust for an important confounder. This is a point typically to be addressed in the discussion of a paper and sometimes it might be some type of a competition between an author and the reviewer, whether the latter can find more potential confounders than the author (cf. Section 16.5). There are many questions one would like to address by an epidemiological study, but which are (nearly) impossible to address, because we have just too many hidden confounders. One typical example is the question, whether sticking to a vegetarian diet is beneficial or perhaps harmful. It is a straightforward idea just to look at some health outcomes in a group of vegetarians and to compare it with a group of nonvegetarians. However, we have to expect that vegetarians also differ in many other lifestyle variables from nonvegetarians, so that it is hopeless to address this type of question in such a simple design. Many of these lifestyle variables have to be considered as potential confounders and whenever we decide to measure some of them, we can easily find another potential confounder of relevance not measured. • Reversal of causality It may happen that we believe that a covariate X has some (positive) effect on an outcome Y , but that a study indicates that the effect goes in the opposite direction. A classical example is the so-called “Healthy worker effect” we can often find in studies on occupational risk factors, where more vigorous occupations are associated with lower mortality or morbidity rates than less vigorous occupations. The usual explanation is that it is not the vigorousness of the occupation which influences the health of the subjects, but the health of a subject determines whether he or she can stand a vigorous occupation or not. We have seen such an effect in the example on the relation between physical occupation and lower back pain
D IFFERENCE BETWEEN S UBJECTS VERSUS D IFFERENCE WITHIN S UBJECTS 99 (Exercise 8.6), where we failed to show an association between the type of occupation and the presence of lower back pain when both were measured at the same time point. Here, the outcome lower back pain has probably already influenced partially the choice of the occupation. In summary, it may be questioned whether it is a good idea to refer to regression coefficients as “effects,” or whether one should decide to use a more neutral term. However, there are no good candidates. It is also so common to use this term that it is difficult to change this tradition. And in some settings (for eaxmple, the experimental one considered in Chapter 2), it is highly justified to use this term. So it has just always to be kept in mind that the “effect” of a covariate described and assessed in a regression model is in the first place just a description of an association, that any further and deeper interpretation has to be justified by subject matter considerations, and that the term “effect of a covariate” (or similar ones like “influence”) may be misleading. In justifying a “causal” interpretation of an effect, regression models might be very helpful in the process of excluding possible explanations for an association (or its absence). This is exactly what we do if we decide to adjust for potential confounders: We would like to exclude the possibility that the observed association is due to an association of the covariate of interest with the confounder and a simultaneous association between the confounder and the outcome of interest. 11.2
Difference between Subjects versus Difference within Subjects
If we consider a covariate allowing its values to be changed within an individual within a short time period, we often interpret the effect estimate of a regression analysis—which describes the difference between subjects—as the effect we have to expect, if an individual changes its value of this covariate over time. This is, for example, an implicit assumption in all analyses of the effect of smoking on some outcome if we compare smokers and nonsmokers (and adjust for the relevant confounders). If we, for example, observethat a difference in 10 cigarettes daily corresponds to a certain difference in the blood pressure, we interpret it usually in the way that if an individual reduces his or her smoking by 10 cigarettes per day, he or she can expect a corresponding decrease in blood pressure. We must be aware that such an interpretation is not necessarily correct, and there are several reasons why such a transfer of between-subjects effects to within-subject effects can be dangerous. The first reason is that subjects may be used to the covariate value they have and that their whole physiology or lifestyle has been adapted to this value. So a change of the covariate value may be not very beneficial, as it forces the subject or its body to perform a new adaption, and the individual might be just too old to do it without developing serious health problems. So it might be better for an individual to stick to his or her personal value than to change to a value which might be associated with a lower risk, if you have had it for many years. A second reason is that subjects may have adapted their value of X j to a value which is optimal for them. For example, there exist many studies looking at the effect of low dose alcohol consumption on some health outcomes, and some of them come
100
C OMMON P ITFALLS IN U SING R EGRESSION M ODELS
to a conclusion like “a glass of red wine each day is optimal.” This is typically based on the result that in this subgroup the lowest rate of health problems was found. However, this does not justify recommending now one glass of wine each day to all subjects. It might be that for some subjects the “optimal” value is 0, and for some 1, and for some 2. The crucial point is that we cannot exclude (or perhaps have really to expect) that many subjects have found out their “optimal” value and stick to it. For example, if you get a headache from one glass of red wine, you probably stick most days to your optimal dose of 0. So, forcing more subjects to drink 1 glass of wine may just imply that more subjects deviate from their optimal dose, and we get more health problems in the whole population. A third reason arises from the fact that it might be questionable whether subjects can really change the value of one covariate without changing the value of another covariate. For example, it has been discussed that carpeted floors may constitute a risk for children to develop atopic diseases. We may find in a regression model that there is a difference between children living in a house with carpeted floors or not, taking all relevant confounders into account. However, if we recommend to parents with children at risk for an atopic disease to remove the carpets, the parents may instead buy flowers to make their home more pretty again or they may buy a cat or a dog for the child, because parents want their children to have again something which is cosy. So they may start to do something which is not beneficial with respect to the risk of developing atopic diseases. All these problems illustrate that we have to be careful with interpreting the “effect” of a covariate observed in a regression analysis based on comparisons between subjects as the “effect” we can expect when changing the value of the covariate within a subject. This may be a valid interpretation, but it may also be an invalid interpretation. So whenever we would like to make such an interpretation (in particular, to justify a certain intervention), we need additional arguments beyond the results of a regression analysis to justify it, for example, by arguing against the presence of any of the three reasons mentioned above. 11.3
Real-World Models versus Statistical Models
If we use the term “model” outside of the field of statistics, we typically mean some ideas or theoretical constructs which can help to explain or understand phenomena of the real world. One typical example is Niels Bohr’s model of the atom, describing it as a small nucleus and electrons moving on circular orbits around the nucleus and explaining the emission of electromagnetic energy at certain levels. Often, such models include explicit formulas to describe relations between certain quantities, for exampe, the predator-prey cycle in population biology, relating dynamically the size of a predator population to the size of a prey population. Regression models often look like real-world models, as we try to explain the values of the outcome Y by the values of the covariates X1 , X2 , . . . , Xp . However, there is no guarantee that they are real-world models. We have introduced regression models as tools to describe the mechanism generating the values of Y if X1 , X2 , . . . , Xp is known. In this sense, they are “models,”
101
1
3
2
4
3
V1
V2 5
4
6
5
6
7
R EAL -W ORLD M ODELS VERSUS S TATISTICAL M ODELS
1
2
3
4 V1
5
6
3
4
5 V2
6
7
Figure 11.1 A data set with two variables V1 and V2 . Left side: The data and the fitted regression line with V2 as outcome and V1 as covariate. Right side: The data and the fitted regression line with V1 as outcome and V2 as covariate.
as they do not just describe the observed data but take into account that we are interested in certain aspects of the underlying mechanism. This is also reflected by supplementing estimates of regression parameters with confidence intervals and pvalues, reminding us always that we would like to generalise our results beyond the actual data set. On its own, a regression model can in the best case only describe adequately the data generating mechanism, and this can be, but need not be, a real-world model. Whether we can interpret a regression model as a real-world model or not is a matter of interpretation. Many of the pitfalls mentioned in the two previous sections stem exactly from (over)interpreting a regression model as a real-world model. For example, the question of causality is a typical aspect of a real-world model, but not of a pure statistical model. And also the interpretation of between-subjects effects as within-subject effects is exactly the step from the statistical model to a real-world model. Fortunately, many regression models are just considered in the medical literature, because we have a strong belief that they are real-world models, so these problems are not always present. But we should also remember that there can be a difference. Using just the available data, there is usually no way to find out whether a statistical model is a real-world model or not. This is illustrated in Figure 11.1: If we have two variables V1 and V2 , we can always use V1 as outcome and V2 as covariate, or vice versa, and both regression models give a nice fit to the data, although probably at most one of the two models is a real-world model in the sense that we can (causally) explain one variable by the other. There are even instances in which regression models which are completely contrary to the real-world model can be justified. For example, in forensic medicine there is often some interest in reconstructing a certain amount of a substance in a body in the past based on the current measurement. So, in a study the subjects may be exposed at baseline to a certain amount of the substance and the concentration in
102
C OMMON P ITFALLS IN U SING R EGRESSION M ODELS
the blood or urine is measured 24 hours later. From a real-world perspective the 24 hour measurements are a function of the baseline values. However, for the purpose of predicting the baseline values from the 24 hour values we may consider a regression model with the baseline values as outcome and the 24 hour measurement as covariate. Only if we have a very good real-world model in the sense of a pharmacokinetic model may we go the other way and then invert the relation found by the real-world model. 11.4
Relevance versus Significance
One of the most frequent misuses of p-values and the concept of significance is to interpret significance as a proof of the importance of a covariate. This is a very common misinterpretation. A small p-value tells us only that we have evidence against the null hypothesis H0 : β j = 0. This evidence depends on the size of the effect, but it depends also on the sample size. An increasing sample size implies that the precision of our estimates increases, and hence small regression coefficients can also become significant if our sample is large enough. A small p-value only tells us that the estimated regression coefficient is big enough relative to its standard error (and hence to the sample size) to allow rejection of the null hypothesis of no effect. It does not tell us something about the magnitude of the effect. This is illustrated in Figure 11.2: with increasing sample size, the p-values become smaller and smaller, even if the regression parameter estimate is constant. Instead of focusing on the significance, it is more important to ask whether an estimated effect is of relevant magnitude. If we look at the example of the Hb values in schoolchildren, we observe a regression coefficient of 0.011. As age is measured in months in this data set, this means that with one month of age the Hb values increase with 0.011. This increase looks very small, and taking the observed variation of the Hb value (roughly between 10 and 14) in the population into account, 0.011 seems to be irrelevant. However, if we compare children 10 years apart, we have to expect a difference of 0.011 × 12 × 10 = 1.32. And this is relevant, because if we, for example, use our results to construct norm intervals in dependence on age, the interval for children at age 16 would lie 1.32 above the interval for children at age 6. In general, to judge the relevance of a regression parameter for a continuous covariate, it does usually not make sense to look directly at the regression coefficient, as this value depends on the unit we have chosen to measure the covariate. It is usually wise to translate this coefficient to a difference between subjects who are at the lower and upper end of the “normal” range of the covariate of interest, such that we can judge whether the difference in Y suggested by our model for two such individuals is relevant. For example, to judge the relevance of the effect of smoking, we may compare no smoking with 2 packages per day. In many situations, the upper and lower 10% percentiles of the distribution of X j can give a useful choice for the two values to be compared. Let us take a look at another example. Table 11.1 reports the results of the occurrence of two side effects in a clinical trial comparing a standard therapy A with a new therapy B. Logistic regression was used to compute (unadjusted) odds ratios
R ELEVANCE VERSUS S IGNIFICANCE
103
14 Hb 12 10 8
8
10
Hb 12
14
16
β=0.0110 p=0.1302
16
β=0.0107 p=0.3654
72
120
168
216
72
120
168
β=0.0111 p=0.0082
β=0.0106 p=0.0003
216
14 Hb 12 10 8
8
10
Hb 12
14
16
age
16
age
72
120
168
216
72
120
age
168
216
age
Figure 11.2 Four subsamples of different size of the data set used in Exercise 2.7. The haemoglobin (Hb) value is plotted against the age of the child (in months); the two different symbols refer to boys and girls. In addition, the estimated regression coefficient of age (adjusted for gender) and the corresponding p-value are shown.
fatigue cardiac events
A B A B
N 233 223 233 223
n 101 118 8 15
% 43.3% 52.9% 3.4% 6.7%
OR 1.47
95% CI [1.02,2.12]
p-value 0.041
2.03
[0.84,4.88]
0.115
Table 11.1 Results from a randomised clinical trial with respect to the occurrence of the side effects fatigue and cardiac events. N=number of patients in treatment arm, n=number of patients with side effect, “OR“ = odds ratio.
to compare the frequencies of each side effect between the two treatment arms. A simple look at the p-values would suggest that we have evidence for a difference with respect to the occurrence of fatigue, but not with respect to the occurrence of cardiac events. Nevertheless, the results with respect to cardiac events are here of higher relevance. We can observe that cardiac events are nearly twice as frequent under the new therapy B compared to the standard therapy A, implying an odds ratio in the magnitude of 2. If this difference reflects the truth, then it will probably imply that therapy B will never come into use. On the other side, the frequency of fatigue is only increased by a factor of 1.5 on the odds scale. If this increase reflects the truth,
104
C OMMON P ITFALLS IN U SING R EGRESSION M ODELS
it would probably does not prohibit recommending therapy B, if the therapy demonstrates a clear benefit with respect to its efficacy, for example, a distinct improvement in overall survival. Fatigue is a side effect which may be acceptable or manageable, for example, by giving patients some advice to handle episodes of fatigue. So, in the final judgement and presentation of the results, we have to focus on the results for cardiac events, not on the results for headache. Of course, we cannot be sure that therapy B increases the risk of cardiac events, but it is a likely possibility. The following may be an adequate description: Our study demonstrates a moderate, but significant increase of fatigue under the new therapy. Of higher relevance is, however, a doubling of the frequency of cardiac events. This may imply a severe objection against the new therapy. However, the difference is not statistically significant, and hence further investigations have to clarify the true cardiac risk implied by therapy B. Of course, we should not only refer to further investigations, but also try by ourselves to investigate whether other data on the new treatment or on treatments with a similar mechanism may already now allow further insights into the cardiac risk of the treatment. Also, a comparison of the rate of cardiac events under therapy A with previous investigations of this treatment can be helpful, as they may indicate that the rate under treatment A is low by chance in this study, and that the rate under treatment B is actually close to what had been observed under treatment A in other studies. In summary, any reasonable interpretation of the results of a regression analysis with respect to the effect of a covariate should take into account both the p-value and the effect estimate and the uncertainty of the latter represented by the confidence interval. Remark: The reason for the conflict between the magnitude of the odds ratios and magnitude of the p-values in the last example is the difference in prevalence: Cardiac events are rather rare, whereas fatigue is rather frequent. This implies that it is more difficult to assess differences in frequencies for cardiac events than for headache, which is also visible in the difference in the width of the confidence intervals. This influence of the prevalence of the outcome on the power of a regression analysis is further discussed in Section 14.2. Remark: There exists no uniform strategy to assess the relevance of an effect. The approach to this can be very different from application to application, and it has to be chosen in the light of the subject question. It is often wise to look at the potential impact for individuals, decisions, or society. In Chapter 13 we will discuss further techniques which might be helpful to assess the relevance of an effect. 11.5
Exercise Prognostic Factors in Breast Cancer Patients—Part 2
Which of the four prognostic factors considered in Exercise 10.5 would you regard as having a relevant effect?
E XERCISE Prognostic Factors in Breast Cancer—Part 2
T HIS C HAPTER IN A N UTSHELL Regression models allow one to estimate the “effect” of a covariate in a well-defined manner, but this does not necessarily coincide which what we mean by “effect” in daily life. Hidden “causal” variables or confounders may be one reason, the difference between betweensubject differences and within-subject differences another. The interpretation of any effect has to take into account the magnitude of the effect and its relevance, not just the p-value. In any case, the interpretation of the results of a regression analysis requires common sense and to relate the results to background subject knowledge.
105
Part II
Advanced Topics and Techniques
107
Chapter 12
Some Useful Technicalities
12.1
Illustrating Models by Using Model-Based Predictions
It is common to all regression models considered so far that we model a certain quantity of interest for each subject as a (linear) function of the covariates. In the classical regression model, this is the expected value of the outcome, in the logistic model it is the probability of Y = 1 (on the logit scale), and in the Cox model the (logarithm of the) hazard to die at time t, given one has survived until time t. Once we have fitted a regression model and have obtained estimates of the regression parameters, we can consider for each subject an estimate for the quantity of interest by inserting the estimates in the model equation. We have actually done this already several times for single subjects with specific covariate values, for example, in Exercise 2.7 and Exercise 6.6. Often the numbers obtained this way are referred to as “predictions,” and we stick to this terminology for the moment. Later, in Chapters 24 and 25, we will present a more subtle look at this topic, when the assessment of individual risks is the real focus of an investigation. However, for the moment we just use these numbers as a technical tool in illustrating certain aspects of a model. One simple example of illustrating a model is to add the fitted regression line to the data. We have already done this on several occasions, and Figure 12.1 repeats two of our examples. In both cases, the lines are just the predictions according to the fitted model in dependence on the covariate values. In the case of the logistic model on the right side of the figure, the predictions are shown on the probability scale. Another useful illustration arises when computing the predictions for all subjects of a data set and looking at the distribution of these values. This way we can judge how good a model is in discriminating between different subjects. For example, in the data set allergy2 used in Exercise 6.6 we can compute for each child the probability of developing an allergy according to the fitted model. If we visualise the distribution of these values (Figure 12.2), we can observe that the model allows identification of low-risk children with a risk below 0.2 as well as high-risk children with a risk above 0.5 (cf. question d.3 of Exercise 6.6). However, the vast majority of the children have just a risk between 0.2 and 0.35, that is, show a rather limited variation. So although the model has been very useful for investigating the effects of the four covariates, it is not useful to separate all children clearly in a low-risk or a high-risk category.
109
S OME U SEFUL T ECHNICALITIES
13
male 1
2
female
3 dose
0
.2
15
.4
y
weight 17 19
.6
21
.8
23
1
110
5
0
20
40
60
80
100
x
0
10
Percent
20
30
Figure 12.1 Two examples of illustrating a model fitted.
.1
.2
.3 .4 probability
.5
Figure 12.2 The distribution of the probability of developing an allergy according to the fitted model for all children in the data set allergy2.
12.2
How to Work with Predictions in Stata
Stata offers a predict command to compute model-based predictions after fitting any type of regression model. The command requires specifying a new variable name to be used to store the predictions. We start with looking at the use of predict after using the classical regression model in order to visualise a model with two regression lines (cf. Section 2.4): . use calciumfull, clear . regress weight dose sex Source | SS df MS -------------+-----------------------------Model | 42.5255822 2 21.2627911
Number of obs = F( 2, 5) = Prob > F =
8 9.50 0.0198
H OW TO W ORK WITH P REDICTIONS IN S TATA Residual | 11.1944253 5 2.23888507 -------------+-----------------------------Total | 53.7200076 7 7.6742868
111 R-squared = Adj R-squared = Root MSE =
0.7916 0.7083 1.4963
-----------------------------------------------------------------------------weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dose | 1.034286 .3576818 2.89 0.034 .1148356 1.953736 sex | 3.45 1.058037 3.26 0.022 .7302291 6.169771 _cons | 14.53071 1.235815 11.76 0.000 11.35395 17.70748 ------------------------------------------------------------------------------
Next we compute the model-based predictions, which are stored in the variable pred: . predict pred (option xb assumed; fitted values) . list
1. 2. 3. 4. 5. 6. 7. 8.
+--------------------------------+ | dose sex weight pred | |--------------------------------| | 1 1 20.8 19.015 | | 1 0 14.2 15.565 | | 2 1 18.9 20.04929 | | 2 0 16.8 16.59929 | | 3 1 21.9 21.08357 | |--------------------------------| | 3 0 17.4 17.63357 | | 5 1 21.7 23.15214 | | 5 0 21.1 19.70214 | +--------------------------------+
As pred is now a variable in the data set, we can use it in a line command to draw the lines according to the estimated model: . scatter weight dose if sex==0 || scatter weight dose if sex==1 || > line pred dose if sex==0, lpat(solid) lcol(black) || /// > line pred dose if sex==1, lpat(solid) lcol(black) legend(off)
resulting in a graph like
///
S OME U SEFUL T ECHNICALITIES
14
16
18
20
22
24
112
1
2
3 dose
4
5
A smarter way to obtain such a graph is to use the c(L) option of the line command, which allows several lines to be drawn, because Stata connects the points as long as the variable for the x-axis is increasing, but starts a new line if the values are decreasing. So if we first sort the data set by the sex of the mice, we can also use . sort sex dose . list
1. 2. 3. 4. 5. 6. 7. 8.
+--------------------------------+ | dose sex weight pred | |--------------------------------| | 1 0 14.2 15.565 | | 2 0 16.8 16.59929 | | 3 0 17.4 17.63357 | | 5 0 21.1 19.70214 | | 1 1 20.8 19.015 | |--------------------------------| | 2 1 18.9 20.04929 | | 3 1 21.9 21.08357 | | 5 1 21.7 23.15214 | +--------------------------------+
. scatter weight dose if sex==0 || scatter weight dose if sex==1 || > line pred dose, c(L) legend(off) lpat(solid)
///
resulting in the same graph. Next we consider the example of visualising a logistic dose response model. We use the example of Section 6.2, but restrict ourselves to the cells not exposed to an increased oxygen level: . use toxic, clear . keep if oxygen==0 (700 observations deleted)
H OW TO W ORK WITH P REDICTIONS IN S TATA
113
We start with fitting the logistic model: . logistic damage dose Logistic regression
Number of obs LR chi2(1) Prob > chi2 Pseudo R2
Log likelihood = -260.77586
= = = =
700 383.63 0.0000 0.4238
-----------------------------------------------------------------------------damage | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------dose | 1.119994 .0093131 13.63 0.000 1.101889 1.138397 ------------------------------------------------------------------------------
We use predict to compute for each dose the probability of damage according to our model. After a logistic model predict computes automatically the predictions on the probability scale: . predict prob (option pr assumed; Pr(damage))
Now we can plot the line corresponding to the fitted logistic model:
0
.2
Pr(damage) .4 .6
.8
1
. line prob dose
0
20
40 dose
60
80
As there are only a few dose levels in the data set, the line does not look very smooth. However, predict also works if we apply it to a new dataset, so now we create a new dataset with dense values of the dose. We start by forgetting everything about the current data (clear), then set the number of observations to 81, and then create a variable dose with values from 0 to 80. ( n is an internal variable available in any data set in Stata, corresponding to the observation number.)
114
S OME U SEFUL T ECHNICALITIES
. clear . set obs 81 obs was 0, now 81 . gen dose=_n-1 . list in 1/10
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
+------+ | dose | |------| | 0 | | 1 | | 2 | | 3 | | 4 | |------| | 5 | | 6 | | 7 | | 8 | | 9 | +------+
. predict prob (option pr assumed; Pr(damage))
0
.2
Pr(damage) .4 .6
.8
1
. line prob dose
0
20
40 dose
60
80
The graph looks now much nicer. But we may miss the raw data for a comparison. So we now save the data for the curve: . save linedata, replace file linedata.dta saved
H OW TO W ORK WITH P REDICTIONS IN S TATA
115
and then compute the empirical frequencies . use toxic, clear . keep if oxygen==0 (700 observations deleted) . collapse (mean) freq=damage, by(dose) . list
1. 2. 3. 4. 5. 6. 7.
+-------------+ | dose freq | |-------------| | 10 .1 | | 20 .28 | | 30 .53 | | 40 .77 | | 50 .91 | |-------------| | 60 .98 | | 70 .99 | +-------------+
and then append the data for the curve and plot both in one graph: . append using linedata
0
.2
.4
.6
.8
1
. scatter freq dose || line prob dose, legend(off)
0
20
40 dose
60
80
Finally, we consider how to generate a graph like that shown in Figure 12.2: . use allergy2, clear
116
S OME U SEFUL T ECHNICALITIES
. logistic allergyc allergym allergyf smokef smokem Logistic regression
Log likelihood = -669.26113
Number of obs LR chi2(4) Prob > chi2 Pseudo R2
= = = =
1125 48.39 0.0000 0.0349
-----------------------------------------------------------------------------allergyc | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------allergym | 1.593087 .2283784 3.25 0.001 1.202859 2.109911 allergyf | 1.368414 .1855205 2.31 0.021 1.049101 1.784916 smokef | 1.716233 .2430943 3.81 0.000 1.300195 2.265394 smokem | 1.602263 .2367101 3.19 0.001 1.199449 2.140356 -----------------------------------------------------------------------------. predict prob (option pr assumed; Pr(allergyc)) . twoway histo prob, start(0.1) width(0.05) percent
Remark: If we are interested in obtaining predictions on the logit scale after using logistic, then we can use the xb option of the predict command. Using this option, the simple linear predictor βˆ0 + βˆ1 xi1 + βˆ2 xi2 + . . . + βˆ p xip is obtained for each subject. Remark: Commands like predict do work, because Stata saves internally many results of any regression command. You can use after any regression command in Stata the command ereturn list to see which information is stored internally. The meaning of all these numbers is explained in the help file of each command. 12.3
Residuals and the Standard Deviation of the Error Term
One typical use of predictions is the computation of residuals. Remember that in the classical regression model we can define the error for each subject in the data set as the difference between Yi and the true expectation: ei = Yi − μ (xi1 , xi2 , . . . , xip ) (cf. Section 5.1). If we have fitted a model, we have estimates
μˆ (xi1 , xi2 , . . . , xip ) = βˆ0 + βˆ1 xi1 + βˆ2 xi2 + . . . + βˆ p xip of the true expectations μ (xi1 , xi2 , . . . , xip ). So we can now try to approximate the errors ei by the residuals ri = yi − μˆ (xi1 , xi2 , . . . , xip ) = Yi − (βˆ0 + βˆ1 xi1 + βˆ2 xi2 + . . . + βˆ p xip ) mentioned already in Section 2.2 and Section 5.1. So we can try to examine properties of the errors ei by examining the residuals. For example, we can make a histogram of the residuals to look for single outliers, that is, single observations Yi much more far away from the regression line than all other observations.
117
0
Frequency 20 40
60
R ESIDUALS AND THE S TANDARD D EVIATION OF THE E RROR T ERM
−40
−20
0 20 Residuals
40
Figure 12.3 Histogram of the residuals from fitting a regression model with the outcome systolic blood pressure and the covariates alcohol consumption and smoking to the data set used in Section 4.1.
One interesting question is often how far the observations Yi are typically away from the true regression line. A simple histogram of the residuals allows a quick answer. Figure 12.3 shows the distribution of the residuals in the example considered in Section 4.1 with the systolic blood pressure as outcome. We can observe that most residuals are in the range between -20 and 20; that is, the observations are rather close to the model. However, some subjects have residuals up to values of 40 or above, which are rather large deviations, in particular if we take into account that in our data set the observations on systolic blood pressure range only from 79 to 175. A traditional way to quantify the deviation of the observations from the true regression line is to look at the standard deviation σe of the error term. This is, of course, unknown, but we may expect toget an estimate by using the empirical stan-
1 dard deviation of the residuals, that is, n−1 ∑i ri2 . However, this is slightly too optimistic. As the residuals stem from fitting the best line to the observed data, they are on average slightly smaller than the true errors. So the correct way to estimate the standard deviation of the error term is to use the formula 1 1 σˆ e = ri2 = (yi − (βˆ0 + βˆ1 xi1 + βˆ2 xi2 + . . . + βˆ p xip ))2 n − (p + 1) ∑ n − (p + 1) ∑ i i
with n − (p + 1) instead of n − 1 in the denominator. This quantity is also called the root mean squared error (RMSE) in the statistical literature. In the example with the systolic blood pressure, we obtain an RMSE of 12.6. As this is a standard deviation, we can apply the so-called 2σ rule: If the errors are roughly normally distributed (and Figure 12.3 supports this assumption), then 95% of the errors are roughly between −2 × σˆ and 2 × σˆ , that is, between −25.1 and 25.1, which corresponds nicely with the impression from the histogram. Remark: Unfortunately, there exists no simple analogue to residuals and the root
118
S OME U SEFUL T ECHNICALITIES
mean squared error in the logistic model and the Cox model. This is just a simple consequence of the fact that we cannot define something like an error term in these models in a comparable and straightforward way. 12.4
Working with Residuals and the RMSE in Stata
In Stata the root mean squared error can be found in the output of the regress command. In addition, we can obtain residuals using the option resid of the predict command. So we can approach the example used in the previous section in the following manner: . use sbp , clear . regress sbp alcohol smoking Source | SS df MS -------------+-----------------------------Model | 17765.5618 2 8882.78091 Residual | 47174.1468 299 157.773066 -------------+-----------------------------Total | 64939.7086 301 215.74654
Number of obs F( 2, 299) Prob > F R-squared Adj R-squared Root MSE
= = = = = =
302 56.30 0.0000 0.2736 0.2687 12.561
-----------------------------------------------------------------------------sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------alcohol | .0758207 .0205325 3.69 0.000 .0354143 .1162272 smoking | .3031225 .0554814 5.46 0.000 .193939 .412306 _cons | 118.7952 1.000166 118.78 0.000 116.8269 120.7634 -----------------------------------------------------------------------------. predict resid, resid . list in 1/10
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
+------------------------------------------+ | id smoking alcohol sbp resid | |------------------------------------------| | 1 0 0 99 -19.79518 | | 2 23 11.7 130 3.345896 | | 3 0 0 116 -2.795184 | | 4 0 0 140 21.20482 | | 5 23 20.5 139 11.67867 | |------------------------------------------| | 6 0 0 123 4.204816 | | 7 16 108.3 159 27.14347 | | 8 21 116.4 131 -2.98629 | | 9 4 76.4 124 -1.800378 | | 10 7 0 121 .0829587 | +------------------------------------------+
. twoway histo resid, freq
and we obtain a graph similar to Figure 12.3.
L INEAR AND N ONLINEAR F UNCTIONS OF R EGRESSION PARAMETERS 12.5
119
Linear and Nonlinear Functions of Regression Parameters
We have already seen on several occasions that Stata allows computation of the standard error of and confidence intervals for linear functions of the regression parameters using the lincom command. Stata allows also consideration of nonlinear functions of regression parameters, for example, ratios of regression parameters. Such ratios are, for example, of interest, if logistic dose-response models are fitted as in Section 6.2 and there is an interest in that dose, which implies a certain probability of the response. If the interest is in the dose d20 , implying a response with probability 20%, and if the dose is the only covariate X1 in the model, then d20 satisfies the relation 0.2 = π (Y = 1|x1 = d20 ) or logit 0.2 = logit π (Y = 1|x1 = d20 ) = β0 + β1 d20 which implies d20 =
0.2 − β0 log 0.8 logit 0.2 − β0 log 0.25 − β0 −1.386 − β0 = = = β1 β1 β1 β1
and we obtain an estimate of d20 as −1.386 − βˆ0 dˆ20 = βˆ1 Once we have computed this estimate, we are, of course, also interested in its precision, and this is what we can obtain with Stata’s nlcom command. So in the example of Section 6.2, if we are interested in the dose which damages cells with a probability of 20% without adding oxygen, we can perform in Stata the following steps: . use toxic, clear . keep if oxygen==0 (700 observations deleted) . logistic damage dose Logistic regression
Log likelihood = -260.77586
Number of obs LR chi2(1) Prob > chi2 Pseudo R2
= = = =
700 383.63 0.0000 0.4238
-----------------------------------------------------------------------------damage | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------dose | 1.119994 .0093131 13.63 0.000 1.101889 1.138397 -----------------------------------------------------------------------------. nlcom (log (0.2/0.8) - _b[_cons])/_b[dose] _nl_1:
(log (0.2/0.8) - _b[_cons])/_b[dose]
120
S OME U SEFUL T ECHNICALITIES
-----------------------------------------------------------------------------damage | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_nl_1 | 16.66973 1.41877 11.75 0.000 13.88899 19.45047 ------------------------------------------------------------------------------
and we obtain an estimate of 16.7 with a confidence interval of [13.9, 19.5], suggesting a rather precise knowledge of the dose that causes cell damage in 20% of the cells. Remark: The statistical theory behind the lincom command is quite simple: If the regression parameter estimates are normally distributed, then any linear combination is normally distributed, too, and the standard error can be computed from the variance-covariance matrix of the parameter estimates. The statistical theory behind the nlcom command is slightly more complicated, as nonlinear combinations are only approximately normally distributed, and also the standard errors can be only computed approximately using the so-called delta rule outlined in Appendix D.3. Remark: The nlcom command expects to refer to the estimated regression parameters using the b[varname] notation. This reflects the fact that the internally stored regression parameters can be accessed as a vector with name b indexed by the names of the covariates. 12.6
Transformations of Regression Parameters
In some situations, there is some interest in transformations f (β j ) of a regression parameter. Odds ratios and hazard ratios are simple examples, with f denoting the exponential function. If f is a monotone increasing function (i.e., if x < x′ implies f (x) < f (x′ )), then there exist two ways to compute a confidence interval for f (β j ). We can take the confidence interval [l, u] for β j and transform the limits of the interval by applying f ; that is, we use [ f (l), f (u)] as confidence interval. Or we can try to compute the standard error SE f of f (βˆ j ) and compute confidence intervals of the type f (βˆ j ) ± 1.96SE f , and this is exactly what the nlcom command does. So now we have two ways to compute a confidence interval, and this raises a simple question: Which is the better way? The answer is in principle rather simple: The first approach is based on the assumption that βˆ j is normally distributed, the second on the assumption that f (βˆ j ) is normally distributed. So we should prefer the approach for which the assumption is more valid. As a simple rule of thumb, the distribution of estimated regression parameters is typically close to a normal distribution, so the first approach is preferable. This is also the reason why Stata uses the first approach in computing confidence intervals for odds ratios and hazard ratios in the logistic model and Cox model, respectively. You may have already realized that the confidence intervals of odds ratios and hazard ratios are typically asymmetric around the estimate, which is due to using this approach. The question becomes more complicated if we consider transformations of
C ENTERING OF C OVARIATE VALUES
121
several parameters as in the previous section. For example, if we are interested β0 β0 , we may apply the log transformation log −1.386− = in a ratio like −1.386− β1 β1 log(−1.386 − β0 ) − log(β1 ), compute a confidence interval for this quantity using nlcom, and then backtransform the limits by applying the exponential function. Does this give better confidence intervals? A general answer is difficult, and we will point out one possibility of arriving at an answer in Chapter 14. We should, however, mention that ratios of parameters often give rise to distributions far away from a normal if the denominator can become close to 0, as then the ratios can become very large. So as a rule of thumb the confidence interval for the denominator alone should be distinctly away from 0. 12.7
Centering of Covariate Values
It is a basic property of effect parameters in a regression model that they are invariant to adding or subtracting a constant from a covariate. For example, in analysing the effect of systolic blood pressure on some health outcome, we may have the idea of using the difference to 120 as covariate, as 120 is often regarded as an optimal value of the systolic blood pressure. This idea looks nice at first sight, but it has actually no impact on the effect estimate. This follows simply from the definition of β j : It describes the change in the outcome when comparing two subjects differing only in the covariate X j by 1 unit. And subtracting 120 from the blood pressure for both subjects does not change the difference between the two subjects. The situation is different for the intercept. The intercept is by definition the expected outcome or the probability of Y = 1 on the logit scale, respectively, for a subject with all covariates equal to 0. Hence, if we subtract a constant from a covariate, the meaning of 0 changes. For example, if we use systolic blood pressure as a covariate, the intercept refers to a subject with a systolic blood pressure of 0, and is hence meaningless. If we use the difference from 120, the intercept refers to a subject with the optimal value. So if we ensure that for all covariates the value of 0 is of interest for us, the intercept gets a meaning. If a model includes only continuous covariates, one simple way to approach a meaningful interpretation of the intercept is to center each covariate, that is, to subtract the mean of each covariate from the covariate. Then the intercept refers to a subject for whom all covariates have the mean of the covariate as covariate value. Such a subject may be interpretable as “the average subject.” If a model includes also binary or categorical covariates, there are several ways to define such an “average subject”, for example, by choosing the most frequent category as reference category. However, all these considerations are only relevant if there is a priori an interest in the intercept. As this is rarely the case, centering of covariates is not a standard technique and should be used only in special cases. We will later see some examples in which centering can be helpful. Remark: In older text books on regression models, a general recommendation can often be found to center covariates in order to improve the numerical stability of the
122
S OME U SEFUL T ECHNICALITIES
maximisation algorithms used to compute estimates. As modern statistical packages have a built-in centering, this is today no longer necessary. 12.8
Exercise Paternal Smoking versus Maternal Smoking
It has been postulated that the effect of maternal smoking is twice as large as the effect of paternal smoking with respect to the risk of children to develop allergies. Try to falsify this statement using the data set allergy2, used previously in Exercise 6.6.
T HIS C HAPTER IN A N UTSHELL Beyond the simple estimation of the effect of covariates, regressions models can provide often further information. Computation of modelbased predictions, residuals, or nonlinear combination of regression coefficients can add useful information.
Chapter 13
Comparing Regression Coefficients
In many applications, there is some interest in comparing the (absolute) size of regression coefficients in order to rank them according to their “strength” or “importance.” In this chapter, we discuss some strategies to approach this aim and their limitations.
13.1
Comparing Regression Coefficients among Continuous Covariates
Table 13.1 shows results from a study investigating differences among general practitioners (GPs) in their prescription behaviour with respect to antidepressant drugs. (This example is motivated by the study of Hansen et al. (2003)). The results shown are based on all GPs with a solo practice in a well-defined region covering one calendar year. The outcome of interest Y is the fraction of patients (in %) receiving at least one prescription of an antidepressant drug among all patients enrolled at a GP, standardised for sex and age. Different covariates measurable at each practice were investigated with respect to the association with this fraction by linear regression, and the results are shown in Table 13.1. The regression coefficients vary highly in magnitude, but this is mainly due to the fact that the covariates are measured on different scales. For example, the median of Workload is about 5200, whereas the median of Experience, Counselling and Drug prescription is about 10, 0.36, and 60 , respectively. Hence, the regression coefficient of Workload is small probably because the unit “1 patient” is small compared to the unit “1 year” for Experience. This example illustrates that it is in general not possible to compare regression coefficients directly, because a regression coefficient β j describes the effect of changing the covariate X j by one unit, and hence its size depends on the scale we use to measure the covariate. For example, in Table 13.1 the regression coefficient of Experience is −0.024, which looks rather small. However, if we use Decades in Practice, we obtain a 10-fold value (i.e., −0.249) which looks more impressive. (For the same reason, the effect of Age looks small in most applications, because we are used to measure age in years, and the effect of being older 1 year is usually small.) So if we want to compare regression coefficients, we must ensure that they are measured on comparable scales or that we express them in a comparable manner. One approach is to consider the difference between a subject for whom the covariate x j is equal to the 10% percentile of the distribution of X j and a subject for whom the covariate is equal to the 90% percentile. This can be easily achieved by multiplying
123
124
C OMPARING R EGRESSION C OEFFICIENTS βˆ 95%CI
covariate Experience (Number of years in practice) Workload (Number of surgery consultations/year/GP) Counselling (Number of counsellings/year/100 patients) Drug prescription (Number of patients with a prescription/year/100 patients)
−0.024
[−0.057, 0.0074]
−0.00012
[−0.00033, 0.000083]
−0.98
[−1.94, −0.03]
0.100
[0.035, 0.165]
Table 13.1 Association of different covariates with the standardised prevalence of prescriptions of antidepressants. Regression coefficients of linear regression models together with 95% confidence intervals are shown.
covariate Experience Workload Counselling Drug prescription
βˆ −0.48 −0.47 −0.61 1.49
95%CI [−1.10, 0.14] [−1.23, 0.30] [−1.21, −0.02] [0.52, 2.47]
Table 13.2 Association of different covariates with the standardised prevalence of prescriptions of antidepressants. Regression coefficients of a linear regression models together with 95% confidence intervals are shown. All coefficients are multiplied with the 80% range.
the observed regression coefficient β j with the observed difference Δ j between the 90% percentile and the 10% percentile. In Table 13.2 we have applied this technique to the data of our example. For example, the effect of Counselling is now −0.61 instead of −0.98, as the 90% percentile of Counselling is 0.62, the 10% percentile is 0.00, the difference is 0.62, and −0.98 ∗ 0.62 = −0.61. Now it becomes obvious that Drug prescription has the biggest effect, whereas the effect of the other covariates is rather small. Note that we can also obtain confidence intervals just by multiplying the boundaries with Δ j . Additionally, it is now also easier to judge the relevance of the effect: The median prevalence of the prescription of antidepressants is about 4%, and hence a difference of 1.49 percentage points between practices in the upper and lower end of the distribution indicates that Drug prescription can really explain some deviation from the median. On the other side, a factor like Workload with a difference of −0.47 percentage points can only explain deviations from the median of a magnitude, which is probably not relevant. In general, we can try to compute a quantity Δ j for each covariate, which describes a change on the corresponding scale which we regard as comparable among the covariates, and then define the standardised coefficients as
βˆ jST = βˆ j × Δ j
C OMPARING C OEFFICIENTS AMONG C ONTINUOUS C OVARIATES
125
and can transform the standard error and the boundaries of the confidence intervals in the same manner. One possible choice for Δ j is a range between certain percentiles as used above. It is usually not wise to use the overall range, that is, the difference between maximum and minimum, as these values can be influenced by single outliers. More useful alternatives are measures for the spread of X j , for example, the standard deviation. Instead of standardising the regression coefficients, we can also standardise the covariate, that is, we apply the transformation X jST = X j /Δ j and fit a model using these standardised covariates to obtain new regression coefficients. However, this way we obtain exactly the same standardised regression coefficients as with the first approach, that is, they are equivalent. In my experience, most people accept intuitively that dividing a variable by its standard deviation is something useful, whereas dividing by a difference between percentiles seems to be more strange. In contrast, most people can understand the idea of expressing an effect as the difference between subjects corresponding to some percentiles, whereas they have difficulty understanding why it is useful to consider the difference between two subjects differing by 1 standard deviation. Hence although mathematically equivalent, both interpretations are useful in dependence on the choice to construct the scaling factor Δ j . Standardisation of covariates can also be performed by categorisation. For example, we can divide each covariate into 5 groups of equal size (often called quintiles), code these groups by 1, 2, 3, 4, and 5, and use this new covariate as a continuous covariate in the regression model. The resulting regression coefficients have the simple interpretation of describing the change of the outcome if a subject moves from one quintile to the next. Such a change may be regarded as comparable among different covariates. The results of applying this idea to our example is shown in Table 13.3. We can again observe that Drug prescription has the biggest effect, and we are again able to judge the relevance of the effect: Comparing practices in the upper and lower quintile, we have to expect a difference of 4 × 0.43 = 1.71 in the prevalence of AD prescriptions, and as the median prevalence is about 4%, this is relevant. However, we must be aware that we now fit a slightly different model, whereas previously the model itself was only reparametrised, but not changed. The pros and cons of categorising a covariate will be discussed further in Section 18.3. In applying any method of standardisation, we have to be aware that the standardised regression coefficients not only depend on the fitted regression model, but also on the population chosen. If, for example, two researchers draw samples from the same population, but the first restricts the age range from 30 to 65 whereas the second allows subjects between 18 and 75, then they may obtain very similar unstandardised regression coefficients for the effect of age, but the standardised coefficients will be quite different, as the variation of age will be larger in the second study. So whenever we have selected subjects in dependence on certain covariate values, we have to be careful with the interpretation of standardised coefficients. In our exam-
126 covariate Experience Workload Counselling Drug prescription
C OMPARING R EGRESSION C OEFFICIENTS βˆ 95%CI −0.14 −0.15 −0.17 0.43
[−0.31, 0.03] [−0.40, 0.09] [−0.33, −0.00] [0.17, 0.69]
Table 13.3 Association of different covariates with the standardised prevalence of prescriptions of antidepressants. Regression coefficients of simple linear regression models together with 95% confidence intervals are shown. All covariates are divided into quintiles.
ple, we can argue that we have used the complete population of all GPs in the county of Funen, so that there was no selection. However, even when using unselected populations, we still have population dependence. If, for example, the same study is repeated in another country, it might happen that some of the covariates have a higher spread, because, for example, in other countries, practitioners have a greater or less freedom to choose how much of their time they spend on counseling. Then differences in the standardised coefficients between the two studies may mainly reflect differences in the variation of the covariates between the countries. All these points illustrate a general limitation in comparing standardised coefficients. The techniques described above allow comparison of regression coefficients, because they are now expressed in units, which are mathematically identical among the covariates. However, we have still to ensure that the units have also conceptually the same meaning. For example, if we consider a regression model with Smoking measured in number of cigarettes/day and Body Mass Index (BMI) as covariates, and we standardise both covariates by, for example, dividing them by their standard deviations, it is hard to justify that a change of 1 SD has the same conceptual meaning for both variables. With respect to smoking, it is possible for individuals to switch the exposure immediately from one extreme to the other; that is, heavy smokers can stop smoking immediately, and nonsmokers can start smoking. Furthermore, the “optimal” value for an individual is probably 0, that is, at one end of the range. For the body mass index it is much harder for an individual to obtain a change, and it is not advisable to go from one extreme to the other. The “optimal” value is somewhere in the middle of the range, and it may vary from subject to subject in dependence on its stature and perhaps its genes. So, changing the BMI by 1 SD has conceptually another meaning than changing the daily amount of smoking by 1 SD. Another issue arises if one covariate has a skewed distribution and another a symmetric distribution. Then it can be questioned whether a unit like an 80% range is really comparable. This is closely related to the general question of interpreting regression coefficients for covariates with skewed distributions; see Section 18.2. The use of regression models to adjust for potential confounders is another source of difficulties in comparing regression coefficients. Suppose we would like to compare the effects of X1 and X2 , but we have a third variable X3 in our model, which is a potential confounder with respect to the effect of X1 , because we know that it
C OMPARING C OEFFICIENTS AMONG B INARY C OVARIATES covariate BMI above 25 low social class mother smoking maternal age < 18
OR for preterm delivery 1.3 1.8 2.1 2.9
127
rel. freq. in population 48.1% 37.6% 24.3% 7.1%
Table 13.4 The influence of four maternal risk factors on preterm delivery. Results are based on a multiple logistic regression model. (Artificial data)
associated with X1 and has an influence on the outcome. However, if X3 is not a confounder for X2 , is it fair to compare the (standardised) regression coefficients βˆ1 and βˆ2 ? If we would add a fourth covariate, which is a potential confounder for X2 , this may lead to a reduction of both the unstandardised and the standardised regression coefficient of X2 , but not for the coefficients of X1 . Taking all these limitations into account, we should regard standardised regression coefficients as an attempt to make regression coefficients more comparable, but we should not regard them as completely comparable. In any interpretation, we should take the above limitations into account. In particular, we should provide some arguments that the units are indeed comparable among covariates and that none of the above-mentioned difficulties are present. 13.2
Comparing Regression Coefficients among Binary Covariates
At first sight we may expect that it is much easier to compare regression coefficients among binary covariates, because they are per definition measured on the same scale. However, we can again raise the question of whether switching from 0 to 1 has the same conceptual meaning for all binary covariates. One typical difficulty is illustrated by the results of an (artificial) study on maternal risk factors for preterm delivery shown in Table 13.4. We can observe that Maternal age < 18 shows the highest odds ratio, whereas BMI above 25 shows an odds ratio close to 1. However,—as we can also see in Table 13.4—only a few mothers are actually younger than 18, whereas according to the definition used about half of the mothers suffer from obesity. So we may argue that Maternal age < 18 is a more extreme condition than BMI above 25, and that hence it is not surprising to observe a smaller effect of the latter. And with respect to the other odds ratios in Table 13.4, the relative frequencies suggest a similar explanation. It is indeed to some degree true that the effect of a binary covariate is expected to increase if the frequency of X = 1 decreases. This is illustrated in Table 13.5, where we analyse two artificial data sets with an originally continuous covariate X ∗ (cf. Figures 13.1 and 13.2) by defining binary covariates Xc based on different cutpoints c via 1 if X ∗ ≥ c Xc = 0 if X ∗ < c
C OMPARING R EGRESSION C OEFFICIENTS
5
5
Y 10
Y 10
15
15
128
−4
−2
0 X*
2
4
Figure 13.1 An artificial example.
Data set of Figure 13.1 cutpoint c 0 .5 1 1.5 2
rel. freq. of Xc = 1 50.0 34.3 21.0 11.3 5.3
βˆ1 1.48 1.63 1.82 2.15 2.11
−5
0
5 X*
10
15
Figure 13.2 An artificial example.
Data set of Figure 13.2 cutpoint rel. freq. βˆ1 c of Xc = 1 0 66.5 2.24 1 49.5 2.35 2 35.5 2.61 3 24.5 2.79 5 11.5 3.60 10 1.0 5.17
Table 13.5 Result of analysing two data sets with varying cutpoints.
and applying simple linear regression to the binary covariate. The size of the regression coefficient βˆ1 increases with increasing cutpoint, that is, with decreasing frequency of Xc = 1. This is especially the case if the distribution of X ∗ is skewed as in the second data set. Figure 13.3 illustrates this further by showing the mean values of Y in the two subgroups defined by X ∗ ≥ c and X ∗ < c, respectively, as the regression coefficient estimate βˆ is just the difference between the two mean values. Note that our considerations about the comparison of adjusted effects presented in the previous section applies also to binary covariates. Hence, in summary, for binary covariates too we have to be careful about comparing regression coefficients directly. 13.3
Measuring the Impact of Changing Covariate Values
In public health research, the importance of a risk factor is typically related to the impact we can expect if we are able to remove or modify this factor. Such an impact depends on at least two factors: the effect on certain health outcomes and the population distribution of the factor. The latter becomes obvious if a binary risk factor
M EASURING THE I MPACT OF C HANGING C OVARIATE VALUES
−5
0
5 X*
10
Y 10 5
Y 10 5
5
Y 10
15
c=2
15
c=1
15
c=0
−5
15
0
5 X*
10
10
15
5 X*
10
15
10
15
15 5
Y 10
15 Y 10
5 X*
0
c=10
5
Y 10 5
0
−5
15
c=5
15
c=3
−5
129
−5
0
5 X*
10
15
−5
0
5 X*
Figure 13.3 The mean of Y in the data set of Figure 13.2 in the two subgroups defined by X ∗ ≥ c and X ∗ < c for six different choices of c.
is considered: Removing a rare exposure from a population may have a smaller effect than removing a frequent exposure, even if the rare exposure has a higher effect (i.e., larger odds ratio). It is rather simple to get an impression about such a possible impact once a regression model has been fitted in a relevant population. Let us take, for example, the case of a binary outcome variable Y and a logistic regression model. Then, based on the estimated regression coefficients βˆ j we can for each subject i compute the probability of Y = 1 according to our model as
πˆi = πˆ (xi1 , xi2 , . . . , xip ) = logit −1 (βˆ0 + βˆ1 xi1 + βˆ2 xi2 + . . . + βˆ p xip ) , that is, predictions on the probability scale as introduced in Section 12.1. If we sum up these probabilities, we know the number E of events we can expect in our population according to our model. (And as we have fitted the model in this population, E should be close to the actual number of events, that is, the number of subjects with Yi = 1 we can observe in our population.) Now, we may “remove” a covariate X j from this model by setting for all subjects the covariate value xi j to a constant value c. For example, if X j is the number of daily cigarettes, we set all xi j to 0, or if X j is the body mass index, we may set all xi j to 21, which is within the range of values typically regarded as optimal. Now we can for each subject compute again the probability of
130
C OMPARING R EGRESSION C OEFFICIENTS
Y = 1 with the new value c for xi j , that is,
π˜i
= πˆ ∗ (xi1 , . . . , xi j−1 , c, xi j+1 , . . . , xip ) = logit −1 (βˆo + βˆ1 xi1 + . . . + βˆ j−1 xi j−1 + βˆ j c + βˆ j+1 xi j+1 + . . . + βˆ p xip )
and by summing up these probabilities we obtain an estimate for the number E˜ of events we can expect in our population if we assume that X j = c for all subjects. Hence, E − E˜ tells us how many events we can avoid by “removing” the risk factor in our population. So, such a computation can give a nice illustration of the “importance” of a covariate, and it is not restricted to public health research. Applying it to different covariates in a model may hence allow a comparison of their “importance.” And it allows comparison of binary, continuous, and categorical covariates. However, we should be aware that it mainly serves an illustrative purpose. The approach is based on the assumption that between-subjects effects are identical to within-subject effects (cf. Section 11.2), and it is typically somewhat unrealistic to “remove” a risk factor completely in a population. If the assumption of a complete “removal” of a risk factor is unrealistic, we may consider other scenarios, for example, what happens if all subjects reduce the number of cigarettes by 50%. We can also try to take into account that changing the behaviour with respect to one covariate may imply a change in another covariate. For example, we may consider a scenario in which all subjects reduce their smoking by 50%, but half of them increase their fat intake by 25%. So there are many variations of the basic idea to measure the impact of changing covariate values. Remark: This idea can be also used if the outcome variable is continuous. Then we have to compute the conditional expectation μˆ (xi1 , xi2 , . . . , xip ) = βˆ0 + βˆ1 xi1 + βˆ2 xi2 + . . . + βˆ p xip for each subject and take the average over all subjects, which can be regarded as an estimate of the population mean of Y . Then we can study the change of this population mean under different scenarios. If the outcome is a survival time, we can use survival probabilities for such computations; cf. Section 24.1. ˜
E Remark: The fraction E− E describes the proportion of events we can avoid by “removing” the risk factor. This fraction is often called the etiological fraction or attributable risk in epidemiology.
13.4
Translating Regression Coefficients
We have demonstrated in Section 13.1 that the different scales underlying different covariates make the comparison of regression coefficients cumbersome. On the other hand, it opens often a nice opportunity to communicate the magnitude of effect estimates. The basic idea is to use covariates for which most people have an intuitive understanding of the associated risk—for example, age or daily number of cigarettes—to express the effect of other covariates. For example, we may have a logistic regression model with several covariates including age (in years) and physical activity in leisure time (in hours per week), and we observe regression coefficients of 0.21 for
H OW TO C OMPARE R EGRESSION C OEFFICIENTS IN S TATA
131
age and −0.42 for physical activity. So 2 years of age have the same effect as 1 hour of physical activity per week, but in opposite directions, which we may reduce to the simple message: “One hour of physical activity in your leisure time per week makes you 2 years younger.” If we have also a binary covariate like Watching more than 14 hours television per week in the model, and it obtains a regression coefficient around 1.05, we may express this as “Watching a lot of television makes you 5 years older,” because 1.05 = 5 × 0.21. In this approach, it may be necessary to transform first the scale of a covariate to a unit which is easy to understand. If in the example above we have a covariate alcohol intake measured in g per day with an effect estimate of 0.030, it makes little sense to say that “one g alcohol per day makes you 0.030 0.21 = 0.14 of a year older,” even if we express the latter as “about 1.7 months older,” as most people have no idea how much one g alcohol per day is. Here it may be useful to transform the scale to a more popular unit like “one bottle of beer.” In many countries a bottle of beer has about 16 g alcohol, hence one bottle has an effect of 16 × 0.030 = 0.48, such that one bottle of beer per day makes you 0.48 0.21 = 2.3 years older. Formally, what we do here is to express the effect of a change in one unit of the covariate X2 as the number of units of the covariate X1 giving the same effect. And this means nothing other than computing the ratio of the regression coefficients, that is, ββ2 . Of course, we should ask also for the precision of such ratios, and for1 tunately it is possible to compute standard errors and confidence intervals for such ratios (cf. Section 12.5). Additionally, we have to be aware of whether we express the results as “A 1 hour difference in the amount of physical exercise each week has the same effect as a difference of 2 years of age,” which remains at the between-subjects perspective, or as “1 hour of physical exercise per week makes you 2 years younger”, which takes the within-subject perspective with all the problems mentioned in Section 11.2. 13.5
How to Compare Regression Coefficients in Stata
We start by taking a look at the data set: . use GPprevAD, clear . list in 1/10
1. 2. 3. 4. 5. 6. 7. 8. 9.
+--------------------------------------------------------+ | id prevAD exper workload couns drugpr | |--------------------------------------------------------| | 1 2.8537 5.3 5231 .5297675 53.17889 | | 2 6.348848 8.0 3467 .6196641 59.6736 | | 3 5.401683 9.8 6331 .1633966 56.60002 | | 4 1.846023 23.9 2430 .824449 50.3332 | | 5 3.443628 3.0 2882 .8499712 61.24242 | |--------------------------------------------------------| | 6 2.905964 7.5 4205 .3030116 49.83715 | | 7 7.650514 3.8 4088 .2966704 63.39799 | | 8 5.458002 3.0 5319 0 58.37924 | | 9 4.711245 6.8 4708 .3428961 62.39655 |
132
C OMPARING R EGRESSION C OEFFICIENTS
10. | 10 4.933899 15.3 4871 .6056716 61.80287 | +--------------------------------------------------------+
and continue with fitting the regression model of interest: . regress prevAD exper workload couns drugpr, vce(robust) Linear regression
Number of obs F( 4, 105) Prob > F R-squared Root MSE
= = = = =
110 5.07 0.0009 0.1827 1.2306
-----------------------------------------------------------------------------| Robust prevAD | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------exper | -.0248877 .0162733 -1.53 0.129 -.0571546 .0073793 workload | -.0001279 .0001061 -1.21 0.231 -.0003383 .0000825 couns | -.9834353 .4823002 -2.04 0.044 -1.939748 -.027123 drugpr | .0999148 .032829 3.04 0.003 .034821 .1650086 _cons | -.3417752 1.542115 -0.22 0.825 -3.399504 2.715953 ------------------------------------------------------------------------------
To standardise the regressions coefficients, we take a look at the 10% and 90% percentile. . tabstat exper workload couns drugpr, s(p10 p90) stats | exper workload couns drugpr ---------+---------------------------------------p10 | 3.045205 3476.5 0 49.98628 p90 | 22.34931 7117.5 .6223555 64.94479 --------------------------------------------------
To obtain standardised regression coefficients, in a statistical package it is typically more convenient to standardise the covariates than to directly standardize the coefficients. Stata’s egen command allows to generate variables with the percentiles . egen p10exper=pctile(exper), p(10) . egen p90exper=pctile(exper), p(90) . gen experST=exper/(p90exper-p10exper) . list id exper experST p90exper p10exper in 1/10
1. 2. 3. 4. 5.
+---------------------------------------------+ | id exper experST p90exper p10exper | |---------------------------------------------| | 1 5.3 .2722112 22.34932 3.045206 | | 2 8.0 .4147034 22.34932 3.045206 | | 3 9.8 .5055351 22.34932 3.045206 | | 4 23.9 1.235737 22.34932 3.045206 | | 5 3.0 .1555492 22.34932 3.045206 |
H OW TO C OMPARE R EGRESSION C OEFFICIENTS IN S TATA 6. 7. 8. 9. 10.
133
|---------------------------------------------| | 6 7.5 .389015 22.34932 3.045206 | | 7 3.8 .1945785 22.34932 3.045206 | | 8 3.0 .1555492 22.34932 3.045206 | | 9 6.8 .3499858 22.34932 3.045206 | | 10 15.3 .7949191 22.34932 3.045206 | +---------------------------------------------+
You can now do the same steps with the other variables. But you can also program a loop using the foreach construct: . foreach var of varlist workload couns drugpr { 2. egen p10‘var’=pctile(‘var’), p(10) 3. egen p90‘var’=pctile(‘var’), p(90) 4. gen ‘var’ST=‘var’/(p90‘var’-p10‘var’) 5. }
Now we can perform the regression analysis with the standardised covariates: . regress prevAD experST workloadST counsST drugprST, vce(robust) Linear regression
Number of obs F( 4, 105) Prob > F R-squared Root MSE
= = = = =
110 5.07 0.0009 0.1827 1.2306
-----------------------------------------------------------------------------| Robust prevAD | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------experST | -.480434 .3141418 -1.53 0.129 -1.103319 .1424511 workloadST | -.4655753 .3863188 -1.21 0.231 -1.231574 .3004236 counsST | -.6120462 .3001622 -2.04 0.044 -1.207212 -.01688 drugprST | 1.494576 .4910722 3.04 0.003 .5208705 2.468281 _cons | -.3417745 1.542115 -0.22 0.825 -3.399503 2.715954 ------------------------------------------------------------------------------
Next we try the standardisation by building quintiles. The xtile, n(5) command divides a variable into 5 groups of equal size: . xtile qu5exper=exper, n(5) . xtile qu5workload=workload, n(5) . xtile qu5couns=couns, n(5) . xtile qu5drugpr=drugpr, n(5) . list id exper qu5exper in 1/10 +-----------------------+ | id exper qu5exper | |-----------------------| 1. | 1 5.3 2 |
134 2. 3. 4. 5. 6. 7. 8. 9. 10.
C OMPARING R EGRESSION C OEFFICIENTS | 2 8.0 2 | | 3 9.8 3 | | 4 23.9 5 | | 5 3.0 1 | |-----------------------| | 6 7.5 2 | | 7 3.8 1 | | 8 3.0 1 | | 9 6.8 2 | | 10 15.3 4 | +-----------------------+
. regress prevAD qu5exper qu5workload qu5couns qu5drugpr, vce(robust) Linear regression
Number of obs F( 4, 105) Prob > F R-squared Root MSE
= = = = =
110 5.49 0.0005 0.1924 1.2233
-----------------------------------------------------------------------------| Robust prevAD | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------qu5exper | -.1399966 .0832519 -1.68 0.096 -.3050698 .0250765 qu5workload | -.1535496 .1249432 -1.23 0.222 -.4012889 .0941897 qu5couns | -.1674876 .0826692 -2.03 0.045 -.3314053 -.0035699 qu5drugpr | .4262558 .1307075 3.26 0.001 .167087 .6854247 _cons | 4.233402 .4644146 9.12 0.000 3.312554 5.154251 ------------------------------------------------------------------------------
Next we try to study how—according to our model—the average prevalence of antidepressants would change if all GPs increase the number of counselling sessions. The predict command allows us to compute the expected prevalence in each practice according to our original model: . regress prevAD exper workload couns drugpr, vce(robust) Linear regression
Number of obs F( 4, 105) Prob > F R-squared Root MSE
= = = = =
110 5.07 0.0009 0.1827 1.2306
-----------------------------------------------------------------------------| Robust prevAD | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------exper | -.0248877 .0162733 -1.53 0.129 -.0571546 .0073793 workload | -.0001279 .0001061 -1.21 0.231 -.0003383 .0000825 couns | -.9834353 .4823002 -2.04 0.044 -1.939748 -.027123 drugpr | .0999148 .032829 3.04 0.003 .034821 .1650086 _cons | -.3417752 1.542115 -0.22 0.825 -3.399504 2.715953 ------------------------------------------------------------------------------
H OW TO C OMPARE R EGRESSION C OEFFICIENTS IN S TATA
135
. predict expect (option xb assumed; fitted values) . list id expect exper workload couns drugpr in 1/10
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
+--------------------------------------------------------+ | id expect exper workload couns drugpr | |--------------------------------------------------------| | 1 3.650923 5.3 5231 .5297675 53.17889 | | 2 4.368539 8.0 3467 .6196641 59.6736 | | 3 4.100293 9.8 6331 .1633966 56.60002 | | 4 2.972051 23.9 2430 .824449 50.3332 | | 5 4.498106 3.0 2882 .8499712 61.24242 | |--------------------------------------------------------| | 6 3.615112 7.5 4205 .3030116 49.83715 | | 7 5.084651 3.8 4088 .2966704 63.39799 | | 8 4.736303 3.0 5319 0 58.37924 | | 9 4.785191 6.8 4708 .3428961 62.39655 | | 10 4.232846 15.3 4871 .6056716 61.80287 | +--------------------------------------------------------+
For example, we can see that in practice 4, we expect according to our model a rather low prevalence of prescriptions of antidepressant drugs, as this GP has many years of experience, a high number of counsellings, and a low overall rate of drug prescriptions. Note that the average of the expected prevalences is equal to the average of the observed prevalences: . tabstat prevAD expect, s(mean) stats | prevAD expect ---------+-------------------mean | 4.142272 4.142272 ------------------------------
Now we can increase the number of counselling sessions per 100 patient per year by 0.5, and can again use predict to compute the expectations: . replace couns=couns+0.5 (110 real changes made) . predict newexpect (option xb assumed; fitted values) . tabstat prevAD expect newexpect, s(mean) stats | prevAD expect newexp~t ---------+-----------------------------mean | 4.142272 4.142272 3.650554 ----------------------------------------
So we can see that the average prevalence would decrease from 4.142 to 3.651, that is, by −0.49.
136
C OMPARING R EGRESSION C OEFFICIENTS
Finally, we may have the idea of convincing (young) GPs to do more counselling by expressing the effect of counselling in the number of years of experience. . use GPprevAD, clear . regress prevAD exper workload couns drugpr, vce(robust) Linear regression
Number of obs F( 4, 105) Prob > F R-squared Root MSE
= = = = =
110 5.07 0.0009 0.1827 1.2306
-----------------------------------------------------------------------------| Robust prevAD | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------exper | -.0248877 .0162733 -1.53 0.129 -.0571546 .0073793 workload | -.0001279 .0001061 -1.21 0.231 -.0003383 .0000825 couns | -.9834353 .4823002 -2.04 0.044 -1.939748 -.027123 drugpr | .0999148 .032829 3.04 0.003 .034821 .1650086 _cons | -.3417752 1.542115 -0.22 0.825 -3.399504 2.715953 ------------------------------------------------------------------------------
To express the effect of 1 more counselling session per 100 patients per year on the scale “number of years of experience,” we have now to take a look at the ratio of the regression coefficients. This can be approached using the nlcom command, as pointed out on Section 12.5: . nlcom _b[couns]/_b[exper] _nl_1:
_b[couns]/_b[exper]
-----------------------------------------------------------------------------prevAD | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------_nl_1 | 39.51498 36.63846 1.08 0.283 -33.13232 112.1623 ------------------------------------------------------------------------------
We can observe that one more counselling session has the same effect as 39.5 years of experience. Note, however, that the confidence interval of this ratio is extremely wide and includes negative values. This is due to the fact that the effect of experience is actually quite small, and we cannot even be sure that it does exist. So it is actually here a poor idea to use the effect of experience to express the effect of counselling. Remember also that the median of the number of counselling sessions per 100 patient per year is only 0.36. So increasing this number by 0.5 or 1 is actually a very big difference compared to the current practice. Remark: If you want to standardise the covariates to a standard deviation of 1, you can just use the std function in the egen command.
E XERCISE Health in Young People 13.6
137
Exercise Health in Young People
In a study the relation between some lifestyle factors and the health of young people in the age between 16 and 21 has been investigated. The health status of the adolescents has been assessed by two variables: A general health questionnaire resulting in a score between 0 and 100, and the actual costs for the public health care system within one calendar year. Our interest is in three selected lifestyle factors: The average number of fruits and vegetables consumed each day, the daily hours spent on physical activities during leisure time, and the money spent on visiting fitness centres and similar facilities each month. All information can be found in the data set healthintheyoung, and the variable names are gh, costs, vegetables, hours and money. To prevent subjects with very large (and hence suspicious) values in the three lifestyle factors from having an overwhelming influence, the variables have been truncated: If the number of fruits and vegetables exceeds 6, the value was set to 6. If the number of hours spent on physical activity each day exceeds 4, the value was set to 4. And if someone spent more than 100 EUR per month in fitness centres, the number was set to 100. Information on hours was collected with a precision of 15 minutes, and on money with a precision of 10 EUR. Try to use this data set in order to (a) convince young people to eat more fruits and vegetables, (b) convince politicians to start a campaign to promote the consumption of fruit and vegetables in the young, (c) make a statement on the lifestyle factor with the greatest influence.
T HIS C HAPTER IN A N UTSHELL Regression coefficients cannot be directly compared. However, there exist different techniques to make regression coefficients more comparable or to illustrate the impact of different covariates. The interpretation of regression coefficients can sometimes be supported by expressing the effect of one covariate as a multiple of the effect of another covariate.
Chapter 14
Power and Sample Size
In this chapter we discuss the major determinants of the power of a regression analysis and demonstrate how we can perform power and sample size calculations by means of simulations.
14.1
The Power of a Regression Analysis
In the planning of any study in the health sciences, we would like to ensure as good as possible to obtain finally a convincing result which adds new insights into the subject matter of interest and which is worth publishing. If a regression analysis is planned as the main analytical approach, a convincing result means that we can estimate the regression coefficient(s) of interest with a precision which allows some new or at least interesting conclusions to be drawn. A convincing result does not necessarily mean that the effect estimate has to be as large as was originally anticipated and that it has to be significantly different from zero. Also, an effect estimate close to zero can be a very interesting and convincing result, if it is accompanied by a narrow confidence interval such that we can conclude that if there is any effect then it can be only rather small and hence of little relevance. Similarly, moderate effect estimates with narrow confidence intervals can provide an interesting and convincing result, as they allow one to conclude that on the one hand there exists an effect, but that on the other hand the effect is smaller than anticipated. In any case, in the planning of a study we would like to ensure that the study has sufficient “power” to deliver such a convincing and interesting result. The term “power” in connection with scientific studies is used in a broad sense as well as in a narrow, statistical sense. The statistical term “power” refers to the probability of obtaining a significant result when testing the null hypothesis H0 : β j = 0 under the assumption of a certain (true) value for β j . This is the relevant concept if we are mainly interested in demonstrating that a certain covariate (after appropriate adjustment) has an effect. The more general term “power of a study” refers to the likelihood that a study can demonstrate or find what the study is intended to demonstrate or find. This may refer to a general aim like “Finding new genetic markers for diseases Y” or a more specific aim like “determining a critical threshold for exposure to X.” Typically statistical concepts to translate also such powers into probabilities or expectations can be found. The statistical “power” as the probability to be able to reject the null hypothesis of no effect is just one example. The most prominent alternative is to consider the expected length of a confidence interval. 139
140
P OWER AND S AMPLE S IZE
This is a relevant concept if there is no doubt about that a covariate X j has an effect, but there is still some doubt about the magnitude of the effect. If, for example, some studies have suggested that the effect of a covariate is around 2.0, and others that it is around 1.0, and we would like to clarify this point, then it is reasonable to require that the length of the confidence interval should be of a magnitude of maximally 1. Or in other words, that the half-length of the confidence interval should be no more than 0.5; that is, the confidence interval can be computed as βˆ j ± 0.5. This would ensure that if we obtain an estimate around 1.0, for example, 1.2, the upper bound of the CI is still distinctly below 2.0. Such power considerations typically occur when choosing the sample size in planning a study. The larger the sample size, the more precise are the estimates, the narrower are the confidence intervals, and the smaller are the p-values. Or in other words: The larger the sample size, the larger the power. On the other hand large sample sizes are often expensive or require a long recruitment period. Finding a reasonable compromise ensuring sufficient power for the main question of interest without increasing the sample size to an unnecessary degree is the aim of sample size calculations, and hence some practical advice on performing such calculations is an essential part of this chapter. However, understanding the determinants of the power of a regression analysis is also of more general interest, as it often allows one to judge the “general power” of published studies or to become aware of the most crucial power issues in planning a new study prior to any formal sample size calculation. This understanding is also essential for many other designs issues like the selection of covariates to be included in a regression model. The following two sections are hence devoted to giving some insights into the basic determinants of the power in a regression analysis. 14.2
Determinants of Power in Regression Models with a Single Covariate
Binary outcome—binary covariate We start by considering a simple logistic regression model relating a binary covariate Y to a binary covariate X via logit π (x) = logit P(Y = 1|X = x) = β0 + β x . We can also say that we are interested in a simple odds ratio, that is, exp(β ). The joint distribution of Y and X can be described by the prevalence p = P(X = 1) of X, the regression coefficient β , and the prevalence q = P(Y = 1) of Y , and hence these three values determine also the power. For any choice of p, q, β , and the sample size n, we can now compute characteristics of the power of a regression analysis using the simple logistic model. For example, we can compute the power in the narrow sense, that is, the probability of obtaining a p-value less than 0.05 when testing the null hypothesis H0 : β = 0. Results of such computations are shown in Table 14.1, illustrating the dependence of the power on the true effect β and the sample size n, that is, when keeping the prevalence of X and Y fixed at 0.5. Not surprisingly, we can observe that increasing the sample size increases the power, and it is easier to obtain a significant result if the true β is large. If we take a closer look at the two lower lines
P OWER IN REGRESSION MODELS WITH A SINGLE COVARIATE
β 0.40 0.50 0.60 0.70 0.80
100 0.18 0.26 0.33 0.44 0.54
200 0.31 0.45 0.58 0.71 0.81
n 300 0.41 0.58 0.74 0.85 0.93
400 0.52 0.70 0.85 0.94 0.98
141
500 0.62 0.81 0.92 0.98 0.99
Table 14.1 Probability of rejecting H0 : β = 0 in dependence on the true value of β and the sample size n in the case of a balanced binary covariate X (i.e., p = 0.5) and a balanced outcome Y (i.e., q = 0.5).
q 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.1 0.34 0.48 0.53 0.54 0.51 0.43 0.32 0.11 0.00
0.2 0.48 0.67 0.76 0.78 0.77 0.73 0.63 0.45 0.11
0.3 0.53 0.76 0.85 0.88 0.88 0.85 0.78 0.63 0.33
0.4 0.54 0.78 0.88 0.92 0.92 0.90 0.85 0.73 0.44
p 0.5 0.51 0.77 0.88 0.92 0.93 0.92 0.88 0.78 0.52
0.6 0.43 0.73 0.85 0.90 0.92 0.92 0.88 0.79 0.54
0.7 0.32 0.63 0.78 0.85 0.88 0.88 0.85 0.77 0.54
0.8 0.11 0.45 0.63 0.73 0.78 0.79 0.77 0.67 0.48
0.9 0.00 0.11 0.33 0.44 0.52 0.54 0.54 0.48 0.34
Table 14.2 Probability of rejecting H0 : β = 0 in dependence on the prevalence p of X and the prevalence q of Y in the case of a true β of 0.8 and a sample size of n = 300.
in this table, we can observe that the power increases rapidly with the sample size up to a power of about 90%, but afterwards the power increases only rather slowly. Hence, increasing sample sizes further than necessary to achieve a power of 90% is typically rather costly, as the additional gain in power is small. In Table 14.2, we take a look at the power in dependence on the prevalence p of X and the prevalence q of Y . We observe maximal power if p and q are close to 0.5, that is, if both X and Y are balanced. Looking at the middle column corresponding to p = 0.5, we can observe that the power is maximal if q = 0.5, that is, if the distribution of Y is balanced. This may be intuitively clear: If the vast majority of the values in Y are 0 or if the vast majority of the values in Y are 1 there is little information in Y . Looking at the middle row corresponding to q = 0.5, we observe a similar pattern with respect to p: The power decreases if the prevalence p tends to 0 or 1. The latter may be less intuitively clear. To facilitate understanding, it can help to take a look at three artificial studies summarised in Table 14.3. In these three studies, the frequency of Y = 1 given X = 0 is always 40%, and the frequency of Y = 1 given X = 1 is always 60%, such that we have in all three studies an estimated OR of 2.25 or a βˆ of 0.81. The only difference between the three studies is the prevalence of X:
142
P OWER AND S AMPLE S IZE study 1 1 40 60
rel. freq 40.0% 60.0%
1 60 30
rel. freq 40.0% 60.0%
Y 0 1 105 70 10 15
rel. freq 40.0% 60.0%
Y X 0 1
0 60 40
X 0 1
0 90 20
X 0 1
Y
95% CI βˆ [30.3,50.3] 0.81 [49.7,69.7] study 2
95% CI [0.25,1.38]
p-value 0.005
95% CI βˆ [32.1,48.3] 0.81 [45.2,73.6] study 3
95% CI [0.16,1.46]
p-value 0.015
βˆ 0.81
95% CI [-0.04,1.67]
p-value 0.063
95% CI [32.7,47.7] [38.7,78.9]
Table 14.3 Three artificial studies. The cross-tabulation of Y and X, the relative frequency of Y = 1 in the two strata defined by X with 95% confidence intervals, and the effect estimate from the logistic regression with 95% confidence interval and p-value are given for each study.
In study 1, the prevalence is 50%, in study 2 it is 25%, and in study 3 it is 12.5%. If we now take a look at the confidence intervals for the relative frequency of Y = 1 in the strata defined by X and compare study 2 with study 1, we can observe that the confidence interval in the stratum given by X = 0 has become slightly more narrow, as we have increased the number of available subjects from 100 to 150. However, the confidence interval in the stratum given by X = 1 has become distinctly wider, as we have reduced the number of subjects from 100 to 50, that is, by a factor of 2. So the gain in information in the stratum defined by X = 0 is smaller than the loss of information in the other stratum. Now, βˆ is nothing else but the difference between these two estimates on the logit scale, and due to the asymmetry in loss and gain of information, we have overall a loss of information: The confidence interval for β becomes wider and the p-value larger. And the same phenomenon occurs when moving from study 2 to study 3: Again, we have a loss of information by reducing the prevalence further. Looking at the other rows and columns in Table 14.2, we can observe that a balanced X or a balanced Y is not always the optimal constellation if Y or X, respectively, is unbalanced. This can be explained by a second criterion influencing the power: If X and Y has the same prevalence, then we can obtain a 100% agreement between X and Y ; that is, it is possible that we observe Xi = Yi for all subjects. However, if the prevalences are different, this is impossible, and hence it becomes more difficult to obtain high effect estimates. Indeed, we can observe the lowest power in Table 14.2 if p = 0.1 and q = 0.9 or vice versa. The power in terms of the probability of obtaining a significant result is relevant,
P OWER IN REGRESSION MODELS WITH A SINGLE COVARIATE
β 0.40 0.50 0.60 0.70 0.80
100 0.79 0.79 0.80 0.80 0.80
200 0.56 0.56 0.56 0.56 0.57
n 300 0.46 0.46 0.46 0.46 0.46
400 0.39 0.40 0.40 0.40 0.40
143
500 0.35 0.35 0.35 0.36 0.36
Table 14.4 Expected median length of the 95% confidence interval for β in dependence on the true value of β and the sample size n in the case of a balanced binary covariate X (i.e., p = 0.5) and a balanced outcome Y (i.e., q = 0.5).
q 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.1 1.03 0.82 0.77 0.77 0.80 0.88 0.99 1.23 2.02
0.2 0.82 0.64 0.59 0.58 0.59 0.64 0.71 0.85 1.23
0.3 0.77 0.59 0.52 0.50 0.51 0.54 0.60 0.71 0.99
0.4 0.77 0.58 0.50 0.48 0.47 0.49 0.54 0.63 0.88
p 0.5 0.80 0.59 0.51 0.47 0.46 0.47 0.51 0.59 0.80
0.6 0.88 0.64 0.54 0.49 0.47 0.48 0.50 0.57 0.77
0.7 0.99 0.71 0.60 0.54 0.51 0.50 0.52 0.59 0.77
0.8 1.23 0.85 0.71 0.63 0.59 0.57 0.59 0.64 0.82
0.9 2.02 1.23 0.99 0.88 0.80 0.77 0.77 0.82 1.02
Table 14.5 Expected median length of the 95% confidence interval for β in dependence on the prevalence p of X and the prevalence q of Y in the case of a true β of 0.8 and a sample size of n = 300.
if we aim to demonstrate the existence of an effect. If the aim is to study the magnitude of the effect, it is not enough to obtain significance, we also need a narrow confidence interval. Hence, when planning a study, it is more relevant to consider the expected length of the confidence interval, and Tables 14.4 and 14.5 present the corresponding numbers in analogy to Tables 14.1 and 14.2. We can observe that the magnitude of β has nearly no influence on the expected median length. We can further observe that increasing the sample size by a factor of 4 reduces the expected length by a factor of 2. This reflects the well-known fact that the precision of an estimate increases only by the square root of n. The prevalence of X and Y has again a big impact on the expected length: In the worst case (p = 0.1 and q = 0.9 or vice versa), the expected length is more than four times as large as in the case of both X and Y being balanced. Remark: One may wonder why the intercept β0 does not appear in these power considerations. Indeed, it does appear, but only implicitly in the prevalence q of Y . The prevalence q is just a function of β0 , and the larger β0 is, the larger is q. However, for power considerations it is more convenient to consider q. It is harder to interpret
144
P OWER AND S AMPLE S IZE
σe 1.0 2.0 3.0 4.0 5.0
0.1 0.86 0.33 0.17 0.12 0.09
02 0.98 0.52 0.27 0.17 0.13
0.3 0.99 0.62 0.33 0.21 0.15
0.4 1.00 0.68 0.37 0.23 0.16
σe 1.0 2.0 3.0 4.0 5.0
0.1 1.00 0.86 0.53 0.33 0.23
02 1.00 0.98 0.76 0.52 0.36
0.3 1.00 0.99 0.86 0.62 0.44
0.4 1.00 1.00 0.90 0.68 0.49
p 0.5 1.00 0.70 0.38 0.23 0.17 p 0.5 1.00 1.00 0.91 0.70 0.51
0.6 1.00 0.68 0.37 0.23 0.16
0.7 0.99 0.62 0.33 0.21 0.15
0.8 0.98 0.52 0.27 0.17 0.13
0.9 0.86 0.33 0.17 0.12 0.09
0.6 1.00 1.00 0.90 0.68 0.49
0.7 1.00 0.99 0.86 0.62 0.44
0.8 1.00 0.98 0.76 0.52 0.36
0.9 1.00 0.86 0.53 0.33 0.23
Table 14.6 Probability of rejecting H0 : β = 0 in dependence on the prevalence p of X and the standard deviation σe of the error term for n = 100 and β = 1.0 (upper part) and β = 2.0 (lower part). A normal distribution of the error term is assumed.
β0 compared to q, as β0 is the probability of Y = 1 on the logit scale for a subject with X = 0. Moreover, β0 changes if we recode X (i.e., exchange the values 0 and 1), whereas q does not. Continuous outcome—binary covariate If the outcome Y is a continuous variable, it is obvious that the power depends on the magnitude of the errors ei = Yi − μ (Xi ) = Yi − (β0 + β Xi ) . If the errors are small, then the points (yi , xi ) are close to the regression line, and hence it is easy to estimate the regression coefficient, and we have a high power. If the errors are large, the points are widely scattered around the line, it is difficult to estimate the regression coefficients, and the power is low. We can try to quantify the magnitude of the errors by the standard deviation σe of the error term e (cf. Section 12.3), and in Table 14.6 we can observe how the power decreases with increasing σe if we keep β , the sample size n, and the prevalence p of X fixed. And again we observe that the power depends on p, and now we have the clear message that a balanced distribution of X implies the highest power. So the four basic determinants of the power are now the sample size n, the effect β , the standard deviation of the distribution of the error term, and the prevalence of X. Remark: It can be easily shown that in the case of a continuous outcome the power depends on β and σe only through the ratio σβe . This follows from the fact that if we rescale Y by dividing by some constant c, both the error term and β are divided by c, too (cf. Exercise 2.8). We can also observe this in Table 14.6: If we go from the
P OWER IN REGRESSION MODELS WITH A SINGLE COVARIATE
145
upper part to the lower part, that is, from β = 1.0 to β = 2.0, we find identical powers when comparing the lines differing in σe by a factor of 2.0. Remark: In the scientific literature, we can often find statements of the type that the power of a study is related to the signal-to-noise ratio. The ratio σβe is such a signalto-noise ratio: β is just the difference in the expectation of Y between X = 0 and X = 1, that is, the magnitude of the signal, and σe is the magnitude of the noise. Remark: In the statistical literature the ratio σβe is often referred to as effect size. This does make sense in connection with power and sample size calculations, as this ratio is—in addition to p—sufficient to compute power and sample sizes. However, in many other situations we would like to call the value of β itself the size of the effect. So it is important to clearly distinguish between these two different concepts. Remark: In the power computations of Table 14.6, we have assumed a normal distribution of the error term. For another error distribution, we can use the modern way of inference and perform corresponding computations; cf. Section 14.10. Remark: The intercept β0 has no impact on the power in the case of a continuous outcome. This follows from the simple fact that we can always add a constant c to the outcome Y without changing any effect estimate, but only the intercept. Binary outcome—continuous covariate The distribution of a binary covariate X is uniquely determined by its prevalence p. The situation is more complicated in the case of a continuous covariate, for which we have at least three characteristics: the location, typically expressed by the mean or the median; the spread, typically expressed by the standard deviation or percentiles; and the shape, in particular whether we have a symmetric distribution close to a normal distribution or an asymmetric, skewed distribution. The location fortunately has no effect on the power, as we can add any constant value c to X without changing the results of a regression analysis with respect to β (only the intercept will change; cf. Section 12.7). However, the spread has a substantial impact, as we can observe in Table 14.7: For given β , sample size n and prevalence q of Y the power increases if we increase the standard deviation σX of X. This is rather intuitively clear, especially if we compare two data sets with similar values of βˆ as shown in Figure 14.1: The higher the spread of X, the higher is the variation of π (x) = logit −1 (β0 + β1 x) when comparing x values from the lower and upper end of the distribution, and hence the easier it is to demonstrate that the covariate X has an effect. However, not only the spread, but also the shape of the distribution of X has an effect on the power. We can observe this in Table 14.8, which compares the power under the assumption of a normal distribution of X with the power under the assumption of a specific skewed distribution with the same standard deviation. And we can observe that the power can be smaller in the case of a skewed distribution, but it can also be larger. This depends on the prevalence of Y : If the prevalence is low, the essential change from Y = 0 to Y = 1 happens in the upper half of the distribution of X, and in the case of a right-skewed distribution (as in our case) there is a wide spread of X in this part of the distribution, increasing the power (cf. Figure 14.2). If
146
P OWER AND S AMPLE S IZE
.8 .6 y .4 .2 0
0
.2
.4
y
.6
.8
1
p=0.002
1
p=0.139
2
4 x
6
2
4 x
6
Figure 14.1 Two data sets with a binary outcome Y and a continuous covariate X based on identical regression models, but with different spreads of X. The estimated logistic regression curves are shown in addition to the data. The p-values are shown at the top of the figures.
σX 0.50 0.75 1.00 1.25 1.50
0.1 0.20 0.39 0.59 0.76 0.87
0.2 0.33 0.61 0.83 0.94 0.98
0.3 0.41 0.73 0.92 0.98 1.00
0.4 0.46 0.79 0.95 0.99 1.00
q 0.5 0.48 0.80 0.95 0.99 1.00
0.6 0.46 0.78 0.95 0.99 1.00
0.7 0.41 0.73 0.92 0.98 1.00
0.8 0.33 0.61 0.83 0.94 0.98
0.9 0.20 0.38 0.59 0.76 0.87
Table 14.7 Probability of rejecting H0 : β = 0 in dependence on the prevalence q of Y and the standard deviation σX of X for β = 0.8 and n = 100. A normal distribution of X is assumed.
the prevalence is high, the change happens in the lower half of the distribution with limited variation, which decreases the power. We should finally mention that we also can describe the spread of the distribution of X by parameters other than the standard deviation. For reasons which will become clear in Section 17.12, it is actually more reasonable to consider measures like the 80% range, that is, the difference between the 90% and the 10% percentile. This is done in Table 14.9, and we observe similar tendencies as in Table 14.8. A final look at Tables 14.8 and 14.9 reveals that in the case of a continuous, normally distributed covariate, the optimal prevalence of Y is again 0.5, but in the case of a skewed distribution the optimal prevalence may be different. Continuous outcome—continuous covariate We do not present any results for this case, as they do not give any new insights. The power depends on the signal-to-noise ratio, the sample size, the spread of X, and the shape of the distribution of X. Survival outcome
P OWER IN M ODELS WITH S EVERAL C OVARIATES
σX 0.50 0.75 1.00 1.25 1.50
dist normal skewed normal skewed normal skewed normal skewed normal skewed
0.1 0.20 0.29 0.39 0.55 0.59 0.77 0.76 0.89 0.87 0.96
0.2 0.33 0.40 0.61 0.71 0.83 0.90 0.94 0.97 0.98 0.99
0.3 0.41 0.45 0.73 0.77 0.92 0.94 0.98 0.99 1.00 1.00
0.4 0.46 0.47 0.79 0.78 0.95 0.94 0.99 0.99 1.00 1.00
147
q 0.5 0.48 0.44 0.80 0.77 0.95 0.94 0.99 0.99 1.00 1.00
0.6 0.46 0.40 0.78 0.70 0.95 0.90 0.99 0.97 1.00 0.99
0.7 0.41 0.32 0.73 0.60 0.92 0.82 0.98 0.93 1.00 0.98
0.8 0.33 0.21 0.61 0.42 0.83 0.63 0.94 0.79 0.98 0.90
0.9 0.20 0.09 0.38 0.17 0.59 0.29 0.76 0.41 0.87 0.53
Table 14.8 Probability of rejecting H0 : β = 0 in dependence on the prevalence q of Y and the standard deviation σX of X for β = 0.8 and n = 100. “dist” indicates the distribution of X; cf. Figure 14.2.
skewed
−4
−2
0
0
.02
Fraction .04 .06
Fraction .02 .04 .06 .08 .1
.08
normal
0 x
2
4
0
2
4 x
6
8
Figure 14.2 The two distributions considered as distributions of X in the power calculations of Tables 14.8 and 14.9.
In the case of a survival outcome affected by censoring, the basic determinants related to the covariate—besides the magnitude of the effect itself— remain the same. However, the effect of the sample size and the distribution of the outcome are merged into one single, major determinant: The expected number of events we can observe (Schoenfeld, 1983). If the censoring times are small compared to the typical time spans until an event, we will have no events, and hence we are unable to show any effect of a covariate. The distribution of the time until the event has to allow enough events appearing prior to censoring. So the number of events to be expected is one major determinant of the power if we want to use a Cox proportional hazards model to assess the effect of a covariate.
148
P OWER AND S AMPLE S IZE
80%range 1.00 1.50 2.00 2.50 3.00
dist normal skewed normal skewed normal skewed normal skewed normal skewed
0.1 0.14 0.23 0.26 0.42 0.41 0.64 0.57 0.80 0.71 0.90
0.2 0.22 0.30 0.42 0.57 0.64 0.80 0.82 0.92 0.92 0.98
0.3 0.28 0.34 0.53 0.63 0.76 0.85 0.91 0.95 0.97 0.99
0.4 0.31 0.35 0.58 0.65 0.81 0.86 0.94 0.96 0.98 0.99
q 0.5 0.32 0.34 0.60 0.63 0.83 0.85 0.95 0.95 0.99 0.99
0.6 0.31 0.30 0.59 0.57 0.81 0.79 0.94 0.92 0.98 0.98
0.7 0.27 0.24 0.53 0.46 0.76 0.69 0.90 0.85 0.97 0.94
0.8 0.22 0.16 0.42 0.32 0.65 0.50 0.82 0.67 0.92 0.80
0.9 0.14 0.07 0.26 0.13 0.41 0.21 0.57 0.32 0.71 0.42
Table 14.9 Probability of rejecting H0 : β = 0 in dependence on the prevalence q of Y and the 80% range of X for β = 0.8 and n = 100. “dist” indicates the distribution of X; cf. Figure 14.2.
14.3
Determinants of Power in Regression Models with Several Covariates
Binary outcome—two continuous covariates We start by considering a logistic regression model with two covariates; that is, a model of the type logit π (x1 , x2 ) = β0 + β1 x1 + β2 x2 . Compared to the previous section we have now further characteristics which may influence the power to investigate the influence of X1 : The distribution of X2 , the association between X2 and X1 , and the effect of X2 on the outcome of interest. To keep it simple, we assume that both X1 and X2 follow a normal distribution with standard deviation 1, and we describe the association between the two covariates by their correlation ρ . Then we can study the power to reject the null hypothesis H0 : β1 = 0 as a function of the correlation ρ and the effect β2 of X2 . This is done in Table 14.10 for one specific choice of all other determinants. Considering any column in Table 14.10, we can observe that the power is maximal if the two covariates are uncorrelated. If (the absolute value of) the correlation becomes large, then there remains nearly no power. This relationship between the association among the covariates and the power to demonstrate their effect is a fundamental property with a high impact on the potential benefits and limitations of using regression models in medical research. Hence, it is worth spending some further thoughts in order to understand this property. In Figures 14.3 and 14.4, we can take a look at two data sets corresponding to the situation considered in Table 14.10. In Figure 14.3 the two covariates are uncorrelated, and in Figure 14.4 they are correlated. Now we have to remember that in fitting a regression model with two covariates, the regression parameter β1 describes the change in outcome if we change X1 but keep X2 fixed. So the information we can get about β1 stems from considering for each value of X2 how the outcome Y changes if we move X1 up or down. If the two covariates are correlated, then there is not much space to
P OWER IN M ODELS WITH S EVERAL C OVARIATES
ρ -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8
0.0 0.37 0.59 0.71 0.76 0.78 0.76 0.71 0.59 0.37
0.2 0.37 0.58 0.70 0.76 0.78 0.76 0.70 0.58 0.37
β2 0.4 0.36 0.57 0.69 0.75 0.77 0.75 0.69 0.57 0.36
0.6 0.35 0.56 0.68 0.74 0.75 0.74 0.68 0.56 0.35
149 0.8 0.34 0.54 0.65 0.71 0.73 0.71 0.65 0.54 0.34
2 x1 0 1 −1 −2
−2
−1
x1 0 1
2
Table 14.10 Probability of rejecting H0 : β1 = 0 in dependence on the correlation ρ between X1 and X2 and the effect β2 of X2 . The true effect of β1 is 0.5, and the sample size is 200. Both X1 and X2 follow a standard normal distribution, and the prevalence of Y is 0.5.
−2
−1
0 x2
Y=1
1
2
Y=0
Figure 14.3 A data set with two uncorrelated covariates and a binary outcome. The range of X2 available for three values of X1 is indicated by dashed lines.
−2 −1 0 x2 Y=1
1
2
Y=0
Figure 14.4 A data set with two correlated covariates and a binary outcome. The range of X2 available for three values of X1 is indicated by dashed lines.
move X1 up and down as indicated by the three dashed lines for three selected values of X2 in Figure 14.4. And if we compare this with the situation of two uncorrelated covariates in Figure 14.3, we can see that there is much more space to move X1 for a fixed value of X2 . And hence it is easier to demonstrate the effect of X1 in the case of uncorrelated covariates compared to the case of correlated covariates. We can also explain this from the perspective of confounding. If X2 is correlated with X1 and has an effect of its own on Y (i.e., if β2 = 0), then there is some danger that without adjustment for X2 the effect we estimate for X1 is confounded with the effect of X2 . So there is some need to adjust for X2 , and the higher the correlation, the higher is this need, and the higher is the impact; that is, the difference between the
150
P OWER AND S AMPLE S IZE
adjusted and the unadjusted effect. If the correlation is close to 0, there is no need for adjustment, and hence we expect that the adjustment for X2 does not substantially change the unadjusted effects, and hence we have to pay only a limited price for adjusting. And this is exactly what we see in the increase in power if the correlation tends to 0. If we take a look at the rows in Table 14.10, we can observe that the power slightly decreases with increasing size of β2 . This can be again explained by the perspective of confounding. The higher the effect of X2 , the higher the need for confounding, and hence the higher the impact of an adjustment, and hence we have to pay the price of a reduction in power. But compared to the correlation, the influence of the effect of X2 is rather limited. Remark: The difficulty of assessing the effect of highly correlated covariates and the resulting loss of power is often referred to as the problem of collinearity in the statistical literature. Continuous outcome—two binary covariates We now exchange the role of the continuous and the binary variables: We consider the case of a continuous outcome and two binary covariates, and hence a classical regression model of the type
μ (x1 , x2 ) = β0 + β1 x1 + β2 x2 . So again we want to take a look at the additional determinants: The distribution of X2 (i.e., the prevalence of X2 ), the effect β2 of X2 , and the association between X1 and X2 . We express the latter by the odds ratio between X1 and X2 . This odds ratio may refer to a logistic regression of X2 versus X1 or vice versa, as in the case of two binary variables these two regression models define the same odds ratio. (This property of the odds ratio is further discussed in Section 28.3.) In Table 14.11 we can take a look at the power to demonstrate an effect of X1 in dependence on the odds ratio between X1 and X2 and the effect β2 of X2 given one specific choice of all other determinants. Looking within each column, we observe as above a substantial influence of the association between the two covariates with again the case of independence of the two covariates being the optimal case. This influence can be explained in the following way: If X1 and X2 are independent and have both a prevalence of 0.5, then the four possible combinations of values for X1 and X2 (i.e., (x1 = 0, x2 = 0), (x1 = 1, x2 = 0), (x1 = 0, x2 = 1), and (x1 = 1, x2 = 1)), have all the same probability of 0.25. So each of the four values μ (x1 , x2 ) can be estimated with the same precision. In contrast, if there is a high association between X1 and X2 , the combinations (x1 = 0, x2 = 0) and (x1 = 1, x2 = 1) are frequent, but the combinations (x1 = 0, x2 = 1) and (x1 = 1, x2 = 0) are infrequent. So now in estimating the effect of X1 we have to compare the estimates of μ (x1 , x2 ) (i.e., the mean values within each of the four groups defined by X1 and X2 ) for fixed values of x2 ; that is, we have to compare μˆ (1, 0) with μˆ (0, 0) and μˆ (1, 1) with μˆ (0, 1). So, we always compare a mean based on many observations with a mean based on few observations. Consequently, we are back in the situation considered previously in Table 14.3: An imbalanced distribution of observations into two groups makes
P OWER IN M ODELS WITH S EVERAL C OVARIATES OR 1.0 2.0 4.0 8.0 16.0 32.0
0.0 0.80 0.79 0.76 0.70 0.62 0.53
0.2 0.80 0.79 0.76 0.70 0.61 0.53
β2 0.4 0.80 0.79 0.76 0.70 0.62 0.53
0.6 0.80 0.79 0.76 0.70 0.61 0.53
151 0.8 0.80 0.79 0.76 0.70 0.61 0.53
Table 14.11 Probability of rejecting H0 : β1 = 0 in dependence on the odds ratio between X1 and X2 and the effect β2 of X2 . The true effect β1 of X1 is 0.4, and the sample size is 200. Both X1 and X2 have a prevalence of 0.5, and the standard deviation σe of the error term is 1.0. A normal distribution of the error term is assumed.
one confidence interval smaller and one bigger, but the decrease in precision in the smaller group is higher than the increase in precision in the larger group, and for the difference between the two groups we have a decrease in precision. Looking at the rows in Table 14.11, we observe nearly no influence of the effect of β2 . This is due to two counterbalancing effects: On the one hand, the need for confounding and hence the impact on the difference between adjusted and unadjusted effects increases, diminishing the power. On the other hand, as we have a continuous outcome, with increasing β2 the effect of X2 can explain more and more of the variation of the outcome, which increases the power. (We will later in Section 16.7 come back to this point). We finally take a look at the influence of the distribution of X2 on the power to demonstrate an effect of X1 . In Table 14.12, we can in each row observe the effect of the prevalence p2 of X2 : The power is smallest if X1 is balanced. This somewhat surprising result can be explained in the following way: If there are only a few events in X2 , there can be also only a few events in X2 in the two subgroups defined by X1 = 0 and X1 = 1. So, in any case, most of the observations with X1 = 0 are coupled with X2 = 0, and most of the observations with X1 = 1 are coupled with X2 = 0, too. So even if X2 has an effect, this affects only the few subjects with X1 = 1 and (in the case of an positive association of X1 and X2 ) the even fewer subjects with X1 = 0. So the unadjusted effect of X1 cannot be confounded with the effect of X2 to a high degree, and if there is no big need for adjustment, it is easier to adjust, and hence the power increases. Other constellations We can now consider many other constellations, but the two we have considered so far have already demonstrated the essential points. First, the association among covariates is one of the main determinants of the power of a regression analysis: The higher the association between two covariates, the more difficult it becomes to dissect their effects. Second, it is hard to predict how the distribution of a covariate actually affects the power: We have, for example, seen that a high or low prevalence in a binary covariate can be an advantage or a disadvantage, and the same is true
152
P OWER AND S AMPLE S IZE OR 1.0 2.0 4.0 8.0 16.0 32.0
0.1 0.80 0.80 0.79 0.78 0.77 0.76
0.2 0.80 0.80 0.77 0.75 0.73 0.70
0.3 0.80 0.79 0.77 0.72 0.68 0.63
0.4 0.80 0.79 0.76 0.71 0.63 0.56
p2 0.5 0.80 0.79 0.76 0.70 0.61 0.53
0.6 0.80 0.79 0.76 0.71 0.63 0.56
0.7 0.80 0.79 0.77 0.72 0.68 0.63
0.8 0.80 0.80 0.77 0.75 0.73 0.70
0.9 0.80 0.80 0.79 0.78 0.77 0.76
Table 14.12 Probability of rejecting H0 : β1 = 0 in dependence on the odds ratio between X1 and X2 and the prevalence p2 of X2 . The true effect β1 of X1 is 0.4, the true effect β2 of X2 is 0.8, and the sample size is 200. The prevalence of X1 is 0.5, and the standard deviation σe of the error term is 1.0. A normal distribution of the error term is assumed.
for the shape of a covariate distribution. So, it is wise to perform for each study a separate power calculation taking the known features of the covariate distribution into account as carefully as possible. 14.4
Power and Sample Size Calculations When a Sample from the Covariate Distribution Is Given
When planing a new study, it is often possible to obtain a sample from the distribution of the covariates X1 , X2 , . . . , Xp which is close to the distribution we expect in our study. For example, in a university clinic it is often possible to obtain from a clinical information system used in the daily routine the data with all relevant information for the patients from the last year, which probably resembles the patients in the next year. In an epidemiological setting, it may be possible to use covariate data from a previous study which has looked at another disease. If such a sample is given, and if it is of the size we expect in our new study, it is rather simple to perform power calculations for any given choice of the regression coefficients β0 , β1 , . . . , β p by simulation. This follows from the fact that the regression coefficients determine uniquely the distribution of Yi given Xi . So we can “simply” simulate the study: For each subject i in our sample, we generate an observation Yi according to our regression model, then we apply the intended regression method and note the results of interest, for example, whether we can reject the null hypothesis of interest. Then we not only simulate the study once, but many times (say 1000 times), and count how often we can reject the null hypothesis. And then the relative frequency of rejections approximates the power of our study for the chosen values of the regression parameters. And varying the regression parameters allows to consider different scenarios and to identify the most crucial assumptions. Due to the fact that most statistical packages offer the possibility of generating random numbers by some prespecified functions, it is indeed not very cumbersome to do such simulations, and we will now describe the basic steps in more detail. In the case of a binary outcome, we can compute for each subject i in our sample
... W HEN A S AMPLE FROM THE C OVARIATE D ISTRIBUTION I S G IVEN
153
the probability of Yi = 1 as
πi = π (xi1 , xi2 , . . . , xip ) = logit −1 (β0 + β1 xi1 + β2 xi2 + . . . + β p xip ) Now, statistical packages offer typically the opportunity to generate uniform random numbers Ui , which are uniformly distributed between 0 and 1. And we just have to compare Ui with πi ; that is, we generate Yi according to 1 if Ui < πi Yi = 0 if Ui > πi and this way we ensure P(Yi = 1) = πi . The only additional difficulty in this case is the choice of β0 . By definition, β0 is the probability (on the logit scale) of Y = 1 for a subject with all covariate values set to 0. Typically, we have no idea about this probability. However, we have typically an idea about the expected prevalence q of Y in the new study. Now, the prevalence q is nothing else but the average over all πi . So we can just try some values of β0 , compute for each choice all πi and their average and take a look, for which choice the average is closest to q. The case of a continuous outcome is even more simple: Here we can just compute the conditional expectation
μi = μ (xi1 , xi2 , . . . , xip ) = β0 + β1 xi1 + β2 xi2 + . . . + β p xip and then we have just to draw Yi from a normal distribution with mean μi and standard deviation σe . There is no need to think about the choice of β0 , as the power does not depend on the value of β0 . However, it is often somewhat difficult to obtain a good guess for σe , as we are not used to thinking about the standard deviation of the error term, and although most statistical packages provide an estimate for this (cf. Section 12.4), it is seldomly published. One solution is to try several values and to take a look in the simulations at the distribution of Y in the whole sample, as we may have an idea about the spread of Y in the population expected for the study. The case of a survival outcome to be analysed by a Cox proportional hazards model is slightly more complicated. The simulation approach may require specifying the baseline hazard function or, equivalently, a survival time distribution. A useful framework for this task is described by Bender et al. (2005). In principle, in all three cases we can simulate with limited effort the study we plan for different choices of the regression coefficients and compute for each choice the power to reject a null hypothesis on a certain regression parameter. We can also study the length of confidence intervals, and as the distribution of the length is often asymmetric, typically the median length is considered. It will frequently happen that the sample available for a power calculation is not of the same size as the study you plan. If the sample is larger than the study planned, you can in each simulation just draw a random sample of the size desired. If the sample is smaller, you can fill the sample in each simulation just by randomly drawing for any missing subject one subject from the sample. So, you are working in the simulation with samples, in which some subjects occur once, some occur twice,
154
P OWER AND S AMPLE S IZE
and some three or four times. (This is a special case of a statistical technique called bootstrapping.) If you are interested in determining the necessary sample size to achieve a certain power, you just compute the power for several sample sizes and perform a regression of the power versus the sample size and find the necessary sample size by interpolation. Remark: If simulations are used to compute a power, the results are, of course, affected by some random noise. Two simulations with, for example, 1000 repetitions will not give identical results for the power. You can investigate the precision of a simulated power by standard statistical techniques, for example, by computing confidence intervals for the power. A popular choice for the number of repetitions in power calculations is 2500, because then the standard error is in any case less than 0.01, and hence the width of the confidence interval for the true power is less than ±0.02. Determining the expected median length of a confidence interval typically requires less repetitions, as this is easier to estimate from simulations. 14.5
Power and Sample Size Calculations Given a Sample from the Covariate Distribution with Stata
To illustrate the use of power calculations, let us assume that we are interested in planning a study of breast cancer patients at a university clinic, and that the breastx data set mirrors the patients we have seen during the last year in the clinic. (breastx is identical to the breast data set except of that we have removed the information on survival.) We start with the case that all patients are treated with the same therapy, and we plan to measure a new binary surrogate outcome 3 months after start of the therapy, which divides the patients into “responders” and “nonresponders.” We would like to know whether with the number of patients we expect during the next year the prognostic value of the factors age, tumor size, number of affected lymph nodes, and tumor grading with respect to this new outcome can be established. We expect for the variables number of affected lymph nodes and tumor grading an effect (on the logit scale) of 0.5 when going from one category to the next, and this effect should correspond to the effect of a difference in tumor size of 5 cm and to the effect of an age difference of 25 years. So, if we handle all four covariates as continuous ones, we assume a true regression coefficient of 0.5 for number of affected lymph nodes and tumor grading, of 0.1 for tumor size, and of 0.02 for age. We are not pretty sure about the fraction of responders due to the new surrogate outcome, but expect a fraction between 0.1 and 0.5. The first step is now to write a small Stata program to generate the outcome data in dependence on the regression parameters β0 , β1 , . . . , β p . Such a program (best to be written as a do-file) looks as follows: . program define mysimdata1 1. syntax, b0(real) bage(real) bnode(real) bsize(real) bgrad(real) 2. use breastx, clear 3. gen pi=1/(1+exp(- (‘b0’ + ‘bage’*age + ‘bnode’*nodestat + ///
... A S AMPLE FROM THE C OVARIATE D ISTRIBUTION WITH S TATA >
155
‘bsize’*tumorsize + ‘bgrad’*grad))) 4. gen y=runiform() chi2 Pseudo R2
= = = =
533 52.60 0.0000 0.0823
------------------------------------------------------------------------------
156
P OWER AND S AMPLE S IZE
y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------age | .0274514 .0079533 3.45 0.001 .0118631 .0430396 nodestat | .6442009 .1541734 4.18 0.000 .3420265 .9463753 tumorsize | .0032659 .1007006 0.03 0.974 -.1941035 .2006354 grad | .5786099 .1491734 3.88 0.000 .2862354 .8709844 _cons | -2.304992 .5859042 -3.93 0.000 -3.453343 -1.156641 ------------------------------------------------------------------------------
However, our choice β0 = −2 was completely arbitrary, and the actual prevalence of the outcome in the simulated data set is larger than we wish: . tab y y | Freq. Percent Cum. ------------+----------------------------------0 | 153 28.71 28.71 1 | 380 71.29 100.00 ------------+----------------------------------Total | 533 100.00
So we now try smaller values of β0 and look at the average values of the individual probabilities πi , as they provide more stable estimates of the prevalence than the relative frequency of Y . . mysimdata1, b0(-3.0) bage(0.02) bnode(0.5) bsize(0.1) bgrad(0.5) . tabstat pi, s(mean) variable | mean -------------+---------pi | .4885479 -----------------------. mysimdata1, b0(-4.0) bage(0.02) bnode(0.5) bsize(0.1) bgrad(0.5) . tabstat pi, s(mean) variable | mean -------------+---------pi | .2774076 -----------------------. mysimdata1, b0(-5.0) bage(0.02) bnode(0.5) bsize(0.1) bgrad(0.5) . tabstat pi, s(mean) variable | mean -------------+---------pi | .131072 ------------------------
Consequently, values of β0 in the magnitude between -5 and -3 are relevant for our power calculations, as we expect a prevalence between 0.1 and 0.5. It remains now to repeat the simulation many times and to study how often we
... A S AMPLE FROM THE C OVARIATE D ISTRIBUTION WITH S TATA
157
can reject the null hypotheses of interest. The first step is to add the analyses of interest to a single simulation of the study: . program define mysimdata2 1. syntax, b0(real) bage(real) bnode(real) bsize(real) bgrad(real) 2. use breastx, clear 3. gen pi=1/(1+exp(- (‘b0’ + ‘bage’*age + ‘bnode’*nodestat + /// > ‘bsize’*tumorsize + ‘bgrad’*grad))) 4. gen y=runiform() chi2 Pseudo R2
= = = =
533 30.46 0.0000 0.0686
-----------------------------------------------------------------------------y | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------age | 1.016623 .0098925 1.69 0.090 .9974182 1.036199 nodestat | 1.831843 .3118795 3.56 0.000 1.312101 2.557462 tumorsize | 1.100515 .1280009 0.82 0.410 .876179 1.38229 grad | 1.605914 .3185028 2.39 0.017 1.088685 2.368875 -----------------------------------------------------------------------------( 1)
[y]age = 0 chi2( 1) = Prob > chi2 =
( 1)
[y]nodestat = 0 chi2( 1) = Prob > chi2 =
( 1)
12.64 0.0004
[y]tumorsize = 0 chi2( 1) = Prob > chi2 =
( 1)
2.87 0.0902
0.68 0.4102
[y]grad = 0 chi2( 1) = Prob > chi2 =
5.70 0.0169
We can now repeat this command many times and count how often the p-values are
158
P OWER AND S AMPLE S IZE
less than 0.05. Stata supports this process with its simulate command, allowing one to execute a program many times and to store the results in a data set. However, this requires in a first step to ensure that the program generates not only output, but also stores some results internally. So we have to use Stata’s return mechanism. This mechanism allows one after any command to access the main results with expressions like r(...). After the test command for example, the p-values can be accessed by r(p). And we can also allow our program to create internal results by using the return command. So we extend our program: . program define mysimdata3, rclass 1. syntax, b0(real) bage(real) bnode(real) bsize(real) bgrad(real) 2. use breastx, clear 3. gen pi=1/(1+exp(- (‘b0’ + ‘bage’*age + ‘bnode’*nodestat + /// > ‘bsize’*tumorsize + ‘bgrad’*grad))) 4. gen y=runiform() chi2 Pseudo R2
= = = =
533 30.46 0.0000 0.0686
-----------------------------------------------------------------------------y | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------age | 1.016623 .0098925 1.69 0.090 .9974182 1.036199 nodestat | 1.831843 .3118795 3.56 0.000 1.312101 2.557462 tumorsize | 1.100515 .1280009 0.82 0.410 .876179 1.38229 grad | 1.605914 .3185028 2.39 0.017 1.088685 2.368875 -----------------------------------------------------------------------------( 1)
[y]age = 0 chi2( 1) = Prob > chi2 =
( 1)
[y]nodestat = 0 chi2( 1) = Prob > chi2 =
( 1)
2.87 0.0902
[y]tumorsize = 0
12.64 0.0004
... A S AMPLE FROM THE C OVARIATE D ISTRIBUTION WITH S TATA chi2( 1) = Prob > chi2 = ( 1)
159
0.68 0.4102
[y]grad = 0 chi2( 1) = Prob > chi2 =
5.70 0.0169
. return list scalars: r(pgrad) r(psize) r(pnode) r(page)
= = = =
.0169222265163092 .4102375482545775 .0003774055660454 .0902080707524761
The return list command allows one to take a look at all internal results generated by the last command executed, and hence we can see that the four p-values are now stored internally. Note that we have added an rclass option in the define program line. This is necessary to allow the use of the return command. So now we can use Stata’s simulate command to generate, for example, 10 repetitions of the study and store the p-values in a data set: . simulate page=r(page) pnode=r(pnode) psize=r(psize) pgrad=r(pgrad), /// > reps(10): mysimdata3, b0(-5.0) bage(0.02) bnode(0.5) bsize(0.1) bgrad(0.5) command: page: pnode: psize: pgrad:
mysimdata3, b0(-5.0) bage(0.02) bnode(0.5) bsize(0.1) r(page) r(pnode) r(psize) r(pgrad)
Simulations (10) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 . list
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
+-------------------------------------------+ | page pnode psize pgrad | |-------------------------------------------| | .0048062 .0021121 .0355471 .0011184 | | .015484 .0003594 .6125847 .0001055 | | .0722209 .0038082 .9783179 .4512527 | | .0606583 .0092659 .3567294 .0004139 | | .1663147 .004979 .0814798 .0025115 | |-------------------------------------------| | .2256224 .0052241 .0342396 .0287104 | | .1789334 .0002863 .2126401 .0042908 | | .0392135 .5980729 .0503882 .0427329 | | .9427388 .0225225 .9216042 .1816052 | | .0027463 .0000717 .4514185 .0051899 | +-------------------------------------------+
The expressions after the simulate command told Stata to store the results of the
160
P OWER AND S AMPLE S IZE
mysimdata3 program in the given names. So now we are prepared for a real power calculation using 2500 repetitions by changing the value of the reps() option. . simulate page=r(page) pnode=r(pnode) psize=r(psize) pgrad=r(pgrad), /// > nodots reps(2500): /// > mysimdata3, b0(-5.0) bage(0.02) bnode(0.5) bsize(0.1) bgrad(0.5) command: page: pnode: psize: pgrad:
mysimdata3, b0(-5.0) bage(0.02) bnode(0.5) bsize(0.1) r(page) r(pnode) r(psize) r(pgrad)
We have also used the nodots option to suppress plotting of a point for each simulation. Now it remains to defines indicators for significant results and to look at the relative frequency of significant results by computing the mean of the indicator variables. . gen sigage=page F = 0.0000
and then use it in simulate: . simulate page=r(page) pnode=r(pnode) psize=r(psize) pgrad=r(pgrad), /// > nodots reps(2500): /// > mysimdata4, sigma(25) bage(0.5) bnode(10) bsize(2.5) bgrad(10) command: page: pnode: psize: pgrad:
mysimdata4, sigma(25) bage(0.5) bnode(10) bsize(2.5) bgrad(10) r(page) r(pnode) r(psize) r(pgrad)
. gen sigage=page F = 0.0000
1, 245) = 1903.44 Prob > F = 0.0000
grad = 0 F(
1, 245) =11486.60 Prob > F = 0.0000
. simulate page=r(page) pnode=r(pnode) psize=r(psize) pgrad=r(pgrad), /// > nodots reps(2500): /// > mysimdata5, n(250) sigma(25) bage(0.5) bnode(10) bsize(2.5) bgrad(10) command: page: pnode: psize: pgrad:
mysimdata5, n(250) sigma(25) bage(0.5) bnode(10) bsize(2.5) bgrad(10) r(page) r(pnode) r(psize) r(pgrad)
. gen sigage=page|t| [95% Conf. Interval] -------------+---------------------------------------------------------------age | 20.7845 .8602082 24.16 0.000 19.08742 22.48157 resid | 5.769928 1.071965 5.38 0.000 3.655081 7.884775 _cons | -398.6991 17.64418 -22.60 0.000 -433.5087 -363.8894 ------------------------------------------------------------------------------
If we now fit a model only with age, we obtain the same regression coefficient for age as above: . regress numwords age, noheader -----------------------------------------------------------------------------numwords | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------age | 20.7845 .9226259 22.53 0.000 18.96434 22.60465 _cons | -398.6991 18.92446 -21.07 0.000 -436.0333 -361.3649 ------------------------------------------------------------------------------
We now continue with the dataset vocgrowthall, which includes data for the whole age range from 18 to 30 months of age. We start by looking at the data with a scatter plot like that in Figure 17.7. . use vocgrowthall . scatter numwords age, msize(*0.6)
The mkspline command creates the dummy variables for a linear spline. The names of the dummy variables for each interval are just placed between the values for the cutpoints. The first dummy variable age1 will be identical to the variable age. Note the use of the marginal option!
242
M ODELLING N ONLINEAR E FFECTS
. mkspline age1 19.5 age2 21 age3
= age , marginal
. list in 1/10
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
+----------------------------------------------------------+ | id age numwords age1 age2 age3 | |----------------------------------------------------------| | 1 27.4 281 27.4 7.9 6.4 | | 2 28.5333 379 28.53333 9.033333 7.533333 | | 3 22.7333 82 22.73333 3.233334 1.733334 | | 4 27.2667 210 27.26667 7.766666 6.266666 | | 5 24.4333 136 24.43333 4.933332 3.433332 | |----------------------------------------------------------| | 6 18.3333 5 18.33333 0 0 | | 7 25.5333 114 25.53333 6.033333 4.533333 | | 8 25.5 144 25.5 6 4.5 | | 9 21.7333 35 21.73333 2.233334 .7333336 | | 10 26.4 187 26.4 6.9 5.4 | +----------------------------------------------------------+
Now we can just apply the classical regression model: . regress numwords age1 age2 age3, noheader -----------------------------------------------------------------------------numwords | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------age1 | 2.544484 7.275573 0.35 0.727 -11.73771 16.82668 age2 | 13.62793 10.77826 1.26 0.206 -7.530159 34.78602 age3 | 17.74485 4.677895 3.79 0.000 8.561986 26.92772 _cons | -43.27293 137.7613 -0.31 0.754 -313.7031 227.1572 ------------------------------------------------------------------------------
To plot the fitted regression model, we again use predict . predict muhat (option xb assumed; fitted values) . sort age . scatter numwords age, msize(*0.6) || line muhat age
and we obtain a graph like Figure 17.8. We can also test the null hypotheses of no effect of age as well as of a purely linear effect: . test age1 age2 age3 ( 1) ( 2) ( 3)
age1 = 0 age2 = 0 age3 = 0 F(
3, 774) = 2200.62 Prob > F = 0.0000
. test age2 age3
H OW TO M ODEL N ONLINEAR E FFECTS IN S TATA ( 1) ( 2)
243
age2 = 0 age3 = 0 F(
2, 774) = Prob > F =
43.44 0.0000
Without the marginal option, mkspline uses an alternative way to define the dummy variables, such that the regression coefficients coincide now directly with the slopes in each interval. The fitted curve is, of course, the same: . use vocgrowthall, clear . mkspline age1 19.5 age2 21 age3
= age
. regress numwords age1 age2 age3 Source | SS df MS -------------+-----------------------------Model | 9690110.25 3 3230036.75 Residual | 1136066.43 774 1467.78608 -------------+-----------------------------Total | 10826176.7 777 13933.3033
Number of obs F( 3, 774) Prob > F R-squared Adj R-squared Root MSE
= 778 = 2200.62 = 0.0000 = 0.8951 = 0.8947 = 38.312
-----------------------------------------------------------------------------numwords | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------age1 | 2.544484 7.275573 0.35 0.727 -11.73771 16.82668 age2 | 16.17241 4.388551 3.69 0.000 7.55754 24.78729 age3 | 33.91726 .5404273 62.76 0.000 32.85639 34.97814 _cons | -43.27293 137.7613 -0.31 0.754 -313.7031 227.1572 ------------------------------------------------------------------------------
To obtain a cubic spline, we use the option cubic and with knots we specify the knots. The dummy variables are automatically numbered, and the first is identical to age: . use vocgrowthall, clear . mkspline agec=age , cubic knots(19 20 21 22) . list in 1/10
1. 2. 3. 4. 5. 6. 7. 8.
+----------------------------------------------------------+ | id age numwords agec1 agec2 agec3 | |----------------------------------------------------------| | 1 27.4 281 27.4 13.46667 4.266666 | | 2 28.5333 379 28.53333 15.73333 5.022222 | | 3 22.7333 82 22.73333 4.133334 1.155556 | | 4 27.2667 210 27.26667 13.2 4.177778 | | 5 24.4333 136 24.43333 7.533331 2.288888 | |----------------------------------------------------------| | 6 18.3333 5 18.33333 0 0 | | 7 25.5333 114 25.53333 9.733333 3.022222 | | 8 25.5 144 25.5 9.666667 3 |
244
M ODELLING N ONLINEAR E FFECTS
9. | 9 21.7333 35 21.73333 2.137548 .4909961 | 10. | 10 26.4 187 26.4 11.46667 3.6 | +----------------------------------------------------------+ . regress numwords agec* Source | SS df MS -------------+-----------------------------Model | 9690565.08 3 3230188.36 Residual | 1135611.6 774 1467.19845 -------------+-----------------------------Total | 10826176.7 777 13933.3033
Number of obs F( 3, 774) Prob > F R-squared Adj R-squared Root MSE
= 778 = 2201.60 = 0.0000 = 0.8951 = 0.8947 = 38.304
-----------------------------------------------------------------------------numwords | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------agec1 | 5.280469 5.686891 0.93 0.353 -5.883089 16.44403 agec2 | 20.39088 18.73093 1.09 0.277 -16.37856 57.16033 agec3 | -18.13039 48.77307 -0.37 0.710 -113.8736 77.61278 _cons | -94.24322 108.4806 -0.87 0.385 -307.1944 118.7079 ------------------------------------------------------------------------------
We obtain the fitted curve in the usual way: . predict muchat (option xb assumed; fitted values) . sort age . scatter numwords age , msize(*0.5) xlab(18(3)30) || line muchat age
and obtain a graph similar to Figure 17.9. To check the adequateness of the fitted spline, we can take a look at the residuals: . predict resid, resid . lowess resid age, msize(*0.5) xlab(18(3)30)
H OW TO M ODEL N ONLINEAR E FFECTS IN S TATA
245
−200
−100
Residuals 0
100
200
Lowess smoother
18
21
24 age
27
30
bandwidth = .8
We cannot detect any structure in the mean values of the residuals, so we have to regard the restricted cubic spline as a rather adequate fit. The hypotheses tests of the null hypothesis of no effect of age or a purely linear effect of age can be performed in the following way: . test agec1 agec2 agec3 ( 1) ( 2) ( 3)
agec1 = 0 agec2 = 0 agec3 = 0 F(
3, 774) = 2201.60 Prob > F = 0.0000
. test agec2 agec3 ( 1) ( 2)
agec2 = 0 agec3 = 0 F(
2, 774) = Prob > F =
43.62 0.0000
If we want to select the knots following the general suggestion by Harrell (2001), we can specify just the nknots option. . drop muchat agec* . mkspline agec=age , cubic nknots(4) . regress numwords agec* Source | SS df MS -------------+-----------------------------Model | 9677251.12 3 3225750.37 Residual | 1148925.56 774 1484.39994 -------------+-----------------------------Total | 10826176.7 777 13933.3033
Number of obs F( 3, 774) Prob > F R-squared Adj R-squared Root MSE
= 778 = 2173.10 = 0.0000 = 0.8939 = 0.8935 = 38.528
246
M ODELLING N ONLINEAR E FFECTS
-----------------------------------------------------------------------------numwords | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------agec1 | 17.61758 1.926059 9.15 0.000 13.83666 21.3985 agec2 | 28.68546 5.775575 4.97 0.000 17.34781 40.0231 agec3 | -61.38871 17.23953 -3.56 0.000 -95.23049 -27.54693 _cons | -330.5325 38.92893 -8.49 0.000 -406.9513 -254.1137 -----------------------------------------------------------------------------. predict muchat (option xb assumed; fitted values)
0
100
200
300
400
500
. scatter numwords age , msize(*0.5) xlab(18(3)30) || line muchat age
18
21
24 age numwords
27
30
Fitted values
With 4 knots, a satisfying result is not obtained, which is due to the fact that all curvature in this example is located in a small interval. But with 6 knots it does somehow work: . drop muchat agec* . mkspline agec=age , cubic nknots(6) . regress numwords agec* Source | SS df MS -------------+-----------------------------Model | 9699246 5 1939849.2 Residual | 1126930.68 772 1459.75476 -------------+-----------------------------Total | 10826176.7 777 13933.3033
Number of obs F( 5, 772) Prob > F R-squared Adj R-squared Root MSE
= 778 = 1328.89 = 0.0000 = 0.8959 = 0.8952 = 38.207
-----------------------------------------------------------------------------numwords | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------agec1 | 5.358568 3.687843 1.45 0.147 -1.880822 12.59796 agec2 | 156.3433 33.90001 4.61 0.000 89.79619 222.8905
H OW TO M ODEL N ONLINEAR E FFECTS IN S TATA
247
agec3 | -410.8409 110.0908 -3.73 0.000 -626.9537 -194.7281 agec4 | 393.1064 148.5527 2.65 0.008 101.4913 684.7216 agec5 | -210.5436 141.0942 -1.49 0.136 -487.5173 66.43015 _cons | -95.98105 71.64109 -1.34 0.181 -236.6155 44.65339 -----------------------------------------------------------------------------. predict muchat (option xb assumed; fitted values)
0
100
200
300
400
500
. scatter numwords age , msize(*0.5) xlab(18(3)30) || line muchat age
18
21
24 age numwords
27
30
Fitted values
To fit fractional polynomials, Stata offers the fracpoly command. It requires specifying the desired degree of the FP model and the regression model you want to fit. We start with fitting an FP1 model: . use vocgrowthall . fracpoly, degree(1) : regress numwords age -> gen double Iage__1 = X^3-14.27783188 if e(sample) (where: X = age/10) Source | SS df MS -------------+-----------------------------Model | 9659684.91 1 9659684.91 Residual | 1166491.77 776 1503.21104 -------------+-----------------------------Total | 10826176.7 777 13933.3033
Number of obs F( 1, 776) Prob > F R-squared Adj R-squared Root MSE
= 778 = 6426.03 = 0.0000 = 0.8923 = 0.8921 = 38.771
-----------------------------------------------------------------------------numwords | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------Iage__1 | 17.18737 .2144065 80.16 0.000 16.76649 17.60826 _cons | 132.4486 1.404544 94.30 0.000 129.6915 135.2058 ------------------------------------------------------------------------------
248 Deviance:
M ODELLING N ONLINEAR E FFECTS 7897.21. Best powers of age among 8 models fit: 3.
You can take a look at the fitted model using the fracplot command: . fracplot, msize(*0.5) xlab(18(3)30)
0
Predictor+residual of numwords 100 200 300 400
500
Fractional Polynomial (3)
18
21
24 age
27
30
We can see that the fit (by an FP1 model with power 3) is not satisfying. Next we try an FP2 model: . fracpoly, degree(2) : regress numwords age -> gen double Iage__1 = X^-2-.1699124325 if e(sample) -> gen double Iage__2 = X^-1-.4122043577 if e(sample) (where: X = age/10) Source | SS df MS -------------+-----------------------------Model | 9691217.82 2 4845608.91 Residual | 1134958.86 775 1464.46304 -------------+-----------------------------Total | 10826176.7 777 13933.3033
Number of obs F( 2, 775) Prob > F R-squared Adj R-squared Root MSE
= 778 = 3308.80 = 0.0000 = 0.8952 = 0.8949 = 38.268
-----------------------------------------------------------------------------numwords | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------Iage__1 | 7278.924 364.382 19.98 0.000 6563.631 7994.216 Iage__2 | -7995.117 316.5525 -25.26 0.000 -8616.519 -7373.715 _cons | 136.7653 1.961529 69.72 0.000 132.9148 140.6159 -----------------------------------------------------------------------------Deviance: 7875.89. Best powers of age among 44 models fit: -2 -1. . fracplot, msize(*0.5) xlab(18(3)30)
H OW TO M ODEL N ONLINEAR E FFECTS IN S TATA
249
0
Predictor+residual of numwords 100 200 300 400
500
Fractional Polynomial (−2 −1)
18
21
24 age
27
30
Now the fit is more satisfying. Note that fracplot also adds a confidence region to the fitted line, describing the uncertainty of the fitted function. To test the null hypotheses of no effect or of a purely linear effect, we can add the compare option: . fracpoly, degree(2) compare : regress numwords age -> gen double Iage__1 = X^-2-.1699124325 if e(sample) -> gen double Iage__2 = X^-1-.4122043577 if e(sample) (where: X = age/10) Source | SS df MS -------------+-----------------------------Model | 9691217.82 2 4845608.91 Residual | 1134958.86 775 1464.46304 -------------+-----------------------------Total | 10826176.7 777 13933.3033
Number of obs F( 2, 775) Prob > F R-squared Adj R-squared Root MSE
= 778 = 3308.80 = 0.0000 = 0.8952 = 0.8949 = 38.268
-----------------------------------------------------------------------------numwords | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------Iage__1 | 7278.924 364.382 19.98 0.000 6563.631 7994.216 Iage__2 | -7995.117 316.5525 -25.26 0.000 -8616.519 -7373.715 _cons | 136.7653 1.961529 69.72 0.000 132.9148 140.6159 -----------------------------------------------------------------------------Deviance: 7875.89. Best powers of age among 44 models fit: -2 -1. Fractional polynomial model comparisons: -----------------------------------------------------------------------------age df Deviance Res. SD Dev. dif. P (*) Powers -----------------------------------------------------------------------------Not in model 0 9630.573 118.039 1754.678 0.000 Linear 1 7959.425 40.3528 83.531 0.000 1 m = 1 2 7897.215 38.7713 21.321 0.000 3 m = 2 4 7875.894 38.2683 --- -2 -1 -----------------------------------------------------------------------------(*) P-value from deviance difference comparing reported model with m = 2 model
Here, the fitted model is compared with the model of no effect of age (first line)
250
M ODELLING N ONLINEAR E FFECTS
and the model with a pure linear effect (second line); hence, the p-values reported correspond exactly to our needs. This option can be also used to find out whether an FP3 model may give a better fit: . fracpoly, degree(3) compare : regress numwords age -> gen double Iage__1 = X^-2-.1699124325 if e(sample) -> gen double Iage__2 = X^-2*ln(X)-.1505825211 if e(sample) -> gen double Iage__3 = X^3-14.27783188 if e(sample) (where: X = age/10) Source | SS df MS -------------+-----------------------------Model | 9691931.92 3 3230643.97 Residual | 1134244.76 774 1465.43251 -------------+-----------------------------Total | 10826176.7 777 13933.3033
Number of obs F( 3, 774) Prob > F R-squared Adj R-squared Root MSE
= 778 = 2204.57 = 0.0000 = 0.8952 = 0.8948 = 38.281
-----------------------------------------------------------------------------numwords | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------Iage__1 | 872.6452 236.872 3.68 0.000 407.6575 1337.633 Iage__2 | -7408.828 1676.239 -4.42 0.000 -10699.34 -4118.315 Iage__3 | 3.360137 2.981288 1.13 0.260 -2.492232 9.212506 _cons | 136.9374 2.091128 65.48 0.000 132.8325 141.0424 -----------------------------------------------------------------------------Deviance: 7875.40. Best powers of age among 164 models fit: -2 -2 3. Fractional polynomial model comparisons: -----------------------------------------------------------------------------age df Deviance Res. SD Dev. dif. P (*) Powers -----------------------------------------------------------------------------Not in model 0 9630.573 118.039 1755.168 0.000 Linear 1 7959.425 40.3528 84.020 0.000 1 m = 1 2 7897.215 38.7713 21.810 0.000 3 m = 2 4 7875.894 38.2683 0.490 0.784 -2 -1 m = 3 6 7875.405 38.281 --- -2 -2 3 -----------------------------------------------------------------------------(*) P-value from deviance difference comparing reported model with m = 3 model
This is not the case here, as indicated by the high p-value in the fourth line. If you want to use fractional polynomials simultaneously for several covariates, you have to use the mfp command. Finally, we take a look at the data behind the dose response experiment in Section 17.1 . use toxic2, clear . list in 1/10 +-------------------------------+ | dose oxygen cell damage | |-------------------------------|
H OW TO M ODEL N ONLINEAR E FFECTS IN S TATA 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
251
| 10 0 1 0 | | 10 0 2 1 | | 10 0 3 0 | | 10 0 4 0 | | 10 0 5 0 | |-------------------------------| | 10 0 6 0 | | 10 0 7 0 | | 10 0 8 0 | | 10 0 9 0 | | 10 0 10 0 | +-------------------------------+
We fit the quadratic model: . gen dose2=dose^2 . logit damage dose dose2 oxygen Iteration Iteration Iteration Iteration Iteration Iteration
0: 1: 2: 3: 4: 5:
log log log log log log
likelihood likelihood likelihood likelihood likelihood likelihood
= = = = = =
-968.75397 -506.07485 -501.45684 -500.73551 -500.73503 -500.73503
Logistic regression
Log likelihood = -500.73503
Number of obs LR chi2(3) Prob > chi2 Pseudo R2
= = = =
1400 936.04 0.0000 0.4831
-----------------------------------------------------------------------------damage | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------dose | -.0056997 .0254765 -0.22 0.823 -.0556328 .0442334 dose2 | .0016444 .0003429 4.80 0.000 .0009723 .0023164 oxygen | -.9762224 .1664017 -5.87 0.000 -1.302364 -.650081 _cons | -2.436032 .450425 -5.41 0.000 -3.318849 -1.553215 ------------------------------------------------------------------------------
We can translate the quadratic effect into effect estimates at different doses, for example, at doses of 20 mg and 60 mg: . lincom dose + 2 * dose2 * 20 ( 1)
[damage]dose + 40*[damage]dose2 = 0
-----------------------------------------------------------------------------damage | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | .060075 .0125491 4.79 0.000 .0354792 .0846708 -----------------------------------------------------------------------------. lincom dose + 2 * dose2 * 60
252 ( 1)
M ODELLING N ONLINEAR E FFECTS [damage]dose + 120*[damage]dose2 = 0
-----------------------------------------------------------------------------damage | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | .1916244 .0174105 11.01 0.000 .1575004 .2257484 ------------------------------------------------------------------------------
So, we can see that the effect of the dose is higher at higher doses compared to lower doses, which corresponds to what we can see in Figure 17.5. If we would like to express the effects as odds ratios, we can just add the or option to lincom. If we would like to know the effect of increasing the dose by 10 mg, we just have to multiply the slope at each dose by 10: . lincom 10*(dose + 2 * dose2 * 20), or ( 1)
10*[damage]dose + 400*[damage]dose2 = 0
-----------------------------------------------------------------------------damage | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | 1.823486 .2288314 4.79 0.000 1.425884 2.331958 -----------------------------------------------------------------------------. lincom 10*(dose + 2 * dose2 * 60), or ( 1)
10*[damage]dose + 1200*[damage]dose2 = 0
-----------------------------------------------------------------------------damage | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | 6.795387 1.183112 11.01 0.000 4.830761 9.559006 ------------------------------------------------------------------------------
Note, however, that we should be careful with the phrase “increasing the dose by 10 mg”. This should not be interpreted as a change from 20 to 30 mg or from 60 to 70 mg, respectively, as the slope changes over the interval. It would be more appropriate to interpret this as a change from 15 to 25 mg or from 55 to 65 mg, respectively. To visualise the fitted model, we make use of the predict command: . predict p (option pr assumed; Pr(damage)) . collapse (mean) freq=damage, by(dose oxygen p) . sort oxygen dose . scatter freq dose if oxygen==0, msym(O) || /// > scatter freq dose if oxygen==1, msym(Oh) || line p dose, c(L)
and obtain a graph similar to the left side of Figure 17.5. If we want the graph on the logit scale, we perform the following transformations:
H OW TO M ODEL N ONLINEAR E FFECTS IN S TATA
253
. gen logitp=log(p/(1-p)) . gen logitf=log(freq/(1-freq)) . scatter logitf dose if oxygen==0, msym(O) || /// > scatter logitf dose if oxygen==1, msym(Oh) || line logitp dose, c(L)
and obtain a graph similar to the right side of Figure 17.5. We can also use fractional polynomials for this data. We have just to use logit instead of regress. fracpoly can be used with several covariates, but only the first covariate is subjected to transformations. . use toxic2, clear . fracpoly, degree(1): logit damage dose oxygen -> gen double Idose__1 = X^2-16 if e(sample) (where: X = dose/10) Iteration Iteration Iteration Iteration Iteration
0: 1: 2: 3: 4:
log log log log log
likelihood likelihood likelihood likelihood likelihood
= = = = =
-968.75397 -506.36519 -500.82165 -500.76001 -500.76
Logistic regression
Log likelihood =
Number of obs LR chi2(2) Prob > chi2 Pseudo R2
-500.76
= = = =
1400 935.99 0.0000 0.4831
-----------------------------------------------------------------------------damage | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------Idose__1 | .1569958 .0079587 19.73 0.000 .1413971 .1725945 oxygen | -.976046 .1663686 -5.87 0.000 -1.302122 -.6499696 _cons | -.0188411 .1104458 -0.17 0.865 -.2353109 .1976287 -----------------------------------------------------------------------------Deviance: 1001.52. Best powers of dose among 8 models fit: 2.
We can observe that fracpoly suggested a quadratic transformation as the most suitable one. We can take a look at whether an FP2 model gives a better fit: . fracpoly, degree(2) compare: logit damage dose oxygen -> gen double Idose__1 = X^3-64 if e(sample) -> gen double Idose__2 = X^3*ln(X)-88.72283911 if e(sample) (where: X = dose/10) Iteration Iteration Iteration Iteration Iteration Iteration
0: 1: 2: 3: 4: 5:
log log log log log log
Logistic regression
likelihood likelihood likelihood likelihood likelihood likelihood
= = = = = =
-968.75397 -510.11491 -500.09626 -499.84168 -499.84103 -499.84103 Number of obs LR chi2(3)
= =
1400 937.83
254
M ODELLING N ONLINEAR E FFECTS
Log likelihood = -499.84103
Prob > chi2 Pseudo R2
= =
0.0000 0.4840
-----------------------------------------------------------------------------damage | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------Idose__1 | .0781717 .0127383 6.14 0.000 .0532051 .1031383 Idose__2 | -.0293431 .0067726 -4.33 0.000 -.0426172 -.0160689 oxygen | -.9845484 .1673635 -5.88 0.000 -1.312575 -.6565219 _cons | -.0086653 .1256435 -0.07 0.945 -.254922 .2375915 -----------------------------------------------------------------------------Deviance: 999.68. Best powers of dose among 44 models fit: 3 3. Fractional polynomial model comparisons: --------------------------------------------------------------dose df Deviance Dev. dif. P (*) Powers --------------------------------------------------------------Not in model 0 1920.933 921.251 0.000 Linear 1 1025.667 25.985 0.000 1 m = 1 2 1001.520 1.838 0.399 2 m = 2 4 999.682 --- 3 3 --------------------------------------------------------------(*) P-value from deviance difference comparing reported model with m = 2 model
This is not the case, as indicated by the p-value in the third line.
17.12
The Impact of Ignoring Nonlinearity
We have mentioned at the start of Section 17.5 that it is unrealistic to assume that any true model is exactly linear. Nevertheless, most analyses are done by models assuming a pure linear effect. We have presented in Section 17.6 an argument based on power considerations which explains why this is reasonable. But what about the effect estimates? Can we trust effect estimates from a linear modelling if the true model is not linear? The answer is: Almost always. This is related to the fact that estimates of the slope mainly depend on the observations in the upper and lower end of the distribution of X (cf. Section 15.3). And this also holds when the true model is nonlinear. So in this case the slope estimate describes the change of Y when moving from values in the lower end of the distribution of X to values in the upper end, ignoring what happens in between. And hence the slope catches the “general trend” in our data. Both aspects are illustrated in the first five data sets of Figure 17.19: The regression line (in grey) catches the overall trend, and its slope is nearly identical with that of a regression line using only the observations with the 15% smallest and 15% largest values of X. (Note that the intercept can be quite different between the two lines!) So estimated slopes are not misleading just because we have a nonlinear relationship, at least as long as the relation is monotone. Of course, in the case of a distinct nonmonotone relation like in the last data set in Figure 17.19, such as slope can be very misleading. But again this is something we can detect by inspection of our data. Even if it is known that the effect of a covariate is nonlinear, we may prefer to
4
255
2
6
−6 −4 −2
−2
−2
0
0
2
2
0
4
4
6
M ODELLING THE N ONLINEAR E FFECT OF C ONFOUNDERS
0 x
1
2
−2
−1
0 x
1
2
−2
−1
0 x
1
2
−2
−1
0 x
1
2
−2
−1
0 x
1
2
−2
−1
0 x
1
2
−2
−4
0
−2
−2 −1 0
2
1
0
4
2
2
6
−1
3
−2
Figure 17.19 Scatter plots of six data sets. For each data set a linear regression is fitted using all points (the grey line) or only the points marked in black (black dashed line), respectively.
model a linear effect, because a linear effect has the great practical advantage that we can express the effect of a covariate with a single number. Whenever we use nonlinear models, we have to describe the effect of the covariate by a graphical display or at least by several numbers (cf. the considerations for the quadratic model above). And this makes the presentation of the results rather complicated. Hence, for the sake of simplicity and in order to be able to send short and comprehensive messages we may prefer to use linear modelling, although we know that the effect is (slightly) nonlinear. 17.13
Modelling the Nonlinear Effect of Confounders
So far we have focused on the question how to model the nonlinear effect of the covariate we are most interested in. However, continuous covariates may also be included in a regression model because of their role as a potential confounder. This raises the question whether it may be relevant to model also the effect of a confounder in a nonlinear manner. Indeed, examples can be found in which only a nonlinear modelling of the effect of the confounder gives a correct assessment of the effect of the covariate of interest. Figure 17.20 shows such an example: If the effect of X2 is modelled correctly by a quadratic function, we obtain a much larger effect of X1 (i.e., a larger distance between the two regression lines) than if we assume incorrectly
20 15 y 10 5 0
0
5
10
y
15
20
25
M ODELLING N ONLINEAR E FFECTS 25
256
2
3
4 x2 x1=0
5
6
x1=1
2
3
4 x2 x1=0
5
6
x1=1
Figure 17.20 A data set with a binary covariate X1 , a continuous covariate X2 , and a continuous outcome Y together with a fitted quadratic regression model (left side) and a linear model (right side).
a linear relationship. This can be also read from the two corresponding outputs when comparing the effects of x1: beta SE 95%CI p-value variable intercept 19.65 3.30 [13.19,26.11] |t| [95% Conf. Interval] -------------+---------------------------------------------------------------sex | 2.828571 11.09429 0.25 0.811 -27.97412 33.63126 | sex#c.dose | 0 | .1428571 .1576874 0.91 0.416 -.2949532 .5806675 1 | .5885714 .1576874 3.73 0.020 .1507611 1.026382 | _cons | 47.71429 7.844848 6.08 0.004 25.9335 69.49507 ------------------------------------------------------------------------------
the same results as in our first analysis. Read more about this by typing help fvvarlist. We now consider the example of Figure 19.4 and redo the analyses performed in Section 19.1: . use hyper . list in 1/10
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
+----------------------------+ | id age smoking hyper | |----------------------------| | 1 47 0 0 | | 2 53 1 1 | | 3 50 1 1 | | 4 40 1 1 | | 5 33 1 0 | |----------------------------| | 6 47 0 0 | | 7 54 1 1 | | 8 38 0 0 | | 9 59 0 1 | | 10 67 0 1 | +----------------------------+
We would like to investigate the effect of smoking as a function of age. Since age is a continuous covariate, we have to follow the interaction approach:
294
E FFECT M ODIFICATION AND I NTERACTIONS
. gen agesmoking=age*smoking . logit hyper age smoking agesmoking Iteration Iteration Iteration Iteration Iteration
0: 1: 2: 3: 4:
log log log log log
likelihood likelihood likelihood likelihood likelihood
= = = = =
-877.28077 -795.08929 -793.22769 -793.22413 -793.22413
Logistic regression
Log likelihood = -793.22413
Number of obs LR chi2(3) Prob > chi2 Pseudo R2
= = = =
1332 168.11 0.0000 0.0958
-----------------------------------------------------------------------------hyper | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------age | .0512461 .0064305 7.97 0.000 .0386426 .0638497 smoking | 2.955036 .5061917 5.84 0.000 1.962918 3.947153 agesmoking | -.0326584 .0087536 -3.73 0.000 -.0498152 -.0155016 _cons | -3.905869 .3847093 -10.15 0.000 -4.659885 -3.151852 ------------------------------------------------------------------------------
Now we can assess the change of the effect of smoking over 10 years: . lincom agesmoking*10 ( 1)
10*[hyper]agesmoking = 0
-----------------------------------------------------------------------------hyper | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | -.3265839 .0875364 -3.73 0.000 -.4981521 -.1550156 ------------------------------------------------------------------------------
and we can use lincom’s or option to obtain the result as the factor corresponding to the change of the odds ratio: . lincom agesmoking*10, or ( 1)
10*[hyper]agesmoking = 0
-----------------------------------------------------------------------------hyper | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | .7213839 .0631474 -3.73 0.000 .6076525 .8564018 ------------------------------------------------------------------------------
Now we look at the effect of smoking at age 35 and age 75: . lincom _b[smoking] + 35 * _b[agesmoking] ( 1)
[hyper]smoking + 35*[hyper]agesmoking = 0
------------------------------------------------------------------------------
H OW TO A NALYSE E FFECT M ODIFICATION WITH S TATA
295
hyper | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | 1.811992 .2217385 8.17 0.000 1.377393 2.246591 -----------------------------------------------------------------------------. lincom _b[smoking] + 75 * _b[agesmoking] ( 1)
[hyper]smoking + 75*[hyper]agesmoking = 0
-----------------------------------------------------------------------------hyper | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | .5056565 .2059231 2.46 0.014 .1020546 .9092584 ------------------------------------------------------------------------------
Finally, we take a look at how to perform likelihood ratio tests (which are again slightly more powerful). First, we test the null hypothesis of no interaction by comparing the model above with a model without an interaction: . estimates store A . logit hyper age smoking Iteration Iteration Iteration Iteration
0: 1: 2: 3:
log log log log
likelihood likelihood likelihood likelihood
= -877.28077 = -801.10435 = -800.2788 = -800.27872
Logistic regression
Number of obs LR chi2(2) Prob > chi2 Pseudo R2
Log likelihood = -800.27872
= = = =
1332 154.00 0.0000 0.0878
-----------------------------------------------------------------------------hyper | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------age | .0344195 .0043207 7.97 0.000 .0259511 .0428878 smoking | 1.139242 .1214511 9.38 0.000 .901202 1.377281 _cons | -2.936872 .2597851 -11.31 0.000 -3.446042 -2.427703 -----------------------------------------------------------------------------. lrtest A Likelihood-ratio test (Assumption: . nested in A)
LR chi2(1) = Prob > chi2 =
14.11 0.0002
Next, we test the null hypothesis of “no effect of smoking at all” by comparing the full model with the model in which all terms involving smoking are omitted: . logit hyper age Iteration Iteration Iteration Iteration
0: 1: 2: 3:
log log log log
likelihood likelihood likelihood likelihood
= = = =
-877.28077 -846.18595 -846.07003 -846.07002
296
E FFECT M ODIFICATION AND I NTERACTIONS
Logistic regression
Log likelihood = -846.07002
Number of obs LR chi2(1) Prob > chi2 Pseudo R2
= = = =
1332 62.42 0.0000 0.0356
-----------------------------------------------------------------------------hyper | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------age | .0320192 .0041419 7.73 0.000 .0239012 .0401373 _cons | -2.278813 .2361742 -9.65 0.000 -2.741706 -1.81592 -----------------------------------------------------------------------------. lrtest A Likelihood-ratio test (Assumption: . nested in A)
19.11
LR chi2(2) = Prob > chi2 =
105.69 0.0000
Exercise Treatment Interactions in a Randomised Clinical Trial for the Treatment of Malignant Glioma
The data set gliom includes data from a randomised clinical trial1 comparing a mono chemotherapy (BCNU) with a combination therapy (BCNU + VW 26) for the treatment of malignant glioma in adult patients. The variable time includes the survival time (in days) after chemotherapy (if died==1), or the time until the last contact with the patient (if died==0). (a) It has been claimed that the benefit from the combination therapy depends on the performance status of the patient. The variable ps divides the subjects into those with poor performance status (ps==0) and those with good performance status (ps==1). Try to clarify the claim based on the available data. Try to support the results of your analysis by appropriate Kaplan-Meier curves. (b) Some people have even raised doubts that the combination therapy is of any benefit for the patients. Try to clarify this question based on the available data. (c) It has been claimed that the benefit from the combination therapy depends on the age of the patient. Try to clarify this claim based on the available data. Try to support the results of your analysis by appropriate Kaplan-Meier curves. (d) The data set includes, besides the variable ps, also a variable karnindex, which divides the subjects into three groups according to their Karnofsky index, a more detailed measurement of the performance status. Try to reanalyse (a) again using this variable. Try to support the results of your analysis by appropriate KaplanMeier curves. 1 This data set is based on a real clinical trial. However, the trial has never been published in a medical journal. It has been used in several methodological investigations (e.g., Sauerbrei and Schumacher (1992) and Royston and Sauerbrei (2008)).
E XERCISE Treatment Interactions in a Randomised Clinical Trial
T HIS C HAPTER IN A N UTSHELL Regression models also allow investigation of the effect of one covariate on the effect of another covariate. Such effect modifications can often be modelled directly, or they can be handled using interactions. Some care is necessary in making statements on the effect of a covariate if it is involved in an interaction. A possible influence of the choice of the outcome scale and ceiling effects should be taken into account in the interpretation of interactions, in particular, if effect modifications are interpreted with reference to a real-world model. Hunting for interactions should be avoided.
297
Chapter 20
Applying Regression Models to Clustered Data
In this chapter we discuss the impact of clustering on the validity of regression analyses and discuss several approaches to ensure valid inference.
20.1
Why Clustered Data Can Invalidate Inference
It is a basic assumptions behind all inference procedures described in this book that the different subjects in your study (corresponding to different rows in your data set) provide independent information on the association of interest. This means that the value of Yi for one subject does not allow prediction of the Y value of any other subject in any way which cannot be explained by the covariate values. Sometimes this is not the case. In particular, if the data is collected in certain clusters like families, practices, hospital departments, etc., it may happen that these clusters have an influence by their own without being covariates. For example, if we look at an outcome like fat intake in subjects, and we collect data in families, we may have to expect a tendency that the shared cooking and eating habits of a family influence the outcome. So if we know for one member of a family that he or she has a high fat intake (relative to his or her covariate pattern), then we have to expect to some degree that also the other family members tend to have a high fat intake. So the outcomes are no longer independent, and we have a correlation of the outcomes within each cluster, which we cannot explain by the covariate values of the subjects. Let us look at a concrete example in Figure 20.1, in which we depict the association of the satisfaction of 109 patients with their GP (measured on a visual analog scale between 0 and 100) with the experience of the GP (measured as number of years acting as GP). At first sight the association looks rather convincing, and a simple regression analysis of this data gives the following output: SE 95%CI p-value variable beta experience 1.57 0.51 [0.56,2.57] 0.0026 0.0020 sex 17.81 5.60 [6.69,28.92] We can observe a significant effect of the experience suggesting an increase of the patient satisfaction by 1.57 on the visual analog scale with each year of experience. However, we may come in doubt about this result if we take into account that the 299
100
A PPLYING R EGRESSION M ODELS TO C LUSTERED DATA
3 3
20
satisfaction 40 60
19 19
0
20
19
15
20 25 experience
male
19 19
11 4 2
17 5 3 15 17 15 3 17 7 3
7 7
female
Figure 20.1 A scatter plot of data on patient satisfaction versus experience of their general practitioner in 109 subjects.
15 57
10
15
6
1410 9 16 10 12 10 16 14 1412 14 9
20 20
12
1 11 8 1 26 8 1 46 13 11 6 2
9 16 9 1618 20 10 10 18 20 12 9 18 16
6
30 0
10
13 8 2 11 8 4 8 4 8 2
35
80
satisfaction 40 60 80 100
300
1
6 6
12
20 25 experience
30
Figure 20.2 A scatter plot of data on patient satisfaction versus experience of the general practitioner in 109 subjects from 20 practices.
data covers only 20 different GPs. In Figure 20.2 we can can for example observe that all patients in practice 14 were rather satisfied with their GP, and this was a GP with long experience. On the other side, all patients of the GP with shortest experience in practice 19 were not very satisfied. So these two practices may be responsible for much of the observed association. That the patients in practice 14 are so satisfied and in practice 19 so unsatisfied may be due to completely different factors than the experience of the GP. However, in computing confidence intervals and p-values the statistical programs assume that all these patients provide independent information, that is, that they come from different practices. Hence, they cannot explain the common low and high values by the fact that they come from the same practice. And, hence, p-values and confidence intervals are too optimistic, and they overstate the precision of the estimates and the evidence for effects. So, there is some need for methods to take into account that we have in this example not 109 independent observations, but 109 observations clustered into 20 practices. 20.2
Robust Standard Errors
In Section 5.1 we have already seen that there exists a mathematical approach to perform inference for the classical regression model, which does not assume a normal distribution of the error term. It resulted in so-called “robust standard errors.” The same mathematical approach can be also used to compute standard errors in the presence of clustered observations. (Mathematical details are outlined in Appendix C.2.) As in Section 5.1, this means that we stick to the effect estimates of the classical regression, and only change the way to compute standard errors. Of course, this time we have to tell the statistical program which observations are clustered.
I MPROVING THE E FFICIENCY Applying this technique, we obtain now an output like SE 95%CI variable beta experience 1.57 0.59 [0.34,2.80] sex 17.81 5.34 [6.64,28.98]
301 p-value 0.015 0.003
We can see that the effect estimates are the same as above, but that the standard errors, confidence intervals, and p-values have changed. Taking the clustering in practices into account, the standard error of the effect of experience has increased, that is, we now know that the true precision of this effect estimate is lower than indicated by our first, incorrect analysis. The p-value has increased correspondingly, but the effect is still significant. It is important to note that these corrections are not just a function of the number of clusters and of the cluster sizes. They take into account that the impact of the clustering can differ from effect estimate to effect estimate. For example, we can see in the example above that the standard error of sex increases to a much smaller degree than the standard error of experience. The reason for this is that we have patients of both sexes in nearly all practices, such that some of the information on the effect of sex comes from comparisons within practices. And these comparisons remain valid even if the single practices have an effect on their own. So the precision of this effect is less affected by clustering. Actually, the amount of correction performed by the robust standard errors depends on the degree of correlation of the outcomes within the clusters. If there is no correlation, that is, if the clusters do not have an effect on their own, then the robust standard errors are close to the ordinary standard errors, and they may be even smaller. So we do not have necessarily to pay a price if we erroneously adjust the inference for a clustering, if it is not necessary. So this way we have a simple tool to take a clustering of the observations into account. And the technique can be used for any type of regression model, in particular, also for logistic regression and Cox regression. Remark: Robust standard errors are only valid in large samples. The essential point is, however, no longer the sample size, but the number of clusters (or to be precise some number between the number of clusters and the overall number of observations depending on the degree of correlation). In particular, one should be cautious with using robust standard errors in the case of few clusters with many observations. If you are in doubt, a small simulation study can help to clarify the question of the validity of the standard errors and confidence intervals. 20.3
Improving the Efficiency
In the last section we saw a solution to take clustering into account in the inference without changing the usual effect estimates. This has a lot of practical advantages, since if we use other, more sophisticated methods which also change the effect estimate, we may have to convince readers that we still estimate what we usually estimate. We will see later that some of the more sophisticated methods actually estimate different parameters. In changing only the standard error, we can tell the reader that
302
A PPLYING R EGRESSION M ODELS TO C LUSTERED DATA
all regression estimates are obtained in the usual way, and only with respect to the inference have we taken the clustering into account. However, we may raise the question whether this is an optimal solution with respect to efficiency in estimating the parameters of interest and power. In the above approach, we still give all subjects the same weight. If clusters are varying in size, we may expect to gain efficiency, if we give subjects in large clusters a lower weight: As the outcomes of the subjects within a cluster are correlated, adding a new subject to a single cluster adds less information than adding a new cluster with a single subject. Downweighting of the subjects from large clusters can take this into account. There are two popular approaches to implement this idea. This first one is given by a so-called random intercept model. The main idea behind this approach is to explicitly model the effect of each cluster by introducing a cluster-specific level for each cluster. Then—with numbering the clusters from j = 1 to J and the subjects within each cluster from i = 1 to the cluster size n j —the expected value of Y ji reads (in the case of a single covariate)
μ ji = β0 + θ j + β1 x ji with θ j denoting the extra level of cluster j. These θ j are assumed to flutter randomly around 0, and actually coming from a normal distribution with mean 0 and unknown standard deviation σθ . So although the model reads like a regression model, it cannot be fitted by standard software for regression and requires to use specific procedures. Besides the effect estimates SE 95%CI p-value variable beta experience 1.61 0.75 [0.13,3.08] 0.032 0.001 sex 18.04 5.47 [7.32,28.75] which are very similar to the previous ones, such programs give also an estimate for the standard deviation σθ of the random effect. In our example the estimate of σθ is 11.54. The second approach is known by the name GEE, abbreviating Generalized Estimation Equations, and does not change the underlying model, but just modifies the estimation procedure in a certain way under the assumption of a constant correlation ρ among the outcomes in a cluster given the covariate values. Applying this approach yields an output like SE 95%CI p-value variable beta experience 1.61 0.78 [0.08,3.15] 0.040 0.001 sex 18.09 5.30 [7.69,28.48] with, again, very similar effect estimates. Although some gain in efficiency using these procedures can be expected, we should not expect too much. In Table 20.1 we compare the two approaches described above with the simple use of robust standard errors, and we can observe that a gain in power can be only expected if the correlation is substantial and we have both small and large clusters. The latter is not surprising, as in this case we have both subjects we would like to down-weight as well as subjects we do not want to down-weight. So as long as the variation of cluster sizes is moderate, there is no need to replace the
I MPROVING THE E FFICIENCY 1 .5
fraction of clusters of size 2 3 4 5 6 .5 J = 160 .5
.5
.33 .34 J = 70
.33
303 7
J = 60 .33
.34
.33 J = 135
.25
.25
.25 .25 J = 110
.2
.2
.2 J = 95
.2
.2
.4
.3
.1 .1 J = 125
.1
.3
.3
.1 .1 J = 130
.1
.1
.3
.2
.1 .1 J = 120
.1
.1
.1
method robust se re GEE robust se re GEE robust se re GEE robust se re GEE robust se re GEE robust se re GEE robust se re GEE robust se re GEE robust se re GEE
0.0 0.86 0.85 0.86 0.88 0.87 0.89 0.90 0.89 0.90 0.90 0.89 0.91 0.90 0.89 0.91 0.91 0.89 0.91 0.90 0.89 0.90 0.95 0.94 0.95 0.97 0.95 0.97
0.2 0.82 0.82 0.82 0.70 0.69 0.70 0.82 0.82 0.83 0.73 0.73 0.74 0.79 0.80 0.80 0.77 0.78 0.79 0.80 0.81 0.81 0.85 0.86 0.86 0.84 0.86 0.87
ρ 0.4 0.78 0.78 0.79 0.55 0.55 0.56 0.75 0.77 0.78 0.60 0.61 0.62 0.69 0.72 0.72 0.66 0.68 0.69 0.70 0.74 0.75 0.74 0.78 0.79 0.72 0.76 0.77
0.6 0.75 0.76 0.77 0.47 0.46 0.47 0.68 0.71 0.72 0.51 0.51 0.52 0.62 0.66 0.66 0.57 0.61 0.61 0.62 0.68 0.69 0.64 0.71 0.72 0.60 0.69 0.69
0.8 0.70 0.73 0.74 0.40 0.40 0.41 0.63 0.67 0.68 0.44 0.44 0.45 0.55 0.59 0.60 0.49 0.54 0.55 0.55 0.62 0.63 0.57 0.65 0.66 0.53 0.62 0.63
Table 20.1 The probability to reject the null hypothesis H0 : β = 0 in a simple linear regression model with one continuous covariate X for three approaches (robust se = robust standard error, GEE = generalized estimation equations, re = random effects model) to take clustering into account and in dependence on the correlation ρ of the errors within each cluster. X is assumed to be constant within each cluster and is normally distributed with mean 0 and standard deviation 1.0. The standard deviation of the error term is 1.0, and β is set to 0.2. J denotes the number of clusters, and the expected fraction of clusters of each size is given on the left side of the table.
use of robust standard errors by any of the two other approaches in order to obtain a substantial gain in power. Remark: Random effects models can be also used to extend the logistic regression model or the Cox model. In the latter case they are known under the name “shared frailty models.” The true effect parameters in these models are no longer identical to the true effect parameters in the ordinary logistic or Cox model without a random effect. This is due to the fact that adding an uncorrelated covariate in these models
304
A PPLYING R EGRESSION M ODELS TO C LUSTERED DATA
changes the true effect, as demonstrated in Section 16.7. The true effects in these models are called subject specific effects in contrast to population average effects in the ordinary models. In this context the ordinary models are often called marginal models. Remark: Inference for the GEE approach as well as for the random effects models are based on asymptotic arguments and hence only valid in large samples (cf. the remark in the previous section). Remark: In some situations the correlation structure within a cluster may be more complicated. For example, when we have families as clusters, the correlation between the parents may be different from the correlations between children and parents or among children, or children of the same sex may correlate differently from children of different sex. If these factors are also in the regression model as covariates, then the random effects model as well as the correlation structure assumed in the GEE approach are misspecified. In the random effects models this may introduce some (usually small) bias, but the GEE approach will produce (in large samples) still unbiased estimates and correct standard errors, confidence intervals and p-values. This is one reason for the popularity of the GEE approach. 20.4
Within- and Between-Cluster Effects
If a covariate is varying within clusters, it may happen that the effect of a covariate within a cluster is different from the effect of the covariate between clusters. For example, if we consider a covariate like fat intake, and we consider two subjects from two different families with different fat intakes, then such a difference will reflect often mainly differences in the average intake in these two families, and hence the effect may be confounded with other covariates varying between families like income, lifestyle and environmental factors. Within each family these potential confounders are often constant, and hence the effect of fat intake within families suffers less from confounding. Consequently, the within- and between-cluster effect differs. In Figure 20.3 we have illustrated some data sets with an obvious difference between within-cluster effects and between-cluster effects. The estimated effect of a regression analysis can be any number between these two effects, depending on the ratio between the spread of the covariates within and between the clusters, as also illustrated in the Figure 20.3. If we expect a difference with respect to between- and within-cluster effects, there exists a simple way to analyse this. We have just to split the covariate X ji into two covariates: The first is constant within each cluster and is defined by the mean value X¯ j in each cluster. The second is defined by the deviation of the covariate from the mean within each cluster, that is, by X ji − X¯ j . So the model reads now
μ ji = β0 + β1 X¯ j + β2 (X ji − X¯ j )
such that β1 describes the effect of a difference in the cluster means between two clusters and β2 describes the effect of a difference between two subjects within a cluster. So these two numbers have a very simple interpretation. Fitting this model to the middle data set of Figure 20.3 results in an output such as
1
2
3 x
4
4
305
5
−2
0
y
2
4 2 y 0 −2
−2
0
y
2
4
S OME U NUSUAL U SAGES OF ROBUST S TANDARD E RRORS
−2
0
2 x
4
6
−2
0
2
4
6
8
x
Figure 20.3 Three data sets with four clusters of observations marked by different symbols. The regression lines of within-cluster effects obtained by regressing Y on X within each cluster are shown as black, solid lines, the regression line of the between-cluster effect obtained by regressing Y on the cluster specific mean values of X is shown as a dashed, black line, and the regression line of regressing Y on X in the whole data set is shown as a grey line.
variable meanx xminusmeanx
beta 0.170 -0.550
SE 0.160 0.071
95%CI [-0.339,0.680] [-0.776,-0.325]
p-value 0.366 0.004
We can observe a significant, negative effect of X within each cluster, and a positive effect of X between the clusters. The latter is not significant which is not surprising as we have only four clusters, that is, a sample size of four to estimate this effect. As typically X¯ j will not explain all variation between the clusters, one has still to take the clustering into account in the inference, for example, by using robust standard errors. The approach of splitting up into a within- and a between-cluster effect can be used with logistic regression and Cox regression models, too. Remark: Further discussions of techniques to separate within- and between-cluster effects can be found in Neuhaus and Kalbfleisch (1998) and Neuhaus and McCulloch (2006). 20.5
Some Unusual but Useful Usages of Robust Standard Errors in Clustered Data
Clustering is not only a problem when estimating regression parameters, but also for other types of statistical analyses. The most simple examples are the estimation of means and proportions from clustered data, for which the standard formulas and procedures for confidence intervals and p-values are also based on the assumption of independent observations. Interestingly, statistical software to compute robust stan-
306
A PPLYING R EGRESSION M ODELS TO C LUSTERED DATA
dard errors in a regression analysis can also be used to take clustering into account for estimates of means. We have just to note that in the classical regression model without any covariates, the intercept describes the expected value of the outcome, and the estimated intercept is nothing else but the mean. Similarly, in a logistic regression model without covariates, the intercept corresponds to the prevalence of the outcome on the logit scale, and a simple back transformation of the estimated intercept to the probability scale gives exactly the relative frequency of the outcome. And as the software gives us also robust standard errors of the intercept in these models, the problem is solved. Examples will be given in the next section. Sometimes we wish to compare regression coefficients from different models fitted to the same data set. For example, we may be interested in two adverse events which can happen after a therapy, for example, nephrological toxicity and diarrhea, and would like to compare the effect of a covariate X on the time until nephrological toxicity with the effect on the time until diarrhea. So we fit two Cox models, one with time until nephrological toxicity and one with the time until diarrhea as outcome. But now we want to compute the difference between the two effect estimates and would like to have a confidence interval and a p-value. We cannot use, for example, the lincom command in Stata. However, we can rearrange our data such that there are two rows for each subject in the data set, with a common variable for the two event times. If we also duplicate the covariates and set one version to 0 in the first row and one version to 0 in the second row of each subject, then we obtain by fitting a stratified Cox model (cf. Section 30.6) effect estimates for both effects (which are identical to those from fitting the two models separately). If we use robust standard errors to take the clustering into account we can now simply perform inference for the difference between the two effect estimates. And the same trick can be also applied if we want to summarise the evidence for an effect of a covariate over different outcomes, just by looking at the average effect. Note that we can also avoid the doubling of the covariates and fit directly a model with interactions between X and the status first/second row, cf. Chapter 19. It is even allowed in such tricks that the two outcomes are identical for each subject. For example, you may find in the literature two adjusted effect estimates that have been adjusted for different covariates, and you may be interested to check whether the difference can be explained by a true difference of the adjusted effects. If you have all the covariates involved measured in your data, you can fit the two corresponding models in your data similar as above, just by setting in the first row all covariates to 0 which are not in the first model, and in the second row all covariates to 0 which are not in the second model. You have only, in addition, to ensure that also two separate intercepts are fitted, for example, by adding an indicator variable for the status first or second row. Remark: It should be noted that the term “cluster” and “clustering” is used in statistics in many different situations, which have no relation to the clustering considered in this chapter. In particular, the term “cluster analysis” refers to a completely different task, namely, how to find clusters (i.e., groups of similar subjects) based on a set of variables measured at each subject.
H OW TO TAKE C LUSTERING INTO ACCOUNT IN S TATA 20.6
307
How to Take Clustering into Account in Stata
We start with taking a look at the dataset we have seen in Figure 20.2: . use GPexp, clear . list in 1/10
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
+----------------------------------------+ | pract experi~e id sex satisf~n | |----------------------------------------| | 1 18 1 f 41 | | 1 18 2 f 54 | | 1 18 3 m 51 | | 1 18 4 m 0 | | 2 20 1 f 67 | |----------------------------------------| | 2 20 2 m 28 | | 2 20 3 m 46 | | 2 20 4 f 79 | | 2 20 5 f 100 | | 3 14 1 m 16 | +----------------------------------------+
With . scatter satisf exp, mlab(prac) mlabpos(0) msym(i)
we generate a graph similar to Figure 20.2. A naive analysis ignoring the clustering would look like . regress satisf exp sex, vce(robust) Linear regression
Number of obs = F( 2, 94) = Prob > F = R-squared = Root MSE =
97 10.28 0.0001 0.1695 27.631
-----------------------------------------------------------------------------| Robust satisfaction | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------experience | 1.567665 .5062939 3.10 0.003 .562406 2.572923 sex | 17.80866 5.597744 3.18 0.002 6.694209 28.92311 _cons | 17.83489 11.13332 1.60 0.113 -4.270578 39.94036 ------------------------------------------------------------------------------
where we have taken the probably nonnormal error distribution into account by using the vce(robust) option. To compute robust standard errors taking the clustering into account we have to specify the clusters using the vce(cluster ...) option (which automatically implies the vce(robust) option):
308
A PPLYING R EGRESSION M ODELS TO C LUSTERED DATA
. regress satisf exp sex, vce(cluster pract) Linear regression
Number of obs = F( 2, 19) = Prob > F = R-squared = Root MSE =
97 10.58 0.0008 0.1695 27.631
(Std. Err. adjusted for 20 clusters in pract) -----------------------------------------------------------------------------| Robust satisfaction | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------experience | 1.567665 .5874832 2.67 0.015 .3380482 2.797281 sex | 17.80866 5.335342 3.34 0.003 6.641662 28.97566 _cons | 17.83489 12.4801 1.43 0.169 -8.286256 43.95604 ------------------------------------------------------------------------------
To fit the random intercept models, we have first to declare the data as “panel” data in Stata using the xtset command. (A panel is, roughly speaking, the same as a cluster.) Then we can use the xtreg command to fit a random intercept model: . xtset prac panel variable:
pract (unbalanced)
. xtreg satisf exp sex Random-effects GLS regression Group variable: pract
Number of obs Number of groups
= =
97 20
R-sq:
Obs per group: min = avg = max =
2 4.8 7
within = 0.1167 between = 0.2565 overall = 0.1695
Random effects u_i ~ Gaussian corr(u_i, X) = 0 (assumed)
Wald chi2(2) Prob > chi2
= =
16.34 0.0003
-----------------------------------------------------------------------------satisfaction | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------experience | 1.60886 .7521878 2.14 0.032 .1345995 3.083121 sex | 18.03559 5.46582 3.30 0.001 7.322783 28.7484 _cons | 16.5594 15.67251 1.06 0.291 -14.15815 47.27694 -------------+---------------------------------------------------------------sigma_u | 11.5385 sigma_e | 24.900001 rho | .17677424 (fraction of variance due to u_i) ------------------------------------------------------------------------------
Note that you can find the estimate for the standard deviation of the random intercept in the line labeled sigma u. The corresponding command for logistic regression has the name xtlogit. To fit a shared frailty Cox model you can use the shared option of the stcox command.
H OW TO TAKE C LUSTERING INTO ACCOUNT IN S TATA
309
Finally, we use the xtgee command to apply the GEE approach. Here we have to specify three options: . xtgee satisf exp sex Iteration 1: tolerance = .07640108 Iteration 2: tolerance = .00025609 Iteration 3: tolerance = 8.756e-07 GEE population-averaged model Group variable: pract Link: identity Family: Gaussian Correlation: exchangeable Scale parameter:
740.1026
Number of obs Number of groups Obs per group: min avg max Wald chi2(2) Prob > chi2
= = = = = = =
97 20 2 4.8 7 16.64 0.0002
-----------------------------------------------------------------------------satisfaction | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------experience | 1.613903 .7842852 2.06 0.040 .0767319 3.151074 sex | 18.08677 5.303274 3.41 0.001 7.692548 28.481 _cons | 16.39142 16.32067 1.00 0.315 -15.59652 48.37935 ------------------------------------------------------------------------------
If you want to apply the GEE approach in fitting a logistic regression model to data with a binary outcome, then you have to use the options family(binomial) link(logit). (The meaning of these options will become more clear in Section 29.2). Let us now further assume that our study population is a representative sample, such that it makes sense to use it to estimate the mean satisfaction. To obtain an estimate we can just use . tabstat satis, s(mean) variable | mean -------------+---------satisfaction | 57.78351 ------------------------
If we now want to compute a confidence interval for this number, we have just to take into account the clustering in practices. Here we can use . regress satis, vce(cluster pract) Linear regression
Number of obs = F( 0, 19) = Prob > F = R-squared = Root MSE =
97 0.00 . 0.0000 30.003
(Std. Err. adjusted for 20 clusters in pract) ------------------------------------------------------------------------------
310
A PPLYING R EGRESSION M ODELS TO C LUSTERED DATA
| Robust satisfaction | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------_cons | 57.78351 4.552564 12.69 0.000 48.25488 67.31213 ------------------------------------------------------------------------------
and the confidence interval for the intercept is the confidence interval for the mean. If we are interested in the prevalence of a satisfaction above 80 on the visual analog scale, we can use . gen satis80=satisf>80 . tab satis80 satis80 | Freq. Percent Cum. ------------+----------------------------------0 | 69 71.13 71.13 1 | 28 28.87 100.00 ------------+----------------------------------Total | 97 100.00 . logit satis80, vce(cluster pract) Iteration 0: Iteration 1:
log pseudolikelihood = log pseudolikelihood =
Logistic regression
Log pseudolikelihood =
-58.29189
-58.29189 -58.29189 Number of obs Wald chi2(0) Prob > chi2 Pseudo R2
= = = =
97 . . -0.0000
(Std. Err. adjusted for 20 clusters in pract) -----------------------------------------------------------------------------| Robust satis80 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_cons | -.901902 .2877067 -3.13 0.002 -1.465797 -.3380073 ------------------------------------------------------------------------------
The intercept is now an estimate for the prevalence on the logit scale, and we can use nlcom to transform it to the probability scale: . nlcom 1/(1+exp(-_b[_cons])) _nl_1:
1/(1+exp(-_b[_cons]))
-----------------------------------------------------------------------------satis80 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_nl_1 | .2886598 .0590763 4.89 0.000 .1728723 .4044473 ------------------------------------------------------------------------------
and obtain an estimate of the prevalence together with a confidence interval.
H OW TO TAKE C LUSTERING INTO ACCOUNT IN S TATA
311
Finally, we take a look at the dataset presented in the middle of Figure 20.3: . use middle . list in 1/10
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
+------------------------------+ | id group x y | |------------------------------| | 1 1 1.647 .2083 | | 2 4 4.0419 1.937 | | 3 1 .7689 .544 | | 4 4 4.1266 1.7448 | | 5 2 1.8631 2.0677 | |------------------------------| | 6 2 1.2299 2.0502 | | 7 2 2.9268 1.9919 | | 8 4 3.521 2.3006 | | 9 1 1.9768 .0634 | | 10 1 -.6067 1.246 | +------------------------------+
To separate the within and between cluster effect of X, we define the two variables required and fit the model: . bys group: egen meanx=mean(x) . gen diff=x-meanx . regress y meanx diff, vce(cluster group) Linear regression
Number of obs = F( 2, 3) = Prob > F = R-squared = Root MSE =
80 38.20 0.0073 0.3173 .72514
(Std. Err. adjusted for 4 clusters in group) -----------------------------------------------------------------------------| Robust y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------meanx | .170195 .1601154 1.06 0.366 -.3393635 .6797536 diff | -.5503743 .070822 -7.77 0.004 -.7757615 -.3249871 _cons | .7583906 .5378455 1.41 0.253 -.9532737 2.470055 ------------------------------------------------------------------------------
312
A PPLYING R EGRESSION M ODELS TO C LUSTERED DATA
T HIS C HAPTER IN A N UTSHELL The existence of clusters of observations can invalidate the assumption of independence of the outcomes, given the covariate values between subjects and hence confidence intervals and p-values. The use of robust standard errors solves this problem. There is often no need for more sophisticated approaches. In some applications it can be useful to dissect within-cluster and between-cluster effects.
Chapter 21
Applying Regression Models to Longitudinal Data
In this chapter we discuss a variety of possibilities to use regression models in the analysis of longitudinal data, short time series, serial measurements, and repeated measurements.
21.1
Analysing Time Trends in the Outcome
Longitudinal studies are characterised by measuring the outcome variables Y at different time points in each subject. In addition, covariates may be measured at each time point or at subject level. One basic question in longitudinal studies is the existence and degree of a general trend over time. A typical example is the course of a disease after a treatment. Figure 21.1 shows the development of the Hamilton depression score in 17 patients over 3 months, with baseline measurements at time 0 prior to the start of the therapy. We can observe a more or less distinct trend in many patients to lower values of the Hamilton depression score, that is, some improvement. So, a first basic question here is whether we can conclude that there is overall a tendency for a decrease of the depression score over time. To assess such an overall tendency, we can take a look at the mean values over time as shown in Figure 21.2. Here, we can observe a slight decrease over time. To estimate the degree of this decrease and to assess the evidence for such a decrease, we may now wish to fit a regression line to these mean values, that is, using them as outcome variables and the times points as covariates. However, this way we would not make use of our knowledge that each mean value is in itself an estimate. Instead of using the mean values as outcome, we can use the individual measurements as outcome. This means that we look at a scatter plot of the individual score values versus time as shown in Figure 21.3, and we can fit a regression line to these values. So we consider a regression model of the type
μ (t) = β0 + β1t where μ (t) denotes the expected value of the outcome Y at time point t. Note that conceptually this regression model makes exactly the same as a model regressing the mean values against the time: If any Yit has expectation μ (t), then the mean value of 313
314
A PPLYING R EGRESSION M ODELS TO L ONGITUDINAL DATA
2
3
4
5
6
7
8
10
11
12
13
14
15
16
17
9
15 10
1
2
3
20
25
0
10
15
Hamilton depression score
20
25
1
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
month
Hamilton depression score 15 20 25 10
16
mean (Hamilton score) 17 18 19
20
Figure 21.1 Measurements of the Hamilton depression score in 17 patients at baseline (t = 0) and 1, 2, and 3 months after start of a therapy.
0
1
2
3
month
Figure 21.2 The mean values of the Hamilton depression score at each time point in the 17 patients shown in Figure 21.1.
0
1
2
3
month
Figure 21.3 Measurements of the Hamilton depression score in 17 patients at 4 different time points. The size of the bubbles reflects the number of patients with the given score value in that month.
A NALYSING T IME T RENDS IN THE O UTCOME
315
Y at time point t has expectation μ (t), too. Actually, fitting the regression line to the individual values or to the mean values gives exactly the same estimates for slope and intercept. If we fit a regression model to the individual data, we obtain an output like beta SE 95%CI p-value variable month -0.688 0.315 [-1.306,-0.071] 0.0325 suggesting a decrease in the Hamilton depression score of 0.69 points from month to month. However, our statistical software has now assumed that we have 68 independent measurements of the Hamilton depression score. Actually, we have only 17 subjects with each contributing four measurements, and it is questionable whether these measurements provide independent information. In particular, we can observe in Figure 21.1 that the patients 4, 15 and 16 show a very distinct decrease, such that these three patients are mainly responsible for the trend. This makes the above demonstrated significance of the trend questionable, because if it only depends on three patients, it is hard to believe that we can be sure to observe a decrease on average in future patients. As in Section 20.2 we can again use robust standard errors after telling our software that there are clusters of observations. This results in our example in an output like beta SE 95%CI p-value variable month -0.688 0.339 [-1.352,-0.024] 0.0592 Now the standard error of the effect of time is larger, and we lose significance at the 5% level. So indeed three patients with a distinct decrease are not enough to allow a firm conclusion about a general trend on average. Remark: Similar to our considerations in Section 20.3, we can in principle improve the efficiency of the estimate of the time trend by modelling the correlation among the outcome variables in each subject. However, the correlation structure is now more complicated. We may have again that each subject has its own level, suggesting a constant correlation between observations. In addition, there may be some smooth development over time in each subject, suggesting that outcomes measured close in time are higher correlated than outcomes apart in time. This is often modelled by a correlation structure called “auto-regressive.” Also individual variations in the time trends impose a certain correlation structure. One can use both the GEE approach as well as random effects models (cf. Section 21.4) to take these correlation structures into account. The basic advantage of the GEE approach is that it provides unbiased estimates of the regression parameters and valid confidence intervals and p-values even if the correlation structure is misspecified. Remark: Estimation of time trends typically do not need to be adjusted for other covariates. Subject-specific covariates like age or sex cannot be correlated with time. Time-varying covariates can, of course, be correlated, but in most situations we will regard them as mediators of the time trend. This means that a change in another covariate may explain a part of the time trend, but there is no reason to “subtract” this effect from the observed time trend.
316
A PPLYING R EGRESSION M ODELS TO L ONGITUDINAL DATA
female male
cycle 1 2 58.1 55.9 36.6 40.9
3 53.8 46.2
4 54.8 53.8
Table 21.1 Frequency of vomiting in the first four cycles of chemotherapy in female and male patients.
Remark: Multiple measurements over time occur in very different settings in medical research. Similarly, the name for this type of data may range from longitudinal data over short time series and serial measurements until repeated measurements. 21.2
Analysing Time Trends in the Effect of Covariates
Besides time trends in the outcome, other trends can also be of interest in analysing longitudinal studies. One typical question is about a time trend in the effect of covariates. An example is given in Table 21.1 in which we can observe that there is a rather big sex difference with respect to the occurrence of vomiting as a side effect of a chemotherapy in the first cycle, but that this difference seems to become smaller over the four cycles of the chemotherapy. If we now want to assess the degree of this change, we have to realize that this is a just a question of an interaction between time and sex, so we can address it by the methods described in Chapter 19. The outcome of interest is the binary variable Yit indicating the occurrence of vomiting at cycle t of subject i. We can hence formulate a logistic regression model with the time t and a binary sex indicator X as covariates, and a sex-specific time trend as if x = 0 βmalet . logit π (t, x) = β0 + β1 x + β f emalet if x = 1 We can fit this model by defining two versions of the covariate t, one for male and one for female subjects, and—using again robust standard errors to take the correlation within each patient into account—we obtain an output like SE 95%CI p-value variable beta gender -1.16 0.39 [-1.92,-0.40] 0.003 [0.02,0.45] 0.032 cyclemale 0.23 0.11 0.577 cyclefemale -0.05 0.09 [-0.22,0.12] and in a further step we can obtain for the difference between the two sex-specific trends a confidence interval of [0.01, 0.55] and a p-value of 0.043. So we have evidence for a difference in the trends between male and female patients or, equivalently, for a change of the sex difference over the cycles.
A NALYSING THE E FFECT OF C OVARIATES 21.3
317
Analysing the Effect of Covariates
We can, of course, in a longitudinal study also ask the simple question of whether a covariate X has an effect on the outcome Y , and we may want to adjust this effect for some potential confounders. Now, a longitudinal study provides just several observations of X and Y for every subject, and this is the only difference compared to nonlongitudinal studies. So we can address this question just by fitting regression models, and we obtain regression parameters with the usual interpretation. We also obtain valid inference by taking the correlation of the outcomes within each subject into account, for example, by using robust standard errors. For all regression analyses described so far in this chapter we actually did not need longitudinal data. All questions could have been addressed also in a cross sectional study, in which we would have collected data for each subject only at one time point covering different time points by different subjects. This would not have made much sense in our examples, as there the data appeared naturally in a longitudinal manner. In other settings this can be different, for example, if we are interested in the development of certain diseases with age. Then it may be much easier to collect data of subjects of different ages in a cross-sectional design than to follow subjects over many years. In the next sections we will discuss some types of analyses, which are only possible in a longitudinal design, and hence which make really use of the information provided by the longitudinal measurements. The coincidence between cross-sectional and longitudinal studies with respect to using the same regression models to address the same type of questions has, however, one practical advantage: It implies that we can compare regression coefficients between cross-sectional and longitudinal studies, and that we do not have to think about whether the regression coefficients have different interpretations. (This common structure is often called “marginal models” in the statistical literature.) The coincidence in the interpretation between cross sectional and longitudinal studies is, however, only true, if the covariates are subject-specific variables which do not change over time. If we include time varying covariates in the regression models then we are back to the problem we have discussed already in Section 20.4: it may be that the (average) effect of the covariate within a subject is different from the effect of the covariate between subjects. We can approach this problem again by techniques similar to the approach mentioned in Section 20.4. However, instead of the mean of X in each cluster, it may be more useful to use the first measurement as the cluster-/subject-specific characteristic value. 21.4
Analysing Individual Variation in Time Trends
In our example of the course of the degree of depression (Figure 21.1) we have seen a substantial variation in the individual trends. Often, this individual variation is in itself of scientific interest. We might, for example, be interested whether different patients respond differently to a therapy, or not. And if we have evidence for individuality in their response, we may want to quantify this individual variation. And
318
A PPLYING R EGRESSION M ODELS TO L ONGITUDINAL DATA
if some patients do not respond, we may want to know how many respond and how many do not respond. Random effects models allow an answer to these questions. They simple assume that each subject has its own regression line. So if we denote with μi (t) the expected value of the outcome Yit in subject i at time point t, such a model reads simply:
μi (t) = β0i + β1it with the additional assumption that the subject-specific intercept parameters β0i are drawn from a normal distribution with mean μ0 and standard deviation σ0 , the subject-specific slopes β1i are drawn from a normal distribution with mean μ1 and standard deviation σ1 , and the intercept and the slope may show a correlation ρ . In fitting such a random effects model, the individual parameters are not estimated, but only the five parameters μ0 , μ1 , σ0 , σ1 , and ρ are estimated. The parameter σ1 is of highest interest, as it describes the variation of the individual slopes, that is, to which degree the slopes vary from subject to subject. Fitting such a model to the data shown in Figure 21.1, we obtain the following estimates and confidence intervals: μˆ 0 μˆ 1 σˆ 0 σˆ 1 ρˆ 19.52 −0.68 2.25 1.36 −0.35 [18.4, 20.6] [−1.35, −0.03] [1.56, 3.23] [0.95, 1.95] [−0.70, 0.14]
The value 19.52 of μˆ 0 tells us that the patients start on average at baseline with a score about 19.5, and the confidence interval tells us that we are pretty confident in this value. The value 2.25 of σˆ 0 tells us that this initial level has some variation, and using the fact that for a normal distribution the interval mean ± standard deviation covers 2/3 of all values, the two values together suggest that two thirds of all patients start with a level between μˆ 0 − σˆ 0 = 19.52 − 2.25 = 17.27 and μˆ 0 + σˆ 0 = 19.52 + 2.25 = 21.77. The value of μˆ 1 tells us that the patients have on average a decrease of −0.68 in the Hamilton depression score per month, but the confidence interval tells us that we are not very confident about this value. The value 1.36 of σˆ 0 tells us that there is a substantial variation in this decrease from patient to patient. The estimates suggest that two thirds of the patients have a slope between between μˆ 1 − σˆ 1 = −0.68 − 1.36 = −2.05 and μˆ 1 + σˆ 1 = −0.68 + 1.36 = 0.67, so we have both patients with distinct decreases as well as with distinct increases. If we are interested in the fraction of subjects with a slope smaller than a given value s, we can use the fact that the probability of a slope less than s is given by the expression Φ((s − μˆ 1 )/σˆ 1 ) with Φ denoting the distribution function of a standard normal distribution. Of particular interest is often the fraction of subjects with a negative slope, which is given by Φ((0 − μˆ 1 )/σˆ 1 ). Inserting our estimates in this expression yields Φ((0 − (−0.68)/1.36) = 0.69. Or, with other words, 31% of the patients do not seem to respond to the treatment. The estimate −0.35 for the correlation ρ tells us that patients with a high baseline level tend to have a smaller slope, that is, a higher decrease, than patients with a low baseline level.
A NALYSING I NDIVIDUAL VARIATION IN T IME T RENDS
319
We can rewrite our random effects model also as
μi (t) = μ0 + β˜0i + (μ1 + β˜1i )t with β˜0i now drawn from a normal distribution with mean 0 and standard deviation σ0 , and β˜1i now drawn from a normal distribution with mean 0 and standard deviation σ1 . And then we can rewrite it as
μi (t) = μ0 + μ1t + β˜0i + β˜1it Now, the first part looks like an ordinary regression model, and we have added just a part, involving random deviations of the individual parameters from the values μ0 and μ1 . So we can interpret the parameters μ0 and μ1 as the “fixed effects,” corresponding to the regression parameters we have considered in the previous sections (and known from “marginal models”). And, indeed, they coincide with them. Also, the estimates are numerically identical. For this reason, random effects models are also known as “mixed models,” mixing fixed effects with random effects. It is important to know that the analysis of individual variation in time trends is not restricted to the case of a continuous outcome. Such an analysis can also be done in the case of a binary outcome, although in the binary case it is more difficult to be sure about individual trends, unlike in the case of a continuous outcome. Table 21.2 depicts the individual patterns behind the data of Table 21.1: the occurrence of vomiting as a side effect of a chemotherapy for four cycles of a chemotherapy. We can, for some patients, observe a decreasing and for some an increasing pattern, reflecting possibly that some patients adapt over time to the emetogenic potential of the chemotherapy, whereas other tend to develop this reaction over time. However, within four cycles, such patterns can also occur by chance, and we can see that the majority of patients actually do not have such a pattern. A model with a random intercept and a random slope now reads logit πi (t) = β0i + β1it with the same assumptions on the distribution of the random intercept and the random slope as above. If we fit such a model to the male subjects of Table 21.2, we obtain μˆ 0 μˆ 1 σˆ 0 σˆ 1 ρˆ −1.44 0.42 3.75 1.38 −0.93 [−2.6, −0.3] [0.02, 0.83] [2.34, 6.00] [0.87, 2.18] [−0.97, −0.85] We observe that on average we have an increase of the probability (on the logit scale) of vomiting from cycle to cycle of 0.42, with a substantial individual variation described by a standard deviation of 1.38. The correlation between intercept and slope is definitely negative, suggesting that the lower the level a patient starts with in the first cycle the higher the increase. If we fit this model to the female subjects we obtain μˆ 0 μˆ 1 σˆ 0 σˆ 1 ρˆ 0.38 −0.05 0.53 0.12 1.00 [−0.2, 0.9] [−0.25, 0.16] [0.06, 4.93] [0.00, 6.13] [−1.00, 1.00]
320
A PPLYING R EGRESSION M ODELS TO L ONGITUDINAL DATA vomiting frequency in in cycle 1 2 3 4 females males constant pattern 5 13 0 0 0 0 19 8 1 1 1 1 increasing pattern 7 13 0 0 0 1 3 12 0 0 1 1 4 7 0 1 1 1 decreasing pattern 8 8 1 0 0 0 4 7 1 1 0 0 5 3 1 1 1 0 slightly increasing pattern 0 4 0 0 1 0 6 4 0 1 0 1 4 2 1 0 1 1 slightly decreasing pattern 5 0 0 1 0 0 6 2 1 0 1 0 1 3 1 1 0 1 patterns with no trend 8 5 0 1 1 0 7 0 1 0 0 1
Table 21.2 Frequency of vomiting patterns in the first four cycles of chemotherapy in female and male patients.
Now on average there is neither an increase nor a decrease, and the narrow confidence interval of [−0.25, 0.16] for μ1 suggests that we are pretty sure about this. In addition, there seems to be nearly no individual variation, as indicated by the small value of the estimate of σ1 . However, here the confidence interval is very wide, so we should be careful with our conclusions. Note the extremely wide confidence interval for the correlation. This reflects that if there is nearly no individual variation in the slope, then it is hard to say something about the correlation between intercept and slope. The random effects in a logistic model can also be split into fixed effects and random effects. However, it no longer holds that the fixed effects are identical to the effects in the marginal models. We have again a (slight) difference between subjectspecific effects and population average effects (cf. Section 20.2). It is obvious that random effects models are a powerful tool if we are interested in analysing individual variability in trends. They are not restricted to the analysis of variability in time trends, but they can be also used to estimate the variability of individual effects of covariates varying over time. It can also be studied how much of
A NALYSING S UMMARY M EASURES
321
the individual variability can be explained by subject-specific factors like age and sex just by adding these covariates and their interactions with time to the model. Hence random effects models play an important role in the analysis of longitudinal data as well as other data with some structure within each subject. The books of Brown and Prescott (1999), Verbeke and Molenberghs (2000), and Skrondal and Rabe-Hesketh (2004) allow some further insights into the potential of using random effects models. Remark: Mixed models for the analysis of longitudinal data have been used for decades, as software for this type of models have been available since the 1970s. Until the invention of techniques like robust standard errors or GEE (becoming popular shortly after the papers by Liang and Zeger (1986) and Zeger et al. (1988)), they have been the standard approach also for analyses, in which we were mainly interested in the fixed effects. The random effects part is used in these applications mainly to take the association structure into account and in this way to obtain realistic standard errors. Remark: Software to fit random effects models for nonnormal outcome variables (like binary variables) became available only rather recently. Hence analyses like that of the vomiting patterns are not very common in the medical literature, but will be more frequently used in the future. 21.5
Analysing Summary Measures
Looking at the individual curves in Figure 21.1 we may get another idea to analyse this data: We may fit a regression line for each subject and then use the estimated slope in each subject as a measure for the individual trend. Then we can use this measure as the new outcome of interest. We can for example use a one sample t-test to demonstrate that the average slope is different from 0. This is a completely valid approach. It may be slightly less efficient than using the regression approach described in Section 21.1, as we ignore that each estimated slope has an imprecision we can estimate from our data, but this loss is typically small. We can also compare the average slope between different groups or use the estimated slope as the outcome variable of a regression model with subject-specific covariates. However, we must be aware that we cannot use the estimated slopes to judge the individual variation of the true slopes, as some of the variation of the estimated slopes is due to the fact that they are estimates and hence affected by random noise. The most interesting and useful feature of the summary measure approach is its ability to analyse properties of the patterns in individual curves, that do not correspond to a simple linear trend. A typical example is shown in Figure 21.4. We can observe a common pattern in all patients—a peak within the first 24 hours—but both the time and the height of the peak varies from patient to patient. Here, we can use summary measures like the time until the peak or the maximum in each subject, if the peak is of interest. If it is more of interest to quantify the overall amount in the rise of the laboratory parameter, we may use the area under the curve (AUC) as the summary measure. Remark: As the use of mixed models or similar model-based approaches is quite
322
0
A PPLYING R EGRESSION M ODELS TO L ONGITUDINAL DATA 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
12
24
0
12
24
0
12
24
0
12
24
0
12
24
0
12
24
0
12
24
0
12
24
time(hours)
Figure 21.4 The course of a laboratory parameter in 40 patients in the first 24 hours after giving a certain dose of a drug.
common in the analysis of longitudinal data, the use of summary measures is often regarded as “too simple” and hence inadequate, even if it gives a very direct answer to the scientific question of interest. Here, it can be useful to cite the paper by Matthews et al. (1990) defending and propagating the use of summary measures. 21.6
Analysing the Effect of Change
One of the basic difficulties in interpreting associations analysed by regression models as causal relations lies in the impossibility to observe the temporal order in the actual events behind the covariates X and the outcome Y (cf. Section 16.4). Longitudinal studies have the potential to overcome this limitation, as we have the possibility to observe first a change of X in some subjects and subsequently a change in Y , and hence adding evidence for a causal relation. So it can be of high interest to perform regression analyses, in which the changes ΔXij = Xi j+1 − Xi j in X are the covariates and the changes ΔYij = Yi j+1 −Yi j in Y are the outcome variables. However, this simple version rarely works, and typically we need some refinements before we can come to meaningful and convincing results. Some typical problems we may have to take into account are the following: • Just using the difference in X from one time point to the next may not be a reliable indicator for a change, and most differences will be due to chance. So, often the estimation of the difference has to be stabilised, for example, by considering a difference in the average over three time points before and after a given time point. On the other hand, short term changes may be also relevant.
H OW TO A NALYSE L ONGITUDINAL DATA IN S TATA
323
• Changes may occur gradually, so that we have to take the slope over several time points to obtain a reliable indicator for a change. • Y may not react immediately to a change in X. Hence it may be necessary to allow some time lag between the change in X and the change in Y . In addition, this time lag may vary from subject to subject. Then it may be a good idea to use as outcome a quantity like “time until a relevant change in Y .” • Consecutive changes tend to be correlated negatively. If X goes down from time point j − 1 to time point j, there is some likelihood that X goes up again from time point j to j + 1. So this opens the opportunity to rather surprising results, in particular if the lag in Y is modelled incorrectly. Similarly, a large change ΔXij may be just do to the fact that Xi j has been larger or smaller than usual. So we may not really catch by ΔXij a relevant change, but that a somewhat extreme observation occured in X at time point j. • Relevant changes in one covariate often occur together with relevant changes in other covariates, so also changes can act as confounders to each other. For example, changes in lifestyle factors are often correlated as they reflect certain events in life like a new job, a move, founding a family, etc. So we have to make appropriate adjustments in the analysis. Hence, many considerations about the adequate definition of changes in the covariates and the outcome and about adequate adjustments are necessary before a convincing regression analysis to assess the effect of a change in a covariate can be performed. Unfortunately, some of the obvious solutions to the problems mentioned above are not necessarily compatible with each other, for example, it is hard to combine the idea of a lag in Y with that of a gradual change of X. 21.7
How to Perform Regression Modelling of Longitudinal Data in Stata
We start with a look at the data on the Hamilton depression score: . use hamilton . list in 1/10
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
+-------------------+ | id month hamd | |-------------------| | 1 0 20 | | 1 1 19 | | 1 2 21 | | 1 3 21 | | 2 0 19 | |-------------------| | 2 1 19 | | 2 2 17 | | 2 3 18 | | 3 0 19 | | 3 1 18 | +-------------------+
324
A PPLYING R EGRESSION M ODELS TO L ONGITUDINAL DATA
It is always a good idea to start the analysis of longitudinal data by a look at the individual curves: . line hamd month, c(L) by(id, rows(2))
This generates a graph like that in Figure 21.1. Next, we analyse the overall trend by a regression analysis. We take the clustering into account by using the vce(cluster ...) option: . regress hamd month, vce(cluster id) Linear regression
Number of obs = F( 1, 16) = Prob > F = R-squared = Root MSE =
68 4.13 0.0592 0.0674 2.9043
(Std. Err. adjusted for 17 clusters in id) -----------------------------------------------------------------------------| Robust hamd | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------month | -.6882353 .338811 -2.03 0.059 -1.406483 .030012 _cons | 19.51765 .5625886 34.69 0.000 18.32501 20.71028 ------------------------------------------------------------------------------
To analyse the variation of the individual trends, we fit a random effect model (or mixed model) using Stata’s xtmixed command. In this command first the outcome is specified and then the fixed effects in the way we are used to specifyig covariates for all regression commands. Then, after two “|” signs the random effects are specified. First, we specify the variable defining the unit of each random effect. Then, after the “:” we specify the covariates for which we assume that the parameter is varying from unit to unit. Stata assumes that in any case the intercept cons varies from unit to unit, so we have only to specify here the variable month to indicate a random slope. Finally, we specify that we make no assumptions on the dependence structure among the two random effects, which is expressed by the option cov(unstructured). . xtmixed hamd month || id: month , cov(unstructured) Performing EM optimization: Performing gradient-based optimization: Iteration 0: Iteration 1:
log restricted-likelihood = -121.22627 log restricted-likelihood = -121.22627
Computing standard errors: Mixed-effects REML regression Group variable: id
Number of obs Number of groups
= =
68 17
H OW TO ANALYSE LONGITUDINAL DATA IN S TATA
325
Obs per group: min = avg = max =
Log restricted-likelihood = -121.22627
Wald chi2(1) Prob > chi2
= =
4 4.0 4
4.19 0.0407
-----------------------------------------------------------------------------hamd | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------month | -.6882353 .3362731 -2.05 0.041 -1.347319 -.0291521 _cons | 19.51765 .5583745 34.95 0.000 18.42325 20.61204 ----------------------------------------------------------------------------------------------------------------------------------------------------------Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+-----------------------------------------------id: Unstructured | sd(month) | 1.361444 .2496833 .9503704 1.950324 sd(_cons) | 2.249314 .41676 1.56436 3.234174 corr(month,_cons) | -.3486286 .2273615 -.7019617 .1424191 -----------------------------+-----------------------------------------------sd(Residual) | .5866153 .0711375 .4625195 .7440065 -----------------------------------------------------------------------------LR test vs. linear regression: chi2(3) = 94.25 Prob > chi2 = 0.0000 Note: LR test is conservative and provided only for reference.
The first part of the output is structured like that of other regression commands. Here, we can find estimates of the “fixed effects” or the mean values of the random effects. In a second part we can find estimates for the standard deviations of the random effects and the correlation, together with standard errors and confidence intervals. Note that Stata does not provide in this output p-values for testing whether the standard deviations are different from 0. To obtain an estimate for the fraction of subjects with a negative slope, that is, a decreasing trend, we use the nlcom command: . nlcom normal( -_b[month]/ exp([lns1_1_1]_b[_cons])) _nl_1:
normal( -_b[month]/ exp([lns1_1_1]_b[_cons]))
-----------------------------------------------------------------------------hamd | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_nl_1 | .6934027 .0926258 7.49 0.000 .5118595 .8749459 ------------------------------------------------------------------------------
Note the wide confidence interval for the fraction, reflecting that with 17 subjects we cannot say much more than at least half of the subjects have a negative trend. Now we take a look at the dataset with the binary outcome:
326
A PPLYING R EGRESSION M ODELS TO L ONGITUDINAL DATA
. use vomiting . list in 1/10
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
+-----------------------------------------+ | id gender vom1 vom2 vom3 vom4 | |-----------------------------------------| | 1 male 0 0 1 1 | | 2 female 1 1 1 1 | | 3 male 0 0 1 0 | | 4 male 1 0 0 0 | | 5 male 1 0 0 0 | |-----------------------------------------| | 6 female 0 0 1 0 | | 7 female 1 0 0 1 | | 8 female 0 1 1 0 | | 9 female 1 1 1 1 | | 10 male 0 0 0 0 | +-----------------------------------------+
We start with looking at the frequency of vomiting in each cycle for males and females: . tabstat vom1-vom4, s(mean) by(gender) Summary statistics: mean by categories of: gender gender | vom1 vom2 vom3 vom4 -------+---------------------------------------female | .5806452 .5591398 .5376344 .5483871 male | .3655914 .4086022 .4623656 .5376344 -------+---------------------------------------Total | .4731183 .483871 .5 .5430108 ------------------------------------------------
To perform a regression analysis, we have to reshape the data, such that each observation of the outcome is in a single row. We use Stata’s reshape command for this: . reshape long vom, i(id) j(cycle) (note: j = 1 2 3 4) Data wide -> long ----------------------------------------------------------------------------Number of obs. 186 -> 744 Number of variables 6 -> 4 j variable (4 values) -> cycle xij variables: vom1 vom2 ... vom4 -> vom ----------------------------------------------------------------------------. list in 1/10 +---------------------------+ | id cycle gender vom |
H OW TO ANALYSE LONGITUDINAL DATA IN S TATA 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
327
|---------------------------| | 1 1 male 0 | | 1 2 male 0 | | 1 3 male 1 | | 1 4 male 1 | | 2 1 female 1 | |---------------------------| | 2 2 female 1 | | 2 3 female 1 | | 2 4 female 1 | | 3 1 male 0 | | 3 2 male 0 | +---------------------------+
Now we want to take a look at the gender specific time trends. So we define corresponding variables, and fit the model: . gen cyclemale=cycle*(gender==2) . gen cyclefemale=cycle*(gender==1) . logit vom gender cyclemale cyclefemale, vce(cluster id) Iteration Iteration Iteration Iteration
0: 1: 2: 3:
log log log log
pseudolikelihood pseudolikelihood pseudolikelihood pseudolikelihood
= = = =
-515.7015 -507.7465 -507.744 -507.744
Logistic regression
Log pseudolikelihood =
-507.744
Number of obs Wald chi2(3) Prob > chi2 Pseudo R2
= = = =
744 11.65 0.0087 0.0154
(Std. Err. adjusted for 186 clusters in id) -----------------------------------------------------------------------------| Robust vom | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------gender | -1.160026 .3886489 -2.98 0.003 -1.921763 -.3982878 cyclemale | .2330545 .1087994 2.14 0.032 .0198117 .4462974 cyclefemale | -.0479417 .0858673 -0.56 0.577 -.2162384 .1203551 _cons | 1.506815 .5741896 2.62 0.009 .3814243 2.632206 ------------------------------------------------------------------------------
We now take a look at the difference between the trends in female and male patients: . lincom cyclemale-cyclefemale ( 1)
[vom]cyclemale - [vom]cyclefemale = 0
-----------------------------------------------------------------------------vom | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | .2809962 .1386019 2.03 0.043 .0093414 .552651 ------------------------------------------------------------------------------
328
A PPLYING R EGRESSION M ODELS TO L ONGITUDINAL DATA
Finally, we want to take a look at the individual variation of the trends. So we want to fit a logistic mixed model, which is provided in Stata by the xtmelogit command. The syntax is exactly like that of the xtmixed command. We fit these models separately for male and female patients: . xtmelogit vom cycle if gender==2 || id: cycle , cov(unstructured) Refining starting values: Iteration 0: Iteration 1: Iteration 2:
log likelihood = -264.91644 log likelihood = -248.54386 log likelihood = -246.99078
(not concave) (not concave)
Performing gradient-based optimization: Iteration Iteration Iteration Iteration Iteration Iteration
0: 1: 2: 3: 4: 5:
log log log log log log
likelihood likelihood likelihood likelihood likelihood likelihood
= = = = = =
-246.99078 -243.95631 -240.42853 -240.29646 -240.29491 -240.2949
Mixed-effects logistic regression Group variable: id
Integration points = 7 Log likelihood = -240.2949
(not concave)
Number of obs Number of groups
= =
372 93
Obs per group: min = avg = max =
4 4.0 4
Wald chi2(1) Prob > chi2
= =
4.31 0.0379
-----------------------------------------------------------------------------vom | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------cycle | .4247844 .2046304 2.08 0.038 .0237162 .8258525 _cons | -1.443652 .5708964 -2.53 0.011 -2.562588 -.3247158 ----------------------------------------------------------------------------------------------------------------------------------------------------------Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+-----------------------------------------------id: Unstructured | sd(cycle) | 1.380396 .3220722 .8737806 2.180745 sd(_cons) | 3.747773 .8986774 2.342416 5.99629 corr(cycle,_cons) | -.9349265 .0285768 -.9727456 -.8486545 -----------------------------------------------------------------------------LR test vs. logistic regression: chi2(3) = 24.21 Prob > chi2 = 0.0000 Note: LR test is conservative and provided only for reference.
. xtmelogit vom cycle if gender==1 || id: cycle , cov(unstructured)
E XERCISE Increase of Body Fat in Adolescents
329
Refining starting values: Iteration 0: Iteration 1: Iteration 2:
log likelihood = -269.16401 log likelihood = -251.7894 log likelihood = -251.47435
(not concave)
Performing gradient-based optimization: Iteration Iteration Iteration Iteration Iteration Iteration
0: 1: 2: 3: 4: 5:
log log log log log log
likelihood likelihood likelihood likelihood likelihood likelihood
= = = = = =
-251.47435 -251.40162 -251.37241 -251.36701 -251.36633 -251.36631
Mixed-effects logistic regression Group variable: id
Integration points = 7 Log likelihood = -251.36631
Number of obs Number of groups
= =
372 93
Obs per group: min = avg = max =
4 4.0 4
Wald chi2(1) Prob > chi2
= =
0.20 0.6570
-----------------------------------------------------------------------------vom | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------cycle | -.0456022 .1026821 -0.44 0.657 -.2468554 .1556511 _cons | .3757106 .2811202 1.34 0.181 -.1752748 .926696 ----------------------------------------------------------------------------------------------------------------------------------------------------------Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+-----------------------------------------------id: Unstructured | sd(cycle) | .1196747 .2403085 .0023376 6.126713 sd(_cons) | .5308585 .6034892 .0571897 4.927647 corr(cycle,_cons) | .9999996 .0057263 -1 1 -----------------------------------------------------------------------------LR test vs. logistic regression: chi2(3) = 7.95 Prob > chi2 = 0.0470 Note: LR test is conservative and provided only for reference.
By comparison with the results presented in Section 21.4 it becomes obvious how to read the output of xtmelogit.
21.8
Exercise Increase of Body Fat in Adolescents
Today the body fat can be measured rather precisely using dual-energy x-ray absorptiometry. Using this technique, body fat measurements have been performed in a longitudinal study in several hundred children at age 11, 13, and 15. You can find the
330
A PPLYING R EGRESSION M ODELS TO L ONGITUDINAL DATA
data in the data set bodyfat. Besides information on the body fat (measured in kg) you can find information on the age and gender, with girls coded with a 1 and boys coded with a 2. Try to use this data to address the following questions: (a) Does body fat increase with age for both girls and boys? (b) Is there a difference between girls and boys with respect to the increase of the body fat with age? (c) Do girls tend to have more body fat than boys? (d) How large is the individual variation of the increase of the body fat with age in girls and in boys? How many boys and girls have an increase of more than 1.5 kg body fat per year?
T HIS C HAPTER IN A N UTSHELL Regression models can contribute to the analysis of longitudinal data in various ways. The extension to random effects models allows us to describe inter-individual variation in time trends. Summary measures allow us to relate specific properties of patterns in individual curves to covariates.
Chapter 22
The Impact of Measurement Error
In this chapter we discuss the impact of measurement error in or misclassification of covariate values on the results of a regression analysis.
22.1
The Impact of Systematic and Random Measurement Error
Most continuous covariates we measure in medical research are subject to some measurement error. For example a blood pressure measurement is dependent on the equipment used and the conditions under which we measure. In addition, blood pressure is not a quantity fixed in each subject the whole day, so we have to take some biological variation into account. If in measuring a certain variable we depend on the information given by a subject, measurement errors can occur due to limitations in the subject’s ability to remember or due to active manipulation by the subject. Hence, in interpreting the results of a regression analysis we often have to take into account that our covariate measurements have not been perfect. This requires a basic understanding of the potential impact of measurement errors in the covariates on the results. We start with the case of random measurement error, that is, we can express the measured covariate value X as X = X∗ + ε with X ∗ denoting the true value of the covariate and ε a random measurement error
with expected value 0. So we just add “random noise” to the covariate value. In Figures 22.1 and 22.2 we can observe the impact on the regression line if there is a positive association between Y and X; the regression slope becomes smaller. If the association is negative, the slope also becomes smaller in absolute value, as we can observe in Figures 22.3 and 22.4. As the slope is negative here, it means that the slope becomes larger. However, we can summarise the common behaviour in the simple statement that we observe a bias towards 0, that is, estimates of the slope tend to get closer to 0. And this attenuation is not only the case in our two examples. It can be shown in general that a random measurement error implies a bias towards 0 in the expected value of the regression coefficient (e.g., Fuller (1987)). This phenomenon is also called regression dilution. The mean of the measurement error need not to be zero in any case. Often, we have also a systematic component in the measurement error, for example, a labora331
10 8 Y 6 4 2
2
4
6
Y
8
10
12
T HE I MPACT OF M EASUREMENT E RROR 12
332
2
4
6
8
10
12
0
5
X*
10
15
X or X*
8 6 Y 4 2 0
0
2
4
Y
6
8
10
Figure 22.2 The impact of adding random covariate measurement error to the data set shown in Figure 22.1. The true values of X ∗ are shown as grey dots, the mismeasured values X as black dots. The dashed gray lines correspond to the measurement error. The regression line using X ∗ is shown in grey, the regression line using X is shown in black.
10
Figure 22.1 A data set with a continuous outcome variable Y and a continuous covariate X measured without error and the corresponding regression line.
2
4
6
8
10
12
X*
Figure 22.3 A data set with a continuous outcome variable Y and a continuous covariate X measured without error and the corresponding regression line.
0
5
10
15
X or X*
Figure 22.4 The impact of adding random covariate measurement error to the data set shown in Figure 22.3. The true values of X ∗ are shown as grey dots, the mismeasured values X as black dots. The dashed gray lines correspond to the measurement error. The regression line using X ∗ is shown in grey, the regression line using X is shown in black.
10 8 Y 6 4 2
2
4
6
Y
8
10
12
333
12
T HE IMPACT OF MEASUREMENT ERROR
2
4
6
8
10
12
X*
Figure 22.5 A data set with a continuous outcome variable Y and a continuous covariate X measured without error and the corresponding regression line.
2
4
6 8 X or X*
10
12
Figure 22.6 The impact of adding a systematic covariate measurement error to the data set shown in Figure 22.5. The true values of X ∗ are shown as grey dots, the mismeasured values X as black dots. The dashed gray lines correspond to the measurement error. The regression line using X ∗ is shown in grey, the regression line using X is shown in black.
tory parameter may be measured by some assay always 0.2mg/l higher than the true value. Such a measurement error has no impact on the estimated regression slope, as the slope measures the impact of a difference in the X values on the outcome, and differences in X remain the same if we add a constant to X. However, a systematic component in the measurement error can have an impact, if its magnitude depends on the value of X. For example in subjects’ reporting of their income we can imagine that subjects with a high income tend to underreport, and subjects with a low income to overreport. Similar phenomena can be imagined for many other variables in which at least for one end of the distribution subjects may tend to under- or overreport. The impact of such a systematic measurement error is illustrated in Figures 22.5 and 22.6 for the case of a positive association between X and Y ; the slope becomes steeper. This is just a consequence of the fact that we make differences in X between subjects smaller without changing the outcomes values. For example, in the case of the covariate income, a true difference of 1000 USD may be reduced to 800 USD, and hence the effect of an increase by 1 USD becomes larger. In the case of a negative association we can observe in Figures 22.7 and 22.8 that the slope becomes steeper, too. So this specific type of systematic measurement error implies a bias away from 0. If the systematic measurement error would go in the opposite direction, that is, if large values of X are overreported and small values of X are underreported, then we would observe a bias towards 0. Remark: As just shown, if subjects with high covariate values tend to underreport
8 6 Y 4 2 0
0
2
4
Y
6
8
10
T HE I MPACT OF M EASUREMENT E RROR 10
334
2
4
6
8
10
12
X*
Figure 22.7 A data set with a continuous outcome variable Y and a continuous covariate X measured without error and the corresponding regression line.
2
4
6 8 X or X*
10
12
Figure 22.8 The impact of adding a systematic random covariate measurement error to the data set shown in Figure 22.7. The true values of X ∗ are shown as grey dots, the mismeasured values X as black dots. The dashed gray lines correspond to the measurement error. The regression line using X ∗ is shown in grey, the regression line using X is shown in black.
and subjects with low covariate values tend to overreport, this leads to an increase of the effect. However, this is only true as long as subjects with high values report on average values, which are higher than those reported by subjects with low values. If there is no longer any difference on average, the effect may vanish! Remark: If we have an idea about the standard deviation of the random measurement error ε , we can correct the estimated slope using a correction factor. Similar, we can correct for a systematic measurement error by a correction factor, if the slope of a regression of X on X ∗ is known, and this holds also for other types of regression models like the logistic one (Fuller, 1987; Rosner et al., 1989). However, we should be aware of that such corrections do not change the p-value, that is, we cannot become “more significant.” Remark: Also continuous outcome variables can be affected by random or systematic measurement error. Random measurement error does not matter, as this becomes just part of the error term. Systematic measurement error can again imply bias. 22.2
The Impact of Misclassification
Also, binary and categorical covariates can be affected by errors in their assessment. If we ask people about their smoking status or similar variables, some people may just give an incorrect answer. If we look at the disease history of a subject, gaps in the
10
Y 20
30
40
335
0
0
10
Y 20
30
40
T HE I MPACT OF M EASUREMENT E RROR IN C ONFOUNDERS
0
1
0
1
X X*=0
X X*=1
X*=0
X*=1
Figure 22.9 A data set with a binary covariate X and a continuous outcome Y and no measurement error in X (left side) and with measurement error (right side). The colour of the markers refers to the true state X ∗ , whereas the measured value X is shown on the x-axis. The mean of Y in the two groups defined by X are shown as vertical, grey lines.
patient records may lead to overlook some events. Errors in a binary or categorical variable are typically not called measurement error, but misclassification. The impact of misclassification in a binary covariate is illustrated in Figure 22.9. Misclassification implies that some subjects move from one group to the other, and— as long as this happens independent of the actual value of Y —this means that some subjects move from the group with the higher average of Y to the group with the lower average and vise versa. So the difference between the two groups tends to become smaller. So again a misclassification implies a bias towards 0 for the effect estimate. Note that these considerations are independent of how many subjects move from 0 to 1 and how many move from 1 to 0 in X, so also in the case of different fractions (which may be regarded as a systematic misclassification) we have a bias towards 0. Remark: Also a binary outcome Y may be affected by misclassification. This implies typically some bias. 22.3
The Impact of Measurement Error in Confounders
So far we have focused on the impact of the measurement error in a covariate X on the effect estimation for this covariate. If we want to adjust the effect of a covariate for other covariates reflecting potential confounders, then measurement errors in these confounders can have not only an impact on the effect estimates of the confounders but on the adjusted effect of the covariate of interest, too. This is illustrated using an example in Figure 22.10, in which the effect of X1 in an unadjusted analysis (corresponding to the grey, dashed line) is confounded
10 Y 5 0
0
5
Y
10
15
T HE I MPACT OF M EASUREMENT E RROR 15
336
−5
0
5 X1
X2=1
X2=0
10
−5
0
5
10
X1 X2=1
X2=0
Figure 22.10 Left side: A data set with a continuous covariate X1 , a binary covariate X2 , and a continuous outcome Y and no measurement error. The regression lines from fitting a model with both covariates are shown as solid black lines. The regression line from fitting a model only with X1 is shown as grey, dashed line. Right side: The same data set with a few misclassifications in X2 . The regression lines from fitting a model with both covariates are shown again as black, solid lines.
with the effect of X2 , which is negatively associated with X1 but has in its own a positive effect. Hence, the adjusted effect (corresponding to the slope of the two black lines) is larger. If we now change the status of X2 for a few observations, that is, assume some misclassification in X2 , then the slopes in the adjusted analysis become smaller, that is, they are moving closer to the unadjusted effect. This is also intuitively convincing: If we have a measurement error in the confounder, our adjustment has to be imperfect, and hence we move closer to the unadjusted effect. Hence as a rule of thumb we can say that (random) measurement error in a confounder implies a bias towards the unadjusted effect. If there is (in addition) a systematic component, this can be different. Moreover, as soon as several variables are involved in the considerations about the impact of the measurements error, seldom do we just have only one measurement error problem, and quickly the story can become more complicated. 22.4
The Impact of Differential Misclassification and Measurement Error
So far we have assumed that the occurrence of measurement error or misclassification is independent of the value of the outcome variable. This is also a reasonable assumption in prospective studies, in which data on X is collected before we have a
S TUDYING THE M EASUREMENT E RROR
337
chance to observe Y . In retrospective studies in which data on X is collected when the value of Y is already known, this assumption need not to be true any longer. In particular, in case control studies in which we collect data for diseased subjects (cases) and healthy controls, we can imagine many mechanism implying differences in the measurement errors between cases and controls. Differential misclassification (and measurement error) can imply a bias towards 0 and away from 0. Let us consider two typical examples from case control studies on cancer. If we ask controls about their smoking behaviour in previous years, there may be some random measurement error (as subjects remember incorrectly) and systematic measurement error in the form of a general underreporting. If we ask cases, we have to expect an additional underreporting, as cancer patients have an additional reason to feel guilty about their previous smoking behaviour. So if smoking is a risk factor, then this way the cases make themselves more similar to the controls, and the estimated effect of smoking becomes smaller; hence, we have to expect a bias towards 0. In the second example, we are interested in the effect of family history, that is, whether the occurrence of cancer in relatives increases the risk of developing cancer. Now, it is always somewhat difficult to ask a subject whether he or she knows cases of cancer in his or her family, as this requires knowledge for all relatives covered by the term “family” in the present study. So the answer “yes” is typically reliable, whereas the answer “no” typically means that there is no case of cancer in those relatives the subject is familiar with to a sufficient degree. So there is probably some misclassification here, as if an aunt or uncle died of cancer 30 years ago, this is often not known to the subject. The degree of this misclassification will, however, vary between cases and controls, as the cases are suffering from cancer and, hence, they have probably already talked with some relatives about the occurrence of cancer in the family. And this way they may have heard about cases among their relatives which had been unknown to them before. So we have to expect that the misclassification is less pronounced among cases. But this means that we underestimate the relative frequency of presence of cancer in the family in the controls to a higher degree than in the cases, t and hence we overestimate the effect of family history. And it may happen that the true effect of family history is 0, and we state erroneously an effect due to the higher underreporting in controls. 22.5
Studying the Measurement Error
So far we have seen that simple, random measurement error or misclassification in a covariate introduces a bias towards 0 in this covariate, but that systematic or differential measurement error/misclassification or measurement error/misclassification in other covariates can also introduce a bias away from 0. If random measurement error and random misclassification did not imply a bias towards 0, regression models would probably be much less popular in medical research than they are today. As we have in nearly all medical studies using regression models some measurement error/misclassification in the covariates, we have to expect that nearly all these studies report biased estimates and hence may lead to incor-
338
T HE I MPACT OF M EASUREMENT E RROR
rect conclusions. The fact that (random) measurement error/misclassification implies typically a bias towards 0 provides now the essential argument why, nevertheless, we can expect that medical studies using regression models are, on average, useful. We typically underestimate the effects of interest, and as we are mainly interested in demonstrating effects, there is little danger of erroneously demonstrating an effect that does not exist just because of measurement error. However, this general statement that measurement error does on average not much harm may not be used to ignore the potential impact of measurement error/misclassification in a single study. If we have some evidence or belief that some of the covariates we use are affected by some measurement error/misclassification, and we have some idea about the direction and magnitude and the relation to other variables (including the outcome), we can often get some idea about the potential bias along the lines we have used in this chapter. (An example is given in the exercise below.) Also simulation studies can be helpful to improve the understanding of the potential impact. If a serious bias in the results of a study due to measurement error/misclassification in the covariates cannot be excluded, it can be wise to include a substudy to investigate this measurement error further. A common approach is to perform a validation substudy, in which in a subsample of the whole study an attempt is made to measure the true covariate value X ∗ by additional efforts. For example, we may measure the nutrition of the subjects in the whole study by a simple questionnaire, but in the subsample the subjects are asked to generate over two weeks a detailed protocol of anything they eat or drink. Such validation substudies allow us to detect and estimate random, systematic, and differential measurement errors/misclassification; combing the information from the main study with that of the validation substudy can lead to unbiased estimates. Many methods to perform this task have been proposed in the literature and a nice overview can be found in Thurigen et al. (2000). However, performing validation substudies is a nontrivial task, as it is often hard to find error-free measurement instruments and to ensure that being a member of the validation sample does not affect a subject’s behaviour in reporting X in the main study. Another approach is to measure X twice in all subjects or a subsample of the participants of a study. This allows to get an idea about the magnitude of the random component of the measurement error, but only to a very limited degree about systematic or differential measurement error. The book by Carroll et al. (1995) gives a nice overview about methods to handle such studies and similar approaches. Remark: A comprehensive overview about the impact of measurement error on the results of a regression analysis and possible correction approaches can be found in the book by Buonaccorsi (2010). 22.6
Exercise Measurement Error and Interactions
In many medical studies we can imagine that one covariate has an influence on the magnitude of the measurement error in another variable. For example, if we ask people about their alcohol and drug consumption in their adolescence, we may expect
E XERCISE Measurement Error and Interactions
339
a smaller measurement error in young subjects than in old subjects, as the younger people can remember better their adolescence and may be more open to report on this topic. Imagine now that your are interested in investigating whether the effect of alcohol and drug consumption in adolescence has the same impact on the quality of life in young and old subjects, that is, an interaction between age and alcohol/drug consumption. How will the measurement error in the covariate early alocohol/drug consumption probably affect the results of the study?
T HIS C HAPTER IN A N UTSHELL Measurement error or misclassification in the covariates is a common problem in many regression analyses. They often imply an underestimation of effects, that is, a bias towards 0, but this need not to be the case always. Hence, the potential impact on the results of a regression analysis has always to be discussed in light of the specific study.
Chapter 23
The Impact of Incomplete Covariate Data
Missing values in the covariates are a common problem in regression analyses. We discuss in this chapter various approaches to handle this problem.
23.1
Missing Value Mechanisms
The occurrence of missing values in the covariates is a common problem in many studies. They can be due to several reasons: Subjects may refuse to answer certain questions, or they cannot remember events in the past (e.g., vaccinations), laboratory measurements may fail or probes can be lost, or patient records may be incomplete. The standard approach of most statistical software packages is to remove all subjects with a missing value in at least one of the covariates from the analysis, and this is known as a complete case analysis. To understand the potential disadvantages (and advantages) of a complete case analysis, we need to discuss shortly typical and important properties of missing value mechanisms, that is, the mechanisms generating the missing values. Missing At Random (MAR) If there is no relation between the true value of X and the occurrence of missing value in X, then it is said that the missing values occur at random, and that the missing mechanism satisfies the MAR assumption. The MAR assumption is always questionable, if missing values occur due to an active refusal to answer a question, as the decision to refuse the answer is often related to the true answer. One can easily imagine that missing values in questions on income, alcohol abuse, heavy smoking, or sexual activities do not satisfy the MAR assumption. But also missing values due to the response “I do not know” may not occur at random, even if the subject is telling the truth. For example, if you ask subjects about the occurrence of diseases in family members, it is much more likely that they can remember such an occurrence if there is such a family member than that they exactly know the status of all their family members, if none has experienced this disease. Similar, if a patient had shown an allergic reaction to an antibiotic, it is likely that this is noted in a patient’s record, but if the patient has never experienced this, it is less likely to find such a statement in the records. Also objective measurements by technical devices may produce missing 341
342
T HE I MPACT OF I NCOMPLETE C OVARIATE DATA
values not satisfying the MAR assumption, for example, if they cannot produce a measurement of a lab parameter if the concentration is too low. On the other hand the MAR assumption allows that the occurrence of missing values depends on other, measured covariates or the outcome. For example, if we know that old subjects refuse more often to answer questions about sexual activity than young subjects, and age is one of the covariates in the regression model, then the missing values in the variable sexual activity may be still occur at random, if they are not related to the true answer. Missing dependent on X (MDX) This assumption requires that the occurrence of missing values in any covariate may be related to the true value of this covariate or the value of any other observed covariate or to the true value of any other covariate with a missing value in this subject, but it is not allowed to depend on the value of the outcome variable Y . This assumption is typically satisfied in prospective studies, in which data on all covariates are collected prior to the measurement of Y and also prior to the events that are responsible for the final measurement of Y . It is typically more questionable in retrospective studies, in which data on X is collected when the value of Y is already known. In particular, in case control studies, the MDX assumption is often highly questionable, as the diseased cases will remember their history differently compared to the healthy controls, or—when the cases are already dead—we have to collect data on X using relatives or patient records. Consequently, we have different missing rates in cases and controls, and hence the MDX assumption is violated (but MAR may still hold!) It is also likely that there is not only a quantitative difference between cases and controls, but also a qualitative one. For example, healthy controls have little reason to refuse an answer on their smoking habits even if they smoke a lot, so that missing values in this subgroup may be regarded as occurring at random. However, heavy smokers in the diseased cases may feel guilty and hence prefer to refuse to answer. Remark: There are (rare) situations in which, also in a prospective study, the occurrence of missing values carries information on the value of the outcome. This can happen, if both the missing values and the outcome are related to some latent variables like attitude to one’s own disease. Patients who have give up themselves may be more likely to refuse to answer questions and have simultaneously a worse prognosis than patients who are eager to overcome their disease. 23.2
Properties of a Complete Case Analysis
In a complete case analysis we select those subjects for the analysis who have no missing values in their covariates. If we now assume that the missing value mechanism satisfies the MDX assumption, then this selection depends only on the (true) covariate values of X. But we have already seen in Section 15.1 that such a selection usually does not introduce any bias in our effect estimates. So under the MDX assumption a complete case analysis is a clean way to estimate effects. And—which is highly important—this does not require the MAR assumption. So even if we, for example, know that subjects with high income refuse more often to show the in-
B IAS D UE TO U SING AD HOC M ETHODS
343
terviewer their last tax report than subjects with a low income, we can still use a regression model to assess the influence of income on some outcome Y , and we do not have to be afraid of a bias from a complete case analysis. And a complete case analysis is (nearly) the only method with this property. But what is about efficiency? We are throwing away many observations with information on the outcome and at least some of the covariates. So this may suggest that we use more sophisticated methods like those we will outline in Section 23.4, which can make use of this information and hence may provide more efficient estimates and a higher power. However, the gain that is possible by such methods should not be overestimated. For example, if we are interested in estimating the effect of X1 , observations with a missing value in X1 cannot provide any information about the effect of X1 . So if X1 is the only covariate effected by missing values, we cannot expect a gain by using more sophisticated methods than a complete case analysis. Only those observations with missing values in other covariates than X1 can contribute information about the effect of X1 , which is neglected by a complete case analysis. However, if we have now an observation with a missing value in, for example, X2 but an observed value in X1 , this observation cannot carry much information about the effect of X1 adjusted for X2 . It can mainly only contribute to the adjustment that is necessary because of the covariates we could observe in this observation. So, it carries again limited information. In summary, in many situations the potential gain in using sophisticated methods may be smaller than expected at first sight. In addition, we get this (small) gain in efficiency not for free: The sophisticated methods require the MAR assumptions, which is often highly questionable, and the complete case analysis does not require this assumption. On the other hand, if the MDX assumption is questionable—in particular in retrospective studies—then the complete case analysis can be seriously biased (Vach and Blettner, 1991), and then we need sophisticated methods to overcome this bias. Remark: A further discussion about the advantages and disadvantages of a complete case analysis can be found in White and Carlin (2010). 23.3
Bias Due to Using ad hoc Methods
In medical studies we can often find the application of some ad hoc methods to make use of the observations with missing values in some covariates. In particular, regarding missing values in a categorical covariate just as an additional category has been popular. Similar, it has been popular to set missing values in continuous variables to 0 and to add a binary indicator variable indicating the occurence of a missing value. However, both approaches introduce bias. Let us take a look at the situation with two covariates and missing values in X2 . If we apply any of the two approaches above, it implies that in subjects with no missing value we make an adjustment for X2 , but in subjects with a missing value, X2 is constant, so in these observations we do not make any adjustment. So, in the first case, we estimate the adjusted effect and, in the second case, the unadjusted effect. So, on average, we estimate a number between these two effects, that is, we introduce a bias towards the unadjusted effect.
344
T HE I MPACT OF I NCOMPLETE C OVARIATE DATA
The latter does only hold if the two subgroups, with and without a missing value, are random subsamples. If this is not the case, then a bias of any magnitude and any direction can appear, even if the MAR assumption holds (Vach and Blettner, 1991). So these approaches are a very poor idea if we are interested in using the subjects with incomplete observations to adjust estimates. However, there may be situations in which this approach is very useful, namely when the missing values are in itself highly predictive. For example Commenges et al. (1992) report a study comparing different tests to diagnose dementia. They found missing values in those variables corresponding to the results of two tests to be highly predictive, because the missing values reflected a subject‘s failure to comprehend the test. 23.4
Advanced Techniques to Handle Incomplete Covariate Data
The MAR assumption is the key to use the data from observations with missing values in some covariates. It allows to estimate for any missing value its distribution, because we know that the true value has no influence on the occurrence of a missing value. So, for example, if we have two covariates X1 and X2 , and we observe a missing value in X2 for a specific subject, we can take a look at all other subjects sharing the value of X1 (or at least approximately sharing this value) and also sharing the value of Y , and use their values of X2 to estimate the distribution of the unobserved value X2 . There have been many proposals on how to use this idea, but in the end, one approach became most popular: multiple imputation. The simple, basic idea of multiple imputation is to estimate the distribution of each missing value, then to generate many observations from these estimated distributions, to analyse all of these filled data sets, and to average the results. And if this generation is performed in a specific way taking also into account that the distributions are estimated (called proper imputation), then it can be shown that valid estimates are obtained, and if the averaging of the standard errors is done in a particular way, valid confidence intervals and p-values are also obtained. So many technicalities have to be taken into account in performing multiple imputations, and in addition we have to tackle the problem of varying missing patterns from subject to subject and the necessity to condition on the outcome Y (which is a censored observation in the case of using the Cox model). But fortunately all these problems have been solved, and multiple imputation is today available as a general technique in most statistical packages. We would like to refer the reader to Schafer (1997) and Royston (2004). Stata provides multiple imputation with the mi command, and the documentation of this command includes also a nice introduction and many further references. However, we should always remember that multiple imputation requires to assume MAR. This is a severe restriction, and whenever this is in doubt, it may be preferable to stick to a complete case analysis. Remark: Multiple imputation can be also used if the MAR assumption is in doubt, but then a rather precise idea of how the MAR assumption is violated is required. For example, if we believe that subjects with high income refuse more often to report
H ANDLING OF PARTIALLY D EFINED C OVARIATES
345
their income, we can generate income values higher than suggested by the MAR assumption. Such an imputation can be also part of a sensitivity analysis, that is, an analysis of the results of a study obtained by multiple imputation with respect to the sensitivity to the MAR assumption. This is usually recommended today when using multiple imputation. Remark: If a covariate is affected by missing values, then it is often likely that it is also affected by some measurement error, in particular if the missing values are due to lack of response from subjects: Some subjects may prefer to refuse to answer (i.e., generate missing values), but other may prefer to lie (i.e., generate measurement error or misclassifications). Hence, in some situations it can be rather artificial to use advanced methods to use the information from subjects with missing values in the analysis, but to ignore the potential impact of measurement error (cf. Chapter 22). Remark: Of course, an outcome variable Y can also be affected by missing values. Since the subjects with a missing value in the outcome carry no information on the relation between the covariates of interest and the outcome, a complete case analysis is more or less the only method available. It will be (only) biased if the missing values in Y do not satisfy the MAR assumption. The situation is different in the case of a longitudinal outcome, as then we have actually several measurements of the outcome variable for each subject, and only some of them may be missing. Then advanced methods like multiple imputation may allow to make efficient use of the available data. However, if random effects models are used, they make automatically efficient use of the available data under the MAR assumption, and no further actions are needed. 23.5
Handling of Partially Defined Covariates
In some instances, a covariate is only defined within a certain subgroup. For example, when analysing risk factors in the elderly age at menopause is often known to be a relevant factor and/or confounder in women, but this covariate is undefined in men. Another example arises when analysing the effect of the type of current occupation on some health outcomes, when some of the subjects are currently unemployed and hence lack a status with respect to the current occupation. It is important to be aware of that this type of “missing information” on covariate values is fundamentally different from the “missing values” discussed so far in this chapter. The term “missing value” indicates that the true value of the covariate is just unobserved, but it is existing. In the case of partially defined covariates there exist no true values of the covariate for a subgroup. Hence, it is not appropriate to use any approach trying to reconstruct such values like multiple imputation. Instead, incorporating such covariates has to be approached by conceptual considerations about what we want to estimate in the specific situation and then a technical realization within the framework of regression models. With respect to age at menopause the essential step is to define the effect of gender in a meaningful manner. Since we have no substitute for age at menopause in men, the usual way to define the effect of gender as the expected difference between a woman and a man with identical values for all other covariates does not work. We
346
T HE I MPACT OF I NCOMPLETE C OVARIATE DATA
can only compare a man with a woman with a specific value of age at menopause fixing all other covariate values. The choice of this age is arbitrary. A typical choice may be the average age in the study population or in the general population. Once such a choice is made, we have just to impute this value for all men in the variable age at menopause and fit a model with an indicator variable for gender, the filled variable age at menopause, and all other covariates. This way we obtain the gender effect of interest. If you are not interested in the gender effect itself, but just aim at adjusting the effect of all other covariates for the effects of gender and age at menopause, then it does not matter which constant value you impute, as this choice does not affect the other estimates. With respect to handling the type of current occupation in unemployed subjects, the first straightforward approach is to regard unemployed as an additional category of the covariate type of current occupation. Then we can estimate the difference between any category of type of current occupation and unemployed, and the effects of all other covariates are adjusted for both the employment status as well as the current type of occupation. Only if there is an interest in assessing the effect of employed versus unemployed do we lack a single number giving a direct answer. Again, we have to ask for a meaningful definition of this effect, taking into account that the current type of occupation also has an influence. One answer can be a weighted average of the effects between the single categories of type of current occupation and the status unemployed. The weights may be chosen proportional to the frequency of each category of the current type of occupation in the study population or the general population. Such a weighted average with a confidence interval and a p-value can be easily assessed using Stata’s lincom command when using unemployed as reference category (cf. Chapter 8).
T HIS C HAPTER IN A N UTSHELL Missing values in the covariates are a frequent problem in regression analyses. Ad hoc methods to handle them can introduce serious bias. A complete case analysis is nearly always a proper way in studies based on a prospective collection of the outcome, but can be seriously biased otherwise. Sophisticated methods like multiple imputation should be used in studies with retrospective sampling or if we have to expect a substantial loss in power by a complete case analysis. They require, however, assuming that missing values occur at random. Partially undefined covariates should not be addressed by methods for handling missing values.
Part III
Risk Scores and Predictors
347
Chapter 24
Risk Scores
24.1
What Is a Risk Score?
We have already pointed out on several occasions that we can apply the estimated regression parameters of a regression model to the covariate values x1 , x2 , . . . , x p of any subject—either a subject from our sample or a new subject. The resulting value ηˆ (x1 , x2 , . . . , x p ) = βˆ0 + βˆ1 x1 + βˆ2 x2 + . . . + βˆ p x p is often called a risk score, as it estimates a quantity describing the risk of a subject with respect to the outcome of interest. We start now by recapitulating the meaning of this quantity in the different regression models. In the case of the classical linear regression model, the risk score is an estimate for η (x1 , x2 , . . . , x p ) = β0 + β1 x1 + β2 x2 + . . . + β p x p and this is nothing else but μ (x1 , x2 , . . . , x p ), that is, the expected value of our outcome variable Y given the covariate values x1 , x2 , . . . , x p of a subject. In other words, the true value η (x1 , x2 , . . . , x p ) is the value of Y we can expect on average if we have many subjects all with the same covariate values x1 , x2 , . . . , x p . And our risk score ηˆ (x1 , x2 , . . . , x p ) is an estimate for this value. For example, in Section 4.1 we developed a model relating systolic blood pressure to the daily number of cigarettes (X1 ) and the daily alcohol intake (X2 ). The estimated regression coefficients were βˆ0 = 118.8, βˆ1 = 0.30, and βˆ2 = 0.076. According to this model, we should expect for a subject smoking 30 cigarettes per day and drinking 8 g alcohol per day, a systolic blood pressure of ηˆ (30, 8) = βˆ0 + βˆ1 × 30 + βˆ2 × 8 = 128.5. In the case of the logistic regression model, the risk score
ηˆ (x1 , x2 , . . . , x p ) = βˆ0 + βˆ1 x1 + βˆ2 x2 + . . . + βˆ p x p is an estimate for
η (x1 , x2 , . . . , x p ) = β0 + β1 x1 + β2 x2 + . . . + β p x p = logit π (x1 , x2 , . . . , x p ) that is, the probability to observe Y = 1 for a subject with covariate values x1 , x2 , . . . , x p expressed on the logit scale. This true value η (x1 , x2 , . . . , x p ) is the relative frequency of Y = 1 (expressed on the logit scale) we can expect if we have many subjects all with the same covariate values x1 , x2 , . . . , x p . And our risk score 349
350
R ISK S CORES
ηˆ (x1 , x2 , . . . , x p ) is an estimate for this value. To facilitate the interpretation of the risk score, the risk score value ηˆ (x1 , x2 , . . . , x p ) is usually transformed from the logit scale to the probability scale; that is, the final risk score is πˆ (x1 , x2 , . . . , x p ) = logit −1 ηˆ (x1 , x2 , . . . , x p ) which tries to estimate π (x1 , x2 , . . . , x p ). Examples of computations of such risk score values have been given in Section 6.2 and in Section 13.3. Risk scores can be also computed on the odds scale as ˆ (x1 , x2 , . . . , x p ) = odds
πˆ (x1 , x2 , . . . , x p ) = eηˆ (x1 ,x2 , ..., x p ) . 1 − πˆ (x1 , x2 , . . . , x p )
The situation is slightly more complicated in the case when the outcome of interest is a survival time and we use the Cox proportional hazards model. In fitting this model, we obtain directly only estimates for the regression parameters, but not for the baseline hazard function h0 (t). Consequently, the risk score is usually defined as
ηˆ (x1 , x2 , . . . , x p ) = βˆ1 x1 + βˆ2 x2 + . . . + βˆ p x p which is an estimate for
η (x1 , x2 , . . . , x p ) = β1 x1 + β2 x2 + . . . + β p x p . As the Cox regression model reads now log h(t, x1 , x2 , . . . , x p ) = log h0 (t) + η (x1 , x2 , . . . , x p ) , the values of ηˆ (x1 , x2 , . . . , x p ) work in principle fine as risk scores: Low values of ηˆ (x1 , x2 , . . . , x p ) imply that a subject is at low risk of an event at all time points t compared to a subject with high values of ηˆ (x1 , x2 , . . . , x p ). However, it is hard to interpret the absolute value of ηˆ (x1 , x2 , . . . , x p ), as the actual risk of a subject depends also on the unknown baseline hazard function. The risk score of a subject describes only the risk of the subject relative to a subject, for whom all covariate values are 0, as the latter will in any case obtain a risk score of 0. As a consequence, a risk score of 0.5 may indicate a high-risk subject in one application and a low-risk subject in another application. So we can use the risk score only to compare subjects, but not to assess the absolute risk of a single subject. To overcome this difficulty, we can try to estimate the probability of a subject with a certain covariate pattern to survive at least until the time point t, that is, the probability πt (x1 , x2 , . . . , x p ) = P(Y ≥ t|x1 , x2 , . . . , x p ) However, this is not only a function of the risk score η (x1 , x2 , . . . , x p ), but also of the so-called baseline survival function S0 (t), which is the survival function of a subject for whom all covariate values are 0. Some statistical packages provide an estimate for this baseline survival function, but an estimate can also be computed manually (for
W HAT I S A R ISK S CORE ?
351
details see Appendix D.1). Once we have these estimates, we can obtain estimates for πt (x1 , x2 , . . . , x p ) using
πˆt (x1 , x2 , . . . , x p ) = Sˆ0 (t)exp(ηˆ (x1 ,x2 , ..., x p ) . The values of Sˆ0 (t) change only for those values of t that correspond to an observed survival time, and hence the same is true for πˆt (x1 , x2 , . . . , x p ). And at each of these time points, πt (x1 , x2 , . . . , x p ) decreases, so at the end we can combine all these values to a step function similar to a Kaplan-Meier estimate. For example, in Exercise 10.5 we have fitted a Cox proportional hazards model to analyse the effect of the covariates age, tumour grading (with possible values 1, 2 and 3), tumour size (in cm), and lymphnode status on disease-free survival. If we consider lymphnode status as a binary variable (≤ 3 lymph nodes versus ≥ 4 lymph nodes), and the other three variables as continuous variables, then we obtain the following regression estimates: age 0.039 tumour grading 0.403 tumour size 0.113 lymphnode status 1.066 If we now consider a 65-year-old patient with a tumor of grade 3 and of 8 cm size, and the patient has 7 affected lymph nodes, then we obtain a raw risk score of
ηˆ (65, 3, 8, 1) = 0.039 × 65 + 0.403 × 3 + 0.113 × 8 + 1.066 × 1 = 5.76 . In contrast, if we consider a 40-year-old patient with a tumour of grade 1 and of 2 cm size, and the patient has 2 affected lymph nodes, we obtain a raw risk score of
ηˆ (40, 1, 2, 0) = 0.039 × 40 + 0.403 × 1 + 0.113 × 2 + 1.066 × 0 = 2.20 . So we can see that the first patient has a much higher risk of dying than the second, because she is older, her tumour is bigger, and has a higher grade, and many of her lymph nodes are affected. We can now translate these values into 3-year survival probabilities, if we know that the baseline survival function is estimated at 3 years as Sˆ0 (3) = 0.9973. Then we obtain the estimated 3-year survival probabilities as Sˆ0 (3)exp(ηˆ (65,3,8,1)) = 0.9973exp(5.76) = 0.423 and Sˆ0 (3)exp(ηˆ (40,1,2,0)) = 0.9973exp(2.20) = 0.976, respectively. We can also look at all values of πt for the two patients arranged as survival curves (Figure 24.1), and we can again see the big difference between the two patients. Note that a publication should include values of Sˆ0 (t) for some selected values of t (e.g., 1, 2, 3, and 5 years) such that it becomes possible to compute survival probabilities for a new subject. Remark: It is common to express the raw scores from a Cox proportional hazards model on the hazard rate scale instead of the log hazard rate scale, that is, to consider eη (x1 ,x2 , ..., x p ) . These numbers are hazard ratios comparing, again, a given subject with a subject for whom all covariate values are 0. That is, the problem of the
R ISK S CORES
0
.25
.5
.75
1
352
0
1
2
3
4
5
year
Figure 24.1 The estimated survival curves for a high-risk and a low-risk patient.
interpretation of the absolute value of the risk score does not vanish. However, the comparison of risk scores values between different subjects is now somewhat easier, as ratios of risk score values describe the factor by which the hazard rate is increased or decreased. Remark: We have chosen the term risk score to reflect the common idea behind such scores across various areas of application. In specific areas, more specific terms like prognostic score or diagnostic score are used. 24.2
Judging the Usefulness of a Risk Score
The aim of a risk score is typically to provide a tool to distinguish high risk from low risk subjects, given their covariate values. So whenever fitting a regression model to obtain a risk score, we should check whether the risk score fulfils its aim. This can be easily approached by looking at the distribution of the ηˆ or πˆ values in the data we have used to construct the risk score. For example, Figure 24.2 shows the distribution of the estimated 3-year survival probabilities and the estimated 5-year survival probabilities in the data set breast introduced in Exercise 10.5. We can, in particular, with respect to the 5-year survival probabilities, observe a substantive spread. We can identify a low risk group with probabilities above 0.9 comprising 16% of the patients, and if we regard a survival probability of 0.8 still as a low risk, about half of the patients are at low risk. It turns out to be much harder to define a high risk group of substantial size. Even if we regard 5-year survival probabilities below 0.5 as an indicator for high risk, only 12% of the patients fall into this group. So we can see that it is rather simple to judge the usefulness of a risk score by inspecting its distribution. Any type of method to inspect a distribution like box plots or summary statistics can be used, if we want to predict how the risk score will behave in populations similar to that of our study. We should, however, be cautious and remember always that we look at the distribution of the estimated risk score values ηˆ (xi1 , xi2 , . . . , xip ), not at the distribution of the true values η (xi1 , xi2 , . . . , xip ). The
353
0
0
.1
.2
Fraction .2
Fraction .4
.3
.6
T HE P RECISION OF R ISK S CORE VALUES
0
.2 .4 .6 .8 1 3−year survival probability
0
.2 .4 .6 .8 1 5−year survival probability
Figure 24.2 The distribution of the individual estimated 3-year survival probabilities (left side) and of the individual estimated 5-year survival probabilities (right side) in the data set breast.
latter has a somewhat smaller spread, as the estimated risk score values are subject to some estimation uncertainty (cf. also the next section). In particular, if the true effect of all covariates in a model is 0, that is, if η (xi1 , xi2 , . . . , xip ) is a constant for all subjects, we will nevertheless observe some variation in ηˆ (xi1 , xi2 , . . . , xip ). Consequently, the distribution of ηˆ (xi1 , xi2 , . . . , xip ) should not be used as an argument that the covariates have some effect. Remark: The distribution of the estimated risk scores does not only reflect properties of the underlying regression model, but also of the distribution of the covariates in a study population. Often, risk scores are developed from studies with restrictive inclusion criteria like clinical trials. In clinical trials age is often restricted to below 65 or below 70 because the higher morbidity and mortality in elderly dilutes the treatment effect. However, we typically wish to apply the risk score later also in patients above the age of 65 or 70, respectively, and such an extrapolation is often justified. But this also implies that the number of high risk patients increases (if age is a risk factor), and hence also the spread of the risk score values will increase, implying that the score is more useful than suggested by the current study. 24.3
The Precision of Risk Score Values
In Section 24.1 we have seen that risk scores are estimates for quantities describing the risk of a subject with respect to the outcome of interest. Actually, they provide (at least in large samples) unbiased and optimal estimates for the true values η (x1 , x2 , . . . , x p ). Since risk scores are estimates, they have also some imprecision, and in many situations it can be essential to take this imprecision into account. For example, if a risk score indicates that a cancer patient has a 5-year survival probability of 95% after surgery, we may decide to avoid any further chemo- or radiation therapy, as the side effects of the therapy may outperform the little improvement in
354
R ISK S CORES x1 1 1 1 1 0 0 0 0
x2 1 1 0 0 1 1 0 0
x3 1 0 1 0 1 0 1 0
n 112 64 79 28 53 32 23 17
%Y =1 67.0 60.9 57.0 46.4 47.2 43.8 43.5 29.4
ηˆ (x1 , x2 , x3 ) 0.706 0.403 0.273 −0.030 −0.021 −0.325 −0.455 −0.758
ηˆ (x1 , x2 , x3 )) SE( 0.168 0.201 0.188 0.233 0.211 0.231 0.245 0.275
Table 24.1 An artificial data set with three binary covariates and a binary outcome variable (left side) and estimated risks scores with their standard errors based on a fitted logistic model (right side). Here, n denotes the number of subjects with the given covariate pattern, %Y = 1 denotes the relative frequency of Y = 1 in subjects with the given covariate pattern.
survival probability we may gain. However, if the 95% confidence interval of the risk score is [.58,0.99], we may change our decision, as the prognosis might be much worse than indicated by the risk score alone. Fortunately, for a risk score value ηˆ (x1 , x2 , . . . , x p ) a standard error also can be computed in a rather simple manner, using the variance-covariance matrix (details are given in Appendix D.2). And risk score values tend to flatter around the true value like a normal distribution, such that we can construct a confidence interval by taking ηˆ (x1 , x2 , . . . , x p )) . ηˆ (x1 , x2 , . . . , x p ) ± 1.96 SE(
It is important to realize that the standard error of risk score values can substantially vary from subject to subject. Let us, for example, consider the data set of Table 24.1. A logistic regression analysis yields the following output beta SE 95%CI p-value variable intercept -0.758 0.275 [-1.297,-0.219] 0.006 [0.295,1.160] 0.001 x1 0.728 0.220 [0.017,0.850] 0.041 x2 0.433 0.212 0.156 x3 0.303 0.214 [-0.116,0.723] Based on these parameter estimates we can compute for each possible covariate pattern the risk score ηˆ (x1 , x2 , x3 ) and its standard error, as shown on the right side of Table 24.1. We can see that the standard error of the risk score of a subject with covariate pattern (0,0,0) is much higher than the standard error for a subject with covariate pattern (1,1,1). This is due to the fact that we have many subjects with a covariate pattern equal or similar to (1,1,1), but only few equal or similar to (0,0,0). In the case of logistic regression we typically transform risk scores at the end to values πˆ (x1 , x2 , . . . , x p ) = logit −1 ηˆ (x1 , x2 , . . . , x p ) on the probability scale. Also, standard errors for these values can be computed (see Appendix D.2). However, to construct a confidence interval for πˆ (x1 , x2 , . . . , x p ), it is preferable to just apply the inverse logit transformation to the boundaries of the confidence interval
T HE P RECISION OF R ISK S CORE VALUES
355
for η (x1 , x2 , . . . , x p ) (cf. Section 12.6). In the case of the Cox proportional hazards model, it is unfortunately more cumbersome to obtain standard errors and confidence intervals for πˆt (x1 , x2 , . . . , x p ). Hence, we typically focus on the standard errors of the risk scores ηˆ (x1 , x2 , . . . , x p ). However, some care is necessary in computing and interpreting the standard error of ηˆ (x1 , x2 , . . . , x p ) from a Cox model. As mentioned above, ηˆ (x1 , x2 , . . . , x p ) describes the risk of a subject relative to a subject with all covariate values equal to 0. Now if we have a covariate like age in the model, the value 0 for age is far away from the actually observed values of age, and hence we can estimate the risk of a subject with age 0 only with a very low precision. And as ηˆ (x1 , x2 , . . . , x p ) is the difference to this subject, also this risk score is imprecise. To avoid this problem, we have to ensure that the subject with the value 0 in all covariates is actually somewhere in our sample, and the best is if it is somewhere in the middle. This can be obtained by centering all continuous covariates, that is, we subtract the population mean from each covariate. Additionally, we can ensure that for all categorical covariates the most frequent category is used as reference category. Actually, such transformations do not change any parameter estimate, and it preserves the differences between the risk sores values for any pair of subjects. But it changes the values of the risk scores themselves and their standard errors, and ensures that both are interpretable. We must be aware that there is today one practical limitation in computing standard errors for (published) risk scores. Publications about risk scores include typically only the regression parameter estimates and their estimated standard errors. This information is not sufficient to compute the standard errors of individual risk scores, as they also depend on the correlation among the regression coefficients, and estimates for these correlations are typically not published. Remark: In interpreting standard errors of individual risk scores and confidence intervals for risk scores, we implicitly assume that the risk score is a consistent estimate of the underlying quantity of interest. This is the case if our regression model is correctly specified. However, the assumption might be violated if a quadratic term or a substantial interaction term is missing in the model. In most applications this is no major matter of concern, as confidence intervals for risk score are typically large, such that a moderate bias of a risk score does not heavily affect the coverage probability of a confidence interval. In our example of Table 24.1 we have estimated 4 regression parameters based on 350 observations, which is a rather comfortable situation. Nevertheless, the confidence interval for πˆ (1, 1, 1) ranges from 0.593 to 0.738, and the confidence interval for πˆ (0, 0, 0) ranges from 0.215 to 0.445. Remark: Centering the covariates in fitting a Cox model can be also useful when computing individual survival probabilities. The baseline survival function is computed for the subject with all covariates equal to 0, and if this subject is far away from our sample, this survival function can behave rather strange, in particular it can be close to 1 for all relevant values of t. This may imply numerical instabilities, which is avoided by centering the covariates.
356 24.4
R ISK S CORES The Overall Precision of a Risk Score
Once we have fitted a regression model to a data set and we want to propose the resulting risk score to be used in future to assess individual risk, we would also like to inform the potential users about the precision they can expect in using the risk score values. Since the precision is varying from subject to subject, depending on the covariate values, we have to define an appropriate average precision. As we are used to describing precision by means of standard errors, a useful measure is the average standard error ηˆ (xi1 , xi2 , . . . , xip )) , ¯ ηˆ ) = 1 ∑ SE( SE( n i=1
The interpretation of the average standard error is straight forward. It describes the value we can expect on average if we draw randomly a subject from our population and look at the estimated standard error of its risk score. In spite of this simple interpretation it remains a question when we should regard an average standard error as big or small. Here, it is useful to remember that we compute confidence intervals for risk scores as ηˆ (x1 , x2 , . . . , x p ) ± ηˆ (x1 , x2 , . . . , x p ), that is, the average length of such a confidence interval 1.96 SE( is 4 times the average standard error. If we have a continuous outcome, it is relative simply to judge such a length, as the length refers directly to the scale of the outcome variable. If the outcome is, for example, the systolic blood pressure, then an average SE of 2.5 will be regarded as small, as it implies that we can estimate the expected blood pressure with a confidence interval of about ±5. On the other hand, an average SE of 10 will be regarded as (too) big, as it implies confidence intervals of ±20 for the expected blood pressure. For a binary outcome the average SE and the average length of the confidence interval refers to the logit scale, so we are back to the problem of judging differences on the logit scale. Hence, it can be helpful to translate the length of a CI on the logit scale to the length of the CI on the odds scale, expressed as the ratio between the ηˆ (x1 , x2 , . . . , x p ))). And upper and the lower bound, which is just given as exp(4SE( the average length of the CIs on the logit scale is translated to the geometric mean of ¯ ηˆ ), these ratios. Table 24.2 presents this computation for some selected values of SE( together with a suggestion for a classification of the average standard error. With respect to the Cox-model the same argument can be applied. The average ¯ ηˆ ) can be transformed to a geometric mean of the ratios between standard error SE( upper and lower bound of the individual CIs, if we work on the hazard rate scale instead of the log hazard rate scale. So Table 24.2 can be used as a guideline again. However, the judgement is slightly overoptimistic, as if we go one step further and transform the risk score to survival probabilities, we add additional imprecision, as we have to estimate the baseline survival function. Remark: In principle, also the standard errors of the estimates πˆ (xi1 , xi2 , . . . , xip ) or πˆt (xi1 , xi2 , . . . , xip ) can be averaged. However, this is not very convincing, as the precision of an estimated probability depends typically on its size, cf. the varying width of the confidence intervals in Table 24.1.
U SING S TATA’ S predict C OMMAND TO C OMPUTE R ISK S CORES
excellent very good good moderate low poor
¯ ηˆ ) SE( ≤ 0.05 ≤ 0.1 ≤ 0.2 ≤ 0.3 ≤ 0.4 > 0.5
357
GM ≤ 1.22 ≤ 1.49 ≤ 2.23 ≤ 3.33 ≤ 4.95 > 4.95
¯ ηˆ ) on the logit scale to a geometric Table 24.2 Translation of the average standard error SE( mean (GM) of the ratio between the upper and the lower bound of the confidence intervals for the individual risk scores on the odds scale.
24.5
Using Stata’s predict Command to Compute Risk Scores
After any regression command, Stata’s predict command allows to compute the value of the risk score on various scales for all subjects in the data set (cf. Section 12.2). For example, we may analyse the data set shown in Table 24.1 in the following way: . use riskscoreexample . list in 1/5
1. 2. 3. 4. 5.
+-----------------------+ | id y x1 x2 x3 | |-----------------------| | 1 1 1 1 1 | | 2 1 1 1 0 | | 3 0 1 0 1 | | 4 0 1 0 1 | | 5 1 1 1 1 | +-----------------------+
. logit y x1 x2 x3 Iteration Iteration Iteration Iteration
0: 1: 2: 3:
log log log log
likelihood likelihood likelihood likelihood
Logistic regression
Log likelihood = -272.09107
= = = =
-280.42688 -272.09704 -272.09107 -272.09107 Number of obs LR chi2(3) Prob > chi2 Pseudo R2
= = = =
408 16.67 0.0008 0.0297
-----------------------------------------------------------------------------y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------x1 | .727524 .2204529 3.30 0.001 .2954442 1.159604 x2 | .4333494 .2123535 2.04 0.041 .0171442 .8495547 x3 | .3033271 .2139756 1.42 0.156 -.1160574 .7227116
358
R ISK S CORES
_cons | -.7578997 .2748286 -2.76 0.006 -1.296554 -.2192456 -----------------------------------------------------------------------------. predict p (option pr assumed; Pr(y)) . list in 1/5
1. 2. 3. 4. 5.
+----------------------------------+ | id y x1 x2 x3 p | |----------------------------------| | 1 1 1 1 1 .6695833 | | 2 1 1 1 0 .5994019 | | 3 0 1 0 1 .5678173 | | 4 0 1 0 1 .5678173 | | 5 1 1 1 1 .6695833 | +----------------------------------+
The predict command has added the variable p to our data set, which includes the value of the risk score on the probability scale for each subject. If we want to judge the usefulness of this riskscore, we can take a look at the distribution
0
50
Frequency 100
150
. histogram p, start(0.0) width(0.1) xlab(0(0.1)1.0) frequency (bin=7, start=0, width=.1)
0
.1
.2
.3
.4
.5 Pr(y)
.6
.7
.8
.9
1
and we observe that we have a very limited spread of the risk score. This is due the fact that we have only three covariates with rather small effects in this study. If we just want the risk scores ηˆ (x1 , x2 , x3 ) = βˆ0 + βˆ1 x1 + βˆ2 x2 + βˆ3 x3 on the logit scale, we can use the xb option of predict: . predict eta, xb . list in 1/5 +---------------------------------------------+ | id y x1 x2 x3 p eta | |---------------------------------------------| 1. | 1 1 1 1 1 .6695833 .7063008 | 2. | 2 1 1 1 0 .5994019 .4029737 |
U SING S TATA’ S predict C OMMAND TO C OMPUTE R ISK S CORES
359
3. | 3 0 1 0 1 .5678173 .2729513 | 4. | 4 0 1 0 1 .5678173 .2729513 | 5. | 5 1 1 1 1 .6695833 .7063008 | +---------------------------------------------+
We can obtain the standard errors for these values using the stdp option: . predict seeta, stdp . list in 1/5
1. 2. 3. 4. 5.
+--------------------------------------------------------+ | id y x1 x2 x3 p eta seeta | |--------------------------------------------------------| | 1 1 1 1 1 .6695833 .7063008 .1679462 | | 2 1 1 1 0 .5994019 .4029737 .2012551 | | 3 0 1 0 1 .5678173 .2729513 .1877256 | | 4 0 1 0 1 .5678173 .2729513 .1877256 | | 5 1 1 1 1 .6695833 .7063008 .1679462 | +--------------------------------------------------------+
If we want to compute confidence intervals for the risk score values on the probability scale, we have to transform the confidence interval for the risk score values on the logit scale: . gen low=1/(1+exp(- ( eta-1.96*seeta))) . gen up= 1/(1+exp(- ( eta+1.96*seeta))) . list y x1 x2 x3 p low up
1. 2. 3. 4. 5.
in 1/5
+---------------------------------------------------+ | y x1 x2 x3 p low up | |---------------------------------------------------| | 1 1 1 1 .6695833 .5931798 .7379761 | | 1 1 1 0 .5994019 .5021284 .6894253 | | 0 1 0 1 .5678173 .4762701 .6549554 | | 0 1 0 1 .5678173 .4762701 .6549554 | | 1 1 1 1 .6695833 .5931798 .7379761 | +---------------------------------------------------+
The overall precision of the risk score (on the logit scale) can be described by the average value of the estimated standard errors: . tabstat se, s(mean) variable | mean -------------+---------seeta | .2008677 ------------------------
According to the classification suggested in Table 24.2 we are close to a good precision for our risk scores. If we want to compute the risk score value for a single subject with a certain covariate pattern, for example, the values 1,1,1, it is convenient to obtain the score
360
R ISK S CORES
value on the logit scale together with standard errors and confidence intervals using lincom . lincom ( 1)
_cons + x1 + x2 + x3
[y]x1 + [y]x2 + [y]x3 + [y]_cons = 0
-----------------------------------------------------------------------------y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | .7063008 .1679462 4.21 0.000 .3771322 1.035469 ------------------------------------------------------------------------------
If we want to know this value on the probability scale, we can use . nlcom 1/(1+exp(-(_b[_cons] + _b[x1] + _b[x2] + _b[x3]))) _nl_1:
1/(1+exp(-(_b[_cons] + _b[x1] + _b[x2] + _b[x3])))
-----------------------------------------------------------------------------y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_nl_1 | .6695833 .0371567 18.02 0.000 .5967575 .742409 ------------------------------------------------------------------------------
Since computation of risk scores for a specific choice of covariate values is a common task, Stata supports this by the margins command. So you can obtain the same numbers by . margins, at(x1=1 x2=1 x3=1) predict(xb) Adjusted predictions Model VCE : OIM Expression at
Number of obs
=
408
: Linear prediction, predict(xb) : x1 = 1 x2 = 1 x3 = 1
-----------------------------------------------------------------------------| Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_cons | .7063008 .1679462 4.21 0.000 .3771322 1.035469 ------------------------------------------------------------------------------
or . margins, at(x1=1 x2=1 x3=1) Adjusted predictions Model VCE : OIM Expression at
: Pr(y), predict() : x1 = x2 =
Number of obs
1 1
=
408
U SING S TATA’ S predict C OMMAND TO C OMPUTE R ISK S CORES x3
=
361
1
-----------------------------------------------------------------------------| Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_cons | .6695833 .0371567 18.02 0.000 .5967575 .742409 ------------------------------------------------------------------------------
respectively. The predict command allows computing of the risk score values also for subjects in another sample than the original one, just by changing to a new data set. For example, we can obtain the numbers of Table 24.1 by . use riskscoreexample, clear . logit y x1 x2 x3 Iteration Iteration Iteration Iteration
0: 1: 2: 3:
log log log log
likelihood likelihood likelihood likelihood
= = = =
-280.42688 -272.09704 -272.09107 -272.09107
Logistic regression
Number of obs LR chi2(3) Prob > chi2 Pseudo R2
Log likelihood = -272.09107
= = = =
408 16.67 0.0008 0.0297
-----------------------------------------------------------------------------y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------x1 | .727524 .2204529 3.30 0.001 .2954442 1.159604 x2 | .4333494 .2123535 2.04 0.041 .0171442 .8495547 x3 | .3033271 .2139756 1.42 0.156 -.1160574 .7227116 _cons | -.7578997 .2748286 -2.76 0.006 -1.296554 -.2192456 -----------------------------------------------------------------------------. clear . input x1 x2 x3 x1 1. 2. 3. 4. 5. 6. 7. 8. 9.
1 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 end
1 0 1 0 1 0 1 0
. predict eta, xb . predict se, stdp
x2
x3
362
R ISK S CORES
. list
1. 2. 3. 4. 5. 6. 7. 8.
+-------------------------------------+ | x1 x2 x3 eta se | |-------------------------------------| | 1 1 1 .7063008 .1679462 | | 1 1 0 .4029737 .2012551 | | 1 0 1 .2729513 .1877256 | | 1 0 0 -.0303758 .2331476 | | 0 1 1 -.0212232 .211357 | |-------------------------------------| | 0 1 0 -.3245503 .2314261 | | 0 0 1 -.4545726 .2445921 | | 0 0 0 -.7578998 .2748286 | +-------------------------------------+
The predict command works in an analogous way also after regress and stcox. In the latter case, we need to make some additions to also obtain the value πˆt . We demonstrate this here using the breast data: . clear . use breast . list in 1/5
1. 2. 3. 4. 5.
+---------------------------------------------------------+ | id age grad nodestat tumors~e survtime died | |---------------------------------------------------------| | 1 67 1 0 1.5 5636 0 | | 15 37 1 0 .6 4076 0 | | 45 53 1 0 1.5 4635 0 | | 57 49 1 1 2.2 3463 0 | | 59 65 1 0 1.5 3503 0 | +---------------------------------------------------------+
Since it is easier to work with years than with days, we change the survival times to years: . replace survtime=survtime/365.25 (617 real changes made)
In the previous analysis we have decided to handle lymph node status as a binary covariate: . replace nodestat=nodestat==2 (321 real changes made)
Now we fit a Cox model: . stset survtime, failure(died==1) failure event:
died == 1
U SING S TATA’ S predict C OMMAND TO C OMPUTE R ISK S CORES obs. time interval: exit on or before:
363
(0, survtime] failure
-----------------------------------------------------------------------------617 total obs. 0 exclusions -----------------------------------------------------------------------------617 obs. remaining, representing 284 failures in single record/single failure data 5104.397 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 16.81314 . stcox age grad nodestat tumorsize, nohr failure _d: analysis time _t:
died == 1 survtime
Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Iteration 3: log likelihood Iteration 4: log likelihood Refining estimates: Iteration 0: log likelihood
= = = = =
-1700.131 -1621.7125 -1611.6826 -1611.6367 -1611.6367
= -1611.6367
Cox regression -- Breslow method for ties No. of subjects = No. of failures = Time at risk = Log likelihood
=
617 284 5104.396991 -1611.6367
Number of obs
=
617
LR chi2(4) Prob > chi2
= =
176.99 0.0000
-----------------------------------------------------------------------------_t | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------age | .0388857 .0047472 8.19 0.000 .0295814 .04819 grad | .4033945 .091863 4.39 0.000 .2233463 .5834428 nodestat | 1.065892 .1374865 7.75 0.000 .7964229 1.33536 tumorsize | .1128235 .0372052 3.03 0.002 .0399027 .1857443 ------------------------------------------------------------------------------
We can use predict to obtain the risk scores . predict eta, xb . list id-tumorsize eta
1. 2. 3. 4. 5.
in 1/5
+--------------------------------------------------+ | id age grad nodestat tumors~e eta | |--------------------------------------------------| | 1 67 1 0 1.5 3.177972 | | 15 37 1 0 .6 1.90986 | | 45 53 1 0 1.5 2.633572 | | 57 49 1 0 2.2 2.557006 | | 59 65 1 0 1.5 3.100201 |
364
R ISK S CORES +--------------------------------------------------+
To estimate the individual survival probabilities, we need also the baseline survival function. Stata provides us with an estimate for this by using the basesurv option of predict. The baseline survival function is stored in an new variable with the given name, showing the value of the function at the time point given by the survival time of each subject. . predict bs, basesurv . list survtime bs in 1/10
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
+---------------------+ | survtime bs | |---------------------| | 15.43053 .9778309 | | 11.15948 .9853304 | | 12.68994 .9829489 | | 9.481177 .9875943 | | 9.590692 .9873664 | |---------------------| | 3.271732 .9967416 | | 4.191649 .9951761 | | 10.22861 .9866531 | | 5.015743 .9935123 | | 7.45243 .9896646 | +---------------------+
Suppose now we want to estimate the probabilities πt for a 65-year-old subject with a tumour of grade 3 and of 8.5 cm size, and many affected lymph nodes. We can use lincom to compute the value of ηˆ (65, 3, 1, 8.5) for this subject: . lincom ( 1)
age*65 + grad*3 + nodestat*1 + tumorsize*8.5
65*age + 3*grad + nodestat + 8.5*tumorsize = 0
-----------------------------------------------------------------------------_t | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | 5.762646 .4624879 12.46 0.000 4.856186 6.669105 ------------------------------------------------------------------------------
or the margins command: . margins, at(age=65 grad=3 nodestat=1 tumorsize=8.5) predict(xb) Adjusted predictions Model VCE : OIM Expression at
: Linear prediction, predict(xb) : age = 65 grad = 3 nodestat = 1 tumorsize = 8.5
Number of obs
=
617
U SING S TATA’ S predict C OMMAND TO C OMPUTE R ISK S CORES
365
-----------------------------------------------------------------------------| Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_cons | 5.762646 .4624879 12.46 0.000 4.856186 6.669105 ------------------------------------------------------------------------------
In both cases we obtain the value 5.76 for ηˆ (65, 3, 1, 8.5). Now we can compute πt for all time points t in our data set as . gen prob= bs^exp(5.76) if died==1 (333 missing values generated) . list survtime prob in 1/10
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
+---------------------+ | survtime prob | |---------------------| | 15.43053 . | | 11.15948 . | | 12.68994 . | | 9.481177 . | | 9.590692 . | |---------------------| | 3.271732 .3549642 | | 4.191649 .2155528 | | 10.22861 . | | 5.015743 .1267451 | | 7.45243 . | +---------------------+
Note that the values in the variable prob have no meaning with respect to the subjects in the data set. We only borrow the survival time of the subject, as πt (65, 3, 1, 8.5) changes it value at all observed survival times. Now we can plot the values as a survival function using Stata’s line command and requiring a step function using the connect(stairstep) option. . sort survtime . line prob survtime if died==1, co(stairstep)
R ISK S CORES
0
.2
prob .4 .6
.8
1
366
0
5
10 15 survtime
20
Fortunately, Stata has summarised this step in a single command called stcurve: . stcurve, survival at(age=65 grad=3 nodestat=1 tumorsize=8.5)
0
.2
Survival .4 .6
.8
1
Cox proportional hazards regression
0
5
10 analysis time
15
20
You can actually plot curves for several individuals using stcurve. Type help stcurve for more details. To assess the precision of the risk scores, we first center the two continuous covariates. egen allows computing a new variable with the mean of a given variable, and then we have just to subtract this mean: . egen mage=mean(age) . egen mtumorsize=mean(tumorsize) . gen cage=age-mage . gen ctumorsize=tumorsize-mtumorsize
With respect to the categorical variables we take a look at their distribution: . tab grad grad |
Freq.
Percent
Cum.
U SING S TATA’ S predict C OMMAND TO C OMPUTE R ISK S CORES
367
------------+----------------------------------1 | 126 20.42 20.42 2 | 272 44.08 64.51 3 | 219 35.49 100.00 ------------+----------------------------------Total | 617 100.00 . tab nodestat nodestat | Freq. Percent Cum. ------------+----------------------------------0 | 499 80.88 80.88 1 | 118 19.12 100.00 ------------+----------------------------------Total | 617 100.00
nodestat is already 0 in its most frequent category. The most frequent category of grad is 2, and as we use this covariate as a continuous one, we just subtract the value 2, such that all subjects in the most frequent category have the value 0. . gen cgrad=grad-2
Now we can fit the model and compute the standard error of the risk score value for each individual: . stcox cage cgrad nodestat ctumorsize, nohr failure _d: analysis time _t:
died == 1 survtime
Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Iteration 3: log likelihood Iteration 4: log likelihood Refining estimates: Iteration 0: log likelihood
= = = = =
-1700.131 -1621.7125 -1611.6826 -1611.6367 -1611.6367
= -1611.6367
Cox regression -- Breslow method for ties No. of subjects = No. of failures = Time at risk = Log likelihood
=
617 284 5104.396991 -1611.6367
Number of obs
=
617
LR chi2(4) Prob > chi2
= =
176.99 0.0000
-----------------------------------------------------------------------------_t | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------cage | .0388857 .0047472 8.19 0.000 .0295814 .04819 cgrad | .4033945 .091863 4.39 0.000 .2233463 .5834428 nodestat | 1.065892 .1374865 7.75 0.000 .7964229 1.33536 ctumorsize | .1128235 .0372052 3.03 0.002 .0399027 .1857443 -----------------------------------------------------------------------------. predict se, stdp
368
R ISK S CORES
We can take a look at the mean value . tabstat se, s(mean) variable | mean -------------+---------se | .1097745 ------------------------
and according to our suggestion we have to regard this as close to a very good precision. To compute the estimated 3-year survival probabilities for all subjects, we have first to find the value for the baseline survival function at 3 years: . list survtime bs if 2.9= .294 True D defined as y != 0 -------------------------------------------------Sensitivity Pr( +| D) 70.34% Specificity Pr( -|~D) 46.90% Positive predictive value Pr( D| +) 35.62% Negative predictive value Pr(~D| -) 79.10% -------------------------------------------------False + rate for true ~D Pr( +|~D) 53.10% False - rate for true D Pr( -| D) 29.66% False + rate for classified + Pr(~D| +) 64.38% False - rate for classified Pr( D| -) 20.90% -------------------------------------------------Correctly classified 53.81% --------------------------------------------------
H OW TO C ONSTRUCT P REDICTIONS WITH S TATA
381
Note that estat class also computes the positive and negative predictive values, assuming that the prevalence in the current sample is identical to the prevalence in future subjects. lsens views sensitivity and specificity as a function of the cut point πc and is, hence, most useful in determining the final cut point:
0.00
Sensitivity/Specificity 0.25 0.50 0.75
1.00
. lsens
0.00
0.25
0.50 Probability cutoff Sensitivity
0.75
1.00
Specificity
Finally, lroc draws the ROC curve: . lroc Logistic model for y 801 0.6214
0.00
0.25
Sensitivity 0.50
0.75
1.00
number of observations = area under ROC curve =
0.00
0.25
0.50 1 − Specificity
0.75
1.00
Area under ROC curve = 0.6214
Note that lroc plots sensitivity versus 1-specificity. Unfortunately, there is no option to add the values of the cut points to this graph.
382
C ONSTRUCTION OF P REDICTORS
The ROC curve illustrates that in this example we cannot come to a good prediction of y, based on x1, x2, and x3. In the case of a survival outcome, we can compute the probabilities πˆt (x1∗ , x2∗ , . . . , x∗p ) as described in Section 24.5, and they can be transformed into predictions Yˆ or Yˆt by inspection of theses values. 25.6
The Overall Precision of a Predictor
To describe the overall precision of a predictor, we may define an average standard error of the predictor similar to the way we have define the average standard error of a risk score. However, we can apply this only in the case of a continuous outcome, in which we have the possibility to compute the standard error of a single prediction. In the case of a binary outcome, we may be tempted to report the sensitivity and specificity we have observed for the cutpoint finally chosen. However, we will see in the next chapter that this gives typically a too-optimistic result.
Chapter 26
Evaluating the Predictive Performance
26.1
The Predictive Performance of an Existing Predictor
Whenever a predictor has been constructed, it is a straightforward question how good the predictor is, that is, how close Yˆ will be to Y ∗ in new subjects. We have already seen in the last chapter how we can describe this at the level of a single subject. However, to judge the predictive performance in general, we are interested in some type of average performance, reflecting the intended use in many subjects. We start by considering the situation that a predictor has been published, that is, we are able to compute predictions Yˆ for any new subject in dependence on its covariate values X1∗ , X2∗ , . . . , Xp∗ . Typically this means that we have access to the regression parameters of the fitted model and—if necessary—to some cutpoint. Now, the aim is to describe the performance of this predictor in our current sample. Of course, we can just apply the predictor to all subjects in our sample such that we obtain a prediction Yˆi for each subject. It just remains to describe how close Yˆi is to Yi on average. In the case of a continuous outcome, we can just look at the distribution of the differences di = Yˆi − Yi and to visualise this distribution or to describe it by some characteristics. Most useful are upper and lower percentiles, for example the 10% percentile and the 90% percentile, such that we know that for 80% of the subjects the differences between the prediction and the true value is within this range. It is also a widespread tradition to compute the average squared distance or its root, that is, 1 n ˆ 1 n ˆ 2 ( Y −Y ) or i i ∑ ∑ (Yi −Yi )2 . n i=1 n i=1 These numbers are often referred to as the (average) quadratic prediction error or the root mean squared error of prediction. In the binary case it is most useful to look at sensitivity and specificity, that is, at estimates of the probability of the two types of errors we can perform (cf. Section 25.3). We can estimate the sensitivity as the fraction of subjects with Yˆi = 1 among all subjects with Yi = 1, and the specificity as the fraction of subjects with Yˆi = 0 among all subjects with Yi = 0. It is also common to compute the overall error rate, that is, the relative frequency of Yˆi = Yi . However, this should be restricted to the case in which it can be justified that misclassifications are of equal importance in both directions. 383
384
E VALUATING THE P REDICTIVE P ERFORMANCE
In the case of a survival outcome and a predictor Yˆ of the true survival time, we can—as in the continuous case—investigate the distribution of the differences Yˆi −Yi . The only additional difficulty is that typically we have censored observations, such that only a lower bound of Yi is available. However, if we just change the differences to di′ = Yi − Yˆi , the lower bound for Yi translates to a lower bound for di′ and we can use the Kaplan-Meier estimate to estimate the distribution function of these differences and obtain this way at least some percentiles of the distribution. (It might be necessary to add some constant to all differences to convince a statistical package to perform the Kaplan-Meier estimation on these differences, as most packages will not accept negative values in the time variable.) If using predictions Yˆt for the survival status at a certain time point t, we can consider sensitivity and specificity or the overall error rate. However, we run into the problem that the true survival status Yt is not known for subjects censored prior to t, such that we cannot compute these quantities directly. This problem can be solved in the following manner: Since we know Yˆt for all subjects, we are able to estimate the positive and negative predictive value in the current population. We have just to apply the Kaplan-Meier estimator on the outcome Y within the subjects with Yˆi = 1 and can find the positive predictive value ppv as the estimated survival probability at time point t, and to apply the Kaplan-Meier estimator on the outcome Y within npv as 1 minus the subjects with Yˆi = 0 and can find the negative predictive value the estimated survival probability at time point t. Then we need only the relative frequency πˆ of Yˆt = 1 and we can obtain estimates of sensitivity and specificity as se ˆ =
ppv πˆ ppvπˆ + (1 − npv)(1 − πˆ )
and
sp ˆ =
npv (1 − πˆ ) . (1 − ppv)πˆ + npv(1 − πˆ )
A further discussion of this approach can be found in Antolini and Valsecchi (2012). Remark: It is not unlikely that the performance of a predictor varies in different subgroups. A predictor may work well in young subjects, but poor in old subjects. We may have a good sensitivity and specificity to diagnose myocardial ischemia by myocardial perfusion imaging techniques, but the method may break down in subjects who already experienced a myocardial infarction. Hence, it is usually wise to analyse the predictive performance also in important subgroups to obtain a complete assessment. Any of the techniques mentioned above can be just applied in each subgroup. Remark: The techniques described above refer to the situation in which we use directly the prediction. In some applications, the emphasis will be on the prediction interval, for example, if we want to predict the bilirubin level in newborn children. In such a situation we have, of course, to evaluate the performance of the prediction interval, for example, by checking, whether indeed 95% of the values of Yi are in the prediction intervals. Remark: In publications suggesting a predictor, we can often find an assessment of the predictive performance based on the data set used also to construct the predictor. Such an assessment is typically too optimistic (cf. the next section). Hence, the assessment of the predictive performance in a new, independent data set is often referred to as a validation, as the original assessment is validated. However, this type of
... P ERFORMANCE OF AN E XISTING P REDICTOR IN S TATA
385
validation does not directly address the question whether the prediction or risk score is biased or not. This requires other techniques. Remark: Sensitivity and specificity for a given cutpoint are the basic ingredients to judge the performance of a binary predictor or a predictor in a survival context. There are many techniques to combine these (time dependent) values into further numbers. An overview can be found in Steyerberg et al. (2010). Some further fundamental aspects are summarised in Altman et al. (2009). 26.2
How to Assess the Predictive Performance of an Existing Predictor in Stata
Partin et al. (2001) published a table to support the prediction of the pathological stage in localised prostate cancer in dependence on three prognostic factors: The TNM stage, the PSA level using a monoclonal assay, and the Gleason score. We focus here on the pathological stage Organ-Confined Disease. Table 26.1 includes the numbers presented by Partin et al. (2001), that is, the estimated probability to have an organ-confined disease. The data set partinassess includes data of an (artificial) study in patients with localised prostate cancer. We use this data in the following to assess the predictive performance of the Partin table. The variable ocd indicates, whether the patient’s pathological stage is Organ-Confined Disease or not, and the variable partinpr includes the risk score of the patient (expressed on the probability scale as a percentage) according to the Partin table. We start as usual with looking at the data set. . use partinassess, clear . list in 1/5
1. 2. 3. 4. 5.
+-------------------------------------------------+ | id psa gleason stage partinpr ocd | |-------------------------------------------------| | 1 0-2.5 8-10 T2b 37 1 | | 2 0-2.5 2-4 T2b 88 1 | | 3 0-2.5 3+4=7 T2c 51 0 | | 4 2.6-4.0 5-6 T2a 71 1 | | 5 2.6-4.0 2-4 T1c 92 1 | +-------------------------------------------------+
The Partin score is a risk score. So if we want to assess the predictive preformance, we have first to change the risk score values to a (binary) predictor for the status Organ Confined Disease. We start with using a cut point of 0.5. . gen ocdpred=partinpr>50
Now we can make a cross tabulation to inspect how well this coincides with the true status given by the variable ocd. . tab ocd ocdpred, row
386
E VALUATING THE P REDICTIVE P ERFORMANCE Gleason 2–4 5–6 3+4=7 4+3=7 8–10
T1c 95 90 79 71 66
2–4 5–6 3+4=7 4+3=7 8–10
T1c 92 84 68 58 52
2–4 5–6 3+4=7 4+3=7 8–10
T1c 87 75 54 43 37
2–4 5–6 3+4=7 4+3=7 8–10
T1c 80 62 37 27 22
PSA=0–2.5 T2a T2b 91 88 81 75 64 54 53 43 47 37 PSA=2.6–4.0 T2a T2b 85 80 71 63 50 41 39 30 33 25 PSA=4.1–6.0 T2a T2b 76 69 58 49 35 26 25 19 21 15 PSA=6.1–10.0 T2a T2b 65 57 42 33 20 14 14 9 11 7
T2c 86 73 51 39 34 T2c 78 61 38 27 23 T2c 67 46 24 16 13 T2c 54 30 11 7 6
Table 26.1 Risk score values (on the probability scale) for organ-confined disease according to Partin et al. (2001). Reproduced from Partin, AW and Mangold, LA and Lamm, DM and Walsh, PC and Epstein, JI and Pearson, JD (2001). “Contemporary update of prostate cancer staging nomograms (Partin Tables) for the new millennium”, Urology 59, 843–848, by kind permission of Elsevier.
+----------------+ | Key | |----------------| | frequency | | row percentage | +----------------+ | ocdpred ocd | 0 1 | Total -----------+----------------------+----------
T HE P REDICTIVE P ERFORMANCE OF A N EW P REDICTOR
387
0 | 82 69 | 151 | 54.30 45.70 | 100.00 -----------+----------------------+---------1 | 41 220 | 261 | 15.71 84.29 | 100.00 -----------+----------------------+---------Total | 123 289 | 412 | 29.85 70.15 | 100.00
We can see that we have a sensitivity of 84.3% and a specificity of 54.3%. We can now redo this analysis with various cut points and combine the results into an ROC curve. This is exatly what the roc command does:
0.00
0.25
Sensitivity 0.50
0.75
1.00
. roctab ocd partinpr, graph
0.00
0.25
0.50 0.75 1 − Specificity
1.00
Area under ROC curve = 0.7914
We can observe a rather limited diagnostic value. For example, there is no cut point allowing both a sensitivity and a specificity above 75%. One explanation for this is that the Partin score assigns to many subjects a probability between 25% and 75%, indicating that we are not very confident about the true status of the patient with respect to the spread of the disease.
26.3
Estimating the Predictive Performance of a New Predictor
The assessment of the predictive performance described above has been done in a new sample, which was independent of the sample that has been used to construct the predictor. It might be tempting to perform the above assessment directly on the original sample, but this is a dangerous idea. Using the data twice for the construction of the predictor and the assessment of its performance results usually in too optimistic results. This is due to the fact that in fitting a model to a data set we just try to decrease the above measures as much as possible. If we want to construct a predictor and to assess its predictive performance in
388
E VALUATING THE P REDICTIVE P ERFORMANCE
one and the same data set, we can split the data set into a so-called training data set, which we use to construct the predictor, and into a validation or test set, which we use to assess the predictive performance. And this is actually done in many areas of applications, and it is the cleanest way. However, in medical research, data collection is often rather expensive, and we would like to use all available data for the construction to obtain the best predictor. Hence, it is very popular to use all data for the construction and to use a technique called cross-validation to assess the predictive performance. This basic idea of crossvalidation is to split the data several times into a training set and a validation set and to average the results. As most measures of predictive performance can be computed as a sum over the single subjects and as we wish the results in the training set to be as close as possible to the results in the whole data set, a very popular choice is the leave-one-out cross-validation, that is, to use each single subject once as a validation data set and all remaining subjects as corresponding training data set. So this means we just compute for each subject a prediction Yˆ(i) by fitting a regression model to all subjects except of subject i. This way we have not used data of subject i to compute Yˆ(i) . So we can now compare the values Yˆ(i) and Yi with any of the methods mentioned in the last section and we obtain an assessment, which does not suffer from overoptimism due to using the data twice. Remark: To be more precise, cross-validation estimates the predictive performance we have to “expect” from a predictor that is fitted by the same method as our predictor. All the predictors we construct during cross-validation are slightly different, so what we obtain is only some average predictive performance. However, as long as all these predictors have a similar performance, it does not matter that we take an average and we can be pretty sure that the the average is close to the performance of our specific predictor. However, if we fit very complex models, for example, with many interactions, to small data sets, the performance may vary substantially, and hence we can be less confident that the average performance measured by cross-validation agrees with the one of our particular predictor. Remark: It is known that cross-validation is not necessarily the most efficient method to assess the predictive performance. Other resampling methods, for example, based on the bootstrap, may be more efficient. However, cross-validation is still a very popular method and seems to be sufficient for most practical purposes. Remark: In applying cross-validation, it is essential to perform the complete construction process. This implies often more than just fitting the regression model. For example, in using logistic regression to construct a predictor, the process also includes the choice of the cutpoints. As a consequence, we can apply cross-validation only if the construction process can be described by a precise algorithm. So it is not possible to use cross-validation, if a cutpoint was just chosen by looking in an informal way at one or the other graph. The process of choosing the cutpoint has to be specified, such that we can repeat it. For example, we may define a priori the rule to choose the maximal cutpoint with a sensitivity of at least 80%. If the predictor is the result of trying many different mod-
... P REDICTIVE P ERFORMANCE VIA C ROSS -VALIDATION IN S TATA
389
els, then the use of cross-validation does not solve the problem that the assessment of the predictive performance might be too optimistic. 26.4
How to Assess the Predictive Performance via Cross-Validation in Stata
We illustrate the use of cross-validation using the dataset partinassess, which we have used above in Section 26.2 to assess the predictive performance of the Partin tables. We just now use this data set to construct a new risk score to describe the probability of organ-confined disease. As we have a reasonable number of observations, we decide to use all three covariates as categorical ones. . use partinassess . logit ocd i.psa i.gleason i.stage Iteration Iteration Iteration Iteration Iteration
0: 1: 2: 3: 4:
log log log log log
likelihood likelihood likelihood likelihood likelihood
= -270.71254 = -209.81877 = -208.25991 = -208.2582 = -208.2582
Logistic regression
Log likelihood =
Number of obs LR chi2(10) Prob > chi2 Pseudo R2
-208.2582
= = = =
412 124.91 0.0000 0.2307
-----------------------------------------------------------------------------ocd | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------psa | 2 | -.1895232 .3223004 -0.59 0.557 -.8212204 .4421739 3 | -1.192635 .3282145 -3.63 0.000 -1.835924 -.5493465 4 | -1.945909 .3640727 -5.34 0.000 -2.659479 -1.23234 | gleason | 2 | -.7932126 .3344301 -2.37 0.018 -1.448684 -.1377415 3 | -1.696966 .3828736 -4.43 0.000 -2.447384 -.9465472 4 | -1.226046 .3794045 -3.23 0.001 -1.969665 -.4824264 5 | -1.728524 .4244207 -4.07 0.000 -2.560373 -.8966744 | stage | 2 | -1.274228 .3155943 -4.04 0.000 -1.892782 -.6556749 3 | -1.563822 .3484053 -4.49 0.000 -2.246684 -.8809602 4 | -1.860817 .3532933 -5.27 0.000 -2.553259 -1.168375 | _cons | 3.125293 .3730015 8.38 0.000 2.394224 3.856362 -----------------------------------------------------------------------------. lroc Logistic model for ocd number of observations = area under ROC curve =
412 0.8120
E VALUATING THE P REDICTIVE P ERFORMANCE
0.00
0.25
Sensitivity 0.50
0.75
1.00
390
0.00
0.25
0.50 1 − Specificity
0.75
1.00
Area under ROC curve = 0.8120
The ROC curve looks very similar to that of the Partin score created in Section 26.2, with a slightly larger area under the curve. However, this figure might be too optimistic, as we have used the data twice both for the construction as well as for the assessment. To investigate this further, we take a look at the specific case of a cutpoint of 0.5: . predict p (option pr assumed; Pr(ocd)) . gen ocdhat=p>0.5 . tab ocd ocdhat, row +----------------+ | Key | |----------------| | frequency | | row percentage | +----------------+ | ocdhat ocd | 0 1 | Total -----------+----------------------+---------0 | 91 60 | 151 | 60.26 39.74 | 100.00 -----------+----------------------+---------1 | 34 227 | 261 | 13.03 86.97 | 100.00 -----------+----------------------+---------Total | 125 287 | 412 | 30.34 69.66 | 100.00
. drop p
Compared to the analysis of the Partin score in Section 26.2, both sensitivity and specificity look better, but without a cross-validation we cannot trust this result.
... P REDICTIVE P ERFORMANCE VIA C ROSS -VALIDATION IN S TATA
391
To perform a leave-one-out cross-validation, we have to loop over all observations. We, hence, start with determining the number of observations. This number is stored internally in the variable N, and we assign this value to the local macro N. (A macro is just the name of a piece of text, we can refer to later by ‘name’). . local N=_N
A further preparation is to define a variable yhat filled with missing values. . gen yhat=. (412 missing values generated)
Now we use forvalues to loop over all observations. In each step of the loop, we have a local macro i with the number of the step, such that i goes through all observations. In each step, we fit a logistic regression model without using the ith subject (if n!=‘i’), compute the probability of Y = 1 for each subject (predict p), and replace yhat for subject i (if n==‘i’) with the prediction we obtain by comparing p with 0.5. . forvalues i=1/‘N’ { 2. qui logit ocd i.psa i.gleason i.stage 3. qui predict p 4. qui replace yhat=p>0.5 if _n==‘i’ 5. drop p 6. }
if _n!=‘i’
(The qui command suppresses any output on the screen.) Now we can compare yhat with the true values odc . tab ocd yhat, row +----------------+ | Key | |----------------| | frequency | | row percentage | +----------------+ | yhat ocd | 0 1 | Total -----------+----------------------+---------0 | 84 67 | 151 | 55.63 44.37 | 100.00 -----------+----------------------+---------1 | 40 221 | 261 | 15.33 84.67 | 100.00 -----------+----------------------+---------Total | 124 288 | 412 | 30.10 69.90 | 100.00
392
E VALUATING THE P REDICTIVE P ERFORMANCE
and obtain a cross-validated sensitivity of 84.7% and a specificity of 55.6%, which are smaller than those computed above. And we are close to the sensitivity and the specificity we have observed for the original Partin score in Section 26.2.
26.5
Exercise Assessing the Predictive Performance of a Prognostic Score in Breast Cancer Patients
In this exercise we use again the breast data set. We assume that previously a publication has reported the analysis of a similar data set by a Cox model, resulting in the regression parameters age 0.042 tumour grading 0.530 tumour size 1.270 lymphnode status 0.071 and that this publication reported a 5-year baseline survival probability of 0.9953. (a) Assess the predictive performance of predictions of the 5 year survival status based on this risk score using the breast data set. There is one patient in this data set who is censored prior to reaching a follow up time of 5 years. We suggest that you just ignore this patient. Otherwise, you have to follow the approach outlined above to compute the predictive values using the Kaplan-Meier method and to insert the results in the formula for sensitivity and specificity. Note that with sts list, by(...) you can see the numerical values of the Kaplan-Meier curve. (b) Assess the predictive performance of the risk score constructed using the breast data set of Exercise 24.7. If you want to use cross-validation, you can use the following piece of code to assess the baseline survival function at 5 years and to save it in the local macro bs5: preserve qui keep if survtime n situation occurs typically when working with high-throughput methods to measure genetic or molecular markers. The basic difficulty with all approaches to construct parsimonious predictors is that they try to minimise the (average) prediction error. This is useful as long as we are purely interested in constructing predictions, which work well on average. But with respect to many other purposes, this principles does not imply good properties. In particular, methods following this principle • often do not provide unbiased estimates of regression coefficients, • often do not provide valid confidence intervals and valid p-values, • often do not provide unbiased estimates of individual risk scores, and • in particular, underestimate the risk for subjects with, high risk and overestimate the risk for subjects with low risk. Some of the methods popular in this area try to avoid some of these difficulties, but none of them can circumvent all. Hence, when using these methods, you have to be aware of these deficiencies, and many of the properties of regression models described in this book do not hold any longer. There are many areas in which the construction of parsimonious predictors is the main task, or—especially in the p > n situation—is a very reasonable starting point. However, it is very difficult to make recommendations about particular methods, as this is still an area of active research. I would like to point to the books of Royston and Sauerbrei (2008), Hastie et al. (2009), and Steyerberg (2009) to get an overview about the current status and trends. The book by Hand (1997) still provides a very nice introduction to this area.
Part IV
Miscellaneous
395
Chapter 28
Alternatives to Regression Modelling
Regression models are only one way to analyse the influence of a set of covariates X1 , X2 , . . . , X p on an outcome Y . In this chapter we present some alternatives, which may be useful in particular situations.
28.1
Stratification
In Section 4.1 we have used a stratified analysis to introduce the basic idea of adjusted effects: If we are interested in the effect of a covariate X1 , and we are afraid of that a binary (or categorical) covariate X2 may act as confounder, we may stratify the analysis by the values of X2 . As we now analyse the effect of X1 within each stratum defined by a single value of X2 , we can exclude that the effect of X1 we can observe in each stratum can be explained by any association between X1 and X2 , as there cannot be such an association—X2 is constant. We argued in Section 4.1 that, in general, it is not a good idea to perform such an analysis as the main analysis of a study, because we get two effect estimates. A regression analysis providing one overall effect estimate is more useful, in particular, if we are interested in demonstrating the existence of an effect. The latter is basically an argument about power: The confidence interval of the overall effect is substantially more narrow than the confidence intervals in each stratum, as we have in the overall analysis a larger sample size than in the single strata. Similarly, the test on the null hypothesis of “no effect” using all data is typically much more powerful than the corresponding tests in each stratum. However, if a study is large enough, we may have sufficient power in each stratum, and a corresponding, stratified analysis may be a feasible and interesting alternative. As an example, let us take a look at a reanalysis of the breast cancer data set introduced in Exercise 10.5. If we are interested in demonstrating that the lymph node status has an effect on the survival of the patients that we cannot explain by confounding with the effect of grading, we can take a look at the effect of the lymph node status on survival within the three strata corresponding to the three levels of grading. And as within each stratum we are only interested in the effect of one covariate, we can simply use a graphical approach to visualise the effect, that is, in this case Kaplan-Meier plots as shown in Figure 28.1. Such a stratified analysis may be regarded by a reader of a paper as more convincing than an estimate for the effect of lymph node status and a corresponding p-value from a Cox model with grading as additional covariate. There can be several 397
A LTERNATIVES TO R EGRESSION M ODELLING
0
5
10 years
15
20
grading 3
0.25
0.50
0.75
p2
grad≤2
age>64
age≤64
52 0.13
66 0.38
167 0.47
332 0.74
age>52
age≤52
grad>2
grad≤2
grad>1
grad≤1
44 0.27
22 0.59
42 0.37
125 0.49
254 0.69
78 0.92
size>1.6
size≤1.6
157 0.64
97 0.77
Figure 28.2 A classification and regression tree obtained from the breast data set. For each node the defining condition, the number of patients in this node and the 10-year survival probability based on a Kaplan-Meier estimate are given. “ln” abbreviates “number of lymph nodes.” “size” abbreviates “tumour size” measured in cm.
adequateness of using a regression model in the analysis, in particular as such trees are useful to pick up patient groups which behave completely different than all other patients. On the other hand, we have to be aware of that classification and regression trees are not a substitute for a formal analysis using regression models. It is not possible to conclude that any split shown in the tree reflects a “significant” property, as we have systematically searched for the smallest p-value. Classification and regression trees tend also to overlook covariates with small, but relevant effects, as in the lower nodes of a tree the groups are typically too small to allow the detection of further effects. For example, the fact that tumour size appears only once in our tree—and actually only for the largest patient group at the third level—reflects probably a lack of power in the remaining groups, and not an absence of the effect. Moreover, the trees obtained are often very unstable in the sense that the second “most important” split has often a p-value close to the “most important” split, and if we select the second “most important” split, the structure of the tree may change substantially. So it is important to have always in mind that a classification and regression tree provides mainly a “snapshot” of the data, but cannot tell the full story.
406
A LTERNATIVES TO R EGRESSION M ODELLING
T HIS C HAPTER IN A N UTSHELL In particular situations, alternative approaches can be useful. If the power is sufficient, stratified analyses may be more informative than a regression analysis. If the selection of one variable as outcome would be artificial, measures of association with appropriate adjustment for confounding can be useful. Propensity score based analyses can be useful if one binary covariate plays a distinguished role. Classification and regression trees provide a simple tool to explore a data set from a regression perspective.
Chapter 29
Specific Regression Models
In this chapter we present some specific regression models used sometimes in medical research.
29.1
Probit Regression for Binary Outcomes
In some areas of medical research influenced by the social sciences, probit regression is a common method for binary outcome data. The idea behind probit regression is that a linear combination β0 + β1 x1 + β2 x2 + . . . + β p x p of the covariate values competes with a latent variable L. If β0 + β1 x1 + β2 x2 + . . . + β p x p is larger than L, then Y is 1, and otherwise it is 0. The interpretation of L will vary from application to application. If we consider risk factors for a disease, L may be interpreted as the individual susceptibility to the risk factors. If assuming a standard normal distribution for L, the probability of a subject with covariate pattern x1 , x2 , . . . , x p to show Y = 1 is P(Y = 1|x1 , x2 , . . . , x p ) = P(L < β0 + β1 x1 + β2 x2 + . . . + β p x p ) = Φ(β0 + β1 x1 + β2 x2 + . . . + β p x p ) with Φ denoting the distribution function of the standard normal distribution and hence Φ−1 (P(Y = 1|x1 , x2 , . . . , x p )) = β0 + β1 x1 + β2 x2 + . . . + β p x p . So, the probit model is very similar to a logistic model, except of that the logit function is replaced by the so-called probit function probit (p) = Φ−1 (p). In Figure 29.1, we can observe that the logit function and the probit function are similar, but that the logit function is somewhat steeper. Hence, effect estimates from probit regression tend to be smaller than effect estimates from logit regression, but otherwise probit regression and logistic regression give very similar results with respect to the order of the magnitude of the effect estimates. Remark: Also logistic regression can be motivated by a latent variable model. We have just to change the distributional assumption for L: Instead of a normal distribution, a so-called logistic distribution is assumed. Remark: There exists also a connection between logistic regression and the normal 407
S PECIFIC R EGRESSION M ODELS
−4
−2
0
2
4
408
0
.2
.4
.6
.8
1
p
Figure 29.1 The probit function (dashed line) and the logit function (solid line).
distribution. If assuming that the covariate values in the two groups defined by Y = 0 and Y = 1 follow a multivariate normal distribution with different mean values, but identical variance/covariance structure, then this implies that the distribution of Y given X1 , X2 , . . . , Xp follows a logistic regression model. 29.2
Generalised Linear Models
A generalised linear model is not a new, specific regression model, but the name of a wide class of models, covering many different regression models. They are characterised by a relation between μ (x1 , x2 , . . . , x p ) denoting the expectation of the outcome Y , given the covariate values x1 , x2 , . . . , x p , and a linear predictor β0 + β1 x1 + β2 x2 + . . . + β p x p via a link function g, such that g(μ (x1 , x2 , . . . , x p )) = β0 + β1 x1 + β2 x2 + . . . + β p x p , and a relation between the expected value μ (x1 , x2 , . . . , x p ) of Y given the covariates x1 , x2 , . . . , x p and the variance V (x1 , x2 , . . . , x p ) of Y given x1 , x2 , . . . , x p , which is of the type V (x1 , x2 , . . . , x p ) = φ v(μ (x1 , x2 , . . . , x p )) with φ denoting an unknown (or known) scale parameter and v(μ ) a function relating the mean to the variance. These specifications are actually sufficient to obtain valid estimates of the regression parameters using an estimation technique called quasi likelihood, and the validity of these estimates requires (at least in large samples) that only the mean structure is correctly specified, but that the variance structure may be incorrectly specified. And using robust standard errors, also valid confidence intervals and p-values can be obtained under the same conditions. So the assumption about the variance structure is only a working assumption: If it is true, we know that the resulting estimates are also optimal, but if it is wrong, it does not invalidate our estimates, they may be only
R EGRESSION M ODELS FOR C OUNT DATA
409
(typically slightly) inefficient. This effect of the choice of the variance function can be explained in the way that it roughly determines the weight of each observation in the analysis: observations with a high (modelled) variance get a higher weight than observations with a low (modelled) variance. Using an incorrect variance function means, hence, only a suboptimal weighting of the observations. Both the classical regression model as well as the logistic regression model are special cases in the class of generalised linear models. In the classical regression model, the link function is the identity function, and the variance function evaluates just to the constant 1. In the logistic regression model, the link function is just the logit function, and the variance function is v(μ ) = μ (1 − μ ) with the scale parameter set to 1. Also probit regression is covered by this class just by changing the logit link to a probit link. In all these specific cases, the quasi likelihood principle reduces to the ordinary maximum likelihood principle, hence we obtain estimates identical to the usual ones. The general framework provided by generalised linear models is mainly of theoretical interest providing a general theory for a large class of models. It is also beneficial for computations, as it implies a general, common algorithmic approach known as iterative weighted least squares. But sometimes it is also useful in practice, if we have an idea about a specific nonstandard variance function. A typical example is the analysis of percentages arising in image analysis, for example, the percentage of an image with a certain colour or intensity. As these percentages are bounded by 0 and 100, they often exhibit a variance structure close to that of relative frequencies: The variance is highest around 50% and becomes close to 0 if the percentages approach 0% or 100%. Then a reasonable variance function is μ (1 − μ ) like in logistic regression, but now allowing a scale parameter. And this can then be combined with a logit or identity link for the mean structure. Also other quantities restricted in their values to a certain range like VAS scores exhibit often a similar variance function. A further example with another variance function is given in the next section. 29.3
Regression Models for Count Data
The number of mutated cells in an experimental study on the mutagenic effect of a substance, the number of emetic episodes during the chemotherapy of a cancer patient, or the number of children a woman has born are typical examples of count data appearing as the outcome variable Y in a regression analysis. In some cases, count outcome data can be handled like continuous outcomes using the classical regression model, but in particular if many counts are 0, this is often not very adequate. It may imply that μ (x1 , x2 , . . . , x p ) may be become negative for some values of the covariate, that is, we predict a negative number for the expected count, which is not very convincing (cf. Section 5.1) To ensure that a model always predicts a positive count, the expectation can be modelled on the log scale; that is, we consider the modelling log μ (x1 , x2 , . . . , x p ) = β0 + β1 x1 + β2 x2 + . . . + β p x p with μ (x1 , x2 , . . . , x p ) still denoting the expected value of Y in a subject with covari-
410
S PECIFIC R EGRESSION M ODELS
ate pattern x1 , x2 , . . . , x p . As then
μ (x1 , x2 , . . . , x p ) = eβ0 +β1 x1 +β2 x2 + ... +β p x p we ensure that μ (x1 , x2 , . . . , x p ) is always positive. The interpretation of the regression coefficients has to be performed on the log scale. There are two popular models using this modelling, which only differ in the distributional assumptions made for Y . The essential difference lies, however, in the relation assumed between the expectation μ (x1 , x2 , . . . , x p ) and the variance V (x1 , x2 , . . . , x p ) (cf. the previous section). In Poisson regression, it is assumed that the variance is equal to the mean. This is a basic property of the Poisson distribution, which occurs naturally if we count independent events in a single subject. In some situations this is reasonable, but the crucial point is that “independent” means that for each single event the probability of its occurrence is only a function of μ (x1 , x2 , . . . , x p ). So it is not allowed that there are any other subject-specific variables with an influence on the occurrence of the single events. As this assumption is often unrealistic, a more reasonable model is the negative binomial model assuming a negative binomial distribution. This distribution arises if allowing a subject specific, latent variable L to vary from subject to subject such that given L = l, the counts follow a Poisson distribution with an expected value on the log scale of β0 + β1 x1 + β2 x2 + . . . + β p x p + l. The essential property of the negative binomial distribution is that the variance is not equal, but only proportional to the expected value (i.e., V (x1 , x2 , . . . , x p ) = φ μ (x1 , x2 , . . . , x p )). And it is highly reasonable to assume that the variance increases with the mean, as if the expected count is close to 0, Y cannot vary a lot as it is bounded by 0. Today, it is very popular to analyse count data by a Poisson regression or negative binomial regression, but using robust standard errors, such that we do neither rely on the specific distributional assumption nor on the specific assumption on the variance structure, and obtain in any case valid effect estimates and valid confidence intervals and p-values. The choice between the two models is then just a question about the working assumption on the variance structure (cf. the previous section). Remark: Poisson regression is not only used for count data but also for incidence data. Incidence data arises often from cohort studies, in which subjects spend different times in different risk groups defined by categorical covariates such as age classes or working conditions. Then we count for each possible combination of the values of the categorical covariates over all subjects the time spent in each group and the number of events occurred in this time, and can then present a table with the incidence rates, that is, the number of events divided by the overall time spent in each group, typically called the time at risk. A regression analysis of these incidence rates is then possible using Poisson regression for the counts in each group, using the time at risk as an additional covariate with a regression coefficient fixed at 1. This reflects that the expected number of events is of course proportional to the time at risk. (Such a covariate with a regression coefficient fixed at 1 is often called an offset). Remark: Actually Poisson regression can be also used to analyse incidence data at the subject level, following each subject until the first event (resulting in Y = 1) or
R EGRESSION M ODELS FOR O RDINAL O UTCOME DATA
411
until the end of follow up (resulting in Y = 0). So the data does no longer look like count data, but still it is allowed to use Poisson regression using the individual times at risk as offset. This is based on an analogy between the likelihood implied by a Poisson regression model and the likelihood of a proportional hazards model with a piecewise constant baseline hazard function (see, for example, Section 5.2.1 in Aalen et al. (2008)). Remark: If it is more reasonable to use an identity link than a log link for the mean structure in analysing count data, it is often still reasonable to assume proportionality between variance and mean. Such a model can be easily fitted within the framework of generalised linear models. 29.4
Regression Models for Ordinal Outcome Data
If the outcome of a study is the subjective evaluation of a disease severity or of a treatment effect by a patient, the outcome is often measured on an ordinal scale with values such as low, middle, high or much worse, worse, unchanged, better, much better. If we are interested in the effect of covariates on these ratings, we need regression models to handle this type of outcome variables. A widespread used approach is to assign just numerical values 1, 2, 3, . . . to the different categories, that is, to use a k-point scale with k denoting the number of categories, and then to use the classical regression model (with robust standard errors to take the nonnormal error distribution into account). Effect estimates from such regression models are easy to interpret: For example, in the case of a binary covariate, they just express the expected difference on the k point scale between two subjects only differing in this covariate. In spite of this simple interpretation of the effect estimates, the approach is often criticised as ignoring the ordinal character of the outcome scale and due to that using different nonequidistant numbers can give different results. Both objections are not very substantial: As we assign the numbers in such a way that they reflect the order of the categories, we do not ignore the ordinal character. And the choice of specific (typically equidistant) numbers for the categories is a highly transparent decision and more natural than arbitrary, and there is no way to analyse ordinal outcome data in an efficient way without making some decision that can be questioned. A more serious objection can be that using classical linear regression μ (x1 , x2 , . . . , x p ) may get values outside the range of the k-point scale. There has been many attempts to define regression models for ordinal outcome data, but none of the approaches turned out to be convincing in all situations. Stata offers an ologit command to perform an ordered logistic regression based on generalising the latent logistic variable justification of the logistic regression model mentioned in Section 29.1. In this model, it is assumed that there exist cut points c0 , . . . , ck such that P(Y falls in category j | x1 , x2 , . . . , x p ) = P(c j−1 ≤ β1 x1 + β2 x2 + . . . + β p x p +L ≤ c j ) and both the cut points and the regression parameters are estimated from the data.
412
S PECIFIC R EGRESSION M ODELS
However, the interpretation of the parameters of this model as differences on a logit scale or as odds ratios is cumbersome. An alternative allowing such an interpretation is the proportional odds model. In this model the ordinal outcome is split into k − 1 binary indicators dividing always the k ordered categories into two groups using the k − 1 possible cut points. Then for each binary indicator a logistic model is fitted and assuming that the regression coefficients are identical for all indicators (except of the intercept which has to vary), we can fit a joint model and obtain one effect estimate for each covariate. And it can be easily interpreted as an effect on the logit scale or an odds ratio, just with respect to any of the binary indicators. The proportional odds model can be easily fitted including valid inference by defining for each subject clustered observations of the different indicators (cf. Section 20.5). We may also allow the effect of the covariates to vary linearly with the cut point of the binary indicators to allow for some variation of the effect from indicator to indicator. Technically, this can be approached by including interactions between the covariates and a variable numbering the different splits from 1 to k − 1. As a fundamental alternative it can sometimes be useful to use sequential splitting of the ordinal outcome. For example, if the outcome is the subjective assessment of the severity of a subject’s back pain and we are interested in relating the decision to some factors, it can be wise to focus in a first step on a split into low or middle versus high, and then to analyse in a second step only those subjects who did not answer high, and split the outcome into low and middle. This way we may obtain rather different effect estimates, as some factors may influence the subjects’ decisions to cry loud, whereas other factors may influence the subjects’ decisions to differentiate between low and middle, given their decision not to cry loud. Such an approach can offer interesting insights, but we may fail to obtain significant results due to estimating two effect estimates for each covariate. To overcome this problem, we can just estimate the average of the two effects. This can be again easily approached by generating clustered indicators for each subject, such that each subject with outcome high contributes one observation, and all others contribute two observations. Remark: A discussion of a variety of regression models for ordinal outcome data can be found in Ananth and Kleinbaum (1997) and Bender and Benner (2000). 29.5
Quantile Regression and Robust Regression
In some applications with a continuous outcome, other characteristics of the conditional distribution of Y given x1 , x2 , . . . , x p than the expected value of Y are of interest. For example, if we consider the construction of age-specific norm curves for a lab parameter, we are interested in knowing how the lower and upper 2.5% percentile is changing with age. Then we can try to model a specific percentile of the conditional distribution of the lab parameter given age as covariate. And this is exactly what quantile regression does. In Stata, such model can be fitted using the qreg command. Quantile regression is of specific interest if the variation of Y is not constant. Only then it can happen that we obtain a substantial difference between modelling
10
20
30
40
413
0
0
10
20
30
40
Q UANTILE R EGRESSION AND ROBUST R EGRESSION
0
2
4
6 x
8
10
0
2
4
6
8
10
x
Figure 29.2 A data set with a continuous outcome Y and a continuous outcome X represented by a scatter plot. Left side: The regression line from the classical regression model is added. Right side: The regression line according to a median regression is added.
the mean and modelling a quantile. For example, if we consider a treatment against obesity and we study the course of treated patients over time, it may happen that the variation of the BMI increases with time, as the treatment works in the majority of the patients, but has the opposite effect in a minority. If we model the mean BMI as a function of age, we may observe a decrease over time, but for the upper 5% quantile, we may observe an increase. A special case of quantile regression is median regression, that is, we model the median instead of the mean. At first sight we would not expect a big difference to the classical regression model, as if the distribution of the error term is symmetric, there is no difference between the mean and the median. However, median regression is typically performed by minimising the sum of the absolute values of the residuals instead of by minimising the sum of the squares of the residuals (cf. Appendix A.1). And this makes the results less sensitive against outliers, that is, single subjects with a large deviation of the value of Y from the regression line. This is illustrated in Figure 29.2, where median regression catches the decreasing trend we can see in the majority of the observations, whereas the classical regression model indicates a steep increase due to a single outlier. Median regression is only one specific example for a robust regression method. “Robust” means here that the results cannot be dominated by the value of one or a few outliers. Although this is a clear advantage, these methods never became popular in medical research. One reason for this may be that it seems to be preferable to identify outliers by a careful inspection of the data and then to remove them manually if we can find a reasonable explanation (e.g., some type of obvious measurement error). However, the latter approach only works if we can inspect the data. In some situations, it is necessary to fit many regression models to different data sets, and then a systematic visual inspection becomes rather cumbersome. This happens, for example, when working with many genetic markers and we would like to find those
414
S PECIFIC R EGRESSION M ODELS
that correlate best with age or another continuous patient characteristic. When fitting for each of, for example, 10,000 markers a regression of the marker values versus age, we have to ensure that we do not focus at the end just on those markers with outliers in the measurement. So we need here a robust method to allow an automatised analysis. 29.6
ANOVA and Regression
In some publications we can find a method called Analysis of Variance (ANOVA) to be used in a situation in which we may expect a regression to be used. This is not surprising, as an ANOVA is identical to specific regression models. However, due to the long history of ANOVA—in particular with respect to experimental studies— there are some specific traditions we should be aware of. A one-way ANOVA is identical to a classical regression with one categorical covariate, and the overall p-value of the one-way ANOVA corresponds to the overall p-value we introduced in Chapter 8, that is, we test the null hypothesis of no difference between all categories. In a one-way ANOVA, p-values referring to the single pairwise comparisons of the different categories are often reported in addition, hopefully adjusted for multiple testing. A two-way ANOVA is identical to a classical regression with two categorical covariates. However, whereas in a regression context we typically start with a model with the two covariates and no interaction, many ANOVA programs perform a standard two-way ANOVA by considering a model with all interactions (which can be many in the case of categorical covariates). The effects reported for the single variables (called main effects) refer to the approach to assess the effect of single covariates in the presence of interactions we described in Section 19.5. In particular, different ANOVA programs may come to different main effects due to differences in the weights of the subgroup-specific effects. ANOVA was developed already at the start of the last century, and, hence, it is a method that can be applied without a pocket calculator or computer. It requires to compute specific sum of squares, which are then reported in a so-called ANOVA table. This partially explains the way results from an ANOVA are reported. Analysis of Covariance (ANCOVA) is an extension of ANOVA allowing one to incorporate continuous covariates also. It corresponds to a regression analysis with both categorical and continuous covariates.
Chapter 30
Specific Usages of Regression Models
In this chapter we discuss the use of regression models in some specific situations occurring in medical research.
30.1
Logistic Regression for the Analysis of Case-Control Studies
Case-control studies are a very popular study type to investigate risk factors for a rare disease. The basic idea is to collect data on relevant covariates (typically some exposures the subjects have experienced during their life) for a large fraction of all subjects who have the disease (the cases), and only for a small fraction of all subjects who do not have the disease (the controls). Then the distribution of the risk factors is compared between case and controls: If a risk factor is more frequent among the cases than among the controls, this gives a hint that this risk factor increases the risk to develop the disease. So the original idea is to look from the case-control status onto the covariate values, which is exactly the opposite what we do in regression modelling. Nevertheless, it can be shown that it is possible, allowed, and useful to analyse data from casecontrol studies using logistic regression with the case-control status as outcome, that is, with Y = 1 for the cases and Y = 0 for the controls. The key argument for this is that odds ratios are invariant under case-control sampling. Let us assume that we would not do a case-control study but collect data on the covariates and the disease status in the whole population. The (true) unadjusted odds ratio for the effect of a binary covariate X on a binary outcome Y is then defined by OR =
P(Y =1 P(Y =0 P(Y =1 P(Y =0
| | | |
X=1) X=1) X=0) X=0)
Now we introduce the cell probabilities of the joint distribution of Y and X as p jk := P(Y = j, X = k) and, for the marginal probabilities describing the distribution of X, we introduce p.k := P(X = k) = p1k + p0k . Then we have the simple relation P(Y = j | X = k) =
p jk P(Y = j, X = k) = P(X = k) p.k
415
416
S PECIFIC U SAGES OF R EGRESSION M ODELS
Hence, we can express the odds ratio as OR =
p11 /p.1 p01 /p.1 p01 /p.0 p00 /p.0
=
p11 p00 . p01 p10
(cf. also Section 28.3). Now we take a look at the cell probabilities in the subjects included in the casecontrol study. If we introduce a random indicator variable 1 if subject is selected for the study S= 0 if subject is not selected for the study we can express the cell probabilities in the case-control population as p˜ jk = P(Y = j, X = k | S = 1) , that is, as the probability of Y = j, X = k given a subject is selected for the casecontrol study. This probability is just the probability to show Y = j, X = k and to be selected for the case-control study divided by the probability to be selected. Hence, we have P(Y = j, X = k, S = 1) P(Y = j, X = k | S = 1) = P(S = 1) We now introduce the selection probabilities q1 in cases and q0 in controls as q j = P(S = 1 | Y = j) . Due to the case-control sampling, these probabilities are independent of the value of X, that is, they satisfy q j = P(S = 1 | Y = j) = P(S = 1 | Y = j, X = k) for k = 0 and k = 1 . If we further use that the probability of Y = j, X = k, S = 1 is just the product of the probability of Y = j, X = k times the probability of S = 1 given Y = j, X = k, then we have P(Y = j, X = k, S = 1) = P(S = 1 | Y = j, X = k)P(Y = j, X = k) = q j p jk and hence, the cell probabilities in the case-control study satisfy p˜ jk = P(Y = j, X = k | S = 1) =
q j p jk P(Y = j, X = k, S = 1) = P(S = 1) P(S = 1)
So for the true odds ratio in the subjects of the case-control study we obtain ˜ = p˜11 p˜00 = q1 p11 q0 p00 = p11 p00 OR p˜01 p˜10 q1 p10 q0 p01 p01 p10 so this odds ratio is identical to the odds ratio in the whole population.
A NALYSIS OF M ATCHED C ASE -C ONTROL S TUDIES
417
This argument of the invariance of the odds ratio under case-control sampling can be extended to adjusted odds ratios and logistic regression models in the following way: If in the whole population the logistic model logit P(Y = 1 | x1 , x2 , . . . , x p ) = β0 + β1 x1 + β2 x2 + . . . + β p x p holds, then in the case-control sample, the model logit P(Y = 1 | x1 , x2 , . . . , x p , S = 1) = β˜0 + β1 x1 + β2 x2 + . . . + β p x p holds with β˜0 = β0 + log qq10 , that is, only the intercept changes. So all effect estimates we obtain from a case-control study can be interpreted as if they would have been obtained by analysing the whole population. For this reason, logistic regression is the standard tool to analyse case-control studies. Remark: The argument above only provides a justification of the effect estimates. It is a different question whether standard errors, confidence intervals, and p-values are also correct, as due to the case-control sampling we do not have any longer a simple sample of independent observations. However, it can be shown that it is allowed to use the usual inference procedures without any modification (Prentice and Pyke, 1979). Remark: Case-control studies are typically used to study rare diseases. Consequently, in the general population, the prevalence is small and we are modelling small probp as the abilities. But this means that there is no big difference between p and 1−p denominator in the latter is close to 1. Consequently, there is also no big difference between odds ratios (the ratio between two odds) and relative risks (the ratio between two probabilities). For this reason, in particular in the literature up to the 1990s, the estimates of a logistic regression analysis are often referred to as relative risks, and not odds ratios. 30.2
Logistic Regression for the Analysis of Matched Case-Control Studies
In case-control studies, the distribution of the covariates in the cases is often known prior to the selection of the controls. This can be used to improve the power of the case-control study by matching, if a strong confounder is known a priori. A typical example is age. If we observe that the majority of the diseased cases is above the age of 60, and if we decide to select our controls from the general population, then the majority of the controls is below 60. Now if we adjust in the analysis of the other risk factors for age, we have to compare among the young subjects many controls with a few cases and in the old subjects a few controls with many cases. And this implies a poor power (in analogy to the considerations we made in Section 14.2). So the basic idea is now to select for each case a fixed number (usually 1, 2, or 3) of controls from all subjects of the same age, that is, we match to each case a fixed number of controls. This way we allow for a more powerful adjustment for age, but we lose, of course, the possibility to estimate the effect of age, as we manipulate the age distribution in the controls and make it artificially identical to that of the cases. In spite of this matching, ordinary logistic regression can still be used to analyse
418
S PECIFIC U SAGES OF R EGRESSION M ODELS
the data and estimates of the same effects as in an unmatched case-control study are obtained. This can be seen in the following way: Matching for age means that for each age a we perform an ordinary case-control study in the subpopulation of all subjects of age a. So, according to the considerations in the last section, we have within the selected subjects and with age as the first covariate that logit P(Y = 1 | X1 = a, X2 = x2 , . . . , Xp = x p , S = 1) = β˜0a + β1 a + β2 x2 + . . . + β p x p with β˜0a depending on the fractions of cases and controls selected in the age stratum a. So we have in each age stratum a logistic model, with intercept β˜0a + β1 a depending on a. But this implies that if we analyse the whole data, we can use a logistic regression model including the covariate age, and typically it is sufficient to allow just a linear “effect” for age. However, we can interpret this “effect” not as the true effect of age as it reflects also the intercept β˜0a varying with age. So, we have the somewhat strange situation that although we know that we cannot estimate the effect of age, we have to include age as a covariate in the analysis. This can be explained in the way that matching does not remove the correlation between age and the other covariates, so age is still a confounder, and hence we have to adjust for it. Remark: In many countries, it is difficult to draw controls for a case-control study from the general population due to lack of a population register. Then controls are often sampled by looking for a disease-free subject living in the same street as the case, or by looking for a subject with another, unrelated disease in the same hospital. These studies are also matched cases control studies with a matching factor such as local area or hospital. As these matching factors cannot be used as covariates, logistic regression cannot be used directly. However, there exists a variant called conditional logistic regression that does allow to perform a logistic regression also for this type of matched case-control studies. Conditional logistic regression can also be used for matched studies with matching based on covariates. It then gives very similar results. Remark: More considerations about the difference between matched and unmatched studies and the analysis of matched studies can be found in Chapters 19 and 29 of Clayton and Hills (1993). 30.3
Adjusting for Baseline Values in Randomised Clinical Trials
In clinical trials with a continuous outcome, the outcome variable can often also be measured at baseline, that is, prior to the therapy. A typical example is shown in Figure 30.1, illustrating data from a randomised clinical trial comparing two therapies for patients suffering from severe apoplexia cerebri. The patients’ performance with respect to activities of daily living is assessed prior and after the therapy by the Barthel score (pre- and posttherapy measurements). In principle, such a trial can be analysed by just comparing the posttherapy measurements between the two treatment groups. It is allowed to ignore the baseline values, as they are roughly balanced between the two treatment groups due to the
A DJUSTING FOR BASELINE VALUES IN R ANDOMISED T RIALS B
60 20
40
barthel
80
100
A
419
pre
post
pre
post
Figure 30.1 Measurements of the Barthel score pre- and posttherapy in the two treatment groups of a randomised clinical trial.
randomisation. However, this may not be the optimal way. If we take a closer look at Figure 30.1, we can observe that patients tend to keep their baseline level under the treatment in spite of some treatment effect: Patients starting at a low level tend to remain on a low level, and patients starting at a high level tend to remain on a high level. With other words, the baseline values are predictive for the posttreatment values. In such a situation, the difference between post- and pretherapy measurement is often used as the outcome variable of interest, as it reflects the change in each single patient, and it is hence a direct measurement of the effect of the therapy in each single patient. Then we can simply compare the two groups by considering the difference in mean values between the two groups. In our example, we obtain a mean change of −0.68 under therapy A and a mean change of 9.21 under therapy B, and a t-test gives a p-value of 0.0041. The approach to use the change between post- and pretherapy as outcome instead of the posttherapy values themselves is very popular, and if the correlation between pre- and posttherapy values is at least 0.5, using the change is indeed more powerful. However, the approach is not perfect. The mean change is equivalent to the difference between the mean value of the posttreatment measurements and the mean value of the pretherapy measurements, and hence we can visualise the results of such a study also in a graph like that shown in Figure 30.2. Now it becomes obvious that the difference in mean change partially reflects the difference in the means of the baseline values between the two treatment groups. And we know that the observed difference at baseline is a chance result: We have a randomized trial, and consequently the true mean values are identical. Hence, in our example we have to conjecture that the true difference in change is larger than the estimated one, as the true, common mean baseline value is probably somewhere in between the two mean baseline values
S PECIFIC U SAGES OF R EGRESSION M ODELS
55
20
60
40
barthel
post 60
65
80
70
100
420
20 pre therapy A
40
post therapy B
Figure 30.2 Mean values of the Barthel score pre- and posttherapy in the two treatment groups of a randomized clinical trial.
60 pre
therapy A
80
100
therapy B
Figure 30.3 A scatter plot of post- versus pretherapy measurements in the two treatment groups of a randomized clinical trail and the regression lines from fitting a regression model with the posttreatment measurements as outcome and the baseline values and a treatment indicator as covariates.
observed. Since in the treatment group with the larger change the mean baseline value is larger than in the group with the smaller change, we are probably underestimating the evidence we have for a difference between the two treatment groups. The results of such a trial can be, of course, also the other way round: If the treatment group with the larger mean change has a smaller mean baseline value than the treatment group with the smaller mean change, then we may have the suspicion to overestimate the difference between the two groups. In any case, this approach is never completely satisfying, as the chance difference at baseline has a direct impact on the results. Now there is a simple way to overcome this problem. As mentioned above, the pretherapy values are predictive for the posttherapy values. We have already learned in Section 16.7 that we can improve the power of a regression analysis by adding predictive, but independent covariates. Estimation of a treatment effect means nothing else but regressing the outcome (i.e., the posttreatment values) on a binary covariate, the treatment indicator. And baseline values are independent of the treatment status due to the randomisation, and they are highly predictive. So, it is wise to add them to the regression model, that is, to consider the two covariates treatment group (X1 ) and baseline value (X2 ) and the regression model
μ (x1 , x2 ) = β0 + β1 x1 + β2 x2 with β1 reflecting the treatment effect. Figure 30.3 visualises the fit of such a model, and the distance between the two lines corresponds to the estimated treatment effect. As we condition on the baseline values, we have no longer any effect of a difference
A SSESSING P REDICTIVE FACTORS
421
in the mean baseline values on the results. The output from fitting the regression model looks like SE 95%CI p-value variable beta pre 0.92 0.09 [0.73,1.11]