219 45 8MB
English Pages 153 Year 2020
Logistic Regression SECOND EDITION
Quantitative Applications in the Social Sciences A SAGE PUBLICATIONS SERIES 1. Analysis of Variance, 2nd Edition Iversen/ Norpoth 2. Operations Research Methods Nagel/Neef 3. Causal Modeling, 2nd Edition Asher 4. Tests of Significance Henkel 5. Cohort Analysis, 2nd Edition Glenn 6. Canonical Analysis and Factor Comparison Levine 7. Analysis of Nominal Data, 2nd Edition Reynolds 8. Analysis of Ordinal Data Hildebrand/Laing/ Rosenthal 9. Time Series Analysis, 2nd Edition Ostrom 10. Ecological Inference Langbein/Lichtman 11. Multidimensional Scaling Kruskal/Wish 12. Analysis of Covariance Wildt/Ahtola 13. Introduction to Factor Analysis Kim/Mueller 14. Factor Analysis Kim/Mueller 15. Multiple Indicators Sullivan/Feldman 16. Exploratory Data Analysis Hartwig/Dearing 17. Reliability and Validity Assessment Carmines/Zeller 18. Analyzing Panel Data Markus 19. Discriminant Analysis Klecka 20. Log-Linear Models Knoke/Burke 21. Interrupted Time Series Analysis McDowall/ McCleary/Meidinger/Hay 22. Applied Regression, 2nd Edition Lewis-Beck/ Lewis-Beck 23. Research Designs Spector 24. Unidimensional Scaling McIver/Carmines 25. Magnitude Scaling Lodge 26. Multiattribute Evaluation Edwards/Newman 27. Dynamic Modeling Huckfeldt/Kohfeld/Likens 28. Network Analysis Knoke/Kuklinski 29. Interpreting and Using Regression Achen 30. Test Item Bias Osterlind 31. Mobility Tables Hout 32. Measures of Association Liebetrau 33. Confirmatory Factor Analysis Long 34. Covariance Structure Models Long 35. Introduction to Survey Sampling, 2nd Edition Kalton 36. Achievement Testing Bejar 37. Nonrecursive Causal Models Berry 38. Matrix Algebra Namboodiri 39. Introduction to Applied Demography Rives/Serow 40. Microcomputer Methods for Social Scientists, 2nd Edition Schrodt 41. Game Theory Zagare 42. Using Published Data Jacob 43. Bayesian Statistical Inference Iversen 44. Cluster Analysis Aldenderfer/Blashfield 45. Linear Probability, Logit, and Probit Models Aldrich/Nelson 46. Event History and Survival Analysis, 2nd Edition Allison 47. Canonical Correlation Analysis Thompson 48. Models for Innovation Diffusion Mahajan/Peterson 49. Basic Content Analysis, 2nd Edition Weber 50. Multiple Regression in Practice Berry/Feldman 51. Stochastic Parameter Regression Models Newbold/Bos 52. Using Microcomputers in Research Madron/Tate/Brookshire
53. Secondary Analysis of Survey Data Kiecolt/ Nathan 54. Multivariate Analysis of Variance Bray/ Maxwell 55. The Logic of Causal Order Davis 56. Introduction to Linear Goal Programming Ignizio 57. Understanding Regression Analysis, 2nd Edition Schroeder/Sjoquist/Stephan 58. Randomized Response and Related Methods, 2nd Edition Fox/Tracy 59. Meta-Analysis Wolf 60. Linear Programming Feiring 61. Multiple Comparisons Klockars/Sax 62. Information Theory Krippendorff 63. Survey Questions Converse/Presser 64. Latent Class Analysis McCutcheon 65. Three-Way Scaling and Clustering Arabie/ Carroll/DeSarbo 66. Q Methodology, 2nd Edition McKeown/ Thomas 67. Analyzing Decision Making Louviere 68. Rasch Models for Measurement Andrich 69. Principal Components Analysis Dunteman 70. Pooled Time Series Analysis Sayrs 71. Analyzing Complex Survey Data, 2nd Edition Lee/Forthofer 72. Interaction Effects in Multiple Regression, 2nd Edition Jaccard/Turrisi 73. Understanding Significance Testing Mohr 74. Experimental Design and Analysis Brown/Melamed 75. Metric Scaling Weller/Romney 76. Longitudinal Research, 2nd Edition Menard 77. Expert Systems Benfer/Brent/Furbee 78. Data Theory and Dimensional Analysis Jacoby 79. Regression Diagnostics, 2nd Edition Fox 80. Computer-Assisted Interviewing Saris 81. Contextual Analysis Iversen 82. Summated Rating Scale Construction Spector 83. Central Tendency and Variability Weisberg 84. ANOVA: Repeated Measures Girden 85. Processing Data Bourque/Clark 86. Logit Modeling DeMaris 87. Analytic Mapping and Geographic Databases Garson/Biggs 88. Working With Archival Data Elder/Pavalko/Clipp 89. Multiple Comparison Procedures Toothaker 90. Nonparametric Statistics Gibbons 91. Nonparametric Measures of Association Gibbons 92. Understanding Regression Assumptions Berry 93. Regression With Dummy Variables Hardy 94. Loglinear Models With Latent Variables Hagenaars 95. Bootstrapping Mooney/Duval 96. Maximum Likelihood Estimation Eliason 97. Ordinal Log-Linear Models Ishii-Kuntz 98. Random Factors in ANOVA Jackson/Brashers 99. Univariate Tests for Time Series Models Cromwell/Labys/Terraza 100. Multivariate Tests for Time Series Models Cromwell/Hannan/Labys/Terraza
101. Interpreting Probability Models: Logit, Probit, and Other Generalized Linear Models Liao 102. Typologies and Taxonomies Bailey 103. Data Analysis: An Introduction Lewis-Beck 104. Multiple Attribute Decision Making Yoon/ Hwang 105. Causal Analysis With Panel Data Finkel 106. Applied Logistic Regression Analysis, 2nd Edition Menard 107. Chaos and Catastrophe Theories Brown 108. Basic Math for Social Scientists: Concepts Hagle 109. Basic Math for Social Scientists: Problems and Solutions Hagle 110. Calculus Iversen 111. Regression Models: Censored, Sample Selected, or Truncated Data Breen 112. Tree Models of Similarity and Association Corter 113. Computational Modeling Taber/Timpone 114. LISREL Approaches to Interaction Effects in Multiple Regression Jaccard/Wan 115. Analyzing Repeated Surveys Firebaugh 116. Monte Carlo Simulation Mooney 117. Statistical Graphics for Univariate and Bivariate Data Jacoby 118. Interaction Effects in Factorial Analysis of Variance Jaccard 119. Odds Ratios in the Analysis of Contingency Tables Rudas 120. Statistical Graphics for Visualizing Multivariate Data Jacoby 121. Applied Correspondence Analysis Clausen 122. Game Theory Topics Fink/Gates/Humes 123. Social Choice: Theory and Research Johnson 124. Neural Networks Abdi/Valentin/Edelman 125. Relating Statistics and Experimental Design: An Introduction Levin 126. Latent Class Scaling Analysis Dayton 127. Sorting Data: Collection and Analysis Coxon 128. Analyzing Documentary Accounts Hodson 129. Effect Size for ANOVA Designs Cortina/Nouri 130. Nonparametric Simple Regression: Smoothing Scatterplots Fox 131. Multiple and Generalized Nonparametric Regression Fox 132. Logistic Regression: A Primer Pampel 133. Translating Questionnaires and Other Research Instruments: Problems and Solutions Behling/Law 134. Generalized Linear Models: A Unified Approach, 2nd Edition Gill/Torres 135. Interaction Effects in Logistic Regression Jaccard 136. Missing Data Allison 137. Spline Regression Models Marsh/Cormier 138. Logit and Probit: Ordered and Multinomial Models Borooah 139. Correlation: Parametric and Nonparametric Measures Chen/Popovich 140. Confidence Intervals Smithson 141. Internet Data Collection Best/Krueger 142. Probability Theory Rudas 143. Multilevel Modeling, 2nd Edition Luke 144. Polytomous Item Response Theory Models Ostini/Nering 145. An Introduction to Generalized Linear Models Dunteman/Ho 146. Logistic Regression Models for Ordinal Response Variables O’Connell
147. Fuzzy Set Theory: Applications in the Social Sciences Smithson/Verkuilen 148. Multiple Time Series Models Brandt/Williams 149. Quantile Regression Hao/Naiman 150. Differential Equations: A Modeling Approach Brown 151. Graph Algebra: Mathematical Modeling With a Systems Approach Brown 152. Modern Methods for Robust Regression Andersen 153. Agent-Based Models, 2nd Edition Gilbert 154. Social Network Analysis, 3rd Edition Knoke/Yang 155. Spatial Regression Models, 2nd Edition Ward/Gleditsch 156. Mediation Analysis Iacobucci 157. Latent Growth Curve Modeling Preacher/Wichman/MacCallum/Briggs 158. Introduction to the Comparative Method With Boolean Algebra Caramani 159. A Mathematical Primer for Social Statistics Fox 160. Fixed Effects Regression Models Allison 161. Differential Item Functioning, 2nd Edition Osterlind/Everson 162. Quantitative Narrative Analysis Franzosi 163. Multiple Correspondence Analysis LeRoux/ Rouanet 164. Association Models Wong 165. Fractal Analysis Brown/Liebovitch 166. Assessing Inequality Hao/Naiman 167. Graphical Models and the Multigraph Representation for Categorical Data Khamis 168. Nonrecursive Models Paxton/Hipp/ Marquart-Pyatt 169. Ordinal Item Response Theory Van Schuur 170. Multivariate General Linear Models Haase 171. Methods of Randomization in Experimental Design Alferes 172. Heteroskedasticity in Regression Kaufman 173. An Introduction to Exponential Random Graph Modeling Harris 174. Introduction to Time Series Analysis Pickup 175. Factorial Survey Experiments Auspurg/Hinz 176. Introduction to Power Analysis: Two-Group Studies Hedberg 177. Linear Regression: A Mathematical Introduction Gujarati 178. Propensity Score Methods and Applications Bai/Clark 179. Multilevel Structural Equation Modeling Silva/Bosancianu/Littvay 180. Gathering Social Network Data adams 181. Generalized Linear Models for Bounded and Limited Quantitative Variables, Smithson and Shou 182. Exploratory Factor Analysis, Finch 183. Multidimensional Item Response Theory, Bonifay 184. Argument-Based Validation in Testing and Assessment, Chapelle 185. Using Time Series to Analyze Long Range Fractal Patterns, Koopmans 186. Understanding Correlation Matrices, Hadd and Rodgers 187. Rasch Models for Solving Measurement Problems, Engelhard and Wang
Logistic Regression A Primer SECOND EDITION
Fred C. Pampel University of Colorado Boulder
FOR INFORMATION:
Copyright © 2021 by SAGE Publications, Inc.
SAGE Publications, Inc.
All rights reserved. Except as permitted by U.S. copyright law, no part of this work may be reproduced or distributed in any form or by any means, or stored in a database or retrieval system, without permission in writing from the publisher.
2455 Teller Road Thousand Oaks, California 91320 E-mail: [email protected]
London EC1Y 1SP
All third party trademarks referenced or depicted herein are included solely for the purpose of illustration and are the property of their respective owners. Reference to these trademarks in no way indicates any relationship with, or endorsement by, the trademark owner.
United Kingdom
Printed in the United States of America
SAGE Publications India Pvt. Ltd.
Library of Congress Cataloging-in-Publication Data
B 1/I 1 Mohan Cooperative Industrial Area
Names: Pampel, Fred C., author.
SAGE Publications Ltd. 1 Oliver’s Yard 55 City Road
Mathura Road, New Delhi 110 044 India
Title: Logistic regression: a primer / Fred C. Pampel, University of Colorado, Boulder.
SAGE Publications Asia-Pacific Pte. Ltd.
Description: 2nd edition. | Thousand Oaks, Calif.: SAGE, [2021] | Includes bibliographical references and index.
18 Cross Street #10-10/11/12 China Square Central Singapore 048423
Identifiers: LCCN 2020031266 | ISBN 9781071816202 (paperback ; alk. paper) | ISBN 9781071816196 (epub) | ISBN 9781071816189 (epub) | ISBN 9781071816172 (ebook) Subjects: LCSH: Logistic regression analysis. Classification: LCC HA31.3 .P36 2021 | DDC 519.5/36—dc23 LC record available at https://lccn.loc.gov/2020031266
This book is printed on acid-free paper.
Acquisitions Editor: Helen Salmon Editorial Assistant: Elizabeth Cruz Production Editor: Astha Jaiswal Copy Editor: Integra Typesetter: Hurix Digital Proofreader: Jennifer Grubba Cover Designer: Rose Storey Marketing Manager: Shari Countryman
20 21 22 23 24 10 9 8 7 6 5 4 3 2 1
CONTENTS Series Editor Introduction
ix
Preface xi Acknowledgments xiii About the Author
xv
Chapter 1: The Logic of Logistic Regression 1 Regression With a Binary Dependent Variable 1 Transforming Probabilities Into Logits 9 Linearizing the Nonlinear 14 Summary17 Chapter 2: Interpreting Logistic Regression Coefficients 19 Logged Odds 19 Odds 23 Probabilities 26 Standardized Coefficients 41 Group and Model Comparisons of Logistic Regression Coefficients 47 Summary49 Chapter 3: Estimation and Model Fit 51 Maximum Likelihood Estimation 51 Tests of Significance Using Log Likelihood Values 56 Model Goodness of Fit 62 Summary67 Chapter 4: Probit Analysis 69 Another Way to Linearize the Nonlinear 69 The Probit Transformation 72 Interpretation 73 Maximum Likelihood Estimation 77 Summary79 Chapter 5: Ordinal and Multinomial Logistic Regression 81 Ordinal Logistic Regression 82 Multinomial Logistic Regression 95 Summary108
Notes109 Appendix: Logarithms 115 The Logic of Logarithms 115 Properties of Logarithms 117 Natural Logarithms 120 Summary 122 References125 Index127
SERIES EDITOR’S INTRODUCTION It is with great pleasure that I introduce the second edition of Logistic Regression: A Primer by Fred C. Pampel. This straightforward and intuitive volume preserves the best of the first edition while adding and updating material that leads to an even more useful guide to logistic regression. As explained in the Preface, Professor Pampel’s manuscript is a “primer” that makes explicit what other treatments take for granted. In the first chapter, he develops the logistic regression model as one solution to the problems encountered by linear models when the dependent variable is a dichotomous yes/no type of variable. He also shows very clearly the new problems that arise when shifting from linear to logistic regression, for example, explaining why and how modeling a dependent variable with a ceiling (Y=1) and a floor (Y=0) makes the influence of the independent variables nonadditive and interactive. In the second chapter, Professor Pampel takes up the multiple ways to interpret effects in logistic regression. He discusses the differences between effects on the logged odds, on the odds, and on probabilities and the advantages and disadvantages of each. He also explains marginal effects at the mean, marginal effects at representative values, and the average marginal effect. The volume includes output from Stata, SPSS, and R so as to familiarize readers with the look and feel of each. The third chapter addresses estimation and model fit, doing so in a way that demystifies these procedures. Professor Pampel explains the likelihood function, the log likelihood function, why its values are negative, why negative values further from zero indicate better fit, and the reason for multiplying the difference between the baseline and model log likelihood by −2 in assessing fit. He explains the connection between the baseline log likelihood and null deviance, and between the model log likelihood and residual deviance, and also explains why these are called “deviance.” This is the kind of “inside knowledge” that can baffle novices but that experienced users of statistical methods may not even realize they have. The fourth and fifth chapters introduce probit analysis and ordinal and multinomial logistic regression. The focus is on how these more complex forms relate to dichotomous logistic regression, with particular attention to how estimation and interpretation are extensions of material discussed earlier in the volume. Examples are key to the pedagogy throughout. Model estimation, fit, and the interpretation of effects are illustrated with specific applications. The outcomes range from smoking behavior to support for federal spending, ix
x views about gay marriage, attitudes on the legalization of marijuana, and rankings of the importance of various world problems. Data come from the National Health Interview Survey 2017, opinion data from multiple waves of the General Social Survey, and the World Values Survey India 2014. Data to replicate the examples, along with code in Stata, SPSS, and R needed to reproduce the analyses presented in the volume, are provided on the companion website at study.sagepub.com/researchmethods/qass/ pampel-logistic-regression-2e. This little green cover is a gem. The material is easily accessible to novices; it can serve as a helpful supplemental text in an undergraduate or graduate methods course. Readers more experienced in statistical methods will appreciate that the volume is not a cookbook, far from it; those who use it will learn how the method actually works. Given changes in practice since the original version was published 20 years ago, an updated edition of the primer was urgently needed. Professor Pampel has delivered it. Barbara Entwisle Series Editor
PREFACE I call this book a primer because it makes explicit what treatments of logistic regression often take for granted. Some treatments explain concepts abstractly, assuming readers have a comfortable familiarity with odds and logarithms, maximum likelihood estimation, and nonlinear functions. Other treatments skip the logical undergirding of logistic regression by proceeding directly to examples and the interpretation of actual coefficients. As a result, students sometimes fail to gain an understanding of the intuitive logic behind logistic regression. The first edition aimed to introduce this logic with elementary language and simple examples. The second edition sticks to this approach but updates the material in several ways. It presents results from several statistical packages to help interpret the meaning of logistic regression coefficients. It presents more detail on variations in logistic regression for multicategory outcomes. And it describes potential problems in using logistic regression to compare groups, compare coefficients across models, and test for statistical interaction. Chapter 1 briefly presents a nontechnical explanation of the problems of using linear regression with binary dependent variables, and then more thoroughly introduces the logit transformation. Chapter 2 presents central material on interpreting logistic regression coefficients and the programs available to help in the interpretations. Chapter 3 takes up the meaning of maximum likelihood estimation and the explanatory power of models in logistic regression. Chapter 4 reviews probit analysis, a similar though less commonly used way to analyze a binary outcome. Chapter 5 introduces ordinal logistic regression and multinomial logistic regression as extensions to analyze multicategory ordered and nominal outcomes. Because the basic logic of logistic regression applies to the extensions in Chapters 4 and 5, however, the later topics receive less detailed discussion than the basics of logistic regression in Chapters 1 to 3. An online supplement available at study .sagepub.com/researchmethods/qass/pampel-logistic-regression-2e includes the three data sets and Stata, SPSS, and R commands needed to reproduce all the tables and figures in the book. Finally, the Appendix reviews the meaning of logarithms and may help some students understand the use of logarithms in logistic regression as well as in other types of models.
xi
ACKNOWLEDGMENTS I thank Scott Menard, Jani Little, Melissa Hardy, Dennis Mileti, Rick Rogers, Scott Eliason, Jane Menken, Tom Mayer, Michael Lewis-Beck, and several anonymous reviewers for helpful comments on the first edition of the book. For the second edition, I thank Barbara Entwisle, the series editor, and the following reviewers for helpful comments and guidance: Prince Allotey, University of Connecticut Thomas J. Linneman, The College of William and Mary Eliot Rich, University at Albany Jason C. Immekus, University of Louisville Anthea Chan, New York University Victoria Landu-Adams, Walden University Travis Loux, Saint Louis University David Han, University of Texas at San Antonio
xiii
ABOUT THE AUTHOR Fred C. Pampel is Research Professor of Sociology and a Research Associate in the Population Program at the University of Colorado Boulder. He received a Ph.D. in sociology from the University of Illinois, Champaign-Urbana, in 1977, and has previously taught at the University of Iowa, University of North Carolina, and Florida State University. His research focuses on socioeconomic disparities in health behaviors, smoking in particular, and on the experimental and quasi-experimental methods for evaluation of social programs for youth. He is the author of several books on population aging, cohort change, and public policy, and his work has appeared in American Sociological Review, American Journal of Sociology, Demography, Social Forces, and European Sociological Review.
xv
To Steven
Sara Miller McCune founded SAGE Publishing in 1965 to support the dissemination of usable knowledge and educate a global community. SAGE publishes more than 1000 journals and over 800 new books each year, spanning a wide range of subject areas. Our growing selection of library products includes archives, data, case studies and video. SAGE remains majority owned by our founder and after her lifetime will become owned by a charitable trust that secures the company’s continued independence. Los Angeles | London | New Delhi | Singapore | Washington DC | Melbourne
Chapter 1 THE LOGIC OF LOGISTIC REGRESSION Many social phenomena are discrete or qualitative rather than continuous or quantitative in nature—an event occurs or it does not occur, a person makes one choice but not the other, an individual or group passes from one state to another. A person can have a child, die, move (either within or across national borders), marry, divorce, enter or exit the labor force, receive welfare benefits, vote for one candidate, commit a crime, be arrested, quit school, enter college, join an organization, get sick, belong to a religion, or act in myriad ways that either involve a characteristic, event, or choice. Sometimes continuous scales are measured qualitatively, such as for income below the poverty level or birth weight below a specified level. Likewise, large social units—groups, organizations, and nations—can emerge, break up, go bankrupt, face rebellion, join larger groups, or pass from one type of discrete state into another. Binary discrete phenomena usually take the form of a dichotomous indicator or dummy variable. Although it is possible to represent the two values with any numbers, employing variables with values of 1 and 0 has advantages. The mean of a dummy variable equals the proportion of cases with a value of 1 and can be interpreted as a probability.
Regression With a Binary Dependent Variable A binary dependent variable with values of 0 and 1 seems suitable on the surface for use with multiple regression. Regression coefficients have a useful interpretation with a dummy dependent variable—they show the increase or decrease in the predicted probability of having a characteristic or experiencing an event due to a one-unit change in the independent variables. Equivalently, they show the change in the predicted proportion of respondents with a value of 1 due to a one-unit change in the independent variables. Given familiarity with proportions and probabilities, researchers should feel comfortable with such interpretations. The dependent variable itself only takes values of 0 and 1, but the predicted values for regression take the form of mean proportions or probabilities conditional on the values of the independent variables. The higher the predicted value or conditional mean, the more likely that any individual with particular scores on the independent variables will have a characteristic or experience the event. Linear regression assumes that the 1
2 conditional proportions or probabilities define a straight line for values of the independent variables. To give a simple example, the 2017 National Health Interview Survey asked respondents if they currently smoke cigarettes or not. Assigning those who smoke a score of 1 and those who do not a score of 0 creates a dummy dependent variable. Taking smoking (S) as a function of years of completed education (E) and a dummy variable for gender (G) with men coded 1 produces the regression equation: S = .388 − .018E + .039G The coefficient for education indicates that for a 1-year increase in education, the predicted probability of smoking goes down by .018, the proportion smoking goes down by .018, or the percent smoking goes down by 1.8. Women respondents with no education have a predicted probability of smoking of .388 (the intercept). A woman with 10 years of education has a predicted probability of smoking of .388 − (.018 × 10) = .208. One could also say that the model predicts 20.8% of such respondents smoke. The dummy variable coefficient for gender shows men have a probability of smoking .039 higher than for women. With no education, men have a predicted probability of smoking of .388 + .039 = .427. Despite the uncomplicated interpretation of the coefficients for regression with a dummy dependent variable, the regression estimates face two sorts of problems. One type of problem is conceptual in nature, while the other type is statistical in nature. The problems may prove serious enough to use an alternative to ordinary regression with binary dependent variables.
Problems of Functional Form The conceptual problem with linear regression with a binary dependent variable stems from the fact that probabilities have maximum and minimum values of 1 and 0. By definition, probabilities and proportions cannot exceed 1 or fall below 0. Yet, the linear regression line will continue to extend upward as the values of the independent variables increase, and continue to extend downward as the values of the independent variables decrease. Depending on the slope of the line and the observed X values, a model can give predicted values of the dependent variable above 1 and below 0. Such values make no sense and have little predictive use. A few charts can illustrate the problem. The normal scatterplot of two continuous variables shows a cloud of points as in Figure 1.1(a). Here, a line through the middle of the cloud of points would minimize the sum of squared deviations. Further, at least theoretically, as X extends on to higher
3 or lower levels, so does Y. The same straight line can predict large Y values associated with large X values as it can for medium or small values. The scatterplot of a relationship of a continuous independent variable to a dummy dependent variable in Figure 1.1(b), however, does not portray a cloud of points. It instead shows two parallel sets of points. Fitting a straight line seems less appropriate here. Any line (except one with a slope of 0) will eventually exceed 1 and fall below 0. Some parts of the two parallel sets of points may contain more cases than others, and certain graphing techniques can reveal the density of cases along the two lines. For example, jittering reduces overlap of the scatterplot points by adding random variation to each case. In Figure 1.2, the jittered Figure 1.1 ( a) Scatterplot, continuous variables and (b) scatterplot, dummy dependent variable.
Y
X (a)
Y
X (b)
4 Figure 1.2 J ittered scatterplot for a binary dependent variable, smoking or nonsmoking, by years of education. 1.20 1.00 Smoking
0.80 0.60 0.40 0.20 0.00 –0.20 0
5
10
15
20
25
Education
distribution for a binary dependent variable—smokes or does not smoke— by years of education suggests a slight relationship. Cases with higher education appear less likely to smoke than cases with lower education. Still, Figure 1.2 differs from plots between continuous variables. Predicted probabilities below 0 or above 1 can occur, depending on the skew of the outcome, the range of values of the independent variable, and the strength of the relationship. With a skewed binary dependent variable, that is with an uneven split in the two categories, predicted values tend to fall toward the extremes. In the example of smoking, where the split equals 15:85, the lowest predicted value of .062 occurs for women with the maximum education of 18; the highest predicted value of .427 occurs for men with the minimum education of 0. However, simply adding age to the model produces predicted values below 0 for females aged 75 years and over with 18 years of education. The same problem can occur with a less skewed dependent variable. From 1973 to 2016, the General Social Survey (GSS) asked respondents if they agree that the use of marijuana should be made legal. With the 30% agreeing coded 1 and the 70% disagreeing coded 0, a regression of agreement (M) on years of education (E), a dummy variable for gender (G) with males coded 1, and a measure of survey year (Y) with the first year, 1973, coded 0 and each year thereafter coded as the years since 1973, gives M = −.104 + .017E + .083G + .007Y The intercept for females with no years of education and responding in 1973 shows the nonsensical predicted probability well below 0. Although a
5 problem in general, reliance on the assumption of linearity in this particular model proves particularly inappropriate.1
Alternative to Linearity One solution to the boundary problem would assume that any value equal to or above 1 should be truncated to the maximum value of 1. The regression line would be straight until this maximum value, but afterward changes in the independent variables would have no influence on the dependent variable. The same would hold for small values, which could be truncated at 0. Such a pattern would define sudden discontinuities in the relationship, whereby at certain points the effect of X on Y would change immediately to 0 (see Figure l.3(a)). Figure 1.3 (a) Truncated linear relationship and (b) S-shaped curve.
Y
X (a)
Y
X (b)
6 However, another functional form of the relationship might make more theoretical sense than truncated linearity. With a floor and a ceiling, it seems likely that the effect of a unit change in the independent variable on the predicted probability would be smaller near the floor or ceiling than near the middle. Toward the middle of a relationship, the nonlinear curve may approximate linearity, but rather than continuing upward or downward at the same rate, the nonlinear curve would bend slowly and smoothly so as to approach 0 and 1. As values get closer and closer to 0 or 1, the relationship requires a larger and larger change in the independent variable to have the same impact as a smaller change in the independent variable at the middle of the curve. To produce a change in the probability of the outcome from .95 to .96 requires a larger change in the independent variable than it does to produce a change in the probability from .45 to .46. The general principle is that the same additional input has less impact on the outcome near the ceiling or floor, and that increasingly larger inputs are needed to have the same impact on the outcome near the ceiling or floor. Several examples illustrate the nonlinear relationship. If income increases the likelihood of owning a home, an increase of 10 thousand dollars of income from $70,000 to $80,000 would increase that likelihood more than an increase from $500,000 to $510,000. High-income persons would no doubt already have a high probability of home ownership, and a $10,000 increase would do little to increase their already high probability. The same would hold for an increase in income from $0 to $10,000: since neither income is likely to be sufficient to purchase a house, the increase in income would have little impact on ownership. In the middle-range, however, the additional $10,000 may make the difference between being able to afford a house and not being able to afford a house. Similarly, an increase of 1 year in age on the likelihood of first marriage may have much stronger effects during the late twenties than at younger or older ages. Few will marry under age 18 despite growing a year older, and few unmarried by 50 will likely marry by age 51. However, the change from age 29 to 30 may result in a substantial increase in the likelihood of marriage. The same kind of reasoning would apply in numerous other instances: the effect of the number of delinquent peers on the likelihood of committing a serious crime, the effect of the hours worked by women on the likelihood of having a child, the effect of the degree of party identification on the support for a political candidate, and the effect of drinking behavior on premature death are all likely stronger at the midrange of the independent variables than the extremes. A more appropriate nonlinear relationship would look like that in Figure l.3(b), where the curve levels off and approaches the ceiling of 1 and
7 Figure 1.4 Linear versus curvilinear relationship. 1.25 1 0.75 0.5
Y
0.25 0 -0.25 X
the floor of 0. Approximating the curve would require a succession of straight lines, each with different slopes. The lines nearer the ceiling and floor would have smaller slopes than those in the middle. However, a constantly changing curve more smoothly and adequately represents the relationship. Conceptually, the S-shaped curve makes better sense than the straight line. Within the range of a sample, the linear regression line may approximate a curvilinear relationship by taking the average of the diverse slopes implied by the curve. However, the linear relationship still understates the actual relationships in the middle and overstates the relationship at the extremes (unless the independent variable has values only in a region where the curve is nearly linear). Figure 1.4 compares the S-shaped curve with the straight line; the gap between the two illustrates the nature of the error and the potential inaccuracy of linear regression.
Nonadditivity The ceiling and floor create another conceptual problem besides nonlinearity in regression models of a dichotomous response. Regression typically assumes additivity—that the effect of one independent variable on the dependent variable stays the same regardless of the levels of the other independent variables. Models can include selected product terms to account for nonadditivity, but a binary dependent variable likely violates the additivity assumption for all combinations of the independent variables. If the value of one independent variable reaches a sufficiently high level to push the probability of the dependent variable to near 1 (or to near 0), then
8 the effects of other variables cannot have much influence. Thus, the ceiling and floor make the influence of all the independent variables inherently nonadditive and interactive. To return to the smoking example, those persons with 20 years of education have such a low probability of smoking that only a small difference exists between men and women; in other words, gender has little effect on smoking at high levels of education. In contrast, larger gender differences likely exist when education is lower and the probability of smoking is higher. Although the effect of gender on smoking likely varies with the level of education, additive regression models assume that the effect is identical for all levels of education (and the effect of education is identical for men and women). One can use interaction terms in a regression model to partly capture nonadditivity, but that does not address the nonadditivity inherent in all relationships in a probability model.
Problems of Statistical Inference Even if a straight line approximates the nonlinear relationship in some instances, other problems emerge that, despite leaving the estimates unbiased, reduce their efficiency. The problems involve the fact that regression with a binary dependent variable violates the assumptions of normality and homoscedasticity. Both these problems stem from the existence of only two observed values for the dependent variable. Linear regression assumes that in the population, a normal distribution of error values around the predicted Y is associated with each X value, and that the dispersion of the error values for each X value is the same. The assumptions imply normal and similarly dispersed error distributions. Yet, with a dummy variable, only two Y values and only two residuals exist for any single X value. For any value Xi, the predicted probability equals b0 + b1Xi. Therefore, the residuals take the value of and
1 – (b0 + b1 Xi) when Yi equals 1, 0 – (b0 + b1 Xi) when Yi equals 0.
Even in the population, the distribution of errors for any X value will not be normal when the distribution has only two values. The error term also violates the assumption of homoscedasticity or equal variances because the regression error term varies with the value of X. To illustrate this graphically, review Figure 1.1(b), which plots the relationship between X and a dummy dependent variable. Fitting a straight line that goes
9 from the lower left to the upper right of the figure would define residuals as the vertical distance from the points to the line. Near the lower and upper extremes of X, where the line comes close to the floor of 0 and the ceiling of 1, the residuals are relatively small. Near the middle values of X, where the line falls halfway between the ceiling and floor, the residuals are relatively large. As a result, the variance of the errors is not constant (Greene, 2008, p. 775).2 While normality creates few problems with large samples, heteroscedasticity has more serious implications. The sample estimates of the population regression coefficients are unbiased, but they no longer have the smallest variance and the sample estimates of the standard errors will be too small. Thus, even with large samples, the standard errors in the presence of heteroscedasticity will be incorrect, and tests of significance will be biased in the direction of being too generous. Using robust standard errors or weighted least squares estimates can deal with this problem, but they do not solve the conceptual problems of nonlinearity and nonadditivity.
Transforming Probabilities Into Logits To review, linear regression has problems in dealing with a dependent variable having only two values, a ceiling of 1 and a floor of 0: the same change in X has a different effect on Y depending on how close the curve corresponding to any X value comes to the maximum or minimum Y value. We need a transformation of the dependent variable that captures the decreasing effects of X on Y as the predicted Y value approaches the floor or ceiling. We need, in other words, to eliminate the floor and ceiling inherent in binary outcomes and probabilities. The logistic function and logit transformation define one way to deal with the boundary problem. Although many nonlinear functions can represent the S-shaped curve (Agresti, 2013, Chapter 7), the logistic or logit transformation has become popular because of its desirable properties and relative simplicity. The logistic function takes probabilities as a nonlinear function of X in a way that represents the S-shaped curve in Figure 1.3(b). We will review the logistic function in more detail shortly. For now, simply note that the function defines a relationship between the values of X and the S-shaped curve in probabilities. As will become clear, the probabilities need to be transformed in a way that defines a linear rather than nonlinear relationship with X. The logit transformation does this. Assume that each value of Xi has a probability of having a characteristic or experiencing an event, defined as Pi. Since the dependent variable has values of only 0 and 1, this Pi must be estimated, but it helps to treat the outcome in terms of probabilities for now. Given this probability, the logit
10 transformation involves two steps. First, take the ratio of Pi to 1 – Pi, or the odds of the outcome. Second, take the natural logarithm of the odds. The logit thus equals Li = ln [Pi/(1 – Pi)], or, in short, the logged odds. It is worth seeing how the equation works with a few numbers. For example, if Pi equals .2, the odds equal .25 or .2/.8, and the logit equals –1.386, the natural log of the odds. If Pi equals .7, the odds equal 2.33 or .7/.3, and the logit equals 0.847. If Pi equals .9, the odds equal 9 or .9/.1, and the logit equals 2.197. Although the computational formula to convert probabilities into logits is straightforward, it requires some explanation to show its usefulness. It turns out to transform the S-shaped nonlinear relationship between independent variables and a distribution of probabilities into a linear relationship.
Meaning of Odds The logit begins by transforming probabilities into odds. Probabilities vary between 0 and 1, and express the likelihood of an outcome as a proportion of both occurrences and nonoccurrences. Odds or P/(1−P) express the likelihood of an occurrence relative to the likelihood of a nonoccurrence. Both probabilities and odds have a lower limit of 0, and both express the increasing likelihood of an outcome with increasing large positive numbers, but otherwise they differ. Unlike a probability, odds have no upper bound or ceiling. As a probability gets closer to 1, the numerator of the odds becomes larger relative to the denominator, and the odds become an increasingly large number. The odds thus increase greatly when the probabilities change only slightly near their upper boundary of 1. For example, probabilities of .99, .999, .9999, .99999, and so on result in odds of 99, 999, 9999, 99999, and so on. Tiny changes in probabilities result in huge changes in the odds and show that the odds increase toward infinity as the probabilities come closer and closer to 1. To illustrate the relationship between probabilities and odds, examine the values below: Pi
.01
.1
.2
.3
.4
.5
.6
.7
.8
.9
.99
1 – Pi
.99
.9
.8
.7
.6
.5
.4
.3
.2
.1
.01
Odds
.01
.111
.25
1
1.5
2.33
4
9
99
.429 .667
Note that when the probability equals .5, the odds equal 1 or are even. As the probabilities increase toward one, the odds no longer have the ceiling
11 of the probabilities. As the probabilities decrease toward 0, however, the odds still approach 0. At least at one end, then, the transformation allows values to extend linearly beyond the limit of 1. Manipulating the formula for odds gives further insight into their relationship to probabilities. Beginning with the definition of odds (Oi) as the ratio of the probability to one minus the probability, we can with simple algebra express the probability in terms of odds: Pi/(1 – Pi) = Oi implies that Pi = Oi/(1 + Oi). The probability equals the odds divided by one plus the odds.3 Based on this formula, the odds can increase to infinity, but the probability can never equal or exceed one. No matter how large the odds become in the numerator, they will always be smaller by one than the denominator. Of course, as the odds become large, the gap between the odds and the odds plus 1 will become relatively small and the probability will approach (but not reach) one. To illustrate, the odds of 9 translate into a probability of .9, as 9/(9+1) = .9, the odds of 999 translate into a probability of .999 (999/1000 = .999), and the odds of 9999 translate into a probability of .9999, and so on. Conversely, the probability can never fall below 0. As long as the odds equal or exceed 0, the probability must equal or exceed 0. The smaller the odds in the numerator become, the larger the relative size of the 1 in the denominator. The probability comes closer and closer to 0 as the odds come closer and closer to 0. Usually, the odds are expressed as a single number, taken implicitly as a ratio to 1. Odds above 1 mean the outcome is more likely to occur than to not occur. Thus, odds of 10 imply the outcome will occur 10 times for each time it does not occur. Since the single number can be a fraction, there is no need to keep both the numerator or denominator as a whole number. The odds of 7 to 3 can be expressed equally well as a single number of 2.33 (to 1). Even odds equal 1 (1 occurrence to 1 nonoccurrence). Odds below 1 mean the outcome is less likely to occur than it is to not occur. If the probability equals .3, the odds are .3/.7 or .429. This means the outcome occurs .429 times per each time it does not occur. It could also be expressed as 42.9 occurrences per 100 nonoccurrences.
Comparing Odds Expressed as a single number, any odds can be compared to another odds, only the comparison is based on multiplying rather than on adding. Odds of 9 to 1 are three times higher than odds of 3 to 1. Odds of 3 are one-third the size of odds of 9. Odds of .429 are .429 the size of even odds
12 of 1, or half the size of odds of .858. In each example, one odds is expressed as a multiple of the other. It is often useful to compare two different odds as a ratio. Consider the odds of an outcome for two different groups. The ratio of odds of 8 and 2 equals 4, which shows that the odds of the former group are four times (or 400%) larger than the latter group. If the odds ratio is below 1, then the odds of the first group are lower than the second group. An odds ratio of .5 means the odds of the first group are only half or 50% the size of the second group. The closer the odds ratio to 0, the lower the odds of the first group to the second. An odds ratio of one means the odds of both groups are identical. Finally, if the odds ratio is above one, the odds of the first group are higher than the second group. The greater the odds ratio, the higher the odds of the first group to the second. To prevent confusion, keep in mind the distinction between odds and odds ratios. Odds refer to a ratio of probabilities, while odds ratios refer to ratios of odds (or a ratio of probability ratios). According to the 2016 GSS, for example, 65.9% of men and 57.2% of women favor legalization of marijuana. Since the odds of support for men equal 1.93 (.659/.341), it indicates that around 1.9 men support legalization for 1 who does not. The odds of support for legalization among women equal 1.34 (.572/.428) or about 1.3 women support legalization for 1 who does not. The ratio of odds of men to women equals 1.93/1.34 or 1.44. This odds ratio is a group comparison. It reflects the higher odds of supporting legalization for men than women. It means specifically that 1.44 men support legalization for each women who does. In summary, reliance on odds rather than probabilities provides for meaningful interpretation of the likelihood of an outcome, and it eliminates the upper boundary. Odds will prove useful later for interpreting coefficients, but note now that creating odds represents the first step of the logit transformation.
Logged Odds Taking the natural log of the odds eliminates the floor of 0 much as transforming probabilities into odds eliminates the ceiling of 1. Taking the natural log of: odds above 0 but below 1 produce negative numbers; odds equal to 1 produce 0; and odds above 1 produce positive numbers. (The logs of values equal to or below 0 do not exist; see the Appendix for an introduction to logarithms and their properties.)
13 The first property of the logit, then, is that, unlike a probability, it has no upper or lower boundary. The odds eliminate the upper boundary of probabilities, and the logged odds eliminate the lower boundary of probabilities as well. To see this, if Pi = 1, the logit is undefined because the odds of 1/(1 − 1) or 1/0 do not exist. As the probability comes closer and closer to 1, however, the logit moves toward positive infinity. If Pi = 0, the logit is undefined because the odds equal zero 0/(1 − 0) = 0 and log of 0 does not exist. As the probability comes closer and closer to 0, however, the logit proceeds toward negative infinity. Thus, the logits vary from negative infinity to positive infinity. The ceiling and floor of the probabilities (and the floor of the odds) disappear. The second property is that the logit transformation is symmetric around the midpoint probability of .5. The logit when Pi = .5 is 0 (.5/.5 = 1, and the log of 1 equals 0). Probabilities below .5 result in negative logits because the odds fall below 1 and above 0; Pi is smaller than 1 – Pi, thereby resulting in a fraction, and the log of a fraction results in a negative number (see the Appendix). Probabilities above .5 result in positive logits because the odds exceed 1 (Pi is larger than 1 – Pi). Furthermore, probabilities the same distance above and below .5 (e.g., .6 and .4, .7 and .3, .8 and .2) have the same logits, but different signs (e.g., the logits for these probabilities equal, in order, .405 and –.405, .847 and –.847, 1.386 and –1.386). The distance of the logit from 0 reflects the distance of the probability from .5 (again noting, however, that the logits do not have boundaries as do the probabilities). The third property is that the same change in probabilities translates into different changes in the logits. The principle is that as Pi comes closer to 0 and 1, the same change in the probability translates into a greater change in the logged odds. You can see this by example: Pi
.1
.2
.3
.4
.5
.6
.7
.8
.9
1 − Pi
.9
.8
.7
.6
.5
.4
.3
.2
.1
1.5
2.33
Odds
.111
Logit
–2.20
.25
.429
.667
1
−1.39
−.847
–.405
0
.405
.847
4
9
1.39
2.20
A change in probabilities of .1 from .5 to .6 (or from .5 to .4) results in a change of .405 in the logit, whereas the same probability change of .1 from .8 to .9 (or from .2 to .1) results in a change of .810 in the logit. The change in the logit for the same change in the probability is twice as large at this extreme as in the middle. To repeat, the general principle is that small differences in probabilities result in increasingly larger differences in logits when the probabilities are near the bounds of 0 and 1.
14
Linearizing the Nonlinear It helps to view the logit transformation as linearizing the inherent nonlinear relationship between X and the probability of Y. We would expect the same change in X to have a smaller impact on the probability of Y near the floor or ceiling than near the midpoint. Because the logit expands or stretches the probabilities of Y at extreme values relative to the values near the midpoint, the same change in X comes to have similar effects throughout the range of the logit transformation of the probability of Y. Without a floor or ceiling, in other words, the logit can relate linearly to changes in X. One can now treat a relationship between X and the logit transformation as linear. The logit transformation straightens out the nonlinear relationship between X and the original probabilities. Conversely, the linear relationship between X and the logit implies a nonlinear relationship between X and the original probabilities. A unit change in the logit results in smaller differences in probabilities at high and low levels than at levels in the middle. Just as we translate probabilities into logits, we can translate logits into probabilities (the formula to do this is discussed shortly): Logit Pi Change
–3
–2
–1
1
2
3
.047
.119
.269
.5
0
.731
.881
.953
—
.072
.150
.231
.231
.150
.072
A one-unit change in the logit translates into a greater change in probabilities near the midpoint than near the extremes. In other words, linearity in logits defines a theoretically meaningful nonlinear relationship with the probabilities.
Obtaining Probabilities From Logits The logit transformation defines a linear relationships between the independent variables and a binary dependent variable. The linear relationship of X with the predicted logit appears in the following regression model: ln[Pi/(1 – Pi)] = b0 + b1Xi . Like any linear equation, the coefficient b0 shows the intercept or logged odds when X equals 0 and the b1 coefficient shows the slope or the change in the logged odds for a unit change in X. The difference is that the dependent variable has been transformed from probabilities into logged odds.
15 To express the probabilities rather than the logit as a function of X, first take each side of the equation as an exponent. Because the exponent of a logarithm of a number equals the number itself (e of the ln X equals X), exponentiation or taking the exponential eliminates the logarithm on the left side of the equation: Pi /1 Pi e
b 0 b 1 X i
e
b0
eb 1 X i .
Furthermore, the equation can be presented in multiplicative form because the exponential of X +Y equals the exponential of X times the exponential of Y. Thus, the odds change as a function of the coefficients treated as exponents. Solving for Pi gives the following formula4: Pi (e
b 0 b 1 X i
) / (1 e
b 0 b 1 X i
).
To simplify, define the predicted logit Li as ln[Pi/(1 – Pi)], which is equal to b0 + b1Xi. We can then replace the longer formula by Li in the equation, remembering that Li is the logged odds predicted by the value of Xi and the coefficients b0 and b1. Then Pi (e Li ) / (1 e Li ). This formula takes the probability as a ratio of the exponential of the logit to 1 plus the exponential of the logit. Given that eLi produces odds, the formula corresponds to the equation Pi = Oi/(1+Oi) presented earlier. Moving from logits to exponents of logits to probabilities shows L
–4.61
–.223
0
1.61
2.30
4.61
6.91
eL
.01
.1
.2
.8
1
5
10
100
1000
1.01
1.1
1.2
1.8
2
6
11
101
1001
.5
.833
.909
.990
.999
L
1+ e P
.010
–2.30
.091
–1.61
.167
.444
Note first that the exponentials of the negative logits fall between 0 and 1, and that the exponentials of the positive logits exceed 1. Note also that the ratio of the exponential to the exponential plus 1 will always fall below one—the denominator will always exceed the numerator by 1. The transformation of logits into probabilities replicates the S-shaped curve in Figures 1.3(b) and 1.4. With logits defining the X-axis and probabilities defining the Y-axis, the logits range from negative infinity to positive infinity, but the probabilities will stay within the bounds of 0 and 1.
16 Consider how this transformation demonstrates nonlinearity. For a oneunit change in X, L changes by a constant amount but P does not. The exponents in the formula for Pi make the relationship nonlinear. Consider an example. If Li = 2 +.3Xi, the logged odds change by .3 for a one-unit change in X regardless of the level of X. If X changes from 1 to 2, L changes from 2 + .3 or 2.3 to 2 + .3 × 2 or 2.6. If X changes from 11 to 12, L changes from 5.3 to 5.6. In both cases, the change in L is identical. This defines linearity. Take the same values of X, and the L values they give, and note the changes they imply in the probabilities: X
1
2
11
12
L
2.3
2.6
5.3
5.6
eL
9.97
13.46
200.3
270.4
1 + eL
10.97
14.46
201.3
271.4
.909
.931
.995
.996
P Change
.022
.001
The same change in L due to a unit change in X results in a greater change in the probabilities at lower levels of X and P than at higher levels. The same would show at the other end of the probability distribution. This nonlinearity between the logit and the probability creates a fundamental problem of interpretation. We can summarize the effect of X on the logit simply in terms of a single linear coefficient, but we cannot do the same with the probabilities: the effect of X on the probability varies with the value of X and the level of the probability. The complications in interpreting the effects on probabilities require a separate chapter on the meaning of logistic regression coefficients. However, dealing with problems of interpretation proves easier having fully discussed the logic of the logit transformation. One last note. For purposes of calculation, the formula for probabilities as a function of the independent variables and coefficients takes a somewhat simpler but less intuitive form: Pi (e
b 0 b 1 X i
) / (1 e
b 0 b 1 X i
Pi 1/ (1 e
( b 0 b 1 X i )
Pi 1/ (1 e
Li
).
),
),
17 This gives the same result as the other formula.5 If the logit equals –2.302, then we must solve for P = e–2.302/1+ e–2.302 or 1/1+ e–(–2.302). The exponential of –2.302 equals approximately .1, and the exponential of the negative of –2.302 or 2.302 equals 9.994. Thus, the probability equals .1/1.1 or .091, or calculated alternatively equals 1/1 + 9.994 or .091. The same calculations can be done for any other logit value to get probabilities.
Summary This chapter reviews how the logit transforms a dependent variable having inherent nonlinear relationships with a set of independent variables into a dependent variable having linear relationships with a set of independent variables. Logistic regression models (also called logit models) thus estimate the linear determinants of the logged odds or logit rather than the nonlinear determinants of the probabilities. Obtaining these estimates involves complexities left until later chapters. In the meantime, however, it helps to view logistic regression as analogous to linear regression on a dependent variable that has been transformed to remove the floor and ceiling. Another justification of the logistic regression model and the logit transformation takes a different approach than offered in this chapter. It assumes that an underlying, unobserved, or latent continuous dependent variable exists. It then derives the logistic regression model by making assumptions about the shape of the distribution of the underlying unobserved values and its relationship to the observed values of 0 and 1 for the dependent variable. This derivation ends up with the same logistic regression model but offers some insights that may be useful. See, for example, Long (1997, pp. 40–51), Maddala and Lahiri (2009, p. 333), or Greene (2008, pp. 776–777).6 In linearizing the nonlinear relationships, logistic regression also shifts the interpretation of coefficients from changes in probabilities to less intuitive changes in logged odds. The loss of interpretability with the logistic coefficients, however, is balanced by the gain in parsimony: the linear relationship with the logged odds can be summarized with a single coefficient, while the nonlinear relationship with the probabilities is less easily summarized. Efforts to interpret logistic regression coefficients in meaningful and intuitive ways define the topic of the next chapter.
Chapter 2 INTERPRETING LOGISTIC REGRESSION COEFFICIENTS As is true for nonlinear transformations more generally, the effects of the independent variables in logistic regression have multiple interpretations. Effects exist for probabilities, odds, and logged odds, and the interpretation of each effect varies. To preview, the effects of the independent variables on the logged odds are linear and additive—each X variable has the same effect on the logged odds regardless of its level or the level of other X variables—but the units of the dependent variable, logged odds, have little intuitive meaning. The effects of the independent variables on the probabilities have intuitive meaning but are nonlinear and nonadditive—each X variable has a different effect on the probability depending on its level and the level of the other independent variables. Despite the interpretable units, the effects on probabilities are less easily summarized in the form of a single coefficient. The interpretation of the effects of the independent variables on the odds offers a popular alternative. The odds have more intuitive appeal than the logged odds and can still express effects in single coefficients, but the effects on odds are multiplicative rather than additive. This chapter examines the multiple ways to interpret effects in logistic regression results. It gives particular attention to interpretations of probability effects, the most informative but also the most complex way to understand logistic regression results.
Logged Odds One interpretation directly uses the coefficients obtained from the estimates of a logistic regression model. The logistic regression coefficients show the change in the predicted logged odds of experiencing an event or having a characteristic for a one-unit increase in the independent variables, holding other independent variables constant. The coefficients are similar to linear regression coefficients in that a single linear and additive coefficient summarizes the relationship. The difference is that the dependent variable takes the form of logged odds. Consider an example. Returning to the 2017 National Health Interview Survey (NHIS) data and the binary outcome measure of currently smokes, a simple model includes continuous measures of age (26–85+) and years of 19
20 education (0–18) plus categorical measures of gender (a dummy variable with males coded 1), race (four dummy variables with whites as the referent), and Hispanic ethnicity (with Hispanics coded 1). The sample size is 23,786. Selected output from a logistic regression in Stata produces the results in Table 2.1. For the continuous variables, the predicted logged odds of smoking on average decrease by .183 with a 1-year increase in education and by .024 with a 1-year increase in age, controlling for other predictors. For the categorical variables, a change of one unit implicitly compares the indicator group to the reference or omitted group. The coefficient of .254 for gender indicates that the predicted logged odds of smoking are higher by .254 for men than women. The coefficients for race show that, compared to whites, the log odds of smoking are lower by .084 for African Americans, higher by .288 for Native Americans, lower by .810 for Asian Americans, and higher by .436 for multi-race respondents. The gap relative to whites is largest for Asian Americans with the controls, but the gap between Asian Americans and multi-race respondents is still larger. An additional measure shows that the logged odds of smoking are lower by 1.040 for Hispanics than non-Hispanics. Table 2.1 S tata Output: Logistic Regression Model of Current Smoking, NHIS 2017 Std. [95% Conf. Smoker Coef. Err. z P>|z| Interval] Education -.1830747 .0067791 -27.01 0.000 -.1963616 -.1697878 Age -.023966 .0011756 -20.39 0.000 -.0262701 -.0216619 Gender Male
.2535793 .0369909
.1810785
.3260801
African -.0835037 .0574981 American
-1.45 0.146 -.1961979
.0291905
1.89 0.058 -.0103242
.586752
Asian -.8100691 .1092474 American
-7.41 0.000
-1.02419
-.595948
3.70 0.000
.2050056
.6676301
Race
Native American
.2882139 .1523181
Multiple Race
.4363179 .1180186
6.86 0.000
Ethnicity Hispanic -1.039563 .0700205 -14.85 0.000 -1.176801 -.9023256 _cons
2.053864
.127611
16.09 0.000
1.803751
2.303976
21 The coefficients represent the relationship, as in ordinary regression, with a single coefficient. Regardless of the value of an independent variable—small, medium, or large—or the values of the other independent variables, a one-unit change has the same effect on the dependent variable. According to the model, the difference in the logged odds of smoking between white women and men is the same as the difference in the logged odds between Asian-American women and men. Similarly, the effect of education in the model does not differ between men and women or between any of the race-ethnic groups. Indeed, logistic regression aims to simplify the nonlinear and nonadditive relationships inherent in treating probabilities as dependent variables. Despite the simplicity of their interpretation, the logistic regression coefficients, as mentioned, lack a meaningful metric and offer little substantive information other than the sign. Statements about the effects of variables on changes in logged odds reveal little about the relationships and do little to help explain the substantive results. Interpreting the substantive meaning or importance of the coefficients requires something more than reporting the expected changes in logged odds.
Tests of Significance Tests of significance often receive much attention, perhaps too much attention, in logistic regression. If the coefficients have little intuitive meaning in terms of substantive importance, it is easy to note the statistically significant and nonsignificant coefficients. Then, the signs of the significant coefficients offer a crude but quick summary of the results. As in regression, the size of a coefficient relative to its standard error provides the basis for tests of significance in logistic regression. The logistic regression procedures in Stata and R present the coefficient divided by its standard error, which can be evaluated with the z distribution. The significance of the coefficient—the likelihood that the coefficient in the sample could have occurred by chance alone when the population parameter equals 0—is then interpreted as usual. However, since we know little about the small sample properties of logistic regression coefficients, tests of significance for samples less than 100 prove risky (Long, 1997, p. 54). Table 2.1 lists, along with the coefficients, the standard errors of the coefficients, the z values, the probabilities of the z values under the null hypothesis, and 95% confidence intervals around the coefficients. With a sample size of 23,786, all but two of the coefficients reach statistical
22 significance at the .001 level. The coefficients for Native Americans and African Americans are not significant at the usual .05 level. Although the logistic regression results treat race as four separate dummy variables, it is of course a single categorical measure. It is important to test for the significance of all the racial categories together using a procedure discussed in the next chapter. The logistic regression procedure in SPSS calculates the Wald statistic for a (two-tailed) test of a single coefficient. Table 2.2 shows SPSS output from the same logistic regression model as presented above. The coefficients and standard errors are identical. The Wald test appears different but in fact is the same as the z value squared (Hosmer, Lemeshow, & Sturdivant, 2013).7 The Wald statistic in SPSS has a chi-square distribution with one degree of freedom. As before, all coefficients except two are significant at .001. Statistical significance has obvious importance but depends strongly on sample size. The p values provide little information on the strength or substantive meaning of the relationship. Large samples, in particular, can produce significant p values for otherwise small and trivial effects. Despite the common reliance of studies on statistical significance (and the sign of the coefficient) in interpreting logistic regression coefficients, p values best serve only as an initial hurdle to overcome before interpreting the coefficient in other ways. Table 2.2 S PSS Output: Logistic Regression Model of Current Smoking, NHIS 2017 Step 1a
Education Age Gender (1)
B
Wald
729.302
df 1
Sig.
.000
Exp(B)
−.183
.007
S.E.
−.024
.001
415.612
1
.000
.976
.254
.037
46.994
1
.000
1.289
76.138
4
.000
Race
Race (1)
−.084
.057
2.109
1
.146
.920
Race (2)
.288
.152
3.580
1
.058
1.334
Race (3)
−.810
.109
54.982
1
.000
.445
Race (4)
.436
.118
13.668
1
.000
1.547
−1.040
.070
220.420
1
.000
.354
2.054
.128
259.040
1
.000
7.798
Ethnicity (1) Constant a
.833
Variable(s) entered on step 1: Education, Age, Gender, Race, and Ethnicity.
23
Odds The second interpretation comes from transforming the logistic regression coefficients so that the independent variables affect the odds rather than the logged odds of the dependent variable. Recall that the odds equal the probability of a binary outcome divided by one minus the probability, or P/(1 – P). To find the effects on the odds, take the exponent or antilogarithm of the logistic regression coefficients. Exponentiating both sides of the logistic regression equation eliminates the log of the odds and shows the influences of the variables on the odds. The transformation from logged odds to odds for logistic regression with multiple predictors is as follows: ln P /1 P b0 b1 X1 b2 X 2 , eln
P /1 P
=e
b0
e b 1 X1 b 2 X 2 ,
P /1 P eb 0 e
b 1 X1
e
b2 X2
.
With the odds rather than the logged odds as the outcome, the right-hand side of the equation becomes multiplicative rather than additive. b The odds are a function of the exponentiated constant e 0 multiplied by b 1 X1 the exponentiated product of the coefficient and X 1 ( e ) and the exponentiated product of the coefficient and X ( e b 2 X 2 ) . The effect of each 2 variable on the odds (rather than the logged odds) thus comes from taking the antilog of the coefficients. If not already presented in the output, the exponentiated coefficients can be obtained using any calculator by typing the coefficient and then the ex function. The exponentiated coefficients of −.183, −.024, and .254 from Tables 2.1 and 2.2 equal, respectively, .833, .976, and 1.289. These are conveniently listed in the last column of the SPSS output and can be easily obtained with options in Stata. The fact that the equation determining the odds is multiplicative rather than additive shifts the interpretation of the exponentiated coefficients. In an additive equation, a variable has no effect when its coefficient equals 0. The predicted value of the dependent variable sums the values of the variables times the coefficients; when adding 0, the predicted value does not change. In a multiplicative equation, the predicted value of the dependent variable does not change when multiplied by a coefficient of 1. Therefore, 0 in the additive equation corresponds to 1 in the multiplicative equation. Furthermore, the exponential of a positive number exceeds 1 and the exponential of a negative number falls below 1 but above 0 (as the exponential of any number is always greater than 0).
24 For the exponentiated coefficients, then, a coefficient of 1 leaves the odds unchanged, a coefficient greater than 1 increases the odds, and a coefficient smaller than 1 decreases the odds. Moreover, the more distant the coefficient from 1 in either direction, the greater the effect in changing the odds. Recall as well that the odds are not symmetric around 1. They vary between 0 and 1 on one end, but from 1 to positive infinity on the other.
Interpretation To illustrate the interpretations of the exponential coefficients, or the effects on odds, Table 2.3 presents logistic regression output from R, again using the model of current smoking. The commands required to obtain the exponentiated logistic regression coefficients and the format of the output differs from Stata and SPSS. But the results are the same.
Table 2.3 R Output: Logistic Regression Model of Current Smoking, NHIS 2017 Coefficients:
(Intercept) Education Age Gender Race.f2 Race.f3 Race.f4 Race.f5 Ethnicity.f1
(Intercept) Education Age Gender Race.f2 Race.f3 Race.f4 Race.f5 Ethnicity.f1
Estimate 2.053864 -0.183075 -0.023966 0.253579 -0.083504 0.288214 -0.810069 0.436318 -1.039563
Std. Error. 0.127611 0.006779 0.001176 0.036991 0.057498 0.152318 0.109247 0.118019 0.070020
z value 16.095 -27.006 -20.387 6.855 -1.452 1.892 -7.415 3.697 -14.847
Pr(>|z|) < 2e-16 < 2e-16 < 2e-16 7.12e-12 0.146422 0.058466 1.22e-13 0.000218 < 2e-16
*** *** *** *** . *** *** ***
Odds Ratio and Confidence Interval 7.7979707 6.0750055 10.0186332 0.8327060 0.8216888 0.8438194 0.9763189 0.9740669 0.9785661 1.2886296 1.1985124 1.3855425 0.9198877 0.8210606 1.0286780 1.3340426 0.9826226 1.7869569 0.4448273 0.3570509 0.5481475 1.5470004 1.2225119 1.9425137 0.3536091 0.3077680 0.4049986
25 For example, exponentiating the coefficient of −.183 indicates that the predicted odds of smoking are reduced by a multiplicative factor of .833 with a 1-year increase in years of education, holding the other predictors constant. If, hypothetically, the odds of smoking for someone with 12 years of education equal 0.300, then the predicted odds of smoking for someone with 13 years of education equal .300 × .833 or .250. The exponentiated coefficient for age of .976 indicates that the odds of smoking are reduced by a multiplicative factor of .976 with 1-year increase in age. If the predicted odds at age 25 are .400, then the predicted odds at age 26 would fall to .390 (or .400 × .976). The same relationships can be restated in terms of odds ratios. The ratio of the predicted odds of smoking for someone with 13 years of education to someone with 12 years of education, or for someone with 18 years of education to someone with 17 years of education, equals the exponentiated logistic regression coefficient of .833. The ratio of predicted odds for someone aged 26 years (or age 56) to someone aged 25 years (or age 55) equals .976. Thus, the exponentiated coefficient shows the ratio of odds for those one-unit higher to those one-unit lower on the independent variable. For categorical predictors in the form of dummy variables, a similar interpretation follows. The exponentiated coefficient for men of 1.289 indicates that their odds of smoking are higher than those for women by a factor of 1.289. Here, a one-unit increase defines the comparison of men to the reference group of women. If the predicted odds of smoking equal .200 for women, they equal .200 × 1.289 or .258 for men. Equivalently, the ratio of the odds of smoking for men to women is 1.289. The exponentiated coefficient of 1.547 shows higher odds of smoking for multi-race respondents compared to whites. The odds for Hispanics are lower by a factor of .354 than non-Hispanics, and the ratio of odds for Hispanics to non-Hispanics is .354. Since the distance of an exponentiated coefficient from 1 indicates the size of the effect, a simple calculation can further aid in interpretation. The difference of a coefficient from 1 exhibits the increase or decrease in the odds for a unit change in the independent variable. In terms of a formula, the exponentiated coefficient minus 1 and times 100 gives the percentage increase or decrease due to a unit change in the independent variable:
% eb 1 100. For education, the exponentiated coefficient says that the odds of smoking decline by 16.7% or are 16.7% lower with an increase of 1 year in education. This appears more meaningful than to say the logged odds
26 decline by .183. The size of the effect on the odds also depends on the units of measurement of the independent variables—the change in odds for variables measured in different units do not warrant direct comparison. Still, the interpretation of percentage change in the odds has intuitive appeal.8 For men, the exponentiated logistic regression coefficient of 1.289 means that the odds of smoking are 28.9% higher than for women. The exponentiated coefficient for Hispanics of .354 indicates that their odds of participating are 64.6% lower than for non-Hispanics. In interpreting the exponentiated coefficients, remember that they refer to multiplicative changes in the odds rather than probabilities. It is incorrect to say that an additional year of education makes smoking 16.7% less probable or likely, which implies probabilities rather than odds. More precisely, the odds of smoking are .833 times smaller or 16.7% smaller with an additional year of education.
Probabilities The third strategy of interpreting the logistic regression coefficients involves translating the effects on logged odds or odds into the effects on probabilities. Since the relationships between the independent variables and probabilities are nonlinear and nonadditive, they cannot be fully represented by a single coefficient. The effect on the probabilities has to be identified at a particular value or set of values. The choice of values to use in evaluating the effect on the probabilities depends on the concerns of the researcher and the nature of the data, but an initial strategy has the advantage of simplicity: examine the effect on the probability for a typical case. Before interpreting probability effects from logistic regression, it helps to introduce two related concepts of predicted probabilities and marginal effects. First, logistic regression produces a predicted value for each observation in the data, but the predicted value can take the form of logged odds or probabilities. The predicted logits or logged odds are calculated for each observation by substituting that observation’s values on the independent variables, multiplying by the estimated logit coefficients, and summing the products. The predicted probabilities can then be obtained by using the formula transforming logits to probabilities. As presented in Chapter 1, the formula shows that the probabilities are a function of the logits, Li: Pi (e Li ) / (1 e Li ). Of course, the predicted values can be obtained directly from statistical packages. For example, the logistic regression model of smoking in Tables 2.1
27 to 2.3 can generate predicted logits and probabilities for each of the 23,786 observations. The summary statistics from Stata in Table 2.4 list the following: • The outcome (Smoker) has values of only 0 or 1, with a mean of .154 (i.e., 15.4% of the sample currently smokes). • The predicted logits from the model vary between −4.649 and 1.636. Persons with negative values show relatively low predicted smoking and those with positive values show relatively high predicted smoking. The predicted logits are skewed toward nonsmoking. • The predicted probabilities, which have limits of 0 and 1, vary from .009 to .837. Persons with probabilities below .5 are less likely to smoke and persons with probabilities above .5 are more likely to smoke. The mean predicted probability is the same as the mean of the dependent variable and shows that most of the sample does not smoke. The table also presents the comparison of the predicted probabilities from a linear regression, which have no limits of 0 and 1. They vary between −.168 and .595. These values illustrate the point that, unlike linear regression, logistic regression keeps predicted probabilities within the limits. Second, marginal effects refer to the influence of independent variables on a dependent variable. A marginal effect is defined in general terms as the change in the expected value of a dependent variable associated with a change in an independent variable, holding other independent variables constant at specified values. In linear regression, the marginal effect is simply the slope coefficient for an independent variable. In logistic regression, however, the marginal effect on probabilities varies. It is not fully represented by a single coefficient. Table 2.4 S tata Output: Summary Statistics for Observed Values of Current Smoking and Predicted Values From Logistic Regression and Regression Models of Current Smoking, NHIS 2017 Variable
Obs
Mean
Std. Dev.
Min
Max
Smoker
23,786
.1539141
.3608739
0
1
Logit_ Smoker
23,786
-1.837902
.6406989
-4.648903 1.636395
Prob_Smoker
23,786
.1539141
.0829396
.0094813 .8370438
Reg_Smoker
23,786
.1539141
.0797695
-.1680726 .5951874
28 There are two varieties of marginal effects. One involves marginal change in continuous independent variables and the other involves discrete change in categorical independent variables. The two types of marginal effects associated with each type of variable involve different calculations and strategies for interpreting the results.
Continuous Independent Variables One way to understand the marginal effect of a continuous independent variable on probabilities involves calculating the linear slope of the tangent line of the nonlinear curve at a single point. The slope of the tangent line is defined by the partial derivative of the nonlinear equation relating the independent variables to the probabilities (Agresti, 2013, p. 164). The partial derivative shows the change in the outcome for an infinitely small or marginal change in the predictor. More intuitively, it represents a straight line that meets the logistic curve at a single point. Figure 2.1 depicts the tangent line where the logistic curve intersects Y = P = .76. The tangent line identifies the slope only at that particular point, but it allows for easy interpretation. Its slope shows the linear change in the probability at a single point on the logistic curve. The change in probability or the linear slope of the tangent line comes from a simple equation for the partial derivative. The partial derivative equals P /X k bk P (1 P ). Simply multiply the logistic regression coefficient by the selected probability P and 1 minus the probability. The formula for the partial derivative nicely reveals the nonlinear effects of an independent variable on probabilities. The effect of b (in terms of logged odds) translates into a different effect on the probabilities depending on the level of P. The effect will be at its maximum when P equals .5 since .5 × .5 = .25, .6 × .4 = .24, .7 × .3 = .21 and so on. The closer P comes to the ceiling or floor, the smaller the value P(1 – P), and the smaller the effect a unit change in X has on the probability. Multiplying the coefficient times .5 × .5 shows the maximum effect on the probabilities, but may overstate the influence for a sample in which the split on the dependent variable is not so even. Substituting the mean of the dependent variable, P, in the formula gives a more typical effect. For smoking, the logistic regression coefficient for years of education equals −.183, and the mean of the dependent variable or the probability of smoking equals .154. The marginal change at the mean equals −.183 × .154 × .846
29 Figure 2.1 Tangent line of logistic curve at Y = P = .76. 1.5
1
Y
0.5
0
-0.5 X
or −.024. A marginal or instantaneous change in education reduces the probability of smoking by .024. The effect reaches its maximum of −.046 when P = .5. While this example illustrates the logic underlying marginal effects for continuous variables in logistic regression, it oversimplifies things. A question remains: At what values should the marginal effect be calculated? We want to calculate a marginal effect in a way that best represents the relationships for the sample. To do that, three types of marginal effects are commonly recommended (Breen, Karlson, & Holm, 2018; Long & Freese, 2014; Williams, 2012). There are marginal effects at the means, marginal effects at representative values, and average marginal effects (AME). Consider each in turn.
Three Types of Marginal Effects First, the marginal effect at the means is calculated when all independent variables in the model take their mean value. The predicted probability is obtained from multiplying each logistic regression coefficient times the mean of the corresponding independent variable, summing the products and the intercept, and transforming the predicted logit into a predicted probability. Then, this predicted probability can be used with the formula for the partial derivative to calculate marginal effects for each continuous independent variable.
30 Table 2.5 S tata Output: Marginal Effects at Means From Logistic Regression Model of Current Smoking, NHIS 2017 Expression : Pr(Smoker), predict() dy/dx w.r.t. : Education Age 1.Gender 2.Race 3.Race 4.Race 5.Race 1.Ethnicity at
: Education Age 0.Gender 1.Gender 1.Race 2.Race 3.Race 4.Race 5.Race 0.Ethnicity 1.Ethnicity
dy/dx
Education -.0216849 Age -.0028387 Gender Male
.030344
Race African -.0098693 American Native .0388422 American Asian -.0733998 American Multiple .0618207 Race Ethnicity Hispanic -.0924188
Delta-method Std. Err.
13.90932 54.294 .5514588 .4485412 .8100143 .1101909 .0113092 .0501976 .0182881 .8853107 .1146893
= = = = = = = = = = =
z
(mean) (mean) (mean) (mean) (mean) (mean) (mean) (mean) (mean) (mean) (mean)
[95% Conf. Interval]
P>|z|
.0007684 -28.22 0.000 -.023191 -.0201788 .0001353 -20.98 0.000 -.0031039 -.0025736 .0044637
6.80 0.000
.0215954 .0390926
.0066399
-1.49 0.137 -.0228833 .0031447
.0224991
1.73 0.084 -.0052551 .0829396
.0072834 -10.08 0.000 -.0876749 -.0591247 .0190291
3.25 0.001
.0245243 .0991171
.0044527 -20.76 0.000 -.1011459 -.0836917
Note: dy/dx for factor levels is the discrete change from the base level.
Marginal effects at the means are best done with program commands. In Stata, the margins command generates marginal effects at the means. Table 2.5 displays the output from a margins command following the logistic regression (“margins, dydx(*) atmeans”). For the moment, we can
31 focus on the two continuous measures of education and age. The results show a marginal effect of education, or the effect for an infinitely small change in education is −.022 when all independent variables are at their mean values or for a hypothetical person who is average on all characteristics. This indicates that the expected probability of smoking decreases by .022 with a marginal change in education. Alternatively, one can view −.022 as the slope of the tangent line at the predicted probability when the independent variables are at their means. Note that the top of the table lists the means for each independent variable used in the calculations. These marginal effects make for a more intuitive interpretation than logged odds or odds. The coefficients of −.022 for education and −.003 for age represent changes on a probability scale ranging from 0 to 1. Although interpretations still depend on the measurement units and variation of the independent variables, they rely on more familiar units for the dependent variable. Always remember, however, that these effects are specific to a predicted probability determined by the means of the independent variables. Second, marginal effects at representative values use a typical case on the independent variables rather than the means of the independent variables. Note that the means do not typically refer to the characteristics of an actual person. In the NHIS sample, the mean education of 13.91 and the mean age of 54.29 are not observed values for the measures. The concern is more obvious for the categorical variables. The mean for gender is .449 (44.9% male) and the mean for African American is .110. Obviously, the NHIS categorical measures do not treat gender or race as a proportion. To represent a typical case, we might select values for someone with 12 years of education, 45 years old, male (Gender = 1), white (Race = 1), and non-Hispanic (Ethnicity = 0). Margins in Stata can again be used to obtain these marginal effects (“margins, dydx(*) at (Education = 12 Age = 45 Gender = 1 Race = 1 Ethnicity = 0)”). Table 2.6 lists the values used to obtain the predicted probability and the marginal effects at those values. The predicted probability for a person with these representative values is .275, which is higher than the predicted probability of .137 when the independent variables are at their means. The expected change in the probability of smoking for an infinitely small change in education at these representative values is −.037 (vs. −.022 previously). Other models at different representative values could give substantially different results, however. Third, the average marginal effect is obtained differently. It first calculates the marginal effect for each observation by using the actual values
32 Table 2.6 S tata Output: Marginal Effects at Representative Values From Logistic Regression Model of Current Smoking, NHIS 2017 Expression : Pr(Smoker), predict() dy/dx w.r.t. : Education Age 1.Gender 2.Race 3.Race 4.Race 5.Race 1.Ethnicity at : Education = 12 Age = 45 Gender = 1 Race = 1 Ethnicity = 0 Delta-method [95% Conf. dy/dx Std. Err. z P>|z| Interval] Education -.0365248 .0016434 -22.23 0.000 -.0397457 -.0333038 Age -.0047814 .0002657 -18.00 0.000 -.0053021 -.0042607 Gender Male
.0476225
.0069547
Race African -.0163434 American Native American
.0610315
Multiple Race
.0948479
Asian -.1307434 American
Ethnicity Hispanic -.1568755
.0339915
.0612535
.0111053
-1.47 0.141 -.0381095
.0054226
.0339325
1.80 0.072 -.0054749
.127538
.014079
-9.29 0.000 -.1583378
-.103149
.0273988
6.85 0.000
3.46 0.001
.0411472
.1485486
.0088443 -17.74 0.000
-.17421
-.139541
Note: dy/dx for factor levels is the discrete change from the base level.
of each observation rather than the means or representative values. For an independent variable, there are as many marginal effects as observations in the analysis. It then calculates the average of those marginal effects. Note the difference in strategy. The marginal effects at the means and representative values define a single marginal effect for each independent variable. The average marginal effect defines a distribution of marginal effects for the sample and then computes the mean of the distribution for each independent variable.
33 Table 2.7 presents the AME obtained from the margins command in Stata (“margins, dydx(*)”). The AME are similar but not identical to those obtained at the means or representative values. Holding other covariates constant, the average change in smoking for an infinitely small change in education is −.023 (vs. −.022 and −.037 previously). Each type of marginal effect has strengths and weaknesses (Muller & MacLehose, 2014). The marginal effects at the means represent central tendency and in that sense are typical of the sample. However, the means on all independent variables define a hypothetical example rather than a real person, group, organization, or other unit of analysis. The marginal effects at representative values are based on observed values of the independent variables but may not be typical of the sample. Although researchers will select key groups or characteristics that are common in the sample, the representative values miss those in other groups or with other Table 2.7 S tata Output: Average Marginal Effects From Logistic Regression Model of Current Smoking, NHIS 2017 Expression : Pr(Smoker), predict() dy/dx w.r.t. : Education Age 1.Gender 2.Race 3.Race 4.Race 5.Race 1.Ethnicity Delta-method [95% Conf. dy/dx Std. Err. z P>|z| Interval] Education -.0225815 .0008187 -27.58 0.000 -.024186 -.020977 Age -.0029561 .0001436 -20.59 0.000 -.0032375 -.0026747 Gender Male .0315158 Race African -.0102466 American Native .0396333 American
Asian -.0783194 American Multiple .0626203 Race Ethnicity Hispanic -.0988868
.0046228
.0224553
.0405763
.0069154
-1.48 0.138 -.0238005
.0033073
.0226697
1.75 0.080 -.0047986
.0840652
.0079958 .0189212
6.82 0.000
-9.80 0.000 -.0939909 -.0626479 3.31 0.001
.0255354
.0997052
.0049608 -19.93 0.000 -.1086097 -.0891638
Note: dy/dx for factor levels is the discrete change from the base level.
34 characteristics. The average marginal effect can be viewed as the effect for a case picked at random from the sample (Breen et al., 2018). It has the advantages of using all observed values in the sample and thereby representing everyone. Long and Freese (2014, Chapter 6) offer a qualified recommendation for the average marginal effect or AME: “Broadly speaking, we believe that the AME is the best summary of the effect of a variable.” Note that, even after selecting one of the marginal effects, any single coefficient showing the change in probability is potentially misleading. The coefficient will not fully reflect the nonlinear and nonadditive relationship of the independent variables with the probabilities. To get more information but also add more complexity, a researcher might compute marginal effects for a range of values on the independent variables and present the marginal effects for the extremes as well as the middle of the distribution. Ways to present a more complete summary of the range of influences of a variable on probabilities are discussed in more detail below.
Categorical Independent Variables The partial derivative works best with continuous variables for which small changes in the independent variables are meaningful. For dummy variables, the relevant change occurs from 0 to 1, and the tangent line for infinitely small changes in X makes little sense. Instead, the marginal effect for categorical variables is best shown by the discrete change from one category to another. It is possible to compute predicted probabilities for two categories and then measure the difference in the probabilities. This marginal effect refers to a discrete change in the independent variable rather than a marginal change. The two may approximate one another, but calculating the predicted probabilities for categorical independent variables based on the discrete change makes more sense. Remember, however, that the marginal effect based on differences in predicted probabilities, like the partial derivative, varies across points on the logistic curve and specific predicted probabilities. The three strategies for estimating marginal effects on probabilities for continuous independent variables apply to categorical independent variables. Each strategy finds a predicted probability for the omitted group, finds a predicted probability for the dummy variable group, and subtracts the former probability from the latter. The differences in strategies come from setting the values assigned to the other independent variables. The three most common strategies can be reviewed and adapted to categorical independent variables using the same model and examples in Tables 2.5 to 2.7.
35 First, the marginal effect at the means uses the predicted values for categorical variables when one value is 0, the other value is 1, and all other independent variables in the model take their mean values. The discrete change in the predicted values from 0 and 1 for the categorical comparison defines the marginal effect. In Table 2.5, the Stata output notes at the bottom that the marginal effect (symbolized by dy/dx) for factor levels (values of a categorical independent variable) is the discrete change from the base level. For gender, the base level or reference category refers to females, and the marginal effect for males of .030 shows that the expected probability of smoking is higher for males than females by .030, when the other independent variables are held constant at their means. With education, age, time, and the marital status categories taking their mean values, additional runs show that the predicted probability of smoking is .1244 for females and .1547 for males. The difference of .0303 is the same as the gender coefficient for males in Table 2.5. For race, the base level or reference category is white. The coefficients of −.010 for African Americans, .039 for Native Americans, −.073 for Asian Americans, and .062 for multi-race respondents show considerable diversity in smoking across the groups. The marginal effect is largest for Asians, with the difference in expected probabilities of −.073 compared to whites, when other independent variables are held constant at their means. Second, marginal effects at representative values use a typical case on the independent variables rather than the means of the independent variables. To reexamine the discrete change for gender, the predicted value for both males and females can be obtained when the other independent variables are set, for example, at 12 years of education, 45 years old, white (Race = 1), and non-Hispanic (Ethnicity = 0). The results are shown in Table 2.6. The marginal effect for gender equals .048, which is larger than the marginal effect at the means. The marginal effect for Hispanics relative to non-Hispanics is −.157 at the selected representative values. As always, using different representative values can give substantially different results. Third, the AME for the categorical variables are listed in Table 2.7. The average marginal effect calculates the predicted probability for each observation when gender equals 0 and the other independent variables equal their observed values. The same is done when gender equals 1 and the difference is obtained for each observation. The average of the differences equals the average marginal effect. Table 2.7, which presents the average marginal effect for all independent variables, shows coefficients of .032 for gender and −.099 for Hispanic ethnicity. The same strengths and weaknesses in the three summary marginal effects of continuous independent variables apply to categorical independent variables.
36 The interpretation shifts from the change in probabilities due to an infinitely small change in a continuous independent variable to the discrete change from a base level to a group level for a categorical independent variable. Otherwise, the issues faced in interpreting probability effects still apply. Along with selecting the type of marginal effect, one should consider examining the marginal effects at varied levels of the independent variables. An easy way to do this is discussed below.
Graphing Marginal Effects The three ways to summarize marginal effects—at the means, at representative values, and the average marginal effect—do not represent the variation of the effects around the average. A single coefficient is quite useful but incomplete. One way to capture this variation is to select a set of values at which to calculate marginal effects. For example, the marginal effect for a continuous variable can be calculated when the variable takes values –2, –1, 0, 1, and 2 standard deviations from its mean and the other independent variables are at their means, at representative values, or actual values in AME. The AME of education are −.038, −.030, −.022, −.015, and −.010, respectively, at –2, –1, 0, 1, and 2 standard deviations from its mean. The marginal effects get weaker at higher levels of education, as the probability of smoking gets lower. Alternatively, marginal effects might be calculated when a continuous independent variable takes its maximum, mean, and minimum values. The marginal effects of education are −.039 at 0, the education minimum, −.022 at the education mean, and −.013 at 18, the education maximum. An easier procedure involves graphing the marginal effects. In Stata, the average marginal effect for each value of education again comes from the margins command (“margins, dydx(education) at(education = (0(1)18))”) followed by “marginsplot.” The results of this command are shown in Figure 2.2 The graph plots the average marginal effect on the Y axis for each observed value of education on the X axis. The bars represent the 95% confidence interval around the marginal effects. The scale for the marginal effects in the Y axis varies from only −.05 to −.01 and the marginal effects range from −.013 to −.043. Note the nonlinear relationship of education with the probability of smoking. The graph reveals the strongest marginal effects for those with few years of completed schools, who also tend to smoke more than others. The peak negative effect of −.043 occurs with four years of education. The marginal effect moves toward 0, reaching −.013 at 18 years of education. The pattern of the marginal effects reaffirms the point made early that
37 marginal effects are strongest when predicted probabilities are near .5, a level at which those with little education come closest. As smoking falls to levels farther from .5 and closer to 0 for those with advanced education, the negative marginal effect becomes weaker. Similar results could be obtained for marginal effects at the means or marginal effects at representative values. The pattern of marginal effects in Figure 2.2 demonstrates the nonlinearity in probability effects. That is, the average marginal effect of education varies in a nonlinear pattern with the level of education. Graphs can also demonstrate nonadditivity, or how the marginal effect of one independent variable on probabilities varies with the level of another variable. Consider gender and age. Figure 2.3 shows the marginal effect from the discrete change in gender for each age. The marginal effect of being male is larger at younger ages when current smoking is higher. The marginal effect decreases at older ages, as many former smokers have quit or died. The marginal effect falls from .045 at age 26 to .018 at age 85 and over. The change is not large but nonetheless illustrates how the difference between men and women varies with age. Figure 2.2 A verage marginal effects of education at values of education from logistic regression model of current smoking, NHIS 2017. Average Marginal Effects of Education With 95% Cls
Effects on Pr(Smoker)
–.01
–.02
–.03
–.04
–.05 0
1
2
3
4
5
6 7 8 9 10 11 12 13 14 15 16 17 18 Years of Education Attained
38 Figure 2.3 A verage marginal effects of gender (1 = males) at values of age from logistic regression model of current smoking, NHIS 2017. Average Marginal Effects of 1.Gender With 95% Cls .06
Effects on Pr(Smoker)
.05
.04
.03
.02
.01 20
40
60
80
Age
Figure 2.4 A verage marginal effects of education at values of age from logistic regression model of current smoking, NHIS 2017. Average Marginal Effects of Education With 95% Cls
–.01
Effects on Pr(Smoker)
–.015
–.02
–.025
–.03
–.035 20
40
60 Age
80
39 Nonadditivity also holds for continuous independent variables. Figure 2.4 plots the average marginal effect of education for selected ages. The negative marginal effect of education is strongest at the youngest ages, when current smoking is highest. It becomes weaker and closer to 0 at older ages when current smoking is lowest. Although the effect of education on the logged odds is the same across all ages, the effects on probabilities are nonadditive.
Graphing Predicted Probabilities Marginal effects examine the change in predicted probabilities for an infinitely small change or discrete change in the independent variables. Some insight into the nature of marginal effects and, more generally, into the relationships of the independent variables with the dependent variable can come from examining the predicted probabilities themselves. Graphing of predicted probabilities, like graphing of marginal effects, proves helpful. Figure 2.5 graphs the average predicted probabilities of men and women at selected ages (when the other independent variables take their observed values for each case). The predicted probabilities on the Y axis range from just above .05 to just below .3. The probability scale differs from the Figure 2.5 P redicted probabilities by gender and age from logistic regression model of current smoking, NHIS 2017. Predictive Margins of Gender With 95% Cls
.3
Pr(Smoker)
.25 .2 .15 .1 .05 25
30
35
40
45
50
55 Age
Female
60
65 Male
70
75
80
85
40 smaller scale for marginal effects. According to the graph, women have a lower probability of smoking than men. Furthermore, the gap between men and women, which reflects the average marginal effect of gender, varies with age. Although the change is not large, the gap between women and men appears narrower at older than younger ages. This affirms findings for the marginal effects of gender: Figure 2.3 shows these effects get smaller at older ages. The lines below again demonstrate the nonlinear and nonadditive relationships of gender and age with the predicted probabilities of smoking. As one more example, Figure 2.6 presents the average predicted probabilities for non-Hispanics and Hispanics by education level. The line for non-Hispanics is higher than for Hispanics, but it declines more quickly with education. The gap thus narrows with education. It is worth comparing the predicted probabilities from the logistic regression to the predicted probabilities from linear regression. Using the same independent variables as the logistic regression, the linear regression plus the margins command produces Figure 2.7. The predicted probabilities form a straight line, and the gap between non-Hispanics and Hispanics across years of education is constant. In other words, the marginal effects of Hispanic ethnicity and education are both linear and additive. The two Figure 2.6 P redicted probabilities by Hispanic ethnicity and education from logistic regression model of current smoking, NHIS 2017. Predictive Margins of Ethnicity With 95% Cls .8
Pr(Smoker)
.6
.4
.2
0 0
1
2
3
4
5
6 7 8 9 10 11 12 13 14 15 16 17 18 Years of Education Attained Non-Hispanic
Hispanic
41 Figure 2.7 P redicted probabilities by Hispanic ethnicity and education from linear regression model of current smoking, NHIS 2017. Predictive Margins of Ethinicity With 95% Cls
Linear Prediction
.6
.4
.2
0
–.2 0
1
2
3
4
5
6 7 8 9 10 11 12 13 14 15 16 17 18 Years of Education Attained Non-Hispanic
Hispanic
straight and parallel lines from the linear regression in Figure 2.7 contrast with the curved and nonparallel lines from the logistic regression in Figure 2.6. The linear regression shows the same gap at all levels of education, while the logistic regression shows a larger gap at low education than at high education. Reflecting this difference, the logistic regression further shows higher predicted smoking for non-Hispanics at low levels of education than the linear regression. Although the differences between the graphs are not huge, the one for the logistic regression appears more accurate.
Standardized Coefficients Regression programs ordinarily present standardized coefficients along with unstandardized coefficients. Many find standardized coefficients to be helpful in interpreting regression results. Unstandardized coefficients show relationships between variables measured in their original metric. If the measurement units differ, as is typically the case, the unstandardized coefficients are not directly comparable. For the model of current smoking, a unit change in gender (from females to males) obviously differs from a 1-year change in completed education. Even comparisons of education and
42 age are risky, despite both being measured in terms of years. For the sample, age varies from 26 to 85 and has a standard deviation of 16.6. In contrast, education varies from 0 to 18, with a standard deviation of 2.85. A 1-year change has a different meaning for these two variables. Standardized coefficients have the advantage of showing relationships when the independent and dependent variables have a common scale. They can be understood as regression coefficients when all variables are measured as standard scores with means of 0 and standard deviations of 1. They then show the expected change in standard units of the dependent variable for a standard unit change in the independent variables, controlling for other independent variables. Given the comparable units of the variables, standardized coefficients help in comparing the relative strength of the relationships. Unlike multiple regression programs, logistic regression programs do not routinely compute standardized coefficients. The problem with standardized coefficients in logistic regression stems partly from ambiguity in the meaning of standard scores or standard units for a binary dependent variable. Standardizing a binary variable merely translates values of 0 and 1 into two other values. If the mean of the dependent variable Y equals the probability P, the variance equals P(1 – P) (Agresti, 2013, p. 117). Then, the standard score z has only two values: Y values of 1 have z values equal to (1 − P ) / P(1 − P ), and Y values of 0 have z values equal to (0 − P ) / P(1 − P ). With only two values, a standardized binary dependent variable does not represent variation in the underlying probability of the outcome.
Standardizing the Independent Variables One way around the problem involves semistandardizing coefficients in logistic regression (also called X-standardizing). Standardizing only the independent variables does not require a standard deviation for the binary dependent variable, but it still allows for some useful comparisons. The semistandardized coefficients show the expected change in the logged odds of the outcome associated with a standard deviation change in each of the independent variables. With a comparable metric for the independent variables, semistandardized coefficients reflect the relative importance of variables within a model. Semistandardized coefficients can be obtained in one of two ways. First, they come from multiplying by hand the logistic regression coefficient for
43 independent variables in their original metric by the standard deviation of the variables. The formula is simple and has some intuitive value in understanding how standardizing the independent variables works: byx semi = byx unstand × sdx. If the unstandardized coefficient shows the change in the logged odds of a unit change in X, then multiplying the coefficient by the standard deviation shows the change in the logged odds for a standard deviation change in X. For example, the logistic regression coefficient for education in its original units is −.183 and the standard deviation of education is 2.85. The semistandardized coefficient equals: −.183 × 2.85 = −.522. The logistic regression coefficient for age is −.024, but the standard deviation of 16.6 is larger. The semistandardized coefficient is −.024 × 16.6 = −.398. It indicates that age (−.398) is less strongly associated with smoking than education (−.522). Second, to obtain semistandardized coefficients directly, the independent variables can be standardized before being included in the logistic regression model. The resulting logistic regression coefficients will show the effects on the logged odds of a standard deviation change in each of the independent variables. Table 2.8 lists the results using this procedure. First note that, after standardizing the independent variables using the means and standard deviations of the variables for the sample used in the logistic regression analysis, all independent variables will have a mean of 0 and a standard deviation of 1. Next note that the logistic regression coefficients shown below for education and age match (within rounding error) the calculations done by hand. However, the list of all semistandardized coefficients together makes for easy comparison of the size of the coefficients. It can be seen that education has the strongest effect, followed by age and Hispanic ethnicity. The interpretation of the outcome is still in logged odds, but taking the exponent of the coefficients will show the change in the odds for a standard deviation in the independent variables.
Standardizing the Independent Variables and Dependent Variable Semistandardized coefficients, while simple to understand and calculate, have a limitation. They identify the effects of different independent variables for the same dependent variable and within the same model. However,
44 Table 2.8 S tata Output: Semi-Standardized Coefficients From Logistic Regression Model of Current Smoking With Standardized Independent Variables, NHIS 2017 Smoker
Coef.
Std. Err.
z
p>|z|
zEducation -.5212059 .0192999 -27.01 0.000
[95% Conf. Interval]
-.559033 -.4833787
zAge -.3980747 .0195263 -20.39 0.000 -.4363455 -.3598038
zGender
.126119 .0183976
zNativeAmer
.0304768 .0161067
zMultiRace
.0584639 .0158138
zAfricanAmer -.0261479 .0180046 zAsianAmer -.1768843 .0238549
6.86 0.000
.0900603
.1621777
1.89 0.058 -.0010917
.0620454
-1.45 0.146 -.0614363
.0091406
-7.41 0.000 -.2236391 -.1301295 3.70 0.000
.0274695
zEthnicity -.3312603 .0223123 -14.85 0.000 -.3749915
.0894584 -.287529
_cons -1.837902 .0200325 -91.75 0.000 -1.877165 -1.798639
the scale for the dependent variable remains as logged odds, and comparing the effects of independent variables across models with different dependent variables can be misleading. Broader comparisons are possible, as in linear regression, with fully standardized coefficients. A fully standardized coefficient (also called XY-standardized coefficient) adjusts for the standard deviations of both X and Y as in the following formula: byx full = byx unstand × (sdx / sdy). However, the problem of how to obtain the standard deviation of Y remains. To obtain a meaningful measure of the standard deviation of a binary dependent variable, Long (1997, pp. 70–71) recommends using the predicted logits from the model. This approach is based on the idea that the observed binary values of an outcome in logistic regression are manifestations of an underlying latent continuous variable. This latent continuous variable is assumed to have a variance but one that is unobserved. However, the predicted logged odds from logistic regression have an observed variance that will reflect the underlying unobserved variance. In addition, the error term in the logistic regression equation has a variance, arbitrarily defined in the logistic distribution as π2 / 3 . Together, the variance of the predicted logits plus the variance of the error term offers an estimate of the variance of the unobserved continuous dependent variable. Taking the square root of the variance provides a measure of the standard deviation of
45 the continuous latent variable. Using this standard deviation for the latent variable in the formula for the standardized coefficient will show that the standard deviation change in the logged odds for a one standard deviation unit change in the independent variables. In practice, the variance of predicted logged odds can be obtained by saving the predicted logged odds and requesting descriptive statistics. For the model of smoking, the variance of the predicted logged odds is .410, and π2/3 is 3.290. The sum is 3.700 and the square root of 3.700 is 1.924. To get the fully standardized coefficient for education, substitute values into the formula above: −.183 × (2.85/1.924) = −.271. Although tedious and prone to error when done by hand, the same steps can be used to obtain fully standardized coefficients for the other independent variables. For Stata users, an easier ways to get the coefficients is available.
A Note on SPOST Scott Long and Jeremy Freese (2014) have created a suite of programs called SPOST that can be used with Stata to interpret the coefficients obtained from nonlinear models. SPOST is free to download by Stata users. The authors’ book presents a full introduction to the SPOST commands and their uses, and one of the many commands calculates semistandardized and fully standardized coefficients. SPOST uses a simple command (“listcoef, std”) after a logistic regression command. Table 2.9 presents the output from the SPOST command. It lists the coefficients, z values, and probabilities of z from the usual logistic regression output. The column labeled bStdX lists the semistandardized or X-standardized coefficients. These coefficients show the expected change in the logged odds for a one-standard deviation change in the independent variables. They match the coefficients created with the independent variables as standard scores but are easier to obtain. The standard deviations used to obtain these X-standardized coefficients are listed in the last column. Standardizing both the dependent and independent variables gives the fully standardized or XY-standardized coefficients in the column labeled bStdXY. The fully standardized coefficient for education shows that a one standard deviation change in education is associated with an expected decrease in current smoking by −.271 standard deviations. Note that the standard deviation for the dependent variable refers to the underlying latent
46 Table 2.9 S POST Output: Standardized Coefficients From Logistic Regression Model of Current Smoking, NHIS 2017 Observed SD : 0.3609 Latent SD : 1.9236 b
z
Education -0.1831 -27.006 Age -0.0240 -20.387
Gender Male
0.2536
Race African -0.0835 American Native American
0.2882
Multiple Race
0.4363
constant
2.0539
Asian -0.8101 American
P>|z|
16.095
2.847
0.497
0.146 -0.026 -0.043 -0.014
0.313
0.126
0.030
0.132
0.016
0.106
0.000 -0.177 -0.421 -0.092
0.218
3.697 0.000
Ethnicity Hispanic -1.0396 -14.847
SDofX
0.066
1.892 0.058 -7.415
bStdY bStdXY
0.000 -0.398 -0.012 -0.207 16.610
6.855 0.000 -1.452
bStdX
0.000 -0.521 -0.095 -0.271
0.030
0.134
0.000 -0.331 -0.540 -0.172
0.319
0.000
0.058
0.150
.
0.227
.
.
.
variable. These coefficients again show the strongest effects from education, age, and Hispanic ethnicity. The output lists other information that is helpful in understanding logistic regression. It is possible to standardize on Y but not X, just as it is possible to standardize on X but not Y. The table lists these Y-standardized coefficients. Also, the top of the table lists the observed standard deviation of the binary outcome of current smoking as .361, but for reasons mentioned above, that number is of limited value. The latent standard deviation of 1.924 is more meaningful. It refers to the underlying, unobserved distribution of smoking that produces the observed binary outcomes of 0 and 1.9 Caution is warranted in interpreting the meaning of a standard deviation change for a categorical independent variable, which is less straightforward than the meaning of a standard deviation change for a continuous independent variable. And comparison of standardized coefficient across groups is generally not recommended. But standardized coefficients are well suited for the interpretation of the relative strength of relationships within a single model and group.
47
Group and Model Comparisons of Logistic Regression Coefficients Researchers often test theories and hypotheses by making comparisons across groups and models. The comparison across groups might come from estimating the same models for two or more groups, such as males and females or young, middle-aged, and older persons. The group comparisons are typically tested using interaction or moderation terms. The comparison across models might come from sequentially adding confounding or mediating independent variables to a model for the same group and sample and comparing the coefficients across the models. The effect of gender on an outcome might be first shown without controls and then examined with controls for education and work status. Given the use of the same sample, the first model is nested within the second, and the coefficients of an independent variable change from the first to the second model. Although these strategies are appropriate with multiple regression, they are generally not appropriate with logistic regression coefficients. The editors of American Sociological Review (Mustillo, Lizardo, & McVeigh, 2018), in presenting guidelines for authors submitting articles, say the following: “don’t use the coefficient of the interaction term to draw conclusions about statistical interaction in categorical models such as logit, probit, Poisson, and so on.” The same problem noted for interaction terms applies to comparisons of coefficients across separate groups and across nested models. Comparing the size of logit coefficients or odds ratios or using tests of significance for differences in the size of the logit coefficients or odds ratios presents challenges for interpretation. The sources of the problems are beyond the scope of this book, but several articles describe them in detail (Allison, 1999; Breen et al., 2018; Mood, 2010; Williams, 2009). Very briefly, comparisons must assume that the errors are the same across the multiple groups and models, but the errors for the logits are not known. Comparing coefficients across different groups and models, which may have different but unknown error variances, can be misleading. Breen et al. (2018) offer an overview of approaches that avoid or minimize the problem of comparing coefficients in logistic regression and related models. Possible solutions include Y-standardizing or fully XY-standardizing the coefficients, focusing on marginal probability effects, and using linear regression with robust standard errors. More complex solutions involve specifying additional model constraints when comparing coefficients across different groups (Allison, 1999; Williams, 2009) or residualizing the added control
48 Table 2.10 S tata Output: Logistic Regression Model of Current Smoking With the Interaction Between Hispanic Ethnicity and Education, NHIS 2017
Smoker
Coef.
Std. Err.
z
p>|z|
[95% Conf. Interval]
Education -.2233305 .0078442 -28.47 0.000 -.2387048 -.2079562 Ethnicity Hispanic -3.115895 .2149463 -14.50 0.000 -3.537182 -2.694608 .
Ethnicity #c.Education Hispanic
.1802465 .0168571
Gender Male
.25177 .0371321
10.69 0.000
.1472071
.2132858
Age -.0243334 .0011835 -20.56 0.000 -.0266531 -.0220138
Race African -.1187599 .0579308 American Native .2494087 .1524716 American Asian -.8317383 .1108915 American Multiple .4151453 .1182646 Race _cons 2.622645 .139376
6.78 0.000
.1789925
.3245476
-2.05 0.040 -.2323022 -.0052175 1.64 0.102 -.0494302
.5482476
-7.50 0.000 -1.049082
-.614395
3.51 0.000
.183351
.6469396
18.82 0.000
2.349473
2.895817
variables when comparing coefficients across nested models for the same sample (Breen et al., 2018). To illustrate one approach, consider an example of group comparisons using interaction terms and interpretations based on the marginal effects on probabilities. A model allows education and Hispanic ethnicity to interact in the model of smoking, as presented in Table 2.10. The positive and significant interaction term indicates that the negative effect of Hispanic ethnicity is smaller at higher levels of education. The size of the Hispanic ethnicity logged odds coefficient should not be compared across education levels, but marginal effects on probabilities are appropriate for analysis (Long & Mustillo, 2018). Figure 2.8 shows the average marginal effect of Hispanic ethnicity as a discrete change across years of completed education as implied by the logistic regression model. As the graph shows, the negative marginal effect is smaller at higher levels of education.
49 Figure 2.8 A verage marginal effects of Hispanic ethnicity at values of education from the logistic regression model of current smoking with the interaction between Hispanic ethnicity and education, NHIS 2017. Average Marginal Effects of Ethnicity With 95% Cls
Effects on Pr(Smoker)
–.0
–.2
–.4
–.6
–.8 0
1
2
3
4
5
6 7 8 9 10 11 12 13 14 15 16 17 18 Years of Education Attained
Despite its value, the single graph simplifies the complexities involved with interactions in logistic regression. Interpreting the probability effects in general requires care and thoroughness—a recommendation that is doubly important for models with interactions.
Summary Logistic regression coefficients provide a simple linear and additive summary of the influence of a variable on the logged odds of having a characteristic or experiencing an event, but they lack an intuitively meaningful scale of interpretation of change in the dependent variable. Standard tests of significance offer another common way to interpret the results, but by themselves say little about the substantive meaning of the coefficients. Raising e to the coefficient b allows interpretation of the resulting coefficient in terms of multiplicative odds or percentage change in the odds. For still more intuitive coefficients, the marginal effects of independent variables on the probability of an outcome are helpful. However, effects on probabilities depend on the values of the independent variables at which the
50 effect is calculated. Calculating standardized coefficients may help but warrant some caution given difficulties in standardizing the binary outcome. Making comparisons across groups or, equivalently, using statistical interaction terms in logistic regression models similarly warrants caution, as do comparisons across nested models. Care and thoroughness are needed to fully understand the relationships being modeled in logistic regression.
Chapter 3 ESTIMATION AND MODEL FIT The last chapter focused on logistic regression coefficients, emphasizing the linear and additive relationships with the logged odds, the multiplicative relationships with the odds, and the nonlinear and nonadditive relationships with the probabilities. Although the discussion focused on the probability of experiencing an event or having a characteristic, data on individuals usually include values of only 0 and 1 for the dependent variable rather than the actual probabilities. Without known probabilities, the estimation procedure must use observed values of 0 and 1 on binary dependent variables to obtain predicted probabilities. As discussed earlier, the binary dependent variable may make estimation using ordinary least squares problematic. Logistic regression relies instead on maximum likelihood procedures to obtain the coefficient estimates. As a general and flexible strategy, maximum likelihood estimation applies to a variety of models (Eliason, 1993), but this chapter illustrates the logic of the estimation technique for logistic regression. Relying on simple terms and examples, the chapter highlights maximum likelihood concepts and their uses for interpreting logistic regression results. Knowledge of estimation procedures helps to explain the source of common hypothesis tests and measures of model accuracy.
Maximum Likelihood Estimation Maximum likelihood estimation finds the estimates of model parameters that are most likely to give rise to the pattern of observations in the sample data. To illustrate the maximum likelihood principle, consider a simple example involving tossing a coin. Suppose a coin tossed 10 times gives 4 heads and 6 tails. Letting P equal the probability of a head and 1 − P the probability of a tail, the probability of obtaining 4 heads and 6 tails equals: P(4 heads, 6 tails) = 10!/4!6![P4 × (1 – P)6]. We might normally assume that, with a fair coin, P equals .5 and compute the probability of obtaining four heads. If P is unknown and we need to evaluate the coin’s fairness, however, the question becomes: how can P be estimated from the observed outcome of 4 heads over 10 tosses? Maximum likelihood estimation chooses the P that makes the probability of getting the observed outcome as large as possible. 51
52 Table 3.1 Likelihood Function for Coin Flipping Example P
P4 × (1 – P)6
P
P4 × (1 – P)6
P
P4 × (1 – P)6
.1
.0000531
.4
.0011944
.7
.0001750
.2
.0004194
.5
.0009766
.8
.0000262
.3
.0009530
.6
.0005308
.9
.0000007
In finding the maximum likelihood estimate of P, we can focus on the P4 × (1 – P)6 component of the preceding formula. This formula expresses the likelihood of obtaining four heads as a function of varied values of P. Substituting possible values of P into the likelihood function gives the results in Table 3.1. It appears that the maximum value of .0011944 occurs when P equals .4.10 Further checking the likelihood using the same formula when P varies from .35 to .45 confirms that .4 produces the maximum likelihood. Given the data, the most likely or maximum likelihood estimate of P equals .4. In this way, we pick as the parameter estimate for P, the value that gives the highest likelihood of producing the actual observations. For logistic regression, the procedure begins with an expression for the likelihood of observing the pattern of occurrences (Y = 1) and nonoccurrences (Y = 0) of an event or characteristic in a given sample. This expression, termed the likelihood function, depends on unknown logistic regression parameters. As in the coin tossing example, maximum likelihood estimation finds the model parameters that give the maximum value for the likelihood function. It thereby identifies the estimates for model parameters that are most likely to give rise to the pattern of observations in the sample data.
Likelihood Function The maximum likelihood function in logistic regression (Hilbe, 2009, p. 63) parallels the previous formula: LF {Pi i (1 Pi )1Yi }, Y
where LF refers to the likelihood function, Yi refers to the observed value of the dichotomous dependent variable for case i, and Pi refers to the predicted probability for case i. Recall that the Pi values come from a logistic regression model and the formula Pi = 1/(1+e−Li), where Li equals the logged odds determined by the unknown parameters β and
53 the independent variables. Π refers to the multiplicative equivalent of the summation sign and means that the function multiplies the values for each case. The key is to identify β values that produce Li and Pi values that maximize LF. This brief statement is critical to understanding the process of maximum likelihood estimation. It defines a standard— maximizing the likelihood function—for estimating the logistic model coefficients. Consider how this formula works. For a case in which Yi equals 1, the formula reduces to Pi because Pi raised to the power 1 equals Pi and (1 – Pi) raised to the power 0 (1 – Yi) equals 1. Thus, when Yi = 1, the value for a case equals its predicted probability. If, based on the model coefficients, the case has a high predicted probability of the occurrence of an outcome when Yi = 1, it contributes more to the likelihood than if it has a low probability of the outcome. For a case in which Yi equals 0, the formula reduces to 1 – Pi because Pi raised to the power 0 equals 1 while (1 – Pi) raised to the power 1 equals 1– Pi. Thus, when Yi = 0, the value for a case equals 1 minus its predicted probability. If the case has a low predicted probability of the occurrence of an outcome based on the model coefficients when Yi = 0, it contributes more to the likelihood than if it has a high probability (e.g., if Pi =.1, then 1 – Pi = .9, which counts more than if Pi = .9 and 1 − Pi = .1). Take, for example, four cases. Two have scores of 1 on the dependent variable, and two have scores of 0. Assume that the estimated coefficients in combination with the values of the independent variables produce the predicted probabilities for each of the four cases listed in Table 3.2. Using the probabilities with the formula gives the results for each case. The values in the last column indicate the likelihood of the observations given the estimated coefficients; in this example, the observations have relatively high likelihoods. Compare these results to another set of estimated coefficients that in combination with the values of X produce different predicted probabilities and results for the likelihood formula (see Table 3.3). Here, the estimated coefficients do worse in producing the actual Y values, and the likelihood values are lower. Table 3.2 Hypothetical Example 1 of Likelihood Function Yi
Pi
1
.8
1
.7
Yi
Pi
.81 =.8 1
.7 =.7 0
(1− Pi )1−Yi
Pi i (1 Pi )1Yi
.20 = 1
.8
.30 = 1
.7
1
Y
0
.3
.3 = 1
.7 =.7
.7
0
.2
.20 = 1
.81 =.8
.8
54 Table 3.3 Hypothetical Example 2 of Likelihood Function Yi
Pi
1
.2
1
Yi
Pi
.21 = .2 1
.3
.3 = .3 0
(1− Pi )1−Yi
Pi i (1 Pi )1Yi
.20 = 1
.2
Y
0
.3
1
.3 = 1
0
.7
.3 = 1
.3 = .3
.3
0
.8
.20 = 1
.21 = .3
.2
Given a set of estimates for the model parameters, each case has a probability of observing the outcome. Multiplying these probabilities gives a summary indication over all cases of the likelihood that a set of coefficients produces the actual values. Multiplying probabilities means that the total product cannot exceed 1 or fall below 0. It will equal 1 in the unlikely event that every case with a 1 has a predicted value of 1 and every case with a 0 has a predicted value of 0. This likelihood equals .8 × .7 × .7 × .8 or .3136 for the first set of coefficients, and .2 × .3 ×. 3 × .2 or .0036 for the second. What is already obvious from the more detailed results shows in a single number. The hypothetical coefficients in the first example produce a larger likelihood function value than the second and are more likely to have given rise to the observed data.
Log Likelihood Function To avoid multiplication of probabilities (and typically having to deal with exceedingly small numbers), the likelihood function can be turned into a logged likelihood function. Since ln(X × Y) = ln X + ln Y, and ln(XZ) = Z × ln X, the log likelihood function sums the formerly multiplicative terms. Taking the natural log of both sides of the likelihood equation gives the log likelihood function: ln LF Yi ln Pi [(1 Yi ) ln (l Pi )] .
55 If the likelihood function varies between 0 and 1, the log likelihood function will vary from negative infinity to 0 (the natural log of 1 equals 0, and the natural log of 0 is undefined, but as the probability gets closer to 0 the natural log becomes an increasing negative number). The closer the likelihood value is to 1, then the closer the log likelihood value is to 0, and the more likely it is that the parameters could produce the observed data. The more distant the negative value from 0, the less likely that the parameters could produce the observed data. To illustrate the log likelihood function, we can go through the same examples that appear earlier. In Table 3.4, the sum equals –1.16. The same calculation for the second set of coefficients appears in Table 3.5. The sum equals –5.626. Again, coefficients that best produce the observed values show a higher value (i.e., smaller negative number) for the log likelihood function.
Estimation Maximum likelihood estimation aims to find those coefficients that have the greatest likelihood of producing the observed data. In practice, this means maximizing the log likelihood function (Harrell, 2015, Chapter 10).11 Hypothetically, we could proceed in a bivariate model something like this. Table 3.4 Hypothetical Example 1 of Log Likelihood Function Yi
Pi
Yi × ln Pi
(1 – Yi) × ln(l – Pi)
(Yi × ln Pi) + [(1 – Yi) × ln(1 – Pi)]
1
.8
1 × –.223
0 × –1.609
–.223
1
.7
1 × –.357
0 × –1.204
–.357
0
.3
0 × – 1.204
1 × –.357
–.357
0
.2
0 × –1.609
1 × –.223
–.223
Table 3.5 Hypothetical Example 2 of Log Likelihood Function Yi
Pi
Yi × ln Pi
(1 – Yi) × ln(l – Pi)
(Yi × ln Pi) + [(1 – Yi) × ln(l – Pi)]
1
.2
1 × – 1.609
0 × –.223
–1.609
1
.3
1 × – 1.204
0 × –.357
–1.204
0
.7
0 × –.357
1 × –1.204
–1.204
0
.8
0 × –.223
1 × –1.609
–1.609
56 1. Pick coefficients for the parameters, say, for example, 1 and .3 in a model with a constant and single predictor. 2. For the first case, multiply b by the X value and add the product to the constant to get a predicted logit (if X equals 2 for the first case, the predicted logit equals 1 + 2 × .3 = 1.6). 3. Translate the logit into a probability using the formula Pi = 1/(1 + e−Li) = eLi/(1 + eLi). For the first case, the probability equals l/(l + e−1.6) = l/(1 + .2019) = .832. 4. If Y = 1, then the contribution to the log likelihood function for this case equals 1× ln .832 + 0 × ln .168 = –.1839. 5. Repeat steps 1 to 4 for each of the other cases, and sum the components of the log likelihood function to get a total value. 6. Repeat the steps for another pair of coefficients and compare the log likelihood value to that for the first set of coefficients. 7. Do this for all possible coefficients and pick the ones that generate the largest log likelihood value (i.e., closest to 0). Of course, mathematical formulas and computing procedures allow logistic regression programs to more efficiently identify the estimates that maximize the log likelihood function (see Hilbe, 2009, pp. 58–61 for more detail). Programs usually begin with a model in which all b coefficients equal the least squares estimates. They then use an algorithm to successively choose new sets of coefficients that produce larger log likelihoods and better fit with the observed data. They continue through the iterations or cycles of this process until the increase in the log likelihood function from choosing new parameters becomes so small (and the coefficients change so little) that little benefit comes from continuing any further.
Tests of Significance Using Log Likelihood Values The log likelihood value reflects the likelihood that the data would be observed given the parameter estimates. It is also the deviation from a perfect or saturated model in which the log likelihood equals 0. The larger the value (i.e., the closer the negative value to 0), the better the parameters do in producing the observed data. Although it increases with the effectiveness of the parameters, the log likelihood value has little intuitive meaning
57 because it depends on the sample size as well as on the goodness of fit. We therefore need a standard to help evaluate its relative size. One way to interpret the size of the log likelihood involves comparing the model value to the initial or baseline value when all the b coefficients equal 0. The baseline log likelihood comes from including only a constant term in the model—the equivalent of using the mean probability as the predicted value for all cases. The greater the difference between the baseline log likelihood and the model log likelihood, the better the model coefficients (along with the independent variables) do in producing the observed sample values. Consider an example using the General Social Survey (GSS) data. A logistic regression model treats the binary measure of support for legalized marijuana as the dependent variable. The independent variables include continuous measures of time (years since 1973, the first year in which the question was asked), age (18–89+), and years of education (0–20) plus categorical measures of gender (a dummy variable with females as the referent), and size of place of residence (five dummy variables with living in the 12 largest metropolitan areas as the referent). The sample size over the period from 1973 to 2016 is 35,914. The output from Stata in Table 3.6 shows the logistic regression results. The output lists the baseline log likelihood of −22,045.785. The log likelihood values move closer toward 0 with the iterations of the estimation algorithm. By iteration 4, the estimates converge at a log likelihood of −20,193.291. Compared to the baseline model with no independent variables, the model with the independent variables listed in the table improve the log likelihood by −1852.494. That value still has little meaning, but it has uses for hypothesis testing and goodness of fit measures.
Hypothesis Testing The difference in the baseline and model log likelihood values evaluates the null hypothesis that b1 = b2 = ··· = bk = 0. It does so by determining if the difference is larger than would be expected from random error alone. The test proceeds as follows. Take the difference between the baseline log likelihood and the model log likelihood. Multiplying that difference by –2 gives a chi-square value with degrees of freedom equal to the number of independent variables (not including the constant, but including squared and interaction terms). Used in combination with the chi-square distribution, the chi-square value tests the null hypothesis that all coefficients other than the constant equal 0. It reveals if the change in the log likelihood due to all independent variables could have occurred by chance beyond a prespecified significance level (i.e., the improvement in the log likelihood
58 Table 3.6 S tata Output: Logistic Regression Model of Support for Legalized Marijuana, GSS 1973–2016 Iteration Iteration Iteration Iteration Iteration
0: 1: 2: 3: 4:
log log log log log
likelihood likelihood likelihood likelihood likelihood
= = = = =
-22045.785 -20237.596 -20193.327 -20193.291 -20193.291
Logistic regression Number of obs = 35,914 LR chi2(9) = 3704.99 Prob > chi2 = 0.0000 Log likelihood = -20193.291 Pseudo R2 = 0.0840 Grass_ Std. Legal Coef. Err. z Education .0705661 .0041961 16.82 Time .0396423 .0009967 39.78 Age -.0209352 .000748 -27.99 Gender Male SizeOf Place smsa’s 13-100 suburb, 12 lrgst suburb, 13-100 other urban other rural _cons
.4106996
.02425
[95% Conf. P>|z| Interval] 0.000 .0623418 .0787904 0.000 .0376889 .0415957 0.000 -.0224014 -.0194691
16.94 0.000
.3631706
.4582287
-.1102056 .0502961
-2.19 0.028 -.2087842
-.011627
-.2255893 .0524727
-4.30 0.000
-.2623289 .0495089
-5.30 0.000 -.3593646 -.1652932
-.4139728 .0442316
-9.36 0.000 -.5006653 -.3272804
-.328434 -.1227447
-.5736571 .0541428 -10.60 0.000 -.6797751 -.4675392 -1.575737 .0773388 -20.37 0.000 -1.727318 -1.424156
does not differ significantly from 0). For a given degree of freedom, the larger the chi-square value, the greater the model improvement over the baseline, and the less likely that all the variable coefficients equal 0 in the population. Multiplying the log likelihood difference by –2 to obtain the chi-square value is equivalent to multiplying the baseline log likelihood by –2 and the model log likelihood values by –2, and then taking the difference in the
59 Table 3.7 H ypothetical Example 1 of Difference in Log Likelihood Functions Model
LF
LLF
−2(LLF)
.0625
–2.773
5.546
Final model
.3136
–1.160
2.320
Difference
–.2511
–1.613
3.226
Baseline model
LF = likelihood function, LLF = log likelihood function.
values to measure the model improvement. The value equal to the baseline log likelihood times –2 is called the null deviance, and the value equal to the model log likelihood times –2 is called the residual deviance. Both represent the deviance of the log likelihood value from a saturated or perfect model and, after being multiplied by –2, both are positive rather than negative values. The difference between the null deviance and the residual deviance gives the chi-square value. Using the four cases presented earlier illustrates this significance test (Table 3.7). Without knowledge of X, the baseline model would use the mean of Y, say .5, as the predicted probability for each case. Using the likelihood and log likelihood functions, and substituting predicted probabilities of .5 for each case, gives a likelihood of .0625 and a log likelihood of –2.773 for the baseline model. If X relates to Y, however, the log likelihood knowing X should be closer to 0 and reflect a better model than the log likelihood not knowing X. Assume that the log likelihood value computed earlier is maximum. It has a likelihood value of .3136 and a log likelihood value of –1.160. A summary comparison of the baseline and final models in Table 3.7 shows the improvement from knowing X. Although the units of these figures still make little intuitive sense, one can see improvement in the final model compared to the initial or baseline model. A test of significance using the chi-square distribution tells if the 3.226 improvement likely could have occurred by chance alone (at a preselected probability level). With 1 degree of freedom for the one independent variable, the critical chi-square value at .05 equals 3.8414. Since the actual chi-square does not reach the critical value, we can conclude that the independent variable does not significantly influence the dependent variable. Of course, this artificial example with only four cases makes it difficult to reach any level of statistical significance, but it illustrates the use of the chi-square test. Often times researchers refer to the chi-square difference or the improvement in the log likelihood as the likelihood ratio. The log of the ratio of the
60 baseline likelihood to the model likelihood equals the difference between the two log likelihoods. The general principle is that ln X – ln Y = ln(X/Y). In the example, the ratio of the likelihood values divides .0625 by .3136. The log of this value equals –1.613, which is identical to the difference in the log likelihood values. Multiplying the likelihood ratio by –2 gives the chi-square value for the test of the overall model significance. Refer back to Table 3.6 for an actual example. The difference between the baseline and model log likelihood values is −1852.494. Multiplying that value by −2 gives 3704.988—the value listed in the output as the LR chi2(9). The low probability of the chi-square at 9 degrees of freedom means it is unlikely that the coefficients for all the independent variables equal 0 in the population. With a large sample size and meaningful independent variables, that conclusion is obvious. But the test may be more valuable for other models and data. In review, then, the likelihood values range from 0 to 1, while the log likelihood values range from negative infinity to 0. The baseline model typically shows lower likelihood and log likelihood values than the final model. The larger the likelihood and log likelihood values for the final model are relative to the baseline model values, the greater the improvement from estimating nonzero parameters. The log likelihood values times –2, which range from 0 to positive infinity, reverse the direction of interpretation so that the baseline model typically shows a higher value or deviance than the final model. Again, however, the larger the difference between the two models, the larger the improvement in the model due to the independent variables.
Comparing Models The logic of the chi-square test of the difference between the baseline and the final models applies to the comparison of any two nested models for the same sample. If a full model contains k (e.g., 10) variables, and a restricted model contains h fewer variables than the full model (e.g., 6 fewer or 4 total), the chi-square can test the null hypothesis that the coefficients for the h variables added to the restricted model all equal 0. Simply subtract the log likelihood of the full model from the log likelihood of the restricted model and multiply that result by –2. Equivalently, subtract –2 times the log likelihood for the full model from –2 times the log likelihood for the restricted model. In both cases, the result equals a chi-square value with h degrees of freedom. The test of the baseline model represents a subcase of the more general nested model where h includes all variables in the full model.
61 The results in Table 3.6 for support of legalized marijuana can illustrate the use of the log likelihood values for the comparison of models. Table 3.8 displays the logistic regression results with the categorical size of place variable excluded from the model. Such an exclusion is helpful in checking for the overall significance of a categorical measure like size of place rather than relying on the significance of the individual dummy variables that make up the categorical measure. The likelihood ratio chi-square of 3518.51 in Table 3.8 is smaller than the likelihood ratio chi-square of 3704.99 for the full model in Table 3.6. The difference of 186.48 can be evaluated at 5 degrees of freedom (i.e., the difference in the degrees of freedom of the two models). The associated probability is .0000. The null hypothesis that all the size of place coefficients equal 0 in the population can be rejected. This procedure can test for the significance of a single variable by simply comparing the likelihood ratios with and without the variable in question (in this case, h refers to one variable). Subtracting –2 times the log likelihood for the model with the variable (i.e., the full model chi-square)
Table 3.8 S tata Output: Logistic Regression Model of Support for Legalized Marijuana, Without Size of Place of Residence, GSS 1973–2016 Iteration Iteration Iteration Iteration Iteration
0: 1: 2: 3: 4:
log log log log log
likelihood likelihood likelihood likelihood likelihood
= = = = =
-22045.785 -20327.426 -20286.559 -20286.53 -20286.53
Logistic regression Number of obs = 35,914 LR chi2(4) = 3518.51 Prob > chi2 = 0.0000 Log likelihood = -20286.53 Pseudo R2 = 0.0798 Grass_ Std. Legal Coef. Err. z Education .0762428 .0041488 18.38 Time .0396614 .0009933 39.93 Age -.0213682 .0007461 -28.64
[95% Conf. P>|z| Interval] 0.000 .0681114 .0843743 0.000 .0377146 .0416082 0.000 -.0228305 -.0199059
Gender Male .404574 .0241696 16.74 0.000 .3572025 .4519455 _cons -1.933154 .0677671 -28.53 0.000 -2.065975 -1.800333
62 from –2 times the log likelihood for the model without the variable (i.e., the restricted model chi-square) provides a chi-square statistic for the individual variable with 1 degree of freedom.
Model Goodness of Fit The log likelihood and deviance values are useful for significance tests but lack a meaningful scale for comparisons across different outcomes and samples. Two other approaches to model goodness of fit are sometimes useful, although each has limitations.
Pseudo-Variance Explained Least-squares regression programs routinely compute and print the R2 or variance explained. Because the dependent variable in logistic regression does not have variance in the same way continuous variables do in regression, maximum likelihood procedures in logistic regression offer measures that are analogous but not identical to those from least squares regression. Because of the difference, we refer to the pseudo-variance explained or pseudo-R2 in logistic regression. As in tests of significance, it makes intuitive sense to compare a model knowing the independent variables to a model not knowing them. In regression, the total sum of squares follows from a model not knowing the independent variables, the error sum of squares follows from a model knowing the independent variables, and the difference indicates the improvement due to the independent variables. The difference divided by the total sum of squares defines the R2 or variance explained. The result shows the proportional reduction in error of the model. In logistic regression, the baseline log likelihood (ln L0) times –2 represents the null deviance with parameters for the independent variables equaling 0. The model log likelihood (ln L1) times –2 represents the residual deviance with the estimated parameters for the independent variables. The improvement in the residual deviance of the model relative to the null deviance at baseline shows the improvement due to the independent variables. Accordingly, these two log likelihoods define a type of proportional reduction-in-error measure12: R2 = [(–2 × ln L0) – (–2 × ln L1)] / (–2 × ln L0). The numerator shows the reduction in the deviance due to the independent variables, and the denominator shows the deviance without using the independent variables. The resulting value shows the improvement in the
63 model relative to the baseline. It equals 0 when all the coefficients equal 0 and has a maximum that comes close to 1.13 However, the measure does not represent explained variance since log likelihood functions do not deal with variance defined as the sum of squared deviations. This particular pseudovariance explained or pseudo-R2 is commonly called the McFadden measure. Tables 3.6 and 3.8 from Stata each present this pseudo-R2 value. It is .0840 for the full model and .0798 for the reduced model. SPSS lists two similar measures in its logistic regression output. The Cox and Snell (1989) pseudo-R2 is based on raising the ratio of the likelihood values to the power 2/n. Nagelkerke (1991) pseudo-R2 adjusts the Cox and Snell measure to ensure a maximum of 1. The SPSS output listed in Table 3.9 is based on the full logistic regression model of support for legalized marijuana. The logit coefficients are identical to those from Stata, but the Cox and Snell measure of .098 is larger than the McFadden measure of .084, and the Nagelkerke measure of .139 is still larger. Nearly a dozen other measures have been proposed. A sense of the diversity of the measures comes from an SPOST command in Stata (“fitstat”) that follows a logistic regression command. Table 3.10 lists selected output from fitstat. The McFadden value matches that from the Stata logistic regression output (Table 3.6), while the Cox-Snell/ML and the CraggUhler/Nagelkerke values match those from the SPSS output. The other measures vary, sometimes widely. Specialized R packages will calculate many of these measures, but some basic commands allow one to get the main information. The R output in Table 3.11 replicates the coefficients from Stata and SPSS and includes information on the null deviance and the residual deviance. Recall that the deviance equals −2 times the log likelihood values. The difference between the two gives the LR chi-square value (3705) and the degrees of freedom for the chi-square (9). The probability of the chi-square for the degrees of freedom can be obtained from R, web calculators, or a table. The McFadden pseudo-R2 can also be obtained easily by taking the difference in the deviances (3705) divided by the null deviance (44,092). The calculation gives .084. Long (1997, pp. 104–113) and Menard (2002) review and compare the numerous measures that appear in the literature on logistic regression, but the details will not be crucial for most users of logistic regression. The pseudo-R2 is limited by the potential for different measures to give different results and by the ambiguity of their meaning in the absence of a real measure of variance to be explained. Researchers can use these measures as rough guides without attributing great importance to a precise figure. In fact, it is important to note that few published articles using logistic regression present a measure of the pseudo-variance explained.
64 Table 3.9 S PSS Output: Logistic Regression Model of Support for Legalized Marijuana, GSS 1973–2016 Omnibus Tests of Model Coefficients Chi-square Step 1
df
Sig.
Step
3704.988
9
.000
Model
3704.988
9
.000
Block
3704.988
9
.000
Model Summary Step 1
−2 Log likelihood
40386.581a
Cox and Snell R2
Nagelkerke R2
.098
.139
a
Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.
Classification Tablea Predicted
Grass_Legal Step 1
a
Observed
Grass_Legal Overall Percentage
0 1
0
1
23345
1667
8694
Percentage Correct 93.3
2208
20.3 71.2
The cut value is .500
Variables in the Equation Step 1a
Education Time Age Gender(1)
B S.E. Wald df Sig. Exp(B) .071 .004 282.808 1 .000 1.073 .040 .001 1582.089 1 .000 1.040 −.021 .001
783.257
1
.000
.979
.411 .024
286.831
1
.000
1.508
186.247
5
.000
SizeOfPlace SizeOfPlace(1)
−.110 .050
4.801
1
.028
.896
SizeOfPlace(2)
−.226 .052
18.483
1
.000
.798
65 SizeOfPlace(3)
−.262 .050
28.075
1
.000
.769
SizeOfPlace(4)
−.414 .044
87.595
1
.000
.661
−.574 .054
112.260
1
.000
.563
−1.576 .077
415.119
1
.000
.207
SizeOfPlace(5) Constant a
Variable(s) entered on step 1: Education, Time, Age, Gender, SizeOfPlace.
Table 3.10 S POST Output: Pseudo-R2 Measures From Logistic Regression Model of Support for Legalized Marijuana, GSS 1973–2016 R2 McFadden
0.084
McKelvey & Zavoina
0.152
McFadden(adjusted)
Cox-Snell/ML
Cragg-Uhler/Nagelkerke
Efron
Tjur's D Count
Count(adjusted)
0.084 0.098 0.139 0.099 0.099 0.712 0.050
Group Membership Another approach to model goodness of fit compares predicted group membership with observed group membership. Using the predicted probabilities for each case, logistic regression programs also predict group membership. Based on a typical cut value of .5, those cases with predicted probabilities at .5 or above are predicted to score 1 on the dependent variable and those cases with predicted probabilities below .5 are predicted to score 0. Cross-classifying the two categories of the observed dependent variable with the two categories of the predicted dependent variable produces a 2 × 2 table. A highly accurate model would show that most cases fall in the cells defined by 0 on the observed and 0 on the predicted group membership and by 1 on the observed and 1 on the predicted group membership. Relatively
66 Table 3.11 R Output: Logistic Regression Model of Support for Legalized Marijuana, GSS 1973–2016 Coefficients:
(Intercept) Education Time
Estimate Std. -1.5757370
Error 0.0773388
z value -20.374
Pr(>|z|) < 2e-16 ***
0.0396423
0.0009967
39.776
< 2e-16 ***
0.0705661
Age
-0.0209352
SizeOfPlace.f2
-0.1102056
SizeOfPlace.f4
-0.2623289
Gender.f1
SizeOfPlace.f3 SizeOfPlace.f5
SizeOfPlace.f6
0.4106996
-0.2255893 -0.4139728 -0.5736571
0.0041961
16.817
0.0007480
-27.987
0.0502961
-2.191
0.0242500 0.0524727 0.0495089 0.0442316 0.0541428
16.936
< 2e-16 *** < 2e-16 *** < 2e-16 *** 0.0284
*
-4.299
1.71e-05 ***
-9.359
< 2e-16 ***
-5.299 -10.595
1.17e-07 *** < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 44092 Residual deviance: 40387
on 35913 on 35904
degrees of freedom degrees of freedom
few cases would fall into the cells defined by a mismatch of observed and predicted group membership. A simple summary measure equals the percentage of all cases in the correctly predicted cells. A perfect model would correctly predict group membership for 100% of the cases; a failed model would do no better than chance by correctly predicting 50% of the cases. The percentage of correctly predicted cases from 50 to 100 provides a crude measure of predictive accuracy. The SPSS output routinely lists the classification table, and a simple Stata command after a logistic regression command (“estat classification”) produces the table plus various measures. Table 3.9 shows that 71.2% of cases were correctly predicted. However, if one category of the dependent variable is substantially larger than the other, a model can do better than 50% by simply predicting the largest category for all cases. A more accurate measure takes the percentage of correctly predicted cases beyond the percentage that would be predicted by choosing the percentage in the largest category of the dependent variable (Long, 1997, pp. 107–108). For support of legalized marijuana, those disagreeing make up 69.6% of the sample. Correctly predicting the support of 71.2% of the sample shows only slight improvement.
67 Measures of association for nominal and ordinal variables can more precisely summarize the strength of the relationship between predicted and observed values. Menard (2010, Chapter 4) discusses numerous measures of relationship strength for tabular data such as φ, τ, γ, and λ. However, results from predicting group membership can differ widely from results for the pseudo-variance explained. Other than an occasional listing of percent correctly predicted, few articles report more detail on the crossclassification of observed and predicted group membership.
Summary For quick reference, some key terms used in estimation and model goodness of fit are listed below. Term
Definition
Baseline log likelihood
A value ranging from negative infinity to 0 that reflects the log likelihood of a model with no predictors
Baseline or null deviance
A value equal to the baseline log likelihood times –2 that ranges from 0 to positive infinity
Model log likelihood
A value ranging from negative infinity to 0 that reflects the log likelihood of a model with predictors
Model or residual deviance
A value equal to the model log likelihood times –2 that ranges from 0 to positive infinity
Likelihood ratio
Difference between the baseline and model log likelihoods times –2, which has a chisquare distribution and degrees of freedom equal to the number of predictors
Pseudo-variance explained
A measure of goodness of fit in logistic regression that ideally ranges from 0 to 1
Chapter 4 PROBIT ANALYSIS Logistic regression deals with the ceiling and floor problems in modeling a binary dependent variable by transforming probabilities of an outcome into logits. Although probabilities vary between 0 and 1, logits or the logged odds of the probabilities have no such limits—they vary from negative to positive infinity. Many other transformations also eliminate the ceiling and floor of probabilities. A number of functions define S-shaped curves that differ in how rapidly or slowly the tails approach 0 and 1. The logit transformation in logistic regression has the advantage of relative simplicity and is used most commonly. One other transformation based on the normal curve that appears in the published literature is worth reviewing.
Another Way to Linearize the Nonlinear Probit analysis transforms probabilities of an event into scores from the cumulative standard normal distribution rather than into logged odds from the logistic distribution. Despite this difference, probit analysis and logistic regression give essentially equivalent results. This chapter examines probit analysis separately, but to emphasize similarities, uses the earlier material on logistic regression to explain the logic of probit analysis. To transform probabilities with a floor of 0 and a ceiling of 1 into scores without these boundaries, the probit transformation relates the probability of an outcome to the cumulative standard normal distribution rather than to the logged odds. To explain this transformation, it helps to review the information contained in tables from any statistics text on areas of the standard normal curve. The tables match z scores (theoretically ranging from negative infinity to positive infinity, but in practice from −5 to 5) with a proportion of the area under the curve between the absolute value of the z score and the mean z score of 0. With some simple calculations, the standard normal table identifies the proportion of the area from negative infinity to the z score. The proportion of the curve at or below each of the z scores defines the cumulative standard normal distribution. Since the proportion equals the probability that a standard normal random variable will fall at or below that z score, larger z scores define greater probabilities in the cumulative standard normal distribution. Conversely, just as any z score defines a probability in the cumulative standard normal distribution, any probability in the cumulative standard normal distribution translates into a z score. The greater the cumulative 69
70 probability, the higher the associated z score. Furthermore, because probabilities vary between 0 and 1, and the corresponding z scores vary between positive and negative infinity, it suggests using the areas defined by the standard normal curve to transform bounded probabilities into unbounded z scores. To illustrate, Figures 4.1 and 4.2 depict the standard normal curve and the cumulative standard normal curve. The normal curve in Figure 4.1 plots the height or density on the vertical axis for each z score on the horizontal axis, which approximates the probability of a single z value. In addition, each z score implicitly divides the curve into two portions—the portion between negative infinity and the z score, and the portion beyond the z score or between the z score and positive infinity. If the former area under the curve equals P, the latter area under the curve equals 1 − P. Note also that the height of the normal curve drops fastest around values near 0 and changes little at the tails of the curve. Thus, P and 1 − P change more near the middle of the curve than near the extremes. The cumulative standard normal curve in Figure 4.2 directly plots the area in the standard normal curve at or below each z score. As the z scores get larger, the cumulative proportion of the normal curve at or below the z score increases. As for the standard normal curve, the z scores define the X axis, but the Y axis refers to the proportion of area at or below that z score rather than to the height of the normal curve. Drawing a line up to the curve from a z score, and then drawing another perpendicular line across to the Y axis, shows the cumulative probability associated with each z score and the area of the standard normal curve at or below that z score. The cumulative standard normal curve resembles the logistic curve, only with z scores instead of logged odds along the horizontal axis. The curve Figure 4.1 Standard normal curve. 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Z
71 Figure 4.2 Cumulative standard normal curve. 1 0.8 0.6 0.4 0.2 0 Z
approaches but does not reach 0 as the z scores decrease toward negative infinity, and the curve approaches but does not reach 1 as the z scores increase toward positive infinity. Although the probit curve approaches the floor and ceiling slightly faster than the logit curve, the differences are small. Thus, as logistic regression uses the logistic curve to translate probabilities into logits or logged odds, probit analysis uses the cumulative standard normal curve to translate probabilities into associated z scores. Although related nonlinearly to the probabilities, independent variables relate linearly to the z scores from the probit transformation. To illustrate the properties of the transformation used in probit analysis, the numbers below match z scores with probabilities. The first row lists z scores, and the second row lists the associated probabilities of the cumulative standard normal distribution (i.e., the area of the normal curve between negative infinity and the z score). −4
−3
0.00003 0.00135
−2 0.0228
−1
0
0.1587 0.5
1
2
3
4
0.8413 0.9772 0.99865 0.99997
Note the nonlinear relationship between the z scores and probabilities: the same one-unit change in the z scores produces a smaller change in the probabilities near the floor of 0 and near the ceiling of 1 than in the middle. Conversely, the probabilities in the first row below define z scores in the second row. 0.1
0.2
0.3
0.4
−1.282 −0.842 −0.524 −0.253
0.5
0.6
0.7
0.8
0.9
0
0.253
0.524
0.842
1.282
72 These figures likewise show nonlinearity: the same change in probabilities results in a bigger change in z scores as the probabilities approach 0 and 1. These examples show that the probit transformation has the same properties of the logit transformation. It has no upper or lower boundary, as the domain of the normal curve extends to infinity in either direction. It is symmetric around the midpoint probability of .5; the z scores for probabilities .4 and .6 are identical except for the sign. Additionally, the same change in probabilities translates into larger changes in z scores for probabilities near 0 and 1. The transformation thus stretches the probabilities near the boundaries. In short, translating probabilities into z scores based on the cumulative standard normal curve has the characteristics necessary to linearize certain types of nonlinear relationships.
The Probit Transformation Like logistic regression, probit analysis relies on a transformation to make regression on a binary dependent variable similar to regression on a continuous variable. Given a probability of experiencing an event or having a characteristic, the predicted probit becomes the dependent variable in a linear equation determined by one or more independent variables: Zi = b0 + b1Xi. Z represents the nonlinear transformation of probabilities into z scores using the cumulative standard normal distribution. By predicting the z scores with a linear equation, probit analysis implicitly describes a nonlinear relationship with probabilities in which the independent variable has a greater effect on the probabilities near the middle of the curve than near the extremes. In logistic regression, we can summarize the transformation of probabilities into logged odds and vice-versa with relatively simple formulas. For probit analysis, the complex formula for the standard normal curve makes for more difficulty. Corresponding to the nonlinear equation for determining probabilities in logistic regression, Pi 1/ (1 e Li ) , the nonlinear equation for probit analysis takes Pi as a function of Zi in the formula for the cumulative standard normal distribution. The formula involves an integral (roughly similar to summation for a continuous distribution) that transforms z scores from negative to positive infinity into probabilities with a minimum of 0 and maximum of 1. Based on the cumulative standard normal distribution, the cumulative probability associated with any z score equals: Z
1
2
P
exp (U 2 / 2)dU ,
73 where U is a random variable with a mean of 0 and standard deviation of 1. The formula merely says that the probability of the event equals the area under the cumulative normal curve between negative infinity and Z. The larger the value of Z, the larger the cumulative probability. Because of the complexity of the formula, however, computers normally do the calculations. In any case, keep in mind that the goal of the formula is to translate the linearly determined Z in the probit equation back to nonlinearly determined probabilities. Corresponding to the formula in logistic regression for the logged odds, Li = ln(Pi/(l−Pi)), the formula for probit analysis identifies the inverse of the cumulative standard normal distribution. If we represent the cumulative standard normal distribution by Φ, then the equation above equals P = Φ(Z), and the equation for Z equals Z = Φ−1(P), where Φ−1 refers to the inverse of the cumulative standard normal distribution. Although it cannot be represented by a simple formula, the inverse of the cumulative standard normal distribution transforms probabilities into linear Z scores that represent the dependent variable in probit analysis. With probits as the dependent variable, the estimated coefficients show the change in z score units rather than the change in probabilities. Despite the similarities of the logit and probit transformations, the resulting coefficients differ in good part by an arbitrary constant. The data typically include only the observed values of the dependent variable of 0 and 1 rather than the actual observed probabilities, and the predicted logit or probit values can range from negative to positive infinity. The logit and probit variables therefore have no inherent scale, and programs use an arbitrary normalization to fix the scale. Probit analysis sets the standard deviation of the error equal to 1, where logit analysis sets the standard deviation of the error equal to approximately 1.814. The different error variances mean that one should not directly compare probit and logit coefficients. The logit coefficients will exceed the probit coefficients by an approximate factor of 1.6 or 1.7 (Long, 1997, p. 49). The factor will vary to some degree depending on the data and model, but even with the different scaling, probit analysis and logistic regression nearly always produce similar substantive results.
Interpretation Probit Coefficients Given the transformed units of the dependent variable, probit coefficients are like logit coefficients in showing linear and additive change. The
74 difference is that probit coefficients refer to the outcome in z-score units of the probit transformation rather than logged odds. While perhaps more intuitive than logged odds, standard units of the cumulative normal distribution still have little interpretive value. Interpretations usually begin with the sign of the coefficients and the value of the z ratios. A probit model of smoking using the 2017 NHIS data illustrates the interpretation of coefficients. Table 4.1 presents the Stata probit results. The
Table 4.1 Stata Output: Probit Model of Current Smoking, NHIS 2017 Iteration Iteration Iteration Iteration Iteration
0: 1: 2: 3: 4:
log log log log log
likelihood likelihood likelihood likelihood likelihood
= = = = =
-10214.631 -9627.957 -9620.7256 -9620.7206 -9620.7206
Probit regression Number of obs = 23,786 LR chi2(8) = 1187.82 Prob > chi2 = 0.0000 Log likelihood = -9620.7206 Pseudo R2 = 0.0581 Std. [95% Conf. Smoker Coef. Err. z P>|z| Interval] Education -.1031338 .0037797 -27.29 0.000 -.1105417 -.0957258 Age -.0131973 .0006517 -20.25 0.000 -.0144747 -.0119199 Gender Male Race African American
.1414657 .0203892
-.037627 .031909
Native American
.1666979 .0875288
Multiple Race
.2583868 .0682866
_cons
1.084429 .0709507
Asian -.4049622 .0546517 American
6.94 0.000
.1015037
.1814277
-1.18 0.238 -.1001675
.0249135
1.90 0.057 -.0048554
.3382512
-7.41 0.000 -.5120775 -.2978469 3.78 0.000
.1245476
.392226
Ethnicity Hispanic -.5318127 .0359806 -14.78 0.000 -.6023333
-.461292
15.28 0.000
.9453683
1.22349
75 expected z-score for current smoking decreases by .103 with a unit change in education and by .013 with a unit change in age. The expected z-score for current smoking is higher by .141 for males than females and lower by .532 for Hispanics than non-Hispanics. For comparison, Table 4.2 presents the Stata logit results. The sign and significance of the coefficients are the same in both the probit and logit
Table 4.2 Stata Output: Logit Model of Current Smoking, NHIS 2017 Iteration Iteration Iteration Iteration Iteration
0: 1: 2: 3: 4:
log log log log log
likelihood likelihood likelihood likelihood likelihood
= = = = =
-10214.631 -9653.5905 -9626.1301 -9626.0135 -9626.0135
Probit Logistic regression Number of obs = 23,766 LR chi2(8) = 1177.23 Prob > chi2 = 0.0000 Log likelihood = -9626.0135 Pseudo R2 = 0.0576 Std. [95% Conf. Smoker Coef. Err. z P>|z| Interval] Education -.1830747 .0067791 -27.01 0.000 -.1963616 -.1697878 Age
Gender Male
-.023966 .0011756 -20.39 0.000 -.0262701 -.0216619 .2535793 .0369909
6.86 0.000 -.1810785
.3260801
Race African -.0835037 .0574981 American
-1.45 0.146 -.1961979
.0291905
1.89 0.058 -.0103242
.586752
Native American
.2882139 .1523181
Multiple Race
.4363179 .1180186
Asian -.8100691 .1092474 American
-7.41 0.000
-1.02419
-.595948
3.70 0.000
.2050056
.6676301
Ethnicity Hispanic -1.039563 .0700205 -14.85 0.000 -1.176801 -.9023256 _cons
2.053864
.127611
16.09 0.000
1.803751
2.303976
76 models, but the logit coefficients are larger than the probit coefficients. Note that exponentiating probit coefficients does not, as it does for logistic regression coefficients, produce the equivalent of the multiplicative change in odds. Given the usefulness of odds ratios in logistic regression, the lack of comparable coefficients in probit analysis may contribute to the greater popularity of logistic regression. The probit and logit coefficients can be made more comparable with standardization. Using the SPOST listcoef command gives fully standardized coefficients for both the probit and logit coefficients. The standardization adjusts for the standard deviation of the predicted logit and probit outcomes and for the different error variances.14 As shown in Table 4.3, the coefficients vary only slightly after standardization.
Marginal Effects As in logistic regression, transformation of probit effects into probability effects allows for more meaningful interpretations. As defined earlier, the marginal effect for continuous independent variables equals the partial derivative or the effect of an infinitely small change in an independent variable on the probability of an outcome. That effect is equivalent to the slope of the tangent line at a particular point of the nonlinear curve relating an independent variable to the probability. However, the partial derivative for probit analysis takes a different form than in logistic regression: P /X k = bk f Z , Table 4.3 S POST Output: Fully Standardized Probit and Logit Coefficients for Model of Current Smoking, NHIS 2017 bStdXY Probit Education Age Gender Male
Race African American Native American Asian American Multiple Race Ethnicity Hispanic
−0.277 −0.207
Logit
−0.271 −0.207
0.066
0.066
−0.011 0.017 −0.083 0.033
−0.014 0.016 −0.092 0.030
−0.160
−0.172
77 where f(Z) is the density or height of the normal curve at the point Z.15 The key to the formula is that the bk coefficients translate into the largest marginal effects on probabilities when the value of the normal density function is largest (i.e., when the Z value is near 0). The bk translates into smaller effects on probabilities when the Z value is far from 0 and the density of the normal distribution is low. The margins command shows the average marginal effects of the independent variables on the expected probability of smoking obtained from the probit model (Table 4.4). The coefficients differ little from the average marginal effects obtained from logistic regression coefficients. For example, the average marginal effect of education shows that the probability of current smoking decreases by .023 with a marginal or infinitely small change in education and by .003 for a marginal or infinitely small change in age. One can compare these average marginal effects to those using logistic regression in Table 2.7 of Chapter 2. They are nearly identical. For categorical independent variables, marginal effects come from the difference in predicted values for a discrete change. The average marginal effect of gender in Table 4.4 shows that the expected probability of smoking is higher by .032 for males than females and lower by .096 for Hispanics than non-Hispanics. Again, comparison of these average marginal effects shows them to be nearly identical to those obtained from logistic regression. As always, note that in probit analysis, as in logistic regression analysis, a single coefficient may not fully describe the relationship of an independent variable with probabilities. Marginal effects vary depending on the selected values of the independent variables. Changes in probabilities will emerge larger for points near the middle of the cumulative standard normal curve than near the floor or ceiling. Table 4.4 presents the average marginal effects using the observed values on the independent variables. Marginal effects at the means or at representative values may give different results. Marginal effects when an independent variable takes values 1 or 2 standard deviations below and above the mean give still different results. The strategies presented in Chapter 2 for logistic regression, including graphing the distribution of marginal effects, apply to probit analysis as well.
Maximum Likelihood Estimation Like logistic regression, probit analysis uses maximum likelihood estimation techniques. To briefly review the material in Chapter 3, maximum likelihood estimation chooses the estimates of model parameters that are most likely to give rise to the pattern of observations in the sample data.
78 Table 4.4 S tata Output: Average Marginal Effects From Probit Model of Current Smoking, NHIS 2017 : Pr(Smoker), predict()
Expression dy/dx w.r.t.
: Education Age 1.Gender 2.Race 3.Race 4.Race 5.Race 1.Ethnicity
Delta-method [95% Conf. dy/dx Std. Err. z P>|z| Interval] Education -.0230206 .0008226 -27.99 0.000 -.0246329 -.0214084 Age -.0029458 .0001438 -20.48 0.000 -.0032277 -.0026639 Gender Male
.031811
.0046118
Race African -.0083856 American
6.90 0.000
.022772
.0408499
.0070128
-1.20 0.232 -.0221304
.0053593
Native American
.0408354
.022964
1.78 0.075 -.0041732
.0858439
Asian -.0745558 American
.0080756
-9.23 0.000 -.0903836
-.058728
Multiple Race
.0658185
.0191667
Ethnicity Hispanic -.0958261
.0050703
3.43 0.001
.0282525
.1033846
-18.90 0.000 -.1057638 -.0858885
Note: dy/dx for factor levels is the discrete change from the base level.
Probit analysis maximum likelihood estimation proceeds identically to logistic regression maximum likelihood estimation in most ways. The procedure differs from logistic regression in its use of the cumulative standard normal distribution rather than the logistic distribution to obtain the parameter estimates. But like in logistic regression, probit analysis maximizes the log likelihood function rather than the likelihood function. Since the log likelihood function produces negative values, the maximum value comes closest to 0. The estimation procedure uses an iterative method of estimation and re-estimation, which proceeds until the log likelihood function fails to change by a specified (and small) amount. Each probit model therefore produces a log likelihood value that varies from negative infinity to 0. The larger the negative value of the log likelihood, the less well the model does in producing the observed sample
79 values. Comparing the baseline log likelihood and the model log likelihood gives a difference that, when multiplied by −2, produces a chi-square value and tests the null hypothesis that the coefficients for all the independent variables equal 0 in the population. Finally, the log likelihood values allow calculation of several pseudo-R2 coefficients. The tests of overall model significance and the measures of goodness of fit do not differ in interpretation from those for logistic regression discussed in the previous chapter. The probit analysis output from Stata in Table 4.1 displays the same types of information as the logistic regression model in Table 4.2. The probit log likelihood values differ from those for logistic regression, but the likelihood ratio chi-squares are similar (1187.82 vs. 1177.23). The pseudo-R2 values are also similar (.0581 vs. .0576).
Summary Probit analysis deals with the ceiling and floor problems created by a binary dependent variable with a transformation based on the cumulative standard normal distribution. Despite the familiar nature of the normal curve, the changes in z-score units of the inverse of the cumulative standard normal distribution described by probit coefficients lack intuitive meaning. Furthermore, probit analysis does not allow calculation of the equivalent of odds ratios. In most circumstances, researchers will prefer logistic regression, but familiarity with the alternative logic of probit analysis adds to the more general understanding of strategies of analysis for binary dependent variables.
Chapter 5 ORDINAL AND MULTINOMIAL LOGISTIC REGRESSION The previous chapters aimed to explain the basic principles underlying logistic regression (and the companion probit analysis) rather than to offer a comprehensive description or mathematical derivation of the techniques. Understanding the basic principles, however, can help to master more complex and advanced topics. In particular, the logic of the logit transformation and maximum likelihood estimation in binary logistic regression applies to models for dependent variables with three or more categories. To emphasize the generality of the underlying principles presented so far and to offer an introduction to more advanced material, this chapter extends the binary logistic regression model. The focus here is on two commonly used models for outcomes with three or more categories—one with ordinal or ordered categories and one with unordered or nominal categories. The former is called ordinal or ordered logistic regression and the latter is called multinomial or polytomous logistic regression. As will be shown, there is blurring across the two types of dependent variables. Under some conditions, ordered categorical outcomes are best analyzed using multinomial logistic regression rather than ordered logistic regression. For both models, this chapter does not aim for comprehensive coverage of the techniques but rather shows how the basic principles underlying binary logistic regression apply, with some variation, to ordinal and multinomial logistic regression. It might seem that a dependent variable with three or more categories would lend itself to separate binary logistic regressions. Although separate logistic regressions would allow for the same interpretations of coefficients as in a binary logistic regression, such an approach ignores overlap across equations. A more efficient method simultaneously maximizes the joint likelihood for all categories of the dependent variable. Ordinal and multinomial logistic regression take this approach in jointly estimating parameters predicting all categories of the dependent variable. With only two categories of the dependent variable, ordinal and multinomial logistic regression estimates reduce to binary logistic regression estimates; the logic of maximum likelihood does not change, only the number of categories increases.
81
82
Ordinal Logistic Regression Understanding ordinal logistic regression begins with a comparison to linear regression, much as understanding binary logistic regression begins with the linear probability model. With an ordinal outcome, linear regression presents a single coefficient summarizing the linear relationship between the predictors and a multicategory outcome. The coefficients show the change in the outcome units for a unit change in predictors and thus depends on the scale of the outcome. However, by definition, the scale of an ordinal outcome is arbitrary—it has no real or natural measurement metric. Linear regression assumes that the distance between values of the ordinal outcome is the same such that a change from 1 to 2 is the same as the change from 2 to 3 or from 3 to 4. In other words, linear regression assumes the ordinal scale has interval or ratio properties. Take two different examples that come from the General Social Survey. A three-category measure of support for more national spending on the environment distinguishes those answering that the nation spends (1) too much, (2) about the right amount, and (3) too little. The assignment of values of 1, 2, and 3 is arbitrary. Values of 10, 20, and 30 change the units distinguishing the categories, but both measurement metrics are alike in several ways. They reflect an underlying dimension of support by ordering the second category as higher than the first and the third category as higher than the second. They also assign equal distances between the three categories. The assigned values may well approximate a scale in which the difference between too much and the right amount is about the same as the difference between the right amount and too little. Even so, such a claim represents an assumption. One could argue that the change from about right to too little plausibly involves a greater commitment to the environment than the change from too much to about right. A five-category measure comes from a question that asks if gays should have the right to marry, with responses on a Likert-type scale arbitrarily assigned numbers of (1) strongly disagree, (2) disagree, (3) neither agree nor disagree, (4) agree, and (5) strongly agree. The five categories may appear to better approximate an interval scale than the three-category measure. Again, however, the gap between strongly disagree and disagree may be qualitatively different than the gap between agree and strongly agree. Yet, using linear regression imposes the assumption of equal intervals. To the extent that this is not true, the regression coefficients will be misleading. Ordinal logistic regression offers an alternative to linear regression that does not require the equal-interval assumption. It allows for the gaps between categories to vary. In avoiding this assumption, ordinal logistic regression is
83 more general than linear regression. However, ordinal logistic regression is nonlinear in the relationship of the predictors to probabilities and brings complexities to the interpretation that linear regression does not. Linear regression may be appropriate in some cases when the ordinal scale has many categories and equal-interval properties. The potential for systematic bias of linear regression is serious enough, however, to recommend ordinal logistic regression or a variant in most cases (Liddell & Krushke, 2018). Ordinal logistic regression, like binary logistic regression, is based on the logged odds transformation that linearizes nonlinear relationships between the independent variables and probabilities. As explained in Chapter 1, the ceiling of 1 and the floor of 0 are removed by transforming probabilities into odds and odds into logged odds. And as explained in Chapter 2, effects can be presented in terms of logged odds, odds, and probabilities. The same concepts apply, with some variation, to ordinal logistic regression (Agresti & Tarantola, 2018). Note, however, that there are several types of ordinal logistic regression models (Hilbe, 2009, Chapter 12; O’Connell, 2006), but the most commonly used one is called, for reasons that will become clear, the cumulative logit or proportional odds model (Menard, 2010, Chapter 10). This model is the default in Stata and SPSS, and, without other qualifiers, is generally assumed when referring to ordinal logistic regression.
Logged Odds and Odds Ratio Coefficients The logged odds coefficients, although lacking an interpretable scale, are linear and additive. In binary logistic regression, they show the expected change in the logged odds of being in the higher category (Y = 1) than the lower category (Y = 0) of the dependent variable for a unit change in an independent variable. In ordinal logistic regression, the logged odds coefficients similarly show the expected change in the logged odds of being in a higher category than a lower category for a unit change in an independent variable. However, the comparison of higher and lower categories is more general. For a three-category dependent variable, it refers to categories 3 and 2 versus category 1, and it refers to category 3 versus categories 1 and 2. Multiple contrasts of higher to lower categories are involved; with J categories, there are J − 1 such contrasts. The coefficients in ordinal logistic regression are general in that they are invariant across multiple contrasts. They thus take advantage of the ordering of the categories to offer a single measure of association that encompasses multiple and cumulative comparisons of higher to lower categories.16 Table 5.1 presents the ordinal logistic regression results for the threecategory outcome of support for environmental spending (1 = too much,
84 Table 5.1 S tata Output: Logged Odds Coefficients From Ordinal Logistic Regression Model of Support for More Spending on the Environment, GSS 1973−2016 Ordered logistic regression Number of obs LR chi2(9) Prob > chi2 Log likelihood = -28996.607 Pseudo R2 Std. Spend_Env Coef. Err. z Education .0351455 .0037142 9.46 Time .004804 .0008665 5.54 Age -.0218408 .0006526 -33.47 Gender Male -.1431693
.022289
SizeOf Place smsa’s -.4045556 .0509949 13-100 suburb, -.3826241 .053147 12 lrgst suburb, -.4131242 .0506766 13-100
= 34,063 = 1848.45 = 0.0000 = 0.0309
[95% Conf. P>|z| Interval] 0.000 .0278658 .0424251 0.000 .0031057 .0065024 0.000 -.0231199 -.0205616
-6.42 0.000 -.1868549 -.0994836
-7.93 0.000 -.5045038 -.3046074 -7.20 0.000 -.4867903 -.2784579 -8.15 0.000 -.5124485 -.3137999
other -.5486332 .0447757 -12.25 0.000 -.6363919 -.4608745 urban other -.7077865 .0502175 -14.09 0.000 rural
/cutl -3.401066 .0752252 /cut2 -1.445853 .0729311
-.806211 -.6093619
-3.548504 -3.253627 -1.588796 -1.302911
2 = about right, and 3 = too little).17 Given the multiple contrasts, the wording used to interpret the coefficients is nuanced. Broadly stated, the coefficient of education shows that the expected logged odds of being in a higher than a lower category of support for environmental spending rise by .035 with a 1-year increase in education, controlling for other independent variables. More precise interpretations focus on the two specific contrasts. The single coefficient implies the logged odds of supporting the same spending or more spending relative to less spending are higher by .035, and the logged odds of supporting more spending
85 relative to supporting less spending or the same spending are again higher by .035. Similar interpretations apply to the categorical independent variables of gender and size of place, only the meaning of a unit change in the independent variables differs. Although interpreted in terms of logged odds, ordinal logistic regression coefficients are, in one sense, similar to linear regression coefficients. They compare multiple combinations of higher versus lower categories. However, ordinal logistic regression differs from linear regression in allowing for unequal intervals between categories. This property shows in cut point or threshold coefficients (labeled as cut1 and cut2 in the Stata output). These coefficients typically are not theoretically meaningful but are worth discussing briefly. They show where a continuous latent variable of support for more spending is divided to define predicted group membership. Cut 1 defines a point that separates the lower group from the middle group, and cut 2 defines a point that separate the middle group from the higher group. As a replacement for the usual intercept in linear regression or binary logistic regression, the cut points in ordinal logistic regression allow for unequal intervals between categories. Table 5.2 presents the ordinal logistic regression results as odds ratios. They show the odds of supporting the same spending or more spending relative to less spending and supporting more spending relative to supporting less spending or the same spending. For education, these odds are higher by a factor of 1.04; for gender, the odds for men are lower by 13%. The interpretations of the likelihood ratio chi-square, pseudo R2, and significance of the odds ratios remains much the same as for the logged odds coefficients.18
Probability Interpretations Much as with binary logistic regression, ordinal logistic regression produces predicted probabilities but does so for each of the categories. For the three categories of support for environmental spending, there are three sets of predicted probabilities, with each observation having a predicted probability of falling into each of the three categories. Table 5.3 presents the descriptive statistics for these variables. The means sum to 1 and are nearly identical to the observed proportions in each category. About 9% support less spending, about 30% support the same spending, and about 61% support more spending. The predicted probabilities vary across the sample observations, but the minimum and maximum values fall within the bounds of 0 and 1. Also as in binary logistic regression, average marginal effects on probabilities for ordinal logistic regression can be obtained from the
86 Table 5.2 S tata Output: Odds Ratio Coefficients From Ordinal Logistic Regression Model of Support for More Spending on the Environment, GSS 1973−2016 Ordered logistic regression Number of obs LR chi2(9) Prob > chi2 Log likelihood = -28996.607 Pseudo R2
Spend_Env Education Time Age Gender Male SizeOf Place smsa’s 13-100 suburb, 12 lrgst suburb, 13-100
other urban other rural
Odds Std. Ratio Err. z 1.03577 .003847 9.46 1.004816 .0008707 5.54 .978396 .0006385 -33.47
P>|z| 0.000 0.000 0.000
= 34,063 = 1848.45 = 0.0000 = 0.0309
[95% Conf. Interval] 1.028258 1.043338 1.00311 1.006524 .9771453 .9796483
.8666074 .0193158
-6.42 0.000
.8295641
.9053048
.6672733 .0340275
-7.93 0.000
.6038051
.7374128
.6820692 .0362499
-7.20 0.000
.6145959
.7569502
.6615801 .0335266
-8.15 0.000
.5990271
.7306652
.5777389 .0258687 -12.25 0.000
.5291984
.6307318
.4927337 .0247439 -14.09 0.000
.4465468
.5436977
/cutl -3.401066 .0752252 /cut2 -1.445853 .0729311
-3.548504 -3.253627 -1.588796 -1.302911
Note: Estimates are transformed only in the first equation.
margins command in Stata. The difference with ordinal logistic regression is that the average marginal effects are calculated for each of the three categories. With numerous coefficients in the model and three categories of the dependent variable, the margins command produces a large table— one that would be even larger with more categories in the dependent variable. Still, the meaning of the coefficients follows the logic for marginal effects presented for binary logistic regression in Chapter 2. Table 5.4 lists the average marginal effects for the model of support for environmental spending.
87 Table 5.3 S tata Output: Summary Statistics for Predicted Probabilities From Ordinal Logistic Regression Model of Support for More Spending on the Environment, GSS 1973−2016 Variable
Obs
Mean
Std. Dev.
Min
Max
pSpend_Envl 34,063
.0887557
.0407199
.0244017
.3321563
pSpend_Env2 34,063
.3016695
.0696465
.1257789
.4532657
pSpend_Env3 34,063
.6095748
.1092123
.2215315
.8498194
Table 5.4 S tata Output: Average Marginal Effects on Probabilities From Ordinal Logistic Regression Model of Support for More Spending on the Environment, GSS 1973−2016 Average marginal effects Number of obs = 34,063 Model VCE : OIM dy/dx w.r.t. : Education Time Age 1.Gender 2.SizeOfPlace 3.SizeOfPlace 5.SizeOfPlace 6.SizeOfPlace 1._predict 2._predict 3._predict
Education _predict 1 2 3 Time _predict 1 2 3 Age _predict 1 2 3 0.Gender
: Pr(Spend_Env==l), predict(pr outcome(l)) : Pr(Spend_Env==2), predict(pr outcome(2)) : Pr(Spend_Env==3), predict(pr outcome(3))
dy/dx
Delta-method Std. Err.
z
P>|z|
[95% Conf. Interval]
-.0027842 -.005161 .0079452
.0002968 .0005435 .0008359
-9.38 0.000 -.003366 -.0022025 -9.50 0.000 -.0062261 -.0040958 9.50 0.000 .0063068 .0095836
-.0003806 -.0007055 .001086
.0000689 .0001271 .0001956
-5.53 0.000 -.0005155 -.0002456 -5.55 0.000 -.0009545 -.0004564 5.55 0.000 .0007027 .0014694
.0017302 .0000576 .0032072 .0000905 -.0049375 .0001388 (base outcome)
30.02 0.000 .0016172 .0018432 35.44 0.000 .0030299 .0033846 -35.56 0.000 -.0052096 -.0046653
(Continued)
88 Table 5.4 (Continued) 1.Gender _predict 1 2 3 1.SizeOf Place 2.SizeOf Place _predict 1 2 3 3.SizeOf Place _predict 1 2 3 4.SizeOf Place _predict 1 2 3 5.SizeOf Place _predict 1 2 3 6.SizeOf Place _predict 1 2 3
.0114031 .0017928 .0210175 .0032684 -.0324206 .0050492 (base outcome)
6.36 0.000 .0078892 .014917 6.43 0.000 .0146115 .0274235 -6.42 0.000 -.0423169 -.0225243
.0256972 .0602282 -.0859254
.0031574 .0075004 .0106015
8.14 0.000 8.03 0.000 -8.11 0.000
.0240721 .0569325 -.0810046
.0033156 .0078291 .0110947
7.26 0.000 .0175737 .0305706 7.27 0.000 .0415878 .0722772 -7.30 0.000 -.1027498 -.0592594
.02634 .0615149 -.0878549
.00314 .0074502 .0105314
8.39 0.000 8.26 0.000 -8.34 0.000
.0371139 .0817234 -.1188373
.0026749 13.88 0.000 .0318713 .0423566 .0065544 12.47 0.000 .068877 .0945698 .0091289 -13.02 0.000 -.1367296 -.1009451
.0513151 .1048247 -.1561399
.003531 14.53 0.000 .0443945 .0073216 14.32 0.000 .0904746 .0106542 -14.66 0.000 -.1770217
.0195089 .0318855 .0455278 .0749286 -.106704 -.0651468
.0201856 .0324944 .0469128 .076117 -.108496 -.0672138
.0582358 .1191748 -.135258
Note: dy/dx for factor levels is the discrete change from the base level.
First consider the continuous measure of education. The average marginal effect of education shows that with a 1-year increase in education, the expected probability of supporting less spending is lower by .003 and
89 the probability of supporting the same spending is lower by .005. In contrast, the probability of supporting more spending is higher by .008 with a 1-year increase in education. Remember, however, that these effects are the average marginal effects across the full sample. The nonlinear effects of the independent variables on the probabilities vary across observations, and the coefficients in the table are an average of the diverse marginal effects. The marginal effects would differ were they evaluated at other values of the independent variables, such as the mean values or representative values. The coefficients for dummy independent variables have a different interpretation. They show the average difference in the predicted probability for the dummy variable group relative to the omitted group. For example, the average marginal effect of gender, with men coded 1 and women coded 0, shows that the expected probability of supporting less spending or the same spending is higher by .011 and .021, respectively, for men than women. In contrast, the expected probability of supporting more spending is lower by .032 for men than women. The average marginal effect for rural residents (Size of Place = 6) relative to the omitted category of residents in the 12 largest metropolitan areas (Size of Place = 1) again differs by categories of the outcome measure. The coefficient is positive for less and the same spending but negative for more spending.
Testing a Key Assumption Although ordinal logistic regression does not require the assumption of linear regression of equal intervals between categories, it does make another stringent assumption. It assumes that the effect of an independent variable is the same across the multiple comparisons of higher to lower categories (Long, 1997, pp. 140−141; Menard, 2010, Chapter 10). In the example above, it assumes that education has the same influence on supporting the same or more spending relative to supporting less spending as it does on supporting more spending relative to the same or less spending. If the effects are in fact different, the single coefficient in ordinal logistic regression may be misleading. The assumption should be tested. The literature refers to the proportional odds or parallel regression assumption. When the effects are the same across comparisons of categories of the dependent variable, the odds are proportional. Visually, the regression lines representing the relationship between an independent variable and each contrasted pair of the three outcome categories are parallel. An SPOST command in Stata that is easy to implement and interpret can evaluate the assumption with the Brant test. Following the “ologit” command, the command “brant, detail” provides the output listed in Table 5.5. Focus first on the bottom part of the table labeled “Brant test
90 Table 5.5 S tata Output: Brant Tests for Ordinal Logistic Regression Model of Support for More Spending on the Environment, GSS 1973−2016 Estimated coefficients from binary logits Variable Education Time Age Gender Male SizeOfPlace smsa’s 13.. suburb, 1.. suburb, 1.. other urban other rural _cons
y_gt_1
y_gt_2
-0.487 -12.58
-0.085 -3.72
-0.339 -3.67 -0.176 -1.78 -0.304 -3.31 -0.404 -4.94 -0.477 -5.32 3.621 27.99
-0.407 -7.87 -0.404 -7.49 -0.420 -8.17 -0.564 -12.42 -0.748 -14.55 1.407 18.86
0.024 3.92 0.005 3.00 -0.023 -20.48
0.038 9.82 0.005 5.49 -0.022 -32.45
legend: b/t Brant test of parallel regression assumption chi2
All
143.40
Time
0.05
Education
Age
1.Gender
5.21 0.57
118.31
p>chi2
df
0.023
1
0.000 0.815 0.449 0.000
9 1 1 1
91 2.SizeOfPlace
0.64
0.424
1
1.83
0.176
1
3.SizeOfPlace
6.15
5.SizeOfPlace
4.44
4.SizeOfPlace 6.SizeOfPlace
10.29
0.013 0.035 0.001
1 1 1
A significant test statistic provides evidence that the parallel regression assumption has been violated.
of parallel regression assumption.” The chi-square statistic tests the null hypothesis of parallel regression or, in other words, that the coefficients for all contrasts across the dependent variable are the same. It can be seen that the null hypothesis is rejected overall and for several coefficients. Consider gender, which has a large chi-square value. The top part of the table further shows that the effect of gender on supporting the same or more spending relative to less spending (y_gt_1) equals −.487. It also shows that the effect of gender on supporting more spending relative to less or the same spending (y_gt_2) equals only −.085. The relationship changes substantially, with the former coefficient being 5−6 times stronger than the latter. It appears that men and women differ more in supporting less spending than supporting more spending. The proportional odds or parallel regression assumption of ordinal logistic regression is violated. There are several alternative models that allow for the effects of the independent variables to vary across contrasts, including multinomial logistic regression. However, the decision to use or not use ordinal logistic regression depends on substantive concerns as well as statistical tests. With a large sample, the null hypothesis of proportional odds or parallel regressions will frequently be rejected. Researchers may for substantive reasons sometimes prefer the simplicity of the ordinal model despite the varied effects across categories. The ordinal logistic regression model in Table 5.1 shows a logit coefficient for gender of −.143 that falls between the two separate coefficients. It conveys the general relationship between gender and support for environmental spending, albeit without the nuance of distinguishing between the contrasts of the outcome categories.
92
Another Example The five-category measure of support for gay marriage (available from 1988 to 2016) offers an example of a Likert-type scale that, as is common in survey research, has responses ranging from strongly disagree to strongly agree. Table 5.6 presents the ordinal logistic regression estimates for this outcome using the PLUM command in SPSS.19 The results show the logged odds coefficients along with the four cut points or thresholds. Education and time of survey increase the ordered logged odds of support for gay marriage, while age reduces support. For categorical independent variables, SPSS uses the last category as the referent, meaning in this case that females are more supportive than males. The referent for marital status, never married, shows stronger support than the other categories. SPSS has an omnibus test for parallel lines but no test for each predictor. In this case, the overall null hypothesis is rejected. Table 5.6 SPSS Output: Logged Odds Coefficients From Ordinal Logistic Regression Model of Support for Gay Marriage, GSS 1988−2016 Model Fitting Information
−2 Log Likelihood
Model Intercept Only Final
ChiSquare
df
Sig.
2290.492
8
.000
33453.685 31163.193
Goodness-of-Fit Chi-Square Pearson
34844.815
Deviance
28069.060
Pseudo R2
Cox and Snell
.176
McFadden
.061
Nagelkerke
.184
df 34728 34728
Sig. .328
1.000
93 Parameter Estimates
95% Confidence Interval Lower Upper Wald df Sig. Bound Bound 298.189 1 .000 1.796 2.256
Std. Estimate Error Threshold [Gay_Marry 2.026 .117 = 1] [Gay_Marry 2.832 .119 569.092 = 2] [Gay_Marry 3.432 .120 819.122 = 3] [Gay_Marry 4.597 .122 1409.896 = 4] Location Education .128 .006 509.506 Time .068 .002 877.652 Age −.018 .001 240.839 .335 .034 96.714 [Gender=0] 0a . . [Gender=1] [Marital −.547 .045 148.726 = 1] [Marital −.261 .081 10.344 = 2] [Marital −.185 .058 10.308 = 3] [Marital −.168 .099 2.889 = 4] [Marital 0a . . = 5]
1 .000 2.599 3.065 1 .000 3.197 3.667 1 .000 4.357 4.837 1 1 1 1 0 1
.000 .117 .139 .000 .064 .073 .000 −.021 −.016 .000 .268 .402 . . . .000 −.635 −.459
1 .001 −.421 −.102 1 .001 −.297 −.072 1 .089 −.361 0
.
.026
.
Link function: Logit. a This parameter is set to zero because it is redundant.
Test of Parallel Linesa Model
−2 Log Likelihood
Null Hypothesis
31163.193
General
30854.584
Chi-Square
df
Sig.
308.609
24
.000
The null hypothesis states that the location parameters (slope coefficients) are the same across response categories.a a Link function: Logit.
.
94 A last table (Table 5.7) presents the R output for the same model but lists odds ratios in separate rows after the logit coefficients and significance details. Like Stata, R uses the first category as the referent, which produces different coefficients for gender and marital status than SPSS. It also produces different cut points or thresholds. Much as for the intercepts in linear regression or binary logistic regression, the cut points depend on the 0 values for the independent variables. Given the differences, it is important to understand the conventions used in ordinal logistic regression programs.
Table 5.7 R Output: Logged Odds Coefficients From Ordinal Logistic Regression Model of Support for Gay Marriage, GSS 1988−2016 Value Education Time Age
Gender.f1
Marital.f2 Marital.f3 Marital.f4 Marital.f5 1|2 2|3 3|4 4|5
Std. Error
t value
0.12798336
0.005640855
22.688645
0
-0.01846003
0.001191698
-15.490527
0
0.068874228
4.150981
0.06810884
-0.33526501
0.28589563 0.36272014 0.37956473 0.54736692 2.23794358 3.04410112 3.64378289 4.80910468 OR
0.002244824 0.034118529 0.048729131 0.094808049
30.340396 -9.826479
7.443600 4.003507
0.044973559
12.170861
0.119721584
25.426502
0.118579054 0.121057170 0.124226128
18.873009 30.099687 38.712506
2.5 %
97.5 %
Education
1.1365341
1.1240642
1.1491972
Age
0.9817093
0.9794170
0.9840023
Time
Gender.f1
Marital.f2 Marital.f3 Marital.f4 Marital.f5
1.0704818 0.7151485 1.3309535 1.4372336 1.4616482 1.7286952
1.0657973 0.6688698 1.1627849 1.3063341 1.2135746 1.5829088
1.0752173 0.7645854 1.5232212 1.5812756 1.7600270 1.8880801
p value 0 0 0 0 0 0 0 0 0 0
95
Multinomial Logistic Regression Nominal or unordered categorical measures with three or more categories are suited for analysis with multinomial logistic regression. Ordinal measures that do not meet the proportional odds or parallel regression assumption also are suited for multinomial logistic regression. Compared to binary logistic regression, multinomial logistic regression is similar in its logic and interpretation of coefficients. In some ways, it is easier to understand and interpret than ordinal logistic regression. However, because it cannot assume an ordering of categories for the dependent variable, multinomial logistic regression presents a large number of coefficients that can complicate interpretations. Understanding multinomial logistic regression begins with a comparison to binary logistic regression. With three or more unordered categories of the dependent variable, the model must make multiple comparisons. A three-category dependent variable might proceed with three separate binary logistic regressions and three dummy dependent variables: the first category versus all others, the second category versus all others, and the third category versus all others. The separate logistic regressions would allow for the same interpretations of coefficients as in a single logistic regression, but the results fail to isolate precise contrasts. More exact comparisons would involve category one versus category two, category one versus category three, and category two versus category three. Yet, the comparison of all categories with one another involves redundancy. Multinomial logistic regression provides precise, nonredundant contrasts by selecting a base category. It also is more efficient in using a single maximum likelihood estimation procedure rather than separate logistic regressions. For example, with three categories and the last selected as the base, multinomial logistic regression would estimate sets of coefficients for two contrasts: category one with category three and category two with category three. Each set of coefficients represents the effects of a unit change in the independent variables on the logged odds or odds of belonging to each category relative to the base category. The coefficients are analogous to binary logistic regression coefficients, only the logged odds refer to the specific categories used in a contrast. With a four-category dependent variable, multinomial logistic regression presents coefficients for three comparisons, and with a five-category dependent variable, it presents coefficients for four. An outcome with J categories thus implies J-1 independent
96 comparisons. As the number of categories increase, so do the number of comparisons. Multinomial logistic regression thus produces a set of coefficients for each independent variable. Each independent variable affects the logged odds of each category relative to the base category. Although multinomial logistic regression programs initially estimate coefficients only for the nonredundant contrasts, one can obtain coefficients for other contrasts as well. Other than the difference in the specific contrasts, the use of logged odds, odds ratios, and marginal probability effects discussed in Chapter 2 apply to multinomial logistic regression coefficients.
Example Consider a five-category measure available in wave six (2010−2014) of the World Values Survey (WVS). The measure asks respondents which of the following do they view as the most serious problem of the world: people living in poverty and need, discrimination against girls and women, poor sanitation and infectious diseases, inadequate education, or environmental pollution. The choices have no ordering and are suited for multinomial logistic regression. To illustrate, one of the countries can be selected for the analysis, and India has a relatively large sample and serves as an example of a nation facing many of the problems listed in the question.20 A simple model includes a measure of education degree (ranging from (1) no formal education to (9) university degree), age (18−80+), gender (males = 1), and size of town (ranging from (1) under 2000 to (8) 500,000 or more). Table 5.8 presents the Stata output from the multinomial logistic regression for this model. Note that the multinomial logistic regression jointly maximizes the likelihood that the estimates of the parameters predict each category of the dependent variable. With only two categories of the dependent variable, multinomial logistic regression estimates reduce to binary logistic regression estimates; the logic of maximum likelihood does not change, only the number of categories increases. Accordingly, the baseline and model log likelihood values, the chi-square statistics, and the pseudo-variance explained measures have similar interpretations in multinomial as in binary logistic regression, except the number of categories of the dependent variable. The unordered dependent variable in the results has a base or omitted category. By default, Stata selects the largest outcome category as the base, in this case the category designating poverty as the most serious world problem (59% selected this choice, compared to 15% for gender discrimination, 4%
97 Table 5.8 S tata Output: Logged Odds Coefficients From Multinomial Logistic Regression Model of Most Serious World Problem (Poverty as Base), WVS India 2010−2014 Multinomial logistic regression Number of obs = 3,901 LR chi2(16) = 274.42 Prob > chi2 = 0.0000 Log likelihood = -4530.9585 Pseudo R2 = 0.0294 Most_Serious_ Problem Poverty
Std. Coef. Err. (base outcome)
Gender_ Discrimination Educ_Degree .0377407 .0210137 Age -.0083628 .0036438 Gender Male -1.005836
.09996
Size_Town .05394 .0277577 _cons -.8540531 .1988285
Sanitation_ Disease
Educ_Degree Age
.1038031 .0353772 .0034604 .0060625
Gender Male -.3785881 .1690333
Size_Town -.0936161 .0540943 _cons -2.800893 .3466175
Education
Educ_Degree .1902373 .0203125 Age -.0000743 .0036385
z
P>|z|
[95% Conf. Interval]
1.80 0.072 -.0034453 .0789267 -2.30 0.022 -.0155046 -.0012209 -10.06 0.000 -1.201753 -.8099176
1.94 0.052 -.0004641 .1083442 -4.30 0.000 -1.24375 -.4643563
2.93 0.003 .0344651 0.57 0.568 -.0084218
.1731412 .0153426
-2.24 0.025 -.7098873 -.0472888 -1.73 0.084 -8.08 0.000
-.199639 .0124067 -3.48025 -2.121535
9.37 0.000 .1504255 -0.02 0.984 -.0072057
.2300491 .0070571
Gender Male .0592625 .1024831 0.58 0.563 -.1416005 .2601256 Size_Town -.0862654 .0305235 -2.83 0.005 -.1460903 -.0264405 _cons -2.090137 .2081843 -10.04 0.000 -2.49817 -1.682103
(Continued)
98 Table 5.8 (Continued) Pollution Educ_Degree Age
.136086 .0264061 .0008324 .0046474
Gender Male .1215552 .1328113 Size_Town -.1061703 .0408536 _cons -2.484424 .267186
5.15 0.000 .084331 0.18 0.858 -.0082765
.1878409 .0099412
0.92 0.360 -.1387503 .3818606 -2.60 0.009 -.1862419 -.0260988 -9.30 0.000 -3.008099 -1.960749
for sanitation and disease, 15% for education, and 8% for pollution). Each of the other outcome categories is compared to this base. Take the first listed coefficient of .038 for education and gender discrimination. It can be interpreted as an increase of .038 in the expected logged odds of selecting gender discrimination relative to poverty as the most serious problem with a one-unit increase in the education measure. The logit coefficients have little intuitive meaning, but they are similar to those in binary logistic regression, only the logged odds refer to the outcome category relative to the base category. For this contrast, the education coefficient is not significant. The coefficient of −.008 for age shows that with a 1-year increase in age, the expected logged odds of selecting gender discrimination relative to poverty as the most serious problem are lower by .008. Coefficients for other categories of the dependent variable have similar interpretations. For the contrast of poor sanitation and infectious diseases, the coefficient of .104 for education shows the increase in the expected logged odds of selecting sanitation relative to poverty for a one-unit increase in education. Again, only the sign and significance are easily interpreted. Examining effects in terms of odds rather than logged odds comes from taking the exponent of the logged odds coefficients. Stata generates a separate table with the exponentiated logit coefficients. However, it is common to refer to the resulting coefficients as relative risk ratios rather than odds ratios. The reason is that the coefficients compare one outcome category to another single outcome category. Odds refer the ratio of the probability of being in an outcome category divided by the probability of not being in the outcome category. For example, the odds would refer to choosing gender discrimination relative to choosing any of the four other categories. Multinomial logistic regression is different, however. In this case, it examines choosing gender discrimination relative to choosing poverty. Relative risk ratio thus refers more specifically to the category and the base category. In Table 5.9, the relative risk ratio for gender and gender discrimination of .366 shows that the odds of choosing gender discrimination relative to
99 Table 5.9 S tata Output: Relative Risk Ratio Coefficients From Multinomial Logistic Regression Model of Most Serious World Problem (Poverty as Base), WVS India 2010−2014 Multinomial logistic regression Number of obs = 3,901 LR chi2(16) = 274.42 Prob > chi2 = 0.0000 Log likelihood = -4530.9585 Pseudo R2 = 0.0294 Most_Serious_ Problem Poverty
Gender_ Discrimination Educ_Degree Age
Std. RRR Err. (base outcome)
P>|z|
[95% Conf. Interval]
1.80 0.072 -2.30 0.022
.9965606 .984615
1.082125 .9987798
.3657389 .0365593 -10.06 0.000 1.055421 .0292961 .4256861 .0846385
1.94 0.052 -4.30 0.000
.3006665
.4448947
Educ_Degree Age
1.109382 .0392468 1.003466 .0060835
2.93 0.003 0.57 0.568
1.035066 .9916136
1.189034 1.015461
Gender Male
.6848277 .1157587
-2.24 0.025
.4916996
.9538119
1.209537 .0245687 .9999257 .0036383
9.37 0.000 -0.02 0.984
1.162329 .9928202
1.258662 1.007082
1.061054 .10874 0.58 0.563 .9173507 .0280007 -2.83 0.005 .1236702 .0257462 -10.04 0.000
.8679679 .8640797 .0822353
1.297093 .9739059 .1859825
Gender Male
Size_Town _cons
Sanitation_ Disease
Size_Town _cons Education
Educ_Degree Age Gender Male Size_Town _cons
1.038462 .0218219 .9916721 .0036135
z
.9106322 .0607558
.04926 .021059
-1.73 0.084 -8.08 0.000
.999536 .2883011
.8190264 .0307997
1.114431 .6285396
1.012484 .1198475
(Continued)
100 Table 5.9 (Continued) Pollution Educ_Degree Age
1.14578 .0302556 1.000833 .0046513
5.15 0.000 0.18 0.858
1.087989 .9917577
1.206642 1.009991
Gender Male Size_Town _cons
1.129252 .1499774 .8992715 .0367385 .0833736 .0222763
0.92 0.360 -2.60 0.009 -9.30 0.000
.8704454 .8300728 .0493855
1.465008 .9742388 .140753
Note: _cons estimates baseline relative risk for each outcome.
poverty are lower for males than females by a factor of .366 or by 63.4%. The relative risk ratio for education and pollution of 1.146 shows that the odds of choosing pollution relative to poverty are higher by a factor of 1.146 or higher by 14.6% for a one-unit increase in education. Despite the more intuitive units of the relative risk ratios, it can be challenging to make sense of the numerous coefficients. First, each independent variable has multiple coefficients, often preventing a simple summary of the relationship with the dependent variable. Second, because the coefficients are specific to a contrast involving the base category, the coefficients can be presented differently. The results presented thus far refer to each outcome category relative to poverty as the most serious problem. Other categories can be selected as the base. Table 5.10 presents Stata relative risk ratios for pollution as the base. The coefficients change given the different contrasts involved. For example, education has a nonsignificant coefficient above 1 for selecting gender discrimination relative to poverty but has a significant coefficient below 1 on selecting gender discrimination relative to pollution. SPOST in Stata includes a listcoef command that presents coefficients for all possible base categories, and other SPOST commands plot the coefficients in an intuitive form. As Long and Freese (2014) suggest, examining the full range of coefficients in these ways can be helpful. Focusing on probability effects has some advantages in multinomial logistic regression. Table 5.11 presents the average marginal effects from the margins command in Stata. Note that there is an average marginal effect for each independent variable and each category of the dependent variable. Along with using more interpretable probability units, the average marginal effects do not depend on selection of a base category. For example, a oneunit increase in education is associated with a lower probability of selecting
101 Table 5.10 S tata Output: Relative Risk Ratio Coefficients From Multinomial Logistic Regression Model of Most Serious World Problem (Pollution as Base), WVS India 2010−2014 Multinomial logistic regression Number of obs = 3,901 LR chi2(16) = 274.42 Prob > chi2 = 0.0000 Log likelihood = -4530.9585 Pseudo R2 = 0.0294
Most_Serious_ Problem Poverty Educ_Degree Age Gender Male
Size_Town _cons
Gender_ Discrimination
RRR
Std. Err.
z
P>|z|
[95% Conf. Interval]
.8727676 .0230464 .999168 .0046436
-5.15 0.000 .8287465 .919127 -0.18 0.858 .9901081 1.008311
.8855422
-0.92 0.360 .6825902 1.148837
.11761
1.112011 .0454296 11.99421 3.204684
2.60 0.009 1.026442 1.204714 9.30 0.000 7.104646 20.24886
Educ_Degree Age
.9063359 .0281227 .990847 .0054514
-3.17 0.002 .852859 .963166 -1.67 0.095 .9802198 1.001589
Gender Male
.3238773 .0498592
-7.32 0.000 .2395203
Sanitation_ Disease Educ_Degree Age
.9682327 .040631 1.002632 .0073293
-0.77 0.442 .8917843 1.051235 0.36 0.719 .9883688 1.0171
.6064438 .1246038 1.012633 .0659168 .7287177 .3052341
-2.43 0.015 .4054127 .9071598 0.19 0.847 .8913406 1.150431 -0.76 0.450 .3206425 1.656142
Size_Town _cons
Gender Male Size_Town _cons
1.17364 .0539639 5.105767 1.580435
.437944
3.48 0.000 1.072499 1.28432 5.27 0.000 2.783432 9.365723
(Continued)
102 Table 5.10 (Continued) Education Educ_Degree Age
1.055644 .0319031 .9990937 .0054341
1.79 0.073 -0.17 0.868
.9949312 .9884996
1.120062 1.009801
Gender Male Size_Town _cons
.9396079 .1455691 1.020104 .0480394 1.483326 .4635426
-0.40 0.688 0.42 0.673 1.26 0.207
.6935438 .9301634 .80396
1.272974 1.118742 2.736774
Pollution
(base outcome)
Note: _cons estimates baseline relative risk for each outcome.
Table 5.11 S tata Output: Average Marginal Effects From Multinomial Logistic Regression Model of Most Serious World Problem, WVS India 2010−2014 Number of obs = 3,901
Average marginal effects Model VCE : OIM
dy/dx w.r.t.: Educ_Degree Age 1.Gender Size_Town 1._predict : Pr(Most_Serious_Problem==Poverty), predict (pr outcome(l)) 2._predict : Pr(Most_Serious_Problem==Gender_Discrimination), predict(pr outcome(2)) 3._predict : Pr(Most_Serious_Problem==Sanitation_Disease), predict(pr outcome(3)) 4._predict : Pr(Most_Serious_Problem==Education), predict (pr outcome(4)) 5._predict : Pr(Most_Serious_Problem==Pollution), predict (pr outcome(5))
dy/dx
Educ Degree_ predict 1 -.0271561 2 -.0011455 3 .0022777 4 .0195353 5 .0064885
Delta-method Std. Err.
.0032854 .0024196 .0013404 .0022834 .0017651
z
-8.27 -0.47 1.70 8.56 3.68
P>|z|
[95% Conf. Interval]
0.000 -.0335954 -.0207168 0.636 -.0058878 .0035968 0.089 -.0003495 .0049049 0.000 .0150599 .0240108 0.000 .0030291 .009948
103 Age _predict 1 .0006048 .0005843 2 -.0010478 .0004325 3 .0001857 .0002351 4 .0001232 .0004179 5 .0001341 .0003167 (base outcome) 0.Gender 1.Gender_ _predict 1 2 3 4 5 Size_ Town _predict 1 2 3 4 5
1.04 -2.42 0.79 0.29 0.42
0.301 0.015 0.430 0.768 0.672
-.0005403 .00175 -.0018955 -.0002001 -.0002751 .0006465 -.0006959 .0009422 -.0004867 .0007549
.0870805 -.1256607 -.0096888 .0285299 .0197391
.0161085 5.41 0.000 .0121182 -10.37 0.000 .006738 -1.44 0.150 .0113613 2.51 0.012 .0086433 2.28 0.022
.0555084 .1186526 -.149412 -.1019094 -.022895 .0035173 .0062622 .0507976 .0027986 .0366796
.0092905 .0099262 -.0031619 -.009431 -.0066238
.0048244 .0032662 .0020995 .0034884 .0027896
1.93 3.04 -1.51 -2.70 -2.37
0.054 -.0001652 .0187462 0.002 .0035246 .0163279 0.132 -.007277 .0009531 0.007 -.0162682 -.0025939 0.018 -.0120913 -.0011562
Note: dy/dx for factor levels is the discrete change from the base level.
poverty (−.027) and a higher probability of selecting education (.020) and pollution (.006). Relative to females, males have a lower probability of selecting gender discrimination (−.126) and a higher probability of selecting poverty (.087), education (.029), and pollution (.020). As always, it is important to note the nonlinear relationships of the predictors with the probabilities. The marginal effects would differ were they evaluated at other values of the independent variables, such as the mean values or representative values. More detailed analyses can graph the relationships across observations and values of the independent variables in ways that provide a fuller understanding of the relationships. Still, the average marginal effects typically offer an informative summary. One other complication in multinomial logistic regression involves an assumption called the independence of irrelevant alternatives. The assumption is that adding or deleting categories does not change the relationships for the remaining categories. The assumption is most likely violated when two categories are so closely related that one serves as a substitute for the other. The classic example, called the red bus-blue bus problem illustrates the assumption (McFadden, 1974). If one is choosing a
104 way to ride to work between a car, a blue bus, and a red bus, the difference between a car and a bus is crucial, but the difference between a blue bus and red bus is irrelevant. Yet, in multinomial logistic regression, each alternative gets equal weight and thereby distorts the decision between a car and a bus. Another example involving a nominal outcome of preference for smoking cigarettes, vaping, or neither would likely violate the assumption. Vaping and smoking are close alternatives and differ from using neither. Deleting one category would change the relationships involving the other two categories. Long and Freese (2014) conclude that statistical tests of the assumption are seldom helpful, and that understanding the theoretical meaning of the alternatives in a nominal measure is more important.
An Ordinal Dependent Variable Multinomial logistic regression may be useful for ordinal dependent variables that have different relationships across its categories. The ordinal measure of support for more national spending on the environment serves as an example. Tests showed (Table 5.5) that several independent variables had different effects depending on the contrast across categories of the dependent variable. Multinomial logistic regression can be used to model these differences. Table 5.12 presents the SPSS output for a multinomial logistic regression using as the last and largest category—support for more spending—as the base. The Exp(B) column shows that gender has divergent effects. Consider the coefficients for gender: .639 for category 1 and 1.023 for category 2. The relative risk ratio for supporting less spending relative to more spending is 36% lower for women (gender = 0) than men
Table 5.12 S PSS Output: Logged Odds Coefficients From Multinomial Logistic Regression Model of Support for More Spending on the Environment, GSS 1973−2016 Model Fitting Information Model Fitting Criteria Model Intercept Only Final
−2 Log Likelihood
Likelihood Ratio Tests Chi-Square
df
Sig.
1882.953
16
.000
52194.784 50311.831
105 Pseudo R-square Cox and Snell
.054
Nagelkerke
.065
McFadden
.031
Likelihood Ratio Tests Model Fitting Criteria
Effect
−2 Log Likelihood of Reduced Model
Chi-Square
df
a
Sig.
50453.089
141.258
.000
0 2
.000
Years since 1973
50323.206
11.375
2
.003
Age of respondent
51074.744
762.913
2
.000
Gender
Marital status
50445.844 50444.907
134.014
2
.000
Intercept
Highest year of school completed
50311.831
Likelihood Ratio Tests
133.076
8
.
.000
The chi-square statistic is the difference in −2 log likelihoods between the final model and a reduced model. The reduced model is formed by omitting an effect from the final model. The null hypothesis is that all parameters of that effect are 0. a
This reduced model is equivalent to the final model because omitting the effect does not increase the degrees of freedom.
Parameter Estimates
Nation Should Spend More on Environmenta 1 Intercept Highest year of school completed
Std. B Error −2.642 .115 −.049 .006
Wald df Sig. Exp(B) 526.936 1 .000 59.811 1 .000 .952
(Continued)
106 Table 5.12 (Continued)
2
Years since 1973 Age of respondent [Gender=0] [Gender=1] [marital status=1] [marital status=2] [marital status=3] [marital status=4] [marital status=5] Intercept Highest year of school completed Years since 1973 Age of respondent [Gender=0] [Gender=1] [marital status=1] [marital status=2] [marital status=3] [marital status=4] [marital status=5]
−.004 .030
.002 .001
6.391 497.645
1 .011 1 .000
.996 1.030
−.448 0b .373
.041 . .062
120.931 . 35.774
1 .000 0 . 1 .000
.639 . 1.453
.040
.095
.177
1 .674
1.041
.063
.082
.593
1 .441
1.065
.096
.127
.573
1 .449
1.101
0b
.
.
−1.109
.071
244.763
1 .000 1 .000
.957
−.003
.001
7.498
1 .006
.997
.018
.001
443.774
1 .000
1.018
.023 0b .275
.025 . .036
.799 . 59.195
1 .371 0 . 1 .000
1.023 . 1.316
.203
.059
12.011
1 .001
1.225
.022
.048
.213
1 .644
1.022
.062
.074
.711
1 .399
1.064
0b
.
.
−.044
.004
113.330
0
0
.
.
.
.
a
The reference category is: 3. This parameter is set to zero because it is redundant.
b
(gender = 1).21 The relative risk ratio for supporting the same spending relative to more spending is, in contrast, statistically equal for men and women. The coefficients for most of the other independent variables appear similar.
107 Table 5.13 R Output: Logged Odds Coefficients From Multinomial Logistic Regression Model of Support for the Same Spending on the Environment, GSS 1973−2016 y.level term 1 1 (Intercept) 2 1 Education
std. estimate error -1.94 0.140 -0.00248 0.00670
3 1
Time
5 1 6 1
Gender.f1 SizeOfPlace.f2
0.505 0.0634
SizeOfPlace.f4
0.0113
(Intercept)
1.52
4 1
7 1 8 1
9 1 10 1
11 3 12 3 13 3 14 3 15 3 16 3 17 3 18 3 19 3 20 3
Age
-0.00170 0.00163 0.0105
0.00120 0.0418 0.101
SizeOfPlace.f3 -0.124
0.107
SizeOfPlace.f5 0.0205 SizeOfPlace.f6 -0.0158
0.0891 0.0974
Education Time
0.00416 0.000959
Age Gender.f1 SizeOfPlace.f2 SizeOfPlace.f3 SizeOfPlace.f4
0.0371 0.00451
-0.0197 0.0269 -0.394 -0.431 -0.418
SizeOfPlace.f5 -0.560 SizeOfPlace.f6 -0.752
0.100
0.0807
0.000731 0.0248 0.0564 0.0585 0.0560 0.0495 0.0557
statistic p.value -13.9 9.69e- 44 -0.370 7.11e- 1 -1.04
2.99e-
12.1 0.630
1.40e- 33 5.29e- 1
8.73
-1.16
0.113
0.230 -0.163 18.8
8.92 4.70
-26.9 1.08 -6.98 -7.36 -7.47
-11.3 -13.5
1
2.65e- 18
2.44e-
1
8.18e8.71e-
1 1
9.10e-
1
1.40e- 78
4.87e- 19 2.59e- 6 1.39e2.79e2.97e1.83e7.99e-
159 1 12 13 14
1.09e- 29 1.74e- 41
Table 5.13 presents the results from R using the middle category, about the right spending, as the base. The different base changes the coefficients. For example, in Table 5.12, education has a negative coefficient for supporting less spending relative to more spending, while in Table 5.13, education has a nonsignificant coefficient for supporting less spending relative to the same spending. This result is consistent with the general finding that differences between supporting less and about the same spending are small relative to differences between supporting more spending and the other two categories.
108
Summary Ordinal logistic regression and multinomial logistic regression add some complexities to the assumptions and interpretations used in binary logistic regression. This chapter can only briefly review the variations in analyses and interpretations. The key theme has been that understanding logged odds, marginal effects, and maximum likelihood estimation in binary logistic regression provides tools to understand more complex variations on the analysis of categorical dependent variables. The understanding also applies to other models for categorical outcomes such as Poisson or negative binomial models for count outcomes or hazard models for time-dependent binary outcomes.
NOTES 1. Nonsensical predicted values are by no means limited to binary dependent variables—unreasonable predictions at the extreme weaken models with continuous dependent variables as well. Such problems warrant attention to the functional form of the relationship. 2. Algebraically, the variance of the error term equals Var (ei) = (b0 + b1Xi)[1 – (b0 + b1Xi)]. If the variances are equal for all values of X, they would have no relationship to X. Yet, the equation shows just the opposite—X values influence the size of the error. Taking b0 + b1Xi as Pi, the equation becomes Var (ei) = Pi(1 – Pi). As X affects Pi, it affects the error variance, which is greatest when Pi = .5 but becomes smaller as Pi approaches 0 and 1. 3. The derivation is Oi = Pi /(1 – Pi), Pi = Oi × (l –Pi), Pi = Oi – Oi × Pi, Pi + Oi × Pi = Oi, Pi (l + Oi) = Oi, Pi = Oi/(1 + Oi). 4. The derivation is Pi /(1 Pi ) eb0 b1 X i , Pi eb 0 b 1 X i (1 Pi ), Pi 1 (eb 0 b 1 X i ) Pi (eb 0 b 1 X i ), Pi Pi (eb 0 b 1 X i ) (eb 0 b 1 X i ),
109
110 Pi × (1 + eb 0 +b 1 X i ) = (eb 0 +b 1 X i ), Pi eb 0 b 1 X i / (1 eb 0 b 1 X i ). 5. Noting that e−X equals 1/eX, and that eX equals 1/e−X, and letting b0 + b1Xi equal Li, the derivation is L
Pi e Li / (1 e i ), Pi (1/e Pi = (1/e
Li
-L i
Pi = 1/[(e Pi = 1/(e
L i /1
),
Li
) × (1/1 + e ),
-L i
-L i
) / (1 e
L
) × (1 + e i )], +e
Li
×e
-L i
).
Since eX× eY equals eX+Y, and eX−X equals e0 or 1, the formula reduces to Pi = 1/ (e -Li + 1) = 1/ (1 + e -Li ). 6. To briefly summarize the approach, begin with an unobserved or latent continuous variable y* that, in the bivariate case, is represented by a linear regression model: yi* = a + bXi + ei. Then, specify that the observed value of y equals 1 (y = 1) when y* ≥ 0 and equals 0 (y = 0) when y* < 0. Because y* is unobserved, identifying the model requires specification of the variance of the error term, which for the standard logistic distribution equals π2/3 or 3.29. 7. In some rare cases, the Wald statistic can be misleading (Agresti, 2013, pp. 174–175; Hosmer, Lemeshow, & Sturdivant, 2013, Chapter 1.3). Comparing the log likelihood ratio (discussed in the next chapter) for models with and without the variable provides another test for its significance. 8. Although the logistic regression coefficients are symmetric around 0, the factor and percentage change in odds do not have this property. The odds, odds ratio coefficients, and percentage change values have
111 no upper bound but have a lower bound of 0. To compare negative and positive effects on odds, take the inverse. For example, the inverse of an exponentiated coefficient of 2.5 or a 150% increase in the odds equals 1/2.5 or .40, which translates into a 60% decrease. 9. Menard (2002, 2010, Chapter 5) reviews multiple ways to estimate fully standardized coefficients and the strengths and weaknesses of each. He recommends a formula for the variance of the dependent variable that uses a measure of variance explained (discussed in the next chapter), noting that it most closely resembles the standardized coefficient in linear regression. 10. In practice, the maximum likelihood estimate is found by setting the derivative of the likelihood function to 0 and solving for the parameter. 11. A problem in estimation occurs with complete separation or perfect prediction. It occurs, for example, when the binary dependent variable Y does not vary within categories of a categorical independent variable. The lack of variation defines a 0 cell in cross-classifying the two categorical variables. Another example might involve a dependent variable that equals 1 for all values of a continuous independent variable above a certain level and that equals 0 for all values below that level. Again, the dependent variable lacks variation within grouping of the independent variable. Such problems are rare and typically stem from small samples or definitional overlap of the independent and dependent variables, but they prevent maximum likelihood estimation and produce errors or warnings in software packages. 12. Using log likelihood values, the formula is (ln L0 – ln L1)/ln L0, or, equivalently, 1 – (ln L1/ln L0). 13. More precisely, it reaches 1 in practice only in the problematic case of perfect prediction, when the maximization procedure breaks down (Greene, 1993, p. 651). 14. Recall that the standardized coefficients are based on a measure of the standard deviation of the latent dependent variable, which comes from the sum of (1) the standard deviation of the predicted outcome and (2) the standard deviation of the model error. Both components differ in logistic regression and probit models, but the standardized
112 coefficients adjust for both in a way that makes the coefficients comparable. 15. The density of the standard normal curve at the value of Z comes from the following formula: f (Z )
1 2
exp ( Z 2 / 2).
The density value for a z score using this value can be found with functions in SPSS, Stata, or R. 16. The formal model for ordinal logistic regression is more complex. It typically begins with odds equal to the probability of being in a lower category relative to a higher category, but the model then specifies the negative of the coefficients (Long & Freese, 2014, Section 7.1.2). See Long (1997), Menard (2010), and Hilbe (2009) for more formal derivations of the model. The result in practical terms is to produce coefficients that are interpreted as the odds (or logged odds) of being in a higher category relative to a lower category. However, take care to note when programs present results in terms of the original odds of being in a lower category relative to a higher category (O’Connell, 2006). 17. Of course, the interpretation of being in a higher category relative to a lower category depends on the coding of the dependent variable. Although always ordered, ordinal measures often do not have categories that are inherently higher than others. In this case, the measure may be coded as opposition to more environmental spending, with too much coded 3 and too little coded 1. The meaning of coefficients in ordinal logistic regression, like those in other models, depends on the direction of coding in the dependent variable. 18. The problem of complete separation or perfect prediction in binary logistic regression is also a concern in ordinal logistic regression, but it takes a special form. The highest and lowest category of the dependent variable must vary across categories of the independent variables. It is important to check for 0 cells with frequency tables of the dependent variables by the categorical independent variables.
113 19. PLUM does not have an option to present odds ratios. It is possible to save the coefficients and compute the odds ratios in SPSS, but it may be easier to compute the exponent by hand or in a spreadsheet. 20. Including all countries in the analysis would require mixed or multilevel models that adjust for the combination of country-level and individual-level data. 21. SPSS omits the last category of a categorical independent variable.
APPENDIX: LOGARITHMS Researchers often find it useful to distinguish between absolute and relative change in a variable. Absolute change ignores the starting level at which a change occurs; in absolute terms, income may increase by $1, $100, or $1000, but the change counts the same at all income levels. Relative change takes a change as a proportion or percentage of the starting level. As a result, the same absolute change counts less at higher starting levels than at lower levels. Using the income example again, a $100 change at $1000 shows a 10.0% increase ((100/1000) × 100), whereas a $100 change at $100,000 shows a 0.1% increase. The percentage represents relative change rather than absolute change. Conversely, the same relative income change results in larger absolute increases at higher levels. Thus, a 10% increase translates into $100 at the starting level of $1000, and into $10,000 at the starting level of $100,000. Depending on the theoretical meaning of a variable, relative or percentage change may prove more appropriate than absolute change in modeling relationships in ordinary regression. It certainly is important in dealing with relationships involving odds in logistic regression.
The Logic of Logarithms Logarithms offer an effective means of measuring relative change in a variable. The idea behind logarithms is simply to count by multiples rather than by adding ones. Multiples take the form of exponents or powers. For example, using a base of 10, the exponents of 1 to 5 give 101 = 10, 102 = 100, 103 = 1000, 104 = 10,000, 105 = 100,000. As the power or exponent increases by 1, the resulting value increases by a multiple of 10. The outcome goes from 10 to 100 to 1000 and so on, with each successive value equaling 10 times the previous value. Note also that a one-unit increase in the exponent or power results in a constant
115
116 percentage increase in the outcome. The absolute outcomes increase by values of 90, 900, 9000, and 90,000. However, the percentage increases all equal 9 × 100 or 900% (e.g., (90/10) × 100 = 900; (900/100) × 100 = 900). In general, the percentage increase equals the base multiple of 10 minus 1 and times 100. To define logarithms, let the base equal b, the power or exponent equal n, and the outcome equal X. Then bn = X. Given values of X, logarithms measure the power the base must be raised to produce the X values. They measure the power in the exponent formula rather than X. Therefore, we can define n as the log of X such that blogX = X. The logarithm of X to the base 10—called a common logarithm—equals the power 10 must be raised to get X. As 10 raised to the second power equals 100, the base 10 log of 100 equals 2. The base 10 log of 1000 equals 3, the base 10 log of 10,000 equals 4, and so on. As before, an increase of one in a logarithm translates into an increase in X by a multiple of 10. An increase of 2 in the log translates into an increase by a multiple of 100 (10 × 10). In this terminology, X remains in its original absolute units, but the log of X reflects relative or percentage change. As X gets larger, it requires a larger increase to produce a one-unit change in the logarithm. Taking the logarithm thus shrinks values of the original variable above 1, and the shrinkage increases as the values increase. Take the examples in Table A.l. As X increases by multiples of 10, the log of X increases by 1. As the log of X goes from 1 to 2, X moves from 10 to 100 or increases by 90; as the log of X goes from 2 to 3, X moves from 100 to 1000, or increases by 900; and as X goes from 3 to 4, X moves from 1000 to 10,000, or increases by 9000. Reflecting the nature of percentage change, identical changes in the log of X translate into successively larger increases in X.
Table A.1 X
log X
10
1
100
2
1000
3
10,000
4
100,000
5
117 Following this logic, the same change in X translates into a smaller change in the log of X as X gets larger. A change in X from 10 to 11 implies a change in the log of X from 1 to 1.04. A change in X from 100 to 101 implies a change in the log of X from 2 to 2.004. A change in X from 1000 to 1001 implies a change in X from 3 to 3.0004. Each time X increases by 1, but the log of X increases by successively smaller amounts: .04, then .004, and then .0004. This simply restates the principle that successively larger increases in X are needed to produce the same change in the log of X. Taking the logarithm of a variable fits the substantive goal of modeling relative change. If the original X measures absolute change, the log of X measures percentage change. In original units, an increase in X of one unit means the same regardless of the initial starting point. In logged units, an increase in X of one unit translates into a larger change at low levels of X than at high levels of X. Taking the logarithm also has the benefit of pulling in extreme values in a skewed distribution. For many variables, the extreme values lie on the positive or right side of the distribution. To obtain a more normal distribution, and shrink the gap between a few outliers and the rest of the distribution, take the log of such variables. Extremely large values will count less when taking the log of the original variable because of the shift to a percentage scale. In other words, the transformation may place all the cases on a similarly meaningful scale. It does not change the ordering of the cases: the lowest and highest unlogged values remain the lowest and highest logged values, but the relative position and the size of the gaps between the cases change because of the focus on percentage rather than on absolute differences.
Properties of Logarithms Knowing the value of X, you can find the common logarithm of X on a hand calculator simply by typing in X and then the LOG key. Similarly, knowing the common log of X, you can easily find X by typing in the log value and then 10X. To solve for a value of X given the logarithm of X, merely treat the log as an exponent. In calculating the values of logarithms and their exponents, note the following properties. Logarithms are defined only for values of X above 0. No real number exists such that 10 (or any other base) raised to that power produces 0. The same holds for negative values: no real number exists such that 10 (or any other base) raised to that power produces a negative number. A logarithm exists only for numbers above 0. The logarithm of a variable with 0 or
118 negative values is undefined for those values. It is necessary to add a constant to the variable so that all values exceed 0 before taking the logarithm. For values of X between 0 and 1, logarithms are negative. This follows from the logic of exponents. A negative exponent such as in 10−2 equals 1/102, 1/100, or .01. Thus, the power that 10 must be raised to produce an X value of .01 is −2. As 10−3 equals 1/103, 1/1000, or .001, the log of .001 equals −3. As X becomes smaller and smaller and approaches 0, the logarithm of X becomes an increasingly large negative number. As X can become infinitely small without reaching 0, the log of X can become an infinitely large negative number. When X reaches 0, the logarithm is undefined. When X equals 1, the logarithm equals 0 because any number raised to the power 0 equals 1. When X exceeds 1, logarithms produce positive values. As X can increase infinitely, so may the logarithm increase infinitely. Overall, the X value of 1 and the log X value of 0 define dividing points. Values of X between 0 and 1 produce negative logarithms between 0 and negative infinity; values of X between 1 and positive infinity produce positive logarithms between 0 and positive infinity. Conversely, the larger the absolute value of a negative logarithm (i.e., the farther it falls from zero), the closer the original value comes to zero; the smaller the absolute value of a negative logarithm (i.e., the closer it comes to zero), the closer the original value comes to 1. The smaller the positive value of a logarithm, the closer the original value comes to 1. Figure A.1(a) illustrates the logarithm function by plotting the common log of X by X. Figure A.1(b) presents the same graph only for values of X up to 20. Negative logarithms are shown in the graphs for values of X near 0, while positive logarithms are shown in the graphs for values of X greater than 1. The graphs also illustrate that as X increases, the logarithm changes less per unit change in X. At high levels of X, the curve rises very little: as X takes values ranging from near 0 to 1000, the common log of X rises only to 3. The graphs thus indicate that the logarithm shrinks numbers above 1, with the larger numbers shrinking more than smaller numbers. The fact that logarithms represent multiples of a base value allows one to translate multiplication into addition of logarithms. Two properties of logarithms follow. First, the logarithm of a product of two numbers equals the sum of the separate logarithms: log(X × Y) = log X + log Y.
119 Figure A.1 ( a) Common logarithms (open circles) and natural logarithms (open triangles) and (b) lower range of common logarithms (open circles) and natural logarithms (open triangles). 9
6
3
0
–3 0
100
200
300
400
500
600
700
800
900
1000
(a) 3
0
–3 0
5
10
15
20
(b)
For example, the log of (100 × 1000) equals (log l00) + (log 1000): since the log of 100 = 2, the log of 1000 = 3, and the log of 100,000 = 5, adding the logs gives the same result as logging the product. Second,
120 the log of a quotient of two numbers equals the difference of the separate logs: log(X/Y) = log X − log Y. Thus, the log of (100/1000) equals (log 100) − (log 1000). Another property proves useful in manipulating equations with logarithms. The logarithm of a power equals the exponent times the log of the base: log Xk = k × log X. For example, log 105 equals the log of 100,000 or 5; it also equals 5 × log 10 or 5 × 1.
Natural Logarithms Despite their intuitive appeal, common logarithms find less use than another type of logarithm. Natural logarithms use the base of e, or approximately 2.718, instead of 10. This base has mathematical properties that make it useful in a variety of circumstances relating to computing compound interest and solving for derivatives and integrals in calculus. Otherwise, however, the logic of logarithms remains the same for e as for 10. The natural logarithm of X (symbolized by ln X) equals the power e must be raised to get X. Natural logs still count by multiples rather than by adding ones but by multiples of e. The exponents of 1 to 5 of e give e1 = 2.718, e2 = 7.389, e3 = 20.086, e4 = 54.598, e5 = 148.413. As the power or exponent increases by 1, the resulting values increase by a multiple of 2.718. The exponentials do not increase as quickly as with the base 10, since multiples of 10 exceed multiples of 2.718, but they still increase faster than counting by ones. To obtain natural logarithms, simply turn this process around. Given X values of 2.718, 7.389, 20.086, 54.598, and 148.413, the natural logs equal 1, 2, 3, 4, and 5. We must raise e by 1 to get 2.718, by 2 to get 7.389, and
121 so on. Typically, X is an integer, so the log of X is not. Let X equal to 5, 27, 62, and 105, to pick some numbers at random. For the first number, 2.718 must be raised to a power between 1 and 2 since 5 falls between 2.718 and 7.389. The exact natural log of 5 equals 1.609. The X value of 27 falls between e raised to the 3 and 4 power. The exact natural log is 3.296. The natural log of 62 is 4.127, and the natural log of 105 is 4.654. You can obtain the natural log from a calculator simply by typing X and then the LN key. You can verify that, as X gets larger, a one-unit change in X results in increasingly small changes in the natural log of X. As illustrated in Table A.2 for values of X greater than or equal to 1, the log of X shrinks the values of X in proportion to their size. Note that, as for common logs, the natural log is not defined for values of 0 and lower, and that the log of X for values greater than 0 and less than 1 is negative. If a variable has values of 0 or lower, add a constant so that the minimum value exceeds 0 before taking the natural log. The natural log, like the common log, has a straightforward percentage interpretation: a change in one logged unit represents a constant percentage increase in the unlogged variable. To show this, note that to change the log of X back to X, we simply have to raise e to the value of the log of X. For example, on your calculator, type 0, and then the key represented by ex. The result equals 1. Transforming the log of X into X would show the results in Table A.3. To see how the natural log of X reflects a constant percentage or relative increase (rather than a constant absolute increase in single units), calculate the percentage change in X for a one-unit change in the log of X. As the log of X changes from 0 to 1, X changes from 1 to 2.718. The percentage change equals % 2.718 1 /1 100 171.8. Table A.2 X 1
ln X 0
2
.693
3
1.099
101
4.615
102
4.625
103
4.635
122 Table A.3 ln X
X
0
1
1
2.718
2
7.389
3
20.086
4
54.598
For changes in the log of X from 1 to 2 and from 2 to 3, the percentage changes equal % 7.389 2.718 /2.718 100 171.8, and % [ 20.086 7.389 / 7.389] 100 171.8. In each case, the percentage change equals 2.718 – 1 times 100. Hence, X changes by the same percentage (171.8) for each unit change in the log of X. An increase of 171.8% is the same as multiplying the starting value by 2.718. Figures A.1(a) and (b) plot X by the natural log of X along with the common log of X. Compared to the common log of X, the natural log of X reaches higher levels because it takes a larger power to raise 2.718 to X than it takes to raise 10 to X. Overall, however, the shapes of the two curves show important similarities: both show a declining rate of change as X increases.
Summary Logarithms provide a means to count by multiples. They show the power that a base value such as 10 or e must be raised to obtain a nonzero positive number. Compared to the original numbers, logarithms rise at a decreasing rate. When numbers greater than or equal to 1 go up by 1, their logs go up by less than 1. Moreover, the larger the original number, the smaller the
123 logarithm increases for a one-unit increase in the original number. All this makes logarithms appropriate for measuring relative or percentage change rather than absolute change in ordinary regression. It also makes them appropriate for use with the odds of experiencing an event or having a characteristic as modeled in logistic regression.
REFERENCES Agresti, A. (2013). Categorical data analysis (3rd ed.). Hoboken, NJ: John Wiley and Sons. Agresti, A., & Tarantola, C. (2018). Simple ways to interpret effects in modeling ordinal categorical data. Statistica Neerlandica, 72, 210–223. Allison, P. D. (1999). Comparing logit and probit coefficients across groups. Sociological Methods and Research, 28, 186–208. Breen, R., Karlson, K. B., & Holm, A. (2018). Interpreting and understanding logits, probits, and other nonlinear probability models. Annual Review of Sociology, 44, 39–54. Cox, D. R., & Snell, E. J. (1989). Analysis of binary data (2nd ed.). London, England: Chapman and Hall. Eliason, S. R. (1993). Maximum likelihood estimation: Logic and practice (Sage University Papers Series on Quantitative Applications in the Social Sciences, series no. 07-096). Newbury Park, CA: Sage. Greene, W. H. (2008). Econometric analysis (4th ed.). New York, NY: Macmillan. Harrell, F. E., Jr. (2015). Regression modeling strategies (2nd ed.). New York, NY: Springer. Hilbe, J. M. (2009). Logistic regression models. Boca Raton, FL: Taylor & Francis Group. Hosmer, D . W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (3rd ed.). Hoboken, NJ: John Wiley and Sons. Liddell, T. M., & Kruschke, J. K. (2018). Analyzing ordinal data with metric models: What could possibly go wrong? Journal of Experimental Social Psychology, 79, 328–348. Long, J. S. (1997). Regression models for categorical and limited dependent variables: Analysis and interpretation. Thousands Oaks, CA: Sage. Long, J. S., & Freese, J. (2014). Regression models for categorical dependent variables using stata (3rd ed.). College Station, TX: Stata Press. 125
126 Long, J. S., & Mustillo, S. A. (2018). Using predictions and marginal effects to compare groups in regression models for binary outcomes. Sociological Methods and Research. Advance online publication. doi:10.1177/0049124118799374. Maddala, G. S., & Lahiri, K. (2009). Introduction to econometrics (4th ed.). Chichester, England: Wiley. McFadden, D. (1974). Conditional logit analysis of qualitative choice. In P. Zarembka (Ed.), Frontiers of econometrics (pp. 105–142). New York, NY: Academic Press. Menard, S. (2002). Applied logistic regression analysis (2nd ed.) (Sage University Papers Series on Quantitative Applications in the Social Sciences, Series no. 07-106). Thousands Oaks, CA: Sage. Menard, S. (2010). Logistic regression: From introductory to advanced concepts and applications. Thousands Oaks, CA: Sage. Mood, C. (2010). Logistic regression: Why we cannot do what we think we can do, and what we can do about it. European Sociological Review, 26, 67–82. Muller, C. J., & Maclehose, R. F. (2014). Estimating predicted probabilities from logistic regression: Different methods correspond to different target populations. International Journal of Epidemiology, 43(3), 962–970. Mustillo, S. A., Lizardo, O. A., & Mcveigh, R. M. (2018). Editors comment: A few guidelines for quantitative submissions. American Sociological Review, 83(6), 1281–1283. Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78, 691–692. O’Connell, A. A. (1994). Logistic regression models for ordinal response variables. Thousands Oaks, CA: Sage. Williams, R. (2009). Using heterogeneous choice models to compare logit and probit coefficients across groups. Sociological Methods and Research, 37, 531–559. Williams, R. (2012). Using the margins command to estimate and interpret adjusted predictions and marginal effects. The Stata Journal, 12(2), 308–331.
INDEX Agresti, A., 9, 28, 42, 83, 110 Allison, P. D., 47 alternatives to linearity, 5–7 American Sociological Review, 47 Brant test, 89–91 Breen, R., 29, 34, 47–48 chi-square, 22, 57–64, 67, 79, 85, 91–93, 96, 104–105 comparing models, 47–49, 60–62 Cox, D. R., 63–65, 92, 105 cumulative standard normal distribution: probabilities, 69–74, 77 properties, 69–71 deviance: model or residual, 59–63, 66–67 null, 59–63, 66–67 dummy variable: independent variable, 20, 26–27, 34 regression, 1, 8 scatterplot, 3 values, 1 variance, 42, 44 Eliason, S. R., 51 environmental support: measure, 82–85, 112 R output, 107 SPSS output, 104–107 Stata output, 84–91 equal interval assumption, 82–83 exponentiated logit coefficients, 15, 17, 23–26, 98, 111
Freese, J., 29, 34, 45, 100, 104, 122 functional form, 2–6 gay marriage: measure, 82, 92–94 R output, 94 SPSS output, 92–93 General Social Survey (GSS): data, 4 environmental support, 82–87, 90–91, 104–105, 107 gay marriage, 82, 92–94 marijuana, legalize, 4, 12, 57–58, 61, 63, 65–67 goodness of fit, 62–67, 85 Greene, W. H., 9, 17, 111 group comparisons, logistic regression, 47–49 group membership, goodness of fit, 65–67, 85 Harrell, F. E., 66 Hilbe, J. M., 52, 56, 83, 112 Holm, A., 29 homoscedasticity, 8, 9, 109 Hosmer, D. W., 22, 110 hypothesis testing, 60–62 independence of irrelevant alternatives assumption, 103–104 interaction, 8, 47–50, 57 Karlson, K. B., 29 Kruschke, J. K., 83 Lahiri, K., 17 Lemeshow, S., 22, 110 Liddell, T. M., 83 127
128 likelihood function, 52–53, 55–64 likelihood ratio, 59–61 linearity: alternatives to, 5–7 logged odds, 12–13, 19–21, 83–85 logistic regression, 10–11, 14–17, 49 logit transformation, 10–11, 14–17 ordinal logistic regression, 82–83, 84 partial derivative, 28 probit transformation, 69–73 regression, 40 tangent line, 28 Lizardo, O. A., 47 log likelihood: baseline, 59, 67 comparing models, 60–62 hypothesis testing, 60–62 model, 59, 67 significance tests, 56–60 log likelihood function, 54–56 logarithms: logged odds, 12, 15 logic, 115–117 natural, 120–123 properties, 117–120 logged odds: coefficients, 19–21, 83–85 properties, 12, 13, 15 logistic function, 9 logistic regression: compared to ordinal logistic regression, 81 complete separation, 111 group comparisons, 47–49 latent variable approach, 110 model comparisons, 47–49 perfect prediction, 111
logit (see also logged odds): coefficients, exponentiated, 21 compared to probit, 73–76 transformation, 9, 10, 14 Long, J. S., 17, 21, 29, 34, 44–45, 48, 63, 66, 73, 89, 100, 104, 112 MacLehose, R. F., 33 Maddala, G. S., 17 marginal effects: average, 31–35 categorical independent variables, 34–36 continuous independent variables, 28–34 defined, 27–28 graphs, 36–39 at means, 29–31, 35 multinomial logistic regression, 100–103 ordinal logistic regression, 85–89 probit analysis, 76–78 at representative values, 31, 35 marijuana, legalize: measure, 12 regression model, 4 R output, 63–66 SPSS output, 65–67 Stata output, 57–58, 61 maximum likelihood estimation, 51–56, 77–79 McFadden, D., 63, 65, 92, 103, 105 McVeigh, R. M., 47 Menard, S., 63, 67, 83, 89, 111–112 model comparisons, logistic regression coefficients, 47–49 Mood, C., 47
129 Muller, C. J., 33 multilevel model, 113 multinomial logistic regression: base category, 96 independence of irrelevant alternatives assumption, 103–104 versus logistic regression, 95 marginal effects, 100–103 probability effects, 100–103 R, 107 relative risk ratio, 98–100, 106 SPSS, 104–106 Stata, 97–103 Mustillo, S. A., 47, 48 Nagelkerke, N. J. D., 63–65, 92, 105 National Health Interview Survey: graphs, 37–41 interaction, 48–49 logistic regression, 19–20, 22–24, 27 marginal effects, 30–33 probit analysis, 74–76, 78 regression, 2 standardized coefficients, 44, 46 nominal categorical variables, 81 non-additivity: defined, 7–10 probabilities, 26, 34, 37, 39–40, 49 nonlinear model of probabilities, 7 nonlinearity: linearizing, 14, 16–17, 19, 21 ordinal logistic regression, 83 multinomial logistic regression, 103 probabilities, 26, 28, 34, 36–37, 40, 45
probit transformation, 69, 71–73, 76 relationships, 6–10 normal error violation, 8 O’Connell, A. A., 83, 112 odds, 10–12 odds coefficients: interpretation, 23–26 percentage change, 25 odds ratios, 12, 83–85 ordered logistic regression (see ordinal logistic regression) ordinal categorical variables, 81–82 ordinal logistic regression: Brant test, 89–91 compared to binary logistic regression, 81 complete separation, 112–113 cut points, 85 equal interval assumptions, 82–83 formal model, 112 key assumption, 89–91 logged odds coefficients, 83–86 marginal effects, 85–89 odds ratios, 83–85 parallel regression assumption, 89–91 perfect prediction, 112–113 predicted probabilities, 85–87 probability effects, 85–89 R, 94 SPSS, 92–93 Stata, 84–89 types, 83 partial derivative, 28 probabilities: chi–square, 61, 63 functional form, 6, 8–9 graphing, 39–41
130 group membership, 65 interaction, 48–49 interpretation, 19, 21, 26–29 log likelihood, 57, 59–60 logged odds, 12–13 logit transformation, 9–10, 14–17 marginal effects, 31, 34–37, 47, 76–77, 88–89, 103 maximum likelihood estimation, 51–56 multinomial logistic regression, 96–98 odds, 10–12, 23 ordinal logistic regression, 83, 85, 87, 112 out of range, 4, 109 probit transformation, 69–73 regression, 1–2, 4, 8–9 variance, 42 probit analysis: coefficients, 74–75 compared to logit, 73–76 cumulative standard normal distribution, 69–74, 77 linearize probabilities, 69–72 marginal effects, 76–78 maximum likelihood estimation, 77–79 model, 72 transformation, 72–73 proportional odds assumption, 89–91 pseudo-variance explained: logic, 62 R, 63, 66 SPOST, 63 SPSS, 63–65 R:
logistic regression, 24, multinomial logistic regression, 107
ordinal logistic regression, 94 pseudo-variance explained, 63, 66 regression, linear: equal interval assumption, 82–83 multinomial logistic regression, 110–111 ordinal logistic regression, 82–83, 85, 89, 94 probability model, 1, 8, 27, 40, 41 relative risk ratio, 98–100 scatterplot, binary outcome, 3 scatterplot, jittering, 3 semistandardized coefficients (X standardized), 42–43 significance tests, 21–22, 56–60 smoking: graphs, 37–41, 49 interaction, 48 marginal effects, 30, 32–33 R, 24 regression, linear, 2 SPOST, 76 SPSS, 22 Stata, 20, 30, 32–33, 44, 48, 77–79 Snell, E. J., 63–65 SPOST: Brant test, 89–91 output, 46 program, 45–46 standardized coefficients, 45–46, 76–78 SPSS: logistic regression, 22–24, 63–67 multinomial logistic regression, 104–107
131 ordinal logistic regression, 92–93 PLUM, 92, 113 standard normal curve, 112 s-shaped curve, 6, 71 standard normal curve, 72–73, 112 standardized coefficients: logistic regression, 41–46, 111 probit analysis, 76–78 X-standardizing, 42–43, 47 XY-standardizing, 44–47 Stata: graphs, 37–41, interaction, 48–49 logistic regression, 20–21, 23–24 marginal effects, 30, 32, 33, 86, 102–104 model comparison, 58–61 multinomial logistic regression, 95–103
ordinal logistic regression, 84–91 predicted values, 27 probit analysis, 69–79 standardized coefficients 44, 46 statistical inference, 8 Sturdivant, R. X., 22, 110 tangent line, 28–29 Tarantola, C., 83 truncated probability model, 5 variance, 8–9, 42, 44–45 Wald statistic, 22, 100 Williams, R., 29, 47 world problems, most important, 96–103 World Values Survey, 96–97, 99, 101–102