258 72 2MB
English Pages [177] Year 2019
GENERALIZED LINEAR MODELS Second Edition
Quantitative Applications in the Social Sciences A S A G E P U B L I C AT I O N S S E R I E S 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51.
Analysis of Variance, 2nd Edition Iversen/ Norpoth Operations Research Methods Nagel/Neef Causal Modeling, 2nd Edition Asher Tests of Significance Henkel Cohort Analysis, 2nd Edition Glenn Canonical Analysis and Factor Comparison Levine Analysis of Nominal Data, 2nd Edition Reynolds Analysis of Ordinal Data Hildebrand/Laing/ Rosenthal Time Series Analysis, 2nd Edition Ostrom Ecological Inference Langbein/Lichtman Multidimensional Scaling Kruskal/Wish Analysis of Covariance Wildt/Ahtola Introduction to Factor Analysis Kim/Mueller Factor Analysis Kim/Mueller Multiple Indicators Sullivan/Feldman Exploratory Data Analysis Hartwig/Dearing Reliability and Validity Assessment Carmines/Zeller Analyzing Panel Data Markus Discriminant Analysis Klecka Log-Linear Models Knoke/Burke Interrupted Time Series Analysis McDowall/ McCleary/Meidinger/Hay Applied Regression, 2nd Edition Lewis-Beck/ Lewis-Beck Research Designs Spector Unidimensional Scaling McIver/Carmines Magnitude Scaling Lodge Multiattribute Evaluation Edwards/Newman Dynamic Modeling Huckfeldt/Kohfeld/Likens Network Analysis Knoke/Kuklinski Interpreting and Using Regression Achen Test Item Bias Osterlind Mobility Tables Hout Measures of Association Liebetrau Confirmatory Factor Analysis Long Covariance Structure Models Long Introduction to Survey Sampling Kalton Achievement Testing Bejar Nonrecursive Causal Models Berry Matrix Algebra Namboodiri Introduction to Applied Demography Rives/Serow Microcomputer Methods for Social Scientists, 2nd Edition Schrodt Game Theory Zagare Using Published Data Jacob Bayesian Statistical Inference Iversen Cluster Analysis Aldenderfer/Blashfield Linear Probability, Logit, and Probit Models Aldrich/Nelson Event History and Survival Analysis, 2nd Edition Allison Canonical Correlation Analysis Thompson Models for Innovation Diffusion Mahajan/Peterson Basic Content Analysis, 2nd Edition Weber Multiple Regression in Practice Berry/Feldman Stochastic Parameter Regression Models Newbold/Bos
52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97.
Using Microcomputers in Research Madron/Tate/Brookshire Secondary Analysis of Survey Data Kiecolt/ Nathan Multivariate Analysis of Variance Bray/ Maxwell The Logic of Causal Order Davis Introduction to Linear Goal Programming Ignizio Understanding Regression Analysis, 2nd Edition Schroeder/Sjoquist/Stephan Randomized Response and Related Methods, 2nd Edition Fox/Tracy Meta-Analysis Wolf Linear Programming Feiring Multiple Comparisons Klockars/Sax Information Theory Krippendorff Survey Questions Converse/Presser Latent Class Analysis McCutcheon Three-Way Scaling and Clustering Arabie/ Carroll/DeSarbo Q Methodology, 2nd Edition McKeown/ Thomas Analyzing Decision Making Louviere Rasch Models for Measurement Andrich Principal Components Analysis Dunteman Pooled Time Series Analysis Sayrs Analyzing Complex Survey Data, 2nd Edition Lee/Forthofer Interaction Effects in Multiple Regression, 2nd Edition Jaccard/Turrisi Understanding Significance Testing Mohr Experimental Design and Analysis Brown/Melamed Metric Scaling Weller/Romney Longitudinal Research, 2nd Edition Menard Expert Systems Benfer/Brent/Furbee Data Theory and Dimensional Analysis Jacoby Regression Diagnostics Fox Computer-Assisted Interviewing Saris Contextual Analysis Iversen Summated Rating Scale Construction Spector Central Tendency and Variability Weisberg ANOVA: Repeated Measures Girden Processing Data Bourque/Clark Logit Modeling DeMaris Analytic Mapping and Geographic Databases Garson/Biggs Working With Archival Data Elder/Pavalko/Clipp Multiple Comparison Procedures Toothaker Nonparametric Statistics Gibbons Nonparametric Measures of Association Gibbons Understanding Regression Assumptions Berry Regression With Dummy Variables Hardy Loglinear Models With Latent Variables Hagenaars Bootstrapping Mooney/Duval Maximum Likelihood Estimation Eliason Ordinal Log-Linear Models Ishii-Kuntz
Quantitative Applications in the Social Sciences A S A G E P U B L I C AT I O N S S E R I E S 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127. 128. 129. 130. 131. 132. 133.
134. 135. 136. 137. 138.
Random Factors in ANOVA Jackson/Brashers Univariate Tests for Time Series Models Cromwell/Labys/Terraza Multivariate Tests for Time Series Models Cromwell/Hannan/Labys/Terraza Interpreting Probability Models: Logit, Probit, and Other Generalized Linear Models Liao Typologies and Taxonomies Bailey Data Analysis: An Introduction Lewis-Beck Multiple Attribute Decision Making Yoon/Hwang Causal Analysis With Panel Data Finkel Applied Logistic Regression Analysis, 2nd Edition Menard Chaos and Catastrophe Theories Brown Basic Math for Social Scientists: Concepts Hagle Basic Math for Social Scientists: Problems and Solutions Hagle Calculus Iversen Regression Models: Censored, Sample Selected, or Truncated Data Breen Tree Models of Similarity and Association Corter Computational Modeling Taber/Timpone LISREL Approaches to Interaction Effects in Multiple Regression Jaccard/Wan Analyzing Repeated Surveys Firebaugh Monte Carlo Simulation Mooney Statistical Graphics for Univariate and Bivariate Data Jacoby Interaction Effects in Factorial Analysis of Variance Jaccard Odds Ratios in the Analysis of Contingency Tables Rudas Statistical Graphics for Visualizing Multivariate Data Jacoby Applied Correspondence Analysis Clausen Game Theory Topics Fink/Gates/Humes Social Choice: Theory and Research Johnson Neural Networks Abdi/Valentin/Edelman Relating Statistics and Experimental Design: An Introduction Levin Latent Class Scaling Analysis Dayton Sorting Data: Collection and Analysis Coxon Analyzing Documentary Accounts Hodson Effect Size for ANOVA Designs Cortina/Nouri Nonparametric Simple Regression: Smoothing Scatterplots Fox Multiple and Generalized Nonparametric Regression Fox Logistic Regression: A Primer Pampel Translating Questionnaires and Other Research Instruments: Problems and Solutions Behling/Law Generalized Linear Models: A Unified Approach, 2nd Edition Gill/Torres Interaction Effects in Logistic Regression Jaccard Missing Data Allison Spline Regression Models Marsh/Cormier Logit and Probit: Ordered and Multinomial Models Borooah
139. Correlation: Parametric and Nonparametric Measures Chen/Popovich 140. Confidence Intervals Smithson 141. Internet Data Collection Best/Krueger 142. Probability Theory Rudas 143. Multilevel Modeling Luke 144. Polytomous Item Response Theory Models Ostini/Nering 145. An Introduction to Generalized Linear Models Dunteman/Ho 146. Logistic Regression Models for Ordinal Response Variables O’Connell 147. Fuzzy Set Theory: Applications in the Social Sciences Smithson/Verkuilen 148. Multiple Time Series Models Brandt/Williams 149. Quantile Regression Hao/Naiman 150. Differential Equations: A Modeling Approach Brown 151. Graph Algebra: Mathematical Modeling With a Systems Approach Brown 152. Modern Methods for Robust Regression Andersen 153. Agent-Based Models Gilbert 154. Social Network Analysis, 2nd Edition Knoke/Yang 155. Spatial Regression Models, 2nd Edition Ward/Gleditsch 156. Mediation Analysis Iacobucci 157. Latent Growth Curve Modeling Preacher/Wichman/MacCallum/Briggs 158. Introduction to the Comparative Method With Boolean Algebra Caramani 159. A Mathematical Primer for Social Statistics Fox 160. Fixed Effects Regression Models Allison 161. Differential Item Functioning, 2nd Edition Osterlind/Everson 162. Quantitative Narrative Analysis Franzosi 163. Multiple Correspondence Analysis LeRoux/Rouanet 164. Association Models Wong 165. Fractal Analysis Brown/Liebovitch 166. Assessing Inequality Hao/Naiman 167. Graphical Models and the Multigraph Representation for Categorical Data Khamis 168. Nonrecursive Models Paxton/Hipp/ Marquart-Pyatt 169. Ordinal Item Response Theory Van Schuur 170. Multivariate General Linear Models Haase 171. Methods of Randomization in Experimental Design Alferes 172. Heteroskedasticity in Regression Kaufman 173. An Introduction to Exponential Random Graph Modeling Harris 174. Introduction to Time Series Analysis Pickup 175. Factorial Survey Experiments Auspurg/Hinz 176. Introduction to Power Analysis: Two-Group Studies Hedberg 177. Linear Regression: A Mathematical Introduction Gujarati 178. Propensity Score Methods and Applications Bai/Clark 179. Multilevel Structural Equation Modeling Silva/Bosancianu/Littvay 180. Gathering Social Network Data adams
Sara Miller McCune founded SAGE Publishing in 1965 to support the dissemination of usable knowledge and educate a global community. SAGE publishes more than 1000 journals and over 800 new books each year, spanning a wide range of subject areas. Our growing selection of library products includes archives, data, case studies and video. SAGE remains majority owned by our founder and after her lifetime will become owned by a charitable trust that secures the company’s continued independence. Los Angeles | London | New Delhi | Singapore | Washington DC | Melbourne
GENERALIZED LINEAR MODELS A Unified Approach Second Edition
Jeff Gill American University
Michelle Torres Rice University
FOR INFORMATION:
SAGE Publications, Inc. 2455 Teller Road Thousand Oaks, California 91320
c 2020 by SAGE Publications, Inc. Copyright All rights reserved. Except as permitted by U.S. copyright law, no part of this work may be reproduced or distributed in any form or by any means, or stored in a database or retrieval system, without permission in writing from the publisher.
E-mail: [email protected]
SAGE Publications Ltd. 1 Oliver’s Yard 55 City Road
All third party trademarks referenced or depicted herein are included solely for the purpose of illustration and are the property of their respective owners. Reference to these trademarks in no way indicates any relationship with, or endorsement by, the trademark owner.
London EC1Y 1SP United Kingdom
Printed in the United States of America
SAGE Publications India Pvt. Ltd.
Library of Congress Cataloging-in-Publication Data
B 1/I 1 Mohan Cooperative Industrial Area Mathura Road, New Delhi 110 044
Names: Gill, Jeff, author. | Torres, Michelle (Statistician), author.
India
Title: Generalized linear models : a unified approach / Jeff Gill, Michelle Torres. SAGE Publications Asia-Pacific Pte. Ltd. 18 Cross Street #10-10/11/12 China Square Central
Description: Second edition. | Thousand Oaks,California : SAGE, 2019. | Includes bibliographical references and index.
Singapore 048423
Identifiers: LCCN 2019007059 | ISBN 978-1-5063-8734-5 (pbk. : alk. paper) Subjects: LCSH: Linear models (Statistics)
Acquisitions Editor: Helen Salmon
Classification: LCC QA276 .G455 2019 | DDC 519.5/4–dc23 LC record available at https://lccn.loc.gov/2019007059
Editorial Assistant: Megan O’Heffernan Content Development Editor: Chelsea Neve Production Editor: Rebecca Lee Copy Editor: Gillian Dickens Typesetter: Integra Proofreader: Wendy Jo Dymond Indexer: Laurie Andriot Cover Designer: Candice Harman Marketing Manager: Shari Countryman
This book is printed on acid-free paper. 19 20 21 22 23 10 9 8 7 6 5 4 3 2 1
CONTENTS
Series Editor Introduction
ix
About the Authors
xi
1.
Introduction Model Specification Prerequisites and Preliminaries Looking Forward
1 2 5 9
2.
The Exponential Family Justification Derivation of the Exponential Family Form Canonical Form Multiparameter Models
11 11 13 14 16
3.
Likelihood Theory and the Moments Maximum Likelihood Estimation Calculating the Mean of the Exponential Family Calculating the Variance of the Exponential Family The Variance Function
23 23 25 29 32
4.
Linear Structure and the Link Function The Generalization Distributions
34 34 37
5.
Estimation Procedures Estimation Techniques Profile Likelihood Confidence Intervals Comments on Estimation
49 49 58 66
6.
Residuals and Model Fit Defining Residuals Measuring and Comparing Goodness of Fit Asymptotic Properties
69 69 76 82
7.
Extensions to Generalized Linear Models Introduction to Extensions Quasi-Likelihood Estimation Generalized Linear Mixed-Effects Model Fractional Regression Models The Tobit Model A Type 2 Tobit Model With Stochastic Censoring
105 105 106 112 118 121 123
viii
8.
Zero-Inflated Accommodating Models A Warning About Robust Standard Errors Summary
128 132 134
Conclusion Summary Related Topics Classic Reading Final Motivation
135 135 136 137 138
Endnotes
139
References
141
Index
151
SERIES EDITOR INTRODUCTION
The generalized linear model (GLM) is a single methodology for addressing relationships between an outcome and a set of explanatory variables. It consists of a systematic linear component, a stochastic component, and a link function. The basic linear model is a special case of GLM, as, for example, are logit regression, multinomial regression, and Poisson regression models. There is something very satisfying about the elegance and simplicity of this unified approach. In the second edition of Generalized Linear Models, Jeff Gill and Michelle Torres provide an introduction to and overview of GLMs. Each chapter carefully lays the groundwork for the next. Chapter 1 describes the promise and potential of GLMs and sets the foundation for the chapters to come. Chapter 2 explains the exponential family of distributions, Chapter 3 discusses likelihood theory and moments (especially the mean and variance), and Chapter 4 introduces the reader to the link function. These are the ingredients needed for a GLM. Chapter 5 addresses estimation (focusing on maximum likelihood), and Chapter 6 discusses model fit. Gill and Torres explicitly develop key ideas within and between chapters. Readers with knowledge of basic calculus, matrix algebra notation, probability density functions, and likelihood theory will benefit from the presentation. This is a second edition. Compared to the first, it provides fuller explanations, discusses interpretation and inference in greater depth, and includes many new examples. Chapter 7 is completely new. It introduces extensions of the GLM, including the Tobit model, fractional regression models, quasi-likelihood estimation, generalized linear mixed-effects models (GLMM), robust standard errors, and zero-inflated models. The discussion is not intended to be comprehensive, but it does point to interesting potential applications of GLM tools that readers may wish to pursue. Examples are critical to the pedagogy in this volume. Sometimes, they are abstract (e.g., when different members of the exponential family of distributions are explored in Chapter 2). Common probability functions are rewritten in exponential family form with the intermediate steps shown, providing readers with thorough grounding in how to manipulate the distributions and identify the canonical link. Other times, the examples are applications based on real data. There are several of these,
ix
x developed across multiple chapters: capital punishment (a count of executions in each state), electoral politics in Scotland (percentage of votes in favor of granting parliament taxation powers), and voting intention in the U.S. Republican presidential primaries (candidate selected from a list). Additional examples are used to make particular points. These include educational standardized testing, assignment of bills to congressional committees after an election, campaign donations, and suicide rates. Software, scripts, documentation, and data for the examples are available from the authors websites (links at sagepub.com/gill2e) as well as through the R package GLMpack. Throughout, Gill and Torres address challenges involved in the presentation and interpretation of results from GLMs, giving helpful pointers and useful advice. Readers just encountering GLMs for the first time, as well as more advanced users, will benefit from the applied orientation of this volume. —Barbara Entwisle Series Editor
ABOUT THE AUTHORS
Jeff Gill is Distinguished Professor in the Department of Government, Professor in the Department of Mathematics & Statistics, and Member of the Center for Behavioral Neuroscience at American University. He is also the inaugural director of the Center for Data Science and CoDirector of the graduate program in Data Science there. In additional to theoretical and methodological work in Bayesian statistics and statistical computing, his applied work centers on studying human beings from social, political, and biomedical perspectives. Michelle Torres is Assistant Professor in the Department of Political Science at Rice University. Her core research covers political methodology, specifically survey methodology, computer vision, and causal inference. Substantively, she focuses on public opinion, participation, and psychological traits. She holds a PhD in Political Science and an AM in Statistics from Washington University in St. Louis.
xi
CHAPTER 1. INTRODUCTION
Social scientists have a strong interest in regression-style specifications wherein variation in a collection of explanatory variables on the righthand side of an equality is claimed to model variation in a single outcome variable on the left-hand side. These are “semicausal” statements since the placement of variables implies an order of effects. This is a powerful tool for investigating how interrelated observed variables affect some outcome of interest and the relative strength of each of their contributions, as well as the quality of the overall specification. Most of these individual specifications are imported from the history of statistics, but social science methodologists are adept at creating new models as well. The result of these efforts is a large collection of regression-style tools that are readily available in modern software and relatively easy to use in practice. Unfortunately, in the description and conceptualization of associated regression approaches, specific techniques are often unnecessarily treated as distinct and particular. This is certainly true of a class of models that include logit and probit regression, truncated distribution models, event count models, probability outcome models, and the basic linear model. All of these (and more) are actually special cases of the generalized linear model: a single methodology for producing model-based parameter estimates that provide evidence for the strength of the relationship between a collection of explanatory variables and an outcome variable of interest. After the necessary prerequisites, a typical social science graduate methodological education starts with learning the linear model, followed by an introduction to discrete choice models, survival models, counting models, measurement models, and more. This leads to a very compartmented and necessarily limited view of the world. It also means that a multitude of special procedures, specifications, and diagnostics must be learned separately. Rather than see these tools as completely distinct, the approach taken in this volume is to see these approaches as special cases of one generalized procedure. This means that the material herein can be ingested in isolation or as a complement to traditional texts. This monograph explains and demonstrates a unified approach to applying regression models in the social sciences. Once the general framework is understood, then the appropriate choice of model configuration is determined simply by the structure of the outcome variable and
1
2 the nature of the dispersion of data points around the fitted model. This process not only leads to a better understanding of the theoretical basis of the model but also increases the researcher’s flexibility with regard to new data types. The standard linear model posits the Gauss-Markov assumptions, including that the error component be distributed independent, with mean zero and constant variance, to explain variance in unbounded and continuous outcomes. Although the linear model is robust to minor deviations from the Gauss-Markov assumptions, serious errors of estimation can occur with outcomes that are discrete, bounded, or possessing other special characteristics. The basic principle behind the generalized linear model is that the systematic component of the linear model can be transformed to create an analytical framework that closely resembles the standard linear model but accommodates a much wider variety of outcome variables. To achieve this, generalized linear models employ a “link function,” which defines the relationship between the linear systematic component of the model and outcomes that are not necessarily unbounded and continuous. This provides researchers with a rich class of regression-style models for the complex and nuanced type of data that social scientists routinely encounter. To unify seemingly diverse probabilistic forms, the generalized approach first recasts common probability density functions and probability mass functions into a consolidated exponential family form. This facilitates the development of a more rigorous and thorough theoretical treatment of the principles underlying transformationally developed linear models. The first unifying treatment by Nelder and Wedderburn (1972) demonstrated that an understanding of the results from applied statistical work can be greatly enhanced by this further development of the general theory. The approach taken here emphasizes the theoretical foundations of the generalized linear model rather than a laundry list of applications. As a result, most of the effort is spent on the mathematical statistical theory that supports this construct. Several familiar distributions are developed as examples, but the emphasis on theory means that readers will be able to develop an appropriate generalized linear model specification for their own data-analytic applications.
Model Specification George Box (1979) famously stated, “All models are wrong. Some are useful,” and John von Neumann (1947) asserted that the “truth is
3 much too complicated to allow anything but approximations.”1 These giants of the 20th century were merely noticing that developing statistical models is a simplification of information provided by nature that cannot possibly fully reflect the complexity of the complete datagenerating process. We also do not want it to do so; what we want is a data reduction process that highlights the important underlying processes in a parsimonious manner. Accordingly, model development is a process of determining what features of the data are important and what features need not be reported. This activity focuses on determining which explanatory variables to include and which to ignore, positing a mathematical and probabilistic relationship between the explanatory variables and the outcome variable, and establishing some criteria for success. Model specification and implementation produce summary statistics that are hopefully sufficient in the statistical and colloquial sense for unknown population parameters of interest. Model specification is a combination of art and science in that a huge number of possible specifications can be developed from even a modest set of factors. For example, 30 available explanatory variables produces more than a billion possible right-hand side specifications: 230 = 1, 073, 741, 824, and that is without lags, interactions, levels, or other enhancements. Now consider, for example, that the American National Election Study 2012 Direct Democracy Study has 1,037 variables. By the same calculation, if each is equally useful as an explanatory variable (which is clearly not true), then there are 21,037 ≈ 10312 possible models, compared to “only” 1024 known stars in the entire universe (the latter is an approximation due to time-distance). Generally, the researcher has a theoretical justification for some subset of specifications, and in most fields, there are conventions about variable inclusion. At the foundation of this process is a trade-off between parsimony and fit. Specifying parsimonious models is efficient in that less-important effects are ignored. Such models are often highly generalizable because the conditions of applicability are more easily obtained (wide scope). The more simple the model becomes (the extreme case is describing all outcome variable behavior by the mean μ), the greater the probability that the error term contains important systematic information, holding all other considerations constant. The more complex we make the model, the more we confuse inferences from underlying effects since collections of explanatory variables are never independent in the social sciences and overlapping explanatory power will be unevenly distributed across competing coefficients.
4 We can develop a model that is completely correct although limited in its ability to describe the underlying structure of the data. Such a model, called saturated or full, is essentially a set of parameters equal to the number of data points, each indexed by an indicator function. So every parameter is exactly correct since it perfectly describes the location of an observed data point. However, this model provides no data reduction and has limited inferential value.2 Saturated models are tremendously useful heuristic devices that allow us to benchmark hypothesized model specifications (cf. Lindsey, 1997, pp. 214–215; Neter, Kutner, Nachtsheim, & Wasserman, 1996, pp. 586–587). Later it will be shown that the saturated model is required to create a residual-like deviance for assessing the quality of fit for a tested specification. Typical statistical models differ from saturated models in that they are an attempt to reduce the size and complexity of an observed set of data down to a small number of summary statistics. These models trade certainty for greater simplicity by making inferential claims about underlying population values. The estimated parameter values from this procedure are quite literally wrong, but by providing the associated level of uncertainty, the degree of reliability is assessed. ˆ a set of In general, the purpose of model specification is to develop Y, fitted values from the model that closely resemble the observed outcome variable values, Y. We do this by estimating the relative mathematical importance of this fit for each of the explanatory variables with a set of coefficients, θ. The closer Yˆ is to Y, the more we feel that our model accurately describes reality. However, this goal is not supreme, or we would simply be content with the saturated model. Thus, a good model balances the competing objectives of parsimony and fit. Generalized linear models do not differ in any important way from regular linear models in terms of the process of model specification except that a link function is included to accommodate noncontinuous and possibly bounded outcome variables. Therefore, all of the admonitions about the dangers of data mining, stargazing, inverse probability misinterpretation, and probabilistic theory confirmation apply (Gill, 1999; Greenwald, 1975; Leamer, 1978; Lindsay, 1995; Miller, 1990; Rozeboom, 1960). It is also important to be aware that a single dataset can lead to many perfectly plausible model specifications and subsequent substantive conclusions (Raftery, 1995). Some important restrictions should be noted. Generalized linear models still require uncorrelated cases as does the linear model, although there is still mild robustness to this assumption. Time-series and spatial problems can be accommodated but not without additional and
5 sometimes complicated enhancements. Also, there can be only one error term specified in the model. While the distribution of this error term is no longer required to be asymptotically normal with constant variance (as in the linear model), approaches such as cell means models with “stacked” error terms are excluded in the basic framework. Finally, generalized linear models are inherently parametric in that the form of the likelihood function is completely defined by the researcher. Relaxation of this requirement is done with quasi-likelihood here and can also be done through the use of smoothers, which leads to the more flexible but more complicated form referred to as generalized additive models (Hastie & Tibshirani, 1986, 1990; Wood, 2006).
Prerequisites and Preliminaries This section describes some basic knowledge that is required to proceed in the text.
Probability Distributions Distributions of random variables are described by their probability mass functions (discrete case, PMF) or their probability density functions (continuous case, PDF). Probability mass functions and probability density functions are just probabilistic statements (probability functions specifically) about the distribution of some random variable, Y , over a defined range (support), typically conditional on parameters. In the discrete case, it is denoted p(Y = y) for the probability that the random variable Y takes on some realization y, and in the continuous case, we just use f (y). If the random variable is conditioned on known or unknown terms, then it is common to be explicit about this relationship in the notation. For example, the distribution of a normal random variable is conditioned on the population mean (μ) and variance (σ 2 ) and is thus denoted f (y|μ, σ 2 ). Example 1.1: Uniform Distributions Over the Unit Interval A variable that is uniformly distributed over [0, 1] has the following probability function: k-Category Discrete Case (PMF): 1 p(Y = y) =
k,
for y = 1, 2, . . . , k
0,
otherwise
Continuous Case (PDF): f (y) =
1 b−a , for a = 0 ≤ y ≤ b = 1 0,
otherwise (1.1)
6 There is no ambiguity here; a uniform random variable over [0, 1] can be discrete, or it can be continuous. A discrete example is the coding of the outcome of the toss of a fair coin, and a continuous example is perhaps the unconditional probability of a judicial decision. The essential requirement is that the specified PMF or PDF describe the characteristics of the data generation process: bounds and differences in probabilities. The uniform distribution is useful when describing equal probability events over some known range ([0, 1] in this case). There are some mathematical necessities required for probability functions to be well defined. Probability distributions must be defined with regard to some measure (i.e., specified over some measure space). This means that a probability function has no meaning except with regard to a measure of the space to which it is applied so that there is some structure on the set of outcomes. Define a σ -algebra as a class of outcomes that includes (1) the full sample space (all possible outcomes), (2) the complement of any included outcome, and (3) the property that the union of any countable collection of outcomes is also included. A measure is a function that assigns nonnegative values to outcomes and collections of outcomes in the σ -algebra. The classic example of a measure is the Lebesgue measure: some specified k-dimensional finite Euclidean space in which specific subregions can be uniquely identified. Another germane measure is the counting measure, which is simply the set of integers from zero to infinity or some specified limit. Thus, these measures characterize the way outcomes are treated probabilistically. Greatly simplified, a probability function on a given measure has the requirements that something must happen with probability 1, nothing happens with probability 0, and the sum of the probability of disjoint events is equal to the probability of the union of these events. Furthermore, probabilities are bounded by zero and 1 over this measure, and any event outside of the measure has probability zero of occurrence. This theoretical “tidying up” is necessary to avoid pathologies such as negative probabilities and incomplete sample spaces. It is also required that PMFs sum to 1 and PDFs integrate to 1 (thus termed proper). Violation of this stipulation is equivalent to saying that the probability function uniformly underestimates or overestimates the probability of occurrences. We will often talk of a family of distributions to indicate that parameterizations alter the characteristics of the probability function. For instance, the Gaussian-normal family of distributions is a familiar set of unimodal symmetric distributions that vary by location, determined by μ, and dispersion or scale, determined by σ 2 . The idea of a family is very
7 useful because it reminds us that these are mathematically similar forms that change only by altering specified parameter values. In particular, we focus on the exponential family of distributions in the development of generalized linear models.
The Linear Model Generalized linear models provide a way to analyze the effects of explanatory variables in a way that closely resembles that of analyzing covariates in a standard linear model, except that the assumptions are far less confining. It is assumed that the reader is familiar with the multiple regression linear model in matrix notation summarized by Y = Xβ + ,
(1.2)
where Y is an n × 1 column vector containing the outcome variable, X is an n × k matrix of explanatory variables with rank k and a leading column vector of ones, β is an k × 1 column vector of coefficients to be estimated, and is an n × 1 column vector of disturbances. On the right-hand side, Xβ is called the systematic component, and is called the stochastic component. As the name implies, generalized linear models are built on the framework of the classic linear model, which dates back to the 18th century (Gauss and Legendre). The linear model, as elegant as it is, requires a seemingly strict set of assumptions. The Gauss-Markov theorem states that if 1. Functional form: Y = (n×1)
Xβ (n×k)(k×1)
+
n×1
2. Mean zero errors: E[] = 0 3. Homoscedasticity: Var[] = σ 2 I 4. Noncorrelated errors: Cov[i , j ] = 0,
∀i = j
5. Exogeneity of explanatory variables: Cov[i , X] = 0,
∀i,
then the solution produced by selecting coefficient values that minimize the sum of the squared residuals (or maximize the corresponding likelihood function) is unbiased and has the lowest total variance among unbiased linear alternatives. Note that every one of these assumptions above has in it, meaning that these are assumptions about the underlying population values. The first two Gauss-Markov assumptions are
8 eliminated with the basic generalized linear model approach, and the third can be relaxed with more advanced forms. However, the dependence of the variance on the mean function must be known (except in the extension based on quasi-likelihood functions to be described). In addition, normality of the residuals or the outcome variable is not a Gauss-Markov assumption (even though some texts state that it is), but more accurately, we get normality with increasing sample size for the residuals vector: |X ∼ N(0, σ 2 I). There are three requirements, as opposed to assumptions: 1. Conformability matrix/vector objects: All dimensions match for linear algebra calculations. 2. Explanatory variable matrix: X has full rank k, so X X is invertible (nonzero determinant, nonsingular). 3. Identification condition: Not all points lie on a vertical line, for each X j versus Y scatterplot. These requirements are not theoretically interesting, but they are necessary for the mechanics of linear regression to work. Finally, there are two features of the linear model, which we will collectively call toughness (our term, not a part of the literature): 1. Robustness: Minor violations of the Gauss-Markov assumptions can still produce unbiased and efficient estimates. 2. Resistance: Even substantial outliers (high leverage) can be tolerated unless they are far from the mean of that explanatory variable (high influence). Obviously, there are many more theoretical and practical considerations for the linear model, but these are sufficient to allow us to move on.
Linear Algebra and Calculus As one easily sees from the above discussion of the linear model, specifications and results will be discussed in matrix notation, but no linear algebra beyond Gill (2006, chap. 3) or the first half of an introductory undergraduate linear algebra text is required. Some basic calculus knowledge is quite helpful to understanding the theoretical underpinnings of the generalized linear model. This monograph assumes familiarity with calculus at roughly the level of Kleppner and Ramsey’s (1985) Quick Calculus, Gill (2006, chap. 5), or more generally a
9 first-semester undergraduate course. The discussion remains useful to someone without calculus knowledge with the sole exception that some of the derivations would be difficult to follow.
Software Understanding generalized linear models is not very useful without the means of applying this understanding in practice. Consequently, software, scripts, supporting documentation, data for the examples, and some extended mathematical derivations are available freely at both of the authors’ web pages (links provided at sagepub.com/gill2e) and in the GLMpack package at cran.wustl.edu, as well as at the dedicated page at dataverse.org. While these resources include several popular software packages, our focus is on providing model code and data handling in the R language, although no R code appears in the printed text. Generalized linear models are so essential to standard empirical work in the social sciences that every major statistical software package contains the basic forms. Originally, GLIM (Generalized Linear Interactive Modeling; Baker & Nelder, 1978) was the only package that supported generalized linear models and incorporated the associated numerical technique: iterative weighted least squares. However, virtually every popular package now has appropriate routines. Nonetheless, the effect that GLIM had on the development of generalized linear models was enormous. Since widespread use of the latest version, GLIM 4, has faded considerably, software support is not included in our supplied resources.
Looking Forward The plan of this monograph is as follows: We begin with a detailed discussion of the exponential family of distributions. This is important because the basic setup of generalized linear models applies only to parametric forms that fit this category. Next, the likelihood function for the common exponential family is derived, and from this we produce the mean and variance functions. The unified approach is apparent here because regardless of the original form of the probability function, the moments are derived in exactly the same manner. The next section introduces the linear structure and the link function, which allows the generalization to take place. This idea represents the core of the theory. The computational estimation procedure for generalized linear models is introduced, residuals and model fit are discussed in detail, and finally
10 we review common extensions to the GLM framework. Throughout, examples are provided as practical applications of the generalized linear model to actual data. The examples we provide cover different data settings and structures that are commonly used by social scientists. The examples are developed throughout the different chapters to clarify and illustrate the concepts covered in each of them.
CHAPTER 2. THE EXPONENTIAL FAMILY
The development of the theory of the generalized linear model is based upon the exponential family of distributions. This formalization recharacterizes familiar functions into a formula that is more useful theoretically and demonstrates similarity between seemingly disparate mathematical forms. The name refers to the manner in which all of the terms in the expression for these PDFs and PMFs are moved into the exponent to provide common notation. This does not imply some restrictive relationship with the well-known exponential probability density function. Quasi-likelihood models (McCullagh, 1983; Wedderburn, 1974), which we describe later, replace this process and only require the stipulation of the first two moments. This allows the separation of the mean and variance functions, and the estimation is accomplished by employing a quasi-likelihood function. This approach has the advantage of accommodating situations in which the data are found or assumed not to be independent and identically distributed (henceforth iid).
Justification Fisher (1934) developed the idea that many commonly applied probability mass functions and probability density functions are really just special cases of a more general classification he called the exponential family. The basic idea is to identify a general mathematical structure to the function in which uniformly labeled subfunctions characterize individual differences. The label “exponential family” comes from the convention that subfunctions are contained within the exponent component of the natural exponential function (i.e., the irrational number e = 2.718281 . . . raised to some specified power). This is not a rigid restriction as any subfunction that is not in the exponent can be placed there by substituting its natural logarithm. The primary payoff to reparameterizing a common and familiar function into the exponential form is that the isolated subfunctions quite naturally produce a small number of statistics that compactly summarize even large datasets without any loss of information. Specifically, the exponential family form readily yields sufficient statistics for the unknown parameters. A sufficient statistic for some parameter is one that contains all the information available in a given dataset about that 11
12 parameter. For example, if we are interested in estimating the true range, [a, b], for some uniformly distributed random variable: Xi ∈ [a, b] ∀Xi , then a sufficient statistic is the vector containing the first and last order statistics: [x(1) , x(n) ] from the sample of size n (i.e., the smallest and largest of the sampled values). No other elements of the data and no other statistic that we could construct from the data would provide further information about the limits. Therefore, [x(1) , x(n) ] provides “sufficient” information about the unknown parameters from the given data. It has been shown (Barndorff-Nielsen, 1978, p. 114) that exponential family probability functions have all of their moments. The nth moment of a random variable about an arbitrary point, a, is μn = E[(X − a)n ], and if a is equal to the expected value of X , then this is called the nth central moment. The first moment is the arithmetic mean of the random variable X , and the second moment along with the square of the first can be used to produce the following variance: Var[X ] = E[X 2 ] − E[X ]2 . While we are often interested only in the first two moments, the infinite moment property is very useful in assessing higher-order properties in more complex settings. In general, it is straightforward to calculate the moment generating function and the cumulant generating function for exponential family forms. These are simply functions that provide any desired moment or cumulant (logged moments) with quick calculations. Throughout this monograph, we describe in detail the form, elements, characteristics, and examples of the most common probability density functions: Poisson, binomial, normal, gamma, negative binomial, and multinomial. Extensions and adaptions are more briefly described. Two important classes of probability density functions are not members of the exponential family and therefore are not featured in this volume. The Student’s t and the uniform distribution cannot be put into the form of Equation 2.1. Also, in general, a probability function in which the parameterization is dependent on the bounds, such as the uniform distribution, is not a member of the exponential family. Even if a probability function is not an exponential family member, it can sometimes qualify under particular circumstances. The Weibull probability density function (useful for modeling failure times), f (y|γ , β) = γ γ −1 exp(−yγ /β) for x ≥ 0, γ , β > 0, is not an exponential family form βy since it cannot be rewritten in the required form Equation 2.2. However, if γ is known (or we are willing to assign a fixed value), then the Weibull PDF reduces to an exponential family form. In the final chapter, we provide a brief introduction and description of other less common extensions to exponential forms that are designed to deal with certain data challenges. These include quasi-likelihood forms,
13 zero-inflated models, generalized linear mixed-effects models, fractional regression, and Tobit models. Some widely used members of the exponential family that facilitate generalized linear models but are not discussed here include beta, curved normal, Dirichlet, Pareto, and inverse gamma. The theoretical focus of this monograph is intended to provide readers with an understanding necessary to successfully encounter these and other distributional forms.
Derivation of the Exponential Family Form Suppose we consider a one-parameter conditional probability density function or probability mass function for the random variable Z of the form f (z|ζ ). This is read as “f of z given zeta.” This function or, more specifically, this family of PDFs or PMFs is classified as an exponential family if it can be written in the following form: f (z|ζ ) = exp [t(z)u(ζ )] r(z)s(ζ ),
(2.1)
where r and t are real-valued functions of z that do not depend on ζ , s and u are real-valued functions of ζ that do not depend on z, and r(z) > 0, s(ζ ) > 0 ∀z, ζ . Furthermore, Equation 2.1 can easily be rewritten according to ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ f (z|ζ ) = exp ⎢ t(z)u(ζ ) + log(r(z)) + log(s(ζ )) ⎥ . ⎣ ⎦ interaction component
(2.2)
additive component
The second part of the right-hand side of the equation is labeled the “additive component” because the summed components are distinct and additive with regard to z and ζ . The first part of the right-hand side is labeled the “interaction component” because it is reminiscent of the interaction specification of two parameters in a standard linear model. In other words, it is the component that reflects the productindistinguishable relationship between z and ζ . It should be noted that the interaction component must specify t(z)u(ζ ) in a strictly multiplicative manner. So a term such as − β1 yγ , as seen in the exponent of the Weibull PDF, disqualifies this PDF from the exponential family classification. In addition, the exponential structure of Equation 2.2 is preserved under random sampling such that the joint density function
14 of independent, identically distributed (iid) random variables is given by the following: Z = {Z1 , Z2 , . . . , Zn } is
f (z|ζ ) = exp u(ζ )
n i=1
t(zi ) +
n
log(r(zi )) + n log(s(ζ )) .
(2.3)
i=1
This means that the joint distribution of a systematic random sample of variates with exponential family marginal distributions is also an exponential family form. While the following chapters develop the theory of generalized linear models with Equation 2.2 for simplicity, the joint density function, Equation 2.3, is the more appropriate form since multiple data are used in all practical work. Fortunately, there is no loss of generality since the joint density function is also an exponential family form. If it makes the exposition easier to follow, picture Equation 2.2 with subscript i as an index of the data: f (zi |ζ ) = exp [t(zi )u(ζ ) + log(r(zi )) + log(s(ζ ))].
Canonical Form The canonical form is a handy simplification that greatly facilitates moment calculations as shown in Chapter 3. It is a one-to-one transformation (i.e., the inverse function of this function returns the same unique value) of terms of the probability function that reduces the complexity of the symbolism and reveals structure. It turns out to be much easier to work with an exponential family form when the format of the terms in the function says something directly about the structure of the data. If t(z) = z in Equation 2.2, then we say that this PDF or PMF is in its canonical form for the random variable Z. Otherwise, we can make the simple transformation y = t(z) to force a canonical form. Similarly, if u(ζ ) = ζ in Equation 2.2, then this PDF or PMF is in its canonical form for the parameter ζ . Again, if not, we can force a canonical form by transforming θ = u(ζ ) and call θ the canonical parameter. In many cases, it is not necessary to perform these transformations as the canonical form already exists or the transformed functions are tabulated for various exponential families of distributions. The final form after these transformations is the following general expression: f (y|θ ) = exp [yθ − b(θ ) + c(y)] .
(2.4)
15 Note that the only term with both y and θ is a multiplicative term. McCullagh and Nelder (1989, p. 30) call b(θ ) the “cumulant function,” but b(θ ) is also often called a “normalizing constant” because it is the only nonfunction of the data and can therefore be manipulated to ensure that Equation 2.4 sums or integrates to 1. This is a minor point here as all of the commonly applied forms of Equation 2.4 are well behaved in this respect. More important, b(θ ) will play a key role in calculating the moments of the distribution. In addition, the form of θ , the canonical link between the original form and the θ parameterized form, is also important. The canonical link is used to generalize the linear model by connecting the linear-additive component of the nonnormal outcome variable. The form of Equation 2.4 is not unique in that linear transformations can be applied to exchange values of y and θ between the additive component and the interaction component. In general, however, common families of PDFs and PMFs are typically parameterized in a standard form that minimizes the number of interaction terms. Also, it will sometimes be helpful to use Equation 2.4 expressed as a joint distribution of the data, particularly when working with the likelihood function (Chapter 3). This is just
n n yi θ − nb(θ ) + c(yi ) . (2.5) f (y|θ ) = exp i=1
i=1
The canonical form is used in each of the developed examples in this monograph. There is absolutely no information gained or lost by this treatment; rather, the form of Equation 2.5 is an equivalent form to Equation 2.3 where certain structures such as θ and b(θ ) are isolated for theoretical consideration. As will be shown, these terms are the key to generalizing the linear model. To add more intuition to the exponential family form Equation 2.5, consider how likelihood is constructed and used. A likelihood function is just the joint distribution of an observed set of data under the iid assumption for a given PDF or PMF: f (X|θ ) = f (X 1 |θ )×f (X 2 |θ )×· · ·× f (X n |θ ). Fisher’s notational sleight of hand was to note that once we observe the data, they are known even if θ is unknown. So notationally, use L(θ |X) for this product since we want the unknown value of θ that is mostly likely to have generated X. Returning to Equation 2.5, we see that there are three subcomponents of the function: one that interacts with the data and the parameter, one that is only a function of θ (multiplied by n, however), and finally one that is a function of the
16 data only. If we care about the θ that mostly likely generated the data, then the latter is not going to be consequential. The first subcomponent shows how different values of y weight θ in the calculation, and b(θ ) is a parametric statement about how we should treat θ in the maximum likelihood estimation process. In combination, these two demonstrate how data and distributional assumptions determine the final model that we produce. Obviously, some of these statements are slightly vague because we have yet to derive how the components of the exponential family form produce parameter estimates and predicted values.
Multiparameter Models Up until now, only single-parameter forms have been presented. If generalized linear models were confined to single-parameter density functions, they would be quite restrictive. Suppose now that there are k parameters specified. A k-dimensional parameter vector, rather than just a scalar θ , is now easily incorporated into the exponential family form of Equation 2.4: ⎤ ⎡ k yθj − b(θj ) + c(y)⎦ . (2.6) f (y|θ) = exp ⎣ j=1
Here the dimension of θ can be arbitrarily large but is often as small as two, as in the normal (θ = {μ, σ 2 }) or the gamma (θ = {α, β}). In the following examples, several common probability functions are rewritten in exponential family form with the intermediate steps shown (for the most part). It is actually not strictly necessary to show the process since the number of PDFs and PMFs of interest is relatively small. However, there is great utility in seeing the steps both as an instructional exercise and as a starting point for other distributions of interest not covered herein. Also, in each case, the b(θ ) term is derived. The importance of doing this will be apparent in Chapter 3. Example 2.1: Poisson Distribution Exponential Family Form The Poisson distribution is often used to model counts such as the number of arrivals, deaths, or failures in a given time period. The Poisson distribution assumes that for short time intervals, the probability of an arrival is fixed and proportional to the length of the interval. It is indexed by only one (necessarily positive) parameter, which is both the mean and variance.
17 Given the random variable, Y , distributed Poisson with expected number of occurrences per interval μ, we can rewrite the familiar Poisson PMF in the following manner: e−μ μy = exp y log(μ) − μ − log(y!) . f (y|μ) = y! yθ
b(θ)
c(y)
In this example, the three components from Equation 2.4 are labeled by the underbraces. The interaction component, y log(μ), clearly identifies θ = log(μ) as the canonical link. Also, b(θ ) is simply μ. Therefore, the b(θ ) term parameterized by θ (i.e., the canonical form) is obtained by taking the inverse of the θ = log(μ) to solve for μ. This produces μ = b(θ ) = exp(θ ). Obviously, the Poisson distribution is a simple parametric form in this regard. Example 2.2: Binomial Distribution Exponential Family Form The binomial distribution summarizes the outcome of multiple binary outcome (Bernoulli) trials such as flipping a coin. This distribution is particularly useful for modeling counts of success or failures given a number of independent trials such as votes received given an electorate, international wars given country dyads in a region, or bankruptcies given company starts. Suppose now that Y is distributed binomial (n, p), where Y is the number of “successes” in a known number of n trials given a probability of success p. We can rewrite the binomial PMF in exponential family form as follows:3 n y p (1 − p)n−y y n = exp log + y log(p) + (n − y) log(1 − p) y
n p − (−n log(1 − p)) + log = exp y log . 1−p y b(θ)
f (y|n, p) =
yθ
c(y)
From the first term in the exponent, we can see that the canonical link p , so substituting the inverse for the binomial distribution is θ = log 1−p
18 of the canonical link function into b(θ ) produces (with modest algebra) the following: b(θ ) = [−n log(1 − p)] = n log (1 + exp(θ )) . θ=log
p 1−p
So the expression for the b(θ ) term in terms of the canonical parameter is b(θ ) = n log (1 + exp(θ )). In this example, n was treated as a known quantity or simply ignored as a nuisance parameter. Suppose instead that p was known and we developed the exponential family PMF with n as the parameter of interest: n f (y|n, p) = exp log + y log(p) + (n − y) log(1 − p) (2.7) y = exp [log(n!) − log((n-y)!) − log(y!) + . . .] . However, we cannot separate n and y in log((n − y)!) and they are not in product form, so this is not an exponential family PMF in this context. Example 2.3: Normal Distribution Exponential Family Form The normal distribution is without question the workhorse of social science data analysis. Given its simplicity in practice and well-understood theoretical foundations, this is not surprising. The linear model (typically estimated with ordinary least squares [OLS]) is based on normal distribution theory, and as we shall see in Chapter 4, this comprises a very simple special case of the generalized linear model. Often, we need to explicitly treat nuisance parameters instead of ignoring them or assuming they are known as was done in the binomial example above. The most important case of a two-parameter exponential family is when the second parameter is a scale parameter. Suppose ψ is such a scale parameter, possibly modified by the function a(ψ), and then Equation 2.4 is rewritten: yθ − b(θ ) + c(y, ψ) . (2.8) f (y|θ ) = exp a(ψ) When a given PDF or PMF does not have a scale parameter, then a(ψ) = 1, and Equation 2.8 reduces to Equation 2.4. In addition, Equation 2.8 can be put into the more general form of Equation 2.6 if we define θ = {θ , a(ψ)−1 } and rearrange. However, this form would no longer remind us of the important role the scale parameter plays.
19 The Gaussian normal distribution fits this class of exponential families. The subclass is called a location-scale family and has the attribute that it is fully specified by two parameters: a centering or location parameter and a dispersion parameter. It can be rewritten as follows: 1 1 f (y|μ, σ 2 ) = √ exp − 2 (y − μ)2 ) 2σ 2πσ 2 1 1 2 2 = exp − log(2π σ 2 ) − (y − 2yμ + μ ) 2 2σ 2
μ2 −1 y2 2 2 = exp ( yμ − )/ σ + + log(2π σ ) . 2 2 σ2 a(ψ) yθ b(θ) c(y,ψ)
Note that the μ parameter (the mean) is already in canonical form 2 (θ = μ), so b(θ ) is simply b(θ ) = θ2 . This treatment assumes that μ is the parameter of interest and σ 2 is the nuisance parameter, but we might want to look at the opposite situation. However, in this treatment, μ is not considered a scale parameter. Treating σ 2 as the variable of interest produces 1 1 2 2 2 2 f (y|μ, σ ) = exp − log(2π σ ) − (y − 2yμ − μ ) 2 2σ 2
2 1 1 2 −1 μ yμ − y + = exp . log(2π σ 2 ) − 2 2 2 σ2 σ θ z
b(θ)
Now the canonical link is θ = σ12 . So σ 2 = θ −1 , and we can calculate the new b(θ ): 1 1 μ2 1 2 b(θ ) = − log(2π σ ) − 2 = − log(2π ) + log(θ ) − μ2 θ . 2 2 2 σ Example 2.4: Gamma Distribution Exponential Family Form The gamma distribution is particularly useful for modeling terms that are required to be nonnegative such as variance components. Furthermore, the gamma distribution has two important special cases: the χ 2 distribution is gamma ( ρ2 , 12 ) for ρ degrees of freedom, and the exponential distribution is gamma(1, β), both of which arise quite often in applied settings.
20 Assume Y is now distributed gamma indexed by two parameters: the shape parameter and the inverse-scale (rate) parameter. The gamma distribution is most commonly written as f (y|α, β) = 1 α α−1 e−βy , y, α, β > 0. For our purposes, a more convenient form
(α) β y is produced by transforming α = δ, β = δ/μ. The exponential family form of the gamma is produced by δ δ −δy 1 δ−1 y f (y|μ, δ) = exp μ (δ) μ
δy = exp δ log(δ) − δ log(μ) − log( (δ)) + (δ − 1) log(y) − μ
1 1 = exp (− y − log(μ))/ μ δ b(θ) a(ψ) θy + δ log(δ) + (δ − 1) log(y) − log( (δ)) . c(y,ψ)
From the first term in the last equation above, the canonical link for 1 the gamma family variable μ is θ = − μ . So b(θ ) = log(μ) = log − θ1 with the restriction θ < 0. Therefore, b(θ ) = − log(−θ ). Example 2.5: Negative Binomial Distribution Exponential Family Form The binomial distribution measures the number of successes in a given number of fixed trials, whereas the negative binomial distribution measures the number of failures before the rth success.4 An important application of the negative binomial distribution is in survey research design. If the researcher knows the value of p from previous surveys, then the negative binomial can provide the number of subjects to contact to get the desired number of responses for analysis. If Y is distributed negative binomial with success probability p and a goal of r successes, then the PMF in exponential family form is produced by r+y−1 r f (y|r, p) = p (1 − p)y y
r+y−1 = exp y log(1 − p) + r log(p) + log . y yθ b(θ) c(y)
21 The canonical link is easily identified as θ = log(1 − p). Substituting this into b(θ ) and applying some algebra gives b(θ ) = r log(1 − exp(θ )). Example 2.6: Multinomial Distribution Exponential Family Form The multinomial distribution generalizes the binomial by allowing more than k = 2 nominal choices or events to occur. The set of possible outcomes for an individual i is a k−1 length vector of all zeros except for a single 1 identifying the chosen response: Yi = [Yi2 , Yi3 , . . . , Yi(k) ]. The first category k = 1 is left out of this vector and is called the reference (or baseline) category, and all model inferences are comparative to this category. Therefore, an individual picking the reference category will have all zeros in the Yi vector. We want to estimate the k − 1 length of categorical probabilities (π1 , . . . , πk ) for a sample size of n, g−1 (θ) = μ = [π1 , π2 , . . . , πk−1 ], from the dataset consisting of the n × (k − 1) outcome matrix Y and the n × p matrix of p covariates X including a leading column of 1s. The estimates are provided with a logit (or a probit) link function, giving for each of the k − 1 categories the probability that the ith individual picks category r: exp(X i β r ) p(Yi = r|X) = k−1 1 + s=1 exp(Xβ s ) where β r is the coefficient vector for the rth category (logit version). For simplicity of notation, consider k = 3 possible outcomes, without loss of generality, and drop the indexing by individuals. If there are n individuals in the data picking from these three categories, then the intuitive PMF that shows similarity to the binomial case is given by f (Y|n, μ) =
n Y Y π 1 π 2 (1 − π1 − π2 )n−Y1 −Y2 Y1 !Y2 !(n − Y1 − Y2 )! 1 2 π1 π2 Y1 Y2 , log , log n n 1 − π1 − π2 1 − π1 − π2
= exp
Y
θ
n . − (− log(1 − π1 − π2 )) n + log Y1 !Y2 !(n − Y1 − Y2 )! b(θ) c(y)
Note that this exponential form has two-dimensional structure for Y and θ , which is an important departure from the previous examples. The
22 two-dimensional link function that results from this form is π2 π1 , log . θ = (θ1 , θ2 ) = g(π1 , π2 ) = log 1 − π1 − π2 1 − π1 − π2 We can therefore interpret the results in the following way for a single respondent: π1 p(choice 1) θ1 = log = Xiβ1 = log p(reference category) 1 − π1 − π2 π2 p(choice 2) = X iβ 2. = log θ2 = log p(reference category) 1 − π1 − π2 With minor algebra, we can solve for the inverse of the canonical link function: exp(θ1 ) π1 = (1 − π2 ) 1 + exp(θ1 ) π2 = (1 − π1 )
exp(θ2 ) . 1 + exp(θ2 )
This allows us to rewrite b(θ )in terms of the two-dimensional canonical exp(θ1 ) exp(θ2 ) link function θ: b(θ ) = − log 1 − (1 − π2 ) 1+exp(θ ) − (1 − π1 ) 1+exp(θ ) 1
which reveals multinomial structure in this simplified case.
2
We have now shown that some of the most useful and popular PMFs and PDFs can easily be represented in the exponential family form. The payoff for this effort is yet to come, but it can readily be seen that if b(θ ) has particular theoretical significance, then isolating it as we have in the θ parameterization is helpful. This is exactly the case as b(θ ) is the engine for producing moments from the exponential family form through some basic likelihood theory. The reparameterization of commonly used PDFs and PMFs into the exponential family form highlights some well-known but not necessarily intuitive relationships between parametric forms. For instance, virtually all introductory statistics texts explain that the normal distribution is the limiting form for the binomial distribution. Setting the first and second derivatives of the b(θ ) function in these forms equal to each other gives the appropriate asymptotic reparameterization: μ = np, σ 2 = np(1 − p).
CHAPTER 3. LIKELIHOOD THEORY AND THE MOMENTS
Maximum Likelihood Estimation To make inferences about the unknown parameters, we would like to develop the likelihood and score functions for Equation 2.4. Maximizing the likelihood function with regard to coefficient values is without question the most frequently used estimation technique in applied statistics. Since asymptotic theory assures us that for sufficiently large samples, the likelihood surface is unimodal in k dimensions for exponential family forms (Fahrmeir & Kaufman, 1985; Jørgensen, 1983; Wedderburn, 1976), then this process is equivalent to finding the k-dimensional mode. Our real interest lies in obtaining the posterior distribution of the unknown k-dimensional θ coefficient vector, given an observed matrix of data values: f (θ|X). This allows us to determine the “most likely” values of the θ vector using the k-dimensional mode (maximum likelihood inference; Fisher, 1925) or simply to probabilistically describe this distribution (as in Bayesian inference). This posterior is produced by the application of Bayes law: f (θ|X) = f (X|θ )
p(θ ) , p(X)
(3.1)
where f (X|θ) is the n-dimensional joint PDF or PMF of the data (the probability of the sample for a fixed θ ) under the assumption that the data are independent and identically distributed according to f (Xi |θ) ∀ i = 1, . . . , n, and p(θ ), p(X) are the corresponding unconditional probabilities. The Bayesian approach integrates out p(X) (or ignores it using proportionality) and stipulates an assumed (prior) distribution on θ , thus allowing fairly direct computation of f (θ|X from Equation 3.1 (see Gill, 2014). If we regard f (X|θ) as a function of θ for given observed data X (we can consider the observed data as fixed, p(X) = 1, since it has occurred), then L(θ|X) = f (X|θ) is called a likelihood function (DeGroot, 1986, p. 339). The maximum likelihood principle states that an admissible θ that maximizes likelihood function probability (discrete case) or density (continuous case), relative to alternative values of θ , provides the θ that is most “likely” to have generated the observed data X, given the assumed parametric form. Restated, if θˆ is 23
24 the maximum likelihood estimator for the unknown parameter vector, then it is necessarily true that L(θˆ |X) ≥ L(θ |X) ∀ θ ∈ , where is the admissible range of θ . The likelihood function differs from the inverse probability, f (θ|X), in that it is necessarily a relative function since probabilistic uncertainty is a characteristic of the random variable X, not the unknown but fixed θ. Barnett (1973) clarifies this distinction: “Probability remains attached to X , not θ ; it simply reflects inferentially on θ ” (p. 131). Thus, maximum likelihood estimation substitutes the unbounded notion of likelihood for the bounded definition of probability (Barnett, 1973, p. 131; Casella & Berger, 1990, p. 266; Fisher, 1922, p. 327; King, 1989, p. 23). This is an important theoretical distinction but of little significance in applied practice. Typically, it is mathematically more convenient to work with the natural log of the likelihood function. This does not change any of the resulting parameter estimates because the likelihood function and the log-likelihood function have identical modal points. Using Equation 2.4, we add a scale parameter (as in the normal example); return to a single parameter of interest, θ ; and accompany the case by a function of a scale parameter, a(ψ). This leads to a very simple basic likelihood function: yθ − b(θ ) + c(y, ψ) (θ , ψ|y) = log(f (y|θ , ψ)) = log exp a(ψ) (3.2) yθ − b(θ ) + c(y, ψ). = a(ψ) It is certainly not a coincidence that using the natural log of the exponential family form simplifies our calculations. One of the reasons for casting all of the terms into the exponent is that at this stage, the exponent becomes expendable and the terms are easy to work with. The score function is the first derivative of the log-likelihood function with respect to the parameters of interest. For the time being, the scale parameter, ψ, is treated as a nuisance parameter. The resulting score ˙ |ψ, y), is produced by function, denoted as (θ ∂ y − ∂θ b(θ ) ∂ yθ − b(θ ) ∂ ˙ (θ |ψ, y) = + c(y, ψ) = . (θ |ψ, y) = ∂θ ∂θ a(ψ) a(ψ) (3.3) ˙ |ψ, y) equal to zero and solving for the parameter of interSetting (θ est gives the maximum likelihood estimate, θˆ. This is now the “most likely” value of θ from the parameter space treating the observed data
25 as given: θˆ maximizes the likelihood function at the observed values. The likelihood principle (Birnbaum, 1962) states that once the data are observed and therefore treated as given, all of the available evidence for estimating θˆ is contained in the likelihood function, (θ , ψ|y). This is a very handy data reduction tool because it tells us exactly what treatment of the data is important to us and allows us to ignore an infinite number of alternates. Suppose we use the notation for the exponential family form expressed as the joint probability function of observed iid data Equa n the n y θ − nb(θ ) + c(y ) . Setting the score tion 2.5: f (y|θ ) = exp i i i=1 i=1 function from this joint PDF or PMF equal to zero and rearranging gives the likelihood equation:
t(yi ) = n
∂ log(b(θ )), ∂θ
(3.4)
where t(yi ) is the remaining function of the data, depending on the form of the PDF or PMF. The underlying theory is remarkably strong. Solving Equation 3.4 for the unknown coefficient produces an estimator that is unique (a unimodal posterior distribution), consistent (converges in probability), and asymptotically efficient (the variance of the estimator achieves the lowest possible value as the sample size becomes adequately large: the Cramér-Rao lower bound). This combined with the central limit theorem gives the asymptotic normal form for the esti √ P t(yi ) is a sufficient mator: n(θˆ − θ ) → N(0, Σθ ). Furthermore, statistic for θ , meaning that all of the relevant information about θ in the data is contained in t(yi ). For example, the normal log- likelihood expressed as a joint exponential family form as in Equation 2.5 is 2 n 2 (θ , ψ|y) = (μ yi − nμ2 )/σ 2 − 2σ1 2 yi − 2 log(2π σ 2 ). So t(y) = yi , ∂ nμ2 ∂θ 2
= nμ, and equating gives the maximum likelihood estimate of μ to be the sample average, which we know from basic texts: 1n yi .
Calculating the Mean of the Exponential Family An important quantity to calculate is the mean of the PDF or PMF in the context of Equation 2.4. The generalization of the linear model is done by connecting the linear predictor, θ = Xβ, from a standard linear models analysis of the explanatory variables to the nonnormal outcome variable through its mean function. Therefore, the expected value (first moment) plays a key theoretical role in the development of generalized
26 linear models. The expected value calculation of Equation 2.4 with respect to the data (Y ) is
∂ b(θ ) y − ∂θ EY =0 a(ψ) ∂ y − ∂θ b(θ ) f (y)d(y) = 0 a(ψ) Y ∂b(θ ) f (y)dy = 0 yf (y)dy − ∂θ Y Y ∂b(θ ) yf (y)dy − f (y)dy = 0. (3.5) ∂θ Y Y E[Y ]
1
The last step requires general regularity conditions with regard to the bounds of integration and all exponential family distributions meet this requirement (Casella & Berger, 1990). Specifically, this means that we need to apply Leibnitz’s rule for constant bounds (a, b): b b ∂ d dψ a f (y, ψ)dy = a ∂ψ f (y, ψ)dy, or Lebesgue’s dominated conver∞ ∞ ∂ d gence theorem for infinite bounds: dψ −∞ f (y, ψ)dy = ∞ ∂ψ f (y, ψ)dy where there exists some function g(y) ≥ |(f (y, ψ)| such that ∞ g(y) < ∞. For discrete random variables, replace the integration −∞ with summation in Equation 3.5. From all this effort, we get the wonderfully useful result that ∂ b(θ ). So all that is required from Equation 2.4 to get the E[Y ] = ∂θ mean of a particular exponential family of distributions, a quantity we will call μ for uniformity across examples, is b(θ ). This is an illustration of the value of expressing exponential family distributions in canonical form, since the first derivative of b(θ ) immediately produces the first moment. Example 3.1: Mean for the Poisson PMF The procedure for obtaining the expected value (mean) is just to perform the differentiation of b(θ ) with regard to θ and then substitute in the canonical link and solve. Generally, this is a very simple process. Recall that for the Poisson distribution, the normalizing constant is b(θ ) = exp(θ ), and the canonical link function is θ = log(μ). So, ∂ ∂ b(θ ) = exp(θ ) = exp(θ ) = μ. ∂θ ∂θ θ=log(μ)
27 Of course, the result that E[Y ] = μ for a Poisson distributed random variable is exactly what we would expect. Example 3.2: Mean for the Binomial PMF Forthe binomial distribution, b(θ ) = n log (1 + exp(θ )), and θ = p log 1−p . Therefore, from the following, we get the mean function: ∂ ∂ b(θ ) = (n log (1 + exp(θ ))) ∂θ ∂θ −1 = n (1 + exp(θ )) exp(θ )
p = n 1 + exp log 1−p p , = n(1 − p) 1−p
θ=log
p 1−p
−1
p exp log 1−p
where some algebra is required in addition to taking the derivative. Once again, E[Y ] = np is the expected result from standard moments analysis.
Example 3.3: Mean for the Normal PDF The normal form of the exponential family has b(θ ) = θ = μ. Therefore, ∂ θ2 ∂ b(θ ) = . = θ ∂θ ∂θ 2 θ=μ
θ2 2 ,
and simply
This is the most straightforward and important case: E[Y ] = μ.
(3.6)
Example 3.4: Mean for the Gamma PDF Recall that for the gamma exponential family form, θ = − μ1 and b(θ ) = − log(−θ ). This produces ∂ ∂ 1 b(θ ) = = μ. (− log(−θ )) = − ∂θ ∂θ θ θ=− 1 μ
For the gamma distribution, we found that E[Y ] = μ. This is equivalent to E[Y ] = α/β when the gamma PDF is expressed in the familiar rate 1 β α yα−1 e−βy , (μ = αβ, δ = α). form: f (y|α, β) = (α)
28 Example 3.5: Mean for the Negative Binomial PMF For the negative binomial distribution, b(θ ) = r log(1 − exp(θ )), and θ = log(1 − p). The mean is obtained by ∂ ∂ b(θ ) = r log (1 − exp(θ )) . ∂θ ∂θ = r (1 − exp(θ ))−1 exp(θ )
θ=log(1−p)
1−p =r 1 − (1 − p) So for the negative binomial, we get the mean function E[Y ] = r 1−p p . Example 3.6: Mean for the Multinomial PMF exp(θ1 ) −(1− For the multinomial distribution, b(θ ) = − log 1−(1−π2 ) 1+exp(θ 1) exp(θ2 ) π1 ) 1+exp(θ ) . The calculation for the mean requires more effort in this 2
case and must be done for each parameter in θ. To make the calculations more straightforward, first define h(θ1 ) =
exp(θ1 ) 1+exp(θ1 )
and h(θ2 ) =
exp(θ2 ) 1+exp(θ2 ) .
The cumulant function can then be reexpressed as b(θ ) = − log[1 − (1 − π2 )h(θ1 ) − (1 − π2 )h(θ2 )]. The derivative of the h(θ1 ) function for the first dimension is exp(θ1 ) ∂ exp(θ1 ) 1− = h(θ1 )(1 + h(θ1 )). h(θ1 ) = ∂θ1 1 + exp(θ1 ) 1 + exp(θ1 ) Then the first derivative of the cumulant function with respect to θ1 is ∂ (1 − π2 )h(θ1 )(1 + h(θ1 )) . b(θ ) = ∂θ1 1 − (1 − π2 )h(θ1 ) − (1 − π2 )h(θ2 )
(3.7)
The probability substitution into the h(θ1 ) function is given by h(θ1 )
θ1 =log
π1 1−π1 −π2
=
π1 1 − π1 − π2
1 − π2 1 − π1 − π2
−1
=
π1 . 1 − π2 (3.8)
29 Substituting Equation 3.8 and h(θ2 ) = π2 /(1 − π1 ) into Equation 3.7 gives E[Y1 ] = π1 and by symmetry E[Y2 ] = π2 . While this may seem like an inordinate amount of effort to specify mean functions for commonly used distributions, the value lies in further understanding the unified approach that results from expressing probability functions in exponential family form. The mean function is pivotal to the working of generalized linear models because, as we shall see in Chapter 4, the link function connects linear predictor to the exponential family form.
Calculating the Variance of the Exponential Family Just as we have derived the first moment earlier, we can obtain ∂ b(θ ) and the variance from the second moment. Since E[Y ] = ∂θ ˙ θ, ˆ ψ|y) = 0, then the variance calculations for the exponential family ( form are greatly simplified. First, we obtain the variance and derivative of the score function and then apply a well-known mathematical statistics relationship. The variance of the score function is ! ! ˙ , ψ|y) − 0 2 ˙ , ψ|y)] = E (θ ˙ , ψ|y) − E[(θ ˙ , ψ|y)] 2 = E (θ Var[(θ ⎡ ⎡ 2 ⎤ 2 ⎤ ∂ ∂ y − ∂θ b(θ ) b(θ ) y − ∂θ ⎦ = E⎣ ⎦ = E⎣ a(ψ) a2 (ψ)
(y − E[Y ])2 =E a2 (ψ) =
1 a2 (ψ)
Var[Y ].
(3.9)
The derivative of the score function with respect to θ is ∂ ˙ ∂ (θ , ψ|y) = ∂θ ∂θ
∂ b(θ ) y − ∂θ a(ψ)
=−
1 ∂2 b(θ ). a(ψ) ∂θ 2
(3.10)
The utility of deriving Equation 3.9 andEquation 3.10 comes from the ˙ , ψ|y) for exponential ˙ , ψ|y))2 ] = E − ∂ (θ following relation: E[((θ ∂θ
30 families (Casella & Berger, 1990, p. 312). This means that we can equate Equation 3.9 and Equation 3.10 to solve for Var[Y ]: 1 a2 (ψ)
Var[Y ] =
1 ∂2 b(θ ) a(ψ) ∂θ 2
Var[Y ] = a(ψ)
∂2 b(θ ). ∂θ 2
(3.11)
We now have expressions for the mean and variance of Y expressed in the terms of the exponential family format Equation 2.4 with the a(ψ) term included. Example 3.7: Variance for the Poisson PMF ∂2 ∂2 Var[Y ] = a(ψ) 2 b(θ ) = 1 2 exp(θ ) = exp (log(μ)) = μ. ∂θ ∂θ θ=log(μ) Once again, Var[Y ] = μ is the expected result.
Example 3.8: Variance for the Binomial PMF Var[Y ] = a(ψ)
∂2 b(θ ) ∂θ 2
∂2 (n log (1 + exp(θ ))) ∂θ 2 ∂ n(1 + exp(θ ))−1 exp(θ ) = ∂θ
=1
! = n exp(θ ) (1 + exp(θ ))−1 − (1 + exp(θ ))−2 exp(θ )
p =n 1−p
p 1+ 1−p
−1
p − 1+ 1−p
−2
θ=log
p 1−p
p 1−p
= np(1 − p). Var[Y ] = np(1 − p) is the familiar form for the variance of the binomial distribution.
31 Example 3.9: Variance for the Normal PDF ∂2 ∂2 Var[Y ] = a(ψ) 2 b(θ ) = σ 2 2 ∂θ ∂θ
θ2 2
= σ2
∂ θ. ∂θ
(3.12)
Var[Y ] = σ 2 is the obvious result.
Example 3.10: Variance for the Gamma PDF
Var[Y ] = a(ψ)
=−
∂2 1 ∂2 1 ∂ b(θ ) = (− log(−θ )) = δ ∂θ 2 δ ∂θ ∂θ 2
1 − (−1) −θ
1 1 (−1)θ −2 = μ2 . 1 δ δ θ=−
(3.13)
μ
This result, Var[Y ] = μ2 /δ, is equivalent to α/β 2 in the other familiar rate notation for the gamma PDF.
Example 3.11: Variance for the Negative Binomial PMF
Var[Y ] = a(ψ) =1
∂2 b(θ ) ∂θ 2
∂ r (1 − exp(θ ))−1 exp(θ ) ∂θ −2
= r exp(θ ) (1 − exp(θ ))
−1
exp(θ ) + (1 − exp(θ ))
! θ=log(1−p) !
= r(1 − p) (1 − (1 − p))−2 (1 − p) + (1 − (1 − p))−1 =
r(1 − p) p2
Also, Var[Y ] = r(1 − p)/p2 is exactly what we expected.
32 Example 3.12: Variance for the Multinomial PMF exp(θ1 ) Using the same shorthand notation h(θ1 ) = 1+exp(θ ) and h(θ2 ) = exp(θ2 ) 1+exp(θ2 )
1
from the mean calculation,
∂ −(1 − π2 )h(θ1 )(1 + h(θ1 )) = ∂θ1 ∂θ12 ∂2
[1 − (1 − π2 )h(θ1 ) − (1 − π1 )h(θ2 )] −1
g(θ)
= (1 − π2 )h(θ1 )(1 + h(θ1 ))g(θ )−1
! × (2h(θ1 ) + 1) + g(θ )−1 (1 − π2 )h(θ1 )(1 + h(θ1 ) .
Performing the substitutions: h(θ1 ) π1 g(θ )
θ1 =log
θ1 =log
=
π1 +π2
π1 π1 +π2
π1 1 − π2
,θ2 =log
π2 π1 +π2
and =
1 − π1 − π2
and some algebra gives Var[Y1 ] = (π1 )(1 − π1 ), as well as Var[Y2 ] = (π2 )(1 − π2 ) from symmetry.
The Variance Function It is common to define a variance function for a given exponential family expression in which the θ notation is preserved for compatibility with the b(θ ) form. The variance function is used in generalized linear models to indicate the dependence of the variance of Y on location and scale parameters. It is also important in developing useful residuals analysis as will be discussed in Chapter 6. The variance function is simply defined ∂2 2 indexed by θ . Note as τ 2 = ∂θ 2 b(θ ), meaning that Var[Y ] = a(ψ)τ that the dependence on b(θ ) explicitly states that the variance function is conditional on the mean function, whereas there was no such stipulation with the a(ψ) form. The variance of Y can also be expressed with prior weighting, usually coming from point estimation theory: Var[Y ] = ψw τ 2 , where ψ is a dispersion parameter and w is a prior weight. For example, a mean from
33 a sample of size n, as well as a population with known variance σ 2 , is 2 2 Var[X¯ ] = ψτw = σn . It is convention to leave the variance function in terms of the canonical parameter, θ , rather than return it to the parameterization in the original probability function as was done for the variance of Y . Table 3.1 summarizes the variance functions for the distributions studied. Table 3.1
Normalizing Constants and Variance Functions
Distribution
b(θ)
Poisson
τ2 =
∂2 b(θ) ∂θ 2
exp(θ)
exp(θ)
n log (1 + exp(θ))
n exp(θ)(1 + exp(θ))−2
Normal
θ2 2
1
Gamma
− log(−θ)
1 θ2
r log(1 − exp(θ))
r exp(θ)(1 − exp(θ))−2
− log[1 − (1 − π2 )h(θ1 )
−(1 − π2 )h(θ1 )(1 + h(θ1 ))g(θ)−1 × (2h(θ1 ) + 1) + g(θ)−1
Binomial
Negative binomial Multinomial with h(θ1 ) =
exp(θ1 ) 1+exp(θ1 )
−(1 − π2 )h(θ2 )]
×(1 − π2 )h(θ1 )(1 + h(θ1 )]
CHAPTER 4. LINEAR STRUCTURE AND THE LINK FUNCTION
This is a critical chapter of this monograph because it describes the theory by which the standard linear model is generalized to accommodate nonnormal outcome variables such as discrete choices, counts, survival periods, truncated varieties, and more. The basic philosophy is to employ a function of the mean vector of the outcome to link the normal theory environment with Gauss-Markov assumptions to another environment that encompasses a wide class of outcome variables. The first part of this monograph explored the exponential family and showed how seemingly distinct probability functions had an underlying theoretical similarity. That similarity is exploited in this chapter by showing how the θ specification and the b(θ ) function lead to logical link functions under general conditions. In this regard, the utility of reexpressing distributions in exponential family form is that components of this expression immediately give the means of specifying nonlinear regression models. The payoff to using distributions in exponential family form is that correct assignment of generalized linear model (GLM) forms and the resulting properties are easily understood in the general regression modeling sense.
The Generalization Consider the standard linear model meeting the Gauss-Markov conditions. This can be expressed as follows: V =
(n×1)
Xβ
E[V] = θ (n×1)
+
(n×k)(k×1) (n×1)
=
(n×1)
Xβ
.
(4.1) (4.2)
(n×k)(k×1)
The right-hand sides of the two equations are very familiar: X is the design or model matrix of observed data values, β is the vector of unknown coefficients to be estimated, Xβ is called the “linear structure vector,” and are the independent (often normally distributed) error terms with constant variance: the random component. On the left-hand side of Equation 4.2, E[V] = θ is the vector of means: the systematic component. The variable, V, is distributed asymptotically iid normal 34
35 with mean θ and constant variance σ 2 . So far, this is exactly the linear model described in basic statistics texts, and we use the notation V for the linear additive expectation to differentiate it from the generalization that follows. Now suppose we generalize slightly this well-known form with a new “linear predictor” based on the mean of the outcome variable: g(μ) = θ (n×1)
(n×1)
=
Xβ
,
(n×k)(k×1)
where g() is an invertible, smooth function (i.e., no discontinuities) of the mean vector μ of the outcomes, which is no longer equivalent to V. At this point, we drop the V vector of normal variates completely since it is an artificial construct for our purposes; these realizations never actually existed. The V vector is only useful in setting up the right-hand side of Equation 4.1 and Equation 4.2 before we generalized the left-hand side of the model to g(μ). Information from the explanatory variables is now expressed only through the link from the linear structure, Xβ, to the linear predictor, θ = g(μ), controlled by the form of the link function, g(). This is actually more commonly expressed as the inverse link function: g−1 (Xβ). This inverse link function connects the linear predictor to the mean of the outcome variable not directly to the expression of the outcome variable itself (as in the linear model), so the outcome variable can now take on a variety of nonnormal forms. By this manner, the generalized linear model extends the standard linear model to accommodate nonnormal response functions with transformations to linearity. The generalization of the linear model now has three components derived from the earlier expressions. 1. Stochastic component: Y is the random or stochastic component that remains distributed iid according to a specific exponential family distribution such as those in Chapter 2, with mean μ. This component is sometimes also called the “error structure” or “response distribution.” 2. Systematic component: θ = Xβ is the systematic component producing the linear predictor. So the explanatory variables, X, affect the observed outcome variable, Y, only through the functional form of the g() function. 3. Link function: the stochastic component and the systematic component are linked by a function of θ , which is exactly the canonical link function developed in Chapter 2 and summarized in
36 Table 4.1. The link function connects the stochastic component, which describes some response variable from a wide variety of forms to all the standard normal theory supporting the systematic component through the mean function: g(μ) = θ = Xβ g−1 (g(μ)) = g−1 (θ ) = g−1 (Xβ) = μ = E[Y]. ˆ where we So the inverse of the link function ensures that X β, ˆ insert β the estimated coefficient vector, maintains the GaussMarkov assumptions for linear models and all the standard theory applies even though the outcome variable takes on a variety of nonnormal forms. We can think of g(μ) as “tricking” the linear model into thinking that it is still acting upon normally distributed outcome variables. The inverse link function connects the linear predictor, the systematic component (θ), to the expected value of the specified exponential family form (μ), and its form is determined by the noninverse version supplied by the exponential family form, which we call the canonical link function as described. This setup is much more powerful than it initially appears. The outcome variable described by the exponential family form is affected by the explanatory variables strictly through the link function applied to systematic component, g−1 (Xβ), and nothing else. This data reduction is accomplished because g−1 (Xβ) is a sufficient statistic for μ, given the assumed parametric form (PMF or PDF) and a correctly specified link function. Actually, although it is traditional to describe the generalized linear model by these three components, there are really four. The residuals comprise the fourth component and are critical determinants of model quality, as will be shown in Chapter 6. The payoff to notating and understanding distributions in exponential family form is that the canonical link function is simply the θ = u(ζ ) component from the interaction component in Equation 2.4 expressed in canonical form. In other words, once the exponential family form is expressed, the link function is immediately identified. For example, since the exponential family form for the negative binomial PMF is r+y−1 f (y|r, p) = exp ylog(1 − p) + log(p) + log , y the canonical link function is θ = log(1 − p). Even more simply, in standard linear models, the link function is the identity function: θ = μ. This
37 states that the canonical parameter equals the systematic component, so the linear predictor is just the expected value.
Distributions Table 4.1 summarizes the link functions for the distributions included as running examples. Note that g() and g−1 () are both included. In Table 4.1, there are three expressions for the canonical link for the binomial PMF. The first link function, logit, is the one that naturally occurs from the exponential family form expression for the canonical term (Example 2.2). The probit link function (based on the cumulative standard normal distribution, denoted ) and the cloglog link function are close but not exact approximations of the same mathematical form and are practical conveniences rather than theoretically derived expressions. Figure 4.1 compares the three forms. The differences are really mostly noticeable in the tails of these distributions (especially with the cloglog). In general, with social science data, any of these functions can be used and will generally provide the same substantive conclusions. Table 4.1
Natural Link Function Summary for Example Distributions Canonical Link: θ = g(μ)
Inverse Link: μ = g−1 (θ ) exp(θ )
logit link:
log(μ) μ log 1−μ
exp(θ) 1+exp(θ)
probit link:
−1 (μ)
(θ)
cloglog link:
log (−log(1 − μ))
1 − exp (−exp(θ ))
Normal
μ
θ
Gamma
1 −μ
− θ1
Negative binomial
log(1 − μ)
1 − exp(θ)
Distribution Poisson Binomial
Multinomial
k = 1, . . . , K categories
πk
θk = log K
=1 π
exp(θk ) πk = K
=1 θk
38
Comparison of Binomial Link Functions
0.8
1.0
Figure 4.1
cloglog
0.2
0.4
μ
0.6
logit
0.0
probit
−4
−2
0
θ
2
4
39 The logit link function does have one advantage over the other two with regard to describing uncertainty with odds rather than probability. The odds of an event is the ratio of the probability of an event happening to the probability of the event not happening: odds =
p(y = 1) p = 1−p p(y = 0)
−→
p=
odds , 1 + odds
where odds is obviously on the support (0 : ∞). This is essentially how logit regression works, since log(odds) = log
p 1−p
= β0 + β1 x1 + β2 x2
(4.3)
for an example model with two explanatory variables. So if x2 is held constant, then a one-unit change in x1 gives a β1 change in the logodds of success, or a exp(β1 ) change in the odds. So in this way, we get a statement that resembles the interpretation of a linear model coefficient estimate: for a one-unit change in x, we get a β expected change in y. Some common modeling forms are not included in Table 4.1. The conditional logistic model (Bishop, Fienberg, & Holland, 1975) can be expressed in exponential family form, but its most common use with prospective study matched pair data is beyond the scope of this monograph. Surprisingly, the ordered logit/probit model is technically not at GLM even though it appears to be a “close relative” of the multinomial model where an ordering of outcomes is imposed. This is because the model defined in cumulative terms, for example, in the logit case: F (θr − X β) = p(Y ≤ r|X) = [1 + exp(−θr + X β)]−1 , meaning that the probability that the ith outcome Yi is at the rth ordered category or less, is given by the inverse logit of the difference between the linear additive component and the cutpoint to the right of this category, θr . The resulting likelihood function is given by L(β, θ |X, Y) =
n C−1 " "
((1 + exp(θj − X i β))−1
i=1 j=1
− (1 + exp(θj−1 − X i β))−1
!zij
,
40 where zij = 1 if the ith case is in the jth ordered category, and zij = 0 otherwise. It is easy to see that it is mathematically impossible to collect the terms appropriately to express them in exponential family form. This discussion is provided here to demonstrate that just because a regression form is used as commonly as those in Table 4.1, this does not automatically imply that it will be a GLM. Example 4.1: Poisson GLM of Capital Punishment Data Consider an example in which the outcome variable is the number of times that capital punishment is implemented on a state level in the United States for 1997. Included in the data are explanatory variables for median per capita income in dollars, the percentage of the population classified as living in poverty, the percentage of Black citizens in the population, the rate of violent crimes per 100,000 residents for the year before (1996), a dummy variable to indicate whether the state is in the South, and the proportion of the population with a college degree of some kind.5 In 1997, executions were carried out in 17 states with a national total of 74. The original data for this problem are provided in Table 4.2 and constitute the X matrix in the earlier discussion (except that the X matrix necessarily contains a leading vector of 1s for the constant instead of the outcome variable in the first column). Two models are generally used to analyze count variables such as the one in this example: the negative binomial and Poisson. Even though these modeling approaches share certain features (e.g., same mean structure), the negative binomial is useful when dealing with overdispersed data. Recall that overdispersion refers to the case where the variance exceeds the mean for such count models. From Table 4.1, we know that the mean of the Poisson distribution is the same as its variance. However, this is not the case for the negative binomial, and the inequality between these two terms is captured by a dispersion parameter that is held constant in the Poisson model. If there is no evidence of overdispersion in the data, the Poisson model should be preferred over the negative binomial. To illustrate this difference, we first use the negative binomial to model the outcome. The output from this model yields an estimate δ that can be interpreted as the inverse of the overdispersion parameter. For the current example, δ = 30, 963.4 that, once inverted, implies almost no overdispersion. Similarly, it is possible to compare the two models using a likelihood ratio test. The likelihood ratio test for comparing two models, 2(log(LNB ) − log(LP )), is −0.0001 with 8 degrees of freedom for a chi-square distributed text statistic and provides further evidence that the Poisson model should be favored.
INC
EXE
POV
16.7 12.5 10.6 18.4 14.8 18.8 11.6 13.1 9.4 14.3 8.2 16.4 18.4 9.3 10.0 15.2 11.7
Percent Poverty
Source: U.S. Census Bureau, U.S. Department of Justice.
34,453 41,534 35,802 26,954 31,468 32,552 40,873 34,861 42,562 31,900 37,421 33,305 32,108 45,844 34,743 29,709 36,777
Median Income
37 9 6 4 3 2 2 2 1 1 1 1 1 1 1 1 1
Executions
Capital Punishment in the United States, 1997
Texas Virginia Missouri Arkansas Alabama Arizona Illinois South Carolina Colorado Florida Indiana Kentucky Louisiana Maryland Nebraska Oklahoma Oregon
State
Table 4.2
BLK
12.2 20.0 11.2 16.1 25.9 3.5 15.3 30.1 4.3 15.4 8.2 7.2 32.1 27.4 4.0 7.7 1.8
Percent Black
CRI
644 351 591 524 565 632 886 997 405 1,051 537 321 929 931 435 597 463
Violent Crime/100K
SOU
1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0
South
DEG
0.16 0.27 0.21 0.16 0.19 0.25 0.25 0.21 0.31 0.24 0.19 0.16 0.18 0.29 0.24 0.21 0.25
Proportion With Degrees
41
42 The model is developed from the Poisson link function in Table 4.1, θ = log(μ), with the objective of finding the best β vector in g−1 (θ ) = g−1 (Xβ) 17×1
= exp [Xβ] = exp [1β0 + INCβ1 + POVβ2 + BLKβ3 + CRIβ4 + SOUβ5 + DEGβ6 ] = E[Y] = E[EXE]. The systematic component here is Xβ, the stochastic component is Y = EXE, and the link function is θ = log(μ). The goal is to estimate the coefficient vector: β = {β0 , β1 , β2 , β3 , β4 , β5 , β6 } in the earlier context. From this notation, it is clear that the rate of executions is affected by the explanatory variables only through the link function. It should also be noted that the quality of the model is still partly a function of appropriate variable inclusion, casewise independence, and measurement quality exactly as in the standard linear model. In this generalized linear model, we have the additional assumption that θ = log(μ) is the appropriate link function. Example 4.2: Gamma GLM of Electoral Politics in Scotland On September 11, 1997, Scottish voters overwhelming (74.3%) approved the establishment of the first Scottish national parliament in nearly 300 years. On the same ballot, the voters gave strong support (63.5%) to granting this parliament taxation powers. This vote represents a watershed event in the modern history of Scotland, which was a free and independent country until 1707. Whether this event is simply an incremental part of the current Labour government’s decentralization program or a genuine step toward renewed Scottish independence within Europe remains an open question. The popular press in the United Kingdom and elsewhere emphasized Scottish pride and nationalism as driving factors in the voters’ minds. The question addressed here is whether or not social and economic factors were important as perhaps more rational determinants of the vote. The data are aggregated to 32 Unitary Authorities (also called council districts). These are the official local divisions in Scotland since 1996, before which there were 12 administrative regions. Despite the greater journalistic attention paid to the first vote establishing the Scottish parliament, it can be argued that granting taxation powers to a new
43 legislature is more consequential. The outcome variable analyzed here is therefore the protaxation granting vote percentage measured at the council district level. The dataset, collected from U.K. government sources, includes 40 potential explanatory variables from which 6 are selected for this model (all 40 are available to readers in the dataset provided at the authors’ websites). Because the other local taxing body is the council, a variable for the amount of council tax is included, measured in £ Sterling as of April 1997 per two adults before miscellaneous adjustments. The data include several variables concerning employment and unemployment benefits. The variable selected here is the female percentage of total claims for unemployment benefits as of January 1998. Due to the complexities of measuring actual unemployment rates from nationally collected statistics on those who apply for benefits, female applicants appear to be a better indication of underlying unemployment activity in Scotland: They are more likely to apply when unemployed and less likely to participate in unrecorded economic activities. As a way of measuring regional variation in population aging, the standardized mortality rate (United Kingdom equals 100) is included. Interestingly, this measure is higher than the U.K. benchmark in 30 of the 32 Scottish council districts. To include general labor force activity, a variable is specified indicating the percentage of economically active individuals relative to the population of working age. Finally, as a way to look at family size and perhaps commitment to community building (and therefore an implied tolerance for greater taxation), the percentage of children aged 5 to 15 is included. As a percentage (actually converted to a proportion here simply to make the scale of the coefficient estimates more readable), the outcome variable is bounded by 0 and 100. It is regrettably common to see researchers apply the standard linear model in this setting and then obtain estimates from ordinary least squares estimation. This is a flawed practice but in varying degrees. If the data are centered in the middle of the interval and no censoring is involved at the bounds, then the results, while theoretically unjustified, are likely to be quite reasonable. However, if the data are concentrated at either end of the interval or there is some amount of censoring at the bounds, then serious errors of estimation can occur.6 An appropriate model, provided that there is no censoring at the upper bound, is a generalized linear model with the gamma link function. This model is often used to model variance since the outcome variable is defined over the sample space [0, +∞]. Because vote percentages over 100 are not defined and do not exist, this is a good choice of model for this example.
44 The model for these data using the gamma link function is produced by g−1 (θ ) = g−1 (Xβ) 32×1
=−
1 Xβ
= − [1β0 + COUβ1 + UNMβ2 + MORβ3 + ACTβ4 + AGEβ5 ]−1 = E[Y] = E[YES]. The systematic component here is Xβ, the stochastic component is Y = YES, and the link function is θ = − μ1 . One challenge with the analysis of these data is that there is relatively little variation through each of the variables in Table 4.3. In some senses, this is a good problem to have, but it makes it slightly more challenging to identify regional differentiation. Example 4.3: Multinomial GLM to Model Vote Intention in the U.S. Republican Presidential Primaries, 2016 Electoral results and voters’ preferences have been some of the most important objects of analysis in political science. The most prominent theories of voting behavior emphasize different main factors behind people’s choices: affiliation with demographic or social groups (Lazarsfeld, Berelson, & Gaudet, 1968), party identification (Campbell et al., 1960), and economic evaluations (Downs, 1957; Fiorina, 1981). There have been several studies providing evidence supporting these theories and others complementing the explanations by focusing on campaigns (Druckman, 2004; Iyengar & Simon, 2000; Vavreck, 2009), issue salience (Budge & Farlie, 1983; Carmines & Stimson, 1980), individual characteristics of candidates like gender or personality (Huddy & Terkildsen, 1993; Sanbonmatsu, 2002; Sapiro, 1983), and others. Most of these studies have focused on general elections, and very few have paid attention to preliminary stages of the electoral process where candidates are selected: the primary elections. More important, the explanations posed by some of these theories are not suitable given the characteristics of these races where, for example, there is hardly any partisan variation among the candidates and the electorate, or the evaluations of economic conditions and attribution of responsibility are implausible. The question regarding the factors behind a candidate’s support became more relevant during the 2016 presidential election in
Highland Inverclyde
Falkirk Fife Glasgow City
East Renfrewshire Edinburgh City Western Isles
East Ayrshire East Dunbartonshire East Lothian
Clackmannanshire Dumfries & Galloway Dundee City
Aberdeenshire Angus Argyll & Bute
Council Tax 712 643 679 801 753 714 920 779 771 724 682 837 599 680 747 982 719 831
Proportion Voting Yes
0.603 0.523 0.534 0.570 0.687 0.488 0.655 0.705 0.591 0.627 0.516 0.620 0.684 0.692 0.647 0.750 0.621 0.672
21.0 26.5 28.3 27.1 22.0 24.3 21.2 20.5 23.2 20.5 23.8 22.1 19.9 21.5 22.5 19.4 25.9 18.5
% Female Unemployment
Taxation Powers Vote for the Scottish Parliament, 1997
Aberdeen City
Table 4.3
105 97 113 109 115 107 118 114 102 112 96 111 117 121 109 137 109 138
Standardized Mortality 82.4 80.2 86.3 80.4 64.7 79.0 72.2 75.2 81.1 80.3 83.0 74.5 83.8 77.6 77.9 65.3 80.9 80.2
% Active Economically
12.3 15.3 13.9 13.6 14.6 13.8 13.3 14.5 14.2 13.7 14.6 11.6 15.1 13.7 14.4 13.3 14.9 14.6 Continued
% Aged 5−15
45
Continued
Council Tax 858 652 718 787 515 732 783 612 486 765 793 776 978 792 COU
Proportion Voting Yes
0.677 0.527 0.657 0.722 0.474 0.513 0.636 0.507 0.516 0.562 0.676 0.589 0.747 0.673
YES
UNM
19.4 27.2 23.7 20.8 26.8 23.0 20.5 23.7 23.2 23.6 21.7 23.0 19.3 21.2
% Female Unemployment
MOR
119 108 115 126 106 103 125 100 117 105 125 110 130 126
Standardized Mortality
ACT
84.8 86.4 73.5 74.7 87.8 86.6 78.5 80.6 84.8 79.2 78.4 77.2 71.5 82.2
% Active Economically
Source: U.K. Office for National Statistics, the General Register Office for Scotland, the Scottish Office.
West Lothian
South Lanarkshire Stirling West Dunbartonshire
Scottish Borders Shetland Islands South Ayrshire
Perth and Kinross Renfrewshire
North Ayrshire North Lanarkshire Orkney Islands
Midlothian Moray
Table 4.3
AGE
14.3 14.6 15.0 14.9 15.3 13.8 14.1 13.3 15.9 13.7 14.5 13.6 15.3 15.1
% Aged 5−15
46
47 the United States, where candidate Donald Trump challenged several explanations of vote choice by running a very unorthodox but successful campaign. What is the profile of his supporters? How do they differ from other Republican voters? To answer these questions, we focus on responses to The American Panel Survey (TAPS). This survey was conducted among a nationally representative sample comprising about 2,000 respondents. The survey has several waves that span from 2012 to 2018, but we only use the answers provided to the April 2016 wave, which happened in the middle of the caucuses and primaries for the 2016 presidential election. Given that we are interested in the factors associated with support to Donald Trump, a Republican candidate, our sample includes 577 respondents who identified as Republicans.7 The outcome of interest is vote intention (PRIMVOTE): respondents picked an exclusive option from a list of candidates that included Donald Trump, Ted Cruz, John Kasich, and others.8 The nominal nature of this variable leads to the use of a multinomial model. The covariates to explain vote choice are age (AGE), gender (GENDER), education (EDUCATION), region of the country in which the respondent lives (REGION), ideology (ranging from conservative to liberal, IDEOLOGY), level of authoritarian attitudes (measured with the Right Wing Authoritarianism scale, RWA), and perceptions of whether Trump could win the election (TRUMPWIN). Figure 4.2 presents bar plots and histograms with the distribution of the variables under analysis. The probability of the ith respondent voting for candidate r (PRIMVOTE = r) is modeled as follows: P(PRIMVOTE = r|X) =
exp(X i βr ) , k−1 1 + r=1 exp(Xβr )
where Xi βr = 1β0,r + AGEi β1,r + MALEi β2,r + EDUCATIONi β3,r + REGIONi β4,r + RELIGIOSITYi β5,r + IDEOLOGYi β6,r + RWAi β7,r + TRUMPWINi β8,r . Notice that the bold coefficients β1,r , β3,r , β4,r , and β8,r are also vectors of coefficients corresponding to each of the categories of their respective variable. Similarly, AGE, EDUCATION, REGION, and TRUMPWIN are matrixes with the categories of each variable in columns and rows indicating the category to which each respondent belongs. We will return to this example providing the model results on page 64.
106
110
274
0.5
Ideology Score
−2.0 −1.5 −1.0 −0.5 0.0
Ideology
Schooling Level
1.0
1.5
Some College
266
Education
Less than High School
59
156
KASICH OTHER
Candidates
190
CRUZ
249
TRUMP
Electoral Preference
242 45−59
−1
0
1
258
154 West
60+
2
Right Wing Authoritarianism Score
−2
Authoritarianism
Region
186 97 268 Midwest Northeast South
Region
Age Categories
141 30−44
64 18−29
Age
Histogram and Bar Plots of Variables: Primary Election Example
100 150 200
50
0
50 100 150 200 250
0
150
100
50
0
100 150 200 250 50 0 50 100 150 200 250 0 200 150 100 50 0
100 200 300 400 500 0 100 150 200 50 0 100 200 300 400 500
0.5
1.0
1.5 2.0
542 Yes
109 No
DK
54
Trump's Winnability
Religiosity Score
−1.5 −1.0 −0.5 0.0
Religiosity
Gender Categories
403 Male
302 Female
Gender
Perceptions of Whether Trump Could Win
0
Figure 4.2
48
CHAPTER 5. ESTIMATION PROCEDURES
Estimation Techniques This chapter develops the statistical computing technique used to produce maximum likelihood estimates for coefficients in generalized linear models: iterative weighted least squares (IWLS, also called iteratively reweighted least squares, IRLS). All statistical software uses some iterative root-finding procedure to find maximum likelihood estimates; the advantage of iterative weighed least squares is that it finds these estimates for any generalized linear model specification based on an exponential family form (and a number of others as well; cf. Green, 1984). Nelder and Wedderburn (1972) proposed iteratively weighed least squares in their founding article as an integrating numerical technique for obtaining maximum likelihood coefficient estimates, and the GLIM package (Baker & Nelder, 1978) was the first to provide IWLS in a commercial form. All professional-level statistic computing implementations now employ IWLS to find maximum likelihood estimates for generalized linear models. To fully understand the numerical aspects of this technique, we first discuss finding coefficient estimates in nonlinear models (i.e., simple root finding), then discuss weighted regression and finally the iterating algorithm. This provides a background for understanding the special nature of reweighting estimation.
Newton-Raphson and Root Finding In most parametric data-analytic settings in the social sciences, the problem of finding coefficient estimates given data and a model is equivalent to finding the “most likely” parameter value in the parameter space. For instance, in a simple binomial experiment where 10 flips of a coin produce five heads, the most likely value for the unknown but true probability of a heads is 0.5. In addition, 0.4 and 0.6 are slightly less likely to be the underlying probability, 0.3 and 0.7 are even less likely, and so forth. So the problem of finding a maximally likely value of the unknown probability is equivalent to finding the mode of the function for the probability given the data over the parameter space, which happens to be [0, 1] in this case. This process as described is essentially maximum likelihood estimation. In many settings, the problem of finding the best possible estimate for some coefficient value is simply finding a mode. In nonlinear models, we 49
50 are often driven to use numerical techniques rather than well-developed theory. Numerical techniques in this context refer to the application of some algorithm that manipulates the data and the specified model to produce a mathematical solution for the modal point. Unlike wellproven theoretical approaches, such as that provided by least squares for linear models or the central limit theorem for simple sampling distributions, there is a certain amount of “messiness” inherent in numerical analysis due to machine-generated round-off and truncation in intermediate steps of the applied algorithm. Well-programmed numerical techniques recognize this state of affairs and are coded accordingly. If we visualize the problem of numerical maximum likelihood estimation as that of finding the top of an “anthill” in the parameter space, then it is easy to see that this is equivalent to finding the parameter value where the derivative of the likelihood function is equal to zero: where the tangent line is horizontal. Fortunately, many techniques have been developed by mathematicians to attack this problem. The most famous, and perhaps most widely used, is called Newton-Raphson and is based on (Sir Isaac) Newton’s method for finding the roots of polynomial equations. Newton’s method is based on a Taylor series expansion around some given point. This is the principle that there exists a relationship between the value of a mathematical function (with continuous derivatives over the relevant support) at a given point, x0 , and the function value at another (perhaps close) point, x1 , given by f (x1 ) = f (x0 ) + (x1 − x0 )f (x0 ) + +
1 (x1 − x0 )2 f (x0 ) 2!
1 (x1 − x0 )3 f (x0 ) + . . . , 3!
where f is the first derivative with respect to x, f is the second derivative with respect to x, and so on. Infinite precision is achieved only with infinite application of the series (as opposed to just the four terms provided earlier) and is therefore unobtainable. For the purposes of most statistical estimation, only the first two terms are required as a step in an iterative process. Note also that the rapidly growing factorial function in the denominator means that later terms will be progressively more unimportant. Suppose we are interested in finding the point, x1 , such that f (x1 ) = 0. This is a root of the function, f (), in the sense that it provides a solution to the polynomial expressed by the function. It can also be thought of as the point where the function crosses the x-axis in a graph of x versus
51 f (x). We could find this point using the Taylor series expansion in one step if we had an infinite precision calculator: 0 = f (x0 )+(x1 −x0 )f (x0 )+
1 1 (x1 −x0 )2 f (x0 )+ (x1 −x0 )3 f (x0 )+ . . . . 2! 3!
Lacking that resource, it is clear from the additive nature of the Taylor series expansion that we could use some subset of the terms on the right-hand side to at least get closer to the desired point: 0∼ = f (x0 ) + (x1 − x0 )f (x0 ).
(5.1)
This shortcut is referred to as the Gauss-Newton method because it is based on Newton’s algorithm but leads to a least squares solution in multivariate problems. Newton’s method rearranges Equation 5.1 to produce at the (j + 1)th step: x(j+1) = x(j) −
f (x(j) ) f (x(j) )
(5.2)
so that progressively improved estimates are produced until f (x(j+1) ) is sufficiently close to zero. It has been shown that this method converges rapidly (quadratically in fact) to a solution provided that the selected starting point is reasonably close to the solution. However, the results can be disastrous if this condition is not met. The Newton-Raphson algorithm when applied to mode finding in a statistical setting adapts Equation 5.1 to find the root of the score function Equation 3.3: the first derivative of the log-likelihood. First consider the single-parameter estimation problem where we seek the mode of log-likelihood function Equation 3.2 from page 24. If we treat the score function provided by Equation 3.3 as the function of analysis from the Taylor expansion, then iterative estimates are produced by θ (j+1) = θ (j) −
∂ (j) ∂θ (θ |y) . ∂2 (j) ∂θ∂θ (θ |y)
(5.3)
Now generalize Equation 5.3 by allowing multiple coefficients. The goal is to estimate a k-dimensional θˆ estimate given data and a model. The applicable multivariate likelihood updating equation is provided by θ
(j+1)
=θ
(j)
−1 ∂2 ∂ (j) (j) (θ |y) − (θ |y) . ∂θ ∂θ ∂θ
(5.4)
52 Sometimes the Hessian matrix, H = ∂θ∂∂θ (θ (j) |y), is difficult to calculate and is replaced by its expectation with regard to θ , ∂2 (j) A = Eθ ∂θ ∂θ (θ |y) . This modification is referred to as Fisher scoring (Fisher, 1925). For exponential family distributions and natural link functions (Table 4.1), the observed and expected Hessian matrix are identical (Fahrmeir & Tutz, 1994, p. 39; Lehmann & Casella, 1998, pp. 124–128). The Hessian matrix plays an important role in assessing the statistical quality of estimated regression coefficients in GLMs: The standard errors of these estimated coefficients are calculated from the square root of the diagonal of the variance-covariance matrix, which is the negative inverse of the expected Hessian matrix. For GLMs, the statistical reliability of a given estimate is produced by dividing this estimate by its corresponding standard error exactly as in linear regression. In terms of estimation, the Hessian matrix is repeatedly calculated during the IWLS process, and the final version is used in inference. Occasionally, there are numerical problems with the Hessian matrix in more complicated GLMs or with problematic data. See Gill and King (2004) for a discussion. At each step of the Newton-Raphson algorithm, a system of equations determined by the multivariate normal equations must be solved. This is of the following form: 2
(θ (j+1) − θ (j) )A = −
∂ ∂θ (j)
(θ (j) |y).
(5.5)
Given that there already exists a normal form, it is computationally convenient to solve on each iteration by least squares. Therefore, the problem of mode finding reduces to a repeated weighted least squares application in which the inverse of the diagonal values of A are the appropriate weights. The next subsection describes weighted least squares in the general context.
Weighted Least Squares The least squares estimate of linear model regression coefficients is produced by βˆ = (X X)−1 X Y. This is not only a solution that minimizes the summed squared errors, (Y−Xβ) (Y−Xβ), but is also the maximum likelihood estimate. A standard technique for compensating for nonconstant error variance (heteroscedasticity) is to insert a diagonal matrix of weights, , into the calculation of βˆ such that the heteroscedasticity is mitigated. The matrix is created by taking the error variance of the ith case (estimated or known), vi , and assigning the inverse to the ith
53 diagonal: ii = v1i . The idea is that large error variances are reduced by multiplication of the reciprocal. To further explain this idea of weighted regression, begin with the standard linear model from Equation 1.2: Yi = X i β + i .
(5.6)
Now observe that there is heteroscedasticity in the error term, so i = vi , where the shared (minimum) variance is (i.e., nonindexed), and differences are reflected in the vi term. To give a trivial, but instructive, example, visualize a heteroscedastic error vector: E = [1, 2, 3, 4]. Then = 1, and the v-vector is v = [1, 2, 3, 4]. So by the earlier logic, the matrix for this example is ⎡1 v1
⎢0 ⎢ =⎢ ⎣0 0
0 1 v2
0 0
0 0 1 v3
0
⎤ ⎡ 0 1 0⎥ 0 ⎥ ⎢ ⎥=⎢ 0 ⎦ ⎣0 1 0
0
v4
1 2
0 0
0 0
0
1 3
⎤ 0 0⎥ ⎥. 0⎦ 1 4
We can premultiply each term in Equation 5.6 by the square root of the matrix (i.e., by the standard deviation). This “square root” is actually produced from a Cholesky factorization: If A is a positive definite,9 symmetric (A = A) matrix, then there must exist a matrix G such that A = GG . In our case, this decomposition is greatly simplified because the matrix has only diagonal values (all off-diagonal values equal to zero). Therefore, the Cholesky factorization is produced simply from the square root of these diagonal values. Premultiplying Equation 5.6 as such gives 1
1
1
2 Yi = 2 X i β + 2 i .
(5.7)
So if the heteroscedasticity in the error term is expressed as the diagonals of a matrix, ∼ (0, σ 2 V), then Equation 5.7 gives ∼ (0, σ 2 V) = (0, σ 2 ), and the heteroscedasticity is removed. Now instead of minimizing (Y − Xβ) (Y − Xβ), we minimize (Y − Xβ) −1 (Y − Xβ), and the weighted least squares estimator is found by βˆ = (X X)−1 X Y. The latter result is found by rearranging Equation 5.7. The weighted least squares estimator gives the best linear unbiased estimate (BLUE) of the coefficient estimator in the presence of heteroscedasticity. Note also that if the residuals are homoscedastic, then the weights are simply 1 and Equation 5.7 reduces to Equation 5.6.
54
Iterative Weighted Least Squares Suppose that the individual variances used to make the reciprocal diagonal values for are unknown and cannot be easily estimated, but it is known that they are a function of the mean of the outcome variable: vi = f (E[Yi ]). So if the expected value of the outcome variable, E[Yi ] = μ, and the form of the relation function, f (), are known, then this is a very straightforward estimation procedure. Unfortunately, even though it is very common for the variance structure to be dependent on the mean function, it is relatively rare to know the exact form of the dependence. A solution to this problem is to iteratively estimate the weights, improving the estimate on each cycle using the mean function. Since ˆ provides a mean estimate μ = g−1 (Xβ), then the coefficient estimate, β, and vice versa. So the algorithm iteratively estimates these quantities using progressively improving estimates and weights. The weights are improved analogously as with dealing with heteroscedasticity in the previous example: They reduce the size of the residuals (Yi − E[Yi ]) and thus improve overall model fit, only not just once but also on each iteration. The general steps are as follows: 1. Assign starting values to the weights, generally equal to 1 (i.e., 1 = 1, and construct the diagonal unweighted regression): (1) v1
matrix , guarding against division by zero. 2. Estimate β using weighted least squares with the current weights. The jth estimate is βˆ (j) = (X (j) X)−1 X (j) Y. 3. Update the weights using the new estimated mean vector: 1 (j+1) = Var(μi ). vi
4. Repeat Steps 2 and 3 until convergence (i.e., X βˆ (j) − X βˆ (j+1) is sufficiently close to zero). Under very general conditions, satisfied by the exponential family of distributions, the iterative weighted least squares procedure finds the mode of the likelihood function, thus producing the maximum likeliˆ Furthermore, the hood estimate of the unknown coefficient vector, β. 2 −1 matrix produced by σˆ (X X) converges in probability to the variance matrix of βˆ as desired. All modern software for GLMs use some variant of this procedure.
55 Because we have an explicit link function identified in a generalized linear model, the form of the multivariate normal equations Equation 5.8 is modified to include this embedded transformation: (θ (j+1) − θ (j) )A = −
∂(θ (j) |y) ∂g−1 (θ ) . ∂g−1 (θ) ∂(θ)
(5.8)
It is easy to see that in the case of the linear model, when the link is just the identity function, that Equation 5.8 simplifies to Equation 5.5. The overall strategy of the IWLS procedure for generalized linear models is fairly simple: Newton-Raphson with Fisher scoring applied iteratively to the modified normal equations Equation 5.8. For excellent detailed analyses and extensions of this procedure, the reader is directed to Green (1984) and del Pino (1989). Example 5.1: Poisson GLM of Capital Punishment, Continued Returning to the problem of modeling the application of capital punishment at the statewide level in the United States, we now implement the iterative weighted least squares algorithm to produce the desired βˆ ˆ This produces the output in Table 5.1. coefficients in E[Y] = g−1 (X β). The iterative weighted least squares algorithm converged in three iterations in this example partly due to the simplicity of the example and partly due to the well-behaved structure of the likelihood surface. The standard errors are calculated from the square root of the diagonal of the variance-covariance matrix, which is the negative of the inverse expected 2 Hessian matrix discussed previously: A = E ∂θ∂∂θ (θ (j) |y) . Since the expected Hessian calculation is used in this example, the estimation algorithm is Fisher scoring. The variance-covariance matrix is often useful in these settings for determining the existence of problems such as Table 5.1
Modeling Capital Punishment in the United States: 1997
(Intercept) Median Income Percent Poverty Percent Black log(Violent Crime) South Degree Proportion
Coefficient
Standard Error
−6.3068 0.0003 0.0690 −0.0950 0.2212 2.3099 −19.7022
4.1823 0.0001 0.0799 0.0228 0.4427 0.4291 4.4679
Null deviance: 136.5728, df = 16 Summed deviance: 18.2122, df = 10
95% Confidence Interval [−14.5040 , 1.8903] [0.0002 , 0.0004] [−0.0877 , 0.2256] [−0.1398 , −0.0502] [−0.6465 , 1.0890] [1.4689 , 3.1509] [−28.4593 , −10.9452]
Maximized (): −31.7376 AIC: 77.4752
56 multicollinearity (large off-diagonal values) and near-nonidentifiability (rows or columns with all values equal to or near zero). The variancecovariance matrix in this problem shows no signs of such pathologies: VC = (−A)−1 = ⎤ ⎡ Int INC POV BLK log(CRI) SOU DEG ⎢ 17.4917 −0.0001 −0.1990 0.0177 −1.4866 0.3678 −4.6768 ⎥ ⎥ ⎢ ⎢ −0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 −0.0010 ⎥ ⎥ ⎢ ⎥ ⎢ 0.0000 0.0064 0.0002 0.0039 −0.0178 0.1221 ⎥ ⎢ −0.1990 ⎥. ⎢ 0.0000 0.0002 0.0005 −0.0033 −0.0051 −0.0337 ⎥ ⎢ 0.0177 ⎥ ⎢ ⎢ −1.4866 0.0000 0.0039 −0.0033 0.1960 −0.0012 0.3981 ⎥ ⎥ ⎢ ⎣ 0.3678 0.0000 −0.0178 −0.0051 −0.0012 0.1841 0.3001 ⎦ −4.6768 −0.0001 0.1221 −0.0337 0.3981 0.3001 19.9625
Several interesting substantive conclusions are provided by this model. The coefficient for percentage of poverty in the state and the previous year’s log-crime rate have 95% confidence intervals that bound zero. So there is no evidence provided by this model and these data that the rate of executions is tied to poverty levels or the preceding year’s crime rate. These are often used as explanations for higher murder rates and therefore presumably higher execution rates. However, higher income and education levels both have 95% confidence intervals bounded away from zero (far away, in fact). The coefficient for income is positively signed, inferring that higher levels of income are associated with more executions. Interestingly, states with higher education levels tend to have fewer executions. It has been suggested that increased education (generally at the university level) can provide a distaste for capital punishment. The negative sign on the coefficient for percent Black population is also interesting as it has a 95% confidence interval bounded away from zero. One possible explanation is linked to the preponderance of evidence that the death penalty is applied disproportionately to Black prisoners (Baldus & Cole, 1980). That is, greater numbers of African Americans in a state constituted a greater political force against applying the death penalty. The large and positive coefficient indicating that a state is in the South is not surprising given the history of capital punishment in that region of the country. Two admonitions are warranted at this point. First, note that 95% confidence intervals are provided in Table 5.1 rather than t-statistics and p-values. Actually, the four coefficients with 95% confidence intervals bounded away from zero in this model could have reported 99.9% confidence intervals bounded away from zero if it were important to provide
57 such a finding. The use of confidence intervals rather than p-values or “stars” throughout the text is done to avoid the common misinterpretations of these devices that are prevalent in the social sciences (Gill, 1999). Confidence intervals provide all the information that p-values would supply: a 95% confidence interval bounded away from zero is functionally equivalent to a p-value lower than .05. Second, the coefficients in Table 5.1 should not be interpreted like linear model coefficients: A one-unit change in the kth explanatory variable value does not provide a βk change in the outcome variable because the relationship is expressed through the nonlinear link function. A more appropriate interpretation is to look at first differences: analyzing outcome variable differences at two researcher-determined levels of some explanatory variable value, holding the other variables constant or at some interesting values. The recipe is as follows: 1. Pick one covariate of interest, X q . 2. Choose two levels of this variable, X 1,q , X 2,q . 3. Set all other covariates besides X q covariates at their mean, X¯ −q . 4. Create two predictions by running these values through the link function: Yˆ 1 = g−1 (X¯ −q βˆ −q + X 1,q βˆ q ) Yˆ 2 = g−1 (X¯ −q βˆ −q + X 2,q βˆ q ). 5. Look at Yˆ 2 − Yˆ 1 . Naturally, this can be done with multiple variables of interest and reported in a table or a plot (we have several examples in other sections). One immediate question is what levels/quantiles of difference in X q should be used for a continuous explanatory variable (discrete random variables immediately suggest using their specified values unless there are a lot of categories). One recommended suggestion is the interquartile range (IQR), the 0.25 and 0.75 quantile points, since this covers the middle 50% of the data and contrasts a low but typical level with a high but typical level. Some authors prefer to use the minimum and maximum of the data, but this often exaggerates the magnitude of the effect and by definition uses atypical observations. In the capital punishment example, if we hold all the explanatory variables constant at their mean except for the dummy variable indicating
58 whether or not the state is in the South, then the first difference for this dummy variable is 8.156401. This means that there is an expected increase of about 8 executions per year just because a state is in the South. It should be noted, as we will see later when discussing residuals, that the Texas case is driving this finding to a great degree. Figure 5.1 provides another way of looking at the output from the Poisson GLM model of capital punishment. In this graphical display, the expected count of executions is plotted along the y-axis, and each of the explanatory variables is plotted over its observed range along the x-axis with the dummy variable either off (thin line) or on (thick line). The variables not displayed in a particular graph are held constant at their mean. In this way, we can see how changes in a specified explanatory factor differ in affecting the outcome variable depending on the status of the dichotomous variable, controlling for the others. For example, in Panel 1, we see that as income increases, the expected number of executions increases only slightly for non-South states but increases dramatically and seemingly exponentially for states in the South. There is a similar effect apparent between Panels 3 and 5. As both the percentage of Blacks and the education level increase, the expected number of executions for states in the South dramatically decreases until it nearly converges with non-South states at the upper limit. Note that this approach provides far more information than simply observing that the sign on the coefficient estimate is negative.
Profile Likelihood Confidence Intervals In the previous example about the death penalty, we computed confidence intervals for all the estimates that the Poisson model yields, as a way of assessing whether any of the variables of interest have an effect on the outcome. Those reported in the fourth column of Table 5.1 are Wald-type intervals with upper and lower bounds computed in the following way: # (5.9) [LB, UB] = βˆk ± z1− α2 × VCk,k , where z is the critical value for the 1 − α level, and k is the number of parameters (including the intercept). Note that the square root of the k, k entry of the variance-covariance matrix, VC = (−A)−1 , is the standard error of the kth coefficient. Alternatively, in cases with multiple parameters and a moderate number of observations, we use the quantile
Expected executions
Expected executions
0
5
10
15
20
25
30
140 120 100 80 60 40 20 0
5
35,000
40,000
15
20
25
30 25 20 15 10 5 0
Levels of PERBLACK
10
Levels of INCOME
30,000
0.20
0.25
0
2
4
6
8
10
12 10 8 6 4 2 0
5.8
8
South State
10
12
14
16
6.2
6.4
6.6
0.30
18
6.8
Levels of log(VC100k96)
6.0
Levels of PERPOVERTY
Non-South State
Levels of PROPDEGREE
30
45,000
Controlling for the South: Capital Punishment Model
Expected executions
Expected executions Expected executions
Figure 5.1
7.0
59
60 tn−k,1− α2 from the Student’s t distribution with n − k degrees of freedom instead of z1− α2 . The form given in Equation 5.9 assumes that the MLE is normally distributed. However, in certain cases (e.g., small to moderate sample size or sparse data), this assumption is not necessarily met, leading the Waldtype confidence intervals to perform poorly. It could also happen that the confidence interval is not helpful given that it contains values outside the support of a variable (e.g., when the parameter itself is close to the boundaries). Instead, researchers should consider the computation of profile likelihood confidence intervals. These intervals are based on an asymptotic χ 2 distribution of the log-likelihood ratio test and tend to perform better than Wald-type confidence intervals in certain cases, including small-n applications. The underlying intuition of the profile likelihood confidence intervals is to invert the likelihood ratio test to bound a parameter of interest. More specifically, a 100(1 − α)% CI for the parameter θ is the set of all values θ0 such that a two-sided test of the null hypothesis (Ho : θ = θ0 ) would not be rejected at the α level of significance under the null. To have a better picture of this, consider the following likelihood ratio test and its asymptotic distribution: ˆ ψ) ˆ − (θ0 , ψˆ 0 )] LRT = 2[(θ, ˆ ψ) ˆ − (θ0 )] ∼ χ 2 (1), = 2[(θ, where ( ) is a log-likelihood function, θˆ and ψˆ are the MLEs of θ and ψ in the fully specified model, and ψˆ 0 is the MLE of ψ for the reduced model when θ = θ0 . If that null hypothesis is true, then the LRT will not be rejected at the α level if and only if 2 ˆ − (θ0 )] < χ1−α 2[(θˆ, ψ) (1)
ˆ − (θ0 ) > (θˆ, ψ)
2 (1) χ1−α
2
.
2 (1) is the 1 − α quantile of the χ 2 distribution with one Note that χ1−α ˆ ψ) ˆ is fixed, so then we degree of freedom. Furthermore, recall that (θ, need to construct the profile log-likelihood function for the parameter of interest θ and find the interval for which the inequality above holds. This interval is the profile likelihood confidence interval. The profile log-likelihood function is the maximum of the likelihood function of θ over the values of the remaining parameters. Thus, note that (θ ) in the
61 following is not a log-likelihood function per se, but it is composed of maximum values of a log-likelihood function: (θ ) = max (θ , ψ). ψ
Example 5.2: Profile Likelihood Confidence Intervals for the Death Penalty Example Consider the death penalty example from previous sections. We have that E[Y |X] = exp(1β0 +β1 INC+β2 POV+β3 BLK+β4 log(CRI) +β5 SOU+ β6 DEG), and for illustration purposes, we want to compute the profile likelihood confidence interval for the effect of belonging to the South, β5 . To achieve this, we conduct the following steps: ˆ 1. Determine an interval # of potential values for β5 (e.g., 200 values ˆ between β 5 ± 3 × VC5,5 ). ˜ where X ˜ = X ˜ β, ˜ does not include 2. Define a new model E[Y |X] SOUTH, and β5 is fixed to one of the values defined in Step 1 (i.e., β5 will not be estimated). The tools to estimate the model are the same as the ones used for the full model, with the addition of an offset in the equation corresponding to each of the specified values of β5 × SOUTH. 3. Recover the maximum log-likelihood ˜ for each of the models run in Step 2. These values correspond to the profile log-likelihood of β5 . Figure 5.2 illustrates this curve. 4. Determine the level of confidence desired and compute the value of
2 (1) χ1−α . 2
For 95% confidence, this component is ≈ 1.92.
ˆ ψ), ˆ which in 5. Compute the log-likelihood for the full model (θ, this case is equal to –31.73761. The horizontal line in Figure 5.2 2 ˆ − χ1−α (1) . In numbers, this correrepresents the component (θˆ, ψ) 2 sponds to −31.73761 − 1.92 = −35.57906. The intersection of the profile log-likelihood curve and this line indicates the lower and upper bounds of the profile likelihood confidence intervals. Mathematically, these bounds correspond to the roots of (β5 ) when this function is equal to –60.95935. Figure 5.2 shows the comparison between the Wald-type and profile likelihood confidence intervals. Although in this case, both perform very
62 Figure 5.2
Comparison of Confidence Intervals for the SOUTH Coefficient in the Death Penalty Example
Profile log−likelihood
−35
−40
Profile log−likelihood
Wald Test
−45
0
1
2
3
4
5
Coefficient of SOUTH
similarly, we can still see that the latter is slightly wider. In this case, the bounds of the Wald-type CI are [1.4689, 3.1509], and the bounds of the profile likelihood CI are [1.5120, 3.2062]. The two types of confidence intervals for the rest of the variables are presented in Table 5.2. Example 5.3: Gamma GLM of Electoral Politics in Scotland, Continued Returning to the Scottish voting example discussed in the previous chapter, we now run the GLM with the gamma link function, θ = − μ1 . This produces the output in Table 5.3. The iterative weighted least squares algorithm converged in two iterations in this example. The dispersion parameter, a(ψ) = 1δ , is estimated to be 0.00358. We will use this information when we return to this example during the following discussion of residuals and model fit. In addition, an interaction term between the council tax variable and the female unemployment variables is added to the model. This is done in exactly the same way as in standard linear models. Here it shows some
63 Table 5.2
Wald-Type and Profile Likelihood Confidence Intervals (Death Penalty Example) 95% Wald Type
95% Profile Likelihood
[−14.5040 , 1.8903]
(Intercept)
[−14.5512 , 1.9583]
Median Income
[0.0002 , 0.0004]
0.0002 , 0.0004]
Percent Poverty
[−0.0877 , 0.2256]
[−0.0882 , 0.2266]
Percent Black
[−0.1398 , −0.0502]
[−0.1418 , −0.0515]
log(Violent Crime)
[−0.6465 , 1.0890]
[−0.6475 , 1.0965]
South Degree Proportion
Table 5.3
[1.4689 , 3.1509]
[1.5120 , 3.2062]
[−28.4593 , −10.9452]
[−28.8877 , −11.2536]
Modeling the Vote for Parliamentary Taxation: 1997 Coefficient
Standard Error
−1.7765
1.1479
[−4.0396 , 0.4601]
Council Tax
0.0050
0.0016
[0.0018 , 0.0082]
Female Unemployment
0.2034
0.0532
[0.0999 , 0.3085]
Standardized Mortality
−0.0072
0.0027
[−0.0125 , −0.0019]
(Intercept)
95% Confidence Interval
0.0112
0.0041
[0.0032 , 0.0191]
GDP
−0.0001
0.0001
[−0.0004 , 0.0001]
Percent Aged 5−15
−0.0519
0.0240
[−0.0991 , −0.0049]
Council Tax: Female
−0.0002
0.0001
[−0.0004 , −0.0001]
Economically Active
Unemployment Null deviance: 0.536072, df = 31 Summed deviance: 0.087389, df = 24
Maximized (): 63.89 AIC: −111.78
evidence that increasing amounts of council taxes are associated with a decrease in the slope of female unemployment change. It should be noted that no causality is thus asserted. The resulting model has a number of interesting findings. First, it is surprising that GDP is not 95% CI bounded away from zero (i.e., not statistically significant at the p < .05 level). One would think that the level of economic production in a given region would shape attitudes
64 about taxation policy and taxation authority, but there is no evidence of that effect from these data and this model. The other economic variable, the current level of the council tax, does have a coefficient that is 95% CI bounded away from zero. The sign is positive, suggesting that council districts with higher taxes (such as Glasgow) see parliamentary taxation as a potential substitute for an uneven levy that currently disadvantages them. Also, the higher taxed districts are generally more urban, and it could be that urban voters rather than higher-taxed voters have a greater preference for parliamentary taxation authority (although this claim is not specifically tested here). Each of the social variables has a 95% CI bounded away from zero, and the model clearly favors these social effects over economic effects as explanations of the vote. Both employment-related variables are signed positively, which is a little befuddling. Higher levels of unemployment, as measured through female applications for benefits, are associated with greater support for the new taxation authority. Yet higher levels of working-age individuals participating in the economy are also associated with greater support. The mortality index is negatively signed, which seems to imply that those constituencies with higher death rates (potentially associated with poorer health services, higher crime rates, etc.) are less enthusiastic about parliamentary taxation or perhaps more skeptic and less receptive to change. This finding is in line with studies that find that mortality salience is associated with shifts toward conservatism and risk-averse attitudes. Finally, those districts with higher numbers of children in the 5 to 15 range are less likely to support the proposition. Since the current council tax provides breaks for families with children, this could be a concern that some new tax scheme in the future might have different priorities. The quality of the fit of the model developed in this example is analyzed in the following example. In the interim, one quick indication is that all but one of the explanatory variables has a 95% confidence interval bounded away from zero (equivalent to a Wald test).
Example 5.4: Multinomial GLM to Model Vote Intention in the U.S. Republican Presidential Primaries, Continued Let’s return to the example about the primary elections in the United States that we introduced in the previous chapter. We can also run the iterative weighted least squares algorithm to obtain the estimates of βˆ that will help us to assess the effects that multiple variables have on the probability of voting for Trump in the primaries. The output from
65 the multinomial model does not follow the same structure as the other models. Notice that we have a set of coefficients β for r − 1 candidates that are not the baseline. For this example, the baseline is Trump, and therefore, the model yields three sets of coefficients for Cruz, Kasich, and others, as well as standard errors for each of those sets. Figure 5.3 illustrates the magnitude of each set of coefficients for all the variables under analysis with their respective 95% confidence interval. The white
Figure 5.3
Modeling Vote Intention in the Republican Primaries in the United States, 2016
Don't know
TRUMPWIN Yes
RWA IDEOLOGY RELIGIOSITY
Authoritarianism
Ideology
Religiosity
West
REGION
South
Northeast
Bachelor's degree or higher
EDUCATION
Some College
High School
GENDER
Male
60+
AGE
45−59
30−44
−5.0
−4.0
−3.0
−2.0
Coefficient Cruz
Kasich
−1.0
0.0
1.0
2.0
3.0
66 bubbles with “K” represent the coefficients for Kasich, and the black bubbles with “C” are the estimated coefficients for Cruz. The baseline candidate excluded is Trump, and for sake of space and parsimony, we excluded “Others” from the results. The figure shows interesting findings regarding the composition of the Trump electoral base. For example, the most educated respondents in the sample (those with a bachelor’s degree or higher) are significantly more likely to support Kasich than Trump. We can also observe that higher scores in the ideology scale (corresponding to more liberal attitudes) are also associated with higher probabilities of supporting Kasich over Trump. In line with theories of bandwagon effects, the results from the model also indicate that those who consider that Trump can win the election are significantly more likely to vote for him. To have a better understanding of the implications of these results, we can compute predicted probabilities of voting for each of these candidates as well as first differences. For example, the left panel of Figure 5.4 show the probabilities of supporting each of the candidates depending on whether respondents consider that Trump could win the election. The rest of the characteristics are fixed at either the mode (for categorical variables) or at the mean (for numerical variables). The plot shows that on average, the probability of voting for Trump increases from 0.05 to 0.46 when respondents think that he could win. The results are inverted and with a lower magnitude for the other candidates: Those in the “No” category are more likely to support Kasich and Cruz, providing support to the hypothesis that voters actively decide to choose “winners” or rationalize that their preferred choice has an actual chance of succeeding. In the left panel, we can also analyze the effect of changing across the ideological spectrum. The probability of voting for Trump decreases 0.5 when we change from the first to the third quartile of the ideology distribution in the liberal direction. In contrast, the probability of voting for Kasich increases almost 0.15 when we move between those quartiles. While some of these results are in line with theoretical expectations regarding voting behavior, they also provide some interesting evidence about the composition of Trump’s base that go beyond traditional explanations related to partisanship, gender, or religiosity.
Comments on Estimation This chapter is about how GLM coefficients are estimated, how to interpret those estimates, and how to intuitively explain the implications of estimated quantities. The most general definition of estimation, of
Probability of voting
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
K
Yes
K
C
DK
T
C
K
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Perceptions of whether Trump could win the election
No
T
C
T
Winnability
−2.0
−1.5
−1.0
0.0
0.5
Ideology scores
−0.5
Ideology
Predicted Vote Intention Depending on Perceptions of Winnability and Ideology
Probability of voting
Figure 5.4
1.0
Cruz
Trump
Kasich
1.5
67
68 course, is nothing more than using sample quantities to make statements about unknown (and generally unknowable) population quantities. In our case with GLMs, that is mostly putting a “hat” on the beta vector: ˆ But there is obviously more to it. We have focused here on the β → β. maximum likelihood estimator for GLMs because of the elegance, properties, and computational robustness of doing so. But there are other approaches such as the generalized method of moments and Bayesian posterior inference. Each of these has strengths, but they are large topics on their own that are further separated from GLM theory and therefore not touched upon herein. Another related discussion is the contrast between inference and prediction. Certainly, there is great overlap since prediction needs inference to occur beforehand, and the goal of inference can (even strictly) be prediction. But with the advent of machine learning and other approaches that blur the lines between statistics and other fields, a few clarifications are worth noting. First, inference alone implies a generality of approach wherein the effects of major population phenomena are asserted to explain variation in an outcome variable. So the focus is on a semicausal relationship whose quality depends on many things, including the efficacy of the model, the quality of the data, the sampling process, measurement error, missing data, software, and more. Since prediction is downstream, it conditionally depends on those issues as well. However, the focus on prediction is more making claims about the probability of specific outcomes: categories or ranges of the outcome variable. In a classic machine learning context, the point is to build and test models that make the most accurate possible predictions while not worrying so much about the reliability of individual coefficient estimates. This also can make the prediction process less general as there is almost always a prediction accuracy gain by being less general. The emphasis here has been on estimation, and we discuss prediction more in future chapters.
CHAPTER 6. RESIDUALS AND MODEL FIT
Defining Residuals Residuals (errors, disturbances) are typically analyzed in linear modeling with the goal of identifying poorly fitting values. If it is observed that there exist a sufficiently large number of these poorly fitting values, then often the linear fit is determined to be inappropriate for the data. Other common uses of residuals include looking for signs of nonlinearity, evaluating the effect of new explanatory variables, creating goodnessof-fit statistics, evaluating leverage (distance from the multidimensional explanatory mean vector), influence (change exerted on the coefficient) for individual data points, and general model comparison. Because of the generalization to a wider class of outcome variable forms, residuals in generalized linear models are often not normally distributed and therefore require a more thoughtful postmodel analysis. Despite this challenge, we would very much like to have a form of the residuals from a generalized linear model that is as close as possible to normally distributed around zero, or at least “nearly identically distributed” in Cox and Snell’s (1968) language. The motivation is that we can then apply a wide range of graphical and inferential tools developed for the linear model to investigate potential outliers and other interesting features. The core emphasis in this chapter is the discussion of deviance residuals, which are attempts to describe the stochastic behavior of the data relative to a constructed generalized linear model in a format that closely resembles the normal theory analysis of standard linear model residuals. Actually, four main types of residuals are used with generalized linear models and are discussed in this chapter: response, Pearson, working, and deviance. For the standard linear model, these forms are equivalent. However, for other exponential family forms, they can differ substantially and confusingly. There are also Anscombe residuals, which rely on differential equations to transform the residuals in such a way that firstorder asymptotic skewness is mitigated and the form is approximately unimodal and symmetric (Anscombe, 1960, 1961). Since these are no longer used, they are not discussed here. A substantial advantage of the generalized linear model is its freedom from the standard Gauss-Markov assumption that the residuals have mean zero and constant variance. Yet this freedom comes with 69
70 the price of interpreting more complex stochastic structures. Currently, the dominant philosophy is to assess this stochastic element by looking at (summed) discrepancies: a function that describes the difference between observed and expected outcome data for some specified model: D = ni=1 d(θ , yi ). This definition is left intentionally vague for the moment to stress that the format of D is widely applicable. For instance, if the discrepancy in D is measured as the squared arithmetic difference from a single mean, then this becomes the standard form for the variance. In terms of generalized linear models, the squared difference from the mean will prove to be an overly restrictive definition of discrepancy, and a likelihood-based measure will show to be far more useful. For the standard linear model, the residual vector not only is quite easy to calculate but also plays a central role in determining the quality of fit of the model since it leads fit summaries like R2 and the F-statistic. The response residual vector is calculated simply as RResponse = Y − Xβ and is used to measure both the dispersion around the fitted line and the level of compliance with the Gauss-Markov assumptions. As applied to generalized linear models, the linear predictor needs to be transformed by the link function to be comparable with the response vector. Therefore, the response residual vector for generalized linear models is RResponse = Y − g−1 (Xβ). It is seldom mentioned in introductory texts, but the linear model is moderately robust to minor deviations from the Gauss-Markov assumptions. Individual cases in social science datasets are not uncommonly correlated in some relatively mild or benign fashion. In fact, it is nearly impossible to produce a realistic and realistically large collection of social science explanatory variables that are truly independent (hence, we should not use the common phrase “independent variables”). In addition, there will sometimes be large outliers with or without influence (individually causing a nontrivial change in the estimated slopes). In many of these occurrences, the linear model is resistant, meaning that the substantive conclusions from the linear model are barely affected or at least minimally affected in comparison to the assumed effects of measurement error. In sum, resistance and robustness give the toughness of the linear model. With the linear model, asymptotic normality of the residuals is still achievable in a more general setting by appealing to the LindebergFeller variant of the central limit theorem. This theorem relaxes casewise independence in favor of the condition that no single term dominates in
71 the sum. Counter to some textbooks’ incorrect assertions, normality of residuals is not one of the Gauss-Markov assumptions (linear functional form, zero mean residuals, homoscedasticity, noncorrelated errors, and exogeneity of explanatory variables); it is a property that comes with sample size. It is far more typical with generalized linear models to produce residuals that deviate substantially rather than mildly from the basic conditions. In these cases, response residuals tell us very little. A basic alternative to the standard response residual is the Pearson residual. This is the response residual scaled by the standard deviation of the following prediction: Y−μ . RPearson = # Var[μ] Pearson residuals are an attempt to provide some sense of scale to the response residual by dividing by the standard error of the prediction. The name comes from the fact that the sum of the Pearson residuals for a Poisson generalized linear model is the Pearson χ 2 goodness-offit measure reported by all statistical packages. In ideal and large sample situations, the Pearson residuals are approximately normally distributed. Unfortunately, like response residuals, Pearson residuals can be very skewed for GLMs and can therefore provide a misleading measure of typical dispersion. In the process of fitting generalized linear models, software programs use the iterative weighted least squares algorithm. As described in Chapter 5, a set of working weights is calculated at each step of a linear estimation until the appropriate derivative is sufficiently close to zero. An occasionally useful quantity is the residual produced from the last (i.e., determining) step of the iterative weighting process: the difference between the current working response and the linear predictor. This is defined as ∂ RWorking = (y − μ) μ. ∂θ This residual is sometimes used as a diagnostic for evaluating convergence of the fitting algorithm as well as an indication of model fit at this point. A lack of general theory for working residuals hampers their use in a broader context.
The Deviance Function and Deviance Residuals By far the most useful category of discrepancies for the generalized linear model is the deviance residual. This is also the most general
72 form. A common way to look at model specification is the analysis of a likelihood ratio statistic comparing a proposed model specification relative to the saturated model (n data points, n specified parameters, and using the exact same data and link function) and the null model (an intercept-only GLM, again with exact same data and link function). The difference between these three models is measured by summing the individual deviances. Sometimes the deviance function (or summed deviance) is indexed by a weighting factor, wi , to accommodate grouped data. Also, when a(ψ) is included, Equation 6.1 is called the scaled deviance function; otherwise, it is predictably called unscaled. Starting with the log-likelihood for a proposed model from the Equation 3.2 notation, add the “$” (hat) notation as a reminder that it is evaluated at the maximum likelihood values: ($ θ , ψ|y) =
n yi$ θ − b($ θ) i=1
a(ψ)
+ c(y, ψ).
Also consider the same log-likelihood function with the same data and the same link function, except that it now has n coefficients for the n data points, that is, the saturated model log-likelihood function with the “%” (tilde) notation to indicate the n-length θ vector: (% θ , ψ|y) =
n yi% θ − b(% θ) i=1
a(ψ)
+ c(y, ψ).
This is the highest possible value for the log-likelihood function achievable with the given data, y. Yet it is also often unhelpful analytically except as a benchmark. Finally, define a third model with the same data and the same link function again, except no explanatory variables are used and the X matrix is just a column of 1s. This is called the null or mean model and provides the “worst” fit of any model under these restrictions since there are no covariates to help explain variation in y, and this denoted here with the log-likelihood function with the “ ¯ ” notation to indicate the n-length θ vector: (θ¯ , ψ|y) =
n yi θ¯ − b(θ¯ ) + c(y, ψ). a(ψ) i=1
This setup allows two important comparisons: the proposed model versus the saturated model and the proposed model versus the null model. The first comparison tells us how far we are from a perfect fit, and the
73 second comparison tells us how far we are from the most basic model possible of the same form. The deviance function for these comparisons is defined as minus twice the log-likelihood ratio, which is just the arithmetic difference since both terms are already written on the log metric. For instance, the deviance function between the constructed model and the saturated model is D(% θ, y) = −2
n (% θ, ψ|y) − ($ θ, ψ|y) i=1
= −2
n i=1
yi$ θ − b(% θ) θ − b($ θ) yi% + c(y, ψ) − + c(y, ψ) a(ψ) a(ψ)
n = −2 yi (% θ −$ θ ) − b(% θ ) − b($ θ a(ψ)−1 ,
(6.1)
i=1
where the same calculation substituting θ¯ for % θ above gives the deviance function between the constructed model and the null model. Here % θ is a measure of the summed difference of the data-weighted maximum likelihood estimates and the b(θ ) parameters, and % θ is a measure of the summed difference of the lack of maximum likelihood estimates and the b(θ ) parameters. Observe also that the b(θ ) function developed in Chapter 2 plays a critical role once again. The proposed model, which reflects the researcher’s belief about the identification of the systematic and random components, has a summed deviance that is always higher than the saturated summed deviance of zero (every point is fit perfectly). Also, the proposed model has a summed deviance that is always lower than that of the null model. Conveniently, these differences are on the χ 2 metric, such that tail values indicate that the proposed model is statistically distinct from either the saturated or the null model, with the degrees of freedom equal to the difference in parameters specified. Thus, for p explanatory variables, hypothesis tests are performed with Saturated comparison: Null comparison:
2 D(% θ , y) ∼ χn−p 2 D(θ¯ , y) ∼ χp−1 .
These tests are based on asymptotic principles, and the asymptotic rate of convergence varies depending on the exponential family form, suggesting caution with small samples. Although calculating D(θ , y) is relatively straightforward, we usually do not need to develop this
74 Table 6.1
Deviance Functions
Distribution
Canonical Parameter
Poisson (μ)
θ = log(μ)
Binomial (m,p)
Deviance Function
μ θ = log 1−μ
i
θ =μ
Gamma (μ, δ)
1 θ = −μ
Negative Binomial (μ, p)
θ = log(1 − μ)
2
2
π1 log 1−(π +···+π ,..., 1 k) πk log 1−(π +···+π ) 1
i
Normal (μ, σ )
Multinomial k categories
! y yi log μi − yi + μi i ! y m −y 2 yi log μi + (mi − yi )log m i−μi
2
2
i
[yi − μi ]2
! yi yi −μi −log μ μ i i
! y 1+μ yi log μi + (1 + yi )log 1+y i i i cases
categories yik log
yi μi
k
calculation as many texts provide the result for frequently used PDFs and PMFs, and modern statistical software provides the results for a given estimation. For the running examples, the deviance functions are given in Table 6.1. A utility of the deviance function is that it also allows a look at the individual deviance contributions in an analogous way to linear model residuals. The single-point deviance function is just the deviance function for the yth i point (i.e., without the summation): θ −$ θ ) − b(% θ ) − b($ θ ) a(ψ)−1 . d(θ , yi ) = −2 yi (% To define the deviance residual at the yi point, we take the square root: RDeviance =
(yi − μi ) # |d(θ, yi )|, |yi − μi |
i −μi ) where (y |yi −μi | is just a sign-preserving function. Pierce and Schafer (1986) study the deviance residual in detail and recommend that a continuity correction of Eθ [(yi − μi )/Var[yi ]]3 be added to each term in the right-hand column of Table 6.1 to improve the normal approximation. Furthermore, in the case of binomial, negative binomial, and Poisson exponential family forms, they prescribe
75 adding or subtracting 12 to integer-valued yi outcomes to move these values toward the mean. These are then called adjusted deviances. In general, the deviance approach is quite successful in producing residual structures that are centered at zero, have standard error of 1, and are approximately normal. Example 6.1: Poisson GLM of Capital Punishment, Continued Returning once again to the Poisson-based example using 1997 capital punishment data in the United States, we now look at various residuals from the model. These data were chosen specifically because there is one case with a noticeably large outcome variable value: Texas. This makes the residuals analysis particularly interesting given the perceived dominance by this one case. Table 6.2 provides the residual vectors for each type studied. Note from Table 6.2 that in no single case does the sign of the residual change across residual types. If a change of sign were observed, a coding error should be suspected. In addition, the Pearson residuals are not very different from the deviance residuals in this example except for one
Table 6.2
Residuals From Poisson Model of Capital Punishment
Texas Virginia Missouri Arkansas Alabama Arizona Illinois South Carolina Colorado Florida Indiana Kentucky Louisiana Maryland Nebraska Oklahoma Oregon
Response
Pearson
Working
Deviance
1.7076 0.8741 4.5953 0.2648 0.9596 0.9539 0.1393 –0.3823 –0.9590 –1.8222 –2.1772 –2.3183 –1.6016 0.1016 0.0703 0.4992 –0.9051
0.2874 0.3066 3.8773 0.1370 0.6718 0.9327 0.1021 –0.2477 –0.6852 –1.0847 –1.2214 –1.2727 –0.9930 0.1072 0.0729 0.7054 –0.6557
0.0484 0.1076 3.2715 0.0709 0.4703 0.9119 0.0748 –0.1605 –0.4895 –0.6457 –0.6853 –0.6986 –0.6156 0.1131 0.0756 0.9967 –0.4751
0.2852 0.3014 2.8693 0.1355 0.6274 0.8274 0.1009 –0.2548 –0.7571 –1.2527 –1.4291 –1.4959 –1.1362 0.1053 0.0720 0.6202 –0.7219
76 point (Missouri) producing a notable skewness, exactly as the theoretical discussion predicted for Pearson residuals. The large, positive residuals for Missouri indicate that it has more executions than expected given the observed levels of the explanatory variables. Looking at the deviance column, one is inclined to worry less about the effect of Texas as an outlier and more about Florida. This is because the Texas case has great influence on the parameter estimates and therefore the resulting μi . Florida, for instance, is similar in many of the explanatory variable values to Texas but does not have nearly as many executions and is subsequently further separated from the fit. Pregibon (1981) suggests jackknifing out (temporarily removing for reanalysis) cases and looking at the resulting changes to the coefficient values. If the change is quite substantial, then we know that the jackknifed case had high leverage on that coefficient. This can be done manually by removing the case and rerunning the analysis, but Williams (1987) also provides a one-step estimate that is computationally superior for large datasets. Figure 6.1 is a modification of Pregibon’s (1981) index plot construct in which all of the coefficients from the Poisson model of capital punishment are reestimated jackknifing out the cases listed numerically on the x-axis by the order given in Table 4.2. The horizontal line indicates the coefficient value for the complete data matrix. Therefore, the distance between the point and the line indicates how much the coefficient estimate changes by removing this case. Figure 6.1 shows that Texas does indeed exert great leverage on the coefficient estimates. Index plots are an excellent way to show the effect of one or a few cases on the resulting estimates but only for a relatively small number of cases. Picture Figure 6.1 with a sample size of several thousand. In these cases, one of several approaches are helpful. The researcher could sort the jackknifed values, taking the top 5% to 10% in absolute value before plotting. These are the ones we are inclined to worry about anyway. Second, if interest was focused on an overall diagnostic picture rather than individual cases, the plot could show a smoothed function of the sorted index values rather than individual points as was done here.
Measuring and Comparing Goodness of Fit There are five common methods for assessing how well a particular generalized linear model fits the data: the chi-square approximation to the Pearson statistic, the summed deviance, the Akaike information
1.0
1.5
2.0
2.5
3.0
−0.14
−0.12
−0.10
−0.08
−0.06
−0.04
−0.02
0.00
5e(−5)
10e(−5)
15e(−5)
20e(−5)
25e(−5)
30e(−5)
35e(−5)
2
2
2
4
4
4
8
10
12
8
10
12
8
10
12
Index number
6
Index number
6
Index number
6
14
14
14
16
16
16
Jackknife Index Plot: Capital Punishment Model
INCOME
PERBLACK
SOUTH
PERPOVERTY log(VC100k96)
0.00
0.05
0.10
0.15
0.20
−30
−25
−20
−15
−10
−5
−1.0
−0.5
0.0
0.5
1.0
−0.10
−0.05
PROPDEGREE
Figure 6.1
2
2
2
4
4
4
6
8
10
12
8
10
12
8
10
12
Index number
6
Index number
6
Index number
14
14
14
16
16
16
77
78 criterion, the Schwartz criterion, and graphical techniques. Each of these approaches is useful, but the summed deviance statistic appears to be the best overall measure of lack of fit because it provides the most intuitive indication of individual-level contributions to the fit. However, no single measure provides “the right answer,” and whenever possible, more than one approach should be used. The Pearson statistic is the sum of the squared Pearson residuals:
2 n n Y−μ 2 2 # RPearson = . (6.2) X = Var[μ]i i=1 i=1 2
X 2 , where n is If the sample size is sufficiently large, then a(ψ) ∼ χn−p the sample size and p is the number of explanatory variables, including the constant. Unfortunately, this statistic is very poorly behaved for relatively small sample sizes, and readers should be wary of reported values based on double-digit sample sizes (as in Example 4.2). The utility of this distributional property is that large X 2 values will be determined to reside in the tail of the χ 2 distribution, and the model can therefore be “rejected” as poorly fitting. The summed deviance has already been presented in detail in this chapter but not discussed as a measure of goodness of fit. Given suffi2 and D(θ¯ , y)/a(ψ) ∼ cient sample size, it is true that D(% θ, y)/a(ψ) ∼ χn−p 2 χp−1 . However, for enumerative outcome data (dichotomous, counts), the convergence of the deviance function to a χ 2 statistic is much slower than the Pearson statistic. In any case involving enumerative data, one is strongly advised to add or subtract 12 to each outcome variable in the direction of the mean. This continuity correction greatly improves the distributional result. Pierce and Schafer (1986) as well as Peers (1971) remind us, though, that just because the Pearson statistic is more nearly chi-square distributed, it does not mean that it is necessarily a superior measure of fit than the summed deviance. Although the summed deviance is somewhat more problematic with regard to the discreteness of the outcome variables, it is vastly superior when considering likelihood-based inference. Deviance residuals are also very useful for comparing to a proposed, nested model specification. In the preceding discussion, the outer nesting model was the saturated model, but it need not be. Suppose we are comparing two nested model specifications, M1 and M2 with p < n and q < p )−D(M2 ) parameters, respectively. Then the likelihood ratio statistic, D(M1a(ψ) , 2 subject to a few complications is distributed approximately χp−q
79 (see Fahrmeir & Tutz, 1994; McCullagh & Nelder, 1989). If we do not know the value of a(ψ), such as in the Poisson case, then it is estimated and a modified likelihood ratio statistic is used: D(M$1 )−D(M2 ) . This moda(ψ )(p−q)
ified statistic is distributed according to the F-distribution with n − p and p − q degrees of freedom. A commonly used measure of goodness of fit is the Akaike information criterion (AIC; Akaike, 1973, 1974, 1976), also sometimes called the discrepancy or cross-entropy. The principle is to select a model that minimizes the negative likelihood penalized by the number of parameters: AIC = −2($ θ |y) + 2p,
(6.3)
where ($ θ |y) is the maximized model log-likelihood value and p is the number of explanatory variables in the model (including the constant). Akaike’s idea is that the second “penalty” term is an approximation for the expected likelihood function under the actual data-generating process, and thus, the AIC is simply the estimated discrepancy between modeled and true likelihoods. Unsurprising, this approximation is best when n is large and p is small. This construct is very useful in comparing and selecting nonnested model specifications, but the practitioner should certainly not rely exclusively on AIC criteria. Many authors have noted that the AIC has a strong bias toward models that overfit with extra parameters since the penalty component is obviously linear with increases in the number of explanatory variables and the log-likelihood often increases more rapidly (Carlin & Louis, 2009, p. 53; Neftçi, 1982, p. 539; Sawa, 1978, p. 1280). However, a substantial benefit is that by including a penalty for increasing degrees of freedom, the AIC explicitly recognizes that basing model quality decisions on the value of the log likelihood alone is a poor strategy since the likelihood never decreases by adding more explanatory variables regardless of their inferential quality. In response to such concerns, Hurvich and Tsai (1989) introduce the corrected AIC (AICc) for linear and generalized linear models, especially (but not necessarily) applied to time series, based on the expected Kullback-Leibler distance between the “true” and fitted models. In the β) (y − X $ β) so that linear case, first define $ σ 2 = (y − X $ n(n + p) . AICc = n log 2π$ σ2 + n−p−2
(6.4)
80 If the true model is a subset of the models that are possible to estimate by the research, then this is a minimum variance unbiased estimator for the Kullback-Leibler distance: β, β true ) = E[−2 log L($ β|X, y)], δKL ($
(6.5)
where the expectation is taken with respect to the true distribution but substituting in the MLE estimate. Hurvich and Tsai (1991) later show that this adjustment reduces the bias in the AIC, and for linear models, it generally outperforms the AIC in model selection. Furthermore, they derive the mathematical connection between the AIC and the AICc, see also Cavanaugh (1997) and Prasad and Tsai (2001). Nonlinear forms are given in Hurvich and Tsai (1988) and Hurvich, Simonoff, and Tsai (1998). Define H as the hat (projection) matrix from the last step from the IWLS algorithm (Chapter 5) with degrees of freedom dfH = tr(H), and D(% θ , y) is the deviance function of the proposed model relative to the saturated model as introduced earlier in this chapter. Then a GLM version is given by AICc =
D(% θ , y) 2a(ψ)(dfH + 1) + , n n − dfH − 2
(6.6)
where a(ψ) is the general scale parameter given in Chapter 2. This is a very general form that is conveniently based on the summed deviance. The trace of the linear hat matrix (recall that IWLS makes linear steps) is the rank of the X matrix, but in this case, it needs to be the weighted hat matrix from the last step of IWLS, which some software packages provide more readily than others. Another commonly used measure of goodness of fit is that proposed by Schwarz (1978), called both the Schwarz criterion and the Bayesian information criterion (BIC). Even though it is derived from a different statistical perspective, the BIC resembles the AIC in calculation: BIC = −2($ θ |y) + plog(n),
(6.7)
where n is the sample size. Despite the strong visual similarity expressed between the AIC and the BIC, the two measures can indicate different model specifications from a set of alternatives as being optimal, with the AIC favoring more explanatory variables and a better fit and the BIC favoring fewer explanatory variables (parsimony) and a poorer fit (Koehler & Murphree, 1988, p. 188; Neftçi, 1982, p. 537; Sawa, 1978, p. 1280). Since the BIC explicitly includes sample size in the calculation,
81 it is obviously more appropriate in model comparisons where sample size differs, and a model that can achieve a reasonable log-likelihood fit with a smaller sample is penalized less than a comparable model with a larger sample. Whereas the AIC is just a convenient construction loosely derived from maximum likelihood and negative entropy (Amemiya, 1985, p. 147; Greene, 1997, p. 401; Koehler & Murphree, 1988, p. 189), the BIC is strongly connected with Bayesian theory. For instance, as n → ∞, (BICa − BICb ) − log(BF) → 0, log(BF)
(6.8)
where BICa and BICb are the BIC quantities for competing model specifications a and b, and BF is the Bayes factor between these models. The Bayes factor is a model comparison tool roughly analogous to the likelihood ratio test (LRT). Very briefly (for more details, see Gill, 2014, chap. 7), suppose θ 1 and θ 0 represent two competing models through different sets of explanatory variable choices with the same outcome variable. Unlike the LRT, these need not be a nested comparison in which one model is a special case of the other by specifying fewer of the explanatory variables from the larger set. The researcher sets up prior distributions on the unknown associated coefficients, as well as prior model probabilities, p(θ 1 ) and p(θ 0 ), although in practice, these latter values are often assigned 0.5 each. Both models are updated by conditioning these priors on the observed data through the likelihood functions to produce the corresponding posterior distributions, π (θ 1 |X) and π (θ 0 |X). Evidence for model 1 over model 0 is given by the Bayes factor: π (θ 1 |x) , (6.9) BF1,0 = π (θ 0 |x) which can be rewritten a number of different ways to show structural features. There is no corresponding distribution for testing purposes, so one wants a value much bigger or much smaller than 1 in order to make a comfortable decision in one direction or another. Notice through this discussion that the BIC is an approximation to the log of the Bayes factor that is often much easier to calculate (Kass, 1993) and perfectly reasonable to apply in non-Bayesian settings. Other competing model selection criteria as well as modifications of the AIC and the BIC provide useful comparisons, but the basic Akaike and Schwarz constructs dominate empirical work. Although it can be shown that nearly all of these measures are asymptotically equivalent
82 (Zhang, 1992), Amemiya (1980) provides simulation evidence that the BIC finds the correct model more often than the AIC for small samples. However, Neftçi’s (1982) study found that the BIC was noticeably more sensitive to transformations on the data than the AIC. Since these measures are only one part of the assessment of model quality and neither has remarkably superior properties, the choice of which to use is primarily a function personal preference. Amemiya (1981, p. 1505, note 13) states his preference for the AIC due to its simplicity, and statistical software packages generally give the AIC as the default measure.
Asymptotic Properties It is convenient at this point to briefly review some asymptotic properties of the estimated quantities from a generalized linear model: coefficients, goodness-of-fit statistics, and residuals. The coefficients are maximum likelihood estimates generated by a numerical technique (iterative weighted least squares) rather than an analytical approach. This means that the extensive theoretical foundation developed for maximum likelihood estimation applies in this case, and this section reviews the conditions under which these principles hold. We also extend the discussion of the asymptotic chi-square distribution of the two primary test statistics, as well as review some large-n properties of the residuals from a generalized linear model fit. Chapter 5 introduced the Hessian matrix as the second derivative of the likelihood function at the maximum likelihood values, $ θ. The negative expectation of this matrix, I(θ ) = −Eθ
∂2 (j) (θ |y) , ∂θ∂θ
(6.10)
is called the information matrix and plays an important role in evaluating the asymptotic behavior of the estimator. A useful and theoretically important feature of a given square matrix is the set of eigenvalues associated with this matrix. Every p × p matrix, A, has p scalar values, λi , i = 1, . . . , p, such that Ahi = λi hi for some corresponding vector, hi . In this decomposition, λi is called an eigenvalue of A and hi is called an eigenvector of A. These eigenvalues show important structural features of the matrix. For instance, the number of nonzero eigenvalues is the rank of the A, the sum of the eigenvalues is the trace of A, and the product of the eigenvalues is the determinant of A.
83 The information matrix is especially easy to work with because it is symmetric, and unless there are serious computational problems, it is also positive definite. Generalized linear models built upon exponential families and natural link functions produce positive definite information matrices under very reasonable regularity conditions: 1. The sample space for the coefficient vector is open in Rp and convex for p explanatory variables. 2. The linear predictor transformed by the link function is defined over the sample space of the outcome variable. 3. The link function is twice differentiable. 4. The X X matrix is full rank. These conditions are explored in much greater detail in Fahrmeir and Kaufman (1985), Lehmann and Casella (1998), and Le Cam and Yang (1990). Given the regularity conditions described earlier, if (1) small changes in the coefficient estimate produce arbitrarily small changes in the normed information matrix in any direction, and (2) the smallest absolute eigenvalue of the information matrix produces asymptotic divergence: λmin I(θ) −→ ∞, then the coefficient estimator (1) exists, n→∞
(2) converges in probability to the true value, and (3) is asymptotically normal with a covariance matrix equal to the inverse of the information matrix. This is summarized by the following notation: √ P n($ θ − θ ) → n(0, I(θ )−1 ). (6.11) A condition of divergence seems odd at first, but recall that the information matrix functions in the denominator of expressions for variances. These conditions are only very broadly stated here, and for details the reader should consult Fahrmeir and Tutz (1994, Appendix A2). It has been shown that the maximum likelihood estimates from these models are still reasonably well behaved under more challenging circumstances than those required in the last paragraph. These applications include clustered units (Bradley & Gart, 1962; Zeger & Karim, 1991), non-iid samples (Jørgensen, 1983; Nordberg, 1980), sparse tables (Brown & Fuchs, 1983), generalized autoregressive linear models (Kaufman, 1987), and mixed models (McGilchrist, 1994). It is easy to be relatively comfortable under typical or even these more challenging circumstances, provided a large sample size. Not having that luxury, what can be done to check asymptotic conditions? The primary diagnostic advice is to check the minimum absolute eigenvalue from the information matrix, which can also be normalized
84 prior to eigen-analysis to remove the scale of the explanatory variables. Some statistical packages make it relatively easy to evaluate this quantity. For any package that provides a variance-covariance matrix, there is a handy shortcut. If λi is an eigenvalue of the matrix A, then 1 −1 λi is an eigenvalue of the matrix A , provided it exists (nonsingular). So we can evaluate the inverse of the maximum eigenvalue of the variance-covariance matrix instead of the minimum eigenvalue of the information matrix. It is desirable that λmin not have a double-digit negative exponent component, but since the quantity is scale dependent, it is often useful to first normalize the information matrix. When it is not convenient (or possible) to evaluate the eigen-structure of the information matrix, other symptoms may help. If at least one of the coefficients has an extremely large standard error, it may be an indication of a very small minimum eigenvalue. This is a handy but imperfect approach. The asymptotic properties of two important test statistics, the summed deviance and the Pearson statistic, were discussed in the last section. Under ideal circumstances, both of these converge in distri2 . Of the two, the Pearson statistic possesses a superior bution to χn−p chi-square approximation because it is composed of more nearly normal terms, which are then squared. Pierce and Schafer (1986) show a noticeable difference in the behavior of these two statistics for a binomial model with n = 20 and m = 10. The advice here is to never depend exclusively on the asymptotic convergence of these measures for any sample size smaller than this. However, it is often the case that the social science researcher is not in a position to choose the sample size, and it is still always worth the effort to calculate these values. Checking the distributional properties of the residuals can be helpful in diagnostic situations. In addition to constructing test statistics, the residuals themselves often provide important information. While the residuals from a generalized linear model are not required to be asymptotically normal around zero, systematic patterns in the distribution can be an indication of misspecification or mismeasurement. By far the best method of evaluating the distribution of residuals is by graphing them various ways. Different methods are presented in examples.
Example 6.2: Poisson GLM of Capital Punishment, Continued Returning very briefly to the Poisson model of capital punishment on page 55, we see that the deviance function of 18.212 on 10 degrees of freedom for the specified model provides a substantial improvement over the
85 null deviance (calculated from a model explaining all systematic effects by the mean) of 136.573. A chi-square test of the summed deviance finds that it is not in the tail with a predefined alpha level of 0.05, suggesting a reasonable overall fit. In general, a quick test of fit is looking to see whether the summed deviance is not substantially larger than the degrees of freedom.
Example 6.3: Gamma GLM of Electoral Politics in Scotland, Continued Assessing the quality of fit for the Scottish elections model is a bit more subtle than the other examples because of the units of the outcome variable and the relatively low variability of the effects. In fact, if one was willing to live without any explanatory value, summarizing the outcome variable with a mean only produces an amazingly low null deviance of 0.536072. However, we are interested in producing models that explain outcome variable changes with regard to specific factors. In attempting to do so here, the model produces a summed deviance of 0.087389, a good reduction in proportion. In addition, the minimum eigenvalue of the information matrix is 0.7564709. In the previous description of this model, it was argued that an interaction term between the variables for council tax and female unemployment claims was useful in describing outcome variable behavior. Some evidence supporting this claim was found in the 95% confidence interval for the coefficient being bounded away from zero. How it adds to the overall quality of the model becomes more apparent after performing an analysis of deviance. This is provided in Table 6.3, where the terms are sequentially entered into the model and therefore the calculation of the summed deviance, with the interaction term being last. Unfortunately, this analysis of deviance is always order dependent, so the deviance residual contribution from adding a specified variable as reported is conditional on the previously added variables. Regardless of this attribute, we can evaluate whether or not the interaction term contributes reasonably to the fit. It is placed last in the order of analysis so that whatever conclusion we reach, the value of this term is conditional fully on the rest of the terms in the model. From looking at the second and fourth columns, we can see that the marginal contribution from the interaction term is nontrivial. More important, we can test the hypothesis that the variable contributes to the fit of the model using the F-test described earlier (since the dispersion parameter, a(ψ), was estimated). For a model on row k of the table with pk degrees of freedom nested within a model on row k − 1 with pk−1 degrees of freedom (necessarily
86 Analysis of Deviance for the Scottish Vote Model
Table 6.3
Individual Variable
Summed Statistic
Deviance df
Residual
Null model
Residual
F-
df
Deviance
Statistic
31
0.5361
p(> F)
Council tax
1
0.2323
30
0.3038
64.8037
0.0000. . .
Female unemployment
1
0.1195
29
0.1843
33.3394
0.0000. . .
Standardized mortality
1
0.0275
28
0.1569
7.6625
0.0106
Economically active
1
0.0230
27
0.1339
6.4109
0.0183
GDP
1
0.0005
26
0.1334
0.1435
0.7081
Percent aged 5−15
1
0.0073
25
0.1260
2.0438
0.1657
Council tax: Female unemployment
1
0.0386
24
0.0874
10.7806
0.0031
larger), this test statistic is calculated: fk,k−1 =
D(Mk−1 ) − D(Mk ) . $)(pk−1 − pk ) a(ψ
So the F-statistic on the last row of Table 6.3 is calculated by f9,8 =
0.12603 − 0.08739 = 10.7806, (0.003584182)(25 − 24)
indicating strong evidence that the interaction term should be included in the model (note the p-value in the last column of Table 6.3). In addition, we can readily see corroborating evidence that gross domestic product (GDP) is not particularly reliable or important in the context of this model. Interestingly, the variable indicating the percentage of middle-range children does not seem to be contributing a substantial amount to our summed deviance reduction, even though it has a 95% confidence interval bounded away from zero.
87 Example 6.4: Binomial GLM of Educational Standardized Testing Measuring and modeling the educational process is a particularly difficult empirical task. Furthermore, we are typically not only interested in describing the current state of the educational institutions and policies; we also seek to explain what policies “work” and why. Even defining whether or not a program is successful can be difficult and about the only principle that scholars in this area agree on is that our current understanding is rudimentary at best (Boyd, 1998). There are two primary academic schools of thought on the problem. Economists (Becker & Baumol, 1996; Boyd & Hartman, 1998; Hanushek, 1981, 1986, 1994) generally focus on the parametric specification of the production function (a “systems” model of the process, which evaluates outputs as a function of definable and measurable inputs). Conversely, education scholars (Hedges, Laine, & Greenwald, 1994; Wirt & Kirst, 1975) tend to evaluate more qualitatively, seeking macro-trends across cases and time as well as the implications of changing laws and policies. These two approaches often develop contradictory findings as evidenced by bitter debates such as in the determination of the marginal value of decreasing class size. This example examines California state data on educational policy and outcomes (STAR Program Results for 1998). The data come from standardized testing by the California Department of Education (CDE), which required students in the 2nd through 11th grades to be evaluated by the Staford 9 test on a variety of subjects. These data are recorded for individuals and aggregated at various levels from schools to the full state.10 The level of analysis here is the unified school district, providing 303 cases. The outcome variable is the number of ninth graders scoring over the mathematics national median value for the district given the total number of ninth graders taking the mathematics exam (hence a binomial GLM). The explanatory variables are grouped into two functional categories. The first, environmental factors, includes four variables traditionally used in the literature that are typically powerful explanations of learning outcomes. The proportion of low-income students (LOWINC) is measured by the percentage of students who qualify for reduced-price or free lunch plans. Proportions of minority students are also included (PERASIAN, PERBLACK, and PERHISP). Poverty has been shown to strongly affect education outcomes. Racial variables are often important because according to numerous studies, economic factors and discrimination negatively and disproportionately affect particular minorities in terms of educational outcomes.
88 The second group, policy factors, includes eight explanatory variables. These are per-pupil expenditures in thousands of dollars (PERSPEN); median teacher salary, including benefits, also in thousands of dollars (AVSAL); mean teacher experience in years (AVYRSEX); the pupil/teacher ratio in the classroom (PTRATIO); the percentage of minority teachers (PERMINTE); the percentage of students taking college credit courses (PCTAF); the percentage of schools in the district that are charter schools (PCTCHRT); and the percentage of schools in the district operating year-round programs (PCTYRRND). μ , The model is set up with a logit link function g(μ) = log 1−μ although nearly identical results were observed with the probit and cloglog link functions. In addition to the listed variables, several interactions are added to the model. The outcome is listed in Table 6.4. Each of the 20 included explanatory variables in the model produces a 95% confidence interval bounded away from zero (but not the intercept). Actually, if Table 6.4 were constructed with 99.9% intervals instead of 95%, then every interval would still be bounded away from zero. As expected, the environmental variables are reliable indicators of educational outcomes. Increasing the percentage of minority teachers appears to improve learning outcomes. This is consistent with findings in the literature that suggest that minority students are greatly assisted by minority teachers while nonminority students are not correspondingly ill-affected (Meier, Stewart, & England, 1991; Murnane, 1975). The large and negative coefficient for pupil/teacher ratio supports current public policy efforts (particularly in California) to reduce class sizes. This, of course, comes with a cost. Since the coefficient is large and positive for the teachers’ experience variable, bringing in many new and necessarily inexperienced teachers has at least a short-run negative effect. Furthermore, the negative coefficient on the interaction effect between the percentage of minority teachers and years of experience implies that more experienced teachers tend to be nonminorities. So the positive effect of hiring new minority teachers could be reduced slightly by their short-term inexperience. Without a little background, the sign of the coefficient for percentage of schools within the district operating year-round is perplexing. A common argument for year-round schools is that the relatively long time that students have away from the classroom in the summer means that part of the school year is spent catching up and remembering previous lessons. However, year-round schools come in two flavors: single track, in which all of the students are on the same schedule, and multitrack, in which the
89 Table 6.4
Modeling the Standardized Testing Results Coefficient
Standard
95% Confidence
Error
Interval
(Intercept)
2.9589
1.5467
[−0.0647 , 5.9986]
LOWINC
−0.0168
0.0004
[−0.0177 , −0.0160]
PERASIAN
0.0099
0.0006
[0.0087 , 0.0111]
PERBLACK
−0.0187
0.0007
[−0.0202 , −0.0173]
PERHISP
−0.0142
0.0004
[−0.0151 , −0.0134]
PERMINTE
0.2545
0.0299
[0.1958 , 0.3132]
AVYRSEXP
0.2407
0.0571
[0.1287 , 0.3527]
0.0804
0.0139
[0.0531 , 0.1077]
−1.9522
0.3168
[−2.5756 , −1.3336]
AVSAL PERSPEN PTRATIO
−0.3341
0.0613
[−0.4546 , −0.2144]
PCTAF
−0.1690
0.0327
[−0.2335 , −0.1053]
0.0049
0.0013
[0.0025 , 0.0074]
PCTCHRT PCTYRRND
−0.0036
0.0002
[−0.0040 , −0.0031]
PERMINTE.AVYRSEXP
−0.0141
0.0019
[−0.0178 , −0.0103]
PERMINTE.AVSAL
−0.0040
0.0005
[−0.0049 , −0.0031]
AVYRSEXP.AVSAL
−0.0039
0.0010
[−0.0058 , −0.0020]
PERSPEN.PTRATIO
0.0917
0.0145
[0.0634 , 0.1203]
PERSPEN.PCTAF
0.0490
0.0075
[0.0345 , 0.0637]
PTRATIO.PCTAF
0.0080
0.0015
[0.0051 , 0.0110]
0.0002
0.0000
[0.0002 , 0.0003]
−0.0022
0.0003
[−0.0029 , −0.0016]
PERMINTE.AVYRSEXP.AVSAL PERSPEN.PTRATIO.PCTAF Null deviance: 34,345, df = 302 Summed deviance: 4,078.8, df = 282
Maximized (): −2,999.6 AIC: 6,039.2
students share classrooms and other resources by alternating schedules. Evidence is that multitrack schools perform noticeably worse for various sociological reasons than for single-track schools and traditionally scheduled schools (Quinlan, 1987; Weaver, 1992). Often the best way to understand generalized linear models with interaction effects is by using first differences. As described in Chapter 5, the principle of first differences is to select two levels of interest for
90 a given explanatory variable and calculate the difference in impact on the outcome variable, holding all of the other variables constant at some value, usually the mean. Therefore, when looking at a variable of interest in a table of first differences, the observed difference includes the main effect as well as all of the interaction effects that include that particular variable. Table 6.5 provides first differences for each main effect variable in the two models over the interquartile range and over the whole range. Thus, for example, the interquartile range for Percent Low Income is 26.68% to 55.46%, and the first difference for this explanatory variable over this interval is −11.89%. In other words, districts do about 12% worse at the third quartile than the first quartile. Looking at first differences clarifies some of the perplexing results from Table 6.4. For instance, the coefficient for Per-Pupil Spending in Table 6.4 has a negative coefficient (−1.95217 using dollars as the measure). This is the contribution of this explanatory value in the nonsensical scenario where all other interacting variables are fixed at zero. If there is a large discrepancy between the magnitude of the effect in the estimation table and the first difference result, it means that the interactions dominate the zero-effect marginal. The first differences for Per-Pupil Spending in the math model, moving from the first quartile to the third quartile, improve the expected pass rate about 1% and, moving across the whole range of spending for districts, improve the expected pass rate slightly less than 6%. Of great concern is the summed deviance of 4078.8 on 282 degrees of freedom (4054.928 for adjusted deviances). Clearly, this is in the tail of the chi-square distribution (no formal test needed), and none of the smoothing techniques in the literature will have an effect. First, it should be noted that the deviance was reduced about 90% from the null. This observation, along with the high quality of the coefficient estimates and the minimum eigenvalue (0.4173) of the information matrix, motivates further investigation as to whether the fit is acceptable. Figure 6.2 provides three very useful diagnostics for looking at the quality of the fit for the developed model. The first panel provides a comparison of the fitted values, g−1 (Xβ), versus the observed outcome variable values, Y. The diagonal line in this panel is the linear regression line of the fitted values regressed on these observed values. If this were the saturated model, then all of the points would land on the line. So a model’s difference from the saturated benchmark is the degree to which the points deviate from the line. If there existed systematic bias in the fit, say from omitted variables, then the slope would be much
[8.42, 20.55] [39.73, 80.57]
−0.0024 0.0210 0.0080 0.0042 0.0224 0.0000 −0.0109
[13.03, 15.51] [55.32, 62.21]
[3.94, 4.51] [21.15, 24.12] [23.45, 41.80]
[0.00, 0.00] [0.00, 12.31]
Average Years Experience Average Salary
Per-Pupil Spending Class Size Percent in College Courses
Percent Charter Percent Year-Round
Note: Bold coefficient names denote 95% CI bounded away from zero.
[2.25, 98.82] [0.00, 80.17]
−0.1184 0.0144
[13.92, 47.62] [6.33, 19.18]
Percent Hispanic Percent Minority Teachers
[0.00, 71.43] [0.00, 100.00]
[2.91, 6.91] [14.32, 28.21] [0.00, 89.13]
[0.00, 63.20] [0.00, 76.88]
0.0154 −0.0237
[0.88, 7.19] [0.85, 5.96]
Percent Asian Percent Black
[0.00, 92.33]
−0.1189
[26.68, 55.46]
0.0875 −0.0863
0.0559 0.0197 0.1093
−0.0117 0.1243
−0.3155 0.0907
0.1555 −0.2955
−0.3620
Full Range Values First Difference
Percent Low Income
Interquartile Range Values First Difference
First Differences for the Standardized Testing Results Model
Main Effect
Table 6.5
91
Observed values
+
+ ++ +
0.8
0.2
0.6
Fitted values
0.4
0.8
++ ++ + + + ++++ + + +++ ++ + + + + + + + ++ + +++++ + + + ++ + + ++ + + + ++ 0.6 + +++++++ ++ + ++ +++ + +++ ++ + + +++++++ + ++++++++ + + ++++ +++ ++ + +++ ++ +++++++ +++ +++ + + ++++++++++ + + ++ +++++ ++ + + +++ 0.4 + +++ +++++ + + + + +++ ++ + + + ++ + + + + +++++++ + +++ ++ + + ++ + + + +++ ++ ++ +++ + ++ ++++ + + + + + + + + + + + + + +++ ++ ++ +++ ++ + + + 0.2 + ++++ + + + + + ++++++++ + + +++ + + +++ + +
Model Fit Plot
Diagnostics: Education Policy Model
+
+
+
+
+
+ +
+
0.2
+
0.6
Fitted values
0.4
0.8
++ + + + + + + ++ + + + 5 + ++ ++ ++ ++++ + + + + ++ ++ + + ++ + + + ++++++ ++ ++ ++ +++ + + + + ++ ++ ++ ++ + + +++ + + ++ +++ + + + ++ ++ +++++++ ++ ++ + + + + ++ + + ++++ + + + +++++ + ++ +++++++++ + ++ + + 0 + +++ + + + +++ +++ ++ + + + + + ++++ + +++ ++++ + ++++++++ + + + +++ ++ + ++++ + + ++++++++++++ + ++ ++ ++++ + ++++ + ++++ +++ +++ +++ + + ++++ ++ + + + + + + + +++ + ++ + + −5 + + + ++ + + ++ + ++ + + + + ++ + + + + + + +
10
15
Residual Dependence Plot
−10
Pearson Residuals
Figure 6.2
Deviance Residual Quantiles −10
−5
0
5
10
15
0
1
2
Quantiles of N(0,1)
−3 −2 −1
+
++ ++ ++ ++ + + + ++ + + +++ + + +++ + +++ + + ++ ++ ++ + + + ++ ++ + ++ + + + ++ ++ + ++ + ++ ++ +++ + + +++ ++ ++ + + ++ +++ + +++ + + + + +
+ + +
+
3
+
Normal−Quantile Plot
92
93 different from 1. For instance, a slope noticeably less than 1 would indicate that the model systematically underfit cases with larger observed values and overfit cases with smaller observed values. In Figure 6.2, the linear regression produces an intercept and a slope (α = −0.00496, and β = 0.98982), which are very near the perfect ideal. Panel 2 in Figure 6.2 displays a residual dependence plot. These are β), plotted against the Pearson residuals. Any the fitted values, g−1 (X i$ discernible pattern or curvature in a residual dependence plot is an indication of either systematic effects contained within the stochastic term, a poor choice for the link function, or a very badly measured variable. This plot shows a very healthy residuals structure. The final panel in Figure 6.2 plots the deviance residual quantiles (y-axis) against the quantiles from an equal number of sorted normal variates with mean zero and standard deviation 1. The purpose of this normal-quantile plot is to evaluate whether or not the obtained deviance residuals are approximately normally distributed. If one were to plot perfectly normally distributed residuals against the N(0,1) benchmark here, the plot would be an approximately straight line with slope equal to the mean of the normal variates. For the model developed here, there is evidence that the deviance residuals are approximately normally distributed. Again, we can see some outliers at the end, but these are small in number and not terribly ill-behaved. Indications of residual distributions that deviate strongly from the desired normality show up as “S”-type curves in the normalquantile plot. It should be noted again that generalized linear models are not required to have normally distributed residuals. Therefore, linearity in the normal-quantile plot in this context is not a condition of model quality but rather a helpful description of residuals behavior. This residuals analysis shows that it is unlikely that the model suffers from omitted variable bias affecting the stochastic component or misspecification of the link function. The summed deviance could probably be reduced significantly by including an explanatory variable that many researchers have pointed to as a critical determinant in the education production function: parental involvement, possibly measured by the level of PTA activity. Unfortunately, the state of California does not track and measure this variable for these data. Example 6.5: Negative Binomial GLM, Congressional Activity: 1995 As a simple illustration of one use for a negative binomial GLM, consider the assignment of bills to committees during the first 100 days that the House of Representatives is in session after an election. This
94 period is typically quite busy as Congress usually sees election issues as subsequent legislative mandates (although not always with successful or intended successful outcomes). The first 100 days of the 104th House was certainly no exception to this observed phenomenon. The new Republican majority busily addressed 40 years of minority party frustration and attempted to fulfill their promises outlined in the “Contract with America.” The negative binomial distribution has the same sample space (i.e., on the counting measure) as the Poisson but contains an additional parameter that can be thought of as gamma distributed and therefore used to model a variance function. This configuration, Poisson-distributed mean and gamma distributed variance, naturally produces the negative binomial PMF. In this manner, we can model counts while relaxing the assumption that the mean and variance are identical. This turns out to be enormously useful for count data that are observed to be overdispersed: Var[Y ] = δE[Y ], δ 1. Often when there is noticeable heterogeneity in the sample, such overdispersion results. The data in this example contain the number of bills assigned to committee in the first 100 days of the 103rd and 104th Houses, the number of members on the committee, the number of subcommittees, the number of staff assigned to the committee, and a dummy variable indicating whether or not it is a high-prestige committee. These data are provided in Table 6.6. If bill assignments in the 104th House are perceived as events (at the committee level), then it is natural to consider applying a generalized linear model with a Poisson link function. Unfortunately, this model fits poorly as measured by some of the diagnostics already discussed (e.g., summed deviance of 393.43 on 14 degrees of freedom). The culprit appears to be a variance term that is larger than the expected value and thus violates a Poisson assumption, motivating the negative binomial specification developed here. The negative binomial model is developed with the link function: θ = log(1 − μ). It is possible with many statistical packages to either assign a known value for the variance function or estimate it. Given that we have no prior information about the nature of the variance function, it is estimated here. The resulting output is provided in Table 6.7. From Table 6.7, we can see that the model provides a quite reasonable fit. The summed deviance term is not in the tail of a chi-square distribution for 13 degrees of freedom, and the smallest eigenvalue of the information matrix is 0.1366226. The dispersion parameter is estimated
Size 58 42 13 39 51 43 49 44 51 35 49 55 44 61 50 43 33 12 10 16
Appropriations Budget Rules Ways and Means Banking Economic/Educational Opportunities Commerce International Relations Government Reform Judiciary Agriculture National Security Resources Transportation/Infrastructure Science Small Business Veterans Affairs House Oversight Standards of Conduct Intelligence
13 0 2 5 5 5 4 3 7 5 5 7 5 6 4 4 3 0 0 2
Subcommittees
Bills Assigned to Committed, First 100 Days
Committee
Table 6.6
109 39 25 23 61 69 79 68 99 56 46 48 58 74 58 29 36 24 9 24
Staff 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Prestige 9 101 54 542 101 158 196 40 72 168 60 75 98 69 25 9 41 233 0 2
Bills—103rd 6 23 44 355 125 131 271 63 149 253 81 89 142 155 27 8 28 68 1 4
Bills—104th
95
96 Table 6.7
Modeling Bill Assignment—104th House, First 100 Days Coefficient
Standard Error
95% Confidence Interval
(Intercept)
−6.8071
2.0828
[−11.1398 , −2.9731]
Size
−0.0283
0.0171
[−0.0599 , 0.0031]
Subcommittees
1.3020
0.4447
[0.4855 , 2.1309]
log(Staff)
3.0103
0.6498
[1.8222 , 4.3451]
−0.3231
0.3608
[−0.9708 , 0.3627]
0.0066
0.0011
[0.0043 , 0.0090]
−0.3237
0.1022
[−0.5177 , −0.1328]
Prestige Bills in 103rd Subcommittees: log(STAFF)
Null deviance: 107.318, df = 19 Summed deviance: 20.949, df = 13
Maximized (): 10,559 AIC: 205.39
to be a(ψ) = 1.49475, indicating that we were justified in developing a negative binomial specification for these counts. Both the coefficients for the prestige variable and the variable for the size of the committee (measured by number of members) have 95% confidence intervals that bound zero. There is therefore no evidence that these are important determinants of the quantity of bill assigned, given these data and this formulated model. This is interesting because some committees in Congress are larger presumably because they have more activity. However, the size of the committee is likely to be affected by 40 years of Democratic policy priorities as well. Other measures of size and resources for a committee are its number of staff and its subcommittees. The corresponding coefficients for both of these have 95% confidence intervals bounded away from zero. Predictably, the interaction term for these variables also has a 95% confidence interval bounded away from zero, although the negative sign is mysterious. It is not surprising that the coefficient for the variable indicating committee counts in the 103rd House is also reliable. Seemingly, this tells us that party control and a change of agenda do not make huge changes in the assignment of bills to committee during the first 100 days of a Congress. In other words, a certain amount of work of a similar nature needs to take place regardless of policy priorities of the leadership. Figure 6.3 shows another way of looking at model residuals. In this display, the Pearson residuals are shown by the dots and the deviance residuals are indicated by the length of the vertical lines. These residual quantities are both sorted by the order of the outcome variable.
97 Figure 6.3
Residual Diagnostics: Bill Assignment Model
Residual Effect
Model Fit Plot
Order of Fitted Outcome Variable Pearson
Deviances
Therefore, if there were some unintended systematic effects in the stochastic term, we would expect to see some sort of a trend in the display. It is clear that none is present. The horizontal bands indicate 1 and 2 times the standard error of the Pearson residuals in both the positive and negative directions and can be used to look for outlying points. Example 6.6: Campaign Donations in the United States Campaign contributions are a central part of the electoral dynamics in the United States. The money that candidates receive is crucial to support and enhance the mechanisms through which campaigns affect voting behavior. Resources are necessary to fulfill a campaign’s objectives of persuading, informing, and mobilizing the electorate. Equally important, the fingerprints that donors and contributors leave when they decide to fund campaigns provide information on the ideology, policy positioning, and political network of the candidates receiving financial support (Bonica, 2014). However, beyond the effects and implications of campaign contributions, there is another central question regarding this issue: What types of representatives receive higher contributions? Although this is not an easy question to answer, a helpful approach is to
98 observe the characteristics of the candidates and the constituencies that they aim to represent. First, it is important to highlight that a campaign contribution is an investment that would yield returns in terms of policies: Interest groups, voters, political action committees (PACs), and donors expect to advance their own preferences by putting in office a candidate close to those interests and likely to fight for them. Therefore, the decision to make an investment on a given candidate will depend on the evaluations of ideological closeness and on perceptions of whether the candidate is likely to win. Three factors in particular can affect such perceptions and evaluations: gender of the candidate, incumbency status, and the party to which she or he belongs. First consider gender. In the past decades, scholars have identified a large gap between the resources that men and women have for political action. On one hand, women have fewer financial resources and a decreased ability to contribute to the campaigns of female candidates that potentially represent them better (Schlozman, Burns, & Verba, 1994). On the other hand, studies suggest that women have overall received less financial support than male candidates for reasons related to gender bias and the perceptions that female candidates are not “winners” worth backing up (Epstein, 1981). Some authors suggest that although the amounts raised by female candidates tend to be lower than their male counterparts, once we make the comparison within incumbency status groups, this gap disappears (Burrell, 1985). The implication is that general disparities can be attributed to the lack of female incumbents and highlights the financial dimension of the popular “incumbency advantage.” Multiple authors (Ansolabehere, Snyder, & Stewart, 2000; Fouirnaies & Hall, 2014; Jacobson, 2009) find that incumbency status is indeed a powerful factor driving contributions. For example, Fouirnaies and Hall (2014) find a strong positive causal effect of incumbency on contributions mainly driven by the interest groups trying to buy “access” to government. Therefore, we should expect incumbents to receive larger contributions than challengers and open-seat contenders. However, the latter should also receive more than challengers given the absence of any potential advantage. A similar intuition lies behind the mechanism through which the partisanship of a candidate affects the contributions that she or he receives. Although the literature suggests that Republican candidates are likely to receive more resources given their alignment with corporations and businesses that are, in general, wealthy, more recent trends highlight
99 the changing composition of the donor pool. For example, Powell and Wilcox (2010) suggest that the means of asking for contributions (such as the Internet and social media) are changing the traditional profile of the White, old, well-educated, and affluent donor. Early studies of Internet donors report that they are less likely to fit a consistently liberal or conservative ideology and are younger and more diverse. However, while these divisions are getting blurry, we should still expect independents to receive less financial support given not only the low expectations of victory but also the absence of the party apparatus to support them. Finally, an important factor affecting the level of funding a candidate receives is the expectations of contributors about her or his policy preferences (Hillman, Keim, & Schuler, 2004). If, for example, there is more potential for redistributive policies and higher tax collection, then the levels of contributions from companies and businesses should decrease. Similarly, we should expect less contributions in states in which controversial issues related to minorities are more likely to be salient (e.g., immigration and labor policies). A signal of the “pressure” that candidates may receive from constituents to enforce redistributive or minority-oriented policies is the proportion of minorities in the state, such as Hispanics. These populations tend to favor welfare policies that benefit them or that might conflict with the interests of big businesses and organizations. Therefore, contributors might consider the strength of these populations in a particular region when deciding on providing financial resources to a campaign. To explore the effects of these variables, consider the model for contributions to the 2014 electoral campaigns in the 53 districts of California for the seats in the U.S. House of Representatives (TOTCONTR, in thousands of dollars). We model this variable as a function of the following covariates: gender of the candidate (CANDGENDER), party to which she or he belongs (PARTY), incumbency status (INCUMCHALL), and percentage of Hispanic citizens in the state (HISPPCT). The data comprise 180 candidates in 53 districts with the majority of districts having only two candidates. Because the outcome variable (contributions) is continuous, a common approach is to implement a regular linear regression. However, it is important to consider that this variable is bounded (e.g., contributions cannot be negative), and then such an approach is not the appropriate tool to analyze it. The ordinary linear model can be summarized as E[TOTCONTR] = 1β0 + CANDGENDERβ1 + PARTYβ2 + INCUMCHALLβ3 + HISPPCTβ5 .
100 Figure 6.4
Histogram of Contributions to the U.S. Congressional Campaigns in California, 2014 Distribution of campaign contributions
120
100
Frequency
80
60
40
20
0
0
2000
4000
6000
8000
Total campaign contributions (thousands of dollars)
A histogram of the outcome variable displayed in Figure 6.4 indicates a strongly right-skewed distribution, suggesting that the linear model might not be an appropriate choice. A diagnosis of residuals and fitted values in Figure 6.5 confirms this statement. In the first panel of the plot, we observe that the residuals increase with the mean, suggesting a nonconstant variance. Panels 2 and 3 provide more evidence for this issue and suggest a nonnormal distribution of the residuals. Therefore, to improve the fit and to account for the censored nature of the outcome variable, we conduct a gamma GLM with θ = log(μ) with the following specification: E[TOTCONTR] = g−1 [1β0 + CANDGENDERβ1 + PARTYβ2 + INCUMCHALLβ3 + HISPPCTβ5 ], where the g−1 (Xβ) is the gamma log link function applied to the linear component. Note that this specification does not follow the common gamma inverse link, 1/μ2 , as in the example of electoral politics in Scotland. The model specification above produces the results in Table 6.8.
Observed values
0
2000
4000
6000
8000
+
+
+ ++
+
+
+ + + + + +
+
+
+
0
+
+
+
+
+
+
+ +
+
+
+ +
+
+ +++ + ++ + + +++ + + + + + + + + + ++ + + + + + ++ + + +
+
+
+
+
+
500 1000 1500 2000 Fitted values
++ + ++ + + + ++ ++ + + + + ++++ ++ + ++ +++ ++ + +++ + + ++ + +++++++++++++++++++++++++++++++++++ + ++
+
+
++
+
Model Fit Plot
0
2000
4000
6000
8000
−2000
Pearson Residuals
+
+
+ + ++
+
+
+
+
0
+
+
+
+
+
+ + + +
+
+ +
+
+ + + ++ ++ + + + + + + +++ ++ ++ + ++ + ++ + + + + + ++ +
+
+
+
+
500 1000 1500 2000 Fitted values
+ + + +++ + + ++ +++ + +++++ ++ + ++ + ++++ + + + ++++++ ++ +++++++ + + + + +++ + ++++++ ++ ++ +++++ + ++ ++ + + ++
+++ +
++
+
+
Residual Dependence Plot
Predicted Values: Campaign Contributions
8 6 4 2 0 −2
Figure 6.5
+
−2 −1 0 1 2 Theoretical Quantiles
+ ++ ++ ++ ++++ +++ ++ ++ ++ ++++ +++++ +++++ +++++++ ++++++ + + + + + + + + + + + ++ ++++++++ +++++++ ++++++++++ ++ ++++ ++++++ + ++ + + ++
+71
18+
+56 + + + + + ++++
Normal−Quantile Plot
101
Standardized residuals
102 For comparison purposes, we also present the results from the linear model in Table 6.9. The results show that Independent candidates receive significantly lower contributions than Democrats. This, however, is not the case when we compare it to Republicans: The model does not provide evidence that the effect is distinguishable from zero. The table also provides evidence that incumbency status matters. Incumbent candidates or those competing in open-seat races are more likely to receive larger contributions than are challengers. The effect is larger for incumbents providing support for the hypothesis of “incumbency financial advantage.” Finally, as expected, the higher the proportion of Hispanics in a district, the lower the contributions a candidate receive. These findings are in line with theoretical expectations. When we compare these results with those yielded by the linear model, we observe some striking differences. This model only shows as reliable factors on contributions (1) whether a candidate is an incumbent and (2) the proportion of Hispanics. However, other relevant variables like “Independent status” are not distinguishable from zero. Without a proper model definition, the evidence could mislead researchers to infer null effects of relevant variables. To have a better interpretation of the substantive impact of these variables on contributions, Figure 6.6 presents the predicted contributions and first differences of this variables depending on the multiple values of the covariates of interest. For example, the top-right panel shows the differences in campaign contributions between the party groups to which a candidate might belong while keeping the rest of the variables at their means or modes. As we can see, when the candidate is an Independent, he or she receives $280,774.80 less than a Republican candidate. Similarly, the bottom-left panel shows the financial advantage of the incumbent: All else equal, she or he receives $1,283,040 more than challengers and $901,620.9 more than open-seat contenders. Finally, the bottom-right panel shows the predicted values depending on the percentage of Hispanics in a district and the change in those predicted contributions when we change from the first quartile of the Hispanic percentage distribution (18.33%) to the third quartile (48.18%): Contributions decrease by $224,400.5.
0.8583
1.7839 0.9052 −1.9230
Incumbency status = Incumbent
Incumbency status = Open seat
427.5418
−1, 036.3993
Residual standard error: 1,041, df = 173
% Hispanics in district
211.5055
289.4947
190.6179
1, 356.0105
178.1287
338.2318
Incumbency status = Open seat
92.1092
Incumbency status = Incumbent
−325.7645
Party = Republican
Gender of the candidate = Male
Party = Independent
261.6336
553.8604 176.1828
(Intercept)
192.5299
Standard Error
Linear Model of Campaign Contributions
0.3826
0.3576
Coefficient
Table 6.9
Summed deviance: 448.9029, df = 173
Null deviance: 578.1605, df = 179
% Hispanics in district
0.4246
0.2528
Party = Republican
0.6790
−1.5911
Party = Independent
0.3865
0.5252
Standard Error
−0.1570
6.4165
Coefficient
Gamma Model of Campaign Contributions
Gender of the candidate = Male
(Intercept)
Table 6.8
R−squared: 0.266
[−1, 880.2691 , −192.5295]
[−127.9688 , 706.9582]
[979.7743 , 1, 732.2467]
[−259.4761 , 443.6945]
[−993.3567 , 341.8277]
[−203.8271 , 556.1927]
[37.4556 , 1, 070.2653]
95% Confidence Interval
95 AIC: 2,578.341
95 Maximized (): −1,281.171
[−2.7281 , −1.1180]
[0.5070 , 1.3035]
[1.4250 , 2.1428]
[−0.0826 , 0.5882]
[−2.2279 , −0.9542]
[−0.5196 , 0.2055]
[5.9238 , 6.9091]
95% Confidence Interval
103
Total campaign contributions
Total campaign contributions
0
500
1,000
1,500
2,000
2,500
3,000
100
200
300
400
500
600
700
M
Challenger
Open seat
y^O − y^I = −901.6209
Incumbent Status of candidate
y^I − y^C = 1283.04
Status of candidate
Female Male Gender of candidate
F
y^F − y^M = −44.05
Gender of candidate
Predicted Values: Campaign Contributions
Total campaign contributions Total campaign contributions
Figure 6.6
0
200
400
600
800
1000
0
100
200
300
400
500
600
Republican Party of candidate
0.482
0.2 0.4 0.6 0.8 % Hispanic population in Congressional District
0.183
Interquartile range
y^Q3 − y^Q1 = −224.4005
I
Independent
y^I − y^ R = −280.7748
R
Hispanic constituency
Democrat
D
y^R − y^D = 74.5081
Party of candidate
104
CHAPTER 7. EXTENSIONS TO GENERALIZED LINEAR MODELS
Introduction to Extensions This chapter looks at various extensions to the standard GLM that are commonly used for particular data challenges. This is not meant to be a comprehensive list as that would require an enormous amount of space. Instead, we highlight some useful forms. In general, these were developed to solve particular data problems associated with applying standard GLMs such as overdispersion, an excess of zero values, aggregation hierarchies, clustering, and forms of the outcome not directly associated with an exponential family form. We first describe the quasi-likelihood extension of GLMs in which a parametric assignment is replaced with moment conditions in order to handle data problems like overdispersion with count or binomial outcomes. We describe the theoretical development and then illustrate its use with an application that seeks to explain counts of suicides in advanced industrialized democracies. Next we present generalized mixed models that introduce hierarchies (multilevels) to account for aggregation in the data at hand. These are an important extension since ignoring grouping in the data when setting up a model leads to incorrect standard errors on the subsequently estimated regression coefficients. Our example for these models is a random intercept specification applied to panel data where answers through time are nested within respondents from the Mexican family life survey (MxFLS). After this we introduce fractional regression models for outcomes that are interval measured between 0 and 1. An example is provided using the proportion of growth for sub-Saharan African countries where the key question is whether dictatorship contributes to inflation. Next we discuss a family of “Tobit” models that can accommodate censoring problems on the outcome variable. The example from this section is a model for political corruption at the country level where the provided outcome variable has problems with both the lower and upper bounds. After this we describe a family of “zero-inflated” models designed to deal with different types of count outcomes where the responses are dominated by zero values, which can have a different interpretation in this context. Our example for this set of models comes from international relations and looks at the characteristics of peace agreements. 105
106
Quasi-Likelihood Estimation All the models explained throughout this monograph assume that there is a probability model for the data under analysis. This implies that the researcher has sufficient knowledge about the data generation process or substantial experience with similar data. This assumption is not trivial; there are cases when there is not enough information about the distribution of the data, or the parametric form of the likelihood is known to be misspecified (Wedderburn, 1974). Obviously, this precludes the standard maximum likelihood estimation of unknown parameters since we cannot specify a full likelihood equation or a score function. However, even in those cases, we are generally able to specify whether the variables of interest are continuous or discrete, how the variability of the response changes with the mean response, and so on. This information is particularly useful if we consider that most features of GLMs only depend on the first two moments rather than the entire distribution. Therefore, it allows researchers to use a more flexible form that retains desirable GLM properties (i.e., those described in Fahrmeier & Kaufmann, 1985; Heyde, 2008; Wedderburn, 1976). This estimation procedure known as quasi-likelihood only requires specification of the mean function of the data and a stipulated relationship between this mean function and the variance function. The specification of these two components gives a replacement for the unknown specific form of the score function, the log-likelihood derivative, that still provides the necessary properties for maximum likelihood estimation: ∂(θ) ! 1 1 ∂ 2 (θ) ! ∂(θ) ! = 0 Var = = −E . E 2 2 ∂μi ∂μi a(ψτ ) a(ψτ 2 ) ∂μi We imitate these three criteria of the score function with a function that contains significantly less parametric information: only the mean and variance. This new quasi-score function (McCullagh & Nelder, 1989, p. 325) can be defined as qi =
yi − μi . a(ψ)τ 2
The associated contribution to the log-likelihood function from the ith point is defined by μi yi − t Qi = dt. 2 yi a(ψ)τ
107 Since the components of Y are independent by assumption, the log-quasi-likelihood for the complete data is the sum of the individual contributions: Q(θ , a(ψ)|y) =
n
Qi .
i=1
Thus, in order to find the maximum likelihood estimator, θˆ, we need to solve n n ∂ μi yi − t ! ∂ ∂ Qi = dt Q(θ , ψ|y) = ∂θ ∂θ ∂θ yi a(ψ)τ 2 i=1
i=1
= =
n yi − μi ∂μi i=1 n i=1
=−
∂θ
a(ψ)τ 2
yi − μi xi a(ψ)τ 2 g(μi )
n
yi + nθ ≡ 0.
i=1
This process is comparable to the regular MLE estimation in n which we substitute the log-likelihood equation (θ , a(ψ)|y) = i=1 log(f (y)|θ , a(ψ))) for Q(θ , a(ψ)|y). So only the first two moments are required rather than a specific PDF or PMF. This literature begins roughly with Wedderburn (1974) and McCullagh (1983) but gains momentum with Chapter 9 of McCullagh and Nelder (1989). Some other works on quasi-likelihood generalized linear models have focused on modeling dispersion (Efron, 1986; Nelder & Pregibon, 1987; Pregibon, 1984). Therefore, it follows that other quantities of interest, like the deviance, could easily be derived using the quasi-score function. More specifically, the quasi-deviance function will be of the following form: D(θ , ψ|y) = −2a(ψ)−1 =2
n
Qi
i=1 yi μi
yi − t dt. τ2
To illustrate the process outlined earlier, consider the following example. Suppose that we ignore the data-generating process of an outcome
108 Table 7.1
Quasi-Likelihoods, Canonical Links, and Variance Functions
Distribution
τ2
Q(θ , ψ|y)
Canonical Link: θ = g(μ)
−
(y−μ)2
Normal
1
Poisson
μ
y log μ − μ
Gamma
μ2
− μ − log μ
Inverse Gaussian
μ3
Unspecified distribution
μζ
1 − 2 +μ 2μ μy μ2 μζ 1−ζ − 2−ζ μ y log 1−μ + log(1 − μ) μ k y log k+μ + k log k+μ
Binomial Negative binomial
2
y
y
μ(1 − μ) 2 μ + μk
μ log μ 1 −μ
− 12
2μ 1 ζ −1 (1−ζ )μ μ log 1−μ k log k+μ
variable we are studying. However, we have information to determine that the relationship between the mean and the variance can be defined as τ 2 = μ. Then, the quasi-score function is qi =
yi − μi . a(ψ)μi
And from it we can derive the log quasi-likelihood: Q(θ , ψ|y) =
n i=1
μi yi
yi − μi dμi a(ψ)μi
= y log μ − μ. This form resembles the likelihood for a Poisson distribution. In Table 7.1, we provide information on other quasi-likelihoods. In summary, when conducting quasi-models, we use the usual maximum likelihood engine for inference with complete asymptotic properties such as consistency and normality (McCullagh, 1983), by only specifying the relationship between the mean and variance functions as well as the link function (which actually comes directly from the form of the outcome variable data). This allows us to make inferences and estimate quantities of interest without the need of distributional assumptions while still obtaining the desirable properties of the GLMs. Quasi-likelihood estimators are consistent and asymptotically equal to the true estimand (Fahrmeir & Tutz, 2001, pp. 55–60; Firth, 1987;
109 McCullagh, 1983). However, a quasi-likelihood estimator is often less efficient than a corresponding maximum likelihood estimator and can never be more efficient. Formally, Vquasi (θ ) ≥ [I(θ )]−1 . Thus, when available, researchers should use all the data and distributional information for the estimation of parameters. Example 7.1: Quasi-Poisson GLM of Suicide Rates In order to understand the application of a quasi-model, consider an example in which the outcome variable is the number of suicides per 100,000 individuals that happened in 2009 in countries that are members of the Organisation for Economic Co-operation and Development (OECD). The literature on the determinants of suicide point out to both environmental and personal characteristics. Chief among them are psychological and physical health, social support and negative life events, economic environment, and even weather conditions. Across the social sciences, several studies highlight the role of economic conditions on shaping suicide rates. However, there are competing theories regarding the direction and magnitude of this effect. Durkheim (1897), for example, suggests a quadratic relation in which suicide counts are higher during both economic booms and depressions. In his view, under those settings, social integration and regulation are lower, which in turn promote rises in suicides. Ginsberg’s (1966) limits this effect to only economic prosperity. The argument is that without economic concerns, individuals focus on other personal dimensions of their lives that are more likely to generate suicidal behavior. In contrast, several authors (Dublin & Bunzel, 1933; Henry & Short, 1954; Ogburn & Thomas, 1922) pose that this phenomenon is more likely to occur during economic depressions due to their effects on unemployment, wealth, and increase in anxiety given financial concerns. Studies in other fields have determined that other major elements behind suicidal rates go well beyond economic explanations. These studies focus on physical and psychological factors and have identified substance abuse and mental health disorders as crucial factors affecting suicidal behavior. More specifically, studies find that more severe mental health and substance disorders are associated with higher suicide rates. To test these expectations, we use data collected from members of the OECD. The sample covers the year 2009 and 36 countries that belong to this organization. We measure “economic conditions” with the real gross
110 domestic product per capita (GDP) in 2009 per country (in thousands of dollars),11 and “substance abuse” with share of the population with alcohol or drug use disorder.12 We also include “average temperature” for that year (in Celsius) given the relevance of weather as a factor behind suicidal behavior.13 The outcome variable is the number of suicides in the country per 100,000 individuals. A histogram with the distribution of these variables is presented in Figure 7.1. The outcome variable, suicides, can be treated as a count variable and therefore modeled with a Poisson model. The results are shown in the odd rows of Table 7.2 and indicate that both economic conditions and substance effect have a reliable effect on suicide rates at conventional levels. Substantially, if we increase the GDP per capita from the first to the third quartile (24.23 to 40.12), we observe a decrease of 1.78 suicides. On the other hand, an increase from the first to the third quartile of the share of the population with a substance abuse disorder (2.233% to 3.265%) is associated with an increase of 2.27 suicides.
Table 7.2
Modeling Suicide Rates in OECD Countries: 2009 Coefficient
(Intercept)
Economic conditions
Temperature
Substance abuse
Model
Standard Error
Poisson
0.276
[1.941,
3.023]
Quasi-Poisson
0.438
[1.628,
3.346]
Poisson
0.004
[−0.016, −0.001]
Quasi-Poisson
0.006
[−0.021,
0.003]
Poisson
0.010
[−0.033,
0.006]
Quasi-Poisson
0.016
[−0.044,
0.017]
Poisson
0.047
[0.080,
0.264]
Quasi-Poisson
0.074
[0.025,
0.316]
2.479
−0.009
−0.013
0.173
Null deviance: 109.913, df = 35 Summed deviance: 78.623, df = 32 Dispersion parameter: 2.52
95% Confidence Interval
Maximized (): AIC (for Poisson): 242.42
Figure 7.1
15
20
25
6
% of population with alcohol or drug use disorders
0
5
0 4
5
5
3
10
10
2
15
15
1
40
50
60
70
0
20
30
Number of suicides per 100,000 people
10
Suicide rate
30
Substance abuse
20
GDP per capita (in thousands of dollars)
10
Economic Conditions
Mean temperature (in Celsius)
10
0
5
0
0
5
5
−5
10
10
−10
15
15
Temperature
Histogram of Variables: Suicide Rates Example
40
80
111
112 After obtaining these results, we can test for overdispersion. More specifically, we calculate the sum of the product of the residuals and the working weights (the set of weights from the last iteration of the IWLS) and divide it by the residual degrees of freedom. This quantity is 2.519 (= 1), indicating overdispersion. Thus, we conduct a quasiPoisson model to account for it. Recall that this would allow us to model the variance as a linear function of the mean in contrast to the underlying assumption of a Poisson model that μ = τ 2 . The results are shown in the even rows of Table 7.2. Although the coefficients did not change between the two models, the overdispersion parameter affects the standard error and, in turn, confidence intervals. Once we account for it, we observe that the coefficient of economic conditions is no longer distinguishable from 0, and therefore, we cannot reject the hypothesis of null effect. Figure 7.2 illustrates the regression lines for the quasiPoisson models and shows their predictive confidence intervals. As we can see, the reliable effect of GDP captured by the Poisson model might have been driven by some outliers, although the data do not show a strong pattern. However, the substance abuse curve suggests a positive and strong relationship between this variable and suicide rates.
Generalized Linear Mixed-Effects Model What are the consequences of victimization? Does experiencing a crime affect the propensity to migrate? Suppose we are interested in estimating the likelihood that a subject decides to leave her city as a function of whether she was victim of a crime. In order to do this, we can interview subjects at different points in time and inquiry about the crimes of which they have been victims as well as about their desire to migrate to a different place. This will yield multiple measures of these two variables for each of the individuals in our sample. In other words, these panel data will follow a “hierarchy” with different “levels”: single responses through time are nested within respondents, which can in turn be nested within households, households within states, and so on. This particular data structure poses new challenges for the analysis of the relationship between covariates and outcome. Given that each individual is generating multiple outcomes through time, we cannot longer assume that our observations are independent of one another. The standard models discussed in this monograph are generally inappropriate in these circumstances where we deal with correlated data. However, mixed-effect models consider the dependencies of the observations within
Number of suicides per 100,000 people
Figure 7.2
0
10
20
30
+
+
+
20
+
+
+
+
+
+
30
+
+
+
+
+
+
+ +
+
++
40
+++ +
+ +
+
50
+ + +
+
60
GDP per capita (in thousands of dollars)
+
+
+
70
+
0
10
20
30
0
1
++
+
2
+ ++
+
+
+
+
+ ++
3
+
+ + + + + ++ + +
+
+ ++
4
+
+
+
5
+
+
% of population with alcohol or drug use disorders
+
+
+
+
Relationship Between Economic Conditions and GDP (With Confidence Intervals)
6
113
114 “clusters” and allow us not only to reach unbiased estimates of the effect of covariates of interest and their respective standard errors but also to address questions related to the variation between and within groups: analyze the trajectories of groups/individuals through time, assess the differences between clusters, and others. This approach is useful for panel data where responses recorded through time are perfectly “grouped” by panelist. Recall that GLMs are composed of three elements: a random component, a linear predictor on which the expected value of the outcome depends, and a link function. We can understand the generalized linear mixed-effects model (GLMM) as an extension of the GLM where we define new levels (Gelman & Hill, 2007). For GLMMs, we add random effects to the linear predictor and then express the expected value of the outcome conditional on those random effects. But what are random effects? Consider the example regarding the effect of crime on migration. The propensity of an individual to migrate might not only be affected by victimization up to time t but also by other factors and personal characteristics of the subject that are not likely to change through time, such as the attachment she has to the place where she lives, her network outside this city, her openness to new experiences, and so on. Therefore, when estimating the effect of crime on migration through time, we need to include an indicator for each individual in our sample representing the effect of being that individual i. If the subjects in our sample have been chosen randomly with the goal of treating them as a representation of the population of interest, then their effects on the outcome are also going to be random and generalizable to that same population. Therefore, each random effect ai is a random variable that not only will help to make inferences about the population but also allows us to assess the variation between individuals, predict outcomes for each of them, and incorporate the existent correlation between observations. In a simplified version of the example discussed earlier, we are interested in assessing the effect of victimization on the propensity of women to migrate. In order to do so, we collect responses from women at different points in time. Recall that for each woman i, we record t responses. Because the respondents are selected randomly, the effect that each of them has on the outcome is a random variable u, and as a consequence, the distribution of the outcome y is going to be conditional on it. Thus, y is generally assumed to consist of conditionally independent elements, each with a distribution from or similar to the exponential family. Let yit be a vector of length N (total number of observations at the
115 individual-wave level) containing the individual observations collected for i = 1, . . . , m individuals in the sample at t different points in time, and ni the number of responses recorded for each individual i (i.e., waves in which we contacted and interviewed the respondent). Then we have that yit |u, θ ∼ indep. fY |u (yit |u, θ ) ! yit θit − b(θit ) fY |u (yit |u, θ ) = exp + c(yit , ψ) . a(ψ) Therefore, for each yi (the ni × 1 response vector with the recorded observations from each ith group), we have that E[yi |ui , θ ] = μi ni ×1
Var[yi |ui ] = a(ψ)τ τ ni ×ni
g(μi ) = θ i ni ×1
θ i = Xi β + Zi ui ni ×p p×1
n1 ×q q×1
yi independent of yi for i = i , where, as in the GLM case, θ i is the linear predictor, g(·) is the link function, X is a matrix with explanatory variables, and β is a vector of coefficients or fixed effects for each cluster i. However, in contrast to GLM, GLMMs also have a vector of subject-specific effects (random effects), ui ; a model matrix for these random effects, Zi , that indicates to which individual or cluster i each observation belongs; and a matrix, i , that expresses the dependence structure of the responses within individuals (e.g., if the observations within group are sampled independently, √ then i = Ii ), and τ = diag[ τ 2 ], where τ 2 is the variance function characteristic of the distributional family to which y belongs. In general, we assume that the random effects are multivariate normally distributed with mean zero. We also assume that the observations might be correlated and that there might exist unequal variances. Thus, in matrix form, ui ∼ Nq (0, D) ui independent of ui for i = i ,
116 where D is the q × q covariance matrix of the random effects. So far, the model specification has been conditional on the values of the random effects u. However, since we are interested in the marginal distribution of the response, we derive that the mean of y is E[yi ] = E[E[yi |ui ]] = E[μi ] = E[g−1 (Xi β + Zi ui )]. For the marginal variance of y, we have that Var[yi ] = Var[E[yi |ui ]] + E[Var[yi |ui ]] = Var[μi ] + E[a(ψ)τ τ ]. As we can see, the presence of random effects affects both the marginal mean and variance of the outcomes. Furthermore, as we discussed earlier, they also introduce a correlation among observations that have random effects in common (e.g., that belong to the same individual/group). Combining all this information, we have that the likelihood function & of the conditional distribution of y|u is L = fYi |u (yi |u)fU (u)du. The task of evaluating this integral is not trivial, but several methods achieve it. Maximizing the likelihood function yields maximum likelihood estimates of the (fixed) effects β and the dispersion parameter ψ. Furthermore, we can approximate this function using methods such as penalized quasi-likelihood (PQL), Laplace approximation, and Gauss-Hermite quadrature. Example 7.2: Binomial GLMM of Migration To illustrate these concepts and definitions, consider the example outlined previously. We want to test whether being a victim of a crime increases the propensity of women living in a Mexican border state to migrate. Studies show that security conditions and perceptions of safety influence the decisions of individuals to change their residency even if it implies high costs. The data for this test come from the Mexican Family Life Survey (MxFLS), a three-wave survey that collects information at the individual and household levels in Mexico.14 The households and respondents were selected to achieve representativeness at various levels through a polietapic, cluster, probability sampling design. For this example, we study a sample of 644 women living in Coahuila, a state in Mexico located at the border with the United States, who were interviewed at three different points in time.
117 The outcome variable is whether respondents have thought of migrating (MIGR), and therefore it takes values of 0 and 1. Our main explanatory covariate is the number of crimes of which the female respondent has been victim (NCRIME) and the mean seriousness of the crime (SEV). We also include other variables that the literature identifies as determining factors of migration: evaluations of past life conditions (PAST), evaluations of future community conditions (FUT), and other demographics variables such as income (INC) and age (AGE). Furthermore, we include an indicator for the wave under analysis (WAVE) and random effects for respondents that we assume to be normally distributed. We use Z as a model matrix that identifies to which subject each of the tth responses corresponds and u as the vector of random effects. The model is set up with a logit link function and can be defined as Pr[MIGR = 1] = g−1 (θ) exp(Xβ + Zu) 1 + exp(Xβ + Zu) Xβ = 1β0 + NCRIMEβ1 + SEVβ2 + PASTβ3 + FUTβ4 = logit−1 [Xβ + Zu] =
+ INCβ5 + AGEβ6 + WAVEβ7 . The results presented in Table 7.3 suggest that neither number of crimes nor crime seriousness have a reliable effect on propensity of women to migrate.15 However, the 95% confidence interval of future life evaluations indicates that worse evaluations of future life conditions decrease the propensity to migrate. This finding, although apparently counterintuitive, is in line with the notion that pessimism and negative expectations of the future are associated with impaired performance, poor coping, and depression: elements that lead to inactivity and “resignation” over the current situation. The standard deviation indicates that there is some mild variance between the intercepts of the women who were interviewed justifying the use of a linear mixed-effect model. Furthermore, after comparing the results with those from a logistic regression with no fixed effects, we find that the coefficient of number of crimes becomes reliable. Although this particular finding generally holds for male populations, in this particular sample, the finding has a weak theoretical basis. In Mexico, people who decide to migrate are generally men who leave their homes with the objective of sending money back to their families. Thus, while victimization of a subject or of her relatives might motivate the decision to migrate, it generally applies to the head of household, typically the man, and not to women or whole families.
118 Table 7.3
Victimization and Migration in Mexico Coefficient
Standard Error
95% Confidence Interval
Fixed effects (Intercept)
−3.1777
0.9162
[−4.9735 , −1.3818]
Number of crimes
0.6323
0.3876
[−0.1275 , 1.3920]
Crime seriousness
−0.5106
0.3363
[−1.1698 , 0.1486]
Past life evaluations
0.2644
0.2710
[−0.2666 , 0.7955]
Future life evaluations
−1.2310
0.3468
[−1.9107 , −0.5512]
Income
−0.4095
0.3841
[−1.1624 , 0.3434]
Age
−0.2042
0.3473
[−0.8849 , 0.4764]
Wave
−0.0610
0.4794
[−1.0007 , 0.8786]
Random effects
Variance
Standard deviation
Number of groups
2.443
1.563
227
Subject
Summed deviance: 193.1, df = 297
= −96.5
AIC: 211.1 BIC: 244.6
Fractional Regression Models In many cases in the social sciences, an outcome of interest is bounded by zero and 1: yi ∈ [0, 1], i = 1, . . . , n. Examples include the proportion of a subpopulation voting for a given candidate, the state’s proportional cost of funding programs such as Medicaid, the proportion of free and reduced-price lunches provided by a set of schools, the proportion of antigovernment demonstrations that turn violent, and so on. Various approaches have been implemented over time, including a linear or truncated linear model and some dichotomous model applications (logit, probit, cloglog, Cauchit16 ), all with limited success. Aitchison (1982) developed a log-transformation from such k + 1 dimensional “compositional data,” xi , . . . , xk+1
119 to the k-dimensional real space, Rk via xi i = 1, ..., d(i = ), yi = log x
(7.1)
where x is arbitrarily chosen from the set of categories. This transformation would be applied to each case vector using the same x reference category in the denominator. One limitation is that no compositional value can equal zero so a small amount needs to be added to true zero values. Also, taking the log of a very small number produces a very large negative value. In addition, like the multinomial logit model, summarizing results in a regression table can be awkward. By far the most principled approach is specifying that the outcome variable is beta distributed: • PDF: BE(y|α, β) = • E[Y ] =
(α+β) α−1 (1 − y)β−1 ,
(α) (β) y
0 < y < 1, 0 < α, β.
α α+β .
• Var[Y ] =
αβ (α+β)2 (α+β+1)
Social science implementations include Brehm and Gates (1993) and Paolino (2001). A downside to this approach is that interpreting the coefficients is not very easy, although first differences are very helpful (recall the discussion starting on page 89. So a common reparameterization is to set α = μψ and β = (1 − μ)ψ. This means that E[Y ] = μ and Var[Y ] = μ(1 − μ)/(ψ + 1). The convenience of the mean in this form is obvious, and the convenience of the variance form is that ψ scales it away from a Bernoulli variance: The larger ψ, the smaller the variance is from the regular Bernoulli variance for a dichotomous experiment. Of course, if ψ = 0, then it equals the Bernoulli variance. The downside of this approach is that it only works on a close-ended space for the outcome [0, 1]. Ramalho, Ramalho, and Murteira (2011) describe a two-part model for [0, 1] that presages the zero-inflated Poisson model structure described in Section 7. Stipulate: 0 for y = 0 z= (7.2) 1 for y ∈ (0, 1), where the covariates enter through E[z|X ] = G−1 (Xβ) and the capital G−1 denotes the cumulative density function, CDF, for the inverse link function. Here the link function can be one of the forms mentioned
120 (logit, probit, cloglog, Cauchit). They provide extensive Monte Carlo analysis of alternatives and properties. Ramalho, Ramalho, and Henriques (2010) provide a specification that reverses the bounds definition to [0, 1] with 1 for y = 1 z= (7.3) 0 for y ∈ (0, 1), where again the covariates enter through E[z|X ] = G−1 (Xβ). Note also that this approach is equivalent to a dichotomous GLM with a “sandwich” estimator for the standard error terms (see Section 7). It also turns out that fractional regression with both ends open, (0, 1), is an ongoing research area with some awkward approaches as of this writing. Example 7.3: Fractional Regression for Increase in Inflation in Africa This example uses data on Africa collected by Bratton and Van De Walle (1997) and made available with a codebook through the InterUniversity Consortium for Political and Social Research (Study #6996). These authors collected data on political, economic, and geographic features for 47 sub-Saharan countries over the period from each country’s colonial independence to 1989, with some additional variables collected for the period 1980 to 1989. The results from this effort are 98 variables describing governmental, economic, and social conditions for the 47 cases. Also provided are data from 106 presidential and 185 parliamentary elections, including information about parties, turnout, and political openness. Economic instability is one of the obvious characteristics of late 20th-century politics in sub-Saharan Africa. For a variety of reasons, including ethnic fragmentation, arbitrary borders, economic problems, outside intervention, and poorly developed governmental institutions, disproportionately high annual inflation rates are observed: 17.14% on average during this period (INFLATN in the data by country). Perhaps one reason is the levels of control of government by dictators during this period. The key variable to address this question in the dataset is DICTATOR, defined as the number of years of personal dictatorship that occurred from independence to 1989. The mean of this variable is 3.20. The control variables included are the area in thousand square kilometers at the end of the period (SIZE), the average annual gross national product (GNP) rate of growth in percent from 1965 to 1989 (GROWTH), the number of church-operated hospitals and medical clinics as of 1973 (CHURMED), the constitutional structure when not a dictatorship in ascending centrality (0 = monarchy, 1 = presidential,
121 Table 7.4
Does Dictatorship Effect Inflation in Sub-Saharan Africa?
Coefficient
Standard Error
((Intercept)
−1.8601
0.5588
[2.9553, 1.6540]
DICTATOR
0.0726 −0.0001
0.0281 0.0003
[0.0174, 0.0833] [−0.0008, 0.0010]
−0.2274 0.0035
0.0628 0.0913
[−0.3506, 0.1860] [−0.1754, 0.2702]
0.4703 −0.5664
0.2252 0.2296
[0.0289, 0.6665] [−1.0163, 0.6795]
SIZE GROWTH CHURMED CONSTIT REPRESS
95% Confidence Interval
null deviance: 15.671, df = 45 summed deviance: 10.601, df = 39
2 = presidential/parliamentary mix, 3 = parliamentary; CONSTIT), and violence and threats of violence by the government against opposition political activity from 1990 to 1994 (REPRESS). Inflation numbers that were missing from the dataset were filled in from World Bank and EconStats resources. Missing values on the repression variable were filled in from online archival news reports. Specifying a fractional regression model with a logit link function gives the results in Table 7.4. The results support the claim that dictatorships increase inflation in the region. Of the control variables, only the variable for type over government is statistically reliable (CONSTIT), implying that purely parliamentary governments, presumably closer to the population and therefore at the other end of the political spectrum, also contribute to inflation. In short, these two results lend credence to the idea that governmental systems that require compromise and negotiation may manage the economy better in sub-Saharan Africa.
The Tobit Model Censored and truncated data often need to be accounted for in the specification of a GLM. Recall that censoring occurs when data are unavailable to the researcher, such as when patients drop out of a trial or when a government withholds information. Conversely, truncation
122 occurs as part of a research design, for instance, if we specify only a certain range of a PDF in a model or if we are uninterested in behavior after a certain point in time. In general, censoring provides greater challenges. Suppose that an interval-measured outcome is observed for positive values only and those that would have been observed as negative values are assigned zero. For instance, Gill (2008) looks at public support for the death penalty in the U.S. states and its effect on counts of executions 1993 to 1995. However, at the time, 15 states did not have a law allowing executions, and excluding them from the regression would likely give a biased result that is not representative of the country as a whole. That is, we are not allowed to see what the effect of public opinion is on the number of executions because executions are not allowed in these cases. A standard tool for handling this kind of censoring is the Tobit model (Tobin, 1958). Start with an interval-measured outcome variable y1 , . . . yn that is censored such that all values that would have naturally been observed as negative are reported as zero (this is generalizable to other threshold values). Assume the standard setup with the linear additive specification, except that in this case, it is linked to z, which is a latent outcome variable such that z = Xβ + . (7.4) This unobserved outcome is assumed in the basic model to be normally distributed around Xβ with variance σ 2 : zi ∼ N (Xβ, σ 2 ).
(7.5)
The relationship between the recorded outcomes and the latent variable is specified as zi if zi > 0, yi = (7.6) 0 if zi ≤ 0. In this way, we account for the censoring in the model by relating the covariates to the recorded data through the uncensored z. The likelihood function that results from these assumptions is given by " " xi β 1 2 −1 2 1− (σ ) exp − 2 (yi − xi β) , L(β, σ |y, X) = σ 2σ yi =0
yi >0
(7.7) which is easy to specify and contained in nearly all statistical software packages.
123
A Type 2 Tobit Model With Stochastic Censoring This section describes the Type 2 Tobit model (Amemiya, 1985, pp. 285– 387), which generalizes the standard Tobit model by considering the censoring mechanism as stochastic rather than fixed. These are often called sample selection models since there may a process by researchers or subjects that determines this censoring. The well-known Heckman two-step process (Heckman, 1979) used in political science is a special case. The Type 2 Tobit model is used to incorporate a prior filtering decision by studied actors that can be parameterized. Classic selection models are a special case. The model begins with the definition of the selection equation and the outcomes equation for the ith case: y∗i = xi1 β1 + ui1 wi = I(y∗i > 0) mi = xi2 β2 + ui2 mi if wi = 1 log(yi ) = −∞ if wi = 0,
(7.8)
where Manning (1998) provides the rationale for using the logarithm of the outcome variable in this context. Here xi1 is a k1 -length vector of explanatory variable values that collectively affect the selection process, and xi2 is a k2 -length vector of explanatory variable values that collectively affect the continuous outcome variable value yi for those cases with a positive outcome at the selection stage, y∗i > 0, since they were observed. Otherwise, y∗i ≤ 0, and we only see yi = 0. Therefore, mi = xi2 β2 + ui2 is the log of the potential outcomes, which are observed when wi = 1 and latent when wi = 0. The challenge is that mi is completely unobserved for yi = 0 and fully observed for yi = 1, and so an unbiased estimation process must include mechanisms for both outcomes. Conversely, the y∗i = xi1 β1 + ui1 is assumed to be fully observed, meaning that we have a specification for selection and nonselection that is conditional on measured explanatory variables. Derivation of the likelihood function starts with the basic definition (Amemiya, 1985, p. 285): " " " p(y∗i ≤ 0) p(y∗i > 0) p(log(yi )|y∗i > 0) . (7.9) L(log(y)) = wi =0
wi =1
wi =1
p(yi is unobserved) p(yi is observed)
p(yi given selection)
124 Applying the usual normal assumptions, this becomes
L(log(y)) =
n " i=0 n "
1−
i=1
xi1 β1 σ1 wi
xi1 β1 σ1
1−wi
[f (log(yi ) − xi2 β2 )]wi .
(7.10)
The third component obviously needs further clarification, so consider ∞ f (log(yi ) − xi2 β2 , y∗ )dy∗ f (log(yi ) − xi2 β2 |y∗ > 0) = 0 p(y∗ > 0) ∞ f (log(yi ) − xi2 β2 )f (y∗ | log(yi ) − xi2 β2 )dy∗ = 0 p(y∗ > 0) f (log(yi ) − xi2 β2 ) 0∞ f (y∗ | log(yi ) − xi2 β2 )dy∗ . = p(y∗ > 0) (7.11)
Denote φ[μ,σ 2 ] as a normal density function with mean μ and variance σ 2 and [μ,σ 2 ] as the corresponding cumulative density function. Then applying the distributional assumptions to Equation 7.11 gives f (log(yi ) − xi2 β2 |y∗ > 0) φ[0,σ 2 ] (log(yi ) − xi2 β2 )(1 − [xi1 β1 +(ρσ1 /σ2 )(log(yi )−xi2 β2 ),σ 2 (1−ρ 2 )] (0)) 2 1 = [0,1] (xi1 β1 /σ1 ) φ[0,1] ((log(yi ) − xi2 β2 )/σ2 )/σ2 [0,1] =
xi1 β1 +(ρσ'1 /σ2 )(log(yi )−xi2 β2 )
[0,1] (xi1 β1 /σ1 )
σ12 (1−ρ 2 )
. (7.12)
Dropping the subscripts on the standard normal PDF and CDF designators (φ[0,1] becomes φ and [0,1] becomes ) and inserting Equation 7.12 into Equation 7.10 simplifies to
125 n " xi1 β1 1−wi L(log(y)) = 1− σ1 i=0
⎡ ×
n ⎢ " xi1 β1 ⎢ ⎢ ⎣ σ1 i=1
1 × σ2
=
φ
(log(yi )−xi2 β2 ) σ2
⎤wi σ xi1 β1 + ρ σ1 (log(yi )−xi2 β2 ) 2 ' ⎥ σ12 (1−ρ 2 ) ⎥
xi1 β1 σ1
⎥ ⎥ ⎦
n xi1 β1 1−wi 1 " 1− σ2 σ1 i=0
×
n "
φ
i=1
(log(yi ) − xi2 β2 ) σ2
⎛ × ⎝'
xi1 β1 σ12 (1 − ρ 2 )
+
ρ
σ1 σ2
⎞⎤wi (log(yi ) − xi2 β2 ) ⎠⎦ . ' 2 2 σ1 (1 − ρ ) (7.13)
Returning to the distribution of the error terms, we need to impose a restriction on the covariance matrix in the bivariate normal specification since ui2 is not observed for cases with wi = 0. It is convenient and conventional to set σ12 = 1. Furthermore, from the bivariate normal assumption, we know that 2 Var(ui2 |ui1 ) = σ22 − σ12 = σ22 − (ρσ1 σ2 )2 ,
(7.14)
which is henceforth labeled as ξ 2 . Note that this means that σ2 = ' 2 . The variance-covariance matrix is now ξ 2 + σ12 =
1 σ12
σ12 2 ξ 2 + σ12
.
(7.15)
1 We can assign the prior for the symmetric nonsingular matrix α−3 to make the calculations easier, where this is usually specified as a diagonal
126 matrix, and α > 3. The last parenthetical term of Equation 7.13 now becomes σ1 ρ σ2 (log(yi ) − xi2 β2 ) xi1 β1 ' ' + σ12 (1 − ρ 2 ) σ12 (1 − ρ 2 ) ρ(log(yi ) − xi2 β2 ) xi1 β1 # + =# 2 1−ρ σ2 1 − ρ 2 σ12 (log(yi ) − xi2 β2 ) xi1 β1 σ2 =, + 2 , σ22 σ2 σ12 σ2 22 − σ122 2 − 2 σ2
σ2
σ2 xi1 β1 =' + σ22 − σ12
σ2 σ2 σ12 (log(yi ) − xi2 β2 ) σ22
' σ22 − σ12
σ12 (log(yi ) − xi2 β2 ) σ2 xi1 β1 + ξ σ2 ξ ' 2 ξ 2 + σ12 xi1 β1 σ12 (log(yi ) − xi2 β2 ) ' + = ξ 2ξ ξ 2 + σ12
=
=
2 )x β + σ (log(y ) − x β ) (ξ 2 + σ12 i i1 1 12 i2 2 ' . 2 2 ξ + σ12 ξ
(7.16)
Now the likelihood function in Equation 7.13 is n " 1 xi1 β1 1−wi L(log(y)) = ' 1− σ1 2 ξ 2 + σ12 i=0 ⎡ ⎛ ⎞ n " (log(y ) − x β ) i2 2 ⎠ ⎣φ ⎝ ' i × 2 2 ξ + σ12 i=1 ⎛ ⎞⎤wi 2 )x β + σ (log(y ) − x β ) (ξ 2 + σ12 i i1 1 12 i2 2 ⎠⎦ . ' × ⎝ 2 2 ξ + σ12 ξ (7.17) This form is much more convenient for estimation purposes. Importantly, this reduced form means that we only require estimation of the parameters β1 , β2 , ξ , σ12 , and y∗ .
127 Example 7.4: Tobit Model for a Censored Corruption Scale In an important contribution to the literature, Quinn (2008) looks at factors that influence corruption in 46 countries worldwide and fits Tobit models (as well as linear specifications) for various specifications. The outcome variable of interest here is a country-level compilation of effects into a 0 to 10 scale of increasing government corruption with an adjustment that modifies this range slightly. Quinn is concerned that this ordinal measure induces boundary effects and thus censures extreme cases, hence the motivation to run a Tobit model. The explanatory variables used here are a dichotomous measure indicating whether the government owns a majority of key industries (MSO), the log of the average per capita GDP from 1975 to 1983 (LOG.PC.GDP), the Polity IV democracy score from 1975 to 1983 (DEMOCRACY), average government spending as a percentage of GDP from 1980 to 1983 (GOVGDT), an index of the ability of capitalists to invest and move money (ECONFREE), and finally a dichotomous variable indicating whether the government has a federal system during this period (FEDERAL). To understand whether these last two measures have a joint effect on corruption, we include an interaction effect between them in the model. Substantively and qualitatively more details on these variables are given in the original article. The Tobit model specified here is based on Quinn’s model (6) on page 101 but differs in several ways. We use a smaller set of explanatory variables of interest and specify an interaction. In doing the original replication of the original Tobit model before specifying a different one, we noted minor differences in the results due to differing software. To be consistent with Quinn’s original intent, we made two decisions: (1) We set the truncation bounds to [2.6 : 10.7] to get nine left-censored cases and one right-censored case, and (2) we repeated the article’s practice of casewise deletion for consistency, even though this is wrong. The original model in the paper and our new specification gave slightly weaker results with imputation of the missingness. The results here are given in Table 7.5. The model results suggest a good fit for the Tobit specification since every coefficient estimate except that for FEDERAL is statistically reliable at conventional levels. Consistent with the original work, the presence of majority state ownership of key industries appears to provide an opportunity for corruption. However, we find a much stronger dampening effect of higher levels of democracy. The findings here on economic freedom and federal system status are more nuanced due to the interaction specification. Economic investment freedom is associated with lower levels of corruption, but if federal system status is also present (equal to 1),
128 Table 7.5
Factors Influencing Levels of Corruption Coefficient
Standard Error
95% Confidence Interval
25.0439
2.5433
2.1932
0.3748
LOG.PC.GDP
−1.8659
0.2633
[−2.3820, −1.3497]
DEMOCRACY
−0.1360
0.0345
[−0.2035, −0.0685]
GOVGDT
−0.0467
0.0139
[−0.0740, −0.0193]
ECONFREE
−0.5093
0.1983
[−0.8979, −0.1207]
FEDERAL
(Intercept) MSO
[20.0590, 30.0288] [1.4586,
2.9279]
−2.6484
1.3649
[−5.3236,
ECONFREE:FEDERAL
0.6119
0.2926
[0.0385,
1.1853]
log σ
0.0200
0.1185
[−0.2123,
0.2522]
0.0269]
Log-likelihood: −60.01881 on 9 df
then this effect is substantially diminished (recall that these are both 0/1 variables), because the interaction term is activated by 1 × 1 expression of the two interacting covariates. We cannot say anything similar when ECONFREE is equal to zero about the effect of federal system status because it is not statistically reliable as a main effect and the interaction of federal system and economic freedom is negated by 0 × 1. For more details on interpreting interaction effects in GLMs, see Tsai and Gill (2013).
Zero-Inflated Accommodating Models A problem that can occur in the social sciences is count data with many zeros. For instance, if one is interested in studying militarized conflicts between each of the pairs of UN member countries, there are many zeros and very few ones. This phenomenon occurs when studying bankruptcies among U.S. corporations; violent crime events in large cities in North America, China, and Europe; and even terrorist attacks. This is also a problem in survey research when asking sensitive questions directly. Unfortunately, ignoring this lopsided distribution of data produces biased estimators and inefficient models. Erdman, Jackson, and Sinko (2008) show through simulation that both Poisson and negative binomial models lead to biased inferences in the presence of excess zeros. Kleinke and Reinecke (2013, 2015) deal with this problem by specifying Bayesian
129 generalized linear mixed-effects Poisson models for multilevel count data with missingness. Min and Agresti (2005) recommend a random-effect cumulative logit model in this context. Barry and Welsh (2002) use a generalized additive model (GAM) that includes smoothers in a regression specification to accommodate this effect. Lam, Xue, and Bun Cheuhg (2006) specify a similar semiparametric model by allowing explanatory variables to nonlinearly alter the log-link function of the Poisson model. Another approach specifies mixture distributions to counteract the effect of too many zeros with a separate contribution from the nonzero counts that leads to multimodal (Bayesian) forms (Angers & Biswas, 2003; Ghosh, Mukhopadhyay, & Lu, 2006; Janga, Leeb, & Kim, 2010). The literature for overcoming this problem is vast but generally centers on a small set of models and their variations, which we describe here.
Zero-Inflated Poisson Model Zero-inflated Poisson (ZIP) regression modes were first specifically developed by Lambert (1992), but earlier, more primitive forms can be found in Cohen (1963) and Yip (1988). Other contemporaneous developments are in Hall (2000), Yau and Lee (2001), and Böhning, Dietz, Schlattmann, Mendonça, and Kirchner (2002). The key idea is to account for structural zeros in the model: those that occur for specific reasons ancillary to the regular counts. This separation allows for a two-stage construction of the likelihood function. First, assume that the zeros are observed with probability π , and the rest of the observations come from a Poisson(λ) distribution with probability 1−π . Now assume that Y1 , . . . , Yn is an iid sample of size n each from 0 with probability πi Yi ∼ (7.18) Poisson(λi ) with probability 1 − πi . Therefore, the associated PMF is πi + (1 − πi )e−λi p(Yi = y) = y (1 − πi )e−λi λi /y!
for y = 0 for y = 1, 2, . . .
(7.19)
This means that the full model for the ZIP is a combination of two separate GLMs, a logit regression accounting for the zeros (logit(πi ) = Zi γ ) and a Poisson regression accounting for the nonzeros (log(λi ) = X i β). Thus, covariates explain variation in the count of zeros (Z), and
130 covariates explain variation in the positive integer counts (X). Such separation provides flexibility to account for more zeros than a regular Poisson GLM can handle by standard assumptions (see page 16).
Hurdle Model A similar approach to model zero-inflated count data without bias is given by Mullahy (1986). The hurdle model employs a zero-truncated y Poisson assumption: p(Yi = y|Yi > 0) = λi /((exp[λi ]−1)y!). This means that the associated PMF is now for y = 0 πi p(Yi = y) = (7.20) y (1 − πi )λi /((exp[λi ] − 1)y!) for y = 1, 2, . . . . The hurdle model has the added advantage of handling both zeroinflated count data (too many zeros for the Poisson assumption) and zero-deflated count data (too few zeros for the Poisson assumption). Recall that the Poisson assumes equal mean and variance and that we have discussed alternatives that deal with overdispersion in count data such as the negative binomial (see page 20) and the quasi-likelihood estimation (see page 106).
Zero-Inflated Negative Binomial Model A specification that accommodates excess zeros and overdispersion is proposed by Ridout, Hinde, and Demétrio (2001) and Yau, Wang, and Lee (2003) as the zero-inflated negative binomial model (ZINB). As done before in Section 2.5, define again for the PDF y trials to get r successes given p as the success probability on any trial giving Bernoulli r (1 − p)y for y = 1, 2, . . . For convenience Yau p p(Y = 0|r, p) = r+y−1 y et al. define the change of variable t = r/(r + μ), where μ = r(1 − p)/p (the mean of the negative binomial PMF; see page 33). This is just notational, since r r = = p. (7.21) t= r+μ r + r 1−p p We complement this with the probability of a zero count value, p(Y = 0|r, p) = p + (1 − p)tr , in a combined GLM fashion as done with the ZIP model. The combined moments are given by E[Y ] = (1 − p)μ μ1 + p) μ. Var[Y ] = (1 − p) 1 + ( r
(7.22) (7.23)
131 This becomes a regression specification by parameterizing the mean functions of the two model structures before combining: logistic: pi = logit−1 (Zi γ ) negative binomial: μi = exp(X i β).
(7.24) (7.25)
It is also possible to do variance component and scale component for the ZINB model (see Section 3.4 of Yau et al., 2003).
Applications Because of their simplicity, the zero-inflated models are popular in the social sciences for dealing with awkward zero counts. Zorn (1998) reviews these models and notes U.S. congressional attempts to address or overturn Supreme Court rulings, motivating his empirical example. Cornman et al. (2015) use different zero-inflated approaches to understand differences in socioeconomic status on biological risk in the United States and Taiwan. Lundy and Dean (2018) use a ZIP model to discover which risk factors contribute to juvenile criminal behavior. Savun and Tirone (2017) use a ZIP model to investigate whether foreign aid is an effective tool for reducing terrorism with a hurdle model. Dagne (2010) uses a ZIP model for longitudinal count data. Sims and Addona (2014) look at whether age and relative age have an effect on the Major League Baseball draft also with a hurdle model. Boto-Garcia, Baños, and Álvarez (2018) analyze tourists’ length of stay at a vacation destination using a hurdle model. Lubell, Schneider, Scholz, and Mete (2002) apply the ZINB model to a public policy example. Fisher, Hartwell, and Deng (2017) use a ZINB model to analyze recidivism data for prisoners released who had diagnosed mental conditions while incarcerated. Hendrix and Haggard (2015) use a ZINB model to investigate whether food prices affect levels of conflictual politics in developing countries. This is hardly an exhaustive list, but it is intended to show the variety of applications for these approaches. Example 7.5: Characteristics of Peace Agreements The peace agreement data studied here come from the Uppsala Conflict Database (Peace Agreement Dataset Codebook, Version 2.0; Harbom, Högbladh, & Wallensteen, 2006; Högbladh, 2011). The data include 216 signed peace agreements between parties actively engaged in armed conflict from 1989 to 2005. The outcome variable of interest here is OUTISS, which is an ordinal variable indicating the scale of outstanding issues that were not resolved during the peace negotiations with
132 30% zero values. Obviously, there are international relations variables with much greater proportions of zeros such as war/not war, but in this case, running a conventional model that does not account for separate production of zero values yields substantially different and inferior results. The explanatory variable for the first (zero-generation) part of the model is PKO, which indicates whether or not the peace agreement included the deployment of peacekeeping forces. The covariates for the second (count) part of the model are all dichotomous measures: a variable indicating whether or not a rebel force is allowed to transform into a legal political party (PP), a variable for whether or not members of the rebel group are to be integrated into the civil service (INTCIV), a variable equal to 1 if there is an amnesty provision in the agreement and zero otherwise (AMN), a variable indicating the release of prisoners or not (PRIS), a variable indicating if the agreement provided for a federal state arrangement (FED), a dichotomous variable that is equal to 1 if the agreement establishes a commission or committee to oversee implementation (COMIMP), a variable for whether a federal state solution is included (FED), and a variable indicating if the agreement reaffirms earlier peace agreements (REAFFIRM). The model results are given in two blocks: the zero-generation component and the count component. Notice that there are two intercepts, which are interpreted in the standard way but play two different roles. In the first part, we see that PKO is a statistically reliable predictor of the zero-generation process, as is the corresponding intercept. The variable indicating that rebels are allowed to join the civil service has a reliable and negative coefficient implying that this provision is associated with fewer unresolved issues. This makes sense as this is a very strong indication of reconciliation. Oddly, the release of prisoners is associated with more unresolved issues, perhaps an indication of a deep conflict in which the taking of prisoners was common. Not surprisingly, if the peace agreement reaffirms previous agreements, then there are fewer loose ends at the conclusion of the process.
A Warning About Robust Standard Errors What if there is heterogeneity in the standard errors from a group definition in the data that is not directly accounted for in the model specification? This does not bias the coefficient estimates but will affect the estimated standard errors for both regular linear models and
133 Table 7.6
Characteristics of Peace Agreement Outcomes
Zero Generation (Intercept)
Coefficient
Standard Error
95 % Confidence Interval
−2.0444
0.3768
[−2.7830, −1.3058]
1.3691
0.5784
PKO
[0.2355,
2.5027]
Count Explanation Coefficient (Intercept)
0.7724
Standard Error
95 % Confidence Interval
0.0806
[0.6145,
0.9304] 0.0204]
PP
−0.4998
0.2654
[−1.0200,
INTCIV
−1.1548
0.3321
[−1.8057, −0.5039]
AMN
0.2319
0.1672
[−0.0958,
0.5597]
PRIS
0.4118
0.1418
[0.1339,
0.6897]
0.0580
0.2917
[−0.5137,
0.6297]
COMIMP
−0.2535
0.1348
[−0.5177,
0.0107]
REAFFIRM
−0.4668
0.1544
[−0.7694, −0.1641]
FED
Log-likelihood: −352.2 on 10 df
GLMs. Cluster-robust (Huber-White) standard errors (Huber, 1967; White, 1980) adjust the variance-covariance matrix with a “sandwich estimation” approach: VC ∗ = frobust (X X)−1 (U U) (X X)−1 , bread
meat
(7.26)
bread
where U is an M × k matrix such that each row is produced by X m ∗ m for group/cluster m, the element-wise product of the Nm × k matrix of observations in group m, and the Nm -length m corresponding residuals vector. Suppose there are M groups, with modified degrees of freedom for the model now equal to dfrobust =
(N − 1) M , (M − 1) (N − K)
(7.27)
which means that it is possible to “fix” the standard errors after the production of the model and account for cluster-induced heterogeneity. Thus, the Huber-White sandwich estimator adjusts the estimate of the variances and does not affect the estimate of the coefficients when
134 the original model was incorrect. The more this model is incorrect, the greater the adjustment. Under tight conditions, the sandwich estimator gives adjusted variances for the maximum likelihood estimator that are asymptotically correct for an incorrectly specified model, hence the attraction and vast number of published applications. Furthermore, there are many variants on this estimator, typically tailored for a specification type of application. Plenty of caution is warranted, however, since incorrectly specified variance components may be accompanied by other incorrect features of the model specification that are not “fixed” (King & Roberts, 2015). Philosophically and practically, it is better to work hard to get a model that one believes is as correct as possible and does not require post hoc fixes. Hence, we recommend not relying on “robust” standard error calculations, without deep thought about the consequences, even though they are very easy to specify in commonly used statistical software packages.
Summary The social sciences are the hard sciences. Real-world social science data are complex and pose a large number of challenges to researchers that natural science researchers often do not have to face. In this chapter, we aimed to introduce and describe a set of less common forms related to GLMs in order to help researchers overcome some of these data challenges: quasi-likelihood, linear mixed effects, and fractional, Tobit, and zero-inflated models. This chapter serves as an introduction to the basic intuition and interpretation of these forms. If the reader is interested in a more detailed and extensive description of these models and their structure, please see McCulloch, Searle, and Neuhaus (2008); Dupuy (2018); Amemiya (1984); McCullagh and Nelder (1989); Breslow and Clayton (1993); and Skrondal and Rabe-Hesketh (2004), as well as the emerging related literatures.
CHAPTER 8. CONCLUSION
Summary In this second edition of the monograph, we introduce new explanations, new datasets, and new models. While generalized linear models are a basic tool in the toolbox of any empirical social scientist, modifications and extensions continue to be developed. We started in Chapter 1 with a general introduction to the language and setup of generalized linear models. This was followed by some classic mathematical statistics theory in Chapter 2. Here we saw that most commonly used PDFs and PMFs can be expressed in a single exponential family form. This has the advantage of identifying and highlighting particular structural components such as b(θ ) and a(ψ). Likelihood theory and moment calculations using the exponential family form were provided in detail in Chapter 3. This material is important enough to fill volumes on its own (and in fact does so). Here we focused on calculating the first two moments and identifying the variance function. The most important chapter in the monograph followed. Chapter 4 provided the link from the standard linear setup, with interval measurement and assumed normality, to the broader class of outcome variable forms. The link function represents the core of generalized linear model theory since it is the mechanism that allows generalization beyond the Gauss-Markov assumptions. Chapter 5 discussed the important statistical computing issues associated with producing estimates for the generalized linear model. The basics of iterative weighted least squares were explained and demonstrated with examples. Chapter 6 contains the material that readers of generalized linear models care about most: Does the prescribed model provide a good fit to the data? Here we looked at residuals analysis as well as some commonly applied tests. Finally, we discussed various extensions and modifications to the GLM framework that address specific issues in the data in Chapter 7. Throughout the second half of the monograph, there has been an emphasis on looking at data, and all of the data are supplied in an R package GLMpack and in some cases printed in the text. Most of the examples included graphical displays to highlight various features of the data or model. The maxim that researchers should spend some time looking at the data before applying various parameterizations and summaries cannot be overemphasized. Furthermore, these examples 135
136 are real, original data-analytic problems, not simply contrivances for presentational convenience. It is not infrequently the case that some textbook will explain some data-analytic procedure with the aid of a stylized simple “dataset” that bears little resemblance to problems readers will subsequently face in their own work. This disconnect can be very frustrating when applying principles in practice. However, since the data included here address actual, unpublished problems, they are necessarily more “messy,” containing issues such as a dominant outlier, small outcome variation, reliable coefficients but large deviance, and the need for a two-stage process. This is an intentional feature of the monograph you are reading as it better reflects the actual process of social science data analysis.
Related Topics There are several associated and related topics not discussed in this monograph. An important application of generalized linear models is to the analysis of grouped and tabular data. Generalized linear models are quite adept at addressing these problems and the concerned reader is directed to Fahrmeir and Tutz (1994) or Lindsey (1997). Generalized additive models (GAMs) are a natural extension of generalized linear models in which the form of the relationship between the selected explanatory variable and the outcome variable can be defined nonparametrically. See the canonical text by Hastie and Tibshirani (1990) or the more modern book by Wood (2006). This is a marvelously flexible tool despite some of the complications that can arise. However, even a welldeveloped generalized additive model lacks something that a generalized linear model necessarily possesses: a direct analytical expression for the model relationship from the smoothed explanatory variables. Generalized estimating equations deal with nonindependent data where cases are clustered based on a panel structure or cases are clustered based on some shared characteristic through a categorical variable. These are implemented in most statistical packages, and interested readers are directed to the text by Hardin and Hilbe (2003) or pioneering articles such as Zeger, Liang, and Albert (1988); Lipsitz, Laird, and Harrington (1991); Zeger and Karim (1991); or, more recently, Hall and Severini (2012). This approach (GEE) does not require that the functional form be identified as an exponential family form yet uses the same mean and variance function developed for generalized linear models for computational ease
137 (Liang & Zeger, 1986; Zeger & Liang, 1986). Others such as Su and Wei (1991) look extensively at assessing model quality in this context in a variety of more complex settings.
Classic Reading The standard and classic reference for generalized linear models is McCullagh and Nelder (1989). Despite the popularity of this text, it remains somewhat distant to many social scientists due to the level of discussion and the preponderance of biostatistics examples (lizards, beetles, wheezing coal miners, etc.). The article by Nelder and Wedderburn (1972) is well worth reading as it is the original defining work on generalized linear models. The classic book by Lindsey (1997) provides a wealth of extensions such as spatial interaction, dynamic modeling, and polynomial specifications. The advanced-level book by Fahrmeir and Tutz (1994) is extremely rich in theory and offers many useful practical points to those with some experience in mathematical statistics. Dobson (1990) offers an accessible introduction with some useful problem sets (updated as Dobson & Barnett, 2008). The standard algorithm for computing parameter values, iterative weighted least squares (IWLS), is restricted at present to the exponential family form. Loosening this restriction is an important area of research. Härdle, Mammen, and Müller (1998) develop a generalized partially linear model and use Severini and Staniswalis’s (1994) quasi-likelihood estimation algorithm. Bayesian variants of the generalized linear model are common and incorporate prior information about the β vector. The explosion in computing power available to researchers has tremendously benefited Bayesian approaches, some of which build upon the generalized linear model. Cook and Broemeling (1994), Albert (1988), and Naylor and Smith (1982) provide excellent overviews with an emphasis on computing issues. Other provocative works include Ibrahim and Laud (1991) on the use of a Jeffreys prior, Walker and Mallick (1997) on frailty models, Zellner and Rossi (1984) on binary outcome variables, and West, Harrison, and Migon (1985) on forecasting. Hierarchical generalized linear models with Bayesian priors and hyperpriors are currently very common in applied methods in this area. Good examples include Daniels and Gatsonis (1999); Albert and Chib (1996); Ghosh, Natarajan, Stroud, and Carlin (1998); and Bennet, Racine-Poon, and Wakefield (1996).
138
Final Motivation The process of generating social science statistical models has four steps: obtaining and coding the data, specifying a probability model, applying the model to the data to obtain inferences, and finally determining the quality of the fit of the model to the data. This monograph directly addresses the last three steps with a unified process for developing and testing empirical models. Once a researcher is comfortable with the theoretical basis of the generalized linear model, then specification is simplified to two primary tasks: variable inclusion decisions and selection of an appropriate link function. In other words, it is not necessary to rattle through an extensive toolbox full of distinct and separate techniques. The generalized linear model is a flexible, unified framework for applying parametric models to social science data. The flexibility stems from the broad class of probability statements that are included under the exponential family form. The theory bridges the chasm between discrete and continuous probability models by recasting both PDFs and PMFs into this common exponential family form. Therefore, provided that an appropriate link function has been selected, the distinction between levels of measurement is not an important consideration. The GLM framework includes an integrated set of techniques for evaluating and presenting goodness of fit for the specified model. By concentrating quality assessment on a more general measure, deviances, generalized linear models provide a more cohesive framework for gauging model quality. Furthermore, this approach moves attention away from flawed measures of model fit such as the R2 measure in the linear model, common fixation with p-values, and linear misinterpretation of logit/probit coefficients. Given that every statistical computing software package readily accommodates the generalized linear modeling approach, there are few technical impediments to widespread use. The theoretical underpinnings can be somewhat challenging, particularly when explained to a different audience. This monograph has taken the approach that understanding the theory is critical but that it should be explained and applied in a way that social scientists find accessible.
ENDNOTES 1 Or more amusingly, but no less germane here, Oscar Wilde (1898) wrote, “The truth is
rarely pure and never simple.” 2 There are two notable exceptions. First, the interaction term in the saturated loglinear
model for a contingency table (saturated in this context means that there are as many parameters as cells in the table) demonstrates the strength of association of a hypothesized relationship and can be tested to provide inferential evidence of nonindependence. Useful discussions can be found in Bishop, Fienberg, and Holland (1975); Good (1986); Krzanowski (1988, chap. 10); and Upton (1991). The second useful application of the saturated model is in a time series where there exists a time-varying parameter and it is desired to have an estimate for each point. In this setup, the parameters can be allowed to vary as smooth functions of the other variables and as a function of time (Harvey, 1989; Harvey & Koopman, 1993; Hastie & Tibshirani, 1993). These structural timeseries models are formulated to use unobserved features of the data that affect patterns of interest such as periodicity. 3 It should be noted that most statistical packages do not allow an explicit form for the n! , in an estimation routine. binomial coefficient or “choose” operator, ny = y!(n−y)!
This is not a serious problem as the gamma function can be substituted according to ∞ a−1 −t n
(n+1) e dt. y = (y+1) (n−y+1) , where (a) = 0 t 4 An alternative but equivalent form, f (y|r, p) = y−1 pr (1 − p)y−r , measures the number r−1 of trials necessary to get r successes. 5 This measure includes AA degree and above. The observed figures appear low since
children and currently enrolled college students are included in the denominator. It also does not count college attendance short of receiving a degree. 6 For an example, see “WORKSHOP: A Unified Theory of Generalized Linear Models,”
Jeff Gill, presented to the Boston Chapter of the American Statistical Association, February 1998. Available at http://www.calpoly.edu/˜jgill. 7 This includes people stating to be either strong or weak Republicans and respondents
who answered “Closer to the Republican party” when prompted if they felt closer to any of the parties. 8 Vote intention was measured with the question, “If the following people were the
Republican candidates for president in your stateŠs primary or caucus, which one would you vote for?” 9 A matrix, A, is positive definite if for any nonzero p × 1 vector x, x Ax > 0. 10 The datasets are freely available from the source (http://goldmine.cde.ca.gov) or
the author’s webpage. Demographic data are provided by CDE’s Educational Demographics Unit, and income data are provided by the National Center for Education
139
140 Statistics. For some nontrivial data collection and aggregation issues, see Theobald and Gill (1999). 11 Data come from The Madison Project. Benchmark is 2011. 12 Data comes from the “Global Burden of Disease Study 2016 (GBD 2016) Results”
published by the Institute for Health Metrics and Evaluation (IHME). 13 There is a substantial amount of work on the impact of weather conditions on suicidal
thoughts through the activation of chemical reactions due to exposure to sunlight and temperature. This implies that warmer places will have lower suicide rates than places with lower average temperature. 14 The three waves of this survey span over a period of 10 years: 2002, 2005–2006, and
2009–2012. The sample of the first wave reached a size of 35,000 individual interviews. 15 To achieve a better performance, variables were scaled. 16 The Cauchit link function is the inverse CDF for the standard Cauchy distribution and
therefore analogous to a probit link function but with heavier tails.
REFERENCES Aitchison, J. 1982. The Statistical Analysis of Compositional Data. London: Chapman & Hall. Akaike, H. 1973. “Information Theory and an Extension of the Maximum Likelihood Principle.” Pp. 716–23 in Proceedings of the 2nd International Symposium on Information Theory, edited by N. Petrov and Csàdki. Budapest, Hungary: Akadémiai Kiadó. Akaike, H. 1974. “A New Look at Statistical Model Identification.” IEEE Transactions Automatic Control AU-19: 716–22. Akaike, H. 1976. “Canonical Correlation Analysis of Time Series and the Use of an Information Criterion.” Pp. 52–107 in System Identification: Advances and Case Studies, edited by R. K. Mehra and D. G. Lainiotis. New York: Academic Press. Albert, J. H. 1988. “Computational Methods Using a Bayesian Hierarchical Generalized Linear Model.” Journal of the American Statistical Association 83: 1037–44. Albert, J. H., and S. Chib. 1996. “Bayesian Tests and Model Diagnostics in Conditionally Independent Hierarchical Models.” Journal of the American Statistical Association 92: 916–25. Amemiya, T. 1980. “Selection of Regressors.” International Economic Review 21: 331–54. Amemiya, T. 1981. “Qualitative Response Models: A Survey.” Journal of Economic Literature XIX: 1483–536. Amemiya, T. 1984. “Tobit Models: A Survey.” Journal of Econometrics 24: 3–61. Amemiya, T. 1985. Advanced Econometrics. Cambridge, MA: Harvard University Press. Angers, J., and A. Biswas. 2003. “A Bayesian Analysis of Zero-Inflated Generalized Poisson Model.” Computational Statistics & Data Analysis 42(12): 37–46. Anscombe, F. J. 1960. “Rejection of Outliers.” Technometrics 2: 123–47. Anscombe, F. J. 1961. “Examination of Residuals.” In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press. Ansolabehere, S., J. M. Snyder Jr., and C. Stewart III. 2000. “Old Voters, New Voters, and the Personal Vote: Using Redistricting to Measure the Incumbency Advantage.” American Journal of Political Science 44: 17–34. Baker, R. J., and J. A. Nelder. 1978. GLIM Manual, Release 3. Oxford, UK: Numerical Algorithms Group and Royal Statistical Society. Baldus, D. C., and J. W. L. Cole. 1980. Statistical Proof of Discrimination. New York: McGraw-Hill. Barndorff-Nielsen, O. 1978. Information and Exponential Families in Statistical Theory. New York: John Wiley. Barnett, V. 1973. Comparative Statistical Inference. New York: John Wiley. Barry, S. C., and A. H. Welsh. 2002. “Generalized Additive Modelling and Zero-Inflated Count Data.” Ecological Modelling 157(2): 179–88. Becker, W. E., and W. J. Baumol (eds.). 1996. Assessing Educational Practices: The Contribution of Economics. Cambridge: MIT Press. Bennet, J. E., A. Racine-Poon, and J. C. Wakefield. 1996. “MCMC for Nonlinear Hierarchical Models.” XXXX. In Markov Monte Carlo in Practice, edited by W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. London: Chapman and Hall.
141
142 Berelson, B., H. Gaudet, and P. F. Lazarsfeld. 1968. The People’s Choice: How the Voter Makes Up His Mind in a Presidential Campaign. New York: Columbia University Press. Birnbaum, A. 1962. “On the Foundations of Statistical Inference (With Discussion).” Journal of the American Statistical Association 57: 269–306. Bishop, Y. M. M., S. E. Fienberg, and P. W. Holland. 1975. Discrete Multivariate Analysis: Theory and Practice. Cambridge: MIT Press. Böhning, D., E. Dietz, P. Schlattmann, L. Mendonça, and U. Kirchner. 2002. “The ZeroInflated Poisson Model and the Decayed, Missing and Filled Teeth Index in Dental Epidemiology.” Journal of the Royal Statistical Society, Series A 162(2): 195–209. Bonica, A. 2014. “Mapping the Ideological Marketplace.” American Journal of Political Science 58: 367–86. Boto-García, D., Baños-Pino, J., and Álvarez, A. 2018. “Determinants of Tourists’ Length of Stay: A Hurdle Count Data Approach.” Journal of Travel Research 57: 1–18. doi:10.1177/0047287518793041 Boyd, W. L. 1998. “Productive Schools From a Policy Perspective.” Pp. 1–22 in Resource Allocation and Productivity in Education: Theory and Practice, edited by William T. Hartman and William Lowe Boyd. Westport, CT: Greenwood. Boyd, W. L., and W. T. Hartman. 1998. “The Politics of Educational Productivity.” Pp. 23–56 in Resource Allocation and Productivity in Education: Theory and Practice, edited by William T. Hartman and William Lowe Boyd. Westport, CT: Greenwood. Box, G. E. P. 1979. “Robustness in the Strategy of Scientific Model Building.” Pp. 201– 36 in Robustness in Statistics, edited by R. L. Launer and G. N. Wilkinson. New York: Academic Press. Bradley, R. A., and J. J. Gart. 1962. “The Asymptotic Properties of ML Estimators When Sampling From Associated Populations.” Biometrika 49: 205–14. Bratton, M., and N. Van De Walle. 1997. Political Regimes and Regime Transitions in Africa, 1910–1994. Ann Arbor, MI: Inter-University Consortium for Political and Social Research. Brehm, J., and S. Gates. 1993. “Donut Shops and Speed Traps: Evaluating Models of Supervision on Police Behavior.” American Journal of Political Science 37(2): 555–81. Breslow, N. E., and D. G. Clayton. 1993. “Approximate Inference in Generalized Linear Mixed Models.” Journal of the American Statistical Association 88: 9–25. Brown, M. B., and C. Fuchs. 1983. “On Maximum Likelihood Estimation in Sparse Contingency Tables.” Computational Statistics and Data Analysis 1: 3–15. Budge, I., and D. Farlie. 1983. Explaining and Predicting Elections: Issue Effects and Party Strategies in Twenty-Three Democracies. Winchester, MA: Allen & Unwin. Burrell, B. C. 1985. “Women’s and Men’s Campaigns for the US House of Representatives, 1972–1982: A Finance Gap?” American Politics Quarterly 13: 251–72. Campbell, A., P. E. Converse, W. E. Miller, and D. E. Stokes. 1960. The American Voter. Chicago: University of Chicago Press. Carlin, B. P., and T. A. Louis. 2009. Bayes and Empirical Bayes Methods for Data Analysis. 2nd ed. New York: Chapman & Hall. Carmines, E. G., and J. A. Stimson. 1980. “The Two Faces of Issue Voting.” American Political Science Review 74: 78–91. Casella, G., and R. L. Berger. 1990. Statistical Inference. Pacific Grove, CA: Wadsworth & Brooks/Cole.
143 Cavanaugh, J. E. 1997. “Unifying the Derivations for the Akaike and Corrected Akaike Information Criterion.” Statistics & Probability Letter 33: 201–8. Cohen, A. C. 1963. “Estimation in Mixtures of Discrete Distributions.” Pp. 373–78 in Proceedings of the International Symposium on Discrete Distributions. Montreal: Pergamon. Cook, P., and L. D. Broemeling. 1994. “A Bayesian WLS Approach to Generalized Linear Models.” Communications in Statistics: Theory Methods Methods 23: 3323–47. Cornman, J. C., D. A. Glei, N. Goldman, C. D. Ryff, and M. Weinstein. 2015. “Socioeconomic Status and Biological Markers of Health: An Examination of Adults in the United States and Taiwan.” Journal of Aging and Health 27: 75–102. Cox, D. R., and E. J. Snell. 1968. “A General Definition of Residuals.” Journal of the Royal Statistical Society, Series B 30: 248–65. Dagne, G. 2010. “Bayesian Semiparametric Zero-Inflated Poisson Model for Longitudinal Count Data.” Mathematical Biosciences 224: 126–30. Daniels, M. J., and C. Gatsonis. 1999. “Hierarchical Generalized Linear Models in the Analysis of Variations in Healthcare Utilization.” Journal of the American Statistical Association 94: 29–42. DeGroot, M. H. 1986. Probability and Statistics. Reading, MA: Addison-Wesley. del Pino, G. 1989. “The Unifying Role of Iterative Generalized Least Squares in Statistical Algorithms.” Statistical Science 4: 394–408. Dobson, A., and A. G. Barnett. 2008. An Introduction to Generalized Linear Models. New York: Chapman & Hall/CRC. Dobson, A. J. 1990. An Introduction to Generalized Linear Models. New York: Chapman & Hall. Downs, A. 1957. An Economic Theory of Democracy. New York: Harper. Druckman, J. N. 2004. “Political Preference Formation: Competition, Deliberation, and the (Ir)relevance of Framing Effects.” American Political Science Review 98: 671–86. Dublin, L. I., and B. Bunzel. 1933. To Be or Not to Be: A Study of Suicide. New York: Harrison Smith and Robert Haas. Dupuy, J. F. 2018. Statistical Methods for Overdispersed Count Data. Oxford, UK: Elsevier. Durkheim, E. 1897. Suicide: A Sociological Study. Paris: Alcan. Efron, B. 1986. “Double Exponential Families and Their Use in Generalized Linear Regression.” Journal of the American Statistical Association 81: 709–21. Epstein, C. 1981. “Women and Power: The Role of Women in Politics in the United States.” 124–146. In Access to Power: Cross-National Studies of Women and Elites, edited by C. F. Epstein and R. L. Coser. Boston: George Allen & Unwin. Erdman, D., L. Jackson, and A. Sinko. 2008. Zero-Inflated Poisson and Zero-Inflated Negative Binomial Models Using the Countreg Procedure. Cary, NC: SAS Institute. Fahrmeir, L., and H. Kaufman. 1985. “Consistency and Asymptotic Normality of the Maximum Likelihood Estimator in Generalized Linear Models.” Annals of Statistics 13: 342–68. Fahrmeir, L., and G. Tutz. 2001. Multivariate Statistical Modelling Based on Generalized Linear Models. New York: Springer. Fiorina, M. 1981. Retrospective Voting in American National Elections. New Haven, CT: Yale University Press.
144 Firth, D. 1987. “On the Efficiency of Quasi-Likelihood Estimation.” Biometrika 74: 233–45. Fisher, R. A. 1922. “On the Mathematical Foundations of Theoretical Statistics.” Philosophical Transactions of the Royal Statistical Society of London A 222: 309–60. Fisher, R. A. 1925. “Theory of Statistical Estimation.” Proceedings of the Cambridge Philosophical Society 22: 700–25. Fisher, R. A. 1934. “Two New Properties of Mathematical Likelihood.” Proceedings of the Royal Society A 144: 285–307. Fisher, W. H., S. W. Hartwell, and X. Deng. 2017. “Managing Inflation: On the Use and Potential Misuse of Zero-Inflated Count Regression Models.” Crime & Delinquency 63(1): 77–87. Fouirnaies, A., and A. B. Hall. 2014. “The Financial Incumbency Advantage: Causes and Consequences.” Journal of Politics 76: 711–24. Gelman, A., and J. Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge, UK: Cambridge University Press. Ghosh, M., K. Natarajan, T. W. F. Stroud, and B. P. Carlin. 1998. “Generalized Linear Models for Small-Area Estimation.” Journal of the American Statistical Association 93: 273–82. Ghosh, S. K., P. Mukhopadhyay, and J.-C. Lu. 2006. “Bayesian Analysis of Zero-Inflated Regression Models.” Journal of Statistical Planning and Inference 136(4): 1360–75. Gill, J. 1999. “The Insignificance of Null Hypothesis Significance Testing.” Political Research Quarterly 52: 647–74. Gill, J. 2006. Essential Mathematics for Political and Social Research. Cambridge, UK: Cambridge University Press. Gill, J. 2008. “Is Partial-Dimension Convergence a Problem for Inferences From MCMC Algorithms?” Political Analysis 16(2): 153–78. Gill, J. 2014. Bayesian Methods for the Social and Behavioral Sciences. New York: Chapman & Hall/CRC. Gill, J., and G. King. 2004. “What to Do When Your Hessian Is Not Invertible: Alternatives to Model Respecification in Nonlinear Estimation.” Sociological Methods and Research 33: 54–87. Ginsberg, R. B. 1967. Anomie and Aspirations: A Reinterpretation of Durkheim’s Theory. New York: Arno. Good, I. J. 1986. “’Saturated Model’ Or ’Quasimodel’: A Point of Terminology.” Journal of Statistical Computation and Simulation 24: 168–9. Green, P. J. 1984. “Iteratively Reweighted Least Squares for Maximum Likelihood Estimation, and Some Robust and Resistant Alternatives.” Journal of the Royal Statistical Society, Series B 46: 149–92. Greene, W. 1997. Econometric Analysis. 3rd ed. New York: Prentice Hall. Greenwald, A. G. 1975. “Consequences of Prejudice Against the Null Hypothesis.” Psychological Bulletin 82: 1–20. Hall, D. B. 2000. “Zero-Inflated Poisson and Binomial Regression With Random Effects: A Case Study.” Biometrics 56: 1030–9. Hall, D. B., and T. A. Severini. 2012. “Extended Generalized Estimating Equations for Clustered Data.” Journal of the American Statistical Association 93(444): 1365–75.
145 Hanushek, E. A. 1986. “The Economics of Schooling: Production and Efficiency in Public Schools.” Journal of Economic Literature 24: 1141–77. Hanushek, E. A. 1981. “Throwing Money at Schools.” Journal of Policy Analysis and Management 1: 19–41. Hanushek, E. A. 1994. “Money Might Matter Somewhere: A Response to Hedges, Laine, and Greenwald.” Educational Researcher 23: 5–8. Harbom, L., S. Högbladh, and P. Wallensteen. 2006. “Armed Conflict and Peace Agreements.” Journal of Peace Research 43(5): 617–31. Hardin, J. W., and J. M. Hilbe. 2003. Generalized Estimating Equations. Boca Raton, FL: Chapman & Hall/CRC Press. Härdle, W., E. Mammen, and M. Müller. 1988. “Testing Parametric Versus Semiparametric Modeling in Generalized Linear Models.” Journal of the American Statistical Association 89: 501–11. Harvey, A. 1989. Forecasting, Statistical Time Series Models and the Kalman Filter. Cambridge, UK: Cambridge University Press. Harvey, A., and S. J. Koopman. 1993. “Forecasting Hourly Electricity Demand Using Time-Varying Splines.” Journal of the American Statistical Association 88: 1228–36. Hastie, T. J., and R. J. Tibshirani. 1986. “Generalized Additive Models.” Statistical Science 1: 297–318. Hastie, T. J., and R. J. Tibshirani. 1990. Generalized Additive Models. New York: Chapman & Hall. Hastie, T. J., and R. J. Tibshirani. 1993. “Varying-Coefficient Models.” Journal of the Royal Statistical Society, Series B 55: 757–96. Heckman, J. J. 1979. “Sample Selection as Specification Error.” Econometrica 47: 153–62. Hedges, L. V., R. D. Laine, and R. Greenwald. 1994. “Does Money Matter? A MetaAnalysis of Studies of the Effects of Differential School Inputs on Student Outcomes.” Educational Researcher 23: 383–93. Hendrix, C. S., and S. Haggard. 2015. “Global Food Prices, Regime Type, and Urban Unrest in the Developing World.” Journal of Peace Research 52: 143–57. Henry, A. F., and J. F. Short. 1954. Suicide and Homicide: Some Economic, Sociological and Psychological Aspects of Aggression. New York: Free Press. Heyde, C. C. 2008. Quasi-Likelihood and Its Applications: A General Approach to Optimal Parameter Estimation. New York: Springer. Hillman, A. J., G. D. Keim, and D. Schuler. 2004. “Corporate Political Activity: A Review and Research Agenda.” Journal of Management 30: 837–57. Högbladh, S. 2011. “Peace Agreements 1975–2011: Updating the UCDP Peace Agreement Dataset.” States in Armed Conflict 5: 85–105. Huber, P. J. 1967. “The Behavior of Maximum Likelihood Estimates Under Nonstandard Conditions.” Pp. 221–33 in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. Berkeley: University of California Press. Huddy, L., and N. Terkildsen. 1993. “The Consequences of Gender Stereotypes for Women Candidates at Different Levels and Types of Office.” Political Research Quarterly 463: 503–25. Hurvich, C. M., J. S. Simonoff, and C. L. Tsai. 1998. “Smoothing Parameter Selection in Nonparametric Regression Using an Improved Akaike Information Criterion.” Journal of the Royal Statistical Society, Series B 60(2): 271–93.
146 Hurvich, C. M., and C. L. Tsai. 1988. “A Crossvalidatory AIC for Hard Wavelet Thresholding in Spatially Adaptive Function Estimation.” Biometrika 85(3): 701–10. Hurvich, C. M., and C. L. Tsai. 1989. “Regression and Time Series Model Selection in Small Samples.” Biometika 76: 297–307. Hurvich, C. M., and C. L. Tsai. 1991. “Bias of the Corrected AIC Criterion for Underfitted Regression and Time Series Models.” Biometika 78: 499–509. Ibrahim, J. G., and P. W. Laud. 1991. “On Bayesian Analysis of Generalized Linear Models Using Jeffreys’ Prior.” Journal of the American Statistical Association 86: 981–6. Iyengar, S., and A. F. Simon. 2000. “New Perspectives and Evidence on Political Communication and Campaign Effects.” Annual Review of Psychology 51: 149–69. Jacobson, G. C. 2015. “It’s Nothing Personal: The Decline of the Incumbency Advantage in US House Elections.” Journal of Politics 77: 861–73. Janga, H., S. Leeb, and Kim, S. W. 2010. “Bayesian Analysis for Zero-Inflated Regression Models With the Power Prior: Applications to Road Safety Countermeasures.” Accident Analysis & Prevention 42(2): 540–7. Jørgensen, B. 1983. “Maximum Likelihood Estimation and Large-Sample Inference for Generalized Linear and Nonlinear Regression Models.” Biometrics 70: 19–28. Kass, R. E. 1993. “Bayes Factors in Practice.” The Statistician 42: 551–60. Kaufman, H. 1987. “Regression Models for Nonstationary Categorical Time Series: Asymptotic Estimation Theory.” Annals of Statistics 15: 79–98. King, G. 1989. Unifying Political Methodology: The Likelihood Theory of Statistical Inference. Cambridge, UK: Cambridge University Press. King, G., and M. E. Roberts. 2015. “How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It.” Political Analysis 23(2): 159–79. Kleinke, K., and J. Reinecke. 2013. “Multiple Imputation of Incomplete Zero-Inflated Count Data.” Statistica Neerlandica 67(3): 311–36. Kleinke, K., and J. Reinecke. 2015. “Multiple Imputation of Multilevel Count Data: Improving Survey Methods.” Pp. 381–96 in Improving Survey Methods: Lessons From Recent Research, edited by U. Engel, J. B. Lynn, A. Scherpenzeel, and P. Sturgis. New York: Routledge, Taylor & Francis. Kleppner, D., and N. Ramsey. 1985. Quick Calculus: A Self-Teaching Guide. New York: Wiley Self Teaching Guides. Koehler, A. B., and E. S. Murphree. 1988. “A Comparison of the Akaike and Schwarz Criteria for Selecting Model Order.” Applied Statistics 37(2): 187–95. Krzanowski, W. J. 1988. Principles of Multivariate Analysis. Oxford, UK: Clarendon. Lam, K. F., H. Xue, and C. Y. Bun Cheuhg. 2006. “Semiparametric Analysis of ZeroInflated Count Data.” Biometrics 62(4): 996–1003. Lambert, D. 1992. “Zero-Inflated Poisson Regression With an Application to Defects in Manufacturing.” Technometrics 34: 1–14. Lazarsfeld, P., Berelson, B., & Gaudet, H. 1968. The People’s Choice: How the Voter Makes Up His Mind in a Presidential Campaign. 3rd ed. New York: Columbia University Press. Leamer, E. E. 1978. Specification Searches: Ad Hoc Inference With Nonexperimental Data. New York: John Wiley. Le Cam, L., and G. L. Yang. 1990. Asymptotics in Statistics: Some Basic Concepts. New York: Springer-Verlag.
147 Lehmann, E. L., and G. Casella. 1998. Theory of Point Estimation. 2nd ed. New York: Springer-Verlag. Liang, K. Y., and S. L. Zeger. 1986. “Longitudinal Analysis Using Generalized Linear Models.” Biometrika 73: 13–22. Lindsay, R. M. 1995. “Reconsidering the Status of Tests of Significance: An Alternative Criterion of Adequacy.” Accounting, Organizations and Society 20: 35–53. Lindsey, J. K. 1997. Applying Generalized Linear Models. New York: Springer-Verlag. Lipsitz, S. R., N. M. Laird, N. M., and D. P. Harrington. 1991. “Generalized Estimating Equations for Correlated Binary Data: Using the Odds Ratio as a Measure of Association.” Biometrika 78(1): 153–60. Lubell, M., M. Schneider, J. Scholz, and M. Mete. 2002. “Watershed Partnerships and the Emergence of Collective Action Institutions.” American Journal of Political Science 46(1): 148–63. Lundy, E. R., and C. B. Dean. 2018. “Analyzing Heaped Counts Versus Longitudinal Presence/Absence Data in Joint Zero-inflated Discrete Regression Models.” Sociological Methods & Research, August, 1–30. Manning, W. G. 1998. “The Logged Dependent Variable, Heteroscedasticity, and the Retransformation Problem.” Journal of Health Economics 17: 283–95. McCullagh, P. 1983. “Quasi-Likelihood Functions.” Annals of Statistics 11: 59–67. McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. 2nd ed. New York: Chapman & Hall. McCulloch, C. E., S. R. Searle, and J. M. Neuhaus. 2008. Generalized, Linear, and Mixed Models. 2nd ed. Hoboken, NJ: John Wiley. McGilchrist, C. A. 1994. “Estimation in Generalized Mixed Models.” Journal of the Royal Statistical Society, Series B 55: 945–55. Meier, K. J., J. Stewart Jr., and R. E. England. 1991. “The Politics of Bureaucratic Discretion: Education Access as an Urban Service.” American Journal of Political Science 35 (1): 155–77. Miller, A. J. 1990. Subset Selection in Regression. New York: Chapman & Hall. Min, Y., and A. Agresti. 2005. “Random Effect Models for Repeated Measures of ZeroInflated Count Data.” Statistical Modelling 5(1): 1–19. Mullahy, J. 1986. “Specification and Testing of Some Modified Count Data Models.” Journal of Econometrics 33: 341–65. Murnane, R. J. 1975. The Impact of School Resources on the Learning of Inner City Children. Cambridge, UK: Ballinger. Naylor, J. C., and A. F. M. Smith. 1982. “Applications of a Method for the Efficient Computation of Posterior Distributions.” Applied Statistics 31: 214–25. Neftçi, S. N. 1982. “Specification of Economic Time Series Models Using Akaike’s Criterion.” Journal of the American Statistical Association 77: 537–40. Nelder, J. A., and D. Pregibon. 1987. “An Extended Quasi-Likelihood Function.” Biometrika 74: 221–32. Nelder, J. A., and R. W. M. Wedderburn. 1972. “Generalized Linear Models.” Journal of the Royal Statistical Society, Series A 135: 370–85. Neter, J., M. H. Kutner, C. J. Nachtsheim, and W. Wasserman. 1996. Applied Linear Regression Models. Chicago: Irwin.
148 Nordberg, L. 1980. “Asymptotic Normality of Maximum Likelihood Estimators Based on Independent Unequally Distributed Observation in Exponential Family Models.” Scandinavian Journal of Statistics 7: 27–32. Ogburn, W. F., and D. S. Thomas. 1922. “The Influence of the Business Cycle on Certain Social Conditions.” Journal of the American Statistical Association 18: 324–40. Paolino, P. 2001. “Maximum Likelihood Estimation of Models With Beta-Distributed Dependent Variables.” Political Analysis 9(4): 325–46. Peers, H. W. 1971. “Likelihood Ratio and Associated Test Criteria.” Biometrika 58: 577–89. Pierce, D. A., and D. W. Schafer. 1986. “Residuals in Generalized Linear Models.” Journal of the American Statistical Society 81: 977–86. Powell, L. W., and C. Wilcox. 2010. “Money and American Elections.” 629–648. In The Oxford Handbook of American Elections and Political Behavior, edited by J. E. Leighley. Oxford, UK: Oxford University Press. Prasad, A. N., and C.-L. Tsai. 2001. “Single-Index Model Selections.” Biomnetrika 88(3): 821–32. Pregibon, D. 1981. “Logistic Regression Diagnostics.” Annals of Statistics 9: 705–24. Pregibon, D. 1984. “Review of Generalized Linear Models by McCullagh and Nelder.” American Statistician 12: 1589–96. Quinlan, C. 1987. Year-Round Education, Year-Round Opportunities: A Study of YearRound Education in California. Sacramento: California State Department of Education. Quinn, J. J. 2008. “The Effects of Majority State Ownership of Significant Economic Sectors on Corruption: A Cross-Regional Comparison.” International Interactions 34(1): 84–128. Raftery, A. E. 1995. “Bayesian Model Selection in Social Research.” Pp. 111–95 in Sociological Methodology, edited by P. V. Marsden. Cambridge, MA: Blackwell. Ramalho, E. A., J. J. Ramalho, and P. D. Henriques. 2010. “Fractional Regression Models for Second Stage DEA Efficiency Analyses.” Journal of Productivity Analysis 34(3): 239–55. Ramalho, E. A., J. J. Ramalho, and J. M. Murteira. 2011. “Alternative Estimating and Testing Empirical Strategies for Fractional Regression Models.” Journal of Economic Surveys 25: 19–68. Ridout, M., J. Hinde, and C. G. B. Demétrio. 2001. “A Score Test for Testing a Zero-Inflated Poisson Regression Model Against Zero-Inflated Negative Binomial Alternatives.” Biometrics 57: 219–23. Rozeboom, W. W. 1960. “The Fallacy of the Null Hypothesis Significance Test.” Psychological Bulletin 57: 416–28. Sanbonmatsu, K. 2002. “Gender Stereotypes and Vote Choice.” American Journal of Political Science 46: 20–34. Sapiro, V. 1983. The Political Integration of Women: Roles, Socialization, and Politics. Urbana: University of Illinois Press. Savun, B., and D.C. Tirone. 2017. “Foreign Aid as a Counterterrorism Tool: More Liberty, Less Terror?” Journal of Conflict Resolution 62: 1607–1635. doi:10.1177/0022002717704952 Sawa, T. 1978. “Information Criteria for Discriminating Among Alternative Regression Models.” Econometrica 46: 1273–91.
149 Schlozman, K. L., N. Burns, and S. Verba. 1994. “Gender and the Pathways to Participation: The Role of Resources.” Journal of Politics 56: 963–90. Schwarz, G. 1978. “Estimating the Dimension of a Model.” Annals of Statistics 6: 461–4. Severini, T. A., and J. G. Staniswalis. 1994. “Quasi-Likelihood Estimation in Semiparametric Models.” Journal of the American Statistical Association 89: 501–11. Sims, J., and V. Addona. 2014. “Hurdle Models and Age Effects in the Major League Baseball Draft.” Journal of Sports Economics 17(7): 672–87. Skrondal, A., and S. Rabe-Hesketh. 2004. Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models. New York: Chapman and Hall/CRC. Su, J. Q., and L. J. Wei. 1991. “A Lack-of-Fit Test for the Mean Function in a Generalized Linear Model.” Journal of the American Statistical Association 86: 420–6. Theobald, N., and J. Gill. 1999. “Looking for Data in All the Wrong Places: An Analysis of California’s STAR Results.” Paper presented at the Annual Meeting of the Western Political Science Association, Seattle, WA, March. Available at http://JeffGill.org. Tobin, J. 1958. “Estimation of Relationships for Limited Dependent Variables.” Econometrica 26: 24–36. Tsai, T., and J. Gill. 2013. “Interactions in Generalized Linear Models: Theoretical Issues and an Application to Personal Vote-Earning Attributes.” Social Sciences 2(2): 91–113. Upton, G. J. G. 1991. “The Exploratory Analysis of Survey Data Using Log-Linear Models.” The Statistician 40: 169–82. Vavreck, L. 2009. The Message Matters: The Economy and Presidential Campaigns. Princeton, NJ: Princeton University Press. von Neumann, J. 1947. “The Mathematician.” Pp. 180–96 in Works of the Mind, edited by R. B. Haywood. Chicago: University of Chicago Press. Walker, S. G., and B. K. Mallick. 1997. “Hierarchical Generalized Linear Models and Frailty Models With Bayesian Nonparametric Mixing.” Journal of the Royal Statistical Society, Series B 59: 845–60. Weaver, T. 1992. “Year-Round Education.” ERIC Digest 68: ED342107. Wedderburn, R. W. M. 1974. “Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method.” Biometrika 61: 439–47. Wedderburn, R. W. M. 1976. “On the Existence and Uniqueness of the Maximum Likelihood Estimates for Certain Generalized Linear Models.” Biometrika 63: 27–32. West, M., P. J. Harrison, and H. S. Migon. 1985. “Dynamic Generalized Linear Models and Bayesian Forecasting.” Journal of the American Statistical Association 80: 73–83. White, H. 1980. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.” Econometrica 48(4): 817–38. Wilde, O. 1898. The Importance of Being Earnest: A Trivial Comedy for Serious People. London: Leonard Smithers. Williams, D. A. 1987. “Generalized Linear Model Diagnostics Using the Deviance and Single Case Deletions.” Journal of the Royal Statistical Society, Series C 36(2): 181–91. Wirt, F. M., and M. W. Kirst. 1975. Political and Social Foundations of Education. Berkeley, CA: McCutchan. Wood, S. 2006. Generalized Additive Models. New York: Chapman & Hall/CRC.
150 Yau, K. K., and A. H. Lee. 2001. “Zero-Inflated Poisson Regression With Random Effects to Evaluate an Occupational Injury Prevention Programme.” Statistics in Medicine 20: 2907–20. Yau, K. K., K. Wang, and A. H. Lee. 2003. “Zero-Inflated Negative Binomial Mixed Regression Modeling of Over-Dispersed Count Data With Extra Zeros.” Biometrical Journal: Journal of Mathematical Methods in Biosciences 45: 437–52. Yip, P. 1988. “Inference About the Mean of a Poisson Distribution in the Presence of a Nuisance Parameter.” Australian Journal of Statistics 30: 299–306. Zeger, S. L., and R. Karim. 1991. “Generalized Linear Models With Random Effects: A Gibbs Sampling Approach.” Journal of the American Statistical Association 86: 79–86. Zeger, S. L., and K. Y. Liang. 1986. “Longitudinal Data Analysis for Discrete and Continuous Outcomes.” Biometrics 42: 121–30. Zeger, S. L., K.-Y. Liang, and P. S. Albert. 1988. “Models for Longitudinal Data: A Generalized Estimating Equation Approach.” Biometrics 44(4): 1049–60. Zellner, A., and P. E. Rossi. 1984. “Bayesian Analysis of Dichotomous Quantal Response Models.” Journal of Econometrics 25: 365–93. Zhang, P. 1992. “On the Distributional Properties of Model Selection.” Journal of the American Statistical Association 87: 732–7. Zorn, C. J. W. 1998. “An Analytic and Empirical Examination of Zero-Inflated and Hurdle Poisson Specifications.” Sociological Methods & Research 26(3): 368–400.
INDEX
Additive component, 13, 15 Additive models, generalized, 5, 136 Adjusted deviances, 74–75 Akaike information criterion (AIC), 79–81, 82 Anscombe residuals, 69 Asymptotic properties Anscombe residuals, 69 BIC-AIC comparison, 81–82 hypothesis tests, 73 likelihood ratio test, 60 Lindeberg-Feller variant, 70–71 maximum likelihood estimation, 23, 25 Bayesian information criterion (BIC), 80–82 Bayes law, 23–24 Bernoulli variance, 119 BIC. See Bayesian information criterion Binary outcome trials, 17 Binomial distribution asymptotic properties, 84 exponential family form, 17–18 general linear mixed-effects model, 116–117 link functions, comparison of, 38 (figure) mean calculation, 27 residuals and model fit, 87–93, 89 (table), 91 (table) variance calculation, 30 Bounds censoring, 43 exponential family distributions, 26 fractional regression model, 118–121 profile likelihood confidence intervals, 58, 61–62 Tobit model, 127
Box, George, 2 Calculus, 8–9 Campaign donations example, 97–104, 100 (figure), 103 (table), 104 (figure) Canonical form, 14–16, 26 Canonical links, 15, 36–37, 37 (table) Capital punishment example (Poisson GLM), 40–42, 41 (table) iterative weighted least squares, 55–58, 55 (table), 59 (figure) profile likelihood confidence intervals, 61–62 residuals and model fit, 75–76, 75 (table), 77 (figure), 84–85 Censoring, 43, 121–128 Chi-square distribution, 40, 78, 82, 84–85, 90, 94 Cholesky factorization, 53 Classic sources of information, 137 Cloglog link function, 37 Cluster-robust standard errors, 133–134 Coefficient estimates. See Estimation techniques Conditional logistic model, 39 Confidence intervals, 55 (table), 56–58. See also Profile likelihood confidence intervals Conformability matrix/vector objects, 8 Congressional bill assignment example (negative binomial GLM), 93–94, 95 (table), 96 (table), 97 (figure) Corruption example. See Government corruption example (Tobit model) Count data Poisson distribution, 16, 40 quasi-Poisson model, 110 zero-inflated accommodating models, 128–129
151
152 Counting measure, 6 Cross-entropy, 79 Cumulant function, 12, 15, 28 Death penalty example. See Capital punishment example (Poisson GLM) Deviance, 107. See also Summed deviance Deviance function and deviance residuals, 71–76, 74 (table), 75 (table), 78, 80 Discrepancy, 70, 71–72, 79 Distributions link functions, 37–40 normalizing constants and variance functions, 33 (table) of random variables, 5–7 See also specific type Educational standardized testing example (binomial GLM), 87–93, 89 (table), 91 (table), 92 (figure) Eigenvalues, 82–84, 85, 90, 94 Electoral politics in Scotland example (gamma GLM), 42–44, 45–46 (table) profile likelihood confidence intervals, 62–63, 63 (table 5.3) residuals and model fit, 85–86, 86 (table) Errors, generalized vs. standard models, 5 Error structure, 35 Error variance. See Heteroscedasticity Estimation techniques inference vs. prediction, 68 iterative weighted least squares, 54–58 Newton-Raphson and root finding, 49–52 profile likelihood confidence intervals, 58–66 quasi-likelihood estimation, 106–112 weighted least squares, 52–53 Explanatory variable matrix, 8
Exponential family basics, 7 canonical form, 14–16 derivation of the, 13–14 Hessian matrix, 52 justification, 11–13 mean calculation, 25–29 multiparameter models, 16–22 natural link functions, 37 (table) variance calculation, 29–32 Extensions to linear models fractional regression models, 118–121 general linear mixed-effects model, 112, 114–117, 118 (table) quasi-likelihood estimation, 106–112 Tobit model, 121–128 zero-inflated accommodating models, 128–132 Family of distributions, 6–7. See also Exponential family First differences, 57–58, 66, 89–90, 91 (table), 102, 119 First moment. See Moments Fisher scoring, 52, 55 Fit advantages of generalized linear models, 138 parsimony and, 3–4 See also Residuals and model fit Flexibility of generalized linear models, 138 Gamma distribution campaign donations example, 100, 102, 103 (table 6.8) electoral politics in Scotland example, 42–44, 45–46 (table) exponential family form, 19–20 mean calculation, 27 profile likelihood confidence intervals, 62–63, 63 (table 5.3) residuals and model fit, 85–86, 86 (table) variance calculation, 31
153 See also Electoral politics in Scotland example (gamma GLM) Gaussian normal distribution, 6, 19 Gauss-Markov assumptions, 2, 7–8, 34, 36, 69–70 Gauss-Newton method, 51 Generalization of the linear model, 34–37 Generalized additive models, 5, 136 Generalized estimating equations, 136–137 Generalized linear mixed-effects model, 112–117, 118 (table) GLIM software, 9, 49 Goodness of fit. See Fit; Residuals and model fit Government corruption example (Tobit model), 127–128, 128 (table) Heckman two-step process, 123 Hessian matrix, 52, 55, 82 Heteroscedasticity, 52–54 Huber-White sandwich estimator, 133–134 Hurdle model, 130 Identification condition, 8 Inference Hessian matrix, 52 maximum likelihood estimation, 23, 24 prediction vs., 68 quasi-models, 108 summed deviance, 78 typical vs. saturated models, 4 zero-inflated models, 128 Infinite moment, 12 Information matrix, 82–84, 85, 90, 94 Interaction effects, 89–90, 128 Interactions binomial GLM, 88 canonical form, 36
derivation of the exponential family form, 13, 15, 17 first differences, 89–90 gamma GLM, 62–63, 85–86 negative binomial GLM, 96 Tobit model, 127–128 Interpretation confidence intervals, use of, 57 first differences, 57, 102, 119 Poisson model, 40 stochastic elements, 70 zero-inflated models, 132 Interquartile range, 57, 90, 91 (table), 104 (figure) Inverse link function, 35, 36, 37 (table), 119–120 Inverse probability compared to the likelihood function, 24 Iterative weighted least squares (IWLS), 49, 54–55 fit, 71, 80 gamma GLM example, 62–63 GLIM software, 9 multinomial GLM example, 64 Poisson GLM example, 55–58 Jackknifing, 76, 77 (figure) Joint density function, 13–14 Joint probability function, 25 Least squares. See Iterative weighted least squares; Ordinary least squares; Weighted least squares Lebesgue measure, 6 Lebesgue’s dominated convergence theorem, 26 Leibnitz’s rule for constant bounds, 26 Likelihood function canonical form, 15 logit model, 39 maximum likelihood estimation, 23–25 model specification, 5 Tobit model, 123–124 See also Estimation techniques; Log-likelihood function Likelihood principle, 25
154 Likelihood ratio test (LRT), 40, 60, 81 Likelihood theory canonical form, 15–16 maximum likelihood estimation, 23–25 mean of the exponential family, 25–29 variance function, 32–33 variance of the exponential family, 29–32 See also Estimation techniques Lindeberg Feller variant, 70–71 Linear algebra, 8 Linear model basics, 7–8 Linear structure, generalization of the, 34–37 Link functions distributions, 37–40, 37 (table), 38 (figure) generalization of the linear model, 34–37 generalized vs. standard models, 4 Poisson distribution, 42 Location-scale family, 19 Logit link function, 37, 39, 88, 117, 121 Log-likelihood function deviance function and deviance residuals, 72–73 goodness of fit, 79–80, 81 maximum likelihood estimation, 24, 25 Newton-Raphson technique, 51 (See also Profile log-likelihood function) profile likelihood confidence intervals, 60–61 quasi-likelihood estimation, 106–107 Log link function, 100, 129 Log-transformation, fractional regression model, 118–119 LRT. See Likelihood ratio test
Maximum likelihood estimation (MLE), 23–25, 80. See also Estimation techniques Mean first moment, 12 fractional regression models, 119, 120 generalization of the linear model, 34–35 generalized linear mixed-effects model, 115–116 Hurdle model, 130 iterative weighted least squares, 54 likelihood theory, 25–29 linear model, 7–8 negative binomial GLM, 94 overdispersion, 40, 112 Poisson GLM example, 57, 58 quasi-likelihood estimation, 106, 108 residuals, 69–70, 71, 100 zero-inflated negative binomial model, 130, 131 Mean model, 72, 93, 94, 100, 102 Migrant crime victimization example (binomial GLMM), 118 (table) Migration example (binomial GLMM), 116–117 Mixed-effects model. See General linear mixed-effects model MLE. See Maximum likelihood estimation Model fit. See Fit; Residuals and model fit Model specification, 2–5, 72, 78–81, 100, 116, 132–134 Moments canonical form, 14, 15 exponential family, 12, 22 mean of the exponential family, 25–26, 27 quasi-likelihood estimation, 106, 107
155 variance of the exponential family, 29 Multinomial distribution exponential family form, 21–22 mean calculation, 28–29 profile likelihood confidence intervals, 64–66, 65 (figure), 67 (figure) variance calculation, 32 vote intention example, 44, 47–48, 48 (figure) Multiparameter models. See specific models Natural link functions, 37 (table), 52, 83 Negative binomial distribution Congressional bill assignment example, 93–94, 95 (table), 96 (table), 97 (figure) exponential family form, 20–21 mean calculation, 28 Poisson distribution compared to, 40 variance calculation, 31 zero-inflated negative binomial model, 130–131 Nesting, 78–79, 81, 85–86, 105, 112 Neumann, John von, 2 Newton-Raphson technique, 49–52 Nonconstant error variance. See Heteroscedasticity Nonnested models, 79 Normal distribution exponential family form, 18–19, 22 generalization, 34–35 generalized linear mixed-effects model, 117 mean calculation, 27 residuals, 69, 70–71, 74–75 Tobit model, 122, 124 variance calculation, 31 See also Asymptotic properties Normalizing constants, 15, 26, 33 (table)
Nuisance parameters, 18–19, 24 Null model, 72–73 Numerical techniques. See Iterative weighted least squares Odds, 39 Ordinary least squares, 18, 43 Overdispersion, 40, 94, 105, 112, 130 Parameters generalized vs. standard models, 5 nuisance parameters, 18–19, 24 scale parameters, 18–19, 24, 32, 80 sufficiency, 11–12 unknown parameters, 11–12, 23, 24 Parsimony and fit, 3–4 Peace agreement characteristics example (zero-inflated accommodating models), 131–132, 133 (table) Pearson residual campaign donations example, 101 (figure) capital punishment example, 75–76, 75 (table) Congressional bill assignment example, 96–97, 97 (figure) defined, 71 educational standardized testing example, 92 (figure), 93 Pearson statistic, 78, 84 Poisson distribution capital punishment example, 40–42, 41 (table) exponential family form, 16–17 iterative weighted least squares, 55–58, 55 (table), 59 (figure) mean calculation, 26–27 Pearson residuals, 71 profile likelihood confidence intervals, 61–62 quasi-Poisson general linear model, 109–110
156 residuals and model fit, 75–76, 75 (table), 77 (figure), 84–85 variance calculation, 30 zero-inflated Poisson model, 129–130, 131 Prediction vs. inference, 68 Probability density function and probability mass function basics, 5–7. See also Exponential family Probit link function, 37, 39–40 Profile likelihood confidence intervals, 58–61 capital punishment example (Poisson GLM), 61–62, 62 (figure), 63 (table 5.2) electoral politics example (gamma GLM), 62–64, 63 (table 5.3) vote intention example (multinomial GLM), 64–66, 65 (figure), 67 (figure) Profile log-likelihood function, 60–61, 62 (figure) P-values, 56–57, 138 Quasi-likelihood estimation, 11, 106–112, 108 (table), 111 (figure), 113 (figure) Quasi-score function, 106, 107, 108 Quinn’s model, 127 Random effects, 114–116, 117, 129 Random variables, 5, 14, 26, 57 Reading recommendations, 137 Relative function, 24 Reliability, 4, 52, 68 Reparameterization, 119. See also Exponential family Residuals and model fit asymptotic properties, 82–84 campaign donations example, 97–104, 100 (figure), 103 (table), 104 (figure) capital punishment example (Poisson GLM), 75–76, 75 (table), 77 (figure)
congressional bill assignment example (negative binomial GLM), 93–94, 95 (table), 96 (table), 97 (figure) defining residuals, 69–75 deviance function and deviance residuals, 71–75, 74 (table) educational standardized testing example (binomial GLM), 87–93, 89 (table), 91 (table), 92 (figure) electoral politics in Scotland example (gamma GLM), 85–86, 86 (table) measuring and comparing goodness of fit, 76, 78–82 Resistance, 8, 70 Response residuals, 70, 71, 75 (table) Response variables (stochastic component), 36 Robustness, 8, 68, 70, 132–134 Root finding, 49–52 “Sandwich estimation,” 120, 133–134 Saturated model, 4, 72–73, 78, 80, 90 Scale parameters, 18–19, 24, 32, 80 Schwarz criterion. See Bayesian information criterion (BIC) Score function, 23, 24, 25, 29, 51. See also Quasi-score function Second moment, 12, 29. See Moments Software, 9, 49, 71, 138 Standard errors, 132–134 Standard linear model campaign donations example, 103 (table 6.9) Gauss-Markov conditions, 34 generalized linear models compared to, 2 residuals, 69–70 Stochastic censoring, 123–126
157 Stochastic component, 35 Structural zeros, 129 Suicide rates example (quasi-Poisson GLM), 109–110, 111 (figure), 113 (figure) Summed deviance asymptotic properties, 84 defining residuals, 72, 73 electoral politics in Scotland example (gamma GLM), 85–86, 90, 93 goodness of fit, 78, 80 Survey research design, 20, 128 Systematic component of the linear model, 2, 7, 34–37, 42, 44 Taylor series expansion, 50–51 Tobit model, 121–128, 128 (table) Toughness, 8, 70 Truncation, 121–122, 127 Uniform distributions over the unit interval, 5–6 Unknown parameters, 11–12, 23, 24 Variance binomial GLMM of migration, 117, 118 (table) fractional regression models, 119 gamma GLM, 43 generalization of the linear model, 35–36 generalized linear mixed-effects model, 115–116
goodness of fit, 80 heteroscedasticity, 52–53 iterative weighted least squares, 54 likelihood theory, 29–32 linear model, 7–8 negative binomial GLM, 94 overdispersion, 40, 112 quasi-likelihood estimation, 106, 108, 108 (table) residuals, 69, 70, 100 second moment, 12 Tobit model, 122, 124 zero-inflated negative binomial model, 130, 131 Variance-covariance matrix, 52, 55, 58, 83–84, 125, 133–134 Variance function, 32–33, 33 (table) Vote intention example multinomial GLM, 44, 47–48, 48 (figure) profile likelihood confidence intervals, 64–66, 65 (figure), 67 (figure) Wald-type intervals, 58, 60, 61–62, 62 (figure), 63 (table), 63 (table 5.2) Weibull probability density function, 12, 13 Weighted least squares, 52–53. See also Iterative weighted least squares Zero-inflated accommodating models (ZINB and ZIP models), 128–132, 133 (table)