130 94 13MB
English Pages 275 Year 2024
Bayesian Statistics for the Social Sciences
Methodology in the Social Sciences David A. Kenny, Founding Editor Todd D. Little, Series Editor www.guilford.com/MSS This series provides applied researchers and students with analysis and research design books that emphasize the use of methods to answer research questions. Rather than emphasizing statistical theory, each volume in the series illustrates when a technique should (and should not) be used and how the output from available software programs should (and should not) be interpreted. Common pitfalls as well as areas of further development are clearly articulated. RECENT VOLUMES THEORY CONSTRUCTION AND MODEL-BUILDING SKILLS: A PRACTICAL GUIDE FOR SOCIAL SCIENTISTS, SECOND EDITION James Jaccard and Jacob Jacoby LONGITUDINAL STRUCTURAL EQUATION MODELING WITH Mplus: A LATENT STATE−TRAIT PERSPECTIVE Christian Geiser COMPOSITE-BASED STRUCTURAL EQUATION MODELING: ANALYZING LATENT AND EMERGENT VARIABLES Jörg Henseler BAYESIAN STRUCTURAL EQUATION MODELING Sarah Depaoli INTRODUCTION TO MEDIATION, MODERATION, AND CONDITIONAL PROCESS ANALYSIS: A REGRESSION-BASED APPROACH, THIRD EDITION Andrew F. Hayes THE THEORY AND PRACTICE OF ITEM RESPONSE THEORY, SECOND EDITION R. J. de Ayala APPLIED MISSING DATA ANALYSIS, SECOND EDITION Craig K. Enders PRINCIPLES AND PRACTICE OF STRUCTURAL EQUATION MODELING, FIFTH EDITION Rex B. Kline MACHINE LEARNING FOR SOCIAL AND BEHAVIORAL RESEARCH Ross Jacobucci, Kevin J. Grimm, and Zhiyong Zhang BAYESIAN STATISTICS FOR THE SOCIAL SCIENCES, SECOND EDITION David Kaplan LONGITUDINAL STRUCTURAL EQUATION MODELING, SECOND EDITION Todd D. Little
Bayesian Statistics for the Social Sciences SECOND EDITION
David Kaplan
Series Editor’s Note by Todd D. Little
THE GUILFORD PRESS New York London
Copyright © 2024 The Guilford Press A Division of Guilford Publications, Inc. 370 Seventh Avenue, Suite 1200, New York, NY 10001 www.guilford.com All rights reserved No part of this book may be reproduced, translated, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording, or otherwise, without written permission from the publisher. Printed in the United States of America This book is printed on acid-free paper. Last digit is print number: 9 8 7 6 5 4 3 2 1 Library of Congress Cataloging-in-Publication Data Names: Kaplan, David, author. Title: Bayesian statistics for the social sciences / David Kaplan. Description: Second edition. | New York : The Guilford Press, [2024] | Series: Methodology in the social sciences | Includes bibliographical references and index. Identifiers: LCCN 2023025118 | ISBN 9781462553549 (cloth) Subjects: LCSH: Social sciences—Statistical methods. | Bayesian statistical decision theory. | BISAC: SOCIAL SCIENCE / Statistics | BUSINESS & ECONOMICS / Statistics Classification: LCC HA29 .K344 2014 | DDC 519.5/42—dc23 LC record available at https://lccn.loc.gov/2023025118
Guilford Press is a registered trademark of Guilford Publications, Inc.
Series Editor’s Note
Anytime I see one of my series authors put together a second edition it’s like falling in love again because I know two things: It’s a labor of love for the author and you become even more enriched than the first time around. David Kaplan’s second edition is simply lovely. As I said in the first edition, Kaplan is in a very elite class of scholar. He is a methodological innovator who is guiding and changing the way that researchers conduct their research and analyze their data. He is also a distinguished educational researcher whose work shapes educational policy and practice. I see David Kaplan’s book as a reflection of his sophistication as both a researcher and statistician; it shows depth of understanding that even dedicated quantitative specialists may not have and, in my view, it will have an enduring impact on research practice. Kaplan’s research profile and research skills are renowned internationally and his reputation is globally recognized. His profile as a prominent power player in the field brings instant credibility. As a result, when Kaplan says Bayesian is the way to go, researchers listen. As with the first edition, his book brings his voice to you in an engaging and highly serviceable manner. Why is the Bayesian approach to statistics seeing a resurgence across the social and behavioral sciences (it’s an approach that has been around for some time)? One reason for the delay in adopting Bayes is technological. Bayesian estimation can be computer intensive and, until about a score of years ago, the computational demands limited the widespread application. Another reason is that the social and behavioral sciences needed an accessible translation of Bayes for these fields so that we could understand not only the benefits of Bayes but also how to apply a Bayesian approach. Kaplan is clear and practical in his presentation and shares with us his experiences and helpful/pragmatic recommendations. I think a Bayesian perspective will continue to see widespread usage now that David has updated and expanded upon this indispensable resource. In many ways, the zeitgeist for Bayes is still favorable given that researchers are asking and attempting to answer more complex questions. This second edition provides researchers with the means to address well the intricate nuances of applying a Bayesian perspective to test intertwined theory- driven hypotheses. v
vi
Series Editor’s Note
This second edition brings a wealth of new material that builds nicely on what was already a thorough and formidable foundation. Kaplan uses the R interface to Stan that provides a fast and stable software environment, which is great news because the inadequacies of other software environments were an impediment to adopting a Bayesian approach. Each of the prior chapters has been expanded with new material. For example, Chapter 1 adds a discussion of coherence, Dutch book bets, and the calibration of probability assessments. In Chapter 2, there is an extended discussion of prior distributions, which is at the heart of Bayesian estimation. Chapter 3 continues with coverage of Jeffreys’ prior and the LKJ prior for correlation matrices. In Chapter 4, the different algorithms utilized in the Stan software platform are explained, including the Metropolis-Hastings algorithm, the Hamiltonian Monte Carlo algorithm, and the No-U-Turn sampler, as well as an updated discussion of convergence diagnostics. Other chapters have new material such as new missing data material on the problem of model uncertainty in multiple imputation, expanded coverage of continuous and categorical latent variables, factor analysis, and latent class analysis, as well as coverage of multinomial, Poisson, and negative binomial regression. In addition, the important topics of model evaluation and model comparison are given their own chapter in the second edition. New chapters on other critical topics have been added—including variable selection and sparsity, the Bayesian decision theory framework to explain model averaging, and the method of Bayesian stacking as a means of combining predictive distributions—and a remarkably insightful chapter on Bayesian workflow for statistical modeling in the social sciences. All lovely additions to the second edition, which, as in the first edition, was already a treasure trove of all things Bayesian. As always, enjoy! Todd D. Little At My Wit’s End in Montana (the name of my home)
Preface to the Second Edition
Since the publication of the first edition of Bayesian Statistics for the Social Sciences in 2014, Bayesian statistics is, arguably, still not the norm in the formal quantitative methods training of social scientists. Typically, the only introduction that a student might have to Bayesian ideas is a brief overview of Bayes’ theorem while studying probability in an introductory statistics class. This is not surprising. First, until relatively recently, it was not feasible to conduct statistical modeling from a Bayesian perspective owing to its complexity and lack of available software. Second, Bayesian statistics represents a powerful alternative to frequentist (conventional) statistics, and, therefore, can be controversial, especially in the context of null hypothesis significance testing.1 However, over the last 20 years or so, considerable progress has been made in the development and application of complex Bayesian statistical methods, due mostly to developments and availability of proprietary and open-source statistical software tools. And, although Bayesian statistics is not quite yet an integral part of the quantitative training of social scientists, there has been increasing interest in the application of Bayesian methods, and it is not unreasonable to say that in terms of theoretical developments and substantive applications, Bayesian statistics has arrived. Because of extensive developments in Bayesian theory and computation since the publication of the first edition of this book, I felt there was a pressing need for a thorough update of the material to reflect new developments in Bayesian methodology and software. The basic foundations of Bayesian statistics remain more or less the same, but this second edition encompasses many new extensions and so the order of the chapters has changed in some instances, with some chapters heavily revised, some chapters updated, and some chapters containing all new material. 1
We will use the term frequentist to describe the paradigm of statistics commonly used today, and which represents the counterpart to the Bayesian paradigm of statistics. Historically, however, Bayesian statistics predates frequentist statistics by about 150 years.
vii
viii
Preface to the Second Edition
• Chapter 1 now contains new material on coherence, Dutch book bets, and the calibration of probability assessments. • Chapter 2 now includes an extended discussion of prior distributions, including weakly informative priors and reference priors. As an aside, a discussion of Cromwell’s rule has also been added. • Chapter 3 now includes a description of Jeffreys’ prior associated with each relevant probability distribution. I also add a description of the LKJ prior for correlation matrices (Lewandowski, Kurowicka, & Joe, 2009). • Chapter 4 adds new material on the Metropolis-Hastings algorithm. New material is also added on Hamiltonian Monte Carlo and the NoU-Turn sampler underlying the Stan software program which will be used throughout this book. An updated discussion of convergence diagnostics is presented. Chapter 4 also includes material on summarizing the posterior distribution and provides a fully worked-through example introducing the Stan software package. I also provide a brief discussion of variational Bayes—an alternative algorithm that does not require Markov chain Monte Carlo sampling. An example of variational Bayes is reserved for Chapter 11. • Chapter 5 (Chapter 6 in the first edition) retains material on Bayesian linear and generalized linear models, but expands the material from the first edition to now cover multinomial, Poisson, and negative binomial regression. • Chapter 6 (Chapter 5 in the first edition) covers model evaluation and model comparison as a separate chapter and now adds a critique of the Bayesian information criterion. This chapter also includes in- depth presentations of the deviance information criterion, the widely applicable information criterion, and the leave-one-out information criterion. This chapter serves as a bridge to describe more advanced modeling methodologies in Chapters 7 and 8. • Chapter 7 (Chapter 8 in the first edition) retains the material on Bayesian multilevel modeling from the first edition with the exception that all examples now use Stan. • Chapter 8 (Chapter 9 in the first edition) has been revamped in light of the publication of Sarah Depaoli’s (2021) excellent text on Bayesian structural equation modeling. For this chapter, I have decided to focus attention on two general latent variable models and refer the
Preface to the Second Edition ix
reader to Depaoli’s text for a more detailed treatment of Bayesian SEM which focuses primarily on estimating a variety of latent variable models using Mplus (Muthén & Muthén, 1998–2017), BUGS (Lunn, Spiegelhalter, Thomas, & Best, 2009), and blavaan (Merkle, Fitzsimmons, Uanhoro, & Goodrich, 2021). Note that blavaan is a flexible interface to Stan for estimating Bayesian structural equation models. For this edition, I will focus on models for continuous and categorical latent variables, specifically factor analysis, and latent class analysis, respectively. In both cases, I use Stan and in the latent class example, I will discuss the so-called label-switching problem and a possible solution using variational Bayes. • Chapter 9 (Chapter 7 in the first edition) addresses missing data with a focus on Bayesian perspectives. This chapter remains mostly the same as the first edition and serves simply as a general overview of missing data with a focus on Bayesian methods. The reader is referred to Enders’ and Little and Rubin’s seminal texts on missing data for more detail (Little & Rubin, 2020; Enders, 2022). This chapter does include new material on the problem of model uncertainty in multiple imputation. • Chapter 10 provides new material on Bayesian variable selection and sparsity, focusing on the ridge prior, the lasso prior, the horseshoe prior, and the regularized horseshoe prior, all of which induce sparsity in the model. A worked-through example using Stan is provided. • Chapter 11 is now dedicated to model averaging reflecting both historical and recent work in this area. I frame this discussion in the context of Bayesian decision theory. I provide a discussion of model averaging to include the general topic of – frameworks and now discuss the method of Bayesian stacking as a means of combining predictive distributions. This chapter is primarily drawn from Kaplan (2021). • Chapter 12 closes the book by providing a Bayesian workflow for statistical modeling in the social sciences and summarizes the Bayesian advantage over conventional frequentist practice of statistics in the social sciences.
Data Sources As in the first edition, the examples provided will primarily utilize large- scale assessment data, and in particular data from the OECD Program for
x
Preface to the Second Edition
International Student Assessment (PISA; OECD, 2019), the Early Childhood Longitudinal Study (ECLS; NCES, 2001), and the Progress in International Reading Literacy Study (PIRLS; Mullis & Martin, 2015). The datasets used for the examples are readily available on the Guilford website associated with the book (www.guilford.com/kaplan-materials).
Program for International Student Assessment (PISA) Launched in 2000 by the Organization for Economic Cooperation and Development (OECD), PISA is a triennial international survey that aims to evaluate education systems worldwide by testing the skills and knowledge of 15-year-old students. In 2018, 600,000 students, statistically representative of 32 million 15-year-old students in 79 countries and economies, took an internationally agreed-upon 2-hour test.2 Students were assessed in science, mathematics, reading, collaborative problem solving, and financial literacy. PISA is arguably the most important policy-relevant international survey that is currently operating (OECD, 2002). Following the overview in Kaplan and Kuger (2016), the sampling framework for PISA follows a two-stage stratified sample design. Each country/economy provides a list of all “PISA-eligible” schools, and this list constitutes the sampling frame. Schools are then sampled from this frame with sampling probabilities that are proportional to the size of the school, with the size being a function of the estimated number of PISA-eligible students in the school. The second stage of the design requires sampling students within the sampled schools. A target cluster size of 35 students within schools was desired, though for some countries, this target cluster size was negotiable. The method of assessment for PISA follows closely the spiraling design and plausible value methodologies originally developed for NAEP (see, e.g., OECD, 2017). In addition to these so-called cognitive outcomes, policymakers and researchers alike have begun to focus increasing attention on the nonacademic contextual aspects of schooling. Context questionnaires provide important variables for models predicting cognitive outcomes and these variables have become important outcomes in their own right, and are often referred to as non-cognitive outcomes (see, e.g., Heckman & Kautz, 2012). PISA also assesses these non-cognitive outcomes via a onehalf-hour internationally agreed-upon context questionnaire (see Kuger, Klieme, Jude, & Kaplan, 2016). Data from PISA are freely available at https://www.oecd.org/pisa/ and can be accessed using their PISA Data Explorer. 2 PISA 2022 will have 85 participating countries, but the data were not released as of the writing of this book.
Preface to the Second Edition xi
Early Childhood Longitudinal Study: Kindergarten Class of 1998-99 In addition to PISA, this book uses data from the Early Childhood Longitudinal Study: Kindergarten Class of 1998-99 (NCES, 2001). The ECLSK: 1998-99 implemented a multistage probability sample design to obtain a nationally representative sample of children attending kindergarten in 1998–99. The primary sampling units at the base year of data collection (Fall Kindergarten) were geographic areas consisting of counties or groups of counties. The second-stage units were schools within sampled PSUs. The third- and final-stage units were children within schools. For ECLS-K:1998-99, detailed information about children’s kindergarten experiences, as well as transition into the formal schooling from Grades 1 through 8, was collected in the fall and the spring of kindergarten (199899), the fall and spring of 1st grade (1999-2000), the spring of 3rd grade (2002), the spring of 5th grade (2004), and the spring of 8th grade (2007), seven time points in total. For more detail regarding the sampling design for ECLS-K:1998-1999, please see Tourangeau, Nord, Lê, Sorongon, and Najarian (2009). Children, their families, teachers, schools, and care providers provided information on children’s cognitive, social, emotional, and physical development. Information was also collected on children’s home environment, home educational activities, school environment, classroom environment, classroom curriculum, teacher qualifications, and before- and after-school care. Data from ECLS-K:1998-1999 are freely available at https://nces.ed.gov/ecls/.
Progress in International Reading Literacy Study (PIRLS) The Progress in International Reading Literacy Study (PIRLS) is sponsored by the International Association for the Evaluation of Educational Achievement (IEA) and is an international assessment and research project designed to measure reading achievement at the fourth-grade level, as well as school and teacher practices related to instruction. The international target population for the PIRLS assessment is defined as the grade representing 4 years of formal schooling, counting from the first year of primary or elementary schooling. PIRLS employs a two-stage random sample design, with a sample of schools drawn as a first stage and one or more intact classes of students selected from each of the sampled schools as a second stage. Intact classes of students are sampled rather than individual students from across the grade level or of a certain age because PIRLS pays particular attention to students’ curricular and instructional experiences, and these typically are organized on a classroom basis. Sampling intact classes also has the
xii
Preface to the Second Edition
operational advantage of less disruption to the school’s day-to-day business than individual student sampling. Fourth-grade students complete a reading assessment and questionnaire that addresses students’ attitudes toward reading and their reading habits. In addition, questionnaires are given to students’ teachers and school principals to gather information about students’ school experiences in developing reading literacy. Since 2001, PIRLS has been administered every 5 years, with the United States participating in all past assessments. For the example used in this book, I will draw data from PIRLS 2016. The PIRLS databases are freely available at https://timssandpirls.bc.edu/ pirls2016/index.html.
Software For this edition, I will demonstrate Bayesian concepts and provide applications using primarily the Stan (Stan Development Team, 2021a) software program and its R interface RStan (Stan Development Team, 2020). Stan is a high-level probabilistic programming language written in C++. Stan is named after Stanislaw Ulam, one of the major developers of Monte Carlo methods. With Stan, the user can specify log density functions, and, of relevance to this book obtain fully Bayesian inference through Hamiltonian Monte Carlo and the No-U-Turn algorithm (discussed in Chapter 4). In some cases, other interfaces for Stan, such as rstanarm (Goodrich, Gabry, Ali, & Brilleman, 2022) and brms (Bürkner, 2021), will be used. These programs also call in other programs, and for cross-validation, we will be using the loo program (Vehtari, Gabry, Yao, & Gelman, 2019). For pedagogical purposes, I have written the code to be as explicit as possible. The Stan programming language is quite flexible, and many different ways of writing the same code are possible. However, it should be emphasized that this book is not a manual on Bayesian inference with Stan. For more information on the intricacies of the Stan programming language, see Stan Development Team (2021a). Finally, all code will be integrated into the text and fully annotated. In addition, all software code in the form of R files and data can be found on the Guilford companion website. Note that due to the probabilistic nature of Bayesian statistical computing, a reanalysis of the examples may not yield precisely the same numerical results as found in the book.
Preface to the Second Edition xiii
Philosophical Stance In the previous edition of this book, I wrote a full chapter on various philosophical views underlying Bayesian statistics, including subjective versus objective Bayesian inference, as well as a position I took arguing for an evidence-based view of subjective Bayesian statistics. As a good Bayesian, I have updated my views since that time, and in the interest of space, I would rather add more practical material and concentrate less on philosophical matters. However, whether one likes it or not, the application of statistical methods, Bayesian or otherwise, betrays a philosophical stance, and it may be useful to know the philosophical stance that encompasses this second edition. In particular, my position regarding an evidencebased view of Bayesian modeling remains more or less unchanged, but my updated view is also consistent with that of Gelman and Shalizi (2013) and summarized in Haig (2018), namely, a neo-Popperian view that Bayesian statistical inference is fundamentally (or should be) deductive in nature and that the ”usual story” of Bayesian inference characterized by updating knowledge inductively from priors to posteriors is probably a fiction, at least with respect to typical statistical practice (Gelman & Shalizi, 2013, p. 8). To be clear, the philosophical position that I take in this book can be summarized by five general points. First, statistical modeling takes place in a state of pervasive uncertainty. This uncertainty is inherently epistemic in that it is our knowledge about parameters and models that is imperfect. Attempting to address this uncertainty is valuable insofar as it impacts one’s findings whether it is directly addressed or not. Second, parameters and models, by their very definition, are unknown quantities and the only language we have for expressing our uncertainty about parameters and models is probability. Third, prior distributions encode our uncertainty by quantifying our current knowledge and assumptions about the parameters and models of interest through the use of continuous or categorical probability distributions. Fourth, our current knowledge and assumptions are propagated via Bayes’ theorem to the posterior distribution which provides a rich way to describe results and to test models for violations through the use of posterior predictive checking via the posterior predictive distribution. Fifth, posterior predictive checking provides a way to probe deficiencies in a model, both globally and locally, and while I may hold a more sanguine view of model averaging than Gelman and Shalizi (2013), I concur that posterior predictive checking is an essential part of the Bayesian workflow (Gelman et al., 2020) for both explanatory and predictive uses of models.
xiv
Preface to the Second Edition
Target Audience Positioning a book for a particular audience is always a tricky process. The goal is to first decide on the type of reader one hopes to attract, and then to continuously keep that reader in mind when writing the book. For this edition, the readers I have in mind are advanced graduate students or researchers in the social sciences (e.g., education, psychology, and sociology) who are either focusing on the development of quantitative methods in those areas or who are interested in using quantitative methods to advance substantive theory in those areas. Such individuals would be expected to have good foundational knowledge of the theory and application of regression analysis in the social sciences and have had some exposure to mathematical statistics and calculus. It would also be expected that such readers would have had some exposure to methodologies that are now widely applied to social science data, in particular multilevel models and latent variable models. Familiarity with R would also be expected, but it is not assumed that the reader would have knowledge of Stan. It is not expected that readers would have been exposed to Bayesian statistics, but at the same time, this is not an introductory book. Nevertheless, given the presumed background knowledge, the fundamental principles of Bayesian statistical theory and practice are self-contained in this book.
Acknowledgments I would like to thank the many individuals in the Stan community (https:discourse.mc-Stan.org) who patiently and kindly answered many questions that I had regarding the implementation of Stan. I would also like to thank the reviewers who were initially anonymous: Rens van de Schoot, Department of Methodology and Statistics, Utrecht University; Irini Moustaki, Department of Statistics, London School of Economics and Political Science; Insu Paek, Senior Scientist, Human Resources Research Organization; and David Rindskopf, Departments of Educational Psychology and Psychology, The Graduate Center, The City University of New York. All of these scholars’ comments have greatly improved the quality and accessibility of the book. Of course, any errors of commission or omission are strictly my responsibility. I am indebted to my editor C. Deborah Laughton. I say, “my editor” because C. Deborah not only edited this edition and the previous edition, but also the first edition of my book on structural equation modeling (Kaplan, 2000) and my handbook on quantitative methods in the social sciences (Kaplan, 2004) when she was editor at another publishing house. My loyalty to C. Deborah stems from my first-hand knowledge of her extraor-
Preface to the Second Edition xv
dinary professionalism, but also from a deep-seated affection for her as my close friend. Since the last edition of the book, there have been a few new additions to my family, including my daughter Hannah’s husband, Robert, and their son, Ari, and daughter, June. My first grandchild, Sam, to whom the first edition was dedicated, now has a brother, Jonah. I dedicate this book to my entire family, Allison, Rebekah, Hannah, Joshua, Robert, Sam, Jonah, Ari, and June, but especially to my wife, Allison, and my daughters, Rebekah and Hannah, who have been the inspiration for everything I do and everything I had ever hoped to become.
Contents
PART I. FOUNDATIONS 1 • PROBABILITY CONCEPTS AND BAYES’ THEOREM 3 1.1 Relevant Probability Axioms / 3
1.1.1 The Kolmogorov Axioms of Probability / 3 1.1.2 The Re´nyi Axioms of Probability / 4
1.2 Frequentist Probability / 5 1.3 Epistemic Probability / 6
1.3.1 Coherence and the Dutch Book / 6 1.3.2 Calibrating Epistemic Probability Assessments / 7
1.4 Bayes’ Theorem / 9
1.4.1 The Monty Hall Problem / 10
1.5 Summary / 11
2 • STATISTICAL ELEMENTS OF BAYES’ THEOREM 13 2.1 Bayes’ Theorem Revisited / 13 2.2 Hierarchical Models and Pooling / 15 2.3 The Assumption of Exchangeability / 16 2.4 The Prior Distribution / 18 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5
Non-Informative Priors / 18 Jeffreys’ Prior / 19 Weakly Informative Priors / 20 Informative Priors / 21 An Aside: Cromwell’s Rule / 22
2.5 Likelihood / 23
2.5.1 The Law of Likelihood / 23
2.6 The Posterior Distribution / 25
xvii
xviii
Contents
2.7 The Bayesian Central Limit Theorem and Bayesian Shrinkage / 27 2.8 Summary / 29
3 • COMMON PROBABILITY DISTRIBUTIONS AND THEIR PRIORS 31 3.1 The Gaussian Distribution / 32
3.1.1 Mean Unknown, Variance Known: The Gaussian Prior / 32 3.1.2 The Uniform Distribution as a Non-Informative Prior / 33 3.1.3 Mean Known, Variance Unknown: The Inverse-Gamma Prior / 35 3.1.4 Mean Known, Variance Unknown: The Half-Cauchy Prior / 36 3.1.5 Jeffreys’ Prior for the Gaussian Distribution / 37
3.2 The Poisson Distribution / 38
3.2.1 The Gamma Prior / 38 3.2.2 Jeffreys’ Prior for the Poisson Distribution / 39
3.3 The Binomial Distribution / 40
3.3.1 The Beta Prior / 40 3.3.2 Jeffreys’ Prior for the Binomial Distribution / 41
3.4 The Multinomial Distribution / 42
3.4.1 The Dirichlet Prior / 43 3.4.2 Jeffreys’ Prior for the Multinomial Distribution / 43
3.5 The Inverse-Wishart Distribution / 44 3.6 The LKJ Prior for Correlation Matrices / 45 3.7 Summary / 46
4 • OBTAINING AND SUMMARIZING THE POSTERIOR DISTRIBUTION 4.1 Basic Ideas of Markov Chain Monte Carlo Sampling / 47 4.2 The Random Walk Metropolis-Hastings Algorithm / 49 4.3 The Gibbs Sampler / 50 4.4 Hamiltonian Monte Carlo / 51 4.4.1 No-U-Turn Sampler (NUTS) / 52
4.5 Convergence Diagnostics / 53 4.5.1 4.5.2 4.5.3 4.5.4 4.5.5 4.5.6
Trace Plots / 53 Posterior Density Plots / 53 Autocorrelation Plots / 54 Effective Sample Size / 54 Potential Scale Reduction Factor / 55 Possible Error Messages When Using HMC/NUTS / 55
4.6 Summarizing the Posterior Distribution / 56
4.6.1 Point Estimates of the Posterior Distribution / 56 4.6.2 Interval Summaries of the Posterior Distribution / 57
4.7 Introduction to Stan and Example / 60 4.8 An Alternative Algorithm: Variational Bayes / 66 4.8.1 Evidence Lower Bound (ELBO) / 67 4.8.2 Variational Bayes Diagnostics / 68
4.9 Summary / 70
47
Contents xix
PART II. BAYESIAN MODEL BUILDING 5 • BAYESIAN LINEAR AND GENERALIZED MODELS 73 5.1 The Bayesian Linear Regression Model / 73
5.1.1 Non-Informative Priors in the Linear Regression Model / 74
5.2 Bayesian Generalized Linear Models / 85 5.3 5.4 5.5 5.6 5.7
5.2.1 The Link Function / 86
Bayesian Logistic Regression / 87 Bayesian Multinomial Regression / 91 Bayesian Poisson Regression / 94 Bayesian Negative Binomial Regression / 98 Summary / 99
6 • MODEL EVALUATION AND COMPARISON 101 6.1 The Classical Approach to Hypothesis Testing and Its Limitations / 101 6.2 Model Assessment / 103 6.2.1 Prior Predictive Checking / 104 6.2.2 Posterior Predictive Checking / 107
6.3 Model Comparison / 112 6.3.1 6.3.2 6.3.3 6.3.4 6.3.5 6.3.6
Bayes Factors / 112 Criticisms of Bayes Factors and the BIC / 116 The Deviance Information Criterion (DIC) / 117 Widely Applicable Information Criterion (WAIC) / 118 Leave-One-Out Cross-Validation / 119 A Comparison of the WAIC and LOO / 121
6.4 Summary / 123
7 • BAYESIAN MULTILEVEL MODELING 125 7.1 7.2 7.3 7.4 7.5
Revisiting Exchangeability / 126 Bayesian Random Effects Analysis of Variance / 127 Bayesian Intercepts as Outcomes Model / 135 Bayesian Intercepts and Slopes as Outcomes Model / 137 Summary / 141
8 • BAYESIAN LATENT VARIABLE MODELING 143 8.1 Bayesian Estimation for the CFA / 143
8.1.1 Priors for CFA Model Parameters / 144
8.2 Bayesian Latent Class Analysis / 150
8.2.1 The Problem of Label-Switching and a Possible Solution / 154 8.2.2 Comparison of VB to the EM Algorithm / 158
8.3 Summary / 160
xx
Contents
PART III. ADVANCED TOPICS AND METHODS 9 • MISSING DATA FROM A BAYESIAN PERSPECTIVE 165 9.1 A Nomenclature for Missing Data / 165 9.2 Ad Hoc Deletion Methods for Handling Missing Data / 166 9.2.1 Listwise Deletion / 167 9.2.2 Pairwise Deletion / 167
9.3 Single Imputation Methods / 167 9.3.1 9.3.2 9.3.3 9.3.4 9.3.5
Mean Imputation / 168 Regression Imputation / 168 Stochastic Regression Imputation / 169 Hot Deck Imputation / 169 Predictive Mean Matching / 170
9.4.1 9.4.2 9.4.3 9.4.4 9.4.5
Data Augmentation / 171 Chained Equations / 172 EM Bootstrap: A Hybrid Bayesian/Frequentist Method / 173 Bayesian Bootstrap Predictive Mean Matching / 175 Accounting for Imputation Model Uncertainty / 176
9.4 Bayesian Methods of Multiple Imputation / 170
9.5 Summary / 177
10 • BAYESIAN VARIABLE SELECTION AND SPARSITY 179 10.1 10.2 10.3 10.4 10.5 10.6
Introduction / 179 The Ridge Prior / 181 The Lasso Prior / 183 The Horseshoe Prior / 185 Regularized Horseshoe Prior / 187 Comparison of Regularization Methods / 189
10.6.1 An Aside: The Spike-and-Slab Prior / 191
10.7 Summary / 191
11 • MODEL UNCERTAINTY 193 11.1 Introduction / 193 11.2 Elements of Predictive Modeling / 194
11.2.1 Fixing Notation and Concepts / 195 11.2.2 Utility Functions for Evaluating Predictions / 195
11.3 Bayesian Model Averaging / 196 11.3.1 11.3.2 11.3.3 11.3.4 11.3.5
Statistical Specification of BMA / 197 Computational Considerations / 197 Markov Chain Monte Carlo Model Composition / 199 Parameter and Model Priors / 200 Evaluating BMA Results: Revisiting Scoring Rules / 201
11.4 True Models, Belief Models, and M-Frameworks / 210
11.4.1 Model Averaging in the M-Closed Framework / 210 11.4.2 Model Averaging in the M-Complete Framework / 211 11.4.3 Model Averaging in the M-Open Framework / 211
Contents xxi
11.5 Bayesian Stacking / 212
11.5.1 Choice of Stacking Weights / 212
11.6 Summary / 216
12 • CLOSING THOUGHTS 217 12.1 A Bayesian Workflow for the Social Sciences / 217 12.2 Summarizing the Bayesian Advantage / 220 12.2.1 12.2.2 12.2.3 12.2.4 12.2.5 12.2.6
Coherence / 220 Conditioning on Observed Data / 220 Quantifying Evidence / 221 Validity / 221 Flexibility in Handling Complex Data Structures / 222 Formally Quantifying Uncertainty / 222
LIST OF ABBREVIATIONS AND ACRONYMS
223
REFERENCES 225 AUTHOR INDEX
237
SUBJECT INDEX
241
ABOUT THE AUTHOR
249
The companion website (www.guilford.com/kaplan-materials) provides the data and R software code files for the book’s examples.
Part I
FOUNDATIONS
1 Probability Concepts and Bayes’ Theorem In this chapter, we consider fundamental issues in probability that underlie both frequentist and Bayesian statistical inference. We first discuss the axioms of probability that underlie both frequentist and Bayesian concepts of probability. Next, we discuss the frequentist notion of probability as long-run frequency. We then show that long-run frequency is not the only way to conceive of probability, and that probability can be considered as epistemic belief. A key concept derived from epistemic probability and one that requires adherence to the axioms of probability is coherence, and we describe this concept in terms of betting systems, particularly, the so-called Dutch book. The concepts of epistemic probability and coherence lead to our discussion of Bayes’ theorem, and we then show how these concepts relate by working through the famous Monty Hall problem.
1.1
Relevant Probability Axioms
Most students in the social sciences were introduced to the axioms of probability by studying the properties of the coin toss or the dice roll. These studies address questions such as (1) What is the probability that the flip of a fair coin will return heads? and (2) What is the probability that the roll of two fair die will return a value of 7? To answer these questions requires enumerating the possible outcomes and then counting the number of times the event could occur. The probabilities of interest are obtained by dividing the number of times the event occurred by the number of possible outcomes, that is, the relative frequency of events. Before introducing Bayes’ theorem, it is useful to review the axioms of probability that have formed the basis of frequentist statistics. These axioms of can be attributed primarily to the work of Kolmogorov (1956).
1.1.1
The Kolmogorov Axioms of Probability
Consider two events denoted as A and B. To keep the things simple, consider these both to be the flip of a fair coin. Then, using the standard notation for the union ∪ and intersection ∩ of sets, the following are the axioms of probability, namely
3
4
Bayesian Statistics for the Social Sciences 1. p(A) ≥ 0. 2. The probability of the sample space is 1.0. 3. Countable additivity: If A and B are mutually exclusive, then p(A or B) ≡ p(A ∪ B) = p(A) + p(B). Or, more generally, ∞ ∞ [ X p A j p(A j ) (1.1) = j=1 j=1
A number of other axioms of probability can be derived from these three basic axioms. Nevertheless, these three axioms can be used to deal with the relatively easy case of the coin flipping example mentioned above. For example, if we toss a fair coin an infinite number of times, we expect it to land heads 50% of the time.1 This probability, and others like it, satisfy the first axiom that probabilities must be greater than or equal to zero. The second axiom states that over an infinite number of coin flips, the sum of all possible outcomes (in this case, heads and tails) is equal to one. Indeed, the number of possible outcomes represents the sample space, and the sum of probabilities over the sample space is one. Finally, with regard to the third axiom, assuming that one outcome precludes the occurrence of another outcome (e.g., the coin landing heads precludes the occurrence of the coin landing tails), then the probability of the joint event p(A ∪ B) is the sum of the separate probabilities, that is, p(A ∪ B) = p(A) + p(B). We may wish to add to these three axioms a fourth axiom that deals with the notion of independent events. If two events are independent, then the occurrence of one event does not influence the probability of another event. For example, with two coins A and B, the probability of A resulting in “heads,” does not influence the result of a flip of B. Formally, we define independence as p(A and B) ≡ p(A ∩ B) = p(A)p(B). The notion that independent events allow the individual probabilities to be simply their product plays a critical role in the derivation Bayes’ theorem.
1.1.2
´ The Renyi Axioms of Probability
Note that the Kolmogorov axioms do not take into account how probabilities might be affected by conditioning on the dependency of events. An extension of the Kolmogorov system that accounts for conditioning was put forth by Alfred R´enyi. As an example, consider the case of observing the presence or absence of lung cancer (C) and the behavior of smoking or not smoking (S). We may argue on the basis of prior experience and medical research that C is not independent of S, that is, the joint probability p(C ∩ S) , p(C)p(S). To handle this problem, we define the conditional probability of C “given” S (i.e., p(C | S)) as p(C | S) = 1
p(C ∩ S) p(S)
(1.2)
Interestingly, this expectation is not based on having actually tossed the coin an infinite number times. Rather, this expectation is a prior belief, and arguably, this is one example of how Bayesian thinking is automatically embedded in frequentist logic.
Probability Concepts and Bayes’ Theorem
5
The denominator on the right-hand side of Equation (1.2) shows that the sample space associated with p(C ∩ S) is reduced by knowing S. Notice that if C and S were independent, then p(C ∩ S) p(S) p(C)p(S) = p(S)
p(C | S) =
= p(C)
(1.3)
which states that knowing S tells us nothing about C. Following Press (2003), R´enyi’s axioms can be defined with respect to our lung cancer example. Let G stand for a genetic history of cancer. Then, 1. For any events, C, S, we have P(C | S) ≥ 0 and p(C | C) = 1. 2. For disjoint events Cj and some event S ∞ ∞ [ X p C | S = p(C j | S) j j=1 j=1 where j indexes the collection of disjoint events. 3. For every collection of events (C, S, G), with S a subset of G (i.e., S ⊆ G), and 0 < p(S | G), we have p(C ∩ S | G) p(C | S) = p(S | G) R´enyi’s third axiom allows one to obtain the conditional probability of C given S, while conditioning on yet a third variable G, for obtaining the conditional probability of lung cancer given observed smoking, while in turn conditioning on whether there is a genetic history of cancer.
1.2
Frequentist Probability
Underlying frequentist statistics is the idea of the long-run frequency. An example of probability as long-run frequency is the dice roll. In this case, the number of possible outcomes of one roll of a fair die is 6. If we wish to calculate the probability of rolling a 2, then we simply obtain the ratio of the number of favorable outcomes (here there is only 1 favorable outcome), to the total possible number of outcomes (here 6). Thus, the frequentist probability is 1/6 = 0.17. However, the frequentist probability of rolling a 2 is purely theoretical insofar as the die might not be truly fair or the conditions of the toss might vary from trial to trial. Thus, the frequentist probability of 0.17 relates to the relative frequency of rolling a 2 in a very large (indeed infinite) and perfectly replicable number of dice rolls. This purely theoretical nature of long-run frequency nevertheless plays a crucial role in frequentist statistical practice. Indeed, the entire structure of Neyman –
6
Bayesian Statistics for the Social Sciences
Pearson hypothesis (Neyman & Pearson, 1928) testing and Fisherian statistics (e.g., Fisher, 1941/1925) is based on the conception of probability as long-run frequency. Our conclusions regarding null and alternative hypotheses presuppose the idea that we could conduct the same study an infinite number of times under perfectly reproducible conditions. Moreover, our interpretation of confidence intervals also assumes a fixed parameter with the confidence intervals varying over an infinitely large number of identical studies.
1.3
Epistemic Probability
But there is another view of probability, where it is considered as subjective belief. Specifically, a modification of the Kolmogorov axioms was advanced by de Finetti (1974) who suggested replacing the (infinite) countable additivity axiom with finite additivity and also suggested treating p(·) as a subjective probability.2 It may be of historical interest to note that de Finetti was a radical subjectivist. Indeed, the opening line of his two-volume treatise on probability states that ’‘Probability does not exist,” by which he meant that probability does not have an objective status, but rather represents the quantification of our experience of uncertainty. Furthermore, the notion of probability as something external to the individual, possessing an objective status out there is superstition - no different from postulating the existence of “... Cosmic Ether, Absolute Space and Time, ... , or Fairies and Witches...”(p. x). Thus, for de Finetti (1974) ... only subjective probabilities exist – i.e. the degree of belief in the occurrence of an event attributed by a given person at a given instant with a given set of information. (pp. 4–5, italics de Finetti’s) As pointed out by Press (2003), the first mention of probability as a degree of subjective belief was made by Ramsey (1926), and it is this notion of probability as subjective belief that led to considerable resistance to Bayesian ideas. A detailed treatment of the axioms of subjective probability can be found in Fishburn (1986). The use of the term subjective is perhaps unfortunate, insofar as it promotes the idea of fuzzy, unscientific, reasoning. Lindley (2007) relates the same concern and prefers the term personal probability to subjective probability. Howson and Urbach (2006) adopt the less controversial term epistemic probability to reflect an individual’s greater or lesser degree of uncertainty about the problem at hand. Put another way, epistemic probability concerns our uncertainty about unknowns. This is the term that will be used throughout the book.
1.3.1
Coherence and the Dutch Book
How might we operationalize the notion of epistemic probability? The canonical example used to illustrate epistemic probability is betting behavior. If an individual enters into a bet that does not satisfy the axioms of probability, then they are not being coherent. Incoherent beliefs can lead a bettor to enter into a 2
A much more detailed set of axioms for subjective probability was advanced by Savage (1954).
Probability Concepts and Bayes’ Theorem
7
sequence of bets in which they are guaranteed to lose regardless of the outcome. In other words, the epistemic beliefs of the bettor do not cohere with the axioms of probability. This type of bet is referred as a Dutch book or lock. Example 1.1: An Example of a Dutch Book Table 1.1 below shows a sequence of bets that one of the following teams goes to the World Series. TABLE 1.1. Sequence of bets leading to a Dutch book Team
Odds
Chicago Cubs
Even
Boston Red Sox
3 to 1 against
Los Angeles Dodgers
4 to 1 against
New York Yankees
9 to 1 against
Totals
Implied probability 1 1+1 1 3+1 1 4+1 1 9+1
Bet price
Payout
= .50
$100
$100 + $100
= .25
$50
$50 + $150
= .20
$40
$40 + $160
= .10
$20
$20 + $180
$210
$200
1.05
Consider the first bet in the sequence, namely that the odds of the Chicago Cubs going to the World series is even. This is the same as saying that the probability implied by the odds is 0.50. The bookie sets the bet price at $100. If the Cubs do go to the World Series, then the bettor gets back the $100 plus the bet price. However, this is a sequence of bets that also includes the Red Sox, Dodgers, and Yankees. Taken together, we see that the implied probabilities sum to greater than 1.0, which is a clear violation of Kolmogorov’s Axiom 2. As a result, the bookie will pay out $200 regardless of who goes to the World Series while the bettor has paid $210 for the bet, a net loss of $10.
1.3.2
Calibrating Epistemic Probability Assessments
It is perhaps not debatable that if an individual is making a series of epistemic probability assessments, then it is rational for that individual to utilize feedback to improve their judgment. Such feedback is referred to as calibration. As a simple example, discussed in Press (2003), consider the problem of weather forecasting. Each and every day the forecaster (probability assessor) states, let’s say, the probability of precipitation. The forecast could be the result of an ensemble of weather forecasting models, but in any event, a forecast is stated. This forecast is then compared to the actual event for that day, and over the long run, this forecast should improve. But how do we quantify the difference between the forecast and the actual event in a fashion that provides useful feedback to the forecaster? To provide useful feedback, we use scoring rules. Scoring rules provide a measure of the accuracy of epistemic probabilistic assessments, and a probability assessment can be said to be well-calibrated if the
8
Bayesian Statistics for the Social Sciences
assigned probability of the outcome matches the actual proportion of times that the outcome occurred (Dawid, 1982). A scoring rule is a utility function (Gneiting & Raftery, 2007), and the goal of the assessor is to be honest and provide a forecast that will maximize his/her utility. The idea of scoring rules is quite general, but one can consider scoring rules from a subjectivist Bayesian perspective. Here, Winkler (1996) quotes de Finetti (1962, p. 359) The scoring rule is constructed according to the basic idea that the resulting device should oblige each participant to express his true feelings, because any departure from his own personal probability results in a diminution of his own average score as he sees it. Because scoring rules only require the stated probabilities and realized outcomes, they can be developed for ex-post or ex-ante probability evaluations. Ex-post probability assessments utilize the existing historical probability assessments to gauge accuracy whereas ex-ante probability assessments are true forecasts into the future before the realization of the outcome. However, as suggested by Winkler (1996), the ex-ante perspective of probability evaluation should lead us to consider strictly proper scoring rules because these rules are maximized if and only if the assessor is honest in reporting their scores. Following the discussion and notation given in Winkler (1996, see also; Jose, Nau, & Winkler, 2008), let p ≡ (p1 , . . . pn ) represent the assessor’s epistemic probability distribution of the outcomes of interest, let r ≡ (r1 . . . rn ) represent the assessor’s reported epistemic probability of the outcomes of interest, and let ei represent the probability distribution that assigns a probability of 1 if the event i occurs and a probability of 0 for all other events. Then, a scoring rule, denoted as S(r, p), provides a score S(r, ei ) if the event occurs. The expected score obtained when the assessor reports r when their true distribution is p is X S(r, p) = pi S(r, ei ). (1.4) i
The scoring rule is strictly proper if S(p, p) ≥ S(r, p) for every r and p with equality when r = p (Jose et al., 2008, p. 1147). We will discuss scoring rules in more detail in Chapter 11 when we consider model averaging. Suffice it to say here that there are three popular types of scoring rules. 1. Quadratic scoring rule (Brier score) Sk (r) = 2rk −
n X
r2i
(1.5)
r2i
(1.6)
i=1
2. Spherical scoring rule Sk (r) = rk
v t n X i=1
3. Logarithmic scoring rule
Sk (r) = log rk
(1.7)
Probability Concepts and Bayes’ Theorem
9
Example 1.2: An Example of Scoring Rules As a simple example, consider a hypothetical forecast of the amount of snowfall in Madison, Wisconsin, on February 2, 2023. The forecaster is required to provide forecasts for four possible intervals of snowfall in inches: [0, 3], (1, 4], [2, 5], [5, ∞). Now suppose the forecaster provides the following forecasts r = 0.6, 0.2, 0.1, 0.1). On February 2, 2023 we observe that actual snowfall is in the interval [0, 3]. Then, the forecaster’s quadratic (Brier) scores would be (0.78, −0.02, −0.22, −0.22. The forecaster’s spherical scores would be (0.39, 0.13, 0.06, 0.06). Finally, the forecaster’s logarithmic scores would be (−0.51, −1.69, −2.30, −2.30). We see that in each case, the forecaster’s score is maximized when the reported forecast aligns with the observed outcome. We will take up the idea of scoring rules in Chapter 11 when we discuss Bayesian model averaging.
1.4
Bayes’ Theorem
Having set the stage with our discussion of epistemic probability, we are now ready to introduce Bayes’ theorem. To begin, an interesting feature of Equation (1.2) is that joint probabilities are symmetric, namely, p(C ∩ S) = p(S ∩ C). Therefore, we can also express the conditional probability of smoking, S, given the observation of lung cancer, C, as p(S | C) =
p(S ∩ C) p(C)
(1.8)
Because of the symmetry of the joint probabilities, we obtain p(C | S)p(S) = p(S | C)p(C) Therefore, p(C | S) =
p(S | C)p(C) p(S)
(1.9)
(1.10)
Equation (1.10) is Bayes’ theorem, also referred to as the inverse probability theorem (Bayes, 1763).3 In words, Bayes’ theorem states that the conditional probability of an individual having lung cancer given that the individual smokes is equal to the probability that the individual smokes given that he/she has lung cancer times the probability of having lung cancer. The denominator of Equation (1.10), p(S), is the marginal probability of smoking. This can be considered the probability of smoking across individuals with and without lung cancer, which we write as p(S) = p(S | C) + p(S | ¬C).4 Because this marginal probability is obtained over all possible outcomes of observing lung cancer it does not carry information relevant to the conditional probability. In fact, p(S) can be considered a normalizing factor, ensuring that the probability does not exceed 1, as required by Kolmogorov’s 3
What we now refer to as Bayesian probability should perhaps be referred to as Laplacian probability. 4 The symbol ¬ denotes “not.”
10
Bayesian Statistics for the Social Sciences
second axiom described earlier. Thus, it is not uncommon to see Bayes’ theorem written as p(C | S) ∝ p(S | C)p(C) (1.11) Equation (1.11) states that the probability of observing lung cancer given smoking is proportional to the probability of smoking given observing lung cancer times the marginal probability of observing lung cancer. It is interesting to note that Bayesian reasoning resolves the so-called base-rate fallacy, that is, the tendency to equate p(C | S) with P(S | C). Specifically, without knowledge of the base-rate p(C) (the prior probability) and the total amount of evidence in the observation P(S), it is a fallacy to believe that p(C | S) = p(S | C).
1.4.1
The Monty Hall Problem
As a means of understanding connections among the concepts of conditional probability, coherence, and Bayes’ theorem, let’s consider the famous “Monty Hall” problem. In this problem, named after the host of a popular old television game show, a contestant is shown three doors, one of which has a desirable prize, while the other two have undesirable prizes. The contestant picks a door, but before Monty opens the door, he shows the contestant another door with an undesirable prize and asks the contestant whether they wish to stay with the chosen door or switch. At the start of the game, it is assumed that there is one desirable prize and that the probability that the desirable prize is behind any of the three doors is 1/3. Once a door is picked, Monty shows the contestant a door with an undesirable prize and asks if they would like to switch from the door she originally chose. It is important to note that Monty will not show the contestant the door with the desirable prize. Also, we assume that because the remaining doors have undesirable prizes, which door Monty opens is basically random. Given that there are two doors remaining in this three-door problem, the probability is 1/2. Thus, Monty’s knowledge of where the prize is located plays a crucial conditioning role in this problem. Moreover, at no point in the game are the axioms of probability violated. With this information in hand, we can obtain the necessary probabilities with which to apply Bayes’ theorem. Assume the contestant picks door A. Then, the necessary conditional probabilities are 1. p(Monty opens door B|prize is behind A) = 12 . 2. p(Monty opens door B|prize is behind B) = 0. 3. p(Monty opens door B|prize is behind C) = 1. The final probability is due to the fact that there is only one door for Monty to choose given that the contestant chose door A and the prize is behind door B.
Probability Concepts and Bayes’ Theorem
11
Let M represent Monty opening door B. Then, the joint probabilities can be obtained follows: 1 1 1 × = 2 3 6 1 p(M ∩ B) = p(M | B)p(B) = 0 × = 0 3 1 1 p(M ∩ C) = p(M | C)p(C) = 1 × = 3 3
p(M ∩ A) = p(M | A)p(A) =
Before applying Bayes’ theorem, note that we have to obtain the marginal probability of Monty opening door B. This is p(M) = p(M ∩ A) + p(M ∩ B) + p(M ∩ C) 1 1 1 = +0+ = 6 3 2 Finally, we can now apply Bayes’ theorem to obtain the probabilities of the prize lying behind door A or door C. 1 2
×
p(A | M) =
p(M | A)p(A) = p(M)
p(C | M) =
p(M | C)p(C) =1× p(M)
1 3
1 2 1 3 1 2
=
1 3
=
2 3
Thus, from Bayes’ theorem, the best strategy on the part of the contestant is to switch doors. Crucially, this winning strategy is conceived of in terms of long-run frequency. That is, if the game were played an infinite number of times, then switching doors would lead to the prize approximately 66% of the time. This is an example of where long-run frequency can serve to calibrate Bayesian probability assessments (Dawid, 1982), as we discussed in Section 1.3.2.
1.5
Summary
This chapter provided a brief introduction to probabilistic concepts relevant to Bayesian statistical inference. Although the notion of epistemic probability predates the frequentist conception of probability, it had not significantly impacted the practice of applied statistics until computational developments brought Bayesian inference back into the limelight. This chapter also highlighted the conceptual differences between the frequentist and epistemic notions of probability. The importance of understanding the differences between these two conceptions of probability is more than just a philosophical exercise. Rather, their differences are manifest in the elements of the statistical machinery needed for advancing a Bayesian perspective for research in the social sciences. We discuss the statistical elements of Bayes’ theorem in the following chapter.
2 Statistical Elements of Bayes’ Theorem The material presented thus far concerned frequentist and epistemic conceptions of probability, leading to Bayes’ theorem. The goal of this chapter is to present the role of Bayes’ theorem as it pertains specifically to statistical inference. Setting the foundations of Bayesian statistical inference provides the framework for applications to a variety of substantive problems in the social sciences. The first part of this chapter introduces Bayes’ theorem using the notation of random variables and parameters. This is followed by a discussion of the assumption of exchangeability. Following that, we extend Bayes’ theorem to more general hierarchical models. Next are three sections that break down the elements of Bayes’ theorem with discussions of the prior distribution, the likelihood, and the posterior distribution. The final section introduces the Bayesian central limit theorem and Bayesian shrinkage.
2.1
Bayes’ Theorem Revisited
To begin, denote by Y a random variable that takes on a realized value y. For example, a student’s score on a mathematics achievement test could be considered a random variable Y taking on a large set of possible values. Once the student receives a score on the mathematics test, the random variable Y is now realized as y. Because Y is an unobserved random variable, we need to specify a probability model to explain how we obtained the actual data values y. We refer to this model as the data generating process or DGP. Next, denote by θ a parameter that we believe characterizes the probability model of interest. The parameter θ can be a scalar, such as the mean or the variance of a distribution, or it can be vector-valued, such as a set of regression coefficients in regression analysis. To avoid too much notational clutter, for now we will use θ to represent either scaler or vector valued parameters where the difference will be revealed by the context. In statistical inference, the goal is to obtain estimates of the unknown parameters given the data. The key difference between Bayesian statistical inference and frequentist statistical inference involves the nature of the unknown parameters θ. In the frequentist tradition, the assumption is that θ is unknown, but has a fixed value that we wish to estimate. In Bayesian statistical inference, θ is also consid-
13
14
Bayesian Statistics for the Social Sciences
ered unknown, but instead of being fixed, it is assumed, like Y, to be a random variable possessing a prior probability distribution which reflects our uncertainty about the true value of θ before having seen the data. Because both the observed data y and the parameters θ are considered random variables, the probability calculus allows us to model the joint probability of the parameters and the data as a function of the conditional distribution of the data given the parameters, and the prior distribution of the parameters. More formally, p(θ, y) = p(y | θ)p(θ)
(2.1)
where p(θ, y) is the joint distribution of the parameters and the data. Using Bayes’ theorem from Equation (1.10), we obtain the following: p(θ | y) =
p(θ, y) p(y | θ)p(θ) = p(y) p(y)
(2.2)
where p(θ | y) is referred to as the posterior distribution of the parameters θ given the observed data y. Thus, from Equation (2.2) the posterior distribution of θ given y is equal to the data distribution p(y | θ) times the prior distribution of the parameters p(θ) normalized by p(y) so that the posterior distribution sums (or integrates) to 1. For discrete variables, X p(y) = p(y | θ)p(θ) (2.3) θ
and for continuous variables, Z p(y) =
θ
p(y | θ)p(θ)dθ
(2.4)
As an aside, for complex models with many parameters, Equation (2.4) will be very hard to evaluate, and it is for this reason we need the computational methods that will be discussed in Chapter 4. In line with Equation (1.11), the denominator of Equation (2.2) does not involve model parameters, so we can omit the term and obtain the unnormalized posterior distribution: p(θ | y) ∝ p(y | θ)p(θ)
(2.5)
Consider the data density p(y | θ) on the right-hand side of Equation (2.5). When expressed in terms of the unknown parameters θ for fixed values of y, this term is the likelihood L(θ | y), which we will discuss in more detail in Section 2.5. Thus, Equation (2.5) can be rewritten as p(θ | y) ∝ L(θ | y)p(θ)
(2.6)
Equations (2.5) or (2.6) represent the core of Bayesian statistical inference and it is what separates Bayesian statistics from frequentist statistics. Specifically, Equation (2.6) states that our uncertainty regarding the true values of the parameters of our model, as expressed by the prior distribution p(θ), is weighted by the actual data
Statistical Elements of Bayes’ Theorem
15
p(y | θ) (or equivalently, L(θ | y)), yielding (up to a constant of proportionality) an updated estimate of our uncertainty, as expressed in the posterior distribution p(θ | y). Following a brief digression to discuss the assumption of exchangeability, we will take up each of the elements of Equation (2.2) – the prior distribution, the likelihood, and the posterior distribution.
2.2
Hierarchical Models and Pooling
At this point, the careful reader might have noticed that because θ is a random variable described by a probability distribution, then perhaps θ can be modeled by its own set of parameters. Indeed this is the case, and such parameters are referred to as hyperparameters. Hyperparameters control the degree of informativeness in prior distributions. For example, if we use a normal prior for the mean of some outcome of interest, then a more accurate expression for this could be written as p(θ | ϕ) ∼ N(0, 1)
(2.7)
where ϕ contains the hyperparameters µ = 0, σ = 1. Naturally, the hierarchy can continue, and we can specify a hyperprior distribution for the hyperparameters — that is, p(ϕ | δ) ∼ N(· | ·) (2.8) and so on. Although this feels like “turtles all the way down,” for practical purposes, the hierarchy needs to stop, at which point the parameters at the end of the hierarchy are assumed to be fixed and known. Nevertheless, Bayesian methods do allow one to examine the sensitivity of the results to changes in the values of those final parameters. Our discussion about hyperparameters leads to the important concept of pooling in Bayesian analysis. Specifically, we can distinguish between three situations related to Bayesian models: (a) no pooling, (b) complete pooling, and (c) partial pooling. To give an example, consider the situation of modeling mastery of some mathematics content for a sample of n students. In the no pooling case, this would indicate that each student has his/her own chance of mastery of the content. From a population perspective, this would imply an infinitely large variance. In the complete pooling case, it is assumed that each student has exactly the same chance of mastering the content, and again from the population perspective, this would imply zero variance. A balance of these extremes is achieved through partial pooling wherein each student is assumed to have their own chance at mastering the content, but data from other students help inform each student’s chance of mastery. Partial pooling is achieved through specifying Bayesian hierarchical models such as Equation (2.8). Note that our discussion also extends to groupings of observations, such as students nested in schools. Here in the no pooling case, each of g schools could have its own its own separate parameters, say, θ g (g = 1, 2, . . . , G), or there could be one set of parameters that describes the characteristics of all schools as in the complete pooling case. Finally, each school could have its own set of parameters, in turn modeled by a set of hyperparameters as in the partial pooling case. This
16
Bayesian Statistics for the Social Sciences
leads to a very natural way to conceive of multilevel models which we will take up in Chapter 7.
2.3
The Assumption of Exchangeability
In most discussions of statistical modeling, it is common to assume that the data y1 , y2 , . . . yn are independently and identically distributed — often referred to as the iid assumption. As Lancaster (2004, p. 27) has pointed out, the iid assumption suggests that probability has an “objective” existence; that there exists a random number generator that produces a sequence of independent random variables. However, such an objectivist view of probability does not accord with the Bayesian notion that probabilities are epistemic. As such, Bayesians invoke the deeper notion of exchangeability to produce likelihoods and address the issue of independence. Exchangeability arises from de Finetti’s representation theorem (de Finetti, 1974, see also; Bernardo, 1996) and implies a judgment of the “symmetry” or “similarity” of the information provided by the observations. This judgment is expressed as p(yi , . . . yn ) = p(yπ[i] , . . . , yπ[n] ) (2.9) for all subscript permutations π. This sequence of random variables is said to be (finite) exchangeable if the sequence in Equation (2.9) holds for a finite sequence of the random variables (Bernardo, 1996, p. 2). In other words, the joint distribution of the data, p(y1 , y2 , . . . yn ) is invariant to permutations of the subscripts. The idea of infinite exchangeability requires that the infinite sequence of random variables yπ[i] , . . . is finite exchangeable for every subsequence, with finite exchangeability defined in Equation (2.9) (Bernardo & Smith, 2000, p. 171) Example 2.1: An Example of Finite Exchangeability Consider the response that student i (i = 1, 2, . . . , 10) makes to a question on an educational questionnaire assessing attitudes toward their teacher, such as “My teacher is supportive of me,” where 1, if student i agrees yi = (2.10) 0, if student i disagrees Next, consider three patterns of responses by 10 randomly selected students: p(1, 0, 1, 1, 0, 1, 0, 1, 0, 0)
(2.11a)
p(1, 1, 0, 0, 1, 1, 1, 0, 0, 0)
(2.11b)
p(1, 0, 0, 0, 0, 0, 1, 1, 1, 1)
(2.11c)
We have just presented three possible patterns, but notice that there are 210 = 1, 024 possible patterns of agreement and disagreement among the 10 students. If our task were to assign probabilities to all possible outcomes, this could become
Statistical Elements of Bayes’ Theorem
17
prohibitively difficult. However, suppose we now assume that student responses are independent of one another, which might be reasonable if each student is privately asked to rate the teacher on supportiveness. Then, exchangeability implies that only the proportion of agreements matter, not the location of those agreements in the vector. In other words, given that the sequences are the same length n, we can exchange the response of student i for student j without changing our belief about the probability model that generated that sequence. Exchangeability is a subtle assumption insofar as it means that we believe that there is a parameter θ that generates the observed data via a stochastic model and that we can describe that parameter without reference to the particular data at hand. As Jackman (2009) points out, the fact that we can describe θ without reference to a particular set of data is, in fact, what is implied by the idea of a prior distribution. Indeed, as Jackman notes, “the existence of a prior distribution over a parameter is a result of de Finetti’s Representation Theorem (de Finetti, 1974), rather than an assumption” (p. 40, italics Jackman’s). It is important to note that iid random variables implies exchangeability. Specifically, iid implies that iid
p(yi , . . . yn ) = p(y1 ) × p(y2 ) × . . . × p(yn )
(2.12)
Because the right-hand side can be multiplied in any order, it follows that the lefthand side is symmetric, and hence exchangeable. However, exchangeability does not necessarily imply iid. A simple example of this idea is the case of drawing balls from an urn without replacement (Suppes, 1986, p. 348). Specifically, suppose we have an urn containing one red ball and two white balls and we are told to draw one ball out without replacement. Then, 1, if the ith ball is red yi = (2.13) 0, otherwise To see that yi is exchangeable, note that 1 1 ×1×1= 3 3 2 1 1 P(y1 = 0, y2 = 1, y3 = 0) = × × 1 = 3 2 3 1 2 1 P(y1 = 0, y2 = 0, y3 = 1) = × × 1 = 3 2 3
P(y1 = 1, y2 = 0, y3 = 0) =
(2.14a) (2.14b) (2.14c)
We see that the permutations of (1,0,0) are exchangeable, however they are not iid, because given that the first draw from the urn is red, we definitely will not get a red ball on the second draw, but if we don’t obtain a red ball on the first draw, we will have a higher probability of getting a red ball on the second draw, and thus there is a dependency in the sequence of draws. Our discussion of exchangeability has so far rested on the assumption of independent responses. In the social sciences it is well understood that this is a very heroic assumption. Perhaps the best example of a violation of this assumption concerns the problem of modeling data from clustered samples, such as assessing
18
Bayesian Statistics for the Social Sciences
the responses of students nested in classrooms. To address this issue, we will need to consider the problem of conditional exchangeability, which we will discuss in Chapter 7 where we will deal with the topic Bayesian multilevel modeling.
2.4
The Prior Distribution
Returning to Equation (2.2), we begin with a discussion of the prior distribution, which is the key component that defines Bayesian statistics. It is useful to remind ourselves that no study is conducted in the complete absence of knowledge derived from previous research. Whether we are designing randomized experiments or sketching out path diagrams, the information gleaned from previous research is almost always incorporated into our choice of designs, the variables we choose to include in our models, or conceptual diagrams that we draw. Indeed, researchers who postulate a directional hypothesis for an effect are almost certainly drawing on prior information about the direction that an estimate must reasonably take. Thus, when a researcher encounters a result that does not seem to align with previous established findings, then all other things being equal, the researcher’s surprise is related to having held a belief about what is reasonable, usually based on prior research and/or elicited from expert judgment. Bayesian statistical inference simply requires that this prior knowledge be made explicit, but then confronts the prior knowledge by the actual data in hand. Moderation of prior knowledge by the data in hand is the key meaning behind Equation (2.2). How then might we defend the use of prior distributions when conducting Bayesian analyses? A straightforward argument is that prior distributions directly encode our assumptions regarding reasonable values of model parameters. These assumptions reflect existing knowledge (or lack thereof) about the parameters of interest, and, consistent with our philosophical stance discussed in the Preface, these assumptions are testable against other assumptions, and therefore falls squarely into a neo-Popperian framework (see Gelman & Shalizi, 2013). As we will see, these assumptions lie at the level of the model parameters as well as the models themselves.
2.4.1
Non-Informative Priors
In some cases, we may not be in possession of enough prior information to aid in drawing posterior inferences. Or, from a policy or clinical perspective, it may be necessary to refrain from providing substantive information regarding effects of interest and, instead, let the data speak. Regardless, from a Bayesian perspective, this lack of information is still important to consider and incorporate into our statistical analyses. In other words, it is equally important to quantify a lack of information as it is to quantify the cumulative understanding of a problem at hand. The standard approach to quantifying a lack of information is to incorporate non-informative prior probability distributions into our model specifications. Non-informative priors are also referred to as vague or diffuse priors. In the situation in which there is no prior knowledge to draw from, perhaps the most obvious non-informative prior distribution to use is the uniform probability distribution
Statistical Elements of Bayes’ Theorem
19
U(α, β) over some sensible range of values from α to β. Application of the uniform distribution is based on the Principle of Insufficient Reason, first articulated by Laplace (1774/1951), which states that in the absence of any relevant (prior) evidence, one should assign their degrees-of-belief equally among all the possible outcomes. In this case, the uniform distribution essentially indicates that our assumption regarding the value of a parameter of interest is that it lies in the range β − α and that all possible values have equal probability. Care must be taken in the choice of the range of values over the uniform distribution. For example, U[−∞, ∞] is an improper prior distribution insofar as it does not integrate to 1.0 as required of probability distributions. We will discuss the uniform distribution in more detail in Chapter 3.
2.4.2
Jeffreys’ Prior
A problem with the uniform prior distribution is that it is not invariant to simple transformations. In fact, a transformation of a uniform prior can result in a prior that is not uniform and will end up favoring some values more than others. Moreover, the expected values of the distributions are not the same. As a very simple example, suppose we assign a U(0, 1) prior distribution to a parameter θ and let θ′ = θ2 , then the expected values of these two distributions would be Z 1 Z 1 1 (2.15) E[p(θ)] = θp(θ)dθ = θdθ = 2 0 0 and
1
Z E[p(θ )] =
1
Z θ p(θ )dθ =
′
′
′
′
0
0
1 ′1/2 1 θ = 2 3
(2.16)
which shows the lack of invariance to a simple transformation of the parameters. In addressing the invariance problem associated with the uniform distribution, Jeffreys (1961) proposed a general approach that yields a prior that is invariant under transformations. The central idea is that knowledge and information contained in the prior distribution of a parameter θ should not be lost when there is a one-to-one transformation from θ to another parameter, say ϕ, denoted as ϕ = h(θ). More specifically, using transformation-of-variables calculus, the prior distribution p(ϕ) will be equivalent to p(θ) when obtained as dθ (2.17) p(ϕ) = p(θ) dϕ On the basis of the relationship in Equation (2.17), Jeffreys (1961) developed a noninformative prior distribution that is invariant under transformations, written as p(θ) ∝ [I(θ)]1/2
(2.18)
where I(θ) is the Fisher information matrix for θ. The derivation of this result is as follows. Write the Fisher information matrix for a parameter θ as ! ∂2 log p(y | θ) I(θ) = −E y|θ (2.19) ∂θ2
20
Bayesian Statistics for the Social Sciences
Next, we write the Fisher information matrix for ϕ as I(ϕ) = −E y|ϕ
∂2 log p(y | ϕ) ∂ϕ2
! (2.20)
From the change-of-variables expression in Equation (2.17), we can rewrite Equation (2.20) as 2 2 ∂ (log p(y | θ)) dθ × I(ϕ) = −E y|θ (2.21) dϕ ∂θ2 2 dθ = I(θ) dϕ Therefore, I(ϕ)
1/2
= I(θ)
1/2
dθ dϕ
(2.22)
from which we obtain the relationship to Equation (2.18). It should be noted that Jeffreys’ prior is part of a class of so-called reference priors that are designed to place the choice of priors on objective grounds based on a set of agreed upon rules (Bernardo, 1979; Kass & Wasserman, 1996; Berger, 2006). In other words, reference priors are an algorithmic approach to choosing a prior, and ideally, should be invariant to transformations of the parameter space. Jeffreys’ prior is one such reference prior, and is considered the most popular. We will provide the Jeffreys’ priors for common probability distributions used in the social and behavioral sciences in Chapter 3.
2.4.3
Weakly Informative Priors
Situated between non-informative and informative priors is the class of weakly informative priors. Weakly informative priors are probability distributions that provide one with a method for incorporating less information than one actually has in a particular situation. Specifying weakly informative priors can be useful for many reasons. First, it is doubtful that one has absolutely no prior information on which to base assumptions about parameters and for which a non-informative prior would be appropriate. Rather, it is likely that one can consider a range of parameter values that are reasonable. As an example, we know that the PISA 2018 reading assessment was internationally scaled to have a normal distribution with a mean of 500 and standard deviation of 100, but of course the means and standard deviations of the reading scores will be different for individual countries. Thus, if we were to model the reading scores for the United States, we would not want to place a non-informative prior distribution on the mean, such as the (improper) U(−∞, +∞) or even a vague Gaussian prior such as N(0, 100). Rather, a weakly informative prior distribution that contains a range of reasonable values might be N(500, 10) to reflect reasonable assumptions about the mean but also recognizing uncertainties in a given analysis. Second, weakly informative priors are useful in stabilizing the estimates of a model. As we will see, Bayesian inference can be computationally demanding,
Statistical Elements of Bayes’ Theorem
21
particularly for hierarchical models, and so although one may have information about, say, higher level variance terms, such terms may not be substantively important, and/or they may be difficult to estimate. Therefore, providing weakly informative prior information may help stabilize the analysis without impacting inferences. Finally, as discussed in Gelman, Carlin, et al. (2014), weakly informative priors can be useful in theory testing where it may appear unfair to specify strong priors in the direction of one’s theory. Rather, specifying weakly informative priors in the opposite direction of a theory would then require the theory to pass a higher standard of evidence. In suggesting an approach to constructing weakly informative priors, Gelman, Carlin, et al. (2014) consider two procedures: (1) Start with non-informative priors and then shift to trying to place reasonable bounds on the parameters according to the substantive situation. (2) Start with highly informative priors and then shift to trying to elicit a more honest assessment of uncertainty around those values. From the standpoint of specifying weakly informative priors, the first approach seems the most sensible. The second approach appears more useful when engaging in sensitivity analyses — a topic we will take up later.
2.4.4
Informative Priors
In the previous section, we considered the situation in which there may not be much prior information that can be brought to bear on a problem. In that situation we would rely on non-informative or weakly informative priors. Alternatively, it may be the case that previous research, expert opinion (O’Hagan et al., 2006; Hanea, Nane, Bedford, & French, 2021), or both, can be brought to bear on a problem and be systematically incorporated into prior distributions. Such priors are referred to as informative – though one could argue that applying non-informative priors in the case where one has little to no past information to rely on is, itself, quite informative. We will focus on one type of informative prior based on the notion of conjugacy. A conjugate prior distribution is one that, when combined with the likelihood function, yields a posterior that is in the same distributional family as the prior distribution. Conjugate distributions are convenient because if a prior is not conjugate, the resulting posterior distribution may have a form that is not analytically simple to solve. Although the existence of numerical simulation methods for Bayesian inference, such as Markov Chain Monte Carlo (MCMC) sampling (discussed in Chapter 4), renders the importance of conjugate distributions less of a problem. Chapter 3 will outline conjugate priors for probability distributions commonly encountered in the social sciences; and throughout this book, conjugate priors will be used in examples when studying informative priors. To motivate the use of informative priors, consider the problem in education research of academic achievement and its relationship to class size. In this case, we have a considerable amount of prior information based on previous studies regarding the increase in academic achievement when reducing class size (Whitehurst & Chingos, 2011). It may be that previous investigations used different tests of academic achievement but when examined together, it has been found that reduc-
22
Bayesian Statistics for the Social Sciences
ing class size to approximately 15 children per classroom results in one-quarter of a standard deviation increase (say about 8 points) in academic achievement. In addition to a prior estimate of the average achievement gain due to reduction in class size, we may also wish to quantify our uncertainty about the exact value of θ by specifying a probability distribution around the prior estimate of the average. Perhaps a sensible prior distribution would be a Gaussian distribution centered at θ = 8. However, let us imagine that previous research has shown that achievement gains due to class size reduction are almost never less than 5 points and almost never more than 14 points (almost a full standard deviation). Taking this range of uncertainty into account, we would propose as our initial assumption about the parameter of interest a prior distribution on θ that is N(8, 1). The careful reader might have wondered if setting hyperparameters to these fixed values violates the essence of Bayesian philosophy that all parameters are considered as unknown random variables requiring probability distributions to encode uncertainty. To address that concern, note first that the Bayesian approach does permit one to treat hyperparameters as fixed, but then they are presumably known. Of course, as in the above class size example, this represents an hypothesis of sorts and can be compared formally to other fixed values of hyperparameters that have resulted from a different assumptions. In contrast, the frequentist approach treats all parameters as unknown and fixed. Second, as we discussed in Section 2.2, it is not necessary to set hyperparameters to known and fixed quantities. Rather, in a fully hierarchical Bayesian model, it is possible to specify hyperprior distributions on the hyperparameters. In the class size example, a hierarchical model would leave the hyperparameters as unknown but modeled in terms of a hyperprior distribution reflecting the variation in the class size parameters in terms of, say, variation in class size policies across the United States. Regardless, differences of opinion on the values specified for hyperparameters can be directly compared via Bayesian model testing which we will discuss in Chapter 6.
2.4.5
An Aside: Cromwell’s Rule
An important feature of specifying prior probabilities is that even though probabilities range from zero to one, prior probabilities have to lie between zero and one. That is, a prior probability cannot be either exactly zero or exactly one. This feature of prior probability is referred to as Cromwell’s rule Cromwell’s rule was coined by Dennis V. Lindley (2007), and the reference is to Oliver Cromwell, who wrote to the General Assembly of the Church of Scotland on 3 August 1650, including a phrase that has become well known and frequently quoted: I beseech you, in the bowels of Christ, think it possible that you may be mistaken.
Statistical Elements of Bayes’ Theorem
23
That is, there must be some allowance for the possibility, however slim, that you are mistaken. If we look closely at Bayes theorem, we can see why. Recall that Bayes theorem is written as p(θ | y) =
p(y | θ)p(θ) p(y)
(2.23)
and so, if you state your prior probability of an outcome to be exactly zero, then p(θ | y) =
p(y | θ) ∗ 0 =0 p(y)
(2.24)
and thus no amount of evidence to the contrary would change your mind. What if you state your prior probability to be exactly 1? In this case recall that the denominator p(y) is a marginal distribution across all possible values of θ. So, if p(θ) = 1, the denominator p(y) collapses to only your hypothesis p(y | θ) and therefore p(y | θ) =1 (2.25) p(θ | y) = p(y | θ) Again, no amount of evidence to the contrary would change your mind; the probability of your hypothesis is 1.0 and you’re sticking to your hypothesis, no matter what. Cromwell’s rule states that unless the statements are deductions of logic, then one should leave some doubt (however small) in probability assessments.
2.5
Likelihood
Whereas the prior distribution encodes our accumulated knowledge/assumptions of the parameters of interest, this prior information must, of course, be moderated by the data in hand before yielding the posterior distribution — the source of our current inferences. In Equation (2.5) we noted that the probability distribution of the data given the model parameters, p(y | θ) could be written equivalently as the L(θ | y), the likelihood of the parameters given the data. The concept of likelihood is extremely important for both the frequentist and Bayesian schools of statistics. Excellent discussions of likelihood can be found in Edwards (1992) and Royall (1997). In this section, we briefly review the law of likelihood and then present simple expressions of the likelihood for the binomial probability and normal sampling models.
2.5.1
The Law of Likelihood
The likelihood can be defined as proportional to the probability of the data y given the parameter(s) θ. That is L(θ | y) ∝ p(y | θ), (2.26) where the constant of proportionality does not depend on θ. As a stand-alone quantity, the likelihood is of little value. Rather, what matters is the ratio of the likelihoods under different hypotheses regarding θ, and this leads to the law of likelihood. We define the law of likelihood as follows (see also Royall, 1997):
24
Bayesian Statistics for the Social Sciences
Definition 2.5.1. If hypothesis θ1 implies that Y takes on the value y with probability p(y | θ1 ) while hypothesis θ2 implies that Y takes on the value y with probability p(y | θ2 ), then the law of likelihood states that the realization Y = y is evidence in support of θ1 over θ2 if and only if L(θ1 | y) > L(θ2 | y). The likelihood ratio, L(θ1 | y)/L(θ2 | y), measures the strength of that evidence. Notice that the law of likelihood implies that only the information in the data as summarized by the likelihood serve as evidence in corroboration (or refutation) of a hypothesis. This latter idea is referred to as the likelihood principle. Notice also that frequentist notions of repeated sampling do not enter into the law of likelihood or the likelihood principle. The issue of conditioning on data that was not observed will be revisited in Chapter 6 when we take up the problem of null hypothesis significance testing. Example 2.2: The Binomial Probability Model First, consider the number of correct answers on a test of length n. Each item on the test represents a Bernoulli trial, with outcomes 0 = wrong and 1 = right. The natural probability model for data arising from n Bernoulli sequences is the binomial sampling model. Under the assumption of exchangeability – meaning the indexes 1 ... n provide no relevant information – we can summarize the total number of successes by y. Letting θ be the proportion of correct responses in the population, the binomial sampling model can be written as ! n y p(y | θ) = Bin(y | n, θ) = θ (1 − θ)(n−y) y = L(θ | n, y)
(2.27)
where ny is read as “n choose y” and refers to the number of successes y in a sequence of n “right/wrong” Bernoulli trials that can be obtained from an n item test. The symbol Bin is shorthand for the binomial density function. Example 2.3: The Normal Sampling Model Next consider the likelihood function for the parameters of the simple normal distribution which we write as ! (y − µ)2 1 2 exp − (2.28) p(y | µ, σ ) = √ 2σ2 2πσ
Statistical Elements of Bayes’ Theorem
25
where µ is the population mean and σ2 is the population variance. Under the assumption of independent observations, we can write Equation (2.28) as p(y1 , y2 , . . . , yn | µ, σ2 ) =
n Y
p(yi | µ, σ2 )
i
= √
1
!n/2
2πσ2
P (y − µ)2 i i exp − 2 2σ
= L(θ | y)
(2.29)
where θ = (µ, σ). Thus, under the assumption of independence, the likelihood of model the parameters given the data is simply the product of the individual probabilities of the data given the parameters.
2.6
The Posterior Distribution
Continuing with our breakdown of Equation (2.2), notice that the posterior distribution of the model parameters is obtained by multiplying the probability distribution of the data given the parameters by the prior distribution of the parameters. As we will see, the posterior distribution can be quite complex with no simple closed form solution, and modern Bayesian statistics typically uses computational methods such as MCMC for drawing inferences about model parameters from the posterior distributions. We will discuss MCMC methods in Chapter 4. Here we provide two small examples of posterior distributions utilizing conjugate prior distributions. In Chapter 3, we will consider the prior and posterior for other important distributions commonly encountered in the social sciences. Example 2.4: The Binomial Distribution with Beta Prior Consider again the binomial distribution used to estimate probabilities for successes and failures, such as obtained from responses to multiple choice questions scored right/wrong. As an example of a conjugate prior, consider estimating the number of correct responses y on a test of length n. Let θ be the proportion of correct responses. We first assume that the responses are independent of one another. The binomial sampling model was given in Equation (2.27) and reproduced here: ! n y p(y | θ) = Bin(y | n, θ) = θ (1 − θ)(n−y) . (2.30) y One choice of a conjugate prior distribution for θ is the beta(a,b) distribution. The beta distribution is a continuous distribution appropriate for variables that range from 0 to 1. The terms a and b are the shape and scale parameters of the beta distribution, respectively. The shape parameter, as the term implies, affects the shape of the distribution. The scale parameter affects spread of the distribution, in the sense of shrinking or stretching the distribution. For this example, a and b
26
Bayesian Statistics for the Social Sciences
will serve as hyperparameters because the beta distribution is being used as a prior distribution for the binomial distribution. The form of the beta(a,b) distribution is p(θ; a, b) =
Γ(a + b) a−1 θ (1 − θ)b−1 Γ(a)Γ(b)
(2.31)
where Γ is the gamma(a, b) function. Multiplying Equation (2.30) and Equation (2.31) and ignoring terms that don’t involve model parameters, we obtain the posterior distribution by p(θ | y) =
Γ(n + a + b) θ y+a−1 (1 − θ)n−y+b−1 Γ(y + a)Γ(n − y + b)
(2.32)
which itself is a beta distribution with parameters a′ = a + y and b′ = b + n − y. Thus, the beta prior for the binomial sampling model is a conjugate prior that yields a posterior distribution that is also in the family of the beta distribution. Example 2.5: The Gaussian Distribution with Gaussian Prior: σ2 Known This next example explores the Gaussian prior distribution for the Gaussian sampling model in which the variance σ2 is assumed to be known. Thus, the problem is one of estimating the mean µ. Let y denote a data vector of size n. We assume that y follows a Gaussian distribution shown in Equation (2.28) and reproduced here: ! (y − µ)2 1 (2.33) exp − p(y | µ, σ2 ) = √ 2σ2 2πσ Consider that our prior distribution on the mean is Gaussian with mean and variance hyperparameters, κ and τ2 , respectively which for this example are assumed known. The prior distribution can be written as ! (µ − κ)2 1 2 p(µ | κ, τ ) = √ exp − (2.34) 2τ2 2πτ2 After some algebra, the posterior distribution can be obtained as κ n y¯ 2 2 τ2 + σ2 τ σ , p(µ | y) ∼ N 1 n σ2 + nτ2 + τ2 σ2
(2.35)
Thus, the posterior distribution of µ is Gaussian with mean: µˆ = and variance
κ τ2 1 τ2
+ +
n y¯ σ2 n σ2
(2.36)
τ2 σ2 (2.37) σ2 + nτ2 We see from Equations (2.35), (2.36), and (2.37) that the Gaussian prior is conjugate for the likelihood, yielding a Gaussian posterior. σˆ2µ =
Statistical Elements of Bayes’ Theorem
2.7
27
The Bayesian Central Limit Theorem and Bayesian Shrinkage
When examining Equation (2.36) carefully, some interesting features emerge. For example, notice that as the sample size approaches infinity, there is no information in the prior distribution that is relevant to estimating the mean and variance of the posterior distribution. To see this, first we compute the asymptotic posterior mean by letting n go to infinity: lim µˆ =
n→∞
=
n y¯ κ + σ2 τ2 lim n→∞ 1 + n τ2 σ2 κσ2 ¯ 2 + y lim nτ 2 n→∞ σ + 1 nτ2
= y¯
(2.38)
Thus as the sample size increases to infinity, the expected a posteriori estimate µˆ converges to the maximum likelihood estimate y¯ In terms of the variance, first let 1/τ2 and n/σ2 to refer to the prior precision and data precision, respectively. The role of these two measures of precision can be seen by once again examining the variance term for the normal distribution in Equation (2.37). Specifically, letting n approach infinity, we obtain lim σˆ2µ = lim
n→∞
n→∞ 1 τ2
= lim
n→∞ σ2 τ2
1 +
n σ2
σ2 +n
=
σ2 n
(2.39)
which we recognize as the maximum likelihood estimator of the variance of the mean, the square root of which yields the standard error of the mean. A similar result emerges if we consider the case where we have very little information regarding the prior precision. That is, choosing a very large value for τ2 gives the same result. The fact that as n increases to infinity leads to the same result as we would obtain from the frequentist perspective has been referred to as the Bayesian central limit theorem. This is important from a philosophical perspective insofar as it suggests that Bayesian and frequentist results will agree in the limit. The posterior distribution in Equation (2.35) reveals another interesting feature regarding the relationship between the maximum likelihood estimator of the mean, ¯ and the prior mean, κ. Specifically, the posterior mean, µ, can be seen as a y, ¯ To see compromise between the prior mean, κ, and the observed data mean, y. this clearly, notice that we can rewrite Equation (2.35) as µˆ =
σ2 nτ2 κ + y¯ σ2 + nτ2 σ2 + nτ2
(2.40)
Thus, the posterior mean is a weighted combination of the prior mean and observed data mean. These weights are bounded by 0 and 1 and together
28
Bayesian Statistics for the Social Sciences
are referred to as the shrinkage factor. The shrinkage factor represents the proportional distance that the posterior mean has shrunk back to the prior ¯ Notice that mean, κ, and away from the maximum likelihood estimator, y. if the sample size is large, the weight associated with κ will approach zero ¯ and the weight associated with y¯ will approach one. Thus µˆ will approach y. Similarly, if the data variance, σ2 , is very large relative to the prior variance, τ2 , this suggests little precision in the data relative to the prior and therefore the posterior mean will approach the prior mean, κ. Conversely, if the prior variance is very large relative to the data variance, this suggests greater precision ¯ in the data compared to the prior and therefore the posterior mean will approach y. Example 2.6: The Gaussian Distribution with Gaussian Prior: σ2 Unknown Perhaps a more realistic situation that arises in practice is when the mean and variance of the Gaussian distribution are unknown. In this case, we need to specify a full probability model for both the mean µ and variance σ2 . If we assume that µ and σ2 are independent of one another, then we can factor the joint prior distribution of µ and σ2 as p(µ, σ2 ) = p(µ)p(σ2 ) (2.41) We now need to specify the prior distribution of σ2 . There are two approaches that we can take to specify the prior for σ2 . First, we can specify a uniform prior on µ and log(σ2 ), because when converting the uniform prior on log(σ2 ) to a density for σ2 , we obtain p(σ2 ) = 1/σ2 .1 With uniform priors on both µ and σ2 , the joint prior distribution p(µ, σ2 ) ∝ 1/σ2 . However, the problem with this first approach is that the uniform prior over the real line is an improper prior. Therefore, a second approach would be to provide proper informative priors, but with a choice of hyperparameters such that the resulting priors are quite diffused. First, again we assume as before that y ∼ N(µ, σ2 ) and that µ ∼ N(κ, τ2 ). As will be discussed in the next chapter, the variance parameter, σ2 , follows an inverse-gamma distribution with shape and scale parameters, a and b, respectively. Succinctly, σ2 ∼ IG(a, b) and the probability density function for σ2 can be written as p(σ2 | a, b) ∝ (σ2 )−(a+1) e−b/(σ
2
)
(2.42)
Even though Equation (2.42) is a proper distribution for σ2 , we can see that as a and b approach 0, the proper prior approaches the non-informative prior 1/σ2 . Thus, very small values of a and b can suffice to provide a prior on σ2 to be used to estimate the joint posterior distribution of µ and σ2 . The final step in this example is to obtain the joint posterior distribution for µ and σ2 . Assuming that the joint prior distribution is 1/σ2 , then the joint posterior distribution can be written as ( ) n (yi − µ)2 1 1 Y 2 (2.43) exp − p(µ, σ | y) ∝ 2 √ σ i=1 2πσ2 2σ2 1
Following Lynch (2007), this result is obtained via a change of variable calculus. Specifically, let k = log(σ2 ), and p(k) ∝ constant. Change of variable calculus involves the Jacobian J = dk/dσ2 = 1/σ2 . Therefore, p(σ2 ) ∝ constant × J.
Statistical Elements of Bayes’ Theorem
29
Notice, however, that Equation (2.43) involves two parameters µ and σ2 . The solution to this problem is discussed in Lynch (2007). First, the posterior distribution of µ obtained from Equation (2.43) can be written as ) ( ¯ nµ2 − 2n yµ p(µ | y, σ ) ∝ exp − 2σ2 2
(2.44)
Solving for µ involves dividing by n and completing the square yielding ( ) ¯2 (µ − y) p(µ | y, σ ) ∝ exp − 2σ2 /n 2
(2.45)
There are several ways to determine the posterior distribution of σ2 . Perhaps the simplest is to recognize that the joint posterior distribution for µ and σ2 can be factored into their respective conditional distributions: p(µ, σ2 | y) = p(µ | σ2 , y)p(σ2 | y)
(2.46)
The first term on the right-hand side of Equation (2.46) was solved above assuming σ2 is known. The second term on the right hand side of Equation (2.46) is the marginal posterior distribution of σ2 . An exact expression for p(σ2 | y) can be obtained by integrating the joint distribution over µ – that is, Z p(σ2 | y) = p(µ, σ2 | y)dµ (2.47) Although this discussion shows that analytic expressions are possible for the solution of this simple case, in practice the advent of MCMC algorithms render the solution to the joint posterior distribution of model parameters quite straightforward.
2.8
Summary
With reference to any parameter of interest, be it a mean, variance, regression coefficient, or a factor loading, Bayes’ theorem is composed of three parts: (1) the prior distribution representing our cumulative knowledge about the parameter of interest; (2) the likelihood representing the data in hand; and (3) the posterior distribution, representing our updated knowledge based on the moderation of the prior distribution by the likelihood. By carefully decomposing Bayes’ theorem into its constituent parts, we also can see its relationship to frequentist statistics, particularly through the Bayesian central limit theorem and the notion of shrinkage. In the next chapter we focus on the relationship between the likelihood and the prior. Specifically, we examine a variety of common data distributions used in social and behavioral science research and define their conjugate prior distributions.
3 Common Probability Distributions and Their Priors In Chapter 2, we introduced the core concepts of Bayesian statistical inference. Recall that the difference between Bayesians and frequentists is that Bayesians hold that all unknowns, including parameters, are to be considered as random variables that can be described by probability distributions. In contrast, frequentists hold that parameters are also unknown, but that they are fixed quantities. The key idea of Bayesian statistics is that the probability distribution of the data is multiplied by the prior distribution of the parameters of the data distribution in order to form a posterior distribution of all unknown quantities. Deciding on the correct probability model for the data and the prior probability distribution for the parameters is at the heart of Bayesian statistical modeling. Because the focus of Bayesian statistical inference is the probability distribution, it is essential that we explore the major distributions that are used in the social sciences and examine associated prior probability distributions on the parameters. The organization of this chapter is as follows. We will first explore the following data distributions: 1. The Gaussian distribution • Mean unknown/variance known • Mean known/variance unknown 2. The Poisson distribution 3. The binomial distribution 4. The multinomial distribution 5. The inverse-Wishart distribution 6. The LKJ distribution for correlation matrices For each of these data distributions, we will discuss how they are commonly used in the social sciences and then describe their shape under different sets of parameter values. We will then describe how each of these distributions is typically incorporated in Bayesian statistical modeling by describing the conjugate prior distributions typically associated with the respective data distributions. We
31
32
Bayesian Statistics for the Social Sciences
will also describe the posterior distribution that is derived from applying Bayes’ theorem to each of these distributions. Finally, we provide Jeffreys’ prior for each of the univariate distributions. We will not devote space to describing more technical details of these distributions, such as the moment generating functions or characteristic functions. A succinct summary of the technical details of these and many more distributions can be found in Evans, Hastings, and Peacock (2000).1
3.1
The Gaussian Distribution
The Gaussian (normal) distribution figures prominently in the field of statistics, owing perhaps mostly to its role in the central limit theorem. Assumptions of Gaussian distributed data abound in the social and behavioral sciences literature, and is a particularly useful distribution in the context of measuring human traits. The Gaussian distribution is typically assumed for test score data such as the various literacy outcomes measured in large-scale assessments. We write the Gaussian distribution as y ∼ N(µ, σ2 )
(3.1)
E[y] = µ
(3.2)
V[y] = σ
(3.3)
where
2
where E[·] and V[·] are the expectation and variance operators, respectively. The probability density function of the Gaussian distribution was given in Equation (2.28) and reproduced here: ! (y − µ)2 1 2 exp − (3.4) p(y | µ, σ ) = √ 2σ2 2πσ In what follows, we consider conjugate priors for two cases of the Gaussian distribution. The first case is where the mean of the distribution is unknown but the variance is known and the second case is where the mean is known but variance is unknown. The Gaussian distribution is often used as a conjugate prior for parameters that are assumed to be Gaussian in the population, such as regression coefficients.
3.1.1
Mean Unknown, Variance Known: The Gaussian Prior
In the case where the mean is unknown but the variance is known, the prior distribution of the mean is Gaussian, as we know from standard statistical theory. Thus, we can write the prior distribution for the mean as 1
To create the plots, artificial data were generated using a variety of R functions available in scatterplot3d (Ligges & M¨achler, 2003), pscl (Jackman, 2012), MCMCpack (Martin, Quinn, & Park, 2011), and MVA (Everitt & Hothorn, 2012). The R code is available on the accompanying website.
Common Probability Distributions and Their Priors
p(μ |
μ0 , σ20 )
⎛ ⎞ ⎟ 1 ⎜⎜⎜ 1 2⎟ ∝ exp ⎜⎝− 2 (μ − μ0 ) ⎟⎟⎠ , σ0 2σ0
33
(3.5)
where μ0 and σ20 are hyperparameters Example 3.1: The Gaussian Prior Figure 3.1 below illustrates the Gaussian distribution with unknown prior mean and known variance under varying conjugate priors. For each plot, the dark dashed line is the Gaussian likelihood which remains the same for each plot. The light dotted line is the Gaussian prior distribution which becomes increasingly diffused. The gray line is the resulting posterior distribution.
−4
−2
−4
−2
−4
−2
−4
−2
FIGURE 3.1. Gaussian distribution, mean unknown/variance known with varying conjugate priors. Note how the posterior distribution begins to align with the distribution of the data as the prior becomes increasingly flat.
These cases makes quite clear the relationship between the prior distribution and the posterior distribution. Specifically, the smaller the variance on the prior distribution of the mean (upper left figure), the closer the posterior matches the prior distribution. However, in the case of a very flat prior distribution (bottom right figure), the more the posterior distribution matches the data distribution.
3.1.2
The Uniform Distribution as a Non-Informative Prior
The uniform distribution is a common choice for a non-informative prior, and in particular with applications to the Gaussian data distribution. Historically, Pierre-
34
Bayesian Statistics for the Social Sciences
Simon Laplace (1774/1951) articulated the idea of assigning uniform probabilities in cases where prior information was lacking and referred to this as the principle of insufficient reason. Thus, if one lacks any prior information favoring one parameter value over another, then the uniform distribution is a reasonable choice as a prior. For a continuous random variable Y, we write the uniform distribution as Y ∼ U(α, β)
(3.6)
where α and β, are the lower and upper limits of the uniform distribution, respectively. The uniform distribution has α = 0 ≤ x ≤ β = 1. Under the uniform distribution, α+β 2 (β − α)2 V[y] = 12 E[y] =
(3.7) (3.8)
Generally speaking, it is useful to incorporate the uniform distribution as a non-informative prior for a distribution that has bounded support, such as (−1, 1). As an example of the use of the uniform distribution distribution as a prior, consider its role in forming the posterior distribution for a Gaussian likelihood. Again, this would be the case where a researcher lacks prior information regarding the distribution of the parameter of interest. Example 3.2: Uniform Prior with Different Bounds Figure 3.2 below shows the influence of different bounds on the uniform prior results in different posterior distributions.
Common Probability Distributions and Their Priors
−4
−2
−4
−2
−4
−2
−4
−2
35
FIGURE 3.2. Influence of varying uniform priors on Gaussian distribution. We see from Figure 3.2 that the effect of the uniform prior on the posterior distribution is dependent on the bounds of the prior. For a prior with relatively narrow bounds (upper left of figure), this is akin to having a fair amount of information, and therefore, the prior and posterior roughly match up. However, as in the case of Figure 3.1, if the uniform prior has very wide bounds indicating virtually no prior information (lower right figure), the posterior distribution will match the data distribution.
3.1.3
Mean Known, Variance Unknown: The Inverse-Gamma Prior
When the mean of the Gaussian distribution is known but the variance is unknown, the goal is to determine the prior distribution on the variance parameter. The inverse-gamma (IG) distribution is often, but not exclusively, used as the conjugate prior for the variance parameter denoted as σ2 ∼ IG(a, b)
(3.9)
where α (> 0) is the shape parameter and β (> 0) is the scale parameter. The probability density function for the inverse-gamma distribution is written as p(σ2 ) = (σ2 )(a+1) e−b/σ
2
(3.10)
36
Bayesian Statistics for the Social Sciences
The expected value and variance of the inverse-gamma distribution are E(σ2 ) = and V(σ2 ) =
b , a−1
b2 , (a − 1)2 (a − 2)
for a > 0
for a > 2
(3.11)
(3.12)
respectively. Example 3.3: The Inverse-Gamma Prior Figure 3.3 below shows the posterior distribution of the variance for different inverse-gamma priors that differ only with respect to their shape.
FIGURE 3.3. Inverse-gamma prior for variance of Gaussian distribution. Note that the “peakedness” of the posterior distribution of the variance is dependent on the shape of the inverse-gamma prior.
3.1.4
Mean Known, Variance Unknown: The Half-Cauchy Prior
Another distribution that has been advocated as a prior for unknown variances of the Gaussian distribution, and in particular variance terms for Bayesian hierarchical models, is the half-Cauchy prior, denoted throughout this book as C+ . (Gelman, 2006). The C+ distribution is parameterized by a location x0 and scale β. As it is a C+ distribution that we use for variance terms, the location is set to 0 and large scale values confer a greater degree of non-informativeness.
Common Probability Distributions and Their Priors
37
Example 3.4: The C+ Prior Figure 3.4 below shows the C+ distribution for various values of the scale parameter δ.
FIGURE 3.4. C+ distribution for the variance of the Gaussian distribution. We see, as expected, that as δ → − ∞, then p(σ2 ) → − U(0,∞). We will be employing + the C prior for variance parameters throughout many of the examples in this book.
3.1.5
Jeffreys’ Prior for the Gaussian Distribution
Consider the case of the Gaussian distribution with unknown mean and unknown variance and treated as independent. The information matrix for the Gaussian distribution with unknown mean μ and unknown variance σ2 is defined as 1 0 2 2 σ I(μ, σ ) = (3.13) 0 2σ1 4 Then, following our discussion in Section 2.4.2, Jeffreys’ prior is the square root of the determinant of the information matrix - viz., 1/2 σ12 1 1 0 = ∝ 3 (3.14) 0 1 6 2σ σ 4 2σ Often we see Jeffreys’ prior for this case written as 1/σ2 . This stems from the transformation found in Equation (2.18). Namely, the prior 1/σ3 based on p(μ, σ2 )
38
Bayesian Statistics for the Social Sciences
is the same as having the prior 1/σ2 on p(µ, σ). To see this, note from Equation (2.18) that p(µ, σ) = p(µ, σ2 )det(J) 1 1 ∝ 3 2σ ∝ 2 σ σ where J=
1 0
0 2σ
(3.15) (3.16)
! (3.17)
is a Jacobian transformation matrix.
3.2
The Poisson Distribution
Often in the social sciences, the outcome of interest might be the count of the number of events that occur within a specified period of time. For example, one may be interested in the number of colds a child catches within a year – modeled as a function of access to good nutrition or health care. In another instance, interest might be in the number of new teachers who drop out of the teaching profession after 5 years, modeled as a function of teacher training and demographic features of the school. In cases of this sort, a random variable k, representing the number of occurrences of the event is assumed to follow a Poisson distribution with parameter θ representing the rate of occurrence of the outcome per unit time. The probability density function for the Poisson distribution is written as p(k | θ) =
3.2.1
e−θ θk , k = 0, 1, 2, . . . , k!
θ>0
(3.18)
The Gamma Prior
The conjugate prior for the parameter θ of the Poisson distribution is the gamma density with scale parameter α and shape parameter β. In this context, the gamma density is written as G(θ) = θa−1 e−bθ (3.19) The posterior density is formed by multiplying equations (3.18) and (3.19), yielding p(θ | k, a, b) ∝ θk+a−1 e−(b+1) θ
(3.20)
Example 3.5: Poisson Distribution with Varying Gamma Density Priors Figure 3.5 below shows the posterior distribution under the Poisson likelihood with varying gamma-density priors.
Common Probability Distributions and Their Priors
39
FIGURE 3.5. Poisson distribution with varying gamma-density priors. Here again we see the manner in which the data distribution moderates the influence on the prior distribution to obtain a posterior distribution that balances the data in hand with the prior information we can bring regarding the parameters of interest. The upper left of Figure 3.5 shows this perhaps most clearly with the posterior distribution balanced between the prior distribution and the data distribution. And again, in the case of a relatively non-informative Gamma distribution, the posterior distribution matches up to the likelihood (lower right of Figure 3.5).
3.2.2
Jeffreys’ Prior for the Poisson Distribution
To obtain Jeffreys’ prior for the Poisson rate parameter note that ∂2 k log p(k | θ) = − 2 ∂θ2 θ Thus the information matrix can be written as I(θ) = −E log p(k | θ) θ θ2 1 = θ =
(3.21)
(3.22) (3.23) (3.24)
40
Bayesian Statistics for the Social Sciences
Jeffreys’ prior is then
p j (θ) = θ−1/2
(3.25)
Note that Jeffreys’ prior for the Poisson case is improper in that it cannot be integrated over [0, ∞). The solution to this problem is to note that the Jeffreys’ prior can be approximated by a G(.5,0). This in turn is not a defined distribution, thus we can use a G(.5,.00001) to approximate the Jeffreys’ prior for this case.
3.3
The Binomial Distribution
We encountered the binomial distribution in Chapter 2. To reiterate, the binomial distribution is used to estimate probabilities for successes and failures, where any given event follows a Bernoulli distribution, such as right/wrong responses to a test, or agree/disagree responses to a survey item. The probability density function for the binomial distribution can be written as ! n y p(y | θ) = θ (1 − θ)n−y (3.26) y where here θ ∈ [0, 1] is the success proportion, y ∈ {1, 2, . . . , n} is the number of successes, and n is the number of trials. Furthermore, E(y) = nθ and V(y) = nθ(1 − θ).
3.3.1
The Beta Prior
Perhaps one of the most important distributions in statistics and one that is commonly encountered as a prior distribution in Bayesian statistics is the beta distribution, denoted as B. The probability density function of the beta distribution with respect to the success proportion parameter θ can be written as B(θ | α, β) =
G(α + β) α−1 θ (1 − θ)β−1 G(α)Γ(β)
(3.27)
where α (> 0) and β (> 0) are shape and scale parameters and G is the gamma distribution, discussed in the previous section. The mean and variance of the beta distribution can be written as α (3.28) E(θ | α, β) = α+β αβ V(θ | α, β) = (3.29) 2 (α + β) (α + β + 1) Note also that the U(0, 1) distribution is equivalent to the B(1,1) distribution. The beta distribution is typically used as the prior distribution when data are assumed to be generated from the binomial distribution, such as the example in Section 2.4, because the binomial parameter θ is continuous and ranges between zero and one.
Common Probability Distributions and Their Priors
41
Example 3.6: The Binomial Likelihood with Varying Beta Priors Figure 3.6 below shows the posterior distribution under the binomial likelihood with varying beta priors.
FIGURE 3.6. Binomial distribution with varying beta priors. Note that the figure on the lower right is Jeffreys’ prior for the binomial distribution. We see that the role of the beta prior on the posterior distribution is quite similar to the role of the Gaussian prior on the posterior distribution in Figure 3.1. Note that the B(1, 1) prior distribution in the lower left-hand corner of Figure 3.6 is equivalent to the U(0, 1) distribution. The lower right-hand display is Jeffreys’ prior, which is discussed next.
3.3.2
Jeffreys’ Prior for the Binomial Distribution
Recall that the uniform distribution is not invariant to transformations, and that invariance can be achieved by deriving Jeffreys’ prior for the binomial likelihood. From Equation (3.26) we have log p(y | θ) ∝ y logθ + (n − y) log(1 − θ)
(3.30)
Taking the partial derivative twice of Equation (3.30) we obtain ∂2 log p(y | θ) y n−y =− 2 − 2 ∂θ θ 1 − θ2
(3.31)
42
Bayesian Statistics for the Social Sciences
The Fisher information matrix is defined as ∂2 log p(y | θ) I(θ) = −E ∂θ2 nθ n − nθ = 2 + θ (1 − θ)2 n = , θ(1 − θ)
!
∝ θ−1 (1 − θ)−1
(3.32a) (3.32b) (3.32c) (3.32d)
Jeffreys’ prior is then the square root of Equation (3.32d). That is, p j = θ−1/2 (1 − θ)−1/2
(3.33)
which is equivalent to the Beta(1/2, 1/2) distribution. The lower right-hand corner of Figure 3.6 shows Jeffreys’ prior for the binomial distribution and its impact on the posterior distribution. The unusual shape of Jeffreys’ prior has a very interesting interpretation. Note that, in general, the data have the least impact on the posterior distribution when the true value of θ is 0.5 and greater impact on the posterior when θ is closer to 0 or 1. Jeffreys’ prior compensates for this fact by placing more mass on the extremes of the distribution, thus balancing out impact of the data on the posterior distribution.
3.4
The Multinomial Distribution
Another very important distribution used in the social sciences is the multinomial distribution. As an example, the multinomial distribution can be used to characterize responses to items on questionnaires where there are more than two alternatives. The multinomial distribution can also be used in models for categorical latent variables, such as latent class analysis (Clogg, 1995), where individuals are assigned to one and only one of, say, C latent classes. An example of the multinomial distribution for latent classes will be discussed in Chapter 8 where we examine the classification of reading skills in young children. A special case of the multinomial distribution is the binomial distribution discussed earlier. The probability density function for the multinomial distribution is written as p(X1 = x1 , . . . , XC = xc ) PC xC x n! x1 !···xC ! π11 · · · πC , when c=1 xc = n = 0, otherwise
(3.34)
where n is the sample size and π1 , . . . , πC are parameters representing category proportions. The mean and variance of the multinomial distribution are written as E(xc ) = nπc
(3.35)
V(xc ) = nπc (1 − πc )
(3.36)
Common Probability Distributions and Their Priors
43
and the covariances among any two categories c and d can be written as cov(xc , xd ) = −nπc πd
3.4.1
(3.37)
The Dirichlet Prior
The conjugate prior distribution for the parameters of the multinomial distribution, πc , follows a Dirichlet distribution. The Dirichlet distribution is the multivariate generalization of the beta distribution. The probability density function can be written as C 1 Y ac −1 πc (3.38) f (π1 , . . . , πC−1 ; a1 , . . . , aC ) = B(a) c=1
where
QC B(a) =
c=1
Γ(
Γ(ac )
PC
c=1 ac )
(3.39)
is the multinomial beta function expressed in terms of a gamma function.
3.4.2
Jeffreys’ Prior for the Multinomial Distribution
As the multinomial distribution is an extension of the binomial distribution, we find that Jeffreys’ prior for the multinomial distribution is Dirichlet with αc = 1/2, (c = 1, . . . , C). Example 3.7: Multinomial Likelihood with Varying Precision on the Dirichlet Prior Figure 3.7 below shows the the multinomial likelihood and posterior distributions with varying degrees of precision on the Dirichlet prior.
44
Bayesian Statistics for the Social Sciences
FIGURE 3.7. Multinomial distribution with varying Dirichlet priors. As in the other cases, we find that a highly informative Dirichlet prior (top row) yields a posterior distribution that is relatively precise with a shape that is similar to that of the prior. For a relatively diffused Dirichlet prior (bottom row), the posterior more closely resembles the likelihood.
3.5
The Inverse-Wishart Distribution
Many applied problems in the social sciences focus on multivariate outcomes, and a large array of multivariate statistical models are available to address these problems. For this book we will focus on factor analysis, which can be considered a multivariate model for covariance matrices and which will be demonstrated in Chapter 8. Denote by y a P-dimensional vector of observed responses for n individuals. For example, these outcomes could be answers to items asking n students to answer a P item survey (p = 1, . . . , P) on their school environment For simplicity, assume that y is generated from a multivariate Gaussian distribution with a Pdimensional mean vector μ and a P × P covariance matrix Σ. The conjugate prior for Σ is the inverse-Wishart distribution (IW) denoted as Σ ∼ IW(Ψ, ν)
(3.40)
where Ψ is a scale matrix that is used to situate the IW distribution in the parameter space and ν > P − 1 are the degrees-of-freedom (df) used to control the degree of uncertainty about Ψ. The expected value and variance of the IW distribution are E=
Ψ ν−P−1
(3.41)
Common Probability Distributions and Their Priors and V=
2ψ2PP (ν − P − 1)2 (ν − P − 3)
45
(3.42)
From Equation (3.41) and Equation (3.42) we see that the informativeness of the IW prior is dependent on the scale matrix Ψ and the degrees-of-freedom ν that is, the IW prior becomes more informative as either the elements in Ψ become smaller or ν becomes larger. Finding the balance between these elements is tricky, and so common practice is to set Ψ = I and vary the values of ν.
3.6
The LKJ Prior for Correlation Matrices
Finally, in many cases interest centers on inferences for models involving correlation matrices. An example would be the estimation of correlations of random effects in multilevel models. Indeed, factor analysis can also be considered a multivariate model for correlation matrices. Because correlations are model parameters, prior distributions are required. One approach could be to use the IW distribution for the covariance matrix and then standardize within the algorithm to provide correlations. However, practical experience has shown that the IW distribution is often difficult to work with, particularly in the context of modern Bayesian computation tools such as Stan, which will be discussed in Chapter 4. Addressing this problem requires recognizing that the covariance matrix Σ can be decomposed into a so-called quadratic form as Σ = σRσ
(3.43)
where σ is a diagonal matrix of standard deviations. Based on work concerning the generation of random correlation matrices, Lewandowski et al. (2009) developed the so-called LKJ distribution that can easily be applied to correlation matrices matrices. A parameter η controls the degree of informativeness across the correlations. Specifically, η = 1 specifies a uniform distribution over valid correlations, that is, the correlations are all equally likely. Values of η greater than 1 place less prior mass on extreme correlations. Example 3.8: The LKJ Prior Figure 3.8 below shows the probability density plots for the LKJ distribution with different values of η. Notice that higher values of η place less mass on extreme correlations.
46
Bayesian Statistics for the Social Sciences
FIGURE 3.8. LKJ prior distribution with different degrees of informativeness.
3.7
Summary
This chapter presented the most common distributions encountered in the social sciences along with their conjugate priors and associated Jeffreys’ priors. We also discussed the LKJ prior for correlation matrices which is quite useful when specifying inverse-Wishart priors for covariance matrices. The manner in which the prior and the data distributions balance each other to result in the posterior distribution is the key point of this chapter. When priors are very precise, the posterior distribution will have a shape closer to that of the prior. When the prior distribution is non-informative, the posterior distribution will adopt the shape of the data distribution. This finding can be deduced from an inspection of the shrinkage factor given in Equation (2.40). In the next chapter we focus our attention on the computational machinery for summarizing the posterior distribution.
4 Obtaining and Summarizing the Posterior Distribution As stated in the Introduction, the key reason for the increased popularity of Bayesian methods in the social sciences has been the (re)discovery of numerical algorithms for estimating posterior distributions of the model parameters given the data. Prior to these developments, it was virtually impossible to derive summary measures of the posterior distribution, particularly for complex models with many parameters. The numerical algorithms that we will describe in this chapter involve Monte Carlo integration using Markov chains – also referred to as Markov chain Monte Carlo (MCMC) sampling. These algorithms have a rather long history, arising out of statistical physics and image analysis (Geman & Geman, 1984; Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953). For a nice introduction to the history of MCMC, see Robert and Casella (2011). For the purposes of this chapter, we will consider three of the most common algorithms that are available in both open source as well as commercially available software – the random walk Metropolis-Hastings algorithm, the Gibbs sampler, and Hamiltonian Monte Carlo. First, however, we will introduce some of the general features of MCMC. Then, we will turn to a discussion of the individual algorithms, and finally we will discuss the criteria used to evaluate the quality of an MCMC sample. A full example using the Hamiltonian Monte Carlo algorithm will be provided, which also introduces the Stan software package.
4.1
Basic Ideas of Markov Chain Monte Carlo Sampling
Within the frequentist school of statistics, a number of popular estimation approaches are available to obtain point estimates and standard errors of model parameters. Perhaps the most common approaches to parameter estimation are ordinary least squares and maximum likelihood. The focus of frequentist parameter estimation is the derivation of point estimates of model parameters that have desirable asymptotic properties such as unbiasedness and efficiency (see, e.g., Silvey, 1975).
47
48
Bayesian Statistics for the Social Sciences .
In contrast to maximum likelihood estimation and other estimation methods within the frequentist paradigm, Bayesian inference focuses on calculating expectations of the posterior distribution, such as the mean and standard deviation. For very simple problems, this can be handled analytically. However for complex, high-dimensional problems involving multiple integrals, the task of analytically obtaining expectations can be virtually impossible. So, rather than attempting to analytically solve these high-dimensional problems, we can instead use wellestablished computational methods to draw samples from a target distribution of interest (in our case, the posterior distribution) and summarize the distribution formed by those samples. This is referred to as Monte Carlo integration. Monte Carlo integration is based on first drawing S samples of the parameters of interest {θs , s = 1, . . . , S} from the posterior distribution p(θ | y) and approximating the expectation by S 1X p(θs | y) (4.1) E[p(θ | y)] ≈ S t=1
Assuming the samples are independent of one another, the law of large numbers ensures that the approximation in Equation (4.1) will be increasingly accurate as S increases. Indeed, under independent samples, this process describes ordinary Monte Carlo sampling. However, an important feature of Monte Carlo integration, and of particular relevance to Bayesian inference, is that the samples do not have to be drawn independently. All that is required is that the sequence θs , (s = 1, . . . , S) yields samples that have explored the support of the distribution (Gilks, Richardson, & Spiegelhalter, 1996a).1 One approach to sampling throughout the support of a distribution while also relaxing the assumption of independent sampling is through the use of a Markov chain. Formally, a Markov chain is a sequence of dependent random variables {θs } θ0 , θ1 , . . . , θs , . . .
(4.2)
such that the conditional probability of θs given all of the past variables depends only on θs−1 – that is, only on the immediate past variable. This conditional probability for the continuous case is referred to as the transition kernel of the Markov chain. For discrete random variables this is referred to as the transition matrix. The Markov chain has a number of very important properties, not the least of which is that over a long sequence, the chain will forget its initial state θ0 and converge to its stationary distribution p(θ | y), which does not depend either on the number of samples S or on the initial state θ0 . The number of iterations prior to the stability of the distribution is referred to as the warmup samples. Letting m 1
The support of a distribution is the smallest closed interval (or set in the multivariate case) where the elements of the interval/set are members of the distribution. Outside the support of the distribution, the probability of the element is zero. Technically, MCMC algorithms explore the typical set of a probability distribution. The concept of typical set will be taken up when discussing Hamiltonian Monte Carlo.
Obtaining and Summarizing the Posterior Distribution
49
represent the initial number of warmup samples, we can obtain an ergodic average of the posterior distribution p(θ | y) as p(θ | y) =
S X 1 p(θs | y) S−m
(4.3)
s=m+1
The idea of conducting Monte Carlo sampling through the construction of Markov chains defines MCMC. The question that we need to address is how to construct the Markov Chain, that is, how to move from one parameter value to the next. Three popular algorithms have been developed for this purpose and we take them up next.
4.2
The Random Walk Metropolis-Hastings Algorithm
One of the earliest, yet still common, methods for constructing a Markov chain is referred to as the random walk Metropolis-Hastings algorithm or M-H algorithm for short (Metropolis et al., 1953). The basic steps of the M-H algorithm are as follows. First, a starting value θ0 is obtained to begin forming a sequence of draws θ0 , . . . θs−1 . The next element in the chain, θs , is obtained by first obtaining a proposal value θ∗ from a so-called proposal distribution (also referred to as a jumping distribution, which we will denote as q(θ∗ | θs−1 ).2 This proposal distribution could be, for example, a standard Gaussian distribution with mean zero and some variance. Second, the algorithm accepts the proposal value with an acceptance probability ( ) p(θ∗ )q(θs−1 | θ∗ ) s−1 ∗ p(θ | θ ) = min 1, (4.4) p(θs−1 )q(θ∗ | θs−1 where the ratio inside the brackets of Equation (4.4) is referred to as the MetropolisHastings ratio. Equation (4.4) can be simplified by noting that the M-H algorithm uses symmetric proposal distributions implying that q(θ∗ | θs−1 ) = q(θs−1 | θ∗ ) and therefore the ratio of these distributions are 1.0. Thus, Equation (4.4) can be written as ( ) p(θ∗ ) ∗ s−1 p(θ | θ ) = min 1, (4.5) p(θs−1 ) Notice that the numerator of M-H ratio in Equation (4.5) is the probability of the proposal value and the denominator is the probability of the current value. In essence, Equation (4.5) states that if the odds ratio p(θ∗ )/p(θs−1 ) > 1.0, then the probability of acceptance of the proposal value is 1.0 – that is the algorithm will accept the proposal value with certainty. However, if the odds ratio is less than 1.0, then the algorithm can either move to the next value or stay at the current value. To decide this, the third step of the algorithm draws a random value from a U(0, 1) distribution. If the sample values is between 0 and p(θ∗ | θs−1 ), then 2
For ease of notation, we are suppressing conditioning on the data y.
50
Bayesian Statistics for the Social Sciences
the algorithm moves to the next state, i.e., θs = θ∗ . Otherwise, if it is rejected, then the algorithm does not move, and θs = θs−1 . A remarkable feature of the M-H algorithm is that regardless of the proposal distribution, the stationary distribution of the algorithm will be p(·), which for our purposes is the posterior distribution. The technical details of this fact can be found in Gilks et al. (1996a).
4.3
The Gibbs Sampler
Another popular algorithm for MCMC is the Gibbs sampler. Consider that the goal is to obtain the joint posterior distribution of two model parameters – say, θ1 and θ2 , given some data y, written as p(θ1 , θ2 | y). These two model parameters can be, for example, two regression coefficients from a multiple regression model. Dropping the conditioning on y for notational simplicity, what is required is to sample from p(θ1 | θ2 ) and p(θ2 | θ1 ). In the first step, an arbitrary value for θ2 is chosen, say θ02 . We next obtain a sample from p(θ1 | θ02 ). Denote this value as θ11 . With this new value, we then obtain a sample θ12 from p(θ2 | θ11 ). The Gibbs algorithm continues to draw samples using previously obtained values until two long chains of values for both θ1 and θ2 are formed. After discarding the warmup samples, the remaining samples are then considered to be drawn from the marginal distributions p(θ1 ) and p(θ2 ). The formal algorithm can be outlined as follows. For notational clarity, let θ be a P-dimensional vector of model parameters with elements θ = {θ1 , . . . , θP }. Note that information regarding θ is contained in the prior distribution p(θ). The Gibbs sampler begins with an initial set of starting values for the parameters, denoted as θ0 = (θ01 , . . . , θ0P ). Given this starting point, the Gibbs sampler generates θs from θs−1 as follows: 1. sample
s−1 s−1 θs1 ∼ p(θ1 | θs−1 2 , θ3 , . . . , θP )
2. sample
s−1 θs2 ∼ p(θ2 | θs1 , θs−1 3 , . . . , θP ) .. . θsP ∼ p(θp | θs1 , θs2 , . . . , θsP−1 )
P. sample
So, for example, in Step 1, a value for the first parameter θ1 at iteration s = 1 is drawn from the conditional distribution of θ1 given other parameters with start values at iteration 0 and the data y. At Step 2, the algorithm draws a value for the second parameter θ2 at iteration s = 1 from the conditional distribution of θ2 given the value of θ1 drawn in Step 1, the remaining parameters with start values
Obtaining and Summarizing the Posterior Distribution
51
at iteration zero, and the data. This process continues until ultimately a sequence of dependent vectors are formed: θ1 = {θ11 , . . . θ1P } θ2 = {θ21 , . . . θ2P } .. . θS = {θS1 , . . . θSP } This sequence exhibits the so-called Markov property insofar as θs is conditionally independent of {θ01 , . . . θs−2 } given θs−1 . Under some general conditions, the P sampling distribution resulting from this sequence will converge to the target distribution as s → ∞.
4.4
Hamiltonian Monte Carlo
The development of the Metropolis-Hastings and Gibbs sampling algorithms and their implementation in Bayesian software programs have made it possible to bring Bayesian statistics into mainstream practice. However, these two algorithms suffer from a severe practical limitation — namely, as the number of parameters increases, the number of directions that the algorithm can search increases exponentially while the M-H acceptance probability in Equation (4.5) decreases. Thus, these two algorithms can take an unacceptably long time to converge to the posterior distribution, resulting in a highly inefficient use of computer resources (Hoffman & Gelman, 2014). An approach for addressing this problem has emerged from the development of Hamiltonian Monte Carlo (HMC). The mathematics behind HMC arises from the field of Hamiltonian dynamics which was designed to address problems in quantum chromodynamics in the context of the orbital dynamics of fundamental particles. Hamiltonian Monte Carlo underlies the Stan programming environment, which we will be using for our examples throughout the book. In what follows, we draw on excellent intuitive introductions to HMC by Betancourt (2018b, 2019). The problem associated with the inefficient use of computer resources when implementing M-H or Gibbs algorithms is a result of the geometry of probability distributions when the number of parameters increases. In particular, although the density of a distribution is largest in the neighborhood near the mode, the volume of that neighborhood decreases and thus has an inconsequential impact on the calculation of expectations. At the same time, as the number of parameters increases, the region far away from the mode has greater volume but much smaller density and thus also contributes negligibly to the calculation of expectations. The neighborhood between these extremes is called the typical set, which is a subspace of the support of the distribution. This “Goldilocks zone” represents a region where the volume and density are just right, and where the mass is sufficient to produce reasonable expectations. Again, outside of the typical set, the contribution to the calculation of expectations is inconsequential and thus a waste of computing resources (Betancourt, 2018b)
52
Bayesian Statistics for the Social Sciences
The difficulty with the M-H and Gibbs algorithms is that although they will eventually explore the typical set of a distribution, it might be so slow that computer resources will be expended. This problem is due to the random walk nature of these algorithms. For example, in the ideal situation for a small number of parameters, the proposal distribution of M-H algorithm (usually a Gaussian proposal distribution) will be biased toward the tails of the distribution where the volume is high while the algorithm will reject proposal values if the density is small. This then pushes the M-H algorithm toward to typical set as desired. However, as the number of parameters increase, the volume outside the typical set will dominate the volume inside the typical set and thus the Markov chain will mostly end up outside the typical set yielding proposals with low probabilities and hence more rejections by the algorithm. This results in the Markov chain getting stuck outside the typical set and thus moving very slowly, as is often observed when employing M-H in practice. The same problem just described holds for the Gibbs sampler as well. The solution to the problem of the Markov chain getting stuck outside the typical set is to come up with an approach that is capable of making large jumps across regions of the typical set, such that the typical set is fully explored without the algorithm jumping outside the typical set. This is the goal of HMC. Specifically, HMC exploits the geometry of the typical set and constructs transitions that “...glide across the typical set towards new, unexplored neighborhoods” (Betancourt, 2018b, p. 18). To accomplish this controlled sojourn across the typical set, HMC exploits the correspondence between probabilistic systems and physical systems. As discussed in Betancourt (2018b), the physical analogy is one of placing a satellite in a stable orbit around Earth. A balance must be struck between the momentum of the satellite and the gravity of Earth. Too much momentum and the satellite will fly off into space. Too little, and the satellite will crash into Earth. Thus, the key to gliding across the typical set is to carefully choose an auxiliary momentum parameter to the probabilistic system. This momentum parameter is essentially a first-order gradient calculated from the log-posterior distribution.
4.4.1
No-U-Turn Sampler (NUTS)
Hamiltonian Monte Carlo yields a much more efficient exploration of the posterior distribution compared to random-walk M-H and Gibbs. However, HMC requires user-specified parameters that can still result in a degree of computational inefficiency. These parameters are referred to as the step size ϵ and the number of so-called leapfrog steps L. If ϵ is too large, then the acceptance rates will be too low. On the other hand, if ϵ is too small, then computation time is being wasted because the algorithm is taking unnecessarily small steps. With regard to the leapfrog steps, if L is too small, then the draws will be too close to each other, resulting in random walk behavior and slow mixing of the chains. If L is too large, then computational resources will be wasted because the algorithm will loop back and repeat its steps (Hoffman & Gelman, 2014). Although ϵ can be adjusted “on the fly” through the use of adaptive MCMC, deciding on the appropriate value of L is more difficult, and a poor choice of either parameter can lead to serious computational inefficiency. To solve these problems, the No-U-Turn Sampler (NUTS)
Obtaining and Summarizing the Posterior Distribution
53
algorithm was developed by Hoffman and Gelman (2014), which is designed to mimic the dynamics of HMC, while not requiring the user to specify ϵ or L. The NUTS algorithm is implemented in Stan (Stan Development Team, 2021a).
4.5
Convergence Diagnostics
Given the computational intensity of MCMC, it is absolutely essential for Bayesian inference that the convergence of the MCMC algorithm be assessed. The importance of assessing convergence stems from the very nature of MCMC in that it is designed to converge in distribution rather than to a point estimate. Because there is not a single adequate assessment of convergence, it is important to inspect a variety of diagnostics that examine varying aspects of convergence. We will consider five methods of assessing convergence that are used across M-H, Gibbs, and HMC, and then mention a few others that are specific to HMC. These diagnostics will be presented for many of the examples used throughout this book.
4.5.1
Trace Plots
Perhaps the most common diagnostic for assessing MCMC convergence is to examine the so-called trace or history plots produced by plotting the posterior parameter estimate obtained for each iteration of the chain. Typically, a parameter will appear to converge if the sample estimates form a tight horizontal band across the history of the iterations forming the chain. However, using this method as an assessment for convergence is rather crude since merely viewing a tight trace plot does not indicate that convergence was actually obtained. As a result, this method is more likely to be an indicator of non-convergence (Mengersen, Robery, & Guihenneuc-Jouyax, 1999). For example, if two chains for the same parameter are sampled from different areas of the target distribution and the estimates over the history of the chain stay separated, that would be evidence of non-convergence. Likewise, if a plot shows substantial fluctuation or jumps in the chain, it is likely that the chain linked to that parameter has not reached convergence.
4.5.2
Posterior Density Plots
Another useful tool for diagnosing any issues with the convergence of the Markov chain is the posterior density plot. This is the plot of the draws for each parameter in the model and for each chain. This plot is very important to inspect insofar as the summary statistics for the parameters of the model are calculated off of these posterior draws. If we consider, for example, a regression coefficient with a Gaussian prior, then we would expect the posterior density of that plot to be normally distributed. Any strong deviations from normality, and in particular any serious bi-modality in the plot, would suggest issues with convergence of that parameter, which could possibly be resolved through more iterations, better choice of priors, or both.
54
4.5.3
Bayesian Statistics for the Social Sciences
Autocorrelation Plots
In addition to the trace plots, it is important to examine the speed in which the draws from the posterior distribution achieve independence. As noted earlier, draws from the posterior distribution using a Markov chain are not, in the beginning, independent of one another. However, we expect that the chain will eventually “forget” its initial state and converge to a set of independent and stationary draws from the posterior distribution. One way to determine how quickly the chain has forgotten its initial state is by inspection of the autocorrelation function (ACF) plot, defined as follows. Let θs (s = 1, . . . , S) be the sth draw of a stationary Markov chain. Then the lag-l autocorrelation can be written as ρl = cor(θs , θs+l )
(4.6)
In general, we expect that the lag-1 autocorrelation will be close to 1.0. However, we also expect that the components of the Markov chain will become independent as l increases. Thus, we prefer that the autocorrelation decrease quickly over the number of iterations. If this is not the case, it is evidence that the chain is “stuck” and thus not providing a full exploration over the support of the target distribution. In general, positive autocorrelation will be observed, but in some cases negative autocorrelation is possible, indicating fast convergence of the estimated value to the equilibrium value.
4.5.4
Effective Sample Size
Related to the autocorrelation diagnostic is the effective sample size denoted as n eff in the Stan output, which is an estimate of the number of independent draws from the posterior distribution. In other words, it is the number of independent samples with the same estimation power as the T autocorrelated samples. Staying consistent with Stan notation, the n eff is calculated as n eff =
1+2
S P∞ s=1
ρs
(4.7)
where S is the total number of samples. Because the samples from the posterior distribution are not independent, we expect from Equation (4.7) that the n eff will be smaller than the total number of draws. If the ratio of the effective sample size to the total number of draws is close to 1.0, this is evidence that the algorithm has achieved mostly independent draws. Much lower values could be a cause for concern as it signals that the draws are not independent, but it is important to note that this ratio is highly dependent on the choice of MCMC algorithm, number of warmup iterations, and number of post-warmup iterations. One approach to addressing the problem of autocorrelation and the associated problem of lower effective sample sizes involves the use of thinning. Suppose we request that the algorithm take 3,000 draws from the posterior distribution. This would be the same as having the algorithm take 30,000 draws but to thin the sample by keeping only every 10th draw. Notice that while this is simply a way to reduce memory burden, the advantage is that the autocorrelation is typically reduced, which also results in a higher effective sample size.
Obtaining and Summarizing the Posterior Distribution
4.5.5
55
Potential Scale Reduction Factor
When implementing an MCMC algorithm, one of the most important diagnostics is the potential scale reduction factor (see, e.g., Gelman & Rubin, 1992a; Gelman, ˆ This diagnostic is based 1996; Gelman & Rubin, 1992b), often denoted as rhat or R. on analysis of variance and is intended to assess convergence among several parallel chains with varying starting values. Specifically, Gelman and Rubin (1992a) proposed a method where an overestimate and an underestimate of the variance of the target distribution is formed. The overestimate of the variance of the target distribution is measured by the between-chain variance and the underestimate is measured by the within-chain variance (Gelman, 1996). The idea is that if the ratio of these two sources of variance is equal to 1, then this is evidence that the chains have converged. If the Rˆ > 1.01, this may be a cause for concern. Brooks and Gelman (1998) added an adjustment for sampling variability in the variance estimates and also proposed a multivariate extension of the potential scale reduction factor which does not include the sampling variability correction. The Rˆ diagnostic is calculated for all chains over all iterations. A problem with Rˆ originally noted by Gelman, Carlin, et al. (2014) and further discussed in Vehtari, Gelman, Simpson, Carpenter, and Burkner (2021) is that it sometimes does ¨ not detect non-stationarity, in the sense of the average or variability in the chains changing over the iteration history. A relatively new version of the potential scale ˆ and reduction factor is available in Stan. This version is referred to as the Split R, ˆ is designed to address the problem that the conventional R cannot reliably detect ˆ which quantifies the variation of a set of Markov non-stationarity. The Split R chains initialized from locations points in parameter space. This is accomplished by splitting the chain in two and then calculating the Split Rˆ on twice as many chains. So, if one is using four chains with 5,000 iterations per chain, the Split Rˆ is based on eight chains with 2,500 iterations per chain.
4.5.6
Possible Error Messages When Using HMC/NUTS
Because we will be relying primarily on the Stan software program for our examples, it may be useful to list some of the error messages that one may encounter when using the HMC/NUTS algorithm.
Divergent Transitions Much of what has been described so far represents the ideal case of the typical set being a nice smooth surface that HMC can easily explore. However, in more complex Bayesian models, particularly Bayesian hierarchical models often applied in the social and behavioral sciences, the surface of the typical set is not always smooth. In particular, there are can be regions of the typical set that have very high curvature. Algorithms such as Metropolis-Hastings might jump over that region, but the problem is that there is information in that region, and if it is ignored, then the resulting parameter estimates may be biased. However, to compensate for not exactly exploring this region, MCMC algorithms will instead get very close to the boundary of this region and hover there for a long time. This can be seen through
56
Bayesian Statistics for the Social Sciences
a careful inspection of trace plots where instead of a nice horizontal band, one sees a sudden jump in the plot followed by a long sequence of iterations at that jump point. After a while, the algorithm will jump back, and, in fact, over correct. In principle, if the algorithm were allowed to run forever, these discontinuities would cancel each other out, and the algorithm would converge to the posterior distribution under the central limit theorem. However, in finite time, the resulting estimates will likely be biased. Excellent graphical descriptions of this issue can be found in Betancourt (2018b) The difficulty with the problem just describe is that M-H and Gibbs algorithms do not provide feedback as to the conditions under which they got stuck in the region of high curvature. With HMC as implemented in Stan, if the algorithm diverges sharply from the trajectory through the typical set, it will throw an error message that some number of transitions diverged. In Stan the error might read
1: There were 15 divergent transitions after warmup.
Regardless of the number of divergent transitions, this warning needs to be taken seriously as it suggests that this region was not explored by the algorithm at all. Generally, the problem can be diagnosed by checking potential problems with the data, on whether the model parameters are identified (Betancourt, 2018c), by checking on the appropriateness of the priors, and by checking and possibly adjusting the step size. For more information, see Chapter 15 in the Stan Reference Manual (Stan Development Team, 2021b)
4.6
Summarizing the Posterior Distribution
Once a satisfactory convergence to the posterior distribution is obtained, the next step is to calculate point estimates and obtain relevant intervals. The expressions for point estimates and intervals of the posterior distribution come from expressions of conditional distributions generally.
4.6.1
Point Estimates of the Posterior Distribution
Specifically, for the continuous case, the mean of the posterior distribution of θ given the data y is referred to as the expected a posteriori or EAP estimate and can be written as Z+∞ E(θ | y) = θp(θ | y)dθ
(4.8)
−∞
Similarly, the variance of posterior distribution of θ given y can be obtained as
Obtaining and Summarizing the Posterior Distribution
57
V(θ | y) = E[(θ − E[(θ | y])2 | y) Z+∞ = (θ − E[θ | y])2 p(θ | y)dθ −∞ Z+∞
=
(θ2 − 2θE[θ | y]) + E[θ | y]2 )p(θ | y)dθ −∞
= E[θ2 | y] − E[θ | y]2
(4.9)
The mean and variance provide two simple summary values of the posterior distribution. Another common summary measure would be the mode of the posterior distribution – referred to as the maximum a posteriori (MAP) estimate. The MAP begins with the idea of maximum likelihood estimation. Maximum likelihood estimation obtains the value of θ, say θˆ ML which maximizes the likelihood function L(θ | y), written succinctly as θˆ ML = argmax L(θ | y) θ
(4.10)
where argmax stands for the value of the argument for which the function attains its maximum. In Bayesian inference, however, we treat θ as random and specify a prior distribution on θ to reflect our uncertainty about θ. By adding the prior distribution to the problem, we obtain θˆ MAP = argmax L(θ | y)p(θ) θ
(4.11)
Recalling that p(θ | y) = L(θ | y)p(θ) is the posterior distribution, we see that Equation (4.11) provides the maximum value of the posterior density of θ given y, corresponding to the mode of the posterior density.
4.6.2
Interval Summaries of the Posterior Distribution
Along with point summary measures and posterior probabilities, it is usually desirable to provide interval summaries of the posterior distribution. There are two general approaches to obtaining interval summaries of the posterior distribution. The first is the so-called posterior probability interval, also referred to as the credible interval), and the second is the highest posterior density (HPD).
Posterior Probability Intervals One important consequence of viewing parameters probabilistically concerns the interpretation of confidence intervals. Recall that the frequentist confidence interval requires that we imagine a fixed parameter, say the population mean µ. Then, we imagine an infinite number of repeated samples from the population characterized by µ. For any given sample, we can obtain the sample mean x¯ and then form a
58
Bayesian Statistics for the Social Sciences
100(1 − α)% confidence interval. The correct frequentist interpretation is that 100(1−α)% of the confidence intervals formed this way capture the true parameter µ under the null hypothesis. Notice that from this perspective, the probability that the parameter is in the interval is either 0 or 1. In contrast, the Bayesian framework assumes that a parameter has a probability distribution. Sampling from the posterior distribution of the model parameters, we can obtain its quantiles. From the quantiles, we can directly obtain the probability that a parameter lies within a particular interval. So here, a 95% posterior probability interval would mean that the probability that the parameter lies in the interval is 0.95. Notice that this is entirely different from the frequentist interpretation, and arguably aligns with common sense.3 In formal terms, a 100(1 − α)% posterior probability interval for a particular subset of the parameter space Θ is defined as Z 1−α= p(θ)dθ (4.12) C
where C is some critical value. Symmetric intervals are, of course, not the only interval summaries that can be obtained from the posterior distribution. Indeed, a major benefit of Bayesian inference is that any interval of substantive importance can be obtained directly from the posterior distribution through simple functions available in R. The flexibility available in being able to summarize any aspect of the posterior distribution admits a much greater degree of nuance in the kinds of research questions one may ask. We demonstrate how to obtain other posterior intervals of interest in the example below.
Highest Posterior Density Intervals The simplicity of the posterior probability interval notwithstanding, it is not the only way to provide an interval estimate of a parameter. Following arguments set down by G. Box and Tiao (1973), when considering the posterior distribution of a parameter θ, there is a substantial part of the region of that distribution where the density is quite small. It may be reasonable, therefore, to construct an interval in which every point inside the interval has a higher probability than any point outside the interval. Such a construction is the highest probability density or HPD interval. More formally, Definition 4.6.1. Let p(θ | y) be the posterior probability density function. A region R of the parameter space θ is called the highest probability density region of the interval 1 − α if 1. p(θ ∈ R | y) = 1 − α 2. For θ1 ∈ R and θ2 < R, p(θ1 | y) ≥ p(θ2 | y) 3
Interestingly, the Bayesian interpretation is often the one incorrectly ascribed to the frequentist interpretation of the confidence interval.
Obtaining and Summarizing the Posterior Distribution
59
In words, the first part says that given the data y, the probability that θ is in a particular region is equal to 1−α, where α is determined ahead of time. The second part says that for two different values of θ, denoted as θ1 and θ2 , if θ1 is in the region defined by 1 − α, but θ2 is not, then θ1 has a higher probability than θ2 given the data. Note that for unimodal and symmetric distributions, such as the uniform distribution or the Gaussian distribution, the HPD is formed by choosing tails of equal density. The advantage of the HPD arises when densities are not symmetric and/or are multi-modal. A multi-modal distribution, for example, could arise as a consequence of a mixture of two distributions. Following G. Box and Tiao (1973), if p(θ | y) is not uniform over every region in θ, then the HPD region 1 − α is unique. Also, if p(θ1 | y) = p(θ2 | y), then these points are included (or excluded) by a 1 − α HPD region. The opposite is true as well, namely, if p(θ1 | y) , p(θ2 | y) then a 1−α HPD region includes one point but not the other (G. Box & Tiao, 1973, pg 123). Figure 4.1 below shows the HPDs for a symmetric distribution centered at zero on the left, and an asymmetric distribution on the right. We see that for the
FIGURE 4.1. Highest posterior density plot for symmetric and nonsymmetric distributions. symmetric distribution, the 95% HPD aligns with the 95% confidence interval as well as the posterior probability interval, as expected. Perhaps more importantly, we see the role of the HPD in the case of the asymmetric distribution on the right. Such distributions could arise from the mixture of two Gaussian distributions. Here, the value of the posterior probability interval would be misleading. The HPD, by contrast, indicates that, due to the asymmetric nature of this particular distribution, there is very little difference in the probability that the parameter of interest lies within the 95% or 99% intervals of the highest posterior density.
60
Bayesian Statistics for the Social Sciences
4.7
Introduction to Stan and Example
Throughout this book we will make use of the Stan probabilistic programming language (Stan Development Team, 2021a). Stan was named after Stanislaw Ulam, a Polish-born American mathematician who, along with Nicholas Metropolis, were the early developers of Monte Carlo methods while working on the atomic bomb in Los Alamos, New Mexico. Stan is an open-source, freedomrespecting software program and is financially supported by the NUMFOCUS project (https://numfocus.org/) It is not the purpose of this book to provide a detailed tutorial on the Stan software program. Instead, we will introduce elements of Stan by way of examples. The Stan User’s Guide can be found at https://mc-stan.org/docs/229/stan-usersguide/index.html, and a very friendly and helpful forum for questions related to Stan and its related programs can be found at https://discourse.mc-stan.org/. Example 4.1: Distribution of Reading Literacy To introduce some of the basic elements of Stan and its R interface RStan we begin by estimating the distribution of reading literacy from the United States sample of the Program for International Student Assessment (PISA) (OECD, 2019). For this example, we examine the distribution of the first plausible value of the reading for the U.S. sample. For detailed discussions of the design of the PISA assessment, see Kaplan and Kuger (2016) and von Davier (2013). In the following block of code we load RStan, bayesplot, read in the PISA data, and do some subsetting to extract the first plausible value of the reading competency assessment.4 We also create an object called data.list that allows us to rename variables and provide information to be used later. Note that we refer to the reading score as readscore in the code.
library(rstan) library(bayesplot) PISA18Data 0
(5.20)
where k is the set of whole numbers representing the counts of the events. The link function for the Poisson regression in Table 5.3 allows us to model the count data in terms of chosen predictors. Example 5.5: Bayesian Poisson Regression of Absentee Data This analysis utilizes data from a school administrator’s study of the attendance behavior of high school juniors at two schools. The outcome is a count of the number of days absent and the predictors include gender of the student and standardized test scores in math and language arts. The source of the data is unknown, but is widely used as an example of regression for count data in a number of R programs. The data are available on the accompanying website. We begin by reading in the data and creating the data list for Stan.
poissonex .05). If the resulting test is significant, then the null hypothesis is rejected. However, if the resulting test is not significant, then no conclusion can be drawn. As Gigerenzer et al. (2004) has pointed out, Fisher developed a later version of his ideas wherein one only reports the exact significance level arising from the test and does not place a “significant” or “non-significant” value label to the result. In other words, one reports, say, p = .045, but does not label the result as “significant” (Gigerenzer et al., 2004, p. 399). In contrast to Fisher’s ideas, the approach advocated by Neyman and Pearson requires that two hypotheses be specified – the null and alternative hypothesis. By specifying two hypotheses, one can compute a desired trade-off between two types of decision errors: Type I errors (the probability of rejecting the null when it is true, denoted as α) and Type II errors (the probability of not rejecting the null
101
102
Bayesian Statistics for the Social Sciences
when it is false, denoted as β, where 1 − β denotes the power of the test). As Raftery (1995) has pointed out, the dimensions of this hypothesis are not relevant – that is, the problem can be as simple as the difference between a treatment group and a control group, or as complex as a structural equation model. The point remains that only two hypotheses are of interest in the conventional practice. Moreover, as Raftery (1995) also notes, it is often far from the case that only two hypotheses are of interest. This is particularly true in the early stages of a research program, when a large number of models might be of interest to explore, with equally large numbers of variables that can be plausibly entertained as relevant to the problem. The goal is not, typically, the comparison of any one model taken as “true” against an alternative model. Rather, it is whether the data provide evidence in support for one of the competing models. The conflation of Fisherian and Neyman-Pearson hypothesis testing lies in the use and interpretation of the p-value. In Fisher’s paradigm, the p-value is a matter of convention with the resulting outcome being based on the data. In contrast, in the Neyman-Pearson paradigm, α and β are determined prior to the experiment being conducted and refer to a consideration of the cost of making one or the other decision error. Indeed, in the Neyman-Pearson approach, the problem is one of finding a balance between α, power, and sample size. However, even a casual perusal of the top journals in the social sciences will reveal that this balance is virtually always ignored and α is taken to be 0.05, with the 0.05 level itself being the result of Fisher’s experience with small agricultural experiments and never taken to be a universal standard. The point is that the p-value and α are not the same thing. This confusion is made worse by the fact that statistical software packages often report a number of p-values that a researcher can choose from after having conducted the analysis (e.g., .001, .01, .05). This can lead a researcher to set α ahead of time (perhaps according to an experimental design), but then communicate a different level of “significance” after running the test. This different level of significance would have corresponded to a different effect size, sample size, and power, all of which were not part of the experimenter’s original design. The conventional practice is even worse than described as evidenced by nonsensical phrases such as results “trending toward significance,” or “approaching significance,” or “nearly significant.”1 One could argue that a poor understanding and questionable practice of NHST is not sufficient as a criticism against its use. However, it has been argued by Jeffreys (1961, see also; Wagenmakers, 2007) that the statistical logic of the pvalue underlying NHST is fundamentally flawed on its own terms. Consider that any test statistic t(y) that is a function of the data y (such as the t-test). The p-value is obtained by p[t(y) | H0 ], as well as that part of the sampling distribution t(yrep | H0 ) more extreme than the observed t(y). A p-value is obtained from the distribution of a test statistic over hypothetical replications (i.e., the sampling distribution). The p-value is the sum or integral over values of the test statistic that are at least as extreme as the one that is actually observed. In other words, 1
For a perhaps not-so-humorous account of the number of different phrases that have been used in the literature, see https://mchankins.wordpress.com/2013/04/21/still -not-significant-2/.
Model Evaluation and Comparison
103
the p-value is the probability of observing data at least as extreme as the data that were actually observed, computed under the assumption that the null hypothesis is true. However, data points more extreme were never actually observed and thus constitutes a violation of the likelihood principle, a foundational principle in statistics (Birnbaum, 1962) that states that in drawing inferences or decisions about a parameter after the data are observed, all relevant observational information is contained in the likelihood function for the observed data, p(y | θ). This issue was echoed in Kadane (2011, p. 439) who wrote: Significance testing violates the Likelihood Principle, which states that, having observed the data, inference must rely only on what happened, and not on what might have happened but did not. Kadane (2011, p. 439) goes on to write: But the probability statement...is a statment about√X¯ n before it is observed. After it is observed, the event | X¯ n |> 1.96/ n either happened or did not happened and hence has probability either one or zero. To be specific, if we observe an effect, say y = 5, then the significance calculations involve not just y = 5 but also more extreme values, y > 5. But y > 5 was not observed and it might not even be possible to observe it in reality! To quote Jeffreys (1961, p. 385), I have always considered the arguments for the use of P [sic] absurd. They amount to saying that a hypothesis that may or may not be true is rejected because a greater departure from the trial value was improbable; that is, that it has not predicted something that has not happened. ...This seems a remarkable procedure
6.2
Model Assessment
In many respects, the frequentist and Bayesian goals of model building are the same. First, a researcher will specify an initial model relying on a lesser or greater degree of prior theoretical knowledge. In fact, at this first stage, a number of different models may be specified according to different theories, with the goal being to choose the “best” model, in some sense of the word. Second, these models will be fit to data obtained from a sample from some relevant population. Third, an evaluation of the quality of the models will be undertaken, examining where each model might deviate from the data, as well as assessing any possible model violations. At this point, model respecification may come into play. Finally, depending on the goals of the research, the “best model” will be chosen for some purpose. Despite the similarities between the two approaches with regard to the broad goals of model building, there are important differences. A major difference between the Bayesian and frequentist goals of model building lies in the model specification stage. In particular, because the Bayesian perspective explicitly incorporates uncertainty regarding model parameters in the form of prior probability distributions, the first phase of modeling building will require the specification of
104
Bayesian Statistics for the Social Sciences
a full probability model for the data and the parameters of the model, where the latter requires the specification of the prior distribution. The notion of model fit, therefore, implies that the full probability model fits the data. Lack of model fit may well be due to incorrect specification of likelihood, the prior distribution, or both. Arguably, another difference between the Bayesian and frequentist goals of model building relates to the justification for choosing a particular model among a set of competing models. Specifically, model building and model choice in the frequentist domain is based primarily on choosing the model that best fits the data. This has certainly been the key motivation for model building, respecification, and model choice in the context of popular methods in the social sciences such as structural equation modeling (see, e.g., Kaplan, 2009). In the Bayesian domain, the choice among a set of competing models is based on which model provides the best posterior predictions. That is, the choice among a set of competing models should be based on which model will best predict what actually happened. In this chapter, we examine the components that might be termed a Bayesian workflow (see, e.g., Bayesian Workflow, 2020; Schad, Betancourt, & Vasishth, 2019; Depaoli & van de Schoot, 2017) and includes a series of steps that can be taken by a researcher before and after estimation of the model. We discuss a possible Bayesian workflow in more detail in Chapter 12. These steps include (1) prior predictive checking, which can aid in identifying priors that might be serious conflict with the distribution of the data, and (2) post-estimation steps including model assessment, model comparison, and model selection. We begin with a discussion of prior predictive checking, which essentially generates data from the prior distribution before the data are observed. Next, we discuss posterior predictive checking as a flexible approach to assessing the overall fit of a model as well as the fit of the model to specific features of the data. Then, we discuss methods of model comparison including Bayes factors, the related Bayesian information criterion (BIC), the deviance information criterion (DIC), the widely applicable information criterion (WAIC), and the leave-one-out cross-validation information criterion (LOO-IC).
6.2.1
Prior Predictive Checking
As part of the modeling workflow, it is often useful to have a sense of the appropriateness of a prior distribution before observing the data and performing a full Bayesian analysis. Such a check is useful in the presence of content knowledge wherein clearly absurd prior distributions can be ruled out. This approach to examining prior distributions is referred to as prior predictive checking and was originally discussed by Box (1980, 1983) and advocated by Gelman, Simpson, and Betancourt (2017, see also; van de Schoot et al., 2021; Depaoli & van de Schoot, 2017). The prior predictive distribution can be written as Z rep p(y ) = p(yrep | θ)p(θ)dθ (6.1)
Model Evaluation and Comparison
105
which we see is simply generating replications of the data from the prior distribution.2 Example 6.1: Prior Predictive Checking In this example, we conduct prior predictive checks on a very simple analysis of the distribution of the PISA 2018 reading outcome. Using information from the 2009 cycle of PISA, we have a good idea of the distribution, but here we examine the case where our priors for 2018 are quite inaccurate and the case where they are based on more accurate knowledge of the 2009 reading outcome distribution. Setting up prior predictive checking in Stan simply involves generating MCMC samples from the prior distribution. In other words, the Stan code simply excludes the likelihood of the data from the model block and so there is no posterior distribution that is being sampled. The code is as follows.
library(rstan) library(loo) library(bayesplot) library(ggplot2) ## Read in data ## PISA18Data P − 1, where P is the number of observed variables. Different choices for Ψ and ν will yield different degrees of “informativeness” for the inverse-Wishart distribution. The inverse-Wishart distribution was discussed in Chapter 3. Note that other prior distributions for θIW can be chosen. For example, Φ can be standardized to a correlation matrix and the LKJ(η) prior can be applied. Also, if the uniqueness covariance matrix Ξ is assumed to be a diagonal matrix of unique variances (which is typically the case), then the elements of Ξ can be given IG(α, β) priors or C+ (0,β) priors, where α and β are shape and scale parameters, respectively, for the C+ distribution, and the location x0 is set to zero by definition (see Section 3.1.3). Example 8.1: Bayesian Confirmatory Factor Analysis This example is based on a reanalysis of a confirmatory factor analysis described in the OECD technical report (OECD, 2010). The PISA background questionnaire distinguishes two forms of motivation to learn mathematics: (1) students may learn mathematics because they enjoy it and find it interesting, or (2) because they perceive learning mathematics as useful. These two constructs are central in self-determination theory (Ryan & Deci, 2009) and expectancy-value theory (Wigfield, Tonks, & Klauda, 2009). Data come from the U.S. sample of PISA 2012 students who were asked the following: Thinking about your views on mathematics: to what extent do you agree with the following statements? (Please tick only one box in each row.) Strongly agree (1)/ Agree (2)/ Disagree (3)/ Strongly disagree
Bayesian Latent Variable Modeling
145
(4) a) I enjoy reading about mathematics (enjoyread), b) Making an effort in mathematics is worth it because it will help me in the work that I want to do later on (effort), c) I look forward to my mathematics lessons (lookforward), d) I do mathematics because I enjoy it (enjoy), e) Learning mathematics is worthwhile for me because it will improve my career (career), f) I am interested in the things I learn in mathematics (interest), g) Mathematics is an important subject for me because I need it for what I want to study later on (important), and h) I will learn many things in mathematics that will help me get a job (job). The confirmatory factor model in this example was specified to have two factors. The first factor is labeled IntrinsicMotiv measured by enjoyread, lookforward, enjoy, and interest. The second factor is labeled ExtrinsicMotiv and is measured by effort, career, important, and job. The Stan code follows. To begin, we read in and select the data. Missing data are deleted listwise.
library(rstan) library(loo) library(bayesplot) library(dplyr) # Read in data PISAcfa 1 p(Mk | y)
(11.8)
Again, in words Equation (11.8) states that there exists a model Ml within the set A′ and where Ml is simpler than Mk . If a complex model receives less support from the data than a simpler submodel — again based on the Bayes factor — then it is excluded from B. Notice that the second step corresponds to the principle of Occam’s razor (Madigan & Raftery, 1994).
Model Uncertainty
199
With Step 1 and Step 2, the problem of reducing the size of the model space for BMA is simplified by replacing Equation (11.4) with X p( y˜ | y, A) = p( y˜ | Mk , y)p(Mk | y, A) (11.9) Mk ∈A
In other words, models under consideration for BMA are those that are in A′ but not in B. Madigan and Raftery (1994) outline an approach to the choice between two models to be considered for Bayesian model averaging. To make the approach clear, consider the case of just two models M1 and M0 , where M0 is the simpler of the two models. This could be the case where M0 contains fewer predictors than M1 in a regression analysis. In terms of posterior odds, if the odds are positive, indicating support for M1 , then we reject M0 . If the posterior odds are large and negative, then we reject M1 in favor of M0 . Finally, if the posterior odds lie in between the pre-set criterion, then both models are retained. For linear regression models, the leaps and bounds algorithm combined with Occam’s window is available in the bicreg option in the R program BMA (Raftery et al., 2020).
11.3.3
Markov Chain Monte Carlo Model Composition
Markov chain Monte Carlo model composition (MC3 ) is based on the MetropolisHastings algorithm (see, e.g., Gilks, Richardson, & Spiegelhalter, 1996b), discussed in Chapter 4, and is also designed to reduce the space of possible models that can be explored via Bayesian model averaging. Following Hoeting et al. (1999), the MC3 algorithm proceeds as follows. First, let M represent the space of models of interest; in the case of linear regression this would be the space of all possible combinations of variables. Next, the theory behind MCMC allows us to construct a Markov chain {M(t), t = 1, 2, . . . , } which converges to the posterior distribution of model k, that is, p(Mk | y). The manner in which models are retained under MC3 is as follows. First, for any given model currently explored by the Markov chain, we can define a neighborhood for that model which includes one more variable and one less variable than the current model. So, for example, if our model has four predictors x1 , x2 , x3 , and x4 , and the Markov chain is currently examining the model with x2 and x3 , then the neighborhood of this model would include {x2 }, {x3 }, {x2 , x3 , x4 }, and {x1 , x2 , x3 }. Now, a transition matrix is formed such that moving from the current model M to a new model M′ has probability zero if M′ is not in the neighborhood of M and has a constant probability if M′ is in the neighborhood of M. The model M′ is then accepted for model averaging with probability ) ( pr(M′ | y) min 1, pr(M | y)
(11.10)
otherwise, the chain stays in model M. We recognize the term inside Equation (11.10) as the Markov acceptance ratio presented in Equation (4.4).
200
11.3.4
Bayesian Statistics for the Social Sciences
Parameter and Model Priors
One of the key steps when implementing BMA is to choose priors for both the parameters of the model and the model space itself. Discussions of the choice of parameter and model priors can be found in Fern´andez, Ley, and Steel (2001a); Liang, Paulo, Molina, Clyde, and Berger (2008); Eicher, Papageorgiou, and Raftery (2011) and Feldkircher and Zeugner (2009), with applications found in Fern´andez et al. (2001a) and an example comparing parameter and model priors with largescale educational data can be found in Kaplan and Huang (2021).
Parameter Priors A large number of choices for parameter priors are available in the R software program BMS (Zeugner & Feldkircher, 2015) and are based on variations of Zellner’s g-prior (Zellner, 1986). Specifically, Zellner introduced a natural-conjugate normal-gamma g-prior for regression coefficients β under the normal linear regression model, written as yi = x′i β + ε (11.11) where ε is iid N(0, σ2 ). For a given model, say Mk , Zellner’s g-prior can be written as −1 (11.12) βk | σ2 , Mk , g ∼ N 0, σ2 g x′k xk Feldkircher and Zeugner (2009) have argued for using the g-prior for two reasons: its consistency in asymptotically uncovering the true model, and its role as a penalty term for model size. The g-prior has been the subject of some criticism. In particular, Feldkircher and Zeugner (2009) have pointed out that the particular choice of g can have a very large impact on posterior inferences drawn from BMA. Specifically, small values of g can yield a posterior mass that is spread out across many models while large values of g can yield a posterior mass that is concentrated on fewer models. Feldkircher and Zeugner (2009) use the term supermodel effect to describe how values of g impact the posterior statistics, including posterior model probabilities (PMPs) and posterior inclusion probabilities (PIPs). To account for the supermodel effect, researchers such as Fern´andez et al. (2001a), Liang et al. (2008), Eicher et al. (2011), and Feldkircher and Zeugner (2009) have proposed alternative priors based on extensions of the work of Zellner (1986). Generally speaking, these alternatives can be divided into two categories: fixed priors and flexible priors. Examples of fixed parameter priors include the unit information prior, the risk inflation criterion prior, the benchmark risk inflation criterion, and the Hannan-Quinn prior (Hannan & Quinn, 1979). Examples of flexible parameter priors include the local empirical Bayes prior (E. George & Foster, 2000; Liang et al., 2008; Hansen & Yu, 2001), and the family of hyper-g priors (Feldkircher & Zeugner, 2009).
Model Uncertainty
201
Model Priors In addition to parameter priors, it is essential to consider priors over the space of possible models, which concerns our prior belief regarding whether the true model lies within the space of possible models. Among those implemented in BMS, the uniform model prior is a common default prior which specifies that if there are Q predictors in the model, then the prior on the model space is 2−Q . The problem with the uniform model prior is that the expected model size is Q/2, when in fact there are many more models of intermediate size than there are models with extreme sizes. For example, with six variables, there are more models of size 2 or 5 than there are 1 or 6. As a result, the uniform prior ends up placing more mass on models of intermediate size. An alternative is the binomial model prior which proposes placing a fixed inclusion probability θ on each predictor of the model. The problem is that θ is treated as fixed, and so to remedy this problem, Ley and Steel (2009) proposed a beta-binomial model prior which treats θ as random specified by a beta distribution. Unlike the uniform model prior, the beta-binomial prior will place equal mass across the models regardless of size.
11.3.5
Evaluating BMA Results: Revisiting Scoring Rules
With such a wide variety of parameter and model priors to choose from, it is important to have a method for evaluating the impact of these choices when applying BMA to substantive problems. Given that the utility of BMA lies in its optimal predictive performance, a reasonable method for evaluation should be based on measures that assess predictive performance – referred to as scoring rules – which we introduced in Chapter 1. A large number of scoring rules have been reviewed in the literature (see, e.g., Winkler, 1996; Bernardo & Smith, 2000; Jose et al., 2008; Merkle & Steyvers, 2013; Gneiting & Raftery, 2007). Here, however, we highlight two related, strictly proper scoring rules that are commonly used to evaluate predictions arising from Bayesian model averaging: the log predictive density score, and the KullbackLeibler divergence score (see, e.g., Fern´andez et al., 2001b; Hoeting et al., 1999; Kaplan & Yavuz, 2019; Kaplan & Huang, 2021, for examples).
The Log Predictive Density Score The log predictive density (LPD) score (Good, 1952; Bernardo & Smith, 2000) can be written as X − log p( y˜ i , y, x˜i ) (11.13) i
where, for example, y˜ i is the predictive density for ith person, x and y represent the model information for the remaining individuals, and x˜i is the information on the predictors for individual i. The model with the lowest log predictive score is deemed best in terms of long-run predictive performance.
202
Bayesian Statistics for the Social Sciences
Kullback-Leibler Divergence Score The Kullback-Leibler divergence measure was introduced in our discussion of variational Bayes in Chapter 4. To briefly reiterate, we consider two distributions, p(y) and g(y | θ), where p(y) could be the distribution of observed reading literacy scores, and g(y | θ) could be the prediction of these reading scores based on a model. The KLD between these two distributions can be written as ! Z p(y) dy (11.14) KLD( f, g) = p(y)log g(y | θ) where KLD( f, g) is the information lost when g(y | θ) is used to approximate p(y). For example, the actual reading outcome scores might be compared to the predicted outcome using Bayesian model averaging along with different choices of model and parameter priors. The model with the lowest KLD measure is deemed best in the sense that the information lost when approximating the actual reading outcome distribution with the distribution predicted on the basis of the model is lowest. Example 11.1: Bayesian Model Averaging We will focus again on the reading literacy results from PISA 2018. The list of variables used in this example is given below in Table 11.1. TABLE 11.1. PISA 2018 predictors of reading literacy Variable name FEMALE ESCS METASUM PERFEED HOMEPOS ADAPTIVE TEACHINT ICTRES JOYREAD ATTLNACT COMPETE WORKMAST GFOFAIL SWBP MASTGOAL BELONG SCREADCOMP SCREADDIFF PISADIFF PV1READ a b
Variable label Sex (1=Female) Index of economic, social, and cultural status Meta-cognition: summarizing Perceived feedback Home possessions (WLE)a Adaptive instruction (WLE) Perceived teacher’s interest ICT resources (WLE) Joy/Like reading Attitude towards school: learning activities Competitiveness Work mastery General fear of failure Subjective well-being: positive affect Mastery goal orientation Subjective well-being: sense of belonging to school (WLE) Perception of reading competence Perception of reading difficulty (WLE) Perception of difficulty of the PISA test First plausible value reading score (outcome variable)b
Weighted likelihood estimates. See OECD (2018) for more details. Plausible values. See OECD (2018) for more details.
Model Uncertainty
203
For this example, we use the software package BMS (Zeugner & Feldkircher, 2015), which implements the so-called Birth/Death (BD) algorithm as a default for conducting MC3. See Zeugner and Feldkircher (2015) for more details on the BD algorithm. The analysis steps for this example are as follows: 1. We begin by implementing BMA with default unit information priors for the model parameters and the uniform prior on the model space. We will outline the major components of the results including the posterior model probabilities and the posterior inclusion probabilities. 2. We next examine the results under different combinations of parameter and model priors available in BMS and compare results using the LPS and KLD.
BMA Results We first call BMS and use the unit information prior (g) (g=”uip”)and the uniform model prior (mprior) (mprior=”uniform”).
PISAbmsMod1