272 18 12MB
English Pages 518 [550] Year 2021
Bayesian Structural Equation Modeling
Methodology in the Social Sciences David A. Kenny, Founding Editor Todd D. Little, Series Editor This series provides applied researchers and students with analysis and research design books that emphasize the use of methods to answer research questions. Rather than emphasizing statistical theory, each volume in the series illustrates when a technique should (and should not) be used and how the output from available software programs should (and should not) be interpreted. Common pitfalls as well as areas of further development are clearly articulated. RECENT VOLUMES PRINCIPLES AND PRACTICE OF STRUCTURAL EQUATION MODELING, FOURTH EDITION Rex B. Kline HYPOTHESIS TESTING AND MODEL SELECTION IN THE SOCIAL SCIENCES David L. Weakliem REGRESSION ANALYSIS AND LINEAR MODELS: CONCEPTS, APPLICATIONS, AND IMPLEMENTATION Richard B. Darlington and Andrew F. Hayes GROWTH MODELING: STRUCTURAL EQUATION AND MULTILEVEL MODELING APPROACHES Kevin J. Grimm, Nilam Ram, and Ryne Estabrook PSYCHOMETRIC METHODS: THEORY INTO PRACTICE Larry R. Price INTRODUCTION TO MEDIATION, MODERATION, AND CONDITIONAL PROCESS ANALYSIS: A REGRESSION-BASED APPROACH, SECOND EDITION Andrew F. Hayes MEASUREMENT THEORY AND APPLICATIONS FOR THE SOCIAL SCIENCES Deborah L. Bandalos CONDUCTING PERSONAL NETWORK RESEARCH: A PRACTICAL GUIDE Christopher McCarty, Miranda J. Lubbers, Raffaele Vacca, and Jose´ Luis Molina QUASI-EXPERIMENTATION: A GUIDE TO DESIGN AND ANALYSIS Charles S. Reichardt THEORY CONSTRUCTION AND MODEL-BUILDING SKILLS: A PRACTICAL GUIDE FOR SOCIAL SCIENTISTS, SECOND EDITION James Jaccard and Jacob Jacoby LONGITUDINAL STRUCTURAL EQUATION MODELING WITH Mplus: A LATENT STATE–TRAIT PERSPECTIVE Christian Geiser COMPOSITE-BASED STRUCTURAL EQUATION MODELING: ANALYZING LATENT AND EMERGENT VARIABLES ¨ Henseler Jorg BAYESIAN STRUCTURAL EQUATION MODELING Sarah Depaoli
Bayesian Structural Equation Modeling ..........................................................................
Sarah Depaoli
Series Editor’s Note by Todd D. Little
THE GUILFORD PRESS New York London
© 2021 The Guilford Press A Division of Guilford Publications, Inc. 370 Seventh Avenue, Suite 1200, New York, NY 10001 www.guilford.com All rights reserved No part of this book may be reproduced, translated, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording, or otherwise, without written permission from the Publisher. Printed in the United States of America This book is printed on acid-free paper. Last digit is print number: 9 8 7 6 5 4 3 2 1 Library of Congress Cataloging-in-Publication Data Names: Depaoli, Sarah, author. Title: Bayesian structural equation modeling / Sarah Depaoli. Description: New York, NY : The Guilford Press, 2021. | Series: Methodology in the social sciences | Includes bibliographical references and index. Identifiers: LCCN 2021011543 | ISBN 9781462547746 (cloth) Subjects: LCSH: Bayesian statistical decision theory. | Social sciences–Statistical methods. Classification: LCC BF39.2.B39 D46 2021 | DDC 150.1/519542–dc23 LC record available at https://lccn.loc.gov/2021011543
To Mom and Dad, who taught me the only limits we face are those we place on ourselves And, to my adorable family, Darrin, Andrew, and Jacob
Series Editor’s Note
It’s funny to me that folks consider it a choice to Bayes or not to Bayes. It’s true that Bayesian statistical logic is different from a traditional frequentist logic, but it’s not really an either/or choice. In my view, Bayesian thinking has permeated how most modern modeling of data occurs, particularly in the world of structural equation modeling (SEM). That being said, having Sarah Depaoli’s guide to Bayesian SEM is a true treasure for all of us. Sarah literally guides us through all ways that a Bayesian approach enhances the power and utility of latent variable SEM. Accessible, practical, and extremely well organized, Sarah’s book opens a worldly window into the latent space of Bayesian SEM. Although her approach does assume some familiarity with Bayesian concepts, she reviews the foundational concepts for those who learned statistical modeling under the frequentist rock. She also covers the essential elements of traditional SEM, ever foreshadowing the flexibility and enhanced elements that a Bayesian approach to SEM brings. By the time you get to Chapter 5, on measurement invariance, I think you’ll be fully hooked on what a Bayesian approach affords in terms of its powerful utility, and you won’t be daunted when you work through the remainder of the book. As with any statistical technique, learning the notation involved cements your understanding of how it works. I love how Sarah separates the necessary notation elements from the pedagogy of words and examples. In my view, Sarah’s book will be received as an instant classic and a goto resource for researchers going forward. With clearly developed code for each example (in both Mplus and R), Sarah removes the gauntlet of learning a challenging software package. She provides wonderful overviews of
vii
viii
Series Editor’s Note
each chapter’s content and then leads you along the path of learning and understanding. After your learning journey through an example analysis, she brilliantly provides a mock results write-up to facilitate dissemination of your own work. And her final chapter is a laudatory culmination of dos and don’ts, pitfalls and solutions. At a conference in the Netherlands organized by one of the leading Bayesian advocates, Rens van de Schoot, I reworked the lyrics to “Ring of Fire” and performed (very poorly) “Ring of Priors”: Bayes Is a Burning Thing, and It Makes a Fiery Ring Bound by Prior Knowledge, I Fell in to a Ring of Priors I Fell in to a Burning Ring of Priors I Went Down, Down, Down but the Posteriors Went Higher And It Burns, Burns, Burns, the Ring of Priors, the Ring of Priors The Taste of Knowledge Is Sweet When Minds Like Ours Do Meet We Fell for Bayes Like a Child, and the Priors, They Went Wild
As you embark on your journey to becoming a proficient Bayesian structural equation modeler, you’ll fall in love with it (and won’t be able to get this little parody out of your head). As always, enjoy! You’ll be grateful you took the Bayesian plunge with Sarah’s book as your life raft. TODD D. LITTLE Isolating at my “Wit’s End” retreat Lakeside, Montana
Preface Until recent years, Bayesian methods within the social and behavioral sciences were seldom used. However, with advances in computational capacities, and exposure to alternative ways of modeling, there has been an increase in use of Bayesian statistics. Bayesian methods have had a strong place in theoretical and simulation work within structural equation modeling (SEM). In a systematic review examining the use of Bayesian statistics, we found that it was not until about 2012 that the field of psychology experienced an uptick in the number of SEM applications implementing Bayesian methods (van de Schoot, Winter, Ryan, Zondervan-Zwijnenburg, & Depaoli, 2017). There were many theoretical and simulation papers published, and a handful of technical books on the topic (see, e.g., Lee, 2007), but application was relatively scarce. In part, the delayed use of Bayesian methods within SEM was due to a lack of exposure to applied users, as well as relatively complex software for implementation.1 In 2010, Mplus (L. K. Muth´en & Muth´en, 1998-2017), one of the most comprehensive latent variable software programs, incorporated Bayesian estimation. The knowledge users needed to implement this new feature was quite trivial, making Bayesian methods more appealing for applied researchers. Each year in the last decade or so, more and more packages in R have been published that allow for Bayesian implementation or provide graphical resources. Being that R is free and relatively easy to use given the extensive online documentation, it is a desirable choice for Bayesian implementation. Programs that can be implemented in R, such as Stan (Stan Development Team, 2020) and blavaan (Merkle & Rosseel, 2018), have provided relatively straightforward implementation of Bayesian methods to a variety of model types. Applied users no longer have to rely on learning a complex new programming language to implement Bayesian methods– 1
I use the term “complex” not to criticize certain programming languages. Rather, I want to acknowledge that the start-up knowledge required to implement some of these programs (e.g., WinBUGS) is deep and unappealing to many applied users.
ix
x
Preface
and the days of requesting an annual “key” from the BUGS group are behind us. With an increase in application, we have also seen an increase in tutorials and other discussions surrounding the benefits and extensions that Bayesian statistics afford latent variable modeling. Indeed, Bayesian methods offer a more flexible alternative to their frequentist counterparts. Many of these elements of flexibility are illustrated throughout this book. Bayesian statistics are, no doubt, an attractive alternative to implementing SEMs. To date, there are still very few books tackling Bayesian SEM, and most are written at a more technical level. The goal of this book is to introduce Bayesian SEM to graduate students and researchers in the social and behavioral sciences. This book was written for social and behavioral scientists who are trained in “traditional” statistical methods but want to venture out and explore advanced implementations of SEM through the Bayesian framework. I assume that the reader has some experience with Bayesian statistics. However, for the novice reader who is still interested in this book, I include an introductory chapter with the essential components needed to get started. Each main chapter covers the basics of the models so the reader need not have a strong background in SEM prior to reading this book. A strong background in calculus is not needed, but some familiarity with matrix algebra is useful. However, I make an effort to explain every equation thoroughly so careful readers can pick up all math that they need to know to understand the models.
Intended Audience and Suggestions for Use This book was written with a broad audience in mind. Most Bayesian latent variable modeling books are written for advanced users of Bayesian statistics and include extensive derivations. These books are extremely useful resources, and the current book is not meant to directly compete with them. This book was written for an audience of graduate students (master’s or PhD level), faculty from quantitative or applied fields in the social and behavioral sciences, and data scientists looking to implement these techniques in industry. As a result, I have aimed the level of the book to accommodate a broad audience. The main reader is not intended to be a statistician looking to understand derivations. Rather, the intended reader will be a researcher (in quantitative methods or an applied social sciences field–e.g., a health psychologist or sociologist), who aims to implement these methods and properly convey results in articles or research reports.
Preface
xi
This book is meant as a guide for implementing Bayesian methods for latent variable models. I have included thorough examples in each chapter, highlighting problems that can arise during estimation, potential solutions, and guides for how to write up findings for a journal article. The current book is not a replacement for a general Bayesian text, nor is it a reference for derivations. For general books on Bayesian statistics, I refer the reader to Gelman, Carlin, et al. (2014), Kaplan (2014), or Kruschke (2015), among many others. For a more technical treatment of Bayesian latent variable modeling, I refer the reader to Lee (2007) or Song and Lee (2012). These books complement the current text and provide more mathematical detail and derivations for the models. This book can be used in the classroom setting (at the master’s or PhD level) for an advanced SEM course, or for a specialized course on Bayesian latent variable modeling. Several areas within the social and behavioral sciences can benefit from this text. For example, researchers in the fields of Psychology, Health Sciences, Public Health, Education, Sociology, and Marketing all regularly implement latent variable models that would benefit from the Bayesian perspective. In order to fully align with these fields, a particular emphasis is placed on the examples provided within the book for sub-disciplines in Psychology, Education, and Public Health (or the Health Sciences).
Organization of the Book This book is structured into 12 main chapters, beginning with introductory chapters comprising Part I. Chapter 1 is called “Background,” and it highlights the importance of Bayesian methods with SEM, as well as foundational information about SEMs. This chapter foreshadows benefits of the Bayesian perspective that are later illustrated through examples in subsequent chapters. Information regarding the datasets, examples, and notation is also provided in this chapter. Chapter 2 is called “Basic Elements of Bayesian Statistics,” and it provides a brief account of the key concepts underlying Bayesian methods. The purpose of this chapter is to act as a refresher regarding key concepts within Bayesian statistics. Part II is entitled “Measurement Models and Related Issues.” This section presents Chapters 3-5. Each of these chapters deals with various models and techniques related to measurement models within SEM. Chapter 3 is “The Confirmatory Factor Analysis Model,” and it covers the implementation of Bayesian confirmatory factor analysis (CFA). Many of the key Bayesian elements are described in this chapter in relation to CFA; this provides a foundation for the remaining chapters. Chapter 4 is “Multiple-
xii
Preface
Group Models,” and it incorporates issues surrounding the assessment of model differences across observed groups. This chapter provides a foundation for Chapter 5, which is “Measurement Invariance Testing.” In general, invariance testing is an area where Bayesian methodology can act as a huge asset in providing added flexibility and improving the accuracy of model results obtained. Part III is entitled “Extending the Structural Model,” and it contains Chapters 6 and 7. Chapter 6 is “The General Structural Equation Model,” and it includes important information about the addition of a structural part of the model. This added element complicates some features of Bayesian estimation because the prior distributions can be tricky to specify and implement. These issues are highlighted, as well as troubleshooting techniques. Chapter 7 presents an extension entitled “Multilevel Structural Equation Modeling,” which is an area where Bayesian methods can greatly improve the accuracy of model results. Several examples are provided, which highlight how some models that struggle under frequentist methods can shine when Bayesian methods are implemented. Part IV is entitled “Longitudinal and Mixture Models,” and it contains Chapters 8-10. Chapter 8, “The Latent Growth Curve Model,” introduces the concept of assessing change over time through continuous latent variables. Concepts underlying measurement invariance testing resurface in this chapter as well. Chapter 9 breaks with SEM tradition and presents “The Latent Class Model,” which covers the benefits of implementing Bayesian methods for categorical latent variable models; the idea of a mixture component is introduced (this is an element greatly benefited by the use of Bayesian methods).2 Mixture model and longitudinal topics are extended in Chapter 10, which is “The Latent Growth Mixture Model.” This chapter combines the use of continuous and categorical latent variables and highlights how the Bayesian estimation framework is particularly beneficial for this sort of model. Finally, Part V is called “Special Topics,” and it contains Chapters 11 and 12. Chapter 11 is “Model Assessment,” which covers many important elements regarding Bayesian model selection and assessment. I present “traditional” Bayesian methods, as well as extensions that can be used for determining final model solutions. The last chapter, Chapter 12, is called “Important Points to Consider.” This chapter is aimed toward promoting best practice for implementing latent variable models through the Bayesian framework. There is a particular emphasis placed on proper implemen2
Traditionally the latent class model is described in the context of psychometric modeling. However, so-called second generation SEM incorporates models with continuous and categorical latent variables–which includes mixture models, such as this one. This chapter provides a treatment of mixture modeling through the LCA model.
Preface
xiii
tation of the estimation process, reporting standards, and the (strong) importance of conducting a sensitivity analysis (especially on some model parameters found largely in the latent variable modeling context). The goal of this chapter is to ensure that readers are properly implementing and reporting the techniques covered in the preceding chapters. Finally, the book closes with a Glossary, defining key terms used throughout.
Organization within Each Chapter In order to keep a uniform organization throughout chapters covering specific models and techniques (Chapters 3-11), I have kept the main structure relatively consistent for each of these chapters as follows: • A short introduction to the model is presented, including why it is important in the social and behavioral sciences. • The model notation is presented, along with a diagram showing an illustration of the model. LISREL notation is used for the most part, unless otherwise noted. • The Bayesian form of the model is presented, which includes all prior notation. • At least one example of implementation is presented in each chapter. The examples have been constructed to show basic use of Bayesian methods for the model, as well as to highlight certain “problems” or “issues” that can arise during Bayesian estimation of the model. The data and examples are used for pedagogical purposes to illustrate different modeling techniques and statistical or estimation issues that arise. No substantive conclusions can be drawn from the examples. • A section is included for how to write up results for a manuscript. In this section, I pull results from the example provided, write a mock data analysis plan and results section, and comment on topics that can be included in a discussion section write-up. • A chapter summary and major take-home messages are included for each model presented. • At the end of each chapter, I include a guide to notation, which helps to avoid overwhelming readers with many pages of notation provided for all chapters at once. The book is notation-heavy, and decentralizing the notation definitions allows for a quick guide to
xiv
Preface notation at the end of each chapter. See Section 1.4.3 for more details on this decision. • I provide a small list of suggested readings, as well as annotation describing the content. • Finally, each chapter is accompanied with software code highlighting select components of the examples provided. All code and datasets are available on the companion website. For each model and technique presented, I provide code for implementation in Mplus and R so that users can work within their preferred platform. The end of each chapter contains pertinent (but sometimes incomplete) sections of code, with full coding files stored on the companion website.
Examples: Data, Software, and Code Each chapter includes at least one example of implementation. Prior to writing this book, I thought at length about what software or packages I wanted to use. I realized that the software component of the book should mimic use. Software and programs are not equally developed within Bayesian implementation for SEMs. Some programs have extensions that others have not yet implemented. In addition, students and researchers are more likely to experiment with new models and techniques when working with software they are already familiar with. For these reasons, I have included example code in every chapter using Mplus, as well as options in the R programming environment. These two options were selected because Mplus is a powerful and popular tool for estimating latent variable models. In addition, R has growing Bayesian SEM capabilities, and it can easily interface with BUGS, JAGS, and Stan. There are many different packages that can be used within R. I have provided example code using packages and code that represent the best match with the topics being described in the corresponding chapter. At times, the R examples I provide implement the BUGS language because this is the best tool available in R to execute the specific model or priors being described, and other examples use packages such as blavaan. The choice to include code from multiple programs was done to make the online component of the book (which includes all code and data) as helpful as possible for readers interested in replicating the work presented here. It also captures the notion that some programs (or programming languages) are more developed to implement certain models compared to others. Chapter 1 discusses the data sources in detail. Datasets were selected to span across areas within the social and behavioral sciences. All data are made available on the companion website for download.
Preface
xv
Extra Resources The book is accompanied by a glossary at the end of the book, as well as a companion website. Glossary The Glossary includes brief definitions of select key terms used throughout the book. Companion Website The companion website (see the box at the end of the table of contents) includes all datasets, annotated code, and annotated output so that readers can replicate what is presented in the book. This is meant to act as a learning tool for readers interested in applying these techniques and models to their own datasets.
Acknowledgments Years of building foundational knowledge and skills are needed before embarking on the journey to write a book. So many people have contributed to my development as a methodologist and a teacher, as well as my perspectives on the use (and misuse) of Bayesian statistics. An early, and perhaps most impactful, influence was my PhD advisor, David Kaplan. He helped me to harness my inquisitive mind, questioning the why behind techniques and processes. He also modeled a combination of professionalism, dedication, and light-hearted fun, which cannot be rivaled. I am so lucky to continue to have his mentorship, and I strive to play the same role for my own students. I am also fortunate to have such supportive colleagues, whom I consider to be my teammates as we work toward building a noteworthy program at a new university. First, I am thankful for the Psychological Sciences Department at the University of California, Merced, which has always supported the growth of the quantitative program and embraced the presence of Bayesian statistics within the department. I also thank my colleagues in the Quantitative Methods, Measurement, and Statistics area: Fan Jia, Keke Lai, Haiyan Liu, Ren Liu, and Jack Vevea. They are truly a joy to work with, and I have learned so much from each of them. I am incredibly fortunate to be surrounded by so many brilliant and truly nice colleagues. I especially thank Keke Lai, who never hesitated to help me brainstorm about modeling and notation solutions for this project. I am also grateful for the time that I got to spend with Will Shadish, who was the first person to encourage me to pursue authoring a book. I reflect on his mentoring advice often, and there were many times when I wished I could have shared my progress on this project with him. I have collaborated with many people over the years, and they have all helped me gain insight and clarity on methodological topics. One person in particular that I would like to thank is Rens van de Schoot, whom I partnered with early in my career to tackle issues of transparency within Bayesian statistics. I reflected on our projects many times when writing this book, and I am grateful for the line of work that we built together. xvii
xviii
Acknowledgments
I would also like to thank the many graduate students that I have had the pleasure to work with, including: James Clifton, Patrice Cobb, Johnny Felt, Lydia Marvin, Sanne Smid, Marieke Visser, Sonja Winter, Yuzhu (June) Yang, and Mari¨elle Zondervan-Zwijnenburg. Each person helped me to grow as a mentor and become a better teacher. In addition, I used many examples in this book from my previous work with several students, and I am thankful for their contributions in this respect. I would like to particularly thank Sonja Winter, who supplied software support in producing figures for the examples in this book. It was my honor to work with C. Deborah Laughton, the Methodology and Statistics publisher at The Guilford Press. C. Deborah was a tremendous support as I ventured out to start this project. Her advice and encouragement throughout this process were unmatched. I cannot think of a better person to partner with, and I am so fortunate that I was able to work with her. I am also grateful for reviews that I received on an earlier version of this book. The names of these reviewers were revealed after the writing for the book was completed. I thank each of them for their time and effort in going through the manuscript. Their advice was on point, and it helped me to refine the messages presented in each chapter. I thank the following people for their input: • Peng Ding, Department of Statistics, University of California, Berkeley • Katerina Marcoulides, Department of Psychology, University of Minnesota Twin Cities • Michael D. Toland, Executive Director, The Herb Innovation Center, Judith Herb College of Education, University of Toledo Finally, I am indebted to my husband, Darrin, for his unwavering support, patience, and love. His dedication to me and our boys is inspiring, and I am fortunate beyond words to have him in my life.
Contents
Part I. Introduction 1
Background 1.1 Bayesian Statistical Modeling: The Frequency of Use / 3 1.2 The Key Impediments within Bayesian Statistics / 6 1.3 Benefits of Bayesian Statistics within SEM / 9 1.4
1.5
2
1.3.1
A Recap: Why Bayesian SEM? / 12
1.4.1 1.4.2 1.4.3
The Fundamentals of SEM Diagrams and Terminology / 13 LISREL Notation / 17 Additional Comments about Notation / 19
1.5.1 1.5.2 1.5.3 1.5.4 1.5.5 1.5.6 1.5.7 1.5.8
Cynicism Data / 21 Early Childhood Longitudinal Survey–Kindergarten Class / 21 Holzinger and Swineford (1939) / 21 IPIP 50: Big Five Questionnaire / 22 Lakaev Academic Stress Response Scale / 23 Political Democracy / 23 Program for International Student Assessment / 24 Youth Risk Behavior Survey / 25
3
Mastering the SEM Basics: Precursors to Bayesian SEM / 12
Datasets Used in the Chapter Examples / 20
Basic Elements of Bayesian Statistics 2.1 A Brief Introduction to Bayesian Statistics / 26 2.2 Setting the Stage / 27 2.3 Comparing Frequentist and Bayesian Estimation / 29 2.4 The Bayesian Research Circle / 31 2.5 Bayes’ Rule / 32 2.6 Prior Distributions / 34 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.6.6 2.6.7 2.6.8
26
The Normal Prior / 35 The Uniform Prior / 35 The Inverse Gamma Prior / 35 The Gamma Prior / 36 The Inverse Wishart Prior / 36 The Wishart Prior / 36 The Beta Prior / 37 The Dirichlet Prior / 37
xix
xx
Contents 2.6.9 Different Levels of Informativeness for Prior Distributions / 38 2.6.10 Prior Elicitation / 39 2.6.11 Prior Predictive Checking / 42
2.7 2.8
2.9
The Likelihood (Frequentist and Bayesian Perspectives) / 43 The Posterior / 45
2.8.1 2.8.2 2.8.3 2.8.4 2.8.5 2.8.6 2.8.7
An Introduction to Markov Chain Monte Carlo Methods / 45 Sampling Algorithms / 47 Convergence / 52 MCMC Burn-In Phase / 53 The Number of Markov Chains / 53 A Note about Starting Values / 54 Thinning a Chain / 54
2.9.1 2.9.2 2.9.3 2.9.4 2.9.5 2.9.6 2.9.7 2.9.8 2.9.9
Posterior Summary Statistics / 55 Intervals / 56 Effective Sample Size / 56 Trace-Plots / 57 Autocorrelation Plots / 57 Posterior Histogram and Density Plots / 57 HDI Histogram and Density Plots / 57 Model Assessment / 58 Sensitivity Analysis / 58
Posterior Inference / 55
2.10 A Simple Example / 62 2.11 Chapter Summary / 71
2.11.1 Major Take-Home Points / 71 2.11.2 Notation Referenced / 73 2.11.3 Annotated Bibliography of Select Resources / 75
Appendix 2.A: Getting Started with R / 76
Part II. Measurement Models and Related Issues 3
The Confirmatory Factor Analysis Model 3.1 Introduction to Bayesian CFA / 89 3.2 The Model and Notation / 91 3.3
3.4 3.5 3.6
3.2.1
Handling Indeterminacies in CFA / 93
3.3.1 3.3.2 3.3.3 3.3.4
Additional Information about the (Inverse) Wishart Prior / 97 Alternative Priors for Covariance Matrices / 100 Alternative Priors for Variances / 100 Alternative Priors for Factor Loadings / 101
3.6.1 3.6.2 3.6.3
Hypothetical Data Analysis Plan / 125 Hypothetical Results Section / 125 Discussion Points Relevant to the Analysis / 127
The Bayesian Form of the CFA Model / 96
Example 1: Basic CFA Model / 101 Example 2: Implementing Near-Zero Priors for Cross-Loadings / 120 How to Write Up Bayesian CFA Results / 124
89
Contents 3.7
Chapter Summary / 128 3.7.1 3.7.2 3.7.3 3.7.4 3.7.5
4
xxi
Major Take-Home Points / 128 Notation Referenced / 131 Annotated Bibliography of Select Resources / 132 Example Code for Mplus / 133 Example Code for R / 136
Multiple-Group Models 4.1 A Brief Introduction to Multiple-Group Models / 138 4.2 Introduction to the Multiple-Group CFA Model (with Mean Differences) / 139 4.3 The Model and Notation / 140 4.4 The Bayesian Form of the Multiple-Group CFA Model / 142 4.5 Example 1: Using a Mean-Difference, Multiple-Group CFA Model to Assess for School Differences / 144 4.6 Introduction to the MIMIC Model / 153 4.7 The Model and Notation / 153 4.8 The Bayesian Form of the MIMIC Model / 154 4.9 Example 2: Using the MIMIC Model to Assess for School Differences / 156 4.10 How to Write Up Bayesian Multiple-Group Model Results with Mean Differences / 158
138
4.10.1 Hypothetical Data Analysis Plan / 158 4.10.2 Hypothetical Results Section / 159 4.10.3 Discussion Points Relevant to the Analysis / 160
4.11 Chapter Summary / 161 4.11.1 4.11.2 4.11.3 4.11.4 4.11.5
5
Major Take-Home Points / 162 Notation Referenced / 163 Annotated Bibliography of Select Resources / 165 Example Code for Mplus / 166 Example Code for R / 167
Measurement Invariance Testing 5.1 A Brief Introduction to MI in SEM / 169 5.2 5.3 5.4 5.5
5.6
5.1.1 5.1.2
Stages of Traditional MI Testing / 170 Challenges within Traditional MI Testing / 172
5.5.1 5.5.2 5.5.3
Results for the Conventional MI Tests / 181 Results for the Bayesian Approximate MI Tests / 182 Results Comparing Latent Means across Approaches / 184
5.6.1 5.6.2 5.6.3
Hypothetical Data Analysis Plan / 187 Hypothetical Analytic Procedure / 188 Hypothetical Results Section / 189
Bayesian Approximate MI / 173 The Model and Notation / 174 Priors within Bayesian Approximate MI / 176 Example: Illustrating Bayesian Approximate MI for School Differences / 178
How to Write Up Bayesian Approximate MI Results / 186
169
xxii
Contents
5.7
5.6.4
Discussion Points Relevant to the Analysis / 190
5.7.1 5.7.2 5.7.3 5.7.4 5.7.5
Major Take-Home Points / 190 Notation Referenced / 192 Annotated Bibliography of Select Resources / 193 Example Code for Mplus / 194 Example Code for R / 195
Chapter Summary / 190
Part III. Extending the Structural Model 6
The General Structural Equation Model 6.1 Introduction to Bayesian SEM / 199 6.2 The Model and Notation / 201 6.3 The Bayesian Form of SEM / 203 6.4 Example: Revisiting Bollen’s (1989) Political Democracy Example / 204 6.5
6.6
6.4.1 6.4.2
Motivation for This Example / 205 The Current Example / 206
6.5.1 6.5.2 6.5.3
Hypothetical Data Analysis Plan / 213 Hypothetical Results Section / 214 Discussion Points Relevant to the Analysis / 215
6.6.1 6.6.2 6.6.3 6.6.4 6.6.5
Major Take-Home Points / 217 Notation Referenced / 219 Annotated Bibliography of Select Resources / 221 Example Code for Mplus / 222 Example Code for R / 223
199
How to Write Up Bayesian SEM Results / 213
Chapter Summary / 216
Appendix 6.A: Causal Inference and Mediation Analysis / 224 7
Multilevel Structural Equation Modeling 7.1 Introduction to MSEM / 228 7.2 7.3 7.4 7.5 7.6 7.7
7.1.1 7.1.2
MSEM Applications / 230 Contextual Effects / 232
7.5.1 7.5.2
Implementation of Example 1 / 244 Example 1 Results / 246
7.6.1 7.6.2
Implementation of Example 2 / 253 Example 2 Results / 253
7.7.1 7.7.2 7.7.3
Hypothetical Data Analysis Plan / 258 Hypothetical Results Section / 259 Discussion Points Relevant to the Analysis / 260
Extending MSEM into the Bayesian Context / 233 The Model and Notation / 235 The Bayesian Form of MSEM / 238 Example 1: A Two-Level CFA with Continuous Items / 243 Example 2: A Three-Level CFA with Categorical Items / 247 How to Write Up Bayesian MSEM Results / 258
228
Contents 7.8
xxiii
Chapter Summary / 261 7.8.1 7.8.2 7.8.3 7.8.4 7.8.5
Major Take-Home Points / 262 Notation Referenced / 264 Annotated Bibliography of Select Resources / 267 Example Code for Mplus / 268 Example Code for R / 268
Part IV. Longitudinal and Mixture Models 8
The Latent Growth Curve Model 8.1 Introduction to Bayesian LGCM / 275 8.2 The Model and Notation / 276 8.3 8.4 8.5 8.6 8.7
8.8
9
8.2.1
Extensions of the LGCM / 279
8.3.1
Alternative Priors for the Factor Variances and Covariances / 281
The Bayesian Form of the LGCM / 280 Example 1: Bayesian Estimation of the LGCM Using ECLS–K Reading Data / 283 Example 2: Extending the Example to Include Separation Strategy Priors / 287 Example 3: Extending the Framework to Assessing MI over Time / 291 How to Write Up Bayesian LGCM Results / 297
8.7.1 8.7.2 8.7.3
Hypothetical Data Analysis Plan / 297 Hypothetical Results Section / 298 Discussion Points Relevant to the Analysis / 299
8.8.1 8.8.2 8.8.3 8.8.4 8.8.5
Major Take-Home Points / 300 Notation Referenced / 302 Annotated Bibliography of Select Resources / 304 Example Code for Mplus / 305 Example Code for R / 305
Chapter Summary / 299
The Latent Class Model 9.1 A Brief Introduction to Mixture Models / 308 9.2 Introduction to Bayesian LCA / 309 9.3 The Model and Notation / 310 9.4 9.5
9.6
275
9.3.1
Introducing the Issue of Class Separation / 312
9.4.1
Adding Flexibility to the LCA Model / 314
9.5.1 9.5.2 9.5.3 9.5.4
Identifiability Constraints / 319 Relabeling Algorithms / 320 Label Invariant Loss Functions / 321 Final Thoughts on Label Switching / 321
9.6.1 9.6.2
Motivation for This Example / 322 The Current Example / 324
The Bayesian Form of the LCA Model / 313
Mixture Models, Label Switching, and Possible Solutions / 315
Example: A Demonstration of Bayesian LCA / 321
308
xxiv
Contents 9.7
9.8
10
How to Write Up Bayesian LCA Results / 340 9.7.1 9.7.2 9.7.3
Hypothetical Data Analysis Plan / 340 Hypothetical Results Section / 341 Discussion Points Relevant to the Analysis / 343
9.8.1 9.8.2 9.8.3 9.8.4 9.8.5
Major Take-Home Points / 344 Notation Referenced / 346 Annotated Bibliography of Select Resources / 347 Example Code for Mplus / 348 Example Code for R / 352
Chapter Summary / 344
The Latent Growth Mixture Model 10.1 Introduction to Bayesian LGMM / 354 10.2 The Model and Notation / 356
354
10.2.1 Concerns with Class Separation / 359
10.3 The Bayesian Form of the LGMM / 363 10.3.1 10.3.2 10.3.3 10.3.4
Alternative Priors for Factor Means / 365 Alternative Priors for the Measurement Error Covariance Matrix / 365 Alternative Priors for the Factor Covariance Matrix / 365 Handling Label Switching in LGMMs / 365
10.4 Example: Comparing Different Prior Conditions in an LGMM / 366 10.5 How to Write Up Bayesian LGMM Results / 378 10.5.1 Hypothetical Data Analysis Plan / 378 10.5.2 Hypothetical Results Section / 379 10.5.3 Discussion Points Relevant to the Analysis / 381
10.6 Chapter Summary / 381 10.6.1 10.6.2 10.6.3 10.6.4 10.6.5
Major Take-Home Points / 382 Notation Referenced / 384 Annotated Bibliography of Select Resources / 386 Example Code for Mplus / 387 Example Code for R / 387
Part V. Special Topics 11
Model Assessment 11.1 Model Comparison and Cross-Validation / 395 11.1.1 11.1.2 11.1.3 11.1.4 11.1.5
Bayes Factors / 395 The Bayesian Information Criterion / 398 The Deviance Information Criterion / 400 The Widely Applicable Information Criterion / 402 Leave-One-Out Cross-Validation / 403
11.2 Model Fit / 404
11.2.1 Posterior Predictive Model Checking / 404 11.2.2 Missing Data and the PPC Procedure / 409 11.2.3 Testing Near-Zero Parameters through the PPPP / 410
11.3 Bayesian Approximate Fit / 411
11.3.1 Bayesian Root Mean Square Error of Approximation / 412 11.3.2 Bayesian Tucker-Lewis Index / 413
393
Contents
xxv
11.3.3 Bayesian Normed Fit Index / 414 11.3.4 Bayesian Comparative Fit Index / 414 11.3.5 Implementation of These Indices / 415
11.4 Example 1: Illustrating the PPC and the PPPP for CFA / 416 11.5 Example 2: Illustrating Bayesian Approximate Fit for CFA / 419 11.6 How to Write Up Bayesian Approximate Fit Results / 422 11.6.1 Hypothetical Data Analysis Plan / 422 11.6.2 Hypothetical Results Section / 423 11.6.3 Discussion Points Relevant to the Analysis / 425
11.7 Chapter Summary / 425 11.7.1 11.7.2 11.7.3 11.7.4 11.7.5
12
Major Take-Home Points / 425 Notation Referenced / 427 Annotated Bibliography of Select Resources / 431 Example Code for Mplus / 432 Example Code for R / 432
Important Points to Consider 12.1 Implementation and Reporting of Bayesian Results / 434 12.1.1 12.1.2 12.1.3 12.1.4
434
Priors Implemented / 435 Convergence / 435 Sensitivity Analysis / 435 How Should We Interpret These Findings? / 436
12.2 Points to Check Prior to Data Analysis / 436
12.2.1 Is Your Model Formulated “Correctly”? / 436 12.2.2 Do You Understand the Priors? / 440
12.3 Points to Check after Initial Data Analysis, but before Interpretation of Results / 443 12.3.1 Convergence / 443 12.3.2 Does Convergence Remain after Doubling the Number of Iterations? / 448 12.3.3 Is There Ample Information in the Posterior Histogram? / 450 12.3.4 Is There a Strong Degree of Autocorrelation in the Posterior? / 452 12.3.5 Does the Posterior Make Substantive Sense? / 455
12.4 Understanding the Influence of Priors / 456
12.4.1 Examining the Influence of Priors on Multivariate Parameters (e.g., Covariance Matrices) / 457 12.4.2 Comparing the Original Prior to Other Diffuse or Subjective Priors / 460
12.5 Incorporating Model Fit or Model Comparison / 462 12.6 Interpreting Model Results the “Bayesian Way” / 463 12.7 How to Write Up Bayesian Results / 464
12.7.1 (Hypothetical) Results for Bayesian Two-Factor CFA / 465
12.8 How to Review Bayesian Work / 469 12.9 Chapter Summary and Looking Forward / 470 Glossary
473
References
482
xxvi
Contents
Author Index
499
Subject Index
504
About the Author
52
*ÕÀV
>ÃiÀÃÊvÊÌ
ÃÊLÊV>Ê>VViÃÃÊ>ÊV«>ÊÜiLÃÌiÊÌ
>ÌÊÃÕ««iÃÊ `>Ì>ÃiÌÃÆÊ>Ì>Ìi`ÊV`iÊvÀÊ«iiÌ>ÌÊÊLÌ
Ê«ÕÃÊ>`Ê,]ÊÃÊÌ
>ÌÊ ÕÃiÀÃÊV>ÊÜÀÊÜÌ
ÊÌ
iÀÊ«ÀiviÀÀi`Ê«>ÌvÀÆÊ>`ÊÕÌ«ÕÌÊvÀÊ>ÊvÊÌ
iÊL½ÃÊ iÝ>«iÃÊ>ÌÊÜÜÜ°}ÕvÀ`°VÉ`i«>>ÌiÀ>ðÊÊÊÊÊÊÊÊÊÊÊÊÊÊÊÊ
Part I
INTRODUCTION
1 Background The current chapter provides background information and context for the Bayesian implementation of structural equation models (SEMs). The aim of this book is to highlight various aspects of this estimation process as it relates to SEM–including benefits and caveats (or dangers). The current chapter provides a basic introduction to the book, setting the stage for more complex topics in later chapters. First, the frequency of use of Bayesian estimation methods is described, and this is followed by a discussion of several impediments within Bayesian statistical modeling. Next, a presentation of benefits of implementing Bayesian methods within the SEM framework is provided. This is followed by a description of several foundational elements, including terminology and notation, within the SEM framework. These elements are needed prior to delving into the Bayesian implementation of SEMs in subsequent chapters. This chapter concludes with a description of the datasets implemented in the examples provided throughout the book.
1.1
Bayesian Statistical Modeling: The Frequency of Use
Bayesian analysis is an established branch of methodology for model estimation. This is partly due to a combination of two aspects: the advent of Markov chain Monte Carlo (MCMC) methods, and the increased popularity of Bayesian methodology. MCMC methods encompass a growing set of computational algorithms that can be used to solve high-dimensional and complex modeling situations. Among other applications, MCMC methods can be used to help with Bayesian estimation by reconstructing the posterior distribution–a topic I cover in greater detail in Chapter 2. Perhaps as a result of the computational advances, Bayesian methods have increased in use, thus exposing applied researchers to new tools rich with information and flexibility. Bayesian methods are not yet entrenched in all substantive fields, but there has been a rather drastic increase in use. Several systematic reviews have shown a steady increase in use of Bayesian methods in the fields 3
4
Bayesian Structural Equation Modeling
of Organizational Science (Kruschke, 2010), Health Technology Assessment (Spiegelhalter, Myles, Jones, & Abrams, 2000), and Epidemiology and Medicine (Ashby, 2006; Rietbergen, Debray, Klugkist, Janssen, & Moons, 2017), as well as within item response theory (IRT; Rupp, Dey, & Zumbo, 2004) and simulation work on SEMs (Smid, McNeish, Mioˇcevi´c, & van de Schoot, 2019). In addition, Bayesian methods are being further highlighted by being the focus of special issues of journals; for example, Frontiers in Psychology: Quantitative Psychology and Measurement has a new special issue entitled “Moving Beyond Non-Informative Prior Distributions: Achieving the Full Potential of Bayesian Methods for Psychological Research.” Trends across substantive fields can be viewed in Figure 1.1, which was constructed based on a cursory search on Scopus with the search word “Bayesian” (and excluding “Bayesian information criterion”). This figure illustrates an increase in use of Bayesian methods over time across many disciplines (1990-2015). Recently, I was involved with a large systematic review of Bayesian methods focused on the Psychological Sciences (van de Schoot et al., 2017). The review spanned the literature from 1990-2015 to capture papers using Bayesian methodology with Psychology, or closely related fields. We identified 1,579 viable papers to include in the review, and there were several important trends.1 Within the group of papers, the following main article types were explored: empirical, theoretical, simulation, tutorial, Bayesian meta-analysis, and commentary. The use across all categories, with the exception of “commentary,” trended upward. Notably, there was a spike in empirical applications (across all regression-based model-types) of Bayesian methods starting in about 2010. The percentage of papers being published in the Psychological Sciences has increased steadily over time, indicating an increase in interest in Bayesian methods.2 1
For inclusion, the papers mentioned any of the following terms in the title, abstract, or keywords: Bayesian, Gibbs sampler, MCMC, prior distribution, or posterior distribution. Given that MCMC can be used for Bayesian or frequentist estimation, we further restricted this term. For papers noting “MCMC,” we only extracted those that used MCMC with observed data and a prior in order to sample from the posterior. We extracted all papers that were published in peer-reviewed journals with “Psychology” listed as one of the journal’s topics in Scopus. Journals could have also listed the following fields: Arts and Humanities, Business, Decision Sciences, Economics, or Sociology. Papers solely mentioning the “Bayesian information criterion” were excluded. 2 Although the percentage of papers (and absolute number) has increased over time, it is important to note that the overwhelming majority of publications are still implementing frequentist methods.
Background
5
FIGURE 1.1. The Use of Bayesian Statistics across Fields (from a Cursory Scopus Search). This figure was extracted from van de Schoot et al. (2017).
-*%'4 41#/.4 )$)24 4 &.02 $ 1.!.. )$4(4 +/. 4'41&/.4 4 1$/ ."$!34
When examining trends specifically within regression-based models, SEM was the second most common model for technical papers (16.2%) and simulation papers (25.8%) published using Bayesian methods. Out of all regression-based models examined, SEM had the most Bayesian applications (26.0%). We found in the review that technical papers, simulation papers, and applications of Bayesian SEM all increased over time (see Figure 1.2). Technical and simulation papers tended to increase at a steadier rate compared to applications, which have experienced a more drastic increase in the literature. Specifically, there was a striking increase in the number of Bayesian SEM applications starting around 2012. The systematic review concluded with a prediction that there will continue to be a faster increase in Bayesian SEM applications in the coming years.
6
Bayesian Structural Equation Modeling
FIGURE 1.2. Papers Using Bayesian Estimation for SEMs. This figure was extracted from van de Schoot et al. (2017).
1.2
The Key Impediments within Bayesian Statistics
There are several key impediments embedded within the movement toward broader implementation of Bayesian statistics. I have kept three main impediments in mind while writing various sections of this book: (1) proper implementation, (2) training of the next generation of users, and (3) training of the current scholars in teaching- or reviewing-type roles. The first issue that I see is that of proper implementation. Of course, this is not an issue unique to Bayesian statistics. Statisticians have been concerned with this issue throughout the entire timespan of modern statistical modeling. Many pieces have been written (see, e.g., Gigerenzer, Krauss, & Vitouch, 2004) about improper implementation and interpretation of statistical tools. Although this issue of proper use and interpretation spans all areas of statistics and modeling, it is a key issue within Bayesian methods.
Background
7
When I was first learning about Bayesian statistics and estimation methods, there were no “easy” tools for implementation. Estimating a model using Bayesian statistics required extensive knowledge of all components of the model and estimation process being used. Sure, there was plenty that could still be improperly implemented in the process. However, the learning curve for programming was so steep that it required a complete understanding of the model and at least a semi-complete understanding of the estimation algorithm being implemented. It was pretty clear if something was not correctly programmed, and it was common practice to thoroughly (and I mean thoroughly) examine the resulting chains for anything appearing abnormal. Just as we have seen with statistical models, like SEMs, implementation of Bayesian methods has improved vastly over the last several decades. There are now extensive packages within R, and other user-friendly software, that require little to no knowledge of the underpinnings of Bayesian methods. The increase in simple tools to use makes implementation rather straightforward, but it also creates an issue that it is even easier to misuse the tools and interpret findings incorrectly. A user could (unintentionally) implement the Bayesian process incorrectly, not know how to check for problems, and report misleading results. It is frightening to think how easily this exact story can play out. As a result, the field needs to push thorough training in Bayesian techniques to ensure that users are implementing and interpreting findings correctly. I (and many others) have done some work in this area (see, e.g., Depaoli & van de Schoot, 2017), but I worry that modern advances have made it too easy to make mistakes in implementation and interpretation. This issue leads me to the next key impediment. The second key impediment, which I believe to be linked to the issue of implementation, is training the next generation of users. When I was interviewing for my first academic job, I had a particularly interesting meeting with the school Dean. He was not in my field and, to my (mistaken) knowledge, he was also not adept in statistics. I went in thinking that we would talk about the more “typical” issues that are discussed in Dean meetings (what resources I would need, how much lab space, etc.). Instead, he opened by asking me: Do you think you would ever teach Bayesian statistics (not just Bayes’ rule, but full implementation of Bayesian estimation) at the undergraduate level? Why or why not? My gut reaction was to say no. I was interviewing for a position in a Psychology department, and teaching about MCMC methods (for example) in the undergraduate statistics series seemed implausible. I cited the vast start-up knowledge required for Bayesian methods and the fact that there usually is not even enough time
8
Bayesian Structural Equation Modeling
to lay a proper foundation of conventional statistical theory. We had a nice conversation about this and moved onto other topics. Fast-forward several years, and this conversation still replays in my mind. Did I answer the question correctly? Do I still feel this way? I guess that I am still sorting out the issue. In fact, I have brought this topic up with many of my colleagues and friends in the field over the years. In these conversations, I have been met with a variety of answers, but almost everyone agrees that there is inevitably too much to teach in order to provide a solid foundation of statistics. I feel that one of the issues we (i.e., the field) need to face head-on is training at a speed and level that will allow students to keep up with the growing demands and use of advanced statistical methods. A major issue that I see now is that some scholarsin-training want to implement methods that they have not yet been fully trained in. I believe this practice will only increase as Bayesian methods become more mainstream and straightforward to implement. So, when should the training start? I’m not sure that I have formulated a concrete answer to this question yet. There are a lot of things that get in the way of increasing content in undergraduate and graduate training programs. Maybe one solution is to promote short courses and workshops aimed at students. Or perhaps faculty advisors need to be more adept at saying “no, you are not trained to do that yet” when a student wants to use advanced methods. At any rate, the issue of proper training of students is still one that needs to be addressed as a large-scale, pedagogical issue within the social and behavioral sciences. Finally, with use of these methods on the rise in statistical and applied journals, it is important that reviewers are trained on what to look out for. What makes for impeccable versus sloppy implementation of Bayesian methods? How is an applied researcher, who is trained to be an expert in substantive theory (and not statistics), supposed to properly assess Bayesian work coming through journals or grant applications? I described in the previous point that thorough training for the next generation of users is needed, but this training is also key at the level of faculty, journal reviewers, and those sitting on grant review panels. In Chapter 12, I provide thorough points to consider for scholars filling these important roles. Overall, these impediments are issues that I believe methodologists should be mindful to address in our work. Promoting proper training and use, at all levels, is imperative to the production of good science. These issues are especially prevalent in the Bayesian implementation of latent variable models where, as I will demonstrate through a variety of applied examples, there are a whole host of issues that can go awry.
Background
1.3
9
Benefits of Bayesian Statistics within SEM
There are many different reasons why a researcher may prefer to use Bayesian estimation to traditional, frequentist (e.g., maximum likelihood; ML) estimation. The main reasons for using Bayesian methods that I highlight here are as follows: (1) the models are too “complex” for traditional methods to handle (i.e., models can be made less computationally demanding, and new models can be explored that are not viable in the frequentist framework), (2) only relatively small sample sizes are available, (3) the researcher wants to include background information into the estimation process, and (4) there is preference for the types of results that Bayesian methods produce. The first listed issue is referring to situations where the intended statistical model is either too “complex” in form or otherwise intractable (e.g., not identified) to implement in the frequentist framework. Some advanced models implement numerical integration, which can be intractable due to high-dimensional integration that is needed to solve for the ML estimates. Model non-identification is also a reason that some researchers choose to move into the Bayesian framework. The Bayesian framework does not require traditional model identification to estimate model parameters. Nonidentification of a model can cause some other issues with estimates in some modeling instances (especially in how the priors impact results), but the model can still be freely estimated under Bayes. When a model is not identified in the traditional sense, then the Bayesian framework can allow for all parameters to be estimated (which would not otherwise be possible in the frequentist framework); see, for example, S.-Y. Kim, Suh, Kim, Albanese, and Langer (2013) for an example of non-identification. Another case of this was demonstrated in B. O. Muth´en and Asparouhov (2012a), where the authors illustrated how Bayesian methods can provide a more flexible framework for substantive inquiries involving latent variables. In addition, some models that have added complexity (e.g., mixture models, or some multilevel models) may be better off estimated in the Bayesian framework. Some models have shown, via extensive simulations, estimation accuracy is quite poor in the frequentist setting. For example, Depaoli (2013) illustrated how the Bayesian framework can improve upon parameter estimate accuracy in the context of longitudinal mixture models (e.g., those with latent, or unobserved, classes). Some additional examples, where the Bayesian framework provides more accurate results, are with latent variable multilevel models (Depaoli & Clifton, 2015), multiple-group growth models with unbalanced group sizes (Zondervan-Zwijnenburg, Depaoli, Peeters, & van de Schoot, 2019), and measurement invariance
10
Bayesian Structural Equation Modeling
testing situations (Cieciuch, Davidov, Schmidt, Algesheimer, & Schwartz, 2014). A more specific example that is highlighted in various places in this book has to do with the fact that variances (and covariances) are more difficult to estimate compared to means. Typically, the likelihoods for variance and covariance parameters are flatter compared to the likelihood for a mean. Information from the prior can aid in improving the accuracy of estimation of these parameters, which are often important components of SEMs. The second issue listed deals with the size of the sample available to the researcher. Frequentist methods rely on large sample theory. Thus, some models are only appropriate for larger sample sizes in the frequentist framework. However, there are some substantive areas in which larger samples are simply not viable. It could be that data are very expensive to obtain or analyze, restricting the amount of data that can be collected for a given study. There are also cases in which there is limited access to data. For example, when studying a rare disease or a small population, researchers may not have access to a large pool of participants. In these cases, the frequentist framework may produce inaccurate results, leaving researchers with incorrect substantive conclusions. The Bayesian framework has been shown, with proper use, to aid in estimation when only smaller sample sizes are available (see, e.g., Depaoli, Rus, Clifton, van de Schoot, & Tiemensma, 2017; Zhang, Hamagami, Wang, Nesselroade, & Grimm, 2007; Zondervan-Zwijnenburg et al., 2019). The phrase used above, “with proper use,” is where many of the aims for this book come into the picture. The Bayesian estimation framework is not magic– it does not have the capability to create an accurate picture of results for small samples without additional information. This information is located in the prior distributions that are implemented in the estimation process. The reason that instances with small samples can be aided by the Bayesian estimation framework is because extreme estimates “shrink” toward the prior. The use of priors in this context can be a very beneficial tool, but priors can also be quite misleading or even dangerous if misused. This book focuses much more on this issue of how prior distributions can be used to aid in proper estimation, especially when sample sizes are smaller (which is a common topic area in SEM–e.g., how low can sample size go and still be able to implement the SEM?). I will also focus on how to examine the impact of these priors and avoid misleading results. The third reason listed is when the researcher wants to implement the Bayesian framework. There may be no other reason (e.g., the model is identified and sample sizes are adequate) aside from the researcher simply wanting to use prior knowledge. It may be that researchers want to incor-
Background
11
porate prior knowledge (e.g., obtained through experts, a meta-analysis, or some other means) into the estimation process in order to fully incorporate theory or knowledge; see, for example, Zondervan-Zwijnenburg, Peeters, Depaoli, and van de Schoot (2017) for an example of how to incorporate knowledge from experts. In this case, the researcher is acknowledging that there is previous information about model parameters, and she is directly incorporating it into the analysis process of current data. In the frequentist framework, there is no such mechanism that allows a researcher to directly acknowledge what the field has learned about model parameters–all model parameters are treated as completely unknown, but this may not actually be the case. For example, Zondervan-Zwijnenburg et al. (2017) used previous research and expert knowledge to determine a reasonable range for the initial status and change over time of adolescents’ working memory scores. Adding this knowledge into the estimation process was key to obtaining accurate and substantively important results. Finally, the framework provides a more complete picture of population parameters, and researchers are able to narrate full distributions rather than a simple point estimate. The Bayesian estimation framework can be a rich source of information, regardless of how priors are used or what sort of model is being implemented. A nice example of this is provided in Kruschke (2013), which illustrates how the Bayesian estimation framework can make a model as simple as a t-test more informative to applied researchers. This article highlights the fact that Bayesian methods do not have to be slated for complex modeling situations with small samples– this framework can be highly informative even in the simplest modeling contexts. All of these benefits extend to the SEM framework. As I will demonstrate in various chapters, the use of Bayesian methods–and specifically priors–allows for a more flexible treatment of traditional latent variable models. Not only can Bayesian methods improve the accuracy of results obtained, but they can also allow new questions to be answered that frequentist methods cannot accommodate. Finally, there are effectively two groups of Bayesian SEMs. The careful reader will be able to distinguish the difference between these two throughout the book. In some examples, I use Bayesian methods for estimation purposes. These examples represent SEMs estimated via Bayesian methods. In other examples, I implement Bayesian tools into the model, thus changing the way the model is constructed. In other words, the model could not be implemented using frequentist methods because only Bayesian methods allow for the flexibility needed for such a model. One example of the latter
12
Bayesian Structural Equation Modeling
type of model is presented in Section 3.5, where I illustrate how Bayesian methods can change the specification of the model altogether.
1.3.1
A Recap: Why Bayesian SEM?
All of the models contained within the traditional SEM framework stand to benefit from the increased flexibility of releasing restrictive model constraints. For example, as described in Chapter 3, the CFA model can be implemented in a far more flexible manner when certain priors are implemented. This flexibility can be further extended into the context of measurement invariance testing and multiple-group comparisons with SEM. In addition, so-called second generation SEM incorporates continuous and categorical latent variables into a single model. This extended framework produces many different model forms of substantive interest, for example, the latent growth mixture model (see Chapter 10), in which continuous latent growth is captured within categorical latent classes. Bayesian methodology can provide useful tools that allow for more accurate results in situations implementing a combination of continuous and categorical latent variables. SEM is rich with different model types, and this modeling framework has been used in a variety of substantive contexts spanning virtually every major field. Bayesian implementation of SEMs allows for, in some cases, a more flexible and accurate account of findings. It also provides an expansion of the types of research questions that can be examined, and it produces a rich set of results to interpret. Before delving into specific model forms in subsequent chapters, it is important to cover some basic terminology and notation that defines the SEM framework.
1.4
Mastering the SEM Basics: Precursors to Bayesian SEM
There are many foundational elements needed prior to learning about the Bayesian treatment of SEMs. In many ways, it is beyond the scope of the current book to provide a thorough background treatment of SEM and Bayesian methodology prior to delving into Bayesian SEM–it would essentially require a three-volume series to cover all of these topics properly. I acknowledge that there are many aspects of each topic not presented in detail here. However, I will provide the basic prerequisite knowledge surrounding SEM and Bayesian methodology for readers to comprehend– and feel comfortable with–the content presented in subsequent chapters.
Background
13
The remaining portions of the current chapter present the key elements of SEM that are necessary prior to delving into the model-based topics covered throughout this book. For readers already familiar with the basics underlying SEM, the following sections of this chapter can be skipped. For novice readers, or those looking for a refresher, these sections will provide important elements to get started with the remaining book content. In addition, Chapter 2 covers the basic elements of Bayesian statistical modeling that are required. A reader with a solid foundation in Bayesian statistical modeling may not need to cover Chapter 2 in great detail, but novice readers will find the material imperative for understanding the remaining chapters in the book.
1.4.1
The Fundamentals of SEM Diagrams and Terminology
This section is not meant to provide specifics surrounding certain model types. Instead, it is meant to familiarize the reader with terminology and diagram-based symbols that are commonly implemented within SEM (and are present throughout this book). One of the main elements within the SEM framework is the distinction between observed and latent variables. Observed variables represent the information that has been collected during the data collection process. These are the variables that represent the numeric columns in the datafile. Observed variables can be continuous or categorical (e.g., binary, Likert type, count). Latent variables represent unobserved constructs that are composed of observed variables. They represent entities that are not directly observable and, therefore, latent variables are not represented in the datafile. As an example, depression (depending on the definition) may not be directly observable. There is not a single element, or variable, that fully represents depression as a construct. In other words, depression is not an observable variable. Instead, observable symptoms or features of depression can be directly measured from participants. These observable variables can form a latent representation of depression, where the latent construct is able to capture many different facets of depression that are directly measurable (e.g., sleep and eating habits, lack of concentration, sadness, excessive crying, agitation, and social isolation). For SEM diagrams, which are common visual representations of models, observed variables are represented by squares and latent variables are represented by circles, as denoted in Figure 1.3.
14
Bayesian Structural Equation Modeling
FIGURE 1.3. An Example of Observed and Latent Variable Diagram Symbols.
The construction of a latent variable occurs through a measurement model, in which observed variables are used as indicators for a latent construct. In this context, observed items (sometimes called item indicators) capturing features of depression would be collected from participants. Participants may provide information about the number of hours they sleep per day, how many minutes a day they spend crying, and how often they speak to or see other people. These variables represent measurable features that are thought to represent elements of depression. The observed items can act as observed item indicators. Analyzing scores for these observed item indicators can form a latent construct called a factor. The measurement model within the SEM framework can be used to form latent factors based on patterns of responses obtained from observed item indicators. The interpretation of the latent factor is then based on the response patterns for the observed items. As a simple visual example of this measurement model, let’s assume that we collected data on three observed variables: Sleep Amount, Crying Amount, and Contact with Others. A measurement model can be tested that combines these observed items together to form a latent construct called Depression. A simplified version of this model is in Figure 1.4. FIGURE 1.4. A Simplified Example of a Measurement Model.
Background
15
This figure introduces a new symbol, which is a single-sided arrow pointing from the latent construct into the observed item indicators. When constructing a latent variable, the arrows are considered in some software languages as “BY” statements, which represent factor loadings. Factor loadings capture the direct effect of the latent variable onto the observed item indicators. In general, a relatively larger loading corresponds to a larger effect for that item indicator. Another key element to SEMs is called the structural model. This part of the model can grow in complexity, but here I provide a simple example to highlight certain concepts important to grasp. Figure 1.5 presents a simplified version of a structural path model, with three observed variables related through a single-sided arrow. FIGURE 1.5. A Simplified Example of a Path Model.
In this figure, X1 is acting as a direct predictor for Y1 , which is acting as a direct predictor of Y2 . The outcome in this model is Y2 , and we know this by examining the direction of the arrows that are present. In this simplified version of a structural model, the arrows represent “ON” statements, which reflect regression paths leading from a predictor to an outcome. Notice that the initial predictor, X1 , does not have any arrows pointing into it. Variables in SEM that act only as predictors (i.e., they do not have arrows pointing into them) are considered to be exogenous variables. In contrast, variables that have arrows pointing toward them act as endogenous variables. In the case of Figure 1.5, there are two endogenous variables: Y1 and Y2 . The direction of the arrow dictates the role that each variable plays in the model. Another important element of SEM diagrams concerns variances and covariances. These elements are represented by double-sided arrows in the model. For example, Figure 1.6 illustrates two covarying predictors (X1 and X2 ) for an outcome (Y1 ). It is clear that the predictors are allowed to covary because of the presence of the double-sided arrow, which points toward each predictor. The covariance is represented by a “WITH” statement in many software languages.
16
Bayesian Structural Equation Modeling
FIGURE 1.6. Illustrating Covariance in a Diagram.
Regarding variances, the symbol is similar to the covariance symbol in that a double-sided arrow is used. In order to denote a variance is being considered, the double-sided arrow will point directly to the variable in question. Figure 1.7 illustrates a variance for predictors X1 and X2 . FIGURE 1.7. Illustrating Variance in a Diagram.
Finally, some models denote constants or intercepts by using triangles. Once such model is the latent growth curve model, which is described in Chapter 8. A figure depicting this model, along with the intercept term at the very bottom (the triangle with the “1” in it), can be found in Figure 1.8.
Background
17
FIGURE 1.8. Illustrating Variance in a Diagram.
These are the basic symbols that are needed to understand the models presented in subsequent chapters. For more detailed discussions of the foundations and fundamentals of SEM, please see Hoyle (2012a). This book covers many topics related to SEM, including a basic introduction to the modeling framework (Hoyle, 2012b), historical advances (Matsueda, 2012), the use of path diagrams (Ho, Stark, & Chernyshenko, 2012), and details surrounding the use of latent variables within SEM (Bollen & Hoyle, 2012).
1.4.2
LISREL Notation
LISREL (linear structural relations) notation is commonly implemented for models housed within the SEM framework. The LISREL program is a software program used for SEM that is matrix-based and follows specific notation. Many software programs, books, and research papers surrounding SEM follow this notation system. For most sections in this book, I also adhere to this notation (unless noted otherwise). As an example for describing the notation, I have included a picture of the model discussed in Chapter 6 in Figure 1.9.
18
Bayesian Structural Equation Modeling
FIGURE 1.9. An Example of LISREL Notation.
This figure has several elements, which include endogenous and exogenous variables, latent and manifest variables, factor loadings, and regression and covariance elements. Each of these features will be described next. First off, notice that there are three main latent variables in the model: ξ1 , η1 , and η2 . In LISREL notation, endogenous and exogenous latent variables are denoted with different notation. The exogenous variable is ξ1 , and the endogenous variables are denoted with η notation. In this case, η2 acts as the outcome in this model. The exogenous latent variable, ξ1 , has a variance term φ. In the case of multiple exogenous latent variables, the variance terms and covariances would be contained in matrix Φ. The endogenous latent variables η1 and η2 have disturbance terms called ζ. These disturbance terms, along with any covariances among the endogenous disturbance terms, would be contained in matrix Ψη .
Background
19
Each of the three latent variables have observed item indicators. The item indicators associated with the exogenous latent variable (ξ1 ) are denoted with Xs, and the items associated with the endogenous latent variables (η1 and η2 ) are denoted with Ys. The Xs are tied to ξ1 through factor loadings λx for each item. These loadings are contained in a factor loading matrix for the exogenous variables, Λx . The numeric subscripts next to the λx terms represent the row and column (respectively) that these elements represent within Λx . The Xs have measurement errors called δ. These measurement errors are summarized by error variances (σ2δ ). The σ2δ elements comprise the diagonal elements of a matrix Θδ . Any covariances among the δ terms would be in the off-diagonal elements of Θδ ; no covariances are pictured in the figure for Θδ . The Ys correspond with η1 and η2 through factor loadings λ y for each item. These loadings are contained in a factor loading matrix for the endogenous variables, Λ y . The numeric subscripts next to the λ y terms represent the row and column (respectively) that these elements represent within Λ y . For example, λ y21 represents Item 2 loading onto Factor 1 within this matrix, and λ y72 represents Item 7 loading onto Factor 2. The Ys have measurement errors called . These measurement errors are summarized by error variances (σ2 ). The σ2 elements comprise the diagonal elements of a matrix Θ . The covariances among the terms would be in the off-diagonal elements of Θ . The covariances are denoted with the curved lines at the top of the figure with double arrows. The three latent variables are linked together with regression paths, which define the endogenous and exogenous variables in the model. The paths regressing endogenous variables (both of the η terms) onto the exogenous variable (ξ) are denoted by γ notation. All pathways leading from exogenous variables to endogenous variables are contained within the matrix Γ. The pathway linking the two endogenous variables together is denoted with β, and all pathways akin to this would be contained in matrix B . As a quick summary, the notation is redefined in Table 1.1.
1.4.3
Additional Comments about Notation
Section 1.4.2 acts as a centralized guide to the basics of SEM-based notation, but the Bayesian treatment of SEMs requires much more notation. Therefore, I find it useful to include a notation guide at the end of each chapter. To some extent, including notation guides at the end of each chapter breaks with convention. It is more common in statistical modeling books to have a centralized guide to all notation in the book. However, chapter-specific
20
Bayesian Structural Equation Modeling TABLE 1.1. LISREL Notation at a Glance
Parameter Notation ξ (xi) η (eta) λx (lambda) λ y (lambda) φ (phi) ζ (zeta) ψ (psi) γ (gamma) β (beta) σ2δ (sigma2 delta) θδ (theta delta) σ2 (sigma2 epsilon) θ (theta epsilon)
Matrix/ Vector ξ η Λx Λy Φ ζ Ψη Γ B Θδ Θδ Θ Θ
Definition Exogenous latent variable Endogenous latent variable Exogenous variable factor loadings Endogenous variable factor loadings (Co)variances for exogenous latent variables Endogenous latent variable disturbance terms Endogenous disturbances and covariances Regression paths from endogenous to exogenous Regression paths from endogenous to endogenous Error variances for exogenous variables θδ is a generic term for all elements in Θδ Error variances for endogenous variables θ is a generic term for all elements in Θ
notation guides have been included here as an additional learning tool for readers. SEMs are notoriously notation-heavy, and introducing these models into the Bayesian estimation framework further complicates the notation needed when implementing prior distributions. Each chapter contains all notation needed to understand the modeling techniques described in that chapter. This structure ensures that each chapter can be handled as a standalone learning tool to facilitate grasping chapter content. For example, a reader can tackle one chapter at a time and have all necessary information contained within that chapter to learn and understand the material presented. The notation at the end of the chapter can act as a reference guide, and also as a “knowledge check,” as readers gain more comfort with notation used with Bayesian SEM.
1.5
Datasets Used in the Chapter Examples
In this section, I present information about all of the datasets that are used in the examples throughout the book. All data are freely available on the companion website. As a caveat, I want to be clear that none of the examples are meant to derive substantive conclusions. The models I constructed are used to highlight modeling and estimation features. These models are not necessarily constructed with substantive theory in mind, nor were all model alternatives tested. In fact, in many cases the models did not fit the data well, and this helps to highlight certain points about the analysis. The data and examples that were selected were done so for pedagogical reasons, and I do not make any substantive claims in this book. Each dataset is now discussed in turn.
Background
1.5.1
21
Cynicism Data
Data were collected from 100 college students who participated for course credit. Three variables are used from this dataset: • Cynicism: Higher values indicate greater cynicism • Lack of Trust: Higher values indicate less trust in others • Sex: Female = 0, Male = 1
1.5.2
Early Childhood Longitudinal Survey–Kindergarten Class
The Early Childhood Longitudinal Survey–Kindergarten class (ECLS–K; National Center for Education Statistics [NCES], 2001) was used. This database consists of information capturing child development, school readiness, and early school experiences. Data were extracted from the class of 1998-1999. The reading assessment measures basic skills such as letter recognition, site recognition of words, and vocabulary in context. There are many scoring options within the database, and I opted to use the reading scores based on item response theory (IRT scores). I used 3,856 students who had reading assessment scores for the following times: fall kindergarten, spring kindergarten, fall first grade, and spring first grade. The measurement occasions were spaced as follows (and are described in more detail in Kaplan, 2002): • Interval 1: October-November 1998 • Interval 2: April-May 1999 • Interval 3: September-October 1999 • Interval 4: April-May 2000 Notice that each time point is actually an interval of time that contains two months (e.g., October-November 1998). This interval indicates that data collection took place over a period of time for the children, rather than (for example) on a single day for all children.
1.5.3
Holzinger and Swineford (1939)
The Holzinger and Swineford (1939) dataset is a classic example for implementing factor structures and assessing for group differences. Data were collected from seventh- and eighth-grade students from two different schools. There were 145 students from the Grant-White school, and
22
Bayesian Structural Equation Modeling
156 students from the Pasteur school. Originally data were collected from 26 items of mental ability tests. However, Joreskog (1969) used only nine ¨ of these items for studying models of correlation structures. I use these same nine items, which are thought to separate into three distinct factors as follows: • Factor 1: Spatial Ability – Item 1. Visual Perception – Item 2. Cubes – Item 3. Lozenges • Factor 2: Verbal Ability – Item 4. Paragraph Comprehension – Item 5. Sentence Completion – Item 6. Word Meaning • Factor 3: Speed – Item 7. Addition – Item 8. Counting Dots – Item 9. Straight-Curved Capitals
1.5.4
IPIP 50: Big Five Questionnaire
The next database is a bit different in that it is one that constantly updates through online data collection. It is called the Open-Source Psychometrics Project database (2019). This database is freely available from www.openpsychometrics.org, which is an online repository from personality test data. Other users of this database include de Roover and Vermunt (2019) and J. C.-H. Li (2018). I pulled data from the Big Five test from the International Personality Item Pool (IPIP; https://ipip.ori.org/), which is a freely available database where information regarding more than 3,000 items and 250 personality scales is available. The IPIP is managed by the Oregon Research Institute. The IPIP 50 is a short version of the well-known Big Five (Costa & McCrae, 1992), which includes only 10 items per hypothesized factor; the items are publicly available at the IPIP website (https://ipip.ori.org). There are clear limitations to using data collected from online surveys (e.g., there could be duplicate entries, it is not controlled whether a single person filled out the survey or if multiple people contributed). However,
Background
23
the large database is useful for pedagogical reasons illustrated in Chapter 3. The following information was extracted on September 10, 2018: 50 Likert-type questions based on the IPIP Big Five questionnaire, gender, race, age, native language, and country. I extracted 19,719 participants and used their answers to 50 IPIP Big Five items.
1.5.5
Lakaev Academic Stress Response Scale
The Lakaev Academic Stress Response Scale (Lakaev, 2009) was designed to assess stress in university students. It is composed of 21 items and four stress response domains: Physiological Stress, Cognitive Stress, Affective Stress, and Behavioral Stress. Items are scored based on a 5-point Likert-type scale, ranging from 1 (not at all) to 5 (all of the time). Data were collected on this scale and reported in Winter and Depaoli (2019). The sample consisted of 144 undergraduate students, who were asked to fill out the questionnaire during the week leading up to the midterm, right after the midterm (before grades were posted), and a week after the midterm (when grades were posted). Data were collected from an Introductory Psychology course. There were some missing data at each of the measurement occasions: 140 undergraduates completed the first occasion (97.2%), 102 completed the second occasion (70.8%), and 115 participants completed the third occasion (79.9%). Overall, 127 students participated in at least two measurement occasions (88.2%). Data used in this book consist of answers to the following items: • I couldn’t breathe. • I had headaches. • My hands were sweaty. • I have had a lot of trouble sleeping. • I had difficulty eating.
1.5.6
Political Democracy
Data on political democracy described in Bollen (1989) were used. Data were collected from 75 developing countries. Four measures of democracy were collected in 1960 and again in 1965, and data from three measures of industrialization were collected in 1960. The item content is as follows: • Data from 1960 and 1965 (democracy) – Freedom of the press
24
Bayesian Structural Equation Modeling – Freedom of political opposition – Fairness of elections – Effectiveness of elected legislature • Data from 1960 (industrialization) – Gross national product (GNP) per capita – Energy consumption per capita – Percentage of labor force in industry
1.5.7
Program for International Student Assessment
The Program for International Student Assessment (PISA) is an international study sponsored by the Organization for Economic Cooperation and Development (Organization for Economic Cooperation and Development (OECD), 2013). It is designed to assess academic performance among 15year-old students in the domains of mathematics, reading, and science. The survey is conducted every 3 years in participating countries, with each assessment focusing on a different content domain. Examples use data from the 2003 and 2012 PISA data cycles, which emphasized students’ performance in, and attitudes toward, mathematics. Data from the 2003 cycle consisted of N = 5,376 students from 149 schools (average cluster size = 36) from South Korea. Data used from the 2012 cycle consisted of a sample of 30 schools randomly selected from the full group of South Korean schools (N = 617 students; average cluster size = 21). The final subsection of data used in this book consisted of data from all 65 countries sampled in 2012. There were a total of 308,238 students sampled from 17,952 schools (average cluster size = 17). The following item content was used: • Train timetable • Discount % • Size (m2 ) of a floor • Graphs in newspaper • Distance on a map • Petrol consumption rate • 3x + 5 = 17 • 2(x + 3) = (x + 3)(x − 3)
Background
1.5.8
25
Youth Risk Behavior Survey
Data were pulled from the 2005 and 2007 Youth Risk Behavior Survey (YRBS). The YRBS consists of nationally representative samples of U.S. high school students in grades 9-12 (Centers for Disease Control and Prevention, 2018). Data for the YRBS are collected every 2 years in an effort to examine the prevalence of health risk behaviors in adolescents. Examples in this book use the full 2005 (n = 13, 917) and 2007 (n = 14, 041) YRBS samples, in addition to a random subset of the 2007 data (n = 281). Motivation for using these data came from a previous application based on the 2005 YRBS sample by Collins and Lanza (2010). In that application, the authors presented an analysis of youth health risk behaviors using responses to 12 binary items (1 = yes, 2 = no) in which students were asked to indicate whether they had ever engaged in a particular health risk behavior. The following item content was used: • Smoked first cigarette before age 13 • Smoked daily for 30 days • Has driven when drinking • Had first drink before age 13 • ≥ Five drinks in a row in the past 30 days • Tried marijuana before age 13 • Used cocaine in life • Sniffed glue in life • Used methamphetamines in life • Used Ecstasy in life • Had sex before age 13 • Had sex with 4+ people
2 Basic Elements of Bayesian Statistics The focus of this book is on the Bayesian treatment of SEMs. Before presenting on different model types, it is important to review the basics of Bayesian statistical modeling. This chapter reviews concepts crucial to understanding Bayesian SEM. Conceptual similarities and differences between the frequentist and Bayesian estimation frameworks are highlighted. I also introduce the Bayesian Research Circle, which can be used as a visual representation of the steps needed to implement Bayesian estimation. The key ingredients of Bayesian methodology are described, and a simple example of implementation using a multiple regression model is presented. I provide a special focus on the concept of conducting a sensitivity analysis, which is equally applicable to the statistical model and the prior distributions. This chapter provides the basics needed to understand the material in the subsequent model-based chapters, but I also provide references to more detailed treatments of Bayesian methodology.
2.1
A Brief Introduction to Bayesian Statistics
The role of this chapter is to act as a refresher of key elements of Bayesian statistics. The chapter covers most of the main elements that are needed to get started with the remaining chapters in this book. However, it is by no means an exhaustive treatment of Bayesian statistics, or the elements needed for estimation and assessment of results. For a more thorough presentation of these topics, I refer the reader to Gelman, Carlin, et al. (2014), Kaplan (2014), or Kruschke (2015). Each of these resources provides a complete background of Bayesian statistics and can act as a springboard to the current book. There are many reasons that a researcher might want to use Bayesian methods in applied research. However, in order to do so, one must first become familiar with the various elements that make Bayesian estimation different from traditional (frequentist) estimation. In this chapter, I cover the main ingredients linked to estimation, which are: the model, the model parameters, and corresponding probability distributions. A refresher of
26
Basic Elements of Bayesian Statistics
27
these elements will help ensure understanding of the model-specific information in subsequent chapters. The remainder of this chapter is structured as follows. Next, I provide basic background information for Bayesian methodology (Section 2.2), compare frequentist and Bayesian perspectives (Section 2.3), present the Bayesian Research Circle (Section 2.4), and then present a description of Bayes’ rule (Section 2.5). This is followed by a list of common prior distributions that are implemented in SEM (Section 2.6). The likelihood is described (Section 2.7), and this is followed by a description of the posterior (Section 2.8) and posterior inference (Section 2.9). An example of implementing Bayesian estimation is provided (Section 2.10). The chapter ends with a summary, including major take-home points, notation, and an annotated bibliography of helpful resources (Section 2.11). Finally, code for getting started with R is presented in Appendix 2A.
2.2
Setting the Stage
Ahead of data collection, the researcher often identifies one or more testable hypotheses that correspond to a statistical model. The idea is that these hypotheses can be tested on a sample of participants, and then results can be generalized to the general population. The statistical model reflects the ideas (or theory) that the researcher has about the population, and it is represented by mathematical equations that link together predictor and outcome variables in a predetermined way (i.e., in a way that accurately reflects the research hypotheses). The model is characterized by unknown model parameters, which are then estimated. In the case of a simple linear regression model, the model may consist of a single predictor (e.g., SAT score) and a single outcome (e.g., college performance). The main model parameter of interest in this example may be the regression slope (i.e., the regression weight associated with the predictor of SAT score). This parameter sheds light on the strength and direction (e.g., positive versus negative) of the relationship between SAT scores and college performance. The researcher can use information gained through the parameter estimate to help ascertain how predictive SAT scores are of college performance in the population; this is the basis for inferential statistics, in which a sample dataset is used to learn something about the population. Different methods can be employed to estimate model parameters. The main two methods are frequentist analysis (e.g., ML estimation) and Bayesian analysis.1 These two methods differ in several important ways, 1
Although frequentist analysis incorporates different methods, I will restrict my discussion here to the example of ML since it is the most commonly implemented method in SEM.
28
Bayesian Structural Equation Modeling
with the main difference residing in the approach to how estimates and parameters are defined. Within the frequentist framework, focus is on identifying estimates that reflect the highest probability of representing the sample data. ML estimation, as an example of frequentist estimation, finds parameter estimates through maximizing a likelihood function using the observed sample data. A point estimate is obtained for each model parameter, and this estimate acts as the optimal value for the fixed population parameter. In this instance, the estimate is viewed as the value linked to the highest probability of observing the sample data being examined. ML estimation makes an assumption that the distribution of the parameter estimate is normal (i.e., symmetric), and this is rooted in the asymptotic theory that the approach is based on. Within Bayesian statistics, there is an important element in this basic estimation phase that is added to the picture. Frequentist estimation treats model parameters as fixed, unknown quantities, whereas it is common for Bayesian estimation to treat model parameters as unknown random variables, with the random aspect captured through uncertainty surrounding the population value.2 The researcher may have previous (or prior) beliefs, opinions, or knowledge about a population parameter. These beliefs can be used to capture the degree of uncertainty surrounding the parameter. For example, a parameter can be represented as the relationship between a predictor and an outcome variable. Bayesian methodology allows the researcher to incorporate this knowledge into the estimation process through probability distributions linked to each of the model parameters (e.g., the regression weight). It is important to note that these beliefs are determined before data analysis (i.e., they are typically independent of model results for the current data and current model under examination). The beliefs can be particularly useful to help narrow down to a set of plausible values for a given model parameter. In the case of a regression weight, there may be a strong belief that the weight is somewhere around 1, so a value of 150 for this parameter would be highly unlikely. The prior knowledge incorporated into the estimation process can help the researcher by including information that some values (e.g., regression weight = 1) are more likely to occur in the population than others (e.g., regression weight = 150). 2
Although I mentioned that it is common for model parameters to be treated as unknown and random, it is not a requirement to do so within the Bayesian estimation framework. Indeed, parameters can be modeled as fixed and estimated through frequentist methods within the Bayesian framework (see, e.g., Carlin & Louis, 2008). I focus on the more traditional approach for Bayesian modeling here, where parameters are treated as random.
Basic Elements of Bayesian Statistics
29
Once the beliefs are set, model estimation can take place using the beliefs as part of the estimation process–again, these beliefs help to narrow down the range of possible values for the model parameter. When results are obtained from this statistical model, we have a new state of knowledge– one that incorporates previous beliefs with current data. This state of knowledge is referred to as the posterior belief, since it is based on results obtained post-data analysis. Bayesian methodology can be used to take us from the idea of prior beliefs to posterior beliefs, both of which are captured using probability distributions (opposed to point estimates as discussed for ML estimation). Unlike ML estimation, Bayesian methods do not rely on asymptotic theory, and the posterior need not be normal (i.e., it can be heavily skewed).
2.3
Comparing Frequentist and Bayesian Estimation
The current book is by no means meant to position Bayesian and frequentist perspectives against one another. There are some aspects of SEM that can benefit from the Bayesian perspective, but I do not claim that frequentist methods are inferior. It was also not my intention to provide an exhaustive treatment (or comparison) of these approaches. However, I do think it is important to briefly cover the differences in estimation and interpretation in order to highlight similarities and differences. My aim is that the differences highlighted here will help the reader understand the philosophical underpinnings of the frameworks, which will aid in interpreting the Bayesian SEM findings in the remaining chapters. In this section, I will use the ML estimator as a proxy for frequentist estimation–ML estimation is not the only frequentist tool, but it is useful for illustration. Within ML estimation, the estimator is a function of the observed data, which come from a distribution (i.e., the data are assumed to be random). Since the data are assumed random, the ML estimator is also considered to be a random variable. The ML estimate, which is based on the observed data, represents the product (or realization) of the random variable. Specifically, the ML estimate is the point estimate of a fixed but unknown model parameter. It is important to reiterate that the model parameter is not viewed as probabilistic within the frequentist framework. Uncertainty in the ML point estimates is captured through standard errors and confidence intervals. In addition, interpretations of these elements are based on asymptotic theory, as well as the notion that model parameters are fixed. For example, the standard error captures variability in the ML estimates that would be obtained through the ML estimator upon repeatedly
30
Bayesian Structural Equation Modeling
sampling data from the population. The frequentist confidence interval is also interpreted in an asymptotic manner. Assume that a 95% confidence interval for sample data is [1, 10]. The frequentist interpretation indicates that, upon repeated sampling and estimation, 95% of the confidence intervals constructed in the same way would capture the true parameter value under the null hypothesis. It does not make a claim about the specific interval of [1, 10]. The probability of the parameter falling within this interval is either 0 or 1, and nobody can ever know which it is. Within frequentist methods, the parameters are assumed to be fixed over repeated samples of data, but the estimates (e.g., point or interval estimates) can vary across samples. This artifact is tied to the notion that frequentist probabilistic statements are associated with the ML estimates (obtained through repeated sampling) and not the parameters (assumed to be fixed across sampling). Bayesian estimation and interpretation rely on a different philosophical foundation, but there are several similarities with frequentist methods. Both estimation frameworks consider the data to be random and associated with a distribution. The treatment of the model parameters is largely where differences can be highlighted between the two estimation perspectives. The literature has produced ambiguous statements about whether parameters within the Bayesian estimation framework are considered to be fixed or random. Some authors present Bayesian parameters as being fixed and some argue they are random. The ambiguity in how this detail is presented is rooted in the distinction between what parameters are and how they are treated. The Bayesian perspective assumes that the true value of the parameter θ is indeed fixed and unknown–akin to the frequentist perspective. However, within the Bayesian estimation framework, the parameter value is treated as being random because there is uncertainty about θ expressed through the prior distribution. In other words, unknown and fixed values of θ are treated as unobserved random variables via the prior. Parameters are considered random due to the uncertainty surrounding them, but they represent fixed entities in the population. There are also important conceptual and interpretation differences between the estimation approaches. For example, the Bayesian analog for a frequentist confidence interval is called the Bayesian credible interval (CI). Upon estimating the posterior for a given model parameter, the CI can be computed based on the posterior’s quantiles. The CI can be interpreted in terms of the probability that the parameter value falls within this particular interval. For example, if a Bayesian 95% CI is [20, 60], then this indicates that there is a 95% chance that the fixed parameter value is between the val-
Basic Elements of Bayesian Statistics
31
ues of 20 and 60. Notice that the interpretation does not rely on asymptotic arguments, as with the frequentist counterpart. Before delving into more details surrounding Bayesian estimation, I will present the Bayesian Research Circle. This process represents the basic steps that should be addressed when implementing Bayesian methodology.
2.4
The Bayesian Research Circle
The Bayesian Research Circle is a process adapted from van de Schoot et al. (2021), which involves conceptual and statistical phases of Bayesian statistical modeling. The processes are depicted in Figure 2.1. FIGURE 2.1. The Bayesian Research Circle.
32
Bayesian Structural Equation Modeling
Whether working under the Bayesian or frequentist estimation framework, the typical research cycle begins with the elements listed in the left-hand panel (a) of Figure 2.1. The researcher defines an area of study (e.g., by defining research problems or gaps in the literature), references the literature, and formulates hypotheses. This phase is typically completed using an iterative approach, where elements within this portion of the cycle can influence one another–as indicated in the cyclic nature of this portion of the figure. After research questions and hypotheses are solidified, the researcher moves into the pre-analysis phase of the circle. In this pre-analysis phase, the researcher works to solidify the appropriate statistical model that aligns with the proposed research questions. This phase may also include preregistration of the design and analytic strategy, and then data collection or acquisition will take place. After data collection takes place, the Bayesian elements of this research circle appear. The right-hand panel (b) in Figure 2.1 highlights the various features underlying the Bayesian research process. In the next section, I detail the process as it is defined through Bayes’ rule (with elements of Bayes’ rule denoted in the colored circles in Figure 2.1), which is the result of a mathematical theorem. In the subsequent sections, I present on various elements comprising the Bayesian Research Circle while continuing to reference Figure 2.1 to highlight how these different features are related to one another.
2.5
Bayes’ Rule
Bayes’ rule is an important component of Bayesian estimation, and it helps us to understand how previous knowledge can be incorporated into the estimation process. R´enyi’s axiom of probability allows for an examination of conditional probabilities. The basic form of a conditional probability, where Events A and B are dependent, can be written as p(B|A) =
p(B ∩ A) p(A)
(2.1)
where the probability of Event B occurring is conditional on Event A. This notion sets the foundation of Bayes’ rule, which recognizes that p(B|A) p(A|B) but p(B ∩ A) = p(A ∩ B). Bayes’ rule can be written as follows: p(A|B) =
p(A ∩ B) p(B)
(2.2)
Basic Elements of Bayesian Statistics
33
which can be reworked as p(A|B) =
p(B|A)p(A) p(B)
(2.3)
These principles of conditional probability can be extended to the situation of data and model parameters. For data y and model parameters θ, we can rewrite Bayes’ rule as follows: p(yy|θ)p(θ) p(yy)
(2.4)
p(θ|yy) ∝ p(yy|θ)p(θ)
(2.5)
p(θ|yy) = which is often simplified to
with the symbol ∝ representing “proportional to,” θ representing all model parameters (e.g., the regression weights tied to predictors in the model), and y representing the observed sample data. Notice that the denominator, p(yy), was removed from Equation 2.4. This term acts as a constant, is not a function of the model parameters, and typically cannot be directly computed because it is represented by an intractable integral. The term p(yy) is often viewed as a normalizing factor across all outcomes y , which do not contain information about the conditional probability. Given that this constant term p(yy) only contains information about the data and not the model parameters, it is typically removed. The removal reduces the equation from equivalence to proportionality. There are three main elements in this equation that reflect the key “ingredients” to Bayesian statistics. In Equation 2.5, the term on the far left, p(θ|yy), is a conditional probability that denotes the probability of the model parameters given the sample data. This is the posterior distribution (or posterior), and it is what is solved for in the estimation process (i.e., this is the final result to interpret). The term p(yy|θ) represents the probability of the sample data given the model parameters. This term is called the likelihood, and it represents the sample data and statistical model. Finally, p(θ) represents one’s prior beliefs about model parameter values that would be (un)likely to occur in the population. This term is called the prior distribution (or prior), and in many ways this element is the crux of Bayesian methodology. These three elements (the prior, likelihood, and posterior) are all represented in Figure 2.1 in the colored circles, and I describe them in more detail next.
34
2.6
Bayesian Structural Equation Modeling
Prior Distributions
The top circle in panel (b) of Figure 2.1 represents the prior distribution. This phase of the Bayesian Research Circle requires that the researcher determine specifics surrounding the priors implemented in the model. All unknown model parameters receive a prior distribution, and formalizing the model priors can be a long and difficult process. The researcher must first taken into account the distributional form of the prior (e.g., normal, gamma, or an unknown distributional form based on theory or another mechanism). In addition, the researcher must decide the hyperparameters for each prior, defining the level of informativeness each prior represents (or at least the level of informativeness it is intended to represent). Hyperparameters represent the terms that form the prior distribution. For example, the normal prior is defined through a mean and a variance hyperparameter. Knowing the values of the hyperparameters will allow for full reconstruction of the distribution. Priors can be elicited using many different methods, including from previous research and expert knowledge. After elicitation, and when the desired hyperparameter values are defined, it is advised to check the priors for consistency with the likelihood through some sort of prior-predictive checking process. The current section describes each of these elements of prior specification in more detail. It is important to remember while reading through this section that formalizing priors is an iterative process. This process requires referencing background knowledge, eliciting priors (whether from experts or otherwise), checking the priors through a prior-predictive checking process, and then potentially modifying the priors before implementation. What follows are details presented for common prior distributions that are implemented in Bayesian statistics. This is by no means an exhaustive list of distributional forms that can be used. In fact, many programs (including the BUGS language) make it possible to specify priors of an unknown (i.e., non-conjugate) form. This latter type of prior can be particularly useful in cases in which, for example, the researcher wants to specify a prior distribution that captures knowledge from experts. This knowledge may not map directly onto a known distributional form, which means the researcher could write code representing the exact density her or she wants to use as a prior. This is more of a special-case treatment of priors. In this current section, I will briefly discuss some known prior distribution forms, as well as the sorts of parameters they are typically used for. However, keep in mind that the only limitation to what a prior can look like is one’s imagination.
Basic Elements of Bayesian Statistics
2.6.1
35
The Normal Prior
For parameters such as means, factor loadings, and intercepts, we typically assume a normal prior (N). For a random variable X, assume X ∼ N[μX , σ2X ]
(2.6)
where the random variable (X) is captured by a normal distribution with mean hyperparameter μX and variance hyperparameter σ2X . The mean hyperparameter dictates the center of the prior, and the variance hyperparameter reflects the spread (or level of informativeness) of the prior. Depending on the coding scheme being used, the variance hyperparameter can also be specified as a precision (1/σ2 ) or standard deviation (σ).
2.6.2
The Uniform Prior
For continuous random variables, where a bounded flat prior is desired, the uniform (U) prior can be used. The U prior places equal probability between an upper and lower bound specified by the user. For a continuous random variable X, this prior can be specified as X ∼ U[αu , βu ]
(2.7)
where αu and βu represent the upper and lower bounds of the prior, respectively. In the current case, I used bracket notation of [. . .], which means that the values specified through α and β are included in the possible values for the U prior. If parentheses notation was used such as (. . .), then α and β would not be included in the possible values for the U prior.
2.6.3
The Inverse Gamma Prior
In cases in which a variance is unknown, then an inverse gamma (IG) prior can be used. This prior is specified as follows for an unknown variance σ2 : σ2 ∼ IG[aσ2 , bσ2 ]
(2.8)
where the hyperparameters a and b represent the shape and scale parameters for the IG distribution, respectively. The hyperparameters are both set to positive (> 0) values to obtain a proper prior. Improper versions of this prior can also be implemented. One example that is used throughout this book is the following prior: IG[−1, 0]. This prior has a constant density of 1 on the interval (−∞, ∞) and has been shown to produce unbiased and efficient estimates in a variety of situations within SEM (Asparouhov & Muth´en, 2010a).
36
2.6.4
Bayesian Structural Equation Modeling
The Gamma Prior
When working with an unknown precision (as opposed to a variance), then the gamma (G) prior can be used. The G prior is also defined as follows: 1/σ2 ∼ G[a1/σ2 , b1/σ2 ]
(2.9)
where the hyperparameters a and b represent the shape and scale parameters for the G distribution, respectively. Just as with the IG, the hyperparameters are both set to positive (> 0) values to obtain a proper prior, but the scale hyperparameter (b) is an inverse scale compared to that in the IG prior.
2.6.5
The Inverse Wishart Prior
For covariance matrices, the inverse Wishart (IW) prior can be used. This prior is defined as follows for covariance matrix Σ: Σ ∼ IW[Ψ, ν]
(2.10)
where Ψ is a positive definite matrix of size p and ν is an integer representing the degrees of freedom for the density.3 The value set for ν can vary depending on the informativeness of the prior distribution. If the dimension of Ψ is equal to 1, and Ψ = 1, then this prior reduces to an IG prior. Akin to the IG prior, an improper version of the IW is implemented in several examples. Specifically, IW(0, −p − 1) is used, which is a prior distribution that has a uniform density.
2.6.6
The Wishart Prior
For precision matrices, the Wishart (W) prior can be used. This prior is defined as follows for precision matrix Σ−1 : Σ−1 ∼ W[Ψ−1 , ν]
(2.11)
where Ψ−1 is a positive definite matrix of size p and ν is an integer representing the degrees of freedom for the density. The value set for ν can vary depending on the informativeness of the prior distribution. Similar to above, if the dimension of Ψ−1 is equal to 1, and Ψ−1 = 1, then this prior reduces to a G prior. The Ψ hyperparameter should not be confused with Ψη appearing in Table 1.1. The notation Ψη is LISREL notation for a latent factor (η) covariance matrix, and Ψ (without a subscript) is used for the (inverse) Wishart hyperparameter.
3
Basic Elements of Bayesian Statistics
2.6.7
37
The Beta Prior
The beta (B) prior is a continuous distribution bounded on the [0, 1] interval, which makes it a common choice for data assumed to be distributed as binomial. It is specified as follows: X ∼ B(αB , βB )
(2.12)
with two positive (> 0) shape parameters αB and βB .
2.6.8
The Dirichlet Prior
The B prior can generalize to multiple (i.e., non-binary) variables and is called a Dirichlet (D) distribution. The D distribution is commonly used for data assumed to be distributed as multinomial. The prior can be written as follows for multinomial variable π: π ∼ D[d1 . . . dC ]
(2.13)
The hyperparameters for this prior are d1 . . . dC , which control how uniform the distribution will be. Specifically, these parameters represent the proportion in each of the categories of π. Depending on how the software is set up, the Dirichlet prior may be formulated to be in terms of the proportion of cases in each category, or the user may need to specify the number of cases. The most diffuse version of this prior would be D(1, 1, 1) for a 3-category variable, where there is only a single case representing each category so there is no indication of the proportion of cases. A more informative version of this prior could be as follows. Assume that there are 100 participants in the dataset, and the researcher believes that proportions are set at: Category 1 = 45%, Category 2 = 50%, and Category 3 = 5%. In this case, the informative prior could be defined as D(45, 50, 5), where the hyperparameters of the prior reflect the number of cases in each of the categories. Note that, in this case, the Dirichlet hyperparameters are being written out in terms of absolute number of cases rather than as proportions (i.e., in the latter example, d1 + d2 + d3 = 100 participants). There are additional ways that this prior can be formulated, all of which are technically equivalent to one another. Another option is to write the prior in terms of proportions for the the C − 1 elements of the Dirichlet. Given that the last proportion is fixed to uphold the condition that Cc=1 π = 1.0, the last category’s proportion is always a fixed and known value.
38
2.6.9
Bayesian Structural Equation Modeling
Different Levels of Informativeness for Prior Distributions
Prior distributions must be specified for each model parameter within the Bayesian framework, and these distributions can be defined under several different levels of informativeness. Some of the different levels of informativeness will be briefly described in this section, ranging from an extreme level representing diffuse information to an extreme level of informativeness. It is important to recognize that, although I describe categories of prior informativeness, informativeness is really captured along a continuum. To begin, a completely non-informative prior distribution is also sometimes referred to as a diffuse prior. In this book, I prefer this term diffuse to the terminology non-informative because even a lack of information is informative to the posterior. A diffuse prior is a distribution that places a near equal probability for each possible value under that distribution. An example of a completely diffuse prior would be a uniform prior distribution specified for the range of U(−∞, ∞).4 Diffuse priors represent a complete lack of knowledge about the parameter being estimated. The next level of informativeness represents a prior distribution that is essentially somewhere between diffuse and informative but that still holds some useful information. These prior distributions are referred to as weakly informative priors. A weakly informative prior is perhaps more useful than a strictly diffuse prior since some information is conveyed within the distribution. For example, a uniform prior specified as U[0, 10] could be considered a weakly informative prior for a given parameter since a more restricted and potentially useful range was specified compared to U(−∞, ∞). Likewise, a normal prior with a larger variance (but not equal to infinity) would also be considered a weakly informative prior since there is not an equal probability placed on every possible value. In some sense, weakly informative priors can be considered to be more useful than diffuse priors because, although they can still be relatively vague, these priors do provide some indication about the range of plausible values for a parameter. Essentially, weakly informative priors do not supply any controversial information, but yet are still strong enough to avoid inappropriate inferences (see, e.g., Gelman, 2006). A weakly informative prior can have an impact on the posterior if, for example, a plausible parameter space is specified through the prior. A plausible parameter space represents a range of values that are reasonable 4
Note, however, that this is an improper prior in that it does not yield a probability distribution that integrates to 1.0. To avoid improper densities, researchers will often specify conjugate priors, but improper forms can still be implemented. I use proper and improper prior forms throughout the examples in this book. For more on improper priors, see Gelman (2006).
Basic Elements of Bayesian Statistics
39
for that parameter to take on. One criticism of diffuse priors is that they can incorporate unreasonable or out-of-bound parameter values. In contrast, a potential criticism of a strictly informative prior (described next) is that the range of possible parameter values is not expansive enough, or it swarms the information encompassed in the likelihood (i.e., the prior dictates the posterior without much influence from the likelihood). Determining a plausible parameter space, as specified through a weakly informative prior, helps to mitigate the issues stemming from diffuse priors. Specifically, the weakly informative prior may not include out-of-bound parameter values, but it is also more inclusive than a strictly informative prior. The other end of the spectrum includes prior distributions that contain strict numerical information that is crucial to the estimation of the model or represents strong opinions about population parameters. These priors are often referred to as informative prior distributions. Specifically, the hyperparameters for these priors are fixed to express particular information about the model parameters being estimated. This information can come from a variety of places, including from an earlier data analysis or from the published literature. For example, Gelman, Bois, and Jiang (1996) present a study looking at physiological pharmacokinetic models where prior distributions for the physiological variables were extracted from results from the literature. Although these prior distributions were considered to be quite specific, they were also considered to be reasonable given that they resulted from a similar analysis computed on another sample of data. Another example of an informative prior could be to simply decrease the variance of the distribution, creating a prior that only places high probability on certain plausible values in the distribution. This is a very common method used for defining informative prior distributions.
2.6.10
Prior Elicitation
One of the main strengths of the Bayesian approach is the use of prior distributions, which incorporate previous knowledge into the estimation algorithm. However, the process of specifying priors may also be considered a point of controversy within this framework. This controversy is tied to the origin of the prior (e.g., method of prior elicitation) as well as the notion that informative priors can have a large impact on posterior estimates, especially when sample sizes are small (see, e.g., Zhang et al., 2007). It follows that one of the most critical features of using the Bayesian estimation framework is carefully identifying a method for defining the prior distributions.
40
Bayesian Structural Equation Modeling
There are many different methods that can be used for prior elicitation. I will briefly describe some of the most common methods here, but this should not be considered an exhaustive list. Expert elicitation is one method that can be implemented when defining priors. There are many different methods that can be used for gathering information from experts. The end goal is to be able to summarize their collective knowledge (or opinions) through a prior probability distribution. Content experts can help to determine the plausible parameter space to create (weakly) informative priors based on expert knowledge. Some reasons for using content experts include the following: (1) ensuring that the prior incorporates the most up-to-date information (recognizing that the published literature lags behind, especially in some fields), or (2) gathering opinions about hypothetical or rare events (e.g., What is the impact of nuclear war on industrialization? Fortunately, our world has limited experience with this, so asking experts for hypothetical information may help to supplement the little information gathered on the topic.). One potential criticism of this approach is that the priors will undoubtedly be skewed toward the subjective opinions of the experts. As a result, executing this process of expert elicitation in a transparent manner is key. The process of expert elicitation has many stages, and resources have been developed to aid in proper execution. One resource is called the SHeffield ELicitation Framework (SHELF), and it is available as a package of documents in R called SHELF (Oakley, 2020). In addition, examples of methods of expert elicitation can be found in Gosling, O’Hagan, and Oakley (2007) or Oakley and O’Hagan (2007), among others. As an alternative approach, the literature can be a terrific tool for gathering information about parameters. Systematic reviews and meta-analyses can be used to synthesize information across a body of literature about a topic and construct priors for a future analysis. There have been recent papers highlighting how to implement this process within SEM (see, e.g., van de Schoot et al., 2018; Zondervan-Zwijnenburg et al., 2017). One alternative method used for defining prior distributions when other methods are not possible is to specify a data-driven prior, where prior information is actually a function of the sample data. There are several different types of data-driven priors that can be specified in a Bayesian model. Perhaps one of the more common forms of data-driven priors is to use ML estimates to inform the prior distribution (see, e.g., J. Berger, 2006; Brown, 2008; Candel & Winkens, 2003; van der Linden, 2008). One criticism of using a data-driven prior derived in this manner is that the sample data have been utilized twice in the estimation–once when constructing the prior and another when the posterior distribution was estimated. This “double-
Basic Elements of Bayesian Statistics
41
dipping” into the sample data can potentially distort parameter estimates, as well as artificially decrease the uncertainty in those estimates (Darnieder, 2011). To contrast this method, there are also other processes that do not require initial parameter estimation (e.g., through ML estimation) when constructing the prior distributions. For example, Raftery (1996), Richardson and Green (1997), and Wasserman (2000) have all constructed methods of defining data-driven priors based on summary statistics (e.g., median, mean, variance, range of data) rather than parameter estimates. Arguments surrounding the “proper” method(s) for defining priors can be linked to the philosophical underpinnings of the debate between subjective (see, e.g., Goldstein, 2006) versus objective (see, e.g., J. Berger, 2006) Bayesian approaches. Some researchers take the approach that subjectivity must be embedded within statistical methodology and that incorporating subjective opinions into statistics fosters scientific understanding (see, e.g., Lindley, 2000). While others (see, e.g., J. Berger, 2006) argue that a more objective approach should be taken (e.g., implementing reference priors). The subjective approach translates into Bayesian estimation through the specification of subjective priors based on opinions, or expert beliefs, surrounding parameter values. One of the main criticisms of this approach is typically rooted in the question: Where do these opinions or beliefs come from? It is certainly true that two researchers can specify completely different prior distributions and hyperparameter values for the same model and data. The cautionary point when specifying any type of prior is that, although there is no “wrong” opinion for what the priors should be, there is also no “correct” opinion.5 Notice the keyword I used to describe priors (whether they were labeled objective, subjective, informative, or diffuse) was opinion, and I urge the reader to keep this notion in mind when reading the Bayesian literature. It is also worthy to note here that there are many different subjective aspects to model estimation aside from the specification of priors. For example, model selection and model building are both features of estimation that incorporate subjectivity and opinion. Further, the subjectivity of these 5
Delving into the philosophical underpinnings of Bayesian statistics, or statistics in general, is beyond the scope of the book. However, I am compelled to list a few sources for those interested in reading more on this topic. For a terrific treatment of subjective and objective aspects of statistics, see Gelman and Hennig (2017). For an introduction to the issues surrounding subjective versus objective Bayesian statistics, see Kaplan (2014, Chapter 10). Finally, Gelman and Shalizi (2012) present an important and insightful take on the philosophy of Bayesian statistics, noting in the end of the paper: “Likelihood and Bayesian inference are powerful, and with great power comes great responsibility” (p. 32).
42
Bayesian Structural Equation Modeling
model-related features may even have a larger impact on estimates than the choice of subjective priors or the hyperparameters.
2.6.11
Prior Predictive Checking
Finally, a related issue deals with the idea that, although there is no such thing as a “wrong” prior in application, some prior formations are not viable as it relates to the likelihood (described in the next section). The ability to draw inferences through Bayesian methods is, in part, based on the accuracy of the model. However, it is also based on the “accuracy” of the priors. Just as it is important to check the alignment of the model with the data (see Chapter 11), it is also important to assess the alignment of the priors with the data. Prior prediction methods can be used to aid in improving the understanding of the priors, but should not routinely be used as a method for changing the priors.6 Box (1980) suggested that a prior-predictive distribution can be derived from a specified prior. This prior predictive distribution comprises all possible samples that can occur, provided the model is true. The idea is that a prior that is true to the data will provide a prior-predictive distribution that is akin to the true data-generating distribution (Daimon, 2008; Ranganath & Blei, 2019). The compatibility between the prior and the data is captured through a p-value, where small p-values indicate that the observed data was unlikely to be generated by the model (i.e., the priors that were specified) (Daimon, 2008). For an example of Box’s version of the prior-predictive distribution, see van de Schoot et al. (2021). In addition, several modifications have been proposed to Box’s method (see, e.g., Evans & Moshonov, 2006; Evans & Jang, 2011). A main criticism of the prior-predictive checking process (regardless of the exact method used) is that it relies on the interpretation of the p-value. Basing the evaluation of the prior on a p-value leaves the identification of a prior-data conflict dependent on the interpretation of arbitrary cutoff values for the p-value. There have been several other methods proposed that do not rely on p-values, all aimed at identifying potential prior-data conflicts. Young and Pettit (1996) proposed a method using Bayes factors to compare models with two competing sets of priors. In addition, there are also methods that 6
There may be some instances in which the priors are found to be virtually unaligned with the data (i.e., the prior generates data that are completely incorrect). In these cases, modifying the priors based on this finding could be warranted. However, I view this process of prior-predictive checking as a method for understanding the priors, and the role that they play–and not, by definition, as a method for identifying changes that need to be made to the priors. If priors are changed as a result of this process, then this modification should be made transparent and be framed as a part of the model-building phase.
Basic Elements of Bayesian Statistics
43
are based on Kullback-Leibler divergence (see, e.g., Bousquet, 2008; Nott, Drovandi, Mengersen, & Evans, 2018).
2.7
The Likelihood (Frequentist and Bayesian Perspectives)
Prior selection is often viewed as one of the most important stages in Bayesian statistical modeling because it is the facet most known for subjectivity. However, model subjectivity and uncertainty is just as important to consider. The second circle in panel (b) of Figure 2.1 represents the likelihood, which is used in frequentist and Bayesian estimation to capture how much support the data have for specific values for the unknown model parameters. The selection of the likelihood and priors will ultimately go hand-in-hand in that the prior distributions selected will depend on the parameters defining the likelihood. As a result, these two elements of the Bayesian Research Circle should be viewed as being dependent on one another. Frequentist and Bayesian inference are based, in part, on a conditional distribution of data (yy) given model parameters (θ) such that p(yy|θ). The main difference between frequentist and Bayesian approaches is in how the model parameters are viewed. Within the frequentist framework, the model parameters are viewed as being fixed but unknown. In other words, probability statements about the unknown model parameters are not considered within frequentist approaches because the parameters are assumed to be fixed. The data within the frequentist framework are treated as random. As mentioned earlier, frequentist estimation entails converging upon a point estimate for each model parameter through methods such as ML estimation. The ML estimation approach maximizes the conditional probability p(yy|θ) of the random data given the fixed but unknown model parameters. Specifically, the term p(yy|θ) represents the conditional probability when viewed as a function of the unknown model parameters θ. However, once data (yy) have been collected (and are therefore observed), they can be included in the conditional likelihood expression to form a likelihood function (or likelihood; L(θ|yy)). The only difference between the likelihood notation of L(θ|yy) and the conditional probability notation of p(yy|θ) is that the likelihood assumes data are observed and the expression is defined as varying over values of θ. Within the Bayesian estimation framework, the model parameters θ are still assumed to be unknown, but the approach also treats them as being random rather than fixed. In this framework, probability statements
44
Bayesian Structural Equation Modeling
can be assigned to model parameters (via prior distributions), reflecting their random nature. The observed data are treated as fixed, which sets the stage for the likelihood L(θ|yy). The likelihood, independent of being viewed through the frequentist or Bayesian perspectives, is a function that summarizes a statistical model, a range of possible values of the unknown model parameters (θ), and the observed data (yy). One aspect that is often lost in the discussion surrounding the inherent subjectivity of Bayesian inference is that the likelihood (again, independent of discussing frequentist or Bayesian estimation) is also subjective in nature–typically the focus is solely on prior subjectivity. There is natural subjectivity and uncertainty in the model formulation, which is embedded within the likelihood as a statistical model that stochastically generates the data. When complex phenomena and processes are under study, then it is more often the case that complex statistical models are implemented as an attempt to capture these processes. In many modeling situations, especially those found within SEM-based inquiries, the data-generating model is typically not known. The unknown nature of the data-generating model incorporates uncertainty into the process when the statistical model is specified. The choice of the statistical, data-generating model is a subjective one. As a result, the process used to determine the model formulation should be made transparent. The exact specification of the statistical model can alter the make-up of the likelihood, which comprises the backbone of model estimation–whether via frequentist or Bayesian methods. Given the subjective nature of the likelihood, Figure 2.1 illustrates several points to consider during this phase of constructing the likelihood. Just as with other elements presented here, this part of the figure depicts a circular process, implying an iterative model-building phase may be needed. When constructing the likelihood, the researcher should consult background knowledge and previous research to determine the optimal statistical model.7 Then data collection can occur, where data can then be cleaned and model assumptions can be checked. The statistical model and observed data determine the likelihood in this phase. One final element regarding the likelihood that should be discussed is the need for robustness checks. In order to fully understand the influence of subjectivity on final model estimates, the likelihood must be examined. Most sensitivity analyses focus on the influence that priors have on posterior estimates. However, the likelihood function is also subjective in modeling contexts in which the data-generating model is unknown. In 7
The model can be further examined after estimation using any of the methods described in Chapter 11.
Basic Elements of Bayesian Statistics
45
cases such as these, a sensitivity analysis of the likelihood (e.g., on the assumed data-generating model) can help to illuminate the potential impact of model subjectivity on posterior inference. For more information on examining the robustness of likelihood functions in the Bayesian estimation framework, see Greco, Racugno, and Ventura (2008) or Agostinelli and Greco (2013).
2.8
The Posterior
The third circle in panel (b) of Figure 2.1 represents the posterior, which is a compromise of the likelihood (i.e., the statistical model and observed data) and priors (i.e., probability statements assigned to unknown model parameters). During the Bayesian estimation process, posterior distributions are formed for each unknown model parameter. The estimated posterior represents the final result that practitioners seek when using Bayesian methods. It can be summarized and interpreted, and inferences can be made based on the obtained distribution. In order to obtain the posterior distributions for a given model, an iterative sampling method is typically implemented. The most common process is to implement an algorithm called the Markov chain Monte Carlo (MCMC) method. This portion of the Bayesian Research Circle represents the stage at which the estimation process would be identified and implemented in order to obtain estimated posteriors for each unknown model parameter. The following section presents these processes in greater detail, introducing MCMC, which is commonly implemented in the Bayesian modeling setting. In addition, several aspects that are relevant to posterior estimation are presented.
2.8.1
An Introduction to Markov Chain Monte Carlo Methods
One important distinction between Bayesian and frequentist estimation is what exactly the researcher is trying to estimate. Under the frequentist estimation framework, we assume that model parameters are fixed but unknown values. Then sample data are collected and used to estimate a best “guess” at what those model parameter values are. The estimation process produces a single point estimate (i.e., a single number) that represents the most likely value for the model parameter. Bayesian methods work quite differently in that a posterior distribution is estimated for each model parameter. It is not typically possible to obtain direct inference on the posterior, which became a practical reason why many past researchers opted for frequentist methods instead. However, the advent of MCMC made this
46
Bayesian Structural Equation Modeling
inference possible (Gelfand & Smith, 1990). MCMC is a complex process used in a variety of settings. Although it has been a tool used in some fields for well over half a century, MCMC has not made its way into applied statistics and the related social and behavioral sciences until recent decades (for early work, see, e.g., Geyer, 1991; Gilks, Richardson, & Spiegelhalter, 1996). In general, MCMC is used as a means to simulate complex stochastic processes that cannot otherwise be easily implemented through analytic calculations (Geyer, 1991). MCMC utilizes a simulation process to compute, sometimes high-dimensional, integrals that are involved in various forms of statistical inference. In other words, many samples (thousands or more) are simulated from the posterior distribution for each model parameter. These samples are then constructed in a way that forms an estimate of the posterior distribution. One of the main advantages of utilizing the MCMC algorithm is the framework itself. This framework is much different from something like ML via the expectation-maximization algorithm (ML/EM), since the focus of MCMC is on converging to a distribution that carries certain distributional properties. This notion of converging in distribution is a fundamental difference between MCMC and some of the more traditional estimation processes (e.g., ML/EM). The goal of MCMC is to produce a distribution for a parameter rather than a point estimate. However, it is often the case that the distribution is then summarized by a central tendency measure– albeit some Bayesians strictly disagree with this practice. There are other advantages outside of the alternate interpretation of the results, such as the flexibility and ease of implementing complex models. However, as is true with all estimation algorithms, there are also some disadvantages of implementing MCMC. Specifically, this estimation algorithm can be easily misused without proper knowledge of the distributional properties being implemented within the context of Bayesian estimation. It is also sometimes difficult to detect the accuracy of MCMC results given the nature of Markov chain convergence (discussed in further detail in Chapter 12). This technique can be computationally demanding. Although some complex models can take up to several days to run, the increase in computer speed and available computational resources have made this a less problematic feature of MCMC than it once was. As a result, research implementing MCMC has increased in the methodological and the applied literature. Finally, I will note that, while MCMC is the most common algorithm used in Bayesian methods, it is not the only option. There are other algorithms that can be implemented. For example, sequential Monte Carlo can be used for real time processing of data (e.g., data obtained online in real
Basic Elements of Bayesian Statistics
47
time) (Doucet, de Freitas, & Gordon, 2001). Approximate Bayesian computation can be used in cases in which the likelihood function is intractable (Sisson, Fan, & Beaumont, 2018), and integrated nested Laplace approximations (INLA; Martino & Riebler, 2019) can be used for latent Gaussian models (e.g., generalized additive models). These methods are all quite useful, but the focus of this book is in posterior inference based on MCMC.
2.8.2
Sampling Algorithms
As the name suggests, there are two elements at work within MCMC. This method can be thought of as having a Monte Carlo component as well as a Markov chain component. The estimator is a compilation of computing Monte Carlo integration using Markov chains. Monte Carlo integration is a process used to draw samples from a distribution which are then averaged to approximate expected values (Gilks et al., 1996). These samples are drawn by a specified Markov chain that runs sometimes for a large number of iterations. Within MCMC, there are several different methods, called sampling methods, of constructing these chains. A Conceptual Description of the Process Akin to frequentist estimation, there is still an assumption that model parameters are fixed and unknown within Bayesian estimation. However, the Bayesian estimation framework allows the researcher to obtain a summary of an estimated posterior distribution, rather than a single point estimate. In other words, under the Bayesian estimation framework, the model parameter estimate will actually be a distribution that represents the degree of (un)certainty surrounding the model parameter value in the population. If the distribution is normally distributed, then picture a distribution that is very flat and wide. This posterior would indicate that there is a great deal of uncertainty surrounding the model parameter value. In contrast, the estimated posterior may reflect a very narrowed distribution. In this case, there is relatively greater certainty surrounding the model parameter value in the population. Bayesian methods allow for a richer set of results because the results come in the form of probability distributions that help reflect the degree of (un)certainty in the final model results. Frequentist results produce a single point estimate (which can also be captured by a confidence interval). Although some find it easier to interpret (because our statistical training is mostly focused on the frequentist school of thought), a point estimate is not nearly as revealing about the entire picture surrounding the model parameters, the degree of generalizations, and substantive conclusions we can derive surrounding the population.
48
Bayesian Structural Equation Modeling
In order to estimate probability distributions that reflect the population values of the model parameters, the Bayesian estimation process looks quite different. The posterior distribution cannot typically be directly solved for. Before describing how the posterior is found, let’s work with a quick analogy to aid in understanding the process. Imagine you are putting a puzzle together without a picture of what the puzzle looks like. You have many pieces in front of you, but you have no idea what these pieces form. In this example, you draw one puzzle piece at a time and then you fit it to the previous piece that you drew. Eventually, you will have drawn enough pieces to get an idea of what picture comprises the puzzle. In other words, you pulled puzzle pieces (one at a time) until you were able to reconstruct what the puzzle image was. Akin to this analogy, the posterior needs to be reconstructed one piece at a time. This process involves a long series of draws from the posterior. In other words, the best way to identify what the posterior distribution looks like is to repeatedly draw samples from it and then use those samples to construct a picture of what the posterior distribution looks like. Each of these samples is drawn from the posterior, one at a time, and is viewed as a likely value that the model parameter would take on. Once “enough” samples are drawn, then the researcher can get a relatively clear picture of what the posterior distribution looks like. We say that the posterior is converged upon. A More Technical Description of the Process The typical estimation process used to reconstruct the posterior distribution is through the MCMC method, often with a separate sampling algorithm also implemented. The MCMC method has two main parts: (1) the Markov chain part constructs a chain that is comprised of samples pulled from the posterior, and (2) the Monte Carlo part represents a simulation process that takes place to reconstruct the posterior. The sampling algorithm is a specific process (and there are many sampling algorithms, e.g., Gibbs sampling or the Metropolis-Hastings algorithm) that can be used to determine what value is sampled from the posterior in a given stage of the estimation process. A Markov chain (i.e., a series, or chain, of samples) is formed for each model parameter in an iterative fashion. This chain is formed using Monte Carlo (simulation) procedures, as well as a sampling algorithm. The sampling algorithm will aid in determining a likely value for one model parameter, usually given current values of all other model parameters. This value will be “drawn” or “sampled” to form a single iteration in the chain. Then, a likely value for the next model parameter will be drawn from the posterior and will represent the next iteration in that parameter’s chain.
Basic Elements of Bayesian Statistics
49
This process happens subsequently for all model parameters (usually one parameter at a time), and it will be repeated sometimes many thousands (or even millions!) of times depending on the complexity of the model. Once there are substantial draws from the posterior to result in a stable distributional form for every model parameter, then the chains are said to have converged. A converged chain represents an accurate estimate for the true form of the posterior. Another way of phrasing this is that once the chain has reached the stationary distribution, any subsequent draws from the Markov chain are dependent samples from the posterior (van de Schoot et al., 2021). In addition, once the stationary distribution is reached, the researcher must decipher how many samples are needed to obtain reliable Monte Carlo estimates. To begin the process, the Markov chain receives starting values and is then defined through the transition kernel, or sampling method. The first sampling method was introduced by Metropolis and colleagues (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953) and this algorithm has served as a basis for all other sampling methods developed within MCMC. However, the Metropolis algorithm as we use it today is actually a generalization of the original work by Metropolis et al. (1953), and this generalization was first introduced by Hastings (1970). Hastings reworked the Metropolis algorithm to relax certain assumptions, which resulted in a more flexible algorithm that is now referred to as the Metropolis-Hastings algorithm. As mentioned, the Metropolis-Hastings algorithm has given rise to several alternative samplers that can be used in different modeling situations. Perhaps one of the most utilized of these is the Gibbs sampler. The Gibbs sampler was originally introduced by Geman and Geman (1984) in the context of estimating Gibbs distributions (hence the name) within imageprocessing models (Casella & George, 1992). This sampling algorithm is actually often viewed as a special case of the Metropolis-Hastings algorithm since, akin to the Metropolis-Hastings algorithm, the Gibbs sampler also generates a Markov chain of random variables which converges to a stationary distribution. However, the difference between the MetropolisHastings algorithm and the Gibbs sampler is that the Gibbs sampler accepts every candidate point for the Markov chain with probability 1.0, whereas the Metropolis-Hastings algorithm does not. The Metropolis-Hastings algorithm allows an arbitrary choice of candidate points from a proposal distribution when forming the posterior distribution (Geyer, 1991). These candidate points are accepted as the next state of the Markov chain in proportion to their relative likelihood as seen in the proportionality rela-
50
Bayesian Structural Equation Modeling
tionship for the posterior distribution: p(θ|yy) ∝ p(yy|θ)p(θ). As a result, the proposal distribution can take on any form and still theoretically lead to the proper stationary distribution representing the posterior (Gilks et al., 1996). However, the Metropolis-Hastings algorithm was set up such that the proposal distribution could depend on the current state of the Markov chain, thus producing highly dependent states. This high dependence can result in the chain needing to run for a large number of iterations before reaching the stationary distribution. The main contribution of the Gibbs sampler was the knowledge that the conditional distributions are sufficient to determine the joint (or marginal) distribution (Casella & George, 1992). In a sense, this is an indirect method of computing random variables from a marginal distribution in that the density need not ever be directly computed. Computing the joint distribution in a high-dimensional modeling situation can become quite complex, but this method employed by the Gibbs sampler can handle these situations with ease. Specifically, the Gibbs sampler avoids computing the integrals that are of specific concern in high-dimensional situations. This high-dimensional integration is instead replaced by a series of unidimensional random variable generations, which is much more straightforward to compute (Casella & George, 1992). Gibbs sampling occurs typically one parameter at a time. This technique samples each parameter individually with respect to its conditional distribution and treats all other parameters as known. Specifically, one parameter is updated with respect to the conditional distribution given the remaining variables under the stationary distribution. This updating process typically occurs in a particular fixed order for the parameters and is sometimes referred to as scanning (Geyer, 1991). The process that Gibbs uses to generate a sample is as follows. Suppose that you have a vector of model parameters θ = (θ1 , θ2 , . . . , θq ) that is given (0) (0) a starting point at time zero such that θ(0) = (θ1 , . . . , θq ). The Gibbs sampler generates state s (θ(s) ) of the Markov chain from state s−1 (θ(s−1) ) such that (s)
(s−1)
1. sample
θ1 ∼ p(θ1 |θ2
2. sample
(s) θ2
.. .
q. sample
(s)
∼
(s−1)
, θ3
(s−1)
, . . . , θq
(s) (s−1) (s−1) p(θ2 |θ1 , θ3 , . . . , θq )
(s)
(s)
(s)
θq ∼ p(θq |θ1 , θ2 , . . . , θq−1 )
)
Basic Elements of Bayesian Statistics
51
This sampling algorithm generates a dependent sequence of vectors such that θ(s) depends on θ(0) , θ(1) , . . . , θ(s−1) . However, θ(s) is conditionally independent of θ(0) , θ(1) , . . . , θ(s−2) |θ(s−1) . This process is called the Markov property and the sequence produced is a Markov chain. Using a larger number of iterations within Gibbs sampling is optimal since, as the number of iterations s marches toward infinity, the chain will converge with the stationary distribution. A larger number of iterations is a better approximation since it produces an independent and identically distributed sample from the marginal distribution (via conditional distributions), rather than directly computing the marginal density (Casella & George, 1992). The Gibbs sampler and Metropolis algorithm are inefficient due to their random walk algorithm. As a result, many advances have been made to help improve the efficiency of these methods. Specifically, many extensions of the Gibbs sampler have been developed, which use different methods for generating correlation and covariance techniques (see, e.g., Asparouhov & Muth´en, 2010b; Boscardin, Zhang, & Belin, 2008; Chib & Greenberg, 1995). Reparameterization of the model can aid in speeding mixing time, as can the implementation of jumping rules through different versions of these algorithms. However, in the case of high-dimensional models, these techniques may not be sufficient in ridding the problems. More advanced Markov chain simulation methods have been developed in order to combat issues with mixing. One such method is called Hamiltonian Monte Carlo, which is also referred to as a hybrid Monte Carlo approach. Hamiltonian Monte Carlo is described in detail in Gelman, Carlin, et al. (2014, Section 12.4). In short, Hamiltonian Monte Carlo is a generalization of the Metropolis algorithm, which moves through the parameter space faster and more efficiently. It includes what is referred to as “momentum” variables that aid in moving farther in the parameter space in a given iteration. This aspect of the method helps to speed mixing time which, as mentioned, can be incredibly long for some high-dimensional models. Hamiltonian Monte Carlo can move relatively much faster through the target distribution because it essentially suppresses the random walk aspect of the Metropolis algorithm. Each parameter in the model, θ, receives a “momentum” variable, and these two are updated together in a new Metropolis algorithm (Gelman, Carlin, et al., 2014). A jumping distribution for θ is determined by this “momentum” variable, making it possible to move rapidly through the parameter space of θ. For the sake of illustration, the “momentum” variable will be referred to as ω, a vector with elements corresponding to all θ model parameters. Within Hamiltonian Monte Carlo, the posterior p(θ|yy) is supplemented by an independent dis-
52
Bayesian Structural Equation Modeling
tribution of the “momentum” variable, p(ω), resulting in a joint distribution as follows: p(θ, ω|yy) = p(ω)p(θ|yy)
(2.14)
Simulations are obtained from the joint distribution, but only simulations of θ are of interest. The “momentum” vector, ω, is then viewed as an auxiliary variable with the sole purpose of speeding mixing time. The ω vector typically receives a multivariate normal distribution, with mean vector 0 and a Q-dimension covariance matrix set to what is referred to as a “mass matrix,” M .8 By implementing a diagonal matrix for M , the components of ω are independent and normal (N) priors can be implemented such that ωq ∼ N(0, Mqq )
(2.15)
where there are q = 1, . . . , Q elements in ω. First the Hamiltonian Monte Carlo iteration begins by updating ω with a random draw from ω ∼ N(0, M). Then there is a simultaneous update of θ and ω. A more detailed explanation of this process can be found in Gelman, Carlin, et al. (2014). Hamiltonian Monte Carlo is growing in popularity due to its speed and accuracy in reconstructing the posterior distribution, and the No-UTurn sampler (NUTS) variant of Hamiltonian Monte Carlo is the default in the program Stan (Stan Development Team, 2020).
2.8.3
Convergence
Due to the nature of MCMC sampling, chain convergence is an important element to consider when evaluating posterior estimates. In fact, it is so important that the success of the entire estimation process resides in the ability to detect (non)convergence. Rather than assessing for convergence to an integer, this process surrounds convergence to a distribution. The key for proper assessment of convergence typically resides in examining several different elements of convergence as measured through different diagnostics. Each of the main diagnostics will pinpoint different elements of the chain. Examining results from several different convergence diagnostics will help to shed light onto different elements of the chain and ensure that decisions surrounding convergence are maximally informed. Chapter 12 covers specific convergence diagnostics (e.g., potential scale reduction factor ( R)) in detail and provides examples for assessing results. 8
Gelman, Carlin, et al. (2014) explain that Hamiltonian Monte Carlo is derived from physics and this “mass matrix” gets its name from the physical model of Hamiltonian dynamics.
Basic Elements of Bayesian Statistics
2.8.4
53
MCMC Burn-In Phase
A Markov chain produced through MCMC, regardless of sampling method, tends to have a higher level of dependency between samples in that new states within the chain are influenced by the previous states. As a result, it is common (and also necessary) to discard the beginning iterations of a chain where dependency on starting values is the highest. Only samples drawn after this section will comprise the posterior distribution. This beginning state is referred to as the burn-in (or warm-up) phase of the chain. Although the length of the burn-in phase is a function of model complexity, starting values, how similar the likelihood and prior distribution are, and sample dependence, there are some rules of thumb for determining the appropriate number of iterations to include in the burn-in. For example, Geyer (1991) indicated that 5% of the length of a long chain should be devoted to the burn-in phase. However some research has indicated the need for longer burn-in phases (e.g., Sinharay, 2004). For some guidance in this matter, there are several convergence diagnostics available that together can help determine the optimal length of the burn-in phase.
2.8.5
The Number of Markov Chains
Up until this point, I have referred to the MCMC process in the case involving a single-chain context. However, there are some researchers who prefer to run multiple chains rather than only one with fear that running only one chain can produce misleading answers (see, e.g., Gelman & Rubin, 1992a). This strategy usually incorporates several different Markov chains that have different parameter starting values. The aim of using more than one chain is that the chains should all converge to the same stationary distribution, regardless of the starting values. If this is the result, then the thought is that it can be viewed as support for parameter convergence. Some researchers use multiple chains to decrease the length of the chains needed. It is thought that if several chains converge to the same result, then the proper stationary distribution was obtained. However, this is quite misleading in several ways. It is the case that if two chains produced completely different distributions, this would be an indication that the chains are too short and perhaps a longer burn-in phase would produce proper convergence. Nonetheless, it cannot necessarily be concluded that convergence was obtained if two (or even more) short chains produced the same result. It could still be the case that a longer burn-in is needed for the proper stationary distribution to be obtained (Geyer, 1991; Gilks et al., 1996). Specifically, convergence in distribution to the proper stationary
54
Bayesian Structural Equation Modeling
distribution for multiple chains relies on the number of chains and the post-burn-in iterations marching toward infinity (Geyer, 1991). As a result, there should be caution with interpreting results from several shorter chains as found in Gelfand and Smith (1990), for example. Perhaps a more appropriate multiple-chain scenario would be to run several longer chains (see, e.g., Gelman & Rubin, 1992a). Although this defeats the purpose of saving computing time with shorter chains, it does help prevent prematurely concluding convergence was satisfied. I refer to this issue as local convergence and expand on it in Chapter 12. The take-home message here is that, no matter the number of chains, the main concern within MCMC is ensuring that the length of the chain is long enough to obtain convergence to the stationary distribution.
2.8.6
A Note about Starting Values
Just as with ML/EM, starting values are incredibly important in some cases within MCMC. Each parameter included in a model requires starting values in order for the sampling process to begin. Sometimes using random starts can suffice in estimation, but there are other cases in which the choice of starting values requires extra care. Theoretically, the starting values should not affect the distribution produced, since theory suggests the proper stationary distribution will be obtained through the Markov chain. However, the starting values can have a large impact on the length of the Markov chain needed for convergence (Gilks et al., 1996). In order to avoid needing an excessive burn-in phase of the chain, starting values should be chosen carefully, in particular for complex models. Having poor starting values will not only increase the number of burn-in iterations, but the number of post-burn-in iterations can also be affected. This occurs because poor starting values tend to create a Markov chain that moves gradually toward the stationary distribution, which inevitably creates a highly autocorrelated (dependent) sequence within the chain (Raftery & Lewis, 1996). It is often the case that this would be handled through a process called thinning that can require a higher number of post-burn-in iterations to receive an optimal sample for the chain.
2.8.7
Thinning a Chain
A process referred to as thinning can be employed when there is high autocorrelation between adjacent states throughout the chain. This signifies that there is a slower rate of convergence, or mixing, within the chain (J.S. Kim & Bolt, 2007). To lessen the dependency between the samples in the chain, every sth sample (s > 1) can be selected to comprise the post-burn-in
Basic Elements of Bayesian Statistics
55
samples forming the stationary distribution. This process can also diminish the dependence on starting values, thus creating reasonably independent samples (Geyer, 1991). The thought is that the convergence rate will be faster if the chain is able to rapidly move through the sample space. Note, however, that a thinning process is not necessary to obtain convergence but rather it is just used to reduce the amount of data saved from an MCMC run (Raftery & Lewis, 1996). Without thinning, the resulting samples will be dependent, but convergence will still eventually be obtained with a long enough chain. Geyer (1991) indicated that the optimal thinning interval is actually 1 in many cases, indicating that no thinning should take place. Specifically, when computing sample variances, it is necessary to down-weight the terms for larger lags (higher thinning interval) in order to obtain a decent variance estimate, thus indicating thinning intervals are not always useful or helpful. However, when a thinning interval greater than 1 is desired, then a value less than 5 is often optimal. Settling on the thinning interval to use within a chain seems subjective in some ways. The purpose of including a thinning interval is basically to lower autocorrelation, but there is no steadfast guideline for how much decrease in autocorrelation is “enough.” Some researchers have suggested guidelines for determining the appropriate thinning interval (see, e.g., Raftery & Lewis, 1996). However, this can sometimes result in making a judgment call since the benefits of thinning reside mainly in mixing time.
2.9
Posterior Inference
The final circle in panel (b) of Figure 2.1 represents posterior inference. Upon obtaining estimated posterior distributions for each unknown model parameter, the researcher can begin the process of deciphering and interpreting findings. In the subsequent chapters, I present a variety of examples where posterior inference is highlighted. Each example is accompanied by summary statistics representing the obtained posteriors, as well as plots displaying different aspects of the results. In this section, I will briefly define the major summary statistics and highlight the plots that can be used to aid in interpreting posterior inference. All of the summary statistics and plots described here can be produced using the code presented in Appendix 2.A.
2.9.1
Posterior Summary Statistics
The posterior distribution resulting from the MCMC process can be summarized in several ways. Summary statistics can be used as an initial
56
Bayesian Structural Equation Modeling
indication of the posterior estimate. In this book, I focus on the posterior median and mean (i.e., the expected a posteriori or EAP estimate). As an indication of variance within the posterior, the standard deviation of the posterior is also provided.
2.9.2
Intervals
To summarize the full posterior, and not just a point estimate pulled from the posterior, it is important to also examine the intervals produced. Two separate intervals are displayed for each example provided: equal tail 95% CIs, and 95% highest density intervals (HDIs), which need not have equal tails. Each of these intervals provide very useful information about the width of the distribution. A wider distribution would indicate more uncertainty surrounding the posterior estimate, and a relatively narrow interval suggests more certainty. When displaying these intervals, it is advised to report the upper and lower bounds, as well as to show a visual plot of the intervals (e.g., through HDI plots; see below). In the examples presented throughout the book, the median for these intervals is reported for consistency purposes. However, it is sometimes advised to report the mode of the HDI and the median of the equal tail CI when plotting the histogram and densities tied to these intervals. These intervals are particularly helpful in identifying the most and least plausible values in the posterior distribution, but they can require much longer chains to stabilize compared to other summary statistics of the posterior. For example, the median of the chain can stabilize rather quickly (i.e., with a relatively shorter chain) since it is located in a high density range of the posterior (Kruschke, 2015). The lower and upper bounds of the intervals are much more difficult to capture in a stable way because they are in the lowest density portions of the posterior (i.e., portions of the chain containing the least number of samples). One way of assessing whether the 95% intervals are stable is to examine the effective sample size.
2.9.3
Effective Sample Size
The effective sample size (ESS) is linked to the degree of autocorrelation in the chain. Adjacent iterations in the Markov chain are typically highly dependent on one another. ESSs take into account the amount of autocorrelation in the chain. If autocorrelation is high, then the ESS of the chain is going to be lower in order to account for the high degree of dependency among the samples in the chain. There are several rules of thumb surrounding minimum ESS values. For example, Kruschke (2015, Section 7.5.2, pp. 182+) indicates that ESSs must be at least 10,000 to ensure that the intervals
Basic Elements of Bayesian Statistics
57
are stable, and Zitzmann and Hecht (2019) indicate that values greater than 1,000 can be sufficient. The best advice that I can give is to carefully inspect the parameters and the histograms of the posteriors. Parameters can be sorted based on ESSs, with those corresponding with the lowest ESSs being inspected first. If the posterior histogram is highly variable, lumpy, or not smooth, then it may be an indication that the parameter experienced low sampling efficiency that is due to high autocorrelation. I present ESSs for all examples. In some cases, the values are relatively low, and I highlight reasons why when applicable.
2.9.4
Trace-Plots
Trace-plots (or convergence plots) are used to track the movement of the chain across the iterations of the sampling algorithm. A converged traceplot shows stability in its central tendency (horizontal center) and variance (vertical height).
2.9.5
Autocorrelation Plots
Autocorrelation plots illustrate the degree of dependency within a chain. Lower degrees of dependency are optimal, especially given that high dependency can be a sign of a poorly formed (i.e., mixed) chain or model mis-specification. Autocorrelation is a product of the model and the data, so higher degrees can be an artifact of the model (e.g., the way it is formulated) or certain features of the data. Sometimes autocorrelation can be reduced by changing the model, using different data, or even switching to Hamiltonian Monte Carlo methods (e.g., as implemented in Stan (Stan Development Team, 2020)).
2.9.6
Posterior Histogram and Density Plots
The posterior histogram and density plots show the overall shape of the posterior. These plots are very helpful in assessing for features such as skew and areas of higher density.
2.9.7
HDI Histogram and Density Plots
Closely related to the posterior histogram and density plots are the plots produced for the HDI. The HDI histogram and HDI density plots illustrate the highest probability density (95% center of the distribution). These plots are very helpful in interpreting (un)likely values for the parameter.
58
2.9.8
Bayesian Structural Equation Modeling
Model Assessment
One of the main elements in the posterior inference portion of Figure 2.1 relates to model assessment. Issues such as model fit, posterior predictive checking, and model comparison are all detailed in Chapter 11. However, it is also important to mention here that part of the estimation process should indeed be to examine the integrity of the model post-estimation. Tools such as posterior predictive checking can help the researcher gain a firmer understanding of the performance of the statistical model via the likelihood. This sort of assessment can be a valuable way of gaining insight as to how well the statistical model captures the phenomena or patterns observed in the data. It can also inform future modeling changes that can be used to restart the Bayesian Research Circle for subsequent theories, statistical models, priors, or data.
2.9.9
Sensitivity Analysis
The posterior inference phase of Figure 2.1 highlights an important element called sensitivity analysis. I have already briefly mentioned this concept in terms of likelihood robustness, but the concept can be equally applied to priors. The idea underlying a sensitivity analysis is to examine the impact that priors (or a specified likelihood, i.e., statistical model) have on final model results. If the results are heavily influenced by the selection of priors (or the statistical model that was implemented), then this is valuable information for the researcher. The researcher can report on the influence that theory has on final model results, indicating clearly if slight alterations in theory (via priors or the model) have a strong influence on final model results. This type of finding may highlight the need for careful theory building or, at the very least, the need for reporting full sensitivity analysis results to highlight varying inferences based on different priors or models. In contrast, if results are relatively robust (or stable) to modifications of the priors (or model), then that is also valuable information to report. The researcher may find that modifications of theory (again, via the priors or model) have little to no impact on final model inferences. This result can be captured through a discussion of the robustness of results, where subjective decisions have little impact on final model inferences. The sensitivity analysis process allows researchers to convey how stable or malleable results are to subjective decisions made in the phases of defining the priors or likelihood. One of the biggest criticisms of Bayesian methods is that priors have the ability to impact final model results in a considerable way. This impact effectively means that subjectivity embed-
Basic Elements of Bayesian Statistics
59
ded in the priors can influence inference. The attention in the literature is typically focused on the subjective nature of the priors but, as I argued when defining the likelihood, the statistical model is just as subjective. As a result, a sensitivity analysis can help identify the role that the subjective decisions played in inference. For the remainder of this section, I will describe sensitivity analysis in terms of priors, but the same concepts and points can be extended to a sensitivity analysis of the statistical model. The Prior Sensitivity Analysis Process There are many research scenarios in which informative (or user-specified) priors have an impact on posterior inference (see, e.g., Depaoli, Yang, & Felt, 2017; Golay, Reverte, Rossier, Favez, & Lecerf, 2013; van de Schoot et al., 2018). In addition, diffuse priors have also been found to influence final model estimates in important ways (see, e.g., Depaoli, 2013; Lambert, Sutton, Burton, Abrams, & Jones, 2005; van Erp, Mulder, & Oberski, 2018). Given that prior specification has the potential to alter obtained estimates (sometimes in an adverse way, as will be demonstrated throughout the book), it is always important to assess and report prior impact alongside the final model results being reported for a study. It is important to never blindly rely on default prior settings in software without having a clear understanding of their impact. A sensitivity analysis of priors allows the researcher to methodically examine the impact of prior settings on final results. The researcher will often specify original priors based on desired previous knowledge. After posteriors are estimated and inferences are described, the researcher can then examine the robustness of results to deviations in the priors specified in the original model. There are several steps that a researcher can take to conduct a sensitivity analysis of priors. These steps may include the following: 1. The researcher first determines a set of priors that will be used for the original analysis. These priors are obtained through methods described in panel (a) of Figure 2.1, where knowledge and previous research can be used to derive prior settings. 2. The statistical model is defined, data are collected, and the likelihood is formed. 3. Model estimation occurs using a sampling algorithm. Convergence is monitored, and estimated posterior distributions are obtained for all model parameters.
60
Bayesian Structural Equation Modeling 4. Upon obtaining model results for the original analysis, the researcher then defines a set of “competing” priors that can be examined. These “competing” priors are not meant to replace the original priors. The point here is not to alter the original priors in any way. Instead, the purpose of this phase is to examine how robust the original results are when the priors are altered, even if only slightly so. 5. Model estimation occurs for the sets of “competing” priors, and then the results are systematically compared to the original set of results. This comparison can take place through a series of visual and statistical assessments. 6. The final model results are written to reflect the original model results, as well as the sensitivity analysis results. Comments can be made about how robust (or not) the findings were when priors were altered.
When the prior settings are altered within the sensitivity analysis, there is also inherent subjectivity on the researcher’s part as to how the prior settings are altered. A researcher may decide to examine different hyperparameter settings without modifying the distributional form of the prior. In contrast, the researcher may decide to inspect the impact of different distributional forms and different hyperparameter settings. Consider the following example of modifying hyperparameter settings in a prior sensitivity analysis. A regression coefficient is assumed to be normally distributed with mean and variance hyperparameters as follows: N(1, 0.5). Assume, for the sake of this example, that there is no reason to believe the prior to be distributed as anything other than normal. The sensitivity analysis can then take place by systematically altering the mean and variance hyperparameters for a normal prior. Then the resulting posteriors from the sensitivity analysis can be compared to the results from the original prior. First, the researcher may choose to alter the mean hyperparameter, while keeping the variance hyperparameter at 0.5. The prior, N(μ, 0.5), can be altered in the following way: • Original setting: μ = 1. • Examine settings lower than 1, where μ = 0.5, 0, −0.5, and −1. • Examine settings greater than 1, where μ = 1.5, 2, 2.5, and 3. Next, the variance hyperparameter can be altered, while keeping the mean hyperparameter at 1. The prior, N(1, σ2 ), can be altered in the following way:
Basic Elements of Bayesian Statistics
61
• Original setting: σ2 = 0.5. • Examine settings lower than 0.5, where σ2 = 0.1 and 0.01. • Examine settings greater than 0.5, where σ2 = 1, 5, 10, 100, and 1,000. Then the settings for the mean and variance hyperparameters would be fully crossed to form a thorough sensitivity analysis of the settings. In other words, each of the mean hyperparameter settings listed would be examined under all of the variance hyperparameter settings. This is just one example of how a sensitivity analysis can be conceptualized for a given model parameter. Regardless of how settings are determined, it becomes the duty of the researcher to ensure that a thorough sensitivity analysis was conducted on the impact of priors. The definition of “thorough” will vary by research context, model, and original prior settings. The most important aspect is to clearly define the settings for the sensitivity analysis, and describe the results in comparison to the original findings in a clear manner. Many Bayesian researchers (see, e.g., Depaoli & van de Schoot, 2017; Kruschke, 2015; B. O. Muth´en & Asparouhov, 2012a) recommend that a sensitivity analysis accompany original model results. This practice helps the researcher gain a firmer understanding of the robustness of the findings, the impact of theory, and the implications of results obtained. In turn, reporting the sensitivity analysis will also ensure that transparency is promoted within the applied Bayesian literature. Note that there is no right or wrong finding within a prior sensitivity analysis. If results are highly variable to different prior settings, then that is perfectly fine–and it is nothing to worry about. The point here is to be transparent about the role of the priors, and much of that comes from understanding their impact through a sensitivity analysis. A Sensitivity Analysis Warning There is one final caveat related to this issue. It is important to report the results based on the original prior no matter what the sensitivity analysis results convey. In other words, do not modify the original priors because of something that was unveiled in the sensitivity analysis. The practice of modifying the priors based on finding more desirable results within a sensitivity analysis would be considered Bayesian HARKing (hypothesizing after results are known; Kerr, 1998). At the very least, this action would be considered a questionable research practice, but I would argue it is a misleading–or even deceiving–action.
62
Bayesian Structural Equation Modeling
2.10
A Simple Example
This section presents a straightforward example for implementing and interpreting Bayesian results. For illustration purposes, a multiple regression model is used with a continuous outcome and two predictors. Specifically, the data used here were collected from 100 college students to predict levels of cynicism (higher values indicate greater cynicism) based on a measure of lack of trust in others (continuous predictor, with higher values indicating less trust in others) and sex (coded here as a binary predictor, with female = 0 and male = 1). The basic model can be written as follows: Yi = β0 + β1 (Sex)i + β2 (Trust)i + i
(2.16)
where Yi is the outcome measure of Cynicism for person i, β0 is the model intercept (e.g., the average level of cynicism when the two predictors are zero), β1 is the regression coefficient tied to the categorical predictor of Sex (X1 ), β2 is the regression coefficient associated with the continuous predictor of Lack of Trust (X2 ), and is the error. The basic form of the model can be viewed in Figure 2.2. FIGURE 2.2. Multiple Regression Model.
The main parameters of interest are the regression weights, which are denoted by the lines connecting the predictors to the outcome. When estimating this model using Bayesian methods, each model parameter will be associated with a prior. The following list represents the four parameters that need priors: • Intercept (β0 ) • Regression weight 1 (β1 )
Basic Elements of Bayesian Statistics
63
• Regression weight 2 (β2 ) • Error variance (σ2 ) The priors can range from diffuse to informative, and the settings are defined at the discretion of the researcher. As an example, assume that the researcher had some prior knowledge that can be incorporated into the model. These priors are pictured in Figure 2.3. FIGURE 2.3. Prior Densities for All Parameters.
(a) β1 - Sex
(b) β2 - Lack of Trust 0.4
0.10
0.3
0.2 0.05 0.1
0.00
0.0 í10
í5
0
5
10
(c) β0 - Intercept
í10
(d)
í5
σ2
0
5
10
- Error Variance
0.12
0.10
0.09
0.06 0.05 0.03
0.00
0.00 20
30
40
50
60
0
10
20
30
The normally distributed prior for β1 (associated with Sex) is centered at 0 with a variance hyperparameter of 10 (N(0, 10)), indicating that the density mass covers a wide range of values for the regression weight. The prior for β2 (associated with Lack of Trust) is normally distributed, centered at 6 with a variance hyperparameter of 1 (N(6, 1)). This prior is more informative than the prior for β1 , with 95% of the density for the β2 prior between 4 and 8. This prior is relatively informative, indicating that the researcher has a strong expectation that a 1-point increase in Lack of Trust is related to a 4- to 8-point increase in Cynicism. The normally distributed prior on the intercept is centered at 41 and has a variance of 10 (N(41, 10)), which is meant to act as a weakly informative prior with 95% of the density
64
Bayesian Structural Equation Modeling
falling between 34.67 and 47.32. Finally, the error variance received an inverse gamma prior of IG(0.5, 0.5). This example was implemented in R using the rStan package (Stan Development Team, 2020) with the default sampler in the package called the NUTS sampler (No-U-Turn sampler; Betancourt, 2017). Two Markov chains were requested, each with 5,000 burn-in samples and 5,000 samples for the posterior. Convergence was monitored through the potential scale reduction factor (PSRF, or R; Brooks & Gelman, 1998; Gelman & Rubin, 1992a; Vehtari, Gelman, Simpson, Carpenter, & Burkner, 2019). The R val¨ ues for each parameter were < 1.001, thus pointing toward convergence. In addition the ESSs ranged from 5,593 to 7,386, which was deemed sufficient given the simple nature of this example. All relevant plots can be found in Figures 2.4-2.9. Figure 2.4 shows the trace-plots for the four model parameters. Notice that there appears to be visual confirmation of chain convergence. Both chains overlap nicely, forming a stable mean (horizontal center of the chain) and a stable variance (vertical height) for all parameters. Figure 2.5 shows that autocorrelation levels were low for all parameters. Each chain shows a quick visual decline of the degree of autocorrelation. Figures 2.6 and 2.7 present the posterior histogram and density plots for all parameters, respectively. These plots all show a smoothness of the densities, with the regression parameters linked to the two predictors (Plots (a) and (b)) and the intercept (Plot (c)) displaying distributions that approximate a normal distribution. Finally, Figures 2.8 and 2.9 illustrate the HDI histogram and density plots, respectively. These plots mimic the findings in Figures 2.6 and 2.7 in that relatively normal distributions were produced for the regression coefficients and the intercept. The results based on the estimated posteriors are presented in Table 2.1 on page 68. One interesting element is that the HDI includes the value 0 for the regression coefficient for the predictor Sex. Based on the prior information specified, this result indicates that there is no significant effect of sex. In contrast, the HDI does not include 0 for Lack of Trust, indicating that this predictor had some impact on levels of cynicism–specifically, that greater mistrust (higher values for the Lack of Trust predictor) are associated with greater cynicism (higher values for the Cynicism outcome). The result for the Lack of Trust predictor may have been influenced by the relatively informative prior that was placed on β2 . One method that can be used to assess the impact of the prior distributions on final model estimates is to conduct a prior sensitivity analysis.
Basic Elements of Bayesian Statistics
65
FIGURE 2.4. Trace-Plots for All Parameters.
(a) β1 - Sex
(b) β2 - Lack of Trust
(c) β0 - Intercept
(d) σ2 - Error Variance
FIGURE 2.5. Autocorrelation Plots for All Parameters.
(b) β2 - Lack of Trust 1.0
0.5
0.5 Autocorrelation
0.0 1.0
1.0
2
0.0
0.0 0
5
10 Lag
15
20
0
(c) β0 - Intercept
5
10 Lag
15
20
(d) σ2 - Error Variance 1.0
0.5
0.5 Autocorrelation
0.0 1.0
0.0 1.0
0.5
2
2
0.5
1
1.0
1
Autocorrelation
0.0
0.5
2
0.5
1
1.0
1
Autocorrelation
(a) β1 - Sex
0.0
0.0 0
5
10 Lag
15
20
0
5
10 Lag
15
20
66
Bayesian Structural Equation Modeling
FIGURE 2.6. Posterior Histograms for All Parameters.
(a) β1 - Sex
í5.0
í2.5
0.0 2.5 Cynicism on Sex
(b) β2 - Lack of Trust
5.0
(c) β0 - Intercept
40
1.5
2.0
2.5 3.0 Cynicism on Lack of Trust
3.5
(d) σ2 - Error Variance
45 Intercept of Cynicism
60
80 100 Error Variance of Cynicism
FIGURE 2.7. Posterior Densities for All Parameters.
(a) β1 - Sex
í2.5
(b) β2 - Lack of Trust
0.0 2.5 Cynicism on Sex
5.0
(c) β0 - Intercept
37.5
40.0
42.5 45.0 Intercept of Cynicism
2.0
2.5 3.0 Cynicism on Lack of Trust
3.5
(d) σ2 - Error Variance
47.5
60
80 100 Error Variance of Cynicism
Basic Elements of Bayesian Statistics
67
FIGURE 2.8. HDI Histograms for All Parameters.
(a) β1 - Sex
(b) β2 - Lack of Trust
95% HDI (Median = 0.52)
í2.17
í5.51
í2.50
95% HDI
95% HDI (Median = 2.57)
3.24
0.51
3.52
1.35
1.99
Cynicism on Sex
39.50
95% HDI
3.90
95% HDI (Median = 71.81)
45.8
43.11
3.26
(d) σ2 - Error Variance
95% HDI (Median = 42.73)
39.5
3.13
2.63
Cynicism on Lack of Trust
(c) β0 - Intercept
35.89
95% HDI
2
6.53
46.72
95% HDI
56.3
50.33
41.66
Intercept of Cynicism
61.72
89.1
81.77
101.82
121.87
Error Variance of Cynicism
FIGURE 2.9. HDI Densities for All Parameters.
(a) β1 - Sex
(b) β2 - Lack of Trust
95% HDI (Median = 0.52)
í2.36
í5.51
í2.50
95% HDI
95% HDI (Median = 2.57)
3.28
0.51
3.52
1.99
6.53
1.35
Cynicism on Sex
35.89
39.50
43.11
2.63
3.26
3.90
(d) σ2 - Error Variance
95% HDI (Median = 42.73)
95% HDI
3.16
Cynicism on Lack of Trust
(c) β0 - Intercept
39.4
1.99
95% HDI
95% HDI (Median = 71.81)
55.8
46
46.72
Intercept of Cynicism
50.33
41.66
61.72
95% HDI
81.77
90.3
101.82
Error Variance of Cynicism
121.87
68
Bayesian Structural Equation Modeling TABLE 2.1. Example: Multiple Regression Analysis Predicting Cynicism 95% CI (Equal Tails)
Intercept β1 β2 Error Var.
Median 42.730 0.515 2.574 71.811
Mean 42.733 0.489 2.574 72.479
SD 1.596 1.383 0.285 8.585
Lower 39.574 −2.288 2.003 57.621
Upper 45.864 3.161 3.135 90.867
95% HDI (Unequal Tails) Lower 39.489 −2.170 2.003 56.291
Upper 45.750 3.236 3.135 89.112
ESS 5593 7050 5652 7386
Note. β1 = Regression of Cynicism on Sex; β2 = Regression of Cynicism on Lack of Trust; Error Var. = Error variance; CI = credible interval.
The process of conducting a sensitivity analysis for this multiple regression analysis model is presented in Depaoli, Winter, and Visser (2020). A Shiny app was introduced, which can be used as a learning tool for the basics of conducting a prior sensitivity analysis. The app is available for download on the Open Science Framework (https://osf.io/eyd4r/). To run the app on your personal computer, open the ui.R and server.R files in RStudio and press the “Run App” link in the top-right-hand corner of the R Script section of the RStudio window. The app includes the Cynicism dataset, and it walks the user through the process of modifying priors on each of the model parameters comprising the multiple regression model. Before delving into the specifics of the app, I want to caution the user about one point. The app was constructed so that users must first examine the impact of priors placed on one parameter at a time. For example, the user will first explore different priors placed on β1 , examining the impact that the different prior settings for β1 have on the posterior for β1 . After exploring each parameter, one at a time, users can examine the combination of different priors at once. This combination approach is a more realistic view of the impact that priors can have. As I will demonstrate throughout this book, a prior on one model parameter can impact the results for another model parameter. Therefore, it is always important to examine prior settings in the context of the full model and not just focus on how a single posterior shifts when a prior is altered. This app initially walks the user through the process of a sensitivity analysis one parameter at a time (for pedagogical purposes). Then the last tab in the app compiles the information for each parameter and illustrates the impact of different prior settings on the entire model (rather than a single parameter at a time). This final approach mimics the impact of prior settings in an applied research context, where priors are examined in a sensitivity analysis with the full model results being tracked. As an example of a prior sensitivity analysis, consider the regression
Basic Elements of Bayesian Statistics
69
parameter β1 , which is the coefficient tied to the predictor Sex. The original prior for this parameter was N(0, 10). A sensitivity analysis can be performed on this prior by systematically altering the mean and variance hyperparameter values and assessing the resulting impact on the posterior.9 This example explores two alternative specifications: N(5, 5) (called Alternative Prior 1) and N(−10, 5) (called Alternative Prior 2). These alternative priors have been plotted against the original prior in Figure 2.10 in order to highlight their discrepancies. FIGURE 2.10. Sensitivity Analysis of Priors for Sex Regression Coefficient (β1 ).
Figure 2.11 illustrates the impact that altering the prior for β1 has on all of the model parameters. This figure shows the posterior densities for three different analyses: 1. Original priors for all model parameters 2. Alternative Prior 1 (N(5, 5)) for β1 and original priors for all other parameters 3. Alternative Prior 2 (N(−10, 5)) for β1 and original priors for all other parameters It is clear that the formulation of the prior for β1 has an impact on findings. The greatest impact is indeed on the posterior for β1 (or βsex ), where the most discrepancy between the posteriors can be viewed. However, there is also some discrepancy in the overlaid posteriors for the other parameters,
9
It is important to note here that the sensitivity analysis could (and perhaps should) examine the impact of different distributional forms of priors. For the sake of this simple example, the sensitivity analysis process will only focus on altering the hyperparameter values and not the distributional form of the prior.
70
Bayesian Structural Equation Modeling
FIGURE 2.11. Estimated Posteriors When Altering Priors for Sex Regression Coefficient (β1 ).
indicating that the prior setting for β1 impacts findings for the other parameters. Sensitivity analysis results are often most impactful when they can be visually displayed in plots such as these, as it highlights the degree of (non)overlap in posteriors. It can also be useful to examine statistics pulled from the different analyses to help judge whether point estimates (or HDIs) were substantively impacted by altering prior settings. Table 2.2 contains this information pulled directly from the app. The app results presented in Table 2.2 on page 72 include several different types of information. The top half includes information when the Alternative Prior 1 was used for β1 (or βsex ), and the bottom half includes information when the Alternative Prior 2 was used. The last column in this table provides “percent deviation.” This column contains a comparison index that captures the amount of deviation between the original posterior mean (seventh column) and the posterior mean obtained under the alternative prior (third column).10 The percent deviation was calculated here through the following equation: [(posterior mean from new analysis − original posterior mean)/original posterior mean] ∗ 100. This formula will allow for an interpretation of the percent deviation from one posterior mean to the next. If the deviation is relatively small (and not substantively meaningful), then this indicates that the results for the mean are robust 10
The researcher may also choose to compare the median, mode, or any substantively important percentile of the posteriors resulting from the different prior settings.
Basic Elements of Bayesian Statistics
71
to the different prior settings examined. If the posterior changes substantially as a result of the prior, then this indicates that the prior impacts the posterior (potentially) in a meaningful way. A figure such as Figure 2.11 can be useful to visually compare the posterior distributions, and a table such as Table 2.2 is a useful way to make estimate comparisons within the sensitivity analysis. Overall, the sensitivity analysis is an important process that can be used to better understand the role and impact of priors during the estimation process. Although only settings for β1 were altered here, all parameters would ideally be examined through a sensitivity analysis. Chapter 12 contains additional points to consider when conducting a prior sensitivity analysis.
2.11
Chapter Summary
This chapter presented a summary of Bayesian statistical modeling, highlighting the most important details regarding the estimation process. This treatment was meant to act as a review of the material. For more detailed information on these topics, please reference Gelman, Carlin, et al. (2014), Kaplan (2014), or Kruschke (2015). Important take-aways in this chapter include the fundamental differences between the frequentist and Bayesian estimation approaches. Although these two perspectives have similarities (e.g., both treat data as being random and having a distribution), there are several important distinctions that make the interpretation of results quite different across platforms. Within the frequentist framework, inference is based on asymptotic theory, which drives interpretation to surround repeated sampling from the population. Each approach assumes the true population parameter value is fixed, but the Bayesian perspective adds an important element of uncertainty. Specifically, the Bayesian approach treats model parameters as random, using prior information to capture uncertainty surrounding the true population value. This aspect allows for the computation of a conditional probability distribution called the posterior distribution.
2.11.1
Major Take-Home Points
In the following chapters, I will focus on a variety of models within the SEM framework. There are several aspects from the current chapter that should be kept in mind when reading about the Bayesian treatment of these models. Some final points to keep in mind regarding the process of Bayesian statistical modeling are as follows:
72
Bayesian Structural Equation Modeling
TABLE 2.2. Sensitivity Analysis Results When Altering Priors for Sex Regression Coefficient (β1 )
Parameter Intercept Sex Lack of Trust Error Variance Intercept Sex Lack of Trust Error Variance
90% HDI New New New (Unequal Tails) Median Mean SD Lower Upper Alternative Prior 1: Posterior Estimates 43.989 43.973 1.529 41.447 46.478 2.213 2.225 1.468 −0.197 4.645 2.245 2.248 0.284 1.783 2.722 65.592 66.524 9.770 52.242 84.013 Alternative Prior 2: Posterior Estimates 44.767 44.748 1.576 42.112 47.274 −4.052 −4.070 1.535 −6.612 −1.581 2.392 2.398 0.298 1.919 2.898 69.032 70.138 10.704 54.582 89.586
Original Mean
Percent Deviation (Mean)
42.733 0.489 2.574 72.479
2.902 355.010 −12.665 −8.216
42.733 0.489 2.574 72.479
4.715 −932.311 −6.838 −3.230
Note. Intercept = Intercept in the regression model; Sex = Regression weight of Cynicism (outcome) on Sex (predictor); Lack of Trust = Regression weight of Cynicism (outcome) on Lack of Trust (predictor); New Median = Posterior median under the alternative prior; New Mean = Posterior mean under the alternative prior; New SD = Posterior standard deviation under the alternative prior; HDI = 90% highest posterior interval under the alternative prior; Original Mean = Posterior mean under the original set of priors; Percent Deviation (Mean) = [(new mean − original mean)/original mean] ∗ 100.
1. The Bayesian Research Circle is a visual representation of the different elements that are involved in the estimation process. As indicated in Figure 2.1, the elements of this process have the potential to be iterative (or circular) in nature. Being transparent about decisions made at each step is imperative. 2. The impact of priors is often the biggest criticism of Bayesian methods. Critics point toward the inherent subjectivity of priors and the fact that they can impact results in substantively important ways. It is true that priors can be used to manipulate findings. Because of this, as well as some other elements of the estimation process that are prone to implementation problems (e.g., not properly assessing chain convergence), Bayesian statistical modeling has the potential to be misused (whether intentional or not). In Chapter 12, I recommend several points that can be used to improve the transparency and credibility of the Bayesian process. Many of these points cover proper implementation and reporting standards, and I provide special attention to the process of conducting a prior sensitivity analysis. After reading through the model-based chapters, I recommend that the points in Chapter 12 be carefully considered prior to implementation.
Basic Elements of Bayesian Statistics
2.11.2
Notation Referenced
• p(·) = probability • θ: vector of known parameters • y : data • X: random variable • N: Normal prior distribution • μX : mean hyperparameter for the normal distribution • σ2X : variance hyperparameter for the normal distribution • U: uniform prior distribution • αu : lower-bound hyperparameter for uniform distribution • βu : upper-bound hyperparameter for uniform distribution • σ2 : variance parameter • IG: inverse gamma prior distribution • aσ2 : shape hyperparameter for the inverse gamma distribution • bσ2 : scale hyperparameter for the inverse gamma distribution • G: gamma prior distribution • a1/σ2 : shape hyperparameter for the inverse gamma distribution • b1/σ2 : scale hyperparameter for the inverse gamma distribution • Σ: covariance matrix • IW: inverse Wishart prior distribution • Ψ: the scale hyperparameter for the inverse Wishart prior distribution • ν: the degrees of freedom hyperparameter for the inverse Wishart prior distribution • Σ−1 : precision matrix • W: Wishart prior distribution
73
74
Bayesian Structural Equation Modeling
Notation Referenced (continued) • Ψ−1 : the scale hyperparameter for the Wishart prior distribution • ν: the degrees of freedom hyperparameter for the Wishart prior distribution • B: beta prior distribution • αB : shape hyperparameter for beta distribution • βB : shape hyperparameter for beta distribution • π: a multinomial variable (proportions) • D: Dirichlet prior distribution • d: hyperparameter for the Dirichlet prior distribution • s: number of samples in a Markov chain • q: number of parameters in a model being estimated • ω: “momentum” variable in Hamiltonian Monte Carlo with Q-elements • L(·): likelihood function • M : Q-dimensional mass matrix in Hamiltonian Monte Carlo • R: R-hat, potential scale reduction factor to monitor nonconvergence • β0 : regression weight for intercept in multiple regression example • β1 : regression weight for first predictor (X1 ) in multiple regression example predicting outcome Y • β2 : regression weight for second predictor (X2 ) in multiple regression example predicting outcome Y • : error in multiple regression example • σ2 : error variance in multiple regression example
Basic Elements of Bayesian Statistics
2.11.3
75
Annotated Bibliography of Select Resources
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014). Bayesian data analysis (3rd ed.). Boca Raton, FL: Chapman & Hall. • This book offers a thorough and technical treatment of Bayesian methodology. It is a terrific reference for the different elements comprising Bayesian estimation. Kaplan, D. (2014). Bayesian statistics for the social sciences. New York, NY: The Guilford Press. • This book is a great resource for applied users of Bayesian methods. It is geared toward social scientists and offers a detailed treatment of aspects of Bayesian estimation. Kruschke, J. K. (2015). Doing Bayesian analysis: A tutorial with R, JAGS, and Stan. San Diego, CA: Elsevier Inc. • This book has a strong focus on implementation of Bayesian statistical modeling in the most common software programs offering Bayesian estimation.
76
Bayesian Structural Equation Modeling
Appendix 2.A: Getting Started with R Whether models are estimated using Mplus, R, or another Bayesian program, there are many useful plotting functions in R that can aid in visually examining the estimated posterior distributions. This Appendix illustrates how to get started with R and obtain the plots and posterior statistics that are presented in the book. If you have not done so already, you should download the R programming environment from this website: https://cran.r-project.org/. From here, you will need to install required packages (using the install.packages() function) that are needed for the various plots and statistics that are presented in this book. The following packages need to be installed for plotting purposes, and then loaded using the following code:
library(coda) library(runjags) library(MplusAutomation) library(ggmcmc) library(BEST) library(bayesplot) The coda package can be used to analyze MCMC output. It provides many diagnostics, including convergence diagnostics. The runjags package can be used as an interface to estimate models using MCMC through the program Just Another Gibbs Sampler (JAGS). The MplusAutomation package can be used alongside Mplus to run batches of models and extract information about model parameters that can be further analyzed in R. The ggmcmc package can be used to analyze results obtained through MCMC. This package is used to extract the chain information in vector form. The BEST package provides Bayesian estimation for t-tests, and it also can be used to compute posterior summary statistics.
Basic Elements of Bayesian Statistics
77
The bayesplot package can be used to produce a variety of plots representing and summarizing the posterior. In cases in which Mplus is used for initial estimation, it is helpful to move the chain information over into the R programming environment to obtain all of the relevant plots and statistics used to summarize the posterior for each parameter. The following section of code shows how files can be imported from Mplus into R. The “mcmc” file is needed for all of the following commands, and it can be saved out from Mplus using these commands: BPARAMETERS IS mcmc.txt; PLOT: TYPE=PLOT2; Mplus saves the BPARAMETERS as an mcmc.list, which is appropriate for using anything that is related to BUGS or JAGS (e.g., the coda package). This next line is used to pull all MCMC iterations from Mplus into R. It uses the output file to make sense of what is in the BPARAMETERS output (i.e., all of the samples from the chains). It is a function in MplusAutomation. If you want to keep the burn-in iterations, then change the code to read “discardBurnin = FALSE”: getSavedata_Bparams("/Users/sarah/Desktop/mcmc.out", discardBurnin = TRUE) The following command extracts all output from the “.out” file: allOutput