131 43 4MB
English Pages 254 [249] Year 2021
Quantitative Methods in the Humanities and Social Sciences
Marvin Titus
Higher Education Policy Analysis Using Quantitative Techniques Data, Methods and Presentation
Quantitative Methods in the Humanities and Social Sciences
Editorial Board Thomas DeFanti, Anthony Grafton, Thomas E. Levy, Lev Manovich, Alyn Rockwood
Quantitative Methods in the Humanities and Social Sciences is a book series designed to foster research-based conversation with all parts of the university campus – from buildings of ivy-covered stone to technologically savvy walls of glass. Scholarship from international researchers and the esteemed editorial board represents the far-reaching applications of computational analysis, statistical models, computer-based programs, and other quantitative methods. Methods are integrated in a dialogue that is sensitive to the broader context of humanistic study and social science research. Scholars, including among others historians, archaeologists, new media specialists, classicists and linguists, promote this interdisciplinary approach. These texts teach new methodological approaches for contemporary research. Each volume exposes readers to a particular research method. Researchers and students then benefit from exposure to subtleties of the larger project or corpus of work in which the quantitative methods come to fruition.
More information about this series at http://www.springer.com/series/11748
Marvin Titus
Higher Education Policy Analysis Using Quantitative Techniques Data, Methods and Presentation
Marvin Titus Counseling, Special, and Higher Education University of Maryland College Park, MD, USA
ISSN 2199-0956 ISSN 2199-0964 (electronic) Quantitative Methods in the Humanities and Social Sciences ISBN 978-3-030-60830-9 ISBN 978-3-030-60831-6 (eBook) https://doi.org/10.1007/978-3-030-60831-6 © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Acknowledgments
There are many individuals who encouraged and inspired me, over the past few years, to write this book. I am grateful to my colleagues, Alberto Cabrera and Sharon Fries-Britt, at the University of Maryland who encouraged me to write this book. I am grateful to my students, whom I taught in a graduate course that covered many of the topics that are introduced in this book. They contributed to my deeper understanding of the topics that I introduced in the course and also inspired me to take on this project. I am particularly grateful to my former students who reviewed the draft chapters of this book. They are as follows: Christie De Leon, MacGregor Obergfell, Matt Renn, and Liz Wasden. They provided valuable comments, edits, and suggestions for improvement. I thank Ozan Jaquette, who graciously makes institution- and state-level data available to me and other researchers. Some of these data are used in many of the examples in this book. I would also like to thank Springer Publishing for their support in publishing this book and their patience as I wrote and revised the book. I would also like to thank my academic department, college, and university for a semester-long sabbatical, which I used to develop the book proposal. Finally, I would like to thank my wife Beverly, who provided encouragement and support as I spent an enormous amount of time away from her working on the draft manuscript of this book. I owe a great deal of gratitude to her.
v
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 6
2
Asking the Right Policy Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Asking the Right Policy Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 The What Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 The How Questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 The How Questions and Quantitative Techniques . . 2.2.4 So Many Answers and Not Enough Time . . . . . . . . . . . 2.2.5 Answers in Search of Questions . . . . . . . . . . . . . . . . . . . . . . 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 9 10 11 14 15 17 17 18 18
3
Identifying Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 International Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 National Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 State-Level Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Institution-Level Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19 19 20 20 24 26 27 28
4
Creating Datasets and Managing Data . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Stata Dataset Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Primary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Secondary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33 33 34 34 35 48 49 51
vii
viii
Contents
5
Getting to Know Thy Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Getting to Know the Structure of Our Datasets . . . . . . . . . . . . . 5.3 Getting to Know Our Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Missing Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Missing Data—Missing Completely at Random. . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53 53 54 61 63 71 74 74 77
6
Using Descriptive Statistics and Graphs . . . . . . . . . . . . . . . . . . . . . . 79 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2.1 Measures of Central Tendency. . . . . . . . . . . . . . . . . . . . . . . . 80 6.2.2 Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2.3 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.3 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3.1 Graphs—EDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7
Introduction to Intermediate Statistical Techniques . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Review of OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 The Assumptions of OLS Regression . . . . . . . . . . . . . . . . 7.2.2 Bivariate OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Multivariate OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 Multivariate Pooled OLS Regression . . . . . . . . . . . . . . . . . 7.3 Weighted Least Squares and Feasible Generalized Least Squares Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Fixed-Effects Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Unobserved Heterogeneity and Fixed-Effects Dummy Variable (FEDV) Regression . . . . . . . . . . . . . . . . 7.4.2 Estimating FEDV Multivariate POLS Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Fixed-Effects Regression and Difference-in-Differences . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Random-Effects Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Hausman Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
103 103 104 104 105 108 110 121 121 122 122 128 134 136 141 141 144
Contents
8
9
Advanced Statistical Techniques: I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Time Series Data and Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Testing for Autocorrelations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Examples of Autocorrelation Tests—Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Time Series Regression Models with AR terms . . . . . . . . . . . . . . 8.4.1 Autocorrelation of the Residuals from the P-W Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Summary of Time Series Data, Autocorrelation, and Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Examples of Autocorrelation Tests—Panel Data. . . . . . . . . . . . . 8.7 Panel-Data Regression Models with AR Terms . . . . . . . . . . . . . . 8.8 Cross-Sectional Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.1 Cross-Sectional Dependence—Unobserved Common Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.2 Tests to Detect Cross-Sectional Dependence—Unobserved Common Factors. . . . . . . . . 8.9 Panel Regression Models That Take Cross-Sectional Dependency into Account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.11 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced Statistical Techniques: II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 The Context of Macro Panel Data and an Appropriate Statistical Approach . . . . . . . . . . . . . . . . . . . . . 9.2.1 Heterogeneous Coefficient Regression . . . . . . . . . . . . . . . . 9.2.2 Macro Panel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 Common Correlated Effects Estimators . . . . . . . . . . . . . 9.2.4 HCR with a DCCE Estimator . . . . . . . . . . . . . . . . . . . . . . . . 9.2.5 Error Correction Model Framework . . . . . . . . . . . . . . . . . . 9.2.6 Mean Group Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Demonstration of HCR with DCCE and MG Estimators . . . 9.3.1 Macroeconomic Panel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Tests for Nonstationary Data . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 Tests for Cointegration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4 Tests for Cross-Sectional Independence . . . . . . . . . . . . . . 9.3.5 Test of Homogeneous Coefficients . . . . . . . . . . . . . . . . . . . . 9.3.6 Results of the HCR with DCCE and MG Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
145 145 146 151 152 153 154 163 163 164 167 168 168 173 176 176 179 181 181 182 182 183 184 185 186 186 187 188 188 194 196 197 198 201 202 203
x
10
Contents
Presenting Analyses to Policymakers . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Presenting Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Descriptive Statistics in Microsoft Word Tables . . . . 10.3 Choropleth Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 Graphs of Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 Marginal Effects (with Continuous Variables) and Graphs . . 10.5.1 Marginal Effects (Elasticities) and Graphs . . . . . . . . . . 10.6 Marginal Effects and Word Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7 Marginal Effects (with Categorical Variables) and Graphs . . 10.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
207 207 208 208 212 214 216 221 225 229 231 232 233 240
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
About the Author
Marvin Titus His research focuses on the economics and finance of higher education and quantitative methods. While he has explored how institutional and state finance influences student retention and graduation, Dr. Titus’ most recent work is centered on examining the determinants of institutional cost and productivity efficiency. He investigates how state higher education finance policies influence degree production. Through the use of a variety of econometric techniques, Dr. Titus is also exploring how state business cycles influence volatility in state funding of higher education. Named a TIAA Institute Fellow in 2018, Dr. Titus has published in top-tier research journals, including the Journal of Higher Education, Research in Higher Education, and Review of Higher Education. He is an associate editor of Higher Education: Handbook of Theory and Research and has served on the editorial board of Research in Higher Education, Review of Higher Education, and the Journal of Education Finance. Dr. Titus also serves on several technical review panels for national surveys produced by the National Center for Education Statistics. To conduct his research utilizing national and customized state- and institution-level datasets, Dr. Titus uses several statistical software packages such as Stata, Limdep, and HLM. He earned a BA in economics and history from York College of the City University of New York, MA in economics from the University of Wisconsin-Milwaukee, and a PhD in higher education policy, planning, and administration from the University of Maryland.
xi
Chapter 1
Introduction
Keywords Introduction · Chapters
Why write a book about using quantitative techniques in higher education policy analysis? There are several reasons why I decided to write this book. First, the idea for this book evolved out of a graduate-level course that I have been teaching over the past few years at the University of Maryland. In that course, I instruct students on how to conduct state-level higher education policy research that addresses such questions as how college enrollment rates across states are influenced by the economic and political context of state higher education policy or how college completion rates across states are affected by state governance and the regulation of higher education. Based on their interests, students are instructed on how to design and manage panel datasets. Students are introduced to, discuss, and may draw from such data sources. In the course, students are encouraged to think deeply about higher research policy questions within the context of the concerns of policymakers and the broader public. This prompted me to think about how quantitative techniques in higher education policy research should be rigorous, relevant, accessible to policymakers as well as the public in general, but also forwardlooking (hence, Chap. 9). Second, the idea for this book emerged out of a realization that higher education policy research involves the use of many different quantitative techniques. These techniques mostly include descriptive statistics, ordinary least squares (OLS) regression, panel data analysis (e.g., fixed-effects and random-effects regression), and most recently differences-in-differences. Comprehensive discussions of some of these techniques have appeared in separate © Springer Nature Switzerland AG 2021 M. Titus, Higher Education Policy Analysis Using Quantitative Techniques, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-60831-6_1
1
2
1
Introduction
volumes of Higher Education: Handbook of Theory and Research. What is missing in the higher education literature on quantitative methods is a comprehensive discussion and demonstration of some of the techniques that have been recently developed in other disciplines and fields. Discussions are also needed with respect to why higher education policy analysts and researchers should begin to think about what would warrant the need to apply the most appropriate technique to address policy questions. Additionally, as data are made more available over a longer period of time, some of these techniques become appropriate for use in higher education policy research. Hence, there is a need to have a comprehensive discussion about and demonstrate the commonly used and recently developed policy research techniques. A book provides a good venue for those discussions and demonstrations. Third, there is a need to discuss and demonstrate how we should present the results of higher education policy research to policymakers and the public in general. Much of what is done in higher education policy research remains in academic journals and technical reports. Many of those articles and reports present the results of regression models in ways that are not easily digestible for some policy analysts, policymakers, and other lay people. Additionally, there have been claims that a “disconnect” exists between higher education policy researchers and policymakers (Birnbaum 2000). These claims, however, are not unique to the field of higher education. In general, higher policy analysts should be able to conduct research and seek information from that research that is useful. In other words, there should be no divide. Based on a study conducted several years ago, there tends to be divergence between the expectations of individuals working as higher education policy analysts and the training of graduate students by faculty with respect to quantitative skills (Arellano and Martinez 2009). This study alluded to the divergence being the result of how quantitative techniques are taught to master’s and doctoral students in higher education and public policy programs. It is claimed that, on one hand, master’s students are “undereducated” with respect to quantitative skills that are required to conduct policy analysis. On the other hand, the study claims that doctoral students are “overeducated” with respect to quantitative skills that are needed by policy analysts. The most likely “truth” is that both higher education graduate students and policy analysts should be provided with a comprehensive reference that can provide a unified approach to understanding the use of quantitative techniques. Additionally, this reference should provide guidance to higher education graduate students and policy analysts with respect to the presentation of the results of quantitative policy research to a lay audience. In the form of a book, this reference introduces and demonstrates to both audiences the use of quantitative techniques and the presentation of results from those techniques to policymakers and the general public. This book, it is hoped, will help bridge the gaps between graduate students, practicing policy analysts, and policymakers.
1
Introduction
3
This book will also touch on the subject of policy research questions. Higher education policy analysis is not only about asking the right questions, it’s also about using the appropriate quantitative techniques to answer those questions. While acknowledging and touching on the former, this book focuses on the latter. Some books on the higher education policy analysis show how to frame a research agenda (e.g., Hillman et al. 2015). A plethora of literature in a variety of journals addresses a wide range of higher education policy areas such as state funding, tuition, student financial aid, governance, accountability, and college completion. A smaller body of literature introduces higher education researchers to the use of specific quantitative techniques. As pointed out above, whole chapters in Higher Education: Handbook of Theory and Research have been devoted to a particular quantitative research method in higher education. However, to date, there is no comprehensive reference text that provides guidance to higher education policy analysts, researchers, and students with respect to the research design that may be necessary to answer important questions using quantitative techniques. A research design would include asking the “right” questions, identifying existing data sources or creating a customized dataset, and using the appropriate statistical techniques. This book goes beyond providing guidance to higher education policy analysts with respect to research design. On the front end, it also covers the identification of data sources, management and exploration of data. On the back end, the book introduces advanced quantitative techniques but also demonstrates how to present research results to higher education policymakers and other lay people. Consequently, the book is organized in the following fashion. Chapter 2 discusses the questions that higher education policy analysts and researchers, who use quantitative methods should ask, and may not be able to answer. These questions may involve the use of a variety of data and statistical techniques. Chapter 3 introduces the reader to various secondary data sources that can be used to answer policy or research questions or build custom datasets. This chapter will provide an overview of easily accessible data for higher education policy analysis across countries, U.S. states, institutions, and students. Most of these data are publicly available but others are restricted and require a license. In this book, only data from publicly available sources are accessed and used in examples. Many higher education analysts and researchers have used data from these publicly available sources to examine various policyrelated topics. It should be noted that this chapter does not provide an exhaustive list of higher education data sources. Chapter 4 shows how to create analytic datasets, organize, and manage datasets that can be used to answer specific higher education policy questions. By way of step-by-step instructions on how to build a custom dataset, this chapter shows how to import data into Stata datasets for analysis. Using examples, the organization and management of customized datasets are also
4
1
Introduction
demonstrated. This chapter discusses and demonstrates the use of Excel as well as Stata to create, organize, and manage datasets. Chapter 5 discusses the importance of getting to “know thy data” even before doing any kind of data analysis. Because many higher education policy analysts and researchers import data from other sources, it is important to “clean” and “prep” such data before use. Utilizing examples, this chapter demonstrates how to address the nuances of imported data (e.g., missing data, string variables) before they are analyzed. Chapter 6 demonstrates the use of various descriptive statistical methods and graphs that can be used to provide basic descriptive information to higher education policymakers and lay people. Building on the previous chapter, this chapter shows how exploratory data analysis (EDA) techniques can be used to present descriptive statistics presented in such a way that enables policymakers and others to better understand the nature of the data that are used to inform higher education policymakers. In many ways, EDA is the most important part of higher education policy analysis. It precedes and determines the extent to which intermediate or advanced level analyses are needed or required. If intermediate or advanced level analyses are needed, EDA also provides guidance with respect to the specific quantitative techniques that should be employed. Chapter 7 shows how intermediate level statistical techniques can be used to answer higher education policy-oriented questions. In this chapter, the use of statistical techniques that include ordinary least squares (OLS), fixed-effects, and random-effects regression models are introduced to address the “what” questions with respect to the relationship of policy variables to outcomes of interest to policymakers. This chapter also introduces the use of regression-based models that can be modified to infer causation and address the “what effect” questions with respect to the adoption or changes in specific policies on policy outcomes. More specifically, Chap. 7 introduces differencesin-differences regression. Chapter 8 introduces advanced statistical techniques to address violations of the assumptions of OLS regression. This chapter covers time series analysis and autocorrelation, including autoregressive–moving-average (ARMA and ARMAX) regression models, which are not but should be used in higher education policy analysis. Chapter 8 also introduces advanced statistical techniques that address cross-sectional dependence. Chapter 9 introduces additional advanced statistical techniques that could be used to address higher education policy questions. In this chapter, it is demonstrated how these advanced techniques can take into account the complex nature of data that are increasingly becoming available to policy analysts. This is particularly the case with respect to cross-sectional dependence inherent in geographic-oriented units of analysis such as higher education institutions and jurisdictions such as states. So, a good part of Chap. 9 addresses how to deal with cross-sectional dependence in panel data by using recently developed advanced statistical techniques. In this sense, Chap. 9 is more forward-looking with respect to the “state-of-art” of
1
Introduction
5
quantitative techniques in higher education policy analysis and evaluation. Given the development of longer time series and larger panel datasets, the chapter lays out a set of methodological tools that policy analysts and researchers should use now and in even more so in the future. Chapter 10, the final chapter, demonstrates how to present the results of policy research to policymakers and other lay people. This chapter demonstrates how the results of descriptive statistics can be presented in Word files and thematic maps. In Chap. 10, it is also shown how the most relevant results from intermediate and advanced statistical techniques can be presented in simple graphs. These graphs make the results of sometimes complex analyses available to policymakers and the general public in “pictures” rather than numbers and technical jargon. Beginning in Chap. 4 and continuing throughout the remainder of the book, Stata code and output are provided to demonstrate how we can conduct the analyses being discussed. Rather than relying on Stata’s menus, I use Stata’s code in an interactive mode. This will enable readers to copy, paste, and modify in text or ado files for future use in their own work. Beginning in Chap. 4, an appendix is provided with the Stata code that was used in the respective chapter. This book does not comprehensively cover all quantitative techniques that have or could be used in higher education policy analysis and research. I do not discuss event history analysis (EHA), which has mainly been employed to explain when a state higher education policy is adopted. Others (e.g., DesJardins 2003; Lacy 2015) have provided comprehensive descriptions and demonstrations of the use of EHA in higher education. With the exception of difference-in-differences (DiD) regression, quantitative techniques that infer correlation rather than causation are not covered in this book. More specifically, I do not cover instrumental variable (IV) regression, synthetic control methods (SCM), and regression discontinuity (RD). Bielby et al. (2013) provide a good discussion and demonstration of IV regression, while McCall and Bielby (2012) present a comprehensive exposition of how RD can be used in higher education policy research. Because it has only recently been introduced in the higher education policy literature, I will not discuss the use of SCM to evaluate higher education policy. For those who are interested in how SCM has been applied to examining policy outcomes in higher education, I refer them to the work of Jaquette and associates (e.g., Jaquette et al. 2018; Jaquette and Curs 2015). While I introduce and demonstrate the use of difference-in-differences (DiD) regression, Furquim et al. (2020) provide a more comprehensive discussion of how to apply that technique when conducting higher education policy evaluation. I do not cover spatial analysis and regression, quantitative techniques that are emerging in and increasingly being applied to higher education policy research. Several higher education scholars have begun to discuss (e.g., Rios-Aguilar and Titus 2018) and apply (Fowles and Tandberg 2017) the application of spatial techniques to higher education policy analysis and
6
1
Introduction
evaluation. Like quantitative techniques that infer causality, spatial analysis in higher education policy evaluation is a topic for another book. Beginning with Chap. 4, I provide an appendix with the Stata commands and syntax that were used to demonstrate procedures and in examples throughout the chapter. The syntax is meant to provide a template for modification specific to the reader’s data and statistical techniques rather than a guide to programming in Stata. This approach to the use of Stata is consistent with that of Acock (2018), who encourages readers to use help commandname in an interactive mode to get more information about specific Stata commands and routines. Because of my research interests, most of the examples in this book involve the use of higher education finance-oriented policy data. But the statistical methods and techniques presented in this book can be applied to other quantitative data used in other areas as well.
References Acock, A. C. (2018). A Gentle Introduction to Stata (6th ed.). A Stata Press Publication, StataCorp LLC. Arellano, E. C., & Martinez, M. C. (2009). Does Educational Preparation Match Professional Practice: The Case of Higher Education Policy Analysts. Innovative Higher Education, 34 (2), 105–116. https://doi.org/10.1007/s10755-009-9097-0 Bielby, R. M., House, E., Flaster, A., & DesJardins, S. L. (2013). Instrumental variables: Conceptual issues and an application considering high school course taking. In Higher education: Handbook of theory and research (pp. 263–321). Springer. Birnbaum, R. (2000). Policy Scholars Are from Venus; Policy Makers Are from Mars. The Review of Higher Education, 23 (2), 119–132. https://doi.org/10.1353/rhe.2000.0002 DesJardins, S. L. (2003). Event history methods: Conceptual issues and an application to student departure from college. In J. C. Smart (Ed.), Higher Education: Handbook of Theory and Research (Vol. 18, pp. 421–471). Springer. Fowles, J. T., & Tandberg, D. A. (2017). State Higher Education Spending: A Spatial Econometric Perspective. American Behavioral Scientist, 61 (14), 1773–1798. https:// doi.org/10.1177/0002764217744835 Furquim, F., Corral, D., & Hillman, N. (2020). A Primer for Interpreting and Designing Difference-in-Differences Studies in Higher Education Research. In L. W. Perna (Ed.), Higher Education: Handbook of Theory and Research: Volume 35 (pp. 667–723). Springer International Publishing. https://doi.org/10.1007/978-3-030-31365-4_5 Hillman, N. W., Tandberg, D. A., & Sponsler, B. A. (2015). Public Policy and Higher Education: Strategies for Framing a Research Agenda. ASHE Higher Education Report, 41 (2), 1–98. Jaquette, O., & Curs, B. R. (2015). Creating the Out-of-State University: Do Public Universities Increase Nonresident Freshman Enrollment in Response to Declining State Appropriations? Research in Higher Education, 56 (6), 535–565. https://doi.org/ 10.1007/s11162-015-9362-2 Jaquette, O., Kramer, D. A., & Curs, B. R. (2018). Growing the Pie? The Effect of Responsibility Center Management on Tuition Revenue. The Journal of Higher Education, 89 (5), 637–676.
References
7
Lacy, T. A. (2015). Event history analysis: A primer for higher education researchers. In M. Tight & J. Huisman (Eds.), Theory and Method in Higher Education Research (Vol. 1, pp. 71–91). Emerald Publishing Group. McCall, B. P., & Bielby, R. M. (2012). Regression discontinuity design: Recent developments and a guide to practice for researchers in higher education. In Higher education: Handbook of theory and research (pp. 249–290). Springer. Rios-Aguilar, C., & Titus, M. A. (Eds.). (2018). Spatial Thinking and Analysis in Higher Education Research: New Directions for Institutional Research: Vol 2018, No 180 (Vol. 2018). Wiley Press. https://onlinelibrary.wiley.com/toc/1536075x/2018/2018/180
Chapter 2
Asking the Right Policy Questions
Abstract This chapter discusses asking the right policy questions. It points out how the nature of those questions and answers are shaped by the policy context. With the most appropriate methodological tools, policy analysts should be prepared to address follow-up questions. These include “what” and “how” questions. The chapter discusses how academic researchers have to simultaneously use rigorous methods and provide results of their research that is of use to policymakers and the general public. Keywords Policy questions · The why questions · The how questions
2.1
Introduction
This chapter discusses higher education policy analysis and evaluation with respect to the nature policy questions. The first part of chapter discusses the policy context within which the right policy question is addressed by policy analysts. The next section provides a perspective on the “what” questions that policymakers ask policy analysts to address. The chapter then discusses the “how” question, followed by the next section that explains how academic researchers may also provide answers in search of questions. The chapter ends with some concluding remarks in the summary section.
© Springer Nature Switzerland AG 2021 M. Titus, Higher Education Policy Analysis Using Quantitative Techniques, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-60831-6_2
9
10
2.2
2
Asking the Right Policy Questions
Asking the Right Policy Questions
Policy analysis involves asking the right questions and providing those answers. But how does one determine what constitutes the right questions? It is necessary to clearly identify the policy issue at hand, who are concerned about the issue, how to frame questions about the issue, and the possibility of providing the relevant answers. Identification of a policy issue in higher education is not as straightforward as one may think. Take for example the issue of college affordability. The context and focus of that same policy issue differs by who is discussing it. In the popular press, college affordability may be presented in terms of the increase in the price of college (i.e., tuition and fees). Among higher education advocacy groups such as the Institute for Higher Education Policy, college affordability may be discussed within the context of the extent to which students from low-income families are being priced out of the higher education market. Therefore, with respect to identifying policy issues, the audience also matters. Even if the issue and audience have been identified, policy research and the policy issue have to be bridged (Ness 2010). With regard to an identified policy issue, the question that policy researchers and policymakers are asking may not be one in the same. Moreover, the decisions of policymakers may not be linked to answers to questions addressed by policy researchers. According to Ness (2010), a direct application of policy research to policymaking process is more closely connected to the rational choice model. But a more realistic policy making process is the “multiple streams” model (Kingdon 2011). Policy analysts who operate under the assumptions of the “multiple streams” model of the policy process produce research for multiple audiences such as academics, advocacy groups, policymakers, the media, as well as the general public. Consequently, research findings have to be clearly articulated or written for a wide audience of users who may or may not influence the policy process or policymakers. Given the variety of groups, a variety of questions and answers may have to be posed and addressed. This is rather challenging for the policy analyst who must be cognizant of her or his audience, the policy process, as well as a variety of analytical techniques, modes of communicating the results, and the possible implications for policy. Different questions will require different methods and analytical techniques. In general, the “why” questions usually require a qualitative research design. The “what” and “how” questions generally necessitate a quantitative research design, which includes continuous and categorical data, measures or variables, and statistical techniques. But to answer the questions, the policy analyst or researcher must choose the appropriate data and statistical techniques, which depend on several factors.
2.2 Asking the Right Policy Questions
2.2.1
11
The What Questions
In some cases, policymakers may want to know how an outcome or phenomenon was related to a set of policy-oriented variables. For example, a state higher education policymaker may inquire about the following: 1. When changes in resident undergraduate tuition occurred at public 4-year state colleges and universities, what was the outcome with respect changes in resident undergraduate college enrollment at those institutions in the state? A similar question about specific groups could also be posed: 2. When changes in resident undergraduate tuition occurred at public 4-year state colleges and universities, what was the outcome with respect changes in resident undergraduate college enrollment among low-income students at those institutions in the state? Some policymakers may even go further and expand on the question above and ask the following: 3. When changes in resident undergraduate tuition at public 4-year state colleges and universities and changes in state need-based aid occurred, what was the outcome with respect to changes in resident undergraduate college enrollment among low-income students at those institutions in the state? It may be prudent for policy analysts to anticipate the second and third “what” questions. In some cases, a cascade of questions may follow an initial “what” question. Consequently, the policy analyst must be prepared to answer the “what” questions that have not been posed but may be coming. The astute reader may have already noticed that the “what” questions above are retrospective and relational in nature. But in many instances, policymakers may want to have answers to “what if” questions. For example, policymakers may want to know the following: 4. If resident undergraduate tuition increased (decreased) at public 4-year state colleges and universities, then what would be the outcome with respect changes in resident undergraduate college enrollment at those institutions in the state? At first glance, this may appear to be a rather challenging question to answer. But a skilled policy analyst could approach this question in several ways. First, the question could be approached by observing the history (i.e., time series) of resident undergraduate tuition at state colleges and universities and resident undergraduate college enrollment at those institutions in the state. Based on historical trends, the analyst could then project the changes going forward. A second approach could be to
12
2
Asking the Right Policy Questions
examine the relationship between resident undergraduate tuition at state colleges/universities and resident undergraduate college enrollment at those institutions across institutions or states in a particular year. Based on a snapshot (i.e., cross-section) in time, the analyst would be able to determine if a relationship exists and then make an assumption about the particular state of interest. While the first approach involves an extrapolation over time, the second approach involves an extrapolation across units of analysis or groups (institutions or states). A third approach could include the use of data across time and units of analysis. All three approaches involve a set of implicit assumptions regarding the relationship among variables. Those assumptions are originally embedded in the “what” questions. In our example above, the policymaker has to implicitly assume or hypothesize there is a relationship between enrollment in college and tuition price that should be tested. This assumption or hypothesis is based on an underlying theory with regard to the relationship between enrollment in college and tuition price. In an effort to address the policy questions above, the policy analyst must make an effort to test the implicit hypothesis regarding the relationship between college enrollment and tuition price or more generally, the market demand for higher education within a state or across states. This may seem like a straight forward task. It, however, is quite complex and involves a set of underlying questions. If the policy analyst chooses to answer the initial question by looking at how resident undergraduate college enrollment has changed over time, how does she or he know whether the trend reflects changes in the demand for college, changes in the supply of college (e.g., physical capacity, admissions, etc.), or both?1 If the analyst ignores possible changes in supply and focuses on changes in demand with respect to tuition price, then she or he is implicitly assuming that “all else is held constant”. But what if median family income or the traditional college-age population (18- to 24-year-old) or the college wage premium (the difference between the wages earned by college graduates and high school graduates) or the “tastes” (based on expected social norms) for attending college changed over time? Obviously, the analyst cannot possibly attribute a change in resident undergraduate college enrollment solely due to a change in tuition price if all of these other things are assumed to change as well. Therefore, she or he must simplify reality by assuming the other variables did not change or propose an alternative set of policy questions. Perhaps that alternative set policy questions could be centered on changes in college enrollment and changes in affordability rather than tuition price.2 This set of questions would 1 For more discussion on this, see Toutkoushian, R. K., & Paulsen, M. B. (2016). Economics of Higher Education: Background, Concepts, and Applications (first ed. 2016 edition). Springer. 2 The issue of college affordability has increasingly received attention at the state and national level. For example, see Miller, G., Alexander, F. K., Carruthers, G., Cooper, M. A., Douglas, J. H., Fitzgerald, B. K., Gregoire, C., & McKeon, H. P. “Buck.” (2020). A
2.2 Asking the Right Policy Questions
13
require data on affordability, which could be measured as a ratio of average tuition price to income. But this would require some agreement with respect to the “right” measure of tuition price. Should the policy include the sticker (before financial aid) tuition or net (after financial aid) tuition price? It may also require agreement with regard to the “right” measure of income. Should the analyst use average family income or median family income? Even if there is agreement with regard to the use of college affordability, the other variables mentioned above will still have to be ignored or held constant. Ignoring the other variables overly simplifies reality, while holding constant the other variables has implications for what statistical techniques the analyst will use to answer the policy questions. Additionally, while it may be the most useful information to the general public, college affordability may not always be what higher education policymakers can directly change. Why? College affordability is composed of both tuition price (and possibly other prices related to attending college, such as housing, meals, books, etc.) and family income. State-level higher education policymakers may have varying control of tuition prices. For example, in 38 states, tuition price setting was controlled by multicampus or single campus boards during 2012 (Zinth and Smith 2012). Clearly, state higher education policymakers do not have control over changes (at least not in the short run) in family income. For the same reasons mentioned above, the third policy question may also have to be re-stated in terms of affordability, but with a slight modification. For example, one could ask: 5. As changes in state need-based financial aid and resident tuition for undergraduates at public 4-year colleges, what changes in resident undergraduate enrollment occurred at those same institutions? If the policymaker is interested in changes in both tuition and state needbased aid as well as their implicit influence on changes in enrollment, then the answer to question 5 becomes a bit more nuanced. Question 5 generates three analytical questions: 5(a) What is the relationship between changes in resident undergraduate enrollment and changes in resident undergraduate tuition at public 4-year colleges and universities in the state? 5(b) What is the relationship between changes in resident undergraduate enrollment and changes in state need-based financial for undergraduate students at public 4-year colleges and universities in the state? 5(c) What changes in state need-based financial for undergraduate students at public 4-year colleges and universities in the state condition (influence) changes in the relationship between changes in resident undergraduate
New Course for Higher Education. Bipartisan Policy Center. https://bipartisanpolicy.org/ wp-content/uploads/2020/01/WEB_BPC_Higher_Education_Report_RV8.pdf.
14
2
Asking the Right Policy Questions
enrollment and changes in tuition at public 4-year colleges and universities in the state? In this example, it is not clear if question 5(a) and 5(b) can be addressed without addressing question 5(c). It is quite possible that the relationship between changes in resident undergraduate enrollment and changes in tuition at public 4-year colleges and universities in the state can only be discerned by observing changes in state need-based financial aid for undergraduate students at public 4-year colleges and universities in the state. Therefore, a “how” question may actually be embedded within a “what” question. It is also possible that a particular type of “what” policy question also requires qualitative data and techniques or a mixed methods approach to addressing the question (Creswell and Creswell 2018). For example, if a policymaker asks if there are differences in interpretation of articulation policy between high school administrators and college administrators and if they exist, “what” do those differences mean for students in terms of their enrollment in college courses? To address the first part of this question, the analyst will have to interview high school and college administrators. To answer the second part of the question, the analyst will have to examine student enrollment in college courses.
2.2.2
The How Questions
Many higher education policy inquiries are “how” questions. A state policymaker may inquire how a particular policy may have affected a particular outcome or output. For example, policymakers in Maryland may want to know how the adoption of a state-wide policy on articulation has affected transfer rates from community colleges to 4-year institutions in Maryland. Using quantitative techniques, the policy analyst can approach this question in several different ways. First, he or she may want to answer this question from the perspective of Maryland’s transfer rates before and after the adoption of a state-wide policy on articulation, without comparison to other states that have articulation policies. This is probably the easiest but not necessarily the best way to answer this question. The second way to answer this question is to compare Maryland’s transfer rates before and after the adoption of a state-wide policy on articulation, with comparison to comparable states that have no articulation policy. This approach to answering the question involves collecting data on comparable states. But this prompts the analyst to ask the following set of questions: What states are considered to be comparable to Maryland? Are only border states comparable to Maryland? Are states in the same regional compact, the Southern Regional Education Board (SREB) comparable to Maryland?
2.2 Asking the Right Policy Questions
15
Are states with characteristics similar to Maryland comparable to Maryland? The answers to these more analytical questions follow the original policyoriented question and have implications for the data used and quantitative techniques employed. But quantitative techniques may not be appropriate for some “how” questions that policymakers may ask. For example, a policymaker may ask how is a particular state-wide higher education policy (e.g., dual enrollment) being implemented across a state. The answer to this inquiry would require interviews with different stakeholders (e.g, high school and college administrators) across the state. Therefore, to address that particular type of “how” question, it is necessary for an analyst, who does not have qualitative or interviewing skills, to consult with her or his colleagues who do possess those skills.
2.2.3
The How Questions and Quantitative Techniques
With respect to “how” questions, descriptive statistics of data on current patterns or past trends of policy indicators or variables, may be necessary but is certainly not sufficient to show relationships or test hypotheses. On the other hand, most regression models used in higher education policy research are correlational and used to examine the relationships among variables. The relationships may be between variables that policymakers can influence (e.g., undergraduate resident tuition at public 4-year colleges and universities) and variables (e.g., the enrollment of undergraduate resident students in public 4-year colleges and universities) that are implicitly or explicitly theorized to be influenced by the actions of policy. The regression models are used to take into account other observed variables (e.g., state median family income, traditional college-age population, etc.) or unobserved variables (state culture or habitus with regard to college enrollment) that may be related to the outcome of policy action. Policymakers, however, may not be able to “control” those other related variables. Consequently, policy analysts should include “control” variables and take into account “unobservable” factors. Most regression models are used in higher education policy research to examine relationships among variables. They involve asking the “what” question. More specifically, the questions being asked are: what policy-oriented variables (controlling for other variables) are important with respect to the policy outcome? When using most quantitative techniques, the answers do not prove cause. Therefore, it is very important to use the appropriate language when presenting the results. The results from regression models, such as instrumental variable (IV), difference-in-differences (DiD), and discontinuity
16
2
Asking the Right Policy Questions
regression (RD) have more causal inferences than from ordinary least squares (OLS), fixed-effects (FE), and random-effects (RE) regression models. Other quantitative techniques that suggest a causal inference include synthetic control methods (SCM), a recently developed technique. An experimental research (ER) design or “scientific” method is utilized to establish the causeeffect relationship among a group of variables, with a random assignment to treatment and control groups. While it is considered the “gold standard” of research design where the researcher can manipulate the policy intervention or “treatment”, in most instances, ER cannot be used to conduct policy analysis or evaluation, due to legal, ethical, or practical reasons. Therefore, the vast majority of analyses of higher education policy is conducted using either descriptive statistics or correlational methods such as OLS, FE, and RE or quasi-experimental methods such as IV, DiD, RD regression, or SCM. The nature of the policy research question and data should determine the most appropriate method to be utilized by the analyst. For example, if the question is referring to the incidence of the adoption of a state policy (e.g., free tuition for community college students) across the United States by year, the use of descriptive statistics or exploratory data analysis (EDA) may be all that is needed. If the question is about the relationship between an outcome (e.g., the enrollment of full-time students in community colleges within a state) and a state higher education policy (e.g., free tuition at community colleges), an ordinary least squares (OLS) regression model may be more appropriate.3 If few states (e.g., 20) have implemented similar free tuition policies among all 50 states and across many (e.g.,10) years, then a fixedeffects regression model may be the most appropriate technique to address the question in terms of the “average” influence of such policies.4 If complete data are available in only a subset of states, then a random-effects regression model probably should be employed.5 Finally, if the question is referring to how the adoption of a particular policy in a particular state affected an outcome in that state (compared to similar states without no such policy), then a difference-in-difference (DiD) regression may be the most appropriate method.6 If one chooses to address the question with respect to the effect of the policy in a specific state (e.g., Tennessee) or group of states (e.g., Tennessee and Maryland) compared to states that did not adopt the policy, has access to data for only a few comparable states (e.g., members of the Southern Regional Education Board—SREB) and a few years, and wants to address the question in terms of the effect of the policy compared to states that did not adopt the policy, DiD regression or SCM may be the method of choice.
3 OLS
regression is discussed in Chap. 7.
4 Fixed-effects
regression is presented in Chap. 7. regression is explained in Chap. 7. 6 DiD regression is discussed in Chap. 7. 5 Random-effects
2.2 Asking the Right Policy Questions
17
If the analyst is aware of the assumptions of DiD regression and chooses to relax them, the SCM may be the preferred method. In addition to the nature of the question and available data, the skill level of the analyst may also determine the method that is actually used to address the question. In many instances, the higher education analyst or researcher makes a judgment with regard to what method is used to answer a question, based on the set of tools that are in her or his “toolbox”. Therefore it is important that policy analysts and researchers have a full set of “tools” in her or his “toolbox” to answer different questions in different ways to different audiences. In many ways, the presentation of the analyses is one of the most important aspects of policy analysis and evaluation. Whether conducting EDA or more advanced statistical techniques like DiD regression or SCM, the use of tables and graphs should also be employed to clearly convey the results of the analyses. This is particularly pertinent when presenting the results of an analysis to policymakers, who may have a limited amount of time available to consume the information. Therefore, it is imperative that the results of analyses or evaluation be presented in such a way that clearly and succinctly highlights key points.
2.2.4
So Many Answers and Not Enough Time
When responding to questions from policymakers, policy analysts run the risk of providing “many answers” that distract from the main answer. As discussed above, there may be several reasons for providing answers. But policy analysts have to be careful not to lose the audience by providing too many answers to the analytical questions that may arise during the analysis or evaluation of policy. Many of the analytical questions posed by the analyst may be interesting from a methodological perspective but of less importance to the world of policymakers. In the interest of time (and space), it may be prudent for policy analysts to provide answers to only the main policyoriented questions. For many policymakers, time is of the essence with regard to being provided answers. It does not mean that secondary analytical-related answers should never be provided to policymakers and the general public. It may be possible to include those questions and answers in appendices or supplementary reports.
2.2.5
Answers in Search of Questions
Many policy analysts, particularly academic researchers, provide answers that are in search of questions. These answers may be very important to the analyst and possibly others, particularly academic researchers, with regard
18
2
Asking the Right Policy Questions
to philosophical or theoretical questions. As Birnbaum (2000) asserts, it takes time for academic researchers to develop theories. So, we cannot expect academic researchers who are interested in higher education policy and theory to simply abandon the latter in favor of the former. Birnbaum also claims that policymakers may eventually make use of policy research that later enters the policy world. Hence when academic researchers publish the results of their policy analyses or evaluations, they should do so with a mixed audience in mind. This requires providing rigorous research for the academic community and language, free of technical jargon, which informs policy discussions. In many ways, this is more challenging to achieve than presenting the results of a policy analysis or evaluation to policymakers or the general public. Those who are able to strike this balance are the most successful in influencing the world of policymakers over the long run.
2.3
Summary
This chapter discussed the asking and answering higher education policy questions. It was pointed out how the nature of those questions and answers are shaped by their context. These questions are not always straightforward and may lead to additional questions by policymakers. Policy analysts should be prepared to address follow-up questions. This chapter also discussed the nature of policy inquiries, which may include “what” questions or “how” questions or both. Policy analysts have to choose the appropriate methods to address these questions. The chapter ended with a discussion of how academic researchers may have to simultaneously use rigorous methods and provide results of their research that is of use to policy and the general public.
References Birnbaum, R. (2000). Policy Scholars Are from Venus; Policy Makers Are from Mars. The Review of Higher Education, 23 (2), 119–132. Creswell, J. W., & Creswell, J. D. (2018). Research design: Qualitative, quantitative, and mixed methods approaches (5th ed.). Sage Publications. Kingdon, J. W. (2011). Agendas, Alternatives, and Public Policies. Netherlands: Longman. Ness, E. C. (2010). The role of information in the policy process: Implications for the Examination of Research Utilization in Higher Education Policy. In J. C. Smart (Ed.), Higher education: Handbook of theory and research (Vol. 25, pp. 1–49). Springer. Zinth, K., & Smith, M. (2012). Tuition-Setting Authority for Public Colleges and Universities (p. 10). Education Commission of the States.
Chapter 3
Identifying Data Sources
Abstract There are varied sources of data available to higher education analysts and researchers at the international, national, state, and institutional levels. These data are provided by international organizations, the federal government, regional compacts, and independent organizations. Most of these data are available to the public without restrictions. Many higher education analysts and researchers have used these data to examine numerous topics. Keywords Data sources · International · National · State-level · Institutional
3.1
Introduction
This chapter identifies and discusses some of the major data sources that are available to conduct higher education policy research. The first part of the chapter introduces sources of data that include international organizations. The next section discusses the U.S. national-level data from the U.S. Department of Education and other sources. Higher education institutionallevel data are introduced and discussed in the following section of the chapter. The last section of the chapter provides concluding statements on data sources.
© Springer Nature Switzerland AG 2021 M. Titus, Higher Education Policy Analysis Using Quantitative Techniques, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-60831-6_3
19
20
3.2
3
Identifying Data Sources
International Data
Many international organizations provide data on a variety of topics, including higher education (sometimes referred to as tertiary education). One of those premier organizations is the World Bank, which collects and makes comprehensive higher education data available to policy analysts and researchers. World Bank (WB) education data are compiled by the United Nations Educational, Scientific, and Cultural Organization (UNESCO) Institute for Statistics from the surveys and reports provided by education officials in each country. These data are accessible via its website at: https:// data.worldbank.org/topic/education. A Stata add-on module, wbopendata (Azevedo 2020), can be used to access WB data. The user can access a menu of specific data across countries or a set of data from a specific country. The data can be downloaded directly into Stata or Excel files. Using wbopendata, WB education data can be joined or merged with other data on other WB “topics”, such as Agriculture and Rural Development, Aid Effectiveness, Economy and Growth, Environment, Health, and Social Development. World Bank data can also be accessed with other Stata user-written programs getdata (Gonçalves 2016) and sdmxuse (Fontenay 2018). Many of the higher education-oriented studies that used World Bank data have focused on the relationship between economic growth and educational attainment (e.g., Chatterji 1998; Holmes 2013; e.g., Knowles 1997). World Bank data enable policy analysts and researchers to examine the relationship between higher education and other topical areas across countries. The Organisation for Economic Co-operation and Development (OECD) also provides international education data. The OECD covers 37 countries that span Europe, Asia, and North and South America. Like the World Bank, OECD organizes its data by topic. The OECD topic of education covers higher (tertiary) education data, which can be accessed in several ways. In addition to going to the OECD webpage on education (https://data. oecd.org/education.htm), analysts can utilize Stata programs (i.e., getdata and sdmxuse) to extract data. Education policy analysts, researchers, advocacy organizations, and government agencies have used OECD to examine educational attainment rates and spending across countries.
3.3
National Data
Countries collect and provide varying amounts of data on higher education. This section will focus on the United States. The U.S. government collects a tremendous amount of data from a variety of sources. Much of the education data are collected by the U.S. Department of Education (DOE). The U.S. DOE provides data from several nationally representative surveys that can
3.3 National Data
21
be used to examine higher education policy at many different levels. The U.S. DOE’s National Center for Education Statistics (NCES) is primarily responsible for collecting information on education. NCES collects, via surveys, information on primary (elementary), secondary, and postsecondary students. It also collects data on primary (elementary) and secondary schools and postsecondary education institutions. Brief descriptions of NCES surveys that focus on aspects of postsecondary education are provided below. National Education Longitudinal Study of 1988 (NELS:88). NELS:88 is a nationally representative longitudinal survey of eighth graders in 1988 that included follow-up questionnaires through their secondary education, postsecondary education, and/or labor market years. NELS:88 is based on a multistage sampling frame in which middle schools were first selected then followed by random sampling of students within each school. In addition to providing information on students, NELS:88 also provides information on the schools they attended. The NELS:88 follow-up in 2000 also includes postsecondary education transcripts (PETS). The NELS:88 base year and follow-up microdata are accessible as a public use file (PUF) and restricted use file (RUF). The NELS:88 PUF is a subset of the NELS:88 and does not include all the student variables that are available in the NELS:88 RUF. Users have to apply for a license from the Institute of Education Sciences (IES) and NCES to obtain the NELS:88 RUF. (The IES is the research and evaluation arm of the U.S. Department of Education.) Information on accessing the NELS:88 PUF and RUF can be found at: https://nces.ed.gov/ surveys/nels88/data_products.asp. Many higher education policy analysts and researchers have used NELS:88 data to examine determinants of college enrollment. The final NELS:88 follow-up, however, was completed in 2000. Consequently, the most recent NELS:88 data are at least 20 years old. Education Longitudinal Study of 2002 (ELS:2002). The ELS:2002 is a nationally representative longitudinal study of tenth graders in 2002 and 12th graders in 2004 who were followed throughout their secondary education, postsecondary education, and/or labor market years. Like NELS:88, ELS:2002 is based on a multistage sampling of schools and students. The ELS:2002 final follow-up was completed in 2012. ELS:2002 includes PETS information for 2013. Information on accessing ELS:2002 can be found at: https://nces.ed.gov/surveys/els2002/. With the exception of associated PETS information, ELS:2002 is available as a PUF. Several higher education policy analysts and researchers have used ELS:2002 to examine college enrollment (e.g., D. Kim and Nuñez 2013; Lee et al. 2013; Savas 2016; You and Nguyen 2012), college choice (e.g., Belasco and Trivette 2015; Hemelt and Marcotte 2016; Kim and Nuñez 2013; Lee et al. 2013), and college retention (e.g., Glennie et al. 2015; Morgan et al. 2015; Rowan-Kenyon et al. 2016; Schudde 2011, 2016). High School Longitudinal Study of 2009 (HSLS: 09). The HSLS:09 is a nationally representative longitudinal study that surveyed 23,000 plus
22
3
Identifying Data Sources
students in 2009 beginning in the ninth grade at 944 schools. The first followup of the HSLS:09 was in 2012. In 2013, there was an update to HSLS:09. A second follow-up, conducted in 2016, collected information on students in postsecondary education and/or the workforce. In 2017, HSLS:09 was supplemented with PETS. Information on accessing HSLS:09 can be found at: https://nces.ed.gov/surveys/hsls09/. Recently, a few higher education policy analysts and researchers have used HSLS:09 to examine college readiness (e.g., Alvarado and An 2015; George Mwangi et al. 2018; Kurban and Cabrera 2020; Pool and Vander Putten 2015) and college enrollment (e.g., Engberg and Gilbert 2014; Goodwin et al. 2016; Nienhusser and Oshio 2017; Schneider and Saw 2016). National Postsecondary Student Aid Study (NPSAS). NPSAS is a nationally representative cross-sectional survey, with a focus on financial aid, of students enrolled in postsecondary education institutions. Beginning in 1987, the NPSAS survey has been conducted almost every other year. A NPSAS survey is planned for 2020, which will include state-representative data for most states. In addition to student interviews, NPSAS includes data from institution records and government databases. Analysts and researchers can perform analysis on NPSAS data only through NCES, via its Datalab at: https://nces.ed.gov/surveys/npsas/. NPSAS microdata or restricted use file data are only available to analysts and researchers who have been granted a license from IES/NCES. The federal government, higher education advocacy groups, and researchers have used NPSAS data to produce reports to help inform policy on federal financial aid. Beginning Postsecondary Students Longitudinal Study (BPS). The BPS, a spin-off of the NPSAS, is a nationally representative survey, based on a multistage sample of postsecondary education institutions and first-time students. Drawing on cohorts from the NPSAS, the BPS surveys collect data on student demographic characteristics, PSE experiences, persistence, transfer, degree attainment, entry into the labor force and/or enrollment in graduate or professional school. The first BPS survey was first conducted 1990 (BPS:90/94) and followed a cohort of students through 1994. Since then, BPS surveys of students have been conducted at the end of their first, third and sixth year after entering a postsecondary education (PSE) institution. The BPS has been repeated every few years. Beginning with the BPS:04/09, PETS information is also provided. The most recent BPS (BPS:12/17) survey followed a cohort of 2011–2012 first-time beginning students, with a followup in 2017. The next BPS survey will collect information on students who began their postsecondary education in the academic year 2019–2020 and will follow that cohort in surveys to be conducted in 2020, 2022, and 2025. Users can access a limited amount of BPS data through the NCES Datalab. Information on the accessing data from the BPS can be obtained from NCES at: https://nces.ed.gov/surveys/bps/. The complete BPS with microdata are available to restricted use file license holders. Many higher education policy
3.3 National Data
23
analysts and researchers (too numerous to mention) have used the BPS to investigate college student persistence and completion. The Baccalaureate and Beyond Longitudinal Study (B&B) is a nationally representative survey, based on a sample of postsecondary education students and institutions, of college students’ education and labor force experiences after they complete a bachelor’s degree. Drawing from cohorts in the NPSAS, the B&B surveys also collect information on degree recipients’ earnings, debt repayment, as well as enrollment in and completion of graduate and professional school. Students in the B&B survey are followed up in their first, fourth, and tenth year after receiving their baccalaureate degree. The first B&B survey was conducted in 1993, with follow-ups in 1994, 1997, and 2003. The second B&B survey (B&B:2000/01) had only one follow-up, which was in 2001. The B&B:2008/12, which focuses on graduates from STEM education programs, was completed in 2008 and included follow-ups in 2009 and 2012. The B&B:2008/18 will include a follow-up in 2018. Using the NCES Datalab, analysts can perform limited analyses of B&B data. Microdata from the B&B surveys, which include PETS information, are only available to users who are given a license by IES/NCES to use restricted use files. Numerous analysts and researchers have used the B&B to examine such topics as: labor market experiences and workforce outcomes of college graduates (e.g., Bastin and Kalist 2013; Bellas 2001; Joy 2003; Strayhorn 2008; Titus 2007, 2010); graduate and professional school enrollment and completion (e.g., English and Umbach 2016; Millett 2003; Monaghan and Jang 2017; Perna 2004; Strayhorn et al. 2013; Titus 2010); student debt and repayment (e.g., Gervais and Ziebarth 2019; Millett 2003; Scott-Clayton and Li 2016; Velez et al. 2019; Zhang 2013); and family formation (e.g., Velez et al. 2019) and career choices (e.g., Xu 2013, 2017; Zhang 2013) of bachelor’s degree recipients. Digest of Education Statistics. In addition to providing microdata on institutions and students, the U.S. Department of Education (DOE) also produces statistics at an aggregated or macro level on postsecondary education. For example, IES/NCES publishes the Digest of Education Statistics, which provides national- and state-level statistics on various areas of education, including postsecondary education (PSE). For PSE, these areas include: institutions; expenditures; revenues; tuition and other student expenses; financial aid; staff; student enrollment; degrees completed; and security and crime. The statistics on PSE are mostly based on aggregated data from NCES surveys discussed above (e.g., IPEDS, NPSAS, BPS, B&B). The statistics, aggregated over time and in some cases across states, are provided in tables. The tables can be downloaded in an Excel format, which can be used to either produce reports or merge with data from other sources to conduct statistical analyses. Current Population Survey (CPS). The U.S. Census Bureau also provides national-level postsecondary education data to the public in the form of the CPS. U.S Census Bureau microdata sample data files are available to researchers who are given authorization to use specific datasets at one of the
24
3
Identifying Data Sources
secure Federal Statistical Research Data Centers. For example, the restricted use dataset of the CPS, School Enrollment Supplement provides microdata at the household level. The CPS School Enrollment Supplement has been used to examine demographic differences in postsecondary enrollment (e.g., Hudson et al. 2005; Jacobs and Stoner-Eby 1998; Kim 2012). Other Sources of National Data. The U.S. government collects and disseminates national higher education data that focuses on specific areas. The Office of Postsecondary Education of the U.S. Department of Education (DOE) provides data on campus safety and security. The College Scorecard, which is maintained by the U.S. DOE, produces a national database on student completion, debt and repayment, earnings, and other data. The College Board . There are other sources of aggregate national postsecondary education data, such as the College Board, that draw on nationally representative surveys and federal administrative information. The College Board data, however, are focused mainly on tuition price and college student financial aid across years and to limited extent across states. The data can be accessed at the College Board website (https://research.collegeboard.org/ trends/trends-higher-education) and can be downloaded in Excel format. Policy analysts use the College Board data to explain patterns and trends in average higher education tuition prices (e.g., Baum and Ma 2012; Heller 2001; Mitchell 2017; Mitchell et al. 2016) and college student financial aid (e.g., Baum and Payea 2012; Deming and Dynarski 2010).
3.4
State-Level Data
There are a variety of sources of state-level postsecondary education data. Based on NCES collection efforts (see the Integrated Postsecondary Education Data System below) and federal government administrative data, the Digest of Education Statistics is a source of much of the state-level postsecondary education data. These data are institutional-level postsecondary education data aggregated at the state level. Some of the state-level data are available across years. For example, the Digest of Education Statistics provides state-level postsecondary education statistics by year on enrollment, degrees, institutional revenues, and institutional expenditures. National Association of State Student Grant and Aid Programs. Other sources of state-level postsecondary education data include the National Association of State Student Grant and Aid Programs (NASSGAP). Data from the NASSGAP surveys (https://www.nassgapsurvey.com/), which focus on state financial aid, are available in Excel file format.1 Many higher education policy analysts and researchers have used NASSGAP data to
1 Surveys
prior to 2015–2016 are available in pdf format.
3.4 State-Level Data
25
examine state need- and merit-based financial aid (e.g., Cohen-Vogel et al. 2008; Doyle 2006; Hammond et al. 2019; Titus 2006). State Higher Education Executive Officers. Another source of state-level higher education data is the State Higher Education Executive Officers (SHEEO). SHEEO provides data on higher education finance (i.e., state appropriations and net tuition revenue) and postsecondary student unit record systems. The SHEEO finance data, some of which go as far back as fiscal year 1980, can be downloaded (https://shef.sheeo.org/data-downloads/) in an Excel file format. SHEEO finance data have been used by several higher education policy analysts and researchers to produce reports and studies on state support for higher education (e.g., Doyle 2013; Lacy and Tandberg 2018; Lenth et al. 2014; Longanecker 2006). National Science Foundation (NSF). The National Science Foundation (NSF) is another source of state-level higher education data. More specifically, NSF provides statistics based on Science and Engineering Indicators (SEI) State Indicators (https://ncses.nsf.gov/indicators/states/). These statistics include the number of science and engineering (S&E) degrees conferred, academic research and development (R&D) expenditures at state colleges and universities, academic S&E article output, and academic patents awarded. The data are available to the public and can be downloaded in Excel file format. Utilizing NSF/SEI state-level data, a few analysts and researchers (e.g., Coupé 2003; Fanelli 2010; Wetter 2009) have addressed the topic of academic R&D. Regional Compacts. There are several academic common market or regional compacts that provide state-level higher education data. The Southern Regional Education Board (SERB) is a regional compact that includes 16 member states in the South which provides state-level information to the public. With respect to higher education, SREB produces a “factbook” (https://www.sreb.org/fact-book-higher-education-0) which contains tables on data such as the population and economy, enrollment, degrees, student tuition and financial aid, faculty, administrators, revenue, and expenditures. These tables can be downloaded in an Excel file format. The Western Interstate Commission for Higher Education (WICHE) is an academic common market that is composed of 15 Western states and member U.S. Pacific Territories and Freely Associated States (which currently include the Commonwealth of the Northern Mariana Islands and Guam). WICHE produces a regional “factbook” for higher education that contains “policy indicators” (https://www.wiche.edu/pub/factbook). Similar to SREB’s, the WICHE higher education factbook provides state-level data in the following areas: demographics (including projections); student preparation, enrollment, and completion; affordability and; finance. The Midwest Higher Education Compact (MHEC) is an academic common market that is composed of 12 states in the Midwest. MHEC, via its Interactive Dashboard (https://www.mhec.org/policy-research/mhec-interactivedashboard), provides state-level data and key performance indicators of
26
3
Identifying Data Sources
college context, preparation, participation, affordability, completion, finance, and the benefits of college. Data for all states in the MHEC common market are provided and can be downloaded in several formats, except in Excel. The New England Board of Higher Education (NEBHE) is an academic common market that is composed of six New England states. The statelevel data from NEBHE, which is limited to tuition and fees across six states (accessed via https://nebhe.org/policy-research/policy-analysisresearch-home/data/), can be downloaded to Excel files. The Education Commission of the States (ECS) is an interstate compact that provides information on education policy, including postsecondary education policy. ECS maintains a database of information on state-level postsecondary education governance structures, policies, and regulations. Education analysts and researchers can track and identify changes in those areas. That information, however, has to be manually entered into the individual datasets that are created and analyzed by analysts. Several researchers have used ECS information to examine the role of governance structures on the adoption (e.g., Mokher and McLendon 2009) and conditioning effect (e.g., Tandberg 2013) of state higher education policies on policy outcomes. Other State-Level Sources. There are few other sources of state-level postsecondary education data and information. These sources include the National Association of State Budget Officers (NASBO) and the Center for the Study of Education Policy (University of Illinois). While both sources of data cover all states over several years, those data are limited to higher education finance (i.e., state spending on higher education).
3.5
Institution-Level Data
The federal government collects postsecondary education institutionallevel data, via the Integrated Postsecondary Education Data System (IPEDS). Colleges and universities as well as proprietary institutions submit institutional-level data to IPEDS, via 12 different surveys. These surveys include: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
12-month Enrollment (E12); Academic Libraries (AL); Admissions (ADM); Completions (C); Fall Enrollment (EF) Finance (F) Graduation Rates (GR) Graduation Rates (GR200) Human Resources (HR) Institutional Characteristics (IC)
3.6 Summary
27
11. Outcome measures (OM) 12. Student Financial Aid (SFA) A brief description of each of the 12 IPEDS surveys, which are collected over different collection periods, can be found at this website: https://nces.ed.gov/ipeds/report-your-data/overview-survey-componentsdata-cycle. IPEDS data can be accessed by users at: https://nces.ed.gov/ ipeds/use-the-data. Using either SAS, SPSS or Stata, users can download statistical program routines to extract data from an entire survey or selected variables within any of the 12 IPEDS surveys. Additionally, longitudinal data (Delta Cost Project) from several IPEDS surveys are made available by the American Institutes for Research on a NCES webpage at https:// nces.ed.gov/ipeds/deltacostproject/. The data can be downloaded using statistical software packages (SAS, SPSS, Stata) or Excel. IPEDS data can be aggregated at the state and national level. Higher education policy analysts, institutional researchers, and others have relied heavily on IPEDS data to produce thousands of reports and studies on many topics related to postsecondary education institutions. Researchers have also merged IPEDS data with data from other NCES datasets such as NELS, BPS, and B&B, to address questions related to students, taking into account the characteristics and policies of the institutions they attend and/or the states in which they reside. IPEDS data are limited in that they do not provide detailed information within an institution at the academic unit (e.g., college, school, or department) level. Over several years, there have been changes in the way in which data on selected items in some the IPEDS surveys have been collected.2
3.6
Summary
This chapter provided an overview of data sources available for higher education analysts and researchers. These sources include international, national (U.S.), state, and institutional data, surveys, and datasets. The national data include public use and restricted use files from nationally representative surveys conducted by IES/NCES and the CPS. The statelevel data are largely aggregated data originating from national surveys and federal administrative datasets. Independent organizations also collect and make available state-level data. Those data, however, are limited to financerelated information. Regional compacts also collect and make available higher education data. In most cases, those data are only collected for certain states within specific regions of the country. Institutional data collected by NCES,
2 This
is particularly the case with respect to the Finance (F) survey.
28
3
Identifying Data Sources
via IPEDS, are made available to the public but also have limitations. The overview of data sources that are provided in this chapter is, by no means, an exhaustive list of all sources of data.
References Alvarado, S. E., & An, B. P. (2015). Race, Friends, and College Readiness: Evidence from the High School Longitudinal Study. Race and Social Problems, 7 (2), 150–167. https:/ /doi.org/10.1007/s12552-015-9146-5 Azevedo, J. P. (2020). WBOPENDATA: Stata module to access World Bank databases. In Statistical Software Components. Boston College Department of Economics. https:/ /ideas.repec.org/c/boc/bocode/s457234.html Bastin, H., & Kalist, D. E. (2013). The Labor Market Returns to AACSB Accreditation. Journal of Labor Research, 34 (2), 170–179. https://doi.org/10.1007/s12122-012-91558 Baum, S., & Ma, J. (2012). Trends in College Pricing, 2012. Trends in Higher Education Series. (pp. 1–40). College Board Advocacy & Policy Center. https://files.eric.ed.gov/ fulltext/ED536571.pdf Baum, S., & Payea, K. (2012). Trends in Student Aid, 2012. Trends in Higher Education Series. (pp. 1–36). College Board Advocacy & Policy Center. https://files.eric.ed.gov/ fulltext/ED536570.pdf Belasco, A. S., & Trivette, M. J. (2015). Aiming low: Estimating the scope and predictors of postsecondary undermatch. The Journal of Higher Education, 86 (2), 233–263. Bellas, M. L. (2001). Investment in higher education: Do labor market opportunities differ by age of recent college graduates? Research in Higher Education, 42 (1), 1–25. Chatterji, M. (1998). Tertiary education and economic growth. Regional Studies, 32 (4), 349–354. Cohen-Vogel, L., Ingle, W. K., Levine, A. A., & Spence, M. (2008). The “Spread” of MeritBased College Aid: Politics, Policy Consortia, and Interstate Competition. Educational Policy, 22 (3), 339–362. https://doi.org/10.1177/0895904807307059 Coupé, T. (2003). Science Is Golden: Academic R&D and University Patents. The Journal of Technology Transfer, 28 (1), 31–46. https://doi.org/10.1023/A:1021626702728 Deming, D., & Dynarski, S. (2010). College aid. In P. B. Levine & D. J. Zimmerman (Eds.), Targeting investments in children: Fighting poverty when resources are limited: Vol. Targeting Investments in Children: Fighting Poverty When Resources are Limited (pp. 283–302). University of Chicago Press. https://www.nber.org/chapters/c11730.pdf Doyle, W. R. (2006). Adoption of merit-based student grant programs: An event history analysis. Educational Evaluation and Policy Analysis, 28 (3), 259–285. Doyle, W. R. (2013). Playing the Numbers: State Funding for Higher Education: Situation Normal? Change: The Magazine of Higher Learning, 45 (6), 58–61. Engberg, M. E., & Gilbert, A. J. (2014). The Counseling Opportunity Structure: Examining Correlates of Four-Year College-Going Rates. Research in Higher Education, 55 (3), 219–244. https://doi.org/10.1007/s11162-013-9309-4 English, D., & Umbach, P. D. (2016). Graduate school choice: An examination of individual and institutional effects. The Review of Higher Education, 39 (2), 173–211. Fanelli, D. (2010). Do Pressures to Publish Increase Scientists’ Bias? An Empirical Support from US States Data. PLoS ONE, 5 (4). https://doi.org/10.1371/journal.pone.0010271 Fontenay, S. (2018). SDMXUSE: Stata module to import data from statistical agencies using the SDMX standard. In Statistical Software Components. Boston College Department of Economics. https://ideas.repec.org/c/boc/bocode/s458231.html
References
29
George Mwangi, C. A., Cabrera, A. F., & Kurban, E. R. (2018). Connecting School and Home: Examining Parental and School Involvement in Readiness for College Through Multilevel SEM. Research in Higher Education. https://doi.org/10.1007/s11162-0189520-4 Gervais, M., & Ziebarth, N. L. (2019). Life After Debt: Postgraduation Consequences of Federal Student Loans. Economic Inquiry, 57 (3), 1342–1366. https://doi.org/10.1111/ ecin.12763 Glennie, E. J., Dalton, B. W., & Knapp, L. G. (2015). The influence of precollege access programs on postsecondary enrollment and persistence. Educational Policy, 29 (7), 963– 983. Gonçalves, D. (2016). GETDATA: Stata module to import SDMX data from several providers. In Statistical Software Components. Boston College Department of Economics. https://ideas.repec.org/c/boc/bocode/s458093.html Goodwin, R. N., Li, W., Broda, M., Johnson, H., & Schneider, B. (2016). Improving College Enrollment of At-Risk Students at the School Level. Journal of Education for Students Placed at Risk, 21 (3), 143–156. https://doi.org/10.1080/10824669.2016.1182027 Hammond, L., Baser, S., & Cassell, A. (2019, June 7). Community College Governance Structures & State Appropriations for Student Financial Aid. 36th Annual SFARN Conference. http://pellinstitute.org/downloads/sfarn_2019Hammond_Baser_Cassell.pdf Heller, D. E. (2001). The States and Public Higher Education Policy: Affordability, Access, and Accountability. JHU Press. Hemelt, S. W., & Marcotte, D. E. (2016). The changing landscape of tuition and enrollment in American public higher education. RSF: The Russell Sage Foundation Journal of the Social Sciences, 2 (1), 42–68. Holmes, C. (2013). Has the expansion of higher education led to greater economic growth? National Institute Economic Review, 224 (1), R29–R47. Hudson, L., Aquilino, S., & Kienzl, G. (2005). Postsecondary Participation Rates by Sex and Race/Ethnicity: 1974–2003. Issue Brief. NCES 2005-028. (NCES 2005–028; pp. 1–3). National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education. Jacobs, J. A., & Stoner-Eby, S. (1998). Adult Enrollment and Educational Attainment. The Annals of the American Academy of Political and Social Science, 559, 91–108. JSTOR. Joy, L. (2003). Salaries of recent male and female college graduates: Educational and labor market effects. ILR Review, 56 (4), 606–621. Kim, D., & Nuñez, A.-M. (2013). Diversity, situated social contexts, and college enrollment: Multilevel modeling to examine student, high school, and state influences. Journal of Diversity in Higher Education, 6 (2), 84. Kim, J. (2012). Welfare Reform and College Enrollment among Single Mothers. Social Service Review, 86 (1), 69–91. https://doi.org/10.1086/664951 Knowles, S. (1997). Which level of schooling has the greatest economic impact on output? Applied Economics Letters, 4 (3), 177–180. https://doi.org/10.1080/135048597355465 Kurban, E. R., & Cabrera, A. F. (2020). Building Readiness and Intention Towards STEM Fields of Study: Using HSLS: 09 and SEM to Examine This Complex Process among High School Students. The Journal of Higher Education, 91 (4), 1–31. Lacy, T. A., & Tandberg, D. A. (2018). Data, Measures, Methods, and the Study of the SHEEO. In D. A. Tandberg, A. Sponsler, R. W. Hanna, J. Guilbeau P., & R. Anderson E. (Eds.), The State Higher Education Executive Officer and the Public Good: Developing New Leadership for Improved Policy, Practice, and Research (pp. 282–299). Teachers College Press.
30
3
Identifying Data Sources
Lee, K. A., Leon Jara Almonte, J., & Youn, M.-J. (2013). What to do next: An exploratory study of the post-secondary decisions of American students. Higher Education, 66 (1), 1–16. https://doi.org/10.1007/s10734-012-9576-6 Lenth, C. S., Zaback, K. J., Carlson, A. M., & Bell, A. C. (2014). Public Financing of Higher Education in the Western States: Changing Patterns in State Appropriations and Tuition Revenues. In Public Policy Challenges Facing Higher Education in the American West (pp. 107–142). Springer. Longanecker, D. (2006). A tale of two pities. Change: The Magazine of Higher Learning, 38 (1), 4–25. Millett, C. M. (2003). How undergraduate loan debt affects application and enrollment in graduate or first professional school. The Journal of Higher Education, 74 (4), 386–427. Mitchell, J. (2017, July 23). In reversal, colleges rein in tuition. The Wall Street Journal. http://opportunityamericaonline.org/wp-content/uploads/2017/07/INREVERSAL-COLLEGES-REIN-IN-TUITION.pdf Mitchell, M., Leachman, M., & Masterson, K. (2016). Funding down, tuition up. Center on Budget and Policy Priorities. https://www.cbpp.org/sites/default/files/atoms/files/519-16sfp.pdf Mokher, C. G., & McLendon, M. K. (2009). Uniting Secondary and Postsecondary Education: An Event History Analysis of State Adoption of Dual Enrollment Policies. American Journal of Education, 115 (2), 249–277. https://doi.org/10.1086/595668 Monaghan, D., & Jang, S. H. (2017). Major Payoffs: Postcollege Income, Graduate School, and the Choice of “Risky” Undergraduate Majors. Sociological Perspectives, 60 (4), 722– 746. https://doi.org/10.1177/0731121416688445 Morgan, G. B., D’Amico, M. M., & Hodge, K. J. (2015). Major differences: Modeling profiles of community college persisters in career clusters. Quality & Quantity, 49 (1), 1–20. https://doi.org/10.1007/s11135-013-9970-x Nienhusser, H. K., & Oshio, T. (2017). High School Students’ Accuracy in Estimating the Cost of College: A Proposed Methodological Approach and Differences Among Racial/Ethnic Groups and College Financial-Related Factors. Research in Higher Education, 58 (7), 723–745. https://doi.org/10.1007/s11162-017-9447-1 Perna, L. W. (2004). Understanding the decision to enroll in graduate school: Sex and r racial/ethnic group differences. The Journal of Higher Education, 75 (5), 487–527. Pool, R., & Vander Putten, J. (2015). The No Child Left Behind Generation Goes to College: A Longitudinal Comparative Analysis of the Impact of NCLB on the Culture of College Readiness (SSRN Scholarly Paper ID 2593924). Social Science Research Network. https://doi.org/10.2139/ssrn.2593924 Rowan-Kenyon, H. T., Blanchard, R. D., Reed, B. D., & Swan, A. K. (2016). Predictors of Low- SES Student Persistence from the First to Second Year of College. In Paradoxes of the Democratization of Higher Education (Vol. 22, pp. 97–125). Emerald Group Publishing Limited. https://doi.org/10.1108/S0196-115220160000022004 Savas, G. (2016). Gender and race differences in American college enrollment: Evidence from the Education Longitudinal Study of 2002. American Journal of Educational Research, 4 (1), 64–75. Schneider, B., & Saw, G. (2016). Racial and Ethnic Gaps in Postsecondary Aspirations and Enrollment. RSF: The Russell Sage Foundation Journal of the Social Sciences, 2 (5), 58–82. JSTOR. https://doi.org/10.7758/rsf.2016.2.5.04 Schudde, L. (2016). The Interplay of Family Income, Campus Residency, and Student Retention (What Practitioners Should Know about Cultural Mismatch). Journal of College and University Student Housing, 43 (1), 10–27. Schudde, L. T. (2011). The causal effect of campus residency on college student retention. The Review of Higher Education, 34 (4), 581–610. Scott-Clayton, J., & Li, J. (2016). Black-white disparity in student loan debt more than triples after graduation. Economic Studies, 2 (3), 1–9.
References
31
Strayhorn, T. L. (2008). Influences on labor market outcomes of African American college graduates: A national study. The Journal of Higher Education, 79 (1), 28–57. Strayhorn, T. L., Williams, M. S., Tillman-Kelly, D., & Suddeth, T. (2013). Sex Differences in Graduate School Choice for Black HBCU Bachelor’s Degree Recipients: A National Analysis. Journal of African American Studies, 17 (2), 174–188. https://doi.org/ 10.1007/s12111-012-9226-1 Tandberg, D. A. (2013). The Conditioning Role of State Higher Education Governance Structures. The Journal of Higher Education, 84 (4), 506–543. https://doi.org/10.1353/ jhe.2013.0026 Titus, M. A. (2006). No college student left behind: The influence of financial aspects of a state’s higher education policy on college completion. The Review of Higher Education, 29 (3), 293–317. Titus, M. A. (2007). Detecting selection bias, using propensity score matching, and estimating treatment effects: An application to the private returns to a master’s degree. Research in Higher Education, 48 (4), 487–521. Titus, M. A. (2010). Exploring Heterogeneity in Salary Outcomes Among Master’s Degree Recipients: A Difference-in-Differences Matching Approach (SSRN Scholarly Paper ID 1716049). Social Science Research Network. https://papers.ssrn.com/abstract=1716049 Velez, E., Cominole, M., & Bentz, A. (2019). Debt burden after college: The effect of student loan debt on graduates’ employment, additional schooling, family formation, and home ownership. Education Economics, 27 (2), 186–206. Wetter, J. (2009). Policy effect: A study of the impact of research & development expenditures on the relationship between Total Factor Productivity and US Gross Domestic Product performance [PhD Thesis, The George Washington University]. https://search.proquest.com/openview/2461d479184375adc5db8da77a9ebc15/ 1?pq-origsite=gscholar&cbl=18750&diss=y Xu, Y. J. (2013). Career Outcomes of STEM and Non-STEM College Graduates: Persistence in Majored-Field and Influential Factors in Career Choices. Research in Higher Education, 54 (3), 349–382. https://doi.org/10.1007/s11162-012-9275-2 Xu, Y. J. (2017). Attrition of Women in STEM: Examining Job/Major Congruence in the Career Choices of College Graduates. Journal of Career Development, 44 (1), 3–19. https://doi.org/10.1177/0894845316633787 You, S., & Nguyen, J. (2012). Multilevel analysis of student pathways to higher education. Educational Psychology, 32 (7), 860–882. https://doi.org/10.1080/ 01443410.2012.746640 Zhang, L. (2013). Effects of college educational debt on graduate school attendance and early career and lifestyle choices. Education Economics, 21 (2), 154–175.
Chapter 4
Creating Datasets and Managing Data
Abstract This chapter provides a discussion and demonstration of creating datasets. The management of Excel and Stata datasets is also presented. These datasets include primary and secondary data. While this chapter discusses and demonstrates how to create datasets based on primary data, it focuses on the creation and management of the datasets based on secondary data. Keywords Creating datasets · Managing datasets · Primary data · Secondary data
4.1
Introduction
A substantial amount of time that is spent conducting higher education policy research and analysis includes dataset creation and management. Even though they may draw on secondary data sources such as those discussed in the previous chapter, researchers and analysts may need to create and manage customized datasets to address specific policy-related questions. This chapter discusses customized datasets that are based on secondary sources of data. It also demonstrates how to create, organize, and manage datasets using Excel and Stata.1 The Stata commands and syntax that are used throughout this chapter are included in an appendix.
1 It
is assumed the reader is familiar with Excel.
© Springer Nature Switzerland AG 2021 M. Titus, Higher Education Policy Analysis Using Quantitative Techniques, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-60831-6_4
33
34
4
4.2
Creating Datasets and Managing Data
Stata Dataset Creation
Data that can be used in Stata may be generated from surveys created and inputted by the analyst or imported from an external source. The former is a primary data source while the latter is a secondary data source. Data produced by the analyst from original surveys are considered primary data, while secondary data originally compiled by another party are secondary data. In the sections below, we discuss both.
4.2.1
Primary Data
If we are entering data from a very short survey, then we use the input command. The example below shows how data for three variables (variable_x, variable_y, and variable_z) can be entered in Stata by typing the following: input 31 57 25 68 35 60 38 59 30 59 end
variable_x variable_y variable_z 18 12 13 17 15
To see the data that was entered above, type. list which would show the following: . list +--------------------------------+ | variabx variaby variabz | |--------------------------------| 1. | 31 57 18 | 2. | 25 68 12 | 3. | 35 60 13 | 4. | 38 59 17 | 5. | 30 59 15 | +--------------------------------+ To save the above data, type: save “Example 1.0.dta”
4.2 Stata Dataset Creation
35
To use the Stata editor to enter additional data in Example 1.0, type: edit Importing data from a data management (e.g., dBase) file or a spreadsheet (e.g., Excel) file would be a more efficient way to enter data in Stata. There are several ways we can do this. We can import data from comma delimited Excel files (csv). For example, the data above may be imported from an Excel comma delimited file (csv) by typing in the following: insheet using “Example 1.csv”, comma The use of primary data requires careful planning and a well-developed data collection process. Many of these processes involve conducting computerassisted personal interviews (CAPI). If we need to collect data, there are several Stata-based tools available to assist in such an effort. One such tool is a Stata-user created package of Stata commands, iefieldkit, developed by the World Bank’s Development Research Group Impact Evaluations team (DIME). The most recent version of the package can be installed in Stata by typing in “ssc install iefieldkit, replace”. Information on iefieldkit can be found at the website address: https://dimewiki.worldbank.org/wiki/ Iefieldkit. Once installed, iefieldkit allows for the automatic creation of Excel files containing the collected data.
4.2.2
Secondary Data
Using secondary data sources, customized datasets can be easily created for use when conducting higher education policy research. The most basic dataset is one that captures a snapshot in time or is cross-sectional in nature. For example, we can download a table containing data on the participation rate of U.S. high school graduates in 2012 who attended degreegranting postsecondary education institutions in the same year, by state, from the 2017 version of The Digest of Education Statistics (Table 302.50) in an Excel format: (https://nces.ed.gov/programs/digest/d16/tables/xls/ tabn302.50.xls). The data, however, need to be reformatted before it can be imported or copied and pasted into Stata for further analysis. Hence, the following steps need to be taken: 1. Blank columns and rows should be deleted. 2. All rows with text containing titles, subtitles, notes, and footnotes should be deleted. 3. A column needs to be inserted to include the state id number. 4. Each column should have an appropriate simple one-word title. For example, the columns from left to right could be named stateid, state, total, public, private, anystate, homestate, anyrate, and homerate.
36
4
Creating Datasets and Managing Data
5. Because the total number (N ) of cases 51 (50 states plus the District of Columbia,) the state id numbers should be entered, ranging from 1 to 51 to reflect N. (If the analyst chooses to delete one or more cases, then the range of the state id would reflect the modified N ). 6. All numbers should be formatted as numeric with the appropriate decimal places and not as text characters. 7. Any characters that are not alpha-numeric should be removed from all cells. 8. After steps 1–7, the file should be saved in an Excel format in the “working” directory (as discussed in the previous chapter). 9. Open Stata and change to the same “working directory” as in step 8. Based on writing this chapter for this book, the Stata command to change to the “working directory” which contains the Excel file is as follows: cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Excel files”
10. The entire Excel file can be imported into Stata. Be sure to indicate the first row as a variable name as an option. Using the same file from above, the Stata command is: import excel “tabn302.50 - reformatted.xls”, firstrow 11. Open the Stata Data editor either in edit or browse mode to look at the imported data. In the Stata Data editor, you should see the following:
Fig. 4.1 Stata dataset, based on Excel tabn302.50
4.2 Stata Dataset Creation
37
In Fig. 4.1, take note of the column with the State names, which are in red text. This indicates State is a string variable. We may want to include Federal Information Processing Standard Publication (FIPS) codes and the abbreviations of state names in a state-level dataset. Using the user-created Stata program “statastates”, the FIPS codes and state abbreviations can be easily added to any state-level data set that includes the state name. (In our example from above, the state name is “States”.) This is demonstrated in the two steps below: 1. ssc install statastates 2. statastates, name(). 3. We can delete the variable _merge, which was created when we added the FIPS codes and state abbreviations. This is done by simply typing drop _merge We may also want to move the FIPS codes and state abbreviations somewhere near the front of our dataset. This can be accomplished by typing the following Stata command: order state_abbrev state_fips, before( state) The dataset should look like Fig. 4.2:
Fig. 4.2 Stata dataset, based on the modified Excel tabn302.50
We can then save this file with a new more descriptive name, such as “US high school graduates in 2012 enrolled in PSE, by state”, in a working directory containing Stata files (e.g., C:\Users\Marvin\Dropbox\Manuscripts\ Book\Chapter 4\Stata files). After changing to the working directory and
38
4
Creating Datasets and Managing Data
reopening the new Stata file, we can show a description of our dataset by typing: describe The output is the following: . describe Contains data obs: 51 vars: 11 ---------------------------------------------------------------storage display value variable name type format label variable label ----------------------------------------------------------Stateid byte %10.0gc Stateid state_abbrev str2 %9s state_fips byte %8.0g state str20 %20s state total long %10.0g total public long %10.0g public private int %10.0g private anystate long %10.0g anystate homestate long %10.0g homestate anyrate double %10.0g anyrate homerate double %10.0g homerate ----------------------------------------------------------Sorted by: Note: Dataset has changed since last saved. Contains data from US high school graduates in 2012 enrolled in PSE, by state.dta Take note, that none of the variables have labels. To create labels, based on the column names in the Excel file, we use the label variable (lab var) command for each variable. Here is an example: lab var Stateid “Stateid” lab var state_abbrev “State abbreviation” lab var state_fips “FIPS code” lab var state “State name” lab var total “Total number of graduates from HS located in the state” lab var public “Number of graduates from public HS located in the state” lab var private “Number of graduates from private HS located in the state” lab var anystate “Number of first-time freshmen graduating
4.2 Stata Dataset Creation
39
from HS 12 months enrolled in any state” (Notice that labels cannot be more than 80 characters. So we have to shorten the label.) lab var anystate “Number of 1st-time freshmen graduating from HS enrolled in any state” lab var homestate “Number of 1st-time freshmen graduating from HS enrolled in home state” lab var anyrate “Estimated rate of HS graduates going to college in any state” lab var homerate “Estimated rate of HS graduates going to college in home state” Typing the describe command, the output is this: . describe Contains data from C:\Users\Marvin\Dropbox \Manuscripts\Book\Chapter 4\Stata\US high school graduates in 2012 enrolled in PSE, by state.dta obs: 51 vars: 11 25 Jun 2018 16:16 ---------------------------------------------------------------storage display value variable name type format label variable label ---------------------------------------------------------------Stateid byte %10.0gc Stateid state_abbrev str2 %9s State abbreviation state_fips byte %8.0g FIPS code state str20 %20s State name total long %10.0g Total number of graduates from HS located in the state public long %10.0g Number of graduates from public HS located in the state private int %10.0g Number of graduates from private HS located in the state anystate long %10.0g Number of 1st-time freshmen graduating from HS enrolled in any state homestate long %10.0g Number of 1st-time freshmen graduating from HS enrolled in home state anyrate double %10.0g Estimated rate of HS graduates going to
40
4
Creating Datasets and Managing Data
college in any state Estimated rate of HS graduates going to college in home state ---------------------------------------------------------------Sorted by: homerate
double %10.0g
We then re-save the dataset with the same name. The example above is a cross-sectional dataset that can be used to provide descriptive statistics, which will be discussed in the next chapter. Timeseries datasets can be used to observe changes in phenomena over time. For example, data on the enrollment of recent high school completers in college from 1960 through 2016 is a time series. These data are also provided by the National Center of Education Statistics (NCES) in The Digest (2019) and can be downloaded to an Excel file by going to https://nces.ed.gov/programs/ digest/d17/tables/dt17_302.10.asp. Focusing on the percent of recent high school completers who enrolled in college between 1960 and 2016, the data can be copied directly from the downloaded Excel table into Stata. More specifically, we can copy data in column H (total percent of recent high school completers who enrolled in college) into the Stata data editor. In Stata, if we type describe, we will see there are 68 instead of 57 observations (1960– 2016). Because the Excel file had blank rows, Stata treated those blank rows as cases with missing data (.). Therefore, we drop the cases with missing data. drop if var1==. We rename var1 totalpct by typing: rename var1 totalpct We then create a year variable that has values that range from 1960 (1959 + 1) to 2016. gen year = 1959 + _n We relocate the year variable to the beginning of the dataset by typing: order year, first Then we declare the dataset to be a time series. tsset year, yearly (This has to be done only once before saving the file.) We then change the working directory to the one with our Stata files. cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata” Finally, we save the file with a descriptive name (Fig. 4.3). save “Percent of US high school graduates in PSE, 1960 to 2016.dta”
4.2 Stata Dataset Creation
41
A word of caution when using secondary data such as the NCES Excel files. Many of those files contain non-numeric characters, such as commas and dollar signs which will yield string variables in Stata. Before we copy and paste data from those types of files, we have to properly reformat the cells with data so they contain no non-numeric characters. Using the timeseries data, we can create graphs (which we will discuss in the next chapter). In many instances, cross-sectional time-series or panel data are used to conduct higher education policy research. In some cases, analysts have direct access to data in a panel format, such as some tables that are published in the Digest of Education Statistics. Consistent with the examples above and the use of panel data, we download the Excel version of Table 304.70 from the 2018 version of The Digest.2 Because the data on total fall enrollment of undergraduate students in degree-granting postsecondary education institutions by state are for selected years 2000 through 2017, we can characterize those data as panel in nature. Unlike the above example of time-series data, none of the panel data in this format can be easily copied and pasted into the Stata Data Editor. Prior to copying or importing them into Stata, the data have to be properly formatted. The easiest way to reformat data is in Excel worksheets, containing data on each of the variables to be subsequently analyzed in Stata. For example, some of the data on undergraduate students by state from Table 304.70 of the 2018 version of
Fig. 4.3 Stata dataset, based on Excel Table 302.10
2 Table 304.70—Total fall enrollment in degree-granting postsecondary institutions, by level of enrollment and state or jurisdiction: Selected years, 2000 through 2017. The table can be found at: https://nces.ed.gov/programs/digest/d18/tables/dt18_304.70.asp.
42
4
Creating Datasets and Managing Data
The Digest can be stored in an Excel worksheet named, “Undergrads”. This worksheet could be one of many in an Excel workbook named, “Enrollment”. The Excel worksheet looks like this (Fig. 4.4): Before we can import or copy and paste the data into Stata, we must to do the following: 1. Copy the worksheet “Digest 2018 Table 304.70” to another worksheet and rename it Ugrad in the same workbook. In the Ugrad worksheet: 2. Remove all borders and unmerge all cells. 3. Remove all irrelevant (e.g., table titles, United States, District of Columbia, table footnotes, etc.) and blank rows. 4. Remove all irrelevant columns. 5. Insert a new column and create a column header named, “id”. 6. Beginning with the number 1, create an id number for each state. 7. Create a column header for the State names, “State”. 8. Create variable labels for each year of data, beginning with “Ugrad”, which reflects undergraduate enrollment (e.g., Ugrad2000 for the year 2000, Ugrad2010 for the year 2010, etc.). 9. Reformat all data cells so they contain no non-numeric characters (e.g., commas, dollar signs, etc.). Note—If we copy and paste or import numbers with non-numeric characters into Stata, it will treat them as string variables, cannot be analyzed. 10. Save the Excel workbook to a new name, such as “Undergraduate enrollment data”.
Fig. 4.4 Digest 2018 Table 304.70 (Excel)
4.2 Stata Dataset Creation
43
As a result of steps 1–10, our new worksheet should look like this (Fig. 4.5): As we can see, this worksheet allows us to view and manage the data that we are interested in and if necessary, access the source of that data in the other worksheet (i.e., Digest 2018 Table 304.70). We can import this worksheet from this Excel workbook into Stata, via the following syntax (all on one line): import excel “C:\Users\Marvin\Dropbox\Manuscripts\Book \Chapter 4\Excel files\College enrollment data.xls”, sheet(“Ugrad”) firstrow The result is as follows: Take note that the option sheet(“Ugrad”)refers to the specific worksheet we would like to import. The option firstrow tells Stata that we would like to designate the first row of the worksheet as variable labels. . import excel “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4 \Excel files\College enrollment data.xls”,sheet(“Ugrad”) firstrow (8 vars, 50 obs)
We then save this Stata file with our panel data: save “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata\ Undergraduate enrollment data - Wide.dta” Take note that we used “Wide” as part of the naming convention. This is because our panel dataset is in a “wide” format. While it is easier to manage in this format, it is difficult if not impossible to conduct analysis
Fig. 4.5 Digest 2018 Table 304.70 (Excel)—modified
44
4
Creating Datasets and Managing Data
on “wide” format panel data. To conduct panel data analysis, we have to convert the data from a “wide” to a “long” format using the reshape or the much faster Stata user-created sreshape (Simons 2016). In Stata, type search sreshape, all. Click on dm0090, install and type: sreshape long Ugrad, i(id) j(year) Notice the results: . import excel “C:\Users\Marvin\Dropbox\Manuscripts\Book \Chapter 4\Excel files\College enrollment data.xls”,sheet(“Ugrad”) firstrow (8 vars, 50 obs) . sreshape long Ugrad, i(id) j(year) (note: j = 2000 2010 2012 2015 2016 2017)Data wide -> long ---------------------------------------------------------------Number of obs. 50 -> 300 Number of variables 8 -> 4 j variable (6 values) -> year xij variables: Ugrad2000 Ugrad2010 ... Ugrad2017 -> Ugrad -----------------------------------------------------------------
We now have 300 observations and four variables, including a year variable. This new dataset, in a long format, now has to be “declared” a panel dataset by typing: xtset id year, yearly The result is: . xtset id year, yearly panel variable: id (strongly balanced) time variable: year, 2000 to 2017, but with gaps delta: 1 year The example above is a strongly balanced panel dataset with gaps in the years. Panel datasets can be strongly balanced, strongly balanced with gaps, weakly balanced, or unbalanced. In a panel dataset, the total number (N ) of observations equals the number of units (e.g., states or institutions, etc.) or panels (p) multiplied by the number of time (t) points (e.g., days, or weeks, or months, or years) or where N = p x t. A strongly balanced dataset is one in which all the panels have been observed for the same number of time points. Panel datasets in which all the panels have been observed for the same number of time points but have gaps in time points are known as strongly balanced with gaps in the years. A weakly balanced dataset exists if each panel has the same number of observations but not the same time points. An unbalanced dataset is when each panel does not have the same number of
4.2 Stata Dataset Creation
45
observations. In order of priority, we should strive for strongly balanced then weakly balanced panel datasets. For reasons we will discuss this later in the book, we should try to avoid using unbalanced panel datasets. Once we “declare” our data to be a panel dataset (which only has to be done one time), we save it to a new file. A good practice is to save it with “Long” as part of its naming convention. For example, we save our “declared” panel data file as follows: save “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata\ Undergraduate enrollment data - Long.dta” We then close Stata, by typing: exit Unlike our example above, most panel datasets have more than one variable that can be analyzed, such as Ugrad. We can add more variables for the same years to our dataset in a few ways. We can manually add more variables. But as pointed out above, this is a very time-consuming and possibly error-prone process. The additional variables can be added in another Excel worksheet in the same or another Excel workbook. We use a similar naming convention for the variable in the worksheet. For example, we can download data on state appropriations for public high school graduation data from the NCES’ Digest of Education Statistics (https://nces.ed.gov/ programs/digest/d18/tables/dt18_219.20.asp) in an Excel file and follow the steps above, including “HSGrad” as a naming convention for the same years as in the Ugrad worksheet. Our updated Excel workbook (which we saved as Example 4) now looks like this (Fig. 4.6): We open Stata, change our working directory to the directory that contains the Excel file and import the HDGrad worksheet into Stata, by typing: cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Excel files” import excel “Example 4.xls”, sheet(“HSGrad”) firstrow We then change our working directory to where we want to save our Stata file and save it, by typing: cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata” save “HSGrad - Wide.dta” Using the same syntax as above but simply substituting HSGrad for FirsTim, we can reformat our file from wide to long, declare it a panel data set, and save it to a new file. We show these steps below. sreshape long HSGrad, i(id) j(year) xtset id year, yearly save “HSGrad - Long.dta” Like the dataset of first-time college students, this dataset is also strongly balanced but with gaps in the years. This is not a problem, if we want to merge
46
4
Creating Datasets and Managing Data
Fig. 4.6 Digest 2018 Table 219.20 (Excel)
the two datasets, based on id, into one that would contain two variables that we can analyze: FirsTim and HSGrad. We do this by specifying the dataset (“First-Time - Long.dta”)that contains the two variables we would like to add to the dataset that is currently open. We carry out this procedure by typing the following: joinby id year using “First-Time - Long.dta”, unmatched(none) Because the file contains the same yearly data on first-time college students as the data on public high school students, we do not have to specify “year” as a variable. But as shown in the next example below, it is a good practice to include that variable as well. Given our example, our new Stata file looks like this (Fig. 4.7): In the data editor, we can see the same two variables (HSGrads and FirsTim) that we can later analyze. If we have data for additional variables in other worksheets located in the same working directory (e.g., “C:\Users\Marvin\Dropbox\Manuscripts \Book\Chapter 4\Excel files”), we would simply repeat the steps above referring to the specific Excel files/worksheets that we want to import and the Stata files that we want to reshape from wide to long and ultimately join to our current file in memory. Similarly, we would reshape these datasets from wide to long and ultimately join them to our current file in memory. We could also join two or more Stata files that were reshaped from wide to long and have the variables State, id, and year. For example, if in our current directory, we have a file that contains state-level undergraduate needbased financial aid (Undergraduate state financial aid - need.dta) and another that has merit-based financial aid (Undergraduate state financial
4.2 Stata Dataset Creation
47
Fig. 4.7 Stata file based on Digest 2018 Table 219.20 (Excel)
aid - need.dta) data, we could add the data from those two those files to our long-format panel dataset on undergraduate college enrollment (College enrollment data.dta) by executing the following commands: use “Undergraduate enrollment data - Long.dta”, clear joinby id year using “Undergraduate state financial aid - need” joinby id year using “Undergraduate state financial aid - merit” xtset id year, yearly save “Example - 4.1.dta” Notice that in the joinby, syntax we did not have to include the option unmatched(none). We also did not have to include the extension dta as a part of the names of the Stata files. We did, however, have to declare our dataset as a panel data and save it with a new file name (e.g., Example 4.1). We can see in our Stata data editor, we now have six variables in our new panel dataset (Fig. 4.8). After closing the Stata data editor, we can see how our new panel dataset is structured, by typing the command xtdescribe or the shortened version (xtdes): id: 1, 2, ..., 50 n = 50 2000, 2010, ..., 2016 T = Delta(year) = 1 year Span(year) = 17 periods (id*year uniquely identifies each observation) Distribution of T_i: min 5% 25% 50% 75% 95% max . xtdes year:
5
48
4
Creating Datasets and Managing Data
Fig. 4.8 Modified Stata file based on Digest 2018 Table 219.20 (Excel)
5 5 5 5 5 Freq. Percent Cum. | Pattern ---------------------------+------------------50 100.00 100.00 | 1.........1.1..11 ---------------------------+------------------50 100.00 | X.........X.X..XX
5
5
We can see that like our original dataset of only undergraduate college enrollment, our new appended panel dataset spans 17 years, has 250 observations (50 states × 5 years), is strongly balanced, but with gaps in the years. This structure is acceptable when conducting basic data analysis such as descriptive statistics, and running some regression models (which we will cover in other chapters). But as we shall see in the other chapters, a strongly balanced panel data set with no gaps in the time periods is required to conduct more advanced statistical analyses.
4.3
Summary
This chapter discussed and demonstrated how higher education policy analysts can create primary and secondary datasets to address specific policyrelated questions. More specifically, the chapter demonstrated how to create and organize datasets, using Excel and Stata. Furthermore, it described how these customized datasets need to be managed.
4.4
Appendix
4.4 Appendix *Chapter 4 Syntax *Primary data *example below shows how data for three variables (variable_x, variable_y, /// and variable_z) can be entered in Stata input variable_x variable_y variable_z 31 57 18 25 68 12 35 60 13 38 59 17 30 59 15 end *To see the data that was entered above, type list *To save the above data, type: save “Example 1.0.dta” *To use the Stata editor to enter additional data in Example 1.0, type: edit *the data above may be imported from an Excel comma delimited file (csv) /// by typing in the following: insheet using “Example 1.csv”, comma *Secondary data to change to the “working directory” which contains the Excel file /// is as follows: cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Excel files” *Using the same file from above, the Stata command is: import excel “tabn302.50 - reformatted.xls”, firstrow *Using the user-created Stata program “statastates”, the FIPS codes /// and state abbreviations can be easily added to any state-level data /// set that includes the state name. (In our example from above, the /// state name is “States”.) This is demonstrated in the two steps below: ssc install statastates statastates, name() *We can delete the variable _merge, which was created when we added /// the FIPS codes and state abbreviations. This is done by simply typing: drop _merge *We may also want to move the FIPS codes and state abbreviations /// somewhere near the front of our dataset. This can be accomplished /// typing the following Stata command: order state_abbrev state_fips, before( state) *Stata dataset, based on the modified Excel tabn302.50 describe *To create labels, based on the column names in the Excel file, /// we use the label variable (lab var) command for each variable. /// Here is an example: lab var Stateid “Stateid” lab var state_abbrev “State abbreviation” lab var state_fips “FIPS code” lab var state “State name” lab var total “Total number of graduates from HS located in the state” lab var public “Number of graduates from public HS located in the state” lab var private “Number of graduates from private HS located in the state” lab var anystate ///
49
50
4
Creating Datasets and Managing Data
“Number of first-time freshmen graduating from HS 12 months enrolled in any state” *Labels cannot be more than 80 characters. So we have to shorten the label. lab var anystate /// “Number of 1st-time freshmen graduating from HS enrolled in any state” lab var homerate /// “Estimated rate of HS graduates going to college in home state” describe *we drop the cases with missing data. drop if var1==. *We rename var1 totalpct by typing: rename var1 totalpct gen year = 1959 + _n *We relocate the year variable to the beginning of the dataset by typing: order year, first *Then we declare the dataset to be a time series. tsset year, yearly *change the working directory to the one with our Stata files and save the file using. cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata” *Finally, we save the file with a descriptive name. save “Percent of US high school graduates in PSE, 1960 to 2016.dta” *import worksheet from Excel workbook into Stata, via the following syntax clear all cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Excel files” import excel “College enrollment data.xls”,sheet(“Ugrad”) firstrow *save this Stata file with our panel data: cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata save ”Undergraduate enrollment data - Wide.dta“ *convert the data from a “wide” to a “long” format using the reshape /// or the much faster user-created sreshape (Simons 2016) *install sreshape net install dm0090.pkg, replace sreshape long Ugrad, i(id) j(year) *declare a panel dataset by typing: xtset id year, yearly *save our “declared” panel data file as follows: save ”Undergraduate enrollment data - Long.dta“ *cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Excel files“ import excel ”Example 4.xls“, sheet(”HSGrad“) firstrow *change our working directory to where we want to save our Stata /// file and save it, by typing: cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata“ save ”HSGrad - Wide.dta“ *reformat our file from wide to long, declare it a panel data set, /// and save it to a new file sreshape long HSGrad, i(id) j(year) xtset id year, yearly save ”HSGrad - Long.dta“ *join the two datasets, based on id, into one dataset that would /// contain two variables joinby id year using ”First-Time - Long.dta“, unmatched(none)
References
*join two or more Stata files that were reshaped from wide to long /// and have the variables State, id, and year. use ”Undergraduate enrollment data - Long.dta“, clear joinby id year using ”Undergraduate state financial aid - need“ joinby id year using ”Undergraduate state financial aid - merit“ xtset id year, yearly save ”Example - 4.1.dta“ *see how our new panel dataset is structured, by typing the command /// xtdescribe or the shortened version: xtdes *end
References Simons, K. L. (2016). A sparser, speedier reshape. The Stata Journal, 16 (3), 632–649.
51
Chapter 5
Getting to Know Thy Data
Abstract This chapter discusses and demonstrates the importance of getting to know the data that we use to conduct higher education policy analysis and evaluation. More specifically, this chapter addresses the need to know the structure of datasets. The identification and exploration of missing data are also discussed in this chapter. Keywords Dataset structure · Missing data · Missing data analysis
5.1
Introduction
In the first section of this chapter, we demonstrate how to explore the structure of a dataset. The next section, we explore or “get to know thy data”. But this is part of a broader point with regard to our datasets. We should be well acquainted with all aspects of our data, including the strengths and limits of their use. The limitations include the extent to which we have missing data. Therefore, the last part of this chapter discusses how to identify and analyze missing data patterns. This chapter presents ways in which we can determine and discuss the strengths and limitations of the data we use to conduct higher education policy analysis. The Stata commands and syntax that are used throughout this chapter are included in an appendix.
© Springer Nature Switzerland AG 2021 M. Titus, Higher Education Policy Analysis Using Quantitative Techniques, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-60831-6_5
53
54
5.2
5
Getting to Know Thy Data
Getting to Know the Structure of Our Datasets
In Chap. 4, we discussed the types of data (primary and secondary) and how we construct our own dataset from secondary sources. We ended that chapter by showing we can explore the structure of our panel data set using the xtdescribe (or xtdes) command in Stata. We can also use the describe command to show information on data storage and with respect to variables in any type of dataset. Using that command, we can look at the structure of our time series data that we introduced in the previous chapter. . describe
Contains data obs: 56 vars: 2 --------------------------------------------------------------storage display value variable name type format label variable label --------------------------------------------------------------year float %ty totalpct float %8.0g --------------------------------------------------------------Sorted by: year We see that “year” and “totalpct” are stored as a floating or float type. By default, Stata stores all numbers as floats, also known as single-precision or 4byte reals (StataCorp 2019). Compared to the integer storage type, the float storage type uses more memory.1 While it may be necessary for the “totalpct” variable, this level of precision is not necessary for the year variable, which is an integer. So we can reduce the amount of memory required by float by compressing the data using the compress command.2 The use of this command automatically changes the storage type for the year variable from float to integer (int). We see from the output below that we save 112 bytes. . compress variable year was float now int (112 bytes saved) . describe Contains data 1 For a complete description of the storage types, see page 89 of Stata User’s Guide Release 16. 2 For more information on compress, see pages 77–78 of the Stata User’s Guide Release 16.
5.2 Getting to Know the Structure of Our Datasets
55
obs: 56 vars: 2 --------------------------------------------------------------storage display value variable name type format label variable label --------------------------------------------------------------year int %ty totalpct float %8.0g --------------------------------------------------------------Sorted by: year Note: Dataset has changed since last saved. Therefore, it is a good practice to invoke the compress command, particularly using large datasets with numeric variables that are actually integers. As an example, we will use an enhanced version of one of our panel data files that we created in the previous chapter and saved to a new file name, Example 5.0. With the exception of how state expenditures on financial aid for undergraduates is measured in millions of dollars, this file contains the same data as in the file we used as in example Chap. 4 (Example 4.1). Most likely, we would have either imported these data on state expenditures on financial aid for undergraduates from National Association of State Student Grant and Aid Programs (NASSGAP) Excel files or copied and pasted the data, or manually entered the data from NASSGAP pdf files into a Stata file. Because it has implications for how our variables are stored, it is important that we are aware of whether or not the state financial aid data in our dataset are also measured in millions. If they are measured in millions, then those variables are not stored as integers. We can verify this by typing the describe command. cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Stata files“ use ”Example 5.0.dta“ . describe Contains data from C:\Users\Marvin\Dropbox\Manuscripts\Book \Chapter 5\Stata files\Example 5.0.dta obs: 250 vars: 6 18 Jul 2020 15:57 --------------------------------------------------------------storage display value variable name type format label variable label --------------------------------------------------------------id float %10.0gc id year int %ty State str20 %20s State Ugrad long %10.0g Undergraduate enrollment need float %9.0g State spending on need-
56
5
Getting to Know Thy Data
based aid (in millions) State spending on meritbased aid (in millions) --------------------------------------------------------------Sorted by: id year merit
float
%9.0g
We see that only the “year” variable is stored as an integer. Because we know that “id” is also numeric and an integer, we can compress the data. This results in the following output. . compress variable id was float now byte variable Ugrad was long now int variable State was str20 now str14 (2,750 bytes saved) We see this reduces the required memory for the dataset by changing the storage type for “id” to byte and the storage type for “State” from a string variable with 20 bytes to one with 14 bytes, resulting in a total of 2750 bytes saved. After typing the command describe again, we see the following: Contains data from C:\Users\Marvin\Dropbox\Manuscripts\Book \Chapter 5\Stata files\Example 5.0.dta obs: 250 vars: 6 18 Jul 2020 15:59 --------------------------------------------------------------variable storage display value name type format label variable label --------------------------------------------------------------id byte %10.0gc id year int %ty State str14 %14s State Ugrad int %10.0g Undergraduate enrollment need float %9.0g State spending on needbased aid (in millions) merit float %9.0g State spending on meritbased aid (in millions) --------------------------------------------------------------Sorted by: id year We see that “id” is not stored as an integer. So we will have to use another Stata command, recast, to accomplish that task. recast int id . recast int id We then retype describe, and see the following: . describe
5.2 Getting to Know the Structure of Our Datasets
57
Contains data from C:\Users\Marvin\Dropbox\Manuscripts\Book \Chapter 5\Statafiles\Example 5.0.dta obs: 250 vars: 6 18 Jul 2020 15:59 --------------------------------------------------------------variable storage display value name type format label variable label --------------------------------------------------------------id int %10.0gc id year int %ty State str14 %14s State Ugrad int %10.0g Undergraduate enrollment need float %9.0g State spending on needbased aid (in millions) merit float %9.0g State spending on meritbased aid (in millions) --------------------------------------------------------------Sorted by: id year We then save our file with the same name (e.g., Example 5.0). In some cases, we may be using a large amount of data from secondary data sources such as the National Center for Education Statistics’ public-use High School Longitudinal Study of 2009 (HSLS:09) student dataset. Because this dataset has several thousand variables, we set the maximum variables to 10,0000 (set maxvar 10,000) in Stata.3 We download and import all the student data from the HSLS:09 dataset in Stata and we use the command describe, short. We can see that we have 23,503 observations and 8509 variables. If we use the memory command, we can also see that this huge dataset uses about 1 gigabyte of memory. . set maxvar 10000 . use ”C:\Users\Marvin\Google Drive\HSLS\hsls_16_student_v1_0.dta“ . describe, short Contains data from C:\Users\Marvin\Google Drive\HSLS\hsls_16_ student_v1_0.dta obs: 23,503 vars: 8,509 Sorted by:. memory
3 If we are using Stata/IC, then the maximum number of variables is 798. If we are using Stata/MP, then the maximum number of variables is 65,532. In this example, we are using Stata/SE which has as a maximum 10,998 variables.
58
5
Getting to Know Thy Data
Memory usage used allocated -----------------------------------------------------------data 1,026,681,549 1,241,513,984 strLs 0 0 -----------------------------------------------------------data & strLs 1,026,681,549 1,241,513,984 ------------------------------------------------------------
[the rest of the output omitted] If we compress the data, we would fail to save any bytes. . compress (0 bytes saved) This suggests that the dataset is structured in such a way that it is efficiently using our computer’s memory. In yet another example, we download, reformat, modify, and import some state higher education finance data from Excel files provided by the State Higher Education Executive Officers (SHEEO). These data are saved to a Stata file (Example 5.2.dta)4 This particular example will include postGreat Recession (i.e., after fiscal year 2009) state-level data on net tuition revenue (gross tuition revenue minus state financial aid), state appropriations to public higher education, state financial aid to students, full-time equivalent (FTE) students (net of medical students) and cost indices (COLI, EMI, HECA), which SHEEO uses to adjust the data when comparing the data across years and states. (For the purposes of this example, we will not use those indices.) To access the data from the worksheet within the Excel workbook that contains our downloaded data, we change the working directory: cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Excel files“ and import the Excel file (type the import command in its entirety on one line) import excel ”SHEEO_SHEF_FY18_Nominal_Data.xlsx“, sheet (”State and U.S. Nominal Data (2“) firstrow Because we want to use only post-Great Recession data, we drop observations if they are prior to fiscal year (FY) 2010. (According to the National
4 These
data can be found at: https://sheeo.org/project/state-higher-education-finance/.
5.2 Getting to Know the Structure of Our Datasets
59
Bureau of Economic Research, the Great Recession in the U.S. began in December 2007 and ended in June 2009). Based on the output below, we see that 1532 observations were dropped. . drop if FY chi-square = 0.0000
We see from the output above that even after including race-ethnicity, the data in the two variables (S3CLGPELL and P1TUITION) are not MCAR.
74
5
Getting to Know Thy Data
There are at least two implications for data that are MCAR. First, it is probably not a good idea to delete missing data that are not MCAR and the variances do not matter. Second, statistical methods that assume no missing data are valid when missing data are MCAR. In the next few chapters, some of those statistical methods will be discussed.
5.5
Summary
This chapter discussed and demonstrated the importance of getting to know the data that we use to conduct higher education policy analysis and evaluation. More specifically, this chapter addressed the need to know the structure of our datasets. The identification and exploration of missing data were also discussed and demonstrated.
5.6 Appendix *Chapter 5 Syntax *use time series data from Chap. 4. cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata“ use ”Percent of US high school graduates in PSE, 1960 to 2016.dta“ *examine structure of the dataset describe *reduce the amount of memory required by float by compressing the data using /// compress
*compare after compressing, show structure describe *open panel dataset cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Stata files“ use ”Example 5.0.dta“ *compress the data and show structure compress describe * recast int id describe *save save ”Example 5.0.dta“, replace *clear all *using a large amount of data from secondary data sources such as the /// National Center for Education Statistics’ NCES /// public-use High School Longitudinal Study of 2009 (HSLS:09) student dataset *set the maximum variables to 10,0000
5.6
Appendix
set maxvar 10000 *download all student data from the HSLS:09 dataset in Stata *examine a shortened version of the HSLS:09 dataset’s structure describe, short *look at how much memory this dataset uses memory *try to see if we can compress the data compress *close dataset clear all *import an Excel file (SHEEO finance data) cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Excel files“ import excel /// ”SHEEO_SHEF_FY18_Nominal_Data.xlsx“, sheet(”State and U.S. Nominal Data (2“) firstrow *Because we want to use only post-Great Recession data, we drop observations /// if they are prior to fiscal year (FY) 2010 or if less than FY 2010. drop if FY F = 0.0100 Residual | 417181345 48 8691278.01 R-squared = 0.1303 -------------+---------------------------------Adj R-squared = 0.1122 Total | 479707145 49 9789941.74 Root MSE = 2948.1 -----------------------------------------------------------------------------netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------stapr_fte | -.354383 .132125 -2.68 0.010 -.6200382 -.0887278 _cons | 10192.75 1107.684 9.20 0.000 7965.599 12419.9 ------------------------------------------------------------------------------
We can see that R2 , 0.1303, is the same value as what was shown in Fig. 6.18. But the regression output provides an analysis-of-variance (ANOVA) table with the model and residual (errors) sum of squares (SS), the degrees of freedom (df), and mean square (MS).2 Information from the ANOVA table can be used to calculate the R2 , which is the regression model sum of squares (RSS) divided by the total sum of squares (TSS) or R2 = RSS/TSS, where RSS =
n
Yˆi − Y
2
i=1
1 For OLS regression formulas with more than one independent variable, see introductory mathematical statistics texts. 2 It is assumed the reader is familiar with ANOVA.
7.2 Review of OLS Regression
107
TSS =
n
Yi − Y
2
i=1
With respect to the overall regression model, the output includes the F statistic and its statistical significance, adjusted R2 , root mean square error (MSE), where 2
adjusted R =
k R − n−1 2
n−1 n−k−1
The F -statistic compares a model with no independent variables (as an intercept-only model) to the model with one or more independent variables. The null hypothesis is the intercept-only model and model with independent variables are equal. A rejection of the null hypothesis is if the interceptonly model is significantly reduced compared to the model with one or more independent variables. If we can reject the null hypothesis (Prob > F is less than 0.05), then we can conclude that the model with one or more independent variables provides a better fit than the intercept-only model. We see from the output above that this is indeed the case. In the above output, we also see the root MSE is the square root of the MS of the residual (shown in the ANOVA table) or the standard deviation of the residuals. It is an indication of the concentration of the data around the regression line. The lower the values of the root MSE, the better the regression model fits the data. The estimated beta coefficients for state appropriations per FTE student (stapr_fte) and the constant (_cons), as well as the standard error (Std. Err.), t statistic (t), statistical significance (P > |t|), and 95% confidence intervals are also shown. The standard error reflects the average distance between the data points and the regression line and is represented in the following formula:
Y − Yˆ 2 N −2 sβ = 2 ˆ X −X The t statistic is estimated as follows: tn−2 =
β sβ
The smaller sβ , the larger tn−2 , and more likely the probability that the null hypothesis will be rejected and the claim the parameter (β) estimate is statistically significant. If we can reject the null hypothesis with more
108
7
Introduction to Intermediate Statistical Techniques
than 95% certainty (95% of the values of β lie within mean ±1.96 * standard deviation) then we can say β is not equal to zero or not the result of statistical chance. This is the same as saying there is less than a 5% probability (p value F Residual | 325155837 46 7068605.16 R-squared -------------+---------------------------------Adj R-squared Total | 479707145 49 9789941.74 Root MSE
50 = = = = =
7.29 0.0004 0.3222 0.2780 2658.7
------------------------------------------------------------------------------netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------stapr_fte | -1.608785 .5061895 -3.18 0.003 -2.627692 -.5898788 stapr_fte2 | .0000543 .0000232 2.33 0.024 7.45e-06 .0001011 pc_income | .1322943 .0533078 2.48 0.017 .0249912 .2395974 _cons | 9744.101 3472.105 2.81 0.007 2755.115 16733.09 -------------------------------------------------------------------------------
We see the adjusted R2 is now 0.278, suggesting the model explains 28% of the variability in net tuition revenue per FTE student across states in 2016. More importantly, the size of the estimated beta coefficient for state appropriations per FTE student is now −1.61. Because it only relies on crosssectional data in 2016, this multiple regression model may have actually produced biased estimates of the beta coefficients. With data from only 50 cases (i.e., states), a multiple regression model limits the number of independent variables that may be included in the model. For example, suppose we include seven independent variables in our model (i.e., the variable reflecting states grouped by regional compacts, the number of independent units of analysis). This means that the degrees of freedom (which is the number of estimated beta coefficients minus one) will be
110
7
Introduction to Intermediate Statistical Techniques
reduced. Multiple regression models with very low degrees of freedom may result in inefficient estimates of the beta coefficients.
7.2.4
Multivariate Pooled OLS Regression
If they are available, then data should be used that allows us to overcome possible problems of low degrees of freedom and consequently inefficient estimates of beta coefficients. The availability of panel data (discussed in Chap. 4) would enable us to run pooled OLS (POLS) regression models. The following example illustrates this point where we regress net tuition revenue per FTE student on the same variables shown in the previous example. However, now we use panel data (50 states across 27 years). reg netuit_fte stapr_fte stapr_fte2 pc_income The output is: . reg netuit_fte stapr_fte stapr_fte2 pc_income Source | SS df MS Number of obs = -------------+---------------------------------F(3, 1346) Model | 5.1916e+09 3 1.7305e+09 Prob > F Residual | 3.8124e+09 1,346 2832408.8 R-squared -------------+---------------------------------Adj R-squared Total | 9.0040e+09 1,349 6674588.45 Root MSE
1,350 = 610.98 = 0.0000 = 0.5766 = 0.5756 = 1683
------------------------------------------------------------------------------netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------stapr_fte | -1.018307 .0773341 -13.17 0.000 -1.170015 -.8665983 stapr_fte2 | .0000329 4.33e-06 7.60 0.000 .0000244 .0000413 pc_income | .2036221 .0048243 42.21 0.000 .1941581 .2130862 _cons | 2403.068 320.5399 7.50 0.000 1774.256 3031.88 -------------------------------------------------------------------------------
From the results of the POLS, we see the number of observations at 1350 (50 × 27) is substantially large, compared to those in previous output. The adjusted R2 at 57.6% is also greater while the root MSE is smaller, indicating a better model fit. But more relevant to a higher education policy analyst would be the estimated beta coefficients, specifically for stapr_fte, which is now −1.018. This indicates while there is still a negative relationship between net tuition revenue per FTE student and state appropriations per FTE student, and the value of the beta coefficient is lower than when using the 2016 cross-sectional data. The larger number of observations will also enable us to include more independent variables in our POLS regression model without being too concerned about low degrees of freedom. For example, we can now include the categorical variable representing region compacts (region_compact) in
7.2 Review of OLS Regression
111
our model. Because region_compact is a categorical or factor variable, we include i.region_compact in the Stata syntax below. reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact The output is: . reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact Source | SS df MS Number of obs = -------------+---------------------------------F(7, 1342) Model | 5.6898e+09 7 812822803 Prob > F Residual | 3.3143e+09 1,342 2469642.47 R-squared -------------+---------------------------------Adj R-squared Total | 9.0040e+09 1,349 6674588.45 Root MSE
1,350 = 329.13 = 0.0000 = 0.6319 = 0.6300 = 1571.5
------------------------------------------------------------------------------netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+-------------------------------------------------------------stapr_fte | -1.04053 .0750848 -13.86 0.000 -1.187826 -.8932333 stapr_fte2 | .0000383 4.22e-06 9.09 0.000 .00003 .0000466 pc_income | .1917324 .0047585 40.29 0.000 .1823976 .2010672 | region_compact | SREB | 185.804 194.7014 0.95 0.340 -196.1481 567.7562 WICHE | -957.9857 199.7539 -4.80 0.000 -1349.85 -566.1219 MHEC | 99.67403 197.3705 0.51 0.614 -287.5143 486.8623 NEBHE | 1100.607 215.7601 5.10 0.000 677.3429 1523.87 | _cons | 2712.485 366.2787 7.41 0.000 1993.944 3431.027 -------------------------------------------------------------------------------
We see that controlling for regional compact does not substantially change the estimated beta coefficient for state appropriations per FTE student or for any of the variables. It is worth noting that compared to states that are not members of regional compacts, WICHE states have lower net tuition revenue per FTE student and NEBHE states have higher net tuition revenue per FTE student.
7.2.4.1
Multivariate Pooled OLS Regression with Interaction Terms
Because we are using pooled data, we can also include more variables, including interaction terms. Interaction terms are combinations of existing variables. The combination may include the following: 1. two or more categorical variables 2. two or more continuous variables 3. one or more categorical variables with one or more continuous variables
112
7
Introduction to Intermediate Statistical Techniques
For an example of 1, we will use regional compact (region_compact) and undergraduate merit aid program (ugradgmerit). The double hashtag ## is used to create the interaction term and include them separately in the regression model. To get a sense of the omitted reference categories, we include allbaselevels as an option. . reg netuit_fte stapr_fte i.region_compact##i.ugradmerit, allbaselevels Source | SS df MS Number -------------+---------------------------------Model | 1.7957e+09 10 179571843 Residual | 7.2083e+09 1,339 5383346.82 -------------+---------------------------------Total | 9.0040e+09 1,349 6674588.45
of obs = F(10, 1339) Prob > F R-squared Adj R-squared Root MSE
1,350 = 33.36 = 0.0000 = 0.1994 = 0.1935 = 2320.2
-----------------------------------------------------------------------------------------netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------------------+-------------------------------------------------------------stapr_fte | .0275013 .0289854 0.95 0.343 -.0293604 .084363 | region_compact | None | 0 (base) SREB | -4376.048 663.3507 -6.60 0.000 -5677.367 -3074.728 WICHE | -4350.232 610.8338 -7.12 0.000 -5548.527 -3151.937 MHEC | -3356.754 660.639 -5.08 0.000 -4652.754 -2060.754 NEBHE | -917.9704 637.625 -1.44 0.150 -2168.823 332.8823 | ugradmerit | No | 0 (base) Yes | -2149.968 648.7122 -3.31 0.001 -3422.571 -877.3649 | region_compact#ugradmerit | None#No | 0 (base) None#Yes | 0 (base) SREB#No | 0 (base) SREB#Yes | 3477.178 732.8958 4.74 0.000 2039.429 4914.927 WICHE#No | 0 (base) WICHE#Yes | 2837.446 696.7433 4.07 0.000 1470.619 4204.274 MHEC#No | 0 (base) MHEC#Yes | 3084.481 735.2593 4.20 0.000 1642.096 4526.867 NEBHE#No | 0 (base) NEBHE#Yes | 3028.864 743.8004 4.07 0.000 1569.723 4488.005 | _cons | 6658.134 600.8424 11.08 0.000 5479.439 7836.829 ------------------------------------------------------------------------------------------
We can test to see if there is an interaction effect between being a member of a regional compact and having a state merit aid program for undergraduates that explains more variance in net tuition revenue per FTE enrollment. We do so by quietly (qui) running the models and storing (est sto) the model results without (model1) and with the interaction terms (model2). . qui reg netuit_fte stapr_fte i.region_compact . est sto model1 . qui reg netuit_fte stapr_fte i.region_compact##i.ugradmerit . est sto model2
7.2 Review of OLS Regression
113
Then we conduct a nested F -test to test whether the difference between the R2 of the main effects model and the R2 of the interaction model is equal to zero. . lrtest model1 model2 LR chi2(4) = Prob > chi2 =
Likelihood-ratio test (Assumption: model1 nested in model2)
23.33 0.0001
Because the difference between the R2 of the main effects model and the R of the interaction model is 0.0001, we can reject the null hypothesis and conclude the model with the interaction terms does not help to explain more variance in net tuition revenue per FTE enrollment. Using the testparm command, the statistical significance of the interaction terms can also be checked. 2
. testparm i.region_compact#i.ugradmerit ( ( ( (
1) 1.region_compact#1.ugradmerit = 0 2) 2.region_compact#1.ugradmerit = 0 3) 3.region_compact#1.ugradmerit = 0 4) 4.region_compact#1.ugradmerit = 0 F(
4, 1339) = Prob > F =
5.84 0.0001
The test results above indicate that the interaction terms as a whole are statistically significant. What if we wanted to investigate if the difference in net tuition revenue per FTE enrollment by tuition-setting authority (i.tuitset) changes with the amount of state appropriations per FTE enrollment? This is an example of number 2 above, where the interaction term is composed of one continuous variable and one categorical variable. The following syntax includes “c.”, which indicates state appropriations per FTE enrollment (c.stapr_fte) is a continuous variable. . reg netuit_fte i.ugradmerit i.region_compact c.stapr_fte##i.tuitset Source | SS df MS Number -------------+---------------------------------Model | 2.3356e+09 12 194633061 Residual | 6.6684e+09 1,337 4987601.41 -------------+---------------------------------Total | 9.0040e+09 1,349 6674588.45
of obs = F(12, 1337) Prob > F R-squared Adj R-squared Root MSE
1,350 = = = = =
39.02 0.0000 0.2594 0.2527 2233.3
------------------------------------------------------------------------------------netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------------+---------------------------------------------------------------ugradmerit | No | 0 (base) Yes | 561.8776 152.1065 3.69 0.000 263.4843 860.271 |
114
7
Introduction to Intermediate Statistical Techniques
| | 0 (base) | -1500.987 276.8269 -5.42 0.000 -2044.049 -957.9247 | -2130.328 283.3574 -7.52 0.000 -2686.201 -1574.454 | -1020.981 280.5026 -3.64 0.000 -1571.254 -470.7078 | 1018.53 320.9729 3.17 0.002 388.8648 1648.195 | stapr_fte | .2060764 .217321 0.95 0.343 -.2202508 .6324036 | tuitset | Legislature | 0 (base) State-Wide Board | 10193.37 1678.683 6.07 0.000 6900.234 13486.51 System Board | 1900.957 1390.018 1.37 0.172 -825.8971 4627.81 Campus | 3296.063 1411.991 2.33 0.020 526.1053 6066.022 | tuitset#c.stapr_fte | State-Wide Board | -1.310195 .273725 -4.79 0.000 -1.847173 -.7732179 System Board | -.1069439 .2198251 -0.49 0.627 -.5381836 .3242958 Campus | -.2043566 .224125 -0.91 0.362 -.6440316 .2353184 | _cons | 1895.032 1400.074 1.35 0.176 -851.5488 4641.613 ------------------------------------------------------------------------------------region_compact None SREB WICHE MHEC NEBHE
. testparm c.stapr_fte#i.tuitset ( 1) 2.tuitset#c.stapr_fte = 0 ( 2) 3.tuitset#c.stapr_fte = 0 ( 3) 4.tuitset#c.stapr_fte = 0 F(
3, 1337) = Prob > F =
17.31 0.0000
The results of the regression model show that the difference in net tuition revenue per FTE enrollment by tuition-setting authority (specifically for state-wide boards compared to the reference category, the legislature), declines with increases in state appropriations per FTE enrollment. The results of the post-estimation test indicate that the interaction terms are statistically significant. What if we wanted to find out how the relationship between net tuition revenue per FTE enrollment and state appropriation changes as the amount of state total need-based financial aid (state_needFTE) changes. Therefore, the regression model should include an interaction term that is composed of two continuous variables as shown below in the output. . reg netuit_fte i.region_compact c.stapr_fte##c.state_needFTE Source | SS df MS Number -------------+---------------------------------Model | 1.8885e+09 7 269778857 Residual | 7.1156e+09 1,342 5302211.49 -------------+---------------------------------Total | 9.0040e+09 1,349 6674588.45
of obs = F(7, 1342) Prob > F R-squared Adj R-squared Root MSE
1,350 = = = = =
50.88 0.0000 0.2097 0.2056 2302.7
--------------------------------------------------------------------------------------------netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------------------------+---------------------------------------------------------------region_compact | None | 0 (base)
7.2 Review of OLS Regression
115
| -729.9702 309.623 -2.36 0.019 -1337.368 -122.5725 | -1546.882 316.8258 -4.88 0.000 -2168.41 -925.3543 | -187.8332 309.1457 -0.61 0.544 -794.2946 418.6282 | 1762.95 327.8046 5.38 0.000 1119.885 2406.016 | stapr_fte | .1285269 .0364127 3.53 0.000 .057095 .1999588 state_needFTE | 3.372105 .4932087 6.84 0.000 2.404562 4.339649 | c.stapr_fte#c.state_needFTE | -.0003921 .000074 -5.30 0.000 -.0005372 -.000247 | _cons | 3349.423 373.6221 8.96 0.000 2616.476 4082.37 --------------------------------------------------------------------------------------------SREB WICHE MHEC NEBHE
The results above indicate that the relationship between net tuition revenue and state appropriations per FTE enrollment is captured in the interaction term that reflects state appropriations and total need-based aid. If we focus on state total need-based aid changes, this means that the relationship between net tuition revenue per FTE enrollment and state total need-based aid changes as state appropriations per FTE enrollment changes. (This would be the case, even if state appropriations per FTE enrollment by itself was not statistically significant.) The interpretation of the results of a regression with an interaction term that is composed of two continuous variables is facilitated with the use of the margins and marginsplot post-estimation commands. To restrict some of the output, we include the vsquish option. . margins, dydx(stapr_fte)
at(state_needFTE=(0(3000)10000)) vsquish
Average marginal effects Model VCE : OLS Expression dy/dx w.r.t. 1._at 2._at 3._at 4._at
: : : : : :
Number of obs
=
1,350
Linear prediction, predict() stapr_fte state_needE = 0 state_needE = 3000 state_needE = 6000 state_needE = 9000
-----------------------------------------------------------------------------| Delta-method | dy/dx Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------stapr_fte | _at | 1 | .1285269 .0364127 3.53 0.000 .057095 .1999588 2 | -1.047798 .2012571 -5.21 0.000 -1.442611 -.6529854 3 | -2.224123 .4220569 -5.27 0.000 -3.052086 -1.39616 4 | -3.400448 .6435904 -5.28 0.000 -4.663001 -2.137895 ------------------------------------------------------------------------------
The value in the margins command indicates the amount of change in net tuition revenue per FTE enrollment with a one unit (i.e., one dollar) change in state appropriations per FTE enrollment at different values of state total need-based aid per FTE enrollment. In our example, we are holding
116
7
Introduction to Intermediate Statistical Techniques
state total need-based aid per FTE enrollment at $0, $3000, $6000, $9000. The output above indicates that state appropriations per FTE enrollment is statistically significant for all of those values of state total need-based aid per FTE enrollment. We can show this relationship changes at each value in a graph by entering the following syntax. . qui margins, at(stapr_fte=(0 10000) state_needFTE=(0(3000)10000)) vsquish . . marginsplot, noci x(stapr_fte) recast(line) xlabel(0(3000)10000)
We can see from Fig. 7.1 that where there is no state need-based financial aid per FTE enrollment, the relationship between net tuition revenue per FTE enrollment and state appropriations per FTE enrollment is slightly positive. But as the amount of state need-based financial aid per FTE enrollment increases the relationship between net tuition revenue per FTE enrollment and state appropriations per FTE enrollment becomes increasingly negative. This suggests that as states increase their funding directly to students, net tuition revenue to institutions decline more rapidly in response to higher amounts of state appropriations. But it is possible that the estimated beta coefficients in this POLS regression are biased due to violations of one or more of the seven classical OLS assumptions presented in Sect. 7.2.1. More specifically, we can and should check to see if some of the assumptions have been violated by performing post-estimation diagnostics. One such diagnostic is a residual-
Fig. 7.1 Predictive margins of net tuition revenue per FTE by state need-based aid per FTE
7.2 Review of OLS Regression
117
versus-fitted plot that can be created immediately after running the regression by simply typing the Stata command syntax, rvfplot. This command graphs the following plot. We can see from Fig. 7.2 that the residuals are more dispersed in the middle of the graph than at the right and left. This indicates there is a violation of the assumption that the error term (ε) has a constant variance or of homoscedasticity. Additionally, it is quite possible that the errors are not normally distributed. So a comprehensive post-estimation test should be conducted to detect if, in addition to the assumption of violation of normally distributed errors, there is also heteroscedasticity. This is done by typing the Stata command syntax estat imtest, which produces the following output: . estat imtest Cameron & Trivedi’s decomposition of IM-test --------------------------------------------------Source | chi2 df p ---------------------+----------------------------Heteroskedasticity | 189.76 24 0.0000 Skewness | 63.95 7 0.0000 Kurtosis | 9.56 1 0.0020 ---------------------+----------------------------Total | 263.27 32 0.0000 --------------------------------------------------The p values indicate the assumptions of homoscedasticity and normally distributed errors have been violated. To take into account these two violations of assumptions, we should rerun our POLS regression model using the robust option. . reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, robust Linear regression
Number of obs F(7, 1342) Prob > F R-squared Root MSE
= = = = =
1,350 249.37 0.0000 0.6319 1571.5
-----------------------------------------------------------------------------| Robust netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+-------------------------------------------------------------stapr_fte | -1.04053 .0721036 -14.43 0.000 -1.181978 -.8990817 stapr_fte2 | .0000383 4.03e-06 9.51 0.000 .0000304 .0000462 pc_income | .1917324 .0060699 31.59 0.000 .1798248 .20364 | region_compact | SREB | 185.804 199.3342 0.93 0.351 -205.2365 576.8446 WICHE | -957.9857 180.9863 -5.29 0.000 -1313.033 -602.9389
118
7
Introduction to Intermediate Statistical Techniques
Fig. 7.2 Residual-versus-fitted plot (rvfplot)
| 99.67403 178.2546 0.56 0.576 -250.014 449.362 | 1100.607 221.0546 4.98 0.000 666.9565 1534.257 | _cons | 2712.485 353.6409 7.67 0.000 2018.736 3406.235 -----------------------------------------------------------------------------MHEC NEBHE
From this output we see the estimated beta coefficients are the same but some of the standard errors have changed. But, it is also possible that the variability of the dependent variable is unequal across a range of independent variables or there is group-wise heteroscedasticity. In other words, net tuition revenue per FTE student within each state may not be independent, leading to residuals that are not independent across states. To detect group-wise heteroscedasticity, another test should be conducted. This test, which is robust to non-normality, is called the Levene test of homogeneity and is conducted in the following steps.4 quietly: reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact predict double eps, residual robvar eps, by(state) . robvar eps, by(state)
4 For a full description of the Levene test, see Levene, H. (1960). Robust tests for equality of variances. In I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow, & H. B. Mann (Eds.), Contributions to probability and statistics: Essay in honor of Harold Hotelling (pp. 278– 292). Stanford University Press.
7.2 Review of OLS Regression
State | abbreviatio | Summary of Residuals n | Mean Std. Dev. Freq. ------------+-----------------------------------AK | 1122.2639 939.42995 27 AL | 1616.6929 1956.758 27 AR | 162.56587 334.56975 27 AZ | -137.19617 477.91889 27 CA | -2498.2149 1121.2912 27 CO | -90.322544 701.35696 27 CT | -845.30401 699.56268 27 DE | 4531.8731 3107.7021 27 FL | -2373.7873 730.65775 27 GA | -810.79875 723.35512 27 HI | 1297.0811 471.53693 27 IA | 1141.2379 504.31709 27 ID | 468.24506 360.61989 27 IL | -1369.4974 1150.3627 27 IN | 1404.8174 1242.093 27 KS | -933.50045 366.96024 27 KY | 623.90684 721.21209 27 LA | -1011.2335 521.60785 27 MA | -2230.438 826.23532 27 MD | -752.82225 415.40112 27 ME | 873.48914 1048.5964 27 MI | 1873.1069 1620.3943 27 MN | -134.58677 664.61368 27 MO | -865.87429 391.30801 27 MS | 504.26361 703.23813 27 MT | 483.33723 393.45575 27 NC | -630.84519 306.35169 27 ND | 227.6154 755.3359 27 NE | -662.25153 372.39504 27 NH | -1012.948 760.14257 27 NJ | 223.57714 440.59753 27 NM | 254.49388 293.12376 27 NV | -464.44503 327.67539 27 NY | -1739.7618 502.25295 27 OH | 385.59024 496.92629 27 OK | -845.73432 563.76589 27 OR | 500.09117 664.07122 27 PA | 1516.1847 582.46832 27 RI | -78.700309 720.27019 27 SC | 878.38713 769.57448 27 SD | 172.33711 638.97413 27 TN | -197.12965 483.98213 27 TX | -1033.8177 637.22766 27 UT | 680.8684 424.88764 27 VA | -1089.88 508.53689 27 VT | 3293.9012 1334.3393 27 WA | -1094.1928 523.50132 27 WI | -1238.9945 541.81062 27
119
120
7
Introduction to Intermediate Statistical Techniques
WV | 428.35934 789.9017 27 WY | -522.0092 1628.6087 27 ------------+-----------------------------------Total | -1.169e-13 1567.427 1,350 =
21.243149
df(49, 1300)
Pr > F = 0.00000000
W50 =
10.597663
df(49, 1300)
Pr > F = 0.00000000
W10 =
19.183198
df(49, 1300)
Pr > F = 0.00000000
W0
The output above shows that Delaware (DE), Alabama (AL), Wyoming (WY), Michigan (MI), and Vermont (VT) have very large standard deviations, which suggests they are outliers. But more relevant to the Levene test, the p value of W0 (which is more robust to non-normality than the other tests) indicates the equality of variances should be is rejected. This strongly suggests there is group-wise heteroscedasticity. To address this particular violation of the assumption of homoscedasticity, we use the cluster option, with state as the cluster variable in our POLS regression model. . reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, cluster(state) Linear regression
Number of obs F(7, 49) Prob > F R-squared Root MSE
= = = = =
1,350 34.92 0.0000 0.6319 1571.5
(Std. Err. adjusted for 50 clusters in state) -----------------------------------------------------------------------------| Robust netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+-------------------------------------------------------------stapr_fte | -1.04053 .2449388 -4.25 0.000 -1.532753 -.5483067 stapr_fte2 | .0000383 .0000122 3.13 0.003 .0000137 .0000629 pc_income | .1917324 .0167239 11.46 0.000 .1581245 .2253404 | region_compact | SREB | 185.804 915.4101 0.20 0.840 -1653.781 2025.39 WICHE | -957.9857 851.9214 -1.12 0.266 -2669.986 754.0144 MHEC | 99.67403 841.7089 0.12 0.906 -1591.803 1791.151 NEBHE | 1100.607 1053.032 1.05 0.301 -1015.539 3216.753 | _cons | 2712.485 1294.855 2.09 0.041 110.3759 5314.595 ------------------------------------------------------------------------------
Compared to the previous regression model with the robust option, this model produces different results with respect to the statistical significance of the categorical variables reflecting regional compacts. In this model, net tuition revenue per FTE student is not related to state membership in regional compacts. Using this example, we can see that not taking into account the clustered nature of the residuals with respect to the states results in making false claims about the statistical significance of certain variables or Type I errors (rejection of true null hypotheses). Therefore,
7.4 Fixed-Effects Regression
121
when employing POLS regression models we should always test for group-wise heteroscedasticity and if called for, allow for use the appropriate standard errors that reflect the relaxation of intragroup independence.
7.3
Weighted Least Squares and Feasible Generalized Least Squares Regression
When the assumption of homoscedasticity is violated and the variance of the dependent variable is known, we can use weighted least squares (WLS). When the form of heteroscedasticity is known, we can employ feasible generalized least squares (FGLS). The use of WLS, however, requires a great deal of judgment on the part of the analyst regarding the weight that should be used. Consequently, there may be different results across analysts estimating the same regression with the same variables model but with different weights. According to Hoechle (2007), FGLS regression models are inappropriate for data in which the number of time (T ) periods (e.g., years) is less than the number of panels (m) (e.g., states). In higher education policy research, T < m is the rule rather than the exception. Additionally, both WLS and FGLS multivariate regression models may also be inappropriate statistical methods when using what econometricians refer to as microeconometric (T < m) panel data in an effort to take into account unobserved heterogeneity.
7.4
Fixed-Effects Regression
Even with the use of cluster-robust standard errors, multivariate POLS regression models may be limited by their inability to take into unobserved differences or heterogeneity between units of analysis (e.g., students, institutions, states). This takes us to a discussion of unobserved heterogeneity and fixed-effects regression models.
122
7.4.1
7
Introduction to Intermediate Statistical Techniques
Unobserved Heterogeneity and Fixed-Effects Dummy Variable (FEDV) Regression
When conducting higher education policy research, unobserved heterogeneity may influence findings. For example, state culture with regard to higher education (which we may not be able to observe) may also influence the extent to which states allow public higher education institutions to be funded by tuition revenue. With respect to a regression model, unobserved heterogeneity is included in the equation below as error is comprised of ui (unobserved characteristics of a group or entity such as an institution, state, or country) and εi (real residual): Yˆit = β0 + β1 X1it + β2 X2it + . . . βn Xnit + uit + εit
(7.6)
In Eq. (7.6), we include uit , which is constant or fixed over a reasonable amount of time or a time-invariant group effect (e.g., institutional culture, state culture, national identity, etc.) and can be represented by “dummy” variables in a multivariate regression model. Therefore, Eq. (7.6) can be expanded and be rewritten as a regression model containing dummy variables, reflecting the number (N ) of units of groups minus (N − 1). Yˆit =β0 +β1 X1it +β2 X2it + . . . βn Xnit +α1 D2t +α2 D3t + . . . αn DN −1t +uit +εit (7.7) where each β is the estimated beta coefficient for each of the respective dummy variables (D). Equation (7.7) excludes the first dummy variable (D1 ), which is the reference group. Applying this above equation to a state-level panel dataset, αi is a state fixed-effect as the “effect” of state i is “fixed” across all years. In Eq. (7.7), each α represents a different state fixed-effect, while β 1 . . . β n are the same for all states.
7.4.2
Estimating FEDV Multivariate POLS Regression Models
Using the panel data from the example above and dummy variables, we show how state fixed-effects can be taken into account by adding i.stateid to the multivariate POLS regression model (without the regional compact categorical variable) above. . reg netuit_fte stapr_fte stapr_fte2 pc_income i.stateid, cluster(state) Linear regression
Number of obs
=
1,350
7.4 Fixed-Effects Regression
123 F(2, 49) Prob > F R-squared Root MSE
= = = =
. . 0.8989 837.7
(Std. Err. adjusted for 50 clusters in state) ------------------------------------------------------------------------------| Robust netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------------+------------------------------------------------------------stapr_fte | -.5639993 .1274783 -4.42 0.000 -.8201766 -.3078221 stapr_fte2 | 9.09e-06 7.15e-06 1.27 0.210 -5.29e-06 .0000235 pc_income | .2109376 .0114659 18.40 0.000 .1878961 .2339792 | stateid | Alaska | -864.7734 504.8033 -1.71 0.093 -1879.214 149.6669 Arizona | -2570.101 179.2046 -14.34 0.000 -2930.226 -2209.976 Arkansas | -1487.844 35.80258 -41.56 0.000 -1559.792 -1415.896 California | -5408.939 120.4583 -44.90 0.000 -5651.009 -5166.87 Colorado | -2641.914 231.0978 -11.43 0.000 -3106.323 -2177.506 Connecticut | -1804.824 263.9297 -6.84 0.000 -2335.21 -1274.437 Delaware | 2795.367 112.9649 24.75 0.000 2568.355 3022.378 Florida | -4036.336 82.02393 -49.21 0.000 -4201.169 -3871.502 Georgia | -2552.067 74.2346 -34.38 0.000 -2701.247 -2402.887 Hawaii | -1178.139 376.8727 -3.13 0.003 -1935.493 -420.7845 Idaho | -2360.993 45.92879 -51.41 0.000 -2453.29 -2268.696 Illinois | -3203.982 83.94839 -38.17 0.000 -3372.683 -3035.282 Indiana | -323.1194 58.14713 -5.56 0.000 -439.9704 -206.2683 Iowa | -661.722 48.34359 -13.69 0.000 -758.8721 -564.5719 Kansas | -2656.893 97.87067 -27.15 0.000 -2853.571 -2460.214 Kentucky | -1039.455 33.75404 -30.79 0.000 -1107.286 -971.6237 Louisiana | -2613.661 26.13406 -100.01 0.000 -2666.179 -2561.143 Maine | 45.72522 35.23359 1.30 0.200 -25.07933 116.5298 Maryland | -2589.094 157.934 -16.39 0.000 -2906.474 -2271.714 Massachusetts | -3270.657 154.2259 -21.21 0.000 -3580.585 -2960.728 Michigan | 290.352 136.2497 2.13 0.038 16.54801 564.1559 Minnesota | -2032.941 87.92352 -23.12 0.000 -2209.63 -1856.252 Mississippi | -1056.209 33.40168 -31.62 0.000 -1123.332 -989.0861 Missouri | -2468.3 120.12 -20.55 0.000 -2709.69 -2226.91 Montana | -2046.337 137.3619 -14.90 0.000 -2322.376 -1770.298 Nebraska | -2522.536 59.83506 -42.16 0.000 -2642.779 -2402.293 Nevada | -3389.331 60.91085 -55.64 0.000 -3511.736 -3266.926 New Hampshire | -1343.527 313.4941 -4.29 0.000 -1973.517 -713.5372 New Jersey | -1970.144 153.7648 -12.81 0.000 -2279.146 -1661.142 New Mexico | -2521.355 101.9705 -24.73 0.000 -2726.273 -2316.438 New York | -3830.118 132.7709 -28.85 0.000 -4096.931 -3563.305 North Carolina | -2394.794 85.28162 -28.08 0.000 -2566.174 -2223.415 North Dakota | -1529.233 55.4164 -27.60 0.000 -1640.596 -1417.87 Ohio | -1248.71 117.3901 -10.64 0.000 -1484.615 -1012.806 Oklahoma | -2539.597 20.97655 -121.07 0.000 -2581.751 -2497.443 Oregon | -2087.886 153.6461 -13.59 0.000 -2396.649 -1779.122 Pennsylvania | -235.4195 158.476 -1.49 0.144 -553.8891 83.05 Rhode Island | -794.3782 137.5683 -5.77 0.000 -1070.832 -517.9245
124
7
Introduction to Intermediate Statistical Techniques
| -667.1569 58.70243 -11.37 0.000 -785.1238 -549.1899 | -1501.664 108.4995 -13.84 0.000 -1719.702 -1283.626 | -1878.185 23.56557 -79.70 0.000 -1925.541 -1830.828 | -2752.335 49.84661 -55.22 0.000 -2852.506 -2652.165 | -2040.838 38.52044 -52.98 0.000 -2118.248 -1963.428 | 2953.999 246.3382 11.99 0.000 2458.964 3449.034 | -2760.425 161.7475 -17.07 0.000 -3085.469 -2435.381 | -4003.913 114.8606 -34.86 0.000 -4234.734 -3773.092 | -1083.476 46.69777 -23.20 0.000 -1177.319 -989.6333 | -2920.358 121.1009 -24.12 0.000 -3163.719 -2676.996 | -3131.875 236.0225 -13.27 0.000 -3606.18 -2657.57 | _cons | 2177.63 574.5532 3.79 0.000 1023.023 3332.238 ------------------------------------------------------------------------------South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming
Before we interpret the results of the above output, we should determine if the state fixed-effects as a whole are statistically significant. Immediately after we run the above regression, we do this by typing the following: testparm i.stateid We see an output that looks like this: . testparm i.stateid ( ( ( ( ( ( ( ( (
1) 2) 3) 4) 5) 6) 7) 8) 9)
2.stateid = 0 3.stateid = 0 4.stateid = 0 5.stateid = 0 6.stateid = 0 7.stateid = 0 8.stateid = 0 9.stateid = 0 10.stateid = 0
[omitted output] (45) (46) (47) (48) (49)
46.stateid 47.stateid 48.stateid 49.stateid 50.stateid F(
= = = = =
0 0 0 0 0
3, 49) = Prob > F =
30.34 0.0000
We reject the null that the coefficients for all 49 state dummy variables are jointly equal to zero. Therefore, state fixed-effects can be retained in the regression model. We see from the output that every state except the first state, Alabama, was included in the regression results. Compared to Alabama, net tuition revenue per FTE student is lower in every state
7.4 Fixed-Effects Regression
125
except Delaware, Maine (no statistically significant difference), Michigan, and Vermont. Compared to the multivariate POLS regression model without state fixed-effects, the beta coefficient for state appropriations per FTE student is also statistically significant but substantially smaller at −0.564. (We also see that the squared term of state appropriations per FTE student is no longer statistically significant.) Many analysts (particularly economists) view dummy variables as “nuisance” variables that are not discussed when presented in studies. Therefore, it may not be necessary to show the estimated beta coefficients of the dummy variables reflecting group (e.g., states) fixed-effects. In most instances, we can simply indicate (e.g., Yes) that state or institution fixed-effects have been included in a POLS regression model that has been fitted to panel data. By adding the dummy variables (Ds) for each state, we are estimating the pure effects of the independent variables (X s). Each dummy variable (D) is absorbing the effects particular to each state. This concept is the basis for the alternative approach to producing the same results that are shown above by using the following Stata syntax: areg netuit_fte stapr_fte stapr_fte2 pc_income, cluster(stateid) absorb(stateid)
The resulting output is: . areg netuit_fte stapr_fte stapr_fte2 pc_income, cluster(stateid) absorb(stateid) Linear regression, absorbing indicators Absorbed variable: stateid
Number of obs No. of categories F( 3, 49) Prob > F R-squared Adj R-squared Root MSE
= = = = = = =
1,350 50 118.57 0.0000 0.8989 0.8949 837.7043
(Std. Err. adjusted for 50 clusters in stateid) -----------------------------------------------------------------------------| Robust netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------stapr_fte | -.5639993 .1274783 -4.42 0.000 -.8201766 -.3078221 stapr_fte2 | 9.09e-06 7.15e-06 1.27 0.210 -5.29e-06 .0000235 pc_income | .2109376 .0114659 18.40 0.000 .1878961 .2339792 _cons | 339.0282 574.4332 0.59 0.558 -815.3384 1493.395 ------------------------------------------------------------------------------
Minus the estimated beta coefficients for 49 states, the results are exactly the same as the previous output. This option is very useful when running a FEDV multivariate POLS regression model with many units or groups (e.g., institutions). For example, suppose we are conducting a study of how education and general (EG) expenditures across 220 public master’s colleges and universities (over 10 years) are related to state appropriations (controlling
126
7
Introduction to Intermediate Statistical Techniques
for other variables) using a FEDV multivariate POLS regression model. Clearly, it would be more efficient to use this (areg) option than including 219 dummy variables in the regression model. This is shown below: cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 7\Stata files“ use ”Example 7.1.dta“ areg eg statea tuition totfteiarep ftfac ptfac D, cluster(opeid5_new)
The output is as follows: . areg eg statea tuition totfteiarep ftfac ptfac D, cluster(opeid5_new) absorb(opeid5_new) Linear regression, absorbing indicators Absorbed variable: opeid5_new
Number of obs No. of categories F( 6, 219) Prob > F R-squared Adj R-squared Root MSE
= = = = = = =
1,978 220 221.65 0.0000 0.9714 0.9677 9.127e+06
(Std. Err. adjusted for 220 clusters in opeid5_new) -----------------------------------------------------------------------------| Robust eg | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------statea | .6341792 .0779977 8.13 0.000 .480457 .7879014 tuition | 1.193035 .0634888 18.79 0.000 1.067908 1.318162 totfteiarep | 1167.479 1007.936 1.16 0.248 -819.0184 3153.976 ftfac | 27583.39 29489.68 0.94 0.351 -30536.51 85703.28 ptfac | 7026.641 14691.85 0.48 0.633 -21928.88 35982.16 D | -5818460 2680039 -2.17 0.031 -1.11e+07 -536491.5 _cons | -1.93e+07 7709459 -2.50 0.013 -3.45e+07 -4108069 ------------------------------------------------------------------------------
(Note: eg = education and general expenditures; statea = state appropriations; tuition = tuition revenue; totfteiarep = total FTE students per IPEDS report; ftfac = full-time faculty; ptfac = part-time faculty; D = whether the institution confers doctoral degrees (0 = no/1 = yes).)
7.4.2.1
Unobserved Heterogeneity and Within-Group Estimator Fixed-Effects Regression
While the use of the Stata command areg enables us to run a FEDV regression model that takes into account unobserved time-invariant heterogeneity, xtreg, allows us to do the same via the within-group estimator. The withingroup estimator involves the indirect use of the between-effects model, which regresses the group mean of the dependent variable on the group means of the independent variables. This is reflected in Eq. (7.8). The within-group estimator fixed-effects regression is obtained by subtracting Eq. (7.8) from Eq. (7.6).
7.4 Fixed-Effects Regression
127
(7.8)
Y i = β1 X 1i + β2 X 2i + . . . βn X ni + μi + εi
The result of this subtraction, also known as “time demeaning” the data, is the disappearance of the ui term or time-invariant unobserved heterogeneity. In Stata, this is equivalent to using the xtreg command with the fe option. . xtreg eg statea tuition totfteiarep ftfac ptfac, fe cluster(opeid5_new) = =
1,978 220
min = avg = max =
2 9.0 10
=
0.0000
Fixed-effects (within) regression Group variable: opeid5_new
Number of obs Number of groups
R-sq:
Obs per group: within = 0.7784 between = 0.9312 overall = 0.9011
F(5,219) corr(u_i, Xb)
= 284.17 = -0.7836
Prob > F
(Std. Err. adjusted for 220 clusters in opeid5_new) -----------------------------------------------------------------------------| Robust eg | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------statea | .6359503 .0721045 8.82 0.000 .4938427 .7780578 tuition | 1.20119 .0598787 20.06 0.000 1.083178 1.319202 totfteiarep | 1050.312 1031.059 1.02 0.309 -981.7563 3082.38 ftfac | 32819.51 31076.35 1.06 0.292 -28427.49 94066.5 ptfac | 7375.78 14858.29 0.50 0.620 -21907.76 36659.32 _cons | -2.51e+07 6197457 -4.05 0.000 -3.73e+07 -1.29e+07 -------------+---------------------------------------------------------------sigma_u | 21316368 sigma_e | 9198872.8 rho | .84300893 (fraction of variance due to u_i) ------------------------------------------------------------------------------
While the output shows that the estimated beta coefficients are the same as those produced by the FEDV POLS regression model with dummy variables using the areg command the above output provides more information. First, it shows the within R2 , between R2 , and the overall R2 . The within R2 measures how much variation in the dependent variable within groups (e.g., institutions) units is explained over time by the regression model. The between R2 measures how much variation in the dependent variable between groups is captured by the model. The overall R2 is a weighted average of the within R2 and the between R2 . In some cases, the within R2 will be higher than the between R2 and in other cases, the reverse may hold true. Because most higher education policy research is more concerned with the importance (i.e., the statistical significance of beta coefficients) of policyoriented variables, there is less focus on the R2 s. Second, information is provided about the time-invariant group-specific error term (μi ) and the idiosyncratic error term (εi ). The sigma_u is the
128
7
Introduction to Intermediate Statistical Techniques
standard deviation of μi and sigma_e is the standard deviation of μi . The rho (fraction of variance due to u_i) indicates the proportion of the unexplained variance that is due to unobserved time-invariant group heterogeneity. In the above example, we see that 84% of the unexplained variance is due to unobserved time-invariant group heterogeneity. Third, the output provides information on the number of observations per group. The example above shows that the panel data set is unbalanced with a minimum of two observations per group and a maximum of ten observations per group. Fourth, corr(u_i, Xb)shows the correlation between unobserved timeinvariant group heterogeneity and the independent variables in the fixedeffects regression model. The output above indicates there is correlation (−0.7836) between the unobserved time-invariant group heterogeneity and the independent variables. When using a fixed-effects model, any such correlation is acceptable. It is not acceptable when using a random-effects model (discussed below), which assumes no correlation between unobserved time-invariant group heterogeneity and the independent variables.
7.4.2.2
Limitations of Fixed-Effects Regression Models
While they take into account unobserved time-invariant group heterogeneity, fixed-effects regression models cannot include an observed time-invariant group variable. For example, a fixed-effects regression model cannot estimate the variable reflecting membership in a regional compact, which does not vary over time.
7.4.3
Fixed-Effects Regression and Difference-in-Differences
Like a pooled OLS or random-effects regression model, a fixed-effects regression model does not infer causation. In our example above, we cannot conclude that a change in any of the independent variables “causes” the dependent variable (E&G expenditures) to change. In order to infer causation, a difference-in-differences (DiD) estimator has to be included in a fixed-effects regression model. In general, the DiD estimator takes the difference in average outcomes for a treated group (e.g., students, institutions, states) compared to an untreated comparison or control group before and after the treatment.
7.4 Fixed-Effects Regression
7.4.3.1
129
The DiD Estimator
Drawing heavily from Furquim et al. (2020) and using their notation, the DiD estimator is based on the following: C T T C δDiD = Y 1 − Y 0 − Y 1 − Y 0
(7.9)
where Y¯ is the average outcome, T is the treated group, C is the control group, 0 is before treatment, and 1 is after treatment. In a regression model, this is represented as: Yit = α + βTi + γPt + δDiD Ti × Pt + θCon + εit
(7.10)
where Yit is the treatment outcome for group i in year t, Ti is a binary variable indicating the treatment status (treated group = 1 and untreated group = 0), P is a binary variable indicating the time periods t when the treatment takes effect. Con is a vector of control variables. In Eq. (7.10), the interaction of T and P takes a value of 1 for all observations of groups in the treatment group in the treatment and post-treatment time periods. The treatment effect is δ DiD , or the “average” treatment effect.5 To demonstrate the use of a fixed-effects regression-based DiD model, we will refer to an actual higher education policy change that occurred in a state. Using a fixed-effects regression-based DiD model and the relevant data, we will show how this question can be addressed.
7.4.3.2
Fixed-Effects Regression-Based DiD: An Example
In 2004, Colorado enacted Senate Bill 189 (SB 04-189) to establish the College Opportunity Fund (COF) program. Starting in 2005, the COF-designated, higher education institutions no longer received state appropriations. Instead funding was provided to resident undergraduate students in the form of a stipend to help pay their tuition. The legislation also required that 20% of increased resident tuition be set aside for financial aid. This suggested that net tuition should not increase substantially. If Colorado state policymakers ask whether COF had an effect on net tuition revenue, then a fixed-effects regression-based DiD model is an appropriate technique that analysts can use to address this question. use ”Example 7.1.dta“, clear
5 For an excellent comprehensive description, discussion, and example of regression-based DiD techniques, see Furquium et al. (2020).
130
7
Introduction to Intermediate Statistical Techniques
We create the treatment variable (T). gen T=0 replace T=1 if state==”CO“ The post-treatment (P) is then created. gen P=0 replace P=1 if year>=2004 Based on every state other than the treatment state (Colorado), we create the first control group. gen C1 = 0 replace C1=1 if state !=”CO“ Based on every state that is a member of the Western Interstate Commission for Higher Education (WICHE) other than the treatment state (Colorado), we create a second control group. gen C2 = 0 replace C2=1 if state !=”CO“ & region_compact==2 In order avoid additional keystrokes, we use the Stata global command to create temporary variables reflecting the dependent variable net tuition revenue per FTE enrollment (y) global y ”netuit_fte“ and the set of control variables, state appropriations to higher education per FTE enrollment (stapr_fte) and state per capita income (pc_income). global controls ”stapr_fte pc_income“ We run a DiD regression model that includes controls, year dummy variables (i.year), and state dummy variables (i.stateid). The model is run covering the year 2000 to the most recently available year and states in the treatment group or first control group. To take into account unobserved heterogeneity, we include the robust (rob) as an option in the syntax. reg $y i.T i.P T#P $controls i.year i.fips if year>=2000 & (C1==1 | T==1), rob . reg $y i.T i.P T#P $controls i.year i.fips if year>=2000 & (C1==1 | T==1), rob note: 2016.year omitted because of collinearity note: 8.fips omitted because of collinearity Number of obs = 850 F(68, 781) = 192.85 Prob > F = 0.0000 R-squared = 0.9393 Root MSE = 685.82 ------------------------------------------------------------------------------
Linear regression
7.4 Fixed-Effects Regression
131
| Robust netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------1.T | -1111.343 365.6843 -3.04 0.002 -1829.184 -393.5028 1.P | 4634.093 427.505 10.84 0.000 3794.899 5473.288 | T#P | 1 1 | 501.2044 202.7361 2.47 0.014 103.2322 899.1765 | stapr_fte | -.1933747 .0320378 -6.04 0.000 -.2562652 -.1304842 pc_income | .0001359 .0198814 0.01 0.995 -.0388913 .0391632 | year | 2001 | 219.4993 172.0082 1.28 0.202 -118.1539 557.1525
[omitted output] fips | 2 |
| -165.3607
397.6397
-0.42
0.678
-945.9298
615.2084
[omitted output] | _cons | 5565.35 539.019 10.32 0.000 4507.252 6623.447 ------------------------------------------------------------------------------
We see from the output above that the DiD coefficient (δ DiD ) is positive and statistically significant (beta = 501, p < 0.05). This suggests that net tuition revenue per FTE enrollment was, on average, higher by $501 in Colorado after passage of SB 04-189, compared to net tuition revenue per FTE enrollment in all other states. The within-group fixed-effects DiD regression model (xtreg) can also be employed. xtreg $y T##P $controls i.year if year>=2000 & (C1==1 | T==1) , fe rob . xtreg $y T##P $controls i.year if year>=2000 & (C1==1 | T==1) , fe rob note: 1.T omitted because of collinearity note: 2016.year omitted because of collinearity = =
850 50
min = avg = max = = =
17 17.0 17 . .
Fixed-effects (within) regression Group variable: fips
Number of obs Number of groups
R-sq:
Obs per group: within = 0.8217 between = 0.1378 overall = 0.3530
corr(u_i, Xb)
= 0.0532
F(18,49) Prob > F
(Std. Err. adjusted for 50 clusters in fips) -----------------------------------------------------------------------------| Robust netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval]
132
7
Introduction to Intermediate Statistical Techniques
-------------+---------------------------------------------------------------1.T | 0 (omitted) 1.P | 4634.093 1141.528 4.06 0.000 2340.108 6928.079 | T#P | 1 1 | 501.2044 162.1192 3.09 0.003 175.4137 826.995 | stapr_fte | -.1933747 .0767252 -2.52 0.015 -.3475598 -.0391896 pc_income | .0001359 .0554728 0.00 0.998 -.1113407 .1116126 | year | 2001 | 219.4993 66.723 3.29 0.002 85.41442 353.5842
[omitted output] | _cons | 4291.512 1517.395 2.83 0.007 1242.192 7340.831 -------------+---------------------------------------------------------------sigma_u | 2066.6057 sigma_e | 685.81506 rho | .90079681 (fraction of variance due to u_i) ------------------------------------------------------------------------------
In the within-group fixed-effects model, the DiD coefficient (δ DiD ) is positive and statistically significant and has the same value as the regression model with state dummy variables. For comparison, we run the within-group fixed-effects model with the second control group (states in WICHE). xtreg $y T##P $controls i.year if year>=2000 & (C2==1 | T==1) , fe rob . xtreg $y T##P $controls i.year if year>=2000 & (C2==1 | T==1) , fe rob note: 1.T omitted because of collinearity note: 2016.year omitted because of collinearity = =
221 13
min = avg = max =
17 17.0 17
Fixed-effects (within) regression Group variable: fips
Number of obs Number of groups
R-sq:
Obs per group: within = 0.8405 between = 0.1823 overall = 0.4986
F(12,12) corr(u_i, Xb)
= = -0.0404
. Prob > F
=
.
(Std. Err. adjusted for 13 clusters in fips) -----------------------------------------------------------------------------| Robust netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------1.T | 0 (omitted) 1.P | 3694.226 1339.566 2.76 0.017 775.5632 6612.888 |
7.4 Fixed-Effects Regression T#P | 1 1 | | stapr_fte | pc_income | | year | 2001 |
133
947.8925
215.3771
4.40
0.001
478.6261
1417.159
-.1722081 .0047754
.1092674 .0758187
-1.58 0.06
0.141 0.951
-.4102812 -.1604195
.065865 .1699702
106.4016
75.83226
1.40
0.186
-58.82273
271.6259
[omitted output] | _cons | 3187.94 2251.055 1.42 0.182 -1716.689 8092.568 -------------+---------------------------------------------------------------sigma_u | 1228.8195 sigma_e | 543.94789 rho | .83615752 (fraction of variance due to u_i) ------------------------------------------------------------------------------
We see that when the second control group is used, the DiD coefficient (δ DiD ) is also positive and statistically significant but the value is higher ($948). The preferred regression-based DiD model is a matter of choice for analysts. The choice depends on the selection of treatment period which the analyst thinks the adoption of a policy began to take full effect, the control variables, and control group.
7.4.3.3
DiD Placebo Tests
In order to determine whether there is “real” evidence of the effect of a policy or some unknown factor, placebo tests are required. The tests involve estimating the treatment effect after changing the treatment timing. This is demonstrated below, where we change the timing in our example to 2000 and simulate the treatment to occur before 2005. gen placebo_2000 = 1 if year>=2000 recode placebo_2000 (.=0) xtreg $y T##placebo_2000 $controls if (year>1995 | year1995 | year F
=
.
(Std. Err. adjusted for 13 clusters in fips) -----------------------------------------------------------------------------| Robust netuit_fte | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+-------------------------------------------------------------1.T | 0 (omitted) 1.placebo_2000 | -278.5172 228.006 -1.22 0.245 -775.2996 218.2653 | T#placebo_2000 | 1 1 | 404.7803 377.3187 1.07 0.304 -417.3266 1226.887 | stapr_fte | -.3260768 .1256597 -2.59 0.023 -.5998658 -.0522878 pc_income | .1858099 .0281743 6.60 0.000 .1244233 .2471964 _cons | -629.3751 939.2766 -0.67 0.516 -2675.883 1417.133 ---------------+-------------------------------------------------------------sigma_u | 1113.4546 sigma_e | 687.0107 rho | .7242707 (fraction of variance due to u_i) ------------------------------------------------------------------------------
We see from the output directly above that the coefficient for the placebo is statistically insignificant. This suggests the effect of SB 04-189 on net tuition revenue per FTE enrollment is “real”. Policy analysts are encouraged to conduct several placebo tests and use different control groups to validate their findings.
7.5
Random-Effects Regression
Like fixed-effects regression, random-effects regression models (also known as random intercept models) allow us to take into account unobserved timeinvariant variables. With random-effects models, we can use generalized least squares (GLS) or maximum likelihood (ML) estimating techniques. ML estimating techniques tend to have asymptotic properties (as the sample size increases, the efficiency and consistency of the estimates are maintained). The random-effects regression model is reflected in the following equation: Yˆit = β0 + β1 X1it + β2 X2it + · · · βn Xnit + ΥZit + μit + εit
(7.11)
where Z is an observed time-invariant categorical variable and γ is estimated beta coefficient for Z. In Eq. (7.11), Z does not vary over time with time (t), while unobserved time-invariant heterogeneity of the group error (μit ) is
7.5 Random-Effects Regression
135
assumed to be random and uncorrelated to the independent variables, which allows for time-invariant variables to play a role as explanatory variables. The random-effects estimator is the weighted average between the “within” and “between” estimator. The weight is based on between-group variances and derived from variances of μit and εit , which produces the random-effects model estimates. Using the data from one of the examples above, we can run a random-effects regression model that includes the regional compact variable (region_compact) by entering the Stata command xtreg, with the option re. When running random-effects models, GLS techniques tend to produce larger standard errors than ML techniques. Because it is the default option and can be used when estimating cluster-robust standard errors, GLS is implicitly part of the following syntax. xtreg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, re cluster(stateid)
The output is below. .xtreg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, re cluster(stateid) = =
1,350 50
min = avg = max =
27 27.0 27
=
0.0000
Random-effects GLS regression Group variable: stateid
Number of obs Number of groups
R-sq:
Obs per group: within = 0.8085 between = 0.4145 overall = 0.6173
Wald chi2(7) corr(u_i, X)
= 396.73 = 0 (assumed)
Prob > chi2
(Std. Err. adjusted for 50 clusters in stateid) -----------------------------------------------------------------------------| Robust netuit_fte | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------------+-------------------------------------------------------------stapr_fte | -.5766974 .1224583 -4.71 0.000 -.8167112 -.3366836 stapr_fte2 | 9.83e-06 6.98e-06 1.41 0.159 -3.84e-06 .0000235 pc_income | .2106405 .0113809 18.51 0.000 .1883344 .2329467 | region_compact | SREB | 343.5133 950.5288 0.36 0.718 -1519.489 2206.516 WICHE | -629.2418 910.5012 -0.69 0.490 -2413.791 1155.308 MHEC | 276.6268 904.9169 0.31 0.760 -1496.978 2050.231 NEBHE | 1304.324 1162.57 1.12 0.262 -974.2709 3582.919 | _cons | 226.986 967.0053 0.23 0.814 -1668.31 2122.282 ---------------+-------------------------------------------------------------sigma_u | 1235.0691 sigma_e | 837.70428 rho | .68491108 (fraction of variance due to u_i) -------------------------------------------------------------------------------
136
7
Introduction to Intermediate Statistical Techniques
Notice that with the exception of corr(u_i, X) = 0 (assumed), the format of the output is the same as what we would find when we run a fixedeffects regression model. How do we know whether the random-effects model is more appropriate than a POLS regression model? By conducting a test immediately after we run the random-effects regression model, we can provide an answer to this question. This test is the Breusch and Pagan Lagrangian multiplier test for random effects (xttest0).6 . xttest0 Breusch and Pagan Lagrangian multiplier test for random effects netuit_fte[stateid,t] = Xb + u[stateid] + e[stateid,t] Estimated results: | Var sd = sqrt(Var) ---------+----------------------------netuit_e | 6674588 2583.522 e | 701748.5 837.7043 u | 1525396 1235.069 Test:
Var(u) = 0 chibar2(01) = Prob > chibar2 =
8010.80 0.0000
The rejection of the null hypothesis above indicates that, compared to a POLS regression model, the random-effects regression is more appropriate. But how do we know whether a random-effects or a fixed-effects regression is the most appropriate model? According to some econometricians (Judge et al. 1988), it depends on judgment and/or statistical tests. If we are using a sample of units (e.g., 35 out of 50 states) or we suspect our independent variables and unobserved heterogeneity are correlated or we are including time-invariant variables, then we may want to use a random-effects regression model. But what if we are doing an analysis that does not include observed time-invariant variables? This is where the use of statistical tests is required.
7.5.1
Hausman Test
The Hausman test is most commonly employed to determine whether to use a fixed-effects or random-effects model.7 Using data from one of our examples above and a set of Stata commands, we can easily run this test
6 For 7 For
a complete discussion of this test, see Breusch and Pagan (1980). a technical discussion of the Hausman test, see Hausman (1978).
7.5 Random-Effects Regression
137
in five steps. First we quietly run (i.e., not showing the results) a withingroup fixed-effects model. Second, we store those estimated results (i.e., est sto fixed) to memory, which is illustrated below. Third, we quietly run a random-effects model. Fourth, we store those estimated results (i.e., est sto random). Fifth, we run the Hausman test (i.e., hausman fixed random, sigmamore). quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, fe est sto fixed quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, re est sto random hausman fixed random
Given the ordering of the stored estimated results in the last line of syntax, a rejection of the null would indicate the fixed-effects regression is the more appropriate model. The output is below. . . . . . . . . .
quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, fe est sto fixed quietly: xtreg eg statea tuition totfteiarep ftfac ptfac, re est sto random hausman fixed random
Note: the rank of the differenced variance matrix (3) does not equal the number of coefficients being tested (5); be sure this is what you expect, or there may be problems computing the test. Examine the output of your estimators for anything unexpected and possibly consider scaling your variables so that the coefficients are on a similar scale. ---- Coefficients ---| (b) (B) (b-B) sqrt(diag(V_b-V_B)) | fixed random Difference S.E. -------------+---------------------------------------------------------------statea | .6359503 .711084 -.0751337 .0078128 tuition | 1.20119 1.078007 .1231832 .0101439 totfteiarep | 1050.312 -332.6668 1382.979 296.0531 ftfac | 32819.51 10317.97 22501.54 6505.906 ptfac | 7375.78 5765.71 1610.069 2575.428 -----------------------------------------------------------------------------b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Test:
Ho:
difference in coefficients not systematic
chi2(3) = (b-B)’[(V_b-V_B)ˆ(-1)](b-B) = 45.67 Prob>chi2 = 0.0000
138
7
Introduction to Intermediate Statistical Techniques
While the above results of the Hausman test suggest we should use the fixed-effects regression model, the note in the beginning of the output states there may be possible problems with the test and recommends rescaling the variables. To rescale the variables, we log transform the variables and rerun the test. This time we will show the entire output, including the results of each regression model by excluding the quiet (qui) option. . xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac, fe = =
1,978 220
min = avg = max =
2 9.0 10
=
0.0000
Fixed-effects (within) regression Group variable: opeid5_new
Number of obs Number of groups
R-sq:
Obs per group: within = 0.6691 between = 0.9157 overall = 0.8825
F(5,1753) corr(u_i, Xb)
= 708.86 = -0.8252
Prob > F
------------------------------------------------------------------------------lneg | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+--------------------------------------------------------------lnstatea | .0128249 .0045956 2.79 0.005 .0038114 .0218384 lntuition | .5562887 .0157124 35.40 0.000 .5254718 .5871057 lntotfteiarep | .113466 .0386945 2.93 0.003 .0375738 .1893581 lnftfac | .5642428 .0458174 12.32 0.000 .4743802 .6541054 ptfac | .0003861 .0000482 8.01 0.000 .0002915 .0004806 _cons | 3.801971 .3236401 11.75 0.000 3.16721 4.436732 --------------+--------------------------------------------------------------sigma_u | .32099191 sigma_e | .12532947 rho | .86771903 (fraction of variance due to u_i) ------------------------------------------------------------------------------F test that all u_i=0: F(219, 1753) = 16.90 Prob > F = 0.0000 . . est sto fixed . . xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, re = =
1,978 220
min = avg = max =
2 9.0 10
Random-effects GLS regression Group variable: opeid5_new
Number of obs Number of groups
R-sq:
Obs per group: within = 0.6536 between = 0.8999 overall = 0.8697
7.5 Random-Effects Regression
Wald chi2(5) corr(u_i, X)
= 5302.90 = 0 (assumed)
139
Prob > chi2
=
0.0000
------------------------------------------------------------------------------lneg | Coef. Std. Err. z P>|z| [95% Conf. Interval] --------------+--------------------------------------------------------------lnstatea | .0192334 .004174 4.61 0.000 .0110526 .0274143 lntuition | .5254071 .0151305 34.72 0.000 .4957518 .5550624 lntotfteiarep | .0644028 .0331053 1.95 0.052 -.0004824 .129288 lnftfac | .3408924 .0344727 9.89 0.000 .273327 .4084577 lnptfac | .0417042 .0071786 5.81 0.000 .0276343 .0557741 _cons | 5.823929 .1957728 29.75 0.000 5.440222 6.207637 --------------+--------------------------------------------------------------sigma_u | .15783994 sigma_e | .12699568 rho | .60703286 (fraction of variance due to u_i) ------------------------------------------------------------------------------. . est sto random . . hausman fixed random ---- Coefficients ---| (b) (B) (b-B) sqrt(diag(V_b-V_B)) | fixed random Difference S.E. -------------+---------------------------------------------------------------lnstatea | .0128249 .0192334 -.0064085 .0019229 lntuition | .5562887 .5254071 .0308817 .0042361 lntotfteiap | .113466 .0644028 .0490632 .0200325 lnftfac | .5642428 .3408924 .2233504 .0301806 -----------------------------------------------------------------------------b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Test:
Ho:
difference in coefficients not systematic
chi2(4) = (b-B)’[(V_b-V_B)ˆ(-1)](b-B) = 454.45 Prob>chi2 = 0.0000
While the results of the Hausman test using the log transformed variables are more accurate, they are based on models that cannot allow us to take into account heteroscedasticity via cluster-robust errors. This is a limitation of the standard Hausman test provided by Stata. For this reason, we now turn to a Stata user-written Hausman routine (rHausman) by Kaiser (2015) that addresses this limitation. (We have to download this program by typing ssc install rHausman.) Using this program, the log transformed variables, and the models with cluster-robust errors, we rerun the Hausman test. The options reps(1000) and cluster are included to allow for random sampling
140
7
Introduction to Intermediate Statistical Techniques
with replacement (i.e., 400 times) and take into account the cluster variable institution and cluster variable opeid5_new, respectively.8 . quietly: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac, cluster(opeid5_new) fe . est sto fixed . quietly: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac, cluster(opeid5_new) re . est sto random . rhausman fixed random, reps(400) cluster bootstrap in progress ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 (This bootstrap will approximately take another 0h. 1min. 10sec.) .................................................. 100 .................................................. 150 .................................................. 200 .................................................. 250 .................................................. 300 .................................................. 350 .................................................. 400 ------------------------------------------------------------------------------Cluster-Robust Hausman Test (based on 400 bootstrap repetitions) b1: obtained from xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac, cluster(opeid5_new) fe b2: obtained from xtreg lneg lnstatea lntuition lntotfteiarep lnftfac ptfac, cluster(opeid5_new) re Test:
Ho:
difference in coefficients not systematic
chi2(5) = (b1-b2)’ * [V_bootstrapped(b1-b2)]ˆ(-1) * (b1-b2) = 49.03 Prob>chi2 = 0.0000
Given the output directly above, now we can be confident that the results of the Hausman test accurately indicate the fixed-effects regression model is more appropriate. Using a fixed-effects regression model and our panel data of public master’s universities and colleges, we can now conclude that E&G expenditures are positively related to state appropriations (lnstatea), tuition revenue (lntuition), total FTE students (lntotfteiarep), full-time faculty (lnftfac), and part-time faculty (ptfac).
8 For
more on bootstrapping in Stata, see Guan (2003).
7.7
7.6
Appendix
141
Summary
This chapter introduced intermediate statistical methods that are used in higher education policy correlational studies. Starting with pooled ordinary least squares (POLS) and continuing with fixed-effects and random-effects regression models, this chapter demonstrated how we can use these statistical techniques to analyze panel data with Stata syntax. We also showed how various tests can be conducted to determine the appropriate method that should be employed in correlational studies. The chapter also introduced how fixed-effects regression can be modified to infer causal effects by including difference-in-differences estimators.
7.7
Appendix
*Chapter 7 Stata syntax *Bivariate OLS Regression *use dataset from the previous chapter use ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 6\Stata files\Example 6.3.dta“, clear *generate bivariate (one independent variable) OLS regression output regress netuit_fte stapr_fte if year ==2016 *create a new variable reflecting, say, the squared term /// (or quadratic) of another variable gen stapr_fte2 = stapr_fte*stapr_fte *include new variable in the regression model regress netuit_fte stapr_fte stapr_fte2 pc_income if year ==2016 *Multivariate Pooled OLS Regression reg netuit_fte stapr_fte stapr_fte2 pc_income *include the categorical variable (region_compact) in regression model reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact *Multivariate Pooled OLS Regression with Interaction Terms reg netuit_fte stapr_fte i.region_compact##i.ugradmerit, allbaselevels *test to see if there is an interaction effects by quietly (qui) running the /// models and storing (est sto) the model results without (model1)and with and the *interaction terms (model2) qui reg netuit_fte stapr_fte i.region_compact est sto model1 qui reg netuit_fte stapr_fte i.region_compact##i.ugradmerit est sto model2 lrtest model1 model2 *Using the testparm command, the statistical significance of the interaction /// terms can also be checked. testparm i.region_compact#i.ugradmerit *if the interaction term is composed of one continuous (c) variable and one /// categorical (i) variable reg netuit_fte i.ugradmerit i.region_compact c.stapr_fte##i.tuitset testparm c.stapr_fte#i.tuitset
142
7
Introduction to Intermediate Statistical Techniques
*change working directory cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 7\Stata files“ *use new dataset Use ”Example 7.1.dta“, clear *if two continuous variables are included reg netuit_fte i.region_compact c.stapr_fte##c.state_needFTE *using margins (with the vsquish option) margins, dydx(stapr_fte) at(state_needFTE=(0(3000)10000)) vsquish qui margins, at(stapr_fte=(0 10000) state_needFTE=(0(3000)10000)) vsquish *and marginsplot with different patterns marginsplot, noci x(stapr_fte) recast(line) xlabel(0(3000)10000) /// plot1opts(lpattern(”...“)) plot2opts(lpattern(”-..-“) color(black)) /// plot3opts(lpattern(”---“) color(black)) plot4opts(color(black)) *residual-versus-fitted plot rvfplot, mcolor(black) *comprehensive post-estimation estat imtest *POLS regression model using the robust option reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, robust *Levene test of homogeneity quietly: reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact predict double eps, residual robvar eps, by(state) *To address this particular violation of the assumption of homoscedasticity, we /// use the cluster option, with state as the cluster variable, in our POLS *regression model. reg netuit_fte stapr_fte stapr_fte2 pc_income i.region_compact, cluster(state) *Fixed-Effects Regression * Unobserved Heterogeneity and Fixed-Effects Dummy Variable (FEDV) /// Regression Estimating FEDV Multivariate POLS Regression Models reg netuit_fte stapr_fte stapr_fte2 pc_income i.stateid, cluster(state) *we determine if the state fixed-effects as a whole are statistically /// significant, immediately after we run the above regression testparm i.stateid * alternative approach to producing the same results areg netuit_fte stapr_fte stapr_fte2 pc_income, cluster(stateid) absorb(stateid) *open another dataset use ”Example 7.1.dta“ areg eg statea tuition totfteiarep ftfac ptfac D, cluster(opeid5_new) *Unobserved Heterogeneity and Within-Group Estimator Fixed-Effects Regression /// using the xtreg command with the fe option xtreg eg statea tuition totfteiarep ftfac ptfac, fe cluster(opeid5_new) *Fixed-effects regression and difference-in-differences (DiD) *The DiD Estimator *Fixed-effects Regression-based DiD: An Example use ”Example 7.1.dta“, clear *We create the treatment variable (T). gen T=0 replace T=1 if state==”CO“
7.7
Appendix
*The post-treatment (P) is then created. gen P=0 replace P=1 if year>=2004 *Based on every state other than the treatment state (Colorado), we create the /// first control group. gen C1 = 0 replace C1=1 if state !=”CO“ *we create a second control group. gen C2 = 0 replace C2=1 if state !=”CO“ & region_compact==2 *we use the global command to create temporary variables reflecting the /// dependent variable net tuition revenue per FTE enrollment (y) global y ”netuit_fte“ *and the set of control variables state appropriations to higher education per /// FTE enrollment (stapr_fte) and state per capita income (pc_income). global controls ”stapr_fte pc_income“ *To take into account unobserved heterogeneity, we include the robust (rob) as /// an option in the syntax. reg $y i.T i.P T#P $controls i.year i.fips if year>=2000 & (C1==1 | T==1), rob *The within-group fixed-effects DiD regression model can also be employed. xtreg $y T##P $controls i.year if year>=2000 & (C1==1 | T==1) , fe rob *For comparison, we run the within-group fixed-effects model with the second /// control group (states in WICHE). xtreg $y T##P $controls i.year if year>=2000 & (C2==1 | T==1) , fe rob *DiD Placebo Tests gen placebo_2000 = 1 if year>=2000 recode placebo_2000 (.=0) xtreg $y T##placebo_2000 $controls if (year>1995 | year F = 0.0000 Residual | .048564458 43 .001129406 R-squared = 0.4912 -------------+---------------------------------Adj R-squared = 0.4676 Total | .095456753 45 .002121261 Root MSE = .03361> -----------------------------------------------------------------------------D.lnenpub2yr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+--------------------------------------------------------------lntupub2yr | D1. | -.2340422 .0944225 -2.48 0.017 -.4244633 -.043621 | lnunemprate | D1. | .1763452 .0336472 5.24 0.000 .1084892 .2442012 | _cons | .0265284 .0053751 4.94 0.000 .0156885 .0373684 ------------------------------------------------------------------------------
Next, the Stata command racplot is used to create an autocorrelation function or correlogram of the residuals from the regression model above. The correlogram of the residuals is shown in Fig. 8.3. We can see that after the first lag and second lag, the values of the autocorrelation of the residuals decline substantially with the number of lags and lies inside the 95% confidence interval (the shaded area). The partial autocorrelations of the residuals in Fig. 8.3 provide additional visual evidence of first-order autoregressive (AR1) disturbance. This figure is created by
8.3 Testing for Autocorrelations
151
Fig. 8.3 Autocorrelation (correlogram) of the residuals from the regression model
generating the residuals from the model (predict residuals, resid) and creating a graph of partial autocorrelations (pac residuals, yw). We see that after the first lag, the partial autocorrelations of the residuals dissipate at the higher lags and are well within the 95% confidence interval (Fig. 8.4). Combined, these visuals suggest evidence of first-order autocorrelation (AR1) that should be addressed before using a final regression model. However, for more definitive evidence, we should conduct statistical tests.
8.3
Testing for Autocorrelations
When using time series data, the most common tests for autocorrelation are the Durbin–Watson (D-W) test (Durbin and Watson 1950) and Breusch– Godfrey (B-W) test (Breusch 1978; Godfrey 1978). The D-W test is based on a measure of autocorrelation in the residuals from a regression model. That measure or D-W (d ) statistic always has a value between 0 and 4. A d statistic with a value of 2.0 indicates there is no autocorrelation, while a value from 0 to less than 2 indicates a positive autocorrelation. A d statistic with a value from 2 to 4 indicates negative autocorrelation. Different versions of the D-W test are based on different assumptions regarding the exogeneity of the independent variables. The results of the D-W test that are based on the work of Durbin and Watson (1950) assume the independent variables
152
8
Advanced Statistical Techniques: I
Fig. 8.4 Partial autocorrelations of residuals
are exogenous and the residuals are normally distributed. An alternative version of the D-W test relaxes that assumption of exogenous independent variables, normally distributed residuals, homoscedasticity (Davidson and MacKinnon 1993). The B-G test is limited in that while it relaxes the assumption of strictly exogenous regressors, it does not take into account violations of the assumption of homoscedasticity. The Woodbridge (2002) test or Arellano-Bond (A-B) test (Arellano and Bond 1991) is conducted when using panel data. The less known Cumby-Huizinga (C-H) general test (Cumby and Huizinga 1992) can be used with both time series and panel data.
8.3.1
Examples of Autocorrelation Tests—Time Series Data
To demonstrate how to conduct a D-W test when using time series data, we use the same data from above, assume the independent variables are exogenous, and use the Stata post-estimation time series command estat dwatson. . estat dwatson Durbin-Watson d-statistic(
3,
47) =
.8127196
8.4 Time Series Regression Models with AR terms
153
While the value of the D-W d statistic shown above is 0.813, the DW test does not tell us whether or not the value is statistically different from 2. In addition, the results are based on the assumptions of exogenous independent variables, a normal distribution of the residuals or errors (ε), and homoscedastic errors. (The OLS regression model that we ran did not take into account possible heteroscedasticity.) So, we “quietly” (i.e., do not show the output) rerun the regression model with the robust (rob) option and use the alternative D-W test, the post-estimation time series command estat durbinalt, with the force option. . quietly: reg D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate, rob . . estat durbinalt, force Durbin’s alternative test for autocorrelation --------------------------------------------------------------------------lags(p) | chi2 df Prob > chi2 -------------+------------------------------------------------------------1 | 24.039 1 0.0000 --------------------------------------------------------------------------H0: no serial correlation
We can see from the above output of the D-W alternative test that the null hypothesis (Ho ) of no serial correlation (no autocorrelation) is rejected (p < 0.001).
8.4
Time Series Regression Models with AR terms
When we find a violation of the assumption of no autocorrelation, we have to use a regression model of time series that include first-order serially correlated residuals or an autoregressive (AR1) disturbance term. This regression model can be estimated via several estimating techniques. (See Davidson and MacKinnon (1993) for a complete discussion of these estimating techniques.) The time series regression model with an AR term can be calibrated via the Prais–Winsten (P-W) estimator. In Stata, this is accomplished by using the prais command in place of the regress command. The prais command is used (with the default rhotype(regress)—base rho (ρ) on single-lag OLS of residuals) along with the same dependent and independent variables and the rob option as in the regression model above. . prais D1.lnenpub2yr D1.lntupub2yr D1.lnunemprate, rob Iteration 0: rho = 0.0000 Iteration 1: rho = 0.5658 Iteration 2: rho = 0.6117 Iteration 3: rho = 0.6147 Iteration 4: rho = 0.6149
154
8
Advanced Statistical Techniques: I
Iteration 5: rho = 0.6149 Iteration 6: rho = 0.6149 Prais-Winsten AR(1) regression -- iterated estimates Linear regression Number of obs = 47 F(2, 44) = 25.33 Prob > F = 0.0000 R-squared = 0.5486 Root MSE = .02624 -----------------------------------------------------------------------------| Semirobust D.lnenpub2yr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lntupub2yr | D1. | -.240876 .0914333 -2.63 0.012 -.4251476 -.0566044 | lnunemprate | D1. | .1335829 .0257963 5.18 0.000 .0815939 .1855718 | _cons | .0279557 .0100677 2.78 0.008 .0076656 .0482458 -------------+---------------------------------------------------------------rho | .6149498 -----------------------------------------------------------------------------Durbin-Watson statistic (original) 0.812720 Durbin-Watson statistic (transformed) 2.145852
The rho (ρ) in the output above shows there is positive autocorrelation. The regression model with an AR1 term shows that the first-differenced log transformed tuition and unemployment variables are statistically significant. We also see that the value of the transformed D-W d statistic is 2, suggesting no autocorrelation. However, we should examine the autocorrelation and partial autocorrelation functions of the residuals from the P-W regression. We do so by first generating residuals from the P-W regression (predict residuals_PW, resid) and creating graphs of autocorrelations and partial autocorrelations of those residuals.
8.4.1
Autocorrelation of the Residuals from the P-W Regression
Given Fig. 8.5, it appears as if there is still autocorrelation even after we ran the P-W regression. The partial autocorrelation function in Fig. 8.6 further provides evidence of first-order autocorrelation (AR1). Because the alternative test D-W test for autocorrelation does not work after running a P-W regression in Stata, we use the Cumby-Huizinga (C-H) general test of the residuals. However, the Stata user-written program for the C-H test has to be downloaded (ssc install actest). We will check for autocorrelation of the residuals from the P-W regression (residuals_PW) for up to four lags (lag (4)), specify the null hypothesis of no autocorrelation at
8.4 Time Series Regression Models with AR terms
155
Fig. 8.5 Autocorrelation of the residuals from the P-W regression
Fig. 8.6 Partial autocorrelations of the residuals from P-W regression
any lag order (q=0), and take into account possible heteroscedasticity (rob). So our Stata syntax for this test is: actest residuals_PW, lag(4) q0 rob). The output is as follows.
156
8
Advanced Statistical Techniques: I
. actest residuals_PW, lag(4) q0 robCumby-Huizinga test for autocorrelation H0: disturbance is MA process up to order q HA: serial correlation present at specified lags >q ----------------------------------------------------------------------------H0: q=0 (serially uncorrelated) | H0: q=0 (serially uncorrelated) HA: s.c. present at range specified | HA: s.c. present at lag specified -----------------------------------------+----------------------------------lags | chi2 df p-val | lag | chi2 df p-val -----------+-----------------------------+-----+----------------------------1 - 1 | 5.369 1 0.0205 | 1 | 5.369 1 0.0205 1 - 2 | 6.147 2 0.0463 | 2 | 5.812 1 0.0159 1 - 3 | 6.255 3 0.0998 | 3 | 3.141 1 0.0763 1 - 4 | 6.344 4 0.1749 | 4 | 1.201 1 0.2731 ----------------------------------------------------------------------------Test robust to heteroskedasticity
Looking at the panel on the right in output from the C-H test, we can see that the null hypothesis of no autocorrelation is rejected at both the first (chi2 5.369 = p < 0.5) and second (chi2 = 5.812, p < 0.5) lags, indicating first-order (AR1) and second-order (AR2) autocorrelation.2 Unfortunately, Stata’s Prais–Winsten (prais) regression allows for including only AR1. Therefore, we have to use an autoregressive–moving-average (ARMA) model with only autoregressive terms and exogenous independent variables, or commonly known as an ARMAX model.3 ARMA models can accommodate autoregressive disturbance terms with more than one lag and are reflected in an expansion of Eq. (7.1) in the previous chapter to include the following: Yt = βXi + μt p q μt = ρi μt−1 + θj εt−j + εt i=1
(8.2)
j−1
where p is the number of autoregressive terms, q is the number of movingaverage terms, ρ is the autoregressive parameter, Θ is the first-order movingaverage parameter, j is the lag, t is time, and εt is the error term. Using Stata command arima, we then move to estimate an ARMAX model with first-order (AR1) and second-order (AR2) autoregressive terms with the dependent variable first-differenced log of enrollment (D1.lnenpub2yr), and the exogenous independent variables first-differenced log of tuition (D1.lntupub2yr) and first-differenced log of unemployment
2 According 3 The
to Hoechle (2007), an AR process can be approximated by a MA process.
ARMAX model is an extension of the Box–Jenkins autoregressive moving average (ARIMA) model with exogenous variables. This particular example does not include the moving average (MA) as an independent variable. The decision to include MA term is determined by observing the autocorrelation function of the dependent variable. Although not shown, including the MA term in this example, does not change the results. For more information on the ARIMA model, see Box and Jenkins (1970).
8.4 Time Series Regression Models with AR terms
157
rate (D1.lnunemprate). The vce(robust) option is included, which produces semi-robust standard errors. . arima D1.lnenpub2yr D1.lntupub2yr (setting optimization to BHHH) . arima D1.lnenpub2yr D1.lntupub2yr (setting optimization to BHHH) Iteration 0: log pseudolikelihood Iteration 1: log pseudolikelihood Iteration 2: log pseudolikelihood Iteration 3: log pseudolikelihood Iteration 4: log pseudolikelihood (switching optimization to BFGS) Iteration 5: log pseudolikelihood Iteration 6: log pseudolikelihood Iteration 7: log pseudolikelihood Iteration 8: log pseudolikelihood Iteration 9: log pseudolikelihood Iteration 10: log pseudolikelihood ARIMA regression Sample: 1971 - 2017
D1.lnunemprate, ar(1 2 ) vce(robust) D1.lnunemprate, ar(1 2 ) vce(robust) = = = = =
104.65434 106.70775 107.3222 107.57829 107.68275
= = = = = =
107.7467 107.80569 107.81934 107.82036 107.82039 107.82039
Number of obs = 47 Wald chi2(4) = 111.29 Log pseudolikelihood = 107.8204 Prob > chi2 = 0.0000 -----------------------------------------------------------------------------| Semirobust D.lnenpub2yr | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------lnenpub2yr | lntupub2yr | D1. | -.2147784 .085312 -2.52 0.012 -.3819868 -.04757 | lnunemprate | D1. | .1324922 .0213737 6.20 0.000 .0906004 .1743839 | _cons | .0296133 .0197496 1.50 0.134 -.0090953 .0683219 -------------+---------------------------------------------------------------ARMA | ar | L1. | .4546644 .1601833 2.84 0.005 .1407109 .7686179 L2. | .3434536 .2337644 1.47 0.142 -.1147162 .8016235 -------------+---------------------------------------------------------------/sigma | .0241711 .0022081 10.95 0.000 .0198433 .0284988 -----------------------------------------------------------------------------Note: The test of the variance against zero is one sided, and the two-sided confidence interval is truncated at zero.
From the results of the ARMAX model shown above, we see that the AR1 disturbance term is statistically significant (beta = 0.455, p < 0.01) but not the AR2 disturbance term. However, we examine the residuals from the ARMAX model to see if there is any autocorrelation and conduct a final test. . predict residuals_ARMX12, resid (1 missing value generated)
158
8
Advanced Statistical Techniques: I
Fig. 8.7 Autocorrelations of residuals from ARMX12
Both Figs. 8.7 and 8.8 suggest there is no autocorrelation of the residuals from the ARMAX model with AR1 and AR2 disturbance terms. Using the C-H general test, we conduct a final test to detect autocorrelation. . actest residuals_ARMX12 , lag(4) q0 rob Cumby-Huizinga test for autocorrelation H0: disturbance is MA process up to order q HA: serial correlation present at specified lags >q ----------------------------------------------------------------------------H0: q=0 (serially uncorrelated) | H0: q=0 (serially uncorrelated) HA: s.c. present at range specified | HA: s.c. present at lag specified -----------------------------------------+----------------------------------lags | chi2 df p-val | lag | chi2 df p-val -----------+-----------------------------+-----+----------------------------1 - 1 | 0.120 1 0.7288 | 1 | 0.120 1 0.7288 1 - 2 | 0.208 2 0.9014 | 2 | 0.089 1 0.7648 1 - 3 | 2.721 3 0.4367 | 3 | 2.567 1 0.1091 1 - 4 | 2.756 4 0.5995 | 4 | 0.746 1 0.3877 ----------------------------------------------------------------------------Test robust to heteroskedasticity
We see from the C-H general test results above, the null hypothesis of no autocorrelation cannot be rejected. So, the results of the C-H test combined with the Figs. 8.7 and 8.8 allow us to definitively conclude there is no autocorrelation when using our time series data and the ARMAX model above. Given the ARMAX model results, it can now be stated with confidence that enrollment in public 2-year colleges is negatively related to
8.4 Time Series Regression Models with AR terms
159
Fig. 8.8 Partial autocorrelations of residuals from ARMX12
published tuition and fees at public 2-year colleges and positively related to unemployment rates. Because the ARMAX model was used with first-differenced variables, the interpretation of the results is based on an average short-term (1 year) rather than an average over the long-term (e.g., 47 years) relationship. If we wanted to make a statement based on the latter, we would have to fit an ARMAX model to the data levels rather than their first differences. Fortunately, we can do this by using the diffuse option in Stata.4 (The nolog option is also included to not show the iteration log.) We show the output below. . arima lnenpub2yr lntupub2yr lnunemprate, ar(1 2 ) rob diffuse nolog ARIMA regression Sample: 1970 - 2016 Number of obs = 47 Wald chi2(4) = 30163.62 Log pseudolikelihood = 86.21852 Prob > chi2 = 0.0000 -----------------------------------------------------------------------------| Semirobust lnenpub2yr | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------lnenpub2yr | lntupub2yr | -.1989154 .0891667 -2.23 0.026 -.3736788 -.024152 lnunemprate | .1478185 .029172 5.07 0.000 .0906425 .2049945 _cons | 16.97047 .7746909 21.91 0.000 15.45211 18.48884
4 For
more information on the diffuse option, see the Stata Reference Time-Series Manual, Release 16 and Ansley and Kohn (1985) and Harvey (1989).
160
8
Advanced Statistical Techniques: I
-------------+---------------------------------------------------------------ARMA | ar | L1. | 1.305528 .0153133 85.25 0.000 1.275514 1.335542 L2. | -.3506986 .0030851 -113.67 0.000 -.3567453 -.3446518 -------------+---------------------------------------------------------------/sigma | .022082 .0020042 11.02 0.000 .0181539 .0260101 -----------------------------------------------------------------------------Note: The test of the variance against zero is one sided, and the two-sided confidence interval is truncated at zero.
From the results above, we can see that all the independent variables are statistically significant. We also see that both autocorrelation disturbance terms (AR1 and AR2) are statistically significant. Like with the ARMAX model using first-differences, the C-H test is used to detect any remaining autocorrelation. . predict residuals_nsARMA12dn, resid . actest residuals_nsARMA12dn, q0 rob lag(4) Cumby-Huizinga test for autocorrelation H0: disturbance is MA process up to order q HA: serial correlation present at specified lags >q ----------------------------------------------------------------------------H0: q=0 (serially uncorrelated) | H0: q=0 (serially uncorrelated) HA: s.c. present at range specified | HA: s.c. present at lag specified -----------------------------------------+----------------------------------lags | chi2 df p-val | lag | chi2 df p-val -----------+-----------------------------+-----+----------------------------1 - 1 | 0.892 1 0.3450 | 1 | 0.892 1 0.3450 1 - 2 | 0.977 2 0.6136 | 2 | 0.067 1 0.7958 1 - 3 | 1.797 3 0.6156 | 3 | 0.503 1 0.4782 1 - 4 | 2.237 4 0.6923 | 4 | 0.256 1 0.6127 ----------------------------------------------------------------------------Test robust to heteroskedasticity
We see the results of the test indicate there is no remaining autocorrelation up through lag 4. (There is no reason to think there is autocorrelation for any lags beyond lag 4.) While these results may be good news from a statistical perspective, they may not be helpful to policymakers who are interested in how changes in tuition and fees at public community colleges may influence enrollment at those institutions. It is quite possible that a shift in the demand for higher education at public 2-year higher institutions may influence a change in tuition and fees at those colleges (Toutkoushian and Paulsen 2016). So in order to avoid “reverse causality”, we have to regress enrollment on at least a 1 year lag of tuition. In Stata, we do this by including the lag operator (L1) in a re-calibrated ARMAX model. To not lose an additional observation, we use data through 2017. . arima lnenpub2yr L1.lntupub2yr lnunemprate, ar(1 2 numerical derivatives are approximate flat or discontinuous region encountered numerical derivatives are approximate
) rob diff nolog
8.4 Time Series Regression Models with AR terms
161
flat or discontinuous region encountered ARIMA regression Sample: 1971 - 2017
Number of obs = 47 Wald chi2(4) = 4111.19 Log pseudolikelihood = 83.0424 Prob > chi2 = 0.0000 -----------------------------------------------------------------------------| Semirobust lnenpub2yr | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------lnenpub2yr | lntupub2yr | L1. | .1337166 .082935 1.61 0.107 -.028833 .2962662 | lnunemprate | .171104 .0317356 5.39 0.000 .1089033 .2333046 _cons | 14.27244 .6893422 20.70 0.000 12.92135 15.62353 -------------+---------------------------------------------------------------ARMA | ar | L1. | 1.225775 .190853 6.42 0.000 .8517105 1.59984 L2. | -.2997325 .1754243 -1.71 0.088 -.6435578 .0440928 -------------+---------------------------------------------------------------/sigma | .02378 .0025161 9.45 0.000 .0188485 .0287114 -----------------------------------------------------------------------------Note: The test of the variance against zero is one sided, and the two-sided confidence interval is truncated at zero.
With respect to the influence of tuition on enrollment, this model shows results that differ substantially from the previous model. In this model, tuition lagged 1 year (L1.lntupub2yr) is statistically insignificant. Finally, we can fit an ARIMA model to the same data, using slightly different Stata syntax where the arima (2 0 0) indicates the model should include a firstorder (AR1) and second-order (AR2) autoregressive term, no (0) differencing and no (0) moving-average (MA) term. . arima lnenpub2yr L1.lntupub2yr lnunemprate, arima(2 0 0) rob nolog diffuse numerical derivatives are approximate flat or discontinuous region encountered numerical derivatives are approximate flat or discontinuous region encountered ARIMA regression Sample: 1971 - 2017 Number of obs = 47 Wald chi2(4) = 4111.19 Log pseudolikelihood = 83.0424 Prob > chi2 = 0.0000 -----------------------------------------------------------------------------| Semirobust lnenpub2yr | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------lnenpub2yr | lntupub2yr | L1. | .1337166 .082935 1.61 0.107 -.028833 .2962662 | lnunemprate | .171104 .0317356 5.39 0.000 .1089033 .2333046 _cons | 14.27244 .6893422 20.70 0.000 12.92135 15.62353
162
8
Advanced Statistical Techniques: I
-------------+---------------------------------------------------------------ARMA | ar | L1. | 1.225775 .190853 6.42 0.000 .8517105 1.59984 L2. | -.2997325 .1754243 -1.71 0.088 -.6435578 .0440928 -------------+---------------------------------------------------------------/sigma | .02378 .0025161 9.45 0.000 .0188485 .0287114 -----------------------------------------------------------------------------Note: The test of the variance against zero is one sided, and the two-sided confidence interval is truncated at zero.
We see that the results are the same as in the previous output. One final test after a final ARMAX model is to check the model’s stability. More specifically, the estimated dependent variable should not increase with time and its variance should be independent of time. More specifically, the estimated parameters (ρ) in our second-order AR (AR2) model must meet the following conditions: ρ2 + ρ1 < 1 ρ2 − ρ1 < 1 −1 < ρ2 < 1 Or in other words, inverse roots (eigenvalues) of the AR polynomial must all lie inside the unit circle.5 To check the stability of the ARMAX model the post-estimation test estat aroots is conducted. (We include the option dlabel to show each eigenvalue along with its distance from the unit circle.) . estat aroots, dlabel Eigenvalue stability condition +----------------------------------------+ | Eigenvalue | Modulus | |--------------------------+-------------| | .8883852 | .888385 | | .3373902 | .33739 | +----------------------------------------+ All the eigenvalues lie inside the unit circle. AR parameters satisfy stability condition. As indicated in the test results and shown in Fig. 8.9, the ARMAX model is stable.
5 For more information on inverse roots, see the Stata Reference Time-Series Manual, Release 16 and Hamilton (1994).
8.6 Examples of Autocorrelation Tests—Panel Data
163
Fig. 8.9 Inverse roots of the AR polynomial
8.5
Summary of Time Series Data, Autocorrelation, and Regression
As longer time series data become available to higher education analysts, greater care must be taken to use available visual and statistical tools prior to providing regression-based information to policymakers. The visual tools include simple line graphs as well as correlograms, partial autocorrelations functions, and unit circles. The statistical tools include tests to detect serial correlation, nonstationary data processes, and unstable time series regression models.
8.6
Examples of Autocorrelation Tests—Panel Data
Autocorrelation may also be present in the errors of fixed-effects and randomeffects regression models. Consequently, we should conduct autocorrelation tests when we are using those models. With respect to panel-data models, the most commonly used autocorrelation test is the Woolbridge (2002) test. Under the null of no first-order autocorrelation (AR1), the errors from a regression of the first-differenced variables should have an AR1 of -.5. In Stata, this test is conducted via the command xtserial, which is demonstrated below. . use ”Balanced panel data - state.dta“, clear
164
8
Advanced Statistical Techniques: I
We include the option output to show the results of the regression of the first-differenced variables. . xtserial lnnetuit lnstapr lnfte lnpc_income, output Linear regression Number of obs = 1,300 F(3, 49) = 266.10 Prob > F = 0.0000 R-squared = 0.3332 Root MSE = .09355 (Std. Err. adjusted for 50 clusters in stateid) -----------------------------------------------------------------------------| Robust D.lnnetuit | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lnstapr | D1. | -.2611843 .0510568 -5.12 0.000 -.3637868 -.1585818 | lnfte | D1. | .6437485 .1010357 6.37 0.000 .4407098 .8467873 | lnpc_income | D1. | 1.377408 .060067 22.93 0.000 1.256699 1.498117 -----------------------------------------------------------------------------Wooldridge test for autocorrelation in panel data H0: no first-order autocorrelation F( 1, 49) = 83.583 Prob > F = 0.0000
We see from the results of the test there is first-order autocorrelation in the errors of our model. Assuming all of the other assumptions of OLS hold, would use a fixed- or random-effects model with an AR(1) disturbance term to generate our regression results.
8.7
Panel-Data Regression Models with AR Terms
After we discover there is first-order autocorrelation in the errors in our regression model, we will need to include an AR disturbance term. In Stata, we do this by using the command xtregar for either a fixed-effects model (xtregar with the option fe) or random-effects model (xtregar with the option re). . xtregar lnnetuit lnstapr lnfte lnpc_income, fe FE (within) regression with AR(1) disturbances Number of obs = Group variable: stateid Number of groups = R-sq: Obs per group: within = 0.3472 min = between = 0.8435 avg = overall = 0.8380 max = F(3,1247) = 221.11
1,300 50 26 26.0 26
8.7 Panel-Data Regression Models with AR Terms
165
corr(u_i, Xb) = 0.5483 Prob > F = 0.0000 -----------------------------------------------------------------------------lnnetuit | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lnstapr | -.2734652 .038685 -7.07 0.000 -.34936 -.1975704 lnfte | .7158662 .0699253 10.24 0.000 .578682 .8530503 lnpc_income | 1.390187 .0669224 20.77 0.000 1.258894 1.52148 _cons | 2.735897 .1438201 19.02 0.000 2.453741 3.018053 -------------+---------------------------------------------------------------rho_ar | .85955152 sigma_u | .55845958 sigma_e | .09146923 rho_fov | .97387421 (fraction of variance because of u_i) -----------------------------------------------------------------------------F test that all u_i=0: F(49,1247) = 8.76 Prob > F = 0.0000
We see from the output that the AR1 (rho_ar) is 0.86. It should be noted that in several ways, the xtregar command is rather limited. First, it does not allow for the use of higher-order autoregressive (AR) disturbance terms. Second, there is no option to estimate robust standard errors. Third, it cannot take into account possible cross-sectional dependence in the data, which we will discuss later in the chapter. The use of xtregar, as shown above, is appropriate if we have stationary time series data in our panel. However, if we are uncertain our data are stationary, we should conduct a series of tests prior to using xtregar. Fortunately, there are several first-generation panel unit root tests (PURTs) we can choose from in Stata.6 However, we will use the Stata user-written routine xtpurt, the most recently developed and available second-generation PURTs that take into account autocorrelation (Herwartz et al. 2018). Herwartz and Siedenburg (2008) contend that second-generation PURTs allow for crosssectional error correlation. (The xtpurt routine, however, requires a balanced panel dataset.) We include the default option hs, reflecting the Herwartz and Siedenburg test.7 . xtpurt lnnetuit Herwartz and Siedenburg (2008) unit-root test for lnnetuit ----------------------------------------------------------Ho: Panels contain unit roots Number of panels = 50 Ha: Panels are stationary Number of periods = 27 After rebalancing = 22 Constant: Included Prewhitening: BIC Time trend: Not included Lag orders: min=0 max=4 -----------------------------------------------------------------------------Name Statistic p-value -----------------------------------------------------------------------------t_hs 2.8239 0.9976
6 For more information on unit root tests for panel data, see Stata Longitudinal Data/Panel Data Reference Manual Release 16. 7 For information on the other tests, please see Herwartz et al. (2018).
166
8
Advanced Statistical Techniques: I
-----------------------------------------------------------------------------. xtpurt lnstapr Herwartz and Siedenburg (2008) unit-root test for lnstapr ---------------------------------------------------------Ho: Panels contain unit roots Number of panels = 50 Ha: Panels are stationary Number of periods = 27 After rebalancing = 23 Constant: Included Prewhitening: BIC Time trend: Not included Lag orders: min=0 max=3 -----------------------------------------------------------------------------Name Statistic p-value -----------------------------------------------------------------------------t_hs 1.3166 0.9060 -----------------------------------------------------------------------------. xtpurt lnfte Herwartz and Siedenburg (2008) unit-root test for lnfte -------------------------------------------------------Ho: Panels contain unit roots Number of panels = 50 Ha: Panels are stationary Number of periods = 27 After rebalancing = 23 Constant: Included Prewhitening: BIC Time trend: Not included Lag orders: min=0 max=3 -----------------------------------------------------------------------------Name Statistic p-value -----------------------------------------------------------------------------t_hs 0.6345 0.7371 -----------------------------------------------------------------------------. xtpurt lnpc_income Herwartz and Siedenburg (2008) unit-root test for lnpc_income -------------------------------------------------------------Ho: Panels contain unit roots Number of panels = 50 Ha: Panels are stationary Number of periods = 27 After rebalancing = 22 Constant: Included Prewhitening: BIC Time trend: Not included Lag orders: min=1 max=4 -----------------------------------------------------------------------------Name Statistic p-value -----------------------------------------------------------------------------t_hs 0.2176 0.5861 ------------------------------------------------------------------------------
The results of the unit root tests show the null hypothesis of the panels containing unit roots was not rejected, indicating nonstationary time series data are in the panel. This suggests that we have to include first-differenced variables in our final regression fixed- or random-effects model with an AR1 disturbance term. We run this model, using quiet (or qui for short), to omit the output of the regression results. . qui xtregar D1.lnnetuit D1.lnstapr D1.lnfte D1.lnpc_income, re
8.8 Cross-Sectional Dependence
167
However, we conduct a test to see if there is any remaining autocorrelation in the residuals. We do this by using the Cumby-Huizinga (C-H) general test for autocorrelation, which we discussed earlier in the chapter. First, we generate residuals from the model. . predict ar_residuals_re, ue (50 missing values generated) Then we conduct the C-H autocorrelation general test of the residuals. . actest ar_residuals_re, lags(10) q0 robust Cumby-Huizinga test for autocorrelation H0: disturbance is MA process up to order q HA: serial correlation present at specified lags >q ----------------------------------------------------------------------------H0: q=0 (serially uncorrelated) | H0: q=0 (serially uncorrelated) HA: s.c. present at range specified | HA: s.c. present at lag specified -----------------------------------------+----------------------------------lags | chi2 df p-val | lag | chi2 df p-val -----------+-----------------------------+-----+----------------------------1 - 1 | 2.768 1 0.0962 | 1 | 2.768 1 0.0962 1 - 2 | 2.901 2 0.2345 | 2 | 0.962 1 0.3268 1 - 3 | 3.089 3 0.3781 | 3 | 0.064 1 0.7998 1 - 4 | 5.877 4 0.2086 | 4 | 3.207 1 0.0733 1 - 5 | 6.552 5 0.2562 | 5 | 0.000 1 0.9990 1 - 6 | 7.300 6 0.2940 | 6 | 1.302 1 0.2539 1 - 7 | 9.940 7 0.1920 | 7 | 2.166 1 0.1411 1 - 8 | 11.225 8 0.1892 | 8 | 0.615 1 0.4331 1 - 9 | 13.556 9 0.1390 | 9 | 0.827 1 0.3632 1 - 10 | 13.583 10 0.1929 | 10 | 0.431 1 0.5115 ----------------------------------------------------------------------------Test robust to heteroskedasticity
We see from the results of the test that autocorrelation of the model’s residuals is not present. Unfortunately, xtregar is limited in that its estimated standard errors are not robust to heteroscedasticity and crosssectional dependence, which we will discuss in the next section.
8.8
Cross-Sectional Dependence
As discussed in Chap. 4, higher education policy analysis and evaluation also involve the use of cross-sectional data and time series/cross-sectional or panel data. When using those data, we may encounter situations where there is correlation between cases or units (e.g. institutions, states, nations) or cross-sectional dependence. This would be a violation of one of the implicit assumptions of OLS regression. The implicit assumption of OLS regression is that the data are based on randomly and independently drawn samples from the population. Cross-sectional dependence may arise if observations
168
8
Advanced Statistical Techniques: I
are not independently drawn, leading to those observations having an effect on each other’s outcomes. Common unobserved shocks due to state higher education state policies may result in cross-sectional dependence among institutions. Some units of analysis, such as institutions or states, may be highly interconnected across space, leading to cross-sectional dependence. The latter type of interconnectedness is based on spatial dependence or spatial autocorrelation.
8.8.1
Cross-Sectional Dependence—Unobserved Common Factors
When using regression models with cross-sectional dependence in our data, the effect of unobserved common factors may be transmitted through the residual or error. This may result in biased estimated standard errors, leading to biased estimated beta coefficients. Therefore, when using cross-sectional or panel datasets consisting of units of analysis such as institutions or states, we should test and may have to correct for possible nonspatial cross-sectional dependence or spatial autocorrelation before providing regression-based information to policymakers.
8.8.2
Tests to Detect Cross-Sectional Dependence—Unobserved Common Factors
There are several tests that use the uncommon factor approach to detect cross-sectional dependence in panel data. These tests include the Pesaran (2004), Friedman (1937), and Frees (1995) tests and made available in Stata by De Hoyos and Sarafidis (2006). Each of these tests are based on the correlation coefficients of residuals from OLS regression models of time series data within each individual unit (e.g., institution, state, etc.) in a panel. After running a fixed-effects (xtreg, fe) or random-effects (xtreg, re) regression model in Stata, we can conduct the Pesaran, Friedman, and Frees tests by using the post-estimation commands: xtcsd, pesaran; xtcsd, friedman; and xtcsd, frees, respectively. Consequently, we demonstrate the use of all three tests below. First, we have to install the Stata user-written routine, xtcsd (De Hoyos and Sarafidis 2006). . ssc install xtcsd checking xtcsd consistency and verifying not already installed all files already exist and are up to date. We change our working directory and open our dataset.
8.8 Cross-Sectional Dependence
169
. cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 8\Stata files“ . use ”Unbalanced panel data - institutional.dta“ We use the xtdescribe (or the shortened version, xtdes) command to get a sense of the distribution of observations per unit (i.e., institution) in the panel dataset. . xtdes opeid5_new: 1004, 1005, ..., 31703 n = endyear: 2004, 2005, ..., 2013 T = Delta(endyear) = 1 year Span(endyear) = 10 periods (opeid5_new*endyear uniquely identifies each observation) Distribution of T_i: min 5% 25% 50% 75% 95% 8 8 9 9 10 10 Freq. Percent Cum. | Pattern ---------------------------+-----------95 46.80 46.80 | 1111111111 43 21.18 67.98 | 1.11111111 33 16.26 84.24 | 1.1.111111 7 3.45 87.68 | 111.111111 7 3.45 91.13 | 1111111.11 4 1.97 93.10 | 1.111.1111 4 1.97 95.07 | 111.1.1111 3 1.48 96.55 | 11111.1111 2 0.99 97.54 | 1.11111.11 5 2.46 100.00 | (other patterns) ---------------------------+-----------203 100.00 | XXXXXXXXXX
203 10
max 10
From the output above, we see clearly that this is a slightly unbalanced panel dataset with observations per institution ranging from eight to 10 years. Next, we “quietly” run our fixed-effects regression model using the within regression estimator (xtreg, with the fe option). . qui: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac , fe We run the Pesaran test and the Friedman test. . xtcsd, pesaran Pesaran’s test of cross sectional independence = . xtcsd, friedman Friedman’s test of cross sectional independence =
82.069, Pr = 0.0000 293.510, Pr = 0.0000
The results of both tests show that the null of cross-sectional independence is rejected (p < 0.001), which indicates cross-sectional dependence. The Frees test is also conducted. . xtcsd, frees Frees’ test of cross sectional independence = |--------------------------------------------------------| Critical values from Frees’ Q distribution alpha = 0.10 : 0.4892
44.948
170
8 alpha = 0.05 : alpha = 0.01 :
Advanced Statistical Techniques: I
0.6860 1.1046
The Frees test statistic of 44.948 is above the critical values of α at different levels, indicating a rejection of the null of cross-sectional independence. So, based on all three tests, we can say with some degree of certainty there is cross-sectional dependence.8 Using the common factor approach, Eberhardt (2011) extended the Stata routine by De Hoyos and Sarafidis and developed a cross-sectional dependence test (xtcd) that can be applied to variables in the pre-estimation rather than the post-estimation stage. Below, we show how this test can be conducted using a few variables from the same panel dataset. First, we download the most recent version of xtcd (Eberhardt 2011). . ssc install xtcd, replace checking xtcd consistency and verifying not already installed... all files already exist and are up to date. Then we run the test on variables of interest from the same panel dataset. . xtcd lneg lntuition lnftfac lnptfac Average correlation coefficients & Pesaran (2004) CD test Variables series tested: lneg lntuition lnftfac lnptfac Group variable: opeid5_new Number of groups: 203 Average # of observations: 8.74 Panel is: unbalanced --------------------------------------------------------Variable | CD-test p-value corr abs(corr) -------------+------------------------------------------lneg | 362.27 0.000 0.859 0.871 -------------+------------------------------------------lntuition | 351.21 0.000 0.833 0.866 -------------+------------------------------------------lnftfac | 90.10 0.000 0.212 0.531 -------------+------------------------------------------lnptfac | 83.18 0.000 0.194 0.453 --------------------------------------------------------Notes: Under the null hypothesis of cross-section independence CD N(0,1)
As we can see from the results above, the null hypotheses of cross-sectional independence are rejected for all the variables. If we cannot or choose not to include all of the variables at one time, we can test the residuals from a regression model. Using the variables that we included in a fixed-effects 8 For
more information on the use of these tests, see De Hoyos and Sarafidis (2006).
8.8 Cross-Sectional Dependence
171
model above, we employ a random-effects regression model and apply the test to the residuals. . qui xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, re . predict ue_residuals_re, ue . xtcd ue_residuals_re Average correlation coefficients & Pesaran (2004) CD test Variables series tested: ue_residuals_re Group variable: opeid5_new Number of groups: 203 Average # of observations: 8.74 Panel is: unbalanced --------------------------------------------------------Variable | CD-test p-value corr abs(corr) -------------+------------------------------------------ue_residuae | 144.14 0.000 0.338 0.544 --------------------------------------------------------Notes: Under the null hypothesis of cross-section independence CD N(0,1) We can see from the above results of the test, there is cross-sectional dependence. Thus far, the tests we have discussed are based on “strong” correlation of the residuals between units in a panel. In other words, the correlation converges to a constant as the number of units approaches infinity (Pesaran 2004). If the correlation approaches zero as the units approach infinity, then we have what is called a “weak” correlation (Pesaran 2015). Written by Pesaran, the Stata routine xtcd2 allows us to test for weak cross-sectional dependence. After installing the xtcd2 routine, this is shown below. . ssc install xtcd2, replace checking xtcd2 consistency and verifying not already installed... the following files will be replaced: c:\ado\plus\x\xtcd2.ado c:\ado\plus\x\xtcd2.sthlpinstalling into c:\ado\plus\... installation complete. . quietly: xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, fe . xtcd2 Pesaran (2015) test for weak cross-sectional dependence. Residuals calculated using
predict, e from xtreg.
172
8
Advanced Statistical Techniques: I
Unbalanced panel detected, test adjusted. H0: errors are weakly cross-sectional dependent. CD = 77.124 p-value = 0.000
The test results indicate there is at least weak cross-sectional dependence. Finally, another Stata user-written program xtcdf (Wursten 2017) allows for a much faster estimation of the Pesaran cross-sectional dependence test and provides additional statistics. The xtcdf routine also enables us to conduct a test on several variables as well as the residuals from a regression model. As customary, we first install the most recent version of Wurstenwritten Stata routine. . ssc install xtcdf, replace checking xtcdf consistency and verifying not already installed... all files already exist and are up to date. Then we “quietly” run our fixed-effect regression model and generate the residuals. . qui xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, fe. predict ue_residuals_fe, ue We conduct the test, which includes the variables as well as the residuals from the fixed-effects regression. . xtcdf lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac ue_residuals_fe The output from the test is shown below. . xtcd test on variables lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac ue_residuals_fe Panelvar: opeid5_new Timevar: endyear ------------------------------------------------------------------------------+ Variable | CD-test p-value average joint T | mean ρ mean abs(ρ) | ----------------+--------------------------------------+----------------------| lneg + 362.268 0.000 8.69 + 0.86 0.87 | lnstatea + 147.015 0.000 8.69 + 0.35 0.49 | lntuition + 351.212 0.000 8.69 + 0.83 0.87 | lntotfteiarep + 154.81 0.000 8.69 + 0.36 0.56 | lnftfac + 90.103 0.000 8.69 + 0.21 0.53 | lnptfac + 83.181 0.000 8.69 + 0.19 0.45 | ue_residualfe + 82.069 0.000 8.69 + 0.19 0.48 | ------------------------------------------------------------------------------+ Notes: Under the null hypothesis of cross-section independence, CD N(0,1) P-values close to zero indicate data are correlated across panel groups.
From the results of the test we can see there is at least weak cross-sectional dependence across all the variables and residuals from the fixed-effects regression model. The output also shows the mean and mean absolute correlation (ρ) between institutions.
8.9 Panel Regression Models That Take Cross-Sectional Dependency. . .
8.9
173
Panel Regression Models That Take Cross-Sectional Dependency into Account
In the previous section, we discussed and demonstrated how to detect one type (unobserved common factors) of cross-sectional dependence. After discovering this type of cross-sectional dependence, what is the most appropriate regression model to use? It depends. If the number of periods (T ) is greater than or equal to the number of panels (m), then a regression model that is estimated via feasible generalized least squares (FGLS) is the most appropriate. In higher education policy research, however, T ≥ m is rarely the case. Additionally, Stata’s FGLS regression command xtgls can only be used with balanced panel datasets when taking into account correlated panels or cross-sectional dependence. Consequently, regression models with Driscoll and Kraay (1998) standard errors are the most appropriate. The most recent routine for estimating regression models with Driscoll and Kraay (D-K) standard errors was made available for use in Stata by Hoechle (2018) and can be downloaded by typing ssc install xtscc, replace. A variety of regression models with D-K standard errors can be estimated with pooled OLS, weighted least squares (WLS), fixed-effects (within), or generalized least squares (GLS) random-effects. In addition to being robust to cross-sectional dependence, D-K standard errors are also robust to heteroscedasticity, and autocorrelation with higher-order lags. Using our institution-level panel data, we demonstrate their use below with a fixedeffects regression model. . use ”Unbalanced panel data - institutional.dta“, clear . xtscc lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, fe lag(2) Regression with Driscoll-Kraay standard errors
Number of obs
=
1875
Number of groups = 203 F( 5, 9) = 624.23 Prob > F = 0.0000 within R-squared = 0.6572 ------------------------------------------------------------------------------| Drisc/Kraay lneg | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------lnstatea | .0142957 .0101544 1.41 0.193 -.0086751 .0372664 lntuition | .570156 .026773 21.30 0.000 .5095913 .6307208 lntotfteiarep | .1831277 .0617837 2.96 0.016 .0433633 .3228921 lnftfac | .6344542 .1151675 5.51 0.000 .3739271 .8949813 lnptfac | .0306096 .0038811 7.89 0.000 .0218299 .0393893 _cons | 2.454215 .4902524 5.01 0.001 1.345188 3.563243 ------------------------------------------------------------------------------Method: Fixed-effects regression Group variable (i): opeid5_new maximum lag: 2
Next, we demonstrate the use of a random-effects model with D-K standard errors.
174
8
Advanced Statistical Techniques: I
. xtscc lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, re lag(2)Regression with Driscoll-Kraay standard errors Number of obs = 1875 Method: Random-effects GLS regression Number of groups = 203 Group variable (i): opeid5_new Wald chi2(5) = 33301.28 maximum lag: 2 Prob > chi2 = 0.0000 corr(u_i, Xb) = 0 (assumed) overall R-squared = 0.8692 ------------------------------------------------------------------------------| Drisc/Kraay lneg | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------lnstatea | .0186992 .008593 2.18 0.058 -.0007394 .0381378 lntuition | .5332011 .0532264 10.02 0.000 .4127946 .6536076 lntotfteiarep | .056763 .0708909 0.80 0.444 -.1036034 .2171294 lnftfac | .3563393 .0977645 3.64 0.005 .1351806 .577498 lnptfac | .0448512 .0114099 3.93 0.003 .0190403 .0706622 _cons | 5.659183 .5206644 10.87 0.000 4.481358 6.837008 --------------+---------------------------------------------------------------sigma_u | .1600267 sigma_e | .12709341 rho | .61321262 (fraction of variance due to u_i) -------------------------------------------------------------------------------
When we include year-fixed effects in a fixed-effects or random-effects regression model with D-K standard errors, there appears to be no crosssectional dependence in the residuals. . qui xtscc lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac i.endyear, fe lag(2) . predict xtscc_residuals_fe2y, resid. xtcdf xtscc_residuals_fe2y xtcd test on variables xtscc_residuals_fe2y Panelvar: opeid5_new Timevar: endyear ------------------------------------------------------------------------------+ Variable | CD-test p-value average joint T | mean ρ mean abs(ρ) | ----------------+--------------------------------------+----------------------| xtscc_resid2y + .266 0.790 8.69 + 0.00 0.41 | ------------------------------------------------------------------------------+ Notes: Under the null hypothesis of cross-section independence, CD N(0,1) P-values close to zero indicate data are correlated across panel groups. . qui xtscc lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac i.endyear, re lag(2). predict xtscc_residuals_re2y, resid. xtcdf xtscc_residuals_re2y xtcd test on variables xtscc_residuals_re2y Panelvar: opeid5_new Timevar: endyear ------------------------------------------------------------------------------+ Variable | CD-test p-value average joint T | mean ρ mean abs(ρ) | ----------------+--------------------------------------+----------------------| xtscc_resre2y + .417 0.677 8.69 + 0.00 0.41 | ------------------------------------------------------------------------------+ Notes: Under the null hypothesis of cross-section independence, CD N(0,1) P-values close to zero indicate data are correlated across panel groups.
As a final comparison of the estimated coefficients of interest to policy analysts or researchers, we run and store the results of three regression
8.9 Panel Regression Models That Take Cross-Sectional Dependency. . .
175
models: (1) a fixed-effects model without year fixed-effects; (2) a fixed-effects model with year fixed-effects; and (3) a fixed-effects model with year fixedeffects and D-K standard errors. . eststo: qui xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac, fe (est1 stored) . eststo: qui xtreg lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac i.endyear, fe (est2 stored) . eststo: qui xtscc lneg lnstatea lntuition lntotfteiarep lnftfac lnptfac i.endyear, fe lag(2) (est3 stored) Then we use Stata’s esttab command (with the label, p[(fmt)], and keep options as well as the estout varwidth option) to create a table of the stored regression results to compare the estimated beta coefficients of variables of interest across the three models. . esttab, label keep(lnstatea lntuition lntotfteiarep lnftfac lnptfac) varwidth(30) beta(%8.3f) (tabulating estimates stored by eststo; specify ”.“ to tabulate the active results) -----------------------------------------------------------------------------(1) (2) (3) log(eg)
log(eg)
log(eg)
-----------------------------------------------------------------------------Log of state appropriations 0.040** 0.019* 0.019 (3.07) 0.714***
(2.05) 0.070**
(0.77) 0.070*
(34.83) 0.178***
(3.04) 0.073**
(2.91) 0.073
Log of full-time faculty
(4.67) 0.554***
(2.60) 0.421***
(1.96) 0.421***
Log of part-time faculty
(13.19) 0.051***
Log of tuition revenue
Log of FTE students
(13.54) -0.010
(7.48) -0.010
(3.76) (-1.00) (-1.91) -----------------------------------------------------------------------------Observations 1875 1875 1875 -----------------------------------------------------------------------------Standardized beta coefficients; t statistics in parentheses * p F = 0.04 variables in mean group regression = 450 R-squared = 0.26 variables partialled out = 800 R-squared (MG) = 0.60 Root MSE = 0.04 CD Statistic = -1.33 p-value = 0.1828 ------------------------------------------------------------------------------D.lny1| Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------------+--------------------------------------------------------------Short Run Est.|
200
9
Advanced Statistical Techniques: II
---------------+--------------------------------------------------------------Mean Group: | LD.lny1| .0247934 .0480515 0.52 0.606 -.0693857 .1189726 LD.lnx1| -.0290142 .0436854 -0.66 0.507 -.114636 .0566075 LD.lnx2| .011547 .1082887 0.11 0.915 -.200695 .223789 LD.lnx4| .2466661 .1515284 1.63 0.104 -.0503242 .5436563 ---------------+--------------------------------------------------------------Long Run Est. | ---------------+--------------------------------------------------------------Mean Group: | ec| -.7831215 .0569085 -13.76 0.000 -.8946601 -.6715829 lnx1| -.555794 .240716 -2.31 0.021 -1.027589 -.0839993 lnx2| .5886828 .3961468 1.49 0.137 -.1877507 1.365116 lnx4| .4912524 .2131553 2.30 0.021 .0734756 .9090292 _cons| -4.048029 4.794766 -0.84 0.399 -13.4456 5.34954 ------------------------------------------------------------------------------Mean Group Variables: LD.lny1 LD.lnx1 LD.lnx2 LD.lnx4 _cons Cross-sectional Averaged Variables: lny1(3) lnx1(3) lnx2(3) lnx4(3) Long Run Variables: ec lnx1 lnx2 lnx4 _cons Cointegration variable(s): L.lny1Estimation of Cross-Sectional Exponent (alpha) -------------------------------------------------------------variable| alpha Std. Err. [95% Conf. Interval] ---------------+---------------------------------------------residuals| .1295795 .0096851 .1105971 .1485619 -------------------------------------------------------------0.5 20 years) across many states. These statistical techniques include heterogeneous coefficient regression (HCR) with dynamic coefficient common correlated estimation (DCCE) and mean group estimators, which allows for the distinguishing between short-run and long-run relationships between variables. The HCR with DCCE and MG estimators allows for the distinguishing between
202
9
Advanced Statistical Techniques: II
short-run and long-run relationships between variables. These techniques also enable analysts to examine adjustment to shocks to the long-run or “equilibrium” relationships between policy variables. Finally, this chapter showed that HCR with DCCE and MG estimators also allows for statespecific estimates of short-run, long-run, and EC coefficients.
9.5
Appendix
*Chapter 9 Stata Syntax *create Fig. 9.1. Trends in Log of Appropriations by State, FY 1980 to FY 2018 twoway (line lny1 fy), by(State) xlabel(1980 (8) 2018, labsize(small)) /// ytitle(Logs) ytitle(Log of Appropriations) xtitle(Fiscal Year) *create Fig. 9.2. Trends in Log of GSP by State, FY 1980 to FY 2018 twoway (line lnx4 fy), by(State) xlabel(1980 (8) 2018, labsize(small)) /// ytitle(Logs) ytitle(Log of GSP) xtitle(Fiscal Year) *Use the Stata routine xtpurt, with test options proposed by Herwartz and /// Siedenburg (2008), Demetrescu and Hanck (2012), and /// Herwartz et al. (2019). In the three test options, the null /// hypothesis is that the panels (i.e., states) contain non-stationary data /// or unit roots. * xtpurt, with test options proposed by Herwartz and Siedenburg (hs) xtpurt lny1, test(hs) xtpurt lnx1, test(hs) xtpurt lnx2, test(hs) xtpurt lnx4, test(hs) * xtpurt, with test options proposed by Demetrescu and Hanck (dh) xtpurt lny1, test(dh) xtpurt lnx1, test(dh) xtpurt lnx2, test(dh) xtpurt lnx4, test(dh) * xtpurt, with test options proposed by Herwartz, Maxand, and Walle (hmw) xtpurt lny1, test(hmw) trend xtpurt lnx1, test(hmw) trend xtpurt lnx2, test(hmw) trend xtpurt lnx4, test(hmw) trend * xtpurt, with all test options with first-differences (D1) xtpurt D1lny1, test(all) xtpurt D1lnx1, test(all) xtpurt D1lx2, test(all) xtpurt D1lnx4, test(all) * xtcointtest - tests for cointegration *test for no cointegration with and without demeaning /// (first subtracting the cross-sectional averages from the series ) the data /// xtcointtest kao lny1 lnx1 lnx2 lnx4 xtcointtest kao lny1 lnx1 lnx2 lnx4, demean xtcointtest pedroni lny1 lnx1 lnx2 lnx4 xtcointtest pedroni lny1 lnx1 lnx2 lnx4, demean xtcointtest westerlund lny1 lnx1 lnx2 lnx4 xtcointtest westerlund lny1 lnx1 lnx2 lnx4, demean *ECM-based cointegration test, developed by Westerlund (2007), that is robust /// to structural breaks in the intercept and slope of the cointegrated ///
References
203
regression, serial correlation, and heteroscedasticity. xtwest lny1 lnx1 lnx2 lnx4, constant lags(0 3) *Tests using Stata user-written routine xtcdf (Wursten 2017) for /// cross-sectional independence, using updated version ssc install xtcdf, replace xtcdf lny1 lnx1 lnx2 lnx4 *Test of homogeneous coefficients utilize the Stata user-written /// xthst (Ditzen and Bersvendsen 2020) routine ssc install xthst, replace xthst D1.lny1 D1.L1.lny1 D1.lnx1 D1.lnx2 D1.lnx4, hac whitening xthst lny1 L1.lny1 lnx1 lnx2 lnx4, hac whitening *HCR with DCCE and MG estimators *using the Stata-user written xtdcce2133 (Ditzen 2018b) search xtdcce2, all *click on st0536, then install or type: net install st0536.pkg, replace *run a autoregressive model with distributed lags (ARDLs) of (1 1 1) and /// cross-sectional with lags (3 3 3 3) within an ECM framework xtdcce2 D1.lny1 L1.D1.lny1 L1.D1.lnx1 L1.D1.lnx2 L1.D1.lnx4, reportc /// cr(_all) cr_lags(3 3 3 3) lr(L1.lny1 lnx1 lnx2 lnx4) lr_options(ardl) *Pesaran (2015) test for weak cross-sectional dependence xtcd2 *run xtdcce2 with the options lr(xtpmg) and exponent xtdcce2 D1.lny1 L1.D1.lny1 L1.D1.lnx1 L1.D1.lnx2 L1.D1.lnx4, reportc /// cr(_all) cr_lags(3 3 3 3) lr(L1.lny1 lnx1 lnx2 lnx4) lr_options(xtpmg) exponent *If we want to see the estimates for the individual states, then we include the /// option showindividual. xtdcce2 D1.lny1 L1.D1.lny1 L1.D1.lnx1 L1.D1.lnx2 L1.D1.lnx4, /// reportc cr(_all) cr_lags(1 3 3 3) lr(L1.lny1 lnx1 lnx2 lnx4) /// lr_options(ardl) exponent showin *end
References Baltagi, B. (2008). Econometric analysis of panel data. John Wiley & Sons. Blomquist, J., & Westerlund, J. (2013). Testing slope homogeneity in large panels with serial correlation. Economics Letters, 121 (3), 374–378. Cheslock, J. J., & Rios-Aguilar, C. (2011). Multilevel analysis in higher education research: A multidisciplinary approach. In J. Smart & M. B. Paulsen (Eds.), Higher education: Handbook of theory and research (Vol. 46, pp. 85–123). Springer. Chudik, A., & Pesaran, M. H. (2015). Common correlated effects estimation of heterogeneous dynamic panel data models with weakly exogenous regressors. Journal of Econometrics, 188 (2), 393–420. Chudik, A., Pesaran, M. H., & Tosetti, E. (2011). Weak and strong cross-section dependence and estimation of large panels. The Econometrics Journal, 14 (1), C45– C90. Demetrescu, M., & Hanck, C. (2012). A simple nonstationary-volatility robust panel unit root test. Economics Letters, 117 (1), 10–13. Ditzen, J. (2016). xtdcce: Estimating Dynamic Common Correlated Effects in Stata. In SEEC Discussion Papers (No. 1601; SEEC Discussion Papers). Spatial Economics and
204
9
Advanced Statistical Techniques: II
Econometrics Centre, Heriot Watt University. Ditzen, J. (2018a). Cross-country convergence in a general Lotka–Volterra model. Spatial Economic Analysis, 13 (2), 191–211. Ditzen, J. (2018b). Estimating dynamic common-correlated effects in Stata. The Stata Journal, 18 (3), 585–617. Ditzen, J., & Bersvendsen, T. (2020). XTHST: Stata module to test slope homogeneity in large panels. In Statistical Software Components. Boston College Department of Economics. Engle, R. F., & Granger, C. W. J. (1987). Co-Integration and Error Correction: Representation, Estimation, and Testing. Econometrica, 55 (2), 251–276. Herwartz, H., & Siedenburg, F. (2008). Homogenous panel unit root tests under cross sectional dependence: Finite sample modifications and the wild bootstrap. Computational Statistics & Data Analysis, 53 (1), 137–150. Herwartz, Helmut, Maxand, S., & Walle, Y. M. (2019). Heteroskedasticity-Robust Unit Root Testing for Trending Panels. Journal of Time Series Analysis, 40 (5), 649–664. Hildreth, C., & Houck, J. P. (1968). Some estimators for a linear model with random coefficients. Journal of the American Statistical Association, 63 (322), 584–595. Hsiao, C. (1975). Some estimation methods for a random coefficient model. Econometrica: Journal of the Econometric Society, 43 (2), 305–325. Kapetanios, G., Pesaran, M. H., & Yamagata, T. (2011). Panels with non-stationary multifactor error structures. Journal of Econometrics, 160 (2), 326–348. Liddle, B. (2017). Accounting for Nonlinearity, Asymmetry, Heterogeneity, and CrossSectional Dependence in Energy Modeling: US State-Level Panel Analysis. Economies, 5 (3), 30. Passamani, G., & Tomaselli, M. (2018). Air Pollution and Health Risks: A Statistical Analysis Aiming at Improving Air Quality in an Alpine Italian Province. In C. H. Skiadas & C. Skiadas (Eds.), Demography and Health Issues: Population Aging, Mortality and Data Analysis (pp. 199–216). Springer International Publishing. Patel, P. C. (2019). Minimum wage and transition of non-employer firms intending to hire employees into employer firms: State-level evidence from the US. Journal of Business Venturing Insights, 12, e00136. Pesaran, M. H. (2006). Estimation and inference in large heterogeneous panels with a multifactor error structure. Econometrica, 74 (4), 967–1012. Pesaran, M. H. (2015a). Testing weak cross-sectional dependence in large panels. Econometric Reviews, 34 (6–10), 1089–1117. Pesaran, M. H. (2015b). Time Series and Panel Data Econometrics. Oxford University Press. Pesaran, M. H., Shin, Y., & Smith, R. P. (1999). Pooled mean group estimation of dynamic heterogeneous panels. Journal of the American Statistical Association, 94 (446), 621– 634. Pesaran, M. H., & Smith, R. (1995). Estimating long-run relationships from dynamic heterogeneous panels. Journal of Econometrics, 68 (1), 79–113. Pesaran, M. H., Ullah, A., & Yamagata, T. (2008). A bias-adjusted LM test of error crosssection independence. The Econometrics Journal, 11 (1), 105–127. Pesaran, M. H., & Yamagata, T. (2008). Testing slope homogeneity in large panels. Journal of Econometrics, 142 (1), 50–93. Swamy, P. A. (1970). Efficient inference in a random coefficient regression model. Econometrica: Journal of the Econometric Society, 311–323. Wells, R. S., Kolek, E. A., Williams, E. A., & Saunders, D. B. (2015). “How We Know What We Know”: A Systematic Comparison of Research Methods Employed in Higher Education Journals, 1996—2000 v. 2006—2010. The Journal of Higher Education, 86 (2), 171–198.
References
205
Westerlund, J. (2005). New simple tests for panel cointegration. Econometric Reviews, 24 (3), 297–316. Westerlund, J. (2007). Testing for error correction in panel data. Oxford Bulletin of Economics and Statistics, 69 (6), 709–748. Wursten, J. (2017). XTCDF: Stata module to perform Pesaran’s CD-test for cross-sectional dependence in panel context. In Statistical Software Components. Boston College Department of Economics.
Chapter 10
Presenting Analyses to Policymakers
Abstract This chapter discusses and demonstrates how we can present analysis for presentation to higher education policymakers. The chapter details how to present descriptive statistics in a user-friendly Microsoft Word document format. The chapter also shows how we can use choropleth maps to illustrate data spatially. The chapter also demonstrates how graphs and tables of regression results and marginal effects are created. Keywords Tables · Choropleth maps · Graphs · Marginal effects
10.1
Introduction
The analyses that were discussed and demonstrated in the previous chapters range from simple descriptive statistics to advanced statistical techniques. The consumers of the results of these analyses are also varied, and include, but are not exclusive to, policymakers. The consumers of the results of these analyses are also varied. Many consumers include, but are not exclusive to policymakers. However, many analysts target their work toward policymakers. Therefore, it is necessary to produce policymaker-friendly presentations. Using some of the routines in Stata, this chapter demonstrates how we can accomplish this critical part of higher education policy analysis and evaluation. These routines, commands, and syntax are included in an appendix at the end of the chapter.
© Springer Nature Switzerland AG 2021 M. Titus, Higher Education Policy Analysis Using Quantitative Techniques, Quantitative Methods in the Humanities and Social Sciences, https://doi.org/10.1007/978-3-030-60831-6_10
207
208
10.2
10
Presenting Analyses to Policymakers
Presenting Descriptive Statistics
Descriptive statistics are the most common form of quantitative analyses that are presented to higher education policymakers. Although these forms of analyses are rather basic, care should still be taken to present the data in a clear way to policymakers. The presentation of descriptive statistics should highlight key points, such as patterns and trends. This requires the use of tables, charts, and graphs. Tables with lots of variables and data should be avoided. Charts and graphs should be clearly labeled and uncluttered. Below, we demonstrate the use of Stata commands and routines and create presentation-ready tables, charts and tables displaying descriptive statistics for policymakers and others.
10.2.1
Descriptive Statistics in Microsoft Word Tables
The Stata user-written module asdoc (Shah 2019) is one of the most comprehensive routines for creating presentation-ready tables in Microsoft Word. For the most recent version of asdoc, in Stata, type: net install asdoc, from(http://fintechprofessor.com) replace To get a sense of the comprehensive nature of the asdoc module, type: help asdoc In this demonstration, we will use data (supplemented with state tax revenue and personal income data) from the previous chapter. First, we change our working directory to where we want to save our tables. cd “C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Tables” Then we invoke the sum command for previously noted variables of interest: state appropriations (y); net tuition revenue (x1); full-time equivalent enrollment (x2); state total personal income (x3); gross state product (x4) and; state tax revenue (x5) . sum y x1 x2 x3 x4 x5 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------y | 1,900 9.73e+08 1.45e+09 1.81e+07 1.57e+10 x1 | 1,900 5.57e+08 7.38e+08 7900000 5.22e+09 x2 | 1,900 178357.1 213093.8 10530 1639923 x3 | 1,900 1.60e+11 2.28e+11 4.02e+09 2.26e+12 x4 | 1,900 1.87e+11 2.73e+11 4.40e+09 2.66e+12 -------------+--------------------------------------------------------x5 | 1,900 1.70e+10 2.58e+10 4.60e+08 2.43e+11
10.2 Presenting Descriptive Statistics
209
With the exception of x2, most of the variables have values that are in scientific notation format. Therefore, before we can create a presentationready table, the data for x1, x3, and x4 need to be rescaled to millions. We can either create new variables that are rescaled by hand or utilize the Stata user-written routine rescale to automatically rescale the new variables. To do the latter, type the following: net install rescale, from(http://digital.cgdev.org/doc/stata/MO/Misc) replace
To rescale the y, x1, x3, x4, and x5 into millions, we use replace and the millions option. rescale rescale rescale rescale rescale
y, millions x1, millions x3, millions x4, millions x5, millions
The rerun the sum command and see the results below. . sum y x1 x2 x3 x4 x5 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------y | 1,900 973.1793 1452.862 18.1 15692.18 x1 | 1,900 556.8281 738.1183 7.9 5216.492 x2 | 1,900 178357.1 213093.8 10530 1639923 x3 | 1,900 159660.9 228314.1 4015.1 2263890 x4 | 1,900 187043.9 273491.1 4398.6 2657798 -------------+--------------------------------------------------------x5 | 1,900 16974.89 25750.39 459.909 243082.1
Above we see the values are no longer in scientific notation format, but they are not what we typically present to policymakers and other users. Each variable should also be normalized in a way that makes them comparable across states and over time. For example, state appropriation (y) is divided by either population or FTE enrollment. Net tuition revenue (x1) should be divided by full-time equivalent (FTE) enrollment. State total personal income (x3), gross state product (x4), and state tax revenue (x5) should be divided by population. Tandberg and Griffith (2013) suggest that state appropriations per capita is a measure of adequacy or effort which is easily understood by policymakers and the general public. However, they also caution that the measure is limited in that larger population states are not always higher income states. In most cases, users would like to see one or two statistics, the mean and the median. Generally, we also do not want to include decimal places, but do want to include commas (format(%9.0fc)). We use the Stata command tabstat (for documentation, type help tabstat), with the options below, to produce the following:
210
10
Presenting Analyses to Policymakers
. tabstat y_pop x1fte x3_pop x4_pop x5_pop, statistics(mean median) column(statistics) format(%9.0fc) variable | mean p50 -------------+-------------------y_pop | 201 189 x1fte | 4,178 3,479 x3_pop | 32,641 31,754 x4_pop | 38,230 37,070 x5_pop | 3,475 3,269 ----------------------------------
We combine the use of asdoc and tabstat to make this presentation ready with the necessary table titles and variable labels. To ensure that we are replacing any existing tables with the same name, we include the option replace. Because we want to show variable labels with long names, we also include the option abb(.). . asdoc tabstat y_pop x1fte x3_pop x4_pop x5_pop, statistics(mean median) column(statistics) format(%9.0fc) dec(0) long title(Table 10.1 Descriptive Statistics) save(Table 10.1.doc) replace label abb(.) replace | mean p50 -------------+---------------------y_pop | 201.0187 189.0372 x1fte | 4177.586 3479.234 x3_pop | 32641.16 31753.6 x4_pop | 38230.36 37069.8 x5_pop | 3474.925 3268.786 (note: file Table 10.1.doc not found) Click to Open File: Table 10.1.doc When we click to open Table 10.1, we see the following:
See Table 10.1. Many state higher education policymakers, however, may be interested in how their particular state compares to other states on several different indicators or metrics. For example, state policymakers in Maryland may want to know how their state compares to the rest of the nation. The policy analyst may have to create a categorical variable representing Maryland (MD) and create a table. Table 10.1 Descriptive statistics
State appropriations per capita Net tuition revenue per FTE enrollment State personal income per capita Gross state product per capita State tax revenue per capita
Mean
Median
201 4,178 32,641 38,230 3,475
189 3,479 31,754 37,070 3,269
Note: The current version of asdoc does not produce tables with numbers that include commas in the Word. So we later added the commas in the numbers in the table.
10.2 Presenting Descriptive Statistics
211
gen MD=0 lab var MD ”Comparisons“ replace MD=1 if fips==24 label define MD1 1 Maryland 0 ”All Other States“ label values MD MD1 It is also useful to create a categorical variable that reflects different time periods. In this example, we create a variable decade and code and label it accordingly. gen decade =0 lab var decade ”Decades“ replace decade =1 if fy>=1980 & replace decade =2 if fy>=1990 & replace decade =3 if fy>=2000 & replace decade =4 if fy>=2010 & label define decade1 1 ”1980 to ”2010 to 2018“ label values decade decade1
fy|z| [95% Conf. Interval] ---------------------+---------------------------------------------------------------LDlnstateap | -.0097169 .0411761 -0.24 0.813 -.0904206 .0709869 LDlnfte | .1521267 .0761399 2.00 0.046 .0028952 .3013583 LDlnperinc | -.466823 .1204983 -3.87 0.000 -.7029952 -.2306507 __00000M_Dlnnetut | .9814087 .1471938 6.67 0.000 .6929142 1.269903
220
10
Presenting Analyses to Policymakers
__00000L_LDlnstateap | .0389984 .0805769 0.48 0.628 -.1189294 .1969263 __00000L_LDlnfte | -.1537738 .2039478 -0.75 0.451 -.553504 .2459565 __00000L_LDlnperinc | .4572597 .1699901 2.69 0.007 .1240854 .7904341 _cons | .0011686 .0062927 0.19 0.853 -.0111648 .013502 -------------------------------------------------------------------------------------Root Mean Squared Error (sigma): 0.0676 Cross-section averaged regressors are marked by the suffix: _Dlnnetut, _LDlnstateap, _LDlnfte, _LDlnperinc respectively.
We repeat the Stata syntax to extract the estimated coefficients from the matrix produced by the regression model with the CCEMG estimator: mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 2 \st_matrix(”e(b)“) :+ 2))
Then we modify the coefplot syntax to include the variables of interest from the regression model with the CCEMG estimator. We also change the orientation from horizontal to vertical, add titles, and bold the text we want to bring attention to in the graph.3 coefplot, xline(0) keep(LDlnstateap LDlnfte LDlnperinc) mlabel format(%9.2g) mlabposition(0) msymbol(i) ciopts(recast(. rbar) barwidt(. 0.35) fcolor(. white) lwidth(. medium)) rescale(10) levels(95 99) coeflabels(LDlnstateap = ”{bf:State Appropriations}“ LDlnfte = ”FTE Enrollment“ LDlnperinc = ”State Personal Income“, labsize(medium)) vertical title(”ShortRun Change in {bf:Net Tuition Revenue} Due to a 10% Change in“ ”{bf:State Appropriations} (controlling for other factors)“, size(medium) margin(small) justification(center))
The graph is shown below. We see in Fig. 10.8, with titles and bolded text, the user is directed to the areas of the graph that focus on the information that is most relevant to the results from the regression model with CCEMG. We can use the results in a graph from an even more advanced model, the HCR model with DCCE and MG estimators and a first-order autoregressive distributed lag (ARDL), of each of the variables. We quietly (qui) run the model to suppress the output. qui xtdcce2 Dlnnetut L1.Dlnnetut LDlnstateap LDlnfte LDlnperinc, reportc cr(_all) cr_lags(3 3 3 3) lr(L1.Dlnnetut LDlnstateap LDlnfte LDlnperinc) lr_options(ardl) We then retype the following syntax. mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))
Then we reenter the coefplot syntax from above [not shown here]. The result is the graph shown below. We see from Fig. 10.9 the results are the same with regard to the influence of a short-run change in net tuition revenue (not statistically significant) as a 3 For a complete description and examples of the options for coefplot, see Jann, B. (2019, May 28). coefplot—Plotting regression coefficients and other estimates in Stata. http:// repec.sowi.unibe.ch/stata/coefplot/getting-started.html.
10.5 Marginal Effects (with Continuous Variables) and Graphs
221
Fig. 10.8 Pct. change in appropriations, FTE and personal income due a Pct. change in net tuition revenue
result of a 10% change in state appropriations, controlling for other variables. The results presented in Fig. 10.9, however, are based on the HCR model with DCCE and MG estimators. This model takes into account nonstationary data, heterogeneous coefficients, and cross-sectional dependence (all of which policymakers and other users are most likely to have no interest in knowing but may be of interest to policy researchers).
10.5
Marginal Effects (with Continuous Variables) and Graphs
Marginal effects and graphs are another way to present the results of regression models to policymakers and other users. Combined with most regression models that are composed of continuous variables, the Stata commands margins and coefplot provide a way to carry this out. This section will discuss and demonstrate the use of these very flexible commands as a way to provide information to policymakers. Marginal effects are the changes in the dependent variable due to changes in a specific continuous independent variable, holding all other independent variables constant. They are calculated for one variable (y) by defining the marginal effect to be the change (Δ) in another variable (x ) or a partial
222
10
Presenting Analyses to Policymakers
Fig. 10.9 Pct. change in appropriations, FTE and personal income due a Pct. change in net tuition revenue
derivative (Δy/Δx ) of a function f (x, y). Using calculus, the derivative provides the rate of change over a very small interval that is approaching zero. The average derivative computed across all observations is the average marginal effect (AME). The marginal effect at the average (MEA) is the derivative at the average of variables. Marginal effects, based on the results of a regression model, are predictions. After running a regression model, marginal effects can show the change in the dependent variable, given a change in an independent variable (holding all other independent variables at some constant level). Marginal effects also allow analysts to look at the percent change of the dependent variable given a change in a particular independent variable, holding other independent variables at their median, various percentiles, or at other specified levels. Stata’s margins command can also be used to estimate elasticity. The concept of elasticity is a percent change in a dependent variable, given a 1% change in an independent variable, holding all other variables at some constant level. This concept is helpful when variables are measured using different metrics. Consequently, marginal effects are useful when interpreting regression results for policymakers and other nontechnical audiences. Utilizing regression models, the concept of marginal effects is demonstrated below. Suppose a state legislator wants to find out how the number of administrators (executives and managers) in public higher education changes with state funding of public colleges and universities. Using data and Stata, we demonstrate how a higher education policy analyst could approach this.
10.5 Marginal Effects (with Continuous Variables) and Graphs
223
We change our working directory. . cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Stata\Data“ C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Stata\Data
Then we open a file with the relevant Stata file, based on IPEDS data. . use ”Example 10.dta“, clear
Upon careful inspection of the data, we see that the dataset spans 46 states and 13 years (2000–2012) with a 1 year gap (2001 is missing). So, we drop the first year (2000). It is preferable that we have no yearly gaps in the data when we are including 1-year lags of independent variables in our regression models. drop if year==2000 (46 observations deleted)
In this analysis, the dependent variable, administrators (adminstaff), is measured by the total number of executive and managerial employees. The main independent variable is net tuition revenue (net_tuition_rev). The other independent variables include state appropriation (state_appro), revenue from the federal government (fedrev_r), and full-time equivalent students ( FTE_enroll). We will use global marco names to reduce the number of keystrokes. . . . . .
global global global global global
y ”adminstaff“ x1 ”net_tuition_rev_adj“ x2 ”state_appro_adj“ x3 ”fedrev_r“ x4 ”FTE_enroll“
Descriptive statistics [not shown] indicate the data are highly skewed. Because prior testing [not shown] revealed there is serial correlation and cross-sectional dependence in the data among the variables we plan to use, a pooled OLS regression model with Driscoll-Kraay (D-K) errors. Because we want to avoid reverse causation, we lag the independent variables by 1 year in the regression model. . xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4 Regression with Driscoll-Kraay standard errors Method: Pooled OLS Group variable (i): id maximum lag: 2
Number of obs = 460 Number of groups = 46 F( 4, 9) = 184.89 Prob > F = 0.0000 R-squared = 0.8083 Root MSE = 867.9764 ------------------------------------------------------------------------------------| Drisc/Kraay adminstaff | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------------+---------------------------------------------------------------net_tuition_rev_adj | L1. | 1.21e-06 1.24e-07 9.81 0.000 9.34e-07 1.49e-06 | state_appro_adj | L1. | 2.47e-07 1.27e-07 1.95 0.083 -3.99e-08 5.34e-07
224
10
Presenting Analyses to Policymakers
| fedrev_r | L1. | -1.67e-07 1.62e-07 -1.03 0.330 -5.35e-07 2.00e-07 | FTE_enroll | L1. | .0025754 .0014926 1.73 0.119 -.000801 .0059518 | _cons | 136.6033 74.6967 1.83 0.101 -32.37241 305.5789 -------------------------------------------------------------------------------------
Then we use margins to calculate the average marginal effects (AME). . margins, dydx(L1.$x1 L1.$x2 L1.$x3 L1.$x4) Average marginal effects Number of obs = 460 Model VCE : Drisc/Kraay Expression : Fitted values, predict() dy/dx w.r.t. : L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r L.FTE_enroll ------------------------------------------------------------------------------------| Delta-method | dy/dx Std. Err. z P>|z| [95% Conf. Interval] --------------------+---------------------------------------------------------------net_tuition_rev_adj | L1. | 1.21e-06 1.24e-07 9.81 0.000 9.72e-07 1.46e-06 | state_appro_adj | L1. | 2.47e-07 1.27e-07 1.95 0.051 -1.59e-09 4.95e-07 | fedrev_r | L1. | -1.67e-07 1.62e-07 -1.03 0.303 -4.86e-07 1.51e-07 | FTE_enroll | L1. | .0025754 .0014926 1.73 0.084 -.00035 .0055007 -------------------------------------------------------------------------------------
Given the very large numbers, the AME are difficult to interpret. So we should make an effort to calculate the elasticities, using the option eyex. . margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) could not calculate numerical derivatives -- discontinuous region with missing values encountered r(459);
This clearly does not work! Why? The “average” elasticity for none of the independent variables can be calculated. Instead, we should try to calculate the elasticities of each of the variables at their average or mean levels. . margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((mean) _all) Conditional marginal effects Number of obs = 460 Model VCE : Drisc/Kraay Expression : Fitted values, predict() ey/ex w.r.t. : L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r L.FTE_enroll at : L.net_tuitj = 8.99e+08 (mean) L.state_apj = 1.11e+09 (mean) L.fedrev_r = 8.26e+08 (mean) L.FTE_enroll = 200801.8 (mean) ------------------------------------------------------------------------------------| Delta-method | ey/ex Std. Err. z P>|z| [95% Conf. Interval] --------------------+---------------------------------------------------------------net_tuition_rev_adj | L1. | .5804154 .0613005 9.47 0.000 .4602685 .7005622
10.5 Marginal Effects (with Continuous Variables) and Graphs
225
| state_appro_adj | L1. | .1456065 .0764509 1.90 0.057 -.0042345 .2954474 | fedrev_r | L1. | -.0734286 .0703213 -1.04 0.296 -.2112558 .0643986 | FTE_enroll | L1. | .2748182 .1556239 1.77 0.077 -.0301991 .5798356 -------------------------------------------------------------------------------------
Because we know that data are highly skewed, we should also calculate elasticities for variables at the median rather than the mean to see if the results are substantially different. . margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at(( median) _all) Conditional marginal effects Number of obs = 460 Model VCE : Drisc/Kraay Expression : Fitted values, predict() ey/ex w.r.t. : L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r L.FTE_enroll at : L.net_tuitj = 6.07e+08 ( median) L.state_apj = 7.86e+08 ( median) L.fedrev_r = 5.95e+08 ( median) L.FTE_enroll = 150336 ( median) ------------------------------------------------------------------------------------| Delta-method | ey/ex Std. Err. z P>|z| [95% Conf. Interval] --------------------+---------------------------------------------------------------net_tuition_rev_adj | L1. | .5436902 .067622 8.04 0.000 .4111535 .6762268 | state_appro_adj | L1. | .1432976 .0776547 1.85 0.065 -.0089028 .2954979 | fedrev_r | L1. | -.0735113 .0701083 -1.05 0.294 -.210921 .0638984 | FTE_enroll | L1. | .2857197 .1582156 1.81 0.071 -.0243771 .5958164 -------------------------------------------------------------------------------------
We see that at the median, only the change in net tuition revenue has an effect on the change in the number of administrators. The results suggest that a 1% increase in net tuition revenue contributes to a 0.54% increase in administrators at public colleges and universities. This is only slightly less than the 0.58% increase at the mean tuition revenue.
10.5.1
Marginal Effects (Elasticities) and Graphs
Next, we should display these results in a graph similar to Fig. 10.9. To do so, we save the marginal effects (in terms of elasticities) by including the option post in the following syntax. margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((median) _all) post
226
10
Presenting Analyses to Policymakers
We enter the following syntax. mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))
We then modify the coefplot syntax to produce the graph with the relevant titles. coefplot, xline(0) keep(L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r L.FTE_enroll) mlabel format(%9.2g) mlabposition(0) msymbol(i) ciopts(recast(. rbar) barwidt(. 0.35) fcolor(. white) lwidth(. medium)) levels(95 99) coeflabels(L.net_tuition_rev_adj = ”{bf:Net Tuition Revenue}“ L.state_appro_adj = ”State Appropriations“ L.fedrev_r = ”Federal Revenue“ L.FTE_enroll = ”FTE Enrollment“) title(”Percent Change in {bf:Administrators} Due to a 1% Change in“ ”{bf:Net Tuition Revenue} (controlling for other factors)“, size(medium) margin(small) justification (center))
Figure 10.10 can be displayed by analysts to clearly explain the results of an advanced regression model Using this figure, we can easily show that a 0.5% increase in administrators is brought about by 1% increase in net tuition revenue (at net tuition revenue and all other variables at their median levels). We may want to: 1. to show the percent change (a rescaled to 10) on the vertical axis; 2. show the independent variables on the horizontal axis and; 3. create custom legends with regard to the significance of the independent variables (This may require additional explanation to a lay audience).
Fig. 10.10 Pct. change in administrators due to a Pct. change in net tuition revenue
10.5 Marginal Effects (with Continuous Variables) and Graphs
227
To carry out steps 1 through 3, we enter a very long line of syntax that produces the graph below. coefplot (., keep(L.net_tuition_rev_adj) color(black)) (., keep(L.state_appro_adj) color(gray)) (., keep(L.fedrev_r) color(gray)) (., keep (L.FTE_enroll) color(gray)), legend(on) xline(0) nooffsets pstyle(p1) recast(bar) barwidth(0.4) fcolor(*.8) coeflabels(L.net_tuition_rev_adj = ”{bf:Net Tuition Revenue}“ L.state_appro_adj = ”State Appropriations“ L.fedrev_r = ”Federal Revenue“ L.FTE_enroll = ”FTE Enrollment“ , labsize(small)) title(”Percent Change in {bf:Administrators} Due to a 10% Change in“ ”{bf:Net Tuition Revenue} (controlling for other factors)“, size(medium) margin(small) justification(center)) addplot(scatter @b @at, ms(i) mlabel(@b) mlabpos(1) mlabcolor(black)) vertical noci format(%9.1f) rescale(10) p2(nokey) p3(nokey) p1(label(”Different from Zero“)) p4(label(”Ignore not different zero“)) ytitle(Percent) xtitle(”At the Median“, size(small))
The graph looks like this (Fig. 10.11). Using the same steps above, a graph can also be created to show the change in administrators with respect to the change at the 25th percentile of net tuition revenue (and other variables). This is accomplished by replacing “median” with “p25” in the margins syntax as follows: . margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((p25) _all) post Conditional marginal effects Number of obs = 460 Model VCE : Drisc/Kraay Expression : Fitted values, predict() ey/ex w.r.t. : L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r L.FTE_enroll
Fig. 10.11 Pct. change in administrators due to a Pct. change in net tuition revenue
228
10
Presenting Analyses to Policymakers
: L.net_tuitj = 2.57e+08 (p25) L.state_apj = 3.70e+08 (p25) L.fedrev_r = 2.34e+08 (p25) L.FTE_enroll = 67108.5 (p25) ------------------------------------------------------------------------------------| Delta-method | ey/ex Std. Err. z P>|z| [95% Conf. Interval] --------------------+---------------------------------------------------------------net_tuition_rev_adj | L1. | .4636343 .0776525 5.97 0.000 .3114382 .6158304 | state_appro_adj | L1. | .1355566 .077485 1.75 0.080 -.0163113 .2874245 | fedrev_r | L1. | -.0581332 .0555159 -1.05 0.295 -.1669424 .0506759 | FTE_enroll | L1. | .2563384 .1394503 1.84 0.066 -.0169792 .5296561 ------------------------------------------------------------------------------------at
We take note of which independent variables are statistically significant (p < 0.05). The mata syntax from above is then reentered. mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))
We modify (i.e., xtitle(”At the 25th Percentile“, size(small)) and use coefplot syntax above to produce the graph below. We see from Fig. 10.12, administrators increase by 4.6% for every 10% increase in net tuition revenue and other variables at the 25th percentile.
Fig. 10.12 Pct. change in administrators due to a Pct. change in net tuition revenue
10.6 Marginal Effects and Word Tables
229
After rerunning the regression model, the margins syntax is then changed to reflect elasticities at the 75th percentile of net tuition revenue and other variables. . margins, eyex(L1.$x1 L1.$x2 L1.$x3 L1.$x4) at((p75) _all) post Conditional marginal effects Number of obs = 460 Model VCE : Drisc/Kraay Expression : Fitted values, predict() ey/ex w.r.t. : L.net_tuition_rev_adj L.state_appro_adj L.fedrev_r L.FTE_enroll at : L.net_tuitj = 1.21e+09 (p75) L.state_apj = 1.32e+09 (p75) L.fedrev_r = 1.02e+09 (p75) L.FTE_enroll = 232360.3 (p75) ------------------------------------------------------------------------------------| Delta-method | ey/ex Std. Err. z P>|z| [95% Conf. Interval] --------------------+---------------------------------------------------------------net_tuition_rev_adj | L1. | .6224404 .0570452 10.91 0.000 .5106339 .7342469 | state_appro_adj | L1. | .1381845 .0706619 1.96 0.051 -.0003102 .2766792 | fedrev_r | L1. | -.0719685 .0693115 -1.04 0.299 -.2078166 .0638796 | FTE_enroll | L1. | .2534847 .1466091 1.73 0.084 -.0338638 .5408332 -------------------------------------------------------------------------------------
The mata syntax is reentered. mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))
A slightly modified (i.e., xtitle(”At the 75th Percentile“, size(small)) version of the coefplot syntax is reentered to create the following graph below. Figure 10.13 clearly shows that administrators increase by 5.9% for every 10% increase at the 75th percentile of net tuition revenue and other variables. Taken together, the figures (Figs. 10.11, 10.12, and 10.13] show that the change in administrators increases with the change at every level of net tuition revenue. But that change is even greater at higher levels of net tuition revenue.
10.6
Marginal Effects and Word Tables
It may be useful to produce a publication-ready table that could be included as part of an appendix in reports provided to policymakers and other consumers of the information. These tables could show the detailed results of regression models that were used to produce the graphs discussed in the previous section. The creation of these tables can be accomplished by using the Stata-user written routine esttab which is part of the program estout (Jann 2019b). To use esttab, we install the most recent version of esttab (net install st0085_2, replace). Then we enter the following syntax.
230
10
Presenting Analyses to Policymakers
Fig. 10.13 Pct. change in administrators due to a Pct. change in net tuition revenue
First, we change the working directory to where we would like to place a Word table. cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Tables“
Then, we repeat the following steps: 1. Run our OLS regression model 2. calculate elasticities 3. produce a Word table with the results in step 2 Steps 1and 2—elasticities at the 25th percentile qui xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4 qui margins, eyex(*) at((p25 ) _all) cont post eststo marginalp25
Steps 1and 2—elasticities at the 50th percentile (median) qui xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4 qui margins, eyex(*) at((median ) _all) cont post eststo marginalmed
Steps 1and 2—elasticities at the 75h percentile qui xtscc $y L1.$x1 L1.$x2 L1.$x3 L1.$x4 qui margins, eyex(*) at((p75 ) _all) cont post eststo marginal75
10.7 Marginal Effects (with Categorical Variables) and Graphs
231
Table 10.3 Percent change in administrators due to a 1% change in net tuition revenue, controlling for other factors (state appropriations, federal revenue, and FTE enrollment)
L.Net tuition revenue L.State appropriations L.Federal revenue L.FTE enrollment Observations
25th percentile
Median
75th percentile
0.427*** (0.079) 0.135 (0.073) -0.0195 (0.085) 0.193 (0.151) 322
0.531*** (0.078) 0.149 (0.080) -0.0246 (0.107) 0.211 (0.164) 322
0.590*** (0.076) 0.150 (0.079) -0.0258 (0.112) 0.202 (0.159) 322
Standard errors in parentheses * p < 0.05, ** p < 0.01, *** p < 0.001
Step 3—create Word file esttab marginalp25 marginalmed marginalp75 using Table_Appendix, label se(3) title(”Percent Change in Administrators“ ”Due to a One Percent Change“ ”in Net Tuition Revenue, Controlling for Other Factors“ ”(State Appropriations, Federal Revenue, and FTE Enro llment)“) mtitle(”25th Percentile“ ”Median“ ”75th Percentile“) nonumbers rtf replace [Stata output cut] (output written to Table_Appendix.rtf)
We can click Table_Appendix.rtf to access Word the table (Table 10.3). In Word, the table above can be edited and placed in an appendix.
10.7
Marginal Effects (with Categorical Variables) and Graphs
Marginal effects can also be used with categorical variables to answer a range of policy questions. For example, suppose higher education policymakers would like to know if the relationship between administrators and net tuition revenue differ by the extent to which higher education is regulated by the state. In this example, we measure regulation by whether (Yes = 1) or not (No = 0) a state has a higher education consolidated governing board (CGB). The following steps are carried out to produce a graph of the marginal effects by whether or not states have a consolidated governing board. 1. Shorthand notation and global macros are used to save keystrokes. gen y = adminstaff global x ”L1.net_tuition_rev_adj L1.state_appro_adj L1.fedrev_r L1.FTE_enroll“
232
10
Presenting Analyses to Policymakers
2. We “quietly” run a pooled OLS regression with D-K standard errors for states with no consolidated governing board. qui xtscc y $x if CGB==0
3. The marginal effects, specifically the elasticities, are calculated. qui margins, eyex(*) post
4. We enter the mata syntax. mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1))
5. The calculation of the marginal effects is stored. eststo NoCGB
Steps 1 through 5 are repeated for states with a consolidated governing board. qui xtscc y $x if CGB==1 qui margins, eyex(*) post mata: st_matrix(”e(box)“, (st_matrix(”e(b)“) :- 1 \st_matrix(”e(b)“) :+ 1)) eststo CGB
The graph, with appropriate labels and titles, is then created using the following syntax. coefplot NoCGB CGB, xline(0) format(%9.0f) rescale(10) recast(bar) barwidth(0.3) fcolor(*.5) coeflabels(L.net_tuition_rev_adj = ”{bf:Net Tuition Revenue}“ L.state_appro_adj = ”State Appropriations“ L.fedrev_r = ”Federal Revenues“ L.FTE_enroll = ”FTE Enrollment“, labsize(small)) vertical p1(label(”No CGB“)) p4(label(”CGB“)) ytitle(Percent) ylabel(4(2)10) title(”Percent Change in {bf:Administrators} Due to a 10% Change in“ ”{bf:Net Tuition Revenue} (controlling for other factors)“, size(medium) margin(small) justification(center))
The graph is shown below. Figure 10.14 shows that confidence interval lines cross the horizontal line at zero for most of the bars, indicating they are not different from zero. It also indicates that the percent increase in administrators as it relates to net tuition revenue only occurs in states with no consolidated governing boards.
10.8
Summary
Many questions from higher education policymakers can be answered by providing basic descriptive statistics. Using the Stata user-written routine asdoc, this chapter showed how the results from basic descriptive statistics can be presented in Word tables for presentations to higher education policymakers and other consumers of this information. This chapter also demonstrated how policy questions that can be answered with spatial descriptive statistics can be displayed in maps. Other policy questions require the use of more advanced statistical techniques, like regression models. This
10.9
Appendix
233
Fig. 10.14 Pct. change in administrators due to 10% Change In Net Tuition Revenue (and other factors) by Consolidated Governing Board (CGB)
chapter demonstrates how the Stata commands margins and coefplot can be used to create graphs to show the results to policymakers and others who may not be familiar with or interested in regression models.
10.9
Appendix
*Chapter 10 Syntax *Use the Stata user-written module asdoc (Shah, 2019) to create /// presentation-ready tables in Word net install asdoc, from(http://fintechprofessor.com) replace *change our working directory to where we want to save our tables cd ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Tables“ *open a dataset use ”C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 10\Stata\Data\Example 10.1.dta“ *we invoke the sum to produce descriptive statistics sum y x1 x2 x3 x4 x5 *We can either create new variables that are rescaled original variables by /// hand or utilize the Stata user-written routine rescale automatically to /// rescale the new variables *install rescale net install rescale, from(http://digital.cgdev.org/doc/stata/MO/Misc) replace
234
10
Presenting Analyses to Policymakers
*rescale the y, x1, x3, x4,and x5 into millions, we use replace and the /// millions option rescale y, millions rescale x1, millions rescale x3, millions rescale x4, millions rescale x5, millions *rerun the sum command sum y x1 x2 x3 x4 x5 *use the Stata command tabstat tabstat y_pop x1fte x3_pop x4_pop x5_pop , statistics(mean median) /// column(statistics) format(%9.0fc) *combine the use of asdoc and tabstat, use replace and /// and option abb(.) and options asdoc tabstat y_pop x1fte x3_pop x4_pop x5_pop, statistics(mean median) /// column(statistics) format(%9.0fc) dec(0) long /// title(Table 10.1 Descriptive Statistics) save(Table 10.1.doc) /// replace label abb(.) replace *create a categorical variable representing Maryland (MD) and create a table gen MD=0 lab var MD ”Comparisons“ replace MD=1 if fips==24 label define MD1 1 Maryland 0 ”All Other States“ label values MD MD1 *create a categorical variable that reflects different time periods /// In this example, we create a variable decade and code and label it accordingly gen decade =0 *label variable lab var decade ”Decades“ replace decade =1 if fy>=1980 replace decade =2 if fy>=1990 replace decade =3 if fy>=2000 replace decade =4 if fy>=2010
& & & &
fy