233 46 54MB
English Pages 531 [532] Year 2023
Springer Texts in Social Sciences
Manuel S. González Canché
Spatial Socio-econometric Modeling (SSEM) A Low-Code Toolkit for Spatial Data Science and Interactive Visualizations Using R
Springer Texts in Social Sciences
This textbook series delivers high-quality instructional content for graduates and advanced graduates in the social sciences. It comprises self-contained edited or authored books with comprehensive international coverage that are suitable for class as well as for individual self-study and professional development. The series covers core concepts, key methodological approaches, and important issues within key social science disciplines and also across these disciplines. All texts are authored by established experts in their fields and offer a solid methodological background, accompanied by pedagogical materials to serve students, such as practical examples, exercises, case studies etc. Textbooks published in this series are aimed at graduate and advanced graduate students, but are also valuable to early career and established researchers as important resources for their education, knowledge and teaching. The books in this series may come under, but are not limited to, these fields: – – – – –
Sociology Anthropology Population studies Migration studies Quality of life and wellbeing research
Manuel S. González Canché
Spatial Socio-econometric Modeling (SSEM) A Low-Code Toolkit for Spatial Data Science and Interactive Visualizations Using R
Manuel S. González Canché University of Pennsylvania Philadelphia, PA, USA
ISSN 2730-6135 ISSN 2730-6143 (electronic) Springer Texts in Social Sciences ISBN 978-3-031-24856-6 ISBN 978-3-031-24857-3 (eBook) https://doi.org/10.1007/978-3-031-24857-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
This book is the result of a collective effort fueled by the wealth of tools and knowledge that the R Project for Statistical Computing community has made available. Founded upon the notion that the storytelling capabilities and statistical power that spatial socio-econometric modeling brings to the analytic table should not be concentrated among few well-resourced researchers, this book aims to remove computer programming barriers, hoping to, in doing so, expand access to spatial data science tools. Many unthought questions and hypotheses may emerge when we start thinking relationally, and when we start considering the spaces and places (i.e., splaces) wherein those connections emerge. Accordingly, I not only dedicate this book to my family, teachers, mentors, as well as to all the authors cited, but also to you, who are willing to embark in this spatial data science journey. From this view, the continuous use of “WE” rather than “I” throughout all chapters, aims to capture this communal effort to expand unrestricted access to spatial data science tools.
Preface
Despite the vast availability of data that can be linked to earth’s surface (also known as geo-referenced or geo-located), to date most research studies in the social sciences continue to ignore or omit the influence of location on the outcomes of interest. For example, assume one is interested in measuring the impact of an intervention taking place in some school districts across a given state. Assume further that a variety of school-level administrative indicators are readily available for model inclusion, whereas other place-based indicators require extra data gathering, management, and methods before they can be statistically accounted for. In scenarios like this, researchers typically proceed to model the intervention effects without considering other placebased variables that may surely impact students’ performance over and above such students within school experiences. That is, given that where individuals experience life transcends school and home boundaries, the omission of out-of-school and out-of-home localized factors such as neighborhood-level poverty, crime, or unemployment levels, equates to ignoring and/or assuming that those factors bear no effect on such individuals’ prospects of success. The main goal of this book is to provide researchers with a set of minimal (or low) code tools 1 to easily start accounting for these place-based indicators via spatial socio-econometric modeling.
1 Minimal or low-code tools ascribe to data science democratizing efforts by removing vast amounts of statistical and computer programming proficiency requirements (or barriers) to apply sophisticated and rigorous state-of-the-art analytic and data interactive data visualization techniques. Accordingly, all functions and loops designed for this book may be implemented with a few lines of code that package hundreds or thousands of steps in the back end of those functions. We hope these minimal code functions, which ascribe to the algorithmic model culture (see Breiman, 2001), help expand access to SSEM and related data science techniques.
vii
viii
PREFACE
Spatial socio-econometric modeling (SSEM) is a set of conceptual and statistical and econometric methodological tools for modeling and understanding how attributes that can be located on Earth’s surface (i.e., geolocated or geo-referenced2 ) may be used to explain their inhabitants’ social and economic outcomes and how these outcomes may likely be part of feedback loops that translate into concentrations of (dis )advantages (Manuel S. González Canché, 2019c, 2022). For example, living in high-income neighborhoods is associated with higher salary and education attainment prospects, whereas the opposite expectations are true for those living in lower income areas. The geo-located attributes to be used for SSEM are not limited to social constructs (i.e., socio-demographic or socio-economic indictors), but rather may include physical characteristics of a given area. Note, however, that for convenience and clarity, our examples build from social constructs that have been found to impact social mobility prospects. Accordingly, and having briefly noted that spatial attributes may be of “any” kind, the main tenet of SSEM is that, since concentration of place-based attributes, with either positive or negative connotations, may lead to, or result in, feedback loops that in turn may affect participants’ experiences and outcomes, such geo-contextualized attributes must be included in our statistical models along with our more traditional individual- or unit-level features. From this view, SSEM analyses strive to understand how the constant intermingling of spaces and places (i.e., splaces) may inform the mechanisms behind reproductions of privilege or even disadvantages. Accordingly, with nuanced understandings that account for individual and geo-contextual attributes, we may gain deeper insights about potential sources of biases resulting from systematic disparities across the neighborhoods or splaces where participants experience life. Despite the conceptual and analytic benefits associated with accounting for and modeling these place-based attributes, the mainstream use of these methods has been, and continues to be, constrained by computer and statistical programming and coding barriers, for most of these procedures require advanced data handling and manipulation skills. That is, as depicted in subsequent chapters, SSEM analyses require the data access, handling, and manipulation of shapefiles (databases containing points, lines, or polygon geo-located features), matrices (accounting for distances and/or relationships among units), and geo-located databases that may need to be merged with those shapefiles and matrices. The simultaneous handling of these nonstandard data elements results in cumbersome and long programming and coding procedures that may overwhelm even advanced quantitative researchers. To remove programming hurdles, with the purpose of democratizing access to spatial data science tools and state of the art, interactive, and dynamic spatial data visualizations, this book offers dozens of minimal or low-code
2
Geo-located and geo-referenced units and attributes are typically used interchangeably.
PREFACE
ix
tools—see page xxxvii for the complete list.3 These minimal code tools are presented in the form of handcrafted functions that were designed to ease the use and implementation of SSEM analyses. These minimal code tools or functions, which were developed throughout the past 10 years, practically mirror standard analytic procedures typically conducted in Stata (or similar userfriendly statistical software), where a standard regression call like reg y x1 x2 computes all the processes in the background and then renders the resulting outcome. Our minimal code functions then were designed to implement complex and sophisticated analytical procedures (not yet available elsewhere) that, building from dozens of packages and a myriad of functions, automatize multiple data manipulation procedures, including the merging of matrices, shapefiles, and databases, with an easy-to-implement function. For example, the function geographical_networks(...) is capable of even automatically downloading a shapefile directly from the TIGER/Line Shapefiles and join this file with our local databases. This call, which internally manages a myriad of elements4 will then result in a visualization like https://rpubs.com/ msgc/two-mode_geographical_networks where connections are identified in a contextualized space, and interactive data retrieval and mining elements are included as part of this standard output resulting from the execution of this minimal code function. Throughout this book we will also emphasize the relevance of estimating distances among units or participants that, more realistically and more accurately, capture the travel distances and navigation times that inhabitants of these areas experience on a daily basis. This emphasis is important when considering that the standard method for distance computations, currently implemented in spatial modeling, still assumes that we can reach a point by “flying from one location to another.” This standard method is called “as the crow flies” and does not incorporate road information during the estimation of distances among points. Based on this “flying” assumption then, the “as the crow flies” method may likely underestimate travel distances or commuting times. Considering that distance estimates are at the hearth of SSEM analyses, we devoted an entire chapter (Chap. 5) to illustrate how to move beyond the “as the crow flies” distance estimation method. Accordingly, two of the functions presented in Chap. 5 are called humans_walk_batch(...), which mirrors Google maps walk navigation distances, and travel_times(...), which retrieves commuting times from the “OpenStreetMap” server. Both functions are capable of retrieving millions of distances with a single call. This latter statement is particularly relevant for the travel_times(...) function, which was designed to iteratively connect to the “OpenStreetMap” 3
Although all our replication exercises provide direct access to these minimal code functions---and data, the list of replication Listings can also be accessed at https://github. com/democratizing-data-science/SSEM. 4
For example, the elements internally managed are db, db2, mat, threshold, db.ID, y, higher, shapefile, shape.ID, unit.weight, link.weight, link.label, neighborless, lat, long, as shown in code Listing 9.5.
x
PREFACE
server as many times as required to circumvent the 10,000 travel times restrictions per call imposed by such a server. Although 10,000 distances may sound appealing, it represents the distances of 100 data points per call. Subsequently in the book, our minimal code functions will provide us with the choice of relying on “as the crow flies” distances or use our “as humans walk” or “as humans drive” matrices for the execution of SSEM analyses. Before moving on, let us also note that another salient contribution of SSEM to spatial modeling, in general, is our reliance on network transformations that expand the possibility of including more than one unit type in the analytic procedures. That is, typically spatial analyses only consider connections among units of the same type, namely, 4-year colleges (see Manuel S. González Canché, 2017; Manuel S. González Canché, 2014; McMillen, Singell Jr., and Waddell, 2007). However, as we depict in https://rpubs.com/msgc/ two-mode_geographical_networks, our SSEM modeling approaches showcase strategies to model relationships emerging from the inclusion of units of different type, like public 2-year colleges and 4-year colleges or like students selecting colleges (Manuel S. González Canché, 2018b). So far, we have yet to see the mainstream implementation of these types of analyses, which although relevant also require advanced data handling tools, which we have packaged in, and automatized with, our minimal code functions. Building from this brief discussion, let us note that the main goal of this book is devoted to describing these dozens of minimal code functions uniquely designed for this book—see complete list of replication exercises on page xxxvii. To achieve our goal of expanding access to spatial data science tools, we are also providing all access to data and reproducible examples of the entire conceptual and analytic processes required to conduct SSEM analyses and advanced data visualizations. These SSEM analyses and visualizations include both cross-sectional and spatio-temporal processes and data structures. Applications of SSEM included in the book are analyses of concentration of migrants deaths along with strategies to assess changes in this death concentration across different causes of death and across different seasons of the year (see Fig. 9.8 and map https://rpubs.com/msgc/spatial_density_mig rants_deaths, for instance). Analyses of whether single global estimates may suffice to capture the relationships of predictors and the outcome of interest, or whether variations of these relationships may be modeled and found across our spaces (see Fig. 8.10, for example). The modeling of the impact of a place-based scholarship in increasing college attendance rates of inhabitants over time (see spatio-temporal difference in differences estimation section and exploratory spatio-temporal data mining and visualization results in Figs. 9.2 and 9.3). Our analyses and examples also include changes in college affordability given the local availability of 4-year colleges, which rely on geographical network analyses (see Figs. 9.10, 9.11, and 9.12). From this description then, this book was designed as a standalone guide that relies on free software tools (The R Project for Statistical Computing) designed to equip social science researchers and data analysts with all concepts,
PREFACE
xi
explanations, and tools required to tackle practical, real-world questions such as ... • What times of the day or months of the year are more dangerous in a given neighborhood? • Is there evidence to suggest that our outcomes are affected by our localized circumstances? Can these principles be extrapolated to relationships measured from social interactions? If so, how is this extrapolation useful in expanding our sociological understandings? • Do economically disadvantaged students who enroll in schools located in higher income neighborhoods perform better in mathematics than their economically disadvantaged peers enrolled at schools located in poor neighborhoods? • How come some schools located in impoverished areas may reach similar success rates than schools located in “better off” neighborhoods? What other localized factors may help explain these cases? • Is there any indication that tax break policies for higher education are effective in reaching lower income taxpayers, after accounting for spatial dependence, and when considering taxpayers’ own levels of wealth and resources within their Zip code tabulated areas of ascription? • Are students who attended college nearby their high school homes less likely to succeed in college than their peers who out-migrated from home to attend college? How can we identify these nearby institutions? How can we minimize or address feedback loops in these analyses? • Can we rely on SSEM tools, like geographical networks, to identify treated and control units as a function of some characteristics of their localized communities and circumstances (like the local presence of more prestigious institutions) and then use these treatment and control classifications to apply standard quasi-experimental analyses? This book may serve as a complete guide that provides concepts and techniques for conducting spatial data science modeling with interactive data visualizations and minimal code functions requirements. Clear and concise explanations of state-of-the-art methodological applications (geospatial point density analyses, multilevel spatial autoregressive models, difference in differences spatio-temporal models) are reinforced with actual data and code used in recent published research. Accordingly, the goal of this book is to provide social scientists with a standalone, complete set of open-access minimal code tools to... • Identify and assess place-based data sources and formats. • Conduct advanced data management, including crosswalks, joining, and matching.
xii
PREFACE
• Formulate research questions designed to incorporating or accounting for localized or place-based factors in model specification and assessing their relevance compared to individual- or unit-level indicators. • Estimate distance measures across units that follow road network paths rather than being calculated by assuming that humans may just “fly” from point A to point B. • Create sophisticated and interactive HTML data visualizations (cross sectionally or longitudinally), to strengthen their research storytelling capabilities. • Follow best practices for presenting spatial analyses and findings for SSEM research in the social sciences. • Master theories on neighborhood effects, equality of opportunity, and geography of (dis )advantage that undergird SSEM applications and methods. • Assess multicollinearity issues via machine learning that may affect coefficients’ estimates and, instead, identify relevant predictors. • Think of strategies to address feedback loops by using SSEM as an identification framework than can be merged with standard quasi-experimental techniques like propensity score models, instrumental variables, and difference in differences. • Expand the SSEM analyses to connections that emerge via social interactions, like co-authorship and advice networks, for example.
Unique Contribution to Social Sciences Currently there is no book that offers fully reproducible, state-of-the-art databases, and minimal code functions to replicate the analytic procedures and advanced data mining, machine learning and interactive visualizations, described in the book. All the procedures are prepared with free software that is both free to share and cost-free, so there is no monetary barrier for readers to use and master the analytic tools described in the book. Furthermore, the content of the book was designed to be easily updated based on new advancements of methods. For example, when users load a function from our server, any potential updates, including dates, will be described. Finally, spatial data science is increasing its presence in both academic literature and the mainstream media. For example, in the academic sphere, the work of Raj Chetty (https://opportunityinsights.org/) relies heavily on spatial data science and visualization techniques, and this work has attracted plenty of media and academic attention. However, even initiatives like “opportunities insight,” (see link above) directed by Chetty, provide no resources to conduct spatial analyses and much less spatial data science procedures. Similarly, in terms of mainstream media, the New York Times and the Washington Post are particularly active in providing their readers with interactive spatial visualizations, but do not provide code to replicate these visualizations. This book
PREFACE
xiii
distinctively provides minimal code tools, datasets, and examples of how to use and implement those applications to tackle real-world spatial research questions as well as to making these types of analyses and visualizations attainable and reproducible. Relatedly, the content of the book was designed so that only essential statistical coding (no computer programming) is required to conduct the analyses. This implies that readers with minimal (or no) programming proficiency will be able to replicate the minimal code functions provided. This ease of access and implementation is the primary goal and motivation of this book. Spatial data science modeling and storytelling via interactive and sophisticated data visualizations should not be the exclusive tools of a handful of well-resourced researchers and media giants.
SSEM Statistical Modeling Culture The analytic approach that this book takes follows the algorithmic model culture described by Breiman (2001). In this algorithmic model framework, there are no preconceived notions of a model that preemptively fit the data (i.e., data modeling culture). Instead, the data gathering process aims to capture the real world and real circumstances faced by its inhabitants as close as possible and with the understanding that localized circumstances vary across zones (i.e., from concentration of wealth to concentration to poverty indicators across areas). With these data elements in mind, we will then proceed to apply functions and algorithms designed to offer estimates that, as realistically as possible, may account for these localized set of circumstances. For example, we have dedicated an entire chapter to discuss estimation strategies that measure distances among points that account for actual infrastructures in the form of road availability, as opposed to assuming that we can “fly” from point A to point B without accounting for the presence or absence of roads or bridges. This SSEM algorithmic modeling approach has been developed to identify local features and then estimate distances. Although this nuanced distance estimation process should be the standard approach in spatial modeling, surprisingly most, if not all, readily available spatial analytic procedures are based on the “as the crow flies” distance estimation method. From this view, our algorithmic estimation strategies developed via minimal code tools are a fundamental feature of SSEM procedures. Another example of the algorithmic process followed in this book consists of the application of functions designed to identify the most optimal model specifications. For example, assume that we know that outcome dependence may be present in the data (see Chap. 7). As explained later in the book, this means that our outcomes are affected by our peers’ or neighbors’ performance while, at the same time, we are also affecting our peers’ or neighbors’ performances or outcomes. In this instance, rather than assuming that a single model may fit the data well, we will test a set of N models with changes in neighboring structure specifications designed to assess whether the first order
xiv
PREFACE
or direct neighbors is a reasonable or optimal model, or whether our specification should rather account for higher order neighboring structures (i.e., the neighbors of our neighbors). From this view, the analytic procedures showcased in this book will illustrate how to apply data-driven approaches to detect optimal solutions, namely, for distance selection, optimal order of neighbors, assumptions of global or local regression estimates cross sectionally or longitudinally, and even in the assessment of potential multicollinearity issues in our variable selection. Specifically, for this latter case, guided by our conceptual frameworks of neighborhood effects and geographies of (dis )opportunity, it is likely that spatial clustering may reflect that high poverty zones may also have high indices of violence and high unemployment rates. If this is the case, then it would be necessary to assess which of these three highly correlated indicators should be included in our models. Accordingly, in Chap. 8, we showcase how to apply algorithms designed to detect redundancy in the predictive or classificatory power of these predictor and control indicators for a given outcome. Overall, the title of the book Spatial Socio-econometric Modeling aims to capture the need to rely on statistics and econometrics to work with data that capture social structures and real-world circumstances as close as possible. In SSEM, functions and algorithms do not assume any predetermined answer or model to test sociological theory against these data, rather model specifications may vary across regions given the localized circumstances that their inhabitants experience. In a sense, then, this book offers a set of innovative and diverse set of minimal code tools (in the forms of functions and algorithms) to highlight structures (Breiman, 2001) and hopefully design strategies that may contribute to close persistent socio-economic gaps. In sum, the applications offered in this book have been developed to explain how geographically based circumstances, as experienced by inhabitants, may impact their outcomes and the theoretical and empirical relevance of accounting for these external influences in our model specifications, rather than ignoring their influence on our participants’ prospects of success or upward mobility.
Level at Which This Book is Aimed The pedagogical approach of the book requires minimal expertise in statistical analyses. An applied course where the main assumptions of multivariate linear regression are covered will suffice. Each chapter is designed to provide readers with the knowledge and skillsets required to master its content. For example, when pertinent, a given topic will revisit linear regression models’ assumptions as part of the justification for using spatial analyses to test for and address assumption violations. With this book as a guide, social scientists in training, as well as professional researchers and scholars, may become well suited to use free software and minimal code tools to leverage the abundance of geo-referenced data to discover and analyze the influence of place on critical issues in their research.
PREFACE
xv
And they can create interactive spatial visualizations to bring those findings alive with no computer programming background. We truly hope these efforts to democratize access to these spatial data science tools and interactive data visualizations via minimal code tools, may result in an increase of the use of these free, powerful, and sophisticated analytic procedures. On a personal and professional note, I remain convinced that SSEM may enable us to test innovative hypotheses in the social sciences with important societal implications. Philadelphia, USA June 2023
Manuel S. González Canché
References Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical science, 16(3), 199–231. González Canché, M. S. [Manuel S.]. (2014). Localized competition in the nonresident student market. Economics of Education Review, 43, 21–35. González Canché, M. S. [Manuel S]. (2017). The heterogeneous non-resident student body: Measuring the effect of out-of-state students’ home-state wealth on tuition and fee price variations. Research in Higher Education, 58(2), 141–183. González Canché, M. S. [Manuel S.]. (2018b). Nearby college enrollment and geographical skills mismatch: (re)conceptualizing student out-migration in the american higher education system. The Journal of Higher Education, 89(6), 892–934. https://doi.org/10.1080/00221546.2018.1442637. eprint: https:// doi.org/10.1080/00221546.2018.1442637. González Canché, M. S. [Manuel S.]. (2019c). Repurposing standardized testing for educational equity:: Can geographical bias and adversity scores expand true college access? Policy Insights from the Behavioral and Brain Sciences, 6(2), 225–235. González Canché, M. S. [Manuel S.]. (2022). Post-purchase federal financial aid: How (in) effective is the irs’s student loan interest deduction (slid) in reaching lowerincome taxpayers and students? Research in Higher Education, 1–54. https://doi. org/10.1007/s11162-021-09672-6. McMillen, D. P., Singell Jr, L. D., & Waddell, G. R. (2007). Spatial competition and the price of college. Economic Inquiry, 45(4), 817–833.
Acknowledgements
Spatial Socio-econometric Modeling (SSEM) was only possible based on the conceptual and methodological contributions of many researchers and scholars who made their code functions free to use, share, and reproduce. The generosity of those scholars has paved the road for the development of minimal code functions fueling the analyses presented in this book. I have striven to provide full access to codes, functions, and datasets that may, as well, be modified and strengthened, similar to how I modified functions and procedures implemented by the many scholars cited. Moreover, note that although the minimal code functions presented here were designed specifically for this book, each of these functions builds from a plethora of procedures taken from dozens of packages. Accordingly, each function, when used for the first time in a given R session, will display all the packages required, which also serves as an acknowledgment of these contributions. Accordingly, it is this communal sense of knowledge building, sense-making, and data storytelling that has motivated the development and realization of this book. Thank you.
xvii
Contents
Part I
Conceptual and Theoretical Underpinnings
1
SPlaces SPlaces: Spaces, Places, and Spatial Socioeconometric Modeling Spaces Places SPlaces Inequality in Mobility Prospects Measuring Inequality and Growing Inequality Neighborhood Effects and Concentration of ( dis)Advantages Splace-Based Modeling Challenges Causality in Spatial Modeling Closing Thoughts and Next Steps Next Steps Discussion Questions References
3 3 4 5 7 9 13 14 17 17 20 20 20 21
2
Operationalizing SPlaces Delimiting and Operationalizing Neighborhoods as Splaces Representing Physical Spaces and Nesting Structures Zooming in Across Administrative Boundaries Shapefiles as Spaces Place-Based Indicators Contributing to Building Splaces Neighborhood Operationalization and Disaggregation Data Point Differentiation Across Neighborhood Levels Illustration of Splaces and Data Point Gains Tradeoffs of Data Point Differences Bringing Concepts, Shapefiles, and Place-Based Indicators Together
25 26 27 28 28 30 32 34 34 36 39
xix
xx
CONTENTS
ACS Published or Pre-tabulated Data Identifying Proxies for Poverty Identifying Proxies for Median Income Identifying Proxies for Unemployment Identifying Proxies for Housing Quality Identifying Proxies for Family Structure Closing Thoughts and Next Steps Next Steps Discussion Questions References 3
Data Formats, Coordinate Reference Systems, and Differential Privacy Frameworks Types of Geo-Referenced Data: Raster and Vector Data Raster Data Vector Data Vector to Raster Transformations and Vice Versa Vector to Raster Data Transformations Moving From Raster to Vector Data Coordinate Reference Systems Elements of CRS Implications of Distortions Resulting from Map Projections Commonly Used Coordinate Reference Systems Differential Privacy Framework (DPF) and Changes to Census Micro Data Are Differential Privacy and Synthetic Data the Same Privacy Protection Strategy? What are Differential Privacy Algorithms? Relevance of Differential Privacy For SSEM Strategies to Protect Privacy Methodological Implications of DPF for SSEM Next Steps Discussion Questions References
Part II 4
41 43 45 47 48 50 52 52 53 53 55 56 56 61 66 66 70 74 74 78 79 81 82 84 86 87 90 92 92 93
Data Science SSEM Identification Tools: Distances, Networks, and Neighbors
Access and Management of Spatial or Geocoded Data R Tutorial Installation R Infrastructure Code Rationale Reading Data from an External Source Creating Datasets from Within R Merging Joining Data
97 97 98 99 99 101 101 104
CONTENTS
5
xxi
Installing Packages Reading Polygon Shapefiles Reading Polygons at the Country Level Reading Polygons at the County Level Reading Polygons at the ZIP Code Tabulated Area (ZCTA) Level Reading Polygons at the Census Tract Level Reading Polygons at the Block Group Level Reading Line Shapefiles All Roads Shapefiles Primary Roads Shapefiles Primary and Secondary Roads Shapefiles Reading Point Shapefiles Point Geocoding or Georeferencing Batch Geocoding Using Addresses in R Point Batch Geocoding Using ZCTAs in R Crosswalking Lower to Higher Level Crosswalking Place-Based Data Access at the Polygon Level Applying for a Census API Key Place-Based Data Access at the Point Level Joining Points with Polygons Data Closing Thoughts Next Steps Discussion Questions Replication Exercises References
106 109 109 113
Distances Distances Geolocated Data: Polygons, Points, or Both? Why is Distance Estimation Relevant for SSEM? Data Source and Data Requirements Projections, Distortions, and Bias Concerns? Network Analysis Tools and Data Transformations Approaches to Distance Connections Identification Multiple Sources of Points Matrix to Edgelist Transformations As the Crow Flies Distance Calculations From a Matrix to a List of Connections (Edgelist) with Distances “As the Crow Flies” Distances Including Multiple Unit Types Network Route Distance Calculations: As Humans Walk As Humans Walk Distances Between Two Points Batch “As Humans Walk” Distances
165 165 166 167 169 172 173 173 175 178 183
116 118 120 121 121 123 125 128 132 133 138 141 145 146 148 155 158 160 160 161 162 162
188 190 191 194 196
xxii
6
CONTENTS
humans_walk_batch Function Application Navigation/Travel Time Distances “Travel Distance” Data Format and Requirements travel_times Function Applications Closing Thoughts Next Steps Discussion Questions Replication Exercises References
200 206 207 208 213 213 214 214 215
Geographical Networks as Identification Tools Neighboring Structures and Networks What is a Network and How is it Different from or Similar to Neighboring Structures? From Distances (or Travel Times) to Networks and Neighboring Structures Point-Based Network and Neighboring Structures Identification Rules Radius-Based Approach Kth Closest Neighbor(s) Approach Inverse Distances From Neighboring Structures to Weights Code Application Moving Forward and Beyond These Standard Identification Approaches Crow Flies Versus Road Networks Distances Crow Flies Applying Road Networks Distances: “As Humans Walk” Using our Own Network Distances (and/or Travel Times) to Identify Neighboring Structures rad Function klosest Function rad_kth_row Function rad_kth_inv Function Moving Forward Identifying Neighboring Structures Among Different Types of Units Identifying the Local Presence of Units of Different Type Indirect Neighboring Structures Moving Forward and Feedback Loops Two-Mode Kth Closest Identification and Selection Polygons and Matrices of Influence Rook’s Bishop’s Queen’s
217 218 219 220 221 221 223 224 225 227 229 230 230 231 233 235 237 238 239 240 241 241 247 254 256 264 264 265 266
CONTENTS
Application Higher Order Neighbors Closing Thoughts Next Steps Discussion Questions Replication Exercises References
xxiii
266 269 271 273 273 274 275
Part III SSEM Hypothesis Testing of Cross-Sectional and Spatio-Temporal Data and Interactive Visualizations 7
SODA: Spatial Outcome Dependence or Autocorrelation SODA: Spatial Outcome Dependence or Autocorrelation Why Is SODA Statistically Concerning? Assessing SODA Based on Polygon Data Moran’s I Regression Approach Moran’s I Code Application with Polygon Data Is the First Order Neighboring Structure Enough? Assessing SODA Based on Point Data Machine Learning Tools to Assess SODA Decadence Moran’s I Code Application with One-Mode Point Data Data Source and Outcome of Interest Neighboring Structures and Weight Matrix Moran’s I Code Application with Two-Mode Point Data Two- To One-Mode Transformations and Rationale Causal Chains Through Spillovers in SSEM Local Moran’s I Visualizing Local Moran’s I Code Application Local Moran’s I Polygon Data Code Application Local Moran’s I One-Mode Point Data Code Application Local Moran’s I Two-Mode Point Data To Retain or Exclude Neighborless Units Code Application to Exclude Neighborless Units Social Outcome Dependence or Autocorrelation: SODA 2.0 Relationships in SODA 2.0 Application of SODA 2.0 Next Steps Discussion Questions Replication Exercises References
279 280 280 282 284 286 290 296 297 297 298 299 305 306 312 313 315 316 319 323 327 328 332 332 337 346 347 348 349
8
SSEM Regression Based Analyses Residual SODA and the Importance of Spatial Regression Modeling SODA Mechanisms in Regression Residuals Testing for RSODA
353 354 355 356
xxiv
CONTENTS
Simultaneous Autoregressive (SAR) Modeling Mechanisms and Implications of RSODA SAR Application to Polygon Data Assessing Whether RSODA was Handled Building a SAR Model While Addressing Place-based Multicollinearity Application of Feature Selection Via Random Forests Application Simultaneous Autoregressive Models SAR Application to Two-Mode Point Data Data Preparation and Transformations Outcome Indicators and Feature Selection Rationale Two-mode to One-mode Transformations Feature Selection with Point Data Building SAR Model While Addressing Place-based Multicollinearity Multilevel SAR Models Multilevel Data Statistical Description of Multilevel SAR Multilevel SAR Function Application SAR or Multilevel SAR Testing for Spatial Heterogeneity Via Geographically Weighted Regression How Does SAR differ from GW Approaches? Distance and Travel Time Matrices and Kernel Functions Do we Need GW?: GW Multiscale Summary Statistics Geographically Weighted Regression and Visualization Spatio-Temporal SAR: A Difference in Differences Application Spatio-Temporal Data or Panel Data with Spatial Information Testing for RSODA in Panel SAR SAR Panel Set Up SAR Panel Data Source and Setting Falsification Test Identification SAR Panel Application SAR Panel Function SAR and Multilevel SAR with Social Data Multilevel SAR Constrains for SODA 2.0 Social Multilevel SAR social_multilevel_SAR(…) Application Closing Thoughts and Next Steps Discussion Questions Replication Exercises References
359 359 361 362 365 366 370 373 373 375 376 380 380 384 385 387 392 401 401 403 404 406 413 420 421 422 422 424 425 426 427 433 434 435 436 440 441 442 444
CONTENTS
9
10
Visualization, Mining, and Density Analyses of Spatial and Spatio-Temporal Data SSEM Visualizations Polygon Data Visualization poly_map(…) Function Application Exploratory Spatio-Temporal Data Mining and Visualization spatio_panel_visual(…) Implementation Point Data Visualization point_map(...) Function Application Geospatial Point Density Methodological Approach What Questions May We Address with Geospatial Point Density? Code Application for the Maps Next Steps in Gesopatial Point Density Geographical Network Visualizations Data Sources Preparation Rationale geographical_networks(...) Application Two-Mode Networks Application One-Mode Geographical Networks Procedures Closing Thoughts Discussion Questions Replication Exercises References Final Words References
xxv
447 447 449 449 452 453 456 458 459 460 462 464 468 468 470 470 471 473 478 479 480 482 482 485 489
Glossary
491
Index
497
Acronyms
ACS CRS CT DiD DPF FBI GIS i.i.d. IPEDS IRS LCV OLS PUMAs PUMS RSODA 2.0 RSODA SAR SLID SODA 2.0 SODA ZCTAS
American Community Survey Coordinate reference system Critical threshold detected via the largest closest distance discussed in Chap. 6 Difference in differences Differential Privacy Framework Federal Bureau of Investigation Geographical information systems Identical and independent distribution Integrated post-secondary data system Internal Revenue Service Localized coefficient of variation Ordinary least squares Public Use Microdata Areas Public Use Microdata Sample Residual SODA 2.0 Residual SODA Simultaneous Autoregressive Models Student Loan Interest Deduction Social outcome dependence or autocorrelation Spatial outcome dependence or autocorrelation ZIP code tabulated areas
xxvii
List of Figures
Fig. 1.1 Fig. 2.1 Fig. 2.2 Fig. 2.3
Fig. 2.4 Fig. 2.5
Fig. 2.6
Fig. 2.7
Fig. 2.8 Fig. 2.9 Fig. 2.10 Fig. 2.11 Fig. 2.12
Fig. 2.13 Fig. 2.14
Spaces and places: splaces Tri-state Area: New York, New Jersey, Pennsylvania, USA, example Example of census tract shapefile or space Example of place-based information data storage format. Source American Community Survey, 2015–2019 Table B15001 Zooming in and operationalizing neighborhoods, Philadelphia, PA, USA, example Operationalizing splaces, Philadelphia, PA, USA, example. Source American Community Survey, 2015–2019, 5-year ACS estimates Table B15001 Visualizing splaces using college access among high school graduates, Philadelphia, PA, USA, example. Source American Community Survey, 2015–2019 Table B15001 Operationalizing splaces, Philadelphia, PA, USA, example. Source American Community Survey, 2015–2019 Table B15001 Table shells for all detailed Tables Click Here For Direct Table Access Example of poverty estimate Example of condensed poverty estimate Example of median income tables, source ACS Tables “B06011” and “B07011” Example of median income total for tables, source ACS Tables “B06011” and “B07011”. We will discuss the code to access these data below Example of unemployment estimates in tables: Tables “B23025” and “B23001” Example of housing quality proxy in Table “B25014” and others
4 27 29
31 33
35
35
37 42 43 45 46
46 47 49
xxix
xxx
LIST OF FIGURES
Fig. 2.15 Fig. Fig. Fig. Fig. Fig.
2.16 3.1 3.2 3.3 3.4
Fig. 3.5 Fig. 3.6 Fig. 3.7
Fig. Fig. Fig. Fig. Fig. Fig. Fig.
3.8 3.9 3.10 3.11 3.12 3.13 3.14
Fig. 3.15 Fig. 3.16 Fig. 3.17 Fig. 3.18 Fig. 3.19 Fig. 3.20 Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.
4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10
Fig. 4.11 Fig. 4.12
Example of housing quality proxy by lack of plumbing facilities (Table “B25016”) and type of heating fuel used (“B25040”) Example of family structure proxy, Table “B09002.” Example of raster cell changes within same space River raster representation Land raster representation Example of raster data representation not showing the land feature: river in blue, single houses in red, and settlements in magenta Example of three layers of raster data representing: a river (blue), three single houses (red), and four settlements (magenta) Example of three layers of raster data representing: a river (blue), three single houses (red), and four settlements (magenta) Depiction of University City’s ZCTAs (polygons), two main roads (lines), and Universities (points), all located in Philadelphia County Raster side by side comparison Representation of bounding source Google Maps Representation of raster to point transformation Representation of raster to point transformation Representation of raster to polygon transformation Representation of Y and X grid of latitudes and longitudes Representation of geodesic and planar CRSs. Code adapted from https://texample.net/tikz/examples/spherical-and-car tesian-grids/ Representation of geodesic and planar CRSs of the continental United States Representation of data structures of geodesic and planar CRSs using the continental United States Data protection mechanisms, changes, and challenges Example of noise infused locations. College location source The IPEDS Data Center Example of noise polygon locations. Source The TIGER/Line Shapefiles Locations of Migrants’ deaths 1990–2022. Source Humane Boarders Inc. (Fronteras Compasivas) R Layout Creating a data frame example Merging/joining data frames example Default content of the USA shapefile Plotting the USA using default and modified options Default ZCTA shapefile Default and modified ZCTA shapefiles All roads in Philadelphia county shapefile Primary roads nation and Philadelphia county shapefiles Primary and secondary roads in Pennsylvania and Philadelphia county shapefiles Primary and secondary roads in Pennsylvania delaware shapefiles Point shapefile structure featuring the State of Pennsylvania
50 51 57 58 59
60 60 63
65 67 68 69 72 73 75
76 79 80 82 88 89 91 99 102 106 110 112 117 118 123 124 126 127 129
LIST OF FIGURES
Fig. 4.13 Fig. 4.14 Fig. 4.15 Fig. 4.16 Fig. 4.17 Fig. 4.18
Fig. 4.19 Fig. 4.20 Fig. 5.1 Fig. 5.2 Fig. 5.3
Fig. 5.4
Fig. 5.5 Fig. 5.6
Fig. 5.7 Fig. Fig. Fig. Fig. Fig.
5.8 5.9 5.10 6.1 6.2
Fig. 6.3 Fig. 6.4 Fig. Fig. Fig. Fig.
6.5 6.6 6.7 6.8
Fig. 6.9 Fig. 6.10
Schools and academies in Pennsylvania and Philadelphia county Geocoding or georeferencing one address in Google Maps Geocoding or georeferencing all public schools, Source The Common Core Data Side by side comparison of schools’ distribution by georeferencing process Crosswalking ZCTAs to counties example, Source HUD-USPS Crosswalk Files Crosswalking ZCTAs to counties (higher to lower) and to tracts (lower to higher), Source HUD-USPS Crosswalk Files IRS example, including student loan interest deduction, Source ZIP Code Data Users Guide and Record Layouts IPEDS Example, merging two databases, Source Complete data files, IPEDS Centroids and polygons. Hypothetical distances in miles Distance conceptualizations and operationalizations Distance conceptualizations and operationalizations between the Community College of Philadelphia and The University of Pennsylvania Distance matrix of all public 2- and 4-year colleges and all private not for profit 4-year colleges in Philadelphia county, Pennsylvania Roads as a network—Philadelphia county roads shapefile Points added to the roads network (Community College of Philadelphia and the University of Pennsylvania)—Philadelphia county roads shapefile Snapping points to road network—Philadelphia county roads shapefile Point objects appended to a single geocoded database Matrices to edgelists transformation rationales Point objects appended to a single geocoded database Locations and inverse distance networks Connections conceptualizations and operationalizations: radius-based (500 feet), Kth neighbor (closest), and inverse distance approaches Neighboring structures across 2- and 4-year colleges within 27.6 min apart Neighboring structures across 2-year colleges that share at least one 4-year neighbor Rook’s neighboring specification Bishop’s neighboring specification Queen’s neighboring specification Rook’s neighboring specification in the contiguous United States Queen’s (magenta) and Rook’s (black) neighboring specifications in the contiguous United States Neighbors of neighbors neighboring structure using Queen’s as the input
xxxi 131 132 137 142 143
147 154 156 167 168
171
188 192
193 193 197 199 212 224
230 248 253 265 265 266 268 268 272
xxxii Fig. Fig. Fig. Fig. Fig. Fig.
LIST OF FIGURES
7.1 7.2 7.3 7.4 7.5 7.6
Fig. 7.7 Fig. 7.8 Fig. 7.9 Fig. 7.10 Fig. Fig. Fig. Fig. Fig.
7.11 7.12 7.13 7.14 7.15
Fig. 8.1 Fig. 8.2 Fig. 8.3 Fig. 8.4
Fig. Fig. Fig. Fig.
8.5 8.6 8.7 8.8
Fig. 8.9 Fig. 8.10 Fig. Fig. Fig. Fig. Fig.
8.11 8.12 8.13 9.1 9.2
Fig. 9.3 Fig. 9.4 Fig. 9.5
Intuition Moran’s I with two groups Monte-Carlo simulations of Moran’s I s Spatial correlogram with 10 higher order neighbors One-mode neighboring structure Higher order neighboring structures for one mode point data SODA based on continuous distances and neighboring structures for one mode point data Rationale of two-to one-mode transformations Higher order neighboring structures for one-mode transformed point data SODA based on continuous distances and neighboring structures for one mode point data Local Moran’s I representation in the contiguous united states for polygons Local Moran’s I representation for point data Local Moran’s I representation for point data Adjacency list to edgelist transformation Transformation to individual publication record Higher order analysis and local Moran’s I representation of co-authorship network RSODA test of SAR residuals state level data example Boruta plot for state indicators and SAR model building—see interactive version here https://cutt.ly/EZdhhHL RSODA for three SAR modes shown in Table 8.1 Boruta plot for point indicators and SAR model building—see interactive version here https://rpubs.com/msgc/point_ based_Boruta RSODA for three SAR modes shown in Table 8.2 Multilevel SAR framework and nesting structures Kernel functions for geographically weighted approach Geographically weighted and bootstrapped estimates per institution Geographically weighted and bootstrapped estimates per ZCTA (and county for Crime) Geographically weighted regression approach with multiscale adaptive bandwidths and bootstrapped standard deviations Santa Cruz and neighboring counties in California Residuals for women and men SAR and Naïve models RSODA 2.0 coauthorship networks Polygon mapping results Exploratory spatio-temporal data mining and visualization women results Exploratory spatio-temporal data mining and visualization men results Point mapping results—four out of all features requested Geospatial point density rationale
283 286 291 298 303 304 306 310 311 320 321 326 334 340 344 363 368 372
381 383 386 405 407 408 417 425 431 434 452 455 456 459 461
LIST OF FIGURES
Fig. 9.6
Fig. 9.7 Fig. 9.8
Fig. 9.9
Fig. 9.10
Fig. 9.11
Fig. 9.12
Aggregated geospatial point density rationale all deaths—interactive version at https://rpubs.com/msgc/spa tial_density_migrants_deaths Point mapping results disaggregated Aggregated geospatial point density rationale all deaths—interactive version at https://rpubs.com/msgc/spa tial_density_migrants_deaths Visual analysis of distances versus travel times—interactive version at https://rpubs.com/msgc/two-mode_geographical_ networks Public 2-year colleges with and without 4-year neighbors—interactive version at https://rpubs.com/msgc/ two-mode_geographical_networks Two-mode and one-mode transformed representations—the interactive version of the one mode transformed version is available at https://rpubs.com/ msgc/one-mode_transformed_geographical_networks Public 2-year colleges with and without 2-year neighbors—interactive version at https://rpubs.com/msgc/ one-mode_geographical_networks
xxxiii
463 464
466
469
476
477
478
List of Tables
Table 1.1 Table 2.1 Table Table Table Table Table Table Table Table Table Table Table Table Table
3.1 4.1 4.2 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10
Table 6.1
Table 6.2 Table 7.1
Table 8.1 Table 8.2
Average years of education (with SDs) by city and ethnicity of inhabitants ages 18–24 Poverty thresholds for the 48 contiguous states and the district of Columbia Example of counts and noise (ε) added to microdata Ten first cases of address components IPEDS “complete data files” Example Point geocoded data storage—“pt” object in code Listing 5.7 Our own list of edges or links (A.K.A. edgelist) A matrix representation of all possible distances Origin and destinations for travel distances Appended origin and destination data sources Matrix of relationships including multiple types Trimmed matrix including units of different type From two individual matrices to one Edgelist of matrix with as the crow flies distances in miles Matrix representation populated with “as humans walk” distances Identification of neighboring structures of Fig. 6.1, radius threshold rule= 0.50 miles, Kth = closest, Inverse Distance = 0.60 miles CT Comparison of crow flies and as humans walk distances and resulting neighbors lists Cutoff Values Proposed in the Statistics (Evans 1996) and in the Social Networks literatures (Meghanathan 2016) to Assess Correlation Strength SAR models SAR models for point data
11 44 86 136 157 170 174 174 176 176 177 177 185 189 201
225 232
295 371 382
xxxv
xxxvi Table Table Table Table Table
LIST OF TABLES
8.3 8.4 8.5 8.6 9.1
Higher level connections (M) shown in Fig. 8.6 Lower-level units representations (W ) of Fig. 8.6 Δ or block diagonal design matrix representation of Fig. 8.6 SAR models Data structures and transformations for geographical networks
390 390 390 402 471
Code Listings for Replication Exercises
Listing 4.1 Listing 4.2
Listing 4.3
Listing 4.4
Listing 4.5
Listing 4.6
Listing 4.7
Listing 4.8
Listing 4.9 Listing 4.10
Listing 4.11
Steps to merge or join in R Shapefile of the United States and Data Management—Access to code here https://cutt. ly/6GP3Qpn Shapefile of the United States’ Counties and Data Management—Access to code here https://cutt.ly/QGA G3bS Shapefile of the United States’ ZCTAs and Data Management—Access to code here https://cutt.ly/SGD N82e Shapefile of the United States’ Census Tracts and Data Management—Access to code here https://cutt.ly/bGF Hx00 Shapefile of the United States’ Census Block Groups and Data Management—Access to code here https://cutt.ly/ bGFHx00 Shapefile of All Roads in Philadelphia County and Data Management—Access to code here https://cutt.ly/eGG 70tO Shapefile of Primary Roads in the Nation and in Philadelphia County and Data Management—Access to code here https://cutt.ly/rGJiFve Multiple States Primary and Secondary Roads—Access to code here https://cutt.ly/YCfwumg Shapefile of Primary and Secondary Roads in Pennsylvania and Philadelphia County and Data Management—Access to code here https://cutt.ly/rGJiFve Shapefile of Point Landmarks in Pennsylvania and Data Management—Access to code here https://cutt.ly/kGM Qesj
105
111
114
116
119
120
122
125 126
128
131
xxxvii
xxxviii
CODE LISTINGS FOR REPLICATION EXERCISES
Listing 4.12
Listing 4.13
Listing 4.14 Listing 4.15 Listing 4.16 Listing 4.17 Listing 4.18 Listing 4.19 Listing 5.1 Listing 5.2 Listing 5.3 Listing 5.4 Listing 5.5 Listing 5.6 Listing 5.7 Listing 5.8
Listing 5.9 Listing 5.10 Listing 5.11 Listing 6.1 Listing 6.2
Listing 6.3
Geocoding Pennsylvania Public Schools and Data Management—Access to code here https://cutt.ly/nGM QWMe Geocoding Pennsylvania Public Schools from ZCTAs and Data Management—Access to code here https://cutt.ly/ nGMQWMe Crosswalk Example Using ZCTAs and Counties—Access to code here https://cutt.ly/UHbYHuF Installing Census API key Accessing Data from ACS and Crosswalking—Access to code here https://cutt.ly/kHWexso Accessing IRS Data Directly From Within R. Accessing Data from the IRS and Merging with ACS—Access to code here https://cutt.ly/dHIooIX Accessing Data From the IRS and merging with ACS—Access to code here https://cutt.ly/CHA42zp Data Structures Showing Multiple Data Points of Different Types—Access to code here https://cutt.ly/GCfrGlB Network Transformations Among One Type of Units—Access to code here https://cutt.ly/dCftam4 Network Transformations Among Two Types of Units—Access to code here https://cutt.ly/UCftMlp Crow Flies Distance Computations—Access to code here https://cutt.ly/3H9JWCO Example non-square matrix including units of multiple types Humans Walk Distance Computations Between Two Points—Access to code here https://cutt.ly/OJwpzmF Batch Humans Walk Distance Computations—Access to code here https://cutt.ly/dJrX77w Batch Humans Walk Distance Computations All public 2and 4-year and private not for profit 4-year colleges located in Philadelphia County—Access to code here https://cutt. ly/FJtfBkD Batch Humans Walk Distance Computations Two Types of Unites—Access to code here https://cutt.ly/eJYOJsW Travel Distance Computations —Access to code here https://cutt.ly/3Ja55Sy Comparing Performance with Google Maps Standard Network Creation—Access to code here https:// cutt.ly/dJ0HZiR Functions Available for Neighboring Structures and Matrices of Weights—Access to code here https://cutt.ly/ gCfFU2a Data Structure Edgelist with Distances
135
141 144 148 151 153 155 159 170 180 182 187 191 195 201
203 205 210 211 228
234 235
CODE LISTINGS FOR REPLICATION EXERCISES
Listing 6.4 Listing 6.5 Listing 6.6
Listing 6.7
Listing 6.8 Listing Listing Listing Listing
6.9 6.10 6.11 6.12
Listing 6.13 Listing Listing Listing Listing
6.14 6.15 6.16 6.17
Listing Listing Listing Listing Listing
6.18 6.19 6.20 6.21 6.22
Listing 6.23 Listing 7.1 Listing 7.2
Listing 7.3 Listing 7.4 Listing 7.5 Listing 7.6 Listing 7.7
Radius-Based Neighboring Structures Identification—Access to code here https://cutt.ly/ACfFNbB Kth Neighboring Structures Identification—Access to code here https://cutt.ly/4CfF6TW Inverse Distances (Radius and Kth) Neighboring Structures Identification (Row Standardized)—Access to code here https://cutt.ly/pCfGfG2 Inverse Distances (Radius and Kth) Neighboring Structures Identification (Inverse Distances)—Access to code here https://cutt.ly/dCfGnQi Two Mode Neighbors Identification—Access to code here https://cutt.ly/lKzTylC Example of Travel Time in Minutes Result of Matching with Count of Neighbors Two to one mode transformations From Two- to One-Mode Neighbors Identification—Access to code here https://cutt.ly/EKvv4lI Counts and Row Standardization that Captures Neighbor Strength Heterogeneity Students’ Homes for Nearby Enrollment Institutions Locations Homes and Colleges Distances Nearby Selection Identification—Access to code here https://cutt.ly/PKQFzmX Nearest Identification Three Closest Units Identification Closest Neighbor Enrollment Identifying any of the Three Closest Neighbors Enrollment From Polygon to Neighbors—Access to code here https:// cutt.ly/3KTx7v2 Neighbors of Neighbors—Access to code here https:// cutt.ly/GKYXIJn Moran’s I Procedures with Polygon Data—Access to code here https://cutt.ly/dK1f2U3 Moran’s I Procedures with Higher Order Neighbors—Access to code here https://cutt.ly/ GK1VWwd College Score Card Data Structure One Mode Moran’s I Procedures—Access to code here https://cutt.ly/sLpi2n0 Frequentist and Monte Carlo Simulations of Moran’s I for One-Mode Point Data Two Mode Moran’s I Procedures—Access to code here https://cutt.ly/6LsfQcF Parametric and Monte Carlo Simulations of Moran’s I for Two-Mode Point Data
xxxix
236 238
239
240 244 245 246 249 251 252 257 257 258 260 260 260 261 262 267 271 290
293 300 302 305 309 312
xl
CODE LISTINGS FOR REPLICATION EXERCISES
Listing 7.8 Listing 7.9 Listing 7.10 Listing 7.11 Listing 7.12 Listing 7.13 Listing 7.14 Listing 7.15 Listing 7.16 Listing 8.1 Listing 8.2 Listing 8.3 Listing 8.4 Listing 8.5 Listing 8.6
Listing 8.7 Listing 8.8 Listing 8.9 Listing 8.10 Listing 8.11 Listing 8.12 Listing Listing Listing Listing Listing
8.13 8.14 8.15 8.16 8.17
Listing 8.18
Local Moran’s with Polygon Data—Access to code here https://cutt.ly/dLb7DW3 Local Moran’s with Point Data—Access to code here https://cutt.ly/dLnuVkT Local Moran’s with Point Data Two Mode—Access to code here https://cutt.ly/FCclsHi Removing Neighborless Units—Access to code here https://cutt.ly/iLb1XR2 List of Co-authorship and Strength Based on Number of Co-publications List of Number of Publications from 2010 to 2020 per Author Format of Co-authorship Relationships SODA 2.0 Analyses—Access to code here https://cutt.ly/ 5LYUScW Distribution of Individual Publication RSODA Procedures with Polygon Data—Access to code here https://cutt.ly/DL91d3d RSODA Test Based on Monte Carlo Simulations Null SAR Model and Test for RSODA Procedures—Access to code here https://cutt.ly/CZqLtSG Model Building via Feature Relevance and Machine Learning—Access to code here https://cutt.ly/aZrwCNc Code to Showcase the Creation of the Travel Time Matrix for SAR with Point Data Model Building via Feature Relevance and Machine Learning for Point Data—Access to code here https:// cutt.ly/aZrwCNc Components of the Multilevel SAR Function Provided for SSEM Multilevel SAR Function Application—Access to code here https://cutt.ly/AZmN437 Summary Approach for Geographically Weighted Methods—Access to code here https://cutt.ly/wZIh3em Extract of Summary Information Spatio-Temporal Structure of the Database Analyzed Spatio-Temporal SAR Analyses—Access to code here https://cutt.ly/uXjGB7b Data preparation and Balancing Checks Women Panel SAR Results Men Panel SAR Results Placebo Estimates for Men in Panel SAR Results Procedures to use SAR with Social Data and test for RSODA 2.0.—Access to code here https://cutt.ly/KCfHfgu Simulation of Collaborations
319 323 325 331 335 336 337 342 345 359 361 365 370 374
380 393 395 410 412 427 428 429 430 430 432 433 435
CODE LISTINGS FOR REPLICATION EXERCISES
Listing 8.19 Listing 9.1 Listing 9.2
Listing 9.3 Listing 9.4 Listing 9.5 Listing 9.6
Social Multilevel SAR Function Application—Access to code here https://cutt.ly/sXPSxch Mapping Procedures Polygon Data—Access to code here https://cutt.ly/4XGAvNC Exploratory Spatio-temporal Data Mining and Visualization—Access to code here https://cutt.ly/HX3 rFEw Mapping Procedures Point Data—Access to code here https://cutt.ly/MXJktgE Point Density Procedures—Access to code here https:// cutt.ly/1XCUyUs Geographical Network Visualization Procedures—Access to code here https://cutt.ly/EX0D6vP Two- to one-mode transformations—Access to code here https://cutt.ly/xCfHIu3
xli
437 451
454 457 467 475 477
PART I
Conceptual and Theoretical Underpinnings
The first three chapters of this book are dedicated to set the conceptual and theoretical bases for Spatial Socio-econometric Modeling (SSEM). Chapter 1 introduces notions of space, place, and splaces (configured by the bringing together of the former two) and their relevance for SSEM. A vital discussion here is inequality in mobility prospects and its interaction/intersection with neighborhoods. Without a clear understanding of inequalities, SSEM becomes just a spatial modeling tool, rather than a comprehensive set of methods and concepts that aim to model how splaces affect their inhabitants’ potentialities for success. Chapter 2 becomes more applied in nature, but the discussions are still at the conceptual levels for we are yet to start using methodological tools to apply the concepts discussed in this chapter. Specifically, we illustrate the diverse sources and levels of spatial data publicly available and how we may be using them to operationalize concentrations of advantages and disadvantages. Finally, part I closes with Chap. 3. In this chapter, we discuss types of spatial data (raster and vector), transformations among these types, as well as the relevance of coordinate reference systems. This chapter closes with a discussion of disclosure avoidance mechanisms implemented by the United States Census and their prevalence in decennial data for many decades ago—starting in 1930. Particular emphasis will be placed on how new developments of these disclosure avoidance techniques, in the form of differential privacy algorithms (DPF), and the use of synthetic (or simulated rather than collected) data may impact our SSEM analyses. Once Part I is completed, we will be ready to start discussing applications that involve actual data access and manipulation, relying on spatial data science and interactive visualization tools, as we discuss in Part II.
CHAPTER 1
SPlaces
Abstract Where we grew up shaped or at least impacted our experiences. Take a moment and remember your childhood neighborhood. What do you see? How clean does it look? How does it smell? What sounds do you hear? Are there trees? Are there parks? Is it safe to walk around? What about public libraries? How far was the closest college? As discussed in this chapter, the answers to these questions require the intermingling of spaces and places and your answers, to a great extent, determine how you experienced life growing up in those neighborhoods and the types of memories you created, over and above the specific experiences and memories you had and created inside your home. This introductory chapter serves two main purposes. First, it discusses spaces and places as constructs with significant differences but that form part of a continuum, which we will call “splaces,” with clear conceptual and methodological implications for spatial socioeconometric modeling. Second, we elaborate on inequality concepts and subsequently connect these notions to spaces and places, or splaces as conceptualized in this chapter. As part of this discussion, we then highlight the relevance of accounting for geographies of disopportunity in spatial socioeconometric modeling based on our goal of understanding processes, mechanisms, and structures that may help explain inequalities in our participants’ decisions and outcomes. Finally, we close with a brief discussion of threats of validity and modeling challenges that emerge from the inclusion of place-based indicators in our spatial socioeconometric models.
SPlaces: Spaces, Places, and Spatial Socioeconometric Modeling Spatial socio-econometric modeling (SSEM) is a set of conceptual and methodological tools for modeling and understanding how geo-located or geo-referenced attributes may be used to explain their inhabitants’ social and economic outcomes, and how these outcomes may be affected by feedback loops that translate into concentrations of (dis)advantages. From this perspective, the discussion of spaces and places and their characteristic inequalities in resources that may
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. S. González Canché, Spatial Socio-econometric Modeling (SSEM), Springer Texts in Social Sciences, https://doi.org/10.1007/978-3-031-24857-3_1
3
4
M. S. GONZÁLEZ CANCHÉ
impact their inhabitants’ outcomes, is at the hearth of spatial socio-econometric modeling as a conceptual and methodological framework. Accordingly, to begin our formal presentation of SSEM, let us first discuss spaces, places, and the analytic and conceptual gains that may emerge from understanding them together (Agnew, 2005; Agnew & Livingstone, 2011; Cresswell, 2008) as splaces. Although spaces and places are part of a continuum (Agnew, 2005; Cresswell, 2008; Tuan, 1977), for identification strategies (i.e., the systematic and welldocumented process of gathering indicators, attributes, and characteristics that will enable us to build models) and modeling/analytic processes (i.e., a set of structured steps followed to execute those models and test hypotheses) it is useful to highlight their conceptual and empirical differences. Spaces As depicted in Fig. 1.1, spaces, and more specifically geographical spaces, are inherently physical and therefore objectively measurable, identifiable on, and delimitable around earth’s surface (Cresswell, 2008; Tuan, 1977). Spaces are abstract (Cresswell, 2008; Tuan, 1977) for they can be represented with grids or systems of mathematical coordinates (Agnew, 2005) that are not necessarily earth bounded. However, when this grid or coordinates system is applied to earth’s surface, we refer to spaces as being geolocated or georeferenced for the geo root or prefix comes from gê, which in Greek means earth.
S Allow us to:
paces
*Measure distances between/among *Create lists of connections among Since spaces are physical and measurable, we can then apply statistical algorithms and make inferences using information of
P
laces
*Have meaning
*Are experienced or lived
Spatial Socioeconometric Modeling
*May be longed
Builds from Spaces' and Places' (splaces') information to identify processes, mechanisms, and structures that may help explain inequalities in decisions and outcomes.
*Have affected or shaped our decisions and outcomes
Fig. 1.1 Spaces and places: splaces
1 SPLACES
5
Spaces can also be defined as planes on which objects are observed and events may be recorded (Agnew, 2005). Accordingly, once spaces are mathematically delimited using planes or grids, we can then quantify indicators that are observed and/or measured in a given geolocation. Some examples of these indicators are elevation, humidity, temperature, and type of terrain. These objective measures may vary over short periods of time, like temperature in the morning versus at night, or in the summer months versus the winter season. Some other attributes may remain relatively constant or take many years to change in a given space, such as altitude or elevation. Other sets of objectively measurable attributes of spaces that are of particular importance for spatial analyses involve distances among objects observed within spaces. As further discussed throughout this book, it is this distance among objects located in those spaces that will eventually allow us to define rules to identify relationships or connections among such objects. Moreover, it is this identification of relationships that will also eventually allow us to measure influence or dependence of attributes among objects located in those spaces (Bivand et al., 2013; Lovelace et al., 2019; Pebesma & Bivand, 2020). Chapter 6 details the methodological steps required to identify these connections, which, for now may it suffice to simply describe them as contiguity or distance-based identification rules that link two or more objects of interest observed in a given space. Spatial and objectively measurable properties are fundamental to SSEM. These quantifiable properties—that emerge from identifying or establishing connections among objects (based on physical distances observed in spaces) following well-defined sets of rules (i.e., contiguities or distance thresholds)—, enable the application of statistical algorithms as strategies to understand mechanisms and processes. As further discussed in this chapter, with these conceptual and methodological tools, one can then gain deeper understandings of social reproduction of advantages and disadvantages as a function of spatial location and feedback loops (Jargowsky & Tursi, 2015). These processes, however, require us to introduce and define the concept of places. Places Thus far this opening discussion about abstract yet measurable features serve to exemplify that although fundamental for SSEM, spaces themselves have no meaning (Cresswell, 2008) for they are more conceptual than real (Agnew, 2005; Tuan, 1977). It is only when these objects located within spaces become meaningful, that is, when they are experienced and remembered (Cresswell, 2008; Tuan, 1977) that the true analytic power of SSEM aimed to detecting, modeling, and explaining social structures becomes more feasible to be attained. More specifically, let us start referring to those objects as more concrete places that we have experienced, such as school and home, for example, or as how Cresswell (2008) may refer to them: place = location + meaning. For example, take a moment and remember your childhood neighborhood. What do you see? How clean does it look? How does it smell? What sounds do
6
M. S. GONZÁLEZ CANCHÉ
you hear? Are there trees? Are there parks? Is it safe to walk around? What about public libraries? How far was the closest college? With this exercise of retrieving our memories associated with those places, we may also remember measurable attributes such as distances we may have traveled from home to school along with other physical attributes that may have affected our experiences associated with those places (i.e., temperature, elevation). More to the point, these attributes may go beyond physical ones and may include socially based indicators like poverty, crime, and educational attainment levels. From this perspective then, places differ from yet complement spaces in that places have meanings that transcend their mere physical properties (Tuan, 1977) but are experienced in those spaces. That is, although places, are inherently linked to spaces (Agnew, 2005; Cresswell, 2008) for they are physically located on or linked to earth (i.e., geolocated or georeferenced), places also go beyond physical measurable attributes and notions. Places have and convey meaning, they are experienced, lived and as such, can be remembered and have likely impacted, affected, or even shaped important aspects of our lives. We long for our childhood neighborhoods (Agnew, 2005; Cresswell, 2008; Tuan, 1977), regardless of how good or bad our experiences in those places were. For example, despite growing up across the street from a local gang, having had several bicycles stolen, and eventually having been assaulted and sent to the hospital by some of these gang members for refusing to join their gang, these negative experiences are only part of my memories. I still remember and miss the times when I played soccer or baseball pickup games with my (gang and non-gang-affiliated) neighbors and the times my siblings and my parents took me for walks at our local park. The experiences we form inside our homes, which interact with the specific set of unique spatial configurations and physical attributes these places offer, become life experiences that jointly prepare us to interact in other spaces and places. To this day, for example, I still feel confident about my ability to quickly assess and identify factors that may constitute real risks (vis-à-vis violence or crime) in West Philadelphia, even though my childhood experiences happened in a different country and several decades ago. For modeling and analytic reasons, spaces and places may be differentiable yet complementary and although it is useful to identify them individually, their analytic power is better served when we examine them together (Agnew, 2005). In going back to Fig. 1.1, note that spaces are physical and serve to apply mathematical operations via the connections identified among objects following distance rules. Places, on the other hand have meanings, are experienced and lived. Considering only one or the other in our modeling efforts may limit our understandings of their combined influence. That is, to continue with a previous example, allow me to use first person to continue to share some of my personal experiences growing up. I grew up in a space that has several parks and green areas nearby my parents’ house. These public areas included a public park with rapid soccer (played on concrete rather than grass) and basketball courts located a few houses down
1 SPLACES
7
the block. Moreover, I grew up in Mérida, the capital of the State of Yucatán, and the largest city of the region (southeast of Mexico). Even though this city is located in the south of the country, the vast majority of houses in the city, and certainly all houses in my neighborhood had running water, gas, electricity, public lighting (illumination), and immediate local access to public transportation. In terms of college availability nearby my parents’ home, the travel time by bus from my home to the campus where I attended college, was about 20 minutes, and I typically biked in about the same time. All these measurable factors are considered positive attributes that may resemble the attributes of socioeconomically privileged neighborhoods. However, failing to account for the level of violence, poverty, unemployment, or under-employment associated with my home neighborhood, all of which also shaped my experiences, would offer a limited vision of the relevance of places and their interaction with spaces in creating our memories and affecting our decisions. Specifically, it was the assault I experienced and the heightened crime levels typical of that area that, to a great extent, motivated me to pursue scholarships and fellowships out of my home state. Similarly, ignoring the space-based benefits, such as local college availability, which translated into the increased affordability that living at home implied, for example, would also limit our understandings of how spaces and places intermingle to shape our experiences and opportunities. From this view, it is only when, or if we can include a more complete picture or portrait of the spaces and places (or splaces) surrounding peoples’ lives and experiences, that our models may more realistically capture their impact on peoples’ decisions and outcomes. SPlaces This discussion aims to elaborate on how places and spaces (splaces) intermingle on a daily basis and how their joint, yet differentiable contributions to spatial socioeconometric modeling may enrich our understandings of processes, mechanisms, and the explanations of the outcomes observed (Agnew, 2005; Cresswell, 2008; Tuan, 1977). To begin, let us discuss the first law of geography (Tobler, 1970). Tobler (1970) said that although everything is related to everything else, closer things are more related than distant things. The operationalization of this statement requires the identification of spatial components, namely distance among the objects of interest as well as the incorporation of their place-based indicators or attributes that may enable us to measure the extent to which this law is realized in a given geographical area or zone. Specifically, imagine we may be interested in measuring the degree to which an indicator of food security varies across city neighborhoods. As further discussed in Chap. 6, from a spatial modeling perspective we first need to establish some form of spatial connections among neighborhoods to measure dependence—or the notion that nearer things might be more related than distant things. Once we have established or identified this set of spatial connections (i.e., spatial weights matrix) we can
8
M. S. GONZÁLEZ CANCHÉ
then include place-based indicators of interest (i.e., food security in this example) and apply algorithms to mathematically assess spatial dependence, using Moran’s I —Chap. 7 details this analytic technique. Although the Moran’s I coefficient will suffice to address the goal of measuring the extent to which the first law of geography is realized when discussing this food security indicator, more nuanced understandings of the underlying structural issues, processes, and factors that may influence the variation of this indicator, require the inclusion of both other analytic tools (e.g., regression models) as well as the inclusion of additional place-based indicators. Note that the inclusion of place-based indicators does not mean that we are actually capturing the meanings and memories of individuals who lived and experienced life those areas. This would be naïve for those memories are personal, created and recreated at the individual level in interaction with social and splatial structures. Instead, our efforts to include place-based indicators are more modest in nature. We assume that, when a place or neighborhood is delimited (or operationalized) in the form of a block, tract, or ZIP code tabulated area [ZCTA] within a county (or a city as discussed in Chap. 2), individuals who know (i.e., lived or grew up in) those places may have been systematically exposed to similar experiences, and, to the extent to which our selection of placebased indicators approximate the conditions that our participants lived, we may be able to capture the impact of this systematic exposure in our model estimates, even if we are not capable of truly understand what it actually meant for them to live in those areas. In short, we do not argue that place-based attributes capture memories, but simply approximate, in the best-case-scenarios, common exposures that translated into common experiences capable of altering or impacting their outcomes. From this view then, our goal as place-based analysts consists of aiming to control for as much of the variation as possible that living in and experiencing those places may have exerted in their inhabitants’ prospects of social mobility in interaction with such inhabitants’ own sociodemographic attributes. This latter point deserves more attention. Specifically, although unemployment, crime, and violence indicators in a given neighborhood may, to a great extent, impact most of its inhabitants similarly, on average, important variations may be captured within areas by considering individuals’ gender, age, race, and ethnicity indicators. That is, as further detailed below, being a black male teenager in an affluent neighborhood may translate into quite different experiences compared to being a white male teenager in the same neighborhood. Similarly, being a black male teenager in a high crime, high poverty area, also likely results in different life experiences compared to being a white male teenager in such a neighborhood. The same may be true for the intersection of other gender, age, and ethnicity groups, for instance. This is the reason why, whenever possible, our analyses including splace notions may be richer, in terms of the variations we may be able to capture, to the extent we can disaggregate our models to account for the lived experiences of our participants considering their sociodemographic attributes in place-based contextualized conditions instead of assuming linear
1 SPLACES
9
or homogeneous effects of splaces (see Crane, 1991; Faber & Sharkey, 2015; Jencks & Mayer, 1990). To exemplify how individuals’ traits may impact their place-based experiences and outcomes, in Chap. 8 we will present an example of how a place-based policy change was associated with divergent effects given inhabitants’ gender. This policy consisted of the enacting of a scholarship to attend a community college locally that only targeted county high school graduates—that is students who attended high school in other counties were not eligible to benefit from this program. In this analysis we will assess the extent to which this scholarship impacted the college participation rates of 18–24 years old men and 18–24 years old women differently in that county. In a similar analysis of these same data, González Canché (2018c) estimated the expected different impact of this policy for men and women who lived nearby (or closer) to that college (i.e., within 12 miles) compared to the impact of this policy among those men and women who lived further away, also within the same county. Although these analyses could and should be expanded to include ethnicity indicators (for example), the inclusion of disaggregated estimates by gender intersected with distance measures, revealed differential patterns of the impact of this policy on college going rates associated with this policy change. As distance from college increased, men were less likely to benefit from this policy, whereas their women counterparts were not negatively affected by this distance indicator. In other words, distance from college negatively affected men’s enrollment rates but had no negative impact on women’s college participation in that community college. However, the analyses shown in Chap. 8, indicate that compared to men in neighboring counties, those who lived in the county where the policy was implemented, increased their college going participation rates. In the case of women in this county, the passing of this policy did not impact their college going rates compared to the college going rates of their women counterparts in neighboring counties. In combining both examples then, we may conclude that it is likely that men who lived nearby this college were impacted the most, certainly more than their within-county peers who lived farther away from this campus, who were also eligible for this place-based scholarship. The main goal of including these splaces notions in interaction with the attributes of their inhabitants and distances as identification strategies, is to account, as much as possible, for structural inequalities that continue to impact social mobility prospects and may help capture place-based inequalities. To this end, the following subsection presents a discussion of the main conceptual lenses guiding spatial socioeconometric modeling to be used throughout the book and how these lenses may help guide the selection of sociodemographic and socioeconomic indicators to be included in our SSEM analyses.
Inequality in Mobility Prospects Inequality refers to variation in one characteristic, resource, or attribute (Osberg, 2001) among units configuring a group or population in a given society (Jasso,
10
M. S. GONZÁLEZ CANCHÉ
2015; Koh, 2020; Osberg, 2001). Typically, this variation is not randomly distributed across groups, instead it often affects certain groups persistently and systematically, which may lead to unequal or even unjust distributions of resources and opportunities in society (Koh, 2020). Considering that inequality has been a prevalent issue in social settings since the origins of recorded history (Jasso, 2015), its understanding should be a central task of any study that involves social processes. Furthermore, the study of persistent inequality issues and their effects is particularly relevant for studies that consider spatial/geographical dimensions (Anderson, 2013; Jargowsky & Tursi, 2015; Jasso, 2015; Koh, 2020), likely due to the tendency of socioeconomic and sociodemographic indicators to be splace-based clustered or concentrated. The study of inequality involves both the comparison of resources as well as the identification of levels of analyses (Jasso, 2015; Osberg, 2001). This means that, once we have identified the resources of interest (i.e., income, as further discussed below), we can measure this form of inequality at the individual, city, or country levels, for example Jasso (2015), Osberg (2001), Koh (2020). More to the point, the inclusion of geographical indicators allows us to further identify potential variations of these resources within geographical regions and across distinct groups (Koh, 2020; Osberg, 2001). From a geographical perspective, inequality may be measured at any level of aggregation, with individuals’ location rendering the most disaggregated form of these levels. Since individuals live in households (Osberg, 2001), the second level of aggregation is then this household level estimate, which may account for the income prospects of other household members (Osberg, 2001). By the same token, census blocks estimates, which are configured by household estimates (wherein each household estimate is itself configured by its members’ incomes prospects), constitute yet another aggregation level. This same rationale applies to ZIP code tabulated areas, which are configured by census blocks, and then cities that nest ZIP codes, blocks, and households, …, and so on. For modeling considerations, note that higher levels of spatial aggregation (i.e., cities versus ZIP code tabulated areas or countries versus states) condense more information, and the more information accounted for in a single estimate, the higher the possibilities of blurring or masking inequalities that may be observed at a more local or disaggregated levels. Specifically, when referring to countries we may simply refer to them as median and/or average years of formal education in Country A and median and/or average years of formal education in Country B. Although these descriptions may be useful for broader understandings and comparisons, they do not allow us to clearly observe more nuanced differences taking place within those countries. That is, regions configuring countries may present important variations, and within those regions there may be some states that are thriving, whereas there may be others that are struggling, yet this information is masked when only looking at higher levels of aggregation. Masking issues, however, are not restricted to geographical analyses. Even if we “zoom in” from a county to a city-level depiction (with the understanding that counties may account for several cities) and present estimates without con-
1 SPLACES
11
sidering inhabitants’ attributes, we may continue to mask unequal distributions of resources. Specifically, if we present average number of years of education or unemployment rates in City A and City B (instead of County A and County B), we would still fall short in capturing how inhabitants from distinct groups may be faring in these indicators. Accordingly, a more detailed approach to depict variations consist of describing variations across subsets of the population (i.e., subgroups) inhabiting those splaces. Procedurally, this approach requires the computation of variations by attributes of groups inhabiting those cities/counties or neighborhoods. Specifically, we can estimate differences in disposable income, years of education, or employment rates, for example, by ethnicity, gender, or the intersection of these attributes across cities. From this view then, our understandings of inequalities across indicators may be strengthened by presenting splace-based analyses by specific attributes of their inhabitants. To illustrate this point, let us assume we have three cities, and we want to understand the degree to which their inhabitants ages 18–24 have similar or dissimilar average years of education attained. Moreover, assume we may also be curious to know how these indicators may vary by ethnicity indicators of their inhabitants. To present and analyze these estimates we could tabulate these indicators as shown in Table 1.1. To read Table 1.1 note that the most aggregated level of information is contained in the intersection of the column “Group Average” and the row “City Average.” This indicator then, reflects that the average years of education attainment across these cities and ethnicity groups is 12.33 years with a standard deviation (SD) of 2.13. The next level of aggregation that does not account for location is captured by row averages, which are separated by ethnicity. That is, using this information we can read that Asian and Black participants had the highest and lowest average years of education with 14.83 years (SD = 1.83) and 10 (SD = 2.67), respectively, but these estimates alone do not tell us anything about potential variations by cities. When we look at column totals, we gain information about splace-based estimates but, by themselves this information renders no information of differences by ethnicities. That is, inhabitants ages 18–24 in City C have a mean of 13.63 years of education on average (SD = 2.13). It is only when we intersect rows and
Table 1.1 Average years of education (with SDs) by city and ethnicity of inhabitants ages 18–24 Groups
City A
City B
City C
Group average
White Black Hispanic Asian
13 (2.5) 9 (3.5) 10 (1.5) 15 (2.5)
15 (1.0) 8 (2.0) 8.5 (3.0) 15.5 (1.0)
11 (3.5) 13 (2.5) 16 (0.5) 14.5 (2.0)
13 (2.33) 10 (2.67) 11.5 (1.67) 14.83 (1.83)
City average
11.75 (2.5)
11.75 (1.75)
13.63 (2.13)
12.33 (2.13)
12
M. S. GONZÁLEZ CANCHÉ
columns that we gain disaggregated estimates by city and ethnicity, as shown next. The disaggregated indicators shown in Table 1.1 indicate that Asian inhabitants, ages 18–24, have the highest attainment levels in City B with 15.5 years of formal education, on average (SD = 1.0). However, we also see that Hispanic inhabitants, in the same age group, but in City C, performed even better for they attained the equivalent of a four-year college degree. This estimate also had the lowest variability across all groups and cities (SD = 0.5) and practically doubled the average number of years of education of their Hispanic counterparts living in City B. On the other hand, Black inhabitants in City B had the lowest attainment years across all cities and groups, whereas Asian inhabitants consistently had high attainment levels across cities. This descriptive example serves to showcase how the addition of inhabitants’ attributes to their geographical locations adds to our understandings of splace-based inequalities. Although these estimates themselves do not allow us to explain why those differences may be happening, they highlight that notable educational attainment inequalities are realized across and within those cities and among and within most ethnic groups. These variations then, merit further exploration and justify the need to incorporate other set of indicators or attributes of those cities (i.e., splace-based) and groups to our analyses in our quests to understand the factors that may be driving these inequalities in educational attainment. From the previous example and following Jasso (2015) and Osberg (2001), note that when designing studies on inequality we should consider whether we are interested in aggregated or disaggregated estimates. The selection of this focus depends on our research goals. International comparative reports, typically refer to aggregated country estimates disaggregated by groups based on age and socioeconomic status, for instance, whereas for evaluation and policy studies the interest is typically placed on how sub-groups may be impacted by a policy change, for example, therefore focusing on more nuanced and disaggregated estimates. However, when crafting or designing an evaluation or policy-analysis study, we should not immediately or automatically discard aggregated estimates. Indeed, one strategy to justify the study of subgroup variation consists of first presenting aggregated estimates and then discussing disparities by sub-groups and/or geographies. Returning to the example presented in Table 1.1, one could first highlight that the aggregated average number of years of education is 12.33 (SD 2.13) across inhabitants ages 18–24 in these cities. After that, one can proceed to discuss how the disaggregated estimates by ethnicity and by city vary. For example, we can note that City C has both the highest mean attainment years across cities and that this estimate may be in part explained by this City also having the ethnic group estimate with the highest average years of education across all groups in the sample. This descriptive information may serve to motivate the analysis of factors that may be driving the high attainment level of Hispanics in this city, as well as the identification of other factors that
1 SPLACES
13
may explain the low academic attainment, in term of average years of education, realized at City B by Black and Hispanic inhabitants. The following subsection further elaborates on specific strategies and indicators that have been used in the literature to measure inequality. After that discussion, we will focus on how these inequality concepts and indicators may be operationalized in the neighborhood or splace-based effects literature. Measuring Inequality and Growing Inequality The study of inequality is important for growth in inequality gaps leads to worse mental health outcomes, higher levels of crime, food insecurity, and other societal issues that pose threats to the well-being of cities, states, and nations (Gilbert, 2015; Jasso, 2015; Koh, 2020). From this view, it is important to be familiar with standard ways to measure inequality gaps (Osberg, 2001). In this respect, according to Gilbert (2015) one of this standard measures is annual household income, or more precisely, annual household disposable income. Nonetheless, although this indicator has the term “annual,” Gilbert (2015) also mentions that, when possible, multi-year estimates of disposable income (as opposed to annual estimates) are better indicators of variations in gap growth. The added stability of accounting for multi-year estimates is based on the potential instability associated with a single annual measure of income, which is particularly true for lower-skilled workers, who may also be more negatively affected by fast-evolving technological changes prevalent in our current times (Gilbert, 2015). Another important aspect related to the study of inequality is the identification of factors that may lead to equality of opportunity (or dis-opportunity), social mobility (or immobility), and (un)just societies (Yaish, 2015). From equality of opportunity perspectives, inequality is founded on stratification issues that explain lack of social mobility and inequality of results based on unequal starting conditions. For example, Yaish mentions that a child of highly educated parents is likely better equipped to benefit more from schooling than children whose parents have fewer years of formal schooling. Accordingly, equality of opportunity does not mean having both types of kids in the same classroom and referring to this as equality of access nor leveled playing fields. Yaish (2015) refers to these latter forms of equality (i.e., being in the same classroom as an indicator of equality) as thin definitions of equality, wherein opportunities are assumed equal unless there are overt discriminatory practices based on gender, race/ethnicity, or socioeconomic grounds. The thin or naïve nature of this definition consists of ignoring that individuals’ differential access to resources and sources of support are associated with the types of outcomes they attain (González Canché, 2017a; 2018b; 2019). From this perspective, functional and meritocratic notions indicating that “competent individuals” should be brought to the top regardless of where they are coming from (Yaish, 2015), typically ignore that “competence development” does not happen in a leveled playing field. Therefore, rewards based on merit typically ignore peoples’ circumstances that led
14
M. S. GONZÁLEZ CANCHÉ
to different competence development, decisions, or opportunities (González Canché, 2017a; 2018b; 2019; Yaish, 2015). Initial conditions or starting points have a durable impact on outcome variation and remuneration prospects. For example, in a study involving scientists, defined as doctorate and/or Ph.D. holders in science, technology, engineering and mathematics (STEM), González Canché (2017a) found that scientists who started college in the public 2-year sector, had lower salaries than their counterparts who began college in the 4-year sector, and this salary gap did not disappear even 10 years after having been conferred their doctoral degrees (González, 2017a). Considering that the analytic samples of that study account for the most prestigious terminal degree holders (i.e., doctorate recipients in STEM), these findings then suggest that initial conditions effectively preserve inequality in outcomes—even among those at the top of the hierarchy of scientific competency and prestige in arguably the most developed higher education system in the world. Those starting in advantaged positions have historically been able to maintain those advantages even when compared against peers with comparable (or identical) credentials and competence levels. The following discussion builds upon these notions of persistence of (dis) advantages and (in)equalities but with a particular focus on neighborhood and splace-based effects based on their relevance for spatial socioeconometric modeling.
Neighborhood Effects and Concentration of (dis)Advantages Neighborhoods, as splaces, have meanings, attributes, and characteristics that impact the life chances of their inhabitants (Besbris et al., 2015; Faber & Sharkey, 2015) and can also be physically delimited and measured based on administrative lines. According to Faber and Sharkey (2015) neighborhoods represent a fundamental dimension of inequality because systems of stratification have historically been organized along spatial boundaries (see also Koh, 2020). From this view then, spatial or geographical stratification (operationalized via administrative lines) serves to maintain and reproduce inequality across multiple dimensions (Faber & Sharkey, 2015; Jargowsky & Tursi, 2015). One of these dimensions is exposure time to neighborhood characteristics, wherein individuals who have spent their entire childhood in high-poverty zones may experience more severe neighborhood effects compared to individuals who were exposed for shorter periods of time to those same neighborhood conditions (Chetty & Hendren, 2018; Faber & Sharkey, 2015). In this respect, the current consensus indicates that families benefit when they leave high-poverty areas and move to less impoverished neighborhoods (Besbris et al., 2015; Chetty & Hendren, 2018; Faber & Sharkey, 2015). From neighborhood and geographical concentration of opportunities perspectives, individuals’ common exposure to spatially contextualized situations (González Canché, 2019), shape their opportunities of upward social mobility
1 SPLACES
15
(Chetty et al., 2020) by comprehensively affecting their cultural, ethnic, and socioeconomic identities (Rosen, 1985). Applying these concepts to explain education attainment and college enrollment prospects (see González Canché, 2022, for a recent example), one can say that growing up in lower income neighborhoods, where the vast majority of individuals did not finish high school or did not enter college, typically translates into reduced opportunities to learn about careers that require college education, which may not only shape students’ college aspirations or expectations but also affect their employment prospects, salary levels, and exposure to crime and incarceration rates (Chetty et al., 2020; González Canché, 2019; 2022; Iriti et al., 2018). More to the point, even when students, growing up in these types of neighborhoods, observe a few individuals with some college or college degrees, they may form a belief that exposure to college did not help these college graduates (i.e., their neighbors) to overcome difficulties to find employment or increase earnings (Rosen, 1985; Weicher, 1979). The latter may reinforce their negative views about the long-term benefits associated with a college education (Iriti et al., 2018). On the other hand, growing up in more affluent neighborhoods, either since birth or moving there from high-poverty housing at younger ages, has been found to causally affect individuals’ prospects of upward income mobility (Chetty et al., 2020). For individuals experiencing life in more affluent neighborhoods, obtaining college degrees, and securing employment become normalized views that translate into greater certainty about rates of return associated with investing in education and the expectation of success derived from college attendance. This certainty is not only obtained at home but also through community networks and resources (Iriti et al., 2018). Although the previous discussion focused on education attainment and college access, the impact of geographical contexts may affect many other outcomes. For example, with respect to health outcomes, Jones and Duncan (1995) illustrated that individuals with nearly identical personal attributes and socioeconomic characteristics, but who live in different areas, tend to realize divergent health conditions. The same can be said about other outcomes such as employment and disposable income prospects, for example. In terms of the mechanisms upon which these effects are realized, neighborhood effects may be conceptualized relying upon notions of concentrated disadvantage (Anderson, 2013; Jargowsky & Tursi, 2015), and geography of opportunity (Pastor, 2001; Tate, 2008) or disadvantage (Pacione, 1997). Using these lenses, individuals who live in the same geographical area (e.g., neighborhood), are assumed to systemically and systematically be affected by their exposure to the same factors, such as their attendance in the same school districts and schools, exposure to similar levels of localized crime, violence, unemployment, poverty rates, and even access to food options with similar health quality levels (González Canché, 2019; 2022). Hence, it is expected that individuals’ common location and the corresponding shared exposure to “life” not only affects their sociodemographic identities and cultural views but also influences the positive or negative co-variation of their out-
16
M. S. GONZÁLEZ CANCHÉ
comes, namely teenage parenthood, unemployment, welfare dependence, crime (Wilson, 1987), college-going expectations, safety prospects, and even performance in standardized tests scores (González Canché, 2019), to name a few examples. The splatial clustering of economically and socially disadvantaged individuals within a set of neighborhoods result in feedback effects that exacerbate problems associated with poverty that have historically led to vicious cycles (Jargowsky & Tursi, 2015). That is, this feedback loop translates into systematic concentration of disadvantages, wherein high crime neighborhoods also experience high unemployment, low housing quality, single parent households, high poverty and high rates of alcohol and drug abuse (González Canché, 2019; 2022; Jargowsky & Tursi, 2015). Notably, these patterns are also reproduced in affluent neighborhoods but in opposite directions: low poverty zones also have low unemployment rates, low crime and alcohol and drug abuse rates, to mention a few indicators. Therefore, the feedback loops characteristic of affluent zones may be depicted as experiencing virtuous cycles with their configuring neighborhoods realizing systematic concentration of advantages. Note, however, that neighborhood effects, conceptualized and operationalized through the aforementioned concentration of disadvantage notions, are not linear or homogeneous across groups of inhabitants (Crane, 1991; Faber & Sharkey, 2015; Jencks & Mayer, 1990). As depicted in the discussion of splaces, individuals’ attributes such as ethnicity, age group, and gender or sexual identity, interact with their geographies of disopportunity and affect how they experience life and may contribute to explain their outcomes and potentialities for social mobility (Anderson, 2013; Besbris et al., 2015; Chetty et al., 2020; Chetty & Hendren, 2018; Faber & Sharkey, 2015; Rosen et al., 1985; Weicher, 1979; Wilson, 1987). More specifically, a young black male in an affluent neighborhood is likely to have different experiences than in a poor neighborhood—even if experiencing microaggressions (Sue et al., 2007) in this wealthy splace. The same may be true for other combinations of individual traits at different geographical zones. Another version of this non-linearity in the effects of splaces may be observed when studying individual level socioeconomic indicators and their interaction with environment-based SES levels. Specifically, assume we are able to identify (a) low-income students, or students in need of financial assistance and (b) nonfinancially constrained students. Assume further we can identify them in schools located in (i) high poverty and (j) affluent neighborhoods. Using these pieces of information, we may be able to estimate: whether low-income students attending schools located in low-income areas realize different academic attainment levels than (A) their non-low-income peers attending schools located in wealthy neighborhoods, (B) low-income peers attending schools located in wealthy neighborhoods, and (C) non-low-income peers attending schools located in poor areas. Of course, based on data availability we can further disaggregate the analyses by ethnicity and gender. This more detailed level of analyses may not only enable a more nuanced understanding of non-linearity of neighborhood
1 SPLACES
17
effects, but more importantly it may highlight the need for more targeted early interventions, for example. A working paper version of a study that addresses the types of questions discussed in the previous paragraph can be found here https://cutt.ly/7Cqzy5N. This study relies on New York State’s administrative data which contain multicohort, multi-year detailed information. The analyses and visualizations presented in that working paper may be replicated with the procedures described later in this book (see Chap. 8).
Splace-Based Modeling Challenges Although neighborhood effects justify the need to account for splace-based indicators, the selection and inclusion of these attributes to our models requires a discussion of some methodological challenges associated with their analyses. The first challenge discussed next refers to sorting mechanisms leading to selfselection issues. The second challenge is based on high levels of correlation among these indicators, precisely due to these sorting mechanisms and feedback loops. Causality in Spatial Modeling So far we have described that experiencing life in a given neighborhood has been found to impact the outcomes of its inhabitants. Inhabitants with similar individual (non-place-based) attributes but who grew up in different neighborhoods, are likely to realize different outcomes in life. Note, however, that these statements tend to be associative in nature, as opposed to causal, an important distinction considered next. Faber and Sharkey (2015) note that selection bias is an important hurdle to argue causation in spatial modeling. In this form of bias, typical of observational data, it is difficult to disentangle outcomes based on individuals’ attributes and the attributes of their neighborhoods. That is, although we may observe that individuals’ outcomes are affected by their common exposure to their neighborhoods’ characteristics, we must first consider the mechanisms in motion that “placed” these individuals in those neighborhoods. These mechanisms may be conceptualized as unobserved factors that both (a) contributed to those inhabitants deciding (or being constrained/forced) to select those neighborhoods and (b) may also be associated with their outcomes (Pearman, 2019) as we illustrate next. Let us assume we want to measure the impact of outmigration from home to attend college compared to attending college locally on probability of college graduation and initial post-college salary (González Canché, 2018b). This research question has a clear geographical component wherein we may be observing a set of students who left home and another set who did not leave home to pursue college. If we simply measure graduation and salary outcomes across these groups of students and ignore that, since college outmigration is
18
M. S. GONZÁLEZ CANCHÉ
more expensive than nearby college enrollment (González Canché, 2014; 2017; 2018a), those who were able to afford to move may … 1. not only be systematically different compared to their peers who did not or could not afford to move in both observed (i.e., socioeconomic indicators) and unobserved (i.e., interests or motivation) indicators, but also 2. these same attributes that enabled mobility, may also be affecting their outcomes resulting from moving. From this view, failing to consider these disparities in factors that enabled mobility, which may also affect their outcomes, poses a serious threat to causality. The implications of these observed and unobserved mechanisms that result in geographical sorting are that: what we typically observe in our quantitative databases is the already sorted set of inhabitants with specific characteristics concentrated in a given place rather than the factors that drove them to decide to live in a given neighborhood. In following with the feedback loop notion (Jargowsky & Tursi, 2015) just discussed, note that neighborhood statistical descriptors are a function of their inhabitants’ attributes compositions. In returning to our previous discussion of concentrated advantages, to the extent that a neighborhood is configured by wealthier inhabitants, we would not only be observing that the neighborhood has higher income levels than other neighborhoods, but also that, to afford to move into those neighborhoods, prospective inhabitants would need the resources to cover the costs of such a decision. From this view, those who can afford to move into wealthy neighborhoods either have the means to afford doing so (observables), are extra-motivated (unobservables), or both. At any rate, these observed and unobserved factors continue to feed this virtuous loop. On the other end of the spectrum, below poverty line neighborhoods will also tend to attract inhabitants whose only options may be constrained to live in those areas. From this discussion, then, the questions become, are neighborhood effects the result of the neighborhoods characteristics themselves, the result of the attributes of their inhabitants, the results of unobserved sorting mechanisms or a combination of all of them? Most importantly, what can we do to offer estimates that may help us account for observable and unobservable factors with the goal to reduce bias in our estimates? As we will discuss in Chap. 6, spatial analyses principles and tools may be used as identification strategies where “treated” and “control” status is a function of participants’ decisions (or possibilities) to out-migrate or not to attend college, for example. In those discussions, we will elaborate on how the analyses may benefit from the use of quasi-experimental tools such as instrumental variables (IV), propensity score modeling (PSM), regression discontinuity (RD), or difference in differences (DiD) to reduce biases based on observables (PSM, DiD) and unobservables (RD and IV). Moreover, in Chap. 8 we will discuss how simultaneous autoregressive (SAR) models handle residual dependence, which constitutes a serious violation of linear regression models (Bivand et al., 2013; Dong et al., 2015; Pebesma & Bivand, 2020) and the relevance to testing
1 SPLACES
19
and addressing for this source of bias. In this same Chap. 8 we will also showcase how spatio-temporal SAR analyses may be paired with quasi-experimental analysis (i.e., difference in differences) to reduce bias in the estimates. For now, the goal of discussing both concentration of inequalities along with the societal sorting mechanisms that contributed to the selection of those areas to live, is important in our conceptual quest to build models that may address selection bias. Individual and Place-based Multicollinearity This discussion, rather than discouraging the use of spatial based indicators as control or predictor variables due to their potential confounding effects based on self-selection issues, aimed to highlight their relevance in including them to our models. However, their inclusion in statistical models should be carefully assessed for there are two main forms of multicollinearity (high correlations among two or more independent variables) that may emerge from their inclusion: • Multicollinearity among individual and place-based indicators (i.e., parents’ income and neighborhood median income) • Multicollinearity among place-based indicators (i.e., proportion of adults living below the poverty line and unemployment rates of the adult population in a given neighborhood). That is, based on our previous discussion, if place-based indicators are a function of the prevalence of these places’ inhabitants’ attributes, would not we expect to observe high levels of correlation between individuals and their splace-based indicators? More to the point, if neighborhood indicators are so highly correlated that they are explaining similar parts of the total variation of the outcome(s) of interest, the question then becomes how to identify what indicators are more relevant to be included in the models? The tenets of geography of advantage/disadvantage suggest that geographical indicators may be highly correlated, that is, zones with high crime are likely to have high poverty levels, for example. This correlation, which is typically observed in studies modeling environmental factors (Li et al., 2016), may affect the observed variable importance of the predictors. Following Li et al. (2016) before model estimation, it is recommended that researchers assess variable inclusion criteria relying on a feature selection algorithm (Kursa & Rudnicki, 2010) for example, to detect all non-redundant variables to predict outcome variation via machine learning. Feature selection algorithms like Boruta (Kursa & Rudnicki, 2010) may effectively address multicollinearity issues by identifying and easing the exclusion of features that are redundant in the presence of other features for a specific outcome. As illustrated in Chap. 8, the Boruta function is based on a Random Forest regression procedure. Boruta is a wrapper algorithm that subsets control and predictor features and train a model using them to try to capture all the relevant indicators with respect to an outcome variable.
20
M. S. GONZÁLEZ CANCHÉ
As indicated by Kursa and Rudnicki (2010), relevance is identified when there is a subset of attributes in the dataset among which a given indicator is not redundant when predicting the outcome of interest. Procedurally, Boruta duplicates the dataset, and shuffles the values in each column referring to these shuffled indicators as shadow features. Then, a Random Forest algorithm is used to learn whether the actual feature performs better than its randomly generated shadow. Our examples will showcase with minimal code functions and interactive visualizations (see https://rpubs.com/msgc/point_based_Boruta) how to apply and interpret Boruta using real data. Closing Thoughts and Next Steps Based on the mechanisms described in this chapter splaces where our participants experience life may have important effects in their outcome variation, over and above their individual level attributes. As such, these effects are important to consider for our modeling efforts. However, the mere inclusion of these indicators may not be enough to address other potential sources of bias that emerge precisely from their inclusion. Unobserved sources of geographical sorting may still threaten the validity of our estimates. Similarly, multicollinearity issues represent yet another source of variation that may impact our estimates of interest. These validity threats will be revisited when we discuss the analytic approaches presented in this book, along with strategies to minimize their influence, including the use of quasi-experimental estimators with spatial data (see Panel SAR difference in differences in Chap. 8) and the reliance on modeling outcome and/or residual dependence (Chaps. 7 and 8).
Next Steps In terms of next steps, note that so far our discussion on neighborhood effects has been abstract in the sense that we have yet to empirically showcase the different strategies typically employed to identify these neighborhood (macro, general) and neighboring (micro, localized) structures. Accordingly, our next chapter presents a detailed description of these strategies along with some tradeoffs associated with zooming in or out in a given geographical zone (González Canché, 2018c) by moving from states to counties, to ZIP code tabulated areas (ZCTAs), to census tracts, and to census block groups. As part of this presentation, we will also provide examples of how to operationalize splaces by identifying proxies of spatially concentrated disadvantage indicators relying on data from the American Community Survey (ACS) that can be merged to places to form splaces.
Discussion Questions 1.1 What are Places and Spaces? How are they similar yet different? Is there any value added associated with their separate conceptualization?
1 SPLACES
21
1.2 How relevant are these notions of concentration of advantages and disadvantages for spatial socioeconometric modeling? Explain some of the mechanisms that these notions follow to reproduce the system. 1.3 Why should we consider splace-based indicators as important sources of variation worthy of model inclusion? 1.4 In your own words, what are the mechanisms that may pose threats to causality? 1.5 Explain the mechanisms that may explain sorting and how is this relevant for causal arguments/claims in spatial socioeconometric modeling? 1.6 Why should we or should we not be worried about multicollinearity issues? If multicollinearity is a concern, describe in your own words what types of multicollinearity may be observed? Are we missing any types? 1.7 Why were we referring to splaces, what is it, and how is it relevant for spatial socioeconometric modeling?
References Agnew, J. (2005). Space: Place (pp. 81–96). Spaces of geographical thought: Deconstructing human geography’s binaries. Agnew, J., & Livingstone, D. N. (2011). The sage handbook of geographical knowledge. Sage Publications. Anderson, E. (2013). Streetwise: Race, class, and change in an urban community. University of Chicago Press. Besbris, M., Faber, J. W., Rich, P., & Sharkey, P. (2015). Effect of neighbourhood stigma on economic transactions. Proceedings of the National Academy of Sciences, 112(16), 4994–4998. Bivand, R. S., Pebesma, E. J., & Gómez-Rubio, V. (2013). Applied spatial data analysis with r (2nd ed.). Springer. https://asdar-book.org/ Chetty, R., Friedman, J. N., Hendren, N., Jones, M. R., & Porter, S. R. (2020). The opportunity atlas: Mapping the childhood roots of social mobility (tech. rep.). National Bureau of Economic Research. https://opportunityinsights.org/wpcontent/uploads/2018/10/atlaspaper.pdf Chetty, R., & Hendren, N. (2018). The impacts of neighborhoods on intergenerational mobility i: Childhood exposure effects. The Quarterly Journal of Economics, 133(3), 1107–1162. Crane, J. (1991). The epidemic theory of ghettos and neighborhood effects on dropping out and teenage childbearing. American Journal of Sociology, 96(5), 1226–1259. Cresswell, T. (2008). Place: Encountering geography as philosophy. Geography, 93(3), 132–139. Dong, G., Harris, R., Jones, K., & Yu, J. (2015). Multilevel modelling with spatial interaction effects with application to an emerging land market in Beijing, China. PloS one, 10(6), e0130761.
22
M. S. GONZÁLEZ CANCHÉ
Faber, J. W., & Sharkey, P. (2015). Neighborhood effects. International Encyclopedia of the Social & Behavioral Sciences (pp. 443-449). Gilbert, D. (2015). Income inequality. International encyclopedia of the social & behavioral sciences. González Canché, M. S. (2014). Localized competition in the non-resident student market. Economics of Education Review, 43, 21–35. González Canché, M. S. (2017a). Community college scientists and salary gap: Navigating socioeconomic and academic stratification in the u.s. higher education system. The Journal of Higher Education, 88(1), 1–32. https://doi.org/10.1080/00221546. 2016.1243933 González Canchée, M. S. (2017). The heterogeneous non-resident student body: Measuring the effect of out-of-state students’ home-state wealth on tuition and fee price variations. Research in Higher Education, 58(2), 141–183. González Canché, M. S. (2018). Geographical network analysis and spatial econometrics as tools to enhance our understanding of student migration patterns and benefits in the us higher education network. The Review of Higher Education, 41(2), 169–216. González Canché, M. S. (2018). Nearby college enrollment and geographical skills mismatch: (re)Conceptualizing student out-migration in the American higher education system. The Journal of Higher Education, 89(6), 892–934. https://doi.org/10.1080/ 00221546.2018.1442637 González Canché, M. S. (2018). The statistical power of “zooming in”: Applying geographically based difference in differences using spatio-temporal analysis to the study of college aid and access. New Directions for Institutional Research, 2018(180), 85–107. González Canché, M. S. (2019). Repurposing standardized testing for educational equity: Can geographical bias and adversity scores expand true college access? Policy Insights from the Behavioral and Brain Sciences, 6(2), 225–235. https://doi.org/ 10.1177/2372732219861123 González Canché, M. S. (2022). Post-purchase federal financial aid: How (in) effective is the irs’s student loan interest deduction (slid) in reaching lowerincome taxpayers and students? Research in Higher Education, 1–54. https://doi.org/10.1007/s11162021-09672-6 Iriti, J., Page, L. C., & Bickel, W. E. (2018). Place-based scholarships: Catalysts for systems reform to improve postsecondary attainment. International Journal of Educational Development, 58, 137–148. Jargowsky, P. A., & Tursi, N. O. (2015). Concentrated disadvantage. International encyclopedia of the social & behavioral sciences. Jasso, G. (2015). Inequality analysis: Overview. International encyclopedia of the social & behavioral sciences (2nd ed., pp. 885–893). Elsevier Inc. Jencks, C., & Mayer, S. E. (1990). The social consequences of growing up in a poor neighborhood. Inner-city poverty in the United States, 111, 186. Jones, K., & Duncan, C. (1995). Individuals and their ecologies: Analysing the geography of chronic illness within a multilevel modelling framework. Health & Place, 1(1), 27–40. Koh, S. Y. (2020). Inequality. In A. Kobayashi (Ed.), International encyclopedia of human geography (2nd ed., pp. 269–277). Elsevier. https://doi.org/10.1016/B978-0-08102295-5.10196-9 Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the Boruta package. Journal of statistical software, 36, 1–13.
1 SPLACES
23
Li, J., Tran, M., & Siwabessy, J. (2016). Selecting optimal random forest predictive models: A case study on predicting the spatial distribution of seabed hardness. PloS one, 11(2), e0149089. Lovelace, R., Nowosad, J., & Muenchow, J. (2019). Geocomputation with r. CRC Press. https://geocompr.robinlovelace.net/ Osberg, L. (2001). Inequality. In N. J. Smelser & P. B. Baltes (Eds.), International encyclopedia of the social & behavioral sciences (pp. 7371–7377). Pergamon. https:// doi.org/10.1016/B0-08-043076-7/01898-2 Pacione, M. (1997). The geography of educational disadvantage in glasgow. Applied Geography, 17 (3), 169–192. Pastor, M. (2001). Geography and opportunity. America Becoming: Racial Trends and their Consequences, 1, 435–68. Pearman, F. A. (2019). The effect of neighborhood poverty on math achievement: Evidence from a value-added design. Education and Urban Society, 51(2), 289–307. Pebesma, E., & Bivand, R. S. (2020). Spatial data science. Open Access rmarkdown/bookdown. Rosen, H. S. (1985). Housing subsidies: Effects on housing decisions, efficiency, and equity. Handbook of public economics (pp. 375–420). Elsevier. Sue, D. W., Capodilupo, C. M., Torino, G. C., Bucceri, J. M., Holder, A., Nadal, K. L., & Esquilin, M. (2007). Racial microaggressions in everyday life: Implications for clinical practice. American Psychologist, 62(4), 271. Tate, W. F., & IV. (2008). Geography of opportunity: Poverty, place, and educational outcomes. Educational Researcher, 37 (7), 397–411. Tobler, W. R. (1970). A computer movie simulating urban growth in the Detroit region. Economic geography, 46(sup1), 234–240. http://www.geog.ucsb.edu/_ tobler/publications/pdfdocs/A-Computer-Movie.pdf Tuan, Y.-F. (1977). Space and place: The perspective of experience. University of Minnesota Press. Weicher, J. (1979). Urban housing policy. In P. Mieszkowski & M. Straszheim (Eds.), Current issues in urban economics. Baltimore: Johns Hopkins Univ. Press. Wilson, W. J. (1987). The truly disadvantaged. Chicago IL: University of Chicago Press. Yaish, M. (2015). Equality of opportunity. International encyclopedia of the social & behavioral sciences (pp. 903–905).
CHAPTER 2
Operationalizing SPlaces
Abstract In Chap. 1 we described spaces as planes with mathematical properties that allow us to identify objects located in those planes. We further identified these objects as places with meaning that we may experience, live, and remember. Subsequently, we defined splaces as the intermingling of spaces and spaces that may enrich our understandings of processes, mechanisms, and the explanations of the outcomes observed. From this view, splaces then, eased our discussion of geographies of opportunity and disopportunity, feedback loops, sorting or selfselection, and multicollinearity issues. We, however, have so far limited these discussions to conceptual understandings and have yet to discuss and showcase how to operationalize these splaces. Accordingly, the purpose of this chapter is to introduce and apply these concepts into practical applications using real and publicly available data. To this end, we will discuss how to delimit and operationalize splaces with particular emphasis on data format structures and sources. We will also discuss tradeoffs associated with zooming in (i.e., going from higher level areas [counties] to lower level areas [census tracts]) into splaces, which, although may result in data points gains, may be computationally expensive. Finally, we close this chapter with examples of attributes readily available at the American Community Survey (ACS) that can be used to operationalize concentrated disadvantages.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. S. González Canché, Spatial Socio-econometric Modeling (SSEM), Springer Texts in Social Sciences, https://doi.org/10.1007/978-3-031-24857-3_2
25
26
M. S. GONZÁLEZ CANCHÉ
Delimiting and Operationalizing Neighborhoods as Splaces Our discussion on neighborhood effects and concentration of disadvantages has been abstract in the sense that we have yet to discuss how to identify these splaces. The following sections discuss procedural and practical ways in which researchers may achieve this splace identification, a process that can be understood as the merger of spaces and place-based indicators. Conceptually, we can define neighborhoods as physical spaces and as such, we can represent them with grids or systems of mathematical coordinates, or as planes on which events are recorded (Agnew, 2005). Operationally, neighborhoods can be identified or defined via administrative boundaries such as census blocks groups, census tracts, and/or ZIP Code Tabulation Areas (ZCTAs) (Her & Yu, 2021). These boundaries have the attributes of spaces for they are measurable, geo-located, and abstract (Cresswell, 2008; Tuan, 1977). This measurability property implies that boundaries may be drawn in the same plane or space at distinct levels so that some boundaries may be contained or nested within other levels. For example, counties may both contain ZIP code tabulated areas, while at the same time counties may also be dissected into more granular zones like census block groups. In terms of nesting, consider that counties themselves are contained within (or nested in) states, and states belong to regions, and regions are nested in countries. Note that in the previous depictions of space levels we have purposefully used the terms: nested, belong, contained, dissected, and located to identify or refer to these space levels and boundaries. From a geographical analysis perspective all these descriptions are valid for they convey the idea of aggregation or disaggregation in spaces, referring to their physical and mathematical properties. As such, as long as we are clear and consistent with their usage in a particular research project, all these terms may be used to describe our spatial analytic strategies and processes. There are, however, certain methodological approaches, such as multilevel simultaneous autoregressive models (Dong et al., 2015) that prefer the “nesting” language, due to the hierarchical nature of that framework. One more common way to refer to nesting levels is as higher or lower levels (Dong et al., 2015). As further discussed in the following section, higher levels will denote the nesting units in our models and lower level units will be more granular or disaggregated, referring to units ascribed to, or located in, their respective higher level areas. For example, assume we identify housing units in a given ZIP code tabulated area (ZCTA). Because these units are located within that ZCTA, the latter will be their respective higher level, whereas the houses themselves will be the lower level units of analysis. The same rationale applies to ZCTAs contained within counties. In this example, counties will be higher level and ZCTAs will be lower level units given their within counties nesting. In bringing all these spatial units together, the lowest level unit in these two cases will be the houses for they are located within ZCTAs and also within counties.
2 OPERATIONALIZING SPLACES
27
Representing Physical Spaces and Nesting Structures These different nesting structures and hierarchical levels are represented in Figs. 2.1 and 2.4. The left-hand side subfigure of Fig. 2.1 shows the Tri-State Region in the Northeast United States. This region accounts for the states of New York, New Jersey, and Pennsylvania. Although, as discussed above, each of these states are configured by counties (and counties by other lower level units), for our discussion and explanation purposes, this Tri-State depiction only shows the location of Philadelphia county in Pennsylvania in the dotted, darker circle in this subfigure. Let us assume that we are interested in focusing our analyses on Philadelphia County. To accomplish this, we first need to “zoom in” from this Tri-State Region into this smaller geographical zone. This resulting “zoomed in” version of Philadelphia County is depicted in the right-hand side subfigure of Fig. 2.1, which as of now does not show any of its nested lower level administrative boundaries. This latter statement implies that Philadelphia County (and any other county), may be further decomposed into its configuring zones. The selection of levels of aggregation has important methodological implications. If analyses are left aggregated at this county level, then all their configuring “neighborhoods” will be represented with a single homogeneous spatial and splace-based set of values. In terms of spatial values, all physical attributes of lower level units will be lost by only considering their contribution to the entire county measurement in terms of water and land (see Figs. 2.6 and 2.7, discussed below in this chapter). In terms of sociodemographic or splace-based indicators, all these neighborhoods configuring the Philadelphia County will also have a single uniform estimate, instead of their individual actual values. For example, North Philadelphia, West Philadelphia, and Philadelphia’s Historic District will
Tristate Area, Circled area: Philadelphia County, PA
Zooming in into Philadelphia County, PA
Fig. 2.1 Tri-state Area: New York, New Jersey, Pennsylvania, USA, example
28
M. S. GONZÁLEZ CANCHÉ
all be associated with county level estimates. If these estimates are crime, poverty, violence, all these county regions will “lose” their own values and will instead be given the value of Philadelphia County. As with most, in not all cities, anyone who is familiar with Philadelphia knows that, in reality, all these neighborhoods vary quite drastically across these indicators. Building from this brief discussion, we can say that there are at least two benefits associated with disaggregation efforts. The first is that disaggregation may lead to more nuanced and accurate depictions of these neighborhoods’ indicators and their impact on how their inhabitants experience live. The second benefit is that the higher the disaggregation, the more data points our models may be able to incorporate. That is, instead of one single estimate of a given county, we will get all its configuring lower level estimates configuring such a county (we discuss this in Fig. 2.4). That is, by zooming in, we increase the analytic sample size, even when the actual splace remains the same, and this increase translates into more statistical power. In reality, these two points converge into modeling approaches that may more realistically or accurately estimate neighborhood effects and concentrations of disadvantages. Zooming in Across Administrative Boundaries The process of spatial disaggregation may be conceptualized as zooming in (González Canché, 2018), or decomposing higher order level information into its configuring parts. For example, we may move from county to its configuring ZIP code tabulated areas (ZCTAS), or to their census tracts, or to their census block groups. Procedurally, to conduct this “zooming in,” we typically rely on official administrative boundaries (Her & Yu, 2021). In the United States, these administrative boundaries are available from the TIGER/Line Shapefiles provided by the United States Census Bureau (Her & Yu, 2021). Shapefiles are databases that contain georeferenced information (i.e., latitude and longitude coordinates) that are referred to as geometries for they take the forms of: points (i.e., housing units, buildings, a street corner), lines (i.e., rivers, highways, streets, a border separating two countries), or polygons (i.e., ZCTAs, tracts, states, countries) that enable visualizing their location on earth in map form (United States Census Bureau, 2021, p. 1). Shapefiles as Spaces Like spaces, as described and conceptualized in Fig. 1.1, shapefiles account for physical properties that enable objective physical measurements. Indeed, when accessing these georefenced databases their information focuses on describing their physical properties such as areas covered by land and by water. To provide an example, let us discuss Fig. 2.2. This figure represents the shapefile of census tracts in the State of Pennsylvania as downloaded from the TIGER/Line Shapefiles—we will discuss how to access these files later in the
2 OPERATIONALIZING SPLACES
29
Fig. 2.2 Example of census tract shapefile or space
book (Chap. 4) directly from a server and by downloading them and loading them from our local hard drive. Elements of a Shapefile The first four columns are identification numbers, starting with: • “STATEFP” the ID of the state of Pennsylvania, which for the American Census Bureau is 42. Since we are only analyzing Pennsylvania, this number presents no variation. • “COUNTYFP” county ID. Even though we have only one state, this state is configured by 67 counties. However, for our description purposes, we limited the shapefile to the county of Philadelphia; accordingly, this column has one unique value: 101, which is the number associated with Philadelphia County—this number does not reflect number of counties. • “TRACTCE” census tract identification numbers within the county of Philadelphia. This column has 348 values, which is the number of census tracts configuring Philadelphia County. In addition, all these three IDs configure a super or complex ID that concatenates (unites) all the previous three IDs (state, county, and tract). The union or concatenation of these pieces of information yields a unique ID per each administrative unit in the United States. In this case, this unique ID is called GEOID and means that in the entire country there may not be two administrative units with the exact combination of state, county, and tract. Before describing the remaining columns, note that, when analyzing all census tracts of the United States, this GEOID serves as the key to add place-based attributes to this shapefile, and the merger of space and place-based indicators results in a splace dataset—more on this below. However, when the analyses only include a single state’s census tracts, the column “TRACTCE” in Fig. 2.2 suffices to merge space with place-based indicators, as further discussed below. The subsequent two columns in this shapefile are land (“ALAND”) and water (“AWATER”) areas. Subsequently we can see two more columns denoting: latitude and longitude coordinates, which enable us to geolocate these tracts on
30
M. S. GONZÁLEZ CANCHÉ
earth. Note, however, that these coordinates indicate the center of the census tract (also known as centroid) rather than a collection of points forming lines that eventually configured each polygon. These latter shape attributes (i.e., the collection of points configuring lines that formed polygons) are stored under the multipolygon feature, which will enable depicting these polygons in a map form based on projections on earth for visualization and measurement purposes—we will further discuss projections in Chap. 3. Place-Based Indicators Contributing to Building Splaces In going back to our discussion of places, spaces, and splaces, note that shapefiles per se would not allow us to conduct spatial socioeconometric analyses. In an analogous way, place-based indicators without the spatial features shown in Fig. 2.2 would also fall short in helping us conduct spatial socioeconometric modeling. To elaborate on this latter case scenario, let us analyze the content of Fig. 2.3. This figure represents a subset of a place-based indicator stored in “Table B15001: Sex by Age by Educational Attainment [2015–2019].” This table contains the estimates at the census tract level, as reported by the American Community Survey (ACS), of educational attainment of inhabitants by age groups and sex in each census tract of Philadelphia County.1 Since these estimates are of Philadelphia County, these tracts also correspond to the state of Pennsylvania, and allow us to continue building from our example presented in Fig. 2.2. Note that the first column represented in Fig. 2.3 is called “tract” and its content matches the “TRACTCE” column shown in Fig. 2.2. As mentioned above, because these two databases only include Pennsylvania information, these two columns are enough to merge the shapefile and the place-based indicator for there is no repetition of tracts within a single state. Note, however, that there will be repetition of tracts across different states. That is, Alabama and Pennsylvania may both have a tract ID “0001001,” for example, making it impossible to rely on TRACTCE and tracts alone as the merging keys across these (or other) states. From this view, if spatial data management (or spatial feature engineering) include multiple states, we must rely on GEOID as the merger keys to build splaces (i.e., bringing shapefile and its corresponding place-based indicators together for further SSEM). The subsequent set of columns contained in Fig. 2.3 correspond to the total number of men and women ages 18–24 in that tract (columns “B15001_003” and “B15001_011”). Moreover, with some basic arithmetic computations (as we showcase below in this chapter), we estimated the proportion of these men and women inhabitants who did not enter college between 2015 and 2019 (or 5-year estimates as detailed below). Once more, let us note that there is no spatial or geographical information in this database shown in Fig. 2.3. Consequently, as stated above, these data 1
All the content of this table is available at https://censusreporter.org/tables/B15001/.
2 OPERATIONALIZING SPLACES
31
Fig. 2.3 Example of place-based information data storage format. Source American Community Survey, 2015–2019 Table B15001
per se do not enable the implementation of SSEM. It is only when we combine these place-based attributes with spaces or shapefiles, that our models may become suitable for spatial modeling. And in this example, we may use the “tract” column to link this place-based information to its corresponding space or geography in the state of Pennsylvania. Once more, when more than one state is present in our shapefiles and place-based datasets, then the GEOID column of Fig. 2.2 must be used to merge or link these space and place databases to form splatial datasets suitable for SSEM. Although we will cover these joining or merging processes in more detail in Chap. 3, for now it is worth noting the GEOID or complex key, may not necessarily be readily available in place-based databases, like the one depicted in Fig. 2.3. This implies that we may need to create these key GEOIDs. The good news is that this is a straightforward process based on concatenating2 the columns “state,” “county”, and “tract” that are available in the complete database versions like the one represented in Fig. 2.3, similar to the columns “STATEFP,” “COUNTYFP,” and “TRACTCE” shown in Fig. 2.2.3 Once these shapefiles are linked to or merged with attributes or characteristics measured at those splaces, we can then analyze spatial processes (relying on mathematical algorithms) such as the clustering or concentration of disadvantages of these now geolocated set of attributes (i.e., crime and poverty levels, years of education attainment, family structure all in a given area). To summarize this discussion, and to be as clear as possible let us refer to Her and Yu (2021) who stated that although shapefiles per se “do not include demographic data, […,] they do contain geographic entity codes (GEOIDs) 2 Basically pasting these values together to form a single string. That is, assume we have state = 42, county = 91, and tract = 210700. To concatenate these values into a new column called GEOID we use the command paste(...) in R that yields GEOID=“4291210700” or a unique ID for that specific tract in the United States. 3 Note, however, that recent developments to access place-based dataset (see package “tidyverse”, for example), by default extract the GEOID column, rather than the decomposition of state, county, and tract. When using these new approaches, we simply need to rely on GEOID with no need to concatenate as shown in this example.
32
M. S. GONZÁLEZ CANCHÉ
that can be linked to the Census Bureau’s demographic data, available on data.census.gov” (Her & Yu, 2021, p. 2). Chapter 4 details the processes followed to link or merge information, a process that can be conceptualized as adding attributes to these empty shells or administrative boundaries. Neighborhood Operationalization and Disaggregation After this brief discussion of spaces (shapefiles or administrative units) and placebased indicators data structures, let us go back to strategies we can use to zoom in into different neighborhood configurations. To showcase the different levels of neighborhood operationalizations, we can rely on Fig. 2.4. Going left to right, we can see that counties and county level information may be decomposed into its ZIP Code Tabulation Areas (ZCTAs), census tracts, and block groups. Although each of these areas may be used to represent neighborhoods, typically researchers need to select one of these three options, and this selection conveys warranted and even unwarranted tradeoffs. For example, although ZCTAs are arguably the areas with the highest data availability after counties, ZCTAs tend to be mistakenly considered to be the least stable administrative units over time. Note, however, that ZCTAs are only changed once every ten years.4 The origin of this mistake is based on the similarity of names of these two quite similar yet different administrative units: ZIP codes and ZIP code tabulated areas. It is ZIP codes (not ZIP code tabulated areas) that are subjected to periodic changes in the pursuit of efficiency in mail delivery routes (United States Census Bureau, 2022, no page number). Notably, however, based on the usefulness and popularity of ZIP codes, the US Census Bureau created ZCTAS to “approximate area representations of U.S. Postal Service (USPS) [ZIP codes].” Accordingly, ZIP codes and ZCTAs are highly related yet present slight variations that may potentially grow over time and up to 10 years—when the United States Census adjusts ZCTAs. Despite these variations between these administrative units, ZCTAs are particularly useful for SSEM given the vast availability of data that “the Census Bureau [offers] using whole blocks to present statistical data from censuses and surveys” (United States Census Bureau, 2022, no page number). This implies that, when using ZCTAs in SSEM, the discrepancies with respect to ZIP codes are meaningless for our estimates. All data provided by the United States Census Bureau is mapped to ZCTAs (not to ZIP codes). From this perspective, there is no downside of using ZCTAs in terms of instability of these administrative units within decades. The modeling challenge may emerge when analyses expand across decades, but this challenge is not unique to ZCTAs but also to census tracts and block groups.5 4 5
See https://www.policymap.com/2016/03/what-are-zip-code-tabulation-areas/.
The full United States glossary of terms is available here https://www.census.gov/ programs-surveys/geography/about/glossary.html.
2 OPERATIONALIZING SPLACES
ZCTA (N = 48)
Census Tract (N = 384)
33
Census Block Group ( N= 1336)
Fig. 2.4 Zooming in and operationalizing neighborhoods, Philadelphia, PA, USA, example
The next level of administrative units available from the TIGER/Line Shapefiles corresponds to census tracts, which are shown in the subfigure located at the center of Fig. 2.4. These boundaries are as stable as ZCTAs areas for they are updated “prior to each decennial census as part of the Census Bureau’s Participant Statistical Areas Program” (United States Census Bureau, 2022, no page number). In terms of criteria for updating these administrative units, tracts with more than 8,000 people are split into subtracts so that each of these units has about 4,000 inhabitants. This split can be observed in Fig. 2.3. Specifically, note that Census Tract 158 was split into 158.1 and 158.2 indicating a population growth in that tract with respect to the previous decade. Similarly, tracts that at the end of the decade decreased their population reaching less than 1,200 inhabitants are merged with a neighboring tract. These mergers and splits ensure that census tracts average about 4,000 inhabitants with a minimum population of 1,200 and a maximum of not more than 8,000 inhabitants.6 Finally, each of these census tracts is configured by block groups, as shown in the right-hand side of Fig. 2.4. These block groups “are statistical divisions of census tracts, are generally defined to contain between 600 and 3,000 people and are used to present data and control block numbering” (United States Census Bureau, 2022, no page number). Note that although block level shapefiles or administrative units are also available from the TIGER/Line Shapefiles, there are still no publicly available attributes to be linked to these shapefiles (US Census Bureau, 1994). That is, in the United States, block groups are “the smallest geographic entity for which the decennial census tabulates and publishes sam6 More information can be accessed here https://www2.census.gov/geo/pdfs/ education/CensusTracts.pdf .
34
M. S. GONZÁLEZ CANCHÉ
ple data” (US Census Bureau, 1994, p. 1). Note the publication year of this reference, 1994, which indicates that almost three decades later, census block data are still not available. This latter point indicates that, since no place-based indicators may be added to these block level spaces, no SSEM can be conducted using publicly available information; however, if we can access or collect our own data at this block level, then we can conduct these SSEM analyses. Data Point Differentiation Across Neighborhood Levels Each of these shapefiles account for different numbers of data points to be analyzed. Going back to our Philadelphia County example shown in Fig. 2.4, we can see that, with the exact same space, each of these administrative or neighborhood levels accounts for different numbers of spatial units of analysis. In the case of ZCTAs, Philadelphia County accounts for 48 ZCTAs, which operationally represents up to 48 different estimates of each place-based indicator of interest. For example, returning to our indicator of college access shown in Fig. 2.3, if instead of tracts we downloaded ZCTAs estimates, we would have 48 values for men and 48 estimates for women. If this area would have been left aggregated at the county level, we would only have a single estimate assigned to this county. If, instead of ZCTAs, we zoom in into the next administrative boundary level, the number of spatial data points will augment, along with its corresponding increase in statistical power. Specifically, as depicted in Fig. 2.4, we can see that by relying on census tracts, instead of ZCTAs, we would have moved from 48 data points to 384 (see Fig. 2.4). Accordingly, when linking this shapefile with the place-based indicators, we would have up to 384 spatial data points. In the latter sentence we mentioned up to, for there may be some instances where no estimates may be available due to very few inhabitants, more on this in the tradeoff subsection below. Finally, in the case of block groups, we can see that in Philadelphia County these administrative boundaries account for a total of 1,336 units, almost a one thousand units increase with respect to census tracts. Once more, in addition to an important increase in data points, the resulting analyses may allow for more nuanced understandings of place-based advantages and/or disadvantages. Illustration of Splaces and Data Point Gains The example presented in this section builds from the integration of Figs. 2.2 and 2.3. Specifically, using the columns “TRACTCE” and “tract” in Figs. 2.2 and 2.3, respectively, we joined (or merged) both datafiles. Effectively, as shown in Fig. 2.5 we added place-based indicators to this space, formally operationalizing the concept of splaces discussed above. Figure 2.5 now contains the columns that account for the American Community Survey’s five-year estimates (2015–2019) of the proportion of men and
2 OPERATIONALIZING SPLACES
35
Fig. 2.5 Operationalizing splaces, Philadelphia, PA, USA, example. Source American Community Survey, 2015–2019, 5-year ACS estimates Table B15001 County Aggregate (N = 1)
Men and Women Mean College Access = 0.394 (SD = 0.267)
Census Tract (N = 384)
Census Tract (N = 384)
Access Distribution
Access Distribution
[0,0.15) [0.15,0.31) [0.31,0.53) [0.53,0.68) [0.68,1] missing (n=10)
[0,0.116) [0.116,0.26) [0.26,0.42) [0.42,0.56) [0.56,1] missing (n=10)
Men Mean College Access = 0.435 (SD = 0.286)
Women Mean College Access = 0.354 (SD = 0.245)
Fig. 2.6 Visualizing splaces using college access among high school graduates, Philadelphia, PA, USA, example. Source American Community Survey, 2015–2019 Table B15001
women, 18–24 years old who finish high school but did not access college across all census tracts in Philadelphia County. The data point gains are illustrated in Fig. 2.6. The left hand-side subfigure contains the aggregated values of women and men who finished high school (or equivalent) but did not access college. Although two renderings may be achieved by plotting these women and men college access estimates, there would be no color variation of these estimates (they fall in the same color bracket shown in the disaggregated subfigures), which is why we decided to combine their mean values. In this county level depiction, also note that although the 384 census tract polygons are shown, every single one is assigned the aggregate mean value
36
M. S. GONZÁLEZ CANCHÉ
of 0.427 (SD = 0.24). Typically, county level estimates do not show these census tract borders, but we added them for illustrative purposes. The center and right subfigures present the disaggregated values for men and women, respectively. Each of these figures also contain a quintile distribution of these estimates of college access. In each of these estimates, there were 10 census tracts with not available values, which are represented in pink—we discuss the reason for this lack of availability as a tradeoff or compromise below. Note that lighter colors in these figures indicate higher rates of college access among high school graduates in those tracts. To briefly illustrate the influence of contextual factors, we also highlight the neighborhood known as University City, where the University of Pennsylvania and Drexel University are located. This neighborhood is represented with a circle and a radius of three miles; that is, this circle has a diameter of six miles in total from side to side. This circle consistently reflected higher access rates for both men and women in University City. This variation, however, is lost in the aggregate subfigure. Tradeoffs of Data Point Differences In terms of tradeoffs associated with zooming in, although more statistical power is gained by increasing the number of units of analyses to be included in the models, there are two main challenges. One is the precision of the estimates. The more disaggregated the estimates either, the less precise they will be for they are based on surveys, rather than censuses (Spielman et al., 2014), or the more likely they “will be near zero” (U.S. Department of Commerce, Economics and Statistics Administration, U.S. Census Bureau, 2020), or both. That is, the more granular the administrative level, the higher the chance that not enough information from lower level units may be available to form acceptable margins of error. If this is the case, “the calculated value of the lower confidence bound may be less than zero, …, [and since] a negative number of people [in a given area] does not make sense” (U.S. Department of Commerce, Economics and Statistics Administration, U.S. Census Bureau, 2020, p. 54) these estimates may be reported as zero instead, resulting in some areas appearing as having zero inhabitants, resulting in not available estimates. This implies that, as the level of aggregation at the neighborhood level grows, that is, moving from block groups to tracts and from tracts to ZCTAs, counts will increase, and valid margins of error may be computed. To showcase an application of this process with actual data, let us replicate Fig. 2.6 but using its ZCTA level splaces instead of the census block group levels. The result of this process is located in Fig. 2.7. Note that in this case, there were only two ZCTAs (instead of 10 census block group units) where estimates were reported as zero. Although this output shows a reduction of 8 missing units (i.e., moving from 10 in census tracts to 2 in ZCTAs), still reflects that census tracts resulted in an overall availability of 374 spatial units, whereas, for ZCTAs the total number of available spatial units in Philadelphia County was 46 ZCTAs. This discussion serves to highlight that no rule of thumb exists regarding the
2 OPERATIONALIZING SPLACES
County Aggregate (N = 1)
Men and Women College Access = 0.373 (SD = 0.164)
ZCTAs (N = 48)
37
ZCTAs (N = 48)
Access Distribution
Access Distribution
[0.07,0.24) [0.24,0.33) [0.33,0.47) [0.47,0.56) [0.56,0.88] missing (n=2)
[0.05,0.2) [0.2,0.29) [0.29,0.39) [0.39,0.45) [0.45,0.55] missing (n=2)
Men Mean College Access = 0.416 (SD = 0.191)
Women Mean College Access = 0.416 (SD = 0.191)
Fig. 2.7 Operationalizing splaces, Philadelphia, PA, USA, example. Source American Community Survey, 2015–2019 Table B15001
selection of the area level. Instead, there may be other considerations, such as previous literature or conceptual lenses, or even computing power required to conduct these analyses. The latter constrain constitutes our second trade off. The second trade off or challenge associated with relying on smaller (or the smallest) geographic boundaries, is based on computing costs. These costs are relevant for they could prevent the use of lower level units. Specifically, as further discussed in Chap. 6, to apply statistical analyses that incorporate spatial information, we need to rely on matrices of influence, and these matrices may require vast amounts of memory RAM. For example, considering that each cell element in a matrix requires eight bytes, a 10 by 10 matrix would have 100 cells that translate into 800 bytes. By today’s computing standards this computing power represents absolutely no problem. However, even in analyses covering the exact same area, like Philadelphia County for example, the computing power required to conduct spatial analyses will grow quite fast as discussed next. Going back to the ZCTAs depiction, if Philadelphia is represented in matrix form, we would have 2,304 cells (or 48∗48). For census tracts and block groups, these corresponding numbers of cells will be 147,456 and 1,784,896. Although the memory RAM required to conduct spatial analyses with the Philadelphia County is still minimal (less than 0.015 GB),7 we must consider the fact that this is a small geographical area that accounts for only one of the 3,006 counties in the United States. In practice, spatial analyses typically involve many counties or states, which becomes computationally expensive. For example, the State of Pennsylvania has 9,740 block groups, which requires a modest 0.76 GB of RAM required to handle spatial statistical analyses. However, if analysts are interested in modeling all 217,740 block groups in the United States, these analyses will require Estimate based on Wicklin (2014) r ∗ c ∗ of columns. 7
8 , 109
where r is number of rows and c is number
38
M. S. GONZÁLEZ CANCHÉ
379.29 GBs of RAM. In the case of census tracts and ZCTAs, computers require 42.69 GBs and 8.19 GBs, respectively, to run one single regression. More to the point, as we will discuss in Chap. 8, recent advancements in spatial modeling may require the incorporation of more than one matrix in order to account for multiple sources of influence (see González Canché, 2018; 2022, for two recent examples). In these cases, if census tracts and ZCTAs are incorporated in one model, for example, the computing power required to run analyses will be the result of adding 42.69 GBs and 8.19 GBs per model, which would require to have at least 50.88 GBs available to execute these models across the United States. Although there may be important reasons to subset the models by regions of the United States, like college concentration (see González Canché, 2022), these modeling decisions should be justified based on valid analytic reasons rather than being made due to lack of enough computational power to fit these models. So far, the computation of these matrices has considered these polygon shapefiles. In Chap. 4 we will also discuss how can we handle point-based shapefiles, which will also translate into matrices of influence that add computing power requirements. What Might be the Best Choice? From our previous discussion, we can conclude that ZCTAs are the most cost affordable administrative areas to operationalize and analyze neighborhood effects for they decompose county-level information without resulting in unaffordable vast amounts of spatial data units. One point yet to be discussed is that ZCTAs may also have the highest place-based data availability in the United States. For example, the Internal Revenue Service provides a wealth of information of tax returns across the country at the ZCTA level only, that is, no availability of these data exist at the census tracts of census block groups levels. Moreover, all this multi-year information is publicly available and may be accessed at https://www.irs.gov/statistics/soi-tax-stats-individual-incometax-statistics-zip-code-data-soi. As further discussed in our statistical analyses, these data can be analyzed as response or control indicators, depending on each project’s main goal. Having said this, it may be quite convenient to rely on ZCTAs, particularly when incorporating thousands of units in order to get enough statistical power. Nonetheless, in practice, models may account for more than one level of place-based indicators. That is, we can include ZCTAs, county, and state level indicators in our models even if our main matrix of influence (to conduct spatial regression analyses—as discussed in Chap. 6), is measured at the ZCTA level, for example. This mixing of administrative levels, however, should also be justified either based on previous literature and/or on our theoretical frameworks. Similar to the selection of the administrative level to build matrices of influence, we would need to have a clear reasoning and rationale for explaining the need to rely on these divergent place-based indicators’ levels. One reason, as we will discuss later in the book, may be that certain indicators are only available at
2 OPERATIONALIZING SPLACES
39
one level (i.e., county), whereas others may be available at different levels. For example, if crime rates are only available at the county level and tax breaks are only available at the ZCTA level, we may be constrained to rely on county level crime estimates if we consider that their inclusion in the models are relevant based on our conceptual framework while also considering the results of our feature selection procedures.8
Bringing Concepts, Shapefiles, and Place-Based Indicators Together Once the neighborhood level (or levels) has been identified, the next important step consists of the identification of attributes to be linked to these administrative shapefiles. This selection process depends on the goals of the study, previous literature on the topic, and the conceptual frameworks used. Following our discussion presented in Chap. 1, if the framework of interest is concentration of disadvantage, Jargowsky and Tursi (2015) mention five fundamental indicators of socioeconomic distress: • • • • •
poverty rates, median income, unemployment, housing quality, and family structure.
This section elaborates further on the process and rationale followed to identify attributes available at the American Community Survey (ACS) that can be used to capture and/or approximate these indicators. In this respect, it is worth noting that ACS data offers two main forms of social, economic, housing, and demographic data (U.S. Department of Commerce, Economics and Statistics Administration, U.S. Census Bureau, 2021). One of these forms is called pretabulated, summary, or aggregated estimates. These estimates are derived from 1-year or 5-year survey responses, as further described below. The second form is called Public Use Microdata Sample (PUMS).9 PUMS data consist of a subsample (about two-thirds the total ACS surveys or 1 percent of the United States population in a given year or 5 percent in the 5-year survey) containing actual individual records with information about the characteristics of each person and housing unit in the survey. PUMS data allow users to create their own indicators (U.S. Department of Commerce, Economics and Statistics Administration, U.S. Census Bureau, 2021, p. 2) as opposed to relying on the tens of thousands pre-tabulated indicators that Census employees have created for pub8
Actually in the examples presented in Chap. 8, we will see that despite measuring crime rates at the county level based on their conceptual relevance for the models, this indicator was not found to be relevant using the feature relevance identification via Boruta (Kursa & Rudnicki, 2010), accordingly, crime was not included in the final set of models. 9 Pre-tabulated data are also referred to as summary or aggregate data (U.S. Department of Commerce, Economics and Statistics Administration, U.S. Census Bureau, 2021).
40
M. S. GONZÁLEZ CANCHÉ
lic use. PUMS data, then are useful if a needed estimate is not available through pre-tabulated ACS data products (U.S. Department of Commerce, Economics and Statistics Administration, U.S. Census Bureau, 2021). Having said this, our book focuses on these pre-tabulated ACS estimates for the following two main reasons. In line with the goal of SSEM, pre-tabulated data are available for a vast array of regions10 and these regions include all the levels we have described in this chapter (state, county, ZCTAs, census tracts, and census block groups). On the other hand, PUMS data are available for the nation, regions, divisions, states, and Public Use Microdata Areas (PUMAs). The latter must have a population of at least 100,000 that is consistent throughout a decade to ensure confidentiality. In less densely populated areas, PUMAs may aggregate counties and census tracts over time, which may drastically limit or completely impede the types of spatial analyses we can conduct relying on these data source. The second reason is that PUMS data are in the process of undergoing a significant transformation for the sake of protecting the confidentiality of its respondents. Specifically, “at the April 2021 ACS Data Users conference, the Census Bureau announced that it will replace the ACS research data with ‘fully synthetic’ data” (U.S. Census and American Community Survey microdata, 2022, para. 10). Fully synthetic data consist of constructing or simulating data consistent with the distributions obtained via the ACS surveys. Although no more details currently exist, it seems that ACS surveys (along with their pretabulated estimates) will continue to rely on actual participants’ responses, but instead of drawing a sub-sample (i.e., two-thirds of these reposes) to make PUMS data available for research, the PUMS data will be simulated (U.S. Census and American Community Survey microdata, 2022). From this view, in addition of already (a) being more complicated to be analyzed due to the creation of estimates with correct survey weights, (b) being much narrower in scope and limited geographical reach, (c) most analyses “currently conducted with the ACS [PUMS and PUMAs data] are likely to become impossible with the shift to synthetic data” (U.S. Census and American Community Survey microdata, 2022, para. 11). In summary, given the narrower geographical scope and the upcoming reliance on simulations to synthetize PUMS data, our data access and data analysis discussion for this book focuses instead on ACS pre-tabulated estimates. Although there exists the possibility that these pre-processed tables do not have the specific indicator that we are interested in analyzing, this is quite unlikely; these published tables account for tens of thousands of indicators, available for a myriad of geographical levels that have been carefully crafted by Census employees.
10 The complete list of shapefile levels (accounting for 86 spaces in total) can be accessed here https://api.census.gov/data/2019/acs/acs5/geography.html.
2 OPERATIONALIZING SPLACES
41
ACS Published or Pre-tabulated Data As the first step, we should begin by familiarizing ourselves with the “Table Shells for All Detailed Tables” that the United States Census Bureau makes available here https://www2.census.gov/programs-surveys/acs/summary_file/2019/ documentation/user_tools/ACS2019_Table_Shells.xlsx. Note that this file covers a large list of indicators, and has a total of 40,016 rows, which may be overwhelming to navigate at first. The data structure of this file contains five columns “Table ID,” “Line,” “UniqueID,” “Stub,” and “Data Release” as shown in Fig. 2.8. The Table ID column is represented in Figs. 2.3, 2.5, 2.6, and 2.7 but in those figures we described the Table ID “B15001.” The Line column represents the column number that the database format will have when downloaded from the server—similar to the depiction shown in Fig. 2.3. Moreover, this number is used to assign each variable with a Unique ID, as shown in the UniqueID column. For example, the Line with value three (3), forms a UniqueID value concatenating the Table ID plus an underscore with two leading zeroes and the three as follows: “B01001_003.” The next column, called Stub describes the meaning of each unique indicator, which in the case of B01001_003 is the number of men under 5 years of age living in a given area. From this description, and going back to Fig. 2.3, the column called “B15001_003” would have a Line number of 3, and its corresponding Stub would measure the total number of men ages 18– 24 in a given area. This means that the line value does not have the same meaning across Tables. This is the main reason why it is very important to access the “Table Shells for All Detailed Tables.” The 2016–2020 version is available here https://www2.census.gov/programs-surveys/acs/summary_ file/2020/documentation/user_tools/ACS2020_Table_Shells.xlsx. Finally, the last column shown in Fig. 2.8 is called “Data Release.” This column may take values of 1 or 5. When the values are “1,5” this indicates that that estimate is available for 1-year and 5-year estimates. When the content of this column for a given Table ID is only 1 or 5, then the estimates for that indicator are only available for the corresponding value displayed. Similar to our discussion of tradeoffs associated with potentially reaching areas with no inhabitants at lower levels, although the 1-year estimates may seem appealing due to their added frequency, the 5-year estimates “have more statistical reliability […] compared with that of single-year estimates, particularly for small geographic areas and small population subgroups” (U.S. Department of Commerce, Economics and Statistics Administration, U.S. Census Bureau, 2020, p. 13). More to the point, the 1-year estimates only survey areas with populations of 65,000 or more and some “popular” tables with at least 20,000 people (U.S. Department of Commerce, Economics and Statistics Administration, U.S. Census Bureau, 2020, p. 13)—we were unable to find a measure or description of what constitutes this “popularity.” Areas with less than 20,000 inhabitants are only available at the 5-year estimate. This means that our ZCTAs, census tracts,
42
M. S. GONZÁLEZ CANCHÉ
Fig. 2.8 Table shells for all detailed Tables Click Here For Direct Table Access
and census blocks representations shown in Fig. 2.4 are only available at the 5year estimates, given their population sizes. Whereas, counties, states, divisions, regions, and the nation can be accessed at both the 1- and 5-year estimates given their aggregation levels (see U.S. Department of Commerce, Economics and Statistics Administration, U.S. Census Bureau, 2020, p. 8, Figure 2.1 for a more comprehensive explanation of these levels and frequencies). Note that our description of these tables and unique IDs has been based on identifying estimates of peoples’ attributes located in “a given area.” That is, we have not specified any particular spatial unit level. This vagueness in our language
2 OPERATIONALIZING SPLACES
43
is purposeful because, as just discussed, the ACS provides these estimates at all the levels we have described in this chapter (state, county, ZCTAs, census tracts, and census block groups). Having described the structure of these place-based databases, we can now proceed to identify proxies of the indicators of concentrated (dis)advantages discussed in our first chapter. Identifying Proxies for Poverty As briefly depicted above, the ACS Table Shells for All Detailed Tables contains over 40,000 rows, which poses a challenge to identify the place-based attributes that we may include in our SSEM. One identification strategy to deal with this vast amount of information, consists of searching for keywords within the Excel Table Shells for All Detailed Tables, we described in Fig. 2.8. Specifically, in the case of poverty rates we can search “poverty.” As shown in Fig. 2.9, the first result measures concentration of children’s poverty by family composition, including whether the parents immigrated to the United States. The Data Release value shows that this indicator is available for 1- and 5-year estimates. The concept of Ratio of Income to Poverty shown in Fig. 2.9 requires the understanding of poverty thresholds. Specifically, according to the Office of the Assistant Secretary for Planning and Evaluation of the United States, as of 2021 the poverty thresholds are represented in Table 2.1. These threshold account for the combination of household members and salary. Typically, households with
Fig. 2.9 Example of poverty estimate
44
M. S. GONZÁLEZ CANCHÉ
Table 2.1 Household
Poverty thresholds for the 48 contiguous states and the district of Columbia Poverty guideline
1 person $12,880 2 persons $17,420 3 persons $21,960 4 persons $26,500 5 persons $31,040 6 persons $35,580 7 persons $40,120 8 persons $44,660 9 + persons $4,540 per each additional member Data source US Department of Health and Human Services (2021)
salaries below the poverty line qualify for financial aid eligibility (US Department of Health and Human Services, 2021). The concept of ratio of income to poverty captured by the ACS is computed as follows. Assume you identify a household with 4 persons. According to Table 2.1, the minimum combined salary this household would have to earn to be considered over the poverty line would be $31,041. If this salary threshold is exactly met, then the ratio of income to poverty for this household is 1 (or $31,040 $31,040 )—meaning that this household is at the poverty line. However, if instead this household was only able to earn $15,520, the corresponding ratio will be 0.5 (or $15,520 $31,040 ). Accordingly, going back to Fig. 2.9, the “Under 1.00” Stubs, accounting for UniqueIDs “B05010_002” to “B05010_009,” reflect the number of household in that area who earned combined salaries, according to their family members configurations, that were below their income poverty thresholds, as reported in Table 2.1. More to the point, all the remaining cases, that is from UniqueIDs “B05010_010” to “B05010_025,” were estimated to earn higher than their respective income to poverty thresholds. Indeed, inhabitants classified in the “B05010_018” to “B05010_025” categories, were estimated to have earned at least double the minimum amount required to qualify for public assistance (i.e., poverty threshold). Table ID “B05010” allows for a detailed description of the family or household configuration. However, a more condensed description of this poverty estimate is available at Table ID “C17002.” shown in Fig. 2.10. This table also presents the same rationale for the ratio of income to poverty, but instead of disaggregating the analyses by family configuration and nativity, the estimates present more granular depictions of these ratios. For example, the under 0.5 ratio will count the estimated number of households with combined salaries that are less than half of their respective threshold. Going back to our 0.50 estimated just shown, if the threshold is $31,040, a given household with less than 0.5, would have been estimated to earn less than $15,520—which is half of $31,040 (Fig. 2.10).
2 OPERATIONALIZING SPLACES
45
Fig. 2.10 Example of condensed poverty estimate
Also note that, Table ID “B05010” only allows to estimate who is below the poverty threshold by only offering the under 1.00 ratio. In the case of Table ID “C17002,” we can estimate more precise depictions of poverty levels, with the worse category being under 0.50. The selection of one of these tables for model specifications should consider these differences. If we are interested in poverty by immigration status, Table ID “B05010” may be selected. If instead we are interested in capturing more extreme levels of poverty, then Table ID “C17002” is the most adequate data source. Identifying Proxies for Median Income The second indicator we will discuss here corresponds to median income. In this case the ACS data estimates report a 1- and 5-year indicator of “MEDIAN INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) BY PLACE OF BIRTH IN THE UNITED STATES.” This indicator may be accessed from Table ID “B06011” and is represented in Fig. 2.11. For comparison purposes, that same figure also includes Table “B07011.” As can be seen in this figure, Table “B06011” captures total median income in a given area, as well as median income estimates disaggregated by place of birth. In the case of Table “B07011,” although there is also an indicator of mobility, this indicator refers to migration regardless of place of birth. Unless differences by these birth of migration categories represented in this figure are important for our research purposes, the total information contained in each of these tables would suffice to capture this median income estimate. Moreover, since the total is the same across these tables for it only captured the median income per area (see Fig. 2.12), the selection of Table “B06011” or Table “B07011” is inconsequential for our modeling purposes. Nonetheless, if our research of interest is purposefully trying to capture changes in composition of inhabitants in a given area, we need to select Table “B07011”. For example, Leigh and González Canché (2021) tested whether the enactment of a college promise program, which by design targets only county inhabitants and offers up to the full costs of tuition and fees, may have contributed to a change in the sociodemographic and economic composition of counties that implemented this college promise program—by attracting inhabitants interested in taking advantage of this scholarship. Because of this research
46
M. S. GONZÁLEZ CANCHÉ
Fig. 2.11 Example of median income tables, source ACS Tables “B06011” and “B07011”
Fig. 2.12 Example of median income total for tables, source ACS Tables “B06011” and “B07011”. We will discuss the code to access these data below
goal, Leigh and González Canché (2021) were interested in testing for changes in Table “B07011.” Specifically, the indicator “B07011_004” was of particular interest for it offers an estimate of the median income of inhabitants in the county where the college promise was implemented and who moved from other counties within the same state. Once more, the rationale driving the selection of this estimate was that, since moving is costly, only those with the means to doing so, may have decided or afford to move to the county where the program was implemented to “cash in” this benefit. Once more, in the absence of a clear need to capture these mobility patterns and socioeconomic and demographic changes, an aggregate estimate of median income may suffice to capture this indicator, recommended by Jargowsky and Tursi (2015), to operationalize concentration of disadvantages.
2 OPERATIONALIZING SPLACES
47
Identifying Proxies for Unemployment Unemployment indicators are well represented in ACS. There are estimates disaggregated by ethnicity, sex, age group, and the combination of all these indicators, including whether survey respondents are in the labor force (i.e., actively seeking employment) or are not part of the labor force. As we discussed in Chap. 1, disaggregated estimates may be of interest for they may render more nuanced understandings of place-based attributes and, in some cases, such a disaggregation may even be necessary based on a specific research question. For example, if there is a program where the outcome consists of incentivizing women with kids to participate in the labor force, the outcome would need to be disaggregated by sex. Figure 2.13 shows two tables that may be suitable to capture unemployment. Table ID “B23025” represents the most straightforward estimate. Using that table, we have several options. Note that “In labor force” UniqueID “B23025_002” is configured by “Civilian labor force” and “Armed forces.” These estimates are structured so that adding the estimates of “Civilian labor force” and “Armed forces” will render the estimates for “In labor force.” From this view, then if “In labor force” = 100, and “Civilian labor force” = 95 and “Armed forces” = 5, to get the proportion of the “Civilian labor force” who is unemployed, we need to divide “Unemployed” by “Civilian labor force.”
Fig. 2.13 Example of unemployment estimates in tables: Tables “B23025” and “B23001”
48
M. S. GONZÁLEZ CANCHÉ
Assuming “Unemployed” = 30, we would have 30 95 = 0.316. However, if the 30 = 0.30) but this unemdenominator is “In labor force” we would get 0.30 ( 100 ployment estimate will include non-civilians. Although these results are not wrong, per se, we need to be careful with our analytic decisions. To get the civilian women unemployment estimates, the process is longer. The estimates are separated into the following age categories: 16–19, 20–21, 22–24, 25–29, 30–34, 35–44, 45–54, 55–59, 60–61, 62–64, 65–69, 70–74, 75 and over. Each of these estimates is further disaggregated by “In labor force,” “In Armed Forces,” “Civilian,” “Employed,” and “Unemployed.” This implies that we need to first add all the “Civilian” count estimates of pear each age category to form our denominator (i.e., B23001_092, B23001_099, B23001_106, B23001_113, B23001_120, B23001_127, B23001_134, B23001_141, B23001_148, B23001_155). Note that the ACS stops differentiating the civilian population after 64 years of age, so the last Unique ID would be B23001_155. After this arithmetic computation, we should then proceed to add the unemployed Unique IDs per each of these age categories. When looking at Table “B23001’, these values will be: B23001_094, B23001_101, B23001_108, B23001_115, B23001_122, B23001_129, B23001_136, B23001_143, B23001_150, and B23001_157. Note that if we prefer to expand these estimates beyond 64 years of age, we simply need no include the corresponding age groups in the creation of the numerator and denominator of interest. Identifying Proxies for Housing Quality The fourth indicator that Jargowsky and Tursi (2015) mentioned is housing quality. In the United States, the governmental entity called Healthy People defines Quality of Housing11 as the physical condition of a person’s home as well as the quality of the social and physical environment in which the home is located. The latter includes security, home space per individual, air quality, and the presence of asbestos, or mold. It is worth noting that this housing quality construct is more difficult to operationalize than the previous proxies discussed up to this point for ACS may not have these exact indicators. However, we can rely on some of these indicators mentioned by Healthy People as part of our operationalizing strategy. For instance, the ACS provides 1- and 5-year estimates of number of occupants per room in Table “B25014”. This estimate may approximate “home space per individual.” As shown in Fig. 2.14, this table presents estimates by owner occupied and renter occupied houses. And each of these categories is sub-classified into number of occupants per room, ranging from 0.5 or less to 2 or more. In addition, the ACS also presents estimates of this indicator by ethnicity in a set of tables (see Fig. 2.14), and in these instances, the categories are 1 or less 11 See https://www.healthypeople.gov/2020/topics-objectives/topic/social-determinants-health/interventions-resources/quality-of-housing.
2 OPERATIONALIZING SPLACES
49
Fig. 2.14 Example of housing quality proxy in Table “B25014” and others
occupants per room or 1 or more. Once more, depending on our interests or needs, we would select whether we prefer estimates by ethnicity in this case, or by owner versus renter and number of occupants per room. Accordingly, proportion estimates may be obtained by dividing the count of 1 or more occupants by the total count of occupied household units in that area by each of the ethnicity categories. The identification of housing quality may also rely on estimating the proportion of houses that lack plumbing facilities. In this respect, the ACS provides Table “B25016”, as show in Fig. 2.15. Note that this table separates the estimates by owner occupied and renter occupied units. In this case, a good proxy of housing quality in a given area, regardless of home ownership status may be created by adding together the counts of “Owner occupied” (UniqueID = “B25016_002”) and “Renter occupied” (UniqueID = “B25016_011”) to use as the denominator, and the totals for “Lacking complete plumbing facilities” (UniqueIDs= “B25016_007” and “B25016_016”) to capture the numerator of interest. The addition of these totals while ignoring the number of occupants per room may be warranted, particularly if we have created other indicators of housing quality that already accounted for these occupancy numbers per room.
50
M. S. GONZÁLEZ CANCHÉ
Fig. 2.15 Example of housing quality proxy by lack of plumbing facilities (Table “B25016”) and type of heating fuel used (“B25040”)
Note also that, Table “B25047” accounts for the number of units lacking complete plumbing facilities, which may be an easier indicator to build. On the other hand, if we are interested in variations by home ownership status, we must rely on Table “B25016” and obtain this estimate separately, which cannot be accomplished with Table “B25047”. Finally, as another indicator of housing quality, we present Table “B25040”, which measures the type of fuel used to heat a unit. In this instance, the lack of fuel usage may indicate hardship, especially in colder zones of the United States. In this case, to obtain this estimate, we would only need to divide the count of “No fuel used” (UniqueID = “B25040_010”) by the total number of occupied housing units (UniqueID = “B25040_001”). If we are interested in getting this estimate by home ownership status, we can rely on Table “B25117” which separates Table “B25040” by “Owner occupied” and “Renter occupied” units. Identifying Proxies for Family Structure Another important measure of concentration of disadvantages is family structure. This indicator may be based on number of children per household and/or single-mother-led households (Jargowsky & Tursi, 2015). Table “B09002” allows to approximate this family structure indicator based on women-led households.
2 OPERATIONALIZING SPLACES
51
Specifically, as can be seen in Fig. 2.16, Table “B09002” accounts for count estimates of households with children (under 18 years of age), and has three main categories: Married couple families, Male householder, no spouse present, and Female householder, no spouse present. Moreover, each of these categories is separated into the following age ranges: 0–3, 3–4, 5, 6–11, 12–17 years of age. To operationalize the indicator of single-mother-led households, we can simply divide the total of Female householder, no spouse present (“B09002_015”) by the Total reported in this table (“B09002_001”). This estimate then may read as” proportion of households in a given area led by single mothers. If these 40 , this indicates that 10% (or 0.10) of these households are led numbers are 400 by single mothers. However, this table also allows us to capture a more nuanced set of estimates by considering children’s age, but the meaning of this new indicator is slightly different. Specifically, let us assume that the number of single mother households is 40, as in the previous paragraph. Let us assume further that we are interested in computing the estimate of single mother led households with kids between 0 and 5 years. This requires us to add the corresponding Unique IDs, which according to Fig. 2.16 are “B09002_016,” “B09002_017,” and “B09002_018.” Finally, let us assume this total is 20. If we divide 20 40 , this results will read as 50% of single mother households have children between 0 and 5 years old. However, we could also use the total (400) as the denominator, which would read as 5% of all households in a given area are led by single mothers with children between 0 and 5 years of age. Our decision should be guided by our research questions, and study purpose. However, once more, the more disaggregated our estimates are, the more likely we will be to face zeroes. If this is the case, perhaps the total estimate (without
Fig. 2.16 Example of family structure proxy, Table “B09002.”
52
M. S. GONZÁLEZ CANCHÉ
considering children’s ages) may be more appropriate to avoid losing spatial data points. This latter concern aligns with our tradeoff and compromises discussed earlier in the chapter.
Closing Thoughts and Next Steps This chapter aimed to provide more practical clarity of spaces and places forming splaces. As part of this process, we discussed the data formats we will be mostly handling to conduct SSEM analyses. An important goal of our discussion consisted of highlighting both advantages and potential limitations associated with zooming in and illustrated these cases using Philadelphia County data and place-based estimates measured at ZCTAs and census tracts levels. However, our discussion so far has been centered around polygon shapefiles. In practice, another truly relevant source of data that can be geocoded is referred to as point spatial data. As briefly noted in this chapter, this point data only need the intersection of latitude and longitude coordinates and is useful to represent institutions like schools, hospitals, colleges, or record events, such as crime related instances. These points, which are located in spaces, also need the inclusion of place-based attributes to conduct SSEM analyses. Notably, these place-based attributes can be added at their point level, that is, as attributes of the units representing those points (individuals, colleges, gas stations) or at a given area level. In the latter case, these attributes serve to capture neighborhood or context effects for these attributes are not of the units (points) but of their location, which follows the same rationale of estimating the effects of concentrations of advantages or disadvantages. Once more the selection of the neighborhood level depends on the research team’s goals and priorities, for it can be at the block group, census tract, or ZCTA. County and or state levels are areas that while capture spatial contextual factors go beyond localized neighborhood impacts.
Next Steps Moving forward, Chap. 3 builds upon our discussion of shapefiles to present two different formats of spatial data: raster and vector data. Here we will discuss similarities and differences as well as strategies to move from one format to another. In addition, we will discuss the components of coordinate reference systems, including projections that allow us to represent geolocated information in a map. Finally, we will also discuss recent developments of differential privacy algorithms that have been designed to protect the anonymity and confidentiality of respondents and will also discuss how these algorithms may or may not pose threats to our SSEM modeling and identification strategies.
2 OPERATIONALIZING SPLACES
53
Discussion Questions 2.1 What are shapefiles? How are they similar or different from out discussion or conceptualization of spaces? 2.2 What is the relevance of merging shapefiles with their place-based attributes? How would you refer to the resulting product of this union? 2.3 What is the meaning of zooming in? What are some of the advantages and perils associated with this process? 2.4 During our presentation we also mentioned frequency for publication of data by the American Community Survey. This frequency was 1- and 5-year estimates. Can you elaborate on the advantages and/or limitations of selecting one or the other? That is, what estimates are more stable, and why? 2.5 What is a super key column, such as the GEOID we discussed in the chapter? Can you name or identify its components? 2.6 What is the relevance of such a super key? In case our place-based data base does not have this indicator, how can we retrieve it or build it? 2.7 During our discussion and examples of indicators of concentrated disadvantages, we operationalized five of them. Could you name them and describe them with your own words? 2.8 What was/were the indicator(s) that resulted more complicated to operationalize and how or why was this the case? Could you describe the process you would follow to identify place-based attributes that you may be interested in modeling? At what level would this be and with what frequency?
References Agnew, J. (2005). Space: Place. Spaces of geographical thought: Deconstructing human geography’s binaries (pp. 81–96). Cresswell, T. (2008). Place: Encountering geography as philosophy. Geography, 93(3), 132–139. Dong, G., Harris, R., Jones, K., & Yu, J. (2015). Multilevel modelling with spatial interaction effects with application to an emerging land market in Beijing, China. PloS one, 10(6), e0130761. González Canché, M. S. (2018). The statistical power of “zooming in”: Applying geographically based difference in differences using spatio-temporal analysis to the study of college aid and access. New Directions for Institutional Research, 2018(180), 85–107. González Canché, M. S. (2022). Post-purchase federal financial aid: How (in) effective is the irs’s student loan interest deduction (slid) in reaching lowerincome taxpayers and students? Research in Higher Education, 1–54. https://doi.org/10.1007/s11162021-09672-6.
54
M. S. GONZÁLEZ CANCHÉ
Her, Y. G., & Yu, Z. (2021). Mapping the us census data using the tiger/line shapefiles: Ae557/ae557, 05/2021. EDIS, 2021(3). https://www.census.gov/geographies/ mapping-files/time-series/geo/tiger-line-file.html. Jargowsky, P. A., & Tursi, N. O. (2015). Concentrated disadvantage. International Encyclopedia of the Social & Behavioral Sciences. Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the boruta package. Journal of Statistical Software, 36(11), 1–13. https://doi.org/10.18637/jss.v036. i11. Leigh, E. W., & González Canché, M. S. (2021). The college promise in communities: Do place-based scholarships affect residential mobility patterns? Research in Higher Education, 62(3), 259–308. Spielman, S. E., Folch, D., & Nagle, N. (2014). Patterns and causes of uncertainty in the American community survey. Applied Geography, 46, 147–157. Tuan, Y.-F. (1977). Space and place: The perspective of experience. Minneapolis: University of Minnesota Press. United States Census Bureau. (2021). Tiger/line shapefiles 2021: Technical documentation. https://www2.census.gov/geo/pdfs/maps-data/data/tiger/ tgrshp2021/TGRSHP2021_TechDoc.pdf . United States Census Bureau. (2022). Glossary. United States Census Bureau. https:// www.census.gov/programs-surveys/geography/about/glossary.html. U.S. Census and American Community Survey microdata. (2022). Changes to Census Bureau data products. IPUMS. https://www.ipums.org/changes-tocensus-bureaudata-products. US Census Bureau. (1994). Geographic areas reference manual. https://www2.census. gov/geo/pdfs/reference/GARM/. U.S. Department of Commerce, Economics and Statistics Administration, U.S. Census Bureau. (2020). Understanding and using American community survey data: What all data users need to know. U.S. Government Publishing Office. https://www.census.gov/content/dam/Census/library/publications/ 2020/acs/acs_general_handbook_2020.pdf . U.S. Department of Commerce, Economics and Statistics Administration, U.S. Census Bureau. (2021). Understanding and using the American community survey public use microdata sample files: What data users need to know. U.S. Government Printing Office. https://www.census.gov/content/dam/Census/library/publications/ 2021/acs/acs_pums_handbook_2021.pdf . US Department of Health and Human Services. (2021). Us federal poverty guidelines used to determine financial eligibility for certain federal programs. https://aspe. hhs.gov/topics/poverty-economic-mobility/poverty-guidelines/prior-hhs-povertyguidelines-federal-register-references/2021-poverty-guidelines. Wicklin, R. (2014). How much ram do i need to store that matrix? https://blogs.sas. com/content/iml/2014/04/28/how-much-ram-do-i-need-to-store-thatmatrix. html.
CHAPTER 3
Data Formats, Coordinate Reference Systems, and Differential Privacy Frameworks
Abstract This is our last conceptual chapter. In this chapter we formally discuss three topics with important practical implications for SSEM: spatial data formats (vector and raster), coordinate reference systems (projected and unprojected), and data privacy or projection frameworks (data swapping, differential privacy, and jittering). Accordingly, the purpose of this chapter is to provide readers with the set of practical elements and understandings required to start reading spatial data files and building, visualizing, and analyzing splace datasets while being aware of the relevance of protecting our participants’ privacy. From this perspective, we begin the chapter with a presentation of spatial data formats that include: (a) raster or grid data files which represent units in space based on a matrix or grid, and (b) vector data format which stores and represents geographical features (or geometries) as points, lines, or polygons. As part of this presentation, we illustrate how to move from raster to vector data and vice versa, along with the implications of these transformations for SSEM. Subsequently, we introduce coordinate reference systems (CRSs) and discuss similarities and differences between projected (flattened) and unprojected (spherical) spatial data representations, once again while highlighting similarities, differences, and the implications of each CRS form for SSEM. Finally, we close this chapter with a discussion of Differential Privacy Frameworks implemented by the United States Census Bureau as a response to the conundrum of presenting as accurate counts of inhabitants as possible, while protecting the identity and privacy of the respondents. From this view, although differential protection was developed and implemented with the goal of protecting participants’ place-based anonymities based on their personal attributes, this framework may pose challenges with respect to modeling accuracy. Our presentation will discuss the implications of these data protections for SSEM and will also illustrate instances where the preservation of accurate spatial mapping and analyses is paramount.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. S. González Canché, Spatial Socio-econometric Modeling (SSEM), Springer Texts in Social Sciences, https://doi.org/10.1007/978-3-031-24857-3_3
55
56
M. S. GONZÁLEZ CANCHÉ
Types of Geo-Referenced Data: Raster and Vector Data Spatial data may be classified in raster and vector formats. These data formats have been the focus of study and discussion for geographical information systems (GIS) and spatial modeling and visualization for many decades (Congalton, 1997; Piwowar et al., 1990; Wade et al., 2003). Not only do vector and raster spatial data formats have different data handling or manipulation requirements and computing power needs, but they also have been primarily used for different analytic procedures. Vector data are particularly useful for social sciences modeling, whereas raster data have been more relevant for environmental sciences. In short, these data sources both store and handle data differently and are also used for different types of spatial analyses and processes. In continuing with the disciplinary prevalence of the use of vector and raster data, our discussion in previous chapters has focused on vector data for this data format is the most used in the social sciences (Lovelace et al., 2019). Vector data closely aligns with the traditional data storage, retrieval, and management practices used with non-spatial databases—i.e., administrative data. On the other hand, disciplines that rely on remote sensing data and satellite imagery that capture spatial data from aeroplanes, satellites, or drones, like many environmental sciences (Lovelace et al., 2019), mostly rely on raster data.1 And this data format and handling differ from more traditional non-spatial administrative data formats. With this disciplinary focus in mind, although this book is written for social scientists, ignoring the availability of raster data would be a disservice for our conceptual and analytic purposes, particularly in cases where the only form of geo-referenced data available may be in raster format. Accordingly, an important takeaway from the following discussion would not be so much the handling and analysis of raster data, per se, but its transformation to vector data, as we will describe in this chapter. With this in mind, our following section formally discusses and visualizes raster data. Raster Data Raster spatial data represents Earth’s surface in a matrix or grid wherein geographical elements are captured in pixels (Wasser, 2022) and each pixel is accounted for by the intersection of a row and column of that matrix (i.e., cell). In other words, raster data are stored in a two-dimensional matrix (also known as a grid) wherein each cell represents a “pixel” of the world that indicates the space where a longitude and latitude intersect (Wasser, 2022). As with our discussion of shapefiles in Chap. 2, this grid may simply have space information
1 See html.
also
https://docs.qgis.org/2.8/en/docs/gentle_gis_introduction/raster_data.
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
57
Fig. 3.1 Example of raster cell changes within same space
(i.e., like land or water areas as discussed in Fig. 2.2 in Chap. 2), or it may also include or be merged with place-based indicators or attributes to form splaces. An empty grid in raster data may be conceived as a canvas where we may add information at the intersection of each row and column, and where such an intersection constitutes a pixel in an image. The smaller these pixels are in a given space, the more granular information they convey, which will result in higher resolution images. Conversely, the bigger these cells or pixels are, the higher the distortion of the resulting image will be. Similar to our previous analogy of zooming in, as image resolution increases (i.e., the number of cells increases in a given matrix covering the same space), the dimensions of the grid or matrix will increase as well. To illustrate this process, let us observe Fig. 3.1.2 The area covered in this figure is the same in both grids. However, the top grid has is a 16 by 16 matrix, whereas the bottom grid is a 32 by 32 matrix, which allows for a higher number of spatial units to be represented in this matrix or canvas as a function the number of data cells or pixels available in this grid. From this description, it is derived then that higher resolutions in raster data, although allow for more nuanced or detailed depictions of splaces, are also more computationally expensive. Specifically, to illustrate this mechanism, let 2 The code used to create this figure was modified from Wikimedia Commons. The original code is available here https://commons.wikimedia.org/wiki/File:Raster_vector_tikz.svg.
58
M. S. GONZÁLEZ CANCHÉ
us go back to Fig. 3.1. As mentioned above, both grids in this figure cover the same space. However, the top grid is a 16 by 16 matrix, which contains a total of 256 cells (or pixels). The bottom grid doubles the number of rows and columns rendering a 32 by 32 matrix with a total of 1,024 cells. Since the result of dividing 1,024 256 = 4, we can see that the bottom grid would accommodate four times as much information as the top grid, which may translate into more detailed renderings to be accommodated in canvases with higher number of pixels. Earlier, we mentioned that we can think of these grids as empty canvases and that each cell may be used as a pixel to represent the presence of objects (i.e., rivers, buildings, cities) in this matrix (or grid) and/or to capture changes in a given attribute’s (i.e., a place-based indicator) intensity. To illustrate the storage costs and processes used in raster data, let us use the 16 by 16 grid as an example. Assume there is a river crossing this space and, as depicted above, let us use each cell as a pixel to represent this spatial element by adding color to the cells wherein this river passes to record its presence. The number of cells covered in this representation would capture thickness of this river as we illustrate with Fig. 3.2. If we were interested in representing land area in that space, we could either add a layer to the raster depiction shown in Fig. 3.2, or fill the cells only with this land information. Our visualization decision in Fig. 3.3 was to only include the land information in the same planar space. In comparing Figs. 3.2 and 3.3, one would be prone to conclude that Fig. 3.2 is more computationally affordable for only a segment of the grid or matrix was used, compared to Fig. 3.3. Note however, that in both depictions the size or dimensions of the matrix remained unchanged. That is, while the number of empty cells changed across figures, computationally speaking both matrices require the same computing power to process these raster data. From this view, even though in real social science SSEM applications land features are not included for modeling process (we instead rely on distances among objects), the size of the matrix with or without this indicator remains unchanged. This is the reason why raster data is more expensive to store for even when removing
Fig. 3.2 River raster representation
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
59
Fig. 3.3 Land raster representation
non-essential features and indicators the grid size would remain unchanged. As discussed in Chap. 2 (see section “What Might be the Best Choice?”), matrix storage may become very or even too expensive to compute our analytic procedures which may prevent us from relying on raster data even more often than when using spatial information stored in vector format. With this understanding in mind, let us showcase how we can add other spatial features (i.e., representations of towns or houses) to the same grid. Similar to vector data, each cell of this grid may capture the presence of a feature (i.e., an object) in this space. Moreover, each pixel representing a spatial feature may be added color variation or intensity to further indicate changes of an attribute recorded in that spatial feature. This process also follows the rationale of adding attributes to polygons (ZCTAs) based on certain attributes or placebased indicators, such as college access rates, as discussed in Chap. 2. Specifically, to represent the presence of attributes, let us assume that, in addition to the river depicted above, we have detected three single houses, and four settlements—or towns. Moreover, note that since the presence of grass or land is typically not part of the modeling strategy, we are not presenting this information in Fig. 3.4. Figure 3.4 shows three not overlapping layers of spatial features. This representation indicates that each feature requires its own grid and that in each grid, only the pixels or cells with actual spatial information are non-missing. For example, the river raster representation occupies 25 cells, the town representation occupies 26 cells, and the houses representation occupies three pixels. From a data storage perspective then, we can conclude that the least efficient data storage option for this grid space is the representation of the houses given that only three cells are occupied in this 256-cell matrix, which means that these houses representation contain 253 “missing values.” Each of these layers may also be represented in a single grid, rather than with three rasters. This single grid representation is shown in Fig. 3.5. In this case, the grid is occupying 54 of the 256 cells, which means that there are still 202 cells with missing cases in Fig. 3.5.
60
M. S. GONZÁLEZ CANCHÉ
Fig. 3.4 Example of raster data representation not showing the land feature: river in blue, single houses in red, and settlements in magenta
Fig. 3.5 Example of three layers of raster data representing: a river (blue), three single houses (red), and four settlements (magenta)
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
61
For comparison purposes, note that this layered depiction is not unique of raster data. Indeed, most of our map representations shown in Chap. 2 (and the visualization to be shown in Chap. 9) followed the same rationale depicted in Fig. 3.5: we needed a shapefile of the area of interest, which in Chap. 2 was Philadelphia County, and then we continued adding geolocated features, like ZCTAs and census blocks. In addition, the circle capturing University City, as shown in Fig. 2.7, was another example of adding other spatial attributes to this map, following the layer rationale represented in Fig. 3.5. Before discussing the vector data format, let us note, once more, that the pixels size of all the representations shown in Figs. 3.4 and 3.5 play an important role in capturing contours with higher or lower levels of detail. This is similar to using markers with ultra-fine point versus using markers with chisel tip to draw contours. The ultra-fine point marker may render more precise details than the chisel tip. Comparatively, in the raster data representation, the size of each cell plays a fundamental role in the level of fuzziness or pixelation of the resulting drawing or image. Smaller cell sizes (like ultra-fine tips) may capture more nuanced characteristics of the features represented in the resulting image than larger cells or pixels. Nonetheless, smaller cell sizes will only allow us to present a more detailed depiction of spatial features if the measurement matches the smaller pixels grid. That is, going back to Fig. 3.1, we know that for each cell of the top figure there are 4 cells in the bottom one (256 cells, versus 1,024 cells). Knowing this, we could gain more precision with the 1,024 grid only if the measurement of spatial features matches this cell grid. In other words, if we have data measured using the 256 grid and aim to represent these data using the 1,024 grid with the expectation of gaining details in the resulting analyses and visualizations, this data transformation will render no detail gain. Instead, the same representation will occur in terms of the space being occupied with the exception that each of the cells represented in Fig. 3.4 will simply be divided in four smaller pixels, but no contour smoothness will occur. We can only gain more detail and smoother contours if the data collected match the more nuanced depiction resulting from having smaller cells in the raster or grid to store more spatial information. Our following section discusses vector data highlighting how the same features represented in Fig. 3.5 are handled by this vector data format. Subsequently, our raster and vector comparison will showcase how place place-based attributes, such as college access rates, shown in Fig. 2.7, may be added to both raster and vector data building on our Philadelphia college access example, discussed in Chap. 2. Vector Data Spatial vector data represents spatial units as geometries in a plane. As discussed earlier, these geometries take three main forms: points, lines, and polygons. Notably, in terms of data storage, each of these geometries stored in a database have a unique identifier (i.e., GEOID as discussed in Chap. 2) that is linked to
62
M. S. GONZÁLEZ CANCHÉ
latitude and longitude coordinates (and/or a coordinate reference system) that allow us to identify them on Earth’s surface (Lovelace et al., 2019). From a practical point of view, only recorded geometries are stored in a spatial vector database. Accordingly, not recorded units in a given space do not (need to) appear in the spatial dataset. Although this rationale may sound irrelevant, it translates into more computationally efficient data storage requirements compared to raster data format storage procedures. To illustrate this vector storage rationale and to compare it with the rater data storage approach, let us keep building from the example presented in the raster data section. So far, we have showed a 256 pixels grid that included a river, three houses, and four towns. The vector data representation of these features captured as geometries include: lines to represent the river, points to represent the houses, and polygons to represent the towns. Point Geometries Of these three geometries, spatial points are the simplest form of vector data for they only include the intersection of one longitude and one latitude coordinates. We represent these units in the top subfigure of Fig. 3.6. For consistency with our raster representation, we preserved the same space (area) represented and the three houses are depicted with the three points or dots in this same space. However, different from the grid representation that consisted of a 256-cell matrix with only three cells not having missing cases, the vector data storage format will consist of a database configured by only three rows. The number of columns of this spatial vector database will depend on the attributes of each of these points that may be added for SSEM purposes. In other words, the raster data requires a 16 by 16 matrix, while the vector representation of the same spatial units requires a 3 by n database, with n being the number of columns indicating attributes of the rows (i.e., GEOIDs). You may notice that this is the same storage rationale used by most administrative databases or spreadsheets. Line Geometries The next geometry represented in Fig. 3.6 is a line, or a sequence of connected points that in this case is representing a river. Although our depiction in this figure has a width, typically a line geometry per se does not have this attribute. Instead, we add this place-based value to indicate width, dept, and/or contamination levels, for example. In terms of storage, this line alone would only account for one row in its corresponding vector spatial dataset. Similar to the point representation, this row will contain a column storing all the collection of points coordinates that form a line and are needed to plot this information in a map form. Once more, with the vector storage format we once more moved from requiring a 256-cell matrix to a database containing only one row. Also, as in the case of the point database, in line storage files, the columns will contain other line attributes.
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
63
Fig. 3.6 Example of three layers of raster data representing: a river (blue), three single houses (red), and four settlements (magenta)
In Chap. 5, we will discuss the relevance of merging line geometries (specifically roads) with point geometries to measure traveling distances. Accordingly, line attributes may be relevant to subset line geometry files (i.e., main roads, freeways) in our modeling strategies involving the measurement of distances. As part of this subsetting process, we may want to retain only roads in one county, for example, and we discuss how to conduct these analyses in our applied sections and examples. Polygon Geometries Finally, the last spatial geometry type we showcase in Fig. 3.6 are polygons. We are already familiar with these geometries for we have presented them at different levels in Chap. 2. Specifically, we represented them as states, counties, ZCTAs, census tracts, and block groups. In this case, we are representing them as towns in this example. Formally, polygons are a collection of lines that form a closed ring. Considering that lines are configured by points, polygons are then configured by
64
M. S. GONZÁLEZ CANCHÉ
a collection of connected spatial points where the first and last longitude and latitude coordinates are the same—hence its closed ring form (Lovelace et al., 2019). Polygons, since they are configured by points and lines, are the most expensive data format for they require all these coordinates to be associated with each GEOID. However, once more, if an area only has a couple of polygons, our corresponding databases will only have those corresponding number of rows and the corresponding geometrical attributes (i.e., collection of latitude and longitude coordinates configuring points that form these polygons). That is, polygon vector data will require a m by n matrix, with m representing the number of polygons surveyed and the n the number of columns or attributes associated with those polygons. Geometries as Layers Vector data can also be added as layers to a map. To achieve so we need to have all the vector objects as standalone and separate databases. Moreover, the spatial features of these databases need to be harmonized with respect to their spatial features. This is typically referred to as being in the same coordinate reference systems (CRSs). Note that we discuss these CRSs in section “Coordinate Reference Systems” below in this chapter. For now, it suffices to know that for us to add layers to a map all the geometries involved need to be measured in the same spatial unit (i.e., degrees or meters). In this respect, even if different vector datasets were measured in different CRS, unit or CRS modification/transformation is a straightforward process and CRS harmonization may represent no problem, as we will demonstrate in our applied sections. To illustrate the elements and steps required to add layers using vector data, let us assume that we are interested in plotting points and lines in ZCTAs. Specifically, let us plot the main roads that connect a neighborhood with other counties in Pennsylvania and the location of two universities located in such a neighborhood. This task require us to identify: 1. the ZCTAs that configure the neighborhood of interest, which in this case is called University City and is located in Philadelphia County, 2. the two most important main roads that reach/connect this neighborhood to other places located outside this county, and 3. the location of the two four-year institutions located in such a neighborhood. This task requires the following elements: the polygon shapefile (or spatial database) for the county,3 the polygon ZCTAs (19104, 19139, and 19143) that configure the University City neighborhood in Philadelphia county (the second spatial database), the two lines representing the main roads (Interstate 3 Although this geometry is optional, we consider it important given our goal to showcase how to plot a subset of the spatial elements in a vector data file.
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
65
Fig. 3.7 Depiction of University City’s ZCTAs (polygons), two main roads (lines), and Universities (points), all located in Philadelphia County
76 and Interstate 676) passing through or connecting this neighborhood with other counties (the third spatial database), and the two points (i.e., universities in this case) of interest, the University of Pennsylvania and Drexel University (the fourth spatial database). Although this discussion indicated that to present Fig. 3.7 we need four spatial databases, in reality the computing power to achieve this visualization is minimal. The two polygon vector databases have four rows in total, one for the county shapefile, and three for the shapefile containing the ZCTAs. The line shapefile only has two rows, one for each interstate roads. Finally, the points vector database also required two rows. In reality the main challenge with the use of vector data consists of keeping track of these elements. In this respect, note that the minimal code functions we have prepared, automatize this process—see Chap. 9. Relatedly, the plotting order matters. The main layer should be the largest area where all the subsequent layers will be added. In this case, the largest area (or highest level, as described earlier) represents the entire county. Then we can add the second set of polygons (ZCTAs), the lines, and finally the points. Having showcased how these layers may be added then, we consider relevant to build from our previous representation of college access discussed in Chap. 2 to more clearly illustrate differences in storage efficiency of vector and raster data along with transformations from raster to vector data and vice versa, as we present next.
66
M. S. GONZÁLEZ CANCHÉ
Vector to Raster Transformations and Vice Versa We begin this section with an illustration of vector to raster data transformations using real data. Subsequently we illustrate raster to vector transformations focusing on raster to points and raster to polygons. Vector to Raster Data Transformations To further convey differences in data storage using a real example, let us discuss Fig. 3.10. To compile this figure, we build from our previous example of ZCTAs college access in Philadelphia County (see Fig. 2.7). This example only relies on women’s ACS college access estimates (Fig. 3.8). As we may remember, the spatial data input required to produce Fig 2.7 were shapefiles, or vector spatial data as we have described in this chapter. Moreover, the place-based data were obtained from the ACS and joined or merged with the ZCTAs shapefile to build this splace database. Figure 3.10b is the raster representation of these same data sources. However, to our knowledge there are no publicly available ZCTAs in raster form. Accordingly, we had to transform our vector data sources and the college access values (or attributes of each ZCTA) to a raster format. We were able to conduct this transformation using the stars R package. The data storage result of this transformation is represented in Fig. 3.10c. This subfigure also includes the minimal description of the splace file used to produce Fig. 3.10a. Let us discuss key points of convergence and divergence of these data formats. First of all, recall that the raster version is derived from the vector source, accordingly, there have to be points in common in these datasets. Let us start by noting the bounding box shown in the top rectangle of Fig. 3.10c. This box has four pieces of information: xmin and ymin are longitude and latitude values (forming an intersecting point) for the bottom left part of the space to be plotted. Following the same rationale, xmax and ymax are longitude and latitude values (forming another intersecting point) for the top right part of the same space to be plotted. We can think of this bounding box as selecting an image that we want to crop starting from the bottom left and dragging while clicking to the upmost right side. We represent this selection process in Fig. 3.9. The shaded area in Fig. 3.9 represents the left-out area (the non-selected geographical zone). Conversely, the non-shaded area represents our attempt to, as closely as possible, delimit the spatial selection to only include Philadelphia County, being careful to not leave any of its corresponding ZCTAs or other spatial units out of this selection. However, given that Philadelphia County is not a perfect rectangle or square, our process will be imperfect, and we likely will include in our selection units that are not part of this county. Four our illustrative purposes, however, this representation suffices to convey the rationale guiding bounding boxes. In returning to the xmin and ymin points, their coordinates values in Fig. 3.9 should be quite close to our more precise xmin and ymin ones shown in
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
67
Fig. 3.8 Raster side by side comparison
Fig. 3.10c. The latter are more precise because we did not have to manually set this bounding box. Instead, R packages and shapefiles have specialized algorithms to select the best possible (or most optimal) bottom and upper bounding points. In terms of how this raster representation stores this bounding box, the rationale is similar to the vector format. Nonetheless, in this case it simply shows
68
M. S. GONZÁLEZ CANCHÉ
Fig. 3.9 Representation of bounding source Google Maps
the bottom left longitude (xmin) and the upper right latitude coordinate (ymax). This information is enough to build the matrix or grid we depicted above. Another point that is the same across depictions is the coordinate reference system (CRS) that we will discuss below. For now, simply note that the CRS information in the vector representation (under PA_ZCTAs) indicates that the CRS is “Geodetic.” As explained below this geodetic denomination implies that this system is representing these areas following Earth’s curvature and is using degrees to represent spatial features. As we will further elaborate later in this chapter, CRS may also represent spatial units in a planar space, which may pose challenges associated with representing a sphere in planar form.4 Note also that the raster representation follows the same CRS, since we did not apply and transformation to this coordinate system. Although we will elaborate further on
4
See also https://cran.r-project.org/web/packages/eRTG3D/vignettes/v6.html.
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
69
Fig. 3.10 Representation of raster to point transformation
this information, for now note that raster and vector data may be represented relying on Geodetic CRSs. As discussed above, the most salient difference between raster and vector data storage and visualizations is the amount of information required to render the same plots. In terms of data storage, as discussed in the raster data section, we indicated that this grid accounts for a matrix and the dimensions of this matrix may be computed from the information contained in Fig. 3.10c. Specifically note that in the “rasterv” section of Fig. 3.10c there are x and y values and two columns called “from” and “to”. The values associated with “to” in this case are 279 and 233 for x and y, respectively. These values indicate the dimensions of the grid or matrix and can therefore be used to estimate that this grid has 65,007 cells. In short, this means that the data input required to plot Fig. 3.10b comes from a matrix containing over 65,000 cells. As discussed above, the vector representation required to plot Fig. 3.10a is a database with 48 rows. Given that Fig. 3.10b was built from data transformed from Fig. 3.10a, the two images show congruent results. However, as shown in Fig. 3.10c, also under the “rasterv” section, the grid or matrix used to plot Fig. 3.10b, contains a total of 36,886 cells with missing cases. In the case of Fig. 3.10a, the splace database has a total two missing ACS estimates. Conceptually, however, note that of the 36,886 missing cases represented in the raster data, the vast majority are just empty cells rather than actual ACS missing cases. There are visualization implications associated with this vector to raster transformation. As shown in Fig. 3.10a the magenta color indicates that there are two ZCTAs with missing cases (one ZCTA is quite small, but you can zoom in to observe it). In the case of Fig. 3.10b the missing cells configuring these ZCTAs resulting from the vector transformation became part of the 36,886 cells with missing
70
M. S. GONZÁLEZ CANCHÉ
cases. The implications are that by only observing Fig. 3.10b we could mistakenly conclude that there are no ZCTAs with missing estimates, for the larger ZCTA in the bottom right corner of Fig. 3.10a is not even represented in Fig. 3.10b. More to the point, if we wanted to represent the missing cases in the raster representation, as we did in Fig. 3.10a, b will appear to have a magenta background for all white areas in this figure are actually missing cases in the grid. Indeed, as depicted above, the raster grid has 65,007 and since 36,886 are missing cases, we can conclude that only 28,121 cells in this matrix have non-missing cases (or the result of 65,007–36,886). Notably, during this transformation we lost the number of cells in this matrix that are actually missing cases. Before moving to the next set of transformations let us note that, although we can see that our splace (vector) file has 13 fields (or columns as shown in Fig. 3.10c under “PA_ZCTAs”), the raster representation only shows one attribute (the equivalent of a column). Although other raster objects, which were created as rasters, rather than transformed from vectors to rasters (i.e., rasterized) can accommodate more attributes, all our vector to raster transformations only were able to render one attribute at the time, which considerably limits the types of analyses that we can conduct with this data format.5 Moving From Raster to Vector Data As we mentioned above, an important skill related to SSEM consists of moving from raster to vector data. This is particularly important in the instances where the only data format available for our research project is only available in raster format. Accordingly, this section illustrates the rationale followed to move from raster data to point and polygon geometries while highlighting similarities and differences in these transformations. From Raster to Points To illustrate these data transformations let us continue building from the ZCTA information we have just discussed. Recall that our original Philadelphia County shapefile has 48 ZCTAs and when we added information there were 2 units with missing values. When we transformed this vector data to a raster format, the number of “missing cases in the grid” was 36,886 for these cells have no information but this is a function of the bounding box. Moreover, as discussed in the previous section, the size of the grid represented in Fig. 3.10b is 233 rows by 279 columns, which translates into 65,007 cells with 28,121 of them having valid values. This brief refresher serves to illustrate the rationale followed by the raster to points conversion. In this transformation, all valid cells in a grid will be transformed into a point geometry, which indicates that the resulting splace vector file will be configured by 28,121 points. 5
The stars R package is supposed to now accommodate the inclusion of more than one attribute when rasterizing, but so far this process is not working properly as of this package version 0.4.3.
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
71
Note that this transformation is made for illustrative purposes for we should always respect the original data format when conducting these transformations. That is, if our original raster grid was in point format, as we showcased above with the three houses depiction in Fig. 3.4, and each point was represented with one cell, the transformation process of moving from raster to points will recognize that all empty cells should be discarded, and the resulting point vector dataset will contain only those three points. Then, the correct process to move from raster to vector consists of respecting the shape that the raster aimed to capture or represent. From Raster to Polygons Once again note that our original file was a splace vector polygon that was transformed to its raster version. Naturally then, in principle we should be able to retrieve the original dimensions of the input data. In this case, we were dealing with 48 polygons, two of which had missing values. In this respect, note however that we have just learned that the raster data has 28,121 valid cells. Accordingly, the raster to polygon transformation will yield 28,121 polygons, unless we adjust the transformation parameters as follows. From all our previous depictions of these ZCTAs and our discussion of spatial unit levels, we know that within each ZCTA there is one value associated with the splace-based indicator of interest. This implies that in the rasterization process all the cells configuring a given ZCTA will be assigned the same value—more on this below. Note, however, that similar to our bounding box selection discussed above, the decomposition of each ZCTA in a collection of cells will not be perfectly aligned, unless the ZCTA itself is a perfect square or rectangle and with the exact same units of measurement.6 Despite distortions in the borders of polygons (neighboring ZCTAs in our case), most of the content of each ZCTA will be preserved with our rasterization approach (assuming that cells are small enough to avoid gross pixelizations or distortions as we discussed with our ultra-fine marker tip example discussed earlier in this chapter, see section “Vector Data”). This preservation of within ZCTA uniformity of values is important in reconstructing the original polygon geometries from grids. Specifically, to illustrate this process let us look at Fig. 3.11. The top subfigure represents four towns in vector/polygon form. The subfigure in the middle represents the vectorization process we have discussed, wherein albeit imperfectly, we aimed to capture the location in this space of those towns using the cells they occupy. As can be seen with the magenta color, all these gridded representations are still divided internally by this cell borders, however, in our attempt to retrieve the original values, we can melt those borders internally following this condition (and assumption): 6 Assume we have a square of 10 by 10 centimeters and each cell is a 1 by 1 cm, this would perfectly match this square. However, if our square is 12.33 by 12.33 centimeters, we will not be able to perfectly match this polygon using our 1 by 1 centimeter cells. In this case, approximations and distortions have to occur.
72
M. S. GONZÁLEZ CANCHÉ
Fig. 3.11 Representation of raster to point transformation
if to adjacent cells have the same value they should be merged for they may represent the same polygon. In other words, the assumption here is that polygons sharing a border that also have the same value should belong to the same spatial polygon. The result of this process is shown in the bottom subfigure also in Fig. 3.11, which although resulted in some loss of precision in the contours or borders reflected in the original vector representation, captures to a great extent the original geometries. In returning to our example with real data, the result of this raster to vector transformation, wherein we melted these cells within assumed polygons as a function of having the same values, is shown in Fig. 3.12. Note that after melting all these 28,121 polygons (i.e., cells) we were able to reconstruct the 46 polygons with valid data, hence matching our original polygon number (again, with valid ZCTA data) shown in Fig. 2.7. As discussed above, since the vector
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
73
Fig. 3.12 Representation of raster to polygon transformation
to raster data resulted in the loss of the two ZCTAs with missing cases, we were not able to retrieve these polygons with this raster to vector transformation. From this view, the takeaway is, these transformations, although imperfect, perform well in retrieving the shapes that the raster aimed to capture. However, note that in terms of distortions, there are some important considerations. Both Figs. 3.10 and 3.12 have a legend that is consistent across transformations and match the original range shown in Fig. 2.7. However, when computing means of this indicator, the disturbances and imprecisions at the borders of each ZCTA and the size of each polygon impacted the mean and standard deviation of these raster to vector polygon transformation. Specifically, in both the raster to point and raster to polygon representations, the mean estimates were lower than the estimates obtained with the original untransformed data. So far there is no available fix to this issue but recall once more that these transformations are only recommended if there is no vector data availability. Despite these concerns, observing the resulting raster to vector transformation shown in Fig. 3.12 we can conclude that this melting process works quite well in drawing the respective borders recreating ZCTAs approximations. From this discussion it is worth noting that only when our raster file has attributes, we can apply this melting process. That is, if we only had the delineation of the cells representing ZCTAs without any information of college then the raster to vector transformation will represent the county contour rather than the ZCTAs we retrieved in Fig. 3.12. This concludes our discussion of these transformations. In the following section we discuss coordinate reference systems (CRS) and their implications and usefulness for SSEM and data visualization.
74
M. S. GONZÁLEZ CANCHÉ
Coordinate Reference Systems Throughout our discussion, we have consistently referred to latitude and longitude coordinates as the most fundamental pieces of information that enable us to locate geometries (points, lines, polygons) and rasterized spatial data on Earth’s surface. These coordinates form part of coordinate reference systems (CRSs) that define how projections may relate to actual places on Earth’s surface (Sutton et al., 2009). From this perspective when our visualizations rely on a sole source of geocoded data, CRS is less consequential than when we rely on two or more layers or types of geocoded information. When multiple shapefiles are employed, their CRS need to be harmonized to ensure that the added layers of information render depictions that are as accurate as possible. Moreover, when distance computations are conducted, this CRS harmonization is required to avoid any source of variation that is due to differences in CRS projections, rather than on actual distances among units. Our following presentation describes the elements configuring CRSs and illustrate the challenges associated with representing an ellipsoid (i.e., threedimensional) object7 in a flat plane, like a piece of paper or a computer screen, along with the implications of these projections or representations for SSEM. Elements of CRS As a system, all the elements of coordinate reference system (CRS) are interrelated. According to Holdgraf and Wasser (2022) these elements are (a) a grid of coordinates, in quadrant form or a Cartesian coordinate system, (b) a collection of vertical and horizontal units, derived from or linked to that grid, (c) a datum indicating the origin of the grid, and (d) the projection information indicating the mathematical equations or the rationale for flattening earth to planes. Accordingly, from this description of the CRS components, their first element is a Y and X grid representing the latitude (Y axis, north to south pole) and longitude (X axis, east to west) coordinates of the Earth. This grid is represented in Fig. 3.13, wherein, as discussed above, we can see that the intersection of X and Y values form a point. If we consider these latitude and longitude coordinates individually, their ranges vary. Longitude coordinates range from −180 to 180◦ and latitude coordinates range from −90 to 90. The zero longitude, or 0 degrees longitude, is referred to as the Prime Meridian. From the Prime Meridian to the left, up to −180◦ , we move west—hence all the longitudinal values in the American Continent are negative. Similarly, from the Prime Meridian to the right, up to the +180◦ , we move east. The latitude coordinates are used to divide earth into the north and south hemispheres. Latitude with a value of 0 degrees configure the Equator. Positive latitude coor7
According to the National Ocean Service Earth is an ellipsoid (three-dimensional) rather than a smooth sphere (two-dimensional), which further complicates its representation in or projection to a planar, two-dimensional space.
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
75
Fig. 3.13 Representation of Y and X grid of latitudes and longitudes
dinate values helps us to identify the northern hemisphere (ranging from +1 to +90) and negative latitude coordinate values help us locate the southern hemisphere. From this view, all countries in North America will have positive latitude coordinates and negative longitude values. Datum is defined as the intersection of zero longitude and zero latitude and indicates the origin of this system. Following the latitude and longitude values just described, the zero degrees longitude and zero degrees latitude intersection falls in the Atlantic Ocean. Its main function is to serve as a global reference point. However, this value (0 latitude and 0 longitude) may also indicate that the geocoding process failed to produce a valid location.8 More to the point, when this 0 latitude and 0 longitude origin point is added as an input to calculate traveling distances, for example, the connection to the server will crash indicting that to “save resources,” no null point (i.e., 0 latitude and 0 longitude) may be included in any traveling distance calculation.9 Our representations of points, such as those shown in Fig. 3.13, and those reflecting the location of the University of Pennsylvania and Temple University in Fig. 3.7 are straightforward as we are used to observe these types of representations in any map search (i.e., Google Maps, Map Quest). Nonetheless, it is important to consider that such representations (of points, lines, and polygons)
8 See https://www.geographyrealm.com/zero-degrees-latitude-and-zero-degrees-longitude/. 9
See ORSM package for more details.
76
M. S. GONZÁLEZ CANCHÉ
Fig. 3.14 Representation of geodesic and planar CRSs. Code adapted from https:// texample.net/tikz/examples/spherical-and-cartesian-grids/
are the product of a projection from a three-dimensional ellipsoid to a plane (or from the center of Earth to its surface), rather than the direct identification of those points on Earth’s surface. To illustrate these projections, let us observe Fig. 3.14, which represents Earth’s latitudes and longitudes, and includes our previously discussed grid and planar representations. This representation builds from the typical Cartesian coordinate system associated with Earth. In this representation, the origin of each point can be traced to the center of the Earth. This point is identified with the magenta line in Fig. 3.14. As before, this point is the result of intersecting a latitude and a longitude. Given that the origin is located at the center of the Earth, each point needs to be projected to the surface of Earth to be useful for visualization and/or navigation purposes. These projections are referred to as spherical (geodesic, geodetic, or unprojected) or planar. Recall that in our vector to raster transformation (see Fig. 3.10c) we observed that both data formats were measured in degrees and both CRSs referred to the
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
77
NAD83, which stands for North American Datum of 1983 by the United State Census. Additionally, in the PA_ZCTAs section of Fig. 3.10c the CRS was called “Geodetic” which indicates that these system was design to capture the ellipsoid form of the Earth, making this an “unprojected” projection of the center of the Earth to its surface. We are mentioning this because both, the vector and raster data formats may represent spatial data as geodetic (unprojected) or planar (projected) representations of the Earth’s surface. Having said this, for illustration purposes, let us continue building from our previous grid and vector discussion. In this respect, Fig. 3.14 shows the intersection of latitude and longitude coordinates that may be captured in both a spherical (i.e., geodesic or geodetic) form in red and/or in its planar or projected form, which we represented with a rectangle with black borders. What may seem confusing, is that although both the grid and planar representations are projections of the point, which comes from the center of the Earth, in GIS the spherical or geodetic representation is referred to as unprojected for it is not “forced” to be a plane. This unprojected representation is measured in degrees, with the −90 to +90 latitude and the −180 to +180 longitude ranges and as mentioned above, may be represented in raster or vector data—although our figure represents it as a grid for illustrative purposes. When we refer to projected CRSs the units are measured in either meters or feet and once more, both vector and raster formats may also be projected. Because degree representations are not evenly spaced due to the unevenness of Earth’s surface, the use of unprojected CRS poses challenges to measuring distances. This is why services like Google Maps rely on projected or planar CRSs. Specifically, Google maps relies on a Mercator projection that is the projection of the World Geodetic System of 1984 with the European Petroleum Survey Group (EPSG) code 3857. Its navigation purpose is useful for relatively short areas as opposed to global representations of Earth’s surface. In sum, the unprojected CRS is conventionally referred to as geodesic and is measured in degrees. Moreover, the distance measurements follow Earth’s curvature but may be affected by the uneven distribution of the meridians, which may impact distances in some areas. The projected CRSs are called planar projections wherein Earth’s surface is mathematically flattened, and distances are measured in meters or feet following straight lines on such a flat surface.10 Moreover, although only the planar CRS is referred as projected, in reality, both the planar and the geodesic CRSs are based on mathematical projections to the Earth’s surface. Despite these efforts to preserve accuracy, note that, although the sphere shown in Fig. 3.14 is smooth, in reality Earth not only is not a perfect sphere but its surface is also constantly (albeit quite slowly) changing. Both factors then, pose challenges in terms of accuracy. Accordingly, in addition to the inherent mathematical challenges of projecting the origin of spatial information to a geodesic grid or to a plane, the unevenness of Earth and 10 See also https://guides.library.duke.edu/r-geospatial/CRS for a summary of CRS available in The R Project
78
M. S. GONZÁLEZ CANCHÉ
continuous surface changes, imply that “every flat map misrepresents the surface of the Earth in some way” (U.S. Department of the Interior, U.S. Geological Survey, 2022, p. 1). Implications of Distortions Resulting from Map Projections These distortions are much more problematic and/or challenging for mapmakers (U.S. Department of the Interior, U.S. Geological Survey, 2022) than for spatial analysts, using SSEM. The reason is that mapmakers strive to yield maps with: true directions, true distances, true areas, and true shapes (U.S. Department of the Interior, U.S. Geological Survey, 2022). Although one or several of these goals may be incorporated to a map, no map can include all four conditions given the challenges of representing Earth’s surface on a plane (U.S. Department of the Interior, U.S. Geological Survey, 2022), its unevenness, and the constant change of its surface, just mentioned. In terms of implications of these inherent distortions of CRSs for spatial modeling and analysis, we may argue that SSEM is primarily concerned with distances, that is, SSEM is mostly concerned with only one of the four challenges that mapmakers face. This implies that in SSEM our goal is focused on representing true distances, or as close as true distances as possible, in order to measure influence, spillovers, or spatial dependence, as discussed in subsequent chapters of this book. Having said this, note that visualizations are quite relevant for SSEM, particularly when demonstrating the result of our identification strategies and model operationalizations. Accordingly, we will discuss strategies to craft interactive HTML maps with the understanding that these resulting maps and visualizations are shown for descriptive purposes, rather than for navigation (i.e., driving directions) or mapmaking purposes (i.e., geographical information systems, infomaps, etc.). If the SSEM maps are to be used as a reference purposes (once colleges or other places of interest have been geolocated and represented in visualizations), readers may need to complement their navigation information with maps designed for navigation purposes, such as those readily available, like Google Maps, for instance. To summarize this brief discussion, note that for our SSEM requirements we are not as affected by projections’ distortions as mapmakers are. This implies that, although our visualization exercises will automatize the harmonization of CRSs used, our resulting products are for the most part to be minimally affected by the choice of a given CRS. Why is CRS Harmonization Important? When a single shapefile is assigned a CRS, all configuring units in that specific area are affected by this choice. This follows the same rationale as when using a scalar or even an equation to transform units in a column, like moving from meters to miles, for example. This implies then that, within a single shapefile, all configuring spatial units are to be measured in the same scale. However, when relying on different shapefiles there is no guarantee that all CRSs are
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
79
Fig. 3.15 Representation of geodesic and planar CRSs of the continental United States
the same. For example, Figs. 2.6 and 2.7 included county, census tracts, and ZCTA shapefiles as layers configuring parts of the resulting maps. These visualizations required that all of their configuring shapefiles had the same CRS. This is important because including different CRSs in map layers may result in either distorted or inaccurate visualizations, or in some features not being visualized due to being out of focus. This is why in our minimal code function, we have ensured that when handling different shapefiles their CRSs are harmonized as part of the data preparation process. Commonly Used Coordinate Reference Systems As just discussed, for SSEM, georeferenced data may be collected with different CRSs. In these cases, it is important to transform them to a common CRS before mapping and analyzing them. This is the same rationale followed when having two input data and one of them uses miles and the other uses kilometers as its unit of measurement. As discussed in Fig. 3.14 there are two main types of CRSs: geodesic and planar. In terms of visual differences, trained eyes may be able to more easily observe disparities. For example, Fig. 3.15 shows the geodesic or unprojected World Geodetic System of 1984 (WGS84) with code number 4326 and the projected planar Pseudo-Mercator, Web Mercator, or Spherical Mercator that is used by Google Maps with code 3857. In terms of data structure, the main difference between these shapefiles is the geometries representation. In the geodesic or unprojected CRS, the geometries are represented in degrees, whereas in the case of the projected CRS, these are represented in meters. Everything else in terms of data structure remains the same. Specifically, Fig. 3.16 shows these elements. Although both shapefiles are in vector format, recall that if we transform them to their raster form, the CRSs will be transferred as well.
80
M. S. GONZÁLEZ CANCHÉ
Fig. 3.16 Representation of data structures of geodesic and planar CRSs using the continental United States
In the package Simple Features for R, CRS transformations are achieved with a single command as represented at the top of each subfigure of Fig. 3.16. As can be seen, the main differences among projected and unprojected representations are the bounding boxes, which are in meters and degrees, respectively. The same is true for the geometry attributes. As expected, each of these shapefiles displays the CRS, with the name “Geodetic CRS: WSG 84” in the unprojected and “Projected CRS: WGS 84/Pseudo-Mercator” in the planar representation. The WSG in both projections indicate they have the same datum (point of origin on earth), the Earth’s center of mass, which represent the similarity in their respective visualizations as shown in Fig. 3.15. Despite the ample availability of CRSs, for our purposes, and particularly when dealing with distance measurements, we will rely on the CRS code 3857 discussed here. This code has been crafted for navigation, hence its Mercator classification (U.S. Department of the Interior, U.S. Geological Survey,
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
81
2022). As discussed above, however, the most important criterion is for us to consistently use the same CRS when dealing with multiple shapefiles to avoid distortions and/or incompatibilities. This concludes our CRS discussion. Our last section of this chapter will discuss Differential Privacy Algorithms, which, like vector and data formats and CRSs, may pose some challenges to modeling purposes by creating some distortions to units’ locations as we will detail next.
Differential Privacy Framework (DPF) and Changes to Census Micro Data When it comes to spatial analyses, the United States Census is poised with two conflicting tasks: (a) to provide an actual and complete enumeration of all people living in the United States every 10 years (United States Census Bureau, 1787, art. I, Sect. 1), while (b) protecting respondents’ anonymities by keeping all their personally identifiable information confidential for 72 years (United States Census Bureau, 2022). Naturally, the more accurate the data collection the higher the probability of identifying individual respondents and their families. This is particularly true with current availability of both computing power and big data (Dwork et al., 2014), which makes it easier now than ever before to merge multiple data sources to identify individual respondents. This tension between accuracy and privacy has served to justify the need to generate and implement mathematical algorithms whose main goal is to preserve anonymity with minimal distortions. The differential privacy framework (Dwork et al., 2014; United States Census Bureau, 2021) forms part of the United States Census Bureau’s confidentiality protection program, also referred to as “disclosure avoidance” (United States Census Bureau, 2020; 2021). Differential privacy was implemented due to the threat of database reconstruction or the “process for inferring individuallevel responses from tabular data” (Ruggles & Van Riper, 2022, pp. 791–782). Clearly then, if this threat is materialized, it will result in failure to protect respondents’ anonymities. From this view, the development and application of differential privacy algorithms to add noise to responses aims to avoid that the threat of data reconstruction becomes a reality. The cost, however, is the potential reduction of the usefulness of census data for social, economic, and health research with potentially serious compromises of basic demographic understandings (Ruggles & Van Riper, 2022). Although the research community is increasingly concerned about how disclosure avoidance may impact the usefulness of census data, particularly for the 2020 census and the American Community Survey and PUMS micro data (more on these latter two sources below), it is worth noting that disclosure avoidance does not constitute a new practice—see Fig. 3.17 for a summary of this discussion. What is new is the reliance on sophisticated differential privacy algorithms and even on artificial intelligence to create synthetic data, or data digitally generated or simulated rather than collected (Chen et al., 2021). This
82
M. S. GONZÁLEZ CANCHÉ
Fig. 3.17 Data protection mechanisms, changes, and challenges
reliance on computing power and simulations and artificial intelligence represent important departures from previous disclosure avoidance techniques that have been used for many decades. For example, starting in the Census of 1930, the Census Bureau stopped publishing certain tables for small geographic areas (United States Census Bureau, 2021), which constituted the first documented form of disclosure avoidance. More recent disclosure avoidance strategies, like those applied to the 1990, 2000, and 2010 decennial censuses consisted of data swapping (United States Census Bureau, 2021). Data swapping consisted of assigning some attributes of one household to another of the same size in the same census block (McKenna, 2019). This swapping process, while did not alter the distribution of the responses, made the identification based on certain household attributes more difficult to accomplish (United States Census Bureau, 2021). Note that a more comprehensive historical description of decennial census disclosure avoidance methods can be found at McKenna (2019). This section focuses on the disclosure avoidance measures implemented in the 2020 decennial census, including synthetic data generation, given the potential implications of these strategies for SSEM moving forward. Are Differential Privacy and Synthetic Data the Same Privacy Protection Strategy? Before starting, it is important to note that differential privacy algorithms are different from synthetic data generation and the former only apply to the postcollection data processing of decennial census information. Synthetic data generation, on the other hand is being developed to replace public use microdata with a simulated version to fully protect responses at the individual and household levels. From this view, although synthetic data generation adheres to the disclosure avoidance framework, it does not fully fall within differential privacy algorithms.
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
83
Having noted that differential privacy and synthetic data generation constitute different strategies, they share the same goal to protect confidentiality by masking “potentially identifiable” microdata. However, although both strategies follow divergent methodological paths, both have faced backlash from the research community. Indeed, short after the public announcement of the United States Census in April 2021, where they disclosed their intentions to rely on simulated microdata for PUMS, a public uproar led to the United States Census to retract its previous timeline, which was supposed to release these simulated data in 2025. Given this retraction, currently, as briefly discussed in Chap. 2, there is no clarity on this timeline and the scope of this proposed change (U.S. Census and American Community Survey microdata, 2022). The backlash associated with the implementation of differential privacy algorithms for the decennial census data has translated into peer review research (see Hauer & Santos-Lozada, 2021; Ruggles & Van Riper, 2022; Winkler et al., 2022, for example). In this line of research, scholars have indicated that “there has never been a single documented case where the identity of a respondent in the ACS or decennial census has been revealed by someone outside the Census Bureau” (U.S. Census and American Community Survey microdata, 2022, para. 1). Despite this fact, the big data explosion had increased fears of database reconstruction. This big data explosion implies that, not only has there been a fast-growing individual data availability via a myriad of online transactions (like banking, online purchasing), but also the public itself (which includes census respondents) has been making their individual and identifiably data publicly available via social media. These two factors together, have then contributed to the belief that the data swapping strategy, employed in the 1990, 2000, and 2010 decennial censuses (United States Census Bureau, 2021), has become obsolete or weak in protecting the privacy of census respondents. In theory then, with big data, information merging and matching from a variety of readily available sources have eased the retrieval or reconstruction of individual data responses (United States Census Bureau, 2021).11 In reality, however, own Census’s efforts “to reconstruct the 2010 Census from published tabulations was incorrect in most cases” (U.S. Census and American Community Survey microdata, 2022, para. 6). More to the point, Ruggles and Van Riper (2022) found that “the database reconstruction technique does not perform much better than a random number generator” (Ruggles & Van Riper, 2022, p. 784). These findings are quite problematic for the following reasons. First, the implementation of differential private algorithms are fueled by the fear of database reconstruction, but evidence indicates that such a reconstruction, in addition of being expensive, is infeasible or not better than results obtained 11 Although in most cases one can say that data privacy threats are happening precisely due to individuals’ decisions to make their personal data public, the Census Bureau is still poised to protect the confidentiality of census responses, leading to the adoption of the differential privacy framework.
84
M. S. GONZÁLEZ CANCHÉ
by chance alone. Second, and as or more importantly, “differential privacy will add error to every statistic the agency produces for geographic units below the state level, and this error will significantly reduce the usability of census data for social, economic, and health research, and will compromise basic demographic measures” (Ruggles & Van Riper, 2022, p. 782) (see also Hauer & SantosLozada, 2021; Winkler et al., 2022). Having stated the risks of a decrease in the usefulness of census data (see Fig. 3.17) and its implications for “redistricting, allocation of funds, urban and regional planning, and studies of residential segregation” (U.S. Census and American Community Survey microdata, 2022, para. 4), let us briefly describe these algorithms that are being applied to the 2020 decennial census data. What are Differential Privacy Algorithms? Differential privacy algorithms may be defined as mathematical strategies to define and/or protect privacy (Dwork et al., 2014). Dwork et al. (2014) argue that at its core, differential privacy addresses the “paradox of learning nothing about an individual while learning useful information about a population” (Dwork et al., 2014, p. 5). The power and promise of differential privacy consist then in challenging the “Fundamental Law of Information Recovery that states that overly accurate answers to too many questions will destroy privacy in a spectacular way” (Dwork et al., 2014, p. 5). Taken together then, survey, or in this case census, respondents are promised that their responses will not affect the study’s conclusions, for such conclusions are to be reached with or without their participation in the survey or census (Dwork et al., 2014; United States Census Bureau, 2021). Not only does this latter statement sounds counterintuitive but also it could deter some respondents from participating. This premise may lead them to wonder why they would participate if their participation may not alter (or impact) the study’s conclusions. On the other hand, when trying to convince potential respondents to participate based on this same premise, the incentive would be that they can be completely transparent for no individual-case-responses are to be reconstructed. According to United States Census Bureau (2021) DPAs have clear advantages over data swapping and data suppression techniques employed in previous decennial censuses. These advantages include transparency and reproducibility. In line with this latter statement, the United States Census Bureau (2021) has made all Python code available to the public,12 with the only information not made public being the exact value of the “noise” added to these computations to fully avoid individual response reconstructions. In other words, without this noise value, as explained next, no individual response profile can be recreated and the premise of no individual responses affecting the outcome or conclusions is preserved.
12
See https://github.com/uscensusbureau.
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
85
In the differential privacy framework this noise is referred to as , a parameter that determines the amount of distortion to be added to the data (United States Census Bureau, 2021). A smaller yields better privacy and less accurate responses, and a higher value translate into less privacy and more accurate responses (Dwork et al., 2014, p. 6). Accordingly, the challenge consists of detecting the optimal noise value to preserve both privacy and accuracy. In its simplest form, this noise may be captured by flipping a coin before responding to a potentially “incriminating” or “socially undesirable” question (Dwork et al., 2014, p. 15). The process develops as follows: for each question asked, the answer (considered truthful) is registered, but the actual reported answer is the result of a coin flip that if it lands in tails has a value of “yes.” If this coin flip did not land in tails, the answer is subjected to a second coin flip that will be assigned to “yes” or “no” if it lands in tails or heads, respectively. Dwork et al. (2014) indicate that even an incriminating response (i.e., answering “yes” to a question related to having committed a crime) is subjected to plausible deniability for this incriminating response has a 1/4 probability of being the true response. It is this randomization that makes re-identification of anonymized records robust or resistant to linkage attacks to match anonymized records with non-anonymized records in a different dataset (Dwork et al., 2014). In this coin flip example, the amount of randomization using a fair coin is 1/2, however, differential privacy does not hold this or noise uncertainty constant, thus adding even more complexity to the virtually impossible recreation of individual records. As briefly depicted above, an important challenge in the differential privacy framework is the selection of the optimal amount of noise for more noise (smaller ) implies more data protection but low accuracy or representativeness of the responses. To handle this decision, the United States Census Bureau (2021) relies on a “privacy loss budget […that] defines the absolute upper bound of privacy loss that can occur” (United States Census Bureau, 2021, p. 10). Notably, this privacy-loss budget is allocated considering population characteristics, housing attributes, and geographic levels. More to the point, the added noise may be positive or negative, which adds more complexity to any linkage attacks efforts. In areas with small counts, the presence of negative noise may lead to negative counts, which as we discussed above, will result in non-negative or even in “zero counts” (or not availability of information) via the processing of post-noise infusion (see Table 2.3 in United States Census Bureau, 2021, for an example of post-processing). As expected, data privacy concerns are higher for smaller geographical areas and for areas with low number of inhabitants. That is why, the United States Census Bureau (2021) adds the highest noise to census blocks, for these splaces, particularly in non-densely populated areas, have the greatest risks of linkage or privacy attacks. For a mathematical depiction of the differential privacy framework see Dwork et al. (2014). For our purposes it is worth noting that this differential privacy noise is applied to every count, including totals. For instance, Table 3.1 shows
86
M. S. GONZÁLEZ CANCHÉ
Table 3.1
Example of counts and noise (ε) added to microdata Actual population counts
Block 1 Block 2
White Non-white 80 40 60 50
Total 120 110
Noise
Noise infused counts
White Non-white Total 0 −6 −5 5 0 3
White Non-white Total 80 36 115 65 50 113
a fictitious case following the examples presented by the United States Census Bureau (2021). In this case we have counts of white and non-white participants in two census blocks. Our goal here is to show that each cell in this table is subjected to independently of the sum of rows. That is, whereas in the actual population count follows the traditional sum of columns per row, the noise infused one does not. Specifically, note that in Block 1 there are 80 white inhabitants and 40 non-White inhabitants, which add to a total of 120 inhabitants in this block. In the “Noise Infused Counts”, the total number of inhabitants reported is 115, which is the result of subtracting the noise (−5) under the “Total” column in the “Noise” subsection of this table from the “Total” column in the “Actual Population Counts” subsection of this same table. This process is different from adding 80 White inhabitants and 36 non-White inhabitants, as shown in the “Noise Infused Counts” subsection of this table. Relevance of Differential Privacy For SSEM From the previous presentation it is clear that unless SSEM is applied to decennial data gathered in 2020 or to the public use microdata (PUMS), the use of Census data sources such as the American Community Survey pre-tabulated survey data represents no methodological issues for SSEM as ACS aggregated data is not subjected to the differential privacy framework. Nonetheless, the PUMS data are to be impacted when synthetic or simulated data occupy its place instead of the two-thirds subset that is currently being taken for research use (U.S. Department of Commerce, Economics and Statistics Administration, U.S. Census Bureau, 2021). Nonetheless, data privacy concerns are still relevant for SSEM in at least two interrelated circumstances. The first is when we collect our own data, and we have specific latitude and longitude responses of our participants. The second is when we use survey data gathered by other entities such as the National Center for Education Statistics which has a myriad of databases, most of which are longitudinal in nature and contain data that, either can be geocoded, or that are already georeferenced for use. An example of data that can be geocoded that are typically available from the NCES is the inclusion of participants ZCTAs (González Canché, 2018). In this case, we can rely on the ZCTAs provided by the TIGER/Line Shapefiles to add latitude and longitude coordinates associated with each ZCTA.
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
87
Although ZCTAs may not reveal the precise location of these participants, it opens the possibility of “inquiring” or asking around for survey respondents, which risks their identification. More recent National Center for Education Statistics’s data sources come with fully geolocated information, which could improve the precision of survey respondents. All these instances (including both our own collection of data or the analysis of secondary data sources) imply that with visualization techniques that may depict key attributes of our respondents, anyone could reconstruct individual cases and reveal participants’ identities. Although disclosure of individually identifiable information of National Center for Education Statistics’ participants is a federal crime, punishable with up to five years of imprisonment and a $250,000 fine,13 as data analysts we should always prioritize the protection of (our) respondents’ privacy and identities. Strategies to Protect Privacy Since the main threat to data privacy, when no differential privacy algorithms or synthetic databases are implemented or applied to raw data, may be individual identification based on physical location, the easiest solution to protect our participants may be to add noise to individuals’ locations. Note, however, that spatial statistical models relying on distance inputs as identification strategies per se, do not reveal participants’ locations for the results are typically presented in a tabular form, like any other regression model output. This is important to be noted for we do not need to add any noise to our data sources to estimate our models. This noise may be added to the resulting visualizations we may produce as a strategy to protect the identities of our participants by not revealing their exact locations. Note also, that although we are referring to human participants, the noise infusion process may be applied to all points. For example, we could add noise to colleges’ locations in exactly the same way we would add location noise to participants’ home addresses. To exemplify this spatial noise strategy, let us rely on point data in Philadelphia County to map all colleges and universities located in Philadelphia. The result of this process is shown in Fig. 3.18. Figure 3.18 contains 74 dots. The 37 black dots indicate the actual address (in terms of latitude and longitude coordinates) of these colleges and universities. The remaining 37 magenta dots were infused random location noise to those same coordinates. This noise is typically referred to as a jitter function that adds impression to the locations.14 When we apply the jitter function, the amount of imprecision is independently applied to each geometry or shape configuring a shapefile. That is, jitter applies noise to points, lines, and polygons as whole units. This means that even if a polygon is configured by lines and lines by points, the amount of noise if 13 14
See https://nces.ed.gov/statprog/rudman/g.asp.
This noise infusion is achieved with the st_jitter(...). function in the Simple Features (sf) R package.
88
M. S. GONZÁLEZ CANCHÉ
Fig. 3.18 Example of noise infused locations. College location source The IPEDS Data Center
uniformly applied to the polygon itself so that its original shape is not lost or distorted. In Fig. 3.18 each dot is its own unit, so the resulting noise infused depictions jump quasi-randomly and independently. In the case of polygons these jumps would not be as relevant or noticeable. To showcase this latter case, let us observe the ZCTAs representation of Philadelphia County with and without the application of this jitter function. Figure 3.19 shows the result of adding a jitter function to ZCTA polygons in Philadelphia County. Different from the point jittered representations, although there is some movement in each polygon, the jitter function did not present major displacements of these locations. Moreover, as described above, none of the polygons’ shapes morphed or were altered. When considering the reason why these differences between noise added to polygons and point occur, note that, to a great extent, jitter as a form of protecting privacy is more relevant for individual cases than with aggregated
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
89
Fig. 3.19 Example of noise polygon locations. Source The TIGER/Line Shapefiles
shapes. Since points, by definition represent single units located in space, the jitter or noise is more relevant for these lower-level units than it is for higher-level geometries. To close this subsection, note that, when the jitter is applied to points, some of the resulting points crossed county boundaries—and in some cases in this example even the state border. However, note once more, that for our actual statistical modeling purposes, these jumps represent no problem for our statistical (or SSEM) models may be based on the black dots rather than on the jittered representations; indeed, these magenta representations may only be used in visualizations as a data privacy protection strategy.
90
M. S. GONZÁLEZ CANCHÉ
Methodological Implications of DPF for SSEM Regardless of whether DPF is needed or not to avoid database reconstruction (see Hauer & Santos-Lozad, 2021; Ruggles & Van Riper, 2022; Winkler et al., 2022), DPF has important implications for SSEM when using decennial census data. The resulting models will be subjected to an amount of noise that may potentially reduce the usefulness of decennial census data for social, economic, and health research (Ruggles & Van Riper, 2022). Having said this, let us also note that, as we have briefly noted above, the process of adding uncertainty to decennial census data has been in place since 1930, with data swapping practices having been implemented since 1990. From this view, the methodological question for us then becomes, how badly are these distortions expected to affect the resulting estimates? To answer this question, we must consider that the differential privacy framework is based on the premise that, despite the noise added, the survey conclusions would not be affected by this process. This statement, in theory implies that our estimates would not be biased due to DPF. In addition, we should also remember that our SSEM models are and will continue to be approximations to reality (hopefully good approximations), rather than reality itself. From this view then, although differential privacy adds noise, this noise is added randomly across units, so it is not expected to be over-concentrated in certain segments of the population. Having said this, note that SSEM does not deal with redistricting, urban planning, or allocation of funds, which are the areas that experts are more concerned regarding the implementation of DPF (U.S. Census and American Community Survey microdata, 2022). In conclusion, while we do not expect these changes to impact SSEM, this does not mean that important downsides are expected to occur due to the implementation of both DPF and synthetic data. Moreover, from a comparative point of view, it remains unknown whether these noise infused decennial data estimates are more or less accurate than the 5year estimates available from the American Community Survey (ACS)—not the PUMS data. The purpose of making this comparison is not to doubt models or inferences obtained with ACS pre-tabulated survey data, instead, it is to acknowledge, once more, that although these inferences may be useful, there is always going to be a sense of uncertainty when relying on these data sources. When we collect our own data, via ethnographic observations that are geolocated, we should be particularly careful to protect our participants’ identities and locations, which as we have shown in Fig. 3.18 may be accomplished quite easily. However, there are other instances when precise locations really matter in terms of detecting areas with increase occurrence or concentration of activities. For example, Humane Boarders Inc. (Fronteras Compasivas) in collaboration with the Pima County Office of the Medical Examiner in Arizona continue to make geolocated information of migrant deaths publicly available. Since 2001 they have updated these counts carefully documenting locations and cause of death of over 3,900 migrant who died or were killed while crossing the Sonoran Desert.
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
91
Fig. 3.20 Locations of Migrants’ deaths 1990–2022. Source Humane Boarders Inc. (Fronteras Compasivas)
Although we will explore in more detail this database when discussing point pattern analyses, the accurate representation of these locations may be important to prevent deaths in zones where there is a concentration of incidents. For example, in Fig. 3.20, it is clear that there is a concentration of events closer to the border with Mexico and in this type of representations, accuracy should be preserved. In closing this section, note that the decision to rely on noise infusion as a form of data privacy protection depends on the goals of our research. In some instances, we need to protect the identity of our participants and in others we should preserve the accuracy of their location. Finally, concerns related to the use of decennial census data and PUMS ACS data are more important when using these data sources for “redistricting, allocation of funds, urban and regional planning, and studies of residential segregation” (U.S. Census and American Community Survey microdata, 2022, para. 4). For example, Kenny et al. (2021) found that the United States Census disclosure avoidance systems systematically undercounts the population in mixed-race and mixed-partisan precincts, yielding unpredictable racial and partisan biases (Kenny et al., 2021, p. 5). Accordingly, particular emphasis should be placed in the evaluation of the
92
M. S. GONZÁLEZ CANCHÉ
performance of these synthetic and data privacy frameworks in areas wherein marginalized and historically disfranchised groups reside.
Next Steps This chapter concludes our conceptual discussion of SSEM. Here, we illustrated data formats and transformations as well as the rationale and components of coordinate reference systems and differential privacy. These elements will be applied throughout the rest of the book. The next chapter will start with a brief tutorial of R and then we will proceed to manage spatial data, including merging, visualization, geocoding, and crosswalking. This indicates that the remaining of the book will be applied with all data and code being discussed in the content of the book. Additionally, complete replication files will be made available.
Discussion Questions 3.1 What are some of the differences between raster and vector data? Are some fields of disciplines more inclined to use one format over the other? Which is more important for the social sciences and why? 3.2 Why is it relevant of being able to move from raster to vector data and vice versa? Explain some of the challenges discussed in the content of the book, like those involving missing cases. 3.3 Describe the elements of coordinate reference systems? 3.4 During our discussion of CRSs we mentioned that some representations are called projected and others are unprojected. What does projected mean? What is their main difference with respect to their units of measurement? 3.5 Are the unprojected CRSs truly unprojected? Explain why. 3.6 Explain the process and relevance of adding layers to our visualization procedures. Can this process be achieved with vector and raster data? 3.7 What would happen if you had a shapefile that is unprojected and another that is projected and you needed to add them as layers of your map? 3.8 What is the conundrum that differential privacy aims to address? 3.9 In what cases would SSEM be affected by the noise added by data privacy? Relatedly, since this noise is added randomly across tables, should we be concerned about systematic bias resulting from the use of differential privacy data?
3 DATA FORMATS, COORDINATE REFERENCE SYSTEMS, AND DIFFERENTIAL . . .
93
3.10 Would we always need to add noise to our visualizations? What may be some examples of cases where the representation of exact locations should be prioritized?
References Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F., & Mahmood, F. (2021). Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering, 5(6), 493–497. Congalton, R. G. (1997). Exploring and evaluating the consequences of vector-toraster and raster-to-vector conversion. Photogrammetric Engineering and Remote Sensing, 63(4), 425–434. Constitution of the United States (1787). Dwork, C., Roth, A., et al. (2014). The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4), 211–407. González Canché, M. S. (2018). Nearby college enrollment and geographical skills mismatch: (re)Conceptualizing student out-migration in the american higher education system. The Journal of Higher Education, 89(6), 892–934. https://doi.org/10.1080/ 00221546.2018.1442637. Hauer, M. E., & Santos-Lozada, A. R. (2021). Differential privacy in the 2020 census will distort covid-19 rates. Socius, 7, 2378023121994014. Holdgraf, C., & Wasser, L. A. (2022). Gis in python: Intro to coordinate reference systems in python. Earth Lab. https://www.earthdatascience.org/courses/usedata-open-source-python/intro-vector-data-python/spatial-data-vector-shapefiles/ intro-to-coordinate-reference-systems-python/ Kenny, C. T., Kuriwaki, S., McCartan, C., Rosenman, E. T., Simko, T., & Imai, K. (2021). The use of differential privacy for census data and its impact on redistricting: The case of the 2020 us census. Science Advances, 7 (41), eabk3283. https://imai.fas. harvard.edu/research/files/DAS.pdf Lovelace, R., Nowosad, J., & Muenchow, J. (2019). Geocomputation with r. CRC Press. https://geocompr.robinlovelace.net/ McKenna, L. (2019). Disclosure avoidance techniques used for the 1960 through 2010 decennial censuses of population and housing public use microdata samples. Washington, DC: United States Census Bureau (Working Paper). https://www.census.gov/ library/working-papers/2019/adrm/six-decennial-censuses-da.html Piwowar, J. M., LeDrew, E. F., & Dudycha, D. J. (1990). Integration of spatial data in vector and raster formats in a geographic information system environment. International Journal of Geographical Information System, 4(4), 429–444. Ruggles, S., & Van Riper, D. (2022). The role of chance in the census bureau database reconstruction experiment. Population Research and Policy Review, 41(3), 781–788. Sutton, T., Dassau, O., & Sutton, M. (2009). A gentle introduction to GIS. Chief Directorate: Spatial Planning & Information, Department of Land Affairs, Eastern Cape, South Africa. United States Census Bureau. (2020). Understanding differential privacy. https:// www.census.gov/programs-surveys/decennial-census/decade/2020/planningmanagement/process/disclosure-avoidance/differential-privacy.html%5C#basics
94
M. S. GONZÁLEZ CANCHÉ
United States Census Bureau. (2021). Disclosure avoidance for the 2020 census: An introduction. https://www2.census.gov/library/publications/decennial/2020/ 2020-census-disclosure-avoidance-handbook.pdf United States Census Bureau. (2022). The “72-year rule”. https://www.census.gov/ history/www/genealogy/decennial_census_records/the%5C_72%5C_year%5C_ rule%5C_1.html U.S. Census and American Community Survey microdata. (2022). Changes to Census Bureau Data Products. IPUMS. https://www.ipums.org/changes-tocensus-bureaudata-products U.S. Department of Commerce, Economics and Statistics Administration, U.S. Census Bureau. (2021). Understanding and using the american community survey public use microdata sample files: What data users need to know. U.S. Government Printing Office. https://www.census.gov/content/dam/Census/library/publications/ 2021/acs/acs_pums_handbook_2021.pdf U.S. Department of the Interior, U.S. Geological Survey. (2022). Map projections. https://pubs.usgs.gov/gip/70047422/report.pdf Wade, T. G., Wickham, J. D., Nash, M. S., Neale, A. C., Riitters, K. H., & Jones, K. B. (2003). A comparison of vector and raster GIS methods for calculating landscape metrics used in environmental assessments. Photogrammetric Engineering & Remote Sensing, 69(12), 1399–1405. Wasser, L. A. (2022). Raster data in r - the basics. neon, Operated by Batelle. https:// www.neonscience.org/resources/learning-hub/tutorials/raster-data-r Winkler, R. L., Butler, J. L., Curtis, K. J., & Egan-Robertson, D. (2022). Differential privacy and the accuracy of county-level net migration estimates. Population Research and Policy Review, 41(2), 417–435.
PART II
Data Science SSEM Identification Tools: Distances, Networks, and Neighbors
The subsequent three chapters of SSEM contain the conceptual and methodological foundations of spatial data science. Each chapter focuses on developing interrelated, yet specific skills designed to equip researchers with conceptual and analytic tools to start using SSEM. The topics covered include a tutorial and data access tips, to then move to distance estimation procedures, and to the establishment of networks and neighboring structures. Together, these chapters contain a comprehensive depiction of innovative and groundbreaking spatial data science processes and identification strategies. Chapter 4 is the first applied chapter. The goal of this chapters consists of providing readers with the tools to start applying the concepts discussed in the first three chapters of the book, as well as all the remaining analytic procedures to be discussed. Accordingly, in this chapter, we present an applied R tutorial covering the most fundamental requirements to apply all the minimal code functions contained in this book. In this chapter, we will also illustrate the steps required to create splace datasets which involve both the reading of spatial datasets (i.e., shapefiles), place-based datasets (i.e., ACS pre-tabulated data), and their merger. Another important skill to be learned from this chapter is cross-walking, a data management approach that enables the merging of attributes measured at different geographical levels. Chapter 5 relies on data science and network modeling to tackle distance and navigation times estimations that go over and above naïve distance estimations that draw a straight line between two points (referred to “as the crow flies”). In this chapter, we offer two functions that compute distances and travel times based on following a road network. These distances are referred to “as humans walk” and “as humans drive.”
96
PART II:
DATA SCIENCE SSEM IDENTIFICATION TOOLS: …
Building from the distance approaches estimated in Chap. 5, Chap. 6 relies on geographical network modeling principles to identify neighboring structures. Here, units in close proximity or smaller travel times are identified as neighbors or sources of influence. Since closeness and travel time concepts are relative, we discuss feasibly cutoff points that may aid in the selection of optimal distances or travel times to apply these neighborhood identification. Moreover, we discuss the application of network-based transformations and data management to include units of different type in this neighborhood identification process, which constitutes an important contribution for spatial modeling. Currently, these identification strategies are constrained to include units of the same type and to rely on “flying times.” The content of Chap. 6 allows us to go over and above those methodological constraints. Once Part II is completed, we will be ready to start using spatial analyses to assess for spatial dependence, point pattern analyses, spatial econometrics, and interactive geographical visualizations, as we discuss in Part III.
CHAPTER 4
Access and Management of Spatial or Geocoded Data
Abstract This is our first applied chapter of the book. The content from this point forward will introduce The R Project for Statistical Computing and the minimal code functions required to replicate all the analyses presented in this book. The main goal of this code presentation is to illustrate the rationale required for analysts to adapt and use this code in their own projects. Accordingly, in addition to the code discussion presented in these chapters, standalone code files are being made available for download along with all their respective datasets to replicate all the procedures. The chapter starts with a brief R tutorial that will cover the fundamentals required to start using R. Subsequently we will read shapefiles and access data files from the American Community Survey, the National Center for Education Statistics, and the Internal Revenue Service. We then showcase how to geocode, merge or join, and crosswalk different data sources to form splaces. As part of this discussion, we also show how to replicate most of the figures presented in previous chapters.
R Tutorial R is a flexible and power tool for statistical analysis, data visualization, and data science. Since R is an object-oriented programming language, it offers high flexibility in accessing and merging or joining multiple data sources simultaneously while applying algorithms that are fully reproducible and modifiable. This R tutorial aims to spell out the most fundamental steps required to read and understand the rationale to conduct the SSEM approaches discussed in the following chapters. Accordingly, the goal of this tutorial is that after its study, participants will comfortably read and manipulate R code and adapt it to their own projects. The following information is based on over a decade of constant use of R. The selection of the tutorial content has been selected based on also 10 years of R teaching experience and also considering the functions developed for and to be used in the book. However, we also highly recommend studying other tutorials and manuals like “An Introduction to R” by
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. S. González Canché, Spatial Socio-econometric Modeling (SSEM), Springer Texts in Social Sciences, https://doi.org/10.1007/978-3-031-24857-3_4
97
98
M. S. GONZÁLEZ CANCHÉ
W. N. Venables, D. M. Smith, and the R Development Core Team that can be accessed here https://cran.r-project.org/doc/manuals/R-intro.pdf. Other resources are the “R Reference Card V1” by Tom Short that can be found here https://cran.r-project.org/doc/contrib/Short-refcard.pdf and the “R Reference Card V2” by Matt Baggott that can be found here https://cran.r-project.org/doc/contrib/Baggott-refcard-v2.pdf. Installation R is a multi-platform open-access software. To install: • Open your web browser and go to http://www.r-project.org/. • On the left panel, click “CRAN” under “Download” and choose a mirror site (R repository) geographically close to you from the list. That is, if you are located in Pennsylvania, you can select “Hoobly Classifieds, Pittsburgh, PA” (https://cran.mirrors.hoobly.com/) rather than Duke University, Durham, NC (https://archive.linux.duke.edu/cran/), for example. This is supposed to speed up the downloading process. • Under “Download and Install R”, choose your operating system, – Windows users can select: Download R for Windows and then click “Save File” to save to your local disk. – Mac users can go Download R for macOS – Linux users can select Download R for Linux (Debian, Fedora/Redhat, Ubuntu). • Once the installation file is saved, execute it and select “Run”, then choose your language, say “English”, click “OK”, click “Next”, click “Next” again after reading the license info, choose an installation folder (default options works fine). • You then need to choose the components you want to install. The “Full installation” for 64 bit requires about 79 MB. • Once you accept all default choices, the whole process might run for anywhere from a few seconds to a few minutes. An excellent companion of R is RStudio for it provides a myriad of tools that ease the coding and analysis process. RStudio is also cross-platform and has a free version, which will handle all the approaches we will discuss in this book. Note that for RStudio to work, the successful installation of R is required. Accordingly, after installing R follow the next steps: • Go to http://www.rstudio.com/products/rstudio/download/ and select “RStudio Desktop.” • There are different versions: – Windows users can select https://download1.rstudio.org/desktop/ windows/RStudio-2022.02.1-461.exe – Mac users can select https://download1.rstudio.org/desktop/macos/ RStudio-2022.02.1-461.dmg
4 Access and Management of Spatial or Geocoded Data
99
– Ubuntu 18+/Debian 10+ users can select https://download1.rstudio. org/desktop/bionic/amd64/rstudio-2022.02.1-461-amd64.deb. • In all cases the default options are recommended. • Once more, if R was correctly installed, you will be able to execute RStudio. If the R installation failed, RStudio will not start. • You will be asked to select a mirror or R repository to download packages, once again select the closest to your current location. R Infrastructure RStudio uses a four-pane layout. Figure 4.1 shows the two main panes we use in our examples. Pane A is the R console that executes our commands. These commands can be directly typed in this console or can be fed from the source pane. The latter contains all the commands (i.e., that can form a program of analysis) to be sent to the console to execute functions and conduct analyses. Typically, we type commands directly in the console pane to get quick information that does not need to be part of any program of analysis. Here we may understand a program of analysis as the entirety of a code that may be used to, for example, conduct all the analyses presented in one research study. Code Rationale In the remaining of the book, we will describe our actual code input. For example, since R is an object-oriented programming software application, we will be required to define all the objects of analysis. That is, all the shapefiles, raster, and datasets described in previous chapters required us to assign them to “objects” and to achieve this we relied on the following assigning sign: