278 81 12MB
English Pages 500 [476] Year 2021
HIGHWAY SAFETY ANALYTICS AND MODELING DOMINIQUE LORD XIAO QIN SRINIVAS R. GEEDIPALLY
Elsevier Radarweg 29, PO Box 211, 1000 AE Amsterdam, Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States Copyright © 2021 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-816818-9 For information on all Elsevier publications visit our website at https://www.elsevier.com/books-and-journals Publisher: Joe Hayton Acquisitions Editor: Brian Romer Editorial Project Manager: Barbara Makinster Production Project Manager: Swapna Srinivasan Cover Designer: Mark Rogers Typeset by TNQ Technologies
Dominique Lord: To my family (Leah and Javier), my mother (Diane), my brother (Se´bastien), and my two former advisors (Dr. Ezra Hauer and Dr. Bhagwant Persaud). Xiao Qin: To my family (Yuchen, Ethan, and Eva), my parents (Xingpo and Guangqin), my brother (Hui), and my former advisor (Dr. John Ivan) Srinivas R. Geedipally: To my family (Ashwini, Akshath, and Svidha), my parents (Ram Reddy and Laxmi), and my brother (Rajasekhar Reddy). Special thanks to my former advisor, Dr. Lord, for involving me in this project.
Preface The primary purpose of this textbook is to provide the state-of-the-art knowledge about how to better analyze safety data given their unique characteristics. This textbook provides the latest tools and methods documented in the highway safety literature, some of which have been developed or introduced by the authors. The textbook covers all aspects of the decision-making process, from collecting and assembling data to making decisions based on the analysis results, and is supplemented by real-world examples and case studies to help understand the state of practice on the application of models and methods in highway safety. Where warranted, helpful hints and suggestions are provided by the authors in the text to support the analysis and interpretation of safety data. The textbook is suitable for college students, safety practitioners (e.g., traffic engineers, highway designers, data analysts), scientists, and researchers who work in highway safety. This textbook specifically complements the Highway Safety Manual (HSM) published by AAHSTO and the Road Safety Manual (RSM) by the World Road Association. The publication of the HSM, RSM, and other safety-oriented guidelines has substantially increased the demand for training engineers and scientists about understanding the concepts and methods outlined within. Hence, the content of this textbook helps fill in this gap by describing the methods in greater depth and allows the readers to broaden their knowledge about the fundamental principles and theories of highway safety. All three authors of this textbook have taught graduate-level courses in highway safety at different institutions. The material covered had to be used from various sources, including chapters (or part of them) of various textbooks in areas within and peripheral to highway safety, published peer-reviewed papers, class notes from the world leaders in highway safety (e.g., Dr. Ezra Hauer), research reports, and manuals published by national public agencies. Most of these materials did not contain exercises and problems that students could use to apply the knowledge acquired from these documents. Throughout the years, it became clear that a textbook was needed that could combine all these important topics into a single document. The one from which students could read and learn about theoretical principles and apply them using observed (or simulated) data. In this regard, the textbook includes more than nine
xi
xii
Preface
datasets for more than 40 exercises. Most of these datasets have been used in peer-reviewed publications. All the datasets can be found at the lead author’s website: https://ceprofs.civil.tamu.edu/dlord/Highway_Safety_ Analytics_and_Modeling.htm. The content of the textbook is based on an accumulation of more than 40 years of research and applications related to methods and tools utilized for analyzing safety data. The textbook is divided into three general areas. The first area includes chapters that describe fundamental and theoretical principles associated with safety data analyses. This area covers the nature of the crash process from the human and statistical/mathematical perspectives, as well as key crash-frequency and crash-severity models that have been developed in the highway safety literature. The second area groups chapters that describe how the various models described in the first area are applied. The chapters include methods for exploring safety data, conducting cross-sectional and before-after studies, identifying hazardous sites or sites with promise as well as tools for incorporating spatial correlation and identifying crash risk on a near real-time basis. The third area assembles alternative safety analysis tools. The methods include how to use surrogate measures of safety and data mining techniques for extracting relevant information from datasets, including those categorized as big data (e.g., naturalistic data). It is hoped that the content will help readers to better understand the analytical tools that have been used to analyze safety data to make informed decisions for reducing the negative effects associated with crashes across the globe. This is even more important given the Vision Zero programs that have been increasingly implemented by various agencies in Europe, North America, and Eurasia among others. The content should also help improve or develop new tools aimed at estimating the safety performance of connected and automated vehicles, especially when they will be deployed in mixed-driving environments (within the next decade). For implementing methods and techniques proposed in this textbook, the authors have provided computer codes for three advanced software languages. Of course, the methods are not restricted to just three, but many other software languages can be easily implemented to be utilized given the parameterization described in the textbook. Along the same line, Microsoft Excel provides simple, flexible, and adequate tools that can be used to implement various simpler methods, such as the graphical methods presented in Chapter 5 or before-after studies described in Chapter 7. This textbook would never have come to completion without the significant help and input from numerous individuals, colleagues, and former and current graduate students: Zhi Chen, Soma Dhavala, Kathleen
Preface
xiii
Fitzgerald-Ellis, Ali Shirazi, Ioannis Tsapakis, Yuanchang Xie, Chengcheng Xu, and Lai Zheng. After a few requests on social media, several people have offered information about getting access to safety databases or giving us permission to use datasets. They include Jonathan AgueroValverde (Costa Rica), Amir Pooyan Afgahri (Australia), David Llopis Castello´ (Spain), Aline Chouinard (Canada), Stijn Daniels (Belgium), Thomas Jonsson (Sweden) Neeraj Kumar (Netherlands), Pei Fen Kuo (Taiwan), Emad Soroori (Australia), Shawn Turner (New Zealand), and Simon Washington (Australia). Finally, this textbook project would not have been possible without the support from Elsevier. First, a large thank you to Brian Romer, who first approached the authors several years ago and convinced us to prepare a book on highway safety (given our reluctance about the effort needed for such an endeavor). Thanks to the two book managers who kept us on our toes for the duration of this project: Barbara Makinster and Ali Afzal-Khan. Special thanks to Narmatha Mohan for helping us manage copyright information and permission log, and Swapna Srinivasan for handling the production of the textbook. The content of this textbook has been partly funded by the A.P. and Florence Wiley Faculty Fellow provided by the College of Engineering at Texas A&M University and project 01-001 from the Safety through Disruption (Safe-D) University Transportation Center (UTC). Dominique Lord, Texas A&M University Xiao Qin, University of WisconsindMilwaukee Srinivas R. Geedipally, Texas A&M Transportation Institute
C H A P T E R
1
Introduction 1.1 Motivation Although a lot of effort has been placed by agencies across the world to reduce the number and severity of crashes1 via improvements in highway design, vehicle technology, traffic policy, emergency services, and the like, the effects of highway crashes on road transport networks are still a major source of morbidity (Lord and Washington, 2018). Fig. 1.1 illustrates the historical statistics in roadway fatalities in the United States between 1913 and 2018 (similar trends have been observed among most industrialized countries). This figure shows that the trend in roadway fatalities has been slightly going down since early 1970s, with sharp decreases during economic recessions (further discussed later). This figure also demonstrates that when the values are analyzed by taking into account the vehicle miles traveled (a measure of exposure), the rate has been going significantly down since the beginning of official crash data collected by the federal government. Even though the crash rate shows a great reduction, the raw numbers, as a public health measure, are still the most important factor that guides the allocation of resources. For example, although the crash rate is generally going down, the number of injured people arriving at various emergency rooms located within a jurisdiction, or the patient 1
In this textbook, we use the term “crash” to reflect outcome of a collision between a vehicle and a fixed object (i.e., an event where only one vehicle is involved), one or more vehicles, or one or more vulnerable road users (i.e., pedestrians, cyclists, etc.). Although some people do not like to label a crash an “accident” because the word accident could absolve the driver of any responsibility, the word accident could still be employed as that word refers to the probabilistic nature of the event. If accidents were coming from a deterministic system, we should therefore be able to “predict” with certainty when one or more crashes would occur in the future. Obviously, in the context of this textbook, this is not possible.
Highway Safety Analytics and Modeling https://doi.org/10.1016/B978-0-12-816818-9.00006-8
1
© 2021 Elsevier Inc. All rights reserved.
2
1. Introduction
FIGURE 1.1 Number of fatalities and fatalities per 100 million vehicle miles in the United States between 2013 and 1018 (NSC, 2018).
arrival rate, is the primary metric that the hospital management uses to allocate medical services. The same information is also needed, for example, for managing first responders, such as emergency medical services, firefighters, and national, regional, and local police forces. Hence, the desired attention usually focuses on crash or injury counts for many safety interventions, although exposure in terms of vehicular traffic and/or segment length may still need to be incorporated into some of the methods utilized for assessing safety. According to the World Health Organization (WHO), between 2000 and 2016, roadway-related crashes increased from about 1.15 million to 1.35 million deaths globally (WHO, 2018). On an annual basis, about 80 million nonfatal injuries warranting medical care occur on highway networks (Word Bank, 2014). Road traffic injuries are ranked eighth as the leading cause of death (2.5%) among people of all ages, right in front of diarrheal diseases and tuberculosis (WHO, 2018). Vulnerable road users (i.e., pedestrians and cyclists) represent 26% of road injury deaths, while drivers and passengers of motorized two-wheel and three-wheel vehicles account for another 28% worldwide (WHO, 2018). Unfortunately, while a large proportion of high-income countries have observed either a reduction or no change in traffic-related deaths between 2013 and 2016, a significant number of middle- and low-income countries have observed an increase in traffic-related deaths (WHO, 2018), in large part attributed to the rapid motorization observed in developing countries (World Bank, 2014). The economic burden of crashes significantly impacts the global economy. In the United States, for instance, highway crashes are estimated to have caused more than US$871 billion in economic loss and societal harm in 2010 (Blincoe et al., 2015). In Europe, it is estimated that
1.1 Motivation
3
crashes have cost more than US$325 billion (V280 billion) in economic harm in 2015 (this value is considered underestimated) (Wijnen et al., 2017), while in Australia the economic burden was estimated to be US$ 23.9 billion (AU$33.2) in 2016 (Litchfield, 2017). Globally, it is estimated that 3% of gross domestic product (GDP) is lost to highway crashes (all severities) and can be as high as 5% for middle- and low-income countries (WHO, 2015). In short, in addition to the pain and suffering that crashes have caused to the victims of such events, highway crashes can significantly impede a country’s economic growth or viability across the globe. As described in Fig. 1.1, the relationship that economic activity is strongly linked to the number of fatalities observed on highways has now been well established (Wijnen and Rietveld, 2015; Elvik et al., 2015; Wegman et al., 2017; Noland and Zhou, 2017; Shimu, 2019). In times of economic growth, the number of crashes increases, while during economic hardship (i.e., recession), the number of crashes decreases. Fig. 1.2 illustrates such a relationship in detail, during the “Great Recession” of 2007e09 in the United States (the right-hand side of Fig. 1.1). The influencing factors include unemployment level, especially among young people, mode shift for people who are unemployed and lower exposure by high-risk drivers (e.g., drivers below 25 years old) during recession periods (Blower et al., 2019). The relationship between economic activity and crash risk is very important to be understood before analytical tools are used for analyzing highway crash data. This is to avoid the potential confounding effects when treatments are implemented and evaluated for reducing the number and severity of crashes.
FIGURE 1.2 (NCS, 2018).
Fatalities trend during the great recession of 2007e09 in the United States
4
1. Introduction
Given the magnitude of the problem associated with highway crashes, numerous public transportation agencies across the world, from national to local agencies, have placed a lot of effort (i.e., labor, promotion, etc.) and allocated a large amount of funds for reducing the number and severity of crashes, especially over the last 25 years. For example, in the United States, the National Highway Transportation Safety Agency (NHTSA) has devoted US$908 million for highway-safety initiatives related to vehicle safety, driver safety, and traffic enforcement in 2016 (NHTSA, 2016). In 2019, the Federal Highway Administration (FHWA) allocated US$2.60 billion solely for safety projects, which include research, dissemination, engineering, and construction projects among others (FHWA, 2019). Similar financial investments have been placed by various transportation agencies in Europe, Middle East, Asia, South Asia, and Oceania. The strong commitment to reducing the negative effects of highway crashes by decision-makers can be seen in the Vision Zero2 movement that was first introduced by the Swedish Government in 1997. This movement consists in finding new and innovative approaches and ways of thinking (i.e., shifting the responsibility from road users to highway designers and engineers for reducing crashes) for significantly reducing, if not eliminating, fatal and nonfatal injuries on highways, especially on urban highways (Kristianssen et al., 2018). Vision Zero has been assertively implemented in various communities across the globe. To respond to the increasing investment in safety-related projects and help with the aim of reducing, if not eliminating (as per Vision Zero) highway crashes, research into methods and tools for analyzing crash data has exponentially grown during the same time period. The testament of such increase has recently been documented in two scientometric overview publications that visually mapped the knowledge in the field of highway safety (i.e., key areas of research) and the impact of the research that has been published in the leading journal Accident Analysis and Prevention (Zou and Vu, 2019; Zou et al., 2020). These authors identified “crash-frequency modeling analysis” to be the core research topic in road safety studies, hence showing the relevance of the material covered in this textbook. Although design and application manuals, such as the Highway Safety Manual (HSM) (AASHTO, 2010) or the Road Safety Manual (RSM) (PIARC, 2019), specialized textbooks, such as the one by Hauer (1997) on before-after studies or Tarko (2020) on surrogate measures of safety, and review papers (see Lord and Mannering, 2010; Savolainen et al., 2011; Mannering and Bhat, 2014), already exist, there is not a single source available that covers the fundamental (and up-to-date) principles related to the analysis of safety data. As discussed by Zou and Vu (2019), the field 2
https://visionzeronetwork.org/.
1.2 Important features of this textbook
5
of highway safety covers very wide areas of research and applications (i.e., psychology, human factors, policy, medicine, law enforcement, epidemiology). Manuals and textbooks have already been published on these topics (see Dewar and Olson, 2007; Shinar, 2007; Smiley, 2015). This textbook complements these published manuals and focuses on the actual analysis of highway safety data. The primary purpose of this textbook is to provide information for practitioners, engineers, scientists, students, and researchers who are interested in analyzing safety data to make engineering- or policy-based decisions. This book provides the latest tools and methods documented in the literature for analyzing crash data, some of which have in fact been developed or introduced by the authors. The textbook covers all aspects of the decision-making process, from collecting and assembling data to making decisions based on the results of the analyses. Several examples and case studies are provided to help understand models and methods commonly used for analyzing crash data. Where warranted, helpful hints and suggestions are provided by the authors in the text to support the analysis and interpretation of crash data. The textbook’s readership is suitable for highway safety engineers, transportation safety analysts, highway designers, scientists, students, and researchers who work in highway safety. It is expected that the readers have a basic knowledge of statistical principles or an introductory undergraduate-level course in statistics. This textbook specifically complements the HSM published by AAHSTO and the RSM by the World Road Association. The publication of these manuals has increased the demand for training engineers and scientists about understanding the concepts and methods outlined in the HSM and the RSM.
1.2 Important features of this textbook This textbook is needed for the following reasons: (1) There are no manuals nor textbooks that summarize all the techniques and statistical methods that can be utilized for analyzing crash data into a single document (although the words “crash data” are frequently used in this textbook, many methods and techniques can be used for analyzing all types of safety data, such as surrogate measures of safety (i.e., traffic conflicts), speed-related incidents, citations, driver errors or distractions, and the like). The few manuals that cover highway safety concepts usually provide basic information, such as regression equations, figures, charts, or tables that may not always be suitable for the safety analyses. For example, transferring models from one jurisdiction to another may
6
1. Introduction
not be feasible for methodological reasons. Furthermore, no manuals specifically explain how to develop crash-frequency models, crashseverity models, or data mining techniques from the data collection procedures to the assessment of the models using data collected in their own jurisdictions. (2) There are no textbooks that cover all aspects of safety data analyses and can be used in a teaching or classroom environment, such as data collection, statistical analyses, before-after studies, and real-time crash risk analysis among others. This textbook can be used as a core textbook for a senior undergraduate or graduate course in highway safety. Different chapters could also be used for senior-level undergraduate courses that cover some elements of highway safety, highway design, crash data analyses, or statistical analyses. (3) Crash data are characterized by unique attributes not observed in other fields. These attributes include the low sample mean and small sample size problem, missing values, endogeneity, and serial correlation among others (Lord and Mannering, 2010; Savolainen et al., 2011). These attributes can significantly affect the results of the analysis and are, to this day, often not considered in analyses conducted by transportation safety analysts. Not taking into account, these attributes can lead to misallocation of funds and, more importantly, could potentially increase in the number and severity caused by motor vehicle crashes. The textbook addresses the nuances and complexity related to the analysis of crash and other types of safety data as well as the pitfalls and limitations associated with the methods used to analyze such data.
1.3 Organization of textbook The textbook is divided into three general areas. The first area includes chapters that describe fundamental and theoretical principles associated with safety data analyses. This area covers the nature of crash data from the human and statistical/mathematical perspectives, as well as key crash-frequency and crash-severity models that have been developed in the highway safety literature. The second area groups chapters that describe how the models described in the first area are applied for analyzing safety data. The chapters include methods for exploring safety data, conducting cross-sectional and before-after studies, identifying hazardous sites or sites with promise as well as tools for incorporating spatial correlation, and identifying crash risk on a near real-time basis. The third area assembles alternative safety analysis tools. The methods include how to use surrogate measures of safety and data mining techniques for extracting relevant information from datasets, including those categorized as big data (e.g., naturalistic data).
1.3 Organization of textbook
7
1.3.1 Part I: theory and backbround Chapter 2dFundamentals and Data Collection describes the fundamental concepts related to the crash process and crash data analysis as well as the data collection procedures needed for conducting these analyses. The chapter covers the crash process from the perspectives of drivers, roadways and vehicles, and theoretical and mathematical principles. It provides important information about sources of data and data collection procedures, as well as how to assemble crash and other related data. The chapter also describes a four-step modeling procedure for developing models and analyzing crash data and the methods for assessing the performance of these models. Chapter 3dCrash-Frequency Modeling describes the basic nomenclature of the models that have been proposed for analyzing highway safety data and their applications. The chapter describes the most important crash-frequency models that have been proposed for analyzing crash count data, along with the important or relevant information about their characteristics. The models are grouped by their intended use and for handling specific characteristics associated with safety data. The chapter ends with a discussion about the modeling process related crashfrequency models. Chapter 4dCrash-Severity Modeling introduces the methodologies and techniques that have been applied to model crash severity in safety studies. The discussion includes the different forms, constructs, and assumptions that crash severity models have been developed as a function of the prevailing issues related to crash data. The theoretical framework and practical techniques for identifying, estimating, evaluating, and interpreting factors contributing to crash injury severities are also explored.
1.3.2 Part II: highway safety analyses Chapter 5dExploratory Analyses of Safety Data describes techniques and methods for exploring safety data. They are divided into two general themes: (1) quantitative techniques that involve the calculation of summary statistics and (2) graphical techniques that employ figures or plots to summarize the data. The exploratory analyses of data help frame the selection of more advanced methodologies such as those associated with cross-sectional analyses, before-after studies, identification of hazardous sites, spatial correlation and capacity, and mobility. Chapter 6dCross-Sectional and Panel Studies in Safety describes different types of data and analysis methods, as well as how models described in the previous part can be used to this effort. The discussion includes data and modeling issues and presents some techniques to
8
1. Introduction
overcome them. The chapter describes the characteristics of different functional forms, selection of variables, and modeling framework. Techniques for determining the required sample size, identification of outliers, and transferability of models to other geographical areas are also presented. Lastly, a brief outline of other study designs that are not commonly used in highway safety is presented. Chapter 7dBefore-After Studies in Safety covers basic and advanced study techniques for analyzing before and after data. The chapter describes the two critical issues that can negatively influence this type of study and the basic methods for conducting a before-after study with and without control groups. Then, the empirical Bayes and full Bayes methods in the context of before-after studies are presented. The last sections of the chapter document more recent methods, such as the naı¨ve adjustment method, the before-after study using survival analysis, and the propensity score method. The chapter ends with a discussion about the sample size needed for conducting before-after studies. Chapter 8dIdentification of Hazardous Sites first discusses various hazardous site selection methods that rely on observed crashes, predicted crashes, or expected crashes. The discussion includes each method’s strengths and weaknesses. Then, the chapter presents geospatial hotspot methods that consider the effects of unmeasured confounding variables by accounting for spatial autocorrelation between the crash events over a geographical space. This chapter also documents the list of the high crash concentration location procedures because the hazardous site selection methods may not efficiently identify the point locations where a deficiency exists. The proactive approach methods are then presented due to their nature of identifying sites before a crash could occur. Lastly, the screening evaluation methods are discussed in detail. Chapter 9dModels for Spatial Data is dedicated to analyzing and modeling crash data within a spatial context. The chapter begins with an overview of the characteristics of spatial data and commonly used data models. Then, spatial indicators, such as Getis G and Moran’s I, are introduced to help determine the distribution of crash locations as clustering, dispersed, or random. Next, the chapter describes techniques for analyzing crash point data that are presented to facilitate the discovery of the underlying process that generates these points. Finally, spatial regression methods are introduced to explicitly consider the spatial dependency of crashes and spatial heterogeneity in the relationship between crashes and their contributing factors. Chapter 10dCapacity, Mobility, and Safety offers a perceptive account of one of the fastest-developing fields in highway safety analysis, involving traffic flow theory, driver behavior models, and statistical methods. The chapter first describes a theoretical car-following model to demonstrate the safety aspects of a classic driver behavior model, the
1.3 Organization of textbook
9
modeling of relationships between crashes and traffic volume, and how to map crash typologies to a variety of traffic regimes characterized by traffic variables. The use of Bayesian theory to predict crash probability given a real-time traffic input and real-time crash prediction models (RTCPM) are also described. The chapter ends with a description about the motivation and methodology for developing RTCPM from simulated traffic data when actual traffic data are not available.
1.3.3 Part III: alternative safety analyses Chapter 11dSurrogate Safety Measures focuses on defining, analyzing, comparing, and applying state-of-the-art surrogate safety measures. Following a brief history of traffic conflicts, the chapter explains the basic characteristics of traffic conflicts technique and the practice of observing and collecting traffic conflicts in the field. The chapter also covers both the pragmatic approach and the theoretical development of surrogate safety measures. Chapter 12dData Mining and Machine Learning Techniques introduces data mining and machine learning methodologies and techniques that have been applied to highway safety studies, including association rules, clustering analysis, decision tree models, Bayesian networks, neural networks, and support vector machines. The theoretical frameworks are illustrated through exemplary cases published in safety literature and are supplemented with implementation information in the statistical software package R. The chapter ends with a description of a means of specifying the effect of an independent variable on the output, which can assist in deciding on the appropriate safety solutions.
1.3.4 Appendices Appendix A describes the basic characteristics of the Negative Binomial model, the most popular model in crash data analysis (Lord and Mannering, 2010), with and without spatial interactions and the steps to estimate the model’s parameters using the maximum likelihood estimation and Bayesian methods. Appendix B provides a historical description, a detailed and up-to-date list of crash-frequency and crash-severity models that were previously published in peer-reviewed publications (Lord and Mannering, 2010; Savolainen et al., 2011; Mannering and Bhat, 2014). Appendix C presents useful codes for developing many models described in the textbook in SAS, WinBUGS, and R software languages. Appendix D lists the available datasets for each chapter of this textbook. Finally, datasets used for the examples described in various chapters are made available on
10
1. Introduction
the personal website of the lead author (https://ceprofs.civil.tamu.edu/ dlord/Highway_Safety_Analytics_and_Modeling.htm).
1.3.5 Future challenges and opportunities The methods and analysis tools documented in this textbook are the accumulation of more than 40 years of research and applications in highway safety. Many of these methods and tools have been introduced in this area when new methods were developed in other fields, such as in statistics, econometric, medicine, epidemiology, and social sciences or when methodological limitations had been identified based on the unique characteristics associated with highway safety data (see Chapters 3 and 4). Despite the foreseeable changes and uncertainties, the techniques and methods introduced in this textbook should not be outdated and will continue to be used as powerful tools for analyzing highway safety data. However, with the significant advancement in transportation technologies and computing power, existing methods may need to be adapted and new ones to be developed to properly measure the safety performance associated with these technologies over the next few decades. Although the full development of connected and autonomous vehicles is several years, if not decades, away, the impacts of their deployment in mixed traffic conditions (a mixture of human-driven and automated vehicles) are not well understood. Automated vehicles have the potential to significantly reduce vehicle fatalities, but their safety benefits have so far mainly been based on simulation analyses and surrogate measures of safety (Morando et al., 2018; Mousavi et al., 2019; Papadoulis et al., 2019; Sohrabi et al., 2020). On the other hand, crashes involving autonomous vehicles have been reported with limited open road tests, some involving fatalities. Hence, observational tools targeted for small samples would provide a more accurate and reliable picture than simulations and surrogate events. In the new era of Big Data, a large amount of new and emerging data are becoming more and more available (i.e., smart cities, disruptive technologies, naturalistic data, video processing, etc.). In Chapter 12d Data Mining and Machine Learning Techniques, data mining and machine learning techniques for analyzing safety data, such as those collected from naturalistic studies or from connected vehicles (e.g., basic safety messages) are discussed. With the rich data, extracting useful and meaningful information becomes essential. Many competing techniques are capable of handling conventional safety data issues, their strengths and limitations vary, so do the model performance and results. Examples include random forests versus gradient boosted trees, and convolutional neural networks versus recurrent neural networks. Understanding the
References
11
fundamentals of these techniques that are developed from some of the methodological and modeling principles introduced in the textbook will help the reader to select the most appropriate method when confronted by different data issues and challenges. Despite the superior performance in handling high-dimensional data, machine learning methods have long been criticized for operating like a black box, with no statistical inferences and model goodness-of-fit or no explicit relationships between outcomes and input variables. Hence, there is a trend to use machine learning techniques as a screening tool for a large quantity of factors, or a clustering tool for grouping data into more homogeneous dataset, and then to apply conventional statistical models. Promising results have been reported in this combination of methods. On the other hand, with the increased use of naturalistic data in safety, new tools have been and are being developed to handle datasets that include video data, social media data, and vehicle performance data that record vehicle location, position, and kinematics every second or fraction of a second. Applying artificial intelligence methods, such as those currently being used by YouTube,3 for example, should be examined. As the authors, we hope this textbook will serve as a springboard for the reader to continue advancing the safety research frontier through better analytical methods.
References AASHTO, 2010. Highway safety manual, 1st Edition. In: American Association of State Highways and Transportation Officials, Washington, D.C. Blincoe, L., Miller, T.D., Zaloshnja, E., Lawrence, B.A., 2015. The Economic and Societal Impact of Motor Vehicle Crashes, 2010 (Revised). (Technical Report DOT HS 812 013). U.S. Department of Transportation, National Highway Traffic Safety Administration, Washington, D.C. Blower, D., Flannagan, C., Geedipally, S., Lord, D., Wunderlich, R., 2019. Identification of Factors Contributing to the Decline of Traffic Fatalities in the United States from 2008 to 2012. Final Report NCHRP Project 17-67. Transportation Research Board, Washington, D.C. Dewar, R.E., Olson, P.L., 2007. Human Factors in Traffic Safety, second ed. Lawyers & Judges Publishing Company, Inc., Tucson, AZ. Elvik, R., 2015. An analysis of the relationship between economic performance and the development of road safety. In: Why Does Road Safety Improve when Economic Times are Hard? ITF/IRTAD, Paris, pp. 43e142. FHWA, 2019. FHWA FY 2019 Budget. Federal Highway Administration, Washington, D.C. https://www.fhwa.dot.gov/cfo/fhwa-fy-2019-cj-final.pdf. Hauer, E., 1997. Observational BeforeeAfter Studies in Road Safety: Estimating the Effect of Highway and Traffic Engineering Measures on Road Safety. Pergamon Press, Elsevier Science, Ltd., Oxford, United Kingdom.
3
https://www.forbes.com/sites/bernardmarr/2019/08/23/the-amazing-ways-youtube-usesartificial-intelligence-and-machine-learning/#1f27802e5852.
12
1. Introduction
˚ ., Nilsene, P., 2018. Swedish Vision Zero polKristianssen, A.-C., Andersson, R., Belin, M.-A icies for safety e a comparative policy content analysis. Saf. Sci. 103, 260e269. Litchfield, F., 2017. The Cost of Road Crashes in Australia 2016: An Overview of Safety Strategies. The Australian National University, Canberra, Australia. Lord, D., Mannering, F., 2010. The statistical analysis of crash-frequency data: a review and assessment of methodological alternatives. Transport. Res. Part A 44 (5), 291e305. Lord, D., Washington, S., 2018. Safe Mobility: Challenges, Methodology and Solutions (Transport and Sustainability, vol. 11. Emerald Publishing Limited. Mannering, F., Bhat, C.R., 2014. Analytic methods in accident research: methodological frontier and future directions. Anal. Methods Accid. Res. 1, 1e22. Morando, M.M., Tian, Q., Truong, L.T., Vu, H.L., 2018. Studying the safety impact of autonomous vehicles using simulation-based surrogate safety measures. J. Adv. Transport. 2018, 11. https://doi.org/10.1155/2018/6135183. Article ID 6135183 (open access). Mousavi, S.M., Lord, D., Osman, O.A., 2019. Impact of Urban Arterial Traffic LOS on the Vehicle Density of Different Lanes of the Arterial in Proximity of an Unsignalized Intersection for Autonomous Vehicle vs. Conventional Vehicle Environments. In: Paper presented at the ASCE International Conference on Transportation & Development. Alexandria, VA, June 9-12. NHTSA, 2016. Fiscal Year 2016 Budget Overview. National Highway Traffic Safety Administration, Washington, D.C. Noland, R.B., Zhou, Y., 2017. Has the great recession and its aftermath reduced traffic fatalities? Accid. Anal. Prev. 98, 130e138. NSC, 2018. Historical Fatality Trends. National Safety Council, Washington D.C. https:// injuryfacts.nsc.org/motor-vehicle/historical-fatality-trends/deaths-and-rates/. (Accessed 8 August 2020). Papadoulis, A., Quddus, M., Imprialou, M., March 2019. Evaluating the safety impact of connected and autonomous vehicles on motorways. Accid. Anal. Prev. 124, 12e22. PIARC, 2019. Roadway Safety Manual, third ed. World Road Association, Paris, France. Savolainen, P.T., Mannering, F.L., Lord, D., Quddus, M.A., 2011. The statistical analysis of highway crash-injury severities: a review and assessment of methodological alternatives. Accid. Anal. Prev. 43 (5), 1666e1676. Shimu, T., 2019. Examining the Factors Causing a Drastic Reduction and Subsequent Increase of Roadway Fatalities on United States Highways between 2005 and 2016. MS Thesis. Zachry Department of Civil & Environmental Engineering, Texas A&M University, College Station, TX. Shinar, D., 2007. Traffic Safety and Human Behavior. Emerald Group Publishing Limited, Bingley, UK. Smiley, A., 2015. Human Factors in Traffic Safety, third ed. Lawyers & Judges Publishing Company, Inc., Tucson, AZ. Sohrabi, S., Khodadadi, A., Mousavi, S.M., Dadashova, B., Lord, D., 2020. Quantifying Autonomous Vehicle Safety: A Scoping Review of the Literature, Evaluation of Methods, and Directions for Future Research. Working Paper. Zachry Department of Civil and Environmental Engineering, Texas A&M University, College Station, TX (submitted for publication). Tarko, A., 2020. Measuring Road Safety with Surrogate Events. Elsevier Inc., Amsterdam, The Neatherlands. Wegman, F., Allsop, R., Antoniou, C., Bergel-Hayat, R., Elvik, R., Lassarre, S., Lloyd, D., Wijnen, W., 2017. How did the economic recession (2008e2010) influence traffic fatalities 18 in OECD-countries? Accid. Anal. Prev. 102, 51e59.
References
13
Wijnen, W., Rietveld, P., 2015. The impact of economic development on road safety: a literature review. In: Why Does Road Safety Improve when Economic Times are Hard? ITF/ IRTAD, Paris, pp. 22e42. Wijnen, W., Weijermars, W., Vanden Berghe, W., Schoeters, A., Bauer, R., Carnis, L., Elvik, R., Theofilatos, A., Filtness, A., Reed, S., Perez, C., Martensen, H., 2017. Crash Cost Estimates for European Countries. Research Report. Institute for Road Safety Research (SWVO), The Hague, The Netherlands. https://www.swov.nl/publicatie/crash-cost-estimateseuropean-countries. WHO, 2015. Global Status Report on Road Safety 2015. World Health Organization, WHO Press, Geneva, Switzerland. http://www.who.int/violence_injury_prevention/road_ safety_status/2015/en/. WHO, 2018. Global Status Report on Road Safety 2018. World Health Organization, WHO Press, Geneva, Switzerland. http://www.who.int/violence_injury_prevention/road_ safety_status/2018/en/. World Bank, 2014. Transport for Health: The Global Burden of Disease from Motorized Transport. Seattle, WA. IHME, Washington, D.C. http://documents.worldbank.org/curated/ en/984261468327002120/pdf/863040IHME0T4H0ORLD0BANK0compressed.pdf. Zou, X., Vu, H.L., 2019. Mapping the knowledge domain of road safety studies: a scientometric analysis. Accid. Anal. Prev. 132, 105243. Zou, X., Vu, H.L., Huang, H., September 2020. Fifty years of accident analysis & prevention: a bibliometric and scientometric overview. Accid. Anal. Prev. 144, 105568.
C H A P T E R
2
Fundamentals and data collection 2.1 Introduction Crashes are very complex and multidimensional events. Although, in police reports, a single factor may have been identified as the primary cause of the crash, usually several interrelated factors could have contributed to a crash. For example, we can have a scenario where an 18-year old driver, who is traveling late at night during very windy conditions in a 20-year old pickup truck, starts dosing off, then runs off the road in a horizontal curve with low-friction pavement due to lack of maintenance and a radius that is below the design standards (but approved via a design exemption), and then hits a tree located within the designated clear zone. In this scenario, the primary factor may have been identified as “falling asleep behind the wheel,” but if we remove any other factors, the crash could have been avoided (e.g., adequate maintenance, no tree, no wind, on a tangent section). In addition to showing that contributing and interrelated factors can be related to the driver, the vehicle or the roadway, this scenario highlights that crashes are complex and probabilistic events (if they were deterministic events, we would be able to know when and where a crash would happen). Hence, all relevant events need to be analyzed with appropriate tools in order to account for the complexity and randomness of crash data. This chapter describes the fundamental concepts related to the crash process and crash data analysis as well as the data collection procedures needed for conducting these analyses. The first section covers the crash process from the perspective of drivers, roadways, and vehicles. The second section describes the crash process from a theoretical and
Highway Safety Analytics and Modeling https://doi.org/10.1016/B978-0-12-816818-9.00010-X
17
© 2021 Elsevier Inc. All rights reserved.
18
2. Fundamentals and data collection
mathematical perspective. The third section provides important information about sources of data and data collection procedures. The fourth section describes how to assemble crash and other related data. The fifth section presents a four-step modeling procedure for developing models and analyzing crash data. The sixth section describes methods that can be used for evaluating models. The last section presents a heuristic method that allows the selection of models before models are fitted to the crash data.
2.2 Crash process: drivers, roadways, and vehicles As explained earlier, the crash process is a very complex phenomenon that can involve a multitude of factors. These factors can generally be separated into three categories: drivers or the human element, roadways or the roadway element, and the vehicles. Over the years, studies have examined how risk factors associated with these three categories contribute to crashes. The most recent report from the National Highway Traffic Safety Administration (NHTSA), published in 2018, details in Table 2.1 the precrash caustion factors (i.e., the last event in the crash caustion chain) by proportion based on a survey of 5740 crashes throughout the United States. In this table, driver errors accounted for about 94% of all the crashes, whereas roadways and vehicles played a critical role in only 2% of all the crashes each (NHTSA, 2018). Unknown causation was estimated to be around 2% as well. All driver errors can be further categorized as recognition (41%), decision (33%), and performance (11%). For vehicles, tires or wheel-related (35%), brake-related (22%), and steering/suspension/transmission-related (3%) were the most common failures or causational factors. It should be pointed out that in 40% of the cases where the vehicle was the primary causation factor, the exact mode of failure was unknown. For roadways,
TABLE 2.1 Critical reasons for crash occurrences (NHTSA, 2018). Crash causation
Percentage (standard error)
Drivers
94% (2.2%)
Vehicles
2% (0.7%)
Roadways
2% (1.3%)
Unknown
2% (1.4%)
Total
100%
I. Theory and background
2.2 Crash process: drivers, roadways, and vehicles
19
slick road (ice, debris, etc.) (50%), glare (17%), and view obstructions (11%) were the most common causation factors. Adverse weather accounted for about 8% of roadway-related factors. Unfortunately, the NHTSA report does not cover the interaction between all three categories as if they were independent of each other. However, in practice, many crash events may not have independent precrash factors, as illustrated in the example presented in the introduction. As such, it is not uncommon for drivers, for example, to be confused by elements of the roadway environment (geometric design, regulatory and commercial signs, lane-occupancy, etc.) that could lead to critical driver errors (e.g., increase in perception-reaction times, getting more distracted), and eventually resulting into a vehicular crash. In a more distant study, Rumar (1985), for example, examined this interaction and found that up to 27% of the crashes included both the roadway and driver as the primary precrash contributing factors (see Fig. 2.1). Unfortunately, there has not been any more recent study that examined the interaction between vehicles, drivers, and roadways as part of the precrash causational effect. With the prevalent in-vehicle technology and distraction (i.e., talking on a cellphone, texting), such study should be done in short order. The Haddon Matrix, originally developed in 1970 by Dr. William Haddon, was used to better understand the mechanism associated with nonintentional injuries and consequently develop strategies (i.e., countermeasures) for reducing the number and severities of these injuries. This matrix was more specifically proposed to help identify the relative importance of risk factors and to tailor interventions for the most important and identifiable factors. Initially developed as a three by three matrix (that is the three categories listed earlier), a fourth dimension that focuses on policy or social environment was eventually added to the matrix.
FIGURE 2.1 Precrash causation factors for roadways, drivers, and vehicles (Rumar, 1985).
I. Theory and background
20
2. Fundamentals and data collection
TABLE 2.2 Haddon matrix for urban crashes (Herbel et al., 2010). Social environment
Event
Driver
Vehicle
Roadway
Precrash
Poor vision, speeding
Failed brakes, worn out tire
Poorly timed traffic lights
Speeding culture, red light running
Crash
Failure to use seatbelt
Air bag failure
Poorly designed brake-away pole
Lack of vehicle regulations
Postcrash
Age (to sustain injury), alcohol
Poorly design fuel tank
Poor emergency communication
Lack of support for EMS trauma systems
In 1980, Haddon (1980) revised the matrix to address injuries caused by motor vehicle crashes. The matrix is often used for targeting specific crashes, such as pedestrians, urban or rural crashes, and driving while intoxicated (DWI) among others. In the context of the crash process, the matrix can be very useful for classifying precrash causation factors. The matrix has three levels or categories on the vertical axis: precrash, crash (during the crash), and postcrash. Moreover, it has four dimensions on the horizontal axis: the driver, the vehicle, the roadway, and the social environment. Table 2.2 illustrates an example of how the matrix can be used for crashes occurring in an urban environment.
2.3 Crash process: analytical framework To develop and use appropriate analytical tools for analyzing crash data, which are described in the subsequent chapters, the crash process needs to be represented in a mathematical form or as an analytical framework. This framework helps address characteristics that are specifically related to the crash generation process, some of which will be covered in various chapters of the textbook. As discussed in Lord et al. (2005) and Xie et al. (2019), a crash is, in theory, the result of a Bernoulli trial. Each time a vehicle (or any other road user) enters an intersection, a highway segment, or any other type of entity (a trial) on a given transportation network, it will either crash or not crash. For the purpose of consistency, a crash is termed a “success” while failure to crash is called a “failure.” In a Bernoulli trial, a random variable, defined as X, can be generated with the following probability model: if the outcome w is a particular event outcome (e.g., a crash), then X (w) ¼ 1
I. Theory and background
2.3 Crash process: analytical framework
21
whereas if the outcome is a failure, then X (w) ¼ 0. Thus, the probability model becomes. X
1
0
P(x ¼ X)
p
q
where p is the probability of success (a crash) and q ¼ ð1 pÞ is the probability of failure (no crash). In general, if there are N independent trials (vehicles passing through an intersection, road segment, etc.) that give rise to a Bernoulli distribution, then it is natural to consider the random variable Z that records the number of successes out of the N trials. Under the assumption that all trials are characterized by the same failure process (this assumption is revisited below), the appropriate probability model that accounts for a series of Bernoulli trials is known as the binomial distribution, and is given as follows: N n PðZ ¼ nÞ ¼ p ð1 pÞNn (2.1) n where n ¼ 0; 1; 2; .; N. In Eq. (2.1), n is defined as the number of crashes or collisions (successes). The mean and variance of the binomial distribution are EðZÞ ¼ Np and VARðZÞ ¼ Npð1 pÞ, respectively. For typical motor vehicle (or another category of) crashes, these events have a very low probability of occurrence given a large number of trials (e.g., million entering vehicles, vehicle-miles-traveled), it can be shown that the binomial distribution can be approximated by a Poisson distribution. Under the Binomial distribution with parameters N and p, let p ¼ l=N, so that a large sample size N can be offset by the diminution of p to produce a constant mean number of events l for all values of p. Then, as N/N, it can be shown that (see Olkin et al., 1980) n N l l Nn ln l PðZ ¼ nÞ ¼ y e (2.2) 1 N N n! n where, n ¼ 0; 1; 2; .; N and l is the mean of the Poisson distribution. The approximation illustrated in Eq. (2.2) works well when the mean l and p are assumed to be constant. In practice, however, it is not reasonable to assume that crash probabilities across drivers and across road segments/intersections are constant. Specifically, each driverevehicle combination is likely to have a probability pi that is a function of the driver (e.g., driving experience, attentiveness, mental workload, risk adversity), the roadway (e.g., lane and should widths, deficiencies in design and operations, weather), and the vehicle (e.g., maintenance, safety features). All these factors (known and unknown) will affect to various degrees the I. Theory and background
22
2. Fundamentals and data collection
individual risk of a crash. Outcome probabilities that vary from trial to trial are known as Poisson trials (note: Poisson trials are not the summation of independent Poisson distributions; this term is used to designate Bernoulli trials with unequal probability of events). As discussed by Feller (1968), count data that arise from Poisson trials do not follow a standard distribution, but they are still considered a Poisson process. In this process, the variance of the process, VARðZÞ, is usually not equal to the mean of the process, EðZÞ. In this regard, Neldelman and Wallenius (1986) have shown that the unequal outcome occurrence of independent probabilities usually leads to overdispersed data (they referred to this characteristic as a convex relationship between mean and variance). They examined 24 datasets and, 23 of those showed a convex relationship. The same characteristic has been observed with crash datasets (Abbess et al., 1981; Poch and Mannering, 1996; Hauer, 1997). The main characteristics of Eq. (2.2) dictate that all the crash-frequency and crash-severity models described in the next two chapters are used to approximate the Poisson process with an unequal probability of events and overdispersion in most cases. So far, the “real” process is not known to safety analysts or researchers. More details are provided in Lord et al. (2005).
2.4 Sources of data and data collection procedures To quantify safety, that is estimating the safety performance of entities or measuring the safety effects of countermeasures, data need to be collected and analyzed. Unfortunately, in highway safety, the “best” sources of data usually involve collecting data from crashes that have already occurred (note: noncrash based data exist and are covered in this chapter and elsewhere in the textbook). This means that the data involved people who have been injured, sometimes fatally; and properties, such as vehicles or roadside objects, that have been damaged. In addition to the pain and suffering, traffic crashes often cause high direct and important societal costs, as covered in Chapter 1dIntroduction. An example of such costs is illustrated in Table 2.3 for the United States. This table shows the latest comprehensive costs (2018) for people who may be injured in future highway-related projects (or, alternatively, the savings captured by reducing future crashes via improvements in highway design and operational characteristics or the implementation of countermeasures). These values are usually used for calculating the benefitecost analysis of highway projects.1
1
https://injuryfacts.nsc.org/all-injuries/costs/guide-to-calculating-costs/data-details/.
I. Theory and background
2.4 Sources of data and data collection procedures
23
TABLE 2.3 Comprehensive crash costs (per person) (2018 dollars). Severity of injuries
Costs
Fatal (K)
$10,855,000
Incapacitating (type A)
$1,187,000
Nonincapacitating (type B)
$327,000
Possible (type C)
$151,000
No injury (property damage only or PDO)
$50,000
Source: NSC.1
In this section, we will cover different sources of data used for analyzing safety. Section 2.4.1 describes traditional data that can be utilized for estimating the safety performance of highway entities and countermeasures. Section 2.4.2 covers relatively new sources of data that come from the perspective of a naturalistic driving environment. The last section describes data collected from disruptive technologies, such as those coming from smartphones.
2.4.1 Traditional data Traditional data can generally be grouped into five broad categories: (1) crash data; (2) roadway data; (3) traffic flow data; (4) supplemental data, and (5) other safety-related data. Although other sources of safety-related data exist, such as citations or traffic conflicts (both briefly covered below), crash data remain the best source of information that can be used for better understanding the safety performance of a system. Ultimately, crashes are events that can truly measure the safety performance. To better understand crashes that occurred on the highway system, we also need to collect data about the characteristics of the highways under study. This includes obtaining information about the physical and operational characteristics of the highway and/or its users. Traffic flow data are primarily used for estimating the level of exposure in the system, which dictates that if no traffic is present, no crashes can occur. The supplemental data refer to data that are collected manually, based on site visits or via tools, such as Google Earth or Streetview, which are not routinely collected by transportation agencies. 2.4.1.1 Crash data Crash data are the fundamental type of data needed for conducting safety studies. They are usually collected from police reports, although
I. Theory and background
24
2. Fundamentals and data collection
some transportation or law enforcement agencies could collect selfreported reports from drivers involved in crashes that did not lead to any injuries. Over the last 20e25 years, most agencies have upgraded to providing the data in electronic format or databases (e.g., SAS, DBF, MS Excel). Essentially, these agencies have specifically trained staff who code the data from hard copies that have been filled out by police officers, most of whom were called in at the scene of a crash. Usually, these agencies have a validation process to ensure the data are properly coded. In the United States, state agencies generally collect the data based on the guidelines outlined in the Model Minimum Uniform Crash Criteria2 to maintain consistency in the numerous variables collected. To be reported in official statistics, crashes need to meet a set of criteria, such as a minimum level of damage (usually around $1200 US, but could vary by state), include at least one injured person or, in some cases, one of the vehicles involved in the crash has to be towed away. A fatal injury is often defined as such if a vehicle occupant dies within 30 days after the crash (caution: this may not be true everywhere and should be validated by the safety analyst). Given the characteristics described earlier, it is important that the safety analyst becomes familiar with the characteristics of the database that will be used for safety analyses. Table 2.4 lists some of the important crash-related variables that are relevant in analyzing the safety performance of highway entities or users. In the past, police officers would fill out the form manually either at the scene of the crash or at the police station at the end of the shift (based on the notes taken at the scene). In recent years, reports are usually filled out electronically inside the police vehicle. The electronic crash databases usually include one line per crash, although some could include one line per vehicle involved. Each column contains information for each variable. From past experiences, it is not uncommon to have files that contain more than 100 variables that describe the characteristics of the crashes. It is important to point out that crash data variables collected by law enforcement agencies are often primarily utilized for determining if the driver(s) involved in a crash will be cited or subjected to criminal charges, especially when one or more fatal injuries occurred or a criminal conduct happened before the crash such as DWI. In many cases, variables that are irrelevant for this goal may not be adequately gathered. Hence, based on the authors’ own experience, transportation agencies should establish a very good line of communication with law enforcement agencies to ensure that important variables utilized for determining the safety performance of the highway network are properly collected.
2
https://www.nhtsa.gov/mmucc-1.
I. Theory and background
25
2.4 Sources of data and data collection procedures
TABLE 2.4 Important variables collected from crash data. Variable
Description
Identification number
Each crash report should have its own identification number. This ensures that each crash is unique and can be easily traceable.
Location
The location can be identified using a linear system, such as control-section mile point on predefined maps maintained by the transportation agency. More recently, most agencies are now reliably coding crash data using geographic information system (GIS) technology.
Date and time
These two variables can be used to assign crashes for different seasons and whether the crash occurred during nighttime, dusk, dawn, or daytime conditions.
Severity
This is used to characterize the most severe injuries among all the occupants or vulnerable road users (pedestrians or bicyclists). For example, if a crash has three injures, one incapacitated (type A) and two possible injuries (type C), the crash will be classified as incapacitating injury (type A).
Collision type or manner of collision
This variable describes the characteristic of the crash, such as right-angle, side-swipe, or leftturn/through collision.
Direction of travel
This variable explains the direction or trajectory of each vehicle or road user involved in the crash.
Alcohol or drugs
This variable explains if any of the drivers or vulnerable road users was under the influence of alcohol or drugs. This variable will be often be updated in the report after the crash to account for the time needed to get laboratory results back.
Vehicle occupants
This variable describes the gender and age of each vehicle occupant or road user. It may include the legal driving and insurance statuses of drivers.
Vehicles involved
This one describes the characteristics of each vehicle. This variable defines the crash as being a single-vehicle or multivehicle event. Continued
I. Theory and background
26
2. Fundamentals and data collection
TABLE 2.4 Important variables collected from crash data.dcont’d Variable
Description
Narratives
This section of the report is usually not coded electronically. However, the narrative is very important, as it provides information about the crash process (based on the testimony of witnesses and the visual assessment of the officer). It is usually accompanied by one or more figures or sketches that help explain what happened. Based on the authors’ experience, many research projects involved the review of these narratives for validating the electronic databases. This is a very time consuming and costly process.
2.4.1.2 Roadway data Roadway data provide information about the design and operational characteristics of highway segments and intersections. In this day and age, transportation agencies also maintain these kinds of data electronically. Table 2.5 lists common variables that are found in these databases. 2.4.1.3 Traffic flow data Traffic data provide information about users traveling on the facilities under study. For segments, the traffic flow represents the number of vehicles, bicyclists, or pedestrians that travel on the segment. For intersections, traffic flow represents the number of vehicles, bicyclists, or pedestrians that enters the intersection. Usually, transportation agencies collect the traffic flow data from manual and automatic counters on their highway network. These data are also available electronically and are usually separated by year (i.e., each year has its own file). It is important TABLE 2.5 Common variables found in roadway data. Location (control-section mile point or geographical coordinates) Segment length Type of pavement Traffic control at intersections Speed limit Road alignment (tangent, curve) Road surface condition Right-of-way width Parking
Highway classification (freeway, arterials, etc.) Type of lane and width Type of shoulder and width Type of median and width Number of lanes Divided/undivided Lighting
I. Theory and background
2.4 Sources of data and data collection procedures
27
TABLE 2.6 Traffic flow data. Location (control-section mile point or geographical coordinates) Annual average daily traffic/AADT (vehicles/day) Average daily traffic/ADT (vehicles/day) Traffic mix (heavy vehicles, motorcycles, passenger cars, etc.) Speed distribution Short counts (hourly volumes, 15-min values, etc.) Vehicle occupancy (on instrumented urban freeway segments) Traffic density (on instrumented urban freeway segments) Turning movements at intersections
to note that many traffic flow data are actually estimated values that are extrapolated from expansion factors based on when and where the traffic counts were performed on the network such as the day of the week and the month of the year. Table 2.6 summarizes key traffic flow variables. As discussed earlier, for intersections, entering flows are usually used as input variables for the crash-frequency and crash-severity models described in the subsequent chapters. Fig. 2.2 describes how the traffic flows are assigned for each (undivided or single traveled way) leg of a 4legged intersection (note: 3-legged intersections would work in the same manner). This figure shows values for the annual average daily traffic
FIGURE 2.2 Entering flows in vehicles per day (AADT).
I. Theory and background
28
2. Fundamentals and data collection
(AADT) for each street and the entering flow in vehicles per day for each leg. It is basically the AADT divided by two. 2.4.1.4 Supplemental data Although transportation agencies collect a wide amount of data, researchers and safety analysts often need to collect (traditional) data that are not routinely available. On other occasions, supplemental data could be collected for validating the data provided by these agencies. Traffic flow (collected by the analysts themselves) and speed limits are such variables that have been collected for validation purposes. Other supplemental data could include for example: • • • • • • • •
Number and types of driveways located in urban or rural areas Pedestrian and bicycle traffic flow counts Side slope along rural two-lane highways Superelevation on horizontal curves Deflection angles on horizontal curves Pavement friction Length of clear zones Location of roadside devices (e.g., longitudinal barriers)
Below are useful methods or tools that have been used to provide supplemental data. Site visits: In many safety-related projects, site visits are commonly performed to collect supplemental data. They could include those achieved for collecting pedestrian and bicycle counts at urban or suburban intersections. Site visits can also include the use of specialty-equipped vehicles that can collect on- and off-road data. Fig. 2.3 shows, for example, a screenshot of the Dewetron data collection system. This system can measure the lateral clearance measured by LiDAR (top right corner of the screen). It also includes a video signal recorded by the cameras as shown in the middle of the screen (forward camera on the left and side camera on the right) and allows the recording of the roadway profile created by the GPS signal, as shown at the bottom right. Google Earth: This is a powerful mapping service software program. This program offers satellite views of basically any location around the earth. It can provide aerial views for several years, distance measurements, horizontal curve radius measurements, and the location of railway lines among others. The program can be used to collect variables such as driveway densities, validate lane and shoulder widths, turning radii at intersections, etc. Fig. 2.4 shows a typical satellite image view in Google Earth. Street View: This tool is attached to Google Earth and allows the user to see the highway from the driver’s perspective. The tool can be used to record the location and type of traffic control devices (such as signal type
I. Theory and background
2.4 Sources of data and data collection procedures
29
FIGURE 2.3 Screenshot of the Dewetron data collection system (Lord et al., 2011).
FIGURE 2.4 Satellite view of a divided rural arterial and the location of driveways. Image Credit: Google Earth Mapping Service.
I. Theory and background
30
2. Fundamentals and data collection
FIGURE 2.5 Image similar to Google’s street view
or rumble strip presence), the location of roadside objects, and severity of sideslopes, etc. Streetview has been useful for collecting supplemental data without having to conduct site visits. Fig. 2.5 shows an image from the driver’s perspective similar to typical images available in Street View (Google does not allow showing their Street View images in textbooks). Video Recording and Processing: Over the last 5e10 years, video recording has become increasingly useful for collecting safety data. Video recording has usually been utilized for collecting traffic conflicts (see Chapter 11dSurrogate Safety Measures) and driver or pedestrian behavior in urban environments (say crossing paths at intersections) as well as on fully instrumented urban freeways (e.g., closed circuit TV). Usually, hours of videos are recorded and then these videos are manually processed in a laboratory or back at the analysis center. Fig. 2.6 shows cyclist motion patterns at two intersections in Montreal, Canada, which were collected from a video recording process. 2.4.1.5 Other safety-related data and relevant databases As part of traditional data, other categories of data are used for analyzing safety. Some of these data include: • Citation records • Hospital data (note: in the United States, there are some privacy issues that impede on obtaining these kinds of data) • Driver data (from governmental licensing agencies) • Land-use • Demographics and population statistics • Traffic conflicts (discussed in Chapter 11)
I. Theory and background
2.4 Sources of data and data collection procedures
FIGURE 2.6
31
Cyclist motion patterns at two intersections in Montreal, Canada (Niakia
et al., 2019).
• Microsimulation output (conflicts, jerk, deceleration rate, etc.) • Precipitation data (from National Oceanic and Atmospheric Administration) • Pavement friction data (from Pavement Management Information System) • Vehicle registration and driver records (from state department of motor vehicles). The data described earlier are not directly based on crashes that occurred on the system, with the exception of hospital data to some degree. However, these kinds of data have been used in the past to measure, for instance, the safety risk of road users when they are combined with crash data (e.g., regional-based crash-frequency models). Table 2.7 shows a list of potential databases that can be used for collecting data and conducting safety analyzes. Some of these databases are available to the public, while others are only available for governmental employees, researchers or may require special permission to access the data. In many cases, the person requesting the data needs to fill out forms before accessing them. The list, presented in Table 2.7, includes some of the publically available safety databases, the name of the agency and region or country. Although somewhat old right now, Montella et al. (2012) have evaluated different crash databases around the world.
2.4.2 Naturalistic driving data Naturalistic driving data are data that come from instrumented vehicles in which drivers are given no special instructions about how they should drive nor are outside observers present when they travel in the
I. Theory and background
32
2. Fundamentals and data collection
TABLE 2.7 Sample of national and regional public databases. Database
Agency
Region
Highway Safety Information System (HSIS) http://www.hsisinfo.org/
FHWA
USA
Fatality Analysis Reporting System (FARS) https://www-fars.nhtsa.dot.gov/Main/ index.aspx
NHTSA
USA
National Automotive Sampling System (NASS) https://www.nhtsa.gov/nationalautomotive-sampling-system-nass/nassgeneral-estimates-system
NHTSA
USA
General Estimates System (GES) https://www.nhtsa.gov/nationalautomotive-sampling-system-nass/nassgeneral-estimates-system
NHTSA
USA
Crashworthiness Data System (CDS) https://www.nhtsa.gov/nationalautomotive-sampling-system/ crashworthiness-data-system
NHTSA
USA
Crash Outcome Evaluation System (CODES) https://www.nhtsa.gov/crash-datasystems/crash-outcome-data-evaluationsystem-codes
NHTSA
USA
Bureau of Transportation Statistics https://www.bts.gov/content/motorvehicle-safety-data
BTS
USA
Mobility and TransportdRoad Safety https://ec.europa.eu/transport/road_ safety/specialist/statistics_en
European Commission
EU
Statistics Norway https://www.ssb.no/en/transport-ogreiseliv/statistikker/vtu
Gov Norway
NO
Transport Analysis https://www.trafa.se/en/road-traffic/ road-traffic-injuries/
Gov Sweden
SE
I. Theory and background
33
2.4 Sources of data and data collection procedures
TABLE 2.7 Sample of national and regional public databases.dcont’d Database
Agency
Region
CBS
NL
Open Data Portal https://www.data.qld.gov.au/dataset/ crash-data-from-queensland-roads https://www.webcrash.transport.qld.gov. au/webcrash2
Gov Queensland
AU
National Collision Database Online https://wwwapps2.tc.gc.ca/Saf-Sec-Sur/ 7/NCDB-BNDC/p.aspx?l¼en&l¼en
Transport Canada
CA
Road AccidentsdOECD https://data.oecd.org/transport/roadaccidents.htm
OCED
EU
Road Safety Statistics http://www.dgt.es/es/seguridad-vial/ estadisticas-e-indicadores/
DGT
ES
Statistics of Traffic Accidents in Kaohsiung City https://data.gov.tw/dataset/127489
Government of Taiwan
TW
A2 Road Traffic Accident in New Taipei City https://data.gov.tw/dataset/125657
Government of Taiwan
TW
CBS Open Data Online (The Netherlands) https://opendata.cbs.nl/statline/portal. html?_la¼en&_catalog¼CBS& tableId¼81452ENG&_theme¼1160
vehicle. This means that the data collection procedure is considered unobtrusive (Dingus et al., 2006). The goal is to collect data in a “natural” environment with the hope that the instrumented vehicles do not influence the behavior of drivers. In most cases, the owner of the vehicle agreed to have their vehicle equipped with all sorts of sensors that measure forces that act of the vehicle, cameras looking at both inside and outside the vehicle, LiDAR(s) that can measure distances and the relative speed with other vehicles and fixed objects, and a GPS unit that locates the vehicle on the highway system (after the vehicle moves away from home for privacy reasons).
I. Theory and background
34
2. Fundamentals and data collection
The first study on this topic was known as the 100-car naturalistic study that examined drivers located in the Northern Virginiae Washington, D.C. metro area in the early 2000s (Dingus et al., 2006). The data were collected over a year-and-a-half time period. Approximately 2,000,000 vehicle-miles of driving and about 43,000 h of data were recorded. This study was initially designed as a pilot program to evaluate the data collection procedure, study design methodologies, and potential tools for analyzing the data for a future larger-scale study. The 100-car study eventually led to the much larger Strategic Highway Research Program (known as SHRP 23) Naturalistic Driving Study. As stated by the National Academy of Sciences (NAS4): “The central goal of the SHRP 2 Safety research program was to address the role of driver performance and behavior in traffic safety. This included developing an understanding of how the driver interacts with and adapts to the vehicle, traffic environment, roadway characteristics, traffic control devices, and the environment. It also included assessing the changes in collision risk associated with each of these factors and interactions. This information will support the development of new and improved countermeasures with greater effectiveness.” The study involved more than 3000 vehicles located in six cities across the United States. The data file contained approximately 35 million vehicle miles, 5.4 million trips, 2705 near-crashes, 1541 crashes, and more than one million hours of video (NAS, 2014). That study also included detailed roadway data collected on 12,538 centerline miles of highways in and around the study sites, which could be matched with the drivers who traveled on these segments. Other naturalistic driving research that has been performed across the world includes the UMTRI naturalistic driving study,5 the UDRIVE European naturalistic driving study,6 and the Australian Naturalistic Driving Study.7 Some of the data can be available to the public and researchers, but the users may need special permissions such as the approval by a researcher’s Institutional Review Board. Naturalistic driving studies provide unique data that can help study driving behavior in a natural environment. Unfortunately, the amount of
3
https://www.shrp2nds.us/index.html.
4
http://www.trb.org/StrategicHighwayResearchProgram2SHRP2/SHRP2DataSafety About.aspx.
5
http://www.umtri.umich.edu/our-focus/naturalistic-driving-data.
6
https://results.udrive.eu/.
7
http://www.ands.unsw.edu.au/.
I. Theory and background
2.4 Sources of data and data collection procedures
35
data collected can be extremely large (these types of data are also known as “Big Data”). For example, the total collected data for SHRP-2 have required approximately 1.5 Petabyte (PB) of archival storage, 700 Terabyte (TB) of parametric data (sensors, etc.), and over 1.2 PB of video storage (NAS, 2014). The entire database contains millions of files. Analyzing such datasets can be very challenging as traditional analytical tools, such as traditional crash-frequency or crash-severity models, are not usually adequate. Alternative tools are therefore needed for extracting information from extremely large datasets. Chapter 12dData Mining and Machine Learning Techniques provides data-driven tools that can be used for properly analyzing datasets categorized as Big Data.
2.4.3 Disruptive technological and crowdsourcing data In general terms, disruptive technology refers to an innovation that significantly alters the way that consumers, industries, or businesses operate, which usually has much better attributes than the older technology (Smith, 2020). Smartphones are such technology (as compared to regular cell phones) (Appiah et al., 2019). Smartphones can provide a large amount of data that could be used for safety analyzes. They include deceleration rates, locations of crashes, where and when a driver was texting and driving, and traveled speed of a vehicle on the highway network among others (see, e.g., Bejani and Ghatee, 2018; Kanarachosa et al., 2018; Stipancica et al., 2018). Such data are collected by third-party companies or vendors (who buy data directly from cellphone service providers) and sell them to private and public agencies as well as private citizens/researchers. Streetlight Data and Safe 2 Save LLC are examples of such companies. Crowdsourcing is defined as obtaining data or information from a large group of people. Over the last few years, some have used crowdsourced data for analyzing highway safety. Data can be collected from social media platforms such as Twitter, Reddit, Waze, Instagram, or Facebook. Similar to the disruptive technological data, third-party vendors extract data from these platforms (often with a fee) and sell them to public and private agencies or anyone else who request the data. These data could be used to validate police reports data or identify crash data not commonly collected by transportation agencies because they do not meet the minimum reportability criteria (see, e.g., Flynn et al., 2018; Goodall and Lee, 2019; Li et al., 2020).
I. Theory and background
36
2. Fundamentals and data collection
2.4.4 Data issues Traditional safety data offer unique characteristics that are not found in other types of data, such as those related to crime (Levine, 2008) or power failures caused by hurricanes (Guikema et al., 2014). Some of these issues are attributed to the huge cost associated with collecting crash and other related data (Lord and Bonneson, 2005; Lord and Mannering, 2010). Safety analysts should be made aware of these issues as they could negatively influence the performance of crash-frequency and crashseverity models. For a full description of these issues, the reader is referred to Chapter 6dCross-Sectional and Panel Data in Safety as well as Lord and Mannering (2010) and Savolainen et al. (2011).
2.5 Assembling data After a crash is reported, the law enforcement agencies investigate and complete the crash report with all factual information. If some information is unknown (such as driver distraction), law enforcement officials use their best judgment and record their considered opinions based on their investigation. As they collect an extensive list of variables, the data are usually stored in different electronic files such as crash, vehicle, driver/ passenger, citation/adjudication, and EMS/Injury Surveillance. Each of these data electronic files contains a unique identifier (such as crash ID or number) with which they can be combined. As described earlier, crash data may not necessarily provide a complete picture about the safety performance of an entity. It must be therefore combined with other data sources (such as traffic data, road inventory data, and vehicle registrations) for further investigation and analysis. For example, the main reason for an abundance of roadway departure crashes could be attributed to a narrow shoulder, a sharp curve, or fixed objects located within the clear zone, and it may not be possible to know this unless the crash data are combined with roadway and roadside data. The process of combining or merging multiple data sources is often called data assembly or integration. Deterministic and probabilistic integrations are the two most common procedures used when assembling traffic safety data. Most of the databases provided by the state agencies use linear referencing, which allows roadway and traffic attributes to be stored individually and to be defined by the route, along with the start and end reference markers (such as mileposts). The primary advantage of linear
I. Theory and background
2.6 4-stage modeling framework
37
referencing is that it allows for a very detailed delineation of features along a route without breaking the route into very small segments. Crashes are also referenced using the same system. Therefore, crashes can be merged using location code to exactly match records belonging to the same point with the roadway. This kind of data assembly is called deterministic integration because it relies on common elements shared among the datasets to make exact matches. Many agencies have already started identifying different data elements in a geographic information system (GIS) environment (using latitude and longitudes). This makes the integration of databases attractive as map-based interface can effectively present and interpret data. In some instances where an exact match cannot be conducted using unique identifiers, probabilistic integration is often used that relies on similar elements and values shared among the datasets to make matches. For instance, crash records are often required to be linked to data collected by emergency medical services, hospital emergency department, and hospital admission and discharge files. All these files are subjected to strict confidentiality rules and regulations, which prevents merging databases by the name of the vehicle occupants who were injured and hospitalized for example. The probabilistic linkage uses common fields between databases such as incident longitude/latitude, date, age/date of birth (if available), time of admission at the hospital, and seat position among others.
2.6 4-stage modeling framework This section describes a general 4-step modeling process that can be used for developing statistical models for crash data analyses. These steps are applicable for crash-frequency models (Chapter 3) and crashseverity models (Chapter 4). In the safety literature, statistical models have often been defined or called safety performance functions (SPFs) and crash prediction models (CPMs). The former is used in the AASHTO’s Highway Safety Manual (HSM) (AASHTO, 2010) and publications that are associated with the manual. In this textbook, we refer to all the models as statistical models, either crash-frequency or crash-severity models, unless the methodology is specifically tied to the HSM. In this case, the model may be referred to as an SPF or a CPM.
2.6.1 Determine modeling objective matrix The first step in developing statistical models is to layout the objectives of the modeling effort. The main considerations, in this step, include
I. Theory and background
38
2. Fundamentals and data collection
application needs (e.g., prediction, screening variables, etc. as described in Chapter 3dCrash-Frequency Modeling), project requirements, data availability, logical scalesdboth spatial and temporal scalesdof modeling units and their definitions, and range, definition, and unit of key input and output variables. The latter characteristics are described in greater detail in Chapter 5dExploratory Analyses of Safety Data. Table 2.8 lists an example of a matrix describing the modeling objectives. This table shows how the highway network is divided into segments and intersections, and the outcome of potential models. For this hypothetical project, crash-frequency and crash-severity and statistical models by collision type will be estimated, but crash cost will not be included in the analysis for segments and intersections. For ramps, only crash-frequency models will be developed. It is critically important in this step to determine the logical scales of modeling units and their definitions, as well as range, unit, and definition of key input and output variables. For instance, it is important to have a spatial and physical definition of intersections and segments and the exact types of traffic crashes (e.g., intersection, intersection-related, pedestrianinvolved, or animal-involved crashes) to be assigned to each observational unit. These units may be associated with the highway network, but could also be related to the analyses of drivers, vehicle occupants, or pedestrians. The range of traffic flows can be used as another example. There is a need to determine the range of flows and geometric characteristics of interest to this study (e.g., AADT ¼ 200e20,000) and make sure commensurable data can be obtained and enough data can be collected. The time unit of analysis (i.e., number of crashes per unit of time) is another critical element when developing statistical models. Whether one uses crashes per month, per year, per 3-year, etc., will have considerable effects on modeling assumptions and consequently on model interpretation and applicability. As discussed in Chapter 3dCrash-Frequency Modeling in Safety (and in Lord and Geedipally, 2018), using a smalltime or space scale could “artificially” increase the proportion of zero reponses in the dataset. This could lead to erroneously selecting an inappropriate statistical model for analyzing such datasets. TABLE 2.8 Modeling objective matrix. Highway segments
Crash frequency
Crash severity (KABCO)
Crash frequency by collision type
Crash cost
Intersections
Y
Y
Y
N
Segments
Y
Y
Y
N
Ramps
Y
N
N
N
I. Theory and background
2.6 4-stage modeling framework
39
2.6.2 Establish appropriate process to develop models This step is used to ensure the best possible statistical models be developed to achieve the modeling objectives identified in the previous step. This includes ensuring that (1) data sources and limitations, sampling design, and statistical, functional, and logical assumptions are clearly spelled out; (2) supporting theories are properly defended and/or cited (i.e., goodness-of-logic, as discussed by Miaou and Lord, 2003); (3) models are systematically developed and tested; and (4) modeling results are properly interpreted. Typical modeling procedures employed in developing statistical models can be grouped into five major processes: (1) establish a sampling model (such as those used in surveys with weight factors or stratified data), (2) choose an observational model (or conditional model) (note: most crash-frequency and crash-severity models fall into this category), (3) develop a process/state/system model (e.g., hierarchical/random effects models, etc.), (4) develop a parameter model (for the Bayesian method and, to some degree, random-parameters models), and (5) construct model and interrogation methods (e.g., interrogating theoretical models), including model comparison, sensitivity or robustness analysis, and specification test among others.
2.6.3 Determine inferential goals The inferential goals determine whether a point prediction combined with a simple estimate of its standard error (i.e., the maximum likelihood estimation method or MLE), an interval prediction (e.g., 2.5 and 97.5 percentile “credible” intervals using the Bayesian method), or a full probability distribution for the prediction is needed (also based on the Bayesian method). As will be discussed in the next section, more detailed inferential goals will require more sophisticated computational methods to fully capture the sampling variations in producing estimates and predictions. Fig. 2.7 shows an example of a posterior distribution generated by the WinBUGS software (Lunn et al., 2000) for the inverse dispersion parameter of an NB model developed from 868 signalized 4-legged intersections in Toronto, ON using the Bayesian estimation method. The posterior mean, standard deviation, and median of the distribution were 7.12, 0.67, and 7.03, respectively. The inverse dispersion parameter (point estimate) that was originally estimated using the MLE was 7.15 (0.63) for the same dataset (Lord, 2000).
I. Theory and background
40
2. Fundamentals and data collection
FIGURE 2.7 Posterior distribution for the inverse dispersion parameter.
2.6.4 Select computational techniques and tools This is the process where Frequentist (analysts who use the likelihood-based method or MLE) (McCullagh and Nelder, 1989), and the Bayesian method (Carlin and Louis, 2008; Gelman et al., 2013) are likely to differ in their estimating approaches and use of different “stochastic approximations” to reduce the computational burden. Many statistical programs are now available for estimating the coefficients of statistical models for both the Bayesian and the MLE methods which fall under the exponential family of probability distributions (e.g., the Poisson model). More difficult inferential goals will require additional sophisticated computational methods to fully capture the sampling variations in producing estimates and predictions. By being able to take advantage of the unprecedented computing power available today, simulation-based methods, including various bootstraps and Markov Chain Monte Carlo (MCMC) methods (Gilks et al., 1996), have been particularly popular in the statistical community over the last 20 years, regardless of whether the MLE or Bayesian estimating method is considered. The characteristics of the likelihood-based and the Bayes methods are described later. Note that in the highway safety literature, crash-frequency and crash-severity models estimated using the Bayes method are often called a “Full” Bayes (FB) model (Miaou and Lord, 2003). The terminology is used to distinguish models estimated using the Bayes method from techniques that employed the empirical Bayes (EB) method. The EB method is covered in Chapter 7dBefore-After Studies in Safety and Chapter 8dIdentification of Hazardous Sites. 2.6.4.1 The likelihood-based method Under this method, one estimates the parameters by maximizing the likelihood function. The likelihood function is nothing more than the joint distribution of the observed data under a specified model, but it is seen as a function of the parameters, with fixed data. For example, when crash data are assumed to follow an NB distribution, the most popular
I. Theory and background
2.6 4-stage modeling framework
41
model used in highway safety (Lord and Mannering, 2010) and details can be found in Appendix A, the likelihood is as follows: LðbÞ ¼
N Y
NBðyi ; bÞ
(2.3)
i¼1
Where yi is the response variable for observation i; b is a p 1 vector of estimable parameters; x0i is a vector of explanatory variables; and, p is the number of parameters in the model. The parameters can be obtained by maximizing the likelihood by using the NewtoneRaphson or other search techniques, such as the Fisher scoring (McCullagh and Nelder, 1989). Generally, the likelihood function is not directly optimized for efficiency reasons. Instead, its logarithm, called log-likelihood (LL) is preferred. As the likelihood function, under independent and identically distributed (i.i.d) assumption, is the product of the sampling distributions of individual data points (as shown in Eq. 2.3), taking the logarithms converts the product into sums of the log densities. Sum of logs is numerically more stable than the log of the products. The Hessian of the LL obtained at the point of convergence is often used to report the standard errors of the parameters and perform model selection. Appropriate distributional assumptions have to be satisfied, such as the asymptotic normality of the test statistics, for performing valid hypothesis testing on the parameters estimated by the MLE method. In the above simplified model, besides the crash history yi, no other data are used. However, typically, crash counts from several observation periods (e.g., years) are collected over presumably static roadway, intersection, or other entity conditions. It is fairly trivial to incorporate such additional meta-data into the modeling process, by assuming that the mean response depends on that additional meta-data. That is, the conditional mean is considered an unknown function of the covariates, as given in the following equation: Eðyi jbÞ ¼ mi ¼ f x0i b (2.4) The application of the MLE method is now commonly implemented in all commercially available statistical programs, such as SAS, Genstat, R, Python, and STATA. The theory associated with the generalized linear models, the foundation for crash-frequency models, is also well covered in seminal textbooks on this topic (Cameron and Tridevi, 2013; Hilbe, 2011, 2014). For crash-severity models, the theory is also well covered in
I. Theory and background
42
2. Fundamentals and data collection
Train (2009). Appendix A shows how the characteristics of the MLE is used for estimating NB models. If panel or longitudinal data are used (data collected over time as discussed in Chapter 6dCross-Sectional and Panel Studies in Safety), there is a strong probability that the observations will be correlated in time. In other words, the same observation is measured at different points in time (note that some researchers have labeled such datasets as data with repeated measurements). It has been shown that this kind of dataset will most likely create a temporal correlation that will influence the inferences associated with the parameter estimates (Diggle et al., 1994). To handle the temporal or serial correlation, the generalized estimating equations (GEE) has been proposed for handling panel data. The GEE is also a likelihoodbased method (more specifically, quasi-likelihood, which does not require the assumption of normality), but is designed specifically to handle the temporal correlation. Generally, the safety analyst needs to assess the covariance matrix, which can be either defined as Independent, Exchangeable, Auto-Regressive Order One or Unstructured. The models are estimated using quasi-likelihood estimators via an iterative process (note: quasi-likelihood equations are also called generalized estimating equations, hence the name GEE). Not accounting for the serial correlation often underestimates the standard errors of the parameters (Lord and Persaud, 2000; Hardin and Hilbe, 2013). 2.6.4.2 The Bayesian method Under the MLE method, the likelihood function is solely responsible for encoding the knowledge about the model. However, in many cases, a safety analyst may know something about the problem, even before collecting the data, often dubbed as prior knowledge or expert knowledge. The Bayesian paradigm formally combines the prior knowledge and the likelihood via the Bayes rule: we can say that posterior belief is proportional to the product of the prior belief and the likelihood. It is expressed as PðmjyÞfPðmÞPðyjmÞ h
PðyÞ PðmÞPðyjmÞ
(2.5)
where P(m) is the prior distribution of the parameters, P(y|m) is the likelihood function, and P(m|y) is the posterior distribution. The normalization constant P(y), which ensures that the posterior has a valid density, is only a function of the data. Another way to think about the Bayesian paradigm is that the prior belief is updated in light of evidence (collected
I. Theory and background
2.6 4-stage modeling framework
43
via the data) resulting in the posterior belief. It is the posterior belief that is of interest for inference. Inference is typically carried out by generating approximate samples from the posterior density using MCMC techniques. When only point estimates are sufficient or computational time is of concern, it is not uncommon to use Variational Inference or MAP estimates. MAP is estimated as the MLE analogy in the Bayesian setting, where the posterior distribution is maximized instead of the likelihood. Models elicited under the Bayesian paradigm are actually framed as a hierarchical or multilevel model. In highway safety, they are often defined as a hierarchical Poisson-mixed model (for crash-frequency models) or simply as an FB model, as explained earlier. Such a hierarchical modeling framework can be defined as follows: ðiÞ yi ui e Poissonðui Þ / yi ui e Poissonðmi eεi Þ (2.6a) (2.6b) ðiiÞ eεi h e pε ðhÞ ðiiiÞ h e ph ð$Þ
(2.6c)
where ui is the Poisson mean for observation i; pε is the prior distribution assumed on the unobserved model error (eεi ), which depends on hyper parameter h, with hyper-prior ph. Moreover, parameters mi ¼ f x0i b and h are assumed to be mutually independent (Rao, 2003). Various prior choices can be considered for modeling the parameters eεi and h. Depending on the specification of the priors pε ð$Þ and ph ð$Þ, different alternative hierarchical models can be defined. For the hierarchical NB (HNB) model, we specify a gamma prior on eεi with shape (a) and scale (b) parameters to be equal. This leads to the following error function: eεi j4egammað4; 4Þ and 4egammaða; bÞ
(2.7)
Instead of assuming a gamma distribution as a prior distribution for eεi , the lognormal distribution can be used as an alternative function. With this prior choice, the hierarchical Poisson-lognormal model, another model very popular in highway safety, is derived by assuming a proper hyper-prior for the parameter s2 such that (Lord and Miranda-Moreno, 2008): εi ¼ logðeεi Þs2 eNormal 0; s2 and s2 egammaða; bÞ (2.8) The choice of prior for the parameter s2 relies on the fact that a conjugate distribution of the Normal distribution is the Inverse-gamma.
I. Theory and background
44
2. Fundamentals and data collection
Convenient priors are conjugate distributions that produce full conditional posteriors of the same form. Furthermore, the hyper-prior parameters a and b have fixed values and must be specified by a safety analyst. Different from the MLE, which subscribes to the Frequentist paradigm, inference in the Bayesian setup is easy to interpret, as every quantity of interest is a probability statementdcase in point being credible intervals versus confidence intervals that quantify the uncertainty of the model parameters. As the complete joint distribution of all model parameters is available, any question concerned with the parameters, expressed as a functional of the parameters, can be routinely obtained. However, care must be taken in both model elicitation, of which prior specification is a big component, and performing extensive checks on the inference technique. For example, when a vague or noninformative hyper-prior is for defining the model’s parameters, the posterior estimates (MAP) will be similar to the estimates provided by the MLE. Another advantage of the Bayes method is that information extracted from previous studies can be used to refine the hyper-priors (see, e.g., Heydari et al., 2013). Appendix A documents the characteristics for estimating HNB using the Full Bayes method. For the development of crash-frequency models, the Bayes method is preferred when crash data are characterized by low sample mean values and small sample size (see Chapter 3dCrash-Frequency Modeling) (Lord and Miranda-Moreno, 2008). In such instances, informative hyper-priors based on prior knowledge, as obtained from the previous studies, can be used. Some studies have developed FB models by accounting for temporal and/or spatial correlations (Huang et al., 2009; Jiang et al., 2013). Fawcett et al. (2017) proposed a novel FB hierarchical model that incorporated crash counts from multiple past periods rather than from a single before period in the identification of hazardous locations. These authors used a discrete-time indicator in the model to account for the effects of a temporal trend in crashes.
2.7 Methods for evaluating model performance This section describes different methods that can be used for evaluating the model performance of crash-frequency and crash-severity models. The methods are used to measure the “goodness-of-fit” (GOF) or how well the model fits the data. Although evaluating the fit is an important measure in the assessment of models, it should not be the sole goal for selecting a model over another. It is also important to examine what is called the “goodness-of-logic” (Miaou and Lord, 2003). More details about this topic are covered in the Chapter 3 - Crash-Frequency Models and Chapter 6 - Cross-Sectional Studies.
I. Theory and background
2.7 Methods for evaluating model performance
45
The GOF methods can be classified into two general groups. One group of methods relates to likelihood statistics, while the other group assesses the model performance based on the model’s errors. It is suggested to use several GOF methods from both groups to assess the performance of different models. It should be pointed out that likelihood-based methods should compare models that are estimated using the same dataset.
2.7.1 Likelihood-based methods The methods presented in this section describes how well the model maximizes the likelihood function, with different parametrizations or with different penalty functions. Most of these likelihood-based methods can be used either for crash-frequency and crash-severity models. Although the basic equations are described here, all these methods can be calculated automatically (predefined functions/modules or written codes) in statistical software programs. 2.7.1.1 Maximum likelihood estimate As the name implies, the most basic method consists of maximizing the LL function. This is accomplished by first taking the log of the function. Then, take the partial derivatives (first-order conditions) of the LL for each model’s parameter and make each one equal to zero. Simultaneously solve all these partial derivative equations to find the parameters and put all them back in the log-likelihood function. Algorithms, such as the NewtoneRaphson search algorithm (second-order conditions), need to be used to solve these equations. Fortunately, all statistical computer programs can now automatically estimate the maximum log-likelihood or MLE estimate. The MLE is given as follows: MLE ¼ 2 LL
(2.9)
The largest value indicates the best fit. The MLE is unfortunately not dependent on the number of parameters found in the model, which could potentially lead to an overfitted model. Other methods below can overcome this problem. Appendix A shows how to calculate the MLE for the NB model. 2.7.1.2 Likelihood ratio test The likelihood ratio test is used to select models by comparing the loglikelihood for the fitted model (restricted) with the log-likelihood for a model with fewer or no explanatory variables (unrestricted or less restricted model). The formulation of the likelihood ratio test is LRT ¼ 2½LLðbU Þ LLðbR Þ
I. Theory and background
(2.10)
46
2. Fundamentals and data collection
where LLðbR Þ is the log-likelihood at the convergence of the “restricted” model and LLðbU Þ is the log-likelihood at the convergence of the “unrestricted” model. Larger values indicate a better fit. 2.7.1.3 Likelihood ratio index The likelihood ratio index statistic compares how well the model with estimated parameters performs with a model in which all the parameters are set to zero (or no model at all). This test is primarily used for assessing the GOF of crash-severity models. The index is more commonly called the McFadden R2, the r2 statistic or sometimes just, r: The estimation of potentially insignificant parameters is accounted for by estimating a corrected r2 as where p is the number of parameters estimated in the model. The formulation is: b LL b r2 ¼ 1
r2corrected ¼ 1
LLð0Þ b p LL b LLð0Þ
(2.11a)
(2.11b)
b is the log-likelihood function at the estimated parameter b; b Where LLð bÞ and, and LLð0Þ is the log-likelihood function where the parameters are set to zero. Therefore, this index ranges from zero (when the estimated parameters are no better than zero, not optimal estimates) to one (when the estimated parameters perfectly predict the outcome of the sampled observations). The name r2 is somewhat similar to the R2 statistic. However, while R2 indicates the percentage of the variation in the dependent variable that can be “explained” by the estimated model, r2 is the actual percentage increase in the log-likelihood function above the value taken at the zero parameter. Unfortunately, the meaning of such an increase is unclear in terms of the power of the model explanation. When comparing two models estimated using the same set of data with the same set of alternatives (the premise for model comparison so that LLð0Þ is the same for both models), the model with the higher r2 fits the data better. This is the equivalent of saying that the model with a higher value of the likelihood function is preferable. 2.7.1.4 Akaike information criterion The Akaike information criterion (AIC) is a measure of fit that can be used to assess models. This measure uses the log-likelihood, but add a penalizing term associated with the number of variables. It is well known
I. Theory and background
2.7 Methods for evaluating model performance
47
that by adding variables, one can improve the fit of models. Thus, the AIC tries to balance the GOF versus the inclusion of variables in the model. The AIC is computed as follows: AIC ¼ 2 LL þ 2p
(2.12)
where p is the number of unknown parameters included in the model (this also includes the dispersion or shape parameters of models, such as the inverse dispersion parameter of the NB model or the random spatial effect) LL. Smaller values indicate better model fitting. 2.7.1.5 Bayes information criterion Similar to the AIC, the Bayes information criterion (BIC) also employs a penalty term, but this term is associated with the number of parameters (p) and the sample size (n). This measure is also known as the Schwarz Information Criterion. It is computed the following way: AIC ¼ 2 LL þ p ln n
(2.13)
Like the AIC, smaller values indicate better model fitting. 2.7.1.6 Deviance information criterion When the Bayesian estimation method is used, the deviance information criterion (DIC) is often used as a GOF measure instead of the AIC or BIC. The DIC is defined as follows: b þ 2 D D b DIC ¼ D (2.14) where D is the average of the deviance ( 2 LL) over the posterior disb is the deviance calculated at the posterior mean parameters. tribution, and D b (effective number of As with the AIC and BIC, the DIC uses pD ¼ D D parameters) as a penalty term on the GOF. Differences in DIC from 5 to 10 indicate that one model is clearly better (Spiegelhalter et al., 2002). 2.7.1.7 Widely applicable information criterion The widely applicable information criterion (WAIC) (Watanabe, 2010) is a measure that is similar to the DIC (i.e., adds a penalty term for minimizing overfitting), but incorporates the variance of individual terms (the D s in Eq. 2.12). According to Gelman et al. (2014), the “WAIC has the desirable property of averaging over the posterior distribution rather than conditioning on a point estimate” (p. 9), as it is done with the AIC and DIC. Because of this, the WAIC provides a better assessment of models estimated by the Bayesian method.
I. Theory and background
48
2. Fundamentals and data collection
2.7.1.8 Bayes factors The Bayes factor is a powerful tool to assess different models using the same dataset when the Bayes estimating method is used. For example, the Bayes factor, B12 , compares model M1 to model M2 after observing the data (Lewis and Raftery, 1997). The Bayes factor is the ratio of the marginal likelihoods of the two models being compared B12 ¼ pðyjM1 Þ=pðyjM2 Þ. For calculating the marginal likelihood, the method developed by Lewis and Raftery (1997) can be used. The approximation of the marginal likelihood is carried out on the logarithmic scale such that: p 1 log pðyjMÞg z logð2pÞ þ logfjH jg þ logff ðyjb Þ þ logfpðb Þg 2 2 (2.15) where p is the number of parameters, logff ðyjb Þg is the log-likelihood of data at b , and logfpðb Þg is the log-likelihood of prior distribution at b . One way for estimating b is to find the value of b at which log f f ðyjb Þg þ logfpðb Þg achieves its maximum from the posterior simulation output. jH j is the determinant of the varianceecovariance matrix estimated from the Hessian at the posterior mode, and is asymptotically equal to the posterior varianceecovariance matrix. This can be estimated from the sample varianceecovariance matrix of the posterior simulation output. Assuming that the prior probabilities for the competing models are equal, B12 is expressed as follows: logfB12 g ¼ logfpðyjM1 Þg logfpðyjM2 Þg
(2.16)
According to Kass and Raftery (1995), values between 20 and 150 strongly support the selection of Model 1 over Model 2. 2.7.1.9 Deviance The deviance is a measure of GOF and is defined as twice the difference between the maximum likelihood achievable (yi ¼ mi ) and the likelihood of the fitted model: n o b Dðy; uÞ ¼ 2 LLðyÞ LL m (2.17) Smaller values mean that the model fits the data better. This GOF measure applies only to models with a defined likelihood function. As opposed to the DIC, the measure does not include a penalizing function.
2.7.2 Error-based methods There are methods for estimating how well the model fits the data that are based on minimizing the model’s errors (i.e., the difference between the observed and estimated values). The following methods can be I. Theory and background
2.7 Methods for evaluating model performance
49
applied for the entire dataset after the model is fitted to the data or when the dataset is split in different proportions (say the model is first estimated with 70% of the data and applied to the rest of the data). The first four methods have been proposed by Oh et al. (2003) to evaluate the fit of crash-frequency models. In addition, most of statistical software programs have modules available for calculating the measures described in the following. 2.7.2.1 Mean prediction bias The mean prediction bias (MPB) measures the magnitude and direction of the model bias. It is calculated using the following equation: MPB ¼
n 1X ðm yi Þ n i¼1 i
(2.18)
A positive value indicates the model over-estimate values, while a negative value shows the model under-predict values. 2.7.2.2 Mean absolute deviation The mean absolute deviance (MAD) calculates the absolute difference between the estimated and observed values: MAD ¼
n 1X jm yi j n i¼1 i
(2.19)
Smaller values are better. 2.7.2.3 Mean squared prediction error The mean squared prediction error (MSPE) is a traditional indicator of error and calculates the difference between the estimated and observed values squared. The equation is as follows: MSPE ¼
n 1X ðm yi Þ2 n i¼1 i
(2.20)
A value closer to 1 means the model fits the data better. 2.7.2.4 Mean squared error The mean squared error (MSE) calculates the sum of the squared differences between the observed and estimated crash frequencies divided by the sample size minus the number of parameters in the model. The MSE is calculated as follows: MSE ¼
n 1 X ðm yi Þ2 n p i¼1 i
(2.21)
The MSE value can be compared to the MSPE. If the MSE value is larger than the MSPE value, then the model may overpredict crashes. I. Theory and background
50
2. Fundamentals and data collection
2.7.2.5 Mean absolute percentage error The mean absolute percentage error (MAPE) is a statistical technique that is used for assessing how well a model predicts values (in the future). It measured as a percentage. The MAPE is calculated using this equation: n 1X Ai Pi 100 MAPE ¼ (2.22) n i¼1 Ai Where Ai is the actual value and Pi is the predicted value for site or observation i. It should be pointed out that the equation will not work if one or more actual values is 0. A smaller percentage indicates that a model is better at predicting values. 2.7.2.6 Pearson Chi-square Another useful likelihood statistic is the Pearson Chi-square and is defined as Pearson c2 ¼
n X ðyi mi Þ2 VARðyi Þ i¼1
(2.23)
variance are properly specified, then # "If the mean and the . n P ðyi mi Þ2 VARðyi Þ ¼ n (Cameron and Tridevi, 2013). Values E i¼1
closer to n (the sample size) show a better fit. 2.7.2.7 Coefficient of determination R2a Miaou (1996) has proposed using the dispersion-parameter-based coefficient of determination R2a to evaluate the fit of an NB model when it is used for modeling crash data. It is computed as follows: a R2a ¼ 1 (2.24) anull where a is the dispersion parameter of the NB model that includes independent variables (i.e., VarðYÞ ¼ m þ am2 ); and, anull is the dispersion parameter of the NB model when no parameters are included in the model. 2.7.2.8 Cumulative residuals The cumulative residuals (CURE) consist of plotting the cumulative difference between the estimated and observed values (ri ¼ mi yi , where ri represents the residual for observation or rank i) in the increasing order of the variable that is being analyzed (Hauer and Bamfo, 1997). The CURE plot allows the safety analyst to examine how the cumulative difference varies around the zero-line, which can help determine where, in the range of the variable examined, the model over- or underestimate the number of crashes. To properly evaluate the fit, the 95%-percentile
I. Theory and background
2.8 Heuristic methods for model selection
51
confidence interval (CI) needs to be calculated. The CI is calculated using the variance of the residual i (i.e., r2i ) and then cumulating the variance for the increasing order of the variable. The following equation can be used (Hauer and Bamfo, 1997) for this purpose: s2i ¼ s2 ðni Þ 1 s2 ðni Þ = s2 ðNÞ (2.25) where s2i is the variance at observation/rank i; s2 ðni Þ is the cumulative variance at the residual i; and s2 ðNÞ is the cumulative variance for the last observation in the dataset. The last part of the equation incorporates the proportion of the cumulative residual. For the last observation, the variance is equal to zero. The 95% CI can be calculated as follows (1:96 e 2:0) at observation i: qffiffiffiffiffi 2 s2i (2.26) Although some statistical programs provide CURE plots, they can also be created in a spreadsheet. Table 2.9 presents an example describing how the CURE plot can be calculated in a spreadsheet. The example uses a traffic flow variable (in vehicles per day). The dataset contains 215 observations ranked from the smallest flow to the largest flow. The last two columns apply Eqs. (2.18) and (2.19). Fig. 2.8 shows the CURE plot using the data shown in Table 2.7. The columns “Cumulative Residual,” “Upper CI” and “Lower CI” were used for the figure. In most cases, the CURE plot does not start nor end at 0 (zero), which may make it difficult to compare different models. To help with the comparison, the plot can be adjusted by proportionally changing the values along the curve to ensure it starts and end at 0. Using the example shown in Table 2.7, Fig. 2.9 illustrates how the adjustment can be accomplished. The CURE plot starts at 12.4 (for the flow 1542 veh/day) and ends at 71.7 (for the flow 45,685 veh/day). The rate of the red (gray in printed version) line is 0.0019/veh/day. The goal is to adjust the cumulative residual (add or subtract) by the proportion shown inside the triangle. Between 1542 and 8057 veh/day, you add the value to cumulative residual and between 8057 and 45,685 veh/day, you subtract the value. For the first flow value, the adjusted cumulative residual will be 12:4 þ 12:4 ¼ 0. In addition, this can be accomplished in a spreadsheet.
2.8 Heuristic methods for model selection The methods described in the previous section are only applicable after the model is applied and fitted to the data. This approach to selecting a model over another could be very time consuming, especially if complex models are evaluated. Lord et al. (2019) have proposed a heuristics I. Theory and background
52
TABLE 2.9 CURE plot calculations. Flow
Residuals
Cumulative residuals
Squared residuals
Cumulative squared residuals
Upper CI
Lower CI
1
1,542
12.4
12.4
152.6
152.6
24.7
24.7
2
7,793
30.0
42.4
902.1
1,054.7
64.9
64.9
3
8,425
29.4
71.8
864.1
1,918.8
87.6
87.6
4
9,142
53.2
124.9
2,826.6
4,745.4
137.6
137.6
5
9,474
74.1
50.9
5,489.3
10,234.7
201.7
201.7
6
9,856
37.6
88.4
1,412.9
11,647.6
215.1
215.1
.
.
.
.
.
.
.
.
215
45,685
258.3
71.7
66,733.8
1,660,753.9
0.0
0.0
2. Fundamentals and data collection
I. Theory and background
Rank
2.8 Heuristic methods for model selection
FIGURE 2.8
FIGURE 2.9
53
Cumulative residuals for the data shown in Table 2.7.
Adjustment procedure for the cumulative residuals.
method that could be used before the models are fitted and evaluated. This method relies on simulating many datasets from competitive distributions and recording key summary statistics for each dataset. Then, run a Machine Learning classifier (such as Decision Tree or Random Forest) to distinguish one distribution from another. Once the classifier is trained, the descriptive statistics for new datasets could be used to select one distribution over the other (see Shirazi et al., 2017; Shirazi and Lord, 2019). Table 2.10 provides a list of descriptive statistics that could be used for comparing distributions, most of which are described in Chapter 5dExploratory Analyses of Safety Data. Lord et al. (2019) have already compared the NB with the NB-Lindley (NB-L) and the NB with the Poisson-Lognormal (PLN). The results are
I. Theory and background
54
2. Fundamentals and data collection
shown in Figs. 2.10 and 2.11, respectively. In Fig. 2.10, if the skewness of the data is greater than 1.92, the NB-L should be selected over the NB. For the NB versus PLN comparison, the safety analyst just needs to follow the tree-diagram for the statistics “percentage of zeros” and “Kurtosis.” All these models are described in the next chapter. TABLE 2.10 Descriptive statistics needed for the heuristics methods. Descriptive statistics Coefficient-of-variation Interquantile (10%e90%, in increments of 10%) Kurtosis Mean Percentage of zeros Quantile (10%e90%, in increments of 10%) Range Skewness Standard deviation Variance Variance-to-mean ratio
FIGURE 2.10
Heuristics to select a model between the NB and NB-L distributions.
FIGURE 2.11
Heuristics to select a model between the NB and PLN distributions.
References
55
References AASHTO, 2010. Highway Safety Manual, first ed. American Association of State Highway Transportation Officials, Washington, D.C. Abbess, C., Jarett, D., Wright, C.C., 1981. Accidents at blackspots: estimating the effectiveness of remedial treatment, with special reference to the “Regression-to-Mean” effect. Traffic Eng. Contr. 22 (10), 535e542. Appiah, D., Ozuem, W., Howell, K., 2019. Disruptive technology in the smartphones industry. In: Book: Leveraging Computer-Mediated Marketing Environments, pp. 351e371. https://doi.org/10.4018/978-1-5225-7344-9.ch017. Bejani, M.M., Ghatee, M., 2018. A context aware system for driving style evaluation by an ensemble learning on smartphone sensors data. Transport. Res. Part C 89, 303e320. Cameron, A.C., Tridevi, P.K., 2013. Regression Analysis of Count Data, second ed. Cambridge University Press, Cambridge, U.K. Carlin, B.P., Louis, T.A., 2008. Bayesian Methods for Data Analysis, third ed. Chapman and Hall/CRC, London, U.K. Diggle, P.J., Liang, K.-Y., Zeger., S.L., 1994. Analysis of Longitudinal Data. Clarendon Press, Oxford, U.K. Dingus, T.A., Klauer, S.G., Neale, V.L., Petersen, A., Lee, S.E., Sudweeks, J., Perez, M.A., Hankey, J., Ramsey, D., Gupta, S., Bucher, C., Doerzaph, Z.R., Jermeland, J., Knipling, R.R., 2006. The 100-Car Naturalistic Driving Study, Phase II e Results of the 100-Car Field Experiment. In: DOT HS, 810. National Traffic Highway Safety Agency, Washington, D.C. Fawcett, L., Thorpe, N., Matthews, J., Kremer, K., 2017. A novel Bayesian hierarchical model for road safety hotspot prediction. Accid. Anal. Prev. 99 (Pt A), 262e271. Feller, W., 1968. An Introduction to Probability Theory and its Application, 3rd Ed., 1. John Wiley, New York, New York. Flynn, D.F.B., Gilmore, M.M., Sudderth, E.A., 2018. Estimating Traffic Crash Counts Using Crowdsourced Data Pilot Analysis of 2017 Waze Data and Police Accident Reports in Maryland. DOT-VNTSC-BTS-19-01. U.S. DOT, Volpe National Transportation Systems Center, Cambridge, MA. Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B., 2013. Bayesian Data Analysis, third ed. Chapman & Hall/CRC Press, Boca Raton FL. Gelman, A., Hwang, J., Vehtari, A., 2014. Understanding predictive information criteria for Bayesian models. Stat. Comput. 24, 997e1016. Gilks, W.R., Richardson, S., Spiegelhalter, D.J., 1996. Markov Chain Monte Carlo in Practice. London: Chapman and Hall. Goodall, N., Lee, E., 2019. Comparison of Waze crash and disabled vehicle records with video ground truth. Transp. Res. Interdiscip. Perspect. 1, 100019. Guikema, S.D., Nateghi, R., Quiring, S.M., Staid, A., Reilly, A.C., Gao, M., 2014. Predicting hurricane power outages to support storm response planning. IEEE Access 2, 1364e1373. Haddon, W., 1980. Options for the prevention of motor vehicle crash injury. Isr. J. Med. Sci. 16 (1), 45e65. https://slideplayer.com/slide/7780000/. Hardin, J.W., Hilbe, J.M., 2013. Generalized Estimating Equations, second ed. CRC Press, Taylor and Francis Group, Boca Raton, FL. Hauer, E., 1997. Observational Before-After Studies in Road Safety: Estimating the Effect of Highway and Traffic Engineering Measures on Road Safety. Elsevier Science Ltd, Oxford. Hauer, E., Bamfo, J., 1997. Two tools for finding what function links the dependent variable to the explanatory variables. In: Proceedings of the ICTCT 1997 Conference, Lund, Sweden. Herbel, S., Laing, L., McGovern, C., 2010. Highway Safety Improvement Program Manual. Report No. FHWA-SA-09-029. Federal Highway Administration, Washington, D.C. https://safety.fhwa.dot.gov/hsip/resources/fhwasa09029/index.cfm#toc. (Accessed 4 June 2020).
I. Theory and background
56
2. Fundamentals and data collection
Heydari, S., Miranda-Moreno, L.F., Fu, L., Lord, D., 2013. How to specify priors for full Bayes road safety studies?. In: 4th International Conference on Road Safety and Simulation, Rome, Oct. 23rde25th, 2013. Hilbe, J.M., 2011. Negative Binomial Regression, second ed. Cambridge University Press, Cambridge, U.K. Hilbe, J.M., 2014. Modelling Count Data. Cambridge University Press, Cambridge, U.K. Huang, H., Chin, H.C., Haque, M.M., 2009. Empirical evaluation of alternative approaches in identifying crash hot spots. Transport. Res. Rec. 2103, 32e41. Jiang, X., Huang, B., Zaretzki, R.L., Richards, S., Yan, X., 2013. Estimating safety effects of pavement management factors utilizing Bayesian random effect models. Traffic Inj. Prev. 14 (7), 766e775. Kanarachosa, S., Christopoulosa, S.-R.G., Chroneos, A., 2018. Smartphones as an integrated platform for monitoring driver behaviour: the role of sensor fusion and connectivity. Transport. Res. Part C 95, 867e882. Kass, R.E., Raftery, A.E., 1995. Bayes factors and model uncertainty. J. Am. Stat. Assoc. 90, 773e795. Levine, N., 2008. In: Shekhar, S., Xiong, H. (Eds.), CrimeStat: A Spatial Statistical Program for the Analysis of Crime Incidents. Encyclopedia of Geographic Information Science. Springer, pp. 187e193. Lewis, S.M., Raftery, A.E., 1997. Estimating Bayes factors via posterior simulation with the Laplace-Metropolis estimator. J. Am. Stat. Assoc. 92, 648e655. Li, X., Dadashova, B., Turner, S., Goldberg, D., 2020. Rethinking Highway Safety Analysis by Leveraging Crowdsourced Waze Data. Presented at the 99th TRB Annual Meeting. Washington, DC. Lord, D., 2000. The Prediction of Accidents on Digital Networks: Characteristics and Issues Related to the Application of Accident Prediction Models. Ph.D. Dissertation. Department of Civil Engineering, University of Toronto, Toronto, Ontario. Lord, D., Bonneson, J.A., 2005. Calibration of predictive models for estimating the safety of ramp design configurations. Transp. Res. Rec. 1908, 88e95. Lord, D., Geedipally, S.R., 2018. Safety prediction with datasets characterised with excess zero responses and long tails. In: Lord, D., Washington, S. (Eds.), Safe Mobility: Challenges, Methodology and Solutions (Transport and Sustainability, vol. 11. Emerald Publishing Limited, pp. 297e323. Lord, D., Mannering, F., 2010. The statistical analysis of crash-frequency data: a review and assessment of methodological alternatives. Transport. Res. Part A 44 (5), 291e305. Lord, D., Miranda-Moreno, L.F., 2008. Effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter of poisson-gamma models for modeling motor vehicle crashes: a Bayesian perspective. Saf. Sci. 46 (5), 751e770. Lord, D., Persaud, B.N., 2000. Accident prediction models with and without trend: application of the generalized estimating equations (GEE) procedure. Transp. Res. Rec. 1717, 102e108. Lord, D., Washington, S.P., Ivan, J.N., 2005. Poisson, Poisson-gamma and zero inflated regression models of motor vehicle crashes: balancing statistical fit and theory. Accid. Anal. Prev. 37 (1), 35e46. Lord, D., Brewer, M.A., Fitzpatrick, K., Geedipally, S.R., Peng, Y., 2011. Analysis of Roadway Departure Crashes on Two-Lane Rural Highways in Texas. Report No. FHWA/TX-11/06031-1. Texas A&M Transportation Institute, College Station, TX. Lord, D., Geedipally, S.R., Guo, F., Jahangiri, A., Shirazi, M., Mao, H., Deng, X., 2019. Analyzing Highway Safety Datasets: Simplifying Statistical Analyses from Sparse to Big Data. Report No. 01-001, Safe-D UTC. U.S. Department of Transportation, Washington, D.C. Lunn, D.J., Thomas, A., Best, N., Spiegelhalter, D., 2000. WinBUGS d a Bayesian modelling framework: concepts, structure, and extensibility. Stat. Comput. 10, 325e337.
I. Theory and background
References
57
McCullagh, P., 1989. Generalized Linear Models, second ed. Chapman and Hall, Ltd, London, U.K. Miaou, S.-P., 1996. Measuring the Goodness-of-fit of Accident Prediction Models. FHWARD-96-040, Final Report. Federal Highway Administration, McLean, VA. Miaou, S.-P., Lord, D., 2003. Modeling traffic-flow relationships at signalized intersections: dispersion parameter, functional form and Bayes vs empirical Bayes. Transport. Res. Rec. 1840, 31e40. Montella, A., Andreassen, D., Tarko, A.P., Turner, S., Mauriello, F., Imbriani, L.L., Romero, M.A., Singh, R., 2012. Critical review of the international crash databases and proposals for improvement of the Italian national database. Procedia Soc. & Behav. Sci. 53, 49e61. Nabavi Niakia, M.S., Saunier, N., Miranda-Moreno, L.F., 2019. Is that move safe? Case study of cyclist movements at intersections with cycling discontinuities. Accid. Anal. Prev. 131, 239e247. National Academies of Sciences, Engineering, and Medicine, 2014. Naturalistic Driving Study: Technical Coordination and Quality Control. The National Academies Press, Washington, DC. https://doi.org/10.17226/22362. Neldelman, J., Wallenius, T., 1986. Bernoulli trials, Poisson trials, surprising variance, and Jensen’s inequality. Am. Statistician 40 (4), 286e289. NHTSA, 2018. Critical Reasons for Crashes Investigated in the National Motor Vehicle Crash Causation Survey. Traffic Safety Facts. National Highway Traffic Safety Administration, Washington, D.C. Oh, J., Lyon, C., Washington, S.P., Persaud, B.N., Bared, J., 2003. Validation of the FHWA crash models for rural intersections: lessons learned. Transport. Res. Rec. 1840, 41e49. Olkin, I., Gleser, L.J., Derman, C., 1980. Probability Models and Applications. MacMillan Publishing Co., Inc, New York, N.Y. Poch, M., Mannering, F.L., 1996. Negative binomial analysis of intersection-accident frequency. J. Transport. Eng. 122 (No. 2), 105e113. Rao, J.N.K., 2003. Small Area Estimation. Wiley, Hoboken, New Jersey. Rumar, K., 1985. The role of perceptual and cognitive filters in observed behavior. In: Evans, L., Schwing, R. (Eds.), Human Behavior in Traffic Safety. Plenum Press. Savolainen, P.T., Mannering, F.L., Lord, D., Quddus, M.A., 2011. The statistical analysis of highway crash-injury severities: a review and assessment of methodological alternatives. Accid. Anal. Prev. 43 (5), 1666e1676. Shirazi, M., Lord, D., 2019. Characteristics based heuristics to select a logical distribution between the Poisson-gamma and the Poisson-lognormal for crash data modelling. Transportmetrica Transport. Sci. 15 (2), 1791e1803. Shirazi, M., Dhavala, S.S., Lord, D., Geedipally, S.R., 2017. A methodology to design heuristics for model selection based on characteristics of data: application to investigate when the negative binomial lindley (NB-L) is preferred over the negative binomial (NB). Accid. Anal. Prev. 107, 186e194. Smith, T., 2020. Disruptive Technology. Investopedia. https://www.investopedia.com/ terms/d/disruptive-technology.asp. (Accessed 12 June 2020). Spiegelhalter, D.J., Best, N.G., Carlin, B.P., van der Linde, A., 2002. Bayesian measures of model complexity and fit. J. Roy. Stat. Soc. B 64, 583e639. Stipancica, J., Miranda-Moreno, L., Saunierc, N., 2018. Vehicle manoeuvers as surrogate safety measures: extracting data from the gps-enabled smartphones of regular drivers. Accid. Anal. Prev. 115, 160e169. Train, K.E., 2009. Discrete Choice Methods with Simulation. Cambridge University Press, Cambridge, U.K. Watanabe, S., 2010. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J. Mach. Learn. Res. 11, 3571e3594. Xie, K., Ozbay, K., Yang, H., Yang, D., 2019. A new methodology for beforeeafter safety assessment using survival analysis and longitudinal data. Risk Anal. 39 (6), 1342e1357.
I. Theory and background
C H A P T E R
3
Crashefrequency modeling 3.1 Introduction As described in Chapter 2dFundamentals and Data Collection, crash counts are nonnegative and discrete events occurring on roadways, which require specific analysis tools for assessing the safety performance of entities, such as roadway segments, intersections, and corridors among others. Although many analysis tools have been developed and applied in a wide variety of areas, the focus of this chapter is placed on statistical models tailored specifically for the characteristics associated with safety data. A comprehensive list of models that have been used in highway safety can be found in Appendix B of this book. The intent of this chapter is to describe the fundamental concepts related to count-data models that can be used for analyzing crash data. The chapter describes the basic nomenclature of the models that have been proposed for analyzing highway safety data and different applications of crashfrequency models (note: the models are not limited to crash data and can also be used or estimated for many types of safety data such as vehicle occupants, nonmotorized road users or traffic conflicts). These applications are dependent on the objectives of the study or the type of analysis conducted by the safety analyst. The study objectives will have a large influence on the selection of the model, which is based on the “goodness-of-fit” or “goodness-of-logic” sought by the safety analyst (Miaou and Lord, 2003). In Chapter 2, a four-stage modeling process is presented for all the model applications described later. This chapter describes the sources of data dispersion, which influences the selection of the crash-frequency model and covers key models that have been proposed for analyzing count data, along with the important or relevant information about their characteristics. The models are grouped by their intended use for handling specific characteristics associated with safety data. The chapter ends with a discussion about the modeling selection process.
Highway Safety Analytics and Modeling https://doi.org/10.1016/B978-0-12-816818-9.00007-X
59
© 2021 Elsevier Inc. All rights reserved.
60
3. Crashefrequency modeling
3.2 Basic nomenclature The basic nomenclature for the models described later follows this general form: (3.1) yi ¼ f x0i b ¼ b0 þ b1 xi1 þ b2 xi2 þ / þ bp xip where yi is the response variable for observation i; b is a ðp 1Þ vector of estimable parameters; x0 is a vector of explanatory variables; and, p is the number of parameters in the model (less the intercept). As we are interested in estimating the long-term mean of the observation (i.e., mean parameterization), Eq. (3.1) can be rewritten by calculating the expected value of the response variable as follows: E½yi jxi ¼ mi ¼ b0 þ b1 xi1 þ b2 xi2 þ / þ bp xip
(3.2)
where E½yi jxi is the expected value of the response variable (mi is the mean of the response variable). Based on the generalized linear modeling relationship with an exponential canonical link function (McCullagh and Nelder, 1989), Eq. (3.2) leads to the following form: (3.3) mi ¼ exp x0i b ¼ exp b0 þ b1 xi1 þ b2 xi2 þ / þ bp xip : All the models described later follows the form described in Eq. (3.3). Note that different functional forms exist and could be used for analyzing safety data, many of which are described in Chapter 6dCross-Sectional and Panel Studies in Safety.
3.3 Applications of crash-frequency models Crash-frequency models are needed as crash data are generally random and independent events. In other words, in general terms, a crash is not directly correlated to another crash in space and time. It is theoretically possible, however, that a crash that occurred say on an urban freeway section at 4:00 p.m. could directly have contributed to another crash that happened at 4:06 p.m. about 500 m upstream from the location of the first crash, also known as the secondary crash (because of the queue that build-up following the initial collision) (Chapter 10dCapacity, Mobility and Safety covers microscopic crash risk in greater details). However, those events are extremely rare compared to all typical events and the correlation between the two would be difficult to quantify and their influence on the model performance will most likely be negligible. Hence, the assumption of independence between observations, as defined earlier, will be the core attribute for the models described in this chapter and the rest of the textbook.
I. Theory and background
3.3 Applications of crash-frequency models
61
Crash-frequency (and crash-severity) models can be used for analyzing different aspects of the transportation system. The type of applications will also govern which model to select and how to evaluate the performance of different components of the system and/or its users. Later, we describe how these models can be used for this purpose.
3.3.1 Understanding relationships The first application consists of developing statistical models with the objective of learning a useful relationship between observable features and a potentially actionable outcome about the system from which the data are extracted and analyzed. More specifically, the objective consists of establishing a relationship between safety, being crashes, vehicle occupants, and so forth, and variables that are assumed to influence safety or crash risk. In statistical terms, a relationship is established between a dependent variable and a series of independent variables. Examining the sign of a coefficient is an example of such application. In general, this application requires very tight statistical controls so that the effects of covariates are in fact independent and do reflect the attributes of the “system.” Multicollinearity among variables is an enemy of this application, as much focus is on the effect of predictor variables and their interpretability. Similarly, the omitted-variable bias that arises due to a model that does not include important variables can also negatively affect the modeling results (Lord and Mannering, 2010).
3.3.2 Screening variables The second application consists of developing models for screening purposes, where the objective is to determine which covariates or independent variables have specific or significant effects on the risk of collisions (as associative relationship, as opposed to causal relationship described later). For this application, most of the attention is devoted to the covariates of the statistical models. Usually, this is accomplished by examining the statistical significance of the variables investigated, such as at the 5%, 10%, or 15% level (i.e., p-value) of Feature Importance Score (i.e., a statistical technique used to rank the importance of each variable). In addition, multicollinearity can be problematic for this application.
3.3.3 Sensitivity of variables The third application seeks to examine the sensitivity of the variables that have been identified as part of the screening process described earlier. Under this application, the coefficients of the model are investigated in
I. Theory and background
62
3. Crashefrequency modeling
greater detail. This can be accomplished by examining the marginal effects of the coefficients. They are obtained by taking the partial derivative of the coefficients under investigation, one at the time. For example, the effect of changes in variable xj (the jth coefficient) can be estimated as follows: ME ¼
vm ¼ bj expðx0 bÞ: vxj
(3.4)
Other approaches for investigating the marginal effects of coefficients can be found in Cameron and Trivedi (2013). An emerging trend in the Machine Learning space is to use counterfactuals to provide explanations (Molnar, 2020).
3.3.4 Prediction The fourth application aims at developing models for prediction purposes. In this application, the goal is to develop models with the best predictive capabilities. These models could be used with the data collected as part of the model development or with a completely new dataset. Predictive models are very useful for identifying hot spots or high-risk sites (discussed in Chapter 8dIdentification of Hazardous Sites)d as they can guide researchers in forecasting the expected number of crashes at selected locationsdwhich are then compared to the observed values. Multicollinearity among variables is more acceptable in models aimed at prediction, although they still need to be investigated to estimate the magnitude of the correlation, as most of the focus of the overall analysis is on the dependent variable. Variance-inflation-factors is a useful tool used for identifying highly correlated variables (Myers, 2000). Note that a proper or logical relationship1 must exist between the dependent and independent variables if the model is developed for predicting crashes or other types of safety data. This is to avoid having the model work as a black box, which could lead to biased estimates or large errors.
3.3.5 Causal relationships As discussed in the previous sections, the statistical models are used to describe relationships between variables. In a perfect world, one would seek to establish causal relationships, meaning that being able to capture changes in risk factors that can directly influence the risk for a crash to 1
The authors have noted models published in reports or manuscripts in which the relationship between the number of crashes and traffic flow had an inverse relationship, which meant that the crash risk tended towards infinity as the traffic flow dropped to zero and became lower as traffic flow increased (i.e., risk ¼ 1 / flow).
I. Theory and background
3.4 Sources of dispersion
63
occur or observe a change in the severity outcome when a crash happens. In other words, the causal relationship seeks to determine the deterministic mechanism that leads to a crash rather than rely on the statistical nature of the crash process. Although some studies have been conducted in highway safety on this topic, the field of research is still in its infancy, which can be attributed to the complexity of the crash process and the lack of detailed information related to crash occurrences found in police reports and other sources of information. Curious readers are referred to Davis (2000), Elvik (2003, 2011), and Hauer (2010, 2020) for further discussions about causee effect relationships in highway safety. Pearl (2009) provided pioneering work in this area, which could be of interest for some readers.
3.4 Sources of dispersion This section describes the characteristics of dispersion (both over- and underdispersion) in crash data. Overdispersion refers to the attribute that the variance of the observed or response variables is greater than the mean, Var½yi > E½yi . On the other hand, underdispersion refers to the variance that is smaller than the mean, Var½yi < E½yi . Equi-dispersion refers to when both the mean and variance are equal. With crash data, observed equi-dispersion is extremely rare (Lord and Mannering, 2010). Fig. 3.1 shows an example of the residuals (after the model is fitted to the data) that exhibit over- and underdispersion.
3.4.1 Overdispersion As discussed in the previous chapter, the fundamental crash process shows that crash counts are usually overdispersed, which, at its core, is attributed to the nature of the data. As such, the process, when looking at the raw count data, follows a Bernoulli trial with unequal probability of independent events, which is also known as Poisson trials (Lord et al., 2005).
FIGURE 3.1 Overdispersed (left) and underdispersed (right) residuals.
I. Theory and background
64
3. Crashefrequency modeling
Within the context of Poisson trials, there are numerous attributes that lead to overdispersion or unobserved heterogeneity (i.e., the crash rate that differs across observations). The crash process is very complex and involves numerous factors related to the interaction between the driver, the vehicle, and the environment (roadway and traffic characteristics), as discussed in Chapter 2dFundamentals and Data Collection. Mannering et al. (2016) provide a list of factors that can lead to unobserved heterogeneity based on the interactions described earlier. They also discussed factors that influence crash severity, such as energy dissipation, impact angle, and the physiological characteristics of the vehicle occupants, once the crash occurred. The issue is that routinely collected data for safety analyses cannot capture all the factors that influence crash risk, even with the naturalistic data that are now being collected in the United States and Europe (see Chapter 12dData Mining and Machine Learning Techniques). This will influence how much of the unobserved heterogeneity can be captured by statistical modeling. Later in the chapter, different models will be presented for minimizing overdispersion or reduce unobserved heterogeneity. It is important to note that as the sample mean increases, and the number of zero observations becomessmaller, the characteristics of the crash data will tend toward equi-dispersion, irrespective of the missing factors that affect unobserved heterogeneity. This is particularly true for safety data that are analyzed at the county or state-level of analysis (Stevens and Lord, 2005; Blower et al., 2020). When the data are characterized by a very large mean, the model often will revert back to a “pure” Poisson process.
3.4.2 Underdispersion In rare cases, crash data can exhibit underdispersion (Oh et al., 2006; Daniels et al., 2010). This can occur under two conditions. First, the data themselves exhibit underdispersion. This usually happens when the sample mean is very low and the data contain a large proportion of zero observations, which is also combined with a very small right-hand tail. With this characteristic, the data are almost characterized as a binomial distribution. The main attribute of the binomial distribution is that the sample variance is always smaller than the sample mean. The dataset by Oh et al. (2006) exhibits this characteristic, in which the mean-to-variance ratio is equal to 1.1. This dataset will be used for Exercise 3.3. Second, the underdispersion is observed after the model is fitted with the data. In this case, the original data are often marginally overdispersed or may be equi-dispersed (such as datasets with a very large mean). At that point, when the model is fitted, that is, the observations are conditionally modeled upon the mean ðyjmÞ, the modeling output shows the model’s residuals exhibiting underdispersion (Lord et al., 2010). It is therefore important for the safety analyst to adequately explore the data for I. Theory and background
3.5 Basic count models
65
identifying if there is the possibility that the modeling output could indicate underdispersion (see Chapter 5dExploratory Analyses of Safety Data).
3.5 Basic count models This section describes the basic models that have been proposed for analyzing safety data.
3.5.1 Poisson model Because crash data are nonnegative integers, the application of standard ordinary least-squares regression (which assumes a continuous dependent variable) is not appropriate. Given that the dependent variable is a nonnegative integer, the Poisson regression model has initially been used as a starting point for analyzing crash data (Jovanis and Chang, 1986). In a Poisson regression model, the probability of a roadway entity (segment, intersection, etc.) i having yi crashes per some time period (where yi is a nonnegative integer) is given by y
emi mi i with yi ¼ 0; 1; 2; 3; .; n (3.5) yi ! where Pðyi jxi Þ is the probability of roadway entity (or observation) i having yi crashes per unit of time and mi is the Poisson mean parameter for roadway entity i, which is equal to roadway entity i’s expected number of crashes per year, E½yi . Poisson regression models are estimated by specifying the Poisson parameter mi (the expected number of crashes per period) as a function of explanatory variables, the most common func tional form being mi ¼ exp x0i b , as described in Eq. (3.3). Although the Poisson model has served as a starting point for crashfrequency analysis during the 80s, this model is rarely or no longer used for safety analyses for the reasons explained earlier. Furthermore, the Poisson model assumes that the crash risk (i.e., mi ) is the same for all entities that have the same characteristics (same covariates), which is theoretically impossible. Pðyi jxi Þ ¼
3.5.2 Negative binomial model The negative binomial (NB) (or Poisson-gamma) model is an extension of the Poisson model that was introduced to overcome the overdispersion observed in data. The NB model assumes that the Poisson parameter follows a gamma probability distribution. The model parameterization is characterized by a closed-form equation and the mathematics to manipulate the relationship between the mean and the variance structures is relatively simple. Because of these properties, the NB has become a very I. Theory and background
66
3. Crashefrequency modeling
popular model among practitioners and researchers working in numerous fields (Hilbe, 2011). The NB model is derived by rewriting the Poisson parameter for each observation i with E½yi jxi ¼ mi ¼ exp x0i b þεi , where expðεi Þ is a gammadistributed error term with mean 1 and variance a or a ¼ 1=f. The addition of this term allows the variance to differ from the mean, such that Var½yi jxi ¼ E½yi jxi ½1 þaE½yi jxi ¼ E½yi jxi þ aE½yi jxi 2 . The Poisson model is a limiting model of the NB regression model as a approaches zero, which means that the selection between these two models is dependent upon the value of a. The parameter a is often called the dispersion parameter (or the over-dispersion parameter). The inverse of dispersion is called the “inverse dispersion parameter” and is denoted as f. The dispersion parameter is usually assumed fixed, but can be made dependent on the covariates of the model (Miaou and Lord, 2003; Geedipally et al., 2009): ai ¼ exp z0i g þ6i , where Z is a vector of covariates that may not necessarily be the same ones that are used for the mean value and g is a ð p 1Þ vector of parameters. Chapter 6dCross-Sectional and Panel Studies in Safety provides more details about the variance structures of crashfrequency models. The probability mass function (PMF) of the negative binomial distribution has the following form: 1=a yi Gð1=a þ yi Þ 1=a mi (3.6) Gð1=aÞyi ! ð1=aÞ þ mi ð1=aÞ þ mi where as in Eq. (3.1), Pðyi jxi ; aÞ is the probability of roadway (or other type of) entity i having yi crashes per time period; G denotes the gamma distributed function; and all other variables are as previously defined. This parameterization is defined as NB-2 in the econometric literature (Cameron and Trivedi, 2013). Another parameterization called NB-1, which also exists but is very seldom used in the safety literature, is described in Appendix A. Similar to other fields, the Poisson-gamma or NB model is the most frequently used model in crash-frequency modeling (Lord and Mannering, 2010). Although very popular, the model does have some limitations, most notably its inability to handle underdispersed data, and potential biases for the dispersion parameter when the data are characterized by the low sample mean values and small sample sizes, datasets with long tails and large percentage of zeros (see Lord, 2006; Lord et al., 2007).
Pðyi xi ; aÞ ¼
3.5.3 Poisson-lognormal model The Poisson-lognormal (PLN) model provides a similar parameterization as the negative binomial model, but the error term,expðεi Þ is lognormal rather than gamma distributed. In this context, εi eNormal 0; s2 , which 0 translates to E½yi jxi ¼ mi ¼ exp xi b þ s2 2 þ εi . In terms of modeling, the I. Theory and background
3.5 Basic count models
67
mean has to be adjusted to account for the added term associated with the variance. The variance of the estimated count is given as follows: 2 2 2 Varðyi jxi Þ ¼ emi þs =2 þ es 1 e2mi þs . Compared to the Poisson-gamma model, the PLN model is more flexible than its counterpart, particularly for observations located at the tail end of the distribution, as seen in Fig. 3.2 (Khazraee et al., 2018). On the other hand, the NB tends to fit the data better near the zero counts. The PLN model is also more robust or reliable compared to the NB model
FIGURE 3.2 Probability distribution function of the Poisson-gamma (NB) and lognormal distributions for different mean-variance combinations (Khazraee et al., 2018).
I. Theory and background
68
3. Crashefrequency modeling
for data characterized by low sample mean values and small sample size (Lord and Miranda-Moreno, 2008). With the latest advancements in Bayesian statistics, this model can be easily estimated or applied for analyzing safety data as the PMF of the model does not have a closed form. Codes already exist in WinBUGS and MATLAB for estimating Bayesian models (see Appendix C). As discussed in the previous chapter, Shirazi and Lord (2019) described the boundaries when the PLN model should be selected over the NB model before fitting the data. The analyst needs to summarize the data for several types of statistics. They showed that the kurtosis and percentageof-zeros in the data are among the most important summary statistics needed to distinguish between these two distributions. Fig. 3.3, reproduced from Chapter 2dFundamentals and Data Collection, outlines the decision tree for selecting one model over the other. Exercise 3.1 provides an example about how to use this figure. Exercise 3.2 shows an example when the NB and PLN are compared using the same dataset.
FIGURE 3.3 Characteristics-based heuristic to select a model between the PG and PLN distributions (Lord et al., 2019) (the tree can be used for data with the characteristics of 0.1 < mean < 20 and 1 < VMR < 25).
I. Theory and background
69
3.5 Basic count models
Exercise 3.1 Using the Texas Rural Two-Lane Highways Dataset, compare the characteristics of the data to determine which distribution between the NB and PLN appears to be more adequate. First, using the information presented in Chapter 5, summarize the characteristics of this dataset. The summary statistics show that the percentage of zeros is 69.9% and the kurtosis is 123.6. Second, examine the decision tree described in Fig. 3.3. Looking at this tree, for a Kurtosis of 123.6 and the percentage of zeros equal to 69.9%, the selected model should be the PLN. The solution is found using the right-hand side of the tree.
Exercise 3.2 Estimate an NB model and a PLN model using the subset of the dataset used in Exercise 3.1 to compare the two models. Use the following variables: Annual Average Daily Traffic (AADT) (veh/day), Lane width (LW) (ft), Speed limit (SL) (mph), and Left shoulder width (LSW) (ft). Use the segment length (L) (mile) as an offset, that is, the coefficient is equal to 1. First, determine the functional form: 0 1 p X m ¼ exp@b0 þ b1 lnðFÞ þ bj xj A j¼2
The functional form can be manipulated to get the following: 0 1 p X b1 @ m ¼ b0 LF exp bj xj A j¼2
The natural log of the AADT, lnðAADTÞ, needs to be used to characterize the nonlinear relationship Fb1 and has to be taken out of the left-hand side of the equation. This form ensures that when the traffic flow is equal to zero, there is zero risk of a crash. The coefficient of the intercept is given as follows, b0 ¼ expðb0 Þ. continued
I. Theory and background
70
3. Crashefrequency modeling
Exercise 3.2 (cont’d) Second, as the PLN cannot be estimated using the maximum likelihood estimate (MLE), the coefficients are estimated using the Bayes method. The hyperparameters are estimated as follows: Regression coefficients for both models: bj e Uniformð N; þNÞ, j ¼ 0; 1; 2; .; p; Inverse dispersion parameter of PG: f e Uniformð0; þNÞ; Dispersion parameter of PLN: s e Uniformð0; þNÞ. Third, estimate the posterior distribution of the coefficients using Markov Chain Monte Carlo (MCMC) simulation methods. The sampling from all parameters can be carried out using the MetropoliseHastings algorithm or Slice Sampling (such as in WinBUGS software). Discard the first 20,000 samples, and estimate the posterior distribution of each coefficient using the next 60,000 MCMC iterations. Fourth, present the results of the posterior estimates. Poisson-gamma
Poisson-lognormal
Variable
Estimate
Std. Errora
Pr(>| z|)
Estimate
Std. Errora
Pr(>| z|)
Intercept (b0 )
7.3166
1.6729
0.0000
7.5750
1.6729
0.0000
Ln(AADT) (b1 )
0.5521
0.0949
0.0000
0.5477
0.082
0.0000
RSW (b2 )
0.0733
0.0695
0.2942
0.0710
0.755
0.9252
LW (b3 )
0.2366
0.1009
0.0184
0.2569
0.1068
0.0173
SL (b4 )
0.0193
0.0192
0.0201
0.0204
0.0201
0.3181
LSW (b5 )
0.0285
0.0631
0.6513
0.0246
0.0680
0.7202
f
2.4660
1.1162
0.0284 0.6898
0.1001
0.0000
s DIC
883.3
891.8
a
Based on the credible intervals.
The modeling results are almost similar, as the heuristic method showed that the PG is slightly favored over the PLN and the DIC is very close to each other. This was also confirmed with the heuristic method (the kurtosis was close to the boundary between the PG and PLN).
I. Theory and background
3.6 Generalized count models for underdispersion
71
Exercise 3.2 (cont’d) The final model for the Poisson-gamma is as follows: m ¼ ðexpð7:3166ÞÞLF0:55 expð0:073RSW þ 0:237LW þ 0:019SL 0:029LSWÞ
3.5.4 Other Poisson-mixture models Following the introduction of the NB model, several other Poissonmixture models have been examined or proposed in the context of crash data analysis. Some of these models include: PoissoneWeibull: The PoissoneWeibull distribution model performs as well as the NB model and its coefficients can be easily estimated using the MLE (Connors et al., 2013; Cheng et al., 2013). Poisson-Inverse Gaussian (PIG): The PIG model performs similar to the PLN, in that the model fits the data better at the tail end of the distribution (Zha et al., 2016). The coefficients are also easily estimated using the MLE. Poisson-Inverse Gamma: This model also performs similar to the PLN for long-tailed data, as shown in Fig. 3.2 (Khazraee et al., 2018). The model is estimated using the Bayesian estimating method, which requires further work than the MLE. Sichel [SI]: This model has been used or applied more frequently than the previous models (Wu et al., 2015). The SI model is recommended to be used for long-tailed data (Zou et al., 2015). This model can be estimated using the MLE. PoissoneTweedie: Depending on the parameterization of the model, the PoissoneTweedie distribution models can become as special cases to the NB, PIG, or SI model (Debrabant et al., 2018). The same characteristics as those listed above apply here. It should be noted that most of these models have not been evaluated for their stability when the safety data are characterized by the low sample mean and small sample biases (Lord, 2006).
3.6 Generalized count models for underdispersion This section describes models that have been proposed for analyzing datasets that are either underdispersed or the modeling results exhibit underdispersion (conditioned upon the mean).
I. Theory and background
72
3. Crashefrequency modeling
3.6.1 ConwayeMaxwellePoisson model The most well-known model used for analyzing underdispersion is the ConwayeMaxwellePoisson (COM-Poisson). This model was originally proposed by Conway and Maxwell (1962) to analyze queues and service rates. Shmueli et al. (2005) further elucidated the statistical properties of the COM-Poisson distribution using the formulation given by Conway and Maxwell (1962), and Kadane et al. (2006) developed the conjugate distributions for the parameters of the COM-Poisson. The latter researchers proposed the following parameterization for the mean function of the model: y
li i 1 Pðyi jxi Þ ¼ Zðli ; nÞ ðyi !Þn Zðli ; nÞ ¼
N X ln ðn!Þn n¼0
(3.7a) (3.7b)
where yi is the number of crashes per unit of time for observation i; li is a centering parameter that is approximately equal to the mean of the observations in many cases (not exactly the same as m) for observation i; and, n is defined as the shape parameter of the COM-Poisson distribution. The COM-Poisson can model both underdispersed (n > 0) and overdispersed (n < 0) data, and several common PMFs are special cases of the COM with the original formulation. Specifically, setting n ¼ 0 yields the geometric distribution; l < 1 and n/N yields the Bernoulli distribution in the limit; and, n ¼ 1 yields the Poisson distribution. The parametrization above leads to the following relationship for the mean and variance, respectively: 1
1 1 = þ2n2
E½yi jxi ¼ mi z li n
1 1= Var½yi jxi z li n: n
(3.8a) (3.8b)
The parameterization in Eq. (3.8) does not provide a clear centering parameter for the parameter l. To overcome this issue, Guikema and Coffelt (2007) have proposed a reparameterization of the COM-Poisson distribution to provide a clear centering parameter. With the new parameterization, they were able to develop the generalized linear model (GLM) version of the distribution (see the steps in Lord et al., 2008). The equations are as follows: ! y n mi i 1 Pðyi jxi Þ ¼ (3.9a) Zðmi ; nÞ yi !
I. Theory and background
73
3.6 Generalized count models for underdispersion
Zðmi ; nÞ ¼ 0
N n n X m i
n¼0
mi ¼ exp@b0 þ
1
p X
bj xj A
j¼1
ni ¼ exp g0 þ
(3.9b)
n!
q X
(3.9c)
! g l xl
(3.9d)
l¼1 1
=
where mi ¼ li n , the mean of the response variable. With this parameterization, the shape parameter can be made dependent on the covariates of the model. Similar to the NB with a varying dispersion, the covariates do not need to be the same as for the mean function. Since its introduction, the COM-Poisson has become very popular not only in highway safety, but also in various areas, such as in management, economics, statistics, and ecology.
Exercise 3.3 Using the South Korean Dataset, estimate a COM-Poisson model. Although old, this dataset collected in South Korea has been used for comparing models characterized with underdispersion. In this case, the underdispersion can be observed after the model is fitted, as the raw data showed near equal-dispersion (as discussed in the text earlier). Use the following variables: AADT (veh/day), Average daily rail traffic (ADRT), Presence of commercial area (PCA), Train detector distance (mile) (TDD), Presence of track circuit controller (PTCC), Presence of guide (PG), and Presence of speed hump (PSH). First, determine the functional form: 0 1 p X m ¼ b Fb1 exp@ bj xj A 0
j¼2
In this functional form, F is the AADT flow on the segment. The natural log of the AADT, lnðAADTÞ, needs to be used to characterize the nonlinear relationship Fb1 . Second, estimate the coefficient of the model using the MLE (see attached codesa). continued
I. Theory and background
74
3. Crashefrequency modeling
Exercise 3.3 (cont’d) Third, present the results of the model.
Variable
Std. error
Pr(>|z|)
7.075
1.286
0.0000
Ln(AADT) (b1 )
0.658
0.141
0.0000
ADRT (b2 )
0.005
0.004
0.2112
PCA (b3 )
1.500
0.515
0.0041
TDD (b4 )
0.002
0.001
0.0456
PTCC (b5 )
1.214
0.437
0.0063
PG (b6 )
0.994
0.513
0.0522
PSH (b7 )
1.571
0.539
0.0048
Intercept (b0 )
2.365
n Log likelihood a
Estimate
94.88
https://cran.r-project.org/web/packages/COMPoissonReg/COMPoissonReg.pdf.
The modeling results show that the model output is heavily underdispersed (n ¼ 2:365).
3.6.2 Other generalized models Other models that have been proposed for analyzing underdispersion include the following: Gamma model (continuous distribution): This model cannot account for observations with zero counts. It is presented here as a caution for not using this model. Gamma-count model: This modified gamma model has been proposed by Winkelman (1995). The parameterization offered by this researcher assumes that the observations have a direct correlation with each other in time. In safety, this means that a crash at time t is directly related to a crash at time t þ n, which is again theoretically impossible. Double Poisson: This model has initially been proposed by Efron (1986). However, it has not been used often, as the normalizing constant of the model is not properly definite. Zou et al. (2013) have proposed a different parameterization of the constant term and found results similar to the COM-Poisson.
I. Theory and background
3.7 Finite mixture and multivariate models
75
Hyper-Poisson: The hyper-Poisson (hP) is a two-parameter generalization of the Poisson distribution. Similar to the COM-Poisson, it can model the variance function as a function of the covariates. It performs as well as the COM-Poisson and can be estimated using the MLE (Khazraee et al., 2015). Generalized Event Count: This model uses the theoretical statistics called “bilinear recurrence relationship” that was introduced by Katz (1965) for describing the dispersion parameter of the Poisson count model. Ye et al. (2018) applied the model to crash data and found its performance to be similar to the hP.
3.7 Finite mixture and multivariate models This section describes models that have been proposed to examine unknown subpopulations within datasets (finite mixture models) and when different dataset attributes are correlated and need to be accounted for when analyzed simultaneously (multivariate models).
3.7.1 Finite mixture models Finite mixture models are a type of models that can be utilized to examine heterogeneous populations. For these models, the assumption states that the overall data are generated from several distributions that are mixed together, with the underlying principle that individual observations are generated from an unknown number of distributions or subpopulations. Similar to the NB model, the Finite Mixture Negative Binomial or FMNB is also the most popular model among finite mixture models (note: the notation FM is placed in front of the Poisson or NB model designation). There are several reasons to expect the existence of different subpopulations as the crash data are generally collected from various geographic, environmental, and geometric design contexts over some fixed time period. In such cases, it may be inappropriate to apply one aggregate NB model, which could lead to the misinterpretation of the modeling results. Therefore, it is reasonable to hypothesize that individual crashes occurring on highways or other safety-related entities are generated from a certain number (K) of hidden subgroups, or components that are unknown to the transportation safety analyst. For the FMNB model, the final outputs include the number of components, component proportions, component-specific regression coefficients, and the degree of overdispersion within each component.
I. Theory and background
76
3. Crashefrequency modeling
For the FMNB-K model, it is assumed that the marginal distribution of yi follows a mixture of negative binomial distributions (Park and Lord, 2009), Pðyi jxi ; QÞ ¼
K X k¼1 K X
wk NB mk;i ; fk 2
mk;i Gðyi þ fk Þ ¼ wk 4 Gðyi þ 1ÞGðfk Þ mk;i þ fk k¼1
! yi
fk mk;i þ fk
! fk 3 5:
(3.10)
With the expected value and variance given by Eðyi jxi ; QÞ ¼
K X
wk mk;i
k¼1
Varðyi jxi ; QÞ ¼ Eðyi jxi ; QÞ þ
K X
wk m2k;i
(3.11)
! 2 1 1 þ =f Eðyi jxi ; QÞ
k¼1
k
(3.12) 0
where mk; i ¼ expðxi bk Þ for subgroup; b ¼ ðb1 ; b2 ; .; bK Þ denotes the vector of all parameters; and, w ¼ ðw1 ; w2 ; .; wK Þ0 refers to a weight distribution of which elements are restricted to be positive and sum to unity P (wk > 0 and wk ¼ 1). In this case, even if all the component’s means are the same, the variance of yi will always be greater than the mean. This type of model has become increasingly popular among highway safety analysts as this model can be used for identifying potential sources of dispersion by way of mixed distributions. It should be noted that zeroinflated models are a special case of finite mixture models, where one of the two latent classes has a long-term mean equal to zero, which is theoretically impossible (Park and Lord, 2007). Zero-inflated models are briefly discussed in the last section of this chapter.
3.7.2 Multivariate models Multivariate models are used when different crash severities or collision types are analyzed simultaneously. These models are needed as the severity levels and collision types may not be independent. As shown by Kuo and Lord (2020), the location of fatal crashes for example may also be associated with crashes that are not fatal, but lead to incapacitated injuries (note: the crashes themselves are not correlated). In this case, the factors, such as the speed limit, may explain why these severity types are related geographically. Treating the correlated crash counts as independent and applying a univariate model to each category can lead to less precise estimates for estimating the effects of factors on crash risk. Fortunately, multivariate models can account for the correlation between crash counts directly into the modeling process. I. Theory and background
77
3.8 Multi-distribution models
This type of model type has the same basic nomenclature as the univariate model. The difference is that the vectors are in fact matrices ðp mÞ, where p refers to the number of parameters in the model and m corresponds to the number of different crash severity or collision types. The notation for multivariate models is as follows (Park and Lord, 2007): 2 3 2 3 2 3 b11 . b1m x11 . y1p y11 . y1m 6 6 6 7 1 « 7 1 « 7 1 « 5; x ¼ 4 « Y¼4 « 5; b ¼ 4 « 5: yn1
.
ynm
yn1
.
ynp
bp1
.
bpm (3.13)
With the matrices above, the model is defined (always assuming the crash count is Poisson distributed) (Park and Lord, 2007): Pðyim jxi ; bi ; bm ÞePoissonðmim Þ
(3.14)
mim ¼ expðxi bm þ bim Þ
(3.15)
where for m ¼ 1; .; M and i ¼ 1; .; n. The yim ’s are assumed independent given the mim ’s. To model the correlations among the crash counts of M different severity or collision types, let P
Pðbi jSÞeNM ð0; SÞ
(3.16)
where is an unrestricted covariance matrix; and, NM denotes Mdimensional multivariate normal distribution. It was shown in Chib and Winkelmann (2001) that the variance of yim is greater than the P mean (allowing for overdispersion) as long as the diagonal elements of are can be greater than 0, and the covariance between the counts, yim and nP im positive or negative depending on the sign of the mth element of . Thus, the correlation structure of the crash counts is unrestricted. In the safety literature, the multivariate model has been proposed for both the Poisson-gamma (MVNB) (Ma and Kockleman, 2006) and the Poisson-lognormal (MVPLN) (Park and Lord, 2007) models. However, the correlation matrix for the MVNB model cannot be negative, which provides less flexibility than for the MVPLN. Hence, since their introduction, the MVPLN has been the only multivariate model used for analyzing safety data.
3.8 Multi-distribution models Multi-distribution models are different from multivariate models as they include multiple shape/scale parameters or mix more than two distributions simultaneously (e.g., Poisson-gamma-Bernoulli). On
I. Theory and background
78
3. Crashefrequency modeling
occasions, some of these models have been called as zero-centered models as the majority of the distribution weight lies near zero. Technically, multidistribution models could also be manipulated to become multivariate models, but so far, no researchers have developed the combination of these two categories of models.
3.8.1 Negative BinomialeLindley model The Negative BinomialeLindley (NB-L), as the name implies, is a mixture of the NB and the Lindley distributions (Lindley, 1958; Ghitany et al., 2008). This three-parameter distribution has interesting properties in which the distribution is characterized by a single long-term mean that is never equal to zero and a single variance function, similar to the traditional NB distribution (note: as discussed below, the third parameter is used to refine the estimate of the gamma parameter of the NB model). Before tackling the NB-L, recall that NB distribution can be parameterized in two different mannersdeither as a mixture of the Poisson and gamma distributions (described earlier) or based on a sequence of independent Bernoulli trials. Using the latter parameterization, the PMF of the NB distribution can be given as follows: Pðyi jf; qÞ ¼
Gðf þ yi Þ f ðqÞ ð1 qÞyi ; f > 0; 0 < q < 1 GðfÞ yi !
(3.17)
In this formulation, “q” is defined as the probability of failure in each trial, given by q¼
f mi þ f
(3.18)
where mi and f, are as defined earlier. We can conveniently denote the above distribution as NBðyi jf; mi Þ. We can now define the NB-L distribution as a mixture of the NB and Lindley distributions as follows: Z Pðyi jmi ; f; qÞ ¼ NBðyi jf; εi mi Þ Lindleyðεi jqÞdεi (3.19) Notice that, mi , the mean of the NB distribution is multiplied by a random term εi , which follows the Lindley distribution. The Lindley distribution is given as: f ðxjqÞ ¼
q2 ð1 þ xÞeqx ; q > 0; x > 0 qþ1
where q is the shape parameter.
I. Theory and background
(3.20)
79
3.8 Multi-distribution models
Under the assumption that the number of crashes y follows an NB-L ðf; qÞ distribution, then the mean function is given as follows: 1 0 p X qþ2 (3.21) Eðyi Þ ¼ mi Eðεi Þ ¼ exp@b0 þ bj xj A qðq þ 1Þ j¼1 The variance can be obtained as follows: Varðyi Þ ¼ mi
qþ2 2ðq þ 3Þ ð1 þ fÞ qþ2 2 þ m2i 2 mi qðq þ 1Þ f qðq þ 1Þ q ðq þ 1Þ (3.22)
Notice that the structural form of the regression is comparable to that of
in Eq. (3.3), if we rearrange the terms as b0j ¼ bj þ log
qþ2 qðqþ1Þ
.
For the GLM, the NB-L distribution, conditional upon the unobserved site-specific frailty term εi that explains additional heterogeneity, can be rewritten as follows (Geedipally et al., 2012): Pðyi ; mi ; fjεi Þ ¼ NBðyi jf; εi mi Þ εi eLindleyðεi jqÞ:
(3.23)
The above formulation can be thought of as an instance of the Generalized Linear Mixed model where the mixed effects follow the Lindley distribution. However, considering that the Lindley is not a standard distribution, the hierarchical representation of the Lindley distribution can be further utilized. As defined by Zamani and Ismail (2010), the Lindley distribution is a two-component mixture given by 1 1 Gammað2; qÞ þ Gammað1; qÞ: εe 1þq 1þq
(3.24)
Recognizing the special structure in the mixture components, the above equation can be rewritten as follows: X 1 εi e Gammað1 þ zi ; qÞBernoulli zi : (3.25) 1þq Under a Bayesian framework, a hierarchical representation of the Lindley distribution can be represented as εi eGammað1 þ zi ; qÞ
I. Theory and background
(3.26a)
80
3. Crashefrequency modeling
1 zi eBernoulli 1þq
(3.26b)
whose marginal distribution is the Lindley distribution. The complete multilevel hierarchical model can now be given as follows (Geedipally et al., 2012): Pðyi ; mi ; fjεi Þ ¼ NBðyi jf; εi mi Þ
(3.27a)
εi e Gammaðεi j1 þ zi ; qÞ 1 zi e Bernoulli zi 1þq
(3.27b) (3.27c)
The hierarchical model mixes three distributions: Poisson, gamma, and Bernoulli. Since its introduction, the NB-L model has been successfully applied for analyzing numerous datasets (see Appendix B for a list of studies). In all cases, the NB-L worked very well to capture excess zero responses. Recently, a random-parameters NB-L (RPNB-L) model was introduced for analyzing safety data, which further helps reducing unobserved heterogeneity (random-parameters models are described later) (Shaon et al., 2018). Shirazi et al. (2017) indicated, using the heuristics method, that the NB-L is usually preferred over the NB when the skewness of the data is larger than 1.92. As the boundary was calculated via heuristics methods, the difference in selecting one distribution or model over another will become more significant if the value is further away from the boundary (say 2.5 or 3).
Exercise 3.4 Using the Texas Rural Divided Multilane Highways Dataset, compare the characteristics of the data to determine which distribution between the NB and NB-L appears to be more adequate. First, using the methods described in Chapter 5, summarize the characteristics of the Texas data and find the skewness. The skewness is equal to 4.0. Second, as the skewness is larger than 1.92, the NB-L is preferred over the NB. This is confirmed when other goodness-of-fit measures are used to compare the distributions, as shown in the following table (Sharazi et al., 2017).
I. Theory and background
81
3.8 Multi-distribution models
Exercise 3.4 (cont’d) Method
NB
Chisquare (c2 )
NB-L 2.73
1.68
695.1
Criteria
Favored Distribution
c2NBL < c2NB
NB-L
LLNBL > LLNB
NB-L
Loglikelihood (LL)
696.1
DT heuristic
e
Skewness>1.92
NB-L
RF heuristic
e
Using the RF heuristic tool
NB-L
3.8.2 Other multi-distribution models Other models that are categorized as multivariate models include the following: Negative Binomial-Generalized Exponential (Vangala et al., 2015): This model is very similar to the NB-L, but the Lindley distribution is replaced by the generalized exponential (GE) distribution. The NB-GE is defined as follows: Z Pðyi jmi ; f; pÞ ¼ NBðyi jf; εi mi Þ GEðzi jp; sÞdεi (3.28) The GE distribution is given by (Aryuyuen and Bodhisuwan, 2013) as follows: f ðεi jp; sÞ ¼ psð1 esεi Þp1 esε ; p; s > 0; εi > 0
(3.29)
where p is the shape parameter, and s is the scaled parameter, respectively. This model has been found to perform very similar to the NB-L (Vangala et al., 2015) both in terms of goodness-of-fit and computing effort. The Negative BinomialeCrack Model (Saengthong and Bodhisuwan, 2013): The negative binomial-Crack (NB-CR) distribution is obtained by
I. Theory and background
82
3. Crashefrequency modeling
mixing the NB distribution with a CR distribution. The NB-CR distribution can be defined such that Z Pðyi jmi ; f; aÞ ¼ NBðyi jf; εi mi Þ CRðzi jl; q; gÞdεi : (3.30) The parameter mi is the mean response for the number of crashes and is assumed to have a log-linear relationship with the covariates. The parameter εi follows the CR distribution. This model has not been evaluated using crash data.
3.9 Models for better capturing unobserved heterogeneity This section describes models that have been proposed to better capture and understand unobserved heterogeneity. The two principle models are the random-effects and random-parameter models.
3.9.1 Random-effects/multilevel model In the models described in Section 3.5, it is assumed that the size of the effect of the variables is fixed, meaning that one true effect size exists for all the observations and the difference, that is the unobserved heterogeneity between observations is a purely random error. In practice, the unobserved heterogeneity may not be solely attributed to a purely random error, but could also be partly explained by the differences in the observations themselves and between groups or levels of observations (as described in Section 3.4). These are called within observations variance (pure random error) and between observations variance, respectively. In highway safety, for example, this could mean that sites located within the same geographical areas are expected to share similar characteristics (say more homogeneous population within a neighborhood), as opposed to sites located in other geographical neighborhoods within a very large city, and this needs to be accounted for in the modeling effort. Random-effects (RE) models, or sometimes called multilevel models (Gelman and Hill, 2007), are models that allow the variance that may exist within different levels of the data to be better depicted. This is accomplished by adding one or more RE terms or random intercept term to capture the between observations variance. Taking the basic models described in Section 3.5, the formulation becomes mio ¼ exp x0io b þ 6o þ εio (3.31) where mio is the mean of the observation i belonging to group o and 6o is a random-effect or intercept term for group o. The end result of the random-
I. Theory and background
3.9 Models for better capturing unobserved
83
effects model is that it modifies the mean of the observation i by changing the value of the intercept. There are different parameterizations of the random-effects model and the random effect term can also be used to account for temporal correlation, such as for panel data (covered in Chapter 6dCross-Sectional and Panel Studies in Safety), or analyzing safety data accross clusters in the recently introduced cross-classified random effects modeling (CCREM) (see (Bakhshi and Ahmed, 2021).
3.9.2 Random-parameters model Random-parameters (RP) models can be viewed as an extension of random-effects models. However, rather than only influencing the intercept of the model, random-parameters models allow each estimated parameter of the model to vary across each individual observation in the dataset (after they are identified as being random). These models attempt to account for the unobserved heterogeneity from one roadway site to another (Milton et al., 2008). Over the last decade, this type of model has become quite popular. These models, as crash-frequency models, have two general types of parameterization: (1) random parameters and (2) random parameters with means as a function of explanatory variables. A third parameterization that incorporates the potential correlation between the variables exists, but is not covered here, as it has not been frequently used in practice, with a few exceptions (see Saeed et al., 2019; Matsuo et al., 2020). They are described as follows (Mannering et al., 2016; (Washington et al., 2020)). 3.9.2.1 Random parameters The RP model follows the same characteristics as for the fixed model, that is, mi ¼ exp x0i b þεi , but in this case, the coefficients are allowed to vary from one observation to the next. Thus, the coefficient k for each observation i in the vector xi can be defined as bik ¼ bk þ yik
(3.32)
where bik is the coefficient for the explanatory variable k and observation i, bk is the mean fixed parameter estimate across all observations for the explanatory variable k, and yik is the randomly distributed variable that is used to capture unobserved heterogeneity across all observations. This term can assume any specified distribution, such as the normal or the gamma distribution.
I. Theory and background
84
3. Crashefrequency modeling
3.9.2.2 Random parameters with means as a function of explanatory variables The parameterization described in Eq. (3.32), which assumes a single mean bik ¼ bk þ yik across all observations, can be expanded by making the coefficient dependent on the explanatory variables themselves. In other words, the mean may directly vary from observation to observation as of the function of the observation’s characteristics. In this regards, Eq. (3.32) can be rewritten as follows (Mannering et al., 2016): bi ¼ b þ Qzi
(3.33)
where zi is a ðk 1Þ-vector of explanatory variables for observation i that influence the random parameter vector, Q is a matrix ðk kÞ of estimable coefficients. As explained by Mannering et al. (2016, p. 9), “each row of Q corresponds to the loadings of a specific element of the bi vector on the zi vector; if a specific column entry in a row of Q is zero, it implies that there is no shift in the mean of the corresponding row element of the bi vector due to the row element of the zi vector corresponding to the column under consideration.” This parameterization could help seize additional heterogeneity that a model with fixed coefficients could not. For example, drivers of different age groups could be driving vehicles that have unique characteristics, such as the age of the vehicle, including or not including specific safety features, which are not commonly used by other groups. In this regard, it is well known that younger drivers often drive older vehicles (hand-me downs from parents)2. RP models are powerful models that have become very popular in highway safety, in major part because they fit the data better (as they significantly reduce the unobserved heterogeneity). Although the models tend to provide a better fit (compared to fixed-parameter models), the coefficients could be difficult to interpret. For instance, one may have difficulties explaining why the variable “lane width” for a given value is associated with an increase in the crash risk for some sites, while the majority of the other sites are not. Furthermore, recent work by Tang et al., 2019 and Huo et al., 2020 showed that RP models may not be adequate for predicting crashes (say at new locations), at least with the datasets used in these two studies. Nonetheless, more work should be conducted on this topic. 2
https://www.irishtimes.com/life-and-style/motors/older-smaller-cars-puttingyounger-drivers-at-risk-1.2060380: "A US-based report has shown that teenagers and younger drivers are dying because their cars are older and less safe.. .The ages of these cars are observed by the report to be as much as 11-years, while 82% of teenage drivers killed in this period were driving a car of at least 6 year old. The researchers behind the report flagged up the fact that older, smaller cars were clearly less safe than newer, larger vehicles." (accessed December 2020).
I. Theory and background
3.10 Semi- and nonparametric models
85
It is instructive to note that the distinction between RE and RP models seems to be more pronounced in traffic safety modeling. Even though both modeling approaches focus on different structural aspects to reduce unobserved heterogeneity, both can be seen as special cases of a more general class of models called generalized linear latent and mixed models (GLLAMs) (McCulloch et al., 2008; Stroup, 2013; Skrondal et al., 2004). Consequently, much of the inference machinery that has been the mainstay in fitting GLLAMs can be used here. However, with the introduction of either RE or RP, the choice of the inference technique is not straightforward. Sometimes it can get philosophical as well. One can estimate the parameters using either the Frequentist method or the Bayesian method. One can also focus on marginal estimates or conditional estimates. Under the frequentist method, the RP models can include anywhere from a single parameter to all the parameters that will be identified as random (although extremely rare). As the likelihood function does not have a closed form in this case, simulation needs to be used for estimating the coefficients. Green (2004) used a Monte Carlo (MC) simulation, not MCMC, for estimating the parameters. On the other hand, under the Bayesian estimation method, all the parameters are always defined as random by the nature of the Bayesian theory (Yamrubboon et al., 2019). A variety of Bayesian inference techniques can be employed, with the MCMC simulation technique being the most commonly used by statisticians. Occasionally, some RP models may be equivalent to a Bayesian hierarchical model. In terms of reducing the unobserved heterogeneity, it should be pointed out that RP models reduce the heterogeneity via the parameters themselves, while multi-distribution models reduce it by decreasing the modeling error (via multiple shape parameters). If the ultimate (and misguided) objective is to solely reduce unobserved heterogeneity, then multi-distribution models are usually better than RP models (Shaon et al., 2018). However, as discussed at the end of the chapter, solely using this objective is not recommended.
3.10 Semi- and nonparametric models This section describes semi- and nonparametric models. Semiparametric models are somewhat similar to parametric models in a sense that the crash counts are assumed to be Poisson distributed, but the Poisson mean and/or the coefficients of the model are assumed to follow a nonparametric distribution (referred to as distribution-free). Nonparametric models are models that are completely distribution free.
I. Theory and background
86
3. Crashefrequency modeling
3.10.1 Semiparametric models As described earlier, semiparametric models assume that the crash counts are Poisson distributed. There are different categories, but the two most common categories that have gained popularity in highway safety consist of estimating the Poisson mean or the coefficients of the model using smooth functions (e.g., spline, etc.). We can find two examples in the following. The first one is known as the generalized additive models (GAMs). For this type of model, each coefficient is characterized by a distinct smooth function. Using the characteristics of Eq. (3.34), the relationship between the mean and the parameters can be defined as follows (Xie and Zhang, 2008): 0 1 p X mi ¼ exp@b0 þ fj ðxij ÞA (3.34) j¼1
where b0 is the intercept of the model and fj ðxij Þ is the smooth function (e.g., P-splines, kernel cubic regression splines, smoothers, and thin-plate regression splines). The generalized additive models can also include a combination of fixed, nonlinear functions or a combination of nonlinear functions: 0 1 p k
X X mi ¼ exp@b0 þ xij þ fk12 xiðkþ1Þ ; xiðkþ2Þ þ fj ðxij ÞA (3.35) j¼1
j¼kþ3
k P where xij is the parametric component of the model, j¼1
fk12 xiðkþ1Þ ; xiðkþ2Þ is a smooth function taking two input variables, and p P
fj ðxij Þ is the summation of nonparametric smooth functions.
j¼kþ3
Fig. 3.4 shows an example of the application of the GAM model using crash and traffic flow data collected at signalized intersections in Toronto, Ontario. This dataset has been used extensively in the safety literature (Lord, 2000; Miaou and Lord, 2003). The functional form of the GAM is as follows: mi ¼ expðb0 þ f1 ðxi1 Þ þ f2 ðxi2 ÞÞ
(3.36)
where f1 ðxi1 Þ and f2 ðxi2 Þ are the entering flows for the major and minor approaches, respectively. The relationship between the number of crashes and flow is illustrated in Fig. 3.4. The second example is known as the seminonparametric (SNP) Poisson model (Ye at al., 2018). For this model, the unobserved heterogeneity captured by the model’s error (ε) follows a K-not polynomial function.
I. Theory and background
87
3.10 Semi- and nonparametric models
FIGURE 3.4 Relationship between the number of crashes and entering flows (Xie and Zhang, 2008).
The polynomial, called the SNP distribution, can be used to approximate different distributions, such as the normal, gamma, or binomial distribution among others. The probability density function (PDF) of the error term is given as follows (Ye et al., 2018):
P 2 k m fðεÞ a ε m m¼0 f ðεÞ ¼ R (3.37)
2 þN Pk m fðεÞdε m¼0 am ε N where “K” refers to the length of the polynomial, “m” is an indicator increasing from 0 to “K”, am is a constant coefficient, and fðεÞ R þN represents the PDF of the standard normal distribution. The function N f ðεÞdðεÞ in the denominator guarantees that it is equal to 1. The denominator of the PDF described in Eq. (3.37) can be expanded by including another indicator “n” that also increases from 0 to K (Ye et al., 2018): Z þN X Z þN 2 XK XK k m a ε fðεÞdε ¼ a a εmþn fðεÞdε (3.38) m¼0 m m¼0 n¼0 m n N
R þN
N
εn fðεÞdε,
where Ið0Þ ¼ 1, Ið1Þ ¼ Using the recursive equation IðnÞ ¼ N 0, and IðnÞ ¼ n 1 Iðn 2Þ, when n 2, the denominator of Eq. (3.38) can be defined such that Z þN X 2 XK XK k m a ε fðεÞdε ¼ a a Iðm þ nÞ (3.39) m m¼0 m¼0 n¼0 m n N
I. Theory and background
88
3. Crashefrequency modeling
By mixing the SNP with the Poisson distribution, the SNP-Poisson model can be defined as follows: Z Pðyi jxi Þ ¼
N
Z Pðyi jxi Þ ¼
þN
Pðyi jεi Þf ðεi Þdεi 8
i
N
> :
yi
9 2 m > = a ε fðε Þ m i i m¼0 : XK XK > ; a a Iðm þ nÞ m¼0 n¼0 m n
Xk
(3.40)
As the unconditional probability function of Eq. (3.40) does not have a closed form, the numerical method of the GausseHermite quadrature needs to be applied to approximate the unconditional probability: 8 39 2
2 Pk ( ) > m fðε Þ yi J > < = m a ε X i m i m¼0 e mi i 7 6 Pðyi jxi Þ ¼ wj (3.41) 4PK PK 5 : > > yi j¼1 : m¼0 n¼0 am an Iðm þ nÞ ; As discussed in Press et al. (2007), the Gaussian quadrature is a cuttingedge procedure that can correctly evaluate the integrals in the likelihood function with a small number of supporting points. Ye et al. (2018) presented a table listing the weights and the number of nodes needed to characterize a Poisson-gamma model. They have also showed that, depending on the number of nodes, the model can reliably replicate the Poisson-gamma and Poisson-lognormal models. Another advantage of this model is that it can capture multimodal datasets, similar to the finite mixture models previously described.
3.10.2 Dirichlet process models Models that employ the Dirichlet process (DP), widely used in the Bayesian literature, can technically be classified as either nonparametric or semiparametric models depending on the modeling framework (Antoniak, 1974; Escobar and West, 1995). For semiparametric models, as applied with safety data, the count data still follow a Poisson distribution, but the mean or the error term is assumed to follow a Dirichlet distribution or process. As opposed to the Poisson or NB mixtures, in which two parametric distributions are mixed together, the DP is characterized by an infinite mixture of distributions, where the number of unique components or distributions and the component characteristics themselves can be learned from the data (Shirazi et al., 2016). The DP (Ferguson, 1973, 1974) is a stochastic process that is usually used as a prior in Bayesian nonparametric (or semiparametric) modeling. In this regard, Escobar and West (1998) defined the DP as a random probability measure over the space of all probability measures. In that I. Theory and background
3.10 Semi- and nonparametric models
89
sense, the DP is considered as a distribution over all possible distributions; that is, each draw from the DP is itself a distribution, which may not be same as for the previous draw. As described in Shirazi et al. (2016), let A2 ; .; Ar A1 ; A2 ; .; Ar be any finite measurable partitions of the parameter space (Q). Let us assume s be a positive real number and F0 ð:jqÞ be a continuous distribution over Q. Then, FðÞeDPðs; F0 ð:jqÞÞ if and only if (Escobar and West, 1998): ðFðA1 Þ; FðA2 Þ; .; FðAr ÞÞeDPðsF0 ðA1 jqÞ; sF0 ðA2 jqÞ; .; sF0 ðAr jqÞÞ (3.42) where s is defined as the precision (or concentration) parameter and F0 ð:jqÞ as the base (or baseline) distribution. Note that, based on the Dirichlet distribution properties, each partition A3Q is defined as follows: EðFðAÞÞ ¼ F0 ðAjqÞ
(3.43)
F0 ðAjqÞð1 F0 ðAjqÞÞ : (3.44) 1þs Therefore, the base distribution F0 ð:jqÞ and the precision parameter s play significant roles in the DP definition. The expectation of the random distribution FðÞ is the base distribution F0 ð:jqÞ. Likewise, the precision parameter s controls the variance of the random distribution around its mean. In other words, s measures the variability of the target distribution around the base distribution. As s/N, FðÞ/F0 ð:jqÞ while, on the other hand, as s/0, the random distribution FðÞ would deviate further away from F0 ð:jqÞ. Eq. (3.45) defines the DP indirectly through the marginal probabilities assigned to a finite number of partitions. Therefore, it gives no intuition on realizations of FðÞeDPðs; F0 ð:jqÞÞ. To simulate random distributions from the DP, however, Sethuraman (1994) introduced a straightforward stick-breaking constructive representation of this process as follows: VarðFðAÞÞ ¼
gk jseBetað1; sÞ;
k ¼ 1; 2; .
jk jqeF0 ð:jqÞ; k ¼ 1; 2; . Y pk ¼ gk ð1 gk0 Þ; k ¼ 1; 2; . k0 |z|)
Intercept ðb0 Þ
4.779
0.979
0.0000
3.739
1.115
0.0008
7.547
0.1227
0.0000
Ln(ADT) ðb1 Þ
0.722
0.091
0.0000
0.630
0.106
0.0000
0.983
0.117
0.0000
0.02774
0.008
0.0006
0.02746
0.011
0.1300
0.01999
0.008
0.0126
0.4613
0.135
0.0005
0.4327
0.217
0.0468
0.3942
0.152
0.0100
MW ðb4 Þ
0.00497
0.001
0.0000
0.00616
0.002
0.0021
0.00468
0.002
0.0195
Barrier ðb5 Þ
3.195
0.234
0.0000
3.238
0.326
0.0000
8.035
1.225
0.0000
Rumble ðb6 Þ
0.4047
0.131
0.0021
0.3976
0.213
0.0609
0.3780
0.150
0.0134
0.934
0.118
0.0000
0.238
0.083
0.0074
0.301
0.085
0.0042
Friction ðb2 Þ Pavement ðb3 Þ
a ¼ 1=f DICa MADb MSPE a
c
Deviance information criterion. Mean absolute deviance. Mean squared predictive error.
b c
1900
1701
1638
6.91
6.89
6.63
206.79
195.54
194.5
3. Crashefrequency modeling
I. Theory and background
Estimate
93
3.10 Semi- and nonparametric models
Site
1
2
3
4
5
6
7
8
9
10
1
1.0
0.6
0.6
0.6
0.6
0.2
0.6
0.6
0.1
0.1
2
0.6
1.0
0.6
0.6
0.6
0.2
0.6
0.6
0.1
0.1
3
0.6
0.6
1.0
0.6
0.6
0.2
0.6
0.6
0.1
0.1
4
0.6
0.6
0.6
1.0
0.6
0.2
0.6
0.6
0.1
0.1
5
0.6
0.6
0.6
0.6
1.0
0.2
0.6
0.6
0.1
0.1
6
0.2
0.2
0.2
0.2
0.2
1.0
0.2
0.2
0.6
0.6
7
0.6
0.6
0.6
0.6
0.6
0.2
1.0
0.6
0.1
0.1
8
0.6
0.6
0.6
0.6
0.6
0.2
0.6
1.0
0.1
0.1
9
0.1
0.1
0.1
0.1
0.1
0.6
0.1
0.1
1.0
0.6
10
0.1
0.1
0.1
0.1
0.1
0.6
0.1
0.1
0.6
1.0
FIGURE 3.5 The heatmap representation of the partitioning matrix for the top 10 sites with the highest ADT values in the Indiana dataset (Shirazi et al., 2018).
of clustering, the partitioning information matrix needs to be recorded at each iteration of the MCMC. The matrix can be used to investigate similarities between sites especially with regards to recognizing unobserved variables or identifying safety issues and deploying countermeasures. An example is presented in Fig. 3.5. Fig. 3.5 shows the heatmap representation of the partitioning matrix for the top 10 sites with the highest ADT values. The figure shows the likelihood that site “X” and “Y” fall into the same cluster. For simplicity, the probabilities were rounded to the first decimal. A higher likelihood will be represented by a darker shade on the map. As observed in this figure, for instance, with relatively high probability (~60%), site “1” falls into the same cluster as site “2”, site “3” or several more. This information can offer insights to identify potential unobserved variables or safety issues and decide on appropriate countermeasures for the site “1.”
3.10.3 Nonparametric models Nonparametric models have been used relatively often in highway safety. Some recent popular models include the multilayer perceptron I. Theory and background
94
3. Crashefrequency modeling
(MLP) neural network (Xie et al., 2007; Kononov et al., 2011), convolutional neural networks (Ma et al., 2017), Bayesian neural networks (BNN) (Xie et al., 2007), and support vector machine (SVM) (Li et al., 2008). These models and other models are described in greater detail in Chapter 12dData Mining and Machine Learning Techniques, which discusses methods related to the analysis of naturalistic and other types of data. For crash-count analyses, MLP, BNN, and SVM have so far been used for predicting crashes rather than for examining relationships (Singh et al., 2018; Dong et al., 2019). They are very good at predicting crashes, but they, unfortunately, work as black boxes. Hence, the safety analyst needs to be very familiar with the characteristics of the data and the assumptions associated with their use. Because of the nature of these models (i.e., nonparametric), they could easily over-fit the data. Techniques, such as sensitivity analyses, can be used for assessing the relationship between the independent variables and the dependent variable (crash count). For example, Fish and Blodgett (2003) have proposed a method for examining the sensitivity of MLP models. Their method can also be used for assessing the variables of BNN and SVMs (Xie et al., 2007; Li et al., 2008).
3.11 Model selection In this chapter, we have presented different types of models that vary from the most basic to very complex. Positive and negative attributes have been provided for most of these models, along with a description explaining when the model could be suitable given the known and unknown characteristics of the data. In the highway safety literature, a lot of work has been devoted to the development and application of statistical models (in fact, this is the majority of the work produced in highway safety according to Zou and Vu, 2019). A common theme of a new study is that the “new” proposed model is claimed to be better than other previously published or widely applied models because it fits the data better. In other words, the “new” model reduces the unobserved heterogeneity more than the previous model. Although this could be a legitimate objective, the final selection of the model, as discussed earlier, should not solely be based on how good the model fits the data. The model also needs to adequately capture the data generating process of the dataset under study. Miaou and Lord (2003) refer to this subject as the “goodness-oflogic,” which consists of making sure the model properly characterizes the analyzed data and the model is methodologically sound. For example, the safety literature has shown that zero-inflated (ZI) and Hurdle NB models usually provide a better statistical fit than the traditional NB. However, the main assumption for this model is that one of the two states’ key attributes shows that the long-term mean is equal to zero (i.e., the Poisson I. Theory and background
3.11 Model selection
95
mean is equal to zero) (note: for the ZI model, a proportion of the 0s come from a Poisson mean equal to zero; whereas, for the Hurdle model, all the 0s come from such distribution). It is obviously not possible to observe sites that could never experience any crashes despite road users traveling on the facilities3 (Lord et al., 2005, 2007). This fundamental and other related issues have also been raised in environmental science (Warton, 2005), social sciences (Allison, 2012), substance abuse (Xie et al., 2013), and criminology (Fisher et al., 2017; Britt et al., 2018). Solely looking at the fit could allow the safety analyst to miss important cues or information, such as coefficients that are counterintuitive or hidden correlation among variables. Along the same line, using a very complex model does not necessarily mean that the model is better, even if the “fit” is superior. The model could be overly complex given the study objectives or the gains they provide compared to traditional models considered marginal. As Dr. George Box famously said “all models are wrong but some are useful” (Box, 1979) and “the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity” (Box, 1976), meaning that more complex models are not necessarily better. Furthermore, complex models often do not have the opportunity to be fully validated using theoretical principles, simulation, or a wide range of datasets, especially as many have been recently been introduced in the safety literature. It is not uncommon to see new models that are later found to suffer from important methodological limitations (e.g., zero-inflated models). Depending on the parameterization and the estimation method, the model could also take a very long time to provide results. For example, some models estimated using the Bayesian method can sometimes take several days or hours for the MCMC posterior estimates to converge. A pragmatic or better approach would be to use the MLE method when it can be used based on the study objectives and characteristics of the data, but to use the Bayesian
3
Lambert (1992), who first introduced zero-inflated models, clearly stated that these models are used for two distinct categories of observations: perfect state and nonperfect state. The idea behind this kind of model is to assign a probability for these observations with 0s to fall on either one of these two states. She also mentioned that these models can be difficult to interpret. This limitation is even addressed in the abstract. Note that the use of the Vuong statistic (Vuong, 1989) to evaluate when these models should be used has also been criticized by other researchers (e.g., Wilson, 2015). Pew et al. (2020) argued that zero-inflated models could still be used for predicting crashes, irrespective if the relationship between the number of crashes and risk factors makes sense or not. More details about the inadequacy of zero-inflated models in different fields can be found on statistician and sociologist Dr. Paul D. Allison’s website: https://statisticalhorizons. com/zero-inflated-models?
I. Theory and background
96
3. Crashefrequency modeling
estimation method when the MLE cannot be used because of the complexity of the model. Although the NB model can become unreliable under particular conditions, this model, which is characterized by solid theoretical foundations, has had the opportunity to be analyzed, tweaked and used by researchers and practitioners across the globe for several decades and is considered more than adequate for most applications (Hilbe, 2014). Hence, safety analysts should not automatically reject this basic model for analyzing crash data. However, if the study objectives, as described in Chapter 2dFundamentals and Data Collection, and the characteristics of the data are such that a better model may be more suitable, then an alternative model should be selected (where the “fit” plays an important role, but is not the sole decision factor). The same principles apply to the crashseverity models described in the next chapter.
References Allison, P., 2012. Logistic Regression Using SAS: Theory and Application, second ed. SAS Institute Inc., Cary, NC. See Do we really need zero-inflated models? Statistical Horizons (Blog by Emeritus Sociology Professor Dr. Paul Allison) Retrieved from http:// statisticalhorizons.com/zero-inflated-models. (Accessed 29 July 2020). Antoniak, C.E., 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Stat. 2 (6), 1152e1174. Aryuyuen, S., Bodhisuwan, W., 2013. Negative binomial-generalized exponential (NB-GE) distribution. Appl. Math. Sci. 7 (22), 1093e1105. Bakhshi, A.K., Ahmed, M.M., 2021. Practical advantage of crossed random intercepts under Bayesian hierarchical modeling to tackle unobserved heterogeneity in clustering critical versus non-critical crashes. Accid. Anal. Prev. 149, 105855. Blower, D., Flannagan, C., Geedipally, S., Lord, D., Wunderlich, R., 2020. Identification of Factors Contributing to the Decline of Traffic Fatalities in the United States from 2008 to 2012, vol. 928. NCHRP RESEARCH REPORT, Washington, D.C. Britt, C.L., Rocque, M., Zimmerman, G.M., 2018. The analysis of bounded count data in criminology. J. Quant. Criminol. 34, 591e607. https://link.springer.com/article/10. 1007/s10940-017-9346-9?shared-article-renderer. Box, G.E.P., 1976. Science and statistics. J. Am. Stat. Assoc. 71 (356), 791e799. https:// doi.org/10.1080/01621459.1976.10480949. Box, G.E.P., 1979. Robustness in the strategy of scientific model building. In: Launer, R.L., Wilkinson, G.N. (Eds.), Robustness in Statistics. Academic Press, pp. 201e236. https:// doi.org/10.1016/B978-0-12-438150-6.50018-2, 9781483263366. Cameron, A.C., K Trivedi, P., 2013. Regression Analysis of Count Data, second ed. Cambridge University Press, N.Y. Cheng, L., Geedipally, S.R., Lord, D., 2013. Examining the Poisson-Weibull generalized linear model for analyzing crash data. Saf. Sci. 54, 38e42. Chib, S., Winkelmann, R., 2001. Markov chain monte carlo analysis of correlated count data. J. Bus. Econ. Stat. 19, 428e435. Connors, R.D., Maher, M., Wood, A., Mountain, L., Ropkins, K., 2013. Methodology for fitting and updating predictive accident models with trend. Accid. Anal. Prev. 56, 82e94. Conway, R.W., Maxwell, W.L., 1962. A queuing model with state dependent service rates. J. Ind. Eng. Int. 12, 132e136.
I. Theory and background
References
97
Daniels, S., Brijs, T., Nuyts, E., Wets, G., 2010. Explaining variation in safety performance of roundabouts. Accid. Anal. Prev. 42 (2), 393e402. Davis, G.A., 2000. Accident reduction factors and causal inference in traffic safety studies: a review. Accid. Anal. Prev. 32 (1), 95e109. Debrabant, B., Halekoh, U., Bonat, W.H., Hansen, D.L., Hjelmborg, J., Lauritsen, J., 2018. Identifying traffic accident black spots with Poisson-Tweedie models. Accid. Anal. Prev. 111, 147e154. Dong, C., Xie, K., Sun, X., Lyu, M., Yue, H., 2019. Roadway traffic crash prediction using a statespace model based support vector regression approach. PLoS One 14 (4) e0214866. Efron, B., 1986. Double exponential families and their use in generalized linear Regression. J. Am. Stat. Assoc. 81 (395), 709e721. Elvik, R., 2003. Assessing the validity of road safety evaluation studies by analysing causal chains. Accid. Anal. Prev. 35 (5), 741e748. Elvik, R., 2011. Assessing causality in multivariate accident models. Accid. Anal. Prev. 43, 253e264. Escobar, M.D., West, M., 1995. Bayesian density-estimation and inference using mixtures. J. Am. Stat. Assoc. 90 (430), 577e588. Escobar, M.D., West, M., 1998. Computing nonparametric hierarchical models. Pract. Nonparametric Semiparametric Bayesian Stat. 1e22. Ferguson, T.S., 1973. A Bayesian analysis of some nonparametric problems. Ann. Stat. 209e230. Ferguson, T.S., 1974. Prior distributions on spaces of probability measures. Ann. Stat. 615e629. Fish, K.E., Blodgett, J.G., 2003. A visual method for determining variable importance in an artificial neural network model: an empirical benchmark study. J. Target Meas. Anal. Market. 11 (3), 244e254. Fisher, W.H., Hartwell, S.W., Deng, X., 2017. Managing inflation: on the use and potential misuse of zero-inflated count regression models. Crime Delinquen. 63 (1), 77e87. Geedipally, S.R., Lord, D., Dhavala, S.S., 2012. The negative binomial-lindley generalized linear model: characteristics and application using crash data. Accid. Anal. Prev. 45 (2), 258e265. Geedipally, S.R., Lord, D., Park, B.-J., 2009. Analyzing different parameterizations of the varying dispersion parameter as a function of segment length. Trans Res Rec 2103 108e118. Gelman, A., Hill, J., 2007. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, Cambridge, UK. Ghitany, M.E., Atieh, B., Nadarajah, S., 2008. Lindley distribution and its application. Math. Comput. Simul. 78, 39e49. Green, W., 2004. Interpreting Estimated Parameters and Measuring Individual Heterogeneity in Random Coefficient Models. Working paper. Department of Economics, Stern School of Business, New York University. Guikema, S.D., Coffelt, J.P., 2007. A flexible count data regression model for risk analysis. Risk Anal. 28 (1), 213e223. Hauer, E., 2010. Cause, effect and regression in road safety: a case study. Accid. Anal. Prev. 42, 1128e1135. Hauer, E., 2020. Crash causation and prevention. Accid. Anal. Prev. 143, 105528. Heydari, S., Fu, L., Lord, D., Mallick, B.K., 2016. Multilevel Dirichlet process mixture analysis of railway grade crossing crash data. Anal. Methods Accid. Res. 9, 27e43. Hilbe, J.M., 2011. Negative Binomial Regression, second ed. Cambridge University Press, Cambridge, UK. Hilbe, J.M., 2014. Modeling Count Data. Cambridge University Press, Cambridge, UK.
I. Theory and background
98
3. Crashefrequency modeling
Huo, X., Leng, J., Hou, Q., Zheng, L., Zhao, L., 2020. Assessing the explanatory and predictive performance of a random parameters count model with heterogeneity in means and variances. Accid. Anal. Prev. 147, 105759. https://doi.org/10.1016/j.aap.2020.105759. Ishwaran, H., James, L.F., 2001. Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96 (453), 161e173. Ishwaran, H., Zarepour, M., 2002. Exact and approximate representations for the sum Dirichlet process. Can. J. Stat. 30 (2), 269e283. Jovanis, P.P., Chang, H.L., 1986. Modeling the relationship of accidents to miles traveled. Transp. Res. Rec. 1068, 42e51. Kadane, J.B., Shmueli, G., Minka, T.P., Borle, S., Boatwright, P., 2006. Conjugate analysis of the Conway-Maxwell-Poisson distribution. Bayesian Anal 1, 363e374. Katz, L., 1965. Unified treatment of a broad class of discrete probability distributions. Class. Contag. Discret. Distrib. 1, 175e182. Khazraee, S.H., Johnson, V., Lord, D., 2018. Bayesian Poisson hierarchical models for crash data analysis: investigating the impact of model choice on site-specific predictions. Accid. Anal. Prev. 117, 181e195. Khazraee, S.H., Saez-Castillo, A.J., Geedipally, S.R., Lord, D., 2015. Application of the hyperPoisson generalized linear model for analyzing motor vehicle crashes. Risk Anal. 35 (5), 919e930. Kononov, J., Lyon, C., Allery, B., 2011. Relation of flow, speed, and density of urban freeways to functional form of a safety performance function. Transp. Res. Rec. J. Transp. Res. Board (2236), 11e19. Kuo, P.-F., Lord, D., 2020. Applying the colocation quotient index to crash severity analyses. Accid. Anal. Prev. 135, 105368. Lambert, D., 1992. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34 (1), 1e14. Li, X., Lord, D., Zhang, Y., Xie, Y., 2008. Predicting motor vehicle crashes using support vector machine models. Accid. Anal. Prev. 40 (4), 1611e1618. Lindley, D.V., 1958. Fiducial distributions and Bayes’ theorem. J. R. Stat. Soc. Series B Stat. Methodol. 20 (1), 102e107. Lord, D., 2006. Modeling motor vehicle crashes using Poisson-gamma models: examining the effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter. Accid. Anal. Prev. 38 (4), 751e766. Lord, D., Mannering, F., 2010. The statistical analysis of crash-frequency data: a review and assessment of methodological alternatives. Transp. Res. A 44 (5), 291e305. Lord, D., Miranda-Moreno, L.F., 2008. Effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter of Poisson-gamma models for modeling motor vehicle crashes: a Bayesian perspective. Saf. Sci. 46 (5), 751e770. Lord, D., Guikema, S.D., Geedipally, S., 2008. Application of the Conway-Maxwell-Poisson generalized linear model for analyzing motor vehicle crashes. Accid. Anal. Prev. 40 (3), 1123e1134. Lord, D., Geedipally, S.R., Guo, F., Jahangiri, A., Shirazi, M., Mao, H., Deng, X., 2019. Analyzing Highway Safety Datasets: Simplifying Statistical Analyses from Sparse to Big Data. Report No. 01-001, Safe-D UTC. U.S. Department of Transportation, Washington, D.C. Lord, D., Persaud, B.N., 2000. Accident Prediction Models with and without Trend: Application of the Generalized Estimating Equations (GEE) Procedure, vol. 1717. Transportation Research Record, pp. 102e108. Lord, D., Washington, S.P., Ivan, J.N., 2005. Poisson, Poisson-gamma and zero inflated regression models of motor vehicle crashes: balancing statistical fit and theory. Accid. Anal. Prev. 37 (1), 35e46.
I. Theory and background
References
99
Lord, D., Washington, S.P., Ivan, J.N., 2007. Further notes on the application of zero inflated models in highway safety. Accid. Anal. Prev. 39 (1), 53e57. Ma, X., Dai, Z., He, Z., Ma, J., Wang, Y., Wang, Y., 2017. Learning traffic as images: a deep convolutional neural network for large-scale transportation network speed prediction. Sensors 17 (4), 818. Ma, J., Kockelman, K.M., 2006. Bayesian multivariate Poisson regression for models of injury count, by severity. In: Transportation Research Record: Journal of the Transportation Research Board. No. 1950, pp. 24e34. Mannering, F., Shankar, V., Bhat, C., 2016. Unobserved heterogeneity and the statistical analysis of highway accident data. Anal. Methods Accid. Res. 11, 1e16. Matsuo, K., Sugihara, M., Yamazaki, M., Mimura, Y., Yang, J., Kanno, K., Sugiki, N., 2020. Hierarchical Bayesian modeling to evaluate the impacts of intelligent speed adaptation considering individuals’ usual speeding tendencies: a correlated random parameters approach. Anal. Methods Accid. Res. 27, 100125. McCulloch, C.E., Searle, S.R., Neuhaus, J.M., 2008. Generalized, Linear, and Mixed Models, second ed. John Wiley & Sons Inc., Hoboken, NJ. McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models, second ed. Chapman & Hall/ CRC, Boca Raton, FL. Miaou, S.-P., Lord, D., 2003. Modeling traffic-flow relationships at signalized intersections: dispersion parameter, functional form and Bayes vs empirical Bayes. Transp. Res. Rec. 1840, 31e40. Milton, J., Shankar, V., Mannering, F., 2008. Highway accident severities and the mixed logit model: an exploratory empirical analysis. Accid. Anal. Prev. 40 (1), 260e266. Molnar, C., 2020. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable eBook. https://christophm.github.io/interpretable-ml-book/index.html. Myers, R., 2000. Classical and Modern Regression with Applications, second ed. Duxbury Press, Belmont, U.S. Oh, J., Washington, S.P., Nam, D., 2006. Accident prediction model for railwayehighway interfaces. Accid. Anal. Prev. 38 (2), 346e356. Ohlssen, D.I., Sharples, L.D., Spiegelhalter, D.J., 2007. Flexible random-effects models using Bayesian semi-parametric models: applications to institutional comparisons. Stat. Med. 26 (9), 2088e2112. Park, B.-J., Lord, D., 2009. Application of finite mixture models for vehicle crash data analysis. Accid. Anal. Prev. 41 (4), 683e691. Park, E.S., Lord, D., 2007. Multivariate Poisson-Lognormal Models for Jointly Modeling Crash Frequency by Severity. Transportation Research Record, pp. 1e6. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P., 2007. Numerical recipes. In: The Art of Scientific Computing, third ed. Cambridge University Press. Pearl, J., 2009. Casualty: Causality Models, Reasoning and Inference, second ed. Cambridge University Press, Cambridge, UK. Saeed, T.U., Hall, T., Baroud, H., Volovski, M.J., 2019. Analyzing road crash frequencies with uncorrelated and correlated random-parameters count models: an empirical assessment of multilane highways. Anal. Methods Accid. Res. 23, 100101. Saengthong, P., Bodhisuwan, W., 2013. Negative binomial-crack (NB-CR) distribution. J. Pure Appl. Math. 84, 213e230. Sethuraman, J., 1994. A constructive definition of Dirichlet priors. Stat. Sin. 4 (2), 639e650. Shaon, M.R.R., Qin, X., Shirazi, M., Lord, D., Geedipally, S., 2018. Developing a random parameters negative binomial-lindley model to analyze highly over-dispersed crash count data. Anal. Methods Accid. Res. 18, 33e44.
I. Theory and background
100
3. Crashefrequency modeling
Shirazi, M., Dhavala, S.S., Lord, D., Geedipally, S.R., 2017. A methodology to design heuristics for model selection based on characteristics of data: application to investigate when the negative binomial lindley (NB-L) is preferred over the negative binomial (NB). Accid. Anal. Prev. 107, 186e194. Shirazi, M., Lord, D., Dhavala, S.S., Geedipally, S.R., 2016. A semiparametric negative binomial generalized linear model for modeling over dispersed count data with a heavy tail: characteristics and applications to crash data. Accid. Anal. Prev. 91, 10e18. Shmueli, G., Minka, T.P., Kadane, J.B., Borle, S., Boatwright, P., 2005. A useful distribution for fitting discrete data: revival of the Conway-Maxwell-Poisson distribution. J. R. Stat. Soc. Ser. C. 54, 127e142. Singh, G., Sachdeva, S.N., Pal, M., 2018. Support vector machine model for prediction of accidents on non-urban sections of highways. Proc. Inst. Civ. Eng. Transp. 171 (5), 253e263. Skrondal, A., Rabe-Hesketh, S., 2004. Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models. Chapman & Hall/CRC, Boca Raton, FL. Stroup, W.W., 2013. Generalized Linear Mixed Models: Modern Concepts, Methods and Applications. Chapman & Hall/CRC, Boca Raton, FL. Tang, H., Gayah, V.V., Donnell, E.T., 2019. Evaluating the predictive power of an SPF for twolane rural roads with random parameters on out-of-sample observations. Accid. Anal. Prev. 132, 105275. Vangala, P., Lord, D., Geedipally, S.R., 2015. Exploring the application of the negative binomial-generalized exponential model for analyzing traffic crash data with excess zeros. Anal. Methods Accid. Res. 7, 29e36. Vuong, Q., 1989. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica 57, 307e333. Warton, D.I., 2005. Many zeros does not mean zero inflation: Comparing the goodness-of-fit of parametric models to multivariate abundance data. Environmetrics 16 (2), 275e289. Washington, S., Karlaftis, M.G., Mannering, F., Anastasopoulos, P., 2020. Statistical and Econometric Methods for Transportation Data Analysis, third ed. CRC Press - Taylor & Francis Group, New York, N.Y. Wilson, P., 2015. The misuse of the Vuong test for non-nested models to test for zero-inflation. Econ. Lett. 127, 51e53. Winkelmann, R., 1995. Duration dependence and dispersion in count-data models. J. Bus. Econ. Stat. 13 (4), 467e474. Wu, L., Lord, D., Zou, Y., 2015. Validation of CMFs derived from cross sectional studies using regression models. Transp. Res. Rec. 2514, 88e96. Xie, Y., Lord, D, Zhang, Y., 2007. Predicting motor vehicle collisions using Bayesian neural networks: An empirical analysis. Accid. Anal. Prev. 39 (5), 922e933. Xie, H., Tao, J., McHugo, G., Drake, R.E., 2013. Comparing statistical methods for analyzing skewed longitudinal count data with many zeros: an example of smoking cessation. J. Subst. Abuse Treat. 45, 99e108. Xie, Y., Zhang, Y., 2008. Crash frequency analysis with generalized additive models. Transp. Res. Rec. J. Transp. Res. Board 39e45. No. 2061. Yamrubboon, D., Thongteeraparp, A., Bodhisuwan, W., Jampachaisri, K., Volodin, A., 2019. Bayesian inference for the negative binomial-sushila linear model. Lobachevskii J. Math. 40 (1), 42e54. Yang, M.A., Dunson, D.B., Baird, D., 2010. Semiparametric Bayes hierarchical models with mean and variance constraints. Comput. Stat. Data Anal. 54 (9), 2172e2186. Ye, X., Wang, K., Zou, Y., Lord, D., 2018. A semi-nonparametric Poisson regression model for analyzing motor vehicle crash data. PLoS One 13 (5) e0197338. Ye, Z., Xu, Y., Lord, D., 2018. Crash data modeling with a generalized estimator. Accid. Anal. Prev. 117, 340e345.
I. Theory and background
References
101
Zamani, H., Ismail, N., 2010. Negative binomial-Lindley distribution and its application. J. Math. Stat. 6 (1), 4e9. Zha, L., Lord, D., Zou, Y., 2016. The Poisson inverse Gaussian (PIG) generalized linear regression model for analyzing motor vehicle crash data. J. Transp. Saf. Secur. 8 (1), 18e35. Zou, Y., Geedipally, S.R., Lord, D., 2013. Evaluating the double Poisson generalized linear model. Accid. Anal. Prev. 59, 497e505. Zou, X., Vu, H.L., 2019. Mapping the knowledge domain of road safety studies: a scientometric analysis. Accid. Anal. Prev. 132, 1, 05243. Zou, Y., Wu, L., Lord, D., 2015. Modeling over-dispersed crash data with a long tail: examining the accuracy of the dispersion parameter in negative binomial models. Anal. Methods Accid. Res. 5e6, 1e16.
I. Theory and background
C H A P T E R
4
Crash-severity modeling 4.1 Introduction “Vision Zero,” an ambitious, multinational initiative to eliminate traffic fatalities and serious injuries, has made the reduction of fatal and severe injury crashes a top priority for transportation safety stakeholders around the world. To achieve this goal, researchers and safety professionals rely heavily on crash data as they are the most relevant and informative resource for analyzing traffic injuries; however, the causes of an injury are very complicated because they involve a sequence of events and a number of factors (i.e., driver, vehicle, environment), as discussed in Chapter 2dFundamentals and Data Collection. Similar to the methodologies described in Chapter 3dCrash-Frequency Modeling, statistical methodologies have been used extensively to explore the intriguing relationships between crash severities and other data elements. In particular, crash injury severity modeling helps describe, identify, and evaluate the factors contributing to various levels of injury severity. Unlike crash count, which is a nonnegative integer, injury severity has a finite number of outcomes (e.g., killed, injury type A, injury type B, injury type C, no injury) that are categorized on the KABCO scale. Discrete choice and discrete outcome models have been used to handle this type of response variable. Crash severity models are categorized as fixed or random parameter models according to the parameter assumptions. Crash-severity models can also be classified as nonordinal (e.g., multinomial logit (MNL) and multinomial probit) or ordered probabilistic (e.g., ordered probit and order logistic) if an ordinal structure for the response variable is assumed. Model variations are available if restrictions such as irrelevant and independent alternatives (IIA), proportional odds, or homogeneity are
Highway Safety Analytics and Modeling https://doi.org/10.1016/B978-0-12-816818-9.00005-6
103
© 2021 Elsevier Inc. All rights reserved.
104
4. Crash-severity modeling
relaxed. Savolainen et al. (2011) performed an extensive review of the methodological alternatives for modeling highway crash injury severity, but the research did not yield an agreement among experts on which model works best with crash severity data. Professionals do agree, however, that both statistical goodness-of-fit and model interpretation should be considered when modeling crash injury severity. This chapter introduces the methodologies and techniques that have been applied to model crash severity in safety studies. The discussion includes the different forms, constructs, and assumptions of crash severity models due to the prevailing issues of crash data. The theoretical framework and practical techniques for identifying, estimating, evaluating, and interpreting factors contributing to crash injury severities are also explored. In addition, an extensive list of available crash-severity models are described in Appendix B.
4.2 Characteristics of crash injury severity data and methodological challenges Several prevailing issues related to crash injury data have come to light during model development, including unobserved heterogeneity, omitted variables bias, temporal and spatial correlation, ordinality of injury severity, and imbalanced observations between injury severity levels (Savolainen et al., 2011; Washington et al., 2020; Mujalli, 2016). Some of these issues are discussed in depth in Chapter 3dCrash-Frequency Modeling and Chapter 6 - Cross-sectional and Panel Studies in Safety, as the roots of the problems are the same. The issues that are specific to crash severity data will be discussed in detail in the following sections.
4.2.1 Ordinal nature of crash injury severity data An ordinal scale quantitatively categorizes crashes from the highest to lowest levels of injury severity (i.e., KABCO). Recognizing this ordinal structure within data is important because it aids in the selection of an appropriate methodology. Utilizing the intrinsic ordinal information preserved in the data may lead to the estimation of fewer parameters. Additionally, the potential dependency between adjacent categories may share unobserved effects. If such a correlation exists but is not accounted for, it can lead to biased parameter estimates and incorrect inferences (Savolainen and Mannering, 2007). Nevertheless, the ordinality assumption should be exercised with caution, as it can be overly restrictive for models under certain circumstances, such as when lower severity crashes are underreported.
I. Theory and background
4.3 Random utility model
105
4.2.2 Unobserved heterogeneity Differences in drivers’ risk-taking behaviors, physiological attributes, and other factors lead to unobserved heterogeneity among road users involved in crashes. Data heterogeneity affects the model parameters among injury observations. Large effects, when unaccounted for, could lead to biased parameter estimates and incorrect statistical inferences (McFadden and Train, 2000; Train, 2009).
4.2.3 Omitted variable bias It is impossible to include all variables relating to injury severity in one model. Some variables, albeit important, may not be available in a crash report (e.g., vehicle mass, speed, collision angles). However, the omission of important explanatory variables can result in inconsistent parameter estimates. This can occur when omitted variables are correlated with variables that are already included in the model, or when omitted variables contribute to different variances among injury severity levels (Washington et al., 2020). If the omission of relevant variables is a critical limitation in a crash prediction model, the model results must be discussed for possible implications on their application.
4.2.4 Imbalanced data between injury severity levels Crash injury severity data usually are imbalanced on the KABCO scale, where the number of fatal or severe injuries is substantially less than the number of less severe and no injury crashes. This imbalance of data in each injury category presents a challenge for classification algorithms. In predictive modeling, imbalanced data introduce a bias toward the majority that causes less accurate predictions of severe crashes. A common method of treating imbalanced data is to combine similar injury types (i.e., K, A, B, and C) into one category on a new scale (i.e., injury and noninjury) to gain more balanced data. Other methods for handling imbalanced data include resampling techniques that aim to create a balanced injury scale data through oversampling less-representative classes or undersampling overly-representative classes (Mujalli, 2016).
4.3 Random utility model Crash severity models are driven by the development of econometrics methods. In economics, utility is a measure of relative satisfaction. In the context of safety, we are looking for a combination of factors that lead to the worst injuries. The utility function usually favors the maximum utility
I. Theory and background
106
4. Crash-severity modeling
(e.g., high injury severity levels) and is usually a linear form of covariates as follows: Uni ¼ b0i þ b1i xn1i þ b2i xn2i þ . þ bki xnki ¼ x0 ni bi
(4.1)
where Uni is the utility value of crash n with injury severity level i; xnki is the kth variable related to injury level i; b0i is the constant for injury level i; and, bki is the estimable coefficients for the covariates. Utility maximization is the process of choosing the alternative with the maximum utility value. In a binary outcome model with injury and no injury, if UðinjuryÞ > Uðno injuryÞ, then the probability of injury Pr ðinjuryÞ ¼ 1; and if UðinjuryÞ < Uðno injuryÞ, then Pr ðinjuryÞ ¼ 0. This is a deterministic choice that can be depicted in Fig. 4.1. A random unspecifiable error term, εni , is added to the end of Eq. (4.1), as it is difficult to specify each crash observation’s utility function with certainty. The utility function becomes a random utility function as follows: Uni ¼ b0i þ b1i xn1i þ b2i xn2i þ / þ bki xnki þ εni ¼ Vni þ εni
(4.2)
where Vni represents the deterministic portion of Uni . The addition of a disturbance term ε helps with the previously mentioned issues of variables being omitted from the utility function, an incorrectly specified functional form, use of proxy variables, and unaccounted for variations in b (b may vary across observations). A random utility model leads to an estimable model of discrete outcomes, with I denoting all possible outcomes for observation n, and Pni denoting the probability of observation n with discrete outcome iði ˛IÞ: Pni ¼ PrðVni þ εni > Vnj þ εnj ; cjsiÞ ¼ Prðεnj < Vni þ εni Vnj ; cjsiÞ (4.3)
P(injury)
U(injury) U(no injury)
FIGURE 4.1 Deterministic choice of a binary variable.
I. Theory and background
4.4 Modeling crash severity as an unordered
107
P(injury)
1
0 0
V(injury) V(no injury)
FIGURE 4.2 Stochastic choice of a binary variable.
Models are estimated by assuming a distribution for the random error term, ε0 s. Now, instead of being a deterministic outcome, the probability of each outcome alternative is determined by the distributional form (Fig. 4.2).
4.4 Modeling crash severity as an unordered discrete outcome Treating the dependent variable with multiple responses as ordinal or as nominal significantly impacts which methodologies should be considered. From a model estimation perspective, it is desirable for the maximum of a set of randomly drawn values to have the same form of distribution as the one from which they are drawn. An error term (ε) distribution with such a property greatly simplifies model estimation because this property could be applied to the multinomial case by defining the highest utility value of all other options as x0nj bj ðcj siÞ. The normal distribution does not possess this property because the maximums of randomly drawn samples from the normal distribution are not normally distributed. However, the extreme value distribution is different. Distributions of the maximums of randomly drawn samples from a distribution are called extreme value distributions (Gumbel, 1958) that can be categorized as Type 1, Type 2, or Type 3. The most common extreme value distribution is Type 1, or the Gumbel distribution. Based on the error distributional assumption of the Gumbel distribution (Type I extreme value), the most known discrete choice model is the MNL model. However, MNL models rely on the independence of irrelevant alternatives (IIA) assumption, which states that the odds of having one outcome category over another do not depend on the presence or absence of other categories. The IIA assumption is violated when there is correlation
I. Theory and background
108
4. Crash-severity modeling
among multiple categories, causing the MNL model to generate biased estimates. Nested logit (NL) or mixed logit (ML) models offer a more appropriate methodological approach when IIA cannot be held.
4.4.1 Multinomial logit model The MNL model has been widely applied in crash severity studies to predict the probability of different crash outcomes. If εni is considered known, Eq. (4.3) is the cumulative distribution for εni evaluated at Vni þ εni Vnj . When Gumbel is assumed1 and εni ’s are independent, the cumulative distribution over all jsi is the product of individual cumulative distributions as (Train, 2009): Y Pni εni ¼ jsi eexp½ðVni þεni Vnj Þ (4.4) As εni is not given, the choice probability is the integral of Pni jεni over all values of εni weighted by its density as follows: Z nY o exp½ðVni þεni Vnj Þ εni expðεni Þ Pni ¼ e dεni (4.5) e e jsi This results in a closed-form expression known as the MNL model, formulated as follows: expðx0 ni bi Þ Pni ¼ Prðyn ¼ iÞ ¼ PI 0 i¼1 expðx ni bi Þ
(4.6)
where xn is a vector of explanatory variables that determines the severity of crash observation n, and bi is a vector of estimable coefficients for injury severity level i,ði ˛IÞ. The estimated coefficients bi are usually presented as a log odds ratio between the probability of a given level i and the reference one, resulting in (I-1) estimates for each independent variable. The odds ratio is defined as the ratio between the probabilities of two specific categories, and it quantifies the propensity of an observation falling into one category compared with another. If level I is the reference level, the model becomes Pn ðiÞ log (4.7) ¼ x0 ni bi Pn ðIÞ Note that in crash severity modeling, the lowest injury severity level, i ¼ 1, (i.e., “no injuries” or “property damage only” (PDO)) is usually set to be the reference level instead of level I. The latter, however, is a more common choice in commercial statistical software. 1
Gumbel distribution: f ðxÞ ¼ ex eexpðxÞ ; and FðxÞ ¼ eexpðxÞ .
I. Theory and background
4.4 Modeling crash severity as an unordered
109
Eq. (4.7) shows that the MNL model allows the explanatory variables related to one injury severity, as well as their parameter estimations, to vary. Thus, the MNL model should be an appropriate model when possibilities of different injury severities are related with different contributing factors or are affected differently by the same factor. Another unique property of the MNL model is the IIA assumption. According to Eq. (4.6), the ratio of any two alternatives A and B is PðAÞ=PðBÞ ¼ expðVA VB Þ, which is unaffected by another alternative. This property appears to be a major restriction for the use of MNL. Therefore, when the alternatives are distinctly different and independent, the MNL model should work well. Conversely, the MNL cannot be justified when the alternatives share some unobserved effects, and different modeling approaches should be considered. Maximum likelihood estimate (MLE) is a method for estimating model coefficients. ML estimators are known to have good properties in large samples because they are consistent, asymptotically efficient, and asymptotically normal. In statistical terms, “consistency” means the estimate approaches the true value as the sample size increases indefinitely. Asymptotic efficiency means that in large samples, the estimates have standard errors that are at least as small as those of any other methods. Asymptotically normal means the normal and chi-square distributions can be used to construct confidence intervals and calculate P-values for the coefficients.
Exercise 4.1 Estimate a multinomial logit model using the Large Trucks Dataset. Crashes involving large trucks are generally more severe than those involving other vehicles due to the size, weight, and speed differential between trucks and other vehicles. The exercise is adapted from the large truck safety study published by (Qin et al., 2013a,b). The purpose of the exercise is to identify key contributing factors and their impacts on crash severities involving large trucks using MNL models. The large truck crash dataset includes 10,000 traffic accidents, with 4905 (49%) PDO crashes; 3981 (40%) injury type B (possible injury) and C (nonincapacitating injury); and 1114 (11%) injury type A (incapacitating injury) and K (fatal). The MNL model includes the following explanatory variables: human factors and driver behavior (Young, Old, Female, Alcohol, Drugs, Safety constraints, Speed, Rule violation, Reckless behavior), highway and traffic conditions (Signal, Two-way, None, Total units), environmental factors (Snow, Ice, Wet, and Dark). The following steps are taken to solve the problem. continued
I. Theory and background
110
4. Crash-severity modeling
E x e r c i s e 4 . 1 (cont’d) expðx0 ni bi Þ
First, determine the functional form: Pni ¼ Prðyn ¼ iÞ ¼ PI
i¼1
expðx0 ni bi Þ
In this functional form, yn is the crash injury severity with three levels: PDO (i ¼ 1), B or C (i ¼ 2); and K or A (i ¼ 3). xn is a vector of explanatory variables that determines the severity of crash observation n (n ¼ 1,10000), and bi is a vector of coefficients for injury severity level i. Second, estimate the coefficients using the R “mlogit” package: crash_mnl | z|)
Intercept
1.1248
0.3605
0.0018
3.0400
0.5082
0.0000
0.0903
0.0595
0.1289
0.2779
0.0907
0.0022
0.0429
0.0595
0.4711
0.5054
0.0847
0.0000
Female
0.8380
0.0530
0.0000
0.5557
0.0807
0.0000
Alcohol
0.1466
0.1286
0.2542
0.8121
0.1556
0.0000
Drugs
1.5013
0.4077
0.0002
2.5610
0.4193
0.0000
0.7098
0.3319
0.0325
0.9989
0.4369
0.0222
0.5290
0.0556
0.0000
0.6722
0.0849
0.0000
Young Old
Safety constraints Speed
I. Theory and background
111
4.4 Modeling crash severity as an unordered
E x e r c i s e 4 . 1 (cont’d) B or C
K or A
Variable
Estimate
Std. Error
Pr(>| z|)
Estimate
Std. Error
Pr(>| z|)
Rule violation
0.3192
0.0611
0.0000
0.9368
0.0888
0.0000
Reckless behavior
0.2263
0.0507
0.0000
0.3559
0.0767
0.0000
Signal
0.6930
0.1471
0.0000
0.6216
0.2727
0.0227
Two-way
0.7419
0.1555
0.0000
1.3280
0.2709
0.0000
None
0.4295
0.1379
0.0018
0.8710
0.2580
0.0007
Total units
0.3264
0.0269
0.0000
0.3849
0.0358
0.0000
Snow
0.6935
0.0753
0.0000
1.0676
0.1305
0.0000
Ice
0.5375
0.1080
0.0000
0.7336
0.1836
0.0001
Wet
0.0467
0.0675
0.4891
0.3037
0.1113
0.0064
Dark
0.0991
0.0613
0.1059
0.3775
0.0901
0.0000
AIC: 17,869.32; Log-Likelihood: 8898.7, McFadden R : 0.062873. 2
Finally, summarize your findings. In the MNL model, the coefficient estimates are explained as the comparison between injury level i and the base level PDO (i ¼ 1). As can be seen in the table, a driver usually sustained more severe injuries when alcohol or drugs were involved. If a driver was influenced by drugs, his or her chance of getting injured increases drastically, with respective probabilities of level B or C and level K or A being 4.49 (e1.5013) times and 12.95 (or e2.561) times that of PDO. The exponentiated value of the logit coefficients is also called the odds ratio. Other factors relating to unsafe driving behavior, such as speeding, violating the traffic rules, and driving recklessly, all suggest an increased probability of serious injuries.
4.4.2 Nested logit model McFadden (1981) developed the generalized extreme value model (GEV) to overcome the IIA limitation in the MNL model. The NL model is the most well-known GEV model. The NL model generates two kinds of crash outcomes: those that are part of a nest (crash outcomes that are
I. Theory and background
112
4. Crash-severity modeling
No Evident Injury
Property Damage Only (PDO)
Evident Injury (Type B)
Incapacitating Injury or Fatality (Type A or K)
Possible Injury (Type C)
FIGURE 4.3 Nested structure of accident severities.
correlated) and those that are not. By grouping outcomes that share unobserved effects into conditional nests, the shared unobserved effects are canceled out in each nest. Shankar et al. (1996) observed that the “property damage only” and “possible injury” severity levels were correlated due to shared unobserved factors, which is a sign that IIA has been violated. The authors proposed the nested structure shown in Fig. 4.3. The structure combines the two severity levels into one nest named “No Evident Injury,” which is independent of the other two nests (evident injury and disabling injury or fatality). The crash severity probabilities for the nested outcome in the NL model consist of the nest probability as well as the outcome probability inside the nest. Assuming the disturbances are generalized extreme value distributed, the nested logit can be formulated as (see McFadden, 1981): 0
exp½x ni bi þ 4i Lni 0 cI exp½x nI bI þ 4I LSnI h i exp x0 nj bjji h i Pn ð jjiÞ ¼ P 0 cJ exp x nJ bJji hX i 0 exp x b LSni ¼ LN nJ Jji cJ Pni ¼ P
(4.8a)
(4.8b)
(4.8c)
where Pni is the unconditional probability of crash n resulting in injury outcome i; x are vectors of characteristics that determine the probability of injury severity, and b are vectors of estimable parameters of injury outcome. Pn( j|i) is the probability of crash observation n having injury outcome j conditioned on the injury outcome being in category i. J is the conditional set of outcomes on i. I is the unconditional set of outcome categories (for example, the upper three branches in Fig. 4.3: “no evident injury,” “evident injury,” and “disabling injury or fatality”), LSni is the inclusive value (logsum) that can be considered as the expected I. Theory and background
4.4 Modeling crash severity as an unordered
113
maximum value of the attributes that determine probabilities in severity category i, and 4i is an estimable parameter. According to Eq. (4.8b), grouping property damage only and possible injury crashes that share common unobserved effects in “no evident injury” can cancel out the unobserved effects and therefore preserve the independence assumption. To be consistent with McFadden’s generalized extreme value derivation of the model, the parameter estimate for 4i in the nested logit model must be between zero and one. If 4i equals to one or is not significantly different from one, there is no correlation between the severity levels in the nest, meaning the model reduces to the multinomial logit model. If 4i equals to zero, a perfect correlation is implied among the severity levels in the nest, indicating a deterministic process by which crashes result in particular severity levels. The t-test can be used to test if 4i is significantly 4i 1 different from 1: t ¼ S:E:ð4 Þ. Because 4i is less than or equal to one, this is a i
one-tailed t-test (half of the two-tailed t-test) (more details about the t-test can be found in Chapter 5 - Exploratory Analyses of Safety Data). It is important to note that the typical t-test implemented in many commercial software packages are against zero instead of one. Thus, the t value must be calculated manually. The IIA assumption for an MNL model can also be tested with the Hausman-McFadden (1984) test that has been widely implemented in commercial statistical software. Model estimates can be produced in a sequential fashion (i.e., estimating the conditional model as in Eq. 4.8b) using only the data in the sample that observed the subset of injury outcomes J; then, the logsum in Eq. (4.8c) is calculated using all observations, including those with injury severity J and those without. Lastly, the calculated logsums can be used as an independent variable in Eq. (4.8a). The sequential estimation procedure, however, may generate small varianceecovariance matrices that lead to inflated t-statistics. The full information maximum likelihood (FIML) estimation will not have this problem. Its log-likelihood is ln L ¼ P ln ProbðtwigjbranchÞi ProbðbranchÞ where twig is the outcome in i
the nest and branch is not. FIML is more efficient than two-step estimation and it ensures appropriate estimation of variance-covariance matrices (see Greene, 2000 for additional details).
Exercise 4.2 Estimate a nested logit model using the Large Trucks Dataset and compare the model results with Exercise 4.1. The exercise uses the same dataset as Exercise 4.1. Crash injury severity levels include PDO, injury type B (possible injury), C (nonincapacitating injury), injury type A (incapacitating injury) and K (fatal). The solution is as follows. continued I. Theory and background
114
4. Crash-severity modeling
E x e r c i s e 4 . 2 (cont’d) First, establish the nested structure of crash severities.
B, C, or PDO
B or C
K or A
PDO
Second, determine the functional form based on Eq. 4.8 (aec). For example, Pn(j|i) is the probability of crash n having injury outcome B or C conditioned on the injury outcome being in a category not a K or A injury. I is the unconditional set of outcome categories (for example, the upper three branches in the figure: no K/A injury and K/A injury). LSni is the inclusive value (logsum). Third, estimate the coefficients using the R “mlogit” package: nested_logit | z|)
Estimate
Std. Error
Pr(>| z|)
Intercept
0.8087
0.7161
0.2588
3.0882
0.4798
0.0000
0.0632
0.0677
0.3507
0.2658
0.0920
0.0039
0.0278
0.0487
0.5680
0.5136
0.0839
0.0000
Female
0.6021
0.5037
0.2320
0.4299
0.2815
0.1267
Alcohol
0.0975
0.1219
0.4239
0.7874
0.1506
0.0000
Drugs
1.0946
0.9586
0.2535
2.3048
0.6835
0.0007
0.5025
0.4731
0.2881
0.8868
0.4541
0.0508
Young Old
Safety constraints
I. Theory and background
115
4.4 Modeling crash severity as an unordered
E x e r c i s e 4 . 2 (cont’d) B or C (lower nest)
K or A
Estimate
Std. Error
Pr(>| z|)
Estimate
Std. Error
Pr(>| z|)
Speed
0.3802
0.3187
0.2330
0.5965
0.1864
0.0014
Rule violation
0.2284
0.1952
0.2420
0.8945
0.1265
0.0000
Reckless behavior
0.1621
0.1394
0.2447
0.3264
0.0980
0.0009
Signal
0.4969
0.4280
0.2457
0.5304
0.3382
0.1168
Two-way
0.5331
0.4587
0.2451
1.2299
0.3436
0.0003
None
0.3059
0.2746
0.2652
0.8153
0.2862
0.0044
Total units
0.2328
0.1954
0.2337
0.3304
0.1208
0.0062
Snow
0.5044
0.4233
0.2334
0.9792
0.2309
0.0000
Ice
0.3875
0.3326
0.2439
0.6646
0.2431
0.0063
Wet
0.0309
0.0550
0.5740
0.3124
0.1107
0.0048
Dark
0.0756
0.0769
0.3254
0.3650
0.0961
0.0001
iv (inclusive value)
0.7161
0.5989
0.2319
Variable
AIC: 17,871.11; Log-Likelihood: 8898.6; McFadden R2: 0.062884.
Finally, compare the nested logit model with the multinomial logit model in Exercise 4.1. None of the variables in the lower nest seem to be statistically significant. The AIC value of the NL model (17871.11) is also greater than that of the MNL model (17869.32), indicating inferior performance. The inclusive value is 0.7161, and its t-value is 0.474 (calculated as 0:71611 0:5989 ). Apparently, the log-sum coefficient is not significantly different from 1. When the inclusive value is equal to one or not significantly different from 1, there is no correlation between the severity levels in the nest, meaning the model reduces to a simple multinomial logit model. We can conclude that for this dataset, the MNL model is more appropriate.
I. Theory and background
116
4. Crash-severity modeling
4.4.3 Mixed logit model The ML model (also known as the random parameters logit model) is highly flexible because it can approximate any random utility model (McFadden and Train, 2000). The mixed logit model addresses the limitations of the multinomial logit by allowing for heterogeneous effects and correlation in unobserved factors (Train, 2009). The mixed logit is a generalization of the multinomial structure in which the parameter vector b can vary across each observation. We can consider the mixed logit probability as a weighted average of the logit function at different values of parameter b. The weighted average of several functions is called a mixed function, and the density that provides the weights is called the mixing distribution. In crash severity modeling applications, common mixing distributions include normal, lognormal, uniform, and triangular. Thus, the mixed logit is the integral of standard logit probabilities over a density of parameters, specified as (Train, 2009): Z 0 expðx ni bi Þ Pni ðiÞ ¼ P f ðbjfÞdb (4.9) 0 J expðx nJ bJ Þ where f ðbjfÞ is a density function of b, and f, and is a vector of parameters that specify the density function, with all other terms as previously defined. The injury severity level probability is a mixture of logits. When all parameters b are fixed, the model reduces to the multinomial logit model. When b is allowed to vary, the model is not in a closed form, and the probability of crash observation n having a particular injury outcome i can be calculated through integration. Simulation-based maximum likelihood methods such as Halton draws are usually used. The choice of the density function of b depends on the nature of the coefficient and the statistical goodness of fit. The lognormal distribution is useful when the coefficient is known to have the same sign for each observation. Triangular and uniform distributions have the advantage of being bounded on both sides. Furthermore, triangular distribution assumes that the probability increases linearly from the beginning to the midrange and then decreases linearly to the end. A uniform distribution assumes the same probability for any value within the range. If a coefficient is no longer fixed but random, its interpretation can be tricky because the impact of X on the injury outcome is case-specific. In Milton et al. (2008), the authors suggested that roadway characteristics were better modeled as fixed parameters, while volume-related variables such as average daily traffic per lane, average daily truck traffic, truck percentage, and weather effects were better when modeled as random parameters. The authors also speculated that the random effect of ADT per lane increases injury severity in some cases while it decreases it in others, which captures the response and adaptation of local drivers to various levels of traffic volume. The number of interchanges per mile was I. Theory and background
4.4 Modeling crash severity as an unordered
117
also found to be a random coefficient, suggesting some interchanges may be more complex in design and traffic patterns than others. In Chen and Chen (2011), the authors concurred that weather characteristics such as snowy or slushy surface conditions and a light traffic indictor appeared to be random coefficients. Compared to the fixed parameters, the “randomness” may present new insights for a more comprehensive and better understanding of the complex relationship between observed factors and crash injury outcome.
Exercise 4.3 Estimate a mixed logit model using Large Trucks Dataset. This exercise uses the same dataset as Exercise 4.1, the presentation of coefficients for fixed parameters in the ML model is the same as the MNL model except when the coefficient is a random variable; in this case, the standard deviation of the coefficient is displayed. For computation efficiency, fewer explanatory variables are tested for the mixed logit model. The explanatory variables are Old, Female, Alcohol, Speed, Snow, and Dark. The dependent variable is the crash injury severity level: PDO (i ¼ 1), B or C (i ¼ 2); and K or A (i ¼ 3). The solution is as follows. First, determine the density function f ðbjfÞ in the R “mlogit” package, random parameter object “rpar” contains all the relevant information about the distribution of random parameters. Currently, the normal ("n"), log-normal ("ln"), zero-censored normal ("cn"), uniform ("u") and triangular ("t") distributions are available. For illustration, normal distribution is chosen as the density function of random parameter b. Second, estimate the coefficients using the R “mlogit” package: crash_data_mixed | z|)
0.0068
AIC: 18,659.75, Log-Likelihood: 9309.9, McFadden R2: 0.030,777.
4.5 Modeling crash severity as an ordered
119
E x e r c i s e 4 . 3 (cont’d) Finally, summarize the findings. The ML model can account for the data heterogeneity by treating coefficients as random variables. In this exercise, the coefficients associated with K or A injuries are more of interest and therefore, selected for testing whether or not they are random parameters. The coefficient of Snow is tested for both injury types B or C and injury types K or A to see if snowy pavement has varying effects on injury severity due to driver’s risk compensation. According to the model outputs, the snowy surface parameter for truck K or A injuries is fixed (10.830); and for severity B or C, it is normally distributed with a mean of 0.8892 and a standard deviation of 1.8159, meaning that 31% of truck crashes occurring on snowy pavement have an increased possibility of B or C injuries. It is plausible that people often drive more slowly and cautiously on snowy roads but that the slick conditions still have a tendency to cause crashes.
4.5 Modeling crash severity as an ordered discrete outcome The primary rationale for using ordered discrete choice models for modeling crash severity is that there is an intrinsic order among injury severities, with fatality being the highest order and property damage being the lowest. Including the ordinal nature of the data in the statistical model defends the data integrity and preserves the information. Second, the consideration of ordered response models avoids the undesirable properties of the multinomial model such as the independence of irrelevant alternatives in the case of a multinomial logit model or a lack of closed-form likelihood in the case of a multinomial probit model. Third, ignoring the ordinality of the variable may cause a lack of efficiency (i.e., more parameters may be estimated than are necessary if the order is ignored). This also increases the risk of obtaining insignificant results. Although there are many positives to the ordered model, the disadvantage is that imposing restrictions on the data may not be appropriate despite the appearance of a rank. Therefore, it is important to test the validity of the ordered restriction. The rest of the section will introduce three types of ordered choice models: the ordinal probit/logistic model, the generalized ordered and proportion odds model, and the sequential logit/probit model.
I. Theory and background
120
4. Crash-severity modeling
4.5.1 Ordinal probit/ logistic model The ordinal logit/probit model applies a latent continuous variable, zn , as a basis for modeling the ordinal nature of crash severity data. zn is specified as a linear function of xn : z n ¼ x0 n b þ ε n
(4.10)
Where xn is a vector of explanatory variables determining the discrete ordering (i.e., injury severity) for nth crash observation; b is a vector of estimable parameters; and, εn is an error term that accounts for unobserved factors influencing injury severity. A high indexing of z is expected to result in a high level of observed injury y in the case of a crash. The observed discrete injury severity variable yn is stratified by thresholds as follows: 8 1; if zn m1 ðPDO or no injuryÞ > > > > > > < 2; if m1 < zn m2 ðinjury CÞ yn ¼ 3; if m2 < zn m3 ðinjury BÞ (4.11) > > > 4; if m3 < zn m4 ðinjury AÞ > > > : 5; if m4 < zn ðK or fatal injuryÞ where the ms are estimable thresholds, along with the parameter vector b. The model is estimated using maximum likelihood estimation (Greene, 2000). If the random error term ε is assumed to follow a standard normal distribution, the model is an ordered probit model. The probabilities associated with the observed responses of an ordered probit model are as follows: Pn ð1Þ ¼ Prðyn ¼ 1Þ ¼ Prðzn m1 Þ ¼ Prðx0 n b þ εn m1 Þ ¼ Prðεn m1 x0 n bÞ ¼ Fðm1 x0 n bÞ
0
Pn ð2Þ ¼ Prðyn ¼ 2Þ ¼ Prðm1 < zn m2 Þ ¼ Prðm1 < x n b þ εn m2 Þ 0
0
¼ Prðεn m2 x n bÞ Prðεn m1 x n bÞ 0
0
¼ Fðm2 x n bÞ Fðm1 x n bÞ
«
Pn ði þ 1Þ ¼ F miþ1 x0 n b Fðmi x0 n bÞ « Pn ðIÞ ¼ Prðyn ¼ IÞ ¼ Prðzn > mI1 Þ ¼ 1 FðmI1 x0 n bÞ
(4.12)
where i is the ith level of injury and I represents the highest injury level (i.e., fatal). FðÞ is the cumulative standard normal distribution. I. Theory and background
4.5 Modeling crash severity as an ordered
121
The ordinal logistic model, also called the cumulative logit model, is formulated when the discrete outcomes are treated as the cumulative distribution of the response. Let Prðyn > iÞ represent the cumulative probabilities of the observation yn belonging to categories higher than i, we specify the ordinal logistic model as: ! Prðyn > iÞ (4.13) log ¼ ai þ x0 n b ði ¼ 1; .; I 1Þ 1 Prðyn > iÞ Where ai is different for each of the equation. b is a single set of coefficients. 0
expðai þ x n bÞ Prðyn > iÞ ¼ 0 1 þ expðai þ x n bÞ
(4.14)
We can also derive the formulation based on latent variable Zn and assume the error term εn to be logistically distributed across observations expðε Þ
whose CDF is Fðεn Þ ¼ 1þexpðεn n Þ. The equation is as follows. Prðyn > iÞ ¼ PrðZn > mi Þ ¼ Prðεn > mi x0 n bÞ ¼
1 expðx0 n b mi Þ ¼ 1 þ expðmi x0 n bÞ 1 þ expðx0 n b mi Þ
(4.15)
As can be found, Eqs. (4.14) and (4.15) have the same form except for different symbols for the intercept variable. It is worth pointing out that in Eq. (4.13), we assume that the regressors xn do not include a column of ones because the constant is absorbed in the cutpoints (i.e., thresholds) mi or ai. Due to the increasing nature of the thresholds, the positive sign of b indicates higher injury severity when the value of the associated variables increases, while a negative sign suggests the opposite. b does not depend on the placement of the threshold and stays the same across categories. The threshold values affect the intercepts and the relative numbers of crashes that are located in different categories. McCullagh (1980) refers to ordinal models as proportional odds models because the covariates X increases or decreases the odds of a response in the category higher than i by the factor expðx0 n bÞ, meaning the effect is a proportionate change for all response categories. In contrast to the MNL coefficient bi , which varies by the injury outcome i, one important restriction associated with the b of an ordered logit/probit model is the proportional odds assumption (i.e., the parallel regression assumption or the parallel lines assumption). The use of the order probit/ logit may be inappropriate if this assumption is violated. The proportional odds assumption restriction also creates an unintended consequence concerning how the explanatory variables affect the probabilities of the discrete outcome. Consider a model of three injury levelsdno injury, injury, and fatality. Suppose that one of the contributing factors in determining the level of the injury is airbag. As shown in I. Theory and background
122
4. Crash-severity modeling
FIGURE 4.4 Illustration of an ordered model with an increase in.x0 n b.
Fig. 4.4, a negative parameter of the airbag indicator (1 if it was deployed and zero otherwise) implies that x0 n b becomes greater and hence, shifts values to the right on the x-axis. Thus, the model constrains the effect of the seatbelt to simultaneously decrease the probability of a fatality and increase the no injury probability. We know for a fact that the activation of an airbag may cause injury and thus, decrease no injury; but unfortunately, ordered models cannot account for this bidirectional possibility because the shift in thresholds is constrained to move in the same direction.
Exercise 4.4 Estimate an ordinal probit model and an ordinal logistic model using the Large Trucks Dataset. This exercise uses the same dataset as Exercise 4.1. In this exercise, an ordinal probit and an ordinal logistic regression model are respectively applied to recognize the ordinality of injury level, the dependent variable. The problem is solved using the following steps. First, determine the functional form: Eq. (4.12) for the ordinal probit model and Eq. (4.15) for the ordinal logistic model. In both equations, the ms are estimable thresholds, along with the parameter vector b. Second, estimate the coefficients using the R “ordinal” package: crash_data_ordinal | z|)
Young
0.0963
0.0309
0.0019
0.1514
0.0521
0.0037
Old
0.1285
0.0307
0.0000
0.1961
0.0520
0.0002
Female
0.3398
0.0270
0.0000
0.6116
0.0454
0.0000
Alcohol
0.2977
0.0623
0.0000
0.4945
0.1082
0.0000
Drugs
1.0187
0.1459
0.0000
1.7663
0.2455
0.0000
0.4321
0.1642
0.0085
0.7798
0.2881
0.0068
Speed
0.3090
0.0288
0.0000
0.5299
0.0488
0.0000
Rule violation
0.3329
0.0315
0.0000
0.5416
0.0534
0.0000
Reckless behavior
0.1569
0.0264
0.0000
0.2644
0.0446
0.0000
Signal
0.3353
0.0799
0.0000
0.5744
0.1342
0.0000
Two-way
0.5364
0.0830
0.0000
0.9019
0.1403
0.0000
None
0.3190
0.0752
0.0000
0.5144
0.1265
0.0000
Total units
0.1635
0.0120
0.0000
0.2867
0.0210
0.0000
Snow
0.4450
0.0400
0.0000
0.7666
0.0682
0.0000
Ice
0.3358
0.0577
0.0000
0.5727
0.0974
0.0000
Wet
0.0615
0.0357
0.0844
0.0906
0.0596
0.1288
0.1237
0.0319
0.0001
0.2108
0.0540
0.0001
Safety constraints
Dark
Threshold coefficients 1|2
0.5518
0.1811
0.8867
0.3154
2|3
1.8791
0.1818
3.1692
0.3171
AIC
18,072.64
18,036.83
continued
124
4. Crash-severity modeling
E x e r c i s e 4 . 4 (cont’d) Finally, summarize the findings. The positive coefficients suggest the likelihood of more severe injuries. Thus, all explanatory variables except for the use of safety constraints and adverse weather are associated with more severe injuries. The coefficient estimates of both models are consistent in signs and magnitudes. The threshold coefficients of the ordinal logistic model are greater than these of the ordinal probit model. According to the AIC value, the ordinal logistic model is slightly better than the ordinal logit model.
4.5.2 Generalized ordered logistic and proportional odds model A generalized ordered logistic model (gologit) provides results similar to those that result from running a series of binary logistic regressions/ cumulative logit models. The ordered logit model is a special case of the gologit model where the coefficients b are the same for each category. The partial proportional odds model (PPO) is in between, as some of the coefficients b are the same for all categories and others may differ. A gologit model and an MNL model, whose variables are freed from the proportional odds constraint, both generate many more parameters than an ordered logit model. A PPO model allows for the parallel lines/ proportional odds assumption to be relaxed for those variables that violate the assumption. In the gologit model, the probability of crash injury for a given crash can be specified as (I1) set of equations: 0
Prðyn > iÞ ¼
expðx n bi mi Þ ; 0 1 þ expðx n bi mi Þ
i ¼ 1; .ðI 1Þ
(4.16)
where mi is the cut-off point for the ith cumulative logit. Note that Eq. (4.16) is different from Eq. (4.14) in that bi is a single set of coefficients that vary by category i. In the PPO model formulation, it is assumed that some explanatory variables may satisfy the proportional odds assumption while some may not. The cumulative probabilities in the PPO model are calculated as follows (Peterson and Harrell, 1990): 0 0 exp xn b þ Tn gi mi
; i ¼ 1; .ðI 1Þ 0 Prðyn > iÞ ¼ (4.17) 0 1 þ exp xn b þ Tn gi mi
I. Theory and background
4.5 Modeling crash severity as an ordered
125
where xn is a ðp 1Þ vector of independent variables of crash n, b is a vector of regression coefficients, and each independent variable has a b coefficient. Tn is a ðq 1Þ vector ðq pÞ containing the values of crash n on the subset of p explanatory variables for which the proportional odds assumption is not assumed, and gi is a ðq 1Þ vector of regression coefficients. So, gi represents a deviation from the proportionality b and 0 Tn gi is an increment associated only with the ith cumulative logit, i ¼ 1; .; ðI 1Þ. An alternative but a simplified way to think about the PPO model is to have two sets of explanatory variables: x1 , the coefficients of which remain the same for all injury severities and x2 , the coefficients of which vary across injury severities. Note that x1 and x2 have no common variables. The PPO model is specified in Eq. (4.18): Prðyn > iÞ ¼
expðx0 1 b1 þ x0 2 b2i mi Þ 1 þ expðx0 1 b1 þ x0 2 b2i mi Þ
(4.18)
where b1 is a vector of parameters to be estimated for x1 and is the same for all injury severities, and b2i is a vector of parameters to be estimated for x2 and varies across injury severities. The proportion assumption dictates whether a coefficient is the same or different. Parameterization in Eqs. (4.17) or (4.18) depends on the statistical software packages (Williams, 2006). The gologit/PPO model has been applied in several recent studies as an extension of the ordered logit model, and results show that they consistently outperform conventional ordered response models (Wang and Abdel-Aty, 2008; Qin et al., 2013a,b; Yasmin and Eluru, 2013). According to Williams (2016), the gologit/PPO model usually provides a substantially better fit to the data than the ordered logit model and is also much more parsimonious than other alternatives. However, interpretation and justification are less straightforward for the gologit model than it is for the ordered logit model. A test devised by Brant (1990) is commonly used to test parallel regression assumption. The Brant test is available in most of the statistical software packages.
Exercise 4.5 Estimate a PPO model using the Large Trucks Dataset. The PPO model in this exercise uses the same crash data as Exercise 4.1. In the ordinal model, the coefficients use cumulative probability to estimate the log odds ratio between all possibilities of injury severities higher than level i and all possibilities of injury severities lower than continued
I. Theory and background
126
4. Crash-severity modeling
E x e r c i s e 4 . 5 (cont’d) and equal to level I (see Eq. 4.13). However, it is important to point out that some statistical ! software packages use log
Prðyn >iÞ 1Prðyn >iÞ
¼ ai þ x0 n b
ði ¼ 1; .; I 1Þ, where the log odds
ratio is for a lower level. The solution is as follows. First, determine the functional form: Eq. (4.18) for the PPO model where b1 is a vector of parameters to be estimated for x1 and is the same for all injury severities, and b2i is a vector of parameters to be estimated for x2 and varies across injury severities. Second, the coefficients are estimated using the R “ordinal” package: ppo_model |z|)
Young
0.1482
0.0529
0.0051
Alcohol
0.4990
0.1087
0.0000
Drug
1.7142
0.2449
0.0000
0.7610
0.2856
0.0077
Speed
0.5247
0.0490
0.0000
Rule violation
0.5410
0.0537
0.0000
Reckless behavior
0.2654
0.0448
0.0000
Signal
0.5770
0.1355
0.0000
Two-way
0.9097
0.1416
0.0000
None
0.5176
0.1277
0.0001
Total units
0.2827
0.0210
0.0000
Dark
0.2049
0.0542
0.0002
Safety constraints
I. Theory and background
127
4.5 Modeling crash severity as an ordered
E x e r c i s e 4 . 5 (cont’d) Variable
Estimate
Std. Error
Pr(>|z|)
Threshold coefficients 1|2.(Intercept)
0.9253
0.3136
2|3.(Intercept)
3.0458
0.3161
1|2.Old
0.0980
0.0548
2|3.Old
0.4868
0.0764
1|2.Female
0.7905
0.0506
2|3.Female
0.1250
0.0728
1|2.Snow
0.7558
0.0703
2|3.Snow
0.8447
0.1196
1|2.Wet
0.0224
0.0644
2|3.Wet
0.3475
0.1044
1|2.Ice
0.5680
0.1009
2|3.Ice
0.5654
0.1722
AIC
17,923.93
Readers can also run an ANOVA test between the PPO model and the proportional odds model such as the ordered probit model (Exercise 4.4) using “anova(op_model, ppo_model)”. no.par
AIC
logLik
op_model
19
18,073
9017.3
ppo_model
24
17,924
8938
LR.stat
df
Pr(>Chisq)
158.71
5
> > > > > < 2; if zn1 < 0 and zn2 0 (4.20) yn ¼ 3; if zn1 < 0; zn2 < 0 and zn3 0 > > > « > > > : I; if zn1 < 0; zn2 < 0; .; and zn;I1 0 The probability of yn with different injury severities is written as follows: 0
Pn ð1Þ ¼ Prðyn ¼ 1Þ ¼ Lð ða1 þ x n b1 ÞÞ 0
0
0
0
Pn ð2Þ ¼ Prðyn ¼ 2Þ ¼ Lð ða1 þ x n b1 ÞÞLð ða2 þ x n b2 ÞÞ 0
Pn ð3Þ ¼ Prðyn ¼ 3Þ ¼ Lð ða1 þ x n b1 ÞÞLð ða2 þ x n b2 ÞÞLð ða3 þ x n b3 ÞÞ « Pn ðI Þ ¼ Prðyn ¼ IÞ ¼
I 1 Y
0
Lð ðai þ x n bi ÞÞ
i
(4.21) where LðÞ represents the standard logistic CDF for the sequential logit model and the standard normal CDF for the sequential probit model. Take Pn(1) as an example: for the standard logistic CDF, Fð ða1 þx0 n b1 ÞÞ ¼ 1þexpða11 þx0 n b Þ; and for the standard normal 1
CDF, Fða1 þx0 n b1 Þ ¼ Fða1 þx0 n b1 Þ. The probability of injury level i is the product of individual cumulative functions. This formulation shows one major limitation of the sequential logit/probit model in that the model assumes the independence between error terms. On the other hand, an important practical feature of the hierarchy model is that the multinomial likelihood factors into the product of binomial likelihoods. Jung et al. (2010) applied the sequential logit model to assess the effects of rainfall on the severity of single-vehicle crashes on Wisconsin interstate highways. The sequential logit regression model outperformed the ordinal logit regression model in predicting crash severity levels in rainy weather when comparing goodness of fit, parameter significance, and
I. Theory and background
130
4. Crash-severity modeling
prediction accuracies. The sequential logit model identified that stronger rainfall intensity significantly increases the likelihood of fatal and incapacitating injury crash severity, while this was not captured in the ordered logit model. Yamamoto et al. (2008) also reported superior performance and unbiased parameter estimates with sequential binary models as compared with traditional ordered probit models, even when underreporting was a concern.
4.6 Model interpretation Statistical modeling is only used as a data-driven tool for measuring the effects of variables on crash injury severity levels. It is the expert domain knowledge that eventually helps to explain what factors cause or contribute to more severe injuries, the safety problem to be solved, and the context within. To properly interpret model results, we need to be wary of the data formats as they can be structured differently because of different methods. The dependent variable can be treated as individual categories, categories higher than level i, or categories lower than level i. Independent variables can be continuous, indicator (1 or 0) or categorical. Categorical variables should be converted to dummy variables, with a dummy variable assigned to each distinct value of the original categories. The coefficient of a dummy variable can be interpreted as the log-odds for that particular value of dummy minus the log-odds for the base value which is 0 (e.g., the odds of being injured when drinking and driving is 10 times of someone who is sober). The percent change in the odds for each 1-unit increase in the continuous independent variable is calculated by subtracting
1 from the odds ratio and then multiplying by 100 or, 100 eb 1 . The key concepts of marginal effect and elasticity are fundamental to understanding model estimates. The marginal effect is the unit-level change in y for a single-unit increase in x if x is a continuous variable. In a simple linear regression, the regression coefficient of x is the marginal vy effect,vxk ¼ bk . Due to the nonlinear feature of logit models, the marginal vp effect of any continuous independent variable is: vxkii ¼ bki pi ð1 pi Þ. Thus, the marginal effect depends on the logit regression coefficient for xki, as well as the values of the other variables and their coefficients. If xki is a discrete variable (indicator or dummy variable), the marginal effect of xki is ½Prðijx; xki ¼ 1Þ Prðijx; xki ¼ 0Þ. Such marginal effects are called instantaneous rates of change because they are computed for a variable while holding all other variables as constant. The marginal effect at the mean is a popular approach for both continuous and discrete variables in which all x’s are at their mean.
I. Theory and background
References
131
Elasticity can be used to measure the magnitude of the impact of specific variables on the injury-outcome probabilities. For a continuous variable, elasticity is the % change in y given a 1% increase in x. It is computed from the partial derivative with respect to the continuous variable of each observation n. The equation uses the partial derivative of the MNL PðiÞ to express the elasticity of the continuous variable as (n subscripting omitted): PðiÞ
Exki ¼
vPðiÞ x ki ¼ ½1 PðiÞbki xki vxki PðiÞ
(4.22)
where bki is the estimable coefficient associated with xki . Elasticity values are the percent effect that a 1% change in xki has on the injury severity probability PðiÞ. For indicator or dummy variables (those variables taking on values of 0 or 1), a pseudoelasticity percentage can be written as follows: # " P exp½Dbi xi cI expðbkI xkI Þ PðiÞ P P Exki ¼ 1 100 exp½Dbi xi cIn expðbkI xkI Þ þ cIsIn expðbkI xkI Þ (4.23) where In is the set of injury severity outcomes with xki in the function determining the outcome, and I is the set of all possible injury severity outcomes. The pseudo-elasticity of an indicator variable with respect to an injury severity category represents the percent change in the probability of that injury severity category when the variable is changed from zero to one.
References Brant, R., 1990. Assessing proportionality in the proportional odds model for ordinal logistic regression. Biometrics 46, 1171e1178. https://doi.org/10.2307/2532457. Chen, F., Chen, S., 2011. Injury severities of truck drivers in single- and multi-vehicle accidents on rural highways. Accid. Anal. Prev. 43 (5), 1677e1688. Christensen, R.H.B., 2018. Cumulative link models for ordinal regression with the R package Ordinal. Submitt. J. Stat. Softw. 1e40. Greene, W., 2000. Economietric Analysis, fourth ed. Prentice Hall, Upper Saddle River, NJ. Gumbel, E.J., 1958. Statistics of Extremes. Columbia University Press, New York. Hausman, J.A., McFadden, D., 1984. A specification test for the multinomial logit model. Econometrica 52, 1219e1240. Jung, S.Y., Qin, X., Noyce, D.A., 2010. Rainfall effect on single-vehicle crash severities using polychotomous response models. Accid. Anal. Prev. 42 (1), 213e224. McCullagh, P., 1980. Regression models for ordinal data. J. Roy. Stat. Soc. B 42 (2), 109e127. McFadden, D., 1981. Econometric models of probabilistic choice. In: Manski, C., McFadden, D. (Eds.), Structural Analysis of Discrete Data with Econometric Applications. MIT Press, Cambridge, pp. 198e272.
I. Theory and background
132
4. Crash-severity modeling
McFadden, D., Train, K., 2000. Mixed MNL models for discrete response. J. Appl. Econom. 15 (5), 447e470. Milton, J.C., Shankar, V.N., Mannering, F., 2008. Highway accident severities and the mixed logit model: an exploratory empirical analysis. Accid. Anal. Prev. 40 (1), 260e266. Mujalli, R.O., Lo´pez, G., Garach, L., 2016. Bayes classifiers for imbalanced traffic accidents datasets. Accid. Anal. Prev. 88, 37e51. Peterson, B., Harrell Jr., F.E., 1990. Partial Proportional Odds Models for Ordinal Response Variables. Appl. Stat. 205e217. Qin, X., Wang, K., Cutler, C.E., 2013a. Analysis of crash severity based on vehicle damage and occupant injuries. Transport. Res. Rec. 2386 (1), 95e102. Qin, X., Wang, K., Cutler, C.E., 2013b. Logistic regression models of the safety of large trucks. Transport. Res. Rec. 2392 (1), 1e10. Savolainen, P.T., Mannering, F., Lord, D., Quddus, M.A., 2011. The statistical analysis of highway crash-injury severities: a review and assessment of methodological alternatives. Accid. Anal. Prev. 43 (5), 1666e1676. Savolainen, P., Mannering, F., 2007. Probabilistic models of motorcyclists’ injury severities in single- and multi-vehicle crashes. Accid. Anal. Prev. 39 (5), 955e963. Shankar, V., Mannering, F., Barfield, W., 1996. Statistical analysis of accident severity on rural freeways. Accid. Anal. Prev. 28 (3), 391e401. Train, K., 2009. Discrete Choice Methods with Simulation, second ed. Cambridge university press. Wang, X.S., Abdel-Aty, M., 2008. Analysis of left-turn crash injury severity by conflicting pattern using partial proportional odds models. Accid. Anal. Prev. 40 (5), 1674e1682. Washington, S.P., Karlaftis, M.G., Mannering, F., 2020. Statistical and Econometric Methods for Transportation Data Analysis, third ed. (Chapman and Hall/CRC). Williams, 2016. Understanding and interpreting generalized ordered logit models. J. Math. Sociol. 40 (1), 7e20. https://doi.org/10.1080/0022250X.2015.1112384. Williams, R., 2006. Generalized ordered logit/partial proportional adds models for ordinal dependent variables. Stata J. 6 (1), 58e82. Yamamoto, T., Hashiji, J., Shankar, V., 2008. Underreporting in traffic accident data, bias in parameters and the structure of injury severity models. Accid. Anal. Prev. 40, 1320e1329. Yasmin, S., Eluru, N., 2013. Evaluating alternate discrete outcome frameworks for modeling crash injury severity. Accid. Anal. Prev. 59, 506e521.
I. Theory and background
C H A P T E R
5
Exploratory analyses of safety data 5.1 Introduction Exploratory data analyses focus on presenting a variety of techniques for performing initial investigations on data with the help of summary statistics and graphical representations. They are used to accomplish the following objectives: 1. Understanding the data, mapping their underlying structure and identifying data issues such as errors and missing information, 2. Selecting the most important variables and identifying possible relationships in terms of direction and magnitude between independent and outcome variables, 3. Detecting outliers whose values are significantly different from the other observations in the dataset, 4. Testing hypotheses and developing associated confidence intervals or margins of error, 5. Examining underlying assumptions to know if the data follows a specific distribution, and 6. Choosing a preliminary model that fits the data appropriately. This chapter describes different methods and techniques for exploring safety data. The exploratory data analyses are conducted using two different techniques: (1) quantitative techniques that involve the calculation of summary statistics, (2) and graphical techniques that employ charts to summarize the data. Additionally, exploratory data analyses can be divided into univariate or multivariate (typically bivariate) methods.
Highway Safety Analytics and Modeling https://doi.org/10.1016/B978-0-12-816818-9.00015-9
135
© 2021 Elsevier Inc. All rights reserved.
136
5. Exploratory analyses of safety data
Univariate methods look at one variable (independent or outcome variable) at a time, while multivariate methods look at two or more variables (several independent variables alone or with an outcome variable) simultaneously to explore relationships. It is always recommended to initially perform a univariate analysis for each variable in a multivariable dataset before performing a multivariate analysis. The first part of the chapter focuses on the quantitative techniques, while the second part summarizes the graphical techniques.
5.2 Quantitative techniques This section describes five different quantitative techniques.
5.2.1 Measures of central tendency Safety datasets are usually large with many variables. It is always useful to represent the variables using summary statistics. Central tendency is the most common statistic used to describe the “average” “middle” or “most common” value of a variable. The mean, median, and mode are the measures that are used to describe the central tendency. It is always suggested to compute and analyze the mean, median, and mode for a given dataset simultaneously as they elucidate different aspects of the given data. Considering them alone can lead to misrepresentations of the data due to outliers or extreme values. 5.2.1.1 Mean The arithmetic mean, or simply called the mean is calculated by dividing the sum of all observations in the dataset by the total number of observations. The mean is significantly affected by the outliers, that is, extremely large or small values. The mean is also called as mathematical expectation, or average. The sample mean is denoted by x (pronounced as “x-bar”) and is calculated using the following equation. x¼
n 1X x n i¼1 i
(5.1)
where n is the total number of observations in the sample and x1 ; x2 ; .:xn are individual observations. As the sample mean changes from one sample to another, it is considered as a random variable. If the whole population is used, then x is replaced by the Greek symbol, m and is given by m¼
N 1 X xi N i¼1
II. Highway safety analyses
(5.2)
5.2 Quantitative techniques
137
where N is the total number of observations in the population. The population mean is always fixed and is a nonrandom variable. 5.2.1.2 Median The median is a value that divides the dataset or a probability distribution into two halves. Sorting of the data in a particular order is important when calculating the median of a variable. The observations can be sorted in an ascending or descending order to calculate the median. If the total number of observations in a dataset is odd, then the median is simply the number in the middle of the list of all observations. If the total number of observations in a dataset is even, the average of the two middle values is the median. When the data contain outliers, the median is not affected, and so it is considered more robust than the mean. 5.2.1.3 Mode The observation that has the highest number of occurrences in the dataset is called the mode. When two or more observations occur frequently, then we have more than one mode in the dataset. Similar to the mean and median, the mode is a measure of central tendency that has the highest probable outcome in the data sample. Unlike the mean and median, however, the mode can be applied to nonnumerical or qualitative data (i.e., data measured on the nominal and ordinal scale).
5.2.2 Measures of variability The central tendency measures do not always provide adequate information related to the data. The information on the variability is required to understand the amount of spread or dispersion in the dataset. Low dispersion is indicated by the observations clustering tightly around the center. Alternatively, high dispersion means that the observations fall further away from the center. 5.2.2.1 Range Range is the simplest measure that is used to calculate the amount of variability. The range is defined as the difference between the largest and smallest observations in the dataset. Although the range is simple to calculate and easy to understand, it is highly susceptible to outliers because its value is based on only the two most extreme observations in the dataset. When dealing with traffic crashes, in many situations, the value of the largest observation is unusually high, which affects the entire range.
II. Highway safety analyses
138
5. Exploratory analyses of safety data
Additionally, the range is significantly affected by the sample size of the dataset. For small samples with no possible outliers, the range is particularly suitable because the other measures cannot be calculated reliably. However, the chances of outliers increases as the sample size becomes larger. Consequently, if we draw multiple random samples with different sample sizes from the same population, then the range tends to increase with the increase in the sample size. 5.2.2.2 Quartiles and interquartile range Quartiles separate the dataset into four equal parts after sorting in the ascending order. Quartiles use percentage points (or percentiles) to divide the data into quarters. A percentile is defined as a value below which lies a certain percentage of observations in the sample. The lowest or first quartile (Q1) is the 25th percentile value, the median or middle quartile (Q2) is the 50th percentile value, and the upper or third quartile (Q3) is the 75th percentile value. For setting up speed limits on highways, the 85th percentile speed is the commonly used measure. It suggests that 85% of sample driver speeds observed are lower than the 85th percentile speed. The interquartile range (IQR) is the middle half of the data and is used to understand the data spread. The IQR is calculated as the difference between Q3 and Q1. The IQR includes 50% of observations that fall between Q1 and Q3. For skewed distributions, the IQR and median are the robust measures of variability and central tendency, respectively. Similar to the median, the IQR is not influenced significantly by outliers because it does not consider the extreme values. 5.2.2.3 Variance, standard deviation and standard error The variance and standard deviation are the two most frequently used measures to calculate the dispersion in the data. Unlike the range and IQR, the variance and standard deviation consider all the observations in the calculation by comparing each observation to the mean. The variance is calculated using two equations, depending on whether we are interested in the sample variance or the variance for the entire population. As collecting the whole population is not always possible, the sample variance is commonly used as an estimate of the population variance. The sample variance changes from one iteration to the next so it is a random variable. The sample variance is calculated using the following equation: Pn ðx xÞ2 s2 ¼ i¼1 i (5.3) n1 where x is the sample mean, and n is the total number of observations in the sample. Similar to the sample mean, as the sample variance changes from one sample to another, it is considered a random variable. If the
II. Highway safety analyses
5.2 Quantitative techniques
139
whole population is used, then s2 is replaced by the Greek symbol, s2 and the population variance is given by: PN ðx mÞ2 2 (5.4) s ¼ i¼1 i N where m is the population mean and N is the total number of observations in the population. Unlike sample variance, the population variance is always fixed and is a nonrandom variable. There are two reasons for using n 1 in the denominator, instead of n for calculating the sample variance. First, as the sample mean x is used in the calculation, it loses 1 degree of freedom and there are n 1 independent observations remaining to calculate the variance. Second, when the small sample size is used, the variance tends to be underestimated and so it is compensated by using n 1, instead of n. However, when a large sample is considered, the difference in the variance calculation either with n 1 or n becomes negligible. The standard deviation is calculated using the square root of the variance. The standard deviation is defined as a standard difference between each observation and the mean. The standard deviation is small when the data points are grouped closer together. Alternatively, it is larger when the data points are spread out. The sample and population standard deviations can be calculated using the following equations, respectively: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn 2 pffiffiffiffi i¼1 ðxi xÞ s ¼ s2 ¼ (5.5) n1 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PN 2 pffiffiffiffiffi i¼1 ðxi mÞ 2 s¼ s ¼ (5.6) N The standard deviation is a more widely used measure than the variance because the units of the standard deviation are the same as the original units of the data, which makes the interpretation easier. The standard error is often confused with the standard deviation measure. The standard deviation is used to measure the variability to understand how scattered some observations are in the dataset. The standard error of an estimate is the standard deviation of its sampling distribution. For example, the mean calculated from different samples will vary from one to another; and the variation can be described by a distribution called as the “sampling distribution” of the mean. The standard error is an estimate that is calculated to know how much sample means will vary from the standard deviation of this sampling distribution. The standard error of the sample mean is calculated by the following equation. s SE ¼ pffiffiffi n II. Highway safety analyses
(5.7)
140
5. Exploratory analyses of safety data
where s is the sample standard deviation, and n is the total number of observations in the sample. With the increase in the sample size, the standard error decreases but the standard deviation tends to remain unchanged. 5.2.2.4 Coefficient of variation The standard deviation provides a measure of variability without considering the magnitude of variable values. The coefficient of variation (CV), also called relative standard deviation which is a unitless quantity, is a measure of relative variability that provides the dispersion of observations in a dataset around the mean. The CV for a sample is calculated using the following equation: CV ¼
s x
(5.8)
where x is the sample mean. The CV is often expressed in percentages and is a useful measure to compare the degree of variability from one sample to another as it does not have any unit. For example, if we want to compare the variation in traffic crashes between two facility types, then the CV shows which facility type has more variation than the other.
Exercise 5.1 The following are the crashes that occurred on 30 segments selected from Roadway Segment Dataset. Provide the summary statistics for crashes. 1,3,0,0,6,2,0,1,4,0,1,3,16,0,1,0,2,1,1,3,2,8,5,2,3,2,0,1,0,4. Statistic
Value
Mean
2.4
Median
1.5
Mode(s)
0.0
Range
16.0
25th percentile
0.0
75th percentile
3.0
IQR
3.0
Variance
10.5
Standard deviation
3.2
CV
1.3
II. Highway safety analyses
5.2 Quantitative techniques
141
5.2.2.5 Symmetrical and asymmetrical data The data follow a symmetrical distribution when its observations occur at regular frequencies and all the measures of central tendency (i.e., mean, median, and mode) occur at the same point. Fig. 5.1a shows a symmetrical distribution that appears in the shape of a bell curve. If a line was drawn dissecting the middle of the curve, the left side of the distribution mirrors the right side. Examples of data that are symmetrically distributed around their mean include free-flow speeds of vehicles (Berry and Belmont, 1951) and logarithm of crash rates (Ma et al., 2015). For a symmetrical distribution with a large sample size, it is recommended to express the results in terms of mean and standard deviation. Asymmetrical distributions, also known as skewed distributions, can be either right-skewed or left-skewed and do not have the same value for the mean, median, and mode. A right-skewed or a positively skewed distribution has a longer tail on the right and the mean is on the right of the mode (see Fig. 5.1b). Examples of data that are usually asymmetrically distributed around their mean and follow a positively skewed distribution include the crash rate data (Ma et al., 2015), crash frequency data (Miaou, 1994), and travel time data (Berry and Belmont, 1951). A leftskewed or a negative distribution has a longer tail on the left, and the mean is on the left of the mode (see Fig. 5.1C). This distribution is rarely found for any variable in the traffic safety datasets. For asymmetrical distributions, or when the sample size is small, it is recommended to express the results in terms of median and IQR. 5.2.2.6 Skewness Skewness is used to measure the degree of asymmetry of a distribution. In other words, it quantifies the degree of distortion from the normal distribution. It differentiates extreme values from one tail to the other. A symmetrical distribution (such as normal distribution) has a skewness of 0. The sample skewness can be calculated using the following equation. ! Pn 3 i¼1 ðxi xÞ n g1 ¼ (5.9) s3 where s is the sample standard deviation. The data are assumed to be characterized as follows: • • • • •
Symmetrical, if 0:5 g1 < þ0:5 Moderately negative skewed, if 1:0 g1 < 0:5 Moderately positive skewed, if þ0:5 g1 < þ1:0 Highly negative skewed, if g1 1:0 Highly positive skewed, if g1 þ1:0
II. Highway safety analyses
142
(b) Positively skewed FIGURE 5.1 Symmetrical and skewed distributions.
(c) Negatively skewed
5. Exploratory analyses of safety data
II. Highway safety analyses
(a) Symmetrical
5.2 Quantitative techniques
143
5.2.2.7 Kurtosis Kurtosis is the measure of the sharpness of the peak of a frequency distribution. The sample kurtosis can be calculated using the following equation. ! Pn 4 i¼1 ðxi xÞ n g2 ¼ 3 (5.10) s4 If the kurtosis value is positive then it suggests heavy tails. Alternatively, a negative value means that there are light tails. The tail heaviness or lightness is in comparison with the normal distribution and it suggests whether the data distribution is flatter or less flat than the normal distribution. The kurtosis value is 3.0 for a standard normal distribution. Kurtosis can be categorized into three measures, as shown in Fig. 5.2. If the kurtosis statistic of a distribution is similar to that of the normal distribution, or bell curve, then it is called a mesokurtic distribution. This type of distribution has similar extreme value characteristics as that of a normal distribution. If the kurtosis is greater than a mesokurtic distribution then it is called a leptokurtic distribution. This distribution has long tails (due to the presence of many outliers). The outliers stretch the horizontal axis and a lot of data appear in the narrow curve. The final type of distribution is a platykurtic distribution and it has kurtosis that is smaller than a mesokurtic distribution. This type of distribution has short tails (due to the presence of fewer outliers). When compared to the normal distribution, these distributions have fewer extreme values.
FIGURE 5.2
Kurtosis in the normal curve.
II. Highway safety analyses
144
5. Exploratory analyses of safety data
Exercise 5.2 Using the data presented in Exercise 5.1, calculate the skewness and kurtosis. ! Pn 3 i¼1 ðxi xÞ n 87:808 g1 ¼ ¼ ¼ 2:6 s3 3:23 ! Pn 4 i¼1 ðxi xÞ n 1190:45 3¼ 3 ¼ 7:89 g2 ¼ s4 3:24 As the skewness is 2.6, it is highly positively skewed, similar to the one shown in Fig. 5.1b. The mean, median, and mode statistics also confirm that the distribution is positively skewed. The kurtosis value is 7.89, indicating a relatively “skinny” (leptokurtic) distribution. This distribution has longer tails, due to the presence of outliers. In this case, the site with 16 crashes is a potential outlier.
5.2.3 Measures of association The measure of association is used to quantify the relationship between two variables. Correlation and regression analysis are among several methods that are used to quantify the measure of association. The correlation between two variables refers to a measure of the linear relationship, whereas association refers to any relationship between variables. The selection of the method to determine the strength of an association is dependent on the characteristics of data for a variable. Data can be observed on an interval/ratio (continuous) scale, an ordinal/rank (integer) scale, or a nominal/categorical (qualitative) scale. 5.2.3.1 Pearson’s correlation coefficient When the association between two variables that are measured on an interval/ratio (continuous) scale is sought, the appropriate measure of association is Pearson’s correlation coefficient. Pearson’s correlation coefficient r for a sample is defined by the following equation. r¼
COVðX; YÞ sx sy
II. Highway safety analyses
(5.11)
5.2 Quantitative techniques
with,
Pn COVðX; YÞ ¼
xÞ yi y n1
i¼1 ðxi
145
(5.12)
where COVðX; YÞ is the sample covariance between two random variables X and Y that are normally distributed with means x and y and standard deviations sx and sy respectively. To calculate the population correlation coefficient, the sample means x and y are replaced by population means mx and my , and the sample standard deviations sx and sy are replaced by population standard deviations sx and sy respectively. 5.2.3.2 Spearman rank-order correlation coefficient The Spearman rank-order correlation coefficient, a nonparametric method, is used to measure the strength of association between two variables when one or both are measured on an ordinal/ranked (integer) scale, or when both variables are not normally distributed. If one of the variables is on an interval scale, then it needs to be transformed to a rank scale to analyze with the Spearman rank-order correlation coefficient, although this may result in a loss of information. Once two variables are ranked and sorted in an ascending order, the spearman correlation coefficient rs is calculated using the following equation. P 6 ni¼1 d2i rs ¼ 1 (5.13) nðn2 1Þ where di (d1 ; d2 ; .:dn ) are the differences in ranks of two variables xi (x1 ; x2 ; .:xn ) and yi (y1 ; y2 ; .:yn ). The correlation coefficient takes on the values from 1.0 to þ1.0. A value of 1.0 indicates a perfect negative linear relationship between the two variables, which means as one variable increases, the other decreases. Similarly, a value of þ1.0 indicates a perfect positive linear relationship between the two variables, which means as one variable increases, the other increases too. If the value is 0.0, then it indicates no linear relationship. Any coefficient value between 1.0 and 0.0 or 0.0 and þ1.0 indicate a negative or positive linear relationship but not an exact straight line. Hinkle et al. (2003) provided a rule of thumb for interpreting the correlation coefficient, as shown in Table 5.1. 5.2.3.3 Chi-square test for independence The chi-square test for independence is commonly used for testing relationships between two sets of data that are measured on the categorical scale. The chi-square test is used to measure the significance of the relationship but not the strength of the association. Before conducting the chi-square test, data must be arranged as a contingency table (a matrix
II. Highway safety analyses
146
5. Exploratory analyses of safety data
TABLE 5.1 Interpreting of correlation coefficient (Hinkle et al., 2003). Correlation coefficienta
Interpretation
þ0.9 to þ1.0 (0.9 to 1.0)
Very high correlation
þ0.7 to þ0.9 (0.7 to 0.9)
High correlation
þ0.5 to þ0.7 (0.5 to 0.7)
Moderate correlation
þ0.3 to þ0.5 (0.3 to 0.5)
Low correlation
0.3 to þ0.3
Negligible correlation
a
“þ” means positive correlation and “” means negative correlation.
that shows the frequency distribution of variables). The rows represent the bins or categories. The columns represent the frequencies for two variables of interest (one variable is represented as “observed” and other as “expected”). A two-way table is similar to a frequency distribution except that the two variable frequencies (observed and expected values) are shown simultaneously. The chi-square c2 statistic is calculated using the following equation. c2df ¼
k X ðOi Ei Þ2 Ei i¼1
(5.14)
where Oi and Ei are the observed and expected frequencies in the ith category (i¼1,2 . k). The term df is the degrees of freedom, which is equal to k 1. The calculated chi-square value will be compared to the critical value from a chi-square table for a chosen significance level a (e.g., a ¼ 0.05). If the calculated chi-square value is greater than the critical value, then it can be concluded that there is a significant difference between observed and expected frequencies. The chi-square statistic is extremely sensitive to the sample size within each category. If a particular category has expected frequency fewer than 5, then it should be combined with the adjacent category. The chi-square test for independence should not be confused with the chi-square goodness-of-fit test, although the formula for the test is the same in both cases. The test for independence is used for testing the association between two sets of data, whereas the goodness-of-fit test is used to test if the data sample follows a certain distribution. It should be noted that the chi-square test can only be used for discrete distributions (e.g., Poisson, and binomial distributions). For continuous distributions (e.g., normal and uniform distributions), other tests such as KolmogoroveSmirnov goodness of fit test should be used.
II. Highway safety analyses
147
5.2 Quantitative techniques
5.2.3.4 Relative risk and odds ratio Relative risk and odds ratio are the two other measures used to test the association between categorical variables. The two tests are used to measure the strength of the association but do not provide the significance of the relationship. For calculating the relative risk or odds ratio, data must be arranged as a two-way contingency table (a two-by-two matrix that shows the frequency distribution of variables in two groups for two outcomes). The following example of two-way contingency table shows the frequency of two mutually exclusive outcomes (e.g., fatal vs. nonfatal crashes) for each of the two groups (e.g., cars installed with airbags vs. cars without airbags) to understand the role of airbags in saving lives when involved in a collision. Outcome Group
Outcome 1
Outcome 2
Treatment
A
B
Control
C
D
The relative risk (also known as risk ratio) is used to evaluate the risk (or probability) of an outcome in one group when compared to the risk of the same outcome in the other group. A relative risk of 1.0 indicates no difference in risk between the groups, whereas a relative risk other than 1.0 indicates that there is a difference between the groups. The relative risk RR is calculated using the following equation. RR ¼
A=ðA þ BÞ C=ðC þ DÞ
(5.15)
The odds ratio is used to evaluate the odds of an outcome in one group when compared to the odds of the same outcome in the other group. Similar to the relative risk, odds ratio of 1.0 indicates no difference in risk between the groups, whereas an odds ratio other than 1.0 indicates that there is a difference between the groups. The odds ratio OR is calculated using the following equation. OR ¼
A=B AD ¼ C=D BC
(5.16)
For rare events (such as traffic crashes), where the chance of occurrence is too low (50
ARIMA
Disaggregated
10e20
Poisson INAR(1), NBINGARCH, or GLARMA
Highly disaggregated
ðx mÞ > > x¼0 : 1 exp s where x is the shape parameter, s is the scale parameter, and x m is an exceedance in the range 0 x m < N, if x 0; or, m x < m sx, if x < 0. If x is zero, then the GP distribution applies to tails that decrease exponentially, such as the exponential distribution; if x is positive, then the GP distribution applies to tails that decrease as a polynomial, such as Student’s t (heavy tail); if x is negative, then the GP distribution applies to finite tails such as the beta distribution (light tail). Similar to the block size in the BM approach that determines the sample size, the threshold value in the POT approach determines the sample size as well. An optimal threshold should be chosen to balance the need for selecting extreme values as well as the sufficient sample size. A high threshold produces fewer observations that may have a large variance; a low threshold may treat observations with ordinary values as extremes and compromise the asymptotic distribution of extreme values. The typical approaches to determining an optimal threshold value are the “mean residual life plot” and the “assessment of parameters stability” (Coles, 2001). Ideally, the EVM is applied to a specific conflict point or to a conflict line at a specific location (e.g., intersection or segment). The sample size may not be adequate when the observation time is relatively short. A common approach to fix this issue is to pool data from other conflict points/lines or from other locations to increase the sample size. However, III. Alternative safety analyses
11.6 Theoretical development of safety surrogate
391
this approach leads to the issue of nonstationarity, which is that the parameters in the EV distribution may not be the same. A viable solution to address the nonstationarity issue is to allow the parameters of the EV distribution to vary. Songchitruksa and Tarko (2006) and Zhang et al. 2014a,b defined the location parameter m in GEV as a function of traffic volume, m ¼ x0 b, where is a set of covariates that characterize the changes in m, and b is the vector of regression coefficients. The issue of serial dependency or serial correlation happens when extreme observations are correlated with their prior values because they are drawn sequentially. This may be more problematic to the BM method than to the POT method because extreme values are drawn sequentially from the blocks. In the case of a lane change event, the observed PETs may be dependent on prior lane change maneuvers. The standard method of detecting serial dependency is to plot observations chronologically or order them over space. Serially correlated observations will reveal a trend over time, with peaks and valleys that typically repeat themselves over fixed intervals. Since the mid-1990s, several methods for declustering a series of extremes to extract a set of independent extremes have been discussed. An additional method for handling clustered extremes is the deletion of neighboring observations on both sides of a local maxima.
11.6.3 Block maxima or peak over threshold Zhang et al. 2014a,b conducted a study of freeway lane change maneuver crashes to compare the BM and POT approaches when modeling PET. The authors found that when the observation time period is relatively short, the POT approach is superior to the BM approach regarding data utilization, estimate accuracy, and estimation reliability. The sample size is equal to the observation time period divided by the block size; therefore, when the observation time is short, the block interval also needs to be short to ensure an adequate sample size. The short interval reduces the number of observations within each block and arbitrarily elevates some ordinary observations to be extremes. Hence, models estimated from a mixture of extreme and ordinary observations may be biased and ineffective. On the other hand, the POT approach makes full use of possible extremes provided that a threshold is properly set up. In their study, the observation time for most of the freeway segments is approximately 3 h, and the interval is 5-min, resulting in a block size that ranges from 21 to 41. Another way to compare the model performance is through the shape parameter. According to Smith (1985), the maximum likelihood estimators are not likely to be attainable if the shape parameter x < 1:0; whereas, the estimator possesses the usual asymptotic properties and thus more reliable (note: the theory of asymptotic and limit laws are essential to formulate the distributions of extremes) if x > 0:5. In Zhang’s study, two out of 29 freeway segments yielded a shape parameter III. Alternative safety analyses
392
11. Surrogate safety measures
that was greater than 0.5 in the BM approach, whereas the POT produced eight. Model parameters can be estimated with the maximum likelihood estimation method in R package “exTremes” or “evd: Functions for Extreme Value Distributions.” Details on the statistical properties of the GEV and GP distributions can be found in Coles (Coles, 2001). Safety literature regarding EVT has reported considerably different performance outcomes between the BM and POT approaches. The statistical community continues to discuss the merits and circumstances of both approaches. Readers are encouraged to refer to (Zhang et al., 2014a,b; Tarko, 2012; Farah, 2017) for more technical details and to run their own comparative studies based on the data and study design.
Example 11.1 A driving simulator experiment was set up to estimate the probability of head-on collisions during passing maneuvers (Farah et al., 2017). The driving scene was projected onto a screen in front of the driver and the image was updated at a rate of 30 frames per second. The subject vehicle is overtaking a front vehicle when another vehicle is approaching fro the opposite direction. The simulator experiment produced 1287 completed passing maneuvers, including nine collisions. The minimum TTC is measured at the end of the passing maneuver. Use the BM and POT approach to estimate the probability of a head-on collision and its confidence interval. First, in the BM approach, a GEV distribution is fitted using the noncrash passing maneuvers and the respective minimum TTC measurements. The block intervals are defined as the entire passing maneuvers. Fig. 11.7 (left) shows the CDF of the minimum TTC for the full dataset, and Fig. 11.7 (right) shows the CDF of the minimum TTC for the filtered data (a minimum TTC less than or equal to 1.5 s).
FIGURE 11.7 CDF of the minimum TTC. From H. Farah, C. Lima Azevedo, Safety analysis of passing maneuvers using extreme value theory, IATSS Res. 41 (2017) 12e21.
III. Alternative safety analyses
11.6 Theoretical development of safety surrogate
393
Example 11.1 (cont’d) A smaller TTC value means a higher risk for head-on collision. For the sake of the maxima model, the negative TTC value is used because a higher negative TTC value indicates a higher risk for head-on collision. In the stationary block maxima model for the maxima of the negated values of TTC (i.e., maximum ( TTC)), the probability of head-on collision should be ( TTC) 0. The driving simulator experiment used a 1.5 s filter and recorded 463 near head-on collisions and nine actual collisions. Then, the probability of a head-on collision given the presence of 9 a near head-on collision event during passing is: 463þ9 ¼ 0:0191 with a 95% confidence interval (0.0088, 0.0359). Finally, the fitted distribution has the following parameters for a GEV CDF: m ¼ 0:993 ð0:012Þ; s ¼ 0:383 ð0:0163Þ; x ¼ 0:236 ð0:05Þ where the values in parenthesis are the standard errors. Fig. 11.8 presents the kernel density functions of the empirical and modeled negated TTC. According to the stationary model for the ( TTC), the estimated probability of a head-on collision given the observed passing maneuver is 0.0179 with a 95% confidence interval of (0.0177, 0.0182) (Fig. 11.8, left).
FIGURE 11.8
Kernel density plot for the BM and POT model (Farah et al., 2017).
In the POT approach, the optimal threshold value can be calculated through statistical methods such as mean residual life and stability plots (Coles, 2001). Different stationary models were fitted using four different threshold values of 1.5, 1.0, 0.5, and 0.25 for ( TTC), continued
III. Alternative safety analyses
394
11. Surrogate safety measures
Example 11.1 (cont’d) respectively. Among the values, the estimated probability of a head-on collision that is closest to the empirical data is 0.00628 with a 95% confidence interval of (0.00612, 0.00643) yielded with the 0.25 s threshold (Fig. 11.8, right). Nonstationary models have been tested in the POT method by including several covariates (e.g., speed, passing rate, curvature) in the scale parameter formulation (s). Nonstationary models have also been tested in the BM method in the location parameter formulation (m). All tests and validations lead to the conclusion that the BM approach yields more stable results compared to POT. The predicted probability of headon collisions based on the BM approach was very close to the probability of head-on collisions based on the empirical data.
11.7 Safety surrogate measures from traffic microsimulation models Traffic microsimulation models, which replicate driver and vehicle behavior, have been used extensively to mimic the process of vehicle movement and interactions in a traffic stream. The performance, accuracy, and reliability of contemporary traffic microsimulation models have been significantly improved through the calibration of parameters related to driver behavior such as car following, lane changing, and gap acceptance. The use of microscopic traffic simulation models drastically increases the efficiency and reduces the cost of collecting surrogate measures. However, the obvious shortcoming of microsimulation models is that they are incapable of “producing” collisions. This limitation casts doubt on whether surrogate measures taken from simulated traffic events can be a reliable predictor of crash frequency. Nevertheless, the simulation models are ideal tools for comparing highway design and traffic operational alternatives before their implementation. As each vehicle can be traced in a computer simulation through its trajectory, its location, speed, and acceleration or deceleration are recorded on a second-by-second basis. The detailed and timestamped vehicle positions allow researchers to measure and estimate the spatial and temporal proximity between vehicles. The FHWA sponsored the development of the Surrogate Safety Assessment Model (SSAM) from traffic simulation packages (https://highways.dot.gov/safety/ssam/ surrogate-safety-assessment-model-overview) (Gettman et al., 2008; Pu et al., 2008). SSAM is a software application developed to automatically III. Alternative safety analyses
11.8 Safety surrogate measures from video and emerging
395
identify, classify, and evaluate traffic conflicts in the vehicle trajectory data output from microscopic traffic simulation models. The model was built on the outputs of existing traffic simulation models such as PTV Vissim2, and has built-in statistical analysis features for conflict frequency and severity measures that can aid analysts in the design of safe traffic facilities. The SSAM approach has been assessed by comparing different surrogate safety measures. It has been validated through field studies that compared its output to real-world crash records (Fan et al., 2013; Huang et al., 2013). It should be noted that over the last few years, microsimulation, despite the limitation described earlier, has been used to estimate the safety effects of connected and autonomous vehicles (CAVs). As the deployment of CAVs is very limited, there are not enough crash data to properly evaluate their safety performance. The simulation, in this context, can be used to examine different penetration rates (say 10%e100%) and how CAVs interact (i.e., traffic conflicts) with human-driven vehicles (Jeong et al., 2017; Mousavi et al., 2019, 2020). Input variables, such as headways and reaction times, need to be adjusted for properly simulating CAVs on (usually urban) transportation networks. The latest version of Vissim (2020) has a module that can be used specifically for CAVs.
11.8 Safety surrogate measures from video and emerging data sources Measuring the spatial and temporal proximity between vehicles requires vehicles to be traced and their trajectory information to be extracted. As discussed in Chapter 2dFundamentals and Data Collection, on-site video cameras can be used for recording the vehicle trajectory, while the computer vision technique allows tracking moving objects and detecting traffic conflicts from videos. Microsoft Corp., the City of Bellevue, WA and University of Washington (UW) jointly developed a software application that utilizes a city’s existing traffic cameras to count near-miss collisions and classify vehicles by turning movement (through, left or right), direction of approach (northbound, southbound, eastbound, westbound), and mode (car, bus, truck, motorcycle, bicycle, and pedestrian). Additionally, speed and acceleration rate can be calculated continuously from the vehicle trajectory to better understand the driver’s steering and braking behaviors (Loewenherz et al., 2017). The
2
PTV Vissim is a microscopic multimodal traffic flow simulation software package developed by PTV AG in Karlsruhe, Germany.
III. Alternative safety analyses
396
11. Surrogate safety measures
video analytics technologies will play a more important role in active safety management as video surveillance becomes more prevalent. Vehicle trajectory information can also be gathered from in-vehicle longitudinal data, or GPS. Naturalistic driving study (NDS) data (Guo et al., 2010; Wu and Jovanis, 2012) and connected vehicle data (Liu and Khattak, 2016; He et al., 2018) are two emerging data sources that provide a great opportunity for gaining a better understanding of collision mechanisms and developing novel safety metrics. SHRP 2 Naturalistic Driving Study (SHRP 2 NDS) is the largest coordinated safety program ever undertaken in the United States. The SHRP 2 program consists of an NDS data and a companion Roadway Information Database, RID. The NDS data were collected from more than 3500 volunteer passengervehicle drivers aged 16e98 during a 3-year period, with most drivers participating for one to 2 years (2010e12). Additional details about the program background and the database of detailed data can be found at https://insight.shrp2nds.us/. The Safety Pilot Model Deployment (SPMD) program, a comprehensive data collection effort that took place under real-world conditions in Ann Arbor, Michigan, covered over 73 lane-miles and included approximately 3000 pieces of onboard vehicle equipment and 30 pieces of roadside equipment. In the SPMD program, basic safety messages (BSMs) communicated between connected vehicles transmit approximately 10 per second in-vehicle data (e.g., vehicle size, position, speed, heading, acceleration, brake system status) can be an important supplement to a traditional crash data-oriented safety analysis. SPMD data are text-based, accompanied by a downloadable data dictionary and metadata document and are available for use via the research data exchange (www.its-rde.net). Traffic conflicts have become the most prominent and promising proactive safety measures for identifying safety concerns and evaluating the effectiveness of safety treatments in the absence of crash data. Researchers have put significant efforts into developing techniques and metrics that properly record traffic conflicts. This section has discussed both evasive action-based and proximity-based surrogate measures. As emerging data sources such as the BSM from V2V and V2I technologies and modern video image processing technologies become more available and pervasive, large-scale, continuous, and automatic traffic conflict observations will become more readily available. Rich data sources will accelerate the advances of surrogate safety measures by including new and more complex crash types and by extending and calibrating the extreme value models.
III. Alternative safety analyses
References
397
References AASHTO, 2010. Highway Safety Manual, 1st Edition ed. American Association of State Highway and Transportation Officials, Washington, D.C. Allen, B.L., Shin, B.T., Cooper, P.J., 1978. Analysis of traffic conflicts and collisions. Transp. Res. Rec. J. Trans. Res. Board (667), 67e74. Almqvist, S., Hyden, C., Risser, R., 1991. Use of Speed Limiters in Cars for Increased Safety and a Better Environment. Transportation Research Record 1318, Transportation Research Board, Washington, DC, pp. 34e39. Amundsen, F., Hyden, C., 1977. Proceedings of the 1st Workshop on Traffic Conflicts. Norway, Oslo. Chin, H.-C., Quek, S.-T., 1997. Measurement of traffic conflicts. Saf. Sci. 26 (3), 169e185. Coles, S., 2001. An Introduction to Statistical Modeling of Extreme Values. Springer-Verlag, London, UK. Fan, R., Yu, H., Liu, P., Wang, W., 2013. Using VISSIM simulation model and Surrogate Safety Assessment Model for estimating field measured traffic conflicts at freeway merge areas. IET Intell. Transp. Syst. 7 (1), 68e77. Farah, H., Lima Azevedo, C., 2017. Safety analysis of passing maneuvers using extreme value theory. IATSS Res. 41, 12e21. Fisher, R.A., Tippett, L.H.C., 1928, April. Limiting forms of the frequency distribution of the largest or smallest member of a sample. In: Mathematical Proceedings of the Cambridge Philosophical Society (Vol. 24, No. 2, pp. 180e190). Cambridge University Press. Gettman, D., Head, L., 2003. Surrogate safety measures from traffic simulation models. Transp. Res. Rec. 1840 (1), 104e115. Gettman, D., Pu, L., Sayed, T., Shelby, S., Siemens, I.T.S., 2008. Surrogate Safety Assessment Model and Validation (No. FHWA-HRT-08-051). United States. Federal Highway Administration. Office of Safety Research and Development. Glauz, W.D., Migletz, D.J., 1980. Report 219: Application of Traffic Conflict Analysis at Intersections. TRB, National Research Council, Washington, D.C. Guo, F., Klauer, S.G., Hankey, J.M., Dingus, T.A., 2010. Near crashes as crash surrogate for naturalistic driving studies. Transp. Res. Rec. 2147 (1), 66e74. Hayward, J.C., 1971. Near Misses as a Measure of Safety at Urban Intersection. The Pennsylvania State University. He, Z.X., Qin, X., Liu, P., Sayed, M.A., 2018. Assessing Surrogate Safety Measures Using a Safety Pilot Method Deployment (SPMD) Dataset, Transportation Research Record. https://doi.org/10.1177/0361198118790861. Hernandez, V., 1982. A Microscopic Digital Computer Simulation of Traffic Conflicts at 4-way Fixed Time Signalized Urban Intersections. Ph.D. Dissertation. University of California, Los Angeles. Huang, F., Liu, P., Yu, H., Wang, W., 2013. Identifying if VISSIM simulation model and SSAM provide reasonable estimates for field measured traffic conflicts at signalized intersections. Accid. Anal. Prev. 50, 1014e1024. Hyde´n, C., 1987. The Development of a Method for Traffic Safety Evaluation: The Swedish Traffic Conflicts Technique. Lund Institute of Technology, Lund. Ismail, K., Sayed, T., Saunier, N., 2011. Methodologies for aggregating indicators of traffic conflict. Transp. Res. Rec. 2237, 10e19. Jeong, E., Oh, C., Lee, S., 2017. Is vehicle automation enough to prevent crashes? Role of traffic operations in automated driving environments for traffic safety. Accid. Anal. Prev. 115e124. Johnsson, C., Laureshyn, A., De Ceunynck, T., 2018. In search of surrogate safety indicators for vulnerable road users: a review of surrogate safety indicators. Transp. Rev. 38 (6), 765e785. https://doi.org/10.1080/01441647.2018.1442888.
III. Alternative safety analyses
398
11. Surrogate safety measures
˚ ., Hyde´n, C., 2010. Evaluation of traffic safety, based on microLaureshyn, A., Svensson, A level behavioural data: theoretical framework and first implementation. Accid. Anal. Prev. 42, 1637e1646. https://doi.org/10.1016/j.aap.2010.03.021. Liu, J., Khattak, A.J., 2016. Delivering improved alerts, warnings, and control assistance using basic safety messages transmitted between connected vehicles. Transp. Res. C Emerg. Technol. 68, 83e100. Loewenherz, F., Bahl, V., Wang, Y., March 2017. Video analytics towards vision zero. ITE J. 25e28. Mahmud, S.M.S., Ferreira, L., Hoque, S., Tavassoli, A., 2017. Application of proximal surrogate indicators for safety evaluation: a review of recent developments and research needs. IATSS Res. 41, 153e163. Minderhoud, M.M., Bovy, P.H., 2001. Extended time-to-collision measures for road traffic safety assessment. Accid. Anal. Prev. 33, 89e97. https://doi.org/10.1016/S00014575(00)00019-1. PMID:11189125. Mousavi, S.M., Osman, O.A., Lord, D., 2019. Impact of urban arterial traffic LOS on the vehicle density of different lanes of the arterial in proximity of an unsignalized intersection for autonomous vehicle vs. conventional vehicle environments. In: International Conference on Transportation and Development, Alexandria, VA. Mousavi, S.M., Lord, D., Mousavi, S.R., Shirinzad, M., 2020. Safety performance of autonomous vehicles on an urban arterial in proximity of a driveway. In: Paper Presented at the 99th Annual Meeting of the Transportation Research Board, Washington, D.C. Ozbay, K., Yang, H., Bartin, B., Mudigonda, S., 2008. Derivation and validation of new simulation-based surrogate safety measure. Transp. Res. Rec. J. Transp. Res. Board 2083, 105e113. https://doi.org/10.3141/2083-12. Parker Jr., M.R., Zegeer, C.V., 1989. Traffic Conflict Techniques for Safety and Operations: Observers Manual (No. FHWA-IP-88-027, NCP 3A9C0093). Federal Highway Administration, United States. Perkins, S.R., Harris, J.I., December 7, 1967. Traffic Conflict Characteristics; Accident Potential at Intersections. General Motors Research Publication GMR-718. Proceedings of the 1st Workshop on Traffic Conflicts, 1977. Oslo, Norway. Pu, L., Joshi, R., Energyd, S., 2008. Surrogate Safety Assessment Model (SSAM)–software User Manual (No. FHWA-HRT-08-050). Turner-Fairbank Highway Research Center. Saunier, N., Sayed, T., 2008. Probabilistic Framework for Automated Analysis of Exposure to Road Collisions. Transportation Research Record, 2083, pp. 96e104. Smith, R.L., 1985. Maximum likelihood estimation in a class of nom-regular cases. Biometrika 72, 67e90. Songchitruksa, P., Tarko, A.P., 2006. The extreme value theory approach to safety estimation. Accid. Anal. Prev. 28, 811e822. ˚ ., 1998. A method for analysing the traffic process in a safety perspective. DepartSvensson, A ment of Traffic Planning and Engineering, Lund Institute of Technology, Lund, Sweden, p. 174. Tarko, A.P., 2012. Use of crash surrogates and exceedance statistics to estimate road safety. Accid. Anal. Prev. 45, 230e240 doi:10.1016/j.aap. 2011.07.008. Wu, K.F., Jovanis, P.P., 2012. Crashes and crash-surrogate events: exploratory modeling with naturalistic driving data. Accid. Anal. Prev. 45, 507e516. Zheng, L., Sayed, T., 2019. From univariate to bivariate extreme value models: approaches to integrate traffic conflict indicators for crash estimation. Transp. Res. C Emerg. Technol. 103, 211e225. Zheng, L., Ismail, K., Meng, X., 2014. Traffic conflict techniques for road safety analysis: open questions and some insights. Can. J. Civ. Eng. 41 (7), 633e641. Zheng, L., Ismail, K., Meng, X.H., 2014. Freeway safety estimation using extreme value theory approaches: a comparative study. Accid. Anal. Prev. 62, 32e41.
III. Alternative safety analyses
C H A P T E R
12
Data mining and machine learning techniques 12.1 Introduction Although traditional safety data (e.g., crash data, highway, traffic, and environmental characteristics) remain the primary data sources for highway safety analysis, new and emerging data are becoming more and more available. The advent of autonomous and connected vehicle technologies and naturalistic driving studies offer far richer information than that of traditional data sources. Given large and complex datasets, we need to use advanced methods in addition to the strategies introduced in the preceding chapters to handle high-dimensional and nonlinear relationships in Big Data. In recent years, more people are turning their attention to data mining and machine learning techniques in hopes of discovering new, accurate and useful patterns through a “data-driven safety analysis.” The readers should note that both data mining and machine learning fall under the umbrella of artificial intelligence (AI). Data mining is the process of discovering and extracting useful information from a vast amount of data. Machine learning incorporates the principles and techniques of data mining and uses the same algorithms to automatically learn from and adapt to the data. The canonical characteristics in machine learning include, but are not limited to: • Clustering: Clustering is used to create clusters where similar data points are grouped together. Applications in highway safety analysis include crash pattern recognition and hot spot identification. Typical methods include k-means and latent class clustering. • Classification: Classification assigns data to one of several categories. Applications of classification in highway safety analysis include
Highway Safety Analytics and Modeling https://doi.org/10.1016/B978-0-12-816818-9.00016-0
399
© 2021 Elsevier Inc. All rights reserved.
400
12. Data mining and machine learning techniques
real-time crash prediction, crash injury severity prediction, crash types, and risky driver behaviors. Typical methods for such tasks include logistic regression, support vector machines, neural networks, random forests, and Bayesian network classifiers. • Regression: Regression predicts continuous quantities from input data. Applications in highway safety analysis include prediction of crash counts, rates, and certain types of crash events. Typical methods include linear regression, neural networks, and Gaussian processes. Examples of these three types of characteristics can be found in this chapter. Given the multitude of choices, it is important to consider how the data mining and machine learning techniques are used, what limitations exist, and whether there are explicit trade-offs (e.g., prediction and interpretation). This chapter introduces data mining and machine learning methodologies and techniques that have been applied in highway safety studies, including association rules, clustering analysis, decision tree models, Bayesian networks, neural networks, and support vector machines. The theoretical frameworks are illustrated through exemplary cases published in safety literature and are supplemented with implementation information in the statistical software package R. The sensitivity analysis section at the end of the chapter offers a means of specifying the effect of an independent variable on the output, which can assist in deciding on the appropriate safety solutions. Machine learning is transforming the way we collect and analyze safety data; this chapter helps facilitate curriculum changes that improve data literacy and better prepare the future workforce for this transformation.
12.2 Association rules Association rule is a rule-based machine learning method for discovering relations between variables in large databases. Specifically, it identifies the relative frequency of sets of variables (e.g., highway geometric features, traffic conditions, driver characteristics) occurring alone and together in a given event such as a crash. The rules have the form A / B in which A is the antecedent and B is the consequent. In association rules, the rules can be expressed by Support, Confidence, and Lift. The three indexes can be calculated as follows: Support ðA / BÞ ¼
#ðAXBÞ N
III. Alternative safety analyses
12.2 Association rules
Confidence ¼ Lift ¼
Support ðA/BÞ Support ðAÞ
401 (12.1)
Support ðA/BÞ Support ðAÞ Support ðBÞ
where N is the number of crashes and # (AXB) is the number of crashes in which both Conditions A (antecedent) and B (consequent) are presented. As can be seen in Eq. (12.1), Support is the percentage of a rule that exists in the whole dataset. Confidence is the proportion of consequents among antecedents. Lift is used to quantify the statistical dependence because this rule indicates the number of cooccurrences of the antecedent and consequent to the expected cooccurrences, under the assumption that the antecedent and the consequent are independent. A Lift value smaller than 1 indicates a negative dependence between the antecedent and the consequent. A Lift value equal to 1 indicates independence, and a Lift value greater than 1 indicates positive dependence. For example, in the rule “reckless driving / alcohol (support ¼ 1%, confidence ¼ 50%, lift ¼ 5),” Support shows that the proportion of observations including both reckless driving and alcohol is 1% in all crashes; Confidence shows that the proportion of observations including both reckless driving and alcohol is 50% in the dataset including reckless driving crashes only, or for 50% of the crashes involving reckless driving, the rule is correct (i.e., 50% of the times a reckless driving crash also involves alcohol); and Lift indicates that reckless driving is positively associated with alcohol. The following four steps help create association rules: (1) generate all two-item rules; (2) determine threshold values; (3) eliminate the rules with lift values outside of the thresholds; and (4) eliminate the remaining rules that have both Support and Confidence values lower than the thresholds. In an effort to explore the types of driver errors leading to a crash at an intersection (Wang and Qin, 2015), the threshold value for Support is set to be 1%. For Confidence, the threshold is set to be 30% for both careless driving and reckless driving at uncontrolled and signalcontrolled intersections. For stop-controlled intersections, the threshold for Confidence is set to be 10% for careless driving and 50% for reckless driving. For Lift, the threshold value is set to be greater than or equal to 1.1 (positive correlation), or smaller than or equal to 0.9 (negative correlation). The purpose of choosing different confidence thresholds is to accommodate different distributions of careless driving and reckless driving among the three types of intersections. As an example, Table 12.1 shows the rules that involve more severe mistakes (i.e., careless and reckless driving) for stop-controlled intersections.
III. Alternative safety analyses
402
12. Data mining and machine learning techniques
TABLE 12.1 Association rules for driver errors for stop-controlled intersections. Careless driving
Reckless driving
Antecedent
Support
Confidence
Lift
Confidence
Lift
Old male
e
e
e
Support 9%
87%
1.3
Young female
e
e
e
11%
87%
1.3
Middle-aged female
3%
11%
0.8
e
e
e
Old female
e
e
e
8%
87%
1.3
DUI
e
e
e
2%
80%
1.2
Horizontal curve
e
e
e
5%
54%
0.8
Vertical curve
2%
9%
0.7
10%
54%
0.8
Speed limit (35e55mph)
1%
34%
2.6
e
e
e
Speed limit (>55 mph)
1%
11%
0.8
80%
1.2
Afternoon peak (4:00 p.m.e6:59 p.m.)
3%
10%
0.8
e
e
e
Nighttime (7:00 p.m.e6:59 a.m.)
5%
20%
1.5
e
e
e
Wet pavement
2%
10%
0.8
13%
74%
1.1
Snowy pavement
1%
10%
0.8
6%
54%
0.8
Icy pavement
e
e
e
2%
27%
0.4
Passenger car
e
e
e
76%
87%
1.3
Light truck
2%
22%
1.7
5%
54%
0.8
Heaver truck
1%
18%
1.4
e
e
3%
e
Notes: “e” represents the rule is not applicable. Source: Wang, K., Qin, X., 2015. Exploring driver error at intersections: key contributors and solutions. Transport. Res. Rec. 2514 (1), 1e9.
According to Table 12.1, the highest Support value for rules leading to careless driving is nighttime (5%). The highest Support value for rules leading to reckless driving is passenger car drivers (76%). Regarding Confidence, the highest value for rules leading to careless driving is a posted speed limit of 35e55 mph (34%). All Confidence values for rules leading to reckless driving are high except for poor pavement conditions. High Confidence values are due to the high percentage of reckless driving (67%) taking place at sign-controlled intersections.
III. Alternative safety analyses
12.3 Clustering analysis
403
12.3 Clustering analysis In data mining, clustering analysis (CA) is an unsupervised learning technique with a principal objective of dividing a dataset into smaller subsets called clusters. CA is based on a heuristic method and tries to maximize the similarity between intracluster elements and the dissimilarity between intercluster elements (Fraley and Raftery, 2002). The two clustering methods most commonly used in safety research are K-means clustering (KC) and latent class clustering (LCC).
12.3.1 K-means clustering KC is a nonhierarchical, similarity-based clustering method that partitions the observations in K clusters, based on the distance of the observations from clusters’ means. Many algorithms can be used to partition a dataset, including naı¨ve k-means (or Lloyd’s algorithm), Forgy and Random Partition, and the HartiganeWong method. Before running the KC algorithm, a distance function and the value of K need to be specified. A popular choice for the distance function is the Euclidean distance. The value of K can be determined visually through a Scree plot that exhibits different K values versus the corresponding results in terms of the intracluster homogeneity. An example of the K-means algorithm is presented in detail in Section 10.4 of Chapter 10dCapacity, Mobility and Safety, Characterizing Crashes by Real-Time Traffic.
12.3.2 Latent class cluster LCC is a model-based clustering method that assumes data from a mixture of probability densities. Given Y1, ., Yn, each described by a set of features (y1, ., ym), the mixture probability density for the whole dataset can be expressed as follows (Hagenaars and McCutcheon, 2002): f ðyi jqÞ ¼
K X
pðCk Þfk ðyi jCk ; qk Þ
(12.2)
k¼1
where yi denotes the ith object’s scores on a set of observed variables, K is the number of clusters, pk denotes the prior probability of belonging to the latent class k (or membership), and q are the class-specific parameters. So, the distribution of yi given the model parameters q, f ðyi jqÞ, is a mixture of class-specific densities fk ðyi jCk ; qk Þ. Maximum-likelihood (ML) and maximum-posterior (MAP) are the two main estimation methods for LCC models, and most software packages use an expectation-maximization (EM) algorithm to find ML or MAP estimates. Like the KC model, the number of clusters can be visually determined through the Scree plot. A better way to evaluate and decide the number of clusters in an LCC model is to use information criteria such as AIC, BIC, and CAIC (see Section 2.7 of Chapter 2dFundamentals and Data Collection).
404
12. Data mining and machine learning techniques
The model with a lower score of BIC, AIC, or CAIC is considered to be better. Thus, LCC employs statistical criteria to decide the most appropriate number of clusters and allows probability classifications to be determined by using subsequent membership probabilities estimated with the maximum likelihood method. To further assess the quality of the clustering solution, the entropy R2 can be calculated (McLachlan and Peel, 2000) as in Eq. (12.3) where pik denotes the posterior probability that case i belongs to cluster k. P P ni¼1 Kk¼1 pik lnðpik Þ 2 (12.3) Entropy R ¼ 1 nlnðKÞ In a perfect case of classification, the criterion equals to 1; a worst-case scenario for clustering would have a value of 0 for the criterion. In a study by Depaire et al. (2008), 29 crash-contributing factors (e.g., crash, vehicle, driver, environmental) were used to partition 4028 crashes. The results of cluster analysis in Table 12.2 indicate distinctions TABLE 12.2 Feature probabilities by cluster. C1 (%)
C2 (%)
C3 (%)
C4 (%)
C5 (%)
C6 (%)
0
99
0
0
0
99
6
74
26
40
21
40
15
5
Crossroad: no crossroad
5
48
1
56
33
76
42
Built-up area: outside built-up area
1
1
1
1
1
0
47
Road type: highway, national, regional or provincial road
25
25
55
23
24
7
99
Age road user 1: 0e18 years old
14
11
16
16
42
96
11
Dynamics road user 2: road user is not moving
0
11
0
79
74
35
0
Vehicle type road user 1: motorcycle or bicycle
8
0
12
17
90
0
7
Vehicle type road user 1: car
85
0
81
76
9
0
82
Size (%)
23
19
15
15
14
10
4
Variable e value Accident type: collision with a pedestrian Crossroad: crossroad without traffic lights or priority road
C7 (%)
Source: Depaire, B., Wets, G., Vanhoof, K., 2008. Traffic accident segmentation by means of latent class clustering. Accid. Anal. Prev. 40 (4), 1257e1266.
12.4 Decision tree model
405
between traffic crashes with cars, motorcycles, or bicycles and whether pedestrians were involved. The analysis shows that other types of features, such as type of crossroad or road type, can be used to group data. The seven-cluster model provides cluster-specific distributions for each variable, which helps to characterize every cluster as a unique and specific type.
12.4 Decision tree model Among the various data mining techniques, outcomes from tree-based models are relatively simple for nonstatisticians to interpret. A tree-based classification and regression model can be constructed by recursively partitioning data with criteria such as the total sum of the squared error (SSE). In other words, the values of all the variables in the model, either discrete or continuous, are selected to yield the maximum reduction in the variability of the response variable. The algorithm obtains optimal results by exhaustively searching all variables as well as all values for each selected variable.
12.4.1 The CART model The classification and regression trees (CART) methodology, first introduced by Breiman et al. (1998), refers to two types of decision trees: a classification tree and a regression tree. A classification tree is an algorithm where the response variable is categorical, such as crash injury severity levels. The algorithm is used to identify the “class” to which a response variable would most likely belong. A regression tree refers to an algorithm where the response variable is a continuous variable, such as crash frequency or crash rate, and the algorithm is used to predict its value. The CART model has been used to predict crash injury severity, crash frequency, and to identify important contributing factors based on the tree structure. The decision tree grows when the dataset is split into groups and organized so that each group’s data is as homogenous as possible. Algorithms for constructing decision trees usually work top-down by choosing a splitter at each step that best splits that particular set of variables. Although different algorithms use different metrics, they usually measure the homogeneity of the target variable within the subsets using one of the following: • Gini impurity: Gini impurity is a measure of how often an element from the dataset would be labeled wrong if it is randomly labeled according to the distribution of labels in the subset
III. Alternative safety analyses
406 IG ðnÞ ¼ 1
12. Data mining and machine learning techniques
J P
ðpi Þ2 . Here, the Gini impurity of a node n is 1 minus
i¼1
the sum of all J classes of the fraction in each class (pi) squared. • Variance reduction: The variance reduction of a node is defined as the total reduction of the variance of the target variable due to the split at the node. It is often used when the target variable is continuous, such as in a regression tree. • Information gain: The information gain is based on the concept of entropy and information content. At each step, the split chosen is usually the one that results in the purest children nodes. Information gain is used to decide which feature to split at each step when building the tree. It is used by the ID3, C4.5, and C5.0 decision-tree algorithms. The notations used in the tree algorithm are as follows: t denotes a e denotes the set of terminal nodes of T, e T denotes node, T denotes a tree, T the number of terminal nodes of T, Tt denotes a subtree of T with root node t, and {t} denotes a subtree of Tt containing only the root node t. The following describes the theory of a Poisson regression and classification using the tree algorithm notations. • First, crash data are modeled by the Poisson distribution when the equality of the mean and variance is not violated, or by the Negative Binomial distribution when data overdispersion is present. If the number of crashes, the response variable Yi, is assumed to follow a Poisson distribution, the expected number of crashes, mi, can be expressed as the product of traffic exposure and the exponential function of the potential crash contributing factors or other explanatory variables, as follows: mi ¼ V*exp(b0 þ b1xi1 þ . þ bkxik). The linear relationship between the expected number of crashes and the corresponding vector of predictors is obtained from the logarithm transformation and the maximum likelihood estimation of b0, b1, ., bk can be obtained by maximizing the likelihood function . Q y LðmjyÞ ¼ ni¼1 emi mi i yi ! or equivalently, the log-likelihood function P P P LLðmjyÞ ¼ ni¼1 y lnðmi Þ ni¼1 mi ni¼1 yi !. • Then, when partitioning a dataset recursively, an appropriate splitter needs to be selected. If the predictor variable xi is numerically ordered, the dataset is partitioned by xi c, and if xi is categorical, the dataset is partitioned by xi ˛A, where c is a constant
III. Alternative safety analyses
407
12.4 Decision tree model
and A is a fixed subset of possible values of xi. The method proceeds by an iterative search for the variable as well as its specific value from all of the variables within all the possible levels or values in the model that result in the maximum reduction in variability of the dependent variable. The best splitter, s*, is determined by deviance D or by squared errors, where a squared error for a node t is defined as follows: DðtÞ ¼
X
yn m ^
2
(12.4)
x˛I
where m ^ is an estimation of the mean or a sample mean y. For generalized linear models, the deviance is also called the log-likelihood (ratio) statistic, which is described in Section 2.7 of Chapter 2dFundamentals and Data Collection. In the Poisson case, deviance can be simplified as nXn Xn o m y ln y m ^ D¼2 y =^ i i¼1 i i¼1 i If the deviance for a node t is denoted by D(t), the deviance for a tree T is D ðT Þ ¼
X e t˛T
DðtÞ ¼
XX
yn yðtÞ
2
(12.5)
e x˛t t˛T
For a binary partitioning by a splitter s, a difference by s is defined as DDðs; tÞ ¼ DðtÞ DðtL Þ DðtR Þ
(12.6)
where tL and tR are the left and right child nodes of t, respectively. • Finally, the best splitter, s*, is obtained by maximizing the difference, that is, DDðs ; tÞ ¼ maxs˛S DDðs; tÞ
(12.7)
where S is the set of all possible splitters. The maximum reduction occurs at a specific value of a selected variable s. When the data are split at s into two samples, these remaining samples have a much smaller variance in Y than the original dataset. Thus, the reduction at node t is the greatest when the deviances at nodes tL and tR are smallest. In a study by Qin and Han (2008), classification criteria have been developed to categorize sites into groups sharing similar attributes and consequences. A tree-based regression method significantly improves the
III. Alternative safety analyses
408
12. Data mining and machine learning techniques
model efficiency. The variables used for the CART model include total number of crashes (dependent variable), type of area (rural or urban), types of traffic controls (all-way, side, signal), number of intersection approach legs (3, 4, or other), number of major roadway lanes (2, 4, or unknown), existence of major roadway median, and existence of left-turn lane(s). In Fig. 12.1, the CART tree shows that the first splitter to predict the crash rate at an intersection is the number of intersection approaches, followed by intersection traffic control types. R 3.5.0 (R Core Team, 2018) includes two available packages, “Rpart” and “Tree,” to help build classification and regression trees. The key difference between the two packages is the way missing values are handled during the splitting and scoring processes. In “Tree,” an observation with a missing value for the primary split rule is terminated. “Rpart” offers more flexibility, allowing users to decide how to handle missing values by using surrogates to set up the “usesurrogate” parameter in the rpart.control option. Despite its convenience, the CART model has several limitations. First, the CART model cannot quantitatively measure the effect of variables on injury severity as there is not an estimated coefficient for each variable. Second, the CART model predicts the outcome based on a single decision tree whose classification accuracy can be unstable due to the data, split variable, and complexity of tree change (Chung, 2013). Third, the CART model may cause an overfitting problem (Duncan and Khattak, 1998).
N=3193 Crash Rate=0.688 Number of Legs=3, Other
Number of Legs=4
N=1846 Crash Rate=0.569
N=1347 Crash Rate=0.852 Traffic Control=All-way, Side
N=1527 Crash Rate=0.526
Traffic Control=Signal
N=319 Crash Rate=0.775
FIGURE 12.1 Cart model results.
III. Alternative safety analyses
409
12.4 Decision tree model
An alternative to the CART model is ensemble learning methods, which can be used to predict responses by growing a series of classification trees. The classification trees are grown by randomly selected samples with replacement (i.e., bootstrap samples). Random forest (RF) and gradient boosting trees (GBT) are representative methods for ensemble learning methods. The RF classifier is a type of bootstrap aggregation (i.e., bagged) decision tree in which the ensemble method builds multiple decision trees by repeatedly resampling training data with replacement data. The trees then vote for a consensus prediction. GBT, on the other hand, builds trees one at a time, and each new tree helps correct mistakes made by the previously trained tree.
12.4.2 Random forest The RF method generates many classifiers and aggregates their results. Breiman (2001) proposed this method as a prediction tool that uses a collection of tree-structured classifiers with independent, and identically distributed random vectors. Each tree votes for the most popular class. The method performs very well compared with many other classifiers and is robust against the overfitting problem, one of CART’s limitations. The procedures for implementing RF are summarized in Algorithm 12.1 (Lee and Li, 2015).
Algorithm 12.1: Random forest algorithm 1. Select a bootstrap sample. 2. Grow a classification tree to fit the bootstrap sample so that the variable can be selected only from a small subset of randomly selected variables for each split in the classification tree. 3. Predict the response variable for the samples not selected in the bootstrap sample (i.e., out-of-bag samples) by using the classification tree in Step 2. The predicted category of the response variable is the category with the highest proportion of samples. 4. Compare the observed and predicted categories of the response variable to calculate the rate of incorrect classification of the sample (the number of misclassified samples over the total number of samples) for each tree. This rate is defined as the misclassification rate (rb). continued
III. Alternative safety analyses
410
12. Data mining and machine learning techniques
Algorithm 12.1: Random forest a l g o r i t h m (cont’d) 5. For each predictor variable i, permute the value of the variable in the out-of-bag samples. Predict the response variable by using the classification tree in Step 2 to calculate the new misclassification rate of the tree (rai). The importance score for variable i is computed on the basis of the difference between the misclassification rates before and after the permutation [(rai rb)/rb]. A higher difference between the two misclassification rates increases the importance score, meaning the variable importance is higher. The importance score for each variable is updated as more trees are trained to the out-of-bag samples. 6. Repeat Steps 1e5 until enough trees are grown by using different bootstrap samples. Calculate the average importance score for each variable in different trees.
RF uses a random sample of the data to train each tree independently. This randomness helps to make the model more robust than a single tree and less likely to overfit the training data. However, RF does not identify whether a variable has a positive or negative effect on the response variable. Hence, RFs are often used to rank the importance of variables as a screening method for selecting input variables for other models such as a logistic regression. For a categorical variable with multiple levels, RFs are biased in favor of the attribute values with more observations and may produce unreliable variable importance scores. In addition, a large quantity of trees may make the algorithm slow for real-time prediction. The RF technique is implemented in the R package “randomForest” that is based on Breiman’s (2001) method.
12.4.3 Gradient boosted trees Gradient boosting is a machine learning technique for regression and classification problems. Friedman (2001) introduced the technique as the gradient boosting machine. The technique typically uses decision trees (especially CART trees) of a fixed size as base learners, so it is often called as GBT. Unlike RF, which builds an ensemble of deep independent trees, GBT builds an ensemble of shallow and weak trees sequentially. Each new tree in GBT learns and improves on the previously trained tree by applying a higher weight to incorrectly classified observations and a lower weight to correctly classified observations. The chance that higher weights will be correctly classified increases when the weak learners are boosted. Hence, the GBT model transforms an ensemble of weak III. Alternative safety analyses
12.4 Decision tree model
411
learners into a single strong model and predicts the cases that are difficult to classify. In the GBT model, a basis function f ðxÞ describes a response variable y in a function of the summation of weighted basis functions for individual trees as follows: f ðxÞ ¼
m X
bn bðx; gn Þ
(12.8)
n¼1
where bðx; gn Þ is the basis function for individual tree n; m is the total number of trees; gn is the split variable; and, bn is the estimated parameter that minimizes the loss function, L( y, f(x)). GBT can be summarized in Algorithm 12.2 (Friedman, 2001):
Algorithm 12.2: GBT algorithm 1. Initialize f0 ðxÞ, which can be set to zero. 2. For n ¼ 1,2,3, ., m (number of trees) a. For i ¼ 1 to k (number of observations), calculate the residual r. vLðy; f ðxÞÞ
r ¼ vf ðxÞ where Lðy; f ðxÞÞ ¼ ðy f ðxÞÞ2 ; f ðxÞ ¼ fm1 ðxÞ and fm1 ðxÞ is the basis function for the previous tree (m1). b. Fit a decision tree to r to estimate gn c. Estimate bn by minimizing Lðyi ; fn1 ðxÞÞ þ bn bðx; gn Þ d. Update fn ðxÞ ¼ fn1 ðxÞ þ bn bðx; gn Þ m P bn bðx; gn Þ 3. Calculate f ðxÞ ¼ n¼1
The GBT model can handle different types of predictor variables and can also accommodate missing data. The GBT model can fit complex nonlinear relationships and automatically account for interactions between predictors. As boosted trees are built by optimizing an objective function, GBT can be used to solve almost all objective functions as long as the gradient functions are available. In addition, boosting focuses step by step on difficult cases is an effective strategy for handling unbalanced datasets because it strengthens the impact of positive cases. However, GBT training generally takes time because trees are built sequentially, and each tree is built to be shallow. Therefore, the quantitative effect of each variable on the response variable, such as crash injury severity, may not be available. A study by Lee and Li (2015) compares the model performances of CART and GBT. The R package implements the “Generalized Boosting Model” method in “gbm.” Interested readers can refer to Elith and Leathwick (2017) for details on building GBT with “gbm.” III. Alternative safety analyses
412
12. Data mining and machine learning techniques
12.5 Bayesian networks Bayesian network (BN) is a probabilistic graphical model that depicts a set of variables and their conditional dependencies via a directed acyclic graph (DAG). BN has two main components: the causal network model (topology) and the conditional probability tables (CPTs). The model causal relationships are represented as DAGs in which variables are denoted by nodes, and relationships (e.g., causality, relevance) among variables are described by arcs between nodes. CPTs explicitly specify the dependencies among variables in terms of conditional probability distributions. Let U ¼ {x1, ., xn}, n 1 be a set of variables and Bp be a set of CPTs, Bp ¼ {p(xi|pa(xi), xi ˛ U)} where pa(xi) is the set of parents of xi in BN and i ¼ (1, 2, 3, ., n). A BN represents joint probability distributions P(U): Y PðUÞ ¼ Pðxi jPa ðxi ÞÞ (12.9) xi ˛Up
Bayes’ theorem can be applied to predict any variable in U given the other variables using pðxi jxj Þ ¼
pðxj jxi Þpðxi Þ . pðxj Þ
For instance, the classification
task consists of classifying a variable y given a set of attribute variables U. A classifier h: U / y is a function that maps an instance of U to a value of y. The classifier is learned from a dataset consisting of samples over (U, y). The arcs in the BN model could represent causality, relevance, or relations of direct dependence between variables. In highway safety analysis, it is better to consider arcs as evidence of a direct dependence between the linked variables rather than view them as causality due to the complexity of crashes. The absence of arcs means the absence of direct dependence between variables, which does not necessarily mean the absence of indirect dependence between variables. Candidate BNs are assigned a goodness-of-fit “network score” to be maximized by heuristic algorithms. Classes of heuristic algorithms include greedy hill climbing (HC), genetic algorithms, tabu search and simulated annealing. The R package includes “bnlearn” to help explain the graphical structure of BN, estimate its parameters and perform useful inferences. Functions in “bnlearn” include HC and tabu, as well as severity score functions such as AIC and BIC. Prati et al. (2017) investigated factors related to the severity of bicycle crashes in Italy using the Bayesian network analysis. The DAG in Fig. 12.2 shows the association between the severity of bicycle crashes and crash characteristics. The network consists of nine nodes, one for the target and one for each predictor. The BN model also indicates the relative importance of each predictor, using a darker color for more important
III. Alternative safety analyses
12.5 Bayesian networks
FIGURE 12.2
413
The Bayesian network model and predictor importance (Prati et al., 2017).
relationships to the severity of bicycle crashes: crash type (0.31), road type (0.19), and type of opponent vehicle (0.18). The BN model provides a conditional probability table for each node in which each column represents a value of the predictor and each row represents a combination of values of the target and parent predictor variables. Table 12.3 summarizes the conditional probability for each crash type by its parents (i.e., a combination of bicycle crash severity and month). The conditional probabilities of crash type suggest that in angle crashes, fatalities are less likely to occur than injuries, especially from February to December. Fatalities were more likely than injuries in rearend crashes.
III. Alternative safety analyses
TABLE 12.3 Crash type/month conditional probabilities (Prati et al., 2017).
Hit pedestrian
Hit parked or stationary vehicle
Hit stopped vehicle
Hit obstacle in carriageway
0.13
0
0
0
0.02
0.08
0.32
0
0.02
0
0
0.09
0.09
0.19
0
0.03
0.01
0
0.31
0.09
0.15
0.34
0
0
0.01
0
0
0.4
0.06
0.08
0.19
0
0.02
0
0.06
0.11
0
0.35
0.04
0.12
0.31
0
0.03
0
0.01
0.05
0.15
0
0.35
0.07
0.15
0.24
0
0
0
0
Fatality
0.03
0.05
0
0.41
0.03
0.21
0.27
0
0.01
0
0
September
Fatality
0.05
0.15
0
0.42
0.04
0.14
0.19
0.01
0
0
0
October
Fatality
0.07
0.05
0
0.31
0.06
0.2
0.2
0.01
0.05
0
0.03
November
Fatality
0.01
0.11
0.02
0.31
0.06
0.21
0.26
0
0.02
0
0
December
Fatality
0
0.07
0
0.42
0.05
0.15
0.28
0
0.03
0
0
January
Injury
0.03
0.08
0
0.53
0.03
0.18
0.07
0
0.05
0.01
0.01
February
Injury
0.03
0.07
0
0.54
0.04
0.18
0.07
0
0.05
0.01
0.01
March
Injury
0.05
0.06
0
0.51
0.03
0.19
0.08
0
0.05
0.01
0.02
April
Injury
0.04
0.06
0
0.51
0.04
0.2
0.07
0.01
0.05
0.01
0.01
Angle collision
Falling from the vehicle
Sideswipe collision
Rear-end collision
0
0.51
0.05
0.13
0.08
0
0.43
0.05
0.08
0.11
0
0.41
Fatality
0.05
0.03
0
May
Fatality
0.03
0.15
June
Fatality
0.04
July
Fatality
August
Month
Severity of bicycle crashes
Run-offthe-road
Head-on collision
Sudden braking
January
Fatality
0.06
0.09
February
Fatality
0.03
March
Fatality
April
May
Injury
0.04
0.06
0
0.5
0.04
0.21
0.07
0.01
0.05
0.01
0.02
June
Injury
0.04
0.07
0
0.49
0.04
0.2
0.08
0.01
0.05
0
0.02
July
Injury
0.04
0.07
0
0.5
0.03
0.2
0.08
0
0.06
0.01
0.02
August
Injury
0.04
0.07
0
0.5
0.04
0.2
0.08
0.01
0.05
0
0.02
September
Injury
0.04
0.06
0
0.49
0.03
0.22
0.08
0
0.05
0.01
0.01
October
Injury
0.04
0.05
0
0.53
0.04
0.19
0.07
0.01
0.06
0.01
0.01
November
Injury
0.03
0.07
0
0.52
0.02
0.19
0.08
0
0.07
0.01
0.01
December
Injury
0.03
0.05
0
0.57
0.03
0.17
0.08
0
0.06
0
0.01
416
12. Data mining and machine learning techniques
12.6 Neural network The artificial neural network (ANN) is a machine learning technique used to model the response in a large dataset as a nonlinear (activation) function of linearly combined predictors. ANN is a means of effectively discovering new patterns and correctly classifying data or making forecasts. The output signal from one neuron can be used as an input for other neurons, thus effectively modeling and solving complex problems through a network of multiple neurons and multiple layers. Several types of ANNs are described in this section.
12.6.1 Multilayer perceptron neural network The feed-forward neural network (FNN) effectively solves multivariate nonlinear regression and classification problems. The multilayer perceptron (MLP) neural network is a class of FNN in which neurons of the same layer are not connected to each other, but connected to the neurons of the preceding and subsequent layers. An output of one hidden layer serves as an input to the subsequent layer in the form of the activation function of a weighted summation of the outputs of the last hidden layer. The weights can be determined by solving the optimization problem or minimizing a given cost function. The algorithm most commonly used to determine the weights is back-propagation. In Fig. 12.3, the MLP model, Xs are the independent variables (indicators), Hs are the hidden nodes, and Y is the dependent variable. as are the estimated coefficients between hidden nodes and indicators, and bs are the estimated coefficients between hidden nodes and dependent variables. The model can be described as follows:
where Hi ¼ gH
Y ¼ gY ðb0 þ b1 H1 þ . þ bm1 Hm1 Þ þ ε (12.10) ! P ai;j Xj j ¼ 0; 1; .p 1; ε is the error term; and gY j
and gH are the activation functions. In ANN, the activation function of a node defines the output of that node given an input. Fig. 12.4 shows common activation functions such as the radial basis function (e.g., Gaussian), the sigmoidal function (e.g., logistic), the hyperbolic function (e.g., tanh), the rectified linear unit (ReLU), and the identity function. For example, if the activation function is the logistic function as gðzÞ ¼ ð1 þ ez Þ1 ; then the model can be transformed to h h Xm1 1 ii1 1 þ exp a Y ¼ 1 þ exp b0 b X Þ þε n;j j n n¼1 ¼ f ðX; a; bÞ þ ε
(12.11)
Data normalization can improve data fitting and model performance given different types and various magnitudes of input variables; III. Alternative safety analyses
417
12.6 Neural network
FIGURE 12.3 The multilayer perceptron (MLP) ANN model. Sigmoid Function: σ=1/(1+exp(-x))
Hyper Tangent Function: tanh(x)
1.2
1.5
1
1
0.8
0.5
0.6
0 -15
0.4
-10
-5
0
5
10
15
-0.5
0.2 -1 0 -15
-10
-5
0
5
10
15
-1.5
ReLU Function: max(0, x)
Identity Function:f(x)=x 15
12 10
10
8 5 6 0
4
-15
-10
-5
2 0 -15
-10
-5
-2
0 -5
0
5
10
15
-10 -15
FIGURE 12.4 Common activation functions. III. Alternative safety analyses
5
10
15
418
12. Data mining and machine learning techniques
therefore, data normalization is required for ANN. Variables are normalized using the min-max normalization formula where the minimum of 0 and maximum of 1 are used to match the lower and upper limits of the sigmoid activation function in ANN models. Categorical variables with more than two values are converted into (N-1) binary variables before the normalization. The normalization equation is as follows: xni ¼
xi minðxi Þ maxðxi Þ minðxi Þ
(12.12)
Statistical software R has several packages for dealing with ANNs, such as “neuralnet,” “nnet,” and “RSNNS.” However, not all packages include the ability to plot the function that allows analysts to visualize the neural networks. Additionally, the ability to take separate or combined x and y inputs as data frames or as a formula may also not be included. Software package development is an evolving process, so readers are encouraged to check regularly for new developments and compare the differences between these options in R. ANNs are not constrained by a predetermined functional form or specific distributional assumptions, so they are expected to produce more accurate functions that predict the number of crashes given the explanatory variables. Kononov et al. (2011) applied ANNs to explore the underlying relationships between crashes and other factors for urban freeway segments. The crash-frequency models (or SPFs in this case) developed from a sigmoid functional form through ANN with one hidden layer showed a better model fit when compared with a traditional NB regression model. Several studies have used different types of ANN models, including MLP, to predict driver injury severity or identify significant factors of injury severity in traffic crashes in hopes of better understanding the relationships between driver injury severity and factors related to the driver, vehicle, roadway, and environment (Abdelwahab and Abdel-Aty, 2002; Delen et al., 2006). The modeling results in these studies, when compared to traditional methods such as MNL and the ordered logit model, are promising. However, ANN models, like other classification models, have the multiclass classification problem, meaning they ignore less-represented categories to improve the model’s overall accuracy. This is a problem for injury severity studies as the data are highly skewed due to the presence of fewer high-severe injury crashes and more less-severe injury crashes. One solution is reducing the multiclass problem into a series of two-class (binary) classification problems. It can be argued that multiple binary models are more advantageous than a single multiclass model in that they provide better insight into the interrelationships between different levels of injury severity and the crash factors (Delen et al., 2006; Li et al., 2012). Resampling is another technique used to handle the multiclass classification problem. Resampling involves oversampling less-representative III. Alternative safety analyses
12.6 Neural network
419
classes, undersampling overly-representative classes, or using ensemble methods (e.g., Bootstrap aggregating). Jeong et al. (2018) used undersampling, oversampling, and ensemble methods (majority voting and bagging) to classify motor vehicle injury severity. Yuan et al. (2019) used the Synthetic Minority Over-sampling Technique (SMOTE) algorithm for unbalanced classification problems to predict real-time crash risk in their long short-term memory recurrent neural networks. Interested readers can refer to Chawla et al. (2002) for details on SMOTE. Long short-term memory recurrent neural networks will be described in more detail below. For specific data types, such as image or time-series, MLP may not be a good choice of method. MLP does not scale well for image data because it uses one perceptron for each input, and the number of weights can grow rapidly for large images. Additionally, MLP ignores information based on the position of a pixel and its spatial correlation with neighbors, resulting in the loss of spatial information when the image is flattened into an MLP. For time-series data, MLP ignores the time sequence between data points. Hence, for image or time-series data, other neural network algorithms such as convolutional neural networks and recurrent neural networks should be considered.
12.6.2 Convolutional neural networks Convolutional neural networks (CNNs, or ConvNet) are deep learning neural networks that are commonly applied to image analyses. The name “convolutional” indicates that the algorithm employs a mathematical operation called “convolution.” CNN typically consists of convolutional layers, pooling layers, and fully connected layers. The convolutional layer is the core building block of CNN. As shown in Fig. 12.5, a convolutional layer creates a filter that slides over the image spatially, resulting in a feature map. Various filters can produce many separate feature maps that are stacked to generate volume.
FIGURE 12.5
Convolutional neural networks (CNN) structure.
III. Alternative safety analyses
420
12. Data mining and machine learning techniques
The results are then passed to the next layer. Pooling the layers reduces the size of the convoluted feature by combining the outputs of node clusters at one layer into a single node in the next layer. Pooling is a form of nonlinear down-sampling where max pooling is the most common among several nonlinear functions. Pooling increases computational efficiency through dimensionality reduction. Finally, after several convolutional and max-pooling layers, the important features of an image can be understood. The matrix that represents the extracted features will be flattened and fed into a traditional MLP neutral network for classification purposes. Image recognition in the statistical software R 3.5.0 (R Core Team, 2018) uses deep CNN in the “MXNet” package. Ma et al. (2017) applied the CNN to analyze time series freeway speed data for incident detection. First, freeway speed time series data were converted to images by the Gramian Difference Angular Field (GDAP) method. Then, CNNs were used to identify high-level features in the image and predict the probability of a crash. The authors used the AlexNet structure (Krizhevsky et al., 2012) that contains eight layers, including five convolutional layers and three fully connected layers. In the fully connected layer, the ReLU activation function was used to convert the two-dimensional image to a one-dimensional vector and understand complex feature interactions. Finally, the sigmoid layer was used to predict the probability of a crash. The study used 5000 samples, which included 212 crash events and 4788 noncrash events. The dataset was split into training and testing sets. The model performance was evaluated by the detection rate and the false alarm rate, and the AUC of the CNN model is equal to 0.9662.
12.6.3 Long short-term memorydrecurrent neural networks Recurrent neural networks (RNN) are a very important variant of neural networks that are heavily used in natural language processing. In RNN, connections between nodes form a directed graph following a temporal sequence, allowing temporal patterns in the data to be modeled. RNNs can retain a “memory” that is captured in the time-series data. Long short-term memory (LSTM) is a deep learning RNN. As shown in Fig. 12.6, a common LSTM unit includes a cell (Ct), an input gate (It), an output gate (Ot), and a forget gate (Ft). The cell remembers values over certain time intervals, and the three gates regulate the flow of information into and out of the cell. The cell trains the model by using back-propagation. RNN that uses LSTM units partially solves the vanishing gradient problem (i.e., the gradients can tend to zero), a phenomenon that occurs when training an RNN using back-propagation.
III. Alternative safety analyses
12.6 Neural network
421
FIGURE 12.6 Long short-term memory (LSTM).
The input gate decides what values from the input should be used to modify the memory. For example, a sigmoid function decides what values to let through, and a tanh function assigns weights to the passing values for their levels of importance, ranging from 1 to 1. The forget gate determines what memories the cell can forget. For example, a sigmoid function outputs a number between 0 and 1 for each number in the cell state Ct1 based on the previous state (ht1) and the content input (Xt). The output gate yields the output based on the input and the memory of the cell. For example, functioning similar to the input gate, a sigmoid function decides what values to let through. A tanh function gives weights to the passing values for their levels of importance ranging from 1 to 1 and is then multiplied with the output of the sigmoid function. LSTM with a forget gate can be formulated as follows (Hochreiter and Schmidhuber, 1997; Gers et al., 1999): f t ¼ sðW f xt þ U f ht1 þ bf Þ
(12.13)
it ¼ sðW i xt þ U i ht1 þ bi Þ ot ¼ sðW o xt þ U o ht1 þ bo Þ ct ¼ it +tanhðW c xt þ U c ht1 þ bc Þ þ f t +ct1 ht ¼ ot +tanhðct Þ yt ¼ W y yt1 þ by where xt is the input vector to the LSTM unit and yt is the output vector; f t is the forget gate’s activation vector; it is the input/update gate’s activation vector; ot is the output gate’s activation vector; ht is the hidden state vector, also known as the output vector of the LSTM unit, and ct is the cell state vector; W, U, and b are weight matrices and bias vector
III. Alternative safety analyses
422
12. Data mining and machine learning techniques
parameters, respectively. s is the sigmoid function; and, tanh is the hyperbolic tangent function. The term “+” denotes the Hadamard product or the element-wise product of two matrices. Initial values are c0 ¼ 0, h0 ¼ 0. The R 3.5.0 (R Core Team, 2018) package “rnn” implements LSTM, gated recurrent unit (GRU), and vanilla RNN models. The “keras” R package, an open-source neural-network library written in Python, may be another option. “keras” was developed to enable fast experimentation and supports both convolution-based networks and recurrent networks. The R interface for “H2O,” is another choice. H2O is a fully open-source machine learning platform that offers parallelized implementations of many supervised and unsupervised machine learning algorithms such as Generalized Linear Models, Gradient Boosting Machines (including XGBoost), Random Forests, Deep Neural Networks (Deep Learning), Stacked Ensembles, Naive Bayes, Cox Proportional Hazards, K-Means, PCA, Word2Vec, as well as a fully automatic machine learning algorithm (AutoML) (https://cran.r-project.org/web/packages/h2o/index.html). In highway safety literature, Yuan et al. (2019) used an LSTM recurrent neural network to predict real-time crash risk. The authors attempted to predict 5-min crash risk that is updated every minute at signalized intersections. Crash data, travel speed, signal timing, loop detector, and weather data were collected from 44 signalized intersections in Oviedo, FL. The LSTM-RNN was applied, and the results were compared with a conditional logistic model based on a matched case-control design. The comparison results show that the LSTM-RNN with the synesthetic minority oversampling technique outperforms the conditional logistic model.
12.6.4 Bayesian neural networks The Bayesian neuaral network (BNN) model was initially proposed by Liang (2003, 2005) where a fully connected MLP neural network structure with one hidden layer was applied in Fig. 12.3. For the BNN model, the transfer functions used in the hidden layer and the output layer are the same as those used in the MLP model. Although the network structure of the proposed BNN model is very similar to the MLP structure, they are different in the prediction mechanism and the training process. An example is given to illustrate the differences in the prediction mechanism. Assume that there are n sets of crash data fðx1 ; y1 Þ; .; ðxi ; yi Þ; .; ðxn ; yn Þg; where the definitions of xi and yi are the same as what we used for the NB regression and MLP models. Let q denote all the network parameters or weights, bj ; ak ; and gjk
III. Alternative safety analyses
12.7 Support vector machines
423
( j ¼ 1,.,M; k ¼ 0,.,P), in Fig. 12.3. The predicted number of crashes for site i using BNNs is given by Eq. (12.14). Z b yi ¼ fB ðxi ; qÞ Pðqjðx1 ; y1 Þ; .; ðxn ; yn ÞÞdq (12.14) where fB ðxi ; qÞ is defined as 8 !9 P M < P = X X X (12.15) bj tanh gjk xik þ gj0 fB ðxi ; qÞ ¼ a0 þ ak xik þ ; : k¼1
j¼1
k¼1
Pðqjðx1 ; y1 Þ; .; ðxn ; yn ÞÞ in Eq. (12.14) is the posterior distribution of q given observed data fðx1 ; y1 Þ; .; ðxn ; yn Þg. One can see the main difference between BNNs and MLPs is that for BNNs the network parameter q follows a certain probability distribution, and the prediction process for BNNs is to evaluate the integral of fB ðxi ; qÞ Pðqjðx1 ; y1 Þ; .; ðxn ; yn ÞÞ over all possible values of q, while for MLPs the network parameter is fixed. The actual BNN model is more complicated than the example given earlier. Readers are referred to Liang (2003, 2005) for a more detailed description of the BNN model and its Evolutionary Monte Carlo (EMC) training algorithm, as well as an application of using BNN to predict motor vehicle crashes (Xie et al., 2007).
12.7 Support vector machines A support vector machine (SVM) is a machine learning approach originally developed by Cortes and Vapnik (1995), Vapnik (1998). SVM includes a set of supervised learning methods that can be used for classification and regression analysis. A simple two-class classification problem is illustrated in Fig. 12.7. First, the input data points are mapped from data space to a highdimensional feature space using a nonlinear kernel function such as a Gaussian kernel. The SVM model then constructs two separating hyperplanes (see dashed line in Fig. 12.7A) in the high dimensional space to separate the outcome into two classes so that the distance between them is as large as possible. The region bounded by the two separating hyperplanes is called the “margin,” and the optimal separating hyperplane is in the middle (see the solid line in Fig. 12.7A). The idea is to search for the optimal separating hyperplane by maximizing the margin between the classes’ closest points. The points lying on the boundaries are called support vectors. Fig. 12.7B represents a typical neural network with one input layer, one hidden layer, and one output layer. Assume the training input is defined as vectors xi ˛ RIn for i ¼ 1, 2,., N, which are the set of explanatory variables, and the training output is
III. Alternative safety analyses
424
12. Data mining and machine learning techniques
FIGURE 12.7 Classification of SVM models (Li et al., 2012).
defined as yi ˛ R1, which is the crash injury severity level. The SVM maps xi into a feature space Rh(h > In) with higher dimension using a function Fðxi Þ to linearize the nonlinear relationship between xi and yi. The estimation function of yi is b y ¼ wT 4ðxÞ þ b
(12.16)
where w is a normal vector that is perpendicular to the hyperplane. Both w and b are coefficients that are derived by solving the following optimization problem. For the two-class classification problem, given a training set of instance-label pairs ðxi ; xj Þ, the SVM model aims to solve the optimization problem in Eq. (12.17) (Cortes and Vapnik, 1995). N X 1 minw; b; x wT w þ C xi 2 i¼1
Subject to
yi wT 4ðxi Þ þ b 1 xi ; xi 0
III. Alternative safety analyses
(12.17)
12.7 Support vector machines
425
where xi is a slack variable that measures misclassification error; C is a regularization parameter that is the penalty factor to errors (e.g., a large parameter C value indicates a small margin and vice versa). The coefficient C is still undetermined, but this optimization problem can be solved using the Lagrange multiplier: ( ) N N N X X T X 1 T maxmin w w þ C xi ai yi w 4ðxi Þ þ b 1 þ xi bi xi 2 i¼1 i¼1 i¼1 (12.18) where ai , bi > 0 are Lagrange multipliers. The max sign means that among all hyperplanes separating the data, one is unique in yielding the maximum margin of separation between the classes. 4 xTi 4ðxj Þ is the kernel function where a radial basis function (RBF) is often used and is defined as follows: n o 4 xTi 4ðxj Þ ¼ Kðxi ; xj Þ ¼ exp gkxi xj k2 (12.19) where g is a parameter. Sometimes g is parameterized as 2s1 2 . kxi xj k2 and may be recognized as the squared Euclidean distance between the two feature vectors. This model can be easily extended for handling multiclass classification tasks such as crash injury severity with the KABCO scale. An alternative is called n-SVC, where the parameter C is replaced by a parameter n ˛ [0, 1]. n is used to control the number of support vectors. An additional variable r needs to be optimized to remove the user-chosen error penalty factor C. Introducing a new variable r adds another degree of freedom to the margin (Scholkopf et al., 2000). The optimization problem is formulated as follows: N 1 1 X minw;x;r;b wT w nr þ x 2 N i¼1 i
(12.20)
Subject to yi wT 4ðxi Þ þ b r xi ; ci ¼ 1; /; N xi 0
ci ¼ 1; /; N r0
The R interface to “libsvm” is in package “e1071” where svm($) includes C- classification, n-classification, one-class-classification (novelty detection), ε-regression and n-regression, svm() also includes linear, polynomial, radial basis function, and sigmoidal kernels, formula interface, and k-fold cross-validation. For further implementation details on libsvm, see Chang and Lin (2001).
III. Alternative safety analyses
426
12. Data mining and machine learning techniques
SVM’s superior performance has led to its popularity in highway safety analysis, for injury severity classification and for crash count prediction. Li et al. (2012) applied the C-SVM models to predict injury severity of crashes at freeway diverge areas. The SVM model was better at predicting crash injury severity when compared with the ordered probit (OP) model. However, the performance of the SVM model depends highly on the learning procedure, which contains functional mapping and parameter selection. The authors suggested using kernel functions other than the basic RBF kernel to improve the model performance. Li et al. (2008) applied v-SVM to predict motor vehicle crashes on frontage roads in Texas. The SVM model results were compared with traditional NB regression models, and several sample sizes were evaluated for the examination of data fitting and model prediction capabilities. The SVM models were consistently lower regarding the values of mean absolute deviation and mean absolute percentage error, for all sample sizes than those for NB regression models.
12.8 Sensitivity analysis Despite their excellent performance in prediction, machine learning techniques such as ANN and SVM have been long criticized for performing as a black-box solution that cannot be directly used to identify the relationship between outcomes and input variables. To safety professionals and researchers, identifying crash contributing factors and quantifying their effects in terms of direction and magnitude is critical to informed decision-making, targeted investment, and improved safety. One of the methods safety practitioners have used to address this concern is the sensitivity analysis (Delen et al., 2006; Li et al., 2008, 2012). The sensitivity analysis studies how the uncertainty in the output of a mathematical model can be attributed to different sources of uncertainty in its input. The sensitivity analysis can be used to measure the relationship between the input variables and the output of a trained neural network model (Principe et al., 2000). In the process of performing a sensitivity analysis, the neural network learning ability is disabled so that the network weights are not affected. Each input variable of a black-box model (e.g., ANN, SVM) is perturbed by a user-defined amount, with the other variables being fixed at their respective means or medians. The results before and after the perturbation of each input variable are recorded, and the impacts of each input variable on the output are calculated. For example, in the crash injury severity level prediction, the percent change of each severity level by one unit change of an input variable can be estimated.
III. Alternative safety analyses
References
427
References Abdelwahab, H.T., Abdel-Aty, M.A., 2002. Artificial neural networks and logit models for traffic safety analysis of toll plazas. Transport. Res. Rec. J. Transport. Res. Board 1784, 115e125. Breiman, L., 2001. Random forests. Mach. Learn. 45 (No. 1), 5e32. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1998. Classification and Regression Trees. Chapman & Hall/CRC, Boca Raton, FL. Chang, C.-C., Lin, C.-J., 2001. Training v-support vector classifiers: theory and algorithms. Neural Comput. 13 (9), 2119e2147. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321e357. Chung, Y.S., 2013. Factor complexity of crash occurrence: An empirical demonstration using boosted regression trees. Accident Anal. Prev. 61, 107e118. Cortes, C., Vapnik, V., 1995. Support-vector network. Mach. Learn. 20, 273e297. Delen, D., Sharda, R., Bessonov, M., 2006. Identifying significant predictors of injury severity in traffic accidents using a series of artificial neural networks. Accid. Anal. Prev. 38 (3), 434e444. Depaire, B., Wets, G., Vanhoof, K., 2008. Traffic accident segmentation by means of latent class clustering. Accid. Anal. Prev. 40 (4), 1257e1266. Duncan, C.S., Khattak, A.J., 1998. Applying the ordered probit model to injury severity in truckepassenger car rear-end collisions. In: Transportation Research Record 1635. TRB, National Research Council, Washington, D.C, pp. 63e71. Elith, J., Leathwick, J., 2017. Boosted Regression Trees for Ecological Modeling. https://cran. r-project.org/web/packages/dismo/vignettes/brt.pdf. Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Ann. Stat. 1189e1232. Fraley, C., Raftery, A.E., 2002. Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97 (458), 611e631. Gers, F.A., Schmidhuber, J., Cummins, F., 1999. Learning to Forget: Continual Prediction with LSTM. Hagenaars, J.A., McCutcheon, A.L. (Eds.), 2002. Applied Latent Class Analysis. Cambridge University Press. Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9 (8), 1735e1780. Jeong, H., Jang, Y., Bowman, P.J., Masoud, N., 2018. Classification of motor vehicle crash injury severity: a hybrid approach for imbalanced data. Accid. Anal. Prev. 120, 250e261. Kononov, J., Lyon, C., Allery, B., 2011. Relation of flow, speed, and density of urban freeways to functional form of a safety performance function. Transport. Res. Rec. J. Transport. Res. Board 2236, 11e19. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 1097e1105. Lee, C., Li, X., 2015. Predicting driver injury severity in single-vehicle and two-vehicle crashes with boosted regression trees. Transport. Res. Rec. 2514 (1), 138e148. Li, X., Lord, D., Zhang, Y., Xie, Y., 2008. Predicting motor vehicle crashes using support vector machine models. Accid. Anal. Prev. 40 (4), 1611e1618. Li, Z., Liu, P., Wang, W., Xu, C., 2012. Using support vector machine models for crash injury severity analysis. Accid. Anal. Prev. 45, 478e486. Liang, F., 2003. An effective Bayesian neural network classifier with a comparison study to support vector machine. Neural Comput. 15 (8), 1959e1989. Liang, F., 2005. Bayesian neural networks for nonlinear time series forecasting. Stat. Comput. 15 (1), 13e29.
III. Alternative safety analyses
428
12. Data mining and machine learning techniques
Ma, X., Dai, Z., He, Z., Ma, J., Wang, Y., Wang, Y., 2017. Learning traffic as images: a deep convolutional neural network for large-scale transportation network speed prediction. Sensors 17 (4), 818. McLachlan, G.J., Peel, D., 2000. Finite Mixture Models. Wiley, New York. Prati, G., Pietrantoni, L., Fraboni, F., 2017. Using data mining techniques to predict the severity of bicycle crashes. Accid. Anal. Prev. 101, 44e54. Principe, J.C., Euliano, N.R., Lefebre, W.C., 2000. Neural and Adaptive Systems: Fundamentals through Simulations. John Wiley and Sons, New York. R Core Team, 2018. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. Qin, X., Han, J., 2008. Variable selection issues in tree-based regression models. Transport. Res. Rec. 2061 (1), 30e38. Sch€olkopf, B., Smola, A.J., Williamson, R.C., Bartlett, P.L., 2000. New support vector algorithms. Neural Comput. 12, 1207e1245. Vapnik, V., 1998. Statistical Learning Theory. Wiley, New York. Wang, K., Qin, X., 2015. Exploring driver error at intersections: key contributors and solutions. Transport. Res. Rec. 2514 (1), 1e9. Xie, Y., Lord, D., Zhang, Y., 2007. Predicting motor vehicle collisions using Bayesian neural networks: an empirical analysis. Accid. Anal. Prev. 39 (5), 922e933. Yuan, J., Abdel-Aty, M., Gong, Y., Cai, Q., 2019. Real-time crash risk prediction using long short-term memory recurrent neural network. Trans. Res. Rec. 2673 (4), 314e326.
III. Alternative safety analyses
Appendix A Negative binomial regression models and estimation methods This appendix presents the characteristics of negative binomial regression models and discusses their estimating methods. The material described below was originally written in Appendix C of the CrimeStat IV (version 4.02) program documentation (Lord and Park, 2013; Levine, 2015) and has been adapted for this textbook with permission.
Probability density and likelihood functions The properties of the negative binomial models with and without spatial interaction are described in the next two sections.
Poisson-gamma model The Poisson-gamma model has properties that are very similar to the Poisson model in which the dependent variable yi is modeled as a Poisson variable with a mean li where the model error is assumed to follow a gamma distribution. As its name implies, the Poisson-gamma is a mixture of two distributions and was first derived by Greenwood and Yule (1920). This mixture distribution was developed to account for over-dispersion that is commonly observed in discrete or count data (Lord et al., 2005). It became very popular because the conjugate distribution (same family of functions) has a closed form and leads to the negative binomial distribution. As discussed by Cook (2009), “the name of this distribution comes from applying the binomial theorem with a negative exponent.” Two major parameterizations have been proposed and they are known as the NB-1 and NB-2, the latter one being the most commonly known and utilized. NB-2 is therefore described first. Other parameterizations exist but are not discussed here (see Maher and Summersgill, 1996; Hilbe, 2012).
431
432
Appendix A. Negative binomial regression models and estimation methods
NB-2 model Suppose that we have a series of random counts that follows the Poisson distribution: Pðyi jli Þ ¼
eli li yi !
(A.1)
where Pðyi jli Þ is the probability of roadway entity (or observation) i having yi crashes per unit of time yi ; and, li is the mean of the Poisson distribution. If the Poisson mean is assumed to have a random intercept term and this term enters the conditional mean function in a multiplicative manner, we get the following relationship (Cameron and Trivedi, 2013): XK 0 li ¼ exp b0 þ x b þ ε i j¼1 ij j PK 0 x b li ¼ e j¼1 ij j eðb0 þεi Þ P li ¼ e
b0 þ
K
x0 b j¼1 ij j
(A.2)
eεi
li ¼ mi ni is defined as a random intercept; expðb0 þεi Þ PK 0 mi ¼ exp b0 þ j¼1 xij b is the log-link between the Poisson mean and where
j
the covariates or independent variables xs; and, bs are the parameters or regression coefficients. The relationship can also be formulated using vectors, such that mi ¼ exp x0i b . The marginal distribution of yi can be obtained by integrating the error term ni , Z N gðyi ; mi ; ni Þhðni Þdni f ðyi jmi Þ ¼ (A.3) o f ðyi jmi Þ ¼ En ½gðyi ; mi ; ni Þ where hðni Þ is a mixing distribution. In the case of the Poisson-gamma mixture, gðyi jmi ; ni Þ is the Poisson distribution, and hðni Þ is the gamma distribution. This distribution has a closed form and leads to the NB distribution. Let us assume the variable ni follows a two-parameter gamma distribution: kðni jj; dÞ ¼
dj j1 ni d n e ; GðjÞ i
j > 0; d > 0; ni > 0
(A.4)
Appendix A. Negative binomial regression models and estimation methods
433
where E½ni ¼ j=d and VAR½ni ¼ j d2 . Setting j ¼ d gives us the oneparameter gamma where E½ni ¼ 1 and VAR½ni ¼ 1=j. (Note: in the
main text j ¼ 1=a ¼ f. The notation is changed in this appendix since more Greek letters are needed to describe spatial modeling.) We can transform the gamma distribution as a function of the Poisson mean, which gives the following probability mass function (PMF) (Cameron and Trivedi, 2013): kðli jj; mi Þ ¼
ðj=mi Þj j1 mli d l e i GðjÞ i
(A.5)
Combining Eqs. (A.1) and (A.5) into Eq. (A.3) yields the marginal distribution of yi : Z N y expðli Þli i ðj=mÞj j1 mli d f ðyi jmi ; jÞ ¼ l e i dli (A.6) yi ! GðjÞ i o Using the properties of the gamma function, it can be shown that Eq. (A.6) can be defined as follows: Z N ðj=mi Þj j yþj1 f ðyi jmi ; jÞ ¼ exp li 1 þ dli li mi GðjÞGðyi þ 1Þ o j ðjþyi Þ ðj=mi Þj 1 þ Gðj þ yi Þ mi f ðyi jmi ; jÞ ¼ GðjÞGðyi þ 1Þ
(A.7)
j yi Gðyi þ jÞ j mi f ðyi jmi ; jÞ ¼ Gðyi þ 1ÞGðjÞ mi þ j mi þ j The PMF of the NB-2 model is therefore (the last part of Eq. A.7): j yi Gðyi þ jÞ j mi f ðyi jmi ; jÞ ¼ (A.8) Gðyi þ 1ÞGðjÞ mi þ j mi þ j Note that the PMF has also been defined in the literature as follows: j yi yi þ j 1 j mi f ðyi jj; mi Þ ¼ (A.9) mi þ j mi þ j j1 The first two moments of the NB-2 are the following: E½yi jmi ; j ¼ mi VAR½yi jmi ; j ¼ mi þ
(A.10) m2i j
(A.11)
434
Appendix A. Negative binomial regression models and estimation methods
The next steps consist of defining the log-likelihood (LL) function of the NB-2. It can be shown that ln
X y1 Gðyi þ jÞ lnðj þ jÞ ¼ GðjÞ j¼0
(A.12)
By substituting Eq. (A.12) into (A.8), the log-likelihood can be computed using the following equation: 80 1 y1 n < X X @ ln Lðj; bÞ ¼ lnðj þ jÞA ln yi ! ðyi þ jÞln 1 þ j1 mi : j¼0 i¼1 9 (A.13) = 1 þ yi ln j þ yi ln mi ; Note also that the log-likelihood has also been expressed as follows: n X jmi ln Lðj; bÞ ¼ yi ln j1 lnð1 þ jmi Þ þ ln G yi þ j1 1 þ jmi i¼1 (A.14)
1 ln Gðyi þ 1Þ ln G j Recall that mi ¼ exp x0i b . In the statistical literature, the Poisson-gamma model has also been defined as follows: Pðyi jli Þ ¼ Poissonðli Þ i ¼ 1; 2; .; I
(A.15)
where the mean of the Poisson is structured as follows: li ¼ f ðxi jbÞexpðεi Þ ¼ mi expðεi Þ
(A.16)
and where f ðÞ is a function of the covariates, x is a vector of explanatory variables; as before, b is a vector of estimable parameters; and, εi is the model error independent of all the covariates with mean equal to 1 and a variance equal to 1=j.
NB-1 model The NB-1 is very similar to the NB-2, but the parameterization of the variance (the second moment) is slightly different than in Eq. A.11. E½yi jmi ; j ¼ mi
(A.17)
Appendix A. Negative binomial regression models and estimation methods
VAR½yi jmi ; j ¼ mi þ
mi j
435 (A.18)
The log-likelihood of the NB-1 is given by 80 1 y1 n < X X @ lnðj þ jmi ÞA ln yi ! ðyi þ jmi Þln 1 þ j1 ln Lðj; bÞ ¼ : j¼0 i¼1 9 (A.19) = þ yi ln j1 ; The NB-1 is usually less flexible in capturing the variance and is not used very often by analysts and statisticians. Interested readers are referred to Cameron and Trivedi (2013) for additional information about this parameterization.
Poisson-gamma model with spatial interaction The Poisson-gamma (or NB model) can also incorporate data that are collected spatially. To capture this kind of data, a spatial autocorrelation term needs to be added to the model. Using the notation described in Eq. A.15, the NB-2 model with spatial interaction can be defined as follows: Pðyi jli Þ ¼ Poissonðli Þ
(A.20)
with the mean of Poisson-gamma organized as li ¼ exp x0i b þ εi þ fi
(A.21)
The assumption on the uncorrelated error term εi is the same as in the Poisson-gamma model described earlier; same as before, mi ¼ exp x0i b . The third term in the expression, 4i , is a spatial random effect, one for each observation. Together, the spatial effects are distributed as a complex multivariate normal (or Gaussian) density function. In other words, the second model is a spatial regression model within a negative binomial model. There are two common ways to express the spatial component, either as a Conditional Autoregressive (CAR) or as a Simultaneous Autoregressive (SAR) function (De Smith et al., 2007). The CAR model is expressed as X E yi allyjsi ¼ mi þ r wij yi mj (A.22) isj
ij
436
Appendix A. Negative binomial regression models and estimation methods
where mi is the expected value for observation i; wij is a spatial weight1 between observation i and all other observations j (and for which all weights sum to 1.0); and, r is a spatial autocorrelation parameter that determines the size and nature of the spatial neighborhood effect. The summation of the spatial weights times the difference between the observed and predicted values is over all other observations (i s j). The reader is referred to Haining (1990) and LeSage (2001) for further details. The SAR model has a simpler form and can be expressed as X
E yi allyjsi ¼ mi þ r wij yj (A.23) isj
ij
where the terms are as defined earlier. Note that in the CAR model the spatial weights are applied to the difference between the observed and expected values at all other locations whereas in the SAR model, the weights are applied directly to the observed value. In practice, the CAR and SAR models produce very similar results. Additional information about the Poisson-gamma-CAR is described in the following.
Estimation methods This section describes two methods that can be used for estimating the coefficients of the regression NB models. The two methods are the maximum likelihood estimates (MLE) and the Bayesian method based on the Monte Carlo Markov Chain (MCMC).
Maximum likelihood estimation The coefficients or parameters of the NB regression model are estimated by taking the first-order conditions and making them equal to zero. There are two first-order equations, one for the model’s parameters and one for the dispersion parameter (Lawless, 1987). The two for the NB-2 are as follows: n X i¼1
1
yi mi xi ¼ 0 1 þ j1 mi
(A.24a)
Note: there are different weight factors that have been proposed, such as the inverse distance weight function, exponential distance decay weight function and the Gaussian weighting function among others.
Appendix A. Negative binomial regression models and estimation methods
8 n < X i¼1
0
1 @ ln 1 þ j1 mi :j1 2
yX i 1 j¼0
437
9 = 1 A y mi ¼0 þ 1 i 1 ðj þ jÞ j 1 þ j mi ; 1
(A.24b) where xi is a vector of covariates. The series of equations can be solved using the Newton-Raphson procedure or the scoring algorithm. The confidence intervals on the bs and j1 can be calculated using the covariance matrix that is assumed to be normally distributed: " # VAR½b 0 b b b
wN ; (A.25) 0 VAR j1 j1 j1 where VAR½b ¼
i¼1
0
B VAR j1 ¼@
n X
n X i¼1
0
@ln 1þj1 mi 4 1 i
j
mi xi x0 1 þ j1 mi i yj1 X j¼0
!1 (A.26a) 11
12
1 A mi C þ A 2 ðj þ jÞ j1 1 þ j1 mi (A.26b)
It should be pointed out that the NB-2 with spatial interaction model (Poisson-gamma-CAR) cannot be estimated using the MLE method. It needs to be estimated using the MCMC technique, which is described next.
Monte Carlo Markov Chain estimation This section presents how to draw samples from the posterior distribution of the Poisson-gamma model and Poisson-gamma-conditional autoregressive (CAR) model using the MCMC technique.
MCMC Poisson-gamma model The Poisson-gamma model can be formulated from a two-stage hierarchical Poisson model: ðLikelihoodÞ yi jli wPoisson ðli Þ
(A.27a)
ðFirst stageÞ li jjwpl ðjÞ
(A.27b)
ðSecond stageÞ jwpj ð , Þ
(A.27c)
438
Appendix A. Negative binomial regression models and estimation methods
where pl ðjÞ is the prior distribution imposed on the Poisson mean, li , with a prior parameter j, and pj ð ,Þ is the hyper-prior on j with known hyperparameters (a, b, for example). In Eqs. (A.27b) and (A.27c), if we specify li ¼ ni mi (where ni ð ¼ eεi ÞwGamma ðj; jÞ in the first stage and jwGamma ða; bÞ in the second stage), these result in exactly the NB-2 regression model described in the previous section. With this specification, it is also easy to show that li in the first stage Gamma ðj; j =mi Þ as shown in follows Eq. (A.5). Note that mi ¼ exp x0i b as described earlier. For simplicity, if a flat uniform prior is assumed for each bj (j ¼ 0; 1; .; J) and the parameters bs and j are mutually independent, the joint posterior distribution for the Poisson-gamma model is defined as
¼
pðl; b; jjyÞ f f ðyjlÞ$pðljb; jÞ $ pðb0 Þ.pðbJ Þ $ pðjja; bÞ (A.28a) 1 0 j ! x’i b x’ b n n je Y Y i je li C eli ðli Þyi B ða1Þ bj lj1 j e e @ A i yi ! GðjÞ i¼1 i¼1 (A.28b)
The parameters of interest are l ¼ ðl1 ; .ln Þ; b ¼ ðb0 ; b1 ; .bJ Þ; and the inverse dispersion parameter j (or the dispersion parameter g ¼ 1/j). Ideally, samples need to be drawn of each parameter from the joint posterior distribution. However, the form in Eq. (A.28b) is very complex and it is difficult to draw samples from such a distribution. Consequently, samples are drawn from the full conditional distribution sequentially (that is, one at a time). This iterative process is called the Gibbs sampling method. Therefore, once the full conditionals are known for each parameter, Gibbs sampling can be implemented by drawing samples of each parameter sequentially. The full conditional distributions for each parameter for the Poisson-Gamma model can be easily derived from Eq. (A.28b) and are given as (Park, 2010) pðli jb; j; yi Þff ðyi jli Þ$pðli jb; jÞ ’ ¼ Gamma yi þ j; 1 þ jexi b ; for i ¼ 1; 2; .; n p bj l; bj ; j fpðljbj ; j $p bj ( " ! #) n n X X x’i b ¼ exp j xij bj þ li e ; for j ¼ 0; 1; .J i¼1
i¼1
(A.29a)
(A.29b)
Appendix A. Negative binomial regression models and estimation methods
439
pðjjl; b; a; bÞfpðljb; jÞ$pðjja; bÞ (
! n X ’ x’i b ¼ exp n lnðGðjÞÞ þ j n lnðjÞ xi b þ lnðli Þ þ li e i¼1 ) þ ða 1ÞlnðjÞ bj
(A.29c)
However, unlike Eq. (A.29a), the full conditional distributions for the bs and j (Eqs. (A.29b) and (A.29c)) do not belong to any standard distribution family so it is not easy to draw samples directly from their full conditional distributions. Although there are several approaches to sampling from such a complex distribution, the popular sampling algorithm used in practice is the Metropolis-Hastings (or MH) algorithm with slice sampling of individual parameters. The MCMC sampling procedure using the slice sampling algorithm within Gibbs sampling, therefore, can be summarized as follows: 1. Start with initial values lð0Þ , bð0Þ and jð0Þ . Repeat the following steps for t ¼ 1; .; T0 ; .; T0 þ T. 2. Step 1: Conditional on knowing bðt1Þ and jðt1Þ , draw lðtÞ from Eq. A.29a independently for i ¼ 1; 2; .; n. 3. Step 2: Conditional on knowing lðtÞ and jðt1Þ , draw bðtÞ from Eq. A.29b independently for j ¼ 0; 1; .; J using the slice sampling method. 4. Step 3: Conditional on knowing lðtÞ and bðtÞ , draw jðtÞ from Eq. A.29c using the slice sampling method. 5. Step 4: Store the values of all parameters (i.e., lðtÞ , bðtÞ and jðtÞ ). Increase t by one and return to Step 1. 6. Step 5: Discard the first T0 draws as a burn-in period. After equilibrium is reached at the T0 iteration, sampled values are averaged to provide the consistent estimates of the parameters: T P
hðqÞðtÞ
t¼T þ1 b E½hðqÞ ¼ 0 T
where q denotes any interest parameter in the model.
(A.30)
440
Appendix A. Negative binomial regression models and estimation methods
MCMC Poisson-gamma-CAR model For the Poisson-gamma-CAR model, the only difference from the Poisson-gamma model is the way li is structured. The mean of Poissongamma-CAR is organized as (A.31) li ¼ exp x0i b þ εi þ 4i where 4i is a spatial random effect, one for each observation. As in the Poisson-gamma model, we specify eεi wGammaðj; jÞ to model the independent error term. To model the spatial effect, 4i , we assume the following: 2 32 1 0 X wij w iþ pð4i jFi Þfexp@ 2 44i r (A.32) 45 A w j 2s4 jsi iþ where pð4i jFi PÞ is the probability of a spatial effect given a lagged spatial effect, wiþ ¼ wij which sums all over all records, and j (i.e., all other isj
zones) except for the record of interest, i. This formulation gives a conPw s2 ditional normal density with mean ¼r wiþij 4j and variance ¼wiþ4 . The jsi
parameter r determines the direction and overall magnitude of the spatial effects. The term wij is a spatial weight function between zones i and j. In the algorithm, the term for the variance is s24 ¼ 1=s4 and the same variance is used for all observations. We define the spatial weight matrix W with the entries wij and the diagonal entries wii ¼ 0. The matrix D is defined as a diagonal matrix with the diagonal entries, wiþ . Sun et al. (1999) show that if k1 min < r < k1 where k and k are the smallest and largest eigenvalues of max min max WD1 , respectively, then, F has a multivariate normal distribution with mean 0 and nonsingular covariance matrix s24 ðD rWÞ1 . ! 1 0 jAj1=2 0 2 1 F ¼ ð41 ; .4n Þ ¼ MNVn 0; s4 A ¼ n=2 exp 2 F AF 2s4 2ps24 (A.33) where A ¼ ðD rWÞ and
k1 min