Smart Urban Mobility: Transport Planning in the Age of Big Data and Digital Twins 0128207175, 9780128207178

Smart Urban Mobility: Transport Planning in the Age of Big Data and Digital Twins explores the data-driven paradigm shif

489 92 8MB

English Pages 265 [268] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Front Cover
SMART URBAN MOBILITY 
SMART URBAN MOBILITY: TRANSPORT PLANNING IN THE AGE OF BIG DATA AND DIGITAL TWINS
Copyright
Contents
Preface
1 - Introduction
1.1 Objectives of the chapter
1.2 Word cloud
1.3 Introduction
1.4 Background
1.5 Why smart mobility and why now?
1.6 Audiences
1.6.1 Transport planners and practitioners
1.6.2 City officials and policy makers
1.6.3 University professors and students
1.6.4 Business analysts, data scientist, data engineers, and developers
1.6.5 Multidisciplinary urban planning and mobility projects managers
1.6.6 Citizen scientists and members of citizens' participation initiatives
1.6.7 Smart city and smart mobility advocates, consultants, and implementers
1.7 Chapter structure
1.7.1 Topics/chapters
1.7.1.1 Chapter 2: Introduction to smart mobility
1.7.1.2 Chapter 3: The new challenge of smart urban mobility
1.7.1.3 Chapter 4: Small and big data for mobility studies
1.7.1.4 Chapter 5: Data analytics
1.7.1.5 Chapter 6: Four step transport planning model and big data
1.7.1.6 Chapter 7: Data driven mobility management
1.7.1.7 Chapter 8: Digital twin
1.7.1.8 Chapter 9: Summary
References
2 - Introduction to smart mobility
2.1 Objectives of the chapter
2.2 Word cloud
2.3 Mobility
2.3.1 Terminology/definitions
2.3.2 Urban mobility
2.4 Smart city
2.4.1 Sustainable city
2.4.2 Quality of life
2.4.3 Role of the new technologies in smart city
2.4.4 Responsive city
2.4.5 Smart city domains
2.5 Smart mobility
References
3 - The new challenge of smart urban mobility
3.1 Objectives of the chapter
3.2 Word cloud
3.3 Urban population trends
3.3.1 Key urban population-related challenges
3.4 Multimodality
3.4.1 What transport modes exist in the city?
3.4.1.1 What is the difference between multimodal and intermodal transport?
3.4.1.2 What are sustainable transport modes?
3.4.2 Key multimodal mobility-related challenges
3.4.3 Example: transport mode competitiveness in an urban area
3.5 Connected mobility
3.5.1 Key connected mobility-related challenges
3.5.1.1 Data versus information
3.5.1.2 Some of the key mobility data-related challenges
3.5.1.2.1 Data standardization
3.5.1.2.2 Data availability
3.5.1.2.3 Data privacy
3.5.1.2.4 Measurability and quantification
3.5.1.2.5 Data openness
3.6 ConnectedX
3.6.1 Connected vehicles
3.6.2 Connected infrastructure
3.6.3 Connected traveler
3.6.4 Connected freight
3.6.5 Service-oriented perspective of ConnectedX
3.6.6 Autonomous vehicles
3.6.6.1 Example: autonomous vehicles (I)
3.6.6.2 Example: autonomous vehicles (II)
3.6.7 ConnectedX-related challenges
3.7 Electric vehicles
3.7.1 Electric vehicles related challenges
3.8 Shared mobility
3.8.1 Shared mobility-related challenges
3.8.2 Example: impact of shared mobility practices on electric vehicles
3.9 Mobility as a service
3.9.1 MaaS-related challenges
3.10 Governance
3.10.1 Governance-related challenges
3.11 Smart mobility innovations
3.11.1 Smart mobility innovation-related challenges
3.12 Change management
3.12.1 Change management-related challenges
3.13 State of the affairs
References
4 - Small and big data for mobility studies
4.1 Objectives of the chapter
4.2 Word cloud
4.3 Introduction
4.4 Traditional data collection approaches
4.5 Big data for mobility studies
4.5.1 Global navigation satellite systems data
4.5.1.1 Example: GNSS data (I)
4.5.1.2 Example: GNSS data (II)
4.5.2 Mobile network data
4.5.2.1 Example: mobile network data (I)
4.5.3 Mobile sensed data
4.5.3.1 Example: mobile sensed data (i)
4.5.3.2 Example mobile sensed data (ii)
4.5.4 Comparison of the three main big data sources for mobility studies
4.5.5 Other big data sources for mobility studies
4.5.5.1 Location-oriented sensing
4.5.5.1.1 Computer vision techniques
4.5.5.1.1.1 Example: computer vision
4.5.5.1.2 Bluetooth data
4.5.5.1.2.1 Example: bluetooth data
4.5.5.1.3 Ticketing data
4.5.5.1.3.1 Example ticketing data
References
5 - Data analytics
5.1 Objectives of the chapter
5.2 Word cloud
5.3 Data analytics introduction
5.4 Data analytics workflow
5.4.1 Descriptive analytics
5.4.1.1 Descriptive statistics
5.4.1.1.1 Measures of dispersion and central tendencies
5.4.1.1.1.1 Arithmetic mean
5.4.1.1.1.2 Median and mode
5.4.1.1.1.3 Minimum and maximum
5.4.1.1.1.4 Range
5.4.1.1.1.5 Quartile
5.4.1.1.1.6 Variance
5.4.1.1.1.7 Standard deviation
5.4.1.1.1.8 Skewness and kurtosis
5.4.1.2 Exploratory data analysis
5.4.2 Diagnostic analytics
5.4.2.1 Example: diagnostic analytics
5.4.3 Predictive analytics
5.4.4 Prescriptive analytics
5.4.4.1 Example: predictive analytics
5.5 Machine learning
5.5.1 Supervised learning
5.5.2 Unsupervised learning
5.5.3 Reinforcement learning
5.5.4 Building and evaluating a machine learning algorithm
5.5.5 Common machine learning methods used for mobility analytics
5.5.5.1 Regression methods
5.5.5.2 Support vector machines
5.5.5.3 Decision tree
5.5.5.4 Artificial neural networks
5.6.5.5 kNN
5.5.5.6 Clustering
5.5.5.7 K-mean clustering
5.5.5.8 Cross-validation
5.5.6 Example classification: transport mode recognition
5.5.7 Example regression: travel time estimation
5.6 Data anonymization
5.6.1 Randomization
5.6.2 Generalization
5.6.3 Pseudonymization
References
6 - Transport planning and big data
6.1 Objectives of the chapter
6.2 Word cloud
6.3 Four-step transportation planning model
6.3.1 Trip generation step
6.3.2 Trip distribution step
6.3.3 Mode choice step
6.3.4 Trip assignment step
6.4 Literature review of big data advances for four-step transport planning model
6.4.1 Literature review of big data advances for trip generation step
6.4.2 Example: detection of trip generation zones for tourism population
6.4.3 Literature review of big data advances for trip distribution step
6.4.4 Example: construction of OD matrix from big data
6.4.5 Literature review of big data advances for mode choice step
6.4.6 Example: rule-based transport mode detection from GNSS and GIS data
6.4.7 Literature review of big data advances route assignment step
6.4.8 Example: map matching
References
7 - Data-driven mobility management
7.1 Objectives of the chapter
7.2 Word cloud
7.3 Introduction
7.4 Big data-driven mobility system monitoring
7.5 Analytics-based mobility management decision making support
7.6 Example: incentivization of mobility behavior
7.6.1 Theory of planned behavior as a conceptual framework
7.6.2 Applied market segmentation
7.6.3 Obtained insights
7.7 Example: mobility management as a service
7.7.1 The MMaaS architecture
References
8 - Digital twin
8.1 Objectives of the chapter
8.2 Word cloud
8.3 Digital twin
8.3.1 Digital twin applications and complexities
8.3.2 Digital twin architecture
8.3.3 Digital twins' due time
8.3.4 Digital twin-related initiatives
8.4 Example: electric vehicle's digital shadow
8.5 Example: urban air mobility
References
9 - Summary
9.1 Objectives of the chapter
9.2 Word cloud
9.3 About the book
9.4 Features
9.5 Summary of chapters
9.5.1 Chapter 1: introduction
9.5.2 Chapter 2: introduction to smart mobility
9.5.3 Chapter 3: the new challenge of smart urban mobility
9.5.3.1 Examples in Chapter 3
9.5.4 Chapter 4: small and big data for mobility studies
9.5.4.1 Examples in Chapter 4
9.5.5 Chapter 5: data analytics
9.5.5.1 Examples in Chapter 5
9.5.6 Chapter 6: four step transport planning model and big data
9.5.6.1 Examples in Chapter 6
9.5.7 Chapter 7: data-driven mobility management
9.5.7.1 Examples in Chapter 7
9.5.8 Chapter 8: digital twin
9.5.8.1 Examples in Chapter 8
9.5.9 Chapter 9: summary
9.6 Some smart mobility lessons learned
9.6.1 User needs
9.6.2 Strategy
9.6.3 Data and technology
List of acronyms
Index
A
B
C
D
E
F
G
H
I
K
L
M
N
O
P
Q
R
S
T
U
V
W
Back Cover
Recommend Papers

Smart Urban Mobility: Transport Planning in the Age of Big Data and Digital Twins
 0128207175, 9780128207178

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

SMART URBAN MOBILITY

This page intentionally left blank

SMART URBAN MOBILITY TRANSPORT PLANNING IN THE AGE OF BIG DATA AND DIGITAL TWINS Ivana Cavar Semanjski Faculty of Engineering and Architecture, Department of Industrial Systems Engineering and Product Design, Ghent University, Ghent, Belgium

Elsevier Radarweg 29, PO Box 211, 1000 AE Amsterdam, Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States Copyright © 2023 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. ISBN: 978-0-12-820717-8 For information on all Elsevier Science publications visit our website at https://www.elsevier.com/books-and-journals Publisher: Stacy Masucci Acquisitions Editor: Kathryn Eryilmaz Editorial Project Manager: Aera F. Gariguez Production Project Manager: Selvaraj Raviraj Cover Designer: Christian J. Bilbow Typeset by TNQ Technologies

Contents

4. Small and big data for mobility studies

Preface vii

4.1 Objectives of the chapter 79 4.2 Word cloud 79 4.3 Introduction 79 4.4 Traditional data collection approaches 81 4.5 Big data for mobility studies 84 References 116

1. Introduction 1.1 Objectives of the chapter 1 1.2 Word cloud 1 1.3 Introduction 1 1.4 Background 1 1.5 Why smart mobility and why now? 1.6 Audiences 3 1.7 Chapter structure 5 References 7

3

5. Data analytics 5.1 Objectives of the chapter 121 5.2 Word cloud 121 5.3 Data analytics introduction 121 5.4 Data analytics workflow 122 5.5 Machine learning 134 5.6 Data anonymization 165 References 168

2. Introduction to smart mobility 2.1 Objectives of the chapter 2.2 Word cloud 9 2.3 Mobility 9 2.4 Smart city 12 2.5 Smart mobility 20 References 22

9

6. Transport planning and big data 6.1 6.2 6.3 6.4

Objectives of the chapter 171 Word cloud 171 Four-step transportation planning model 171 Literature review of big data advances for four-step transport planning model 175 References 201

3. The new challenge of smart urban mobility 3.1 Objectives of the chapter 25 3.2 Word cloud 25 3.3 Urban population trends 25 3.4 Multimodality 29 3.5 Connected mobility 38 3.6 ConnectedX 49 3.7 Electric vehicles 59 3.8 Shared mobility 62 3.9 Mobility as a service 67 3.10 Governance 69 3.11 Smart mobility innovations 71 3.12 Change management 72 3.13 State of the affairs 74 References 75

7. Data-driven mobility management 7.1 7.2 7.3 7.4 7.5

Objectives of the chapter 205 Word cloud 205 Introduction 205 Big data-driven mobility system monitoring 206 Analytics-based mobility management decision making support 208 7.6 Example: incentivization of mobility behavior 209

v

vi

Contents

9. Summary

7.7 Example: mobility management as a service 215 References 220

9.1 9.2 9.3 9.4 9.5 9.6

8. Digital twin 8.1 Objectives of the chapter 223 8.2 Word cloud 223 8.3 Digital twin 223 8.4 Example: electric vehicle’s digital shadow 8.5 Example: urban air mobility 235 References 237

234

Objectives of the chapter 239 Word cloud 239 About the book 239 Features 240 Summary of chapters 240 Some smart mobility lessons learned

List of acronyms 247 Index 251

243

Preface

systematic overview of big data sources and techniques, as well as data-driven advances and identified limitations, together with a number of examples and good practices, hopefully to provide a point of reference to all those who are interested in exploring the topic further. With this, the author would also like to thank to a large number of people who have provided impact, influence, and support in creation of this book. To my parents, Mate and Mirjana for unconditional support and understanding, to my husband Silvio for all the kindness, encouragement, and constructive discussions that fueled this creative process, to Dragana and Dario, and to Toni and Petar for all the joy that they have brought and much needed focus on important things. Also, to the whole Ghent University’s ISyE team, for fruitful collaboration and shared research curiosity, as well as knowledge, and inspiration that they have broth.

Traditionally, transport studies have mainly relied on surveys, or so-called small data, to understand travel behavior and mobility dynamics. Recent availability of various location-based techniques, such as Global Navigation Satellite Systems (GNSS) or Call Record Details (CDR), has created new opportunities to better understand mobility needs and patterns. However, to bridge between traditional techniques, new possibilities, and mobility planning or decisionmaking needs, plenty challenges still need to be addressed. The author believes that there is much to be gained in the meaningful application of big data analytics to smart city and smart mobility context and hopes that this book will be instrumental in facilitating this fusion for the good of urban mobility and increased quality of life. The book addresses current challenges that arise in urban mobility context, provides

vii

This page intentionally left blank

C H A P T E R

1 Introduction 1.1 Objectives of the chapter

1.4 Background

What is the background of this book? What topics are addressed in this book? Why is it relevant now? Who is the intended audience of this book? What intended audiences can expect to achieve reading this book?

The beginning of 2020s finds connected multimodal mobility in the forefront of urban transformation. Cities grow larger, and based on the United Nations (UN) estimates, it is expected that Europe’s population in urban areas will increase from today’s 74% to about 83.7% in 2050 (United Nations, 2018). At the same time, cities increasingly face problems caused by transport and traffic. European Environment Agency reports that congestion costs nearly 2% of the European Union’s (EU) Gross Domestic Product annually while 85% of the EU’s urban population is exposed to fine particulate matter PM2.5 at levels deemed harmful to health (European Environment Agency, 2016). To face this challenge, cities increasingly strive to implement SUMPs (Sustainable Urban Mobility Plans) aiming to increase the quality of life in their areas and encourage economic growth. Cities currently generate 80% (World Bank, 2020) of all economic growth; hence, their advances resonate strongly through a wider area. On the other hand, over the past decades, there has been strong development in the domain of sensing technologies. It comes as no surprise that today most of us carry mobile phones with integrated various sensors (e.g., Global Navigation Satellite System sensors, accelerometer, microphone, etc.) or interact with different sensors while doing our daily

1.2 Word cloud Fig. 1.1 illustrates a word cloud with an overview of the content of this chapter.

1.3 Introduction This book explores the data-driven paradigmshift in urban mobility planning and how well-established practices that are not likely to be abandoned by transport planners, and strong data analytics efforts, mainly invested by computer scientist, can be better aligned to fit transport planning practices and mobility management needs. Hence, the book will explore the boundaries and interface between major subject areas as urban, mobility and smart city planning, and data science and analytics. This chapter sets the scene for the reminder of the book, gives more details regarding the background, intended audiences, and explains how the content will be tacked.

Smart Urban Mobility https://doi.org/10.1016/B978-0-12-820717-8.00003-8

1

© 2023 Elsevier Inc. All rights reserved.

2

1. Introduction

FIGURE 1.1 Introduction chapter word cloud.

activities (e.g., bank card payments, inductive loops on roads, etc.). All such interaction and activities produce torrents of data as a by-product of their operations, and some estimations indicate that this equals to an average of 1 GB of content per individual daily (Semanjski et al., 2016). This huge data stream is often being referred to as big data and, among many things, is affecting the way we shape our urban space into smart cities and our mobility into smart mobility. These circumstances create exciting time to be urban mobility professional. Increased urban mobility complexity, rapidly rising data science capabilities, and mobility-related data availability all together create stimulating landscape for innovations and new possibilities, and it is a

privilege to have the opportunity to write this book and communicate about many of these advances. Much of the book’s content is derived from a series of research and innovation activities that I had privilege to be involved in over the past decades, as either a researcher, project manager, or expert evaluation for funding bodies as the European commission. These activities afforded a unique opportunity to collaborate with different domains and experts linked with both urban planning, decision making, and data science from both private and public sector. In my modest oppinion, much of the success related to the data-driven mobility innovation is closely linked with good understanding of the balance between capabilities and

3

1.6 Audiences

limitations of big data applications as well as mobility planning practices. Hence, good understanding of data availability and limitations as well as mobility and urban planning needs is crucial. This brings us to the motivation to write this book, that is to support bridging between mobility planning and data sciences domains as key driving forces behind the smart mobility developments.

1.5 Why smart mobility and why now? Both urban and mobility planning cater to human needs. Understanding human behavior seems to be a core element in achieving liveable cities and big data availability provides us with a possibility to gain insights into this behavior at levels and volumes that were not possible ever before. This provides significant opportunities to improved decision making at all levels. The book is my modest attempt to share experience gained through fusion of a background in transport and traffic sciences with a big data analytics passion. These two branches are entwined throughout all of my professional experience, and this path has been full of revelations and new insights into how data analytics can benefit urban mobility. The aim of this book is to build awareness, interest, and understanding within the smart mobility community, motivating researchers, and practitioners to become familiar with and endeavor into new big data-driven possibilities for smart mobility. And while the term big data itself might seem as omnipresent lately, there are significant opportunities and added value that can be achieved by integrating big data potential into mobility practice. The frequent use of the term can also indicate that a larger community has at least an awareness of the evolving domain, although this awareness might not always be built on a solid foundation. This is an additional motivation for writing this book at this particular moment. The hope is that this common interest in big data and related

analytics will minister as a platform for gaining a better understanding of urban mobility interactions and increased efficiency that caters to more liveable cities.

1.6 Audiences There is a twofold motivation behind highlighting the intended audience for the book. Firstly, it empowers the author with a clear vision of the intended target readership to steer the selection and presentation of the content. Secondly, it provides understanding to the readers regarding the value that can be attained by reading the content of the book. The book intends to address the following readers: • • • •

Transport planners and practitioners; City officials and policy makers; University professors and students; Business analysts, data scientist, data engineers, and developers; • Smart city and smart mobility advocates, consultants, and implementers; • Citizen scientists and members of citizens’ participation initiatives; • Multidisciplinary urban planning and mobility projects managers.

The subsequent section describes each of these readership groups and explains the value they can obtain from reading the book.

1.6.1 Transport planners and practitioners This group includes urban mobility planners and practitioners as traffic engineers, public transport operators, fright planners, and executive leaders in mobility. The book aims to provide this readership group with general understanding of big data, sensing techniques and analytics and how they can be applied in a practical and benificial way to urban mobility domain. To achieve this, book gives an overview

4

1. Introduction

of the main big data sources used for this purpose and the related best practices. This overview is shaped in an easy to understand way for those who have no technical or computer science background, but still clearly highlighting the important aspects of each of the data sources that can greatly affect the applicability and the outcomes of mobility measures and campaigns. This is also complemented with the practical examples, best practices, and lessons learned.

1.6.2 City officials and policy makers This group includes city officials, city mayors, and other policy makers and executive leaders charged with implementing smart city and smart mobility vision. The book aims in providing this group with understanding of big data-driven potential and its added value, as well as existing limitations, when it comes to supporting the smart mobility planning and policy making. It provides illustrative examples of best practices and hands on advices based on the lessons learned by their peers.

1.6.3 University professors and students University professors and students stand at the frontier of mobility-related research. This topic rarely comes isolated; hence, this book aims to support the readership group across related domains as transport and traffic sciences and engineering, computer sciences, urban planners, social geographers, social sciences, Geographic Information System scientist, engineers, Information and communication technology, and telecommunication by providing the state-of-the-art overview on big data applications in smart mobility. It is hoped that systematic overview and insight into practical needs and applications of smart mobility professionals as well as future smart mobility prospects will spur the research curiosity and inspire innovative research lines and efforts.

1.6.4 Business analysts, data scientist, data engineers, and developers Business analysts, data scientist, data engineers, developers, and other data and analytics professionals are likely to be involved in the development of data-driven solutions for smart mobility and/or application of existing market solutions to mobility problems. The book aims to assist them in understanding the needs of transport planners and smart mobility decisions makers in order to derive clearly focused and enriching research questions as well as to provide an overview of how their expertise and solutions can be applied in a practical and useful way to smart mobility domain. To achieve this, the book also offers an introduction to the transport planning information needs and the stateof-the-art research literature overview of current advances in this domain.

1.6.5 Multidisciplinary urban planning and mobility projects managers Managing smart city and smart mobility projects integrates knowledge, skills, tools, and techniques to project activities to meet the project requirements. It also means managing multidisciplinary teams including various sectors (industry, administration, research, end-users .). The book aims to provide this group with a trough understanding of how big data and analytics can be applied to mobility in a practical way. It assists in building common vocabulary and understanding of existing challenges among multidisciplinary team members to facilitate their collaboration and bridging among their expertises.

1.6.6 Citizen scientists and members of citizens’ participation initiatives Recent years have been colored with a number of co-creation activities in mobility domain.

1.7 Chapter structure

Very often, these activities involve citizen scientists (nonprofessional scientist involved in scientific research) and/or citizen initiatives (participatory involvement of citizens to influence their local institutions) that aim to support shaping of liveable smart cities. For this readership group, the book aims to provide up-todate and useful resources on the world of big data and analytics within smart city and smart mobility context. It is hoped that an explanation of how latest data analytic techniques and technologies can be applied to liveable cities will facilitate communication and understanding among the initiatives and professionals to a mutual benefit.

1.6.7 Smart city and smart mobility advocates, consultants, and implementers This readership group includes wide spectrum of smart city enthusiast and advocates that contribute to the landscape and shaping of new ideas and smart mobility applications. The book provides this group with information on current advances and best practices and assists in building the understanding regarding the multidisciplinary vocabulary and existing challenges within smart city and smart mobility domains.

1.7 Chapter structure Whether you are student, professional trying to balance private life with work or manager trying to find couple of minutes between the projects to grow, finding time to read the full book in one breath seems to be challenging task in today’s world. Hence, I have divided each chapter into several subchapters; smaller junks of content dedicated to a specific topic or a question. The idea behind this division is to make the content readable and digestible in small time

5

pockets that might become available in our schedules; time on a train commute, while waiting for the boarding in the airport, those several minutes before falling asleep or a quarter to sit in the sun at the park before the next course starts. The subchapters are created to be readable in not more than 15 min of time. This makes easier to follow the content and round up ideas until next opportunity to read comes. Overall, the book contains nine chapters. Each chapter has a similar structure. It starts with three basic elements: introduction, objectives of the chapter, and the word cloud. The word cloud presents the words used most often within the chapter, with the size of the font proportional to the frequency of their mentioning in the chapter. This way it provides you with a quick and simple visual overview of the chapters’ content. Chapter’s objectives highlight the key topics and questions that will be tackled in that chapter, while short introduction sets the scene for the content.

1.7.1 Topics/chapters As already mentioned, the book incorporates nine chapters. Each chapter addresses a range of subjects related to understanding smart mobility and data analytics challenges and follows a similar structure. The content of the subsequent chapters is summarized in the following sections.

1.7.1.1 Chapter 2: Introduction to smart mobility This chapter sets the scene for the reminder of the book by introducing the key terminology and building the understanding related to the scope of the smart cities and smart mobility that will be tackled in more detail trough the following chapters.

6

1. Introduction

1.7.1.2 Chapter 3: The new challenge of smart urban mobility This chapter explores the paradigm-shift in urban mobility planning and motivation behind this shift. It introduces key challenges of modern urban mobility as connected mobility, multimodal mobility, and mobility-related governance challenges. 1.7.1.3 Chapter 4: Small and big data for mobility studies The chapter small and big data for mobility studies sets the scene and presents the connection between the traditionally used data sources for mobility planning and big data potential. To do so, the chapter contains definitions of these terms, description of the data collection approaches, data examples, lists the main characteristic of each data source relevant for the transport planning and highlights the advantages and disadvantages of each of them. This includes, but is not limited to: (i) survey-based mobility data collection, (ii) global navigation satellite systems (as GPS, Galileo .) mobility data collection, (iii) smartphone-based mobility data collected, (iv) Call details records (CDR)-based mobility data collection, (v) other big data sources for mobility studies (e.g., Internet of Things (IoT), public transport ticketing data). 1.7.1.4 Chapter 5: Data analytics Chapter on data analytics introduces key data analytics concepts linked with the small and big data and their applications in the smart city and smart mobility contexts. The chapter provides a comprehensive and systematic overview of data analytics fundamentals with a focus on machine learning techniques. It presents several selected methods in detail, as support vector machines, k-nearest neighbors, k-means, decision tree, neural networks, and cross-validation, and provides a

number of illustrative and practical examples of their applications in the smart mobility context. The chapter is also intended to provide the requisite background to the reader for reading the chapters that follow. 1.7.1.5 Chapter 6: Four step transport planning model and big data The chapter moves the story forward by giving a short insight into the way transport planning is done and has been done for decades. It introduces the general transport planning framework and main transport planning and forecasting models. It gives a more detailed description of one of the best known transport planning models, the four-step transport planning model and includes subchapters dedicated to each step of the four-step transport planning model (trip generation, trip distribution, mode choice, and route assignment). For each of these steps, the overview of the state-of-the-art literature and the best results is given (based on each big data set introduced in the previous chapter). The idea behind this overview is to give a systematic reference, to both the researchers and the practitioners, on where are we at this point, what are the plausible applications of big data for smart mobility, what are the open questions that are crucial for fruitful implementation of big data-driven insights into smart mobility and transport planning. To researchers, this is a point of reference on where to focus their research in order to support smart mobility developments and for the transport planners and practitioners, it is a reference point to the existing advances and barriers related to the big data integrations. 1.7.1.6 Chapter 7: Data driven mobility management As the previous chapter is more related to strategic and longitudinal transport planning, this chapter tackles the big data potential for operational smart mobility management in

References

smart cities. It concerns questions such as data availability and real-time data analytics, data quality, and privacy, and open and commercial data use in the unified framework. It gives an overview and lessons learned based on two examples, one related to the provision of incentives to support mobility behavior changes and one related to the development of data-driven mobility management as a service framework. 1.7.1.7 Chapter 8: Digital twin This chapter tackles the breakthrough methods for transport planning and mobility managements as a digital twin of urban area concept. It gives stratified overview of the digital twin, its components, and architecture. It discusses the role of the digital twin in the smart city context, either as a smart mobility tool or as a general tool for smart city assets life cycle monitoring, maintenance, and/or management. The chapter includes examples of digital twin applications in the smart mobility context.

7

1.7.1.8 Chapter 9: Summary The book concludes with Chapter 9, which provides an overview of essential elements covered within the book. It distils the key information provided in the book to define advice for a wide range of smart mobility professionals, providing concise summary of actions to consider after digesting the content of the book. It is followed with the list of acronyms that gives a brief overview of key acronyms and terminology used throughout book.

References European Environment Agency, 2016. Air quality in Europe report. European Environment Agency, Copenhagen, Denmark. Semanjski, I., Bellens, R., Gautama, S., Witlox, F., 2016. Integrating big data into a sustainable mobility policy 2.0 planning support system. Sustainability 11 (1142), 8. United Nations, 2018. World urbanization prospects: The 2018 revision. United Nations, Department of Economic and Social Affairs, Population Division, New York, USA. World Bank, 2020. Urban development report. World Bank, Washington, USA.

This page intentionally left blank

C H A P T E R

2 Introduction to smart mobility 2.1 Objectives of the chapter What What What What What What What

Etymologically, the word mobility comes from the Latin word mobilis indicating that something is movable or loose. In the 18th century, the term mobility came into wider utilization, particularly in the military context when it was used to indicate the ability of a military unit to move or be transported to a new position. This trend continued in the upcoming centuries and in the 19th century the term mobility can also be found in the physics domain (to describe the degree to which particles of a liquid or gas are in movement) and sociology (to depict people’s ability to move between different social levels or professional occupations) (Sorokin, 1998). Nowadays, the term mobility is so entwined into our everyday lives that some consider it a key component of the world today (Peter, 2017), and it is preeminently used in two main contexts: spatial and social. In the social context, it refers to the same usage that originates from the 19th century (to describe movement between different social levels or professional occupations), and in the spatial context, it refers to a movement between two spatial coordinates, hence capturing well the terminology both from the military and the physic domains. However, in the scope of this book, we will focus only on the spatial mobility context. In more detail, we will consider mobility from a traffic and transport perspective as movement (change of spatial coordinates over time) of transport entities (humans, freight, information) by means of

is mobility? is urban mobility? is smart city? is sustainable city? is quality of life? are the smart city domains? is smart mobility?

2.2 Word cloud Fig. 2.1 presents a word cloud with an overview of the content of this chapter.

2.3 Mobility At the beginning of this book, it seems particularly important to have a clear common understanding of the key terms that will be used through the book. It also helps readers coming from different backgrounds to gain understanding about the scope of the book and context in which specific topics will be considered and discussed further on. The intention is to have as holistic as possible terminology definitions that are then put in the context of particular uses, or in our case, the mobility context. Having said this, it seems natural to start from the mobility definition itself.

Smart Urban Mobility https://doi.org/10.1016/B978-0-12-820717-8.00009-9

9

© 2023 Elsevier Inc. All rights reserved.

10

2. Introduction to smart mobility

FIGURE 2.1

Introduction to smart mobility word cloud.

transport modes utilizing the transport infrastructure over the predefined rules and the context linked with this activity. Please note that in this aspect we do not consider, for example, pipelines as transport and/or transport infrastructure. As in this case, the moving object (e.g., water, gas, oil .) does not utilize the infrastructure over the predefined rules but rather moves freely following the basic physical laws. Hence, a little to nothing will be said about the pipelines within this book. As the presented mobility definition seems rather long, it is worthwhile having a look at the specific segments of the definition. If we consider traffic and transport (Fig. 2.2) to be a movement of transport entities by means of transport modes utilizing the transport

infrastructure over the predefined rules, then mobility can be seen as a wider term in its scope. Mobility (Fig. 2.3) encapsulates both the traffic and transport, but with the addition of a wider context. This means that mobility does not look merely at the movement, but alongside tries to understand a wider context of this movement, like understanding the purpose and the reasoning behind the need for the movement of the entities, understanding the consequences (e.g., pollution), accessibility to the linked infrastructure and services and interaction with the overall, build and nonbuild, environment. Considering the transport entity mentioned in the definition, the three categories, humans, freight, and information, seem quite intuitive.

2.3 Mobility

FIGURE 2.2

11

Transport and traffic.

They all use transport modes (for instance, freight uses containers, humans trains, information data packages) to move across the transport network (infrastructure) with the in advance known and accepted rules (e.g., for the utilization of the limited network capacity, resolving conflicting flows at the network nodes as intersections, etc.). All three categories of the transport entities are relevant for this book and will be captured throughout its content, although the information will be viewed from the big data analytics perspective and not from the data network optimization and/or design point of view. Nevertheless, it seems worthily noting that from the

prospects of this book, our interest mainly lies with humans. Why? Because, both freight and information are being transported to satisfy the needs of people, either for goods or information. Hence, understanding human behavior and human needs are at the heart of any mobility planning. This brings us back to the notion that mobility concerns a wider context than just a technique and technology associated with the movement itself, but rather considers the full context of this movement’s intricately tied causalities such as human needs. Some of the key working definitions that will be used in this book related to mobility are indicated in the following section.

12

2. Introduction to smart mobility

2.3.2 Urban mobility Urban mobility is a form of mobility that takes place in urban areas. This means that the whole movement or a trip, or at least a part of it, unfolds in an urban setting. Spatially, we can consider intraurban (mobility where both trip’s start and end locations are situated in the same urban settlement), suburban (where trip either starts in urban area and ends in suburban/rural setting or vice versa), and interurban mobility (where trip starts in one urban setting and ends in another urban domain). Here, joint characteristics of all of these trips are that at least part of them takes place in an urban settlement (Fig. 2.4).

FIGURE 2.3 Mobility.

2.3.1 Terminology/definitions Mobility: movement of transport entities by means of transport modes utilizing the transport infrastructure over the predefined rules and the context associated with this activity. Movement or transport: change of spatial coordinates over time, but without changing the characteristics of the entity. Transport entities: content that is being moved: humans, freight, and/or information. Transport modes: the means or the way in which transport entities are being moved/ transported. Transport infrastructure: the basic static objects necessary for the operation of transport. Traffic: stream of transport entities moving over the transport infrastructure (e.g., private cars on a public road or the messages or signals transmitted through a communications system). Small terminology note: transportation is the same as transport, it is just that one term is used more in American English (transportation) and another in British English (transport). In this book, we will use British English terminology, hence the term transport.

2.4 Smart city Quite prevalent lately in public discourse one can find the term smart tied with the term city or urban environment. Most often, this happens in the context of the coined term smart city. This interest in the evolving domain has positive implications in opening up the public discussion about the emerging concept and evidencing that a wider array of people has at least some awareness of its existence. A number of concerns like opinions that this awareness does not necessarily translate in a good and widespread understanding of the smart city concept and that the concept itself remains rather abstract for many (McQueen, 2017) can also be heard. There are several justified reasons for these concerns, including the fact that the concept of smart city refers to still relatively young and largely unexplored interdisciplinary field that in many of its domains resides still at highly conceptual level without possibility to directly reflect in objectively measurable indices. However, smart cities are inevitably linked with smart mobility, and for this reason, it seems particularly relevant to go beyond the pure mentioning of the term in this book toward

2.4 Smart city

FIGURE 2.4

Urban mobility.

13

14

2. Introduction to smart mobility

reaching a working definition that will be utilized further on in its content. Thus, when we have a look at the evolution of the smart city term, literature reports that it first appeared in 1994 (Dameri & Cocchia, 2013). Although this looks relatively far away, the term was not widely accepted from the moment of its first appearance, and it took almost 25 years until it came into wider use among researchers and practitioners. In more details, this happened in 2010, as this year EU launched its smart city initiative and a number of projects dealing with the smart cities pioneered the same year. In parallel, several attempts to reach the definition of the smart city concept took place. Among first, Hall (2000) tried to define a smart city as “a city that monitors and integrates conditions of all of its critical infrastructures, including roads, bridges, tunnels, rails, subways, airports, seaports, communications, water, power, even major buildings, can better optimize its resources, plan its preventive maintenance activities, and monitor security aspects while maximizing services to its citizens” (Hall, 2000). This link between the smart city concept and the monitoring perspective was also captured in Marsal-Llacuna et al. (2015) who see smart city initiatives as rooted in “the previous experiences of measuring environmentally friendly and liveable cities, embracing the concepts of sustainability and quality of life but with the important and significant addition of technological and informational components.” One can notice that next to the monitoring perspective, Marsal-Llacuna et al. (2015) also highlighted the role of ICT as a relevant element associated with the monitoring process. And indeed an extensive body of literature that emphasizes the use of modern technologies as a key component tied with the smart city concept can be found (Gonz alez & Rossi, 2011; Hsieh et al., 2011; Jucevicius et al., 2014; Paroutis et al., 2014). For example, Lombardi et al. (2012) accentuate the application of modern

technologies in everyday urban living that results in innovative transport systems, infrastructures, logistics, and green and efficient energy systems, while Angelidou (2014) stresses the role of ICT to achieve prosperity, effectiveness, and competitiveness. Next to this, Su et al. (2011) share strongly technologically focused point of view and define the smart city as a pure combined product of the Digital City and the Internet of Things concepts. More recently, McQueen (2017) portrays a smart city as “application of advanced technologies to the needs, issues, problems and objectives of people living in urban environment.” And here, we come to another component that is often being moved to the forefront when it comes to the realization of the smart city concept, in addition to the new technologies, and that is the role of a human capital (Hollands, 2008; Nam & Pardo, 2011; Neirotti, 2014). This “human capital tread” can be followed through a number of smart city concept definition attempt. For example, Caragliu et al. (2011) describe a smart city as a place where “investments in human and social capital and traditional (transport) and modern (ICT) communication infrastructure fuel sustainable economic growth and a high quality of life, with a wise management of natural resources, through participatory governance.” Dameri sees smart city as “a well-defined geographical area, in which high technologies such as ICT, logistic, energy production, and so on, cooperate to create benefits for citizens in terms of well-being, inclusion and participation, environmental quality, intelligent development” governed by “a well-defined pool of subjects, able to state the rules and policy for the city government and development” (Dameri, 2013). Next to this, several authors emphasize the importance of urban services in the smart city context (Belanche et al., 2016; Correia & W€ unstel, 2011), including the European Commission (European Comission, 2020) that defines a smart city as a “place where traditional networks and

2.4 Smart city

services are made more efficient with the use of digital and telecommunication technologies for the benefit of its inhabitants and business.” Finally, several efforts were also invested into developing differentiation between the smart city concept and the associated concepts as digital and sustainable cities. Among others, Dameri and Cocchia (2013) systematically explore the relationship between the smart and the digital cities while Ahvenniemi et al. (2017) take a deeper look at the interaction between the sustainable and smart cities. Even brief literature review reveals that there is a divergence in the smart city concept definitions among various stakeholders. This should not come as surprise as it seems rather natural that stakeholders as practitioners and researcher may place different emphasis and focus on the aspects that fall within their area of interest, hence tending to view and define smart city concept from their own perspective. However, this sometimes does not bring us closer to the clarification of the subject but rather dilutes from the achievement of a holistic definition. Nonetheless, even a brief literature review makes apparent that two relevant mainstreams can be identified throughout the smart city definition efforts. The first one being the ICT and technology-oriented point of view on a matter and second one being the people-oriented perspective. Having said this, and based on a personal experience, achieving a smart city concept realization seems to be a more of a process than a solution. And to have a meaningful transformation process toward the smart city concept realization, it seems crucial to have a coherent and aligned understanding of what smart city means to the community and the involved stakeholders. Hence, regardless of what one finds as a final aim of this process on a local level, weather that is an increase in green-blue areas ratio in the city domain, less energy consumption, more sustainable community, or something else, it inevitably starts with the citizens and the

15

city’s joint vision, on boarding other stakeholders into this transformative process. Or in other words, there is no technological solution that will achieve the smart city vision goals by itself, without integrating the people oriented perspective. This bring us to the working definition of smart city that we will use in this book: a smart city is a responsive city where stakeholders share a common vision and strive to achieve more sustainable city with higher quality of life (QoL) by the use of the most suitable and efficient means, often involving the use of innovative technology. And while innovative technologies span out the range of new opportunities to achieve the desired goals, they should not be seen as the only option on this journey toward the smart city vision realization. Sometimes the most efficient solution for a given community and the desired goal can be truly a low-tech solution. An example of this could be placing a physical barrier to close the area for the private car traffic and create a pedestrian zone, opposite to placing cameras or some other potentially costly technological solution that would limit the private car traffic acces to this area. This can also be a part of the smart city concept and smart city vision and, for some, it can reasonably be the most suitable and efficient mean for various reasons as privacy related issues and/or unfavorable cost benefit ratio for a given context. Hence, one could say that the true “smartness” in the smart city vision comes from people and the way in which they utilize this given spectrum of the new possibilities and solutions rather than from the new technologies implementations per se. Hopefully, this book can at least a little bit support these efforts by sharing a light and building an understanding about the new emerging data sources and datadriven possibilities, as well as associated needs and linked limitations in the smart city context. One can also notice that this people-oriented perspective seems to be more rooted in the smart city concept realizations across Europe, than the ICT and technology-oriented vision that seems

16

2. Introduction to smart mobility

to be prevailing in some other areas. For this reason, and for the reason of personal experience, majority of the examples presented in this book will come from the European cities’ experiences.

2.4.1 Sustainable city As a lot has been said in the literature lately about the sustainability (Ahvenniemi et al., 2017) and the term seems to be widely accepted, we will not go deep into the definition development. However, as sustainable city is an important element of the smart city definition, we will briefly summarize the key principles. Sustainable system is a system that can be continued at the same pace or level of activity without harming its efficiency and the people affected by it. In principle, sustainability lies on the three balanced pillars, economy, environment, and people, which are closely related to the ideas of liveable, viable, and fair cities. These three concepts, liveable, viable, and fair cities, are often set as guiding principles for planning and policy when sustainability is placed on the agenda and for this reason we will shortly reflect on the notion of liveable, viable, and fair cities. The notion of a liveable city, in the sense of “suitable to live in,” requires continuous balance between two aspects. The first one concerns the society or in more detail the characteristics of the population and their needs for goods and services as housing, energy, water, food, health, education, entertainment, social engagement, etc. A second aspect comprises the build and nonbuild city’s environment. This includes elements as the built infrastructures, but also the ecosystems that provide the needed goods and services on which city’s livelihoods depend. At the very least, this stems from the green-blue spaces in and around cities, as forests and water bodies, that provide valuable contributions to life conditions as, for example, local climate regulation and air quality. Hence, when there is

a balance among the aspects of environment and society, we refer to such cities as liveable cities. The notion of a viable city, in the sense of “the ability to survive and develop,” requires other two aspects to be, and remain, in balance with each other. The first one being the, above mentioned, physical, and biological environment in and around cities that generate resources and amenities for the city’s population. The second one concerns the ability of the city to create the economic viability of its services and welfare for its inhabitants and businesses, as well as to maintain it through time. While it is conceivable that significant economic prosperity and welfare can be experienced over limited time period while undermining the surrounding, and overall, biophysical ecosystem, over the long term the city’s prosperity is inevitably entwined to environmental sustainability. In other words, the city’s infrastructures and ecosystems establish the boundary constraints that affect the ability of city’s populations to flourish. However, on the other hand, the pressures that city’s dwellers exert on environment mold those constraints themselves intricately (Ruth & Coelho, 2007). Hence, for a city to be considered a viable one the aspects of economy and environment need to be in balance with each other. The notion of a fair city, in the sense of “the equitable community” requires the remaining two aspects to be, and remain, in balance with each other. The first one concerns the society and their needs for goods and services as housing, energy, water, food, health, education, entertainment, and beyond. A second aspect comprises the economic prosperity and the capability to produce goods and services for the society as well as to ensure economic viability. As it is not possible to create successful economies in societies that fail, it is equally not possible to have successful communities without the capability to provide for the city’s requirements. Hence, a balance between how much one can contribute to the community and how much one needs in

17

2.4 Smart city

order to be able to have a decent life in a community needs to be achieved. This concerns questions as optimization of goods and services use, prioritization of their allocation and fostering of their equitable distribution. It reflects in, among others, a chance to have a good life and be able to contribute to society regardless of where one starts, opportunities for education and access to jobs, sharing of risks, inclusion, and equal opportunities regardless of aspects as age or gender. Hence, when the economy and society aspects are in balance with each other, the city is considered to be a fair or equitable one. The notion of liveable and viable cities is often promoted under the hat of an environmental sustainability as it portrays the sustainability that is based on the environment pillar. Following the same logic, the notion of viable and fair cities is viewed as economic sustainability while liveable and fair cities are seen as aspects of social sustainability. While cities with very different economic and social profiles, and different cultural norms, may place different emphasis when it comes to sustainability, a truly sustainable city needs to be all, liveable, viable, and fair. Hence, sustainable city is a city where social, environmental, and economic aspects are in balance, meaning that a city can achieve and maintain its liveability, viability, and fairness at the same pace of development or activity without harming its efficiency, people or the environment within the city and/or affected by the city (Fig. 2.5). As seen in the previous section, many smart city definitions attempt integrate separately terms as liveability and/or viability. However, as these terms are a lower level terms in their scope than the sustainability, they are comprehended within the proposed smart city definition.

2.4.2 Quality of life Another important element of the smart city definition is the QoL. Whereas sustainability

FIGURE 2.5

Sustainable city.

deals with the interaction among different elements (economy, society and environment), QoL is more subjective and concerns the internal perception of human being on its life and interaction with its surrounding. In this aspect, QoL is seen as a general wellbeing of individuals and societies and concerns the expectations that individual or society have toward the life, whereas these expectations are guided by the personal values, goals, and socio-cultural context in which an individual lives (Fig. 2.6). The QoL can be perceived through a prism of different domains of a personal life, including, among others, physical health, family, education, employment, wealth, safety, security, freedom, religious beliefs, and the environment (Rapley, 2003). There have been several attempts to try to measure or quantify the QoL (Felce & Perry, 1995), either through singular perspectives of distinct domains or by devising comprehensive measures. In this context, the opinion that there is a necessity to incorporate the subjective views of the individual’s directly into such estimation seems to be prevailing (Jenkinson, 2020). One of such measurements, which is often considered

18

2. Introduction to smart mobility

FIGURE 2.6 Quality of life.

in the context of the QoL, is the life satisfaction, where the life satisfaction represents a gap between ones perception of its actual life and the desired QoL (Anand, 2016). Collaterally this gap can be a positive one (actual life is seen as better than the expected one, in which case the life satisfaction is high) and negative one (actual life is seen as underachieving when compared to the expected one, in which case the life satisfaction is low) (Fig. 2.7). One can notice that there are two principal paths to bridge the potential negative gap or to increase the life satisfaction. The first one concerns bringing the vision of the expected personal life closer to the one’s actual life, or by “lowering the expectations.” The second one refers to moving the components of individual’s actual life (physical health, family, education, employment, wealth, safety, security, freedom, religious beliefs, and the environment) closer to the expected vision. The third path is then a combination of these two principal paths.

On an illustrative example, if a segment of one’s expectations toward their personal heath is seen through the prism of his/her expected weight being, for example, 50 kg while their actual weight is 80 kg. Then one can bridge this gap, or increase their level of satisfaction, by either losing their weight (moving the component of individual’s actual life closer to the expected vision) or by “lowering the expectations” toward their desired body image and desiring to have 80 kg, 70 kg or some other value closer to their actual weight. Each movement of these borders (actual and expected) closer to each other should have the same effect in the terms of life satisfaction and increased QoL and the actual optimum of the selected measure depends on one’s context (their health conditions, height, gender, age, etc.). Smart city sees an increased QoL as one of its main goals and there is a lot that can be done regarding the use of the most efficient means, often involving the use of innovative technology, to increase it. These activities can range from moving the actual way of living in the city closer to the desired one or by affecting the vision of the expected personal life to meet potential bounds given the local and global context. An example of the first approach can be implementation of improvements and solutions that would increase the efficiency of the city services as transport system (e.g., reduce the travel time between the two locations to meet the desired quality of service). On the other hand, an example of “lowering expectations” can be awakening or encouragement of sustainability vision so that individual decreases the need for personal indulgence to ensure cumulative sustainability at the level of the society. For example, affecting ones’ vision of the desired transport mode for commuting (e.g., private car) toward the use of more sustainable option

2.4 Smart city

FIGURE 2.7

that still meets their needs as electric car, bike, or collective transport like train.

2.4.3 Role of the new technologies in smart city As already mentioned, ICT and innovative technologies are seen as impeding element to many smart city processes and solutions. In more details, they provide a new spectrum of possibilities closely related with the: (i) advanced sensing (or monitoring) opportunities; (ii) insights extraction from emerging data sources; (iii) improved decision making based on the new knowledge and (iv) increased responsiveness. In the context of smart cities, the decision making process encapsulates both the more informed decision making process at the city level as well as those performed by the citizens in their daily lives and/or business in their daily activities. However, for complex systems as urban environment, there is often no simple transition from sensing to the decision making and from the decision making to actual effective responsiveness (action and reaction of the city components). The in-between step(s) strongly depend on data analytics and knowledge/information/insights discovery capabilities. For this reason, more

19

Life satisfaction.

about the urban environment complexity, from the perspective of smart mobility, will be said in the following chapter (see Chapter 3), while sensing and data analytics will be discussed in sequential chapters (see Chapters 4 and 5). This will pave the way for the smart city and smart mobility decision-support considerations in the reminder of the book.

2.4.4 Responsive city Responsive city is another term associated with the smart city vision. Responsive city is closely associated with the capability of the city (its services, environment, people and business) to react appropriately to a given stimulus or change in the system. This includes changes as implementation of new policies, disturbances, and/or unforeseen situations as natural disasters. Hence, responsive city is a resilient city that can react appropriately and efficiently to planned and unforeseen stimuluses and/or changes. Smart city developments have a big potential to foster city responsiveness, either by timely sensing or analytics-based notification of the occurring change, improved communication and dissemination of appropriate messages to the target audiences, improved two-way communication among city, its citizens and business, improved and more informed decision making as well as decreasing the reaction time and overcoming occurring barriers. More about

20

2. Introduction to smart mobility

these elements of the responsive city will be discussed in the following sections of the book and illustrated across number of smart city and smart mobility examples.

2.4.5 Smart city domains Smart city is a holistic concept, but more practically, it is reflected in the smart city domains or dimensions. These are practical fields where the concept so far showed the strongest transformative impact. These smart city domains are • • • • • •

Mobility; Infrastructure; Ecosystem; Economic activity; Urban services; Living.

Living focuses on already discussed aspects of personal life in the city as improved QoL and its domains, while urban services include a wide spectrum of city related services like government, education, health, safety and security, waste management, etc. Some examples of the smart city urban services domain are advances and innovations in the health care sector, use of analytics to develop insight-driven policies, track performance and outcomes, enable constituent engagement, and improvements in government efficiency and advances in virtual learning opportunities, among others. The dimension of economic activity incorporates competitiveness of business and economic welfare of its citizens, leveraging the proximity of a high number of citizens with diverse skills and profiles as well as the availability of innovations. Domains of infrastructure and ecosystem include the build and nonbuild environment elements. Some of the examples include, construction of smart buildings, utilization of green energy, reduction of negative externalities of daily activities within the city, connected and

responsive infrastructural solutions, land use advances, etc. The mobility domain of the smart cities is often referred to as smart mobility, and whereas this topic will be discussed throughout the content of this book, next section gives a brief introduction and presents the working definition of the smart mobility that will be integrated in the reminder of this book (Fig. 2.8).

2.5 Smart mobility Having a look at the available smart city definitions, one can notice a strong role of mobility in contributing to the smart city vision. This aspect of smart city vision is often referred to as smart mobility. When having a look at the literature, one can notice far less efforts to define smart mobility, but still some worthwhile attempts were made. Similar to the smart city concept definitions, the view on smart mobility is also placed in two camps. One focused on the role of ICT and technologies (Jeekel, 2017; Manders et al., 2018; van Audenhove, 2014) and another focused on citizens and their needs. Among the first, Batty, et al. shared a vision where smart mobility concept is not just limited to the diffusion of ICT, but it looks at people and community needs (Batty, 2012). Docherty et al. (2018) emphasized the differentiation between “digital” and “smart,” arguing that with the smart mobility vision these should not be seen as synonyms as “urban mobility becomes smart when smart actors take advantage of smart technology in the context of smart regulations, policies, plans and interactions.” They link smart mobility closely with the use of big data to make short and long-term predictions based on the resulting information, and take actions to improve travel experience and mobility system operations while reducing the consumption of resources and the impact on the environment.

2.5 Smart mobility

21

FIGURE 2.8 Smart city ecosystem.

A number of authors also see smart mobility as closely related to intelligent mobility (Papa & Lauwers, 2015) or intelligent transport systems (ITS). In this context, Chun and Lee (2015) define a smart mobility as “a concept of comprehensive and smarter future traffic service in combination with smart technology. A smart mobility society is realized by means of the current intelligent traffic systems”. This seems intriguing as smart mobility links with the existing ITSs. However, they are not equivalents. Anand et al. defined the ITS as a system “which arose as an application of the information technologies (communications, sensors, artificial vision, control systems, data storage management, etc.) to surface transportation networks” (Anand, 2017). The ITS mainly concerns the use of information technologies, and it was developed before the availability of big data came into the picture. This is a first differentiation between the two terms. The second one comes from the utilization of the ITS tools which

were primarily developed for the road traffic, particularly at the highways and motorways (e.g., motorways traffic monitoring and management, use of variable message signs, etc.). This later endeavored into other transport modes domains (waterways, railways, etc.), equivalently again at the long distance routes. Later on, ITS shyly entered the urban mobility arena, mainly in the domain of traffic control improvements and applications. There are several reasons for these developments. Firstly, highways and motorways represent long road network segments with barley few traffic flows interactions (e.g., intercepting roads); hence, monitoring and managing these is rather simpler option than monitoring and managing of urban mobility flows. Second simplification comes from the mere fact that a limited number of transport modes is permitted to enter the highways and motorways (mainly private cars, heavy and light good vehicles), this limits the number of interactions among different transport modes and retains

22

2. Introduction to smart mobility

the complexity of the traffic flow. Therefore, it makes sense that the early adoptions of new technologies were developed and integrated into somewhat simplified environment, looking from the urban mobility perspective. This experience is valuable and comes handy when looking at the smart mobility concept, which links with the smart city vision, hence is predominantly focused on the urban mobility complexity, which will be discussed in more details in the following chapter (see Chapter 3). So, what is a smart mobility? Smart mobility can be defined as a mobility component of a smart city. In more detail, it is a common, citizens centered, mobility vision shared by all stakeholders that aims to achieve more sustainable urban mobility system, which improves the overall urban performances and, above all, the QoL of citizens, often by integrating big data and innovative technologies. We will expand further on this basic terminology as we evolve trough the content of the book. But for now, it seemed particularly relevant to highlight some key terms and concepts in order to have a clear understanding of the underlying basis that we will build further on. This also helps to set the foundations of the scope of the book and to have clear understanding among different readership groups of the basic concepts and the way in which they will be tacked throughout the content of the book.

References Ahvenniemi, H., Huovila, A., Pinto-Sepp€a, I., Airaksinen, M., 2017. What are the differences between sustainable and smart cities? Cities 60, 234e245. Anand, P., 2016. Happiness explained. Oxford University Press, Oxford, UK. Anand, P., 2017. Intelligent transportation systems. In: Romer, B. (Ed.), Intelligent vehicular networks and communications, fundamentals, architectures and solutions. s.l. Elsevier, pp. 1e242. Angelidou, M., 2014. Smart city policies: A spatial approach. Cities 41, S3eS11. van Audenhove, F., 2014. The future of urban mobility 2.0, s.l. Arthur D. Little and UITP.

Batty, M., 2012. Smart cities of the future. The European Physical Journal Special Topics 214 (1), 481e518. Belanche, D., Casal o, L.V., Or us, C., 2016. City attachment and use of urban services: Benefits for smart cities. Cities 50, 75e81. Caragliu, A., Del Bo, C., Nijkamp, P., 2011. Smart cities in Europe. Journal of Urban Technology 18 (2), 65e82. Chun, B.T., Lee, S.H., 2015. Review on ITS in smart city. Advanced Science and Technology Letters 98, 52e54. Correia, L.M., W€ unstel, K., 2011. Smart cities applications and requirements, White Paper. Net, Brussels, Belgium. Dameri, R.P., 2013. Searching for smart city definition: A comprehensive proposal. International Journal of Computers & Technology 11 (5), 2544e2551. Dameri, R.P., Cocchia, A., 2013. Smart city and digital city: Twenty years of terminology evolution. ITAIS, Milano, Italy. Docherty, I., Marsden, G., Anable, J., 2018. The governance of smart mobility. Transportation Research Part A: Policy and Practice 115, 114e125. European Comission, 2020. Smart city [Online] Available at: https://ec.europa.eu/info/eu-regional-and-urban-devel opment/topics/cities-and-urban-development/city-ini tiatives/smart-cities_en (Accessed 10 June 2020). Felce, D., Perry, J., 1995. Quality of life: Its definition and measurement. Research in Developmental Disabilities 16 (1), 51e74. Gonzalez, J.A., Rossi, A., 2011. New trends for smart cities, open innovation mechanism in smart cities, Bruissels, Belgium: European commission with the ICT policy support programme. Hall, P., 2000. Creative cities and economic development. Urban Studies 37 (4), 639e649. Hollands, R.G., 2008. Will the real smart city please stand up? Intelligent, progressive or entrepreneurial? City 12 (3), 303e320. Hsieh, H.-N., Chou, C.-Y., Chen, C.-C., 2011. The evaluating indices and promoting strategies for intelligent city in Taiwan. IEEE, Hangzhou, China. Jeekel, H., 2017. Social sustainability and smart mobility: Exploring the relationship. Transportation Research Procedia 25, 4296e4310. Jenkinson, C., 2020. Quality of life. In: Encyclopædia Britannica. Encyclopædia Britannica, inc, London, UK. Jucevicius, R., Patasien_e, I., Patasius, M., 2014. Digital dimension of smart city: Critical analysis. Procedia-Social and Behavioral Sciences 156 (26), 146e150. Lombardi, P., Giordano, S., Farouh, H., Wael, Y., 2012. Modelling the smart city performance, Innovation. The European Journal of Social Science Research 25 (2), 137e149. Manders, T.N., Wieczorek, A.J., Verbong, P., 2018. Understanding smart mobility experiments in the Dutch

References

automobility system: Who is involved and what do they promise? Futures 96, 90e103. Marsal-Llacuna, M.-L., Colomer-Llinas, J., MelendezFrigola, J., 2015. Lessons in urban monitoring taken from sustainable and livable cities to better address the Smart Cities initiative. Technological Forecasting and Social Change 90, 611e622. McQueen, B., 2017. Big data analytics for connected vehicles and smart cities. Artech House, Boston, USA. Nam, T., Pardo, T.A., 2011. Conceptualizing smart city with dimensions of technology, people, and institutions. ACM, Maryland, USA. Neirotti, P., 2014. Current trends in Smart City initiatives: Some stylised facts. Cities 38, 25e36. Papa, E., Lauwers, D., 2015. Smart mobility: Opportunity or threat to innovate places and cities. Ghent, Belgium, s.n.

23

Paroutis, S., Bennett, M., Heracleous, L., 2014. A strategic view on smart city technology: The case of IBM Smarter Cities during a recession. Paroutis, Sotirios, Mark Bennett, and Loizos HeracleousTechnological Forecasting and Social Change 89, 262e272. Peter, A., 2017. Mobility. Taylor & Francis, Milton Park, United Kingdom. Rapley, M., 2003. Quality of life research: A critical introduction. Sage, Thousand Oaks, USA. Ruth, M., Coelho, D., 2007. Understanding and managing the complexity of urban systems under climate change. Climate Policy 7 (4), 317e336. Sorokin, P.A., 1998. Social mobility, 3 ed. Taylor & Francis, Milton Park, United Kingdom. Su, K., Li, J., Fu, H., 2011. Smart city and the applications. IEEE, Ningbo, China.

This page intentionally left blank

C H A P T E R

3 The new challenge of smart urban mobility 3.1 Objectives of the chapter

most noteworthy, the overall global population has increased rapidly. This pronounced increase is probably easiest to illustrate by having a look at the timeline, where by the 1800s the overall global population has grown to 1 billion, followed by a bit more than 200 years, during which it has grew sevenfold, up to 7.7 billion, as it is today (United Nations, 2019a, 2019b, 2019c, 2019d). In parallel, the global median age has increased by almost 50% moving from 21.5 years, in 1970, to over 31 years, in 2020. According to the United Nations, this trend is expected to continue, with the median age reaching 36 years by 2050 and 42 by 2100 (United Nations, 2019a, 2019b, 2019c, 2019d). In this context, regions like Latin America and the Caribbean, despite long known for their younger population, are projected to see the most significant shift from the current median age of 31 poised to increase to 41, while Europe is anticipated to have the oldest median age, 47 years, in 2050 (Carneiro Freire et al., 2019). The later is strongly related to the increase in the average life expectancy. Since the 1900s, the global average life expectancy has more than doubled and is now above 70 years. However, there are distinct characteristics across regions and nations related to life expectancy, that is ranging from the

What is the underlying complexity of urban mobility? What is multimodality? What is connected mobility? What are automated and connected vehicles? What are connected infrastructure and connected traveler? What are electric vehicles? What is shared mobility? What is MaaS? What is change management? What are living labs and what is co-creation? What is a quadruple helix?

3.2 Word cloud Fig. 3.1 presents a word cloud with an overview of the content of this chapter.

3.3 Urban population trends Several population-related trends gained significant importance and are strongly affecting life in the cities today, as well as shaping what cities are to become in the future. Firstly, and

Smart Urban Mobility https://doi.org/10.1016/B978-0-12-820717-8.00011-7

25

© 2023 Elsevier Inc. All rights reserved.

26

3. The new challenge of smart urban mobility

FIGURE 3.1

The new challenge of the smart urban mobility word cloud.

Central African Republic, with the lowest life expectancy of 53 years in 2019, to 83 years in Japan. Furthermore, today almost a quarter (26%) of the world population is younger than 14 years, some 16% are between 15 and 24 years old, half of the world population is in the working age bracket between 25 and 65, while 8% are older than 65. However, between 2020 and 2100, the number of people ages 80 and older is expected to increase from 146 million to 881 million and it is projected that in 2073, for the first time in history, there will be more people ages 65 and older than under age 15 (United Nations, 2019a, 2019b, 2019c, 2019d). Hence, while the overall world population is growing, it is at the same time becoming older (both in relative and absolute numbers) while the life expectancy is getting higher (Fig. 3.2). Secondly, in the majority of countries, there is an inequality in the regional population growth

as the population grows faster in urban areas than in rural ones. On one hand, the number of urban centers has almost doubled, from about 6900 in 1975 to more than 13,100 in 2015. On the other hand, their population size has also grown and is expected to continue growing, as the proportion of the world’s population living in urban areas has doubled from 2.8 billion in 1975 to some 5.6 billion people in 2015. Even more, by 2100, approximately 85% of the population is expected to live in cities, with the urban population increasing eightfold from less than 1 billion in 1950 to 9 billion by 2100 (OECD, 2020). From an urban planning’s land use perspective, it is also interesting to observe that about 60% of cities (Schiavina et al., 2019) have seen an increase in land consumed per new resident as their urban surface area (built-up footprint) exceeded half a million square kilometers in 2015 (a 20% increase since 2000). However, urban areas have grown

3.3 Urban population trends

27

FIGURE 3.2 World population by broad age group (in 1000 000s), 1950e2020 and forecast 2020e2100 (United Nations, 2019a, 2019b, 2019c, 2019d).

with distinct characteristics across regions of the world. While in Asia, Africa, Latin America, and the Caribbean, urban population increased in relative terms more than urban built-up areas, in Europe and Northern America the inverse occurred, meaning that more land is now being consumed to accommodate new urban residents than previously (Fig. 3.3). When it comes to the sizes of urban areas, almost 60% of the world’s urban residents live in settlements with less than 1,000,000 inhabitants while 45% live in settlements with fewer than 250,000 inhabitants. There are about 500 urban centers that are home to at least 1 million inhabitants and, among them, 29 megacities with more than 10 million residents (Carneiro Freire et al., 2019). By 2030, it is predicted that there will be 662 cities with at least 1 million inhabitants and 43 megacities, most of which will be in developing regions (United Nations Population Division, 2016).

3.3.1 Key urban population-related challenges Understanding the key trends in urbanization likely to unfold over the coming years is crucial to

the implementation of smart city and smart mobility concepts and related agendas. While it remains challenging and uncertain to say how cities will evolve, react to specific measures, what will be the attitudes and expectations of certain age groups, etc., as many of these are in a long term affected by a number of factors, some key population trends can be observed around the globe (Fig. 3.3). These trends can at least provide some indications of what is there ahead of us, allow us to plan better, and are, as such, an important source of information for planners of all aspects of urban living, including mobility. Overall, it is expected that the urban population will grow and get older, and cities will grow also, but their build-up footprint will be somewhat diverse across the globe as new urban residents in Europe and North America will consume more land than previously, which will not be the case in rest of the World. This also means that mobility solutions in the cities will need to serve more travelers than they did so far and that these travelers are likely to have different mobility needs due to aging (e.g., increased need for noncommuting trips), challenging the existing capacities and infrastructures as well as

28

3. The new challenge of smart urban mobility

FIGURE 3.3 Key World population trends.

interaction with the surrounding environment. Respectively, an increasing number of residents will be affected by the quality of life in the city and mobility solutions and services available for

them in urban areas, hence decisions made at this point in time and ways to adopt smart mobility solutions will have a long-term effect on an increasing number of the urban residents.

3.4 Multimodality

3.4 Multimodality Initially, the focus of transport planning was oriented toward connecting larger urban areas with each other. For these purposes, major roads, railways, and ports were built as cities were seen as the origin and destination of human mobility patterns and needs. Major investments were planned for long-distance infrastructure and cities were seen merely as nodes that needed to be connected. It was not until industrialization, and the emergence of private vehicles, that transport planning entered the cities’ arena. However today the situation is somewhat different and urban areas are arguably the most challenging mobility sphere due to the number of people and freight that utilize the transport infrastructure and services, the number and intensity of trips made, as well as complexity of interactions in this system. One of these complexities originates from the diversity of transport modes present in the cities.

3.4.1 What transport modes exist in the city? As mentioned earlier (see Chapter 1, Mobility), transport modes are the means or the way in which transport entities are being moved/transported. They are designed to either carry passengers, freight, or information, but most modes can carry a combination of the above-mentioned. In general, transport modes fall into three basic types, depending on the medium that the more through/on; land, water, and air transport modes, and within the scope of this book we will focus on those that are used to transport people and/or freight. Land transport modes can be motorized on nonmotorized. Main motorized land transport modes are those that use roads as cars, motorcycles, buses, HGVs (heavy goods vehicles), and LGVs (light good vehicles), and those that use railways as trains, and trams, metro, LRT (light rail transit). Main nonmotorized transport modes

29

include bikes and walking. The water transport modes are divided into two general categories; those intended for inland waterways and those for maritime transport and include ships and submarines (which can navigate also under the waterline). The air transport mode includes aircraft where a primary distinction is between those that are heavier than air (as airplanes, helicopters, etc.) and lighter than air (as balloons, nonrigid airships (blimps), and dirigibles planes). Furthermore, they can be categorized into those with the propulsion system (e.g., reciprocating engines, jet engines, etc.) or without, and into those that are manned or unmanned, as UAV (unmanned aerial vehicles) (Fig. 3.4). All of the transport modes can be found in the city, although some of them are typically placed at the borders of urban settlements (e.g., airports as end destinations for larger airplanes). In such cases, often the longest part of the trip is made by larger (in size and capacity) transport modes, and the part of the trip that interacts with the city’s mobility system is called a “last mile”. Researchers and mobility professionals dedicate special attention to the “last mile” trips due to their complex interaction with the urban system components and their impact on transport demand and overall competitiveness among the transport modes. However, transport modes do not only compete for the same urban mobility market but can also complement one another in aspects such as cost, speed, service frequency, quality of service, accessibility, etc. This complementarity can be strengthened if: (i) transport modes are foreseen to serve different areas. A good example would be the one of the “last mile” trips where various transport modes enable continuity within the overall transport system as they concern different scales (for instance, airplane or train modes for international and public transport modes for urban part of the trip), (ii) transport modes serve different transport entities in the same area. For instance,

30

3. The new challenge of smart urban mobility

FIGURE 3.4 Transport modes.

complementary use of LGV and public transport to transport freight or passengers.

(iii) transport modes provide different levels of service (LoS) in the same area. For example, the use of a private car, train, or bike for

3.4 Multimodality

commuting trips balances differently the costs versus time aspects of mobility service. However, any interaction between different transport modes requires specific attention when it comes to mobility planning and urban planning in general. This concern, not only the network planning and spatial planning as dedicated areas for pedestrians and motorized transport in the city, but also the shared network capacities as intersections between public transport and private cars traffic, mixed flows of motorized and nonmotorized traffic, etc. Furthermore, the interaction between different transport modes requires interconnection points in the network. These points or network nodes can be complex (involve the interface between several transport modes as mobility hubs) or simple (involving only two transport modes such as park&ride, parking locations, public transport stops, etc.). This is especially relevant for a growing number of multimodal trips in the urban environment. In nature, unimodal trips (trips made by using a single transport mode) are rare. Examples of such trips include walking trips (e.g., to visit a nearby bakery or cinema), but most often

31

trips involve a combination of several (at least two) transport modes that are being used from one’s trips origin to the trip destination. Such trips are called multimodal trips (European Commission, 2018a, 2018b, 2018c) (Fig. 3.5). The terms multimodal trip, and multimodal transport, are not always used in the same manner across literature and/or by service providers. Therefore, in the following section, we provide a brief overview of different definitions and our motivation for utilizing the terms mentioned above. 3.4.1.1 What is the difference between multimodal and intermodal transport? There are two principal approaches when it comes to defining multimodal and intermodal transport. One, the transport entity oriented and the other, contractor oriented. The transport entity-oriented approach seems to be more current in Europe and observes the way in which the transport entity travels. If the transport entity travels in a uniform, holistic unit (e.g., container) that is then being transported using multiple transport modes but without losing the integrity (e.g., opening and

FIGURE 3.5 Unimodal and multimodal trips.

32

3. The new challenge of smart urban mobility

repackaging of the container’s content), then this is called intermodal transport. An example of intermodal transport would be a container that is traveling by ship to a port where it is then transferred to a train and/or from train to truck toward its final destination. The use of such holistic transport entity units facilitates the manipulation and reduces the number of operations at the nodes that are specifically designed to accommodate the interfacing of two or more transport modes of travel (e.g., intermodal terminals). Contrary to this, if the entity is not traveling in the holistic unit from its origin to its destination, for instance, the container is being repacked at the network nodes where the switch between transport modes is being accommodated, then this is considered to be a multimodal transport (Steadie Seifi et al., 2014). The other approach, which is prevailing in North America, defines multimodal and intermodal transport by considering the contractor. From this perspective, intermodal transport is defined as the movements of transport entities from an origin to a destination relying on several modes of transport where each carrier is issuing its own ticket (passengers) or contract (freight) (Slack, 1998). Respectively, multimodal transport refers to the movements of passengers or freight from an origin to a destination relying on several modes of transport, but by using a single ticket (passengers) or contract (freight) (Rodrigue, 2020). Although there are two principal approaches to their definitions, the concepts of multimodality and intermodality seem to be well established in the freight transport context. However, their definitions are not always directly transferable to passenger transport. This comes from the mere nature of humans as transport entities, which are holistic in their nature. For these reasons, the terms multimodality and intermodality seem to be often used interchangeably in the passenger transport context. Nonetheless, all the above definitions have one thread in common, and that is that the movement of transport

entities from an origin to a destination relies on utilizing several (multi, from Latin multus meaning many) transport modes. For this reason, for all the trips that use at least two transport modes, we will refer to them as multimodal trips. 3.4.1.2 What are sustainable transport modes? Not all transport modes score equally when it comes to sustainability. This concerns social (e.g., increased accessibility, reduced congestion, increased safety, reduced incidents and accidents), environmental (e.g., energy intensity, reduced pollution, and emissions), and economic (e.g., reduced cost, increased speed, capacity, flexibility, and reliability) impacts. Those transport modes that satisfy transport demand in an efficient manner with minimal negative impact regarding the above-mentioned social, environmental, and economic aspects are considered to be sustainable transport modes. Fig. 3.6 provides a comparison, regarding the emissions, space use, capacity, and infrastructure-related costs, of some of the most frequent transport modes that we can find in the cities today. In general, sustainable urban transport modes can be divided into three categories: -

-

active transport modes, such as walking and cycling: o require human activity as an energy source, o usually slower, when compared with other categories, o somewhat limited in capacity and traveling distance, o cost efficient and with low emissions. collective urban transport modes (e.g., public transport such as bus, tram and metro services, or train service): o capable of transporting large numbers of people and achieving economies of scale, o spatially limited as they provide publicly accessible mobility over specific parts of a city or lines,

3.4 Multimodality

33

FIGURE 3.6 Comparison of the most frequent urban land transport modes. Based on Ministerio de energia, Chile. (2021). Consumo vehicular. [Online] Available at: http://www.consumovehicular.cl/. (Accessed 23 January 2021); Ribeiro, S. et al. (2012). Chapter 9 - energy end-use: Transport. In: Global energy assessment - toward a sustainable future (pp. 575e648). Cambridge: Cambridge University Press.

34

-

3. The new challenge of smart urban mobility

o energy efficient, considering energy consumed per traveler, o environmentally efficient, considering emissions and pollution per traveler, o cost efficient per traveler, considering that the overall cost of the journey is, in some manner, shared among a large number of passengers. motorized individual means of travel, such as private cars: o conveniency regarding trip coverage and traveling times (for instance, moment of departure) as they are highly adjustable per individual traveler, o usage is highly related to the occurrence of road congestion, o have a high impact on emissions and pollution in cities, e.g., noise (see how this relates to new propulsion systems as electric vehicles in the following sections), o have high space demand per vehicle (see how this relates with the new mobility services as shared mobility in the following sections), o have a high energy consumption per traveler.

As a rule of thumb, the above list represents the order of sustainable urban transport modes from those considered to be the most sustainable to those that are considered to be the least sustainable. However, in practice, for every community and a given need for mobility, the sustainability of the available transport modes can differ in their order. An example would be connectivity of a remote urban community with an elderly population or mobility-poor community. In such cases, although emissions and cost-wise use of shared bikes would be an interesting option, it might not be efficient due to the population characteristic (e.g., elderly population with specific mobility needs). On the other hand, a collective transport line might not be economically viable due to a low number of passengers and reduced frequency that would fail to satisfy the mobility demand. Hence, in

some cases, motorized individual means of transport might be the most suitable and sustainable option considering all aspects. However, in such cases use of innovative mobility services, such as shared mobility or on-demand public transport, as well as more sustainable vehicle propulsion systems, such as electric vehicles, are options worth exploring. Today, many cities and communities try to encourage the use of more sustainable transport modes in their local context. In general, these efforts rely on ASI (Avoid, Shift, Improve) approach, consisting of sets of steps that are ideally executed in sequential order. This encompasses: -

-

reduction of transport demand by designing the mobility system where unnecessary trips can be avoided (avoid). One of such examples is “close-by” mixed purpose land use neighborhoods approach, where the idea is to place mixed content such as grocery markets, libraries, shops, bakeries, parks, offices, restaurants, education, etc. entwined with residential buildings. This way, a number of trip motives and daily needs (as grocery shopping, jogging, leaving kinds in a kindergarten, etc.) can be satisfied in the vicinity of people’s homes, reducing the need for the use of motorized transport modes. promotion of more sustainable transport modes encourages a modal shift by capitalizing on the comparative advantages of sustainable transport modes over those less sustainable in a comparable mobility context. The aim is to encourage individuals, and urban logistics operators, to shift to another, more sustainable, transport mode if these comparative advantages are significant. Hence, many city authorities are actively trying to highlight comparative advantages as well as to provide additional incentives for citizens to encourage their modal shift. Such measures can involve rewarding systems, gamification of mobility services for some groups of users, tax benefits, etc.

3.4 Multimodality

FIGURE 3.7

35

Modal shift phases. Based on Rodrigue, J.-P. (2020). The Geography of transport systems. 5th ed. New York, USA:

Routledge.

This modal shift often takes place over three sequential phases (Fig. 3.7), including (Rodrigue, 2020): o Inertia: during this phase, the modal shift starts a bit slow and is usually difficult to notice. This is mainly due to the existence of a strong level of inertia associated with familiarity, comfort, and investments related to the use of the initial mode of transport (for example, car ownership, experience gained, time invested in obtaining a driver’s license, etc.). At this stage, various different exogenous factors, including incentives, regulations, and policies, combined with endogenous factors such as personal motivation and individuals’ attitudes toward sustainability, play an important role in the decision of a limited number of, so-called, early adopters to experiment with modal shift, often as part of a publicly subsidized

initiative or living labs. The significant effects at this stage are not imminently observable and due to low numbers are often considered as a lack of modal shift. However, this is an important stage, that if managed well can lead to a more significant modal shift. o Modal shift: is the central phase during which the benefits of sustainable modes of transport become widely recognized. The trend of adopting new modes of transport during this phase changes drastically, going from poor performance, during the inertia phase, to exceeding performance and becoming a widely accepted tendency. The occurrence of rapidly growing numbers needs to be carefully contextualized, and infrastructure investments need to be carefully considered in this phase as a significant reduction in comparative advantage will

36

3. The new challenge of smart urban mobility

ultimately spark the end of the modal shift phase. An example of this includes cases where the sustainable mobility option becomes increasingly congested due to its high acceptance, while initial modes of transport are becoming less congested and therefore more attractive, familiar, and easy-to-return options. o Maturity: is the final phase during which the comparative advantages of individual modes of transport, such as congestion, become more stable and vary less over time. At this stage, the mobility system has realized its potential for modal shift and a new balance has been reached regarding the use of transport modes. This sets the stage for the next step of the ASI approach. - The final step of the ASI approach concerns the improvement of transport mode’s vehicles and fuel efficiency (for instance, the use of renewable energy sources) as well as the optimization of the operational efficiency (e.g., for public transport and shared mobility). In this manner, the newly achieved balance is further improved by advancing the characteristics of the vehicles and the services to reduce potential negative impact to its minimum.

3.4.2 Key multimodal mobility-related challenges The presence of a variety of transport modes within the city area creates an opportunity to make available several diverse mobility options for its residents. However, to capitalize on this opportunity, aspects such as transport modes’ complementarity and competitiveness need to be addressed. This relates to questions such as: -

space utilization, sustainability, and efficiency (for instance, shared infrastructure as road space that is shared between private cars, public transport, freight transport, and cyclists),

-

-

-

efficient transport nodes distribution and level of service (e.g., prioritization of different mobility options at signalized intersections), connectivity between different mobility options as mobility hubs organization, placement, and synchronization between mobility options (for instance, synchronization of railways timetables with public transport options, spatial distribution and placement of park and ride facilities, etc.), urban mobility policies, measures, and planning strategies, and above all, the capability to meet the mobility needs of its residents in an efficient and sustainable manner.

3.4.3 Example: transport mode competitiveness in an urban area The following example is situated in the urban area of the city of Leuven, Belgium, and illustrates the competitiveness among transport modes during afternoon peak hours. The city of Leuven is the capital of the province of Flemish-Brabant and is located about 25 km east of Brussels (capital of Belgium). The Brussels international airport is situated between these two cities and Leuven is connected to it by road, rail, and bike highway networks. The Leuven municipality itself comprises the historic city, surrounded by the very busy ring road, and the neighboring municipalities (Heverlee, Kessel-Lo, a part of Korbeek-Lo, Wilsele, and Wijgmaal), covering the total area of 57.51 square kilometers (Stad Leuven, 2016). Leuven is home to almost 100,000 inhabitants and hosts a university with more than 50,000 students. All of this contributes to the creation of a very dynamic urban atmosphere that results in a lot of traffic and related traffic congestion. Fig. 3.8 illustrates the reachability of the main train station as observed in Leuven during the afternoon peak hour on weekdays, showing the city center area (within the city ring) and a

3.4 Multimodality

FIGURE 3.8

37

Transport modes competitiveness within the city of Leuven, Belgium.

location of the main train station situated in the north-eastern part of the city ring. The two colored areas represent the results of the observed transport mode, and their travel times, competitiveness during the afternoon peak hours when it comes to the reachability of the main train station. Mobile phone tracking data used for the analytics underlying this illustration were collected from 3400 people during a period of 6 months (Semanjski & Gautama, 2016a, 2016b). The blue area on the map indicates the parts of the city from which the train station will be reached faster by bike, during an average working day and the afternoon peak hour, than by car. The yellow area illustrates the opposite, parts of the city and the surrounding area from

which the train station, during the afternoon peak hour, will be reached faster by car. The map is a strong visualization component that was used by the city’s and province’s stakeholders to open up the discussion about the use of more sustainable mobility options in the city, as it provided evidence-based view on the competitiveness between these two transport modes. It also provided one of the first insights on the impact of the bike highway on the overall mobility system and accessibility of the city center area. The bike highway was introduced in Leuven in 2015, and the leaning part of the blue area (on the left side of the map) is directly associated with the new infrastructure (bike highway) that was introduced to the city

38

3. The new challenge of smart urban mobility

mobility system. The bike highway stretches from the middle of the city center to the North and bends toward the West (in the direction of the Brussels airport). The blue area that leans toward the West surrounding of the city, illustrates the extended reachability of the train station, but also the overall city area, for active transport mode as cycling. Furthermore, collected data and resulting illustrations allowed decision makers to see the imminent effect and acceptance of the new measures, relative usage of the newly introduced connectivity in comparison with other options on the same route, as well as the direct impact of the infrastructure investments on the city center reachability.

3.5 Connected mobility The question of connected mobility gained significant attention in literature and practice lately, particularly in the context of smart mobility. There are two perspectives on connected mobility, the human-oriented and the technology-oriented one. Here, we will focus on human-oriented perspective, whereas the technology-oriented perspective on connected mobility will be discussed in more detail in the following section (see Chapter 3, ConnectedX). In urban areas, several transport modes are present. This often also means that several stakeholders are involved in the mobility service provisions and planning. For example, public transport providers might be interested in the efficient organization of public transport such as buses or subways. Traffic managers might be interested in optimizing the road traffic, striving to have synchronized, and efficient traffic lights at key urban mobility corridors. However, from the context of multimodal urban traffic, their activities are mainly affecting a limited number of transport modes’ related to mobility, and hence, a limited number or parts of the trips when it comes to personal mobility.

From the perspective of personal mobility, the focus is on the overall trip (from its origin to its destination) and this often involves the use of several transport modes during a single trip. Hence, from the perspective of personal mobility, connected mobility concerns the efficient use of a variety of mobility options and transport modes in a single trip. Whereas, for example, public transport operator can be perfectly satisfied with the efficiency at the specific public transport line in terms of speed, capacity utilization, costs, etc., from the perspective of traveler who uses this line during his/her trip, this might not be the case if the seamless mobility is not achieved. An example of this situation would be if one, after the use of this public transport line, would need to spend a long time waiting for a connecting public transport line. In this context, seamless mobility concerns trip where, from the traveler perspective, there is no noticeable barrier/disturbance during the part of the journey or where the two parts of the trip join into a single journey. The dimensions of seamless mobility include: • Cost efficiency: o use of a single ticketing system: hence, there is no need to stop to purchase or validate the additional ticket. • Time efficiency: o synchronized urban mobility: e.g., arrival and departure moments of tram lines at the rail stations are synchronized with the arrival/departure moments of trains, hence, there is no time wasted on waiting when switching from one transport mode to another; o time utilization: placement of desired facilities in (or close by) interchange locations as shops, post offices, readers libraries, and/or other facilities that can contribute to more efficient use of travelers’ time and/or contribute to a more pleasant journey;

3.5 Connected mobility

o efficient functioning of transport network without congestion and delays. • Space efficiency: o locations of interchange nodes among different transport modes are conveniently placed so that there is no need for undesired in-between trips (e.g., long walk from parking place to train platforms); o pleasant and safe space: for instance, use of natural elements such as greenery and water areas along cycling routes and interchange locations, ensuring a safe environment, etc. • Information efficiency: o access to information regarding mobility is available seamlessly and at one location across all user devices, for instance, pretrip, on-board, and posttrip information concerning the overall mobility is accessible and available in a single framework with the intuitive and timely presentation of information. • Integration efficiency: o personalization: travelers are able to plan and adjust their journeys based on personal preferences in a seamless manner across the whole network, including also the provision of personalized pretrip, on-board, and posttrip information (provision of the right content at the right moment by utilizing appropriate means).

3.5.1 Key connected mobility-related challenges In the context of seamless mobility, the boundaries among private, shared, and public transport should be seamless from the travelers’ perspective allowing them to choose among a variety of sustainable, affordable, and flexible options for travel between trips’ origin and destination points. However, in most cities, this is still not the case and space for improvements exist. Among others, these include easiness of use for

39

the available mobility options (e.g., easy, efficient, accessible, and affordable mobility ticketing system), mobility system coverage, frequency, and synchronization among different options as well as time utilization and efficiency of the overall mobility that would result in a smooth functioning network without congestion and delays. Furthermore, in some cities, there is a space for improvements in terms of mobility hubs/interchanges/facilities placement and their offers, as well as regarding the safety, security, cleanliness, and comfort of these locations and the overall system. A strong role in enabling some of these improvements lies in the possibility to capitalize on efficient utilization and exchange of mobility-related data among different stakeholders involved. And although a lot has been done lately related to the mobility data and its standardization to support these efforts, plenty of challenges still remain in this domain. As these are important challenges associated with the data available in the smart mobility context and the possibility to improve the overall system as well as to correctly assess associated efforts and potential usefulness and applicability of new smart mobility possibilities, we will go a bit deeper into addressing them at this point. However, it should be noted that these challenges are associated with the mobility data and provision of mobility information in a wider context not necessarily linked only and limited to the connected mobility challenge. For one, data are not the same as information. 3.5.1.1 Data versus information Word data originates from a Latin word datum (also meaning a singular of data in English) and from a philosophical perspective it stands a simple unprocessed isolated thought fact. From the data analytics perspective, data are symbols used for the facts and terms that describe the properties of objects and their relationships in space and time. Hence, in this context, data has no meaning inside or outside of its existence or

40

3. The new challenge of smart urban mobility

about itself and it joins the meaning by which we describe the properties of objects. Furthermore, data can exist in any form, be it useable or not. The word information comes from the Latin word informare and it means informing or notifying. Information is the result of the analysis and organization of data in a way that gives new knowledge to the recipient. Hence, data in a given context and combined within a structure make up the information that becomes knowledge when it is interpreted or put into context or when meaning is added to it. Making good use of this knowledge is propriety of wisdom created based on the insights generated from the original data. Hence, it incorporates lessons learned and understanding of how to react/ respond appropriately to a given stimulus, which is integral to the smart city definition, also referred to as responsive city property. 3.5.1.2 Some of the key mobility datarelated challenges 3.5.1.2.1 Data standardization

Data standardization seems to be a complex challenge for several reasons. For one, mobility data sensing, storage, and exchange are still quite recent developments, triggered, like in all domains, by the advances in telecommunications and ICT technology and storage capacities. These rapid advancements left a limited time for the discussion and the development of unified and globally accepted data standards. Secondly, the mobility domain, and possibilities for datadriven services, are evolving rapidly, meaning that new possibilities emerge frequently. This also means that it was difficult, if not impossible, to foresee all possible mobility data-driven services at the moment when the initial data standardization questions arose. This often left new advances and services out of the scope of existing standardization efforts. An example of this could be shared mobility services that were developed in recent years. As, to the best of the author’s knowledge, at this moment there is no

global or general standard for the exchange of such data, and it becomes often challenging to include shared mobility services’ data into any type of holistic mobility information service provision in a given area. Hence, mainly, due to the fact that there is no standard, even if there is a willingness to share such data, potential services and solutions developed based on them lack scalability, as any data integration efforts are not replicable and each shared mobility data source requires dedicated resources. This also applies to any associated metadata (data that provides information about other data). In addition, even if data on shared mobility in the designated geographical area are integrated, due to their divergence in content and structure, additional preprocessing steps, such as units’ transformations (e.g., coordinate system transformations, timestamp format uniformity, etc.) will be needed, requiring additional efforts. Furthermore, there is also no guarantee that all the shared mobility data providers gather the same data in their nature. For instance, even if one would integrate data from all the shared mobility service providers in the area, and invest required efforts in transformations and data preprocessing, one service provider could be gathering/providing data on available vehicles’ locations updated at an hourly level, while another one could have available data regarding the vehicle’s energy consumption on a different time scale. Such scenarios make it challenging to plan and implement mobility data-based insights and services in a scalable and efficient manner. Significant contribution in this aspect can be achieved with the introduction of applicable data exchange standards and data provision requirements and recent developments from both industry and administrative bodies leave space for optimism in this aspect as well as indicate the relevance of such questions for all stakeholders. Some of the most recent developments in this aspect include DaTeX, NeTEx (Network

3.5 Connected mobility

Timetable Exchange), and GTFS (General Transit Feed Specification) standards. The GTFS standard originates from industry, more specifically from Google. Its creation was initiated when the company was triggered by the lack of any standard for public transport timetables in the United States and the need to make public transport data integrations scalable options for Google maps solutions (Roush, 2012). Today, the GTFS covers information about public transport stops, lines, and timetabled journey, but also supports a few simple types of fare products. The standard implements data identifiers, which are specific to each data set and require registration with Google. Regarding the data files, the GTFS uses a traditional flat file format, which makes it compact and efficient, but also requires multiple files to describe the different types of elements, adding the complexity reflected in the need for additional rules for naming and packaging the files. Furthermore, custom-written tools are required to interpret and process the data. As a standard, the GTFS is widely known and used among developers and stakeholders interested in the creation of journey planning applications. However, often stakeholders such as public transport organizations and authorities did not find their needs met as the standard does not include much information needed for, for example, development of timetables themselves. The NeTEx is another standard dedicated to the public transport information, but contrasting to the GTFS, it originates from the authorityinitiated initiative. The NeTEx is a CEN (European Committee for Electrotechnical Standardization, 2021) Technical Standard for exchanging public transport schedules and related data, including network descriptions, fare products, journey and timing patterns, connection timings, joined journeys, train makeup, etc., enabling the exchange of both source datasets used to build timetables as well as the resulting timetables themselves. Furthermore, as NeTEx uses XML (Extensible Markup

41

Language), it can package a complete dataset as a single coherent document that can be managed and validated. Another potential comparative advantage of NeTEx, from a service provision perspective, is its capability to link toward operational systems and supply the additional information suited to provision real-time systems (such as destination displays on public transport vehicles or stops) as well as enabled versioning and validity condition mechanisms, which makes it suitable for continuous peer-topeer integration of data from various providers. Considering a complementarity among the NeTEx and the GTFS, it is possible to generate a complete GTFS data set from NeTEx, however, the opposite does not hold. To support this the NeTEx UML (Unified Modeling Language) includes a GTFS mapping package, which maps how to populate each GTFS element from the corresponding NeTEx element, which is a positive step forwards toward across standards’ (co)operability. From the European standardization context, the NeTEx does not come alone, but as a part of the European Transmodel (CEN, 2021a, 2021b). The Transmodel stands for the European Standard “Public Transport Reference Data Model” (EN 12896) (CSN, 2016) that is designed as a higher-level reference data model that covers the whole area of public transport operations and provides a framework for defining and agreeing on data models. As such, it supports the development of integrated and interoperable information processing systems within an organization, as well as efficient communication and exchange of data among various stakeholders such as transport operators, authorities, or software suppliers. Next to the NeTEx, it has already been a fundamental input for the design of a number of EU standards as DVC (Data Communication on Vehicles) (CEN, 2020a, 2020b), IFOPT (Identification of Fixed Objects in Public Transport) (CEN, 2006), SIRI (Standard Interface for Real-Time Information) (CEN, 2015), DJP/ OJP (Open API for distributed journey planning)

42

3. The new challenge of smart urban mobility

(CEN, 2019) and OpRa (Operating Raw Data and statistics exchange) (CEN, 2020a, 2020b). The DATEX II (EC, 2020) is another CEN standard. Contrary to the above-mentioned standards, the DATEX II is not public transport oriented but rather dedicated to the exchange of traffic information and traffic data between traffic centers of road operators and service providers. It is a unified XML-based format that includes traffic and travel information such as traffic flow, traffic measures, roadworks, accidents, parking, etc. defined in the multi-part CEN specification covering: -

Context and framework (part 1), Location referencing (part 2), Situation publication (part 3), Variable Message Sign (VMS) publications (part 4), Measured and elaborated data publications (part 5), Parking publications (part 6), Common data elements (part 7), Traffic management publications and urban extensions (part 8), Traffic signal management publications (part 9).

The DATEX II specifies the exchange protocols separately from the content specifications. This allows flexible use of the content specifications with any defined exchange protocol. For instance, information can be distributed in a manner that is independent of language or presentation format. This way, the possibility for misunderstandings and/or translation errors by the recipient is removed, as one can, for example, choose to include spoken text, an icon/picture on a map, or to integrate information in a navigation calculation. Sometimes, simplified DATEX II versions (profiles) are used in practice, depending on needs and respectively, data being exchanged among parties. The standard itself is still being actively advanced by CEN to meet the increasing needs

of diverse stakeholders and ongoing mobilityrelated advances. These efforts currently include the following application areas and working groups (WG) (CEN, 2021a, 2021b): ITS WG 1 Electronic Fee Collection. ITS WG 3 Public transport. ITS WG 4 Traffic and traveler information. ITS WG 7 ITS Spatial data. ITS WG 8 Road Traffic Data. ITS WG 15 eSafety (eCall). ITS WG 16 Cooperative ITS. ITS WG 17 Mobility integration. D WG 2 Freight. D WG 5 Traffic control. D WG 6 Parking Management. D WG 9 Dedicated Short Range Communication (DSRC). D WG 10 Human-Machine interfacing. D WG 11 Subsystem and intersystem interfaces. D WG 12 Vehicle identification. D WG 13 ITS architecture. D WG 14 Recovery of stolen vehicles. In conclusion, we can say that, in the context of mobility, the process of data exchange standards developments and information requirements’ definitions (e.g., for new services) are everlasting. This can be to some level facilitated with a good base, as general frameworks for defining and agreeing on data models and exchange protocols specifications. However, this base needs to be widely accepted and implemented, to ensure interoperability, flexibility, and the opportunity for the creation of new services, that satisfy the needs of smart mobility stakeholders. Some of the challenges on this path involve: (i) synchronized transition challenge related to the unification and phasing out of the existing partial solutions while, in parallel, adopting and integrating, the still advancing, standardization efforts (and standards’ versioning); (ii) reduction of time needed from the identification of needs toward standards definitions, (iii) synchronization of the unfolding data standardization efforts on a wider (global) level. A good

3.5 Connected mobility

example in this direction is significant efforts invested in linking and harmonizing activities among more regional and global bodies like the ISO (International Organization for Standardization) and CEN (2021). 3.5.1.2.2 Data availability

Another challenge related to the mobility data is data availability. For one, not all cities, regions, or countries tend to have equal data gathering practices and mobile data coverage. This results in variating mobility data availability across geographical areas. Secondly, the traditional mobility data collection approaches, such as travel diaries or surveys, are known to underestimate the short trips (mainly made by walking). Consequently, active transport modes were usually underrepresented in mobility studies and datasets, underpinning the motorized transport-oriented mobility planning and investments as well as the somewhat skewed perception of motorized transport’s role in the overall urban mobility system. Thirdly, even with the advances in the mobility data gatherings, and stepping forward from the traditional mobility data-related approaches toward the emerging sensing possibilities, these misbalances were not by default overcome. As one might have noticed from the previous section, data-related standards are mainly focused on motorized transport modes. This is not without the reason, as motorized transport modes often have some type of sensing and data gathering gear as tachometers, speedometers, and others, already installed. In addition, the first implementations of positioning data gatherings were often associated with the motorized transport modes (e.g., sensors placed in personal vehicles or public transport vehicles). As a result, there is a large misbalance across transport modes’ related data availability, where on one hand, motorized transport modes are more likely to be represented in mobility datasets or, should we say, mobility datasets are

43

more likely to be available for the motorized transport modes than for the active ones. This makes it somewhat challenging to have a holistic view of mobility in an area (or, in fact, across different urban areas also). More of this misbalance will be said in the next chapter, however, it should be noted at this point that the main data availability challenge related to smart mobility comes from the fact that sustainable and active transport modes (such as walking and cycling) are often not well represented in available mobility datasets. Furthermore, from the human-oriented connected mobility perspective, we are actually not that much interested in the transport modes’ mobility, but in human mobility across multimodal urban mobility systems. Moreover, this is an important part of the paradigm shift when it comes to smart mobility and smart cities in general. It should be said also that with the data sensing advances a lot has been, and actually still can be, done to support this paradigm shift. We will tackle this in more detail in the next chapter, as well as when addressing the data analytics topic. 3.5.1.2.3 Data privacy

One cannot speak of human mobility data, without recognizing the challenge related to personal data privacy. Personal data are considered to be any information that relates to an identified or identifiable living individual directly or indirectly (for instance in the cases when different pieces of information, collected together can lead to the identification of a particular person) (European Commission, 2004). An example of the most frequently collected personal data, in the smart city and smart mobility context is shown in the upper part of Fig. 3.9. However, not all data represent personal data, and distinguishing between them and, sometimes, being aware of all the provisions and requirements associated with the specific data-related processes can pose a challenge to

44

3. The new challenge of smart urban mobility

FIGURE 3.9 Personal data.

the implementation of data-oriented activities. For one, a good starting point, in any case, is to ask if, based on the collected data and information, a living individual can be identified or identifiable directly or indirectly. This also implies that if personal data has been rendered anonymous in a manner that the individual is not or no longer identifiable, then such data is no longer considered to be personal data. However, for data to be truly anonymized, the anonymization must be irreversible (there is no possibility to go back to the dataset where individuals can be identifiable). Another often discussed aspect related to personal data is data aggregation. For instance, when data are aggregated to give an insight into overall population in an urban

area, and individuals cannot be identifiable based on such data (and this process satisfies the condition of anonymization, i.e., there is no possibility to go back toward the dataset where individuals can be identifiable) then one also departs away from the definition of personal data. In this aspect, data anonymization processes bear major significance but also still remain a challenge. More about the current advances with respect to data anonymization will be said in the data analytics chapter (see Chapter 5, Data anonymization). It should not go without mentioning that one of the recent advances in the context of personal data protection has been achieved with the introduction of the European General Data Protection

3.5 Connected mobility

Regulation (EU) 2016/679 (GDPR) (European Parliament, 2016). The GDPR is developed to protect personal data regardless of the technology used for processing or storing that data. Hence, it is irrelevant if data are being processed in an automated or manual manner as well as if data are stored in an IT system, on paper, or in another manner. The subject matter, contains provisions and requirements related to the processing of personal data of individuals who are located in the European Economic Area (EEA), and applies to any organization or subject, regardless of its location and the individuals’ citizenship or residence, that is processing the personal information of individuals inside the EEA. In extension, it also addresses the transfer of personal data outside the EU and EEA areas. As such, the GDPR is a cornerstone of any data collection and processing within a smart city and smart mobility context in the EU and EEA but can also serve as a good guideline on how to protect personal data in any other context. In conclusion, it is fine to collect personal data, when this is reasonable for the purpose of the analytics and considering that it is being done in line with the relevant regulations. However, although the GDPR was introduced several years ago, to professionals coming from not data collection/processing domain, it sometimes comes challenging to be aware of all the provisions and requirements that GDPR brings forward and discourages stakeholders to engage in exploring the data-driven possibilities. Fig. 3.9 provides a brief overview of the GDPR key topics and articles that are associated with them, hopefully, to ease the first steps and facilitate exploration and engagement in data-driven insights creations within smart cities, and smart mobility in particular. 3.5.1.2.4 Measurability and quantification

As already partially tackled in the previous chapter (see Chapter 2, Quality of life), smart mobility concerns not only observable and measurable dimensions, such as a count of

45

pedestrians or bikes that have passed a certain location, but also subjective components, as the quality of life, which often concern the human role in the mobility context. Hence, the challenge of measurability and quantifications, when it comes to smart urban mobility is twofold. Firstly, some components of urban mobility are not directly observable and measurable. Hence, it becomes challenging to quantify such components and ensure their applicability in models jointly with the directly observable ones. This challenge concerns a diverse set of smart mobility applications, such as mobility-related decision support or evaluation of new mobility measures. Secondly, some components of urban (and general) living are inherently subjective and should be carefully considered and integrated into urban planning. Research showed that a perception of quality of life is one of them. An example includes the investigation of the quality of life of persons with serious disabilities where results showed that outsiders, such as doctors and family members, tended to perceive the quality of life of persons with serious disabilities more negatively than they themselves did (Jenkinson, 2020). Such discrepancies between the perceptions of outside observers and the individuals indicate the potential limitations when it comes to basing the assessments purely on observers’ appraisals. Furthermore, subjective perceptions are also likely to vary across various stages of the life course of an individual as well as over different geographic areas. An example of this is potential discrepancies about what one might find to be a desirable urban area, where the elderly, young families or student population might have quite conflicting perceptions of such. Besides, even identical demographic groups, in the same geographic area, might have different visions on the same matter over time. For instance, the desired neighborhood to live in for young families has changed from a combination of ownership and suburban housing to urban areas, as

46

3. The new challenge of smart urban mobility

FIGURE 3.10

young professionals now increasingly prefer the city both before and after household formation (Ehrenhalt, 2012). More on how the challenge of measurability and quantification can be tackled by data analytics will be discussed in the Data analytics chapter. 3.5.1.2.5 Data openness

The open data term has been widely used over recent years. Several questions were at the focus of the debate regarding the open data in this period. Firstly, what is open data and what is not? To some level, the discussion has converged toward the definition as given by the Open definition (OKF, 2021) where open data is seen as data that can be freely used, reused, modified, and redistributed by anyone for any purpose. In practice, this means that open data would need to be available as a complete dataset and at a reasonable reproduction cost. Furthermore, they should also be obtainable in a convenient and editable form, ideally, accessible by downloading over the internet and the terms of use should grant unlimited reuse and redistribution including the joining, fusion, or integration with other datasets. Ideally, above mentioned provisions should be universal, meaning that

Open data.

there is no discrimination against fields of endeavors, persons or groups (for instance, restrictions that prevent “commercial” use of otherwise noncommercially available data and uses, etc.) (Fig. 3.10). The second question related to the open data debate concerns, which data should be open and which should, or will, not be. Here, the discussion is somewhat still alive among practitioners, however, we will highlight several thoughts on the matter and briefly discuss associated challenges: -

-

In line with the discussion in the previous section, personal data should not be made open by default and might be made open if in line with the GDPR and/or other regulatory frameworks, which are in place for the area/ individuals concerned. In line with the recent developments regarding the increased provision of public sector and government data as open data, it is recognized that such efforts promote transparency, accountability toward citizens, and value creation by making data available to all. In particular, this is seen as a positive initiative as it supports data integrations and the development of new services and

3.5 Connected mobility

-

products as well as insights associated with the related results of research activities. For mobility, a number of open datasets as data regarding the infrastructure and public transport services are proven to be a useful and practical resource for a number of mobility-related planning and decisionmaking activities as well as a fruitful contribution to research activities and to a number of public discussions and activities, such as citizens’ science initiatives. However, not all data regarding mobility are open. There are several reasons for this ranging from regulatory ones (such as camera feeds where security restrictions may apply) to operational ones (e.g., some public transport operators are reluctant to share data as they are not inclined to provide insight into the day-to-day operations nor would like to be in a position to need to justify their daily operational decisions to a broader audience). In line with industry developments, commercial data are least likely to be provided under the open data umbrella. The main reason for this is that these data are often a part of the business model and market position of the data provider, and secondly, even if this is not the case, companies are often not inclined to share their data as they consider that the market competitors might, in this manner, gain insight into their operations, costumers’ characteristics, etc., challenging their position in the market.

Having said this, there will always be open data and those that are not open. Hence, there will always be datasets that can be freely used, reused, modified, and redistributed by anyone for any purpose and those that cannot at all or can under specified conditions or licenses. This brings ahead several key challenges including the question of open and not-open data integrations and associated licensing propagation. An example of this could be the use of anonymized commercial floating car or mobile phone

47

tracking data together with open data sets that resulted from a citizens’ science emissions observance campaign. To create a new service and/or insight based on these two datasets one would face several challenges. If for example, a commercial party would like to create a new service based on these two types of data, they would face a challenge related to scalability as there are no widely accepted data exchange standards related to air pollution and emissions. Hence, they would be able to develop a service for a limited area and would need to completely reinvest their efforts in any other area. Developing such a service would likely turn out to be a challenging endeavor with a questionable costbenefit ratio. On other hand, if citizens would be interested to use both types of data(sets), there is a question of reasonable cost and license, but more relevantly, there is a question of licensing propagation. The licensing propagation challenge relates to the propagation of the licensing and data provision conditions throughout the whole data preprocessing and processing steps, for processes related to the initial data processing purpose but also for all the potential processes that might occur in the future. An example of this would be if the citizens’ science project would receive the commercial dataset under the condition that it should never be displayed on a collaborative base map or with the data provided by the collaborative contributions that form such map (e.g., Open street map (OSMF, 2022)). Processing activities could involve joining (commercial) car floating data with the emissions data to obtain an indication of the air quality and potential noise pollution across different neighborhoods in the city. Such a dataset could be joined with statistical data (e.g., census data concerning household income, demographics, etc.) and form a series of new datasets that are used as descriptors of those neighborhoods. Sometime later, another initiative might come, or a new project, that would desire to see how cycling correlates with the

48

3. The new challenge of smart urban mobility

number of parks in a city center. This project would use whole neighborhoods’ descriptors for the city center neighborhoods and information about green blue areas in the city, the length per square kilometer of cycling tracks, etc. The project would use geographical data provided by a geographical data company. The geographical data that this company provides rely partially on the input from collaborative base maps and data collected by bike trackers that this company provides. The licensing propagation challenge relates to the question of how to ensure that the licensing conditions under which part of the original data (which were in the meantime preprocessed and the resulting dataset has changed the name, metadata, volume, etc.) are propagated and inherited throughout these steps so that the key information that says that (due to the part of the initial dataset) neighborhoods’ descriptors dataset cannot be joined with the geographical data provided by a geographical data company (due to the fact that part of this data comes from a collaborative base map). This is certainly of interest to the original provider of the floating car data and represents a breach of the conditions under which the dataset was provided. However, currently, there are no data processing and licensing propagation norms or standards, and this often represents a barrier in data sharing efforts when it comes to nonopen data(sets). Another example involves city authorities. Cities or regions might be motivated to implement data-based mobility monitoring services in their area and provide their citizens with useful and timely information related to their personal mobility. For this, they would be interested in acquiring all the available commercial and noncommercial data regarding their area of interest. The motivation for this is the fact that there are a variety of different transport modes available in the area, but also a variety of stakeholders who collect and/or provide such data, including large international data providers as well as small local initiatives and

authorities themselves. These datasets might be partially overlapping in the terms of content but also complement each other. For instance, one dataset might cover car data based on the subsample of the vehicles on the roads and another dataset might contain car data gathered based on the cellular phone service provider. On the other hand, commercial parties might be interested in moving toward the new market and getting better insight into customer needs as well as realizing the potential for the development of a new service for cities as clients. However, for the commercial party, having a service consumer be able to have an insight into their raw (unprocessed data) and compare the quality and reliability of their insights with a market competitor is not an appealing position. For one, consumers might realize that competitor has data of a higher quality and therefore not consider the commercial party in future projects and activities. But also, to reduce the risk commercial parties would also like to know their advantages and disadvantages compared to the competition and be able to position themselves appropriately before presenting their data to the potential customer. So, it is an appealing and potentially risky journey for the stakeholders in this domain. Questions like this have arisen in a number of mobility-related projects in practice. So far, two potential facilitators of this process have been identified: (i) availability of mobility exchange data standards, which were already discussed, that would allow new services scalability and facilitate the data exchange process overall. (ii) Creation of a neutral body that would be data related interface between end consumers and data providers. The role of the neutral body would be related to the data anonymization, so that different datasets/ streams from a variety of stakeholders can be formatted in a manner where the end consumer (city, in the example above) would be able to receive all the car related data for the concerned area but would not be able to identify directly who is the source of data. Therefore, would be

3.6 ConnectedX

able to use data for their needs, and commercial stakeholders would be able to gain new insights related to the potential of new services development and business model creation, but the inherent risk would be mitigated. Several EU projects such as TMaaS (2020) and Socrates2 (2020) highlighted this challenge and the need for such a “neutral body” creation.

3.6 ConnectedX Technology oriented perspective on connected mobility (ConnectedX) mainly concerns technology-oriented connectivity and has three main elements: • • • •

connected connected connected connected

vehicles, infrastructure, traveler, and freight (Fig. 3.11).

3.6.1 Connected vehicles The term Connected vehicles refers to the two-way connectivity between the vehicle and its surroundings. In more detail, it includes different communication devices integrated

FIGURE 3.11

ConnectedX.

49

into the vehicle, which based on the wireless telecommunication network enable in-vehicle connectivity with other devices and/or connection of the vehicle to external devices, networks, applications, and services. There are five main ways a vehicle can be connected to its surroundings and communicate with them (CAAT, 2020): 1. Vehicle to Vehicle (V2V) communication includes the two-way exchange of information between the equipped vehicles within a 300 m radius (NHTSA, 2016). The aim of V2V communication is to increase traffic safety, ease congestion, and reduce the negative impact on the environment by exchanging information about its own speed, location, heading, and position of surrounding vehicles. 2. Vehicle to Cloud (V2C) communication makes use of IoT solutions and enables the exchanges of information about and for applications of the vehicle with a cloud system. This allows the vehicle to use information from other, through the cloudconnected, industries in the smart city context as energy, transport, and smart homes solutions (Agarwal et al., 2018). 3. Vehicle to Infrastructure (V2I) communication includes the wireless exchange of data generated by the vehicle and the provision of information about the nearby infrastructure to the driver. The V2I technologies communicate information about safety and mobility-related conditions (Kakan et al., 2016). 4. Vehicle to Pedestrian (V2P) communication encompasses information exchange between vehicles and a broad set of vulnerable road users such as pedestrians, people with disabilities, public transport users, and other nonmotorized road users as cyclists via their smartphone devices and/or other technology-enabled means and is intended to improve safety and mobility on the road

50

3. The new challenge of smart urban mobility

(Rahimian et al., 2018). An example of this could be a road marking that would turn red when the vehicle is approaching public transport stop while passengers are embarking and/or disembarking to warn about travelers, obscured by the public transport vehicle, that are trying to cross the street. 5. Vehicle to Everything (V2X) communication relates to the two-way exchange of information between a vehicle and all types of vehicles and infrastructure systems such as ships, trains, and airplanes (Thompson & Perez, 2020).

-

-

3.6.2 Connected infrastructure As technology is advancing, there is an increasing number of sensing and IoT solutions being integrated into the urban environments. This enables the exchange of data and information across various application areas and solutions. When these solutions are put in use for mobility-related purposes, then they are seen as connected infrastructure components of the technology-oriented connected mobility. There are four main ways infrastructure can be connected to its surroundings and communicate with them: -

Infrastructure to Vehicle (I2V) is equivalent to V2I and includes the wireless exchange of information about the infrastructure to the driver. It includes aspects such as communication between infrastructure components, alike lane markings, road signs, and traffic lights, to the vehicle, and vice versa. An example of an I2V application would be the use of advanced road markings that would communicate required information with the connected vehicle sensors to enable the detection of lines that are outside of the vision-based spectrum and improve lane detection and traffic safety in extreme weather conditions.

-

Infrastructure to Infrastructure (I2I) mainly concerns two-way communication between mobility infrastructure and other infrastructure present in the vicinity of the mobility infrastructure or connected via cloud-based solutions. An example of I2I communication would be an exchange of information about school start and end times in the neighborhood so that traffic gets calmed while there is a large number of children in the vicinity. Infrastructure to Pedestrian (I2P) communication refers to two-way communication between mobility infrastructure and vulnerable road users via smartphone devices or other communication means. I2P examples include informing public transport travelers with visual disabilities of the upcoming traffic, traffic light status, approaching public transport line/waiting times, or informing commuters about the crowdedness of the mobility hubs. Infrastructure to Everything (I2X) includes connectivity and exchange of data between mobility infrastructure and all the other connected components as smart home features. An example of such an application would be to switch on the heating or cooling system when one would depart from work toward his/her home and be 30 min away from reaching it.

3.6.3 Connected traveler Connected traveler communication refers to two-way communication between humans and its surrounding in the smart mobility context. It originates from the technology’s advancements and a high number of wearable sensors, like smartwatches, car keys, mobile phones, etc., that are present in society today and particularly densely distributed in urban environments (Velaga et al., 2012). Such devices are capable of sensing a variety of data such as, among others, location, time, temperature, heartbeat,

3.6 ConnectedX

as well as audio and visual information. An example of connected traveler sensing would be a collection and exchange of noise information, where mobile phone microphones could be used as sensing devices for traffic noise in the city. Such anonymized information could be used for aggregated noise maps of the city and could support the implementation of appropriate measures for noise reduction and/or a metric for aspects such as quality of life in the city neighborhoods. Furthermore, provision of photos of mobility infrastructure conditions such as potholes, damaged bike parking racks, or pavement in a pedestrian area with the timestamp and location can support timely and appropriate reaction of maintenance responsible services. The essence of such information is data about the location and time, which is also a core characteristic of moving objects and can be a basis for mobility analysis in cities and wider. For this reason, we will have a closer look into such data and their potential for mobility applications in the following chapter. Connected traveler communication is closely related to the seamless mobility aims as it allows the provision of timely and personalized information to individuals, and it can be achieved in four main ways: -

Traveler to Vehicle (T2V) communication enables two-way communication between the individual and all types of vehicles such as private cars, trains, or boats. Such application includes already available services like mobile phone Bluetooth connectivity with the vehicle so that individuals can receive a phone call via the vehicle’s audio devices to ensure hands-free conversation and respectively safer traffic conditions. Another example might include automatic detection of drivers (via his/her wearable sensors) so that a vehicle can automatically adjust desired driving style or

-

-

-

51

in-vehicle conditions such as temperature, the position of seats, etc. Traveler to Cloud (T2C) refers to two-way connectivity between the traveler’s wearable devices and the exchanges of information about and for applications of the device with a cloud system. This allows the traveler to use information from other, through the cloudconnected, industries in the smart city context as energy, transport, and smart homes solutions, but also to consume cloud-based services as infotainment during travels. Traveler to Infrastructure (T2I) communication includes connectivity between the traveler and the infrastructure and includes potential services such as contactless payment of mobility services, sharing information about the public space crowdedness by sensing the number of nearby devices/travelers, and or crowdedness levels of mobility hubs/public transport as well as occupancy rates of parking places, etc. Traveler to Traveler (T2T) includes the exchange of data between travelers as traffic conditions (e.g., slow-moving traffic, traffic flow disturbances as accidents) or conditions of the bike path (e.g., based on the accelerometer and gyroscope readings), social distancing requirements during a pandemic, etc.

3.6.4 Connected freight Refers to two-way connectivity, enabled by a wireless telecommunication network, between freight and its surroundings. It is an advanced concept of logistics where logistics concerns effective and efficient management of procurement, supply, and movement of goods and materials from suppliers to factories and then to retailers, through efficiently operated warehousing facilities. It can also be perceived as a tiered system

52

3. The new challenge of smart urban mobility

where after the long-haul freight transportation tier, the “last mile” tier, also known as urban or city logistics, focuses mainly on the movement of goods and materials within the urban area (Rezende Amaral et al., 2019) and as such is an important component of urban living. Connected freight leverages technology and data to make logistics, and particularly city logistics, more efficient and flexible by integrating five main ways of achieving connectivity between the freight and its surrounding: -

-

-

Freight to Vehicle (F2V) refers to the wireless exchange of data between the freight and the vehicle. Examples of such connectivity and related services include the exchange of information about the freight, like weight, upon placing it in the vehicle so that the cumulative load of the vehicle can be determined or that the loading weight balance within the vehicle can be adjusted. Another example could be an exchange of data regarding goods that require uninterrupted refrigeration, such as pharmaceuticals, or high-value equipment that is sensitive to vibration or shock so that in-vehicle conditions can be adjusted, and goods can have associated conditions tracking log available for inspections. Freight to Freight (F2F) refers to two-way data exchange between goods. An example of such exchange would be a security application as issuing an audio or visual warning when two types of goods are placed at inadequate distances (e.g., corrosive acids, like hydrochloric acid, near flammable liquids). Freight to Individual (F2I) communication is a two-way exchange of data between goods and personal wearable devices such as a smartphone. An example of such exchange would be notification of the freight location, contactless pick up of the shipment, or exchange of information about close-by expiration dates for the goods.

-

-

Freight to Cloud (F2C) communication comprehends wireless exchange of data between freight and cloud systems, including energy, transport, and smart home systems. An example could be sensing a moment when the consumer opens the shipments so that future home/business delivery times can be adjusted and optimized. Freight to Infrastructure (F2I) refers to the exchange of data between the freight and infrastructure in its surrounding. Potential applications include the exchange of humidity or temperature conditions for goods that need to be continuously refrigerated so that an optimal microatmosphere can be achieved.

3.6.5 Service-oriented perspective of ConnectedX And whereas the above discussed “X2X” view on technology-oriented perspective on connected mobility seems to be more communication-focused notation, one could also consider a more services-oriented point of view on the ConnectedX. In this context, we can distinguish between the in-vehicle and the in-between systems-based services. The in-vehicle services focus perceives vehicles as IoT-enabled vehicles, capable at the same time of collecting, sharing, and receiving data, but also processing data and extracting information. As such, they show great potential for smart mobility integration and new service creation. Some of the first in-vehicle service applications include interactive advanced driverassistance systems (ADASs). The ADAS encompasses electronic systems that assist drivers in operating the vehicle while performing operations as driving and parking. The main purpose of such systems is to increase vehicle and road safety, by supporting the driver and reducing the possibility for human error in traffic (EC, 2016), which is estimated to have

3.6 ConnectedX

TABLE 3.1

ADASs systems technology features.

ADASs systems technology features Adaptive cruise control (ACC) Alcohol ignition interlock devices Antilock braking system Automatic parking Automotive head-up display Automotive navigation system Automotive night vision

53

TABLE 3.1 ADASs systems technology features.dcont'd ADASs systems technology features Tire pressure monitoring Traction control system Traffic sign recognition Vehicular communication systems Vibrating seat warnings Wrong way driving warning

Backup camera Blindspot monitor Collision avoidance system (precrash system) Crosswind stabilization Cruise control Driver drowsiness detection Driver monitoring system Electric vehicle warning sounds used in hybrids and plug-in electric vehicles Electronic stability control Emergency driver assistant Forward collision warning (FCW) Intersection assistant Glare-free high beam and pixel light Hill descent control Hill-start assist Intelligent speed adaptation or intelligent speed advice (ISA) Lane centering Lane departure warning system (LDW) Lane change assistance Parking sensor Pedestrian protection system Rain sensor Omni view technology

played a role in 94% of traffic incidences and accidences (compared to incidents in traffic, accidences result in human casualties (UNECE, 1993)). A not comprehensive list of ADASs systems technology features is given in Table 3.1. And whereas ADAS concerns more in-vehiclerelated features, the cooperative intelligent transport systems (C-ITS) refer to the creation of new mobility services based on the cooperation between two or more ITS subsystems and, hence, reflects more the in-between systems services. This is motivated by the possibility to offer better quality and an enhanced service level, compared to the same service that would be provided by a single ITS subsystem. Such cooperation has the possibility to allow road users and traffic managers to share information and use it to coordinate their actions. And while the technology for C-ITS features is relatively mature, to date efforts aiming at wide area deployments have not progressed as far. There are several reasons for this, including the legislation aspects and debates related to the connected vehicles’ standards that would enable cross-borders and cross-industry deployments including both private and public bodies, a challenge that seems to transverse across many smart mobility data-driven developments. However, some C-ITS features are being deployed on smaller scales in urban areas and it is expected that the early (Day 1) C-ITS

54

3. The new challenge of smart urban mobility

TABLE 3.2

C-ITS services.

TABLE 3.2 C-ITS services.dcont'd

List of Day 1 C-ITS services

List of Day 3 C-ITS services

• Hazardous location notifications:

• Cooperative automated driving (cooperative adaptive cruise control strings, cooperative lane merging)

-

Slow or stationary vehicle(s) and traffic ahead warning;

-

Road works warning;

-

Weather conditions;

-

Emergency brake light;

-

Emergency vehicle approaching;

-

Other hazards.

• Signage applications: -

In-vehicle signage;

-

In-vehicle speed limits;

-

Signal violation/intersection safety;

-

Traffic signal priority request by designated vehicles;

-

Green light optimal speed advisory;

-

Probe vehicle data;

-

Shockwave damping.

List of Day 1.5 C-ITS services • Information on fueling and charging stations for alternative fuel vehicles; • Vulnerable road user protection; • On street parking management and information; • Off street parking information; • Park and ride information; • Connected and cooperative navigation into and out of the city (first and last mile, parking, route advice, coordinated traffic lights); • Traffic information and smart routing. List of Day 2 C-ITS services • Advanced warnings • Vulnerable road users protection (e.g., pedestrian, PTWs, þinfra support) • Semi automated driving (e.g., cooperative emergency brake assistance, cooperative adaptive cruise control) • Cooperation with traffic light controllers

• Cooperation with infrastructure for automated driving (intersection crossing, assisted transition of control)

services, when deployed in an interoperable way across Europe, will produce a benefit-cost ratio of up to 3 to 1 (every euro invested in Day 1 CITS services should generate up to three euro in benefits) in a time frame from 2018 to 2030 (European Commission, 2016). Table 3.2 shows C-ITS services time planning (European Commission, 2016; C2CCC, 2020).

3.6.6 Autonomous vehicles Autonomous vehicles are seen as a component of the mature stage of the connected mobility technologies, assuming that the mature implementation of connected mobility solutions is a precondition to the introduction of autonomous vehicles in the traffic flows in a safe and reliable manner. Hence, connected mobility technologies are expected to be a fundamental component of mobility advances as automated driving, as they will allow the exchange of sensor and awareness data among vehicles, cooperative localization, and map updating, as well as facilitate cooperative manoeuvrers between automated vehicles (IEEE, 2022). In this context, a fully autonomous vehicle is considered to be a vehicle capable of sensing its environment and operating without human involvement. Fig. 3.12 gives an overview of the different levels of automation based on the EU Commission (European Commission, 2018a, 2018b, 2018c) and the Society of Automotive Engineers (Society of Automotive Engineers, 2021). Automated vehicles are still not ready to operate without human supervision in an urban

3.6 ConnectedX

FIGURE 3.12

55

Levels of driving automation.

mobility context. There are yet many challenges to be solved to ensure that they are fully capable of sensing their environment, correctly interpreting it, and taking appropriate action as a human driver does. However, the expected benefits for society and the economy are high as the new market for automated and connected vehicles is expected to grow exponentially, with, for instance, revenues exceeding EUR 620 billion by 2025 for the EU automotive industry and EUR 180 billion for the EU electronic sector (European Commission, 2018a, 2018b, 2018c). In addition, the deployment of autonomous vehicles in synergy with decarbonization measures is expected to contribute significantly to achieving key societal objectives such as the socalled Vision Zero, i.e., achieving no road fatalities on European roads by 2050 (European Commission, 2011). From the citizen acceptance perspective, initial studies indicate that a large number of European citizens have a good acceptance of fully autonomous vehicles with 58% willing to take

a ride in a driverless car (World Economic Forum, 2016). Also, high expectations are placed on the potential to contribute to the smart mobility vision by bringing mobility to those who cannot drive themselves (e.g., elderly, or disabled people) or those who are underserved by public transport. In addition, they are expected to encourage innovative mobility schemes such as shared mobility, electromobility, and MaaS (more of which will be said in the next sections) and correspondingly to positively reflect on urban planning by contributing to free up the space wasted in parking. In this context, the first introductions of automated vehicles are already being implemented, mainly in secured environments like dedicated traffic lines or dedicated mobility systems (e.g., within airport services that utilize dedicated space and infrastructure). This way, the capabilities and functionalities of autonomous vehicles can already be showcased even if there is a still full spectrum of requirements to introduce autonomous vehicles within mixed flows (i.e.,

56

3. The new challenge of smart urban mobility

traffic flows where both conventional and autonomous vehicles are present) and mobility systems where several transport modes are present, includingdinherently not automatedhumans as pedestrians, that are still not meet. 3.6.6.1 Example: autonomous vehicles (I) There are several autonomous vehicle tests and related initiatives ongoing at the moment. One of such is the LINC project (UIA, 2021a, 2021b, 2021c) that aimed at linking driverless technology with sustainable urban development through preparatory actions for introduction of the autonomous shuttle bus service. The planned service foresees the utilization of driverless shuttle vehicles, capacity 10e12 passengers, which would operate along 28 km long Greater Copenhagen light rail and its 29 stops. The Greater Copenhagen light rail is planned to be established between 2018 and 2025 and it would connect Denmark’s major urban development areas, including servicing 10 municipalities. The introduction of the shuttle bus service, which would have an operating speed of up to 20 km per hour, would aim at extending the light rail’s range by offering sustainable mobility alternatives for collective transport. The project included two testbeds (the Danish technical university campus in Lyngby and Hersted business park in the city of Albertslund). The testbeds were used to examine various obstacles that must be overcome when it comes to integration of the level four (UIA, 2021a, 2021b, 2021c) autonomous vehicles technology (Fig. 3.13) in the mobility system and to collect both quantitative data on the performance of the driverless shuttle bus and qualitative data on the experiences of the passengers with this new technology. The project also identified key barriers that are associated with the possibility to introduce the self-driving electric shuttles in mixed traffic in the concerned area, including the prolonged application process for the autonomous vehicle approval and process and timeline discrepancies among different countries when it

FIGURE 3.13

LINC EU project.

comes to legal approval to operate the autonomous buses. Following the project execution, legal readiness, financial sustainability, and the possibility of upscaling were identified as key challenges along their path (UIA, 2021a, 2021b, 2021c). 3.6.6.2 Example: autonomous vehicles (II) Another example related to autonomous vehicles’ introduction focuses on infrastructure readiness for the implementation of autonomous vehicles. Vebimobe (De Mol et al., 2016) project focused on existing infrastructure and its potential to support autonomous vehicles’ navigation services. In particular, the project focused on the usability of the Flemish traffics sign database as a dynamic source of information that could support sustainable mobility routing capabilities (Semanjski & Gautama, 2019) and extend its functionalities toward the autonomous vehicles’ routing and navigation services provision. The project’s demonstration area was situated in the Merelbeke neighborhood of the city of Ghent, Belgium, which is the capital and largest city of the East Flanders province, with approximately 250,000 inhabitants. The city itself is accessible via two motorways (E40 and E17), a train network, and a seaport. The inner network consists of two main ring ways (R4, connecting

3.6 ConnectedX

the outskirts of Ghent with each other and the surrounding villages, and R40, connecting the different downtown quarters with each other). It is also a university city, and a home of the largest designated cyclist area in Europe, with nearly 400 km of cycle paths. Fig. 3.14 illustrates the sustainable routing investigation where the traffic sign database was used as a source of information regarding the sustainability-related points of interest, as locations of parks or care facilities where reduced traffic noise is advised. In particular, Fig. 3.14 indicates the (a) location of traffic signs in the area retrieved from the traffic sign database,

57

(b) indication of school-related traffic signs (red dots), and delivery location for heavy goods vehicles (blue dot), (c) shortest path route for goods delivery, and (d) sustainability routing option that would be activated during school start and ending hours when children are approaching and/or leaving the school area. Stakeholders, as citizens and authorities, that were involved in the research perceived, due to safety reasons, the sustainable routing option as the desired solution. On the other hand, the delivery company involved found this option also feasible due to the fact that the overall traveled distance by the delivery vehicle was less than 10% longer than the shortest path.

FIGURE 3.14 (A) The routing location with an indication of all traffic signs in the area; (B) The routing location with an indication of sustainability-related traffic signs (red, light gray in print) and delivery locations (blue, dark gray in print), (C) The Dijskstra-based shortest path route, (D) The route with the sustainability components integrations.

58

3. The new challenge of smart urban mobility

To evaluate the applicability of the designed sustainable mobility route, several evaluation rides were performed with the specially equipped vehicle. These rides were intended to evaluate the reliability of the traffic sign database, compared to the computer vision autonomous vehicle preceptors, and their capability to feed routing planners with traffic sign-based insights relevant to sustainable routing planning. In this context, traffic sign information showed potential to be integrated into the autonomous vehicles’ routing algorithms as the traffic sign database was perceived as a more reliable source of information than the computer vision-based traffic sign recognition solutions. This was mainly due to known challenges that such sensing techniques face as traffic signs’ occlusion, variability in the locations of traffic signs (on the pavement or marks on the road surface), variations in traffic sign standards across different countries, etc. However, to capitalize on this potential and realize such a traffic sign database-based service, challenges related to the traffic sign database updating mechanisms and reliability of traffic signs mapping should be resolved. For example, translating the “start of the populated area” traffic sign into a map’s polygon that carries the information about the speed limit in that area and its inhibition by the speed limit traffic signs that are situated within the populated zone as well as speed limits defined per specific traffic lanes or speed limits for dedicated vehicles’ categories.

3.6.7 ConnectedX-related challenges Technologically oriented connected mobility faces several key challenges: 1. Vehicle and technology Vehicle and technology-related challenges concern issues such as the integration of autonomous vehicles in the mixed traffic flows (where there is a presence of differing vehicle and connectivity lifecycles in the single mobility system), management of unforeseen vehicle operations, as

well as the design of the cost-effective, scalable and fault-tolerant support and architectures. Furthermore, challenges such as heterogeneity among cities and regions in terms of readiness of digital infrastructures should be considered also. 2. Public acceptance and consumer readiness Public acceptance and consumer readiness refer to the demand side of ConnectedX and aspects of consumer acceptance, including challenges such as driver and passenger education and trust in automated operations like assistance and control options. This involves ethical issues related to transferring the responsibility of driving to autonomous vehicles (for example, how an autonomous vehicle should react when an incident or accident cannot be avoided and employment criteria to determine the vehicle’s decision-making). Furthermore, from the consumer perspective, the required initial investment and privacy issues, such as the use of personal mobility data, need to be tackled in an acceptable and transparent manner, as well as the potential impact on the labor market and timely development of required skills and reskilling. 3. System of systems integrations System of systems integrations’ related challenges concerns both technical and mobilityrelated components. From a technical perspective, the system of systems integration challenges relates to technical integration among connected moveable and stationary components of the mobility system, data standardization, compatibility and reliability as well as business integration, governance, and ownership issues. From the mobility system perspective, the integration of autonomous vehicles in the mobility system concerns a need to shape new mobility solutions and services that would complement other transport modes, which are already present in the concerned area. In this context, challenges such as the possibility that the introduction of autonomous vehicles could create a negative modal shift and potential added value in the context of social inclusion and equity (e.g., increased

3.7 Electric vehicles

mobility for the disabled and reduced mobility poorness) should be considered in a timely manner. 4. Legal and regulatory aspects Legal and regulatory-related challenges concern questions of liability (system security and resilience toward ill-tended events such as signal spoofing, viruses, etc., distraction and safety issues) as well as lack of wide-area applicable political and regulatory alignment.

3.7 Electric vehicles An electric vehicle (EV) is a vehicle that uses electric motors or traction motors for propulsion. Whereas electric motors are applied in a wide range of vehicles, traction motors are more specific and used in systems such as locomotives. The powering system for EVs can be external, for example powering through a collector system by electricity from off-vehicle sources, or may be self-contained within the vehicle, e.g., with a battery, fuel cells, or an electric generator. Electric vehicles can have only electric motors, but they can also have a combination of one or more electric motors with at least one other propulsion system. Such vehicles are a special category of EVs, called electric hybrid vehicles (EHV). Although electric vehicles are at the focus of recent mobility discussions, they exist for centuries (Wakefield, 1994) and are present among a wide spectrum of transport modes (e.g., road and rail vehicles, surface and underwater vessels, electric aircraft, etc.). However, with the advances in the domain of energy storage solutions, the production of personal vehicles with electric propulsion has seen its raise over the past decade. This is supported by a wide range of policy measures aiming to increase the share of electric vehicles in mobility systems (Table 3.3 gives an overview of EU Member State incentives for EVs and charging infrastructure). There are several reasons that motivate this increase in the share of electric vehicles, including the fact that

59

they emit no tailpipe carbon dioxide (CO2) and other pollutants such as nitrogen oxides (NOx), nonmethane volatile organic compounds (NMHC), and particulate matter (PM) at the point of use. The European Commission has evaluated that the EV “tank-to-wheels” efficiency is a factor of about three higher than internal combustion engine vehicles (Greenmotion consortium, 2020). In particular, the emission of fine particles, which are less than or equal to 2.5 microns in diameter, also known as PM2.5, has a particularly bad impact on human health. According to the European environment agency (EEA, 2018a, 2018b), the emissions of PM2.5, together with the emissions of nitrogen dioxide (NO2) and ozone (O3) are responsible for more than 480,000 premature death annually. Another study, the one by the International Council and Clean Transportation (Anenberg et al., 2019), shows that in EU countries, between 70% and 75% of the deaths caused by the emissions of the transport sector are due to road transport in particular. Hence, air pollution is an important issue for many urban (and not only urban) authorities as it is estimated to have a deeply negative impact on public health and strongly impacts the smart city and smart mobility aspects of urban living. In this context, the introduction of EVs, instead of traditional combustion engine vehicles, can support the mitigation of transport-related negative externalities, which is particularly relevant in urban settlements due to the high density of traffic, and frequent occurrence of traffic congestions. Another positive side of EVs is that they provide quiet and smooth operation and consequently create less noise and vibration than conventional vehicles. And although the noise emitted by urban road transport depends not only on the type of vehicles and their engines but also on the type of road surfaces, the traffic volume, or the traffic speed, in urban areas where the average speed is generally low, and the vehicles are often static, the noise emitted by engines accounts for a fair share of the total noise created by the traffic. At the same time,

60 TABLE 3.3

3. The new challenge of smart urban mobility

EU Member State incentives for EVs and charging infrastructure (Spöttle et al., 2018).

EEA estimated that over 100 million Europeans are exposed to harmful levels of noise (i.e., above 55 dB) (EEA, 2018a, 2018b), hence EVs show great potential to contribute to reducing the overall noise levels in a smart mobility context.

3.7.1 Electric vehicles related challenges There are still a number of challenges related to the wider adoption of EVs and their higher share in the urban transport systems. From a technical perspective, these include reliability

3.7 Electric vehicles

and durability of batteries, reduction of their weight and volume, safety, improved hybrid electric powertrains, charging infrastructure, and plug-in solutions, as well as making batteries easily recyclable. Although battery costs were reduced over the recent years and a lot of improvements were made regarding the battery energy density, this still remains a main limiting factor for the market penetration of EVs. The same goes for the batteries’ lifespan that ideally would be prolonged, allowing the EVs to charge faster without negative impacts. Next to this, substantial investment is also required for charging infrastructure, for which a common standard is still absent. Nonetheless, EU

61

legislation requires minimum standards for physical plugs and payment systems so that interoperability between operators will likely be implemented soon. Fig. 3.15 gives an overview of charging modes with an indication of related purchasing costs (excluding the installation, grid connection, and operating cost, which might vary across different areas) (Spöttle et al., 2018). From a multimodal transport system perspective, the adoption of EVs poses challenges in the public transport domain, as the introduction of electric buses into the fleet requires operational and technological adjustments. However, electric buses do not yet offer the same flexibility

FIGURE 3.15 Charging modes with an indication of related purchasing costs (excluding the installation, grid connection, and operating cost, which might vary across different areas).

62

3. The new challenge of smart urban mobility

as conventional buses. The same is the case for taxi operators, for whom active usage of the vehicle is directly linked with their business model and hence breaks for battery charging moments deem undesirable effects, both regarding the vehicle capacity utilization as well as regarding drivers working time efficiency. In this context, high expectations are also placed on the further development of alternative charging technologies such as battery swapping, wireless charging, rapid bus charging, and supercapacitors. Furthermore, the higher share of EVs in cities also means higher energy demand, which in return might have negative impacts if not addressed sufficiently in advance, preferably by using renewable energy sources. Also, although reduced noise, when it comes to EVs, is one of their positive points, potential challenges related to safety in this context need to be tackled as humans perceive their environment with all their senses and relate the sound of traffic with the proximity of the vehicle. In this aspect, the lack of sound sense can result in potentially unsafety situations and an adjustment period will be needed, both for other participants in traffic as well as for the drivers who might find themselves in unexpected traffic situations, requiring their faster reaction. Here, a synergy with the above-mentioned connected vehicle technologies seems to be promising, however, EVs’ dedicated regulations that would entwine these two segments to ensure addressing of this challenge is still not present. Another challenge related to EVs comes from the mere fact that cities need to well consider and synergize their strategies to achieve desirable effects. This strongly relates to the fact that electric vehicles are just another form of vehicles and that the desired higher share of EVs in urban traffic, is likely to have positive impacts on aspects such as emissions and noise. However, EVs are likely to have a neutral impact on space occupancy (e.g., EVs will replace conventional vehicles, but will still occupy the same parking

capacity) and congestion (EVs will not by themselves neither improve traffic flows or increase vehicle occupancies, hence the same level of time lost in traffic (Fig. 3.16) is to be expected considering all the other aspects unchanged). Consequently, to efficiently utilize all the opportunities offered by both EVs and connected vehicles, smart mobility strategies need to well balance community needs and synergies with available both, emerging technology solutions, as well as organizational and mobility schemes to meet their community needs and aims.

3.8 Shared mobility One of smart mobility strategies worth of exploring in the smart mobility context is shared mobility, which belongs to a wider concept of sharing economy. Shared mobility is a transport strategy that refers to the shared use of transport mode capacities where users access transport services on an as-needed basis. It is an overall umbrella term that encompasses a variety of transport modes including, among others, car sharing, bike sharing, ride sharing and carpools practices (Fig. 3.17 and Fig. 3.18). Shared mobility is strongly supported by technological innovations, such as social networking, location-based services, and increased usage of mobile phones, which have enabled it to develop and expand as a form of mobility quite swiftly lately. As a strategy, shared mobility reduces the need for personal ownership of a vehicle and addresses the issue of low usage of such vehicles. Studies have shown that personal vehicles are used on average for around 1 h per day (Meijkamp, 1998; Shaheen & Cohen, 2008), meaning that they are being parked most of the time, taking up valuable space from society. This effect is particularly noticeable in urban areas where high demand for space is present and an outcome of the space use decision-making process has strong implications on overall liveability and consequently, sustainability of an urban settlement.

3.8 Shared mobility

FIGURE 3.16

63

Hours spent on road congestion by the average driver in 2017 (EU countries) (European Comission, 2019).

FIGURE 3.17

Most common shared mobility practices.

64

3. The new challenge of smart urban mobility

FIGURE 3.18

Bike sharing station at the Palau de la M usica Catalana, Barcelona, Spain.

Initial implementations of shared mobility strategies such as car sharing have shown a favorable impact on overall car use in two aspects: reduced car use in frequency, down from 3.5 to 2.0 times a week, and reduced number of driven kilometers, on average for 33% (Fishman et al., 2014; Meijkamp, 1998), compared to user behavior before engaging with the car sharing or introduction of shared mobility services in the area. Positive implications on reduced journey costs for individuals, up to 50% (Fellows & Pitfield, 2000), and reduced parking pressure were also observed. Although this impact is not equal across the globe in quantity, it is consistent in its trend as every car shared reduced the need for 4e10 privately owned vehicles in Europe, 6e23 in North America, and 7e10 in Australia (Shaheen & Cohen, 2008), impacting significantly local parking pressure (Fig. 3.19). Studies report that early adopters of such services find their motivation in potential financial savings and the fact that they used their vehicles sporadically, indicating a high potential to replace second car ownership per household. In addition, the two most important predictors

of shared mobility trip usage were observed to be the distance to the nearest vehicles’ station and the length of membership, and both factors have a greater influence on vehicle owners than on nonowners (Katzev, 2003). For shared mobility options such as carpooling, positive implications on reducing air pollution, traffic congestion on the roads, and the need for parking spaces were also observed. However, the adoption of shared mobility options is not uniform and new services need to be considered in relation to urban population characteristics and their needs. For example, commuter carpooling is particularly popular among people who work in places with more jobs nearby, and who live in places with higher residential densities (Belz & Lee, 2012). Also, the use of shared mobility options is significantly correlated with transport operating costs (as fuel prices and commute length), and with measures of social capital (as time spent with others). On the other hand, carpooling is significantly less likely among people who spend more time at work, elderly people, and homeowners (DeLoach & Tiemann, 2010).

3.8 Shared mobility

FIGURE 3.19

65

Impact of one shared car on space use and parking pressure.

The positive effect of shared mobility schemes in synergy with already available mobility services was also observed. Studies indicated a decrease in public transport travel times on average of 10% on a single trip after the shared bike strategy was implemented in Helsinki (J€ appinen et al., 2013) as well as positive implications on the health and wellbeing of shared mobility users who utilized actively shared transport modes as bikes (Ma et al., 2018).

3.8.1 Shared mobility-related challenges Although shared mobility has become a standard mobility offering in many cities, some open challenges still exist. Among them, one can account for the vehicle availability at peak hours when there is the highest demand for mobility services, which has a negative impact on individual’s impression of personal resilience (e.g., vehicle availability in critical situations as natural disasters). These challenges can only partially be

66

3. The new challenge of smart urban mobility

addressed with the use of online technologies ensuring reservation of vehicles at desired time and location, however sometimes even these demands cannot be met resulting in potentially low quality and reliability of the service. On another end, the service provider needs to the balanced quantity of the vehicles to ensure the reliability of the service in a cost-effective way, which is often challenging, especially when accompanied by unsatisfactory user discipline (e.g., late returns, etc.). Another challenge comes from vehicle conditions, which might not be in the desired state after previous use. This particularly relates to the shared mobility use during pandemic periods as well as the general cleanness of the vehicle, which is not always feasible to be addressed between two uses. Some of the additional challenges come from the ease of use, e.g., key exchange, in different schemes, such as peer-to-peer, when flexibility is needed from both the vehicle user and vehicle provider. Many of these challenges can be met by thoughtful considerations when it comes to mobility needs (e.g., incentivizing services for areas with low accessibility) or synergizing with existing mobility options in the city to achieve desirable impacts while maintaining and/or improving mobility service and user experience. This includes complementarity with the existing mobility options as well as the adoption of innovative means, such as connected, automated, or electric vehicles, and solutions to maximize the ease of use and efficiency of the mobility system. One of such solutions is, lately frequently mentioned, mobility as a service. .

potential reason for this is the effect of higher vehicle usage, which is characteristic of shared mobility services such as car sharing, and the implication on the battery’s state of health (the battery’s ability to store and deliver electrical energy). The battery degradation is related to the cost of the battery, hence the cost of the electric vehicle (battery accounts for about 54% of the total production costs of the vehicle (Cars 21, 2016)) and, respectively, the commercial viability of the car sharing practices. During the e-Mobility (e-Mobility, 2017) project, battery’s state of health for two identical electric vehicles shared by two different car sharing practices was estimated and compared (Fig. 3.20). For this purpose, real-life transaction data from charging stations and different electric vehicles’ sensors were used from two identical and equally old EVs. The vehicles were utilized in the same geographic area and their use differed in the average state of charge (the ratio of the available capacity and the maximum possible charge that can be stored in a battery, i.e., the nominal capacity), depth of discharge (the fraction or percentage of the capacity, which has been removed from the fully charged battery before recharging it) and percentage of fast chargers’ utilization. The results indicated that insight into users’ driving and charging behavior can provide a

3.8.2 Example: impact of shared mobility practices on electric vehicles The availability of electric vehicles in the shared mobility offers has the potential to multiply the positive impacts. However, the literature suggests that many shared electric vehicle systems are failing to reach satisfactory commercial viability (Fukuda et al., 2003). A

FIGURE 3.20

e-Mobility shared car services research.

3.9 Mobility as a service

valuable point of reference for car sharing system designers (Semanjski & Gautama, 2016a, 2016b). In particular, the results indicated that delayed charging (for example, prior rather than right after the use of the electric vehicle) and lower utilization of fast chargers have the potential to slow down the rate of battery degradation. However, the third element that differed between the two car-sharing practices, the depth of discharge, is mainly a result of the basic users’ need for mobility (e.g., trip distance). Overall, the forecasting results from this study showed that the moment when an electric vehicle battery reaches its theoretical end of life can differ in as much as 1/4 of the time when vehicles are shared under different conditions (Fig. 3.21).

3.9 Mobility as a service Mobility as a Service (MaaS) refers to the integration of various forms of transport services, as public transport, shared mobility, taxi, etc., from

67

both public and private service providers, together with payment and booking options, through one platform, into a single mobility service accessible on-demand to the end user. The MaaS is still a relatively emerging mobility service offering that is gaining pace in many cities around the world that find themselves in different stages of integration (Fig. 3.22, as described by Sochor et al. (2018)). Due to the limited number of mature examples, the analysis of real-life MaaS demonstrations remains relatively bounded. The initial, somewhat fragmented, documentations of such demonstrators indicate potential benefits in economic, societal, transport-related, and environmental aspects. These advantages can be perceived from the perspective of key MaaS stakeholders: end users, businesses, and the public sector (Hietanen, 2014). From the end user (traveler) perspective, MaaS represents the best value proposition through the use of a single platform to provide access to mobility, with a single and easy-touse payment and booking channel, replacing

FIGURE 3.21 State of health linear trends extrapolation and theoretical end-of-life border.

68

3. The new challenge of smart urban mobility

FIGURE 3.22

Stages of MaaS integration.

multiple ticketing and payment operations that would be traditionally required. In combination with the data analytics and by designing the MaaS that integrates a wide spectrum of mobility services, it should be able to support end users by providing customized offerings, solving the inconvenient parts of individual journeys, hence facilitating their seamless mobility across all transport modes, and making the traveling experience more pleasant and efficient. Such advances are particularly valued when it comes to individuals with disabilities or reduced mobility, such as the elderly or mobility poor. From the business sector’s perspective, the integration of a wide spectrum of mobility services brings possibilities for new business models and ways to organize and operate the various transport options. Potential advantages for transport operators include also access to improved user and demand information, which accompanied by data analytics skills, can

provide new opportunities to serve unmet demand, while active transport operators may ensure cost reduction in individual operations. From the public sector perspective, with the implementation of MaaS, there is potential for the creation of new jobs, and improvement in transport system reliability, accessibility, and resource allocation efficiency. In addition, by promoting low energy consumption and environmentally friendly mobility options through MaaS, environmental and societal benefits are also possible, as well as reduced dependence on private vehicles (Cole, 2018).

3.9.1 MaaS-related challenges To achieve economic or societal goals and to correct transport system issues that the urban area or region might have, it is pivotal for public authorities to establish the conditions and orchestrate the MaaS. For this, authorities need

69

3.10 Governance

to define appropriate rules and regulations for the mobility market (Polydoropoulou et al., 2020), as MaaS innovation might bring disruptive changes, which, if simply left to urban mobility market forces, may not bring the expected economic and societal benefits. The above-mentioned is one of the key challenges when it comes to achieving the desired smart mobility aims at the city level. However, the initial implementations of MaaS have demonstrated several additional implementation related challenges. Firstly, the MaaS encompasses the integration of mobility services that serve the same geographic area and the same market segments. By the nature of the mobility market, these services are thus competing, and providers are not likely to be prone to exchange data with each other or see imminent advantages of participating in such integration. Although urban mobility markets and offerings are local urban contextspecific, it seems that the key integration across them is the public transport service

FIGURE 3.23

(Polydoropoulou et al., 2020), which significantly fosters further development of MaaS implementations. Fig. 3.23 illustrates key actors and enablers in this process. Secondly, associated challenges also concern MaaS-related costs and risks, particularly linked to service deployment and market uptake, as the revenue challenge and revenue allocation challenge. On a more technical note, the question of mobility data-related standards seems particularly challenging. This concerns the lack of data standards in some mobility areas, particularly emerging services such as shared mobility, which increases the effort required for integration and insurance of equal opportunities for all mobility service providers.

3.10 Governance Additional complexity related to urban mobility comes from the governance-related challenge.

MaaS service actors and enablers.

70

3. The new challenge of smart urban mobility

Governance concerns the ownership and management of assets and resources to fulfill desired aims through the exercise of authority and institutional resources. In mobility contexts, this is particularly challenging as it concerns a unique set of characteristics as both the public and private sectors are actively involved and the activity is accompanied by the cross-jurisdictional character of many mobility elements. Hence, one of the mobility system components where the governance-related challenge is imminently noticeable is the governance of transport infrastructure. Transport infrastructure, such as roads, rail, and telecommunication networks, is one of the key enablers of transport, which is not of mere convenience, but a fundamental infrastructure that needs to be systematically and continuously available to its users. One of the simple examples of this challenge could be connectivity between two urban settlements, by single transport mode as a personal vehicle, where both ends of the trip (and associated infrastructure) could be owned and under the jurisdiction of concerned urban authorities and in between infrastructure as bridges or motorway, might be under regional or national jurisdiction. Hence, in order to efficiently manage relevant infrastructure and achieve potential desired smart mobilityrelated aims, cooperation, and coordination among several agencies would be required. However, two urban authorities might have different local mobility aims and try to manage mobility between them in noncomplementary manners. For instance, one could be actively promoting the use of shared vehicles while the other might be focused on the implementation of urban vehicle access regulations and low emission zone, postponing the implementation of shared mobility. The resulting challenge might be a discouragement of the shared mobility users who could not return the vehicle at the dedicated location in the second urban area (e.g., where they commute to) and would in return have to pay for a full day of the vehicle usage (until they travel back home)

rather than just for several hours while they are actually using the car. The challenge is likely to become even more complex with additional elements such as private sector infrastructure ownership or management (e.g., parking locations) and consideration of a multimodal trip, instead of a unimodal one, as in our example. Furthermore, within urban settlements, the question of governance concerns dense urban areas, where infrastructure is either owned or managed by different private and public bodies. The concerned infrastructure comes in near proximity of each other and sometimes even in an overlapping geographic area, for example a telecommunication network underneath a pavement. Hence, it is not rare that different components of smart mobility, even if under the jurisdiction of the same public authority, are handled by, for instance, different departments. An example could be the responsability of the environmental department related to the air quality policies and measures and the transport department related to the management of local parking areas. In such a setting, decisions made by the transport department to locate parking facilities might impact the demand for motorized transport modes in the area and affect the air quality locally. On other hand, the introduction of zero-emission zones by the environmental department might have a strong impact on local mobility and affect aspects of traffic department activities.

3.10.1 Governance-related challenges Governance-related challenges mainly concern the presence of both diverse public and private stakeholders when it comes to mobility assets and the cross-jurisdictional character of their involvement. This is particularly noticeable when it comes to the mobility infrastructure management and decision-making processes as well as the alignment of policy priorities among different policy makers. An example of this

3.11 Smart mobility innovations

would be a mobility infrastructure bordering between urban settlement and regional administration jurisdiction where potentially two unaligned policy priorities exist (e.g., aim to integrate remote regional settlements by fast and efficient road infrastructure and urban authority’s aim to implement congestion charging or similar policy). Another example relates to data governance and local context rules and regulations to access and/or exchange such data among a wide spectrum of public and private mobility stakeholders and conditions for the use as well as unsolved data-related challenges (e.g., licensing propagation).

3.11 Smart mobility innovations In a smart mobility context, there is rarely one solution that fits all urban communities. This is primarily due to aspects such as lifestyles of the residents, cultural, social, economic, environmental, and demographic characteristics that form a local context which diverges from one community to another. Hence, to shape smart city and smart mobility innovations, it is beneficial to consider valuable feedback on products and services, and their innovation value in the local context from residents who can contribute in terms of knowledge, inventiveness, and creativity when new urban solutions are being shaped. These, citizens oriented, interactions in the innovation context were not a part of traditional innovation pathways, focused on universities, industry, and government interaction, and complement them with a citizens and creative civil society perspective and jointly shape a quadruple helix model (Fig. 3.24). The quadruple helix model is still far from being considered a well-established concept in innovation research and policy. Hence, in this context it can be said that both smart cities and smart mobility and this novel approach to innovations are in some way partially developing simultaneous and jointly,

71

where valuable lessons learned from their interaction are beneficial to both domains. Two components of the quadruple helix concept, co-creation and living labs, are particularly frequently explored in the smart mobility context and implementation projects. • Co-creation comprehends the design process in which input from stakeholders and citizens has a vital role in all life stages of solution development. As such, co-creation often entwines technology push and application pull through a diversity of views, constraints, and knowledge sharing to nourish the ideation regarding new potential usage scenarios, concepts, and related artifacts. The most frequently utilized means of co-creations activities include citizen workshops. • Living labs represent an iterative, usercentered open-innovation ecosystem, often operating in a geographically bounded context (e.g., city, region, or university campus) where the innovation processes are entwined with a living labs participants’ daily living. Thus, while co-creation focuses on ideation and close collaboration between citizens and other mobility stakeholders to explore new scenarios, concepts, and related artifacts, living labs engage citizens on a daily life level. This engagement can vary from the simple, systematic, reflection in their daily lives on their unmet mobility needs, as a contribution of ones’ traveling patterns or travel diaries for insights creation, to the assessment of the applicability of new service, or techological, solutions in their daily lives as enabling factor toward early acceptance and larger rollout.

3.11.1 Smart mobility innovation-related challenges Smart mobility-related innovation comprehends the development of new mobility solutions

72 FIGURE 3.24

3. The new challenge of smart urban mobility

Quadruple helix open inno-

vation model.

and scenarios, which is not possible to achieve by implementing traditional innovation paths. It supports the new paradigm shift and brings a new centric role of people into mobility and urban planning, making the shift from “planning for cars” toward “planning for people” and from “meeting the transport demand” to “meeting the people needs”. However, as such an approach it is still relatively new and relies on unpaved paths and emerging innovation concepts, as a quadruple helix, where it serves on one end as a learning path for the innovation concept itself and on the other end, it is growing on a theoretical and experimental knowledge that is being developed and sown into the local context. As such, it still faces many challenges that need attention for fruitful innovation to be established, including citizens and stakeholders’ engagement challenges and involvement of hard-to-reach

and vulnerable groups, such as children and the elderly. It also faces challenges related to setting up the scene and opening discussion on mobility issues, building the understanding among different stakeholders and conflict resolutions, as well as the transition of gathered insights into measures, products, and solutions. Furthermore, a great potential is also seen in a possibility to evaluate co-creation and living lab projects in a local context and gain Better understanding of the transferability of the lessons learned to other and new contexts locally, but also on a regional, national or international level.

3.12 Change management Enabling and implementing smart mobility innovations as well as the availability of new mobility services and schemes and technological

3.12 Change management

advances requires resilient cities that are able to respond to ever-changing circumstances. This places many cities and local authorities under the spotlight, as ever-changing circumstances demand adjustments to deal with the transition or transformation of existing processes and technologies. In the business domain, this is also known as change management. The change management in the smart mobility context comprehends the successful implementation of smart mobility strategies, solutions, and measures for effecting change and helping people to accept and adapt to change. It involves planning for change, managing change, and reinforcing change through communication, incentives, coaching, training, and resistance management (Kingdon & Thurber, 2011; Paredis & Block, 2013). From a smart mobility perspective, this happens in five process streams: • Problem stream refers to the process of identifying the existing smart mobility challenges and bringing them to attention. This stream concerns arising of urban challenges, mapping its evolution and reflection among different smart mobility stakeholders, and co-creation of potential solutions and innovations. • Political stream that refers to bringing on board the decision makers and opening up the discussion regarding the change, its timing, and implementation. This concerns several factors, including the alignment of the smart mobility vision and political views of decision makers, the timing of change and its potential impacts on political results, etc. • Policy stream that refers to policy changes that are needed to successfully implement the smart mobility-related change. • Technology stream refers to mapping up the availability of technology and its influence in the smart mobility life cycle, integration of citizens’ feedback into applicability and acceptance of such solutions, mapping of

73

local industry and their capacities to support potential technological innovations, etc. • Mobility services stream refers to mapping the evolution of the different mobility services before and after the implementation of change and their impacts on overall smart mobility aims (Guzman Vargas et al., 2019). There are several approaches to managing the change, mainly originating in the business domain. One of such is a Prosci ADKAR model (Hiatt, 2006) and Fig. 3.25 illustrates the model related to urban mobility change management.

3.12.1 Change management-related challenges Business-oriented transition models need to be adjusted toward the smart mobility and smart city-related change process in order to be applicable and successfully implemented. The challenges faced here comprehend differences between organizational views present in the business domain and urban complexity, concerning, among others, the large number of concerned stakeholders. It reflects in challenges related to the city administration and policy makers, as well as the change implementations, who all need to build their capacities and knowledge in order to successfully implement change and allow smart mobility evolvement. It concerns the building of local capacities and innovative business models among the local community, support with jobs creation, education, and carriers’ development, as well as support for citizens in this change process, particularly focusing on vulnerable groups. Currently, there are several attempts to integrate such adjustments on dedicated smart mobility domains (e.g., Civitas ReVeal transition framework (ReVeal, 2020)), however the holistic smart mobility dedicated frameworks have still not been widely tested.

74

3. The new challenge of smart urban mobility

FIGURE 3.25

Prosci ADKAR model.

3.13 State of the affairs The growth of urban environments, and not only them, have challenged already for a while the boundaries of our interactions with nature. This has become evident through several discussions regarding the air quality in the cities globally, quality of life, and even through recent pandemic (COVID-19) discussions closely associated with our penetration deep into nature’s habitats that have had limited interaction (and exchange of viruses) with the humans so far. This is closely associated with the challenges of urban pollution, such as emissions, noise, and the footprint of the cities. At the same time, it is becoming evident that the age proportions of the population are changing and that we are growing in numbers, overall and in the cities, faster than ever before. Hence, the way we shape cities today has become an essential question

that will determine our well-being as well as our impact on the environment. This has shifted the focus of the cities toward prioritizing sustainability while at the same time striving to make themselves pleasant and liveable for their citizens. Moreover, in the smart mobility context, this reflects in the paradigm shift from “planning for cars” to “planning for people”, from “meeting the transport demand” to “meeting the people needs”. It also reflects in including the citizens in planning and decision making through exploring innovative paths as living labs and co-creation and shaping the seamless travel experience for their, connected and multimodal, mobility while exploring, in an efficient and purposeful manner, a full range of innovative mobility options and services. Hence, the concept of the car-oriented sprawl are being replaced by the encouragement of active mobility options as well as the creation of

References

walkable and mixed (in terms of land use, life stages, and incomes) neighborhoods. This way cities are aiming to reduce mobility demand, and pollution and balance their footprint by integrating blue-green areas. Of motorized modes of transport in an urban area, the use of public transport is being encouraged while the system is being enriched in terms of connectivity and available options. The urban, and particularly street, spaces are being redesigned to better reflect desired use of transport modes in the cities (whereas so far, they were dedicated mainly to personal vehicles). As such, cities are aiming to be sustainable and pleasurable environments where the need for motorized transport is well managed in a manner that ensures and supports the quality of life of its citizens and smart mobility underpins this transition.

References Agarwal, Y., Kritika, J., Karabasoglu, O., 2018. Smart vehicle monitoring and assistance using cloud computing in vehicular Ad Hoc networks. Agarwal, Yash, Kritika Jain, and Orkun Karabasoglu International Journal of Transportation Science and Technology 7 (1), 60e73. Anenberg, S., Miller, J., Henze, D., Minjares, R., 2019. A global snapshot of the air pollution-related health impacts of transportation sector emissions in 2010 and 2015. International Council and Clean Transportation, Washington, USA. Belz, N., Lee, B., 2012. Composition of vehicle occupancy for journey-to-work trips: Evidence of Ridesharing from the 2009 National Household Travel Survey Vermont Addon Sample. Transportation Research Board, Washington, USA. C2CCC, 2020. CAR 2 CAR communication consortium: 2020 roadmap, Braunschweig, Germany: CAR 2 CAR communication consortium. CAAT, 2020. Automated and connected vehicles [Online] Available at: http://autocaat.org/Technologies/Automated_ and_Connected_Vehicles/ (Accessed 3 December 2020). Carneiro Freire, S., Corban, C., Ehrlich, D., Florczyk, A., Kemper, T., Maffenini, L., Melchiorri, M., Pesaresi, M., Schiavina, M., Tommasi, P., 2019. Atlas of the human planet 2019. Publications Office of the European Union, Luxembourg. Cars 21, 2016. How to reduce EV production costs? EV Battery Tech USA [Online] Available at: http://www. cars21.com/news/view/670 (Accessed 18 October 2016).

75

CEN, 2006. IFOPT [Online] Available at: http://www. transmodel-cen.eu/standards/ifopt/ [Accessed 3 June 2021]. CEN, 2015. SIRI [Online] Available at: http://www. transmodel-cen.eu/standards/siri/ (Accessed 1 September 2021). CEN, 2019. OJP [Online] Available at: http://www. transmodel-cen.eu/standards/ojp/ (Accessed 3 May 2021). CEN, 2020a. DCV [Online] Available at: http://www. transmodel-cen.eu/standards/dcv/ (Accessed 10 May 2021). CEN, 2020b. OPRA [Online] Available at: http://www. transmodel-cen.eu/standards/opra/ (Accessed 23 January 2021). CEN, 2021a. ITS standards [Online] Available at: https:// www.itsstandards.eu/25-2/ (Accessed 11 November 2021). CEN, 2021b. Transmodel [Online] Available at: http://www. transmodel-cen.eu/ (Accessed 13 August 2021). Cole, M., 2018. Mobility as a Service: Putting transit front and center of the conversation. Cubic, San Diego, USA. CSN, 2016. CSN EN 12896-1 [Online] Available at: https:// www.en-standard.eu/csn-en-12896-1-public-transportreference-data-model-part-1-common-concepts/ (Accessed 10 February 2021). De Mol, J., Defreyne, P., Semanjski, I., 2016. Vebimobe: Correcte Snelheidsinformatie voor correct Rijgedrag: Onderzoek naar Mogelijkheden Verkeersbordendatabank voor ITS-toepassingen. Verkeersspecialist 227 (2016), 20e22. DeLoach, S., Tiemann, T., 2010. Not driving alone: Commuting in the Twenty-first century. Elon University Department of Economics, Elon, USA. EC, 2016. Commission’s report on saving lives: Boosting car safety in the EU, COM(2016) 787. European Commission, Brussels, Belgium. EC, 2020. DATEX II [Online] Available at: https://www. datex2.eu/datex2/specifications (Accessed 5 February 2021). EEA, 2018a. EEA report, air quality in Europe d 2018 report. European Environment Agency, Brussels, Belgium. EEA, 2018b. Population exposure to environmental noise. EEA, Copenhagen, Denmark. Ehrenhalt, A., 2012. The great inversion and the future of the American city. Alfred A. Knopf, New York, USA. e-Mobility, 2017. e-Mobility project [Online] Available at: http://e-mobility-nsr.eu/ (Accessed 17 June 2017). European Commission, 2004. Data protection [Online] Available at: https://ec.europa.eu/info/law/law-topic/dataprotection/reform/what-personal-data_en (Accessed 7 November 2021). European Commission, 2011. White paper: Roadmap to a single European transport area, COM(2011) 144. European Commission, Brussels, Belgium.

76

3. The new challenge of smart urban mobility

European Commission, 2016. COM(2016) 766 A European strategy on Cooperative Intelligent Transport Systems, a milestone towards cooperative, connected and automated mobility. European Commission, Brussels, Belgium. European Commission, 2018a. Commission study [Online] Available at: https://ec.europa.eu/jrc/en/publication/ eur-scientific-and-technical-research-reports/analysis-poss ible-socio-economic-effects-connected-cooperative-and-aut omated-mobility-ccam-europe (Accessed 11 June 2021). European Commission, 2018b. Communication from the commission to the European Parliament, the Council, the European Economic and Social Committee, the Committee of the regions: On the road to automated mobility: An EU strategy for mobility of the future, COM(2018) 283. European Commission, Brussela, Belgium. European Commission, 2018c. Logistics and multimodal transport [Online] Available at: https://ec.europa.eu/ transport/themes/logistics-and-multimodal-transport/ 2018-year-multimodality_en (Accessed 14 June 2020). European Commission, 2019. Transport [Online] Available at: https://ec.europa.eu/transport/facts-fundings/score board/compare/energy-union-innovation/roadcongestion_en (Accessed 6 December 2020). European Committee for Electrotechnical Standardization, 2021. CEN [Online] Available at: www.cen.eu (Accessed 5 February 2021). European Parliament, 2016. Regulation (EU) 2016/679 [Online] Available at: https://eur-lex.europa.eu/eli/reg/ 2016/679/oj (Accessed 8 January 2022). Fellows, N., Pitfield, D., 2000. An economic and operational evaluation of urban car-sharing. Transportation Research Part D: Transport and Environment 5, 1e10. Fishman, E., Washington, S., Haworth, N., 2014. Bike share’s impact on car use: Evidence from the United States, Great Britain, and Australia. Transportation Research Part D: Transport and Environment 31, 13e20. Fukuda, T., Kashima, S., Barth, M., 2003. Evaluating second car system , an electric vehicle sharing experiment in Tama New Town District, Inagi City, Tokyo. TRB, Washington, USA. Greenmotion consortium, 2020. Greenmotion [Online] Available at: http://www.greenemotion-project.eu/ (Accessed 3 June 2021). Guzman Vargas, D., Semanjski, I., Gautama, S., Lauwers, D., 2019. ReVeAL transition framework workshop report. Ghent University, Ghent, Belgium. Hiatt, J., 2006. ADKAR: A model for change in business, government, and our community. Prosci, Fort Collins, USA. Hietanen, S., 2014. Mobility as a service e the new transport model? Eurotransport. ITS & Transport Management Supplement 12 (2), 2e4.

IEEE, 2022. Connected vehicles [Online] Available at: https://site.ieee.org/connected-vehicles/ieee-connectedvechicles/connected-vehicles/ (Accessed 8 January 2022). J€appinen, S., Toivonen, T., Salonen, M., 2013. Modelling the potential effect of shared bicycles on public transport travel times in Greater Helsinki: An open data approach. Applied Geography 43, 13e24. Jenkinson, C., 2020. Quality of life. In: Encyclopædia Britannica. s.l.:Encyclopædia Britannica, inc. Kakan D., C., Rayamajhi, A., Chowdhury, M., Bhavsar, P., Martin, J., 2016. Vehicle-to-vehicle (V2V) and vehicle-toinfrastructure (V2I) communication in a heterogeneous wireless network e performance evaluation. Transportation Research Part C: Emerging Technologies 68 (2016), 168e184. Katzev, R., 2003. A new approach to urban transportation problems. Analyses of Social Issues and Public Policy 3, 65e86. Kingdon, J., Thurber, J., 2011. Agendas, alternatives, and public policies. Longman, Harlow, UK. Ma, L., Zhang, X., Ding, X., Wang, G., 2018. Bike sharing and users’ subjective well-being: An empirical study in China. Transportation Research Part A: Policy and Practice 118, 14e24. Meijkamp, R., 1998. Changing consumer behaviour through eco-efficient services: An empirical study of car sharing in the Netherlands. Business Strategy and the Environment 7, 234e244. Ministerio de energia, Chile, 2021. Consumo vehicular [Online] Available at: http://www.consumovehicular.cl/ (Accessed 23 January 2021). NHTSA, 2016. Vehicle to vehicle communication [Online] Available at: https://www.nhtsa.gov/technologyinnovation/vehicle-vehicle-communication (Accessed 23 January 2021). OECD, 2020. OECD regions and cities at a glance 2020. OECD Publishin, Paris, France. OKF, 2021. Open definition [Online] Available at: https:// opendefinition.org/ (Accessed 20 December 2021). OSMF, 2022. OpenStreetMap [Online] Available at: https:// www.openstreetmap.org/about (Accessed 8 January 2022). Paredis, E., Block, T., 2013. The art of coupling: Multiple streams and policy entrepreneurship in flemish transition governance processes. Ghent University, Ghent, belgium. Polydoropoulou, A., Pagoni, I., Tsirimpa, A., Roumboutsos, A., Kamargianni, M., Tsouros, I., 2020. Prototype business models for mobility-as-a-service. Journal of Transportation Research Part A: Policy and Practice 131, 149e162. Rahimian, P., O’Neal E., E., Zhou, S., Plumert M., J., Kearney K., J., 2018. Harnessing vehicle-to-pedestrian

References

(V2P) communication technology: Sending traffic warnings to texting pedestrians. Human Factors 60 (6), 833e843. ReVeal, 2020. Civitas reval project [Online] Available at: https://civitas-reveal.eu/about/approach/ (Accessed 7 March 2020). Rezende Amaral, R., Semanjski, I., Gautama, S., Aghezzaf, E.H., 2019. Exploring the issue of integrating logistics and traffic control in urban areas. Transportation Planning and Technology 42 (6), 606e624. Ribeiro, S., Figueroa, M.J., Creutzig, F., Dubeux, C., Hupe, J., Kobayashi, S., de Melo Brettas, Theodore Thrasher, Sandy Webb and Ji Zou, L.A., Thrasher, T., Webb, S., Zou, J., 2012. Chapter 9 - energy end-use: Transport. In: Global energy assessment - toward a sustainable future. Cambridge University Press, Cambridge, pp. 575e648. Rodrigue, J.-P., 2020. The geography of transport systems, 5th ed. Routledge, New York, USA. Roush, W., 2012. Welcome to Google transit: How (and why) the search giant is remapping public transportation. Community Transportation 3, 1e6. Schiavina, M., Melchiorri, M., Corbane, C., Florczyk, A.J., Freire, S., Pesaresi, M., Kemper, T., 2019. Multi-scale estimation of land use efficiency (SDG 11.3.1) across 25 years using global open and free data. Sustainability 11 (20), 5674. Semanjski, I., Gautama, S., 2016a. Crowdsourcing mobility insightseReflection of attitude based segments on high resolution mobility behaviour data. Transportation Research Part C: Emerging Technologies 71, 434e446. Semanjski, I., Gautama, S., 2016b. Forecasting the state of health of electric vehicle batteries to evaluate the viability of car sharing practices. Energies 9, 1025. Semanjski, I., Gautama, S., 2019. A collaborative stakeholder decision-making approach for sustainable urban logistics. Sustainability 11 (1). Shaheen, S., Cohen, A., 2008. Growth in worldwide carsharing: An international comparison. Transportation Research Record: Journal of the Transportation Research Board 1992, 81e89. Slack, B., 1998. Intermodal transportation. In: Hoyle, B., Knowles, R. (Eds.), Modern transport geography. Wiley, Chichester, UK, pp. 263e290. Sochor, J., Arby, H., Karlsson, M., Sarasini, S., 2018. A topological approach to mobility as a service: A proposed tool for understanding requirements and effects, and for aiding the integration of societal goals. Research in Transportation Business & Management 27 (2018), 3e14. Society of Automotive Engineers, 2021. Autonomous vehicles levels [Online] Available at: http://articles.sae.org/ 13573/ (Accessed 3 June 2021). Socrates2, 2020. Socrates2 [Online] Available at: https:// socrates2.org/ (Accessed 9 January 2020).

77

Spöttle, M., Jörling, K., Schimmel, M., Staats, M., Grizzel, L., Jerram, L., Drier, W., Gartner, J., 2018. Research for TRAN Committee e charging infrastructure for electric road vehicles, European Parliament. Policy Department for Structural and Cohesion Policies, Brussels, Belgium. Stad Leuven, 2016. Leuven in cijfers - officiële site van de stad Leuven [Online] Available at: http://www. leuven.be/bestuur/leuven-in-cijfers/ (Accessed 23 May 2016). Steadie Seifi, M., Dellaert, N.P., Nuijten, W., van Woensel, T., Raoufi, R., 2014. Multimodal freight transportation planning: A literature review. European Journal of Operational Research 233 (1), 1e15. Thompson, A., Perez, Y., 2020. Vehicle-to-Everything (V2X) energy services, value streams, and regulatory policy implications. Energy Policy 137 (2020), 111136. TMaaS, 2020. TMaaS [Online] Available at: drive.tmaas.eu; (Accessed 9 January 2020). UIA, 2021a. News [Online] Available at: https://www.uiainitiative.eu/en/news/understanding-legislative-frame work-test-autonomous-vehicles-public-streets-zoom-2 (Accessed 4 May 2021). UIA, 2021b. News [Online] Available at: https://www. uia-initiative.eu/en/news/linctuppac-expert-journalget-update-what-has-been-happening-last-6-months (Accessed 2 September 2021). UIA, 2021c. UIA cities [Online] Available at: https://www. uia-initiative.eu/en/uia-cities/albertslund (Accessed 7 May 2021). UNECE, 1993. Convention on Road Traffic of 8 November 1968, incorporating the amendments to the Convention which entered into force on 3 September 1993. Inland Transport Committee of the United Nations Economic Commission for Europe, Geneva, Switzerland. United Nations, 2019a. World Population Prospects 2019, custom data acquired via website [Online] Available at: https://population.un.org/wpp2019/Download/ Standard/Interpolated/ (Accessed 4 November 2020). United Nations, 2019b. World population prospects 2019: Highlights. United Nations, Department of Economic and Social Affairs, Population Division, New York, USA. United Nations, 2019c. World population prospects 2019: Ten key findings. United Nations, Department of Economic and Social Affairs, Population Division, New York, USA. United Nations, 2019d. World population prospects 2019: Volume II: Demographic profiles. United Nations, Department of Economic and Social Affairs, Population Division, New York, USA. United Nations Population Division, 2016. The world’s cities in 2016: Data booklet [Online] Available at: http://www.un. org/en/development/desa/population/publications/

78

3. The new challenge of smart urban mobility

pdf/urbanization/the_worlds_cities_in_2016_data_ booklet.pdf (Accessed 21 November 2020). Velaga, N., Beecroft, M., Nelson, J.D., Corsar, C., Edwards, P., 2012. Transport poverty meets the digital divide: Accessibility and connectivity in rural communities. Journal of Transport Geography 21 (2012), 102e112.

Wakefield, E., 1994. History of the electric automobile. In: History of the electric automobile. s.l.:Society of automotive Engineers, pp. 2e3. World Economic Forum, 2016. Self-driving vehicles in an urban context. World Economic Forum, Cologny, Switzerland.

C H A P T E R

4 Small and big data for mobility studies 4.1 Objectives of the chapter

in a single second (Semanjski et al., 2016). Mobile phones, social networks, surveys, location-based services, bank card payments, Global Navigation Satellite Systems (GNSS), online activities, public transport ticketing, etc. all produce torrents of data as a by-product of their operations. In a way, we can say that from the beginning of this century there was a fundamental shift where we moved from a data-scarce to a data-rich environment. In this context, we often speak of socalled “small data” and “big data”. When speaking of small data, we refer to traditionally collected data regarding a specific topic, for example, human mobility. These data are usually collected based on a smaller sample, that represents the population of interest, and for the need of studies that aim to answer specific questions. Big data, on the other hand, are often typified by the so-called 4 Vs definition (Chen et al., 2012; Laney, 2016). The 4 Vs (Fig. 4.2) define big data as:

What is small data? What is big data? What is GNSS? What is mobile/cellular network data? What is mobile phone/smartphone data? What are other big data sources that are being used for mobility studies?

4.2 Word cloud Fig. 4.1 presents a word cloud with an overview of the content of this chapter.

4.3 Introduction Today one can easily get familiar with many details . We can swiftly find out how many people live in the World, what is the most listened to song in Europe, what is the GDP of a country, what is the most sold fruit in the World and so many more things, we can know because of data. Humans record data for millenniums in stone scripts (Kramer, 1988), metals (Qiu, 2000), papers (Crespo & Vinas, 1984), and all sorts of ways. Today, we create quite a lot of digital data. In fact, at the beginning of this century, the amount of data available on the entire Internet equaled to what is now created

Smart Urban Mobility https://doi.org/10.1016/B978-0-12-820717-8.00005-1

-

-

-

79

the increase of volume, where the scale of generated and stored data becomes increasingly big. variety as data come in various forms including structured, semistructured, and unstructured data, and originate from multiple sources. velocity refers to data being generated at high speeds and data collection-processing chain

© 2023 Elsevier Inc. All rights reserved.

80

4. Small and big data for mobility studies

FIGURE 4.1

-

Small and big data for mobility studies.

that needs to be promptly and timely conducted to maximally utilize the potential value of big data. veracity refers to the quality and trustworthiness of the data. It comprehends both provenance and curation of the data. The provenance refers to a sufficient level of data documentation evidencing where data comes from and what are the processes and methodology by which data was produced or collected. Data curation on the other hand refers to the active and continuous management of data through its complete

lifecycle, including elements such as annotation, publication, and presentation of the data, in a way that ensures its value and reusability are maintained over time. In the following section, we will have a look at the traditional data collection approaches for transport planning (so-called small data) and the three, most often used, big data collection approaches for mobility studies. This is followed by a brief description of other big data sources for mobility studies (e.g., Bluetooth data, public transport ticketing data, etc.).

4.4 Traditional data collection approaches

81

FIGURE 4.2 Big data.

4.4 Traditional data collection approaches Travel surveys and interviews are considered to be a traditional and rather straightforward way to collect data on the travel behavior of the population of interest. Usually, they are conducted over a dedicated sample or, more rarely, during the census when the whole population is being surveyed/interviewed. The results can be obtained at the level of households or individuals. Initially, travel surveys and interviews were executed by means of pen and paper, whereas today phone surveys and interviews, even online ones, are more common. Travel surveys and interviews data collection processes (Fig. 4.3), typically, take place in a systematic and cyclical manner. Most often this happens every, every two, or more commonly, five or even 10 years. Although the frequency varies across different regions, usually it has a uniform cyclical occurrence for a given region. During these surveys and interviews, respondents are asked to record or state their travel

behavior on an average weekday. The collected data typically include the following socioeconomic and demographic context data: • Household details: B household size, B dwelling type and ownership, B number of registered motor vehicles by type, B number of bicycles. • Personal data about individuals in the household: B age and gender, B relationship to head of household, B employment status, B occupation, B industry of employment, B personal income, B education degree level. • Travel details: B data for all travels made on the given day,

82

4. Small and big data for mobility studies

FIGURE 4.3 B

Travel surveys and interviews data collection process.

location of trip origin, B time of travel, including departure time and arrival time, B purpose of the travel, B location of the destination, B mode of transport used. B For the trips made by vehicle: • vehicle used, • number of occupants,

• roads used, • parking location used, • parking or toll-related details (amount paid and by who). B For trips made by public transport: • type of ticket, • type of zone ticket, • type of fare paid, • reason for not traveling on the given day.

4.4 Traditional data collection approaches

In practice, travel surveys and interviews are commonly also complemented with other mobility-related details that are collected to capture the overall context of the noted mobility behavior. These include, among others, land use data, existing infrastructure and services overview, traffic and person counts, as well as surveys on attitudes and elasticity of transport demand (e.g., stated preferences) for the same moment in time and geographic area of interest. Furthermore, when speaking of the geographical area of interest, this does not solely relate to the residents of the considered area, but also to nonresidents who travel in or across the area. Their mobility behavior is also collected via shorter travel surveys and interviews conducted at specific locations, usually at the border of the geographical area of interest. These types of surveys and interviews are commonly called intercept surveys/interviews. Overall data collected in such a manner capture the travel behavior for the given moment in time and for the defined geographical area. These time series of the noted travel behaviors and the context, in the equidistant time moments, allow transport planners and decisionmakers to have a systematic overview of the mobility patterns as well as to follow the main longitudinal trends observed in the travel behavior. However, numerous studies (European Institute of Retailing and Services Studies, 1996; Stopher & Greaves, 2007) have shown that data collected by means of travel surveys and interviews differ substantially and deviated systematically from the actual travel behavior. These discrepancies include, among others: -

Respondents’ tendencies to underreport small trips, for instance, a walk to the nearby bakery (Itoh & Hato, 2013). In most of the cases, the literature evidences a shortfall of around 20%e30%, end even up to 60% (Jones

-

-

-

-

83

& Stopher, 2003), for telephone-based interviews and about 7%e12% in reporting of trips in a face-to-face interview (Stopher et al., 2007). Accuracy of the provided details regarding the key components of respondents’ travel as a tendency to round travel times to the nearest 5, 10, or even 15 min, tendencies of car drivers to underestimate their travel times, and public transport users to overestimate their waiting and transfer times (Solomon & Peer, 2012; Uteng & Voll, 2016), the inability to provide location details to the degree of precision required or the failure to provide even basic route information (Stopher & Greaves, 2007). High costs and difficulties in obtaining the interviews and responses. For example, the costs of the face-to-face interviews being conducted as part of the Sydney Household Travel Survey are estimated to be around $350 (290 Euros) per completed household (Stopher et al., 2003). Increasing nonresponse rates (Wilson, 2004) among households. Stopher and Metcalf (1996) and Zimowski et al. (1997) report that a computer-assisted telephone interview (CATI) will usually attain about a 60% recruitment rate for households, followed by a 60% completion rate, resulting in an overall response rate of about 36%. Underrepresentation of households that travel more and/or are larger than the average (DRCOG, 2000; NCHRP, 2006). It is reported that such households are more likely to belong among nonrespondents for several reasons as difficulties in reaching them (for example, those that travel more than the average), and partly considering that they see the survey task as being significantly more burdensome (for example, due to the number of members of the household, and

84

4. Small and big data for mobility studies

the amount of travel that would need to be reported). To avoid these pitfalls, paper travel diaries were introduced (Stopher & Wilmot, 2000). Here, people are asked to systematically note their travel behavior details with respect to travel times, transport modes, trip purposes, and frequencies, or in short, inputs needed for the four-step transport planning model. The data collection time span of travel diaries usually covers a period of 1 week during the spring or autumn months (so-called, nonholiday seasons). However, the literature suggests that participants tend to postpone filling in these diaries, which results in obtaining incomplete and inconsistent information (Arentze et al., 2001; Groves, 2006). Quite often this retrospective data collection in rounding up of time and distances, forgetting to mention some shorter trips, or having difficulties in defining the exact locations of places they have visited (Witlox, 2007). In addition, travel surveys are characterized by rising survey costs (Inbakaran & Kroen, 2011) and by posing a heavy burden on respondents (Danalet & Mathys, 2017). Furthermore, as shorter trips, that are often omitted from noting, are usually made by active transport modes, like cycling or walking, such data collection practice can result in biased observed modal splits (Pooley et al., 2011; Schneider, 2013). For instance, short trips made by active transport modes are estimated to account for 20%e40% of all the trips in Europe (OECD, 1998). Hence, it is easy to see how excluding such a considerable number of trips can resonate in the evolution of carbiased transport planning, which consequently moves us away from the sustainable transport paradigm that many cities and communities aim to achieve. In response to the identified shortfalls of travel diaries, they have undergone significant evolutionary change over the past decades, mainly to improve their reporting capability,

and to reduce respondent burden, however, the nonresponse rates have continued to rise. This is often indicated as one of the major motivators, next to cost reduction, automated processing possibilities, and higher mobility data resolution, in exploring the potential of big data for transport planning needs (Jestico et al., 2016; Semanjski et al., 2017). More of this we will discuss in the following section.

4.5 Big data for mobility studies In this section we will focus on the three main big data sources that are used for mobility studies, namely: -

pure Global Navigation Satellite Systems (GNSS) data, mobile network data and mobile phone sensed data.

These we describe in more detail in the following sections.

4.5.1 Global navigation satellite systems data The GNSS is a general term that describes a constellation of Earth-orbiting satellites, that have global coverage and broadcast signals along a line of sight from space, sending position and time data, to dedicated receivers, which then use them to determine location. The performance of any GNSS is commonly evaluated based on four parameters including: -

-

Accuracy i.e., the discrepancy between a receiver’s measured and actual position, speed, or time, Integrity, defined as a system’s capacity to provide a threshold of confidence and timely and valid indication if any anomaly in the positioning data occurs,

4.5 Big data for mobility studies

-

Continuity, seen as the ability of a system to operate without interruption and, Availability, or the portion of the time a system can be used for the intended application (Sabatini et al., 2017).

Furthermore, the GNSS’s performance can also be improved by the use of regional satellite-based augmentation systems (SBAS), such as the European Geostationary Navigation Overlay Service (EGNOS), or fixed groundbased reference stations that broadcast the difference between the positions indicated by the GNSS and known fixed positions (Fig. 4.4). Examples of GNSS include, among others, the European Galileo, the US NAVSTAR Global Positioning (GPS), the Russian Global’naya Navigatsionnaya Sputnikovaya Sistema (GLONASS) system, and the Chinese BeiDou satellite navigation system. Ever since they were first introduced for civilian use, the GNSS has heralded a new potential regarding the way data for mobility studies can be collected (Feng & Timmermans, 2013; Wolf et al., 2004). In this context, the fruitfulness of potential GNSS applications arises from the fact that the GNSS data are characterized as the recorded timestamps and location coordinates

FIGURE 4.4

85

(together with additional qualitative and descriptive data) records. Table 4.1 illustrates an example of the GNSS message containing data. Based on the sequence of the positioning data measured in very short time intervals, usually in seconds (and not the years, as when it comes to the traditionally collected data), the GNSS data allow recreation of trajectories of an object or human movement. Hence, the GNSS data provide by far more detailed (both in spatial and temporal resolution) description of traveled path than traditional data collection methods. However, the GNNS data by itself lacks other contextual information regarding the trip as the transport mode used for the travel or purpose of the trip. Having said this, many mobility-related research efforts have been invested over the past decades to try to enhance the usability of GNSS recorded trips for mobility studies, both by aiming to improve trip reconstruction from the collected GNSS data as well as to complement trip data with the contextual information by means of data analytics, which will be discussed in more details in the following chapters. From an evolutionary perspective, in early applications of the GNSS for mobility studies, the GNSS devices were mainly installed in vehicles

GNSS-based positioning.

86

4. Small and big data for mobility studies

TABLE 4.1

GNSS message example.

Message example Message element

$GPGGA,001,010.00,4238.451804,N, 1806.620622,E,4,13,1.00,107.414,M,21.200,M,0.10,0000*7F Example value Example meaning

GNSS system indication

GP

Global positioning system

Timestamp

001,010.00

Coordinated Universal time (UTC) time in hours, minutes, and seconds

Position -latitude

4238.4518

Latitude in the DDMM.MMMM format

Latitude orientation

N

Denotes North latitude

Position -longitude

1806.6206

Longitude in the DDMM.MMMM format

Latitude orientation

E

Denotes East longitude

Quality indicator

5

Centimeter precision

Number of satellites

11

Denotes the number of satellites used in the coordinate

Precision

1.0

Horizontal dilution of precision

Altitude

107.414

Denotes altitude of the antenna

Units of altitude

M

Meters

Geoidal separation

21.200

Denotes the geoidal separation

Units used by the geoidal separation

M

Meters

Age of the correction

1.0

Denotes the age of the correction

Correction station ID

0000

Denotes the correction station ID

Checksum

*7F

Denotes the checksum

(Eisele et al., 1998). This had some advantages as the device collected positioning data whenever the vehicle was in motion, and the individuals did not need to carry the devices with them. Which were anyway, in early applications, quite unpractical for being carried around due to their dimensions. In this context, the burden on a data collection participant was somewhat low as one only needed to take care that the device was turned on and, depending on the data transfer methodology (whether data was transferred online or not), needed to take out the device in order to download the data. However, the fact that the GNSS devices were typically installed in vehicles also meant that they only tracked a small portion of mobility behavior (i.e., vehicle trips). In alternative scenarios, when portable handheld

GNSS devices were introduced, effort, and discipline from the respondent to carry continuously the device with him/her were required. This is reflected positively in the ability to track trips made by utilization of other transport modes (other than a vehicle). However, in the context of participants’ discipline and burden, forgetting the device had serious consequences on the data collection process as it results in incomplete mobility patterns and unreported gaps in the mobility data. 4.5.1.1 Example: GNSS data (I) As already mentioned, the GNSS data are of higher spatial and temporal resolution than traditionally collected mobility data. In this context, they enable analysis and insights that

4.5 Big data for mobility studies

were not possible before or were highly unreliable (e.g., due to human error or reaction time impact) or a resource-demanding (e.g., in terms of investments needed to collect inputs required for the analysis). Such insights are of added value both for making traditional transport planning models and approaches more efficient and reliable as well as for gaining new insights and supporting decision-making in aspects that were so far based on scarce data as they allow evidence-based decision-making and datasupported discussion on relevant matters as more sustainable mobility options. For instance, many authorities today look at bikes as a more sustainable alternative to motorized vehicles. The use of bikes is particularly interesting in this context for trips shorter than 10 km, or 20 km when motorized bicycles are also considered. In this process, authorities try to encourage the modal shift by increasing safety, comfort (infrastructure), speed, attractiveness, and directness of bike routes as well as encouraging the use of the bikes from an early age, as this has a positive impact on future mobility habits (Broach et al., 2012; Gillis et al., 2020; Hunt & Abraham, 2007). Various studies emphasize that not all transport mode users value their travel time equally. We already know, from traditional mobility data collection practices that, for instance, public transport users value their travel time as higher than for other transport modes (particularly elements such as waiting times at the stops). Literature evidence that the same is the case for bike users (Börjesson & Eliasson, 2012) for whom bike delays are seen as a relevant element of bike routes attractiveness. The higher value of time for bikers is partially argued by the traveling conditions, which are more hazardous and less attractive than for other transport modes (e.g., due to direct exposure to weather conditions), as well as an increased effort associated with the delays, for instance, the effort required for stopping and restarting. The literature quantifies that the waiting time at

87

intersections is perceived as an additional 2.0 biking minutes while one stop for a signalized intersection is perceived as 1.1 biking minutes, in addition to the overall delay (Börjesson & Eliasson, 2012). The negative impact of delays on route attractiveness may also reflect in undesired behavior in terms of route choice (for instance, avoidance of signalized intersections or routes with frequent stops) (Broach et al., 2012; Menghini et al., 2010; Segadilha & da Penha Sanches, 2014) as well as red light running infringements (Pai & Jou, 2014; Richardson & Caulfield, 2015; Zhang et al., 2016). In this context, the European Parliament’s Committee on Transport and Tourism has also defined “making traffic lights more cycle-friendly” as one of the key areas for future intervention (European Parliament, 2010). In current studies and transport planning practices, bike delays are rarely considered. This is mainly due to the difficulties in obtaining this type of insight. However, if considered, they are usually estimated as statistically expected averages derived from the characteristics of the traffic light control schemes (e.g., the distribution of the duration of green and red times and the total traffic light cycle) and not from empirical data. This is mainly due to the lack of transferability and applicability of well-established practice and existing approaches used for other transport modes, such as car traffic. For instance, methods such as vehicle identification, where roadside sensors are used to record unique vehicle characteristics at subsequent sensing locations as license plate, inductive loop detector signature, Bluetooth or WiFi MAC address to match the vehicle passages so that travel times can be derived (Wang et al., 2014) are not likely to be applicable for bike trips as bikes do not have license plates and cyclists are less likely to have Bluetooth activated, for instance, to use the handsfree phone systems or carry car keys. Hence, it is difficult to detect some unique signatures per bike as required for vehicle identification methods.

88

4. Small and big data for mobility studies

However, some opportunities for transferability and automatic measurements of bike delays could be found among probe vehicle methods where sensing devices are placed in vehicles participating in the traffic stream to collect data on position and speed. Fig. 4.5 gives an example of such an application, where bike delays at intersections, during afternoon peak hours in an urban area, are illustrated based on the observations collected by wearable GNSS devices. The research was conducted in Flanders, the northern region of Belgium, where bikes represent 11.18% (Janssens et al., 2018) of all trips. The image is composed based on 65,718 bike trips, and corresponding GNSS data, collected during a period of 6 months. The red color indicates delays at intersections that are longer than 2 min (or between 4d3.1 min if the perceived delay is considered), yellow those that lasted between one and 2 min, and green those shorter than 1 min. A more detailed description of the methodology for bike delays analysis from GNSS data can be found in the literature (Gillis et al., 2020). Transferability of probe vehicle approach on biking trips can also have other potentially plausible applications that can assist transport planners and decision makers to gain insight into

FIGURE 4.5

different aspects of mobility, for instance, mobility of vulnerable groups such as the elderly or children. Fig. 4.6 illustrates another bikerelated analysis regarding the perceived safety of school children on their bike routes to and from school. The perceived safety of children’s bike routes to school is emphasized in the literature as one of the most common barriers to children going to school by bike (Christie et al., 2011; Dellinger, 2002). However, although this finding is highlighted in the results of several surveys and formulated across multiple recommendations toward authorities, automated methodology to gain more insights and understanding regarding the perceived unsafe locations and situations, especially by children, is rarely considered. Use of smartphones, as a means of twoway communication, to provide feedback regarding such potential situations and areas is not a plausible option as many school-age children intentionally do not carry smartphones, either due to their young age or not to cause distraction. One of the potentially interesting means to automate this type of data collection is the use of GNSS sensors. Fig. 4.6 illustrates the intensity of bike routes of school children, represented by red and yellow lines, when the motivation for the trip is to go to school, or

Bike delays at intersections.

4.5 Big data for mobility studies

FIGURE 4.6

89

Bike-related incidents location reporting for school children.

from school to home. The data were gathered from 3511 children attending five schools in the Flemish Brabant region in Belgium. The data collection means were wearable GNSS sensors, in a form of keychain pendants, that were, with permission of their parents who volunteered to participate in the data collection process, attached to their school bags. Additional functionality, a simple press button, was integrated into pendants to enable noting of specific location once it was pressed. This function was used to enable children to report specific situations and locations, such as the occurrence of incidents on their school-related routes, indicated by blue dots on the map. Although a minor interaction, pressing the button, was needed, the burden on the data collection participants was perceived as low and feasible during the bike routes to school. The results of the concerned study were used to raise the awareness of bike safety for school children in the area as well as to initiate the conversation among stakeholders

and to support decision-making regarding the potential bike infrastructure improvements and investments in order to encourage use of sustainable mobility options for vulnerable mobility groups, as school children. 4.5.1.2 Example: GNSS data (II) With the aim to increase understanding of GNSS-based data collection potential for mobility studies and to explore complementarities with the traditional mobility data collection techniques, a study was conducted over the Belgian region Brussels including the Belgian capital Brussels and 18 other municipalities (Anderlecht, Auderghem, Berchem-SainteAgathe, Etterbeek, Evere, Forest, Ganshoren, Ixelles, Jette, Koekelberg, Molenbeek-Saint-Jean, Saint-Gilles, Saint-Josse-ten-Noode, Schaerbeek, Uccle, Watermael-Boitsfort, Woluwe-SaintLambert, Woluwe-Saint-Pierre) (City of Brussels, 2020).

90

4. Small and big data for mobility studies

Firstly, the official mobility statistics were considered. The official mobility statistics for the region are composed based on the traditionally collected data. In this context, the traditional data collection techniques involved were a combination of travel surveys that were delivered to participants by mail, with phone call reminders to complete them, and face-to-face interviews. Overall, official mobility statistics were derived from a sample of 2223 households that responded by providing travel details for one working day for household members that are older than 6 years (Cornelis et al., 2010). Secondly, the GNSS-based data collection was considered. To achieve a GNSS sample that is comparable with the one participating in the official mobility statistics data collection, and as GNSS devices were personal wearable sensors, the number of households that participated in the official statistics preparations was multiplied by the average household size (2.3 members (Cornelis et al., 2010)) in the area. This number was adjusted to accommodate the nonrespondent rates that were reported to occur in similar studies (Geurs et al., 2015) in order to define the overall number of participants for the GNSSbased study in the same area. The recruitment of participants was then completed based on the same criteria as for the official statistics sample. Nonetheless, in the GNSS-based study the responsiveness rate was slightly higher than expected (87%, compared to 82% (Geurs et al., 2015)), hence the number of participants considered for the next steps was reduced based on the random selection criteria while maintaining the representativeness of the stratified sample. The result was 5113 individuals who actively contributed their GNSS tracks collected over one working day. The GNSS data collected by wearable sensors were transferred wirelessly to a central data platform. There the data were analyzed to segment contributed trajectories into trips and to annotate a transport mode likely to be used (motorized transport mode, walking or biking).

Data collection participants had the possibility to access their data, and related results online visualized on a map and sequentially ordered. In this manner, they were able to confirm/validate both the observations and estimations or report errors (for instance, if the utilized transport mode class was misclassified or the trip details, as a start and the end of the trip, were not correctly detected). Overall, participants contributed 155,807 km of recorder trajectories and 298,492 min of trip data (Fig. 4.7). Fig. 4.8 shows the distribution of trips’ distances for trips collected by wearable sensors (x-axis is a percentage of trips, while the y-axis indicates trip distance classes), while Fig. 4.9 shows the same, but as it is recorded for the trips’ distances in the official statistics (Cornelis et al., 2010). The GNSS data exhibited higher trip resolution (for instance, for a single trip it was possible to analyze each part of the trip that was made utilizing a different transport mode), whereas official statistics report distances for the full trips (assigned to the main transport mode class in the full length of the trip). However, analyzing gathered data at the same level as official statistics (main transport mode per trip), a higher share of shorter distances at a trip level was observed, when compared to the official statistics. It should be noted that the higher share (around double values) of short trips in GNSS studies is consistently reported in the literature (Bricka et al., 2012; Chen et al., 2010; Lee et al., 2016). Additionally, these findings go in line with previous studies who found that respondents tended to underreport small trips in traditional travel diaries/surveys (Itoh & Hato, 2013). To give more insight into observed short trips, Fig. 4.10 illustrates the share of trip parts, made by single transport mode class, that are shorter than 1 km per each transport mode class. Yellow color indicates trips parts made by utilizing a single transport mode class, where this trip part (that is shorter than 1 km) is an element of a multimodal trip (i.e., the overall trip is made in

4.5 Big data for mobility studies

91

FIGURE 4.7 GNSS study data.

FIGURE 4.8 Trip distances GNSS data (data bins as reported in official statistics).

combination with other transport mode classes and, hence, the overall trip would be longer than only this section). The blue color indicates unimodal trips (made by utilizing only one

transport mode class) that are shorter than 1 km. Hence, the official statistics would mainly report only the results indicated in blue color. However, the results indicate that this would

92

4. Small and big data for mobility studies

FIGURE 4.9 Trip distances (data bins as reported in official statistics). Based on Cornelis, E. et al. (2010). Mobiliteit in België in 2010: Resultaten van de BELDAM-enqu^ete. Brussels, Belgium: Belspo. FIGURE 4.10 Trips and trips parts that are shorter than 1 km by transport mode class.

underreport trips made by utilizing active transport mode classes (in this specific case, observed underreporting was around 12.5% for trips made by walking and 26.6% for bike trips).

4.5.2 Mobile network data Mobile network data for mobility studies involves two elements, the positioning traces triggered either by the serving network, so-called network signalization data, or by the user interaction activity itself, so-called Call Detail Record (CDR).

The CDR represents standardized data, which are collected by mobile network operators regularly for billing purposes. Such data include records of all the activities initiated by the user such as phone calls, SMSs (short message services), internet and data services where each record includes spatial and temporal parameters for the incoming and outgoing data transfers. Table 4.2 shows an example of the CDR data. The geographical precision of these data is somewhat lower than it is the case with the GNSS data. The reason for this lies in the fact that the mobile network notes the base station (antenna) location, which covers the area where

4.5 Big data for mobility studies

TABLE 4.2

CDR record example.

CDR record example

043775f2-39c2-4c88-a2ac-6297c273ab83, 008,372,590,762, 007,341,893,461, 2021-02-22T11:29: 09.438 D 05:30, 2021-02-22T11:29:09.438 D 05:30, SMS, 0.3259588, Received

Message element

Example value

Record ID

043775f2-39c2-4c88-a2ac-6297c273ab83

Calling number

008,372,590,762a

Called number

007,341,893,461a

Session star time

2021-02-22T11:29:09.438 þ 05:30

Session end time

2021-02-22T11:29:09.438 þ 05:30

Session type

SMS

Charge

0.3259588

Session result

Received

a

93

Phone numbers are randomly generated numbers, for the purpose of this example.

the user is located, and not the actual mobile phone device location. Indeed, one can improve this location precision by further processing (i.e., triangulation techniques, use of which is forbidden in some countries, except for emergency calls and authority demands (Bachir et al., 2019)). However, for this, one needs to know the distance from other base stations in the vicinity, which is typically not recorded in the CDR dataset. This also poses an additional processing step to, what is already considered as time and processing power sensitive, big data analytics. Furthermore, the location precision of the CDR data varies significantly. This is inherent for the mobile network configurations where higher coverage is required in urban areas (due to a denser population and larger telecommunication traffic load). To satisfy this quality of service requirements, telecom operators place base stations denser in urban areas. This results in higher location precision in urban and lower in rural, and less populated, areas. Furthermore, in most countries, several generations of mobile networks coexist simultaneously, as well as multiple network operators with different market shares. Respectively, this means that several base stations cover the same

area (one for each operator, service provider, or network’s generation). Thus, when preprocessing the data, one needs to account for this fact and include additional steps to align the geographical data. This seems less challenging for mobile network generations that have the same underlying technology, for example, 3G and 4G mobile network generations. In practice, this means that the base stations for these network generations will share some base station components and thus the telecom operator will most likely situate them at the same location (Fig. 4.11). However, generations that have different underlying technology, like 3G and 5G, will have a significantly different configuration of the base stations (even when they belong to the same network operator). This poses an additional challenge when dealing with the geographical location alignment and resulting precision. Also, sometimes the device will actually not be located in the area covered by the base station that its location is assigned to. This is known as cell hopping and can occur for different reasons. For instance, one can live at the border of areas covered by two different base stations. Depending on factors such as the telecommunication

94

4. Small and big data for mobility studies

FIGURE 4.11 Mobile network architecture (2G and 3G).

traffic load and land configuration, the network can continuously reassign the device to different base stations (for instance, to the ones with lower current traffic) even if the user is staying at the same location (for example, at home). From a mobility patterns and trajectories point of view, this means that several “trips” might be observed (between these base stations) whereas the individual did not change the actual location. For these reasons, the CDR data are recently being complemented with the network signalization data for big data analytic purposes. The network signalization data are not collected for billing purposes but are regularly recorded by network operators for their internal network synchronization and management needs. These data include regular updates of devices inventory (home and visitors), handovers or records of

devices (cellular transmissions) being transferred from one base station to another without losing connectivity to the cellular transmission, etc. Although, the user is often unaware of these observations (he/she is not consciously initiating them) they are required for efficient management of network capacities and quality of service. From the trajectory reconstruction point of view, they improve the density of the observed locations. Regarding the time precision, they ensure that even when a user is standing idle for long time periods (not initiating the CDR data and not moving between the base stations) record of his/her location exists. This is done in regular time intervals that vary among different operators and mobile network generations, but in general, it happens every one to 3 hours (Fig. 4.12).

4.5 Big data for mobility studies

95

FIGURE 4.12 Location update in a mobile network. Based on Bolhasani, H. R. (2021). Mobile networks overview (2G/3G/4G/ 5G). Available at: https://bolhasani.net/learning-material/.

Considering this, the spatial and temporal resolution of the mobile network data is quite different and lower than it is the case with the GNSS devices. However, mobile network data exhibit several important advantages like the fact that all network operators continuously collect them, they require no additional effort by users, they are cost efficient (no additional financial resources for their collection are needed), and they cover wide areas and large populations. This makes mobile network data an intriguing candidate for longitudinal analysis of human mobility patterns. Nonetheless, their usage for mobility and other, studies are still hindered by several barriers including, among others, regulatory and privacy issues (e.g., the

anonymity of the phone holder), technological (for instance, allocation of data processing components or differences among network systems that are provided by various vendors), business-related (for example, low motivation by telecom operators to share their data and make it potentially accessible to competitors) and methodological ones (for instance, handling of the geographical references of the data or differentiation among mobile devices used for machine-to-machine services and those used by humans) (Ahas et al., 2014; Calabrese et al., 2014; Seidl et al., 2016; Vij & Shankari, 2015). A more detailed overview of mobile network characteristics and the applicability of mobile network data can be found in the literature

96

4. Small and big data for mobility studies

(Ahas, 2010; Ahas et al., 2014; J€ arv et al., 2014; Mishra, 2004, 2007). 4.5.2.1 Example: mobile network data (I) When we speak of GNSS positioning systems, ideally (with the use of regional satellite-based augmentation systems and differential systems) we can speak of centimeter precision in optimal conditions. However, in a real-life system this is rarely the case, and general GNSS sensors will note the location with the error measured in decimetres or meters in most of the cases. Nonetheless, when we talk about mobile network based precision we speak of a different order of magnitude. Fig. 4.13A shows a 252 km2 region, where urban area of the city of Toulouse, capital of France’s southern Occitanie region, is located at the left side of the image and a bit less densely populated area on the right side. Characteristics of urban and suburban/rural space can also be recognized based on the density of the road network that is visualized in yellow color. In this example, the area is served by 28 base stations (Fig. 4.13B), meaning that one base station, on average, covers 9 km2. As the mobile network will note the location of a base station, that the device is assigned to, and not the device itself, this means that the error of the reported location for the device can range between 0 and 2.12 km (when considering an idealized situation of a 3  3 km cell covered by a base station situated in the middle of this area). As one can see, this is a quite different level of precision than when we talk about GNSS positioning. Furthermore, when comparing Fig. 4.13A and B, one can see that the base stations are distributed more densely in the urban area than in the less populated suburban and rural areas. This also means that the location precision will be higher (due to the shorter distance from the device to the base station) than in the less populated area, where base stations are much more widely spread from one another.

Additionally, at the location of the base station, we can have omnidirectional antennas or directional antennas. Omnidirectional antennas radiate equal radio power in all azimuthal directions, considering a horizontal view perpendicular to an axis. The radio power of omnidirectional antennas is varying with the elevation angle to the axis, when considering the vertical view, and declines to zero on the axis. The directional antennas, on the other hand, radiate or receive greater power in specific directions allowing increased performance and reduced interference from undesired sources. Usually, when the directional antennas are used for mobile networks, then several of them (depending on the angle to which they radiate/ receive power) are aligned to provide coverage in all azimuthal directions. For instance, three antennas with 120 degrees coverage will cover a full 360 degrees (as an omnidirectional antenna would). In such cases, they are also called sectoral antennas as they divide the covered area into several sectors (one for each directional antenna) (Fig. 4.14). Fig. 4.13C, indicates those base station locations, which have in place sectoral antennas (each sector is indicated by a blue circle). This allows telecom operators to serve higher traffic at a higher level of service with fewer losses (for instance, less interrupted calls or data transfer fails). From the positioning perspective, it allows one to determine to which antenna the device was assigned to. However, it does not help to reduce the positioning error as the antennas are at the same location as the originally considered base station. Nonetheless, it does allow the application of base station location correction in order to reduce the positioning error as indicated in Fig. 4.15. In our example, the space is portioned based on Voronoi polygons. This means that the overall plane with n points (where n is the number of antennas) is partitioned into convex polygons in such a manner that each convex polygon contains exactly one generating point (directional

4.5 Big data for mobility studies

97

FIGURE 4.13 Mobile network precision example. A) region, B) base station locations, C) sectoral antennas, D) Voronoi polygons, E) centers of balance.

98

4. Small and big data for mobility studies

0 to 1 km (compared to 0e2.12 km from the start of this example). An interesting note here is that, as mentioned already, the overall positioning precision will be lower for the mobile network data than for the GNSS-based positioning. However, as illustrated in the above example, the precision of positioning based on the mobile network data will be higher in the urban areas, than in the rural, or less densely populated ones, due to the higher density of the base stations and the nature of the mobile networks’ configurations in this context. Exactly the opposite relation is the case for the positioning precision when it comes to the GNSS-based positioning. In the context of GNSS positioning one can expect a higher precision in rural areas due to lower signal occlusion (e.g., availability of wide and open areas) and lower in urban ones due to challenges related to the occurrence of effects as urban canyons, multipath fading, path loss, shadowing, increased time to first fix, etc. which pose a challenge for the GNSS based positioning, especially in situations as underground garages, street corridors in-between high buildings, etc.

4.5.3 Mobile sensed data FIGURE 4.14 Omnidirectional and directional antennas.

or omnidirectional antenna) and that every point in an assigned polygon is closer to its generating point than to any other (Fig. 4.13D). Finally, for each polygon, a center of balance is calculated. This center of balance is then considered as a “corrected” base station location for each antenna (Fig. 4.13E). In this manner the location precision is increased as the error of the reported location (when considering the “corrected” base station location for each antenna) for the device can range between 0 and 100 m in an area with higher density (urban area), or between 0 and 3 km in a rural area, or in average from

The mobile sensed data represent the third main big data source that is intensively being used and evaluated in the context of mobility studies. The mobile sensed data in this context refers to data collected from mobile phones’ sensors. Contrary to the case with the CDR and network signalization data, the mobile sensed data are collected at the users’, and not the mobile networks’, end. Furthermore, similarly to the GNSS data, the mobile sensed data include the location and timestamp information that is collected from one of the phone’s positioning dedicated sensors. This can be the GNSS sensor itself, but also the Wi-Fi network location and/ or mobile networks’ base station location readings. Inherently, this positioning information

4.5 Big data for mobility studies

FIGURE 4.15

99

4-step process for increasing the location precision for sectoral antennas.

has the same advantages and disadvantages as previously discussed for the pure GNSS data and the mobile network data (depending on the source of the positioning information). The precision of the Wi-Fi location data depends on the Wi-Fi network density in the area. Similar to the CDR data, the Wi-Fi data actually note the location of the Wi-Fi network antenna and not the device itself. However, the Wi-Fi stations cover much smaller areas than the mobile network base stations, so the location precision is usually much more detailed (Fig. 4.16). Furthermore, much different from the pure GNSS data or the mobile network data, the mobile phone sensed data represent the fused collection of the data from multiple mobile phone sensors. This can include, among others, accelerometer, gyroscope, microphone, or camera data. The developer of the data collection

application has the possibility to adjust the frequency of the data collection and the selection of the sensor data that will be included in the dataset. Hence, mobile phone data are collectible only if the appropriate settings are applied (e.g., the GNSS sensor is activated) and the dedicated application is installed on the phone. Although the mobile phone application can be designed to activate or deactivate the required sensors, it is important that in such cases the user itself is aware of these settings (both for privacy as well as battery management issues). Table 4.3 shows an example of the mobile phone data. When considering the mobile sensed data for mobility studies, it is essential that users are equipped with mobile phones that have capabilities to operate with the dedicated applicationsd smartphones. The penetration rates of mobile

100

4. Small and big data for mobility studies

instance, the user or the data collector can provide feedback). This capability has multiple advantages when it comes to the use of smartphones for mobility studies. For instance, since positioning data lack contextual information regarding the trip (e.g., the occupancy rate of the vehicle or the motivation for the trip) and the data processing efforts are still not at the level to be able to automatically extract all the information required for transport planning, users are in the position to provide such inputs via this two-way communication channel. Examples include reporting the trajectory-related details such as the purpose of the trip or a FIGURE 4.16 TABLE 4.3

Wi-Fi network.

Mobile network data example.

Mobile network data example

013637f2-41c2-4c90-a4ac-5197c273ef71, GNSS, 181,908.00, 51.0524452,N 3.7250449,E,10,22,0.2732

Message element

Example value

Record ID

013637f2-41c2-4c90-a4ac-5197c273ef71

Positioning system indication

GNSS

Timestamp

181,908.00

Position elatitude (WGS 84)

51.0524452

Latitude orientation

N

Position elongitude (WGS 84)

3.7250449

Latitude orientation

E

Altitude (m)

10

Speed (km/h)

22 2

Acceleration (m/s )

0.2

Sensor recorded value

732

phones and smartphones, (usually expressed as a number of devices over 100 inhabitants) are continuously growing around the World (International Telecommunication Union, 2019) and seem promising for the future use of the mobile sensed data for mobility studies. Especially since they also allow two-way communication (for

transport mode used. This represents an additional burden to the user but is a useful capability that is not present when it comes to the other two big data sources (the GNSS data and the mobile network data). When speaking of the possibility to utilize two-way communication, we can therefore distinguish between:

4.5 Big data for mobility studies

-

“raw” data collected directly from one or more of the mobile phone’s sensors, and mobility-related data that is confirmed/ validated or directly reported by the users via means of available communication options.

This can be simple user based reporting, for instance when users are utilizing mobile phone-based travel diaries, indicating every trip that they make and providing contextual information regarding them, or a feedback loop when users are having an insight into the data analytics results (for example, analytics resulting in the utilized transport mode recognition for the given trip) and are validating the correctness of the analytics results or providing a correction of the obtained results. This distinction between the “raw” data that are not validated and the validated, or user-reported, ones (the “ground truth” data) is particularly relevant in the context of data analytics as it allows data analysts to apply different types of methods and techniques based on them. For instance, validated data can be considered as labeled data and used to train the supervised machine learning-based models that can then later be used to automatically detect some of the trip contextual details, as transport mode utilized for the trip (see Chapter 5, Machine learning). This can be achieved as supervised machine learning models can be trained and evaluated based on the user-reported inputs/validations and once the acceptable success rates are attained, they can perform the learned task without the need for the userprovided inputs, and this respectively reduces the user burden. In this context, based on the user burden and required level of the interaction, we distinguish between the “passively” and “actively” collected data. The “passively” collected data are those data that are collected without any user interaction (other than simply turning on/off the device/sensor or the mobile phone application). Respectively, we call the “actively” collected data those that require more intense interaction

101

(for instance, initiating the recording of the trajectory, validation of the contextual data, etc.). 4.5.3.1 Example: mobile sensed data (i) As a number of mobile sensed data examples for mobility studies will be also given in the following chapters, here we will focus a bit more on the user burden and data collecting discipline as well as the usability of the collected mobile sensed data for trip reconstruction based on the two examples, one where the data collection process is initiated by the data contributors themselves and one where the data collection process is initiated by the stakeholders interested in the comparability among the smartphonebased data collection for mobility studies and the traditionally collected mobility data. The question of user burden and willingness to actively engage in the mobility data collection via mobile phones is often seen through the literature when speaking of mobile sensed mobility data. It is evident that increased effort required from the users results in their demotivation to be engaged in the data collection process. Combined with the excessive use of the phone’s battery, this demotivation is sometimes reflected in exclusion from the active data collection participation, deinstallation of the application, or deactivation of some of the phone’s sensors (Harding et al., 2021; Karami & Kashef, 2020; Seo et al., 2019). From the mobility data collection point of view, this will result in unreported trips or gaps in the data. Fig. 4.17 illustrates an example of the users’ participation in the mobile sensed data collection process during a period of 3 months. The illustrated data were collected during the Flamenco project that aimed to support citizens and stakeholders in creating and participating in so-called citizen observatory campaigns (Flamenco consortium, 2021). The intended concept of citizen observatory campaigns allowed citizens to gather around initiatives that were relevant to them, use customizable mobile phone application toolset to design mobile phone applications

102

FIGURE 4.17

4. Small and big data for mobility studies

Mobile sensed data and users’ discipline

example.

that collect data of interest, monitor the progress of the campaign, and produce outputs in various forms as graphs, maps, or data tables. The flexibility of the process aimed to allow smaller groups of citizens to overcome barriers such as lack of available dedicated mobile phone applications for their intended data collection campaign of interest as well as proactive management of their voluntary contributed data and initiated campaigns by themselves. These campaigns included data collection regarding various mobility and urban living-related aspects such as mobility trajectories data collection, traffic noise, or physical activity-related data sensing. During campaigns, citizens had a possibility either to actively initiate the data collection or allow the data collection process to run passively on their mobile phones. This process is also known as crowdsourcing or crowds sensing data collection campaign. Fig. 4.17 illustrates the daily evolution of the contributed trip data for one of such citizens’ observatory campaigns. The yellow color in the graph indicates the trip data that participants “actively” collected, meaning that participants initiated the trip data collection at the start of their trip and marketed its end. The blue color,

on the other hand, implies the “passively” collected trip data. The “passively” collected trip data include all the trips that were observed from the mobile phone’s positioning sensors, but their start and end were not indicated by the participants. Finally, the orange color marks the noise in the collected data (for instance, records where there was missing data regarding the data collection mode). The observed daily evolution of the contributed trip data indicates that users tend to report trips more actively at the beginning of the data collection process, but after some time prefer the passive data collection approach. There could be several reasons for this, including that they estimate the burden of the data collection activity to be too demanding, preferring not to actively participate in it from one moment onwards, but also that their level of trust in the data collection process has increased and that they feel comfortable allowing the process to run in the background, etc. From the data analytics perspective, one would prefer to have a “validated” data set, where there is a clear indication of trips’ start and ending moments as this reduces the required data processing efforts, but also enables having a “ground truth” for the training/evaluation of the analytics processes (e.g., automated recognition of trips starting and ending moments). However, the observed reporting discipline is not disadvantaging as it also allows data analytics experts to test their algorithms in the early phases of the campaigns (when there are validated mobility patterns) and perform a selection of adequate algorithms’ parameters, allowing the utilization of these algorithms in the later stages of the campaign when there is a higher ratio of the “passively” collected mobility data. Hence, we could say that in this context, this is a somewhat preferable scenario, rather than the other way around (for instance, having the data, which are not validated and obtaining the validated data samples only at the later phases of the campaigns).

103

4.5 Big data for mobility studies

4.5.3.2 Example mobile sensed data (ii) Another example relates to the mobile sensed data collection process that was initiated by the stakeholders interested in comparability among the mobile sensed data and the traditionally collected mobility data in the context of official mobility statistics creation. For this purpose, mobile sensed data were collected over a period of 8 months. During this period, 385 participants voluntarily contributed with their mobility data. The data were collected via a mobile phone application that continuously (unless switched off) “passively” sensed mobility behavior. Once the trip-related movement would be detected, the application initiated the GNSS sensor and started more intense mobility data collection, until the end of the trip would be detected. The methodology used for trip start and end detection is described in more detail in Wolf et al. (2001). The motivation behind the used principle for the GNSS sensor activation and more intense mobility data collection was to, as efficient as possible, manage the battery consumption. Therefore, the higher energy from the battery for data collection was consumed only when the trip was detected and during other periods less battery-consuming data positioning techniques were used (mobile network base station readings and Wi-Fi location readings). The following section gives more insights related to the users’ activity and data collection discipline during the campaign, as well as insights related to the possibility of correctly detecting triprelated activity from the sensed data in a battery management efficient manner. Fig. 4.18 shows the distribution of all the collected positioning locations based on their source. Here, as one might expect, the GNSS data are the ones with the highest spatial precision as they describe the actual location of the smartphone/user, rather than the location of the base station/antenna, as is the case for mobile/cellular or Wi-Fi networks. However, for the GNSS data to be collected, the GNSS sensor

FIGURE 4.18

Positioning location points.

needed to be activated (this is not a default sensing in many devices). The aforementioned results in higher consumption of the smartphone’s battery. On the other hand, for the WiFi location data, it is sufficient that the Wi-Fi sensor is activated, while for the mobile network’s base station location data it is sufficient that a mobile phone is switched on (and not, for instance in the “airplane” mode), both of which drain the battery much less. Considering this, the data contributors had the possibility to directly manage their battery consumption. If significant, they were able to simply deactivate their GNSS sensor. In such cases, the data collection process would continue solely based on the other two data sources. Alternatively, they were able to completely deactivate the data collection mobile phone application, which would result in discontinuing the data collection process completely. Fig. 4.19 illustrates data contributors’ activity or the period, expressed in the overall number of days, during which a participant contributed with the data (location points). 93.9% of the users contributed with more than 1 day of data, 83% reported more than 3 days, while 68.6% of the users reported more than 7 days of mobility behavior. Considering that travel behavior studies that rely on the traditionally collected mobility data in the same area usually consider

104

4. Small and big data for mobility studies

FIGURE 4.19 Contributing days per participant.

one, or three, days of mobility behavior (Cornelis & Huynen, 2012) in terms of data collection period and knowing the responsiveness rates reported in the literature (Zimowski et al., 1997), the obtained insights indicate that the described smartphone-based data collection approach is an interesting candidate to further automate the data collection process for mobility studies purpose. However, ensuring continuity of data contributions, i.e., that travel behavior is reported for several consecutive days, is dependent upon several factors including user motivation and discipline, as well as the provision of different incentives during the campaign. For instance, Montini et al. (2015) provided 150 V as an incentive to the participants of their campaign. They report that despite this all participants did not fulfill their commitment to actively record mobility behavior during the expected period. In their findings, they recognize that the level of compensation did not correlate to the participants’ effort and that it is conceivable that the continuous active reporting of mobility behavior was simply not manageable for most individuals. One of the examples that they highlight is that the discontinuity of mobility behavior observations was not due to users forgetting and leaving their mobile phones at home, but rather due to the mobile phone

application being intentionally deactivated manually by the users. Moreover, they noticed that individuals who participated in their preceding study (Nitsche et al., 2014), had higher data contribution and activity rates, explaining this as familiarity with the used technology and the mobile phone application. Alike these findings, another research (Bohte & Kees, 2009) reported that 25% of individuals who participated in the data collection campaign forgot to bring their device with them at least once during the week, while Geurs et al. (2015) detail that up to 35% of participants simply completely stopped collecting their data, mainly because they were immobile due to injury or disease, or because they were on a holiday. The authors conclude that this level of underreporting does not seems to be problematic, as according to official statistics in their area (Dutch national mobility survey), about 15% and 25% of the population is not mobile during working and weekend days, respectively. Hence, to gain a better understanding of the users’ data contribution discipline during the data collection campaign, we examined the first and the last data contribution day for each user and the occurrence of the eventual data contributing gaps (days without any reported positioning data). Fig. 4.20 illustrates the number of

FIGURE 4.20

Share of data contributing activity.

4.5 Big data for mobility studies

campaign participants who continuously contributed their data and those for whom discontinuity in the data contribution was observed. Out of 49% of the participants for whom the gaps in their observed mobility behavior were noted, 38% (19% of all users) had only a single day of missing data, for 33% (or 16% of all participants) more than 3 days were observed, while for 18%, or 8.8% of all the participants, more than 7 days of missing data were observed (Fig. 4.21). It should be mentioned that, in the context of this study, discontinuous data contribution conveys that there were neither “passively” collected data available for the participant for a given period, thus it can only be concluded that the participant intentionally deactivated the mobile phone application for that period (likely for reasons mentioned in previous studies). However, above mentioned studies had significantly shorter data collection periods (e.g., Geurs, et al. (2015) campaign lasted only 2 weeks and Montini et al. (2015) 8 weeks) compared to 8 months of the considered data collection campaign, thus obtained statistics regarding the reporting gaps cannot be directly compared.

105

Fig. 4.22 illustrates users’ data contribution discipline in more detail, where the vertical axis represents each individual campaign participant (data contributor) and the horizontal axis corresponds to the percentage of continuous data contributions (in blue) and observed gaps in data contributions for the respective participant (in yellow). One can notice that for those participants who had gaps in their data contributions, there is no specific pattern (e.g., there are no significant jumps at the distinct moment during the data collection process), but the ratio of their gaps is rather somewhat linearly distributed (when sorted by percentages). Furthermore, participants had the opportunity to review their data and detected trips on a daily level. This way they were able to validate, or correct, the results of the trip detection procedure. When having a look at the quantity of the collected positioning data and the ratio of positioning data actually belonging to the trip trajectories, one can notice that 45% of the collected data points were used to construct trips trajectories (Fig. 4.23) while the remaining data did not add value to the trip construction. Among these, 5.75% were GNSS data (indicated in the dark yellow color) resulting from a false trips detection or by being collected after a trip ended and its end was not correctly detected, hence resulting in still active and intensive, GNSS-based, data collection (Fig. 4.24).

4.5.4 Comparison of the three main big data sources for mobility studies

FIGURE 4.21

Duration of data contributing discontinuity for participants with discontinuous contributing discipline.

Considering the three most used big data sources for mobility studies, the following tables give summarized comparison of their characteristics. Table 4.4 gives a detailed overview of the three big data sets concerning the user burden, while Table 4.5 gives the summarized overview of the required sensors for the collection of each dataset.

106

4. Small and big data for mobility studies

FIGURE 4.22

Device logging activity.

Another relatively important characteristic is the devices’ energy consumption that each data collection process results in. This seems particularly relevant for mobile phone-based data collection, as mobile phones have other applications rather than solely data collection dedicated ones, and the users are quite sensitive when it comes to battery consumption. Table 4.6 gives a more detailed overview of the battery consumption for each data collection approach.

FIGURE 4.23 Share of the location points that were a constitutional part of the trip.

107

4.5 Big data for mobility studies

FIGURE 4.24

Location points used/ unused for trip trajectory construction per data contributor.

TABLE 4.4

User burden-based classification.

Passively collected

Actively collected

Signalization data

CDR

GNSS Smartphone data (not validated)

Smartphone data (validated)

4.5.5 Other big data sources for mobility studies This section presents nonexclusive list of other big data sources that have the potential to be used for mobility studies and analytics within the smart city context. 4.5.5.1 Location-oriented sensing

TABLE 4.5

Data collection e underlying sensors.

Dataset type

Sensors

CDR

Mobile network transmitters

Smartphone Accelerometer, gyroscope, magnetometer, GNSS, etc. GNSS

GNSS, SBAS, etc.

While the before mentioned big data sources were mainly focused on collecting the mobility data associated with a specific device in order to reconstruct the traveling behavior (trip trajectories), the location-oriented data collection is focused on sensing all the traveling entities at one location (contrary to sensing of all the locations of one traveling entity/device). In other words, the location-oriented data collection process tries to capture travel entities that are

108 TABLE 4.6

4. Small and big data for mobility studies

Device battery consumption.

Dataset type

Battery consumption

CDR and signalization data

Low

GNSS

Medium

Smartphone data (passive collection)

Mediuma

Smartphone data (active collection with prioritization of GNSS sensed locations)

Higha

a

Dependable upon the settings (e.g., how frequent the location points are collected).

passing a predefined location. This can be a single measurement point within the transport network, but more commonly there are several location sites distributed geographically so that the complete area of interest is covered, including all entrance and exit points to the targeted part of the network (for instance, in a case the target area is the city than these can include the main roads entering/exiting the city, train stations, terminals, etc.). Drawing a parallel to the traditionally collected mobility data, the big data based location oriented sensing at the borders of the target area of interest can be seen, in some aspects, as a potential equivalent for the intercept surveys (data collected at specific locations, usually at the border of the area of interest) or automation of traffic and person counts. The traditional and the most straightforward manner to collect location-oriented data includes the manual noting of the travel entity counts that have crossed the predefined location within a specified time interval. Most commonly for mobility studies, these counts are conducted for the whole week during the months with traffic volumes within approximately 2% of the annual average daily traffic volumes. Usually, these are the spring months (i.e., March, April, and May) and fall months (i.e., September, October, and November) or as a general rule months without major holiday periods, events (e.g., Olympics or festivals), school holidays, or weeks after daylight saving time clock adjustments. If the data collection period is required to be shorter than a week (for instance, due to the limited resources) then usually one or more weekdays between Tuesday and Thursday are

selected and the counting is done for a period of 1 h (or 15 min) and then translated to the whole day and week patterns based on the known statistics for the area. During this type of data collection, next to counts, other relevant mobility data can also be noted, including among others vehicle occupancy rate or vehicle classifications (for instance, private cars, heavy goods vehicles, motorcycles, etc.). However, manual counting is quite a resourceful demanding manner to collect mobility data and suffers from known disadvantages such as the occurrence of occlusion or human errors. For this reason, over the last decades, the process was advanced into automated counting of traffic entities. Examples include various types of data recorders and sensors placed under or on the traffic network surface (for instance, pneumatic road tubes, piezoelectric sensors, and inductive loops). Nonetheless, the implementation and operational costs (maintenance, support, and day-to-day operation) of automated counting systems tend to be high. Furthermore, although they successfully note the counts of transport entities, the complementary details, such as vehicle occupancy rate, remain unreported. To overcome these pitfalls video-based techniques for entity counting have also been developed. These techniques rely on entity identification and more advanced approaches can include automated vehicle classifications or capture of vehicle occupancy rates. 4.5.5.1.1 Computer vision techniques

License plate recognition is a common and widely used group of techniques for vehicle identification. It comprehends the collecting of

4.5 Big data for mobility studies

vehicle license plate characters and arrival times at various sensing locations (also called the checkpoints). The most automated manner of the licence plate recognition technique involves the use of video cameras and character recognition software to recognize and automatically transcribe the license plate number for successive computer processing (Jiao et al., 2009). Based on this information, collected at subsequent checkpoints, it is possible to recreate vehicle movements (if data collection points are of adequate density) and to further derive transport network-related indicators such as travel times. This is highly dependable on the videobased method’s ability to correctly identify license plate characters, which are often influenced by factors such as vehicle speed, volume of vehicle flow, ambient illumination (daylight, night, direct sunlight, or shadow), spacing between vehicles (e.g., the occurrence of occlusion), weather conditions (rain, snow, fog), license plate variety, plate’s physical position (tilt, rotation), etc. In general, the license plate capturing and recognition rates may vary from as low as 15% (for poor visibility/weather conditions) to  as high as 85%e90% (Cavar, 2010). Another advancing domain of computer vision applications for mobility involves the possibility to equip the vehicle with computer vision sensing to enable it to recognize its surroundings and adjust the driving behavior in line with the collected information. These types of applications are particularly interesting for autonomous vehicles as future mobility entities within the smart cities context but are also already successfully implemented in a number of vehicle functionalities, for instance, cameraassisted parking. In this type of application, the computer vision challenge is more demanding as a vehicle needs to recognize its surrounding that is of unknown format or composition. For instance, when it comes to license plate recognition, the standardized license plate formats have a predefined composition of possible characters as well as a limited number of characters that can be

109

placed at a different section of the license plate. When it comes to the sensing and recognition of the environment, there are no predefined formats or compositions and the environment can include, among others, road surfaces, other vehicles, various static or moving objects or people. However, for a computer vision system to be able to detect and identify people has long been one of the most demanding tasks to tackle. Next to the inherited challenges related to ambient conditions such as illumination, occlusions (for instance, wearing an umbrella), weather (rain, snow, fog), etc. main challenge arises from a variety of appearances and the erratic way humans behave. So far successful applications mainly focus on recognizing human silhouettes and classifying activities (standing, walking) (Bredereck et al., 2012; Morbee et al., 2010), which is beneficiary for the mobility tasks in which identifying transport mode is required, as in this case, it would be to identify a human as a pedestrian for instance. Nonetheless, for mobility-related applications such as travel time estimation, reidentification of a specific human would be required at sequential sensing locations. In this context, reidentification would comprehend the detection of specific characteristics that differentiate one individual from another, based on facial recognition. However, facial recognition is a more challenging computer vision task and performs better on a smaller scale while its performance is lower at locations where many people pass by daily due to limits in recognition (Buolamwini & Gebru, 2018). Furthermore, facial recognition allows recognition of individuals and as such falls under the category of biometric data, processing of which is regulated in different regions. For instance, in EU, the GDPR generally forbids the processing of biometric data for uniquely identifying purposes (GDPR Art. 2(14)). 4.5.5.1.1.1 Example: computer vision Fig. 4.25 shows an example of research activities under Vebimobe (De Mol et al., 2016) project where the applicability of computer vision for automated recognition of traffic signs within the

110

4. Small and big data for mobility studies

FIGURE 4.25 Vebimobedthe test vehicle with equipment for computer vision-supported detection of traffic signs. From Semanjski, I. & Gautama, S. (2016). Sensing human activity for smart cities’ mobility management. In Smart cities technologies (211e232). Rijeka, Croatia: InTech.

city of Ghent, Belgium, was examined. The specially equipped vehicle tested the ability to recognize traffic signs based on the cameras that operate in different spectrums while being integrated into traffic flow movements (De Mol et al., 2017). One of the main aims of the Vebimobe project was to examine the readiness of related technologies in ensuring the application of such data collection techniques for automated vehicles’ speed adaption and more sustainable route guidance applications (Semanjski & Gautama, 2019). The test conducted in the neighborhood of Merelbeke showed that computer vision was able to correctly recognize 87% of the speed limit-related traffic signs from the moving vehicle. The minimum horizontal detection distance to the traffic sign board was between 20 and 7 m, leaving between 0.5 and 1.4 s for any process activation (e.g., in the case of autonomous driving), for the average speed of 50 km/h.

Observed reasons for the incorrect interpretation of the traffic signs are shown in Fig. 4.26 and include: -

-

-

A typical traffic sign plates (34%)dfor example, the traffic sign is the C43 (speed limit) but next to the typical speed limit number it contains the letters “km”; Occlusion (33%)dsuch as growing greenery in the vicinity of the traffic sign or the presence of heavy good vehicles at a short distance in front of the sensing vehicle; Incorrect interpretation (33%)dfor instance, the traffic sign is correctly recognized but the wrong speed limit value is assigned to it (e.g., 30 km/h instead of the 50 km/h) or the traffic sign was correctly recognized but was not relevant as it was not related to the vehicle’s trajectory (e.g., the traffic sign was positioned close to the intersection, but was relevant for another road, other than the one the vehicle was on).

4.5 Big data for mobility studies

FIGURE 4.26

Incorrect interpretation of traffic signs.

4.5.5.1.2 Bluetooth data

Bluetooth sensing has been suggested recently as an interesting alternative for mobility data sensing. Bluetooth is a wireless technology standard (Institute of Electrical & Electronics Engineers, 2002) for exchanging data over short distances between fixed and/or mobile devices. It uses short-wavelength ultra high frequency (UHF) radio waves in the industrial, scientific, and medical (ISM) band from 2.4 to 2.485 GHz (Bluetooth SIG, 2016). It was firstly introduced in 1994 by telecom vendor Ericsson but has found a wide number of implementations across different domains till now. These implementations range from fixed and mobile devices to building personal area networks (Bluetooth SIG, 2016). The interesting feature regarding the Bluetooth technology for mobility studies is that there is an exchange of inquiry information prior to a wireless connection between two devices through Bluetooth being established. This allows for completely unobtrusive detection of nearby devices. In more detail, prior to the ability for two devices to connect, the protocol’s inquiry phase needs to be concluded. During this phase, an initiator device initiates the service discovery conduct by transmitting inquiry packets. Nearby devices (within 10 m), that allow themselves to

111

be discoverable, issue an inquiry response. This response includes information on device ID (48-bit identifier of the mobile device - MAC address) and clock (Bourk et al., 2008). This process can last up to 10 s, after which the initiator device should be aware of discoverable devices in its vicinity. Today, Bluetooth has become an almost omnipresent technology across a wide range of mobile devices. Due to its range limitations, for mobility related applications, it is more opportune for location oriented sensing, where with additional analytics user trajectories can be approximated based on the timestamp sequences at the sensing locations or checkpoints. Hence, by placing stationary Bluetooth sensors at strategic locations one can obtain insights into individuals (based on mobile devices such as a laptop or mobile phones) or vehicles (based on vehicles’ keys) mobility in a variety of contexts. Phua et al. (2015) have compared Bluetooth sensed data at the supermarket with manually noted data using systematic sampling and have found that trip lengths and user demographics were similar, except for the underrepresentation of elderly population. Other examples of sensing human mobility include travel time measurements of motorized traffic (Park & Haghani, 2015; Portugais & Khanal, 2014), tracking of pedestrians (Utsch & Liebig, 2012), mobility-related incidents detection (Bullock et al., 2010; Margreiter, 2016), dynamics at the mass events (Versichele et al., 2012) and others. However, gathered findings suggest that Bluetooth sensed data for mobility studies exhibit several limitations. First, sensed location is limited to the preselected locations of stationary Bluetooth devices. By processing the sequence of observed devices’ IDs among different locations, movement data can be estimated. However, the exact itineraries remain unknown. To overcome this, the stationary Bluetooth devices can be placed at higher density, for instance, at each intersection of the

112

4. Small and big data for mobility studies

transport network. In return, such implementation increases required investments, which becomes significant for larger networks. The second challenge is related to sample size and data quality. Since only the discoverable Bluetooth devices can be observed, to report at the population level (e.g., absolute density or flow statistics), the proportion of discoverable Bluetooth devices across the population needs to be determined. This can be achieved based on the manual counts of the total number of transport entities at sensing locations; however, this process tends to be resource-demanding. 4.5.5.1.2.1 Example: bluetooth data Within the city of Ghent, Belgium, an annual music and theater festival, called the Ghent Festivities, takes place at the end of the month of July. The festival lasts for 10 days (one full week, including both starting and ending weekends) and spreads across the city on 11 locations. During this period, squares, streets, and the city itself act as major attractors hosting on-stage performances, food stands, and fairs that attract around two million visitors every year. During this period, city dynamics change. Many educational and administrative institutions close their doors, and an influx of people from nearby (and more distanced) towns and regions join the events. This requires additional dedication for city event management, including the organization, security, transport, and emergency service provisions during festivities. To support this, 22 city locations were covered with Bluetooth scanners. Fig. 4.27 shows an implementation of these Bluetooth scanners for monitoring crowd mobility during the event (as described in Versichele et al., 2012). The data collected in this manner, represented people’s mobility within the festivity zone itself and the mobility to and from the festivity zone. Applications of the resulting data for the designed purpose were manifold. The most direct result were those related to the general statistics about visitors and their observed behavior (for instance, the number of

daily visitors, their temporal and spatial distribution around the city, the (sequence of) squares visited by visitors, etc.) (Semanjski & Gautama, 2016). Additionally, the derived set of results provided an insight into the distribution and dynamics of the crowd within the festivity zone and the city center. This insight is relevant for many services, which are monitoring the crowd density and dynamics to plan safety measures such as interim closures of access to overcrowded locations or facilitating the circulation between festivities locations. Such insight is also relevant for the festivity visitors, to whom it was made available via a dedicated mobile phone application, as it assists them in planning their visit and avoiding overcrowded or temporary closed areas. Another set of derived results relates to the accessibility of the festivity zone and is based on the monitoring of the travel times between key locations situated at the borders of the area of interest as train stations, public transport stops, park and ride locations, and the festivity zone itself. For instance, by analyzing the sequence of Bluetooth observations of the same device IDs, starting from the park and ride facility toward the city center, prolonged travel times can be detected. These can suggest congestion problems along the route, where authorities should intervene to facilitate the circulation of public transport. In this manner, the sensed data also assist optimization of safety and comfort for the visitors of the festivity (Gautama et al., 2017). 4.5.5.1.3 Ticketing data

Ticketing data are becoming an increasingly interesting source of insights for mobility studies over recent years. The ticketing data are generated as a by-product of mobility billing services. Most known ticketing data are those used for public transport, but several other services, such as shared mobility or parking, generate ticketing data during transactions such as paying for the service or validating the ticket. Such data leave a trace of a time moment when

113

4.5 Big data for mobility studies

FIGURE 4.27

Bluetooth scanning implementation for mass events. Based on Ghent University (2016). Move. [Online] Available at: http://move2.ugent.be.

activity has occurred, location (either fixed location of ticketing station or indication of the public transport vehicle where the activity occurred), and details regarding the services as a fare data (Table 4.7).

TABLE 4.7

In public transport networks, usually, the following types of tickets exist: -

Single ticket: B one journey (no time limit), B zonal single ticket,

Ticketing data record example.

ID

Entry station

Entry time

Exit station

Exit time

1

XIDAN

2021/06/19 07:05:20

Beijing railway station

2021/06/19 07:15:24

2

XIDAN

2021/06/19 07:07:00

DONGZHIMEN

2021/06/19 07:28:20

3

QIANMEN

2021/06/19 07:20:20

Beijing railway station

2021/06/19 07:31:17

4

XIDAN

2021/06/19 07:35:20

Beijing railway station

2021/06/19 07:45:24

.

XIDAN

2021/06/19 17:07:11

DONGZHIMEN

2021/06/19 17:28:12

n

QIANMEN

2021/06/19 23:20:20

Beijing railway station

2021/06/19 23:31:17

114

4. Small and big data for mobility studies

B

-

-

origin-destination single ticket, B single ticket or several journeys within a limited duration (for example, 1 hour), B single ticket for a limited group of travelers/family ticket, B off-peak ticket/night ticket, B special event/tariff ticket (e.g., major sports events or conferences participants), Several journeys ticket: B return ticket, B multi-journey ticket (for instance, 5, 10, or 20 journeys), B season ticket (for example, day, week, month, or a year), B value ticket (pay-as-you-go), Operator/transport mode ticket: B single transport mode/single operator ticket, B multimodal/multi-operator ticket, B combined ticket (for example, park & ride).

The ticketing systems were highly influenced by the technological advances over the recent years, so it is not uncommon that there were several updates of ticketing systems present in past decades in a single area/mobility system. This makes it challenging to handle different data structures, resulting from different versions of the ticketing systems, and perform longitudinal analysis in the designated area. Furthermore, it is also not uncommon that several types of ticketing systems coexist at the same time in the same area. This can include earlier versions where tokens, paper tickets, or the magnetic ticketing system (where ticketing is done with an automatic belt drive or with a manual sweeping motion of the ticket by the traveler) were used. Newer systems include, among others, contactless smartcards and/or mobile phone ticketing. The contactless smartcards ticketing usually uses Radio Frequency Identification (RFID) or Near Field Communication (NFC) technology to establish a communication between the card and the

validation device. In such systems, each smartcard can be identified by a unique serial number and can be registered to a given individual or be anonymous. On the other hand, the mobile phone ticketing systems are based on the use of the traveler’s mobile phone device where tickets are being issued using SMS, mobile barcodes, or QR (Quick Response) codes. In such cases, the ticket selection is performed by sending an SMS that either contains a specifying text or is sent to the dedicated phone number for a specific ticket. A digital ticket is then returned to the user. Alternatively, the traveler can also purchase tickets by mobile phone and use the RFID or QR code in the same manner that he/she would do with contactless smartcards. This is also known as eticketing. Ticketing data analytics and extraction of potential insights are also highly influenced by the different ticketing validation approaches or disciplines. In this context, the validation of tickets, in different ticketing systems, can be required one time (most often when entering the vehicle or the station), or several times (e.g., each time a switch between different vehicles is made or upon arriving and leaving the vehicle or the stations). Here we can distinguish three main ticket validations disciplines: -

-

-

CICO (Check-In, Check-Out) is a ticketing discipline where the traveler needs to present his/her ticket at an in-vehicle validation device, while entering and/or leaving a vehicle or alternatively at a platform. WIWO (Walk-In, Walk-Out) is a ticketing discipline where, at the entrance and exit locations like vehicle/platform doors, dedicated devices are performing a registration by detecting the device carried by a traveler, without any required action by the traveler. BIBO (BeeIn, Be-Out) is a discipline where ticketing systems detect the traveler devices while the vehicle is moving between the stations, thus allowing to register all travelers that are actually present on board at that

4.5 Big data for mobility studies

time. This is again done without any action required by the traveler. The ticketing data have been successfully used, particularly by the transport service operators, to provide valuable analysis and information on transport network usage and travel patterns, which then can be used for planning, operation, and marketing purposes. Examples from the literature include the monitoring of public transport vehicles’ headways and punctuality, monitoring of boarding and alighting at stops (Agard et al., 2006; Tepanier & Morency, 2010), and estimation of traveler volumes at stops (Blythe, 2004), estimation of ridership per operator and ticket types (Chu & Chapleau, 2008), analysis of travel patterns for different groups of passengers (Bagchi & White, 2005), the introduction of incentives (Halvorsen et al., 2020), estimations of origins and destinations for journeys (Devillaine et al., 2013; Munizaga & Palma, 2012), estimations of travel times (Zhao et al., 2013), cost (Li et al., 2020), transport mode used (Lee et al., 2019), transfer information (Jang, 2010), etc. However, such data provide only a fairly limited understanding of transport demand, for several reasons, including the lack of contextual information as trip purpose (Dempsey, 2008), often the impossibility to

FIGURE 4.28

115

identify individual users (when tickets are not registered to the user), hence only aggregated results can be obtained (this also removes, in such cases, the challenge of handling personal data), often occurring disbalance in the sample when a small number of users travels frequently (for instance, to work or education-related activities) and represent a high proportion of all trips made (Utsunomiya et al., 2006), impossibility to identify linked trips (when validation at the entrance to each vehicle is not required) (Zhao et al., 2019) and high cost of the system introduction (Deakin & Kim, 2001). 4.5.5.1.3.1 Example ticketing data Fig. 4.28 illustrates ticketing data (orange dots) collected during a pilot study in the city of Antwerp, Belgium, where 542 commuters shared their data for a period of 10 days, while at the same time registering their trips (white lines) via dedicated mobile phone application as a part of TransMob project (WIM, 2020). The ticketing data collection included multimodal mobility ticketing transactions, containing, among others, shared bikes, parking, and public transportrelated transactions. The collected data allowed investigation of the impact of a potential mobility card introduction on multimodality.

TransMob ticketing data example.

116

4. Small and big data for mobility studies

Overall, 327,000 km traveled, and 30,000 trips were registered. The results indicated that the use of a mobility card in the area could partially support the creation of a seamless mobility service as it has the potential to remove some practical problems for travelers. For instance, the fact that one needs to pay with different systems for each and every movementdwhether renting an e-bike or car, taking the train, or catching a bus or tram. The use of a mobility card would partially address this obstacle, as it would allow users to use a single payment option to pay for parking, shared vehicles, and public transport. The stakeholders also were able to gain valuable insight from this project regarding user behavior as, for instance, where and for how long one parks his/her car, which bike share stations are hot spots, and which bike routes are most popular. Also, insights allowed service providers and stakeholders to define user groups, based on the mobility mix as well as encouraging authorities to move forward from the pilot study toward a unique payment system for every transport mode. Overall, when considering location-based sensing techniques for mobility studies the main challenges arise from the available level of details. The spatial and temporal resolution of collected data makes obtaining insights into utilized network connections and traffic flows dependencies difficult to achieve. Additionally, location-sensing techniques are often dedicated to a specific group of participants (for instance, license plate recognition relates to vehicles), omitting other mobility system participants (pedestrians, bicyclists, public transport users, etc.). Furthermore, some of the location-based sensing techniques score well for some of these challenges, but fail at others. For instance, sensors placed on or under the traffic network surface provide confident counts of vehicles but cannot identify individual moving entities or vehicle occupancy rates. Computer vision-based techniques, perform well in distinguishing among different transport modes but have limited

success when it comes to identifying traveled trajectories. Bluetooth sensors easily identify individual devices and based on this information track their moving sequences between different locations, but they require a high density of sensing locations to reconstruct actual paths and cannot provide vehicle counts of the same accuracy as road sensors or confident estimation of used transport modes. Ticketing data relate only to the specific types of transactions and omit mobility options and reconstruction of mobility behavior for other users of the mobility system (such as pedestrians, private car users who did not have any mobility-related payments, etc.) causing gaps in observed mobility data or complete exclusion of some users. Hence, they are more convenient for gaining insights related to specific research questions rather than gaining a view of the overall urban mobility and dynamics.

References Agard, B., Morency, C., Trepanier, M., 2006. Mining public transport user behaviour from smart card data. Elsevier, Saint-Etienne, France. Ahas, R., 2010. From the guest editor: mobile positioning and tracking in geography and planning. Journal of Urban Technology 17 (1), 1e2. Ahas, R., Armoogum, J., Esko, S., Ilves, M., Karus, E., Madre, J., Nurmi, O., Potier, C., Schm€ ucker, D., Sonntag, U., Tiru, M., 2014. Feasibility study on the use of mobile positioning data for tourism statistics - consolidated report. Eurostat, Luxembourg City, Luxembourg. Arentze, T., Dijst, M., Dugundji, E., Joh, C., Kapoen, L., Krygsman, S., Maat, K., Timmermans, H., 2001. New activity diary format: design and limited empirical evidence record. Transportation Research Record: Journal of the Transportation Research Board 1768 (1), 79e88. Bachir, D., Khodabandelou, G., Gauthier, V., Yacoubi, M., Puchinger, J., 2019. Inferring dynamic origin-destination flows by transport mode using mobile phone data. Transportation Research Part C: Emerging Technologies 101, 254e275. Bagchi, M., White, P.R., 2005. The potential of public transport smart card data. Transport Policy 12, 464e474. Bluetooth SIG, 2016. Bluetooth specifications [Online] Available at: https://www.bluetooth.com/specifications/ specs/. (Accessed 27 January 2021).

References

Blythe, P., 2004. Improving public transport ticketing through smart cards. Proceedings of the Institute of Civil Engineers -Municipal Engineer 157 (1), 47e54. Bohte, W., Kees, M., 2009. Deriving and validating trip purposes and travel modes for multi-day GPS-based travel surveys: a large-scale application in The Netherlands. Transportation Research Part C: Emerging Technologies 17 (3), 285e297. Bolhasani, H.R., 2021. Mobile networks overview (2G/3G/ 4G/5G) [Online] Available at: https://bolhasani.net/ learning-material/. (Accessed 16 June 2021). Börjesson, M., Eliasson, J., 2012. The value of time and external benefits in bicycle appraisal. Transportation Research Part A: Policy and Practice 46 (4), 673e683. Bourk, T., Howes, T., Seymour, B., 2008. Discovery whitepaper. Bluetooth SIG, Kirkland, USA. Bredereck, M., Jiang, X., Körner, M., Denzler, J., 2012. Data association for multi-object tracking-by detection in multi-camera networks. IEEE, New York, USA. Bricka, S.G., Sen, S., Paleti, R., Bhat, C.R., 2012. An analysis of the factors influencing differences in survey-reported and GPS-recorded trips. Transportation Research Part C: Emerging Technologies 21 (1), 67e88. Broach, J., Dill, J., Gliebe, J., 2012. Where do cyclists ride? A route choice model developed with revealed preference GPS data. Transportation Research Part A: Policy and Practice 46 (10), 1730e1740. Bullock, D., Haseman, R., Wasson, J., Spitler, R., 2010. Anonymous Bluetooth probes for measuring airport security screening passage time: the Indianapolis pilot deployment. Transportation Research Board, Washington, USA. Buolamwini, J., Gebru, T., 2018. Gender shades: intersectional accuracy disparities in commercial gender classification. Proceedings of Machine Learning Research, New York, USA. Calabrese, F., Ferrari, L., Blondel, V.D., 2014. Urban sensing using mobile phone network data: a survey of research. ACM Computing Surveys (Csur) 47 (2), 1e20.  Cavar, I., 2010. Estimation of travel time in urban areas based on fused spatio-temporal and meteorological data. University of Zagreb, Zagreb, Croatia. Chen, C., Gong, H., Lawson, C., Bialostozky, E., 2010. Evaluating the feasibility of a passive travel survey collection in a complex urban environment: lessons learned from the New York City case study. Transportation Research Part A: Policy and Practice 44 (10), 830e840. Chen, H., Roger, C., Storey, V., 2012. Business intelligence and analytics: from big data to big impact. MIS Q 36, 1165e1188. Christie, N., Kimberlee, R., Towner, E., Rodgers, S., Ward, H., Sleney, J., Lyons, S., 2011. Children aged 9e14 living in disadvantaged areas in England: opportunities and

117

barriers for cycling. Journal of Transport Geography 19 (4), 943e949. Chu, K.K., Chapleau, R., 2008. Enriching archived smart card transaction data for transit demand modeling. Transportation Research Record: Journal of the Transportation Research Board 2063 (1), 63e72. City of Brussels, 2020. The communes of the Brussels-capital region [Online] Available at: https://be.brussels/aboutthe-region/the-communes-of-the-region. (Accessed 1 December 2020). Cornelis, E., Hubert, M., Lebrun, K., Patriarche, G., De Witte, A., Creemers, L., Declercq, K., Jannssen, D., Castaigne, M., Hollaert, L., Walle, F., 2010. Mobiliteit in België in 2010: Resultaten van de BELDAM-enqu^ete. Belspo, Brussels, Belgium. Cornelis, E., Huynen, P., 2012. Belgian daily mobilityBELDAM: Enqu^ete sur la mobilite quotidienne des belges: Rapport final. BELDAM, Brussels, Belgium. Crespo, C., Vinas, V., 1984. The preservation and restoration of paper records and books: a RAMP study with guidelines. United Nations Educational, Scientific and Cultural Organization, New York, USA. Danalet, A., Mathys, N.A., 2017. The potential of smartphone data for national travel surveys. 17th Swiss Transport Research Conference 1e23. Monte Verita/Ascona, Switzerland. De Mol, J., Defreyne, P., Semanjski, I., Gautama, S., De Maeyer, P., Bellens, R., Gillis, D., 2016. Vebimobe: correcte snelheidsinformatie voor correct rijgedrag: onderzoek naar mogelijkheden verkeersbordendatabank voor ITStoepassingen. Verkeersspecialist 227, 20e22. De Mol, J., Defreyne, P., Semanjski, I., Gautama, S., Gillis, D., Lopez, A., 2017. VEBIMOBE zoekt niet kortste maar meest duurzame route: verkeersbordendatabank blijkt goede basis voor ontwikkeling duurzame routenavigatie. Verkeersspecialist 235, 6e9. Deakin, E., Kim, S., 2001. Transportation technologies: implications for planning. University of California Transportation Center, Berkeley, USA. Dellinger, A.M., 2002. Barriers to children walking and biking to school–United States, 1999. MMWR: Morbidity and Mortality Weekly Report 51 (32), 701e704. Dempsey, S.P., 2008. Privacy issues with the use of smart cards. Transportation Research Board, Washington, USA. ~ iga, M., 2013. Devillaine, F., Munizaga, M.A., Palma, C., Z un Towards a reliable origin-destination matrix from massive amounts of smartcard and GPS data: application to Santiago. In: Zmud, J., Lee-Gosselin, M., Munizaga, M.A., Carrasco, J.A. (Eds.), Transport survey methods: best practice for decision making. Emerald Group Publishing Limited, Bingley, United Kingdom. DRCOG, 2000. Denver regional travel behavior inventory: describing and reaching nonresponding populations e

118

4. Small and big data for mobility studies

analysis and project report. Denver Regional Council of Governments, Denver, USA. Eisele, S., Benz, R., Holdener, D., 1998. Travel time data collection handbook. Texas Transportation Institut, Arlington, USA. European Institute of Retailing and Services Studies, 1996. Effects of data collection methods in travel and activity research. TU/e University, Eindhoven, The Netherlands. European Parliament, 2010. The promotion of cycling. European Parliament, Brussels, Belgium. Feng, T., Timmermans, H.J., 2013. Transportation mode recognition using GPS and accelerometer data. Transportation Research Part C: Emerging Technologies 37, 118e130. Flamenco consortium, 2021. Citizen observatory [Online] Available at: http://citizen-observatory.be. (Accessed 25 May 2021). Gautama, S., Atzmueller, M., Kostakos, V., Gillis, D., Hosio, S., 2017. Observing human activity through sensing. In: Loreto, V. (Ed.), Participatory sensing, opinions and collective awareness. Springer International Publishing, Cham, Switzerland, pp. 47e68. Geurs, K.T., Thomas, T., Bijlsma, M., Douhou, S., 2015. Automatic trip and mode detection with move smarter: first results from the Dutch mobile mobility panel. Transportation Research Procedia 11 (2015), 247e262. Ghent University, 2016. Move [Online] Available at: http:// move2.ugent.be. (Accessed 20 March 2016). Gillis, D., Gautama, S., Van Gheluwe, C., Semanjski, I., Lopez, A., Lauwers, D., 2020. Measuring delays for bicycles at signalized intersections using smartphone GPS tracking data. ISPRS International Journal of Geo-Information 9 (3), 174. Groves, R., 2006. Nonresponse rates and nonresponse bias in household surveys: what do we know about the linkage between nonresponse rates and nonresponse bias? The Public Opinion Quarterly 70 (5), 646e675. Halvorsen, A., Koutsopoulos, H.N., Ma, Z., Zhao, J., 2020. Demand management of congested public transport systems: a conceptual framework and application using smart card data. Transportation 47 (5), 1e29. Harding, C., Imani, A., Srikukenthiran, S., Miller, E., Habib, K., 2021. Are we there yet? Assessing smartphone apps as full-fledged tools for activity-travel surveys. Transportation 48, 2433e2460. Hunt, J.D., Abraham, J.E., 2007. Influences on bicycle use. Transportation 34, 453e470. Inbakaran, C., Kroen, A., 2011. Travel surveys e review of international survey methods. Australasian Transport Research Forum, Adelaide, Australia. Institute of Electrical and Electronics Engineers, 2002. 802.15.1-2002 - IEEE standard for telecommunications

and information exchange between systems. IEEE, New York, USA. International Telecommunication Union, 2019. Measuring digital development facts and figures. ITU, Geneva, Switzerland. Itoh, S., Hato, E., 2013. Combined estimation of activity generation models incorporating unobserved small trips using probe person data. Journal of the Eastern Asia Society for Transportation Studies 10, 525e537. Jang, W., 2010. Travel time and transfer analysis using transit smart card data. Transportation Research Record 2144 (1), 142e149. Janssens, D., Declercq, K., Wets, G., 2018. Onderzoek Verplaatsingsgedrag Vlaanderen 5.3 (2017e2018). Department of Mobility and Public Works, Brussels, Belgium. J€arv, O., Ahas, R., Witlox, F., 2014. Understanding monthly variability in human activity spaces: a twelve-month study using mobile phone call detail records. Transportation Research Part C: Emerging Technologies 38, 122e135. Jestico, B., Nelson, T., Winters, M., 2016. Mapping ridership using crowdsourced cycling data. Journal of Transport Geography 52, 90e97. Jiao, J., Ye, Q., Huang, Q., 2009. A configurable method for multi-style license plate recognition. Pattern Recognition 42 (3), 358e369. Jones, P., Stopher, P.R., 2003. Transport survey quality and innovation. Pergamon Press, Amsterdam, The Netherlands. Karami, Z., Kashef, R., 2020. Smart transportation planning: data, models, and algorithms. Transportation Engineering 2, 100013. Kramer, S.N., 1988. The origin and development of the Cuneiform system of writing. In: Thirty nine firsts in recorded history. University of Pennsylvania Press, Pennsylvania, USA, pp. 381e383. Laney, D., 2016. 3-D data management: controlling data volume, velocity and variety [Online] Available at: https:// blogs.gartner.com/doug-laney/files/2012/01/ad949-3D -Data-Management-Controlling-Data-Volume-Velocity-a nd-Variety.pdf. (Accessed 18 October 2020). Lee, H., Park, H.C., Kho, S.Y., Kim, D.K., 2019. Assessing transit competitiveness in Seoul considering actual transit travel times based on smart card data. Journal of Transport Geography 80, 102546. Lee, J.S., Zegras, C., Zhao, F., Kim, D., Kang, J., 2016. Testing the reliability of a smartphone-based travel survey: an experiment in Seoul. Journal of the Korea Institute of Intelligent Transport Systems 15 (2), 50e62. Li, Y., Yang, D., Hu, X., 2020. A differential privacy-based privacy-preserving data publishing algorithm for transit smart card data. Transportation Research Part C: Emerging Technologies 115, 102634.

References

Margreiter, M., 2016. Automatic incident detection based on Bluetooth detection in Northern Bavaria. Transportation Research Procedia 15, 525e536. Menghini, G., Carrasco, N., Sch€ ussler, N., Axhausen, K.W., 2010. Route choice of cyclists in Zurich. Transportation Research Part A: Policy and Practice 44 (9), 754e765. Mishra, A.R., 2004. Fundamentals of cellular network planning and optimisation. John Wiley & Sons, Chichester, UK. Mishra, A.R., 2007. Advanced cellular network planning and optimisation 510. John Wiley & Sons, Chichester, UK. Montini, L., Prost, S., Schrammel, J., Rieser-Sch€ ussler, N., Axhausen, K., 2015. Comparison of travel diaries generated from smartphone data and dedicated GPS devices. Transportation Research Procedia 11, 227e241. Morbee, M., Tessens, L., Aghajan, H., Philips, W., 2010. Dempster-Shafer based multi-view occupancy maps. Electronics Letters 46 (5), 341e342. Munizaga, M.A., Palma, C., 2012. Estimation of a disaggregate multimodal public transport origin-destination matrix from passive Smart card data from Santiago, Chile. Transportation Research Part C: Emerging Technologies 24 (12), 9e18. NCHRP, 2006. Standardization of personal travel surveys. Transportation Research Board, Washington, USA. Nitsche, P., Widhalm, P., Breuss, S., Br€andle, N., 2014. Supporting large-scale travel surveys with smartphones - a practical approach. Transportation Research Part C: Emerging Technologies 43, 212e221. OECD, 1998. Safety of vulnerable road users. Organisation for Economic Co-operation and Development, Paris, France. Pai, C.-W., Jou, R.-C., 2014. Cyclists’ red-light running behaviours: An examination of risk-taking, opportunistic, and law-obeying behaviours. Accident Analysis & Prevention 62, 191e198. Park, H., Haghani, A., 2015. Optimal number and location of Bluetooth sensors considering stochastic travel time prediction. Transportation Research Part C: Emerging Technologies 55, 203e216. Phua, P., Page, B., Bogomolova, S., 2015. Validating Bluetooth logging as metric for shopper behaviour studies. Journal of Retailing and Consumer Services 22, 158e163. Pooley, C., Tight, M., Jones, T., Horton, D., Scheldeman, G., Jopson, A., Mullen, C., Chisholm, A., Strano, E., Constantine, S., 2011. Understanding walking and cycling: summary of key findings and recommendations. Lancaster Environment Centre, Lancaster, UK. Portugais, B., Khanal, M., 2014. Adaptive traffic speed estimation. Procedia Computer Science 32, 356e363. Qiu, X., 2000. Chinese writing. In: Mattos, Gilbert L., Norman, Jerry (Eds.). University of California, Berkeley, USA.

119

Richardson, M., Caulfield, B., 2015. Investigating traffic light violations by cyclists in Dublin City Centre. Accident Analysis & Prevention 84, 65e73. Sabatini, R., Moore, T., Ramasamy, S., 2017. Global navigation satellite systems performance analysis and augmentation strategies in aviation. Progress in Aerospace Sciences 95 (2017), 45e98. Schneider, R.J., 2013. Measuring transportation at a human scale: an intercept survey approach to capture pedestrian activity. Journal of Transport and Land Use 6 (3), 43e59. Segadilha, A.B., da Penha Sanches, S., 2014. Identification of factors that influence cyclists’ route choice. ProcediaSocial and Behavioral Sciences 160, 372e380. Seidl, D.E., Jankowski, P., Tsou, M.-H., 2016. Privacy and spatial pattern preservation in masked GPS trajectory data. International Journal of Geographical Information Science 30 (4), 785e800. Semanjski, I., Bellens, R., Gautama, S., Witlox, F., 2016. Integrating big data into a sustainable mobility policy 2.0 planning support system. Sustainability 8 (11), 1142. Semanjski, I., Gautama, S., 2016. Sensing human activity for smart cities’ mobility management. In: Smart cities technologies. InTech, Rijeka, Croatia, pp. 211e232. Semanjski, I., Gautama, S., 2019. A collaborative stakeholder decision-making approach for sustainable urban logistics. Sustainability 11 (1), 234. Semanjski, I., Gautama, S., Ahas, R., Witlox, F., 2017. Spatial context mining approach for transport mode recognition from mobile sensed big data. Computers, Environment and Urban Systems 66, 38e52. Seo, T., Kusakabe, T., Gotoh, H., Asakura, Y., 2019. Interactive online machine learning approach for activity-travel survey. Transportation Research Part B: Methodological 123, 362e373. Solomon, L., Peer, E., 2012. Professionally biased: evidence for misestimations of driving speed, journey time and time-savings among taxi and car drivers. Judgment and Decision Making 7 (2), 165e172. Stopher, P.R., Bullock, P., Rose, J.M., Pointer, G., 2003. Simulating household travel survey data in Australia: Adelaide case study. Road & Transport Research Journal 12 (3), 29e44. Stopher, P.R., Greaves, S.P., 2007. Household travel surveys: where are we going? Transportation Research Part A: Policy and Practice 41, 367e381. Stopher, P.R., Metcalf, H.M., 1996. Methods for household travel surveys. Transportation Research Board, Washington, USA. Stopher, P.R., Wilmot, C.G., 2000. Some new approaches to designing household travel surveys-time-use diaries and GPS. Transportation Research Board, Washington, USA. Stopher, P.R., Xu, M., FitzGerald, C., 2007. Assessing the accuracy of the Sydney household travel survey with GPS. Transportation 34, 723e741.

120

4. Small and big data for mobility studies

Tepanier, M., Morency, C., 2010. Assessing transit loyalty with smart card data. WCTR, Lisbon, Portugal. Paper No. 2341. Uteng, T.P., Voll, N.G., 2016. Public transport: perception contra realities in access and usage. Institute of Transport, Oslo, Norway. Utsch, P., Liebig, T., 2012. Monitoring microscopic pedestrian mobility using Bluetooth. IEEE, Guanajuato, Mexico. Utsunomiya, M., Attanucci, J., Wilson, N., 2006. Potential uses of transit smart card registration and transaction data to improve transit planning. Transportation Research Record: Journal of the Transportation Research Board 1971 (1), 118e126. Versichele, M., Neutens, T., Delafontaine, M., Van de Weghe, N., 2012. The use of Bluetooth for analysing spatiotemporal dynamics of human movement at mass events: a case study of the Ghent festivities. Applied Geography 32 (2), 208e220. Vij, A., Shankari, K., 2015. When is big data big enough? Implications of using GPS-based surveys for travel demand analysis. Transportation Research Part C: Emerging Technologies 56, 446e462. Wang, Y., Araghi, B.N., Malinovskiy, Y., Corey, J., 2014. Error assessment for emerging traffic data collection devices. Washington State Department of Transportation, Washington, USA. Wilson, J., 2004. Measuring personal travel and goods movement. Transportation Research Board, Washington, USA. WIM, 2020. TRANSMOB [Online] Available at: https:// www.antwerpmanagementschool.be/en/research/leadi

ng-organizational-transformation/design-innovation/ research/transmob. (Accessed 3 June 2020). Witlox, F., 2007. Evaluating the reliability of reported distance data in urban travel behaviour analysis. Journal of Transport Geography 15 (3), 172e183. Wolf, J., Bricka, S., Ashby, T., Gorugantua, C., 2004. Advances in the application of GPS to household travel surveys. National Household Travel Survey Conference, Washington, USA. Wolf, J., Guensler, R., Bachman, W., 2001. Elimination of the travel diary: experiment to derive trip purpose from global positioning system travel data. Transportation Research Record 1768 (1), 125e134. Zhang, G., Tan, Y., Jou, R.-C., 2016. Factors influencing traffic signal violations by car drivers, cyclists, and pedestrians: a case study from Guangdong, China. Transportation Research Part F: Traffic Psychology and Behaviour 42, 205e216. Zhao, J., Frumin, M., Wilson, N., Zhao, Z., 2013. Unified estimator for excess journey time under heterogeneous passenger incidence behavior using smartcard data. Transportation Research Part C: Emerging Technologies 34, 70e88. Zhao, D., Wang, W., Li, C., Ji, Y., Hu, X., Wang, W., 2019. Recognizing metro-bus transfers from smart card data. Transportation Planning and Technology 42 (1), 70e83. Zimowski, M., Tourangeau, R., Ghadialy, R., Pedlow, S., 1997. Nonresponse in household travel surveys. Federal Highway Administration, Chicago, USA.

C H A P T E R

5 Data analytics 5.1 Objectives of the chapter What What What What What What What What What What What What What

is is is is is is is is is is is is is

data analytics? big data analytics? descriptive analytics? diagnostics analytics? predictive analytics? prescriptive analytics? machine learning? supervised learning? unsupervised learning? reinforcement learning? regression? classification? clustering?

5.2 Word cloud Fig. 5.1 presents a word cloud with an overview of the content of the chapter on data analytics.

5.3 Data analytics introduction In its essence, data analytics can be defined as the science of fusing heterogeneous data from various sources, drawing relations and causalities among them, making predictions to gain insights, and supporting decision-making. More generally, the term analytics is also often used when referring to any data-driven decision-making.

Smart Urban Mobility https://doi.org/10.1016/B978-0-12-820717-8.00008-7

The recent emergence of big data and possibilities to gain a deeper understanding of processes and extract useful insights has brought upon the data analytics domain an even more significant role than before as well as greater challenges. The challenges emerge from the, previously described (see Chapter 4), characteristics of big data while a significant role comes from the strategic initiatives, across various organizations and domains, to leverage big data for innovation and support smarter decision making. Such data analytics are also often referred to as big data analytics and are an integral part of data analytics today. Mobility and possibilities to leverage big data analytics to improve performance on overall mobility system and efficiently tackle the urban mobility’s related challenges are not exempt from this interest as big data analytics are finding their way to support smarter and more informed mobility decision-making by various stakeholders. With the humble aspiration to facilitate fruitful recognition of potential opportunities for big data analytics as well as support the efficient collaboration within multidisciplinary teams in the mobility domain, this chapter aims to provide a comprehensive and systematic overview of data analytics fundamentals with a focus on machine learning techniques. Some more detailed examples of potential applications of these techniques will be also given in the following chapter, with

121

© 2023 Elsevier Inc. All rights reserved.

122

5. Data analytics

FIGURE 5.1

Data analytics chapter word cloud.

respect to the four-step transport planning model, hence this exposition is also intended to provide the requisite background for reading the chapters that follow.

5.4 Data analytics workflow Although all data analytics aim to support the gaining of insights and data-driven decisionmaking, we can distinguish among different types of data analytics based on the intended purpose of these insights. With this in mind, data analytics can be grouped into four basic types (Fig. 5.2): • • • •

descriptive analytics, diagnostics analytics; predictive analytics and prescriptive analytics.

These types can also be seen as evolutionary advances (or phases) in the data analytics workflow that are to some degree interrelated and partially overlapping each other. From the mobility projects’ and applied data analytics’ standpoint, this also means that before going into the analytics phase of any project, a good understanding among the stakeholders is needed, and a clearly defined research question is important as this determines where to start and on which type of analytics one should focus on the most in order to obtained useful insights. The first phase of this workflow is descriptive analytics. In descriptive analytics one focuses on the current state of the system by gleaning insights based on the distribution of data and occurrence of outliers or anomalies. In this

5.4 Data analytics workflow

123

FIGURE 5.2 Data analytics types.

context, and based on the well-defined research questions, the outliers can be both undesired by-products of the data collection process (for instance, sensing device malfunctions, errors, or noise in the communication channel) as well as the desired insights of added value by themselves (for example, detection of the new infrastructure from the positioning data collected from floating cars for transport network map creation purposes). The following phase, called diagnostic analytics, aims to provide an understanding of what is influencing and causing what one has observed. Predictive analytics are future-oriented and aim to reveal the probability of the occurrence of future events using different statistical and mathematical models and methods. However, predictive analytics do not give any recommendations regarding how to ensure only the desired outcome. This is the scope of prescriptive analytics. For example, predictive analytics can reveal a trend of increased demand for sustainable mobility options, and prescriptive analytics can support shaping the actionable plans by providing recommendations on which incentives to use for each citizen’s profile to ensure a good match between sustainable mobility options and individuals’ needs and attitudes toward mobility and sustainability. This way, the desired outcome (for instance, modal shift) could be achieved.

5.4.1 Descriptive analytics Descriptive analytics focuses on historical data and aims to collect, organize, and present it in a manner that is easily understood. It provides both quantitative and qualitative information and insights into the past leading to the present. To achieve this, descriptive analytics utilizes descriptive statistics, interactive explorations of the data, and data mining techniques. 5.4.1.1 Descriptive statistics Descriptive statistics comprehend a set of tools that quantitatively describes the data summary and graphical forms. Needless to say, data analytics are focused on the analysis of variables (as opposed to constants) and descriptive statistics tries to glean a basic understanding of the data (including the data variability) by having a look at data’s distribution, dispersion, and central tendencies. Commonly used measures of central tendencies are mean, median and mode, where each measures a different type of central value in the data. 5.4.1.1.1 Measures of dispersion and central tendencies 5.4.1.1.1.1 Arithmetic mean Given a set of values X ¼ fxi ; .; xN g , the arithmetic mean (x) is defined as being equal to the sum of the

124

5. Data analytics

numerical values of each and every observation xi ; x2 .; xn , divided by the total number of observations (N): x¼

N 1 X xi N i¼1

(5.1)

Although arithmetic mean is probably the most commonly used mean in data sets for the purpose of descriptive statistics in the mobility domain, other well-known means, such as harmonic, geometric, or root-square, are sometimes also used. 5.4.1.1.1.2 Median and mode Given a sample of N variates X1 ; .; XN , if one reorders them so that Y1 < Y2 < . < YN , then Yi is called the i-th order statistic. Hence, given order statistics Y1 ¼ minj Xj ; Y2 ; .YN1 ; YN ¼ maxj Xj , the statistical median of the dataset can be defined by: 8 ðN þ 1Þ > if N is odd >

N > :1 yN 1 þ if N is even 2 2 2

while the mode of a variable is the most commonly occurring value. Contrary to mean and median, the mode can return more than one result. For instance, two or more values can have the highest and the same occurrence frequencies in the dataset. Compared to the measures of central tendencies, measures of dispersion, or variability, on the other hand, help us glean insight into dispersion (around some central tendency) of numerical values. They include minimum and maximum values, range, quartiles, standard deviation and variance, distribution skewness, and kurtosis. 5.4.1.1.1.3 Minimum and maximum

The minimum of a set of values X ¼ is the smallest value of a variable (mini xi ) and is equal to the first element of an ordered version of X. fxi gN i¼1

Respectively, the maximum (maxi xi ) is the largest value of set values and is equal to the last element of an ordered version of X. 5.4.1.1.1.4 Range Following the above given definition of minimum and maximum values, the statistical range can then be defined as:

R ¼ maxi xi  mini xi

(5.3)

5.4.1.1.1.5 Quartile In order statistics, the quartile divides the number of data points into four parts, or quarters, of more-or-less equal size with ranges as defined:

RQ1 ¼ p0:25  mini xi

(5.4)

R Q2 ¼ e x  p0:25

(5.5)

x RQ3 ¼ p0:75  e

(5.6)

RQ4 ¼ maxi xi  p0:75

(5.7)

Hence, the 25th percentile (p0:25 ) is also known as the first quartile (Q1 ), the 50th percentile (p0:5 ) as the median (e x) or second quartile (Q2 ), and the 75th percentile (p0:75 ) as the third quartile (Q3 ). 5.4.1.1.1.6 Variance

Variance (s2 ) is an arithmetic mean of the squared deviation of a numeric variable X from its arithmetic mean: N P 2

s ¼

i¼1

ðxi  xÞ2 N

(5.8)

The variance is, by definition, a measure given in a second degree, which makes its interpretation somewhat challenging as the variance is not in the same unit of measurement as the original data. For this reason, as a measure of variability, the standard deviation is used more frequently. 5.4.1.1.1.7 Standard deviation Standard deviation (s) is closely related to variance and also measures the dispersion of a dataset relative to its mean. However, it is expressed in the same

125

5.4 Data analytics workflow

unit of measurement as the original data. For this reason, it is much easier to interpret its value and, consequently, it is much more used in the applied and interpretable analysis. It is defined as the positive square root of the variance:



vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uN uP 2 u ti ¼ 1 ðxi  xÞ N

(5.9)

5.4.1.1.1.8 Skewness and kurtosis To define skewness and kurtosis, one first needs to consider the occurrence of distribution. The distribution shows all the possible values of the variable and the occurrence frequencies of each value. These are most frequently depicted using a table or function. Another way to visualize the frequencies of occurrences is the histograms. Histograms are rather simple to construct, however, they are not the best means to determine the shape of the distribution as the shape of the histogram is strongly affected by the number of bins one chooses. For instance, the bin width affects the ability of a histogram to identify local regions of higher incidence. Too large bins and one would not get enough differentiation. Too small bins and the data would not be possible to be grouped. Hence, determining the shape of the distribution kernel density plot is a more appropriate method. Considering a distribution of a variable, skewness is seen as a measure of the degree of asymmetry of a distribution. If the tail at the small end of the distribution (left tail) is more pronounced than the tail at the large end of the distribution (right tail), the function is said to have negative skewness. If the reverse is true, then it is said to have positive skewness. If the two tails are equal, it has zero skewness. Like skewness, kurtosis describes the shape of a distribution, but it refers to the degree of presence of outliers in the distribution. It is defined as a normalized form of the fourth central moment of the distribution (Fig. 5.3).

An interesting example here is Anscombe’s quartet (Anscombe, 1973) dataset, which can be used to illustrate the dangers involved when solely central tendencies and dispersion measures are used (Table 5.1). The quartet comprises four datasets that appear to be rather similar based on the central tendencies and dispersion measures, however, they have very different distributions and appear very different when graphed. For all four datasets, the mean of x and y are consistent and equal to nine and 7.5 respectively. The variance of x is 11.0 and y 4.12, the correlation between variables is 0.816 and linear regression (to predict y for a given x value) is given by the equitation y ¼ 0:5x þ 3. However, the dataset differences are clearly revealed in the scatter plots (Fig. 5.4). Dataset I consist of data points that conform to an approximately linear relationship, but the variance is significant. For the variables in dataset II, the relationship between them is obvious and, in contrast to dataset I, it is not linear as the data points seem to conform to a quadratic relationship. Furthermore, datasets I TABLE 5.1 Anscombe’s quartet. I x

II y

III

x

y

x

IV y

x

y

10.00

8.04

10.00

9.14

10.00

7.46

8.00

6.58

8.00

6.95

8.00

8.14

8.00

6.77

8.00

5.76

13.00

7.58

13.00

8.74

13.00

12.74

8.00

7.71

9.00

8.81

9.00

8.77

9.00

7.11

8.00

8.84

11.00

8.33

11.00

9.26

11.00

7.81

8.00

8.47

14.00

9.96

14.00

8.10

14.00

8.84

8.00

7.04

6.00

7.24

6.00

6.13

6.00

6.08

8.00

5.25

4.00

4.26

4.00

3.10

4.00

5.39

19.00

12.50

12.00

10.84

12.00

9.13

12.00

8.15

8.00

5.56

7.00

4.82

7.00

7.26

7.00

6.42

8.00

7.91

5.00

5.68

5.00

4.74

5.00

5.73

8.00

6.89

126

5. Data analytics

FIGURE 5.3 Skewness and kurtosis.

FIGURE 5.4 Scatter plots and linear regression models for Anscombe’s quartet.

and III exhibit some resemblance, but the values in dataset III seem to conform to a linear relationship more tightly. Finally, values the dataset IV illustrate how one high-leverage point is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables. Hence,

in order to understand the true distributions of data, one would need to consider several methods. 5.4.1.2 Exploratory data analysis Exploratory data analysis (EDA) comprehends interactive presentation, exploration, and

5.4 Data analytics workflow

discovery of data, their trends, behaviors, and relationships with the visualization as one of the key tools in this process. Additionally, the EDA is often also used as a support in variable (or feature) selection for data analytics models and is an integral part of systems where a human in the loop (and machine-to-human interaction) is foreseen (for instance, in digital twins). An initial step of the exploratory data analysis is the presentation, and it aims to provide a swift and cursory familiarity with the dataset. It involves computing and interactively visualizing descriptive statistics based on the data type of the variable, by utilizing a wide spectrum of visualization techniques (i.e., histograms, scatter plots bubble charts, matrix plots, box-andwhisker plots, etc.). The visual exploration aims to gain an intuitive understanding of the overall structure of the data to facilitate analytical reasoning through visual exploration by investigating the data from numerous viewpoints to observe intriguing patterns and qualitatively characterize them. Based on the visual exploration, the guided inquiry can be framed and a discussion among the stakeholders can be supported so that clear and interesting research questions can be formulated. Finally, the discovery phase involves formulating the hypothesis and performing ad-hoc analyses to validate the hypothesis against the data-based evidence. Fig. 5.5 illustrates a part of the overview of descriptive statistics for the mobility data set, where overall counts and averages for some variables (e.g., system users, number of trips, and distances per user per day), the distribution of the observed trip distances and standard deviations of travel times per hour of a day can be seen. Simple visual exploration of such overviews can reveal major trends in the observed values, issues such as missing data, and occurrence of outliers, and spark the conversation among the stakeholders on relevant topics leading to the formulation of the key research questions to be tackled in the next steps of the analytics.

127

5.4.2 Diagnostic analytics Diagnostic analytics focuses on unveiling the interrelations and causalities, based on the historic data, so that insights into why specific patterns or events have occurred can be clarified in the support of future decision-making. It involves the utilization of techniques such as root cause analysis, online analytical processing’s (OLAP) drill-down, data discovery, data mining, and correlations. The root cause analysis tries to identify the root causes of events by following a sequence of steps. The process is initiated by clear identification and description of the event of interest, after which a timeline is established from the last known standard system’s status up to the time the event occurred. Then a process of distinguishing between the root cause and other causal factors (for instance, by using event correlation) is performed and followed by establishing a causal graph between the root cause and the event of interest. Another applicable approach is the simple drill-down technique. This approach means that complex data is broken down into simpler units, for example from years to months to weeks, while the data mining can be used to help discover patterns, relationships (correlations) between variables or unusual data points (anomalies) within a given data set that might assist in explaining the reasons of specific patterns or events occurrences. In mobility, the challenge also often lies with the latent variables or the variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). The models that aim to explain observed variables in terms of latent variables are called latent variable models. Such models are used in many disciplines, including demography, economics, engineering, mobility, physics, and the social sciences, among others. Examples of latent variables from the field of smart mobility include

128 5. Data analytics

FIGURE 5.5

Descriptive statistics example.

5.4 Data analytics workflow

quality of life, which cannot be measured directly, so observable variables are used to infer it. Examples of observable variables that can be used to measure the quality of life include employment, environment, physical and mental health, education, wealth, recreation and leisure time, and social belonging. The result of diagnostic analytics is usually provided in the context of probability, likelihood, or a distributed outcome. The overall process is often both exploratory and labor-intensive endeavor. 5.4.2.1 Example: diagnostic analytics Fig. 5.6 depicts an example of the evolution of the number of new daily data contributors for the mobile sensed mobility data campaign.

FIGURE 5.6

129

Insights based on the evolution of this number allow the campaign initiators (as authorities) and involved researchers to monitor the overall participation in the campaign, manage the sample size and representativeness, and adjust the data processing chain accordingly when needed. The campaign participants joined the campaign by downloading and installing the Androidbased smartphone application that was developed for the purpose of the campaign and was made available in the Google Play store, free for download (Ghent University, 2016). Once upon installing the application, participants were informed about the purpose of the campaign and the use of data, following this, they were able to voluntarily contribute with

Evolution of the number of new campaign participants.

130

5. Data analytics

their demographic data which in turn allowed managing of the data representativeness, select which type of data contribution they would prefer and how they prefer their data to be managed during the campaign. For instance, one was in the position to contribute by manually reporting the mobility behavior and/or be involved in the positioning data collection based on the smartphone’s sensors. The application was firstly tested on several test users involving the development team and the local authority that initiated the campaign. The purpose of this step was to ensure that the resulting adjustments would meet the campaign needs and that any potential technical issues related to the use of different devices or other conditions would be meet before the campaign would roll out. Occurrence of such issues during the campaign (for instance, updating the data collection application to better meet the campaign needs) can result in incompatible data (data collected before and after the update might be different in nature and structure), as well as such activities might discourage users from participating due to the additional efforts needed to update the application or reedit their details. Once the campaign officially started, significant fluctuations were noticed in the number of new contributors that were joining the campaign on a daily level. The aim of the diagnostic analytics was to identify potential reasons for this so that the participants involvement can be managed and improved. This process involved the identification of the major changes in the trend of the number of new contributors, the establishment of the timeline, assessment of variables that might assist in explaining the reasons for specific patterns or events occurrences, and the analysis of relationships among those variables and the changes in the trend of the interest. Considered influencing variables involved, among others, environmental conditions (such as weather), social and demographic descriptors (employment rates, age of participants, etc.), the occurrence of other major events during the

campaign (for instance, festivals, sports events, holidays, natural disasters, etc), interaction with the target audiences and the campaign participants (as public talks, etc.). The relationship analysis indicted several potential contributors to the fluctuation of the number of new contributors, of which the most significant one involved communication and dissemination activities that seemed to be the major motivator for the contributors to participate in this research. Fig. 5.6 illustrates some of the major communication and dissemination activities on the timeline, involving: -

-

-

initial mailing (by postal service) inviting the participants who were already previously involved in the official statistics’ data collection process (the representative sample); distribution of reminders (by postal service) to the previously contacted potential participants; distribution of leaflets, invitation via regional online news portal, regional press article releases, participation at major conferences and events.

The results indicated that the initial, representative sample considered for the official statistics creation, responded well to the common and well-established way of contacting. However, the traditional mail invitations and reminders gave an effect only at the starting phase of the campaign, while their effect faded as the campaign was in a more mature stage and new participants were nudged to join. Later, during the campaign, the effect of external communication targeted at the general audience was more prominent in onboarding the new participants. Once the number of active participants was considered to be sufficient, in size and representativeness, for the purpose of the campaign, the active onboarding activities concluded with the final campaign closing event, allowing the team to focus their efforts on the following data processing and interpretation steps.

5.4 Data analytics workflow

5.4.3 Predictive analytics Predictive analytics focuses on predicting (or forecasting when we are talking about timeseries data) what is likely to happen in the future, or otherwise unknown event, based on the knowledge that we can gain from historical data and data regarding the current state of the system (on-line or due-time data). Predictive analytics are essential for the efficient management of any system, including the urban mobility system, as it allows us to promptly react to potential future events and implement potential corrective measures in a timely manner. The predictive modes are developed based on the training data that contains known examples used to fit the parameters of a model and are probabilistic in their nature. The predicting is done based on the independent (or predictor) variables, which are seen as not depending on any other variable in the scope of the experiment in question (i.e., they are the variables that are being manipulated in an experiment to observe the effect on a variable in the focus of the analysis). The variable that is in the focus of the analysis is called a dependent variable, also known as a criterion variable, the value of which one is trying to predict with some degree of certainty. In general, one would aim that the value of the dependent variable is dependable on the observations of the independent variables as much as possible, while the independent variables used for the model are themselves independent of each other as much as possible. This way one tries to capture and describe the variability of the dependent variable based on the independent variables as much as possible. At the same time, by the careful selection of the independent variables, one tries to ensure that each considered independent variable explains a component of the dependent variable’s variability that is not captured by other independent variables. Otherwise, when one would, for example, have in the model two independent variables that explain completely the

131

same aspect of the dependent variable’s variability, one of these independent variables would be of no added value in the model and therefore, redundant. This is important, as having too many independent variables can lead to the effect known as “the curse of dimensionality”. This means that when dimensionality increases (i.e., each variable represents one dimension, as, for instance, x, y and z in threedimensional space), the volume of the space increases so fast that the available data becomes sparse. This sparsity is an issue for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result usually increases exponentially with the dimensionality. Furthermore, organizing and searching data often relies on detecting areas where observations form groups with similar properties. However, in high dimensional space, all observations (or data points) appear to be sparse and antithetical in many ways, which prevents common data organization strategies from being efficient. Hence, the selection of variables to be included in the model is critical to ensure reliable and meaningful results as well as to perform efficient analysis. The process of selecting the variables that have the highest predictive value and should be included in the analysis is called feature selection. There are several techniques used for the feature selection, including correlation, visualization based on scatter plots, and linear regression. The correlation coefficient (r) quantifies the degree to which two variables are statistically related. Several types of correlation coefficients exist, each with its own definition and range of applicability and characteristics. However, they all assume real number values in the range from 1 to þ1, where 1 indicates the strongest possible association and 0 is the lowest possible association between the variables. The positive value of r indicates that there is a proportional association between the variables (e.g., if the value of one variable increases, the value of the

132

5. Data analytics

second variable will increase too, and, respectively, if the value of one variable decreases, the value of the other variable will decrease too). Lastly, the negative value of r indicates that there is an inverse association between the variables, meaning that if the value of one variable increases the value of the other variable will decrease, and vice versa. However, one should be careful when interpreting correlation as it indicates if the two variables are related, but it does not imply causation between the variables. Or in other words, one is not able to legitimately deduce a cause-and-effect relationship between two variables solely based on an observed association or correlation between them, as for this, additional analyses are needed. A scatter plot is an additional, but an integral part of a standard toolset for both descriptive and predictive analysis. A scatter plot allows us to visualize the relationship between the variables swiftly and allows us to identify certain patterns even before any more complex analysis is conducted to confirm this. For example, in Fig. 5.4 we were able to observe that data points in datasets I and III conform to an approximately linear relationship, while in dataset II they likely conform to a quadratic relationship. All of this is observable before a more complex analysis, where one would need to hypothesize over a number of different types of functional association among the variables and test these hy potheses against the datasets, is conducted. Furthermore, we can observe that using a linear relationship to predict the value of y for a given value of x, would likely result in higher reliability for dataset III than dataset I (as it seems to conform to a linear relationship more tightly), whereas linear regression would likely give poor results for predictions based on dataset IV (as the value of y mainly maintains constant (with one exception) for any given value of y in the dataset). Hence, the linear regression line can be used to predict a value of y for a given value of x. If

the slope of this line would be negative, the correlation between these two variables would also be negative. Contrary, if the slope of the line would be positive, the correlation coefficient would also be positive and the relation between the variables would be proportional (e.g., if one increases other one increases also). Linear regression is just one of the methods (and a rather simple one) that can be used for predictive analytics. Some of the most known predictive models include neural networks, decision trees, naive Bayes, support vector machines, and k-nearest neighbors, among others.

5.4.4 Prescriptive analytics Prescriptive analytics is a continuation of previous analytics phases. They examine data to answer the questions like “What should be done?” or “What can be done to make certain event occur?”. As such, they extend upon diagnostic analytics by assisting on how to appropriately address issues revealed there or upon predictive models to increase chances of events forecast to indeed happen. In other words, prescriptive analytics involves modeling and evaluation of the various potential what-if scenarios through simulation techniques to answer what should be done to maximize the occurrence of good outcomes while preventing the occurrence of bad outcomes. In this context, stochastic optimization techniques are frequently used to determine how to achieve better outcomes. For predictive analytics to be able to provide meaningful and useful outputs a good understanding of considered process’ logic and rules is important. For this reason, the involvement of domain experts and domain knowledge is crucial as prescriptive analytics involves balancing the best practices, lessons learned, constraints, preferences, system boundaries, and stakeholders’ responsibilities. Many data techniques have been developed over the years

5.4 Data analytics workflow

to assist in translating domain knowledge into specific rules, capabilities to collect and represent knowledge and mimic decision-making mechanisms by machines and algorithms to facilitate the prescriptive analytics functionalities. An example of this is the development of domain ontologies (as INSPIRE ontology (Beare et al., 2021; European Commission Joint Research Centre, 2021; Howard et al., 2021) to describe concepts and categories that show their properties and the relations between them. Another type of technique that gain much interest over the last years is the technique of mimicking human cognitive functions (cognitive analytics) that are often employed for prescriptive analytics. Cognitive analytics try to extract features from structured, semistructured, and unstructured data and employ taxonomies and ontologies to enable reasoning and inference. These analytics rely on an assortment of machine learning algorithms and inference engines to realize human-like cognitive capabilities. They rely on probabilistic algorithms to come up with multiple answers with varying degrees of relevance, each answer corresponding to a hypothesis. Furthermore, evidence is gathered in support of this hypothesis, and it is used to score the relevance of the hypothesis. 5.4.4.1 Example: predictive analytics An example of prescriptive analytics can be seen in modeling and evaluation of the various potential what-if scenarios through simulation techniques to answer what future transport flow management strategies and infrastructure investments have the potential to maximize the occurrence of good outcomes while preventing the occurrence of undesired ones. For motorized transport this can include reduction of travel times or increased economic opportunity, as good outcomes of interventions, while the occurrence of congestion could be seen as an undesired outcome. Potential lists of desired and undesired outcomes and priorities are local context related and might be different across

133

different stakeholders. However, considering different what-if scenarios based on the results of predictive analytics and best practices, lessons learned by peers, various constraints, local preferences, mobility system boundaries, and involved stakeholders’ responsibilities is a standard process when it comes to the planning of larger infrastructure investments in the transport network. Fig. 5.7 gives an example of what-if scenarios for a small urban environment situated in Bjelovar-Bilogora County, Croatia. The county includes five cities and 18 districts. The county capital is Bjelovar which has about 42,000 inhabitants whereas the number of inhabitants in other county towns and municipalities ranges between 5000 and 15,000. The town shown on the map  (Cazma) is situated 30 km away from Bjelovar and 60 km from the country’s capital. In the context of traffic planning these two urban centers have a significant influence on the traffic situation of the town and its surroundings, resulting in large transiting traffic that passes through the town including a large share of heavy cargo vehi cles and buses (Cavar et al., 2014). To develop what-if scenarios several data inputs were considered including infrastructural characteristics of the existing traffic network (route category, direction, number of lanes, permitted turns), observed traffic counts data, demographic data as the average size of households in the considered area, structure of economic activities, land use data, motorization rate (number of passenger cars per 1000 inhabitants), the share of various vehicle categories in overall traffic flows, traffic demand, etc. Furthermore, predictive analytics were used to forecast future traffic demand and its characteristics based on the causality among variables in the given local context. This allowed the creation of realistic future scenarios that were then used to evaluate the implementation of different possibilities such as dislocation of cargo-related terminal and associated traffic further away from the town’s center, increase of the public transport lines and/or

134

5. Data analytics

FIGURE 5.7

What-if scenario analysis.

service frequencies in the area, and the introduction of new cycling lines and pedestrian paths.

5.5 Machine learning More advanced data analytics techniques play an important role in the big data analytics domain today. In this context, machine learning comprehends a set of computational methods that automate the acquisition of knowledge from experience. It is closely related to domains of statistics, artificial intelligence, and data mining, with which it shares a number of principles and techniques but does not fully overlap in scope. For instance, machine learning is using

several statistical techniques (Mendenhall & Sincich, 2015) (as other disciplines do too, like quantum physics (Holevo, 2006)) but merges this subset of statistical techniques with additional principles and domains, such as computer science or biology (for instance, in neural network algorithms (Klambauer et al., 2017)), to achieve computer algorithms that improve automatically through experience and by the use of data. Compared to data mining (Chee et al., 2019), which refers to extracting and discovering patterns in databases and relies on human intervention (for instance, cannot learn, or adapt by itself), machine learning is developed with the intention of enabling machines to teach themselves and not depend on human actions in the

5.5 Machine learning

learning process. So, although, machine learning can be used as a means of conducting efficient data mining, and data gathered from data mining can be utilized as an input for machine learning, they differ in their purpose, overall set of techniques, and focus (as human or machine-oriented applications). Considering artificial intelligence (Borges et al., 2021) (the science of making machines perform activities by mimicking cognitive functions associated with natural intelligence displayed by humans or animals (Minsky, 2007)), machine learning has a much more focused scope (that can be applied in domains other than mimicking aspects of intelligence by machines), while artificial intelligence also concerns activities as reasoning, perception, motion, social intelligence, etc. which are out of the scope of machine learning. Hence, the relation between machine learning and other related domains can be seen as illustrated on Fig. 5.8. In the following sections, we will focus a bit more on machine learning techniques as they have demonstrated a significant applicative potential in the mobility domain, which will be

FIGURE 5.8

Relation between machine learning, statistics, artificial intelligence, and data mining.

135

also demonstrated on a number of practical examples in this and the following chapters. In general, machine learning approaches are divided into three broad categories: (i) supervised learning, where previously labeled data is used to learn a general rule that maps inputs to outputs, (ii) unsupervised learning, where only unlabelled data is used, and the learner needs to find structure in its input, (iii) reinforcement learning, where the learning process is guided by a series of feedback/ rewards cycles.

5.5.1 Supervised learning Supervised learning methods infer a function that maps an input to an output based on a given data in which both input and the desired output variables are present. This data is known as training data and comprise a set of training examples where each example has one or more inputs and the desired output (often also called a label or supervisory signal). Each training example is represented by an array or a feature vector and the training data is represented by a matrix. Through iterative optimization of an objective function, supervised learning methods learn a function that can be used for mapping new examples. An optimal function will allow the algorithm to correctly determine the output for the test data, or the input instances that were not a part of the training data and are therefore unseen until this moment by the learner. This requires the learning algorithm to generalize well from the training data to new and unseen situations. In general, the training and the test data are defined before any processing over the data starts. This way it is ensured that the test set remains unseen, and unknown, to the objective function so that an unbiased evaluation of the obtained results can be achieved. This is usually

136

5. Data analytics

done by randomly selecting a predefined quantity of data samples and setting them aside until the algorithm evaluation takes place. The size of the test data can vary (usually between 10% and 40% of the overall dataset), but more importantly. than its relative size, it is essential that it is large enough to yield statistically meaningful results and that it is representative of the data set as a whole. Another dataset that will be also often used is a validation set. The validation set is a sample of the data held back from model training that is used to give an estimate of model skill while tuning the model’s hyperparameters. The validation set is different from the test data, as it is used for an unbiased evaluation of a model fit on the training data while tuning model hyperparameters, while on the other hand the test set is used to provide an unbiased evaluation of a final model fit on the training dataset. The relation between training, validation, and the test data is shown in Fig. 5.9. The two main categories of supervised learning methods include regression and classification. For the classification problem, the aim of the machine learning algorithm is to categorize or classify given inputs based on training data. The training data, in a classification problem, includes input-output pairs categorized into classes. Most simple classification problems are binary ones where only two classes are present (for instance, distinguishing between moving or stationary objects). More challenging problems involve several classes. For example, distinguishing between several road categories, land use categories, or different levels of service. For the regression problem, the aim of the machine learning algorithm is to develop a relationship between outputs and inputs using the continuous function to help machines understand how outputs are changing for a given input. Regression problems can be also envisioned as prediction problems. For instance, given the known demand for shared bike

services across the bike sharing stations, the output can be the average weekend day’s demand for the next period, allowing the service provider to plan the distribution of the bikes ahead of weekends. Formal description: Let X ¼ fx1 ; .; xn g be a set of n examples (data points), where xi ˛X for all i ˛½n : ¼ f1; .; ng. Generally, it is assumed that the points are drawn independently and identically distributed (i.i.d.) from a common distribution on X. The goal of supervised learning is to learn a mapping from x to y, given a training set made of pairs, where yi ˛Y are called the labels or targets of examples xi . If the  T labels are numbers, y ¼ yi i˛½n denotes the   column vector of labels. The xi ; yi pairs are again sampled i.i.d. from some distribution, which here ranges over X; Y. When Y ¼ ℝ or Y ¼ ℝd (the labels are continuous values), the task is called regression. When the y takes values in a finite set (discrete labels), the task is called classification.

5.5.2 Unsupervised learning Unsupervised learning methods rely on underlying unlabelled data to identify hidden patterns instead of inferring models for in advance known input-output example pairs. Hence, unsupervised learning methods take a set of data that contains solely inputs and aims to find structure in the data. To achieve this, unsupervised learning methods identify commonalities in the data and react based on the presence or absence of such commonalities in each new data instance. Unsupervised learning can be a goal in itself (for instance, discovering hidden patterns in data) or a means toward an end (for example, feature learning). Very often, for the correct interpretation of unsupervised learning methods outcomes, both computer science (method) and domain knowledge (mobility)

5.5 Machine learning

137

FIGURE 5.9 Training, validation, and test data sets.

expertise are needed to obtain maximal possible value from the results. Clustering and association are the two most well-known groups of methods for unsupervised learning problems. Clustering methods focus on grouping data in multiple clusters based on similarities between data points. The aim is that data points within the same cluster are similar according to one or more criteria, while data points drawn from different clusters are dissimilar. Various clustering methods make different assumptions about the structure of the data. For this, they rely on mathematical models to

identify similarities between unlabelled data points, usually defined by one similarity metric (for instance, Euclidian distances, Manhattan distances, Jaccard index, Minkowski distance, or cosine index). Following this, they evaluate the outcomes, for example, by internal compactness (the similarity between members of the same cluster) and separation (the difference between clusters). Other methods are based on estimated density and graph connectivity. Association is a rule-based machine learning method that focuses on identifying one or more trends, or interesting relations, in the given data set that represents major data patterns. In other words, it is intended to identify significant

138

5. Data analytics

association rules that connect data patterns to each other. A combination of supervised and unsupervised learning, which uses both labeled and unlabelled data, is called semisupervised learning. Formal description: Let X ¼ fx1 ; :::; xn g be a set of n examples (data points), where xi ˛ X for all i ˛ ½n ¼ f1; :::; ng. It is assumed that the points are drawn i.i.d. from a common distribution on X. The goal of unsupervised learning is to find interesting structures in the data X.

5.5.3 Reinforcement learning Reinforcement learning concerns with how machines (intelligent agents) ought to take actions in an environment to maximize the notion of cumulative reward. Here, the agent aims to learn how to achieve a goal in an uncertain and potentially complex environment. The machine employs trial and error to come up with a solution to the problem. On this path, it gets either rewards or penalties for the actions it performs based on the set reward policy. Its goal is to maximize the total reward. However, it does not receive any suggestions or examples for how to solve the problem, hence it is up to the TABLE 5.2

method to find out how to maximize the reward, starting from random trials and converging toward efficient tactics. Hence, reinforcement learning differs from supervised learning as it does not require labeled input and output data pairs and does not need suboptimal actions to be explicitly corrected. Instead, the methods rather focus on finding an appropriate balance between exploration (of unknown) and exploitation (of gained knowledge). It also differs from unsupervised learning as its goal is to find good behavior, an action, to maximize the longterm benefits that the agent receives, rather than finding similarities and differences between data points, as unsupervised learning does. The overview of the main differences between the three categories of machine learning is given in Table 5.2. Formal description: Let S be the set of states, where st ˛S denotes the state of the agent at the time t, and AðsÞ be the set of available actions in state s, where at ˛Aðst Þ denotes the actions that the agent performs at the time t. Let R : S  A  S /ℝ be a reward function where Rs sa denotes the expected reward when the agent transitions from state s to state s, after choosing to perform action a in state s. The aim of a learning agent is

Comparison of machine learning categories.

Criteria

Supervised machine learning

Principles

Learns based on the labeled data to Trained using unlabelled data infer a function that maps an input to without guidance to identify an output hidden patterns

Employs interaction with the environment (trial and error) to learn how to achieve a goal

Type of data

Labeled data

Unlabelled data

No predefined data

Type of problems

Regression and classification

Association and clustering

Exploitation or exploration

Aim

Calculate outcomes

Discover underlying patterns

Learn a series of action

Distinguishing among vehicle categories

Autonomous driving

Example of Forecasting travel time application

Unsupervised machine learning Reinforcement machine learning

5.5 Machine learning

to learn to select actions that maximize the accumulated reward over time ( maxi Ri ).

5.5.4 Building and evaluating a machine learning algorithm Fig. 5.10 depicts a typical machine learning algorithm development process. This process starts with problem identification and definition. During the problem definition stage, one needs to consider what output variable one is interested in, as well as what potential independent input variables would be of added value for the learning system. The input variables need to be available, obtainable, and compatible with the defined output variable (for instance,

FIGURE 5.10

139

cover the same target population, etc.). An important question to answer during this stage is, does the identified problem requires the application of machine learning techniques or whether a simpler approach would be more appropriate. According to literature (Tarassenko, 1998), the defined problem requires a machine learning solution if: • Given a set of input variables x and given a set of outputs variables y, there must be some logical evidence that a mapping between these variables exists such that y ¼ f ðxÞ. • The above function for mapping between x and y is unknown as no explicit algorithm, or set of mathematical equations, to describe the solution of the problem exists.

Machine learning modeling workflow.

140

5. Data analytics

• The data to define input (x) and output (y) variables exist and a number of data samples is sufficient to train and/or test the learning algorithm. If the above requirements are satisfied and based on the identification of the output variable, one needs to narrow down the spectrum of machine learning method types by identifying does the considered problem corresponds to supervised, unsupervised, or reinforced machine learning problems, and what type of machine learning technique is applicable for solving the problem (for instance, clustering, regression, or classification). Once the problem is identified and defined, and the spectrum of applicable machine learning method types is narrowed down, one needs to gather (existing) or collect new (obtainable) data for the analysis. In order to reduce the dimensionality of the model, good consideration of the selection of features is needed. In this process, the inclusion of mobility domain experts, and experts from other disciplines related to the specific problem at hand, is necessary. Experts can, at this stage, give valuable insights regarding the variables’ relevance, potential associations among different variables, as well as the availability of relevant data in the local context (for instance, in their organization, commercial or open data, local statistics, etc.). Another relevant detail is that gathered and collected data need to be representative (in terms of the population and the problem), time and area lined (cover the same area and period) and of sufficient size to produce needed datasets for the analysis. Furthermore, due consideration regarding the use of personal data needs to be given before obtaining such data as all the procedures (for instance, acquiring of consents, anonymization procedures, availability of needed measures such as data protection offices, etc.) need to be in place. Also, the conditions under which each data can be used need to be considered in the context of the planned analysis.

Following this, a data preprocessing process can start. Data preprocessing comprehends all the processes that follow from the data collection to the analysis conduction. One of the first processes here is data integration (for instance, integration of data streams based on the available APIs (Application Programming Interface)) for the online analysis or, more simplistic, data fusion of already integrated or stored historical data. Before conducting the actual data fusion, first, the data formatting needs to be checked. This includes simple checks as data consistency (for example, that all the data sources use the same decimal separators) as well as more demanding ones (for instance, that all the geographical data use the same spatial reference system). For this, a good starting point is to have a look at the data’s metadata or visually inspect the data. Alternatively, proofing the process on smaller data samples may also be helpful, rather than including the entire dataset in the complete data processing pipeline (which can be quite time-consuming), just to understand how one needs to go back and repeat this process once the proper format check and necessary adjustments have been performed. Also, one needs to always note in the metadata, regardless of the processing stage, all the adjustments that were made with the data (and store the original dataset if possible, so that irreversible processes do not permanently change the data as they might be needed for another analysis in the original form). Following this, the data fusion/integration takes place and it aims to strategically integrate and combine required and obtainable data sources into unique, cohesive, and structured data space so that analysis to follow can extract hidden underlying knowledge. An example of data fusion could be a fusion of noise and emissions data with the road traffic intensities, or walkability indexes for a defined geographical area based on the timeline of observations and location details. To correctly conduct these processes sometimes more demanding procedures are needed, for example, map-

141

5.5 Machine learning

variance, and range. Hence, normalization is a process of conditioning the data within a certain boundary to reduce redundancy and improve the interpretability of the results. In statistics, the normalization is done with the respect to the mean and variance, as follows:

matching of observed vehicle tracks/intensities with the traffic network to correct potential errors and data noise. In general, several initial preprocessing steps might take place, if required by the analysis (Tan et al., 2006) including: -

-

-

-

Data aggregation comprehends combining two or more variables (or records) into a single variable (or record). This can be done in order to reduce the overall number of variables or records or to change the scale of data, as for instance, households’ data aggregated into neighborhoods or days aggregated into weeks or months. Data sampling, for example, in a case when processing the entire set of data of interest is too resource-demanding or time-consuming. In such cases, one needs to take care that the data sample is representative (it has approximately the same properties of interest as the original set of data) as only the representative data sample has the potential to give insights comparable to those made on an entire data set. Most known and used sampling techniques for mobility studies include random sampling and stratified sampling. Discretization and binarization. Discretization refers to a conversion of continuous variables into ordinal ones (for instance, a potentially infinite number of values are mapped into a small number of categories), while binarization maps a continuous or categorical variable into one or more binary variables. Attribute transformation is a function that maps the entire set of values of a given variable to a new set of replacement values in a manner that each original value can be identified with one of the new values. An important attribute transformation step is data normalization. Data normalization refers to various techniques used to account for differences among variables in terms of their frequency of occurrence, mean,

xi norm ¼

xi  m x sx

(5.10)

where xi is the i-th component of the variable x, mx is the mean and sx is the standard deviation of variable x. Alternatively, they can also be normalized with respect to the range by rescaling to the new minimum and maximum values:     xi  ximin   newimax  newimin xi normR ¼  ximax  ximin þ newimin (5.11)

-

-

where ximin and ximax are the minimum and the maximum values for i-th component of the variable x, and newimax and newimin are, respectively, the rescaled values of the i-th component of the variable x. Dimensionality reduction refers to the reduction of the number of input variables either to avoid "the curse of dimensionality", to reduce the amount of time and memory required for the analysis, or allow data to be more easily visualized. The most well-known techniques for dimensionality reduction include principal components analysis (PCA) and singular value decomposition. Feature creation comprehends the creation of new variables that can capture the important information in a data set much more efficiently than the original variables. For instance, deriving average speed on the segment of the road based on the observed positions and timestamps of the vehicles, or by a comparison of image data and recognition of vehicles at two consecutive visual sensing locations.

142 -

5. Data analytics

Missing data handling refers to determining the policy for the missing data points treatment (for instance, listwise or casewise deletion, mean substitution, regression imputation, last observation carried forward, maximum likelihood, etc.).

5.5.5 Common machine learning methods used for mobility analytics As the mobility system is evolving from a technology-driven independent system to datadriven integrated system of systems, the possibilities to apply advanced data analytics and machine learning techniques to improve the functioning and quality of the services in the mobility domain are also increasing. In practical applications, the types of utilized machine learning vary from linear regression or binary classification problems toward the more complex approaches involving fusion among several methods as neuro-fuzzy systems. The following section presents some of the most popular machine learning algorithms used in the mobility domain, followed with some practical examples, more of which will also be presented also in the following chapter (see Chapter 6, Transport planning and big data). 5.5.5.1 Regression methods Given an output (target) variable yi , which up to the measurement errors, depends on one or more input variables xi , regression describes the nature of the dependency and quantifies the error variance by finding a fitting function that maps the input variables to the output. The simple form of regression models includes one input variable linear regression models, polynomial regression (when the relation between input and output variables is not linear), multiple regression (where there are multiple features or input variables), and multivariate

regression when there are more than one dependent, or output, variables. In this context, the training data is described as the output variables yi , i ¼ 1; :::; n and corresponding input variables xi , where each of the input variables can be represented as a vector. The general regression model is modeled by: yi ¼ f ðxi Þ þ εi

(5.12)

where εi is a regression error. Linear regression is a linear approach to modeling the relationship between a scalar output and one or more explanatory input (independent) variables. The case of one explanatory variable is called simple linear regression where the model assumes a linear relationship between a single input variable and the single output variable, as in the following one-variable linear model: yi ¼ f ðxi Þ ¼ w0 þ w1 xi þ εi

(5.13)

where the unknown parameters w0 are called regression coefficients or weights. To find the best linear fit, or how well the model fits a set of observations, we calculate goodness of fit metric, for instance, based on the residual sum of squares (RSS) as: RSSðw0 ; w1 Þ ¼

n  X

yi  ½w0 þ w1 x1 

2

(5.14)

i¼1

and minimize it over all possible w0 , w1 . By taking the derivate of RSSðw0 ; w1 Þ with respect to w0 and w1 we get the optimal regression coefficients to be: n P

w:1

¼

x i yi  n x y

i¼1 n P

i¼1

(5.15) x2i  n x2

and w:0 ¼ y  w:1 x:

(5.16)

However, optimal regression coefficients may not have a closed form solution for more general

143

5.5 Machine learning

models and numerical optimization methods (for instance, gradient descent) can be used in such cases. For polynomial regression models, the function f is modeled as a p th degree polynomial in x by: p

yi ¼ w0 þ w1 xi þ w2 x2i þ :::wp xi þ εi :

(5.17)

When the p is equal to one, the problem corresponds to the previously described linear regression problem, and a quadratic model is described with p ¼ 2. In mobility-related analytics, polynomial regression modes can be found in applications such as forecasting weather or demand modeling. For the multiple regression model, the multiple input features (xi ) are introduced. Let x1 ; x2 ; .; xd be a set of d inputs believed to be related to an output variable y. The multiple linear regression model for the i th data point has the form: yi ¼ w0 þ w1 xi1 þ w2 xi2 þ :::wd xid þ εi : (5.18) Where wj , j ¼ 0; 1; :::; d are unknown regression coefficients. Considering n data records, the model can be organized into vectors and matrices: Y ¼ XW þ ε

(5.19)

where Y is n  1, X is n  ðd þ 1Þ, W is ðd þ 1Þ  1 , and ε is of n  1 dimensions. Multivariate regression deals with the case where there are more than one dependent (output) variable and one independent input variable. In a case where there are more than one output variable and more than one input variable, the model is called multivariate multiple regression. Considering that we have m output variables Y1 ; Y2 ; :::; Ym and the set of d input variables x1 ; x2 ; :::; xd on each data unit. Each output follows its own multiple regression model: Y1 ¼ w01 þ w11 x1 þ :::wd1 xd þ ε1

(5.20)

Y2 ¼ w02 þ w12 x1 þ :::wd2 xd þ ε2

(5.21)

««

Ym ¼ w0m þ w1m x1 þ :::wdm xd þ εm :

(5.22)

Suppose we have a sample of size n. As earlier (in the multiple regression example), the matrix X had dimension n  ðd þ 1Þ, but now: 2 3 Y Y / Y 12 1m 7 6 11 6 7 6 Y21 Y22 / Y2m 7 6 7 Ynm ¼ 6 7 6 « 7 (5.23) « 1 « 6 7 4 5 Yn1 Yn2 / Ynm ¼ ½Yð1Þ Yð2Þ /YðmÞ  where YðiÞ is the vector of n measurements of the i th variable. Likewise, 2 3 6 w01 w02 / w0m 7 6 7 6 w11 w12 / w1m 7 6 7 Wðdþ1Þm ¼ 6 7 6 « « 1 « 7 6 7 (5.24) 4 5 wd1 wd2 / wdm ¼ ½wð1Þ wð2Þ /wðmÞ  where wðiÞ are the ðd þ 1Þ regression coefficients in the model for the i th variable. Lastly, the m n-dimensional vectors of errors εðiÞ , i ¼ 1; :::; m can also be written in matrix notation as an n  d matrix: εnm

2 6 ε11 ε12 6 6 ε21 ε22 6 ¼6 6 « « 6 4 εn1 εn2 2 ε01 3 6 6 0 7 6 ε2 7 6 7 ¼ 6 7 6 « 7 6 5 4 ε0n

3 /

ε1m 7 7 / ε2m 7 7 7 ¼ ½εð1Þ εð2Þ /εðmÞ  1 « 7 7 5 / εnm

(5.25)

144

5. Data analytics

where the m -dimensional row vector εi includes the residuals for each of the m output variables for the i th data record. The multivariate multiple regression model can then take the form: Ynm ¼ Xnðdþ1Þ Wðdþ1Þm þ εnm

(5.26)

The common question, when it comes to regression models is how to select the one which is suitable for the data, or the problem, at hand. In literature (Akossou & Palm, 2013; Peixoto, 1990), one can find several ways on how to choose a suitable regression model for the analysis. However, the most often used ones are R2 and adjusted R2 (adj R2 ), where R2 and adj R2 indicate how much variation in the output variable is explained by the considered input variable(s). Hence, the higher the R2 , the more suitable is the considered model for the given data. The R2 is given by: R2 ¼ 1 

SCEres SCEtot

(5.27)

and adj R2 by: SCEres ðn  d  1Þ adj R2 ¼ 1  SCEtot ðn  1Þ

FIGURE 5.11

(5.28)

where SCEtot is e total sum of squares (proportional to the variance of the data), SCEres is the sum of squares of residuals, n is the number of observations, and d is the total number of input variables in the model. The adj R2 can be seen as a pursuit to account for the phenomenon of the R2 automatically and spuriously increasing when extra input variables are added to the model (Yin & Xitao, 2001). Hence, while R2 assumes that every single variable explains the variation in the dependent variable, the adj R2 tells the percentage of variation explained by only the independent (input) variables that actually affect the dependent (output) variable. 5.5.5.2 Support vector machines Support vector machines (SVM) is a supervised machine learning method that can be used both for classification (Joo et al., 2015; Sem anjski & Gautama, 2016; Cavar et al., 2011) or regression analysis (Tang et al., 2019; Vlahogianni, 2015) (Fig. 5.11). The simplest version of the algorithm is a linear binary classifier that classifies data patterns by identifying a decision hyperplane with the maximum margin between the data points of each class (Fig. 5.12). The underlaying idea is to find a function with the set error margin that maps input variables to output

An example of how linear hyperplane can be used both for classification and regression problems.

145

5.5 Machine learning

FIGURE

5.12

Support

vector

machines.

variables in a such manner that the predicted output does not deviate from the actual output more than the priory set error margin. More formally described, given the set of labeled input and output pairs n  S ¼ xi ; yi i ¼ 1 , where yi ˛f 1; þ 1g is the class point xi , that  label  ofd þhigh-dimensional 1 , and d numbers of features, is xi ; yi ˛ R or dimensionality, and n is the number of the labeled data points. The classes c0 and c1 correspond to the classes of data points with binary data labels 0 and 1 and S ¼ C0 W C1 . The optimal classifier is determined by the parameters w and b, through solving the convex optimization problem: n P

Minimize

2 1 kwk 2

Subject to

yi ðw xi þbÞ  1  xi ; i ¼ 1; :::; n

(5.30)

xi  0; i ¼ 1; :::; n

(5.31)

þC

i¼1

xi

T

(5.29)

With corresponding linear function. f ðxÞ ¼ wx þ b: The magnitude of penalization for misclassification is controlled by the parameter C (the capacity constant) and slack variables xi (parameters for handling nonseparable data inputs). As it is not possible to know beforehand, which capacity constants C value is the best for

a given problem (and their value is important to keep the training error small and in order to generalize well (Anguita & Oneto, 2011), different validation techniques can be used for this purpose, utilizing the validation dataset. Furthermore, in practical problems, often the data is not separable into two classes by hyperplane determined by w and the use of structures more complex than linear ones is needed to correctly classify the data points. For this purpose, different mathematical functions, also called kernels, can be used in order to map S into a new high-dimensional space (Chapelle & Vapnik, 2000; Vapnik et al., 1997) and find the hyperplane there. In such cases, Eq. (5.30) takes the form:   yi wT Fðxi Þ þ b  1  xi ; i ¼ 1; :::; n (5.32) where F stands for kernel function that transforms S. Some of the frequently used kernel functions include: Linear Polynomial Radial basis function (RBF) Sigmoid

Kðxi ; xi Þ ¼ x,xi

(5.33) d

(5.34) Kðxi ; xi Þ ¼ ½g,ðx,xi Þ þ coef 

2 (5.35) Kðxi ; xi Þ ¼ exp  g ,kx  xi k Kðxi ; xi Þ ¼ tanhðg ,ðx ,xi Þ þcoef Þ (5.36)

146

5. Data analytics

where g, coef and d are the function parameters. Furthermore, in practical problems often more than two classes are present, requiring an adjustment of the original binary classifier. For this purpose, one can set dummy variables (created with case values as either 0 or 1) to represent the categorical variables. In such cases, a categorical dependent variable consisting of four levels, say ðA; B; C; DÞ, is represented by a set of four dummy variables: A: f 1 0 : f0 0

0 1

0 g; B: f 0 1 0 0 g; C 0 g; D: f 0 0 0 1 g (5.37)

5.5.5.3 Decision tree A decision tree is a nonparametric supervised machine learning method that has a tree alike structure and can be used for classification and regression problems. The goal of a decision tree is to create a model that predicts the output value (target variable) by learning simple decision rules inferred from the input variable (features). Hence, a tree is built by making a decision about splitting the source set, (represented in the root node of the tree), into subsets (represented in the branches) to form conclusions about the item’s target, or output, value (represented in the leaves). The process ends when all possible decisions and outcomes are considered resulting in a tree-like structure in which the logical sequence of decisions ultimately leads to the original decision (from the input data to the corresponding output labels). More formally, given training data that consists N labeled examples   of  x1 ; y1 ; ::: ; xN ; yN ˛ X  Y, the goal is to predict output labels y ˛ Y and distribution pðyjxÞ for unlabelled test points x ˛ X., where pðyjxÞ is the conditional distribution of label y, for a given x. For simplicity, we assume that X : ℝd , where d is a number of features. For classification problems, y : ¼ f1; :::; Kg, where K is

the number of classes, and for regression problems, y : ¼ ℝ. Let  X1:n : ¼ fx  1 ; :::; xng,  Y1:n : ¼ y1 ; :::; yn , and d1:n : ¼ X1:n ; y1:n . For every subset S 4 f1; :::; Ng, let   Ys ¼ yn : n ˛S and alike for Xs and ds . Again for simplicity, consider that a decision tree T has a finite set of nodes such that every node j has exactly one parent node (except for the root node r which has no parent), and every node j is the parent of exactly zero or two children nodes (branches or leaves). Let lðTÞ denotes leaf nodes. Each node j ˛ T is associated with a portion Pj 3 ℝd of the input space with the following logic, at the root Pr 3 ℝd , each branch node j ˛ T=lðTÞ that has two children f0; 1g divides its parent’s portion into two halves, where j ˛ f1; :::; dg and li denote the dimension and the location of the divide, respectively. The portions can be seen as: n o P0ðjÞ : ¼ x ˛ Pj : xjj  lj (5.38) n o P1ðjÞ : ¼ x ˛ Pj : xjj > lj

(5.39)

We call the tuple s ¼ ðT; j; lÞ a decision tree. Once we have a decision tree structure, we also need a rule for predicting the label of test points given training data. For instance, the classification tree assigns each leaf lðTÞ the most frequent label found in the corresponding portion of X and the regression tree provides a fitted value EðYjX ˛ PlðTÞ Þ within each portion as its estimate of the output value. The level of impurity is measured to evaluate the performance of the tree. For instance, if the decision tree classifies all data patterns into classes to that they actually belong, the splits between classes and branches are considered pure. The impurity of a node j between two branches or classes can be computed based on several methods such as:

147

5.5 Machine learning

-

Entropy-based impurity: Entropy ¼ 

K X i¼i

-

pi ðjÞlog2 pi ðjÞ

(5.40)

Gini impurity: Gini ¼ 1 

K X

pi ðjÞ

2

(5.41)

i¼i

-

Misclassification impurity: Misclassification ¼ 1  max pi ðjÞ i

(5.42)

where pi ðjÞ is the relative frequency of training instances that belong to class i at the node j and 0 log2 0 ¼ 0 in entropy calculations. Some commonly used decision trees for mobility studies include gradient boosting trees and random forests. In gradient boosting, the main idea is to compute a sequence of simple trees, where each successive tree is built for the prediction residuals of the preceding tree. In this manner, by combining many weak learners (high bias, low variance), gradient boosting forms a single strong tree (Appel et al., 2013) and decreases error primary by decreasing bias (and to some extent variance, as it aggregates the output from several models). Contrary to weak learners that work sequentially in gradient boosting, in the bagging approach, several models are fitted on subsets of the data (usually drawn randomly with replacement) in a parallel manner. An example of bagging is in random forests. A random forest (RT) is built as a large collection of fully grown decision trees (low bias, high variance), each capable of producing a response when presented with a set of input values (Breiman, 2001). The random forest addresses the error-decreasing challenge by reducing variance,

which is contrary to boosting trees. The trees are built uncorrelated to maximize the decrease in variance, yet the random forest cannot reduce bias (which is somewhat greater than the bias of an individual tree in the forest). Hence the need for fully grown (unpruned trees), so that the bias is from the beginning as low as possible. 5.5.5.4 Artificial neural networks Artificial neural networks (ANN), commonly referred to as neural networks (NN), are motivated by the way a human brain processes information. Our brains can be seen as remarkable complex, nonlinear, and parallel information processing systems, that are able to organize their structural constituents (neurons) to perform certain tasks. Hence, the ANNs are designed to mimic these functions and structural architecture. A fundamental unit of the artificial neural network is called an artificial neuron (as opposed to the neuron, which is a fundamental unit of the brain and nervous system) and it is a “transfer” function that calculates output y for a given input x. The artificial neurons are connected to form a network through which data flows. The connections among them represent weighted connections that scale the data flow. This general relationship between x inputs and y outputs can be given as: ! n X ym ¼ f B m þ Wim xi (5.43) i¼1

where y is the output and xi is the i th input from a layer X with n th input variable, Bm is the bias for the artificial neuron, Wim is connection weight from i th artificial neuron of the input layer to m th artificial neuron (Fig. 5.13). The activation function is typically a nonlinear function (mimicking the nonlinear processing performed by the human brain) and some of the most frequently used functions for this purpose are:

148 Sigmoid

5. Data analytics

f ðxÞ ¼

The hyperbolic tangent f ðxÞ ¼ function (“tanh”) The hard threshold

1 1þexpðxÞ expðxÞexpðxÞ expðxÞexpðxÞ

fb ðxÞ ¼ 1xb

The rectified linear unit f ðxÞ ¼ maxð0; xÞ (ReLU)

(5.44) ¼

expð2xÞ1 expð2xÞþ1

(5.45) (5.46) (5.47)

when there is only one hidden layer, Hm becomes an input for the output layer, in other words, hidden layers do not have a connection to the outside world (only input and output layers do so). Thus, prediction from output neuron is given by: ! n X (5.49) Pk ¼ f B0 þ Wik Hi i¼1

The structure of ANN typically has three basic layers called input, hidden, and output layers. The hidden layer connects input and output layers with an extra set of artificial neurons, that are correspondingly called hidden neurons. For each mth hidden neuron in the hidden layer H, the output is given by: ! n X Hm ¼ f Bhm þ Wim xi (5.48) i¼1

Where Hm is the output of the mth hidden neuron of the hidden layer H, Bhm is the bias for the hidden neuron and xi is the i th input from a layer X with n th input variable. Wim is connection weight from i th artificial neuron of the input layer to mth hidden neuron of the hidden layer. The output from the hidden layer H becomes an input for the sequential layer. In a simple case

FIGURE 5.13

where Pk is the predicted output from the kth artificial neuron of the output layer P, B0 is the bias from the artificial neuron, Hi is the ith input from the hidden layer H with n hidden neurons, and Wik is connection wight from the ith artificial neuron of the hidden layer to k th artificial neuron of the output layer. Traditionally, ANN layers are fully connected (all the inputs from one layer are connected to every activation unit of the sequential layer) and as such, they are capable of learning global patterns in their input space. However, for some applications, specific types of ANNs have been considered. For instance, when dealing with spatial patterns analysis and image processing, it is useful to be able to learn local patterns. Hence, for these purposes, new types of layers are introduced among the hidden layers and these layers are called convolutional and pooling layers. The ANNs containing these layers are

Artificial neuron.

149

5.5 Machine learning

referred to as convolutional neural networks (CNN). A convolutional layer contains units whose receptive nodes (represented as spatial matrix fields) cover a patch of the preceding layer. The weight vector of such a unit is often called a filter. On the other hand, the pooling layers are used to reduce the dimensions of the feature maps. Hence, the pooling layers reduce the number of parameters to learn, and the amount of computation performed in the network. The pooling layer summarizes the features present in a region of the feature map generated by a convolution layer. Following several convolution and pooling layers, the CNN generally ends with several fully connected layers. For analysis related to sequential data, such as time series or text, another type of ANN, called recurrent neural networks (RNN) is a better option. When it comes to RNNs, a hidden layer at time t depends on the input at moment t, xi , but also on the same hidden layer at the preceding moment t  1 or on the output at time t  1. This creates a loop from a hidden layer to itself or from the output to the hidden layer. Hence, the RNN can be seen as multiple copies of the same network, each passing information to its successor. Based on the above equations one can notice that the weights, that connect input to hidden and hidden to output layers, are important elements upon which the performance of the algorithm is dependent. The learning, or training, of the ANN algorithm, consists of determining these weights. There are several methods that can be used in this step. One of the popular methods, which is based on gradient descent error minimization, is known as a backpropagation learning rule. The backpropagation neural network (BPNN) first runs a forward pass through the network to determine all the state conditions for each node. Following this, it then propagates the error backward, from the output layer to the input layer, and improves the accuracy by changing the weights and biases. The

weights are generally adjusted following each epoch to minimize the error between the target and predicted output. The error function of the network is then given by: E¼

n  2 1X Ti  pi 2 i¼1

(5.50)

where Ti is the target for ith input pattern and Pi is the predicted output from the network for the same input pattern. ANNs can be used for both classification ad regression tasks. 5.6.5.5 kNN k-nearest neighbors (kNN) is the nonparametric method that can also be used for both regression and classification tasks (Poloczek et al., 2014; Valenti et al., 2010). The main idea behind the kNN is to position labeled data points into n dimensional space. Then, based on the defined distance metric among the data points and the size of the neighborhood (k), when a new input is provided, the corresponding value of its label (output) is determined based on the known output labels of k the nearest neighbors in the n dimensional space. For classification tasks, this is done based on the majority vote among n nearest points’ labels

FIGURE 5.14

kNN.

150

5. Data analytics

(Fig. 5.14). For the regression tasks, this is the continuous outcome calculated by averaging the outputs of the k nearest points in the neighborhood. More formally, considering a problem where we would like to classify objects, described with the vector x in ℝd , among m classes Y : ¼ f0; :::; mg. Given a labeled dataset of n data   points xi ; yi ˛ ℝd  Y for i ¼ 1; :::; n. The data is assumed to  realization of i.i.d.  be the random variables xi ; yi following a distribution v. Then the aim is to build a classifier, or a function g : ℝd / Y, which minimizes the probability of mistakes PðX; YÞ w v fgðXÞ s Yg. The kNN classifier, given a new input x ˛ ℝd , looks at the k nearest points to xi in the dataset, where distance is measured based on a predefined measure (for  instance, Euclidian distances) Dn ¼  xi ; yi and predicts a majority vote among them. For the regression tasks, where the output is continuous, the outcome is calculated by averaging the outputs of the k nearest points in the neighborhood. One of the main challenges when using kNN technique is the choice of k (neighborhood size) as it can strongly influence the quality of output labels forecast (Everitt et al., 2011; Silva, 2009). For any given problem, a small value of k will lead to a large variance in predictions and a large value may lead to a large model bias. Literature suggests no exact solutions for finding the optimal size of k but rather uses the heuristic approach (Hall et al., 2008; Nigsch et al., 2006) or one of the known validation methods (as v-fold cross-validation) (Celisse & Mary-Huard,  2018; Cavar et al., 2011). 5.5.5.6 Clustering When dealing with data where there is an absence of labeled input-output pair examples, a frequent problem is how to partition the data in such a manner that similar data points will belong to the same cohesive partitions, called

clusters. This problem is called a clustering problem and hard and fuzzy clustering are two main groups of such data portioning algorithms. The two main groups differ in the type of data points mapping to the set of partitions. When it comes to hard clustering, each object either belongs to a cluster or does not, while in fuzzy clustering each object belongs to each cluster to a certain degree (for instance, a likelihood of belonging to the specific cluster). Furthermore, when selecting the data clustering approach, one needs to consider what similarity measures the algorithm uses and what the criteria for separation between clusters the algorithm model and optimize. Moreover, depending on the problem at hand and the algorithm, the number of clusters may or may not be known in advance. In a situation when the number of clusters is not set in advance, one should carefully consider how to select the number of clusters as it can have a significant impact on the results (similar to the impact of bins when using a histogram to evaluate the distribution of data). Regarding the similarity measures, typical similarity measures include Euclidian distances, mutual information, cosine similarity, and many others. The separation criteria include maximization of similarities inside clusters, minimization of similarities between clusters, and minimization of distances between cluster elements and cluster centers, among others. 5.5.5.7 K-mean clustering In the general clustering problem, a training set fx1 ; :::; xn g needs to be delineated into cohesive partitions, or clusters, such that the data points in a cluster are similar (or related) to one another and different from (or unrelated to) the data pointss in other clusters. Hence, for the clustering problem x ˛ ℝd , but no labels yi are given. Therefore, the clustering problem belongs under the unsupervised learning problems group. One of the well-known and frequently used clustering algorithms is the k-mean clustering. Considering k to be a parameter of the algorithm

151

5.5 Machine learning

that stands for the number of clusters, and mj the temporary assumed positions of the clusters’ centers, the k-mean clustering algorithm involves two main steps. In the first step cluster, centroids m1 ; m2 ; :::; mk ˛ ℝd get initialized randomly. After this step, the inner loop is initiated and it iteratively assigns each training data point xi to the nearest cluster centroid mj and then moves each cluster centroid mj to the mean of the points assigned to it. More formally, the following procedure is repeated until convergence: For every i, set

2

ci : ¼ argmin xi  mj j

(5.51)

For each j, set n P

mj : ¼

1fci ¼ jgxi

i¼1 n P

i¼1

(5.52) 1fci ¼ jg

Where ci is a temporally class label for the data point xi . Once the algorithm converges, the centroids are stable (they do not move in the following iterations) the algorithm terminates. Fig. 5.15 illustrates an example of k-means clustering algorithm with one-dimensional data, for k ¼ 3. Nonetheless, depending on the positions of the randomly initialized cluster centroids, the result might vary. To overcome this challenge, one can run k-means algorithm several times and compute a measure of how well the clustering was performed by summing up the variance across all the clusters. The solution where the variance is the lowest indicates cohesive partitions. Another important aspect is the selection of parameter k. One way to select k is to try different values of k. Starting from k ¼ 1 to k ¼ n, the total variance across clusters will be reduced by each added cluster and reach zero

for k ¼ n, as there will be only one data point per each cluster. By plotting the reduction of variance per value for k, one can observe a point (value of k) after which the reduction of variance does not decrease as quickly as before it. This is known as an “elbow plot” as when plotted one can select k by finding the “elbow” in the plot (Fig. 5.16). 5.5.5.8 Cross-validation Cross-validation is not a machine learning method per se but is a frequent and integrated strategy in many machine learning algorithms because of its simplicity and universality. Its universality lies in the data splitting heuristics strategy as it assumes only that data are identically distributed and that data samples in the training and validation datasets are independent, which can even be relaxed. Thus, cross-validation can easily be integrated into nearly any algorithm in nearly any context, such as regression (Browne, 2000; Stone, 1978), classification (JiHyun, 2009; Semanjski et al., 2020), or density estimation (Hall, 1982), among others. This property is not shared by most other model selection or parameter estimation procedures, which are commonly peculiar to the application context and can be altogether misleading in another one. The validation, also known as the hold-out (Wagner & Devroye, 1979), relies on a single split of data. More formally described, let XðtÞ be a nonempty proper subset of f1; :::; ng, such that both XðtÞ and its complement XðvÞ ¼ XðtÞc ¼ f1; :::; ng\XðtÞ are nonempty. The hold-out estimator of the performance of the algorithm A, Aðsn Þ, with the training set xðtÞ , is given by: z}|{   1 X   ðvÞ   g A Sn ; xi E HO A ; Sn ; XðtÞ : ¼ nv ðvÞ i˛Sn

(5.53) where x1 ; :::; xn ˛ X denotes some random variables with common distribution P (the observations), g : S  X / ½0; NÞ is a contrast

152

5. Data analytics

FIGURE 5.15

K-means example.

5.5 Machine learning

153

Another type of cross-validation is crossvalidation by voting (Yang, 2006) and it is often used for model selection. When two algorithms A1 and A2 are compared, A1 is selected by cross-validation by voting if and only if

Dz}|{

z}|{ ðtÞ ðtÞ E HO A1 ; Sn ; Xj E HO A2 ; Sn ; Xj for a majority of the splits j ¼ 1; :::; B. However, given two models of similar performance, one should prefer the simpler model over the more complex model (Occam’s Razor) as a complex model has a greater chance of being fitted accidently. Most often, cross-validation estimators split the data  with  a fixed size nt of the training set,  ðtÞ  that is, Xj  z nt for every j. The two main types FIGURE 5.16

Elbow plot.

function, SnðtÞ :  ¼ ðxi Þi ˛ XðtÞ is the training sample : ðxi Þi ˛ XðvÞ is the validaof size nt ¼ XðtÞ , SðvÞ n tion sample, of size nv ¼ n  nt , and XðvÞ is called the validation set. The cross-validation by averaging consists of averaging several hold-out estimators of the algorithm performance corresponding to diverse data splits. Let B  1 be an integer and ðtÞ ðtÞ X1 ; :::; XB a sequence of nonempty proper subsets of f1; :::; ng . The cross-validation by averaging estimator of the performance of Aðsn Þ, ðtÞ

with training sets Xj

1  j B



z}|{  ðtÞ E CV A; Sn ; Xj : ¼

, is defined by: 

1jB

B z}|{

1X ðtÞ E HO A; Sn ; Xj B j¼1

(5.54)

In general, all cross-validation by averaging estimators of the performance is calculated as defined above (Eq. 5.54). Among them, the difference is uniquely determined by the choice of

ðtÞ

the splitting scheme Xj

1  j B

.

of splitting schemes given nt are exhaustive data splitting (that is considering all training sets of size nt ) and partial data splitting. Among exhaustive data splitting schemes, the most frequently used ones are leave-one-out (Stone, 1977) and leave-p-out (Shao, 1993). Leave-one-out corresponds to the choice nt ¼ n  1, meaning that each data point is successively “left out” from the sample and used for validation. Leave-p-out with p ˛ f1; :::; n  1g is the exhaustive cross-validation with nt ¼ n  p , meaning that every possible subset of p data is successively “left out” of the sample and used for validation. Hence, leave-one-out is a special case of leave-p-out where p ¼ 1. ! n When considering training sets exhausp tive data splitting cross-validation can be computationally expensive, partial data splitting schemes have been proposed as alternatives. Probably the most known and used among them is the v-fold cross-validation. v-fold crossvalidation relies on a preliminary partitioning of data into v subsets with, where v ˛ f1; :::; ng, of approximately equal cardinality n=v. Each subset then successively takes up the role of

154

5. Data analytics

validation sample. More formally, let A1 ; :::; Av be some partition of f1; :::; ng with cj , Aj  z  n v . Then, the v-fold cross-validation estimator of the performance of AðSn Þ is given by Eq. (5.54) ðtÞ with B ¼ v and Xj ¼ Acj I(t) for j ¼ 1; :::; B:

z}|{ E vF A; Sn ; Aj 1jv 2 3 v X z}|{ ð^j Þ

1X 1 4  ¼ g s Sn ; xi 5 (5.55) v j ¼ 1 Aj  i˛A j

Other cross-validation schemes include Monte-Carlo cross-validation (which allows the same split to be chosen several times), biascorrected versions of v-fold cross-validation and repeated learning-testing, generalized cross-validation, analytic approximation crossvalidation, and leave-one-out bootstrap, among others. Other well-known techniques to perform model selection and parameter evaluation include Bayesian marginal likelihood, Akaike information criterion, Bayesian information criterion, and stepwise regression, among others.

ðAj Þ _ ¼ ðxi Þi ˛Ac and s ðx1 ; :::; xn Þ is the where Dn j

performance of an estimator. Hence, the computational cost of v-fold cross-validation is only v times that of training A with n  n=v points. The advantages of v-fold cross-validation include the fact that all the data are utilized for both training and validation, but never at the same time. v iterations remove the risk of having one particular data set that would give biased results and the performance can be statistically validated using a two-tailed z-test ðk > 30Þ, or a two-tailed t-test ðk < 30Þ. Another partial data splitting scheme is balanced incomplete cross-validation that is often used as an alternative to v-fold crossvalidation when the training sample size (nt ) is small. Balanced incomplete cross-validation takes the form of Eq. (5.54) with training sets ðAc ÞA ˛ H , where H is a collection of B > 0 subsets of f1; :::; ng of size nv ¼ n  nt such that cardinality jfA ˛ T s:t: k ˛ Agj does not depend on k ˛ f1; :::; ng and jfA ˛ T s:t: l ˛ Agj does not depend on k s l ˛ f1; :::; ng . Repeated learning-testing is another partial data splitting scheme where the estimator of the performance of A takes up the form of Eq. (5.54)

with any B > 0, and any sequence ðtÞ

Xj

1jB

of different subsets of f1; :::; ng

that are chosen randomly, without replacement, and independently of data.

5.5.6 Example classification: transport mode recognition For many mobility studies, knowing the transport mode people use for their travel is a key element (Mc Fadden, 1978) as it allows a better understanding of people’s travel behavior (Bohte & Kees, 2009; Chen et al., 2016), improved management of traffic flows (Asakura et al., 2000) and ensures a better and more informed transport and urban planning (Magnanti & Wong, 1984). More details regarding the presence of transport modes in smart cities and smart mobility context and the challenge of multimodality (see Chapter 3, Multimodality), are given in previous chapters. However, in this example, we will focus on land transport modes, such as private motorized cars, public transport, bicycle use or walking, and the possibility to apply data analytics to infer transport modes from big data composed of: (i) mobile sensed data that describes users’ mobility behavior and (ii) publicly available geographical information that is used to define the spatial context in which this mobility behavior took place. As a practical example, we will use the city of Leuven (Flanders, Belgium) (Fig. 5.17), which is described in more detail in the transport mode competitiveness in urban area section (see Chapter 3, Multimodality).

5.5 Machine learning

155 FIGURE 5.17 The location of Leuven in Flanders.

Data on mobility behavior was collected via an Android smartphone application (called Routecoach) developed at Ghent University for the province of Flemish-Brabant in the frame of the Interreg IVb NWE project entitled “New Integrated Smart Transport Options” (NISTO) (NISTO, 2021). The data collection process lasted for 4 months and overall, 8303 users actively participated by downloading the freely available application and collecting the data on one or more trips within Leuven area. In total, more than 30,000 trips have been recorded leading to about 400,000 km of recorded data in driving, public transport, biking, and walking. For this purpose, a trip was defined as one-way movement between origin and destination points (e.g., home or work location) in the traffic network. The smaller parts of a trip were called trip legs and referred to segments of the trip made by utilizing different transport modes. To initiate data collection regarding a trip, the campaign participant first had to select the transport mode used for a given trip among five preoffered transport modes (i.e., car, public transport, train, bike, or on foot). The option

“public transport” referred to the use of the bus service, as tram and metro services were not available in the local area. Reporting of transport mode changes during a trip was also made possible by implementing a simple drag and drop option, from active transport mode to the new one, which minimized the user effort in recording timestamp, location, and the actual transport mode change. The data collection stopped upon the user reaching the destination and marking the end of the trip. In this manner, for every recorded trip, transport mode and GNSS locations (with the frequency of 1 Hz) were known. Furthermore, all the participants had passwordprotected access to their personal trip records where they were able to perform data quality control and correct any wrongly introduced trips or trip leg information via a user-friendly GIS interface. In total, nearly four million GNSS location points and timestamps were registered. The shortest recorded trip had marly 30 location points (equivalent to 58 m made by walking) while the longest one had 3047 points (equal to 802 km made by car) (Table 5.3). Considering

156 TABLE 5.3

5. Data analytics

Dataset details.

Variable

Value

Number of data contributors

8303

Number of transport modes

Five

Data collection process duration

Four months

Number of collected GNSS points

3,960,234

Overall trips’ length

340,000

Median of number of GNSS points per trip

1243

Median of trips per data contributor

8.7

the overall dataset, most of the trips were made by bike (56%), then by car (24%) and walking (11%), while the least of them were made by train (2%). Fig. 5.18 provides a map with an overview of the trips’ spatial density where trips made by bike are visualized in green and trips made by car in blue color. Several distinctive road network features are imminently recognizable, despite a high density of overall trips, as the ring road that surrounds the city with major

transport axes going from North to West of the area (European route E314) and radially around the main ring (national road N2) connecting Leuven to Brussels (North-West) on one side and Hasselt (North-East) on the other; N3 connecting from Brussels (South-West) area toward Walloon region (South-East); N264 on South, N19, and N26 on the North of the ring. Bike routes, in green, highlight the city center with spacious pedestrian zone and frequently used connections toward university campuses that are located in the South-West and SoutheEast areas outside the main ring road. To be able to infer transport mode, next to the GNSS traces, the spatial context of the respondent’s movements was considered. This spatial context was defined based on the characteristics of objects situated in the surrounding of the collected GNSS locations and comprehends a knowledge of these objects’ existence, purpose, and proximity to the observing point of view (in this case the GNSS location point of interest). For this, GIS data available from the OpenStreetMap (OSM) (Haklay, 2010; Jiang & Thill, 2015)

FIGURE 5.18 Bike (green, light gray in print) and car (blue, dark gray in print) trips in the Leuven area (Flanders) (Semanjski et al., 2017).

5.5 Machine learning

were used, where much mobility-oriented information was included (for instance, locations of car sharing points, parking, train stations, railways, cycleways etc.). Fig. 5.19 gives an example of spatial context awareness. For this example, three predictor variables are noted for the smartphone location: an EV charging station (overall there are 14 EV charging stations in Leuven), a public transport stop (overall there are 652 public transport stops), and a bike highway (Leuven is connected with two bike highways, one going to West toward Brussel and one going North toward Antwerp). Depending on the relative position, the GNSS location of interest (smartphone location) can enter or leave the zone of influence of the predictor variable. For every trip, the spatial context of every observed location point was analyzed, and aggregated values (at the trip level) were used as input variables that describe the spatial context for the given trip. Moreover, for every trip two location points, starting and

FIGURE 5.19

157

ending points, were treated separately to define additional input variables for describing trips’ origin and destination spatial context. The complete list of predictor variables that were used in this model to define the spatial context of the mobile sensed trips is given in Table 5.4. Fig. 5.20 illustrates an example of the spatial context for a trip made by utilizing public transport. For the clarity of visualization, the figure indicates only a part of the spatial context that relates to the respective transport mode. As one can notice, in this example, although the public transport trip starting and ending points are not located in close vicinity of recognized public transport stops, it is still noticeable that most of the trip’s locations are located along and in the vicinity of the public transport lines. There can be several reasons for this, such as issues with the time-to-first-fix or multipath reflection, which are common in built-up areas. Mobile sensed data, which describes users’ mobility behavior, and geographical information,

Spatially aware context (as in (Semanjski et al., 2017)).

158 TABLE 5.4

5. Data analytics

Model predictor variables (Semanjski et al., 2017).

Spatial context predictor variables Percentage of trip’s points in vicinity of motorways (class 50 m) Percentage of trip’s points in vicinity of trunk roads (class 50 m) Percentage of trip’s points in vicinity of primary roads (class 50 m) Percentage of trip’s points in vicinity of secondary roads (class 50 m) Percentage of trip’s points in vicinity of tertiary roads (class 50 m) Percentage of trip’s points in vicinity of unclassified roads (class 50 m) Percentage of trip’s points in vicinity of residential, service, or living roads (class 50 m) Percentage of trip’s points in vicinity of tracks, bridleways, or paths (class 50 m) Percentage of trip’s points in vicinity of public transport lines (class 50 m) Percentage of trip’s points in vicinity of railway (class 50 m) Percentage of trip’s points in vicinity of cycleway (class 50 m) Trip start points in vicinity of bus station (class 50 m) Trip start points in vicinity of train station (class 50 m) Trip start points in vicinity of car parking, car wash, car sharing location, car repair, electric vehicle charging station (class 50 m) Trip start points in vicinity of bike parking (class 50m) Trip end points in vicinity of bus station (class 50 m) Trip end points in vicinity of train station (class 50 m) Trip end points in the vicinity of car parking, car wash, car sharing location, car repair, electric vehicle charging station (class 50 m) Trip end points in the vicinity of bike parking (class 50 m)

which define the spatial context, are used as an input dataset to develop a support vector machines model that infers utilized transport mode. In this context, the SVM can be trained to learn on user-reported transport mode (labeled data) and apply this knowledge in inferring transport mode from new datasets and records that have not been used for training and validation purposes. To achieve this, the overall dataset was firstly divided into two parts; Z1 used for training and validation (75% of all data) and Z2 used to test the results of the training process (remaining 25% of the data) (Fig. 5.21).

A linear kernel function (Eq. 5.33) was used to transform the inputs to the feature space. The value of C (Eq. 5.29) was determined based on the grid search within the range from one to 10 (with an increment of one). The value that had the best average 10-fold cross-validation accuracy was three and was selected for further use. Following this, the selected value of capacity was applied to train an SVM classifier dðxn Þ using the entire training sample. The accuracy of the test data estimate (the proportion of cases in the test dataset, which is misclassified by the

5.5 Machine learning

159

FIGURE 5.20

Spatial visualization of public transport mode trip with highlighted areas around public transport lines (10, 30 and 50 m) and bus stops (yellow points, white dots in print) (Semanjski et al., 2017).

FIGURE 5.21

Dataset division and purposes.

classifier constructed from the learning dataset) was calculated as follows: RðdÞ ¼

1 N2

X ðxn ;jn Þ˛Z2

  X dðxn Þ s jn

(5.56)

Where N2 is the number of records in the test dataset, and X is the indicator function for which it is valid:  X ¼ 1, if the statement X ðdÞ s jn is true;  X ¼ 0, if the statement X ðdÞ s jn is false. The designed SVM model (Table 5.5) was able to correctly infer the transport mode utilized for

160 TABLE 5.5

5. Data analytics

SVM model summary (Semanjski et al., 2017).

Parameter

Value

Number of independents

54

SVM type

C-SVM

Kernel type

Linear

Number of SVs

344 (0 bounded)

Number of SVs (BIKE)

199

Number of SVs (BUS)

76

Number of SVs (CAR)

22

Number of SVs (FOOT)

39

Number of SVs (TRAIN)

8

trips in 94% of cases. The confusion was the highest for the transport mode foot, misclassifying it as a bike trip. On the other end, no misclassifications occurred for transport modes bike and train (Fig. 5.22).

5.5.7 Example regression: travel time estimation Travel time information is one of the key quantitative performance indicators of the mobility system and it is widely used in many contexts and applications as dynamic route guidance (Kim et al., 2011; Xu et al., 2013; Yu et al., 2013), traveler information system (Yin et al., 2002; Yu et al., 2010) or traffic management system (Semanjski et al., 2019; Van Gheluwe et al., 2020). Another significant characteristic of travel time is that its relevance across diverse groups of stakeholders, such as decision-makers, transport system users, transport planners etc., is widely acknowledged and easily understood (Lyons & Urry, 2005) and is often used as a relevant quantifier when assessing the competitiveness and performance among different  transport modes (Brnjac & Cavar, 2009). Travel time is also one of the largest costs of transport,

FIGURE 5.22 Confusion matrix (each column represents the percentage of trips made with the predicted transport mode while each row represents the percentage of trips made with an actual transport mode).

and its savings are frequently the primary justification for transport infrastructure interventions and improvements (Malchow et al., 1996). From the perspective of data analysis, travel time is a continuous variable and as such, regression methods can be used for its forecasting. In literature, one can find different approaches for travel time forecasting. Yu et al. (2010) used support vector machines to predict bus arrival times at the bus stations, Zong, et al. (2012) apply a genetic algorithm to forecast daily commute travel times in Beijing, and Simroth and Z€ale (2011) used nonparametric distribution-free regression model. These forecasts are mainly made using GNSS based data (Anusha et al., 2012; Huang et al., 2013), survey data (Sun et al., 2008), or data from different types of road detectors (Yu et al., 2010; Zheng et al., 2006), whereas the use of several data sources is not so frequently considered (Lum et al., 1998). When it comes to the factors affecting travel time, the literature review identifies as relevant the free flow travel speed, the occurrence of incident situations, holidays or other uncommon events, congestion  level, and weather conditions (Cavar et al.,  2011; Semanjski, 2015).

5.5 Machine learning

The following example gives an overview of the use of multiple data sources (road network data, GNSS, and weather forecast data) and analyses the results of four supervised machine learning techniques (SVM, k-nearest neighbor, boosting trees and random forest) in forecasting travel times on five different road categories in an urban setting. The use of multiple data sources allows one to consider several diverse influences while forecasting, but also comes with its challenges regarding data fusion, storing, and processing. For instance, when considering traditional trip diaries, information on the weekly trips for one person represents about 20-kilobytes of data. Today, just the GNSS tracks of weekly trips for one person are of a different magnitude of order  (around 10e20 megabytes) (Cavar et al., 2011). Other variables and data sources just add up to this. For the following example, three data sources were used: (i) spatiotemporal data from vehicle’s GNSS tracks, (ii) road network infrastructure database, and (iii) meteorological data that are collected by the network of sensors provided by the National Meteorological and hydrological service in the study area (DHMZ, 2021). The first two data sources were fused and map matched based on the spatial coordinates, while the third one was fused based on both spatial and temporal components. The resulting database contained 39 gigabytes of data. The spatiotemporal data were collected from 300 probe vehicles that used the urban road network in the City of Zagreb (the capital of Croatia) and were provided by Mireo (Mireo, 2021) company that specialized in mobility data collection and analytics. The GNSS devices installed within vehicles were able to send data via mobile network to the central database. This data included: -

vehicle ID, information on the location (x and y coordinate),

-

161

vehicle speed, vehicle course, logging time.

Additionally, in the database, these data were enriched with the attributes on the day of the week, special events (holidays, scheduled traffic flow disturbances, school days, etc.), historical average speeds recorded for the same road, and standard deviation of historical speed records. The infrastructural database contained information on the overall urban road network including: -

-

identification of the portion of the road between two sequential intersections (road segment), length of the road segment, road name, start and end coordinates of every road segment, directionality of the road segment (onedirectional street or not), traffic modes that are allowed to use the road segment, number of traffic lines per direction.

Meteorological data were obtained from the sensor network and provided by the National Meteorological and hydrological service. This data set included information on: -

air temperature, ground temperature, pavement condition (wet or not), snow (falling or not and the thickness of the new snow cover on the road), rain, humidity, wind (direction and strength), horizontal visibility.

Before performing the travel time forecast, based on the spatiotemporal and infrastructural data, a hybrid approach was used to classify the roads. For this purpose, multiple regression

162

5. Data analytics

and factor analysis were used to identify parameters that influence the road classification the most and these were historical speed average, the standard deviation of the historical speed, length of the road segment, and count of the vehicles that used this road segment. Figs. 5.23 and 5.24 give an overview of the five road classes based on the most distinctive variables (more details on the road classification approach can be  found in the literature (Cavar et al., 2011)). For the travel time forecasting, for each of the five road classes, four supervised machine learning regression techniques (k-nearest neighbors, support vector machines, boosting trees, and random forest) were used. To illustrate the results of the travel time forecasting techniques, we selected the representative roads (test sample) for each road class. Table 5.6 gives the names of the selected roads and a brief description of their characteristics. Fig. 5.25 shows their spatial distribution in the urban road network of the city of Zagreb. To gain the better insight in travel time forecast among different techniques and road

FIGURE 5.23

 (Semanjski, 2015).

classes, the mean absolute percentage error (MAPE) was used as defined by:  n   1X Ai  Fi  MAPE ¼ (5.57)  n i ¼ 1 Ai  where Ai stands for the observed value and Fi for the forecasted value. The largest MAPE was observed for the road class E and the support vector machines technique (Fig. 5.26). Although, when measured in minutes the forecasting residuals were the lowest in their amount, however, when considering the relative ratio (regarding the road length and respectively travel time needed to travel across the whole road), this error was the highest. Overall, road class E had the highest differences in the forecast error among considered techniques. On the other hand, road class D and C has the least. Furthermore, the analysis indicated that for the roads with higher speeds and longer road segments the supervised machine learning techniques will yell the most uniform results. More demanding, in this urban context, seem to be

Summary of road class differences based on the average speed in km/h and standard deviation of the speed

5.5 Machine learning

163

FIGURE 5.24 Summary of road class differences based on the average length of road segments (in meters, right y-axis) and  an average number of records per road segment (count, left y-axis) (Semanjski, 2015).

TABLE 5.6 Road class Class A

Selected representatives for the travel time forecasting for each road class.

Road name

Brief description

Zagrebacka avenija

Main urban network avenues, several traffic lanes for each direction, and lines for different directions usually have a physical barrier among them

Slavonska avenija Class B

Avenija dubrovnik

Main urban avenues with several lines for each direction (sometimes separated by the tram lines in the middle)

Selska cesta Avenija grada vukovara Class C

Savska cesta Kralja zvonimira

Major urban roads with one or more lines per search direction, tram lines are mainly integrated into the road surface

Maksimirska Class D

Mirogojska cesta

Urban roads, both directions present, smaller traffic flow than for class C, no tram tracks

Prisavlje Class E

Harambasiceva Jordanovac Hrvatskog proljeca

Urban roads within neighborhoods, one or two traffic directions, usually narrower traffic lines than for previous classes

164

5. Data analytics

 FIGURE 5.25 Spatial distribution of the selected test samples for each road category (Semanjski, 2015).

the local and arterial roads where the impact of the adequate travel time forecasting technique is crucial. Based on the obtained results, if one cannot afford the complex travel time forecasting system that would take into account different road categories then, in such cases the use of the knearest neighbor technique or random forest seems advisable as two of these yield the lowest

overall error across different road categories. Nevertheless, it should be noted that the kNN technique was more computationally expensive. One reason for this can be found in the use of the cross-validation method to estimate the optimal size of the neighborhood, but indeed this step does not need to be repeated every time when calculating the travel time forecast (rather just the first time and then this value should be

165

5.6 Data anonymization

FIGURE 5.26 MAPE values for travel time forecast for all road categories and kNN, SVM, boosting trees (BT), and random forest (RF) technique.

stored and recalled before every forecast calculation). Another detail that is worth noting, regarding the kNN method, is that it gained the highest accuracy for the smallest sizes of the neighborhood (in real-time forecasting this would correspond to a short forecasting period, as a small number of neighboring records would be needed). If one would design the travel time forecasting system that would consider different road classes, the results indicate that the use of the kNN technique (alternatively the random forest) would be a good option for the higherspeed roads with long road segments (like highways) and urban streets with the average speed around 50 km/h and the average length of the road segment longer than 150 m. For the urban roads with an average speed close to 40 km/h, the kNN and boosting trees yelled the highest accuracy, and for the local and residential roads the random forest technique. Furthermore, the results of the analysis suggest that it is worthwhile to invest effort in bringing multiple data sources together as, in the example, all three data sources had high importance when forecasting travel times. Nevertheless, one should be selective when choosing among different available variables from these sensors as not all were equally

important. Among the GNSS data, the most significant was the standard deviation of the speeds for the given road with respect to the time of the day, frequency of road use, and the position of the vehicle along the road. Regarding the urban road network data, the most interesting was the information on the average length of the road segments (for instance, how often the traffic flow is interrupted). For the meteorological data, information on the horizontal visibility and precipitation (for instance, is it snowing, raining, and is the road surface wet or frozen). Meteorological information also had a higher impact on the travel time forecasting for the roads with the higher speeds as here there were influencing the travel times more severely.

5.6 Data anonymization Data anonymization refers to a group of techniques used to anonymize the data. For data to be truly anonymized they must be processed irreversibly in such a manner that it can no longer be used to identify an individual by using “all the means likely reasonably to be used” (European Community, 1995). Overall, there

166

5. Data analytics

are two main approaches, or strategies, to anonymization: randomization and generalization.

5.6.1 Randomization Randomization refers to a group of techniques that alter the veracity of the data. The aim of the altered veracity is to suitably increase uncertainty, so that the data can no longer be referred to a specific individual. Randomization techniques include noise addition, permutation, and differential privacy, among others. The noise addition technique consists of intentionally introducing the noise in the dataset prior to publication by modifying features, such that they become less accurate whilst retaining the overall distribution. When processing such an anonymized dataset, an analyst will assume that values are accurate. Nonetheless, this will only be correct to a certain degree. For instance, if coordinates of individual’s home location, as a starting location of trips in the tracking dataset, were originally observed to the nearest meter, the anonymized dataset may contain modified location precision that is accurate to only several hundred meters in latitude and longitude. This process should make a home location of an individual, and thus the individual himself, not identifiable from the dataset. Hence, if this randomization strategy is applied effectively, an independent observer should not be able to identify individuals nor detect the noise addition principles and, respectively, should not be able to repair the data, to their original values. However, the introduced level of noise is always a trade-off between the necessity of the level of information required (e.g., for specific mobility data analytics) and the impact on individuals’ privacy because of disclosure of the protected features (European Commission, 2014). For this reason, noise addition is often used in combination with other anonymization techniques such as the removal of obvious attributes and quasiidentifiers (identifiers that do not uniquely

identify an individual in most cases but can in some instances or when combined with other quasi-identifiers). The permutation technique consists of disarranging the values of features by artificially associating part of them with different individuals in the dataset. This technique is particularly convenient when it is relevant to retaining the exact distribution of each feature within the dataset. Alike the noise addition technique, permutation may not provide anonymization by itself and should always be combined with the removal of obvious features and quasi-identifiers. Differential privacy is somewhat different than previous techniques as it is utilized when anonymized views of a dataset need to be created. These views are typically produced upon the request of an authorized third party and through a subset of queries. Hence, every time a subset of queries is made, differential privacy, through its mathematical definition, makes it possible to calibrate the noise and in a quantifiable way to balance data privacy and data utility. However, differential privacy only impacts the results of the queries (data views) and will not change the original data. Thus, it is not an irreversible process and as long as the original data remains, it will still be possible from it to identify individuals. Hence, such results have also to be considered personal data.

5.6.2 Generalization Generalization is the second main strategy among anonymization techniques, and it consists of generalizing the features of individuals by modifying the respective scale or creating a broader categorization. For instance, generalizing home location from street level precision to a neighborhood level or using a week as a time reference rather than a day. Albeit generalization can be effective to prevent the possibility to isolate some or all records, which identify an individual in the dataset (singling out), it

5.6 Data anonymization

necessitates dedicated and sophisticated quantitative approaches to avert linkability (ability to link records concerning the same individual, or a group of individuals, either in the same database or across different databases) and inference (possibility to deduce, with significant probability, the value of a feature from other features) (Victor et al., 2016). Some of the generalization techniques include aggregation and kanonymity with their extensions as L-diversity and T-closeness. Aggregation and k-anonymity techniques assure that any individual in the dataset is indistinguishable from at least other k1 individuals in terms of quasi-identifier attribute values (Zhang et al., 2019). To achieve this, the feature values are aggregated to an extent such that everyone in that group shares the same value. For instance, numerical attributes such as trip lengths, age, etc. can be generalized by interval values as trip lengths between 2 and 5 km, or age group of 20e29 years old. L-diversity is a refinement to the k-anonymity that aims to ensure that in each equivalence class every attribute has at least l different values well represented. In such cases, a potential attacker with existing background familiarity with a specific individual would remain with significant uncertainty due to the confinement of the occurrence of equivalence classes with poor feature variability. T-closeness is a further extension of l-diversity that ensures that the distance between the distribution of a sensitive feature in equivalent class and the initial distribution of the feature is no more than a threshold t. This technique is convenient when it is relevant to keep the data as close as possible to the original one.

5.6.3 Pseudonymization Pseudonymization is a different type of technique, closely related to anonymization and often mentioned in the same context. However, pseudonymization focuses on reducing the

167

linkability of a dataset with the original identity of an individual. This is achieved by substituting a recognizable, or unique, feature in a dataset with another (pseudo) feature. For instance, replacing personal names in the dataset by unique numerical identifiers as a pseudo feature. The result of this process is not an anonymized dataset as it is likely to still allow indirect identification of an individual. Hence, it is mainly used as a useful security measure in combination with the above-described anonymization techniques. The most frequently used pseudonymization techniques include encryption, use of hash function, and tokenization. Encryption with a secret key involves the use of a single secret key, to encrypt and decrypt the dataset. In such cases, the personal data are still included in the dataset, though in an encrypted form, and anyone with the knowledge of the key remains in the position to trivially reidentify each data subject through the decryption process. Pseudonymization with hash function involves the use of a hash function that returns a fixed size output from an input of arbitrary size. Contrary to encryption, it cannot be reversed. Special cases of hashing process involve the use of additions or keys. For instance, the use of a salted hash function involves adding a “salt”, made up of random bits, to each feature before its hashing, or keyedhash function, which involves the addition of a secret key before hashing. In the later cases, the use of a key allows the creation of several different pseudonyms for the same feature, according to the choice of the specific key. Finally, tokenization refers to the use of tokens as a piece of data, with no extrinsic or exploitable meaning, that stands in for the more valuable piece of information as the original feature. Tokens do not alter the type or length of the original data and are not mathematically derived from it. Tokens can be mapped back to the original feature through a tokenization system which, for this reason, must be secured and validated using security best practices.

168

5. Data analytics

References Akossou, A.Y.J., Palm, R., 2013. Impact of data structure on the estimators R-square and adjusted R-square in linear regression. International Journal of Mathematics and Computation 20 (3), 84e93. Anguita, D., Oneto, L., 2011. In-sample model selection for support vector machines. IEEE, San Jose, USA. Anscombe, F.J., 1973. Graphs in statistical analysis. American Statistician 27 (1), 17e21. Anusha, S.P., Anand, R.A., Vanajakshi, L., 2012. Data fusion based hybrid approach for the estimation of urban arterial travel time. Journal of Applied Mathematics 2012. Appel, R., Fuchs, T., Dollar, P., Pero, P., Appel, R., Fuchs, T., Dollar, P., Perona, P., 2013. Quickly boosting decision trees e pruning underachieving features early. Atlanta, USA, pp. 594e602. Asakura, Y., Tanabe, J., Lee, Y., 2000. Characteristics of positioning data for monitoring travel behaviour. ERTICO, Tokio, Japan. Beare, M., Howard, M., Payne, S., Watson, P., 2021. Inspire. Available at: https://inspire.ec.europa.eu/documents/ development-technical-guidance-inspire-transformationnetwork-service. Bohte, W., Kees, M., 2009. Deriving and validating trip purposes and travel modes for multi-day GPS-based travel surveys: A large-scale application in The Netherlands. Transportation Research C: Emerging Technologies 17 (3), 285e297. Borges, A.F.S., Laurindo, F.J.B., Spínola, M.M., Gonçalves, R.F., Mattos, C.A., 2021. The strategic use of artificial intelligence in the digital era: Systematic literature review and future research directions. International Journal of Information Management 57, 102225. Breiman, L., 2001. Random forests. Machine Learning 45 (1), 5e32.  Brnjac, N., Cavar, I., 2009. Example of positioning intermodal terminals on inland waterways. PROMET-Traffic & Transportation 21 (6), 433e439. Browne, M.W., 2000. Cross-validation methods. Journal of Mathematical Psychology 44 (1), 108e132.  Cavar, I., Kavran, Z., Petrovic, M., 2011. Hybrid approach for urban roads classification based on GPS tracks and road subsegments data. PROMET-Traffic & Transportation 23 (4), 289e296.  Cavar, I., Petrovic, M., Kavran, Z., 2014. Small urban area transportation planning. In: Pratelli, A. (Ed.), Urban street design & planning. WIT Press, Southampton, pp. 1e10. Celisse, A., Mary-Huard, T., 2018. Theoretical analysis of cross-validation for estimating the risk of the k-nearest neighbor classifier. The Journal of Machine Learning Research 19 (1), 2373e2426. Chapelle, O., Vapnik, V., 2000. Model selection for support vector machines.

Chee, C.-H., Jaafar, J., Aziz, I.A., Hasan, M.H., Yeoh, V., 2019. Algorithms for frequent itemset mining: A literature review. Artificial Intelligence Review 52, 2603e2621. Chen, C., Ma, J., Susilo, Y., Liu, Y., Wang, M., 2016. The promises of big data and small data for travel behavior (aka human mobility) analysis. Transportation Research C: Emerging Technologies 68, 285e299. DHMZ, 2021. Meteo.hr. Available at: https://meteo.hr/. (Accessed 23 April 2021). European Commission, 2014. Opinion 05/2014 on anonymisation techniques. European Commission, Brussels, Belgium. European Commission Joint Research Centre, 2021. Inspire. Available at: https://inspire.ec.europa.eu/documents/ conceptual-model-developing-interoperability-specificat ions-spatial-data-infrastructures. (Accessed 12 May 2021). European Community, 1995. Directive 95/46/EC. European Community, Brussels, Belgium. Everitt, B.S., Landau, S., Leese, M., Stahl, D., 2011. Miscellaneous clustering methods. In: Cluster analysis, 5th ed. John Wiley & Sons, Chichester, UK. Ghent University, 2016. Move. Available at: http://move2. ugent.be/. (Accessed 20 March 2016). Haklay, M., 2010. How good is volunteered geographical information? A comparative study of OpenStreetMap and ordnance survey datasets. Environment and Planning B: Planning and Design 37, 682e703. Hall, P., 1982. Cross-validation in density estimation. Biometrika 69 (2), 383e390. Hall, P., Park, B.U., Samworth, R.J., 2008. Choice of neighbor order in nearest-neighbor classification. The Annals of Statistics 36 (5), 2135e2152. Holevo, A.S., 2006. Statistical problems in quantum physics. In: Maruyama, G., Prokhorov, Y.V. (Eds.), Proceedings of the second Japan-USSR Symposium on probability theory. Springer, Berlin, pp. 104e119. Howard, M., Payne, S., Sunderland, R., 2021. Inspire. Available at: https://inspire.ec.europa.eu/documents/ technical-guidance-inspire-schema-transformation-netwo rk-service. (Accessed 11 May 2021). Huang, Y., Xu, L., Kuang, X., Huang, Y., Xu, L., Kuang, X., 2013. , “urban road travel time prediction based on taxi GPS data. In: Wuhan, China, 2013. Improving multimodal transportation systems-information, safety, and integration, Vol 2013, pp. 1076e1083. Ji-Hyun, K., 2009. Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Computational Statistics & Data Analysis 53 (11), 3735e3745. Jiang, B., Thill, J.-C., 2015. Volunteered Geographic Information: Towards the establishment of a new paradigm. Computers, Environment and Urban Systems 53, 1e3.

References

Joo, S., Oh, C., Jeong, E., Lee, G., 2015. Categorizing bicycling environments using GPS-based public bicycle speed data. Transportation Research Part C: Emerging Technologies 56, 239e250. Kim, M., Miller-Hooks, E., Nair, R., 2011. A geographic information system-based real-time decision support framework for routing vehicles carrying hazardous materials. Journal of Intelligent Transportation Systems: Technology, Planning, and Operations 15 (1), 28e41. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S., 2017. Self-normalizing neural networks. In: Proceedings of the 31st international conference on neural information processing systems. Curran Associates Inc., New York, pp. 972e981. Lum, K., Fan, H., Olszewski, S., 1998. Speed-flow modeling of arterial roads in Singapore. Journal of Transportation Engineering 124 (6), 213e222. Lyons, G., Urry, J., 2005. Travel time use in the information age. Transportation Research A: Policy and Practice 39 (2e3), 257e276. Magnanti, T.L., Wong, R.T., 1984. Network design and transportation planning: Models and algorithms. Transportation Science 18 (1), 1e55. Malchow, M., Kanafani, A., Varaiya, P., 1996. The economics of traffic information: A state-of-the-art report, Berkeley: Institute of Transportation Studies, University of California at Berkeley. Mc Fadden, D.L., 1978. Quantitative methods for analyzing travel behaviour of individuals: Some recent developments. In: Behavioural travel modelling. Croom Helm London, London, UK, pp. 279e318. Mendenhall, W.M., Sincich, T.L., 2015. Statistics for engineering and the sciences, 6th ed. CRC Press, London. Minsky, M., 2007. The emotion machine: Commonsense thinking, artificial intelligence, and the future of the human mind. Simon and Schuster, New York, USA. Mireo, 2021. Mireo. Available at: https://www.mireo.hr/. (Accessed 16 October 2021). Nigsch, F., Bender, A., Buuren, B.V., Tissen, J., Nigsch, E., Mitchell, J.B.O., 2006. Melting point prediction employing k-nearest neighbor algorithms and genetic parameter optimization. Journal of Chemical Information and Modeling 46 (6), 2412e2422. NISTO, 2021. NISTO. Available at: http://www.nistoproject.eu/home.html. (Accessed 9 September 2021). Peixoto, J.L., 1990. A property of well-formulated polynomial regression models. The American Statistician 44 (1), 26e30. Poloczek, J., Andre, N., Kramer, O., 2014. KNN regression as geo-imputation method for spatio-temporal wind data. Advances in Intelligent Systems and Computing 299, 185e193.

169

 Semanjski, I., 2015. Analysed potential of big data and supervised machine learning techniques in effectively forecasting travel times from fused data. PROMET-Traffic & Transportation 27 (6), 515e528. Semanjski, I., Gautama, S., 2016. Crowdsourcing mobility insights-reflection of attitude based segments on high resolution mobility behaviour data. Transportation Research Part C: Emerging Technologies 71, 434e446. Semanjski, I., Gautama, S., Ahas, R., Witlox, F., 2017. Spatial context mining approach for transport mode recognition from mobile sensed big data. Computers, Environment and Urban Systems 66, 38e52. Semanjski, I., Gautama, S., Hendrikse, S., 2019. Traffic management as a service. ERTICO, Singapore, Singapore. Semanjski, S., Semanjski, I., De Wilde, W., Muls, A., 2020. Use of supervised machine learning for GNSS signal spoofing detection with validation on real-world meaconing and spoofing datadPart I. Sensors 20 (4), 1171. Shao, J., 1993. Linear model selection by cross-validation. Journal of the American Statistical Association 88 (422), 486e494. Silva, D.F., 2009. How k-Nearest Neighbor Param-eters Affect its Performance.Simposio Argentino de Inteligencia Artificial, 2009 Aug 24-28; Mar del Plata, Argentina. Buenos Aires: Sociedad Argentina de In-formatica. Mar del Plata, Argentina, ASAI, pp. 95e106. Simroth, A., Z€ahle, H., 2011. Travel time prediction using floating car data applied to logistics planning. IEEE Transactions on Intelligent Transportation Systems 12 (1), 243e253. Stone, M., 1977. Asymptotics for and against crossvalidation. Biometrika 64 (1), 29e35. Stone, M., 1978. Cross-validation: A review. Statistics: A Journal of Theoretical and Applied Statistics 9 (1), 127e139. Sun, L., Yang, J., Mahmassanic, H., 2008. Travel time estimation based on piecewise truncated quadratic speed trajectory. Transportation Research Part A: Policy and Practice 42 (1), 173e186. Tan, P.-N., Kumar, V., Steinbach, M., 2006. Introduction to data mining, 2nd ed. Pearson Addison Wesley, Boston, USA. Tang, J., Chen, X., Hu, Z., Zong, F., Han, C., Li, L., 2019. Traffic flow prediction based on combination of support vector machine and data denoising schemes. Physica A: Statistical Mechanics and Its Applications 534, 120642. Tarassenko, L., 1998. Guide to neural computing applications, 1st ed. Elsevier, Amsterdam, Netherlands. Valenti, G., Lelli, M., Cucina, D., 2010. A comparative study of models for the incident duration prediction. European Transport Research Review 2 (2), 103e111. Van Gheluwe, C., Lopez, A.J., Semanjski, I., Gautama, S., 2020. Repurposing existing traffic data sources for COVID-19 crisis management. IEEE, New Jersey, USA.

170

5. Data analytics

Vapnik, V., Poggio, T., Niyogi, P., Girosi, F., Burges, C., Sung, K.K., Schoelkopf, B., 1997. Comparing support vector machines with Gaussian kernels to radial basis function classifiers. IEEE Transactions on Signal Processing 45, 2758e2765. Victor, N., Lopez, D., Abawajy, J.H., 2016. Privacy models for big data: A survey. International Journal of Big Data Intelligence 3 (1), 61e75. Vlahogianni, E.I., 2015. Vlahogianni, E.I. Optimization of traffic forecasting: Intelligent surrogate modeling. Transportation Research Part C: Emerging Technologies 55, 14e23. Wagner, T.J., Devroye, L., 1979. Distribution-free performance bounds for potential function rules. IEEE Transaction in Information Theory 25 (5), 601e604. Xu, X., Chen, A., Cheng, L., 2013. Assessing the effects of stochastic perception error under travel time variability. Transportation 40 (3), 525e548. Yang, Y., 2006. Comparing learning methods for classification. Statistica Sinica 16 (2), 635e657. Yin, Y., Lam, W.H., Ieda, H., 2002. Modeling risk-taking behavior in queuing networks with advanced traveler

information systems. Transportation and Traffic Theory 15, 309e328. Yin, P., Xitao, F., 2001. Estimating R 2 shrinkage in multiple regression: A comparison of different analytical methods. The Journal of Experimental Education 69 (2), 203e224. Yu, Z., Ni, M., Wang, Z., Zhang, Y., 2013. Dynamic route guidance using improved genetic algorithms. Mathematical Problems in Engineering 2013. Yu, B., Yang, Z.Z., Chen, K., Yu, B., 2010. Hybrid model for prediction of bus arrival times at next station. Journal of Advanced Transportation 44 (3), 193e204. Zhang, S., Li, X., Tan, Z., Peng, T., Wang, G., 2019. A caching and spatial K-anonymity driven privacy enhancement scheme in continuous location-based services. Future Generation Computer Systems 94, 40e50. Zheng, W., Lee, D.-H., Shi, Q., 2006. Short-term freeway traffic flow prediction: Bayesian combined neural network approach. Journal of Transportation Engineering 132 (2), 114e121. Zong, F., Lin, H., Yu, B., Pan, X., 2012. Daily commute time prediction based on genetic algorithm. Mathematical Problems in Engineering 2012.

C H A P T E R

6 Transport planning and big data 6.1 Objectives of the chapter

6.3 Four-step transportation planning model

What is the four-step transport planning model? What is a trip generation? What is trip distribution? What is mode choice? What is a route assignment? What is TAZ? What is the OD matrix? What are the best big data practices for mobility? What are the advantages of big data-based analytics for transport planning? What are the known limitations of big databased analytics for transport planning? What are the bottlenecks and opportunities related to big data-based analytics for transport planning? What are the open questions when it comes to the use of big data based for transport planning?

6.2 Word cloud Fig. 6.1 presents a word cloud with an overview of the content of this chapter.

Smart Urban Mobility https://doi.org/10.1016/B978-0-12-820717-8.00001-4

Inclusion of a wider set of stakeholders in the collaborative decision-making process and cocreation of new, smarter, mobility solutions and approaches requires bridging between different domains in order to capitalize on new possibilities. In the following chapter, a rather traditional approach from the transport planning domain, the so-called four-step transport planning model, will be interfaced with the state-of-the-art big data advances aiming to highlight the existing potential in bringing these two domains together as well as to indicate potential challenges, limitations, or possibilities for future developments. The reason why bringing something “new” and something “traditional” in this manner, is that traditional transport planning models are well established in the domain of mobility, and the mobility data collection processes are often designed in a way to serve these models, together with techniques and practices that are well developed and not likely to be abandoned by the experts in this domain. On the other hand, big data-driven possibilities are transforming many aspects of our lives and have the potential to support mobility

171

© 2023 Elsevier Inc. All rights reserved.

172

6. Transport planning and big data

FIGURE 6.1

Four-step transport planning model and big data word cloud.

planning efforts, hence, bridging these two domains seems as a good opportunity to facilitate these efforts on both mobility planning and data analytics sides, as well as to support a wider set of stakeholder in assessing where and when these possibilities can be of added value in their local contexts to achieve smart city and smart mobility goals. As already mentioned, one of probably the most known and used methods for transport planning is the four-step transport planning model. The four-step transport planning model is an umbrella term that encompasses four groups of transport planning submodels dedicated to answering specific mobility planning challenges, these steps are: • • • •

trip generation, trip distribution, mode choice and route assignment.

These steps are generally executed sequentially to describe the transport system’s state at the defined time moment, either current or future one, and allow comparison among them, as well as evaluation and definition of required actions to reach the desired outcomes. In this context, one of the basic elements of transport planning is the traffic analysis zones (TAZ). The TAZs refer to geographical areas (zones) that are described by statistical descriptors, as demographic and employment information, about the population in them. These zones are ideally homogeneous in terms of land use (for instance, mixed-used neighborhoods, large university campuses, city blocks, etc.) and separated between each other by transport network elements such as roads. Hence, both zoning system and traffic network details are important to archive desired level of models’ accuracy in balance with the related costs.

6.3 Four-step transportation planning model

Selection of the number of TAZs needed to cover a specific area is not a trivial question and is frequently done based on the needs of a specific study. However, such “need based” creation of TAZs is often not a practical option, as it does not generalise well. For instance, if several studies are being conducted over the same area, it makes it challenging to reuse the collected data and make comparison among archived results over time. A more practical approach is to use a hierarchical zoning system, where subzones can be aggregated into zones, which can be combined into districts and further on into more general zones to provide an appropriate level of detail for different studies, and still maintain comparability and support efficient data collection and reuse. As transport planning models often rely on network graphs theory, within the transport planning context, the TAZs are commonly represented by nodes called centroids. Centroid represents the concentration of TAZ’s attributes and properties into a single network node, which is best thought of as floating in space rather than an actual physical location within the TAZ. Furthermore, each zone is then attached to the transport network by specialized links called centroid connectors, which represent the average costs, as travel time and distance, of joining the transport system for trips with origin or destination within the respective TAZ. The transport networks themselves, in this context, are also described by network graphs consisting of a series of nodes (graphically show locations of intersections) and connections between nodes that are called links (e.g., streets). In general, it is advised to model at least one level below the transport network links of interest (for instance if one is conducting a study focused on primary roads, one should also include at least the network level of secondary roads). This is due to the occurrence of the largest errors at the lowest levels in the hierarchy of transport networks (Jansen & Bovy, 1982). In this context, each link within a transport network is usually described by its length, travel

173

speed, capacity (generally expressed in passenger car equivalent units (PCU) per hour), type of road, road width, and a number of lines and access restrictions (for instance, dedicated public transport lanes). On the other hand, the nodes are mainly described by the type of junction, signal timings, banned turns, the storage capacity of queues, and their presence at the start of the signal phase.

6.3.1 Trip generation step Within the transport planning context, the interaction among traffic analysis zones occurs since each of them produces and attracts personal trips. The estimation of the amount of the produced and attracted trips is done based on the information contained within each zone (for instance, population, households, and employment). In more detail, home-based (HB) trips are those where a home of a trip maker is either the origin or destination point of a trip. All the others are considered to be nonhome-based (NHB) trips. In this context, households are considered the primary producer of trips (note, traditional mobility data collection processes produce statistics at the level of households) where a trip production is defined as the home end of an HB trip or as the origin of an NHB trip. Respectively, employment sites are the primary trip attractors as trip attraction is defined as the nonhome end of an HB trip or destination of an NHB trip. These productions and attractions are translated into person trips that enter and leave each zone. The context of the trips, as the fact that people make trips with different motivations and have different vehicle occupancy rates in doing so, is also calculated into the model. Most often, the following trips’ motivation, or purpose, classes are considered: • • • • •

going to work, trips to the education institution, shopping trips, social and recreation activities related to trips other trips.

174

6. Transport planning and big data

Furthermore, details such as the distribution of trips along the day, particularly during the peak and off-peak hours are studied as well as the notion that travel behavior depends on socioeconomic attributes, where income levels, car ownership, and household size are most often considered. Next to personal trips, freight trips are also of interest for trip generation, where a number of employees, number of sales, and area of a company are seen as the most related variables to the number of generated freight trips. This entire process represents the initial part of the four-step transport planning model called the trip generation step.

6.3.2 Trip distribution step The process of trip distribution determines where the trips end up once they leave their traffic analysis zones. Trip distribution results in a matrix of origins and destinations between all zones for each trip purpose. These matrices are often called OD matrices. The main diagonal of the OD matrix corresponds to intra-TAZ trips, while the sum of trips in a row should equal the total number of trips emanating from the TAZ and the sum of the trips in a column should correspond to the number of trips attracted to that TAZ. The trip distribution process is done according to the information contained within each TAZ (for instance, how many employment sites there are, etc.), based on its proximity to other zones and on the total number of trips generated in the corresponding zone. In this context, the principle of gravity indicates that the zones that are closest to each other will have more trips flowing between them, all other things being equal. Hence, for the trip distribution step’s needs, the gravity models are often employed.

6.3.3 Mode choice step The mode choice refers to the relative proportions of all the trips between zones that utilize each mode of transport and is likely one of the most relevant elements in transport planning and policy-making as it affects the general efficiency with which one can travel in urban areas, the amount of space devoted to transport functions and range of choices available to the traveler. This step can be simplistic or very detailed, depending on the complexity of the transport network and the number of transport options/modes available in a given area. In general, the most often considered factors influencing mode choice are: • Characteristics of the trip maker as car availability/ownership, possession of a driving license, income, and household structure, attitude toward different mobility options, and residential density. • Characteristics of the trip as motivation for the trip and/or time of the day when the trip is undertaken. • Characteristics of the transport facilities, either quantitative as travel time or related costs or qualitative as comfort and convenience, reliability, and perceived safety and security. In this context, the application of mode choice models over the whole population results in trips split by mode, hence the term modal split is frequently used to describe mode choices in the study area. Often used for this purpose are discrete choice models, particularly multinominal and nested models that can account for a wide variety of mode choices like specific types of public transport (bus, metro, tram, etc.). In practice, the mode choice step is often done through multiple iterations of trip distribution and assignment as part of a feedback loop.

6.4 Literature review of big data advances for four-step transport planning model

6.3.4 Trip assignment step Whereas previous steps were mainly focused on transport demand, the trip assignment puts focus on the supply side of the transport system. In this context, the supply side is made up of a road network, its links with associated costs (for instance, distance, free-flow speed, or capacity), and the demand side is made up of an indication of the number of trips by OD pair and the transport mode that would be made for a given level of service, mainly defined by the travel time, cost and/or comfort levels. Probably the simplest example of the equilibrium among the transport demand and supply is equilibrium in the road network where travelers from a defined trip matrix pursue routes to minimize their travel cost (e.g., distance or travel time). Consequentially, they will endeavor into exploring various available routes and potentially settle into a relatively stable arrangement. This allocation of trips to routes yields a pattern of the path, and network link flows, which are considered to be in equilibrium when travelers can no longer find better routes to their destination. However, finding an urban multimodal transport system’s equilibrium is not a simple task as one can see equilibrium taking place at different levels. For example, considering that public transport users seek a combination of services or routes to reduce, for instance, their travel time that is also affected by waiting and transfer times or overcrowding of the vehicles and stops. However, in urban transport networks these travel times are likely to be affected by the presence of other transport modes. For instance, if car traffic congestion increases, public transport vehicles sharing the same road network infrastructure will be affected. Such occurrence can potentially induce either operator or a user to adjust their routes. These route choices consequently interact with those of car drivers, providing an increased capacity in some links and respectively a new equilibrium point.

175

In this context, the trip assignment step focuses on determining what route trips will take when traveling from one traffic analysis zone to another and is generally divided into two components, route choice modeling and the loading of the trip matrix onto the identified routes. The route choice modeling relays on an assumption of a rational traveler who chooses a route with the least perceived individual cost. This choice is affected by a number of factors such as travel times, distance, related monetary costs, potential disturbances (e.g., road works), route attractiveness (for instance, occurrence of congestion, scenery, type, and conditions of transport network infrastructure, reliability of the public transport service) and habit. Following the defined route choice model, a set of rules is used to load a fixed trip matrix onto the transport network and assign all the trips from all zones along the network to all their destination zones. The result produces traffic volumes for all roads in the network.

6.4 Literature review of big data advances for four-step transport planning model 6.4.1 Literature review of big data advances for trip generation step As the aim of the trip generation step is to gain insight into the TAZs (trip production and attraction sites) and related trips’ characteristics (for instance, what are the trips’ purposes and vehicle occupancy rates for those trips), here we have a look at the existing efforts from the literature where big data analytics have made a progress in supporting these aspects of transport planning. Among the first, Wolf et al. (2001, 2004) used the GNSS data to derive trip destinations and purposes. To detect the trip destination, they applied a rule-based time interval in which there was no observed vehicle movement. They tested several no-movement intervals (120, 90, and 60s)

176

6. Transport planning and big data

and found that the 120s threshold yielded the best predictions of the true trip end. For the trip purpose identification, they relied on land use GIS (Geographic Information System) data. Their results indicate the highest confusion for the trips that end in the mixed land use areas. Similarly, Stopher et al. (2008) applied land use information to detect trip purpose in their study. Bohte and Maat (2009) used GNSS loggings, GIS data, and an interactive web-based validation tool with the goal to derive and validate trip purposes and utilized transport modes. Their research included 1104 data contributors. In comparison with the data from the national travel survey, their results had almost equal shares for the observed values. Feng and Timmermans (2011) focused on developing a Bayesian network, decision tree, and random forests methodology to detect trip purpose from GNSS data, while Lu et al. (2021, pp. 1e8) used a decision tree, support vector machine, and meta learner to detect trip purpose from the GNSS and GIS data. Munizaga and Palma (2012) build the public transport OD matrix for Santiago, Chile, based on the passively collected smartcard and GNSS data. From the data available, they obtained detailed information about the time and position of passengers boarding public transport and generated an estimation of the time and position of alighting for over 80% of the boarding transactions. Shen and Stopher (2013) had a sample of 2059 data contributors who participated in the GNSS data collection. Based on the collected data, they propose a rule-based approach that introduced some additional information (e.g., activity duration, tour information) for trip purpose imputation. Their achieved success rate is 66.5% for five trip purposes. Furthermore, Fanhas and Saptawati (2017) used data from GNSS and taximeter and apply graph clustering to discover the frequent origin and destination locations, while Usyukov (2017) applied the rule-based model for identifying home-based tours for cyclists with a success rate of 92%.

Xiao et al. (2016) relied on smartphones to collect data and incorporated a prompted recall survey to validate trip characteristics. After this step, they utilize artificial neural networks combined with particle swarm optimization to detect six different trip purposes. Following this, four scenarios were constructed by employing two methods for land use type coding, i.e., polygon-based information and point of interest (POI), and two methods for selecting training dataset, i.e., equal proportion selection and equal number selection. They then compared the accuracy of trip purpose detection under these scenarios. The results yield the highest accuracy of 96.53% for the test dataset under the scenario of polygon-based information and equal proportion selection by comparing the detected and validated trip purposes. However, they achieved the lowest success rates for irregular trips like social visits or eating out. Ge and Fukuda (2016) use aggregated data of mobile phone traces to estimate work-related trips. They propose a sequential updater based on the maximum entropy principle approach. Trip production and attraction were calculated by a nonlinear programming problem. They demonstrate the approach in a case study of the city of Tokyo. Yazdizadeh et al. (2019) use a large-scale smartphone travel survey conducted in Montreal, Canada, augmented with GTFS and social network data to train and validate a random forest model that predicts a trip purpose (six categories). They apply cross-validation analysis and estimate the prediction accuracies of 71%. Dong et al., 2015 propose the traffic semantic concept to extract traffic commuters’ origins and destinations information from the mobile phones’ CDRs. A k-means clustering method was used to classify a cell area (area covered by a base station) and assign to it a certain land use category or traffic semantic attribute (e.g., working, residential, or urban road) based on four feature data (including real-time user volume, inflow, outflow, and incremental flow) extracted from the CDR data. By combining the

6.4 Literature review of big data advances for four-step transport planning model

geographic information of mobile phone base stations, the roadway network within Beijing’s Sixth Ring Road was divided into 73 traffic zones using another k-means clustering algorithm. In addition, they proposed a traffic zone attribute index to measure the tendency of traffic zones to be predominantly residential or working areas. Yang et al., 2020 utilize location-based social network data (to create a point of interest locations) and mobile network data to assess trip generation and trip attraction among the locations. They form a random tree regression model to select the most important features and the ordinary least square model to establish trips. They also argue that in their case study, Nanjing, China, the occurrences of smaller TAZs are noticeable in the city center where base stations also cover smaller areas. The literature review reveals dedicated efforts in using GNSS and smartphone data in order to detect trip purposes (mainly by enriching them with the GIS data), whereas efforts to detect trip purposes from the CDR data are still quite limited (Dong et al., 2015), mainly due to the challenges related to the spatial-temporal resolution of this data. The most used GIS data include information on land use, from which trip purposes are approximated. The temporal component is also often handled by rule-based approach (e.g., detection of stationary overnight locations as potential home locations), while the association of the trip purpose with the land use details is made based on more advanced data analytics. However, the success of such approaches is highly dependable upon the availability of precise and updated land use data and conditional to the inherent characteristic of the considered positioning information source. Consequently, the obtained results reside in the interval of confidence based on the abovementioned conditions and show the lowest success rates in the mixed land use areas as not only questions as which floor of the mixedused building (for example, ground floors have retail character while upper floors are residential

177

or office spaces) one visited but also which exact location/building one actually visited. Furthermore, an interesting detail is that the most widely accepted stationary time interval from GNSS data of 120 s is used for trip splitting (an indication that one trip has ended, and the following observed movement represents the start of another trip). And whereas trip purpose detection captured the interest of researchers, on the other hand, vehicle occupancy detection from big data is mainly still not tackled. Spare examples of vehicle occupancy considerations, when dealing with big data, are still limited to the manual imputation from users (Gonzalez et al., 2008). Another relevant observation is that, as the next chapter will show in more detail, for the sequential steps of the four-step transport planning model the researchers are mainly still using a simple division of the geographic area into the geographic grid (geographically simpler to apply) or base station coverage areas (Caceres et al., 2007; Fanhas & Saptawati, 2017; MoreiraMatias et al., 2016). However, this is not a realistic practice for practical transport planning purposes as such spatial units usually carry no information about the population and homogeneity of land use or additional content in the grid cell (all the relevant elements for transport planning). It also poses an open question on how to match the idealized grid to the actual TAZa and available statistical data and how to extrapolate the correct information. So far, only a few (J€arv et al., 2017) have tackled these relevant questions and additional effort is needed in order to support transport planning efforts. Table 6.1 gives a summarized overview of the literature review for the trip generation step.

6.4.2 Example: detection of trip generation zones for tourism population Knowing which zones generate and attract trips is important for a number of transport planning studies as well as for other related domains.

Literature

Trip OD OD purpose estimation accuracy estimation

178

TABLE 6.1 Summary of literature review for the trip generation step. Vehicle Purpose occupancy accuracy rates

Occupancy rate accuracy

Duration of test data

Number of users

Dataset

User validated

No

e

Yes

43%

No

e

1 week

1104

GNSS, GIS

Yes

Munizaga & Palma (2012)

Yes

82%

No

e

No

e

2 weeks

N/A (74 million observations)

Smartcard, GNSS (public transport only)

No

Shen and No Stopher (2013)

e

Yes

66.5%

e

e

3 days

2059

GNSS

No

Xiao et al. (2016)

e

Yes

96.53%

e

e

1 week

321

Smartphone

Yes

Ge and Yes Fukuda (2016)

77%

No

e

No

e

1 day

N/A (650,000 observations)

Smartphone

No

Dong et al. (2015)

Yes

e

No

e

No

e

1 day

N/A

CDR

No

Wolf et al. (2001)

Yes

37%

Yes

60.9%

No

e

3 day

13

GNSS, GIS

Yes

Stopher et al. (2008)

Yes

75%

Yes

60%

No

e

56 days

21

GNSS, GIS

Yes

Lu et al. (2021) No

e

Yes

80.6%

No

e

13 weeks

N/A (3188 trips)

GNSS, GIS

Yes

Feng and Timmermans (2011)

No

e

Yes

96.8%

No

e

Yazdizadeh et al. (2019)

No

e

Yes

71%

No

e

1 month

6845

Smartphone

Yes

Usyukov (2017)

No

e

Yes

92%

No

e

5 days

108 cyclists

GNSS

No

329

6. Transport planning and big data

Bohte & Maat (2009)

6.4 Literature review of big data advances for four-step transport planning model

The following example concerns the identification of trip production and attraction sites within the touristic services planning context. In more detail, the considered example was a part of a pilot project conducted within Zeeland province, that was set up between Ghent University and a regional tourism information agency, VVV Zeeland. The Zeeland province is located in the southwest of the Netherlands and covers an area of 2934 square kilometers, of which almost 40% is water (Fig. 6.2A). It has a population of 383,689, whereas a large proportion (around 27%) of this population is situated in the

FIGURE 6.2

179

province’s capital Middelburg (population of 48,544), and the largest municipality Terneuzen (population of 54,589) (StatLine, 2021). Overall, Zeeland province includes 13 municipalities, namely: • Zeeuws-Vlaanderen (Zeelandic Flanders) COROP region (1) Hulst (2) Sluis (3) Terneuzen • Overig Zeeland COROP region (4) Noord-Beveland (North Beveland) (5) Schouwen-Duiveland

(A) Zeeland province, (B) Municipalities in the province, (C) Observed trips in the Zeeland province.

180

6. Transport planning and big data

(6) Tholen • Walcheren (7) Middelburg (8) Veere (9) Vlissingen • Zuid-Beveland (South Beveland) (10) Borsele (11) Goes (12) Kapelle (13) Reimerswaal (Fig. 6.2B). Tourism is an important economic activity in Zeeland, hence a motivation to start a pilot project to gain a better understanding of the visitors’ and tourists’ mobility behavior. The pilot project lasted for 5 months and relied on the new Zeeland mobile phone application, provided by VVV Zeeland, that integrated the data logging capabilities. The new Zeeland mobile phone application was dedicated to the tourists and visitors, who upon installing it, were able to give permission to share their positioning data and were also able to see directly on their home screen that their data is being used for analysis. During the pilot study, overall, 1505 users gave permission for the data collection, resulting in the 124,725 trips identified based on the methodology presented in Wolf et al. (2001). The overall length of the observed trips was 124,725 km, resulting in an average length of the observed trips of around 17.5 km (Fig. 6.3C).

From the trip generation point of view, administrative borders can in this context be seen as a hierarchical zoning system, where the country borders are at the highest level, provinces are at the lower level, followed by the municipality borders, and further on to the more detailed levels. At the highest level, from the tourism-related perspective, it can be interesting to see from which countries are visitors and tourists arriving at the province as their destination (attraction zone). At the lower levels, on the other hand, it might be interesting to see which exact locations they are visiting as major attractors and with what potential motivation (purpose) as well as to potentially gain insight into the crowdedness of these locations for the event or mobility management support. To perform such analysis, as an initial step, the data contributors were clustered into groups based on their touristic activity descriptors. More details regarding the unsupervised hierarchical clustering approach used for this purpose can be found in the literature (Rodríguez et al., 2018). The resulting data contributor clusters included: • Internal - users for whom multiple trips were observed and they all both started and ended within the Zeeland province. There were no trips observed outside the province nor any of

FIGURE 6.3 Country of origin for trips that end in the Zeeland region.

6.4 Literature review of big data advances for four-step transport planning model

the observed trips cross the outer borders of the Zeeland. • External - users for whom trips outside and inside the Zeeland province were observed. These were further clustered in subclusters as follows: B External 24 - captures the visitors for whom several mobility patterns outside the Zeeland province were observed, however among the observed patterns only one visit to Zeeland province, shorter than 24 h, was noted. This cluster contributes to the tourism class “day tourist”.s B External longdsimilar to the External 24 cluster, the External long cluster includes participants who visited Zeeland only once, but their stay in the province area was longer than 24h. This cluster contributes to the tourism class “longerstay tourist”. B External recurring - captures the moving patterns where multiple visits to the Zeeland province were observed, either shorter or longer than 24 h or a combination of both. B External unsorteddincludes users for whom trips were observed, but none of them were in the Zeeland province during the pilot study. As the focus of the pilot study was to gain a better understanding of potential tourismrelated trips, the cluster Internal was excluded from future analysis since it was not evident whether these were local residents who used the mobile phone application or visitors who started using the mobile phone application upon their arrival in Zeeland and might have uninstalled the application before leaving, hence, making all of their observed trips limited to the duration of the staying period within the province. For the remaining clusters, special attention was given to data contributors’ initial trip

181

entering the Zeeland province for a specific stay (in the case of multiple stays as in the External recurring subcluster). The intention was, by identifying the zone (country) in which the starting positioning location for this trip was observed, to gain a better understanding of the potential country of their origin (Fig. 6.3). Notably, all four External subclusters had three main countries of origin for trips that ended in the Zeeland province, and these were: • The Netherlandsdoverall, the country with the most trips’ origins, for each external users’ sub-cluster, • Germanydsecond the most frequent country of origin, for all external users’ subcluster, except for the External 24, • Belgiumdthe second most frequent country of origin for the External 24 subcluster and the third most often country of origin for other external users’ subclusters. Other countries that appear as the origin locations for the trips that end within the Zeeland province were Bulgaria, France, Luxemburg, Switzerland, and United Kingdom. Furthermore, the most diverse set of the origin countries is noted for the External recurring subcluster. At the lower level of TAZs used in this study, touristic organizations were interested in exploring how specific destinations are being visited, where visitors come from and what are the potential motivations for their journeys. One such location is a Neeltje Jans island, which is an artificial island, halfway between NoordBeveland and Schouwen-Duiveland municipalities (indicated in blue color in Fig. 6.4A). It was originally constructed to facilitate the building of the Oosterscheldedam. After the Oosterscheldedam was constructed, a fun park with attractions and other various expositions was situated on the island, attracting travelers to visit them. In the collected observations, all the trips that had a destination at Neeltje Jans island were considered. All these observed trips, originated in Nederland, or more precisely 4.6% were

182

6. Transport planning and big data

FIGURE 6.4 (A) Geographic location of the Neeltje Jans island, (B) Locations of the trips’ origins for trips ending at the Neeltje Jans island.

from the South Holland region and 95,4% were from the Zeeland province and its 13 municipalities, as indicated in Fig. 6.4B. Considering the land use data, the trip destinations were matched with the available details to derive potential trip purposes. More

FIGURE 6.5

details regarding this procedure can be found in the literature (Rodriguez Echeverría et al., 2020; Semanjski et al., 2019). The results of the analysis aiming to detect the potential trip purpose at Neeltje Jans island are presented in Fig. 6.5.

Trip purpose identification for Neeltje Jans area.

6.4 Literature review of big data advances for four-step transport planning model

Overall, the land use analysis revealed a high similarity between the most common trip purpose documented by the official statistics from the Province of Zeeland with the mobile phone positioning-based potential trip purposes insights. The main trip motivation to visit the area in the official statistics is for recreation (KCKT, 2018), while the largest group of potential trip destination locations identified at the province level had a land use where recreational activities are developed. The aforementioned suggests that mobile phone data have the potential to be successfully used to provide insights into tourism-related activity behavior. Furthermore, the areas where the land use is “Recreational” or “Dry natural terrain” represent only 2.35% of the Province of Zeeland area, indicating that the na€ive translation of the land use percentages to the trip destinations purposes would yield significantly misleading results. The lack of ground truth data can be seen as a potential limitation of the considered land usebased approach, as the actual success rates of the suggested trip purposes cannot be confirmed. This might be tackled by implementing a validation functionality based on a twoway communication channel to provide feedback about the activity that the one is engaged in while visiting the area. Furthermore, the considered approach is highly sensitive to the completeness and quality of the underlying land use data. As an example, the proposed approach identified that, on the overall dataset, 8.33% of destination locations had a “Highway” as corresponding land use (Semanjski et al., 2019). Due to the noise present in this type of data, weights based on the location accuracy of the data points and land use information reliability could be considered when assigning the potential trip purpose. Another observed challenge relates to the characteristics of the specific study. For instance, the tourist data is characterized by a relatively short duration of tourist visits and in many cases, the dataset will also

183

contain information from data contributors’ everyday life, which can introduce noise when extracting tourism-related insights.

6.4.3 Literature review of big data advances for trip distribution step The trip distribution step focuses on associating the trip origins and destinations at the TAZs level. The resulting element of this step is the OD matrixes between all zones, for each considered trip purpose. To investigate the trip distributions, Knapen et al., 2016 consider that people often do not use the least cost path through the transport network while making trips. Hence, they aimed to produce a statistical distribution extracted from sets of GNSS traces for both multimodal person movements and unimodal car trips. To do so they test the hypothesis that, for utilitarian trips, the route between origin and destination consists of a small number of concatenated least cost paths. This hypothesis is verified by analyzing routes extracted from large sets of recorded GNSS traces that constitute revealed preference information applicable for route choice sets extraction. Ge and Fukuda (2016) focus on aggregated data of mobile phone traces to estimate work-related trips. They develop a sequential updater based on the maximum entropy principle. The trip production and attraction are firstly calculated by a nonlinear programming problem followed by a matrix-fitting problem to distribute trips to each OD pair. Their case study, situated in Tokyo, Japan, indicates that the proposed updating approach can successfully note the change in travel patterns. Moreira-Matias et al., 2016 propose an incremental framework to maintain statistics on the urban mobility dynamics over a time-evolving OD matrix. Their proposed methodology settles on three steps. First, they use half-space trees to divide the city area into dense subregions of equal mass. The uncovered regions form an OD matrix, which can be updated

184

6. Transport planning and big data

by transforming the trees/leaves into conditional nodes (and vice-versa). Secondly, they use the partitioning incremental algorithm that discretizes the target variable’s historical values on each matrix cell. Finally, they define a dimensional hierarchy to discretize the domains of the independent variables depending on the cell’s samples. They test this approach on a case study of a taxi network running in a midsized city in Portugal. Fanhas and Saptawati (2017) also consider a taxi network with the aim to identify often-occurring OD pairs. To do so, they propose graph-based analysis for the GNSS taxi data and apply graph clustering to discover frequent origin-destination flows. Li et al., 2017 propose an NMF-AR (nonnegative matrix factorizationd autoregressive) model for predicting the OD matrices by combining the nonnegative matrix factorization algorithm (to reveal the basic characteristics of travel flow) and the autoregressive model (to estimate the nonlinear time series coefficient matrix). Based on this, they predict OD matrices and test their approach on a set of GNSS data collected in Beijing, China. They compare their results with some known methods (k-nearest neighbor algorithms, neural network algorithms, and classification algorithms) concluding that the proposed NMF-AR algorithm has a more effective capability in predicting OD matrices than alternative models. Among the first, Caceres et al. (2007) tried to use mobile network data for OD matrix estimation purposes. However, they use simulated network signalization data from the second generation of mobile networks and not the actual users’ data. For this, they designed a mobile network simulator to extract the phone network data, process them and converted them into an OD matrix. Their results indicate the potential of the network data to replace the traditional traffic data collection techniques in a more costeffective way. Candia et al., 2008 use call detail records to analyze human activities, capturing OD flows at a very large scale with the intention to identify abnormalities. Calabrese et al. (2011)

first analyze opportunistically collected mobile phone location data to estimate the OD matrix. They base their analysis on two consecutive steps: trip determination and origin-destination estimation and achieve a relatively high success rate of 76%. Frias-Martinez et al. (2012) proposed a method that generates a commuting OD flows matrix based on temporal variation of association rules using aggregated mobile phone data. The results show high variation in detecting home-work and work-home-related trips in different municipalities. Ma et al. (2013) use CDR and network signalization data to derive individuals’ daily activity chains and path choices revealed from their mobile phone usage and map them to the transport network to obtain the sample trip matrices. Sampled trip matrices were projected to the total travel demand based on surveyed commute flows between subregions or cities. These projected travel demands are then fed to the traffic assignment and matrix estimation loop to derive final time-varying OD matrices. Another interesting detail is that Ma et al. (2013) use actual traffic analysis zones of the area. However, they do not tackle the issue of mapping the mobile network data to the TAZs, but rather map the base stations’ coordinates directly to the TAZs. The same year, Bahoken and Raimond (2013) use CDR data to estimate the OD matrix but here they idealize the base station coverage area by using the Voronoi polygons, which are very heterogeneous from a spatial point of view. They observed that, in their case, the loss of information is relevant when merging Voronoi polygons to the base station location (55%). Iqbal et al. (2014) propose a methodology to develop OD matrices using mobile phone CDRs and limited traffic counts. CDR data are analyzed to detect trips occurring within certain time windows and to generate base station-to-base station OD matrices for different time periods. These are then associated with corresponding nodes of the traffic network and converted to node-tonode OD matrices. Following this, they scale

6.4 Literature review of big data advances for four-step transport planning model

up node-to-node OD matrices to get the final OD matrices for the area. An optimization-based approach, in conjunction with a microscopic traffic simulation platform, is used to determine the scaling factors that result in best matches with the observed traffic counts. The applicability of the methodology is supported by a validation study. Larijani et al., 2015 use CDR and network signalization data to model the OD matrix in order to reveal any probable continuous trends or any dominant trace of the flow stating a specific mode of transport. The same year, Bonnel et al. (2015) try to estimate the OD matrix based on network signalization data alone. In their study, trips are deduced from the spatiotemporal trajectory of devices through a hypothesis of stationarity within an area covered by the base station to define activities. Trips are then aggregated in an OD matrix, which is compared with traditional data (census data and household travel surveys). Alexander et al. (2015) present methods to estimate the average daily OD matrix from triangulated mobile phone records (CDR data). They firstly convert these records into clustered locations at which users engage in activities for an observed duration. They infer these locations as different points of interest (e.g., home, work) depending on observation frequency, day of the week, and time of day, and represent a user’s origins and destinations. Since the arrival time and duration at these locations reflect the observed (based on call details) rather than true values, they probabilistically infer departure time using survey data on trips in major cities in a country. Trips are then constructed for each user between two consecutive observations in a day. These trips they multiply by expansion factors based on the population of a user’s home census tract and divide by the number of days on which they observed the user, distilling average daily OD matrices. The applicability of the proposed methodology is supported by validation against the temporal and spatial distributions of trips reported in local and national surveys. Furthermore, Gundlegård et al. (2016) focus on

185

defining a tailored set of mobility metrics to determine a travel demand and OD matrix. Bachir et al., 2019 infer dynamic OD matrix by transport modes using CDR data of 2 million devices within the Greater Paris region. Their model combines CDR positioning with transport network geospatial data, travel survey, census, and travel card data throughout three steps, segmentation, origin-destination identification, and rescaling. They validate the results against the travel survey and the travel card data for different spatial scales and report a success rate of 61%. Mamei et al., 2019 use pure CDR data from three Italian regions (Piemonte, Emilia Romagna, and Lombardia) and utilize a twosteps analysis (identification of individual trips/mobile phone movements among base station cells and aggregation to create the OD matrix) and achieve 53% success rate compared to local census data. Furthermore, Hadachi et al. (2020) apply a hidden Markov model and the Viterbi algorithm to assign the maximum likelihood sequence of hidden states. Based on this approach they build an OD matrix between 15 Estonian counties, focusing on long-distance trips, which seems more in line with the basic characteristics of CDR positioning precision. Table 6.2 gives a summarized overview of the literature review for the trip distribution step. For determining the OD matrices, the CDR data seem to capture the most attention from the researchers at the moment. This is not necessarily due to the potential CDR data’s advantages related to this particular step of the transport planning, but more opportunistically as the CDR data exhibit further challenges related to the other applications within the four-step transport planning frame. When considering the OD matrix creation from big data, the key challenge lies in the positioning data’s precision. Namely, the inherent spatial and temporal precision that is associated with each particular big data source. In this aspect, the OD matrices can be reliable only to the level that the precision of underlying data

186 TABLE 6.2

6. Transport planning and big data

Summary of literature review for the trip distribution step.

Literature

Duration of test data

Number of users

Dataset

Accuracy

GNSS, taximeter

N/A

Fanhas and Saptawati (2017)

2 months

N/A (16,337 records)

Knapen et al. (2016)

1 week

N/A (999 records) GNSS

N/A

Ge and Fukuda (2016)

1 day

N/A

GNSS

88%

Moreira-Matias et al. (2016)

9 months

441

GNSS, taximeter

79%

Li et al. (2017)

1 month

12,000

GNSS

80%

Calabrese et al. (2011)

3 weeks

1,300,000

CDR

76%

Ma et al. (2013)

1 month

128,000

CDR and network signalization data

78%

Frias-Martinez et al. (2012)

1 month

3 million

CDR

N/A

Bahoken and Raimond (2013)

6 weeks

10 million

CDR

45%

Iqbal et al. (2014)

1 month

2.87 millions

CDR

86%

Larijani et al. (2015)

1 day

1.4 millions

CDR and network signalization data

N/A

Bonnel et al. (2015)

10 days

4,1 million

Network signalization data

82%

Alexander et al. (2015)

2 months

2 million

CDR and network signalization data

65%

Gundlegård et al. (2016)

2 weeks

300,000

CDR

N/A

Bachir et al. (2019)

2 months

2 million

CDR, GIS, travel survey, census, and travel card data

61%

Mamei et al. (2019)

N/A

N/A

CDR

53%

Hadachi et al. (2020)

N/A

300,000

CDR

N/A

enables. For example, when it comes to the GNSS data, that exhibits a higher level of accuracy, reliable OD matrices can be created even for areas where TAZs are geographically smaller in their nature (e.g., within city’s perimeter). Nonetheless, some data preprocessing activities can have a significant impact on this. For instance, implementation of data anonymization steps that cut off trips’ starting and ending segments of a predefined length. This practice is sometimes a part of the data anonymization strategy aiming to prevent identification of

individuals from the positioning data (as based on the frequent trips’ starting and ending locations ones’ home and work location could be easily detected and the individual from a given household could be identified). However, when such practices are utilized, this should be noted and specified in the metadata as it is an important input for the OD matrix analysis. For instance, if the trip is reduced by 200 m at its start and end, for one, trips shorter than 400 m would be automatically excluded from the analysis (and underreporting of short trips with the

6.4 Literature review of big data advances for four-step transport planning model

traditional data collection methods is something that many look up to the emerging data collection techniques to mitigate). But also, for example, using public transport stops, parking, mobility hubs, etc. could be misidentified, echoing this misidentification in classifying the OD pairs into OD matrix corresponding to the wrong transport mode or TAZ. When it comes to the CDR and network signalization data, the issue of mapping the base stations to the existing traffic analysis zones in the data preprocessing step is still barely tackled and is one of the major challenges in this field at the moment. Until reliable techniques for this process are defined, the OD matrices based on the CDR and network signalization data will carry inherent limitations related to both location precision of the source data as well as to the unreliable mapping of locations to the TAZs (and literature (Bahoken & Raimond, 2013) suggests that the loss of information in such cases could be around 50%). Hence, the use of CDR and network signalization data, especially for the areas with geographically smaller TAZs (e.g., intraurban trips), is still not at the satisfactory level to seamlessly replace traditional data collection processing results. Nonetheless, when it comes to the analysis where geographically larger TAZs, or clusters or TAZs at the lover hierarchical level, are considered (for instance, the analysis of suburban or interurban journeys where one needs to identify a number of journeys with the OD pairs corresponding to two cities) the obtainable results, with careful consideration of the results scaling, are of appropriate maturity to replace traditional data collection approaches. This is particularly relevant as the CDR and network signalization data represent a cost-efficient manner to collect data on mobility (e.g., they require no additional funds to be collected). However, the practical issues related to the use of such data still exist and are mainly situated in the domain of business (for instance, companies are not willing to share their business-

187

related insights) and privacy as users’ privacy needs to be guaranteed and maintained throughout all the stages of the analytics process. Also, at the moment, and due to the nature (and size) of the data, it is important to mention that related studies validate their results mainly based on the census or traffic counts data, which is the practical but still imperfect way. This is mainly because the analysis and validation are sourced in two completely different (or only slightly overlapping) data sets and populations. Validation of the data by the users whose data are actually used within the analysis leads to the limited number of data samples and users (or becomes an extremely resource-demanding process). However, it would provide muchneeded reference point for the results in this domain. In conclusion, the CDR and network signalization data are able to cover a wide population, and provide long term continuous sensing and insights into dynamic OD flows, which was not the case, at this level, with the traditional data collection methods. Hence, CDR and network signalization data provide the optimistic foundation for longitudinal analysis of human mobility and OD dynamics and, as such, underpin more detailed based and better-informed mobility planning and decision-making.

6.4.4 Example: construction of OD matrix from big data Based on the dataset presented in the previous example (see Chapter 6, Example: detection of trip generation zones for tourism population), the starting and ending positioning locations of each trip were matched with the corresponding municipalities’ geographical borders in the Zeeland province. If one considers municipalities as the high-level TAZs for the touristic mobility within the province, the OD matrix among them can be created based on the trip ID that is assigned both to the corresponding trips’ origin

188

6. Transport planning and big data

and destination locations. Fig. 6.6 illustrates the results of such for the whole Zeeland region, where the OD matrix gives a number of trips that have the original location situated in one of the municipalities listed in the first column and the destination location that corresponds to the municipality listed in the last row of the matrix. A quite populated matrix diagonal indicates a large number of trips that start and end in the same municipality. The largest number of the municipality observed trips took place in the Veere, closely followed by the SchouwenDuiveland, while the least of them were observed within the Reimerswaal.

FIGURE 6.6

Taking into account the precision of the underlying mobile phone data, where the positioning readings were collected by comparing the GNSS, mobile network, and WiFi positioning readings and selecting the most accurate one for the given context, and the estimation from the literature where positioning errors of 0e40 m, at the 95% confidence level, for the GNSS data are observed (Quddus & Washington, 2015) we can say that there is a great potential in considering smartphone data as an efficient manner to construct OD matrixes for the TAZs of comparable sizes as in our example.

OD matrix for Zeeland province.

6.4 Literature review of big data advances for four-step transport planning model

6.4.5 Literature review of big data advances for mode choice step The mode choice step aims to identify transport modes used for a trip and to give an overview of the relative proportions of all the trips between zones that use each particular mode of transport. As such, it naturally follows the information available from previous steps and enriches it with the information on the transport mode. First, Bohte and Maat (2009) use a rulebased approach to derive transport mode from GNSS data collected over a period of 1 week. They achieve a success rate of 70% for five transport modes and find that the train and public transport modes as the most challenging ones to distinguish. For these two modes, they achieved the lowest success rates of 34% and 0% respectively. Gong et al. (2012) use a similar rule-based approach on a much smaller dataset but include some of the spatial descriptors into their model. This addition resulted in a comparable success rate for train trips (36%), but the correct detection of public transport trips significantly increased (by up to 65%). Bolbol et al. (2012) use speed, acceleration, distance, and changes in heading from accelerometer and GNSS sensors as input for the support vector machines-based model. They inferred six transport modes with a success rate of 88%, demonstrating the applicability of supervised machine learning-based approach for transport mode recognition. Next to speed and accelerometer data, they considered rail and bus stop proximity and, by doing so, showed that the random forest-based approach success rate increased from 76% to 93%. Biljecki et al. (2013) also use GNSS traces and implement trip segmentation by detecting potential transition points between consequent parts of trips made by different transport modes. Following this, they use OpenStreetMap data to create an explicit knowledge base set with a number of empirically derived fuzzy membership functions and infer different transport modes. A year later, Huss et al.

189

(2014) applied rule-based approach to infer transport modes from GNSS without spatial descriptors. Their results indicate that the same level of success, as when using spatial descriptors, can be achieved if one is inferring all motorized transport modes (bus, train, and car) as one class. Xiao et al. (2015) identify transport modes with a Bayesian network, whose structure is established based on a K2 algorithm, and corresponding conditional probability tables are estimated with maximum likelihood methods. They compare the results for the five derived transport modes with those retrieved in the prompted recall survey by telephones indicating a success rate of 86%. Rasmussen et al. (2015) used tracking data of 101 individuals, over 5 days, and combined fuzzy logic and a GISbased algorithm to process raw GNSS data. They derive five transport modes and validate their findings by comparing the results with the control questionnaire collected among the same individuals. Xiao et al. (2017) derived around 100 accelerometer and speed-based features and differed between six transport modes with a success rate of 90%. For this, they applied random forest, gradient boosting decision tree, and XGBoost-based classification approach. Wang et al. (2010) aimed to infer transport mode share from CDR data. They rely on a clustering approach based on travel time deference to infer driving, public transport, and walking transport modes. They compare their results with the Google Maps travel times. Abdelaziz and Youssef (2015) implement mobile network signalization data-based transport mode identification system arguing that in such a way the devices’ battery consumption is not increased and it is possible to process data for a much higher number of individuals that by other big data collection approaches. They present a transport mode detection system that leverages the phone speed (correlated with features extracted from both the serving cell tower ID and the received signal strength from it). They evaluate the proposed approach by using 135 h of mobile

190

6. Transport planning and big data

network traces by four users. Larijani et al. (2015) use CDR and network signalization data to detect transport mode. They focus on the detection of commuter trains based on the grouped handover update information. Bachir et al. (2019) try to use semisupervised approach to distinguish between road vehicles and trains based on the CDR data, while Breyer et al. (2021) use three geometry-based mode classification methods and validate their results against 255 labeled trips achieving 92% success rate. Reddy et al., 2008, 2010) try to distinguish walking, biking, and the use of motorized transport modes from a mobile phone with a GNSS receiver and an accelerometer sensor. Their model consists of a decision tree followed by a first-order hidden Markov model and achieves an accuracy level of 90%. Manzoni et al. (2010) process the mobile sensed data with a fusion of the GNSS traces and other sensors integrated into smartphones. They implement decision trees based approach to differentiate between seven transport modes from data collected during 1 day from four persons. They achieve a comparable success rate of 82%. Stenneth et al. (2011) explored six individuals’ trajectories over a period of 3 weeks. They combined GNSS traces with the external transport network data to distinguish between different motorized transport options based on the Bayesian net, decision tree, random forest, na€ive Bayesian, and multilayer perceptron models. Hemminki (2013) and Hemminki et al. (2013) focus on estimating the gravity component of accelerometer measurements, hypothesizing that it holds sufficient information to infer different transport modes from it. Chen et al. (2013) experimented with the online sequential extreme learning machine-based transfer learning method to recognize various transport modes. Their approach is comprised of three steps. Firstly, they train an initial classifier on the labeled training data from the source domain. Secondly, they calculate mean and standard deviation as

multi-class trustable intervals in the source domain, and then the partially trustable samples are extracted from the target domain. Lastly, they integrate the trustable samples. Their experimental results indicate a high success rate for distinguishing five transport modes. Xia et al. (2014) use mobile sensed GNSS traces and accelerometer data. They implement support vector machines with parameters that are optimized for pattern recognition. In addition, they employ ant colony optimization to reduce the dimension of features and analyze their relative importance. The resulting classification system achieves an accuracy rate of 96% when applied to a dataset obtained from 18 mobile users to infer walking, biking, and motorized transport. Shin et al. (2015) develop an application for mobile phones that runs in the background and continuously collects data from built-in acceleration and network location sensors. They partition the collected tracks into activity segments. For this, they rely on identifying a walking activity in the data stream, which, in turn, acts as a separator for partitioning the data stream into other activity segments. Each vehicle activity segment is then subclassified according to the vehicle type. They use one-second sampling interval to minimize the device power consumption. Zhou et al. (2016) implement a chained random forest model to distinguish between motorized transport, biking, and foot (walking and running) modes. They use data from 12 people’s travel behavior spanning over 6 days and correctly detect the use of these three transport modes in 94% of cases. Semanjski et al. (2017) combine actively collected mobile sensed positioning data with spatial context extracted from GIS data. They develop a support vector, machinebased model to infer five transport modes and achieve a success rate of 94%. Transport mode detection seems to be one of the most active fields when it comes to the application of big data in the transport planning domain. A broad spectrum of literature

191

6.4 Literature review of big data advances for four-step transport planning model

examines the potential to infer transport mode from GNSS traces and mobile sensed data. However, only a few examined the use of CDR data. The main reason for this is a lower spatial and temporal resolution of the CDR data that makes inferring transport mode quite challenging. Hence, only major patterns in a big spatial area are being somewhat reliably detected from the CDR data. On the other hand, pure GNSS data and mobile sensed data show quite promising results. Most of the approaches rely on combining the positioning data with the accelerometer data. This seems to result in a satisfactory level of transport mode detection if one focuses on distinguishing between walking, biking, and motorized transport, but differentiating between different types of motorized transport modes TABLE 6.3

remains a challenge. Additionally, the human interaction with the device (for instance, using a phone while being on public transport) reflects in increased noise among accelerometer readings. This adds an extra challenge to the accelerometer-based transport mode inferring models. Another emerging trend is to rely on the spatial context of the positioning tracks. In this sense, one integrates spatial information such as the location of the metro stop, or forbidden direction for the car, in the analysis. This extra insight serves as a powerful tool when it comes to distinguishing between different motorized transport modes and so far, achieves the best results. Table 6.3 gives a summarized overview of the literature review for the mode choice step.

Summary of literature review for the mode choice step. Number of land transport modes Data

Duration of test data

Number of users

User Accuracy validated

Reddy et al. (2008)

3

Mobile sensed GNSS, accelerometer

240 min

6

90

No

Bohte and Maat (2009)

4

GNSS, GIS

1 week

1104

70

Yes

Wang et al. (2010)

5

CDR

12 h

56,715

70

No

Reddy et al. (2010)

3

Mobile sensed GNSS, accelerometer

24 h

16

93

No

Manzoni et al. 7 (2010)

Mobile sensed GNSS, accelerometer

1 day

4

82

Yes

Stenneth et al. 5 (2011)

Mobile sensed GNSS, GIS

3 weeks

6

93

Yes

Bolbol et al. (2012)

6

GNSS, speed, acceleration, distance, heading 2 weeks

81

88

Yes

Gong et al. (2012)

5

GNSS, GIS

5 days

63

83

Yes

Hemminki et al. (2013)

4

Mobile sensed accelerometer

150 h

16

60e85

Yes

Chen et al. (2013)

5

Mobile sensed GNSS

N/A

5

90

No

Literature

(Continued)

192 TABLE 6.3

6. Transport planning and big data

Summary of literature review for the mode choice step.dcont'd Number of land transport modes Data

Duration of test data

Number of users

User Accuracy validated

Biljecki et al. (2013)

7

GNSS, GIS

1 week

1104

92

No

Huss et al. (2014)

3

GNSS

2 days

12

83

No

Xia et al. (2014)

3

Mobile sensed GNSS, accelerometer

11 h

18

96

No

Shin et al. (2015)

4

Mobile sensed GNSS, WiFi, mobile networks’ base station location readings, accelerometer

1 day

30

82

No

Xiao et al. (2015)

5

GNSS,

5 days

202

86

Yes

Larijani et al. (2015)

1

CDR and network signalization

1 day

1,4 million

N/A

No

Rasmussen et al. (2015)

5

GNSS, GIS

5 days

101

90

Yes

Abdelaziz and 2 Youssef (2015)

Mobile network signalization data

135h

4

89

No

Zhou et al. (2016)

3

Mobile sensed GNSS, accelerometer

6 days

12

94

Yes

Xiao et al. (2017)

6

GNSS

N/A

N/A

90

No

Semanjski et al. (2017)

Mobile sensed GNSS, GIS

4 months

8000

94

Yes

Breyer et al. (2021)

CDR

N/A

255

92

Yes

Literature

6.4.6 Example: rule-based transport mode detection from GNSS and GIS data As one example of the transport mode detection based on the machine learning approach is already given, see Chapter 5, Example classification: transport mode recognition, here we will briefly have a look at construction of rulebased approaches that perform transport mode classification using GPS and GIS data, particularly focusing on the detection of nondominant transport modes in the dataset.

In our example, the nondominant transport mode in the dataset will be the train as a transport mode used for trips. In general, sustainable transport modes such as public transport, walking trips, or bike are nondominant components in the modal split. When it comes to transport mode recognition, this can result in high success rates even if na€ive or overfitted approaches are used. For instance, following the modal split example given in Fig. 6.7, even if one would classify all the trips as car-made trips, in the data sample where five transport modes

6.4 Literature review of big data advances for four-step transport planning model

193

FIGURE 6.7 Modal split example. Based on Department of Mobility and Public Works. (2015). Flemish travel behavior survey. [Online] Available at: http://www.mobielvlaanderen.be/pdf/ovg51/samenvatting.pdf. Accessed 10th March 2019.

are present, the success rate would be 69.62%. On the other hand, if one would completely fail to detect correctly one of the transport modes, such as train, misclassifying it as, for instance, car, the success rate would be 98.31%, which in many cases would be generally acceptable success rate, but it is reasonable to ask can this be considered as a good classifier if it completely fails to identify specific transport modes present in the dataset. For this reason, the literature suggests that it is worthwhile evaluating the success of the proposed classification methods by considering the overall success rate as well as success rates (or misclassifications) for specific, nondominant, transport modes (Rodriguez Echeverría et al., 2018). Generally, rule-based approaches mainly rely on the number of necessary GNSS points and the distance between GNSS points and relevant network infrastructure, such as railways and/ or train stations. For instance, Gong et al. (2012), after the data preprocessing and

extraction of trip segments (parts of the trip made by utilizing a single transport mode), apply the following ruleset to detect train as a transport mode used for the travels: • Distance from first or last point of the trip segment: B to the nearest subway entrance